pValid: Validation Beyond the Target-Decoy Approach for Peptide

Jun 24, 2019 - ... and pDeep validations flag as suspicious identifications, ROC and PR curves of five validation methods on Olsen_Hela and on Mann_He...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/jpr

Cite This: J. Proteome Res. 2019, 18, 2747−2758

pValid: Validation Beyond the Target-Decoy Approach for Peptide Identification in Shotgun Proteomics Wen-Jing Zhou,†,‡ Hao Yang,†,‡ Wen-Feng Zeng,†,‡ Kun Zhang,†,‡ Hao Chi,*,†,‡ and Si-Min He*,†,‡ †

Downloaded via BUFFALO STATE on July 26, 2019 at 12:42:40 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China 100190 ‡ University of Chinese Academy of Sciences, Beijing, China 100049 S Supporting Information *

ABSTRACT: As the de facto validation method in mass spectrometry-based proteomics, the target-decoy approach determines a threshold to estimate the false discovery rate and then filters those identifications beyond the threshold. However, the incorrect identifications within the threshold are still unknown and further validation methods are needed. In this study, we characterized a framework of validation and investigated a number of common and novel validation methods. We first defined the accuracy of a validation method by its false-positive rate (FPR) and false-negative rate (FNR) and, further, proved that a validation method with lower FPR and FNR led to identifications with higher sensitivity and precision. Then we proposed a validation method named pValid that incorporated an open database search and a theoretical spectrum prediction strategy via a machine-learning technology. pValid was compared with four common validation methods as well as a synthetic peptide validation method. Tests on three benchmark data sets indicated that pValid had an FPR of 0.03% and an FNR of 1.79% on average, both superior to the other four common validation methods. Tests on a synthetic peptide data set also indicated that the FPR and FNR of pValid were better than those of the synthetic peptide validation method. Tests on a large-scale human proteome data set indicated that pValid successfully flagged the highest number of incorrect identifications among all five methods. Further considering its cost-effectiveness, pValid has the potential to be a feasible validation tool for peptide identification. KEYWORDS: tandem mass spectrometry, target-decoy approach, validation methods, false-positive rate, false-negative rate



INTRODUCTION As the principal method in proteomics for identifying peptides and proteins, tandem mass spectrometry (MS/MS) can acquire millions of spectra in a single experiment.1 Although plenty of proteome search engines, e.g., SEQUEST,2 Mascot,3 MaxQuant,4,5 PEAKS,6 and pFind,7,8 have been designed to analyze large-scale MS/MS data sets, the top-1 results for each spectrum reported by search engines still need to be further validated. The correctness of identified peptides and proteins reported by search engines is vital to further biological analysis.9,10 The target-decoy approach (TDA) is developed to valid large-scale identifications, which estimates the false discovery rate (FDR) of all identifications with a decoy database generated by reversing or shuffling protein sequences in the target (original) database.11 TDA assumes that the probability of reporting a random (incorrect) target identification is the same as that of reporting a decoy one, and hence, the number of false identifications in target results can be estimated by the number of decoy identifications, and then the FDR can be estimated by the ratio of the number of decoy identifications to the number of target ones. Although TDA has been widely used in MS/MS data analysis, a few problems with its usage and credibility are still controversial, e.g., the way to construct decoy databases, the © 2019 American Chemical Society

selection of the FDR calculation formula, the choice of combined or separate search of target and decoy databases, combined or separate FDR control, and how to use trap databases.12−20 Furthermore, the original assumption is seldom validated before using TDA, which makes the credibility of TDA not as perfect as theoretically expected. Moreover, real FDR may be significantly higher than that reported by TDA.21 In addition, TDA should be used with special care when filtering identifications that are rich in post-translational modifications or identified from multistage search processes.22−24 A few studies focused on improving the performance of TDA. Separate FDR control was proposed for processing search results with different types of enzymes or modifications and might increase the precision for subgroup FDR.23,25 Meanwhile, a few machine-learning models, e.g., support vector machine (SVM) and Markov models, were used in the design of the scoring functions to better distinguish between correct and incorrect identifications.26−28 However, the discrimination power of these approaches is still limited by two main factors. First, restricted search is the dominant database search strategy in the routine experiments, which Received: December 31, 2018 Published: June 24, 2019 2747

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research

Figure 1. Workflow of benchmark construction, usage of validation methods, and training and testing workflows of pValid. (a) Workflow of benchmark construction. Mass deviations +5 and +10 Da are added to the precursor masses of intersected spectra of three search engines to construct trap spectra. (b) Usage of five validation methods. Positive results flagged by each validation method are drawn by one circle, where positive means the search result is suspicious and purple, yellow, green, red ,and blue circles respectively denote suspicious (positive) results flagged by Trap-database, Open-search, pDeep, Open-search-pDeep, and pValid. (c) Training and testing workflows of pValid. Green arrows represent the steps used in the training workflow. Orange arrows represent the steps used in the testing workflow. Purple arrows represent the steps used in the practical application workflow.

considers a relatively incomplete (not considering all types of cleaved peptides and all kinds of modifications) search space with fully tryptic peptides and only a few modifications. If a peptide in the biological sample is a semispecific tryptic peptide with a modification not set in the restricted search, such as the carbamidomethylation of lysine, then the peptide may not be correctly identified. Second, the intensities of theoretical fragment ions are not well estimated by search engines, leading to a weak base of peptide-spectrum match (PSM) scoring. In addition, the traditional TDA cannot flag individual credibility for identifications which passed the FDR control; in other words, it cannot tell whether any one PSM within FDR 1% is correct or not, even when this PSM yielded a significantly high score. Although several common credibility validation methods, such as trap database and synthetic peptide methods, have been developed and used after TDA FDR control, the discrimination power of these methods has not been studied yet.13,29,30

In this study, we have investigated how to characterize a validation method and proposed the pValid algorithm to automatically validate peptide identifications in proteomic experiments. In this algorithm, an open database search strategy is used to examine the match quality of each PSM by introducing many more peptide candidates, and a theoretical spectrum prediction algorithm is used to estimate the intensities of fragment ions more precisely. In particular, we defined the accuracy of a validation method by its false-positive rate (FPR) and false-negative rate (FNR) and further proved that lower FPR and FNR indicate higher sensitivity and precision. Tests on three benchmark data sets indicated that pValid had an FPR of 0.03% and an FNR of 1.79% on average, superior to those of the synthetic peptide validation method and four other common validation methods. The time cost of pValid was only twice the time of a regular database search, meaning that pValid is fast enough and can potentially be widely applied after database search and TDA filtration. 2748

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research



METHODS In this section, we will first introduce the method to construct the benchmark data sets, each of which consists of correct and incorrect identifications. These data sets are used for comparing the discrimination power of different validation methods. Then, we will briefly introduce four validation methods evaluated in this study, i.e., Trap-database, Opensearch, pDeep, and Open-search-pDeep. Further, we will introduce the acknowledged gold standard validation method, i.e., the Synthetic-peptide method. Finally, we will introduce the training and testing workflows of pValid.

against the database consisting of both original and trap proteins. For the same spectrum, if the newly reported peptide is different from the original peptide, then the original identification is flagged as suspicious. In this study, all reviewed proteins of Arabidopsis downloaded from UniProt (released in 2018.03) are used as trap proteins, and then pFind is used for the trap database search. For each spectrum, the identification under the original database search with TDA FDR control is flagged as suspicious if it is not the same as that of the trap database search, implying that it is not competitive enough. Note that the search results under the trap database search do not need to pass FDR 1%.

Benchmark Construction

Open-Search Validation

Four published mass spectra data sets are used in this study (Supplementary Table 1 of the Suppporting Information). Two of them (Kuster_PT_Training and Kuster_PT) are synthetic data sets chosen from ProteomeTools31 and the other two (Olsen_Hela and Mann_Hela) are chosen from HeLa data sets.32,33 pFind, MaxQuant, and PEAKS are used to search these data (Supplementary Tables 2 and 3). Four benchmark data sets are separately constructed from these four data sets using the benchmark construction method described in this study (Figure 1a). Note that Kuster_PT_Training is only used in the training workflow of pValid, and the other three data sets are used for testing the performance of all validation methods. Figure 1a illustrates the workflow of benchmark construction. PSMs consistently reported by all these three search engines are regarded as correct, and the corresponding spectra are referred to as target spectra. However, it is hard to construct a benchmark data set with sufficient well-matched incorrect identifications. The decoy identifications within FDR 1% meet the need, but their number is too limited. The decoy identifications beyond FDR 1% are enough, but their PSM scores and match qualities are worse than those of incorrect identifications within FDR 1%. Moreover, a decoy identification may be still correct because the decoy peptide may be the same as a target one with a modification or a mutation. Adding precursor mass deviations to target spectra is another choice to generate incorrect identifications. In order to construct adequate incorrect identifications, two mass deviations, +5 and +10 Da, are respectively added to the precursors of target spectra and then two trap spectra are generated from each target spectrum. Search results identified from trap spectra are no doubt incorrect identifications. Finally, all the target and trap spectra generated from one data set are merged to one MGF file. Each MGF file containing target and trap spectra is searched by pFind using the restricted search mode considering fully tryptic peptides with two variable modifications (Supplementary Table 3). Peptide FDR 1% is used to filter PSMs, and then the filtered results are compared with annotations of benchmark data sets constructed before. If a PSM is identified from a target spectrum and is consistent with the annotation in the benchmark data set, it is regarded as correct; otherwise, it is deemed incorrect. All these correct and incorrect identifications are then examined by each validation method, and identifications that do not pass the validation are flagged as positive or suspicious identifications (Figure 1b).

The idea of open-search validation (referred to as Opensearch) is to use an open search engine to do the database search with the same search parameters except more enzyme cleavage types and more modifications than the restricted search. For the same spectrum, if the newly reported peptide is different from the original peptide with a higher score, then the original identification is flagged as suspicious. The open-search algorithm Open-pFind,8 which considers all kinds of digestion types (i.e., nonspecific digestion) and all modifications of Unimod34 and, hence, a search space about 100 000 times larger than the regular search space, is used in this study. Obviously, if an identification is still the best candidate after searching against a search space expanded by thousands of times, it tends to be much more credible. As Open-pFind and pFind use the same scoring functions, the credibility of each identification reported by pFind can be directly evaluated by comparing this identification with the one reported by Open-pFind for the same spectrum. For each spectrum, if the identification of pFind is not the same as the top-1 candidate (not necessarily within the FDR 1% threshold) of Open-pFind and the score from pFind is less (worse) than that of Open-pFind, the identification of pFind is flagged as suspicious. For example, for the same spectrum, if pFind reports ACDEFGK with a score of 11 and Open-pFind reports ACDEGFK with a score of 12, then the result reported by pFind is flagged as suspicious. Supplementary Figure 1 demonstrates an identification reported by pFind and then flagged as suspicious by the Open-search validation. pDeep Validation

The idea of theoretical spectrum validation is to predict the theoretical spectrum of the reported peptide and comparing the cosine similarity between the original spectrum and the predicted spectrum. If the cosine similarity is less than the predefined similarity threshold, then the original search result is flagged as suspicious. pDeep is an algorithm for theoretical spectrum prediction based on a deep learning model.35 For one PSM from spectrum s and peptide p, a theoretical spectrum (referred to as s′) of p can be predicted by pDeep, and then the cosine similarity between s and s′ is computed. In this study, given one PSM identified by pFind, we use pDeep to produce the theoretical spectra of the top-3 peptide candidates reported by pFind and the top-3 peptide candidates reported by OpenpFind for the spectrum, and then the similarity between the original spectrum and each of the six theoretically predicted spectra is computed. According to the report of pDeep, more than 96% of the predicted spectra have a cosine similarity of 0.7 or higher when compared with the original spectra.35 Thus, for each identification, if the similarity of the peptide reported

Trap-Database Validation

The idea of trap-database validation13,20 (referred to as Trapdatabase) is to add trap proteins that do not exist in the biological sample to the original database and then searching 2749

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research

Figure 2. Performance comparison of the five validation methods. (a) Definition of FPR and FNR. FPR is calculated by the ratio of the number of false-positive instances to the number of all the correct identifications. FNR is calculated by the ratio of the number of false-negative instances to the number of all the incorrect identifications. (b) FPR and FNR comparison of five validation methods at the PSM level. Trap-database has the lowest FPR and the FPR of pValid is similar to that of Trap-database. pValid has the lowest FNR, far less than the other validation methods. (c) The change of PSM level error rate and recall after eliminating suspicious identifications flagged by five validation methods. “Before Elimination” represents that the identifications are only controlled by TDA FDR. Yellow strips represent the percentage of the error rate after elimination to the error rate before elimination. Blue histograms represent the change of recall and one blue column will be reduced for every 0.1% reduction in the recall. pValid can reduce the error rate by 59 times (from 1.18% to 0.02%) and keep the recall as high as 99.97% at the same time.

new one such that an identification is flagged as suspicious if at least one method flags it as suspicious. This combined validation method is referred to as Open-search-pDeep validation.

by pFind is not the highest among all six peptide candidates or its similarity does not reach 0.7, this identification is flagged as suspicious. This validation method is referred to as pDeep validation. The example shown in Supplementary Figure 1 demonstrates that an identification is reported by pFind, but the pDeep validation flags it as suspicious. Note that the training data of pDeep software is not used in this study to avoid the overfitting problem.

Synthetic-Peptide Validation

The synthetic-peptide validation method (referred to as Synthetic-peptide) is the most popular method in proteomics to validate the credibility of a single identification.29,30 The identified peptides are synthesized and fragmented to generate synthetic spectra, and the cosine similarity between each original spectrum and its related synthetic spectrum is calculated. If the cosine similarity is less than the predefined

Open-Search-pDeep Validation

Open-search and pDeep are regarded as two independent validation methods and are separately used in a validation process. Note that these two methods can be combined into a 2750

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research similarity threshold, such as 0.9, then the original search result is flagged as suspicious.29,30 Generally, identifications reported by search engines after TDA filtration are further filtered by a series of other rules (such as the threshold of the PSM score, the peptide length, and the number of missed cleavage sites), and only a small part of them (called candidate PSMs) are sent out for peptide synthesis to validate their credibility.29,30 Specifically, for each candidate PSM, the corresponding peptide with the same sequence, modification, and modification sites as the candidate PSM is synthesized. Then, this synthetic peptide is fragmented under the same liquid chromatography (LC)−MS/MS condition, and in general, quite a number of synthetic peptide spectra for the same peptide are acquired (e.g., 20 on average in ProteomeTools31). Lastly, the cosine similarity between each synthetic peptide spectrum and the spectrum of the candidate PSM is calculated, and the maximum similarity value is considered as the synthetic peptide similarity for the candidate PSM. If the synthetic peptide similarity for the candidate PSM does not reach a threshold, 0.9 in general and in this study, this candidate PSM is flagged as suspicious by the Synthetic-peptide validation.

data in the testing workflow of pValid (Supplementary Table 6). Each data set is searched against one appropriate database to obtain correct and incorrect instances (Supplementary Table 3). Four features are extracted from each instance, and the model trained before is used to predict the label of each instance.



RESULTS First, we will define two metrics for evaluating the performance of validation methods and analyze the relationship between the two metrics of validation and the two metrics (sensitivity and precision) of identifications. Second, we will characterize the performance of pValid and the other four validation methods mentioned above using the three benchmark data sets, i.e., Kuster_PT, Olsen_Hela, and Mann_Hela, and characterize the performance of Synthetic-peptide validation using one synthetic peptide data set of ProteomeTools. Finally, we will illustrate the application of pValid on two large-scale data sets. Two Metrics for Evaluating the Performance of Validation Methods

Each validation method checks all identifications and flags a part of them as positive or suspicious, which is similar to the medical test in which suspicious results are called positive. In terms of validation, if a suspicious identification comes from an incorrect identification, this identification is a true-positive instance; otherwise, this identification comes from a correct identification and hence is a false-positive instance. In contrast, for an identification not flagged as suspicious by the validation method, if it comes from a correct identification, it is a truenegative instance; otherwise, it comes from an incorrect identification and hence is a false-negative one. Then, two metrics named as false-positive rate (FPR) and false-negative rate (FNR) are developed to characterize the performance of these validation methods (Figure 2a): FPR is calculated by the ratio of the number of false-positive instances to the number of all correct identifications, and FNR is calculated by the ratio of the number of false-negative instances to the number of all incorrect identifications. In addition, identifications flagged by validation methods can be further used by search engines; e.g., eliminating suspicious ones may reduce incorrect identifications. Therefore, two metrics, recall (r) and error rate (f), are defined to evaluate sensitivity and precision of identification. Recall means the ratio of the number of correct identifications retrieved by a search engine to the total number of correct identifications, and error rate means the ratio of the number of incorrect identifications to the number of all retrieved identifications, which equals to one minus precision. If we use a validation method to eliminate suspicious identifications, the recall and the error rate of the remaining search results would be changed, which are referred to as r′ and f ′, respectively. After defining the number of correct identifications (T) and incorrect identifications (F), the relationships among FPR (referred to as), FNR (referred to as), recall, and error rate are investigated (Supplementary Note 1), and two theorems are proposed as follows.

pValid Validation

Instead of simply combining Open-search and pDeep as mentioned above, pValid is an SVM classifier whose features are from the open search and theoretical spectra prediction strategies. After benchmark construction, correct identifications and incorrect identifications are used in both the training and testing workflows. Figure 1c illustrates the detailed steps of training and testing workflows of pValid. Extracting Features. Four features are extracted from each instance, i.e., each PSM in this context (Supplementary Figure 1e). The first two features are related to Open-search validation: one is the score of this PSM calculated by pFind, and the other one is the score of the PSM reported by OpenpFind for the same spectrum. The last two features are related to pDeep validation: one is the cosine similarity between the original spectrum and the spectrum predicted by pDeep (referred to as pDeep similarity) of the best candidate of pFind, and the other one is the highest pDeep similarity among up to the six peptide candidates of the PSM, i.e., the top-3 peptide candidates reported by pFind and the top-3 peptide candidates reported by Open-pFind. Training an SVM Classifier. LIBSVM36 is used in this study to train an SVM classifier. The radial basis function is used as the SVM kernel function. All feature values are first normalized into [0,1] and then used for training. The other parameters are set by default. LIBSVM reports an SVM score for each instance, and an SVM score of 0.5 is used as the threshold of the final judgment, i.e., an instance with an SVM score less than 0.5 will be flagged as suspicious. In the testing workflow, the process of extracting features is the same as that in the training workflow, and the classifier generated by the training workflow is used to predict whether a PSM is suspicious. The identifications of Kuster_PT_Training are used as training instances. To obtain more training instances, four databases of different sizes (Supplementary Table 4) are separately searched and 212 402 correct instances and 1491 incorrect instances generated from four groups of database search results are used in the training process (Supplementary Table 5). The other three data sets are used to generate test

theorem 1 r′ = (1 − e1)r , (e1 + e 2) < 1 2751

f′ = f

T+F T

1 − e1 e2

+F

;

f′ < f

if (1)

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research

Figure 3. Receiver-operating-characteristic (ROC) and precision-recall (PR) curves of five validation methods on Kuster_PT. (a) The ROC curves. (b) The ROC curves in the FPR region of [0, 2%] in part a. (c) The PR curves.

and pDeep; thus, it had the second-lowest FNR. It was worth noting that the FPR was less than 0.5% for all methods, while the FNR was more than 5% on average for all methods except pValid (1.79%), suggesting the particular advantages of pValid. Similar results were obtained when the same analysis was performed at the peptide level (Supplementary Tables 7 and 8; in this study, a peptide is defined by its sequence and modifications). Note that the sum of FPR and FNR of all five validation methods was less than 1, and thus, the error rate of identifications will be reduced after eliminating suspicious identifications flagged by these validation methods (Figure 2c). After elimination, the error rate of pValid was 0.02% on average, which was reduced by 59 times compared with the original value before the elimination (1.18%) and was significantly lower than that reported by the other four methods. Note that the three preconditions of theorem 2 were all satisfied in the experiments, and thus, the error rate after elimination would be reduced to the range between FNR and double FNR times the error rate before elimination. The actual experimental results showed that the error rate after elimination was closer to the lower bond, i.e., FNR times the error rate before elimination, than to the upper bond. At the same time, pValid kept the recall as high as 99.97% (only reduced by 0.02%), only slightly lower than Trap-database (99.98%). The receiver-operating-characteristic (ROC) and precisionrecall (PR) curves of five methods on Kuster_PT are illustrated in Figure 3. The ROC curve of pValid was the nearest to the top left corner, and the PR curve of pValid was the nearest to the top right corner, illustrating the best validation performance of pValid among these five validation methods. From the perspective of ROC and PR curves, the validation performance of the five validation methods can be ranked as pValid > Open-search-pDeep > pDeep > Trapdatabase > Open-search. ROC and PR curves of five validation methods on Olsen_Hela and Mann_Hela also showed the same conclusion as that of Kuster_PT, except performance of Trap-database was worse than that of Open-search (Supplementary Figures 2 and 3). Next, we checked the consistency of the PSMs that were correctly classified by the five validation methods, including the true-positive PSMs, and the true-negative PSMs. The more true-positive and true-negative PSMs flagged by one validation method, the more accurate this method was. For the Kuster_PT data set, nearly 99% of both true-positive and true-negative PSMs flagged by each of the other four validation methods was also consistently flagged by pValid

theorem 2 e 2f < f ′ < 2e 2f

if

T > F,

(e1 + e 2) < 1,

e1 ≤ e 2 (2)

As shown in theorem 1, after eliminating suspicious identifications, the recall would be reduced by a ratio of e1; in other words, a validation method with a lower FPR will lead to a higher recall for the rest of the identifications. If the sum of FPR and FNR is less than 1, the error rate of the rest of the identifications after elimination will be less than that before elimination. As shown in theorem 2, assuming that the number of correct identifications is larger than the number of incorrect identifications, which is reasonable after the traditional TDAbased FDR control, and further assuming that the sum of FPR and FNR is less than 1 and FPR is not larger than FNR, which is feasible for all five validation methods, as shown later, then the error rate after elimination of suspicious identifications will be reduced to the range between FNR and double FNR times the error rate before elimination. Thus, a lower FNR will lead to a lower error rate, i.e., a higher precision. In summary, a validation method with lower FPR and FNR will lead to higher sensitivity and precision for the rest of the identifications after eliminating suspicious ones. Once FPR and FNR of one validation method are known and the suspicious rate (the rate of suspicious identifications in all identifications) is calculated, the real error rate or FDR of all identifications reported by a search engine can be estimated. That is, FDR = (suspicious rate − FPR)/(1 − FPR − FNR); for further discussion, please see Supplementary Note 1. Comparison between pValid and the Other Four Methods

First, we compared the FPRs and FNRs of the five validation methods on three test data sets at the PSM level (Figure 2b). Owing to the lack of a synthetic data set, the performance of the Synthetic-peptide method cannot be estimated on most of the three data sets as for the other five methods, and its performance will be illustrated in the next section. The FPR of the Trap-database was the lowest, while the FPR of pValid was the second lowest. The FPRs of pDeep and Open-searchpDeep were worse than the others, and the FPR of Opensearch-pDeep must be higher than both Open-search and pDeep, owing to its inherent design. By contrast, pValid also combined the features of Open-search and pDeep, but it performed significantly better than the two in terms of FPR. On the other hand, pValid showed the lowest FNR, while Trap-database showed the highest. The FNRs of Trapdatabase and Open-search were ultrahigh compared to those of the other methods, in striking contrast with their low FPRs. Open-search-pDeep was a direct combination of Open-search 2752

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research

as this PSM is calculated. The maximum cosine similarity is retained and regarded as the synthetic cosine similarity of the original PSM. Finally, the FPR of Synthetic-peptide can be calculated by the ratio of the number of correct identifications whose synthetic cosine similarities are less than 0.9 to the number of all correct identifications. However, the FNR of Synthetic-peptide cannot be easily compared with those of pValid and other validation methods, since the incorrect identifications in the synthetic benchmark data set are not all synthesized. Thus, an experiment is designed to construct the incorrect identifications by changing the correct peptide of each spectra to an incorrect peptide (also in the benchmark correct peptide collection) with the lowest mass deviation away from the correct peptide (Supplementary Note 2). The FNR of Synthetic-peptide can be calculated as follows. First, incorrect identifications are constructed. Second, for each incorrect PSM, the cosine similarity between this incorrect PSM and each of the correct PSMs of the same PCM as this incorrect PSM are calculated, and the maximum cosine similarity is retained and regarded as the synthetic similarity of the incorrect PSM. Finally, the FNR of Synthetic-peptide can be calculated by the ratio of the number of incorrect identifications whose cosine similarities exceed 0.9 to the number of all incorrect identifications. In the process of estimating FPR and FNR of Syntheticpeptide method, spectra of the same collision energy (CE) were used to compute the cosine similarity, and we also discussed the similarities of synthetic spectra with the same or different collision energies (Supplementary Note 3). Experimental results of Kuster_PT indicated that Synthetic-peptide validation had an FPR of 0.06% and an FNR of 1.44% at the PSM level (Table 2). Compared with Trap-database, Opensearch, pDeep, and Open-search-pDeep, Synthetic-peptide validation had a similar FPR but the lowest FNR, confirming the rationality of using Synthetic-peptide validation as the gold standard. However, comparing with pValid (Figure 2, FPR 0.04%, FNR 1.44%), Synthetic-peptide validation had a higher FPR. Peptide level comparison of FPR and FNR showed that the Synthetic-peptide validation had higher FPR and FNR than pValid (Supplementary Table 9). Furthermore, considering the time and financial cost, it is difficult to use Synthetic-peptide to do validation on large-scale identifications nowadays. However, pValid only has a computational cost and has lower FPR and FNR, which makes it the most feasible method to validate identifications from large-scale proteome data.

(Table 1). Peptide level comparison of Kuster_PT is shown in Supplementary Figure 4. Detailed comparisons of TP and TN Table 1. True-Positive and True-Negative PSMs Comparison of pValid and the Other Four Validation Methods on Kuster_PTa

TP

TN

only from other validation methods

consistent

Trap-database

0 (0.00%)

1 047

Open-search

9 (1.34%)

663

pDeep

10 (0.75%)

1 328

Open-searchpDeep Trap-database Open-search pDeep

11 (0.80%)

1 362

35 (0.04%) 34 (0.04%) 29 (0.03%)

90 153 90 137 89 962

Open-searchpDeep

29 (0.03%)

89 943

only from pValid 461 (30.57%) 845 (56.03%) 180 (11.94%) 146 (9.68%) 5 (0.01%) 21 (0.02%) 196 (0.22%) 215 (0.24%)

a

TP and TN in the table above respectively represent true-positive and true-negative PSMs. The percentage in parentheses is calculated by the ratio of the number of PSMs flagged only from one method to the number of all the PSMs flagged by this method.

PSMs and peptides among the five validation methods on Olsen_Hela and Mann_Hela are shown in Supplementary Figures 5 and 6. All of these results suggest the superior validation ability of pValid. Figure 4 illustrates an example of incorrect PSM successfully flagged by pValid yet neglected by other four methods. The incorrect PSM is shown in Figure 4a. The first four amino acids had no supporting peaks and the whole peptide had no b ions and hence no complementary b−y ion pairs. This spectrum was constructed from a correct PSM by adding a +10 Da mass deviation to the precursor mass (Figure 4b). In Trap-database and Open-search methods, the search results were both the same as the incorrect peptide (Figure 4c). The pDeep similarity value was 0.84, the highest among all six peptide candidates, which made this identification pass the validation of pDeep and also Open-search-pDeep. However, pValid gave an SVM score of 0.07 for this PSM to be an incorrect identification, flagging this PSM as suspicious or positive. Although the pDeep similarity value of the incorrect PSM was 0.84, higher than the similarity threshold 0.7, pValid might have learned a stricter similarity threshold for this peptide and thus flagged this PSM positive. In the benchmark data set of Kuster_PT, consisting of 1530 incorrect PSMs, if all true-positive PSMs flagged by the other four methods were merged to be compared with those flagged by pValid, pValid could still flag an extra 7.06% (108/1530) true-positive PSMs.

Different Variants of pValid

The effect of different features for pValid is further investigated in this study. We try two alternative designs of pValid, i.e., only using the two features from Open-search (referred to as pValid_OnlyOpen) and only using the two features from pDeep (referred to as pValid_OnlypDeep), and then compare them with the original pValid (referred to as pValid). The FPR of pValid_OnlyOpen is 3−7 times larger than that of pValid, and the FPR of pValid_OnlypDeep is similar to that of pValid (Supplementary Figure 7). The FNR of pValid_OnlyOpen is 2−8 times larger than that of pValid, and the FNR of pValid_OnlypDeep is 5−19 times larger than that of pValid. Therefore, the two features from Open-search are complementary to the other two features from pDeep. We also analyze the FPR and FNR of pValid when different SVM classification thresholds are chosen rather than using the fixed value of 0.5. As the threshold changes from 0.5 to 0.9, the

Comparison between pValid and Synthetic-Peptide Validation

While the Synthetic-peptide method is regarded as the gold standard for validation, few studies characterized its performance, and this study may be the first to estimate its FPR and FNR. The FPR of Synthetic-peptide is easy to estimate as follows. First, correct identifications are the same as those of Kuster_PT (a data set chosen from ProteomeTools) constructed in the benchmark construction. Second, for each correct PSM, the cosine similarity between this PSM and each other PSM of the same peptide-charge-modifications (PCM37) 2753

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research

Figure 4. An example of a true-positive PSM flagged by pValid. (a) The incorrect PSM. (b) The original correct PSM without +10 Da precursor mass deviation. *y8+ represents the neutral loss H(4)COS of y8+, which is a unique neutral loss equaling 63.998285 Da of oxidation. (c) Validation results of the five methods. Under Trap-database and Open-search validation, the results are also the same as the original database search. pDeep similarity is the highest among all the six related peptide candidates and the similarity value is 0.84, which makes this identification pass the validation of pDeep and Open-search-pDeep. However, pValid gives an SVM score of 0.07 for this PSM to be a positive identification, flagging this PSM as positive, i.e., suspicious.

FPR increases and the FNR remains or decreases (Supplementary Figure 8). From 0.5 to 0.6, both the FPR and FNP change little. From 0.6 to 0.7, the change of FPR is larger than that of FNR. From 0.7 to 0.9, the FPR greatly increases and the FNR greatly decreases. The choice of selecting threshold of pValid relies on the goal of using validation methods. If few false-positive PSMs are expected, a threshold like 0.5 can be used to obtain a lower FPR. If few false-negative PSMs are expected, a higher threshold such as 0.9 can be used. In fact, the FPR is less than 0.5% all the time when the threshold changes from 0.5 to 0.9. Under this circumstance, 0.9 would be a proper choice when the lowest FNR is expected.

Table 2. FPR and FNR of Synthetic-Peptide Validation on Kuster_PT at the PSM levela CE

FPR (%)

FNR (%)

25 30 35 average

0.07 0.07 0.04 0.06

0.82 0.96 2.53 1.44

a

CE in the table above represents collision energy. Correct instances for estimating FPR of Synthetic-peptide are the same as those used in the five other validation methods illustrated before, but incorrect instances for estimating FNR are different. The average FPR of Synthetic-peptide is 0.06%, which is higher than that of pValid. The average FNR of Synthetic-peptide is 1.44%, which is equal to that of pValid.

pValid Validation on Large-Scale Complex Data Sets

The performance of pValid was first evaluated on the results of a large-scale complex data set called Olsen data set, consisting 2754

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research

Figure 5. Performance of pValid on the Olsen data set. (a) Time cost of the database search and pValid. The unit of time is hour and pValid only costs twice the time of pFind for extra validation. (b) Validation results of all target and decoy PSMs within FDR 1%. (c) Analysis of all decoy PSMs within FDR 1%. “Target/Variants of target” means that the decoy peptides are identical to semi- or nonspecifically digested target peptides or specifically digested target peptides with a modification or a mutation. 12.60% of decoy PSMs related to short peptides (less than 10 amino acids) and 7.87% of decoy PSMs related to long peptides (equal to or more than 10 amino acids) are flagged as negative PSMs by pValid.

of 46 RAW files whose titles started with 20150410_QE3_UPLC9_DBJ_SA_46fractions_Rep1 and are numbered from 1 to 46 (termed “46 fractions”).33 A total of 1 018 676 spectra from these 46 RAW files were then searched by pFind and 488 242 PSMs were reported after TDA FDR control at 1%. Search parameters were set the same as in Supplementary Table 3. The total time cost for pValid validation was only 8.1 h on a server computer (Intel Xeon CPU E5-2620, 16 cores, 128 GB RAM, Windows Server 2012 R2, 64 bit), completing the validation of all 488 242 identifications in half a day (Figure 5a). We also used GPU (TITAN Xp, 12 GB RAM) for pDeep prediction, and the time for this step was only half an hour, which further reduced the total running time of pValid by 12% (1/8.1). Totally, pValid took just less than twice the time of pFind for extra validation, proving the advantage of pValid over the Synthetic-peptide method in terms of time cost. For all 488 242 target PSMs identified by pFind within 1% FDR at the peptide level, 16 711 (3.43%) of them were flagged as suspicious ones by pValid. On the other hand, 74.46% (1539/2067) of decoy PSMs were flagged as suspicious ones (Figure 5b). In order to analyze the reasons why 25.54% of the decoy PSMs were flagged as negative by pValid, we first separated the decoy PSMs into two sets according to whether the peptide length was less than 10 amino acids or not. Intersected PSMs of three search engines were used for training pValid, and they were always long peptides. According to our statistics, the shortest intersected peptide of the training data set had a length of 10 amino acids. Thus, short peptides (less than 10 amino acids) were difficult to be validated by pValid, and it would be better to separate them from long peptides (equal to or more than 10 amino acids). In all 2067 decoy PSMs, 1317 identified short peptides and 750 identified long peptides (Figure 5c). For decoy PSMs related to short peptides, 469 were flagged as negative, but 303 of them might

be correct, since these peptides were each identical to some other peptides from target proteins with a modification or a mutation (e.g., the decoy peptide was GLTSVLDQK, which was identical to a target peptide GLTSVLNQK with a deamidation on asparagine). For decoy PSMs related to long peptides, 59 were possible false-negative PSMs, and the falsenegative rate (59/750 = 7.87%) was better than that of decoy PSMs related to short peptides (166/1317 = 12.60%). These results were consistent with our expectation, because we used long peptides to train pValid, and constructing training sets of short peptides and well-validating short peptides is our future goal. We also investigated the ratio of PSMs related to olfactory receptor proteins (referred to as OR PSMs) flagged by validation methods. As reported by Ezkurdia et al., PSMs identified as OR PSMs in nonolfactory organ samples should be considered incorrect.38 Thus, these incorrect PSMs can be used to test the effect of validation methods for positive (incorrect) instances. In the Olsen data set, only one PSM was identified as being from an olfactory protein, which was not enough to do the comparison among the five methods. Therefore, a human proteome data set (PXD000561, termed Pandey_Draft),39 consisting of 25 million spectra, was searched by pFind to collect sufficient OR PSMs (Supplementary Data). Totally, 135 pseudo-OR PSMs were identified after the TDA FDR control, in which 61 were removed before validation, including 19 PSMs related to nonunique peptides shared by two or more proteins,40 13 PSMs related to peptides the same as peptides semi- or nonspecifically digested from nonolfactory proteins, and 29 PSMs related to peptides the same as peptides of nonolfactory proteins with one modification or one mutation. The remaining 74 OR PSMs were used for further validation, and pValid successfully flagged the maximum number of OR PSMs (59 PSMs) among all five methods, 2755

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research

Figure 6. Validation results of five methods on the OR PSMs of the Pandey_Draft data set. (a) Numbers of false-negative PSMs flagged by the five methods. (b) Comparison of true-positive OR PSMs flagged by the five methods. pValid successfully flagged the highest number of incorrect OR identifications among all five methods.

incorrect identifications, and we also discuss other mass deviations to construct trap spectra (Supplementary Note 4). As also described in the Methods section, pDeep, Opensearch-pDeep, and pValid all use top-3 peptide candidates in the validation process. However, reporting the top-3 peptide candidates are not supported by some search engines and only the top-1 peptide is reported. If only the top-1 peptide is used in these three validation methods (i.e., pDeep, Open-searchpDeep, and pValid), the difference between using the top-3 peptides and the top-1 peptide is compared (Supplementary Table 10). For pDeep and Open-search-pDeep, using the top1 peptide led to a decrease of FPR (pDeep, from 0.26% to 0.01%; Open-search-pDeep, from 0.30% to 0.06%) and an increase of FNR (pDeep, from 10.80% to 13.74%; Opensearch-pDeep, from 6.50% to 7.97%). For pValid, both the FPR and FNR increased when using the top-1 peptide (FPR, from 0.03% to 0.07%; FNR, from 1.79% to 1.97%). Moreover, for the Olsen_Hela data set, the FPR of using the top-1 peptide increased as large as 5 times (from 0.03% to 0.14%) more than that of using the top-3 peptides. In summary, using the top-3 peptide candidates yields smaller FPR and FNR compared with using the top-1, which is the main reason that we use the top-3 peptide candidates as the default setting in pValid. The validation methods of Trap-database and Open-search in this paper are similar: both of them validated PSMs by finding competitors in larger search spaces. However, the details of these two methods shown in this study are slightly different. Trap-database validation does not compare the scores of two PSMs searched from the original database and the trap database, while Open-search validation compares the scores of two PSMs searched by pFind and Open-pFind, i.e., if the peptide searched by Open-pFind is different from the original peptide, meanwhile with a higher score, then the original identification is flagged as suspicious. In fact, these criteria are based on our experimental results (Supplementary Tables 11 and 12). For Trap-database validation, comparing the two scores (i.e., if the peptide searched from the trap database is different from the original peptide with a higher score, then the original identification is flagged as suspicious) did not affect the FPR but led to an increase of the FNR (from 56.13% to 76.71%). Thus, the Trap-database validation, which

i.e., neglected the fewest OR PSMs (15 PSMs) and uniquely flagged 13 OR PSMs (Figure 6a,b and Supplementary Data). Considering the time consumption, financial cost, and validation performance, pValid was an excellent choice for proteome-scale validation.



DISCUSSION In this study, we have developed a credibility validation method named pValid, which combines the advantages of open search, theoretical spectrum prediction, and machine learning. We also have proposed two metrics, FPR and FNR, for evaluating the classification performance of validation methods and have established the relationship between validation accuracy (i.e., FPR and FNR) and identification accuracy (i.e., recall and precision). pValid has the lowest FPR and FNR among five validation methods (i.e., Trap-database, Opensearch, pDeep, Open-search-pDeep, and pValid) and even has lower FPR and FNR than the Synthetic-peptide validation method. In consideration of the time and financial cost, pValid is an excellent tool for validation of large-scale identifications. As described in the Results section, only 11 true-positive PSMs were flagged by the union of the other four validation methods but failed to be flagged by pValid. The phenomena observed in 9 of the 11 PSMs were the same as that observed in Supplementary Figure 1a. The benchmark peptide and the newly identified peptide had the same sequence but different modification sites (the second site or the fifth site); thus, the two PSMs of the same sequence and different modification sites were too similar to be distinguished. The other two PSMs failed to be flagged by pValid each identified a different peptide instead of the benchmark peptide reported by three search engines. However, the backbone fragment ions, especially for the y ions, of the newly identified peptide were also nearly complete, and the matched peaks were complementary to those of the benchmark peptide. Thus, the different peptides may be cofragmented with the benchmark peptides, and it is reasonable that pValid did not flag them as suspicious ones. In summary, pValid only misflagged a few PSMs, from which the peptides identified were similar to or cofragmented with the benchmark peptides. As described in the Methods section, +5 and +10 Da mass deviations are used to construct trap spectra to generate 2756

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research

2016YFA0501300 to S.-M.H.), the CAS Interdisciplinary Innovation Team (Y604061000 to S.-M.H.), the National Natural Science Foundation of China (21475141 to S.-M.H.), and the Youth Innovation Promotion Association CAS (No. 2014091 to H.C.).

does not compare scores, is better. For Open-search validation, eliminating the criterion of comparing the two scores led to a decrease of FNR by 6 times (from 49.05% to 8.28%); however, the FPR increased by 42 times (from 0.04% to 1.69%) at the same time. Finally, we choose the way that compared the two scores for Open-search validation in this study, which would be more sensitive for detecting false-positive instances.





ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.8b00993. Tables summarizing the four data sets used, the search engines used, the database search parameters, the four databases used for Kuster_PT_Training, the training and test sets, FPRs and FNRs (%) of five validation methods at the peptide level, FPR and FNR of the Synthetic-peptide validation of Kuster_PT at the peptide level, comparisons of using the top-1 peptide or top-3 peptides in three validation methods, of different implementations of the Trap-database, and of different implementations of Open-search; figures showing an example identified from pFind which both Open-search and pDeep validations flag as suspicious identifications, ROC and PR curves of five validation methods on Olsen_Hela and on Mann_Hela, true-positive and truenegative PSMs and peptides comparison among five methods on Kuster_PT, on Olsen_Hela, and on Mann_Hela, influence of different features and thresholds on pValid; and notes detailing the relationship between metrics of validation (i.e., FPR and FNR) and metrics of identification (i.e., recall and error rate), the workflow of calculating FPR and FNR of Syntheticpeptide validation, the similarity of synthetic-peptide with the same or different collision energies, and the other mass deviations to construct trap spectra (Supplementary Tables 1−12, Figures 1−8, and Notes 1−4) (PDF) Supplementary data listing validation results on PSMs related to olfactory proteins identified from the Pandey_Draft data set and database search parameters for the Pandey_Draft data set (XLSX)



REFERENCES

(1) Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 2015, 33 (7), 743−9. (2) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976−89. (3) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (20), 3551−67. (4) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26 (12), 1367−72. (5) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V.; Mann, M. Andromeda: a peptide search engine integrated into the Max Quant environment. J. Proteome Res. 2011, 10 (4), 1794−805. (6) Zhang, J.; Xin, L.; Shan, B.; Chen, W.; Xie, M.; Yuen, D.; Zhang, W.; Zhang, Z.; Lajoie, G. A.; Ma, B. PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol. Cell. Proteomics 2012, 11 (4), M111.010587. (7) Chi, H.; He, K.; Yang, B.; Chen, Z.; Sun, R. X.; Fan, S. B.; Zhang, K.; Liu, C.; Yuan, Z. F.; Wang, Q. H.; Liu, S. Q.; Dong, M. Q.; He, S. M. pFind-Alioth: A novel unrestricted database search algorithm to improve the interpretation of high-resolution MS/MS data. J. Proteomics 2015, 125, 89−97. (8) Chi, H.; Liu, C.; Yang, H.; Zeng, W. F.; Wu, L.; Zhou, W. J.; Wang, R. M.; Niu, X. N.; Ding, Y. H.; Zhang, Y.; Wang, Z. W.; Chen, Z. L.; Sun, R. X.; Liu, T.; Tan, G. M.; Dong, M. Q.; Xu, P.; Zhang, P. H.; He, S. M. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat. Biotechnol. 2018, 36 (11), 1059−1061. (9) White, F. M. The potential cost of high-throughput proteomics. Sci. Signaling 2011, 4 (160), No. pe8. (10) Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Methods 2014, 11 (11), 1114−25. (11) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207−14. (12) Reidegeld, K. A.; Eisenacher, M.; Kohl, M.; Chamrad, D.; Korting, G.; Bluggel, M.; Meyer, H. E.; Stephan, C. An easy-to-use Decoy Database Builder software tool, implementing different decoy strategies for false discovery rate calculation in automated MS/MS protein identifications. Proteomics 2008, 8 (6), 1129−37. (13) Feng, X. D.; Ma, J.; Chang, C.; Shu, K. X.; Zhu, Y. P. An Improved Target-decoy Strategy for Evaluation of Database Search Engines and Quality Control Methods in Shotgun Proteomics. Proceedings of the 2016 International Conference on Biomedical and Biological Engineering; Advances in Biological Sciences Research; Atlantis Press, 2016; pp 366−372, DOI: 10.2991/bbe-16.2016.56. (14) Shen, C.; Sheng, Q.; Dai, J.; Li, Y.; Zeng, R.; Tang, H. On the estimation of false positives in peptide identifications using decoy search strategy. Proteomics 2009, 9 (1), 194−204. (15) Feng, J.; Naiman, D. Q.; Cooper, B. Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies. Bioinformatics 2007, 23 (17), 2210−7. (16) Navarro, P.; Vazquez, J. A refined method to calculate false discovery rates for peptide identification using decoy databases. J. Proteome Res. 2009, 8 (4), 1792−6.

AUTHOR INFORMATION

Corresponding Authors

*H.C. e-mail: [email protected]. *S.-M.H. e-mail: [email protected] ORCID

Wen-Jing Zhou: 0000-0002-5154-6156 Hao Yang: 0000-0002-1277-2628 Wen-Feng Zeng: 0000-0003-4325-2147 Notes

The authors declare no competing financial interest. pValid can be downloaded from http://pfind.ict.ac.cn/ software/pValid/index.html.



ACKNOWLEDGMENTS This work was supported in part by grants from the National Key Research and Development Program of China (No. 2757

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758

Article

Journal of Proteome Research (17) Eisenacher, M.; Kohl, M.; Turewicz, M.; Koch, M. H.; Uszkoreit, J.; Stephan, C. Search and decoy: the automatic identification of mass spectra. Methods Mol. Biol. 2012, 893, 445−88. (18) Hather, G.; Higdon, R.; Bauman, A.; von Haller, P. D.; Kolker, E. Estimating false discovery rates for peptide and protein identification using randomized databases. Proteomics 2010, 10 (12), 2369−2376. (19) Li, H. L.; Park, J.; Kim, H.; Hwang, K. B.; Paek, E. Systematic Comparison of False-Discovery-Rate-Controlling Strategies for Proteogenomic Search Using Spike-in Experiments. J. Proteome Res. 2017, 16 (6), 2231−2239. (20) The, M.; MacCoss, M. J.; Noble, W. S.; Kall, L. Fast and Accurate Protein False Discovery Rates on Large-Scale Proteomics Data Sets with Percolator 3.0. J. Am. Soc. Mass Spectrom. 2016, 27 (11), 1719−1727. (21) Jeong, K.; Kim, S.; Bandeira, N. False discovery rates in spectral identification. BMC Bioinf. 2012, 13 (Suppl 16), S2. (22) Everett, L. J.; Bierl, C.; Master, S. R. Unbiased statistical analysis for multi-stage proteomic search strategies. J. Proteome Res. 2010, 9 (2), 700−7. (23) Fu, Y.; Qian, X. Transferred subgroup false discovery rate for rare post-translational modifications detected by mass spectrometry. Mol. Cell. Proteomics 2014, 13 (5), 1359−68. (24) Hart-Smith, G.; Yagoub, D.; Tay, A. P.; Pickford, R.; Wilkins, M. R. Large Scale Mass Spectrometry-based Identifications of Enzyme-mediated Protein Methylation Are Subject to High False Discovery Rates. Mol. Cell. Proteomics 2016, 15 (3), 989−1006. (25) Sheng, Q.; Dai, J.; Wu, Y.; Tang, H.; Zeng, R. Build Summary: using a group-based approach to improve the sensitivity of peptide/ protein identification in shotgun proteomics. J. Proteome Res. 2012, 11 (3), 1494−502. (26) Liang, X.; Xia, Z.; jian, L.; Niu, X.; Link, A. An adaptive classification model for peptide identification. BMC Genomics 2015, 16 (Suppl 11), S1. (27) Grover, H.; Wallstrom, G.; Wu, C. C.; Gopalakrishnan, V. Context-Sensitive Markov Models for Peptide Scoring and Identification from Tandem Mass Spectrometry. OMICS 2013, 17 (2), 94− 105. (28) Webbrobertson, B. J. M.; Oehmen, C. S.; Cannon, W. R. In Support Vector Machine Classification of Probability Models and Peptide Features for Improved Peptide Identification from Shotgun Proteomics. International Conference on Machine Learning and Applications 2007, 500−505. (29) Peng, X.; Xu, F.; Liu, S.; Li, S.; Huang, Q.; Chang, L.; Wang, L.; Ma, X.; He, F.; Xu, P. Identification of Missing Proteins in the Phosphoproteome of Kidney Cancer. J. Proteome Res. 2017, 16 (12), 4364−4373. (30) Wang, Y.; Chen, Y.; Zhang, Y.; Wei, W.; Li, Y.; Zhang, T.; He, F.; Gao, Y.; Xu, P. Multi-Protease Strategy Identifies Three PE2Missing Proteins in Human Testis Tissue. J. Proteome Res. 2017, 16 (12), 4352−4363. (31) Zolg, D. P.; Wilhelm, M.; Schnatbaum, K.; Zerweck, J.; Knaute, T.; Delanghe, B.; Bailey, D. J.; Gessulat, S.; Ehrlich, H. C.; Weininger, M.; Yu, P.; Schlegl, J.; Kramer, K.; Schmidt, T.; Kusebauch, U.; Deutsch, E. W.; Aebersold, R.; Moritz, R. L.; Wenschuh, H.; Moehring, T.; Aiche, S.; Huhmer, A.; Reimer, U.; Kuster, B. Building Proteome Tools based on a complete synthetic human proteome. Nat. Methods 2017, 14 (3), 259−262. (32) Kulak, N. A.; Pichler, G.; Paron, I.; Nagaraj, N.; Mann, M. Minimal, encapsulated proteomic-sample processing applied to copynumber estimation in eukaryotic cells. Nat. Methods 2014, 11 (3), 319−24. (33) Bekker-Jensen, D. B.; Kelstrup, C. D.; Batth, T. S.; Larsen, S. C.; Haldrup, C.; Bramsen, J. B.; Sorensen, K. D.; Hoyer, S.; Orntoft, T. F.; Andersen, C. L.; Nielsen, M. L.; Olsen, J. V. An Optimized Shotgun Strategy for the Rapid Generation of Comprehensive Human Proteomes. Cell Syst. 2017, 4 (6), 587−599. (34) Creasy, D. M.; Cottrell, J. S. Unimod: Protein modifications for mass spectrometry. Proteomics 2004, 4 (6), 1534−6.

(35) Zhou, X. X.; Zeng, W. F.; Chi, H.; Luo, C.; Liu, C.; Zhan, J.; He, S. M.; Zhang, Z. pDeep: Predicting MS/MS Spectra of Peptides with Deep Learning. Anal. Chem. 2017, 89 (23), 12690−12697. (36) Chang, C. C.; Lin, C. J. LIBSVM: A Library for Support Vector Machines. Acm T Intel Syst. Tec 2011, 2 (3), 1−27. (37) Savitski, M. M.; Wilhelm, M.; Hahne, H.; Kuster, B.; Bantscheff, M. A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets. Mol. Cell. Proteomics 2015, 14 (9), 2394−2404. (38) Ezkurdia, I.; Vazquez, J.; Valencia, A.; Tress, M. Correction to “Analyzing the first drafts of the human proteome. J. Proteome Res. 2015, 14 (4), 1991. (39) Kim, M. S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.; Madugundu, A. K.; Kelkar, D. S.; Isserlin, R.; Jain, S.; Thomas, J. K.; Muthusamy, B.; Leal-Rojas, P.; Kumar, P.; Sahasrabuddhe, N. A.; Balakrishnan, L.; Advani, J.; George, B.; Renuse, S.; Selvan, L. D.; Patil, A. H.; Nanjappa, V.; Radhakrishnan, A.; Prasad, S.; Subbannayya, T.; Raju, R.; Kumar, M.; Sreenivasamurthy, S. K.; Marimuthu, A.; Sathe, G. J.; Chavan, S.; Datta, K. K.; Subbannayya, Y.; Sahu, A.; Yelamanchi, S. D.; Jayaram, S.; Rajagopalan, P.; Sharma, J.; Murthy, K. R.; Syed, N.; Goel, R.; Khan, A. A.; Ahmad, S.; Dey, G.; Mudgal, K.; Chatterjee, A.; Huang, T. C.; Zhong, J.; Wu, X.; Shaw, P. G.; Freed, D.; Zahari, M. S.; Mukherjee, K. K.; Shankar, S.; Mahadevan, A.; Lam, H.; Mitchell, C. J.; Shankar, S. K.; Satishchandra, P.; Schroeder, J. T.; Sirdeshmukh, R.; Maitra, A.; Leach, S. D.; Drake, C. G.; Halushka, M. K.; Prasad, T. S.; Hruban, R. H.; Kerr, C. L.; Bader, G. D.; Iacobuzio-Donahue, C. A.; Gowda, H.; Pandey, A. A draft map of the human proteome. Nature 2014, 509 (7502), 575−81. (40) The, M.; Edfors, F.; Perez-Riverol, Y.; Payne, S. H.; Hoopmann, M. R.; Palmblad, M.; Forsstrom, B.; Kall, L. A Protein Standard That Emulates Homology for the Characterization of Protein Inference Algorithms. J. Proteome Res. 2018, 17 (5), 1879− 1886.

2758

DOI: 10.1021/acs.jproteome.8b00993 J. Proteome Res. 2019, 18, 2747−2758