A Decoy-Free Approach to the Identification of Peptides - Journal of

Feb 25, 2015 - A growing number of proteogenomics and metaproteomics studies indicate potential limitations of the application of the “decoy” data...
0 downloads 7 Views 1MB Size
Subscriber access provided by University of Victoria Libraries

Article

A decoy-free approach to the identification of peptides Giulia Gonnelli, Michiel Stock, Jan Verwaeren, Davy Maddelein, Bernard De Baets, Lennart Martens, and Sven Degroeve J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 25 Feb 2015 Downloaded from http://pubs.acs.org on February 26, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Research Article

A decoy-free approach to the identification of peptides

Giulia Gonnelli1, 2, Michiel Stock3, Jan Verwaeren3, Davy Maddelein1, 2, Bernard De Baets3, Lennart Martens1, 2 * and Sven Degroeve1, 2

1Department

of Medical Protein Research, VIB, B-9000 Ghent, Belgium

2Department

of Biochemistry, VIB, B-9000 Ghent, Belgium

3Dept.

of Mathematical Modelling, Statistics and Bioinformatics, Ghent, Belgium

* Correspondence: Professor Lennart Martens, Department of Medical Protein Research and Biochemistry, VIB and Faculty of Medicine and Health Sciences, Ghent University, A. Baertsoenkaai 3, B-9000 Ghent, Belgium Email: [email protected] Tel.: +3292649358 Fax: +32-92649484

Keywords: peptide identification, decoy databases, machine learning

Abbreviations: PSM- Peptide to Spectrum Matches, FDR- False Discovery Rate, FP- False Positives, FNFalse Negatives, CPTAC- Clinical Proteomic Tumor Analysis Consortium [email protected] [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract

A growing number of proteogenomics and metaproteomics studies indicate potential limitations of the application of the “decoy” database paradigm used to separate correct peptide identifications from incorrect ones in traditional shotgun proteomics. We therefore propose a binary classifier called Nokoi that allows fast yet reliable decoy-free separation of correct from incorrect peptide-to-spectrum matches (PSMs). Nokoi was trained on a very large collection of heterogeneous data, using ranks supplied by the Mascot search engine to label correct and incorrect PSMs. We show that Nokoi outperforms Mascot and achieves a performance very close to that of Percolator, at substantially higher processing speeds.

2

ACS Paragon Plus Environment

Page 2 of 20

Page 3 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction

Over the past several years, mass-spectrometry-based proteomics has become the main technique for the identification and quantification of proteins in proteomics1. In a typical workflow, experimental fragmentation spectra are acquired from peptides originating from a proteolytic digest of sample proteins, and these fragmentation spectra are subsequently matched to biological entities via specialized software tools called search engines. Several such search engines exist, such as Mascot2, X!Tandem3, OMSSA4, Sequest5, Crux6 and Inspect7. The identification methods used by these tools work by estimating the likelihood, in the form of a matching score, that an observed spectrum was generated from a given peptide sequence. The output from a search engine consists of a list of possible peptide-tospectrum matches (PSMs), associated with a score and a corresponding rank. This list is typically further processed using a statistical method to estimate the expected number of false positives (FP), and the False Discovery Rate (FDR). The FDR is typically estimated as the ratio of incorrect to correct PSMs found at a given score cutoff

8 9.

In mass-

spectrometry-based proteomics, so-called ‘target-decoy’ approaches are commonly used to establish an FDR. In these approaches, the experimental spectra are also searched against artificial, ‘decoy’ databases (often created by reversing or shuffling the sequences in the original database), and the number of hits in these databases is then used as an estimate of the number of incorrect hits obtained when searching the original database. The FDR is then estimated as the ratio of the number of decoy matches above threshold to the number of target database hits above threshold. However, this approach is not always optimal, especially in the two compelling scenarios of proteogenomics10 and metaproteomics11 where the increased size of the search database and the high similarity between sequences make the task of separating correct from incorrect PSMs extremely challenging. Indeed, in these two scenarios, the experimental spectra are either searched against whole-genome databases that generate a large number of translated sequences that do not exist in nature (i.e. only one out of the six reading frames actually accounts for a functional protein product), or against a composite database compiled from many closely related genomes in the case of metaproteomics, which leads to a lot of highly similar sequences. These factors interfere with the construction of a proper target database, which should contain only 3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 20

“correct” sequences, and contribute to a vast increase of potential false positive hits12. It thus becomes extremely difficult to separate correct hits from incorrect ones, and to estimate the FDR in a reliable manner13

14.

Furthermore, the use of a decoy database in

proteogenomics analyses of complex organisms may also be problematic for purely technical reasons: it is very hard to construct a genome sized decoy database that does not contain any peptides observed in the original genome, but that follows a nucleotide distribution similar to the one of the target database. Therefore, several strategies have been developed to overcome the problem of searching such big, error-prone databases15 16 17 18.

All of these strategies are, however, either based on somehow reducing the size of the

database, or on the use of initial assumptions to prevent statistical bias originating from the huge number of incorrect sequences gathered in the six-reading frame target database. In this work we propose an alternative way to model incorrect PSMs by using lower scoring, and therefore lower ranked PSMs provided by the search engine. We train a binary classifier, called Nokoi, on forty-seven features that describe a PSM and label correct and incorrect matches using first and lower ranks, respectively. We compare the performance of Nokoi to that of Percolator, showing very good identification performance along with short processing times. Although solutions for computing the FDR without a decoy database exist (e.g. mixture modeling the target PSM score distributions19

20 21,

or using

generating functions22), decoy approaches have remained far more popular, mainly because the decoy approach does not rely on restrictive parametric assumptions23 .

Experimental Procedures

Training data retrieval — 2,229,663 PSMs were extracted from our in-house repository mslims24. These PSMs derive from thirty distinct data sets from different samples, processed at different times on different orbitrap/ion-trap instruments. All samples used were treated with trypsin, and the precursor ion charge was limited to charges +2 and +3. For each experiment, the resulting MS/MS spectra were matched with appropriate search settings to the respective protein database using the Mascot search engine. PSMs up to rank five were collected and extracted from the resulting Mascot output files using the MascotDatfile tool25. Several sets of PSMs were created according to the rank assigned by 4

ACS Paragon Plus Environment

Page 5 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Mascot: the rank1 set contains those PSMs that were ranked first by Mascot and that were scored above the Mascot significance threshold at 99% confidence, while rank2, rank3, rank4 and rank5 sets, contain those PSMs ranked second to fifth, respectively. An additional decoy set, containing decoy PSMs, was created for evaluation purposes. This latter data set was obtained by searching 26,968 random subsets of MS/MS spectra from the initial thirty experiments against shuffled versions of the original target databases, yielding a grand total of 53,915 decoy hits. It must be noted that some false positives may still be present among the selected rank1 PMSs used to model correct hits, as well as some false negatives among the rank2 PSMs used to model incorrect ones. This is a well-known problem of peptide identification in MS-based proteomics, and it affects all real-world datasets. Yet even though this sort of label noise is most likely unavoidable in a general sense, this need not interfere with the creation of accurate models26.

Feature computation — Each PSM is represented by a forty-seven dimensional feature vector. Most of these features were based on the features used in Mascot Percolator27 and describe the PSM at the spectrum and at the peptide level. It should be noted that the Mascot score is not included among the features, while it is shown that is not necessary to obtain very good performance. Features were computed using in-house Perl scripts. A complete picture of the workflow is reported in Figure 1. Table 1 lists all feature classes along with a brief description. A complete list of all forty-seven features used to train the classifier is reported in Supplementary Table 1. These forty-seven features were used as input features for both Nokoi and Percolator (version 2.04), during the model building and evaluation of these methods.

Model building — Nokoi uses an L1-regularized logistic regression classifier 28. To train this classifier, a balanced training dataset was created by randomly sampling 10000 rank1 PSMs from the initial dataset as well as 10000 rank2 PSMs. The rank1 PSMs above Mascot threshold are chosen to model correct PSMs and form the positive class while the rank2 PSMs models the incorrect PSMs and form the negative class. When a regularized classifier is used, finding a good value for the regularization parameter is key to the performance of the classifier (as it prevents overfitting). To tune the regularization parameter (λ λ), a 105

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

fold cross-validation scheme was used. Within this cross-validation scheme, the following range of values for λ were tested: [2-10, 2-9, ..., 25]. Once an optimal value for λ was found, the model was retrained with that λ on the full dataset. The R package ‘glmnet’ 29 was used to train the model. This package implements elasticnet regularization for logistic regression. The elasticnet is a generalization of L1- and L2-regularization, but with the appropriate settings, an L1-regularized model can easily be obtained. In order to test the model that was learned, a separate test set of 20000 PSMs, not containing the PSMs used to train the model, was sampled from the initial dataset. The AUC obtained on that set was about 99%, showing that our model is capable of generalizing from the data. In the explanation above, rank2 PSMs were used to form the negative class. However, the same approach was used in settings where the negative class was formed by rank3 to rank5 PSMs. Similar AUCs were obtained for these models. A plot of the largest coefficients assigned by the logistic regression to the features used by Nokoi is shown in Supplementary Figure S1.

Evaluation data retrieval (CPTAC) — Publicly available data obtained from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) initiative, which was supported by the NCI30, were used to evaluate the models. Here, six yeast samples with spiked in Sigma 48 UPS1 standard were analysed in triplicate by three different laboratories on LTQ-orbitrap instruments. All samples from the three laboratories together accounted for a total of 52 datasets to assess the performance of the classifier. All CPTAC datasets were searched with Mascot version 2.4.1 against a target database containing both yeast and Sigma 48 proteins. The following settings were used: Acetylation of N-terminus, Carbamidomethylation of cysteine, pyro-Glu formation of N-terminal glutamine, pyro-Glu formation of N-terminal glutamate, Oxidation of methionine, Pyro-carbamidomethylation of N-term cysteine, 10 ppm on the peptide tolerance, 0.5 Dalton (Da) for the fragment tolerance, 1 missed cleavage and possible precursor charge +2, +3. Separate decoy searches were also performed against the shuffled version of the combined yeast and Sigma 48 database. In order to be able to use the CPTAC data as a test set, MascotDatfile was used to extract all rank1 PSMs from both target and decoy results from the Mascot searches and the 47

6

ACS Paragon Plus Environment

Page 6 of 20

Page 7 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

features were computed for each rank1 PSM. The FDR was computed as the number of decoy hits divided by the number of target hits.

Availability — The source code and manual of all scripts making up Nokoi, as well as a test data set, is available online at http://genesis.ugent.be/files/costore/Nokoi_utilities.zip, under the permissive Apache2 open source software license. Nokoi runs on Unix-based systems (Linux and Mac OS/X) and requires Java, Perl and R to run.

Results

Performance of the classifier and ability to model PSMs with ranks — In a usual proteomics pipeline, the experimental spectra are matched to peptide sequences through a search engine. As a result of this matching process, a list of scored PSMs is obtained. Most often, more than one peptide is found to match to a spectrum, usually with a different score. The PSM with the highest score is then chosen to be a correct match and used further in downstream analysis, while lower ranked PSMs are excluded. However, the latter PSMs could contain useful information as they represent suboptimal hits and can thus be used to model incorrect matches. Figure 2 shows the Mascot score distributions of 10 000 PSMs, sampled at random from the initial dataset, belonging to: rank1 scored above the Mascot significance threshold at 99% confidence, rank2, rank3, rank4 rank5 and decoys, respectively. Rank1 PSMs are clearly separated from lower ranks and decoys, while the latter are increasingly similar at decreasing ranks. This indicates that lower-ranked hits bear a useful similarity to decoys, and can thus serve as replacements for these decoys. An additional figure of individually stacked histograms for each rank, is shown in Supplementary Figure S2. Five binary classifiers were trained on different sets of PSMs (extracted from our in-house database): Nokoi1_2, trained on rank1 (positive class) and rank2 PSMs (negative class); Nokoi1_3, on rank1 (positive class) and rank3 PSMs (negative class); Nokoi1_4, on rank1 (positive class) and rank4 PSMs (negative class); Nokoi1_5, on rank1 (positive class) and rank5 PSMs (negative class); and Nokoi1_2.3.4.5 on rank1 (positive class) and a combination of rank2, 3, 4 and 5 PSMs (negative class). The effect of subsampling is shown in Supplementary Figure S3 and shows that the model is stable. 7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Unsurprisingly, the Nokoi1_2 model shows a wider coefficient of variation compared to the other models. This effect might be due to the higher similarity between rank1 and rank2 peptides as compared with the similarity between rank1 and lower ranked peptides. This makes the task of separating good from bad PSMs more difficult for the model trained on rank1 and rank2 peptides12. An additional model, Nokoi1_decoy, was trained on rank1 and decoy PSMs in order to allow for a comparison between the decoy trained model and the other rank-based models. For each individual PSM of the test set, Nokoi outputs a score reflecting the probability of the PSM to be a correct match. The ability of each model to separate correct from incorrect matches was tested on the 52 CPTAC datasets. Figure 3 shows the performance of the six models, here computed as the number of target identifications at a fixed 1% FDR. Interestingly, all models perform similarly, indicating that ranks can be used to model correct and incorrect hits effectively and that lower ranks can be a plausible substitute for decoys. Furthermore, the Nokoi1_decoy model exhibits the lowest overall performance, suggesting that decoy PSMs may even be slightly sub-optimal models of incorrect matches. This phenomenon is more pronounced when looking at the increased number of identified PSMs when using Nokoi trained on rank1 and lower ranks compared to Nokoi trained on rank1 and decoys, as reported in Supplementary Figure S4.

Computation of the FDR using the CPTAC data: a comparison between Mascot, Percolator and Nokoi — An important aspect of shotgun proteomics is the estimation of the false discovery rate. Indeed, when choosing a peptide identification tool, one wants to obtain the highest number of reliable peptides to be able to characterize as many proteins as possible from the original biological sample. As such, the best tools are typically the ones that are able to provide the highest number of peptides at a given FDR. State-of–the-art tools like Percolator, which works by rescoring the list of PSMs obtained as the output from a search engine, are able to significantly improve the number of correctly identified peptides. We therefore compared the number of retained peptides at a fixed 1% FDR between Nokoi, the well-known Mascot search engine, and the popular and highly performant post-processor Percolator. As depicted in Figure 4, Nokoi identified more PSMs than Mascot, yet performed slightly less well than Percolator; however, the numbers of correctly identified peptides for Nokoi and Percolator are overall quite similar. It should be noted that the overlap in 8

ACS Paragon Plus Environment

Page 8 of 20

Page 9 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

identified peptide sequences for the three tools is very high, as shown in Supplementary Figure S5. The slight advantage that Percolator has over Nokoi likely stems from the retraining of Percolator for every individual data set, while Nokoi uses a fixed prediction model that was trained only once, and this on a different (albeit very large) data set. This recurring training phase in the case of Percolator comes with a considerable computational cost. In contrast, Nokoi does not need to go through a training phase and simply classifies each PSM individually. Note that this also means that Nokoi can be parallelized very easily, while this is harder to achieve for Percolator31. Indeed, bypassing the retraining of the algorithm coupled to the use of a simpler prediction model dramatically speeds up the classification task. An evaluation of the running time for both Nokoi and Percolator is available in the Supplementary Figures S6 and S7.

Conclusion

In this work we have introduced a new approach to the concept of the target/decoy paradigm, in which the decoy PSMs are not derived from searches against dedicated decoy databases, but rather are selected from the lower-ranked hits from the target database. Our tool, called Nokoi, is therefore proposed as a decoy database-free alternative that is trained on PSMs from different ranks. Our novel method is very fast, and capable of quickly processing a large number of PSMs. Finally, even though Nokoi has been trained on data derived primarily from human and mouse PSMs, it has been shown to work well for the fully independent CPTAC yeast data sets. This ability to transfer the pre-trained model across data sets of different origins is a key benefit of Nokoi. In summary, Nokoi demonstrates that a PSM classification model can be successfully trained using lowerranked PSMs instead of decoy sequences, and that this model shows good performance as well as broad applicability without requiring retraining.

9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Acknowledgments

This work was funded by the Multidisciplinary Research Partnership “Bioinformatics: from nucleotides to networks” of Ghent University. S.D. is supported by the IWT SBO grant ‘INSPECTOR’ (120025) and L.M. and S.D. also acknowledge support from the PRIME-XS project, grant agreement number 262067, funded by the European Union 7th Framework Program.

10

ACS Paragon Plus Environment

Page 10 of 20

Page 11 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figures and tables

Feature class

Description

charge

charge of the precursor ion

log tot int

logarithm of the total spectrum intensity

norm highest peak

intensity of the highest peak in the spectrum normalized by the total spectrum intensity

pep len

number of amino acids in the peptide sequence

num mod

number of modified residues (including termini)

exp mass

experimental value of the precursor mass

calc mass

calculated value of the precursor mass

uniqueDM

(calc_mass - exp_mass) with isotope correction in Dalton

uniqueDMppm

(calc_mass - exp_mass) with isotope correction in ppm

nm pl

ratio of num mod to pep len

norm matched int

sum of the intensity of all matched fragment ions normalized by the total spectrum intensity

log matched int

logarithm of norm matched int

fragment ion norm int

normalized intensity of the matched fragment ions for the following ion series: b, y, b++, y++, b nl, y nl, b++ nl, y++ nl

fragment ion ratio

ratio of the number of matched fragment ions of the same ion series (series as above) to the total number of matched fragment ions of the spectrum

count ion series

number of matched fragment ions per ion series

long ion series

longest consecutive series of matched fragment ions for each ion series i.e. (b1, b2, b3, b6) longest=3 (b1, b2, b3)

medianfragDelta Da

median value of all matched fragment ion errors in Dalton

meanfragDelta Da

mean value of all matched fragment ion errors in Dalton

iqr fragDelta Da

interquartile range of all matched fragment ion errors in Dalton

Table 1. Features used to train the Nokoi binary classifier.

11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. Schematic flow of the pipeline used to build Nokoi. Data were collected from the ms-lims repository and PSMs were extracted using the Mascot datfile parser. Afterwards, features were computed and Nokoi classifiers were trained on these features. As a last step, the trained models were evaluated on the CPTAC dataset.

12

ACS Paragon Plus Environment

Page 12 of 20

Page 13 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2. Score distribution according to ranks and decoy PSMs. The distribution of highly confident Rank1 PSMs, which were above the 99% Mascot significance threshold, is clearly well-separated from the lower ranked PSMs and decoy PSMs. Decoy PSMs and lower ranked PSMs have been assigned by Mascot with low scores and follow similar distributions. Note that the bars of the histograms corresponding to the lower ranked PSMs and the decoy PSMs are stacked for clarity.

13

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 20

#PSMs nokoi1_2 nokoi1_2.3.4.5 nokoi1_3 nokoi1_4 nokoi1_5 nokoi1_decoy

4000 3800 3600 3400 3200 3000 2800 2600 2400 2200 2000 1800 1600 1400 1200 1000 800 600

model Figure 3. Boxplots of the number of PSMs at 1% FDR for six different models trained using different sets of PSMs: rank1 to model correct PSMs (positive class) and the specified rank or decoy hits to model incorrect PSMs (negative class). The plot displays, for each of these six Nokoi models, the number of PSMs found at 1%FDR across fifty-two CPTAC experiments. All models perform similarly, providing comparable numbers of identifications.

14

ACS Paragon Plus Environment

Page 15 of 20

4000

3500

3000

2500

#PSMs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

2000

1500

1000

500

0 mascot

Nokoi1_2.3.4.5

percolator

method

Figure 4. Line plot of the identification rate at 1% FDR: a comparison between Mascot, Nokoi and Percolator. The number of PSMs is plotted for each of the 52 CPTAC experiments. Lines are coloured according to the tool used. Nokoi consistently outperforms Mascot, and its overall performance is comparable to Percolator.

15

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 20

Supporting Information Available: Description of the material

Figure 1. Boxplot of the logistic regression coefficients of the Nokoi features across 10 repetitions Figure 2. Histograms of Mascot scores for ranked 1, 2, 3, 4, 5 and decoy PSMs. Figure 3. Boxplots of the coefficient of variation of the number of PSMs at 1% FDR measured ten times for six different Nokoi models. Figure 4. Boxplots of the increase (percentage) of the number of PSMs retained at 1% FDR between Nokoi trained on rank1 and lower ranks (Nokoi1_2.3.4.5) and Nokoi trained on rank1 and decoy PSMs. Figure 5. Percentage of peptide sequence overlap between Nokoi and Mascot/Percolator for the 52 Cptac evaluation sets Figure 6. Running time estimation of Nokoi and Percolator on an increasing number of PSMs. Figure 7. Running time estimation of Nokoi and Percolator on the 52 CPTAC experiments. Table 1. Complete list of the 47 features used to train Nokoi and corresponding feature class as it is reported on the article

16

ACS Paragon Plus Environment

Page 17 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

References

1.

Aebersold, R. & Mann, M. Mass spectrometry-based proteomics. Nature 422, 198– 207 (2003).

2.

Perkins, D. N., Pappin, D. J., Creasy, D. M. & Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–67 (1999).

3.

Craig, R. & Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–7 (2004).

4.

Geer, L. Y. et al. Open mass spectrometry search algorithm. J. Proteome Res. 3, 958–64

5.

Diament, B. J. & Noble, W. S. Faster SEQUEST searching for peptide identification from tandem mass spectra. J. Proteome Res. 10, 3871–9 (2011).

6.

Park, C. Y., Klammer, A. A., Käll, L., MacCoss, M. J. & Noble, W. S. Rapid and accurate peptide identification from tandem mass spectra. J. Proteome Res. 7, 3022–7 (2008).

7.

Tanner, S. et al. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 77, 4626–39 (2005).

8.

Käll, L., Storey, J. D., MacCoss, M. J. & Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 7, 29–34 (2008).

9.

Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in largescale protein identifications by mass spectrometry. Nat. Methods 4, 207–14 (2007).

10.

Renuse, S., Chaerkady, R. & Pandey, A. Proteogenomics. Proteomics 11, 620–30 (2011).

11.

Muth, T., Benndorf, D., Reichl, U., Rapp, E. & Martens, L. Searching for a needle in a stack of needles: challenges in metaproteomics data analysis. Mol. Biosyst. 9, 578–85 (2013).

12.

Colaert, N., Degroeve, S., Helsens, K. & Martens, L. Analysis of the Resolution Limitations of Peptide Identification Algorithms. J. Proteome Res. 10, 5555–5561 (2011).

13.

Ansong, C., Purvine, S. O., Adkins, J. N., Lipton, M. S. & Smith, R. D. Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Brief. Funct. Genomic. Proteomic. 7, 50–62 (2008). 17

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

14.

Castellana, N. & Bafna, V. Proteogenomics to discover the full coding content of genomes: a computational perspective. J. Proteomics 73, 2124–35 (2010).

15.

Jagtap, P. et al. A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies. Proteomics 13, 1352–7 (2013).

16.

Helmy, M., Sugiyama, N., Tomita, M. & Ishihama, Y. Mass spectrum sequential subtraction speeds up searching large peptide MS/MS spectra datasets against large nucleotide databases for proteogenomics. Genes Cells 17, 633–44 (2012).

17.

Bitton, D. A., Smith, D. L., Connolly, Y., Scutt, P. J. & Miller, C. J. An integrated massspectrometry pipeline identifies novel protein coding-regions in the human genome. PLoS One 5, e8949 (2010).

18.

Blakeley, P., Overton, I. M. & Hubbard, S. J. Addressing statistical biases in nucleotidederived protein databases for proteogenomic search strategies. J. Proteome Res. 11, 5221–34 (2012).

19.

Keller, A., Nesvizhskii, A. I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–92 (2002).

20.

Nesvizhskii, A. I., Keller, A., Kolker, E. & Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 4646–58 (2003).

21.

Renard, B. Y. et al. Estimating the confidence of peptide identifications without decoy databases. Anal. Chem. 82, 4314–8 (2010).

22.

Kim, S., Gupta, N. & Pevzner, P. A. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases. J. Proteome Res. 7, 3354–63 (2008).

23.

Choi, H., Ghosh, D. & Nesvizhskii, A. I. Statistical Validation of Peptide Identifications in Large-Scale Proteomics Using the Target-Decoy Database Search Strategy and Flexible Mixture Modeling. J. Proteome Res. 7, 286–292 (2008).

24.

Helsens, K. et al. ms_lims, a simple yet powerful open source laboratory information management system for MS-driven proteomics. Proteomics 10, 1261–4 (2010).

25.

Helsens, K., Martens, L., Vandekerckhove, J. & Gevaert, K. MascotDatfile: an opensource library to fully parse and analyse MASCOT MS/MS search results. Proteomics 7, 364–6 (2007).

26.

Frenay, B. & Verleysen, M. Classification in the Presence of Label Noise: A Survey. Neural Networks Learn. Syst. IEEE Trans. 25, 845–869 (2014). 18

ACS Paragon Plus Environment

Page 18 of 20

Page 19 of 20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

27.

Brosch, M., Yu, L., Hubbard, T. & Choudhary, J. Accurate and Sensitive Peptide Identification with Mascot Percolator. J. Proteome Res. 8, 3176–3181 (2009).

28.

Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. (Springer New York Inc., 2001).

29.

Friedman, J., Hastie, T. & Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 33, 1–22 (2010).

30.

Paulovich, A. G. et al. Interlaboratory study characterizing a yeast performance standard for benchmarking LC-MS platform performance. Mol. Cell. Proteomics 9, 242–54 (2010).

31.

Verheggen, K., Barsnes, H. & Martens, L. Distributed computing and data storage in proteomics: Many hands make light work, and a stronger memory. Proteomics 14, 367–377 (2014).

19

ACS Paragon Plus Environment

#PSMs

Journal of Proteome Research

nokoi1_2 nokoi1_2.3.4.5 nokoi1_3 nokoi1_4 nokoi1_5 nokoi1_decoy

4000

1 3800 2 33600 4 53400 63200 7 83000 92800 10 2600 11 12 2400 13 2200 14 15 2000 16 1800 17 18 1600 19 20 1400 21 1200 22 23 1000 24 25800 26600 27 28400

Page 20 of 20

ACS Paragon Plus Environment

model