Origin of Disagreements in Tandem Mass Spectra Interpretation by

Aug 29, 2016 - Several proteomic database search engines that interpret ... Moreover, in a configuration where each search engine identifies the ... K...
1 downloads 0 Views 805KB Size
Subscriber access provided by Northern Illinois University

Article

On the origin of disagreements in MS/ MS spectra interpretation by search engines Dominique Tessier, Virginie Lollier, Colette Larre, and Hélène Rogniaux J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00024 • Publication Date (Web): 29 Aug 2016 Downloaded from http://pubs.acs.org on August 29, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

On the origin of disagreements in MS/MS spectra interpretation by search engines Dominique Tessier*, Virginie Lollier, Colette Larré, Hélène Rogniaux INRA, UR 1268 Biopolymères Interactions Assemblages, F-44300 Nantes Abstract: Several proteomic database search engines that interpret LC-MS/MS data do not identify the same set of peptides. These disagreements occur even when the scores of the peptide-to-spectrum matches suggest good confidence in the interpretation. Our study shows that these disagreements observed for the interpretations of a given spectrum are almost exclusively due to the variation of what we call the “peptide space”, i.e., the set of peptides that are actually compared to the experimental spectra. We discuss the potential difficulties of precisely defining the “peptide space.” Indeed, although several parameters that are generally reported in publications can easily be set to the same values, many additional parameters with much less straightforward user access - might impact the “peptide space” used by each program. Moreover, in a configuration where each search engine identifies the same candidates for each spectrum, the proteins inference may remain quite different depending on the FDR selected.

KEYWORDS: shotgun proteomics, bioinformatics, database search engine, spectrum identification

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

INTRODUCTION Comparison to protein databases is the most common strategy for the interpretation of the thousands of tandem mass spectra (MS/MS spectra) generated in a single run by shotgun proteomics. A large number of algorithms implement this approach in two steps, all of which are mainly based on the same paradigm. In the first step, they provide a peptide-spectrum scoring function that measures how closely the experimental spectra fit the “ideal” spectra extrapolated from the predicted fragmentation of peptides derived from a protein database. This step might potentially result in several peptide candidates for a given spectrum. The topscoring candidate peptide is then assigned to the spectrum, resulting in a peptide-spectrum match (PSM). After a PSM is created for each observed spectrum, the complete list of PSMs is sorted by score. A threshold based on a measure of statistical significance determines which PSM are accepted according to a given false discovery rate1 (FDR). In the second step, the retained PSMs are used to infer which proteins might be present in the sample. Many programs are available to interpret MS/MS spectra, and despite extensive literature on the comparison of their performances, none of them clearly stand out from the others. What remains surprising, however, is that different search engines generate different PSMs for the same spectra set. If the score of a PSM is too weak and not significant, we expect that interpretations will vary with different search engines. Unfortunately, as already described2, we observed that such disagreements might appear even when a score is high, i.e., when the score suggests a good confidence in the identification. The disagreements between programs have often been attributed to differences in the underlying algorithms, and consequently, they are often considered “black boxes.” The analysis of these disagreements is complicated because many parameters in the data processing workflow can influence the resulting PSMs proposed by the search engines. As summarized in Figure 1, these parameters can be roughly classified into three categories. The 2 ACS Paragon Plus Environment

Page 2 of 25

Page 3 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

first category encompasses the parameters related to spectra, the second is related to the scoring function of the search engine, and the third category determines the set of peptides compared to the spectra. We referred to this set as the “peptide space.” Because the program algorithms support the same paradigm, we tested the hypothesis that an identical “peptide space” explored by different programs could lead to the same top-ranked PSMs. If we are able to configure the algorithms so that they identify the same set of PSMs, we may assess how their score functions select the peptides used to infer the proteins.

MATERIALS AND METHODS Data collection Two datasets were used in this analysis, acquired by two different mass spectrometers. One was generated from a human cell line HEK293 and downloaded from the PRIDE database3 and the other one is an in-house data set generated from the Brachypodium dystachion organism, a model plant for cereals. The

Human

HEK293

cells

dataset.

We

downloaded

the

b1906_293T_proteinID_01A_QE3_122212.raw file deposited by J. Chick and S.P. Gygi from the PRIDE project PXD001468. This data was acquired on a Q-Exactive Orbitrap mass spectrometer (Thermo-Fisher Scientific). Sample preparation and MS/MS analysis have been previously reported elsewhere4. The raw file was converted to MGF files using the RawConverter tool5. The MzXML format required for TPP was obtained by transforming the MGF format into MzXML via the MSConvert tool included in the cross-platform toolkit ProteoWizard6. This dataset limited to the spectra recorded from 2+ and 3+ charged precursors contains 37685 spectra.

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 25

The Brachy dataset: The data set was generated in-house and consisted of a complex mixture of proteins extracted from the seeds of the model plant Brachypodium dystachion. Briefly, the cells were lysed, the lysate was separated using 1D electrophoresis in polyacrylamide gels, and the gel lane was cut into 12 slices. Afterwards, these gel pieces were digested with trypsin essentially as described in Larré et al7. Nanoscale LC-MS/MS analyses of the samples were performed using a nanoscale capillary liquid chromatography-tandem mass spectrometry (LC-MS/MS) system consisting of an Ultimate 3000 RSLC system (Thermo-Fisher Scientific) coupled with an LTQ-Orbitrap VELOS mass spectrometer (Thermo-Fischer Scientific). MS data acquisitions were performed using Xcalibur 2.1 software. Briefly, full MS scans were acquired at high resolution (FWMH 30000) using the Orbitrap analyzer (m/z 400–2000), and CID spectra were recorded for the five most intense precursor ions in the linear ion trap. Three gel slices run separately using LC-MS/MS were arbitrarily selected for this study. As a consequence, the dataset contains three subsets, and each subset consists of the spectra obtained for one given gel slice (i.e., from one LC MS/MS run). Each subset is limited to the spectra recorded from 2+ and 3+ charged precursors and contains respectively 4674, 4340 or 4530 spectra leading to a dataset of 13544 spectra. Raw files were converted to MGF files using Proteome Discoverer version 1.7 (Thermo-Fisher Scientific). Protein database The Human sequence database was downloaded from the Ensemble genome assembly GRCh37 (GRCh37.61.pep.all). The common Repository of Adventitious Proteins (cRap) which includes the common contaminants was downloaded from the Global proteome Machine Organization and it was added to the Human database. The final database consisted of 87051 proteins.

4 ACS Paragon Plus Environment

Page 5 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The Brachypodium dystachion database was downloaded from http://brachypodium.org (release 1.2, 02-2010) and complemented by a list of common contaminants. The final database consisted of 31087 proteins. Both decoy databases were generated by reversing the protein sequences. Database search engines Among the large list of database search programs available, four programs were assessed for this study due to their frequent use: X!Tandem (version Sledgehammer)8, TPP (version Jackhammer)9, Mascot (version 2.1)10 and MS-GF+11, (version v9979, 3/26/2014). We must note that the search engine embedded in TPP is a modified version of X!Tandem. Although the original scoring function implemented in X!Tandem has been replaced by the k-score scoring function in TPP12, the underlying approach is the same in both implementations. It is nevertheless interesting to compare these two approaches, especially because they are not provided with the same default parameters. To distinguish them, we named them separately as two different programs. For each spectrum, all four software programs (i) compute scores to evaluate the quality of the PSM against all of the peptides in the peptide search space by matching the mass of the parent ion, plus or minus the mass tolerance defined by the user; (ii) find the best PSM (i.e., the best score); and (iii) report a value to assess the statistical significance of the best PSM. X!Tandem is a free, open-source search engine available at the web site http://thegpm.org. Most of the parameters that might influence the database search are configurable in two files. The first file (usually named default-input.xml) includes default settings for a large number of parameters that are likely to be adequate for most applications. Each default setting value can be over-ridden by adding a corresponding entry in a second file, which is simpler for the user, who manipulates only a subset of parameters. This second file is usually named input.xml. It is important to note that X!Tandem has a second-pass search turned on by default. 5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 25

Trans-Proteomic Pipeline (TPP) is a suite of software tools for the analysis of MS/MS datasets and includes a modified version of the X!Tandem algorithm. As already mentioned, in this modified version, the original scoring function has been replaced by the k-score scoring function12. The TPP distribution also contains different default setting parameters from the standalone X!Tandem software. TPP can be downloaded from the Seattle proteome center (http://tools.proteomecenter.org). Mascot

is

a

commercial

search

engine

from

the

company

Matrix

Science

(http://www.matrixscience.com/). Unlike the other software, the underlying algorithm is not completely available. However, Mascot was adapted from the MOWSE program13, and much information is provided by its author on the above website. MSGF+ is available at the website of Center for Computational Mass Spectrometry at UCSD (http://proteomics.ucsd.edu) and uses a sophisticated scoring system. Search parameters are specified using command line arguments, and the post-translational modifications (PTMs) are described in a file named Mods.txt. The number of parameters that can be modified is smaller compared to the other software.

PSMs collection In this study, when several PSMs are associated to the same spectrum by one program, only the best score was retained. The custom parsing of results obtained by each search engine in the series of runs was performed with the help of ETL (Extract Transform and Load) software called Talend Open Studio. Extracted data were stored in a PostgreSQL database. When necessary, PSMs were examined and compared with the interactive viewer SeeMS included in the cross-platform toolkit ProteoWizard6.

6 ACS Paragon Plus Environment

Page 7 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

PSMs were sorted according to their score or to their e-value, in order to extract top-ranked fractions for each program (i.e. cutoff thresholds for the top 10, 20, 30, 40 and 50% PSMs). Peptide space configurations The influence of the peptide space was investigated by analyzing several runs generated by each search engine with one of the two datasets described above. The minimal set of parameters To assess the validity and reproducibility of MS/MS data interpretation as well as the relevance of published proteomics results, a set of minimum information is required for publication14 that typically includes information about the search engine employed, the protein database, the protease used to generate peptides, the precursor ion mass tolerance, the fragment mass tolerance, the peak list file format, the fixed and variable modifications and the number of missed cleavage sites allowed in the peptide. In this study, we set those parameters as follows: (1) Human HEK293 dataset : The Human database, trypsin enzyme, 5 ppm precursor tolerance, 0.01 Da fragment ion tolerance, carbamidomethylation of cysteine as a fixed modification, and methionine oxidation as a variable modification. (2) Brachy dataset: the Brachypodium dystachion database, trypsin enzyme, 5 ppm precursor mass tolerance, 0.4 Da fragment ion tolerance, carbamidomethylation of cysteine as a fixed modification, and methionine oxidation as a variable modification. Comparison of minimal versus optimized parameters •

Run 1: minimal set of parameters, HEK293 dataset, miscleavage = 1



Run 2: optimized parameters to obtain “an identical peptide space”, HEK293 dataset, miscleavage = 1



Run 3: minimal set of parameters, Brachy dataset, miscleavage = 2



Run 4: optimized parameters to obtain “an identical peptide space”, Brachy dataset, miscleavage = 2 7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60



Page 8 of 25

Run 5: optimized parameters to obtain a similar peptide space, HEK293 dataset, miscleavage = 3

We conducted the Run 1 and the Run 2 with the parameter “maximum number of missed cleavage sites” (“miscleavage”) set to 1 because it is a common practice. As we did not find how to fix this value in MSGF+, we eliminated the spectra that had been interpreted with more than 1 miscleavage by this program when comparing the PSMs in those two runs. The Run 5 confirms that the removal of these spectra does not introduce a large bias. Moreover, fixing the number of miscleavage to 3 has the advantage to retain much more spectra. So, this is the configuration that is retained in Run 12. A more detailed description of the parameters is available (Table S9, supporting information). Individual effect of some parameters In the following runs, starting from the “identical peptide space” parameters, we changed alternatively just one parameter to evaluate how many disagreements this parameter generates. The measures were done on the Human HEK293 dataset with X!Tandem. •

Run 6: “identical peptide space” parameters, miscleavage = 2



Run 7: “identical peptide space” parameters except specific N-Term protein modification (quick acetyl) and peptide N-terminus cyclization (quick pyrolidone) allowed.



Run 8: “identical peptide space” parameters except the effective cleavage between RP and KP



Run 9: “identical peptide space” parameters except a parent monoisotopic mass isotope error allowed



Run 10: “identical peptide space” parameters in a first round,

followed by a

refinement round on the selected proteins including additional potential modifications 8 ACS Paragon Plus Environment

Page 9 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Run 11: “identical peptide space” parameters except semi-cleavage rules allowed



Influence of the algorithms on the peptide identifications This last run has been used to evaluate how each search engine ranks its PSMs, and to measure the effect of the FDR on validated peptides. •

Run 12: identical peptide space parameters, HEK293 dataset, miscleavage = 3

For MSGF+, the FDR is solely based on the spectral E-value of the PSM (MSGF:SpecEValue).

RESULTS AND DISCUSSION Validation of our hypothesis We sequentially performed several distinct interpretations of the HEK293 and Brachy sets of spectra using the four search engines chosen in this study with two different sets of parameters. In the first series of runs -Run 1 and Run 3-, parameters were set according to the “minimal set of parameters” (see the Materials & Methods section for further details). If a program provided parameters in addition to this minimal set, they were left at their default values. Depending on the search engine, there are more or fewer of these additional parameters, and their default values are often considered to be recommended values by users. In the second series of tests –Run 2 and Run 4-, the settings were configured to generate the closest “peptide space” possible for the four search engines. Thus, each spectrum was compared to the same set of peptides. In this manuscript, we refer to these settings as “identical peptide space parameters”. It is very difficult to perfectly match the parameters between search engines, as subtle differences in meaning might exist between them. Moreover, there is no way to evaluate the “peptide space” used by each program directly. When we observed divergent interpretations of spectra and found them to be due to

9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 25

differences in “peptide space”, we adjusted the parameters to achieve the best convergence attainable. This process was performed through several successive rounds. For each program, we extracted several collections of PSMs according to a rank threshold (e.g., top-ranked 10%). For a given threshold, we then measured the percentage of the PSMs agreement as the percentage of spectra associated to peptides shared by the four programs regardless of the rank of the PSMs in other programs (detailed results are available Table S1 to Table S6, Supporting Information). In Table 1, we summarize the agreement measurement among the four search engines applied on the human HEK293 cells dataset when parameters were defined according to the “minimal set of parameters” and to the “identical peptide space parameters.” When all of the PSMs were considered (100% top-ranked PSMs in Table 1), in both cases, the percentage of spectra that were interpreted in the same way remained low, around 50%, and this percentage is even lower on the Brachy dataset, around 30%. This means that the search engines do not agree on the best candidate peptide for many spectra. Nevertheless, the lowest-scoring PSMs are usually of low significance, and only the topranked PSMs are used to infer proteins. Therefore, this is why, among all of the PSMs reported by each search engine, a cutoff was applied to consider only a subset of the best results (the cutoff was set to the best value at 10%, 20%, 30%, 40% and 50%). With the “minimal set” of parameters, we can observe in Table 1 that agreements between search engines varied from 80.5% to 99.8% for the 10% best ranked PSMs. Not surprisingly, these disagreements increased depending on the percentage of the highest ranked PSMs considered. A different situation is observed when we configure the parameters to generate the same “peptide space” among the four search engines. The disagreements between algorithms were negligible for the 10% and 20% highest ranked PSMs (99% to 100% of agreement), and agreement remained above 97.5% when considering the 40% top-ranked PSM interpretations (Table S2 and Table S3, supporting information). Agreement then 10 ACS Paragon Plus Environment

Page 11 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

decreased sharply beyond the 40% top PSMs. The same behavior can be observed with the Brachy dataset – Run 3 and Run 4 (Table S4 and Table S5, supporting information)–. Based on these results, we suggest that the variation of “peptide space” constitutes a major source of disagreements between the validated interpretations observed arising from different search engines.

The “peptide space” generated by algorithms evolves over time Several parameters that affect the “peptide space” have been introduced over the years in light of new knowledge on the behavior of proteins and/or of peptides, or with new generations of instruments. One example concerns the cleavage rules. Before 2008, a generally accepted "Keil rule" was that trypsin cleaves after arginine or lysine, unless either residue is followed in the protein sequence by a proline. Based on statistical evidence derived from large MS/MS data sets, J. Rodriguez and his co-authors re-evaluated this rule and showed that unexpected “RP” or “KP” cleavages occurred at significant rates15. Currently, most search engines allow the user to choose between the two options, but some implement only one of the two. Similarly, the peptide bond between aspartic acid and proline residues is the weakest peptide bond and can be easily hydrolyzed in solution as well as in the gas phase upon in-source fragmentation16. X!Tandem and TPP systematically and implicitly consider this cleavage rule when generating peptide candidates, whereas other programs do not. For easy handling by users, several chemical modifications often induced by sample preparation or very common post-translational modifications can also be added by default in the search engines, such as removal of the N-terminal methionine, acetylation of the N-terminal methionine, or derivation of the N-terminal glutamine residues. As a consequence, the user can easily miss these modifications when reporting results. Another significant variation present in the searched “peptide space” is a consequence of the high accuracy that instruments 11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 25

can reach today. A tight mass tolerance is often used by the search engine to filter the candidate peptides matching an experimental spectrum. Unfortunately, during the transformation of raw data into peak lists that are used by database search engines, the peakpicking algorithms sometimes fail to detect the correct isotope and identify the 13C peak rather than the mono-isotopic peak of the peptide, which results in the peptide mass being incorrect by 1 Da. Most search engines account for this problem with a parameter that expands the parent mass tolerance to multiple tolerance windows and that can be switched “on” or “off” and can be tailored for one 13C peak or for several. The number of isotope peaks considered can be defined by the user or fixed by the search engine. It is important to note that this parameter significantly expands the “peptide space” for each spectrum. Finally, we must also mention algorithms that use a two-pass approach for peptide identification, first searching the whole database with restricted modifications and afterwards a smaller database compiled in the first pass. In particular, the second pass often accepts what are called “semitryptic” peptides as well as several amino acid modifications that can substantially change the “peptide space” between the two passes. In order to have an appraisal of the impact of each parameter on the PSM disagreements, we conducted a series of runs with X!Tandem – See in Material and Method, Run 6 to Run 11-, changing alternatively just one parameter. For each run, we calculated the number of disagreements and we summarized their influence in Table 2 (detailed results Table S8, Supporting Information). The effects of the measured parameters vary from 1% to 7.5% of disagreement according to the threshold. They are cumulative because they do not concern the same spectra. On the HEK293 dataset, it is interesting to note that the “monoisotopic error” has a major effect when the 20% top-ranked PSM are considered. Without surprise, the parameters “refine” and “cleavage-semi” have also a significant impact. The above mentioned examples constitute a limited list of the parameters that impact the “peptide space,” although many others might exist. 12 ACS Paragon Plus Environment

Page 13 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

How to choose the best “peptide space?” When we investigated the first series of runs, we observed that when the search parameters widened the “peptide space,” the best PSMs were rarely affected, as previously mentioned2. However, new PSMs that were discordant between algorithms were inserted with high rankings, causing the disagreements between tools observed in Table 1. The percent of disagreements increases with the difference in the sizes of the “peptide space.” Because in our experiment, the “peptide space” generated by Mascot was the smallest one and approximately included in the “peptide space” generated by the other search engines, Mascot achieved the highest percentage of agreement of the four tools. Indeed, X!Tandem executes a two-pass search by default, with the parent ion mass tolerance expanded to the first and second 13C isotope peaks, whereas TPP – although limited to a single-pass search - includes semi-tryptic peptides by default. In both cases, the “peptide space” thus becomes substantially broader. As X!Tandem and TPP report the number of peptides used for matching in their result files, we observed that the default TPP “peptide space” is much larger than the default X!Tandem “peptide space.” Furthermore, the MS-GF+ “peptide space” was estimated to be intermediate between Mascot and X!Tandem because it considers monoisotopic error by default and allows a large number of missed cleavages. With a large “peptide space,” each spectrum was tested against a wide range of candidate sequences. Yet, it has been shown that current search engines lack the ability to distinguish very close peptides17, and such a situation is likely to occur when the “peptide space” is enlarged. This situation increases the chance of high-scoring random peptides, and the result is reduced statistical significance for all identifications. This difficulty likely explains why each tool by default introduces only a limited number of common variations. However, specifying too limited a “peptide space” is a common reason for failing to obtain a match, and 13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 25

given the time needed to complete an experiment, researchers tend to optimize the use of data and reduce the rate of non-interpreted data.

Disagreements between PSMs beyond the “peptide space” By fixing the programs to use the same “peptide space” as we did in our second series of runs –Run 2 and Run 4-, we attempted to examine actual differences between the underlying algorithms of the search programs in the choice of the peptide candidate. As it is often claimed that each has its own strengths and weaknesses, one wonders whether there are some profiles of experimental spectra that are better resolved by one program or another. This question must be assessed on a case-by-case basis depending on the algorithm and perhaps on the dataset. In this work, we observed that X!Tandem showed certain limitations in the correct interpretation of spectra exhibiting a high range of variation in peak intensity. This tool was designed to work quickly, and for this purpose, the software translates intensities into integers. The maximum intensity in a given mass spectrum is set to 100 by default. Therefore, if the amplitude between the peaks is above this value, some of them might be considered as noise and lost for further comparisons because their intensity is set to a null value. This event occurred for a few spectra in our dataset but had little impact on the overall results. Some search engines penalize the remaining non-matching peaks in a given MS/MS spectrum, whereas others consider only matching peaks in the scoring function. Thus, we also expected different interpretations for chimeric spectra arising from co-eluting peptides with similar m/z values. Several studies have shown the significant prevalence of this situation in datasets arising from complex samples18, 19. However, if such spectra were present, their impact was minimal on the observed disagreements in each spectrum interpretations, perhaps because none of the tested programs had been calibrated to specifically account for this event.

14 ACS Paragon Plus Environment

Page 15 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Disagreements between validated peptides with the same peptide space Once the PSMs are collected from an experiment, they are sorted by scores and a threshold is selected such that the PSMs above this threshold are considered as valid interpretations and used to infer the proteins. The score function implemented by the algorithms must rank the full set of PSMs so that correct PSMs have better score -or lower Evalue- than incorrect PSMs at each level of FDR. Two strategies can be used to fix the threshold: an empirical statistical model as implemented by PeptideProphet21 or the target-decoy strategy20 used in this study. As shown in Table 1, below the 30% threshold, each search engine interprets almost each spectrum as the same peptide when the parameters are set to an “identical peptide space”. In such a configuration, the “peptide space” source of disagreements is eliminated. We can therefore compare how each algorithm classifies these shared PSMs from the most reliable to the less reliable. If algorithms rank the shared PSMs in the same manner, we should observe the same set of PSMs at each top-ranked threshold (10%, 20%, and 30%). This is not the case, especially if we consider the lowest thresholds. In Table 3, we can see that the percentage of PSM agreements at the same threshold is only 32% for the 10% top-ranked PSMs but this value attains 82% for the 40% top-ranked PSMs. Once the PSMs are ranked, the peptide validation process is based on a FDR evaluation. At each top-ranked threshold, we evaluated the FDR for each search engine. It is interesting to note that whereas the FDR are similar for the 10% top-ranked PSM suggesting a good confidence, they are quite different at the 40% top-ranked threshold, varying from 0.2% to 11.9 % according to the search engine. In Table S7, a complementary view of these results is given with several settings of FDR. Moreover, peptide-level FDR were evaluated to guarantee the absence of any bias due to multiple identifications – the same peptide identified by multiple spectra -. If we admit that the decoy database is a good way to evaluate how an

15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

algorithm is able to differentiate the correct PSMs from the incorrect ones, in our run, the scoring function of MSGF+ outperforms the others.

What about intersecting results from several search engines with the same peptide space? Disagreements between search engines are the reason why several publications suggest keeping PSMs that are common to multiple search engines to maximize the confidence of peptide and protein identifications. The argument is that spectra matching the same peptide across different programs are more likely to be correct. It has also been shown previously that the union of search engine results provides higher sensitivity, but that the intersection produces better specificity22-24. However, it is important to note that intersecting the PSMs eliminates the PSMs that are outside the common “peptide space”. Moreover, we can observe that when the peptide space is the same and the FDR is stringent, each search engine provides the same peptide interpretation. Nevertheless, each of them has its own ordered list of PSMs, and this order may have significant consequences in the validation of peptides. In our dataset, intersecting the PSM agreements at a low FDR - < 0.2% for instance - limits dramatically the number of validated peptides whereas at this level, the union of the PSM may increase significantly the number of validated peptides without introducing many misinterpretations. However, when higher FDR values are applied, most of the validated peptides become shared peptides and consequently the gain becomes smaller while the FDR introduces more misinterpretations.

CONCLUSION Differences in the dimensionality of the “peptide space” explain most of the disagreements concerning the best PSMs suggested for the interpretation of a given spectrum by different 16 ACS Paragon Plus Environment

Page 16 of 25

Page 17 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

search engines. To assist in comparisons, it might be interesting to report the number of peptides contained in the “peptide space” during the interpretations of spectra in the same way the database name, the enzyme, and the PTM are reported in publications. It would also be useful for the reproducibility of results, considering the evolution of search engine parameters. Particular attention must be paid to those parameters that control the “peptide space”. The question thus remains of how to define this “peptide space” with the highest relevancy for a given dataset, or, in other words, how to determine which parameters best reflect the actual variations expected for a given sample and/or instrumental conditions. To the best of our knowledge, there is no practical method or tool that aids users in determining these parameters according to a sample, a protocol or instrumental conditions. The differences observed between the default parameters of each search engine and the absence of any method to fix them indicates no consensus. At the moment, this question is mainly one of expertise, although there is a clear unsatisfied need that is independent of the search engines. Besides, the validated peptides depend also on the capacity of the search engines to separate reliable PSMs from unreliable ones. At very low FDR, each search engine provides a complementary view of the most reliable PSMs. But when a larger number of PSMs is considered, some algorithms seem to have more difficulties to separate correct from incorrect PSMs.

17 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 25

AUTHOR INFORMATION

E-mail: [email protected]. Tel: 33-240675176. Fax: 33-240675005

SUPPORTING INFORMATION Table S1 - Variation of parameters used in several Runs, Table S2 - Number of disagreements in Run 1 (minimal set of parameters, HEK93 dataset, miscleavage = 1), Table S3 - Number of disagreements in Run 2 (identical peptide space parameters, HEK93 dataset, miscleavage = 1), Table S4 - Number of disagreements in Run 3 (minimal set of parameters, Brachy dataset, miscleavage = 2), Table S5 - Number of disagreements in Run 4 (identical peptide space parameters, Brachy dataset, miscleavage = 2), Table S6 - Number of disagreements in Run 5 (identical peptide space parameters, HEK93 dataset, miscleavage = 3), Table S7 - Number of disagreements in Run 5 under specified PSM-level FDR and peptide-level FDR (identical peptide space parameters, HEK93 dataset, miscleavage = 3), Table S8 - Influence of each parameter modification applied to X!Tandem on the number of disagreements (identical peptide space parameters, HEK93 dataset, miscleavage = 3), Table S9 - More detailed parameters used by each program to generate the same peptide space. ACKNOWLEDGMENTS We would like to thank all of the trainees who participated during the last few years in the development of software tools that have been useful for this study: Guillaume Merceron, Charles-Henri Van Nuvel, Valentin Estréguil and Romain Anseaume. We do thank the anonymous reviewers of this manuscript. The final version was significantly improved by their comments and suggestions. This project has been funded in part by the Conseil Régional, Pays de la Loire (France) with its “GRIOTE program” (2013-2017).

18 ACS Paragon Plus Environment

Page 19 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

REFERENCES

(1) (2) (3)

(4)

(5) (6)

(7)

(8) (9)

(10)

(11) (12)

(13) (14)

(15) (16) (17)

Choi, H., and Nesvizhskii, A. I. False discovery rates and related statistical concepts in mass spectrometry-based proteomics, J Proteome Res. 2008, 7, 47-50. Shteynberg, D., Nesvizhskii, A. I., Moritz, R. L., and Deutsch, E. W. Combining results of multiple search engines in proteomics, Mol Cell Proteomics. 2013, 12, 2383-2393. Martens, L., Hermjakob, H., Jones, P., Adamski, M., Taylor, C., States, D., Gevaert, K., Vandekerckhove, J., and Apweiler, R. PRIDE: the proteomics identifications database, Proteomics. 2005, 5, 3537-3545. Chick, J. M., Kolippakkam, D., Nusinow, D. P., Zhai, B., Rad, R., Huttlin, E. L., and Gygi, S. P. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides, Nat Biotechnol. 2015, 33, 743-749. He, L., Diedrich, J., Chu, Y. Y., and Yates, J. R. Extracting Accurate Precursor Information for Tandem Mass Spectra by RawConverter, Anal Chem. 2015, 87, 11361-11367. Chambers, M. C., Maclean, B., Burke, R., Amodei, D., Ruderman, D. L., Neumann, S., Gatto, L., Fischer, B., Pratt, B., Egertson, J., Hoff, K., Kessner, D., Tasman, N., Shulman, N., Frewen, B., Baker, T. A., Brusniak, M. Y., Paulse, C., Creasy, D., Flashner, L., Kani, K., Moulding, C., Seymour, S. L., Nuwaysir, L. M., Lefebvre, B., Kuhlmann, F., Roark, J., Rainer, P., Detlev, S., Hemenway, T., Huhmer, A., Langridge, J., Connolly, B., Chadick, T., Holly, K., Eckels, J., Deutsch, E. W., Moritz, R. L., Katz, J. E., Agus, D. B., MacCoss, M., Tabb, D. L., and Mallick, P. A cross-platform toolkit for mass spectrometry and proteomics, Nat Biotechnol. 2012, 30, 918920. Larré, C., Penninck, S., Bouchet, B., Lollier, V., Tranquet, O., Denery-Papini, S., Guillon, F., and Rogniaux, H. Brachypodium distachyon grain: identification and subcellular localization of storage proteins, J Exp Bot. 2010, 61, 1771-1783. Craig, R., and Beavis, R. C. TANDEM: matching proteins with tandem mass spectra, Bioinformatics. 2004, 20, 1466-1467. Deutsch, E., Mendoza, L., Shteynberg, D., Farrah, T., Lam, H., Tasman, N., Sun, Z., Nilsson, E., Pratt, B., Prazen, B., Eng, J., Martin, D., Nesvizhskii, A., and Aebersold, R. A guided tour of the Trans-Proteomic Pipeline, Proteomics. 2010, 10, 1150-1159. Perkins, D. N., Pappin, D. J., Creasy, D. M., and Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data, Electrophoresis. 1999, 20, 3551-3567. Kim, S., Gupta, N., and Pevzner, P. A. Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases, J Proteome Res. 2008, 7, 3354-3363. MacLean, B., Eng, J. K., Beavis, R. C., and McIntosh, M. General framework for developing and evaluating database scoring algorithms using the TANDEM search engine, Bioinformatics. 2006, 22, 2830-2832. Pappin, D. J., Hojrup, P., and Bleasby, A. J. Rapid identification of proteins by peptide-mass fingerprinting, Curr Biol. 1993, 3, 327-332. Binz, P. A., Barkovich, R., Beavis, R. C., Creasy, D., Horn, D. M., Julian, R. K., Seymour, S. L., Taylor, C. F., and Vandenbrouck, Y. Guidelines for reporting the use of mass spectrometry informatics in proteomics, Nat Biotechnol. 2008, 26, 862. Rodriguez, J., Gupta, N., Smith, R., and Pevzner, P. Does trypsin cut before proline?, Journal of Proteome Research. 2008, 7, 300-305. Olsen, J. V., Ong, S. E., and Mann, M. Trypsin cleaves exclusively C-terminal to arginine and lysine residues, Mol Cell Proteomics. 2004, 3, 608-614. Colaert, N., Degroeve, S., Helsens, K., and Martens, L. Analysis of the resolution limitations of peptide identification algorithms, J Proteome Res. 2011, 10, 5555-5561.

19 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(18)

(19) (20) (21)

(22) (23)

(24)

Page 20 of 25

Houel, S., Abernathy, R., Renganathan, K., Meyer-Arendt, K., Ahn, N. G., and Old, W. M. Quantifying the impact of chimera MS/MS spectra on peptide identification in large-scale proteomics studies, J Proteome Res. 2010, 9, 4152-4160. Wang, J., Bourne, P. E., and Bandeira, N. MixGF: spectral probabilities for mixture spectra from more than one peptide, Mol Cell Proteomic. 2014, 13, 3688-3697. Elias, J. E., and Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry, Nat Methods. 2007, 4, 207-214. Keller, A., Nesvizhskii, A. I., Kolker, E., and Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search, Anal Chem. 2002, 74, 5383-5392. Alves, G., Wu, W. W., Wang, G., Shen, R. F., and Yu, Y. K. Enhancing Peptide Identification Confidence by Combining Search Methods, J Proteome Res. 2008, 7, 3102-3113. Shteynberg, D., Deutsch, E. W., Lam, H., Eng, J. K., Sun, Z., Tasman, N., Mendoza, L., Moritz, R. L., Aebersold, R., and Nesvizhskii, A. I. iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates, Mol Cell Proteomics. 2011, 10, M111.007690. Dagda, R. K., Sultana, T., and Lyons-Weiler, J. Evaluation of the Consensus of Four Peptide Identification Algorithms for Tandem Mass Spectrometry Based Proteomics, J Proteomics Bioinform. 2010, 3, 39-47.

20 ACS Paragon Plus Environment

Page 21 of 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1. Percentage of PSM agreement between search engines. It represents the number of shared PSM, even though these PSM are ranked at a lower score by any search engine. Percentage of the top-ranked PSMs

100%

Similar “minimal set of parameters”

Identical “peptide space” parameters

60

60

40

40

20

20

0

0 All software

10%

20%

30%

50%

All software

100

100

80

80

60

60

100

100

80

80

60

60

100

100

80

80

60

60

100

100

80

80

60

60

21 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 25

Table 2. Influence of each parameter on the percentage of disagreements– Run 6 to Run 11 compared to Run 2– on the 10% top-ranked and 40% top-ranked identifications with X!Tandem. Starting from the “identical peptide space” parameters set, we alternatively changed just one parameter to measure its effect.

8

8

6

6

4

4

2

2

0

0

10 % top-ranked threshold

40% top-ranked threshold

Table 3. Percentage of peptide agreement between search engines. The percentage of peptide agreement at the same threshold calculated for each search engine at x% (x=10, 20, 30, 40) represents the percentage of shared PSM at this threshold for the four search engines –Run 12-. The FDR represents the PSM-level FDR.

Number of spectra

Peptide agreement at the same threshold

FDR Mascot

FDR X!Tandem

FDR TPP

FDR MSGF+

10%

3670

32%

0.03%