IdentiPy: an extensible search engine for protein identification in

Apr 23, 2018 - We present an open-source, extensible search engine for shotgun proteomics. Implemented in Python programming language, IdentiPy shows ...
0 downloads 4 Views 522KB Size
Subscriber access provided by Warwick University Library

IdentiPy: an extensible search engine for protein identification in shotgun proteomics Lev I. Levitsky, Mark V. Ivanov, Anna A. Lobas, Julia A. Bubis, Irina A Tarasova, Elizaveta M. Solovyeva, Marina L. Pridatchenko, and Mikhail V Gorshkov J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00640 • Publication Date (Web): 23 Apr 2018 Downloaded from http://pubs.acs.org on April 24, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

IdentiPy: an extensible search engine for protein identification in shotgun proteomics Lev I. Levitsky,†,‡,¶ Mark V. Ivanov,‡,¶ Anna A. Lobas,‡ Julia A. Bubis,‡ Irina A. Tarasova,‡ Elizaveta M. Solovyeva,‡ Marina L. Pridatchenko,‡ and Mikhail V. Gorshkov∗,‡ †Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia ‡ V.L. Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, Moscow, Russia ¶Authors Levitsky and Ivanov contributed equally to this work E-mail: [email protected] Phone: +7 499 1378257 Abstract We present an open-source, extensible search engine for shotgun proteomics. Implemented in Python programming language, IdentiPy shows competitive processing speed and sensitivity compared with the state-of-the-art search engines. It is equipped with a user-friendly web interface, IdentiPy Server, enabling the use of a single server installation accessed from multiple workstations. Using a simplified version of X!Tandem scoring algorithm and its novel “auto-tune” feature, IdentiPy outperforms the popular alternatives on high-resolution data sets. Auto-tune adjusts the search parameters for the particular data set, resulting in improved search efficiency and simplifying the user experience. IdentiPy with the auto-tune feature shows higher sensitivity compared with the evaluated search engines. IdentiPy Server has built-in post-processing and protein

1

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

inference procedures and provides graphic visualization of the statistical properties of the data set and the search results. It is open-source and can be freely extended to use third-party scoring functions or processing algorithms, and allows customization of the search workflow for specialized applications.

Keywords proteomics, proteomic search engine, peptide identification

Introduction Bottom-up proteomics remains the dominant approach for protein identification in biological samples. 1 With the progress in mass spectrometry instrumentation and separation methods, the size of proteomic data sets grows substantially, imposing new challenges on data processing algorithms. 2 The efforts of the last 15 years have resulted in the development of a number of search engines, including SEQUEST, 3 X!Tandem, 4 Mascot, 5 OMSSA, 6 Comet, 7 MyriMatch, 8 Andromeda, 9 MS Amanda, 10 MS-GF+, 11 Morpheus, 12 and MSFragger, 13 among the others. A single score typically produced by search engines is usually significantly suboptimal as a means of filtering, 14 as demonstrated by the crucial effect of post-search processing tools, such as PeptideProphet, 15 iProphet, 16 Percolator, 17 and MP score. 18 Post-search validation and filtering may drastically improve both sensitivity and specificity of peptide and protein identification. 19 Another layer of complexity is imposed by the protein inference problem. 20,21 Two principal approaches to protein inference are the parsimony rule 22–24 and statistical models additionally attempting to estimate probabilities associated with individual proteins. 25–27 The vast amount of possible combinations of processing software is exacerbated by the recent trend in applying several search algorithms to the same data set to improve sensitivity. 28 This has led to introduction of “aggregating” software responsible for unification of the complex data processing workflows, such as iProphet, 16 SearchGUI 29 and 2

ACS Paragon Plus Environment

Page 2 of 25

Page 3 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

PeptideShaker, 30 or Proteome Discoverer, as well as software frameworks such as OpenMS, 31 and adaptation of workflow engines like Galaxy. 32 Workflow organizers developed specifically for proteomics, such as Trans-Proteomic Pipeline (TPP), 33 are also commonly used. Using TPP with the Prophet algorithms, or SearchGUI with PeptideShaker, allows significant improvement of the overall sensitivity of proteome analysis by combining the advantages of different search engines in a statistically sound manner, while retaining reproducibility. 16,28,30 Every aspect of data handling, from spectrum preprocessing to scoring spectral matches and post-search validation, affects the results and needs to be tailored to the properties of the data set and instrumentation. A recent trend is the development of approaches suited for high-resolution, high-accuracy mass spectrometry. 10,12 These approaches, however, are not applicable to data sets obtained on older instruments with lower mass accuracy and resolving power. Designing a single, versatile workflow becomes a challenge, given the variety of instruments used in the field. 34 In this work we present IdentiPy, a Python-based open-source extensible engine for peptide identification in shotgun proteomics. We compare its performance to a number of popular search engines and discuss the effect of scoring-independent optimizations that can be applied in other search engines. IdentiPy is complemented by a GUI encompassing peptide identification, post-search validation, protein inference and quantification.

Materials and Methods Search algorithm Architecture The software is divided into two Python packages: the search engine and the web interface. IdentiPy search engine is a Python package responsible for peptide identification. Its operation is based on the previously developed Pyteomics library 35 and managed by a single

3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

plain-text configuration file. The file incorporates the common search parameters, such as precursor and fragment ion mass tolerances, cleavage rules, database options, etc. It also controls the unique features of IdentiPy, including hooking the user-written code for scoring or multi-stage processing. The web interface (IdentiPy Server) is written in Python using the Django framework. It implements the client-server architecture, with IdentiPy installed on the server only. Multiple clients can use the server via a web browser. The user interface includes a user authentication system and allows uploading MS/MS data, protein databases and configuration files to the server, starting searches, as well as viewing the results of the completed searches. Search parameters can also be set directly in the web interface. IdentiPy Server automatically performs post-search validation, filtering and label-free protein quantification using MP score software. 18 Examples of the interface are shown as Supplementary Figures S-4 and S-5.

Spectrum preprocessing IdentiPy accepts centroided spectra in MGF and mzML formats. Spectrum preprocessing procedures include peak filtering and deisotoping. Peak filtering eliminates low-intensity fragment ion peaks from MS/MS spectra prior to matching and is governed by parameters which behave identically to X!Tandem’s: “dynamic range” and “maximum peaks”. The deisotoping algorithm is straightforward and follows the principles described by Coon et al. 12 When a peak distance of 1/z is found in the MS/MS spectrum, where z is a possible value of fragment charge, the extra peaks are eliminated as long as they are less intense than the monoisotopic peak candidate, and the fragment m/z is recalculated to correspond to a singly-charged fragment.

Scoring IdentiPy search engine allows employing a user-defined Python function for peptide scoring. This is done by specifying the importable function name in the configuration file, as

4

ACS Paragon Plus Environment

Page 4 of 25

Page 5 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

illustrated in the default file included in the software. We implemented two scoring functions in the current version of the platform: the Morpheus score 12 and the Hyperscore from X!Tandem. 4 We also proposed and implemented a modification to the Hyperscore function, named RNHS (renormalized Hyperscore). It uses a different normalization of the fragment ion intensities in MS/MS spectra compared with X!Tandem. In X!Tandem, the fragment peak intensities are effectively normalized by the highest peak, followed by summing the matching peak intensities and multiplying the sum by factorials corresponding to ion series lengths. In RNHS, the sum of intensities of the matched peaks is instead normalized by the total intensity. The rest of the scoring algorithm is identical to Hyperscore: P i Ii · mi · Nb ! · Ny !, RN HS = P i Ii

(1)

where Ii are absolute fragment ion intensities, mi is 1 for matching peaks and 0 for others, Nb and Ny are numbers of matched b- and y-ions. It is worth noting that in X!Tandem Hyperscore is not normally used as the final measure of peptide-spectrum match (PSM) quality, but rather a means of calculating the expectation value. In IdentiPy, however, RNHS is used directly for PSM ranking. In this work, we used RNHS for scoring of PSMs with IdentiPy in all search engine evaluations.

Post-processing and FDR filtering IdentiPy Server automatically performs post-search analysis of the search results using the MP score algorithm. 18 The MP score is a measure of PSM quality orthogonal to the search score, calculated using a number of descriptors. These descriptors include precursor ion mass error, median fragment ion mass error, retention time prediction error, number of potential modifications, number of missed cleavages, peptide and protein PSM counts, and isotope mass error. Each descriptor contributes a factor to the final score value. The factor

5

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

is calculated as the value of the probability mass function or the cumulative distribution function (depending on the descriptor) estimated empirically from the top-scoring PSMs based on the initial search score. After re-scoring of PSMs, MP score produces peptide and protein lists and performs FDR filtering based on the target-decoy approach (TDA). 36 Proteins are grouped according to shared peptides; each group contains the minimum set of proteins necessary to explain a set of peptides. If a protein is identified by one or more unique peptides, it becomes a “group leader”; other proteins sharing peptides with the leader but having no unique peptide evidence are added into the group. The grouping is performed by a greedy algorithm that iteratively picks out proteins identified by the highest number of peptides and excludes those peptides from further consideration. This approach is similar to ProteinProphet 25 algorithm as implemented by Scaffold. 37 However, unlike in ProteinProphet, individual protein probabilities are not calculated; only “group leaders” are scored. Processing parameters for MP score can also be set via the web interface. For the comparisons performed in this study, however, we used IdentiPy and the other evaluated search engines directly and in combination with Percolator. Q-value curves and bar charts were plotted from search engine output files using the Pyteomics library and based on the target-decoy approach. All searches were performed against a concatenated target-decoy database constructed using Pyteomics. Decoy protein sequences were generated by reversing the original sequences.

Data sets Three data sets were used for evaluation of search engine performance: tryptic digestion data from the Confetti data set 38 (3 replicate runs, HCD fragmentation); HEK293 data set 39 (one replicate run from each of the six fractions); and one-hour yeast proteome data 40 (single run), obtained using Q Exactive Orbitrap FTMS, Orbitrap FTMS Velos, and Orbitrap FTMS Fusion instruments from Thermo Fisher Corp. (Bremen, Germany), respectively. For all 6

ACS Paragon Plus Environment

Page 6 of 25

Page 7 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

search engines except MaxQuant, we used MGF and mzML files produced with MSConvert utility from ProteoWizard library. 41

Results and discussion Auto-tuning One of the unique features of IdentiPy is the “auto-tune” mode for optimizing the search parameters. Auto-tune is implemented as an instance of arbitrary multi-stage processing functionality of IdentiPy. This functionality allows for an arbitrary function to preprocess the given spectra and initial parameter set and return the updated parameter set for the next search over the same spectra. The optimization procedure is as follows. First, IdentiPy performs a preliminary search with user-defined parameters. The search results are filtered to 1% PSM FDR using the target-decoy approach and analyzed statistically to derive the optimal parameters. Statistical analysis includes calculating the distributions of peptide ion charge states, precursor and fragment ion mass errors and number of missed cleavage sites, as well as calibration of the retention time (RT) prediction model. The distributions are used to determine the optimal values for the corresponding search parameters (precursor and fragment ion mass tolerance and maximum number of missed cleavages). This is done by setting the parameters to empirically established percentages of identifications from the preliminary search. For precursor ion mass, the normal distribution is fit to the precursor mass errors of the filtered PSMs, and the optimized tolerance window is centered at the maximum of the fit. The width of the window is set to 8 times the standard deviation calculated from the fit. This allows correcting for systematic errors and significantly narrowing down the window if it is needed, however the window is several times wider than the usually chosen tolerance. This choice does not have a significant impact on search efficiency, 42 but it can be beneficial for subsequent post-search analysis. 43–45 Gaussian fitting of the precursor mass error distribution may fail due to artifacts of mass calibration (see Supplementary Figure S-3 7

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for an example). In this case, the tolerance is set at 0.1-th and 99.9-th percentiles of the mass error distribution. For missed cleavages, the cut-off value is calculated by rounding up the 99.5-th percentile of the corresponding distribution. For fragment ion mass error, the 68-th percentile of the median fragment mass error is calculated (in ppm) and quadrupled to obtain the optimal fragment ion mass tolerance. Retention time prediction is used in the second search to filter out incorrect sequence candidates. The “additive” RT prediction model (i.e. based on the retention coefficient theory) 46 is trained on the filtered PSMs, followed by calculation of the RT prediction error distribution on the same set of PSMs. The 0.5-th and 99.5-th percentiles of the distribution are used as a filter for all candidate peptides in the second search: peptides with predicted RT outside of the specified window relative to a given spectrum are not scored. It should be noted that several existing implementations of multi-pass searching cause systematic discrepancies in FDR estimation. 47,48 This is not the case for auto-tune, because decoy sequences are not discriminated based on the preliminary search results. Unlike X!Tandem’s refinement mode or Mascot’s error-tolerant search, which impose a score threshold for admission to the final stage of the search, the only thing affected by auto-tune are the search parameters, which apply equally to decoy and target proteins. Optimization of search parameters for a particular data set has been used before. For example, the Param-Medic tool 49 derives the optimal parameters from analyzing the spectra alone, without knowledge of the peptides that produced them. While this makes ParamMedic compatible with any search engine, optimizing the results from a subset of reliable identifications as implemented in IdentiPy is inherently more robust. As another example, MaxQuant 50 performs mass recalibration of the spectra, followed by setting an optimized (much smaller) value of precursor ion mass tolerance. The main purpose of auto-tuning is increasing the number of reliable PSMs, but it also significantly simplifies the researcher’s work and reduces the probability of human errors during data processing. For example, the use of inaccurate search parameters (100 ppm pre-

8

ACS Paragon Plus Environment

Page 8 of 25

Page 9 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

cursor and 0.5 Da fragment mass tolerances and 5 missed cleavages) significantly reduces the number of identifications for HEK293 data set, as shown in Figure 1. Auto-tuning restores the search efficiency by narrowing the mass tolerances and reducing the number of allowed miscleavages based on preliminary results of the search with incorrect input parameters. Full q-value curves, allowing to evaluate the search sensitivity at all FDR levels, are shown in Supplementary Figure S-1. As shown in Figure 1, auto-tune successfully restores the efficiency with unrealistic initial parameters, and yields an improvement of 9.3 thousand PSMs (≈ 8.8%) at 1% FDR for near-optimal initial parameters. Initial and optimized parameters are listed in Table 1 (for HEK293 data) and Supplementary Table S-1 (for Confetti and yeast data). Exact numbers of PSMs with and without auto-tune are shown in Supplementary Table S-3. Contribution of optimization of individual search parameters to the net effect of autotune depends on the properties of the data set and the choice of initial parameters. Using the Confetti data set as an example, we evaluated the effect of tuning single parameters one by one (Supplementary table S-2). The main role in the net effect of auto-tune is typically played by fragment ion mass error optimization. This is to be expected, as the other search parameters only affect the search space, reducing it by a factor of 1-5 (depending on the parameter and the initial guess of the user). Fragment ion mass error, on the other hand, affects the scoring of peptides, rather than the search space size, which has more prominent effect on the search engine’s ability to discriminate between correct and incorrect matches.

Comparative performance evaluation Performance of IdentiPy was compared to X!Tandem (versions Cyclone, 2012.10.01.1, and Alanine, 2017.2.1.2), Morpheus (rev. 272), MS-GF+ (v. 2017.01.27), MaxQuant (1.5.3.30), and Comet (2017.01 rev. 3). Figure 2 shows the numbers of PSMs at 1% FDR obtained by these search engines for all three considered data sets (full q-value curves are shown in Supplementary Figure S-2). In this comparison, we used the following search parameters for 9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 25

Table 1: Initial and optimized parameters for HEK293 data set. On the left: unrealistic initial parameters. On the right: near-optimal initial parameters. Multiple values correspond to optimized settings for individual runs. Parameter

Initial value

precursor mass error, ppm

-100 : +100

fragment mass error, ppm

500

allowed miscleavages

5

Optimized value -15.9 : +7.9 -19.3 : +10.8 -19.7 : +12.7 -18.4 : +12.4 -19.4 : +12.9 -21.0 : +13.6 9 10 11 11 12 12 2 3 2 3 3 3

Initial value

-20 : +20

50

2

Optimized value -15.9 : +7.9 -19.3 : +10.8 -19.7 : +12.7 -18.4 : +12.4 -19.4 : +12.9 -21.0 : +13.6 10 10 10 10 11 11 1 2 2 2 2 2

all search engines: for Confetti data set, 10 ppm and 0.05 Da for precursor and fragment mass tolerance, respectively, SwissProt human protein database; for HEK293 data set, 20 ppm and 0.05 Da, and the same database; for the yeast data set, 10 ppm and 0.5 Da, and SwissProt yeast protein database. In all searches, up to 2 missed cleavage sites were enabled, precursor charge states considered were from 2 to 5, and enabled modifications included cysteine carbamidomethylation (fixed) and methionine oxidation (variable). Autotuning was enabled for IdentiPy (without miscleavage optimization). As shown in Figure 2, IdentiPy and MS-GF+ produce the most PSMs on all data sets. For the two data sets obtained with low fragment ion mass errors (Fig. 2a,2b), IdentiPy demonstrates the best search efficiency of all search engines under evaluation. For the ion trap data set, in which the MS/MS spectra were acquired at low mass measurement accuracy (Fig. 2c), MS-GF+ was the most efficient, yet, IdentiPy keeps the second place at 1% FDR. This difference between high- and low-mass-tolerance MS/MS data is explained by 10

ACS Paragon Plus Environment

Page 11 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the much higher numbers of false matches associated with the wide fragment mass tolerance windows. 42 Low resolution, low mass accuracy MS/MS data impose very strong requirements on statistical estimation of PSM significance, which are apparently met in the best way by MS-GF+. For high mass accuracy MS/MS data, even the utterly simple scoring functions, such as the one employed in Morpheus, allow distinguishing between correct and incorrect matches. 12 Raw search engine output is often subsequently analyzed by post-search processing algorithms. Thus, a comparison of search engine performances is not complete without applying some post-search processing to obtain optimal results. However, no single software is applicable to all of the search engines. We picked Percolator as the most versatile post-processing tool and applied it to all search engines possible. The results are shown as orange bars in Figure 2. It should be noted that, even though we apply the same machine learning algorithm to all data, it operates on different sets of features, defined by the respective file converters. Still, this is probably the clearest way to compare the results of different search engines after post-search processing. Due to optimization of parameters, Percolator does not improve IdentiPy results as much as other search engines. However, IdentiPy, Comet and MS-GF+ show the best performance on all three data sets with or without Percolator. Figure 2c also demonstrates that X!Tandem Alanine is unable to produce reliable PSMs for low-mass-accuracy data. Analysis of the issue has shown that the problem occurs due to the changes in e-value calculation, which has a lower limit of 10−15 . With low mass tolerance, many PSMs reach this limit, making it impossible to rank them for subsequent target-decoy filtering. This problem occurs in all X!Tandem versions following Cyclone. Percolator successfully restores the possibility to filter the PSMs by calculating sensible q values. Additionally, we performed a peptide-level comparison of the results of all search engines. We considered the sets of peptides at 1% peptide-level FDR after Percolator processing (where applicable). Modified peptides were not considered separately from their unmodified

11

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

counterparts. The results are shown in Supplementary Figures S-6 and S-7. Relative performances at the peptide level are similar to those at the level of PSMs. Since every search engine, including IdentiPy, has unique peptide identifications, it is beneficial for performance to aggregate their results. Supplementary Figure S-8 shows a proof-of-principle example of combined analysis using three search engines, including IdentiPy. MS/MS deisotoping played a significant role in overall search efficiency. For search engines having built-in deisotoping capabilities (Morpheus and IdentiPy), non-deisotoped mzML files were used. For other search engines, we enabled the “MS2 Deisotope” filter in the MSConvert utility when producing MGF files. Using the built-in deisotoping procedures resulted in a higher number of reliable PSMs for Morpheus and IdentiPy compared with the MSConvert deisotoping filter. For other search engines, using deisotoping had different effect on sensitivity compared to non-deisotoped spectra, increasing or decreasing the number of reliable PSMs by approximately 1%, depending on the data set. Deisotoped spectra were used for these engines. Figure 3 shows the effect of using built-in deisotoping on IdentiPy performance, as well as the difference between the original Hyperscore and RNHS. On the right in Figure 3, we show the number of PSMs at 1% FDR obtained with X!Tandem Cyclone on Confetti data. It is worth noting that IdentiPy yields significantly more reliable PSMs than X!Tandem, when using the same parameters and scoring and without auto-tune. This is due to the difference in fragment peak matching: IdentiPy is configured to match singly-charged ion peaks only. With this change, Hyperscore becomes a much better metric for PSM ranking, whereas in X!Tandem it shows a much poorer discriminative power than e-value. However, in this case efficient deisotoping of fragment ion spectra becomes more important. As illustrated in Figure 3, built-in deisotoping further improves the sensitivity of IdentiPy, so that its overall efficiency is better than that of X!Tandem using e-values for PSM ranking. We applied ProteoWizard’s MSConvert to deisotope the MS/MS spectra before processing them with X!Tandem. The difference between Hyperscore and RNHS is not as significant for IdentiPy’s

12

ACS Paragon Plus Environment

Page 12 of 25

Page 13 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

performance as deisotoping and fragment ion matching. The numbers of PSMs at 1% FDR produced by all discussed search engines on all data sets (alone and with Percolator), as well as by IdentiPy with and without auto-tune and with Hyperscore used instead of RNHS, are listed in Supplementary Table S-3.

Search speed We performed a speed comparison of the search engines under evaluation. The comparison was made using one replicate run of the Confetti data set on two different platforms: a Windows 7 PC and a Linux PC. The results are shown in Figure 4. For Windows 7 OS, Morpheus outperformed the other search engines, even surpassing both X!Tandem versions. IdentiPy and MS-GF+ were several times slower, taking several minutes to process a single MGF file (121 Mb, 51048 MS/MS spectra), while MaxQuant took as long as 50 minutes. On Linux OS, X!Tandem and Comet were the fastest, while IdentiPy was 3 to 4 times faster than MS-GF+. MaxQuant was not included in the Linux comparison. The reported measurements do not include the time needed to produce input files supported by the corresponding search engines. However, the presented times do include output conversion to pepXML or TSV format, using Tandem2XML from TPP for X!Tandem output and msgf2tsv for MS-GF+. The other search engines generate pepXML files directly.

Conclusions We developed and evaluated IdentiPy, an open-source extensible search platform for shotgun proteomics. Using the scoring algorithm RNHS, closely based on X!Tandem Hyperscore function, IdentiPy demonstrated search efficiency and processing time competitive with the ones delivered by the popular search engines. The sensitivity of IdentiPy exceeds all evaluated algorithms when applied to high-resolution data sets. The search engine comes with a user-friendly multi-user web interface, IdentiPy Server, incorporating peptide and pro-

13

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

tein identification, validation and filtering. Source code of IdentiPy and IdentiPy Server is available at http://hg.theorchromo.ru/identipy and http://hg.theorchromo.ru/ identipy_server.

Acknowledgement The work was supported by Russian Science Foundation, grant #14-14-00971. The authors thank Drs. Arthur T. Kopylov, Sergei A. Moshkovskii, Vladimir A. Gorshkov, and Yury O. Tsybin for helpful discussion of IdentiPy search engine functionality.

Supporting Information Available Table S-1 – Initial and optimized parameters for Confetti and Yeast data sets. Table S-2 – Contribution of individual parameters to the effect of auto-tune. Table S-3 – Number of PSMs with 1% FDR identified by the considered search engines. Figure S-1 – Q-value curves for IdentiPy auto-tuning comparison (HEK293 data set). Figure S-2 – Q-value curves for search engine sensitivity comparison (all data sets). Figure S-3 – Precursor mass error distributions in Confetti. Figure S-4 – Screen shot of IdentiPy Server settings page. Figure S5 – Screen shot of IdentiPy Server result page. Figure S-6 – Pairwise peptide-level comparison between IdentiPy and other search engines. Figure S-7 – Nested peptide set intersections between the considered search engines. Figure S-8 – Example of combined analysis using iProphet.

References (1) Aebersold, R.; Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 2016, 537, 347–355. (2) Reiter, L.; Claassen, M.; Schrimpf, S. P.; Jovanovic, M.; Schmidt, A.; Buhmann, J. M.; 14

ACS Paragon Plus Environment

Page 14 of 25

Page 15 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Hengartner, M. O.; Aebersold, R. Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry. Molecular & Cellular Proteomics 2009, 8, 2405–2417. (3) Eng, J. K.; Mccormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 1994, 5, 976–989. (4) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466–7. (5) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567. (6) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open Mass Spectrometry Search Algorithm. Journal of Proteome Research 2004, 3, 958–964. (7) Eng, J. K.; Jahan, T. A.; Hoopmann, M. R. Comet: An open-source MS/MS sequence database search tool. Proteomics 2013, 13, 22–24. (8) Tabb, D. L.; Fernando, C. G.; Chambers, M. C. MyriMatch: Highly Accurate Tandem Mass Spectral Peptide Identification by Multivariate Hypergeometric Analysis. Journal of Proteome Research 2007, 6, 654–661. (9) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V.; Mann, M. Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment. Journal of Proteome Research 2011, 10, 1794–1805. (10) Dorfer, V.; Pichler, P.; Stranzl, T.; Stadlmann, J.; Taus, T.; Winkler, S.; Mechtler, K.

15

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

MS Amanda, a Universal Identification Algorithm Optimized for High Accuracy Tandem Mass Spectra. Journal of Proteome Research 2014, 13, 3679–3684. (11) Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nature Communications 2014, 5, 5277. (12) Wenger, C. D.; Coon, J. J. A Proteomics Search Algorithm Specifically Designed for High-Resolution Tandem Mass Spectra. Journal of Proteome Research 2013, 12, 1377– 1386. (13) Kong, A. T.; Leprevost, F. V.; Avtonomov, D. M.; Mellacheruvu, D.; Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometrybased proteomics. Nature Methods 2017, 14, 513–520. (14) Nesvizhskii, A. I.; Vitek, O.; Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nature Methods 2007, 4, 787–797. (15) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search. Analytical Chemistry 2002, 74, 5383–5392. (16) Shteynberg, D.; Deutsch, E. W.; Lam, H.; Eng, J. K.; Sun, Z.; Tasman, N.; Mendoza, L.; Moritz, R. L.; Aebersold, R.; Nesvizhskii, A. I. iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates. Molecular & Cellular Proteomics 2011, 10, M111.007690. (17) Käll, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature Methods 2007, 4, 923–925. (18) Ivanov, M. V.; Levitsky, L. I.; Lobas, A. A.; Panic, T.; Laskay, Ü. A.; Mitulovic, G.; Schmid, R.; Pridatchenko, M. L.; Tsybin, Y. O.; Gorshkov, M. V. Empirical Multidi16

ACS Paragon Plus Environment

Page 16 of 25

Page 17 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

mensional Space for Scoring Peptide Spectrum Matches in Shotgun Proteomics. Journal of Proteome Research 2014, 13, 1911–1920. (19) Zhang, Y.; Fonslow, B. R.; Shan, B.; Baek, M.-C.; Yates, J. R. Protein Analysis by Shotgun/Bottom-up Proteomics. Chemical Reviews 2013, 113, 2343–2394. (20) Nesvizhskii, A. I.; Aebersold, R. Interpretation of Shotgun Proteomic Data. Molecular & Cellular Proteomics 2005, 4, 1419–1440. (21) Huang, T.; Wang, J.; Yu, W.; He, Z. Protein inference: a review. Briefings in Bioinformatics 2012, 13, 586–614. (22) Zhang, B.; Chambers, M. C.; Tabb, D. L. Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency. Journal of Proteome Research 2007, 6, 3549–3557. (23) Slotta, D. J.; McFarland, M. A.; Markey, S. P. MassSieve: Panning MS/MS peptide data for proteins. Proteomics 2010, 10, 3035–3039. (24) Ma, Z.-Q.; Dasari, S.; Chambers, M. C.; Litton, M. D.; Sobecki, S. M.; Zimmerman, L. J.; Halvey, P. J.; Schilling, B.; Drake, P. M.; Gibson, B. W.; Tabb, D. L. IDPicker 2.0: Improved Protein Assembly with High Discrimination Peptide Identification Filtering. Journal of Proteome Research 2009, 8, 3872–3881. (25) Nesvizhskii, A. I.; Keller, A. A statistical model for identifying proteins by tandem mass spectrometry. Analytical Chemistry 2003, 75, 4646–4658. (26) Li, J.; Zimmerman, L. J.; Park, B.-H.; Tabb, D. L.; Liebler, D. C.; Zhang, B. Networkassisted protein identification and data interpretation in shotgun proteomics. Molecular Systems Biology 2009, 5, 1–11. (27) Serang, O.; MacCoss, M. J.; Noble, W. S. Efficient Marginalization to Compute Protein

17

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Posterior Probabilities from Shotgun Mass Spectrometry Data. Journal of Proteome Research 2010, 9, 5346–5357. (28) Shteynberg, D.; Nesvizhskii, A. I.; Moritz, R. L.; Deutsch, E. W. Combining Results of Multiple Search Engines in Proteomics. Molecular & Cellular Proteomics 2013, 12, 2383–2393. (29) Vaudel, M.; Barsnes, H.; Berven, F. S.; Sickmann, A.; Martens, L. SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics 2011, 11, 996–999. (30) Vaudel, M.; Burkhart, J. M.; Zahedi, R. P.; Oveland, E.; Berven, F. S.; Sickmann, A.; Martens, L.; Barsnes, H. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nature Biotechnology 2015, 33, 22–24. (31) Röst, H. L. et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nature Methods 2016, 13, 741–748. (32) Jagtap, P. D.; Johnson, J. E.; Onsongo, G.; Sadler, F. W.; Murray, K.; Wang, Y.; Shenykman, G. M.; Bandhakavi, S.; Smith, L. M.; Griffin, T. J. Flexible and Accessible Workflows for Improved Proteogenomic Analysis Using the Galaxy Framework. Journal of Proteome Research 2014, 13, 5898–5908. (33) Deutsch, E. W.; Mendoza, L.; Shteynberg, D.; Slagel, J.; Sun, Z.; Moritz, R. L. TransProteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteomics - Clinical Applications 2015, 9, 745–754. (34) Noble, W. S.; MacCoss, M. J. Computational and Statistical Analysis of Protein Mass Spectrometry Data. PLoS Computational Biology 2012, 8, e1002296. (35) Goloborodko, A. A.; Levitsky, L. I.; Ivanov, M. V.; Gorshkov, M. V. Pyteomics – a Python Framework for Exploratory Data Analysis and Rapid Software Prototyping in 18

ACS Paragon Plus Environment

Page 18 of 25

Page 19 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Proteomics. Journal of The American Society for Mass Spectrometry 2013, 24, 301– 304. (36) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in largescale protein identifications by mass spectrometry. Nature Methods 2007, 4, 207–214. (37) Searle, B. C. Scaffold: A bioinformatic tool for validating MS/MS-based proteomic studies. Proteomics 2010, 10, 1265–1269. (38) Guo, X.; Trudgian, D. C.; Lemoff, A.; Yadavalli, S.; Mirzaei, H. Confetti: A Multiprotease Map of the HeLa Proteome for Comprehensive Proteomics. Molecular & Cellular Proteomics 2014, 13, 1573–1584. (39) Geiger, T.; Wehner, A.; Schaab, C.; Cox, J.; Mann, M. Comparative Proteomic Analysis of Eleven Common Cell Lines Reveals Ubiquitous but Varying Expression of Most Proteins. Molecular & Cellular Proteomics 2012, 11, M111.014050–M111.014050. (40) Hebert, A. S.; Richards, A. L.; Bailey, D. J.; Ulbrich, A.; Coughlin, E. E.; Westphall, M. S.; Coon, J. J. The One Hour Yeast Proteome. Molecular & Cellular Proteomics 2014, 13, 339–347. (41) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24, 2534–2536. (42) Ivanov, M. V.; Levitsky, L. I.; Lobas, A. A.; Tarasova, I. A.; Pridatchenko, M. L.; Zgoda, V. G.; Moshkovskii, S. A.; Mitulovic, G.; Gorshkov, M. V. Peptide identification in “shotgun” proteomics using tandem mass spectrometry: Comparison of search engine algorithms. Journal of Analytical Chemistry 2015, 70, 1614–1619. (43) Jung, H.-J. et al. Integrated Post-Experiment Monoisotopic Mass Refinement: An Integrated Approach to Accurately Assign Monoisotopic Precursor Masses to Tandem Mass Spectrometric Data. Analytical Chemistry 2010, 82, 8510–8518. 19

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(44) Nesvizhskii, A. I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. Journal of Proteomics 2010, 73, 2092–2123. (45) Ding, Y.; Choi, H.; Nesvizhskii, A. I. Adaptive Discriminant Function Analysis and Reranking of MS/MS Database Search Results for Improved Peptide Identification in Shotgun Proteomics. Journal of Proteome Research 2008, 7, 4878–4889. (46) Meek, J. L. Prediction of Peptide Retention Times in High-pressure Liquid Chromatography on the Basis of Amino Acid Composition. Proceedings of the National Academy of Sciences 1980, 77, 1632–1636. (47) Ivanov, M. V.; Levitsky, L. I.; Gorshkov, M. V. Adaptation of Decoy Fusion Strategy for Existing Multi-Stage Search Workflows. Journal of The American Society for Mass Spectrometry 2016, 27, 1579–1582. (48) Gupta, N.; Bandeira, N.; Keich, U.; Pevzner, P. A. Target-decoy approach and false discovery rate: when things may go wrong. Journal of the American Society for Mass Spectrometry 2011, 22, 1111–20. (49) May, D. H.; Tamura, K.; Noble, W. S. Param-Medic: A Tool for Improving MS/MS Database Search Yield by Optimizing Parameter Settings. Journal of Proteome Research 2017, 16, 1817–1824. (50) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nature Biotechnology 2008, 26, 1367–1372.

20

ACS Paragon Plus Environment

Page 20 of 25

Page 21 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Graphical TOC Entry

spectrum processing

database

1. 2. 3. 4. 5. 6. 7.

FWDAAATISL GWVGNQIYVL STVNGATFDG HCDDTAYCEP MCTQCQMVNG CYRRPWFDFQ ENWIIQMHKY

-

1732 987 678 532 424 143 12

scoring

filtering

single config file

optimization

representation

validation

21

ACS Paragon Plus Environment

Journal of Proteome Research

100k

# of PSMs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

75k 50k 25k 0

AT off

AT off AT on AT on (bad (bad parameters) parameters)

Figure 1: Number of PSMs produced by IdentiPy search engine with four sets of parameters (HEK293 data set). The identifications were filtered to 1% FDR. Enabling the auto-tune (AT) feature improves search efficiency for both near-optimal and poorly set parameters.

22

ACS Paragon Plus Environment

Page 22 of 25

Page 23 of 25

a) # of PSMs

80k 60k 40k 20k

Co

me t Ide nti Py MS -GF Ma + xQ ua Mo nt rph X!T eus Ala and n e X!T ine m Cy an clo de ne m

0

b) # of PSMs

100k 75k 50k 25k

ua Mo nt rph X!T eus Ala and n e X!T ine m Cy an clo de ne m

Ma

xQ

F+ -GF

MS -G

Ide

y nti P

Py

Co

me t

0

c) # of PSMs

20k 10k

ua Mo nt rph X!T eus Ala and n e X!T ine m Cy an clo de ne m

xQ

+

Ma

MS

nti

Ide

me t

0

Co

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2: Blue bars: number of PSMs at 1% FDR obtained by the considered search engines for three data sets: a) Confetti; b) HEK293; c) yeast proteome. IdentiPy is the most efficient on data sets (a) and (b). Orange bars: increase in the number of PSMs at 1% FDR after applying Percolator to the search engine output. Morpheus and MaxQuant results were not processed with Percolator.

23

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3: Number of PSMs after filtering to 1% FDR using different metrics for PSM ranking (Confetti data set). “With deisotoping”: IdentiPy’s built-in deisotoping is used. “Without deisotoping”: IdentiPy is run on files deisotoped with MSConvert. Auto-tune was disabled in this comparison.

24

ACS Paragon Plus Environment

Page 24 of 25

Page 25 of 25

Processing time, min

50 40

a)

30 20 10

X!T clone Ala Mo nine rph e Ide us nt Ide iPy nti Py MS AT -G Ma F+ xQ ua nt

Cy

X!T

Co

me

t

0

Processing time, min

25

b)

20 15 10 5

X!T clone Ala Mo nine rph e Ide us nt Ide iPy nti Py MS AT -GF +

X!T

Cy

me t

0

Co

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4: Processing times for different search engines on one replicate run of Confetti data: a) on a Windows 7 PC, Intel Core i3, 4 virtual cores, b) on a Linux PC, Intel Core i7, 12 virtual cores.

25

ACS Paragon Plus Environment