Article pubs.acs.org/jpr
Cite This: J. Proteome Res. 2018, 17, 2249−2255
IdentiPy: An Extensible Search Engine for Protein Identification in Shotgun Proteomics Lev I. Levitsky,†,‡,§ Mark V. Ivanov,‡,§ Anna A. Lobas,‡ Julia A. Bubis,‡ Irina A. Tarasova,‡ Elizaveta M. Solovyeva,‡ Marina L. Pridatchenko,‡ and Mikhail V. Gorshkov*,‡ †
Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow Region 141700, Russian Federation V.L. Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, 38 Leninsky Pr., Bld. 2, Moscow 119334, Russia
Downloaded via DURHAM UNIV on July 10, 2018 at 12:42:57 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
‡
S Supporting Information *
ABSTRACT: We present an open-source, extensible search engine for shotgun proteomics. Implemented in Python programming language, IdentiPy shows competitive processing speed and sensitivity compared with the state-of-the-art search engines. It is equipped with a user-friendly web interface, IdentiPy Server, enabling the use of a single server installation accessed from multiple workstations. Using a simplified version of X!Tandem scoring algorithm and its novel “autotune” feature, IdentiPy outperforms the popular alternatives on high-resolution data sets. Autotune adjusts the search parameters for the particular data set, resulting in improved search efficiency and simplifying the user experience. IdentiPy with the autotune feature shows higher sensitivity compared with the evaluated search engines. IdentiPy Server has built-in postprocessing and protein inference procedures and provides graphic visualization of the statistical properties of the data set and the search results. It is open-source and can be freely extended to use third-party scoring functions or processing algorithms and allows customization of the search workflow for specialized applications. KEYWORDS: proteomics, proteomic search engine, peptide identification
■
fication of the complex data processing workflows, such as iProphet,16 SearchGUI29 and PeptideShaker,30 or Proteome Discoverer, as well as software frameworks such as OpenMS31 and adaptation of workflow engines like Galaxy.32 Workflow organizers developed specifically for proteomics, such as TransProteomic Pipeline (TPP),33 are also commonly used. Using TPP with the Prophet algorithms or SearchGUI with PeptideShaker allows significant improvement of the overall sensitivity of proteome analysis by combining the advantages of different search engines in a statistically sound manner while retaining reproducibility.16,28,30 Every aspect of data handling, from spectrum preprocessing to scoring spectral matches and postsearch validation, affects the results and needs to be tailored to the properties of the data set and instrumentation. A recent trend is the development of approaches suited for highresolution, high-accuracy mass spectrometry.10,12 These approaches, however, are not applicable to data sets obtained on older instruments with lower mass accuracy and resolving power. Designing a single, versatile workflow becomes a challenge given the variety of instruments used in the field.34 In this work we present IdentiPy, a Python-based opensource extensible engine for peptide identification in shotgun
INTRODUCTION Bottom-up proteomics remains the dominant approach for protein identification in biological samples.1 With the progress in mass spectrometry instrumentation and separation methods, the size of proteomic data sets grows substantially, imposing new challenges on data processing algorithms.2 The efforts of the last 15 years have resulted in the development of a number of search engines, including SEQUEST,3 X!Tandem,4 Mascot,5 OMSSA,6 Comet,7 MyriMatch,8 Andromeda,9 MS Amanda,10 MS-GF+,11 Morpheus,12 and MSFragger,13 among the others. A single score typically produced by search engines is usually significantly suboptimal as a means of filtering, 14 as demonstrated by the crucial effect of postsearch processing tools, such as PeptideProphet,15 iProphet,16 Percolator,17 and MP score.18 Postsearch validation and filtering may drastically improve both sensitivity and specificity of peptide and protein identification.19 Another layer of complexity is imposed by the protein inference problem.20,21 Two principal approaches to protein inference are the parsimony rule22−24 and statistical models additionally attempting to estimate probabilities associated with individual proteins.25−27 The vast amount of possible combinations of processing software is exacerbated by the recent trend of applying several search algorithms to the same data set to improve sensitivity.28 This has led to the introduction of “aggregating” software responsible for uni© 2018 American Chemical Society
Received: September 6, 2017 Published: April 23, 2018 2249
DOI: 10.1021/acs.jproteome.7b00640 J. Proteome Res. 2018, 17, 2249−2255
Article
Journal of Proteome Research proteomics. We compare its performance to a number of popular search engines and discuss the effect of scoringindependent optimizations that can be applied in other search engines. IdentiPy is complemented by a GUI encompassing peptide identification, postsearch validation, protein inference, and quantification.
■
RNHS =
∑i Ii·mi ∑i Ii
·Nb!·Ny!
(1)
where Ii are absolute fragment ion intensities, mi is 1 for matching peaks and 0 for others, and Nb and Ny are numbers of matched b- and y-ions. It is worth noting that in X!Tandem Hyperscore is not normally used as the final measure of peptide-spectrum match (PSM) quality but rather a means of calculating the expectation value. In IdentiPy, however, RNHS is used directly for PSM ranking. In this work, we used RNHS for scoring of PSMs with IdentiPy in all search engine evaluations. Postprocessing and FDR Filtering. IdentiPy Server automatically performs postsearch analysis of the search results using the MP score algorithm.18 The MP score is a measure of PSM quality orthogonal to the search score, calculated using a number of descriptors. These descriptors include precursor ion mass error, median fragment ion mass error, retention time prediction error, number of potential modifications, number of missed cleavages, peptide and protein PSM counts, and isotope mass error. Each descriptor contributes a factor to the final score value. The factor is calculated as the value of the probability mass function or the cumulative distribution function (depending on the descriptor) estimated empirically from the top-scoring PSMs based on the initial search score. After rescoring of PSMs, MP score produces peptide and protein lists and performs FDR filtering based on the targetdecoy approach (TDA).36 Proteins are grouped according to shared peptides; each group contains the minimum set of proteins necessary to explain a set of peptides. If a protein is identified by one or more unique peptides, then it becomes a “group leader”; other proteins sharing peptides with the leader but having no unique peptide evidence are added into the group. The grouping is performed by a greedy algorithm that iteratively picks out proteins identified by the highest number of peptides and excludes those peptides from further consideration. This approach is similar to ProteinProphet25 algorithm, as implemented by Scaffold.37 However, unlike in ProteinProphet, individual protein probabilities are not calculated; only “group leaders” are scored. Processing parameters for MP score can also be set via the web interface. For the comparisons performed in this study, however, we used IdentiPy and the other evaluated search engines directly and in combination with Percolator. Q-value curves and bar charts were plotted from search engine output files using the Pyteomics library and based on the target-decoy approach. All searches were performed against a concatenated target-decoy database constructed using Pyteomics. Decoy protein sequences were generated by reversing the original sequences.
MATERIALS AND METHODS
Search Algorithm
Architecture. The software is divided into two Python packages: the search engine and the web interface. IdentiPy search engine is a Python package responsible for peptide identification. Its operation is based on the previously developed Pyteomics library35 and managed by a single plaintext configuration file. The file incorporates the common search parameters, such as precursor and fragment ion mass tolerances, cleavage rules, database options, and so on. It also controls the unique features of IdentiPy, including hooking the user-written code for scoring or multistage processing. The web interface (IdentiPy Server) is written in Python using the Django framework. It implements the client−server architecture, with IdentiPy installed on the server only. Multiple clients can use the server via a web browser. The user interface includes a user authentication system and allows users to upload MS/MS data, protein databases, and configuration files to the server, start searches, as well as view the results of the completed searches. Search parameters can also be set directly in the web interface. IdentiPy Server automatically performs postsearch validation, filtering, and label-free protein quantification using MP score software.18 Examples of the interface are shown as Supplementary Figures S-4 and S-5. Spectrum Preprocessing. IdentiPy accepts centroided spectra in MGF and mzML formats. Spectrum preprocessing procedures include peak filtering and deisotoping. Peak filtering eliminates low-intensity fragment ion peaks from MS/MS spectra prior to matching and is governed by parameters that behave identically to X!Tandem’s “dynamic range” and “maximum peaks”. The deisotoping algorithm is straightforward and follows the principles described by Coon et al.12 When a peak distance of 1/z is found in the MS/MS spectrum, where z is a possible value of fragment charge, the extra peaks are eliminated as long as they are less intense than the monoisotopic peak candidate, and the fragment m/z is recalculated to correspond to a singly charged fragment. Scoring. IdentiPy search engine can employ a user-defined Python function for peptide scoring. This is done by specifying the importable function name in the configuration file, as illustrated in the default file included in the software. We implemented two scoring functions in the current version of the platform: the Morpheus score12 and the Hyperscore from X!Tandem.4 We also proposed and implemented a modification to the Hyperscore function, named RNHS (renormalized Hyperscore). It uses a different normalization of the fragment ion intensities in MS/MS spectra compared with X!Tandem. In X!Tandem, the fragment peak intensities are effectively normalized by the highest peak, followed by summing the matching peak intensities and multiplying the sum by factorials corresponding to ion series lengths. In RNHS, the sum of intensities of the matched peaks is instead normalized by the total intensity. The rest of the scoring algorithm is identical to Hyperscore:
Data sets
Three data sets were used for evaluation of search engine performance: tryptic digestion data from the Confetti data set38 (three replicate runs, HCD fragmentation); HEK293 data set39 (one replicate run from each of the six fractions); and 1 h yeast proteome data40 (single run), obtained using Q Exactive Orbitrap FTMS, Orbitrap FTMS Velos, and Orbitrap FTMS Fusion instruments from Thermo Fisher (Bremen, Germany), respectively. For all search engines except MaxQuant, we used MGF and mzML files produced with MSConvert utility from ProteoWizard library.41 2250
DOI: 10.1021/acs.jproteome.7b00640 J. Proteome Res. 2018, 17, 2249−2255
Article
Journal of Proteome Research
■
RESULTS AND DISCUSSION
While this makes Param-Medic compatible with any search engine, optimizing the results from a subset of reliable identifications, as implemented in IdentiPy, is inherently more robust. As another example, MaxQuant50 performs mass recalibration of the spectra, followed by setting an optimized (much smaller) value of precursor ion mass tolerance. The main purpose of autotuning is increasing the number of reliable PSMs, but it also significantly simplifies the researcher’s work and reduces the probability of human errors during data processing. For example, the use of inaccurate search parameters (100 ppm precursor and 0.5 Da fragment mass tolerances and five missed cleavages) significantly reduces the number of identifications for HEK293 data set, as shown in Figure 1. Autotuning restores the search efficiency by
Autotuning
One of the unique features of IdentiPy is the “autotune” mode for optimizing the search parameters. Autotune is implemented as an instance of arbitrary multistage processing functionality of IdentiPy. This functionality allows for an arbitrary function to preprocess the given spectra and initial parameter set and return the updated parameter set for the next search over the same spectra. The optimization procedure is as follows. First, IdentiPy performs a preliminary search with user-defined parameters. The search results are filtered to 1% PSM FDR using the target-decoy approach and analyzed statistically to derive the optimal parameters. Statistical analysis includes calculating the distributions of peptide ion charge states, precursor and fragment ion mass errors, and number of missed cleavage sites as well as calibration of the retention time (RT) prediction model. The distributions are used to determine the optimal values for the corresponding search parameters (precursor and fragment ion mass tolerance and maximum number of missed cleavages). This is done by setting the parameters to empirically established percentages of identifications from the preliminary search. For precursor ion mass, the normal distribution is fit to the precursor mass errors of the filtered PSMs, and the optimized tolerance window is centered at the maximum of the fit. The width of the window is set to eight times the standard deviation calculated from the fit. This allows us to correct for systematic errors and significantly narrow down the window if it is needed; however, the window is several times wider than the usually chosen tolerance. This choice does not have a significant impact on search efficiency,42 but it can be beneficial for subsequent postsearch analysis.43−45 Gaussian fitting of the precursor mass error distribution may fail due to artifacts of mass calibration (see Supplementary Figure S-3 for an example). In this case, the tolerance is set at the 0.1th and 99.9th percentiles of the mass error distribution. For missed cleavages, the cutoff value is calculated by rounding up the 99.5th percentile of the corresponding distribution. For fragment ion mass error, the 68th percentile of the median fragment mass error is calculated (in ppm) and quadrupled to obtain the optimal fragment ion mass tolerance. RT prediction is used in the second search to filter out incorrect sequence candidates. The “additive” RT prediction model (i.e., based on the retention coefficient theory)46 is trained on the filtered PSMs, followed by calculation of the RT prediction error distribution on the same set of PSMs. The 0.5th and 99.5th percentiles of the distribution are used as a filter for all candidate peptides in the second search: Peptides with predicted RT outside of the specified window relative to a given spectrum are not scored. It should be noted that several existing implementations of multipass searching cause systematic discrepancies in FDR estimation.47,48 This is not the case for autotune because decoy sequences are not discriminated based on the preliminary search results. Unlike X!Tandem’s refinement mode or Mascot’s error-tolerant search, which impose a score threshold for admission to the final stage of the search, the only thing affected by autotune are the search parameters, which apply equally to decoy and target proteins. Optimization of search parameters for a particular data set has been used before. For example, the Param-Medic tool49 derives the optimal parameters from analyzing the spectra alone, without knowledge of the peptides that produced them.
Figure 1. Number of PSMs produced by IdentiPy search engine with four sets of parameters (HEK293 data set). The identifications were filtered to 1% FDR. Enabling the autotune (AT) feature improves search efficiency for both near-optimal and poorly set parameters.
narrowing the mass tolerances and reducing the number of allowed miscleavages based on preliminary results of the search with incorrect input parameters. Full Q-value curves, allowing us to evaluate the search sensitivity at all FDR levels, are shown in Supplementary Figure S-1. As shown in Figure 1, autotune successfully restores the efficiency with unrealistic initial parameters and yields an improvement of 9.3 thousand PSMs (∼8.8%) at 1% FDR for near-optimal initial parameters. Initial and optimized parameters are listed in Table 1 (for HEK293 data) and Supplementary Table S-1 (for Confetti and yeast data). Exact numbers of PSMs with and without autotune are shown in Supplementary Table S-3. Contribution of optimization of individual search parameters to the net effect of autotune depends on the properties of the data set and the choice of initial parameters. Using the Confetti data set as an example, we evaluated the effect of tuning single parameters one by one (Supplementary Table S-2). The main role in the net effect of autotune is typically played by fragment ion mass error optimization. This is to be expected, as the other search parameters only affect the search space, reducing it by a factor of 1−5 (depending on the parameter and the initial guess of the user). Fragment ion mass error, on the contrary, affects the scoring of peptides, rather than the search space size, which has a more prominent effect on the search engine’s ability to discriminate between correct and incorrect matches. Comparative Performance Evaluation
The performance of IdentiPy was compared with X!Tandem (versions Cyclone, 2012.10.01.1, and Alanine, 2017.2.1.2), Morpheus (rev. 272), MS-GF+ (v. 2017.01.27), MaxQuant 2251
DOI: 10.1021/acs.jproteome.7b00640 J. Proteome Res. 2018, 17, 2249−2255
Article
Journal of Proteome Research Table 1. Initial and Optimized Parameters for HEK293 Data Seta parameter
precursor mass error (ppm)
fragment mass error (ppm)
allowed miscleavages
initial value
optimized value
−100:+100
−15.9:+7.9 −19.3:+10.8 −19.7:+12.7 −18.4:+12.4 −19.4:+12.9 −21.0:+13.6
500
9 10 11 11 12 12
5
2 3 2 3 3 3
initial value
optimized value
−20:+20
−15.9:+7.9 −19.3:+10.8 −19.7:+12.7 −18.4:+12.4 −19.4:+12.9 −21.0:+13.6
50
10 10 10 10 11 11
2
1 2 2 2 2 2
a
On the left: unrealistic initial parameters. On the right: near-optimal initial parameters. Multiple values correspond to optimized settings for individual runs.
picked Percolator as the most versatile postprocessing tool and applied it to all search engines possible. The results are shown as orange bars in Figure 2. It should be noted that even though we apply the same machine learning algorithm to all data, it operates on different sets of features, defined by the respective file converters. Still, this is probably the clearest way to compare the results of different search engines after postsearch processing. Because of optimization of parameters, Percolator does not improve IdentiPy results as much as other search engines. However, IdentiPy, Comet, and MS-GF+ show the best performance on all three data sets with or without Percolator. Figure 2c also demonstrates that X!Tandem Alanine is unable to produce reliable PSMs for low-mass-accuracy data. Analysis of the issue has shown that the problem occurs due to the changes in e-value calculation, which has a lower limit of 10−15. With low mass tolerance, many PSMs reach this limit, making it impossible to rank them for subsequent target-decoy filtering. This problem occurs in all X!Tandem versions following Cyclone. Percolator successfully restores the possibility to filter the PSMs by calculating sensible Q-values. Additionally, we performed a peptide-level comparison of the results of all search engines. We considered the sets of peptides at 1% peptide-level FDR after Percolator processing (where applicable). Modified peptides were not considered separately from their unmodified counterparts. The results are shown in Supplementary Figures S-6 and S-7. Relative performances at the peptide level are similar to those at the level of PSMs. Because every search engine, including IdentiPy, has unique peptide identifications, it is beneficial for performance to aggregate their results. Supplementary Figure S-8 shows a proof-of-principle example of combined analysis using three search engines, including IdentiPy. MS/MS deisotoping played a significant role in overall search efficiency. For search engines having built-in deisotoping capabilities (Morpheus and IdentiPy), nondeisotoped mzML files were used. For other search engines, we enabled the “MS2
(1.5.3.30), and Comet (2017.01 rev. 3). Figure 2 shows the number of PSMs at 1% FDR obtained by these search engines for all three considered data sets (full Q-value curves are shown in Supplementary Figure S-2). In this comparison, we used the following search parameters for all search engines: for Confetti data set, 10 ppm and 0.05 Da for precursor and fragment mass tolerance, respectively, SwissProt human protein database; for HEK293 data set, 20 ppm and 0.05 Da, and the same database; for the yeast data set, 10 ppm and 0.5 Da, and SwissProt yeast protein database. In all searches, up to two missed cleavage sites were enabled, precursor charge states considered were from 2 to 5, and enabled modifications included cysteine carbamidomethylation (fixed) and methionine oxidation (variable). Autotuning was enabled for IdentiPy (without miscleavage optimization). As shown in Figure 2, IdentiPy and MS-GF+ produce the most PSMs on all data sets. For the two data sets obtained with low fragment ion mass errors (Figure 2a,b), IdentiPy demonstrates the best search efficiency of all search engines under evaluation. For the ion trap data set, in which the MS/ MS spectra were acquired at low mass measurement accuracy (Figure 2c), MS-GF+ was the most efficient, yet, IdentiPy keeps the second place at 1% FDR. This difference between high- and low-mass-tolerance MS/MS data is explained by the much higher numbers of false matches associated with the wide fragment mass tolerance windows.42 Low-resolution, low-massaccuracy MS/MS data impose very strong requirements on statistical estimation of PSM significance, which are apparently met in the best way by MS-GF+. For high-mass-accuracy MS/ MS data, even the utterly simple scoring functions, such as the one employed in Morpheus, allow us to distinguish between correct and incorrect matches.12 Raw search engine output is often subsequently analyzed by postsearch processing algorithms. Thus a comparison of search engine performances is not complete without applying some postsearch processing to obtain optimal results. However, no single software is compatible with all of the search engines. We 2252
DOI: 10.1021/acs.jproteome.7b00640 J. Proteome Res. 2018, 17, 2249−2255
Article
Journal of Proteome Research
Figure 3. Number of PSMs after filtering to 1% FDR using different metrics for PSM ranking (Confetti data set). “With deisotoping”: IdentiPy’s built-in deisotoping is used. “Without deisotoping”: IdentiPy is run on files deisotoped with MSConvert. Autotune was disabled in this comparison.
X!Tandem Cyclone on Confetti data. It is worth noting that IdentiPy yields significantly more reliable PSMs than X!Tandem when using the same parameters and scoring and without autotune. This is due to the difference in fragment peak matching: IdentiPy is configured to match singly charged ion peaks only. With this change, Hyperscore becomes a much better metric for PSM ranking, whereas in X!Tandem it shows a much poorer discriminative power than e-value. However, in this case, efficient deisotoping of fragment ion spectra becomes more important. As illustrated in Figure 3, built-in deisotoping further improves the sensitivity of IdentiPy, so that its overall efficiency is better than that of X!Tandem using e-values for PSM ranking. We applied ProteoWizard’s MSConvert to deisotope the MS/MS spectra before processing them with X!Tandem. The difference between Hyperscore and RNHS is not as significant for IdentiPy’s performance as deisotoping and fragment ion matching. The numbers of PSMs at 1% FDR produced by all discussed search engines on all data sets (alone and with Percolator), as well as by IdentiPy with and without autotune and with Hyperscore used instead of RNHS, are listed in Supplementary Table S-3. Search Speed
Figure 2. Blue bars: number of PSMs at 1% FDR obtained by the considered search engines for three data sets: (a) Confetti, (b) HEK293, and (c) yeast proteome. IdentiPy is the most efficient on data sets a and b. Orange bars: increase in the number of PSMs at 1% FDR after applying Percolator to the search engine output. Morpheus and MaxQuant results were not processed with Percolator.
We performed a speed comparison of the search engines under evaluation. The comparison was made using one replicate run of the Confetti data set on two different platforms: a Windows 7 PC and a Linux PC. The results are shown in Figure 4. For Windows 7 OS, Morpheus outperformed the other search engines, even surpassing both X!Tandem versions. IdentiPy and MS-GF+ were several times slower, taking several minutes to process a single MGF file (121 Mb, 51 048 MS/MS spectra), while MaxQuant took as long as 50 min. On Linux OS, X!Tandem and Comet were the fastest, while IdentiPy was three to four times faster than MS-GF+. MaxQuant was not included in the Linux comparison. The reported measurements do not include the time needed to produce input files supported by the corresponding search engines. However, the presented times do include output conversion to pepXML or TSV format, using Tandem2XML from TPP for X!Tandem output and msgf2tsv for MS-GF+. The other search engines generate pepXML files directly.
Deisotope” filter in the MSConvert utility when producing MGF files. Using the built-in deisotoping procedures resulted in a higher number of reliable PSMs for Morpheus and IdentiPy compared with the MSConvert deisotoping filter. For other search engines, using deisotoping had different effect on sensitivity compared with nondeisotoped spectra, increasing or decreasing the number of reliable PSMs by ∼1%, depending on the data set. Deisotoped spectra were used for these engines. Figure 3 shows the effect of using built-in deisotoping on IdentiPy performance as well as the difference between the original Hyperscore and RNHS. On the right in Figure 3, we show the number of PSMs at 1% FDR obtained with 2253
DOI: 10.1021/acs.jproteome.7b00640 J. Proteome Res. 2018, 17, 2249−2255
Article
Journal of Proteome Research
■
IdentiPy Server result page. Figure S-6. Pairwise peptidelevel comparison between IdentiPy and other search engines. Figure S-7. Nested peptide set intersections between the considered search engines. Figure S-8. Example of combined analysis using iProphet. (PDF)
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. Tel: +7 499 1378257. ORCID
Mikhail V. Gorshkov: 0000-0001-9572-3452 Author Contributions §
L.I.L. and M.V.I. contributed equally to this work.
Notes
The authors declare no competing financial interest. The source code of IdentiPy and IdentiPy Server is available at http://hg.theorchromo.ru/identipy and http://hg. theorchromo.ru/identipy_server.
■
ACKNOWLEDGMENTS
■
REFERENCES
The work was supported by Russian Science Foundation, grant no. 14-14-00971. We thank Drs. Arthur T. Kopylov, Sergei A. Moshkovskii, Vladimir A. Gorshkov, and Yury O. Tsybin for helpful discussion of IdentiPy search engine functionality.
(1) Aebersold, R.; Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 2016, 537, 347−355. (2) Reiter, L.; Claassen, M.; Schrimpf, S. P.; Jovanovic, M.; Schmidt, A.; Buhmann, J. M.; Hengartner, M. O.; Aebersold, R. Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry. Mol. Cell. Proteomics 2009, 8, 2405−2417. (3) Eng, J. K.; Mccormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976−989. (4) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466−7. (5) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551− 3567. (6) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open Mass Spectrometry Search Algorithm. J. Proteome Res. 2004, 3, 958−964. (7) Eng, J. K.; Jahan, T. A.; Hoopmann, M. R. Comet: An opensource MS/MS sequence database search tool. Proteomics 2013, 13, 22−24. (8) Tabb, D. L.; Fernando, C. G.; Chambers, M. C. MyriMatch: Highly Accurate Tandem Mass Spectral Peptide Identification by Multivariate Hypergeometric Analysis. J. Proteome Res. 2007, 6, 654− 661. (9) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V.; Mann, M. Andromeda: A Peptide Search Engine Integrated into the MaxQuant Environment. J. Proteome Res. 2011, 10, 1794−1805. (10) Dorfer, V.; Pichler, P.; Stranzl, T.; Stadlmann, J.; Taus, T.; Winkler, S.; Mechtler, K. MS Amanda, a Universal Identification Algorithm Optimized for High Accuracy Tandem Mass Spectra. J. Proteome Res. 2014, 13, 3679−3684. (11) Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277.
Figure 4. Processing times for different search engines on one replicate run of Confetti data: (a) on a Windows 7 PC, Intel Core i3, 4 virtual cores and (b) on a Linux PC, Intel Core i7, 12 virtual cores.
■
CONCLUSIONS We developed and evaluated IdentiPy, an open-source extensible search platform for shotgun proteomics. Using the scoring algorithm RNHS, closely based on X!Tandem Hyperscore function, IdentiPy demonstrated search efficiency and processing time competitive with the ones delivered by the popular search engines. The sensitivity of IdentiPy exceeds all evaluated algorithms when applied to high-resolution data sets. The search engine comes with a user-friendly multiuser web interface, IdentiPy Server, incorporating peptide and protein identification, validation, and filtering.
■
ASSOCIATED CONTENT
S Supporting Information *
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.7b00640. Table S-1. Initial and optimized parameters for Confetti and Yeast data sets. Table S-2. Contribution of individual parameters to the effect of autotune. Table S-3. Number of PSMs with 1% FDR identified by the considered search engines. Figure S-1. Q-value curves for IdentiPy autotuning comparison (HEK293 data set). Figure S-2. Q-value curves for search engine sensitivity comparison (all data sets). Figure S-3. Precursor mass error distributions in Confetti. Figure S-4. Screenshot of IdentiPy Server settings page. Figure S-5. Screenshot of 2254
DOI: 10.1021/acs.jproteome.7b00640 J. Proteome Res. 2018, 17, 2249−2255
Article
Journal of Proteome Research (12) Wenger, C. D.; Coon, J. J. A Proteomics Search Algorithm Specifically Designed for High-Resolution Tandem Mass Spectra. J. Proteome Res. 2013, 12, 1377−1386. (13) Kong, A. T.; Leprevost, F. V.; Avtonomov, D. M.; Mellacheruvu, D.; Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 2017, 14, 513−520. (14) Nesvizhskii, A. I.; Vitek, O.; Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 2007, 4, 787−797. (15) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical Statistical Model To Estimate the Accuracy of Peptide Identifications Made by MS/MS and Database Search. Anal. Chem. 2002, 74, 5383− 5392. (16) Shteynberg, D.; Deutsch, E. W.; Lam, H.; Eng, J. K.; Sun, Z.; Tasman, N.; Mendoza, L.; Moritz, R. L.; Aebersold, R.; Nesvizhskii, A. I. iProphet: Multi-level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates. Mol. Cell. Proteomics 2011, 10, M111.007690. (17) Käll, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4, 923−925. (18) Ivanov, M. V.; Levitsky, L. I.; Lobas, A. A.; Panic, T.; Laskay, Ü . A.; Mitulovic, G.; Schmid, R.; Pridatchenko, M. L.; Tsybin, Y. O.; Gorshkov, M. V. Empirical Multidimensional Space for Scoring Peptide Spectrum Matches in Shotgun Proteomics. J. Proteome Res. 2014, 13, 1911−1920. (19) Zhang, Y.; Fonslow, B. R.; Shan, B.; Baek, M.-C.; Yates, J. R. Protein Analysis by Shotgun/Bottom-up Proteomics. Chem. Rev. 2013, 113, 2343−2394. (20) Nesvizhskii, A. I.; Aebersold, R. Interpretation of Shotgun Proteomic Data. Mol. Cell. Proteomics 2005, 4, 1419−1440. (21) Huang, T.; Wang, J.; Yu, W.; He, Z. Protein inference: a review. Briefings Bioinf. 2012, 13, 586−614. (22) Zhang, B.; Chambers, M. C.; Tabb, D. L. Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency. J. Proteome Res. 2007, 6, 3549−3557. (23) Slotta, D. J.; McFarland, M. A.; Markey, S. P. MassSieve: Panning MS/MS peptide data for proteins. Proteomics 2010, 10, 3035−3039. (24) Ma, Z.-Q.; Dasari, S.; Chambers, M. C.; Litton, M. D.; Sobecki, S. M.; Zimmerman, L. J.; Halvey, P. J.; Schilling, B.; Drake, P. M.; Gibson, B. W.; Tabb, D. L. IDPicker 2.0: Improved Protein Assembly with High Discrimination Peptide Identification Filtering. J. Proteome Res. 2009, 8, 3872−3881. (25) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003, 75, 4646−4658. (26) Li, J.; Zimmerman, L. J.; Park, B.-H.; Tabb, D. L.; Liebler, D. C.; Zhang, B. Network-assisted protein identification and data interpretation in shotgun proteomics. Mol. Syst. Biol. 2009, 5, 1−11. (27) Serang, O.; MacCoss, M. J.; Noble, W. S. Efficient Marginalization to Compute Protein Posterior Probabilities from Shotgun Mass Spectrometry Data. J. Proteome Res. 2010, 9, 5346−5357. (28) Shteynberg, D.; Nesvizhskii, A. I.; Moritz, R. L.; Deutsch, E. W. Combining Results of Multiple Search Engines in Proteomics. Mol. Cell. Proteomics 2013, 12, 2383−2393. (29) Vaudel, M.; Barsnes, H.; Berven, F. S.; Sickmann, A.; Martens, L. SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics 2011, 11, 996−999. (30) Vaudel, M.; Burkhart, J. M.; Zahedi, R. P.; Oveland, E.; Berven, F. S.; Sickmann, A.; Martens, L.; Barsnes, H. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nat. Biotechnol. 2015, 33, 22−24. (31) Röst, H. L.; et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat. Methods 2016, 13, 741−748.
(32) Jagtap, P. D.; Johnson, J. E.; Onsongo, G.; Sadler, F. W.; Murray, K.; Wang, Y.; Shenykman, G. M.; Bandhakavi, S.; Smith, L. M.; Griffin, T. J. Flexible and Accessible Workflows for Improved Proteogenomic Analysis Using the Galaxy Framework. J. Proteome Res. 2014, 13, 5898−5908. (33) Deutsch, E. W.; Mendoza, L.; Shteynberg, D.; Slagel, J.; Sun, Z.; Moritz, R. L. Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteomics: Clin. Appl. 2015, 9, 745−754. (34) Noble, W. S.; MacCoss, M. J. Computational and Statistical Analysis of Protein Mass Spectrometry Data. PLoS Comput. Biol. 2012, 8, e1002296. (35) Goloborodko, A. A.; Levitsky, L. I.; Ivanov, M. V.; Gorshkov, M. V. Pyteomics - a Python Framework for Exploratory Data Analysis and Rapid Software Prototyping in Proteomics. J. Am. Soc. Mass Spectrom. 2013, 24, 301−304. (36) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207−214. (37) Searle, B. C. Scaffold: A bioinformatic tool for validating MS/ MS-based proteomic studies. Proteomics 2010, 10, 1265−1269. (38) Guo, X.; Trudgian, D. C.; Lemoff, A.; Yadavalli, S.; Mirzaei, H. Confetti: A Multiprotease Map of the HeLa Proteome for Comprehensive Proteomics. Mol. Cell. Proteomics 2014, 13, 1573− 1584. (39) Geiger, T.; Wehner, A.; Schaab, C.; Cox, J.; Mann, M. Comparative Proteomic Analysis of Eleven Common Cell Lines Reveals Ubiquitous but Varying Expression of Most Proteins. Mol. Cell. Proteomics 2012, 11, M111.014050. (40) Hebert, A. S.; Richards, A. L.; Bailey, D. J.; Ulbrich, A.; Coughlin, E. E.; Westphall, M. S.; Coon, J. J. The One Hour Yeast Proteome. Mol. Cell. Proteomics 2014, 13, 339−347. (41) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24, 2534−2536. (42) Ivanov, M. V.; Levitsky, L. I.; Lobas, A. A.; Tarasova, I. A.; Pridatchenko, M. L.; Zgoda, V. G.; Moshkovskii, S. A.; Mitulovic, G.; Gorshkov, M. V. Peptide identification in “shotgun” proteomics using tandem mass spectrometry: Comparison of search engine algorithms. J. Anal. Chem. 2015, 70, 1614−1619. (43) Jung, H.-J.; et al. Integrated Post-Experiment Monoisotopic Mass Refinement: An Integrated Approach to Accurately Assign Monoisotopic Precursor Masses to Tandem Mass Spectrometric Data. Anal. Chem. 2010, 82, 8510−8518. (44) Nesvizhskii, A. I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteomics 2010, 73, 2092−2123. (45) Ding, Y.; Choi, H.; Nesvizhskii, A. I. Adaptive Discriminant Function Analysis and Reranking of MS/MS Database Search Results for Improved Peptide Identification in Shotgun Proteomics. J. Proteome Res. 2008, 7, 4878−4889. (46) Meek, J. L. Prediction of Peptide Retention Times in Highpressure Liquid Chromatography on the Basis of Amino Acid Composition. Proc. Natl. Acad. Sci. U. S. A. 1980, 77, 1632−1636. (47) Ivanov, M. V.; Levitsky, L. I.; Gorshkov, M. V. Adaptation of Decoy Fusion Strategy for Existing Multi-Stage Search Workflows. J. Am. Soc. Mass Spectrom. 2016, 27, 1579−1582. (48) Gupta, N.; Bandeira, N.; Keich, U.; Pevzner, P. A. Target-decoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 2011, 22, 1111−20. (49) May, D. H.; Tamura, K.; Noble, W. S. Param-Medic: ATool for Improving MS/MS Database Search Yield by Optimizing Parameter Settings. J. Proteome Res. 2017, 16, 1817−1824. (50) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 1367−1372.
2255
DOI: 10.1021/acs.jproteome.7b00640 J. Proteome Res. 2018, 17, 2249−2255