Prediction of Peptide Fragment Ion Mass Spectra by Data Mining

Jul 17, 2014 - 7College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410083, P. R. China. •S Supporting Information...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/ac

Prediction of Peptide Fragment Ion Mass Spectra by Data Mining Techniques Nai-ping Dong,† Yi-Zeng Liang,*,† Qing-song Xu,‡ Daniel K. W. Mok,§,∥ Lun-zhao Yi,⊥ Hong-mei Lu,† Min He,# and Wei Fan7 †

College of Chemistry and Chemical Engineering and ‡School of Mathematics and Statistics, Central South University, Changsha, 410083, P. R. China § Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University, Hung Hom, Hong Kong ∥ State Key Laboratory of Chinese Medicine and Molecular Pharmacology (Incubation), Shenzhen, 518000, P. R. China ⊥ Yunnan Food Safety Research Institute, Kunming University of Science and Technology, Kunming, 650500, P. R. China # Department of Pharmaceutical Engineering, School of Chemical Engineering, Xiangtan University, Xiangtan, 411105, P.R. China 7 College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410083, P. R. China S Supporting Information *

ABSTRACT: Accurate prediction of peptide fragment ion mass spectra is one of the critical factors to guarantee confident peptide identification by protein sequence database search in bottom-up proteomics. In an attempt to accurately and comprehensively predict this type of mass spectra, a framework named MS2PBPI is proposed. MS2PBPI first extracts fragment ions from large-scale MS/MS spectra data sets according to the peptide fragmentation pathways and uses binary trees to divide the obtained bulky data into tens to more than 1000 regions. For each adequate region, stochastic gradient boosting tree regression model is constructed. By constructing hundreds of these models, MS2PBPI is able to predict MS/MS spectra for unmodified and modified peptides with reasonable accuracy. Moreover, high consistency between predicted and experimental MS/MS spectra derived from different ion trap instruments with low and high resolving power is achieved. MS2PBPI outperforms existing algorithms MassAnalyzer and PeptideART. Since the first attempts to identify peptides from large scale MS/MS spectra two decades ago,2,3 many algorithms have been developed for peptide identification task.4 All these algorithms can be categorized into four strategies, protein sequence database search, de novo sequencing, sequence tag approach, and spectral library search. Because of operational simplicity and possibility of automation, protein sequence database search becomes the most popular strategy. In this strategy, protein sequences in database (target database) are theoretically digested into peptides according to the enzyme used in the experiment. Some criteria, such as the precursor ion mass tolerance, can be used to narrow down the number of the calculated peptide sequences in subsequent analysis. Finally but most importantly, the theoretical MS/MS spectra of the peptide candidates are simulated and matched with the experimental spectra to select peptides that best interpret these experimental spectra.

T

andem mass spectrometry (MS/MS) has made great success in comprehensive characterization, from qualitative analysis to function determination, of proteins in tissues and cells.1 This method, termed a bottom-up proteomic approach, first digests proteins into peptide mixtures which then pass through single or multiple dimensional liquid chromatography (LC) for separation. The eluent is ionized by a soft ionization method (e.g., electrospray ionization (ESI), matrix-assisted laser desorption/ionization (MALDI)) to produce a molecular ion mass spectrum. A certain number of (e.g., 3) molecular ions are further ionized by hard ionization method such as collision-induced dissociation (CID) to obtain fragment ion mass spectra (MS/MS spectra). From the mass spectra, the peptide sequences in the sample can be deduced. Unfortunately, a single LC-MS/MS run can generate thousands to tens of thousands of MS/MS spectra, making it impossible to elucidate these spectra manually. Thus, computer-aided programs are required to deduce peptide sequences from mass spectra, to infer protein structure from the sequences, and to quantify them for subsequent analysis. © 2014 American Chemical Society

Received: March 26, 2014 Accepted: July 17, 2014 Published: July 17, 2014 7446

dx.doi.org/10.1021/ac501094m | Anal. Chem. 2014, 86, 7446−7454

Analytical Chemistry

Article

One of the bottlenecks of the protein sequence database search, as is indicated in its procedure discussed above, is the prediction of MS/MS spectra which is still not very reliable and limits the accuracy of the identification. Being aware of this, all database search algorithms provide validation scores based on the initial match scores to indicate the significance of the match (e.g., X!Tandem’s E-value5) or the real similarity between predicted and experimental spectrum (e.g., SEQUEST’s XCorr2). Peptide-Spectrum Matches (PSMs) ranked first by these validation scores are considered as correct and used for further analysis. Yet, since a large number of MS/MS spectra are generated in an experiment and used to match large number of peptide candidates generated from protein sequence database, a large number of false positives would exist in the identification results. Thus, statistical models based on the match and validation scores and other features are used to estimate the confidence of PSMs.6,7 Another approach is to construct a decoy sequence database from the target database and search it parallelly to estimate the false positive rate (FPR) of identification results.8 In addition to these, reducing sample space methods such as spectral quality assessment9 and spectra clustering10 or reducing search space methods such as precursor ion charge determination11 and mass calibration12 are also developed to eliminate the source that will probably produce false positives. All these methods have provided a better chance of correct identifications. However, developing more accurate MS/MS spectra prediction models for PSM is certainly the most effective way to improve the confidence of peptide identification as this eliminates the errors in the initial match, and consequently, eliminates error propagation in subsequent analysis.13 Being aware of this, two strategies have been built up to date. The first strategy calculates the probability of occurrence of each fragment ion in MS/MS spectra data sets and then employed it in scoring scheme to gain the likelihood of PSM.14 These probability based models can be further improved by integrating fragmentation rules of peptides.15,16 The second strategy, on the other hand, attempts to explicitly predict MS/ MS spectra. To this end, MassAnalyzer17 turned to reaction kinetics to simulate the process of peptide cleavage in collision cell, whereas PeptideART,18 Riptide,19 and Zhou et al.20 sought help from machine learning algorithms. Most recently, MS2PIP21 was developed based on random forest regression to predict MS/MS spectra from a large number of highconfident PSMs. Besides, PepNovo22 used BoostRank23 models to predict peak intensity ranks and developed rank-based scoring scheme to support its model. Although these algorithms have significantly improved peptide identification results,24 the prediction of MS/MS spectra for PSM is far from perfect. For instance, since posttranslational modifications (PTMs) in proteins are common and some of them have been demonstrated to change the fragmentation pattern considerably comparing to that without these PTMs,25 predicting MS/MS spectra derived from PTMcontaining peptides by new model is also needed. Moreover, different fragmentation methods (e.g., electron-transfer dissociation, CID) or mass analyzers (e.g., ion trap (IT), time-offlight (TOF)) can generate significantly different MS/MS spectra, leading to poor portability of spectral prediction algorithms trained by MS/MS spectra sets derived from single instrumental platform. These two aspects, to our knowledge, are only considered by MassAnalyzer to date and initially solved by constructing different prediction models.26

In the past one decade, the number of MS/MS spectra recorded increases explosively and a vast variety of proteomic data sets have been deposited in public repositories. This provides opportunities to construct more comprehensive and accurate peptide MS/MS spectrum prediction models by mining fragmentation pattern from huge data set. In present work, a new peptide MS/MS spectra prediction framework named MS2PBPI (MS/MS Spectrum Prediction Boosting Peptide Identification) is constructed based on behaviors of peptide fragmentation pathways in low-energy CID. To accurately predict peptide MS/MS spectra, this framework extracts fragment ion types which dominate the MS/MS spectra from whole NIST peptide fragment ion libraries (http://peptide.nist.gov/) and trains them by ensemble regression algorithm. By constructing hundreds of ensemble models, MS2PBPI can simulate peptide fragment ion mass spectra with high consistence with experimental ones. To extend MS2PBPI to PTM-containing peptides, new sets of variables are extracted to characterize fragment ions derived from these peptides and retrained. The results show that MS2PBPI predicts mass spectra for these modified peptides with reasonable accuracy as well, indicating the extendibility of currently constructed framework to different mass spectra models (e.g., mass spectra generated by peptides with different modifications or by different ionization modes).



EXPERIMENTAL SECTION Training Data Collection and Processing. To obtain high quality MS/MS spectra data set for model training, we consulted to NIST Libraries of Peptide Tandem Mass Spectra. The Human, Mouse, Yeast, Drosophila, C. elegans, E. coli, Rat, BSA, and Sigmaups1 Libraries derived from IT instrumentation (release 2012_05) were downloaded from their Web site and combined. All replicate mass spectra (i.e., derived from same precursor ions) were removed from the large library. Since current framework only predicted MS/MS spectra for singly, doubly and triply charged peptide precursor ions, mass spectra generated by precursor ions with higher charge states were further removed. Moreover, modified peptides containing modifications other than oxidation of methionine, carboxymethylation of cysteine, N-terminal acetylation and pyro-glu modification were excluded. Consequently, the final training set contained 668 844 MS/MS mass spectra, 174 811 of which were derived from modified peptides. All intensities of fragment ions were normalized to the base peak intensity of MS/MS spectrum therein. In addition to constructing models for peptides with and without modifications described above, we also trained models for phosphorylated peptides. To achieve this, PhospoPep libraries27 (version 2.0) were downloaded from PeptideAtlas (http://www.peptideatlas.org/speclib/index.php#ISB) and processed in a similar way to NIST Libraries. Finally, 125 337 phosphorylated peptides MS/MS mass spectra were used for training. Model Construction. It is beneficial to construct models from large number of MS/MS spectra set because it contains large amount of information for mining real fragmentation rules of peptides, which can then be applied to predict MS/MS spectra of new peptides. However, to train such huge data set directly is almost impossible and would be noise (i.e., low intensity) biased. In order to efficiently handle these MS/MS spectra, MS2PBPI used a heuristic approach to divide sample space into small size of data sets and then trained each data set 7447

dx.doi.org/10.1021/ac501094m | Anal. Chem. 2014, 86, 7446−7454

Analytical Chemistry

Article

be discriminated from nonselective cleavages, making the resulted regions straightforward to understand peptide fragmentation under corresponding charge state. To obtain optimal binary trees, each data set (i.e., intensities of fragment ions) was divided into positive and negative group using intensity criterion and trained by classification and regression tree (CART) algorithm36 with Gini impurities as splitting criteria and minimal number of 500 for each tree leaf. Currently, intensity criteria of 0.6 and 0.5 were used for y ions derived from selective fragmentation channels and other ions. However, further binary tree construction procedures were always required because the regions (i.e., tree leaves) obtained in the initial procedure generally contained too many fragment ions, of which the intensities were dominated by low values, resulting in low efficiency of regression model construction (e.g., expensive computation and poor prediction ability). Thus, for the region contained more than 3000 fragment ions, if fraction of intensities greater than 0 was larger than 0.95, intensity criterion of 0.6 was applied to partition these ions into positive and negative group, otherwise 0.3 was applied. The CART algorithm was then employed with same parameters as those used in previous binary tree construction procedures. These procedures were iterated until all such regions that could be further partitioned by CART were divided into regions with number of fragment ions less than 3000. As a result, more than 1000 regions can probably be obtained for some type of fragment ions. Total 60 binary trees were obtained finally for unmodified peptides and 69 for modified and phosphorylated peptides, respectively. Regression Model Construction. Once a binary tree was constructed for a fragment ion type derived from precursor ions with single-, double-, or triple-charge, a stochastic gradient boosting tree (SGBTree) regression37 was performed for each region. SGBTree constructs additive regression tree models sequentially to fit “pseudo”-residuals of previous cumulative models. This stepwise manner combines the performance of weak learners (i.e., regression tree here) iteratively into a strong learner with high accuracy. As predicting peptide MS/MS spectra using those fragmentation rules in single model is definitely inaccurate, such ensemble method would be very helpful. To construct optimal models for fragment ions, parameters such as number of regression tree models, fraction of sample selected for training, shrinkage parameter and size of each regression tree should be first defined. Here we used 10fold cross validation combined with coefficient of correlation between predicted and experimental intensities (R10FCV) as the objective function to select the optimal parameter combination. If a region was dominated by low or high intensities, which turned out to be seriously imbalanced, the model would be biased toward these values, making the intensities predicted by SGBTree models no better than just calculating them from the distribution of intensity directly. Thus, for any region in which fraction of ion intensity greater than 0 was less than 0.8 or fraction of ion intensity equal to 1.0 was greater than 0.75, the intensities were sorted from lowest to highest and the value which located at 0.618 length of the sorted array was calculated as the predicted intensity of that region. The reason for adopting such location in intensity arrays was to balance underestimation and overestimation of intensity during prediction. For instance, an intensity of 1.0 assigned for a region in this way indicates that fragment ions in this region have high probability to become base peaks, whereas median value of 0.9 or mean value of 0.72 can be overconservative for

independently, which is outlined in Figure S-1 in the Supporting Information and will be described in detail below. Pathway Extraction. All fragment ions in training set were divided into subsets according to the fragmentation pathways of peptides as different fragment ion series were generated by different pathways. Currently, only y, b ions generated by a bx-yz pathway;28 y-18, b-18, y-17, b-17 ions generated by water and ammonia loss pathway; precursor ions (p ion); p-18, p-17; and a ions generated by bx → ax pathway were considered. For multiply charged precursors, these fragment ions with multiple charge states were also considered. Other types of fragment ions, for example, two neutral losses of y, b ions and p ions, internal fragments and scrambled ions29 etc. were excluded because they were generally low abundant in MS/MS spectra and always removed as noises in PSM approach. Since selective fragmentation channels such as proline effect,30 aspartic acid effect,31 glutamic acid effect, diketopiperazine-yN‑2 pathway32 and N-terminus glutamine/asparagine effect33 could produce high abundant fragment ions, prediction models for these fragment ions were constructed separately. It should be mentioned that each selective fragmentation channel can produce two types of fragment ions, namely, y, b ions by the first four channels and p-18, p-17 by the last channel. For modified peptides, losses of CH 3 SOH from oxidized methionine (−64 Da) and H3PO4 from phosphoserine, phosphothreonine and phosphotyrosine (−98 Da) in y, b and p ions were added. These types of fragments were trained exclusively because only these two among all types of modifications considered in current model have been demonstrated to change the fragmentation patterns of peptides containing them by adding new types of fragment ions. Then, m/z values of all possible fragment ions described above were calculated for each peptide according to its amino acid sequence and searched MS/MS spectrum with mass tolerance of 0.6 Da. If more than one mass spectral peak were matched, the most intense one was retained. If no mass spectral peak was found, 0 was assigned. Variable Generation. We collected all possible factors which have been demonstrated to strongly influence peptide fragmentation in previous statistical and theoretical investigations34,35 as input variables. The detailed description of these variables was shown in Table S-1 in the Supporting Information. It should be noted that different pathways were trained by different subsets of variables for some of the variables were redundant. For example, the variable fragCterm used to characterize the residue locating at C-terminal side of cleavage site was ignored during training of fragment ions produced by proline effect because this residue was always proline. For modified peptides, variables representing the information on modifications were added to the above variable sets. The description of added variables was shown in Table S-2 in the Supporting Information. Binary Tree Construction. Prior to model construction, a binary tree was constructed for each fragment ion type according to the charge states of their precursor ions. This is advantageous because it divides the large training set (hundreds of thousands to millions of ions) into number of regions suitable for performing regression in following steps. Moreover, as the variables used are derived from amino acid compositions of peptide sequences and proton information on precursors, the binary tree illustrates the fragmentation behavior of current pathway.15,35 That is, by using binary tree and meaningful variables, selective cleavages of each fragmentation pathway can 7448

dx.doi.org/10.1021/ac501094m | Anal. Chem. 2014, 86, 7446−7454

Analytical Chemistry

Article

Table 1. Training Results of MS2PBPI for Unmodified Peptides avg R10FCV

number of models/number of regions ion typea

total number of ions

+1b

+2

+3

+1

+2

+3

y b yPro bPro yAsp bAsp yGlu bGlu yyN−2−b2 byN−2−b2 a y-18 b-18 y-17 b-17 p p-17 pQN p-18 pE

6955487 4071016 559645 647998 665152 740456 826920 1361064 795896 665031 5628260 5402866 5628260 5402866 5628260 1084351 747086 63186 1013594 70757

16/50 8/81 16/31 2/15 4/72 3/14 18/26 3/40 8/51 1/20 1/79 1/110 7/104 7/113 3/80 0/1 8/68 0/1 4/42 0/4

118/984 107/1116 58/167 30/157 49/284 30/231 80/296 35/452 64/442 3/272 3/1073 5/1551 27/1478 6/1508 6/1556 0/19 22/219 4/19 28/246 1/15

149/1608 86/977 62/376 17/191 27/264 11/113 29/318 10/439 30/153 0/94 1/700 5/895 25/1051 5/901 5/1022 0/26 0/86 0/9 0/112 0/5

0.7963 0.8544 0.7830 0.8201 0.8140 0.8640 0.7566 0.8516 0.8842 0.7881 0.6444 0.7418 0.6574 0.6857 0.6608

0.7490 0.7772 0.7573 0.7791 0.7924 0.8287 0.7805 0.7912 0.8159 0.7343 0.6769 0.7114 0.5883 0.6998 0.5226

0.7426 0.6927 0.6970 0.6685 0.7071 0.6118 0.7111 0.5971 0.7354

0.6669

0.6200 0.7005 0.5690 0.7594

0.6583

0.6699 0.5559 0.5728 0.5613 0.5153

a y, b, yPro, bPro, yAsp, bAsp, yGlu, bGlu, yyN−2−b2, and byN−2−b2 are y, b ions generated by yx-bz pathway, proline effect, aspartic acid effect, glutamic acid effect and diketopiperazine-yN−2 pathway; a stands for a ions, and p stands for precursor ions; y-18, b-18, y-17, b-17, p-17, and p-18 are water and ammonia losses of y, b, and p ions; pQN and pE are fragment ions generated by N-terminal glutamine/asparagine effect and N-terminal glutamic acid effect. b+1, +2, and +3 are charge states of precursor ions.

Hence we downloaded another large number of these types of spectra from PeptideAtlas (http://www.peptideatlas.org/ repository/). Since current models and other peptide prediction algorithms were learned from IT MS/MS spectrum sets, the data sets derived from LTQ instrument were used as the standard for test and comparison to other algorithms. Whereas data sets derived from other instruments such as 3D IT (Thermo LCQ Deca and Agilent XCT Ultra), QTOF (ABI QSTAR Pulsar i and Waters/Micromass Q-TOF Ultima) and Thermo LTQ-FT were adopted to test the performance of MS2PBPI models in different instruments. For ISB data sets, the following criteria were used to extract high confident PSMs: PPeptideProphet ≥ 0.95, XCorr ≥1.6, ΔCn ≥ 0.08 for singly charged PSMs (PPeptideProphet stands for PeptideProphet probability6); PPeptideProphet ≥ 0.995 for doubly charged PSMs; and PPeptideProphet ≥ 0.95, XCorr ≥2.5, ΔCn ≥ 0.3 for triply charged PSMs; all identified peptides must belong to the standard protein mixtures and confidently identified contaminants reported by the original publication. For other data sets, PSMs had PPeptideProphet greater than 0.99 were retained. This strict criterion can guarantee the high quality of testing data set, especially for those released MS/MS data sets identified without decoy database search. In addition to these, recently released high resolution raw data sets generated from LTQOrbitrap XL and LTQ-Orbitrap Velos were downloaded from PeptideAtlas and used as additional data set to test the performance of MS2PBPI models in state-of-the-art experimental LC-MS/MS data sets. PeptideProphet probability threshold of 0.95 was then employed to extract highconfidently identified MS/MS spectra. As decoy database search was also performed in these experiments, 1% level of FPR was found in the extracted PSM sets under this probability criterion, confirming the high quality of LTQ-Orbitrap testing data sets.

that region. However, an intensity of 1.0 may be overestimated if value located at higher fraction of array length (e.g., 0.8) was used because the intensities of most fragment ions in the region are lower than that. For other regions, R10FCV criteria were applied to select valid models. Currently, valid models constructed for singly and doubly charged precursors must have R10FCV greater than 0.7 and for triply charged precursors must have it greater than 0.6. Otherwise, the models were considered as invalid for they could gain large prediction errors. However, we found that if several of these regions were combined together and retrained by SGBTree, much higher R10FCV could be obtained. This phenomenon might be caused by more uniform distribution of intensities in combined regions. Thus, for these regions, we artificially combined them together and constructed new models. In this procedure, the same R10FCV criteria described above were applied. If models constructed for these combined regions still failed to meet the criteria, they were treated as those seriously imbalanced regions. Namely, the intensity located at 0.618 length of sorted intensity array of each region was calculated as the predicted intensity. For modified peptides, however, since the fragmentation of them was more complicated than that of unmodified peptides and the size of training set was much smaller, the criteria adopted above were decreased. We constructed models for all regions in which fraction of intensity larger than 0 was greater than 0.5 and at least 10 fragment ions with intensity larger than 0.8 existed. Among all the models, R10FCV criterion of 0.5 was applied to select valid ones. Model Validation. To test MS2PBPI models constructed for unmodified peptides, ISB standard protein mixture data sets38 were used. However, much more experimental MS/MS spectra derived from same instruments as ISB data sets can also be employed to test the models in more wide range of peptides. 7449

dx.doi.org/10.1021/ac501094m | Anal. Chem. 2014, 86, 7446−7454

Analytical Chemistry

Article

intensity fragments for regression are still seriously biased toward low intensities, thus leading to poorer models than those for y and b ions. In addition, the poorest models are obtained from precursor ion series (i.e., p, p-17, and p-18). These trends imply that only y and b ions can be predicted accurately. Since y ions are generally believed to be more abundant and stable than b ions, it is surprising to find that the R10FCV of the latter is higher than that of the former. For models constructed for precursor ions with different charge states, singly charged precursors gain the highest R10FCV and triply charged ones gain the lowest. This is reasonable as fragmentation of peptides containing three protons is much more complicated. However, more efficient sample reduction procedure is also required for these peptides to filter out low intensity fragment ions because they can significantly affect the performance of SGBTree models. It should be mentioned that N-terminus glutamine/asparagine effect (pQN) and N-terminus glutamic acid (pE) effect are the most selective fragmentation pathways in our unmodified peptide training set, especially in singly charged precursors. Thus, MS2PBPI uses a single value as the prediction of these ions instead of constructs SGBTree models. MS/MS Prediction of Unmodified Peptides. MS2PBPI is tested by a MS/MS spectra data set containing high-confident identifications. The distribution of similarity scores derived from predicted and experimental MS/MS spectrum matches is shown in Figure 1. For singly and doubly charged precursors,

To test prediction models of peptides containing modifications without phosphorylation, all LTQ MS/MS spectra assigned to peptides with modifications considered in current models were adopted using same PPeptideProphet criterion as that used in extracting unmodified PSMs. For testing models of phosphorylated peptides, another proteome data sets full of these types of peptides were downloaded from PeptideAtlas. The PSMs with 0.95 or higher PeptideProphet probabilities were retained. This probability criterion also ensured 1% level of FPR in this testing data set. Once all testing data sets were constructed, another filtering procedure was performed to check whether any of the extracted PSMs matched the peptides in training set. All replicative PSMs were then removed. Statistics of the final testing data sets as well as PeptideAtlas accessions of the above LC-MS/MS sets were provided in Table S-3 in the Supporting Information. For each PSM in testing data sets, theoretical MS/MS spectrum was simulated by MS2PBPI and compared to corresponding experimental spectrum. Similarity scores were calculated by formula 1 for these comparisons: S=

∑ Iei × Iti ∑ Iei 2 ×

∑ I ti 2

(1)

Where Iei and Iti are the intensity of ith matched fragment ions in experimental and simulated MS/MS spectra. Prior to similarity score calculation, all intensities were normalized to their base peak intensity.



RESULTS AND DISCUSSION Model Characterization. Statistics of fragment ions derived from different fragmentation pathways in training set and training results of MS2PBPI models are depicted in Table 1. Since we generated all possible fragment ions produced by the fragmentation pathways considered in current MS2PBPI framework for each peptide and assigned intensity 0 to the ions that could not match peaks in experimental MS/MS spectra, the intensity distribution of each ion type is monopolized by low intensities. This situation is even serious in training sets generated by triply charged precursors for we also enumerated all possible charge states of fragment ions. However, by applying binary trees, the fragmentation behaviors of different ion types are revealed (example of y ions generated by proline effect in singly charged precursors is shown in Supporting Information Figure S-2), providing valuable insights into peptide fragmentation. Further, the binary trees could filter majority of low intensities that probably are considered as noises. Besides, these low intensity peaks are almost no contribution to PSMs for the similarities between experimental and theoretical MS/MS spectra are dominated by abundant peaks.39 Hence MS2PBPI only constructed models for regions containing high intensity fragment ions, leaving majority of regions filled with low intensity fragments to be predicted by single values calculated from the intensity distributions in them, as depicted in Table 1. From the R10FCV distributions of all types of fragment ions shown in Table 1, very interesting trends can be found. Generally, y and b ions need the most models to predict and gain highest R10FCV, whereas much less models are obtained for less abundant ions such as neutral losses of y, b ions and gain much lower R10FCV. Though binary tree can effectively filter low intensities and extract high intensities for stochastic gradient boosting tree regression, the regions obtained from these low

Figure 1. Distribution of similarity scores between predicted mass spectra and LTQ tandem mass spectra. +1, +2, and +3 in legend represent the charge states of precursor ions.

MS2PBPI can predict MS/MS spectra with high consistency with experimental ones (average similarity scores are 0.9393 and 0. 9266 respectively). While for triply charged precursors, the similarity is much lower (average similarity score is 0.8667). Previous investigations have shown substantial effect of coulomb repulsion on the fragmentation pattern of highly charged ions.40 Consequently, the fragmentation rules learned from precursors with low charge states are weakened,35 making MS/MS spectra of these precursors less predictable. Nevertheless, MS2PBPI can predict MS/MS spectra for highly charged precursors with reasonable accuracy. Some examples of predicted MS/MS spectra for peptides with different charge states are provided in Supporting Information Figure S-3. 7450

dx.doi.org/10.1021/ac501094m | Anal. Chem. 2014, 86, 7446−7454

Analytical Chemistry

Article

Figure 2. Distributions of similarity scores between predicted and experimental MS/MS spectra derived from singly (A, +1), doubly (B, +2), triply (C, +3) charged unmodified peptides, and from modified peptides (D). The algorithms used in current comparisons are MS2PBPI, MassAnalyzer, and PeptideART.

MS/MS Prediction of Modified Peptides. Currently, only five peptide modification types are considered in MS2PBPI models. Specifically, methane sulfenic acid and phosphoric acid losses of p, y, and b ions are trained separately for they are the diagnostic fragments of peptides containing oxidized methionine and phosphorylated residues, both of which are common modifications encountered during protein sequencing. Since modifications can change fragmentation patterns of peptides containing them, the models constructed for unmodified peptides are inapplicable in prediction of MS/MS spectra for modified ones. Therefore, it is necessary to construct new models for all other fragments using new variable sets which should include the modification information in peptide sequences. Being aware of this, we trained new models for modified peptides and phosphorylated peptides separately. It should be noted here that other modifications are also considered during training of MS/MS spectra derived from phosphorylated peptide. The characteristics of modified peptide training sets and training results are shown in Tables S-4 and S5 in the Supporting Information. As is shown, the performance of models obtained by retraining is comparable with that for unmodified peptides, indicating the good extendibility of MS2PBPI. The distributions of similarity scores between predicted and experimental MS/MS spectra generated by precursors containing oxidized methionine and phosphorylated serine or tyrosine are shown in Figure S-4 in the Supporting Information. For precursors containing methionine oxidation, very high consistency between predicted and experimental MS/MS spectra is observed (average similarity scores are 0.9220 and 0.8647 for doubly and triply charged precursors respectively). As

methionine oxidation in peptide can lead to an extremely strong selective fragmentation channel,41 it seems to be simple to achieve such high consistency because the behavior of the channel can be well characterized by binary tree prior to model construction. For other modifications, though fragmentation patterns of these peptides vary considerably comparing to unmodified counterparts and result in nonselective cleavages, for example, pyroglutamate (pyro-Glu) formation of glutamic acid and glutamine can suppress the selective formation of p-18 when the free amino acids locate at N-terminus of peptides, high similarity scores are still obtained (average scores are 0.9266 and 0.8660 for doubly and triply precursors respectively). This indicates that MS2PBPI has high efficiency in predicting these types of peptides. Though selective fragmentation channel resulting in loss of neutral molecule from precursor ion is also observed during fragmentation of phosphorylated peptides, much lower similarity scores are obtained (average similarity scores are 0.8785 and 0.8805 for doubly and triply charged precursors). This phenomenon may be due to the less selective of the fragmentation channel and the poor prediction ability of MS2PBPI models for the intensity of p ions (see Table S-5 in the Supporting Information). Comparison of MS2PBPI to Other Algorithms. In this comparison, only PeptideART and MassAnalyzer are used. We attempted to compare our models to another two algorithms Riptide and MS2PIP. However, the former has been stopped from maintaining and could not be recompiled, while current distribution of the latter has some bugs and is being recompiled. Figure 2 shows the similarity scores between MS/MS spectra predicted by the algorithms and experimental ones. Since MassAnalyzer has developed models for predicting 7451

dx.doi.org/10.1021/ac501094m | Anal. Chem. 2014, 86, 7446−7454

Analytical Chemistry

Article

Figure 3. Distributions of similarity scores between MS2PBPI predicted and experimental MS/MS spectra. Labels Agilent, LCQ Deca, LTQ-FT, LTQ-Orbitrap, QSTAR, and Waters in the legend indicate that the scores were derived from Agilent XCT Ultra, LCQ-Deca, LTQ-FT, LTQ Orbitrap, ABI QSTAR Pulsar i, and Waters/Micromass Q-TOF Ultima. (A) doubly charged precursors; (B) triply charged precursors.

complexity of peptide fragmentation, which can also be proved by the tail toward very low scores in score distributions of other two algorithms. But as cost of this, much larger models must be constructed to store the fragmentation patterns (e.g., total 65 505 regions are obtained and 2420 SGBTree models are constructed from all our training data sets) and much more times will be spent to process the patterns (e.g., model import, fragmentation pattern extraction) during MS/MS spectrum predicting accordingly.

MS/MS spectra for modified peptides, we also compared the predicting ability of MS2PBPI for these peptides to that of MassAnalyzer and show results in Figure 2D. Generally, MS2PBPI performs much better than the other two algorithms, while PeptideART performs better than MassAnalyzer for precursors with low charge states but analogically for triply charged precursors. The superiority of MS2PBPI may indicate the advantages learning from large-scale data sets because of the sheer 7452

dx.doi.org/10.1021/ac501094m | Anal. Chem. 2014, 86, 7446−7454

Analytical Chemistry

Article

Applying MS2PBPI to Different Instrumental Platforms. The technological developments in tandem mass spectrometry allow proteins to be analyzed by various instrumental platforms. Due to distinct physical principle and operation mode etc. appeared in different instruments, the MS/ MS spectra generated are varied considerably. This variability hinders the MS/MS spectral prediction models constructed from one instrumental platform from applying to other platforms. Figure 3 shows the distributions of similarity scores between MS2PBPI predicted and experimental MS/MS spectra derived from different instruments. For IT instruments Agilent XCT Ultra, LCQ-Deca, LTQ-FT, and LTQ-Orbitrap, MS/MS spectra predicted by the models are consistent very well with the experimental ones (average scores are 0.9304, 0.9044, 0.9248, and 0.9163 for doubly charged precursors and 0.8845, 0.8622, 0.8694, and 0.8956 for triply charged precursors). As ion trap is one of the most commonly used mass analyzers, these high consistencies indicate a fairly wide applicability of MS2PBPI, regardless of the resolving power. However, the similarities between predicted and experimental MS/MS spectra derived from TOF instruments ABI QSTAR Pulsar i and Waters/Micromass Q-TOF Ultima are much lower than ion traps (average similarity scores are 0.7834 and 0.7781 for doubly charged precursors and 0.8161 and 0.7503 for triply charged precursors). Since completely different principles and raw data processing procedures were applied in IT and TOF instrumentation, it can be expected that the MS/MS spectra produced by these two types of instruments are distinct, making current models inapplicable to TOF instruments. Nevertheless, by constructing targeted models, this limitation can be overcome.

help implementing MS2PBPI by other languages such as python, Java, etc.



ASSOCIATED CONTENT

S Supporting Information *

Figure S-1: Flow chart of MS2PBPI. Table S-1: Variables derived from unmodified peptides. Table S-2. Additional variables for modified peptides. Figure S-2. Binary tree for singly charged y fragments derived from proline effect. Table S3. Statistics of testing datasets. Figure S-3: Examples of predicted MS/MS spectra by MS2PBPI. Table S-4: Characteristics of training set for modified peptides. Table S-5. Characteristics of training results for phosphorylated peptides. Figure S-4: Distributions of similarity scores for modified peptides. This material is available free of charge via the Internet at http://pubs.acs.org/.



AUTHOR INFORMATION

Corresponding Author

*Phone: +86 731-88830831. E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was financially supported by the National Nature Foundation Committee of PR China (grant nos. 21275164 and 21105129).





REFERENCES

(1) Yates, J. R. J. Am. Chem. Soc. 2013, 135, 1629−1640. (2) Eng, J. K.; Mccormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976−989. (3) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390−4399. (4) Nesvizhskii, A. I. J. Proteomics 2010, 73, 2092−2123. Eng, J. K.; Searle, B. C.; Clauser, K. R.; Tabb, D. L. Mol. Cell Proteomics 2011, DOI: 10.1074/mcp.R111.009522. (5) Fenyo, D.; Beavis, R. C. Anal. Chem. 2003, 75, 768−774. (6) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383−5392. (7) Renard, B. Y.; Timm, W.; Kirchner, M.; Steen, J. A. J.; Hamprecht, F. A.; Steen, H. Anal. Chem. 2010, 82, 4314−4318. (8) Elias, J. E.; Gygi, S. P. Nat. Methods 2007, 4, 207−214. (9) Bern, M.; Goldberg, D.; McDonald, W. H.; Yates, J. R. Bioinformatics 2004, 20, 49−54. Nesvizhskii, A. I.; Roos, F. F.; Grossmann, J.; Vogelzang, M.; Eddes, J. S.; Gruissem, W.; Baginsky, S.; Aebersold, R. Mol. Cell Proteomics 2006, 5, 652−670. (10) Tabb, D. L.; MacCoss, M. J.; Wu, C. C.; Anderson, S. D.; Yates, J. R. Anal. Chem. 2003, 75, 2470−2477. Beer, I.; Barnea, E.; Ziv, T.; Admon, A. Proteomics 2004, 4, 950−960. Frank, A. M.; Bandeira, N.; Shen, Z.; Tanner, S.; Briggs, S. P.; Smith, R. D.; Pevzner, P. A. J. Proteome Res. 2008, 7, 113−122. (11) Sadygov, R. G.; Eng, J.; Durr, E.; Saraf, A.; McDonald, H.; MacCoss, M. J.; Yates, J. R. J. Proteome Res. 2002, 1, 211−215. Na, S.; Paek, E.; Lee, C. Anal. Chem. 2008, 80, 1520−1528. Sadygov, R. G.; Hao, Z.; Huhmer, A. F. R. Anal. Chem. 2008, 80, 376−386. (12) Venable, J. D.; Xu, T.; Cociorva, D.; Yates, J. R. Anal. Chem. 2006, 78, 1921−1929. Petyuk, V. A.; Mayampurath, A. M.; Monroe, M. E.; Polpitiya, A. D.; Purvine, S. O.; Anderson, G. A.; Camp, D. G.; Smith, R. D. Mol. Cell Proteomics 2010, 9, 486−496. (13) Steen, H.; Mann, M. Nat. Rev. Mol. Cell Bio 2004, 5, 699−711. (14) Havilio, M.; Haddad, Y.; Smilansky, Z. Anal. Chem. 2003, 75, 435−444. (15) Elias, J. E.; Gibbons, F. D.; King, O. D.; Roth, F. P.; Gygi, S. P. Nat. Biotechnol. 2004, 22, 214−219.

CONCLUSION Peptide fragmentation is an excessively complex procedure. It is determined by experimental conditions such as collision energy and ionization temperature as well as amino acid composition of peptide itself, making peptide MS/MS spectrum prediction being a challenging task. In present work, we developed a general MS/MS spectrum prediction framework named MS2PBPI and applied it to different conditions, i.e. unmodified and modified peptides and different ion trap instruments. By constructing tens to hundreds targeted SGBTree models, MS2PBPI is able to accurately predict MS/MS spectra for each condition, providing valuable solution to the rough spectrum predicting during peptide identification by tandem mass spectrometry. However, one limitation of current framework is that the MS/MS spectrum prediction models for TOF instruments as well as peptides containing other PTMs than currently considered ones are still lack. Another limitation is the poor ability to predict intensity for precursor ion series. The former case can be easily solved via constructing suitable models within MS2PBPI framework; whereas the latter case needs more investigations. Nevertheless, benefiting from enormous available LC-MS/MS raw data, accurate prediction of MS/MS spectra by machine learning algorithms becomes possible, as is shown in present work. Current MS2PBPI framework is implemented by MATLAB (www.mathworks.com). The source code is provided and can be obtained from https://code.google.com/p/ms2pbpi/ under the Apache License 2.0. We also provide the text versions of each binary tree and SGBTree model as well as the introduction of variable calculation on the same Web site to 7453

dx.doi.org/10.1021/ac501094m | Anal. Chem. 2014, 86, 7446−7454

Analytical Chemistry

Article

(16) Li, W. Z.; Ji, L.; Goya, J.; Tan, G. H.; Wysocki, V. H. J. Proteome Res. 2011, 10, 1593−1602. Xiao, C. L.; Chen, X. Z.; Du, Y. L.; Sun, X. S.; Zhang, G.; He, Q. Y. J. Proteome Res. 2013, 12, 328−335. (17) Zhang, Z. Q. Anal. Chem. 2004, 76, 3908−3922. Zhang, Z. Q. Anal. Chem. 2005, 77, 6364−6373. (18) Arnold, R. J.; Jayasankar, N.; Aggarwal, D.; Tang, H.; Radivojac, P. Pac. Symp. Biocomput 2006, 219−230. (19) Klammer, A. A.; Reynolds, S. M.; Bilmes, J. A.; MacCoss, M. J.; Noble, W. S. Bioinformatics 2008, 24, I348−I356. (20) Zhou, C.; Bowler, L. D.; Feng, J. F. Bmc Bioinf. 2008, 9, 325. (21) Degroeve, S.; Martens, L. Bioinformatics 2013, 29, 3199−3203. (22) Frank, A. M. J. Proteome Res. 2009, 8, 2226−2240. Frank, A. M. J. Proteome Res. 2009, 8, 2241−2252. (23) Freund, Y.; Iyer, R.; Schapire, R. E.; Singer, Y. J. Mach Learn Res. 2003, 4, 933−969. (24) Sun, S. J.; Meyer-Arendt, K.; Eichelberger, B.; Brown, R.; Yen, C. Y.; Old, W. M.; Pierce, K.; Cios, K. J.; Ahn, N. G.; Resing, K. A. Mol. Cell Proteomics 2007, 6, 1−17. Li, S.; Arnold, R. J.; Tang, H.; Radivojac, P. Anal. Chem. 2011, 83, 790−6. (25) DeGnore, J. P.; Qin, J. J. Am. Soc. Mass Spectrom. 1998, 9, 1175−1188. Tholey, A.; Reed, J.; Lehmann, W. D. J. Mass Spectrom 1999, 34, 117−123. Steen, H.; Mann, M. J. Am. Soc. Mass Spectrom. 2001, 12, 228−232. Dodds, E. D. Mass Spectrom Rev. 2012, 31, 666− 682. (26) Zhang, Z. Q. Anal. Chem. 2010, 82, 1990−2005. Zhang, Z. Q.; Shah, B. Anal. Chem. 2010, 82, 10194−10202. Zhang, Z. Q. Anal. Chem. 2011, 83, 8642−8651. (27) Bodenmiller, B.; Campbell, D.; Gerrits, B.; Lam, H.; Jovanovic, M.; Picotti, P.; Schlapbach, R.; Aebersold, R. Nat. Biotechnol. 2008, 26, 1339−1340. (28) Paizs, B.; Suhai, S. Mass Spectrom Rev. 2005, 24, 508−548. (29) Harrison, A. G.; Young, A. B.; Bleiholder, C.; Suhai, S.; Paizs, B. J. Am. Chem. Soc. 2006, 128, 10364−5. (30) Schwartz, B. L.; Bursey, M. M. Biol. Mass Spectrom 1992, 21, 92−96. Vaisar, T.; Urban, J. J. Mass Spectrom 1996, 31, 1185−1187. (31) Tsaprailis, G.; Nair, H.; Somogyi, A.; Wysocki, V. H.; Zhong, W. Q.; Futrell, J. H.; Summerfield, S. G.; Gaskell, S. J. J. Am. Chem. Soc. 1999, 121, 5142−5154. (32) Cordero, M. M.; Houser, J. J.; Wesdemiotis, C. Anal. Chem. 1993, 65, 1594−1601. Savitski, M. M.; Hith, M.; Fung, Y. M. E.; Adams, C. M.; Zubarev, R. A. J. Am. Soc. Mass Spectrom. 2008, 19, 1755−1763. (33) Harrison, A. G. J. Mass Spectrom 2003, 38, 174−187. (34) Barton, S. J.; Whittaker, J. C. Mass Spectrom Rev. 2009, 28, 177− 187. (35) Dong, N. P.; Zhang, L. X.; Liang, Y. Z. Int. J. Mass Spectrom. 2011, 308, 89−97. (36) Breiman, L.; Friedman, J.; Stone, C. J.; Olshen, R. A. Classification and Regression Trees, 1st ed.; CRC Press: Belmont, CA, 1984. (37) Friedman, J. H. Comput. Stat. Data Anal. 2002, 38, 367−378. (38) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.; Letarte, S.; Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H.; Schmidt, A.; Ossola, R.; Eng, J. K.; Aebersold, R.; Martin, D. B. J. Proteome Res. 2008, 7, 96−103. (39) Stein, S. E.; Scott, D. R. J. Am. Soc. Mass Spectrom. 1994, 5, 859− 66. Dromey, R. G. Anal Chim Acta-Comp 1979, 3, 133−141. (40) Rockwood, A. L.; Busman, M.; Smith, R. D. Int. J. Mass Spectrom Ion Process 1991, 111, 103−129. Tang, X. J.; Thibault, P.; Boyd, R. K. Anal. Chem. 1993, 65, 2824−2834. (41) Reid, G. E.; Roberts, K. D.; Kapp, E. A.; Simpson, R. I. J. Proteome Res. 2004, 3, 751−9.

7454

dx.doi.org/10.1021/ac501094m | Anal. Chem. 2014, 86, 7446−7454