Anal. Chem. 2007, 79, 4870-4878
MSNovo: A Dynamic Programming Algorithm for de Novo Peptide Sequencing via Tandem Mass Spectrometry Lijuan Mo, Debojyoti Dutta, Yunhu Wan, and Ting Chen*
Department of Biology, Department of Mathematics, University of Southern California, Los Angeles, California 90089
Tandem mass spectrometry (MS/MS) has become the experimental method of choice for high-throughput proteomics-based biological discovery. The two primary ways of analyzing MS/MS data are database search and de novo sequencing. In this paper, we present a new approach to peptide de novo sequencing, called MSNovo, which has the following advanced features. (1) It works on data generated from both LCQ and LTQ mass spectrometers and interprets singly, doubly, and triply charged ions. (2) It integrates a new probabilistic scoring function with a mass array-based dynamic programming algorithm. The simplicity of the scoring function, with only 6-10 parameters to be trained, avoids the problem of overfitting and allows MSNovo to be adopted for other machines and data sets easily. The mass array data structure explicitly encodes all possible peptides and allows the dynamic programming algorithm to find the best peptide. (3) Compared to existing programs, MSNovo predicts peptides as well as sequence tags with a higher accuracy, which is important for those applications that search protein databases using the de novo sequencing results. More specifically, we show that MSNovo outperforms other programs on various ESI ion trap data. We also show that for high-resolution data the performance of MSNovo improves significantly. Supporting Information, executable files and data sets can be found at http:// msms.usc.edu/supplementary/msnovo. Tandem mass spectrometry (MS/MS) is now the most widely used high-throughput experimental technique to analyze peptides, proteins, and protein complexes.1,35,39,46 The primary applications of MS/MS in proteomics include identification of peptides,4,5,11,12,14-16,20,23,26,27,29,41,42,46,47,49,50 identification of proTo whom the correspondence should be addressed. E-mail:
[email protected] (1) Aebersold, R.; Mann, M. Nature 2003, 422 (6928), 198-207. (2) Aebersold, R.; Mann, M. Nature 2003, 422 (6928), 198-207. (3) Alves, G.; Yu, Y. K. Bioinformatics 2005; (Aug 16). (4) Anderson, D. C.; Li, W.; Payan, D. G.; Noble, W. S. J. Proteome Res. 2003, 2 (2), 137-46. (5) Bafna, V.; Edwards, N. Bioinformatics 2001, 17 (Suppl 1), S13-21. (6) Bandeira, N.; Tang, H.; Bafna, V.; Pevzner, P. Anal Chem. 2004, 76 (24), 7221-33. (7) Bern, M.; Goldberg, D. EigenMS: de novo analysis of peptide tandem mass spectra by spectral graph paritioning. RECOMB 2005. (8) Fischer, B.; Roth, V.; Roos, F.; Grossmann, J.; Baginsky, S.; Widmayer, P.; Gruissem, W.; Buhmann, J. M. Anal. Chem. 2005, 77, 7265-73.
4870 Analytical Chemistry, Vol. 79, No. 13, July 1, 2007
teins,19,33,37,43 identification of protein complexes,22,28,29 protein sequencing,6 protein quantification,25 identification of modified or mutated proteins,21,44 identification of sequence tags,34,36 and identification of protein cross-links for protein structure studies10,48 among others. Tandem mass spectrometry enables specific peptides to be detected in complex mixtures due to their specific and characteristic fragmentation patterns inside the spectrometer. There are two major classes of methods for peptide identification using MS/ MS. One is the database search, and the other is the de novo sequencing. In database search, such as in SEQUEST16 and Mascot,41 the first step is to preprocess a protein database by digesting each protein into smaller peptides in silico and indexing these peptides by their masses. Then for a query spectrum with a precursor ion mass M, these algorithms extract, or filter, a set of peptides SM that contains all the peptides whose masses are within a certain range δ of M: [M - δ,M + δ]. All these extracted peptides are then compared or scored against the query spectrum using some scoring functions such as the cross-correlations used in SEQUEST or the probabilistic model used in Mascot. The peptides with the best scores are reported upon ranking. Though it is the method of choice for most current discovery projects, one major limitation of the database search method is that it (9) Chen, T.; Kao, M. Y.; Tepel, M.; Rush, J.; Church, G. M. J. Comput. Biol. 2001, 8 (3), 325-37. (10) Chen, T.; Jaffe, J.; Church, G. M. J. Comput. Biol. 2001, 8 (6), 571-83. (11) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J. Proteomics 2003, 3 (8), 1454-63. (12) Creasy, D. M.; Cottrell, J. S. Proteomics 2002, 2, 1426-34. (13) Dancik, V.; et al. J. Comput Biol. 1999, 6, 327-42. (14) Demine R.; Walden P. Rapid Commun. Mass Spectrom. 2004, 18 (8), 90713. (15) Elias, J. E.; Gibbons, F. D.; King, O. D.; Roth, F. P.; Gygi, S. P. Nat. Biotechnol. 2004, 22 (2), 214-9. (16) Eng, J. K.; et al. J. Am. Soc. Mass Spectrom. 1994, 5, 976-89. (17) Frank, A.; Pevzner, P. Anal. Chem. 2005, 77, 964-73. (18) Frank, A.; et al. J. Proteome Res. 2007, 6 (1), 114-23. (19) Fenyo, D.; Beavis, R. C. Anal. Chem. 2003, 75 (4), 768-74. (20) Field, H. I.; Fenyo, D.; Beavis R. C. Proteomics 2002, 2 (1), 36-47. (21) Gatlin, C.; Eng, J.; Cross, S.; Detter, J.; Yates, J. Anal. Chem. 2000, 72, 757-63. (22) Gavin, A. C.; et al. Nature 2001, 415 (6868), 141-7. (23) Geer, L. Y.; Markey, S. P.; Kowalak J. A.; Wagner, L.; Xu M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. J. Proteome Res. 2004, 3 (5), 958-64. (24) Lyris, M. F.; de Godoy, J. V.; Olsen, G.; de Souza, A.; Li, G.; Mortensen, P.; Mann, M. Genome Biol. 2006, 7, R50. (25) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 17 (10), 994-9. (26) Hansen, B. T.; Jones, J. A.; Mason, D. E.; Liebler, D. C. Anal. Chem. 2001, 73 (8), 1676-83. 10.1021/ac070039n CCC: $37.00
© 2007 American Chemical Society Published on Web 06/06/2007
cannot identify novel proteins whose sequences are not already included in sequence databases. De novo sequencing algorithms aim to reconstruct the peptide sequence directly from a MS/MS spectrum without the aid of protein sequence databases. Therefore, in principle, it can overcome some of the limitations of database search algorithms. Many de novo sequencing algorithms have been developed since the late 1990s, including Sherenga,13 Lutefisk,44 the dynamic programming method,9 and a suboptimal method,31 PEAKS,32 DACSIM,51 EigenMS,7 PepNovo,17,18 and NovoHMM.8 The fundamental problem of these de novo sequencing algorithms is to find a complete sequence or a ladder of b- or y-ions such that the distance between the consecutive ladders or peaks is equal to the mass of an amino acid. However, the existence of isotopes, incomplete fragmentations, multiple fragmentations, unknown fragmentations, and random noises severely inhibits the efficacy of these de novo algorithms in practice and frequently leads to false positive predictions. Thus, as of today, de novo sequencing tools are not as widely used as database search methods. Mathematically, the de novo sequencing problem can be formulated as follows: Given a parent mass M, an error range , an experimental spectrum S, and a scoring function f ( ), find a peptide P with mass m such that |M - m| < and the hypothetical spectrum H of the peptide P has the best match with S, i.e., f (H,S) is maximized. There are two major steps in developing a reliable de novo sequencing algorithm. The first is to generate a pool of candidates, one of which is the correct peptide. The second is to design a good scoring function to select the best candidate from the pool. For the first task, most de novo algorithms3,9,13,31,32 create a spectrum graph and then apply a dynamic programming algorithm to find paths in the graph that represent peptide candidates. Some algorithms51 use a divide and conquer heuristics combined with more sophisticated scoring functions to generate peptide candidates. The choice of scoring functions is quite limited because the dynamic programming algorithm works only if the scoring function is additive: the score of a peptide equals the sum or the multiplication of the score of each ion (or fragmentation). Nevertheless many scoring functions have been proposed. Dancik (1999) considered the probability of different ion types in their (27) Havilio, M.; Haddad, Y.; Smilansky, Z. Anal. Chem. 2003, 75, 435-44. (28) Ho, Y.; Gruhler, A.; Heilbut, A.; Bader, G. D.; Moore, L.; Adams, S.; Millar, A.; Taylor, P.; Bennett, K.; Boutilier, K.; et al. Nature 2002, 415, 180-3. (29) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold R. Anal. Chem. 2002, 74, 5383-92. (30) Keller, A.; et al. Omics 2002, 6, 207-12. (31) Lu B.; Chen T. J. Comput. Biol. 2003, 10, 1-12. (32) Ma, B.; Doherty-Kirby, A.; Lajoie, G. Rapid Commun. Mass Spectrom. 2003, 17, 2337-42. (33) MacCoss, M. J.; Wu, C. C.; Yates, J. R., 3rd. Anal. Chem. 2002, 74 (21), 5593-9. (34) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, (24), 4390-9. (35) Mann, M.; Jensen, O. N. Nat. Biotechnol. 2003, 21, 255-261. (36) Mortz, E.; O’Connor, P. B.; Roepstorff, P.; Kelleher, N. L.; Wood, T. D.; McLafferty, F. W.; Mann, M. Proc. Natl. Acad. Sci. U.S.A. 1996, 93 (16), 8264-7. (37) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A. Anal. Chem. 2003, 75 (17), 4646-58. (38) Omenn, G. S.; et al. Proteomics 2005, 5 (13), 3226-45. (39) Pandey, A.; Mann, M. Nature 2003, 405, 837-46. (40) Prince, J. T.; Carlson, M. W.; Want, R.; Lu, P.; Marcotte, E. M. Nat. Biotechnol. 2004, 22, 471-474.
scoring function. Later, Frank and Pevzner(2005) developed a de novo sequencing program called PepNovo by using a probability network with a hypothesis testing. At the same time, Fischer et al.8 proposed an HMM model called NovoHMM. Recently, Frank and Pevzner released a new version of PepNovo that deals with LTQ-FT data.18 The new version of PepNovo generates sequence tags up to eight amino acids long, which are then used to query the database for peptide identification. In this paper, we present a new de novo sequencing algorithm that outperforms the existing ones and is novel in the following ways: (1) We use a novel scoring function based on the probabilistic distributions of the match tolerance, defined as the distance between a mass peak and a hypothetical ion, and the normalized ranking of peak intensities, defined as the ratio of the ranking of the intensity in descending order over the total number of peaks. We have shown that the match tolerance follows a normal distribution centered at around 0 and that the normalized ranking of peak intensities follows an exponential distribution. Our contribution is to leverage the above two distributions to calculate the score for the match of a peak with an ion, the likelihood ratio of the probability that the peak corresponds to this ion over the probability that this peak is a noise peak. The concept of likelihood ratio scores has been used in different ways in other peptide identification programs.13-15 (2) We do not use the traditional spectrum graph in our model. Instead, we construct a mass array starting from 0 with a predetermined resolution. The idea of the “mass array” was first used in ref 9 to determine whether a given mass corresponds to the mass of some peptides or b/y ions in the construction of the spectrum graph. The idea was later applied to sample random peptides that have the same mass in ref 46. Our mass array data structure is different from the spectrum graph in that the mass array explicitly encodes all possible peptides with the same mass M, while the spectrum graph only encodes peptides that are fit into the given spectrum. We will explain the differences in details in the later sections. Empirical results show that our program, MSNovo, performs better than existing de novo tools on multiple data sets, due to the improved scoring function and the novel mass array based dynamic programming algorithm. MATERIALS AND METHODS (1) Data Sets. We obtained MS/MS spectra data from four sources: the ISB data set,30 Open Proteomics Database (OPD),40 PeptideAtlas,38 and a recent LTQ data set of Godoy et al.24 We use five data sets ISB769, ISB646(charge +3), OPD280, HUPO513, and LTQ600 from the above sources to compare the performance (41) Perkins, D. N.; et al. Electrophoresis 1999, 20, 3551-67. (42) Sadygov, R. G.; Yates, J. R., 3rd. Anal. Chem. 2003, 75 (15), 3792-8. (43) Sadygov, R. G.; Liu, H.; Yates, J. R. Anal. Chem. 2004, 76 (6), 1664-71. (44) Taylor, J. A.; Johnson, R. S. Rapid Commun. Mass Spectrom. 1997, 11 (9), 1067-75. (45) Tsur, D.; Tanner, S.; Zandi, E.; Bafna, V.; Pevzner, P. Nat. Biotechnol. 2005, 23 (12), 1562-7. (46) Wan, Y.; Yang, A.; Chen, T. Anal. Chem. 2006, 78 (2), 432-7. (47) Yates, J. R., 3rd; Eng, J. K.; McCormack, A. L.; Schieltz, D. Anal Chem. 1995, 67, 1426-36. (48) Young, M. M.; Tang, N.; Hempel, J. C.; Oshiro, C. M.; Taylor, E. W.; Kunte, I. D.; Gibson, B. W.; Dollinger, G. Proc. Natl. Acad. Sci. U.S.A. 2000, 97 (11), 5802-6. (49) Zhang, N.; Aebersold, R.; Schwikowski, B. Proteomics 2002, 2, 1406-12. (50) Zhang, W.; Chait, B. T. Anal. Chem. 2000, 72, 2482-9. (51) Zhang, Z. Anal. Chem. 2004, 76, 6374-83.
Analytical Chemistry, Vol. 79, No. 13, July 1, 2007
4871
Table 1. Means and Standard Deviations of Precursor Mass Errors and Fragments Mass Errors of Each Data Set precursor mass error data set
mean
standard deviation
ISB769 ISB646 OPD280 HUPO513 LTQ600 LTQ2448
-0.7717 -1.384 -0.5701 -0.7949 -0.02843 -0.1517
0.3974 0.3915 0.3222 0.6647 0.01275 0.4723
fragments mass error mean
standard deviation
-0.02048 0.03882 -0.03210 -0.07674 -0.01317 -0.05700
0.1296 0.1862 0.1377 0.1578 0.1129 0.1562
of MSNovo with other de novo sequencing programs. The mean and the standard deviation of precursor mass errors and fragments’ mass errors for each data set is listed in Table 1. For ISB646 (triply charged spectra), we consider major fragments b-, y-, b2+- and y2+-ions. For other data sets, we only consider the major fragments, i.e., the b- and y-ions. ISB769: We obtained a tandem mass spectra data set from The Institute of Systems Biology (ISB).30 The spectra data came from 22 runs of LC/MS/MS of two protein mixtures consisting of 18 purified proteins of different physicochemical properties with different relative molar amounts and modifications. The data set was analyzed by SEQUEST followed by manual validation. We selected 769 [M + 2H]2+ tryptic spectra, referred to as ISB769, which have SEQUEST Xcorr score of 2.5 or more. In this data set, on average 47.7% of the b-ions and 51.2% of the y-ions are present, and we use these numbers as indicators of the quality of spectra. ISB646 (charge +3): There are 646 + 3 charged tryptic spectra identified by SEQUEST in the ISB data set. We name these spectra as ISB646 data set. In this data set, on average only ∼30.1% of the b-ions and 34.9% of the y-ions are present. OPD280: From the OPD,40 we obtained 280 doubly charged spectra with Xcorr > 2.5 that were used by PepNovo. This data set is referred to as OPD280. In this data set, on average around 47.8% of the b-ions and 53.1% of the y-ions are present. HUPO513: From PeptideAtlas, we obtained the HUPO Plasma Proteome Project (PPP) protein data sets, contributed by HUPO lab 28 (Pacific Northwest National Laboratories).38 We selected 513 doubly charged tryptic spectra with Xcorr > 2.5 from those generated from human serum. We restrict the length of peptides to be 16 residues or less. In this data set, on average about 49% of the b-ions and 62.1% of the y-ions are present. LTQ600: From a LTQ data set generated by Mann’s group,24 we selected 600 charge +1/+2 tandem mass spectra and named them LTQ600. The accuracy of the parent ion mass is 2 ppm, and the accuracy of the MS/MS spectra is roughly ( 0.1 Da. We used Mascot’s annotations to evaluate the accuracy of our de novo tool. We ranked all of these spectra according to Mascot scores in descending order and chose the first 600 as our data set. In this data set, on average about 80.4% of the b-ions and 81.3% of the y-ions are present, and also the percentages for b - H2O, b NH3, y - H2O, and y - NH3 ions are 63.7, 63.7, 53.7, and 54.4%, respectively. LTQ2448: Usually one run of LTQ experiment generates ∼8000 tandem mass spectra. We took two full runs of LTQ spectra 4872
Analytical Chemistry, Vol. 79, No. 13, July 1, 2007
data24 and used the Mascot annotation (p-value < 0.05) to obtain a set of 2448 doubly charged LTQ spectra. In this data set, on average, 61.9% of the b-ions and 71.3% of the y-ions are present. In addition, the percentages of b - H2O, b - NH3, y - H2O, and y - NH3 ions are 49.5, 50.2, 47, and 47%, respectively. Preprocessing Spectra. The Achilles heel of tandem mass spectra analysis is the amount of noise in the mass spectra. We call ∼80% noninterpretable peaks “noise” peaks and peaks matched to b-ions and y-ions “signal” peaks. The existence of a large number of noise peaks will cause false positive predictions. Removing those noise peaks improves the signal-to-noise ratio and, therefore, increases the prediction accuracy. As a first step, we remove noise peaks by using a sliding window-based noise removal method. We choose a window of size 100 u and select the top six picks from each window. Another important preprocessing step is to resolve the isotopic peaks in the spectrum. The presence of isotopic peaks leads to ambiguity in distinguishing amino acids that are 1 Da apart, such I/L(113), N(114), and D(115). (2) Scoring Function. The core of any peptide identification method is the scoring function. In database search, a scoring function calculates the similarity between an experimental spectrum and a hypothetical spectrum that is generated after the in silico digestion of protein sequences from a database. In de novo sequencing, an appropriate scoring function is used to (1) decide whether a pair of peaks is indeed a part of a sequence ladder and (2) to calculate the similarity between the experimental spectrum and the hypothetical spectrum generated from the (partial) de novo peptide sequence, i.e., the output. Distributions for Match Tolerance and Peak Intensity. In our model, there are two main parameters for calculating a good scoring function: the match tolerance and the peak intensity. Given a mass peak pi ) (mi,Ii) from an experimental spectrum and a peak (or an ion) qj ) (m′j,I′j) from another spectrum (or a hypothetical spectrum), we need to determine whether there is a match between the two peaks pi and qj. Intuitively a match would occur, when the difference of the mass-to-charge ratio or the match tolerance between the two peaks is small and when the intensity ranks of the two peaks are similar. The match tolerance and the normalized ranking of peak intensities are the two major novel features used in our scoring function as they determine whether a peak in the spectrum matches a hypothetical ion of a peptide or not. We use the ISB769 as the training data set to obtain the probabilistic distributions of the match tolerance and the peak intensities of b- and y-ions. We observe that the distribution of the match tolerance follows a normal distribution, and the distribution of the normalized ranking of the peak intensities follows an exponential distribution. The distributions for b- and y-ion intensities are different. For the LTQ data, we also observe a large fraction (60%) of the neutral loss ions, i.e., the b - H2O and the b - NH3 ions. Similarly, we found that the match tolerance and the normalized ranking of the peak intensities of the neutral loss ions also follow the normal distribution and the exponential distribution respectively. Figure 1 and Figure 2 show the distribution of the match tolerance and that of the peak intensities of the b - H2O ions of the LTQ spectra. Other ions in the LTQ spectra follow the similar distributions. Scoring a Match. We use the above distributions to create a new and effective scoring function for our de novo sequencing
Figure 1. Match tolerance of b - H2O ions follows a normal distribution.
program. For each ion in the hypothetical spectrum, we first determine a matching peak in the experimental spectrum. If there is no matching peak, we say the hypothetical ion is missing. If there is a match, then the peak could be a signal peak or a noise peak. For now, we consider two kinds of signals in our algorithm: b-ions and y-ions. It can easily be extended to include other ions such as a-ions and neutral loss ions. We denote the match tolerance and the normalized ranking of peak intensity by T, I, respectively. Note that the match tolerance is due to the difference of masses of a hypothetical ion and a mass peak while the normalized ranking of peak intensity is due to the intensity of the peak that matches with the hypothetical ion. We denote the probability that the matching peak is a signal given the match tolerance T and the normalized intensity rank I to be P(signal|T,I) and the probability that the matching peak is a noise peak given T and I to be P(noise|T,I). Our score is then defined to be the likelihood ratio of these two probabilities:
matchScore )
P(signal|T,I) P(noise|T,I)
)
P(T,I|signal) P(signal) P(T,I|noise) P(noise)
∝
P(T,I|signal) P(T,I|noise)
(1)
For the simplest case, we consider only two kinds of signals (band y-ions) in our algorithm, so that P(T,I|signal) can be further divided into P(T,I|b) and P(T,I|y) according to the types of the ions. Assuming T and I are independent, we have
scoreb )
P(T,I|b) P(T|b)P(I|b) ) P(T,I|noise) P(T,I|noise)
(2)
scorey )
P(T,I|y) P(T|y)P(I|y) ) P(T,I|noise) P(T,I|noise)
(3)
where P(T|b) ∼ N(µ,σ), P(T|y) ∼ N(µ,σ), P(I|b) ∼ exp(λb) and P(I|y) ∼ exp(λy). Here, µ ) 0 if the machine is calibrated. σ is the standard deviation of tolerance, and λb and λy are the mean of the normalized ranking of intensities of b- and y-ions respectively.
Figure 2. Normalized ranking of intensities of b - H2O ions follows an exponential distribution. Table 2. Parameters Used in MSNovo +1 spectra P(b) P(y) P(b2+) P(y2+) µ σ λb λy λb2+ λy2+ P(noise) resolution
+2 spectra
+3 spectra
0.77 0.70
0.61 0.63
0 0.10 5.11 5.38 0.002 0.1 m/z
0 0.10 3.61 5.03 0.002
0.43 0.44 0.46 0.50 0 0.10 2.87 3.39 2.93 3.14 0.003
All these parameters will be determined from the training data set. Table 2 lists all the parameters for singly charged, doubly charged, and triply charged spectra. We also simplify P(T,I|noise) ) P(noise), a constant that can be determined by the average of the lowest scores of signal peaks among all of the spectra in the training data set. We will discuss how to determine its approximate value later in the parameter tuning section. Likelihood Ratio Score. We denote the b- and y-ions in the hypothetical spectrum to be b1,...,bn and y1,...,yn, respectively, where n is the number of b- or y-ions in the hypothetical spectrum. Usually, n is equal to the number of peptide bonds and n + 1 is equal to the length of the peptide. For ion bi, we define the probability of an observation (Ti,Ii) to be
SBi ) P(Ti,Ii|bi)
(4)
where Ti and Ii represent the match tolerance and the normalized ranking of intensity of a mass peak from the spectrum S that matches with the ith b-ion, respectively. Similarly, the probability of observing Ti and Ii for an ion yi is
SYi ) P(Ti,Ii|yi)
(5)
When the peak corresponding to ion bi or ion yi is missing, we define SBi ) P(missing) ) 1 - P(bi) or SYi ) P(missing) ) 1 Analytical Chemistry, Vol. 79, No. 13, July 1, 2007
4873
P(yi). The log likelihood ratio of SBi and SYi are defined in the following.
{ {
SBi log , when ion is present P(noise) log(1 - P(bi)), when ion is absent
(6)
SYi log , when ion is present LSYi ) P(noise) log(1 - P(yi)), when ion is absent
(7)
LSBi )
In the case that both the b-ion and the y-ion are missing, the score should be log (1 - P(b)) + log (1 - P(y)). The probability of observing a spectrum S should be the probability of all peaks in this spectrum, as defined in the following: n
Pr(S|b1,...,bn,y1,...,yn) )
∏SB SY × P(noise) i
m-m1
i
(8)
i)1
where m is the number of peaks in spectrum S and m1 is the number of matched peaks. We consider the other m - m1 peaks that have no matches as noise peaks for the term P(noise)m-m1 in eq 8. Our goal is to maximize the log likelihood ratio of the probability that S is generated from peptide P over the probability that S is generated by random noise (i.e., all the m peaks in the spectrum are noise peaks, so the probability of this situation is P(noise)m). Hence, we need to maximize the final score:
Pr(S|b1,...,bn,y1,...,yn) Pr(S|P) log ) log Pr(S|noise) P(noise)m n
∏ SB SY × P(noise) i
) log
m-m1
i
i)1
P(noise)m1 × P(noise)m-m1 n
∏ SB SY i
) log
i
i)1
P(noise)m1
n
)
∑LSB + LSY i
i
(9)
i)1
(3) Dynamic Programming. One of the unique features of our algorithm is that we use a mass array-based dynamic programming algorithm instead of using mass peaks directly in a spectral graph, as previous dynamic programming algorithms do. The mass array is different in principle from the spectrum graph in that the mass array data structure explicitly encodes all possible peptides that have the given mass M and uses the spectrum S as the observation to find the best peptide, while the spectrum graph is constructed directly using peaks in S and finds the peptide that best fits S. Construction of the Mass Array. Given M and a predefined resolution e, the mass array can be constructed in the following two steps. Note that M is defined as the singly charged b-ion mass 4874
Analytical Chemistry, Vol. 79, No. 13, July 1, 2007
of the whole peptide, ∼19 Da (the mass of a water molecule plus one proton) less than the doubly charged precursor ion mass. First, we construct an array with M/e + 1 indices, starting from 0 to M/e. One can view our mass array data structure as a graph where the number of vertexes are the number of elements of the mass array in the same order as they occur in the array. For example, given M ) 1000 Da and e ) 0.1 Da, we can construct a mass array with 1000/0.1 + 1 ) 10 001 elements, starting from 0.0 to 1000.0. From the graph point of view, there are 10 001 vertexes in this graph. However, for the clearness of representation, we still refer to an index or a vertex using the actual mass throughout this paper. Second, for each pair of vertexes i and j, i < j, we define a directed edge (i,j) if and only if (1) i is equal to the mass of some b-ion or i ) 0 and (2) j - i is equal to the mass of one of the 20 amino acids. These two conditions immediately restrict edges to be defined over indexes (actually b-ions) whose masses are equal to some b-ions. Note that a b-ion mass is equal to the sum of the residue mass of every amino acid in the sequence plus a proton (1 Da). Therefore, we can prove that there is a one-to-one mapping between a peptide whose mass is equal to M and a path from 1 to M. The de novo sequencing is equivalent to finding the best path of which the corresponding peptide has the best score defined in eq 9. Note that we do not need to explicitly construct edges in the mass array. The edges will be implicitly constructed in the dynamic programming algorithm. The mass array data structure is different from both the spectrum graph-based methods and the Markov chain method (used in NovoHMM) in (1) that the mass array explicitly encodes all possible peptides with the same mass M, while the others only encode peptides that are fit into the given spectrum and (2) that the mass array defines both vertexes and edges using exact (or theoretical) masses while the others define vertexes and edges using observed masses in the spectrum. Using exact masses avoids a serious problem that happens frequently in other methods: a small mass error, typically up to 0.5 Da for a vertex and 1.0 Da for an edge in a spectrum graph, will be accumulated and usually lead to a much larger mass error for a path corresponding to a peptide. These kinds of errors will cause false predictions because the masses of several amino acids are within 1 Da of each other. Scoring a Path in the Mass Array. Let LSB[m] be the likelihood ratio score defined in eq 6 for a b-ion at mass m in the mass array and let LSY[m] be the likelihood ratio score defined in eq 7 for a y-ion at mass m. For a mass m or a position m in the mass array, we can calculate the score LSB[m] using the normal and exponential distributions described before, where the match tolerance and the normalized ranking of the matching peak are derived from the mass peak that is closest to m in spectrum S. Similarly, we can calculate the score LSY[m] by finding the mass peak that is closest to m in spectrum S. The goal of the algorithm is to identify a path P ) {P0 ) 1,P1,P2,...,Pn,Pn+1 ) M} in the mass array that maximizes n
∑(LSB[P ] + LSY[M - P + 19]) i
i
i)1
where 19 is for an extra water molecule plus an extra proton in the y-ion.
Dynamic Programming Algorithm. Define Score(i) to be the maximum score among all paths from 1 to i. The recursion of the function Score() is
Score(1) ) 0
(10)
Score(j) ) max (Score(i) + LSB[j] + j-i)aa
LSY[M - j + 19]) (11) where aa stands for the mass of one of the 20 amino acids. For each mass m in the mass array, its complementary mass M - m + 19 should be forbidden to appear in the same path because it would interpret m as both a b-ion and a y-ion, a very rare event in actual peptide sequences. After we find the best path along the mass array, we check whether there exist forbidden pairs in the path. If there is no forbidden pair in the path, then we trace back to reconstruct the best peptide. Otherwise we need to run the dynamic programming again by prohibiting each of these forbidden pairs. For example, if the best path contains a forbidden pair of m and M - m + 19. Then we need to run dynamic programming twice, excluding m or M - m + 19 from the path each time. So this process grows at 2k times if we encounter k forbidden pairs in the best path. We have observed that usually k ) 0 in practice. So this process does not usually affect the speed of our program. Dealing with Parent Ion Mass Errors. The accuracy of the parent ion mass is critical to the mass array data structure and the dynamic programming algorithm because it is used to calculate the mass of y-ions. One problem in the spectrum graphbased methods is that the precursor mass error tends to propagate throughout the steps in the dynamic programming algorithm leading to wrong solutions. In our algorithm, we allow a flexible range( (2.0 u) for the precursor ion mass in LCQ data. For this correction interval [M - 2.0,M + 2.0] and with a precision of 0.1 u, we execute 41 runs of the dynamic programming algorithm each with one parent ion mass in the range of M 2.0, M - 1.9 ,..., M + 2.0. After each run, we trace back from the precursor mass used at that run to identify the b-ion ladders to reconstruct the best peptide sequence. We then sort the 41 candidates to find the best peptide. (4) Triply Charged Spectra. For charge +3 spectra, the b2+and y2+-ions become dominant. On average, 46% of the b2+-ions and 51% of the y2+-ions are present, and they are higher than those of the b-ions (42%) and y-ions (44%). We need to consider these ions in the scoring function and as well as in the dynamic programming algorithm. Let LSB2[m] be the likelihood ratio score for a b2+-ion with mass m in the mass array and LSY2[m] be the likelihood ratio score of a y2+-ion with mass m. Using the parameters shown in Table 2, we can calcualte the score for the bi2+-ion and the score for the yi2+-ion using the following equations:
LSB2i ) log
P(Ti,Ii|bi2+) P(noise)
P(Ti,Ii|yi2+)
LSY2i ) log
P(noise)
(12)
(13)
If the mass of a b-ion is m, we can derive its corresponding b2+ion mass as (m + 1)/2 and y2+-ion mass as (M - m + 20)/2.
Figure 3. Comparison of the accuracy of MSNovo against that of PepNovo and NovoHMM as a function of peptide length.
Then, we integrate them into the dynamic programming algorithm using the following recursion.
(
Score[j] ) max Score[i] + LSB[j] + LSY[M - j] + j-i)aa
LSB2
[j+12] + LSY2[M - 2j + 20]) (14)
(5) LTQ-FT Spectra. The mass resolution of the linear ion trap-Fourier transformation mass spectrometer (LTQ-FT) is close to one part per million (ppm). The LTQ data we used in this paper has a resolution bettern than 0.1 Da in measuring the parent ion mass. Thus, we construct a the mass array with a high resolution of 0.01 Da. There are two modes of acquiring LTQ-FT spectra: (1) the profile mode and (2) the centroid mode. In the profile mode, the mass analyzer scans at a resolution of 0.1 Da and reads an intensity value. In the centroid mode, the mass analyzer reports only the centroid of the profile data. The LTQ-FT data24 we used were captured in the profile mode, which cannot be directly used by de novo sequencing programs. We converted the profile mode spectra into centroid data by using a simple weighted averaging scheme. We calculated the weighted average of m/z value of each bin of each 1 Da, using the intensity value of each profile peak as the weight. This preprocessing step generates a MS/MS spectrum similar to that from ordinary ion trap machines. We then run MSNovo on this weighted data. RESULTS The parameters for MSNovo are listed in Table 2. For each spectrum, MSNovo reports the top 20 peptide candidates from the results of the multiple runs of the dynamic programming algorithm using different precursor ion masses, because the correct peptide may not always be the peptide ranked on the top so we also seek useful information from the rest of the candidates. (1) Effect of Different Peptide Lengths. First we compare the performance of MSNovo with PepNovo v1.01 and NovoHMM on spectra that are generated from peptides of different lengths. We define a prediction to be correct when the predicted residues Analytical Chemistry, Vol. 79, No. 13, July 1, 2007
4875
Table 3. Percentage of Correctly Predicted Sequence Tags of Length at Least x (Data Set Used OPD280) algorithm
x)3
x)4
x)5
x)6
x)7
x)8
x)9
x ) 10
MSNovo NovoHMM PepNovo Sherenga PEAKS Lutefisk
0.957 0.893 0.946 0.821 0.889 0.661
0.918 0.796 0.871 0.711 0.814 0.521
0.829 0.711 0.800 0.564 0.689 0.425
0.735 0.589 0.654 0.364 0.575 0.339
0.668 0.486 0.525 0.279 0.482 0.268
0.564 0.404 0.411 0.207 0.371 0.200
0.442 0.293 0.271 0.121 0.275 0.104
0.314 0.193 0.193 0.071 0.179 0.057
Table 4. Comparison of MSNovo with PepNovo and NovoHMM on ISB769, OPD280, HUPO513 (from PeptideAtlas), LTQ600, and ISb646(Charge +3 Spectra) Data Sets Num.Pepa
Av.Lenb
correct Pep c
MSNovo MSNovoh PepNovo NovoHMM
769 769 769 769
11.7 11.7 11.7 11.7
373 503 252 269
MSNovo MSNovoh PepNovo NovoHMM
280 280 280 280
10.5 10.5 10.5 10.5
MSNovo MSNovoh PepNovo NovoHMM
513 513 513 513
MSNovo MSNovoh MSNovo MSNovoh NovoHMM
accuracyd (%)
Av.Pre.Lene
precisionf (%)
recallg (%)
48.50 65.41 32.77 34.98
10.4 10.2 10.6 11.3
86.85 91.96 82.67 79.21
77.36 80.59 75.36 77.13
141 181 99 102
50.36 50 35.36 36.43
9.7 9.8 9.7 10.4
85.13 87.63 81.66 79.52
78.98 82.31 75.84h 78.40
14.1 14.1 14.1 14.1
106 192 50 80
HUPO513 20.66 37.43 9.75 15.59
12.1 12.0 12.1 13.6
77.61 83.58 70.96 73.92
66.58 71.34 61.24 71.78
600 600
12.3 12.3
254 298
9.6 9.6
81.16 86.89
68.77 73.83
646 646 646
17.5 17.5 17.5
12.3 12.7 14.7
66.85 78.27 50.39
47.08 56.87 40.43
ISB769
OPD280
LTQ600 42.33 49.67
ISB646(Charge +3 Spectra) 79 12.23 99 20.28 0 0
a Num.Pep is the total number of peptides in the data set. b Av.Len is the average length of the peptides. c Correct pep is the number of correctly predicted peptides. d Accuracy is the percentage of correctly predicted peptides. e Av.Pre.Len is the average predicted length of the peptides. f Precision is defined as the ratio of correct aa and predicted aa. g Recall is defined as the ratio of correct aa and total aa. h Search for top 20 ranking predicted candidates.
Table 5. Effect of Adding Neutral Loss Ions into MSNovoa ion combinations b,y-ion b,y-ions, b,y - H2O ions b,y-ions, b,y - NH3 ions b,y-ions, a-ions b,y-ions, b,y - H2O and b,y - NH3 ions b,y-ions, a-ions and b,y - H2O ions b,y-ions, a-ions and b,y - NH3 ions b,y-ions, a-ions, b,y - H2O ions and b,y - NH3 ions a
correct peptides (fraction, %)
correct amino acids (fraction, %)
predicted amino acids
674(27.5) 824(33.7) 823(33.6) 889(36.3) 1028(42) 1037(42.4) 1049(42.9) 1100(44.9)
16 932(55.6) 15 183(49.8) 14 170(46.5) 13 889(45.6) 12 506(41) 12 537(41.1) 10 930(35.9) 10 776(35.4)
23 917 21 029 19 991 19 471 17 443 17 324 15 499 15 594
This LTQ Data set has a total of 2448 peptides and 30 472 amino acids.
are correct and at the same position as in the case of the true peptide. That is to say, it may not contain all the residues of the original peptide, but the content is accurate. The accuracy(pep%) is defined as the percentage of correctly predicted peptides. We ordered all of the 1639 SEQUEST-annotated spectra from the ISB data set according to the lengths of the peptides, from 7 to 29. Then we compared the accuracy of the three programs in each group, the results are shown in Figure 3. We observe from Figure 3 that the accuracy of MSNovo is the highest for peptides 4876
Analytical Chemistry, Vol. 79, No. 13, July 1, 2007
shorter than 19 amino acids, and as the length of the peptides increases beyond 23, PepNovo begins to perform better. NovoHMM does not perform well for long peptides because it tries to predict the whole length of the peptide. In general, all programs perform substantially poorer in spectra of long peptides because the quality of these spectra is not as good as that of spectra of shorter peptides. For example, for the spectrum from the peptide YGDFGTAAQQPDGLAVVGVFLK, MSNovo predicts [711.1]AKKPDNGAVVRIFK, of which nine amino acids are correct.
Figure 4. Peptide prediction accuracy, the precision and the recall of MSNovo prediction results using different values of mass resolution. Note the jump is ∼0.5 Da.
PepNovo predicts QPD, starting at 910.52, and is considered as a correct prediction by the above definition. NovoHMM predicts EEFRWYKRFKYGRFIK, of which six amino acids are correct. Table 3 shows the frequency of correctly predicted subsequences (or sequence tags) of length at least x, the same criteria as used by PepNovo and NovoHMM. MSNovo is clearly the best in every category among all programs. (2) Charge +1 and +2 Spectra. We compared the precision and the recall of the three programs, MSNovo, PepNovo, and NovoHMM, on charge +1/+2 MS/MS spectra from four data sets of ISB769, OPD280, HUPO513, and LTQ600. The results are shown in Table 4. The precision is defined as the ratio of correctly predicted residues over the total number of predicted residues, and the recall is defined as the ratio of the correctly predicted residues over the total number of residues in true peptides. For a fair comparison, we also include the average predicted length of the peptides. For the three ion trap data (ISB, OPD, PeptideAtlas), the average predicted length of MSNovo is similar to that of the other two, but the precision and the recall of MSNovo are better. For LTQ data, the b1- and b2-ions are usually missing. In such a case, we cannot predict the first two amino acids of the peptide so we leave a mass gap at the beginning of the peptide. Hence the average predicted length for the LTQ600 data is two amino acids less than the average length of the peptides. The prediction accuracy for the LTQ data is substantially higher than that for the ion trap data. Note that our comparison of PepNovo and NovoHMM is preliminary as these tools were probably not cognizant of the low errors in precursor ion masses. NovoHMM works poorly on the LTQ data because it always overpredicted the whole peptides. The results clearly show that MSNovo performs better for data with better quality. (3) Triply Charged Spectra. We compare the performance of MSNovo and NovoHMM on charge +3 spectra using ISB646. The results are shown in Table 4. MSNovo predicts 76 correct peptides compared to 0 by NovoHMM. If we consider the top 20 predicted peptides for each spectrum, MSNovo had 99 correct
Figure 5. Peptide prediction accuracy, the precision and the recall of MSNovo prediction results using different values of P(noise).
predictions. Table 4 clearly shows that MSNovo outperforms NovoHMM in almost every category. (4) Adding Neutral Loss Ions. For high-quality LTQ MS/ MS spectra, neutral loss ions appear with a high probability. We added the information of neutral loss ions in our algorithm in order to improve the prediction accuracy. We ran MSNovo on the LTQ2448 data set, adding different combinations of neutral loss ions in the dynamic programming algorithm. Table 5 shows the results of adding neutral loss ions. For each case, we kept the top ranking peptides with p-value of