Anal. Chem. 2004, 76, 1664-1671
Statistical Models for Protein Validation Using Tandem Mass Spectral Data and Protein Amino Acid Sequence Databases Rovshan G. Sadygov,† Hongbin Liu,† and John R. Yates*
Department of Cell Biology, The Scripps Research Institute, La Jolla, California 92037
The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, twohypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein’s peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model * Corresponding author. Phone: (858) 784-8862. Fax: (858) 784-8883. E-mail:
[email protected]. † These two authors contributed equally to this work.
1664 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE. Proteomics involves systematic studies of protein expression, protein-protein interactions, protein localization, and protein modifications exerting regulatory control over processes.1-3 A widely used tool for proteomic studies is the tandem mass spectrometer, a device that can be used for high-throughput identification of proteins.4-6 In this process, proteins are first digested and the resulting peptides are isolated one by one in the tandem mass spectrometer, fragmented using collisionactivated dissociation, and the fragment ions measured.7 In combination with liquid chromatography complex mixtures of peptides can be analyzed and then identified using database searching methods.8 To increase the dynamic range of the process and improve the ability to identify the components of complex mixtures, multidimensional liquid chromatography can be used (1) Gavin, A. C.; Bosche, M.; Krause, R.; Grandi, P.; Marzioch, M.; Bauer, A.; Schultz, J.; Rick, J. M.; Michon, A. M.; Cruciat, C. M.; Remor, M.; Hofert, C.; Schelder, M.; Brajenovic, M.; Ruffner, H.; Merino, A.; Klein, K.; Hudak, M.; Dickson, D.; Rudi, T.; Gnau, V.; Bauch, A.; Bastuck, S.; Huhse, B.; Leutwein, C.; Heurtier, M. A.; Copley, R. R.; Edelmann, A.; Querfurth, E.; Rybin, V.; Drewes, G.; Raida, M.; Bouwmeester, T.; Bork, P.; Seraphin, B.; Kuster, B.; Neubauer, G.; Superti-Furga, G. Nature 2002, 415 (6868), 141147. (2) Schirmer, E. C.; Florens, L.; Guan, T.; Yates, J. R., 3rd; Gerace, L. Science 2003, 301 (5638), 1380-1382. (3) MacCoss, M. J.; McDonald, H. W.; Saraf, A.; Sadygov, R.; Clark, J. M.; Tasto, J. J.; Gould, K. L.; Wolters, D.; Washburn, M.; Weiss, A.; Clark, J. I.; Yates. J. R., III. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 7900-7905. (4) Lin, D., Tabb; D. L.; Yates, J. R., 3rd. Biochim. Biophys. Acta 2003, 1646 (1-2), 1-10. (5) Yates, J. R., I. Trends Genet. 2000, 16 (1), 5-8. (6) Yates, J. R., III. Electrophoresis 1998, 19, 893-900. (7) Hunt, D. F.; Yates, J. R., 3rd; Shabanowitz, J.; Winston, S.; Hauer, C. R. Proc. Natl. Acad. Sci. U.S.A. 1986, 83 (17), 6233-6237. (8) Eng, J. K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. 10.1021/ac035112y CCC: $27.50
© 2004 American Chemical Society Published on Web 02/17/2004
to separate peptide mixtures.9-11 This approach has been used to identify proteins in complexes, proteins localized to subcellular structures, proteins in cells, and proteins in tissues.2,9,10,12-14 The process of automated MS/MS data collection creates very large data sets. To interpret these data, several algorithms have been developed to search protein sequence databases.8,15-24 In principle, if the peptide identification process was absolute and/ or if it was possible to uniquely separate true and false positives, and sequence databases were truly accurate, then each identified peptide would indicate the presence of the corresponding protein. These conditions are difficult to achieve; thus, scoring measures are used to discriminate true and false positives.15 Inaccurate peptide matches can lead to incorrectly identified proteins; therefore, manual validation is often used to substantiate matches. Computer algorithms have been developed to facilitate protein assignment from database search results of peptide tandem mass spectra. These algorithms can be classified into two categories. The first category encompasses programs such as SEQUEST_SUMMARY,25 DTASelect,26 INTERACT,27 and CHOMPER28 that use empirically determined scoring threshold values to filter correct peptide identifications and then sort peptides by protein identity. The choice of threshold parameters is empirically determined depending on the program used for the database search. For example, if peptides have been identified using the SEQUEST algorithm, the two most widely used threshold parameters are cross-correlation score and deltaCn (a normalized difference of cross-correlation scores between the first- and second-best matching peptides). The above computer programs also filter random or incorrect protein identifications by requiring a minimum number of peptide matches with scores above a threshold value. The choice of a threshold value can be subjective because the minimum score or number of required matches for (9) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., 3rd. Nat. Biotechnol. 1999, 17 (7), 676-682. (10) Washburn, M. P.; Wolters, D.; Yates, J. R., 3rd. Nat. Biotechnol. 2001, 19 (3), 242-247. (11) Wolters, D. A.; Washburn M. P.; Yates, J. R. 3rd. Anal. Chem. 2001, 73 (23), 5683-5690. (12) Link, A. J.; Hays, L. G.; Carmack, E. B.; Yates, J. R., 3rd. Electrophoresis 1997, 18 (8), 1314-1334. (13) Wu, C. C.; MacCoss, M. J.; Howell, K. E.; Yates, J. R. Nat. Biotechnol. 2003, 21, 532-538. (14) Bell, A. W.; Ward, M. A.; Blackstock, W. P.; Freeman, H. N.; Choudhary, J. S.; Lewis, A. P.; Chotai, D.; Fazel, A.; Gushue, J. N.; Paiement, J.; Palcy, S.; Chevet, E.; Lafreniere-Roula, M.; Solari, R.; Thomas, D. Y.; Rowley, A.; Bergeron, J. J. J. Biol. Chem. 2001, 276 (7), 5152-5165. (15) Sadygov, R. G.; Yates, J. R. Anal. Chem. 2003, 75 (15), 3792-3798. (16) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (17) Havilo, M.; Haddad Y.; Smilansky, Z. Anal. Chem. 2003, 75 (3), 435-444. (18) Fenyo, D.; Beavis, R. C. Anal. Chem. 2003, 75 (4), 768-774. (19) Bafna, V.; Edwards, N. Bioinformatics 2001, 17, S13-S21. (20) Zhang, W.; Chait, B. T. Anal. Chem. 2000, 72, 2482-2489. (21) Zhang, N.; Aebersold, R.; Schwikowski, B. Proteomics 2002, 2, 1406-1412. (22) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390-4399. (23) Hansen, B. T.; Jones, J. A.; Mason, D. E.; Liebler, D. C. Anal. Chem. 2001, 73, 1676-1683. (24) Clauser, K. R.; Baker, P.; Burlingame, A. L. Anal. Chem. 1999, 71, 28712882. (25) Tabb, D. L.; Eng, J.; Yates, J. R., III. In Proteome Research: Mass Spectrometry; James, P., Ed.; Springer: New York, 2001; pp 125-142. (26) Tabb, D. L.; McDonakd, W. H.; Yates, I. J. R., J. Proteome Res. 2002, 1, 21-26. (27) Han, D. K.; Eng, J.; Zhou, H.; Aebersold, R. Nat. Biotechnol. 2001, 19 (10), 946-951. (28) Eddes, J. S.; Kapp, E. A.; Frecklington, D. F.; Connolly, L. M.; Layton, M. J.; Moritz, R. L.; Simpson, R. J. Proteomics 2002, 2, 1097-1103.
high confidence levels can change depending on the size of the data set and database. Additionally, determining the impact of data set or database size on search results is often empirical. In the second category are statistical tools used to validate protein identifications based on a statistical analysis of a peptide’s score and other search parameters. Qscore applies a statistical algorithm resembling an approximation to a binomial distribution and uses such parameters as protein size, peptide match quality, number of peptide matches to a protein, and size of spectral data set.29 The algorithm was presented specifically in the context of database search results from SEQUEST, but probably can be extended to search results from other database search programs. ProteinProphet30 is based on a more general approach that can be used with any database search results analyzed by the algorithm PeptideProphet.31 PeptideProphet uses an expectation maximization method to generate probabilities. Cutoff values are determined empirically by searching a model data set through a database. The main estimate of protein probability used in ProteinProphet is the same as the one described by MacCoss et al.32 The formula used is an empirical estimate and does not have proper coefficients to handle asymptotic cases. Therefore, as the number of peptide matches to a specific protein increases the probability that this protein is a true match will approach 1 no matter the quality of peptide matches to spectra (assuming they pass a low cutoff value). ProteinProphet also uses a concept called “number of sibling peptides” (NSP) for peptide matches to a protein that has other matching peptides. An NSP is high when numerous peptides match the same protein. It was shown that inclusion of NSP into the original Bayesian model (based on data subsets such as protease cleavage specificity, cross-correlation scores, deltaCn, etc.) significantly improves the identification results. No significance to values used in protein assignments was reported, and the explicit dependence of protein assignment on the size of mass spectral data sets and databases is not clear. Fenyo and Beavis18 calculated the expectation value for a protein by considering only peptides that pass a preset threshold value for a survival function. It was suggested that such assignments are rare and their stochastic occurrence is governed by the Poisson distribution. The formula for protein expectation is equal to the product of the square root of the number of spectra in the data set and expectation values of all peptides of the protein. Emili and co-workers33 used a randomized protein database by the addition of inverted amino acid sequences to assess false positive rates. A probability space was generated based on the SEQUEST cross-correlation scores for peptides from a model data set. The protein probability was assigned to be the maximum probability of its constituent peptides. In this paper, we apply statistical models to generate protein identification scores from database search results of tandem mass spectra. We show how one can generalize probabilities (in the (29) Moore, R. E.; Y. M. K.; Lee, T. D. J. Am. Mass Spectrom. 2002, 13, 378386. (30) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. Anal. Chem. 2003, 75, 4646-4658. (31) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383-5392. (32) MacCoss, M. J.; Wu, C. C.; Yates, I. J. R. Anal. Chem. 2002, 74, 55935599. (33) Kislinger, T.; Rahman, K.; Radulovic, D.; Cox, B.; Rossant, J.; Emili, A. Mol. Cell Proteomics 2003, 2, 96-106.
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1665
case where appropriate probabilities for database matches are available) to generate a binomial distribution. For the more general case, we develop a multinomial model that can be used with any database search results. The models do not make assumptions about proteolytic specificity nor do they use a priori generated empirical probabilities. All scores and confidence values are determined from the models directly and depend on data set and database size. The approach has been implemented in a program called PROT_PROBE written in Java and is applied to two data sets with different characteristics to demonstrate performance. MATERIALS AND METHODS Materials. A micro BCA reagent kit was obtained from Pierce (Rockford, IL). SDS-PAGE molecular weight standards (low range) were obtained from Bio-Rad (Hercules, CA). Poroszyme immobilized trypsin was obtained from Perseptive Biosystems (Framingham, MA). Endoproteinase Lys-C was obtained from Roche Diagnostics (Indianapolis, IN). Other standard laboratory chemicals were obtained from Sigma (St. Louis, MO). The chromosomal protein isolate was purified from Chinese hamster ovary cells following a protocol and generously provided by Dr. Aaron Van Hooser at University of California at Berkeley.34 Yeast strain BI5460 was grown to log phase (OD 0.9) in YPD at 30 °C, then harvested, and washed by DDI water. Yeast cells were lysed by grinding a frozen cell pellet under liquid nitrogen. Proteins were extracted using digestion buffer (100 mM NH4HCO3, 8 M urea, pH 8.5), and the soluble fraction was obtained by centrifuging total cell lysates at 21000g. MS/MS Data Collection and Pep_Probe Database Search. A total of 400 µg of a protein mixture that contains 399 µg of protein from the soluble fraction of a yeast cell lysate and 1 µg of the SDS-PAGE low molecular weight standards (phosphorylase b, serum albumin, ovalbumin, carbonic anhydrase, trypsin inhibitor, lysozyme) (yeast+6Prot) was sequentially digested with endoproteinase Lys-C and trypsin using the method of Link et al.9 Approximately 70 µg of the digested protein mixture was loaded on to a two-dimensional liquid chromatography (2DLC) capillary column and washed with a buffer that contains 5% acetonitrile (J. T. Baker, HPLC grade), 0.1% formic acid, and 95% DDI water. 2DLC separation and tandem mass spectrometry conditions were used as described.9,10 Three separate 12-step MudPIT experiments were performed using the digested yeast+6Prot mixture to generate a tandem mass spectrometry data set. A protein mixture isolated from mammalian chromosomes was solublized in 100 mM NH4HCO3, 8 M urea, pH 8.5, and then digested using the method of Link et al.9 The digested protein mixture was loaded on to a 2DLC capillary column and followed by a six-step MudPIT experiment to generate an MS/MS data set. Both data sets were analyzed using the PEP_PROBE algorithm. The yeast+6Prot data set was searched against the yeast protein sequence database (ORF3, downloaded from www.yeastgenome.org) to which were added the six additional protein sequences and the sequences of common protein contaminants (e.g., keratin, trypsin). The number of tandem mass spectra in the yeast+6Prot data set is ∼220 000 while the number of entries in the yeast sequence database is ∼6300. The data set generated (34) Ouspenski, I. I.; Brinkley, B. R. J. Cell Sci. 1993, 105, 359-367.
1666
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
from the mammalian chromosomal protein mixture had ∼26 000 spectra and was searched against a composite database consisting of human-mouse-rat protein sequences (from NCBI downloaded on 04.24.03) containing 106 000 protein entries. Probability Models. In general, only 10-20% of all database search results lead to true peptide identifications. The rest of the database search results correspond to random assignment of tandem mass spectra to database amino acid sequences. When the database search results are grouped together to produce protein identifications, many random protein assignments are expected to arise. Therefore, it is desirable to model the protein assignment process by a probabilistic model and attach a probability to each protein assignment. We use two independent probability models, two-hypothesis binomial and multinomial models, to assess protein identification results generated by tandem mass spectral database searching. The multinomial model as it is formulated can be used with any scoring measure used to identify peptides while the binomial model is specifically tailored for peptide identification results from PEP_PROBE (where the measure of peptide identification is the hypergeometric probability of the randomness of a match). Binomial Model. In this approach, we assume that protein identification by a collection of spectra follows a binomial model. Each database search result is a random Bernoulli event. The trial has two outcomessa protein is either identified or not. Two binomial distributions are determined to be important for protein assignment. The probability of the protein identification at each Bernoulli event is determined either from the relative length of the protein in the database (null hypothesis, H0) or from the hypergeometric probabilities of peptides (an alternative hypothesis, H1). By comparing the two distributions, the one that the protein belongs to is determined. H0 states that the distribution of peptide matches to a protein is governed by the binomial distribution determined from the database. An alternative hypothesis, H1, states that the peptide match distribution is governed by the hypergeometric probabilities determined by PEP_PROBE results. We define a score for protein assignment as the natural logarithm of a maximum likelihood ratio, the ratio of H1 to H0. The procedure is described below. In the alternative hypothesis, the probabilities of all peptides belonging to the same protein are combined to produce a data set probability for the protein identification. If we have N spectra in the data set and K matches (each with probability pi) to a protein A, then the probability of all combinations leading to K matches (with specified probabilities) is estimated as
P1 )
K
N!
∏p
(1 - P(A))N-K
K!(N - K)!
i
(1)
i)1
where P(A) is
x∏ K
P(A) )
K
pi
i)1
The model can be interpreted as a Bernoulli trial where the probability to match a protein A is P(A), and the probability to
miss this protein is (1 - P(A)). The formula (1) contains information about the number of peptide matches (K) to the protein, quality of matches (pi’s through P(A)), and number of spectra in the data set, N. The probabilities of all proteins used in the above formula are normalized. In the alternative hypothesis, the probability, pr(A), of a random match to a protein A is determined as the ratio of the number of amino acids of A to the total number of amino acids in the database. The binomial probability of K matches in N trials is
N! P (A)K(1 - Pr(A))N-K P0 ) K!(N - K)! r
r
K!
P)
r
∏k !
∏p
ki
(4)
i
i)1
i
i)1
(2)
The maximum likelihood ratio, LR, is the ratio of (1) to (2). It is used to elucidate which distribution (H0 or H1) protein A belongs to.
LR ) P1/P0
this multinomial model. Assume a protein A has K peptide matches and these matches, {ki}ri)1, are distributed (drawn from) in r score bins with probabilities {pi}ri)1. This notation means that ki peptides with the pi probability match the protein A, K ) r ∑i)1 ki. Then as a probability for the protein A we use the multinomial probability of the event {ki}ri)1 in K tries:
(3)
To each protein we assign -log(LR) as a score and sort the proteins based on this score. As will be seen below, the use of a likelihood ratio (3) has one major advantage. Normally protein identification probability will increase with the addition of a peptide match no matter how low the quality of the match. However, when the above likelihood ratio is used, the addition of a low-quality match to a group of high-quality matches reduces the overall likelihood of protein identification. Multinomial Model. The above model assumes that prior peptide probabilities have been calculated using PEP_PROBE. It is also useful to have a more general approach that does not depend on the method of peptide identification. As implemented, the hypergeometric probability analysis only uses information on the number of matching fragments. Other search programs such as SEQUEST use a different mathematical method to assess closeness of fit between peptide sequence and tandem mass spectrum, and thus, a priori probabilities are not available. A multinomial model can be used to determine the protein identification probability without the use of a priori peptide probabilities. A multinomial distribution of database searching scores can be constructed for any data set in the following way. After a database search, we obtain peptide identification scores, which in the case of a SEQUEST search are cross-correlation scores. The database search scores of peptides are grouped into bins, and a nonnormalized probability is assigned to each bin. This probability is the portion of the spectra that have the same or better score than the score corresponding to the bin. Bins containing large scores will have small probabilities while bins containing small scores will have large probabilities. It should be noted that the generation of probabilities is based on the observation that about 80-90% of all spectra do not contribute to the identification of peptides or proteins because their scores are low. The probabilities of the database search scores are then normalized with respect to the score bins (the sum of all probabilities over all score bins is set equal to 1). Each score bin has a probability. An exclusive outcome of each multinomial event is the observation of a specific score bin, and the corresponding probability is the probability of this score bin. Next, we calculate a probability of each protein match occurring by random from
Marginal Distribution Model. Many peptide sequence matches occur by random in a database search. At present, however, there is not a reliable way of separating random matches from poor quality but true matches. For this reason, we opt for a multinomial model that takes into account all matches to a protein, even though not all are on equal footing. One such model is the marginalized multinomial distribution.35 In this model, contributions of all events that have probabilities less than a certain cutoff value are summarized. The probability of marginal distribution is r
∑k )!
( P)
i
i)1
r
f
∏
∏k !m!
r
piki(1 -
∑ p)
m
i
(5)
i)f+1
i)1
i
i)1
where f is the number of score bins above the threshold value and m is the number of peptide matches that have scores less than a cutoff score. It is assumed that these peptides are grouped to start from f + 1. The marginal probability is strongly dependent on the quality of matches of peptides that pass the cutoff probability score and at the same time summarizes the low-quality matches. Database Search Methods. In PROT_PROBE, we use two different scores of peptide identification from database search results. The binomial probability model is implemented using the hypergeometric probabilities,15 while the multinomial model uses cross-correlation scores.8 Here we also apply the multinomial model to database search results using a slight modification to cross-correlation scores. In PEP_PROBE, we have implemented a modification scheme that normalizes cross-correlation scores against the experimental and theoretical spectra. The corresponding, normalized cross-correlation score (ET) is
ET )
XCorr
x(E*E)(T*T)
(6)
where (E*E) is the autocorrelation of the experimental spectrum, (T*T) is the autocorrelation of theoretical spectrum, and Xcorr is the cross-correlation of theoretical and experimental spectra. RESULTS AND DISCUSSION Two statistical models have been developed and tested using two tandem mass spectral data sets with different characteristics. (35) Ewens, W. J.; Grant, G. R. Statistical Methods. In Bioinformatics. Statistics for Biology and Health; Springer-Verlag: New York, 2002.
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1667
Figure 1. Distribution of the hypergeometric probability scores of yeast+6Port data set. The global modes of all charge-state distributions are close.
The first MS/MS data set is obtained from the yeast+6Prot protein digest. The ratio of tandem mass spectra to database entries is 35. The second MS/MS data set was obtained from a digest of a chromosomal protein isolate from Chinese hamster ovary cells. The tandem mass spectra to database entry ratio in the data set is ∼0.25. In the first data set, most of the randomness is expected to originate from the tandem mass spectra, while in the second data set, most of the randomness is expected from the database. The latter can be dealt with effectively while the former will introduce more randomness to the process of protein identification. Both issues can be effectively minimized using the described statistical methods. PROT_PROBE starts protein identification first by analyzing database search results from peptide matches of tandem mass spectra. For the binomial model, it uses probabilities of peptide identifications calculated from the hypergeometric model. The distribution of hypergeometric probability scores is shown for all three charge states from the yeast+6Prot data set in Figure 1 (the distribution for the chromosomal protein complex is similar and is not shown). The shapes of the distributions are such that their global modes are very close and it is possible to use a uniform (for all charge states) cutoff value to filter out poor-quality matches. This helps to reduce the effects of poor-quality spectra (or purely “junk” spectra) on the protein assignment. The advantage of the probability distribution shown in Figure 1 is that all spectra are treated in the same way regardless of charge state or precursor peptide masses. The uniform probability cutoff value used in this study is 4.25. This value is only slightly larger than the average probability score, 4.1, of the yeast+6Prot data set. The other set of the numerical data that PROT_PROBE uses are the cross-correlation scores. Figure 2 shows the distribution of cross-correlation scores (Figure 2A), ET normalized crosscorrelation scores (Figure 2B) for the yeast+6Prot data set and cross-correlation scores for the chromosomal protein complex (Figure 2C). The main differences between the data sets are their sizes (there are ∼10 times more spectra in yeast+6Prot) and databases against which the spectra were searched (chromosomal protein complex was searched against human-mouse-rat database which has ∼80 times more entries). The distributions for 1668 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
Figure 2. (A) Distribution of cross-correlation scores for yeast+6Prot data set. Note that we do not fit or smooth any of the data presented in the figures of this paper. (B) ET normalized (described in the text) cross-correlation scores of yeast+6Prot data set. (C) Distribution of cross-correlation scores for chromosomal protein complex. Compared to similar data for yeast+6Prot, the data are relatively noisy. It has to do with the fact that there are almost 10 times more spectra in yeast+6Prot compared to chromosomal protein complex data.
the chromosomal protein isolate data set (Figure 2C) look “nosier” than the corresponding distribution for the yeast+6Prot data set,
Figure 2A, due to the difference in the number of spectra in the data sets. More importantly, the +2 and +3 charge-state spectra distributions (where the majority of identifications originate from) from the two data sets peak at approximately the same values. The +1 charge-state spectra of the chromosomal protein complex have an additional, spurious peak at a lower cross-correlation score. A reason for this could be that salt clusters or ions associated with other chemical noise were characterized as peptide signals with +1 charge state. For both data sets, we used the same threshold cross-correlation values to minimize the effects of lowquality matches on protein identification. As was discussed in the Materials and Methods section, contributions from peptides with scores less than a certain threshold value were summarized in the marginalized multinomial distribution. The threshold values for peptides in this study are 2.0, 2.5, and 3.8 for +1, +2, and +3 charge-state spectra, respectively. The same threshold values were used in DTASelect program filters to compare results obtained from the statistical models. Additional parameters specified in DTASelect were a threshold value for deltaCn of 0.08, a minimum of two unique peptides per protein or more than four copies of a single peptide, and the requirement that peptides show half or full trypsin cleavage specificity. PROT_PROBE assigns an empirical p-value to each crosscorrelation score obtained from the score distributions (Figure 2) as the score probability (note: p-values can be generated not only from cross-correlation scores but also from any scores that measure the closeness of fit between spectra and amino acid sequences). The resulting probabilities are normalized with respect to all score bins and their distributions are presented in Figure 3. Given the peptide probability distribution and assuming a multinomial model, the probability that protein identification is correct can be determined. In the marginal multinomial distribution for cross-correlation scores, the same threshold values used in DTASelect are used for the marginal matches in the multinomial model. The distributions of the ET normalized cross-correlation scores have not been extensively studied; therefore, we used a putative cutoff value of 0.3 for all three charge states as the threshold value in the marginal multinomial distribution model. No requirements on the enzyme specificity are made in PROT_PROBE. The distribution of PROT_PROBE scores for proteins is closely related to the number and quality of peptide matches. Normally the scores from the two methods correlate and proteins with scores above 20 are true identifications in our data sets. There are 1050 loci identified by DTASelect using the threshold values indicated above. We identify 100 more loci using PROT_PROBE. Receiver operating characteristic (ROC) curves are shown in Figure 4 for cross-correlation scores and normalized crosscorrelation scores for the yeast+6Prot data set. The ROC curve is a plot of true positives versus false positives and is indicative of a methods’ performance. The best performance of the multinomial model is observed for cross-correlation scores. The use of cross-correlation scores for peptide identification has been extensively studied and optimized, a factor resulting in the better performance. The corresponding ROC curves indicate that threshold cutoff values used in this work for normalized crosscorrelations values may not be as well optimized as the threshold values for cross-correlation scores. ET normalization has not been
Figure 3. (A) Probabilities generated from original cross-correlation score distributions of yeast+6Prot data set. (B) Probabilities generated from ET normalized (describe in the text) cross-correlation score distributions of yeast+6Prot data set.
Figure 4. ROC curves for yeast+6Prot data set using statistics based on original cross-correlation and its modification, ET normalizations.
extensively studied; therefore, it is not unexpected that results obtained in using this method do not exhibit the least optimal behavior. Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1669
Table 1. Examples of False Negatives from a Threshold-Based Method (Using DTASelect with the Threshold Parameters Specified in the Text) That Were Identified by the PROT_PROBE Programa
CAI
protein molecules/ cell
length
spectrum copy
locus ID
identified peptide
Xcorr
probability
confidence (%)
YML086C YKL010C YIR026C YOR281C YNL016W YML110C YOR271C YEL037C YHR098C
R.YFQPSSIDEVVELVK.S R.IIPMETLIGNIAAILSDK.I K.NIPIDDDDVTDVLQYFDETNR.F K.FGEVFHINKPEYNK.E K.QYFQVGGPIANIK.I K.FGDTESTMDIVDINPDMLK.E R.VINATPTMVIPPLILVR.L R.QVVSGNPEALAPLLENISAR.Y K.SSLPLVLLRPNVDQYYSNVMSQLLCEDK.T K.YFTDQLDESTLQLLDGLR.D R.EILDLIDEYDSEGR.H R.WPIIIQNAIDDLSK.H
3.29 3.62 4.83 4.33 3.73 4.39 3.45 5.08 5.16
12.85 11.01 9.61 7.43 8.99 11.62 8.27 12.22 7.06
94.71 93.47 92.08 90.00 92.57 93.5 90.96 94.35 85.32
0.217 0.139 0.164 0.165 0.291 0.182 0.197 0.164 0.176
6.19 × 103 7.38 × 103 7.57 × 103 7.7 × 103 4.96 × 104 8.1 × 103 9.29 × 103 1.09 × 104 1.22 × 104
526 1483 364 286 453 307 327 398 929
2 2 2 2 2 2 2 2 2
5.42 3.65 4.9
16.46 17.5 13.82
95.84 96.32 94.8
0.143 0.108 0.188
3.76 × 104
355 161 470
1 1 1
YLR287C YOR257W YMR027W
6.19 × 103
a All loci, except YHR098C, which is identified with +3 charge peptide, were identified by +2 charged peptides. The locus IDs follow nomenclature at http://www.yeastgenome.org. The CAI values are obtained from the same web site. Protein numbers per cell are from the literature.36 All calculated database search scores were produced by PEP_PROBE.
Examples of false negatives by DTASelect identified using PROT_PROBE are presented in Table 1. Manual examination of the tandem mass spectra confirmed the identifications as valid. The reason these proteins were not in the DTASelect list is that each of these proteins is identified with a single, high-quality spectrum and thus they were filtered from the final list. Clearly, by lowering the required number of peptide matches to a protein these peptides (and proteins) would be included in the DTASelect list; however, given the number of spectra in the data set (220 000), a number of false positives would appear on the list as well. For example, if we use the same threshold values for cross-correlation scores, minimum deltaCn, consider only at least half tryptic peptides, then the number of candidate proteins will be ∼1700 (including redundant proteins). PROT_PROBE was able to identify these proteins because the high quality of identifications offset the small number of matches and the combination of scores that is implemented in the programs. PROT_PROBE was able to confidently identify proteins even though only a single spectrum matched to the protein. For example, entry gi|27706630|ref|XP_217571.1| was identified by PROT_PROBE but was not found by DTASelect because only one peptide matched to the protein (minimum number of required matches in this case was two). In PROT_PROBE, the protein scores a relatively high multinomial probability and subsequent manual inspection of the tandem mass spectrum verifies it is a true positive. In addition to the probability-based scores, PROT_PROBE generates two additional values that also are helpful in protein identification. These are the levels of significance at which the hypotheses H0 and H1 can be rejected, type I error. For example, in Figure 5, we present H0 and H1 distributions for the yeast locusYOL001W identified in the yeast+6Prot data set. This locus has three peptide matches that pass the threshold values. However, the expected number of matches by random according to H0 is 45 and in H1 is 14. Because the number of matches from the data set is less than the number of random matches expected from the database, this assignment qualifies as false. Another and more difficult example is the yeast locus YJL008C. It has numerous high-quality peptide identifications in our data 1670 Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
Figure 5. Probability distributions for a falsely identified locus YOL001W. The expected number of peptide matches (in both H0 and H1 distributions) is higher than the observed number of matches, 3.
set, and the type I error of H1 is zero. Therefore, even though the type I error of H0 for this protein is rather high, ∼45%, this identification is valid. The distributions of the two hypotheses for this locus are depicted in Figure 6. The shadowed area is the type I error of H0sthe proportion of random distributions that lead to a higher number of peptide matches. The probabilities used to generate H0 are solely based on the protein length. Clearly this is a drawback for our approach, because the actual values that are needed to calculate H0 probability distribution are the protein abundance levels. But since these data are now available only for yeast,36 we derive random probability of protein match from its length and size of the database that was searched against. We observe an important property of the scores generated by the maximum likelihood ratio. Thus, addition of a poor-quality match to a protein identified by high scoring peptides decreases the maximum likelihood score (-log(LR)) of the protein. For (36) Ghaemmaghami, S.; Huh, W. K.; Bower, K.; Howson, R. W.; Belle, A.; Dephoure, N.; O’Shea, E. K.; Weissman, J. S. Nature 2003, 425, 737-741.
those to YBL022C (cross-correlation scores are 2.63, 3.38, and 3.53). No distinction in the protein identifications would be observed using threshold values, while PROT_PROBE assigns substantially different scores to these loci. Another advantage of the probability-based method over the threshold-based methods is a dependence on the size of the data set. Thus, in thresholdbased methods, two protein identifications would have the same validity regardless if the data set contained 10 000 or 100 000 spectra. In these cases, an experienced scientist would factor in the size of the data set when deciding on the minimal number of peptide matches necessary for a protein to be considered truly identified. In probability-based methods, the overall number of spectra enters the probability values in the multinomial model (implicitly, Figure 3) and in the LR scores (explicitly).
Figure 6. Protein probability hypotheses for S. cerevisiae locus YJL008C. The number of matches to this protein is 29, the expected number of matches by random is 28.
example, the locus YGR244C has three peptides matching it. The likelihood score of this locus being identified by three peptides is 4.1. However, the likelihood score of the first two (the two highest scoring peptides) is 15.4. The results are interpreted such that the model does not predict that there are three matches to this protein, but it is more likely that this locus has two matches. This behavior of the maximum likelihood ratio creates a possibility of determining an optimal number of peptide matches to the protein without a strict reference to cutoff values. This can be done by excluding from the protein hit list the peptide with the smallest probability. If the maximum likelihood score increases, then the process is continued until we reach the (highest) optimal maximum likelihood score. The advantages of the presented statistical methods relative to the threshold-based methods can be summarized in the following way. First, we are able to identify proteins missed by the filtering methods employed in threshold-type algorithms. This is mainly related to the threshold values used for the number of peptide matches necessary for a protein to be considered as present in the sample. Lowering threshold values results in a higher than acceptable false positive rate. A second advantage is the ability to differentiate the match quality of proteins that have the same number of peptide matches. For example, two loci YNL045W and YBL022C, in our yeast data set each has three matches that pass threshold values. The multinomial probability ratio of the loci is exp(5.2), and the score difference is 5.2. This reflects the fact the matches to the locus YNL045W (crosscorrelation scores 4.14, 4.54, and 5.32) are higher quality than
CONCLUSION In this paper, we present two probabilistic methods for protein assignment using results of database searches and tandem mass spectra. In the binomial distribution, we approximate protein identification probability at each Bernoulli event with peptide identification probabilities obtained from hypergeometric distribution of database search results (H1). This distribution is tested against the null hypothesis that states that the protein probability belongs to a random binomial distribution whose probability is obtained from the relative protein length in the database (H0). A maximum likelihood ratio of these probabilities is used as protein identification score. In the second, more general model, we employ multinomial distribution to calculate probability of protein identification by random. This approach can be applied to any database peptide identification schemes. In this paper, we test the method using cross-correlation score and its modification. We have shown both of these statistical models improve the efficiency of protein identification by reducing the amount of manual analysis required of tandem mass spectral peptide identifications and increase the numbers of identifications over threshold-based methods. ACKNOWLEDGMENT We thank W. Hayes McDonald, J. Venable, and J. Wohlschlegel for stimulating discussions on the subject. We thank J. Venable for critical reading of the manuscript. The authors gratefully acknowledge financial support from National Institutes of Health Grants RR11823-08 and R01 MH067880 and the Office of Naval Research N00014-00-1-0421. Received for review September 22, 2003. Accepted January 6, 2004. AC035112Y
Analytical Chemistry, Vol. 76, No. 6, March 15, 2004
1671