Central Limit Theorem as an Approximation for ... - ACS Publications

Thermo Electron Corporation, San Jose, California 95134, and The Scripps Research Institute, La Jolla, California 92037. In this paper, we present an ...
1 downloads 6 Views 166KB Size
Anal. Chem. 2006, 78, 89-95

Central Limit Theorem as an Approximation for Intensity-Based Scoring Function Rovshan Sadygov,† James Wohlschlegel,‡ Sung Kyu Park,‡ Tao Xu,‡ and John R. Yates, III*,‡

Thermo Electron Corporation, San Jose, California 95134, and The Scripps Research Institute, La Jolla, California 92037

In this paper, we present an intensity-based probability function to identify peptides from tandem mass spectra and amino acid sequence databases. The function is an approximation to the central limiting theorem, and it explicitly depends on the cumulative product ion intensities, number of product ions of a peptide, and expectation value of the cumulative intensity. We compare the results of database searches using the new scoring function and scoring functions from earlier algorithms, which implement hypergeometric probability, Poisson’s model, and cross-correlation scores. For a standard protein mixture (tandem mass spectra generated from the mixture of five known proteins), we generate receiver operating curves with all scoring schemes. The receiver operating curves show that the shared peaks count-based probability methods (like Poisson and hypergeometric models) are the most specific for matching high-quality tandem mass spectra. The intensity-based (central limit model) and intensity-modeled (cross-correlation) methods are more sensitive when matching low-quality tandem mass spectra, where the number of shared peaks is insufficient to correctly identify a peptide. Cross-correlation methods show a small advantage over the intensity-based probability method. Large-scale analysis of mass spectrometry data is enabled by methods to analyze the data using computer algorithms and protein sequence databases. Two general strategies emerged for protein identificationssearching databases using peptide molecular weight fingerprints and amino acid sequence fragmentation patterns represented in tandem mass spectra. The ability to perform large-scale analysis of tandem mass spectra created new analytical strategies for the analysis of proteins by enabling protein mixture analysis. Methods to analyze tandem mass spectra have been reviewed recently, and the new analytical capabilities created have also been reviewed.1-3 Database searching programs compare tandem mass spectra to candidate sequences using two general methods to measure closeness of fit between spectra and sequences. The first method uses a shared peak model to generate * To whom correspondence should be addressed. Tel: (858) 784-8862. Fax: (858) 784-8883. E-mail: [email protected]. † Thermo Electron Corp. ‡ The Scripps Research Institute. (1) Sadygov, R.; Cociorva, D.; Yates, J. R., III. Nat Methods 2004, 1, 195-202. (2) Yates, J. R., 3rd. Annu. Rev. Biophys. Biomol. Struct. 2004, 33, 297-316. (3) Lin, D.; Tabb, D. L.; Yates, J. R., 3rd. Biochim. Biophys. Acta 2003, 1646, 1-10. 10.1021/ac051206r CCC: $33.50 Published on Web 11/24/2005

© 2006 American Chemical Society

a quantitative measure of the fit while the second method uses fragment ion frequency to generate the probability the sequence and spectrum are the best fit.1 Both methods have strengths and weaknesses relative to the large number of tandem mass spectra generated in a shotgun proteomics-type experiment, and the overall challenge is to increase the sensitivity of searches while maintaining adequate discrimination between correct answers and false positives. A number of probability-based models have been applied in the peptide identification process.4-9 These models mostly use the number of shared peaks as a random variable or test statistic to determine the closeness of fit between a spectrum and amino acid sequence. A shared peak count is a suitable parameter for developing a random matching model. Normally, a random matching model assumes that the product ions of a sequence matches to a tandem mass spectrum by random; therefore, the peptide with the highest number of matches will be the least random assignation and most likely candidate. The approach normally works well and, in a number of implemented models, leads to true identifications for good- to high-quality spectra. The shared peak method of matching may not explain highly abundant fragment ions and thus lead to miss assignment of the spectrum. One way to overcome this is to require that a good-quality match explain at least one (or a certain number) of the highest peaks in the spectrum. This approach works well to bar random matches from being considered candidates. However, the cumulative fragment ion intensity does not explicitly figure into the scoring functions of these methods. In this paper, we present a method that uses a scoring function directly dependent on the cumulative intensity of the explained product ions. The motivation for this scoring function comes from the availability of probability statistics for cumulative random values. Statistics for cumulative random values have been used in bioinformatics in a number of studies. Karlin and Altschul used cumulative random variables to determine significant sequence similarities where the similarity score of a whole sequence is not significant, but the sum of local similarity scores is significant.10 Also, sum statistics have been (4) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390-4399. (5) Eng, J.; McCormack, A.; Yates, J. J. Am. Soc. Mass Spectrom. 1994, 5, 976989. (6) Hansen, B. T.; Jones, J. A.; Mason, D. E.; Liebler, D. C. Anal. Chem. 2001, 73, 1676-1683. (7) Bafna, V.; Edwards, N. Bioinformatics 2001, 17 (Suppl 1), S13-S21. (8) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J. Proteomics 2003, 3, 1454-1463. (9) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. J. Proteome Res. 2004, 3, 958-964.

Analytical Chemistry, Vol. 78, No. 1, January 1, 2006 89

used by Sunyaev et al. for peptide identification using tandem mass spectra and protein sequence databases for cross-species protein identification. The model used by Sunyaev et al. is based on the renewal theory, where the distribution of the sum statistics is calculated from the convolution of the elementary distributions.11 In this paper, we propose a scoring function based on the formulation of the peptide identification problem in terms of the central limiting theorem. Each candidate amino acid sequence explains a certain percentage of the total ion current of a tandem mass spectrum. This value is often used as a distinguishing parameter in validating peptide assignments. We use the central limiting theorem to quantify the cumulative fragment ion intensities explained by fragment ions predicted from a peptide sequence. The central limiting theorem has previously been used in proteomics for differential protein quantification.12 One of the problems in developing an intensity-based probability model is that the random variables (intensities) are not integer values, like number of shared peaks count. Therefore, direct application of the known probability models, where we model the distribution of product ions or their properties, does not seem to be straightforward. The central limit theorem allows us to bypass this problem by creating a statistics of the sum of the random variables where we do not make any assumptions about the distribution function of the random variable itself. EXPERIMENTAL SECTION The data set used for this study was previously described by MacCoss et al. and is described here.13 Standard Preparation and Digestion. A known mixture of proteins containing equimolar levels of phosphorylase a (rabbit skeletal muscle), cytochrome c (horse), apomyoglobin (horse heart), albumin (bovine serum), and β-casein (bovine) was used for all experiments. The resulting mixture (∼1 nmol/mL in water) was adjusted to 8 M urea with the addition of solid urea, reduced with dithiothreitol (20 mM final concentration at 50 °C for 20 min), and alkylated with iodoacetamide (50 mM final concentration in the dark at room temperature). The denatured, reduced, and alkylated standard mixture was divided into four aliquots, and each was digested using a different protease. Aliquot 1 was diluted 3-fold with 100 mM Tris, pH 8.5, and CaCl2 was added to a final concentration of 1 mM. Modified trypsin (Promega) was added at a 1:100 enzyme-to-substrate ratio (w/w), and the mixture was incubated overnight at 37 °C with constant mixing (Thermomixer, Eppendorf). Aliquot 2 was diluted 3-fold with 100 mM Tris, pH 8.5, elastase (Roche) added at a 1:50 enzyme-to-substrate ratio (w/w), and the resultant mixture incubated overnight with mixing at 37 °C. Aliquot 3 was diluted 3× with 100 mM Tris, pH 8.5, subtilisin (Sigma) added at a 1:50 enzyme-to-substrate ratio (w/w), and the resultant mixture incubated with mixing for 3 h at 37 °C. Aliquot 4 was adjusted to pH 11 with 1 M NaOH, proteinase K (Roche) added at a 1:100 (10) Karlin, S.; Altschul, S. F. Proc. Natl. Acad. Sci. U.S.A. 1993, 90, 58735877. (11) Sunyaev, S.; Liska, A. J.; Golod, A.; Shevchenko, A. Anal. Chem. 2003, 75, 1307-1315. (12) Li, X. J.; Zhang, H.; Ranish, J. A.; Aebersold, R. Anal. Chem. 2003, 75, 6648-6657. (13) MacCoss, M. J.; Wu, C. C.; Yates, J. R., 3rd. Anal. Chem. 2002, 74, 55935599.

90

Analytical Chemistry, Vol. 78, No. 1, January 1, 2006

enzyme-to-substrate ratio (w/w), and the resultant mixture incubated at 37 °C for 3 h with constant mixing. Each digestion was quenched with the addition of formic acid to 5% and frozen at -80 °C until analysis by MudPIT as described below. The combined digest (obtained from different enzymatic digests) was used for mass spectral searches we report below. Multidimensional Protein Identification Technology. A triphasic microcapillary column was constructed from 100-µm-i.d. fused-silica capillary tubing pulled to a 5-µm-i.d. tip using a Sutter Instruments P-2000 CO2 laser puller (Novato, CA).14 Each column was packed with 7 cm of 5-µm Aqua C18 material (Phenomenex, Ventura, CA), 3 cm of 5-µm Partisphere strong cation exchanger (Whatman, Clifton, NJ), and followed by another 3 cm of Aqua C18. The columns were equilibrated with 5% acetonitrile/0.1% formic acid, and ∼4 pmol of each protein digest was loaded directly onto separate capillary columns using a high-pressure bomb. After loading the peptide digests, the column was placed inline with a Surveyor quaternary HPLC (ThermoFinnigan, Palo Alto, CA) and analyzed using a modified six-step separation described previously.15,16 The buffer solutions used were 5% acetonitrile/ 0.1% formic acid (buffer A), 80% acetonitrile/0.1%formic acid (buffer B), and 500 mM ammonium acetate/5% acetonitrile/0.1% formic acid (buffer C). Step 1 consisted of a 100-min gradient from 0 to 100% buffer B. Steps 2-5 had the following profile: 3 min of 100% buffer A, 2 min of X% buffer C, a 10-min gradient from 0 to 15% buffer B, and a 97-min gradient from 15 to 45% buffer B. The 2-min buffer C percentages (X) were 10, 20, 30, 40, and 50% for the six-step analysis. The final step, the gradient contained the following: 3 min of 100% buffer A, 20 min of 100% buffer C, a 10-min gradient from 0 to 15% buffer B, and a 107-min gradient from 15 to 70% buffer B. As peptides eluted from the microcapillary column, they were electrosprayed directly into an LCQ-Deca mass spectrometer (ThermoFinnigan, Palo Alto, CA) with the application of a distal 2.4-kV spray voltage. A cycle of one full-scan mass spectrum (400-1400 m/z) followed by three data-dependent tandem mass spectra at a 35% normalized collision energy was repeated continuously throughout each step of the multidimensional separation. The application of all mass spectrometer scan functions and HPLC solvent gradients were controlled by the Xcalibur data system. Database Search. The combined digest mixture (trypsin, elastase, subtilisin, and proteinase K digests combined) was used in this study. The overall number of MS/MS spectra searched is ∼59 900 (this includes +2 and +3 copies of some spectra for which 2to3 could not determine the correct charge state). The spectra were searched against the nonredundant database downloaded from the NCBI site. The database search was not restricted to any type of enzymatic specificity. Cysteine carboxymethylation was the only allowed modification. In our presentation, we combine search results of all spectra, without breakdown to enzymatic specificity. (14) McDonald, W. H.; Ohi, R.; Miyamoto, D. T.; Mitchison, T. J.; Yates, J. R. Int. J. Mass Spectrom. 2002, 219, 245-251. (15) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., 3rd. Nat. Biotechnol. 1999, 17, 676-682. (16) Washburn, M. P.; Wolters, D.; Yates, J. R., 3rd. Nat. Biotechnol. 2001, 19, 242-247.

Figure 1. Summed intensity distribution of candidate peptides to a tandem mass spectrum plotted as a function of the number of peptides in the database. As seen from the figure, the summed intensity serves as a differentiating parameter for validity of peptide identification. The intensities are normalized to 100.

THEORY We formulate an approximation to the central limit theorem as a scoring function for matching tandem mass spectra to peptide sequences. The problem is constructed as follows. It is assumed that we deal with a random variable whose values are abundances from an experimental spectrum. We do not know the probability distribution function of this random variable. What we know are the values that this random variable can take. These are processed intensities (we work with the logarithm of intensities) of a tandem mass spectrum (assume there are N mass peaks in the spectrum each with intensity XI) plus zero intensity value, {XI}N+1. We do not know the underlying probability distribution of the intensities. Every amino acid sequence (peptide) is considered as a random sequence. The values that the random sequence takes are the corresponding intensities of the product ions or zero if there is no fragment ion match. Figure 1 shows an example of the summed intensity distribution of candidate sequences for a tandem mass spectrum (the experimental intensities are normalized to a maximum of 100). The summed intensities show a random distribution as there are significant differences among the candidate sequences. The role of the scoring function is to determine and statistically estimate how far from the random distribution is a specific summed intensity value that corresponds to a candidate peptide. Given that we do not know the underlying distribution of the intensity values, we use a general model in the form of the central limiting theorem, which predicts the distribution function of the sum of n trials as

(

P X>a)

n

)

∑X - E i

i)1

σxn

)



a

-∞

exp(-x2) dx

(1)

here Xi is the (log-transformed) intensity of a fragment match (candidate peptide), E is an expectation value (of summed intensities) for a random sequence, σ is the variation, and n is

the number of product ion fragments of a sequence. The distribution parameters, like the expectation value and variance, were obtained in the overall distribution of peptide (all candidate peptides from the database) intensities to the tandem mass spectrum, Figure 1. The score assigned to each peptide is an argument of the error function. In this model, the significance test follows naturally as the type I error of the model. Strictly speaking, the central limit theorem applies to the situations where the average of averages from several samples is considered. The straightforward analogue in our case would be applications to the average of averaged intensities of sequences and the number of samples will be the number of candidate sequence from the database. In this implementation, we average the intensities over all product ions of the database and calculate the probability of observing a certain intensity value in n trials, where n is the number of product ions in the sequence. The requirements that need to be satisfied for application of the central limit theorem are verifiable. Thus, in Lyapunov’s formulation, it is required that the expectation of the cube of the difference between a random variable and its average be bound:17



(

i)1



nf∞ 1 M|Xi - X h i|3 98 0 σi2)3/2 i)1

(2)

where M denotes an expectation value. Condition 2 is computationally easy to verify, since we store all statistics about the product ion matches to the tandem mass spectrum as they are processed. In the central limit theorem, there is a term that is an expectation value of the accumulated random intensity. The expectation is dependent on two random variablessnumber of shared peaks and the intensity of each shared peak. The estimate of this expectation is given by Wald’s approximation18 (p 601): n

E)

∑i‚X ) E(n)‚E(X) i

(3)

i)0

where E(n) is the expected number of product ion matches and E(X) is the expected or average intensity per product ion match. The scoring function (1) is intuitive and in accord with empirical rules that a researcher applies when deciding about the quality of peptide identification. Thus, it is dependent on the amount of the fragment ion intensity in the spectrum that can be explained (shared peak intensity). This value is empirically known to be one of the most useful parameters for deciding spectrum assignation for intermediate-quality identifications. In the score, there is a term (in the numerator) that is subtracted from the amount of shared peaks’ intensity. This term can be thought of as taking account of the random matches to the spectrum. [In this term, the numerator is the amount of intensity above the random intensity that peptides from the database can acquire against this spectrum.] In this model, we remove from the shared peak intensity the amount of intensity that is expected to be matched by random. Scoring functions are often normalized for the length of a peptide because a longer peptide results in a higher (17) Wilks, S. S. Mathematical Statistics; Wiley: New York, 1962. (18) Feller, W. An Introduction to Probability. Theory and Its Applications, 2nd ed.; John Wiley and Sons: Singapore, 1971.

Analytical Chemistry, Vol. 78, No. 1, January 1, 2006

91

value produced by the scoring function. In the above formula, the scoring function is the reciprocal of the number of fragment ions in a peptide; thus, it takes into account the length of the amino acid sequence explicitly. Also, the scoring function is dimensionless; both the numerator and denominator contain factors dependent on the intensity values. Therefore, rescaling of the spectrum’s intensity is not expected to affect the scoring results. The right-hand side of the scoring function is the well-tabulated error function. It is not necessary to calculate it at every point as the table is stored in memory. RESULTS AND DISCUSSION We have searched a digested mixture of five known proteins with a number of algorithms. Each algorithm uses two scoring schemes chosen from hypergeometric, Poisson’s, central limit theorem, cross-correlation, and preliminary scoring of SEQUEST. The specific combinations that have been tested are the ones based solely on the number matches (hypergeometric and Poisson), number of matches and intensities (hypergeometric and central limit theorem), number of matches and cross-correlation (hypergeometric and cross-correlation score), and normal SEQUEST search (cross-correlation and preliminary scoring). With the exception of the last combination (cross-correlation and preliminary score), all reported combinations (and corresponding individual scoring schemes) have been implemented in Pep•Probe. We have previously described the hypergeometric and Poisson’s scoring functions. Geer et al. extended the Poisson model to include the accuracy of product ion matches.9 In the following discussion, we treat each of the above scoring schemes (except Poissons’ distribution, which in terms of peptide identification is close to hypergeometric) separately and in combination. For all database searches, we used the nonredundant protein sequence database available from NCBI. We assumed that because the database is so large that if a spectrum matches a peptide from the five-protein mix, then the match is valid, regardless of the quality of the match. This is not a completely valid assumption, and one reason is that, even though the database is large, variations in its amino acid sequence content (the number of nonredundant sequences) is not proportionally high. However, this design currently allows us to determine the performance of peptide identification algorithms in controlled experiments (where we know the identity of the proteins in the sample). It is also expected that the protein mixture is not 100% pure, and therefore, we also accept as true hits spectral matches to proteins derived from the original five proteins (modified versions of the original proteins and their precursors). Every match to a target protein (or to its variants) is accepted as a true hit, without regard to the quality of the match. Thus, in reporting the number of identified peptides, we do not use any score cutoffs or thresholds. For each algorithm, the data set is searched twice, once using the original database and the second time using the database obtained from the original database by reversing the amino acid sequence of each protein sequence. The results of score distributions are shown in Figure 2. The search engines behave similarly with respect to the database reversion. In the low scoring areas of the curves, there is not much difference between the scores from the original and reversed databases. The main differences emerge in the areas of the scoring functions where identifications 92 Analytical Chemistry, Vol. 78, No. 1, January 1, 2006

Figure 2. (A) Distribution of peptides with XCorr scores as a function of the number of tandem mass spectra searched with the algorithm SEQUEST using a sequence database and the reverse of the sequence database. The distribution is the combined frequency of the +1, +2, and +3 peptides. (B) Score distributions of hypergeometric probability with regular (black solid line) and reversed (red dash-dotted line) databases searched with the algorithm Pep_Probe. (C) Score distributions of regular (black solid line) and reversed database (dashed red line) for the intensity-based probability scoring function as a function of the number of tandem mass spectra searched.

correspond to real peptides. From the score distribution of the reversed database, we obtain an estimate of the false positive rates.19 In Figure 2, we have combined search scores for all peptides without separating them into charge states and true and false

Figure 3. Score distributions for true (green line) and false (red line) positives calculated using the central limit theorem as a function of the number of tandem mass spectra searched.

positives. The score distributions are mainly multimodalsthere are several local maximums corresponding to random and true positive distributions of different charge states. True and false positive score distributions for XCorr and hypergeometric distributions for the five-protein data set have been presented elsewhere.20 Figure 3 shows the corresponding distribution obtained from the implementation of the central limit theorem. The distributions from the reversed and original databases differ mainly in the high score portion of the curves. The reversed database search results are used below to generate false positive rates. Since we know the identities of the proteins in the mixture, we can compare the performance of each algorithm with the others. For this purpose, in this study the most informative figures appear to be the receiver operating characteristic (ROC) curves. The ROC values from different methods are shown in Figure 4. In the figure, we present the number of true spectral matches as a function of the number of the false spectral matches. We can infer from the curves the overall sensitivity and learn the true and false positive rates of each method. The overall highest sensitivity is achieved by the cross-correlation function of SEQUEST. A shared peak count-based probability method (hypergeometric probability-based scoring functions) shows the best specificity for high-scoring spectra (hard to see this in the figure, but clear from the numerical data). The intensity-based probability method performs better in the midportion of ROC curves. The crossing point between cross-correlation and central limit theorem methods is at ∼5800 true positives and ∼20 400 false positives. The cross-point happens at the cross-correlation and central limit scores of 2.3 and 2.8, respectively. The intensitybased probability scoring function yields 6557 true spectral matches or 4% less than SEQUEST (6837 true matches). However, its specificity is higher than that of both SEQUEST and the shared peak count methods in the middle portion of the ROC. It appears from the ROCs that until we encounter roughly four false positives (19) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50. (20) Sadygov, R.; Yates, J. R., I. Anal. Chem. 2003, 75, 3792-3798.

Figure 4. ROC curves for different peptide identification scores. For the highest scoring peptides, the best performance is achieved by the hypergeometric model, while the intensity-based model performs better in the middle of the ROC curve. The best overall performance is achieved by SEQUEST.

for every true positive, the central limit theorem-based scoring function performs better than the other two methods. It is likely that in most experiments a researcher would stop accepting hits well before this false positive rate. This implies that, for many experimental setups, the intensity-based probability scores could be more informative than the cross-correlation and hypergeometric scores. The main deficiency in illustrations of the ROC curves in this study is the absence of score values. Even though the false and true positive identification numbers are shown, it is not clear at which score values they are obtained. It is important for practical applications to know the true and false positive rates at given score thresholds. The next set of figures (Figures 5A and B) show true and false positive rates for every method that we test alone and in combination with other methods. The depicted false positive rates are the expectation values of number of false identifications (divided by the total number of trials which is the number spectra) called significant at the score threshold. It is also called a type I error. True positive rate at a certain score is the portion of all true positives that pass the score. Figure 5 plots false (broken lines) and true (solid lines) positive rates for a single method (red lines, cross-correlation or central limit theorem) and for a combination of methods (black lines, hypergeometric and crosscorrelation Figure 5A, and hypergeometric and central limit theorem, Figure 5B). In these figures, we consider all peptides without the enzymatic cleavage specificity. Breakdown of the figures into the fully tryptic and nontryptic peptides is presented in the Supporting Information. As seen from the figures, the search results from “paired” methods have lower false positive rates than a single method search results. The down side is that corresponding true positive rates are also lower. Since there are only five proteins in the mixture, the number of unique peptides obtained from the digestion of these proteins is not high. There are 1296 unique peptides identified in the search using SEQUEST. Of these, 541 are either fully or partially tryptic. The corresponding numbers for the intensity-based probability method are 1248 and 507, respectively. In regard to the number of unique peptide assignments, the methods do not show a Analytical Chemistry, Vol. 78, No. 1, January 1, 2006

93

Figure 6. Venn diagram summarizing peptide identifications from three algorithms, SEQUEST, central limit theorem, and Hypergeometric distribution.

Figure 5. (A) True and false positive rates for cross-correlation scores and the combination of cross-correlation and hypergeometric scores. The solid lines denote true positive rates for cross-correlation scores (red line) and for the combination of cross-correlation and hypergeometric scores (black line; in this case, only peptides that are ranked number 1 by both methods are considered as candidate peptides). The broken lines are the false positive rates of the crosscorrelation scores (red broken lines) and the combination of the crosscorrelation and hypergeometric scores (black broken line). (B) False (broken line) and true (solid line) positive rates of intensity-based probability scoring function as obtained from reverse database (green broken line, FPR), from intensity-based method only (red lines) and from the combination of hypergeometric and intensity-based probability scoring functions (black lines).

statistically significant difference (less than 4%). However, when we compare the number of peptides identified in both searches, we make interesting observations. Thus, only 986 of the unique peptides are common to both sets, of these 420 are tryptic. Each of the methods identifies ∼300 peptides that are unique to the method. It should be noted here that this observation cannot be attributed to random assignations (there is no other combination of five peptides in the database that has 300 total hits). When the results from these two methods are summed together, the number of unique peptide identifications is ∼25% higher than if only one search result is used. This analysis reinforces that different methods of scoring can produce complementary results, a phenomenon that has been observed by others.21 The number of unique peptides identified using the hypergeometric model is 685 and of these 329 are tryptic. This model produces the highest percentage of peptides identified in all three 94 Analytical Chemistry, Vol. 78, No. 1, January 1, 2006

methods. Thus, 617 of the unique peptides identified by the hypergeometric model are in the assignment list of intensity-based probability model, while 634 are in the identification list of SEQUEST results. A total of 585 peptides are identified by all three search algorithms. The number of peptides identified by one method only is 8, 230, and 261, respectively, for hypergeometric probability method, central limit-based probability, and SEQUEST. As the analysis shows, combining results from central limit theorem and SEQUEST will significantly improve the sensitivity, while the hypergeometric-based model is best suited for the searches where specificity is most important. Only 1.1% of identifications by the hypergeometric model are not supported by evidence from other algorithms. The results are summarized in a Venn diagram in Figure 6. In summary, our analysis above shows that the central limit theorem-based probability model adds two main advantages to peptide identifications using tandem mass spectra and amino acid sequence databases. The sensitivity of the method allows the identification of a large number of peptides by this method alone. The second advantage is the method has better specificity in the middle portion of the ROC curve, which means for the highscoring peptides the specificity of this method is better than that of the other algorithms considered in this work. Conclusion. We have implemented an intensity-based probability model for peptide identification using tandem mass spectra and amino acid sequence databases. The model uses an approximation to the central limit theorem, where cumulative intensities are used as the test statistic. We have compared the performance of the algorithm with other algorithms using a tandem mass spectrometry data set obtained from a known fiveprotein mixture. The analysis shows that the intensity-based models are significantly more sensitive than the methods based on the shared peak count. Even for the simple data set combining the results from the central limit and SEQUEST searches improves the overall identification statistics. The central limit theorem approximation method identifies ∼25% of peptides that are unique to this method alone. The overall sensitivity of the method is ∼4% less than that of SEQUEST; however, for the middle portion of the ROC curves, the intensity-based probability method may be more useful because it shows higher specificity. (21) Resing, K. A.; Meyer-Arendt, K.; Mendoza, A. M.; Aveline-Wolf, L. D.; Jonscher, K. R.; Pierce, K. G.; Old, W. M.; Cheung, H. T.; Russell, S.; Wattawa, J. L.; Goehle, G. R.; Knight, R. D.; Ahn, N. G. Anal. Chem. 2004, 76, 3556-3568.

A Web-based version of the program can be accessed at the URL http://bart.scripps.edu/public/search/pep_probe/search.jsp.

SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

ACKNOWLEDGMENT Funding for this research was provided by NIH R01 MH067880 and P41 RR11823. We also thank Dr. Daniel Cociorva for technical assistance in this study.

Received for review July 7, 2005. Accepted October 24, 2005. AC051206R

Analytical Chemistry, Vol. 78, No. 1, January 1, 2006

95