Intensity-Based Statistical Scorer for Tandem Mass Spectrometry

In this paper, we provide benchmarking data of our new statistical scorer versus the ... multiple protonated fragments, which are often ignored in MS/...
0 downloads 0 Views 247KB Size
Anal. Chem. 2003, 75, 435-444

Intensity-Based Statistical Scorer for Tandem Mass Spectrometry Moshe Havilio,* Yariv Haddad, and Zeev Smilansky

Compugen Limited, 72 Pinhas Rozen Street, Tel Aviv, 69512 Israel

We describe a new statistical scorer for tandem mass spectrometry. The scorer is based on the probability that fragments with given chemical properties create measured intensity levels in the experimental spectrum. The scorer’s parameters are computed using a fully automated procedure. Benchmarking the new scorer on a large set of experimental spectra, we show that it performs significantly better than the widely used cross-correlation scoring algorithm of Eng et al. (Eng, J. K.; McKormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.). Tandem mass spectrometry (MS/MS) has become the method of choice for high-throughput identification of proteins. The ability to rapidly identify small quantities of peptides in complex mixtures of proteins is matched by no other method.1 The introduction of an efficient protein identification algorithm by Eng et al.2 turned MS/MS into a dominant proteomics technique and demonstrated that computational developments can significantly extend the range of MS/MS applications in proteomics. A key component of any MS/MS interpretation program is the scoring function, which evaluates the match between the predicted spectra of a given peptide and the experimental spectrum. A peptide from a database with maximal match to the experimental spectrum is a likely candidate for the solution of the protein identification problem.2-7 Similarly, a peptide with a maximal match to the experimental spectrum among all possible peptides provides a likely solution to the de novo sequencing problem.7,8,9 Most existing scoring algorithms rely on cross-correlating the experimental spectrum with predicted spectra of peptides from a database.10 The cross-correlation often amounts to variants of the * Corresponding author: (e-mail) [email protected]. (1) Link, A. J.; Eng, J. K.; Schieltz, D. M.; Carmack, E.; Mize, J. G.; Morris, D. R.; Garvik, B. M.; Yates, J. R. Nat. Biotechnol. 1999, 17, 676-682. (2) Eng, J. K.; McKormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (3) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390-99. (4) Taylor, J. A.; Johnson, R. S. Rapid Commun. Mass. Spectrom. 1997, 11, 1067-1075. (5) Fenyo, D.; Qin, J.; Chait, B. Electrophoresis 1998, 19, 998-1005. (6) Clauser, K. R.; Baker, P. R.; Burlingame, A. L. Anal. Chem. 1999, 71, 287182. (7) Bean, M. F.; Carr, S. A.; Throne, G. C.; Reilly, M. H.; Gaskell, S. J. Anal. Chem. 1991, 63, 1473-81. (8) Bartlets, C. Biomed. Environ. Mass Spectrom. 1990, 19, 363-368,. (9) Dancik, V.; Addona, T. A.; C;aiser. L. R.; Vath, J. E.; Pevzner, P. A. J. Comp. Biol. 1999, 6, 327-342. (10) Owens, K. Appl. Spectrosc. Rev. 1992, 27, 1-49. 10.1021/ac0258913 CCC: $25.00 Published on Web 01/04/2003

© 2003 American Chemical Society

shared peaks count.11 This approach to MS/MS was pioneered by Eng et al.,2 and used in such applications as SEQUEST2 and Sonar.12 A different scorer evaluates the probability of finding the collection of detected fragments in a protein database13,14 as implemented in Mascot,13 MOWSE,14 and Protocall15,16 algorithms. A major disadvantage of these algorithms is the absence of a formal way to incorporate MS/MS “wisdom” (the expertise obtained through extensive experimental studies) in peptide identification. Peak intensity intricately depends on ion type and mass, as well as on other experimental parameters. The crosscorrelation algorithm translates this dependency into a single peak in the predicted spectrum. Using the occurrence of the detected fragments’ masses in the database ignores the observed aspects of the spectrum apart from intensity thresholds. Dancik et al.9 recently presented a new likelihood-based scoring approach that takes into account the probabilities to detect certain fragment types in the experimental spectrum. The probabilistic parameters used in the MS/MS search are learned from a given set of interpreted spectra rather than postulated in advance in an ad hoc manner as in other algorithms. This learning process opens a channel through which the observed experimental information affects the score function. Recently, Bafna and Edwards17 at Celera attempted to generalize the probabilistic learning approach of Dancik et al.9 to include more dependencies. Their attempt was limited by the need for expertly curated spectra. Another key component in MS/MS protein identification is the criterion to evaluate the reliability of the identification. Eng et al.2 used the difference between the normalized scores of the first- and second-ranked peptides (∆Cn) for this purpose. Other applications in mass spectrometry, such as Mascot, MOWSE,13 Sonar,12 and Protocall,15,16 use Pvalue estimations to assess the significance of peptide identification (see algorithm section for exact definitions). Pvalue estimates have a long history of applications in other fields of bioinformatics.18,19 The main advantage of Pvalue lies in providing an estimate of the expected rate of false identifications. (11) Pevzner, P.; Dancik, V.; Tang, C. J. Comp. Biol. 2000, 7, 761-770. (12) Sonar (http://65.219.84.5/ProteinId.html). (13) Perkins, D. N. et al. Electrophoresis 1999, 20, 3551-3567. (14) MOWSE (http://www.hgmp.mrc.ac.uk/Bioinformatics/Webapp/mowse/ mowsedoc.html - 4.6). (15) Protocall(http://protocall.labonweb.com). (16) Wool, A.; Smilansky, Z. Proteomics 2002, 2, 1365-1373. (17) Bafna, V.; Edwards, N. Bioinformatics 2001, 17, S13-S21. (18) Pearson, W. R. Methods Enzymol. 1996, 266, 227-259. (19) Levitt, M.; Gerstein, M. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 5913-5920.

Analytical Chemistry, Vol. 75, No. 3, February 1, 2003 435

The empirical Pvalue assessment procedure, which was recently introduced into mass spectrometry by Eriksson et al.20 and used in Sonar,12 is scorer independent. Such a procedure places different scorers in a common framework and provides a welldefined methodology, previously unavailable, that enables their performances to be compared for a given error rate cutoff. Levitt and Gerstein19 used a similar procedure of Pvalue computations to compare different protein similarity detection algorithms. Despite efforts to improve the reliability of an MS/MS search, further algorithmic and statistical improvements in protein identification are needed. Current experimental practice indicates that the false positive and false negative identification rates remain high in high-throughput MS/MS experiments. In a typical run of liquid chromatography (LC) followed by MS/MS, where the sample is extracted from 2D gel, often only 10% or less of the spectra are reliably identified (cf. Results section below). Increasing the true positive rates will significantly improve the analysis of protein mixtures. Rigorous monitoring of identification error rates is essential in high-throughput processes. Although lower error rate cutoffs are necessary to minimize the number of false positives, the low thresholds may cause correct identifications to be missed. Thus, the number of identified spectra for a given false positive rate is the most important benchmarking parameter when protein identification software tools are compared. To the best of our knowledge, reports on MS/MS benchmarking efforts are scarce. In this paper, we provide benchmarking data of our new statistical scorer versus the cross-correlation scorer of Eng et al.2 and demonstrate that the statistical scorer provides significantly lower false positive/false negative rates. Our work extends the scoring approach of Dancik et al.9 We design a family of scoring functions for protein identification that incorporates a large variety of experimental observations and prior theoretical knowledge on peptide fragmentation. We treat intensity levels and correlations between intensity levels of different ion types as special cases of experimental observations. Although intensity levels were formally treated by Dancik et al.,9 correlation between intensity levels of different fragments is distinctive to our new score function. We treat fragment type, fragment mass, and fragment mass/peptide mass as special cases of fragments’ chemical properties. In our scorer, we are able to take into account significant ion types, such as isotopes and multiple protonated fragments, which are often ignored in MS/MS analysis software. For the learning stage, when the required statistical parameters of the scorer are determined from spectra of known peptides, we begin with a simple score function that is independent of the data set. At the next stage, we automatically choose the best identifications and employ them to derive parameters that can be used by the final scorer. This method makes parameter learning a fully automated procedure. We used scorer-independent Pvalue estimations and a large set of experimental data to test the resulting score function. We found that it performed considerably better in comparison with both the scorer we started with and the cross-correlation algorithm.

samples). The entire spectrum collection was used in the analysis. There was no merging of spectra and hence one spectrum ) one peptide. All protein samples originated from 1D PAGE or 2D gel electrophoresis separations. Bands or spots were extracted from the gel and subjected to in-gel digestion by trypsin. Cysteines were reduced and acylated by iodoacetamide. Analysis of the resulting peptide mixtures was performed on a Finnigan LCQ (San Jose, CA) in the Smoler Protein Center at the Faculty of Biology, The Technion, Haifa, Israel. Database Search Procedure. The database used in the searching procedure was a combination of NCBI Nr (May 2001) and Swiss Prot version 39.16. It contained 697 136 proteins. When searching a database with an MS/MS spectrum, it is assumed that the precursor ion in the first MS is the fragmented peptide in the second MS. The mass/charge (m/z) value of the precursor ion is given. The precursor ion can have one of several possible masses according to its possible charge states. In each search, all peptides with mass within (1000 ppm of the possible masses of the precursor ion were retrieved from the database, without taxonomy limitations. A database indexing method was introduced to support efficient database access. The possible masses of the precursor ion were determined in the following way: for each spectrum, a scorer-independent algorithm was used to determine whether the precursor ion is singly protonated (the description of this algorithm is outside of the scope of this paper). Any precursor ion, recognized to be singly protonated was searched as such (i.e., one possible mass). A precursor ion that was suspected to be of a higher protonation state was searched as singly, doubly, or triply protonated (i.e., three possible masses). The amino acid cysteine was treated as modified by iodoacetamide (+57 Da). The search took into account up to two oxidized methionines (+16 Da each) and one missed-cleavage site per peptide. DESCRIPTION OF THE ALGORITHMS Statistical Scoring. The goal of a score function is to quantify how well the candidate peptide P explains the experimental spectrum S, and to choose the peptide that best explains the spectrum. In our statistical score functions we evaluate the probability Ω(P, S) that P produced S. We are looking for the peptide Ph, for which

Ω(P, S ) ) max Ω(P, S ) Pdatabase

METHODS Data Set. The data set included 42 392 different spectra, collected during 46 runs of LC/MS/MS (henceforth protein

We describe a family of statistical models that enable the evaluation of Ω(P, S). Each model corresponds to a different score function. L denotes a member of the family and ΩL(P, S) the probability according to this member. Three properties characterize a family member L: the fragmentation model, fragment properties, and experimental observation. These properties are explained below. Fragmentation Model. In tandem mass spectrometry, molecules of a peptide are fragmented and both the m/z and abundance (intensity) of their fragments are measured. Many types of fragments are known,21 and the fragmentation model determines which of the known types will be considered for any candidate P. The fragmentation model on which it relies charac-

(20) Eriksson, J.; Chait, B. T.; Fenyo, D. Anal. Chem. 2000, 72, 999-1005.

(21) Papayannopoulos, I. A. Mass Spectrom. Rev. 1995, 14, 49-73.

436

Analytical Chemistry, Vol. 75, No. 3, February 1, 2003

terizes a statistical model. Below we describe the fragmentation model we used. A peptide is a sequence of n amino acids a1, ..., an, with masses m(ai). The mass of a peptide is



mp ) m(H2O) +

fragment typea

m(ai)

i

where m(H2O) = 18 is the additional mass of the peptide’s N and C terminals. When a singly charged peptide undergoes fragmentation between amino acids ai and ai+1, mass spectrometers may detect one of the possible two singly charged fragments termed bi and yn-i. These are called the principal fragments. The fragment bi includes the N terminal and the amino acids a1, ..., ai; its mass is given by i

m(bi) ) 1 +

Table 1. Fragment Types Included in Our Fragmentation Model

∑ m(a ) j

yn (full ionized peptide) y y+1* y - H2O * y - NH3 * y - H2O - NH3 y - H2O - H2O doubly protonate y y or b residue mass y or b - 1 triply protonated y or b doubly protonated y or b - H2O bn (full ionized peptide - H2O) b b+1*

j)1

The second principal fragment, yn-i, includes the C terminal and the amino acids ai+1, ..., an; its mass is given by n

m(yn-i) ) 19 +



m(aj)

j)i+1

Our fragmentation model includes other fragment types associated with one of the principal fragments, and their m/z values are given according to the mass of their associated principal fragment. Table 1 contains a list of the fragment types included in our model. The correlation between the intensities attributed to stable isotopes of molecules is well documented22,23 and is used in peak extraction algorithms.24 The notion of correlated fragments can be extended to any type and number of fragments, so the fragmentation model may include groups of fragments. For example, the fragments y and y + 1 may be grouped into a pair. Fragments of the “unexplained” type were not included in the fragmentation model. Given a candidate P, any experimental observation that is not attributed to one of its fragments according to the fragmentation model is attributed to a fragment of type “unexplained”. Fragment Properties. Fragments are characterized by properties such as type, mass, charge, parent peptide mass, parent peptide chemical composition, and fragment chemical composition. One can think of many other properties to help characterize a fragment. A statistical model L is distinguished by the set of properties it uses. We denote by g this set of chemical properties. As examples, we can consider the property of fragment type, denoted by

g(t) ≡ (fragment type ) t)

(1)

A set, which takes into account the fragment’s m/z25, is denoted (22) Watson, J. T. Introduction to mass spectrometry, 3rd ed.; Lippincott-Raven: Philadelphia, 1997; Chapter 7. (23) Yergey, J. A. Int. J. Mass Spectrom. Ion Phys. 1983, 52, 337-349. (24) Caroll, J. A.; Ronald, C. B. Rapid Commun. Mass Spectrom. 1996, 10, 16831687.

b - CO (a fragment) * b - H2O - NH3 b - H2O - H2O doubly protonated b “unexplained” fragment

fragment m/z, in terms of mass of the corresponding y or bb peptide mass + 1

fragment detection probabilityc

y+1 y - 18 y - 17 y - 35 y - 36 (y + 1)/2 y + 56 y-1 (y + 2)/3 (y - 17)/2

0.049 0.62 0.51 0.23 0.26 0.14 0.12 0.19 0.1 0.13 0.11 0.13

peptide mass - 17

0.32

b+1 b - 18 b - 17 b - 28 b - 35 b - 36 (b + 1)/2 not associated to y or b

0.5 0.4 0.33 0.32 0.18 0.19 0.16 0.15 0.047

aPeptide fragment types. Each row stands for a separate fragment type. The symbol y represents all y fragments except yn (yn ) full ionized peptide including terminals). The symbol b represents all b fragments except bn (bn ) full ionized peptide - H2O). In cases where the name indicates “y or b”, the fragment types associated with the y and b principal fragments were treated as one. Fragments marked by (*) were considered as correlated pair with their associated principal fragment. bFragment m/z, given in terms of the mass of the associate principal fragment. In cases where the name indicates “y or b”, b should replace y in the mass row to obtain the m/z value of the fragment associated with the b fragment. cThe probability to detect a peak attributed to the fragment type (Prob(relative intensity >0|fragment type)).

by

g(t, γ) ≡ (fragment type ) t, fragment m/z ) γ) (2) Given that the peptide’s mass is mp, a set that uses the ratio (m/ z)/mp is

g(t, γ, r) ≡ (type ) t, m/z ) γ, (m/z)/mp ) r) A set of properties characterizing two grouped fragments denoted by 1 and 2 is

g(t1, γ1; t2, γ2) ≡ (g(t1, γ1), g(t2, γ2)) Experimental Observation. The choice of observation, which is extracted from the experimental spectrum, is the third characteristic of L. We denote such an observation by b. A possible observation extracted from a certain segment in the spectrum is

b1 t (measured relative intensity in segment > Ic) or Analytical Chemistry, Vol. 75, No. 3, February 1, 2003

437

b2 t (measured relative intensity in segment e Ic)

(3)

where Ic is a relative intensity threshold. For example, using Ic ) 0, b1 is a possible experimental observation extracted from the m/z segment marked by y in Figure 1 (for 947 e m/z e 948). Another possible observation from a segment is

b(I) ≡ (measured relative intensity in segment ) I)

(4)

An alternative experimental observation extracted from the m/z segment marked by y in Figure 1 is b(I ) 90) (see comment a in Figure 1). A possible experimental observation regarding two grouped segments marked by R and β is b(IR, Iβ) ≡ (relative intensity in segment R ) IR, relative intensity in segment β ) Iβ). For example, from the segments marked by y and y + 1 in Figure 1, we can extract the information b(IR ) 90, Iβ ) 4). Alternative statistical models may rely on these alternative observations from the same segment. Computation of the Statistical Score. To evaluate ΩL(P, S), the probability that a candidate peptide P produces the experimental spectrum S, according to a model L, we divide the m/z axis of S, in the interval between the lowest and highest measured values of m/z, into NS segments of equal size. If, according to our fragmentation model, a fragment of the candidate P is expected in a certain segment, this segment will be labeled by the set of properties of the expected fragment, according to the model L. If no fragment of P is expected in a certain segment, this segment is labeled with a set of properties that includes the fragment type “unexplained” (UE). We denote by gi(P, L) the properties by which the segment i, 1 e i e NS, is labeled. An example of a labeled experimental spectrum appears in Figure 1 (see comment b in Figure 1). Denoting by bi(L) the experimental observation in the segment i, we attribute bi(L) to a fragment with properties gi(P, L). If the experimental observation in segment i is attributed to an ungrouped fragment, we treat this segment as an independent experimental event. According to our model, the probability for this event is Prob[bi(L)|gi(L)], which is the probability that a fragment of properties gi(P, L) produced the experimental observation bi(L) (see comment c in Figure 1 for examples). In the following sections, we describe how these probabilities can be evaluated. If the experimental observation measured in two segments, i1 and i2, is attributed to a grouped pair of fragments in the fragmentation model, the measurement in these segments is a single independent event. Denoting by gi1(P, L) and gi2(P, L) the expected fragments’ properties in segments i1 and i2, respectively, and by bi1(L) and bi2(L) the corresponding experimental observations, the probability for this event is Prob[bi1(L), bi2(L); |gi1(L, P), gi2(L, P)]. See an example in comment d in Figure 1. Denoting the number of independent events by Ne(L) (Ne ) Ns if there are no grouped fragments), the probability that the candidate peptide P produced the experimental spectrum S is Ne(L)

ΩL(P, S) )

∏ Prob[b (L)|g (L, P)] i

i

i)1

438

Analytical Chemistry, Vol. 75, No. 3, February 1, 2003

(5)

Figure 1. Section of an experimental mass spectrum for 942 e m/z e 952. (a) In relation to the segment marked by y, several alternative experimental observations can be made: (i) b1 ) (I > 0), (ii) b(Iy) ) (Iy ) 90), and paired with the segment labeled by y + 1, (iii) b(Iy, Iy+1) ) (Iy ) 90, Iy+1 ) 4). These observations are used in different statistical models. (b) Score calculation step 1: The spectrum is divided to segments of size 1 Da. Each section is labeled by the properties of the expected fragment of the candidate peptide P. Here P ) TFAETATAGEWQGK. The labeling property here is g(t) ) (fragment type ) t) (eq 1). If according to the fragmentation model no fragment of P is expected in a certain segment, this segment is marked as unexplained ()UE). (c) According to a statistical model that takes into account the property eq 1, and the experimental observation of eq 3, the probability that the event occurs in the segment labeled by y is Prob(I > 0|fragment type ) y), which is the probability to detect a peak attributed to a y fragment. According to another statistical model, which takes into account the fragment properties in eq 2, and the experimental observation of eq 4, the probability for the same event is Prob(I ) 90| fragment type ) y, fragment mass ) 947 Da), which is the probability for y fragment with m ) 947 Da to produce a peak with relative intensity I ) 90. (d) Using a fragmentation model in which the fragment types y and y + 1 are grouped, the probability for the event occurring in the corresponding pair of segments, labeled by 1 and 2, is Prob(I1 ) 90, I2 ) 4|fragment 1 ) y, fragment 1 mass ) 947 Da, fragment 2 ) y + 1).

where i is an observation index. Here we consider two types of statistical scorers: type a, where all fragments are assumed independent, and type b, where grouped pairs of fragments are considered. For type a, each segment is a distinct event and eq 5 becomes NS

ΩLa(P, S) )

∏Prob[b (L)|g (L, P)] i

i

(6)

i)1

where NS is the number of segments, gi(L, P) is the properties of the fragment by which the segment i is labeled, and bi(L) is the experimental observation measured in the segment i. For scorers of type b,

Npairs

ΩLb(P, S) )

∏ Prob[b

i1(L),

bi2(L)|gi1(L, P), gi2

i)1

NS - 2Npairs

(L, P)]



Prob[bj(L)|gj(L, P)] (7)

j)1

where Npairs is the number of grouped pairs of fragments, i1 and i2 denote the first and second members of the ith pair, and NS 2Npairs is the number of remaining segments. The probability that the spectrum was produced completely by unexplained fragments is NS

ΩL(UE, S) )

∏Prob[b (L)|g (UE, L)] i

i

i)1

where gi(UE, L) includes the fragment type “unexplained”. When gi(UE, L), and thus ΩL(UE, S), are P independent, it becomes more convenient to compute the ratio ΩL(P, S)/ΩL(UE, S). For scorers of type a, this ratio is

ΩLa(P, S) ΩL(UE, S)

N(P)

)

Prob[b(si, L)|gi(L, P)]

∏ Prob[b(s , L)|g (UE, L)] i)1

i

i

where N(P) is the number of expected fragments of the candidate peptide P, i denotes an expected fragment with properties gi(L, P), si is the segment in which the fragment i is expected, and b(si, L) is the experimental observation in the segment si. The final score in this case is computed by taking the logarithm of this ratio:

G(P) t score(P) ) ln

[

ΩLa(P, S)

]

ΩL(UE, S)

(8)

With the fragment property of eq 1 and the experimental observation of eq 3, eq 8 is reduced to the intensity and m/zindependent scorer used in ref 9. In practice we used eq 8, which includes contributions only from those segments that contained a fragment of the candidate peptide P. For an expected fragment with m/z ) γ, we considered the experimental observation in the range γ ( 0.5 Da. If an experimental peak was interpreted by more than one set of properties, we took all the sets into account.26 Assigning Significance Level to Peptide Identification and Identification Criterion. Below we explain the notion of Pvalue and the way it is used to assign a significance level to peptide identification. To the best of our knowledge this is the first time (25) We assume that any continuous numerical property or observation we mention is discretize and defined up to a segment size. Hence by saying, for example, that (m/z)fragment ) x, we mean (m/z)fragment [y, y + ∆y] and x[y, y + ∆y]. This allows us to use probabilities rather than probability distributions. In practice, our m/z resolution is 0.5Da, and relative intensity resolution is 2%. (26) For the statistical model defined by eqs 2 and 4, the correct way to handle a situation where the fragments R and β are expected in the segment s IS with the measured intensity Is is to calculate Prob(Is| (R ∪ β) ) ∑I)0 Prob[I | g(R)]Prob[Is - I]g (β)].

Figure 2. Stages in assessing Pvalue[G(Ph)]. (a) The distribution of scores (score tG) for the Nmax peptides retrieved from the database, F(G), plotted for the statistical scorer (eq 8). (b) Dashed line: Γ(G) ) ∑q∞)G F(q), which is the number of peptides scored higher or equal to G. Solid line: approximation of Γ(G) by curve fitting for 50 e Γ(G) e Nmax/4. (c) Solid line: extrapolation of Γ(G) to get Pvalue[G(Ph)]. In this case, Log{Pvalue[G(Ph)]} ) -5.7. (d) Another example of extrapolation. Here Log{Pvalue[G(Ph)]} ) -0.06.

this notion is used in mass spectrometry to compare score function performances. Suppose we ran a database search procedure for a given experimental spectrum and obtained the resulting set of scores, G(P), of the peptides P in the database. We denote by F(G) the distribution of scores in the database: that is, how many peptides have a given score. Ph denotes the highest ranked peptide. We define Pvalue [G(Ph)] as the probability that Ph is a false identification. In other words, given a large set of spectra with a similar F[G] (the definition of similar will be made precise in the following), Pvalue [G(Ph)] is the rate of true negative peptides receiving a score higher or equal to G(Ph). Given a probability threshold Pcutoff, an identification is considered statistically significant if Pvalue [G(Ph)] e Pcutoff. To evaluate Pvalue [G(Ph)], we used the empirical scorerindependent method suggested in ref 20 and used by Levitt and Gerstein.19 Our implementation of the method proceeds by the following stages: 1. Nmax peptides are retrieved from the database, and F(G) is computed (Figure 2a). It is assumed that all the Nmax peptides but the highest graded one are true negatives. ∞ 2. Γ(G) ) ∑q)G F(q) is computed (Figure 2b, dashed line). Γ(G) is the number of peptides P in the database such that G(P) g G. 3. To avoid high-grade fluctuations, the values of Γ(G) are approximated for 50 e Γ(G) e Nmax/4, by Γ(G) = exp[f(G)], where f(G) is a second-degree polynomial (Figure 2b, solid line). Denoting by G50 the score of the 50th scored peptide, for G >G50, exp[f(G)] is an approximation to the average, over many spectra with a similar F[G]G 0.5. (iii) number of amino acids >10. (b) A supported peptide, where peptide identification is done using criterion a. In the learning stage of our algorithm, this automatic procedure was performed on our entire set of data. A total of 2040 supported peptides were used to build the fragment statistics, from which 97% were supported by our in-house implementation of the cross(27) Denoting by nd the number of potentially detectable principal fragment, and by mh and ml the highest and lowest measured masses in the spectrum, respectively; we estimate: nd ≈ 2(mh - ml)/maa, where maa = 120 Da is the average mass of an amino acid. We estimate by 2 (approximate sum over detection probabilities of confirming ions) the number of explained peaks per potentially detectable principal fragment. The rest of the peaks are considered unexplained. Denoting their number by nu, the probability to detect an unexplained peak is approximated by nu/NS.

correlation algorithm. The final statistical scorer supported 3025 peptides. The scoring algorithms were also tested on our entire set of data. Since the Pvalue test characterizes the scorer’s performance regardless of the origin of its parameters, this is a valid procedure. Because our learning set is large enough to faithfully represent the instrument modus operandi, we argue that identification rates would not be significantly different had the test set excluded the learning set. Cross-Correlation Scoring Algorithm. We tested our score function by comparing it with an existing algorithm. We chose the cross-correlation algorithm, a widely used scoring algorithm that is part of the SEQUEST scorer of ref 2. We implemented the cross-correlation algorithm in our program to enable unbiased comparison of the two scorers. The implemented version was used to obtain all our results. It should be noted that our implementation does not include SEQUEST’s preliminary scoring; its limitation of the number of peptides graded by the cross-correlation scorer to 500 makes it difficult to combine with the Pvalue evaluation method we used. Hence, this paper does not provide a direct comparison with SEQUEST but only with its isolated crosscorrelation part. The cross-correlation scorer is carried out in three stages: 1. Preprocessing of the experimental spectrum: A 10-Da region around the precursor ion mass is removed. The spectrum is divided into 10 equal mass sections of range (highest mass lowest mass)/10. The maximal relative fragment intensity in each mass section is normalized to 50.0. 2. Spectrum prediction: Relative intensity of 50 is assigned to the predicted m/z of y and b fragments. Relative intensity of 25 is assigned to the m/z values of y ( 1 and b ( 1. The natural loss of H2O, and NH3, as well as type a fragments (b - 28), are assigned the relative intensity of 10. 3. Final score calculation: The cross correlation between the n-1 experimental and predicted spectrum, Rτ ) ∑i)0 x[i]x[i + τ] is 28 calculated using fast Fourier transform. The final score attributed to each candidate peptide is R0 minus the mean of Rτ over the range -75 < τ < 75. In stage 3, we took the integer part of the m/z values. We found this practice to produce better identification rates than rounding the m/z values to the nearest integer. When two intensities shared the same integer value, the higher was chosen. RESULTS AND DISCUSSION Probabilities Prob(b|g). The probability of a fragment with properties g to produce experimental observation b is Prob(b|g). These probabilities provide an insight into the connection between fragment properties and the statistical characteristics of the associated spectral peak. The results below represent spectra chosen by the criteria specified in the algorithmic section, in particular criterion 3 in eq 10. Uncorrelated Fragments. Assuming the fragments of a peptide are uncorrelated, we get the score function of eqs 6 and 8. For this case, we computed three sets of probabilities: (1) Prob(relative intensity > 0 | fragment type ) t) (eqs 1 and 3, with Ic ) 0); (2) Prob(relative intensity ) I |fragment type ) t, (m/z)/ (28) Press, W. H.; Teukolsky, S. A.; Vettering, W. T.; Flannery, B. P. Numerical Recipes; Cambridge University Press: New York, 1994; Chapter 12.

Figure 3. Normalized probability for y fragments with mass/peptide mass ) r, to have relative intensity I, Prob(relative intensity ) I|type ) y, mass/peptide mass ) r). The probability is plotted here for I > 0. The normalization is such that, for any r, ∑I100 )0 Prob(I | y, r) ) 1. For this instrument, y fragments are, on the average, the most intense. For b fragments, the high-intensity part of the distribution is much weaker.

Figure 4. Probability not to detect a y fragment (zero intensity) as a function of the relative ion mass, Prob(relative intensity ) 0|y, mass/ peptide mass). It is much more likely to detect y fragments from the middle of the peptide (mass/peptide mass ) 0.5).

peptide mass ) r); (3) Prob(relative intensity ) I |fragment type ) t, m/z ) γ) (eqs 2 and 4). The probability to detect a peak attributed to the fragment type t is Prob[relative intensity > 0 | fragment type ) t]. The third column in Table 1 shows the probability for the fragment types we incorporated in our statistical models. Several fragment types, such as y + 1, b + 1, and multiple protonated fragments, were never previously considered in this context. Of those, y + 1 and b + 1 have a relatively very high detection probability, almost as high as their corresponding principal fragments. Such fragments contribute significantly to the score function. The correlation between the intensities of y and y + 1 fragments is discussed in the following section. The probability for fragment of type t and (m/z)/peptide mass ) r to produce a peak with relative intensity I is Prob(relative intensity ) I|fragment type ) t, (m/z)/peptide mass ) r). Figure 3 shows this probability for t ) y and I > 0. Figure 4 shows this probability for t ) y and I ) 0. Considering the intensity dependence of the probability distribution, we find that y and other Analytical Chemistry, Vol. 75, No. 3, February 1, 2003

441

Figure 5. Probability of an “unexplained” (random) fragment of a given mass to have relative intensity I, Prob(relative intensity ) I| fragment type ) “unexplained”, m/z). The probability is plotted here for I > 0. The probability distribution practically vanishes for relative intensity I > 20%. For any m/z, Prob(I ) 0|t ) unexplained, m/z) ) 0.95.

fragments have no characteristic intensity and that their probability distribution is smeared. The distribution is high for low intensities and there is no valid low-intensity threshold. For this instrument. y fragments turned out to be on the average the most intense. For all other fragments, including b fragments, the high-intensity part of the distribution disappears, including the peak at I ) 100%. As for the dependence on the ratio fragment mass/peptide mass, peptides tend to be fragmented in the middle (r ) 0.5). This is pronounced in high relative intensities, as well as in the probability for zero intensity (Figure 4), which has a distinct minimum in the range 0.4 e r e 0.6. Other fragments demonstrate similar dependency. The probability for fragments of type t and m/z ) γ to produce a peak with relative intensity I is Prob(relative intensity ) I|fragment type ) t, m/z ) γ). For t ) y and 300 e m/z e 2000, the probability distribution is exemplified by Figures 3 and 4, and the high-intensity part, peak at I ) 100%, and zero-intensity minimum are centered around m ) 1000 Da. Figure 5 shows Prob(I|t ) “unexplained”, m/z). The probability distribution for unexplained fragments decays fast along the relative intensity axis and practically vanishes for I > 20%. For I ) 0 and any m/z, Prob-

(I|t ) unexplained, m/z) ≈ 0.95. Figures 3 and 5 tell us that most of the peaks in the spectrum have low relative intensity so lowintensity thresholds cannot distinguish “explained” from “unexplained” peaks. In contrast, we see that the high-intensity peaks are regularly explained by the correct peptide. Correlated Fragments. We are currently assessing the effect of including information about correlated fragments on scorer performance; here, we only show that such information is meaningful. Table 1 lists the fragment pairs tested for possible correlations. Of that list, the strongest correlation exists in the two pairs y, y + 1 and b, b + 1. It is difficult to notice the correlation when plotting the probability Prob[Iy, Iy+1 | my, my + 1], which is the probability to detect the pair of intensities Iy, Iy+1, given that the mass of the y fragment is my. The correlation is better displayed in Figure 6a, where the conditional probability Prob(Iy+1 | Iy) is plotted. This is the probability for a peak attributed to the y + 1 fragment to have relative intensity Iy+1, given that the y fragment has relative intensity Iy. The conditional probability can be incorporated into the scorer in eq 7 using

Prob(Iy, Iy+1|my, my + 1) ) Prob(Iy+1 | Iy)Prob(Iy|my)

The probability distribution is plotted for y fragments in the range of 700 Da e m e 1400 Da. Although it is smeared for high values of Iy, the positive correlation between the two intensities is clear. The distribution decays quickly for Iy+1 > Iy. Figure 6b shows the linear correlation between Iy and the expectation value- 〈Iy+1〉. One of the main advantages of the automated procedure for parameter learning and our statistical family of score functions is that such experimental observations can readily be identified and incorporated into the scorer. Figure 6a shows that additional data or an additional step of artificial smoothing is needed in order to make the most of this result. Performances of Scorers. In this section, we compare the performance of four score functions by measuring the number of peptides that a scorer identifies at a given error rate. This test is performed for our entire set of data, which includes 42 392 spectra. The tested scorers are (1) the initial score function used to collect the fragment statistics, (2) the cross-correlation algorithm of Eng et al.,2 and (3) a statistical scorer that includes information about uncorrelated fragments only (eq 8) and depends only on fragment type. This scorer takes into account the properties of eq 1 and

Figure 6. (a) Prob(Iy+1|Iy) ) probability for the y + 1 peak to have relative intensity Iy+1 (int. ) intensity), given that the y fragment has relative intensity Iy. y fragments are in the range 700 e my e 1400. The probability distribution decays fast for Iy+1 > Iy. (b) The average relative intensity Iy+1 (〈Iy+1〉), for a given Iy. 〈Iy+1〉 shows linear correlation with Iy. 442

Analytical Chemistry, Vol. 75, No. 3, February 1, 2003

Table 2. Comparison between Our Statistical Intensity (SI) Score Function of Eq 8 and the Cross-Correlation Algorithm (CC) of Ref 2 -log(Pcutoff)

no.of peptides (SI)a

no. of peptides (CC)b

mutual peptidesc

SI supported by CCd

CC supported by SIe

disagreementsf

peptides/ protein SIg

peptides/ protein CCh

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

2230 1885 1564 1275 1022 801 632 513 415 338 289 252 210 183 156 137 116 102 85

1824 1419 1004 689 473 313 195 116 61 31 12 9 6 2 1 1 0 0 0

1650 1299 892 575 367 234 136 79 39 13 6 3 2 1 1 1 0 0 0

2033 1789 1515 1243 998 785 621 507 411 334 286 249 209 182 156 137 116 102 85

1734 1396 997 684 471 312 194 115 60 30 12 9 6 2 1 1

12 7 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

5.0 5.4 5.2 4.7 4.2 3.7 3.2 2.9 2.6 2.3 2.2 2.1 2.0 1.8 1.8 1.7 1.7 1.6 1.5

4.9 5.0 4.2 3.6 3.3 2.7 2.3 1.9 1.6 1.3 1.1 1.1 1 1 1 1

a Identified peptide with P b c value e Pcutoff for SI. Identified peptide with Pvalue e Pcutoff for CC. Peptides identified by SI and CC for Pvalue e Pcutoff. Peptides identified by SI and supported by CC. See algorithmic section for the definition of a supported peptide. e Peptides identified by CC and supported by SI. f Spectra identified with Pvalue e Pcutoff, where SI and CC disagreed on the identification. The definition of disagreement: if PSI and PCC are the identified peptides by SI and CC, respectively, and Naa is the number of amino acids, disagreement exists if Naa(PSI) * Naa(PCC) or N(different aa) > Naa(PSI)/10. The total number of times PSI * PCC for Pcutoff )10-2 was 45. g Average number of identified peptides with Pvalue e Pcutoff per distinct group of proteins SI. See algorithmic section for the definition of a distinct group of proteins. h Average number of identified peptides with Pvalue e Pcutoff per distinct group of proteins CC. d

Figure 7. Comparison of scorers’ performances. The graph shows the number of identified peptides (out of 42 392 spectra), with Pvalue e Pcutoff as a function of -log(Pcutoff), for 10-20 e Pcutoff e 10-2. False positive frequency ) Pcutoff. The statistical scorer without intensities depends on the fragment type only (eqs 1 and 3 with Ic ) 0). The intensity-based statistical scorer tested here is based on eq 8 (uncorrelated fragments) and has m/z-dependent probabilities (eqs 2 and 4). The intensity-based statistical score function has the advantage over the other score functions, an advantage that grows with Pcutoff and hence with lower error rates.

the experimental observations of eq 3 with Ic ) 0. Fragment types and their detection probabilities appear in Table 1; (4) an intensitybased statistical scorer that includes information about uncorrelated fragments only (eq 8) and uses m/z-dependent probabilities Prob(relative intensity ) I |fragment type ) t, m/z ) γ) (eqs 2 and 4). Figure 7 shows our results for the number of identified peptides () number of identified spectra) with Pvalue e Pcutoff versus

Pcutoff for 10-20 e Pcutoff e 10-2. According to eq 9, a false identification (false positive) is expected on the average every 1/Pcutoff identified peptide. For all the scorers in the graph, valid identifications were characterized by the criteria in eq 10. The intensity-based statistical score function shows a definite advantage over the other scorers, proving the main claim of the paper: that experimental study with rigorous consideration of spectra properties leads to a significant improvement in the scorer’s performances. It also shows that parameter learning can be a fully automated process that starts from a simple identifying procedure and ends with a much better one. It is natural to ask, what is the correct Pcutoff? Since a typical LC/MS/MS run contains about 1000-2000 spectra for a single peptide identifying a single protein, Pcutoff ) 10-3 seems to be an appropriate choice. This will lead to an average of one or less false positive identifications per run. Table 2 compares the performance of the intensity-based statistical score function (SI, (4) above) with that of the crosscorrelation algorithm (CC, (2) above). For Pcutoff ) 10-3, the statistical scorer identified 32% more peptides than the crosscorrelation algorithm. The statistical scorer identified 91.5% and supported 98.4% of the peptides identified by the cross-correlation algorithm (see algorithmic section for the definition of a supported peptide). Thus, while almost no information is lost, a large amount of information is gained by working with the novel statistical score function. Note also that the cross-correlation algorithm supports 95% of the peptides identified by the statistical scorer (including the additional 32%). Hence, for the large majority of the additional identifications, the statistical scorer only magnified the difference between the peptide and the rest of the database. Analytical Chemistry, Vol. 75, No. 3, February 1, 2003

443

SUMMARY We have formulated a new score function for tandem mass spectrometry. It rigorously incorporates experimental aspects of the spectrum and theoretical knowledge about peptides and their fragments for a given experimental setup, including the type of machine and laboratory protocols employed. The relevant knowledge needed for the statistical model underlying the scorer is obtained by a fully automated procedure. This procedure opens the way to answering a variety of questions about the statistical aspects of experimental mass spectrometry and using the quantified results in the score function. The effect of the peptide’s chemical composition on the spectrum is one such question. We discussed some of the statistical aspects of the connection between fragments’ properties and the intensity of their corresponding peaks in the spectrum. We showed that inclusion of correlations between intensities attributed to different fragment types might further improve the scorer. Statistical inclusion of correlations between peaks’ intensities may be crucially important in applications of this formalism to peptide mass fingerprinting. Using a large experimental data set and a scorer-independent significance assessment method, we showed that the new scorer

444

Analytical Chemistry, Vol. 75, No. 3, February 1, 2003

performs considerably better than currently used scorers. The benchmarking procedure described here can serve as an efficient instrument for assessing potential improvements in mass spectrometry identification tools. ACKNOWLEDGMENT We thank Prof. Pavel Pevzner for his help in reviewing early versions of this paper and providing critical comments. We thank Assaf Wool, Rotem Sorek, Itamar Elem, Osnat Sela-Tavor, Hershel Safer, Jimmy Eng, and Bill Lane for valuable discussions. We thank the Smoler Protein Center at The Technion, and in particular Prof. Arie Admon, for the data set and a much-needed introduction to the experimental aspects of protein mass spectrometry.

Received for review June 26, 2002. Accepted November 12, 2002. AC0258913