Large-Scale Unrestricted Identification of Post-Translation

Larry L. David , William F. Loomis , Steven P. Briggs , and Vineet Bafna ... Attila Kertész-Farkas , Beáta Reiz , Roberto Vera , Michael P. Myer...
1 downloads 0 Views 265KB Size
Anal. Chem. 2007, 79, 1362-1368

Large-Scale Unrestricted Identification of Post-Translation Modifications Using Tandem Mass Spectrometry Moshe Havilio*

Notal Vision Limited, 5 Droyanov Street, Tel Aviv, 63143 Israel Assaf Wool

Compugen Limited, 72 Pinhas Rozen Street, Tel Aviv, 69512 Israel

TwinPeaks, a close variant of the SEQUEST protein identification algorithm, is capable of unrestricted, largescale, identification of post-translation modifications (PTMs). TwinPeaks is applied on a sample of 100441 tandem mass spectra from the HUPO Plasma Proteome Project data set, with full non-redundant human as a reference protein database. With a 3.5% error rate, TwinPeaks identifies a collection of 539 spectra that were not identified by the usual PTM-restricted identification algorithm. At this error rate, TwinPeaks increases the rate of spectra identifications by at least 17.6%, making unrestricted PTM identification an integral part of proteomics. Despite the great importance of post-translational modification (PTMs) for biological function, there are no suitable methods for their large-scale analysis.1 There are three major computational approaches to the identification of PTMs from protein tandem mass spectrometry. The first approach, introduced by Yates and collaborators in the early days of high throughput peptide identification2,3 is based on the restricted identification of a selected, possibly highly abundant set of modifications by multiple considerations of potentially modified database peptides. The number of treated PTMs is limited by the exponential growth in database size, and only known modifications can be considered. Most identification programs routinely use this approach to study the highly abundant modifications of cysteine and methionine (see, for example, ref 4 and below). The second approach, which is generally not PTM restricted, is based on de novo sequencing of a short section of the peptide (sequence tag), which is then used for mass unrestricted database search.5-7 The major drawback of this approach is its reliance on * Corresponding author. E-mail: [email protected]. (1) Mann, M.; Jensen, O. N. Nat. Biotechnol. 2003, 21 (3), 255-61. (2) Yates, J. R., 3rd; Eng, J. K.; et al. Anal. Chem. 1995, 67 (18), 3202-10. (3) Yates, J. R., 3rd; Eng, J. K.; et al. Anal. Chem. 1995, 67 (8), 1426-36. (4) Havilio, M.; Haddad, Y.; et al. Anal. Chem. 2003, 75 (3), 435-44. (5) Havilio, M. Automatic peptide identification using de novo sequencing and efficient indexing. Fifth International Symposium on Mass Spectrometry in Health and Life Sciences, 2001, San Francisco. (6) Frank, A.; Tanner, S.; et al. J. Proteome Res. 2005, 4 (4), 1287-95. (7) Tanner, S.; Shu, H.; et al. Anal. Chem. 2005, 77 (14), 4626-39.

1362 Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

high quality de novo peptide sequencing, which is still one of the most challenging problems in proteomics. The third approach, which was recently introduced by Tsur et al.10 is based on spectral, Smith-Waterman-like alignment algorithm (MS alignment). This protocol is truly non-restrictive in the PTM sense, but it suffers from two major drawbacks, which make it unpractical for large-scale proteomics: (1) MS alignment is nonrestrictive with regard to PTM but is highly restrictive in terms of the protein database. It was demonstrated to run against three very small protein databases of 10, 20, and 37 proteins, with reference to three different data sets; this is in comparison to thousands of proteins that can be identified in a single MudPit run.9 (2) MS alignment includes several steps aimed at increasing the reliability of the identifications, yet the computational protocol used by Tsur et al.10 gives no information about the error rate of the identifications. We introduce here a new computational protocol that deals with these two limitations of MS alignment. We first show that with a slight modification, SEQUEST2,11 inherently contains all the necessary information to enable nonrestrictive PTM identification. We use the modified SEQUEST version in a program called TwinPeaks, which is not restricted by the set of potential PTMs and not limited by the size of the protein database. We applied TwinPeaks to a data set of 100441 spectra from the Human Proteome Organization (HUPO) Plasma Proteome Project (PPP), with the full nonredundant (nr) human as a reference protein database. With an error rate of 3.5%, we identify 539 spectra attributed to mass shifted peptides that were not identified by the restricted identification protocol. This is an increase of at least 17.6% in the rate of identified spectra as compared to the conventional restrictive identification procedure. PPP is a multi-center large-scale initiative aimed at comprehensive analysis of the protein constituent of the human plasma and serum.13 This project is not complete without a comprehensive (8) Pevzner, P. A.; Mulyukov, Z.; et al. Genome Res 2001, 11 (2), 290-9. (9) Washburn, M. P.; Wolters, D.; et al. Nat. Biotechnol. 2001, 19 (3), 242-7. (10) Tsur, D.; Tanner, S.; et al. Nat. Biotechnol. 2005, 23 (12), 1562-7. (11) Eng J. K., M. A.; Yates J. R., 3rd. J. Am. Soc. Mass Spectrom. 1994, 5, 97689. (12) Press, W.; Teukolsky, S., et al. Numerical Recipes; Cambridge University Press: New York, 1994. 10.1021/ac061515x CCC: $37.00

© 2007 American Chemical Society Published on Web 01/17/2007

analysis of the serum proteins’ PTM content. To the best of our knowledge, to date only restricted PTM analysis have been performed on serum proteins,14,15 making the present work a significant contribution to serum research. METHODS Experimental Data. Two data sets were downloaded from PeptideAtlas (http://www.peptideatlas.org/repository/). The first data set was used only to set thresholds for conventional (restricted) identification, as described below, and included 92 831 spectra from the ICAT samples Caco2vsEcoli and Trex_1 and 26 783 spectra from the sample OPDHumanSqCC. A full description of the data can be found following the links in PeptideAtlas. Spectra were produced using ESI-LC-MS/MS. TwinPeaks results were obtained on a total of 100 441 spectra from the PPP directories HUPO12_run31, HUPO12_run32, HUPO12_run33, and HUPO12_run34. A full description of the samples can be found in ref 13. In brief, the four directories correspond to three different preparations of plasma (no clotting) samples and one serum (after clotting) sample. Cysteines were reduced using iodoacetamide (+57 Da), albumin and immunoglobulins were depleted, and samples were fractionated by liquid chromatograpy. Protein Database. The reference protein database was full nr human (3098361 peptides), which, for technical reasons included nr bovine (additional 135356 peptides). Since the serum data was purely human, we expected no nonhuman identifications. In order to determine the error rate in a manner that will be described below, the nr human database was inflated with randomly sequenced proteins. The lengths of the random proteins were tuned so that their distribution was identical to that of the lengths of the NR human proteins. The number of random proteins was tuned so that the number of peptides after in-silico trypsin digestion, in the relevant range of peptide lengths (6-40 AA), was at least as large as the number of peptides originating from nr human in the same range of lengths. The final numbers of peptides in the unified database were 323 3717 and 334 6049 from NR human and the random sections, respectively. The conventional database search procedure, similar to the one used in ref 4, is able to take into account modified cysteine and up to two oxidized methionines per peptide. Computations were performed on a cluster of 20 CPUs working in parallel. The main computational optimization used was to group, in resolution of 3 Da, similar precursor-mass spectra, and to unify the treatment of each group’s spectra in relation to database candidate peptides. It takes around one min per spectrum per CPU to run the full TwinPeaks protocol described below. BASIC ALGORITHM (A) Basic scoring and computation of the mass of the PTM. SEQUEST’s basic principle is to compute, for each candidate database peptide, the cross correlation between a predicted spectrum (Spr) and the experimental spectrum (Sex). In order to reduce the effect of random spectral peak matching, SEQUEST computes the cross correlation between translations of the (13) Omenn, G. S.; States, D. J.; et al. Proteomics 2005, 5 (13), 3226-45. (14) Molina, H.; Bunkenborg, J.; et al. Mol. Cell. Proteomics 2005, 4 (5), 63750. (15) Ping, P.; Vondriska, T. M.; et al. Proteomics 2005, 5 (13), 3506-19.

experimental spectrum and the predicted spectrum: M

pr C(δ) ) 〈Sex δ S0 〉 )



ex Sm+δ Spr m

(1)

m)-M

where the experimental and predicted spectra are defined in the range [-M, M], and periodic boundary conditions are used. Fast Fourier transform (FFT)12 is used in order to compute C(δ) efficiently for any δ ∈ [-M, M]. SEQUEST uses C(δ) in the range δ ∈ [-75, 75] to assess the random matching between the experimental and predicted spectra. TwinPeaks uses C(δ) to identify potential PTMs. The basic principle is simple. Let us assume that the experimental spectrum is produced by a database peptide A1, ..., An, with a single addition of mass ∆ on the amino acid Ak, 1 e k e n. Any mass peak in the experimental spectrum that corresponds to a fragment of the peptide that include Ak, among which are yn, ..., yn-k+1 and bk, ..., bn, is translated by a mass ∆ with respect to the predicted spectrum. Any mass peak in the experimental spectrum that corresponds to a non-translated fragment matches a peak in the predicted spectrum. Hence, we expect two peaks in the graph of C(δ) versus δ: one in δ ) 0, which represents the contribution of experimental non-translated peaks, and one in δ ) -∆, which represents the contribution of the translated peaks. Figure 1 presents a few examples of correlation spectra with translated peaks. Figure 1a, for example, shows C(δ) for the peptide GITIDISLWKFETSK, which is identified (see below) with PTM of 42 Da on its middle lysine. The main stage of TwinPeaks protocol is to score each candidate database peptide by C(0) + C(∆), where ∆ is the translation value of the highest peak in a certain mass range. The highest score candidate spectra is assumed to be modified by a single mass -∆. If we put all peaks in the experimental and predicted spectra, down to intensity threshold, to be of a constant height of 1, TwinPeaks effectively counts the mass shift dependent-shared peaks similar to Pevzner’s spectral convolution algorithm.8 (B) The location of the PTM. A relatively high peak in the correlation spectrum C(δ) for a shift δ ) -∆ suggests that the corresponding database peptide has a PTM with mass ∆. But it does not tell us where along the peptide the PTM is located. The location of the PTM is determined by creating predicted spectrum for any possible location of ∆ along the candidate’s peptide’s amino acids chain and maximizing the usual SEQUEST score. TwinPeaks protocol determines the location of the mass shift only for the highest scored peptide in stage A above. COMPUTATIONAL PROTOCOL Stage 1: Conventional Spectra Identification. A conventional SEQUEST score was attempted for each of the spectra. The highest scored peptide was assigned a p value using the scores’ distribution. The p value was computed as described by Havilio et al.4 The y/b ratio defined by

y/b ratio ) no. of y and b detected fragments (2) no. of y and b fragments that can potentially be detected in the spectrum mass range was also computed. Nonmodified quality identifications were Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

1363

Figure 1. (a) Correlation spectrum for the peptide GITIDISLWKFETSK (EF1A1_HUMAN). The mass shift of 42 Da indicates trimethylation. The optimal location of this mass shift is on the middle K. The 42 ( 1 Da mass shift was relatively abundant in our identifications set, occurring in 30 spectra. (b) Correlation spectrum for the peptide EAGGAFGKR (sp|Q9UII2|ATIF1_HUMAN). This peptide seems to have a mass shift of 71 Da on its left glycine. This shift might be due to sarcosyl (see http://www.abrf.org/index.cfm/dm.home?AvgMass)all). (c) Correlation spectrum for the peptide VVLVAVDKGVFVLNKK (CO3_HUMAN). The mass shift of 114 Da, with optimal location on the peptide’s c-terminal, indicates possible cleavage after the following N in the protein. Apparently this peptide carries other mass shifts (see Supporting Information). The 114 Da mass shift was identified only in this peptide but also in nine spectra. (d) Correlation spectrum for the peptide KVPQVSTPTLVEVSRNLGK (human serum albumin precursor). The peptide has a mass shift of 196 or 197 Da, with optimal location on the 10th or 12th amino acid. The 196 ( 1 Da mass shift was relatively abundant in our identifications set, occurring in 41 spectra. (e) Correlation spectrum for the peptide YSSCSTIFLDDSTVSQPNLKHTIK (gi|18766395|gb|AAL78999.1|AF465729_1). This peptide seems to have a mass shift of 68 Da on its right threonine. 1364

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

expected number of false human peptides can be determined by the number of red points northwest of the separator. Thus, the expected error rate in the green points is

ER )

no. of red points no. of green points

(5)

R was tuned until the desired error rate of about 4% was reached. In the conventional restricted identification of stage 1, the linear separator was

y/b ratio ) 0.1527 × [log10(p value)] + 0.9356

Figure 2. Error rate calculation for the conventional (PTM-restricted) identification procedure, performed on our first data set of 119 614 spectra (see Methods). Each spectra was given a point in the twodimensional plane corresponding to (log10(p value), y/b ratio) of the highest scored candidate peptide. For a given error rate, the best linear separator was computed (see text). For 4.05% error rate, the linear separator was y/b ratio ) 0.1527 × [log10(p value)] + 0.9356. In the PPP data (second data set), 3061 spectra were identified at this error rate.

excluded from the following TwinPeaks run. The definition of quality identification and the calculation of the error rate are discussed below. Stage 2: Analysis of the Non-TwinPeak Error Rate. The error rate was computed using the random peptides as negative controls. Database inflation by negative controls is a general method to get an empirical estimate of the false discovery rate. The non-TwinPeaks analysis was performed using the spectra of the first data set (see Methods). For each identification in stage 1 (conventional restricted identification), with p value e -1.5, a point in the coordinates (log10(p value), y/b ratio) was marked on a two-dimensional plane. Random peptides are shown in red, and NR human peptides are shown in green (Figure 2). In this twodimensional plane, a linear separator was sought between the red and green identifications, as follows. For each line, a red point northwest of the line was considered false positive, and a green point northwest of the line was considered true positive. A benefit function was defined:

B(R) ) TP - RFP

no. of red points no. of red points + no. of green points

TP )

no. of green points no. of red points + no. of green points

In the PPP data, the result of the restricted error rate analysis was 3061 identified spectra with an error rate of 4%. Only nonmodified spectra that were identified in the restricted identification process did not proceed to the next stage of the TwinPeaks phase. All other spectra participated in the following TwinPeaks analysis. Stage 3: TwinPeaks Run. Each run of TwinPeaks was dedicated to a pre-defined mass range in which mass diversions (such as PTMs) were detected. Two mass ranges were attempted in two runs: ∆ ∈ [-30, +100] Da, and ∆ ∈ [+100, +200] Da. In the second run, cysteines were assumed to be modified by iodoacetamide (+57 Da), which in fact allowed us to identify many doubly shifted peptides. The following computational procedure was repeated for each experimental spectrum and mass range: (1) Retrieval of candidate peptides from the database. Candidate peptides were retrieved in the relevant ∆ mass range from the precursor mass of the fragmented molecule, in each of the three attempted protonation states -1, 2, and 3. Regardless of the mass range, candidate peptides in the range of (1000 ppm were also retrieved from the possible precursor masses. On the average about 8 × 105 predicted spectra of candidate database random + nr peptide were scored against each experimental spectrum. (2) C(δ), where δ ∈ [-2000, 2000], was computed for each predicted spectrum with respect to the experimental spectrum. If mex and mdb are the precursor mass in a given protonation state and the mass of the candidate peptide, respectively, we set

∆ ) arg max C(δ):δ ∈ [mdb - mex - 5, mdb - mex + 5]

(3)

where R > 0, the price of false positive identification, is a parameter and

FP )

(6)

(4)

For a given R, the program looks, up to a certain resolution, for the linear separator that maximizes B(R). Since the number of random peptides exceeds the number of human peptides, the

that is, we set ∆ to be the mass shift in the highest peak in C(δ) in a range of (5 Da around the expected mass diversion. The score of the candidate peptide is Ω ) C(∆) + C(0). Candidate peptides in the range of (1000 ppm around the precursor mass were scored only by C(0); only the possibility of no mass shift was considered for these peptides. The fact that the protocol does not treat small mass shifts up to 3Da, which may be prevalent,10 is a drawback and will be repaired in future work. (3) The highest scored TwinPeaks peptide was given four scores: (i) A p value was computed according TwinPeaks’ scores’ distribution. The distribution was computed using the full set of (∼8 × 105) TwinPeaks scores. In general, the scores’ distributions Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

1365

match the exponential fit used for the high scores region.4 This p value is referred to as TP_pvalue. (ii) The y/b ratio for the non-shifted highest scored peptide was calculated and stands for the number of detected non-shifted principal fragments. This ratio is referred to as TP_ybratio. (iii) The location of the mass shift was computed, and the modified peptide (i.e., the database peptide, where the mass modified amino acid, with mass mAA, is replaced by mAA + ∆) was re-scored using C(0) alone (regular SEQUEST score). This score was used to compute another p value using the distribution of SEQUEST scores of the peptides in the range of (1000 ppm from the possible protonation states. This p value, referred to as Pvalue, measures the quality of the identification as compared to the conventional scoring procedure. In this stage any other score function (e.g., MASCOT) can be used to score the highest TwinPeaks candidate. (iv) The peptide y/b ratio after the mass shift. This ratio is similar to (2), but with the shifted masses. It is referred to as yb_ratio. (4) For a best scored peptide to be considered a quality TwinPeaks identification the following conditions must be met: (i) At least one of the three following conditions: TP_pvalue < 0 and Pvalue < TP_pvalue; or TP_pvalue < -4; or Pvalue < -4. (ii) At least one of the two following conditions: (a) yb_ratio g 1.25 × TP_ybratio, or (b) yb_ratio g 0.65, and yb_ratio g 1.1 × TP_ybratio. This condition reflects our expectation to find more matching peaks with the mass shifted peptide. (iii) In addition to these conditions, any quality identification peptide should be scored best by more than one spectrum. An alternative to this last condition is discussed in Results. Stage 4: TwinPeaks Error Rate. The analysis of TwinPeaks error rate on the PPP data has two stages: (i) An analysis similar to stage 2 (error rate of the restricted identification) was performed on all the quality identification candidates of stage 3. Results are shown in Figure 3. The quality identifications with error rate of 2.6% obey the following: (a) y/b ratio g y/b ratio ) 0.048 × [log10(p value)] + 0.7180 and (b) log10(p value) < -1.8. After this stage there were 842 nr identifications and 22 random TwinPeaks identifications. (ii) Addition of spectra, where the highest scored peptide occurred, did not meet the quality identification criteria but an identical peptide was quality identified. This addition was done to the random peptides as well, and the final error rate was calculated by eq 5. This stage added 294 nr identified spectra and 18 random identified spectra (and no new peptides). After this stage the error rate slightly rose to 3.5%. RESULTS The computational protocol was applied to 100 441 spectra of the PPP data set, and 1136 and 40 spectra were identified by nr and random mass shifted peptides, respectively. Thus, the NR human spectra were identified with 3.5% error. Of the nr identified spectra, 618 were identified by peptides, most probably including modified cysteines or methionines. These peptides contained one or more of these amino acids, and ∆ ∈ [14, 18] or ∆ ∈ ([55, 59] Since we assumed in the second mass range (see stage 3) that every cysteine was modified, the minus sign in the second range was to account for unmodified cysteine. 1366 Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

Figure 3. Error rate calculation for TwinPeaks performed on the PPP data, demanding two or more spectra per identified peptide. Each spectrum was given a point in the two-dimensional plane. Explanation for the graph is similar to Figure 2. Here the linear separator was y/b ratio ) 0.048 × [log10(p value)] + 0.7180. In addition to the linear separator, we also limited the quality TwinPeaks identification zone to p value < -1.8. There were 842 green points northwest of the separator and 22 red points, which means an error rate of 2.6%.

Figure 4. Error rate calculation for TwinPeaks, demanding only 1 spectrum or more per quality identification. Explanation for the graph is similar to Figure 2. Black: Very high quality identification zone, to identify single spectrum quality identifications. The linear separator was y/b ratio ) 0.033 × [log10(p value)] + 0.8740. Adding the demand log10(p value) < -3, the additional identifications (with respect to Figure 3) were one random and 28 nonrandom single-spectrum identifications. Brown: 4% error identification zone (like in Figure 2). Here the linear separator was y/b ratio ) 0.068 × [log10(p value)] + 0.8995. In addition to the linear separator, we also limited the quality TwinPeaks identification zone to yb_ratio > 0.45. Note the large difference between this quality identification zone and that of Figure 2. Hundreds of TwinPeaks identifications lie between the two lines.

Next, we relaxed demand (iii) above and settled for only one spectrum for each “quality identification”. We were interested in additions to the previous identifications, and assumed the worstcase scenario that attributed all the random “quality identifications” to single spectrum identifications. The resulting separation in this case is described in Figure 4. The linear separator was y/b ratio ) 0.033 × [log10(p value)] + 0.8740. Adding the demand log10(p

value) < -3, the additional identifications were one random and 28 nr single-spectrum identifications (note that the single random quality identification was in fact part of a triplet spectra “identifying” the same random peptide). Hence, the total numbers were 1164 and 41 nr and random identifications, respectively, which maintains the final error rate at 3.5%. The final number of cysteines/methionines modifications was 625, of which 454 were also identified in the restricted process. Thus, TwinPeaks increased the rate of spectra identification by 1164/(3061-454) ) 45% if the trivial modifications of cysteines/methionines are included, and by 539/3061 ) 17.6% if they are not included. Figure 4 also shows the TwinPeaks “quality identification” zone for at least one spectra and a 4% error rate (the same error rate as in Figure 1). The linear separator becomes y/b ratio ) 0.068 × [log10(p value)] + 0.8995. Note the difference between the two quality identification zones in Figures 2 and 4. Hundreds of spectra lie between the two linear separators, indicating that PTM analysis is useless without a careful quality control. The full set of our identified collection can be seen in the Supporting Information, divided into three files: (a) results of restricted identification (restricted_ids); (b) peptides with modified cysteine/methionine (C_or_M_2ormore_final); and (c) peptides with mass shifts that are not attributed to cysteine/methionine (good_ids_2ormore_final). Files b and c include, in addition to the two or more spectra identifications, the high quality single spectrum identifications. It is outside of the scope of this paper (and the ability of the authors) to explain the origin of the various mass shifts. Among the 539 noncysteine/methionine modified peptides, we described three examples. The peptide GITIDISLWKFETSK (Figure 1a) appears to have a mass difference of 42 Da on the middle K. This peptide comes from EF1A1_HUMAN, an elongation factor. Although this protein is cytoplasmic, it is identified in plasma: in our analysis, one peptide from this protein was identified with a good score; in HUPO, more peptides were identified relying on data from better spectrometers. The Swissprot entry of this protein contains several di- and tri- methylations on lysines in the protein, with the comment (by similarity). One of these (position 79) is exactly the middle K in our peptide. The mass shift of 42 Da occurred another 7 times in our set of identifications. The correlation spectrum for the peptide EAGGAFGKR (sp|Q9UII2|ATIF1_HUMAN), appears in Figure 1b. This peptide seems to have a mass shift of 71 Da on its left glycine. This shift might be due to sarcosyl (see http://www.abrf.org/index.cfm/ dm.home?AvgMass)all). The peptide VVLVAVDKGVFVLNKK (Figure 1c) is a possible example of a non-tryptic peptide. It appears to have a mass difference of 114 Da on its c terminal, which matches the amino acid N that is next to the peptide in the protein sequence. This peptide was identified with several other mass shifts (68 and 196 Da) in its c terminal. Two additional PTM examples are given in Figure 1d and e. DISCUSSION This paper documents the first large-scale high-throughput non-restricted identification of PTMs. No assumptions were made about the PTM content of the tissue or the protein content of the data. We showed that in such realistic circumstances, careful estimate of the error rate is essential.

Computational PTM analysis might be improved in several ways. First, it might be possible to detect n > 1 mass shifts by referring, in addition to the peak in δ ) 0, to the highest n peaks in C(δ) where their translations sum roughly equals the total known candidate peptide mass shift. Allowing a single mass shift in the identification process increases the false discovery rate substantially with respect to no mass shift identification process, in the same quality-identification criterion. Similarly, multiple mass shifts consideration is expected to increase the false discovery rate considerably; hence it is probably best to be attempted in succession to the no- and single-modification discovery stages performed here. Therefore multiple mass shift consideration protocol will probably not replace TwinPeaks but only complement it. Second, it is possible to employ likelihood-based score function, shown recently to improve SEQUEST identification rate.4,16 It is, of course, possible to use likelihood-based score function for best (or more) scored TwinPeaks peptide, that is, to get p values (see section 4 in stage 3 above). The question is how to incorporate likelihood-based score function in the TwinPeaks scoring. Below we show that such statistical scorers can be used through the same cross-correlation/FFT formulation. For simplicity, we assume two sets of n ) the number of fragment types in the n fragmentation model, probabilities: {p(i ) fragment type)}i)1 for fragment detection, and the corresponding {1 - p(i ) fragment n type)}i)1 for the fragment to be undetected. The log likelihood spectrum can be calculated using the following stages: (1) Divide the experimental and predicted spectra into bins. In a separate synthetic spectrum, put a peak of height one in every bin with positive intensity peak in the experimental spectrum. (2) In the predicted spectrum, put a peak of size -log[p(ion type)] in every expected location of each ion type fragment. A peak of size -log[p(unexplained ion type)] should be placed in every bin where no fragment is expected. (3) Calculate the predicted/experimental cross correlation spectrum, C(detected). (4) In a second synthetic spectrum, put a peak of height one for every bin with zero intensity in the experimental spectrum. (5) In the second predicted spectrum, put a peak of size -log[1 - p(ion type)] in every expected location of each ion type fragment. A peak of size -log[1 - p(unexplained ion type)] should be placed in every bin where no fragment is expected. (6) Calculate the predicted/experimental cross correlation spectrum, C(nondetected). (7) Log-likelihood spectrum ) -[C(detected) + C(nondetected)]. Similarly, for a statistical model with k intensity stages, the log-likelihood spectrum can be calculated with k applications of the FFT. A third way to possibly improve computational PTM analysis is to use machine learning classifiers to distinguish identifiable from unidentifiable spectra.17 For TwinPeaks, such algorithms can reduce the error rate by filtering in advance spectra that are most likely to be “identified” by random peptides. We have described a PTM protein identification algorithm that is truly non-restricted both in the mass shift and the protein (16) Cannon, W. R.; Jarman, K. H.; et al. J. Proteome Res. 2005, 4 (5), 1687-98. (17) Bern, M.; Goldberg, D.; et al. Bioinformatics 2004, 20 (Suppl. 1), I49-54.

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

1367

database size. The algorithm works by extracting extra information out of the basic SEQUEST algorithm. Rigorous analysis of PTM identification error rate showed that extra care should be taken in the definition of quality PTM identification. We added valuable PTM and mass shift information to the human serum analysis. By its simplicity and careful error rate analysis, TwinPeaks makes unrestricted PTM identification an integral part of proteomics. ACKNOWLEDGMENT We thank Aviad Kipnis for collaboration and valuable discussions. SUPPORTING INFORMATION AVAILABLE Three files containing our PPP identifications: (a) restricted_ids: results of 3061 dta files identified in our first stage of restricted identification. The identification criterion is that of Figure 2. The data given for each dta file is the identified peptide and identification P value and y/b ratio. (b)

1368

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

C_or_M_2ormore_final: 625 TwinPeaks identification of peptides with modified cysteine/methionine. The identification criterion is that of Figure 3 (618 spectra) or of Figure 4 (additional 7 spectra). The data given for each spectrum is the identified peptide, TP_pvalue, TP_yb_ratio, P value, y/b ratio, predicted mass shift (mass_delta), predicted location of the shift (location), and whether the identification is supported or not. Note that the location starts from 0 (N-terminal). (c) good_ids_2ormore_final: 539 TwinPeaks identification of peptides, which includes mass shifts other than the conventional modified cysteine/methionine. Identification criteria and data are the same as in (b). This material is available free of charge via the Internet at http://pubs.acs.org.

Received for review August 15, 2006. Accepted November 16, 2006. AC061515X