Anal. Chem. 2007, 79, 7439-7449
Comprehensive Phosphorylation Site Analysis of Individual Phosphoproteins Applying Scoring Schemes for MS/MS Data Andreas Schlosser,* Jens T. Vanselow, and Achim Kramer
Laboratory of Chronobiology, Charite´ Universita¨tsmedizin Berlin, Hessische Strasse 3-4, 10115 Berlin, Germany
We have developed novel scoring schemes for the identification of (phospho)peptides (PeptideScore) and for pinpointing phosphorylation sites (PhosphoSiteScore) using MS/MS data. These scoring schemes have been developed for the in-depth analysis of individual phosphoproteins, not for large-scale phosphoproteomic-type data. The scoring schemes are implemented into the new software tool Phosm, which provides a concise and comprehensive presentation of the results. For development and evaluation of these schemes, we have analyzed ∼500 phosphopeptide MS/MS spectra, most of them nontryptic peptides. The novel scoring schemes turned out to be very powerful, even with CID MS/MS spectra of very low quality. Many phosphopeptides and phosphorylation sites that remained unassigned in our LC-MS/MS data sets with Mascot could be identified with Phosm. Especially the number of identified multiply phosphorylated peptides could be significantly increased. The applied scoring parameters are described, and the scoring for several selected examples of phosphopeptides is discussed in detail. Furthermore, a new and simple nomenclature for all types of phosphorylated fragment ions is introduced in this publication. Posttranslational modifications (PTMs) of proteins are of growing interest in biological and biomedical research, since it becomes increasingly evident that the correct protein PTM pattern as well as the correct time of modification is indispensable for most proteins to fulfill their cellular functions.1 Most proteins are stringently controlled via PTMs, and perturbation of this delicate regulation can cause severe diseases, such as cancer, autoimmune diseases, and neurodegeneration. Reversible phosphorylation on Ser, Thr, and Tyr is by far the most prominent PTM in eukaryotic cells. In humans 518 protein kinases and ∼150 protein phosphatases strictly control the function of a countless number of proteins.2 Tandem mass spectrometry (MS/MS) in combination with either matrix-assisted laser desorption/ionization or electrospray * To whom correspondence should be addressed. Current address: Center for Systems Biology (ZBSA), Core Facility Proteomics, Scha¨nzlestr. 1, 79104 Freiburg, Germany. E-mail:
[email protected].. (1) Walsh, C. T.; Garneau-Tsodikova, S.; Gatto, G. J. Angew. Chem., Int. Ed. 2005, 44, 7342-7372. (2) Manning, G.; Whyte, D. B.; Martinez, R.; Hunter, T.; Sudarsanam, S. Science 2002, 298, 1912-1934. 10.1021/ac0707784 CCC: $37.00 Published on Web 08/25/2007
© 2007 American Chemical Society
ionization (ESI) has become the method of choice for the analysis of protein phosphorylation and other posttranslational protein modifications.3 A large number of different strategies for the analysis of protein phosphorylation have been developed during the past decade,4-6 and a growing number of publications report on the large-scale identification of phosphorylation sites .7-10 Although these studies demonstrate that large-scale identification of phosphorylation sites is doable applying state-of-the-art MS technology, the comprehensive mapping of phosphorylation sites, i.e., the identification of each and every phosphorylation site of a protein, is still a formidable challenge and in general not (yet) feasible on the phosphoproteome level. This becomes evident for example for the protein period 2, for which only two phosphorylation sites have been identified in large-scale studies (S-693 and S-697) so far.10 However, the detailed analysis of the individual phosphoprotein resulted in the detection of more than 20 in vivo phosphorylation sites.11,12 Detecting phosphopeptides and pinpointing phosphorylation sites by mass spectrometry is a complex process that involves multiple steps, such as phosphopeptide enrichment, phosphopeptide separation, and data-dependent acquisition of MS/MS spectra. Each single step favors certain phosphopeptides and discriminates others, even if all the steps have been carefully optimized. Thus, only a fraction of all phosphopeptides finally reaches the detector of the mass spectrometer. Some types of phosphopeptides, such as phosphopeptides with multiple phosphorylation sites, phosphopeptides with several basic amino acids, or very hydrophobic (3) Jensen, O. Nat. Rev. Mol. Cell Biol. 2006, 7, 391-403. (4) Loyet, K. M.; Stults, J. T.; Arnott, D. Mol. Cell. Proteomics 2003, 4, 235245. (5) McLachlin, D. T.; Chait, B. T. Curr. Opin. Chem. Biol. 2001, 5, 591-602. (6) Mann, M.; Ong, S. E.; Grønborg, M.; Steen, H.; Jensen, O. N.; Pandey, A. Trends Biotechnol. 2002, 20, 261-268. (7) Trinidad, J. C.; Specht, C. G.; Thalhammer, A.; Schoepfer, R.; Burlingame, A. L. Mol. Cell. Proteomics 2006, 5, 914-922. (8) Beausoleil, S. A.; Jedrychowski, M.; Schwartz, D.; Elias, J. E.; Villen, J.; Li, J.; Cohn, M. A.; Cantley, L. C.; Gygi, S. P. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 12130-12135. (9) Ficarro, S. B.; McCleland, M. L.; Strukenberg, P. T.; Burke, D. J.; Ross, M. M.; Shabanowitz, J.; Hunt, D. F.; White, F. M. Nat. Biotechnol. 2002, 20, 301-305. (10) Villen, J.; Beausoleil, S. A.; Gerber, S. A.; Gygi, S. P. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, 1488-1493. (11) Schlosser, A.; Vanselow, T. J.; Kramer, A. Anal. Chem. 2005, 77, 52435250. (12) Vanselow, K.; Vanselow, J. T.; Westermark, P. O.; Reischl, S.; Maier, B.; Korte, T.; Herrmann, A.; Herzel, H.; Schlosser, A.; Kramer, A. Genes Dev. 2006, 20, 2660-2672.
Analytical Chemistry, Vol. 79, No. 19, October 1, 2007 7439
or very hydrophilic phosphopeptides are particularly difficult to detect. Good-quality MS/MS spectra that enable effortless phosphopeptide identification and phosphorylation site pinpointing are usually obtained only from a fraction of the detected phosphopeptides. To increase the number of detectable phosphopeptides and the number of pinpointed phosphorylation sites, we have recently introduced an optimized multi-protease approach that provides more comprehensive phosphorylation site maps.11,12 Regardless of the specific strategy that is applied for phosphorylation site mapping, MS/MS data (up to now mostly collisioninduced dissociation (CID) MS/MS data) are finally used to identify phosphopeptides and to pinpoint phosphorylation sites. Fragment ion spectra can be interpreted either manually or by using one of various database search engines, such as Mascot,13 Sequest,14 and VEMS.15 In addition, Ko¨cher et al. recently described a software tool (PhosTShunter) for the detection of phosphorylated peptides in LC-FTICR MS/MS data sets.16 This software tool allows the detection of peptides phosphorylated on serine or threonine by analyzing the MS/MS spectra for the presence of a neutral loss of phosphoric acid (or the combined neutral loss of phosphoric acid and water) from the precursor ion. However, this software does not assign phosphopeptide sequences and does not pinpoint phosphorylation sites, and manual data interpretation is still necessary. Another software tool (InsPecT) allows searching for modified peptides without specifying the PTM (blind search) and can also be applied for the identification of phosphopeptides.17 However, this software is based on de novo sequencing and thus requires MS/MS spectra of relative high quality with at least one short b or y ion series. Very recently, Beausoleil et al. described a promising approach for protein phosphorylation analysis and site localization.18 They introduced a probability-based score, the Ascore, that measures the probability of correct phosphorylation site localization based on the presence and intensity of site-determining ions in MS/MS spectra. Finally, Yates and co-workers have developed approaches for the automated validation of phosphopeptide MS/MS spectra identified by database searching algorithms (preferentially SEQUEST).19 Using the search engine Mascot, we noticed that a substantial number of phosphopeptides in our LC-MS/MS data sets remains unassigned. The main reason for this is the low quality of the corresponding MS/MS spectra, which is caused either by low precursor intensity or by poor fragmentation behavior of the phosphopeptides. CID MS/MS spectra of phosphopeptides are often dominated by intense neutral loss(es) of phosphoric acid and mostly contain only a reduced number of sequence-specific b and y ions, especially if the peptides are phosphorylated on (13) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cotrell, J. S. Electrophoresis 1999, 20, 3551-3567. (14) Eng, J. K.; McCormack, A.; Yates, J. R., III. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (15) Matthiesen, R.; Trelle, M. B.; Højrup, P.; Bunkenborg, J.; Jensen, O. N. J. Proteome Res. 2005, 4, 2338-2347. (16) Ko ¨cher, T.; Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. J. Proteome Res. 2006, 5, 659-668. (17) Tanner, S.; Shu, H. J.; Frank, A.; Wang, L. C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. Anal. Chem. 2005, 77, 4626-4639. (18) Beausoleil, S. A.; Ville´n, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. Nat. Biotechnol. 2006, 24, 1285-1292. (19) Lu, B.; Ruse, C.; Xu, T.; Park, S. K.; Yates, J. R., III. Anal. Chem. 2007, 79, 1301-13010.
7440
Analytical Chemistry, Vol. 79, No. 19, October 1, 2007
multiple sites. In addition, the fragmentation behavior of nontryptic phosphopeptides, generated by a low-specificity protease such as proteinase K, can be affected adversely, if these peptides contain several basic amino acids.20 Finally, database searching without protease specificity makes it more difficult to discriminate between true positive and false positive identifications, since the number of peptide sequences that match a certain precursor mass is strongly increased under these circumstances. However, the majority of the phosphopeptide CID MS/MS spectra that remained unassigned using Mascot could be identified by manual in-depth analysis. Different strategies, such as patchwork sequencing21 or the sequence-tag approach,22 have been applied for this purpose. However, this is a very time-consuming and tiring procedure that cannot be performed for a large number of MS/ MS spectra. In addition, the success of this strategy strongly depends on the experience of the scientist analyzing the data sets and on his individual performance, which is probably not constant during the period of a day. This led us to the development of scoring schemes that closely mimic the approach an expert mass spectrometrist would use for manual interpretation of (phospho)peptide MS/MS spectra. EXPERIMENTAL SECTION Sample Preparation. All proteins were reduced and alkylated prior to SDS-PAGE by boiling for 5 min in 3× sample buffer (150 mM Tris-HCl pH 6.8, 6 mM EDTA, 3% SDS, 30% glycerin) supplemented with 10 mM tris(2-carboxyethyl)phosphine hydrochloride ( (Calbiochem, Schwalbach/Ts., Germany). Afterward, 50 mM iodoacetamide (Fluka, Buchs, Switzerland) was added to the supernatant. Proteins were separated by SDS-PAGE and stained with Coomassie R250 (Merck, Darmstadt, Germany). Protein Digestion. Excised gel bands were destained with 30% acetonitrile (ACN), shrunk with 100% ACN, and dried in a vacuum concentrator (Concentrator 5301, Eppendorf, Hamburg, Germany). Digests with trypsin, elastase, proteinase K, and thermolysin were performed overnight at 30 °C in 0.1 M NH4HCO3 (pH 8). About 0.1 µg of protease was used for one gel band. Peptides were extracted from the gel slices with 30% ACN, 5% formic acid (FA). The combined supernatants and extracts were dried in a vacuum concentrator and redissolved in 10 µL of 30% ACN, 2% FA. Phosphopeptide Enrichment. A TiO2 nanocolumn (50-µm i.d., 1.5-cm length; 5-µm particle size, GLSciences) was used for phosphopeptide enrichment as previously described.10 Briefly, samples were loaded on the column at a flow rate of 1 µL/min. After washing with 10 µL of 30% ACN, 2% FA, elution was performed with 10 µL of 0.1 M NH4HCO3 pH 9.5 at a flow rate of 0.3 µL/min. Eluates were acidified with FA before nanoLC-MS/ MS analysis. NanoLC and Mass Spectrometry. A CapLC (Micromasss, Manchester, UK) coupled to a Q-TOF micro (Micromass, Manchester, UK) was used for nanoLC-MS/MS. A column switching system consisting of a 50-µm-i.d. trap column (Waters Atlantis dC18, 5-µm particle size, 1.5-cm length) and an analytical column (Waters Atlantis dC18, 3-µm particle size, 12-cm length) was used. (20) Tabb, D. L.; Huang, Y. Y.; Wysocki, V. H.; Yates, J. R., III. Anal. Chem. 2004, 76, 1243-1248. (21) Schlosser, A.; Lehmann, W. D. Proteomics 2002, 2, 524-533. (22) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390-4399.
Table 1. Examples of the New Nomenclature for Phosphorylated Precursor and Fragment Ions precursor ions pM+ pM2+ p2M+ p2M2+ pM+ - P
) [M + HPO3 + H]+ ) [M + HPO3 + 2H]2+ ) [M + 2HPO3 + H]+ ) [M + 2HPO3 + 2H]2+ ) [M + HPO3 - H3PO4 + H]+
fragment ions py7 p2y7 pb3 - P p4y162+ - 3P
PicoTip emitters (360-µm o.d., 20-µm i.d., 5-µm tip i.d., CE coating) from NewObjective (Wobrun, MA) have been applied. Spray voltage was set to 1600 V. No cone gas was used. Gradients were run from 0% ACN, 0.1% FA to 40% ACN, 0.1% FA in 60 min (0.67%/ min) with a postsplit flow rate of 100 nL/min. Essential parameters of the data-dependent acquisition of MS/ MS spectra were set as follows: a maximum of two parallel MS/ MS spectra, each for 2 s with two different collision offsets. Threshold for MS-to-MS/MS switching was set to 30 counts/s. Parameter for Peak Detection. Mascot Distiller 2.0 (Matrix Science, London, UK) was used to generate peak lists from LCMS/MS raw data. Essentially, the settings for Q-TOF instruments as recommended by Matrix Science (default settings) have been used. Parameter for Database Searching. Mascot Server 2.1 (Matrix Science) was used for database searching. The following parameters have been used: instrument ion series definitions, b ion series and y ion series. Mod file, Phospho (STY), residues, S 166.998359 167.0572, T 181.014010 181.0838, Y 243.029660 243.1532, NeutralLoss, 97.976896 97.9952 0, NeutralLoss, 0 0 0, PepNeutralLoss, 97.976856 97.9952. Taxonomy, all entries. Enzyme, none. Variable modifications, phospho (STY), carbamidomethyl (C), deamidation (N), pyro-Glu (N-term Q), oxidation (M). peptide tol., (0.1 Da, MS/MS tol. (0.1 Da. Monoisotopic mass. Programming. Scoring schemes for PeptideScore and PhosphoSiteScore were implemented in a PERL script (Phosm) using ActivePerl 5.8.6 for Windows from ActiveState (http://www. activestate.com/Products/ActivePerl/?mp)1) and the opensource editor Open Perl IDE 1.0 (http://open-perl-ide.sourceforge. net/). Phosm runs under a Microsoft IIS with installed ActivePerl 5.8. Nomenclature for Phosphorylated Peptide Ions. We have extended the conventional Biemann nomenclature for peptide fragment ions23 in terms of phosphorylated fragment ions. The following simple definitions have been made and are used throughout this publication: p ) HPO3 (80 Da), P ) H3PO4 (98 Da). With these definitions, all possible phosphorylated fragment ions can be assigned unambiguously. For example, p4y162+-3P is the four times phosphorylated, doubly charged y16 ion that shows a neutral loss of three molecules phosphoric acid (see Figure 3b). Nonfragmented precursor ions are assigned the following way: pM+ ) [M + HPO3 + H]+, p2M+ ) [M + 2 HPO3 + H]+, p2M2+ ) [M + 2 HPO3 + 2H]2+, etc. This new nomenclature for phosphorylated precursor and fragment ions is very concise. For illustration, some more examples are shown in Table 1. RESULTS AND DISCUSSION We have developed novel scoring schemes for the identification of (phospho)peptides (PeptideScore) and phosphorylation sites (23) Biemann, K. Biomed. Environ. Mass Spectrom. 1998, 16, 99-111.
) y7 (singly phosphorylated) ) y7 (doubly phosphorylated) ) b3 (phosphorylated, with neutral loss of H3PO4) ) y162+ (4× phosphorylated with neutral loss of 3 H3PO4)
Figure 1. Flowchart for the software tool Phosm. LC-MS/MS data and the sequence of the protein of interest are used to create a list of mass matches. A number of common covalent modifications are considered (see text for details). No restrictions concerning the protease specificity are made. Novel scoring schemes are finally applied for scoring all mass matches to identify phosphopeptides (PeptideScore) and phosphorylation sites (PhosphoSiteScore).
(PhosphoSiteScore). For development and evaluation of these schemes, we have analyzed ∼500 (mostly unique) phosphopeptide MS/MS spectra. We have applied our previously developed multiprotease approach:11,12 in-gel digests have been performed with elastase, proteinase K, and thermolysin in addition to trypsin resulting in mainly nontryptic peptides. Phosphopeptides have been enriched using TiO2 nanocolumns, and the phosphopeptideenriched fractions have been analyzed by nanoLC-MS/MS on a Q-TOF instrument. We have implemented the novel scoring schemes into the software tool Phosm, which provides an input mask where nanoLC-MS/MS data (mgf-type files obtained with Mascot Distiller) and protein sequences (plain text) can be uploaded and several parameters, such as mass accuracy and thresholds, can be set. The workflow of Phosm is shown in Figure 1. In the first step, Phosm creates a list of mass matches, thereby considering a set of common covalent modifications: oxidation of Met, deamidation of Asn if followed by Gly, pyro-Glu formation if the peptide has an N-terminal Gln, carbamidomethylation of Cys, and of course phosphorylation on Ser, Thr, and Tyr. Since no protease specificity is applied, a relatively large number of mass matches is obtained. For example, for the mouse protein PER2 (1257 amino Analytical Chemistry, Vol. 79, No. 19, October 1, 2007
7441
Table 2. Scoring Scheme for the Calculation of the PeptideScore rule
bonus/malus
sequence-specific fragment ions b or y ion a ion (if b ion is present) complementary b/y ions series of b or y ions (2, 3, 4, 5, 6, etc. adjacent fragment ions) Pro-directed fragment ions b or y ion in Top10 cleavage on C-term. side of Pro no Pro-directed b or y ion in Top10 neutral loss (NL) of H3PO4 from precursor # NL ) # phosphatesb # NL > # phosphates # NL < # phosphates (∆ ) 1, 2, 3, 4, etc.) charge state | basic sites - # charges | g2, 2.5, 3, 3.5, etc.c (basic sites: Arg ) Lys ) N-term. NH2 ) 1; His ) 0.5) Top10 if 10, 9, 8, etc. of the 10 most abundant peaks can be matched if 10, 9, 8, etc. of the 10 most abundant peaks cannot be matched a
+10 +10 +10 +5, +11, +20, +34, +55, etc.a +20 -20 -20 +20 -50 (0, -20, -30, -40, etc. -20, -30, -40, -50, etc. +250, +200, +150, etc. -250, -200, -150, etc.
n bonus ) ∑k)2 (5 + (k - 2)2) with n ) number of adjacent fragment ions (b or y ions). b # ) number of. c | x | ) absolute value of x.
acids, SwissProt accession number O54943), an average of 36 peptide sequences match to one observed molecular mass, if the search is performed with a mass accuracy of (0.1 Da. Subsequently, a score (PeptideScore) is calculated for each individual mass match, and a site-specific score (PhosphoSiteScore) is calculated for each Ser, Thr, and Tyr if the peptide is phosphorylated. Phosm finally creates a concise and comprehensive phosphopeptide report (html-file; see Supporting Information S1), where all identified phosphopeptides are listed according to their phosphorylation site. Within this report, detailed information is given for each individual phosphopeptide, such as retention time, charge state (expected and observed), mass error, additional modifications, and number and intensities of the neutral loss(es) of phosphoric acid, PeptideScore and PhosphoSiteScore. All assigned fragment ions, mostly b and y ions, are listed. In the following, a detailed description and evaluation of the PeptideScore and the PhosphoSiteScore is given. PeptideScore. We have tried to elaborate a scoring scheme that closely resembles manual interpretation of CID MS/MS spectra. For this purpose, we have developed a bonus-malus system that scores a number of different parameters. The optimized set of parameters that is used to calculate the PeptideScore is shown in Table 2. The most obvious parameter that is used for the calculation of the PeptideScore is the presence of sequence-specific fragment ions. In CID MS/MS spectra, the most prominent fragment ions are b and y ions. A predicted peptide sequence (mass match) gets a bonus of 10, if an experimental fragment ion mass in the MS/ MS spectrum can be assigned to a b or y ion. Some phosphorylated fragment ions show a neutral loss of phosphoric acid, some do not, and some are present in the MS/MS spectrum with and without the neutral loss of phosphoric acid. Therefore, each version of a b or y ion gets a bonus of 10 if present. For a ions, a bonus of 10 is given only if the corresponding b ion is present. If both the b and the complementary y ion are present, an additional bonus of 10 is given. If a series of b or y ions can be assigned an additional bonus of 5, 11, 20, 34, etc., is given for 2, 3, 4, 5, etc., 7442 Analytical Chemistry, Vol. 79, No. 19, October 1, 2007
n adjacent fragment ions (bonus ) ∑k)2 (5 + (k - 2)2) with n ) number of adjacent fragment ions). For example, if the y2, y3, y4, and y5 can be assigned, a bonus of 20 is given. Cleavage N-terminal to Pro dominates CID MS/MS spectra of peptides that have a mobile or partially mobile proton, whereas cleavage C-terminal to Pro is rarely observed.24,25 Thus, a predicted peptide sequence (mass match) gets a bonus of 20, if one of the 10 most abundant peaks in a MS/MS spectrum can be assigned to a b or y ion that is generated by cleavage on the N-terminal side of Pro. A predicted peptide sequence gets a malus of 20 if it contains a Pro, but no b or y ion generated by cleavage N-terminal to Pro can be assigned to one of the 10 most abundant peaks. If one of the 10 most abundant peaks is assigned to a b or y ion that is generated by cleavage C-terminal to Pro, the corresponding predicted peptide sequence gets a malus of 20. CID MS/MS spectra of phosphopeptides containing phosphoserine or phosphothreonine often show intense neutral loss(es) of phosphoric acid from the precursor ion. If the number of observed neutral losses of phosphoric acid in a MS/MS spectrum equals the number of phosphates in a predicted peptide sequence (mass match), this peptide gets a bonus of 20. If the number of observed neutral losses from the precursor exceeds the number of phosphates, the corresponding predicted peptide sequence gets a malus of 50, since a MS/MS spectrum that shows multiple losses of phosphoric acid does not correspond to a singly phosphorylated peptide. If the number of observed neutral losses is smaller than the number of phosphates, the corresponding predicted peptide sequence gets a malus of 20, 30, 40, etc., for a difference of 2, 3, 4 phosphates. For example, if the precursor ion shows the successive loss of two molecules of phosphoric acid, but the predicted peptide sequence is four times phosphorylated, this predicted peptide will get a malus of 20.
(24) Huang, Y. Y.; Triscari, J. M.; Tseng, G. C.; Pasa-Tolic, L.; Lipton, M. S.; Smith, R. D.; Wysocki, V. H. Anal. Chem. 2005, 77, 5800-5813. (25) Kapp, E. A.; Schutz, F.; Reid, G. E.; Eddes, J. S.; Moritz, R. L.; O’Hair, R. A. J.; Speed, T. P.; Simpson, R. J. Anal. Chem. 2003, 75, 6251-6264.
Tryptic peptides with no missed cleavage site are mostly doubly charged, since the observed charge state in ESI-MS roughly correlates with the number of basic sites in the peptide. In contrast to tryptic peptides, nontryptic peptides show a much wider range of charge states, since these peptides can have very different numbers of basic amino acids. If the observed charge state of a peptide differs greatly from the number of basic sites, the corresponding predicted peptide sequence gets a malus of 20 or more (for details, see Table 2). Finally, if 10, 9, 8, etc., of the 10 most abundant peaks (the precursor, neutral loss(es) of phosphoric acid and water are not included) in a MS/MS spectrum can be assigned, the corresponding predicted peptide sequence gets a bonus of 250, 200, 150, etc. If 10, 9, 8, etc., of the 10 most abundant peaks in a MS/MS spectrum remain unassigned, the corresponding predicted peptide sequence gets a malus of 250, 200, 150, etc. In principle, all peaks above a user-defined threshold (ion threshold) are used for peptide scoring. However, the 10 most abundant peaks are by far the most important for the PeptideScoring. Lower intensity peaks can only further increase the PeptideScore, never decrease it. For evaluating the usefulness of our scoring parameters, we have analyzed the distribution of the individual parameter scores (bonus/malus) for correct and for incorrect peptide sequences. Peptide sequences with the highest positive PeptideScore for each precursor mass are indicated as correct peptides, all other peptide sequences are indicated as incorrect peptides. All correct peptides were checked manually and by comparison with the results obtained from Mascot. Figure 2 shows the analyses of five different parameters: the b-and-y-ion parameter, the ion-series parameter, the Top10 parameter, the neutral-loss, and the proline parameter. The analyses were performed with data from a nanoLC-MS/MS run with 122 MS/MS spectra (phosphopeptide-enriched fraction of a multi-protease digest of the protein PER2; SwissProt accession number O54943). The analysis of this data set with a precursor mass accuracy of (0.1 Da resulted in a total of 3252 PER2-derived peptide sequences (mass matches). A total of 92 of these were identified as correct peptides, 3160 as incorrect peptides. Figure 2a shows the analysis of the b- and y-ion parameter. The majority of incorrect peptides get only a small bonus from this parameter. A total of 66% get no bonus or a bonus of 10; only 0.5% get a bonus of more than 70. The highest bonus that is obtained by an incorrect peptide is 160. In contrast, 64% of the correct peptides get a bonus of 100 or higher. The highest bonus that is obtained by a correct peptide in this data set is 290 (VSHESGGQKEApSVAEMQ; see Supporting Information S1, q39). The analysis of the ion-series parameter shown in Figure 2b reveals that 93% of the incorrect peptides do not get any bonus from this parameter due to the complete absence of fragment ion series. However, ion series of very different length can be assigned for correct peptide sequences. The highest bonus that is obtained by a correct peptide from this parameter is 309 (AVpTTIERDSSGA, see Supporting Information S1, q69). An almost complete b-ion series (b2-b11), in addition to a y-ion series (y6-y11), was assigned in this case. The Top10 parameter, the analysis of which is shown in Figure 2c, is very selective. All incorrect peptides get a malus from this parameter, most of them (63%) the highest possible malus of 250. In contrast, 74% of the correct peptides get a bonus of at least 50
from this parameter; only 13% of the correct peptides get a malus from this parameter. The great majority of identified peptides is singly phosphorylated, shows a neutral loss of one molecule phosphoric acid, and, therefore, gets a bonus of 20 from the neutral-loss parameter (see Figure 2d). On the other hand, a large number of incorrect peptide sequences can be discriminated by this parameter, since for these the number of observed neutral losses of phosphoric acid often greatly differs from the postulated number of phosphates in the predicted peptide sequence. For 17% of the incorrect peptides, the number of the observed neutral losses of phosphoric acid from the precursor is greater than the number of phosphates in the postulated sequence. Thus, 17% of the incorrect peptides get a malus of 50. A total of 80-90% of our analyzed CID MS/MS spectra of pSer/pThr-containing peptides show a neutral loss of phosphoric acid. However, about 10-20% of the spectra do not show any neutral loss of phosphoric acid at all. Examples of such peptide sequences are IVSTPGpTVVAPPAAT, VVAPPAApTHTG, or ACPVTPPAGpTVA. Two factors seem to hamper the neutral loss of phosphoric acid from a peptide: (i) phosphorylation on Thr and (ii) localization of the phosphorylated residue far away from the N- or C-terminus. Thus, singly phosphorylated peptides that do not show a neutral loss of phosphoric acid get neither a bonus nor a malus from our scoring scheme. Finally, Figure 2e shows the results obtained from the analysis of the proline parameter. A total of 57% of the correct peptides get a bonus from this parameter. The highest bonus (180) from the proline parameter in all analyzed data sets is obtained from a peptide with the sequence FSPSPTSPTKEPGAPQPT, which contains 6 proline residues; 9 of the 10 most abundant peaks can be assigned to proline-directed fragment ions (b and y ions) in this case. Figure 2f shows the distribution of the sum of the individual parameter scores, the PeptideScore, for incorrect and for correct peptide sequences. A total of 99.9% of the incorrect peptides get a negative PeptideScore; 0.1% (3 peptide sequences) of the incorrect peptides get a positive PeptideScore. However, there are peptides with a much higher PeptideScore for all these three spectra. Thus, the three incorrect peptides with a positive PeptideScore are not displayed in the Phosm results. In the present data file, only two peptide sequences were identified (by manual interpretation of the MS/MS spectra) as false positives (see Supporting Information S1, q22 and q90 and Supporting Information S3). These peptide sequences have PeptideScores of 10 and 25, respectively. All peptides identified as false positives in all our data sets had PeptideScores below 100. Thus, peptide sequences with a PeptideScore of 100 or smaller are displayed shaded within the Phosm results and have to be verified carefully. In summary, all parameters of our scoring scheme preferentially give a bonus to correct peptides and a malus to incorrect peptides, and thus fulfill the criterion for a good scoring parameter. For testing the performance of the PeptideScoring (the sum of the individual parameter scores), we have analyzed the data from a nanoLC-MS/MS run with 122 MS/MS spectra with Phosm and with Mascot (Mascot Server 2.1) under identical conditions (database: PER2 sequence (SwissProt accesion number O54943); variable modifications: oxidation of Met, deamidaAnalytical Chemistry, Vol. 79, No. 19, October 1, 2007
7443
Figure 2. Distribution of individual parameter scores (bonus/malus) (a-e) and the PeptideScore (f) for correct and for incorrect peptides. Peptide sequences with the highest positive PeptideScore for each precursor mass are indicated as correct peptides; all other peptide sequences are indicated as incorrect peptides. The data are derived from the analysis of a nanoLC-MS/MS run with 122 MS/MS spectra. A total of 3252 PER2-derived peptide sequences (mass matches) resulted in the assignment of 92 correct peptides and 3160 incorrect peptides. The data in histograms b, c, and f have been binned; abscissa labels are the starting values of the bins. 7444
Analytical Chemistry, Vol. 79, No. 19, October 1, 2007
tion of Asn, pyro-Glu formation if the peptide has an N-terminal Gln, carbamidomethylation of Cys, and phosphorylation on Ser, Thr, and Tyr; mass accuracy for MS and MS/MS, (0.1 Da). The Mascot Peptide Summary Report showed 59 phosphopeptides (see Supporting Information S2). In-depth manual interpretation of questionable MS/MS spectra exposed five false positives (incorrectly identified phosphopeptides) and eight false negatives (missed phosphopeptides). Using Phosm, 64 phosphopeptides with only two false positives and no false negative could be assigned (see Supporting Information S1). That is, Phosm has correctly assigned all phosphopeptides that have been identified with Mascot. In addition, eight MS/MS spectra that remained unassigned with Mascot could be assigned to phosphopeptides with Phosm (see Supporting Information S3). These phosphopeptides have been critically verified by manual interpretation of the corresponding MS/MS spectra. It is striking that most of the multiply phosphorylated peptides that have been identified with Phosm are missed with Mascot. Phosm was designed and optimized for the comprehensive phosphorylation site analysis of individual proteins. We apply a two-step strategy for phosphorylation site mapping: we use Mascot in a first-pass search to identify the proteins that are present in our sample. In the second-pass search, we use Phosm for the comprehensive identification of phosphopeptides and phosphorylation sites. The second-pass search is thereby restricted to the proteins identified in the first-pass search. This dramatically reduces search time and the number of random hits. This strategy works fine as long as the number of phosphoproteins is manageable. However, it is not suited for large-scale studies (phosphoproteomics). The ability to discriminate against random hits is a key feature for each kind of database search engine. For evaluating this feature for Phosm, we have created a protein database by randomly picking 500 different proteins (a total of 205 025 amino acids) from the SwissProt database. We have then searched selected MS/ MS spectra of phosphopeptides against this random database, additionally containing the correct protein sequence. The results are shown in Figure 3. Figure 3a displays the MS/MS spectrum of a singly phosphorylated peptide. The spectrum is of high quality and contains a wealth of peptide sequence information. Searching with the corresponding precursor mass (1285.56 ( 0.1 Da) resulted in 4711 predicted peptide sequences (mass matches), 4673 of which get a negative PeptideScore. The calculated PeptideScore is 909, more than 30 sequence-specific fragment ions could be matched with Phosm. As depicted in Figure 3a, 38 predicted peptide sequences get a positive PeptideScore. However, the best incorrect predicted peptide sequence gains a PeptideScore of 210 and thus is still very clearly separated from the correct peptide sequence with its PeptideScore of 909. Mascot is also able to unambiguously identify this phosphopeptide with a score of 58. Figure 3b shows a MS/MS spectrum of a peptide with four phosphorylated residues. Due to the intense neutral loss of several molecules of phosphoric acid, this spectrum contains only a few sequence-specific fragment ions, the signal-to-noise ratio of which is very low. Nevertheless, the calculated PeptideScore for this phosphopeptide is 340. A search against our random database resulted in 3353 predicted peptide sequences with a negative
PeptideScore and 18 predicted peptide sequences with a positive PeptideScore. The best incorrect predicted peptide sequence has a PeptideScore of 75, and the correct phosphopeptide (PeptideScore 340) is still clearly separated from all random hits. In contrast, this MS/MS spectrum remained unassigned with Mascot. It is listed in neither the Peptide Summary Report nor the Protein Summary Report. A similar situation is depicted in Figure 3c. Only four sequencespecific fragment ions with a very low signal-to-noise ratio are detectable in the MS/MS spectrum of this doubly phosphorylated peptide with a N-terminal pyro-Glu modification. The calculated PeptideScore for the correct phosphopeptide is 85 and is still clearly separated from two other sequences with a positive PeptideScore. A total of 4815 predicted peptide sequences get a negative PeptideScore. Mascot again failed to assign this MS/ MS spectrum. Figure 3d shows the MS/MS spectrum of the phosphopeptide that was the most difficult to assign in this data set. Although the signal-to-noise ratio of the observed fragment ions is good in this case, this phosphopeptide has only one preferred cleavage site (under CID conditions) that is located on the N-terminal site of Pro6. Thus, except for the neutral loss of phosphoric acid from the precursor, only two complementary sequence-specific fragment ions are observed, the b5 ion and the y7 ion. The b5 ion is phosphorylated and shows a neutral loss of phosphoric acid. Although the amount of information is very limited in this case, our PeptideScore is able to clearly rule out 4140 predicted peptide sequences. Only 39 predicted peptide sequences get a positive PeptideScore with the correct peptide sequence showing the highest PeptideScore of all (PeptideScore 125). From the 4179 mass matches, only 2 peptide sequences (QCQQLWGPGSKP + 1p and VIANSPEGDIEGK + 1p) are not distinguishable from the correct peptide sequence (LGRASPPLFQSR + 1p) with the mass accuracy applied here (see Figure 3d). However, no predicted peptide sequence has obtained a higher PeptideScore than the correct phosphopeptide. QCQQLWGPGSKP and VIANSPEGDIEGK are assumed to be incorrect, since they derive from proteins that have not been present in our sample (randomly picked sequences from SwissProt). LGRASPPLFQSR is assumed to be the correct sequence, since (i) it is the only PER2-derived peptide sequence, (ii) five other (overlapping) phosphopeptides with the same phosphorylation site (S-971) are identified with high confidence (see Supporting Information S1, q85, q116, q111, q84, q100), and (iii) the observed fragmentation behavior of this peptide is in perfect agreement with the expected behavior. Mascot was not able to assign this phosphopeptide, even if the search was restricted to the protein sequence of PER2. In Figure 3, the high performance and robustness of the PeptideScore is impressively demonstrated. Our scoring scheme clearly identified all the phosphopeptides, even if the MS/MS spectra are of very low quality and an extended database (500 randomly picked proteins) is used. In contrast, Mascot failed to identify three out of the four phosphopeptides shown in Figure 3, even if the search was restricted to only the correct protein sequence (PER2). Only the high-quality MS/MS spectrum shown in Figure 3a was correctly assigned with Mascot. Applying a Mascot error-tolerant search did not result in any additionally assigned phosphopeptides (data not shown). Analytical Chemistry, Vol. 79, No. 19, October 1, 2007
7445
Figure 3. Evaluation of the specificity of the PeptideScore. PeptideScoring was applied for selected phosphopeptide MS/MS spectra with different features. The MS/MS spectra were searched against a test database containing 500 randomly picked proteins from the SwissProt database in addition to the correct protein sequence. The search for each MS/MS spectrum resulted in ∼4000 peptide sequences (mass matches) that are scored. The mass matches with a positive PeptideScore (including the correct hit) are displayed for each MS/MS spectrum in a diagram. (a) MS/MS spectrum of a singly phosphorylated peptide. The spectrum is of high quality and contains a wealth of sequence-specific information. (b) MS/MS spectrum of a peptide with four phosphorylated residues. Due to the intense multiple loss of phosphoric acid, the spectrum contains only a reduced number of sequence-specific fragment ions. (c) MS/MS spectrum of a doubly phosphorylated peptide with an additional N-terminal modification (pyro-Glu formation). The spectrum contains very few sequence-specific fragment ions. (d) MS/MS spectrum of a singly phosphorylated peptide with a singly preferred cleavage site (N-terminal to an internal Pro). 7446
Analytical Chemistry, Vol. 79, No. 19, October 1, 2007
Figure 4. Basic principle of the PhosphoSiteScoring. Representative examples for singly and doubly phosphorylated peptides are shown to explain the applied bonus/malus system. If a singly phosphorylated fragment is assigned, the Ser, Thr, (and Tyr) residues within this fragment share a bonus of 10. If a multiply phosphorylated fragment is assigned, the Ser, Thr, (and Tyr) residues within this fragment share a bonus of 10 multiplied with the number of observed phosphates. If an assigned phosphorylated fragment ion contains all phosphates of a peptide, the corresponding (not observed) part of the peptide has to be nonphosphorylated and thus each Ser, Thr, (and Tyr) residue within this part gets a malus of 10. Examples for the calculation of the PhosphoSiteScore are shown in Figure 5.
PhosphoSiteScore. For pinpointing the phosphorylation site(s) within identified phosphopeptides, we have developed the PhosphoSiteScore. The basic idea behind this scoring is straightforward and is illustrated in Figure 4. If a phosphorylated fragment ion (b or y ion, with or without neutral loss(es) of phosphoric acid) is assigned in the MS/MS spectrum of a singly phosphorylated peptide, all Ser, Thr, (and Tyr) residues within this fragment share a bonus of 10 (Tyr residues are excluded if the number of phosphates equals the number of observed neutral losses from the precursor). Each Ser, Thr, (and Tyr) residue within the nonphosphorylated part of the peptide gets a malus of 10. Similar rules are applied for doubly or higher phosphorylated peptides (see examples in Figure 4). Figure 5a demonstrates the calculation of the PhosphoSiteScore for a singly phosphorylated peptide. Although the MS/ MS spectrum is of low quality, a trustworthy localization of the phosphorylation sites is possible. S-939 of PER2 is identified as the phosphorylated residue with a PhosphoSiteScore of 29.9. Phosphorylation of the other Ser and Thr residues can be ruled out, because of their significant lower PhosphoSiteScores of -26.7, -23.3, and 0. The Mascot scores for the peptides phosphorylated at the four potential phosphorylation sites are 1.4 (S-933), 16.4 (T-936), 18.4 (S-939), and 0.5 (S-945) and thus allow a less definite localization of the phosphorylation site. For validation of the PhosphoSiteScore, the results obtained with Phosm have been compared with the corresponding Mascot results (see Supporting Information S4 for the singly phosphorylated peptides of one data set). In all cases, Phosm indicated the same position as Mascot for the most likely site of phosphorylation, as well as for the second most likely site of phosphorylation, suggesting that our concept of the PhosphoSiteScore is reasonable. The differentiation between the most likely and the second most likely site is in many cases more pronounced for the PhosphoSiteScore than for the Mascot Score (see Supporting
Figure 5. Calculating the PhosphoSiteScore. (a) Calculation of the PhosphoSiteScore for the singly phosphorylated peptide LASEITPASQAEFPS. S-939 is clearly identified as the phosphorylation site. (b) Calculation of the PhosphoSiteScore for the doubly phosphorylated peptide pyro-QVKAVTTIERDSSG. S-561 and T-554 or T-555 are identified as the phosphorylated residues.
Information S4). This might be due to the fact that all kinds of fragment ions contribute to the Mascot Score, whereas only those fragment ions that allow differentiation between different phosAnalytical Chemistry, Vol. 79, No. 19, October 1, 2007
7447
phorylation sites (fragment ions that contain at least one potential phosphorylation site) contribute to the PhosphoSiteScore. In contrast to the Mascot Score, the PhosphoSiteScore can adopt negative values and thus allows a better identification of very unlikely sites. Finally, the PhosphoSiteScore allowed us to pinpoint phosphorylation sites in cases where Mascot failed to identify the phosphopeptide, and thus, no phosphorylation site localization was possible with Mascot. The situation gets more complex for multiply phosphorylated peptides (two or more phosphates), if the number of potential phosphorylation sites (all Ser-, Thr-, and Tyr-residues) exceeds the number of phosphates by more than one. One such example is shown in Figure 5b. The peptide is doubly phosphorylated, and the number of potential phosphorylation sites is four (two Thr and two Ser). The calculated PhosphoSiteScores are 23.3 (T-554), 23.3 (T-555), 23.3 (S-560), and 30 (S-561). This leads to the interpretation that S-561 and one of the other sites is phosphorylated. However, manual interpretation of the MS/MS spectrum shows that some of the possible combinations can be ruled out. In the lower part of Figure 5b, all possible phosphorylation site permutations are listed, and it is shown that only two combinations, T-554/S-561 and T-555/S-561, are in accordance with all observed fragment ions. Phosm provides this kind of additional information for multiply phosphorylated peptides, if necessary. In this case, neither the phosphopeptide nor its phosphorylation sites could be identified with Mascot. We have experienced that the localization of phosphorylation sites can be more complex than expected, even if the peptide is only singly phosphorylated. In one case, the PhosphoSiteScores of all potential phosphorylation sites were highly negative, although the corresponding MS/MS spectrum was of high quality (see Supporting Information S1, q88). In-depth manual interpretation of this spectrum revealed the reason for this: the MS/MS spectrum shows fragment ions from two different types of peptides having the same sequence and the same number of phosphates, phosphorylated at different sites. In this case, these “phosphorylation site isomers” have obviously not been separated on the reversed-phase column. CONCLUSIONS We have developed novel scoring schemes for the identification of (phospho)peptides (PeptideScore) and phosphorylation sites (PhosphoSiteScore) of individual proteins. These scoring schemes have proven to be very powerful, and the obtained results are in many cases superior to the results obtained with Mascot. The novel scoring schemes are implemented into the software tool Phosm, which provides a concise and comprehensive phosphorylation site report. Phosm was developed and optimized for the in-depth analysis of a single or a few phosphoproteins, not for high-throughput phosphoproteomics data. In this context, it has proven to be a very useful tool to assist phosphosite mapping. Phosm is able to identify phosphopeptides and to pinpoint phosphorylation sites even from spectra with very low content of information. However, especially these hits (to our experience peptides with a PeptideScore below 100, which are displayed shaded in the Phosm result file; see Supporting Information S1) need to be verified carefully. However, since Phosm displays much information, verification is in most cases straightforward and is possible without looking at the original MS/MS spectrum. 7448
Analytical Chemistry, Vol. 79, No. 19, October 1, 2007
With our novel scoring schemes, we have tried to mimic the approach an expert mass spectrometrist would use for manual interpretation of phosphopeptide MS/MS spectra. Many parameters, such as the accordance of the observed and the expected peptide charge state or the intensity of proline-directed fragment ions, which are normally not taken into account by most other search engines, are used to obtain a more reliable, more significant identification of phosphopeptides and phosphorylation sites. This finally enabled us to identify phosphorylation sites even from MS/ MS spectra with very low information content that could not be identified with other search engines. However, we still see potential for further improving the performance of our scoring schemes. For example, the accordance of the calculated and the expected retention time of a peptide could be used as a further scoring parameter. There are several approaches described in the literature for the calculation of peptide retention times.26,27 Using instruments with better mass accuracy and complementary fragmentation techniques, such as electron capture dissociation 28 or electron-transfer dissociation,29 would of course also further improve the reliability and significance of phosphorylation site identification. One fundamental limitation of our approach, as with almost every other search engine, is that it is limited to a relatively short list of specified modifications. Phosphopeptides with an additional unknown or unexpected covalent modification will not be identified. However, approaches described for the blind search of modifications17 will probably fail with MS/MS spectra of very low quality. The results presented here are focused on pSer/pThr-containing peptides, since identification of this type of phosphopeptide is the most challenging. This is especially true for multiply phosphorylated peptides, since multiple neutral losses of phosphoric acid almost completely prevent the formation of sequencespecific fragment ions. Upon CID, pTyr-containing peptides do not show a neutral loss of phosphoric acid and behave very much like nonphosphorylated peptides. Although Tyr-phosphorylated peptides can form a highly specific phosphotyrosine immonium ion,30,31 this marker ion was absent from all MS/MS spectra of pTyr-containing peptides used for evaluation of our scoring schemes (data not shown). Thus, we decided not to give a bonus for the presence or a malus for the absence of this immonum ion. However, as far as this can be judged from the limited number of Tyr-phosphorylated peptides we have analyzed so far with Phosm, the PeptideScore as well as the PhosphoSiteScore seems to work fine with this type of phosphopeptide. The concept of the PeptideScore is not restricted to phosphopeptides, but is also applicable for the identification of nonphosphorylated peptides. Non-phosphopeptides are by default displayed (26) Krokhin, O. V.; Craig, R.; Spicer, V.; Ens, W.; Standing, K. G.; Beavis, R. C.; Wilkins, J. A. Mol. Cell. Proteomics 2004, 3, 908-919. (27) Petritis, K.; Kangas, L. J.; Yan, B.; Monroe, M. E.; Strittmatter, E. F.; Qian, W. J.; Adkins, J. N.; Moore, R. J.; Xu, Y.; Lipton, M. S.; Ii, D. G. C.; Smith, R. D. Anal. Chem. 2006, 78, 5026-5039. (28) Stensballe, A.; Jensen, O. N.; Olsen, J. V.; Haselmann, K. F.; Zubarev, R. A. Rapid Commun. Mass Spectrom. 2000, 14, 1793-1800. (29) Molina, H.; Horn, D. M.; Tang, N.; Mathivanan, S.; Pandey, A. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, 2199-2204. (30) Lehmann, W. D. In Proceedings of the 32nd Annual Meeting of the German Mass Spectrometry Society, Oldenburg, Germany, 1999; p 112. (31) Steen, H.; Kuster, B.; Fernandez, M.; Pandey, A.; Mann, M. Anal. Chem. 2001, 73, 1440-1448.
at the end of the Phosm phosphorylation site report (see Supporting Information S1). For all data sets analyzed so far, the PeptideScore identified exactly the same set of nonphosphorylated peptides as Mascot. Thus, we do not expect any interference from nonphosphorylated peptides and claim that Phosm performs just as well when applied to samples without prior phosphopeptide enrichment. Phosm also uses the same scoring scheme that is used to pinpoint the phosphorylation sites (PhosphoSiteScore) for the localization of other modifications, such as methionone oxidation. Extending the applicability for other types of posttranslational modification, such as methylation, would be an easy task. In summary, our novel scoring schemes turned out to be highly useful for a more comprehensive and more reliable identification of phosphorylation sites of individual proteins. Using Phosm, additional phosphopeptides and phosphorylation sites could be identified in almost every analyzed LC-MS/MS data
set comprising a variety of different proteins. Especially the number of identified multiply phosphorylated peptides was significantly increased. ACKNOWLEDGMENT The Deutsche Forschungsgemeinschaft is acknowledged for financial support (Emmy-Noether program and grant SFB 740/ D2). Our work is supported by the 6th EU framework program EUCLOCK. A.S. and J.T.V. contributed equally to this paper. SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review April 18, 2007. Accepted July 19, 2007. AC0707784
Analytical Chemistry, Vol. 79, No. 19, October 1, 2007
7449