PTMSearchPlus: Software Tool for Automated Protein Identification

Sep 23, 2009 - Protein Identification and Post-Translational. Modification Characterization ... PTMSearchPlus is a software tool for the automated int...
0 downloads 0 Views 264KB Size
Anal. Chem. 2009, 81, 8387–8395

PTMSearchPlus: Software Tool for Automated Protein Identification and Post-Translational Modification Characterization by Integrating Accurate Intact Protein Mass and Bottom-Up Mass Spectrometric Data Searches Vilmos Kertesz,*,† Heather M. Connelly,†,‡,§ Brian K. Erickson,†,‡ and Robert L. Hettich*,† Organic and Biological Mass Spectrometry Group, Chemical Sciences Division, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, Tennessee 37831-6131, and Graduate School of Genome Science and Technology, University of Tennessee-Oak Ridge National Laboratory, 1060 Commerce Park, Oak Ridge, Tennessee 37830

PTMSearchPlus is a software tool for the automated integration of accurate intact protein mass (AIPM) and bottom-up (BU) mass spectra searches/data in order to both confidently identify the intact proteins and to characterize their post-translational modifications (PTMs). The development of PTMSearchPlus was motivated by the desire to effectively integrate high-resolution intact protein molecular masses with bottom-up peptide MS/MS data. PTMSearchPlus requires as input both intact protein and proteolytic peptide mass spectra collected from the same protein mixture, a FASTA protein database, and a selection of possible PTMs, the types and ranges of which can be specified. The output of PTMSearchPlus is a list of intact and modified proteins matching the AIPM data concomitant with their respective peptides found by the BU search. This list also contains protein and peptide sequence coverage information, scores, etc. that can be used for further evaluation or refiltering of the results. Corresponding and annotated AIPM and BU mass spectra are also displayed for visual inspection when a listed protein or a peptide is selected. These and other controls ensure that the user can manually evaluate, modify (e.g., remove obvious false positives, low quality spectra etc.), and save the results of the automated search if necessary. Driven by the exponential growth in the number of possible peptide candidates in a BU search when multiple PTMs are probed, the advantages on search speed by limiting the total number of possible PTMs on a peptide in the BU search or by performing an “AIPM predicted” BU search are also discussed in addition to the integration approach. The features of PTMSearchPlus are demon* To whom correspondence should be addressed. Senior author for computational work: Vilmos Kertesz, Oak Ridge National Lab. Phone: (865) 574-4878. Fax: (865) 576-8559. E-mail: [email protected]. Senior author for experimental MS work: Robert Hettich, Oak Ridge National Lab. Phone: (865)-574-4968. Fax: (865)-576-8559. E-mail: [email protected]. † Organic and Biological Mass Spectrometry Group, Chemical Sciences Division, Oak Ridge National Laboratory. ‡ Graduate School of Genome Science and Technology, University of Tennessee-Oak Ridge National Laboratory. § Current address: Amgen Inc., 4000 Nelson Road, Longmont, Colorado 80503. 10.1021/ac901163c CCC: $40.75  2009 American Chemical Society Published on Web 09/23/2009

strated using both a protein standard mixture and a complex protein mixture from Escherichia coli. Experimental data revealed a unique advantage of coupling AIPM and the BU data sets that is mutually beneficial for both approaches. Namely, AIPM data can confirm that no PTM peptides were missed in a BU search, while the BU search determines the location of the PTM. This information is not available using an AIPM search alone. Various mass spectrometric approaches are available for characterizing complex protein mixtures by either interrogating the intact proteins (using accurate intact protein mass (AIPM) or top-down approaches) or their constitutive proteolytic peptides (termed “bottom-up” (BU)).1 While the BU approach is more welldeveloped and widely represented, each of these methods features a unique set of strengths and weaknesses. Clearly, the comprehensive characterization of complex proteomes will require further development in each method. Top-down mass spectrometry for intact protein characterization was first introduced with electrospray-Fourier transform ion cyclotron resonance-mass spectrometry (ESI-FTICR-MS).2-4 The dynamic range, sensitivity, and mass accuracy achievable by highperformance FTICR-MS affords not only high-resolution protein identification in most cases but also detailed information about the molecular state of intact proteins. This high-resolution measurement can reveal protein details that include posttranslational modifications (PTMs), truncations, mutations, signal peptides, and isoforms due to the ability to accurately measure covalent modifications that alter the molecular mass.5,6 While intact protein measurement methodologies provide a powerful (1) Bogdanov, B.; Smith, R. D. Mass Spectrom. Rev. 2004, 24, 168–200. (2) Little, D. P.; Speir, J. P.; O’Connor, P. B.; McLafferty, F. W. Anal. Chem. 1994, 66, 2809–2815. (3) Mortz, E.; O’Connor, P. B.; Roepstorff, P.; Kelleher, N. L.; Wood, T. D.; McLafferty, F. W.; Mann, M. Proc. Natl. Acad. Sci. U.S.A. 1996, 93, 8264– 8267. (4) Kelleher, N. L.; Taylor, S. V.; Grannis, D.; Kinsland, C.; Chiu, H. J.; Begley, T. P.; McLafferty, F. W. Protein Sci. 1998, 7, 1796–1801. (5) VerBerkmoes, N. C.; Connelly, H. M.; Pan, C.; Hettich, R. L. Expert Rev. Proteomics 2004, 1:4, 433–447. (6) Larsen, M. R.; Roepstorff, P. Fresenius J. Anal. Chem. 2000, 366, 677– 690.

Analytical Chemistry, Vol. 81, No. 20, October 15, 2009

8387

analytical approach, there are some remaining challenges for this approach. For example, online chromatography of intact proteins is often difficult due to the wide range of protein sizes and hydrophobicities, intact proteins often do not yield extensive fragmentation information, and the resulting data is often difficult to analyze and to interpret due to limited bioinformatics tools.4 The more common peptide or BU mass spectrometric approach to identify proteins and their modifications involves enzymatic digestion of proteins with a protease such as trypsin, Glu-C, or cyanogen bromide to generate a peptide mixture. This peptide mixture is then analyzed by MS/MS methods to generate peptide fragmentation spectra that are compared to theoretical spectra of possible peptide candidates from a database using different searching algorithms.7 This “shotgun” proteomics approach is able to efficiently provide a comprehensive list of proteins present even in a large multiprotein complex. However, vital information about the molecular nature of the protein may be missed if the peptides containing particular modifications or variations escape detection. Furthermore, identifying peptides that come from a complex protein mixture may not provide information to distinguish between isoforms of the same protein. One obvious solution to a more comprehensive characterization of complex protein mixtures would involve an integrated intact protein and proteolytic peptide characterization approach, which would exploit the unique strengths of each method. In such an integrated approach, the information from the comprehensive list of proteins identified by their intact molecular mass can be compared against information from the comprehensive list of proteolytic peptides corresponding to the same protein, thus revealing detailed information about modified protein isoforms. The correlation between the two methods can provide detailed PTM location and identity and may be more generically applicable than fragmentation information from the intact proteins. It is important to realize that while accurate molecular masses can be measured for most intact proteins (provided they are within the accessible molecular range of the mass spectrometer employed), the quality of the tandem mass spectra from intact proteins varies greatly and in some cases is not sufficient to provide much detailed information. We were one of the first groups to demonstrate the integrated intact protein and proteolytic peptide measurement approach for the characterization of the Shewanella oneidensis proteome8 and have extended this for the 70S ribosomal complex from Rhodopseudomonas palustris.9 For these studies, most of the integrated data sets were interrogated manually. Integrated intact protein and proteolytic peptide approaches have seen increased development in the last several years, focusing on both experimental10-15 and computational aspects16 but range greatly in their ability to handle high-resolution data sets and how the scoring is conducted. While there are a variety of software searching tools for BU data analysis (e.g., SEQUEST,17 Mascot,18 X!Tandem,19 etc.), there are relatively few tools for top-down and AIPM analyses. The (7) Hunt, D. F.; Yates, J. R., III; Shabanowitz, J.; Winston, S.; Hauer, C. R. Proc. Natl. Acad. Sci. U.S.A. 1986, 83, 6233–6237. (8) VerBerkmoes, N. C.; Bundy, J. L.; Hauser, L.; Asano, K. G.; Razumovskaya, J.; Larimer, F.; Hettich, R. L.; Stephenson, J. L., Jr. J. Proteome Res. 2002, 1, 239–252. (9) Strader, M. B.; VerBerkmoes, N. C.; Tabb, D. L.; Connelly, H. M.; Barton, J. W.; Bruce, B. D.; Pelletier, D. A.; Davison, B. H.; Hettich, R. L.; Larimer, F. W.; Hurst, G. B. J. Proteome Res. 2004, 3, 965–978.

8388

Analytical Chemistry, Vol. 81, No. 20, October 15, 2009

current software standard for top-down work is ProSightPTM (commercially available from ThermoFisher Scientific Corporation as ProSightPC), which combines a number of search engines and a browser environment into a Web application that allows the user to analyze AIPM and corresponding protein fragmentation data.20 This program uses the masses of intact proteins and the tandem mass spectrometry information (i.e., product ion masses) of the same proteins to provide protein and PTM identifications. This software relies on the use of top-down dissociation methods that are often not as comprehensive for complex mixtures as BU methods employing an enzymatic digestion. Frequently employed methods to generate intact protein fragments include collisioninduced dissociation (CID), infrared multiphoton dissociation (IRMPD), electron capture dissociation (ECD),21 or electron transfer dissociation (ETD). PROCLAME is another top-down software analysis tool that uses intact protein mass measurements to determine sets of putative protein cleavage and modification events to account for the measured protein masses observed.22 PROCLAME provides a reasonable prediction algorithm but is unable to incorporate tandem mass spectrometry (MS/MS) data within the process. More recently, BIG Mascot was introduced and operates in a similar approach as ProsightPTM, utilizing intact protein mass and corresponding product ion masses generated from intact protein dissociation.23 While there is some progress in the demonstration of computational software to integrate AIPM and BU data sets, as listed above, this active field is still very much under development. In this article, we describe a new software algorithm, PTMSearchPlus, which provides a comprehensive search method to enable the integration of AIPM identification with the BU generated peptide data to more rapidly and more confidently identify proteins and their associated PTMs. The software can perform independent AIPM or BU searches, as well as integrate both approaches. By combination of these two search capabilities, the results from the AIPM search can be used to limit the number of the proteins that are used to generate the peptide database for (10) Wu, S.-L.; Kim, J.; Hancock, W. S.; Karger, B. J. Proteome Res. 2005, 4, 1155. (11) Ogorzalek-Loo, R. R.; Hayes, R.; Yang, Y.; Hung, F.; Ramachandran, P.; Kim, N.; Gunsalus, R.; Loo, J. A. Int. J. Mass Spectrom. 2005, 240, 317– 325. (12) Borchers, C. H.; Thapar, R.; Petrotchenko, E. V.; Torres, M. P.; Speir, J. P.; Easterling, M.; Dominski, Z.; Marzluff, W. F. Proc. Natl. Acad. Sci. U.S.A. 2006, 103, 3094–3099. (13) Whitelegge, J.; Halgand, F.; Souda, P.; Zabrouskov, V. Expert Rev. Proteomics 2006, 3, 585–596. (14) Johnson, K. A.; Paisley-Flango, K.; Tangarone, B. S.; Porter, T. J.; Rouse, J. C. Anal. Biochem. 2007, 360, 75–83. (15) Wu, S.; Lourette, N. M.; Tolic, N.; Zhao, R.; Robinson, E. W.; Tolmachev, A. V.; Smith, R. D.; Pasa-Tolic, L. J. Proteome Res. 2009, 8, 1347–1357. (16) Whitelegge, J.; Cournoyer, J.; Pevzner, P. J. Biomol. Tech. 2007, 18, 89. (17) Eng, J. K.; McCormack, A. L.; Yates, J. R., III J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. (18) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551–3567. (19) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20, 1466–1467. (20) LeDuc, R. D.; Taylor, G. K.; Kim, Y. B.; Januszyk, T. E.; Bynum, L. H.; Sola, J. V.; Garavelli, J. S.; Kelleher, N. L. Nucleic Acids Res. 2004, 32, W340–W345. (21) McLafferty, F. W.; Horn, D. M.; Breuker, K.; Ge, Y.; Lewis, M. A.; Cerda, B.; Zubarev, R. A.; Carpenter, B. K. J. Am. Soc. Mass Spectrom. 2001, 12, 245–249. (22) Holmes, M. R.; Giddings, M. C. Anal. Chem. 2004, 76, 276–282. (23) Karabacak, N. M.; Li, L.; Tiwari, A.; Hayward, L. J.; Hong, P. Y.; Easterling, M. L.; Agar, J. N. Mol. Cell. Proteomics 2009, 8, 846–856.

the BU search (“AIPM predicted” search) and, in return, the results of the BU search can be used as confirmation for the proteins with associated PTMs found in the AIPM search. The limitation of the database used in the BU search based on the results of the AIPM search may reduce the search time dramatically, allowing the user to search for more PTMs on proteins and peptides within a reasonable time frame. The power of this integrated search method is demonstrated using data from analysis of a protein standard mixture and a complex Escherichia coli ribosomal protein mixture. In addition to the integration approach, we also present a novel way to reduce the number of peptide candidates in a BU search when multiple PTMs are probed. The method allows the user to limit the number of possible PTMs on a peptide based on chemical considerations that may result in a significant decrease in the number of peptide candidates. Dramatic increases in search throughput with this method are demonstrated using data from a complex Escherichia coli protein mixture database. METHODS AND MATERIALS All proteins, salts, and buffers were obtained from Sigma Chemical Co. (St. Louis, MO). Sequencing grade trypsin was purchased from Promega (Madison, WI). Formic acid was obtained from EM Science (affiliate of Merck KGaA, Darmstadt, Germany). HPLC-grade acetonitrile and water were used for all LC-MS/MS analyses (Burdick and Jackson, Muskegon, MI). Ultrapure 18 MΩ water used for sample buffers was obtained from Millipore Milli-Q system (Bedford, MA). Fused silica capillary tubing was purchased from Polymicro Technologies (Phoenix, AZ). Sample Preparation for the Protein Standard Mixture and Escherichia coli Ribosomal Proteins. All prepared samples were divided into two portions. One portion was examined by 1D LC-MS/MS BU mass spectrometry, and the other portion of the sample was examined using LC-FTICR-MS for AIPM mass spectrometry. Five proteins were used in the five protein standard mixture (PSM): ubiquitin (MW 8 kDa), chicken lysozyme C (MW 14 kDa), bovine ribonuclease A (MW 13 kDa), bovine carbonic anhydrase II (MW 29 kDa), and bovine β-lactoglobulin-B (MW 18 kDa). The proteins were dissolved in HPLC grade water to give a final concentration of 1 mg/mL of each protein and diluted as required for the analysis. The PSM was neither reduced nor digested before LC-FTICR-MS characterization of the intact proteins. The PSM was digested for BU analysis with sequencing grade trypsin added at 1:20 (w/w) ratio. The digestions were run with gentle shaking at 37 °C for 12 h. After digestion, the samples were immediately desalted with an Omics 100 µL solid phase extraction pipet tip (Varian, Palo Alto, CA). All samples were frozen at -80 °C until LC-MS/MS analysis. Proteins from Escherichia coli were purified and fractionated using a high salt sucrose cushion and sucrose density fractionation, as previously described.8 For BU analysis, acid extracted ribosomal proteins were denatured and reduced in 6 M guanidine HCl, 50 mM Tris-HCl (pH 7.6), and 10 mM DTT at 60 °C for 4 h. Afterward, the proteins were digested overnight at 37 °C with sequencing grade trypsin at a 1:100 (w/w) ratio. Remaining disulfides were reduced with 10 mM DTT at 60 °C for 45 min. To

perform AIPM characterizations, the ribosomal samples were neither reduced nor digested. LC-FTICR-MS for AIPM Mass Spectrometry. All capillary HPLC-FTICR-MS experiments were conducted with an Eksigent NanoLC-2D HPLC interfaced directly to a Micromass Z-Spray source on a Varian (Lake Forest, CA) 9.4 T (Cryomagnetics Inc., Oak Ridge, TN) HiRes electrospray Fourier transform ion cyclotron resonance mass spectrometer.24 A C4 reverse-phase column (Phenomenex Jupiter, 300 Å with 5 µm particles) was packed via a pressure cell in-house and was employed for all intact protein separations. The ribosomal purification eluent or PSM consisting of 5-20 µg of total protein was injected onto the column and eluted at 2.5 µL/min into the electrospray ion source of the FTICR-MS. The gradient was run from 90% solvent A (95/5/0.1 (v/v/v) H2O/ ACN/formic acid) to 100% solvent B (95/5/0.1 (v/v/v) ACN/ H2O/formic acid) over a 60 min linear gradient. Calibration of the mass spectrometer was accomplished externally using a ubiquitin solution resulting in a mass accuracy of ±3-10 ppm and resolution of 50 000-160 000 (fwhm). 1D LC-MS/MS for BU Mass Spectrometry. For all peptide samples, one-dimensional (1D) LC-MS/MS experiments were performed with a Famos/Switchos/Ultimate HPLC system (Dionex, Sunnyvale, CA) coupled to an LTQ quadrupole ion trap mass spectrometer (Thermo Finnigan, San Jose, CA) equipped with a nanospray source, as previously described.25 A 160 min linear gradient from 100% solvent A (95/5/0.1 (v/v/v) H2O/ACN/ formic acid) to 100% solvent B (30/70/0.1 (v/v/v) H2O/ACN/ formic acid) was employed. For all 1D LC-MS/MS data acquisition, the LTQ was operated in the data dependent mode with dynamic exclusion enabled (repeat count 2), where the five most abundant peaks in every MS scan were subjected to MS/MS analysis. Data dependent LC-MS/MS was performed over a parent m/z range of 400-2000. Software. PTMSearchPlus was developed using Delphi 3 computer language (Borland Software Corp., Scotts Valley, CA) under the Microsoft Windows XP Home Edition (Microsoft Corp., Redmond, WA) operating system and can be run in any 32-bit Windows environment with at least 256 MB RAM. Currently, the program is freely available upon request to any government or educational institute.26 RESULTS AND DISCUSSION Search Options. PTMSearchPlus currently supports the following search options: (a) a standalone AIPM search, (b) a standalone BU search using the MyriMatch27 scoring algorithm, and (c) an integrated AIPM and MyriMatch-based BU search. These search options are discussed briefly below. For more detailed information, please see the Supporting Information. Standalone AIPM Search. Deconvoluted isotopic peak envelopes from FTICR-MS measurements were matched against (24) Connelly, H. M.; Pelletier, D. A.; Lu, T.-Y.; Lankford, P. K.; Hettich, R. L. Anal. Biochem. 2006, 357, 93–104. (25) VerBerkmoes, N. C.; Shah, M. B.; Lankford, P. K.; Pelletier, D. A.; Strader, M. B.; Tabb, D. L.; McDonald, W. H.; Barton, J. W.; Hurst, G. B.; Hauser, L.; Davison, B. H.; Beatty, J. T.; Harwood, C. S.; Tabita, F. R.; Hettich, R. L.; Larimer, F. W. J. Proteome Res. 2006, 5, 287–298. (26) E-mail: [email protected]. (27) Tabb, D. L.; Fernando, C. G.; Chambers, M. C. J. Proteome Res. 2007, 6, 654–661.

Analytical Chemistry, Vol. 81, No. 20, October 15, 2009

8389

Figure 1. Flowchart of integration of accurate intact protein mass (AIPM) and bottom-up (BU) searching algorithms. The filled arrow indicates the “AIPM predicted” BU search approach.

calculated28 isotopic peak envelopes of modified and nonmodified proteins from a database, which contains FASTA formatted protein sequences. A match was judged on the mass difference of the most abundant peaks of the experimental and calculated isotopic envelopes. In general, a maximum difference of 50 mDa was declared as a match in the searches. Standalone BU Search. In this mode, the software used the MyriMatch27 scoring algorithm to compare modified and nonmodified peptides of a given protein database against BU mass spectra information stored in MS2 files.29 Peptides with scores above a certain limit were assigned as a match and used in calculation of protein coverages. Integrated AIPM and BU Search. The hollow arrows in Figure 1 illustrate the most straightforward approach to integrate AIPM and BU searching algorithms in general. In this case, AIPM and BU data were searched independently using the specified full PTM database, and the results were then compared and combined. This approach was considered to be a “complete” search, as all proteins (and their possible PTMs) were checked against the two different AIPM and BU data sets. The filled arrow in Figure 1 represents a different approach that was also implemented in PTMSearchPlus to limit search space for the BU search. The search space limitation was based on the proteins and PTMs found in the AIPM search. With the use of this approach, an AIPM search was conducted first, followed by assigning the union of possible PTMs found for a particular protein. For example, if protein 1 was found in three different forms in the AIPM search, e.g., once with two methylations, once with a phosphorylation, and once with a β-methylthiolation, then the union of these PTMs was assigned to protein 1. This individually assigned PTM (two methylations + a phosphorylation + a β-methylthiolation in this example) represents the maximum PTM search space that was used to create PTM peptides of the given protein (protein 1 in the example) in the BU search. For proteins not found in the AIPM search, peptides were generated without any PTM from the intact (nonmodified) sequence of a given protein and tested in the BU search. The advantage of this method over the “complete” search was the significant decrease in the number of theoretical peptide

candidate sequences generated during the BU search by taking advantage of the “AIPM predicted” BU search. Obviously, such a method requires good quality separation and identification of intact proteins. Otherwise, a valid, modified protein that was not identified in the AIPM, but truly existed in the sample, would not be represented in the BU search. Decreasing the Number of Peptide Candidates by Restricting the Maximum Number of PTMs on a Single Peptide. To the best of our knowledge, the current commercially available BU search engines do not have the ability to limit the total number of different PTMs on a single peptide to a reasonable level that could be considered acceptable from a chemical viewpoint. However, a dramatic decrease in the number of peptide candidates and a noticeable search speed increase can be achieved when applying a limitation on the total number of PTMs on a single peptide, as described in the three scenarios below: Same Amino Acid with Multiple PTMs. If a peptide has nX number of amino acid X, and each X can have pX different PTMs, then a simple combinatory approach of

( )

nX i pX i

describes the number of peptide sequences that has exactly i number of PTMs. On the basis of this formula, the number of all possible PTM peptide sequences (one of them being the intact peptide, i ) 0) can be derived by summing the number of peptide sequences that has i ) 0, 1, . . ., nX PTMs in their sequence and has the form of

∑ (i nX

nX

i)0

8390

Analytical Chemistry, Vol. 81, No. 20, October 15, 2009

piX

Similarly, if one allows maximum m PTM/peptide (where m e nX) of pX different PTMs on nX number of amino acid X, it results in

∑ (i m

nX

i)0

)

piX

different peptide candidate sequences. Different Amino Acids with the Same PTMs. The situation is still simple when not one, but q different amino acids (X1, X2, . . ., Xq) can have the same pX PTMs (e.g., mono-, di-, and trimethylation on arginine and lysine). In this case, nX is simply the sum of the number of amino acids X1, X2, . . ., Xq, q

∑n

Xj

j)1

and the number of possible peptide candidates is again

∑ (i nX

(28) Rockwood, A. L.; Van Orden, S. L.; Smith, R. D. Anal. Chem. 1995, 67, 2699–2704. (29) McDonald, W. H.; Tabb, D. L.; Sadygov, R. G.; MacCoss, M. J.; Venable, J.; Graumann, J.; Johnson, J. R.; Cociorva, D.; Yates, J. R., III Rapid Commun. Mass Spectrom. 2004, 18, 2162–2168.

)

i)0

nX

)

piX

without restricting the maximum PTM/peptide value, and

∑ (i m

nX

i)0

)

piX

if one allows maximum m PTMs on a single peptide (m e nX). Multiple Amino Acids with Different PTMs. The situation becomes more complicated when different amino acids can have different PTMs. Suppose, we have a peptide that has nX1 of amino acids X1, nX2 of amino acids X2, . . ., nXq of amino acids Xq, that each can have pX1, pX2, . . ., pXq different PTMs, respectively. If no restriction is applied on the number of PTMs on a single peptide, the number of possible peptide candidates is

( )

nXj

q

nXj

∏∑

i

j)1 i)0

pXi j

However, if the number of PTMs on a single peptide is limited to m, then the number of possible peptide candidates is

min(m,nX1)



i1)0

(( ) ( nX1 i pX11 i1

(

)

1

∑ ∑

min m-

ik,nX2

k)1

i2)0

(

...

(( )

nX2 i pX22 × i2

(

q

∑ ∑

min m-

k)1

iq)0

)

ik,nXq

( )

nXq i pXqq... iq

))

Evaluation of Reduction in the Number of Peptide Candidates by Restricting the Maximum Number of PTMs on a Single Peptide. We have evaluated the theory above by counting the number of tryptic PTM peptide candidates and measuring the time necessary to score them (search time) in case of the proteins of an Escherichia coli ribosomal protein complex. The database included all possible Escherichia coli proteins (4287 in total), from which tryptic peptides with a maximum of four missed cleavages and molecular weights ranging from 400-6000 Da were considered. Three different conditions were investigated: (a) monomethylation can occur on arginine and lysine (see Different Amino Acids with the Same PTMs); (b) mono-, di- and trimethylation can occur on arginine and lysine (see Different Amino Acids with the Same PTMs); and (c) mono-, di- and trimethylation can occur on arginine and lysine and β-methylthiolation can occur on aspartic acid (see Multiple Amino Acids with Different PTMs). Parts a and b of Figure 2 present the number of tryptic PTM peptide candidates and the corresponding BU search time, respectively, on a logarithmic scale for the 4287 proteins of the Escherichia coli proteome. The filled bars represent case a, as discussed above, where monomethylation was considered on arginine and lysine. Scoring the approximately 5 670 000 different tryptic peptide candidate sequences took 46.3 min on a desktop PC equipped with an AMD Duron 1800+ CPU if the maximum PTM/peptide value was not restricted. The number of PTM peptide sequence candidates was decreased to approximately 5 118 000 (90.2% of the original 5 670 000 peptide candidates), 3 843 000 (67.8%), or 2 004 000 (35.3%) and took 43.9 (94.8% of the original 46.9 min), 34.1 (72.7%), or 20 min (43.2%) to score them when the maximum number of allowable PTMs was restricted to 3, 2, or 1, respectively.

Figure 2. (a) Number of tryptic PTM peptide candidates and (b) search time of 4287 proteins of the Escherichia coli proteome without using “AIPM predicted” limitation as a function of the maximum number of PTMs allowed on a single peptide when (filled bars) monomethylation was considered on arginine and lysine; (hatched bars) mono-, di-, and trimethylation were allowed on arginine and lysine, and (empty bars) mono-, di-, and trimethylation were allowed on arginine and lysine, and β-methylthiolation was considered on aspartic acid. (c) Number of tryptic PTM peptide candidates and (d) search time (filled bars) with and (empty bars) without using “AIPM predicted” limitation when mono-, di-, and trimethylation are allowed on arginine and lysine, and β-methylthiolation was considered on aspartic acid. NS stands for a non-restricted search.

The decrease in search time and in the number of tryptic PTM peptide candidates was even more dramatic in case b. The hatched bars in parts a and b of Figure 2 show the number of peptide candidates when mono-, di-, and trimethylation were allowed on arginine and lysine. In this case, approximately 111 820 000 different PTM peptide candidates were generated and scored in 696 min if the maximum number of PTMs in a peptide was not limited. The number of PTM peptide sequences decreased to about 11 752 000 (10.5%), 5 325 000 (4.8%), or 2 004 000 (1.8%) and the peptide candidates were scored in 99 (14.2%), 48 (6.9%), and 20 min (2.9%) if a maximum of 3, 2, or 1 PTM(s) was allowed on a single peptide, respectively. Analytical Chemistry, Vol. 81, No. 20, October 15, 2009

8391

Empty bars in parts a and b in Figure 2 represent case c when mono-, di-, and trimethylation were allowed on arginine and lysine and β-methylthiolation was considered on aspartic acid. The number of different tryptic PTM peptide candidates was about 753 662 000; 23 022 000 (3%), 8 826 000 (1.2%), or 2 685 000 (0.4%), and took 4700; 158 (3.4%), 69 (1.5%), or 25 min (0.5%) to score when no limitation was set for maximum PTM/peptide or a maximum of 3, 2, or 1 PTM(s) was allowed on a peptide, respectively. Evaluation of Reduction in the Number of Peptide Candidates by Applying “AIPM Predicted” Limitation. Parts c and d of Figure 2 show the number of tryptic PTM peptide candidates and the corresponding BU search time, respectively, on a logarithmic scale with and without utilization of the “AIPM predicted” limitation. For this test, an AIPM peak data set was first collected from an Escherichia coli ribosomal protein complex mixture. This data set was then analyzed by an AIPM search allowing a maximum of 10 methylations and 3 β-methylthiolations on a single protein. Filled and empty bars in parts c and d of Figure 2 correspond to the number of different tryptic PTM peptide candidates and the corresponding BU search time with and without applying the “AIPM predicted” limitation, respectively. In the BU search, mono-, di-, and trimethylation on arginine and lysine and β-methylthiolation on aspartic acid were allowed. The reduction in the number of PTM peptide candidates and search time is clearly displayed in parts c and d of Figure 2 when the “AIPM predicted” limitation was applied. The approximately 7 561 000 peptide candidate sequences (1.0% of that of the “complete” search) were searched in 65 min (1.4% of that of the “complete“ search) using the “AIPM predicted” limitation and without limiting the maximum number of PTMs on a single peptide. An additional decrease in the number of peptide candidates and BU search time compared to a “complete” search without limits on the maximum value of PTM/peptide can be achieved by combining “AIPM predicted” and PTM/peptide limitations. The number of possible peptide candidates were decreased to approximately 973 000 (0.13%), 710 000 (0.09%), or 576 000 (0.08%) and took 9.7 (0.20%), 7.1 (0.15%), or 5.8 min (0.12%) to score when a maximum of 3, 2, or 1 PTM was allowed on a peptide, respectively. Note that BU search time decreased by approximately 800-fold (5.8 min versus 4700 min corresponding to an “AIPM predicted” search with a maximum of 1 PTM/peptide versus a “complete” search without a limit on the maximum value of PTM/peptide, respectively) using a combination of the two presented peptide candidate number reduction approaches. The rate of the reduction on the number of PTM peptide candidates depends on many factors (e.g., number of proteins identified in the AIPM search, various search parameters, etc.). Here, we have tried to present a general, randomly selected case. Obviously, limiting the maximum number of PTMs on a single peptide prevents detection of peptides with more PTMs. Furthermore, an “AIPM predicted” BU search may miss a corresponding PTM peptide of a PTM protein not found by the AIPM search, as stated above. For these reasons, the presented possible limitation strategies on the number of PTM peptide candidates should always be used according to the chemical information about the sample (e.g., how 8392

Analytical Chemistry, Vol. 81, No. 20, October 15, 2009

extensive the PTM of the proteins is/expected to be) and the quality of the AIPM analysis (e.g., good quality spectra, signal strength, chances of missing some low abundance proteins, elution efficiency/detectability of proteins with high molecular weight, etc.) Evaluation of PTMSearchPlus for a Protein Standard Mixture. A five protein standard mixture (PSM) consisting of ubiquitin, lysozyme C, ribonuclease A, β-lactoglobulin B, and carbonic anhydrase II was prepared and divided into two parts followed by their independent AIPM and BU analyses. A combined AIPM and “complete” BU search was performed on the data obtained. These results served as a training set to evaluate the performance of PTMSearchPlus with an initial simple mixture. All searching was performed with a database composed of the five proteins as well as common contaminants to give a total of 43 proteins within the database. Specified PTMs were disulfide bond formation between cysteine residues and methionine truncation at the N-terminus based on the nature of the sample. Details on the search parameters are included in the Supporting Information. The result of the combined search is summarized in Table 1. Ubiquitin, lysozyme C, and β-lactoglobulin-B were identified by both the AIMS and the BU searches. Post-translational modifications in the form of disulfide bonds were identified on lysozyme C, β-lactoglobulin B, and ribonuclease A in the AIPM search. These modifications were expected due to the use of purchased protein stocks with these known modifications. Trypsin and carbonic anhydrase were only found by the BU search. Trypsin was used as the proteolytic enzyme to generate peptides for the BU sample and was not expected to be identified in the AIPM analysis. Additionally, carbonic anhydrase is difficult to elute from a C4 reverse phase column due to its relatively large molecular mass (29 kDa) and hydrophobicity and thus was not measured in the intact protein mass spectra. Surprisingly, ribonuclease A was found by the AIPM analysis but not by BU. Manual inspection revealed high-quality mass spectra and excluded the mass spectra quality as a possible reason for not finding peptides confirming ribonuclease A. However, the low sequence coverage may indicate that low digestion efficiency is responsible for the lack of corresponding peptides. Parts a and b of Figure 3 show an example of matching calculated and measured isotopic distributions, respectively, for lysozyme C with three disulfide bonds discovered by the AIPM search and exhibiting 2.1 ppm mass difference between their most abundant peaks. Also, an annotated experimental MS/MS spectrum confirming the presence of peptide NTDGSTDYGILQINSR from lysozyme C in the digested sample solution and discovered by the BU search is presented in Figure 3c. Evaluation of PTMSearchPlus for Escherichia coli Ribosomal Proteins. A full protein database of Escherichia coli provided a base for PTMSearchPlus to evaluate its effectiveness with a more complex sample. A purified ribosomal protein mixture was divided into two parts followed by their independent AIPM and BU analyses. Similarly to the PSM experiment, a combined AIPM and BU search was performed on the data obtained. The search was accomplished using “complete” and “AIPM predicted” BU searches as well. The PTMs included in the AIPM search were mono-, di-, and trimethylation on arginine and lysine, methionine truncation at the N-terminus, and disulfide formation

Table 1. Proteins Identified in the Accurate Intact Protein Mass (AIPM) and Bottom-Up (BU) Searches from the Protein Standard Mixture by PTMSearchPlusa protein

AIPM PTM

∆ppm

BU%

BU peptides

ribonuclease A lysozyme C β-lactoglobulin B

4 disulfides 3 disulfides 2 disulfides

0.5 2.1 1.9

21.7 42.0

ubiquitin

intact

0.1

32.9

FESNFNTQATNR NTDGSTDYGILQINSR GLDIQK IIAEK IDALNENK LIVTQTMK VLVLDTDYK VLVLDTDYKK TPEVDDEALEK VYVEELKPTPEGDLEILLQK ESTLHLVLR TITLEVEPSDTIENVK VLDALDSIK VGDANPALQK DFPIANGER EPISVSSQQMLK MVNNGHSFNVEYDDSQDK YGDFGTAAQQPDGLAVVGVFLK RMVNNGHSFNVEYDDSQDK LGEHNIDVLEGNEQFINAAK IITHPNFNGNTLDNDIMLIK

carbonic anhydrase II

31.2

trypsin precursor

17.3

a Corresponding PTM modifications of proteins (AIPM PTM), mass difference between the theoretical and experimental protein masses (∆ppm), sequence coverage based on the BU search (BU%), and peptides found in the BU search (BU peptides) are also displayed.

Figure 3. (a) Calculated and (b) measured isotopic distributions for P00698|LYC_CHICK Lysozyme C protein with 3 disulfide bonds exhibiting a 2.1 ppm mass difference between their most abundant peaks. (c) MS/MS spectrum from peptide NTDGSTDYGILQINSR of the same protein.

between cysteine residues. Within the BU search, the specified PTMs were mono-, di-, and trimethylation on arginine and lysine and methionine truncation at the N-terminus. (Note that acetylation was not specified explicitly as a PTM but must be considered when trimethylation with the same approximately 42 Da mass shift was found.) Details on the search parameters are included in the Supporting Information. From this integrated AIPM-BU search, we identified 52 out of the total 54 possible ribosomal proteins, many of which were not modified or only exhibited methionine truncation. Table 2 summarizes the PTM-containing ribosomal proteins and peptides confidently identified by an AIPM and/or a BU search. Four PTM proteins identified (L7/L12, L11, S5 and S11) all had PTMs that exactly matched with the PTM of the corresponding peptide found using an “AIPM predicted” BU search. This data demonstrates the unique advantage of coupling AIPM and the BU data sets, in which higher confidence is achieved by the related but independent measurements. Namely, the AIPM data of these four proteins confirm that all of the PTM peptides were found in the BU search, i.e., no peptides with a PTM was missed. This confirmation is not available without coupling the approaches together. On the other hand, the BU search determines the location of the PTM that is difficult to ascertain by the AIPM search. As an example, Figure 4 presents corresponding identifications from AIPM and BU searches of the same protein. Parts a and b of Figure 4 show calculated and measured isotopic distributions, respectively, of 50S ribosomal protein L7/L12 with methionine loss and trimethylation/acetylation found by the AIPM search. The mass difference between the most abundant peaks of the calculated and measured isotopic distributions was 0.2 ppm. The location of the trimethylation/acetylation identified by the AIPM search was determined by BU data. The spectrum in Figure 4c is assigned to a peptide of 50S ribosomal protein L7/L12 with a sequence of SIT(K + trimethylation/acetylation)DQIIEAVAAMSAnalytical Chemistry, Vol. 81, No. 20, October 15, 2009

8393

Table 2. Escherichia coli Ribosomal Proteins and Peptides Confidently Identified with PTMs by Accurate Intact Protein Mass (AIPM) and Bottom-Up (BU) Searches Using PTMSearchPlusa protein

AIPM PTM

∆ppm

BU PTM peptides

BU score

L7/L12 L11 S4b S5 S11

(M loss) + TriMet/Ace TriMet/Ace

0.2 1.8

(M loss) + TriMet/Ace (M loss) + Met

18.5 27.2

(M loss)SIT(K + TriMet/Ace)DQIIEAVAAMSVMDVVELISAMEEK LQVAAGMANPSPPVGPALGQQGVNIMEFC(K + TriMet/Ace)AFNAK C(K + Met)IEQAPGQHGAR (M loss)AHIE(K + TriMet/Ace)QAGELQEK (M loss)A(K + Met)APIRAR

85.61 43.43 33.71 32.33 28.19

a Corresponding PTM modifications of proteins (AIPM PTM), mass difference between the theoretical and experimental protein masses (∆ppm), and PTM peptides found in the BU search (BU PTM peptides) with associated BU scores are also displayed. “M loss”, “Met”, and “TriMet/Ace” stand for loss of methionine, methylation, and trimethylation/acetylation, respectively. The PTM peptide of the S4 protein in italic indicates that the peptide was only identified by a “complete” BU search. All other peptides were also found by the “AIPM predicted” BU search as well. b Ribosomal protein S4 was not found by APM.

database containing only the non-PTM peptides of S4 during the BU search. As S4 is a 23.5 kDa protein, the reason for not identifying it in the AIPM search is most likely that it was not eluted off the C4 reverse phase column used in the AIPM analysis, similarly to the 29 kDa carbonic anhydrase in the PSM experiment. Manual inspection revealed very few peaks above 20 kDa identified by the AIPM analysis. At present, the integrated AIPMBU search discussed above does not provide capability to track the PTM peptides of a protein that are not found by the AIPM method (i.e., if it did not elute from the column or was not detected in the intact form for any reason). However, the “AIPM predicted” BU search does look for non-PTM peptides of proteins not found in the AIPM search. While the current version of PTMSearchPlus does not include this feature, confident identification of peptides of these proteins in the BU search could be used as a trigger to search for PTMs on peptides of a protein not identified in the AIPM search. Furthermore, based on the experimental data, a cutoff mass for the “AIPM predicted” BU search could be specified in the software to target this problem from another angle. The cutoff mass would define the mass that a protein has to exceed in order to generate its peptides using a “complete” BU search. This modification would decrease the chance to miss a PTM peptide even if the protein is not eluted from the separation column during the AIPM analysis, while keeping the speed advantage of the “AIPM predicted” BU search for proteins below the cutoff mass.

Figure 4. (a) Calculated and (b) measured isotopic distributions for 50S ribosomal protein L7/L12 with methionine loss and monomethylation exhibiting 0.2 ppm mass difference between their most abundant peaks. (c) MS/MS spectrum of peptide SIT(K + 3xMethyl/ Acetyl)DQIIEAVAAMSVMDVVELISAMEEK of the same protein.

VMDVVELISAMEEK. This peptide contains a trimethylation on K5 and is also the result of a methionine truncation of the original protein. A “complete” BU search was also performed to check the validity of the “AIPM predicted” analysis with a complex sample. In the “complete“ search, a peptide of protein S4 with a methylation was found, which had not been identified previously in the “AIPM predicted” BU search (see Table 2). The reason for missing the methylated peptide by the BU search was likely due to the lack of finding the modified S4 protein by the AIPM search (the unmodified S4 also was not detected). This resulted in a peptide 8394

Analytical Chemistry, Vol. 81, No. 20, October 15, 2009

CONCLUSIONS PTMSearchPlus provides a novel computational approach for the integration of accurate intact protein mass (AIPM) and bottomup (BU) searches to both confidently identify intact proteins and to characterize their PTMs. The required input data are a FASTA protein database, a selection of possible PTMs, the types and ranges of which can be specified, and both intact protein and proteolytic peptide mass spectra data collected from the same protein mixture. After a search is conducted, the software outputs a list of intact and PTM proteins matching the AIPM data with their respective peptides found by the BU search. This list also includes protein and peptide sequence coverage information, scores, etc. Furthermore, manual evaluation including visual inspection of annotated AIPM and BU mass spectra to evaluate, modify (e.g., remove obvious false positives, low quality spectra, etc.), and (automatic) refiltering of the results is also possible in the software. Improvement in BU search speed when limiting the total number of possible PTMs on a peptide or performing an

“AIPM predicted“ search was also evaluated. All of these features of PTMSearchPlus were demonstrated using a protein standard mixture or a complex protein mixture from Escherichia coli. Also demonstrated was a unique advantage of coupling AIPM and the BU data sets mutually beneficial for both approaches: AIPM data can confirm that no PTM peptides were missed in a BU search, while the BU search determines the location of the PTM, which is not readily determined through an AIPM search alone. Currently, development of a new scoring algorithm for the AIPM search is underway in which the score is based on mass and intensity differences of the peaks in the theoretical and measured isotopic envelopes. Future work also includes evaluation of using a cutoff mass for the “AIPM predicted” BU search. Furthermore, assessment of triggering a “complete” BU search of a protein when it is not identified by the AIPM search, but confident identification of corresponding peptides by the BU search is available, will be accomplished.

Nashville, TN) and Alan Rockwood (ARUP Laboratories, Salt Lake City, UT) for their help in integrating the MyriMatch scoring algorithm and the isotopic envelope calculator, respectively, into PTMSearchPlus. The authors thank Morgan Giddings (University of North Carolina, Chapel Hill, NC) for supplying the Escherichia coli ribosomal sample as part of another project. Research support was provided by the National Institutes of Health, General Medicine Program (Grant NIH-R01-GM070754). H.M.C. and B.K.E. wish to acknowledge the ORNL-UTK Genome Science and Technology Graduate School. Oak Ridge National Laboratory is managed and operated by the University of Tennessee-Battelle, LLC under Contract DE-AC05-00OR22725 with the U.S. Department of Energy.

ACKNOWLEDGMENT Co-first authors V.K. and H.M.C. contributed equally to this work. The authors thank David L. Tabb (Vanderbilt University,

Received for review May 28, 2009. Accepted September 11, 2009.

SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

AC901163C

Analytical Chemistry, Vol. 81, No. 20, October 15, 2009

8395