Automated Proteomics of E. coli via Top-Down Electron-Transfer

Jan 30, 2008 - Mass Spectrometry Research Program, Research Triangle Institute, 3040 Cornwallis Road, Research Triangle Park, North Carolina 27709. An...
0 downloads 10 Views 262KB Size
Anal. Chem. 2008, 80, 1459-1467

Automated Proteomics of E. coli via Top-Down Electron-Transfer Dissociation Mass Spectrometry Maureen K. Bunger, Benjamin J. Cargile, Anne Ngunjiri, Jonathan L. Bundy,* and James L. Stephenson, Jr.

Mass Spectrometry Research Program, Research Triangle Institute, 3040 Cornwallis Road, Research Triangle Park, North Carolina 27709

Electron-transfer dissociation (ETD) has recently been introduced as a fragmentation method for peptide and protein analysis. Unlike collisionally induced dissociation (CID), fragmentation by ETD occurs randomly along the peptide backbone. With the use of the sequences determined from the protein termini and the parent protein mass, intact proteins can be unambiguously identified. Because of the fast kinetics of these reactions, top-down proteomics can be performed using ETD in a linear ion trap mass spectrometer on a chromatographic time scale. Here we demonstrate the utility of ETD in high-throughput top-down proteomics using soluble extracts of E. coli. Development of a multidimensional fractionation platform, as well as a custom algorithm and scoring scheme specifically designed for this type of data, is described. The analysis resulted in the robust identification of 322 different protein forms representing 174 proteins, comprising one of the most comprehensive data sets assembled on intact proteins to date. One of the greatest challenges currently being addressed by the discipline of analytical chemistry is the efficient analysis of proteomes, i.e., the entire protein complement of an organism. Mass spectrometry (MS) has been established as the technique of choice for large-scale protein analysis and identification.1-3 For much proteomics work, this is accomplished two-dimensional gel electrophoresis (2-DE), followed by in-gel digestion of the separated proteins and mass spectrometric analysis. Because this analysis provides information concerning protein isoelectric point and relative molecular weight along with mass and sequence of the proteolytic peptides, one can trace not only protein expression but also protein modifications that may result in response to stimuli. Although 2-D gels are often the platform of choice for proteome-scale differential display analysis, the recovery of digested peptides from gels is relatively inefficient and can introduce additional modifications such as methionine oxidation, acrylamide adduction, and methylation of aspartic acid. In addition, for the aforementioned reasons, although the migration pattern on a 2-D gel may reveal post-translational modification (PTM) status, determining the identity of the PTM via in-gel digestion is often difficult or impossible.4 * To whom correspondence should be addressed. E-mail: [email protected]. (1) Gygi, S. P.; Aebersold, R. Curr. Opin. Chem. Biol. 2000, 4, 489-494. (2) Godovac-Zimmermann, J.; Brown, L. R. Mass Spectrom. Rev. 2001, 20, 1-57. (3) Chalmers, M. J.; Gaskell, S. J. Curr. Opin. Biotechnol. 2000, 11, 384-390. 10.1021/ac7018409 CCC: $40.75 Published on Web 01/30/2008

© 2008 American Chemical Society

In the past decade, much effort has been directed toward developing the use of liquid chromatography followed by tandem mass spectrometry (MS/MS) technology to analyze proteolyic digests of an entire complex sample, in what has come to be known as shotgun proteomics. The use of multidimensional separation techniques such as strong cation exchange and isoelectric focusing, followed by reversed-phase chromatography for separating complex peptide mixtures in conjunction with MS analysis has facilitated the almost routine identification of thousands of proteins5-10 from a single sample in 24-48 h of analysis time. Although the advent of shotgun proteomics has allowed largescale robust identifications of proteins from complex samples, the capacity of this technique to determine the full complement of a proteome, including post-translational modifications, is limited. This has led to development in parallel of “top-down” proteomics technology. This approach, pioneered by McLafferty and coworkers, was made possible by the integration of electrospray ionization with Fourier transform ion cyclotron resonance (FTICR) mass spectrometry technology. These systems have the capacity to obtain accurate mass measurements and routinely resolve charge states of highly charged ions.11 Initial efforts focused on using accurate mass measurement and MS/MS analysis of intact protein ions as a means of protein identification.12-14 Further developments, including integration of quadrupolar ion accumulation15 and the development of hybrid ion trapping-FT-MS instruments,16 have made this type of experiment possible on a chromatographic time scale and enabled the rapid identification (4) Gilar, M.; Bouvier, E. S. P.; Compton, B. J. J. Chromatogr., A 2001, 909, 111-135. (5) Link, A. J.; Eng, J.; Schieltz, D. M.; Carmack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., III. Nat. Biotechnol. 1999, 17, 676-682. (6) Wolters, D. A.; Washburn, M. P.; Yates, J. R., III. Anal. Chem. 2001, 73, 5683-5690. (7) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Nat. Biotechnol. 2001, 19, 242-247. (8) Cargile, B. J.; Bundy, J. L.; Freeman, T. W.; Stephenson, J. L., Jr. J. Proteome Res. 2004, 3, 112-119. (9) Cargile, B. J.; Sevinsky, J. R.; Essader, A. S.; Stephenson, J. L., Jr.; Bundy, J. L. J. Biomol. Tech. 2005, 16, 181-189. (10) Cargile, B. J.; Talley, D. L.; Stephenson, J. L., Jr. Electrophoresis 2004, 25, 936-945. (11) Henry, K. D.; Williams, E. R.; Wang, B. H.; McLafferty, F. W.; Shabanowitz, J.; Hunt, D. F. Proc. Natl. Acad. Sci. U.S.A. 1989, 86, 9075-9078. (12) Loo, J. A.; Quinn, J. P.; Ryu, S. I.; Henry, K. D.; Senko, M. W.; McLafferty, F. W. Proc. Natl. Acad. Sci. U.S.A. 1992, 89, 286-289. (13) Senko, M. W.; Speir, J. P.; McLafferty, F. W. Anal. Chem. 1994, 66, 28012808. (14) Kelleher, N. L.; Lin, H. Y.; Valaskovic, G. A.; Aaserud, D. J.; Fridriksson, E. K.; McLafferty, F. W. J. Am. Chem. Soc. 1999, 121, 806-812.

Analytical Chemistry, Vol. 80, No. 5, March 1, 2008 1459

of hundreds of protein forms.17 The introduction of two new collisional regimes for electrospray ionization Fourier transform tandem mass spectrometry (ESI-FT-MS/MS) analysis of proteins, electron capture dissociation (ECD)18-20 and infrared multiphoton dissociation,21 also increased the amount of information obtained in FT-MS/MS experiments on intact proteins, including those normally refractive to commonly used sustained off resonance irradiation-collisionally induced dissociation (SORI-CID). However, the high cost, often lower sensitivity, and long duty cycles of FTMS-based platforms have, outside work by specialists,22-24 limited their application mostly to single purified proteins and simplified mixtures. Recently, two methods for performing top-down MS analyses that use simpler quadrupole ion trap instrumentation modified to perform ion-ion reactions have been developed.25-32 Initially, ionion reactions were used as a means to manipulate the charge state of electrosprayed protein ions on the millisecond time scale leading to predominantly singly charged forms by gas-phase transfer of protons from multiply charged cations to reagent anions.33 This offers two main analytical advantages. First, since ions are reduced in charge, ions of similar m/z ratios are resolved from one another in m/z space, which increases the ability to resolve complex biochemical mixtures without relying heavily upon the use of chromatographic techniques.31 Also, this same process can be used to simplify the interpretation of MS/MS spectra generated from multiply charged precursor ions by reducing the product ion population to primarily the +1 charge (15) Patrie, S. M.; Charlebois, J. P.; Whipple, D.; Kelleher, N. L.; Hendrickson, C. L.; Quinn, J. P.; Marshall, A. G.; Mukhopadhyay, B. J. Am. Soc. Mass Spectrom. 2004, 15, 1099-1108. (16) Syka, J. E.; Marto, J. A.; Bai, D. L.; Horning, S.; Senko, M. W.; Schwartz, J. C.; Ueberheide, B.; Garcia, B.; Busby, S.; Muratore, T.; Shabanowitz, J.; Hunt, D. F. J. Proteome Res. 2004, 3, 621-626. (17) Parks, B. A.; Jiang, L.; Thomas, P. M.; Wenger, C. D.; Roth, M. J.; Ii, M. T.; Burke, P. V.; Kwast, K. E.; Kelleher, N. L. Anal. Chem. 2007, 79, 79847991. (18) Zubarev, R. A.; Horn, D. M.; Fridriksson, E. K.; Kelleher, N. L.; Kruger, N. A.; Lewis, M. A.; Carpenter, B. K.; McLafferty, F. W. Anal. Chem. 2000, 72, 563-573. (19) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. Proc. Natl. Acad. Sci. U.S.A. 2000, 97, 10313-10317. (20) Axelsson, J.; Palmblad, M.; Hakansson, K.; Hakansson, P. Rapid Commun. Mass Spectrom. 1999, 13, 474-477. (21) Little, D. P.; Speir, J. P.; Senko, M. W.; O’Connor, P. B.; McLafferty, F. W. Anal. Chem. 1994, 66, 2809-2815. (22) Patrie, S. M.; Ferguson, J. T.; Robinson, D. E.; Whipple, D.; Rother, M.; Metcalf, W. W.; Kelleher, N. L. Mol. Cell. Proteomics 2006, 5, 14-25. (23) Du, Y.; Parks, B. A.; Sohn, S.; Kwast, K. E.; Kelleher, N. L. Anal. Chem. 2006, 78, 686-694. (24) Du, Y.; Meng, F.; Patrie, S. M.; Miller, L. M.; Kelleher, N. L. J. Proteome Res. 2004, 3, 801-806. (25) VerBerkmoes, N. C.; Bundy, J. L.; Hauser, L.; Asano, K. G.; Razumovskaya, J.; Larimer, F.; Hettich, R. L.; Stephenson, J. L., Jr. J. Proteome Res. 2002, 1, 239-252. (26) McLuckey, S. A.; Stephenson, J. L., Jr. Mass Spectrom. Rev. 1998, 17, 369407. (27) Reid, G. E.; McLuckey, S. A. J. Mass Spectrom. 2002, 37, 663-675. (28) Stephenson, J. L.; McLuckey, S. A.; Reid, G. E.; Wells, J. M.; Bundy, J. L. Curr. Opin. Biotechnol. 2002, 13, 57-64. (29) Syka, J. E.; Coon, J. J.; Schroeder, M. J.; Shabanowitz, J.; Hunt, D. F. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 9528-9533. (30) Stephenson, J. L., Jr.; McLuckey, S. A. J. Mass Spectrom. 1998, 33, 664672. (31) Stephenson, J. L., Jr.; McLuckey, S. A. J. Am. Soc. Mass Spectrom. 1998, 9, 585-596. (32) Stephenson, J. L., Jr.; McLuckey, S. A. Anal. Chem. 1998, 70, 3533-3544. (33) Stephenson, J. L.; McLuckey, S. A. Int. J. Mass Spectrom. Ion Processes 1997, 162, 89-106.

1460

Analytical Chemistry, Vol. 80, No. 5, March 1, 2008

state.32 Using this approach, Cargile et al. demonstrated the identification of the MS2 bacteriophage viral coat protein from a crude lysate of infected host.34 Subsequently, this methodology was employed to screen for site-directed mutant forms of dihydrofolate reductase35 and the semiautomated identification of proteins in the 5-20 kDa range from an HPLC-fractionated lysate of E. coli.36 However the instrumental platforms used for this work were based on older technology that lacked automated data acquisition schemes available on more recent instruments. Recently, an instrument was described that employs an ionion chemistry termed electron-transfer dissociation (ETD), in which a reagent gas, such as fluoranthene, is subjected to negative ion chemical ionization and subsequently reacted with a polypeptide ion species.29,37,38 This reaction transfers an electron to the multiply charged protein cation, producing an odd electron species which then undergoes cleavage in the gas phase. Similar to ECD, the advantages of the use of ETD for analysis of proteins and peptides over conventional CID include increased random cleavage along the peptide backbone and the preservation of otherwise labile post-translational modifications, such as phosphorylation.29 In subsequent reports, the analysis of proteins on a chromatographic time scale was achieved via a combination of ETD for dissociation and proton-transfer reaction (PTR) for charge state reduction. Identification of the proteins was made by sequence tags generated from the N- and C-termini of the resultant MS/ MS spectra.39,40 A significant barrier to the widespread adoption of top-down proteomics is that mass spectrometry compatible separations techniques for intact proteins are not as well developed as those for peptides. For example, Meng et al. demonstrated a practical separation platform for such work, employing preparative SDSPAGE and an acid-labile surfactant.41 Other investigators have also recently addressed this challenging area, proposing a combination of weak anion-exchange chromatography and reversed-phase highperformance liquid chromatography (RP-HPLC)42 as well as systematic manipulation of RP-HPLC conditions.43 Moreover, the development of automated mass spectrometry approaches as well as bioinformatics tools for handling large-scale top-down proteomics analysis has lagged, although a comprehensive system for automated data acquisition and “top-down” analysis of single proteins has been introduced.44 (34) Cargile, B. J.; McLuckey, S. A.; Stephenson, J. L., Jr. Anal. Chem. 2001, 73, 1277-1285. (35) VerBerkmoes, N. C.; Strader, M. B.; Smiley, R. D.; Howell, E. E.; Hurst, G. B.; Hettich, R. L.; Stephenson, J. L., Jr. Anal. Biochem. 2002, 305, 68-81. (36) Reid, G. E.; Shang, H.; Hogan, J. M.; Lee, G. U.; McLuckey, S. A. J. Am. Chem. Soc. 2002, 124, 7353-7362. (37) Mikesh, L. M.; Ueberheide, B.; Chi, A.; Coon, J. J.; Syka, J. E.; Shabanowitz, J.; Hunt, D. F. Biochim. Biophys. Acta 2006, 1764, 1811-1822. (38) Good, D. M.; Coon, J. J. BioTechniques 2006, 40, 783-789. (39) Chi, A.; Huttenhower, C.; Geer, L. Y.; Coon, J. J.; Syka, J. E.; Bai, D. L.; Shabanowitz, J.; Burke, D. J.; Troyanskaya, O. G.; Hunt, D. F. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, 2193-2198. (40) Coon, J. J.; Ueberheide, B.; Syka, J. E.; Dryhurst, D. D.; Ausio, J.; Shabanowitz, J.; Hunt, D. F. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, 94639468. (41) Meng, F.; Cargile, B. J.; Patrie, S. M.; Johnson, J. R.; McLoughlin, S. M.; Kelleher, N. L. Anal. Chem. 2002, 74, 2923-2929. (42) Sharma, S.; Simpson, D. C.; Tolic, N.; Jaitly, N.; Mayampurath, A. M.; Smith, R. D.; Pasa-Tolic, L. J. Proteome Res. 2007, 6, 602-610. (43) Wang, Y.; Balgley, B. M.; Rudnick, P. A.; Lee, C. S. J. Chromatogr., A 2005, 1073, 35-41.

Figure 1. Overall scheme of the analysis of the E. coli lysate by TD-ETD-MS/MS. E. coli lysate is initially separated by strong anionexchange chromatography, followed by direct reversed-phase nanoLC-MS/MS of the intact protein mixture. Data is then searched by the WP-ETD algorithm to generate a final high-quality list of protein identifications.

In this report, we detail the application of an automated topdown LC-ETD-MS/MS platform (TD-ETD) shown schematically as Figure 1 for the analysis of the soluble proteome of E. coli using a multidimensional separation approach based on strong anion-exchange chromatography and reversed-phase chromatography directly coupled to a commercial linear ion trap equipped with an ETD source. A custom-designed informatics platform was employed to search the data against the E. coli protein database annotated by NCBI. We were able to successfully identify 322 different protein forms that corresponded to 174 protein species, which comprises one of the most comprehensive ETD-MS/MS data set collected on intact proteins to date. EXPERIMENTAL SECTION Methods. Microbial Growth. E. coli, strain K12 (ATCC 700926), was grown to log phase in LB broth at 37 °C, with shaking at 200 rpm. Cells were harvested by centrifugation at 4 °C and washed twice with phosphate-buffered saline (PBS). Lysis of cells was carried out by repeated bursts of sonication (5 × 2 s) in a buffer of 8 M urea, 100 mM TrisCl, pH 7.6, 150 mM NaCl, and protease (44) Johnson, J. R.; Meng, F.; Forbes, A. J.; Cargile, B. J.; Kelleher, N. L. Electrophoresis 2002, 23, 3217-3223.

inhibitors (protease inhibitor cocktail II, Sigma-Aldrich). The protein content of the lysate was determined using the Bradford assay (Pierce, Rockford, IL). Protein Fractionation Using Strong Anion-Exchange Chromatography. Anion-exchange chromatography was performed using a 50 mm × 4.6 mm polymeric strong anion-exchange column (PLSAX, Polymer Labs/Varian, Amherst, MA) connected to a Amersham/GE Healthcare (Piscataway, NJ) AKTA Purifier LC system. The mobile phase buffers were A, 10 mM TrisCl, pH 8.0, and B, 0.5 M NaCl, 10 mM TrisCl, pH 8.0. Four milligrams of protein was injected onto the column at a flow rate of 1 mL min-1. Bound material was eluted with a linear gradient from 0% to 100% B over 40 min followed by 5 min at 100% B. Absorbance at 280 nm was monitored using the AKTA Purifer’s integrated detector module. Collected fractions (1 mL) were reduced to approximately 200 µL using a Speedvac (Thermo Savant). Samples for top-down ETD were used in this state. For bottom-up analyses, tryptic digestion was performed by diluting concentrated fractions (20 µL) 1:10 with 1 M urea, 25 mM Tris, pH 7.6 and addition of 0.5 µg of sequencing grade trypsin (Promega, Madison, WI). Digested fractions were cleaned up by a Waters (Millford, MA) Oasis 96well plate SPE device. The SPE eluates were taken to dryness in a Speedvac and resuspended in 50 µL of 0.1% TFA. Five microliters was loaded to a C18 column for LC-MS/MS. Protein Identification Using Tandem Mass Spectrometry with Electron-Transfer Dissociation. For inline LC-ETD-MS, Zorbax C18 media (Agilent Technologies, Santa Clara, CA) was used to make a 1.5 cm 360 µm o.d. × 100 µm i.d. fused-silica trap contained in a column holder (Upchurch Scientific, Oak Harbor, WA) and a 10 cm 360 µm × 75 µm i.d. fused-silica analytical column that was integral with the nanospray tip. The mobile phases employed were A, 0.1 M HOAc, and B, 70% acetonitrile, 5% isopropyl alcohol, 0.1 M HOAc. Samples (10 µL) were loaded onto trap columns using an Eksigent (Dublin, CA) Nano-2 D LC system with autosampler and automatic 10-port switching valve (Valco Instruments, Houston, TX) at a flow rate of 3 µL/min with 98% A. Samples were washed with 98% A for 60 min before switching inline with the analytical column. Chromatography was performed using a gradient from 15% to 70% B over 50 min followed by a 4 min wash at 100% B and a 15 min re-equilibration at 15% B at a flow rate of 300 nL/min. ETD was performed on a ThermoFisher Scientific LTQ XL (San Jose, CA) fitted in the field with a module for performing negative ion chemical ionization and axial injection of ETD reagent anions (fluoranthene) similar to the originally published design.29 The instrument was fitted with a New Objective Picoview (Woburn, MA) source for online nanospray. Precursor ion isolation width was set to 4 Da, with default charge state setting of 5, with a maximum reagent anion injection period of 100 ms and ion-ion reaction period of 300 ms. Six microscans were accumulated and averaged for each individual ETD-MS/ MS scan. Data analysis was performed by first generating .dta files using the extract_msn.exe program included with the BioWorks package. These files were then queried against an E. coli K12 protein database using an in-house developed search algorithm, Whole Protein-ETD (WP-ETD), written in the C programming language. For LC-ESI-MS/MS analysis of the tryptically digested fractions a similar dual trap setup was used for chromatography; Analytical Chemistry, Vol. 80, No. 5, March 1, 2008

1461

however, the LC system employed was an Agilent 1100 HPLC system (Santa Clara, CA) consisting of an autosampler and a binary pump with flow split to 500 nL/min that was used in conjunction with an isocratic pump in a similar configuration to what was employed for the top-down ETD analyses. The mass spectrometry system was a Thermo LCQ Deca XP Plus equipped with a New Objective Picoview nanospray source. The MS/MS parameters employed were single full scan (200-2000 m/z) followed by data-dependent MS/MS (3 µscans) of the three most intense peaks. Dynamic exclusion was enabled using a repeat count of 1 over duration of 60 s with a mass width of 1.5. MS/MS spectra were searched against the same E. coli protein database used for the top-down analysis, using scrambled database filtering45 to reduce the number of false positives. RESULTS AND DISCUSSION Although top-down proteomics platforms do not provide the depth of coverage (defined as the total number of protein forms detected) compared to bottom-up approaches, they do excel in providing a “holistic view” of intact protein structure by preserving post-translational modifications and truncated protein forms which are often masked in global shotgun analyses.14,18,25,46 Until recently, the ion activation techniques available on commercial ESI, nonFT-MS instrumentation were confined almost exclusively to CID, which has been employed for top-down proteomics in a number of reports using quadrupole time-of-flight instrumentation47 and ion trap instrumentation modified for proton-transfer ion-ion reactions (PTR).27 With the development of ETD, an activation regime is now accessible in a lower cost ion trap instrument that is relatively nonselective as to amino acid sequence and the presence of post-translational modifications, which should lead to greater depth and breadth of proteome coverage than has previously been feasible. Although successful chromatographic-scale analysis of a purified E. coli ribosomal mixture has been accomplished using a combination of ETD and PTR,39 the commercial embodiment of the Thermo ETD linear ion trap instrument is not equipped with PTR capabilities. Therefore, we sought to examine the feasibility of employing ETD alone to generate N- and C-terminal sequence tags for identification of proteins in whole cell extracts. A twodimensional approach was used to first fractionate sample using strong anion exchange (SAX) followed by conventional RP-HPLC that was interfaced with an electrospray ionization source inline with the mass spectrometer (Figure 1). Approximately 5% of each of 36 fractions was analyzed. To maximize the population of Nand C-terminal product ions that could be used for subsequent database searching,40,48 we extended analyte ion reaction time with fluoranthene from the reported 15 ms, when used in combination with 150 ms PTR, to 300 ms. We then averaged six microscans to result in spectra with mostly singly charged ions derived from the N- and C-terminus of each protein. For a proteomics platform to be of any practical utility, a robust informatics platform is paramount. We initially employed the (45) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50. (46) Kelleher, N. L. Anal. Chem. 2004, 76, 197A-203A. (47) Running, W. E.; Ravipaty, S.; Karty, J. A.; Reilly, J. P. J. Proteome Res. 2007, 6, 337-347. (48) Chi, A.; Bai, D. L.; Geer, L. Y.; Shabanowitz, J.; Hunt, D. F. Int. J. Mass Spectrom. 2007, 259, 197-203.

1462

Analytical Chemistry, Vol. 80, No. 5, March 1, 2008

search algorithm OMSSA (open mass spectrometry search algorithm),49 which has support for TD-ETD data48 to query our data set, which yielded less than 100 proteins over all of the SAX fractions (not shown). Given the low number of identifications observed as compared to the amount of high-quality data collected, we surmised that a significant portion of the proteins had been subjected to in vivo or ex vivo proteolysis, which is not currently supported by the OMSSA algorithm (save the cleavage of N-terminal methionine).49 We then interrogated the data against the E. coli protein database with our in-house developed algorithm WP-ETD. The program (written in C) generates theoretical spectra based on the first 17 amino acids of any intact protein similar to OMSSA. The choice to consider only the first 17 amino acids is based on the limited mass range and accuracy of the LTQ-XL instrument (1-2000 m/z). Furthermore, the extended ETD time was chosen specifically to result in the majority of the ions found in the mass range of 1-2000 will be in the +1 charge state. To find N-terminal truncations such as methionine and signal peptide losses, the program then cleaves the most N-terminal amino acid and generates a new theoretical spectrum of the next 17 amino acids. This is continued reiteratively through the length of the entire protein generating a theoretical spectrum for every possible N-terminal 17 amino acids. Principles from the SEQUEST algorithm were utilized in the overall algorithm design.50 In this regard, for any given spectrum, only the top 200 ions by intensity were used for analysis. The top identifications were generated based on total ion intensities of matching c ions only, normalized for overall intensities within mass bins of 100 Da. Identifications were made based on c ions (Nterminus) only for two reasons. First, because the reiterative searching of the C-terminus was not performed, if z ions are considered in the identification, only proteins with intact C-termini would be identified. Second, ETD often results in transfer of a proton from c ions to corresponding z ions resulting in occasional mass differences between theoretical and observed z ions. This phenomenon is difficult to incorporate into an automated search algorithm. For each spectra, a top match was then reported for each spectrum and written to a results file. A search score was calculated by the following formula 1:

(Nc)(Nc,cons) + (Nz)(Nz,cons) ) S

(1)

where Nc and Nz were the total number of matching c and z ions, respectively, to a given matching protein sequence and Nc,cons and Nz,cons were the number of sequence contiguous c and z ions observed for a given matching sequence. This scoring equation was determined empirically to increase the score of true positives in relationship to the highest scoring scrambled protein. We theorized that including consecutive ions in the scoring would improve false positive rates because ETD (like ECD) results in more uniform backbone cleavage than traditional CID. Although z ions were not considered in determining the top protein match, they were considered with equal weight to c ion in the score calculations. Thus, the majority of the top scoring identifications (49) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. J. Proteome Res. 2004, 3, 958-964. (50) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.

Figure 2. Graphical depiction of the computational method developed for searching top-down ETD-MS/MS data (WP-ETD). Raw spectral data was employed in ThermoFisher.dta file format. A protein database was initially constructed consisting of the 17 N- and C-terminal amino acids of all of the proteins found in the database, along with scrambled decoy forms. For the searching of cleaved forms, the program also iteratively cleaved residues off the N-terminus of the molecule, generating new termini for searching. A search score is then calculated by summing the product of the number of ion cleavages and the number of observed consecutive ions for both c and z ion types.

have intact C-termini. A decoy database45 was generated for each protein by scrambling individual protein sequences and was used as a filtering criterion in the final data set. Individual matching spectra were manually validated by inspection of matching c and z ions. The search process is depicted graphically in Figure 2. The overall score distribution for the MS/MS spectra obtained in the TD-ETD analysis is shown in Figure 3a. Using the criterion that a spectrum must score above the first or second decoy hit in each SAX fraction resulted in a spectral false positive rate of 0.5% and an overall protein false positive identification rate of 3.7%. As can be seen from the figure, the number of reverse hits increased significantly below a WP-ETD score of 50. Illustrated in Figure 3b is a representation of the total number of proteins identified per fraction via the use of the WP-ETD algorithm with purple bars indicating original identifications and yellow indicating redundant identifications. From 14 829 ETD-MS/MS spectra, 2568 scored above the score criteria defined above. From this subset, a total of 174 unique proteins were identified, which represented 322 protein forms with different N-termini. A list of all of the identified proteins, their accession numbers, and WP-ETD scores is supplied as Supporting Information. As mentioned previously, WP-ETD implements support for the searching of N-terminal truncated species. Shown in Figure 4a is a summary of the protein forms identified in the data set as a whole by N-terminal cleavage status. Of these proteins, only 30 were found with intact N- and C-termini. The largest group (143 proteins, 209 forms) consisted of products of both N- and

C-terminal proteolysis. Therefore, it is clear that proteolysis products are a major component of the detectable protein forms, and this needs to be taken into account by either chemical or informatic means when top-down proteomics is the analytical method of choice. A potential application of top-down proteomics is in analysis of complex microbial communities. In such samples having molecular weight information as well as a sequence tag greatly facilitates assignment of a particular protein to its originating organism. Given the potential utility of ribosomal proteins in analysis of complex microbial communities, we examined this subgroup more closely. The ribosomal proteome of E. coli (and its post-translational modifications therein) has previously been studied by several groups. Arnold and Reilly employed MALDITOF analysis of intact proteins,51 and more recently, another group identified ribosomal proteins by their intact mass observed by CE-ESI-MS on a quadrupole ion trap.52 Ribosomal subunits have also been previously examined using ETD-PTR by isolating a sucrose gradient ribosomal subfraction.48 In this report, 46 of 53 ribosomes and 7 PTMs were identified as well as several N-terminal and C-terminal truncated forms. In our data set we identified 43 of 52 ribosomal subunits represented by 141 forms, as shown graphically by cleavage status in Figure 4b. As was observed in the data set at large, the majority of the species observed had some degree of N- and C-terminal cleavage. (51) Arnold, R. J.; Reilly, J. P. Anal. Biochem. 1999, 269, 105-112. (52) Moini, M.; Huang, H. Electrophoresis 2004, 25, 1981-1987.

Analytical Chemistry, Vol. 80, No. 5, March 1, 2008

1463

Figure 3. Protein identification summary. (a) Breakdown of proteins by WP-ETD database search algorithm score. (b) Graph delineating the number of total proteins identified by TD-ETD stratified into unique to a fraction (purple bars) and overall (yellow bars).

Although the decoy database searching indicated most of the identifications were valid, we attempted to verify the identifications by several orthogonal means. First, we looked at the biological function of the identified proteins to determine if known high abundance proteins were identified. To do this we examined the breakdown of the proteins by functional category since proteins involved in mRNA translation and energy production should be present in high abundance in the log growth phase. This is depicted, as per the cluster of orthologous groups (COG)53,54 classification scheme in Figure 5a. The second most abundant category was protein translation with 42 ribosomal proteins identified, 9 from general translation, and an additional 7 proteins involved in protein folding and modification. When the category with the third highest number of identifications (energy production) is combined with the carbohydrate metabolism category, this represents more than 25% (17/62) of the rest of the identified proteins (save those with unknown function and mRNA transla(53) Tatusov, R. L.; Koonin, E. V.; Lipman, D. J. Science 1997, 278, 631-637. (54) Tatusov, R. L.; Natale, D. A.; Garkavtsev, I. V.; Tatusova, T. A.; Shankavaram, U. T.; Rao, B. S.; Kiryutin, B.; Galperin, M. Y.; Fedorova, N. D.; Koonin, E. V. Nucleic Acids Res. 2001, 29, 22-28.

1464

Analytical Chemistry, Vol. 80, No. 5, March 1, 2008

tion). Thus, if one expected to identify a large percentage of abundant proteins, as estimated by their purported function, then almost 66% (75/120) of the identified proteins are from known core functional categories. The highest category of proteins identified were from the “unknown/predicted” category, which comprised 72/174 (41%) of the unique protein identifications. This is not surprising, given that approximately 38% of the E. coli predicted open reading frames fall into this category.55 The second method used to validate the identified proteins was by retrospective comparison with a bottom-up proteomics experiment. In a previous report,10 we examined the soluble proteome of E. coli using a multidimensional separation regime based on tryptic peptide IPG-IEF (immobilized pH gradientisoelectric focusing) in the first dimension and nano-RPLC in the second, resulting in the identification of 1082 proteins from fully tryptic peptides, with a calculated false positive identification rate of 1%. We expected these data sets to be reasonably complementary and that the bottom-up data would provide a cross-validation (55) Wasinger, V. C.; Humphery-Smith, I. FEMS Microbiol. Lett. 1998, 169, 375382.

Figure 4. (a) Protein identifications categorized according to N-terminal and C-terminal cleavages. An intact C-terminus was determined automatically if a spectrum matched five or more consecutive z ions. For intact N- and C-termini, those with loss of methionine are indicated by shaded bars. Numbers inside each bar represent the numbers of proteins identified in each category. The number of different internal fragments is indicated as a separate category. Many proteins appear in several categories. (b) Graph depicting the classification of ribosomal proteins in this study, as described for (a).

of the efficacy of our TD-ETD search algorithm. However, when these two data sets were compared, only 42% of the proteins identified in the TD-ETD data set were also observed in the IPGIEF data. Possible reasons for this observation include the use of a different preparation of E. coli, which may have biased the results, as proteome composition is significantly dependent upon growth state. Other possibilities for this observation include difficulty of chromatographic separation of larger and more hydrophobic protein species on reversed-phase media and lower ionization yields due to issues in desolvating proteins of increasing molecular mass. To further investigate the nature of the two aforementioned E. coli proteomic data sets, we performed traditional “bottom-up” shotgun proteomics analysis of the same SAX fractions used for the TD-ETD work. This experiment yielded a total of 1840 peptides derived from 423 protein forms. These results are compared with the results from the TD-ETD analysis in Figure 5b. Eighty proteins were found to be detected in common between the two methods, with 343 and 94 detected uniquely by the bottom-up and the topdown methods, respectively, implying that the TD-ETD data set is 46% complementary to the bottom-up data set described here, similar to what was observed with the peptide IPG-IEF experiment. One potential reason for this modest complementarity is that some proteins may not produce tryptic peptides of the size, amino acid composition, and physiochemical properties amenable to shotgun proteomics. These tryptic peptide species may be weakly retained

on the stationary phase due to small size or low hydropathy. In addition a digest may produce resultant peptides that have lower ionization efficiencies than the intact protein species. In this regard, examination of the sequences of two 30S ribosomal proteins (S15 and S17) that were not detected in bottom-up analysis revealed highly basic proteins (20% K and R residues) suggesting these proteins, when digested, may produce peptides too small to be detected in the typical automated LC-MS/MS scheme. Thus, in these instances, a top-down approach proves to be complementary to bottom-up in increasing the depth of protein coverage. One of the oft-cited benefits of a top-down proteomics experiment is the ability to readily identify post-translationally modified protein forms. The most common of these modifications in prokaryotic organisms is the N-terminal cleavage of methionine, which was observed in 47 of the identified proteins. Cleavage of N-terminal signal peptides is another modification that is commonly undetected in bottom-up proteomics analyses. The custom search algorithm developed herein is capable of identifying N-terminal cleavages of any length by reiteratively searching every amino acid in the E. coli proteome as a possible N-terminus. Through this, several proteins were detected with known and putative N-terminal signal peptide cleavages. Shown in Figure 6a is a representative ETD-MS/MS spectrum derived from one such protein, a thiosulfate sulfurtransferase (PspE), of 11.7 kDa molecular weight. Ten ETD spectra matched to this protein, all appearing with an observed 19 amino acid signal peptide cleavage matching the predicted signal peptide loss as annotated in the Swiss-Prot database. Two other proteins identified with N-termini resulting from predicted signal peptide cleavage included hypothetical protein yahO (SwissProt accession P75694) and spheroplast protein Y precursor (SwissProt accession P77754),a putative Zn2+ transporter. The WP-ETD algorithm was designed to detect mass deductions from intact proteins rather than mass additions. To identify post-translational modifications that result in mass addition, we manually scanned data obtained from several fractions for proteins for which multiple spectra of the same N-terminus was identified but were significantly separated in chromatographic retention time. The PspE protein described above fit this criteria and subsequent deconvolution of several charge state envelopes corresponding to the same PspE N- and C-terminus revealed the presence of seven different protein species, all with the same Nand C-terminus of the processed PspE protein, suggesting the presence of multiple internal modifications (Figure 6b). Besides the mass of the unmodified protein, species corresponding to additions of 31, 120, 133, 175, 191, and 248 Da were detected. PspE is a member of the rhodanese family of sulfurtransferases that catalyze a two-step reaction that transfers a sulfur atom from S2O3- to CN- 56 The first reaction transfers the sulfur moiety to the catalytic cysteine residue in the rhodanese enzyme, generating a thiolated protein form. One of the PspE proteins observed had a molecular mass of 9462 that corresponds to a putative thiolation (+31 Da). This fraction was then subjected to LC-MS/MS using CID at lower collision energy than normally employed to produce limited fragmentation in an attempt to localize the modification (56) Adams, H.; Teertstra, W.; Koster, M.; Tommassen, J. FEBS Lett. 2002, 518, 173-176.

Analytical Chemistry, Vol. 80, No. 5, March 1, 2008

1465

Figure 5. (a) Breakdown of proteins identified in this study by the COG (cluster of orthologous groups) classification scheme. (b) Venn diagram illustrating the numbers of proteins identified by top-down CID and bottom-up ETD analysis of an identical set of strong anion-exchange fractions.

(Figure 6c). CID-MS/MS analysis of the ion at 728.69 m/z ion indicated a 31 Da modification between residues 38 and 48 as evidenced by the observed mass shift in the b38 and b48 ions in the spectrum. Since PspE’s only cysteine residue is at position 48, it is reasonable to conclude the 9462 Da protein is modified with sulfur at this residue. Shown in Figure 6d is a fragmentation map of the combined ETD-CID analysis of this protein. Explanations for the additional five species were not readily apparent and could likely be the result of multiple modifications. Further manual study of these samples will lead to improvements in manual searching algorithms. Although these analyses resulted in a modest number of protein identifications, considering that the E. coli proteome consists of 3762 known open reading frames, the result is a significant improvement compared to previous top-down mass spectrometry studies of the soluble fraction of E. coli. Recently a report by Millea et al. demonstrated a combined top-down/bottom1466

Analytical Chemistry, Vol. 80, No. 5, March 1, 2008

up analysis of this organism using an sophisticated multidimensional SAX-RPLC system coupled with directly with ESI (for molecular weight determination) and offline MALDI (for identification of tryptic peptides) on a quadrupole time-of-flight mass spectrometer.57 In this work, 103 proteins were detected by ESI, but only 46 of these were able to be correlated to peptide MS/ MS data. Fifty-five proteins were found by peptide MS/MS alone. An earlier report by Reid et al. using offline reversed-phase chromatography, coupled with a PTR-equipped modified ion trap,36 reported the identification of five proteins, with extensive manual annotation and interpretation of the data. Other reports on topdown analysis of this organism58,59 have relied on accurate mass measurement alone as an identification criterion, which is possible (57) Millea, K. M.; Krull, I. S.; Cohen, S. A.; Gebler, J. C.; Berger, S. J. J. Proteome Res. 2006, 5, 135-146. (58) Martinovic, S.; Veenstra, T. D.; Anderson, G. A.; Pasa-Tolic, L.; Smith, R. D. J. Mass Spectrom. 2002, 37, 99-107.

Figure 6. (a) ETD-MS/MS spectra of E. coli thiosulfate transferase protein, pspE. (b) Full scan ESI-MS spectrum of protein that matched in sequence, but not in mass, to unmodified pspE. (c) CID-MS/MS spectrum of m/z 728.69 from the modified form of pspE, localizing the site of protein thiolation to Cys 48. (d) Primary sequence showing sites of ETD c/z cleavage (red and blue) and CID (b/y) fragments, indicated by bars.

with reasonable fidelity for small genome organisms with limited post-translational modification. CONCLUSIONS Presented here is an initial evaluation of a workflow for topdown proteomics analysis of a highly complex sample employing a benchtop linear ion trap mass spectrometer equipped with ETD. This platform utilizes a two-dimensional separation approach that can be performed with minimal sample handling. Sample preparation and data collection can be completed in 2 days from as little as 20 µg of sample (from an individual multidimensional chromatography fraction) introduced into the mass spectrometer. Data analysis software and a robust scoring method were developed that are capable of protein identifications that include all possible N-terminal cleavages. Although many more internal cleavages were identified than expected in these analyses, cataloging of common cleavage sites can be used as a means to discover mechanisms of proteolytic cleavage in vivo. Further development of these methods to improve full-length protein identifications and automated deconvolution of precursor charge states as part of the (59) Martinovic, S.; Pasa-Tolic, L.; Smith, R. D. Methods Mol. Biol. 2004, 276, 291-304.

data analysis is forthcoming. The data presented in this report illustrate the power of TD-ETD as a high-throughput analysis platform, capable of robust identification of a large number of proteins and their post-translationally modified forms, and a worthy complement to traditional bottom-up analysis schemes. ACKNOWLEDGMENT Support of this research was provided under Agreement HSHQPA-05-9-0006 with the Department of Homeland Security, Science and Technology Directorate, Chem Bio Division. We also acknowledge the assistance of John E. P. Syka of ThermoFisher Scientific for assistance with the ETD hardware. SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

Received for review August 31, 2007. Accepted November 27, 2007. AC7018409 Analytical Chemistry, Vol. 80, No. 5, March 1, 2008

1467