Evaluation of “Shotgun” Proteomics for Identification of Biological

Jones, J. J.; Stump, M. J.; Fleming, R. C.; Lay, J. O., Jr.; Wilkins, C. L. Anal. Chem. 2003 .... Cargile, B. J.; McLuckey, S. A.; Stephenson, J. L., ...
0 downloads 0 Views 254KB Size
Anal. Chem. 2005, 77, 923-932

Evaluation of “Shotgun” Proteomics for Identification of Biological Threat Agents in Complex Environmental Matrixes: Experimental Simulations Nathan C. VerBerkmoes,†,‡ W. Judson Hervey,†,‡ Manesh Shah,§ Miriam Land,†,§ Loren Hauser,†,§ Frank W. Larimer,†,§ Gary J. Van Berkel,†,‡ and Douglas E. Goeringer*,‡

Graduate School of Genome Science and Technology, University of Tennessee-Oak Ridge National Laboratory, 1060 Commerce Park, Oak Ridge, Tennessee 37830-8026, and Organic and Biological Mass Spectrometry, Chemical Sciences Division, and Genome Analysis and Systems Modeling, Life Sciences Division, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, Tennessee 37831-6131

There is currently a great need for rapid detection and positive identification of biological threat agents, as well as microbial species in general, directly from complex environmental samples. This need is most urgent in the area of homeland security, but also extends into medical, environmental, and agricultural sciences. Mass-spectrometry-based analysis is one of the leading technologies in the field with a diversity of different methodologies for biothreat detection. Over the past few years, “shotgun” proteomics has become one method of choice for the rapid analysis of complex protein mixtures by mass spectrometry. Recently, it was demonstrated that this methodology is capable of distinguishing a target species against a large database of background species from a single-component sample or dual-component mixtures with relatively the same concentration (Dworzanski, J. P.; Snyder, A. P.; Chen, R.; Zhang, H.; Wishart, D.; Li, L. Anal. Chem. 2004, 76, 2355-2366). Here, we examine the potential of shotgun proteomics to analyze a target species in a background of four contaminant species. We tested the capability of a common commercial massspectrometry-based shotgun proteomics platform for the detection of the target species (Escherichia coli) at four different concentrations and four different time points of analysis. We also tested the effect of database size on positive identification of the four microbes used in this study by testing a small (13-species) database and a large (261-species) database. The results clearly indicated that this technology could easily identify the target species at 20% in the background mixture at a 60, 120, 180, or 240 min analysis time with the small database. The results also indicated that the target species could easily be identified at 20% or 6% but could not be identified at 0.6% or 0.06% in either a 240 min analysis or a 30 h * To whom correspondence should be addressed. Phone: (865) 574-3469. Fax: (865) 576-8559. E-mail: [email protected]. † University of Tennessee-Oak Ridge National Laboratory. ‡ Chemical Sciences Division, Oak Ridge National Laboratory. § Life Sciences Division, Oak Ridge National Laboratory. 10.1021/ac049127n CCC: $30.25 Published on Web 01/04/2005

© 2005 American Chemical Society

analysis with the small database. The effects of the large database were severe on the target species where detection above the background at any concentration used in this study was impossible, though the three other microbes used in this study were clearly identified above the background when analyzed with the large database. This study points to the potential application of this technology for biological threat agent detection but highlights many areas of needed research before the technology will be useful in real world samples. There is currently a great need for rapid detection and positive identification of biological threat agents, including bacteria, toxins, and viruses, directly from complex environmental samples due to the recent increased threat of terrorism. This need also exists in the medical, environmental, and agricultural sciences. The detection/identification of microbial pathogens can be based on the presence of unique biomarkers from at least one of the major classes of macromolecules: DNA/RNA, lipids, and proteins. The selective detection of viruses can be centered on either DNA/ RNA or protein biomarkers, whereas for protein toxins the detection methods are limited to proteins. Many different technologies currently exist or are under development for the positive identification of potential bioweapons directly from environmental samples. While PCR-based methods rely on recognition of unique stretches of DNA or RNA2,3 and antibody-based methods4-7 depend mainly on detection of cell surface proteins and lipids, analysis of all three major macromol(1) Dworzanski, J. P.; Snyder, A. P.; Chen, R.; Zhang, H.; Wishart, D.; Li, L. Anal. Chem. 2004, 76, 2355-2366. (2) Broussard, L. A. Mol. Diagn. 2001, 6, 323. (3) Ivnitski, D.; O’Neil, D. J.; Gattuso, A.; Schlicht, R.; Calidonna, M.; Fisher, R. Biotechniques 2003, 35, 862. (4) Long, G. W.; O’Brien, T. J. Appl. Microbiol. 1999, 87, 214. (5) De, B. K.; Bragg, S. L.; Sanden, G. N.; Wilson, K. E.; Diem, L. A.; Marston, C. K.; Hoffmaster, A. R.; Barnett, G. A.; Weyant, R. S.; Abshire, T. G.; Ezzell, J. W.; Popovic, T. Emerging Infect. Dis. 2002, 8, 1060. (6) McBride, M. T.; Gammon, S.; Pitesky, M.; O’Brien, T. W.; Smith, T.; Aldrich, J.; Langlois, R. G.; Colston, B.; Venkateswaran, K. S. Anal. Chem. 2003, 75, 1924.

Analytical Chemistry, Vol. 77, No. 3, February 1, 2005 923

ecules is possible with mass spectrometry (MS).8-11 Consequently, MS has recently been the focus of increasing efforts to develop new analytical techniques for the rapid, positive identification of biological threat agents. One of the earliest forms of MS-based detection of biological weapons, and the first used in actual field studies, is pyrolysis followed by MS analysis, with fatty acid signatures subsequently matched to databases.12,13 This methodology has been shown to be rapid, robust, field portable, and capable of integration into the same system as a chemical weapons detector. While this method can be effective for simple mixtures in which the target is the main biological constituent, its use is problematic if many biological species are present and the target is not the major component. Clearly, proteins are present and abundant in all possible biological threat agents. When combined with the high degree of protein variability between species type, it is thus likely that analysis of unique proteins will provide the most reliability for biothreat detection. The main difficulty in protein-based identification, however, is that their analysis in complex mixtures is a tremendous analytical challenge. Nevertheless, progress has been made in meeting the challenge by using MS-based methods. One such methodology under development involves the detection of intact proteins liberated from a toxin, virus, or bacterium by matrix-assisted laser desorption/ionization (MALDI) time-of-flight (TOF) mass spectrometry.14-21 Furthermore, tandem mass spectrometry (MS/MS) of intact proteins generated from MALDI followed by database searching has recently been demonstrated for small bacterial proteins.22 An alternative methodology to MALDI-TOF is the use of electrospray (ES) ionization, coupled with either Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR-MS) or quadrupole ion trap mass spectrometry (utilizing ion-ion chemistry) for the detection and subsequent fragmentation of intact proteins.23-25 While both (7) McBride, M. T.; Masquelier, D.; Hindson, B. J.; Makarewicz, A. J.; Brown, S.; Burris, K.; Metz, T.; Langlois, R. G.; Tsang, K. W.; Bryan, R.; Anderson, D. A.; Venkateswaran, K. S.; Milanovich F. P.; Colston, B. W., Jr. Anal. Chem. 2003, 75, 5293. (8) Mann, M.; Hendrickson, R. C.; Pandey, A. Annu. Rev. Biochem. 2001, 70, 437. (9) Guo, B. Anal. Chem. 1999, 71, 333R. (10) Christie, W. W. Lipids 1998, 33, 343. (11) Murphy, R. C.; Fiedler, J.; Hevko, J. Chem. Rev. 2001, 101, 479. (12) Barshick, S. A.; Wolf, D. A.; Vass, A. A. Anal. Chem. 1999, 71, 633. (13) Griest, W. H.; Wise, M. B.; Hart, K. J.; Lammert, S. A.; Thompson, C. V.; Vass, A. A. Field Anal. Chem. Technol. 2001, 5, 177. (14) Krishnamurthy, T.; Ross, P. L. Rapid Commun. Mass Spectrom. 1996, 10, 1992. (15) Demirev, P. A.; Ho, Y. P.; Ryzhov, V.; Fenselau, C. Anal. Chem. 1999, 71, 2732. (16) Demirev, P. A.; Ramirez, J.; Fenselau, C. Anal. Chem. 2001, 73, 5725. (17) Fenselau, C.; Demirev, P. A. Mass Spectrom. Rev. 2001, 20, 157. (18) Jones, J. J.; Stump, M. J.; Fleming, R. C.; Lay, J. O., Jr.; Wilkins, C. L. Anal. Chem. 2003, 75, 1340. (19) Hathout, Y.; Setlow, B.; Cabrera-Martinez, R.-M.; Fenselau, C.; Setlow, P. Appl. Environ. Microbiol. 2003, 69, 1100 (20) Wahl, K. L.; Wunschel, S. C.; Jarman, K. H.; Valentine, N. B.; Petersen, C. E.; Kingsley, M. T.; Zartolas, K. A.; Saenz, A. J. Anal. Chem. 2002, 74, 6191. (21) Lee, H.; Williams, S. K.; Wahl, K. L.; Valentine, N. B. Anal. Chem. 2003, 75, 2746. (22) Demirev, P. A.; Ramirez, J.; Fenselau, C. Anal. Chem. 2001, 73, 5725. (23) Cargile, B. J.; McLuckey, S. A.; Stephenson, J. L., Jr. Anal. Chem. 2001, 73, 1277. (24) Stephenson, J. L.; McLuckey, S. A.; Reid, G. E.; Wells, J. M.; Bundy, J. L. Curr. Opin. Biotechnol. 2002, 13, 57.

924

Analytical Chemistry, Vol. 77, No. 3, February 1, 2005

methodologies have their advantages and disadvantages, it is clear that the MALDI-TOF methods are inherently faster, while the ES methods provide better dynamic range. Given the current state of technology, ES methodologies are more adept for MS/MS of intact proteins than the MALDI-TOF technologies. Both the ES and MALDI techniques for protein analysis discussed above can be classified as “top-down” methods, which are generally defined as identification of intact proteins via either mass analysis or mass analysis followed by MS/MS of the massselected protein. An alternative methodology to the separation and mass analysis of intact proteins is the “bottom-up” or “shotgun” method, which involves digestion of the protein material with enzymatic and/or chemical methods. The resultant crude peptide mixture is then amenable to direct analysis by LC-ES-MS/MS or MALDI-TOF followed by proteome bioinformatic methods for peptide identification. MALDI-TOF has found applications in the analysis of enzymatic digests of bacterial proteins followed by either peptide mass fingerprinting or MS/MS of the peptides.26-28 But this application is somewhat limited due to the complexity of the peptide mixtures generated from tryptic digestions of whole bacteria, bacterial spores, viruses, or toxin proteomes. LC-MS/ MS analysis is more amenable to the complex peptide mixtures resulting from enzymatic digest of a complex protein mixture. The core analytical technologies for the analysis of complex peptide mixtures derived from cell lysate(s) by LC-MS/MS have improved greatly over the past few years. As a result, the routine analysis of 500-1000 proteins from a bacterial or yeast cell lysate on a single instrument platform is now possible in a single day.29-32 The overall speed, sensitivity, and dynamic range of current shotgun proteomics techniques cannot be matched by other methods at this time. Furthermore, ongoing developments and advances in instrumentation for peptide analysis via LC-MS/MS will likely enhance these numbers dramatically. Recently, it was demonstrated that MS-based shotgun proteomics is capable of distinguishing a target species against a large database of background species from a single-component mixture or dual-component mixtures with relatively the same concentration.1 That study demonstrated the capability of shotgun proteomics to detect a variety of microbes grown in single cultures against a background database of 87 microbes. That study also developed a new scoring algorithm to test the significance of positive identifications against a large database of background proteins. It demonstrated the utility of MS-based shotgun proteomics for detecting a specific target species against a large database of (25) Reid, G. E.; Shang, H.; Hogan, J. M.; Lee, G. U.; McLuckey, S. A. J. Am. Chem. Soc. 2002, 124, 7353. (26) Harris, W. A.; Reilly, J. P. Anal. Chem. 2002, 74, 4410. (27) English, R. D.; Warscheid, B.; Fenselau, C.; Cotter, R. J. Anal. Chem. 2003, 75, 6886. (28) Warscheid, B.; Jackson, K.; Sutton, C.; Fenselau, C. Anal. Chem. 2003, 75, 5608. (29) Washburn, M. P.; Wolters, D.; Yates, J. R., III Nat. Biotechnol. 2001, 19, 242. (30) VerBerkmoes, N. C.; Bundy, J. L.; Hauser, L.; Asano, K. G.; Razumovskaya, J.; Larimer, F.; Hettich, R. L.; Stephenson, J. L., Jr. J. Proteome Res. 2002, 1, 239. (31) Lipton, M. S.; Pasa-Toli’, L.; Anderson, G. A.; Anderson, D. J.; Aubetty, D. L.; Battista, J. R.; Daly, M. J.; Fedrickson, J.; Hixson, K. K.; et al. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 11049. (32) Corbin, R. W.; Paliy, O.; Yang, F.; Shabanowitz, J.; Platt, M.; Lyons, C. E., Jr.; Root, K.; McAuliffe, J.; Jordan, M. I.; Kustu, S.; Soupene, E.; Hunt, D. F. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 9232.

background species. However, the study did not test the limits of current commercial technologies in more complicated mixtures of background species or test the effects of instrument analysis time on the ability to positively identify a target in a mixture of competing background species. On the basis of the above, we have evaluated some of these current shotgun proteomics technologies for the positive identification of a bioweapon simulant in a mixture of other background species. Escherichia coli K-12 strain served as the target, and two soil microbes, yeast and a plant, functioned as four background species. Two different situations were utilized in this study. The first situation, termed the “instrumental analysis time test”, evaluated LC-MS/MS under conditions that demanded rapid, reliable identification. The target was present at a relatively high concentration compared to the background species, and identification reliability was assessed as a function of analysis time. In the second situation, termed the “target concentration level test” scenario, the biothreat agent was assumed to be present at a much lower concentration relative to the background. Thus, identification was not as time sensitive, although a high degree of reliability was still required. This scenario involved varying the target concentration to increasingly lower levels while fixing the analysis time. The second part of this study focused on the effects of database size on the positive identification of unique peptides from the four microbial species. One of the major challenges of this technique is the identification of unique peptides for a given target species when the database size gets very large. Unique peptide identifications can be defined as those peptides uniquely identified in a proteome measurement, which are specific to a given species in that database. As the number of species related to a target species in a database increases, the fewer unique peptides will be available for positive identification of that target species. Another problem associated with database size is the positive matching of MS/MS spectra to unique peptides from species not in the sample but in the database. For this study, false positive peptides were defined as those unique peptides matching peptides in the database, but not present in the test mixtures as target or background species. This does not define the false positive identification of a species in a mixture, but rather, the random matching to peptides in the background database. This study compares the rate of unique peptide identification for the 4 microbes with a small database (13 species) and a large database (261 species). EXPERIMENTAL SECTION Chemicals and Reagents. All salts, DTT, trifluoroacetic acid, and guanidine were obtained from Sigma Chemical Co. (St. Louis, MO). Protein concentrations were determined with BCA reagents from Pierce Chemical Co. (Rockford, IL). Modified sequencing grade trypsin, from Promega (Madison, WI), was used for all protein digestion reactions. The acetone, water, and acetonitrile used in all sample cleanup and HPLC applications was HPLC grade from Burdick & Jackson (Muskegon, MI), and the 98% formic acid used in these applications was purchased from EM Science (an affiliate of Merck KgaA, Darmstadt, Germany). Bacterial Growth, Mixture Preparation, and Protein Extraction, Digestion, and Cleanup. All bacterial, plant, and yeast samples used in this study were obtained from laboratories proficient in their growth and protein preparation. Shewanella

oneidensis MR-1, Rhodopseudomonas palustris CGA009, and E. coli K-12 were grown individually, aerobically in 2 L beakers with shaking at 30 or 37 °C to between midlogarithmic and stationary phase on Luria-Bertani broth (S. oneidensis and E.coli) or succinate, yeast extract, and mineral salts (R. palustris). All three were individually harvested by centrifugation (5000g for 15 min) and washed twice with ice-cold wash buffer (50 mM Tris, pH 7.8, with 10 mM EDTA). Cells were lysed by sonication in ice-cold wash buffer, and unbroken cells were removed by centrifugation (5000g for 15 min). The crude protein extracts were quantitated by BCA analysis, aliquoted, and frozen at -80 °C until mixing. Saccharomyces cerevisiae cells were grown at 30 °C to logarithmic phase in 1 L of synthetic medium lacking tryptophan. After being harvested by centrifugation (3000g for 15 min) and washed with ice-cold water, the cells were resuspended in ice-cold 50 mM Hepes (pH 7.5) and 5 mM EDTA and lysed by vigorous shaking with glass beads for six 2 min pulses in a Braun Scientific (Allentown, PA) cell harvester. Unlysed cells and cell debris were removed by centrifugation at 700g. Soluble cell extracts were collected after membrane fractions were centrifuged at 150000g for 1 h, and were frozen in 45 mL aliquots at -20 °C. Membrane fractions were needed for a separate study and not included in this analysis. The crude soluble protein extracts were thawed, quantitated by BCA analysis, aliquoted, and frozen at -80 °C until mixing. Arabidopsis thaliana protein material was obtained by grinding the plant material in liquid nitrogen with a mortar and pestle. The protein material was extracted into 50 mM Tris, pH 7.8, with 10 mM EDTA, and unbroken cells and plant debris were removed by centrifugation (5000g for 15 min). The crude protein extract was quantitated by BCA analysis, aliquoted, and frozen at -80 °C until mixing. For all experiments (detailed below), the required volume of each species (based on the total protein amount from the BCA assay) was mixed into a single test tube and treated with an excess of ice-cold 100% acetone. In general, twice as much total protein as was necessary was used for the study and all replicate analyses. All mixtures were digested by the following protocol: The protein mixtures were precipitated in the acetone for 30 min at -20 °C, collected by centrifugation (5000g for 15 min), washed a second time with acetone, and precipitated again. The precipitated protein material was resuspended in 2 mL of 6 M guanidine and 10 mM DTT and then heated at 60 °C for 1 h. The guanidine and DTT were diluted with 12 mL of 50 mM Tris (pH 7.8), and sequencing grade trypsin was added at a 1:100 (w/w) ratio. The digestions were run with gentle shaking at 37 °C for 18 h followed by a second addition of trypsin at a 1:100 ratio and an additional 5 h incubation. The samples were then treated with 20 mM DTT for 1 h at 60 °C as a final reduction step. The samples were immediately desalted with Sep-Pak Plus C18 solid-phase extraction (Waters, Milford, MA). All samples were concentrated and solvent exchanged into 0.1% formic acid in water by centrifugal evaporation to ∼10 µg/µL starting material, filtered, aliquoted, and frozen at -80 °C until LC-MS/MS analysis. Liquid Chromatography-Mass Spectrometry. LC-MS/MS experiments were performed on an integrated Famos/Switchos/ Ultimate 1D/2D HPLC system (LC Packings, a division of Dionex, San Francisco, CA) directly coupled to a quadrupole ion trap mass spectrometer (LCQ-DECAXPplus, Thermo Finnigan, San Jose, CA) Analytical Chemistry, Vol. 77, No. 3, February 1, 2005

925

outfitted with either a Finnigan nanospray (NS) source or a Finnigan orthogonal electrospray (ES) source. The entire system was fully automated and under direct control of the Xcalibur software system (Thermo Finnigan). For all reversed-phase separations, gradients were run from 100% solvent A (95% H2O/5% ACN/0.5% formic acid) to 100% solvent B (30% H2O/70% ACN/0.5% formic acid), followed by a wash with 100% solvent B and a 20 min equilibration back to solvent A before the next run (run times dependent upon the experiment type below). For nanomode reversed-phase separations, 0.1% formic acid was used instead of 0.5%. The loading solvent used in the 2D LC-MS/MS experiments was 100% H2O with 0.1% formic acid. For all capillary mode reversed-phase separations, the flow rate was set at 4 µL/min; for all nanomode reversed-phase separations the flow rate was set at 200 nL/min. The Ultimate micropump provided the flow for the reversed-phase separation. For capillary mode reversed-phase separations a capillary C18 column (300 µm i.d. × 25 cm, 300 Å with 5 µm particles) (Grace-Vydac, Hesperia, CA) was used. For all nanomode reversed-phase separations a nano C18 (75 µm id × 25 cm, 300 Å with 5 µm particles) (Grace-Vydac) column was used. For the 2D experiments peptides were first trapped on an LC Packings SCX cartridge (500 µm i.d. × 15 mm, 300 Å with 5 µm Polysulfoethyl particles), eluted from this cartridge with increasing concentrations of ammonium acetate, trapped on an LC Packings C18 precolumn (300 µm i.d. × 5 mm, 100 Å with PepMap C18 5 µm particles), and eluted after desalting onto the nano resolving C18 column as detailed above. The outlet from the resolving column was directly connected to the ES or NS source with a short piece of fused silica (20 µm i.d. for NS and 100 µm i.d. for ES). For all experiments, the mass spectrometer was operated with the following parameters: ES voltage, 4.5 kV (Thermo Finnigan orthogonal ES source, 25 units of Sheath gas); nanospray voltage, 2.0 kV (Thermo Finnigan nanospray source); heated capillary, 200 °C. ES was performed directly from the 100 µm i.d. fused silica which rested inside the high-voltage needle; for NS experiments the 20 µm i.d. fused silica was directly connected to a liquid junction (Thermo Finnigan), which was connected to a 10 µm i.d. uncoated fused silica tip (New Objective, Woburn, MA). The mass spectrometer was operated with five microscans averaged for full scans and MS/MS scans, five m/z isolation widths for MS/ MS isolations, and a 35% collision energy for collision-induced dissociation. For all experiments, the MS was operated in the datadependent MS/MS mode, where the four most abundant peaks in every full MS scan were subjected to MS/MS analysis. To prevent repetitive analysis of the same intense peptides, dynamic exclusion was enabled with a repeat count of 1 in 1D analysis and a repeat count of 2 in 2D analysis. The exclusion duration was set to 1 min in both cases. Spectra were acquired continually from injection until 100% B was reached. The instrument did not collect data during the reequilibration step. Instrumental Analysis Time Test. For this scenario, crude protein lysates from all four background species (A. thaliana, R. palustris, S. oneidensis, S. cerevisiae) and the target species (E. coli) were mixed, precipitated, denatured, digested, and prepared for LC-MS/MS analysis as detailed above. All species were present at roughly equal concentrations (20%) as determined 926

Analytical Chemistry, Vol. 77, No. 3, February 1, 2005

from BCA analysis of the individual lysates. The samples were loaded onto the 1D LC-MS/MS system with 50 µL injections by the Famos autosampler onto a 50 µL loop, which was flushed directly onto the C18 capillary column. A 10 min load time was used for each sample, followed by a gradient elution, and then a 15 min equilibration time. For all experiments, the load time and equilibration time stayed the same while the elution time was varied. Four separate time points were tested: 60, 120, 180, and 240 min of total experiment time (from injection through reequilibration and ready for the next injection). Each time point was repeated in triplicate. Between each time point two automated column washes were run. Target Concentration Level Test. For this scenario, crude protein lysates from all four background species (A. thaliana, R. palustris, S. oneidensis, S. cerevisiae) and the target species (E. coli) were mixed, precipitated, denatured, digested, and prepared for LC-MS/MS analysis as detailed above. The four background species were kept at equal concentrations, but the target species (E. coli) was at varying concentrations. The first experimental set was run exactly as the 240 min 1D LC-MS/MS experiment detailed above, except the concentration of the target species was varied from 20, 6, 0.6, and 0.06% total protein quantity. To avoid cross-contamination problems, the analyses were started at the lowest concentration target species (0.06%). Each test concentration was run in triplicate, and as above, two automated column washes were run between each concentration point. The second experimental set involved a 2D LC-MS/ MS analysis of the 20, 0.6, and 0.3% concentration samples. The 2D analysis involved an injection step followed by nine subsequent salt steps (ammonium acetate) each taking 3 h for a total of 30 h of analysis time. These analyses were run a single time only. Control Analysis (No E. coli Sample). To test the level of false positive peptide identifications (see definition below) for the E. coli species, a sample was prepared with the four background species (A. thaliana, R. palustris, S. oneidensis, S. cerevisiae) at 25% each, but no target species (E. coli). This sample was prepared exactly as the samples above. The sample was analyzed in triplicate with the same method as the 240 min 1D LC-MS/MS analysis. The number of false positive peptides for E. coli was determined as detailed below. Database Generation, Data Searching, and Target Identification. Two databases were used for this study. The first database was used for the analysis of all the data files from this study, contained 13 species, and is termed the small database. Table 1 lists all the species contained in the small protein database used for this study. The database contained all predicted open reading frames (ORFs) for the five species present in the test mixtures, plus ORFs for seven other bacterial species to determine false positive peptide identification rates (see the definition below). All protein files were obtained from public sources if the genome had been sequenced, annotated, and made publicly available. The source for these genomes is designated as public in Table 1, and the genome-sequencing project is referenced. For some of the genomes, only a draft sequence and annotation were available from the Department of Energy (DOE) Joint Genome InstituteOak Ridge National Laboratory Genome Modeling and Annotation Group (DOE/JGI-ORNL) (http://compbio.ornl.gov/channel/). The source for these genomes is designated as DOE in Table 1.

Table 1. Species in the Database and Test Mixture

a

organism

sourcea

function within database

Bacillus anthracis Burkholderia xenovorans Deinococcus radiodurans Geobacter metallireducens Nitrosomonas europaea Pseudomonas aeruginosa Yersinia pestis CO92 Yersinia pestis KIM Arabidopsis thaliana Saccharomyces cerevisiae Shewanella oneidensis Rhodopseudomonas palustris Escherichia coli

public DOE public DOE DOE public public public public public public DOE public

potential biothreat agent opportunistic pathogen radiation resistant common in soil N2 fixer, common in soil common in soil potential biothreat agent potential biothreat agent background in experimental mixture background in experimental mixture background in experimental mixture background in experimental mixture target organism

Source of protein database.

An exception should be noted for R. palustris. Its genome was recently published,33 and a final public annotation is now available, but this study was started before the final public annotation was available, so an earlier draft annotation was obtained from the website mentioned above. The resulting database of 13 organisms contains 83 777 predicted proteins. Each protein entry was designated with an organism code followed by a numerical identifier and, in some cases, a predicted functional annotation, followed by the amino acid sequence. This simple numbering system of organism code and unique numerical identifier allowed for easy sorting of target identifications from the background as detailed below. The second database was only used for one of the datasets to compare the effects of database size on the identification of unique peptides. This database was termed the large database and contained 261 species with 1 011 612 protein entries (note only one of the sequenced versions of E. coli was included, the K-12 sequenced strain). This database was created by concatenating fasta files of all proteins from published genomes and all predicted proteins from the annotated DOE/JGI-ORNL draft sequences. All MS/MS spectra were searched against the final concatenated protein database using the SEQUEST algorithm (Thermo Finnigan) with the following parameters: enzyme type, trypsin; parent mass tolerance, 3.0; fragment ion tolerance, 0.5. The output data files were then filtered and sorted to obtain useful peptide identifications and, in this case, provide definitive evidence for target identification. This process was achieved with the DTASelect algorithm34 using the following parameters: fully tryptic peptides only, with a delCN of at least 0.08 and cross-correlation scores (Xcorrs) of at least 1.8 (+1), 2.5 (+2), and 3.5 (+3). All peptides passing these criteria were kept for further analysis. Perl scripts were then used to extract unique peptides from the DTASelect-filtered file. All peptides that were identified above filter criteria but were not unique or that were identified below filter criteria were discarded from further analysis. The resultant unique peptides passing the minimum SEQUEST scores from target species, background species, and database background (33) Larimer, F. W.; Chain, P.; Hauser, L.; Lamerdin, J.; Malfatti, S.; Do, L.; Land, M. L.; Pelletier, D. A.; Beatty, J. T.; Lang, A. S.; Tabita, F. R.; Gibson, J. L.; Hanson, T. E.; Bobst, C.; Torres, J. L.; Peres, C.; Harrison, F. H.; Gibson J.; Harwood, C. S. Nat. Biotechnol. 2004, 22, 55. (34) Tabb, D. L.; McDonald, W. H.; Yates, J. R., III J. Proteome Res. 2002, 1, 21.

species were collected into spreadsheets and compared to determine the level of accurate identification from each scenario test. False positive peptides were defined as those unique peptides matching peptides in the database, but not present in the test mixtures as target or background species. This does not define the false positive identification of a species in a mixture, but rather, the random matching to peptides in the background database.

RESULTS AND DISCUSSION The goal of this study was to determine experimentally the ability of standard shotgun proteomics techniques to identify a target biological species, which simulated a biothreat agent, in a complex mixture of representative environmental background species. As described in the Experimental Section, two different biothreat scenarios were constructed. Experiments employing the instrumental analysis time test investigated the effectiveness of 1D LC-MS/MS for target identification as a function of decreasing instrumental analysis time. Experiments using the target concentration level test sought to determine the low concentration limit for target identification with either 1D LC-MS/MS or 2D LC-MS/MS. We also examined the effect of database size on the ability to detect unique positive peptides by testing the results from the target concentration level test against a small database (13 species) and a large database (261 species). The target species was the E. coli K-12 strain,35 which is related to some potential biological weapons, but is itself innocuous. Background organisms were selected to mimic some common biological species found in a typical environmental sample. In addition, the background species were chosen on the basis of access to starting cultures and ease of growth under normal laboratory conditions. S. oneidensis36 and R. palustris33 are common soil microbes, S. cerevisiae37 is a model yeast species, and A. (35) Blattner, F. R.; Plunkett, G., III; Bloch, C. A.; Perna, N. T.; Burland, V.; Riley, M.; Collado-Vides, J.; Glasner, J. D.; Rode, C. K.; Mayhew, G. F.; Gregor, J.; Davis, N. W.; Kirkpatrick, H. A.; Goeden, M. A.; Rose, D. J.; Mau, B.; Shao, Y. Science 1997, 277, 1453. (36) Heidelberg, J. F.; Paulsen, I. T.; Nelson, K. E.; Gaidos, E. J. Nelson, W. C.; Read, T. D.; Eisen, J. A.; Seshadri, R.; Ward, N.; Methe, B.; Clayton, R. A.; Meyer, T.; et al. Nat. Biotechnol. 2002, 20, 1118. (37) Mewes, H. W.; Albermann, K.; Bahr, M.; Frishman, D.; Gleissner, A.; Hani, J.; Heumann, K.; Kleine, K.; Maierl, A.; Oliver, S. G.; Pfeiffer, F.; Zollner, A. Nature 1997, 387S, 7.

Analytical Chemistry, Vol. 77, No. 3, February 1, 2005

927

Figure 1. Experimental flowchart for shotgun proteomics analysis of the simulant biothreat sample.

thaliana38 is a model plant species. All corresponding genomes have been sequenced and are publicly available. Shotgun Proteomics Methodology. The experimental concept is highlighted in Figure 1 for a representative peptide from the target species E. coli. The top left panel depicts a typical base peak chromatogram from a reversed-phase separation of a complex peptide mixture. For this case, the 25 mM salt step from the highest E. coli concentration in the 2D concentration test described above is represented. The top right panel depicts the full MS scan obtained at a given time point (39.34 min). At this point, ∼30 ionic species were detected above the noise level by the mass spectrometer. Four of these ionic species were then (38) The Arabidopsis Initiative. Nature 2000, 408, 796.

928

Analytical Chemistry, Vol. 77, No. 3, February 1, 2005

selected for tandem mass spectrometry. The fragment ion spectrum for the m/z 755.11 species shown in the bottom left panel was observed in the second MS/MS scan after the full mass scan. This MS/MS spectrum, along with all other MS/MS spectra from a given run, was processed by the SEQUEST39 algorithm to give the most likely candidate peptide for a given charge state as depicted in the bottom right panel. In this case, the MS/MS spectrum matched a predicted peptide, SLYEADLVDEAKR, that is unique to the E. coli protein phosphoglycerate kinase. The resultant match was of high quality with an Xcorr of 3.46 from a +2 charge state peptide. DTASelect was then used to process the (39) Eng, J. K.; McCormack, A. L.; Yates, J. R., III J. Am. Mass Spectrom. 1994, 5, 976.

Figure 2. Triplicate analysis via 1D LC-MS/MS as a function of experimental time with a constant target concentration. This dataset was searched against the 13-component database.

many thousands of MS/MS spectra to align peptides to parent proteins, label peptides as unique or not, and then filter the dataset to remove as many false positive peptides as possible. Though peptides may be identified above a certain filtering level, in a large database of many proteins (and especially if that database contains many related species with many conserved proteins), there will exist a large number of replicate tryptic peptides with the exact same sequence. If a specific peptide is found in multiple species, then it cannot be used as a unique peptide for identification purposes. This issue was handled with the DTASelect algorithm since it labels all unique peptide identifications with an asterisk; nonunique peptides are listed under all possible originating proteins and do not contain an asterisk. This is depicted in the table at the bottom of Figure 1. The unique peptide identified in the above example, as well as other unique peptides identified in the same experiment from this given E. coli protein (phosphoglycerate kinase), are listed and labeled with asterisks. Peptides that are not unique, but also matching this protein, are unlabeled. Control Analysis (No E. coli Sample). To test the level of false positive peptide identifications for the E. coli species, a sample was prepared with the four background species (A. thaliana, R. palustris, S. oneidensis, S. cerevisiae) at 25% each, but no target species (E. coli). The sample was analyzed in triplicate with the 240 min 1D LC-MS/MS analysis and searched against the small database. For this analysis three, two, and zero unique E. coli peptides were identified by the SEQUEST search of the MS/MS data in each of the three runs, respectively. These results are very similar to the random level of matching to background database species not present in the sample discussed below. The results on the other four species were typical of all analyses discussed below, validating the quality of this analysis. Instrumental Analysis Time Test. The results from this study, involving triplicate analysis via 1D LC-MS/MS as a function of experimental time with constant target concentration, are illustrated in Figure 2. This dataset was searched against the small database. The resultant numbers are only those peptides identified above conservative filter values, which were unique identifications to the target organism. The number of peptides identified from the target was quite reproducible: time point 240 (283.7 average peptide IDs, standard deviation 9.0), time point

180 (236.0 average peptide IDs, standard deviation 7.8), time point 120 (154.3 average peptide IDs, standard deviaiton 12.9), and time point 60 (44.0 average peptide IDs, standard deviation 3.6). The unique false positive peptide identifications (i.e., those originating from background database peptides not present in the sample) showed a similar consistency, but with much lower average values. It seems clear that this particular common and simple shotgun proteomics technique was capable of correctly identifying the target species at relatively the same concentration as four other background species against a database containing seven additional background species. Even at the 60 min mark for total instrument time, the system clearly picked up the E. coli peptides at a substantially higher quantity than unique false positive peptide identification. It would seem reasonable to test shorter time periods, but with the current simple setup, 25 min of the time is taken up with sample loading and column reequilibration. Other HPLC instrumental setup designs should be considered first in future experiments before more work is done on the MS end to reduce the time. Target Concentration Level Test. Figure 3a illustrates the positive peptide identifications and false peptide identifications for a range of target concentrations with a constant time for 1D LCMS/MS analysis for the target species as well as all other species in the sample. These data illustrated in Figure 3a were searched against the small database. Again, the resultant number of peptide identifications from the target was very reproducible: concentration point 20% (270.3 average peptide IDs, SD 15.3), concentration point 6% (83.3 average peptide IDs, SD 0.6), concentration point 0.6% (4.3 average peptide IDs, SD 2.08), and concentration point 0.06% (1.7 average peptide IDs, SD 0.5). The unique false peptide identifications showed a similar consistency. For this experimental setup and this database size, the data for average peptide identifications and the corresponding standard deviations suggest that positive target identification is attainable at the 20% and the 6% concentration levels. However, the same data imply that positive target identification is not achievable at the 0.6% or the 0.06% concentration levels with the current false peptide identification rate. This figure also illustrates the detection of unique peptides from the other species in the mixtures. The microbial background species (R. palustris, S. cerevisiae, and S. oneidensis) are all easily identified above the background, with a consistent rise in peptide identifications from each as the E. coli concentration decreases. Reproducibility similar to that with E. coli was found for these species. Furthermore, all three background microbial species were easily detected in all experiments in this study. The A. thaliana species was consistently difficult to detect. Indeed, A. thaliana was not clearly detected above the background in any experiment in this study even though it was added at a concentration equal to that of R. palustris, S. cerevisiae, and S. oneidensis according to the BCA analysis. While the A. thaliana species did not contribute as we would have hoped to the complexity at the peptide level, it did contribute to the complexity of the sample matrix and sample processing. Indeed, the acetone precipitation step was needed to remove the large excess of the chlorophyll from A. thaliana and bacteriochlorophyll from R. palustris. It is unclear to the authors why the A. thaliana was consistently so hard to detect in the presences of the other species. For this study, all species were processed separately by routine methodologies Analytical Chemistry, Vol. 77, No. 3, February 1, 2005

929

Figure 3. (a, top) Unique positive peptide identifications and unique false positive peptide identifications for a range of target concentrations of E. coli with a constant time for 1D LC-MS/MS. The analysis was run in triplicate and searched against the 13-component database. Column labels from left to right: E. coli, R. palustris, S. cerevisiae, S. oneidensis, A. thaliana, false positive peptide IDs. (b, middle) E. coli target species (crosshatched bar) and any given database background species (black bars) at the 6% target concentration from a single LC-MS/MS analysis at this concentration. The data were searched against the 13-component database. (c, bottom) Unique positive peptide identifications and unique false positive peptide identifications for a range of target concentrations of E. coli with a constant time for 1D LC-MS/MS. The analysis was run in triplicate and searched against the 261-component database. Column labels from left to right: E. coli, R. palustris, S. cerevisiae, S. oneidensis, A. thaliana, false positive peptide IDs. 930 Analytical Chemistry, Vol. 77, No. 3, February 1, 2005

and then mixed on the basis of protein quantity estimates from BCA analysis. For real world samples, all species would need to be processed by the same technique, which could bias or favor against any given species. The effect of sample preparation on known microbial-plant mixtures is a definite needed area of further research. Figure 3b illustrates the clear separation achieved between the E. coli target species and any given background species at the 6% target concentration from a single LC-MS/MS analysis at this concentration with the 13-species database. A total of 9 false peptide identifications were made at this concentration compared with 83 unique target peptides. But no background species had more than three unique false positive peptide identifications. This random distribution of false positive peptide identifications ranging from zero to five unique peptides per species was observed for all experiments in this study. Though an average of 4.3 peptides were observed from the target species E. coli at the 0.6% concentration, this number is clearly too close to the range (05) of false positive peptide identifications for any given background species common at this time range of data acquisition (240 min). In general, the number of false positive peptides for a given species increases as the instrument acquisition time increases and decreases as the instrument acquisition time decreases. It is unclear whether any changes could be made (with this setup and in this time frame) to produce positive identifications at the 0.6 and 0.06% levels. A serious consideration for the positive identification of an unknown species in a complex background matrix is the effects of database size. As the size of the database increases, the number of unique peptides for any given species in that database decreases (especially if the database contains many species closely related to the species of interest). This will become a serious concern when the concentration of the target species decreases and fewer peptides can be sequenced. We tested this effect by taking the same dataset used in Figure 3a (range of target concentrations with a constant time for 1D LC-MS/MS analysis) and searching against a very large microbial and plant database. This database was generated from public and DOE/JG-ORNL sources and contained 2 plant species (Oryza sativa and A. thaliana) as well as 259 microbial species. The results from this database test are illustrated in Figure 3c. As should be expected, the number of unique false positive peptide identifications rose dramatically from the 13-species database. For all concentrations, an average of 8.5 false positive peptides were identified with the 13-species database; an average of 71 false positive peptides were identified with the 261-species database. The larger database did not seem to have a dramatic effect on the identification of unique peptides from R. palustris, S. cerevisiae, and S. oneidensis. Indeed, for any of these three species there was only a loss of 30-50 unique peptides on average in going from the small database to the large database. The resultant losses for the E. coli target were drastic to the point that positive identification could not be made at any of the E. coli concentrations. While an average of seven unique peptides were identified at the 20% concentration mark, this was clearly well below the false positive peptide level. The severity of unique peptides lost with the E. coli species was much greater than that with the other microbial species, which was not altogether surprising. These results are most likely due to the fact that E.

Figure 4. Positive peptide identifications and false positive peptide identifications for a range of target concentrations with a constant time for a single 2D LC-MS/MS analysis at 20% target species, 0.6% target species, and 0.3% target species. The data were searched against the 13-component database.

coli is a γ-proteobacterium, and a large number of microbial species in this family have been sequenced and were included in this database. The same cannot be said for R. palustris, S. cerevisiae, and S. oneidensis, all of which do not have nearly as many related species whose genomes have been sequenced at this time. Clearly, the effects of database size and composition are an area of needed research for both bioweapons detection and general environmental proteomics applications. Although 2D LC-MS/MS methods typically result in longer total analysis times, they generally improve the dynamic range, separation ability, and sensitivity. Thus, the 2D LC-MS/MS method described above was investigated for positive target identification at the 0.6 and the 0.06% concentration levels (the 20% concentration level was also included in these experiments as a control). Each sample was analyzed a single time by the 2D LC-MS/MS system, and the data were searched against the 13species database. The results from this study are shown in Figure 4. Clearly, the target was again easily identified at the 20% concentration level; indeed, the number of peptides identified was much greater than in the 1D analysis. Unfortunately, the data for the 0.6 and 0.06% concentration levels imply that positive target identification is still not achievable with the 2D methodology against the 13-species database. These results are not altogether surprising since it is known that common 2D methodologies based on SCX and RP separations give the MS more time to collect spectra from a given sample. But the methodology does not greatly increase dynamic range over a 1D separation. Indeed, if one does the same 1D experiment in replicate five to six times with five to six segregated m/z ranges scanned by the mass spectrometer and combines the data sets, the results will be very similar to those of the 2D methodology.30 Thus, the 2D methodology gives much better results with the same starting material than a 1D experiment, but if excess sample is available, the same results can be achieved with a replicate 1D method with multiple mass range scans. Current 1D or 2D LCMS/MS systems, based on current mass spectrometry techniques, will not easily be able to analyze target species if they are at less Analytical Chemistry, Vol. 77, No. 3, February 1, 2005

931

than 1% in the background matrix. We believe this study gives a good starting point for future work with possible alternative separation techniques and MS methodologies, which may extend the dynamic range of proteome measurements to the point of allowing for positive identifications of target species at less than 1% in a background matrix. This level of dynamic range must be reached for shotgun proteomics techniques to be truly applicable to bioweapons threat, agricultural, medical, and environmental applications where a need exists for positive identifications of lowlevel species components in complex background matrixes. The species-dependent difficulties associated with database size is another major difficulty for this technology, which must be addressed on numerous potential biothreat agents to determine the efficiency of this technique for a given species. CONCLUSION To our knowledge, this is the first investigation of the capability of shotgun proteomics techniques to identify a target species in a background matrix of multiple background species, which tests the capabilities of a current commercial LC-MS/MS system as well as the effect of database size. This study examined three central concepts: the speed in which a complex microbial mixture can be analyzed by LC-MS/MS, the effects of target concentration, and the effects of database size. We demonstrated that the system performed well in the instrumental analysis time test where the target species was at a high concentration relative to the background species (∼20%). The target species was clearly identified above the false positive peptide identification rate for all time points tested (60-240 min) with the 13-species database. To determine the effects of concentration on target detection, the LC-MS/MS system was tested with four different E. coli target concentrations with the 13- and the 261-species databases. With the 13-species database, the E. coli target was clearly identified at the 20% and 6% concentrations but could not be detected at the 0.6 and 0.06% concentrations. With the 261-species database, the background microbial species were easily detected at all E. coli

932

Analytical Chemistry, Vol. 77, No. 3, February 1, 2005

concentrations, but the E. coli target species could no longer be detected above the background at any concentration. This is presumably due to the larger number of species in the current database related to the E. coli target than the other three background microbes. This study points to the potential application of shotgun proteomics for biological threat agent detection but highlights many areas of needed research before the technology will be useful in real world samples. These areas include sample processing, LC-MS/MS system development, and detailed studies of the effects of database size. ACKNOWLEDGMENT We thank Grace-Vydac for their generous supply of HPLC columns, and LC Packings, a Dionex company, for the 1D/2D HPLC systems through a collaborative agreement. We thank Dr. David Tabb and the Yates Proteomics Laboratory at Scripps Research Institute for the DTASelect software. We thank Kyle P. Ellrott and Adam M. Tebbe of the University of Tennessee-Oak Ridge National Laboratory (UT-ORNL) school of Genome Science and Technology for invaluable discussions and assistance in software engineering.We thank Dr. Barry Bruce, David McWilliams, Ayca-Akal Strader, Dr. M. Brad Strader, Patricia Lankford, and Dr. Dale Pelletier for generous gifts of organisms and assistance in their growth and harvest. Becky R. Maggard (ORNL) is thanked for secretarial assistance in the preparation of this manuscript. W.J.H. and N.C.V. acknowledge support through the UT-ORNL school of Genome Science and Technology. This research was sponsored by the Laboratory Directed Research and Development Program of ORNL, managed by UT-Battelle, LLC, for the U.S. Department of Energy under Contract No. DE-AC0500OR22725.

Received for review June 14, 2004. Accepted November 11, 2004. AC049127N