Anal. Chem. 2009, 81, 9633–9642
Top-Down Identification of Protein Biomarkers in Bacteria with Unsequenced Genomes Colin Wynne,† Catherine Fenselau,*,† Plamen A. Demirev,‡ and Nathan Edwards*,§ Department of Chemistry & Biochemistry, University of Maryland, College Park, Maryland, Applied Physics Laboratory, Johns Hopkins University, Laurel, Maryland, and Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, D.C. MALDI mass spectrometry-based systems for rapid characterization of microorganisms in biodefense or medical diagnosticsusuallydetectintactproteinsinthe5000-20,000 Da range. To evaluate the reliability of species discrimination, and also for forensic applications, it is important that these biomarker proteins be identified. In the present study we apply high resolution tandem mass analysis on an Orbitrap and top-down bioinformatics to identify major biomarker proteins observed in MALDI spectra of intact bacteria for which little genomic or protein sequence information is available. The strategy depends on recognition of proteins with very high homology in related (sequenced) species, making it possible to place unsequenced organisms in their correct phylogenetic context. We show that this rapid proteomics based approach to phylogenetic characterization produces similar results to the traditional techniques, and may even be applied to target organisms of undetermined taxonomy. We further discuss important issues in combining genomics/proteomics databases and MALDI MS for the rapid characterization of microorganisms. Considerable work has been accumulated during the last 15 years demonstrating the capabilities of MALDI mass spectrometry for rapid characterization of intact microorganisms.1 Typically, such analyses take the form of spectral matching to a mass spectral database of microorganism signatures (“fingerprints”).2,3 Several commercial systems have been designed to provide identifications of carefully cultured bacteria in laboratory settings. In such systems, control of the matrix-to-sample ratio, and all other aspects of the measurement of the signature spectrum are important for faithful fingerprint comparisons. In contrast, instrumental systems that provide near real time detection, required for deployment, for example, on the battle field or in subways, cannot accommodate culturing and do not provide the controlled measurements necessary for reliable matching with spectral * To whom correspondence should be addressed. E-mail:
[email protected] (C.F.),
[email protected] (N.E.). † University of Maryland. ‡ Johns Hopkins University. § Georgetown University Medical Center. (1) Fenselau, C.; Demirev, P. A. Mass Spectrom. Rev. 2001, 20, 157–171. (2) Wunschel, D. S.; Hill, E. A.; McLean, J. S.; Jarman, K.; Gorby, Y. A.; Valentine, N.; Wahl, K. J. Microbiol. Methods 2005, 62, 259–271. (3) Mandrell, R. E.; Harden, L. A.; Bates, A.; Miller, W. G.; Haddon, W. F.; Fagerquist, C. K. Appl. Environ. Microbiol. 2005, 71, 6292–6307. 10.1021/ac9016677 CCC: $40.75 2009 American Chemical Society Published on Web 11/02/2009
database signatures. Consequently, a variety of proteomic strategies have been developed to interpret the MALDI spectrum of the organism rather than merely fingerprint it.1,4,5 These approaches undertake to identify the biomarkers detected in the spectrum, and thereby the source organism, by reference to databases of genome and protein sequence. Protein mass mapping, bottom up peptide sequencing and top down protein sequencing have all been demonstrated, and the first two have been implemented in automated portable instruments.6,7 Despite significant increases in reliability and flexibility provided by interpretive analysis of the mass spectra, the limited (∼1000) number of bacteria and archaea whose genomes have been sequenced is considered a limitation to the application of broadband untargeted proteomic approaches. Top-down proteomics, based on tandem mass spectrometry of intact fractionated and separated proteins, provides an approach for large scale characterization of proteins from microorganisms with sequenced genomes.8 Both types of FTMS instruments, ICR9 and Orbitrap,10 have been used for top-down proteomics. The molecular mass of the intact precursor protein along with fragment ions from MS/MS experiments allows high confidence mapping to database protein entries as well as characterization of posttranslational modifications. Top-down analysis of biomarkers for bacterial spores has been reported previously, carried out after sample enrichment on an ESI FTICR instrument,11 and from an unfractionated bacterial sample on a MALDI TOF-TOF instrument.12 In these experiments, proteins were identified either by performing a homology search of extracted sequence tags11 or by comparison of MS/MS spectra with fragmentations predicted from protein sequences in existing proteome databases.12 In the present study, proteins solubilized from Yersinia rohdei are (4) Demirev, P. A.; Fenselau, C. Annu. Rev. Anal. Chem. 2008, 1, 71–93. (5) Demirev, P. A.; Fenselau, C. J. Mass Spectrom. 2008, 43, 1441–1457. (6) Ecelberger, S. A.; Cornish, T. J.; Collins, B. F.; Lewis, D. L.; Bryden, W. A. Johns Hopkins APL Tech. Dig. 2004, 25, 14–19. (7) Sundaram, A. K.; Gudlavalleti, S. K.; Oktem, B.; Razumovskaya, J.; Gamage, C. M.; Serino, R. M.; Doroshenko, V. M. Proceedings of the 56th Conference of the American Society for Mass Spectrometry; Denver, CO, June 1-5, 2008. (8) Kelleher, N. L. Anal. Chem. 2004, 76, A197–A203. (9) Meng, F.; Cargile, B. J.; Miller, L. M.; Forbes, A. J.; Johnson, J. R.; Kelleher, N. L. Nat. Biotechnol. 2001, 19, 952–956. (10) Macek, B.; Waanders, L. F.; Olsen, J. V.; Mann, M. Mol. Cell. Proteomics 2006, 5, 949–958. (11) Demirev, P. A.; Ramirez, J.; Fenselau, C. Anal. Chem. 2001, 73, 5725– 5731. (12) Demirev, P. A.; Feldman, A. B.; Kowalski, P.; Lin, J. S. Anal. Chem. 2005, 77, 7455–7461.
Analytical Chemistry, Vol. 81, No. 23, December 1, 2009
9633
Figure 1. MALDI MS signature spectrum (positive ions) of intact Y. rohdei.
analyzed by LC-MS/MS on an ESI LTQ-Orbitrap for identification without reference to that organism’s genome. We note that the Y. rohdei genome project (NCBI Project ID: 29767) at the Naval Medical Research Center recently released (June 2009) a number of whole-genome shotgun (WGS) contigs to Genbank, but an assembled genome sequence has not yet been submitted and our work was carried out before any Y. rohdei genomic sequence was available. We provide a retrospective analysis of the newly released WGS contigs in the Results and Discussion Section. Y. rohdei is a non-pathogenic (BSL1) gram-negative species that was initially isolated from feces of dogs as well as humans in 1987.13 It exhibits several of the key biochemical reactions for the genus Yersinia. Y. rohdei is often used as a simulant for the potential bioterrorism agent Yersinia pestis, the causative agent of plague. Y. rohdei’s MALDI signature, obtained directly from whole organisms, is shown in Figure 1. The initial goal of this study has been to identify the proteins detected in this low resolution spectrum. However, the genome and high-quality protein annotations for this species are not yet available, and only a few protein sequences are available for Y. rohdei in public databases. Here, we test the hypothesis that protein biomarkers (and other proteins) from bacteria with unsequenced genomes can be identified by their homology to highly conserved sequences
of known proteins in related species. In the strategy used here, the data set from an automated top-down analysis of low mass proteins, extracted from Y. rohdei, is used to search a database of publically available protein sequences from other Yersinia species as well as other related enterobacteriaceae. The identified proteins make it possible to place the Y. rohdei in its correct phylogenetic context, demonstrating a rapid, proteomics based approach to phylogenetic characterization with similar results to the traditional techniques, that may even be applied to target organisms of undetermined taxonomy.
(13) Aleksic, S.; Steigerwalt, A. G.; Bockemuhl, J.; Huntley-Carter, G. P.; Brenner, D. J. Int. J. System. Bacteriol. 1987, 37, 327–332.
(14) Hathout, Y.; Ho, Y. P.; Ryzhov, V.; Demirev, P.; Fenselau, C. J. Nat. Prod. 2000, 63, 1492–1496.
9634
Analytical Chemistry, Vol. 81, No. 23, December 1, 2009
EXPERIMENTAL METHODS Preparation of Bacteria Extracts. Bacillus anthracis Sterne cells were cultured on Nutrient Broth medium plates (ThermoFisher, Fair Lawn, NJ) using already published methods.14 Cells were scraped into 10 mL of broth in 15 mL tubes using a sterile loop. Cell suspensions were centrifuged at 6000 rpm for 10 min, then washed with 3 mL of Milli-Q water, and centrifuged again for 5 min at 6000 rpm. This wash step was repeated two additional times, with the supernatant discarded each time. The pellet was then resuspended in 3 mL of 10% formic acid and centrifuged at 10000 rpm for 5 min. The supernatant was transferred to a vial for injection into the LC-MS/MS. Eight milliliters of a solution of 4.6 × 108 cells/mL of Y. rohdei (grown
at the Johns Hopkins Applied Physics Lab under standard growth conditions) was further washed and lysed following the same procedures. MALDI-TOF MS Analysis. A signature spectrum was constructed for intact Y. rohdei by averaging multiple spectra, each taken from an individual sample well, 600 shots per spectrum. The spectra were obtained on a commercial MALDI TOF instrument with resolving power ∼ 103 at fwhm (Bruker Microflex, Bruker Daltonics, Billerica, MA) in positive ion linear mode following standard sample preparation and data acquisition procedures.15 LC-MS/MS Analysis. An Accela HPLC unit (ThermoFisher), interfaced to an LTQ-Orbitrap (ThermoFisher), was used. For the online HPLC separation of the intact proteins prior to ESI, solvent A was 95% water/5% acetonitrile/0.1% formic acid and solvent B was 95% acetonitrile/5% water/0.1% formic acid (ThermoFisher). Both sets of proteins were separated on the same 1 mm inner diameter BioBasic C-8 column (ThermoFisher) using the Accela pump. The gradient was 5 min at 95% A, followed by a linear gradient of 45 min to 65% B. For the MS/MS analysis, CID was carried out in the LTQ with helium target gas at the 35% pressure setting. Masses of both precursor and product ions were determined in the Orbitrap used at 30,000 resolving power. Each cycle of high resolution precursor and product ion scans took approximately 600 ms. Four MS/MS scans were obtained for every MS scan. Dynamic exclusion was implemented with a 10 s exclusion period during which precursor ions were not resampled. MS/MS was only performed on species with known charge states, and no MS/MS was performed on +1 and +2 charge state species. Protein Identification. ProSightPC 2.016 was used to decharge precursor and product ions and to search the MS/MS spectra against custom protein sequence databases. Charge deconvolution was carried out by the Thrash program17 embedded in ProSightPC 2.0. Experimental measurements were compared to the average molecular weights of theoretical precursors and the monoisotopic molecular weights of theoretical fragments. The precursor ion mass tolerance was set to 150 Da to allow for posttranslational modifications, such as N-terminal methionine cleavage, and a limited number of amino acid substitutions. The fragment ion mass tolerance was set to 15 ppm. For the analysis of the B. anthracis Sterne spectra, a custom sequence database was constructed, containing all proteins from B. anthracis Sterne, Bacillus thuringiensis konkukian, Bacillus cereus strain AH167, and Bacillus subtilus 168 available in the Swiss-Prot database (Version 57.2, 5/5/09). For the analysis of the Y. rohdei spectra, a custom sequence database was constructed, containing all available SwissProt protein sequences from Yersinia species, Salmonella typhimurium, Escherichia coli, Shigella sonnei, Klebsiella pneumoniae, Enterobacter sp. 638, and Enterobacter aggloramerans (Erwinia herbicola). Identified proteins were checked for membership in highly homologous protein families by collecting and aligning cross-species orthologues, using BlastP18 and ClustalW.19 (15) Pineda, F. J.; Antoine, M. D.; Demirev, P. A.; Feldman, A. B.; Jackman, J.; Longenecker, M.; Lin, J. S. Anal. Chem. 2003, 75, 3817–3822. (16) Boyne, M. T.; Garcia, B. A.; Li, M. X.; Zamdborg, L.; Wenger, C. D.; Babai, S.; Kelleher, N. L. J. Proteome Res. 2009, 8, 374–379. (17) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. J. Am. Soc. Mass Spectrom. 2000, 11, 320–332.
Phylogenetic Analysis. The Rapid Microorganism Identification Database (RMIDb)20 at http://rmidb.org was used to construct a set of all Swiss-Prot, TrEMBL, RefSeq, Genbank, JCVI’s CMR,21 and aggressive Glimmer322 predicted protein sequences from the Enterobacteriaceae family with molecular weights between 4 kDa and 16 kDa, grouped by PFam23 protein family assignment. Protein sequences corresponding to the protein families of the 10 identified Y. rohdei proteins were extracted, and Enterobacteriaceae species with extracted protein sequences in all 10 families identified. For each of these 27 species and each protein family, the protein sequence matching the identified Y. rohdei sequence best was selected using BlastP, and the selected sequences concatenated in a predetermined order for phylogenetic analysis, using the web-server phylogeny.fr.24 Similarly, identified Y. rohdei protein sequences were concatenated in the same predetermined order and added to the phylogeny analysis. The resulting 28 meta-sequences ranged from 759 to 770 amino-acids in length. For the phylogenetic analysis using the traditional 16S-rRNA sequences, the respective sequences were downloaded from the Ribosomal Database Project25 for as many of these 28 species as possible. In all, 21 species’ sequences (including Y. rohdei) were assembled for phylogenetic analysis using the web-server phylogeny.fr, ranging in length from 1449 to 1540 nucleotides. The “one-click” analysis mode of phylogeny.fr was applied to each set of sequences, using MUSCLE26 for multiple-sequence alignment; Gblocks27 for conserved sequence block selection; PhyML and aLRT28,29 for phylogenetic analysis; and TreeDyn30 for tree rendering. Tree branches with branch support value less than 10% were collapsed. Shared Ribosomal Protein Analysis. The RMIDb was used to construct a set of all available RefSeq Genome protein sequences associated with the Gene Ontology (GO) cellular component term “ribosome” with molecular weight between 4 kDa and 16 kDa, grouped by organisms with at least 10 ribosomal proteins. An organism’s protein sequence is determined to be shared, cross-species, if the same sequence is found in any organism representing a different species. Species-species shared ribosomal protein counts, for graph-based analysis and visualiza(18) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. J. Mol. Biol. 1990, 215, 403–410. (19) Thompson, J. D.; Higgins, D. G.; Gibson, T. J. Nucleic Acids Res. 1994, 22, 4673–4680. (20) Edwards, N. J.; Pineda, F. Proceedings of the 54th Conference of the Annual American Society of Mass Spectrometry; Seattle, WA, May 28-June 1, 2006. (21) Peterson, J. D.; Umayam, L. A.; Dickinson, T. M.; Hickey, E. K.; White, O. Nucleic Acids Res. 2001, 29, 123–125. (22) Delcher, A. L.; Bratke, K. A.; Powers, E. C.; Salzberg, S. L. Bioinformatics 2007, 23, 673–679. (23) Finn, R. D.; Tate, J.; Mistry, J.; Coggill, P. C.; Sammut, J. S.; Hotz, H. R.; Ceric, G.; Forslund, K.; Eddy, S. R.; Sonnhammer, E. L.; Bateman, A. Nucleic Acids Res. 2008, 36, D281–D288. (24) Dereeper, A.; Guignon, V.; Blanc, G.; Audic, S.; Buffet, S.; Chevenet, F.; Dufayard, J.-F.; Guindon, S.; Lefort, V.; Lescot, M.; Claverie, J.-M.; Gascuel, O. Nucleic Acids Res. 2008, 36, W465–W469. (25) Cole, J. R.; Wang, Q.; Cardenas, E.; Fish, J.; Chai, B.; Farris, R. J.; KulamSyed-Mohideen, A. S.; McGarrell, D. M.; Marsh, T.; Garrity, G. M.; Tiedje, J. M. Nucleic Acids Res. 2009, 37, D141–D145. (26) Edgar, R. C. Nucleic Acids Res. 2004, 32, 1792–1797. (27) Castresana, J. Mol. Biol. Evol. 2000, 17, 540–552. (28) Anisimova, M.; Gascuel, O. Syst. Biol. 2006, 55, 539–552. (29) Guindon, S.; Gascuel, O. Syst. Biol. 2003, 52, 696–704. (30) Chevenet, F.; Brun, C.; Banuls, A. L.; Jacq, B.; Chisten, R. BMC Bioinformatics 2006, 7, 439.
Analytical Chemistry, Vol. 81, No. 23, December 1, 2009
9635
tion using Cytoscape,31 were determined as the minimum organism-organism shared ribosomal protein count for organisms from the respective species. Statistical Identification of Related Species. The public RMIDb model “Bacterial Ribosomal Proteins (all sources)” comprising all bacterial SwissProt, TrEMBL, RefSeq, Genbank, JCVI’s CMR, and aggressive Glimmer3 predicted protein sequences with molecular weight between 4 kDa and 15 kDa, associated with the GO term “ribosome” and grouped by species with at least 10 such proteins, was used to perform a statistical identification of related species based on the observed Y. rohdei signature masses, using the “Identify” search feature of the RMIDb. RESULTS AND DISCUSSION Feasibility Study. The study was designed to test the strategy that top-down fragmentation spectra and their proteins might be identified by comparison with protein sequences in related bacteria. Tandem spectra and accurate molecular masses obtained from five proteins from a mixed culture (vegetative cells and spores) of B. anthracis Sterne were searched against the Bacillus custom database (See Experimental Section). B. anthracis protein sequences, although available, were deliberately excluded from the custom database. Five top-down analyses were carried out, resulting in protein identifications for four protein sequences with E-values ranging from 10-25 to 10-10. In these four cases, the tandem mass spectra of B. anthracis Sterne proteins were matched to homologous proteins, including two different small, acid-soluble spore proteins, a cold-shock protein, and a DNA binding protein, in the closely related species B. cereus and B. thuringiensis. Further examination confirmed that these four proteins are in fact identical across the three species of the Cereus group. The tandem mass spectra of the fifth protein did not provide a good match to proteins from either B. cereus or B. thuringiensis, and the precursor was identified as a β-small acid soluble spore protein (NCBI accession number YP_030791) by searching protein entries for B. anthracis. This protein has previously been identified as distinguishing B. anthracis from other members of the Cereus group.32,33 None of the MS/MS spectra matched those predicted for any protein from B. subtilis. A Table summarizing this search is shown as Supporting Information, and the five MS/MS spectra, the decharged spectra and the protein assignments are shown in Supplemental Figures 1-5. This preliminary study indicated that an unknown microorganism might be characterized as related to a group or family of the organisms that contain matching, and therefore highly homologous, protein sequences. However, it is clear from the absence of matches with B. subtilis proteins that protein sequences from very closely related species must be available for this approach to be viable. We examine this possibility in a more general case, in the following study of Y. rohdei. Analysis of Y. rohdei. The analysis of Y. rohdei proteins was then undertaken without access or reference to an assembled, (31) Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N. S.; Wang, J. T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Genome Res. 2003, 13, 2498–2504. (32) Demirev, P. A.; Feldman, A. B.; Lin, J. S. Johns Hopkins APL Tech. Dig. 2004, 25, 27. (33) Fenselau, C.; Russell, S.; Swatkoski, S.; Edwards, N. Eur. J. Mass Spectrom. 2007, 13, 35–39.
9636
Analytical Chemistry, Vol. 81, No. 23, December 1, 2009
annotated Y. rohdei genome. Accurate precursor and product ion mass measurements performed from 26 to 39 min of the LC separation run were decharged and searched by ProSightPC 2.0 against the custom database described in the Experimental Section. Table 1 summarizes the protein sequence matches that provided highly statistically significant identifications for 10 proteins from Y. rohdei. Half of these are identified on the basis of matches to 100% homologous proteins in more than one of the species represented in the search database. BlastP and ClustalW similarity searches were used to confirm the cross-species homology of these proteins. Table 1 also lists the charge states and deconvoluted molecular masses of the precursor ions, and summarizes the number of backbone cleavages that contributed to the match. The CID spectrum of one protein biomarker is presented in Figure 2, along with the corresponding spectrum after decharging, and the sequence reported by the search program as the best match. This particular spectrum of decharged fragments from +9 precursors contains an 8 amino acid sequence tag. The spectrum is matched with high confidence (E-value 10-24) to the ribosomal protein L29 in the related species Y. enterocolitica. Tandem mass spectra for the other nine proteins, as well as their mapped protein sequences, are provided as Supporting Information, Figures 6-14. Molecular mass matches indicate that 6 of the 10 proteins identified in the LC-MS/MS experiment (Table 1) coincide with biomarkers in the signature MALDI spectrum (Figure 1). As previously observed,15,32,34-36 many of the intense biomarker ions observed in the MALDI spectra of intact vegetative microorganisms are abundant and highly basic ribosomal proteins. Figure 3 shows which of the 10 identified Y. rohdei proteins are shared with specific Enterobacteriaceae species used in the subsequent phylogenetic analysis. Eighty percent of the Y. rodhei protein biomarkers identified here are shared with Y. enterocolitica, which strongly indicates the approximate phylogenetic placement of Y. rohdei within the Yersinia genus. Furthermore, 4 other Yersinia species share 7 of the 10 protein biomarkers, and another 2, including Y. pestis, share 5. The extent of 100% homologous protein matching between these Yersinia species clearly places Y. rohdei in the Yersinia genus, and suggests that it is more closely related to Y. enterocolitica, and a number of other Yersinia species, than Y. pestis, consistent with the results of DNA based studies of Enterobacteriaceae and, more specifically, Yersinia species.13,37-39 Teramoto et al.36 use a 0/1 incidence matrix of mass matches, similar to Figure 3, to derive a distance metric between Pseudomonas putida strains, and thereby conduct a phylogenetic analysis of the purported evolutionary relationships between the strains. However, with the explicit characterization of full-length Y. rohdei protein sequences by top-down tandem mass-spectrometry, we can use techniques that consider not only whether or not at least (34) Ryzhov, V.; Fenselau, C. Anal. Chem. 2001, 73, 746–750. (35) Suh, M. J.; Hamburg, D. M.; Gregory, S. T.; Dahlberg, A. E.; Limbach, P. A. Proteomics 2005, 5, 4818–4831. (36) Teramoto, K.; Sato, H.; Sun, L.; Torimura, M.; Tao, H.; Yoshikawa, H.; Hotta, Y.; Hosoda, A.; Tamura, H. Anal. Chem. 2007, 79, 8712–8719. (37) Demarta, A.; De Respinis, S.; Dolina, M.; Peduzzi, R. FEMS Microbiol. Let. 2004, 238, 423–428. (38) Paradis, S.; Boissinot, M.; Paquette, N.; Belanger, S. D.; Martel, E. A.; Boudreau, D. K.; Picard, F. J.; Ouellette, M.; Roy, P. H.; Bergeron, M. G. Int. J. Syst. Evol. Microbiol. 2005, 55, 2013–2025. (39) Fagerquist, C. K.; Garbus, B. R.; Williams, K. E.; Bates, A. H.; Boyle, S.; Harden, L. A. Environ. Appl. Microbiol. 2009, 75, 4341–4353.
Table 1. Proteins Identified from Y. rohdei When MS/MS Spectra Were Searched against a Custom Database Containing Protein Sequences from All Yersinia in the Swiss-Prot Database and Other Enterobactericiae
m/z
number of number of matching observed observed theoretical charge fragments fragments mass mass
643.22
14
7
41
8991.92
8992.34
682.76 756.70
13 8
14 27
65 57
8862.89 6044.11
8863.32 6044.82
763.10
11
8
32
8368.61
8368.77
781.10
8
13
27
6239.55
6240.4
802.90
8
7 22
93
6413.56
6414.6
807.80 857.72
9 8
24 9
92 31
7260.92 6852.67
7261.41 6852.95
1105.83 1155.74
11 13
8 8
31 31
12155.73 15007.33
12156.2 15007.8
protein description
selected organisms
50s Ribosomal protein L27 Y. pseudotuberculosis S. sonnei Y. pestis (strain Antiqua) Y. pestis (strain Nepal516) Y. pestis 50s Ribosomal protein L28 Y. enterocolitica 50s Ribosomal protein L32 Y. enterocolitica Y. pseudotuberculosis Y. pestis Y. pestis (strain Antiqua) 30s Ribosomal protein S21 S. typhimurium Y. pestis (strain Antiqua) Y. pseudotuberculosis Enterobacter strain 638 S. sonnei Y. enterocolitica K. pneumoniae Y. pestis 50s Ribosomal protein L33 Enterobacter strain 638 S. typhimurium S. sonnei K. pneumoniae 50s Ribosomal protein L30 Y. pestis (strain Antiqua) Y. enterocolitica Y. pestis Y. pseudotuberculosis 50s Ribosomal protein L29 Y. enterocolitica Carbon storage regulator Y. pestis (strain Nepal516) Y. pseudotuberculosis Y. pestis Y. enterocolitica 50s Ribosomal protein L22 Y. enterocolitica 30s Ribosomal protein S6 Y. pseudotuberculosis Y. pestis (strain Antiqua) Y. enterocolitica Y. pestis (strain Nepal516) Y. pestis
one amino-acid mutation has occurred, but also consider the number of mutations, their precise position in the protein sequence, and the residues involved. In short, with full-length protein sequences, we can carry out a formal phylogenetic analysis of identified proteins from Y. rohdei in the context of the corresponding orthologous proteins in 27 other Enterobacteriaceae species. Figure 4 shows the two phylogenetic trees computed, as described in the Materials and Methods Section, from the protein sequences of the identified Y. rohdei proteins and the traditional 16S-rRNA sequences. Visual comparison of these trees and with published Yersinia and Enterobacteriaceae phylogenetic trees13,37-39 shows excellent phylogenetic concordance. Cladograms with branch support values, for the proteomics and 16SrRNA phylogenies, are also provided as Supporting Information, Figures 15 and 16. Significant Y. rohdei features consistent in all of these phylogenies include the distinct Yersinia clade, clearly separated from the clade containing the Enterobacter, Escherichia, Salmonella, and Shigella genera; the tight evolutionary relationship between Y. pestis and Y. pseudotuberculosis, and between Y. rohdei and Y. enterocolitica; and the relative placement of the remaining Yersinia species. The Y. rohdei whole genome shotgun WGS contigs submitted to Genbank in June 2009 contain DNA sequences supporting four of the 10 proteins identified by top-down tandem mass-spectrometry, but only one of these is annotated with a translation start-
accession number A7FMT7 Q3YX56 Q1CBZ2 Q1CEJ8 Q8ZBA7 A1JHR2 A1JN60 A7FH23 Q8ZFT9 Q1C6M6 P68684 Q1C365 A7FE70 A4WEK0 Q3YXH8 A1JQX1 A6TE47 P68686 A4W513 P0A7P2 Q3YVZ8 A6TFM7 Q1C2W5 A1JS10 Q7CFT2 A7FNL6 A1JS26 Q1CL18 A7FLR6 P63876 A1JK11 A1JS31 A7FMW5 Q1CBW4 A1JIS8 Q1CEH0 Q8ZB81
E value 6.06 6.06 6.06 6.06 6.06 1.64 2.49 2.49 2.49 2.49 2.82 2.82 2.82 2.82 2.82 2.82 2.82 2.82 8.15 8.15 8.15 3.18 6.00 6.00 6.00 6.00 6.79 5.84 5.84 5.84 5.84 1.78 5.65 5.65 5.65 5.65 5.65
× × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × × ×
10-5 10-5 10-5 10-5 10-5 10-12 10-36 10-36 10-36 10-36 10-5 10-5 10-5 10-5 10-5 10-5 10-5 10-5 10-16 10-16 10-16 10-5 10-22 10-22 10-22 10-22 10-24 10-9 10-9 10-9 10-9 10-6 10-8 10-8 10-8 10-8 10-8
site consistent with the experimental evidence. The correctly matching sequence is labeled lactoylglutathione lyase, in contrast to the 30S ribosomal protein S6 label for the same sequence in Y. pestis and Y. enterocolitica. The correctly labeled 50S ribosomal protein L32 is annotated with an incorrect translation start site, resulting in a 1583 Da error in molecular weight determination. While we believe that a fully assembled genome will result in improved protein annotations, it is clear that not only are the recently submitted WGS contigs not sufficient for our purpose, but also that mass-spectrometry based protein identifications may be able to correct and improve bacterial genome annotation, in general. Implications for Identifying Unknown Organisms. Implications for identifying unknown organisms by MALDI MS are derived from our results. Retrospectively examining those steps in our approach that require fore-knowledge of the sample, a number of issues must be resolved. First, we must assess the extent to which protein sequences from unknown organisms without a genome sequence are likely to be shared with known, sequenced organisms. Second, we must find a way to construct a suitable protein sequence database for ProSightPC 2.0 to search. Lastly, we must consider the difficult task of correctly determining whether a sample contains an unknown unsequenced organism, as opposed to single known, or mixtures of known organisms. We consider each of these issues in turn. Analytical Chemistry, Vol. 81, No. 23, December 1, 2009
9637
Figure 2. Top: MS/MS spectrum of the precursor ion at m/z 807.80 (9+ charge state, intact mass 7260.92 Da, resolution 30,000). Middle: The same MS/MS spectrum with all fragment ions converted to zero charge state. Bottom: Protein sequence (50S ribosomal protein L29, Swiss-Prot A1JS26) assigned by ProSightPC 2.0 showing observed fragmentation sites.
We use the ribosomal proteins from existing sequenced genomes extracted from the RMIDb to assess the extent of crossspecies sharing of potential protein biomarkers. Ribosomal proteins from 675 organisms (474 species) with sequenced genomes and protein annotations in the RefSeq Genome database, and with molecular weights between 4 kDa and 16 kDa were extracted from the RMIDb. The extracted sequences represent, on average, about 35 potential biomarkers per sequenced organism. A biomarker sequence in each organism is cross-species shared if the same sequence is found in protein annotations for more than one species. Figure 5 shows the percentage of each organism’s potential ribosomal biomarkers found in another species. The dashed line of Figure 5 shows the number of sequenced organisms (127 out of 675) with at least 50% of their putative biomarker proteins shared with some other species. Two hundred and one sequenced organisms share 25% of their potential biomarkers, 99 sequenced organisms share 75% of their potential markers, and 11 sequenced organisms share all of their potential ribosomal biomarkers with another species. These 11 organisms include, notably, a variety of strains of Y. pestis and Yersinia pseudotuberculosis, and B. cereus and B. thuringiensis strains. Two hundred and thirty-eight sequenced organisms share 9638
Analytical Chemistry, Vol. 81, No. 23, December 1, 2009
none of their potential ribosomal biomarker sequences with another species. Clearly, the space of all organisms is not uniformly sampled by those already sequenced, and these results suggest that not all unsequenced organisms may feasibly be characterized using this shared biomarker approach. To further explore this phenomenon, we constructed a network visualization of the cross-species shared potential biomarker space using the protein sequences extracted from the RMIDb in Cytoscape. Between two nodes, representing different species, we draw an edge if two organisms from these species share at least one potential biomarker. Edge thickness represents the minimum number of shared protein sequences of any two organisms representing the correct pair of species. The resulting graph has 273 nodes and 691 edges, with edge thicknesses from 1 to 10 pixels, representing shared biomarker counts from 1 to 38. The network contains a number of quite large components, with the largest component containing 26 species, and 135 species in connected components of at least 5 nodes. The largest component contains Y. pestis, E. coli, H. influenzae, and a variety of other species with disease and biodefense implications. This is consistent with the earlier phylogenetic analysis, where the 30S ribosomal protein S21 sequence was found in all modeled Enterobac-
Figure 3. Incidence matrix for observed Y. rohdei proteins in Enterobacteriaceae species.
Figure 4. Phylogenetic tree for Y. rohdei based on top-down protein identifications (A) and 16S-rRNA sequences (B).
teriaceae species. Supporting Information, Figure 17 shows this network, with NCBI taxonomy ID numbers as node labels. Figure 6 shows a subgraph of the network, after components with fewer than five nodes are removed, and nodes labeled with species names. Figure 7 shows the largest component, containing Y. pestis and other important species, drawn in a circular layout. Clearly, many species share potential ribosomal biomarkers, and for some organisms, a significant number of potential biomarkers are shared cross-species.
For a given unknown sample we cannot know in advance whether it shares a significant number of protein sequences with organisms with sequenced genomes, and we have no way to select a small set of bacteria around which to build a protein sequence database to search. Large protein sequence databases, such as all of SwissProt, are impractical for searching with ProSightPC 2.0 on a Windows platform. Therefore, a rational methodology is required to limit the number of species whose proteins must be searched. Many previous studies,15,32,34-36 as well as this one, Analytical Chemistry, Vol. 81, No. 23, December 1, 2009
9639
Figure 5. Percentage of sequenced organisms’ ribosomal proteins between 4 kDa and 16 kDa shared among different species.
have observed ribosomal proteins as abundant biomarkers in MALDI spectra of intact vegetative microorganisms. The publicly available Rapid Microorganism Identification Database (www. RMIDb.org)20 applies this insight to statistically identify microorganisms from a MALDI signature such as that in Figure 1. Using the public model “Ribosomal Proteins (all sources)” and the “Indentify” tool, the labeled peaks of Figure 1 can be readily matched against a database of intact protein masses from ribosomal proteins in the appropriate mass-range. The species identified by the RMIDb, with E-value 2.8 × 10-5, is Yersinia enterocolitica. This result recognizes that the Y. enterocolitica sequences match too many peaks to have occurred at random, and thereby effectively identifies either the correct or a closely related organism. This is clearly sufficient information to propose an appropriate selection of bacterial protein sequences to search using ProSightPC 2.0 and top-down protein fragmentation spectra, for much greater protein identification specificity. Furthermore, it indicates that the unknown organism may share a significant number of its protein sequences with other sequenced organisms, providing confidence that this strategy will ultimately be fruitful. From the perspective of rapid detection and identification of novel (hitherto unknown or with unsequenced genomes) microorganisms by MALDI MS, additional tools are required to interpret confidently spectra such as that of Figure 1 as originating from a single unknown or known organism (Y. rohdei or Y. enterocolitica) and not a mixture containing Y. enterocolitica and other organisms. Such a task is a subproblem of the general problem of deconvolution of MALDI spectra from mixtures of organisms. For example, identification of individual components at the species level in a mixture of organisms has been achieved with specialized data 9640
Analytical Chemistry, Vol. 81, No. 23, December 1, 2009
extraction and analysis algorithms.40 However, correctly differentiating between closely related organisms, such as Serratia marcescens, E. coli, and Y. enterocolitica, has been a challenge for these algorithms as well. Novel bioinformatics algorithms for deconvolution of MALDI spectra from mixtures may involve Bayesian belief networks41 utilizing prior knowledge of the individual signatures (experimentally obtained and/or generated in silico) of as many different microorganisms as possible. The high-specificity identification of full-length protein sequence by top-down fragmentation, however, affords significantly more power to detect and deconvolute mixtures than protein mass matching or bottom-up peptide identification. Individual protein mass-matches have a relatively high false-positive rate, while short peptides are often shared among homologous protein sequences in different species. When observed, top-down fragmentation spectra identify full-length protein sequences with very high specificity and a guarantee that organism-distinguishing aminoacid positions on the protein are captured. In particular, closely related organisms that are traditionally very difficult to distinguish can often, as evidenced by the incidence matrix of Figure 3, be distinguished by full-length protein sequences. Furthermore, the incidence matrix makes it possible to reason about the potential for mixtures and known organisms. Several bottom-up proteomics approaches for identification of proteins of organisms with unsequenced genomes by homology searches into databases with known protein sequences have already been described.42,43 A major difference between these (40) Wahl, K. L.; Wunschel, S. C.; Jarman, K. H.; Valentine, N. B.; Petersen, C. E.; Kingsley, M. T.; Zartolas, K. A.; Saenz, A. J. Anal. Chem. 2002, 74, 6191–6199. (41) Saksena, A.; Lucarelli, D.; Wang, I. J. Neural Networks 2005, 18, 843–849. (42) Shevchenko, A.; Sunyaev, S.; Loboda, A.; Shevehenko, A.; Bork, P.; Ens, W.; Standing, K. G. Anal. Chem. 2001, 73, 1917–1926.
Figure 6. Network of sequenced bacterial species labeled by species, with edges representing the number of shared potential ribosomal proteins between each pair of species. Edge thickness varies with the number of shared potential biomarkers. Components of fewer than 5 nodes removed.
approaches and the top-down approach described here is that the mass of the intact unknown protein is used here as a very important constraint in the database searches. This intact protein mass constraint results in matches only to protein sequences shared with other organisms, but when such top-down spectra are successfully assigned to a protein sequence, the full-length protein sequence is a much more powerful signature for the organisms’ identity than a short peptide, as already discussed. The availability of the intact protein mass can provide direct additional information for post-translational modifications undergone by the respective protein, and facilitate further its characterization. The top-down approach bypasses the need for protein digestionsa requirement for bottom-up proteomics. It also de(43) Waridel, P.; Frank, A.; Thomas, H.; Surendranath, V.; Sunyaev, S.; Pevzner, P.; Shevchenko, A. Proteomics 2007, 7, 2318–2329.
creases the possibility for undesirable protein modifications during sample preparation. CONCLUSIONS This paper reports a rapid and reliable top-down strategy for identification of proteins from bacteria that lack sequence information. The strategy is based on the tandem mass spectrometry of multiply charged intact proteins, excited by collisions with a buffer gas, and subsequent high resolution and high mass accuracy measurement of the resulting sequence specific ion fragments. This approach is expected to be more rapid than the more rigorous approaches of sequencing the entire genome, or purifying sufficient amounts of these proteins for de novo sequencing. The strategy additionally allows rapid phyloproteomic classification to be performed for newly isolated (unknown or with unsequenced genomes) microorganisms. Such microorganisms can be charAnalytical Chemistry, Vol. 81, No. 23, December 1, 2009
9641
Figure 7. Largest component of network of sequenced bacterial species labeled by species, with edges representing the number of shared potential ribosomal proteins between each pair of species. Edge thickness varies with the number of shared potential biomarkers. Y. pestis and adjacent edges are highlighted.
acterized by the relationships of identified homologous proteins as belonging to families and genuses of microorganisms. Admittedly, success in the experiments reported here has been facilitated by the global conservation of small acid soluble proteins and ribosomal proteins in bacteria. The extent to which the strategy can be extended to archea or mammals remains to be explored. ACKNOWLEDGMENT This research was supported in part by a contract from the U.S. Department of Homeland Security (DHS). C.W. was supported by a Fellowship from the Metro Washington Chapter of Achievement Rewards for College Scientists Foundation. N.E. is supported by CPTI Grant R01 CA126189. We thank Jason Quizon, Miquel Antoine, Nathan Hagan, and Jeff Lin (Applied Physics Laboratory, Laurel, MD) for culturing of Y. rohdei and helping with acquisition of the MALDI signature spectrum of the intact 9642
Analytical Chemistry, Vol. 81, No. 23, December 1, 2009
organism. We also thank Neil Kelleher and Paul Thomas, University of Illinois, Urbana-Champagne, for providing a beta version of ProSightPC 2.0 and for advice on its use; and Yan Wang, University of Maryland, College Park, for assistance in the acquisition of the tandem mass spectra. This work reflects solely the views of the authors and not necessarily these of DHS. Mention of any product or trademark does not imply endorsement by DHS either. SUPPORTING INFORMATION AVAILABLE Additional information as noted in the text. This material is available free of charge via the Internet at http://pubs.acs.org.
Received for review July 27, 2009. Accepted October 9, 2009. AC9016677