Anal. Chem. 2010, 82, 145–155
Discrimination and Phylogenomic Classification of Bacillus anthracis-cereus-thuringiensis Strains Based on LC-MS/MS Analysis of Whole Cell Protein Digests Jacek P. Dworzanski,*,† Danielle N. Dickinson,‡ Samir V. Deshpande,§ A. Peter Snyder,| and Brian A. Eckenrode⊥ Science Applications International Corporation, Aberdeen Proving Ground, Maryland 21010-0068, Northrop Grumman Electronic Systems, Baltimore, Maryland 21203, Science and Technology Corporation, Edgewood, Maryland 21040, U.S. Army Edgewood Chemical Biological Center, Aberdeen Proving Ground, Maryland 21010-5424, and FBI Counterterrorism and Forensic Science Research Unit, Quantico, Virginia 22135 Modern taxonomy, diagnostics, and forensics of bacteria benefit from technologies that provide data for genomebased classification and identification of strains; however, full genome sequencing is still costly, lengthy, and labor intensive. Therefore, other methods are needed to estimate genomic relatedness among strains in an economical and timely manner. Although DNA-DNA hybridization and techniques based on genome fingerprinting or sequencing selected genes like 16S rDNA, gyrB, or rpoB are frequently used as phylogenetic markers, analyses of complete genome sequences showed that global measures of genome relatedness, such as the average genome conservation of shared genes, can provide better strain resolution and give phylogenies congruent with relatedness revealed by traditional phylogenetic markers. Bacterial genomes are characterized by a high gene density; therefore, we investigated the integration of mass spectrometry-based proteomic techniques with statistical methods for phylogenomic classification of bacterial strains. For this purpose, we used a set of well characterized Bacillus cereus group strains isolated from poisoned food to describe a method that relies on liquid chromatography-electrospray ionization-tandem mass spectrometry of tryptic peptides derived from whole cell digests. Peptides were identified and matched to a prototype database (DB) of reference bacteria with fully sequenced genomes to obtain their phylogenetic profiles. These profiles were processed for predicting genomic similarities with DB bacteria estimated by fractions of shared peptides (FSPs). FSPs served as descriptors for each food isolate and were jointly analyzed using hierarchical cluster analysis methods for revealing relatedness among investigated strains. The results showed that phylogenomic classification of tested food isolates was in consonance with results from * To whom correspondence should be addressed.
[email protected]. † Science Applications International Corporation. ‡ Northrop Grumman Electronic Systems. § Science and Technology Corporation. | U.S. Army Edgewood Chemical Biological Center. ⊥ FBI Counterterrorism and Forensic Science Research Unit. 10.1021/ac9015648 2010 American Chemical Society Published on Web 11/25/2009
E-mail:
established genomic methods, thus validating our findings. In conclusion, the proposed approach could be used as an alternative method for predicting relatedness among microbial genomes of B. cereus group members and potentially may circumvent the need for whole genome sequencing for phylogenomic typing of strains. Strains of Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis (BACT) belong to the B. cereus group of bacteria comprising organisms that may be pathogenic to humans and diverse animals. These microorganisms are widely distributed in the environment and share morphological, biochemical, and genomic similarities; however, they differ in host specificity and pathogenicity. For instance, B. cereus strains are common contaminants of rice and dairy products and may cause emetic- and/ or diarrheal-type food poisoning;1 B. anthracis causes the potentially lethal disease anthrax,2 and B. thuringiensis spores are commonly used as a biopesticide.3 The use of B. anthracis spores during terrorist attacks in 2001 and food safety considerations exemplify the need for a reliable, genome-based classification of BACT strains that is central in identifying outbreaks of disease and tracking the spread of pathogens. The ultimate strain typing technique should be based on the complete DNA sequence, which is considered the reference standard for determining the phylogeny of bacteria.4 However, full genome sequencing is costly, labor intensive, and weeks are required to determine the complete genome sequence5 even with new technologies.6 Therefore, other genotyping methods are needed to estimate genomic similarities between strains (1) Granum, P. E.; Lund, T. FEMS Microbiol. Lett. 1997, 157, 223–228. (2) Mock, M.; Fouet, A. Annu. Rev. Microbiol. 2001, 55, 647–671. (3) Schnepf, E.; Crickmore, N.; Van Rie, J.; Lereclus, D.; Baum, J.; Feitelson, J.; Zeigler, D. R.; Dean, D. H. Microbiol. Mol. Biol. Rev. 1998, 62, 775– 806. (4) Wayne, L. G.; Brenner, D. J.; Colwell, R. R.; Grimont, P. A. D.; Kandler, O.; Krichevsky, M. I.; Moore, L. H.; Moore, W. E. C.; Murray, R. G. E.; Stackebrandt, E.; Starr, M. P.; Truper, H. G. Int. J. Syst. Bacteriol. 1987, 37, 463–464. (5) La Scola, B.; Elkarkouri, K.; Li, W.; Wahab, T.; Fournous, G.; Rolain, J.; Biswas, S.; Drancourt, M.; Robert, C.; Audic, S.; Lo¨fdahl, S.; Raoult, D. Genome Res. 2008, 18, 742–750. (6) Metzker, M. L. Genome Res. 2005, 15, 1767–1776.
Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
145
in an economical and timely manner. Since the 1960s, the estimates of overall similarities between DNA strands were based on DNA-DNA hybridization (DDH). Although the DDH method is lengthy and laborious, it is still considered a gold standard for the delineation of bacterial species,7 because they were defined as a collection of strains with a DDH value of at least 70%.4 Hence, any alternative technique should be validated with collections of strains for which extensive DNA-DNA similarity data are available.7 Alternative approaches focused on improvements of the DNA hybridization methods by the use of microarrays8-10 and the development of rapid DNA typing techniques based on fingerprinting of genomes and gene clusters, sequencing individual genes, or intergenic regions.11 In addition, discrimination of strains by Fourier transform infrared spectroscopy12 and Raman13 spectroscopies and mass spectrometry (MS) (pyrolysis, laser desorption ionization, electrospray ionization)14 was also investigated by many research groups. Currently, sequencing of 16S rRNA15 combined with searching a database (DB) with almost a million entries has become the most commonly used method for identifying and classifying bacteria.16 An MS method based on searching a DB with masses of 16S rRNA cleavage products also was developed.17 Unfortunately, sequences of 16S rRNA are highly conserved and, therefore, are limited in their ability to differentiate closely related strains such as Bacillus pumilus18 or the B. cereus group bacteria.19,20 Therefore, genes with less conserved sequences, namely those encoding proteins like the beta subunit of DNA gyrase (gyrB) or RNA polymerase (rpoB), are usually sequenced due to higher strain resolution and congruence with groupings revealed by DDH.21 For instance, concatenated sequences of gyrB (7) Stackebrandt, E.; Frederiksen, W.; Garrity, G. M.; Grimont, P. A. D.; Kampfer, P.; Maiden, M. C. J.; Nesme, X.; Rossello-Mora, R.; Swings, J.; Truper, H. G.; Vauterin, L.; Ward, A. C.; Whitman, W. B. Int. J. Syst. Evol. Microbiol. 2002, 52, 1043–1047. (8) Cho, J.-C.; Tiedje, J. M. Appl. Environ. Microbiol. 2001, 67, 3677–3682. (9) Dorrell, N.; Hinchliffe, S. J.; Wren, B. W. Curr. Opin. Microbiol. 2005, 8, 620–626. (10) Zwick, M. E.; Kiley, M. P.; Stewart, A. C.; Mateczun, A.; Read, T. D. PLoS ONE, 2008, 3 (7), e2513. (11) Pershing, D. H.; Tenover, F. C.; Versalovic, J.; Tang, Y. W.; Unger, E. R.; Relman, D. A.; White, T. J., Eds. Molecular Microbiology: Diagnostic Principles and Practice; American Society for Microbiology Press: Washington, DC, 2003. (12) Samuels, A. C.; Snyder, A. P.; Emge, D. K.; Amant, D. ST.; Minter, J.; Campbell, M.; Tripathi, A. Appl. Spectrosc. 2009, 63, 14–24. (13) Hutsebaut, D.; Maquelin, K.; De Vos, P.; Vandenabeele, P.; Moens, L.; Puppels, G. J. Anal. Chem. 2004, 76, 6274–6281. (14) Wilkins, C. L.; Lay, J. O., Eds. Identification of Microorganisms by Mass Spectrometry; Wiley & Sons, Inc.: Hoboken, NJ, 2006. (15) Woese, C. R. Microbiol. Rev. 1987, 51, 221–271. (16) Cole, J. R.; Wang, Q.; Cardenas, E.; Fish, J.; Chai, B.; Farris, R. J.; KulamSyed-Mohideen, A. S.; McGarrell, D. M.; Marsh, T.; Garrity, G. M.; Tiedje, J. M. Nucleic Acids Res. 2009, 37, D141-D145. (17) Jackson, G. W.; McNichols, R. J.; Fox, G. E.; Willson, R. C. Int. J. Mass Spectrom. 2007, 261, 218–226. (18) Dickinson, D. N.; La Duc, M. T.; Satomi, M.; Winefordner, J. D.; Powell, D. H.; Venkateswaranb, K. J. Microbiol. Methods 2004, 58, 1–12. (19) La Duc, M. T.; Satomi, M.; Agata, N.; Venkateswaran, K. J. Microbiol. Methods 2004, 56, 383–394. (20) Bavykin, S. G.; Lysov, Y. P.; Zakhariev, V. M.; Kelly, J. J.; Jackman, J.; Stahl, D. A.; Cherni, A. J. Clin. Microbiol. 2004, 42, 3711–3730. (21) Yamamoto, S.; Harayama, S. Int. J. Syst. Bacteriol. 1996, 46, 506–511.
146
Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
have been used in phylogenetic studies of numerous bacteria including the BACT group19 and other Bacillus species.18,22 Different genes may give different patterns of interspecies relationships due to horizontal gene transfer (HGT) or unequal rates of nucleotide substitution. Therefore, sequence analysis of 6-8 housekeeping genes (a multilocus approach) was designed to increase the resolution and to buffer the potential impact of interspecies HGTs on the determined relatedness.23 A multilocus sequence analysis (MLSA) proved to be an accurate method for genetic typing of B. cereus group strains25 and for epidemiology of food-borne diseases.24 The MLSA approach was also implemented for the electrospay ionization-MS analysis of polymerase chain reaction-amplified genomic regions (PCR/ESI-MS)26 and used for the classification/identification and genotyping of microbial strains through the assignment of base composition of amplified regions. Currently, there is a growing interest in applying MS techniques for rapid classification of bacteria, especially by the use of MS-based proteomics methods.27-37 Efficient protocols were developed for bacterial protein extraction and mass analysis, especially by matrix-assisted laser desorption/ionization (MALDI) time-of-flight MS methods38 targeting mainly ribosomal proteins as phylogenetic characters.39 In addition, whole cell protein extracts and digests of Bacillus spores were frequently used for discrimination of strains based on sequencing small acid-soluble proteins (SASPs).37 The increasing number of complete genome sequences provides a wealth of new data for analysis of genomic similarities. Most frequently, these similarities are calculated by determining the content of orthologous genes shared by two genomes or by (22) Dickinson, D. N.; La Duc, M. T.; Haskins, W. E.; Gornushkin, I.; Winefordner, J. D.; Powell, D. H.; Venkateswaran, K. Appl. Environ. Microbiol. 2004, 70, 475–482. (23) Helgason, E.; Tourasse, N. J.; Meisal, R.; Caugant, D. A.; Kolstø, A.-B. Appl. Environ. Microbiol. 2004, 70, 191–201. (25) Priest, F. G.; Barker, M.; Baillie, L. W. J.; Holmes, E. C.; Maiden, M. C. J. J. Bacteriol. 2004, 186, 7959–7970. (24) Cardazzo, B.; Negrisolo, E.; Carraro, L.; Alberghini, L.; Patarnello, T.; Giaccone, V. Appl. Environ. Microbiol. 2008, 74, 850–860. (26) Ecker, D. J.; Massire, C.; Blyn, L. B.; Hofstadler, S. A.; Hannis, J. C.; Eshoo, M. W.; Hall, T. A.; Sampath, R. Methods Mol. Biol. 2009, 551, 71–87. (27) Demirev, P. A.; Ho, Y. P.; Ryzhov, V.; Fenselau, C. Anal. Chem. 1999, 71, 2732–2738. (28) Demirev, P. A.; Lin, J. S.; Pineda, F. J.; Fenselau, C. Anal. Chem. 2001, 73, 4566–4573. (29) Pineda, F. J.; Antoine, M. D.; Demirev, P. A.; Feldman, A. B.; Jackman, J.; Longenecker, M.; Lin, J. S. Anal. Chem. 2003, 75, 3817–3822. (30) Pribil, P. A.; Patton, E.; Black, G.; Doroshenko, V.; Fenselau, C. J. Mass Spectrom. 2005, 40, 464–474. (31) Castanha, E. R.; Fox, A.; Fox, K. F. J. Microbiol. Methods 2006, 67, 230– 240. (32) Castanha, E. R.; Vestal, M.; Hattan, S.; Fox, A.; Fox, K. F.; Dickinson, D. Mol. Cell. Probes 2007, 21, 190–201. (33) Norbeck, A. D.; Callister, S. J.; Monroe, M. E.; Jaitly, N.; Elias, D. A.; Lipton, M. S.; Smith, R. D. J. Microbiol. Methods 2006, 67, 473–86. (34) Hu, A.; Lo, A. A.; Chen, C. T.; Lin, K. C.; Ho, Y. P. Electrophoresis 2007, 28, 1387–1392. (35) Moura, H.; Woolfitt, A. R.; Carvalho, M. G.; Pavlopoulos, A.; Teixeira, L. M.; Satten, G. A.; Barr, J. R. FEMS Immunol. Med. Microbiol. 2008, 53, 333– 342. (36) Dworzanski, J. P.; Snyder, A. P. Expert Rev. Proteomics 2005, 2, 863–878. (37) Demirev, P. A.; Fenselau, C. J. Mass Spectrom. 2008, 43, 1441–1457. (38) Freiwald, A.; Sauer, S. Nat. Protoc. 2009, 4, 732–742. (39) Teramoto, K.; Sato, H.; Sun, L.; Torimura, M.; Tao, H.; Yoshikawa, H.; Hotta, Y.; Hosoda, A.; Tamura, H. Anal. Chem. 2007, 79, 8712–8719.
quantifying the DNA conservation of the core genome.40 The latter method was developed by Konstantinidis and Tiedje41,42 and gained a broad interest.43-45 According to this method, the relatedness between two bacterial strains can be determined by comparing sequences of all shared genes or their protein products through the computation of sequence-derived parameters that estimate average nucleotide identities (ANI)41 or average amino acid identities (AAI).42 Phylogenomic relationships revealed by the use of AAI were congruent with findings based on the analysis of concatenated sequences of all shared genes. These relationships showed a strong correlation with results obtained by DDH and the sequencing of established phylogenetic markers such as 16S rRNA gene, gyrB, or rpoB.46 However, computation of these indices is quite complicated, because in the first step, a two-way BLAST algorithm is used to find conserved genes. In the second step, identities are computed between all orthologs, which finally are averaged and expressed as ANI or AAI indices. Moreover, despite its taxonomic value as a robust and universal measure of strain similarities, the AAI index is not applicable to nonsequenced species. Therefore, the authors noted that in the future “it may also be feasible to devise a new method to indirectly measure AAI, i.e., to circumvent the need for whole-genome sequencing.”42 It seems that AAI could be estimated by the use of proteomics methods because bacterial genomes are characterized by a high gene density47 and predicted protein coding sequences usually makeup 88% of a genome.48 Moreover, multidimensional analyses of microbial proteins indicate that high coverage of predicted proteomes may be achieved on a routine basis; therefore, comparative analysis of expressed proteomes could be used for the detection of differences between a query genome from a bacterial isolate and a reference set of fully sequenced genomes. Previously, we showed that tryptic peptides identified during liquid chromatography (LC)-ESI tandem mass spectrometry (MS/ MS) analysis of a whole cell protein digest could be used for successful identification of strains represented in a DB through the comparison of similarities expressed as the fraction of peptides shared between each DB bacterium and a test strain.49 Here, we report on the use of indices representing fractions of shared peptides (FSP) between each test strain and n-bacteria in the DB as n-dimensional coordinates of strains which were jointly analyzed by hierarchical cluster analysis for revealing genomic relatedness among investigated strains. To validate this approach, we selected a group of Bacillus cereus strains that were isolated from poisoned food and classified as diverse H-serotypes based on the polymorphism of flagellar H-antigen.50 Moreover, (40) Kunin, V.; Ahren, D.; Goldovsky, L.; Janssen, P.; Ouzounis, C. A. Nucleic Acids Res. 2005, 33, 616–621. (41) Konstantinidis, K. T.; Tiedje, J. M. Proc. Natl. Acad. Sci. U.S.A. 2005, 102, 2567–2572. (42) Konstantinidis, K. T.; Tiedje, J. M. J. Bacteriol. 2005, 187, 6258–6264. (43) Eppley, J. M.; Tyson, G. W.; Getz, W. M.; Banfield, F. Genetics 2007, 177, 407–416. (44) Adekambi, T.; Shinnick, T. M.; Raoult, D.; Drancourt, M. Int. J. Syst. Evol. Microbiol. 2008, 58, 1807–1814. (45) Nesbo, C. L.; Dlutek, M.; Doolittle, W. F. Genetics 2006, 172, 759–769. (46) Goris, J.; Konstantinidis, K. T.; Klappenbach, J. A.; Coenye, T.; Vandamme, P.; Tiedje, J. M. Int. J. Syst. Evol. Microbiol. 2007, 57, 81–91. (47) Koonin, E. V.; Wolf, Y. I. Nucleic Acids Res. 2008, 36, 6688–6719. (48) Mira, A.; Ochman, H.; Moran, N. A. Trends Genet. 2001, 17, 589–596. (49) Dworzanski, J. P.; Deshpande, S. V.; Chen, R.; Jabbour, R. E.; Snyder, A. P.; Wick, C. H.; Li, L. J. Proteome Res. 2006, 5, 76–87. (50) Taylor, A. J.; Gilbert, R. J. J. Med. Microbiol. 1975, 8, 543–550.
these strains were extensively characterized by the use of microbiological and molecular methods including DDH,19 sequence analysis of phylogenetic markers such as 16S rRNA and gyrB,19,20 and the mass spectra of spore biomarkers.32 Results reported here indicate that relatedness among Hserotypes revealed by our proteomics approach is congruent with findings obtained by reference methods, thus validating the use of FSP as a universal estimator that could be applied for predicting the whole genome relatedness on a strain level. Furthermore, this approach integrates both the gene content and sequence similarities which allows for the exploration of functional information associated with identified proteins that is important for epidemiological and forensic applications. Therefore, this approach may circumvent the need for whole genome sequencing to yield the phylogenomic typing of bacterial strains. EXPERIMENTAL SECTION Materials and Reagents. Ammonium bicarbonate (ABC), urea, dithiothreitol (DTT), iodoacetamide (IAM), isopropanol, guanidine hydrochloride, HPLC grade acetonitrile (ACN), trifluoroacetic acid (TFA), and formic acid (FA) were purchased from Sigma (St. Louis, MO). Sequencing grade modified trypsin was from Promega (Madison, WI). A TRIO-Extraction reagent was from MoBiTec (Go¨ttingen, Germany) while distilled water was from a Milli-Q UV plus ultrapure system (Millipore, Billerica, MA). Centrifugation membrane filters with a molecular mass cutoff of 3 kDa (Microcon YM-3, regenerated cellulose) were purchased from Millipore Corp. Bacterial Strains. A list of bacterial strains used in this study is shown in Table 1. All strains that caused food poisoning were previously characterized, and sequences of their gyrB and 16S rRNA genes were deposited in the GenBank.19 For comparative purposes, strains of B. cereus ATCC 14579, B. anthracis Sterne, and B. thuringiensis (Israeliensis and Berliner) were also analyzed. Bacterial strains were cultivated in Tryptic Soy Broth (Difco, St. Louis, MO) containing 1.5% glycine by shaking at 30 °C for 16 h. Next, bacterial cells were harvested by centrifugation, rinsed twice with phosphate buffered saline and water, lyophilized, and stored at -25 °C. Cell Lysis, Protein Digestion, and Sample Cleanup. The preparation of peptides from microbial cells (108 cfu) was performed using well-established protocols that rely on cell lysis followed by denaturation of proteins, reduction of disulfide bonds, and the optional cysteine carboxyamidomethylation. To facilitate protein extraction from bacteria, cells were washed with phosphate-buffered saline and pelleted by centrifugation. One milliliter of the TRIO-Extraction reagent was added to every cell pellet, and after removal of RNA and DNA, proteins were precipitated by isopropanol. The protein pellet was washed with 0.3 M guanidine hydrochloride/95% ethanol, then centrifuged, and washed with 100% ethanol. After centrifugation and removal of ethanol, proteins were denatured with 8 M urea for 1 h, reduced with DTT, alkylated with 20 mM IAM, and digested overnight with trypsin at a ratio of 1:50 (w/w) at 37 °C. LC-ESI-MS/MS Analysis of Peptides. Peptides were separated on an Atlantis C18 column (300 Å, 5 µm, 1 mm i.d. × 10 cm) by the use of the Alliance type 2690 liquid chromatograph from Waters (Milford, MA). The elution was performed using a Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
147
Table 1. Bacterial Strains Used in this Study GenBank accession number species Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus Bacillus d
cereus cereus cereus cereus cereus cereus cereus cereus cereus cereus cereus cereus cereus cereus cereus cereus anthracis thuringiensis thuringiensis cereus
serotype/strain
source/country
origin
16S rDNA
gyrB
number of tryptic peptides retained for analysis
H-01 H-02 H-03 H-04 H-05 H-06 H-07 H-08 H-09 H-10 H-11 H-12 H-14 H-15 H-17 H-18 Sterne Israelensis Berliner ATCC 14579
vomit/UK meat loaf/USA boiled rice/UK fried rice/UK fried rice/UK barbecued chicken/Canada curry powder/UK Indonesian rice dish/Ned. vanilla pudding/Ned. Indonesian rice dish/Ned. neonatal brain abscess/UK risotto/UK fried rice/UK uncooked rice/UK uncooked rice/UK uncooked rice/UK
PHLS, UKa J.M. Goepfertb PHLS, UKa PHLS, UKa PHLS, UKa E. Toddc PHLS, UKa PHLS, UKa PHLS, UKa J.M. Goepfertb PHLS, UKa PHLS, UKa PHLS, UKa PHLS, UKa PHLS, UKa PHLS, UKa JPLd JPLd IAM, Japane ATCCf
AY461742 AY461743 AY461744 AY461745 AY461746 AY461747 AY461748 AY461749 AY461750 AY461751 AY461752 AY461753 AY461755 AY461756 AY461758 AY461759 X55059 AY461762 D16281 1202360
AY461763 AF136388 AY461764 AY461765 AY461766 AF136389 AY461767 AY461768 AY461769 AY461770 AY461771 AY461772 AY461774 AY461775 AY461776 AY461777 AF090333 AY461780 AF090331 AF090330
558 485 661 541 683 630 626 605 602 599 494 601 561 554 625 525 723 580 528 523
a PHLS, Public Health Laboratory Service, UK. b Food Research Institute, The University of Wisconsin, Madison. c Health and Welfare, Canada. JPL, Jet Propalsion Laboratory, Pasadena, CA. e IAM, Institute of Applied Microbiology, Japan. f ATCC, American Type Culture Collection.
linear gradient from 98% A (0.1% FA in water) and 2% B (0.1% FA in ACN) to 60% B over 60 min at a flow rate of 100 µL/min and was continued for an additional 20 min in the isocratic mode. The resolved peptides were electrosprayed into a linear ion trap mass spectrometer (LTQ, Thermo Scientific, San Jose, CA). Product ion mass spectra were obtained in the data dependent acquisition mode that consisted of a survey scan over the m/z range of 400-2000 followed by ten scans on the most intense precursor ions activated for 30 ms by the excitation energy level of 35% (cycle time, 4.4 s). A dynamic exclusion was activated for 3 min after the first MS/MS spectrum acquisition for a given ion. Finally, uninterpreted product ion mass spectra were searched against a microbial database with TurboSEQUEST (Bioworks 3.1, Thermo Scientific). Bacterial Proteome Database. Amino acid sequences of chromosome and plasmid encoded proteins predicted from completely sequenced genomes were downloaded from the National Institutes of Health National Center for Biotechnology Information (ftp://ftp.ncbi.nih.gov/genomes/Bacteria) on January 15, 2007. They were used to create a prototype microbial DB by adding abbreviated strain names in header lines with a script written in PERL. These names were used as specific codes that linked strains to taxonomic information derived from the NCBI taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy/). Finally, the database was indexed using a TurboSEQUEST utility program (Thermo Scientific) by assuming trypsin digestion rules with up to two missed cleavages per peptide. In addition, only peptides with Mr in the 700-3500 Da range were accepted and the oxidation of methionine was allowed. Data Processing. The details of the mass spectral processing procedure were described previously.49 Briefly, *.dta files were searched against the indexed DB using a TurboSEQUEST algorithm (Thermo Scientific) with a peptide mass tolerance of 2.5 Da and a fragment ion tolerance of 1.00 Da. The other parameters were left as default; however, in order to improve the 148
Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
efficiency of DB searches, duplicate 2+/3+ ion spectra were removed by applying a z-state determination algorithm (ZSA) utility program (Thermo Scientific). The output files were analyzed with PeptideProphet,52 and peptides with probabilities of correct assignments higher than 0.95 were retained (Table 1) and used to generate a matrix of sequence-to-bacterium (STB) assignments49 for each analyzed strain. This process is depicted schematically in Figure 1 for an unknown (U) strain s1 that is characterized by four identified peptides p1 through p4 which are assigned to the closest DB strains b1-b5. For LC-MS/MS analysis of any unknown strain si, an STB matrix of assignments is created with entries representing the presence (1) or absence (0) of a given peptide sequence (px) in each DB proteome (bj). Each column of the matrix represents a peptide profile of a DB bacterium while each row represents a phylogenetic profile of a peptide sequence. Peptide sequences shared between an unknown strain si and a DB strain bj, were used to determine strain similarity calculated as a fraction of shared peptides (FSP) between si and a j-th strain in the DB in accordance with the following equation: Sim(i,j) ) |pi ∩ pj |/|pi | ) FSPji
(1)
where pi is a total number of sequence unique peptides matched to all DB bacteria, and pj is the number of peptides matched to a j-th proteome of a DB strain. Because FSPs represent coordinates of an unknown strain si in an n-dimensional data space in which each axis represents a DB bacterium, these values were presented as similarity histograms and used to assemble a similarity matrix that was analyzed using hierarchical cluster analysis (HCA) methods (Figure 1). HCA was performed by the use of the STACluster libraries of STATISTICA (release 6, StatSoft, Inc., Tulsa, OK) and a Permut(52) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383–5392.
Figure 1. Schematic representation of the sample and data processing workflow for the phylogenomic classification of Bacillus cereus group strains. Abbreviations: U, unknown strain; STB, sequence-to-bacterium; BacId, in-house developed software for taxonomic classification of bacteria.49
Matrix software51 that is available from the authors’ Internet Web site (http://www.lirmm.fr/∼caraux/PermutMatrix). PermutMatrix was used only for a two-way clustering of STB matrices and ordering of dendrogram leaves. Whole Proteome Similarities between DB Bacteria. The comparison of complete proteomes was performed by the use of all tryptic peptides with Mr in the 700-3500 Da range (two missed cleavages were allowed) that were predicted from theoretical proteomes of the following Bacillaceae species: (1) Bacillus anthracis ‘Ames Ancestor’, (2) Bacillus anthracis Ames, (3) Bacillus anthracis Sterne, (4) Bacillus cereus ATCC 10987, (5) Bacillus cereus ATCC 14579, (6) Bacillus cereus E33L, (7) Bacillus thuringiensis serovar konkukian 97-27, (8) Bacillus thuringiensis Al Hakam, (9) Bacillus subtilis subsp. subtilis 168, (10) Bacillus licheniformis ATCC 14580, (11) Bacillus halodurans C-125, (12) Bacillus clausii KSM-K16, (13) Geobacillus kaustophilus HTA426, and (14) Oceanobacillus iheyensis HTE831. A set of DBs was formed that included predicted proteomes of single strains and all potential pairs (XY) of the above strains. They were ‘digested’ in silico following trypsin rules, and the results were analyzed to calculate the number of shared tryptic peptides (Figure S-1, Supporting Information). Briefly, the numbers of shared tryptic peptides (SPxy ) |UPx ∩ UPy|) were obtained from reports generated by running the indexing utility program (TurboSEQUEST) for each DB and were calculated according to the formula SPxy ) UPx + UPy - UPxy, where UPx and UPy represent numbers of unique peptides determined for proteomes X and Y, respectively, and UPxy is the number of unique peptides for a combined proteome of strains X and (51) Caraux, G.; Pinloche, S. Bioinformatics 2004, 21, 1280–1281.
Y. Note that similarities can be calculated to reflect both the ratio of SPxy to the number of all predicted tryptic peptides derived from a proteome X or Y. Because these similarities are nonreciprocal, a matrix of similarities between every pair of individuals was created using the Dice similarity index: SimD ) SPxy /0.5(UPx + UPy) ) FSPxy
(2)
This takes into account the proportion of common peptides to the averaged number of unique peptides found in both proteomes. RESULTS AND DISCUSSION Relationship between the Fraction of Shared Peptides (FSP) and the Genome/Proteome Conservation. The relationship between the FSP and the genome/proteome conservation expressed as AAI42 values can be obtained using an approach based on the Kimura model53 for the estimation of amino acid substitution rate for homologous proteins. Let us substitute homologous proteins in the Kimura model with predicted proteomes of two strains and further assume these proteomes have the same total number of amino acids, denoted as Taa. Under these circumstances, we can calculate the number of amino acid sites in which they differ (d) using the following equation: d ) Taa[1 - exp(-Kaa)]
(3)
where Kaa is the mean number of substitutions per amino acid site over the evolutionary period (t) that separated microorgan(53) Kimura, M. Proc. Natl. Acad. Sci. U.S.A. 1969, 63, 1181–1188.
Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
149
isms, and exp(-Kaa) is the probability that no substitutions occurred per amino acid site that can be obtained from the Poisson distribution. Hence, assuming the independence of substitutions, Kaa may be estimated from the rearranged eq 3 as Kaa ) -2.3 log(1 - d/Taa)
(4)
Because d/Taa is the fraction of sites that differs between strains, 1 - d/Taa represents the fraction of identical sites that, in the case of amino acids, would be equivalent to the AAI index. Further, the rate of substitution per amino acid per time (kaa) may be written as kaa ) Kaa /2t, where time is doubled because substitutions could occur independently in both proteomes since the divergence of organisms from a common ancestor. By substituting these parameters and rearranging eq 4, the time since the divergence of bacterial strains is t ) -2.3 log (AAI)/2kaa. The same treatment could be applied to two microbial proteomes by substituting amino acid sites in each proteome with Tp peptide sites of length L, where L represents the number of amino acid residues. Under these circumstances, the fraction of identical sites would be equivalent to the FSP index. Therefore, the time since the divergence could be described by the equation t ) -2.3 log (FSP)/2kp, where kp refers to the constant for the peptide substitution rate. For any given pair of microorganisms, the same time, since divergence, may be expressed by the use of different similarity measures; therefore, by combining and rearranging these equations, the following relationship is obtained (eq 5): log(AAI)/kaa ) log(FSP)/kp
(5)
Assuming that the probability of each amino acid to be conserved is equal to AAI, the relationship between constants for substitution rates is kp ) Lkaa. Hence, the relationship between FSP values and genome/proteome conservation can be estimated from the exponential form of eq 5 as FSP ) (AAI)L
(6)
In accordance with this model, the fraction of peptides shared between two microbial proteomes is always lower than the AAI index value and depends on the peptide length. These interrelationships are depicted in Figure 2A for peptides with 8, 15, and 30 amino acids that represent a typical range of peptide lengths identified in our experiments. For comparison, the L dependent frequency distribution of 24 856 tryptic peptides accepted for the classification of investigated H-serotypes is shown in Figure 2B and indicates that 84% of accepted assignments consist of peptides with 8-22 amino acid residues. In Figure 2A, similarities between proteomes on the level of ‘peptides’ with one residue (L ) 1) are equivalent to AAI. However, for longer peptides (L > 1) relatively small differences in the amino acid identity are associated with substantially decreased values of the FSP index due to logarithmic relationships between these indices. For example, the model predicts that for proteomes characterized by 94% sequence identity on the amino acid level (AAI ) 0.94), the fraction of shared peptides with 8 amino acid 150
Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
Figure 2. (A) Relationships between proteome conservation expressed as the averaged amino acid identity (AAI) index and the fraction of shared peptides (FSP) calculated using eq 6 for peptides of different lengths (L); (B) frequency distribution of peptide lengths (L, number of amino acids) determined for 24 856 tryptic peptides identified in this work.
residues represent 61% of all peptides and drops to 40% and 16% for peptides with 15 and 30 residues, respectively. Because the 94% sequence identity may be assumed as a cutoff value for strains belonging to the same species,42 the FSP index provides a good resolving power that is required for discrimination and identification of closely related strains for diagnostic or epidemiological purposes. Sequence-Based Identification of B. cereus Strains. Peptides identified during analysis of each tested strain were matched to DB bacteria and classified on the basis of similarities estimated with FSPs.49 The results confirmed that all investigated Hserotypes belong to the B. cereus group of the Bacillaceae family in agreement with their taxonomic position established by La Duc et al.19 Nevertheless, none of them could be identified on the strain level because they significantly differed from the DB strains (Table S-1 in the Supporting Information) as exemplified by a histogram of similarities between the serotype H17 and DB strains (Figure S-2 in the Supporting Information).
Figure 3. Heat map representation of the clustered data matrix of 599 peptide sequences from the B. cereus strain H-10 assigned to the nearest neighbors in the DB. Each yellow cell represents the presence and each black cell the absence of a match. Two-way hierarchical cluster analysis was performed with PermutMatrix51 using Euclidean distances and unweighted pair group averages as the aggregation method. The dendrogram of bacterial strains shows that the H-10 strain clusters with the B. cereus group of bacteria and forms a subcluster with a type strain B. cereus ATCC 14579. The dendrogram of peptides allows visual selection of sequences. For instance, clusters marked ‘a’ through ‘e’ indicate groupings of peptides with different discriminative/diagnostic power (see text and Table 2 for details). Abbreviations of database bacterial strains: XYYY_Z...Z, where X represents the first letter of a genus name, YYY represent the first three letters of a species name, and Z...Z represents the strain name.
Differences and similarities between each H-serotype and DB bacteria were further investigated by cluster analysis of the sequence-to-bacterium (STB) assignments. For example, in a heat map representation of the STB matrix for H10 (Figure 3), both DB strains and peptide sequences were rearranged into displayed groupings by the use of a two-way cluster analysis.51 Further, inspection of Figure 3 indicates that B. cereus ATCC 14579 is the closest DB relative to the H10 strain. They share 526 out of 599 peptides (FSP ) 0.88) while other members of this group, and especially a clade of B. anthracis strains, are more distant (FSP ) 0.82). Moreover, the remaining Bacillaceae strains shared only 3% to 6% of peptides with the serotype H10 (Table S-1 in the Supporting Information). The heat map representation of SEQUEST search results is very advantageous for the discriminative analysis of strains because peptides and their source proteins can easily be investigated in the context of observed similarities to DB strains. It is exemplified by clusters of tryptic peptides, marked ‘a’ through ‘e’ in Figure 3, indicating groupings of peptide sequences with different discriminative/diagnostic power. For instance, the majority of peptides only discriminate between the B. cereus group and remaining DB strains (cluster ‘d’), while clusters ‘a’ and ‘c’ reveal sequences that discriminate serotype H10 and its closest DB neighbor, that is, B. cereus ATCC 14579. Further, the inspection of cluster ‘e’ indicates that 25 sequences
that form this group have low discriminative power because they are broadly distributed among DB Firmicutes. Closer inspection of these peptides (Table 2) indicates that they originate from ribosomal proteins, elongation factors, chaperones, and other proteins (Table S-4 in the Supporting Information) known for containing highly conserved amino acid sequences,54 thus providing molecular insights into the biochemical basis of observed similarities. Representative results of clusters formed by other strains with DB bacteria are shown in Figure 4, and for the clarity of presentation, only DB Bacillaceae strains were included. It can be noted that in all cases the dendrograms consisted of two large clusters in a fashion similar to that in Figure 3. One cluster always comprised the investigated strain clustered with B. cereus group, while the other major cluster grouped the remaining, distantly related members of the Bacillaceae family. By choosing a nearest DB neighbor as a discrimination criterion, all analyzed strains could be subdivided into four groups. For instance, the closest DB neighbor of H10 (Figure 3), H4 (Figure 3C), and four other strains (H11, H14, H15, H18) was the B. cereus strain ATCC 14579, while Bacillus sp. serotype H6 (Figure 3A) and H17 were most similar to the B. anthracis strains. On the other hand, H2 (Figure (54) Bern, M.; Goldberg, D.; Lyashenko, E. Nucleic Acids Res. 2006, 34, 4342– 4353.
Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
151
Table 2. Peptides Identified in the Trypsin Digest of B. cereus H-10 Proteins and Grouped as Cluster ‘e’ in Figure 3 peptidea
proteinb
accessionc
RLGISLSGTGK LGISLSGTGK LGEFAPTR KNESLEDALR SAGTSAQVLGK VATIEYDPNR GYGTTLGNSLR FATSDLNDLYR IGETHEGASQMDWMEQEQER VLDGAVAVLDAQSGVEPQTETVWR ALQGEADWEAK TTLTAAITTVLAK IAGLEVER IEDALNSTR VGVMFGNPETTPGGR HVLVVYDDLSK TAMVFGQMNEPPGAR ERENGGVLDMDTK KSGVITGLPDAYGR SGVITGLPDAYGR VSGYAVNFIK RPEDTDMVIFR AMIELDGTPNK HDVVAFTR VDFNVPMK
30S ribosomal protein S4 30S ribosomal protein S4 ribosomal protein S19 30S ribosomal protein S21 50S ribosomal protein L2 50S ribosomal protein L2 DNA-directed RNA polymerase alpha subunit DNA-directed RNA polymerase beta subunit elongation factor EF-2 elongation factor EF-2 elongation factor Tu elongation factor Tu molecular chaperone DnaK chaperonin GroEL recombinase A ATP synthase subunit A ATP synthase subunit B formate acetyltransferase formate acetyltransferase formate acetyltransferase formate acetyltransferase isocitrate dehydrogenase enolase adenylosuccinate lyase phosphoglycerate kinase
NP_847106.1 NP_842682.1 NP_846757.1 NP_842681.1 NP_842705.1 NP_842671.1 NP_842675.1 NP_842676.1 NP_846762.1 NP_842820.1 NP_846162.1 NP_847707.1 NP_847705.1 NP_843045.1
NP_847041.1 NP_847538.1 NP_842840.1 NP_847541.1
a For additional details see Table S4 (in the Supporting Information). b Protein functions were identified by sequence similarity to database proteins of B. cereus group strains. c RefSeq accession.version numbers are given in accordance with the Bacillus anthracis str. Ames annotations (NC_003997.3).
Figure 4. Dendrograms of bacterial strains obtained by hierarchical cluster analysis (HCA) of sequence-to-bacterium assignment matrices for selected Bacillus strains: (A) B. cereus serotype H6, (B) B. cereus strain H2, (C) B. cereus strain H4, and (D) B. thuringiensis Berliner. (Euclidean distances and unweighted pair group averages were used as the aggregation method for HCA.) For abbreviations of database bacterial strains, see the legend to Figure 3.
3B) and H7 were closest to the DB B. cereus strain called ‘zebra killer’ (ZK, recently renamed as E33L), while the other six strains 152
Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
(H1, H3, H5, H8, H9, H12) formed a subcluster with B. cereus ATCC 10987 as their closest relative. It is obvious that these
ment of amino acid sequences by DNA segments that encode them. Under these circumstances, the time from the common ancestor is described by the equation t ) -2.3 log (FSD)/2kD, where kD is the substitution rate constant for DNA segments. Hence, for any pair of microorganisms, the following relationship between FSD, FSP, and AAI indices is obtained FSD ) (FSP)kD/kp)(AAI)kD/kaa
Figure 5. Calibration of experimentally determined proteomic similarities (EFSP) against DNA-DNA reassociation values (FSD) and theoretically predicted proteomic similarities calculated as a fraction of shared tryptic peptides (TFSP). Triangles refer to FSD data of La Duc et al.19 that were fitted with the dashed line. Diamonds refer to the TFSP data calculated from the in silico digestion of predicted proteomes that were fitted with the solid line.
interrelationships are quite complex. Therefore, the structure of genomic relatedness among these strains was further explored by multivariate analysis of genomic similarities to DB strains (Figure 1). Calibration of Proteomic Similarities Using DNA-DNA Reassociation Values. Our preliminary data showed that Hserotypes with high proteomic similarities to B. anthracis Sterne (FSP, 0.87-0.97) also showed high DDH values with its DNA (60-78%).19 A similar trend also was observed in regard to B. cereus ATCC 14579 (Figure S-3 in the Supporting Information), because conceptually the overall genomic similarity between two strains expressed as a DDH value is equivalent to an FSP index. That is, the matching tryptic peptides reflect DNA segments, which potentially would form hybrid pairings. Therefore, for consistency with the FSP term, the DDH value will be considered here as a fraction of shared DNA segments (FSD) between strains. To calibrate proteomic similarities, experimentally determined FSP values (EFSP) were plotted against FSD data for the same pair of strains (Figure 5). FSD data used for construction of this graph covered hybridization levels in the range from 0.4 to 0.97, while the range of proteomic similarities for the same strains was slightly higher (from 0.65 to 0.98). These relationships were approximated by a linear function (FSD ) 1.597 × EFSP - 0.707, R2 ) 0.78) and used to calibrate proteomic similarities against FSD values. The results indicate that the DDH cutoff standard of 70%, which is used for species discrimination7 is equivalent to experimentally determined proteomic similarities of 88% (FSP, 0.86-0.90, p ) 95%). Accordingly, strains with FSP values higher than 88% should be treated as one species. In theory, the relationship between similarities expressed by FSD and FSP indices could be estimated using the model developed above for comparison between FSP and AAI indices, by substituting 1 - d/Taa in eq 4 with the FSD index. This substitution does not change the conceptual framework of the previously developed model, because it only reflects the replace-
(7)
According to this model, the FSP values obtained during our experiments could be used to predict DNA hybridization results for investigated strains by assuming the ratio of the substitution rate constants kD/kp ) 2.5. To investigate a possible role of the proteome coverage on the discrepancies between FSD and EFSP values, theoretical proteomic similarities (TFSP) were calculated between reference proteomes (B. anthracis Sterne and B. cereus ATCC 14579) and other Bacillaceae strains by taking into account all predicted tryptic peptides (Figure S-4 in the Supporting Information). Results of these calculations are also plotted in Figure 5 and show a nonlinear relationship between TFSP and EFSP approximated by a solid line. However, for EFSP values higher than 0.65, a linear relationship could be assumed. Under these conditions, differences between FSD values and theoretical proteomic similarities are not statistically significant (p, 95%), and the overlap of both genome similarity estimators can be observed. For example, strains sharing 70% tryptic peptides predicted by the in silico analysis of whole proteomes (TFSP, 0.7) showed on average 86% of common peptides in our experiments (EFSP range, 0.83-0.89, p ) 95%) and an almost identical level of EFSP values was observed for strains with FSD of 0.7. Because relationships between EFSP and FSD values are mirrored by those between the EFSP and TFSP indices, discrepancies between the theoretical and experimental fraction of shared peptides most likely originate from the overrepresentation of peptides derived from the most conserved proteins. It is known that proteins involved in the information processing are usually present in a high copy number and hence 1-D LCMS/MS analysis may give results biased toward a higher fraction of shared peptides than whole proteomes. Therefore, multidimensional analyses of protein digests could increase the discriminatory power of the proteomic method by providing results closer to theoretical values of shared peptides. However, the outcome of the above relationships could also be affected by other factors like the length of peptides accepted for proteomic analysis or details of the DDH procedure. Interrelationships among B. cereus Group Strains. Having established the rationale for the use of FSP values for the phylogenomic analysis of strains, we performed cluster analysis of strain similarities to DB bacteria (Figure 1) from the Bacillaceae family using Euclidean distances (Table S-2 in the Supporting Information) and unweighted pair group averages (UPGA) as an agglomerative method. The results are shown in Figure 6A as a tree diagram that is compared to a dendrogram of the same strains (Figure 6B) that was obtained by cluster analysis of the literature DDH data19 (Table S-3 in the Supporting Information). The topologies of both dendrograms shown in Figure 6 are strikingly similar. Moreover, both trees closely resemble clusters Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
153
Figure 6. Relatedness among B. cereus serotypes H1 through H18 and selected Bacillus type strains determined by hierarchical cluster analysis of distance matrices obtained from (A) proteomic and (B) DNA-DNA reassociation data.19 GyrB Groups 1-3 stand for clusters of H-serotypes revealed by the analysis of concatenated nucleotide sequences of gyrB genes.19
and subclusters of strains revealed by cluster analysis of concatenated sequences from 1.2 kb gyrB gene which were reported by La Duc et al.19 and Bavykin et al.20 Therefore, to facilitate a 3-way comparison of analyzed strains, groupings inferred from the analysis of gyrB sequences were superimposed on both trees shown in Figure 6 as cluster designations. Inspection of the graphs in Figure 6A,B indicates that strains belonging to gyrB group 1 include B. anthracis Sterne and ten H-serotypes, and the same cluster was inferred from both DNA hybridization results and the proteomics data. Moreover, these strains were also grouped according to amino acid substitutions in SASPs that are commonly used spore biomarkers.32 As seen in Figure 6A,B, two distinct subgroupings emerge from group 1. The subcluster marked as ‘a’ indicates a grouping of strains highly similar to B. anthracis, while the subcluster ‘b’ agglomerates strains only moderately similar to the reference strain. Moreover, on the basis of specific 16S RNA sequences, Bavykin et al.20 classified these serotypes as ‘Anthracis’ and ‘Cereus A’, respectively. It is interesting to note that serotypes H1, H3, and H12 of the ‘b’ subcluster are known as cereulide-producing strains and correspond to MLST sequence types 26, 165, 144, and 164.55 The comparison of strains classified as members of gyrB subclusters (55) Vassileva, M.; Torii, K.; Oshimoto, M.; Okamoto, A.; Agata, N.; Yamada, K.; Hasegawa, T.; Ohta, O. J. Clin. Microbiol. 2007, 45, 1274–1277.
154
Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
‘a’ and ‘b’ shows only one discrepancy, namely, on the basis of proteomic similarities, strain H5 was assigned to subcluster ‘b’ while it was placed, together with serotype H9, into subcluster ‘a’ by the use of both gyrB sequences and hybridization values reported by La Duc and colleagues.19 Nevertheless, phylogenetic trees built using sequences of many housekeeping proteins and the B. cereus virulence factor sphingomyelinase indicate a substantial similarity of H5 with serotypes H3 and H12 (unpublished results) and support findings revealed by proteomic similarities. The remaining H-serotypes form a separate cluster that consists of gyrB clusters marked as ‘Group 2’ and ‘Group 3’ in Figures 6A,B. The former cluster can be subdivided on the basis of proteomic distances into two subclusters. The first include serotypes H4, H11, H14, and H15 that form a subcluster identical with gyrB subgroup ‘b’, and the second includes H10 and H18 that also grouped together as gyrB subgroup ‘a’.19 The phylogenetic significance of this subdivision is also supported by analysis of 16S rRNA sequences indicating affiliation of serotypes H10 and H18 with organisms grouped as ‘Thuringiensis A’.20 A very similar tree topology for these strains was also obtained from DDH results. On the other hand, the B. thuringiensis strains Berliner and Israeliensis formed ‘Group 3’ based on the analysis of gyrB sequences and were clearly separated from food isolates on the basis of proteomic and DDH results. In accordance with these findings, 16S rRNA analyses indicated that they form a separate cluster (‘Thuringiensis B’).20 The only discrepancy observed in this portion of a dendrogram was the position of B. cereus ATCC 14579 that was grouped with the same H-serotypes both on the basis of gyrB sequences (Group 2) and DDH, while proteomic results indicate some differences between this strain and the investigated H-serotypes. Overall, the comparison of Figure 6A,B indicates that proteomic similarities, DNA-DNA hybridization, and gyrB sequencing provide very similar strain classification results, thus validating the proteomics-based approach. Therefore, proteomic similarities expressed as FSPs values could potentially replace DDH, as well as the gyrB or 16S rRNA analyses in revealing phylogenomic affiliations among the B. cereus group. CONCLUSIONS Our studies demonstrate that MS-based proteomics integrated with a suite of statistical tools may be used to delineate bacterial species and uncover intraspecies relatedness based on genomic similarities revealed by analysis of the multidimensional structure of peptide conservation profiles. Predictions of genomic relatedness revealed by this method are in full agreement with findings obtained by DNA reassociation and sequencing of established phylogenetic markers, thus providing a proof-of-concept for the use of FSP indices in the identification and phylogenomic classification of BACT group strains. Further, FSP-based similarities may be used as a supplement to DDH and other genomic methods that are commonly used for the confirmation and discrimination of B. cereus strains in environmental, clinical, or food samples. In addition, this approach complements DNA-based assays by providing orthogonal detection capabilities that could be advantageous to prevent system-wide false positives or negatives. It seems obvious that this method could be used for the analysis of many other strains of pathological, technological, and environmental importance because the number of reference strains with fully sequenced genomes exceeds 900 as of July, 2009.
Finally, this is a cost-effective method for studying differences between bacterial strains without complete genome sequence information, and it also provides a platform for in-depth exploration of observed similarities and differences among strains. ACKNOWLEDGMENT This research was supported in part by an appointment to the Research Participation Program at the Federal Bureau of Investigation, Counterterrorism and Forensic Science Research Unit (FBI-CTFSRU), administered by the Oak Ridge Institute of Science and Education through an interagency agreement be-
tween the U.S. Department of Energy and FBI-CTFSRU. The authors would also like to thank M. Satomi and K. Venkateswaran for providing the bacterial strains used in this study. SUPPORTING INFORMATION AVAILABLE Additional information as noted in the text. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review July 14, 2009. Accepted November 6, 2009. AC9015648
Analytical Chemistry, Vol. 82, No. 1, January 1, 2010
155