Identification of Bacteria Using Tandem Mass ... - ACS Publications

Roger Karlsson, Max Davidson, Liselott Svensson-Stadler, Anders Karlsson, .... Journal of The American Society for Mass Spectrometry 2016, 27 (2) , 19...
2 downloads 0 Views 251KB Size
Anal. Chem. 2004, 76, 2355-2366

Identification of Bacteria Using Tandem Mass Spectrometry Combined with a Proteome Database and Statistical Scoring Jacek P. Dworzanski,*,† A. Peter Snyder,‡ Rui Chen,§ Haiyan Zhang,| David Wishart,| and Liang Li*,§

Geo-Centers, Inc., Aberdeen Proving Ground, Maryland 21010-0068, U.S. Army Edgewood Chemical Biological Center, Aberdeen Proving Ground, Maryland 21010-5424, Department of Chemistry, University of Alberta, Edmonton, Alberta T6G 2G2, Canada, and Department of Biological Sciences, University of Alberta, Edmonton, Alberta T6G 2N8, Canada

Detection and identification of pathogenic bacteria and their protein toxins play a crucial role in a proper response to natural or terrorist-caused outbreaks of infectious diseases. The recent availability of whole genome sequences of priority bacterial pathogens opens new diagnostic possibilities for identification of bacteria by retrieving their genomic or proteomic information. We describe a method for identification of bacteria based on tandem mass spectrometric (MS/MS) analysis of peptides derived from bacterial proteins. This method involves bacterial cell protein extraction, trypsin digestion, liquid chromatography MS/MS analysis of the resulting peptides, and a statistical scoring algorithm to rank MS/MS spectral matching results for bacterial identification. To facilitate spectral data searching, a proteome database was constructed by translating genomes of bacteria of interest with fully or partially determined sequences. In this work, a prototype database was constructed by the automated analysis of 87 publicly available, fully sequenced bacterial genomes with the GLIMMER gene finding software. MS/ MS peptide spectral matching for peptide sequence assignment against this proteome database was done by SEQUEST. To gauge the relative significance of the SEQUEST-generated matching parameters for correct peptide assignment, discriminant function (DF) analysis of these parameters was applied and DF scores were used to calculate probabilities of correct MS/MS spectra assignment to peptide sequences in the database. The peptides with DF scores exceeding a threshold value determined by the probability of correct peptide assignment were accepted and matched to the bacterial proteomes represented in the database. Sequence filtering or removal of degenerate peptides matched with multiple bacteria was then performed to further improve identification. It is demonstrated that using a preset criterion with known distributions of discriminant function scores and probabilities of correct peptide sequence assignments, a test bacterium within the 87 database microorganisms can be unambiguously identified. 10.1021/ac0349781 CCC: $27.50 Published on Web 03/13/2004

© 2004 American Chemical Society

The fast detection and identification of pathogenic agents of biological origin including viruses, bacteria, and toxins play a crucial role in a proper response to unintentional or terroristcaused outbreaks of infectious diseases and the use of biological warfare agents on the battlefield. Recently, genomes of all bacteria listed as priority bacterial pathogens for biodefense purposes1 have been sequenced, and this achievement opens new possibilities for their fast and reliable identification on a molecular level by retrieving their genomic or proteomic information. Our goal was to develop a method for microorganism or protein toxin identification based on peptide amino acid sequence information from tandem mass spectrometry (MS/MS) analysis of peptides originated from these sources. Physiological, biochemical, and chemotaxonomic characteristics of microorganisms traditionally have been used for species classification and identification. However, recent advances in molecular biology suggest that classification and identification of microorganisms reflecting relationships encoded in DNA sequences are much more reliable. Retrieving such information is not straightforward because technologies allowing for full genome sequencing were developed only recently.2 Hence, there is an urgent need to develop fast and reliable methods to retrieve parts of genomic information that are thought to be representative of the whole genome. The growing number of completely sequenced bacterial genomes3 and, as a result, the availability of genomic databases for almost 100 bacteria (as of March 2003) provide the sequence information of every potentially expressed protein encoded by these organisms.4 The combination of this unprecedented resource with advances in mass spectrometry (MS) technologies capable of identifying proteins that are actually * To whom correspondence should be addressed. E-mail: jdworzanski@g eo-centers.com or [email protected]. † Geo-Centers, Inc., Aberdeen Proving Ground. ‡ Edgewood Chemical Biological Center, Aberdeen Proving Ground. § Department of Chemistry, University of Alberta. | Department of Biological Sciences, University of Alberta. (1) http://www.niaid.nih.gov/biodefense/bandc_priority.htm. (2) Fleischman, R. D.; et al. (40 coauthors) Science 1995, 269, 496-512. (3) Doolitle, R. F. Nature 2002, 416, 697-700. (4) Blattner, F. R.; Plunkett, G., 3rd.; Bloch, C. A.; Perna, N. T.; Burland, V.; Riley, M.; Collado-Vides, J.; Glasner, J. D.; Rode, C. K.; Mayhew, G. F.; Gregor, J.; Davis, N. W.; Kirkpatrick, H. A.; Goeden, M. A.; Rose, D. J.; Mau, B.; Shao, Y. Science 1997, 277, 1453-1462.

Analytical Chemistry, Vol. 76, No. 8, April 15, 2004 2355

expressed5,6 enables the design of new molecular diagnostic procedures to study the relatedness and identity of microorganisms. The identification of microbial proteins using MS and proteomic approaches relies on two different strategies. The classical approach consists of high-resolution separation of cellular proteins by one- or two-dimensional polyacrylamide gel electrophoresis (2D PAGE) to obtain individual proteins for MS analysis.7,8 The recently developed proteomics approach relies on a global, proteome-wide enzymatic digestion followed by separation of peptides by high-performance liquid chromatography (HPLC) coupled with electrospray ionization tandem mass spectrometry (ESI-MS/MS) for protein identification. MS analysis of bacterial proteins9,10 or digests of protein extracts11 followed by statistical matching of protein/peptide masses detected from an unknown sample to those in a proteome database has been developed as a useful tool for bacterial identification.11-15 Several groups have also reported the application of MS/MS to obtain partial protein sequence information for the purpose of microorganism identification. Yates and Eng16 claimed a method for identifying an organism of interest by determining whether product ion mass spectra of peptides obtained from its proteins indicate a homology to a portion of any proteins specified by amino acid sequences in a library of known organisms. Chen et al.17 investigated the ORFFIND program from NCBI to predict protein sequences from all plausible open reading frames (ORFs) in the unfinished genomic database of Porphyromonas gingivalis and found that SEQUEST18 search results of uninterpreted MS/MS spectra with such a protein library provide better results than those from six reading frame translations of the genome DNA sequence. Harris and Reilly19 demonstrated that searching matrix-assisted laser desorption/ionization postsource decay mass spectra of peptides derived from on-probe digested Bacillus subtilis cell extract against only the B. subtilis SwissProt database produces matches to particular proteins. Yao et al.20 (5) Mann, M.; Hendrickson, R. C.; Pandey, A. Annu. Rev. Biochem. 2001, 70, 437-473. (6) Aebersold, R.; Goodlett, D. R. Chem. Rev. 2001, 101, 269-296. (7) Fountoulakis, M.; Langen, H.; Evers, S.; Gray, C.; Takacs, B. Electrophoresis 1997, 18, 1193-1202. (8) Tonella, L.; Hoogland, C.; Blinz, P. A.; Appel, R. D.; Hochstrasser, D. F.; Sanchez, J. C. Proteomics 2001, 1, 409-423. (9) Fenselau, C.; Demirev, P. A. Mass Spectrom. Rev. 2001, 20, 157-171 and references therein. (10) Krishnamurthy, T.; Rajamani, U.; Ross, P. L.; Jabbour, R.; Nair, H.; Eng, J.; Yates, J. R., 3rd.; Davis, M. T., Stahl, D. C., Lee, T. D., J. Toxicol.-Toxin Rev. 2000, 19, 95-117. (11) Zhou, X.; Gonnet, G.; Hallett, M.; Munchbach, M., Folkers; G, James, P. Proteomics 2001, 1, 683-690. (12) Demirev, P. A.; Ho, Y.-P.; Ryzhov, V.; Fenselau, C. Anal. Chem. 1999, 71, 2732-2738. (13) Arnold, R. J.; Reilly, J. P. Anal. Biochem. 1999, 269, 105-112. (14) Wang, Z.; Dunlop, K.; Long, S. R.; Li, L. Anal. Chem. 2002, 74, 31743182. (15) Pineda, F. J.; Antoine, M. D.; Demirev, P. A.; Feldman, A. B.; Jackman, J.; Longenecker, M.; Lin, J. S. Anal. Chem. 2003, 75, 3817-3822. (16) Yates, J. R., 3rd.; Eng, J. K. Identification of nucleotides, amino acids, or carbohydrates by mass spectrometry. U.S. Patent 6,017,693, Jan 25, 2000. (17) Chen, W.; Laidig, K. E.; Park, Y.; Park, K.; Yates, J. R., III; Lamont, R. J.; Hackett, M. Analyst 2001, 126, 52-57. (18) Eng, J. K.; McCormack, A. L.; Yates, J. R., 3rd. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (19) Harris, W. A.; Reilly, J. P. Anal. Chem. 2002, 74, 4410-4416. (20) Yao, Z.-P.; Alfonso, C.; Fenselau, C. Rapid Commun. Mass Spectrom. 2002, 16, 1953-1956.

2356

Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

found that searching a SwissProt database using product ion mass spectra of a few peptides from limited proteolytic digestion of a virus provided high-scoring matches to proteins coded by the subject virus and on this basis claimed its identification. More recently, Warscheid et al.21,22 demonstrated that the acid extraction of bacterial spore proteins followed by peptide microsequencing, using MALDI-MS/MS data searched against the NCBInr database restricted to bacteria, has the capability to identify species from the genus Bacillus in spore mixtures. We have developed an ESI-MS/MS method for the identification of bacteria presented in a comprehensive bacterial proteome database. In this method, which is experimentally similar to the proteomics work by Yates and co-workers,16,17,23,24 proteins released during cell lysis are trypsin digested and followed by onedimensional (1D) reversed-phase-HPLC separation and ESI-MS/ MS analysis of the resulting peptides. However, instead of searching the product ion mass spectra against the genome database of microorganisms, we have constructed a bacterial proteome database dedicated to bacterial identification. Full or partial sequences of bacterial genomes of interest are analyzed by a gene finding software GLIMMER25 and all putative ORFs are translated into protein sequences. The use of this proteome database allows for much faster product ion mass spectra searching, compared to searching the genome database directly,17 and eliminates potential inconsistencies in publicly available protein databases of newly sequenced organisms due to the use of diverse gene finding programs by sequencing centers.2,4,25,26 Different gene recognition programs may differ in choosing alternative start codons near the beginning of an ORF or in resolving the problem of overlapping potential genes. Moreover, publicly available protein databases usually exclude sequences shorter than 70 codons. To facilitate the process of bacterial identification based on product ion mass spectra searching results, we have also developed a statistical scoring algorithm to rank the matches of the experimental data generated by the SEQUEST searching engine to identify a bacterium in the proteome database with high confidence. EXPERIMENTAL SECTION Materials and Reagents. Ammonium bicarbonate, sodium chloride, urea, dithiothreitol (DTT), iodoacetamide (IAM), trypsin, HPLC grade acetonitrile (MeCN), trifluoroacetic acid, and acetic acid were purchased from Sigma (St. Louis, MO). Distilled water was from a Milli-Q UV plus ultrapure system (Millipore, Mississauga, ON). Bacterial Samples. Escherichia coli K-12 (ATCC 47076, Ec), B. subtilis (ATCC 23857, Bs), and Bacillus thuringiensis (ATCC 33679, Bt) were ordered from the American Type Culture Collection. Bacterial cells were incubated under ATCC recommended conditions, harvested, washed with distilled water, lyophilized, and stored at -25 °C. (21) Warscheid, B.; Fenselau, C. Anal. Chem. 2003, 75, 5618-5627. (22) Warscheid, B.; Jackson, K.; Sutton, C.; Fenselau, C. Anal. Chem. 2003, 75, 5608-5617. (23) Link, A. J.; Eng, J.; Schieltz, D. M.; McCormack, E.; Mize, G. J.; Morris, D. R.; Garvik, B. M.; Yates, J. R., 3rd. Nat. Biotechnol. 1999, 17, 676-682. (24) Washburn, M. P.; Wolters, D.; Yates, J. R., 3rd. Nat. Biotechnol. 2001, 19, 242-247. (25) Saltzber, S. L.; Delcher, A. L.; Kasif, S.; White, O. Nucleic Acids Res. 1998, 26, 544-548. (26) Shibuya, T.; Rigoutsos, I. Nucleic Acids Res. 2002, 30, 2710-2725.

Cell Lysis, Protein Digestion, and Sample Cleanup. Proteins were extracted from bacterial cells after lysis with sonication (Branson probe sonicator; Branson Ultrasonics Corp., Danbury, CT) in 100 mM ammonium bicarbonate buffer (pH 8.5) for 2 min (1 pulse/s with 0.75-s pulse duration). The resulting suspension was centrifuged at 11750g. The supernatant was then filtered using Microcon-3 filters (Millipore, Mississauga, ON, Canada) with a 3000-Da molecular mass cutoff. The cell extract was denatured with urea, reduced with DTT, alkylated by IAM, and digested by trypsin at a ratio of 1:50 (w/ w). A ZipTip C18 was used to desalt the resulting peptide mixture (Millipore) before HPLC analysis. HPLC and Acquisition of Mass and Tandem Mass Spectra. 1D HPLC-MS/MS was conducted on an LCQ DECA Surveyor LC-MS system (ThermoFinnigan, San Jose, CA). Chromatographic separation was performed on a Vydac C18 column (300 Å, 5 µm, 150 µm i.d. × 150 mm) with a flow rate of 1 µL/min. The mobile phase consisted of water and MeCN, and both contained 0.5% (v/v) acetic acid. For 2D HPLC-MS/MS, the peptide mixtures were first separated on a Vydac sulfonic acid cation-exchange column (900 Å, 8 µm, 300 µm i.d. × 150 mm) using a step gradient with increasing NaCl concentration from 0 to 500 mM. Solvent delivery was performed on an Agilent (Palo Alto, CA) HP 1100 HPLC system. Two alternating Vydac C18 columns in an automated fashion separated the effluent from the first-dimension Vydac column. The LCQ DECA ion trap was operated using the Instrument Method files of Xcalibur, and some of the key operational parameters of this instrument are listed below. The LCQ was set to acquire a full-scan mass spectrum between m/z 400 and 1400 followed by three data-dependent product ion mass spectral scans between m/z 400 and 2000 of the most intense precursor ions. The excitation energy for the precursor ions selected for collisioninduced dissociation was set as 35% (using the operational parameter “% relative collision energy”) with a 30-ms activation time. To avoid the collection of same ion spectra during a specified time period, a method of acquisition termed dynamic exclusion was used with the following parameters specified by a ThermoFinnigan software: a repeat duration of 0.5 min, repeat count of 2, and a 3-min exclusion duration window. Database Construction. There were 87 complete bacterial genome sequences available publicly at the time this work was initiated in March 2003. A prototype in-house proteome database was constructed from these genome sequences. Genome sequences of these 87 bacteria (Table S1 in the Supporting Information) were downloaded via the Internet from the National Center for Biotechnology Information (NCBI) site27 in a FASTA format and were automatically processed. A computational Gene Locator and Interpolated Markov Modeler (GLIMMER 2.02), made available by The Institute for Genomic Research (TIGR, Rockville, MD), was used to recognize protein coding sequences to identify ORFs possessing more than 30 consecutive codons.25 In-house-written software was applied for automatic translation of these codons into amino acid sequences of all putative proteins and for assembling a proteome database from the available bacterial genomes. The proteome database was searched directly or was additionally processed to create a peptide sequence (27) http://www.ncbi.nlm.gov/PMGifs/Genomes/micr/html.

database that contains all potential tryptic peptides derived from 87 proteomes using a TurboSEQUEST utility (ThermoFinnigan). These translated sequences of all bacterial peptides were stored and indexed in a file that can be read and searched by product ion mass spectra data mining software such as SEQUEST. The 296 942 protein-encoding putative genes recognized by GLIMMER in 87 genomes shown in Table S1 (see Supporting Information) were used for the database construction. In some special cases of data processing as indicated in the Results and Discussion, matches between predicted peptide sequences and the NCBI microbial database were performed to identify known gene products using BLASTP.28 Data Processing. The results of searches of product ion mass spectra against the prototype database using the SEQUEST algorithm18 produce a list of peptides with several matching parameters that gauge the goodness of matching. Five parameters, namely, Xcorr, ∆Cn, Sp, RSp, and ∆Mpep (see detailed descriptions of these terms in the Results and Discussion), were used in this work to arrive at a unified matching score that could be used to rank the significance of product ion mass spectral matching with the predicted peptide sequence in the prototype database. To determine the unified matching score, discriminant analysis was performed on the five parameters using a training data set consisting of 3019 peptide ion MS/MS spectra generated from a known bacterium. In this case, product ion mass spectra were obtained by reserved-phase HPLC-MS/MS analyses of 17 ionexchange-LC separated fractions from an E. coli K-12 digest. Each spectrum was searched using SEQUEST against the proteome database. If the top peptide candidate listed in the searching result was from a putative protein of E. coli K-12, this result was considered as a correct assignment and all five matching parameters associated with this search were entered into a corresponding table consisting of all correct assignments of many different product ion mass spectra. If the top peptide candidate belonged to a proteome from other bacterium, it was labeled an incorrect assignment and the five matching parameters from this searching were entered into the incorrect peptide assignment table. Discriminant analysis and modeling of discriminant function score distributions among correct and incorrect peptide assignments were then performed using Statistica software (release 6, StatSoft, Inc., Tulsa, OK) with the five matching parameters as variables. RESULTS AND DISCUSSION Figure 1 shows the schematic representation of the proteomic approach for identification of bacteria based on MS/MS analysis of a whole cell protein digest, database search, and statistical analysis of the matching scores. To obtain product ion mass spectra of tryptic peptides, cellular proteins were extracted and digested using trypsin, and HPLC-MS/MS analysis was performed. Although searching uninterpreted product ion mass spectra against nucleic acid sequences is possible, translation in all six frames and the inclusion of noncoding sequences significantly increases the computation process.29 Hence, searching sequences corresponding only to ORFs represents a substantial improvement in the speed of analysis.17 In this work, product ion (28) Altschul, S. F.; Madden, T. L.; Scha¨ffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Nucleic Acids Res. 1997, 25, 3389-3402. (29) Yates, J. R., 3rd.; Eng, J. K.; McCormack, A. L. Anal. Chem. 1995, 67, 32023210.

Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

2357

Figure 1. Schematic representation of the experimental setup and data processing used for the identification of bacteria.

mass spectra are searched against a database composed of genome-translated proteomes of bacteria. The searching results are analyzed using in-house-developed software for statistical scoring of peptide assignments and bacterial identities. The key components of the method presented in Figure 1 are described below including the validation of the data analysis procedure using testing samples containing one or a mixture of bacteria. Proteome Database. Genomic databases of microorganisms are expanding rapidly and it is possible, in cases of fully sequenced genomes, to download a translated proteome of a bacterium from a public site such as the NCBI website. However, sequencing centers use different gene recognition algorithms. To preserve the uniformity in gene finding procedures we have used the same algorithm to find ORFs in all sequenced genomes. In addition, we are interested in developing an in-house proteome database tailored to the identification of certain bacteria of interest. An inhouse proteome database offers several advantages. It can include proteomes of bacteria whose genomic sequences are not available in the public domain. The database can be linked to a genome sequencing effort for a specific bacterium of interest, and ORFs from a partial sequence of a genome can be translated in-house as soon as it becomes available. The proteome database can also be expanded to a broad range of applications including identification of viruses and protein toxins. For this work, we used publicly available 87 bacterial genomes (as of March 2003) to construct a prototype proteome database. 2358

Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

We used the GLIMMER algorithm to predict ORFs from the genome of a microorganism. The evaluation of this algorithm for gene finding performance using 10 completed microbial genomes indicates that 97-98% of all genes were correctly identified.30 The genes missed were mainly representatives of hypothetical proteins found by other programs. Therefore, comparisons limited to genes with significant homology to known genes of other organisms indicate the accuracy of the system to be better than 99%.30 However, the error frequency in finished sequences of bacterial chromosomes has never been precisely measured, but it is thought to be one error (frame shift or base substitution) in 103105 nucleotides.31 Nevertheless, even with one error per gene, the translated proteome sequences should still be very useful for database searching applications. Sequencing errors should have a negligible effect on the overall performance of the bacterial identification method, since the method of bacterial identification presented in this work relies on the use of multiple peptide sequences selected using probability-based criteria (see below). Database Match Scoring. Product ion mass spectra of peptides generated from a bacterial cell protein extract digest were searched against the prototype proteome database by using SEQUEST. Although SEQUEST provides assignments of peptides to the bacterial proteins or proteomes automatically, the validity (30) Delcher, A. L.; Harmon, D.; Kasif, S.; White, O.; Saltzberg, S. L. Nucleic Acids Res. 1999, 27, 4636-4641. (31) Weinstock, G. M. Emerg. Infect. Dis. 2000, 6, 496-504.

of each match has to be evaluated to distinguish between real hits (i.e., correct assignments) and random matches (i.e., incorrect assignments). To discriminate between correctly and incorrectly assigned peptides, several approaches have been developed. For example, it has been suggested that threshold values for a crosscorrelation function (Xcorr) can be used to evaluate the match quality between a MS/MS spectrum and a predicted theoretical spectrum utilizing amino acid sequences in the proteome database.23,24 Recently, a few new techniques have been developed32-35 in order to improve the separation of correct and incorrect peptide assignments. The approach used in this work to distinguish correct and incorrect peptide matches was based on modeling of the SEQUEST computed scores using discriminant function analysis. The SEQUEST-generated scores represent variables associated with the quality of fit between an MS/MS mass spectrum and a theoretical spectrum generated from the proteome sequences in the database. Our modeling was performed using a multivariate DF analysis33 that maximizes the ratio of between-class variance to within-class variance by calculating appropriate weights associated with each variable and transforms SEQUEST scores into DF scores. The model was used to support the selection of peptides that would be subsequently used for analysis of sequence-based similarities between the analyzed test sample and microorganisms in the database. The variables used for the multivariate DF analysis include a raw cross-correlation score of the top candidate peptide (Xcorr), the difference or delta in cross-correlation score (normalized to the highest Xcorr value) between the top-ranked peptide sequences (∆Cn), and the preliminary score of the top candidate peptide (Sp) used to rank the top 500 matching sequences. Additional variables contributing to discrimination comprised the natural logarithm33 of the rank of the preliminary raw score Sp (RSp) among the candidate peptides and the absolute value of the mass difference between a peptide characterized by a molecular ion with a postulated charge state and the theoretical mass of the assigned peptide (∆Mpep). To carry out the DF analysis, a large number of MS/MS peptide spectra from a known bacterium are required as a training data set. To this end, a 2D HPLC-MS/MS analysis of an E. coli K-12 tryptic digest was performed and a total of 3019 MS/MS spectra were generated. These spectra were searched against the prototype database by SEQUEST, and the resulting matching data were manually analyzed. They were grouped into four categories according to the charge state of a given peptide ion (i.e., doubly charged or triply charged) and whether the top candidate of the matched peptide by SEQUEST was from E. coli K-12 or not. In this case, 485 doubly charged ions and 182 triply charged ions belonged to the correctly assigned peptides, and 1199 doubly charged ions and 1153 triply charged ions were incorrectly assigned peptides. Figure 2A plots the results from discriminant analysis used to separate these four groups of peptides. Table 1 (32) Anderson, D. C.; Li, W.; Payan, D. G.; Noble, W. S. J. Proteome Res. 2003, 2, 137-146. (33) Keller, A.; Nesvizhskii, A. I.; Kolker, I.; Aebersold, R. Anal. Chem. 2002, 74, 5383-5392. (34) Moore, R. E.; Young, M. K.; Lee, T. D. J. Am. Soc. Mass Spectrom. 2002, 13, 378-386. (35) Sadygov, R. G.; Eng, J.; Durr, E.; Saraf, A.; McDonald, H.; MacCoss, M. J.; Yates, J. R., 3rd. J. Proteome Res. 2002, 1, 211-215.

Figure 2. Distributions of DF scores for a training data set. (A) Distributions of DF1-DF3 scores for [M + 2H]2+ ions assigned as correct (O) and incorrect (|) matches and [M + 3H]3+ ions assigned as correct (b) and incorrect (s) matches. (B) Distributions of DF1 scores modeled by a Gaussian distribution for positive peptide assignments [correct assignments for spectra of precursor ions at 2+ (s) and 3+ (- - -) charge state] and by the log-normal function distribution for negative assignments [incorrect assignments for 2+ (s) and 3+ (- - -) type precursor ions]. (C) Normalized distributions of positively (s, correct assignments for precursor ions at 2+ and 3+ charge states) and negatively assigned peptides (- - -, incorrect assignments for both types of precursor ions). A vertical solid line indicates an example of a decision criterion that divides assignments into accepted (true and false positive) and rejected matches (true and false negative).

presents statistical parameters associated with the DFs in Figure 2A. The eigenvalues (roots) indicate that DF1 explains 94.9% of the total variance in the E. coli K-12 training data set, DF2 explains Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

2359

Table 1. Discriminant Functions (DFs) Derived from SEQUEST-Generated Peptide Assignments of a Training Data Set Consisting of 3019 MS/MS Spectra Obtained from 2D HPLC MS/MS Analysis of the E. coli K-12 Cell Extract Digesta DF1

DF2

DF3

variable

coefficient

correlation

coefficient

correlation

coefficient

correlation

Xcorr ∆Cn Sp Ln(RSp) ∆Mpep constant eigenvalue

0.608 6.813 -0.002 -0.234 -0.139 -0.816 1.530

0.766 0.883 0.567 -0.660 -0.077

2.817 -6.858 -0.002 0.311 0.006 -4.191 0.080

0.603 -0.105 0.173 0.314 0.019

0.131 -9.320 0.002 -0.430 0.243 1.627 0.002

0.168 -0.408 0.482 -0.616 0.126

a Peptides were grouped into four data sets according to correct and incorrect assignments as well as their charge states (doubly charged and triply charged) for deriving DFs.

Table 2. Classification Matrix of Correctly (Positive) and Incorrectly (Negative) Assigned Peptides in the Training Data Set Using Discriminant Analysis of Four Groups of Peptides As Shown in Table 1 predicted classification group/observed classification [M+2H]2+ ions [M+3H]3+ ions

negative positive negative positive total

[M+2H]2+ ions negative positive 801 74 414 29 1318

5%, and the discriminating power of DF3 appears to be negligible at only 0.1%. The final results of the discriminant analysis model for the E. coli training data set are shown as a classification matrix in Table 2. The correct classification of [M + 2H]2+ precursor ions has a value of 76.3% while only one MS/MS spectral search (0.5%) of [M + 3H]3+ ions was correctly classified. The overall classification accuracy is relatively low, i.e., 62.7%. To improve the classification accuracy, the discriminant analysis results were further examined. Figure 2B presents the observed distributions of DF1 scores for the four groups of peptides. They were plotted by dividing DF1 scores in the range from -3.0 to 6.5 into 50 categories (bins) and counting the number of correct and incorrect peptide matches separately for [M + 2H]2+ and [M + 3H]3+ ions in each bin (i.e., frequency). It was found that these distributions can be modeled separately for incorrect and correct peptide assignments by the log-normal and Gaussian distribution functions, respectively. Figure 2B shows that almost complete distribution overlaps occur for the correct doubly and triply charged ion assignments and for the incorrect doubly and triply charged ion assignments. Thus, it can be concluded that the separation of charge states does not lead to the improvement of discrimination between the correct and incorrect assignments. In a subsequent analysis, charge states of the precursor ions were ignored and only two groups of assignments (i.e., correct and incorrect) were considered. By regrouping the MS/ MS spectral matches into correct and incorrect assignments, a new discriminant function is obtained as shown in eq 1.

DF ) 0.595Xcorr + 6.620∆Cn - 0.0001Sp 0.237 ln(RSp) - 0.134∆Mpep - 0.77 (1) 2360

Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

27 370 17 113 527

[M+3H]3+ ions negative positive 370 41 720 39 1170

1 0 2 1 4

correct classifn (%) 66.81 76.29 62.45 0.55 62.67

An eigenvalue of 1.463 for this new DF is obtained. It should be noted that the differences between coefficient values for DF1 of Table 1 and the new DF shown in eq 1 are small. This is reasonable considering that the distributions of doubly and triply charged ion assignments as shown in Figure 2B are quite similar and the combined correct and incorrect assignments from the ions of these two charge states should be similar to those from the ions of either one charge state. Figure 2C shows the normalized distributions of the new DF scores for the incorrect and correct peptide assignments fitted by the log-normal and Gaussian distribution functions, respectively. The incorrect and correct assignment distributions overlap to some extent. The overlap region indicates where this test cannot distinguish between an incorrect and correct peptide assignment to a given proteome. A user-selected decision point (indicated by the vertical line in Figure 2C) represents a decision criterion associated with the tradeoff between the sensitivity (i.e., the fraction of peptides belonging to the correct assignments to be used for bacterial identification) and the error rate (i.e., the fraction of peptides belonging to the incorrect assignments to be used for bacterial identification). The position of the decision point divides the distributions into four areas and determines the number (or fraction) of peptides that would be rejected or accepted in the final assignment of a peptide to a microorganism proteome. The accepted peptides or assignments are composed of true positive (TPos) and false positive (FPos) matches (see Figure 2C, areas falling in the right side of the decision line), while those rejected comprise both true negative (TNeg) and false negative (FNeg) assignments (see the left side of the decision line in Figure 2C). For example, the decision point at the intersection of the two distribution curves (see Figure 2C) is associated with a classifica-

Table 3. Classification Matrix of Correctly (Positive) and Incorrectly (Negative) Assigned Peptides in the Training Data Set Using Discriminant Analysis of Two Groups of Peptides observed classification negative positive false matches (%)

predicted classification negative positive 2307 183 7.35

correct matches (%)

45 484 8.51

98.09 72.56

tion matrix displayed in Table 3. The overall classification accuracy in this case reaches 92.4% that includes 72.6% of correctly identified peptides (TPos) and only 7.35% of matches are among incorrectly identified peptides (FPos). To gauge the probability of correct assignment of a peptide from an unknown sample MS/MS spectrum to a microorganism proteome, statistical evaluation of the spectral matching can be carried out. In this work, DF score distributions experimentally observed and approximated using modeling functions were used to calculate the observed and expected probabilities that a peptide is correctly identified. The observed probabilities were calculated as the ratio of correctly identified peptides to the total number of peptides in a given bin, while the expected posterior probabilities were calculated in accordance with Bayes’ rule:

p(Pos|S) ) p(S|Pos) p(Pos)/[p(S|Pos)p(Pos) + p(S|Neg)p(Neg)] (2) where p(Pos|S) represents the probability that a peptide with a discriminant function score S was correctly matched to a proteome; p(S|Pos) and p(S|Neg) are the probabilities that correctly and incorrectly identified peptides, respectively, are associated with a discriminant function score S; and p(Pos) and p(Neg) represent a priori probabilities or fractions of correctly and incorrectly matched peptides, respectively, in the training data set. The results of the analysis are presented in Figure 3 in the form of a probability curve, which indicates how DF scores can be translated into the probability of correct peptide assignments. In addition to the probability curve, Figure 3 also shows a plot of the fraction of peptides with true positive assignments (sensitivity) as a function of the DF score and a plot of the fraction of peptides with false positive assignment (error) as a function of the DF score. It can be seen from Figure 3 that, as the probability of correct assignment increases, the fraction of peptides with true positive assignments and the fraction of peptides with false positive assignments both decrease. However, they decrease at different rates. For example, if a minimum DF score of 2.0 is used as the threshold for accepting peptide assignments to proteomes of the microorganisms, the probability of a correct peptide assignment for any MS/MS spectrum with a DF score of >2.0 is greater than 95%. In this case, ∼46% of the true positive peptides have DF scores of >2.0 and, therefore, are used for identification of the bacterium, while less than 1% of the incorrect assignments are included as false positive matches. If a threshold DF score is set at 3.0, the probability of correct peptide assignment is 100%. There are no false positives and the specificity of using these peptides for bacterial identification is very high. However, only 25% of the

Figure 3. Probability of correct peptide assignments and the fractions of correct (sensitivity) and incorrect peptide assignments (error) at different DF threshold scores. The observed probabilities (b) represent the fraction of correctly matched peptides for each bin of DF scores, while the expected probabilities (s) were calculated using modeled distributions of correct and incorrect matches (eq 2); sensitivities were calculated from the relationship (1 - cumulative fraction of correct peptide assignments) and are displayed as the observed true positive rates (∆) and rates computed from modeled distributions (- -). The observed (2) and computed (- - -) false positive rates (error) were calculated from the relationship (1 - cumulative fraction of incorrect peptide assignments).

correctly identified peptides are included for bacterial identification. It is clear that there is a tradeoff between the number of peptides to be used for bacterial identification and the reliability of the peptide data set gauged by the probability of correct peptide assignment. It should be noted that since only peptides with true positive assignments to the proteome of a microorganism contribute to the identification of this microorganism, a highly sensitive method should detect as many peptides with true positive assignments as possible. However, there are a number of reasons that can cause poor matches between recorded MS/MS spectra and the respective theoretical spectra of peptides, hence resulting in lower DF scores. Reasons may be both of biological and of methodological origins. The biological reasons include unsuspected posttranslational modifications or single-nucleotide polymorphism as well as other mutations specific for a given strain and expressed in the investigated proteome. The methodological errors include those associated with the preparation of the database (e.g., DNA sequencing, gene finding) manifested by the lack of a given sequence in the database. Other methodological errors include those related to the analytical procedure such as sample handling and the preparation process itself. For example, the spectrum may represent either a nonproteinaceous contaminant or a peptide ion derived from nonspecific cleavage by trypsin or endogenous proteases in the bacterial sample. Finally, a mass spectrum of poor quality may be caused by the presence of isobaric ions due to poor separation of peptides, random instrumental errors, or experimental noise. Efforts directed to minimize these sources of error should result in an increased proportion of product ion spectra that match database peptides with high scores, thereby improving the overall sensitivity of the proteomic method for bacterial identification. Statistical Scoring for Bacterial Identification. As depicted in Figure 1, for unknown bacterial identification, the MS/MS Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

2361

Figure 4. Number of accepted peptides from the analysis of an E. coli K-12 protein digest matched to proteomes of bacteria in the prototype database. [Bacterium number, where consecutive numbers refer to the order of bacterial genomes shown in Table S1 (see Supporting Information).] Three sets of peptide assignments passing a filter (see text) at the predetermined probability levels are shown in panels A-C while the corresponding bacterial identification plots (D-F) were obtained by filtering out the degenerate peptides found in each set. Abbreviations: Bj, B. japonicum; Hi, H. influenzae; Pm, P. multocida; Sf, S. flexneri; St, S. typhimurium; Yp, Y. pestis; Ec, E. coli strains; Pa, Pseudomonas aeruginosa; Pr, probability; T, total number of assignments of peptides to different bacteria; U, total number of unique peptides accepted. Horizontal dashed line indicates the expected total number of unique peptides incorrectly assigned to all organisms at the indicated probability level.

spectra obtained from a 1D HPLC MS/MS experiment are searched against the prototype proteome database by using SEQUEST, and the searching results are subjected to DF analysis. A DF score is calculated along with the probability of correct assignment for a given MS/MS spectrum search result. The first filter used to assign each peptide to the proteome of a bacterium is the DF threshold score with a known probability of correct assignment. The effect of this probability-based filter on discrimination of different bacteria is shown in panels A-C of Figure 4. Figure 4A displays a histogram with each bar representing the number of accepted peptides assigned to an individual bacterium. In this case, the threshold DF score used to accept peptides corresponds to an expected 60% probability of correct assignment. There are 287 unique peptides (U) found in the database. Among them, 154 peptides were matched to E. coli K-12 (54%) in comparison to the expected 60% from the probability calculation 2362

Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

based on the use of a training data set. At the 80% probability level (Figure 4B), the number of accepted peptides drops to 171 with 137 of them correctly assigned (80%) to the K-12 strain. At the 95% probability level (Figure 4C), 86 out of the 94 accepted peptides were matched to this strain, representing an actual correct assignment of 91.5%. Considering that the number of peptides used in the 1D HPLC MS/MS experiment for this sample analysis is considerably smaller than that of the training data set, the observed small discrepancy between the predicted probability of correct assignments and the fraction of actual correct assignments is not surprising. However, the real number of peptides matched to a given organism, as shown in Figure 4A-C, is higher than those indicated above. This discrepancy originates from the presence of paralogous proteins, i.e., products of related genes formed by a gene duplication event. For example, the protein chain elonga-

Table 4. Selected Peptides from Proteins in the E. coli K-12 Tryptic Digest and Their Matching Sequences Found in Proteomes of Bacteria in the Prototype Database

tion factor EF-Tu from E. coli is encoded by two separate genes tufA and tufB (see Supporting Information, Table S2) that were presumably formed by gene duplication; hence, their 394 amino acid long sequences differ by only one C-terminal residue. Moreover, note that a given peptide may match with several proteomes of different bacteria due to the presence of the same amino acid sequences in these bacterial proteomes. These degenerate peptides are not uncommon and can compromise bacterial differentiation. For example, in panels A-C of Figure 4, the total number of all assigned peptides (T) is 5-7-fold of the number of unique peptides accepted (U). To examine the origins of some of these degenerate peptides, a set of 24 peptides with

the highest probabilities of correct matches were selected and listed in Table 4. The most probable origin of these tryptic peptides was determined on the basis of independent BLASTP sequence alignment searches against the E. coli K-12 proteome database. The search results including the E. coli protein names and the amino acid sequence of the peptides in their respective proteins are shown in Table 4. Each of the peptide sequences shown in Table 4 was then assigned to proteins of matching sequences in the database of 87 proteomes. Table 4 only lists the bacteria with their proteomes matched with at least one peptide sequence. It can be seen from Table 4 that the 24 unique peptides can be assigned to 19 bacterial proteomes and the total number of Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

2363

matches are 179. A close examination of these matches indicates that peptides from particular proteins differ in their discriminating power for bacterial differentiation. For example, the three peptides derived from the periplasmic binding protein (Table 4), involved in oligopeptide transport, are found only in proteomes of E. coli strains and Shigella flexneri, while the peptide SGEIDMTNNSMPIELFQK is found exclusively in the proteome of the K-12 strain. On the other hand, peptides derived from isocitrate lyase (Table 4) do not provide discrimination between E. coli strains. However, they can be used to differentiate between S. flexneri and Salmonella typhimurium strains and collectively these peptides can provide discrimination to exclude the remaining 81 bacteria as possible bacterial matches. Strain specificities of peptide sequences derived from glyceraldehyde-3-phosphate dehydrogenase, chaperone Hsp70, and the chain elongation factor EF-Tu are increasingly lower. However, each of them provides a unique discriminating power and collectively they provide important information about the similarity between the investigated strain and bacteria represented in the database. For example, taking into account only 24 peptides from 5 selected proteins (Table 4), it is possible to draw important diagnostic conclusion about the analyzed sample. Namely, the sample contains peptides derived from a bacterium with a high similarity to E. coli, S. flexneri, and S. typhimurium, moderate similarity to Yersinia pestis and Pasteurella multocida, and a low similarity to Haemophilus influenzae, Buchnera aphidicola and Vibrio cholerae. The observed bacterial similarities, as indicated in Table 4 and panels A-C of Figure 4, are reasonable and do not reflect a random coincidence. These similarities originate from the sequence homology among these and many other proteins (see Supporting Information, Table S2) that are coded by homologous genes and have significant matches in related bacterial species. It is assumed that, upon duplication of an ancestral gene, copies of the gene may be retained as a redundant system for executing the original biological function, or they may diverge, with one or both copies giving rise to a novel function. The process of duplication and divergence, along with the occasional transfer of genes between strains and species, gave rise to the present contents of bacterial genomes.36 Data shown in Table 4 and in Figure 4A-C reflect these genomic similarities and consequently are highly redundant. To address the issue of an accepted peptide possibly matching with proteomes of different bacteria, a simple deconvolution filter can be applied for bacterial identification scoring. Using the proteomic approach for bacterial identification, it is assumed that a bacterium with the highest number of matching peptides is deemed to be the most likely candidate of a true match.37 Giving these circumstances, deconvolution can be performed iteratively by selecting the highest scoring bacterium and filtering out peptides assigned to this microorganism from histogram bins associated with all remaining bacteria, which generates a new peptide-matching histogram. A subsequent step involves the removal of peptides from the second highest scoring organism in the newly assembled histogram and so on. The application of this filter to the data shown in Table 4 indicates that only one deconvolution step is necessary, because all degenerate peptides

found in proteomes of bacteria other than E. coli K-12 are smaller subsets of the tryptic peptides found in E. coli K-12. As a consequence of this filtering process, panels A-C in Figure 4 are transformed into new peptide histograms shown in Figure 4DF, respectively. A clear identification of E. coli K-12 is represented in these new histograms. The above deconvolution filter effectively removes identical sequences mainly associated with orthologous proteins, that is, proteins encoded by genes in separate species that are derived from the same ancestral genes or are products of the horizontal gene transfer between different strains. However, in Figure 4DF, the total number of distributed peptides (T) is still higher than the accepted peptides (U) due to the presence of paralogous proteins. We note that the analysis of peptides from E. coli K-12 represents a difficult case, because the database contains four different E. coli strains. In addition, S. flexneri should be considered as another E. coli strain on the basis of the peptide analysis data presented in Figure 4, i.e., a similar number of peptides matching to the proteomes of the E. coli strains and S. flexneri. This conclusion is in agreement with the results based on the sequencing of eight S. flexneri housekeeping genes (7160 bp)38 and whole genome comparisons.39 Pupo et al.38 and Jin et al.39 have suggested that the S. flexneri species should be reclassified as an E. coli strain. Another example of using the statistical scoring approach to identify a bacterium represented in a proteome database is shown in Figure 5. In this case, a tryptic digest of the cell extract from a B. subtilis sample was analyzed by using 1D HPLC MS/MS. Panels A-C in Figure 5 show the histograms of peptide-tobacterium matching results. At the expected probability level of correct peptide assignments equal 60%, there were 154 accepted peptides from DF analysis of SEQUEST searching results of MS/ MS spectra against the prototype proteome database. Among them, 80 peptides were found to match the proteome of B. subtilis, corresponding to the 52% correct assignments. At the 80% probability level, 72 peptides from the 90 accepted peptides were from B. subtilis (80% of correct assignments). When the probability level increases to 95%, only 56 peptides were accepted and among them 54 peptides are from B. subtilis (96% of correct assignments). Inspection of these histograms indicates that correct identification of B. subtilis is possible even without the use of the deconvolution filter. This is due to a relatively low number of amino acid sequences among accepted peptides that can be found in proteomes of other microorganisms represented in the database. In addition, the number of paralogous proteins among them is also very low. For example, at the 95% probability level (Figure 5C), 54 out of the 56 accepted peptides are matched to B. subtilis and none of the remaining proteomes are associated with more than two matching sequences. The analysis of B. thuringiensis is shown in Figure 5D-F. In this case, the proteome of B. thuringiensis as well as proteomes of its close relatives, that is bacteria belonging to the Bacillus cereus group, were not included in the prototype database. Because the number of sequenced bacterial genomes is still limited in comparison to the number of known bacteria, this case represents

(36) Hooper, S. D.; Berg, O. G. Mol. Biol. Evol. 2003, 20, 945-954. (37) English, R. D.; Warsheid, B.; Fenselau, C.; Cotter, R. J. Anal. Chem. 2003, 75, 6886-6893.

(38) Pupo, G. M.; Lan, R.; Reeves, P. Proc. Natl. Acad. Sci. U.S. 2000, 97, 1056710-572. (39) Jin, Q.; and 32 coauthors. Nucleic Acids Res. 2002, 30, 4432-4441.

2364

Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

Figure 5. Number of accepted tryptic peptides from the analysis of B. subtilis (A-C) and B. thuringiensis (D-F) protein extract digests that matched to proteomes in the database (bacterium number). Three sets of peptide matches are shown with various predetermined probability (Pr) levels of correct assignments: 0.60 (A, D), 0.8 (B, E), and 0.95 (C, F). Abbreviations: Bh, B. halodurans; Cp, Clostridium perfringens; Ct, Clostridium tetani; Sa, S. agalactiae. For remaining abbreviations, see legend in Figure 4.

the most realistic situation that may be encountered by an investigator. Panels D-F in Figure 5 show the histograms of accepted peptides at different levels of probability of correct assignment from analysis of a total of 606 product ion spectra of peptides generated from the 1D HPLC-MS/MS analysis of B. thuringiensis cell extract digest. The distributions of peptide assignments to bacterial proteomes shown in Figure 5D-F appear to be random. For example, by using a DF score threshold, equivalent to 80% expected probability of correct peptide assignments (Figure 5E), only 59 peptides are retained. Interestingly, these accepted peptides are distributed among 47 different bacteria; however, the maximum number of peptides matched the B. subtilis proteome, while the next highest number of accepted peptides matched to Bacillus halodurans and Streptococcus agalactiae strains, thus providing some guidelines reflecting taxonomical position of this microorganism. Moreover, the total expected number of incorrectly matched peptides (horizontal dashed line) exceeds numbers associated with any bacterium. On the contrary, at the 95% probability level (Figure 5F), the number

of expected incorrect assignments is low (dashed line), and up to 18 bacteria emerge as possible sources of these peptides. It should be noted that during this analysis it was impossible to determine the actual number of peptides correctly assigned to B. thuringiensis, because this bacterium was not represented in the database. The assumed fractions of incorrectly assigned peptides at a given threshold were based on the data from B. subtilis and therefore were underestimated. It can be readily concluded from the data shown in Figure 5D-F that there is no significant matching to any bacterial proteomes in the prototype database from the analysis of the B. thuringiensis sample. Application to Mixture Analysis. The relatively high resolving power of this analytical method allows the analysis of mixtures of microorganisms as documented in Figure 6. In this case, a bacterial mixture composed of E. coli K-12 and B. subtilis cells (2:1, w/w) was investigated by 1D HPLC-MS/MS analysis of a protein extract digest. There were 800 unique peptides detected from the SEQUEST analysis of the MS/MS spectra. Using these raw data without applying any filters, the peptide-to-bacterium Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

2365

Figure 6. Number of accepted tryptic peptides from the analysis of a bacterial mixture composed of E. coli K-12 and B. subtilis (Bs) cells (2:1, w/w) protein extract digest that matched to proteomes in the database (bacterium number): (A) raw output data from SEQUEST; (B) assignments of peptides passing the filter with the 80% probability of correct assignment; and (C) bacterial identification plot obtained by filtering out degenerate peptides. For abbreviations, see legend to Figure 4.

assignment histogram is shown in Figure 6A. Because of the presence of degenerate peptides, the histogram consists of 1470 peptide assignments. Although assignments to E. coli and B. subtilis strains predominate, a substantial number of matches are associated with S. flexneri, S. typhimurium, and Bradyrhizobium japonicum strains. The large number of assignments to S. flexneri and S. typhimurium mainly originates from genomic similarities between these microorganisms and E. coli (see Figure 4A). However, the substantial number of matches associated with B. japonicum and the lowest number of matches assigned to K-12 among E. coli strains reflect a substantial contribution of randomly assigned peptides in this histogram. The B. japonicum genome is the largest in the database (9.1 Mbp) while the K-12 genome (4.6 Mbp) is 15-20% smaller than the three other E. coli strains in the database. After applying the 80% probability filter to these raw peptide assignments, only 119 peptides are accepted and their assignments to different bacteria in the prototype database are shown in Figure 6B. The number of matches associated with B. japonicum is diminished from 52 to 4, and K-12 clearly predominates among E. coli strains. The deconvolution of the histogram shown in Figure 6B by removing the degenerate peptides produces the histogram shown in Figure 6C. In this histogram, the number of peptides assigned to K-12 remains the same (62) as that in Figure 6B. However, the number of B. subtilis matches was diminished from 50 to 46 and reflects the presence of four identical sequences shared by these microorganisms, presumably originating from orthologous proteins. From the total number of accepted peptides (119), the peptides matching B. subtilis (46) and E. coli K-12 (57) represent 86% of correct assignments (the expected probability is 80%) and only 16 assignments are distributed among the remaining organisms. However, due to the presence of paralogs, the actual number of peptides assigned to E. coli is 62. From the histogram shown in Figure 6C, it can be concluded that this sample is a mixture of B. subtilis and E. coli K-12. 2366 Analytical Chemistry, Vol. 76, No. 8, April 15, 2004

CONCLUSIONS The 45-60-min 1D HPLC-MS/MS analysis of tryptic digests derived from protein extract of pure cultures of E. coli, B. subtilis, B. thuringiensis, and a bacterial mixture combined with a new data processing algorithm, which includes SEQUEST and discriminant function analysis, allows for a highly reliable identification of bacteria represented in a proteome database. The present approach is based on (a) HPLC-MS/MS analysis of microbial tryptic peptides combined with (b) searching an in-house database composed of proteomes of microorganisms and (c) an in-housedeveloped scoring system. It allows for a high-throughput analysis of SEQUEST database search results, identification of correct peptide assignments at a chosen probability level, and high confidence level identification of pure cultures as well as mixtures of microorganisms. Although the prototype proteome database consists of only bacteria with their complete genomes available in the public domain, it can readily incorporate other sequenced bacterial genomes including all priority pathogenic bacteria for biodefense purposes as well as their protein toxins. In addition, there is no conceptual limitation in the extension of this proteomic approach to the analysis of hundreds of viruses with sequenced genomes. Future work will include the expansion of the proteome database to a broader range of microorganisms, toxins, and expected environmental interferences. Future work will also focus on the seamless automation of the entire identification process. ACKNOWLEDGMENT This work was supported by the U.S. Army Edgewood Chemical Biological Center (L.L.). R.C. thanks the Alberta Ingenuity Foundation for a Research Associateship. SUPPORTING INFORMATION AVAILABLE Additional information as noted in the text. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review August 21, 2003. Accepted February 13, 2004. AC0349781