Direct MALDI-TOF MS Identification of Bacterial Mixtures - Analytical

Direct MALDI-TOF MS Identification of Bacterial Mixtures. Yi Yang , Yu Lin ... Publication Date (Web): August 9, 2018 ... The framework can be further...
0 downloads 0 Views 838KB Size
Subscriber access provided by University of South Dakota

Article

Direct MALDI-TOF MS Identification of Bacterial Mixtures Yi Yang, Yu Lin, and Liang Qiao Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b02258 • Publication Date (Web): 09 Aug 2018 Downloaded from http://pubs.acs.org on August 10, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Direct MALDI-TOF MS Identification of Bacterial Mixtures Yi Yang1, Yu Lin2,*, and Liang Qiao1,* 1. Department of Chemistry, Shanghai Stomatological Hospital, Fudan University, Shanghai 200000, China 2. College of Engineering and Computer Science, The Australian National University, Canberra 0200, Australia ABSTRACT: Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) is now widely used to characterize bacterial samples for clinical diagnosis, food safety control, environmental monitoring, etc. However, existing standard approaches are only applied to analyze single colonies purified by plate culture that limits the approaches to cultivable bacteria and makes the whole approaches time-consuming. In this work, we propose a new framework to analyze MALDI-TOF spectra of bacterial mixtures and to directly characterize each component without purification procedures. The framework is a combination of a synthetic mixture model based on a non-negative linear combination of candidate reference spectra and a statistical assessment by in-silico generated spectra via a jackknife resampling. Ninety-seven model bacterial mixture samples and 8 co-cultured blind-coded bacterial mixture samples, containing up to 6 strains in varied ratios in each sample, together with a reference database containing the mass spectra of 1081 strains, were used to validate the framework. High sensitivity (> 80%, with error rate < 5%) was achieved for balanced binary and ternary mixtures. The sensitivity was > 60% for balanced quaternary and pentabasic mixtures, and 48%-71% for asymmetric situation, with error rate < 5%. The work can facilitate rapid and reliable characterization of bacterial mixtures without purification procedures, which is of practical value in clinical diagnosis, food safety control, environmental monitoring, etc. The framework can be further applied to many other spectroscopy-based analytics to interpret spectra from mixed samples.

Matrix-assisted laser desorption/ionization time-offlight mass spectrometry (MALDI-TOF MS) has been used for rapid and reliable identification of bacteria at the genus, species, and in some cases, strain levels 1, becoming a routine technique in clinical diagnosis, food industries, environmental monitoring, military safety, etc. 2-6 Biological molecules, typically small proteins originating from cell surfaces, intracellular membranes and ribosomes, constitute the fingerprint mass spectrum of a bacterium. 7 Fingerprint mass spectra acquired from bacterial samples are usually compared to reference spectra of purified known strains for identification. Current commercial systems can identify several hundreds of bacterial species 8-10, and have been licensed for identification of human pathogens in clinical settings from the US Food and Drug Administration. 11 However, the current approaches suffer from difficulties in sample preparation, especially with poly-bacterial samples, wherein bacteria must be isolated in pure culture for a typical MALDI-TOF MS identification. 12 Isolation and cultivation procedures are very time-consuming and have been observed to affect spectrum quality and reproducibility in MALDI-TOF MS based bacterial identification. 13 Furthermore, only a small proportion of bacterial sample can be successfully culture in practice. For this reason, there has been considerable interest in applications of MALDI-TOF MS to bacterial sample characterization without pure culture isolation.

MALDI-TOF has been used to directly characterize mono-bacterial samples without pure culture isolation. 1421 For poly-bacterial samples, only a limited number of methods, to the best of our knowledge, have been proposed to MALDI-TOF based bacterial sample characterization without pure culture isolation and without using specific antibodies to purify target bacteria. Most of the work use simple model mixture systems constructed by mixing equal amounts of two or three types of bacteria, and compare the mass spectra of the mixture with reference spectra of pure cultures. 22,23 Clinical samples such as positive blood cultures and urine samples containing two bacterial species in varied ratios have also been studied, while failure or partial mismatch of identification are often reported. 24,25 In 2014, Mahe et al. successfully identified bi-microbial model systems by MALDI-TOF with a penalized non-negative linear regression framework. 26 However, a penalty was set to adjust the number of positive coefficients to achieve the trade-offs between the reconstruction error, the sparsity of the solution, and the model complexity. In 2015, Zhang et al. characterized a more complex mixture containing six types of bacteria by MALDI-TOF using approaches based on biomarker recognizing and spectra similarity calculation. 27 In current similarity-based methods for MALDI-TOF MS characterization of bacteria, a threshold of similarity score is usually applied for effective identification. For instance, Biotyper suggests a cutoff value for its log(score) 1

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

as 2.0 28, wherein the log(score) is an ad-hoc unit score between 0 and 3 reflecting the similarity between sample and reference spectra. 29 Some researchers suggested that the cutoff value of Biotyper could be lower or higher depending on spectra quality and analyzed bacterial samples. 30 Previously, we proposed a framework that provided a statistical assessment to MALDI-TOF based bacterial identification by introducing the bootstrapping method 31, leading to a more reliable identification at both the genus and species level. With the method, species in the Bacillus cereus group, which were previously claimed impossible to be resolved by MALDI-TOF, 32 could be correctly identified.

Page 2 of 10

number of in-silico mass spectra are generated, where the peaks are randomly chosen from the sample spectrum by Jackknife resampling. The in-silico spectra are subjected to the second and the third steps to obtain their deconvolution results. For a given identified component from the real sample spectrum, confidence score (conf) is calculated as its detectable rate from the Jackknife resampling generated insilico spectra. The hypothesis is that the confidence is high for a true positive identification and is low for a random match. When the confidence score from jackknife is less than 0.88, the result is rejected. The details on how the parameters in the framework were determined are explained in supporting information Text S-1.

Reagents

In this work, we propose a framework to characterize poly-bacterial samples based on their MALDI-TOF spectra, using a reference database containing 1081 mass spectra of pure cultures of 1081 strains/480 species/64 genera. A synthetic mixture model (SMM) has been developed to decompose experimental mass spectra of poly-bacterial samples into linear mixtures of the spectra of corresponding pure cultures. Optimal deconvolution results were achieved when highest correlation was obtained between the experimental mass spectra and the in-silico synthesized mixture spectra, i.e. the linear mixtures. Moreover, we have further proposed a jackknife model to provide statistical assessment of the deconvolution results. Ninety-seven model mixtures were used to validate the framework. The performance of the framework was further evaluated with 8 co-cultured blind-coded bacteria mixtures, which were more similar to real samples. The experimental results show that our framework can successfully characterize bacterial mixtures with varied mixing ratios, demonstrating the practical application of MALDI-TOF MS in direct characterization of polybacterial sample. All source code is available at https://github.com/lmsac/BacteriaMS-mixture.

Acetonitrile (ACN) (HPLC grade) and trifluoroacetic acid (TFA) (HPLC grade) were purchased from ANPEL Laboratory Technologies (Shanghai, China). 2,5-Dihydroxycinnamic acid (DHB) and protein calibration standard I including insulin, cytochrome C, myoglobin and ubiquitin were from Bruker Daltonic Inc (Madison, USA). Trypticase soy broth (TSB) and tryptone soy agar (TSA) were purchased from Beijing Land Bridge Technology Company Ltd. (Beijing, China). Tryptone, agar, sodium chloride, sodium deoxycholate and glycerol were purchased from Sangon Biotech (Shanghai) Co., Ltd. (Shanghai, China). α-Lactose, iron(III) citrate, neutral red and cetyltrimethylammonium bromide (CTAB) were purchased from BBI Life Sciences Corporation (Shanghai, China). Trisodium citrate dehydrate (AR grade), dipotassium hydrogen phosphate trihydrate (AR grade), potassium sulfate (AR grade) and magnesium chloride (AR grade) were purchased from Sinopharm Chemical Reagent Co., Ltd. (Shanghai, China). Deionized (DI) water (18.2 MΩ cm) was purified by a Smart-Q deionized water system (Hitech pure water technology, Shanghai, China) and used in all aqueous solutions.

Bacteria Cultivation and Construction of Mixture Samples Ten types of bacteria were used in this study, i.e. Enterobacter cloacae ATCC 23373 (EL), Escherichia coli ATCC 25922 (EC), Klebsiella oxytoca ATCC 13182 (KO), Klebsiella pneumoniae ATCC 700603 (KP), Pseudomonas aeruginosa ATCC 27853 (PA), Staphylococcus aureus ATCC 25923 (SA), Vibrio alginolyticus ATCC 17749 (VA), Vibrio fluvialis ATCC 33809 (VF), Vibrio mimicus ATCC 33653 (VM), and Vibrio parahaemolyticus ATCC 17802 (VP). All bacterial strains were cultivated in TSB growing media at 37 °C for 12 h with continuous shaking at 175 rpm. Specially, Vibrio bacteria (i.e. VA, VF, VM and VP) were cultivated in TSB with 3% sodium chloride. Optical density (OD) of bacterial suspension was determined by UV-visible absorption spectroscopy at 600 nm. Model mixtures containing 2, 3, 4, 5 and 6 types of bacteria were prepared by mixing each bacterial suspension with specified volume ratios into a single sterile microcentrifuge tube, respectively.

MATERIALS AND METHODS Overview of the Framework for Bacterial Mixture Identification The whole framework for bacterial mixture identification uses a four-step pipeline as presented in Figure 1. First of all, raw mass spectra (usually ≥ 5) from a same bacterial mixture sample are preprocessed to extract lists of peaks with massto-charge ratio (m/z), normalized intensity (I), signal-tonoise ratio (S/N) and full width at half maximum (fwhm). Then, the peak lists are combined by hierarchical clustering with tolerance = 2000 ppm to form one combined sample spectrum. In the second step, a number of candidate reference spectra from a database containing the spectra of 1081 strains (Data Set S-1) are chosen based on their similarity to the combined sample spectrum by Jaccard coefficients, and then used to synthesize a number of in-silico mixture spectra by linear combination. In the third step, cosine correlation similarity scores between the combined sample spectrum and each in-silico mixture spectrum are calculated. The composition of the in-silico mixture spectrum with the highest similarity score is reported as the deconvolution result. In the fourth step, further in-silico statistical assessment is performed with a jackknife model. In the jackknife model, a

Blind-coded bacterial mixtures were constructed by mixing 6 bacterial cell suspensions (500 μL, ~10 CFU/mL each) into TSB growing media to reach a final volume of 5 mL, and incubated at 37°C for 4 h with continuous shaking (175 rpm). Concentrations of each individual bacterium in the blindcoded mixtures were determined by plate counting. Total aerobic counts of the samples were measured with TSA plate. Enterobacteriaceae bacteria (i.e. EL, EC, KP and KO) were

2

ACS Paragon Plus Environment

Page 3 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

counted with self-made deoxycholate lactose agar (DCLA) plate containing tryptone (10 g/L), α-lactose (10 g/L), sodium chloride (5 g/L), trisodium citrate (1 g/L), iron(III) citrate (1 g/L), sodium deoxycholate (1 g/L), neutral red (0.03 g/L), dipotassium hydrogen phosphate (2 g/L) and agar (13 g/L). Pseudomonas bacteria (i.e. PA) were counted with self-made cetrimide agar plate containing tryptone (20 g/L), magnesium chloride (1.4 g/L), potassium chloride (10 g/L), CTAB (0.3 g/L), agar (15 g/L) and glycerol (10 mL/L).

All bacterial samples were analyzed on a Bruker Microflex LRF MALDI-TOF Mass Spectrometer (Bruker Daltonik, Germany) with the mass-to-charge ratio (m/z) range of 2-20 kDa in positive linear mode. Four peaks of insulin (m/z 5734), myoglobin (m/z 16952), ubiquitin (m/z 8565) and cytochrome C (m/z 12360) were used for external calibration of the instrument.

MS Data Preprocessing Raw mass spectra were exported as text files (.txt) using flexAnalysis software (version 3.0; Bruker Daltonics, Germany). All subsequent data analyses were conducted with R from the R Foundation for Statistical Computing (http://www.r-project.org). The m/z range of 4-12 kDa was selected for spectral pattern matching. Peaks with a signalto-noise ratio (s/n) of at least 3 were extracted from each spectrum after baseline correction and intensity normalization.

Mass Spectra Acquisition Bacterial suspensions (1 mL) were centrifuged at 13,000 rpm for 3 min at room temperature and the supernatant was discarded. The cell pellets were washed three times with DI water and then re-suspended in DI water (100 μL). Finally, 1 μL of each bacterial suspension in DI water was transferred to one sample spot on a MALDI target plate and allowed to dry at room temperature before being overlaid with 1 μL of DHB matrix solution (10 mg/mL in 70% water, 30% ACN, 0.1% TFA).

Figure 1. Overview of the framework for bacterial mixture identification by MALDI-TOF. (A) Extraction of peak lists from raw sample mass spectra and combination by hierarchical clustering to form combined sample spectrum; (B) in-silico synthesis of mixture spectrum using a non-negative linear combination of the candidate reference spectra of individual bacteria; (C) calculation of the similarity score between the sample and the synthetic mixture spectrum with cosine correlation, colored spots representing the identified components; (D) the in-silico jackknife model for statistical assessment of the identified components from (C).

Replicated spectra of each sample were merged into a combined spectrum using a hierarchical clustering algorithm (Figure 1A). The distance between two peaks is defined as:

d=

( m z )1 − ( m z )2 max {( m z )1 , ( m z ) 2 }

if the two peaks are in the different spectrum. Otherwise, d = 1. Complete linkage was used when calculating intercluster distances. The hierarchical clustering tree was cut at a specified height (tolerance, 2000 ppm in this study), and the peaks were divided into several bins, in each of which the mean of m/z and intensity were calculated, respectively, to make up the combined spectrum.

(M1)

3

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Synthetic Mixture Model and Similarity Scoring Method

estimator E can be obtained. The distribution of values thus obtained closely matches the original distribution of E and can be used to estimate the robustness on the estimator. Later Shao et al. found that the generalized version of jackknife, which omits some fixed percentage of the data points, works well in practice even for non-smooth estimators 38. In our setting, given a sample spectrum with N peaks and a jackknife ratio r, a jackknife spectrum is a fictional spectrum constructed by randomly sampling the (1 - r) N peaks from the sample spectrum. As the nonnegative linear regression model is sensitive to some unknown noisy peaks of high intensities in the spectrum, a jackknife spectrum (which omits r of all the peaks) may create a replicate with fewer such noisy peaks and thus admits more reliable estimates.

We consider the mixture spectrum as a non-negative linear combination of the spectra of individual bacteria (Figure 1B) m

sm = ∑ ci si

(M2)

i =1

where sm is a vector representing the mixture spectrum, si is a vector representing the spectrum of an individual bacterium i, the intensity coefficient ci accounts for the contribution of bacterium i in sm, and m is a specified positive integer representing the maximum possible number of individual bacteria. Jaccard coefficient

N AB J (A, B) = N A + N B − N AB

In our experiments, Nj = 100 jackknife spectra were generated for each input spectrum. Each jackknife spectrum (Sj) was decomposed into a linear combination of m = 6 reference spectra of different species (SPj1, ..., SPj6) in the database using the synthetic mixture model. The confidence score (conf) of each species (SP) was defined as the ratio of the jackknife spectra whose corresponding linear combination contains the species SP (Figure 1D).

(M3)

was calculated between the combined experimental spectrum (A) and each of the spectra of an individual bacterium (B) in a reference database, where NA and NB are the number of peaks in the two spectra for calculation, and NAB is the number of their common peaks. In this study, the m/z tolerance between a pair of common peaks was specified equal to 2000 ppm. The tolerance was selected based on the resolution of the mass spectra of bacteria obtained with Bruker Microflex in linear TOF mode. The reference spectra of different species with top n Jaccard coefficients were selected as candidates. In this study, we used n = 7 and m = 6 to deconvolve the experimental mass spectra of bacterial mixtures, i.e. always considering 6 significant components to form the mixtures.

m

conf SP = ∑ Count ( SPji = SP ) N j

sA ⋅ sB sA s B

(M5)

i =1

RESULTS AND DISCUSSION Contribution of Individual Bacteria to the Mass Spectra of Bacterial Mixture According to the previous publications on MALDI-TOF MS analysis of bacterial mixtures, mass spectra of bacterial mixtures contain peaks originating from the spectrum of each individual bacterium composing the mixture. 11,22,23,26,27,39-41 The mass spectrum of a mixture of four bacteria, i.e. Klebsiella pneumoniae ATCC 700603 (KP), Staphylococcus aureus ATCC 25923 (SA), Klebsiella oxytoca ATCC 13182 (KO) and Escherichia coli ATCC 25922 (EC) with equal amount (~105 cell copies each) and the mass spectra of each individual bacterium (~105 cell copies) are shown in Figure 2. It can be observed that (i) most peaks in the experimental mixture spectrum can be assigned to at least one individual bacterium; (ii) some peaks in the experimental mixture spectrum are shared by two or more types of bacteria; (iii) not all peaks in the individual mass spectra are observed in the mixture spectrum (in gray color, as shown on Figure 2); (iv) a few of peaks are mixture-specific (in gray color, as shown on Figure 2). Mixture-specific peaks were reported in previous publication and were presumed to the results from protein interactions. 11,27,41

For a sample, Cnm candidate combinations of individual bacteria were listed. For each candidate combination, the hierarchical clustering algorithm described above was used for peak alignment among the experimental mixture spectrum and the m individual bacteria in the combination, and the intensity coefficients of the non-negative linear combination model were calculated with the R package “nnls” (https://cran.r-project.org/package=nnls), which implemented the Lawson-Hanson algorithm for non-negative least squares 33. A synthetic mixture spectrum was generated with the individual bacteria and their corresponding intensity coefficients. Similarity between the synthetic mixture spectrum and the experimental spectrum was calculated using cosine correlation 34 C (A, B) =

Page 4 of 10

(M4)

where sA and sB are vectors representing the two spectra respectively, Figure 1C.

To evaluate the presence of peaks from each individual bacterium in the mass spectrum of the bacterial mixture, Jaccard coefficient (J), explained in detail in the Materials and Methods section, was calculated between the experimental mixture spectrum and the mass spectra from each individual bacterium. KP provided the fewest peaks to the mixture spectrum (J = 0.15), while the other three types of bacteria provided more peaks (J = 0.25 for SA, J = 0.26 for

The Jackknife Model Quenouille first introduced the idea of jackknifing 35,36, which was further developed by Tukey 37, who also gave it its name. Given n data points X = {x1, …, xn} and a statistical estimator E(x1, …, xn), a jackknife leaves out one data point at a time, thus creating a sample set X(i) = {x1, …, xi−1, xi+1, …, xn}. From each such new sample, a value of the 4

ACS Paragon Plus Environment

Page 5 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

KO and J = 0.27 for EC). A synthetic mixture spectrum of the quaternary mixture was then computer-generated with the mass spectra of individual bacteria using the linear regression method, namely synthetic mixture model (SMM), as illustrated in Figure 1B and 1C. The synthet-

ic mixture spectrum with the highest cosine correlation similarity score (0.83) to the experimental mixture spectrum was the linear combination of KP, SA, KO and EC with intensity coefficients as 0.82, 0.65, 0.55 and 0.44, respectively (Figure 2).

5

Figure 2. Comparison of peaks in the mass spectrum of a mixture of four bacteria (KP, SA, KO and EC, ~10 cell copies each), in5 silico synthetic mixture spectrum of the four bacteria, and the mass spectra of each individual bacterium (~10 cell copies each).

The intensity coefficients are very different from statistics based on the numbers of peaks, such as the Jaccard coefficient and the percentage of presence 27, which only consider the ratio of peaks shared between the mixture spectrum and each individual spectrum. In SMM, we further take the intensity of peaks into consideration, providing a more reliable evaluation of the contribution of each individual bacterium to the mixture in generating MALDI-TOF spectra. Therefore, the in-silico synthetic mixture spectrum by SMM shows high similarity to the real mixture spectrum.

numbers of each bacterium, but they reflect the order of magnitude. For the balanced mixtures, the median of the intensity coefficients decreased with the increment of the number of components in the mixtures. For asymmetric situation, components of higher abundance corresponded to larger intensity coefficients. Performance of Bacterial Mixture Identification by Synthetic Mixture Model Based on the SMM and cosine correlation, spectral matching was applied to characterize the 97 bacterial mixtures, which were considered as unknown samples during the characterization. A database containing the mass spectra of 1081 individual bacterial strains was used as reference library (Data Set S-1). No matter the real number of bacterial species in the mixtures, 6 components were always considered to deconvolve the mass spectra of the mixtures. The composition of model mixtures and results of deconvolution by the SMM are shown in Data Set S-2.

The mixture in Figure 2 was artificially generated by mixing four types of bacterium, each giving approximately the same optical density (OD), hence similar cell numbers. The intensity coefficients were different for each individual, but the same in the order of magnitude. This is reasonable because of the fact that (i) the exact cell numbers can be different for different bacteria with similar OD, and (ii) the proteins from different bacteria are different in quantity and detection efficiency by MALDITOF. We further experimentally constructed 97 bacterial mixtures, including 33 binary, 41 ternary, 16 quaternary, 6 pentabasic and 1 hexabasic model mixtures using 10 species of bacteria (listed in the Materials and Methods section). Boxplots of the intensity coefficients of the components in the mixtures calculated by SMM using the mass spectra of the corresponding individual bacterium are shown in Figure S-1. Similar to Figure 2, the intensity coefficients cannot precisely measure the relative cell 5

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. Receiver operating characteristic (ROC) curves for the identification of bacterial mixtures with intensity coefficient at the species level.

Figure 4. Performance of the identification of low-abundant components in bacterial mixtures without and with jackknife statistical assessment at the species level. (A) ROC curves for the identification of low-abundant components (intensity coefficient c ≤ 0.5) with intensity coefficients or jackknife conf scores. (B) Sensitivities and error rates for the identification of low-abundant components against intensity coefficients or jackknife (ratio r = 1/3) conf scores.

Bacteria with lower intensity coefficients in an identification result are more likely to be a wrong match. If we consider intensity coefficient c ≥ c0 as a threshold for identification, we can calculate the sensitivity (S) and error rate (ER) as:

Therefore, we have further introduced a jackknife model to provide a statistical assessment to the identification results. The workflow of the jackknife model is explained in Figure 1 and in detail in the Materials and Methods section. The jackknife ratio r was set to 1/10, 1/4, 1/3 and 1/2, respectively. Data Sets S-3 to S-6 show the identification results with conf. Considering confidence score conf ≥ conf0 as a threshold for reliable identification, the sensitivity (S), and error rate (ER) were calculated as:

S = Count ( correct and c ≥ c0 ) Count ( species in sample )

(R1) ER = 1 − Count ( correct and c ≥ c0 ) Count ( c ≥ c0 )

Page 6 of 10

(R2)

The receiver operating characteristic (ROC) curves for the identification were then obtained as shown in Figure 3. It was observed that the identification for binary and ternary mixtures with balanced cell numbers (~105 each per sample spot for MALDI-TOF analysis) of each type of bacterium (AUC = 96.2%) showed better performance than binary and ternary mixtures with asymmetric cell numbers (105:104, 105:105:104, 105:104:104 per sample spot for MALDI-TOF analysis) (AUC = 82.6%) and quaternary to hexabasic balanced mixtures (~105 each per sample spot for MALDI-TOF analysis) (AUC = 82.5%). In asymmetric mixtures, bacteria with low abundance are more likely of low intensity coefficient, resulting in low sensitivity at a certain coefficient threshold.

S = Count ( correct and conf ≥ conf 0 ) Count ( species in sample )

ER = 1 − Count ( correct and conf ≥ conf 0 ) Count ( conf ≥ conf0 )

(R3) (R4)

We observed that the performance of the identification of low-abundant bacteria (with c ≤ 0.5) was enhanced after jackknife assessment with jackknife ratio r = 1/4, 1/3 or 1/2 (Figure 4). When r = 1/3, the AUC value for identification increased to 70.7%. Using the conf scores, sensitivities were highly increased with the same error rates. When conf0 = 0.90, the error rate was~5% (4/78), while the sensitivity increased from 29% (41/143) to 52% (74/143).

Jackknife Statistical Assessment for the Identification of Bacterial Mixtures For bacterial mixtures with asymmetric cell numbers for each type of bacterium, the intensity coefficients can be very different for each bacterium. A high intensity coefficient threshold (e.g. 0.1 to 0.2) would exclude most identification results of low-abundant bacteria and cannot be used to predict the accuracy in the identification of low-abundant bacteria from bacterial mixtures. As shown in Figure 4, for components with intensity coefficient c ≤ 0.5, the AUC value for identification based on c was only 68.6%. When keeping the error rate as 5% (2/43), sensitivity was only 29% (41/143), i.e. > 70% of bacteria could not be identified.

For different jackknife ratios, plots of the sensitivity and error rate against the conf score threshold for all bacterial components are shown in Figure S-2. When jackknife ratio r = 1/3, with 0.88 as confidence score threshold, the error rates were ~5%, while the sensitivities were 89% for binary and ternary balanced mixtures, 67% for binary and ternary asymmetric mixtures and quaternary to hexabasic balanced mixtures. The global sensitivity for all the samples was 73% (212/289), and the error rate was 3.6% (8/220). With the jackknife ratio r = 1/3 and the confidence score threshold conf ≥ 0.88, sensitivities and error rates in the identification of bacteria at species level from binary to pentabasic mixtures are listed in Table 1. Error rates were ≤ 7% for different kinds of bacterial mixtures. The sensitivities of identification decreased with the increment of the number of species in mixtures. Most species in the balanced binary mixtures were correctly identified (sensitivity = 95%). However, it became difficult to identify all the species in the pentabasic mixtures, where the sensitivity was only 67%. For bacterial mixtures with asymmetric cell numbers, sensitivities were lower compared to the mixtures with balanced cell numbers, i.e. 71%, 63% and 48% for binary mixtures (1:0.1), ternary mixtures (1:1:0.1) and ternary mixtures (1:0.1:0.1), respectively. Ion suppression accounts for the underrepresentation of the bacteria with low abundance in the mass spectra of corresponding mixtures. As a result, bacteria with low abundance contributed less information in the mass spectra of mixtures, leading to the difficulties in their identification. 6

ACS Paragon Plus Environment

Page 7 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Table 1. Sensitivities and error rates in the identification of bacterial mixtures. Mixture

Balanced

Sensitivity

Error rate

Binary

95% (40/42)

7.0% (3/43)

Ternary

86% (59/69)

4.8% (3/62)

Quaternary

69% (44/64)

4.3% (2/46)

Pentabasic

63% (19/30)

< 5% (0/19)

Binary (1:0.1) Asymmetric

Ternary (1:1:0.1) Ternary (1:0.1:0.1)

71% (17/24)

Table 2. Identification of blind-coded mixture samples. Composition ID Bacterium

A

< 5% (0/17) B

63% (17/27) 48% (13/27)

< 5% (0/17) C

< 5% (0/13)

D

Characterization of Co-cultured Blind-coded Bacterial Mixtures

E

The jackknife model provides a statistical assessment to the identification of bacterial mixtures by evaluating the confidence of identification. Higher confidence scores indicate better reliability of the reported components. However, the assessment doesn’t propose new components for the ones with low conf scores. Therefore, we further calculated the conf score of each species in the reference database, and the species re-ranked by their corresponding conf scores (conf ≥ conf0) were given as the identification results.

F

G

H

We co-cultured different bacteria together to get blindcoded binary and ternary mixtures, where the final abundance of each type of bacterium was unknown during MALDI-TOF analysis and data interpretation. The blindcoded model mixtures were similar to real samples to demonstrate the performance of the framework. Spectra of the blind-coded mixtures were acquired and analyzed using the SMM with the jackknife ratio r = 1/3 and the conf ≥ 0.88. Table 2 shows the identification results. All the dominant species in each mixture were correctly identified. Three species (SA in sample A, EL in sample G, and KP in sample H) were correctly predicated by the SMM but ruled out because of low conf scores. Two species were wrongly predicated by the SMM (No. 2 in sample D and E), and correctly ruled out by the conf score. According to the plate culture counting, the two species were at very low abundance in the corresponding mixtures, i.e. < 9.3% for PA in sample D, and < 1.1% for SA in sample E.

Identification results

Concentration by plate 8 counting (×10 CFU/mL)

Bacterium

conf

1

EC

42

EC

1.00

2

SA

20

SA

0.80

1

SA

6.4

SA

1.00

2

EL

4.6

EL

0.98

1

SA

1.4

SA

1.00

2

KP

12

KP

0.89

1

EL

9.8

EL

1.00

2

PA