Artificial Neural Network Study of Whole-Cell ... - ACS Publications

Oct 24, 2007 - Artificial Neural Network Study of Whole-Cell. Bacterial Bioreporter Response Determined Using. Fluorescence Flow Cytometry. Sirisha Bu...
0 downloads 4 Views 204KB Size
Anal. Chem. 2007, 79, 9107-9114

Artificial Neural Network Study of Whole-Cell Bacterial Bioreporter Response Determined Using Fluorescence Flow Cytometry Sirisha Busam,† Maia McNabb,† Anke Wackwitz,‡ Wasana Senevirathna,† Siham Beggah,§ Jan Roelof van der Meer,§ Mona Wells,† Uta Breuer,‡ and Hauke Harms*,‡

Department of Chemistry, Tennessee Technological University, Cookeville, Tennessee 38505, Department of Environmental Microbiology, Helmholtz Center for Environmental Research, UFZ, D-04318, Leipzig, Germany, and Department of Fundamental Microbiology, University of Lausanne, CH-1015, Lausanne, Switzerland

Genetically engineered bioreporters are an excellent complement to traditional methods of chemical analysis. The application of fluorescence flow cytometry to detection of bioreporter response enables rapid and efficient characterization of bacterial bioreporter population response on a single-cell basis. In the present study, intrapopulation response variability was used to obtain higher analytical sensitivity and precision. We have analyzed flow cytometric data for an arsenic-sensitive bacterial bioreporter using an artificial neural network-based adaptive clustering approach (a single-layer perceptron model). Results for this approach are far superior to other methods that we have applied to this fluorescent bioreporter (e.g., the arsenic detection limit is 0.01 µM, substantially lower than for other detection methods/ algorithms). The approach is highly efficient computationally and can be implemented on a real-time basis, thus having potential for future development of high-throughput screening applications. In recent years, an alternative to common chemical methods of analysis has been developed, wherein the detection of chemical compounds by genetically engineered microorganisms prompts the expression of spectroscopically active reporter proteins.1-3 These organisms are often generically referred to as whole-cell living biosensors or, alternately, bioreporters. Bioreporters function as living transducers potentially capable of providing information on a compound’s bioavailability, affect on living systems, and synergistic or antagonistic behavior toward biota in mixtures.1,3,4 Although numerous research groups report on strain engineering, only a few reports concentrate on optical methods for reporter * Corresponding author. Phone: +49 341 235 2225. Fax: +49 341 235 2247. E-mail: [email protected]. † Tennessee Technological University. ‡ Helmholtz Center for Environmental Research. § University of Lausanne. (1) Belkin, S. Curr. Opin. Microbiol. 2003, 6, 206-212. (2) Kohler, S.; Belkin, S.; Schmid, R. D. Fresenius’ J. Anal. Chem. 2000, 366, 769-779. (3) van der Meer, J. R.; Tropel, D.; Jaspers, M. Environ. Microbiol. 2004, 6, 1005-1020. (4) Kohlmeier, S.; Mancuso, M.; Deepthike, U.; Tecon, R.; van der Meer, J. R.; Harms, H.; Wells, M. Environ. Pollut. submitted for publication. 10.1021/ac0713508 CCC: $37.00 Published on Web 10/24/2007

© 2007 American Chemical Society

protein detection and/or algorithms for response characterization.5 One promising detection method is fluorescence flow cytometry (FCM). In FCM, individual cells constrained to the center of a fluid stream are interrogated with a laser beam to measure parameters such as cell size, shape/roughness, DNA content, surface receptors, enzyme activity, membrane permeability, calcium flux, etc. In a handful of reports, fluorescence response of bacterial bioreporters to analytes has been measured.6-10 This approach has obvious advantages:5 (1) a large number of cells can be analyzed individually and rapidly, (2) this potentially permits very small experiment sizes and high-throughput work, and (3) multiple data (e.g., cell size and multiple fluorophore intensities) of multivariate nature are readily collected and can be subjected to sophisticated data analyses. One major difference of bioreporter response from chemical analysis is the intrapopulation variability for a single analyte concentration.11-14 Typically, some individual bacteria respond much more than the average and some do not respond at all (i.e., they behave as a negative control or blank). Though one can imagine a number of causative factors for this intrapopulation response variability of bioreporters (e.g., growth stage or cell cycle of individual bacteria), studies that examine the reasons behind are as yet uncommon. Walt and co-workers have used optical well arrays to study gene expression kinetics and gene circuit noise for a recA (global DNA damage responder) bioreporter15 and in another effort have also employed artificial neural networks (5) Wells, M. Curr. Opin. Biotechnol. 2006, 17, 28-33. (6) Bahl, M. I.; Hansen, L. H.; Licht, T. R.; Sorensen, S. J. Antimicrob. Agents Chemother. 2004, 48, 1112-1117. (7) Bahl, M. I.; Hansen, L. H.; Sorensen, S. J. FEMS Microbiol. Lett. 2005, 253, 201-205. (8) Burmolle, M.; Hansen, L. H.; Sorensen, S. J. Microb. Ecol. 2005, 50, 221229. (9) Hansen, L. H.; Ferrari, B.; Sorensen, A. H.; Veal, D.; Sorensen, S. J. Appl. Environ. Microbiol. 2001, 67, 239-244. (10) Norman, A.; Hansen, L. H.; Sorensen, S. J. Mutat. Res., Genet. Toxicol. Environ. Mutagen. 2006, 603, 164-172. (11) Miller, W. G.; Brandl, M. T.; Quinones, B.; Lindow, S. E. Appl. Environ. Microbiol. 2001, 67, 1308-1317. (12) Stiner, L.; Halverson, L. J. Appl. Environ. Microbiol. 2002, 68, 1962-1971. (13) Wells, M.; Go ¨sch, M.; Harms, H.; van der Meer, J. R. Microchim. Acta 2005, 151, 209-216. (14) Kohlmeier, S.; Mancuso, M.; Tecon, R.; Harms, H.; van der Meer, J. R.; Wells, M. Biosens. Bioelectron. 2007, 22, 1578-1585. (15) Kuang, Y.; Biran, I.; Walt, D. R. Anal. Chem. 2004, 76, 6282-6286.

Analytical Chemistry, Vol. 79, No. 23, December 1, 2007 9107

(ANNs) for analysis of chemical sensor data.16 Despite intrapopulation variability, there is typically a concentration range where the average bioreporter response is proportional to the analyte concentration (analogous to the linear dynamic range in chemical methods). Meanwhile, the existence of “more sensitive subpopulations” conveys the potential to identify and exploit their response, though a recent work of ours14 demonstrates that the most sensitive subpopulations exhibit poor response precision. Hence, it appears necessary to identify a subpopulation that possesses an optimum combination of sensitivity and precision. Large multivariate FCM data sets seem ideal for investigation of pattern recognition algorithms of data analysis. Therefore, the purpose of the present work is to use pattern recognition to identify the best-performing bioreporter subpopulation. This promises to improve the analytical performance of extant bioreporter populations and may also help to elucidate the biological reasons for intrapopulation variability as a basis for future strain optimization. To our knowledge, none of the various algorithms for analysis of multivariate FCM data has ever been applied to bioreporters. In this paper we report the use of a single-layer perceptron artificial neural network (SLP-ANN) for cluster analysis of bioreporter response obtained with FCM. We primarily used Escherichia coli DH5R (pPROBE-arsR-ABS), which generates enhanced green fluorescent protein (eGFP) as a reporter molecule in response to arsenic. This strain is of interest because it has been used for field and lab measurements on the bioavailability of arsenic, an environmental toxin and public health hazard. For us it has the advantage of previously well-characterized response characteristics obtained by other detection techniques. We chose the SLP-ANN because of its past success in a wide variety of applications and the simplicity and computational efficiency of the particular SLP algorithm chosen. During the course of the work reported, an opportunity arose to analyze another set of FCM bioreporter data from collaborating investigators, and hence the algorithm optimized for the arsenic reporter was also tested on a second reporter strain, E. coli pHBP269A0, sensitive to 2-hydroxybiphenyl. EXPERIMENTAL SECTION Bacterial Strains, Culture Conditions, Activation. Two strains were used in this work. The primary strain used for algorithm development was E. coli DH5R (pPROBE-arsR-ABS), producing eGFP in response to arsenic, and was constructed as described elsewhere.17 For additional validation of the algorithm after initial development and testing, we used strain E. coli DH5R with plasmid pHBP269A0. This strain harbors the transcription activator HbpR and produces eGFP upon exposure to 2-hydroxybiphenyl (HBP).18 Due to the unwieldy names, for simplicity we henceforth refer to these strains as Str1-As (strain 1, responsive to arsenic) and Str2-HBP (strain 2, responsive to 2-hydroxybiphenyl). For Str1-As, established culturing and activation protocols13,19 were slightly adapted. Briefly, overnight cultures were prepared by inoculating 5-10 mL of Luria broth (LB, all LB media contained (16) White, J.; Kauer, J. S.; Dickinson, T. A.; Walt, D. R. Anal. Chem. 1996, 68, 2191-2202. (17) Stocker, J.; Balluch, D.; Gsell, M.; Harms, H.; Feliciano, J.; Daunert, S.; Malik, K. A.; Van der Meer, J. R. Environ. Sci. Technol. 2003, 37, 4743-4750. (18) Jaspers, M. C.; Suske, W. A.; Schmid, A.; Goslings, D. A.; Kohler, H. P.; van der Meer, J. R. J. Bacteriol. 2000, 182, 405-417. (19) Wells, M.; Go ¨sch, M.; Rigler, R.; Harms, H.; Lasser, T.; van der Meer, J. R. Anal. Chem. 2005, 77, 2683-2689.

9108

Analytical Chemistry, Vol. 79, No. 23, December 1, 2007

50 µg/mL kanamycin sulfate) with a single bacterial colony from an LB agar plate followed by incubation for 16 h at 37 °C and 150 rpm. Overnight cultures were diluted 1:50 in LB and then grown under the same conditions to an optical density at 600 nm (OD600) of 0.6, harvested by centrifugation (2500g), and resuspended at the same density in modified M9 medium (or MM9; MM9 is 100 mL of MM9 salts solution, 2 mL of 1 M MgSO4, 0.1 mL of 1 M CaCl2 solution, and 10 mL of 20% w/w glucose solution per liter; MM9 salts solution is 5 g of NaCl, 10 g of NH4Cl, 54.8 g of MOPS (3-(N-morpholino)propanesulfonic acid), 51.0 g of MOPS sodium salt, 0.59 g of Na2HPO4‚2H2O, and 0.45 g of KH2PO4 in 1 liter of water, final solution adjusted to pH 7). We used MM9 medium to minimize growth and variations in eGFP production during extended exposure to arsenic. Arsenic was introduced as sodium arsenite (Merck, Germany) freshly diluted from a 0.05 M standard solution. Cells were induced for 4 h prior to collection for FCM analysis. For Str2-HBP, the growth procedure was the same but cultures were harvested at an OD600 of 0.4. For induction of Str2HBP, 3 mL of E. coli suspension was mixed with HBP at concentrations between 1 and 100 µM for 3 h at 30 °C and 180 rpm rotary shaking. Sample Preparation and Analysis. For Str1-As, 2.5 mL of cell suspension was fixed for subsequent FCM analysis by centrifuging, removing the supernatant, and adding 4 mL of 10% sodium azide solution for preservation. As eGFP has an optimal pH range for fluorescence,20 0.1 mL of 10 mM Tris of pH 8 was added. FCM measurements were performed using a MoFlo cell sorter (DakoCytomation) equipped with two water-cooled argonion lasers (Innova 90C and Innova 70C, coherent). A total of 48 000 events were analyzed per sample at a rate of 200-500 s-1. Data were extracted from .fcs format for subsequent ANN analysis as five element vectors labeled as pulse width (PW), side scatter (SSC), forward scatter (FSC), nonspecific fluorescence from UV excitation (UV), and fluorescence from excitation of eGFP (GFP). Each vector corresponded to a measurement on a single bioreporter cell. For Str2-HBP, cells were diluted after induction to around 106 per mL in PBS, and FACS Calibur (BD Biosciences, Erembodegem, Belgium) analysis was performed at 488 nm excitation and 535 nm emission wavelengths. FSC and SSC were detected along with eGFP, and data were acquired in the program CellQuest Pro (version 4.0.2., BD Biosciences). Theory. FCM data is still often analyzed by histograms (univariate) or scatter plots (bivariate). Rapid and efficient pattern recognition techniques enable analysis of the whole of a multivariate data set at once and in an automated fashion. With the use of ANNs, two basic approaches are possible: identification and clustering,21 the latter being of importance here. Clustering partitions a data set of measurements on many cells into groups or clusters having distinctly different multidimensional signatures. There are several challenges to clustering of FCM data:21 (1) to exploit the multivariate nature of the data, dimensionality reduction is undesirable; however, with larger data sets this can be intractable for some algorithms requiring iterative solution, (2) many clustering algorithms cannot handle data sets with very different cardinalities and densities (i.e., different cluster patterns, populations, and sparseness), and (3) some clustering algorithms (20) Haupts, U.; Maiti, S.; Schwille, P.; Webb, W. W. Proc. Natl. Acad. Sci. U.S.A. 1998, 95, 13573-13578. (21) Boddy, L.; Wilkins, M. F.; Morris, C. W. Cytometry 2001, 44, 195-209.

node is set at 1. With the learning rate equal to 1, eq 1 sets the cluster weight vector to equal the input vector. Henceforward, each subsequent vector in the data set is evaluated individually with respect to existing clusters via a distance function, s, ranging from 0 to 1. The test vector under evaluation is deemed to match a class if s is above a preset vigilance, θ, in which case the activation for the corresponding class output node is set to 1, all others set to 0, and the weight matrix is updated according to eq 1. If the test vector does not match an existing class, a new output node is added in the manner that the first class was instantiated. In this way, clustering can be controlled by choice of distance function and vigilance. For s we used Figure 1. Graphical depiction of feed-forward SLP-ANN based on sequential parameter estimation (SPE).

depend on cluster density such that clusters narrowly ranged in one or more dimensions result in hypervolume approaching zero, destabilizing the numerical implementation of the algorithm. Adaptive resonance theory (ART)22,23 and real-time adaptive clustering (RTAC)24,25 employ an unsupervised learning ANN that performs sequential parameter estimation (SPE) and do not require repeated iterations over the whole data set. Thus, computational efficiency, size of the data set, and questions pertaining to cluster density are not an issue since each vector is processed independently. Here we use adaptive clustering for the analysis of Str1-As and Str2-HBP response data. The implemented ANN (Figure 1) is comprised of a single-layer perceptron feedforward (i.e., without recursive connections) and fully connected (i.e., every input node connected to every output node) configuration. Perceptrons map an input x (vector) to output values y (scalar) via nodal elements (circles). Mapping takes place via adjustment of weights, w, controlled by activations, a. Source code for the ANN work was implemented in Matlab. The discretized equation24 is

∆W )

1 T a (ot - aW) ni

(1)

where 1/ni is the learning rate for the ith addition to a particular cluster, ni is equal to the number of cells in cluster i plus 1, a is a length m row vector denoting activations (aT is the transpose), ot is a length n test vector for each new datum tested, and W is an m × n weight matrix. For the learning algorithm, input nodes are instantiated with the first vector in the data set, i.e., PW, SSC, FSC, UV, and GFP measurements for a single cell, defining the first cluster or class; activation is binary, based on a winner-takeall strategy (competitive learning), wherein elements of a are 0 except for the ith element corresponding to the “winning node” whose value is 1; for the first vector, activation for the single output (22) Ressom, H. W.; Natarajan, P. In Bioinformatics; Yan, P. V., Ed.; Nova Science Publishers: New York, 2005; pp 1-25. (23) Song, X.-H.; Hopke, P. K.; Fergenson, D. P.; Prather, K. A. Anal. Chem. 1999, 71, 860-865. (24) Fu, L.; Yang, M.; Braylan, R.; Benson, N. Pattern Recognit. 1993, 2, 365373. (25) Mucha, H.-J.; Bartel, H.-G. In Innovations in Classification, Data Science, and Information Systems; Baier, D., Wernecke, K.-D., Eds.; Springer: Berlin, 2005.

s)1-

|ot - oj|2 |ot|2 + |oj|2

(2)

where the numerator is the Euclidian distance between the test vector, ot, and each cluster vector, oj, (j ) 1, ..., m) and the denominator is the sum of the two vector L2-norms. Vigilance must also be ranged from 0 to 1; if θ ) 0 then one large cluster must result, and as θ approaches 1, each test vector will define its own class, such that only identical vectors could coinhabit a class. With this algorithm, weights associated with a single class or output node j correspond to the mean vector of the cluster (wj ) ∑nioj ).26 Weights adapt most rapidly at the beginning ni of cluster formation when the learning rate is changing most rapidly. Eventually, at an appropriate vigilance value, the clusters formed are well separated from each other; simple refinements of the basic process discussed in the results can be implemented to ensure optimized clustering and cluster stability. The algorithm chosen here is attractive for use with FCM data as its simplicity enables real-time application during measurements. RESULTS AND DISCUSSION Initial SLP-ANN Analysis of Bioreporter Response. One way to understand and exploit specific response characteristics of bioreporters has been to plot percentile rank of individual response magnitudes, as shown in Figure 2A for Str1-As response to arsenic. We readily see for all arsenic concentrations a substantial proportion of relatively inactive individuals and the appearance of a highly responsive subpopulation at higher concentrations. A Gaussian distribution would result in a sigmoidal shape to such a plot, with multiple, separate, normally distributed subpopulations resulting in a series of sigmoidal trends with multiple inflection points. Such plots for epi-fluorescence microscopy (EFM) data are somewhat different from the FCM plots and require caution when comparing EFM and FCM results.13 In particular, the discrimination between different arsenic concentrations by the more responsive subpopulation is better than with EFM. We first wanted to see if the SLP-ANN could identify a responsive population and track it across arsenic concentrations. From preliminary tests, we chose θ ) 0.8 as an initial vigilance. For training with all 48 000 points, the average number of (26) Bishop, C. M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, 1995.

Analytical Chemistry, Vol. 79, No. 23, December 1, 2007

9109

Figure 2. Bioreporter intrapopulation heterogeneity of response represented as a percentile rank distribution plot (A), and representative SLP-ANN cluster eGFP values as a function of concentration for an FCM data set of 48 000 vectors trained with a vigilance of 0.8 (B). Response is in relative fluorescence units (RFU).

Figure 3. SLP-ANN response for a single class (cluster of interest) as a function of vigilance (A), given 48 000 training points, and as a function of training points (B), given a vigilance of 0.8.

multivariate clusters formed varied around 25 but generally yielded groupings without dividing the data into an intractable number of classes. We plot selected results of cluster averaged eGFP response versus arsenic concentration in Figure 2B. Tracking clusters across the concentrations can present a problem as cluster stability perforce changes with increased vigilance, and at least the FCM parameter eGFP varies with the concentration. In some cases tracking is straightforwardscluster 2 (C2) in Figure 2B is readily tracked as it typically has the largest number of cells and a characteristic vector. Other clusters can be tracked by their selfsimilarity across concentrations (as measured by s) utilizing the multivariate FCM cluster vectors and the cluster size at the end of training as an additional parameter. However, some of the clusters in Figure 2B are more readily identified than others across arsenic concentrations or separate experiments. The situation can be simplified by focusing on those clusters that manifest the known13,19 linear increase in eGFP with arsenic concentration between ∼0-1 µM. C1-C3 meet this criterion, but only C2 is consistently identifiable in replicate experiments, so we concentrate on this cluster. We next investigated the effect of vigilance and number of training points on the responsive cluster of interest (Figure 3). 9110

Analytical Chemistry, Vol. 79, No. 23, December 1, 2007

From studies using methods other than FCM, we knew that the linear response region for Str1-As is in the range of ∼0-1 µM. As vigilance increases (Figure 3A), sensitivity (slope) also increases slightly; however, at a vigilance of 0.9, the cluster of interest begins to fragment. The algorithm is, nominally, not very sensitive to the number of training points (Figure 3B), in that the same general trend is recovered; nonetheless, the level of reproducibility we desire, even within the same training data set, is not yet attained. From eq 1 we expect that 1000 training points would be sufficient for cluster stabilization, and this is consistent with our observation. The responsive cluster is 50% of the total population for 1000 training points, 42% for 7000, and 29% for 48 000, i.e., after stabilization the algorithm rejects an increasing number of incoming data. However, additional points continue to mold the final vector, and we see that results for 7000 and 48 000 training points are most similar to each other and most different from 1000. Training Approaches and Validation. Once trained, the algorithm requires verification that we can recover the same cluster from unprocessed data. We used the first 24 000 data points for any given sample for training, and the second 24 000 for validation. As the points are from the same sample, results should

be identical to within some level of uncertainty. Several approaches to training were tested: (1) feature reduction + development, (2) simple calibration, (3) reclustering, (4) targeted reclustering, (5) repicking, and (6) cluster mergingsresults from these are shown in Figure 4A-E. Feature reduction + development involves an initial reduction of dimensionality followed by an increase. We tried this approach because of peculiarities of our data regarding the correlation of eGFP responses and DNA dye intensities. For each sample taken, a duplicate sample was stained with dye for DNA determination, but the staining process altered the eGFP response. Crosscorrelation of the eGFP and DNA responses thus required us to identify the cluster of interest in the absence of the eGFP response. Dimensionality reduction compromised the performance of the algorithm, so we tried reincreasing dimensionality by calculating a new parameter from existing response parameters, based on observed differences of parameters between clusters (i.e., new features involved simple products of existing parameters, the approach deemed likely to create a more distinct feature). To test the efficacy of this approach, we used the same set of 24 000 points twice. We first clustered the data for each arsenic concentration and removed the eGFP response from the cluster vectors and the raw data, and second, we added a developed feature to each before using the resulting vector for reclustering the altered data (Figure 4A). For all features tested we see that the number of cells in the cluster largely differs from that of the known cluster of interest, indicating that a different population is now sampled. Besides this, in all cases response sensitivity to arsenic is reduced. Simple Calibration. The training set and validation set are independently clustered (results in Figure 4B). The cluster identified from the validation set is then compared with the arsenic response curve of the training set. Though the general trend between training and validation is nominally reproducible, it is insufficient for the envisaged applications. There also appears to be good reproducibility but poor discrimination at low vigilance, whereas good discrimination goes along with poor reproducibility at higher vigilances. Reclustering, targeted reclustering, and repicking are nested approaches (winnowing techniques) assuming that a small subpopulation gives the optimal response. Reclustering involves clustering at a lower vigilance, θ1, to obtain an initial cluster large in number and multidimensional extent followed by reclustering within the cluster at higher vigilance θ2. Targeted reclustering assumes that reclustering needs guidance for reproducible cluster stabilization. Hence, at the reclustering stage the final average vector from the first clustering operation is preset as the weight for the first vector in the reclustering operation (θ2). Repicking tries to reduce the effects of outliers in the initial stages of cluster stabilization by reducing the dimensional extent of the initial cluster. We first cluster at a lower vigilance (θ1), then use the resulting vector to set a fixed cluster weight, and then recluster at a higher vigilance (θ2). The second step of the process is thus repicking points from the initial cluster on the basis of defined proximity from the average vector. Figure 4C shows results from targeted reclustering that reflect a trend in all three approaches; for an initial vigilance θ1, sensitivity and linearity are gradually improved with increasing secondary vigilance in the second step

(θ2). However, the reproducibility of these approaches is low (Figure 4D) with the poor performance of both targeted reclustering and repicking indicating low cluster stability. Cluster merging (Figure 4E) results in a final cluster that is the same size or larger than the initial size. To correct for assumed initial fragmentation of natural clusters data is first clustered at a lower vigilance (θ1), and then the final cluster vectors for each cluster in a given sample are compared. Cluster vectors that match the cluster of interest to within a second higher vigilance (θ2) are then merged with the initial cluster. The improvement of both discrimination and reproducibility with increased vigilance is promising for quantitative purposes. Clustering at θ1 ) 0.85 followed by merging at θ2 ) 0.9 resulted in nearly superimposable training and validation runs (linear calibration from 0 to 0.67 µM arsenic for validation data gives relative errors for slope and intercept of 1.1% and 2.5%, respectively, relative to training data; R2 values for training and validation are 0.995 and 0.999, respectively; Figure 4E). The best validation outcome is thus achieved with a technique that increases final cluster size. This is consistent with our analysis of the EFM single-cell response for a different bioreporter where larger subsamples gave superior precision.14 It is thus beneficial to exclude the non- or less-responsive and the very sensitive but highly variable subpopulations. Although cluster merging showed the greatest promise, it utilized a very large amount of data considering the decreasing returns in learning rate with each vector clustered. To use available data more efficiently, we divided each training set of 24 000 points into ensembles, clustered each, and averaged the results for the cluster of interest. For the FCM response data we found an optimum ensemble number n of 1000. Figure 5A shows the results for three independent experiments. The general trend is a linear response range up to ∼0.7-1 µM arsenic with excellent precision that worsens for higher arsenic concentrations. This is comparable to what we have previously observed for this strain using other techniques (Figure 5B). Using these data, we calculated method detection limits (MDLs) according to

MDLx )

t0.05sLR m

(3)

where MDLx is the detection limit expressed as arsenic concentration, t0.05 is the t value of the Student’s t distribution for 95% confidence, sLR is the standard error of the intercept from linear regression, and m is the slope from linear regression. Figure 5C compares MDLs. Finally, we examined other parameters than eGFP. These do not vary substantially for the cluster of interest over the tested arsenic concentration range (Figure 6). Comparison of the FCM/SLP-ANN results to those obtained using fluorescence correlation spectroscopy, EFM, and steadystate fluorimetry (Figure 5B) reveals that (1) though it mimics the general trend up to the concentration yielding maximum response (cmax), the sensitivity of FCM (slope of linear response) is inferior and (2) FCM does not show the diminution observed above cmax for all other techniques. The latter is useful as optimum curves are a source of ambiguity because two distinct concentrations above and below cmax could lead to the same response value. The lower sensitivity of FCM results is well offset by its superior precision obtained via the final validated process. The average R2 Analytical Chemistry, Vol. 79, No. 23, December 1, 2007

9111

Figure 4. Results from SLP-ANN training and validation exercises: (A) dimensionality reduction + feature development compared to basic clustering (vigilance ) 0.8 and 24 000 training points for both), (B) comparison of training and validation at different vigilances using simple calibration for 24 000 points, (C) example of general trend for winnowing techniques represented by targeted reclusteringscomparison of training and validation with an initial clustering vigilance of 0.6 and subsequent reclustering vigilances of 0.7, 0.8, and 0.9 (24 000 points), (D) comparison of training and validation for winnowing techniques using an initial vigilance of 0.6 followed by a postprocess vigilance of 0.9 (24 000 points), (E) comparison of training and validation results from clustering at vigilances of 0.55, 0.7, and 0.85 followed by cluster merging with a vigilance of 0.9 (24 000 points). 9112

Analytical Chemistry, Vol. 79, No. 23, December 1, 2007

Figure 6. Representative plot of all FCM parameters as a function of arsenic concentration (selected data associated with Figure 5A).

Figure 7. Plot of Str2-HBP GFP response to HBP, measured by FCM and analyzed by SLP-ANN using the same procedure as for Str1-As results in Figure 5A (24 000 training points with an initial vigilance of 0.85, cluster merging at 0.9, and an ensemble size of 1000 points). Also shown are data for SSC and FSC (average of training and validation data, superimposable results on the scale of this plot).

Figure 5. (A) Final validation using ensemble averaging of merged clustersscomparison of training and validation for three independent replicate experiments (results for 24 000 training points with an initial vigilance of 0.85, cluster merging at 0.9, and an ensemble size of 1000 points). (B) Comparison of representative FCM response data for Str1-As from the present study (panel A) with other instrumental/ algorithmic approaches including fluorescence correlation spectroscopy, EFM, and steady-state fluorimetry. (C) MDLs for arsenic using the methods represented in panel B.

value for the linear regressions in Figure 5A is 0.996 (not apparent on the plot due to the log scale). The absolute relative error in slope for the linear region between training and validation averages 1.4%, and the respective absolute relative error in intercept is 3.8%. This compares to the average relative standard errors from regression itself of 3.5% and 13% for slope and intercept, respec-

tively. As better precision increases sensitivity, superior discrimination of concentrations is possible, in particular of the MDL from a blank. The described FCM-ANN method exhibits the lowest MDL of the instrumental approaches to measurement of this bioreporter response that we have examined to date. Aside from eGFP, for the other parameters measured, between the training and validation tests the average absolute relative error for all concentrations over independent replicate experiments is 2.5%, the variability of the UV laser signature being the greatest with 4.6% for all trials. For the data in Figure 6, we tested the similarity of the final vector (cluster of interest) for each concentration to those for all of the other concentrations using the distance function, but deleting the eGFP response from each vector, and we find that response is identical across the entire concentration range, usually to within s ∼ 0.98. Greatest deviation (s ∼ 0.95) is always that for the response at 6.7 µM arsenic, the concentration where the transition from increasing response begins. Transcriptome or proteome analysis may be a way to find an explanation for this. We also see a trend among clusters exhibiting higher eGFP signals of higher average values for the parameters PW, SSC, FSC, and UV. However, our experience with single-cell Analytical Chemistry, Vol. 79, No. 23, December 1, 2007

9113

responses indicates higher uncertainty in data from the mostresponsive subpopulation which may also result in problems of reliable cluster identification. The cluster finally chosen here is not the most responsive but reproducibly identified across the arsenic concentrations for all experimental trials. The primary goal for this work was to develop and validate the SLP-ANN for the particular application of arsenic biosensing. However, the validation results in Figure 5 encouraged us to apply the method to another bacterial bioreporter responsive to HBP (strain Str2-HBP). These trivariate (SSC, FSC, GFP) FCM results were obtained in a different cytometry lab on different equipment. We processed results exactly as described for the Str1-As final validation work, i.e., thresholds and ensemble size may not be optimized for the HBP data set. Results from such a nonoptimized trial (Figure 7) demonstrate the expected qualitative trend and, hence, promise for the transferability of the FCM/SLP-ANN approach. FCM is an important technique in diagnostic biotechnology, and ongoing development of bioreporter technology will expand its scope even further. Recent advances in microfluidics suggest a future for this technique in high-throughput screening applica-

9114

Analytical Chemistry, Vol. 79, No. 23, December 1, 2007

tions. The approach here is automated and computationally efficient enough to be employed in real-time. Postprocessing procedures such as cluster merging and ensemble averaging negligibly influence the computation time for the work reported here. Thus, the simplicity and rapidity of the approach we describe here opens the door for more complex experiments involving additional measurement parameters such as additional laser signals from targeted staining and/or multicolor reporter molecules. ACKNOWLEDGMENT We thank Dan Eckery and the technical support staff at Dako for fruitful discussions concerning instrument performance and data architecture, the Saxonian State Ministry for Environment and Agriculture for funding part of this work, and the UFZ Helmholtz Center for Environmental Research for hosting personnel from Tennessee Technological University. Received for review June 26, 2007. Accepted September 12, 2007. AC0713508