Environ. Sci. Technol. 2003, 37, 4554-4560
Statistical Evaluation of Bacterial Source Tracking Data Obtained by rep-PCR DNA Fingerprinting of Escherichia coli JOHN M. ALBERT,† J U N K O M U N A K A T A - M A R R , * ,† LUIS TENORIO,‡ AND ROBERT L. SIEGRIST† Environmental Science & Engineering Division and Department of Mathematical & Computer Sciences, Colorado School of Mines, Golden, Colorado 80401
Pattern recognition has been applied to environmental systems for identification of numerous pollution sources including aerosolized lead and petroleum hydrocarbons. In recent years, DNA fingerprinting has gained widespread application as a means to characterize genetic variations for such purposes as microbial source tracking. This approach, however, is strongly dependent on the statistical and image analyses applied. Several statistical analyses of repPCR DNA fingerprints were assessed as a means to differentiate between potential sources of fecal contamination. GelCompar II and methods based on penalized discriminant analysis (PDA) and k-nearest neighbors (KNN) classification procedures were used to differentiate between 10 source groups within a library containing DNA fingerprints of 548 Escherichia coli isolates from known human and nonhuman sources. KNN performed significantly better than PDA in a jackknife analysis, though the library was not large enough to detect significant differences between GelCompar II and the other two methods. GelCompar II and KNN both attained g90% correct classification in a holdout procedure. In addition, interpoint distance analyses indicate coherency within source groups, while library randomization demonstrated that KNN does not create artificial groupings. This investigation stresses the need to understand limitations of statistical analyses used in pattern recognition of DNA fingerprints.
Introduction Fecal contamination is a widespread problem throughout the United States that adversely affects surface and groundwaters and raises health concerns about their quality as drinking water sources. In a draft report, the U.S. EPA stated that 21 000 water bodies within the United States were impaired due to fecal contamination from both animal and human sources (1). In addition, the latest CDC waterborne disease outbreak report indicated 17.9% and 87% increases in outbreaks associated with drinking surface and groundwaters, respectively, since the previous report (2). In the past 10 years, identifying sources of fecal contamination using bacterial source tracking (BST) techniques * Corresponding author phone: (303)273-3421; fax: (303)273-3413; e-mail:
[email protected]. † Environmental Science & Engineering Division. ‡ Department of Mathematical & Computer Sciences. 4554
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 20, 2003
has become a large field of study, as evidenced by recent reviews (3, 4). The goal of all BST methods is to differentiate between sources of fecal contamination in order to effectively direct mitigation efforts. In general, BST methods rely on unique biomarkers from indicator organisms, biomarkers that are, ideally, specific to host population groups. Many BST methods generate phenotypic or genotypic profiles to characterize bacteria from known source groups. These profiles are then subjected to pattern recognition protocols, analogous to other pollution source-tracking applications (5-10). This approach, however, is strongly dependent on the statistical and, in the case of DNA fingerprinting, the image analyses applied (11). Analytical methods used for the purposes of BST include antibiotic resistance assays, host-specific molecular markers, ribotyping, and repetitive element polymerase chain reaction (rep-PCR) (12-19). rep-PCR, a DNA amplification procedure that targets repetitive units within bacterial genomic DNA (20), has been used in BST because it is able to differentiate between organisms at the strain level (21-24). As an analytical tool, rep-PCR reproducibility has been demonstrated within single researcher studies (25, 26). However, slight changes in protocol have been shown to affect DNA fingerprints produced, making it difficult to reproduce results in different laboratories (27). DNA fingerprints can be analyzed using either banding patterns (band-based) or densitometric curves (curve-based). The latter method is preferred since manual band-based scoring methods introduce a bias (28); curvedbased analysis should provide less biased analyses of DNA fingerprints (26, 28, 29). Library-based BST methods involve the collection of indicator organisms from known host groups. Biomarkers obtained from the indicator organisms are stored in a database according to the host group of origin. Various classification methods such as discriminant analysis, cluster analysis, and principal component analysis have been used in library-based BST (13, 14, 16-18, 30, 31). The goals of these techniques are first, to describe the observations within the library and second, to assign new observations to the correct host. Many of these techniques are performed with commercial software packages that offer convenient templates. However, the limitations of the classification algorithms used in these closed-source software packages are not readily apparent. Such limitations include algorithm assumptions, library size, classification rate uncertainties, and input data range. Using the aforementioned techniques and/or commercial software, a wide range of average rates of correct classification (ARCC), from 67 to 87%, has been reported based on libraries ranging in size from 154 to several thousand indicator organisms (13, 14, 17-19). Identification of contaminant sources using BST relies on the associations between indicator organisms and host groups. Evidence suggests that enteric bacteria may exhibit ecological structure due in part to host adaptation and geographic location (32). Some studies indicate that this ecological structure may be maintained in natural populations of E. coli (15, 33-35). Previous attempts to explore indicator-host correlations (i.e. source group coherency) include analysis of variance, randomization procedures, and G statistics (32, 33, 36, 37). New methods of exploring source group distributions are necessary to more directly assess specificity and identify strains of unknown origin. One such method is interpoint distances, a nonparametric procedure that compares multivariate probability distributions. BST methods are promising, but several factors currently limit their effective application. Lack of both a uniform 10.1021/es034211q CCC: $25.00
2003 American Chemical Society Published on Web 09/18/2003
statistical framework between BST studies and a thorough understanding of various statistical approaches, and their limitations, used in BST limit its application (4). Understanding statistical limitations will increase the reliability of the associations made between biomarkers from indicator organisms and host groups. The predictive values of targeted biomarkers must be well understood prior to environmental application. Finally, reproducibility of analytical procedures within and between laboratories must be demonstrated, as it is essential if BST is to be broadly and effectively implemented. Sufficient reproducibility of genotypic techniques has been established within a single laboratory, but comparing results of the same techniques between laboratories has proved difficult (25, 27, 28, 38-42). In this investigation, Escherichia coli was isolated from fecal specimens collected from various source groups. repPCR was performed on the E. coli isolates, and the resulting densitometric curves were stored in a database according to the source group of origin. GelCompar II, penalized discriminant analysis (PDA), and k-nearest neighbors (KNN) classification procedures were evaluated to determine if they could separate host groups and correctly classify new observations. Interpoint distances were used to investigate host group coherency based on rep-PCR fingerprints, and library randomization was used to check for bias in the classifiers and to illustrate the effects of library size.
Materials and Methods Sample Collection. Fecal samples from indigenous animals (bear, elk, goose, mule deer) as well as domestic animals (cow, horse, dog) were collected from a watershed located in a subalpine region of central Colorado. Additionally, elk samples were collected from an elk ranch in southwestern Colorado. Samples from humans, domestic animals (cow, dog, horse, sheep, swine), and some wild animals (deer, elk, goose) were collected from the eastern foothills of central Colorado. Dry samples (10 g) were homogenized with 40-60 mL of autoclaved distilled water prior to isolation. Human samples were collected from both human volunteers (rectal swabbing) and effluent from household on-site wastewater systems. Rectal swabs were stored at 4 °C and processed within 2 h of swabbing. Samples collected from on-site wastewater effluent were filtered using membrane filtration as described in Standard Methods for the Examination of Water and Wastewater (43). Isolation of E. coli from Raw Fecal Samples. The isolation method used in this investigation was adapted from Dombeck et al. (13). Isolates had to exhibit typical E. coli appearance on mFC, MacConkey, ChromAgar, Indole, Methyl Red, Simmon’s Citrate, and EC-MUG media to be considered E. coli. All samples were incubated at temperatures recommended by Standard Methods (43). Positive and negative controls, E. coli (ATCC 11775) and Klebsiella pneumoniae (ATCC 27736), respectively, were used during each sample run. Positive isolates were stored in a 50% glycerol solution at -80 °C. A maximum of five E. coli strains were isolated from each sample collected, with the exception of the two on-site wastewater effluent samples, from which a total of 43 strains were isolated. The total number of samples and isolates collected is summarized in Table 1. PCR and Electrophoresis Conditions. Isolates were streaked onto plate count agar (43) prior to rep-PCR analysis. Whole cell suspensions of E. coli isolates underwent a mild alkaline lysis prior to PCR (26). The PCR reaction mixture of Rademaker and deBrujin (26) was modified to a final concentration of 0.3 µg/mL BOXA1R primer (5′CTACGGCAAGGCGA CGCTGACG-3′). The temperature program of Dombeck et al. (13) was used. PCR products were stored at -20 °C until use. Twelve microliters of PCR product (including dye) was loaded into wells of a 1.5% agarose gel prestained
TABLE 1. Rates of Correct Classification Using GelCompar II, PDA, and KNN with 1 Nearest Neighbor (1-NN) with the Large Library source group bear deer elk goose cow dog horse sheep swine human nonhuman human
no. of no. of no. of fecal E. coli fingerprint GelCompar samples isolates patterns II PDA 1-NN 6 6 22 15 32 20 16 11 13 8
26 21 79 42 109 69 53 29 34 86 462 86
5 5 24 13 62 20 21 9 13 22 ECC 180 28 ECC
92.3 90.5 86.1 97.6 74.3 86.8 94.3 86.2 85.3 90.7 86.7 99.6 90.7 98.2
88.5 76.2 86.1 73.8 75.2 82.6 77.4 72.4 94.1 82.6 80.7 98.6 88.2 98.0
96.2 85.7 92.4 97.6 83.5 85.5 92.5 86.2 91.2 95.3 90.1 99.8 95.3 99.1
with 0.5 µg/mL ethidium bromide. The first, middle, and last lanes of the gel were loaded with a 1Kb+ ladder (Invitrogen, Carlsbad, CA). Electrophoresis was conducted at 4 °C at 60 V for 13-15 h. A peristaltic pump was used to recirculate the running buffer during electrophoresis. Image Analysis. Electrophoresed gels were imaged in a Chemigenius darkroom (Syngene, Frederick, MD) using the Genesnap v3.0 application. 8 bit TIFF images of the gels were imported into GelCompar II v3.0 (Applied Maths, SaintMartens-Latem, Belgium). All gel images were normalized by using fragments of the 1Kb+ ladder between 100 bp and 2000 bp as reference positions, as described in the GelCompar II manual. A 50% signal-to-noise ratio cutoff was used to determine which gels could be used for further analysis. Resulting densitometric curves underwent background subtraction to reduce the amount of artificial trend in each curve as described in the GelCompar II manual. Assessment of Fingerprint Variability. To determine the reproducibility of the DNA fingerprints (densitometric curves), experiments were conducted to quantify densitometric curve variations due to gel normalization, PCR reaction, DNA loading, and thermocycler differences. Cluster analysis (see classification methods) was performed to assess the similarity of curves generated from the same isolate. Gel normalization: two images of the same gel, containing DNA from 23 different swine isolates, were taken on the same day and were normalized once using the 1Kb+ ladder on two different days by the same researcher. PCR reaction: Multiple PCR reactions using two isolates (goose and human) as templates were electrophoresed on a single gel. Nine wells were filled with PCR product from nine PCR reaction tubes containing the same goose isolate. Seven wells were prepared similarly using seven PCR products from seven PCR reaction tubes containing the same human isolate. DNA loading: Volumes of 5, 10, 15, and 20 µL from multiple PCR reactions using two human and deer isolates as templates were electrophoresed. This sample volume array was performed in duplicate on the same gel for a total of eight wells for each of the two isolates. Thermocycler variation: Using a Techne Touchgene Gradient (Princeton, NJ) and a Perkin-Elmer GeneAmp 2400 (Norwalk, CT), four replicate PCR reactions were run with the same horse template. A similar experiment was run using the Applied Biosystems GeneAmp 9700 (Norwalk, CT) and GeneAmp 2400 thermocyclers using two different horse isolates as templates for PCR reactions. Variable amounts of VOL. 37, NO. 20, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
4555
PCR product from each reaction were electrophoresed in this experiment. Classification Methods. Details of the classification methods used in this study can be found in the Supporting Information. All of the classification methods considered in this investigation use densitometric curves as input data. This type of data is high-dimensional, as each curve is a sequence of observations (i.e. points) corresponding to 494 different relative optical densities. Classification of such highdimensional data is difficult not only because of computational cost but also because it is much more difficult to estimate and describe high-dimensional distributions. Ideally one should find lower dimensional representations that capture as much of the information provided by the data as possible. Using wavelet transforms (implemented using WaveLab v. 802, www-stat.stanford.edu/∼wavelab), each densitometric curve is represented as the sum of a smooth baseline and detail components at finer scales. Wavelet transforms are similar to Fourier transforms but are better for capturing transient or localized features in signals (44). Wavelet representations of densitometric curves help denoise the data, provide lower dimensional representations, and also help reduce the correlation structure in the variables that may introduce preferential directions in feature space. The versions of penalized discriminant analysis (PDA) and k-nearest neighbors (KNN) used in this study work on the wavelet coefficients of the densitometric curves and capture smoothness information of the curves, which increases classification power. PDA is a roughness penalty approach to linear discriminant analysis that enforces smoothness in the discriminants (45). Linear discriminant analysis (LDA) is a particular case (no smoothness imposed) of PDA. PDA is more applicable than LDA to small libraries because it can be used even when the number of curves is smaller than the number of observations per curve, but both require that the different source groups share the same covariance matrix and work best when the probability distributions are Gaussian. On the other hand, KNN does not make distributional assumptions; it assigns a new observation to the group corresponding to the membership of the majority of the k-nearest neighbors. For example, 3-NN looks at the membership of the three nearest neighbors; if two of the three are goose, the new observation will be classified as goose. In the absence of a majority, new observations are classified by randomizing the assignment. Neighbors are defined by a measure of distance, in this case defined in the space of wavelet coefficients. Our procedures were implemented in Matlab (v. 6.5, The MathWorks, Inc., Natick MA). To compare our procedures to those used by many people in practice, the cluster analysis and library modules of GelCompar II were used for cluster analysis and classification, respectively. Isolates were classified using similarity coefficients produced by a clustering procedure with the Pearson-product coefficients and unweighted pair group method with arithmetic mean (UPGMA) clustering. A library containing 548 total isolates representing 10 source groups was analyzed to assess the classification methods. Because the number of observations vary significantly between source groups in the large library, the total fraction of correct classifications was used as an overall performance measure of the classification methods; it is an average of source group jackknife correct classifications weighted by the group sizes. This estimate of correct classification (ECC) is different from the unweighted average of correct classifications (average rate of correct classification; ARCC) that other researchers have used (13, 14, 17, 18). In addition to jackknife, a holdout test was performed to test the classification accuracy. After removing 30 isolates at 4556
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 20, 2003
random from the library, the remaining library was used to select classifier parameters, and the holdouts were then classified against the new library using KNN. Classification of the same holdouts in GelCompar II was done using the Library module set to maximum similarity. Library and Source Group Assessment. A library should (i) be representative of the true population distribution of isolates, (ii) consist of well-separated groups, which is determined by the specificity of the fingerprints, and (iii) be large enough to lead to small classification uncertainties. There are many ways to check for group specificity; two used in this study were library randomization and interpoint distances. If the fingerprints are group-specific, a partial randomization should decrease classification rates. If no bias is introduced by the classification methods, a full randomization should decrease the classification rates to those expected under random group assignments (i.e. by chance). The library was fully randomized by randomly assigning every fingerprint to one of the 10 source groups, keeping the same number of observations in each source group as in the unrandomized library. The library was also partially randomized according to fingerprint pattern (i.e. groups of >90% similar fingerprint patterns within each source group were randomly assigned to one of the 10 source groups). In addition, subsets of the library were selectively generated to create small libraries containing 80 total isolates, evenly distributed among five source groups, to assess the effect of library size and fingerprint heterogeneity on the performance of the classification methods. Though geographical location was not considered in creating the small libraries, the source groups in two libraries described here both had similar geographic distribution to the large library. Group specificity should be reflected in overall classification rates and should also lead to different group probability distributions. To compare groups, one can use interpoint distances, a simple nonparametric procedure to compare multivariate probability distributions. The distribution of a chosen distance measure between two observations is considered for three different cases: (1) both observations come from group 1; (2) one observation is from group 1 and the other from group 2; and (3) both observations come from group 2. A result from probability theory states that the distributions of the two groups are the same if and only if the three interpoint distance distributions are equal (46). To assess the significance of observed differences in the distributions, a Kolmogorov-Smirnov test was used with a Bonferroni correction to compensate for multiple comparisons (see Supporting Information for more details). Inequalities on the probabilities and expected values of errors can be used as guides for sufficient library size, as discussed in the Supporting Information. Evaluating whether a library is representative of the population is difficult; the number of different fingerprints for each group is unknown, and even an upper bound is unavailable. However, we should be able to determine if a new observation is represented in the library. Because a classifier will always assign a new observation to one of the available groups whether it is truly represented in the group or not, interpoint distances can be used to determine if the new observation is atypical with respect to the library, by comparing its distance to the distribution of distances of the selected group. To achieve the same apparent objective, GelCompar II uses “quality factors” to describe the certainty with which a new observation is classified, though the basis for these factors is not well described.
Results rep-PCR DNA Fingerprinting. Although single researcher based studies have established the reproducibility of the rep-
TABLE 2. Rates of Correct Classification Using GelCompar II and 1-NN with Two Randomly Generated Small Libraries with 16 Isolates in Each Source Group source group bear goose cow dog human
no. of fingerprints Library 1 4 9 12 9 8 ECC
GelCompar II
1-NN
93.8 93.8 81.2 87.5 87.5 88.8
100 100 100 87.5 100 97.5
87.5 56.2 50.0 68.8 81.2 68.8
100 87.5 68.8 56.2 81.2 78.8
Library 2 bear goose cow dog human
4 12 13 8 9 ECC
PCR technique (47), the measures used to compare the fingerprints have not been clearly defined. Results of the fingerprint reproducibility experiments indicate that the different procedures used in this investigation generally generated fingerprints with a high degree of similarity (g90%) when applied to the same sample in the same thermocycler. However, the same sample run in the Techne Touchgene and the GeneAmp 2400 thermocyclers had a 77% minimum similarity, while samples run in the GeneAmp 2400 and 9700 thermocyclers were at least 90% similar; thus only the GeneAmp thermocyclers were used to build the database. A similarity of 90% was used as the cutoff to differentiate different fingerprint patterns, i.e. fingerprints had to be at least 10% different to be considered distinct, the same cutoff used to differentiate between ribotype profiles in another BST study (15). Classification Rates. Table 1 summarizes the rates of correct classification (RCC) for each group and the overall ECC for each of the three classification methods using the large library. In general, all three methods had RCC above 80%, with PDA having the lowest ECC and KNN the highest; for comparison, the probability of a random selection from the largest group is 20%. All methods had a total performance above 98% ECC when the library was condensed into human and nonhuman source groups. No clear trends characterize the different performances of the three statistical methods, though both GelCompar II and KNN had the highest and lowest RCC in the goose and cow groups, respectively. For this library, an upper bound on the error for the jackknife estimate of the classification rate is 6.6% (see Supporting Information). Therefore, the ECC obtained by PDA is significantly lower than that of KNN but not GelCompar II, while KNN and GelCompar II performed comparably. Though no significant difference between GelCompar II and KNN was observed, it is important to note that the results of the KNN classifier are based on 64 or 128 coefficients instead of the 494 observations that represent the curve, which means that KNN efficiently compresses the information into lower dimensional representations. We thus expect better, more stable predictions with future observations. The large library is relatively small compared to some libraries reported in the literature and may therefore contain more homogeneous fingerprint patterns within source groups, potentially leading to unrealistically high ECC values. Therefore, two small libraries were generated from the large library to evaluate the effect of fingerprint homogeneity on classification rates. Table 2 shows the results obtained from two small libraries. Both libraries contain 80 total fingerprints; Library 1 contains 40 different fingerprint patterns, while Library 2 contains 46 fingerprint patterns. For these libraries,
FIGURE 1. Estimated interpoint distance distributions of observations between bear and all other source groups. Probability distributions are shown within the bear source group (blue), within comparative group (green), and between bear and comparative group (red). Labels at the top of each plot represent the comparative source group. The x-axis represents normalized distance, and the y-axis represents the probability density. the upper bound on the error for jackknife estimate of the classification rate is 19%. Therefore, the differences in ECC between the two libraries are significant or nearly so, but the differences between KNN and GelCompar II are not statistically significant. GelCompar II and KNN classification methods generated comparable ECCs using the jackknife analysis, and their performances in the holdout test were nearly identical. Of the 30 holdouts, GelCompar II correctly classified 27, while 1-NN and 3-NN correctly classified 27 and 28, respectively, using only 64-dimensional data vectors. Library and Source Group Assessment. Host-indicator organism correlations are currently inconclusive; however, some preliminary evidence suggests that communities of E. coli may have some ecological structure, with host and geographic area being important variables (33, 35, 36, 4850). These apparent relationships were investigated further using interpoint distances. Figure 1 shows estimates of the interpoint distributions for the bear group compared to all the other source groups in the library. If the bear and the comparative group have the same distribution, then the three curves should be very similar as they are estimates of the same underlying distribution. As evident in Figure 1, the distribution for the bear source group is different from all other source groups. The largest p-value of the KolmogorovSmirnov test for the three interpoint distance comparisons was 1.6 × 10-18. Such differences were also observed for all other groups (data not shown). Therefore, the predefined groups, based on densitometric curves, appear to contain host-group specific characteristics. Because of the group specificity, a partial randomization should lower the classification rates. Randomization by fingerprint type resulted in ECC of 76.8%, 48.1%, and 76% for GelCompar II, PDA, and KNN, respectively. However, these results are difficult to assess because the ECCs depend on the degree of randomization and the classification procedure used. But if the classifiers do not introduce any bias (i.e. create artificial groupings), a full randomization should lower the rates to those of chance assignments (18). KNN classification on the fully randomized versions of our library as well as on the two small sublibraries resulted in classification rates (12% for the full library, 21% for the small libraries) near the probability by chance (10% for the full VOL. 37, NO. 20, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
4557
FIGURE 2. Estimated interpoint distance distributions of observations between bear and all other source groups using the partially randomized library. Probability distributions are shown within the bear source group (blue), within comparative group (green), and between bear and comparative group (red). Labels at the top of each plot represent the comparative source group. The x-axis represents normalized distance, and the y-axis represents the probability density. library, 20% for the small libraries), indicating that KNN classifiers do not introduce artificial groupings. Note, however, that the ECC of the two unrandomized small libraries is significantly different, suggesting that randomization does not provide enough information to assess the adequacy of the library. Figure 2 shows the interpoint distance distributions generated with the partially randomized library. Although the library was rearranged by clusters of fingerprint patterns and not by individual observations (i.e. clonal rather than individual fingerprints being assigned together to randomly chosen source groups), the generated distributions are very similar. The largest p-value for the Kolmogorov-Smirnov test was 0.002 (for reference, the 1% significance level with the Bonferroni correction is 0.003), which is 15 orders of magnitude larger than that obtained without the randomization. In other words, even partial randomization destroyed differences between group distributions, i.e. group coherency. A cluster analysis of the full database indicates that observations cluster according to host and date of collection (data not shown). To explore geographic relationships, densitometric curves obtained from isolates collected from source groups (cow, dog, goose) located in two geographically separated areas were analyzed by cluster analysis and jackknife. The two sampling locations for each source group were at least 70 miles apart. In addition, to explore temporal relationships within the database, isolates from cow and horse samples collected at different times at the same location were analyzed by cluster analysis and jackknife. Our results suggest that both geographic and temporal factors have some influence on fingerprint patterns, as fingerprints within each source group could be separated by these criteria with >70% RCC. However, when these source groups were separated by these criteria within the large database, the RCC for the geographically and/or temporally divided groups dropped by an average of 10%, while RCC for the remaining groups did not change.
Discussion As a tool, BST may be used as a part of a management plan to mitigate both point and nonpoint sources of contamination. This study demonstrates that the statistical framework 4558
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 20, 2003
of library-based BST efforts is important to consider prior to application. Each of the classification procedures used attained high ECCs; however, they fail in different respects. PDA classification had consistently lower RCC than the other two classification methods, with the exception of the bear source group for which its RCC was equal to that of KNN. PDA requires that the different groups share the same covariance matrix and works best when the conditional probability densities are multivariate Gaussian. These assumptions are hard to meet when there is more than one fingerprint type per group; in this case the group probability density is a mixture of distributions corresponding to different fingerprints. It thus becomes more important to model the different strain groups within each source group. Another drawback of PDA is computational. The matrix operations used to estimate classification errors may be significantly time-consuming for high-dimensional data such as discretized curves. Overall, KNN achieved the highest ECC of the three classification methods using only a fraction of the possible variables. KNN does not make assumptions about the probability density of groups and does not impose linear boundaries; therefore, it does not fail in the same respects as PDA. A majority composition in the neighborhood of a new observation is an estimate of the optimal classification rule; the more compact the neighborhood, the better the estimate. This illustrates the problem that KNN has with high-dimensional data; the nearest neighbors tend to be relatively far in a high-dimensional space. In addition, KNN does not adapt well to directional structure in the data, though several more adaptive versions of KNN can be used (e.g. refs 51 and 52 and references therein). We have not yet implemented an adaptive version of KNN, but the decorrelating property of wavelet transforms should reduce directional preferences caused by correlation structure. The GelCompar II software performed well during these trials, is readily available, and is straightforward to use. However, its closed-source nature limits our understanding of how the software used the densitometric curves to perform cross-validation. With KNN, we have the freedom of tuning the parameters to maximize the rates of a particular group, choosing the amount of dimensionality reduction to decrease the so-called “curse of dimensionality” (51), selecting the number of neighbors to control bias and variance of the classifier, and selecting the number of holdouts in crossvalidation to estimate classification rates. This freedom should translate into better predictions and library updates, though the library was too small to verify this expectation. We also have substantial background regarding the limitations of the KNN approach. The limitations underlying the GelCompar II analysis are, however, difficult to discern without additional information regarding the classification procedures implemented in the software. Comparisons between the small and large libraries showed that numerous factors affect ECC. Increasing library size typically increases the number of different fingerprint patterns per group and thus may reduce group specificity due to dispersion and group overlap. However, greater size does not necessarily lead to lower classification rates, as evidenced by the poor performance of Library 2 relative to the large library. Greater diversity of fingerprint patterns also increases overlap, regardless of library size. The betterperforming small library (Library 1) had a higher average ratio of isolates to fingerprints (1.9) than the lower-performing library (Library 2; ratio ) 1.7). A higher ratio indicates greater clonality, which should lead to higher classification rates. As a specific example, the goose group in Library 1 contained 9 distinct fingerprint patterns shared by 13 of the 16 isolates (g90% similarity), while the goose group in Library 2 contained 12 patterns with only 7 isolates sharing patterns.
Again, however, this simple measure does not fully account for classification performance: the dog group in Library 2 contained 8 fingerprint patterns with 13 isolates having g90% similarity but was correctly classified at significantly reduced rates relative to Library 1, which contained one more pattern and two fewer isolates with shared patterns. The reasons for this performance are apparent upon examination of the similarity matrices for the libraries. While Library 2 contains more isolates having clonal fingerprints, it also has greater group overlap: five of the isolates, while g90% similar to other dog isolates, had even higher similarity to isolates from other source groups (and were thus misclassified), while the same is true for only two of the dog isolates in Library 1. As others have noted (14, 18), both library size and diversity are clearly important factors affecting ECC, though the small libraries demonstrate that simple rules defining sufficient size and diversity may be difficult to attain. Currently, the size of library required to be representative of sources of fecal contamination yet still allow meaningful classification is undefined (4). A literature review of librarybased BST suggests that libraries having small numbers of samples and sampling sites per source group tend to have higher ARCC (17, 30, 53). The small libraries in this study demonstrate that classification rates are related to the number, frequency, and source-specific similarity of fingerprint patterns in each data set, though the number of observations within a library may not be as important as the heterogeneity of the observations. The number of fingerprint patterns needed to characterize a source within a study area is unknown, as the continuum of patterns may be continually changing due to temporal, geographic, and environmental influences. Though it is possible that fingerprint-based BST methods will eventually encounter a plateau in the number of patterns, studies having libraries containing thousands of fingerprint patterns or antibiotic resistance profiles have not demonstrated an endpoint to the dispersion. Various studies suggest that the genetic diversity of fecal bacteria may not be randomly distributed but rather that host taxonomic group and even individual hosts may account for some degree of differences observed in population distribution (32, 33, 36). Numerous studies have demonstrated individual-specific differences in enteric community composition (54-56). Whether such differences are significant at the strain level has not been established, though these and several studies of E. coli diversity (33, 36, 48) suggest that host-dependent strain-level differences are possible. Observations from this study also suggest a relationship between indicator community structure and host. High rates of correct classification were achieved, and the interpoint distances analysis demonstrated that each source group has a distinct probability distribution. However, like other BST measures, DNA fingerprinting methods are affected by shifts in community dynamics when hosts are exposed to antibiotics or other stresses (57). Consumption of antibiotics as well as changes in diet have been shown to affect the intestinal populations of microbiota (58, 59). Pressures affecting the microbiota within the intestine as well as pressures in the environment represent a limiting factor for all BST methods. Further study of the dynamics of indicator organism populations is needed to develop guidelines on the representativeness of a library. Effects of geographic location on enteric bacterial populations are very difficult to discern because of intraspecies and intralocation variability, though numerous studies have attempted to assess the impact of these variables (15, 33, 35, 60). In general, geographic location appears to play a statistically significant role in the genetic variation of E. coli, though the fraction of the variability that may be directly attributed to location is relatively small. Similarly, the effect of time on community structure has proved variable. Some
studies have found significant changes in E. coli populations over time (48, 61), while others have suggested stable enteric community structure (49, 55, 56, 62). Indeed, in one study examining electrophoretic type (ET) of E. coli isolated from a single human over an 11-month period, Caugant et al. (63) found that most ETs were transient, appearing in only one sample and being represented by only one or a few clones, yet three ETs were detected over extended time periods, though at different and temporally variable frequencies. While our study was not designed to study geographic and temporal variability, our sampling plan allowed for some investigation of these variables. Though we were able to successfully separate isolates within each source group by sampling location and time, important source-specific information was lost when the geographic areas were separated in the large library, as evidenced by the drop in RCCs. This indicates that the groups as a whole contained underlying source-specific information that allowed better classification than when divided. As a tool, interpoint distance analysis aids in evaluating the library by assessing the underlying specificity of each source group. Based on the differences in the interpoint distance distributions shown in Figures 1 and 2 for the original and the partially randomized databases, respectively, important coherent information of each source group is clearly lost through the randomization. In other words, the fingerprints in the original groupings provided characteristics that help discriminate between groups. Though p-values were estimated for the tests, one must interpret them carefully because statistical significance does not necessarily mean significant in practice, as any small effect will be statistically significant given a large enough library. A simple visual examination may reveal differences in distributions without calculating p-values or show that “statistically significant” differences may in fact correspond to very similar distributions that are not different in practice. The shape of the interpoint distance distribution also provides qualitative information about the relative distance of the observations and can be used to determine if the assignment of a new observation to a group is meaningful or indicates that we have a new isolate not represented in the existing library. The premise of all BST methods is that a correlation exists between specific indicator organisms and their hosts. To understand the strength of these correlations, statistical analyses are employed. The importance of testing the limits of these statistical analyses cannot be understated: statistical limitations must be understood in order to generate meaningful classification rates for BST purposes. Previous BST studies have used a wide array of statistical methods that generated a broad range of values for correct classification rates, but the parameters used and the limitations of their statistical analyses have not been fully addressed. Many researchers rely on convenient, prepackaged statistical software, but it often does not allow for complete understanding or manipulation of statistical parameters. Objective evaluation of statistical limitations, regardless of the specific analysis used, will allow for more confidence in interpreting BST classification results. To advance BST techniques, study parameters and statistical assumptions must be clearly stated and thoroughly tested. Libraries and new observations have to be subjected to careful statistical analyses. Any new observation may challenge the representativeness of an existing library. The creation of a useful library of fingerprints should be a dynamic process where new fingerprints and groups are periodically included and the library is reanalyzed. Eventually one may discover that the specificity of the fingerprints is not high enough to separate source groups. One then must look for new features (geographic/temporal separation, other organisms or type of data) to improve specificity. The analyses VOL. 37, NO. 20, 2003 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
4559
presented here indicate significant source specificity in the library studied, though its representativeness and specificity will continue to be tested as the library is enlarged. In summary, this work provides a needed platform to evaluate the use of microbial fingerprinting and classification methods, particularly those used in BST.
Acknowledgments This work was supported by a grant from the Colorado School of Mines Foundation. Special thanks go to the Summit County Department of Environmental Health and Kathryn Lowe for their help in coordinating sampling efforts, to Gary Liang Jang Wang for his help in processing samples, and to Drs. Mike Colagrosso and Javier Rojo for insightful discussions.
Supporting Information Available Additional details regarding the rep-PCR classification problem of assigning densitometric curves to source groups. This material is available free of charge via the Internet at http://pubs.acs.org.
Literature Cited (1) U.S.EPA. Developing Strategy for Waterborne Microbial Disease; U.S. Environmental Protection Agency, Office of Water: 2001. (2) CDC. Surveillance for Waterborne-Disease Outbreaks United States, 1999-2000; 2002. (3) Scott, T. M.; Rose, J. B.; Jenkins, T. M.; Farrah, S. R.; Lukasik, J. Appl. Environ. Microbiol. 2002, 68, 5796-5803. (4) Simpson, J. M.; Domingo, J. W. S.; Reasoner, D. J. Environ. Sci. Technol. 2002, 36, 5279-5288. (5) Johnson, G. W.; Jarman, W. M.; Bacon, C. E.; Davis, J. A.; Ehrlich, R.; Risebrough, R. W. Environ. Sci. Technol. 2000, 34, 552-559. (6) Lavine, B. K.; Mayfield, H.; Kromann, P. R.; Faruque, A. Anal. Chem. 1995, 67, 3846-3852. (7) Lavine, B. K.; Ritter, J.; Moores, A. J.; Wilson, M.; Faruque, A.; Mayfield, H. T. Anal. Chem. 2000, 72, 423-431. (8) Na¨f, C.; Broman, D.; Pettersen, H.; Rolff, C.; Zebu¨hr, Y. Environ. Sci. Technol. 1992, 26, 1444-1457. (9) Requejo, A. G.; West, R. H.; Hatcher, P. G.; McGillivary, P. A. Environ. Sci. Technol. 1979, 13, 931-935. (10) Wang, J.; Guo, P.; Li, X.; Zhu, J.; Reinert, T.; Heitmann, J.; Spemann, D.; Vogt, J.; Flagmeyer, R.-H.; Butz, T. Environ. Sci. Technol. 2000, 34, 1900-1905. (11) Kingsley, M. T.; Straub, T. M.; Call, D. R.; Daly, D. S.; Wunschel, S. C.; Chandler, D. P. Appl. Environ. Microbiol. 2002, 68, 63616370. (12) Carson, C. A.; Shear, B. L.; Ellersieck, M. R.; Asfaw, A. Appl. Environ. Microbiol. 2001, 67, 1503-1507. (13) Dombek, P. E.; Johnson, L. K.; Zimmerley, S. T.; Sadowsky, M. J. Appl. Environ. Microbiol. 2000, 66, 2572-2577. (14) Hagedorn, C.; Robinson, S. L.; Filtz, J. R.; Grubbs, S. M.; Angier, T. A.; Reneau, R. B., Jr. Appl. Environ. Microbiol. 1999, 65, 55225531. (15) Hartel, P. G.; Summer, J. D.; Hill, J. L.; Collins, J. V.; Entry, J. A.; Segars, W. I. J. Environ. Qual. 2002, 31, 1273-1278. (16) Parveen, S.; Portier, K. M.; Robinson, K.; Edmiston, L.; Tamplin, M. L. Appl. Environ. Microbiol. 1999, 65, 3142-3147. (17) Wiggins, B. A.; Andrews, R. W.; Conway, R. A.; Corr, C. L.; Dobratz, E. J.; Dougherty, D. P.; Eppard, J. R.; Knupp, S. R.; Limjoco, M. C.; Mettenburg, J. M.; Rinhardt, J. M.; Sonsino, J.; Torrijos, R. L.; Zimmerman, M. E. Appl. Environ. Microbiol. 1999, 65, 34833486. (18) Whitlock, J. E.; Jones, D. T.; Harwood: V. J. Water Res. 2002, 36, 4273-4282. (19) Seurinck, S.; Verstraete, W.; Siciliano, S. D. Appl. Environ. Microbiol. 2003, 69, 4942-4950. (20) Sadowsky, M. J.; Hur, H.-G. In Bacterial Genomes: Physical Structure and Analysis; deBruijn, F. J., Lupski, J. R., Weinstock, G. M., Eds.; Chapman & Hall: New York, 1998. (21) Versalovic, J.; Schneider, M.; de Bruijn, F. J.; Lupski, J. R. Methods Mol. Cell. Biol. 1994, 5, 25-40. (22) Versalovic, J.; Lupski, J. R. Methods Mol. Cell. Biol. 1995, 5, 96104. (23) Louws, F. J.; Fulbright, D. W.; Stephens, C. T.; deBruijn, F. J. Appl. Environ. Microbiol. 1994, 60, 2286-2295. (24) Louws, F. J.; Schneider, M.; deBruijn, F. J. In Nucleic Acid Amplification Methods for the Analysis of Environmental Samples; Toranzos, G., Ed.; Technomic Publishing: 1996; pp 63-94. 4560
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 37, NO. 20, 2003
(25) Versalovic, J.; Lupski, J. R. PCR Methods Appl. 1993, 2, 341-345. (26) Rademaker, J.; deBrujin, F. J. In DNA Markers: Protocols, Applications, and Overviews; Gustavo Caetano-Anolles, P. M. G., Ed.; Wiley-Liss: New York, 1997; pp 151-171. (27) Tyler, K.; Wang, G.; Tyler, S.; Johnson, W. J. Clin. Microbiol. 1997, 35, 339-346. (28) Burr, M. D.; Pepper, I. L. J. Microbiol. Methods 1997, 29, 63-68. (29) Bruijn, F. J. d.; Lupski, J. R.; Weinstock, G. M. Bacterial Genomes: Physical Structure and Analysis; Chapman & Hall: New York, 1998. (30) Wiggins, B. A. Appl. Environ. Microbiol. 1996, 62, 3997-4002. (31) Parveen, S.; Hodge, N. C.; Stall, R. E.; Farrah, S. R.; Tamplin, M. L. Water Res. 2001, 35, 379-386. (32) Gordon, D. M.; FitzGibbon, F. Microbiology 1999, 145, 26632671. (33) Souza, V.; Rocha, M.; Valera, A.; Eguiarte, L. E. Appl. Environ. Microbiol. 1999, 65, 3373-3385. (34) Okada, S.; Gordon, D. M. Mol. Ecol. 2001, 10, 2499-2513. (35) Gordon, D. M. Microbiology 1997, 143, 2039-2046. (36) Gordon, D. M.; Lee, J. Microbiology 1999, 145, 2673-2682. (37) Okada, S.; Gordon, D. M. Mol. Ecol. 2001, 10, 2499-2513. (38) He, Q.; Viljanen, M. K.; Mertsola, J. Methods Mol. Cell. Biol. 1994, 8, 155-160. (39) MacPherson, J. M.; Eckstein, P. E.; Scoles, G. J.; Gajadhar, A. A. Mol. Cell. Probes 1993, 7, 293-299. (40) Marcinek, H.; Wirth, R.; Muscholl-Silberhorn, A.; Gauer, M. Appl. Environ. Microbiol. 1998, 64, 626-632. (41) Meunier, J. R.; Grimont, P. A. D. Res. Microbiol. 1993, 144, 373379. (42) Woegerbauer, M.; Jenni, B.; Thalhammer, F.; Graninger, W.; Burgmann, H. Appl. Environ. Microbiol. 2002, 68, 440-443. (43) American Public Health Association; American Water Works Association; Water Environment Federation. Standard Methods for the Examination of Water and Wastewater; 20th ed.; American Public Health Association, American Water Works Association, Water Environment Federation: Washington, DC, 1998. (44) Mallat, S. A Wavelet Tour of Signal Processing, 2nd ed.; Academic Press: New York, 1999. (45) Hastie, T.; Buja, A.; Tibshirani, R. Annals Stat. 1995, 23, 73-102. (46) Maa, J.-F.; Pearl, D. K.; Bartoszynski, R. Annals Stat. 1996, 24, 1069-1074. (47) Rademaker, J.; Louws, F.; deBrujin, F. J. Mol. Microbial Ecol. 1998, 3.4.3, 1-27. (48) Gordon, D. M.; Bauer, S.; Johnson, J. Microbiology 2002, 148, 1513-1522. (49) McLellan, S. L.; Daniels, A. D.; Salmore, A. K. Appl. Environ. Microbiol. 2003, 69, 2587-2594. (50) Whittam, T. S.; Ochman, H.; Selander, R. K. Proc. Natl. Acad. Sci. 1983, 80, 1751-1755. (51) Hastie, T.; Tibshirani, R.; Friedman, J. H. Elements of Statistical Learning: Data Mining, Inference and Prediction; SpringerVerlag: 2001. (52) Ricci, F.; Avesani, P. IEEE Trans. Pattern Recog. Mach. Intell. 1999, 21, 380-384. (53) Harwood: V. J.; Whitlock, J.; Withington, V. Appl. Environ. Microbiol. 2000, 66, 3698-3704. (54) Zoetendal, E. G.; Akkermans, A. D. L.; Akkermans-van Vliet, W. M.; de Visser, A. G. M.; de Vos, W. M. Microb. Ecol. Health Dis. 2001, 13, 129-134. (55) Simpson, J. M.; Martineau, B.; Jones, W. E.; Ballam, J. M.; Mackie, R. I. Microb. Ecol. 2002, 44, 186-197. (56) Simpson, J. M.; McCracken, V. J.; Gaskins, H. R.; Mackie, R. I. Appl. Environ. Microbiol. 2000, 66, 4705-4714. (57) Kruse, H.; Sorum, H. Appl. Environ. Microbiol. 1994, 60, 40154021. (58) Apajalahti, J. H.; Kettunen, A.; Bedford, M. R.; Holben, W. E. Appl. Environ. Microbiol. 2001, 67, 5656-5667. (59) Russell, J. B.; Diez-Gonzalez, F.; Jarvis, G. N. J. Dairy Sci. 2000, 83, 863-873. (60) Caugant, D. A.; Levin, B. R.; Selander, R. K. J. Hyg., Camb. 1984, 92, 377-384. (61) Pupo, G. M.; Richardson, B. J. Microbiology 1995, 141, 10371044. (62) Zoetendal, E. G.; Akkermans, A. D. L.; de Vos, W. M. Appl. Environ. Microbiol. 1998, 64, 3854-3859. (63) Caugant, D. A.; Levin, B. R.; Selander, R. K. Genetics 1981, 98, 467-490.
Received for review March 11, 2003. Revised manuscript received August 8, 2003. Accepted August 14, 2003. ES034211Q