Subscriber access provided by UNIV OF SCIENCES PHILADELPHIA
Environmental Measurements Methods
Application of SourceTracker for Accurate Identification of Fecal Pollution in Recreational Freshwater: A Double-Blinded Study Christopher Staley, Thomas Kaiser, Aldo Lobos, Warish Ahmed, Valerie J. Harwood, Clairessa M. Brown, and Michael J. Sadowsky Environ. Sci. Technol., Just Accepted Manuscript • DOI: 10.1021/acs.est.7b05401 • Publication Date (Web): 05 Mar 2018 Downloaded from http://pubs.acs.org on March 6, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 34
Environmental Science & Technology
Revised es-2017-05401y
Application of SourceTracker for Accurate Identification of Fecal Pollution in Recreational Freshwater: A Double-Blinded Study Christopher Staley1, Thomas Kaiser1, Aldo Lobos2, Warish Ahmed3, Valerie J. Harwood2, Clairessa M. Brown1, and Michael J. Sadowsky1,4,* 1
BioTechnology Institute, University of Minnesota, 1479 Gortner Ave, St. Paul, MN 55108; 2Department of Integrative Biology, SCA 110, University of South Florida, 4202 East Fowler Ave, Tampa, Florida 33620; 3CSIRO Land and Water, Ecosciences Precinct, 41 Boggo Road, Qld 4102, Australia; 4Department of Soil, Water, and Climate, University of Minnesota, 1991 Upper Buford Cir, St. Paul, MN 55108
*
Corresponding Author: Michael J. Sadowsky, BioTechnology Institute, University of Minnesota, 140 Gortner Lab, 1479 Gortner Ave, Saint Paul, MN 55108; Phone: (612)624-2706, Email:
[email protected] Running title: SourceTracker validation blinded study
Keywords: microbial community / microbial source tracking / next-generation sequencing / SourceTracker / water quality
ACS Paragon Plus Environment
Environmental Science & Technology
1
ABSTRACT
2
The efficacy of SourceTracker software to attribute contamination from a variety of fecal
3
sources spiked into ambient freshwater samples was investigated. Double-blinded
4
samples spiked with ≤ 5 different sources (0.025-10% vol/vol) were evaluated against
5
fecal taxon libraries characterized by next-generation amplicon sequencing. Three
6
libraries, including an initial library (17 non-local sources), a blinded source library (5
7
local sources), and a composite library (local and non-local sources) were used with
8
SourceTracker. SourceTracker’s predictions of fecal compositions in samples were
9
made, in part, based on distributions of taxa within abundant genera identified as
10
discriminatory by discriminant analyses, but also using a large percentage of low
11
abundance taxa. The initial library showed poor ability to characterize blinded samples,
12
but, using local sources, SourceTracker showed 91% accuracy (31/34) at identifying the
13
presence of source contamination, with two false positives for sewage and one for
14
horse. Furthermore, sink predictions of source contamination were positively correlated
15
(Spearman’s ρ ≥ 0.88, P < 0.001) with spiked source volumes. Using the composite
16
library did not significantly affect sink predictions (P > 0.79) compared to those made
17
using the local sources alone. Results of this study indicate that geographically
18
associated fecal samples are required for SourceTracker to assign host sources
19
accurately.
1 ACS Paragon Plus Environment
Page 2 of 34
Page 3 of 34
20 21
Environmental Science & Technology
1. INTRODUCTION Fecal pollution of water is a significant global health issue due to the likely
22
presence of waterborne pathogens. Therefore, identification of the source(s) of fecal
23
pollution is critical for implementing appropriate remediation strategies and protecting
24
human health risks associated with water use and reuse. Fecal pollution of
25
environmental waters has been historically assessed by enumerating fecal indicator
26
bacteria (FIB), such as Escherichia coli, Enterococcus spp., and Clostridium perfringens
27
using culture-based methods1. However, monitoring FIB in environmental waters does
28
not provide information on the source of pollution, e.g., human or animal feces2, or
29
naturalized FIB in the environment3,4, necessitating the use of microbial source tracking
30
(MST) methodologies. Early MST tools were library-dependent and required isolation
31
and typing of hundreds-to-thousands of FIB from human and animal feces to generate
32
source-associated libraries5–7. Conversely, library-independent methods target a gene
33
fragment from a taxonomic group that typically co-evolved, or is otherwise associated
34
(e.g. by infection), with a specific host, providing a host-associated marker typically
35
enumerated by quantitative PCR (qPCR)2.
36
Either type of MST method requires extensive validation prior to its application in
37
environmental studies8. In a previous multi-laboratory study, 22 laboratories used 12
38
different library-dependent and –independent methods to determine sources in blinded
39
water samples spiked with one to three fecal sources9. Results indicated that library-
40
dependent methods were prone to false positive detection, while library-independent
41
methods tended to produce false negative results. Despite drawbacks of both types of
42
methods, library-independent methods have been predominantly used over the last
2 ACS Paragon Plus Environment
Environmental Science & Technology
43
decade2, due to decreased labor and supply costs as well as reduced spatiotemporal
44
variability compared to that associated with library-dependent methods8. More recently,
45
several studies have suggested the use of next-generation sequencing (NGS) to
46
characterize bacterial contamination from multiple sources including recreational
47
waters, hospital environments, and other ecosystems10–13. In a similar strategy to that
48
used in previous MST method comparisons, a blinded study was performed in which
49
three community-based methods were evaluated, including terminal restriction fragment
50
length polymorphism, phylogenetic microarray, and Illumina NGS using community
51
dissimilarity indices14. Sixty-four blinded samples, spiked with single or dual sources,
52
were generated from 12 host groups. While all three methods were able to correctly
53
identify the dominant sources for 95% of the blinded samples, detection of the second,
54
minor sources was less accurate.
55
To accurately quantify multiple sources present at low abundances in sink
56
communities (communities impacted by contaminated from a source, e.g., recreational
57
waters receiving fecal contamination), the Bayesian algorithm SourceTracker was
58
proposed10. The allowance of unknown sources using this method was hypothesized to
59
improve accuracy in overall source assignments. The SourceTracker program has been
60
evaluated in field studies and validated against hydrodynamic modeling of source
61
contamination15 as well as in vitro constructed source mixtures16. The program
62
performed with high accuracy, sensitivity, and specificity using default parameters.
63
However, studies have also noted that source assignments with high relative standard
64
deviations (i.e., greater variability in quantitation across technical replicates) had lower
65
confidence16,17, and these more variable sources were detected at very low abundances
3 ACS Paragon Plus Environment
Page 4 of 34
Page 5 of 34
Environmental Science & Technology
66
(≤ 10%). However, inconsistencies were noted when results were compared to those
67
obtained using qPCR assays for the well-established, human-associated HF183
68
marker18, as well as markers for avian and cattle fecal contamination19.
69
Results of some validation studies tended to support the use of SourceTracker to
70
characterize multiple sources of fecal pollution, at least in a toolbox approach. However,
71
the extent to which factors that typically encumber library-dependent methods, such as
72
library size and spatiotemporal variability, have yet to be assessed in a systematic way.
73
Moreover, while NGS offers the promise of characterizing members of the rare
74
biosphere20, a variety of technical limitations and causes of error exist21. Furthermore,
75
the discrepancy between qPCR assays, which target taxa typically abundant in source
76
communities, e.g., Bacteroidales2, compared to SourceTracker results suggests a
77
disconnect between more traditional microbiological interrogation of source
78
communities compared to the more highly technical machine learning and Bayesian
79
approaches applied to evaluate NGS datasets.
80
The primary aim of this study was to clarify the inconsistent results obtained
81
between traditional statistical and Bayesian approaches to determine differences in
82
community composition between source categories, as well as address the feasibility of
83
using non-local source libraries to identify local fecal source contamination. Source
84
communities from previously published amplicon-based NGS datasets were
85
characterized using non-parametric, multivariate statistics (principle coordinate analysis
86
and linear discriminant analysis of effect sizes) to identify which genera in these
87
communities were presumed to be the most informative. These genera were then
88
compared against operational taxonomic units (OTUs) selected by the SourceTracker
4 ACS Paragon Plus Environment
Environmental Science & Technology
89
algorithm, which assumes a Dirichlet distribution among source (i.e., host animal
90
bacterial communities) and sink (i.e., spiked freshwater) communities22 and employs a
91
Bayesian machine learning approach10. The inter-laboratory transmissibility of the
92
source library, using SourceTracker, was then challenged against blinded freshwater
93
samples spiked with source fecal material from a geographic region not represented in
94
the initial library. Finally, an amplicon-based library was constructed using the local,
95
spiked sources to determine which combination of sources and community features
96
(genera) afforded the best prediction of the true composition of the blinded samples.
97
Results of this study provide a basis for interpreting the more highly technical results
98
obtained from these machine learning, Bayesian analyses as well as provide insight as
99
to the most biologically robust features to include and consider when building NGS
100
libraries for community-based MST.
101 102
2. METHODS
103
2.1.
104
Initial Taxon Library Assembly The initial fecal taxon library was comprised of previously published amplicon-
105
based source communities done using Illumina NGS of the V5+V6 hypervariable
106
regions of the 16S rRNA gene. Bacterial communities in primary-treated influent from
107
wastewater treatment plants (WWTPs) came from various cities throughout
108
Australia19,23 as well as previously unpublished data from California, USA. Fecal
109
samples from birds (including plover, wood duckling, noisy miner, Pacific black duckling,
110
blue-faced honeyeater, magpie, crow, ibis, seagull, and topknot pigeon), cats, dogs,
111
horses, kangaroos, and possums were also included and were obtained from
5 ACS Paragon Plus Environment
Page 6 of 34
Page 7 of 34
Environmental Science & Technology
112
throughout Queensland, Australia19,24. Fecal samples from beavers, Canada geese,
113
cats, cattle (beef and dairy, as separate sources), chickens, deer, dogs, gulls, rabbits,
114
swine, and turkeys were obtained throughout Minnesota, USA17. Previously unpublished
115
data from dogs and gulls collected in California were also included. In total, the initial
116
library was comprised of beavers (n = 19), beef cattle (10), birds (13), Canada geese
117
(25), cats (27), chickens (15), dairy cattle (21), deer (19), dogs (42), gulls (28), horses
118
(14), kangaroos (14), possums (18), rabbits (18), swine (18), turkeys (18), and influent
119
from WWTPs (79). Sources used to create blinded samples for validation (described
120
below) were not included in the initial library.
121 122 123
2.2.
Sample preparation and DNA extraction For the double-blinded study, 20 L of freshwater was collected from Hillsborough
124
River (28.0549° N, 82.3635° W, Tampa, FL, USA). Ten individual fresh fecal samples
125
were collected from each of five animal hosts, including cow, horse, cat, and dog.
126
Animal fecal samples were obtained around the Tampa Bay area. In addition,
127
approximately 1 g of feces (wet weight) from each animal fecal sample was measured
128
and mixed to form a single-host-species composite fecal sample (10 individual animals
129
represented per composite) to spike into blinded samples (see below) and to serve as
130
positive controls for identification of blinded sources. A primary-treated wastewater
131
sample was collected from a WWTP in Tampa, and, in triplicate, 10 ml sewage was
132
filtered through 0.45 µm mixed cellulose esters filter membranes (Thermo Fisher
133
Scientific, Waltham, MA, USA). The DNeasy PowerSoil DNA extraction kit (QIAGEN,
134
Hilden, Germany) was used to extract DNA from 250 mg (wet weight) of feces from all
6 ACS Paragon Plus Environment
Environmental Science & Technology
135
individual animals, composites, and sewage filters, with a holding time of no more than
136
6 hours prior to extraction. Forty-three fecal source samples and four composite source
137
samples were generated for NGS sequencing.
138
Composite animal fecal samples were diluted with 300 mL of phosphate buffered
139
saline (pH 8.0) to make fecal slurries for each animal host. Ambient river water samples
140
(300 ml) were spiked with fecal slurries and sewage in various combinations (ranging
141
from 0.025 to 10% vol/vol per source; Table 1). Spiked river water samples were filtered
142
through 0.45 µm filter membranes and the PowerSoil kit was used to extract DNA
143
directly from the membrane. All DNA samples were stored at -80ºC and shipped to the
144
analytical laboratory, blinded, on dry ice.
145 146 147
2.3.
PCR Amplification and Sequencing The V5+V6 regions of the 16S rRNA gene were amplified using the
148
BSF1064/784 primer set, described previously25. Amplification and sequencing was
149
done using the dual index method by the University of Minnesota Genomics Center
150
(UMGC, Minneapolis, MN, USA)26. Paired-end sequencing was done on the Illumina
151
MiSeq platform (Illumina, Inc., San Diego, CA) at a read length of 300 nucleotides (nt).
152
Previously unpublished sequencing data are available under BioProject accession
153
number SRP118701 in the Sequence Read Archive at the National Center for
154
Biotechnology Information.
155 156 157
7 ACS Paragon Plus Environment
Page 8 of 34
Page 9 of 34
158
Environmental Science & Technology
2.4.
Bioinformatics and SourceTracker Analyses
159
Sequence processing and analysis was done using mothur ver. 1.35.127.
160
Sequences were trimmed to 150 nt, to remove low quality regions at the 3’ ends while
161
still allowing for an overlap of approximately 20 nt, and paired-end joined using fastq-
162
join28. Quality trimming was performed as described previously29. Samples were aligned
163
against the SILVA database ver. 12330 and subjected to a 2% pre-clustering step31.
164
Chimeras were identified and removed using UCHIME ver. 4.2.4032. Operational
165
taxonomic units were assigned at 97% similarity using complete-linkage clustering and
166
taxonomic classification was performed using the Ribosomal Database Project release
167
ver. 1433. Samples source libraries were rarefied to 10,000 sequence reads per sample
168
for comparison34. Blinded sink samples were similarly normalized to 10,000 sequence
169
reads, where possible, for consistency. Blinded sink samples SW06, SW07, SW12,
170
SW14, SW25 had fewer sequence reads (5616, 8599, 6414, 1769, and 271 reads,
171
respectively) but were included in the SourceTracker analyses since lower numbers of
172
reads among sink samples would not influence source assignments.
173
SourceTracker analysis was performed using SourceTracker ver. 0.9.810 and
174
default parameters. To determine the extent to which SourceTracker could differentiate
175
individual sources from geographic regions among the preliminary dataset, samples
176
were binned by broad host category, which included domesticated animals (cats and
177
dogs), livestock (cattle, horses, and swine), avians (birds, Canada geese, chickens,
178
gulls, and turkeys), wildlife (beavers, deer, kangaroos, possums, and rabbits), and
179
WWTP samples. Due to compositional similarity in communities of some members of
180
the avian group (i.e., the general bird group from Queensland and gulls from California)
8 ACS Paragon Plus Environment
Environmental Science & Technology
181
with WWTP communities, these avian groups were included in an additional grouping
182
with WWTP samples. Among WWTP samples, geographic location was specified as
183
accurately as possible based on prior publications19,23. The Queensland WWTP
184
samples reflect samples collected from a broader geographic region and at a different
185
sampling time than Brisbane WWTP samples, which were collected from a single
186
WWTP.
187
Approximately 50% of the individuals in each source category (host and
188
geographic location) were randomly grouped to SourceTracker source or sink
189
categories in order to determine the library’s ability to classify sink samples not included
190
in the library. Taxonomic fingerprints for each source category were determined by
191
genus-level classification of OTUs that contributed to the sink assignment of that
192
category, normalized to 100% of the total sink prediction. To evaluate blind samples,
193
three source libraries were used: 1) the initial library without sources from Tampa, FL, 2)
194
the FL sources alone, and 3) all sources (initial + FL). Geographic distinction was not
195
included in source distinctions to evaluate blind samples, and false positives were
196
determined to be samples in which SourceTracker identified ≥ 1.0% of a source that did
197
not correspond to a spike. To assess potential mischaracterization of samples when a
198
source was not included in the library, the FL library alone was used to analyze samples
199
with all samples related to a single spiked source designated a SourceTracker sink (five
200
separate runs, one for each spiked source). All results reflect those of one
201
SourceTracker run per library configuration using default parameters.
202 203
9 ACS Paragon Plus Environment
Page 10 of 34
Page 11 of 34
204 205
Environmental Science & Technology
2.5.
Statistical Analyses Differences between bacterial communities (beta diversity) were determined
206
using analysis of similarity (ANOSIM)35 calculated using Bray-Curtis dissimilarity
207
matrices36. Similarly, ordination was performed by principal coordinate analysis
208
(PCoA)37 using Bray-Curtis matrices. Significance of sample clustering on ordination
209
plots was evaluated by analysis of molecular variance (AMOVA)38. To determine which
210
genera were significantly correlated with ordination position, OTUs were classified to
211
genera and total genera abundances were related using corr.axes analysis for
212
Spearman correlations in mothur. For clarity, only the five most abundant genera among
213
the broader grouping were plotted. Linear discriminant analysis (LDA) of effect sizes
214
(LEfSe)39 was used to identify highly differential, source-associated OTUs (LDA score ≥
215
4.0), which were then classified to genera. Spearman correlations relating
216
SourceTracker sink predictions to source material spiked (% volume) and ANOVA
217
analysis comparing sink predictions between the FL blinded and full source libraries
218
were calculated using XLSTAT ver. 17.06 (Addinsoft, Belmont, MA, USA). All statistics
219
were evaluated at α = 0.05, with Bonferroni correction for multiple comparisons.
220 221
3. RESULTS
222
3.1. Community Composition of the Initial Library
223
Among feces from all hosts and sewage represented in the initial library, the
224
majority of communities were comprised predominantly of members of the families
225
Lachnospiraceae and Ruminococcaceae, within the Firmicutes phylum, and
226
Bacteroidaceae and Prevotellaceae, within the Bacteroidetes phylum (Supplementary
10 ACS Paragon Plus Environment
Environmental Science & Technology
227
Figure S1). Lower abundances of relatively less abundant families were observed
228
among domesticated animals than the other host source groups. Moreover,
229
communities within the avian group tended to harbor greater relative abundances of
230
Enterobacteriaceae and Pseudomonadaceae (phylum Proteobacteria), and
231
Lactobacillaceae (Firmicutes).
232
Among all sources, communities generally differed significantly from each other
233
by both host species and geography, within a host species, as evaluated by ANOSIM (r
234
= 0.879, P < 0.001; Figure 1). Some similarity was observed among the Brisbane,
235
Perth, and Queensland WWTP communities, as well as between Hobart and
236
Melbourne, and these communities did not differ significantly at a Bonferroni-corrected α
237
= 0.0002. Queensland bird communities were also not significantly different from the
238
Hobart and Perth WWTPs (r = 0.365 and 0.263, P = 0.001 and 0.061). When grouped
239
into broader categories (e.g., domesticated animals, as described in Methods), samples
240
were also clustered independently following ordination by PCoA (AMOVA P < 0.001;
241
Figure 1), although similarities observed by ANOSIM were maintained on ordination and
242
were not significantly separated by AMOVA.
243
The OTUs that were identified as highly discriminant among source categories by
244
LEfSe (Figure 2) were typically classified among the genera that were significantly
245
associated with ordination position by Spearman correlation (P < 0.05; Figure 1).
246
Furthermore, these genera tended to belong to the most abundant families found in
247
each host group (Figure S1). For example, the genus Bacteroides was correlated with
248
ordination position for Queensland cats and dogs in the “domestic” source category
249
(Figure 1), and was included among the discriminatory genera in LEfSe analysis (Figure
11 ACS Paragon Plus Environment
Page 12 of 34
Page 13 of 34
Environmental Science & Technology
250
2). However, among livestock sources, the most abundant genera that correlated with
251
ordination position were not discretely associated with specific source categories, as
252
noted by positioning of most of these away from source communities, with the exception
253
of Xylanibacter, (Figure 1). Correspondingly, LEfSe identified OTUs primarily within less
254
abundant genera as source-associated but did not identify OTUs within Xylanibacter to
255
differentiate livestock, although Rikenella, Alistipes and Bacteroides were identified
256
(Figure 2).
257 258
3.2.
SourceTracker Results for the Initial Library
259
Overlap of bacterial communities among host sources was evaluated by
260
assigning all samples in a single host group in the library, irrespective of geography, as
261
a source and all remaining samples in the initial library as sinks (Supplementary Table
262
S1). A moderate-to-high (26.2 – 99.5%) predicted community similarity was observed
263
among host samples within the broader avian group. Cat and dog communities also
264
showed a greater degree of overlap, with mean similarity of 77.5 and 89.4%. Livestock
265
and wildlife showed much less overlap in community composition (5.2 – 23.6% and 2.1
266
– 44.2% predicted similarity, respectively). In contrast, beef and dairy cattle showed
267
much greater overlap as 77.3 and 89.9% of the communities were predicted to be in
268
common. Inter-group similarity was generally much less than intra-group similarity, with
269
the exception that a moderate-to-high degree of similarity was predicted between avian
270
sources and those from WWTPs (mean 43.1 ± 25.6% community similarity).
271 272
To assess the accuracy of SourceTracker software to discriminate among closely related sources and different geographic regions, and to classify OTUs from fecal
12 ACS Paragon Plus Environment
Environmental Science & Technology
273
samples not in the library, the initial library was divided approximately in half (i.e., half of
274
the individuals in a host category were assigned as source and half were sinks) and
275
evaluated amongst the broader host categories. Overall, sink predictions were accurate,
276
with ≥ 80% of sink community taxa identified as the corresponding source category, with
277
specificity to geography (Figure 3). Sinks within the avian category were identified less
278
accurately, with approximately 60% of chicken communities correctly assigned, while
279
bird communities from Queensland were poorly identified as a mixture of MN Canada
280
geese and CA gulls, in addition to the correct source assignment. Bacterial communities
281
from WWTPs were also less accurately identified compared to other animal sources,
282
with overlap in assignments among the more highly similar WWTPs. Furthermore, the
283
bacterial communities in WWTP influent that were poorly assigned were not well
284
represented in the source library (n < 5).
285
Informative OTUs utilized by the SourceTracker software to perform source
286
assignments in sink samples varied among specific host groups within broader source
287
categories (Figure 4). Many of the OTUs were classified within abundant genera that
288
were associated with specific source categories, e.g., a large proportion of the
289
SourceTracker fingerprint for Queensland dogs was attributed to OTUs within the genus
290
Xylanibacter (Figure 4A), which was highly correlated with their ordination position
291
(Figure 1A). Livestock and wildlife sources were predominantly defined by OTUs within
292
less abundant genera (Figures 4B and 4D), while genera at greater abundances were
293
poorly associated with hosts (Figures 1B and 1D). The avian sources, which had more
294
distinct community compositions than did other categories (Figure S1), could be
295
identified using OTUs within a fewer number of genera than most of the other broader
13 ACS Paragon Plus Environment
Page 14 of 34
Page 15 of 34
Environmental Science & Technology
296
categories, particularly the California birds in which Pseudomonas was a dominant
297
driver of classification, and less abundant genera comprised less than 5% of the
298
taxonomic fingerprint (Figure 4C). Furthermore, profiles for communities represented at
299
lower abundances in the source libraries (e.g., Hobart and Melbourne WWTPs) were
300
defined almost entirely by OTUs within less prominent genera (Figure 4E-F).
301 302 303
3.3.
Evaluation of Blind Samples Using the Initial Library Source 05 was identified as the FL WWTP source based on the number of
304
samples (n = 3), since only triplicates of this source were processed. The remaining
305
blinded source samples were evaluated against the initial library using SourceTracker,
306
but could not be unambiguously assigned to specific sources by this method (mean sink
307
prediction ≤ 6.9% to any source, irrespective of geography). Furthermore, community
308
compositions in all blinded sources were significantly different from all other sources
309
represented in the preliminary library (ANOSIM P < 0.001). Therefore, results from
310
PCoA and LEfSe analyses among the initial library sources were used to narrow down
311
possible sources based on predominant taxonomic composition (Supplementary Figure
312
S2). Sources 01 and 02 were determined to belong to livestock or wildlife groups based
313
on the abundance of Rikenella (Figures 1B and 1D). Sources 03 and 04 were
314
preliminarily identified as domestic animal sources based on abundances of Blautia
315
(Figure 1A) and Catenibacterium (Figure 2A). Sources were further interrogated by
316
PCoA within these broader host source categories (Supplementary Figure S3). On the
317
basis of these observations, sources were identified as coming from cows, horses, cats,
318
and dogs, respectively, which were then confirmed (with those who prepared the double
14 ACS Paragon Plus Environment
Environmental Science & Technology
319
blinded samples) to be correct. Composite source samples were evaluated as positive
320
controls against the FL library alone and were correctly classified by SourceTracker at
321
>86% community similarity.
322
Blinded samples were interrogated against the unblinded FL sources (Table 1),
323
which comprised the local library. Source identifications in samples, as determined by
324
SourceTracker, were 91% (31/34) accurate based on presence/absence sample
325
composition. In two samples, SW25 and SW32, SourceTracker identified a weak
326
sewage signature representing 1.2 and 2.3% of the community, respectively, which was
327
not consistent with a source spike. Similarly, in sample SW31, which was spiked with
328
only cow fecal material, SourceTracker identified a low signature for horse, as well
329
(6.7% of the community). When data from single FL source categories (e.g., all cattle)
330
were removed from the library, individual fecal samples were incorrectly assigned,
331
generally as the most closely related host. Cow samples were misidentified as horse
332
(mean 13.6 ± 2.4% of the community), horse samples were misidentified as cow (5.3 ±
333
1.4%), cat samples were misidentified as a mixture of dog and sewage (61.3 ± 21.7%
334
and 15.8 ± 22.3%, respectively), dog samples were misidentified as cat (90.5 ± 9.4%),
335
and WWTP samples were misidentified as cat (5.5 ± 1.8%). With allowance for these
336
misclassifications (e.g., ignoring classification of a cattle spike as horse when cattle
337
samples were excluded from the source library), the presence/absence results among
338
blinded samples did not change when a single source was omitted from the library.
339
To evaluate the relative quantitative accuracy of SourceTracker, Spearman
340
correlations were performed relating sink predictions (as %) with sample volumes
341
spiked (0.025 – 10%). Strong and significant positive correlations were observed
15 ACS Paragon Plus Environment
Page 16 of 34
Page 17 of 34
Environmental Science & Technology
342
between SourceTracker sink predictions and volumes spiked (ρ = 0.974, 0.924, 0.887,
343
0.884, 0.953, with respect to cow, horse, cat, dog, and sewage sources, P < 0.0001 for
344
all sources).
345
When evaluated against the initial library alone, sources were poorly identified in
346
blinded sink samples (sink predictions ≤ 12.8% to specific hosts, irrespective of
347
geography; Supplementary Table S2), similar to the blinded source samples.
348
Furthermore, combining the initial and FL source libraries did not significantly affect sink
349
predictions (ANOVA P = 0.787 – 0.997 for each source, individually; Table S2) from
350
those observed when using the local FL library alone.
351 352 353
4. DISCUSSION Results of this study indicate that identification of fecal source contamination in
354
recreational freshwater using SourceTracker is dependent on the inclusion of
355
geographically associated source samples present in the source library. Using an initial
356
library with geographically divergent sources, but no representation from local sources,
357
blinded source samples could not be unambiguously defined, with mean similarities of
358
blinded source communities < 7% to sources in the initial fecal library. Furthermore,
359
despite a great overlap in community composition among certain host species, e.g.,
360
avian species, the algorithm was generally able to assign >80% of the sink community
361
to the correct source and geography. These results suggest that, despite taxonomic
362
similarity in the fecal microbial community among closely related sources40, individuals
363
vary by geographic region and specific species compositions (here assessed as OTUs),
364
as has been well documented among humans41,42.
16 ACS Paragon Plus Environment
Environmental Science & Technology
365
We specifically sought to investigate potential discrepancies between source
366
identification using SourceTracker and amplicon sequencing data from conventional
367
qPCR assays. Multivariate statistical analyses generally identified more highly abundant
368
genera and OTUs within these genera as potentially discriminatory (Figure 2), and in
369
most cases, these genera contributed, in part, to sink predictions as determined by
370
SourceTracker (Figure 4). However, in many cases, OTUs within less abundant genera
371
accounted for ≥ 50% of the source fingerprints, and we suggest this is due to the over-
372
dispersed nature of bacterial communities22. Thus, while qPCR assays target specific
373
taxa that are shared within most individuals of a host species2, SourceTracker is able to
374
use some of these as well as to capitalize on lower abundance species to more
375
accurately discriminate among closely related hosts. This may indicate a trade-off in
376
methodologies, where qPCR has greater sensitivity to detect low levels of source
377
contamination, while SourceTracker offers highly specific source identification.
378
However, it is important to note that the SourceTracker algorithm correctly identified as
379
little as 0.025% of spiked source, by volume (Table 1), and, although this result does
380
not represent a truly quantitative measurement of bacterial contamination (i.e., numbers
381
of cells), it is highly suggestive that the algorithm has appropriate sensitivity to detect
382
biologically relevant contamination events. Future work will be necessary to provide a
383
more accurate quantitative assessment of this observation and place it in the context of
384
current water quality monitoring standards.
385
Library size has also historically been an important consideration for library-
386
dependent MST methods8. We previously reported that a minimum library size of 13
387
individuals was necessary to inform a powered analysis of statistically significant
17 ACS Paragon Plus Environment
Page 18 of 34
Page 19 of 34
Environmental Science & Technology
388
differences in community composition17, but previous studies evaluating SourceTracker
389
have sometimes relied on only one or two individuals and achieved results that
390
corresponded with the expected source composition of their samples16,43. Here, 10
391
individuals were sufficient for accurate and, perhaps, relatively quantitative identification
392
of fecal contamination among blinded spikes. Furthermore, analysis of the initial library
393
suggests that, in practice,