Application of SourceTracker for Accurate Identification of Fecal

Mar 5, 2018 - The efficacy of SourceTracker software to attribute contamination from a variety of fecal sources spiked into ambient freshwater samples...
0 downloads 3 Views 6MB Size
Subscriber access provided by UNIV OF SCIENCES PHILADELPHIA

Environmental Measurements Methods

Application of SourceTracker for Accurate Identification of Fecal Pollution in Recreational Freshwater: A Double-Blinded Study Christopher Staley, Thomas Kaiser, Aldo Lobos, Warish Ahmed, Valerie J. Harwood, Clairessa M. Brown, and Michael J. Sadowsky Environ. Sci. Technol., Just Accepted Manuscript • DOI: 10.1021/acs.est.7b05401 • Publication Date (Web): 05 Mar 2018 Downloaded from http://pubs.acs.org on March 6, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 34

Environmental Science & Technology

Revised es-2017-05401y

Application of SourceTracker for Accurate Identification of Fecal Pollution in Recreational Freshwater: A Double-Blinded Study Christopher Staley1, Thomas Kaiser1, Aldo Lobos2, Warish Ahmed3, Valerie J. Harwood2, Clairessa M. Brown1, and Michael J. Sadowsky1,4,* 1

BioTechnology Institute, University of Minnesota, 1479 Gortner Ave, St. Paul, MN 55108; 2Department of Integrative Biology, SCA 110, University of South Florida, 4202 East Fowler Ave, Tampa, Florida 33620; 3CSIRO Land and Water, Ecosciences Precinct, 41 Boggo Road, Qld 4102, Australia; 4Department of Soil, Water, and Climate, University of Minnesota, 1991 Upper Buford Cir, St. Paul, MN 55108

*

Corresponding Author: Michael J. Sadowsky, BioTechnology Institute, University of Minnesota, 140 Gortner Lab, 1479 Gortner Ave, Saint Paul, MN 55108; Phone: (612)624-2706, Email: [email protected]

Running title: SourceTracker validation blinded study

Keywords: microbial community / microbial source tracking / next-generation sequencing / SourceTracker / water quality

ACS Paragon Plus Environment

Environmental Science & Technology

1

ABSTRACT

2

The efficacy of SourceTracker software to attribute contamination from a variety of fecal

3

sources spiked into ambient freshwater samples was investigated. Double-blinded

4

samples spiked with ≤ 5 different sources (0.025-10% vol/vol) were evaluated against

5

fecal taxon libraries characterized by next-generation amplicon sequencing. Three

6

libraries, including an initial library (17 non-local sources), a blinded source library (5

7

local sources), and a composite library (local and non-local sources) were used with

8

SourceTracker. SourceTracker’s predictions of fecal compositions in samples were

9

made, in part, based on distributions of taxa within abundant genera identified as

10

discriminatory by discriminant analyses, but also using a large percentage of low

11

abundance taxa. The initial library showed poor ability to characterize blinded samples,

12

but, using local sources, SourceTracker showed 91% accuracy (31/34) at identifying the

13

presence of source contamination, with two false positives for sewage and one for

14

horse. Furthermore, sink predictions of source contamination were positively correlated

15

(Spearman’s ρ ≥ 0.88, P < 0.001) with spiked source volumes. Using the composite

16

library did not significantly affect sink predictions (P > 0.79) compared to those made

17

using the local sources alone. Results of this study indicate that geographically

18

associated fecal samples are required for SourceTracker to assign host sources

19

accurately.

1 ACS Paragon Plus Environment

Page 2 of 34

Page 3 of 34

20 21

Environmental Science & Technology

1. INTRODUCTION Fecal pollution of water is a significant global health issue due to the likely

22

presence of waterborne pathogens. Therefore, identification of the source(s) of fecal

23

pollution is critical for implementing appropriate remediation strategies and protecting

24

human health risks associated with water use and reuse. Fecal pollution of

25

environmental waters has been historically assessed by enumerating fecal indicator

26

bacteria (FIB), such as Escherichia coli, Enterococcus spp., and Clostridium perfringens

27

using culture-based methods1. However, monitoring FIB in environmental waters does

28

not provide information on the source of pollution, e.g., human or animal feces2, or

29

naturalized FIB in the environment3,4, necessitating the use of microbial source tracking

30

(MST) methodologies. Early MST tools were library-dependent and required isolation

31

and typing of hundreds-to-thousands of FIB from human and animal feces to generate

32

source-associated libraries5–7. Conversely, library-independent methods target a gene

33

fragment from a taxonomic group that typically co-evolved, or is otherwise associated

34

(e.g. by infection), with a specific host, providing a host-associated marker typically

35

enumerated by quantitative PCR (qPCR)2.

36

Either type of MST method requires extensive validation prior to its application in

37

environmental studies8. In a previous multi-laboratory study, 22 laboratories used 12

38

different library-dependent and –independent methods to determine sources in blinded

39

water samples spiked with one to three fecal sources9. Results indicated that library-

40

dependent methods were prone to false positive detection, while library-independent

41

methods tended to produce false negative results. Despite drawbacks of both types of

42

methods, library-independent methods have been predominantly used over the last

2 ACS Paragon Plus Environment

Environmental Science & Technology

43

decade2, due to decreased labor and supply costs as well as reduced spatiotemporal

44

variability compared to that associated with library-dependent methods8. More recently,

45

several studies have suggested the use of next-generation sequencing (NGS) to

46

characterize bacterial contamination from multiple sources including recreational

47

waters, hospital environments, and other ecosystems10–13. In a similar strategy to that

48

used in previous MST method comparisons, a blinded study was performed in which

49

three community-based methods were evaluated, including terminal restriction fragment

50

length polymorphism, phylogenetic microarray, and Illumina NGS using community

51

dissimilarity indices14. Sixty-four blinded samples, spiked with single or dual sources,

52

were generated from 12 host groups. While all three methods were able to correctly

53

identify the dominant sources for 95% of the blinded samples, detection of the second,

54

minor sources was less accurate.

55

To accurately quantify multiple sources present at low abundances in sink

56

communities (communities impacted by contaminated from a source, e.g., recreational

57

waters receiving fecal contamination), the Bayesian algorithm SourceTracker was

58

proposed10. The allowance of unknown sources using this method was hypothesized to

59

improve accuracy in overall source assignments. The SourceTracker program has been

60

evaluated in field studies and validated against hydrodynamic modeling of source

61

contamination15 as well as in vitro constructed source mixtures16. The program

62

performed with high accuracy, sensitivity, and specificity using default parameters.

63

However, studies have also noted that source assignments with high relative standard

64

deviations (i.e., greater variability in quantitation across technical replicates) had lower

65

confidence16,17, and these more variable sources were detected at very low abundances

3 ACS Paragon Plus Environment

Page 4 of 34

Page 5 of 34

Environmental Science & Technology

66

(≤ 10%). However, inconsistencies were noted when results were compared to those

67

obtained using qPCR assays for the well-established, human-associated HF183

68

marker18, as well as markers for avian and cattle fecal contamination19.

69

Results of some validation studies tended to support the use of SourceTracker to

70

characterize multiple sources of fecal pollution, at least in a toolbox approach. However,

71

the extent to which factors that typically encumber library-dependent methods, such as

72

library size and spatiotemporal variability, have yet to be assessed in a systematic way.

73

Moreover, while NGS offers the promise of characterizing members of the rare

74

biosphere20, a variety of technical limitations and causes of error exist21. Furthermore,

75

the discrepancy between qPCR assays, which target taxa typically abundant in source

76

communities, e.g., Bacteroidales2, compared to SourceTracker results suggests a

77

disconnect between more traditional microbiological interrogation of source

78

communities compared to the more highly technical machine learning and Bayesian

79

approaches applied to evaluate NGS datasets.

80

The primary aim of this study was to clarify the inconsistent results obtained

81

between traditional statistical and Bayesian approaches to determine differences in

82

community composition between source categories, as well as address the feasibility of

83

using non-local source libraries to identify local fecal source contamination. Source

84

communities from previously published amplicon-based NGS datasets were

85

characterized using non-parametric, multivariate statistics (principle coordinate analysis

86

and linear discriminant analysis of effect sizes) to identify which genera in these

87

communities were presumed to be the most informative. These genera were then

88

compared against operational taxonomic units (OTUs) selected by the SourceTracker

4 ACS Paragon Plus Environment

Environmental Science & Technology

89

algorithm, which assumes a Dirichlet distribution among source (i.e., host animal

90

bacterial communities) and sink (i.e., spiked freshwater) communities22 and employs a

91

Bayesian machine learning approach10. The inter-laboratory transmissibility of the

92

source library, using SourceTracker, was then challenged against blinded freshwater

93

samples spiked with source fecal material from a geographic region not represented in

94

the initial library. Finally, an amplicon-based library was constructed using the local,

95

spiked sources to determine which combination of sources and community features

96

(genera) afforded the best prediction of the true composition of the blinded samples.

97

Results of this study provide a basis for interpreting the more highly technical results

98

obtained from these machine learning, Bayesian analyses as well as provide insight as

99

to the most biologically robust features to include and consider when building NGS

100

libraries for community-based MST.

101 102

2. METHODS

103

2.1.

104

Initial Taxon Library Assembly The initial fecal taxon library was comprised of previously published amplicon-

105

based source communities done using Illumina NGS of the V5+V6 hypervariable

106

regions of the 16S rRNA gene. Bacterial communities in primary-treated influent from

107

wastewater treatment plants (WWTPs) came from various cities throughout

108

Australia19,23 as well as previously unpublished data from California, USA. Fecal

109

samples from birds (including plover, wood duckling, noisy miner, Pacific black duckling,

110

blue-faced honeyeater, magpie, crow, ibis, seagull, and topknot pigeon), cats, dogs,

111

horses, kangaroos, and possums were also included and were obtained from

5 ACS Paragon Plus Environment

Page 6 of 34

Page 7 of 34

Environmental Science & Technology

112

throughout Queensland, Australia19,24. Fecal samples from beavers, Canada geese,

113

cats, cattle (beef and dairy, as separate sources), chickens, deer, dogs, gulls, rabbits,

114

swine, and turkeys were obtained throughout Minnesota, USA17. Previously unpublished

115

data from dogs and gulls collected in California were also included. In total, the initial

116

library was comprised of beavers (n = 19), beef cattle (10), birds (13), Canada geese

117

(25), cats (27), chickens (15), dairy cattle (21), deer (19), dogs (42), gulls (28), horses

118

(14), kangaroos (14), possums (18), rabbits (18), swine (18), turkeys (18), and influent

119

from WWTPs (79). Sources used to create blinded samples for validation (described

120

below) were not included in the initial library.

121 122 123

2.2.

Sample preparation and DNA extraction For the double-blinded study, 20 L of freshwater was collected from Hillsborough

124

River (28.0549° N, 82.3635° W, Tampa, FL, USA). Ten individual fresh fecal samples

125

were collected from each of five animal hosts, including cow, horse, cat, and dog.

126

Animal fecal samples were obtained around the Tampa Bay area. In addition,

127

approximately 1 g of feces (wet weight) from each animal fecal sample was measured

128

and mixed to form a single-host-species composite fecal sample (10 individual animals

129

represented per composite) to spike into blinded samples (see below) and to serve as

130

positive controls for identification of blinded sources. A primary-treated wastewater

131

sample was collected from a WWTP in Tampa, and, in triplicate, 10 ml sewage was

132

filtered through 0.45 µm mixed cellulose esters filter membranes (Thermo Fisher

133

Scientific, Waltham, MA, USA). The DNeasy PowerSoil DNA extraction kit (QIAGEN,

134

Hilden, Germany) was used to extract DNA from 250 mg (wet weight) of feces from all

6 ACS Paragon Plus Environment

Environmental Science & Technology

135

individual animals, composites, and sewage filters, with a holding time of no more than

136

6 hours prior to extraction. Forty-three fecal source samples and four composite source

137

samples were generated for NGS sequencing.

138

Composite animal fecal samples were diluted with 300 mL of phosphate buffered

139

saline (pH 8.0) to make fecal slurries for each animal host. Ambient river water samples

140

(300 ml) were spiked with fecal slurries and sewage in various combinations (ranging

141

from 0.025 to 10% vol/vol per source; Table 1). Spiked river water samples were filtered

142

through 0.45 µm filter membranes and the PowerSoil kit was used to extract DNA

143

directly from the membrane. All DNA samples were stored at -80ºC and shipped to the

144

analytical laboratory, blinded, on dry ice.

145 146 147

2.3.

PCR Amplification and Sequencing The V5+V6 regions of the 16S rRNA gene were amplified using the

148

BSF1064/784 primer set, described previously25. Amplification and sequencing was

149

done using the dual index method by the University of Minnesota Genomics Center

150

(UMGC, Minneapolis, MN, USA)26. Paired-end sequencing was done on the Illumina

151

MiSeq platform (Illumina, Inc., San Diego, CA) at a read length of 300 nucleotides (nt).

152

Previously unpublished sequencing data are available under BioProject accession

153

number SRP118701 in the Sequence Read Archive at the National Center for

154

Biotechnology Information.

155 156 157

7 ACS Paragon Plus Environment

Page 8 of 34

Page 9 of 34

158

Environmental Science & Technology

2.4.

Bioinformatics and SourceTracker Analyses

159

Sequence processing and analysis was done using mothur ver. 1.35.127.

160

Sequences were trimmed to 150 nt, to remove low quality regions at the 3’ ends while

161

still allowing for an overlap of approximately 20 nt, and paired-end joined using fastq-

162

join28. Quality trimming was performed as described previously29. Samples were aligned

163

against the SILVA database ver. 12330 and subjected to a 2% pre-clustering step31.

164

Chimeras were identified and removed using UCHIME ver. 4.2.4032. Operational

165

taxonomic units were assigned at 97% similarity using complete-linkage clustering and

166

taxonomic classification was performed using the Ribosomal Database Project release

167

ver. 1433. Samples source libraries were rarefied to 10,000 sequence reads per sample

168

for comparison34. Blinded sink samples were similarly normalized to 10,000 sequence

169

reads, where possible, for consistency. Blinded sink samples SW06, SW07, SW12,

170

SW14, SW25 had fewer sequence reads (5616, 8599, 6414, 1769, and 271 reads,

171

respectively) but were included in the SourceTracker analyses since lower numbers of

172

reads among sink samples would not influence source assignments.

173

SourceTracker analysis was performed using SourceTracker ver. 0.9.810 and

174

default parameters. To determine the extent to which SourceTracker could differentiate

175

individual sources from geographic regions among the preliminary dataset, samples

176

were binned by broad host category, which included domesticated animals (cats and

177

dogs), livestock (cattle, horses, and swine), avians (birds, Canada geese, chickens,

178

gulls, and turkeys), wildlife (beavers, deer, kangaroos, possums, and rabbits), and

179

WWTP samples. Due to compositional similarity in communities of some members of

180

the avian group (i.e., the general bird group from Queensland and gulls from California)

8 ACS Paragon Plus Environment

Environmental Science & Technology

181

with WWTP communities, these avian groups were included in an additional grouping

182

with WWTP samples. Among WWTP samples, geographic location was specified as

183

accurately as possible based on prior publications19,23. The Queensland WWTP

184

samples reflect samples collected from a broader geographic region and at a different

185

sampling time than Brisbane WWTP samples, which were collected from a single

186

WWTP.

187

Approximately 50% of the individuals in each source category (host and

188

geographic location) were randomly grouped to SourceTracker source or sink

189

categories in order to determine the library’s ability to classify sink samples not included

190

in the library. Taxonomic fingerprints for each source category were determined by

191

genus-level classification of OTUs that contributed to the sink assignment of that

192

category, normalized to 100% of the total sink prediction. To evaluate blind samples,

193

three source libraries were used: 1) the initial library without sources from Tampa, FL, 2)

194

the FL sources alone, and 3) all sources (initial + FL). Geographic distinction was not

195

included in source distinctions to evaluate blind samples, and false positives were

196

determined to be samples in which SourceTracker identified ≥ 1.0% of a source that did

197

not correspond to a spike. To assess potential mischaracterization of samples when a

198

source was not included in the library, the FL library alone was used to analyze samples

199

with all samples related to a single spiked source designated a SourceTracker sink (five

200

separate runs, one for each spiked source). All results reflect those of one

201

SourceTracker run per library configuration using default parameters.

202 203

9 ACS Paragon Plus Environment

Page 10 of 34

Page 11 of 34

204 205

Environmental Science & Technology

2.5.

Statistical Analyses Differences between bacterial communities (beta diversity) were determined

206

using analysis of similarity (ANOSIM)35 calculated using Bray-Curtis dissimilarity

207

matrices36. Similarly, ordination was performed by principal coordinate analysis

208

(PCoA)37 using Bray-Curtis matrices. Significance of sample clustering on ordination

209

plots was evaluated by analysis of molecular variance (AMOVA)38. To determine which

210

genera were significantly correlated with ordination position, OTUs were classified to

211

genera and total genera abundances were related using corr.axes analysis for

212

Spearman correlations in mothur. For clarity, only the five most abundant genera among

213

the broader grouping were plotted. Linear discriminant analysis (LDA) of effect sizes

214

(LEfSe)39 was used to identify highly differential, source-associated OTUs (LDA score ≥

215

4.0), which were then classified to genera. Spearman correlations relating

216

SourceTracker sink predictions to source material spiked (% volume) and ANOVA

217

analysis comparing sink predictions between the FL blinded and full source libraries

218

were calculated using XLSTAT ver. 17.06 (Addinsoft, Belmont, MA, USA). All statistics

219

were evaluated at α = 0.05, with Bonferroni correction for multiple comparisons.

220 221

3. RESULTS

222

3.1. Community Composition of the Initial Library

223

Among feces from all hosts and sewage represented in the initial library, the

224

majority of communities were comprised predominantly of members of the families

225

Lachnospiraceae and Ruminococcaceae, within the Firmicutes phylum, and

226

Bacteroidaceae and Prevotellaceae, within the Bacteroidetes phylum (Supplementary

10 ACS Paragon Plus Environment

Environmental Science & Technology

227

Figure S1). Lower abundances of relatively less abundant families were observed

228

among domesticated animals than the other host source groups. Moreover,

229

communities within the avian group tended to harbor greater relative abundances of

230

Enterobacteriaceae and Pseudomonadaceae (phylum Proteobacteria), and

231

Lactobacillaceae (Firmicutes).

232

Among all sources, communities generally differed significantly from each other

233

by both host species and geography, within a host species, as evaluated by ANOSIM (r

234

= 0.879, P < 0.001; Figure 1). Some similarity was observed among the Brisbane,

235

Perth, and Queensland WWTP communities, as well as between Hobart and

236

Melbourne, and these communities did not differ significantly at a Bonferroni-corrected α

237

= 0.0002. Queensland bird communities were also not significantly different from the

238

Hobart and Perth WWTPs (r = 0.365 and 0.263, P = 0.001 and 0.061). When grouped

239

into broader categories (e.g., domesticated animals, as described in Methods), samples

240

were also clustered independently following ordination by PCoA (AMOVA P < 0.001;

241

Figure 1), although similarities observed by ANOSIM were maintained on ordination and

242

were not significantly separated by AMOVA.

243

The OTUs that were identified as highly discriminant among source categories by

244

LEfSe (Figure 2) were typically classified among the genera that were significantly

245

associated with ordination position by Spearman correlation (P < 0.05; Figure 1).

246

Furthermore, these genera tended to belong to the most abundant families found in

247

each host group (Figure S1). For example, the genus Bacteroides was correlated with

248

ordination position for Queensland cats and dogs in the “domestic” source category

249

(Figure 1), and was included among the discriminatory genera in LEfSe analysis (Figure

11 ACS Paragon Plus Environment

Page 12 of 34

Page 13 of 34

Environmental Science & Technology

250

2). However, among livestock sources, the most abundant genera that correlated with

251

ordination position were not discretely associated with specific source categories, as

252

noted by positioning of most of these away from source communities, with the exception

253

of Xylanibacter, (Figure 1). Correspondingly, LEfSe identified OTUs primarily within less

254

abundant genera as source-associated but did not identify OTUs within Xylanibacter to

255

differentiate livestock, although Rikenella, Alistipes and Bacteroides were identified

256

(Figure 2).

257 258

3.2.

SourceTracker Results for the Initial Library

259

Overlap of bacterial communities among host sources was evaluated by

260

assigning all samples in a single host group in the library, irrespective of geography, as

261

a source and all remaining samples in the initial library as sinks (Supplementary Table

262

S1). A moderate-to-high (26.2 – 99.5%) predicted community similarity was observed

263

among host samples within the broader avian group. Cat and dog communities also

264

showed a greater degree of overlap, with mean similarity of 77.5 and 89.4%. Livestock

265

and wildlife showed much less overlap in community composition (5.2 – 23.6% and 2.1

266

– 44.2% predicted similarity, respectively). In contrast, beef and dairy cattle showed

267

much greater overlap as 77.3 and 89.9% of the communities were predicted to be in

268

common. Inter-group similarity was generally much less than intra-group similarity, with

269

the exception that a moderate-to-high degree of similarity was predicted between avian

270

sources and those from WWTPs (mean 43.1 ± 25.6% community similarity).

271 272

To assess the accuracy of SourceTracker software to discriminate among closely related sources and different geographic regions, and to classify OTUs from fecal

12 ACS Paragon Plus Environment

Environmental Science & Technology

273

samples not in the library, the initial library was divided approximately in half (i.e., half of

274

the individuals in a host category were assigned as source and half were sinks) and

275

evaluated amongst the broader host categories. Overall, sink predictions were accurate,

276

with ≥ 80% of sink community taxa identified as the corresponding source category, with

277

specificity to geography (Figure 3). Sinks within the avian category were identified less

278

accurately, with approximately 60% of chicken communities correctly assigned, while

279

bird communities from Queensland were poorly identified as a mixture of MN Canada

280

geese and CA gulls, in addition to the correct source assignment. Bacterial communities

281

from WWTPs were also less accurately identified compared to other animal sources,

282

with overlap in assignments among the more highly similar WWTPs. Furthermore, the

283

bacterial communities in WWTP influent that were poorly assigned were not well

284

represented in the source library (n < 5).

285

Informative OTUs utilized by the SourceTracker software to perform source

286

assignments in sink samples varied among specific host groups within broader source

287

categories (Figure 4). Many of the OTUs were classified within abundant genera that

288

were associated with specific source categories, e.g., a large proportion of the

289

SourceTracker fingerprint for Queensland dogs was attributed to OTUs within the genus

290

Xylanibacter (Figure 4A), which was highly correlated with their ordination position

291

(Figure 1A). Livestock and wildlife sources were predominantly defined by OTUs within

292

less abundant genera (Figures 4B and 4D), while genera at greater abundances were

293

poorly associated with hosts (Figures 1B and 1D). The avian sources, which had more

294

distinct community compositions than did other categories (Figure S1), could be

295

identified using OTUs within a fewer number of genera than most of the other broader

13 ACS Paragon Plus Environment

Page 14 of 34

Page 15 of 34

Environmental Science & Technology

296

categories, particularly the California birds in which Pseudomonas was a dominant

297

driver of classification, and less abundant genera comprised less than 5% of the

298

taxonomic fingerprint (Figure 4C). Furthermore, profiles for communities represented at

299

lower abundances in the source libraries (e.g., Hobart and Melbourne WWTPs) were

300

defined almost entirely by OTUs within less prominent genera (Figure 4E-F).

301 302 303

3.3.

Evaluation of Blind Samples Using the Initial Library Source 05 was identified as the FL WWTP source based on the number of

304

samples (n = 3), since only triplicates of this source were processed. The remaining

305

blinded source samples were evaluated against the initial library using SourceTracker,

306

but could not be unambiguously assigned to specific sources by this method (mean sink

307

prediction ≤ 6.9% to any source, irrespective of geography). Furthermore, community

308

compositions in all blinded sources were significantly different from all other sources

309

represented in the preliminary library (ANOSIM P < 0.001). Therefore, results from

310

PCoA and LEfSe analyses among the initial library sources were used to narrow down

311

possible sources based on predominant taxonomic composition (Supplementary Figure

312

S2). Sources 01 and 02 were determined to belong to livestock or wildlife groups based

313

on the abundance of Rikenella (Figures 1B and 1D). Sources 03 and 04 were

314

preliminarily identified as domestic animal sources based on abundances of Blautia

315

(Figure 1A) and Catenibacterium (Figure 2A). Sources were further interrogated by

316

PCoA within these broader host source categories (Supplementary Figure S3). On the

317

basis of these observations, sources were identified as coming from cows, horses, cats,

318

and dogs, respectively, which were then confirmed (with those who prepared the double

14 ACS Paragon Plus Environment

Environmental Science & Technology

319

blinded samples) to be correct. Composite source samples were evaluated as positive

320

controls against the FL library alone and were correctly classified by SourceTracker at

321

>86% community similarity.

322

Blinded samples were interrogated against the unblinded FL sources (Table 1),

323

which comprised the local library. Source identifications in samples, as determined by

324

SourceTracker, were 91% (31/34) accurate based on presence/absence sample

325

composition. In two samples, SW25 and SW32, SourceTracker identified a weak

326

sewage signature representing 1.2 and 2.3% of the community, respectively, which was

327

not consistent with a source spike. Similarly, in sample SW31, which was spiked with

328

only cow fecal material, SourceTracker identified a low signature for horse, as well

329

(6.7% of the community). When data from single FL source categories (e.g., all cattle)

330

were removed from the library, individual fecal samples were incorrectly assigned,

331

generally as the most closely related host. Cow samples were misidentified as horse

332

(mean 13.6 ± 2.4% of the community), horse samples were misidentified as cow (5.3 ±

333

1.4%), cat samples were misidentified as a mixture of dog and sewage (61.3 ± 21.7%

334

and 15.8 ± 22.3%, respectively), dog samples were misidentified as cat (90.5 ± 9.4%),

335

and WWTP samples were misidentified as cat (5.5 ± 1.8%). With allowance for these

336

misclassifications (e.g., ignoring classification of a cattle spike as horse when cattle

337

samples were excluded from the source library), the presence/absence results among

338

blinded samples did not change when a single source was omitted from the library.

339

To evaluate the relative quantitative accuracy of SourceTracker, Spearman

340

correlations were performed relating sink predictions (as %) with sample volumes

341

spiked (0.025 – 10%). Strong and significant positive correlations were observed

15 ACS Paragon Plus Environment

Page 16 of 34

Page 17 of 34

Environmental Science & Technology

342

between SourceTracker sink predictions and volumes spiked (ρ = 0.974, 0.924, 0.887,

343

0.884, 0.953, with respect to cow, horse, cat, dog, and sewage sources, P < 0.0001 for

344

all sources).

345

When evaluated against the initial library alone, sources were poorly identified in

346

blinded sink samples (sink predictions ≤ 12.8% to specific hosts, irrespective of

347

geography; Supplementary Table S2), similar to the blinded source samples.

348

Furthermore, combining the initial and FL source libraries did not significantly affect sink

349

predictions (ANOVA P = 0.787 – 0.997 for each source, individually; Table S2) from

350

those observed when using the local FL library alone.

351 352 353

4. DISCUSSION Results of this study indicate that identification of fecal source contamination in

354

recreational freshwater using SourceTracker is dependent on the inclusion of

355

geographically associated source samples present in the source library. Using an initial

356

library with geographically divergent sources, but no representation from local sources,

357

blinded source samples could not be unambiguously defined, with mean similarities of

358

blinded source communities < 7% to sources in the initial fecal library. Furthermore,

359

despite a great overlap in community composition among certain host species, e.g.,

360

avian species, the algorithm was generally able to assign >80% of the sink community

361

to the correct source and geography. These results suggest that, despite taxonomic

362

similarity in the fecal microbial community among closely related sources40, individuals

363

vary by geographic region and specific species compositions (here assessed as OTUs),

364

as has been well documented among humans41,42.

16 ACS Paragon Plus Environment

Environmental Science & Technology

365

We specifically sought to investigate potential discrepancies between source

366

identification using SourceTracker and amplicon sequencing data from conventional

367

qPCR assays. Multivariate statistical analyses generally identified more highly abundant

368

genera and OTUs within these genera as potentially discriminatory (Figure 2), and in

369

most cases, these genera contributed, in part, to sink predictions as determined by

370

SourceTracker (Figure 4). However, in many cases, OTUs within less abundant genera

371

accounted for ≥ 50% of the source fingerprints, and we suggest this is due to the over-

372

dispersed nature of bacterial communities22. Thus, while qPCR assays target specific

373

taxa that are shared within most individuals of a host species2, SourceTracker is able to

374

use some of these as well as to capitalize on lower abundance species to more

375

accurately discriminate among closely related hosts. This may indicate a trade-off in

376

methodologies, where qPCR has greater sensitivity to detect low levels of source

377

contamination, while SourceTracker offers highly specific source identification.

378

However, it is important to note that the SourceTracker algorithm correctly identified as

379

little as 0.025% of spiked source, by volume (Table 1), and, although this result does

380

not represent a truly quantitative measurement of bacterial contamination (i.e., numbers

381

of cells), it is highly suggestive that the algorithm has appropriate sensitivity to detect

382

biologically relevant contamination events. Future work will be necessary to provide a

383

more accurate quantitative assessment of this observation and place it in the context of

384

current water quality monitoring standards.

385

Library size has also historically been an important consideration for library-

386

dependent MST methods8. We previously reported that a minimum library size of 13

387

individuals was necessary to inform a powered analysis of statistically significant

17 ACS Paragon Plus Environment

Page 18 of 34

Page 19 of 34

Environmental Science & Technology

388

differences in community composition17, but previous studies evaluating SourceTracker

389

have sometimes relied on only one or two individuals and achieved results that

390

corresponded with the expected source composition of their samples16,43. Here, 10

391

individuals were sufficient for accurate and, perhaps, relatively quantitative identification

392

of fecal contamination among blinded spikes. Furthermore, analysis of the initial library

393

suggests that, in practice,