Sequencing Human Mitochondrial Hypervariable Region II as a

Aug 25, 2014 - To protect environmental water from human fecal contamination, authorities must be able to unambiguously identify the source of the con...
1 downloads 12 Views 1MB Size
Subscriber access provided by Universitaetsbibliothek | Johann Christian Senckenberg

Article

Sequencing Human Mitochondrial Hypervariable Region II as a Molecular Fingerprint for Environmental Waters Vikram Kapoor, Ronald W. DeBry, Dominic Boccelli, and David Wendell Environ. Sci. Technol., Just Accepted Manuscript • DOI: 10.1021/es503189g • Publication Date (Web): 25 Aug 2014 Downloaded from http://pubs.acs.org on August 26, 2014

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Environmental Science & Technology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 27

Environmental Science & Technology

1

Sequencing Human Mitochondrial Hypervariable Region

2

II as a Molecular Fingerprint for Environmental Waters

3

Vikram Kapoor1, Ronald W. DeBry2, Dominic L. Boccelli1 and David Wendell1*

4

1

5

Cincinnati, Cincinnati, Ohio 45221, USA

6

2

7

Keywords. Fecal source tracking, human mitochondrial DNA, high-throughput sequencing,

8

genetic barcode, cluster analysis, population diversity

Department of Biomedical, Chemical, and Environmental Engineering, University of

Department of Biological Sciences, University of Cincinnati, Cincinnati, Ohio 45221, USA

9 10 11 12 13 14 15 16 17 18 19 20

ACS Paragon Plus Environment

Environmental Science & Technology

21

ABSTRACT

22

To protect environmental water from human fecal contamination, authorities must be able to

23

unambiguously identify the source of the contamination. Current identification methods focus

24

on tracking fecal bacteria associated with the human gut, but many of these bacterial indicators

25

also thrive in the environment and in other mammalian hosts. Mitochondrial DNA could solve

26

this problem by serving as a human-specific marker for fecal contamination. Here we show that

27

the human mitochondrial hypervariable region II can function as a molecular fingerprint for

28

human contamination in an urban watershed impacted by combined sewer overflows. We present

29

high-throughput sequencing analysis of hypervariable region II for spatial resolution of the

30

contaminated sites and assessment of the population diversity of the impacting regions. We

31

propose that human mitochondrial DNA from public waste streams may serve as a tool for

32

identifying waste sources definitively, analyzing population diversity, and conducting other

33

anthropological investigations.

34 35

INTRODUCTION

36

For over a century, the standard indicator for environmental water contamination has been fecal

37

bacteria and associated microbial metagenomes. However, many recent investigations1-3 have

38

recognized the limitations of microbial-based source tracking because bacteria survive in

39

alternative hosts4 and the environment5, making the spatial and temporal components ambiguous.

40

Chemical signatures associated with human waste, such as caffeine, sterols, detergents and

41

personal care products, have also been used in source tracking studies6, 7; however, issues with

42

persistence and detection sensitivity limits their use as reliable identification tools7. Thus, due to

43

the uncertainty of present microbial/chemical source identification markers, there remains a need

ACS Paragon Plus Environment

Page 2 of 27

Page 3 of 27

Environmental Science & Technology

44

for a sensitive and unambiguous indicator of human waste in environmental water.

45

Mitochondrial DNA (mtDNA), which contains species-specific sequences, can be used as a

46

direct marker of human contamination by identifying human waste directly through its own

47

discharged eukaryotic cells1-3, 8, 9.

48

Human mtDNA has become a useful tool in a variety of scientific disciplines including

49

criminal forensics9, 10, paleoanthropology11-16, population genetics17, 18, and more recently as part

50

of cancer and degenerative disease investigations19, 20. Global sequencing efforts associated with

51

distinct populations have provided improved phylogenetic resolution of the human mtDNA

52

hypervariable regions as a genetic anthropological barcode. Additionally, human mtDNA has

53

been recently used as an identifying mechanism in contaminated environmental waters through

54

species-specific variation in mitochondrial NADH dehydrogenase, discerning human fecal waste

55

from animal sources1, 2. In this investigation we apply the specificity afforded by the

56

hypervariable region of the human mitochondrial genome to evaluate anthropogenic inputs to

57

environmental water in a manner similar to molecular rRNA-based speciation used in microbial

58

source tracking, microbial diversity investigations and bacterial metagenomic fingerprinting21, 22.

59

Numerous sources of human contamination can be found in environmental waters. Major

60

fecal sources include combined sewer overflows (CSO), sanitary sewer overflows (SSO),

61

household sewage treatment systems, and agriculture/urban runoff 6, 23. Human fecal waste

62

contains a large amount of exfoliated epithelial cells24, 25, which in turn contain thousands of

63

mitochondrial genomic copies26, making mtDNA a robust molecular target for impacted

64

environmental water. Additional sources of human mtDNA in the environment can include

65

sloughed skin and hair26, 27 found in waters used for swimming, canoeing and other recreational

66

activities8, 28.

ACS Paragon Plus Environment

Environmental Science & Technology

67

We track and characterize human waste in environmental waters by targeting the human

68

mitochondrial hypervariable region II (HVRII). Like previous human evolution studies12-16, the

69

large number of single-nucleotide polymorphisms (SNPs) present in HVRII can be used as an

70

identifying mechanism since these SNPs have varying allelic frequencies among populations14, 29,

71

30

72

water sample sites, since the waste impact is related to the humans contributing fecal waste near

73

the sites (CSOs/SSOs; overland runoff) and persistence of human mtDNA from upstream

74

sources. To determine the HVRII sequence diversity on a large scale, we have used high-

75

throughput sequencing technology for characterizing human mtDNA HVRII variation found in

76

water samples taken from an urban creek system (Duck Creek Watershed, Cincinnati OH)

77

impacted by municipal CSOs and other human activities. We next used the HVRII sequences to

78

extract haplotypes and assign mitochondrial haplogroups based on the Phylotree database31.

79

Furthermore, we compared the population diversity obtained through HVRII-derived

80

haplogrouping to the U.S. federal census data (by race) for the neighborhoods bordering the

81

watershed.

. We have applied this allelic frequency specificity as a unique “barcode” to identify impacted

82 83

MATERIALS AND METHODS

84

Study area and sampling sites. The Duck Creek Watershed feeds a National and State Scenic

85

River (Little Miami River) but has limited aquatic and riparian habitat32 and as a result of CSO

86

overflows, has been shown to significantly impact the Little Miami River with human fecal

87

contamination2. Initially, ten sampling points were selected within the Duck Creek watershed.

88

The sites were chosen based on proximity to CSO’s, traditional municipal sampling sites and

89

potential impact from human fecal pollution from sewage overflow and watershed runoff. The

ACS Paragon Plus Environment

Page 4 of 27

Page 5 of 27

Environmental Science & Technology

90

sampling sites were identified and assessed for the presence of human mitochondrial DNA

91

through PCR-based detection as described in previous work2. Out of the 10 sites, five (Sites 1, 3,

92

8, 9 and 10; Figure 1 and Table S1) were chosen for further analysis based on the consistent

93

abundance of human mitochondrial DNA throughout the sampling period (Oct 2011 - Jul 2012).

94 95

Sample processing, PCR and sequencing. Five river sites were selected within the Duck Creek

96

Watershed representing different degrees of anthropogenic influence. Water sample collection

97

and DNA extraction was performed as described earlier2. PCR assay targeting the mitochondrial

98

hypervariable region II (422 bp) was carried out using the primers33 HVRII-F (5'-

99

GGTCTATCACCCTATTAACCAC -3') and HVRII-R (5'-CTGTTAAAAGTGCATACCGCC -

100

3') linked to the site-specific barcodes (Table S2). For all PCR assays, water DNA extracts (5

101

µL) were used as templates in a final reaction volume of 50 µL using the OneTaq master mix

102

(New England Biolabs, Ipswich, MA) with 200 nM each of the forward and reverse primer in a

103

GeneAmp® PCR System 9700 thermal cycler (Applied Biosystems, Green Island, NY) under

104

the following cycling conditions: initial denaturation of 30 s at 94 ºC, followed by 35 cycles of

105

15 s at 94 ºC, 20 s at 56 ºC and 30 s at 68 ºC, and final extension step of 5 min at 68 ºC. All PCR

106

products were purified using MinElute PCR Purification kit (Qiagen, Valencia, CA) and

107

quantified using a NanoDrop 1000 Spectrophotometer (Thermo Scientific, Wilmington, DE).

108

Controls containing no template DNA were used to check for cross contamination. Additionally,

109

PCR inhibition was tested in water DNA extracts by using 10-fold dilutions of each DNA

110

extract. The amplification products from the same sampling event were pooled in an equimolar

111

ratio to conduct multiplexed sequencing using the Ion Torrent Personal Genome Machine (PGM)

112

system (Life Technologies, San Francisco, CA). Sequencing of each pooled library was

ACS Paragon Plus Environment

Environmental Science & Technology

113

performed on the PGM system using a 314 chip v2 with the Ion PGM Template OT2 400 kit and

114

Ion PGM Sequencing 400 kit according to the manufacturer's protocol. The HVRII sequence of

115

the operator was also determined through Sanger sequencing and confirmed that it did not

116

contribute to experimental data.

117 118

Bioinformatics analyses. All PGM sequences were sorted according to barcodes and grouped

119

under their respective sites. To compensate for potential sequencing errors, sequences having an

120

average quality under 2010 (as derived from the automated analysis carried out by the Torrent

121

Suite Software version 3.6), having unidentified bases (Ns), or being shorter than 300 bp were

122

discarded. The quality-filtered sequences were then aligned to the revised Cambridge Reference

123

Sequence (rCRS)34 for human mitochondrial DNA (NC_012920.1| Homo sapiens

124

mitochondrion, complete genome); and analyzed with the Torrent Suite Software version 3.6

125

(Life Technologies) using the plug-in VariantCallerForMtDNA version 3.0. The output of the

126

variant caller is presented in tabular format, as a list of variations to the rCRS along with variant

127

frequency values (Tables S3-S5). Additionally, the FASTQ files provided via the Ion Torrent

128

server were exported to CLC Genomics Workbench Version 6.5 (CLC Bio, Cambridge, MA)

129

and aligned to the rCRS, after which the Quality-based Variant Detection was called to detect

130

insertions and deletions (indels) as well as SNPs with reference to the rCRS. CLC Workbench

131

variant caller analysis parameters used in this study are given in Tables S6 and S7. A control

132

DNA sample of HVRII sequence with known variants, which had been previously determined by

133

conventional Sanger sequencing, was included during the analysis. The mitochondrial genome

134

databases, including MITOMAP35, mtDB36, EMPOP37 and Phylotree31 were referred to validate

135

the occurrence of detected variants. Sequences were submitted to MITOMASTER version Beta

ACS Paragon Plus Environment

Page 6 of 27

Page 7 of 27

Environmental Science & Technology

136

138 to extract haplotypes and assign mitochondrial haplogroups according to the sequence motifs

137

present in HVRII. MITOMASTER performs variant calling relative to the rCRS, haplotyping

138

based on Phylotree, and variant annotation based on Mitomap.

139 140

Spatial analysis. Hierarchical cluster analysis (HCA) was used to classify the five sampling sites

141

(1, 3, 8, 9 and 10) into spatial associations using the frequency distribution of SNPs obtained for

142

three events - October 2011 (Set A; wet weather), March 2012 (Set B; dry weather) and July

143

2012 (Set C; dry weather) respectively (Tables S3-S5) for a total of 15 data sets labeled as 1A,

144

3A, etc., where 1 is the site and A refers to the event. HCA is an exploratory pattern detection

145

method that partitions all cases into unique groups. Prior to HCA, the normality of the SNP

146

frequency distribution (sorted by sampling location; frequency cutoff = 5%) was verified by

147

analyzing the histograms and by applying the Shapiro–Wilk test. The combination of Euclidean

148

distances as a similarity-dissimilarity measure and the Ward's method as a linkage algorithm was

149

then applied to obtain the case clusters. The data matrix used for classification has the dimension

150

of 15 (sampling points) X 25 (SNP frequencies), resulting in a total of 375 data points. Using this

151

approach, it was possible to reduce the large number of HVRII sequences to 15 site-specific data

152

sets.

153 154

RESULTS

155

Sequence and variant detection of HVRII for study sites. The Ion Torrent PGM system was

156

used to examine the sequence diversity of HVRII amplicons as recently reported39-41. In total,

157

more than 10,000,000 sequence reads were retrieved with a mean output exceeding 200,000 per

158

pooled set, which were then filtered and grouped according to their respective sites. The absolute

ACS Paragon Plus Environment

Environmental Science & Technology

159

number of HVRII sequence read output per site was consistent with our previous human mtDNA

160

– qPCR results for the watershed2. SNPs were detected using the Torrent Suite plug-in

161

VariantCallerForMtDNA version 3.0, which applies a TMAP Smith–Waterman alignment

162

optimization42 and outputs the variant allele frequency (%). Concurrently, sequences were

163

analyzed using CLC Genomics Workbench Version 6.5 that employed the Neighborhood Quality

164

Standard (NQS) algorithm43 to detect insertions and deletions (indels) and validate the SNPs

165

detected via the variant caller. Indeed, SNP and indel analysis produced no significant

166

differences from variant caller data supporting the reproducibility of the results by alternative

167

computational methods. The HVRII sequence of the operator was compared to the sample

168

sequence databases to check for cross contamination and produced no exact matches against the

169

database reads.

170

HVRII DNA from five sampling events was sequenced and screened producing an

171

average read length of approximately 300 bp and a mean output of 20,000 sequences per site for

172

a particular sampling event. Of this, a 270 bp portion from base position 51 to 320 was used for

173

variant detection since these SNPs have been well documented35. A total of 31 distinct SNPs

174

were detected of which 30 are present in MITOMAP - database of mtDNA Control Region

175

Sequence Variants35. The relative distribution of SNPs (with frequency > 5%) for each site over

176

the annual sampling period is presented in Figure 2. We observed some SNPs that were common

177

to all sites with varying frequencies, while other SNPs were site-specific, allowing each

178

sampling location to have a unique human mtDNA signature in the form SNP allelic frequencies.

179

The variation in site specific SNP frequencies could be the result of several factors including

180

limited sample size, daily population changes related to employment (for a comparison of site

181

populations see Table S8), changes in sampling time and storm runoff volumes during wet

ACS Paragon Plus Environment

Page 8 of 27

Page 9 of 27

Environmental Science & Technology

182

weather events, or a combination of these variables. Variants 73G and 263G were detected in all

183

samples with high frequency (> 40%). Variants 143A and 236C were detected only at site 10

184

during the entire sampling period. This is expected since site 10 is on a separate tributary of

185

Duck creek and is not influenced by influx of water from any other site. Site 1 is downstream of

186

site 3, however some SNPs were detected at site 1 but not 3 (151T, 182T, 185T, 235G, 239C).

187

This may be due to the influx of water between the two sites (confluence with Little Duck Creek;

188

see Figure 1) as well as additional inputs from runoff and other CSOs. Interestingly, some of the

189

variants common to site 1 but not 3 (151T, 182T, 185T) are present at sites 8, 9 and 10 as well.

190

The rCRS reference sequence contains a track of seven cytosines from positions 303 to

191

30934. Length heteroplasmy has been known to occur in this stretch due to C insertions that can

192

create C-stretches of eight or more Cs44, 45. An additional C was found in most of the samples,

193

while sampling site 9 presented two additional Cs in this region. Many samples were also found

194

to have an additional cytosine with respect to the rCRS in the cytosine tract 311–315. The

195

frequency of one additional C in the 311–315 region ranges from 20% to 70% within the sample

196

sets, while the frequency of two additional Cs is between 7% and 60%. Sequences with three

197

additional Cs in the 311–315 track were the most rare and were found only at site 9 with less

198

than 5% frequency.

199 200

Spatial variability of sites using SNP allelic frequencies. Because the sampling locations are

201

impacted by upstream conditions (e.g., CSOs) and immediate surroundings, the frequencies of

202

the HVRII SNP alleles at each sampling location should provide a “fingerprint” specific to each

203

location. Using the allelic frequencies generated for each site, we applied cluster analysis to

204

distinguish environmental water obtained from individual sample locations within the watershed.

ACS Paragon Plus Environment

Environmental Science & Technology

205

The dendrogram of the location pattern resulting from the HCA of HVRII sequence SNP data

206

from the period of Oct 2011–Jul 2012 is presented in Figure 3, illustrating distinct site clusters.

207

The sampling sites were grouped into three main clusters based on their specific HVRII SNP

208

signature. Cluster 1 was formed by site 3; cluster 2 by sites 8, 9 and 10; and cluster 3, site 1. It

209

can be seen that cluster 1 is characterized by the highest linkage distance to the other clusters.

210

Clusters 2 and 3 are linked at a shorter distance and are together linked to Cluster 1 at a higher

211

distance. Note that Cluster 1 corresponds to the middle catchment of the Duck Creek Watershed;

212

cluster 3 is the lower catchment; while cluster 2 formed by sites 8, 9, and 10 is located in the

213

upper catchment. Sites 8 and 9 are directly linked to each other since they are on the same

214

section of river without influence of water influx from other CSO sources. Interestingly, site 1

215

clusters most closely to itself and 8, 9 and 10, despite 3 being between them. It is also interesting

216

to note that the sites are most self-similar despite the time between sampling events and the

217

differences in weather (wet weather for set A and dry weather for set B and C). These results

218

support the applicability of HVRII sequence analysis as a metagenomics tool for human

219

contamination sources in environmental water and provides a mechanism for spatial

220

classification of sites based on human mitochondrial variable region SNP allelic frequencies, or

221

‘HVR fingerprint’.

222

We further compared the Euclidean distances between the five different sites based upon

223

the observed SNP frequency “fingerprints” (Figure S1). For all five locations, the smallest

224

Euclidean distance generally occurred when performing a self-comparison of the SNP frequency

225

fingerprints. Similar to the clustering results (Figure 3), sites 8, 9 and 10 tended to be most

226

similar with each other. Additionally, when looking at the results from sites 1 and 3, the SNP

227

frequencies are more closely related with sites 8, 9 and 10 than with each other, even though site

ACS Paragon Plus Environment

Page 10 of 27

Page 11 of 27

Environmental Science & Technology

228

1 is just downstream of site 3. The expectation is that sites immediately up/down-stream of each

229

other would be most similar in terms of SNP frequencies, even with possible degradation of the

230

mtDNA. To investigate the observed differences between sites 1 and 3, an existing Storm Water

231

Management Model (SWMM) of the combined sanitary/storm water system (provided by the

232

Metropolitan Sewer District of Greater Cincinnati) was used to identify the local regions that

233

contribute to the combined systems that could impact the receiving streams (see Figure S2). Site

234

9 is heavily influenced by the population in Kennedy Heights, while Site 10 is influenced by

235

Kennedy Heights, Pleasant Ridge and parts of Oakley. Site 8 is directly downstream of 9 and

236

further influenced by Madisonville. These three sites all have commonality with Kennedy

237

Heights and the demographics within that region. Site 3, located downstream of the confluence

238

of Sites 8 and 10, is additionally impacted by the Linwood region, which increases the

239

differences between Sites 3, and 8, 9 and 10. Finally, Site 1, located further downstream than

240

Site 3 after the confluence of two additional tributaries, continues to show additional differences

241

from Site 3 (as expected) but was observed to be more similar with Sites 8, 9 and 10, which was

242

not expected. When assessing the regional impacts, the tributary that merges with the flow

243

passing Site 3, is additionally impacted by the Madisonville area that would strengthen the SNP

244

frequency impact from the similar demographics in the Madisonville and Kennedy Heights

245

region thereby strengthening the similarity with the sites upstream of Site 3. These different

246

flow paths may also explain why SNPs 151T, 182T and 185T were observed at Site 1 (as well as

247

Sites 8, 9 and 10) but not Site 3.

248 249

Population diversity assessment using haplogroup classification. Distinct mitochondrial

250

haplogroups have arisen from mutation during human evolution and largely follow the migration

ACS Paragon Plus Environment

Environmental Science & Technology

251

of Homo sapiens from specific geographical regions36, 46, 47. These paleoanthropological

252

haplogroups can also be assigned to race based on the frequency of observation as a means of

253

investigating population diversity46, 47. Consequently, we sought to use our human mtDNA

254

sequences to extract haplotypes and classify them into haplogroups by comparing them to the

255

Phylotree database31. The mitochondrial sequences from sets A (wet weather) and B (dry

256

weather) were compared and assigned to haplogroups based on the differences in HVRII

257

sequence mutations with respect to the rCRS. We observed abundant diversity from haplogroup

258

data of HVRII amplicons at all sites, which is consistent with the clear indication of human

259

contamination in the creeks2. Although the sequences were obtained from an equimolar pool of

260

the HVRII amplicons, the relative composition of the haplogroups varied considerably across the

261

different sampling sites (Figure 4). The most salient features of the haplogroup distribution in the

262

clustered sequences were the relatively high frequencies of haplogroup H (30-50%). Haplogroup

263

H includes the rCRS and is typically characterized by variant 73A in HVRII; most of the other

264

haplogroups are characterized by 73G. Haplogroup L was also relatively abundant, but with

265

large sample-to-sample variation.

266

To further explore the applicability of our HVRII-derived haplogroup data to local

267

population diversity as defined by 2010 U.S. census data48, several mitochondrial databases and

268

studies were consulted to assign haplogroups to the general population groups35-37, 46, 47. From

269

these, we used the Wallace47 haplogroup classification according to which L0, L1, L2, L3, L4,

270

L5, L6 were assigned as 'African American'; H, HV, J, K, P, S, T, U, V as 'White'; B, D, E, F, G,

271

M, R, W as 'Asian'; and A, C, X represented 'American Indian'; while all other, less numerous

272

haplogroups were designated to an 'other' category. Figure 5 presents the comparative analysis of

273

the population data obtained through the two strategies - HVRII-derived population groups viz-a-

ACS Paragon Plus Environment

Page 12 of 27

Page 13 of 27

Environmental Science & Technology

274

viz the census data for population (by race). Classification of the mtDNA haplogroups showed

275

20% African mtDNA, 59% European mtDNA, and 12% Asian/American Indian mtDNA.

276

According to census data, 62% self-declared as White, 32% as African American and 2% as

277

Asian. There was a strong correlation between the federal census data and the mitochondrial

278

haplogroups as an indicator of population composition (Pearson product-moment correlation

279

coefficient, r = 0.97) demonstrating the suitability of human mitochondrial sequences to infer the

280

population structure of the neighborhoods impacting the watershed. One important deviation

281

from the census data was the significantly larger percentage of Asian/American Indian mtDNA

282

detected (Figure 5). This discrepancy could be the result of several factors: coarse haplogroup

283

assignments, proximity to the creek CSO input or underrepresentation in the census data. The

284

latter of these three possibilities presents the opportunity that HVR sequencing directly from

285

waste streams or impacted water may provide a more accurate means of deducing population

286

diversity since human waste disposal is a personal necessity while census response is not. We

287

suggest future epidemiological studies that employ HVR sequencing methods from waste

288

streams that may provide complementary population diversity information as well as additional

289

insight unavailable to voluntary census response data collection.

290 291

DISCUSSION

292

Several ribo-typing investigations have attempted to associate the human intestinal microflora

293

with bacterial metagenomes found in environmental waters21, 41; however, there is significant

294

variation in microbial species composition within and between individuals. Moreover, the

295

microbial communities might replicate after discharge in water making it difficult to differentiate

296

bacteria associated with fecal contamination events. Conversely, mitochondrial sequences

ACS Paragon Plus Environment

Environmental Science & Technology

297

represent a direct marker of human waste since they are derived from the host cells, which in

298

turn enable the mtDNA HVRs to define inter-individual variation and population dynamics

299

contributing to the impacted water.

300

We investigated the occurrence of HVRII allelic frequencies of human mtDNA derived

301

from water samples taken within an impacted urban creek system. We used SNPs within the

302

human HVRII region to form site-specific genetic barcodes (HVR fingerprint) for evaluating

303

anthropogenic watershed inputs. Human mtDNA is readily available in public waste streams and

304

impacted environmental waters, allowing this approach to be more broadly applied as a

305

metagenomics tool for studying human population diversity, waste source tracking and other

306

anthropological investigations.

307

Water samples taken from the impacted watershed contained mitochondrial genome copy

308

equivalents ranging from a few 100 to several 100,000 human mtDNA2. As a result of the large

309

amount of mtDNA, and abundant diversity, it was impractical to isolate and sequence full

310

mitochondrial genomes from our environmental water samples. However, the analysis of small

311

mtDNA regions that have maximal discriminative power has proven useful in past

312

anthropological studies14-17 ; Krings et al.15 determined 340 bp of the mtDNA HVRII from the

313

Neandertal type specimen to better estimate the relationship of the Neandertal mtDNA to the

314

contemporary human mtDNA gene pool, an approach adapted in this investigation.

315

Altogether, the HCA approach (Figure 3) combined with site-specific frequency

316

distribution of SNPs (Figure 2) represents a unique classification for environmental waters that

317

was both location and human community specific. The HVR fingerprint specificity was

318

surprising considering the mixed nature of municipal sewage, variation in CSO discharge with

319

weather and temporal separation of sampling events. The molecular fingerprinting strategy

ACS Paragon Plus Environment

Page 14 of 27

Page 15 of 27

Environmental Science & Technology

320

described here may be further adapted to analyze additional human mtDNA genes either by

321

impacted environmental water or directly though public wastewater. This may provide a

322

significant resource for local community mtDNA genetics, and could be used to examine the

323

association of human disease and aging with mtDNA genes49-51. To this end, it has been reported

324

that mitochondrial gene mutations might predispose individuals to diseases like diabetes,

325

Alzheimer’s and Parkinson’s47, 49, 50; however, the true impact of these mutations on human

326

health remains to be determined. Studies involving the correlative analysis of mtDNA variation

327

of human mitochondrial sequences found in human-impacted environments may provide a direct

328

route to examine the prevalence of these diseases in the local population.

329

With respect to the mtDNA analysis, we applied a high-throughput sequencing strategy

330

to analyze human mitochondrial HVRII DNA obtained from an urban creek system at different

331

time points accounting for spatial-temporal resolution of human contamination in an urban

332

watershed. The use of barcoded primers allowed multiplexed sample sequencing and enabled the

333

identification of collective HVRII SNP frequencies that were site specific. Although our study

334

was confined to analysis of HVRII region of human mtDNA in a limited number of geographic

335

locations, a wealth of information can be obtained through other mtDNA genomic targets,

336

particularly regions associated with aging and cancer47, 50, 51.

337 338

ASSOCIATED CONTENT

339

Supporting Information. Tables of the sampling sites, molecular barcodes used for multiplexed

340

sequencing, HVRII SNP frequency data and bioinformatics analysis parameters for CLC

341

Genomics Workbench; and figures for Euclidean distance between sites and GIS map of Duck

ACS Paragon Plus Environment

Environmental Science & Technology

342

Creek Watershed showing combined sewer lines. This material is available free of charge via the

343

Internet at http://pubs.acs.org.

344

AUTHOR INFORMATION

345

Corresponding Author

346

*E-mail: [email protected]

347

ACKNOWLEDGMENTS

348

We thank R. Ravi, E. Wurtzler and N. Punuru for assistance in the laboratory and C. Smith for

349

help in sample collection. This research was supported by the Metropolitan Sewer District of

350

Greater Cincinnati and a URC Graduate Research Fellowship from the University of Cincinnati

351

(Cincinnati, OH).

352

REFERENCES

353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371

1. Caldwell, J. M.; Raley, M. E.; Levine, J. F. Mitochondrial multiplex real-time PCR as a source tracking method in fecal-contaminated effluents. Environ. Sci. Technol. 2007, 41 (9), 3277-3283. 2. Kapoor, V.; Smith, C.; Santo Domingo, J. W.; Lu, T.; Wendell, D. Correlative Assessment of Fecal Indicators using Human Mitochondrial DNA as a Direct Marker. Environ. Sci. Technol. 2013, 47 (18), 10485-10493. 3. Vuong, N.-M.; et al. Fecal source tracking in water using a mitochondrial DNA microarray. Water Res. 2012, 47 (1), 16-30. 4. Gordon, D. M. Geographical structure and host specificity in bacteria and the implications for tracing the source of coliform contamination. Microbiology 2001, 147 (5), 1079-1085. 5. Anderson, K. L.; Whitlock, J. E.; Harwood, V. J. Persistence and differential survival of fecal indicator bacteria in subtropical waters and sediments. Appl. Environ. Microbiol. 2005, 71 (6), 3041-3048. 6. Glassmeyer, S. T.; et al. Transport of chemical and microbial compounds from known wastewater discharges: potential for use as indicators of human fecal contamination. Environ. Sci. Technol. 2005, 39 (14), 5157-5169. 7. Hagedorn, C.; Weisberg, S. B. Chemical-based fecal source tracking methods: current status and guidelines for evaluation. Rev. Environ. Sci. Biotechnol. 2009, 8 (3), 275-287.

ACS Paragon Plus Environment

Page 16 of 27

Page 17 of 27

372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411

Environmental Science & Technology

8. Martellini, A.; Payment, P.; Villemur, R. Use of eukaryotic mitochondrial DNA to differentiate human, bovine, porcine and ovine sources in fecally contaminated surface water. Water Res. 2005, 39 (4), 541-548. 9. Budowle, B.; Allard, M. W.; Wilson, M. R.; Chakraborty, R. Forensics and Mitochondrial DNA: Applications, Debates, and Foundations. Annu. Rev. Genomics Hum. Genet. 2003, 4 (1), 119-141. 10. Wilson, M. R.; DiZinno, J. A.; Polanskey, D.; Replogle, J.; Budowle, B. Validation of mitochondrial DNA sequencing for forensic casework analysis. Int. J. Legal Med. 1995, 108 (2), 68-74. 11. Gill, P.; et al. Identification of the remains of the Romanov family by DNA analysis. Nat. Genet. 1994, 6 (2), 130-135. 12. Ingman, M.; Kaessmann, H.; PaÈaÈbo, S.; Gyllensten, U. Mitochondrial genome variation and the origin of modern humans. Nature 2000, 408 (6813), 708-713. 13. Wallace, D. C. Mitochondrial DNA sequence variation in human evolution and disease. Proc. Natl. Acad. Sci. U.S.A. 1994, 91 (19), 8739-8746. 14. Salas, A.; Lareu, V.; Calafell, F.; Bertranpetit, J.; Carracedo, A. mtDNA hypervariable region II (HVII) sequences in human evolution studies. Eur. J. Human Genet. 2000, 8 (12), 964-974. 15. Krings, M.; Geisert, H.; Schmitz, R. W.; Krainitzki, H.; Pääbo, S. DNA sequence of the mitochondrial hypervariable region II from the Neandertal type specimen. Proc. Natl. Acad. Sci. U.S.A. 1999, 96 (10), 5581-5585. 16. Ovchinnikov, I. V.; et al. Molecular analysis of Neanderthal DNA from the northern Caucasus. Nature 2000, 404 (6777), 490-493. 17. Schlebusch, C. M.; Lombard, M.; Soodyall, H. MtDNA control region variation affirms diversity and deep sub-structure in populations from southern Africa. BMC Evol. Biol. 2013, 13 (1), 56. 18. Byrne, E. M.; et al. The use of common mitochondrial variants to detect and characterise population structure in the Australian population: implications for genome-wide association studies. Eur. J. Human Genet. 2008, 16 (11), 1396-1403. 19. Burgess, D. J. Disease genetics: Double danger from mitochondrial mutations. Nature Rev. Genet. 2013, 14 (10), 678-679. 20. Wallace, D. C. Mitochondria and cancer. Nature Rev. Cancer 2012, 12 (10), 685-698. 21. Langille, M. G.; et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnol. 2013, 31 (9), 814-821. 22. Yergeau, E.; et al. Next-generation sequencing of microbial communities in the Athabasca River and its tributaries in relation to oil sands mining activities. Appl. Environ. Microbiol. 2012, 78 (21), 7626-7637. 23. Marsalek, J.; Rochfort, Q. Urban wet-weather flows: sources of fecal contamination impacting on recreational waters and threatening drinking-water sources. J. Toxicol. Environ. Health, Part A 2004, 67 (20-22), 1765-1777.

ACS Paragon Plus Environment

Environmental Science & Technology

412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429

24. Kamra, A.; et al. Exfoliated colonic epithelial cells: surrogate targets for evaluation of bioactive food components in cancer prevention. J. Nutr. 2005, 135 (11), 2719-2722. 25. Albaugh, G. P.; et al. Isolation of exfoliated colonic epithelial cells, a novel, non‐invasive approach to the study of cellular markers. Int. J. Cancer 1992, 52 (3), 347-350. 26. Andreasson, H.; Gyllensten, U.; Allen, M. Real-time DNA quantification of nuclear and mitochondrial DNA in forensic analysis. Biotechniques 2002, 33 (2), 402-411. 27. Higuchi, R.; von Beroldingen, C. H.; Sensabaugh, G. F.; Erlich, H. A. DNA typing from single hairs. Nature 1988, 332 (6164), 543-546. 28. Soller, J. A.; Schoen, M. E.; Bartrand, T.; Ravenscroft, J. E.; Ashbolt, N. J. Estimated human health risks from exposure to recreational waters impacted by human and nonhuman sources of faecal contamination. Water Res. 2010, 44 (16), 4674-4691. 29. Parsons, T. J.; et al. A high observed substitution rate in the human mitochondrial DNA control region. Nature Genet. 1997, 15, 363-368. 30. Meyer, S.; Weiss, G.; von Haeseler, A. Pattern of nucleotide substitution and rate heterogeneity in the hypervariable regions I and II of human mtDNA. Genetics 1999, 152 (3), 1103-1110. 31. van Oven, M.; Kayser, M. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum. Mutat. 2009, 30 (2), E386-E394.

430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451

32. U. S. Environmental Protection Agency (2009) Biological and Water Quality Study of the Lower Little Miami River and Selected Tributaries. OHIO EPA Technical Report EAS/2009-10-06. 33. Hutter, G.; et al. Use of polymorphisms in the noncoding region of the human mitochondrial genome to identify potential contamination of human leukemia-lymphoma cell lines. Hematol. J. 2004, 5 (1), 61-68. 34. Andrews, R. M.; et al. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nature Genet. 1999, 23 (2), 147-147. 35. Brandon, M. C.; et al. MITOMAP: a human mitochondrial genome database—2004 update. Nucleic Acids Res. 2005, 33 (suppl 1), D611-D613. 36. Ingman, M.; Gyllensten, U. mtDB: Human Mitochondrial Genome Database, a resource for population genetics and medical sciences. Nucleic Acids Res. 2006, 34 (suppl 1), D749-D751. 37. Parson, W.; Dür, A. EMPOP—a forensic mtDNA database. Forensic Sci. Int. Genet. 2007, 1 (2), 88-92. 38. Brandon, M. C.; et al. MITOMASTER: a bioinformatics tool for the analysis of mitochondrial DNA sequences. Hum. Mutat. 2009, 30 (1), 1-6. 39. Parson, W.; et al. Evaluation of next generation mtGenome sequencing using the Ion Torrent Personal Genome Machine (PGM). Forensic Sci. Int. Genet. 2013, 7 (5), 543549. 40. Seo, S. B.; et al. Single nucleotide polymorphism typing with massively parallel sequencing for human identification. Int. J. Legal Med. 2013, 127 (6), 1079-1086.

ACS Paragon Plus Environment

Page 18 of 27

Page 19 of 27

452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475

Environmental Science & Technology

41. Whiteley, A. S.; et al. Microbial 16S rRNA Ion Tag and community metagenome sequencing using the Ion Torrent (PGM) Platform. J. Microbiol. Methods 2012, 91 (1), 80-88. 42. Li, H.; Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 2010, 11 (5), 473-483. 43. Altshuler, D.; et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 2000, 407 (6803), 513-516. 44. Li, M.; et al. Detecting heteroplasmy from high-throughput sequencing of complete human mitochondrial DNA genomes. Am. J. Hum. Genet. 2010, 87 (2), 237-249. 45. Stewart, J.; et al. Length variation in HV2 of the human mitochondrial DNA control region. J. Forensic Sci. 2001, 46 (4), 862-870. 46. Lee, C.; Măndoiu, I. I.; Nelson, C. E. Inferring ethnicity from mitochondrial DNA sequence. BMC Proceedings. 2011, 5 (2), S11; BioMed Central Ltd. 47. Wallace, D. C.; Brown, M. D.; Lott, M. T. Mitochondrial DNA variation in human evolution and disease. Gene 1999, 238 (1), 211-230. 48. Cincinnati Census Data Website; http://www.cincinnati-oh.gov/planning/reportsdata/census-demographics. 49. Taylor, R. W.; Turnbull, D. M. Mitochondrial DNA mutations in human disease. Nature Rev. Genet. 2005, 6 (5), 389-402. 50. Wallace, D. C. Mitochondrial genetics: a paradigm for aging and degenerative diseases? Science 1992, 256 (5057), 628-632. 51. Shen, E. Z.; et al. Mitoflash frequency in early adulthood predicts lifespan in Caenorhabditis elegans. Nature 2014, 508 (7494), 128-132.

476 477 478 479 480 481 482 483 484

ACS Paragon Plus Environment

Environmental Science & Technology

485 486 487 488 489 490 491 492

Figure Legends

493

Figure 1. Locations of the sampling sites in the Duck Creek Watershed in Cincinnati, Ohio. The

494

sites are marked as green circles in the map while the red circles represents CSO locations. The

495

boundary of the watershed is shown in the inset map of the state of Ohio. Sites 1, 3, 8, 9 and 10

496

were used for HVRII sequence analysis due to the consistent abundance of human mitochondrial

497

DNA at these sites. Sites 1 and 3 are located on Duck Creek at river mile 2.0 and 3.4. Sites 8 and

498

9 are located on Deerfield Creek, near CSO 556, since this overflow had the highest number of

499

annual overflow events and largest ever-volumetric contribution to the CSO total overflow. Site

500

10 is located on Upper Duck Creek close to CSO 68, which had the second highest contribution

501

to the CSO total overflow. Site 1 is downstream of site 3 with the additional influx of water

502

coming form Little Duck Creek. Site 8 is directly downstream of site 9. Site 10 is on a separate

503

section of the creek and is not influenced by influx of water from any other site, while site 3 is

504

influenced by water coming from sites 8, 9 and 10.

505

Figure 2. Heat map demonstrating the occurrence and variant frequency for SNPs detected in

506

human mitochondrial HVRII region (position 51 - 320 bp relative to rCRS) for all sampling sites

507

at three distinct times (A = October 2011; B = March 2012; C = July 2012). Variant frequency is

ACS Paragon Plus Environment

Page 20 of 27

Page 21 of 27

Environmental Science & Technology

508

defined as the number of reads having a SNP divided by the total reads in the sample. All

509

variants with a frequency greater than 5% are reported. It can be seen that variants 73G and

510

263G occurred at all sites with frequency greater than 40%.

511

Figure 3. Dendrogram (left) from the HCA of SNP frequency data for the study sites (right)

512

obtained from the period of Oct 2011-Jul 2012. The site-specific datasets are grouped into three

513

clusters. Cluster 1 (orange background) is formed by 3C, 3B, 3A; cluster 2 (purple background)

514

is formed by 10A, 10C, 9A, 9B, 8B, 9C, 10B, 8C, 8A; and cluster 3 (grey background) is formed

515

by 1C, 1B, 1A. The study sites are marked as green circles in the map while the red circles

516

represents CSO locations.

517

Figure 4. Bar charts showing the haplogroup distribution of Set A and B for all sequences longer

518

than 300 bp derived from Ion Torrent Sequencing of HVRII amplicons. The sequences were

519

annotated using MITOMASTER version Beta 1 that performs variant calling relative to the

520

rCRS, haplotyping based on Phylotree, and variant annotation based on Mitomap. The relative

521

composition of the haplogroups varied considerably across the different sampling sites for both

522

the sets. However, the haplogroup H was most abundant for all the sites followed by haplogroup

523

L.

524

Figure 5. Pie charts demonstrating the population racial diversity in Duck Creek Watershed

525

obtained through (a) annotation of HVRII sequences (October 2011) into haplogroups, and (b)

526

2010 population census data (by race). (c) Comparison of site-specific distribution of population

527

according to races obtained through HVRII annotation and census data 2010 respectively.

528

Census data was obtained for the Cincinnati neighborhood approximations of Duck Creek

529

Watershed region which included Linwood, Oakley, Madisonville and Pleasant Ridge census

ACS Paragon Plus Environment

Environmental Science & Technology

530

tracts. Haplogroups were divided into races according to mitochondrial databases Phylotree

531

(White = H, HV, J, K, P, S, T, U, V; African American = L0, L1, L2, L3, L4, L5, L6; American

532

Indian = A, C, X; Asian = B, D, E, F, G, R, W; and Others).

Figure 1

ACS Paragon Plus Environment

Page 22 of 27

Page 23 of 27

Environmental Science & Technology

Figure 2

ACS Paragon Plus Environment

Environmental Science & Technology

Figure 3

ACS Paragon Plus Environment

Page 24 of 27

Page 25 of 27

A

Environmental Science & Technology

B

100%

100%

Others

Others

HV

80%

HV

80%

R 60%

U N

40%

K J

20%

T

Percent sequences

Percent sequences

T

M

R 60%

U N

40%

K J

20%

M

L H

0% Site 1

Site 3

Site 8

Site 9

Site 10

L H

0% Site 1

Figure 4

ACS Paragon Plus Environment

Site 3

Site 8

Site 9

Site 10

Environmental Science & Technology

Figure 5

ACS Paragon Plus Environment

Page 26 of 27

Page 27 of 27

Environmental Science & Technology

TOC Graphic

ACS Paragon Plus Environment