Subscriber access provided by Universitaetsbibliothek | Johann Christian Senckenberg
Article
Sequencing Human Mitochondrial Hypervariable Region II as a Molecular Fingerprint for Environmental Waters Vikram Kapoor, Ronald W. DeBry, Dominic Boccelli, and David Wendell Environ. Sci. Technol., Just Accepted Manuscript • DOI: 10.1021/es503189g • Publication Date (Web): 25 Aug 2014 Downloaded from http://pubs.acs.org on August 26, 2014
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Environmental Science & Technology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 27
Environmental Science & Technology
1
Sequencing Human Mitochondrial Hypervariable Region
2
II as a Molecular Fingerprint for Environmental Waters
3
Vikram Kapoor1, Ronald W. DeBry2, Dominic L. Boccelli1 and David Wendell1*
4
1
5
Cincinnati, Cincinnati, Ohio 45221, USA
6
2
7
Keywords. Fecal source tracking, human mitochondrial DNA, high-throughput sequencing,
8
genetic barcode, cluster analysis, population diversity
Department of Biomedical, Chemical, and Environmental Engineering, University of
Department of Biological Sciences, University of Cincinnati, Cincinnati, Ohio 45221, USA
9 10 11 12 13 14 15 16 17 18 19 20
ACS Paragon Plus Environment
Environmental Science & Technology
21
ABSTRACT
22
To protect environmental water from human fecal contamination, authorities must be able to
23
unambiguously identify the source of the contamination. Current identification methods focus
24
on tracking fecal bacteria associated with the human gut, but many of these bacterial indicators
25
also thrive in the environment and in other mammalian hosts. Mitochondrial DNA could solve
26
this problem by serving as a human-specific marker for fecal contamination. Here we show that
27
the human mitochondrial hypervariable region II can function as a molecular fingerprint for
28
human contamination in an urban watershed impacted by combined sewer overflows. We present
29
high-throughput sequencing analysis of hypervariable region II for spatial resolution of the
30
contaminated sites and assessment of the population diversity of the impacting regions. We
31
propose that human mitochondrial DNA from public waste streams may serve as a tool for
32
identifying waste sources definitively, analyzing population diversity, and conducting other
33
anthropological investigations.
34 35
INTRODUCTION
36
For over a century, the standard indicator for environmental water contamination has been fecal
37
bacteria and associated microbial metagenomes. However, many recent investigations1-3 have
38
recognized the limitations of microbial-based source tracking because bacteria survive in
39
alternative hosts4 and the environment5, making the spatial and temporal components ambiguous.
40
Chemical signatures associated with human waste, such as caffeine, sterols, detergents and
41
personal care products, have also been used in source tracking studies6, 7; however, issues with
42
persistence and detection sensitivity limits their use as reliable identification tools7. Thus, due to
43
the uncertainty of present microbial/chemical source identification markers, there remains a need
ACS Paragon Plus Environment
Page 2 of 27
Page 3 of 27
Environmental Science & Technology
44
for a sensitive and unambiguous indicator of human waste in environmental water.
45
Mitochondrial DNA (mtDNA), which contains species-specific sequences, can be used as a
46
direct marker of human contamination by identifying human waste directly through its own
47
discharged eukaryotic cells1-3, 8, 9.
48
Human mtDNA has become a useful tool in a variety of scientific disciplines including
49
criminal forensics9, 10, paleoanthropology11-16, population genetics17, 18, and more recently as part
50
of cancer and degenerative disease investigations19, 20. Global sequencing efforts associated with
51
distinct populations have provided improved phylogenetic resolution of the human mtDNA
52
hypervariable regions as a genetic anthropological barcode. Additionally, human mtDNA has
53
been recently used as an identifying mechanism in contaminated environmental waters through
54
species-specific variation in mitochondrial NADH dehydrogenase, discerning human fecal waste
55
from animal sources1, 2. In this investigation we apply the specificity afforded by the
56
hypervariable region of the human mitochondrial genome to evaluate anthropogenic inputs to
57
environmental water in a manner similar to molecular rRNA-based speciation used in microbial
58
source tracking, microbial diversity investigations and bacterial metagenomic fingerprinting21, 22.
59
Numerous sources of human contamination can be found in environmental waters. Major
60
fecal sources include combined sewer overflows (CSO), sanitary sewer overflows (SSO),
61
household sewage treatment systems, and agriculture/urban runoff 6, 23. Human fecal waste
62
contains a large amount of exfoliated epithelial cells24, 25, which in turn contain thousands of
63
mitochondrial genomic copies26, making mtDNA a robust molecular target for impacted
64
environmental water. Additional sources of human mtDNA in the environment can include
65
sloughed skin and hair26, 27 found in waters used for swimming, canoeing and other recreational
66
activities8, 28.
ACS Paragon Plus Environment
Environmental Science & Technology
67
We track and characterize human waste in environmental waters by targeting the human
68
mitochondrial hypervariable region II (HVRII). Like previous human evolution studies12-16, the
69
large number of single-nucleotide polymorphisms (SNPs) present in HVRII can be used as an
70
identifying mechanism since these SNPs have varying allelic frequencies among populations14, 29,
71
30
72
water sample sites, since the waste impact is related to the humans contributing fecal waste near
73
the sites (CSOs/SSOs; overland runoff) and persistence of human mtDNA from upstream
74
sources. To determine the HVRII sequence diversity on a large scale, we have used high-
75
throughput sequencing technology for characterizing human mtDNA HVRII variation found in
76
water samples taken from an urban creek system (Duck Creek Watershed, Cincinnati OH)
77
impacted by municipal CSOs and other human activities. We next used the HVRII sequences to
78
extract haplotypes and assign mitochondrial haplogroups based on the Phylotree database31.
79
Furthermore, we compared the population diversity obtained through HVRII-derived
80
haplogrouping to the U.S. federal census data (by race) for the neighborhoods bordering the
81
watershed.
. We have applied this allelic frequency specificity as a unique “barcode” to identify impacted
82 83
MATERIALS AND METHODS
84
Study area and sampling sites. The Duck Creek Watershed feeds a National and State Scenic
85
River (Little Miami River) but has limited aquatic and riparian habitat32 and as a result of CSO
86
overflows, has been shown to significantly impact the Little Miami River with human fecal
87
contamination2. Initially, ten sampling points were selected within the Duck Creek watershed.
88
The sites were chosen based on proximity to CSO’s, traditional municipal sampling sites and
89
potential impact from human fecal pollution from sewage overflow and watershed runoff. The
ACS Paragon Plus Environment
Page 4 of 27
Page 5 of 27
Environmental Science & Technology
90
sampling sites were identified and assessed for the presence of human mitochondrial DNA
91
through PCR-based detection as described in previous work2. Out of the 10 sites, five (Sites 1, 3,
92
8, 9 and 10; Figure 1 and Table S1) were chosen for further analysis based on the consistent
93
abundance of human mitochondrial DNA throughout the sampling period (Oct 2011 - Jul 2012).
94 95
Sample processing, PCR and sequencing. Five river sites were selected within the Duck Creek
96
Watershed representing different degrees of anthropogenic influence. Water sample collection
97
and DNA extraction was performed as described earlier2. PCR assay targeting the mitochondrial
98
hypervariable region II (422 bp) was carried out using the primers33 HVRII-F (5'-
99
GGTCTATCACCCTATTAACCAC -3') and HVRII-R (5'-CTGTTAAAAGTGCATACCGCC -
100
3') linked to the site-specific barcodes (Table S2). For all PCR assays, water DNA extracts (5
101
µL) were used as templates in a final reaction volume of 50 µL using the OneTaq master mix
102
(New England Biolabs, Ipswich, MA) with 200 nM each of the forward and reverse primer in a
103
GeneAmp® PCR System 9700 thermal cycler (Applied Biosystems, Green Island, NY) under
104
the following cycling conditions: initial denaturation of 30 s at 94 ºC, followed by 35 cycles of
105
15 s at 94 ºC, 20 s at 56 ºC and 30 s at 68 ºC, and final extension step of 5 min at 68 ºC. All PCR
106
products were purified using MinElute PCR Purification kit (Qiagen, Valencia, CA) and
107
quantified using a NanoDrop 1000 Spectrophotometer (Thermo Scientific, Wilmington, DE).
108
Controls containing no template DNA were used to check for cross contamination. Additionally,
109
PCR inhibition was tested in water DNA extracts by using 10-fold dilutions of each DNA
110
extract. The amplification products from the same sampling event were pooled in an equimolar
111
ratio to conduct multiplexed sequencing using the Ion Torrent Personal Genome Machine (PGM)
112
system (Life Technologies, San Francisco, CA). Sequencing of each pooled library was
ACS Paragon Plus Environment
Environmental Science & Technology
113
performed on the PGM system using a 314 chip v2 with the Ion PGM Template OT2 400 kit and
114
Ion PGM Sequencing 400 kit according to the manufacturer's protocol. The HVRII sequence of
115
the operator was also determined through Sanger sequencing and confirmed that it did not
116
contribute to experimental data.
117 118
Bioinformatics analyses. All PGM sequences were sorted according to barcodes and grouped
119
under their respective sites. To compensate for potential sequencing errors, sequences having an
120
average quality under 2010 (as derived from the automated analysis carried out by the Torrent
121
Suite Software version 3.6), having unidentified bases (Ns), or being shorter than 300 bp were
122
discarded. The quality-filtered sequences were then aligned to the revised Cambridge Reference
123
Sequence (rCRS)34 for human mitochondrial DNA (NC_012920.1| Homo sapiens
124
mitochondrion, complete genome); and analyzed with the Torrent Suite Software version 3.6
125
(Life Technologies) using the plug-in VariantCallerForMtDNA version 3.0. The output of the
126
variant caller is presented in tabular format, as a list of variations to the rCRS along with variant
127
frequency values (Tables S3-S5). Additionally, the FASTQ files provided via the Ion Torrent
128
server were exported to CLC Genomics Workbench Version 6.5 (CLC Bio, Cambridge, MA)
129
and aligned to the rCRS, after which the Quality-based Variant Detection was called to detect
130
insertions and deletions (indels) as well as SNPs with reference to the rCRS. CLC Workbench
131
variant caller analysis parameters used in this study are given in Tables S6 and S7. A control
132
DNA sample of HVRII sequence with known variants, which had been previously determined by
133
conventional Sanger sequencing, was included during the analysis. The mitochondrial genome
134
databases, including MITOMAP35, mtDB36, EMPOP37 and Phylotree31 were referred to validate
135
the occurrence of detected variants. Sequences were submitted to MITOMASTER version Beta
ACS Paragon Plus Environment
Page 6 of 27
Page 7 of 27
Environmental Science & Technology
136
138 to extract haplotypes and assign mitochondrial haplogroups according to the sequence motifs
137
present in HVRII. MITOMASTER performs variant calling relative to the rCRS, haplotyping
138
based on Phylotree, and variant annotation based on Mitomap.
139 140
Spatial analysis. Hierarchical cluster analysis (HCA) was used to classify the five sampling sites
141
(1, 3, 8, 9 and 10) into spatial associations using the frequency distribution of SNPs obtained for
142
three events - October 2011 (Set A; wet weather), March 2012 (Set B; dry weather) and July
143
2012 (Set C; dry weather) respectively (Tables S3-S5) for a total of 15 data sets labeled as 1A,
144
3A, etc., where 1 is the site and A refers to the event. HCA is an exploratory pattern detection
145
method that partitions all cases into unique groups. Prior to HCA, the normality of the SNP
146
frequency distribution (sorted by sampling location; frequency cutoff = 5%) was verified by
147
analyzing the histograms and by applying the Shapiro–Wilk test. The combination of Euclidean
148
distances as a similarity-dissimilarity measure and the Ward's method as a linkage algorithm was
149
then applied to obtain the case clusters. The data matrix used for classification has the dimension
150
of 15 (sampling points) X 25 (SNP frequencies), resulting in a total of 375 data points. Using this
151
approach, it was possible to reduce the large number of HVRII sequences to 15 site-specific data
152
sets.
153 154
RESULTS
155
Sequence and variant detection of HVRII for study sites. The Ion Torrent PGM system was
156
used to examine the sequence diversity of HVRII amplicons as recently reported39-41. In total,
157
more than 10,000,000 sequence reads were retrieved with a mean output exceeding 200,000 per
158
pooled set, which were then filtered and grouped according to their respective sites. The absolute
ACS Paragon Plus Environment
Environmental Science & Technology
159
number of HVRII sequence read output per site was consistent with our previous human mtDNA
160
– qPCR results for the watershed2. SNPs were detected using the Torrent Suite plug-in
161
VariantCallerForMtDNA version 3.0, which applies a TMAP Smith–Waterman alignment
162
optimization42 and outputs the variant allele frequency (%). Concurrently, sequences were
163
analyzed using CLC Genomics Workbench Version 6.5 that employed the Neighborhood Quality
164
Standard (NQS) algorithm43 to detect insertions and deletions (indels) and validate the SNPs
165
detected via the variant caller. Indeed, SNP and indel analysis produced no significant
166
differences from variant caller data supporting the reproducibility of the results by alternative
167
computational methods. The HVRII sequence of the operator was compared to the sample
168
sequence databases to check for cross contamination and produced no exact matches against the
169
database reads.
170
HVRII DNA from five sampling events was sequenced and screened producing an
171
average read length of approximately 300 bp and a mean output of 20,000 sequences per site for
172
a particular sampling event. Of this, a 270 bp portion from base position 51 to 320 was used for
173
variant detection since these SNPs have been well documented35. A total of 31 distinct SNPs
174
were detected of which 30 are present in MITOMAP - database of mtDNA Control Region
175
Sequence Variants35. The relative distribution of SNPs (with frequency > 5%) for each site over
176
the annual sampling period is presented in Figure 2. We observed some SNPs that were common
177
to all sites with varying frequencies, while other SNPs were site-specific, allowing each
178
sampling location to have a unique human mtDNA signature in the form SNP allelic frequencies.
179
The variation in site specific SNP frequencies could be the result of several factors including
180
limited sample size, daily population changes related to employment (for a comparison of site
181
populations see Table S8), changes in sampling time and storm runoff volumes during wet
ACS Paragon Plus Environment
Page 8 of 27
Page 9 of 27
Environmental Science & Technology
182
weather events, or a combination of these variables. Variants 73G and 263G were detected in all
183
samples with high frequency (> 40%). Variants 143A and 236C were detected only at site 10
184
during the entire sampling period. This is expected since site 10 is on a separate tributary of
185
Duck creek and is not influenced by influx of water from any other site. Site 1 is downstream of
186
site 3, however some SNPs were detected at site 1 but not 3 (151T, 182T, 185T, 235G, 239C).
187
This may be due to the influx of water between the two sites (confluence with Little Duck Creek;
188
see Figure 1) as well as additional inputs from runoff and other CSOs. Interestingly, some of the
189
variants common to site 1 but not 3 (151T, 182T, 185T) are present at sites 8, 9 and 10 as well.
190
The rCRS reference sequence contains a track of seven cytosines from positions 303 to
191
30934. Length heteroplasmy has been known to occur in this stretch due to C insertions that can
192
create C-stretches of eight or more Cs44, 45. An additional C was found in most of the samples,
193
while sampling site 9 presented two additional Cs in this region. Many samples were also found
194
to have an additional cytosine with respect to the rCRS in the cytosine tract 311–315. The
195
frequency of one additional C in the 311–315 region ranges from 20% to 70% within the sample
196
sets, while the frequency of two additional Cs is between 7% and 60%. Sequences with three
197
additional Cs in the 311–315 track were the most rare and were found only at site 9 with less
198
than 5% frequency.
199 200
Spatial variability of sites using SNP allelic frequencies. Because the sampling locations are
201
impacted by upstream conditions (e.g., CSOs) and immediate surroundings, the frequencies of
202
the HVRII SNP alleles at each sampling location should provide a “fingerprint” specific to each
203
location. Using the allelic frequencies generated for each site, we applied cluster analysis to
204
distinguish environmental water obtained from individual sample locations within the watershed.
ACS Paragon Plus Environment
Environmental Science & Technology
205
The dendrogram of the location pattern resulting from the HCA of HVRII sequence SNP data
206
from the period of Oct 2011–Jul 2012 is presented in Figure 3, illustrating distinct site clusters.
207
The sampling sites were grouped into three main clusters based on their specific HVRII SNP
208
signature. Cluster 1 was formed by site 3; cluster 2 by sites 8, 9 and 10; and cluster 3, site 1. It
209
can be seen that cluster 1 is characterized by the highest linkage distance to the other clusters.
210
Clusters 2 and 3 are linked at a shorter distance and are together linked to Cluster 1 at a higher
211
distance. Note that Cluster 1 corresponds to the middle catchment of the Duck Creek Watershed;
212
cluster 3 is the lower catchment; while cluster 2 formed by sites 8, 9, and 10 is located in the
213
upper catchment. Sites 8 and 9 are directly linked to each other since they are on the same
214
section of river without influence of water influx from other CSO sources. Interestingly, site 1
215
clusters most closely to itself and 8, 9 and 10, despite 3 being between them. It is also interesting
216
to note that the sites are most self-similar despite the time between sampling events and the
217
differences in weather (wet weather for set A and dry weather for set B and C). These results
218
support the applicability of HVRII sequence analysis as a metagenomics tool for human
219
contamination sources in environmental water and provides a mechanism for spatial
220
classification of sites based on human mitochondrial variable region SNP allelic frequencies, or
221
‘HVR fingerprint’.
222
We further compared the Euclidean distances between the five different sites based upon
223
the observed SNP frequency “fingerprints” (Figure S1). For all five locations, the smallest
224
Euclidean distance generally occurred when performing a self-comparison of the SNP frequency
225
fingerprints. Similar to the clustering results (Figure 3), sites 8, 9 and 10 tended to be most
226
similar with each other. Additionally, when looking at the results from sites 1 and 3, the SNP
227
frequencies are more closely related with sites 8, 9 and 10 than with each other, even though site
ACS Paragon Plus Environment
Page 10 of 27
Page 11 of 27
Environmental Science & Technology
228
1 is just downstream of site 3. The expectation is that sites immediately up/down-stream of each
229
other would be most similar in terms of SNP frequencies, even with possible degradation of the
230
mtDNA. To investigate the observed differences between sites 1 and 3, an existing Storm Water
231
Management Model (SWMM) of the combined sanitary/storm water system (provided by the
232
Metropolitan Sewer District of Greater Cincinnati) was used to identify the local regions that
233
contribute to the combined systems that could impact the receiving streams (see Figure S2). Site
234
9 is heavily influenced by the population in Kennedy Heights, while Site 10 is influenced by
235
Kennedy Heights, Pleasant Ridge and parts of Oakley. Site 8 is directly downstream of 9 and
236
further influenced by Madisonville. These three sites all have commonality with Kennedy
237
Heights and the demographics within that region. Site 3, located downstream of the confluence
238
of Sites 8 and 10, is additionally impacted by the Linwood region, which increases the
239
differences between Sites 3, and 8, 9 and 10. Finally, Site 1, located further downstream than
240
Site 3 after the confluence of two additional tributaries, continues to show additional differences
241
from Site 3 (as expected) but was observed to be more similar with Sites 8, 9 and 10, which was
242
not expected. When assessing the regional impacts, the tributary that merges with the flow
243
passing Site 3, is additionally impacted by the Madisonville area that would strengthen the SNP
244
frequency impact from the similar demographics in the Madisonville and Kennedy Heights
245
region thereby strengthening the similarity with the sites upstream of Site 3. These different
246
flow paths may also explain why SNPs 151T, 182T and 185T were observed at Site 1 (as well as
247
Sites 8, 9 and 10) but not Site 3.
248 249
Population diversity assessment using haplogroup classification. Distinct mitochondrial
250
haplogroups have arisen from mutation during human evolution and largely follow the migration
ACS Paragon Plus Environment
Environmental Science & Technology
251
of Homo sapiens from specific geographical regions36, 46, 47. These paleoanthropological
252
haplogroups can also be assigned to race based on the frequency of observation as a means of
253
investigating population diversity46, 47. Consequently, we sought to use our human mtDNA
254
sequences to extract haplotypes and classify them into haplogroups by comparing them to the
255
Phylotree database31. The mitochondrial sequences from sets A (wet weather) and B (dry
256
weather) were compared and assigned to haplogroups based on the differences in HVRII
257
sequence mutations with respect to the rCRS. We observed abundant diversity from haplogroup
258
data of HVRII amplicons at all sites, which is consistent with the clear indication of human
259
contamination in the creeks2. Although the sequences were obtained from an equimolar pool of
260
the HVRII amplicons, the relative composition of the haplogroups varied considerably across the
261
different sampling sites (Figure 4). The most salient features of the haplogroup distribution in the
262
clustered sequences were the relatively high frequencies of haplogroup H (30-50%). Haplogroup
263
H includes the rCRS and is typically characterized by variant 73A in HVRII; most of the other
264
haplogroups are characterized by 73G. Haplogroup L was also relatively abundant, but with
265
large sample-to-sample variation.
266
To further explore the applicability of our HVRII-derived haplogroup data to local
267
population diversity as defined by 2010 U.S. census data48, several mitochondrial databases and
268
studies were consulted to assign haplogroups to the general population groups35-37, 46, 47. From
269
these, we used the Wallace47 haplogroup classification according to which L0, L1, L2, L3, L4,
270
L5, L6 were assigned as 'African American'; H, HV, J, K, P, S, T, U, V as 'White'; B, D, E, F, G,
271
M, R, W as 'Asian'; and A, C, X represented 'American Indian'; while all other, less numerous
272
haplogroups were designated to an 'other' category. Figure 5 presents the comparative analysis of
273
the population data obtained through the two strategies - HVRII-derived population groups viz-a-
ACS Paragon Plus Environment
Page 12 of 27
Page 13 of 27
Environmental Science & Technology
274
viz the census data for population (by race). Classification of the mtDNA haplogroups showed
275
20% African mtDNA, 59% European mtDNA, and 12% Asian/American Indian mtDNA.
276
According to census data, 62% self-declared as White, 32% as African American and 2% as
277
Asian. There was a strong correlation between the federal census data and the mitochondrial
278
haplogroups as an indicator of population composition (Pearson product-moment correlation
279
coefficient, r = 0.97) demonstrating the suitability of human mitochondrial sequences to infer the
280
population structure of the neighborhoods impacting the watershed. One important deviation
281
from the census data was the significantly larger percentage of Asian/American Indian mtDNA
282
detected (Figure 5). This discrepancy could be the result of several factors: coarse haplogroup
283
assignments, proximity to the creek CSO input or underrepresentation in the census data. The
284
latter of these three possibilities presents the opportunity that HVR sequencing directly from
285
waste streams or impacted water may provide a more accurate means of deducing population
286
diversity since human waste disposal is a personal necessity while census response is not. We
287
suggest future epidemiological studies that employ HVR sequencing methods from waste
288
streams that may provide complementary population diversity information as well as additional
289
insight unavailable to voluntary census response data collection.
290 291
DISCUSSION
292
Several ribo-typing investigations have attempted to associate the human intestinal microflora
293
with bacterial metagenomes found in environmental waters21, 41; however, there is significant
294
variation in microbial species composition within and between individuals. Moreover, the
295
microbial communities might replicate after discharge in water making it difficult to differentiate
296
bacteria associated with fecal contamination events. Conversely, mitochondrial sequences
ACS Paragon Plus Environment
Environmental Science & Technology
297
represent a direct marker of human waste since they are derived from the host cells, which in
298
turn enable the mtDNA HVRs to define inter-individual variation and population dynamics
299
contributing to the impacted water.
300
We investigated the occurrence of HVRII allelic frequencies of human mtDNA derived
301
from water samples taken within an impacted urban creek system. We used SNPs within the
302
human HVRII region to form site-specific genetic barcodes (HVR fingerprint) for evaluating
303
anthropogenic watershed inputs. Human mtDNA is readily available in public waste streams and
304
impacted environmental waters, allowing this approach to be more broadly applied as a
305
metagenomics tool for studying human population diversity, waste source tracking and other
306
anthropological investigations.
307
Water samples taken from the impacted watershed contained mitochondrial genome copy
308
equivalents ranging from a few 100 to several 100,000 human mtDNA2. As a result of the large
309
amount of mtDNA, and abundant diversity, it was impractical to isolate and sequence full
310
mitochondrial genomes from our environmental water samples. However, the analysis of small
311
mtDNA regions that have maximal discriminative power has proven useful in past
312
anthropological studies14-17 ; Krings et al.15 determined 340 bp of the mtDNA HVRII from the
313
Neandertal type specimen to better estimate the relationship of the Neandertal mtDNA to the
314
contemporary human mtDNA gene pool, an approach adapted in this investigation.
315
Altogether, the HCA approach (Figure 3) combined with site-specific frequency
316
distribution of SNPs (Figure 2) represents a unique classification for environmental waters that
317
was both location and human community specific. The HVR fingerprint specificity was
318
surprising considering the mixed nature of municipal sewage, variation in CSO discharge with
319
weather and temporal separation of sampling events. The molecular fingerprinting strategy
ACS Paragon Plus Environment
Page 14 of 27
Page 15 of 27
Environmental Science & Technology
320
described here may be further adapted to analyze additional human mtDNA genes either by
321
impacted environmental water or directly though public wastewater. This may provide a
322
significant resource for local community mtDNA genetics, and could be used to examine the
323
association of human disease and aging with mtDNA genes49-51. To this end, it has been reported
324
that mitochondrial gene mutations might predispose individuals to diseases like diabetes,
325
Alzheimer’s and Parkinson’s47, 49, 50; however, the true impact of these mutations on human
326
health remains to be determined. Studies involving the correlative analysis of mtDNA variation
327
of human mitochondrial sequences found in human-impacted environments may provide a direct
328
route to examine the prevalence of these diseases in the local population.
329
With respect to the mtDNA analysis, we applied a high-throughput sequencing strategy
330
to analyze human mitochondrial HVRII DNA obtained from an urban creek system at different
331
time points accounting for spatial-temporal resolution of human contamination in an urban
332
watershed. The use of barcoded primers allowed multiplexed sample sequencing and enabled the
333
identification of collective HVRII SNP frequencies that were site specific. Although our study
334
was confined to analysis of HVRII region of human mtDNA in a limited number of geographic
335
locations, a wealth of information can be obtained through other mtDNA genomic targets,
336
particularly regions associated with aging and cancer47, 50, 51.
337 338
ASSOCIATED CONTENT
339
Supporting Information. Tables of the sampling sites, molecular barcodes used for multiplexed
340
sequencing, HVRII SNP frequency data and bioinformatics analysis parameters for CLC
341
Genomics Workbench; and figures for Euclidean distance between sites and GIS map of Duck
ACS Paragon Plus Environment
Environmental Science & Technology
342
Creek Watershed showing combined sewer lines. This material is available free of charge via the
343
Internet at http://pubs.acs.org.
344
AUTHOR INFORMATION
345
Corresponding Author
346
*E-mail:
[email protected] 347
ACKNOWLEDGMENTS
348
We thank R. Ravi, E. Wurtzler and N. Punuru for assistance in the laboratory and C. Smith for
349
help in sample collection. This research was supported by the Metropolitan Sewer District of
350
Greater Cincinnati and a URC Graduate Research Fellowship from the University of Cincinnati
351
(Cincinnati, OH).
352
REFERENCES
353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371
1. Caldwell, J. M.; Raley, M. E.; Levine, J. F. Mitochondrial multiplex real-time PCR as a source tracking method in fecal-contaminated effluents. Environ. Sci. Technol. 2007, 41 (9), 3277-3283. 2. Kapoor, V.; Smith, C.; Santo Domingo, J. W.; Lu, T.; Wendell, D. Correlative Assessment of Fecal Indicators using Human Mitochondrial DNA as a Direct Marker. Environ. Sci. Technol. 2013, 47 (18), 10485-10493. 3. Vuong, N.-M.; et al. Fecal source tracking in water using a mitochondrial DNA microarray. Water Res. 2012, 47 (1), 16-30. 4. Gordon, D. M. Geographical structure and host specificity in bacteria and the implications for tracing the source of coliform contamination. Microbiology 2001, 147 (5), 1079-1085. 5. Anderson, K. L.; Whitlock, J. E.; Harwood, V. J. Persistence and differential survival of fecal indicator bacteria in subtropical waters and sediments. Appl. Environ. Microbiol. 2005, 71 (6), 3041-3048. 6. Glassmeyer, S. T.; et al. Transport of chemical and microbial compounds from known wastewater discharges: potential for use as indicators of human fecal contamination. Environ. Sci. Technol. 2005, 39 (14), 5157-5169. 7. Hagedorn, C.; Weisberg, S. B. Chemical-based fecal source tracking methods: current status and guidelines for evaluation. Rev. Environ. Sci. Biotechnol. 2009, 8 (3), 275-287.
ACS Paragon Plus Environment
Page 16 of 27
Page 17 of 27
372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411
Environmental Science & Technology
8. Martellini, A.; Payment, P.; Villemur, R. Use of eukaryotic mitochondrial DNA to differentiate human, bovine, porcine and ovine sources in fecally contaminated surface water. Water Res. 2005, 39 (4), 541-548. 9. Budowle, B.; Allard, M. W.; Wilson, M. R.; Chakraborty, R. Forensics and Mitochondrial DNA: Applications, Debates, and Foundations. Annu. Rev. Genomics Hum. Genet. 2003, 4 (1), 119-141. 10. Wilson, M. R.; DiZinno, J. A.; Polanskey, D.; Replogle, J.; Budowle, B. Validation of mitochondrial DNA sequencing for forensic casework analysis. Int. J. Legal Med. 1995, 108 (2), 68-74. 11. Gill, P.; et al. Identification of the remains of the Romanov family by DNA analysis. Nat. Genet. 1994, 6 (2), 130-135. 12. Ingman, M.; Kaessmann, H.; PaÈaÈbo, S.; Gyllensten, U. Mitochondrial genome variation and the origin of modern humans. Nature 2000, 408 (6813), 708-713. 13. Wallace, D. C. Mitochondrial DNA sequence variation in human evolution and disease. Proc. Natl. Acad. Sci. U.S.A. 1994, 91 (19), 8739-8746. 14. Salas, A.; Lareu, V.; Calafell, F.; Bertranpetit, J.; Carracedo, A. mtDNA hypervariable region II (HVII) sequences in human evolution studies. Eur. J. Human Genet. 2000, 8 (12), 964-974. 15. Krings, M.; Geisert, H.; Schmitz, R. W.; Krainitzki, H.; Pääbo, S. DNA sequence of the mitochondrial hypervariable region II from the Neandertal type specimen. Proc. Natl. Acad. Sci. U.S.A. 1999, 96 (10), 5581-5585. 16. Ovchinnikov, I. V.; et al. Molecular analysis of Neanderthal DNA from the northern Caucasus. Nature 2000, 404 (6777), 490-493. 17. Schlebusch, C. M.; Lombard, M.; Soodyall, H. MtDNA control region variation affirms diversity and deep sub-structure in populations from southern Africa. BMC Evol. Biol. 2013, 13 (1), 56. 18. Byrne, E. M.; et al. The use of common mitochondrial variants to detect and characterise population structure in the Australian population: implications for genome-wide association studies. Eur. J. Human Genet. 2008, 16 (11), 1396-1403. 19. Burgess, D. J. Disease genetics: Double danger from mitochondrial mutations. Nature Rev. Genet. 2013, 14 (10), 678-679. 20. Wallace, D. C. Mitochondria and cancer. Nature Rev. Cancer 2012, 12 (10), 685-698. 21. Langille, M. G.; et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nature Biotechnol. 2013, 31 (9), 814-821. 22. Yergeau, E.; et al. Next-generation sequencing of microbial communities in the Athabasca River and its tributaries in relation to oil sands mining activities. Appl. Environ. Microbiol. 2012, 78 (21), 7626-7637. 23. Marsalek, J.; Rochfort, Q. Urban wet-weather flows: sources of fecal contamination impacting on recreational waters and threatening drinking-water sources. J. Toxicol. Environ. Health, Part A 2004, 67 (20-22), 1765-1777.
ACS Paragon Plus Environment
Environmental Science & Technology
412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429
24. Kamra, A.; et al. Exfoliated colonic epithelial cells: surrogate targets for evaluation of bioactive food components in cancer prevention. J. Nutr. 2005, 135 (11), 2719-2722. 25. Albaugh, G. P.; et al. Isolation of exfoliated colonic epithelial cells, a novel, non‐invasive approach to the study of cellular markers. Int. J. Cancer 1992, 52 (3), 347-350. 26. Andreasson, H.; Gyllensten, U.; Allen, M. Real-time DNA quantification of nuclear and mitochondrial DNA in forensic analysis. Biotechniques 2002, 33 (2), 402-411. 27. Higuchi, R.; von Beroldingen, C. H.; Sensabaugh, G. F.; Erlich, H. A. DNA typing from single hairs. Nature 1988, 332 (6164), 543-546. 28. Soller, J. A.; Schoen, M. E.; Bartrand, T.; Ravenscroft, J. E.; Ashbolt, N. J. Estimated human health risks from exposure to recreational waters impacted by human and nonhuman sources of faecal contamination. Water Res. 2010, 44 (16), 4674-4691. 29. Parsons, T. J.; et al. A high observed substitution rate in the human mitochondrial DNA control region. Nature Genet. 1997, 15, 363-368. 30. Meyer, S.; Weiss, G.; von Haeseler, A. Pattern of nucleotide substitution and rate heterogeneity in the hypervariable regions I and II of human mtDNA. Genetics 1999, 152 (3), 1103-1110. 31. van Oven, M.; Kayser, M. Updated comprehensive phylogenetic tree of global human mitochondrial DNA variation. Hum. Mutat. 2009, 30 (2), E386-E394.
430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451
32. U. S. Environmental Protection Agency (2009) Biological and Water Quality Study of the Lower Little Miami River and Selected Tributaries. OHIO EPA Technical Report EAS/2009-10-06. 33. Hutter, G.; et al. Use of polymorphisms in the noncoding region of the human mitochondrial genome to identify potential contamination of human leukemia-lymphoma cell lines. Hematol. J. 2004, 5 (1), 61-68. 34. Andrews, R. M.; et al. Reanalysis and revision of the Cambridge reference sequence for human mitochondrial DNA. Nature Genet. 1999, 23 (2), 147-147. 35. Brandon, M. C.; et al. MITOMAP: a human mitochondrial genome database—2004 update. Nucleic Acids Res. 2005, 33 (suppl 1), D611-D613. 36. Ingman, M.; Gyllensten, U. mtDB: Human Mitochondrial Genome Database, a resource for population genetics and medical sciences. Nucleic Acids Res. 2006, 34 (suppl 1), D749-D751. 37. Parson, W.; Dür, A. EMPOP—a forensic mtDNA database. Forensic Sci. Int. Genet. 2007, 1 (2), 88-92. 38. Brandon, M. C.; et al. MITOMASTER: a bioinformatics tool for the analysis of mitochondrial DNA sequences. Hum. Mutat. 2009, 30 (1), 1-6. 39. Parson, W.; et al. Evaluation of next generation mtGenome sequencing using the Ion Torrent Personal Genome Machine (PGM). Forensic Sci. Int. Genet. 2013, 7 (5), 543549. 40. Seo, S. B.; et al. Single nucleotide polymorphism typing with massively parallel sequencing for human identification. Int. J. Legal Med. 2013, 127 (6), 1079-1086.
ACS Paragon Plus Environment
Page 18 of 27
Page 19 of 27
452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475
Environmental Science & Technology
41. Whiteley, A. S.; et al. Microbial 16S rRNA Ion Tag and community metagenome sequencing using the Ion Torrent (PGM) Platform. J. Microbiol. Methods 2012, 91 (1), 80-88. 42. Li, H.; Homer, N. A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 2010, 11 (5), 473-483. 43. Altshuler, D.; et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 2000, 407 (6803), 513-516. 44. Li, M.; et al. Detecting heteroplasmy from high-throughput sequencing of complete human mitochondrial DNA genomes. Am. J. Hum. Genet. 2010, 87 (2), 237-249. 45. Stewart, J.; et al. Length variation in HV2 of the human mitochondrial DNA control region. J. Forensic Sci. 2001, 46 (4), 862-870. 46. Lee, C.; Măndoiu, I. I.; Nelson, C. E. Inferring ethnicity from mitochondrial DNA sequence. BMC Proceedings. 2011, 5 (2), S11; BioMed Central Ltd. 47. Wallace, D. C.; Brown, M. D.; Lott, M. T. Mitochondrial DNA variation in human evolution and disease. Gene 1999, 238 (1), 211-230. 48. Cincinnati Census Data Website; http://www.cincinnati-oh.gov/planning/reportsdata/census-demographics. 49. Taylor, R. W.; Turnbull, D. M. Mitochondrial DNA mutations in human disease. Nature Rev. Genet. 2005, 6 (5), 389-402. 50. Wallace, D. C. Mitochondrial genetics: a paradigm for aging and degenerative diseases? Science 1992, 256 (5057), 628-632. 51. Shen, E. Z.; et al. Mitoflash frequency in early adulthood predicts lifespan in Caenorhabditis elegans. Nature 2014, 508 (7494), 128-132.
476 477 478 479 480 481 482 483 484
ACS Paragon Plus Environment
Environmental Science & Technology
485 486 487 488 489 490 491 492
Figure Legends
493
Figure 1. Locations of the sampling sites in the Duck Creek Watershed in Cincinnati, Ohio. The
494
sites are marked as green circles in the map while the red circles represents CSO locations. The
495
boundary of the watershed is shown in the inset map of the state of Ohio. Sites 1, 3, 8, 9 and 10
496
were used for HVRII sequence analysis due to the consistent abundance of human mitochondrial
497
DNA at these sites. Sites 1 and 3 are located on Duck Creek at river mile 2.0 and 3.4. Sites 8 and
498
9 are located on Deerfield Creek, near CSO 556, since this overflow had the highest number of
499
annual overflow events and largest ever-volumetric contribution to the CSO total overflow. Site
500
10 is located on Upper Duck Creek close to CSO 68, which had the second highest contribution
501
to the CSO total overflow. Site 1 is downstream of site 3 with the additional influx of water
502
coming form Little Duck Creek. Site 8 is directly downstream of site 9. Site 10 is on a separate
503
section of the creek and is not influenced by influx of water from any other site, while site 3 is
504
influenced by water coming from sites 8, 9 and 10.
505
Figure 2. Heat map demonstrating the occurrence and variant frequency for SNPs detected in
506
human mitochondrial HVRII region (position 51 - 320 bp relative to rCRS) for all sampling sites
507
at three distinct times (A = October 2011; B = March 2012; C = July 2012). Variant frequency is
ACS Paragon Plus Environment
Page 20 of 27
Page 21 of 27
Environmental Science & Technology
508
defined as the number of reads having a SNP divided by the total reads in the sample. All
509
variants with a frequency greater than 5% are reported. It can be seen that variants 73G and
510
263G occurred at all sites with frequency greater than 40%.
511
Figure 3. Dendrogram (left) from the HCA of SNP frequency data for the study sites (right)
512
obtained from the period of Oct 2011-Jul 2012. The site-specific datasets are grouped into three
513
clusters. Cluster 1 (orange background) is formed by 3C, 3B, 3A; cluster 2 (purple background)
514
is formed by 10A, 10C, 9A, 9B, 8B, 9C, 10B, 8C, 8A; and cluster 3 (grey background) is formed
515
by 1C, 1B, 1A. The study sites are marked as green circles in the map while the red circles
516
represents CSO locations.
517
Figure 4. Bar charts showing the haplogroup distribution of Set A and B for all sequences longer
518
than 300 bp derived from Ion Torrent Sequencing of HVRII amplicons. The sequences were
519
annotated using MITOMASTER version Beta 1 that performs variant calling relative to the
520
rCRS, haplotyping based on Phylotree, and variant annotation based on Mitomap. The relative
521
composition of the haplogroups varied considerably across the different sampling sites for both
522
the sets. However, the haplogroup H was most abundant for all the sites followed by haplogroup
523
L.
524
Figure 5. Pie charts demonstrating the population racial diversity in Duck Creek Watershed
525
obtained through (a) annotation of HVRII sequences (October 2011) into haplogroups, and (b)
526
2010 population census data (by race). (c) Comparison of site-specific distribution of population
527
according to races obtained through HVRII annotation and census data 2010 respectively.
528
Census data was obtained for the Cincinnati neighborhood approximations of Duck Creek
529
Watershed region which included Linwood, Oakley, Madisonville and Pleasant Ridge census
ACS Paragon Plus Environment
Environmental Science & Technology
530
tracts. Haplogroups were divided into races according to mitochondrial databases Phylotree
531
(White = H, HV, J, K, P, S, T, U, V; African American = L0, L1, L2, L3, L4, L5, L6; American
532
Indian = A, C, X; Asian = B, D, E, F, G, R, W; and Others).
Figure 1
ACS Paragon Plus Environment
Page 22 of 27
Page 23 of 27
Environmental Science & Technology
Figure 2
ACS Paragon Plus Environment
Environmental Science & Technology
Figure 3
ACS Paragon Plus Environment
Page 24 of 27
Page 25 of 27
A
Environmental Science & Technology
B
100%
100%
Others
Others
HV
80%
HV
80%
R 60%
U N
40%
K J
20%
T
Percent sequences
Percent sequences
T
M
R 60%
U N
40%
K J
20%
M
L H
0% Site 1
Site 3
Site 8
Site 9
Site 10
L H
0% Site 1
Figure 4
ACS Paragon Plus Environment
Site 3
Site 8
Site 9
Site 10
Environmental Science & Technology
Figure 5
ACS Paragon Plus Environment
Page 26 of 27
Page 27 of 27
Environmental Science & Technology
TOC Graphic
ACS Paragon Plus Environment