Subscriber access provided by Olson Library | Northern Michigan University
Article
Predicting the ecological quality status of marine environments from eDNA metabarcoding data using supervised machine learning Tristan Cordier, Philippe Esling, Franck Lejzerowicz, Joana Visco, Amine Ouadahi, Catarina Martins, Tomas Cedhagen, and Jan Pawlowski Environ. Sci. Technol., Just Accepted Manuscript • DOI: 10.1021/acs.est.7b01518 • Publication Date (Web): 30 Jun 2017 Downloaded from http://pubs.acs.org on July 1, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Environmental Science & Technology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 33
Environmental Science & Technology
1 1
Predicting the ecological quality status of marine environments from eDNA
2
metabarcoding data using supervised machine learning
3
Tristan Cordier*1, Philippe Esling2, Franck Lejzerowicz1, Joana Visco3, Amine Ouadahi1,
4
Catarina Martins4, Tomas Cedhagen5, Jan Pawlowski1,3
5 6
1
7
Switzerland
8
2
9
3
10
4
11
5
12
1, DK-8000 Aarhus, Denmark
Department of Genetics and Evolution, University of Geneva, Boulevard d’Yvoy 4, CH 1205 Geneva,
IRCAM, UMR 9912, Université Pierre et Marie Curie, 4 place Jussieu, 75005 Paris, France
ID-Gene ecodiagnostics, Ltd, chemin des Aulx 14, 1228 Plan-les-Ouates, Switzerland Marine Harvest ASA, Sandviksboder 77AB, Bergen, 5035 Bergen, Norway Department of Bioscience, Section of Aquatic Biology, University of Aarhus, Building 1135, Ole Worms allé
13 14
*Corresponding author:
[email protected] 15 16 17 18 19 20 21 22 23 24 25
ACS Paragon Plus Environment
Environmental Science & Technology
Page 2 of 33
2 26 27
Abstract
28 29
Monitoring biodiversity is essential to assess the impacts of increasing anthropogenic
30
activities in marine environments. Traditionally, marine biomonitoring involves the sorting
31
and morphological identification of benthic macro-invertebrates, which is time-consuming
32
and taxonomic-expertise demanding. High-throughput amplicon sequencing of environmental
33
DNA (eDNA metabarcoding) represents a promising alternative for benthic monitoring.
34
However, an important fraction of eDNA sequences remains unassigned or belong to taxa of
35
unknown ecology, which prevent their use for assessing the ecological quality status. Here,
36
we show that supervised machine learning (SML) can be used to build robust predictive
37
models for benthic monitoring, regardless of the taxonomic assignment of eDNA sequences.
38
We tested three SML approaches to assess the environmental impact of marine aquaculture
39
using benthic foraminifera eDNA, a group of unicellular eukaryotes known to be good
40
bioindicators, as features to infer macro-invertebrates based biotic indices. We found similar
41
ecological status as obtained from macro-invertebrates inventories. We argue that SML
42
approaches could overcome and even bypass the cost and time-demanding morpho-taxonomic
43
approaches in future biomonitoring.
44 45
Key-words: Biomonitoring, biotic indices, environmental DNA, supervised machine learning,
46
predictive models, foraminifera.
47 48 49 50
ACS Paragon Plus Environment
Page 3 of 33
Environmental Science & Technology
3 51 52
Introduction
53
Human activities are impacting the marine ecosystem functioning through climate change1,
54
environmental pollution2, or human industry driven eutrophication3, and these pressures are
55
likely to increase with the projected demographic expansion. Such impacts are traditionally
56
monitored by surveying biodiversity, usually focusing on benthic macrofauna4,5. Biotic
57
Indices (BI) have been formalized in order to combine taxonomy and alpha-diversity
58
measures (e.g. species richness, Shannon index) to reduce the data dimensions into single
59
continuous values that are used to ascribe samples to an environmental quality status. Such
60
indices include for instance the AZTI Marine Biotic Index6 (AMBI) or more specific
61
Indicator Species Index7 (ISI), and the Norwegian Sensitivity and Quality Indices8 (NSI and
62
NQI1). The formulas of these indices include taxon-specific ecological weights or categories
63
of tolerance to disturbance defined from empirical and experimental data6. Indeed, all indices
64
currently applied in benthic monitoring are based on the sorting and identification of macro-
65
invertebrate specimens, which is extremely time consuming and taxonomic-expertise
66
demanding. The need for faster and cost-effective ways to conduct biodiversity surveys is of
67
prime importance to allow an effective management of marine resources.
68 69
High-throughput amplicon sequencing and the molecular identification of multiple species in
70
environmental DNA (eDNA metabarcoding) offer a fast and cost-effective way to describe
71
biological communities9. It gives the opportunity to overcome the limitations of morpho-
72
taxonomy based biomonitoring10-12. Recently, several attempts have been made to use the
73
eDNA metabarcoding for biomonitoring freshwater13-18 and marine ecosystems19-23.
74
Comparisons of BI inferred from eDNA metabarcoding and morphological data showed that
75
similar values could be obtained in the case of freshwater diatoms16,15 and marine
ACS Paragon Plus Environment
Environmental Science & Technology
Page 4 of 33
4 76
invertebrates24-25. However, all these studies were dependent of reference sequence databases
77
for taxonomic assignment, to retrieve the ecological weight associated with the assigned taxa
78
(but see 18). This strongly limited the number of sequences included in the analysis, because
79
a large proportion of sequences remained unassigned, or because they were assigned to taxa
80
of unknown ecology20,24,26.
81 82
To overcome these limitations, we propose to use supervised machine learning (SML) for the
83
development of predictive models to infer BI values from eDNA metabarcoding data without
84
relying on taxonomic assignments. In SML approach, the predictive models are based on the
85
knowledge extracted from a training dataset, which typically consist of a set of features and
86
associated labels (classification) or continuous values (regression). The aim of SML is to fit
87
the training data to some function that can be used to predict a label or a continuous value to
88
new input data27 (e.g. new samples). In the recent years, there has been a growing interest to
89
investigate the usefulness of SML for ecological28-30, genetic31, or microbiome analyses27,31-34.
90
However, up to our knowledge, only one study has successfully used SML to predict
91
pollution levels as well as the values of 26 geochemical features based on a training dataset
92
composed of bacterial 16S eDNA metabarcoding data35.
93 94
In this study, we investigated the possibility of using SML to build predictive models for the
95
inference of four biotic indices commonly used for the benthic monitoring of the fish farming
96
industry. We tested this approach using benthic foraminifera, a group of ubiquitous
97
unicellular eukaryotes that include sensitive bioindicators of pollution in marine
98
environments36-39. Foraminifera are responsive to organic enrichment associated with fish
99
farming activities, as shown by both morphological40-43 and eDNA investigations22,23,44. This
100
makes foraminifera particularly appropriate for SML approach, because of their small size,
ACS Paragon Plus Environment
Page 5 of 33
Environmental Science & Technology
5 101
making them readily captured in eDNA samples, and because of the lack of well-defined
102
indicator morphospecies.
103 104
Material and methods
105
Sampling
106
A total of 144 sediment samples were collected in June and October 2015 at 24 stations
107
distributed at the vicinity of five salmon farms in Norway (Table S1). Four farms were
108
sampled at a rate of five stations and the remaining farm at only four stations because of
109
pebbles at the fifth station. At each station, two van Veen grabs (1000 cm2 model 12.211, KC-
110
Denmark) were deployed. From each of the 48 grabs, three samples of c.a. 5 ml were
111
collected from the two first centimeters of the surface sediment using sterile spoons,
112
transferred into 6 ml of LifeGuard Soil Preservation Solution (MO BIO, Laboratories Inc.)
113
and frozen at -20°C until DNA extraction. The rest of the sediment in each sub-sampled grab
114
was sieved through 1mm mesh and fixed in formalin for morpho-taxonomic inventory of
115
macro-invertebrates. Macrofaunal data were used to calculate for each grab four BIs (AMBI,
116
ISI, NSI and NQI1) routinely used in benthic monitoring surveys in Norway.
117 118
DNA extraction, PCR amplification and sequencing
119
The 144 frozen sediments samples were thawed on ice, centrifuged at 2500 rpm for 5 minutes
120
and the overlying solution was discarded. The total eDNA content was extracted using the
121
PowerMax® Soil DNA Isolation Kit (MO BIO Laboratories Inc.) following manufacturer
122
instructions. The 37F hypervariable region of the SSU rRNA gene, commonly used as
123
foraminiferal DNA barcode45 was PCR amplified using the forward primer s14F1 (5' -
124
AAGGGCACCACAAGAACGC - 3') and the reverse primer s15 (5' -
125
CCACCTATCACAYAATCATG - 3'), generating amplicons of about 200-250 bp as
ACS Paragon Plus Environment
Environmental Science & Technology
Page 6 of 33
6 126
described previously22. The primers were tagged with 8-nt sequences appended at their 5’
127
ends in order to multiplex samples prior to sequencing library preparation. The tag sequences
128
have been designed with a pairwise minimum edit distance of 2 and an edit distance of 8 to
129
the corresponding positions of the conserved foraminiferal SSU rDNA sequence. We assigned
130
the tag primer combinations to samples following a Latin Square Design (Table S2) in order
131
to reduce mistagging events46. The PCR products were quantified by high-resolution capillary
132
electrophoresis (QIAxcel System, Qiagen) and pooled in equimolar concentration. The pool
133
was purified using the High Pure PCR Product Purification Kit (Roche), quantified using a
134
fluorometric method (QuBit HS dsDNA kit, Invitrogen) and used for library preparation using
135
the TruSeq® DNA PCR-Free Library Prep Kit (Illumina). After quantification using the
136
KAPA Library Quantification Kits (KAPA BIOSYSTEMS), the library was sequenced on an
137
Illumina MiSeq System using MiSeq Reagent Kit v2 and a standard 14-tiles flow cell for
138
2*251 cycles. The raw dataset is publicly available at the Sequence Read Archive under
139
BioProject PRJNA376130.
140 141
Bioinformatics
142
The paired-end raw reads were quality filtered and assembled into full-length sequences with
143
a pipeline written in C language for the fast processing of Illumina metabarcoding data
144
(https://github.com/esling/illumina-pipeline). Filtering and assembly parameters are reported
145
in Table S3. The pipeline included the demultiplexing of each sequence into its sample of
146
origin by matching the combination of 8-nt tags present at the 5’-end of each sequence read.
147
The tagged primers were trimmed from the sequences as well as the foraminifera-specific
148
conserved region of c.a. 70 nucleotides based on the detection of its GACAG motif after 60
149
positions. Every sequence lacking this motif was discarded. The dataset was then filtered for
150
potential chimeras using UCHIME47 version 4.2.40 implemented in the
ACS Paragon Plus Environment
Page 7 of 33
Environmental Science & Technology
7 151
identify_chimera_seq.py function of the Qiime48 1.9.1 toolkit. We used the default parameters
152
of the function, but the --split_by_sampleid option was used in order to restrict the de novo
153
search by sample (i.e. by PCR). The filtered dataset was then clustered into Operational
154
Taxonomic Units (OTUs) using swarm49 2.1.8 with the default resolution (d=1) and the
155
fastidious option. This clustering also included homologous 37F sequences generated by
156
previous fish farm sequencing surveys and obtained using identical bioinformatics processing,
157
including sequences from Scotland22, from New-Zealand23, from Norway44, as well as
158
unpublished data. The present dataset was augmented with these sequences in order to
159
increase the likelihood of the fastidious option implemented in swarm to form OTUs
160
supported by the multiple occurrence of sequences specific to the fish farm environment. The
161
representative sequences, i.e. the most abundant Individual Sequence Unit (ISU) of each
162
OTU, were used as input of the assign_taxonomy.py function of Qiime with default parameter
163
for taxonomic assignment (uclust method), using a curated foraminifera database
164
(http://forambarcoding.unige.ch). This database contains 1175 foraminiferal SSU rDNA
165
sequences representing 125 morphospecies spread into 35 families, and including as well 106
166
environmental sequences for which morphospecies are not been yet identified but that are
167
commonly found in eDNA samples. OTU-representative sequences that could not be assigned
168
were compared by BLAST50 search against GenBank. The OTU-to-sample matrix was
169
generated from the result of the clustering including all fish farms sequences with
170
make_otu_table.py function of Qiime and the data corresponding to the samples analyzed in
171
the present study was extracted from this table into the statistical R environment51 for
172
downstream statistical analysis.
173
ACS Paragon Plus Environment
Environmental Science & Technology
Page 8 of 33
8 174
Statistics
175
The relationship between diversity and distance from the cage as well as compositional
176
variation of the studied communities were investigated based on normalized versions of the
177
OTU-to-sample matrix. Because uneven sequencing depth across samples can introduce
178
biases in the statistical analysis, samples with less than 10,000 reads were discarded. To
179
investigate correlations between diversity and distance from the cage, 100 rarefied datasets
180
were generated by aiming at 10,000 reads per sample using the rrarefy function of the R
181
vegan v2.4-1 package52. Several alpha-diversity metrics, based on all OTUs (including
182
unassigned ones) and their abundance, were then computed for each rarefied dataset and
183
averaged values were used to fit non-linear polynomial models using the lm function in R.
184
These diversity metrics included the OTU richness, the Shannon diversity53, the Pielou
185
evenness54, the SN diversity used in the calculation of NQI18, the expected OTU richness
186
(ES100 and ES50 for respectively 100 and 50 reads) and the Chao richness estimator55. To
187
investigate compositional variation, the read counts of the OTU-to-sample matrix were
188
normalized according to the cumulative-sum scaling method (CSS) using the metagenomeSeq
189
v1.16.0 R package56 (see 57 and 58). The following beta-diversity analyses were performed
190
using functions of the R vegan package. The Bray-Curtis dissimilarity index56 was calculated
191
using vegdist and the resulting pairwise dissimilarity matrix served for non-metric multi-
192
dimensional scaling (NMDS) analysis using metaMDS with default settings. The values of the
193
BIs obtained from the morpho-taxonomic data and of distance to cages were fit to the NMDS
194
ordination using envfit, and ordisurf, respectively.
195 196
Biotic indices values were predicted from the foraminiferal eDNA metabarcoding data using
197
three supervised approaches. For reference, either the ecological weights associated with the
198
taxa of the morpho-taxonomic inventories (correlation screening approach) or the BI values
ACS Paragon Plus Environment
Page 9 of 33
Environmental Science & Technology
9 199
calculated from these inventories (supervised machine learning approaches) were considered.
200
The BI values were predicted independently for the samples of each farm (testing datasets)
201
based on the samples of the other four farms (training datasets) used for supervision in the
202
three approaches. For the correlation screening approach, each OTU was compared to each
203
morpho-taxon across the samples of the training set. The reads associated with each OTU in
204
the PCR replicates of each grab were summed in order to measure Spearman rank correlations
205
in terms of relative abundance across grabs. The ecological weight of the taxon that has the
206
highest rho correlation above 0.7 was associated to the OTU. Biotic indices were then
207
calculated for the testing dataset using OTUs that were assigned ecological weights.
208 209
For the supervised machine learning approaches, either diversity metrics (diversity learning)
210
or OTUs composition (composition learning) were used as features. Models were built for
211
each training dataset, and BI values were predicted for the samples composing each
212
corresponding testing dataset. For diversity learning, the predicted BI values were averaged
213
for each sample over the 100 rarefied datasets. For composition learning, the effect of keeping
214
rare OTUs was investigated by comparing the results obtained when the OTUs with less than
215
10 reads or less than 100 reads across the full dataset were discarded before CSS
216
normalization. Since only one reference BI value was available for each grab, the BI values
217
predicted using both learning methods were averaged for each grab and the standard
218
deviations were computed. For both supervised diversity and composition learning approach,
219
we compared two different algorithms to predict BI values. We compared the Random Forest
220
(RF) algorithm60 implemented in the ranger v.0.6.0 R package61 for multithreading and the
221
Self-Organizing Map (SOM) algorithm implemented in the Kohonen v2.0.19 R package62.
222
For the RF algorithm, we generated 300 trees and used the default “mtry” parameter for
223
regression task, which is 1/3 of the features randomly picked to split the tree at each node,
ACS Paragon Plus Environment
Environmental Science & Technology
Page 10 of 33
10 224
which usually give the best results63. RF models were measuring the importance of features in
225
the prediction of BI values. For the SOM algorithm, the network was trained with 100
226
iterations with the default learning rate (linear decline from 0.05 to 0.01 over the 100
227
iterations). The tuning of the parameters (xdim, ydim, xweight, topo) were randomly searched
228
over 100 parameters combinations (100 hyperparameter search) with the train function of the
229
caret v6.0-73 R package64, passing the “random” search option and a 10-fold cross validation
230
repeated 10 times to the trainControl function. For each training dataset (four farms), the set
231
of parameters giving the lowest average Root Mean Squared Error (RMSE) on hold-out
232
samples during cross-validation were used to build the model that predict BI values for the
233
testing dataset (the remaining farm).
234 235
To compare the accuracy of the supervised approaches, the relationships between the
236
reference and predicted BI values were modeled using the lm function in R. These BI values
237
were then converted into discrete ecological quality status, after averaging per grab in the case
238
of the predicted values. Their agreement was tested using the kappa2 function of the irr v0.84
239
R package65, with squared weight because the ecological status values are ordered from “very
240
poor” to “very good”. Agreement between the two classifications was considered as “poor
241
agreement”, e.g. Kappa value ranging from 0.01 to 0.2 to “almost perfect agreement”, e.g.
242
Kappa value ranging from 0.8 to 166. For each BI, the best model was the one associated with
243
the highest Kappa value. Its accuracy was investigated at the scale of the farm by testing for
244
difference between predicted and reference values using the Mann-Whitney test with the
245
wilcox.test in R.
246
ACS Paragon Plus Environment
Page 11 of 33
Environmental Science & Technology
11 247
Results
248
Macro-invertebrate morpho-taxonomic inventories and biotic indices
249
The morpho-taxonomic inventories of 75,431 macro-invertebrate specimens comprised 432
250
morphospecies, of which 357 have been ascribed to an ecological category in at least one of
251
the BIs (Table S4). All BIs showed that the five farms were impacted, with highest values
252
(AMBI) and lowest values (ISI, NSI and NQI1) within 200 meters from the fish cages (Figure
253
S1). The impact decreased quickly with increasing distance from the cages. According to the
254
results of non-linear models, the effect of the distance from the cage on the values of every BI
255
was highly significant (Table S5, Figure S1).
256 257
Foraminifera eDNA metabarcoding data
258
PCR products were obtained for 123 out of 144 samples and sequencing yielded 11,257,700
259
paired-end reads. The quality filtering (i.e. base call quality and contig assembly) discarded
260
24.9 % of the reads (Table S3) and de novo chimera filtering discarded a further 449
261
sequences. The resulting 8,449,933 sequences were clustered along with 45,588,115
262
homologous sequences generated in previous studies and 69,716 OTUs were delineated. An
263
OTU-to-sample matrix was formed from the 9,235 OTUs that occurred in at least one of the
264
123 samples corresponding to the five studied fish farms. The sequencing depth of the
265
samples was uneven. It ranged from 42 to 283,431 reads, with an average of 68,698 reads per
266
sample. Seven samples represented by less than 10,000 reads were discarded. The final OTU-
267
to-sample matrix was constituted of 116 sediment samples as rows, covering the 48 sediment
268
grabs from which the macro-invertebrates were inventoried, 9,170 OTUs as columns, and was
269
filled with the 8,414,122 reads (Table S6). Two additional OTU-to-sample matrices were then
270
composed from the OTUs represented by more than 10 reads and by more than 100 reads, and
271
contained 3,648 and 1,579 columns, respectively.
ACS Paragon Plus Environment
Environmental Science & Technology
Page 12 of 33
12 272 273
Non-linear models yielded significant correlations between alpha-diversity metrics and the
274
distance from the cage, except Pielou’s evenness (Figure S2). The NMDS analysis of the
275
beta-diversity matrix showed that the distance from the cage is strongly structuring
276
foraminifera communities, and that the reference BI values inferred from the macro-
277
invertebrate data were strongly correlated with the ordination (Figure S3).
278 279
Taxonomic composition
280
Among the 9,170 OTUs, 2,378 (representing 70 % of the reads) were assigned to a given
281
taxonomic rank based on the consensus of three maximum hits with a minimum of 90 %
282
similarity and coverage on reference sequences of our curated foraminifera SSU rDNA
283
database. Among these assigned OTUs, the majority belong to orders Rotaliida and
284
Textulariida and class “Monothalamea” (Table S7). Sixty-two OTUs (less than 0.1 % of the
285
reads) were assigned to one of the remaining orders (Miliolida, Globerinida, Spirillinida and
286
Robertinida) and 190 OTUs (4.3 % of the reads) matched uncultured foraminifera sequences
287
of GenBank with more than 90 % of similarity and coverage. The remaining 6,602 OTUs
288
(25.7 % of the reads) could not be assigned to any reference sequence in foraminiferal
289
database and matched no GenBank sequence. The five most abundant taxa include 3 rotaliids:
290
Bulimina marginata (37 OTUs, representing 10.2 % of the reads), Stainforthia fusiformis (21
291
OTUs, representing 7.7 % of the reads) and Cibicidoides lobatulus (39 OTUs, representing 7
292
% of the reads), a monothalamid: Bathysiphon argenteus (18 OTUs, representing 7.4 % of the
293
reads) and a textularid genus: Reophax sp. (105 OTUs, representing 5.7 % of the reads).
294
ACS Paragon Plus Environment
Page 13 of 33
Environmental Science & Technology
13 295
Predictions of BI values from eDNA metabarcoding data
296
The three supervised approaches yielded accurate BI values predictions (Table 1). Linear
297
models and Kappa tests showed significance for the correlation screening approach to predict
298
the BI values from ecological weights assignments. Yet, R2 and Kappa statistics were higher
299
for the diversity and composition learning approaches than for the correlation screening
300
approach, as measured for ISI, NSI and NQI1 (Table 1, Figure 1). For AMBI, the correlation
301
screening approach performed better than the diversity learning using SOM algorithm and the
302
composition learning using the RF algorithm.
303 304
The diversity learning approach using the RF algorithm yielded predictions that had the
305
highest Kappa for NSI and NQI1, with 33 and 37 correctly classified grabs and with 15 and
306
11 grabs classified within 1 category mismatch, respectively. The Kappa statistics were 0.907
307
for NSI and 0.88 for NQI1, indicating an almost perfect agreement between molecular and
308
morpho-taxonomic data. Rarefying the OTU-to-sample dataset had a small effect on the per-
309
sample variation of the inferred values for both NSI (Figure S4) and NQI1 (Figure S5). The
310
most important diversity metrics (as measured by the RF algorithm) to infer NSI and NQI1
311
were Chao and OTU richness, and to a lesser extent SN (Figure S6, S7).
312 313
The composition learning approach using the SOM algorithm yielded predictions that had the
314
highest Kappa for AMBI and ISI, with respectively 33 and 25 correctly classified grabs, 13
315
and 21 classified within 1 category mismatch, and 2 misclassified for both BIs. The Kappa
316
statistics were 0.711 for AMBI and 0.774 for ISI, indicating a substantial agreement between
317
molecular and morpho-taxonomic data. The RF algorithm measurements of features
318
importance showed that the OTUs that were most important for inferring AMBI and ISI were
319
not necessarily the most abundant in terms of reads counts (Figure S8 and S9). The most
ACS Paragon Plus Environment
Environmental Science & Technology
Page 14 of 33
14 320
useful OTU to infer AMBI values was unassigned and represented 0.7% of the reads (55,063
321
reads) and that to infer ISI values was an unidentified Monothalamid representing 0.2% of the
322
reads (19,123 reads). Discarding rare OTUs represented by less than 10 or 100 reads did not
323
significantly change the accuracy of BIs predictions using composition learning (Table S8).
324 325
Predicted values obtained with the best predictive models for each of the four BIs were
326
neither over nor under estimated (Figure 1). Grabs were evenly distributed around the correct
327
classification (Figure 1B). The median and the mean of the differences between the predicted
328
and reference BI values were close to zero and with moderate variances, which means that our
329
predictive models were not biased (Figure 1C). Mann-Whitney tests of difference between the
330
predicted and reference BI values were not significant for all fish farm and BI combinations,
331
which means that predictive models were accurately predicting BI values at the scale of the
332
fish farm (Figure S10).
333 334
Discussion
335
Our study demonstrates the usefulness of supervised machine learning (SML) approaches to
336
infer biotic indices from eDNA metabarcoding data. To our knowledge, this is the first time
337
that the SML approaches have been applied to eukaryotes eDNA-based biomonitoring
338
surveys (but see 35 for a bacteria-based survey using SML). Until now, only those groups of
339
eukaryotes for which DNA barcodes are available in reference sequence database and for
340
which an autecological value is known could be taken into account in eDNA-based
341
biomonitoring studies24,25. Provided that an appropriate training dataset is available, SML
342
approaches offer a workaround to the dependency on reference databases and thus allow
343
extending the range of potential genetic bioindicators to other taxonomic groups, especially to
344
the small-sized inconspicuous taxa, which typically dominate the eDNA samples24. With the
ACS Paragon Plus Environment
Page 15 of 33
Environmental Science & Technology
15 345
notable exception of diatoms67, most of these small-sized taxa remain poorly described in
346
terms of autecology (but see 68), which make them useless for computing biotic indices.
347
Using SML approaches, there is no need for prior knowledge on the ecological signal
348
conveyed by OTUs, because these signals are inferred during the statistical modeling.
349 350
Our results showed that accurate predictive models for BIs inference could be obtained with
351
diversity metrics or composition data derived from eDNA metabarcoding data as features in
352
SML approaches. These models led to similar bioassessments as the ones obtained using
353
traditional morphology-based macrofaunal surveys. The possibility of using diversity metrics
354
or composition data for making BIs predictions on new samples may give a practical
355
flexibility for biomonitoring surveys. On the one hand, using the diversity metrics in SML
356
reduces the dimensions of the data and therefore the computation time, and allows the
357
prediction of BIs for samples coming from various geographical regions, where the
358
taxonomic composition may be different. On the other hand, using composition data to
359
predict BIs seems more appropriate if a significant proportion of OTUs are present in both the
360
training and the testing dataset. Composition data should capture species interactions69,70 and
361
give more importance to key taxa consistently responding to specific environmental
362
variation71,72, which could yield better results for atypical samples or statistical outliers.
363 364
While analyzing eDNA metabarcoding data, various biological and technical issues need to
365
be considered for the interpretation of alpha-diversity and beta-diversity. The presence of
366
extracellular DNA necessarily blurs the ecological signal conveyed in eDNA datasets as
367
living, dead and inactive cells cannot be readily distinguished73,74. In our study, samples were
368
collected from surface sediments, where it is reasonable to expect that most of the DNA come
369
from living or recently dead organisms. Intra-genomic rRNA sequence variation75 and rRNA
ACS Paragon Plus Environment
Environmental Science & Technology
Page 16 of 33
16 370
gene copy-number variation76 brings qualitative and quantitative bias in metabarcoding data.
371
From a technical point of view, PCR, sequencing77 and mistagging errors46 could add further
372
noise in the data. However, this noise can be assumed to be somewhat constant across
373
samples, and therefore disentangled from the ecological signal by SML algorithms.
374 375
The occurrence of OTUs splitting, i.e. morphospecies represented by several OTUs, could
376
also be seen as a problem for the interpretation of eDNA data, because the biological meaning
377
of OTUs is unclear78-80 and because it increases the dimensionality of the data (see below).
378
However, there is also some advantages related to the high number of OTUs. For instance, it
379
has been shown that bacterial sequences diverging by a single nucleotide can display a
380
different abundance profile across environmentally distinct samples81, pointing out that sub-
381
species units can be ecologically informative. In our study, several foraminiferal
382
morphospecies were represented by multiple OTUs. The detailed analysis of these OTUs in
383
the case of the two dominant morphospecies: Bulimina marginata and Cibicidoides lobatulus,
384
showed that in the first case the read abundance profiles of the two most abundant OTUs were
385
highly similar across samples (spearman rho = 0,91, p < 0.001, table S9), while in the second
386
case the two most abundant OTUs displayed different profiles across samples (spearman rho
387
= 0,09, p = 0.32, table S9). This suggests that these OTUs could represent different
388
populations of the same species that exhibit different ecological preference or that they belong
389
to cryptic species complexes that our reference sequence database did not capture. The
390
temptation to merge OTUs that are assigned to the same morphospecies, in an effort of
391
matching the metabarcoding data with the observable morphospecies data, is questionable for
392
two reasons. First, it means grouping sequences based on our current morpho-taxonomic
393
knowledge, which is reflected in reference sequence databases, although morphospecies, from
394
which DNA barcodes are generated, are not necessarily well defined. Second, ecologically
ACS Paragon Plus Environment
Page 17 of 33
Environmental Science & Technology
17 395
distinct sub-species units could be grouped together, losing the ecological signal conveyed by
396
distinct OTUs into a single heterogeneous OTU. Because this signal is used for BI
397
calculation, we think that SML approaches applied to biomonitoring would be more accurate
398
in the case of OTUs splitting than in the case of OTUs merging. Furthermore, we think that
399
SML approaches would as well benefit from the automation of a workflow that is entirety
400
independent from reference taxonomic database, to be reproducible over time.
401 402
Building predictive models from eDNA composition data using SML approaches hold
403
challenges too. Metabarcoding data are usually characterized by a much higher number of
404
OTUs than samples8, and this can be further increased by the OTUs splitting, like in the
405
present study. In SML, this data property makes the predictive models prone to be affected by
406
“the curse of dimensionality”31. Indeed, the probability to observe, by chance, a perfect
407
correlation between the relative abundance of one OTU and the variation of a BI increases
408
with the number of OTUs. This could lead to the overfitting of the model, thus decreasing its
409
accuracy. Finding a trade-off between the OTU granularity (dimensionality) and the
410
separation of ecologically relevant units appears as a major challenge for SML approaches
411
applied to metabarcoding data. Although our dataset was highly dimensional, our predictive
412
models were likely not overfitted for two reasons. First, the overfitting is controlled through
413
the growing of a “forest” of regression trees, that are built on a random subset of the data in
414
the RF algorithm60,30 and we used a 10-fold cross-validation step for model selection using the
415
SOM algorithm, a common procedure to prevent such problem27,83. Second, we showed that
416
our models still accurately predicted BI values without 80 % of the OTUs removed based in
417
their low abundance, supporting that our models built on the full dataset do not tend to overfit
418
due to the presence of numerous rare OTUs.
419
ACS Paragon Plus Environment
Environmental Science & Technology
Page 18 of 33
18 420
Future efforts should investigate whether accurate SML predictions can be realized in broad
421
routine applications. Our results were achieved with a relatively small training dataset and
422
increasing the number of sampled farms to cover a wider sampling area will likely further
423
refine our predictive models. Given the ability to multiplex hundreds of samples and to
424
perform analyses rapidly, cost-effective biomonitoring solutions could be proposed in matters
425
of days instead of months84. Hence, an eDNA-based early warning system adapted for the
426
pro-active management of marine industrial activities could be envisioned, as it has been
427
already proposed for others ecosystems and type of bioindicators85,86.
428 429
Acknowledgements
430
This study was supported by the Swiss Network for International Studies, by the Swiss
431
National Science Foundation (grants 316030_150817 and 31003A_159709) and by the
432
Norwegian Seafood Research Fund (FHF project number 901092). We thank Vegard Aambø
433
Langvatn and Dagfinn Breivik Skomsø from Fiske Liv and Jeremiah Peder Ness from Aqua-
434
kompetanse for help in collecting samples. We also thank Are Andreassen Moe and Arne
435
Kvalvik from Marine Harvest for the organization of the sampling campaign. We also thank
436
the four anonymous reviewers for their helpful comments.
437 438
Supporting Information
439
Table S1: Sampling site coordinates and distribution of collected samples
440
Tables S2: Tag to sample reference table following a Latin Square Design and tagged primers
441
sequences for successful PCR amplified samples
442
Table S3: Quality filtering and assembly parameters
443
Table S5: Results of non-linear models testing the effect of farming site, the cages and their
444
interaction on four biotic indexes values
ACS Paragon Plus Environment
Page 19 of 33
Environmental Science & Technology
19 445
Table S6: OTU-to-sample matrix with taxonomic assignments
446
Table S7: Taxonomic composition of foraminiferal communities
447
Table S8: Accuracy of BI values predictions from composition data, as a function of
448
abundance filtering on the OTU-to-sample matrix.
449
Table S9: Pair-wise spearman correlation matrix of OTUs represented by more than 1000
450
reads and assigned to the same species.
451 452
Figure S1: Non-linear relationship between BI reference values and the distance to the cages
453
Figure S2: Correlation between alpha-diversity metrics and the distance from the cages
454
Figure S3: NMDS analysis of the Bray-Curtis beta-diversity matrix calculated from the CSS
455
normalized foraminifera eDNA dataset.
456
Figure S4: Effect of rarefying the OTU-to-sample matrix on the variation of inferred NSI
457
values per sample.
458
Figure S5: Effect of rarefying the OTU-to-sample dataset on the variation of inferred NQI1
459
values per sample.
460
Figure S6: Diversity metrics importance in the prediction of the NSI index from the Random
461
Forest models
462
Figure S7: Diversity metrics importance in the prediction of the NQI1 index from the Random
463
Forest models
464
Figure S8: OTU importance in the prediction of the AMBI index from the Random Forest
465
model
466
Figure S9: OTU importance in the prediction of the ISI index from the Random Forest model
467
Figure S10: Boxplot and Mann-Whitney tests for difference between predicted BI values with
468
the best predictive models and reference BI values for each combination of farm and BI
469
ACS Paragon Plus Environment
Environmental Science & Technology
Page 20 of 33
20 470
References
471 472
(1)
Doney, S. C.; Ruckelshaus, M.; Emmett Duffy, J.; Barry, J. P.; Chan, F.; English, C. A.;
473
Galindo, H. M.; Grebmeier, J. M.; Hollowed, A. B.; Knowlton, N.; et al. Climate Change
474
Impacts on Marine Ecosystems. Ann. Rev. Mar. Sci. 2012, 4 (1), 11–37.
475
(2)
Peterson, C. H.; Rice, S. D.; Short, J. W.; Esler, D.; Bodkin, J. L.; Ballachey, B. E.; Irons, D.
476
B. Long-Term Ecosystem Response to the Exxon Valdez Oil Spill. Science 2003, 302
477
(December), 2082–2086.
478
(3)
Smith, V. H.; Tilman, G. D.; Nekola, J. C. Eutrophication: Impacts of excess nutrient
479
inputs on freshwater, marine, and terrestrial ecosystems. In Environmental Pollution;
480
1999; Vol. 100, pp 179–196.
481
(4)
Borja, A.; Ranasinghe, A.; Weisberg, S. B. Assessing ecological integrity in marine
482
waters, using multiple indices and ecosystem components: Challenges for the future.
483
Mar. Pollut. Bull. 2009, 59 (1–3), 1–4.
484
(5)
Tavakoly Sany, S. B.; Hashim, R.; Rezayi, M.; Salleh, A.; Safari, O. A review of strategies
485
to monitor water and sediment quality for a sustainability assessment of marine
486
environment. Environ. Sci. Pollut. Res. Int. 2014, 21 (813), doi:10.1007/s11356-013-
487
2217-5.
488
(6)
Borja, A.; Franco, J.; Pérez, V. A Marine Biotic Index to Establish the Ecological Quality
489
of Soft-Bottom Benthos Within European Estuarine and Coastal Environments. Mar.
490
Pollut. Bull. 2000, 40 (12), 1100–1114.
491
(7)
waters of Norway; 2002.
492 493
Rygg, B. Indicator species index for assessing benthic ecological quality in marine
(8)
Rygg, B. Developing indices for quality status classification of marine soft-bottom
ACS Paragon Plus Environment
Page 21 of 33
Environmental Science & Technology
21 fauna in Norway.; 2006.
494 495
(9)
Taberlet, P.; Coissac, E.; Pompanon, F.; Brochmann, C.; Willerslev, E. Towards next-
496
generation biodiversity assessment using DNA metabarcoding. Mol. Ecol. 2012, 21 (8),
497
2045–2050.
498
(10)
Woodward, G.; Gray, C.; Baird, D. J. Biomonitoring for the 21st Century: new
499
perspectives in an age of globalisation and emerging environmental threats. Limnetica
500
2013, 32 (2), 159–174.
501
(11)
Aylagas, E.; Borja, Á.; Rodriguez-Ezpeleta, N. Environmental status assessment using
502
DNA metabarcoding: Towards a genetics based marine biotic index (gAMBI). PLoS ONE
503
9(3) e90529 2014, doi:10.1371/journal.pone.0090529.
504
(12)
Bohmann, K.; Evans, A.; Gilbert, M. T. P.; Carvalho, G. R.; Creer, S.; Knapp, M.; Yu, D.
505
W.; de Bruyn, M. Environmental DNA for wildlife biology and biodiversity monitoring.
506
Trends in Ecology and Evolution. 2014, pp 358–367.
507
(13)
Kermarrec, L.; Franc, A.; Rimet, F.; Chaumeil, P.; Humbert, J. F.; Bouchez, A. Next-
508
generation sequencing to inventory taxonomic diversity in eukaryotic communities: A
509
test for freshwater diatoms. Mol. Ecol. Resour. 2013, 13 (4), 607–619.
510
(14)
Kermarrec, L.; Franc, A.; Rimet, F.; Chaumeil, P.; Frigerio, J.-M.; Humbert, J.-F.;
511
Bouchez, A. A next-generation sequencing approach to river biomonitoring using
512
benthic diatoms. Freshw. Sci. 2014, 33 (1), 349–363.
513
(15)
Visco, J. A.; Apothéloz-Perret-Gentil, L.; Cordonier, A.; Esling, P.; Pillet, L.; Pawlowski, J.
514
Environmental Monitoring: Inferring the Diatom Index from Next-Generation
515
Sequencing Data. Environ. Sci. Technol. 2015, 49 (13), 7597–7605.
516 517
(16)
Zimmermann, J.; Abarca, N.; Enke, N.; Enk, N.; Skibbe, O.; Kusber, W. H.; Jahn, R. Taxonomic reference libraries for environmental barcoding: a best practice example
ACS Paragon Plus Environment
Environmental Science & Technology
Page 22 of 33
22 from diatom research. PLoS One 2014, 9 (9), e108793.
518 519
(17)
Zimmermann, J.; Glöckner, G.; Jahn, R.; Enke, N.; Gemeinholzer, B. Metabarcoding vs.
520
morphological identification to assess diatom diversity in environmental studies. Mol.
521
Ecol. Resour. 2015, 15 (3), 526–542.
522
(18)
Apothéloz-Perret-Gentil, L.; Cordonier, A.; Straub, F.; Iseli, J.; Esling, P.; Pawlowski, J.
523
Taxonomy-free molecular diatom index for high-throughput eDNA biomonitoring.
524
Mol. Ecol. Resour. 2017, in press. DOI: 10.1111/1755-0998.12668
525
(19)
Chariton, A. A.; Court, L. N.; Hartley, D. M.; Colloff, M. J.; Hardy, C. M. Ecological
526
assessment of estuarine sediments by pyrosequencing eukaryotic ribosomal DNA.
527
Front. Ecol. Environ. 2010, 8 (5), 233–238.
528
(20)
Chariton, A. A.; Stephenson, S.; Morgan, M. J.; Steven, A. D. L.; Colloff, M. J.; Court, L.
529
N.; Hardy, C. M. Metabarcoding of benthic eukaryote communities predicts the
530
ecological condition of estuaries. Environ. Pollut. 2015, 203, 165–174.
531
(21)
Bik, H. M.; Halanych, K. M.; Sharma, J.; Thomas, W. K. Dramatic shifts in benthic
532
microbial eukaryote communities following the deepwater horizon oil spill. PLoS One
533
2012, 7 (6).
534
(22)
Pawlowski, J.; Esling, P.; Lejzerowicz, F.; Cedhagen, T.; Wilding, T. A. Environmental
535
monitoring through protist next-generation sequencing metabarcoding: Assessing the
536
impact of fish farming on benthic foraminifera communities. Mol. Ecol. Resour. 2014,
537
14 (6), 1129–1140.
538
(23)
Pochon, X.; Wood, S. A.; Keeley, N. B.; Lejzerowicz, F.; Esling, P.; Drew, J.; Pawlowski, J.
539
Accurate assessment of the impact of salmon farming on benthic sediment
540
enrichment using foraminiferal metabarcoding. Marine Pollution Bulletin. 2015.
541
(24)
Lejzerowicz, F.; Esling, P.; Pillet, L. L.; Wilding, T. a.; Black, K. D.; Pawlowski, J. High-
ACS Paragon Plus Environment
Page 23 of 33
Environmental Science & Technology
23 542
throughput sequencing and morphology perform equally well for benthic monitoring
543
of marine ecosystems. Sci. Rep. 2015, 5 (April), 13932.
544
(25)
Aylagas, E.; Borja, Á.; Irigoien, X.; Rodríguez-Ezpeleta, N. Benchmarking DNA
545
Metabarcoding for Biodiversity-Based Monitoring and Assessment. Front. Mar. Sci.
546
2016, 3 (November), 96.
547
(26)
Lanzén, A.; Lekang, K.; Jonassen, I.; Thompson, E. M.; Troedsson, C. High-throughput
548
metabarcoding of eukaryotic diversity for environmental monitoring of offshore oil-
549
drilling activities. Mol. Ecol. 2016, 25 (17), 4392–4406.
550
(27)
FEMS Microbiology Reviews. 2011, pp 343–359.
551 552
(28)
Olden, J. D.; Lawler, J. J.; Poff, N. L. Machine learning methods without tears: a primer for ecologists. Q. Rev. Biol. 2008, 83 (2), 171–193.
553 554
Knights, D.; Costello, E. K.; Knight, R. Supervised classification of human microbiota.
(29)
Philibert, A.; Desprez-Loustau, M. L.; Fabre, B.; Frey, P.; Halkett, F.; Husson, C.; Lung-
555
Escarmant, B.; Marçais, B.; Robin, C.; Vacher, C.; et al. Predicting invasion success of
556
forest pathogenic fungi from species traits. J. Appl. Ecol. 2011, 48 (6), 1381–1390.
557
(30)
and their applications to ecological data. Ecol. Modell. 2012, 240, 113–122.
558 559
(31)
Libbrecht, M. W.; Noble, W. S. Machine learning applications in genetics and genomics. Nat Rev Genet 2015, 16 (6), 321–332.
560 561
Crisci, C.; Ghattas, B.; Perera, G. A review of supervised machine learning algorithms
(32)
Statnikov, A.; Henaff, M.; Narendra, V.; Konganti, K.; Li, Z.; Yang, L.; Pei, Z.; Blaser, M.
562
J.; Aliferis, C. F.; Alekseyenko, A. V. A comprehensive evaluation of multicategory
563
classification methods for microbiomic data. Microbiome 2013, 1 (1), 11.
564 565
(33)
Beck, D.; Foster, J. A. Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics. PLoS One 2014, 9 (2).
ACS Paragon Plus Environment
Environmental Science & Technology
Page 24 of 33
24 566
(34)
Martínez-García, P. M.; López-Solanilla, E.; Ramos, C.; Rodríguez-Palenzuela, P.
567
Prediction of bacterial associations with plants using a supervised machine-learning
568
approach. Environmental Microbiology. 2016.
569
(35)
Smith, M. B.; Rocha, A. M.; Smillie, C. S.; Olesen, S. W.; Paradis, C.; Wu, L.; Campbell, J.
570
H.; Fortney, J. L.; Mehlhorn, T. L.; Lowe, K. A.; et al. Natural bacterial communities
571
serve as quantitative geochemical biosensors. MBio 2015, 6 (3), 1–13.
572
(36)
Retrospect, perspect and prospect. Environ. Int. 2006, 32 (2), 273–283.
573 574
Nigam, R.; Saraswat, R.; Panchang, R. Application of foraminifers in ecotoxicology:
(37)
Frontalini, F.; Coccioni, R. Benthic foraminifera as bioindicators of pollution: A review
575
of Italian research over the last three decades. Revue de Micropaleontologie. 2011, pp
576
115–127.
577
(38)
Schönfeld, J.; Alve, E.; Geslin, E.; Jorissen, F.; Korsun, S.; Spezzaferri, S.; Abramovich,
578
S.; Almogi-Labin, A.; du Chatelet, E. A.; Barras, C.; et al. The FOBIMO (FOraminiferal
579
BIo-MOnitoring) initiative-Towards a standardised protocol for soft-bottom benthic
580
foraminiferal monitoring studies. Mar. Micropaleontol. 2012, 94–95, 1–13.
581
(39)
Alve, E.; Korsun, S.; Schönfeld, J.; Dijkstra, N.; Golikova, E.; Hess, S.; Husum, K.; Panieri,
582
G. Foram-AMBI: A sensitivity index based on benthic foraminiferal faunas from North-
583
East Atlantic and Arctic fjords, continental shelves and slopes. Mar. Micropaleontol.
584
2016, 122, 1–12.
585
(40)
Scott, D. B.; Schafer, C. T.; Honig, C.; Younger, D. C. Temporal variations of benthic
586
foraminiferal assemblages under or near aquaculture operations; documentation of
587
impact history. J. Foraminifer. Res. 1995, 25 (3), 224–235.
588 589
(41)
Angel, D. L. Impact of a Net Cage Fish Farm on the Distribution of Benthic Foraminifera in the Northern Gulf of Eilat (Aqaba, Red Sea). J. Foraminifer. Res. 2000,
ACS Paragon Plus Environment
Page 25 of 33
Environmental Science & Technology
25 30 (1), 54–65.
590 591
(42)
Vidović, J.; Cosovic, V.; Jurasic, M.; Petricioli, D. Impact of fish farming on foraminiferal
592
community, Drvenik Veliki Island, Adriatic Sea, Croatia. Mar. Pollut. Bull. 2009, 58 (9),
593
1297–1309.
594
(43)
Vidović, J.; Dolenec, M.; Dolenec, T.; Karamarko, V.; Žvab Rožič, P. Benthic
595
foraminifera assemblages as elemental pollution bioindicator in marine sediments
596
around fish farm (Vrgada Island, Central Adriatic, Croatia). Mar. Pollut. Bull. 2014, 83
597
(1), 198–213.
598
(44)
Pawlowski, J.; Esling, P.; Lejzerowicz, F.; Cordier, T.; Visco, J. A.; Martins, C. I. M.;
599
Kvalvik, A.; Staven, K.; Cedhagen, T. Benthic monitoring of salmon farms in Norway
600
using foraminiferal metabarcoding. Aquac. Environ. Interact. 2016, 8, 371–386.
601
(45)
surveys of foraminifera: Preparing the future. Biological Bulletin. 2014, pp 93–106.
602 603
(46)
(47)
Edgar, R. C.; Haas, B. J.; Clemente, J. C.; Quince, C.; Knight, R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 2011, 27 (16), 2194–2200.
606 607
Esling, P.; Lejzerowicz, F.; Pawlowski, J. Accurate multiplexing and filtering for highthroughput amplicon-sequencing. Nucleic Acids Res. 2015, 1–12.
604 605
Pawlowski, J.; Lejzerowicz, F.; Esling, P. Next-generation environmental diversity
(48)
Caporaso, J. G.; Kuczynski, J.; Stombaugh, J.; Bittinger, K.; Bushman, F. D.; Costello, E.
608
K.; Fierer, N.; Pena, A. G.; Goodrich, J. K.; Gordon, J. I.; et al. QIIME allows analysis of
609
high-throughput community sequencing data. Nat. Methods 2010, 7 (5), 335–336.
610
(49)
scalable and high-resolution amplicon clustering. PeerJ 2015, 3, e1420.
611 612 613
Mahé, F.; Rognes, T.; Quince, C.; De Vargas, C.; Dunthorn, M. Swarm v2: highly-
(50)
Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic Local Alignment Search Tool. J. Mol. Biol. 1990, No. 215, 403–410.
ACS Paragon Plus Environment
Environmental Science & Technology
Page 26 of 33
26 614
(51)
Comput. Vienna, Austria 2016, URL https://www.R-project.org/.
615 616
R Core Team. R: A language and environment for statistical computing. R Found. Stat.
(52)
Oksanen, J.; Blanchet, F. G.; Kindt, R.; Legendre, P.; Minchin, P. R.; O’Hara, R. B.;
617
Simpson, G. L.; Solymos, P.; Stevens, M. H. H.; Wagner, H. vegan: Community Ecology
618
Package. R package version 2.4-1. 2016.
619
(53)
Shannon, C. E.; Weaver, W. The mathematical theory of information; 1949; Vol. 97.
620
(54)
Pielou, E. C. The measurement of diversity in different types of biological collections. J. Theor. Biol. 1966, 13, 131–144.
621 622
(55)
there are unseen species in sample. Environ. Ecol. Stat. 2003, 10 (4), 429–443.
623 624
(56)
(57)
McMurdie, P. J.; Holmes, S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput. Biol. 2014, 10 (4).
627 628
Paulson, J. N.; Talukder, H.; Pop, M.; Bravo, H. C. Robust methods for differential abundance analysis in marker gene surveys. Nat. Methods 2013, 10 (12), 1200–1202.
625 626
Chao, A.; Shen, T. J. Nonparametric estimation of Shannon’s index of diversity when
(58)
Weiss, S. J.; Xu, Z.; Amir, A.; Peddada, S.; Bittinger, K.; Gonzalez, A.; Lozupone, C.;
629
Zaneveld, J. R.; Vazquez-Baeza, Y.; Birmingham, A.; et al. Effects of library size
630
variance, sparsity, and compositionality on the analysis of microbiome data. PeerJ
631
Prepr. 2015, 3, e1408. DOI: 10.7287/peerj.preprints.1157v1
632
(59)
Bray, J. R.; Curtis, J. T. An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecol. Monogr. 1957, 27 (4), 325–349.
633 634
(60)
Breiman, L. Random Forests. Mach. Learn. 2001, 45 (1), 5–32.
635
(61)
Wright, M. N.; Ziegler, A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. arXiv.org 2015, stat.ML.
636 637
(62)
Wehrens, R.; Buydens, L. M. C. Self- and super-organizing maps in R: The kohonen
ACS Paragon Plus Environment
Page 27 of 33
Environmental Science & Technology
27 package. J. Stat. Softw. 2007, 21 (5), 1–19.
638 639
(63)
(December), 18–22.
640 641
Liaw, a; Wiener, M. Classification and Regression by randomForest. R news 2002, 2
(64)
Kuhn, M. caret: Classification and Regression Training. R package version 6.0-73.
642
Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan
643
Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael
644
Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and
645
Tyler Hunt. 2016.
646
(65)
and Agreement.
647 648
(66)
(67)
Stevenson, R. J.; Pan, Y.; VanDam, H. Assessing environmental conditions in rivers and streams with diatoms. Earth Sci. 2006, 57–85.
651 652
Landis, J. R.; Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 1977, 33 (1), 159–174.
649 650
Gamer, M.; Lemon, J.; Fellows, I.; Singh, P. Various Coefficients of Interrater Reliability
(68)
Grego, M.; De Troch, M.; Forte, J.; Malej, A. Main meiofauna taxa as an indicator for
653
assessing the spatial and seasonal impact of fish farming. Mar. Pollut. Bull. 2009, 58
654
(8), 1178–1186.
655
(69)
Lima-Mendez, G.; Faust, K.; Henry, N.; Decelle, J.; Colin, S.; Carcillo, F.; Chaffron, S.;
656
Ignacio-Espinosa, J. C.; Roux, S.; Vincent, F.; et al. Determinants of community
657
structure in the global plankton interactome. Science (80-. ). 2015, 348 (6237),
658
1262073_1-1262073_9.
659
(70)
Vacher, C.; Tamaddoni-Nezhad, A.; Kamenova, S.; Peyrard, N.; Moalic, Y.; Sabbadin, R.;
660
Schwaller, L.; Chiquet, J.; Smith, M. A.; Vallance, J.; et al. Learning Ecological Networks
661
from Next-Generation Sequencing Data. In Advances in Ecological Research; 2016;
ACS Paragon Plus Environment
Environmental Science & Technology
Page 28 of 33
28 Vol. 54, pp 1–39.
662 663
(71)
Fortunato, C. S.; Eiler, A.; Herfort, L.; Needoba, J. A.; Peterson, T. D.; Crump, B. C.; CS,
664
F.; Eiler, A.; Herfort, L.; JA, N.; et al. Determining indicator taxa across spatial and
665
seasonal gradients in the Columbia River coastal margin. Isme J 2013, 7 (10), 1899–
666
1911.
667
(72)
Dell’Anno, A.; Corinaldesi, C. Degradation and turnover of extracellular DNA in marine
668
sediments: Ecological and methodological considerations. Appl. Environ. Microbiol.
669
2004, 70 (7), 4384–4386.
670
(73)
Corinaldesi, C.; Danovaro, R.; Dell’Anno, A. Simultaneous recovery of extracellular and
671
intracellular DNA suitable for molecular studies from marine sediments. Appl. Environ.
672
Microbiol. 2005, 71 (1), 46–50.
673
(74)
Pillet, L.; Fontaine, D.; Pawlowski, J. Intra-genomic ribosomal RNA polymorphism and
674
morphological variation in elphidium macellum suggests inter-specific hybridization in
675
foraminifera. PLoS One 2012, 7(2) : e32, doi:10.1371/journal.pone.0032373.
676
(75)
Data: A Case Study of Foraminifera. PLoS One 2013, 8 (2).
677 678
Weber, A. A. T.; Pawlowski, J. Can Abundance of Protists Be Inferred from Sequence
(76)
Schirmer, M.; Ijaz, U. Z.; D’Amore, R.; Hall, N.; Sloan, W. T.; Quince, C. Insight into
679
biases and sequencing errors for amplicon sequencing with the Illumina MiSeq
680
platform. Nucleic Acids Res. 2015, 43 (6).
681
(77)
throughput amplicon-sequencing. Nucleic Acids Res. 2015, 43 (5), 2513–2524.
682 683
Esling, P.; Lejzerowicz, F.; Pawlowski, J. Accurate multiplexing and filtering for high-
(78)
Schloss, P. D.; Westcott, S. L. Assessing and improving methods used in OTU-based
684
approaches for 16S rRNA gene sequence analysis. Appl. Environ. Microbiol. 2011, 77
685
(10), 3219–3226.
ACS Paragon Plus Environment
Page 29 of 33
Environmental Science & Technology
29 686
(79)
Chen, W.; Zhang, C. K.; Cheng, Y.; Zhang, S.; Zhao, H. A Comparison of Methods for
687
Clustering 16S rRNA Sequences into OTUs. PLoS One 2013, 8(8) (PLoS ONE 8(8):
688
e70837.), doi:10.1371/journal.pone.0070837.
689
(80)
He, Y.; Caporaso, J. G.; Jiang, X.-T.; Sheng, H.-F.; Huse, S. M.; Rideout, J. R.; Edgar, R. C.;
690
Kopylova, E.; Walters, W. A.; Knight, R.; et al. Stability of operational taxonomic units:
691
an important but neglected property for analyzing microbial diversity. Microbiome
692
2015, 3 (1), 20.
693
(81)
Eren, A. M.; Maignien, L.; Sul, W. J.; Murphy, L. G.; Grim, S. L.; Morrison, H. G.; Sogin,
694
M. L. Oligotyping: Differentiating between closely related microbial taxa using 16S
695
rRNA gene data. Methods Ecol. Evol. 2013, 4 (12), 1111–1119.
696
(82)
fundamentals, tools, and challenges. Ann. Epidemiol. 2016, 26 (5), 330–335.
697 698
(83)
Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55 (10), 78.
699 700
Tsilimigras, M. C. B.; Fodor, A. A. Compositional data analysis of the microbiome:
(84)
Quinn, R. A.; Navas-molina, J. A.; Hyde, E. R.; Song, J.; Vázquez-baeza, Y.; Humphrey,
701
G.; Gaffney, J.; Minich, J. J.; Melnik, A. V; Herschend, J.; et al. From sample to multi-
702
omics conclusions in under 48 hours. 2016, 1 (2).
703
(85)
Wepener, V.; van Vuren, J. H. J.; Chatiza, F. P.; Mbizi, Z.; Slabbert, L.; Masola, B. Active
704
biomonitoring in freshwater environments: Early warning signals from biomarkers in
705
assessing biological effects of diffuse sources of pollutants. Phys. Chem. Earth 2005,
706
30 (11–16 SPEC. ISS.), 751–761.
707
(86)
Maradona, A.; Marshall, G.; Mehrvar, M.; Pushchak, R.; Laursen, A. E.; McCarthy, L. H.;
708
Bostan, V.; Gilbride, K. A. Utilization of multiple organisms in a proposed early-
709
warning biomonitoring system for real-time detection of contaminants: Preliminary
ACS Paragon Plus Environment
Environmental Science & Technology
Page 30 of 33
30 710
results and modeling. J. Hazard. Mater. 2012, 219–220, 95–102.
711
ACS Paragon Plus Environment
Page 31 of 33
Environmental Science & Technology
Tables and figures Table 1: Accuracy of biotic indices predictions obtained from correlation screening, diversity learning and composition learning, with the Random Forest and Self-Organizing Map algorithms. Best predictive models are in bold (i.e. Highest Kappa statistics). Biotic Index Approach Supervised learning algorithm R2 Kappa AMBI Correlation screening - 0.568*** 0.624*** Diversity learning Random Forest 0.641*** 0.69*** Self-Organizing Map 0.492*** 0.618*** Composition learning Random Forest 0.662*** 0.555*** Self-Organizing Map 0.669*** 0.711*** ISI Correlation screening - 0.65*** 0.53*** Diversity learning Random Forest 0.505*** 0.626*** Self-Organizing Map 0.449*** 0.61*** Composition learning Random Forest 0.56*** 0.631*** Self-Organizing Map 0.615*** 0.774*** NSI Correlation screening - 0.508*** 0.607*** Diversity learning Random Forest 0.83*** 0.907*** Self-Organizing Map 0.83*** 0.88*** Composition learning Random Forest 0.827*** 0.832*** Self-Organizing Map 0.794*** 0.871*** NQI1 Correlation screening - 0.76*** 0.8*** Diversity learning Random Forest 0.834*** 0.88*** Self-Organizing Map 0.805*** 0.846*** Composition learning Random Forest 0.81*** 0.856*** Self-Organizing Map 0.803*** 0.873*** *:P < 0.05; **:P < 0.01; ***:P < 0.001
ACS Paragon Plus Environment
Environmental Science & Technology
Figure 1: A) Correlation between the BI values computed from the morpho-taxonomic inventories (morphologic data) and the one predicted from supervised machine learning from the foraminifera SSU 37F data (molecular data). The plots are obtained from the predictive models giving the best accuracy (i.e. highest Kappa statistic, see Table 1). AMBI and ISI predictions were done using the SOM algorithm on composition data. NSI and NQI1 predictions were done using the RF algorithm on diversity metrics. Each dot and error bars represent the average and standard deviation of the predicted BI values for the sediment sample of a grab. B) Barplots indicate the amount (number over bars) and percentage (y axis) of correct classifications (0 on x axis) and misclassifications. C) Boxplots indicates the median and quartile values (black numbers) and the mean value (red number) of the difference between the predicted and the reference BI values.
A
AMBI prediction SOM algorithm on composition data 80
n=6
Molecular
8
n=1
1
6
2
0
1
2
Status mismatch
1.68
1.5
4
2
Percentage
60
10
Percentage
40
n=7 n=1
0
4 3
Molecular
Aukrasanden Beitveitnes Bjørnsvik Nedre Kvarv Storvika
n = 33
20
5
Aukrasanden Beitveitnes Bjørnsvik Nedre Kvarv Storvika
12
100
6
ISI prediction SOM algorithm on composition data
0.5
0.6 0.07
0.5
2
1
0.01
1
2
3
4
5
0
0
0
0.44
1.5
R =0.669*** Kappa =0.711***
R =0.615*** Kappa =0.774***
1.5
0
6
2
4
6
8
10
12
Morphology
Morphology
NQI1 prediction RF algorithm on diversity metrics
0
0.8
NSI prediction RF algorithm on diversity metrics
0
0
Kappa =0.88*** 0.0
0
Kappa =0.907*** 0
0.0
0.8
ISI prediction NSI prediction NQI1 prediction SOM algorithm on composition data RF algorithm on diversity metrics RF algorithm on diversity metrics NSI prediction AMBI prediction ISI prediction 100 60
80
80
1
40
0
1
0
1
2
0
20
2
0
3
0
Status mismatch
1.68 2
1.5
0
n = 10 n=1 n=1
Status mismatch
1.5
10
0.54 0
0.16
2
0.96
0.012
2.79 4
R =0.615*** Kappa =0.774*** 0.0
0.5
8
0.44
0.00
0.07
0
0
0.5
1.49
0.6 0.01
1.5
predicted — known value
n=6 n=1
1
C
6
Percentage
0.8
40
n=7 n=1
2
=0.669*** Kappa =0.711***
NQI1 prediction
n = 25
n = 11
0
80 60
n = 33
20 0
Percentage
% of samples
100
B
0.8
ACS Paragon Plus Environment
Page 32 of 33
Biotic index
New eDNA samples PageEnvironmental 33Macro-invertebrates of 33 Science & Technology Sieving
Foraminifera eDNA
eDNA extraction Sampling
ACS Paragon Plus Environment Data generation
Model training
Data collection and model building
Making predictions Biotic indice predictions