Predicting the Ecological Quality Status of Marine Environments

(19-23) Comparisons of BI inferred from eDNA metabarcoding and morphological ..... Finding a trade-off between the OTU granularity (dimensionality) an...
2 downloads 0 Views 4MB Size
Subscriber access provided by Olson Library | Northern Michigan University

Article

Predicting the ecological quality status of marine environments from eDNA metabarcoding data using supervised machine learning Tristan Cordier, Philippe Esling, Franck Lejzerowicz, Joana Visco, Amine Ouadahi, Catarina Martins, Tomas Cedhagen, and Jan Pawlowski Environ. Sci. Technol., Just Accepted Manuscript • DOI: 10.1021/acs.est.7b01518 • Publication Date (Web): 30 Jun 2017 Downloaded from http://pubs.acs.org on July 1, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Environmental Science & Technology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 33

Environmental Science & Technology

1 1

Predicting the ecological quality status of marine environments from eDNA

2

metabarcoding data using supervised machine learning

3

Tristan Cordier*1, Philippe Esling2, Franck Lejzerowicz1, Joana Visco3, Amine Ouadahi1,

4

Catarina Martins4, Tomas Cedhagen5, Jan Pawlowski1,3

5 6

1

7

Switzerland


8

2

9

3

10

4

11

5

12

1, DK-8000 Aarhus, Denmark

Department of Genetics and Evolution, University of Geneva, Boulevard d’Yvoy 4, CH 1205 Geneva,

IRCAM, UMR 9912, Université Pierre et Marie Curie, 4 place Jussieu, 75005 Paris, France
 ID-Gene ecodiagnostics, Ltd, chemin des Aulx 14, 1228 Plan-les-Ouates, Switzerland Marine Harvest ASA, Sandviksboder 77AB, Bergen, 5035 Bergen, Norway Department of Bioscience, Section of Aquatic Biology, University of Aarhus, Building 1135, Ole Worms allé

13 14

*Corresponding author: [email protected]

15 16 17 18 19 20 21 22 23 24 25

ACS Paragon Plus Environment

Environmental Science & Technology

Page 2 of 33

2 26 27

Abstract

28 29

Monitoring biodiversity is essential to assess the impacts of increasing anthropogenic

30

activities in marine environments. Traditionally, marine biomonitoring involves the sorting

31

and morphological identification of benthic macro-invertebrates, which is time-consuming

32

and taxonomic-expertise demanding. High-throughput amplicon sequencing of environmental

33

DNA (eDNA metabarcoding) represents a promising alternative for benthic monitoring.

34

However, an important fraction of eDNA sequences remains unassigned or belong to taxa of

35

unknown ecology, which prevent their use for assessing the ecological quality status. Here,

36

we show that supervised machine learning (SML) can be used to build robust predictive

37

models for benthic monitoring, regardless of the taxonomic assignment of eDNA sequences.

38

We tested three SML approaches to assess the environmental impact of marine aquaculture

39

using benthic foraminifera eDNA, a group of unicellular eukaryotes known to be good

40

bioindicators, as features to infer macro-invertebrates based biotic indices. We found similar

41

ecological status as obtained from macro-invertebrates inventories. We argue that SML

42

approaches could overcome and even bypass the cost and time-demanding morpho-taxonomic

43

approaches in future biomonitoring.

44 45

Key-words: Biomonitoring, biotic indices, environmental DNA, supervised machine learning,

46

predictive models, foraminifera.

47 48 49 50

ACS Paragon Plus Environment

Page 3 of 33

Environmental Science & Technology

3 51 52

Introduction

53

Human activities are impacting the marine ecosystem functioning through climate change1,

54

environmental pollution2, or human industry driven eutrophication3, and these pressures are

55

likely to increase with the projected demographic expansion. Such impacts are traditionally

56

monitored by surveying biodiversity, usually focusing on benthic macrofauna4,5. Biotic

57

Indices (BI) have been formalized in order to combine taxonomy and alpha-diversity

58

measures (e.g. species richness, Shannon index) to reduce the data dimensions into single

59

continuous values that are used to ascribe samples to an environmental quality status. Such

60

indices include for instance the AZTI Marine Biotic Index6 (AMBI) or more specific

61

Indicator Species Index7 (ISI), and the Norwegian Sensitivity and Quality Indices8 (NSI and

62

NQI1). The formulas of these indices include taxon-specific ecological weights or categories

63

of tolerance to disturbance defined from empirical and experimental data6. Indeed, all indices

64

currently applied in benthic monitoring are based on the sorting and identification of macro-

65

invertebrate specimens, which is extremely time consuming and taxonomic-expertise

66

demanding. The need for faster and cost-effective ways to conduct biodiversity surveys is of

67

prime importance to allow an effective management of marine resources.

68 69

High-throughput amplicon sequencing and the molecular identification of multiple species in

70

environmental DNA (eDNA metabarcoding) offer a fast and cost-effective way to describe

71

biological communities9. It gives the opportunity to overcome the limitations of morpho-

72

taxonomy based biomonitoring10-12. Recently, several attempts have been made to use the

73

eDNA metabarcoding for biomonitoring freshwater13-18 and marine ecosystems19-23.

74

Comparisons of BI inferred from eDNA metabarcoding and morphological data showed that

75

similar values could be obtained in the case of freshwater diatoms16,15 and marine

ACS Paragon Plus Environment

Environmental Science & Technology

Page 4 of 33

4 76

invertebrates24-25. However, all these studies were dependent of reference sequence databases

77

for taxonomic assignment, to retrieve the ecological weight associated with the assigned taxa

78

(but see 18). This strongly limited the number of sequences included in the analysis, because

79

a large proportion of sequences remained unassigned, or because they were assigned to taxa

80

of unknown ecology20,24,26.

81 82

To overcome these limitations, we propose to use supervised machine learning (SML) for the

83

development of predictive models to infer BI values from eDNA metabarcoding data without

84

relying on taxonomic assignments. In SML approach, the predictive models are based on the

85

knowledge extracted from a training dataset, which typically consist of a set of features and

86

associated labels (classification) or continuous values (regression). The aim of SML is to fit

87

the training data to some function that can be used to predict a label or a continuous value to

88

new input data27 (e.g. new samples). In the recent years, there has been a growing interest to

89

investigate the usefulness of SML for ecological28-30, genetic31, or microbiome analyses27,31-34.

90

However, up to our knowledge, only one study has successfully used SML to predict

91

pollution levels as well as the values of 26 geochemical features based on a training dataset

92

composed of bacterial 16S eDNA metabarcoding data35.

93 94

In this study, we investigated the possibility of using SML to build predictive models for the

95

inference of four biotic indices commonly used for the benthic monitoring of the fish farming

96

industry. We tested this approach using benthic foraminifera, a group of ubiquitous

97

unicellular eukaryotes that include sensitive bioindicators of pollution in marine

98

environments36-39. Foraminifera are responsive to organic enrichment associated with fish

99

farming activities, as shown by both morphological40-43 and eDNA investigations22,23,44. This

100

makes foraminifera particularly appropriate for SML approach, because of their small size,

ACS Paragon Plus Environment

Page 5 of 33

Environmental Science & Technology

5 101

making them readily captured in eDNA samples, and because of the lack of well-defined

102

indicator morphospecies.

103 104

Material and methods

105

Sampling

106

A total of 144 sediment samples were collected in June and October 2015 at 24 stations

107

distributed at the vicinity of five salmon farms in Norway (Table S1). Four farms were

108

sampled at a rate of five stations and the remaining farm at only four stations because of

109

pebbles at the fifth station. At each station, two van Veen grabs (1000 cm2 model 12.211, KC-

110

Denmark) were deployed. From each of the 48 grabs, three samples of c.a. 5 ml were

111

collected from the two first centimeters of the surface sediment using sterile spoons,

112

transferred into 6 ml of LifeGuard Soil Preservation Solution (MO BIO, Laboratories Inc.)

113

and frozen at -20°C until DNA extraction. The rest of the sediment in each sub-sampled grab

114

was sieved through 1mm mesh and fixed in formalin for morpho-taxonomic inventory of

115

macro-invertebrates. Macrofaunal data were used to calculate for each grab four BIs (AMBI,

116

ISI, NSI and NQI1) routinely used in benthic monitoring surveys in Norway.

117 118

DNA extraction, PCR amplification and sequencing

119

The 144 frozen sediments samples were thawed on ice, centrifuged at 2500 rpm for 5 minutes

120

and the overlying solution was discarded. The total eDNA content was extracted using the

121

PowerMax® Soil DNA Isolation Kit (MO BIO Laboratories Inc.) following manufacturer

122

instructions. The 37F hypervariable region of the SSU rRNA gene, commonly used as

123

foraminiferal DNA barcode45 was PCR amplified using the forward primer s14F1 (5' -

124

AAGGGCACCACAAGAACGC - 3') and the reverse primer s15 (5' -

125

CCACCTATCACAYAATCATG - 3'), generating amplicons of about 200-250 bp as

ACS Paragon Plus Environment

Environmental Science & Technology

Page 6 of 33

6 126

described previously22. The primers were tagged with 8-nt sequences appended at their 5’

127

ends in order to multiplex samples prior to sequencing library preparation. The tag sequences

128

have been designed with a pairwise minimum edit distance of 2 and an edit distance of 8 to

129

the corresponding positions of the conserved foraminiferal SSU rDNA sequence. We assigned

130

the tag primer combinations to samples following a Latin Square Design (Table S2) in order

131

to reduce mistagging events46. The PCR products were quantified by high-resolution capillary

132

electrophoresis (QIAxcel System, Qiagen) and pooled in equimolar concentration. The pool

133

was purified using the High Pure PCR Product Purification Kit (Roche), quantified using a

134

fluorometric method (QuBit HS dsDNA kit, Invitrogen) and used for library preparation using

135

the TruSeq® DNA PCR-Free Library Prep Kit (Illumina). After quantification using the

136

KAPA Library Quantification Kits (KAPA BIOSYSTEMS), the library was sequenced on an

137

Illumina MiSeq System using MiSeq Reagent Kit v2 and a standard 14-tiles flow cell for

138

2*251 cycles. The raw dataset is publicly available at the Sequence Read Archive under

139

BioProject PRJNA376130.

140 141

Bioinformatics

142

The paired-end raw reads were quality filtered and assembled into full-length sequences with

143

a pipeline written in C language for the fast processing of Illumina metabarcoding data

144

(https://github.com/esling/illumina-pipeline). Filtering and assembly parameters are reported

145

in Table S3. The pipeline included the demultiplexing of each sequence into its sample of

146

origin by matching the combination of 8-nt tags present at the 5’-end of each sequence read.

147

The tagged primers were trimmed from the sequences as well as the foraminifera-specific

148

conserved region of c.a. 70 nucleotides based on the detection of its GACAG motif after 60

149

positions. Every sequence lacking this motif was discarded. The dataset was then filtered for

150

potential chimeras using UCHIME47 version 4.2.40 implemented in the

ACS Paragon Plus Environment

Page 7 of 33

Environmental Science & Technology

7 151

identify_chimera_seq.py function of the Qiime48 1.9.1 toolkit. We used the default parameters

152

of the function, but the --split_by_sampleid option was used in order to restrict the de novo

153

search by sample (i.e. by PCR). The filtered dataset was then clustered into Operational

154

Taxonomic Units (OTUs) using swarm49 2.1.8 with the default resolution (d=1) and the

155

fastidious option. This clustering also included homologous 37F sequences generated by

156

previous fish farm sequencing surveys and obtained using identical bioinformatics processing,

157

including sequences from Scotland22, from New-Zealand23, from Norway44, as well as

158

unpublished data. The present dataset was augmented with these sequences in order to

159

increase the likelihood of the fastidious option implemented in swarm to form OTUs

160

supported by the multiple occurrence of sequences specific to the fish farm environment. The

161

representative sequences, i.e. the most abundant Individual Sequence Unit (ISU) of each

162

OTU, were used as input of the assign_taxonomy.py function of Qiime with default parameter

163

for taxonomic assignment (uclust method), using a curated foraminifera database

164

(http://forambarcoding.unige.ch). This database contains 1175 foraminiferal SSU rDNA

165

sequences representing 125 morphospecies spread into 35 families, and including as well 106

166

environmental sequences for which morphospecies are not been yet identified but that are

167

commonly found in eDNA samples. OTU-representative sequences that could not be assigned

168

were compared by BLAST50 search against GenBank. The OTU-to-sample matrix was

169

generated from the result of the clustering including all fish farms sequences with

170

make_otu_table.py function of Qiime and the data corresponding to the samples analyzed in

171

the present study was extracted from this table into the statistical R environment51 for

172

downstream statistical analysis.

173

ACS Paragon Plus Environment

Environmental Science & Technology

Page 8 of 33

8 174

Statistics

175

The relationship between diversity and distance from the cage as well as compositional

176

variation of the studied communities were investigated based on normalized versions of the

177

OTU-to-sample matrix. Because uneven sequencing depth across samples can introduce

178

biases in the statistical analysis, samples with less than 10,000 reads were discarded. To

179

investigate correlations between diversity and distance from the cage, 100 rarefied datasets

180

were generated by aiming at 10,000 reads per sample using the rrarefy function of the R

181

vegan v2.4-1 package52. Several alpha-diversity metrics, based on all OTUs (including

182

unassigned ones) and their abundance, were then computed for each rarefied dataset and

183

averaged values were used to fit non-linear polynomial models using the lm function in R.

184

These diversity metrics included the OTU richness, the Shannon diversity53, the Pielou

185

evenness54, the SN diversity used in the calculation of NQI18, the expected OTU richness

186

(ES100 and ES50 for respectively 100 and 50 reads) and the Chao richness estimator55. To

187

investigate compositional variation, the read counts of the OTU-to-sample matrix were

188

normalized according to the cumulative-sum scaling method (CSS) using the metagenomeSeq

189

v1.16.0 R package56 (see 57 and 58). The following beta-diversity analyses were performed

190

using functions of the R vegan package. The Bray-Curtis dissimilarity index56 was calculated

191

using vegdist and the resulting pairwise dissimilarity matrix served for non-metric multi-

192

dimensional scaling (NMDS) analysis using metaMDS with default settings. The values of the

193

BIs obtained from the morpho-taxonomic data and of distance to cages were fit to the NMDS

194

ordination using envfit, and ordisurf, respectively.

195 196

Biotic indices values were predicted from the foraminiferal eDNA metabarcoding data using

197

three supervised approaches. For reference, either the ecological weights associated with the

198

taxa of the morpho-taxonomic inventories (correlation screening approach) or the BI values

ACS Paragon Plus Environment

Page 9 of 33

Environmental Science & Technology

9 199

calculated from these inventories (supervised machine learning approaches) were considered.

200

The BI values were predicted independently for the samples of each farm (testing datasets)

201

based on the samples of the other four farms (training datasets) used for supervision in the

202

three approaches. For the correlation screening approach, each OTU was compared to each

203

morpho-taxon across the samples of the training set. The reads associated with each OTU in

204

the PCR replicates of each grab were summed in order to measure Spearman rank correlations

205

in terms of relative abundance across grabs. The ecological weight of the taxon that has the

206

highest rho correlation above 0.7 was associated to the OTU. Biotic indices were then

207

calculated for the testing dataset using OTUs that were assigned ecological weights.

208 209

For the supervised machine learning approaches, either diversity metrics (diversity learning)

210

or OTUs composition (composition learning) were used as features. Models were built for

211

each training dataset, and BI values were predicted for the samples composing each

212

corresponding testing dataset. For diversity learning, the predicted BI values were averaged

213

for each sample over the 100 rarefied datasets. For composition learning, the effect of keeping

214

rare OTUs was investigated by comparing the results obtained when the OTUs with less than

215

10 reads or less than 100 reads across the full dataset were discarded before CSS

216

normalization. Since only one reference BI value was available for each grab, the BI values

217

predicted using both learning methods were averaged for each grab and the standard

218

deviations were computed. For both supervised diversity and composition learning approach,

219

we compared two different algorithms to predict BI values. We compared the Random Forest

220

(RF) algorithm60 implemented in the ranger v.0.6.0 R package61 for multithreading and the

221

Self-Organizing Map (SOM) algorithm implemented in the Kohonen v2.0.19 R package62.

222

For the RF algorithm, we generated 300 trees and used the default “mtry” parameter for

223

regression task, which is 1/3 of the features randomly picked to split the tree at each node,

ACS Paragon Plus Environment

Environmental Science & Technology

Page 10 of 33

10 224

which usually give the best results63. RF models were measuring the importance of features in

225

the prediction of BI values. For the SOM algorithm, the network was trained with 100

226

iterations with the default learning rate (linear decline from 0.05 to 0.01 over the 100

227

iterations). The tuning of the parameters (xdim, ydim, xweight, topo) were randomly searched

228

over 100 parameters combinations (100 hyperparameter search) with the train function of the

229

caret v6.0-73 R package64, passing the “random” search option and a 10-fold cross validation

230

repeated 10 times to the trainControl function. For each training dataset (four farms), the set

231

of parameters giving the lowest average Root Mean Squared Error (RMSE) on hold-out

232

samples during cross-validation were used to build the model that predict BI values for the

233

testing dataset (the remaining farm).

234 235

To compare the accuracy of the supervised approaches, the relationships between the

236

reference and predicted BI values were modeled using the lm function in R. These BI values

237

were then converted into discrete ecological quality status, after averaging per grab in the case

238

of the predicted values. Their agreement was tested using the kappa2 function of the irr v0.84

239

R package65, with squared weight because the ecological status values are ordered from “very

240

poor” to “very good”. Agreement between the two classifications was considered as “poor

241

agreement”, e.g. Kappa value ranging from 0.01 to 0.2 to “almost perfect agreement”, e.g.

242

Kappa value ranging from 0.8 to 166. For each BI, the best model was the one associated with

243

the highest Kappa value. Its accuracy was investigated at the scale of the farm by testing for

244

difference between predicted and reference values using the Mann-Whitney test with the

245

wilcox.test in R.

246

ACS Paragon Plus Environment

Page 11 of 33

Environmental Science & Technology

11 247

Results

248

Macro-invertebrate morpho-taxonomic inventories and biotic indices

249

The morpho-taxonomic inventories of 75,431 macro-invertebrate specimens comprised 432

250

morphospecies, of which 357 have been ascribed to an ecological category in at least one of

251

the BIs (Table S4). All BIs showed that the five farms were impacted, with highest values

252

(AMBI) and lowest values (ISI, NSI and NQI1) within 200 meters from the fish cages (Figure

253

S1). The impact decreased quickly with increasing distance from the cages. According to the

254

results of non-linear models, the effect of the distance from the cage on the values of every BI

255

was highly significant (Table S5, Figure S1).

256 257

Foraminifera eDNA metabarcoding data

258

PCR products were obtained for 123 out of 144 samples and sequencing yielded 11,257,700

259

paired-end reads. The quality filtering (i.e. base call quality and contig assembly) discarded

260

24.9 % of the reads (Table S3) and de novo chimera filtering discarded a further 449

261

sequences. The resulting 8,449,933 sequences were clustered along with 45,588,115

262

homologous sequences generated in previous studies and 69,716 OTUs were delineated. An

263

OTU-to-sample matrix was formed from the 9,235 OTUs that occurred in at least one of the

264

123 samples corresponding to the five studied fish farms. The sequencing depth of the

265

samples was uneven. It ranged from 42 to 283,431 reads, with an average of 68,698 reads per

266

sample. Seven samples represented by less than 10,000 reads were discarded. The final OTU-

267

to-sample matrix was constituted of 116 sediment samples as rows, covering the 48 sediment

268

grabs from which the macro-invertebrates were inventoried, 9,170 OTUs as columns, and was

269

filled with the 8,414,122 reads (Table S6). Two additional OTU-to-sample matrices were then

270

composed from the OTUs represented by more than 10 reads and by more than 100 reads, and

271

contained 3,648 and 1,579 columns, respectively.

ACS Paragon Plus Environment

Environmental Science & Technology

Page 12 of 33

12 272 273

Non-linear models yielded significant correlations between alpha-diversity metrics and the

274

distance from the cage, except Pielou’s evenness (Figure S2). The NMDS analysis of the

275

beta-diversity matrix showed that the distance from the cage is strongly structuring

276

foraminifera communities, and that the reference BI values inferred from the macro-

277

invertebrate data were strongly correlated with the ordination (Figure S3).

278 279

Taxonomic composition

280

Among the 9,170 OTUs, 2,378 (representing 70 % of the reads) were assigned to a given

281

taxonomic rank based on the consensus of three maximum hits with a minimum of 90 %

282

similarity and coverage on reference sequences of our curated foraminifera SSU rDNA

283

database. Among these assigned OTUs, the majority belong to orders Rotaliida and

284

Textulariida and class “Monothalamea” (Table S7). Sixty-two OTUs (less than 0.1 % of the

285

reads) were assigned to one of the remaining orders (Miliolida, Globerinida, Spirillinida and

286

Robertinida) and 190 OTUs (4.3 % of the reads) matched uncultured foraminifera sequences

287

of GenBank with more than 90 % of similarity and coverage. The remaining 6,602 OTUs

288

(25.7 % of the reads) could not be assigned to any reference sequence in foraminiferal

289

database and matched no GenBank sequence. The five most abundant taxa include 3 rotaliids:

290

Bulimina marginata (37 OTUs, representing 10.2 % of the reads), Stainforthia fusiformis (21

291

OTUs, representing 7.7 % of the reads) and Cibicidoides lobatulus (39 OTUs, representing 7

292

% of the reads), a monothalamid: Bathysiphon argenteus (18 OTUs, representing 7.4 % of the

293

reads) and a textularid genus: Reophax sp. (105 OTUs, representing 5.7 % of the reads).

294

ACS Paragon Plus Environment

Page 13 of 33

Environmental Science & Technology

13 295

Predictions of BI values from eDNA metabarcoding data

296

The three supervised approaches yielded accurate BI values predictions (Table 1). Linear

297

models and Kappa tests showed significance for the correlation screening approach to predict

298

the BI values from ecological weights assignments. Yet, R2 and Kappa statistics were higher

299

for the diversity and composition learning approaches than for the correlation screening

300

approach, as measured for ISI, NSI and NQI1 (Table 1, Figure 1). For AMBI, the correlation

301

screening approach performed better than the diversity learning using SOM algorithm and the

302

composition learning using the RF algorithm.

303 304

The diversity learning approach using the RF algorithm yielded predictions that had the

305

highest Kappa for NSI and NQI1, with 33 and 37 correctly classified grabs and with 15 and

306

11 grabs classified within 1 category mismatch, respectively. The Kappa statistics were 0.907

307

for NSI and 0.88 for NQI1, indicating an almost perfect agreement between molecular and

308

morpho-taxonomic data. Rarefying the OTU-to-sample dataset had a small effect on the per-

309

sample variation of the inferred values for both NSI (Figure S4) and NQI1 (Figure S5). The

310

most important diversity metrics (as measured by the RF algorithm) to infer NSI and NQI1

311

were Chao and OTU richness, and to a lesser extent SN (Figure S6, S7).

312 313

The composition learning approach using the SOM algorithm yielded predictions that had the

314

highest Kappa for AMBI and ISI, with respectively 33 and 25 correctly classified grabs, 13

315

and 21 classified within 1 category mismatch, and 2 misclassified for both BIs. The Kappa

316

statistics were 0.711 for AMBI and 0.774 for ISI, indicating a substantial agreement between

317

molecular and morpho-taxonomic data. The RF algorithm measurements of features

318

importance showed that the OTUs that were most important for inferring AMBI and ISI were

319

not necessarily the most abundant in terms of reads counts (Figure S8 and S9). The most

ACS Paragon Plus Environment

Environmental Science & Technology

Page 14 of 33

14 320

useful OTU to infer AMBI values was unassigned and represented 0.7% of the reads (55,063

321

reads) and that to infer ISI values was an unidentified Monothalamid representing 0.2% of the

322

reads (19,123 reads). Discarding rare OTUs represented by less than 10 or 100 reads did not

323

significantly change the accuracy of BIs predictions using composition learning (Table S8).

324 325

Predicted values obtained with the best predictive models for each of the four BIs were

326

neither over nor under estimated (Figure 1). Grabs were evenly distributed around the correct

327

classification (Figure 1B). The median and the mean of the differences between the predicted

328

and reference BI values were close to zero and with moderate variances, which means that our

329

predictive models were not biased (Figure 1C). Mann-Whitney tests of difference between the

330

predicted and reference BI values were not significant for all fish farm and BI combinations,

331

which means that predictive models were accurately predicting BI values at the scale of the

332

fish farm (Figure S10).

333 334

Discussion

335

Our study demonstrates the usefulness of supervised machine learning (SML) approaches to

336

infer biotic indices from eDNA metabarcoding data. To our knowledge, this is the first time

337

that the SML approaches have been applied to eukaryotes eDNA-based biomonitoring

338

surveys (but see 35 for a bacteria-based survey using SML). Until now, only those groups of

339

eukaryotes for which DNA barcodes are available in reference sequence database and for

340

which an autecological value is known could be taken into account in eDNA-based

341

biomonitoring studies24,25. Provided that an appropriate training dataset is available, SML

342

approaches offer a workaround to the dependency on reference databases and thus allow

343

extending the range of potential genetic bioindicators to other taxonomic groups, especially to

344

the small-sized inconspicuous taxa, which typically dominate the eDNA samples24. With the

ACS Paragon Plus Environment

Page 15 of 33

Environmental Science & Technology

15 345

notable exception of diatoms67, most of these small-sized taxa remain poorly described in

346

terms of autecology (but see 68), which make them useless for computing biotic indices.

347

Using SML approaches, there is no need for prior knowledge on the ecological signal

348

conveyed by OTUs, because these signals are inferred during the statistical modeling.

349 350

Our results showed that accurate predictive models for BIs inference could be obtained with

351

diversity metrics or composition data derived from eDNA metabarcoding data as features in

352

SML approaches. These models led to similar bioassessments as the ones obtained using

353

traditional morphology-based macrofaunal surveys. The possibility of using diversity metrics

354

or composition data for making BIs predictions on new samples may give a practical

355

flexibility for biomonitoring surveys. On the one hand, using the diversity metrics in SML

356

reduces the dimensions of the data and therefore the computation time, and allows the

357

prediction of BIs for samples coming from various geographical regions, where the

358

taxonomic composition may be different. On the other hand, using composition data to

359

predict BIs seems more appropriate if a significant proportion of OTUs are present in both the

360

training and the testing dataset. Composition data should capture species interactions69,70 and

361

give more importance to key taxa consistently responding to specific environmental

362

variation71,72, which could yield better results for atypical samples or statistical outliers.

363 364

While analyzing eDNA metabarcoding data, various biological and technical issues need to

365

be considered for the interpretation of alpha-diversity and beta-diversity. The presence of

366

extracellular DNA necessarily blurs the ecological signal conveyed in eDNA datasets as

367

living, dead and inactive cells cannot be readily distinguished73,74. In our study, samples were

368

collected from surface sediments, where it is reasonable to expect that most of the DNA come

369

from living or recently dead organisms. Intra-genomic rRNA sequence variation75 and rRNA

ACS Paragon Plus Environment

Environmental Science & Technology

Page 16 of 33

16 370

gene copy-number variation76 brings qualitative and quantitative bias in metabarcoding data.

371

From a technical point of view, PCR, sequencing77 and mistagging errors46 could add further

372

noise in the data. However, this noise can be assumed to be somewhat constant across

373

samples, and therefore disentangled from the ecological signal by SML algorithms.

374 375

The occurrence of OTUs splitting, i.e. morphospecies represented by several OTUs, could

376

also be seen as a problem for the interpretation of eDNA data, because the biological meaning

377

of OTUs is unclear78-80 and because it increases the dimensionality of the data (see below).

378

However, there is also some advantages related to the high number of OTUs. For instance, it

379

has been shown that bacterial sequences diverging by a single nucleotide can display a

380

different abundance profile across environmentally distinct samples81, pointing out that sub-

381

species units can be ecologically informative. In our study, several foraminiferal

382

morphospecies were represented by multiple OTUs. The detailed analysis of these OTUs in

383

the case of the two dominant morphospecies: Bulimina marginata and Cibicidoides lobatulus,

384

showed that in the first case the read abundance profiles of the two most abundant OTUs were

385

highly similar across samples (spearman rho = 0,91, p < 0.001, table S9), while in the second

386

case the two most abundant OTUs displayed different profiles across samples (spearman rho

387

= 0,09, p = 0.32, table S9). This suggests that these OTUs could represent different

388

populations of the same species that exhibit different ecological preference or that they belong

389

to cryptic species complexes that our reference sequence database did not capture. The

390

temptation to merge OTUs that are assigned to the same morphospecies, in an effort of

391

matching the metabarcoding data with the observable morphospecies data, is questionable for

392

two reasons. First, it means grouping sequences based on our current morpho-taxonomic

393

knowledge, which is reflected in reference sequence databases, although morphospecies, from

394

which DNA barcodes are generated, are not necessarily well defined. Second, ecologically

ACS Paragon Plus Environment

Page 17 of 33

Environmental Science & Technology

17 395

distinct sub-species units could be grouped together, losing the ecological signal conveyed by

396

distinct OTUs into a single heterogeneous OTU. Because this signal is used for BI

397

calculation, we think that SML approaches applied to biomonitoring would be more accurate

398

in the case of OTUs splitting than in the case of OTUs merging. Furthermore, we think that

399

SML approaches would as well benefit from the automation of a workflow that is entirety

400

independent from reference taxonomic database, to be reproducible over time.

401 402

Building predictive models from eDNA composition data using SML approaches hold

403

challenges too. Metabarcoding data are usually characterized by a much higher number of

404

OTUs than samples8, and this can be further increased by the OTUs splitting, like in the

405

present study. In SML, this data property makes the predictive models prone to be affected by

406

“the curse of dimensionality”31. Indeed, the probability to observe, by chance, a perfect

407

correlation between the relative abundance of one OTU and the variation of a BI increases

408

with the number of OTUs. This could lead to the overfitting of the model, thus decreasing its

409

accuracy. Finding a trade-off between the OTU granularity (dimensionality) and the

410

separation of ecologically relevant units appears as a major challenge for SML approaches

411

applied to metabarcoding data. Although our dataset was highly dimensional, our predictive

412

models were likely not overfitted for two reasons. First, the overfitting is controlled through

413

the growing of a “forest” of regression trees, that are built on a random subset of the data in

414

the RF algorithm60,30 and we used a 10-fold cross-validation step for model selection using the

415

SOM algorithm, a common procedure to prevent such problem27,83. Second, we showed that

416

our models still accurately predicted BI values without 80 % of the OTUs removed based in

417

their low abundance, supporting that our models built on the full dataset do not tend to overfit

418

due to the presence of numerous rare OTUs.

419

ACS Paragon Plus Environment

Environmental Science & Technology

Page 18 of 33

18 420

Future efforts should investigate whether accurate SML predictions can be realized in broad

421

routine applications. Our results were achieved with a relatively small training dataset and

422

increasing the number of sampled farms to cover a wider sampling area will likely further

423

refine our predictive models. Given the ability to multiplex hundreds of samples and to

424

perform analyses rapidly, cost-effective biomonitoring solutions could be proposed in matters

425

of days instead of months84. Hence, an eDNA-based early warning system adapted for the

426

pro-active management of marine industrial activities could be envisioned, as it has been

427

already proposed for others ecosystems and type of bioindicators85,86.

428 429

Acknowledgements

430

This study was supported by the Swiss Network for International Studies, by the Swiss

431

National Science Foundation (grants 316030_150817 and 31003A_159709) and by the

432

Norwegian Seafood Research Fund (FHF project number 901092). We thank Vegard Aambø

433

Langvatn and Dagfinn Breivik Skomsø from Fiske Liv and Jeremiah Peder Ness from Aqua-

434

kompetanse for help in collecting samples. We also thank Are Andreassen Moe and Arne

435

Kvalvik from Marine Harvest for the organization of the sampling campaign. We also thank

436

the four anonymous reviewers for their helpful comments.

437 438

Supporting Information

439

Table S1: Sampling site coordinates and distribution of collected samples

440

Tables S2: Tag to sample reference table following a Latin Square Design and tagged primers

441

sequences for successful PCR amplified samples

442

Table S3: Quality filtering and assembly parameters

443

Table S5: Results of non-linear models testing the effect of farming site, the cages and their

444

interaction on four biotic indexes values

ACS Paragon Plus Environment

Page 19 of 33

Environmental Science & Technology

19 445

Table S6: OTU-to-sample matrix with taxonomic assignments

446

Table S7: Taxonomic composition of foraminiferal communities

447

Table S8: Accuracy of BI values predictions from composition data, as a function of

448

abundance filtering on the OTU-to-sample matrix.

449

Table S9: Pair-wise spearman correlation matrix of OTUs represented by more than 1000

450

reads and assigned to the same species.

451 452

Figure S1: Non-linear relationship between BI reference values and the distance to the cages

453

Figure S2: Correlation between alpha-diversity metrics and the distance from the cages

454

Figure S3: NMDS analysis of the Bray-Curtis beta-diversity matrix calculated from the CSS

455

normalized foraminifera eDNA dataset.

456

Figure S4: Effect of rarefying the OTU-to-sample matrix on the variation of inferred NSI

457

values per sample.

458

Figure S5: Effect of rarefying the OTU-to-sample dataset on the variation of inferred NQI1

459

values per sample.

460

Figure S6: Diversity metrics importance in the prediction of the NSI index from the Random

461

Forest models

462

Figure S7: Diversity metrics importance in the prediction of the NQI1 index from the Random

463

Forest models

464

Figure S8: OTU importance in the prediction of the AMBI index from the Random Forest

465

model

466

Figure S9: OTU importance in the prediction of the ISI index from the Random Forest model

467

Figure S10: Boxplot and Mann-Whitney tests for difference between predicted BI values with

468

the best predictive models and reference BI values for each combination of farm and BI

469

ACS Paragon Plus Environment

Environmental Science & Technology

Page 20 of 33

20 470

References

471 472

(1)

Doney, S. C.; Ruckelshaus, M.; Emmett Duffy, J.; Barry, J. P.; Chan, F.; English, C. A.;

473

Galindo, H. M.; Grebmeier, J. M.; Hollowed, A. B.; Knowlton, N.; et al. Climate Change

474

Impacts on Marine Ecosystems. Ann. Rev. Mar. Sci. 2012, 4 (1), 11–37.

475

(2)

Peterson, C. H.; Rice, S. D.; Short, J. W.; Esler, D.; Bodkin, J. L.; Ballachey, B. E.; Irons, D.

476

B. Long-Term Ecosystem Response to the Exxon Valdez Oil Spill. Science 2003, 302

477

(December), 2082–2086.

478

(3)

Smith, V. H.; Tilman, G. D.; Nekola, J. C. Eutrophication: Impacts of excess nutrient

479

inputs on freshwater, marine, and terrestrial ecosystems. In Environmental Pollution;

480

1999; Vol. 100, pp 179–196.

481

(4)

Borja, A.; Ranasinghe, A.; Weisberg, S. B. Assessing ecological integrity in marine

482

waters, using multiple indices and ecosystem components: Challenges for the future.

483

Mar. Pollut. Bull. 2009, 59 (1–3), 1–4.

484

(5)

Tavakoly Sany, S. B.; Hashim, R.; Rezayi, M.; Salleh, A.; Safari, O. A review of strategies

485

to monitor water and sediment quality for a sustainability assessment of marine

486

environment. Environ. Sci. Pollut. Res. Int. 2014, 21 (813), doi:10.1007/s11356-013-

487

2217-5.

488

(6)

Borja, A.; Franco, J.; Pérez, V. A Marine Biotic Index to Establish the Ecological Quality

489

of Soft-Bottom Benthos Within European Estuarine and Coastal Environments. Mar.

490

Pollut. Bull. 2000, 40 (12), 1100–1114.

491

(7)

waters of Norway; 2002.

492 493

Rygg, B. Indicator species index for assessing benthic ecological quality in marine

(8)

Rygg, B. Developing indices for quality status classification of marine soft-bottom

ACS Paragon Plus Environment

Page 21 of 33

Environmental Science & Technology

21 fauna in Norway.; 2006.

494 495

(9)

Taberlet, P.; Coissac, E.; Pompanon, F.; Brochmann, C.; Willerslev, E. Towards next-

496

generation biodiversity assessment using DNA metabarcoding. Mol. Ecol. 2012, 21 (8),

497

2045–2050.

498

(10)

Woodward, G.; Gray, C.; Baird, D. J. Biomonitoring for the 21st Century: new

499

perspectives in an age of globalisation and emerging environmental threats. Limnetica

500

2013, 32 (2), 159–174.

501

(11)

Aylagas, E.; Borja, Á.; Rodriguez-Ezpeleta, N. Environmental status assessment using

502

DNA metabarcoding: Towards a genetics based marine biotic index (gAMBI). PLoS ONE

503

9(3) e90529 2014, doi:10.1371/journal.pone.0090529.

504

(12)

Bohmann, K.; Evans, A.; Gilbert, M. T. P.; Carvalho, G. R.; Creer, S.; Knapp, M.; Yu, D.

505

W.; de Bruyn, M. Environmental DNA for wildlife biology and biodiversity monitoring.

506

Trends in Ecology and Evolution. 2014, pp 358–367.

507

(13)

Kermarrec, L.; Franc, A.; Rimet, F.; Chaumeil, P.; Humbert, J. F.; Bouchez, A. Next-

508

generation sequencing to inventory taxonomic diversity in eukaryotic communities: A

509

test for freshwater diatoms. Mol. Ecol. Resour. 2013, 13 (4), 607–619.

510

(14)

Kermarrec, L.; Franc, A.; Rimet, F.; Chaumeil, P.; Frigerio, J.-M.; Humbert, J.-F.;

511

Bouchez, A. A next-generation sequencing approach to river biomonitoring using

512

benthic diatoms. Freshw. Sci. 2014, 33 (1), 349–363.

513

(15)

Visco, J. A.; Apothéloz-Perret-Gentil, L.; Cordonier, A.; Esling, P.; Pillet, L.; Pawlowski, J.

514

Environmental Monitoring: Inferring the Diatom Index from Next-Generation

515

Sequencing Data. Environ. Sci. Technol. 2015, 49 (13), 7597–7605.

516 517

(16)

Zimmermann, J.; Abarca, N.; Enke, N.; Enk, N.; Skibbe, O.; Kusber, W. H.; Jahn, R. Taxonomic reference libraries for environmental barcoding: a best practice example

ACS Paragon Plus Environment

Environmental Science & Technology

Page 22 of 33

22 from diatom research. PLoS One 2014, 9 (9), e108793.

518 519

(17)

Zimmermann, J.; Glöckner, G.; Jahn, R.; Enke, N.; Gemeinholzer, B. Metabarcoding vs.

520

morphological identification to assess diatom diversity in environmental studies. Mol.

521

Ecol. Resour. 2015, 15 (3), 526–542.

522

(18)

Apothéloz-Perret-Gentil, L.; Cordonier, A.; Straub, F.; Iseli, J.; Esling, P.; Pawlowski, J.

523

Taxonomy-free molecular diatom index for high-throughput eDNA biomonitoring.

524

Mol. Ecol. Resour. 2017, in press. DOI: 10.1111/1755-0998.12668

525

(19)

Chariton, A. A.; Court, L. N.; Hartley, D. M.; Colloff, M. J.; Hardy, C. M. Ecological

526

assessment of estuarine sediments by pyrosequencing eukaryotic ribosomal DNA.

527

Front. Ecol. Environ. 2010, 8 (5), 233–238.

528

(20)

Chariton, A. A.; Stephenson, S.; Morgan, M. J.; Steven, A. D. L.; Colloff, M. J.; Court, L.

529

N.; Hardy, C. M. Metabarcoding of benthic eukaryote communities predicts the

530

ecological condition of estuaries. Environ. Pollut. 2015, 203, 165–174.

531

(21)

Bik, H. M.; Halanych, K. M.; Sharma, J.; Thomas, W. K. Dramatic shifts in benthic

532

microbial eukaryote communities following the deepwater horizon oil spill. PLoS One

533

2012, 7 (6).

534

(22)

Pawlowski, J.; Esling, P.; Lejzerowicz, F.; Cedhagen, T.; Wilding, T. A. Environmental

535

monitoring through protist next-generation sequencing metabarcoding: Assessing the

536

impact of fish farming on benthic foraminifera communities. Mol. Ecol. Resour. 2014,

537

14 (6), 1129–1140.

538

(23)

Pochon, X.; Wood, S. A.; Keeley, N. B.; Lejzerowicz, F.; Esling, P.; Drew, J.; Pawlowski, J.

539

Accurate assessment of the impact of salmon farming on benthic sediment

540

enrichment using foraminiferal metabarcoding. Marine Pollution Bulletin. 2015.

541

(24)

Lejzerowicz, F.; Esling, P.; Pillet, L. L.; Wilding, T. a.; Black, K. D.; Pawlowski, J. High-

ACS Paragon Plus Environment

Page 23 of 33

Environmental Science & Technology

23 542

throughput sequencing and morphology perform equally well for benthic monitoring

543

of marine ecosystems. Sci. Rep. 2015, 5 (April), 13932.

544

(25)

Aylagas, E.; Borja, Á.; Irigoien, X.; Rodríguez-Ezpeleta, N. Benchmarking DNA

545

Metabarcoding for Biodiversity-Based Monitoring and Assessment. Front. Mar. Sci.

546

2016, 3 (November), 96.

547

(26)

Lanzén, A.; Lekang, K.; Jonassen, I.; Thompson, E. M.; Troedsson, C. High-throughput

548

metabarcoding of eukaryotic diversity for environmental monitoring of offshore oil-

549

drilling activities. Mol. Ecol. 2016, 25 (17), 4392–4406.

550

(27)

FEMS Microbiology Reviews. 2011, pp 343–359.

551 552

(28)

Olden, J. D.; Lawler, J. J.; Poff, N. L. Machine learning methods without tears: a primer for ecologists. Q. Rev. Biol. 2008, 83 (2), 171–193.

553 554

Knights, D.; Costello, E. K.; Knight, R. Supervised classification of human microbiota.

(29)

Philibert, A.; Desprez-Loustau, M. L.; Fabre, B.; Frey, P.; Halkett, F.; Husson, C.; Lung-

555

Escarmant, B.; Marçais, B.; Robin, C.; Vacher, C.; et al. Predicting invasion success of

556

forest pathogenic fungi from species traits. J. Appl. Ecol. 2011, 48 (6), 1381–1390.

557

(30)

and their applications to ecological data. Ecol. Modell. 2012, 240, 113–122.

558 559

(31)

Libbrecht, M. W.; Noble, W. S. Machine learning applications in genetics and genomics. Nat Rev Genet 2015, 16 (6), 321–332.

560 561

Crisci, C.; Ghattas, B.; Perera, G. A review of supervised machine learning algorithms

(32)

Statnikov, A.; Henaff, M.; Narendra, V.; Konganti, K.; Li, Z.; Yang, L.; Pei, Z.; Blaser, M.

562

J.; Aliferis, C. F.; Alekseyenko, A. V. A comprehensive evaluation of multicategory

563

classification methods for microbiomic data. Microbiome 2013, 1 (1), 11.

564 565

(33)

Beck, D.; Foster, J. A. Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics. PLoS One 2014, 9 (2).

ACS Paragon Plus Environment

Environmental Science & Technology

Page 24 of 33

24 566

(34)

Martínez-García, P. M.; López-Solanilla, E.; Ramos, C.; Rodríguez-Palenzuela, P.

567

Prediction of bacterial associations with plants using a supervised machine-learning

568

approach. Environmental Microbiology. 2016.

569

(35)

Smith, M. B.; Rocha, A. M.; Smillie, C. S.; Olesen, S. W.; Paradis, C.; Wu, L.; Campbell, J.

570

H.; Fortney, J. L.; Mehlhorn, T. L.; Lowe, K. A.; et al. Natural bacterial communities

571

serve as quantitative geochemical biosensors. MBio 2015, 6 (3), 1–13.

572

(36)

Retrospect, perspect and prospect. Environ. Int. 2006, 32 (2), 273–283.

573 574

Nigam, R.; Saraswat, R.; Panchang, R. Application of foraminifers in ecotoxicology:

(37)

Frontalini, F.; Coccioni, R. Benthic foraminifera as bioindicators of pollution: A review

575

of Italian research over the last three decades. Revue de Micropaleontologie. 2011, pp

576

115–127.

577

(38)

Schönfeld, J.; Alve, E.; Geslin, E.; Jorissen, F.; Korsun, S.; Spezzaferri, S.; Abramovich,

578

S.; Almogi-Labin, A.; du Chatelet, E. A.; Barras, C.; et al. The FOBIMO (FOraminiferal

579

BIo-MOnitoring) initiative-Towards a standardised protocol for soft-bottom benthic

580

foraminiferal monitoring studies. Mar. Micropaleontol. 2012, 94–95, 1–13.

581

(39)

Alve, E.; Korsun, S.; Schönfeld, J.; Dijkstra, N.; Golikova, E.; Hess, S.; Husum, K.; Panieri,

582

G. Foram-AMBI: A sensitivity index based on benthic foraminiferal faunas from North-

583

East Atlantic and Arctic fjords, continental shelves and slopes. Mar. Micropaleontol.

584

2016, 122, 1–12.

585

(40)

Scott, D. B.; Schafer, C. T.; Honig, C.; Younger, D. C. Temporal variations of benthic

586

foraminiferal assemblages under or near aquaculture operations; documentation of

587

impact history. J. Foraminifer. Res. 1995, 25 (3), 224–235.

588 589

(41)

Angel, D. L. Impact of a Net Cage Fish Farm on the Distribution of Benthic Foraminifera in the Northern Gulf of Eilat (Aqaba, Red Sea). J. Foraminifer. Res. 2000,

ACS Paragon Plus Environment

Page 25 of 33

Environmental Science & Technology

25 30 (1), 54–65.

590 591

(42)

Vidović, J.; Cosovic, V.; Jurasic, M.; Petricioli, D. Impact of fish farming on foraminiferal

592

community, Drvenik Veliki Island, Adriatic Sea, Croatia. Mar. Pollut. Bull. 2009, 58 (9),

593

1297–1309.

594

(43)

Vidović, J.; Dolenec, M.; Dolenec, T.; Karamarko, V.; Žvab Rožič, P. Benthic

595

foraminifera assemblages as elemental pollution bioindicator in marine sediments

596

around fish farm (Vrgada Island, Central Adriatic, Croatia). Mar. Pollut. Bull. 2014, 83

597

(1), 198–213.

598

(44)

Pawlowski, J.; Esling, P.; Lejzerowicz, F.; Cordier, T.; Visco, J. A.; Martins, C. I. M.;

599

Kvalvik, A.; Staven, K.; Cedhagen, T. Benthic monitoring of salmon farms in Norway

600

using foraminiferal metabarcoding. Aquac. Environ. Interact. 2016, 8, 371–386.

601

(45)

surveys of foraminifera: Preparing the future. Biological Bulletin. 2014, pp 93–106.

602 603

(46)

(47)

Edgar, R. C.; Haas, B. J.; Clemente, J. C.; Quince, C.; Knight, R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 2011, 27 (16), 2194–2200.

606 607

Esling, P.; Lejzerowicz, F.; Pawlowski, J. Accurate multiplexing and filtering for highthroughput amplicon-sequencing. Nucleic Acids Res. 2015, 1–12.

604 605

Pawlowski, J.; Lejzerowicz, F.; Esling, P. Next-generation environmental diversity

(48)

Caporaso, J. G.; Kuczynski, J.; Stombaugh, J.; Bittinger, K.; Bushman, F. D.; Costello, E.

608

K.; Fierer, N.; Pena, A. G.; Goodrich, J. K.; Gordon, J. I.; et al. QIIME allows analysis of

609

high-throughput community sequencing data. Nat. Methods 2010, 7 (5), 335–336.

610

(49)

scalable and high-resolution amplicon clustering. PeerJ 2015, 3, e1420.

611 612 613

Mahé, F.; Rognes, T.; Quince, C.; De Vargas, C.; Dunthorn, M. Swarm v2: highly-

(50)

Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic Local Alignment Search Tool. J. Mol. Biol. 1990, No. 215, 403–410.

ACS Paragon Plus Environment

Environmental Science & Technology

Page 26 of 33

26 614

(51)

Comput. Vienna, Austria 2016, URL https://www.R-project.org/.

615 616

R Core Team. R: A language and environment for statistical computing. R Found. Stat.

(52)

Oksanen, J.; Blanchet, F. G.; Kindt, R.; Legendre, P.; Minchin, P. R.; O’Hara, R. B.;

617

Simpson, G. L.; Solymos, P.; Stevens, M. H. H.; Wagner, H. vegan: Community Ecology

618

Package. R package version 2.4-1. 2016.

619

(53)

Shannon, C. E.; Weaver, W. The mathematical theory of information; 1949; Vol. 97.

620

(54)

Pielou, E. C. The measurement of diversity in different types of biological collections. J. Theor. Biol. 1966, 13, 131–144.

621 622

(55)

there are unseen species in sample. Environ. Ecol. Stat. 2003, 10 (4), 429–443.

623 624

(56)

(57)

McMurdie, P. J.; Holmes, S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput. Biol. 2014, 10 (4).

627 628

Paulson, J. N.; Talukder, H.; Pop, M.; Bravo, H. C. Robust methods for differential abundance analysis in marker gene surveys. Nat. Methods 2013, 10 (12), 1200–1202.

625 626

Chao, A.; Shen, T. J. Nonparametric estimation of Shannon’s index of diversity when

(58)

Weiss, S. J.; Xu, Z.; Amir, A.; Peddada, S.; Bittinger, K.; Gonzalez, A.; Lozupone, C.;

629

Zaneveld, J. R.; Vazquez-Baeza, Y.; Birmingham, A.; et al. Effects of library size

630

variance, sparsity, and compositionality on the analysis of microbiome data. PeerJ

631

Prepr. 2015, 3, e1408. DOI: 10.7287/peerj.preprints.1157v1

632

(59)

Bray, J. R.; Curtis, J. T. An Ordination of the Upland Forest Communities of Southern Wisconsin. Ecol. Monogr. 1957, 27 (4), 325–349.

633 634

(60)

Breiman, L. Random Forests. Mach. Learn. 2001, 45 (1), 5–32.

635

(61)

Wright, M. N.; Ziegler, A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. arXiv.org 2015, stat.ML.

636 637

(62)

Wehrens, R.; Buydens, L. M. C. Self- and super-organizing maps in R: The kohonen

ACS Paragon Plus Environment

Page 27 of 33

Environmental Science & Technology

27 package. J. Stat. Softw. 2007, 21 (5), 1–19.

638 639

(63)

(December), 18–22.

640 641

Liaw, a; Wiener, M. Classification and Regression by randomForest. R news 2002, 2

(64)

Kuhn, M. caret: Classification and Regression Training. R package version 6.0-73.

642

Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer, Allan

643

Engelhardt, Tony Cooper, Zachary Mayer, Brenton Kenkel, the R Core Team, Michael

644

Benesty, Reynald Lescarbeau, Andrew Ziem, Luca Scrucca, Yuan Tang, Can Candan and

645

Tyler Hunt. 2016.

646

(65)

and Agreement.

647 648

(66)

(67)

Stevenson, R. J.; Pan, Y.; VanDam, H. Assessing environmental conditions in rivers and streams with diatoms. Earth Sci. 2006, 57–85.

651 652

Landis, J. R.; Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 1977, 33 (1), 159–174.

649 650

Gamer, M.; Lemon, J.; Fellows, I.; Singh, P. Various Coefficients of Interrater Reliability

(68)

Grego, M.; De Troch, M.; Forte, J.; Malej, A. Main meiofauna taxa as an indicator for

653

assessing the spatial and seasonal impact of fish farming. Mar. Pollut. Bull. 2009, 58

654

(8), 1178–1186.

655

(69)

Lima-Mendez, G.; Faust, K.; Henry, N.; Decelle, J.; Colin, S.; Carcillo, F.; Chaffron, S.;

656

Ignacio-Espinosa, J. C.; Roux, S.; Vincent, F.; et al. Determinants of community

657

structure in the global plankton interactome. Science (80-. ). 2015, 348 (6237),

658

1262073_1-1262073_9.

659

(70)

Vacher, C.; Tamaddoni-Nezhad, A.; Kamenova, S.; Peyrard, N.; Moalic, Y.; Sabbadin, R.;

660

Schwaller, L.; Chiquet, J.; Smith, M. A.; Vallance, J.; et al. Learning Ecological Networks

661

from Next-Generation Sequencing Data. In Advances in Ecological Research; 2016;

ACS Paragon Plus Environment

Environmental Science & Technology

Page 28 of 33

28 Vol. 54, pp 1–39.

662 663

(71)

Fortunato, C. S.; Eiler, A.; Herfort, L.; Needoba, J. A.; Peterson, T. D.; Crump, B. C.; CS,

664

F.; Eiler, A.; Herfort, L.; JA, N.; et al. Determining indicator taxa across spatial and

665

seasonal gradients in the Columbia River coastal margin. Isme J 2013, 7 (10), 1899–

666

1911.

667

(72)

Dell’Anno, A.; Corinaldesi, C. Degradation and turnover of extracellular DNA in marine

668

sediments: Ecological and methodological considerations. Appl. Environ. Microbiol.

669

2004, 70 (7), 4384–4386.

670

(73)

Corinaldesi, C.; Danovaro, R.; Dell’Anno, A. Simultaneous recovery of extracellular and

671

intracellular DNA suitable for molecular studies from marine sediments. Appl. Environ.

672

Microbiol. 2005, 71 (1), 46–50.

673

(74)

Pillet, L.; Fontaine, D.; Pawlowski, J. Intra-genomic ribosomal RNA polymorphism and

674

morphological variation in elphidium macellum suggests inter-specific hybridization in

675

foraminifera. PLoS One 2012, 7(2) : e32, doi:10.1371/journal.pone.0032373.

676

(75)

Data: A Case Study of Foraminifera. PLoS One 2013, 8 (2).

677 678

Weber, A. A. T.; Pawlowski, J. Can Abundance of Protists Be Inferred from Sequence

(76)

Schirmer, M.; Ijaz, U. Z.; D’Amore, R.; Hall, N.; Sloan, W. T.; Quince, C. Insight into

679

biases and sequencing errors for amplicon sequencing with the Illumina MiSeq

680

platform. Nucleic Acids Res. 2015, 43 (6).

681

(77)

throughput amplicon-sequencing. Nucleic Acids Res. 2015, 43 (5), 2513–2524.

682 683

Esling, P.; Lejzerowicz, F.; Pawlowski, J. Accurate multiplexing and filtering for high-

(78)

Schloss, P. D.; Westcott, S. L. Assessing and improving methods used in OTU-based

684

approaches for 16S rRNA gene sequence analysis. Appl. Environ. Microbiol. 2011, 77

685

(10), 3219–3226.

ACS Paragon Plus Environment

Page 29 of 33

Environmental Science & Technology

29 686

(79)

Chen, W.; Zhang, C. K.; Cheng, Y.; Zhang, S.; Zhao, H. A Comparison of Methods for

687

Clustering 16S rRNA Sequences into OTUs. PLoS One 2013, 8(8) (PLoS ONE 8(8):

688

e70837.), doi:10.1371/journal.pone.0070837.

689

(80)

He, Y.; Caporaso, J. G.; Jiang, X.-T.; Sheng, H.-F.; Huse, S. M.; Rideout, J. R.; Edgar, R. C.;

690

Kopylova, E.; Walters, W. A.; Knight, R.; et al. Stability of operational taxonomic units:

691

an important but neglected property for analyzing microbial diversity. Microbiome

692

2015, 3 (1), 20.

693

(81)

Eren, A. M.; Maignien, L.; Sul, W. J.; Murphy, L. G.; Grim, S. L.; Morrison, H. G.; Sogin,

694

M. L. Oligotyping: Differentiating between closely related microbial taxa using 16S

695

rRNA gene data. Methods Ecol. Evol. 2013, 4 (12), 1111–1119.

696

(82)

fundamentals, tools, and challenges. Ann. Epidemiol. 2016, 26 (5), 330–335.

697 698

(83)

Domingos, P. A few useful things to know about machine learning. Commun. ACM 2012, 55 (10), 78.

699 700

Tsilimigras, M. C. B.; Fodor, A. A. Compositional data analysis of the microbiome:

(84)

Quinn, R. A.; Navas-molina, J. A.; Hyde, E. R.; Song, J.; Vázquez-baeza, Y.; Humphrey,

701

G.; Gaffney, J.; Minich, J. J.; Melnik, A. V; Herschend, J.; et al. From sample to multi-

702

omics conclusions in under 48 hours. 2016, 1 (2).

703

(85)

Wepener, V.; van Vuren, J. H. J.; Chatiza, F. P.; Mbizi, Z.; Slabbert, L.; Masola, B. Active

704

biomonitoring in freshwater environments: Early warning signals from biomarkers in

705

assessing biological effects of diffuse sources of pollutants. Phys. Chem. Earth 2005,

706

30 (11–16 SPEC. ISS.), 751–761.

707

(86)

Maradona, A.; Marshall, G.; Mehrvar, M.; Pushchak, R.; Laursen, A. E.; McCarthy, L. H.;

708

Bostan, V.; Gilbride, K. A. Utilization of multiple organisms in a proposed early-

709

warning biomonitoring system for real-time detection of contaminants: Preliminary

ACS Paragon Plus Environment

Environmental Science & Technology

Page 30 of 33

30 710

results and modeling. J. Hazard. Mater. 2012, 219–220, 95–102.

711

ACS Paragon Plus Environment

Page 31 of 33

Environmental Science & Technology

Tables and figures Table 1: Accuracy of biotic indices predictions obtained from correlation screening, diversity learning and composition learning, with the Random Forest and Self-Organizing Map algorithms. Best predictive models are in bold (i.e. Highest Kappa statistics). Biotic Index Approach Supervised learning algorithm R2 Kappa AMBI Correlation screening - 0.568*** 0.624*** Diversity learning Random Forest 0.641*** 0.69*** Self-Organizing Map 0.492*** 0.618*** Composition learning Random Forest 0.662*** 0.555*** Self-Organizing Map 0.669*** 0.711*** ISI Correlation screening - 0.65*** 0.53*** Diversity learning Random Forest 0.505*** 0.626*** Self-Organizing Map 0.449*** 0.61*** Composition learning Random Forest 0.56*** 0.631*** Self-Organizing Map 0.615*** 0.774*** NSI Correlation screening - 0.508*** 0.607*** Diversity learning Random Forest 0.83*** 0.907*** Self-Organizing Map 0.83*** 0.88*** Composition learning Random Forest 0.827*** 0.832*** Self-Organizing Map 0.794*** 0.871*** NQI1 Correlation screening - 0.76*** 0.8*** Diversity learning Random Forest 0.834*** 0.88*** Self-Organizing Map 0.805*** 0.846*** Composition learning Random Forest 0.81*** 0.856*** Self-Organizing Map 0.803*** 0.873*** *:P < 0.05; **:P < 0.01; ***:P < 0.001

ACS Paragon Plus Environment

Environmental Science & Technology

Figure 1: A) Correlation between the BI values computed from the morpho-taxonomic inventories (morphologic data) and the one predicted from supervised machine learning from the foraminifera SSU 37F data (molecular data). The plots are obtained from the predictive models giving the best accuracy (i.e. highest Kappa statistic, see Table 1). AMBI and ISI predictions were done using the SOM algorithm on composition data. NSI and NQI1 predictions were done using the RF algorithm on diversity metrics. Each dot and error bars represent the average and standard deviation of the predicted BI values for the sediment sample of a grab. B) Barplots indicate the amount (number over bars) and percentage (y axis) of correct classifications (0 on x axis) and misclassifications. C) Boxplots indicates the median and quartile values (black numbers) and the mean value (red number) of the difference between the predicted and the reference BI values.

A

AMBI prediction SOM algorithm on composition data 80

n=6

Molecular

8

n=1

1

6

2

0

1

2

Status mismatch

1.68

1.5

4

2

Percentage

60

10

Percentage

40

n=7 n=1

0

4 3

Molecular

Aukrasanden Beitveitnes Bjørnsvik Nedre Kvarv Storvika

n = 33

20

5

Aukrasanden Beitveitnes Bjørnsvik Nedre Kvarv Storvika

12

100

6

ISI prediction SOM algorithm on composition data

0.5

0.6 0.07

0.5

2

1

0.01

1

2

3

4

5

0

0

0

0.44

1.5

R =0.669*** Kappa =0.711***

R =0.615*** Kappa =0.774***

1.5

0

6

2

4

6

8

10

12

Morphology

Morphology

NQI1 prediction RF algorithm on diversity metrics

0

0.8

NSI prediction RF algorithm on diversity metrics

0

0

Kappa =0.88*** 0.0

0

Kappa =0.907*** 0

0.0

0.8

ISI prediction NSI prediction NQI1 prediction SOM algorithm on composition data RF algorithm on diversity metrics RF algorithm on diversity metrics NSI prediction AMBI prediction ISI prediction 100 60

80

80

1

40

0

1

0

1

2

0

20

2

0

3

0

Status mismatch

1.68 2

1.5

0

n = 10 n=1 n=1

Status mismatch

1.5

10

0.54 0

0.16

2

0.96

0.012

2.79 4

R =0.615*** Kappa =0.774*** 0.0

0.5

8

0.44

0.00

0.07

0

0

0.5

1.49

0.6 0.01

1.5

predicted — known value

n=6 n=1

1

C

6

Percentage

0.8

40

n=7 n=1

2

=0.669*** Kappa =0.711***

NQI1 prediction

n = 25

n = 11

0

80 60

n = 33

20 0

Percentage

% of samples

100

B

0.8



ACS Paragon Plus Environment

Page 32 of 33

Biotic index

New eDNA samples PageEnvironmental 33Macro-invertebrates of 33 Science & Technology Sieving

Foraminifera eDNA

eDNA extraction Sampling

ACS Paragon Plus Environment Data generation

Model training

Data collection and model building

Making predictions Biotic indice predictions