Combined Transcriptomics Analysis for ... - ACS Publications

Nov 2, 2015 - Combined Transcriptomics Analysis for Classification of Adverse. Effects As a Potential End Point in Effect Based Screening. Tjalf E. de...
0 downloads 0 Views 564KB Size
Article pubs.acs.org/est

Combined Transcriptomics Analysis for Classification of Adverse Effects As a Potential End Point in Effect Based Screening Tjalf E. de Boer,*,†,∥ Thierry K. S. Janssens,‡ Juliette Legler,§ Nico M. van Straalen,∥ and Dick Roelofs∥ †

Amsterdam Global Change Institute, VU University Amsterdam, De Boelelaan 1085, 1081 HV Amsterdam, The Netherlands MicroLife Solutions, Science Park 406, 1098 XH Amsterdam, The Netherlands § Institute for Environmental Studies, Faculty of Earth and Life Sciences, VU University Amsterdam, De Boelelaan 1085, 1081 HV Amsterdam, The Netherlands ∥ Department of Ecological Science, Faculty of Earth and Life Sciences, VU University Amsterdam, De Boelelaan 1085, 1081 HV Amsterdam, The Netherlands ‡

S Supporting Information *

ABSTRACT: Environmental risk assessment relies on the use of bioassays to assess the environmental impact of chemicals. Gene expression is gaining acceptance as a valuable mechanistic end point in bioassays and effect-based screening. Data analysis and its results, however, are complex and often not directly applicable in risk assessment. Classifier analysis is a promising method to turn complex gene expression analysis results into answers suitable for risk assessment. We have assembled a large gene expression data set assembled from multiple studies and experiments in the springtail Folsomia candida, with the aim of selecting a set of genes that can be trained to classify general toxic stress. By performing differential expression analysis prior to classifier training, we were able to select a set of 135 genes which was enriched in stress related processes. Classifier models from this set were used to classify two test sets comprised of chemical spiked, polluted, and clean soils and compared to another, more traditional classifier feature selection. The gene set presented here outperformed the more traditionally selected gene set. This gene set has the potential to be used as a biomarker to test for adverse effects caused by chemicals in springtails to provide end points in environmental risk assessment.



INTRODUCTION Ecological or environmental risk assessment (ERA) is defined as the procedure by which the likely or actual adverse effects of pollutants and other anthropogenic activities on ecosystems and their components are estimated with a known degree of certainty using scientific methodologies.1 To quantify this, ecologically relevant test organisms such as fish, daphnids, and algae (for the aquatic environment), or earthworms, springtails, and plants (for the terrestrial environment) are used as proxies for ecosystem components; adverse effects of chemicals are measured in bioassays that use these organisms.2 Usually, bioassays focus on reproduction, survival or growth rate as end points and therefore these tests often take at least several weeks if not longer to perform. Therefore, such tests are less suitable for large-scale screening projects of chemicals and/or sites with suspected contamination. Recently, an alternative and more recent developed end point in ecological testing is the use of gene expression.3 Gene expression marks the start of a sequence of responses in organisms that cross multiple levels of organization and ultimately can lead to effects on population size and ecosystem function.4 Gene expression patterns can be used as a biomarker, providing an early warning of possible adverse effects that might be manifested later on.5 Often genome-wide approaches © XXXX American Chemical Society

such as microarrays or RNA sequencing (RNaseq) are applied. High-throughput gene expression data, however, are complex and the use of different platforms and analysis methods can be problematic due to limited reproducibility and cross-platform technical variation?6 For screening purposes, analysis of smaller gene sets that that show a highly significant correlation with specific adverse effects seem more favorable and practical. The challenge, however, is to reliably select gene sets that can be used as diagnostic tool to quantify adverse effects reliably. A method to test if a selected gene set is able to detect and predict adverse effects is supervised learning or classifier analysis.7 In classifier analysis a class separation model is trained that can differentiate between two or more classes based on a data set with known classes. This model is subsequently used to assign one of the classes to an unknown data set. The class separation is based on a number of characteristics, also called features, from the data set that are different between the classes. Gene expression can be a feature for class separation and has been used successfully for class prediction of unknown Received: August 4, 2015 Revised: October 23, 2015 Accepted: November 2, 2015

A

DOI: 10.1021/acs.est.5b03443 Environ. Sci. Technol. XXXX, XXX, XXX−XXX

Article

Environmental Science & Technology samples. For instance, Wang et al.8 performed binary classifier analyses such as support vector machines (SVM) and k-nearest neighbor (KNN) on gene expression data of zebrafish exposed to endocrine disrupting chemicals and found that selecting a subset of genes (transcription factor networks) yielded a more consistent classifier performance as compared to using whole transcriptome data. In another study, Nota et al.9 applied multiclass classifier analysis on gene expression data in the collembolan Folsomia candida to diagnose specific metal exposures. Using cross validation methods they showed that it is possible to use a subset of genes in this collembolan to successfully predict exposure to a specific metal regardless of the concentration. Both of these studies use a gene selection method to filter out genes prior to training the classifier. This feature selection is often done to improve model performance10 and to reduce the number of features. Within Classifier analysis, several different methods of feature selection can be applied depending on the type of data. A review by Saeys and colleagues10 gives a good overview of the different feature selection approaches. The two relevant for this study are recursive feature elimination (RFE)11 and a parametric test that uses a moderated t-statistic to select genes with a high variance between classes.12 Here we present a study that combines gene expression data from multiple studies generated in the soil ecotoxicological indicator species Folsomia candida and to use this combined data set for gene set selection in order to train classifier models. The goal is to define a classifier to test for general stress caused by chemicals or environmental adverse exposure. Ultimately, this set of genes should be relevant for environmental screening purposes. We combined microarray data from nine different gene expression studies, which determine chemical impact on the transcriptome. The data set includes data from acute exposures to 15 different chemical stressors tested at concentrations corresponding to the EC10 and/or EC50 for reduced reproduction. The overall use of two ecological relevant effect concentrations in the selected studies allowed us to combine data into three groups: control, EC10 equiv and EC50 equiv. For these groups we trained SVM classifiers and tested its potential to predict reproductive effects with cross validation as well as its power to predict these effects with two unrelated test sets; a set consisting of two chemicals in increasing concentrations and a set that consisted of clean and polluted natural soils. The resulting classifiers and their gene sets presented here can act as a potential biomarker to screen for the presence of possible adverse effects and may be used to augment existing soil ecotoxicological tests.13

Table 1. Overview of the Data from Different Folsomia candida Microarray Studies Used in the Study Presented Herea first author Janssens39 Chen40 van Ommen Kloeke41 Nota17 Nota9 Roelofs31 Nota16 de Boer30 Nota42 Roelofs

exposure chlorinated annilines A diclofenac A isothiocyanate

A

phenanthrene A 6 metals A landfill soil A/B cadmium A clean soils B heat shock B soil from metal smelters B

microarrays

GEO platform

GEO link

16

GPL7150

GSE19929

8 8

GPL7150 GPL7150

GSE59589 GSE26698

8 31 16 12 56 4 16

GPL7150 GPL7150 GPL7150 GPL6381 GPL7150 GPL6381 GPL6381

GSE14207 GSE19353 GSE37154 GSE11122 GSE21213 GSE17024 GSE59588

a

A: Data used for the training set. B: data used for test set 2. A/B: data from these studies has been used in both training and test sets but individual samples were only used in one of the sets.

Test Set 1 Microarray Experiment. To generate a separate test set to test for classifier performance, we conducted microarray experiments on Collembolans (springtails) exposed to six concentrations of both cadmium and phenanthrene. The six concentrations were centered around the reported cadmium and phenanthrene EC50s on reproduction.15,16 Folsomia candida (Berlin strain) was cultured under standard conditions as described by de Boer et al.17 Both stocks and exposures of springtails were kept at 20 °C, at 75% relative humidity and at a 12−12 h light/dark ratio and stocks were fed baker’s yeast (Dr. Oetker). For the cadmium and phenanthrene exposures, we spiked LUFA2.2 soil as described previously15,16 at the following nominal concentrations, cadmium: 0, 5.8, 14.5, 28.9, 57.9 (EC50) and 115.8 mg CdCl2 per kg soil and phenanthrene: 0, 4.6, 11.4, 22.9, 45.8 (EC50) and 91.6 mg/kg soil. Exposures lasted for 2 days after which the collembolans were extracted from the soil by floatation and snap frozen in liquid nitrogen. Four biological replicates, each containing 30 individual collembolans, were used per concentration. Total RNA was isolated from each replicate using the SV total RNA kit (Promega, Madison,WI). RNA concentration and quality were checked on a Nanodrop spectrophotometer (Thermo Fischer, Waltham, MA) and Bioanalyzer (Agilent Technologies, Santa Clara, CA). RNA was labeled using the Low Input Linear Amplification kit (Agilent Technologies) according to the manufacturer’s guidelines and as was described previously16 with a modification where the transcription reactions were done in half volume. Samples were hybridized to custom Agilent 8 × 15 microarrays (GEO platform GPL7150). Hybridization, washing and scanning were all done according to the manufacturer’s instructions. Microarray intensity data was extracted with Agilent Feature Extraction software version 9.5.1 (Agilent Technologies). The data was uploaded to the NCBI GEO database in record GSE70877. Data Normalization. The data from the selected studies originated from two Agilent custom microarray platforms (GEO platforms: GPL638115 and GPL715016) which are both based on an F. candida transcriptome sequencing study published by Timmermans et al.18 Although both platforms use the same set of gene transcripts, they use a different microarray design: platform GPL6381 uses the Agilent 2 × 11 k



MATERIALS AND METHODS Data Collection. In this study we make use of gene expression data obtained for a soil-living microarthropod, Folsomia candida (Hexapoda: Collembola: Isotomidae). The data were used in two ways: to train classifier models and to test the trained classifiers. To circumvent bias in either training or testing, individual data were allocated to either one group or the other but not used for both training and testing. For the training data we focused on previously published gene expression data while the test data consisted of a mix of newly generated data and existing data. Previously published gene expression microarray data for Folsomia candida were collected from the Gene Expression Omnibus database (GEO) at NCBI (http://www.ncbi.nlm.nih.gov/geo/).14 See Table 1 for information on the selected studies. B

DOI: 10.1021/acs.est.5b03443 Environ. Sci. Technol. XXXX, XXX, XXX−XXX

Article

Environmental Science & Technology

separately and consists of exposures to five increasing concentrations of both cadmium and phenanthrene. The second group mainly consisted of natural soils, both clean and polluted, to test the predictive power in natural soils that contain mixtures of contaminants. Samples in this second group were normalized together with the training group samples. Finally, both test sets combined were tested for their predictive power by calculating ROC curves and determining the area under the curve using the pROC package for R.23 Annotation and Gene Ontology Analysis. To obtain information on the optimal gene sets determined by the feature selection we reannotated all transcripts on the F. candida microarray against the UniProt/Swiss-Prot curated protein database and assigning Gene Ontology (GO) terms accordingly using the Blast2GO software.24 Gene Ontology (GO) enrichment analysis for Biological Process and Molecular Function was performed for both feature selections using the TopGO package for R.25 Cellular Component was not determined as this is considered to be less informative. Also, significant GO terms that only contained a single gene were removed from the analysis results.

format and contains two technical replicate probes per transcript, while platform GPL7150 uses the Agilent 8 × 15 k format and contains three technical replicate probes per transcript. The 60-mer probe sequences per transcript are equal. In order to be able to normalize all data together, we removed one technical replicate per transcript randomly from the GPL7150 platform data so that prior to normalization all microarray data contained the same set of transcripts with two technical probe replicates per transcript. Data normalization consisted of a background subtraction according to the Edwards method with a minimal offset of 30.19 After background subtraction, the data were normalized both within the microarrays (global LOESS) as well as between the microarrays (Aquantile) using Limma for R/Bioconductor.12 A large portion of these studies make use of exposure concentrations equal to the EC10 and/or EC50 on reproduction as determined by the standard ISO 28 day chronic test.13 As exposure times in the gene expression tests are however markedly shorter (2 days), we cannot equate them to EC10 and EC50 concentrations and will therefore designate them as control, low and high exposure groups. The data were first divided into two groups; all data from studies that use EC10 and EC50 equiv concentrations were designated as the training group and remaining data that did not make use of these concentration equivalents (data from natural soils for example) were used for the test group. Support Vector Machine Classifiers. After normalization samples from the training group were divided into three groups: control (72 samples), low (35 samples) and high (80 samples) according to the exposure concentration. For the SVM classifier analysis we used the e1071 package for R which offers an interface for the C++ implementation of LibSVM.20 Since support vector machines are binary classifiers we performed the training twice: once between the control and low-dose groups and once between the control and high-dose groups. We performed two different feature selections prior to training the classifier: for the first one we used Recursive Feature Elimination (RFE) to rank the genes based on their weight within the model.21 After the RFE ranking, we calculated the optimal number of genes to be used by determining which number of genes had the lowest prediction error using 10-fold cross validation. For the second method of feature selection we performed a differential gene expression analysis for each contrast on the combined data using Limma in R/Bioconductor.12,17 Limma is especially suitable for these types of combined analyses because shrinkage of the estimated sample variances caused by the empirical Bayes method guarantees that the outcome of the analysis is not biased by an individual study.22 The optimal set of genes used to train the classifier consisted of those genes which were significantly differentially expressed in both contrasts. Both feature selection methods produced an optimal number of genes to train the classifier with. Different feature selection calculations, and thus optimal gene sets, were performed for each contrast. SVMs that made use of the optimal gene sets were tuned using grid search to determine the optimal SVM kernel parameter (gamma) and the optimal cost of miss classification (C). After tuning, the SVMs were trained and training success was determined using leave one out cross validation. To determine the predictive power of the newly established SVM classifiers we tested them on an independent test set. This test set was composed of two groups. The first group used samples from an independent experiment that was normalized



RESULTS AND DISCUSSION Data Normalization Efficiency. In this study we use class prediction algorithms such as SVM on a combined gene expression data set assembled from multiple studies in order to establish sets of genes. The selected sets can predict if test organisms, springtails in this case, are experiencing general stress due to exposure to unknown chemicals or potentially polluted soils. For this we make use of existing Folsomia candida gene expression data from two different microarray platforms which may contain systematic, in between platform derived variation which needs to be removed. For this we normalized all microarray data together. In Supporting Information (SI) Figure 1 we show microarray intensity results prior to normalization (SI Figure 1A), after LOESS normalization within microarrays (SI Figure 1B) and after a-quantile normalization between microarrays (SI Figure 1C). Systematic variation is greatly reduced after normalization between microarrays (SI Figure 1C). In order to normalize all the data together we had to remove one of the three technical probe replicates from 84% of the microarrays (all microarrays from platform GPL7150). To calculate the impact from this replicate removal, we calculated the overall duplicate correlation coefficient before and after. The correlation coefficient dropped from 0.918 to 0.911 with the removal of one replicate. This remaining high correlation coefficient reinforced our confidence that the removal of one technical replicate did not have a large impact on the analysis. Feature Selection. To select optimal sets of genes for the classifier analysis we performed two types of feature selection. Classification with a selected subset of features should work, in theory, better than making use of the full set of features.11 Also, for future practical applications, a smaller gene set should offer better applicability than the full set. The first feature selection method we applied was a parametric test that first determines differential expression between the different exposure levels or classes using a linear model that makes use of a moderated tstatistic. This use of statistical methods to filter the gene set prior to classification has been used successfully before.26,27 Differential gene expression analysis between the control and low groups yielded 171 differentially expressed genes (84 up and 87 down), whereas 910 genes (446 up and 464 down) C

DOI: 10.1021/acs.est.5b03443 Environ. Sci. Technol. XXXX, XXX, XXX−XXX

Article

Environmental Science & Technology Table 2. Results of the Different Feature Selection Methods Used for Classifier Training and Subsequent Testinga features

LOO−CV success

LM low classifier

135

89.1% (83.2−95.1)

testset 1 accuracy and F-ratio cadmium phenanthrene overall F-ratio

75% 55% 66% 0.67

testset 2 accuracy and F-ratio clean soil polluted soil overall F-ratio

100% 66% 89% 0.82

0.81 (0.73−0.89)

ROC AUC

LM high classifier

135

90.4% (85.7−95.1)

cadmium phenanthrene overall F-ratio

50% 60% 57% 0.51

clean soil polluted soil overall F-ratio

100% 87% 97% 0.96

0.88 (0.82−0.93)

RFE low classifier

211

82.5% (75.3−89.8)

cadmium phenanthrene overall F-ratio

65% 70% 61% 0.72

clean soil polluted soil overall F-ratio

100% 7% 69% 0.34

0.83 (0.7−0.90)

RFE high classifier

184

79.6% (73.8−85.4)

cadmium phenanthrene overall F-ratio

85% 70% 80% 0.67

clean soil polluted soil overall F-ratio

81% 47% 73% 0.60

0.79 (0.71−0.87)

a

The number of features and the prediction success of the different classifiers for both leave one out cross validation (LOO−CV) and for both test sets are depicted. For both test sets the overall and specific prediction accuracy as well as the F-ratio is depicted. The upper half of the table contains the results for the LM classifiers and the lower part contains the RFE classifier results. Bracketed values denote 95% confidence intervals.

prediction success was 89.1% (low) and 90.4% (high) respectively. LOO−CV performances for the RFE sets were 82.5% for the RFE-low set and 79.6% for the RFE-high set (see Table 2 for the 95% confidence intervals). The LM selected set performed better than the other two sets, especially considering the lower number of genes it contains. To support this we performed the LOO−CV on increasing numbers of ranked genes and in general larger groups of genes tended to perform better in this analysis (data not shown). We found no genes that were shared among all three sets. The two RFE selected sets for low and high exposure levels shared 10 genes, whereas the RFE-low set shared 9 genes with the LM selected set and the RFE-high set shared 3 genes with the LM selected set (see Figure 1). The minimal overlap

were found to be significantly different between the control and high exposure groups. SI Figure S2 shows a volcano plot of both the control−low and control−high comparisons. The high exposure dose group induced a greater effect on gene expression than the low exposure group (both in number of significant genes as well as in magnitude of effect). This is to be expected as higher exposure concentrations generally cause a greater effect.28 A total of 135 genes were significant in both groups and we decided to use these 135 genes as features for our classifier analysis between both the control and low and the control and high groups. The other feature selection method we used is recursive feature elimination (RFE) which is more commonly used in classifier analysis and serves here mainly to test the success of the parametric feature selection method described above. RFE recursively runs the SVM model and eliminates the feature (gene) that contributes the least to class separation. This process is reiterated until only one feature remains. This ranks the features after which the optimal feature set can be determined by repeatedly using cross validation in on multiple sets of ranked genes and choosing the set that shows the lowest prediction error. As RFE selection makes use of binary SVM itself it will yield a separate gene set per class contrast: one for data between the control and low groups and one for data between the control and high groups. The optimal gene set between control and low consisted of 211 genes and the optimal gene set between control and high exposure level groups consisted of 184 genes. See SI Table 1 for details on the gene sets. In total we had three different feature selection sets: The single linear model generated selection set (LM, 135 genes) applicable to both the control−low and control−high contrasts and the two separate RFE selected sets (RFE-low, 211 genes and RFE-high, 184 genes). All three gene sets were subjected to a leave one out cross validation (LOO−CV) to determine their predictive success. LOO−CV for the LM selected set was performed on both the low and high training sets and

Figure 1. Venn diagram that shows the overlap of the three gene sets selected by feature selection and the classifier developed by Nota et al. RFE low: RFE selection for the low to control comparison. RFE high: RFE selection for the high to control contrast. LM set: gene set selected by linear modeling. D

DOI: 10.1021/acs.est.5b03443 Environ. Sci. Technol. XXXX, XXX, XXX−XXX

Article

Environmental Science & Technology

accuracy or prediction success. . For the second set, the contaminated soils were determined as being overall polluted above EC10 (low) and EC50 (high) level and the clean soils as controls. For the low exposure level classifiers (low vs control trained) (Table 2, upper half) we found a rather low prediction success for the cad/phe test set. Both classifiers predicted 61% (RFE set) and 66% (LM set) of the samples correctly. The RFE-low classifier predicted 100% of the control samples as low, whereas the LM-low selected classifier misclassified one out of four control samples as low. Both classifiers classified a large portion of the cadmium samples with concentrations below the EC10 as exposed to low, while they should be classified as control samples. However, it seems that the cadmium-induced effect on gene expression at low concentrations is already more similar to gene expression patterns for which the classifier was trained. Consequently, these Cd concentration samples will be classified as low exposure samples. For the phenanthrene prediction there was less concordance between the two classifiers. The LM-low classifier predicted the lower concentrations correctly as control but only predicted a quarter of the higher concentrations as exposed. The RFE-low classifier did better on the higher concentrations (83% success) but misclassified 50% of the lower phenanthrene concentration samples. The misclassification of the lower concentrations samples is probably related to the fact that phenanthrene caused a less severe and more canalized response in gene expression in F. candida than cadmium does.16 For the second test set that included the natural soils the predictive success of the classifiers was higher. Both classifiers recognized 100% of the clean soils as control. The LM classifier was more successful in the prediction of the polluted soils (66%) than the RFE-low classifier (7%) as well as in predicting the 4 and 7 days cadmium exposure and the heat shock samples (83% vs 58%) For the high exposure level classifier we observed a different outcome (Table 2, lower half). The RFE-high selected classifier was better able to predict the first test set than the LM-high classifier (80% vs 57%). Just like the low classifier validation, the LM classifier identified lower cadmium concentrations as highly exposed. Moreover, two out of four controls were misclassified as high exposure level samples. The RFE-high classifier predicted 88% of the cadmium samples correctly but identified two out of eight higher concentration samples as being control samples. For phenanthrene the overall prediction success was 75% for the RFE-high classifier and 63% for the LM classifier. Both were successful on the concentrations below the EC50 (only a single misclassification for the RFE-high classifier) but performed less well for the concentrations at or above the EC50 (38% for the RFE and 25% for the LM classifier). For the second test set the prediction results were reversed. The LM classifier had an overall success rate of 97% compared to 73% of the RFE. The LM classifier only misclassified two polluted soils as control. Both soils were sampled from a former landfill site and diluted with clean LUFA2.2 soil to the NOEC concentration for reproduction so the final concentrations of chemicals in these soils may have been lower than in other polluted soils.32 Despite their dilution level these soils still elicited a significant response in gene expression and where therefore used in the test set. The actual set of genes that responded to this treatment is different than at the higher effect concentrations which may explain the misclassification. The RFE classifier successfully classified 78% of the clean soils, 47% of the polluted soils. We also used 4 and 7 day EC50 Cd exposure data from Nota et al. (2008) as well as

between the three gene sets underlies the different principles on which the feature selection methods are based. Annotation of the overlapping genes did not show a general pattern except for the comparison between the RFE-low gene set and the LM set; two of the nine genes were CYP450 genes which are known to be involved in detoxification of chemicals. We performed gene enrichment (GO) analysis on the three selected gene sets to see if certain biological processes (BP) or molecular functions (MF) were enriched within the sets. Neither of the RFE selected sets (low and high exposure levels) showed a clear general stress pattern as compared to the Linear Model selected set (see SI Table 2). The RFE-high selected set showed a low number of significant GO terms in general (11 in both BP and MF) and there was no clear discernible pattern. The RFE-low selected set showed more significant GO terms (42 in total) but these were more general terms (higher up in the hierarchical gene ontology tree) such as coagulation (GO:0050817), protein domain specific bin din g (GO:0019904) and DNA binding (GO:0003677). Compared to the other two groups GO analysis applied to the linear model selected group showed a clear stress response pattern with 52 significant GO terms. Among the significant GO term we found exogenous drug catabolic process (GO:0042738) which is involved in the degradation of xenobiotic compounds.29,30 Other significant GO terms in biological process such as chlorinated hydrocarbon metabolic process (GO:0042196), alkaloid catabolic process (GO:0009822) and monoterpenoid metabolic process (GO:0016098) are also involved in xenobiotic degradation and clearly show that the genes present in these GO terms can be markers for effects induced by these chemicals. The three selected gene sets were used to train four classifier models: the LM selected set was trained separately on the low vs control and the low vs high training sets yielding two classifiers. The RFE selected sets used the appropriate set for the appropriate contrast (RFE - low for the low vs control training set and RFE-high for the high vs control training set). Classifier Prediction Success. To test the predictive ability of the classifiers we assembled two test sets from data that were not used for training of the classifiers. The first set consisted of a separate gene expression experiment where springtails were exposed to five concentrations of cadmium and phenanthrene. By using the data from this experiment as a test set for the classifiers we can determine if the classifiers are able to distinguish between different chemical concentrations. The second test set was generated from existing data and consisted of natural soils (both polluted and clean). We randomly picked springtail expression data from exposures to six clean soils (three from clay and three from sandy soil) from de Boer et al.31 and combined this with data from polluted soils9,32 to test if the classifiers are able to separate clean soils from polluted soils. Data from only six soils (out of 27) was selected as to keep the polluted soil to clean soil ratio balanced. The second test set also contained cadmium samples with longer exposure times than the standard 2 days (4 and 7 days) which were derived from a study by Nota et al.15 These samples were added to investigate if exposure time influences classifier performance. Both test sets were classified and validated using the four newly trained SVM models and overall prediction results are depicted in Table 2. The first test set contains six concentrations of Cd and Phe, in a range that includes EC10 and EC50 effect level.As the classifiers were trained to predict EC10 or EC50, we were able to use this test set to determine E

DOI: 10.1021/acs.est.5b03443 Environ. Sci. Technol. XXXX, XXX, XXX−XXX

Article

Environmental Science & Technology

Figure 2. ROC curves based on the combination of both test sets for the four classifiers. The low exposure classifiers are on the left and high exposure classifiers on the right. The diagonal line represents the performance of a random classifier and the more the classifiers deviate from this line the better they perform.

expression patterns of gene sets may have for ecological risk assessment. We have demonstrated that, by combining gene expression data from multiple studies with test animals exposed to a wide array of chemicals, we can use a data set to train classifiers and predict general stress and thus if exposure to a chemical or polluted soil will lead to detrimental effects in the long term. The use of gene expression in combination with class prediction is not new.7,9,35 Nota et al.9 used an unshrunken centroid multiclass classifier on Folsomia candida gene expression data to select a set of 188 genes that was able to differentiate between six metals. In this study we have used SVM classifiers to classify unknown samples based on exposure levels that are linked to physiological relevant effect concentrations rather than on individual exposure types. Although a multiclass approach is better able to pinpoint the actual exposure type, it would be less suitable to predict unknown chemicals that were not present in the training set and therefore it will have less predictive power for future applications. Also, the reason for selecting SVM is that there were only three classes and two interesting contrasts (the contrast between low and high exposure groups is not of interest here) and, according to an extensive evaluation of multiple types of classifiers by Statnikov et al.,36 SVMs outperform other classifiers in predictive power when using gene expression data. By selecting those features (genes) that show similar effects at similar exposure levels and applying them to a classification based on exposure level in combination with a wide variety of exposure chemicals we can build a robust classifier that can also predict effects of unknown chemicals and determine relative toxicity levels in soils. Feature selection is often necessary in to optimize the classifier performance because irrelevant featuresthose features that do not contribute to class separationwithin the training set lead to classifier performance degradation.37 For gene set selection we used two different methods of feature selection. The RFE method is based on gene weights during the selection regardless of the gene function. This means that any combination of genes can yield the optimal class separator without taking gene function or genomic organization such as gene pathways into account and could thus potentially counter the selection of redundant

heat shock transcriptional profiling data (Nota 20??) as test samples. and the classifier successfully predicted exposure level in 92% of these longer cadmium exposures and heat shock exposures. We calculated the F-score for both test sets (see Table 2, higher scores depict better classifier performance), which is a measure determined by the weighted harmonic mean of precision and recall. Both precision and recall are determined by the true positive rate (TP), the false positive rate (FP), and the false negative rate (FN) where precision is calculated by TP/(TP+FP) and recall by TP/(TP+FN).33 The F-score gives a more complete overview of the classifier performance as both false positive and false negative rates are taken into consideration separately. The F-score, however, does not take the true negative rate (clean samples classified as clean) into account which makes it a less interesting value in this study as the classification of clean soils is equally important as the classification of polluted soils. The F-score values where mostly in line with the accuracy results and indicated overall better performance of the LM gene set selected classifier. For an overall prediction of test set performance both test sets were combined and Receiver Operating Characteristic (ROC) curves were calculated for each classifier. ROC curves plot the true positive rate against the false positive rate at multiple discrimination thresholds and are often used to test binary classifiers. Figure 2 shows the ROC curves for the low classifiers (left) and the high classifiers (right). The more the ROC curve deviates from the diagonal line the better the classifier performs. The exposure low classifiers performed equally well but for the high exposure classifiers the LM classifier performed better than the RFE classifier. From the ROC curve the area under the curve (AUC) can be calculated which can be statistically tested against other AUC’s using the DeLong test34 to determine classifier performance. AUC values for the 4 classifiers used in this study are depicted in Table 2. In general AUC performance was in concordance with the other results presented here. The LM classifier for the high exposure set performed the best and significantly outperformed the RFE high classifier (p = 0.041). There was no significant difference between the two low exposure classifiers (p = 0.688). In this study we have used class prediction and machine learning as a tool to evaluate the predictive power that F

DOI: 10.1021/acs.est.5b03443 Environ. Sci. Technol. XXXX, XXX, XXX−XXX

Article

Environmental Science & Technology

(2) Keddy, C. J.; Greene, J. C.; Bonnell, M. A. Review of wholeorganism bioassays: soil, freshwater sediment, and freshwater assessment in Canada. Ecotoxicol. Environ. Saf. 1995, 30 (3), 221−251. (3) Van Straalen, N.; Roelofs, D. Genomics technology for assessing soil pollution. J. Biol. 2008, 7 (6), 19. (4) Ankley, G. T.; Bennett, R. S.; Erickson, R. J.; Hoff, D. J.; Hornung, M. W.; Johnson, R. D.; Mount, D. R.; Nichols, J. W.; Russom, C. L.; Schmieder, P. K.; Serrrano, J. A.; Tietge, J. E.; Villeneuve, D. L. Adverse outcome pathways: A conceptual framework to support ecotoxicology research and risk assessment. Environ. Toxicol. Chem. 2010, 29 (3), 730−741. (5) Wang, R. L.; Bencic, D.; Biales, A.; Lattier, D.; Kostich, M.; Villeneuve, D.; Ankley, G. T.; Lazorchak, J.; Toth, G. DNA Microarray based ecotoxicological biomarker discovery in a small fish model species. Environ. Toxicol. Chem. 2008, 27 (3), 664−675. (6) Chen, J. J.; Hsueh, H.-M.; Delongchamp, R. R.; Lin, C.-J.; Tsai, C.-A. Reproducibility of microarray data: a further analysis of microarray quality control (MAQC) data. BMC Bioinf. 2007, 8, 1−14. (7) Wei, X.; Ai, J.; Deng, Y.; Guan, X.; Johnson, D. R.; Ang, C. Y.; Zhang, C.; Perkins, E. J., Identification of biomarkers that distinguish chemical contaminants based on gene expression profiles. BMC Genomics 2014, 15.24810.1186/1471-2164-15-248 (8) Wang, R.-L.; Bencic, D.; Biales, A.; Flick, R.; Lazorchak, J.; Villeneuve, D.; Ankley, G. T., Discovery and validation of gene classifiers for endocrine-disrupting chemicals in zebrafish (danio rerio). BMC Genomics 2012, 13.35810.1186/1471-2164-13-358 (9) Nota, B.; Verweij, R. A.; Molenaar, D.; Ylstra, B.; van Straalen, N. M.; Roelofs, D. Gene Expression Analysis Reveals a Gene Set Discriminatory to Different Metals in Soil. Toxicol. Sci. 2010, 115 (1), 34−40. (10) Saeys, Y.; Inza, I.; Larrañaga, P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23 (19), 2507− 2517. (11) Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002, 46 (1−3), 389−422. (12) Smyth, G. K., Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004, 3, Article3.110.2202/1544-6115.1027 (13) ISO. Soil Quality. Inhibition of Reproduction of Collembola (Folsomia candida), ISO Guideline 11267, 1999; 16. (14) Edgar, R.; Domrachev, M.; Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30 (1), 207−210. (15) Nota, B.; Timmermans, M.; Franken, O.; Montagne-Wajer, K.; Marien, J.; De Boer, M. E.; De Boer, T. E.; Ylstra, B.; Van Straalen, N. M.; Roelofs, D. Gene Expression Analysis of Collembola in Cadmium Containing Soil. Environ. Sci. Technol. 2008, 42 (21), 8152−8157. (16) Nota, B.; Bosse, M.; Ylstra, B.; van Straalen, N. M.; Roelofs, D., Transcriptomics reveals extensive inducible biotransformation in the soil-dwelling invertebrate Folsomia candida exposed to phenanthrene. BMC Genomics 2009, 10, (236).23610.1186/1471-2164-10-236 (17) de Boer, T. E.; Tas, N.; Braster, M.; Temminghoff, E. J. M.; Roling, W. F. M.; Roelofs, D. The Influence of Long-Term Copper Contaminated Agricultural Soil at Different pH Levels on Microbial Communities and Springtail Transcriptional Regulation. Environ. Sci. Technol. 2012, 46 (1), 60−68. (18) Timmermans, M.; de Boer, M.; Nota, B.; de Boer, T.; Marien, J.; Klein-Lankhorst, R.; van Straalen, N.; Roelofs, D. Collembase: a repository for springtail genomics and soil quality assessment. BMC Genomics 2007, 8 (1), 341. (19) Edwards, D. Non-linear normalization and background correction in one-channel cDNA microarray studies. Bioinformatics 2003, 19 (7), 825−833. (20) Chang, C.-C.; Lin, C.-J. LIBSVM: A Library for Support Vector Machines. Acm Transactions on Intelligent Systems and Technology 2011, 2, (3). (21) Lin, X.; Yang, F.; Zhou, L.; Yin, P.; Kong, H.; Xing, W.; Lu, X.; Jia, L.; Wang, Q.; Xu, G. A support vector machine-recursive feature

features that are coregulated. This is different from the linear model method which is not iterative and selects the gene set based upon expression variation between the different conditions. The linear model selection does allow for coregulated genes to persist in the gene set as was indicated by the GO analysis in which a more canalized and clear gene pattern was observed. This pattern observed in the linear method selection, however, does conform better to the principle set out by Ankley et al.38 that gene function in relation to the chemical mode of action allows for better predictors of adverse effects. Of the two gene selection methods, the classifiers trained with the gene set selected by the linear model method performed better in both the cross validation as well as classification of the test sets. Only in the case of the high classifier the RFE test performed better for the cadmium/ phenanthrene test set. The main cause for this was the LM classifier, which was too sensitive in the lower cadmium and phenanthrene samples. Because genes in the LM selected set showed a clear response to stress pattern as well as a better overall prediction performance this classifier would be better suited as a biomarker set. The set of 135 genes presented here can act as a biomarker for exposure and would be especially suitable for large screening efforts as it does not classify between specific stressors but differentiates, in the great majority of cases correctly, between exposed and nonexposed conditions. The use of a smaller set of genes such as presented here to test the impact of chemicals or pollutants would also allow for other platforms, such as high-throughput quantitative PCR, to be used instead of whole transcriptome methods such as microarrays or RNA-seq This would lower the costs considerably and would also allow for easier to interpret data analysis.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.est.5b03443. Captions for tables and figures (PDF) Table 1 (XLSX) Table 2 (XLSX) Figure 1 (PDF) Figure 2(PDF)



AUTHOR INFORMATION

Corresponding Author

*Phone: +31 20 5987217; fax: +31 20 5987123; e-mail: TE.de. [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Benjamin Nota, Elaine van Ommen-Kloeke and Guangquan Chen for valuable insight into their data. T.E.d.B. was funded by the Amsterdam Global Change Institute and BEBASIC project dRISK F08.002.1. D.R. received additional support from EU FP7 Large scale integration project NMP4LA-2013-604305.



REFERENCES

(1) Depledge, M.; Fossi, M. The role of biomarkers in environmental assessment (2). Invertebrates. Ecotoxicology 1994, 3 (3), 161−172. G

DOI: 10.1021/acs.est.5b03443 Environ. Sci. Technol. XXXX, XXX, XXX−XXX

Article

Environmental Science & Technology

Versteeg, D. Toxicogenomics in regulatory ecotoxicology. Environ. Sci. Technol. 2006, 40, 4055−4065. (39) Janssens, T. K. S.; Giesen, D.; Mariën, J.; van Straalen, N. M.; van Gestel, C. A. M.; Roelofs, D. Narcotic mechanisms of acute toxicity of chlorinated anilines in Folsomia candida (Collembola) revealed by gene expression analysis. Environ. Int. 2011, 37 (5), 929− 939. (40) Chen, G.; den Braver, M. W.; van Gestel, C. A. M.; van Straalen, N. M.; Roelofs, D. Ecotoxicogenomic assessment of diclofenac toxicity in soil. Environ. Pollut. 2015, 199 (4), 253−260. (41) van Ommen Kloeke, E. A. E.; van Gestel, C. A. M.; Styrishave, B.; Hansen, M.; Ellers, J.; Roelofs, D. Molecular and life-history effects of a natural toxin on herbivorous and non-target soil arthropods. Ecotoxicology 2012, 21 (4), 1084−1093. (42) Nota, B.; Van Straalen, N. M.; Ylstra, B.; Roelofs, D. Gene expression microarray analysis of heat stress in the soil invertebrate Folsomia candida. Insect Mol. Biol. 2010, 19 (3), 315−322.

elimination feature selection method based on artificial contrast variables and mutual information. J. Chromatogr. B: Anal. Technol. Biomed. Life Sci. 2012, 910 (0), 149−155. (22) Marot, G.; Foulley, J.-L.; Mayer, C.-D.; Jaffrézic, F. Moderated effect size and P-value combinations for microarray meta-analyses. Bioinformatics 2009, 25 (20), 2692−2699. (23) Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J. C.; Muller, M., pROC: an open-source package for R and S plus to analyze and compare ROC curves. BMC Bioinf. 2011, 12.7710.1186/1471-2105-12-77 (24) Conesa, A.; Götz, S.; García-Gómez, J. M.; Terol, J.; Talón, M.; Robles, M. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 2005, 21 (18), 3674−3676. (25) Alexa, A.; Rahnenfuhrer, J.; Lengauer, T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 2006, 22 (13), 1600−1607. (26) Model, F.; Adorján, P.; Olek, A.; Piepenbrock, C. Feature selection for DNA methylation based cancer classification. Bioinformatics 2001, 17 (suppl 1), S157−S164. (27) Ding, C. H. Analysis of Gene Expression Profiles: Class Discovery and Leaf Ordering. In Proceedings of the Sixth Annual International Conference on Computational Biology; ACM, 2002; pp 127−136. (28) Poynton, H. C.; Loguinov, A. V.; Varshavsky, J. R.; Chan, S.; Perkins, E. J.; Vulpe, C. D. Gene Expression Profiling in Daphnia magna Part I: Concentration-Dependent Profiles Provide Support for the No Observed Transcriptional Effect Level. Environ. Sci. Technol. 2008, 42 (16), 6250−6256. (29) Jones, L. M.; Rayson, S. J.; Flemming, A. J.; Urwin, P. E., Adaptive and Specialised Transcriptional Responses to Xenobiotic Stress in Caenorhabditis elegans Are Regulated by Nuclear Hormone Receptors. PLoS One 2013, 8, (7).e6995610.1371/journal.pone.0069956 (30) Misra, J. R.; Horner, M. A.; Lam, G.; Thummel, C. S. Transcriptional regulation of xenobiotic detoxification in Drosophila. Genes Dev. 2011, 25 (17), 1796−1806. (31) de Boer, T. E.; Birlutiu, A.; Bochdanovits, Z.; Timmermans, M. J. T. N.; Dijkstra, T. M. H.; Van Straalen, N. M.; Ylstra, B.; Roelofs, D. Transcriptional plasticity of a soil arthropod across different ecological conditions. Mol. Ecol. 2011, 20 (6), 1144−1154. (32) Roelofs, D.; de Boer, M.; Agamennone, V.; Bouchier, P.; Legler, J.; van Straalen, N., Functional environmental genomics of a municipal landfill soil. Front. Genet. 2012, 3.10.3389/fgene.2012.00085 (33) Goutte, C.; Gaussier, E. A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In Advances in Information Retrieval; Losada, D., Fernández-Luna, J., Eds.; Springer: Berlin Heidelberg, 2005; Vol. 3408, pp 345−359. (34) Delong, E. R.; Delong, D. M.; Clarkepearson, D. I. COMPARING THE AREAS UNDER 2 OR MORE CORRELATED RECEIVER OPERATING CHARACTERISTIC CURVES - A NONPARAMETRIC APPROACH. Biometrics 1988, 44 (3), 837− 845. (35) van ’t Veer, L. J.; Dai, H.; van de Vijver, M. J.; He, Y. D.; Hart, A. A. M.; Mao, M.; Peterse, H. L.; van der Kooy, K.; Marton, M. J.; Witteveen, A. T.; Schreiber, G. J.; Kerkhoven, R. M.; Roberts, C.; Linsley, P. S.; Bernards, R.; Friend, S. H. Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415 (6871), 530−536. (36) Statnikov, A.; Aliferis, C. F.; Tsamardinos, I.; Hardin, D.; Levy, S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 2005, 21 (5), 631−643. (37) Judson, R.; Elloumi, F.; Setzer, R. W.; Li, Z.; Shah, I., A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model. BMC Bioinf. 2008, 9.24110.1186/1471-2105-9-241 (38) Ankley, G.; Daston, G.; Degitz, S.; Denslow, N.; Hoke, R.; Kennedy, S.; Miracle, A.; Perkins, E.; Snape, J.; Tillit, D.; Tyler, C.; H

DOI: 10.1021/acs.est.5b03443 Environ. Sci. Technol. XXXX, XXX, XXX−XXX