Novel Computational Identification of Highly Selective Biomarkers of

May 4, 2011 - The use of in vivo biosensors to acquire environmental pollution data is an emerging and promising paradigm. One major challenge is the ...
0 downloads 11 Views 2MB Size
ARTICLE pubs.acs.org/est

Novel Computational Identification of Highly Selective Biomarkers of Pollutant Exposure David Weisman, Hong Liu,† Jessica Redfern, Liya Zhu, and Adan Colon-Carmona* Department of Biology, University of Massachusetts Boston, 100 Morrissey Boulevard, Boston, Massachusetts 02125, United States

bS Supporting Information ABSTRACT: The use of in vivo biosensors to acquire environmental pollution data is an emerging and promising paradigm. One major challenge is the identification of highly specific biomarkers that selectively report exposure to a target pollutant, while remaining quiescent under a diverse set of other, often unknown, environmental conditions. This study hypothesized that a microarray data mining approach can identify highly specific biomarkers, and, that the robustness property can generalize to unforeseen environmental conditions. Starting with Arabidopsis thaliana microarray data measuring responses to a variety of treatments, the study used the top scoring pair (TSP) algorithm to identify mRNA transcripts that respond uniquely to phenanthrene, a model polycyclic aromatic hydrocarbon. Subsequent in silico analysis with a larger set of microarray data indicated that the biomarkers remained robust under new conditions. Finally, in vivo experiments were performed with unforeseen conditions that mimic phenanthrene stress, and the biomarkers were assayed using qRT-PCR. In these experiments, the biomarkers always responded positively to phenanthrene, and never responded to the unforeseen conditions, thereby supporting the hypotheses. This data mining approach requires only microarray or next-generation RNA-seq data, and, in principle, can be applied to arbitrary biomonitoring organisms and chemical exposures.

’ INTRODUCTION Environmental pollution data is the primary input used to develop regulatory policies, perform environmental impact assessments, analyze human risks, and plan remediation projects. One approach of acquiring this data is through biomonitoring, broadly defined as use of in vivo systems that detect and report pollution conditions. Biomonitoring can be attractive for its high sensitivity, relevance to human disease, robustness over wide parameter ranges, low cost, and in situ deployment at high spatial density. A wide variety of biomonitoring organisms has been investigated including bacteria, fungi, plants, invertebrates, fish, and mammals.15 Biomonitors typically convey signals through changes in measurable markers such as mRNA or protein expression levels.6,7 Ideally, a highly selective biosensor responds exclusively to a target pollutant, and does not erroneously report exposure to other compounds or environmental conditions. However, the existence of unknown chemicals, mixtures, and conditions encountered in a field setting raises significant challenges to achieving the idealized behavior.8 Chemical mixtures can cause complex synergistic and antagonistic interactions between responses,8,9 obfuscating the known signature of a single-agent response. Another challenge is that biomarkers can exhibit nonmonotonic dose-response curves, rising at low doses, but then falling at higher doses.10 Furthermore, distinct biological pathways are perturbed over the range of treatment levels, creating a collection of dose-responsive effects.5 As a final challenge, r 2011 American Chemical Society

unrelated stresses can elicit similar biomarker responses. For example, in Arabidopsis thaliana, we have previously shown that the microarray fingerprint of a particular organic pollutant resembles that of a pathogenic fungal attack.11 Such resemblance is not surprising given the crosstalk and convergence between signaling pathways, resulting in common end points such as programmed cell death. Unfortunately, because of such openended complexity, biomonitor research typically measures responses to a narrower, more carefully constrained set of variables. Consequently, there is a significant knowledge gap between conventional dose-response biomarker studies and the information needed for robust biomonitor development and field deployment. As a novel step toward addressing this knowledge gap, we focused on identifying highly robust and specific biomarkers that respond to a single agent, while not responding to a spectrum of other chemicals, stresses, and growth conditions. We hypothesized that a microarray data mining approach could identify a small set of mRNA biomarkers with these robust properties. We further hypothesized that the classifier could generalize beyond the initial biological conditions of the training microarray data set. Received: January 10, 2011 Accepted: April 25, 2011 Revised: April 7, 2011 Published: May 04, 2011 5132

dx.doi.org/10.1021/es200065f | Environ. Sci. Technol. 2011, 45, 5132–5138

Environmental Science & Technology

ARTICLE

Table 1. Microarray Data Sets Used As Input to k-TSPa Code

Condition

ID

Reps

aph

aphids (stage 3.90, 8 h; rosette)

101

3,3

bot

Botrytis cinerea (4 w, 48 h)

167

3,3

cls

cold, 4 °C (18 d, 24 h)

138

2,2

css

cesium-137 (3 w, 2 w; shoot)

324

3,3

drs

drought (18 d, 24 h)

141

2,2

gts

genotoxicity (18 d, 24 h; shoot)

142

2,2

hts

heat, 38 °C (18 d, 3 h)

146

2,2

o3 oss

ozone (2 w, 6 h) osmotic stress (18 d, 24 h; shoot)

26 139

2,2 2,2

oxs

oxidative stress (18 d, 24 h; shoot)

143

2,2

phe

phenanthrene (21 d; 21d)

pra

Pieris rapae (5 w, 24 h)

330

1,1

pst

P. syringae patovar (5 w, 24 h)

330

1,1

sen

senescence (stage 6.0; leaves)

52

1,2

sls

salt stress (18 d, 24 h)

140

2,2

thr tnt

F. occidentalis (thrip) (5w, 24 h) trinitrotoluene (TNT) (2 w, 6 h)

330 472

1,1 3,3

uvs

UV radiation (2 w, 24 h; shoot)

144

2,2

wds

wounding (18 d, 24 h)

145

2,2

2,2

a

In parentheses are plant age, treatment duration, and tissue sampled. Time units are weeks (w), days (d), or development stage.19 Whole plant tissue was analyzed unless otherwise indicated. ID is the Nottingham Arabidopsis Stock Centre NASCArrays experiment number.18 Reps indicates the number of biological replicate microarrays for control and treated conditions, respectively.

To perform that data mining, we build on the rich body of research into microarray-based classification of cancer tissue types.12,13 A relatively recent development is the top-scoring pair (TSP) algorithm,14 which evaluates microarray data sets labeled with a binary biological class, and identifies a single pair of mRNA transcripts that accurately discriminates between the two biological classes. The discrimination is based on a simple test of relative expression reversal,15 which compares the relative levels of both transcripts, and predicts the biological class based on that comparison. For example, if the level of cyclin-mRNA > p53-mRNA, then the sample is predicted as cancerous; otherwise the sample is predicted as normal. Here we utilize k-TSP, an enhancement of TSP that identifies several gene pairs, to uniquely identify plant biomarkers of a toxicant exposure.16 To test the hypotheses, we performed initial data mining on a small set of Arabidopsis microarray data, with the goal of constructing a classifier that uniquely responds to phenanthrene, a member of the widely distributed carcinogenic family of polycyclic aromatic hydrocarbons (PAH).17 To investigate whether the results generalize to new conditions, we performed an in silico validation with a large collection of published Arabidopsis microarray data, and tested whether the biomarkers report false-positive responses to a wider spectrum of growth and treatment conditions. To further investigate the properties of the in vivo classifier, we performed qRT-PCR experiments to further validate the microarray gene expression changes seen with the phenanthrene stress response, and to examine the responses to new experimental conditions that were not in the training microarray data. Our results found that highly specific biomarkers can be identified using a computational approach, and suggest that the technique has generality beyond plant systems and PAHs.

’ MATERIALS AND METHODS Data Mining and External Validation of Biomarker Transcripts. Using custom scripts written for this study, computa-

tions were performed in R version 2.12.026 and Bioconductor version 2.7,27 running on Debian Linux version 5.0 (http://www. debian.org). Additional utility software was written in Haskell and Python. To prepare a training data set for the TSP algorithm (Table 1), nonphenanthrene Affymetrix ATH1 .cel files were obtained from the NASCArrays service of the European Arabidopsis Stock Centre.18 These were concatenated with the phenanthrene series of ATH1 microarray .cel files (ref 11, NCBI GEO: GSE20009), and normalized using the Bioconductor27 just.gcrma algorithm with default parameters.28 In the normalized data set, each experiment was labeled with a binary class code indicating phenanthrene treatment (“PHE”) versus a nonphenanthrene condition (“other”). Additional biological details and statistical tests of the phenanthrene microarray data set were previously described.11 Before running k-TSP, the data set was filtered to remove the ambiguous microarray probes that represent more than one mRNA (regexp: [09]þ_s_at), as these multipletranscript measurements are not conducive to qRT-PCR analysis.29 Additional prefiltering experiments were performed to remove probes below a noise floor of log2 100. The resulting training data was generated in the .csv format specified by k-TSP. The k-TSP executable code and k-TSP.pl script were obtained from https://jshare.johnshopkins.edu/atan6/public_html/ KTSP/ and run using k-TSP.pl -t training_file, which performs training and leave-one-out cross-validation.16 Additional experiments were performed with cross-validation disabled. From kTSP output, probe pairs that were amenable to qRT-PCR primer design using AtRTPrimer30 or NCBI PrimerBlast (http://www. ncbi.nlm.nih.gov/tools/primer-blast/), and had microarray readings above background noise levels, were selected. To perform in silico tests of generalization, the online database CressExpress version 3.123 was queried using the “ReST-Style” API at http://www.cressexpress.org/cgi-bin/getExpVals.py. From that query result, microarrays with Kolmogorov-Smirnov goodness-of-fit scores less than 0.15 were retained. To partition these microarrays into the classes described in Figure 3, the following regular expression strings were evaluated over the CressExpress tissue and description fields; Developmental: embryo, seedling, silique, hypocotyl, cotyledon; cell culture: culture, cell suspension; floral: inflorescence, flower, pollen; mutant: mutant, mutation, overexpression, cir1, species, halleri, knockout, Agrobacterium tumefaciens; seed: seed. Additionally, NASC experiments 14 and 26 were denoted as embryo/seedling, and NASC experiment 30 was denoted as cell culture. All other experiments were denoted as whole plant or leaf tissue. Plant Growth, Treatments, and qRT-PCR. Sterile halfstrength Murashige-Skoog (MS) medium with sucrose was prepared and adjusted to pH 5.7. Inorganic treatments were mixed into the medium prior to autoclaving. Organic treatment stocks were prepared in water or methanol, filter sterilized, and mixed into autoclaved medium below 50 °C as previously described.31 Treatments were: phenanthrene, 0.25 mM (Sigma P-2528, 100 mM stock in methanol); CdCl2, 50 uM (Sigma 20899); NaCl, 100 mM (OmniPur 7760); atrazine, 1.0 uM (Sigma 49085, 3 mM stock in methanol); glyphosate, 5 uM (Sigma 45521); paraquat, 20 nM (Sigma 856177). The phenanthrene concentration was chosen to be comparable to significantly 5133

dx.doi.org/10.1021/es200065f |Environ. Sci. Technol. 2011, 45, 5132–5138

Environmental Science & Technology polluted sites;32 other treatment concentrations were chosen based on observed changes in root or hypocotyl length (data not shown). Seeds were sterilized and plated onto media. Plants were grown vertically for 21 days in long day light (16 h light, 8 h dark) at 21 °C. Following growth, plants were synchronously harvested at 10 h into the daylight phase, pooled, and stored at 80 °C until mRNA extraction. RNA was extracted from 100 mg pooled whole plant tissue using the Qiagen RNeasy Plant Mini Kit per manufacturer’s instructions. RNA was treated with DNase I (Invitrogen Amplification grade), and cDNA was synthesized from oligo-dT primers using SuperScript III per manufacturer’s instructions. Quantitative RT-PCR was performed using the Finnzymes F-410 DyNAmo HS Sybr Green qPCR Kit. Each 20 ul reaction contained 0.25 uM of each primer and approximately 75135 ng of template DNA. The primers used in this study were At2g213 30-fwd, gccaaggacttgacggtttagcc; At2g21330-rev, aaagcagagggtccatttggaatgc; At5g66690-fwd, ttcgggagtggtggttgtctatcgg; At5g66 690-rev, tccaccaccgttagccgagacatac; At2g43590-fwd, tgctcatttcact cacgagaccgga; At2g43590-rev, tgtgtgttgctgctctggcagt; At5g24930fwd, ggacggttaggattccgacggtgaa; At5g24930-rev, atctggcaccactcccacttccatc; At5g37690-fwd, tgtcgaactcttgacctcaaccct; At5g37690rev, cggttcaggcacatgcgtgtct; At5g65730-fwd, tctcacatccgtcaaatggaagacg; At5g65730-rev, ggcagagtctccgggaatgagttt; Actin2-fwd, atccaagctgttctctccttgtac; Actin2-rev, tgtgagacacaccatcaccagaat; PP2A-fwd, taacgtggccaaaatgatgc; PP2A-rev, gttctccacaaccgcttggt. The qRT-PCR program of 95 °C for 15 min, and 40 cycles of 95 °C for 10 s, 60 °C for 60 s was run on an Applied Biosystems model 7000. Dissociation curves for all reactions were inspected for amplification specificity. Data was exported from the qPCR machine and converted to .csv format for processing in R. At least two PCR technical replicates were run for each biological sample. Technical replicates greater than two cycle thresholds counts (Ct) from the median were discarded, and the median Ct was recomputed without the outliers. In Figure 5, delta-Ct is defined as the difference between the final median Ct values of both transcripts.

ARTICLE

Figure 1. Biomarker gene pair identified by k-TSP analysis of the microarray data sets in Table 1. Rows with cont and trmt represent control and treatment experiments, respectively. atge_cont is a shared control for the AtGenExpress cls, drs, gts, hts, oss, oxs, sls, uvs, and wds experiments.21 Point shapes represent biological replicates. Relative expression comparisons are made within each biological replicate. The expected relative expression reversal occurs only in the phenanthrene treatment. Expression units are log2-transformed microarray intensity values.

’ RESULTS Data Mining for Biomarker Gene Pairs. From Arabidopsis microarray data archived at NASCArrays,18 we manually selected a small set of experiments that represent exposure to natural stressors and environmental toxicants. Biological replicate experiments were included to represent the variance in response and measurement. In addition, we included a variety of nontreated control microarray data to represent transcriptional variation under nonstressed plant conditions. These experiments were combined with microarray data measuring 21 days wholeplant responses to 0.25 mM phenanthrene treatment.11 Table 1 describes the complete k-TSP input data set. The resulting matrix contains one row for each microarray probe, and one column for each microarray. Additionally, we labeled each column with a binary class indicating phenanthrene-treated versus nonphenanthrene condition. From these inputs, k-TSP identified several gene pairs that discriminate the phenanthrene class from the nonphenanthrene class. From k-TSP outputs, we chose three pairs of putative biomarker genes for further analysis. Figure 1 shows one of these pairs; others are shown in Supporting Information (SI) Figures 1 and 2. For these three pairs, over all biological replicates, the expected relative expression reversal occurred only between the

Figure 2. Expression levels of the reference genes actin2 (At3g18780) and pp2a (At1g13320), from the data sets in Table 1. Graphical conventions are identical to those in Figure 1.

phenanthrene and nonphenanthrene classes, indicating that k-TSP found biomarker candidates that accurately fit the training data. In addition, to serve as negative controls, we analyzed two stable reference genes, actin2 (At3g18780) and pp2a (At1g13320),20 and these results are shown in Figure 2. As expected, the reference genes did not exhibit relative expression reversal. The full set of biomarker gene pairs is described in SI Table 1. In silico Tests of Generalization. During feature selection, k-TSP performs leave-one-out cross-validation, and the resulting 5134

dx.doi.org/10.1021/es200065f |Environ. Sci. Technol. 2011, 45, 5132–5138

Environmental Science & Technology

Figure 3. Generalization of gene pair results by in silico comparison with external microarray data sets. The blue line is y = x, showing the threshold for relative expression reversal of the transcript pair. Each point indicates the transcript levels of the two genes on a particular microarray. Large circles with þ and  correspond to phenanthrene treatment and control microarrays, respectively. Small circles correspond to microarray data sets obtained from CressExpress.23 Color indicates the Arabidopsis sample class: black is whole-plant or leaf (n = 598); brown: floral (n = 115); green: embryo, seedling, silique, hypocotyl, or cotyledon (n = 105); gray: seed (n = 432); orange: cell culture (n = 33) red: mutant or A. halleri (n = 147). Expression units are log2-transformed microarray intensity values.

error rates were consistently below 3% (data not shown); however, due to the class size imbalance we did not consider this statistic highly informative.22 To better understand classifier performance with nonphenanthrene conditions, we tested the biomarkers for generalization across a wider set of negative conditions. For the transcripts in SI Table 1, we queried the online CressExpress database23 version 3.1, which contains 1771 Arabidopsis Affymetrix ATH1 microarrays. Training microarrays that also were present in the CressExpress data were removed from the validation set. To remove microarrays with likely technical imperfections, we eliminated the chips with Kolmogorov Smirnov deleted-residual test scores above 0.15, indicating high variance within the experimental group.23 The phenanthrene microarray experiment was performed on 21d wild-type (WT) whole plants.11 To compare this data in a biologically meaningful way with other microarray data sets, we machine- and manually classified the CressExpress data into several categories based on its tissue and description fields. As the whole-plant phenanthrene microarray data primarily measures above-ground tissue,24 we removed the root-only experiments from the data set. The resulting classes are described in Figure 3 and the Materials and Methods section. For the three biomarker gene pairs (Figure 3, SI Figures 3 and 4), the phenanthrene points exhibit relative expression reversal from the large majority of experiments, indicating that the biomarkers correctly distinguished phenanthrene from other conditions. In the relatively few false-positive cases, the points represent seed, embryonic, seedling, and cell culture samples which are biologically distinct from the mature wholeplant measurements.25 Of critical importance, the biologically comparable whole-plant and aerial tissue samples (small black

ARTICLE

Figure 4. Analysis of stable reference genes actin2 and pp2a, comparing the phenanthrene microarray data with external microarray data sets. Graphical conventions are identical to Figure 3.

circles, n = 598) consistently indicate reversed relative expression levels with the phenanthrene points. The reference gene pair shown in Figure 4 serves as a negative control by illustrating two stable transcripts that are not expected to show relative expression reversal. These in silico results strongly support our hypothesis that biomarkers identified through data mining can generalize far beyond the initial training set. In Vivo Tests of Generalization. To further test generalization of the biomarker genes, we treated Arabidopsis with new chemicals that mimic phenanthrene stress and measured expression of the biomarker genes using qRT-PCR. By design, these new treatments were neither in the training data set (Table 1) nor in the in silico generalization tests. At least three biological replicates were performed for each condition, and at least two qPCR technical replicate measurements were performed on each biological sample. As internal controls and for validation of the microarray data, we performed new experiments under phenanthrene and nontreated conditions. As additional controls, we measured the two stable reference genes in all treatments. Figure 5 shows the resulting qRT-PCR data. In all cases, the biomarker gene pairs produced the anticipated expression reversals, further supporting the generalization hypothesis.

’ DISCUSSION Data Mining for Robust Biomarkers. The k-TSP algorithm is a supervised learning classifier, a member of the diverse family that includes support vector machines, artificial neural networks, linear discriminators, K-nearest neighbors, and random forests.33 Two properties make k-TSP novel and highly relevant for identifying pollution biomarkers in highthroughput gene expression data sets. First, because k-TSP compares within-microarray ranks rather than absolute levels between microarrays, there is no requirement for coherent meta-study microarray normalization, thereby obviating a number of statistical difficulties.15,3436 In addition, the expression ranking approach can integrate other types of transcriptomic data, most importantly, next-generation mRNA sequencing data.37 We anticipate that such integration will become increasingly important 5135

dx.doi.org/10.1021/es200065f |Environ. Sci. Technol. 2011, 45, 5132–5138

Environmental Science & Technology

ARTICLE

Figure 5. Quantitative RT-PCR results of biomarker gene pairs. Plants were grown for 21 days in long day light and harvested at 10 h into the daylight phase. Treatment codes: 0.25 phe, 0.25 mM phenanthrene; Atra, 1.0 uM atrazine; CdCl2, 50 uM cadmium chloride; Gly, 5 uM glyphosate; MS, untreated medium; NaCl, 100 mM NaCl; Para, 20 nM paraquat. The leftmost panel represents the stable reference genes; the other panels depict the responses of the biomarker gene pairs. Delta Ct represents the difference in qRT-PCR cycle threshold (Ct) values between the two transcripts. All conditions were tested with at least n = 3 biological replicate experiments.

as the emerging body of next-generation RNA-seq data can be merged with the large collection of legacy microarray data. Second, k-TSP is a feature selection algorithm that chooses extremely parsimonious marker sets rather than identifying tens or hundreds of informative genes on a microarray. While we evaluated three pairs of transcripts (Figure 5), the consistent results suggest that a single pair would suffice for a practical biosensor. Such parsimony contributes toward making a cost-effective RT-PCR-based assay for routine environmental measurements, and contrasts with traditional feature selection algorithms used in ecotoxicology classification. 4 From a machine learning perspective, identifying a small set of biomarkers is tantamount to inferring a low complexity model, which reduces the likelihood of overfitting to noisy biological data and increases the accuracy of predictions. Other researchers have successfully identified 13 cancer biomarker genes using traditional linear decision boundary methods, but required coherent meta-study microarray normalization.38,39 The training data set is highly imbalanced, with more nonphenanthrene points than phenanthrene points. Class imbalance is challenging for machine learning and occurs routinely in settings such as facial recognition, rare disease classification, and credit card fraud detection.40 The relatively large set of nonphenanthrene training data cannot represent all conditions that a biomonitor might encounter in the field. However, in silico testing with a larger nonphenanthrene data set, and in vivo testing with unforeseen treatments, suggest that k-TSP identified accurate biomarkers having low false-positive rates. These results are not biologically surprising given that a microarray with

! n n = 23  10 probes has = (n(n1)/2) ∼ 3  108 pairs of 2 probes, and, from this large set, several were found that respond uniquely to phenanthrene. The small volume of phenanthrene training data might represent a subset of the physiological conditions experienced under this treatment, thereby causing model overfitting. Comparing control versus phenanthrene treatment, all p-values are significant (SI Table 1), and qRT-PCR externally validates the microarray data (Figure 5), indicating that the phenanthrene training data is valid. In addition, the phenanthrene microarray experiments were performed with pooled samples of at least 20 individual plants thereby representing a larger effective sample size,41 and had low internal noise as indicated by Pearson correlation coefficients of at least 0.99 between all biological and technical replicates.11 Moreover, additional qRT-PCR experiments found that the biomarkers responded to phenanthrene treatments at concentrations above and below the 0.25 mM level described here (data not shown). Taken together, under these conditions, the classifier exhibits high true-positive and high truenegative rates. Subsequent testing under a broad spectrum of field conditions would be appropriate for the candidate biomarkers. The qRT-PCR results shown in Figure 5 support the hypothesis that the biosensor generalizes beyond the initial training set, as the atrazine, CdCl2, and glyphosate conditions are not part of the training or validation microarray data. This set of treatments was chosen for its inhibition of photosynthesis, induction of oxidative stress, and upregulation of detoxification proteins, which are the fundamental hallmarks of phenanthrene toxicity in 3

5136

dx.doi.org/10.1021/es200065f |Environ. Sci. Technol. 2011, 45, 5132–5138

Environmental Science & Technology Arabidopsis.3,11,31,32,42 Despite this mimicry of phenanthrene stress, the biomarkers correctly discriminated between phenanthrene and the other chemical agents. Further supporting the generalization hypothesis, the NaCl and paraquat experiments differed from the corresponding training microarray conditions in treatment time, chemical molarity, treatment method, and plant age (Table 1 and ref 21). Our data mining approach can be extended in several novel ways. We focused on detecting a single pollutant; however, in principle, the technique can also recognize multiple pollutants. In a previous work, k-TSP performed multiclass prediction, identifying up to 14 types of cancer by measuring the relative expression reversals of multiple pairs of genes.16 A similar paradigm is applicable in biosensing, where microarray experiments are performed for several pollutants, and the resulting data, combined with nonpollutant microarrays, forms the input to k-TSP. An attractive property of this model is that a single organism could recognize a set of pollutants, contributing toward costeffective environmental monitoring. Further, it will be interesting to explore whether a multiclass predictor can deconvolve the components of a chemical mixture.43 The biosensor evaluated here does not attempt to report the pollutant concentration, as relative expression reversal provides only a binary indication of exposure. Doseresponse biomonitoring faces the additional challenge of hormesis, the frequent observation that biological actions can be nonmonotonic, with inverted “U”- or “J”-shaped doseresponse curves.10 Moreover, an organism may activate unrelated signaling pathways at different xenobiotic dose levels. The concept of multiclass prediction suggests a novel solution to these problems: By collecting microarray data points over a range of treatment levels, and by identifying unique biomarkers for each level, a multiclass biosensor could then infer the pollutant concentration. In that model, a system with k gene pairs could theoretically resolve up to 2k different pollutant levels. As a final extension to the approach, the biosensor confidence level can be augmented by increasing the number of gene pairs. In this phenanthrene study, the three gene pairs always discriminated correctly (Figure 5); however, other conditions could cause equivocal responses. In that situation, an ensemble of k biomarker pairs can be used in a majority voting scheme to better predict the sample class. While the present study benefited from the tremendous volume of published Arabidopsis microarray data, the approach generalizes to any species having a representative set of global gene expression data. For example, zebrafish (Danio rerio) is under active investigation for use in ecotoxicology monitoring, and microarrays have been used to measure its responses to dioxin and arsenic.4,5 Within the foreseeable future, next-generation sequencing will presumably supplant microarrays as the primary source of massively high throughput transcriptome measurement. This wealth of data will undoubtedly cover numerous species and environmental conditions, creating enormous opportunities for biosensor development through large-scale data mining.

’ ASSOCIATED CONTENT

bS

Supporting Information. Figures S1 and S2 illustrate microarray responses of two additional gene pairs identified by k-TSP. Figures S3 and S4 show the results of in silico analysis of these additional two gene pairs. Table S1 describes the names

ARTICLE

and functions of Arabidopsis phenanthrene biomarker genes identified by k-TSP. Scripts written for this project are available from the authors. This material is available free of charge via the Internet at http://pubs.acs.org.

’ AUTHOR INFORMATION Corresponding Author

*E-mail: [email protected]. Present Addresses †

College of Resources and Environmental Study, Fujian Agriculture and Forestry University, Fuzhou, 350002, China

’ ACKNOWLEDGMENT We thank Helen Poynton for critically reading the manuscript and providing valuable suggestions, and Kristophe Diaz for performing plant photography. This work was supported by a University of Massachusetts Boston Joseph P. Healey Grant to DW and ACC, and by a Joint Interagency Program on Phytoremediation Research grant from the National Science Foundation (Grant No. IBN-0343856) to ACC. Further support was from the National Nature Science Foundation of China (Grant No. 30970532), and Key Program for the Construction of the Economic Zone on the Western Side of the Taiwan Straits, Fujian Province (Grant No. 0b08b005), both to H.L. ’ REFERENCES (1) Menzel, R.; Swain, S. C.; Hoess, S.; Claus, E.; Menzel, S.; Steinberg, C. E.; Reifferscheid, G.; St€urzenbaum, S. R. Gene expression profiling to characterize sediment toxicity—a pilot study using Caenorhabditis elegans whole genome microarrays. BMC Genomics 2009, 10, 160. (2) Alnafisi, A.; Hughes, J.; Wang, G.; Miller, C. A. Evaluating polycyclic aromatic hydrocarbons using a yeast bioassay. Environ. Toxicol. Chem. 2007, 26, 1333–1339. (3) Brain, R. A.; Cedergreen, N. Biomarkers in aquatic plants: selection and utility. Rev. Environ. Contam. Toxicol. 2009, 198, 49–109. (4) Wang, R.-L.; Bencic, D.; Biales, A.; Lattier, D.; Kostich, M.; Villeneuve, D.; Ankley, G. T.; Lazorchak, J.; Toth, G. DNA microarraybased ecotoxicological biomarker discovery in a small fish model species. Environ. Toxicol. Chem. 2008, 27, 664–675. (5) Yang, L.; Kemadjou, J. R.; Zinsmeister, C.; Bauer, M.; Legradi, J.; M€uller, F.; Pankratz, M.; J€akel, J.; Str€ahle, U. Transcriptional profiling reveals barcode-like toxicogenomic responses in the zebrafish embryo. Genome Biol 2007, 8, R227. (6) Aggelen, G. V.; et al. Integrating omic technologies into aquatic ecological risk assessment and environmental monitoring: hurdles, achievements, and future outlook. Environ Health Perspect 2010, 118, 1–5. (7) Schirmer, K.; Fischer, B. B.; Madureira, D. J.; Pillai, S. Transcriptomics in ecotoxicology. Anal Bioanal Chem 2010, 397, 917–923. (8) Forbes, V. E.; Palmqvist, A.; Bach, L. The use and misuse of biomarkers in ecotoxicology. Environ. Toxicol. Chem. 2006, 25, 272–280. (9) Filby, A. L.; Santos, E. M.; Thorpe, K. L.; Maack, G.; Tyler, C. R. Gene expression profiling for understanding chemical causation of biological effects for complex mixtures: a case study on estrogens. Environ. Sci. Technol. 2007, 41, 8187–8194. (10) Calabrese, E. J. Paradigm lost, paradigm found: the re-emergence of hormesis as a fundamental dose response model in the toxicological sciences. Environ. Pollut. 2005, 138, 379–411. (11) Weisman, D.; Alkio, M.; Colon-Carmona, A. Transcriptional responses to polycyclic aromatic hydrocarbon induced stress in Arabidopsis thaliana reveal the involvement of hormone and defense signaling pathways. BMC Plant Biol. 2010, 10, 59. 5137

dx.doi.org/10.1021/es200065f |Environ. Sci. Technol. 2011, 45, 5132–5138

Environmental Science & Technology (12) Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J. P.; Coller, H.; Loh, M. L.; Downing, J. R.; Caligiuri, M. A.; Bloomfield, C. D.; Lander, E. S. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286, 531–537. (13) MAQC Consortium; et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat. Biotechnol. 2010, 28, 827838 (14) Geman, D.; d’Avignon, C.; Naiman, D. Q.; Winslow, R. L., Classifying gene expression profiles from pairwise mRNA comparisons. Stat. Appl. Genet. Mol. Biol. 2004, 3, Article19. (15) Eddy, J. A.; Sung, J.; Geman, D.; Price, N. D. Relative expression analysis for molecular cancer diagnosis and prognosis. Technol. Cancer Res. Treat. 2010, 9, 149–159. (16) Tan, A. C.; Naiman, D. Q.; Xu, L.; Winslow, R. L.; Geman, D. Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 2005, 21, 3896–3904. (17) Baird, W. M.; Hooven, L. A.; Mahadevan, B. Carcinogenic polycyclic aromatic hydrocarbon-DNA adducts and mechanism of action. Environ. Mol. Mutagen. 2005, 45, 106–114. (18) Craigon, D. J.; James, N.; Okyere, J.; Higgins, J.; Jotham, J.; May, S. NASCArrays: a repository for microarray data generated by NASC’s transcriptomics service. Nucleic Acids Res. 2004, 32, D575–D577. (19) Boyes, D. C.; Zayed, A. M.; Ascenzi, R.; McCaskill, A. J.; Hoffman, N. E.; Davis, K. R.; G€orlach, J. Growth stage-based phenotypic analysis of Arabidopsis: a model for high throughput functional genomics in plants. Plant Cell 2001, 13, 1499–1510. (20) Czechowski, T.; Stitt, M.; Altmann, T.; Udvardi, M. K.; Scheible, W.-R. Genome-wide identification and testing of superior reference genes for transcript normalization in Arabidopsis. Plant Physiol. 2005, 139, 5–17. (21) Kilian, J.; Whitehead, D.; Horak, J.; Wanke, D.; Weinl, S.; Batistic, O.; D’Angelo, C.; Bornberg-Bauer, E.; Kudla, J.; Harter, K. The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-B light, drought and cold stress responses. Plant J. 2007, 50, 347–363. (22) Visa, S. Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, 2005; pp 6773. (23) Srinivasasainagendra, V.; Page, G. P.; Mehta, T.; Coulibaly, I.; Loraine, A. E. CressExpress: a tool for large-scale mining of expression data from Arabidopsis. Plant Physiol. 2008, 147, 1004–1016. (24) Hermans, C.; Hammond, J. P.; White, P. J.; Verbruggen, N. How do plants respond to nutrient shortage by biomass allocation? Trends Plant Sci. 2006, 11, 610–617. (25) Schmid, M.; Davison, T. S.; Henz, S. R.; Pape, U. J.; Demar, M.; Vingron, M.; Sch€olkopf, B.; Weigel, D.; Lohmann, J. U. A gene expression map of Arabidopsis thaliana development. Nat. Genet. 2005, 37, 501–506. (26) R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2009; ISBN: 3-900051-07-0. (27) Gentleman, R. C.; et al. Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 2004, 5, R80. (28) Wu, Z.; Irizarry, R. A.; Gentleman, R.; Martinez-Murillo, F.; Spencer, F. A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. J. Am. Stat. Assoc. 2004, 99, 909–917. (29) Affymetrix. GeneChip Expression Analysis Data Analysis Fundamentals, Part No. 701190 Rev. 4th ed.; Affymetrix, Inc.: Santa Clara, CA, 2004. (30) Han, S.; Kim, D. AtRTPrimer: database for Arabidopsis genome-wide homogeneous and specific RT-PCR primer-pairs. BMC Bioinf. 2006, 7, 179. (31) Alkio, M.; Tabuchi, T. M.; Wang, X.; Colon-Carmona, A. Stress responses to polycyclic aromatic hydrocarbons in Arabidopsis include growth inhibition and hypersensitive response-like symptoms. J. Exp. Bot. 2005, 56, 2983–2994.

ARTICLE

(32) Liu, H.; Weisman, D.; Ye, Y.; Cui, B.; Huang, Y.; ColonCarmona, A.; Wang, Z. An oxidative stress response to polycyclic aromatic hydrocarbon exposure is rapid and complex in Arabidopsis thaliana. Plant Sci. 2009, 176, 375–382. (33) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, 2nd ed.; Springer: New York, 2009. (34) Larsson, O.; Wennmalm, K.; Sandberg, R. Comparative microarray analysis. OMICS 2006, 10, 381–397. (35) Quackenbush, J. Microarray analysis and tumor classification. N. Engl. J. Med. 2006, 354, 2463–2472. (36) Ramasamy, A.; Mondry, A.; Holmes, C. C.; Altman, D. G. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med. 2008, 5, e184. (37) Morozova, O.; Hirst, M.; Marra, M. A. Applications of new sequencing technologies for transcriptome analysis. Annu. Rev. Genomics Hum. Genet. 2009, 10, 135–151. (38) Bø, T.; Jonassen, I. New feature subset selection procedures for classification of expression profiles. Genome Biol. 2002, 3, 1–11. (39) Grate, L. R. Many accurate small-discriminatory feature subsets exist in microarray transcript data: biomarker discovery. BMC Bioinf. 2005, 6, 97. (40) He, H.; Garcia, E. A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. (41) Allison, D. B.; Cui, X.; Page, G. P.; Sabripour, M. Microarray data analysis: From disarray to consolidation and consensus. Nat. Rev. Genet. 2006, 7, 55–65. (42) Ramel, F.; Sulmon, C.; Cabello-Hurtado, F.; Taconnat, L.; Martin-Magniette, M.-L.; Renou, J.-P.; Amrani, A. E.; Couee, I.; Gouesbet, G. Genome-wide interacting effects of sucrose and herbicide-mediated stress in Arabidopsis thaliana: novel insights into atrazine toxicity and sucrose-induced tolerance. BMC Genomics 2007, 8, 450. (43) Spurgeon, D. J.; Jones, O. A. H.; Dorne, J.-L. C. M.; Svendsen, C.; Swain, S.; St€urzenbaum, S. R. Systems toxicology approaches for understanding the joint effects of environmental chemical mixtures. Sci. Total Environ. 2010, 408, 3725–3734.

5138

dx.doi.org/10.1021/es200065f |Environ. Sci. Technol. 2011, 45, 5132–5138