Predicting Hepatotoxicity Using ToxCast in Vitro

Predicting Hepatotoxicity Using ToxCast in Vitro...
2 downloads 0 Views 5MB Size
Article pubs.acs.org/crt

Predicting Hepatotoxicity Using ToxCast in Vitro Bioactivity and Chemical Structure Jie Liu,†,‡,§ Kamel Mansouri,†,§ Richard S. Judson,† Matthew T. Martin,† Huixiao Hong,∥ Minjun Chen,∥ Xiaowei Xu,‡,∥ Russell S. Thomas,† and Imran Shah*,† †

National Center for Computational Toxicology, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, United States ‡ Department of Information Science, University of Arkansas at Little Rock, Arkansas 72204, United States § Oak Ridge Institute for Science and Education, Oak Ridge, Tennessee 37831, United States ∥ Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, Arkansas 72079, United States S Supporting Information *

ABSTRACT: The U.S. Tox21 and EPA ToxCast program screen thousands of environmental chemicals for bioactivity using hundreds of high-throughput in vitro assays to build predictive models of toxicity. We represented chemicals based on bioactivity and chemical structure descriptors, then used supervised machine learning to predict in vivo hepatotoxic effects. A set of 677 chemicals was represented by 711 in vitro bioactivity descriptors (from ToxCast assays), 4,376 chemical structure descriptors (from QikProp, OpenBabel, PaDEL, and PubChem), and three hepatotoxicity categories (from animal studies). Hepatotoxicants were defined by rat liver histopathology observed after chronic chemical testing and grouped into hypertrophy (161), injury (101) and proliferative lesions (99). Classifiers were built using six machine learning ̈ Bayes (NB), support algorithms: linear discriminant analysis (LDA), Naive vector machines (SVM), classification and regression trees (CART), k-nearest neighbors (KNN), and an ensemble of these classifiers (ENSMB). Classifiers of hepatotoxicity were built using chemical structure descriptors, ToxCast bioactivity descriptors, and hybrid descriptors. Predictive performance was evaluated using 10-fold cross-validation testing and in-loop, filter-based, feature subset selection. Hybrid classifiers had the best balanced accuracy for predicting hypertrophy (0.84 ± 0.08), injury (0.80 ± 0.09), and proliferative lesions (0.80 ± 0.10). Though chemical and bioactivity classifiers had a similar balanced accuracy, the former were more sensitive, and the latter were more specific. CART, ENSMB, and SVM classifiers performed the best, and nuclear receptor activation and mitochondrial functions were frequently found in highly predictive classifiers of hepatotoxicity. ToxCast and ToxRefDB provide the largest and richest publicly available data sets for mining linkages between the in vitro bioactivity of environmental chemicals and their adverse histopathological outcomes. Our findings demonstrate the utility of high-throughput assays for characterizing rodent hepatotoxicants, the benefit of using hybrid representations that integrate bioactivity and chemical structure, and the need for objective evaluation of classification performance.



INTRODUCTION Tens of thousands of chemicals are in commercial use, and hundreds of more chemicals are introduced each year. Current methods to estimate health risks of chemicals require guideline animal testing studies, which are time-consuming, resource intensive, and impractical for evaluating thousands of chemicals.1,2 This has resulted in only a small fraction of chemicals with adequate data for assessing potential hazards. This highlights the urgent need to develop more efficient and informative toxicity determination tools to characterize the bioactivity profiles of environmental chemicals.3 In vitro high throughput screening (HTS) assays combined with computational models could provide an alternative to traditional animal testing studies.4 © XXXX American Chemical Society

The U.S. EPA’s Toxicity Forecaster (ToxCast) program uses hundreds of HTS assays to screen environmental chemicals for bioactivity.5,6 ToxCast Phase I was completed in 2009 and produced bioactivity data on 309 chemicals (primarily pesticides) using 627 HTS assay endpoints including biochemical assays, cell-based assays, cell-free assays, and multiplexed transcription reporter assays. ToxCast Phase II was completed in 2013 and screened an additional 750 chemicals, including food additives, industrial and consumer products, pharmaceuticals, and nanomaterials using 711 HTS assay endpoints. In all, ToxCast Phases I and II have produced Received: December 5, 2014

A

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology

Our objective is to utilize all data from ToxCast Phases I and II, including 1,057 chemicals and 711 assay endpoints to predict chemical-induced toxicity. In this work, we focused on hepatotoxicity because the liver plays an important role in transforming and removing chemicals from the body and is one of the most prevalent sites of their adverse effects. In addition, the liver is one of the most common target organs for environmental chemicals in long-term testing studies.23 Our approach for selecting chemical structure and bioactivity descriptors was unbiased. That is, we did not use any chemical or biological insight to select or aggregate descriptors, and the descriptors were iteratively (and objectively) selected in the cross-validation loop. We used six classification algorithms that construct decision boundaries in high-dimensional space using different techniques. Performance was assessed using 10-fold cross-validation. Because of the variable prevalence of toxic effects across ToxCast chemicals, we also analyzed the relationship between classification accuracy and balanced data sets, which have an equal number of positives and negatives. Our machine learning strategy is aimed at systematically evaluating the accuracy with which chronic hepatic outcomes for ToxCast chemicals can be predicted using in vitro bioactivity descriptors, chemical structure descriptors, and a combination of both descriptors (hybrid representation). Successful development of predictive models like this can be used to help prioritize the many thousands of chemicals in commerce for further evaluation.

data on 1,057 chemicals and more than 800 HTS assay endpoints.5,7,8 Many of these chemicals have been evaluated in short-term and long-term animal studies, and their adverse effects have been curated and stored in the Toxicity Reference Database (ToxRefDB).9,10 Together, ToxCast and ToxRefDB provide a valuable resource for mining relationships between in vitro bioactivity and in vivo outcomes (in animal studies). Here, we apply machine learning methods to build predictive models of chemically induced hepatotoxicity using these data. Data mining and machine learning methods have been used to investigate empirical relationships between chemicals using ToxCast Phase I in vitro data and ToxRefDB in vivo effects. In hypothesis-driven studies, biological knowledge about adverse outcome pathways was employed to select chemicals or to prioritize and aggregate assays into new descriptors outside of the cross-validation loop. Machine learning algorithms were then used to build and evaluate biologically motivated classifiers, including liver cancer,6,11−13 reproductive toxicity,10,14 and developmental toxicity.15,16 The balanced accuracy (BA) of classifiers constructed based on domain knowledge was 0.65 (±0.05 SD) for rodent chronic cancer, 0.70 (±0.09 SD) for rat developmental toxicity, and 0.74 (±0.05 SD) for rat reproductive toxicity. However, using completely automated machine learning methods to classify all possible toxicological effects of ToxCast Phase I chemicals by algorithmic assay selection and aggregation (e.g., by genes, biological pathways, etc.) generally produced lower predictive performance with a median BA of 0.56 for rat liver lesions, 0.50 for rat development toxicity, and 0.52 for rat reproductive toxicity.17 Machine learning methods can identify reproducible associations in large-scale data sets; however, for complex domains, it can be difficult to guarantee that such associations are meaningful. Although the ToxCast data and ToxRefDB were the largest publicly available in vitro and in vivo data sets of environmental chemicals, it is plausible that a larger number of chemicals and a broader set of assays could produce more accurate and meaningful classifiers of toxicity. Machine learning approaches have also been used extensively to mine association between chemical structure and toxicity.18,19 Quantitative structure−activity relationships (QSAR) are classification and regression models based on empirical mappings between the molecular structural features of chemicals and their physical, chemical, or biological properties. According to the chemical similarity principle, compounds with similar structure are likely to perform similar activities. Using QSAR-based methods to classify ToxCast Phase I chemicals according to adverse outcomes in ToxRefDB has yielded relatively low predictive performance.17 More recently, it has been suggested that integrating chemical structure descriptors with bioactivity data could improve classification performance.20,21 One of the many issues in machine learning with toxicological data sets is that the prevalence of different outcomes can be highly variable.22 If there are an unequal number of positive and negative instances, classifiers can form generalizations based on the majority class while ignoring the minority. Using BA gives an objective assessment of performance, but it does not overcome this issue. Because class imbalance is an increasing issue for real-world problems, a number of solutions have been proposed. We conducted an initial analysis of the class imbalance problem for ToxCast data by comparing the classification performance of balanced and imbalanced data sets.



MATERIALS AND METHODS

Data Sources. Hepatotoxicity Data. All hepatotoxicity data were obtained from ToxRefDB,10 which contains the results of guideline animal studies. ToxRefDB lists hundreds of in vivo endpoints observed across different tissues in guideline animal testing studies. ToxRefDB is a dynamic resource that is publicly available (http://www.epa.gov/ ncct/toxcast/data.html). In this analysis, we used 145 liver histopathology endpoints observed in rats after chronic oral administration of the test chemicals. Hepatic histopathologic effects were divided into three broad categories, including hypertrophy, injury, and proliferative lesions, based on the terminology defined by domain experts.24 A complete list of the hepatic effects in ToxRefDB and the corresponding three hepatotoxicity categories is provided as Supporting Information, Table S1. Bioactivity Data. The in vitro assay data were generated from the high-throughput screening of the 1,057 ToxCast Phase I and II compounds and is publicly available (http://www.epa.gov/ncct/ toxcast/data.html, version Dec2013). The data collection was based on more than 700 HTS assay endpoints from ToxCast and Tox21 projects: ACEA Biosciences (ACEA); Apredica (APR); Attagene, Inc. (ATG); NovaScreen panel (NVS); Odyssey Thera (OT); and the Tox21 partnership. Each assay datum is reported as the chemical concentration (micromolar) at half maximal efficacy (AC50) or lowest effective concentration (LEC). For the comparison across the assays, the AC50 values were set as 1,000,000 μM for the inactive chemicals and transformed using the following formula:

AC50 ′ = 6 − log10(AC50 ) In this manner, the transformed AC50 values represent potency on a continuous ascending scale. Chemical Structural Data. The chemical structure descriptors were obtained from previously published data on 903 ToxCast chemicals.25 The chemical structure descriptors include 51 molecular descriptors generated using the QikProp software (Schrödinger, version 3.2),26 4,325 substructural fingerprints generated using publicly available SMARTS sets FP3 (45), FP4 (121), MACCS (145) from OpenBabel,27 PaDEL (3,243),28 and PubChem (771).29 The structural B

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology Table 1. Data Sets Used for Classification Analysisa data sets

total chemicals

hypertrophy

bioactivity

677

161

injury

proliferative lesions

negative set 463 463 463 463 463 463 463 463 463

101 99 chemical

677

161 101 99

bioactivity and chemical

677

161 101 99

descriptors 125 ToxCast HTS assay endpoints

726 chemical structure descriptors

125 ToxCast HTS assay endpoints and 726 chemical structure descriptors

a

The table summarizes the data sets for classification, including three types of descriptors: ToxCast bioactivity (125), chemical structure (726), and a combination of chemical structure and bioactivity (hybrid representation) (851) and four categories of liver toxicity, hypertrophy (161), injury (101), proliferative lesions (99), and negative set (463). fingerprints were represented by binary strings, indicating the presence or absence of certain features. Data Set Reduction. Six hundred seventy-seven chemicals out of 1,057 ToxCast chemicals with bioactivity data and chemical structure data were used for the analysis. We had positive findings for in vivo rat chronic hepatic effects for 214 out of 677 chemicals, which were grouped into three categories, including hypertrophy (161), injury (101), and proliferative lesions (99). Finding negative evidence for in vivo hepatic effects proved to be difficult. Hence, we selected 463 out of 677 chemicals with bioactivity data and chemical structure data as negatives for this analysis. Three types of descriptors were used to classify each hepatotoxicity category: ToxCast bioactivity, chemical structure, and a combination of chemical structure and bioactivity (hybrid representation). In order to reduce the number of potentially irrelevant features, we removed descriptors that were not evaluated in all 677 chemicals or were only active in less than 5% of all 677 chemicals. This reduced the total number of chemical descriptors from 4,376 to 726 and bioactivity descriptors from 711 to 125. The input data file for machine learning combined scaled data of bioactivity descriptors and chemical structure descriptors, including 677 chemicals, 726 chemical descriptors, and 125 bioactivity descriptors (provided as Supporting Information). Table 1 summarizes the data sets used for supervised machine learning. Supervised Machine Learning. Supervised machine learning (ML) was conducted using six different classification algorithms including ̈ Bayes (NB), support vector linear discriminant analysis (LDA), Naive machines (SVM), classification and regression trees (CART), knearest neighbors (KNN), and an ensemble of these classifiers (ENSMB). LDA is a linear classification method based on the assumption that the data are normally distributed.14,15 The NB algorithm is a probabilistic method based on Bayes theorem that assumes that all features are independent.30 SVM approaches find decision boundaries that can maximize the margin between the positive and negative classes of compounds. SVM trained the data using linear and radial basis kernels (the corresponding classifiers are denoted as SVCL and SVCR, respectively).31,32 CART-based methods partition the feature space into a set of rectangles and then fit a simple model in each one.33,34 We trained the CART algorithm by pruning the decision trees to a maximum depth of 10. The KNN method assigns the label of its nearest neighbor to an observation and determines the label by a vote.35 The KNN classifiers were tested using k = 3 nearest neighbors. The classifier ensemble (ENSMB) was based on a voting scheme in which the class of each example was determined by the majority vote across LDA, NB, SVCL, SVCR, CART, and KNN. Cross-Validation Testing. Accuracy was objectively evaluated using 10-fold cross-validation testing. The data were randomly partitioned into 10 subsets of equal size: nine subsets were used for training, while the last subset was used for testing, and this was repeated 100 times. For each step in the cross-validation loop, the subset of best descriptors was filtered using a t test (i.e., with the lowest p-value) to measure the univariate association between each descriptor and

hepatotoxicity class. Classifiers using the six classification algorithms were built using the best 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, and 60 descriptors, and their accuracy for each testing data set was stored for subsequent statistical analysis. This allowed us to evaluate the relationship between classification performance and number of descriptors. We also recorded the frequency with which all descriptors were used for building classifiers. The 10-fold cross-validation testing was conducted for each classification algorithm, and descriptor type for 10 to 60 descriptors. Creating Balanced Data Sets. The ratio of positive chemicals and negative chemicals was considerably skewed, including 1:2.88 (161:463) for hypertrophy, 1:4.58 (101:463) for injury, and 1:4.68 (99:463) for proliferative lesions (see Table 1). The full data set for each hepatotoxicity category contains an unequal number of positive and negative examples (referred to as imbalanced data). In order to reduce the bias in our results introduced by our choice of an unequal number of positive and negative instances, we sampled subsets of instances containing an equal number of positives and negatives. As a result, we generated balanced data sets for 10-fold cross-validation testing in addition to the imbalanced data sets. This produced the following undersampled balanced data sets: (a) 160 positive and negatives for hypertrophy, (b) 100 positives and negatives for injury, and (c) 90 positives and negatives for proliferative lesions. The undersampled balanced data sets were analyzed using the same approach. Evaluating Performance. Classification performance was evaluated on each iteration of 10-fold cross-validation testing. The commonly used parameters for evaluating the performance of classification models are the overall accuracy. However, accuracy provides a biased measure of performance for imbalanced data. We used sensitivity, specificity, and balanced accuracy (BA) to evaluate classification performance. Sensitivity is the true positive rate or the percentage of positive chemicals correctly predicted. Specificity is the true negative rate or the percentage of negative chemicals correctly predicted. BA is the average of sensitivity and specificity. A flowchart depicting the workflow for the whole classification process is provided in Figure 1. Software. Data processing and analysis were conducted in the Python programming language (version 2.7) using the scikit-learn package (version 0.14.1)31,36 for machine learning and the matplotlib package (version 0.99.1.2)37,38 for visualization. The source code is available upon request.



RESULTS We report on the ML analysis for three hepatotoxicity categories using three representations: (i) ToxCast HTS bioactivity descriptors, (ii) chemical structure descriptors, and (iii) combined bioactivity and chemical structure descriptors. This analysis was conducted for (i) imbalanced data containing all chemicals for the hepatotoxicity categories and (ii) balanced data containing undersampled subsets of positive and negative C

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology

Figure 1. Workflow for the whole classification process. The figure shows the schematic overview of the data preprocessing, classifier development, and validation procedure performed in this study. Subset ToxCast (BIO) and chemical structure data (CHM) were used for the following data analysis. 214 chemicals with rat chronic liver effects from ToxRefDB (TOX) were grouped to three lesion categories. For each of the three liver lesion categories, hypertrophy, injury, and proliferative lesions, models were built using imbalanced data sets and undersampled balanced data sets for ToxCast bioactivity descriptors (BIO), chemical structure descriptors (CHM), and hybrid chemical and bioactivity descriptors (BC), respectively. Ten-fold cross-validation testing was used for evaluation and repeated 100 times across six machine learning algorithms, including linear discriminant ̈ Bayes (NB), classification and regression trees (CART0), k-nearest neighbor analysis (LDA), support vector machines (SVCL0, SVCR0), Naive (KNN1), and an ensemble of all classifiers (ENSMB). For each step in the cross-validation loop, the subset of best descriptors was filtered using a t test (i.e., with the lowest p-value) to measure the univariate association between hepatotoxicity class and the descriptor. Cross-validation performance results were recorded by mean balanced accuracy (BA), sensitivity, and specificity. Most frequently selected descriptors for model building were also stored for the analysis.

Summary of Results for Imbalanced and Balanced Data Sets. The relationship between BA and the descriptors for each of the six classification methods for the nine imbalanced data sets is shown in Figure 2. We found BA generally increased with the number of descriptors, and the increase was more pronounced for chemical descriptors than for bioactivity descriptors. This trend was observed in both the balanced and the imbalanced data sets (results not shown). The performance of most classifiers was improved when more

examples. There were nine groups of data sets based on three hepatotoxicity categories and three representations. For each data set, we conducted 10-fold cross-validation testing with inloop descriptor selection to evaluate the performance of six classification algorithms, including LDA, SVM (SVCR, SVCL), NB, CART, KNN, and ENSMB. The ML strategy used these classifiers to systematically evaluate predictive associations between the descriptors and liver toxicity. Our evaluation of these results is reported in the following sections. D

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology

Figure 2. Cross-validation performance results of classifiers for imbalanced data sets. The figure shows the 10-fold cross-validation performance results for the three data sets (rows) including hypertrophy (Hyp), injury (Inj), and proliferative lesions (Pro), and three types of descriptors (columns) including chemical structure descriptors (chm), ToxCast bioactivity descriptors (bio), and hybrid chemical and bioactivity descriptors (bc). The x-axis shows the number of descriptors (Nd), and the y-axis shows the mean balanced accuracy (Bal Acc). Each curve shows the relationship between the mean balanced accuracy and number of descriptors for different classification algorithms including linear discriminant ̈ Bayes analysis (LDA), support vector classifier with linear kernel (SVCL0), support vector classifier with radial basis function kernel (SVCR0), Naive (NB), classification and regression trees (CART0), k-nearest neighbor (KNN1), and an ensemble of all classifiers (ENSMB). The number of observations (Nobs) used for each data set are shown in the title.

descriptors were used; however, additional bioactivity descriptors did not increase the performance of NB and KNN. For the three representations, the predictive accuracy increased in the order proliferative lesions, injury, and hypertrophy, with a mean BA across all classifiers of 0.72, 0.73, and 0.77, respectively. The BA of all classifiers across the balanced and imbalanced data sets (Figure 3) shows that CART, SVM, and ENSMB produced the most accurate classifiers across all data sets. The performance of classifiers built using chemical and bioactivity descriptors was generally similar; however, classifiers using hybrid descriptors produced a modest increase in predictive accuracy. The detailed cross-validation testing results for the nine groups of imbalanced data (three different representations and three different toxicity categories) are shown in Table 2. Each row in Table 2 summarizes the predictive accuracy of a classification method for a toxicity category using a given number of descriptors in terms of BA, sensitivity, and specificity for each of the three representations. In addition to the mean BA, mean sensitivity, and mean specificity, Table 2 also shows

the standard deviation of the different performance measures across 10-fold cross-validation testing trials. Predicting Chronic Rat Liver Hypertrophy. Using the imbalanced data, the best BA for predicting hypertrophy across all classifiers and representations was 0.84 ± 0.08. Hybrid classifiers (BA 0.77 ± 0.08) consistently, albeit marginally, outperformed classifiers built using chemical descriptors (BA 0.74 ± 0.08) or bioactivity descriptors (BA 0.75 ± 0.07) (see Table 2). On average, bioactivity classifiers outperformed chemical classifiers 4/7 times (57%), both classifiers were tied 1/7 times (14%), and chemical classifiers outperformed bioactivity classifiers 2/7 times (29%) (see Figure 3). The imbalanced data sets produced classifiers that were more specific than they were sensitive, which was not surprising given the much larger number of negative examples. Hybrid classifiers had greater sensitivity (0.62 ± 0.16) compared to chemical classifiers (0.58 ± 0.16) or bioactivity classifiers (0.58 ± 0.15). Specificity of hybrid classifiers (specificity 0.94 ± 0.07) were similar to bioactivity classifiers (specificity 0.94 ± 0.08), and both were greater than chemical classifiers (specificity 0.92 ± E

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology

Figure 3. Cross-validation performance results of classifiers for imbalanced and balanced data sets. The figure shows the 10-fold cross-validation performance results for the three data sets (rows) including hypertrophy (Hyp), injury (Inj), and proliferative lesions (Pro), and two types of data sample imbalanced and balanced data. The x-axis shows the seven classification algorithms, including linear discriminant analysis (LDA), support ̈ Bayes (NB), classification and vector classifier with linear kernel (SVCL0), support vector classifier with radial basis function kernel (SVCR0), Naive regression trees (CART0), k-nearest neighbor (KNN1), and an ensemble of all classifiers (ENSMB). The y-axis shows the mean balanced accuracy (MBA). Each curve shows the relationship between the mean balanced accuracy and different classification algorithms for three types of descriptors, including chemical structure descriptors (chm), ToxCast bioactivity descriptors (bio), and hybrid chemical and bioactivity descriptors (bc).

the differences in performance were small. Hybrid classifiers had the greater sensitivity (0.53 ± 0.19) compared to chemical classifiers (0.52 ± 0.22) or bioactivity classifiers (0.51 ± 0.17). Bioactivity classifiers had a greater specificity (0.96 ± 0.07) than hybrid classifiers (specificity 0.95 ± 0.06) or chemical classifiers (specificity 0.94 ± 0.07). Overall, NB produced the most sensitive classifiers, while SVCR produced the most specific classifiers, and CART produced classifiers with the greatest BA. The predictive performance results for injury from the balanced data showed similar overall trends between the different representations, higher mean values for BA and sensitivity, but also higher variability when compared to the imbalanced data (see Supporting Information, Table S2). Predicting Chronic Rat Liver Proliferative Lesions. Predictions of chronic liver proliferative lesions across all representations and classifiers for the imbalanced data had a maximum BA of 0.80 ± 0.10, which was lower than that of hypertrophy (BA 0.84 ± 0.08) but similar to that of injury (BA 0.81 ± 0.11). On average, hybrid classifiers (BA 0.72 ± 0.09) showed a marginal gain in performance over chemical classifiers (BA 0.70 ± 0.09) and bioactivity classifiers (BA 0.70 ± 0.09)

0.07). Hybrid CART classifiers with 60 descriptors (denoted as, BC/CART0/60) had the maximum BA of 0.84 ± 0.08, whereas chemical SVCL classifiers had the lowest BA 0.69 ± 0.07 (CHM/SVCL0/60). The overall trends in BA were similar between balanced and imbalanced data sets for predicting hypertrophy, but classifiers from balanced data sets had greater variability and sensitivity was generally higher (see Supporting Information, Table S2). Predicting Chronic Rat Liver Injury. Predictions of chronic liver injury across all representations and classifiers for the imbalanced data had a maximum BA of 0.81 ± 0.11, which was lower than hypertrophy (BA 0.84 ± 0.08). On average, hybrid classifiers (BA 0.73 ± 0.09) showed a marginal gain in performance over chemical classifiers (BA 0.72 ± 0.10) and bioactivity classifiers (BA 0.70 ± 0.08) (see Table 2). The chemical descriptors produced the most accurate classifier with a BA of 0.81 ± 0.11 (CHM/CART0/60) and the least accurate classifier with a BA of 0.65 ± 0.09 (CHM/SVCL0/60). Comparing the maximum BA across the seven classifiers for the three representations, the hybrid descriptors outperformed bioactivity and chemical descriptors 4/7 times (57%), though F

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology Table 2. Maximum Predictive Performance of Different Classification Methods for the Imbalanced Dataa

a

The table summarizes the results of 10-fold cross-validation testing for the three imbalanced data sets (toxicity) including hypertrophy (Hyp), injury (Inj), and proliferative lesions (Pro). The results are broken down by different classification algorithms (classifier) including linear discriminant analysis (LDA), support vector classifier with linear kernel (SVCL0), support vector classifier with radial basis function kernel ̈ Bayes (NB), classification and regression trees (CART0), k-nearest neighbor (KNN1), and an ensemble of all classifiers (ENSMB). (SVCR0), Naive The results for each data set and classification method are summarized by the mean balanced accuracy (BA), the number of descriptors that produced the best BA, mean sensitivity, and mean specificity. The standard deviation is given in parentheses. The performance results for three types of descriptors are shown: chemical structure descriptors (CHM), ToxCast bioactivity descriptors (BIO), and hybrid chemical and bioactivity descriptors (BC). For instance, the first row shows the 10-fold cross-validation testing performance for predicting rat liver hypertrophy by CART0 using the imbalanced data sets. The maximum BA of CART0 for predicting hypertrophy was 0.84, 0.79, and 0.82 using hybrid descriptors, 50 bioactivity descriptors, and 55 chemical descriptors, respectively. Furthermore, the mean sensitivity/specificity for the hybrid, bioactivity, and chemical by CART0 was 0.74/0.94, 0.64/0.95, and 0.71/0.93, respectively.

(see Table 2). The hybrid descriptors produced the most accurate classifier with a BA of 0.80 ± 0.10 (BC/CART0/60), and the least accurate classifier was produced by chemical descriptors with a BA of 0.62 ± 0.08 (CHM/SVCL0/60). Comparing the maximum BA across the seven classifiers for the three representations, the hybrid descriptors outperformed bioactivity and chemical descriptors 3/7 times (43%), but the differences in performance were small. Hybrid classifiers had the greater sensitivity (0.51 ± 0.19) compared to that of chemical classifiers (0.47 ± 0.19) or bioactivity classifiers (0.50 ± 0.18). Bioactivity classifiers had a greater specificity (0.96 ± 0.07) than hybrid classifiers (specificity 0.95 ± 0.07) or chemical classifiers (specificity 0.95 ± 0.06). Overall, NB produced the most sensitive classifiers, while SVCR produced the most specific classifiers, and CART produced classifiers with the greatest BA. The performance results for the balanced data showed higher mean values and higher variability when compared to the imbalanced data, but there were similar overall trends between the different representations (see Supporting Information, Table S2). Only CART-based chemical classifiers were more accurate than bioactivity classifiers for predicting hypertrophy and injury

but not proliferative lesions. The BA of chemical and bioactivity classifiers for predicting hypertrophy was 0.82 ± 0.09 (CHM/ CART0/55) and 0.79 ± 0.07 (BIO/CART0/50), respectively. The mean BA for predicting hypertrophy, injury, and proliferative lesions for imbalanced data and hybrid descriptors was 0.84 ± 0.08 (BC/CART0/60), 0.80 ± 0.09 (BC/CART0/ 60), and 0.80 ± 0.10 (BC/CART0/60), respectively. The hybrid classifiers generally outperformed the bioactivity and chemical classifiers (based on BA) for predicting liver toxicity; however, the gains in performance were quite modest. Out of the seven hybrid classifiers, gains in predictive performance were observed for hypertrophy (86%), injury (57%), and proliferative lesions (43%). There was rarely a loss in predictive performance of classifiers when chemical and bioactivity descriptors were combined. Bioactivity Descriptors Frequently Used in the Classifiers. The 36 bioactivity descriptors most frequently used in (at least 30% of) classifiers are visualized in Figure 4. The heatmap summarizes the mean values of the bioactivity descriptors (given in columns) for negative chemicals, and chemicals produced hypertrophy, injury, or proliferative lesions (given in rows). Each row of the heatmap can be interpreted as G

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology

Figure 4. Visualizing bioactivity descriptors most frequently selected in classifying hepatotoxicity using imbalanced data. The heatmap summarizes the relationship between toxicity categories (rows), including negative for liver toxicity (Neg), hypertrophy (Hyp), injury (Inj), and proliferative lesions (Pro), and bioactivity descriptors (columns). The standardized values of the descriptors (and colors) are interpreted as follows: close to the mean (yellow), greater than the mean (reds), or less than the mean (blues). The hierarchical clustering further organizes the descriptors into groups (labeled, A, B, C, and D).

a bioactivity “signature” of chemicals across different liver toxicity categories. Column-wise, the heatmap shows the (standardized) mean values of the descriptors, highlighting differences across the liver toxicity categories. The hierarchical clustering (using Euclidean distance as the similarity metric and complete linkage agglomerative clustering) further organizes the descriptors into groups (labeled, A, B, C, and D). The colors in the heatmap signify the distribution of mean values of bioactivity descriptors for chemicals across the four categories of liver effects. The standardized values of the descriptors (and colors) are interpreted as follows: close to the mean (yellow), greater than the mean (reds), or less than the mean (blues). For instance, the mean value of the 13th bioactivity descriptor from the left (labeled, “Tox21 Aromatase Inhibition”) was much higher across chemicals that produced proliferative lesions than those that produced no liver toxicity. Further details about the 36 bioactivity descriptors most frequently used in the ML analysis for classifying liver effects are given in Supporting Information, Table S3. The hierarchical clustering was used to visually divide the bioactivity signatures into four main groups (labeled A−D in Figure 4). Group A included 12 descriptors whose mean values were generally greater for chemicals that produced any liver toxicity (than those chemicals that did not produce any liver effects). Four out of the 12 descriptors in group A measured chemical interactions with proteins (NovaScreen panel, NVS)8,15,39 including human pregnane X receptor (PXR), cytochrome P450 2C19 (CYP2C19) and 1A2 (CYP1A2), and rat translocator protein Tspo. Another 4/12 descriptors assessed transcription factor activities in human hepatoma

HepG2 cells (Attagene panel, ATG) by measuring the expression of reporters associated with specific regulatory elements (cis-acting elements)40 including PXR activity, vitamin D receptor (VDR) activity, nuclear factor (erythroidderived 2)-like 2 (NRF2) activity, and peroxisome proliferator activated receptor activity (PPAR). One of the descriptors measured the recruitment of steroid receptor coactivator-1 (SRC) by the ligand-dependent activation of the farnesoid X receptor (FXR).41 Finally, three of the 12 descriptors measured changes in mitochondrial function and cell loss in HepG2 cells using high-content imaging (Apredica panel, APR).42,43 Descriptors in group A were selected by the ML analysis to predict liver toxicity, and the associated assays measure functions of proteins involved in xenobiotic sensing (PXR, FXR, and VDR), xenobiotic metabolism (CYP2C19 and CYP1A2), and chemical-induced stress response pathways (NRF2). Bioactivity group B included five descriptors whose mean values were generally greater for chemicals that produced injury and, in some cases, proliferative lesions. Three out of the five descriptors in group B included protein activity assays (NVS) for human cytochrome P450 2B6 (CYP2B6), mitochondrial benzodiazepine receptor (MBR; also known as translocator protein, TSPO), and norepinephrine transporter (NET), which is also known as solute carrier protein 6A2 (SLC6A2). The remaining two descriptors in group B measured transcription factor activities in HepG2 cells (using cis-acting elements, ATG) including bone morphogenetic protein (BMP) and octamer binding protein (OCT1/POU2F1). Group B contained a diverse set of proteins involved in xenobiotic H

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology

Figure 5. Visualizing chemical structure descriptors most frequently selected in classifying hepatotoxicity. The heatmap summarizes the relationship between toxicity categories (rows), including negative for liver toxicity (Neg), hypertrophy (Hyp), injury (Inj), and proliferative lesions (Pro), and chemical structure descriptors (columns). The colors in the heatmap signify the distribution of mean values of chemical structure descriptors for chemicals across the four categories. The standardized values of the descriptors (and colors) are interpreted as follows: close to the mean (yellow), greater than the mean (reds), or less than the mean (blues). The hierarchical clustering was used to visually divide the chemical signature across the four categories into five main groups (labeled, A−E).

based on high-content imaging of nuclei in HepG2 cells. A majority of the descriptors in group D measured the activation of nuclear receptor proteins that are involved in the endocrine system. Chemical Descriptors Frequently Used in the Classifiers. The 55 chemical descriptors most frequently used in (at least 30% of) classifiers are visualized in Figure 5 (similar to Figure 4). The heatmap summarizes the mean values of the chemical structure descriptors (given in columns) for chemicals across different liver toxicity categories (given in rows). The colors in the heatmap signify the distribution of mean values of chemical structure descriptors for chemicals across the four categories of liver effects. The standardized values of the descriptors (and colors) are interpreted as follows: close to the mean (yellow), greater than the mean (reds), or less than the mean (blues). For instance, the mean value of the second chemical descriptor from the left (labeled as “MACCS 138”) was much higher across chemicals that produced no liver toxicity than those that produced liver toxicity. Further details about the 55 chemical descriptors most frequently used in the ML analysis for classifying liver injury are given in Supporting Information, Table S4. The hierarchical clustering was used to visually divide the chemical signature across the four liver toxicity categories into five main groups (labeled A−E in Figure 5). Group A included eight descriptors whose mean values were generally greater for chemicals that produced any liver toxicity (than those chemicals that did not produce any liver effects). Group A included the following eight descriptors: PUBCHEM descriptors 17, 285, and 351; MACCS descriptors 141, 149, 160, and 164; and QplogKp. Group B contained nine descriptors whose mean values were generally greater for chemicals that produced proliferative liver lesions and liver injury, including PubChem descriptors 340, 376, 416, 464, 524, 556, 640 and 660, and PaDEL descriptor KR4080. Group C contained 21 descriptors whose mean values were generally greater for chemicals that

metabolism (CYP2B6), transporters (TSPO and SLC6A2), and transcription factors involved in regulating diverse pathways (BMP and POU2F1). Bioactivity group C contained 10 descriptors whose mean values were generally higher for chemicals that produced proliferative liver lesions. Two out of the 10 descriptors in this group included protein activity assays for human androgen receptor (AR) and G-protein coupled opiate receptor mu (OPRM1). Six out of the 10 descriptors measure transcription factor activities including human retinoic acid receptor alpha (RARA) activity in HepG2 cells, thyroid hormone receptor beta (THRB) antagonism in rat pituitary GH3 cells, estrogen receptor alpha (ESR1) antagonism in human embryonic kidney (HEK293T) cells, peroxisome proliferator-activated receptor gamma (PPARG) activity in HEK293T cells, aromatase (CYP19A1) inhibition in human MCF7 cells, and AR antagonism in HEK293T cells. One descriptor measured mitochondrial toxicity in HepG2 cells. The last of the 10 descriptors measured recruitment of steroid receptor coactivator-1 (SRC) by ligand-dependent activation of androgen receptor (AR). Most of the descriptors in group C measured the inhibition of proteins involved in the endocrine system, including receptors for thyroid hormone, estrogen, and testosterone. Finally, Group D contained eight descriptors whose mean values were generally higher for chemicals that either produced no liver injury or elicited proliferative lesions in some cases. Two of the eight descriptors in group D measured chemical interactions with the human and mouse estrogen receptor alpha. Four out of the eight descriptors measured transcription activities including: human retinoid X receptor, beta (RXRβ) activity in HepG2 cells, estrogen receptor alpha (ESR1) activity in HepG2 cells, and human androgen receptor (AR) activity in HEK293T cells. Another descriptor in group D measured ERα activity based on a nuclear receptor dimerization assay. The last of the eight descriptors measured an increase in mitotic arrest I

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology

al.20 However, our results are in sharp contrast with a recent study by Thomas et al. in which the mean predictive accuracy of classifying in vivo outcomes using ToxCast data was found to be approximately 0.5.17 In their analysis, Thomas et al. only used the ToxCast Phase I data set (containing 309 chemicals), whereas we have used both Phase I and II data (containing 1,057 chemicals). Upon conducting the complete ML analysis using just 309 Phase I chemicals, we also found the mean BA results for the imbalanced data that were close to 0.53, 0.61, and 0.60 for hypertrophy, injury, and proliferative lesions, respectively (see Supporting Information, Table S5 and Figure S1 for additional details). Hence, we believe the variation in performance could be explained by differences in the data used for the ML analysis. Although we have used an objective and unbiased approach to construct and evaluate predictive models (i.e., classifiers) of hepatotoxicity, it is also important to interpret the biological and chemical relevance of these models. Finding biologically relevant mappings between in vitro bioactivity and long-term in vivo outcomes is extremely challenging. This is because toxicity involves a complex sequence of dynamic molecular, cellular, and tissue changes along pathways from the initial biological targets to adverse outcomes.44 The adverse outcome pathway (AOP) framework provides a construct to accumulate available information and provides a unique approach to critically review predictive models for which information is available on the molecular initiating events as well as adverse outcomes, such as histopathological lesions. Supervised ML systematically searched for and identified higher order relationships between bioactivity descriptors and adverse hepatic outcomes. The ToxCast descriptors most frequently selected for classifying liver toxicity outcomes are summarized as a bioactivity signature (shown in Figure 4). This bioactivity signature hierarchically clusters chemical bioactivity values across liver toxicity categories to organize them into four main groups (labeled, A, B, C, and D). Group A contained descriptors that showed some activity of chemicals across all types of liver toxicity. For example, CYP2C1945 is one of the key enzymes responsible for metabolizing drugs and other foreign chemicals in humans. Groups B and C contained descriptors that were more specific for liver injury and proliferative lesions, respectively. Finally, group D descriptors were more specific for chemicals that generally did not produce hepatotoxicity. This hierarchical clustering of the ToxCast assay endpoints recapitulates the four liver toxicity categories that were used for classification. It is important to point out that these ToxCast assay endpoints were selected because they were most predictive and reproducible across the chemicals in this data set. Targets of ToxCast assay endpoint liver toxicity signatures (i.e., groups A to D) provided useful insight into hepatic AOPs. ToxCast assay endpoints associated with group A measure chemical-induced changes in xenobiotic-sensing nuclear receptor (NR) proteins (PXR, FXR, and VDR); chemical interactions with xenobiotic metabolism proteins (CYP2C19 and CYP1A2), which are transcriptionally regulated by NR; chemical-induced stress pathway activation (NRF2); and downstream changes in mitochondrial function and cell viability. Activation of NR is postulated to be a molecular initiating event in many pathways leading to liver toxicity and cancer.46 The predictive models we identified in this analysis may be related to AOPs initiated by PXR, FXR, and VDR, which can lead to a range of adverse hepatic effects. Once

produced any type of liver toxicity. Group D contained nine descriptors whose values were generally greater for chemicals that were either nonhepatotoxic or that produced liver injury. Finally, Group E contained eight descriptors that were generally greater for chemicals that produced no liver toxicity.



DISCUSSION We used in vitro high-throughput data to predict in vivo effects as a proof of principle for prioritizing the many thousands of chemicals in commerce for further evaluation. The ToxCast project is the largest collection of chemical bioactivity data that can be used in combination with legacy in vivo animal toxicology data to predict the hazard of new environmental chemicals. In addition to in vitro bioactivity, we also used the molecular structure of chemicals to mine associations with adverse in vivo effects. We conducted a systematic and objective machine learning analysis of a subset of ToxCast Phase I and II data (defined by 677 chemicals, 125 ToxCast bioactivity descriptors, and 726 chemical structure descriptors) to predict chronic hepatotoxicity in rats. Our findings show that (1) some long-term rodent hepatic in vivo effects of chemicals may be predicted with varying accuracy using in vitro bioactivity assays, (2) imbalanced data sets, with an unequal number of positive and negative examples, can bias the predictive accuracy of classifiers, and (3) combining in vitro bioactivity with chemical structure descriptors modestly improves predictive accuracy. The predictive accuracy of classifiers using either ToxCast bioactivity or chemical structure descriptors was similar, but a combination of descriptors generally improved performance for both imbalanced data sets (Table 2) and balanced data sets (Supporting Information, Table S2). However, the gains in performance for hybrid classifiers over chemical and bioactivity classifiers were generally modest. Using balanced data from random undersampling improved the sensitivity (true positive rate) but reduced the specificity (true negative rate) of classifiers compared with those of imbalanced data. In general, bioactivity classifiers were more specific than chemical classifiers, but hybrid classifiers had the greatest specificity. Furthermore, chemical classifiers were more sensitive than bioactivity classifiers, but hybrid classifiers had the greatest sensitivity. ToxCast bioactivity classifiers of hepatotoxicity performed marginally better than chemical classifiers for the same number of descriptors for imbalanced data (data not shown). When we used balanced data sets, these trends were less evident for injury and for proliferative lesions. Using just imbalanced data sets for this analysis may lead us to conclude a greater relevance of bioactivity descriptors for predicting hepatotoxicity.22 However, the results of the balanced data sets, which may offer a more objective assessment of performance, suggest that such claims may only be limited to hypertrophy. Additional work on comparing balanced and imbalanced data sets is needed to more accurately evaluate predicative models of hepatotoxicity. Hybrid classifiers performed better than bioactivity or chemical structure classifiers alone. These findings were consistent across balanced or imbalanced data sets. Hence, it is plausible that using a diverse representation for chemicals that leverages the synergies between chemical structure features, molecular mechanisms, and cellular functions is more relevant for building predictive models of toxicity. The notion of combining chemical and bioactivity descriptors for predicting toxicity has also been recently reported by Low et J

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology

descriptors in group C provide such putative markers of proliferative lesions (covering preneoplastic and neoplastic hepatic pathologies). The descriptors in group C are associated with perturbations of NR including: RARA, AR, ESR1, THRB, and PPARG. Persistent stimulation of NR is believed to be a molecular initiating event in many pathways leading to liver cancer.46 Previously, we used ToxCast Phase I data to show that NR activities of chemicals are associated with liver histopathological effects, including hypertrophy, injury, and proliferative lesions.12 Our analysis of the much larger set of ToxCast Phase I and II chemicals using a completely unbiased approach reinforces these findings and further highlights the biological relevance of predictive models identified by mining in vitro to in vivo associations. Bioactivity descriptors can be readily interpreted in an AOP context by mapping their target genes to molecular or cellular mechanisms. However, it can be difficult to easily evaluate the physicochemical relevance of chemical structure descriptors (Figure 5). Descriptors in group C have mean values generally greater for chemicals that produced any types of liver toxicity. Most of the 21 descriptors in group C have chlorinated hydrocarbons. Chemicals with chlorinated hydrocarbons had been tested and verified for their liver toxicity.63 For example, chemical 2-chlorophenol induced all 3 categories of liver lesions according to the animal studies stored in ToxRefDB and has the substructure of fingerprint PaDEL_KR3869 which is included in group C.64,65 Developing predictive models of toxicity depends on appropriately labeled positive and negative example of chemicals. Since the liver is the primary organ involved in metabolizing xenobiotics, most chemicals produce some hepatic effects in long-term animal testing studies. So it is easy to identify positive examples of liver toxicity. However, when chemicals do not elicit hepatic effects after chronic exposure they cannot easily be assumed to be negative. This is because a lack of hepatic effects can be due to insufficient dosing or other pharmacokinetic factors. When negative examples are scarce, unlabeled examples have been used to build predictive models.66 Faced with a similar problem, we considered the untested chemicals (i.e., chemicals with bioactivity and structure data but no in vivo testing results) as negative examples. Next, we randomly undersampled balanced subsets of data to rigorously evaluate the performance of different classification methods by cross-validation testing. Our analysis suggests that predictive models of chronic rat hepatotoxicity can be built using bioactivity and structure data, but additional negative examples may be necessary for improving accuracy. While ToxCast and ToxRefDB provide the largest publicly available data sets for in vitro and in vivo toxicity data, they still only represent a small fraction of all the environmental chemicals. For example, the 677 chemicals used in this study are mostly pesticidal and industrial chemicals and thus might represent a subset of chemical space. In addition, the 711 biological assay endpoints likely only provide a limited view of the important molecular and cellular events that are part of hepatotoxic AOPs. Furthermore, we only focused on three categories of hepatic histopathological lesions, which may not adequately capture the diverse range and complexity of chemical-induced liver alterations.

activated, NRs translocate to the nucleus and cause the induction of xenobiotic metabolizing enzymes (e.g., CYP2C19 and CYP1A2). This is an adaptive response, which is also observed histopathologically as liver hypertrophy and enlargement.47 Short-term activation of NR causes significant increases in xenobiotic metabolism and in reactive oxygen species, which produce oxidative stress. NRF2, a well-known transcription factor, is activated by chemical-induced stress and is responsible for regulating cytoprotective pathways.48 Chemicals in group A not only increased the activity of NR but also directly interacted with xenobiotic metabolizing enzymes and inhibited their activities. Inhibiting xenobiotic metabolizing enzymes reduces the rate at which chemicals are eliminated from the body and could potentially accentuate their adverse effects. Finally, mitochondrial disruption is one of the early events that precede cellular injury that can result in apoptotic or necrotic cell death. Thus, descriptors in group A are mechanistically related to important events that lead to hepatic hypertrophy and stress. When chemical-induced stress overwhelms the adaptive capacity of liver cells, they undergo phenotypic changes that are manifest histopathologically as organ injury, including steatosis, hydropic swelling, hepatocellular apoptosis or necrosis, inflammation, and regeneration. Bioactivity descriptors in group B were generally higher for chemicals that produced liver injury. These chemicals also induced changes in proteins involved in xenobiotic metabolism (CYP2B6), transporters (TSPO and SLC6A2), and transcription factors involved in regulating diverse pathways (SMAD1 and POU2F1). CYP2B6 is transcriptionally regulated by PXR, and it metabolizes xenobiotic chemicals. Xenobiotic metabolism generates reactive oxygen species (ROS) that have been implicated in liver injury.49,50 As a cholesterol and drug-binding protein that is found in the outer mitochondrial membrane, altered TSPO expression has been linked to a number of disease states.51 TSPO is believed to cause mitochondrial dysfunction by interacting with the mitochondrial membrane permeability transition pore;51 however, a recent study on TSPO knockout mice did not support this.52 Recently, TSPO has also been studied as a marker of hepatic injury due to its role in the progress of inflammation.53,54 SMAD1 is an intracellular effector of cytokines involved in cell differentiation especially in bone morphogenesis. These cytokines, referred to as bone morphogenetic proteins (BMPs), belong to the TGFβ superfamily of ligands that are involved in a broad range of cellular processes. In chronic liver injury, TGFβ/SMAD signaling is involved in maintaining the immune response during repair and negatively regulating proliferation of hepatic progenitor cells in hepatocarcinoma, among other functions.55 POU2F1 is a ubiquitous transcription factor that has been shown to be involved in cell cycle progression56 and in inflammation.57 The descriptors in group B are involved in hepatic injury, inflammation, and repair. Chronic exposure to levels of NR activating chemicals causes persistent cell death and ongoing regenerative proliferation, which is a homeostatic process necessary for liver recovery. Chronic regenerative proliferation can also lead to liver cancer, which is observed histopathologically as proliferative lesions including chronic hyperplasia, preneoplastic foci, and neoplastic effects (e.g., adenoma, carcinoma, and hemangioma).58,59 Because of the gradual progression of lesions in chemicalinduced rodent hepatocarcinogenesis from hypertrophy, injury/ repair, and regenerative proliferation to neoplasia, it is difficult to identify predictive markers.60−62 The ToxCast bioactivity K

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology



CONCLUSIONS ToxCast and ToxRefDB provide the largest and richest publicly available data sets for mining linkages between the in vitro bioactivity of environmental chemicals and their adverse histopathological outcomes. Our findings demonstrate the utility of high throughput assays for characterizing the hepatotoxic outcomes in rodents and the benefit of using hybrid representations that integrate bioactivity and chemical structure. By using a rigorous approach to find and evaluate predictive models of hepatotoxicity, we have provided a compelling interpretation of the biological relevance of these models in terms of AOPs. Although further experimental evaluation of these models is still needed, these models have the potential to be used for prioritizing the hepatotoxic hazard of environmental chemicals.



ASSOCIATED CONTENT



AUTHOR INFORMATION

factor (erythroid-derived 2)-like 2; PPAR, peroxisome proliferator activated receptor; SRC, steroid receptor coactivator; FXR, farnesoid X receptor; MBR, mitochondrial benzodiazepine receptor; NET, norepinephrine transporter; SLC, solute carrier; BMP, bone morphogenetic protein; OCT/POU2F, octamer binding protein/POU domain, class 2, transcription factor; AR, androgen receptor; OPRM1, opioid receptor, mu 1; RAR, retinoic acid receptor; THR, thyroid hormone receptor; ESR1, estrogen receptor 1; RXR, retinoid X receptor; ER, estrogen receptor; AOP, adverse outcome pathway; NR, nuclear receptor; ROS, reactive oxygen species; TGF, transforming growth factor.



REFERENCES

(1) Anastas, P., Teichman, K., and Cohen-Hubal, E. A. (2010) Ensuring the safety of chemicals. J. Expo. Sci. Environ. Epidemiol. 20, 395−396. (2) National Research Council (1984) Toxicity Testing: Strategies to Determine Needs and Priorities, The National Academies Press, Washington, DC. (3) National Research Council (2007) Toxicity Testing in the 21st Century: A Vision and a Strategy, The National Academies Press, Washington, DC. (4) Kavlock, R., and Dix, D. J. (2010) Computational toxicology as implemented by the U.S. EPA: providing high throughput decision support tools for screening and assessing chemical exposure, hazard and risk. J. Toxicol. Environ. Health, Part B 13, 197−217. (5) Kavlock, R. J., Chandler, K., Houck, K. A., Hunter, S., Judson, R. S., Kleinstreuer, N., Knudsen, T. B., Martin, M. T., Padilla, S., Reif, D. M., Richard, A. M., Rotroff, D. M., Sipes, N., and Dix, D. J. (2012) Update on EPA’s ToxCast program: providing high throughput decision support tools for chemical risk management. Chem. Res. Toxicol. 25, 1287−1302. (6) Judson, R. S., Houck, K. A., Kavlock, R. J., Knudsen, T. B., Martin, M. T., Mortensen, H. M., Reif, D. M., Rotroff, D. M., Shah, I., Richard, A. M., and Dix, D. J. (2010) In vitro screening of environmental chemicals for targeted testing prioritization: the ToxCast project. Environ. Health Perspect. 118, 485−492. (7) Knudsen, T., Martin, M., Chandler, K., Kleinstreuer, N., Judson, R., and Sipes, N. (2013) Predictive models and computational toxicology. Methods Mol. Biol. 947, 343−374. (8) Sipes, N. S., Martin, M. T., Kothiya, P., Reif, D. M., Judson, R. S., Richard, A. M., Houck, K. A., Dix, D. J., Kavlock, R. J., and Knudsen, T. B. (2013) Profiling 976 ToxCast chemicals across 331 enzymatic and receptor signaling assays. Chem. Res. Toxicol. 26, 878−895. (9) Martin, M. T., Judson, R. S., Reif, D. M., Kavlock, R. J., and Dix, D. J. (2009) Profiling chemicals based on chronic toxicity results from the U.S. EPA ToxRef database. Environ. Health Perspect. 117, 392−399. (10) Martin, M. T., Mendez, E., Corum, D. G., Judson, R. S., Kavlock, R. J., Rotroff, D. M., and Dix, D. J. (2009) Profiling the reproductive toxicity of chemicals from multigeneration studies in the toxicity reference database. Toxicol. Sci. 110, 181−190. (11) Kleinstreuer, N. C., Dix, D. J., Houck, K. A., Kavlock, R. J., Knudsen, T. B., Martin, M. T., Paul, K. B., Reif, D. M., Crofton, K. M., Hamilton, K., Hunter, R., Shah, I., and Judson, R. S. (2013) In vitro perturbations of targets in cancer hallmark processes predict rodent chemical carcinogenesis. . Toxicol. Sci. 131, 40−55. (12) Shah, I., Houck, K., Judson, R. S., Kavlock, R. J., Martin, M. T., Reif, D. M., Wambaugh, J., and Dix, D. J. (2011) Using nuclear receptor activity to stratify hepatocarcinogens. PLoS One 6, e14584. (13) Judson, R. S., Elloumi, F., Setzer, R. W., Li, Z., and Shah, I. (2008) A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model. BMC Bioinf. 9, 241. (14) Martin, M. T., Knudsen, T. B., Reif, D. M., Houck, K. A., Judson, R. S., Kavlock, R. J., and Dix, D. J. (2011) Predictive model of rat reproductive toxicity from ToxCast high throughput screening. Biol. Reprod. 85, 327−339.

S Supporting Information *

Supporting information file 1 includes the liver effects and curated corresponding lesion categories (Table S1), detailed information for 36 bioactivity descriptors (Table S3) and 55 chemical structure descriptors (Table S4), and performance results for balanced data (Table S2) and ToxCast Phase 1 data (Table S5 and Figure S1). Supporting information file 2 includes the data matrix used for classification analysis. This material is available free of charge via the Internet at http:// pubs.acs.org.

Corresponding Author

*E-mail: [email protected]. Funding

This project was supported in part by an appointment to the Research Participation Program at the Office of Research and Development, U.S. Environmental Protection Agency, administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the U.S Department of Energy and EPA. Notes

The views expressed in this article are those of the authors and do not necessarily reflect the views or policies of the U.S. Environmental Protection Agency and the US Food and Drug Administration. Mention of trade names or commercial products does not constitute endorsement or recommendation for use. The authors declare no competing financial interest.



ABBREVIATIONS ̈ Bayes; SVM, LDA, linear discriminant analysis; NB, Naive support vector machines; CART, classification and regression trees; KNN, k-nearest neighbors; ENSMB, ensemble classifier; HTS, high-throughput screening; BA, balanced accuracy; SD, standard deviation; QSAR, quantitative structure−activity relationship; ACEA, ACEA Biosciences; APR, Apredica panel; ATG, Attagene panel; NVS, NovaScreen panel; OT, Odyssey Thera; AC50, half maximal efficacy concentration; LEC, lowest effective concentration; μM, micromolar; ML, machine learning; SVCL, SVM with linear kernel; SVCR, SVM with radial basis kernel; BIO, bioactivity descriptors; CHM, chemical descriptors; BC, hybrid bioactivity and chemical descriptors; PXR, pregnane X receptor; CYP, cytochrome P450; TSPO, translocator protein; VDR, vitamin D receptor; NRF2, nuclear L

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology (15) Sipes, N. S., Martin, M. T., Reif, D. M., Kleinstreuer, N. C., Judson, R. S., Singh, A. V., Chandler, K. J., Dix, D. J., Kavlock, R. J., and Knudsen, T. B. (2011) Predictive models of prenatal developmental toxicity from ToxCast high-throughput screening data. Toxicol. Sci. 124, 109−127. (16) Dix, D. J., Houck, K. A., Judson, R. S., Kleinstreuer, N. C., Knudsen, T. B., Martin, M. T., Reif, D. M., Richard, A. M., Shah, I., Sipes, N. S., and Kavlock, R. J. (2012) Incorporating biological, chemical, and toxicological knowledge into predictive models of toxicity. Toxicol. Sci. 130, 440−441. (17) Thomas, R. S., Black, M. B., Li, L., Healy, E., Chu, T.-M., Bao, W., Anderson, M. E., and Wolfinger, R. D. (2012) A comprehensive statistical analysis of predicting in vivo hazard using high-throughput in vitro screening. Toxicol. Sci. 128, 398−417. (18) Tropsha, A. (2010) Best practices for QSAR model development, validation, and exploitation. Mol. Inf. 29, 476−488. (19) Cherkasov, A., Muratov, E. N., Fourches, D., Varnek, A., Baskin, I. I., Cronin, M., Dearden, M., Gramatica, P., Martin, Y. C., Todeschini, R., Consonni, V., Cramer, R., Benigni, R., Yang, C., Rathman, J., Terfloth, L., Gasteiger, J., and Tropsha, A. (2013) QSAR modeling: where have you been? Where are you going to? J. Med. Chem. 57, 4977−5010. (20) Low, Y., Sedykh, A., Fourches, D., Golbraikh, A., Whelan, M., Rusyn, I., and Tropsha, A. (2013) Integrative chemical−biological read-across approach for chemical hazard classification. Chem. Res. Toxicol. 26, 1199−1208. (21) Sedykh, A., Zhu, H., Tang, H., Zhang, L., Richard, A., Rusyn, I., and Tropsha, A. (2011) Use of in vitro HTS-derived concentrationresponse data as biological descriptors improves the accuracy of QSAR models of in vivo toxicity. Environ. Health Perspect. 119, 364−370. (22) Lee, P. H. (2014) Resampling methods improve the predictive power of modeling in class-imbalanced datasets. Int. J. Environ. Res. Public Health 11, 9776−9789. (23) Zimmerman, H. J., Ed. (1999) Hepatotoxicity: The Adverse Effects of Drugs and Other Chemicals on the Liver, 2nd, ed., Lippincott Williams & Wilkins, Philadelphia, PA (http://www.wjgnet.com/1007-9327/14/ 6774.asp, ref 109). (24) Thoolen, B., Maronpot, R. R., Harada, T., Nyska, A., Rousseaux, C., Nolte, T., Malarkey, D. E., Kaufmann, W., Kuttler, K., Deschi, U., Nakae, D., Vinlove, M. P., Brix, A. E., Singh, B., Belpoggi, F., and Ward, J. M. (2010) Proliferative and nonproliferative lesions of the rat and mouse hepatobiliary system. Toxicol. Pathol. 38, 5S−81S. (25) Zang, Q., Rotroff, D. M., and Judson, R. S. (2013) Binary classification of a large collection of environmental chemicals from estrogen receptor assays by quantitative structure−activity relationship and machine learning methods. J. Chem. Inf. Model. 53, 3244−3261. (26) QikProp, version 3.2, Schrödinger, New York, 2011. (27) OBoyle, N. M., Banck, M., James, C. A., Morley, C., Vandermeersch, T., and Hutchison, G. R. (2011) Open Babel: an open chemical toolbox. J. Cheminf. 3, 33. (28) Yap, C. W. (2011) PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 32, 1466−1474. (29) PubChem. https://pubchem.ncbi.nlm.nih.gov/ (accessed Aug 8, 2012). (30) Bender, A. (2011) Bayesian methods in virtual screening and chemical biology. Methods Mol. Biol. 672, 175−196. (31) Scikit-learn, http://scikit-learn.org/stable/tutorial/statistical_ inference/supervised_learning.html#using-kernels. (32) Ekins, S. (2014) Progress in computational toxicology. J. Pharmacol. Toxicol. Methods. 69, 115−140. (33) Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984) Classification and Regression Trees, CRC Press, Boca Raton, FL. (34) Loh, W. Y. (2011) Classification and regression trees. Wiley Interdiscip. Rev.: Data Min. Knowl. Discovery 1, 14−23. (35) Altman, N. S. (1992) An introduction to kernel and nearestneighbor nonparametric regression. Am. Stat. 46, 175−185. (36) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.,

Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011) Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825−2830. (37) Hunter, J. D. (2007) Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90−95. (38) Matplotlib, http://matplotlib.org/. (39) Knudsen, T. B., Houck, K. A., Sipes, N. S., Singh, A. V., Judson, R. S., Martin, M. T., Weissman, A., Kleinstreuer, N. C., Mortensen, H. M., Reif, D. M., Rabinowitz, J. R., Setzer, R. W., Richard, A. M., Dix, D. J., and Kavlock, R. J. (2011) Activity profiles of 309 ToxCast chemicals evaluated across 292 biochemical targets. Toxicology 282, 1−15. (40) Martin, M. T., Dix, D. J., Judson, R. S., Kavlock, R. J., Reif, D. M., Richard, A. M., Rotroff, D. M., Romanov, S., Medvedev, A., Poltoratskaya, N., Gambarian, M., Moeser, M., Makarov, S. S., and Houck, K. A. (2010) Impact of environmental chemicals on key transcription regulators and correlation to toxicity end points within EPA’s ToxCast program. Chem. Res. Toxicol. 23, 578−590. (41) MacDonald, M. L., Lamerdin, J., Owens, S., Keon, B. H., Bilter, G. K., Shang, Z., Huang, Z., Yu, H., Dias, J., Minami, T., Michnick, S. W., and Westwick, J. K. (2006) Identifying off-target effects and hidden phenotypes of drugs in human cells. Nat. Chem. Biol. 2, 329− 337. (42) Abraham, V. C., Taylor, D. L., and Haskins, J. R. (2004) High content screening applied to large-scale cell biology. Trends Biotechnol. 22, 15−22. (43) Apredica panel, http://www.cyprotex.com/toxicology/ multiparametric/cytotoxicity-screening-panel. (44) Ankley, G. T., Bennett, R. S., Erickson, R. J., Hoff, D. J., Hornung, M. W., Johnson, R. D., Mount, D. R., Nichols, J. W., Russom, C. L., Schmieder, P. K., Serrrano, J. A., Tietge, J. E., and Villeneuve, D. L. (2010) Adverse outcome pathways: a conceptual framework to support ecotoxicology research and risk assessment. Environ. Toxicol. Chem. 29, 730−741. (45) Kawahigashi, H., Hirose, S., Ohkawa, H., and Ohkawa, Y. (2006) Phytoremediation of the herbicides atrazine and metolachlor by transgenic rice plants expressing human CYP1A1, CYP2B6, and CYP2C19. J. Agric. Food Chem. 54, 2985−2991. (46) Klaunig, J. E., Babich, M. A., Baetcke, K. P., Cook, J. C., Corton, J. C., David, R. M., Deluca, J. G., Lai, D. Y., Mckee, R. H., Peters, J. M., Roberts, R. A., and Fenner-Crisp, P. A. (2003) PPARα agonist-induced rodent tumors: modes of action and human relevance. Crit. Rev. Toxicol. 33, 655−780. (47) Maronpot, R. R., Yoshizawa, K., Nyska, A., Harada, T., Flake, G., Mueller, G., Singh, B., and Ward, J. M. (2010) Hepatic enzyme induction: histopathology. Toxicol. Pathol. 38, 776−795. (48) Kensler, T. W., Wakabayashi, N., and Biswal, S. (2007) Cell survival responses to environmental stresses via the Keap1-Nrf2-ARE pathway. Annu. Rev. Pharmacol. Toxicol. 47, 89−116. (49) Waris, G., and Ahsan, H. (2006) Reactive oxygen species: role in the development of cancer and various chronic conditions. J. Carcinog. 5, 14. (50) Apel, K., and Hirt, H. (2004) Reactive oxygen species: metabolism, oxidative stress, and signal transduction. Annu. Rev. Plant Biol. 55, 373−399. (51) Batarseh, A., and Papadopoulos, V. (2010) Regulation of translocator protein 18 kDa (TSPO) expression in health and disease states. Mol. Cell. Endocrinol. 327, 1−12. (52) Šileikytė, J., Blachly-Dyson, E., Sewell, R., Carpi, A., Menabò, R., Lisa, F. D., Ricchelli, F., Bernardi, P., and Forte, M. (2014) Regulation of the mitochondrial permeability transition pore by the outer membrane does not involve the peripheral benzodiazepine receptor (translocator protein of 18 kDa (TSPO). J. Biol. Chem. 289, 13769− 13781. (53) Hatori, A., Yui, J., Xie, L., Yamasaki, T., Kumata, K., Fujinaga, M., Wakizaka, H., Ogawa, M., Nengaki, N., Kawamura, K., and Zhang, M.-R. (2014) Visualization of acute liver damage induced by cycloheximide in rats using PET with [18F]FEDAC, a radiotracer for translocator protein (18 kDa). PLoS One 9, e86625. M

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX

Article

Chemical Research in Toxicology (54) Papadopoulos, V., Baraldi, M., Guilarte, T. R., Knudsen, T. B., Lacapère, J.-J., Lindemann, P., Norenberg, M. D., Nutt, D., Weizman, A., Zhang, M.-R., and Gavish, M. (2006) Translocator protein (18 kDa): new nomenclature for the peripheral-type benzodiazepine receptor based on its structure and molecular function. Trends. Pharmacol. Sci. 27, 402−409. (55) Xu, L., Tian, D., and Zheng, Y. (2013) Pleiotropic roles of TGFβ/Smad signaling in the progression of chronic liver disease. Crit. Rev. Eukaryot. Gene Expr. 23, 237−255. (56) Roberts, S. B., Segil, N., and Heintz, N. (1991) Differential phosphorylation of the transcription factor Oct1 during the cell cycle. Science 253, 1022−1026. (57) Duncliffe, K. N., Bert, A. G., Vadas, M. A., and Cockerill, P. N. (1997) A T cell−specific enhancer in the Interleukin-3 locus is activated cooperatively by Oct and NFAT elements within a DNase I− hypersensitive site. Immunity 6, 175−185. (58) Weber, A., Boege, Y., Reisinger, F., and Heikenwälder, M. (2011) Chronic liver inflammation and hepatocellular carcinoma: persistence matters. Swiss Med. Wkly. 141, w13197. (59) Wagner, M., Zollner, G., and Trauner, M. (2011) Nuclear receptors in liver disease. Hepatology 53, 1023−1034. (60) Hong, H., and Tong, W. (2014) Emerging efforts for discovering new biomarkers of liver disease and hepatotoxicity. Biomark. Med. 8, 143−146. (61) Chen, M., Bisgin, H., Tong, L., Hong, H., Fang, H., Borlak, J., and Tong, W. (2014) Toward predictive models for drug-induced liver injury in humans: are we there yet? Biomark. Med. 8, 201−213. (62) Hanahan, D., and Weinberg, R. A. (2000) The hallmarks of cancer. Cell 100, 57−70. (63) Traiger, G. J., and Plaa, G. L. (1974) Chlorinated hydrocarbon toxicity. Arch. Environ. Health 28, 276−278. (64) Hasegawa, R., Hirata-Koizumi, M., Takahashi, M., Kamata, E., and Ema, M. (2005) Comparative susceptibility of newborn and young rats to six industrial chemical. Congenital Anomalies 45, 137−145. (65) 2-Chlorophenol, (https://pubchem.ncbi.nlm.nih.gov/summary/ summary.cgi?cid=7245). (66) Elkan, C., and Noto, K. (2008) Learning Classifiers from Only Positive and Unlabeled Data, Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2008): Las Vegas, NV, 213−220.

N

DOI: 10.1021/tx500501h Chem. Res. Toxicol. XXXX, XXX, XXX−XXX