Development and Validation of Decision Forest Model for Estrogen

Nov 2, 2015 - Hence, we have developed a decision forest classification model to predict chemical binding to ERα using a large training data set of 3...
0 downloads 5 Views 1MB Size
Article pubs.acs.org/crt

Development and Validation of Decision Forest Model for Estrogen Receptor Binding Prediction of Chemicals Using Large Data Sets Hui Wen Ng,† Stephen W. Doughty,‡ Heng Luo,† Hao Ye,† Weigong Ge,† Weida Tong,† and Huixiao Hong*,† †

Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, 3900 NCTR Road, Jefferson, Arkansas 72079, United States ‡ School of Pharmacy, University of Nottingham Malaysia Campus, Jalan Broga, 43500 Semenyih, Selangor, Malaysia S Supporting Information *

ABSTRACT: Some chemicals in the environment possess the potential to interact with the endocrine system in the human body. Multiple receptors are involved in the endocrine system; estrogen receptor α (ERα) plays very important roles in endocrine activity and is the most studied receptor. Understanding and predicting estrogenic activity of chemicals facilitates the evaluation of their endocrine activity. Hence, we have developed a decision forest classification model to predict chemical binding to ERα using a large training data set of 3308 chemicals obtained from the U.S. Food and Drug Administration’s Estrogenic Activity Database. We tested the model using cross validations and external data sets of 1641 chemicals obtained from the U.S. Environmental Protection Agency’s ToxCast project. The model showed good performance in both internal (92% accuracy) and external validations (∼70−89% relative balanced accuracies), where the latter involved the validations of the model across different ER pathway-related assays in ToxCast. The important features that contribute to the prediction ability of the model were identified through informative descriptor analysis and were related to current knowledge of ER binding. Prediction confidence analysis revealed that the model had both high prediction confidence and accuracy for most predicted chemicals. The results demonstrated that the model constructed based on the large training data set is more accurate and robust for predicting ER binding of chemicals than the published models that have been developed using much smaller data sets. The model could be useful for the evaluation of ERα-mediated endocrine activity potential of environmental chemicals.



INTRODUCTION The endocrine disrupting chemicals (EDCs) and their potential to mimic endogenous hormones and interfere with the endocrine systems in the body have been a long-standing issue in both public and environmental health. Among the receptors in the endocrine system, the estrogen receptors (ERs) are the most extensively investigated. This is due in no small part to the tremendous role played by the ERs, not only in a myriad of physiological processes, e.g., regulation of lipids, inflammatory markers, and hemostasis, but also in the female reproductive system.1 Interference of the ER-mediated responses by the EDCs is linked to various diseases, such as stroke and infertility, as well as adverse effects that may spill over to future generations as exemplified by the “DES daughters” tragedy where daughters who were exposed to diethylstilbesterol (DES) in utero suffered from higher risks of adenocarcinoma and cervical and vaginal anomalies, among others.1−3 The high level of public concern and media attention on the adverse effects of EDCs has led to a large number of research efforts by the scientific community that encompasses academia, industry, and governments from various parts of the world. In © 2015 American Chemical Society

the U.S., the government has initiated mega-projects, such as the Endocrine Disruptor Screening Project (http://www.epa. gov/endo/) and ToxCast (http://www.epa.gov/ncct/toxcast/ ),4−7 to screen tens of thousands of chemicals for their potential bioactivity using various methods, e.g., in vivo, in vitro, and high throughput assays. Chemical databases and knowledge bases, such as the U.S. Food and Drug Administration (FDA)’s Endocrine Disruptor Knowledge Base (EDKB)8−19 and the Estrogenic Activity Database (EADB),20−24 which collect estrogenic activity data from various experimental sources and serve as a resource for scientists in this field, have also burgeoned as a result. The data generated from these research efforts have not only formed a valuable bedrock of knowledge for toxicologists, they have also propelled the development of various in silico models for the prediction of chemical activities/toxicities to new heights.16,25−34 Most notably, the quantitative structure− activity relationship (QSAR), an approach founded on mathematical models or knowledge-based rules and functions Received: August 27, 2015 Published: November 2, 2015 2343

DOI: 10.1021/acs.chemrestox.5b00358 Chem. Res. Toxicol. 2015, 28, 2343−2351

Article

Chemical Research in Toxicology to predict chemical activities through the structural features of chemicals, is widely employed for the latter. Over the years, numerous specialized QSAR-based tools have been and are still being actively developed by government agencies, such as the U.S. Environmental Protection Agency and U.S. Food and Drug Administration.23,25,27,30,35−38 Here, we describe the development of a QSAR model for predicting chemical binding (i.e., one of the major mechanisms of toxicity) to the most important group of ERs, the α subtype, based on the Decision Forest (DF) classification method37,39 using large training (3308 chemicals) and test (1641 chemicals) sets. Previous published models were mostly developed using data sets of tens to a few hundreds of chemicals, such as the rat ER binding data of 232 chemicals.25,38,39 Through this work, we seek not only to evaluate the performance of our model that was built with a far larger data set but also to apply the model and assess its prediction performance across a range of assays in external data sets (13 ER pathway assays in ToxCast). Using the 162 compounds found in both EADB and ToxCast, we analyzed the agreement between the ER activity data from EADB and ToxCast to generate a “benchmark” against which the performance of our model can be compared. Findings from this study will provide an idea of the ability and applicability of a model built from the binding data of ER-binding compounds in the prediction of ER activity obtained from different assays. Furthermore, an analysis of the descriptors used to build the model will also provide useful knowledge on which and how the key chemical features found in the molecules of the training set contribute to the model prediction ability.



MATERIALS AND METHODS

Figure 1. Overall study design. EADB data was used as a training set for model training. Five-fold cross-validations (repeated 1000 times), permutation tests, and confidence and descriptor analyses were carried out to assess the model performance. The correlation between the data in EADB and ToxCast was investigated through the 162 compounds found in both data sources. Seven of the 13 ToxCast assays had good correlation with the EADB data and were used in external validations.

Study Design. Figure 1 shows the overall study design. The binding data (logRBA) for the ERα subtype of 3308 chemicals in the EADB were output and used as the training data set. The data for the 1641 chemicals tested by the 13 ToxCast ER pathway-related assays40 were used as the test set. The molecular descriptors for these chemicals were calculated using Mold2 software.38 After preprocessing the data in the training and test sets to discard the descriptors with no or little information, the DF method37,39 was used to construct models. The models built from the training set were first subjected to internal validation through 5-fold cross-validation and permutation tests. Of the 1803 chemicals tested by the 13 ER pathway-related assays in ToxCast, 162 were contained in the training data set and were thus excluded from the test set. The remaining 1641 chemicals were used in external validations. The 162 chemicals were used to identify the ToxCast assays that were correlated with the EADB data; these assays were then used as external validation data sets. Finally, a model was built from the whole training data set to predict the activity of the chemicals in external validation data sets. The confidence of the predictions made by the models during the internal validation process was analyzed. Analysis of the descriptors used to build the models were also carried out to identify the informative molecular descriptors. Preparation and Preprocessing of Training and External Data Sets. The training data set contained 3308 chemicals and was obtained from FDA’s EADB.23 There were 4161 ER binding data points for these 3308 chemicals. The binding data were categorized as active (i.e., binding activity was detected in an assay) and inactive (i.e., no binding activity was detected in the assay). For a chemical with more than one binding data, it was assigned as inactive if less than half of the assays showed binding activity and vice versa for the active. Of the 3308 chemicals, 2656 were determined to be active and 652 to be inactive (for more details on the binding data, see Table S1). The 1858 chemicals tested by the 13 ER pathway-related assays in ToxCast40 (see Table S2 for more information about the assays) and

their activity data were obtained from http://epa.gov/ncct/toxcast/ data.html (accessed April 2014). Among the 1858 compounds, 55 compounds have no CAS numbers, and we had difficulty finding their structures using PubChem or Chemspider. Therefore, they were removed from the external data set. For the remaining 1803 chemicals, 162 were found in the training data set and were thus also excluded from the external data set. This resulted in a final external data set of 1641 chemicals. For this set of data, all of the chemicals with a recorded AC50 value of 1 M in ToxCast were treated as inactive,27 whereas the rest, unless not tested, were active. The total number of chemicals tested, actives, and inactives are different for the 13 assays (see Table 1). Unlike the training data set, the external data set contains far more inactive than active chemicals. Descriptors Generation. With the structures of both training and external data sets saved as SDF files, the 777 molecular descriptors for the chemicals were calculated using Mold2 software, which can be downloaded at http://www.fda.gov/ScienceResearch/ BioinformaticsTools/Mold2/. The descriptors generated were onedimensional (1D) (e.g., molecular weight) and two-dimensional (2D) (e.g., bond information) descriptors, where the former was calculated based on molecular formula only and the latter on the 2D structure of the chemicals. After identifying and removing the less informative descriptors (i.e., same values for more than 3000 chemicals in the training set), 467 descriptors out of the original 777 were used to build the DF models. Cross-Validation. The performance of the classification models was evaluated using 5-fold cross-validations. In a 5-fold crossvalidation, the data set was first randomly partitioned into five more2344

DOI: 10.1021/acs.chemrestox.5b00358 Chem. Res. Toxicol. 2015, 28, 2343−2351

Article

Chemical Research in Toxicology

further investigated, i.e., the mean and standard deviation for these features for the active and inactive chemicals as well as their associated p-value were calculated. Comparison of Data between EADB and ToxCast. The 13 ER pathway-related assays in ToxCast measure different aspects of the estrogenic activity; however, the data obtained from these assays were not well-correlated. The data in EADB and ToxCast for the 162 chemicals found in both data sets were compared to assess their correlation. This served to (1) evaluate which of these 13 ER pathwayrelated assays could be used as an external validation data set to assess our model, and (2) set the benchmark to which the performance of the model in predicting the test set(s) could be compared. For each of the 13 ER pathway-related assays in ToxCast, the balanced accuracy (see eq 5) relative to EADB, i.e., EADB data were considered as “truth”, was used as the primary metric to measure the activity data agreement between EADB and ToxCast assays. In addition, other metrics, such as accuracy, sensitivity, and specificity were also calculated and presented in Table S3. The assays with a balanced accuracy of lower than 0.7 were deemed less correlated with the activity of the training data set and excluded from the external data set.

Table 1. Number of Compounds Tested in the 13 ToxCast ER Pathway-Related Assays, the Number of Compounds Tested as Active, and the Number of Compounds Tested as Inactive

ACEA_T47D_80 h_Positive ATG_ERE_CIS ATG_ERRa_TRANS NVS_NR_bER NVS_NR_hER NVS_NR_mERa OT_ER_ERaERa_1440 OT_ER_ERaERb_1440 OT_ER_ERbERb_1440 Tox21_ERa_BLA_Agonist_ratio Tox21_ERa_BLA_Antagonist_ratio Tox21_ERa_LUC_BG1_Agonist Tox21_ERa_LUC_BG1_Antagonist

number of compounds tested

active

inactive

1608 1640 1640 1641 1641 1641 1634 1633 1633 1641 1641 1641 1641

211 214 18 46 94 82 71 122 120 169 121 218 26

1397 1426 1622 1595 1547 1559 1563 1511 1513 1472 1520 1423 1615

accuracy = number of chemicals classified in the same category in both data sources 162

or-less equal portions. Each of the five portions was then left out to challenge the model constructed using the remaining four portions. DF integrates variable selection and model construction together. When constructing models, DF automatically selects descriptors from the 467. The prediction accuracy, sensitivity, and specificity were calculated based on the results of the five models. This process was repeated 1000 times (see Figure 1) to make the validations statistically stable. Permutation Test. For estimating the predictive power of the models, permutation tests were carried out to measure the chance correlation in the training set. In a permutation test, the ER binders and nonbinders from the training data set were first randomly scrambled. Then, a 5-fold cross validation as described was conducted, and the resultant prediction accuracy, sensitivity, and specificity were used to measure the chance correlation. The permutation test was repeated 1000 times to ensure that the estimation of chance correlation is statistically stable. Prediction Confidence Analysis. A prediction from a DF model not only classifies a chemical as an ER binder or nonbinder but also estimates the probability of the chemical being an ER binder or nonbinder, i.e., the prediction confidence. Therefore, the performance of the models can be measured not only from the overall prediction accuracy but also through the relationship between the prediction confidence and accuracy. Prediction confidence is a quantity between 0 and 1 (least and most confident, respectively) that is calculated based on the ER binder probability output from the DF model using eq 1. In this equation, P represents the probability of a chemical being an ER binder (0.5 ≤ P ≤ 1) or nonbinder (0 < P < 0.5). confidence =

|P − 0.5| 0.5

(2) sensitivity =

number of chemicals classified as active in both data sources number of active chemicals in EADB

(3) specificity =

number of chemicals classified as inactive in both data sources number of inactive chemicals in EADB

(4) balanced accuracy =

sensitivity + specificity 2

(5)

External Validation. The predictive power of the classification model constructed from the whole training data set was evaluated with the external data set, which contains 1641 chemicals tested in the 7 ER pathway-related assays in ToxCast.40 A DF model was first constructed using the whole training data set. The model was then used to predict the 1641 chemicals in the external data set from ToxCast. The performance of the DF model was evaluated using sensitivity, specificity, accuracy, and balanced accuracy. For a comparison with the prediction performance using the experimental data, the relative performance metrics (relative sensitivity, specificity, accuracy, and balanced accuracy) were calculated using eq 6 to provide assessment of the DF model performance relative to the experiment. The performance metrics between EADB and ToxCast data were obtained from the comparison of the activity data using the 162 common chemicals.

relative performance metrics performance of external validation = performance metrics between EADB and ToxCast data

(1)

Next, to explore the relationship between prediction confidence and model performance (accuracy, sensitivity, and specificity), the confidence range (0 to 1) was first divided into 10 even bins. For each prediction in the 1000 5-fold cross validations, its prediction confidence was calculated and put in the corresponding bin. The actual and predicted activities (ER binder or nonbinder) were also recorded. Finally, the number of predictions that fell into each bin were counted, and the accuracy, sensitivity, and specificity in each bin were calculated. Informative Descriptors Identification. To investigate what molecular descriptors are informative to the DF models and relevant to ER binding, we further examined the frequency distribution of the descriptors used in the DF models in the 5-fold cross validations. For each Mold2 descriptor, we calculated its frequency in the models from the 5-fold cross validations. The descriptors were then ranked by their frequency values for identification of the informative descriptors. The descriptors that were used in more than 90% of the models were



(6)

RESULTS Cross-Validation. The 5-fold cross-validations carried out in this study served as an internal validation for the prediction models. Figure 2 shows that the models were able to produce high prediction accuracy (91.98 (mean) ± 0.32% (standard deviation), blue solid line) and sensitivity (95.75 ± 0.26%, red solid line) as well as satisfactory specificity (76.59 ± 1.26%, black solid line). Inspecting the width of each peak, a narrow peak was observed for both the prediction accuracy and sensitivity. This indicates that the models performed consistently in the overall prediction as well as the prediction for 2345

DOI: 10.1021/acs.chemrestox.5b00358 Chem. Res. Toxicol. 2015, 28, 2343−2351

Article

Chemical Research in Toxicology

were found at the confidence range of 0−0.9, whereas an overwhelmingly large number of the predictions (∼14 × 105) had a prediction confidence of >0.9. The corresponding accuracy (red line), sensitivity (blue line), and specificity (cyan line) for the 10 bins also showed an uphill trend as the prediction confidence increased. Informative Descriptors. The training data set that contains 3308 chemical and 467 descriptors was utilized in the 5-fold cross-validations. The frequency of the 467 descriptors that were used in the models varied, i.e., some descriptors were used rarely and others were used almost invariably (see Figure S1 for the frequency of use of these descriptors). Out of the 467 descriptors, 18 were found to be used most often (employed in >9000 models). The definitions and frequency of use of the top 18 descriptors are given in Table 2. Everything considered, these descriptors could be classified by 4 main characteristics: (1) molecular weight/van der Waals volume, (2) electrostatic property/polarizability, (3) aromatic ring(s), and (4) structural/bond information. Among these, the van der Waals volume was found to be the most important feature, as it was used in 9887 out of the 10000 models constructed. It was apparent that features belonging to (1) and (2) dominated, whereas those belonging to (3) and (4) contributed to a smaller extent. Table 2 also shows the mean and standard deviation of the top 18 descriptors for the active and inactive chemicals in the training data set. The p-values for the comparisons of the descriptors between the actives and inactives are given in the last column of Table 2. All but Moran autocorrelation lag 6 (weighted by atomic polarizabilities) showed a significant difference between the active and inactive compounds. Comparison of Data between EADB and ToxCast. The 162 compounds were used to estimate the extent of data agreement between EADB and ToxCast. Figure 4a shows that the balanced accuracy ranged between ∼0.5 and 0.8 across the 13 ER pathway-related assays (see Table S3 for calculated sensitivity, specificity, accuracy, and balanced accuracy). Seven assays had a balanced accuracy of >0.7 and were deemed to be correlated with the experimental data from EADB. Therefore, only the data from these 7 assays were used in the external data set for validation of the DF model constructed from the training data set. External Validation. Figure 4b shows the balanced accuracy of predictions on the data from the 7 ToxCast ER pathway-related assays of the 1641 chemicals in the external data set by the DF model constructed from the whole training data set (see Table S4 for the total number of compounds tested, number of compounds tested as active and inactive, and the corresponding prediction accuracy, sensitivity, specificity, and area under the curve (AUC)). Consistently high prediction accuracy (0.7−0.8) and specificity (∼0.8) were achieved when compared against the experimental findings from all 7 assays. The prediction sensitivity (0.24−0.49), however, was poor, making the balanced accuracy (0.52−0.65) moderate. Unsurprisingly, the model performed best (0.65) when predicting results from the binding assay. To provide an assessment of the DF model performance relative to EADB experimental data on the 7 ER pathwayrelated assays, Figure 4c shows the performance of the DF model expressed in relative balanced accuracy. The overall trend remained essentially similar to that observed in Figure 4b. The relative balanced accuracy was improved to 0.67−0.89,

Figure 2. Results of 5-fold cross-validations (solid lines) and permutation tests (dotted lines) presented as accuracy (blue), sensitivity (red), and specificity (black).

the active compounds. The broader peak for the prediction specificity indicated that the models performed less consistently when predicting the inactive compounds. Permutation Tests. In clear contrast to the results of the internal validation, the permutation tests are worse with a prediction accuracy of 74.05 ± 0.58% (see Figure 2, indicated by blue dotted line), sensitivity of 89.69 ± 0.59% (red dotted line), and specificity of 10.36 ± 1.41% (black dotted line).The clear separation between the solid line peaks and their corresponding dotted line peaks also indicated the differences in the model performance. The broader peaks from the permutation tests compared to the 5-fold cross-validation, in particular for sensitivity, reflect the increased difficulty in producing consistent predictions (most noticeably for the active compounds) post data-scrambling. Confidence Analysis. Figure 3 shows the results of the confidence analysis carried out for the predictions in the 5-fold cross-validations. The majority of the predictions had high confidence (black line): only a small number of the predictions

Figure 3. Confidence analysis results. Prediction accuracy (red), sensitivity (blue), specificity (cyan), and number of predictions (black) were plotted along the corresponding prediction confidence on the xaxis. 2346

DOI: 10.1021/acs.chemrestox.5b00358 Chem. Res. Toxicol. 2015, 28, 2343−2351

Article

Chemical Research in Toxicology

Table 2. Top 18 Descriptors Used >90% of the Time for Model Building with Their Respective Definitions, Number of Times Used, as well as Mean and Standard Deviation (SD) Values for the Active and Inactive Chemicals active descriptor ID D487 D534 D549 D508 D198 D296 D463 D488 D230 D474 D196 D551 D560 D675 D283 D480 D469 D195

definitions Moran autocorrelation lag 1 (weighted by atomic van der Waals volumes) lowest eigenvalue n. 3 of Burden matrix (weighted by atomic masses) lowest eigenvalue n. 2 of Burden matrix (weighted by atomic Sanderson electronegativities) Moran autocorrelation lag 6 (weighted by atomic polarizabilities) sum of Kier−Hall electrotopological states structural information content (neighborhood symmetry of 5-order) Geary autocorrelation - lag 1/weighted by atomic Sanderson electronegativities Moran autocorrelation lag 2 (weighted by atomic van der Waals volumes) Kier benzene-likeliness index Geary autocorrelation lag 4 (weighted by atomic polarizabilities) molecular electrotopological variation lowest eigenvalue n. 4 of Burden matrix (weighted by atomic Sanderson electronegativities) lowest eigenvalue n. 5 of Burden matrix (weighted by atomic polarizabilities) number of phenols bond information content (neighborhood symmetry of 2-order) Moran autocorrelation lag 2 (weighted by atomic masses) Geary autocorrelation lag 7 (weighted by atomic Sanderson electronegativities) maximal electrotopological positive variation

inactive

number of times used

mean

SD

mean

SD

p-value

9887

0.2931

0.0911

0.3490

0.1709