Insights into the Molecular Basis of the Acute ... - ACS Publications

Nov 21, 2017 - Since the honey bee (Apis mellifera) is probably among the most exposed species to the polluting chemicals, we focused on the in silico...
0 downloads 0 Views 5MB Size
Article pubs.acs.org/jcim

Cite This: J. Chem. Inf. Model. 2017, 57, 2948−2957

Insights into the Molecular Basis of the Acute Contact Toxicity of Diverse Organic Chemicals in the Honey Bee Xiao Li,*,†,‡ Yuan Zhang,‡ Hongna Chen,§ Huanhuan Li,‡ and Yong Zhao*,†,‡ †

Beijing Computing Center, Beijing Academy of Science and Technology, 7 Fengxian road, Beijing 100094, China Beijing Beike Deyuan Bio-Pharm Technology Co. Ltd., 7 Fengxian road, Beijing 100094, China § Tigermed Consulting Co., Ltd., 20 Chaowai Street, Beijing 100020, China ‡

S Supporting Information *

ABSTRACT: Use of chemical pollutants, including pesticides and other industrial chemicals, has resulted in significant risks to the whole ecosystem. Therefore, ecological risk assessment of chemicals is vital and necessary. Since the honey bee (Apis mellifera) is probably among the most exposed species to the polluting chemicals, we focused on the in silico estimation of honey bee toxicity (HBT) of chemicals and the analysis of the relevance of chemical HBT and several key physical−chemical properties and structural characteristics. A total of 40 classification models were developed by combination of five machine learning methods along with seven kinds of fingerprints and a set of molecular descriptors. After 5-fold cross validation and external validation, several models showed good predictive power. The relevance of 12 key physical−chemical properties and chemical HBT was also investigated. Five properties, including AlogP, logD, molecular weight (MW), molecular surface area (MSA), and the number of rotatable bonds (nRTB), indicated positive correlation coefficients with HBT, while molecular solubility (logS) and the number of hydrogen bond donors (nHBD) indicated negative correlation coefficients. Finally, seven privileged substructures responsible for chemical HBT were identified from KRFP and SubFP fingerprints. The results of this study should provide critical information and useful tools for chemical HBT estimation in environmental risk assessment.



INTRODUCTION

structure−activity relationships (QSARs) have been widely used to reduce the cost of the environmental risk assessment.7,8 QSAR models have already been used for the estimation of ecotoxicological end points in various ecologically significant species including T. pyriformis,7,9 quail,10 and fathead minnow11−13 in the past decades. Besides, there are also several QSAR models for HBT prediction. In 1991, Vighi et al.14 developed a QSAR model for HBT prediction with organophosphorus pesticides. Later, Celli et al.15 also proposed QSAR models for the prediction of HBT with 100 organophosphorus pesticides (89 in training set and 11 in test set) using a multilayer feedforward neural network method. The root-mean-square residual value for the training set was 0.43, and that for test set was 0. 386. Since Vighi and Celli’s models were only generated to simulate the HBT of organophosphorus pesticides, the usefulness of these models is very limited. In 2014, Singh et al. established global models for qualitative and quantitative HBT prediction with more than 200 structurally diverse pesticides. The models were developed to predict the qualitative and quantitative HBT of new organic pesticides for regulatory purposes.5 More recently, Como et al. also proposed

Chemicals are an integral part of modern life, and they are used in a wide variety of products. However, misuse and improper disposal of chemicals has also resulted in significant risks to the environment. Hazardous chemicals, especially pesticides and other toxic industrial chemicals, continue to affect all aspects of natural resources, including the water, soil, atmosphere, and wildlife. The honey bee (Apis mellifera) is among the most important insects because they transfer pollen between plants and are regularly exposed to chemical pollutants when visiting plants.1 In the past decade, people have seen a dramatic drop in honeybee numbers due to chemical toxicity in various countries.2−6 Therefore, special attention should be paid to assess the potential risk of chemicals in honey bee. The toxicity of chemicals in the honey bee should always be assessed with 24 or 48 h acute toxicity experimental bioassays to determine the median lethal concentration that induces 50% death (LC50) or lethal dose that induces 50% death (LD50). Since a full evaluation of the honey bee toxicity (HBT) of a large number of organic chemicals by experimental test is costly, time-consuming, and poses an ethical problem; there is a very urgent need to develop alternative methods and tools for HBT estimation. In silico techniques, such as quantitative © 2017 American Chemical Society

Received: August 9, 2017 Published: November 21, 2017 2948

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957

Article

Journal of Chemical Information and Modeling

of molecular descriptors and were used as feature reduction for model building. All the descriptors were calculated by the open source software package PaDEL-Descriptor.21 Molecular fingerprints have been also widely used in chemical toxicity prediction,22−26 because fingerprints are generated directly from chemical structures and could be easily translated into 2D fragments.21,27,28 Herein, seven kinds of commonly used fingerprints were also evaluated by PaDEL Descriptor, including the Estate fingerprint (Estate, 79 bits), CDK fingerprint (FP, 1024 bits), CDK extended fingerprint (Extended, 1024 bits), Klekota−Roth fingerprint (KRFP, 4860 bits), MACCS keys (MACCS, 166 bits), PubChem fingerprint (PubChem, 881 bits), and Substructure fingerprint (SubFP, 307 bits). Construction of Classification Models. A series of binary classification models were developed based on molecular descriptors and fingerprints, respectively. Among a multitude of available modeling algorisms, we applied five machine learning methods, including support vector machine (SVM),29 k-nearest neighbor (kNN),30 naive Bayes (NB),31 C4.5 decision tree (DT),32 and random forest (RF).33 SVM is an excellent kernel-based tool for binary data classification and regression introduced by Vapnik and co-workers, which can deal with high dimensional space problems with the input of vectors from low dimensional space by introducing kernel function. The kNN algorism is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. An object could be classified by a majority vote of its neighbors, with the object being assigned to the class most ̈ Bayes based on common among its k nearest neighbors. Naive the Bayes rule is a simple probabilistic classifier for the conditional probability. It allows the user to categorize instances in a data set based on the equal and independent contributions of their attributes. C4.5 DT is an algorithm used to generate a decision tree developed by Ross Quinlan, which is an extension of Quinlan’s earlier iterative dichotomiser 3 (ID3) algorithm. At each node of the tree, C4.5 DT chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. RF is an ensemble learning method for classification and regression that operated by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees. These algorisms are highly effective and robust and have been extensively used in toxicity research.22−25,34−36 In this study, the SVM algorithm was implemented in the open source LIBSVM (LIBSVM 3.16 package),37 and the other four algorisms were performed in open source tool Orange (version 2.7, freely available at https://orange.biolab.si/orange2/). Assessment of Model Performance. All models were validated by 5-fold cross validation, and then the topperforming models were further validated by external validation. The models were evaluated based on the counts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). The sensitivity (SE) and specificity (SP) were two statistical parameters applicable to binary classifications, which denoted the ratios of positive instances and negative instances correctly identified, respectively. The accuracy (Q) was the total prediction accuracy of samples, and the Matthews correlation coefficient (MCC) was generally regarded as a balanced measure which can be used even if the classes are of very different sizes. MCC returned a

a series of global model for HBT prediction of pesticides using k-Nearest Neighbor (kNN) algorism based on 256 pesticides with acute contact toxicity data collected from different sources.16 The models were validated with good prediction (accuracy of 70% for all compounds and of 65% for highly toxic compounds), which suggested that these models could be used to reliably predict the HBT of structurally diverse pesticides. However, the physical−chemical properties and structural characteristics of toxic and nontoxic compounds were not analyzed in these studies, and the usefulness of most published models was restricted because of poor availability. In this study, we focused on the following: (1) the development of chemical acute contact HBT models combining different machine learning algorithms with substructure pattern recognition methods based on structurally diverse organic chemicals; (2) the analysis of the difference of key physical−chemical properties and structural characteristics between the chemicals with high HBT and those without HBT.



MATERIALS AND METHODS Data Preparation. In this study, we addressed acute contact toxicity of organic chemicals in honey bees. The training data was extracted from Como’s work.16 Como carried out an outstanding work to collect chemical HBT data from different sources, including the DEMETRA project,2 the Terrestrial US EPA ECOTOX database present in the OECD QSAR Toolbox V3.3 (www.qsartoolbox.org), and the European Food Safety Authority (EFSA, http://www.efsa.europa.eu/it/). The compounds with LD50 values presented in terms of micrograms per bee resulting from the mortality of honey bees recorded after 48 h of contact exposure were proposed in Como’s work. Besides, an external validation set was extracted from Singh’s work. Both the data sets were carefully prepared by the following steps: (1) removing mixtures and inorganic and organometallic compounds; (2) salts were converted to their parent forms. Then, duplicate substances in the external set were removed from the training set. The U.S. EPA has established HBT categories on the basis of LD50, chemicals with LD50 < 2 μg/bee were categorized as highly toxic, 2 μg/bee ≤ LD50 < 11 μg/bee as moderately toxic, and chemicals with LD50 ≥11 μg/bee were classified as practically nontoxic.17 Considering that the U.S. EPA criteria was a little liberal for HBT risk assessment, 100 μg/bee was also used as a threshold for the HBT model building in the previous studies.5,11,16 Herein, we grouped the compounds into three classes with LD50 thresholds 11 and 100 μg/bee, compounds with LD50 < 11 μg/bee were labeled as high HBT, compounds with 11 μg/bee ≤ LD50 < 100 μg/bee were labeled as low HBT, and compounds with LD50 ≥ 100 μg/bee were labeled as no HBT. The extreme parts (high and no HBT) were used for model building and analysis of physical−chemical properties and structural characteristics. Calculation of Molecular Descriptors and Fingerprints. In this study, 12 key physical−chemical properties which were widely adopted in chemical toxicity prediction18−20 were calculated and evaluated. They are molecular weight (MW), Ghose-Crippen LogKow (AlogP), molecular solubility (logS), the number of hydrogen bond acceptors (nHBA), the number of hydrogen bond donors (nHBD), logD, molecular surface area (MSA), molecular polar surface area (MPSA), molecular fractional polar surface area (MFPSA), the number of rotatable bonds (nRTB), the number of rings (nR), and the number of aromatic rings (nAR). These properties formed a set 2949

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957

Article

Journal of Chemical Information and Modeling value between −1 and 1. A coefficient of 1 represented a perfect prediction, 0 meant no better than random prediction, and −1 indicated total disagreement between prediction and observation. These parameters were calculated with eqs 1−4. TP + TN TP + TN + FP + FN

Q=

the training set and external validation set contained 208 and 43 compounds, respectively. The CAS numbers and LD50 values of chemicals can be found in Table S1 of the Supporting Information. It is common knowledge that the chemical diversity of the data set is a key issue for a global model. Herein, we applied the radar chart for chemical space investigation of the entire data set. As shown in Figure 1, the values of MW ranged from 84.08 to 731.96, logS ranged from −10.92 to 0.63, nHBA ranged from 0 to 10, nHBD ranged from 0 to 4, and AlogP ranged from −4.02 to 7.29. These data illustrated that the 251 compounds in our data set covered a sufficiently large chemical space.40 Furthermore, the Tanimoto similarity index41 was also calculated based on ECFC-4 fingerprint, which has been widely used to evaluate similarities among chemicals. We plotted the heat map of Tanimoto similarity index of all the compounds, as shown in Figure 2. The average Tanimoto similarity index was 0.237, indicating that the chemicals in our data set were evidently diverse.

(1)

SE =

TP TP + FN

(2)

SP =

TN TN + FP

(3)

MCC = TPTN − FPFN (TP + FFP)(TP + FN)(TN + FP)(TN + FN) (4)

In addition, the value of area under receiver operating characteristic (ROC) curve (AUC) was also computed for the models. AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one, which shows the separation ability of a binary classifier iteratively setting the positive classifier threshold. The principle is that if the plot has a surface area (AUC) of 1, the classifier is perfect, and if the area equals 0.5, the classifier is a useless random one. Identification of Privileged Substructures or Structural Alerts. Privileged substructure, or called structural alert (SA), was defined as molecular functional group which can make chemicals toxic.38 The presence of SA can alert investigators to the potential toxicity of test chemicals. SAs were important for toxicity research because they were derived directly from mechanistic knowledge.39 In the present study, we identified the privileged substructures responsible for HBT by calculating the frequency of each substructure from KRFP and SubFP fingerprints appearing in HBT positive compounds (high HBT) and HBT negative compounds (none HBT). If a substructure presented more frequently in chemicals with high HBT than those with none HBT, this substructure would be regarded as a privileged substructure. The frequency of a substructure was defined as below: F=



RESULTS OF MODEL BUILDING A total of 40 classification models were developed by combination of five machine learning methods along with seven kinds of fingerprint and a set of molecular descriptors. The performance of SVM models depends on the combination of several key parameters. In this study, the Gaussian radial basis function (RBF) kernel was used. The parameters C and γ for RBF kernel were optimized through grid search method and shown in Table S2 of the Supporting Information. For the kNN algorithm, the nearness was measured by Hamming distance metrics, and the parameter of k = 5 was used. The parameters of the C4.5 DT, RF, and NB algorithms were set by default in the Orange toolbox. All the models were validated with 5-fold cross validation, and the performances of models are summarized in Table 2. According to the values of Q and MCC, ten models (MACCS_SVM, Extend_SVM, KRFP_SVM, FP_SVM, Pubchem_SVM, MACCS_kNN, Estate_SVM, Pubchem_RF, MACCS_RF, and SubFP_SVM) provided the best predictive results on 5-fold cross validation (with Q > 84.11% and MCC > 0.6). Considering the apparent imbalance between positive and negative data, we paid special attention to MCC values, as MCC is a single performance measure less influenced by imbalanced data.42 The model developed by the SVM algorithm with MACCS keys gave the best performance with MCC over 0.75, and the values of Q, SE, SP, and AUC were 89.90%, 73.02%, 97.24%, and 0.88, respectively. Performance of Models on External Validation. To further assess the robust and predictive ability of the ten models performed best on 5-fold cross validation, an external set containing 43 compounds was used. The performances on external validation can demonstrate the predictive capability of models objectively, since the data in the external set were completely independent of the training set and not used for model building. The results of models on external validation were shown in Table 3. We can find that all the models performed well on external validation. The Q values of models ranged from 81.40% to 90.70%, and the MCC values ranged from 0.51 to 0.77. The MACCS_kNN model, which performed best on 5-fold cross validation, achieved a prediction accuracy of 83.72% on external validation, and the values of SE, SP, AUC, and MCC were 50.00%, 96.77%, 0.91, and 0.57,

Nfragment class Nclass

(5)

Where Nfragment class is the number of compounds containing the substructure in each class and Nclass is the number of compounds in each class.



RESULTS AND DISCUSSION Data set analysis. After careful data preparation, 251 organic compounds with observed HBT data in LD50 values were extracted from previous work.5,16 As shown in Table 1, Table 1. Statistics of Chemicals in the Training Set and External Validation Set data sets

high HBT

none HBT

total

training set external validation set total

63 12 75

145 31 176

208 43 251 2950

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957

Article

Journal of Chemical Information and Modeling

Figure 1. Radar chart of five simple descriptors. Molecular solubility (logS), Alog P, molecular weight (MW), the number of hydrogen bond acceptors (nHBA), and the number of hydrogen bond donors (nHBD) for the entire data set of 251 compounds are presented. Each color line represents a compound.

of top-performing models, seven models were developed by SVM. In the others, three models performed well on 5-fold cross validation, two models were developed by RF, and one was developed by kNN. The SVM algorithm has been wellknown as a powerful tool for solving small sample, nonlinear, and high dimensional pattern problems. It can provide a good out-of-sample generalization if the parameters C and r (in the case of a Gaussian kernel) are appropriately chosen; thus, it can be robust by choosing an appropriate generalization grade even when the training sample has some bias. Actually, each machine learning method may fit the data better by selecting its own descriptors. In this study, we mainly used the fingerprints as descriptors. All the features in the fingerprints were kept to avoid the loss of information. These higher dimensional samples also resulted in SVM showing better performance than other machine learning methods for model building in our study. SVM seems to be something of a “gold standard” from the perspective of predictive accuracy,25,43 and these results are in agreement with our previously published work that the SVM algorithm is a good category method for chemical toxicity prediction.22,23,35,36 Seven kinds of fingerprints and a set of molecular descriptors contained 12 key physical−chemical properties were used for molecular characterization when developing the models. The fingerprints performed a litter better than molecular descriptors. In particular, the MACCS fingerprints accounted for 3 of the 10 models with best performance. MACCS fingerprints contained much structural information, as this fingerprint package is generated based on the well-defined structural fragments dictionary. In consideration of this result, the MACCS fingerprints package was recommended as the preferential attribute for chemical HBT model building.

Figure 2. Heat map of molecular similarity plotted by the Tanimoto similarity index using the ECFC-4 fingerprint of all molecules. The average Tanimoto similarity index is 0.237. The x- and y-axes represent the number of molecules, respectively.

respectively. The model developed by SubFP and SVM gave best result on external validation, which was amazing at the highest MCC value of 0.77. The Q and SE value of the SubFP_SVM model were much better than the other models’. These two models should be useful for HBT prediction of organic chemicals. Effects of Machine Learning Algorithms and Fingerprints Used in Model Building. Five different machine learning methods were employed for chemical HBT model building in this study. From the performance of the models on 5-fold cross validation and external validation, the SVM algorithm clearly dominated and provided the largest number 2951

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957

Article

Journal of Chemical Information and Modeling

information for discriminating between compounds with high HBT and those with no HBT. Herein, we systematically examined the relationships between the 12 key physical− chemical properties and chemical HBT. The distributions of these molecular properties were quite different between the high HBT and no HBT compounds, as shown in Figure 3, which suggested that these molecular properties could play important roles in distinguishing high HBT compounds from no HBT compounds. In addition, the linear correlations of these molecular properties and lgLD50 values of the whole data set were presented in Figure 4. The Pearson correlation coefficient (R) values were also present in the figure. We can find that five properties (AlogP, logD, MSA, MW, and nRTB) indicated a negative linear correlation with lgLD50, while logS and nHBD were positively correlated with lgLD50. This result indicated that these physical−chemical properties would possess significant relationships with chemical HBT. AlogP and logD represent the lipophilicity of a compound. In our data set, AlogP is distributed between −4.02 and 7.29, with a mean of 3.09. LogD is distributed between −5.29 and 7.29, with a mean of 2.93. The distributions of AlogP and logD of high and none HBT classes were presented in Figure 3. The mean values of AlogP were 3.37 and 2.66 for high HBT compounds and none HBT compounds, respectively. The mean values of logD are 3.30 and 2.37 for high and none HBT compounds, respectively. The data indicated that HBT was associated with chemical lipophilicity. As a complementary test, the linear correlation analysis of AlogP, LogD, and lgLD50 values were given in Figure 4. Both AlogP and logD indicated a negative linear correlation with lgLD50 (R values were −0.263 and −0.285). MW and MSA regarded as simple estimations of molecular size and complexity, which were generally thought to have an impact on chemical absorption. As shown in Figure 3, MW was distributed from 84.08 to 731.96, with a mean of 311.99. The mean values of MW were 324.59 and 293.34 for high and none HBT compounds, respectively. MSA was distributed from 92.56 to 722.37, with a mean of 293.68. The mean values of MSA were 329.59 and 279.02 for high and no HBT compounds, respectively. The data indicated that MW and MSA were also determinants of HBT of chemicals. This conclusion was further supported by the negative linear correlations (R values were −0.241 and −0.236) between MW, MSA, and lgLD50 in Figure 4. There is also a negative linear correlation (R = −0.247) between nRTB and lgLD50 suggested in Figure 4. As shown in Figure 3, nRTB was distributed from 0 to 13, with a mean of 5.02. The mean values of nRTB were 5.70 and 4.15 for high and no HBT compounds, respectively, which also indicated that nRTB was a determinant of HBT of chemicals. Hydrogen bonding ability was commonly represented by nHBA and nHBD. As shown in Figure 3, the mean values of nHBA were 4.48 and 4.04 for high and no HBT compounds, respectively. The mean values of nHBD were 0.29 and 0.90 for high and no HBT compounds, respectively. The distributions of nHBD were more different between the high and no HBT compounds than that of nHBA. And, the linear correlations of −0.12 and 0.392 were also observed between nHBA, nHBD, and lgLD50, respectively. The data indicated that there was no obvious relationship between nHBA and HBT, while there is a positive linear correlation between nHBD and lgLD50.

Table 2. Performances of Classification Models on 5-fold Cross Validation models

Q

SE

SP

AUC

MCC

Estate_kNN Estate_SVM Estate_RF Estate_NB Estate_CT Extend_NN Extend_SVM Extend_RF Extend_NB Extend_CT FP_kNN FP_SVM FP_RF FP_NB FP_CT MACCS_kNN MACCS_SVM MACCS_RF MACCS_NB MACCS_CT Pubchem_kNN Pubchem_SVM Pubchem_RF Pubchem_NB Pubchem_CT SubFP_kNN SubFP_SVM SubFP_RF SubFP_NB SubFP_CT KRFP_kNN KRFP_SVM KRFP_RF KRFP_NB KRFP_CT MD_kNN MD_SVM MD_RF MD_NB MD_CT

0.8271 0.8556 0.8271 0.8022 0.7688 0.8262 0.8893 0.8075 0.7836 0.7401 0.8268 0.8843 0.8173 0.8073 0.7834 0.8557 0.8990 0.8411 0.7927 0.8028 0.8072 0.8650 0.8511 0.7689 0.7882 0.8121 0.8411 0.7980 0.7734 0.7883 0.7931 0.8847 0.7597 0.7641 0.7643 0.7497 0.6925 0.8124 0.6875 0.6826

0.6508 0.6190 0.5079 0.6190 0.5714 0.7460 0.7143 0.4444 0.6349 0.5873 0.7460 0.7143 0.4603 0.6032 0.6032 0.7619 0.7302 0.5714 0.7302 0.6984 0.6984 0.6825 0.5556 0.6349 0.6667 0.6825 0.6032 0.4127 0.6032 0.6190 0.7778 0.6984 0.2381 0.5556 0.6508 0.5873 0.1270 0.4127 0.6190 0.5238

0.9034 0.9586 0.9655 0.8828 0.8552 0.8621 0.9655 0.9655 0.8483 0.8069 0.8621 0.9586 0.9724 0.8966 0.8621 0.8966 0.9724 0.9586 0.8207 0.8483 0.8552 0.9448 0.9793 0.8276 0.8414 0.8690 0.9448 0.9655 0.8483 0.8621 0.8000 0.9655 0.9862 0.8552 0.8138 0.8207 0.9379 0.9862 0.7172 0.7517

0.8363 0.8120 0.8450 0.7692 0.7165 0.8470 0.8762 0.8368 0.7228 0.7077 0.8522 0.8630 0.8747 0.7470 0.7048 0.8899 0.8812 0.8813 0.8097 0.7681 0.8177 0.8497 0.8668 0.7357 0.7535 0.8366 0.8229 0.8420 0.7434 0.7880 0.8502 0.8772 0.8372 0.7252 0.7444 0.7576 0.6328 0.7897 0.7489 0.6186

0.5775 0.6447 0.5689 0.5198 0.4395 0.5980 0.7310 0.5156 0.4854 0.3908 0.5980 0.7187 0.5442 0.5270 0.4767 0.6585 0.7555 0.6067 0.5320 0.5397 0.5488 0.6701 0.6360 0.4585 0.5037 0.5540 0.6067 0.4880 0.4579 0.4904 0.5484 0.7189 0.3762 0.4255 0.4551 0.4080 0.1089 0.5370 0.3176 0.2689

Table 3. Performances of the Ten Best Classification Models on External Validation models

Q

SE

SP

AUC

MCC

MACCS_SVM Extend_SVM KRFP_SVM FP_SVM Pubchem_SVM MACCS_kNN Estate_SVM Pubchem_RF MACCS_RF SubFP_SVM

0.8372 0.8605 0.8837 0.8605 0.8372 0.8140 0.8372 0.8140 0.8372 0.9070

0.5000 0.5000 0.5833 0.5000 0.5000 0.7500 0.5833 0.3333 0.5000 0.8333

0.9677 1.0000 1.0000 1.0000 0.9677 0.8387 0.9355 1.0000 0.9677 0.9355

0.9059 0.8871 0.9328 0.9059 0.8817 0.8306 0.8952 0.8414 0.7500 0.9167

0.5683 0.6472 0.7087 0.6472 0.5683 0.5635 0.5720 0.5147 0.5683 0.7688

Relevance of Molecular Properties to with fingerprints, the models developed descriptors did not perform well enough. physical−chemical properties still present

HBT. Compared with molecular However, these a lot of useful 2952

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957

Article

Journal of Chemical Information and Modeling

Figure 3. Distributions of 12 key molecular properties, including molecular weight (MW), Ghose−Crippen LogKow (AlogP), molecular solubility (logS), the number of hydrogen bond acceptors (nHBA), the number of hydrogen bond donors (nHBD), logD, molecular surface area (MSA), molecular polar surface area (MPSA), molecular fractional polar surface area (MFPSA), the number of rotatable bonds (nRTB), the number of rings (nR), and the number of aromatic rings (nAR) for the high HBT chemicals and no HBT chemicals, respectively.

amount and structure of the compounds but also on the environmental conditions into which the chemicals were released. It is quite difficult to explain the mechanism or estimate chemical HBT using individual or several simple chemical descriptors. Analysis of Structural Alerts. In order to investigate structural features between high HBT and no HBT chemicals, frequency analysis was applied to identify privileged substructures based on KRFP and SubFP fingerprints. Only the fragments presented in 10 or more compounds were analyzed. Three (dimethylamine, 1-methoxy-3-methylbenzene, and cyclopropane carboxylic ester) and four (chloroalkene, carbodithioic

LogS is regarded as an estimation of molecular solubility in water. In our data set, logS was distributed from −10.92 to −0.63, with a mean of −4.54. The mean values of MW were −4.65 and −3.97 for high and no HBT compounds, respectively. It also indicated a positive linear correlation with lgLD50 (R = 0.227). Since the most toxic compounds are characterized by low LD50, AlogP, logD, MW, MSA, and nRTB indicated positive correlation coefficients with HBT, whereas nHBD and LogS indicated negative correlation coefficients with HBT. Actually, toxicity, as a complex chemical and biological process consisting of many steps, did not only depend on the 2953

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957

Article

Journal of Chemical Information and Modeling

Figure 4. Correlation of 12 key molecular properties, including molecular weight (MW), Ghose−Crippen LogKow (AlogP), molecular solubility (logS), the number of hydrogen bond acceptors (nHBA), the number of hydrogen bond donors (nHBD), logD, molecular surface area (MSA), molecular polar surface area (MPSA), molecular fractional polar surface area (MFPSA), the number of rotatable bonds (nRTB), the number of rings (nR), and the number of aromatic rings (nAR) versus chemical HBT (lgLD50).

ester, sulfenic derivative, and phosphoric acid derivative) privileged substructures were identified from KRFP and SubFP fingerprints, respectively. Details of frequencies of each substructure presented in the positive and negative classes were shown in Table 4. These fragments presented far more frequently in high HBT compounds than in no HBT compounds. It is worth mentioning that substructure KRFP3467 (cyclopropane

carboxylic ester) only presented in the high HBT compounds class. Cyclopropane carboxylic ester is a basic composition unit of type I pyrethroids, which represents a broad class of pesticides.44 KRFP810 (dimethylamine) and KRFP2703 (1methoxy-3-methylbenzene) are also common components of many pesticides. Phosphonic acid derivative is a phosphorus fragment, which exists in a large class of pesticides as organophosphates. 2954

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957

Article

Journal of Chemical Information and Modeling Table 4. Privileged Substructures from KRFP and SubFP Fingerprints Responsible for HBT

a

The number of compounds containing the substructure in the high HBT class. bThe appearance frequencies of the substructure among the high HBT class. cThe number of compounds containing the substructure in the no HBT class. dThe appearance frequencies of the substructure among the no HBT class.

molecular descriptors containing 12 physical−chemical properties) were systemically used to develop the best combinatorial classification probability models for chemical toxicity in honey bee based on a diverse data set contained high quality experimental data. The model developed with MACCS and SubFP fingerprints using SVM algorithms performed best on 5fold cross validation and external validation, respectively. (ii) The relevance of 12 key physical−chemical properties and chemical HBT was investigated. AlogP, logD, MW, MSA, and nRTB indicated positive correlation coefficients with HBT, whereas nHBD and LogS indicated negative correlation coefficients with HBT. (iii) The substructure frequency analysis based on KRFP and SubFP fingerprints was applied to identify the privileged substructures for high HBT and no HBT chemicals. Seven substructures presented far more frequently in high HBT compounds than in no HBT compounds. These structural alerts could improve the interpretation of models and provide a useful visual alert function for environmental assessment experts. The models developed in this study and the analysis of relations between chemical HBT and the structural features will

Chemicals containing this fragments can inhibit the activity of cholinesterase and result in the accumulation of acetylcholine, which is a neurotransmitter of the cholinergic receptor. It also can take direct effect on the cholinergic receptor, which can lead the next neuron or effector to excessive excitement or inhibition.22,45 The fragments chloroalkene, carbodithioic ester, and sulfenic derivatives have been also identified as structural alerts responsible for acute oral toxicity in rats.22 The substructure analysis can be used to identify several privileged substructures or structure alerts for high HBT and no HBT chemicals, but they cannot characterize the spatial arrangement of these privileged fragments or how to characterize them if multiple privileged fragments or structural alerts are found simultaneously in the same chemical.46 Nonetheless, these meaningful substructures identified in this study could potentially provide some useful visual alert function in the field of environmental hazard assessment.



CONCLUSIONS In summary, we obtained 3 vital perspectives in this study: (i) Five different machine learning methods and 8 different feature reduction methods (7 kinds of fingerprints and a set of 2955

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957

Article

Journal of Chemical Information and Modeling

(11) Cheng, F.; Shen, J.; Li, W.; Lee, P. W.; Yun, T. In Silico Prediction of Terrestrial and Aquatic Toxicities for Organic Chemicals. Chin. J. Pestic. Sci. 2010, 12, 477−488. (12) Colombo, A.; Benfenati, E.; Karelson, M.; Maran, U. The Proposal of Architecture for Chemical Splitting to Optimize Qsar Models for Aquatic Toxicity. Chemosphere 2008, 72, 772−780. (13) Sun, L.; Zhang, C.; Chen, Y.; Li, X.; Zhuang, S.; Li, W.; Liu, G.; Lee, P. W.; Tang, Y. In Silico Prediction of Chemical Aquatic Toxicity with Chemical Category Approaches and Substructural Alerts. Toxicol. Res. 2015, 4, 452−463. (14) Vighi, M.; Garlanda, M. M.; Calamari, D. QSARs for Toxicity of Organophosphorous Pesticides to Daphnia and Honeybees. Sci. Total Environ. 1991, 109, 605−622. (15) Celli, G.; Maccagnani, B. Honey Bees as Bioindicators of Environmental Pollution. Bull. Insectol. 2003, 56, 137−139. (16) Como, F.; Carnesecchi, E.; Volani, S.; Dorne, J.; Richardson, J.; Bassan, A.; Pavan, M.; Benfenati, E. Predicting Acute Contact Toxicity of Pesticides in Honeybees (Apis Mellifera) through a k-Nearest Neighbor Model. Chemosphere 2017, 166, 438−444. (17) Blacquière, T.; Koomen, I.; Roessink, I.; van der H, V. Pesticide Risks to Wild Pollinators Workshop Report, 30 October−1 November; Wageningen University & Research Centre: Netherlands, 2011. (18) Hou, T.; Li, Y.; Zhang, W.; Wang, J. Recent Developments of in Silico Predictions of Intestinal Absorption and Oral Bioavailability. Comb. Chem. High Throughput Screening 2009, 12, 497−506. (19) Hou, T.; Wang, J. Structure−Adme Relationship: Still a Long Way to Go? Expert Opin. Drug Metab. Toxicol. 2008, 4, 759−770. (20) Zhang, C.; Zhou, Y.; Gu, S.; Wu, Z.; Wu, W.; Liu, C.; Wang, K.; Liu, G.; Li, W.; Lee, P. W.; Tang, Y. In Silico Prediction of Herg Potassium Channel Blockage by Chemical Category Approaches. Toxicol. Res. 2016, 5, 570−582. (21) Yap, C. W. Padel-Descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466−1474. (22) Li, X.; Chen, L.; Cheng, F.; Wu, Z.; Bian, H.; Xu, C.; Li, W.; Liu, G.; Shen, X.; Tang, Y. In Silico Prediction of Chemical Acute Oral Toxicity Using Multi-Classification Methods. J. Chem. Inf. Model. 2014, 54, 1061−1069. (23) Li, X.; Du, Z.; Wang, J.; Wu, Z.; Li, W.; Liu, G.; Shen, X.; Tang, Y. In Silico Estimation of Chemical Carcinogenicity with Binary and Ternary Classification Methods. Mol. Inf. 2015, 34, 228−235. (24) Wang, Q.; Li, X.; Yang, H.; Cai, Y.; Wang, Y.; Wang, Z.; Li, W.; Tang, Y.; Liu, G. In Silico Prediction of Serious Eye Irritation or Corrosion Potential of Chemicals. RSC Adv. 2017, 7, 6697−6703. (25) Xu, C.; Cheng, F.; Chen, L.; Du, Z.; Li, W.; Liu, G.; Lee, P. W.; Tang, Y. In Silico Prediction of Chemical Ames Mutagenicity. J. Chem. Inf. Model. 2012, 52, 2840−2847. (26) Yang, H.; Li, X.; Cai, Y.; Wang, Q.; Li, W.; Liu, G.; Tang, Y. In Silico Prediction of Chemical Subcellular Localization Via MultiClassification Methods. MedChemComm 2017, 8, 1225−1234. (27) Cheng, F.; Yu, Y.; Zhou, Y.; Shen, Z.; Xiao, W.; Liu, G.; Li, W.; Lee, P. W.; Tang, Y. Insights into Molecular Basis of Cytochrome P450 Inhibitory Promiscuity of Compounds. J. Chem. Inf. Model. 2011, 51, 2482−2495. (28) Klekota, J.; Roth, F. P. Chemical Substructures That Enrich for Biological Activity. Bioinformatics 2008, 24, 2518−2525. (29) Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273−297. (30) Cover, T.; Hart, P. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theory 1967, 13, 21−27. (31) Watson, P. Naive Bayes Classification Using 2d Pharmacophore Feature Triplet Vectors. J. Chem. Inf. Model. 2008, 48, 166−178. (32) Quinlan, J. R., C4.5: Programming for Machine Learning; Morgan Kauffmann, 1993, Vol. 38. (33) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5−32. (34) Lei, T.; Li, Y.; Song, Y.; Li, D.; Sun, H.; Hou, T. Admet Evaluation in Drug Discovery: 15. Accurate Prediction of Rat Oral Acute Toxicity Using Relevance Vector Machine and Consensus Modeling. J. Cheminf. 2016, 8, 6.

provide critical information and useful tools for predicting HBT of new chemicals in environmental hazard assessment.



ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00476. Additional details on materials and Tables S1 and S2 (PDF)



AUTHOR INFORMATION

Corresponding Authors

*Phone: +86-10-5934-1764. Fax: +86-10-5934-1855. E-mail: [email protected] (Y.Z.). *Phone: +86-10-5934-1890. E-mail: [email protected] or [email protected] (X.L.). ORCID

Xiao Li: 0000-0002-1148-9898 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by Beijing Postdoctoral Research Foundation (Grant 2017-ZZ-074) and the Special Promotion For Scientific Small and Medium Enterprise of Beijing (Grant Z16010101204).



REFERENCES

(1) Klein, A. M.; Vaissiere, B. E.; Cane, J. H.; Steffan-Dewenter, I.; Cunningham, S. A.; Kremen, C.; Tscharntke, T. Importance of Pollinators in Changing Landscapes for World Crops. Proc. R. Soc. London, Ser. B 2007, 274, 303−313. (2) Benfenati, E. Quantitative Structure-Activity Relationships (QSAR) for Pesticide Regulatory Purposes; Elsevier: Amsterdam, 2011. (3) Devillers, J.; Pham-Delegue, M.; Decourtye, A.; Budzinski, H.; Cluzeau, S.; Maurin, G. Structure-Toxicity Modeling of Pesticides to Honey Bees. SAR QSAR Environ. Res. 2002, 13, 641−648. (4) Pettis, J. S.; Lichtenberg, E. M.; Andree, M.; Stitzinger, J.; Rose, R.; vanEngelsdorp, D. Crop Pollination Exposes Honey Bees to Pesticides Which Alters Their Susceptibility to the Gut Pathogen Nosema Ceranae. PLoS One 2013, 8, e70182. (5) Singh, K. P.; Gupta, S.; Basant, N.; Mohan, D. QSTR Modeling for Qualitative and Quantitative Toxicity Predictions of Diverse Chemical Pesticides in Honey Bee for Regulatory Purposes. Chem. Res. Toxicol. 2014, 27, 1504−1515. (6) Toropov, A. A.; Benfenati, E. Smiles as an Alternative to the Graph in Qsar Modelling of Bee Toxicity. Comput. Biol. Chem. 2007, 31, 57−60. (7) Cheng, F.; Shen, J.; Yu, Y.; Li, W.; Liu, G.; Lee, P. W.; Tang, Y. In Silico Prediction of Tetrahymena Pyriformis Toxicity for Diverse Industrial Chemicals with Substructure Pattern Recognition and Machine Learning Methods. Chemosphere 2011, 82, 1636−1643. (8) Hengstler, J.; Foth, H.; Kahl, R.; Kramer, P.; Lilienblum, W.; Schulz, T.; Schweinfurth, H. The Reach Concept and Its Impact on Toxicological Sciences. Toxicology 2006, 220, 232−239. (9) Xue, Y.; Li, H.; Ung, C.; Yap, C.; Chen, Y. Classification of a Diverse Set of Tetrahymena Pyriformis Toxicity Chemical Compounds from Molecular Descriptors by Statistical Learning Methods. Chem. Res. Toxicol. 2006, 19, 1030−1039. (10) Zhang, C.; Cheng, F.; Sun, L.; Zhuang, S.; Li, W.; Liu, G.; Lee, P. W.; Tang, Y. In Silico Prediction of Chemical Toxicity on Avian Species Using Chemical Category Approaches. Chemosphere 2015, 122, 280−287. 2956

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957

Article

Journal of Chemical Information and Modeling (35) Li, X.; Zhang, Y.; Chen, H.; Li, H.; Zhao, Y. In Silico Prediction of Chronic Toxicity with Chemical Category Approaches. RSC Adv. 2017, 7, 41330−41338. (36) Li, X.; Zhang, Y.; Li, H.; Zhao, Y. Modeling of the Herg K+ Channel Blockage Using Online Chemical Database and Modeling Environment (OCHEM). Mol. Inf. 2017, DOI: 10.1002/ minf.201700074. (37) Chang, C. C.; Lin, C. J. Libsvm: A Library for Support Vector Machines. ACM Trans. intell. Syst. Technol. (TIST) 2011, 2, 27. (38) Ashby, J. Fundamental Structural Alerts to Potential Carcinogenicity or Noncarcinogenicity. Environ. Mutagen. 1985, 7, 919−921. (39) Benigni, R.; Bossa, C. Structure Alerts for Carcinogenicity, and the Salmonella Assay System: A Novel Insight through the Chemical Relational Databases Technology. Mutat. Res., Rev. Mutat. Res. 2008, 659, 248−261. (40) Netzeva, T. I.; Saliner, A. G.; Worth, A. P. Comparison of the Applicability Domain of a Quantitative Structure-Activity Relationship for Estrogenicity with a Large Chemical Inventory. Environ. Toxicol. Chem. 2006, 25, 1223−1230. (41) Butina, D. Unsupervised Data Base Clustering Based on Daylight’s Fingerprint and Tanimoto Similarity: A Fast and Automated Way to Cluster Small and Large Data Sets. J. Chem. Inf. Comput. Sci. 1999, 39, 747−750. (42) Bekkar, M.; Djemaa, H. K.; Alitouche, T. A. Evaluation Measures for Models Assessment over Imbalanced Data Sets. J. Inf. Eng. Appl. 2013, 3, 27−38. (43) Chen, B.; Sheridan, R. P.; Hornak, V.; Voigt, J. H. Comparison of Random Forest and Pipeline Pilot Naive Bayes in Prospective Qsar Predictions. J. Chem. Inf. Model. 2012, 52, 792−803. (44) Bradberry, S. M.; Cage, S. A.; Proudfoot, A. T.; Vale, J. A. Poisoning Due to Pyrethroids. Toxicol. Rev. 2005, 24, 93−106. (45) Casida, J. E.; Quistad, G. B. Organophosphate Toxicology: Safety Aspects of Nonacetylcholinesterase Secondary Targets. Chem. Res. Toxicol. 2004, 17, 983−998. (46) Cheng, F.; Ikenaga, Y.; Zhou, Y.; Yu, Y.; Li, W.; Shen, J.; Du, Z.; Chen, L.; Xu, C.; Liu, G.; et al. In Silico Assessment of Chemical Biodegradability. J. Chem. Inf. Model. 2012, 52, 655−669.

2957

DOI: 10.1021/acs.jcim.7b00476 J. Chem. Inf. Model. 2017, 57, 2948−2957