Carcinogenicity Prediction of Noncongeneric Chemicals by a Support

Mar 27, 2013 - Both the SVM model (Model A1) and the best model from the 10-fold ... extended connectivity fingerprints (ECFPs) and the Toxtree softwa...
0 downloads 0 Views 932KB Size
Article pubs.acs.org/crt

Carcinogenicity Prediction of Noncongeneric Chemicals by a Support Vector Machine Min Zhong, Xianglei Nie, Aixia Yan,* and Qipeng Yuan State Key Laboratory of Chemical Resource Engineering, Department of Pharmaceutical Engineering, P.O. Box 53, Beijing University of Chemical Technology, 15 BeiSanHuan East Road, Beijing 100029, P. R. China S Supporting Information *

ABSTRACT: The ability to identify carcinogenic compounds is of fundamental importance to the safe application of chemicals. In this study, we generated an array of in silico models allowing the classification of compounds into carcinogenic and noncarcinogenic agents based on a data set of 852 noncongeneric chemicals collected from the Carcinogenic Potency Database (CPDBAS). Twenty-four molecular descriptors were selected by Pearson correlation, F-score, and stepwise regression analysis. These descriptors cover a range of physicochemical properties, including electrophilicity, geometry, molecular weight, size, and solubility. The descriptor mutagenic showed the highest correlation coefficient with carcinogenicity. On the basis of these descriptors, a support vector machine-based (SVM) classification model was developed and fine-tuned by a 10-fold cross-validation approach. Both the SVM model (Model A1) and the best model from the 10-fold cross-validation (Model B3) runs gave good results on the test set with prediction accuracy over 80%, sensitivity over 76%, and specificity over 82%. In addition, extended connectivity fingerprints (ECFPs) and the Toxtree software were used to analyze the functional groups and substructures linked to carcinogenicity. It was found that the results of both methods are in good agreement.

1. INTRODUCTION Cancer is the leading cause of death in the developed world and an enormous burden to public health. Any substance able to induce tumors, increase the incidence of tumors, or shorten the time to tumor occurrence is a carcinogen.1,2 The public is becoming more aware about implications of lifestyle, such as diet and smoking, pollution, and workplace exposure on cancer formation, but the regulators often do not have the adequate tools and resources available for the safety assessment of chemicals. One of the reasons is that bioassays are highly resource and time demanding. Testing in animal models is still often required for the rigorous evaluation of substances. Therefore, in silico approaches for the accurate safety profiling of chemicals has become a research focus in recent years. In silico models can be classified into (a) qualitative structure−activity relationship (SAR) models, i.e., classification models, (b) quantitative structure−activity relationship (QSAR) models, and (c) expert systems. SAR and QSAR models have been proven useful in drug and pesticide design and chemical risk assessment since they are able to correlate chemical activity with structural properties, represented by numerical molecular descriptors. A large body of SAR models has been developed for predicting carcinogenicity, and most of these models are applied to well-defined congeneric classes of © 2013 American Chemical Society

carcinogens. It was reported that in order to obtain reliable models, a SAR analysis should focus on a well-defined set of congeneric chemicals, i.e., structurally related chemicals that have the same mechanism of action.3,4 Nevertheless, SAR models for noncongeneric chemicals are of great interest for regulatory use as they allow the assessment of diverse classes of chemicals. In recent years, several investigators developed SAR models for predicting the carcinogenicity of a wide variety of chemicals. For example, Tanabe et al.5 collected experimental carcinogenicity data on ∼1,500 chemicals from six sources, including the IARC (International Agency for Research on Cancer), EU (European Union), EPA (US Environmental Protection Agency), NTP, ACGIH (American Conference of Governmental Industrial Hygienists), and JSOH (Japan Society for Occupational Health). Nine hundred eleven organic chemicals were selected for SAR modeling, and an overall accuracy level of about 70% was obtained. In addition, 20 mutually overlapping subgroups were used to classify the whole data set according to contained substructures, and an overall accuracy of approximately 80% was achieved. Fjodorova et Received: January 10, 2013 Published: March 27, 2013 741

dx.doi.org/10.1021/tx4000182 | Chem. Res. Toxicol. 2013, 26, 741−749

Chemical Research in Toxicology

Article

al.1 compiled a rat-derived rodent carcinogenic potency data set containing 805 chemicals. The Counter Propagation Artificial Neural Network (CP ANN) technique was applied for building qualitative and quantitative models, using MDL molecular descriptors. The quantitative models showed poor prediction success rates, but the qualitative models gave the accuracy of 92% on the training set and 68% on the test set. Toxic properties or carcinogenicity of chemicals can often be related to their substructures or functional groups, which are generally identified as toxicophores or structural alerts (SA).6 A number of SAs have already been identified in the literature. The research on SAs is of great value for understanding the action mechanisms of carcinogens and assessing the potential risk of chemicals. The objective of this study is to build classification models to discriminate noncongeneric chemicals as carcinogenic active and inactive, and to identify the substructures or functional groups causing carcinogenic effects. A rat carcinogenicity data set (449 active and 403 inactive) compiled from the Carcinogenic Potency Database (CPDBAS, version 5d)7 was constructed and used in this study. Support vector machines (SVMs) were used for building models, and the approach was evaluated employing 10-fold cross-validation. An analysis using extended connectivity fingerprints (ECFPs) and the toxic hazard estimation program ToxTree8 revealed several toxicophores of compounds with carcinogenic activity. ToxTree encodes the Benigni/Bossa set of rules for carcinogenicity and mutagenicity.9

random number, uniformly distributed in the range (0, 1). All 85 or 86 compounds were clustered into one subset. Therefore, 10 models were built with selected MOE descriptors14 to discriminate a compound to be a carcinogenic active or inactive compound. 2.2. Diversity of the Data Set. The data set contains 852 noncongeneric compounds. The diversity of the data set was characterized by the pairwise dissimilarity (the inverse of similarity) of the molecules in the data set. Here, the diversity of the 852 compounds in the data set was investigated. A molecule A can be described by means of a vector XA of n attributes such that

XA = {x1A ; x 2A ; x3A ; ...; xjA ; ...; xnA}

(1)

th

where xjA is the value of j attribute of molecule A. The attributes might be the calculated physicochemical properties or the on/off state of the n bits in the fingerprint representing the molecule. On the basis of the different molecular structural representations, the Tanimoto coefficient was used for quantifying the degree of similarity of two molecules.15 For the dichotomous variables (binary vectors), the Tanimoto coefficient SA,B between the molecules A and B is defined as the eq 2:

SA, B =

c a+b−c

(2)

where a is the number of bits “on” in A, b is the number of bits “on” in B, and c is the number of bits “on” in both A and B. SA,B is in the range of (0, 1). Thus, the diversity coefficient (DA,B) of the two molecules A and B can be defined as eq 3. The higher the diversity coefficient, the more dissimilar are the two molecules.

DA,B = 1 − SA,B

(3)

MACCS fingerprints were calculated for all molecules in this data set for diversity analysis. In the studied data set, 449 chemicals were classified as carcinogens and the remaining 403 molecules as noncarcinogens. The diversity coefficients of any two molecules in the groups of carcinogens (or noncarcinogens) were calculated. Pairwise diversity coefficients of carcinogenic molecules, 100,576 (449*448/2), and pairwise diversity coefficients of noncarcinogenic molecules, 81,003 (403*402/2), were obtained as illustrated in Figure 1A and B. For the carcinogenic molecules, most of the pairwise diversity coefficients (about 99.12% of the pairwise diversity coefficients, 99,687 out of 100,576) are higher than 0.3, and 88.62% of the pairwise diversity coefficients (89,130 out of 100,576) are higher than 0.6. The coefficient ranges of (0.8, 0.9) and (0.9, 1) have reached 27.40% (27,562) and 29.10% (29,267). The high rates of diversity coefficient in the range of more than 0.6 revealed that the structures of carcinogenic molecules were diverse, and this conclusion applies to the noncarcinogenic molecules as well. 2.3. Structure Representation. In this study, MACCS fingerprints and 334 MOE descriptors14 were calculated for each molecular structure. The MACCS fingerprints were used to split the training/test set and to analyze the diversity of the data set, while the MOE descriptors were selected for building the classification models. Among all of the 334 MOE descriptors, 186 were 2D descriptors, including physical properties, subdivided surface areas, atom counts and bond counts, adjacency and distance matrix descriptors, Kier & Hall Connectivity and Kappa Shape indices, pharmacophore feature descriptors, and partial charge descriptors. The remaining 148 descriptors were 3D descriptors, including potential energy descriptors, MOPAC descriptors, surface area, volume and shape descriptors, and conformation dependent charge descriptors. Before the computation of MOE descriptors, CORINA16 was employed for 3D structure generation. 2.4. Descriptor Selection Methods. Pearson correlation analysis and stepwise regression method based on the whole data set were used to find the best combination of descriptors for modeling. Pearson correlation analysis can eliminate descriptors that are not significantly correlated with activity and are highly correlated with one another.

2. MATERIALS AND METHODS 2.1. Data Set. Rats and mice are widely used experimental models for investigating chemicals because of their relatively short life span, limited cost of maintenance, widespread use in pharmacological and toxicological studies, susceptibility to tumor induction, and the availability of inbred or sufficiently characterized strains.10,11 In this study, we only considered rat data derived from the Carcinogenic Potency Database (CPDBAS, version 5d).7 Eight hundred fifty-two noncongeneric compounds with well-defined chemical structures were selected, including 449 active carcinogens (labeled “1”) and 403 inactive carcinogens (labeled “0”). The Kohonen’s Self-Organizing Map (SOM) algorithm available in the software packaged SONNIA12 was used for generating a training and a test set, by clustering the 852 compounds by their MACCS fingerprints.13 A rectangle Kohonen map (size 29 × 19) was created to cluster all compounds in the whole data set using MACCS fingerprints of each compound as input and its activity level (active or inactive) as output. The selection was based on the rule that if two or three compounds with the same activity level were clustered together, one was assigned into the test set, and the others were put into the training set; if more than three compounds with the same activity level were clustered together, about half of the compounds were put into the test set, and the rest were assigned into the training set. In cases where neurons had only one compound or neurons had a unique compound whose activity level was different from those of others, this “orphan” compound was assigned to the training set. This procedure ensured that the training set is of a larger chemical space than that of the test set. On the basis of the splitting rules described above, the data set was finally divided into a training set of 651 compounds (331 active carcinogenic compounds and 320 inactive carcinogenic compounds) and a test set of 201 compounds (118 active carcinogenic compounds and 83 inactive carcinogenic compounds). The training set was used for building classification models, and the test set was used for evaluation. In addition, the data set was randomly divided into ten subsets for 10-fold cross-validation. The determining criterion was grounded on a 742

dx.doi.org/10.1021/tx4000182 | Chem. Res. Toxicol. 2013, 26, 741−749

Chemical Research in Toxicology

Article

positive and negative instances referred to carcinogens and noncarcinogens, respectively.

F(i) =

(xi(+) − xi)2 + (xi(−) − xi)2 n

1 n+ − 1

∑k+= 1 (xk , i(+) − xi(+))2 +

1 n− − 1

n

∑k−= 1 (xk , i(−) − xi(−))2 (5)

2.6. Evaluation of Model Performance. Several parameters were employed to evaluate the performance of models, including accuracy, sensitivity (SE), specificity (SP), and the Matthew’s correlation coefficient (MCC),17 which are listed in eqs 6−9. True positives (TP) represent the number of carcinogens, which were predicted as carcinogens, true negatives (TN) represent the number of noncarcinogens, which were predicted as noncarcinogens, false positives (FP) represent the number of noncarcinogens, which were predicted as carcinogens, and false negatives (FN) represent the number of carcinogens, which were predicted as noncarcinogens. SE stands for the prediction accuracy for carcinogens, and SP stands for the prediction accuracy for noncarcinogens.

Accuracy =

TP + TN × 100% TP + TN + FP + FN

(6)

SE =

TP × 100% TP + FN

(7)

SP =

TN × 100% FP + TN

(8)

MCC =

TP × TN − FP × FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) (9)

From eq 9, the higher MCC value is, the better the model performs. 2.7. Support Vector Machine. SVM was chosen to generate the classification models. SVM originated as an implementation of Vapnik’s Structural Minimization (SRM) principle from statistical learning theory. The special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin. Thus, SVM is also known as a maximum margin classifier. In this study, the LIBSVM software developed by Chang and Lin was used for SVM analysis.18 There are four kernels in LIBSVM. The radial basis function (RBF), which was suggested as a reasonable first choice, was used in this work.18 2.8. ECFP_4 Fingerprints. Extended connectivity fingerprints (ECFP) are circular fingerprints derived using a variant of the Morgan algorithm.19 Advantages of the circular fingerprints are that they can be calculated rapidly, have a high information content (commonly with respect to bioactivity20), are able to consider stereochemical information, and can also be interpreted as chemical substructures. In this study, the ECFP_4 fingerprints, were used to analyze the structure features of compounds. 2.9. Structural Alerts. On the basis of the electrophilic theory of chemical carcinogenesis developed by James and Elizabeth Miller,21,22 it was postulated that the carcinogens had the unifying feature that they were either electrophiles per se or could be activated to electrophilic reactive intermediates. Guided by this hypothesis, several chemical functional groups and substructures (structural alerts, SA) for carcinogens had been identified, which were related to the carcinogenicity of chemicals. When one or more SAs embedded in a chemical structure were recognized, the chemical would be flagged as a potential carcinogen.

Figure 1. Diversity analysis for the 852 compounds in the data set. (A) The frequency of the diversity coefficient of compound pairs on the basis of the MACCS fingerprint of 449 carcinogens. (B) The frequency of the diversity coefficient of compound pairs on the basis of the MACCS fingerprint of 403 noncarcinogens. Among all of the 334 MOE descriptors, 58 descriptors were not considered as they were found to have no variance or value in this particular case. The remaining 276 descriptors were used to select the best set. In this study, if the correlation coefficient between any two descriptors was higher than 0.9, the descriptor that had a lower correlation coefficient with the activity was removed. The remaining descriptors were subsequently selected using a stepwise linear regression variable selection method, and finally, 24 descriptors were chosen for modeling. These 24 descriptors and their Pearson correlation coefficient are shown in Table 1. Before training, the input data (the selected descriptors) were scaled to a [0.1, 0.9] range via the following equation: x − xmin xi* = i × 0.8 + 0.1 xmax − xmin (4) where xi was the original value, and xi* was the scaled value. xmin and xmax were the corresponding minimum and maximum values of the descriptor variable, respectively. 2.5. F-Score for Evaluating the Importance of Selected Descriptors. The F-score, which measures the discrimination of two sets of real numbers, is used to calculate the importance of individual descriptors in this study. In the following eq 5, xi, xi(+), and xi(−) stand for the average of the ith descriptor of the whole, positive, and negative (−) data sets; x(+) k,i is the ith descriptor of the kth positive instance, and xk,i is the ith feature of the kth negative instance; and n+ and n− are the number of positive and negative instances, respectively. The higher the F-score value, the more important is the descriptor. In this study,

3. RESULTS AND DISCUSSION 3.1. SVM Classification Models. An SVM classification model (Model A1) was built based on the 24 selected descriptors in this study. The training set containing 651 743

dx.doi.org/10.1021/tx4000182 | Chem. Res. Toxicol. 2013, 26, 741−749

Chemical Research in Toxicology

Article

Table 1. Correlation Coefficient (R) to Carcinogenicity and the F-Score of Selected MOE Descriptors (Sorted as Descending FScore) no.

descriptor

R

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

mutagenic SMR_VSA3 a_don lip_violation SlogP_VSA3 vsurf_A PEOE_VSA-5 vsurf_DW13 Kier3 a_nS pmi1 vsurf_EWmin1 vsurf_IW7 vsurf_W3 BCUT_SMR_1 BCUT_PEOE_2 vsa_pol vsurf_IW8 E_sol a_nI glob a_ICM npr2 PEOE_VSA-4

0.26 −0.139 −0.136 −0.136 0.125 −0.123 −0.116 −0.111 −0.111 −0.108 −0.102 0.097 −0.08 −0.078 0.069 −0.054 −0.048 0.046 0.044 −0.046 −0.041 0.031 0.007 −0.002

F-score 7.32 1.92 1.87 1.85 1.60 1.53 1.35 1.24 1.24 1.15 1.02 9.48 6.50 6.16 4.75 2.98 2.34 2.12 2.06 1.99 1.68 9.54 5.09 5.58

× × × × × × × × × × × × × × × × × × × × × × × ×

10−02 10−02 10−02 10−02 10−02 10−02 10−02 10−02 10−02 10−02 10−02 10−03 10−03 10−03 10−03 10−03 10−03 10−03 10−03 10−03 10−03 10−04 10−05 10−06

origin

class

physical properties subdivided surface areas pharmacophore feature descriptors atom counts and bond counts subdivided surface areas surface area, volume and shape descriptors partial charge descriptors surface area, volume and shape descriptors Kier & Hall connectivity and kappa shape indices atom counts and bond counts surface area, volume and shape descriptors surface area, volume and shape descriptors surface area, volume and shape descriptors surface area, volume and shape descriptors adjacency and distance matrix descriptors adjacency and distance matrix descriptors pharmacophore feature descriptors surface area, volume and shape descriptors potential energy descriptors atom counts and bond counts surface area, volume and shape descriptors atom counts and bond counts surface area, volume and shape descriptors partial charge descriptors

1 4 5 4/5 5 5 2 5 3 6 3 5 5 5 6 6 4 5 5 6 3 6 3 2

Table 2. Prediction Performance of the SVM Model and 10-Fold Cross-Validationa parameter

training set

test set

model

c

γ

number

accuracy (%)

number

accuracy (%)

SEb (%)

SPc (%)

MCCd

A1 A2 A3 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 averagee

0.9 2.4 1 8.2 512 1 0.5 3.7 32 6.6 4.3 7.8 512

8.9 4 9.2 1.5 0.9 12.3 7.7 2 0.3 1.38 2 0.8 0.03125

331/320 291/378 405/298 406/361 411/356 402/365 408/359 394/373 409/358 414/353 398/369 400/366 399/367

84.95 84.45 87.77 81.88 95.83 91.13 81.23 82.66 75.36 80.44 82.01 78.59 73.50 82.26

118/83 158/25 44/105 43/42 38/47 47/38 41/44 55/30 40/45 35/50 51/34 49/37 50/36

80.10 76.50 75.17 76.47 70.59 81.18 69.41 74.12 72.94 68.24 72.94 73.26 60.47 71.96

77.11 88.00 68.57 74.41 63.16 76.60 73.17 80.00 67.50 82.86 70.59 73.47 62.00 72.38

82.20 74.68 90.91 78.57 76.60 86.84 62.91 63.33 77.78 59.00 76.47 72.97 58.33 71.28

0.59 0.45 0.54 0.52 0.39 0.59 0.41 0.43 0.44 0.49 0.42 0.44 0.19 0.43

a

Number of the carcinogens/noncarcinogens in the training set and test set are shown. bSE (sensitivity) represents the prediction accuracy of carcinogens. cSP (specificity) represents the prediction accuracy of noncarcinogens. dMCC represents Matthew’s correlation coefficient on the test set. eAverage represents the average values of B1−B10 .

investigate the predictive power of Model A1. In Model A1, the active/inactive ratio is about 331/320, and in Models A2 and A3, the ratios are 291/378 and 405/298, respectively. Compared with the results of the original model (Model A1), the performance of Models A2 and A3 as shown in Table 2 was slightly lower than that of Model A1. This shows the good stability of Model A1. In order to reduce the possibility of chance correlation, a Yscrambling test was performed to validate the classification model. The activity data of compounds in the training set is shuffled randomly to change its true order, at the same time ensuring that the descriptor data remain unchanged, which

compounds was used to train an SVM model. The two parameters of the SVM (c, γ) were selected using the autosearching program “grid” through a 5-fold cross-validation in LIBSVM. Afterward, manual selection was done, and the optimum parameters of c = 0.9 and γ = 8.9 were selected to build an SVM model. For the training set, the prediction accuracy of 84.95% was achieved, as shown in Table 2. The test set containing 201 compounds was used to test the prediction ability of the SVM model. For the test set, a prediction accuracy of 80.10% and an MCC value of 0.59 were achieved. In addition, the ratio between active and inactive carcinogens in the training set had been changed in Models A2 and A3 to 744

dx.doi.org/10.1021/tx4000182 | Chem. Res. Toxicol. 2013, 26, 741−749

Chemical Research in Toxicology

Article

which means that the original model is not the same as models that were learned by chance, and is not a result of a chance correlation. Using a random number, the whole data set was split equally into 10 subsets. Every 9 out of these 10 subsets were used as a training set, and the other subset was used as a test set to perform 10-fold cross-validation. For every model, a group of best c and γ was obtained. The best c and γ, the accuracy, sensitive, specificity, and MCC value of each model are shown in Table 2. For the test sets of 10 models (Model B1−B10), the accuracies were from 60.47% to 81.18%, and the MCC values were from 0.19 to 0.59. The average accuracy is 71.96%, and the average MCC value is 0.43. The results, shown in Table 2, indicated that the models we built had good prediction ability to discriminate carcinogenic active and inactive compounds. 3.2. ECFP_4 Analysis. To better understand the structure− activity relationship of the chosen data set, 5214 ECFP_4 fingerprints were calculated for the 852 compounds based on their molecular structures. The frequency of each ECFP_4 substructure appearing in carcinogenic active compounds (carcinogens) and carcinogenic inactive compounds (noncarcinogens) was calculated as well in order to locate substructures responsible for observed carcinogenicity. On the basis of the analysis, nine representative substructures stood out and are shown in Table 3. It was found that there was no common substructure shown in the carcinogenic active compounds or in the carcinogenic inactive compounds. However, there were some substructures that appeared more often in the carcinogenic active compounds than in the inactive ones. It is worth mentioning that substructures 5−9 only relate to carcinogenic active compounds and that some ECFP_4 substructures were similar to each other, such as substructures containing N-nitroso groups

means the independent variables and dependent variable lose their original relationship. On the basis of such permuted data, new SVM models with a sufficient number of iterations were built to predict the original test set. Values obtained in the above models are compared with that of the original model by means of a scatter plot. In this study, the Y-scrambling test was investigated to validate Model A1 to prove it was not the outcome of chance correlation. The resulting MCC and prediction accuracy (Q) for the test set are shown in Figure 2. The true value lies outside of the values of scrambled models,

Figure 2. Y-scrambling result of Model A1. The value of Model A1 is shown as a black triangle, while the values of the scrambled models are shown as white triangles.

Table 3. Selected ECFP_4 Substructures and Their Appearance Frequencies among Active and Inactive Carcinogenic Compounds

745

dx.doi.org/10.1021/tx4000182 | Chem. Res. Toxicol. 2013, 26, 741−749

Chemical Research in Toxicology

Article

Table 4. Molecular Structures of the Four Subgroups

Table 5. Performance of Four SVM Models Based on Different Subgroups parameters model Model Model Model Model

(SA_8) (SA_21) (SA_27) (SA_28)

training set

test set

c

g

numbera

5-fold

10-fold

LOOb

Qc (%)

numbera

SEd (%)

SPe (%)

MCCf

Qc (%)

1024 1024 1024 128

4 256 1 32

18/18 66/13 35/18 28/14

75.00 86.08 88.68 69.05

75.00 83.54 86.79 71.53

77.78 83.54 88.68 71.43

100.00 92.41 96.23 95.24

7/8 30/6 16/9 13/6

87.50 66.67 77.78 66.67

85.71 90.00 93.75 92.31

0.73 0.53 0.74 0.62

86.67 86.11 88.00 84.21

a

Number represents the number of the carcinogens/noncarcinogens in each subgroup. bLOO represents leave-one-out cross-validation on the training set. cQ represents the prediction accuracy of the data set. dSE (sensitivity) represents the prediction accuracy of carcinogens. eSP (specificity) represents the prediction accuracy of noncarcinogens. fMCC represents Matthew’s correlation coefficient on the test set.

ment, and drug design and on the other hand provided a new way of finding critical substructures of carcinogens by applying ECFP_4 fingerprint analysis, which, to our knowledge, has not been reported so far. 3.3. Subgrouping of Noncongeneric Chemicals by SA. On the basis of the four outstanding SAs (Table 4) obtained by the ToxTree software, we compiled four subgroups (Table 5) for further analysis. SA_8 represented aliphatic halogens, which contained bromine, chlorine, and iodine. SA_21 represented alkyl and aryl N-nitroso groups. SA_27 represented nitroaromatic. SA_28 represented primary aromatic amine, hydroxyl amine, and its derived esters. Classification models were built based on each subgroup by SVM. The molecular descriptors were selected via the methods mentioned in Section 2.4 and are shown in Table S1, Supporting Information. For each subgroup, a reasonable SVM model was constructed with the selected parameters c and γ. The results are shown in Table 5. For the training sets of four models, the prediction accuracies were above 92%. Five-, 10-fold, and leaveone-out (LOO) cross-validation on the training set were also performed, and the accuracies were more than 70%. In addition, each model had a prediction performance with an accuracy of over 84% and MCC from 0.53 to 0.74 on the test set. Compared with Model A1 and Model B1 to B10, the prediction results of subgroups showed relatively better performance. This proved that classification models built on

(numbers 4, 7, and 9 in Table 3) and furan structures (numbers 5 and 6 in Table 3). Furthermore, some ECFP_4 substructures were part of other substructures. For example, the number 6 substructure contained the number 5 substructure. In addition, the ToxTree software8 was used to test the whole data set and to mark the SAs of each chemical. The results were compared to the outcome of the ECFP_4 fingerprint analysis. As these two methods were applied to deal with molecular functional groups/substructures, the comparison could help us to better analyze the key substructures inducing carcinogenicity and to better understand the related reaction mechanism. We obtained 35 unique SAs by a Toxtree scan, which covered the whole set of SAs compiled by Benigni/Bossa rulebase for carcinogenicity and mutagenicity (Appendix 1 of ref 9). Comparing the nine ECFP_4 representative substructures shown in Table 3 with the 35 SAs, it was found that there was obvious matching between them. The No. 1 ECFP_4 substructure contained hydroxyl amine, which was part of the SA_28 (primary aromatic amine, hydroxyl amine, and its derived esters). The number 3 ECFP_4 substructure (a nitroso group) was shown in the SA_25 (aromatic nitroso group), and the number 4 ECFP_4 substructure (N-nitroso group) was exactly the same as the SA_21 (alkyl and aryl N-nitroso groups). The good matching found in this two-set analysis on the one hand proved that substructure of carcinogens did play an important role in carcinogen screening, chemical risk assess746

dx.doi.org/10.1021/tx4000182 | Chem. Res. Toxicol. 2013, 26, 741−749

Chemical Research in Toxicology

Article

shape of molecules and has been used for exploring the role of certain shape attributes in the mechanism of toxicity as well as other factors. The principal moments of inertia descriptors pmi1 and npr229 have previously been used to assess molecular properties such as shape, geometry, and conformational parameters. In conclusion, these four shape descriptors confirmed the importance of molecular geometric factors in affecting its ability to reach target macromolecules and its chance to be metabolically activated or detoxified. 3.4.4. Class 4: One Atom Count and Bond Count Descriptor (lip_violation), One Subdivided Surface Area Descriptor (SMR_VSA3), and One Pharmacophore Feature Descriptor (vsa_pol). This class had some relationship with the molecular weight and size of chemicals. The molecular weight of chemicals is another important factor that effects the potential carcinogenicity of a chemical.27 The higher the molecular weight of a chemical, the less chance it has to be absorbed in significant amounts. The descriptor lip_violation represents the number of violations of Lipinski’s Rule of Five.30 It is found that in the carcinogens the number of compounds whose molecular weights are greater than 500 Da is 9, while in the noncarcinogens the number of violation is 18. In general, the chemicals with very high molecular weight and size do not pose any substantial carcinogenic risk. The descriptor SMR_VSA3 represents the molar refractivity (MR) of small molecules calculated as the sum of the contributions of each of the atoms in the molecules.31 It is defined as the atom-based calculation of molar refractivity (MR) based on an approximate accessible van der Waals surface area. The descriptor vsa_pol represents approximations to the sum of van der Waals surface areas of polar atoms (atoms that are both hydrogen bond donors and acceptors). Both SMR_VSA3 and vsa_pol relate to the molecular size and polarizability of chemicals. Class 5: One Atom Count and One Bond Count Descriptor (lip_violation), One Potential Energy Descriptor (E_sol), One Pharmacophore Feature Descriptor (a_don), One Subdivided Surface Area Descriptor (SlogP_VSA3), and Six Surface Area, Volume, and Shape Descriptors (vsurf_A, vsurf_DW13, vsurf_IW8, vsurf_IW7, vsurf_EWmin1, and vsurf_W3). This class consists of solubility based descriptors. The descriptor lip_violation also relates to the chemical solubility. The octanol−water partition coefficient log P not greater than 5 is part of Lipinski’s Rule of Five. By examining the chemical data in the data set, it is found that in the carcinogens, the log P of 19 out of 449 (4.23%) is greater than 5; while in the noncarcinogens, the number of violation is 47 out of 403 (11.66%). The partition coefficient is the measure of how hydrophilic or hydrophobic a chemical is. Woo reported that the highly hydrophilic compounds were poorly absorbed and not readily excreted.27 Thus, the introduction of hydrophilic groups (e.g., sulfonyl and carboxyl) into an otherwise carcinogenic compound usually mitigates and sometimes altogether abolishes its carcinogenic activity. There are several other descriptors related to solubility, such as the descriptors SlogP_VSA3, E_sol, a_don, and some VolSurf descriptors (vsurf_A, vsurf_DW13, vsurf_IW8, vsurf_IW7, vsurf_EWmin1, and vsurf_W3). The descriptor E_sol means the solvation energy of a chemical, which shows up as the change in Gibbs energy when an ion or molecule was transferred to a solvent.32 Different noncovalent interactions with the solvent such as H-bonding, dipole−dipole interactions, van der Waals interactions, etc. can affect the stabilization of the

chemicals of congeneric class could often achieve better prediction results. 3.4. Analysis of MOE Descriptors. For each of the selected descriptors, the F-score was calculated. In Table 1, the descriptors were sorted by descending F-score, and the data showed that a descriptor with a high F-score generally had a high correlation coefficient (R) with activity. For example, descriptors with an F-score value of more than 0.010 had corresponding absolute correlation coefficients of more than 0.100. In order to investigate the role of 24 MOE descriptors, we classified them into six classes. 3.4.1. Class 1: One Physical Property Descriptor (mutagenic). The descriptor mutagenic is the indicator of the presence of a potentially toxic group. A nonzero value indicates that the molecule contains a mutagenic group. The table of mutagenic groups is based on the Kazius set.23 Mutagenicity is the ability of a compound to cause mutations in DNA, and often mutation is one of the first steps in the development of cancer. From the point of view of the mechanism of mutagenic action, a compound can either react directly with DNA to distort the DNA structure by causing DNA adducts or base deletions, or be converted into DNA-reactive metabolites through enzyme-catalyzed metabolic activation.24 Table 1 shows the descriptor mutagenic had the highest correlation coefficient with the carcinogenicity of compounds. 3.4.2. Class 2: Two Partial Charge Descriptors (PEOE_VSA5 and PEOE_VSA-4). In the 1960s, Elizabeth and James Miller developed the electrophilicity theory.22 They suggested “that most, if not all, chemical carcinogens either are, or are converted in vivo to, reactive electrophilic derivatives which combine with nucleophilic groups in crucial tissue components, such as nucleic acids and proteins.” The partial equalization of orbital electronegativities (PEOE) descriptors25 calculate atomic partial charges, which are characterized by their orbital electronegativities, the electronegativity of a specific orbital in a given valence state. Their values are obtained from ground state ionization potentials and electron affinities and from valence state promotion energies deduced from spectroscopic data.26 The two descriptors (PEOE_VSA-5 and PEOE_VSA-4) represent the sum of van der Waals surface area calculation for each atom where its partial charge is in the range (−0.25, −0.20) and (−0.30, −0.25), respectively. The negative correlation between carcinogenicity and PEOE descriptors means that the larger orbital electronegativities contribute more to the electron affinity, and the bigger the electrophilicity of compounds influences. 3.4.3. Class 3: Three Surface Area, Volume, and Shape Descriptors (glob, pmi1, and npr2) and One Kier & Hall Connectivity and Kappa Shape Indices Descriptor (Kier3). This class was related to the geometry of chemicals. Besides electrophilic theory of chemical carcinogenesis, there are also general factors that may influence the potential reactivity of a chemical.27 One critical factor is the geometry of the chemical compounds. The planar shape of potent carcinogens is helpful for them to intercalate into DNA.27 The descriptor glob represents the globularity of the chemical compound. The value of 1 indicates a perfect sphere, while a value of 0 indicates a two- or one-dimensional object. The correlation coefficient for the activity of glob shown in Table 1 is −0.041. In addition, the descriptor Kier3, which means the third kappa shape index,28 is also a descriptor capturing different aspects of molecular shape. In essence, it encodes information about the nonspherical 747

dx.doi.org/10.1021/tx4000182 | Chem. Res. Toxicol. 2013, 26, 741−749

Chemical Research in Toxicology

Article

reactant or product. The descriptor a_don represents the number of hydrogen bond donor atoms. The increase of hydrogen bond donor atoms synchronously elevates the solubility of compound and may reduce the carcinogenicity of compound. The descriptor SlogP_VSA3 represents the partition coefficient (log P) of small molecules calculated as the sum of the contributions of each of the atoms in the molecules.31 Since it is defined as an atom-based calculation of partition coefficient (log P) based on an approximate accessible van der Waals surface area, it basically accounts for the solubility of a chemical. Six vsurf descriptors are hydrophilic based descriptors with a close relationship with chemical solubility. They are calculated based on structure connectivity and conformation (dimensions are measured in Å),33 which can be used to predict pharmacokinetic and physicochemical properties. 3.4.6. Class 6: Three Atom Counts and Bond Count Descriptors (a_nS, a_ICM, and a_nI) and Two Adjacency and Distance Matrix Descriptors (BCUT_SMR_1 and BCUT_PEOE_2). This class represented a group of descriptors that reflected the atom information of chemicals. In this class, there are five descriptors associated with the atom information of chemicals. The topological substructure information is related to functional groups or substructures of carcinogens.

groups to cancer, which in future will allow us to generate more powerful, fine-tuned models.



ASSOCIATED CONTENT

S Supporting Information *

Compounds for building Models A1−A3, Models B1−B10, and four subgroup models. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*Phone: +86-10-64421335. Fax: +86-10-64416428. E-mail: [email protected]. Funding

This work was supported by the National Natural Science Foundation of China (20605003 and 20975011) and “Chemical Grid Project” of Beijing University of Chemical Technology. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Molecular Networks GmbH, Erlangen, Germany, for making the programs CORINA and SONNIA available for our scientific work.

4. CONCLUSIONS The objective of this study was to build classification models for predicting the carcinogenicity of noncongeneric chemicals. The models were developed using a data set of 852 chemicals and a SVM. Twenty-four MOE descriptors from six distinct classes were selected as a basis for the model. The descriptor mutagenic (class 1) showed the highest correlation coefficient with carcinogenicity, which was rational and consistent with the mechanism of carcinogenic and mutagenic action. Other classes of descriptors were closely linked to electrophilicity, geometry, molecular weight and size, solubility, and atom information of chemicals, which had been proven as important factors influencing the potential carcinogenicity of chemicals. The models showed good prediction accuracy on noncongeneric chemicals. The SVM model (Model A1) obtained an accuracy of 80.10%, sensitivity of 77.11%, and specificity of 82.20% on the test set. The best model (Model B3), obtained from 10-fold cross-validation, obtained an accuracy of 81.18%, sensitivity of 76.60%, and specificity of 86.84%. The Matthew’s correlation coefficients (MCC) of these two models were identical (0.59). ECFPs and ToxTree were used to analyze the functional groups or substructures related to the carcinogenicity of chemicals, and the results from both methods are in good agreement. This not only shows the importance of the molecular functional groups/substructures in the research of the mechanism of carcinogenicity but also provides a new way of finding critical substructures of carcinogens by applying ECFP_4 fingerprint analysis. By further subgrouping the chemicals on SAs, the SVM models on four subgroups chemicals showed relatively better performance in predicting carcinogenicity. In summary, this study demonstrated the potential of SVMs in combination with a specific set of molecular descriptors to identify carcinogenic compounds from a noncongeneric set of chemicals. The research helped us to gain a better understanding of the relevance of specific substructures or functional



ABBREVIATIONS SVM, support vector machine; SOM, Kohonen’s self-organizing map; ECFP, extended connectivity fingerprints; SA, structural alert; CPDBAS, Carcinogenic Potency Database; SAR, structure−activity relationship; QSAR, quantitative structure− activity relationship; CP ANN, counter propagation artificial neural network; SE, sensitivity; SP, specificity; MCC, Matthew’s correlation coefficient; TP, true positives; TN, true negatives; FP, false positives; FN, false negatives; RBF, radial basis function; Q, accuracy; LOO, leave-one-out; PEOE, partial equalization of orbital electronegativities; MR, molar refractivity



REFERENCES

(1) Fjodorova, N., Vracko, M., Tusar, M., Jezierska, A., Novic, M., Kuhne, R., and Schuurmann, G. (2010) Quantitative and qualitative models for carcinogenicity prediction for non-congeneric chemicals using CP ANN method for regulatory uses. Mol. Diversity 14, 581− 594. (2) Fjodorova, N., Vracko, M., Novic, M., Roncaglioni, A., and Benfenati, E. (2010) New public QSAR model for carcinogenicity. Chem. Cent. J 4, S3. (3) Rainer, F., and Andreas, G. (2003) General Introduction to QSAR, in Quantitative Structure-Activity Relationship (QSAR) Models of Mutagens and Carcinogens, pp 1−40, CRC Press, Boca Raton, FL. (4) Hansch, C., Hoekman, D., Leo, A., Weininger, D., and Selassie, C. D. (2002) Chem-bioinformatics: comparative QSAR at the interface between chemistry and biology. Chem. Rev. 102, 783−812. (5) Tanabe, K., Lucic, B., Amic, D., Kurita, T., Kaihara, M., Onodera, N., and Suzuki, T. (2010) Prediction of carcinogenicity for diverse chemicals based on substructure grouping and SVM modeling. Mol. Diversity 14, 789−802. (6) Benigni, R., and Bossa, C. (2006) Structural alerts of mutagens and carcinogens. Curr. Comput.-Aided Drug Des. 2, 169−176. (7) CPDBAS: Carcinogenic Potency Database Summary Tables, http://www.epa.gov/ncct/dsstox/sdf_cpdbas.html. (8) Joint Research Centre (JRC) Website, http://ihcp.jrc.ec.europa. eu/our_labs/predictive_toxicology/qsar_tools/.

748

dx.doi.org/10.1021/tx4000182 | Chem. Res. Toxicol. 2013, 26, 741−749

Chemical Research in Toxicology

Article

(9) Romualdo, B., Cecilia, B., Nina, J., Tatiana, N., and Andrew, W. (2008) The Benigni/Bossa Rulebase for Mutagenicity and Carcinogenicity: A Module of ToxTree, pp 1−70, IdeaConsult, Ltd., Sofia, Bulgaria. (10) Huff, J., Haseman, J., and Rall, D. (1991) Scientific concepts, value, and significance of chemical carcinogenesis studies. Annu. Rev. Pharmacol. Toxicol 31, 621−652. (11) Fung, V. A., Barrett, J. C., and Huff, J. (1995) The carcinogenesis bioassay in perspective: application in identifying human cancer hazards. Environ. Health. Perspect 103, 680−683. (12) SONNIA, version 4.2, Molecular Networks GmbH, Erlangen, Germany, http://www.molecular-networks.com (accessed Jan, 2013). (13) MDL Information Systems, Inc., San Leandro, CA, http://www. mdli.com/. (14) MOE (The Molecular Operating Environment), version 2010.10, Chemical Computing Group Inc., Montreal, Canada. (15) Willett, P. (2006) Similarity-based virtual screening using 2D fingerprints. Drug. Discovery Today 11, 1046−1053. (16) CORINA, Molecular Networks GmbH, Erlangen, Germany, http://www.molecular-networks.com (accessed Jan, 2013). (17) Matthews, B. W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442−451. (18) Chang, C.-C., and Lin, C.-J. (2011) LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 1−27. (19) Morgan, H. L. (1965) The generation of a unique machine description for chemical structures-a technique developed at Chemical Abstracts Service. J. Chem. Doc. 5, 107−113. (20) Bender, A., Jenkins, J. L., Scheiber, J., Sukuru, S. C., Glick, M., and Davies, J. W. (2009) How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J. Chem. Inf. Model. 49, 108−119. (21) Hiatt, H. H., Watson, J. D., Winsten, J. A., and Laboratory, C. S. H. (1977) Origins of Human Cancer, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York. (22) Miller, E. C., and Miller, J. A. (1981) Searches for ultimate chemical carcinogens and their reactions with cellular macromolecules. Cancer 47, 2327−2345. (23) Kazius, J., McGuire, R., and Bursi, R. (2005) Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem. 48, 312−320. (24) Benigni, R., and Bossa, C. (2011) Mechanisms of chemical carcinogenicity and mutagenicity: a review with implications for predictive toxicology. Chem. Rev. 111, 2507−2536. (25) Gasteiger, J., and Marsili, M. (1980) Iterative partial equalization of orbital electronegativitya rapid access to atomic charges. Tetrahedron 36, 3219−3228. (26) Hinze, J., and Jaffe, H. H. (1962) Electronegativity. I. Orbital electronegativity of neutral atoms. J. Am. Chem. Soc. 84, 540−546. (27) Woo, Y., and Arcos Joseph, C. (1989) Role of Structure-Activity Relationship Analysis in Evaluation of Pesticides for Potential Carcinogenicity, in Carcinogenicity and Pesticides, pp 175−200, American Chemical Society, Washington, DC. (28) Hall, L. H., and Kier, L. B. (2007) The Molecular Connectivity Chi Indexes and Kappa Shape Indexes in Structure-Property Modeling, in Reviews in Computational Chemistry, pp 367−422, John Wiley & Sons, Inc., New York. (29) Sauer, W. H. B., and Schwarz, M. K. (2003) Molecular shape diversity of combinatorial libraries: a prerequisite for broad bioactivity. J. Chem. Inf. Comput. Sci. 43, 987−1003. (30) Lipinski, C. A., Lombardo, F., Dominy, B. W., and Feeney, P. J. (1997) Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Delivery Rev. 23, 3−25. (31) Wildman, S. A., and Crippen, G. M. (1999) Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Comput. Sci. 39, 868−873. (32) Nic, M. J. J., and Kosata, B. (2006) IUPAC Compendium of Chemical Terminology (Online ed.), IUPAC, Research Triangle Park, NC

(33) Cruciani, G., Crivori, P., Carrupt, P. A., and Testa, B. (2000) Molecular fields in quantitative structure−permeation relationships: the VolSurf approach. J. Mol. Struct. 503, 17−30.

749

dx.doi.org/10.1021/tx4000182 | Chem. Res. Toxicol. 2013, 26, 741−749