1030
Chem. Res. Toxicol. 2006, 19, 1030-1039
Classification of a Diverse Set of Tetrahymena pyriformis Toxicity Chemical Compounds from Molecular Descriptors by Statistical Learning Methods Y. Xue,†,§,# H. Li,†,# C. Y. Ung,†,‡ C. W. Yap,† and Y. Z. Chen*,† Bioinformatics and Drug Design Group, Departments of Pharmacy and Computational Science, National UniVersity of Singapore, Blk S16, LeVel 8, 3 Science DriVe 2, Singapore 117543, Department of Biochemistry, The Yong Loo Lin School of Medicine, National UniVersity of Singapore, Blk MD7, #02-03, 8 Medical DriVe, Singapore 117597, and College of Chemistry, Sichuan UniVersity, Chengdu, 610064, People’s Republic of China ReceiVed March 12, 2006
Toxicity of various compounds has been measured in many studies by their toxic effects against Tetrahymena pyriformis. Efforts have also been made to use computational quantitative structure-activity relationship (QSAR) and statistical learning methods (SLMs) for predicting Tetrahymena pyriformis toxicity (TPT) at impressive accuracies. Because of the diversity of compounds and toxicity mechanisms, it is desirable to explore additional methods and to examine if these methods are applicable to more diverse sets of compounds. We tested several SLMs (logistic regression, C4.5 decision tree, k-nearest neighbor, probabilistic neural network, support vector machines) for their capability in predicting TPT by using 1129 compounds (841 TPT and 288 non-TPT agents) which are more diverse than those in other studies. A feature selection method was used for improving prediction performance and selecting molecular descriptors responsible for distinguishing TPT and non-TPT agents. The prediction accuracies are 86.9%∼94.2% for TPT and 71.2%∼87.5% for non-TPT agents based on 5-fold cross-validation studies, which are comparable to some of earlier studies despite the use of more diverse sets of compounds. The selected molecular descriptors are consistent with those used in other studies and experimental findings. These suggest that SLMs are useful for predicting TPT potential of diverse sets of compounds and for characterizing the molecular descriptors associated with TPT. Introduction A variety of compounds are toxic to humans. Investigational drugs need to be screened to eliminate unsafe candidates, and adverse drug reactions have led to the withdrawal of some marketed drugs (1, 2). Harmful effects can be produced by some chemical byproducts escaped to the environment (3), pesticides (4), household products (5), and other chemical materials (6). Toxicological tests and safety evaluations are needed and have been used for assessing the toxic potential of chemical compounds and investigational drugs, many of which have been tested with Tetrahymena pyriformis (7-11). Computational methods have been developed as potential tools for facilitating fast and efficient prediction of T, pyriformis toxicity (TPT) (1221). Structure-toxicity relationship (STR) (12, 22-25) and quantitative structure-activity relationship (QSAR) (14-16, 26-28) models have been developed for predicting TPT of several groups of structurally related chemicals including benzenes, phenols, aromatic compounds, aliphatic compounds, and cyanoacetic acids. However, STR or QSAR models of a majority of chemical groups are yet to be determined. In an effort to develop computational methods applicable to diverse sets of compounds, statistical learning methods (SLMs) have * Corresponding author. Tel.: 65-6874-6877. Fax: 65-6774-6756. Email:
[email protected]. † Departments of Pharmacy and Computational Science, National University of Singapore. § Sichuan University. # These authors contributed equally in this work. ‡ The Yong Loo Lin School of Medicine, National University of Singapore.
also been explored for developing TPT prediction systems for both individual chemical groups (13, 17, 18, 21, 29, 30) and multiple chemical groups (19, 20). These methods achieve impressive predictive performance, but their practical application potential may not be fully realized because of the limited diversity of training data, quality of molecular descriptors, and the efficiency of statistical learning algorithm. So far, three classes of SLMs have been employed for predicting TPT agents. These include linear discriminant analysis (13), various regression methods (13, 17, 18, 21), partial least squares (29), and neural networks (17-20, 30). Because of the diversity of compounds and mechanisms of toxic actions, it is desirable to explore additional SLM such as support vector machines (SVM) (31, 32) k-nearest neighbor (k-NN)(33), and decision tree (DT) (34) that have shown promising potential in predicting toxicological and pharmacological properties of diverse groups of compounds (35-40). Instead of focusing on specific structural features or a particular group of related molecules, SLMs generally classify diverse sets of compounds into TPT and non-TPT agents based on their general structural and physicochemical properties irrespective of their structural and chemical affiliation. These published works are based on the study of 200∼270 (13, 18, 21, 30), 450∼480 (17, 29), and 730∼830 (19, 20) TPT and non-TPT agents, respectively. The agents used in these studies are substantially smaller in number and diversity than the 1129 known TPT and non-TPT agents found from our comprehensive search of the literature. Therefore, the performance of both the already-explored and yet-tobe-explored SLMs needs to be tested by using the more diverse set of compounds.
10.1021/tx0600550 CCC: $33.50 © 2006 American Chemical Society Published on Web 07/18/2006
Classification of T. pyriformis Toxicity Chemical Compounds
In this work, we tested several SLMs by using all of the 1129 known TPT and non-TPT agents we collected. The tested methods include logistic regression (LR) (41), DT (34), k-NN (33), probabilistic neural networks (PNN) (42), and SVM (31, 32). In particular, SVM was explored because of its good performance in the prediction of genotoxicity (38), chromosome aberrations (43), torsade-causing potential of drugs (44), bloodbrain barrier-penetrating agents (45), P-glycoprotein substrates (36), structure-activity relationship of enzyme inhibition (46), and QSAR of antihistamines and antibacterials (47). Most of these studies have consistently demonstrated that SVM to various degrees gives better prediction accuracy than other supervised statistical learning methods. A feature selection method, recursive feature elimination (RFE), was used in this work for improving the prediction performance of SLMs and for selecting the molecular descriptors relevant to the classification of TPT and non-TPT agents. RFE has recently gained popularity due to its effectiveness for discovering informative molecular descriptors responsible for distinguishing activities (48) and toxicological (38) and pharmacological properties (37) of various compounds. To adequately assess the prediction accuracy of the methods used in this work, 5-fold cross-validation method which is a popular method for evaluating drug prediction systems (49, 50) was used for all the SLMs.
Materials and Methods Selection of TPT and Non-TPT Agents. A total of 1129 TPT and non-TPT agents with known IGC50 (mol/L) values (the 50% growth inhibition concentration in 2-day T. pyriformis assay) was selected from a number of published papers (12-17, 20-28). The 2D structure of each of the compounds was generated by using ChemDraw (51) and DS ViewerPro 5.0 (52) and was subsequently converted into 3D structure by using CONCORD (53) followed by geometrical optimization using the clean structure module of DS Viewer Pro. All the generated geometries had been fully optimized without symmetry restrictions. The 3D structure of each compound was manually inspected to ensure that the chirality of each chiral agent was properly generated. Schultz (54) proposed a short-term, static protocol to calculate the chemical toxicity values from a population growth impairment test using the common freshwater ciliate T. pyriformis (strain GLC). The 50% impairment growth concentration (IGC50) is the recorded endpoint. The acute toxicity toward T. pyriformis is defined as log(1/IGC50) (mm/L), where the logarithm is taken to contract the data set to a computationally efficient range (-1.64 to 3.36 log units) (17). Toxicity level generally increases with increasing value of log(1/IGC50), and compounds with positive values are generally considered to be toxic or weakly toxic. However, because of the differences of experimental conditions and other factors, some levels of variations are expected for the measured log(1/IGC50) values. Careful study of the available TPT and non-TPT agents is needed to derive a statistically more meaningful cutoff value. After careful analysis of the toxic characteristics of the TPT and nonTPT compounds in our dataset, a more conservative cutoff value of log(1/IGC50) ) -0.5 was chosen for separating TPT and nonTPT agents. Some toxic compounds such as phenol were found to have a log(1/IGC50) value as low as -0.21; hence, the log(1/IGC50) cutoff value needed to be lower than -0.21. On the other hand, some safe compounds such as salicylic acid, which is used as a food preservative and in toothpaste, has a log(1/IGC50) value as high as -0.51. Therefore the log(1/IGC50) cutoff value was tentatively set at the lowest possible value of -0.5 that optimally excludes the known toxic agents in the non-TPT dataset and the known safe agents in the TPT dataset. Under this definition, a total of 841 TPT and 288 non-TPT agents was found, which is given in Table S1 of Supporting Information.
Chem. Res. Toxicol., Vol. 19, No. 8, 2006 1031 These compounds were further separated into training and testing sets by n-fold cross-validation which was widely used for model validation (55). For 5-fold cross-validation, these compounds are randomly divided into five subsets of approximately equal size. Four subsets are selected as the training set and the fifth as the testing set. This process is repeated five times so that every subset is selected as a testing set once. Molecular Descriptors. Molecular descriptors have routinely been used for representing structural and physicochemical properties of compounds in the study of toxicological (12-21, 38, 40, 44) and pharmacological (56-59) properties of compounds. In this work, a set of 199 molecular descriptors are selected from the more than 1000 descriptors described in the literature after eliminating those descriptors that are obviously redundant or unrelated to the toxicological properties. These descriptors, given in Table 1 and described in our earlier publications (36-38, 40), include 143 topological, 31 quantum chemical, and 25 geometrical descriptors. They are computed from the 3D structure of each compound by using a molecular descriptor computing program developed by us which is available online at http://jing.cz3.nus.edu.sg/cgi-bin/model/ model.cgi. There are other online programs such as DRAGON (http://www.talete.mi.it/main_exp.htm), ALMOND, and VOLSURF (http://www.moldiscovery.com/soft_almond.php) which are available for the molecular descriptors computing. When computing the quantum chemical descriptors and molecular surfaces, a semiempirical AM1 method widely used in QSAR and SLM models of compounds (13, 16, 17, 23, 60-64) was used for preprocessing structural optimization. The remaining redundant and unrelated descriptors in these 199 descriptors were further removed by using a feature selection method (36, 48). Feature Selection Method. Feature selection methods have been used for improving the performance of SLMs and for selecting descriptors relevant to the discrimination of two datasets (48, 50, 65-67). The RFE method has become a popular choice for feature selection in a variety of problems (48). Our study of SVM prediction of genotoxicity compounds (38), torsade-causing potential of drugs (37), P-glycoprotein substrates (36), and human intestinal absorption (37) also showed that this method is useful for selecting features relevant to the study of toxicological and pharmacokinetic properties. Thus, the RFE method was used for feature selection in this work. Feature selection procedure can be demonstrated by the following illustrative example of the development of a SVM classification system trained by using a Gaussian kernel function with an adjustable parameter σ. Sequential variation of σ is conducted against the whole training set of agents to find a value that gives the best prediction accuracy. This prediction accuracy is evaluated by 5-fold cross-validation. In the first step, for a fixed σ, the SVM classifier is trained by using the complete set of features (molecular descriptors). The second step involves the computation of a scoring function to rank the contribution of the features in the current set. In the third step, the m lowest ranked features are removed. In the fourth step, the SVM classification system is retrained by using the remaining set of features, and the corresponding prediction accuracy is computed by means of 5-fold cross-validation. The first to fourth steps are then repeated for other values of σ. After the completion of these procedures, the set of features and parameter σ that give the best prediction accuracy are selected. The choice of the parameter m affects the performance of SVM as well as the speed of feature selection. Although it is desirable to remove one feature at a time (m ) 1), this is often difficult due to the high computational cost. In some cases, removal of several features at a time (m > 1) significantly improves computational efficiency without losing too much accuracy (48). Our studies on a subset of randomly selected TPT and non-TPT agents in this work and compounds of different toxicological (38) and pharmacokinetic properties in our earlier studies (36, 37) suggested that the performance of a SVM system with m ) 5 is only reduced by a few percentages smaller than that with m ) 1, which is consistent with the findings from other studies (65, 68). Thus, for computational efficiency, m ) 5 is used in this study.
1032 Chem. Res. Toxicol., Vol. 19, No. 8, 2006
Xue et al.
Table 1. Molecular Descriptors Used in This Work number of descriptor in class
descriptor Class Simple molecular properties
18
Molecular connectivity and shape
28
Electrotopological state
97
Quantum chemical properties
31
Geometrical properties
25
Statistical Learning Methods. We used our own programs for generating SLM models for predicting TPT agents. A number of downloadable SLM software packages are available from various sources. Examples of these sources are PHAKISO (http://www.phakiso.com/index.htm) and WEKA (http://www.cs. waikato.ac.nz/∼ml/weka/) for a collection of SLM software, NeuNet (http://www.cormactech.com/neunet/index.html) for neural network, and SVM-Light (http://svmlight.joachims.org/) for SVM software. The methodologies of each of the SLMs used in this work are briefly described below: (1) Logistic Regression Method. LR (41) is based on the assumption that a logistic relationship exists between the probability of class membership and one or more descriptors, Y)
1 1 + e-(β0+β1X1+β2X2+...+βkXk)
descriptors molecular weight, numbers of rings, rotatable bonds, H-bond donors, and H-bond acceptors, element counts molecular connectivity indices, valence molecular connectivity indices, molecular shape Kappa indices, Kappa alpha indices, flexibility index Electrotopological state indices, and atom type electrotopological state indices, Weiner index, centric index, Altenburg index, Balaban index, Harary number, Schultz index, PetitJohn R2 index, PetitJohn D2 index, mean distance index, PetitJohn I2 index, information Weiner, Balaban rmsd index, graph distance index polarizability index, hydrogen bond acceptor basicity (covalent HBAB), hydrogen bond donor acidity (covalent HBDA), molecular dipole moment, absolute hardness, softness, ionization potential, electron affinity, chemical potential, electronegativity index, electrophilicity index, most positive charge on H, C, N, O atoms, most negative charge on H, C, N, O atoms, most positive and negative charge in a molecule, LSum of squares of charges on H,C,N,O and all atoms, mean of positive charges, mean of negative charges, mean absolute charge, relative positive charge, relative negative charge length vectors (longest distance, longest third atom, 4th atom), molecular van der Waals volume, solvent accessible surface area, molecular surface area, van der Waals surface area, polar molecular surface area, sum of solvent accessible surface areas of positively charged atoms, sum of solvent accessible surface areas of negatively charged atoms, sum of charge weighted solvent accessible surface areas of positively charged atoms, sum of charge weighted solvent accessible surface areas of negatively charged atoms, sum of van der Waals surface areas of positively charged atoms, sum of van der Waals surface areas of negatively charged atoms, sum of charge weighted van der Waals surface areas of positively charged atoms, sum of charge weighted van der Waals surface areas of negatively charged atoms, molecular rugosity, molecular globularity, hydrophilic region, hydrophobic region, capacity factor, hydrophilic-hydrophobic balance, hydrophilic intery moment, hydrophobic intery moment, amphiphilic moment
(1)
where Y is the probability that vector x belongs to the positive class, β0 is the regression model constant, β1-βk are the coefficients corresponding to the descriptors X1-Xk. A Y value of 0.5 or greater indicates that the vector x belongs to the positive class, while a value of below 0.5 classifies vector x as negative. (2) C4.5 DT. C4.5 DT is a branch-test-based classifier (34). A branch in a decision tree that corresponds to a group of classes, and a leaf represents a specific class. A decision node specifies a test to be conducted on a single attribute value, with one branch and its subsequent classes as possible outcomes of the test. C4.5 DT uses recursive partitioning to examine every attribute of the data and ranks them according to their ability to partition the remaining data, thereby constructing a decision tree. A vector x is
classified by starting at the root of the tree and moving through the tree until a leaf is encountered. At each nonleaf decision node, a test is conducted, and the classification process proceeds to the branch selected by the test. Upon reaching the destination leaf, the class of the vector x is predicted to be that associated with the leaf. The algorithm is a recursive greedy heuristic that selects descriptors for membership within the tree. Whether a descriptor is included within the tree is based on the value of its information gain. As a statistical property, information gain measures how well the descriptor separates training cases into subsets in which the class is homogeneous. Given that the descriptors in this study were all continuous variables, a threshold value had to be established within each descriptor so that it could partition the training cases into subsets. These threshold values for each descriptor were established by rank-ordering the values within each descriptor from lowest to highest and repeatedly calculating the information gain using the arithmetical midpoint between all successive values within the rank order. The midpoint value with the highest information gain was selected as the threshold value for the descriptor. That descriptor with the highest information gain (information being the most useful for classification) was then selected for inclusion in the DT. The algorithm continued to build the tree in this manner until it accounted for all training cases. Ties between descriptors that were equal in terms of information gain were broken randomly (69). (3) k-NN. k-NN measures the Euclidean distance between a tobe-classified vector x and each individual vector xi in the training set (33). The Euclidean distances for the vector pairs are calculated using the following formula:
Classification of T. pyriformis Toxicity Chemical Compounds D ) x||x - xi||2
Chem. Res. Toxicol., Vol. 19, No. 8, 2006 1033 (3)
A total of k number of vectors nearest to the vector x are used to determine its class, f(x): k
∑δ(V,f(x ))
ˆf (x) r arg maxV∈V
(4)
i
i)1
where δ(a,b) ) 1 if a ) b and δ(a,b) ) 0 if a * b, arg max is the maximum of the function, V is a finite set of vectors {V1,...,Vs} and ˆf(x) is an estimate of f(x). Here estimate refers to the class of the majority of the k-nearest neighbors. (4) PNN. As illustrated in Figure 1, PNN is a form of neural network designed for classification through the use of Bayes’ optimal decision rule (42), hicifi(x) > hjcjfj(x)
(5)
where hi and hj are the prior probabilities, ci and cj are the costs of misclassification, and fi(x) and fj(x) are the probability density function for class i and j, respectively. An unknown vector x is classified into population i if the product of all the three terms is greater for class i than for any other class j (not equal to i). In most applications, the prior probabilities and costs of misclassifications are treated as being equal. The probability density function for each class for a univariate case can be estimated by using the Parzen’s nonparametric estimator (70), g(x) )
( ) x - xi
n
1 nσ
∑
W
i)1
(6)
σ
where n is the sample size, σ is a scaling parameter which defines the width of the bell curve that surrounds each sample point, W(d) is a weight function which has its largest value at d ) 0 and (x xi) is the distance between the unknown vector and a vector in the training set. The Parzen’s nonparametric estimator was later expanded by Cacoullos (71) for the multivariate case, g(x1, ..., xp) )
n
1 nσ1...σp
(
∑W i)1
)
x1 - x1,i xp - xp,i , ..., σ1 σp
(7)
The Gaussian function is frequently used as the weight function because it is well-behaved, easily calculated, and satisfies the conditions required by Parzen’s estimator. Thus, the probability density function for the multivariate case becomes g(x) )
1 n
n
∑ i)1
( ∑( ) ) p
xj - xij
j)1
σj
exp -
2
(8)
The network architectures of PNN are determined by the number of compounds and descriptors in the training set. There are four layers in a PNN. The input layer provides input values to all neurons in the pattern layer and has as many neurons as the number of descriptors in the training set. The number of pattern neurons is determined by the total number of compounds in the training set. Each pattern neuron computes a distance measure between the input and the training case represented by that neuron and then subjects the distance measure to the Parzen’s nonparameteric estimator. The summation layer has a neuron for each class, and the neurons sum all the pattern neurons’ output corresponding to members of that summation neuron’s class to obtain the estimated probability density function for that class. The single neuron in the output layer then estimates the class of the unknown vector x by comparing all the probability density function from the summation neurons and choosing the class with the highest probability density function. (5) SVM. There are two types of SVM algorithms, linear and nonlinear SVM. Nonlinear SVM, which is illustrated in Figure 2, is more useful for chemical agents of diverse structures and thus
Figure 1. Schematic diagram illustrating the process of the prediction of chemical agents with a particular pharmacodynamic, pharmacokinetic, or toxicological property from its structure by using a statistical learning method, probabilistic neural networks (PNN). A and B, feature vectors of agents with the property; E and F, feature vectors of agents without the property; feature vector (hj, pj, vj, ...) represents such structural and physicochemical properties as hydrophobicity, volume, polarizability, etc.
more extensively used (35, 39, 48, 49). Linear SVM constructs a hyperplane separating two different classes of feature vectors with a maximum margin (72). This hyperplane is constructed by finding a vector w and a parameter b that minimizes ||w||2 which satisfies the following conditions: w‚xi + b g +1, for yi ) +1 (positive class) and w‚xi + b e -1, for yi ) -1 (negative class). Here xi is a feature vector, yi is the group index, w is a vector normal to the hyperplane, |b|/||w|| is the perpendicular distance from the hyperplane to the origin and ||w||2 is the Euclidean norm of w. Nonlinear SVM projects feature vectors into a high-dimensional feature space by using a kernel function such as K(xi,xj) ) e-||xj-xi||2/2σ2. The linear SVM procedure is then applied to the feature vectors in this feature space. After the determination of w and b, a given vector x can be classified by using sign[(w‚x) + b]; a positive or negative value indicates that the vector x belongs to the positive or negative class, respectively. Performance Evaluation. As in the case of all discriminative methods (73), the performance of SLMs can be measured by the quantity of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), which are the number of TPT agents predicted as TPT, non-TPT agents predicted as non-TPT, non-TPT agents predicted as TPT, and TPT agents predicted as non-TPT, respectively. There are several accuracy functions for measuring prediction performance, which include sensitivity SE ) TP/(TP + FN) × 100 (prediction accuracy for TPT agents), specificity SP ) TN/(TN + FP) × 100 (prediction accuracy for non-TPT agents), the overall prediction accuracy (Q), and Matthews correlation coefficient (C) (74) are given by Q) C)
TP + TN TP + TN + FP + FN TP‚TN - FN‚FP
x(TP + FN)(TP + FP)(TN + FN)(TN + FP)
(16) (17)
1034 Chem. Res. Toxicol., Vol. 19, No. 8, 2006
Figure 2. Schematic diagram illustrating the process of the prediction of chemical agents with a particular pharmacodynamic, pharmacokinetic, or toxicological property from its structure by using a statistical learning method, support vector machines. A,B, E, F, and (hj, pj, vj,...) are the same as those in Figure 1.
Results and Discussion Overall Prediction Accuracies. SVM, PNN, k-NN, and LR computations were conducted by using our own software, and C4.5 DT computation was performed by using the code from Quinlan (34). Table 2 gives the TPT and non-TPT prediction accuracies of each of these methods by using the full set of molecular descriptors and the RFE-selected descriptors. These prediction accuracies were estimated based on 5-fold crossvalidation studies. The TPT accuracies are in the range of 73.7%-94.4%, when the full set of descriptors was used, and in the range of 86.9%∼94.2% when the set of RFE-selected descriptors was used. The non-TPT accuracies are in the range of 59.5%∼82.0%, when the full set of descriptors was used, and in the range of 71.2%∼87.5% when the set of RFE-selected descriptors was used. The use of RFE-selected descriptors significantly improved the performance of LR, PNN, and SVM from the non-TPT accuracy level of 59.5%∼82.0% to that of 71.2%∼87.5%. The use of RFE-selected descriptors significantly improved the performance of LR from an overall accuracy level of 70.1% to that of 87.1%. The overall accuracy of the other methods is already at a high level of 85.8%∼88.9% when the full set of descriptors was used, and the use of RFE-selected descriptors only slightly improved the overall accuracies of SVM. These results showed that feature selection by using RFE
Xue et al.
is useful in improving the SLMs’ prediction capability for TPT agents, especially for improving the prediction accuracy of nonTPT agents. Because of the random splitting, the estimates may vary with different random number seed parameters; therefore, the whole 5-fold cross-validation procedure was repeated five more times. Similar prediction accuracies are also found from the five additional 5-fold cross-validation studies conducted by using training-testing sets separately generated from different random number seed parameters. On the basis od the computed overall accuracies and Matthews correlation coefficients, the performance of SVM appears to be slightly better than the other methods. Detailed prediction results of SVM without RFE and SVM with RFE (SVM + RFE) by using 5-fold cross-validation are given in Table 3. The accuracies of SVM + RFE are 93.5% for TPT agents and 82.0% for non-TPT agents; the former is slightly lower and the later is substantially better than the values of 94.4% for TPT agents and 72.9% for non-TPT agents derived from SVM without RFE. The overall prediction accuracy is slightly improved, which indicates the usefulness of RFE in selecting the proper set of features for the prediction of TPT and especially for non-TPT agents. The performance of the LR method appears to be more significantly improved than that of other methods when the 49 RFE-selected descriptors were used instead of the full 199 descriptor set. LR generally requires a sufficient number of samples and appropriate number of descriptors to develop an accurate classification system. Given a fixed number of samples, excessive number of descriptors may cause the model to be overfitted (37, 75, 76). Moreover, the introduction of an irrelevant molecular descriptor may also affect the performance of the LR model to some extent. It is likely that the use of the 49 RFE-selected descriptors helps to improve the performance of the LR model by reducing the level of overfitting and that of the noise generated by irrelevant descriptors. It had been shown that chance of correlations may occur during descriptor selection especially if the number of descriptors available for selection is large (77). Y-randomization has been frequently used to determine the probability of chance of correlation during descriptor selection processes (78). In yrandomization, a portion of TPT agents in the data set was randomly selected and converted to non-TPT agents. Another portion of non-TPT agents was also randomly selected and converted to TPT agents. The ratios of TPT agents to non-TPT agents were kept unchanged during y-randomization. The “scrambled” data set was then used for the descriptor selection process. The process of scrambling of the data set and descriptor selection process was repeated 10 times. This y-randomization analysis was conducted on the SVM model that consistently gives the better classification accuracies in this and other studies (46, 47). Unless overfitting is found in the SVM model, no further analysis on other models is to be conducted. The average Matthews correlation coefficient of these scrambled SVM classification systems derived by using the 5-fold crossvalidation sets were found to be 0.430, which is significantly lower than that of the original SVM classification system, which is 0.751. This suggests that the original SVM classification system is relevant and unlikely to arise as a result of chance of correlation. Apart from the use of a randomized cross-validation method, an external independent validation test set has frequently been used for testing the robustness of a model (13, 21, 30). We were not able to find new T. pyriformis toxic compounds in the resent publications. Hence, representative training and
Classification of T. pyriformis Toxicity Chemical Compounds
Chem. Res. Toxicol., Vol. 19, No. 8, 2006 1035
Table 2. Comparison of the Prediction Accuracy of Tetrahymena pyriformis Toxic (TPT) and Nontoxic (non-TPT) Agents by Different Statistical Learning Methodsa prediction performance TPT accuracy (%)
non-TPT accuracy (%)
overall accuracy Q (%)
Matthews correlation coefficient C
method
all descriptors
RFE-selected descriptors
all descriptors
RFE-selected descriptors
all descriptors
RFE-selected descriptors
all descriptors
RFE-selected descriptors
LR C4.5 DT k-NN PNN SVM
73.7 91.2 93.2 89.2 94.4
92.5 89.5 94.2 86.9 93.5
59.5 70.1 76.4 82.0 72.9
71.2 72.9 71.5 87.5 82.0
70.1 85.8 88.9 87.3 88.9
87.1 85.3 88.4 87.1 90.4
0.346 0.623 0.706 0.684 0.701
0.654 0.618 0.690 0.696 0.751
a The accuracy of each method was estimated from a 5-fold cross-validation by using 841 TTP and 288 non-TTP agents. The methods used are logistic regression (LR), C4.5 decision tree (C4.5 DT), k-nearest neighbor (k-NN), probabilistic neural network (PNN), and support vector machines (SVM). Two sets of molecular descriptors were used, one is a full set of molecular descriptors and the other is a subset of 49 molecular descriptors selected by using the recursive feature elimination (RFE) method
Table 3. Performance of Support Vector Machines (SVM) and Support Vector Machines with Recursive Feature Elimination (SVM + RFE) for Predicting Tetrahymena pyriformis Toxic (TPT) and Nontoxic (non-TPT) Agents as Evaluated by 5-Fold Cross-Validationa TPT method
non-TPT
cross-validation
TP
FN
SE (%)
TN
FP
SP (%)
Q (%)
C
1 2 3 4 5 average SD SE 1 2 3 4 5 average SD SE
155 168 156 162 153
12 10 6 5 14
41 44 40 36 48
14 18 20 10 17
155 167 154 158 152
12 11 8 9 15
92.8 94.4 96.3 97.0 91.6 94.4 (2.27 (1.02 92.8 93.8 95.1 94.6 91.0 93.5 (1.61 (0.72
44 46 45 45 54
11 16 15 1 11
74.6 71.0 66.7 78. 73.9 72.9 (4.33 (1.94 80.0 74.2 75.0 97.8 83.1 82.0 (9.56 (4.28
88.3 88.3 88.3 93.0 86.7 88.9 (2.38 (1.06 89.6 88.8 89.6 95.3 88.8 90.4 (2.76 (1.24
0.682 0.685 0.690 0.786 0.664 0.701 (0.05 (0.02 0.724 0.700 0.730 0.874 0.728 0.751 (0.07 (0.03
SVM
SVM + RFE
a Predicted results are given in TP (true positive), FN (false negative), TN (true negative), FP (false positive), SE (sensitivity) which is the prediction accuracy for TPT, SP (specificity) which is the prediction accuracy for non-TPT, Q (overall prediction accuracy), and C (Matthews correlation coefficient). Statistical significance is indicated by SD (standard deviation), and SE (standard error), respectively.
Table 4. Comparison of the Prediction Accuracy of Tetrahymena pyriformis Toxic (TPT) and Nontoxic (non-TPT) Agents by Different Statistical Learning Methodsa
methods
parameter
TP
FN
TN
FP
overall accuracy (%)
LR C4.5 DT k-NN PNN SVM
3 0.2 0.4
277 266 280 273 280
4 15 1 8 1
67 75 70 89 85
29 21 26 7 11
91.2 90.5 92.8 96.0 96.8
TPT accuracy (%)
non-TPT accuracy (%)
Matthews correlation coefficient C
98.6 94.7 99.6 97.2 99.6
69.8 78.1 72.9 92.7 88.5
0.762 0.744 0.859 0.896 0.916
a The accuracy of each method was estimated from an independent validation set by using 281 TTP and 96 non-TTP agents. Predicted results are given in TP (true positive), FN (false negative), TN (true negative), FP (false positive), SE (sensitivity) which is the prediction accuracy for TPT, SP (specificity) which is the prediction accuracy for non-TPT, Q (overall prediction accuracy), and C (Matthews correlation coefficient). LR (logistic regression), LDA (linear discriminate analysis), C4.5 DT (C4.5 decision tree), k-NN (k-nearest neighbor), PNN (probabilistic neural network), SVM (support vector machine).
validation sets were constructed from our existing datasets according to their distribution in the chemical space, with the training set used for model construction and validation set for independent testing. A chemical space is defined by the structural and chemical descriptors of the compounds. Each compound occupies a particular location in this chemical space. All possible pairs of these compounds were generated, and a distance score was computed for each pair. These pairs were then ranked in terms of their distance scores, and those with closest distances were evenly assigned into the two separate datasets. For those compounds without close-distance counterparts, they were exclusively assigned to the training set. This procedure gives a training set with 560 TPT and 192 non-TPT
agents, and a validation set with 281 TPT and 96 non-TPT agents. The prediction results for each of the SLM prediction systems with the selected descriptors can be found in Table 4. A frequently used method for checking whether a prediction system is overfitting is to compare the prediction accuracies determined by using cross-validation methods with those determined by using independent validation sets (79). Since descriptor selection was performed by using the cross-validation method as the modeling testing sets, an overfitted classification system is expected to have much higher prediction accuracy for the cross-validation sets than that for the independent validation sets. As shown in Table 4, the prediction accuracies
1036 Chem. Res. Toxicol., Vol. 19, No. 8, 2006
of the SLM systems based on the 5-fold cross-validation method and those based on independent validation sets are similar. This shows that the SLM classification systems in this work are unlikely to be overfitted. Relevance of Selected Molecular Descriptors to T. pyriformis Toxicity Prediction. Apart from the quality of datasets used, selection of molecular descriptors most relevant to the prediction of TPT agents is important for optimizing the prediction models and for elucidating the molecular factors contributing to TPT. Generally, QSAR models particularly design a group of specific descriptors to represent the studied TPT compounds which have similar structural groups or structural alerts. A total of 49 molecular descriptors are selected by the RFE method in this work, all of them are found to match or partially match those descriptors used in the published TPT QSAR models and other statistical learning models. Table 5 is the list of molecular descriptors selected from the RFE method for classification of TPT and non-TPT agents. The first, second, and third columns give the RFE selected descriptors, their description, and the list of the matched or partially matched molecular descriptors used in previously published TPT models, respectively. T. pyriformis toxicity involves multiple mechanisms and chemical features which may not be fully described by a few descriptors, particularly for a larger number of compounds than those covered in previous studies. Gunatilleka and Poole reported, based on their study of compounds that are believed to interact only through a narcosis type mechanism, that the main factors contributing to nonspecific toxicity are solute size and lone-pair electron interactions, with the solute size and hydrogen bond basicity being the most important factors for toxicity (80). They have found that small compounds of high hydrogen-bond basicity tend to show higher toxicity. The descriptor for hydrogen bond acceptor basicity (b) is selected by RFE in this study. Rugty (molecular rugosity), Gloty (molecular globularity), dis1 (length vector longest ditance), dis2 (length vector longest 3rd atom), and dis3 (length vector longest 4th atom) are the RFE-selected VolSurf and other descriptors related to the size and shape of a compound. Thus, our study is consistent with the conclusion from Gunatilleka and Poole. Apart from these descriptors, several studies have found the polar surface area descriptors to be significant for toxicity predictions (13, 14, 26, 27). The RFE-selected descriptors that cover polar surface property are Sapc (sum of solvent-accessible surface areas of positively charged atoms) and Sanc (sum of solventaccessible surface areas of negatively charged atoms). Narcosis is often associated with hydrophobicity. Uptake, transport, and distribution of organic toxicants via passive diffusion is best modeled by hydrophobicity that is usually represented by 1-octanol/water partition coefficient (logKow) (25-28, 30). Shpb (hydrophobic region) and Hiwpb (hydrophobic integy moment) are the RFE-selected descriptors correlated to those hydrophobic descriptors used in previous QSAR studies (25-28). Reactive toxicants are known to be electrophiles associated with irreversible interactions with biomolecules (16). It has been reported that both global and local electrophilicity can capture the toxicity properties of a large variety of aliphatic compounds (15). Electrophilicity is related to the flow of electrons between a donor and a receptor and is associated to the electronic chemical potential and the chemical hardness of the ground state of atoms or molecules (15). Both chemical potential (µcp) and absolute hardness (η) are selected by the RFE method in this study. In addition, both absolute hardness (η) and the chemical potential (µcp) have high
Xue et al.
correlation with hydrogen bond donor acidity as reported by Mignon and co-workers (81). Steric properties of a compound can affect membrane transport and specific interactions at reactive sites with implications to its toxicological profiles. For instance, the isobutyl and isoamyl acrylates are less toxic than the linear analogues (25). The steric properties of a molecule can be described by topological descriptors such as topological shape index (29). The RFE-selected topological descriptors in this study are four simple molecular connectivity descriptors, 3χP (simple molecular connectivity Chi indices for path order 3), 3χC (simple molecular connectivity Chi indices for cluster), 4χPC (simple molecular connectivity Chi indices for path/cluster), and 1χv (valence molecular connectivity Chi indices for path order 1). Moreover, other topological descriptors such as Tcent (Centric Index), Tbala (Balaban Index), Tradi (PetitJohn R2 Index), Tpeti (PetitJohn I2 Index), and Tiwie (Information Weiner) are also selected by RFE. Weak acid respiratory uncouplers, which are generally bulky and electronegative compounds, produce their toxic effect by disrupting ATP synthesis. They induce disruption of the hydrogen ion gradient in the inner mitochondrial membrane (12). Increased acidity of phenols promotes their ability to uncouple the respiratory chain from oxidative phosphorylation (13). Moreover, toxicity of R,β-unsaturated aldehydes such as cinnamaldehyde have been attributed to Schiff-base formation (25, 28). These two toxic events are related to the ionization ability and unequal charge distribution profile of a compound. Both acidity and the Schiff-base formation are represented by the three REF-selected descriptors QC,Max (most positive charge on C atoms), QH,Max (most positive charge on hydrogen atoms), and IP (ionization potential). Genotoxicity is one of the causes of the observed T. pyriformis toxicity, which involves either oxidative damage to DNA by free radicals formed by one-electron oxidation (23) or DNA methylation via intercalation. The four RFE-selected descriptors, absolute hardness (η), chemical potential (µcp), electronegativity index (χen), and electrophilicity index (ω) are correlated to the formation of free radicals. DNA intercalation is primarily mediated by aromatic ring via π-π interactions with DNA base pairs. The RFE-selected electrotopological state descriptors S(9) and S(13), which describe atom-type H estate sum for dCH(sp2) and CHn (unsaturated), represent the ability of a compound to form π-π interactions with DNA base pairs. The other REFselected descriptor S(16), which describes atom-type estate sum of -CH3, can be used to represent the potential of DNA methylation. Performance Evaluation. To assess the performance of SLMs for TPT agents prediction on the more diverse set of molecules, it is useful to examine whether the accuracy from these methods is at a similar level as those derived from the use of a significantly smaller number of agents. It is noted that a direct comparison with results from previous studies is inappropriate because of the differences in the dataset and molecular descriptors used. Although desirable, it is impossible to conduct a separate comparison using results directly from other studies without full information about the algorithms of molecular descriptors and classification methods used in each study. Nonetheless, a tentative comparison may provide some crude estimate regarding the approximate level of accuracy of the TPT agents prediction systems studied in this work. From Table 2, the overall accuracies of all of our tested SLMs are in the range of 85.3%∼90.4% with SVM, k-NN, and PNN giving better performance. These accuracies are comparable to the
Classification of T. pyriformis Toxicity Chemical Compounds
Chem. Res. Toxicol., Vol. 19, No. 8, 2006 1037
Table 5. Molecular Descriptors Selected from the Recursive Feature Elimination (RFE) Method for Support Vector Machine (SVM) Classification of Tetrahymena pyriformis Toxic (TPT) and Nontoxic (non-TPT) Agents molecular descriptor noxy
description
3χ P
count of O atoms simple molecular connectivity Chi indices for path order 3
3χ C
simple molecular connectivity Chi indices for cluster
4χ PC
simple molecular connectivity Chi indices for path/ cluster
1χV
valence molecular connectivity Chi indices for path order 1
S(9)
atom-type H estate sum for dCH- (sp2)
S(12) S(13)
atom-type H estate sum for CHn (saturated) atom-type H estate sum for CHn (unsaturated)
S(16) S(17)
atom-type estate sum for -CH3 atom-type estate sum for dCH2
S(18) S(20)
atom-type estate sum for >CH2 atom-type estate sum for dCH-
S(22) S(25) S(41) Tcent Tbala Tradi Tpeti Tiwie b m h IP
atom-type estate sum for >CHatom-type estate sum for tCatom-type estate sum for -Ocentric index Balaban index PetitJohn R2 index PetitJohn I2 index information Weiner hydrogen bond acceptor basicity (covalent HBAB) molecular dipole moment absolute hardness ionization potential (IP)
µcp χen w QH, Max
chemical potential electronegativity index electrophilicity index most positive charge on H atoms
QC,Max QH,Min QC,Min AQ,min QC,SS Mnc Mac dis1 dis2 dis3 Sapc
Rugty Gloty
most positive charge on C atoms most negative charge on H atoms most negative charge on C atoms most negative charge in a molecule sum of squares of charges on C and all atoms mean of negative charges mean absolute charge length vectors longest distance length vectors longest third atom length vectors longest 4th atom) sum of solvent accessible surface areas of positively charged atoms sum of solvent accessible surface areas of negatively charged atoms sum of charge weighted solvent accessible surface areas of negatively charged atoms sum of charge weighted van der Waals surface areas of negatively charged atoms molecular rugosity molecular globularity
Shpb
hydrophobic region
Capty Hiwpl Hiwpb
capacity factor hydrophilic integy moment hydrophobic integy moment
Hiwpa
amphiphilic moment
Sanc Sancw Svncw
matched or partially matched molecular descriptors used in published TPT models total number of oxygen atoms (26) total number of paths, χ value of path chains of length 7, no. of path chains of length 6, eccentric connectivity index (17); Kier simple and valence molecular connectivity indices (27) total number of paths, χ value of path chains of length 7, no. of path chains of length 6, eccentric connectivity index (17); Kier simple and valence molecular connectivity indices (27) total number of paths, χ value of path chains of length 7, no. of path chains of length 6, eccentric connectivity index (17); Kier simple and valence molecular connectivity indices (27) total number of paths, χ value of path chains of length 7, no. of path chains of length 6, eccentric connectivity index (17); Kier simple and valence molecular connectivity indices (27) no. of sp2 hybridized carbons bonded to 3 other carbons (17), sum of estate indices (27) sum of estate indices (27) no. of sp2 hybridized carbons bonded to 3 other carbons (17), sum of estate indices (27) sum of estate indices (27) no. of sp2 hybridized carbons bonded to 3 other carbons (17), sum of estate indices (27) sum of estate indices (27) no. of sp2 hybridized carbons bonded to 3 other carbons (17), sum of estate indices (27) sum of estate indices (27) sum of estate indices (27) estate index for the hydroxy group (83) eccentric connectivity index (17) Balaban, Weiner topological indices (27) Balaban, Weiner topological indices (27) Balaban, Weiner topological indices (27) Balaban, Weiner topological indices (27) total number of hydrogen bond acceptors (26) dipole moment (24, 26) absolute hardness (23, 25) ionization potential (15, 27); corrected for ionization (16), energy value of HOMO (84) chemical potential (15) HOMO energy (13), σ and π electronegativity (18) soft electrophilicity index (84); the largest positive charge on a hydrogen atom (14, 25), acidity constant (30) partial charge of C (25) maximum partial charge of H atoms (25) partial charge of C (25) largest negative charge value (27) partial charge of C (25) average atomic negative charges (13) sum of absolute charge on N and O atoms (83) molecular surface areas (SA) (13) molecular surface areas (SA) (13) molecular surface areas (SA) (13) molecular surface area (26, 27) the negatively charged molecular surface area (14) the negatively charged molecular surface area (14) the negatively charged molecular surface area (14) charged partial surface area (13) charged partial surface area (13), normalized molecular area projected onto xz-plane (17) hydrophobicity values (12, 16, 21, 24); octanol/ water partition coefficient (25-28, 30) charged partial surface area (13) electrophilicity (15, 21), dipole moment (26) hydrophobicity values (12, 16, 21, 24), octanol/ water partition coefficient (25-28, 30) electrophilicity (15, 21), super-delocalizability (12, 28)
1038 Chem. Res. Toxicol., Vol. 19, No. 8, 2006
accuracies of 90%∼92% derived from the linear regression and neural network models of 220∼448 compounds (17, 18), but somehow lower than the accuracies of 97%∼99% derived from the linear discriminant analysis and binary logistic regression models of 220 compounds (13). Overall, our study suggests that statistical learning methods, particularly SVM, k-NN, and PNN, are useful for toxicity prediction of a broad range of molecule. The prediction accuracy of these methods is at a similar level to some of the earlier studies that were tested by using a much smaller number of molecules. Another advantage of these methods is that they do not require knowledge about the molecular mechanism or structure-activity relationship of a particular drug property before a prediction model can be developed. However, these methods are limited in providing detailed information about the mechanism of the predicted properties, even though some levels of knowledge may be extracted from the molecular descriptors selected by feature selection methods. Development of regression-based SLM models may enable more quantitative description about the mechanism of predicted properties (16, 21, 25). Moreover, the classification speed of these methods is generally fast. For instance, the number of compounds which can be classified per second by using the SVM, k-NN, PNN, and DT methods is approximately 4000, 3000, 2000, and 62 000, respectively, on a P4 3.6Ghz machine. SVM typically uses a portion of the training set as support vectors for classification. In contrast, k-NN and PNN use the whole training set for classification. The number of support vectors of SVM is in the range of 45-75% of the training set. Thus, the classification speed of SVM is usually 25-55% faster than that of k-NN and PNN. On the other hand, the classification speed of SVM is slower than that of decision tree methods which use a set of rules to reach a decision leaf.
Conclusion This study shows that statistical learning methods, particularly SVM, k-NN, and PNN, are useful for facilitating the prediction of toxic potential to T. pyriformis for a diverse set of molecules without requiring the knowledge of mechanisms and choice of specific molecular descriptors. The current SLMs are limited in their ability to facilitate the study of the mechanism of predicted properties. Such a weakness may be partially overcome by the development of regression-based SLM models for quantitative description about the mechanism of predicted properties. Prediction accuracy of some toxic agents is known to be affected by the inadequate coverage of the currently available molecular descriptors (38). The introduction of additional molecular descriptors is useful for further improving the prediction accuracy of SLMs. Current efforts are directed at the improvement of the efficiency and speed of feature selection methods (65), which can further help to optimally select molecular descriptors and enable the development of more accurate and efficient computational tools for toxicity prediction. Moreover, recent works on the introduction of weighting function into SVM descriptors (82) may also be helpful in developing SVM into a practical tool for the prediction of toxicological properties of chemical agents. Supporting Information Available: List of the 841 TPT and 288 non-TPT agents found. This material is available free of charge via the Internet at http://pubs.acs.org.
References (1) Johnson, D., and Wolfgang, G. (2000) Predicting human safety: Screening and computational approaches. Drug DiscoVery Today 5, 445-454.
Xue et al. (2) van de Waterbeemd, H., and Gifford, E. (2003) ADMET in silico modelling: Towards prediction paradise? Nat. ReV. Drug DiscoVery 2, 192-204. (3) Safe, S. (1990) Polychlorinated biphenyls (PCBs), dibenzo-p-dioxins (PCDDs), dibenzofurans (PCDFs), and related compounds: Environmental and mechanistic considerations which support the development of toxic equivalency factors (TEFs). Crit. ReV. Toxicol. 21, 51-88. (4) McDuffie, H. (2005) Host factors and genetic susceptibility: a paradigm of the conundrum of pesticide exposure and cancer associations. ReV. EnViron. Health 20, 77-101. (5) Kaufman, M., Smolinske, S., and Keswick, D. (2005) Assessing poisoning risks related to storage of household hazardous materials: Using a focus group to improve a survey questionnaire. EnViron. Health 4, 16. (6) Needham, L., Barr, D., Caudill, S., Pirkle, J., Turner, W., Osterloh, J., Jones, R., and Sampson, E. (2005) Concentrations of environmental chemicals associated with neurodevelopmental effects in U.S. population. Neurotoxicology 26, 531-545. (7) Sauvant, M., Pepin, D., and Piccinni, E. (1999) Tetrahymena pyriformis: A tool for toxicological studies. A review. Chemosphere 38, 1631-1669. (8) Sauvant, M., Pepin, D., Bohatier, J., and Groliere, C. (1995) Microplate technique for screening and assessing cytotoxicity of xenobiotics with Tetrahymena pyriformis. Ecotoxicol. EnViron. Saf. 32, 159-165. (9) Wu, C., Fry, C., and Henry, J. (1997) Membrane toxicity of opioids measured by protozoan motility. Toxicology 117, 35-44. (10) Darcy, P., Kelly, J., Leonard, B., and Henry, J. (2002) The effect of lofepramine and other related agents on the motility of Tetrahymena pyriformis. Toxicol. Lett. 128, 207-214. (11) Bonnet, J., Dusser, M., Bohatier, J., and Laffosse, J. (2003) Cytotoxicity assessment of three therapeutic agents, cyclosporin-A, cisplatin and doxorubicin, with the ciliated protozoan Tetrahymena pyriformis. Res. Microbiol. 154, 375-385. (12) Schultz, T. (1999) Structure-toxicity relationships for benzenes evaluated with Tetrahymena pyriformis. Chem. Res. Toxicol. 12, 12621267. (13) Schuurmann, G., Aptula, A., Kuhne, R., and Ebert, R. (2003) Stepwise discrimination between four modes of toxic action of phenols in the Tetrahymena pyriformis assay. Chem. Res. Toxicol. 16, 974-987. (14) Devillers, J. (2004) Linear versus nonlinear QSAR modeling of the toxicity of phenol derivatives to Tetrahymena pyriformis. SAR QSAR EnViron. Res. 15, 237-249. (15) Roy, D., Parthasarathi, R., Maiti, B., Subramanian, V., and Chattaraj, P. (2005) Electrophilicity as a possible descriptor for toxicity prediction. Bioorg. Med. Chem. 13, 3405-3412. (16) Aptula, A., Roberts, D., Cronin, M., and Schultz, T. (2005) Chemistrytoxicity relationships for the effects of di- and trihydroxybenzenes to Tetrahymena pyriformis. Chem. Res. Toxicol. 18, 844-854. (17) Serra, J., Jurs, P., and Kaiser, K. (2001) Linear regression and computational neural network prediction of tetrahymena acute toxicity for aromatic compounds from molecular structure. Chem. Res. Toxicol. 14, 1535-1545. (18) Spycher, S., Pellegrini, E., and Gasteiger, J. (2005) Use of structure descriptors to discriminate between modes of toxic action of phenols. J. Chem. Inf. Model. 45, 200-208. (19) Niculescu, S., Kaiser, K., and Schultz, T. (2000) Modeling the toxicity of chemicals to Tetrahymena pyriformis using molecular fragment descriptors and probabilistic neural networks. Arch. EnViron. Contam. Toxicol. 39, 289-298. (20) Gini, G., Craciun, M., Konig, C., and Benfenati, E. (2004) Combining unsupervised and supervised artificial neural networks to predict aquatic toxicity. J. Chem. Inf. Comput. Sci. 44, 1897-1902. (21) Ren, S. (2003) Modeling the toxicity of aromatic compounds to Tetrahymena pyriformis: The response surface methodology with nonlinear methods. J. Chem. Inf. Comput. Sci. 43, 1679-1687. (22) Cottrell, M., and Schultz, T. (2003) Structure-toxicity relationships for methyl esters of cyanoacetic acids to Tetrahymena pyriformis. Bull. EnViron. Contam. Toxicol. 70, 549-556. (23) Netzeva, T., Aptula, A., Chaudary, S., Duffy, J., Schultz, T., Schu¨u¨rmann, G., and Cronin, M. (2003) Structure-activity relationships for the toxicity of substituted poly-hydroxylated benzenes to Tetrahymena pyriformis: Influence of free radical formation. QSAR Comb. Sci. 22, 575-582. (24) Gonzalez, M., Diaz, H., Cabrera, M., and Ruiz, R. (2004) A novel approach to predict a toxicological property of aromatic compounds in the Tetrahymena pyriformis. Bioorg. Med. Chem. 12, 735-744. (25) Schultz, T., Netzeva, T., Roberts, D., and Cronin, M. (2005) Structuretoxicity relationships for the effects to Tetrahymena pyriformis of aliphatic, carbonyl-containing, R,β-unsaturated chemicals. Chem. Res. Toxicol. 18, 330-341. (26) Cronin, M., and Schultz, T. (2001) Development of quantitative structure-activity relationships for the toxicity of aromatic compounds
Classification of T. pyriformis Toxicity Chemical Compounds
(27)
(28) (29) (30) (31) (32) (33) (34) (35)
(36) (37)
(38) (39) (40)
(41) (42) (43)
(44) (45) (46) (47) (48) (49) (50)
(51) (52) (53) (54) (55)
to Tetrahymena pyriformis: comparative assessment of the methodologies. Chem. Res. Toxicol. 14, 1284-1295. Cronin, M., Aptula, A., Duffy, J., Netzeva, T., Rowe, P., Valkova, I., and Schultz, T. (2002) Comparative assessment of methods to develop QSARs for the prediction of the toxicity of phenols to Tetrahymena pyriformis. Chemosphere 49, 1201-1221. Netzeva, T., and Schultz, T. (2005) QSARs for the aquatic toxicity of aromatic aldehydes from Tetrahymena data. Chemosphere 61, 16321643. Netzeva, T., Schultz, T., Aptula, A., and Cronin, M. (2003) Partial least squares modelling of the acute toxicity of aliphatic compounds to Tetrahymena pyriformis. SAR QSAR EnViron. Res. 14, 265-283. Melagraki, G., Afantitis, A., Makridima, K., Sarimveis, H., and Igglessi-Markopoulou, O. (2005) Prediction of toxicity using a novel RBF neural network training methodology. J, Mol. Model. 8, 1-9. Shawe-Taylor., N. C. a. J. (2000) An introduction to Support Vector Machines: and other kernel-based learning methods, Cambridge University Press: New York. Burges, C. (1998) A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discuss. 2, 127-167. Johnson, R., and Wichern, D. (1982) Applied MultiVariate Statistical Analysis, Prentice Hall, Englewood Cliffs, NJ. Quinlan, J. (1993) C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA. Li, H., Yap, C., Ung, C., Xue, Y., Cao, Z., and Chen, Y. (2005) Effect of selection of molecular descriptors on the prediction of blood-brain barrier penetrating and nonpenetrating agents by statistical learning methods. J. Chem. Inf. Model. 45, 1376-1384. Xue, Y., Yap, C., Sun, L., Cao, Z., Wang, J., and Chen, Y. (2004) Prediction of p-glycoprotein substrates by support vector machine approach. J. Chem. Inf. Comput. Sci. 44, 1497-1505. Xue, Y., Li, Z., Yap, C., Sun, L., Chen, X., and Chen, Y. (2004) Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. J. Chem. Inf. Comput. Sci. 44, 1630-1638. Li, H., Ung, C., Yap, C., Xue, Y., Li, Z., Cao, Z., and Chen, Y. (2005) Prediction of genotoxicity of chemical compounds by statistical learning methods. Chem. Res. Toxicol. 18, 1071-1080. Yap, C., and Chen, Y. (2005) Prediction of cytochrome P450 3A4, 2D6, and 2C9 inhibitors and substrates by using support vector machines. J. Chem. Inf. Model. 45, 982-992. Li, H., Yap, C., Xue, Y., Li, Z., Ung, C., Han, L., and Chen, Y. (2006) Statistical learning approach for predicting specific pharmacodynamic, pharmacokinetic or toxicological properties of pharmaceutical agents. Drug DeV. Res. 66, 1-15. Hosmer, D., and Lemeshow, S. (1989) Applied Logistic Regression, Wiley, New York. Specht, D. (1990) Probabilistic neural networks. Neural Netw. 3, 109118. Serra, J., Thompson, E., and Jurs, P. (2003) Development of binary classification of structural chromosome aberrations for a diverse set of organic compounds from molecular structure. Chem. Res. Toxicol. 16, 153-163. Yap, C., Cai, C., Xue, Y., and Chen, Y. (2004) Prediction of torsadecausing potential of drugs by support vector machine approach. Toxicol. Sci. 79, 170-177. Trotter, M., Buxton, B., and Holden, S. (2001) Support vector machines in combinatorial chemistry. Meas. Control 34, 235-239. Burbidge, R., Trotter, M., Buxton, B., and Holden, S. (2001) Drug design by machine learning: Support vector machines for pharmaceutical data analysis. Comput. Chem. 26, 5-14. Czerminski, R., Yasri, A., and Hartsough, D. (2001) Use of support vector machine in pattern classification: Application to QSAR studies. Quant. Struct.-Act. Relat. 20, 227-240. Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002) Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389-422. Trotter, M., and Holden, S. (2003) Support vector machines for ADME property classification. QSAR Comb. Sci. 22, 533-548. Furey, T., Cristianini, N., Duffy, N., Bednarski, D., Schummer, M., and Haussler, D. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 16, 906-914. CambridgeSoft Corporation. (2002) ChemDraw, Cambridge, MA 02140 USA. Accelrys. DS ViewerPro, California, USA. Pearlman, R. CONCORD user’s manual., In. Tripos, St. Louis, MO. Schultz, T. (1997) Tetrahymena pyriformis population growth impairment endpoint: A surrogate for fish lethality. Toxicol. Methods 7, 289-309. Hastie, T., Tibshirani, R., and Friedman, J. (2001) The Elements of Statistical Learning, Springer, New York.
Chem. Res. Toxicol., Vol. 19, No. 8, 2006 1039 (56) Todeschini, R., and Consonni, V. (2000) Handbook of Molecular Descriptors, Wiley-VCH, Weinheim, Germany. (57) Katritzky, A., and Gordeeva, E. (1993) Traditional topological indices vs. electronic, geometrical and combined molecular descriptors in QSAR/QSPR research. J. Chem. Inf. Comput. Sci. 33, 835-857. (58) Cruciani, G., Pastor, M., and Guba, W. (2000) VolSurf: A new tool for the pharmacokinetic optimization of lead compounds. Eur. J. Pharm. Sci. 11, S29-S39. (59) Kier, L., and Hall, L. (1999) Molecular Structure Description: The Electrotopological State, Academic Press, San Diego, CA. (60) Nicolotti, O., and Carotti, A. (2006) QSAR and QSPR studies of a highly structured physicochemical domain. J. Chem. Inf. Comput. Sci. 46, 264-276. (61) Dewar, M., Zoebisch, E., Healy, E., and Steward, J. (1985) AM1: A new general purpose quantum mechanical molecular model. J. Am. Chem. Soc. 107, 3902-3909. (62) Xue, C., Zhang, R., Liu, H., Liu, M., Hu, Z., and Fan, B. (2004) Support vector machines-based quantitative structure-property relationship for the prediction of heat capacity. J. Chem. Inf. Comput. Sci. 44, 1267-1274. (63) Roncaglioni, A., Novic, M., Vracko, M., and Benfenati, E. (2004) Classification of potential endocrine disrupters on the basis of molecular structure using a nonlinear modeling method. J. Chem. Inf. Comput. Sci. 44, 300-309. (64) Modarresi, H., Dearden, J., and Modarress, H. (2006) QSPR Correlation of melting point for drug compounds based on different sources of molecular descriptors. J. Chem. Inf. Comput. Sci. 46, 930-936. (65) Furlanello, C., Serafini, M., Merler, S., and Jurman, G. (2003) An accelerated procedure for recursive feature ranking on microarray data. Neural Netw. 16, 641-648. (66) Degroeve, S., De Baets, B., Van de Peer, Y., and Rouze´, P. (2002) Feature subset selection for splice site prediction. Bioinformatics 18, S75-S83. (67) Bayada, D., Hamersma, H., and van Geerestein, V. (1999) Molecular diversity and representativity in chemical databases. J. Chem. Inf. Comput. Sci. 39, 1-10. (68) Yu, H., Yang, J., Wang, W., and Han, J. (2003) Discovering compact and highly discriminative features or feature combinations of drug activities using support vector machines., ′Proc. IEEE Bioinf. Conf., 2nd, 2003, 220-228. (69) Carnahan, B., Meyer, G., and Kuntz, L. (2003) Comparing statistical and machine learning classifiers: Alternatives for predictive modeling in human factors research. Hum. Factors 45, 408-423. (70) Parzen, E. (1962) On estimation of a probability density function and mode. Ann. Math. Stat. 33, 1065-1076. (71) Cacoullos, T. (1966) Estimation of a multivariate density. Ann. Inst. Stat. Math. 18, 179-189. (72) Vapnik, V. (1995) The Nature of Statistical Learning Theory, Springer, New York. (73) Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., and Nielsen, H. (2000) Assessing the accuracy of prediction algorithms for classification: An overview. Bioinformatics 16, 412-424. (74) Matthews, B. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442-451. (75) Kohavi, R., and John, G. H. (1997) Wrappers for feature subset selection. Artif. Intell. 97, 273-324. (76) Sutter, J., and Kalivas, J. (1993) Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchem. J. 47, 60-66. (77) Jouan-Rimbaud, D., Massart, D., and de Noord, O. (1996) Random correlation in variable selection for multivariate calibration with a genetic algorithm. Chemometr. Intell. Lab 35, 213-220. (78) Manly, B. (1997) Randomization Bootstrap and Monte Carlo Methods in Biology., 2nd ed., Chapman and Hall, London. (79) Hawkins, D. (2004) The problem of overfitting. J. Chem. Inf. Comput. Sci. 44, 1-12. (80) Gunatilleka, A., and Poole, C. (2000) Models for estimating the nonspecific toxicity of organic compounds in short-term bioassays. Analyst 125, 127-132. (81) Mignon, P., Loverix, S., Steyaert, J., and Geerlings, P. (2005) Influence of the pi-pi interaction on the hydrogen bonding capacity of stacked DNA/RNA bases. Nucleic Acids Res. 33, 1779-1789. (82) Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. (2002) Choosing multiple parameters for support vector machines. Mach. Learn. 46, 131-159. (83) Ren, S. (2003) Ecotoxicity prediction using mechanism- and nonmechanism-based QSARs: A preliminary study. Chemosphere 53, 1053-1065. (84) Ren, S., and Schultz, T. (2002) Identifying the mechanism of aquatic toxicity of selected compounds by hydrophobicity and electrophilicity descriptors. Toxicol. Lett. 129, 151-160.
TX0600550