Ind. Eng. Chem. Res. 2007, 46, 4921-4929
4921
Classification of 1,4-Dihydropyridine Calcium Channel Antagonists Using the Hyperbox Approach Pinar Kahraman and Metin Turkay* College of Engineering and the Center for Computational Biology and Bioinformatics, Koc UniVersity, Rumelifeneri Yolu, Sariyer, Istanbul, 34450 Turkey
The early prediction of activity-related characteristics of drug candidates is an important problem in drug design. Activity levels of drug candidates are classified as low or high depending on their IC50 values. Because the experimental determination of IC50 values for a vast number of molecules is both time-consuming and expensive, computational approaches are employed. In this work, we present a novel approach to classify the activities of drug molecules. We use the hyperbox classification method in combination with partial leastsquares regression to determine the most relevant molecular descriptors of the drug molecules for an efficient classification. The effectiveness of the approach is illustrated on DHP derivatives. The results indicate that the proposed approach outperforms other approaches reported in the literature. 1. Introduction The early prediction of activity-related characteristics of drug candidates is an important problem in drug design. During drug development, a large ratio of the capital expended is spent on unsuccessful candidate drugs.1 Effects such as toxicity and adverse pharmacokinetic features are mostly discovered during clinical tests, thus wasting the time and resources expended in drug design and development. Therefore, eliminating molecules with undesired properties beforehand has been one of the central research subjects in structure-based drug design.2 Since the number of possible drug candidates is often on the order of millions, computerized methods are used for the prediction of activities. One approach is to study the chemical structures of the candidate molecules and to predict the activity levels of drug candidates based on these structures. As a result, the efficiency of the drug design process can be improved through elimination of unnecessary experiments. Data-driven methods are based on analyzing objects with different characteristics. These methods derive prediction models directly from experimental data.3 One of the most frequently used techniques is QSAR (quantitative structure-activity relationship) analysis, which is the quantitative effort to understand correlations between the chemical structure of a molecule and its biological and chemical activities such as biotransformation ability, reaction ability, solubility, and target activity.4 The main assumption in a QSAR analysis is that structurally similar molecules tend to have similar activities. On the basis of this principle, molecules with unknown properties can be compared to structures with known properties.5 QSAR analysis has been widely used in drug design, where the purpose is to recognize structures that selectively inhibit specific targets and that have low levels of toxicity. Target specificity is strongly correlated with the 3D structure of the molecule. The geometry of the binding site can be used to find many candidate molecules that will fit into this site, which can be constructed using a variety of methods.6-9 The problem of early prediction of the properties of drug candidates becomes a machine-learning problem when there are a number of structurally similar molecules of known activities that fit into * Corresponding author. E-mail:
[email protected]. Tel.: +90212-338-1586. Fax: +90-212-338-1548.
the binding site. Different machine-learning methods have been utilized for the classification of candidate drugs, mostly to predict their levels of toxicity. Activities of molecules are usually classified into two groups, namely, high-activity or low-activity, based on their toxicity levels. The reason for this binary classification is that the numerical values of biological activities are not available in most cases.10 In this article, we consider a subgroup of a class of drugs called calcium channel blockers that inhibit the Ca2+ flux into the cell. Calcium channel antagonists affect many excitable cells, such as heart muscle cells, vein muscle cells, and neuron cells. The special group of antagonists on which we focus in this work is the 1,4-dihydropyridine calcium channel antagonists, also called DHP derivatives. These antagonists are mostly used for treatment of cardiovascular diseases, such as hypertension and exertional angina.10 One of the first reports on the structural analysis of calcium channel blockers summarized a conformational analysis on a set of 45 2,6-dimethyl-3,5-dicarbomethoxy-4-X-phenyl-1,4dihydropyridine derivatives.11 That work grouped the drugs in terms of their biological activities and derived a correlation between the activity and the conformation of the two rings of the molecules by utilizing multiple linear regression (MLR). QSAR analysis of 1,4-dihydropyridine calcium channel antagonists was presented by Takahata et al.12 The main purpose of their study was to compare principle component analysis (PCA) with MLR. They applied PCA to the same data set as in the previous study, by dividing it into four interrelated sets and adding one additional set of selected drugs from the data. The variables were taken from Gaudio et al.,11 and the most important variables were calculated to determine the associated weight of each variable. Each group of molecules was then classified into low- or high-activity classes by using these weights in PCA, and the results indicated that PCA is a good method for use in QSAR analyses. Viswanadhan et al.13 developed a QSAR algorithm called PCANN. The algorithm suggests steps to be followed to predict drug behavior using the 45 DHP derivatives plus one additional molecule as the data set.13 In that study, PCA was used for descriptor selection, and a back-propagation neural network trained by cross-validation was used for behavior prediction. This algorithm was compared to MLR and a hologram QSAR
10.1021/ie0614327 CCC: $37.00 © 2007 American Chemical Society Published on Web 06/09/2007
4922
Ind. Eng. Chem. Res., Vol. 46, No. 14, 2007
model, and it was shown that this approach provides better prediction results. Another work by Takahata et al. compared QSAR results obtained from neural networks (NNs) with the results found by principal component analysis (PCA).10 Both classical descriptors and calculated descriptors were taken into account for comparison purposes. The same 45 DHP derivatives were classified into two groups once using classical descriptors and once using calculated descriptors. The accuracy results of two types of NN calculations called RECALL and LONE were compared to the results obtained by the PCA method. The drugs were also classified into three and four categories, to compare RECALL and LONE. As a result, this approach was shown to produce results comparable to those obtained with PCA. Moreover, the calculated descriptors were observed to be as efficient as the classical descriptors. Shamsipur et al. applied a genetic algorithm to regression analysis on a different data set of 1,4-DHP calcium channel antagonists.4 They conducted a QSAR study by building linear models of molecular descriptors and then selecting the most relevant ones. For this purpose, the MLR and partial leastsquares (PLS) methods were utilized and compared. A genetic algorithm was used in each analysis to construct many regression models and prune the ones that were genetically eliminated. PLS was selected as the method to be used in QSAR studies. A study by Schleifer et al. presented a comparison of three 3D QSAR methods for toxicity prediction.14 The prediction models built according to these three methods were based on probe-ligand interaction energies and were applied to the same data set. The methods, namely, CoMFA (comperative molecular field analysis), CoMSIA (comparative molecular similarity indices analysis), and GRID/GOLPE (generation of optimal linear PLS estimations), predicted the behavior of the drugs with good R2 values, revealing the important regions of the molecules and their effects on drug behavior. In another classification study, the same class of drugs was examined using the same data set.15 In that work, a sevendescriptor model was built, and the least-square support vector machines (LSSVM) method was used to both obtain the model and classify the molecules according to the model. It was shown that LSSVM is a promising tool for regression and classification. The results obtained from the two analyses were compared with those from a number of methods and were found to provide the best results for both correlation and classification. To our knowledge, the latest QSAR work on the same data set was published by Si et al. in 2006.16 They implemented gene expression programming (GEP) to build a model of molecular descriptors that described the log(1/IC50) values of the drugs. Descriptor calculations were done in the software CODESSA,17 and GEP was compared to the heuristic method embedded in CODESSA. The result was a satisfactory six-descriptor model, which indicates that GEP is a good tool for constructing QSAR/ QSPR models. One of the most important issues in data-driven methods is the selection, from among a huge number of properties, of those chemical features that most significantly affect drug behavior. Therefore, the methods of both feature generation and selection have major importance. Another central concern is the evaluation technique that predicts the activity of drugs on the basis of these features.3 This article presents a new methodology for the early prediction of drug behavior. We use a sequence of methods for characterizing the activity levels of drug candidates. For feature generation, we employ COSESSA,17 a program used widely in
Table 1. 45 DHP Antagonists and Their log(1/IC50) Values antagonist
X
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
3′-Br 2′-CF3 2′-Cl 3′-NO2 2′-CHdCH2 2′-NO2 2′-Me 2′-Et 2′-Br 2′-CN 3′-Cl 3′-F H 3′-CN 3′-I 2′-F 2′-I 2′-OCH3 3′-CF3 3′-CH3 2′-OC2H5 3′-OCH3
log (1/IC50) antagonist 8.89 8.82 8.66 8.40 8.35 8.29 8.22 8.19 8.12 7.80 7.80 7.68 7.68 7.46 7.38 7.37 7.33 7.24 7.13 6.96 6.96 6.72
23* 24* 25* 26* 27* 28* 29 30* 31* 32* 33* 34* 35* 36 37 38 39 40 41 42 43 44* 45*
X
log (1/IC50)
3′-NMe2 3′-OH 3′-NH2 3′-OAc 3′-OCOPh 2′-NH2 4′-F 4′-Br 4′-I 4′-NO2 4′-NMe2 4′-CN 4′-Cl 2′,6′-Cl2 F5 2′-F,6′-Cl 2′,3′-Cl2 2′-Cl,5′-NO2 3′,5′-Cl2 2′-OH,5′-NO2 2′,5′-Me2 2′,4′-Cl2 2′,4′,5-(OCH3)3
6.05 6.00 5.70 5.22 5.20 4.40 6.89 5.40 4.64 5.50 4.00 5.46 5.09 8.72 8.36 6.12 7.72 7.52 7.03 7.00 7.00 6.40 3.00
QSAR studies. Feature selection is performed by the PLS18 method combined with significance analyses. A novel classification method based on mixed-integer programming, the hyperbox method, is presented as the last step of the methodology. The hyperbox method is a very reliable classifier that can capture nonseparable data and minimize considerably misclassifications.19 We apply this approach to the data set taken from literature for comparison purposes. The data set consists of the 45 1,4-dihydropyridine calcium channel blockers presented in Table 1. After conducting regression analyses, selecting the relevant features, building the models, and making preliminary classification studies, we perform significance tests on the descriptors selected by the regression method to increase the classification accuracies. Replacing the less significant descriptors with more significant ones improves the classification results, which shows that a significance study can make a major contribution in a classification study for activity prediction purposes. We compared the results obtained by our approach to those available in the literature10,13-15 with respect to two criteria. The first is the results of the regression analyses. Following the selection of molecular descriptors, we conducted regression analyses for three models built with 7, 10, and 15 descriptors. The results of these analyses for the models were either comparable to or better than the regression results of the methods presented in the literature.13-15 The second criterion is the results of the activity level classification of drugs. After significance analyses, we conducted classification studies for the three models and found that the classification accuracy of the proposed approach is better than those of all of the classification methods available in the literature.10,13,15 2. Strategies, Models, and Methods In a QSAR analysis, the method used in each step has major importance for the success of the study. In addition to the classifier, the process used to determine the numeric molecular attributes (i.e., the descriptors) and the regression method that selects the most relevant ones among these contribute to the efficiency of the analysis. For descriptor calculations, this work uses the program CODESSA,17 which is reliable software that is widely used in QSAR analyses.
Ind. Eng. Chem. Res., Vol. 46, No. 14, 2007 4923
Figure 2. Illustration of the difference between regression and classification.
Figure 1. Summary of the analysis approach.
In this QSAR study, the approach consists of six steps, some of which are repeated in the case that the significance tests give unsatisfactory results. The approach that we employed in this study is summarized in Figure 1. The first step is to reveal the chemical structures of the molecules. For this purpose, molecular structures of drug candidates are constructed and then optimized by energy minimization using HyperChem.20 The optimized molecular structures are then processed in CODESSA17 to generate important descriptors for each molecule. CODESSA calculates molecular descriptors, which are parameters that describe the characterization of many aspects of the molecules. The program calculates eight different types of descriptors: constitutional, topological, geometrical, electrostatic, CPSA (charged partial surface area), MO- (molecular orbital-) related, quantum chemical, and thermodynamic. The objective of the next step is to determine a model that can describe the activity in terms of the descriptors. These properties affect the behavior of the drugs, but can have different degrees of contribution. To build the model and to choose the most relevant descriptors, different machine-learning methods have been used in the literature, as discussed in the Introduction. In this work, we used PLS, which is basically an MLR method closely related to principal component regression. PLS can predict a set of dependent variables Y based on a set of independent variables X. Y is represented as a linear combination of X that minimizes the distances of the actual Y values from the function values. PLS is especially efficient when the number of instances is much smaller than the number of descriptors. In such cases and in cases where the set of independent variables is very large compared to the set of dependent variables, PLS is capable of building efficient models. These are the reasons for us to pick this method for the purposes of this study. The method can also handle problems such as noise or missing data and multicollinearity between independent variables.18 We used the statistics software MINITAB21 for PLS runs, each of which provided a linear model of the dependent variable. The output consists of the coefficients of the independent variables and the residual that represents the variability of the real values, together building the regression model. The variables
that have coefficients of 0 are concluded to have no relationship with the independent variable. Nevertheless, variables that mostly contribute to the output value do not always have the largest positive (or smallest negative) coefficients, because the coefficients are dependent on their magnitude. Thus, standardized coefficients are considered for indicating the variables that describe the dependent variable the most. Once the model is built and the most relevant descriptors are identified, the next step is classification of the drugs according to the values of the descriptors. However, a regression coefficient, which tells the correlation between the corresponding variable and the dependent variable, might not explain the level of effectiveness of the descriptor in a classification analysis. This problem is explained through the following example. Figure 2 illustrates linear regression of some two-dimensional data. The slope of the black line is the coefficient of the independent variable. There are two clearly distinguishable classes, and the dashed line has a different slope than the regression line and separates these classes. This can also be true for higher dimensions. The objective that we pursue is to obtain a linear model that describes the dependent variable, which is the level of effectiveness of a drug in our case. Moreover, we want to build this model using a few descriptors from among many. Then, we aim to use this model as a reference point for our classification efforts. Descriptors for the linear model are chosen based on their standardized coefficients obtained from regression analysis; however, the hyperplane that separates two classes might have different coefficients. In this case, some descriptors other than the selected ones might be more effective in classifying the data. However, these should also contribute to the dependent variable, so we do not expect descriptors having small standardized coefficients to be significant in terms of classification. Then, as the number of preferred descriptors increases, the possibility that such descriptors are selected also increases. We describe our approach to address this problem at the end of this section. The classification of drugs was carried out on the basis of the selected descriptors using the hyperbox method, which is a mixed-integer-programming-based model.19 The hyperbox model encloses the multidimensional data in hyperboxes by solving an MILP problem. The model includes binary variables to indicate how many hyperboxes are used, which hyperbox is assigned to which class, and the assignments of data to the hyperboxes. The strength of the method comes from the ability to enclose the same class in multiple hyperboxes, which provides increased accuracy by minimizing overfitting. In this work, the method is used to classify the 45 DHP derivatives in low-activity and high-activity classes, once using the initial selection of
4924
Ind. Eng. Chem. Res., Vol. 46, No. 14, 2007
descriptors and then once utilizing the descriptors chosen after the significance analysis. Evaluation of the hyperbox model as a classifier is done by comparing the method with 51 different classification methods available in WEKA,22 a well-known data-mining tool. The problem of overestimating the effect of descriptors on classification mentioned above is addressed by making significance tests for the selected descriptors after the preliminary classification runs. The procedure followed in significance testing is as follows: Let Z be the set of drugs under study. After a binary classification, Z is separated into two sets: A and B. If classification is successful, the variances of descriptor values are expected to be smaller for the two classes A and B than the variance is for the whole population, Z. Let D be the set of relevant descriptors selected for classification. The sample variance of values for descriptor i (element) for drug set j is represented by Sij2. Then, the following distribution is obtained
Sij2/σi2 2
Sk /σi
2
) Sij2/Sik2 ) fνη
(1)
The distribution given in eq 1 exhibits the F distribution with degrees of freedom ν ) n - 1 and η ) m - 1, where n is the number of values descriptor i takes in drug set j and m is the number of values descriptor i takes in the drug set k. Our null hypothesis is that the variance of the whole set of drugs is equal to the variance of the subset of drugs that constitutes a class. We expect the former to be larger than the latter, which becomes the alternative hypothesis. Analytically, the null hypothesis is Sij2 ) Sik2, and we test for Ha ) Sij2 > Sik2, where j represents the whole data set and k is one of the classes. We expect the p value of fνη to be smaller than 0.1 to accept the alternative hypothesis. More than one model was constructed through regression by selecting different numbers of descriptors as representative variables for each model. For each set consisting of a different number of descriptors, corresponding significance levels of descriptor variances were calculated. Models having fewer descriptors were analyzed to determine whether any descriptors were insignificant. Such descriptors were then replaced by the ones that are significant in the model that have the highest number of descriptors. This adjustment was successful and increased the accuracy of the classification process. 3. Application to DHP Derivatives The data classification approach presented in the previous section was applied on 45 variants of 1,4-dihydropyridine calcium channel antagonists (DHP). We present the illustration of the approach and the results in this section. 3.1. Data Set. The quantitative structure-activity relationship study in this work was applied to 45 drug molecules that were constructed from a template molecule.15 The template of the molecules is shown in Figure 3. The data set was constructed by attaching various fragments to the upper ring X of the template as indicated in Table 1. This data set consisting of 45 DHP derivatives was taken from the literature. In addition to the fragments, experimental values for log(1/IC50) are also presented in the table. IC50 corresponds to the concentration of an inhibitor necessary for 50% inhibition of the target in vitro. This quantity is used as a measure of drug effectiveness. The lower the effectiveness of the drug, the larger the IC50 value, whereas the opposite is true for the log(1/IC50) values.23 Drugs having log(1/IC50) values lower than 6.72 are
Figure 3. Template of 1,4-dihydropyridine calcium channel antagonists.
classified as having low-activity, indicated by asterisks in the table, and the others as having high-activity,15 according to the cutoff value observed in laboratory runs and used in the literature.10,13-15 The molecular structures were generated and optimized in HyperChem, and the molecular descriptors were calculated in CODESSA. Consequently, values of 172 molecular descriptors for each of the 45 drugs were obtained. Before the regression study, 45 of the 172 descriptors were eliminated, because they had the same value for all of the molecules, reducing the number of descriptors to 127. Because the number of instances was much smaller than the number of attributes, PLS was selected for regression. MINITAB was used for PLS runs. 3.2. Results. The aim of the regression analysis is to have a model with a small number of attributes that explains the -log(IC50) values. Later, these relevant attributes can be used as starting descriptors for the classification of the drug molecules. Three models were formed as an output of partial least-squares analysis: 7-, 10-, and 15-attribute models. The reason for constructing several models was to increase the accuracy by having different descriptors and models at hand, allowing us to replace insignificant descriptors of the 7- and 10-attribute models with the significant ones selected from the 15-attribute model. The regression study was handled by the program MINITAB, which does not provide the linear model and the selected components explicitly. Rather, every parameter such as the loadings, the fits, and the coefficients are listed as outputs. Thus, the relevant variables were chosen according to the absolute values of the standardized coefficients, as described in the Strategies, Models, and Methods section. The most relevant descriptors for the 7-, 10-, and 15-attribute models are listed with their absolute standardized coefficients and incremental contributions to the R2 value in Table 2. Models having fewer than seven variables had poor R2 values and thus were not considered as descriptive models. After the regression analysis had been applied and the descriptors selected, the hyperbox data classification method was solved for the binary classification of drugs as high- or low-activity. In this step, 66% of the data, i.e., 29 instances, were randomly selected and included in the training set, and the remaining 16 instances were included in the test set. Table 3 reports the results of the initial classification runs. The 10-attribute model reached 100% accuracy with the hyperbox method. The 7-attribute model, however, needed to be improved. In the corresponding classification, the hyperbox method placed one molecule in both classes, which is the reason for the “half placements” observed in Table 3. It can be deduced from the results that, as the number of descriptors used increases, the accuracy of the classification process increases. This is
Ind. Eng. Chem. Res., Vol. 46, No. 14, 2007 4925 Table 2. Descriptors Used in the Initial Classification Runs descriptor 7-Attribute Model moment of inertia c zx shadow/zx rectangle yz shadow moment of inertia b relative number of double bonds min partial charge (Qmin) xy shadow/xy rectangle
abs std coeff
R2
0.200 06 0.190 382 0.160 966 0.141 894 0.129 344 0.109 508 0.108 719
0.3208 0.5327 0.7060 0.7887 0.8222 0.8436 0.8545
0.362 224 0.237 551 0.222 727 0.194 466 0.191 894 0.178 006 0.174 063 0.155 643 0.146 396 0.138 678
0.3208 0.5327 0.7060 0.7887 0.8222 0.8437 0.8545 0.8663 0.8832 0.8997
0.690 157 0.328 803 0.276 340 0.271 366 0.262 456 0.244 745 0.238 495 0.235 894 0.223 088 0.194 655 0.169 296 0.164 614 0.163 997 0.163 665 0.150 060
0.3208 0.5327 0.7060 0.7887 0.8222 0.8437 0.8545 0.8663 0.8832 0.8997 0.9052 0.9126 0.9212 0.9273 0.9319
class
10-Attribute Model yz shadow min partial charge (Qmin) relative number of rings gravitation index (all bonds) moment of inertia C avg information content (first order) zx shadow/zx rectangle moment of inertia B xy shadow/xy rectangle avg structural information (first order) 15-Attribute Model yz shadow relative number of rings relative number of double bonds min partial charge (Qmin) gravitation index (all bonds) topographic electronic index (first order) xy shadow/xy rectangle moment of inertia C avg information content (first order) Kier and Hall index (third order) relative number of O atoms avg structural information (first order) number of O atoms zx shadow RPCS (relative positive charges)
Table 3. Confusion Matrix for the Initial Results with the Hyperbox Classification Method 7-attribute model high low accuracy
10-attribute model
high
low
8 3.5 71.875%
1 3.5
Table 4. Significance Test Results for the Initial Classification Run for Seven Descriptors
high low accuracy
high
low
9 0 100%
0 7
expected, because more of the dependent variable is explained as the number of descriptors increases. The result of the first classification run with seven descriptors is relatively low, indicating the possible existence of some descriptors with low significance levels in terms of classifying the drugs in lowactivity and high-activity levels. After these preliminary classification runs, a significance test was conducted on the class variances as explained in section 2. Corresponding p values of the 7-descriptor model are provided in Table 4, where a p value below a certain R value means that the null hypothesis (that the variance of the corresponding descriptor values for all 45 drugs is equal to the variance of the values for the indicated class) can be rejected with 1 - R confidence against the alternative hypothesis that the variance of the whole set is larger than the variance of the classified set. It can be seen that different descriptors are significant for different classes. “Moment of inertia c” and “xy shadow/xy rectangle” have p values smaller than 0.2 for the low-activity class, which means that the molecules in the low-activity class have very similar values for these descriptors. For “minimum partial charge”, the p value for the high-activity class is significantly low, which indicates that this descriptor is significant for the high-activity class. Nevertheless, the p values for “relative number of double bonds” and “zx shadow/zx rectangle” are quite
sample var
p value
all data high-activity class low-activity class
Moment of Inertia c 1.864 33 × 10-12 2.453 64 × 10-12 2.141 77 × 10-13
0.695 53 0.024 75
all data high-activity class low-activity class
zx Shadow/zx Rectangle 4.780 54 × 10-09 5.430 46 × 10-9 4.977 06 × 10-9
0.599 65 0.581 32
all data high-activity class low-activity class all data high-activity class low-activity class
yz Shadow 3.317 75 × 10-5 1.706 21 × 10-5 7.548 47 × 10-5 Moment of Inertia b 3.069 89 × 10-12 3.663 41 × 10-12 1.487 66 × 10-12
Relative Number of Double Bonds all data 7.137 36 × 10-10 high-activity class 6.566 76 × 10-10 low-activity class 1.429 46 × 10-9
0.134 79 0.890 63
0.632 73 0.253 11
0.453 77 0.854 19
all data high-activity class low-activity class
Minimum Partial Charge 3.458 4 × 10-12 1.729 2 × 10-13 1.037 52 × 10-11
0.000 01 0.947 24
all data high-activity class low-activity class
xy Shadow/xy Rectangle 3.796 68 × 10-9 4.221 08 × 10-9 1.450 06 × 10-9
0.584 99 0.181 89
large for both the low- and high-activity classes. To investigate whether this situation affects the classification accuracy, significance tests of the other two classification runs were also made. Tables 5 and 6 report the significance results of these runs. It can be observed from Tables 4-6 that, as the number of descriptors used in classification increases, either the significance for the high-activity class, the significance for the low-activity class, or both improve, i.e., the corresponding p values decrease, for the descriptors still surviving in the larger models, thus increasing the accuracy of the classification analysis. Significance values of the 15-attribute model were calculated to replace the less significant descriptors of the 7-descriptor model. It can be seen in Table 4 that, in the 7-descriptor case, the p values for “zx shadow/zx rectangle” remain above 0.5 for both classes, the p value of the low-activity class for “relative number of double bonds” is considerably high, and the p value of the high-activity class for the same descriptor remains above 0.4, which means that the significance of these two descriptors is low in terms of contributing to the classification process. These two attributes will be replaced by two low-p-valued descriptors selected from Table 6. In light of these results, the effect of exchanging the less meaningful descriptors for significant ones on the success of classification was studied. For the next runs of classification with the participation of seven descriptors, the two less significant descriptors were replaced by the two descriptors selected from the 15-attribute model, namely, “zx shadow” and “RPCS”, which took respective p values of 0.08732 (for the high-activity class) and 0.10254 (for the low-activity class). The descriptor with the minimum p value, “minimum partial charge”, seen in Table 6 was already included in the 7-attribute model. The descriptors used for the second round of classification runs for the 7-attribute model are presented in Table 7. The hyperbox model was once again implemented on the 45 molecules. Again, 29 drugs were selected randomly and taken as the training set, and the test set was the remaining
4926
Ind. Eng. Chem. Res., Vol. 46, No. 14, 2007
Table 5. Significance Test Results for the Initial Classification Run for 10 Descriptors class all data high-activity class low-activity class
sample var Moment of Inertia c 1.864 33 × 10-12 1.165 97 × 10-12 1.351 35 × 10-12
all data high-activity class low-activity class
zx Shadow/zx Rectangle 4.780 54 × 10-9 6.017 71 × 10-9 3.426 24 × 10-9
all data high-activity class low-activity class
yz Shadow 3.317 75 × 10-5 2.110 57 × 10-5 5.472 44 × 10-5
all data high-activity class low-activity class
Moment of Inertia b 3.069 89 × 10-12 1.421 73 × 10-12 4.342 83 × 10-12
Gravitation Index (All Bonds) all data 52 852.916 67 high-activity class 76 877.777 78 low-activity class 26 295.238 1 all data high-activity class low-activity class
Minimum Partial Charge 3.446 5 × 10-12 2.177 78 × 10-13 7.405 71 × 10-12
all data high-activity class low-activity class
xy Shadow/xy Rectangle 3.796 68 × 10-9 3.994 92 × 10-9 1.932 39 × 10-9
Avg Information Content (First Order) all data 0.020 193 333 high-activity class 0.026 119 444 low-activity class 0.012 628 571 Avg Structural Information (First Order) all data 0.001 135 933 high-activity class 0.001 102 111 low-activity class 0.001 158 476 all data high-activity class low-activity class
Relative Number of Rings 2.249 39 × 10-11 1.120 04 × 10-11 3.790 41 × 10-11
p value
Table 6. Significance Test Results for the Initial Classification Run of Values for 15 Descriptors class
sample var Moment of Inertia c 1.122 16 × 10-13 7.023 39 × 10-14 8.131 17 × 10-14
p value
0.255 35 0.363 48
all data high-activity class low-activity class
0.666 72 0.357 75
all data high-activity class low-activity class
zx Shadow 2.724 83 × 10-6 1.046 85 × 10-6 5.030 54 × 10-6
0.087 32 0.842 96
0.263 58 0.798 29
all data high-activity class low-activity class
yz Shadow 1.996 08 × 10-6 1.269 80 × 10-6 3.292 42 × 10-6
0.263 58 0.798 29
0.136 63 0.727 43
all data high-activity class low-activity class
0.746 96 0.199 43
Relative Number of Double Bonds all data 4.293 94 × 10-11 high-activity class 4.762 03 × 10-11 low-activity class 4.172 39 × 10-11
0.000 25 0.892 43
all data high-activity class low-activity class
Minimum Partial Charge 2.081 29 × 10-13 1.384 54 × 10-14 4.460 41 × 10-13
0.000 31 0.891 67
0.557 3 0.207 62
all data high-activity class low-activity class
xy Shadow/xy Rectangle 2.284 24 × 10-10 2.403 50 × 10-10 1.162 61 × 10-10
0.557 30 0.207 63
0.682 5 0.292 18
all data high-activity class low-activity class
Number of O Atoms 1.643 77 × 10-7 1.595 22 × 10-7 1.961 83 × 10-7
0.506 50 0.638 43
0.506 35 0.550 53
Gravitation Index (All Bonds) all data 0.006 598 345 high-activity class 0.009 597 693 low-activity class 0.003 282 790
0.160 64 0.807 27
all data high-activity class low-activity class
molecules.16 The accuracy, false positives, false negatives, and true classifications of the final classification run can be seen in Table 8. From these values, it can be deduced that such a significance analysis and an adjustment in descriptor selection pays off in terms of higher accuracy levels. 3.3. Comparisons of the Results. We used a well-known data-mining tool, WEKA,22 to compare our results with those obtained using available classification methods. In this section, we provide a brief overview of some of the data classification methods available in WEKA. A Bayesian network24 is a directed acyclic graph, with nodes representing the variables and with probabilities attached to the edges. A Bayesian network is specified by the training run and is then used to perform inference maximizing the likelihood. A naive Bayes classifier25 is based on an application of Bayes’ theorem and the naive assumption that the variables are independent of each other. The classifier is trained and parameter estimation of the probability model, i.e., the independent features probability model, is performed using the maximum likelihood method. Naive Bayes simple and naive Bayes updatable26 are modified versions of the naive Bayes method. A logistic classifier27 builds and uses a two-class logistic regression model. As a statistical regression model, logistic regression assumes that the logarithm of the likelihood ratio of class distributions is linear in the observations. The model is a generalized model
Relative Number of O Atoms 6.358 61 × 10-11 7.843 48 × 10-11 5.428 29 × 10-11
Relative Number of Rings 1.353 71 × 10-12 6.739 92 × 10-13 2.281 21 × 10-12
Avg Information Content (First Order) all data 2.521 00 × 10-9 high-activity class 3.260 83 × 10-9 low-activity class 1.576 60 × 10-9 all data high-activity class low-activity class
RPCS 2.161 52 × 10-8 3.307 26 × 10-8 7.619 59 × 10-9
Kier and Hall Index (Yhird Order) all data 6.583 40 × 10-9 high-activity class 6.120 45 × 10-9 low-activity class 8.255 09 × 10-9
0.255 70 0.363 31
0.654 76 0.450 61
0.590 09 0.523 15
0.746 96 0.199 43
0.160 61 0.807 28
0.682 50 0.292 19
0.772 58 0.102 54
0.479 61 0.665 17
Avg Structural Information Content (First Order) all data 1.418 14 × 10-10 high-activity class 1.375 90 × 10-10 0.506 34 low-activity class 1.446 29 × 10-10 0.550 53 Topographic Electronic Index (First Order) all data 6.238 83 × 10-9 high-activity class 4.815 14 × 10-9 0.367 04 0.742 75 low-activity class 9.107 02 × 10-9
of the ordinary least-squares regression model. A multilayer perceptron28 is a network of perceptrons, i.e., simple processing elements. The perceptron computes a single output, explained as a nonlinear activation function of a linear combination of multiple inputs, whose parameters are learned through the training phase. SMO (sequential minimal optimization)29 is a method to train a support vector classifier using polynomial
Ind. Eng. Chem. Res., Vol. 46, No. 14, 2007 4927 Table 7. Descriptors Used for the Second Round of Classification Runs descriptor moment of inertia c zx shadow yz shadow moment of inertia b RPCS min partial charge (Qmin) xy shadow/xy rectangle
method MLR13
Table 8. Final Classification Results of the Hyperbox Method (7-Attribute Model)
high low accuracy
Table 10. Comparison of PLS Regression with Other Methods
high
low
9 0 100%
0 7
Table 9. Accuracy Results of the Classification Runs Made with the Software WEKA accuracy (%) classification method
7-attribute
10-attribute
Bayes network naive Bayes naive Bayes simple naive Bayes updatable logistic multilayer perceptron SMO K-star LWR logit boost multiclass classifier threshold selector decision stump oneR
62.5 43.75 56.25 43.75 68.75 50 62.5 62.5 56.25 56.25 68.75 37.5 62.5 43.75
56.25 50 56.25 50 68.75 62.5 62.5 75 68.75 68.75 68.75 37.5 75 75
kernels by breaking a large quadratic programming (QP) optimization problem into smaller QP optimization problems. K-star30 and LWR (locally weighted regression)31 are instancebased classifiers. In contrast to neural networks or decision trees, which use training to build global representations of their target functions, instance-based learning constructs query-specific local models, fitting the training instances in a neighborhood around the query point. K-star is an improved version of the k nearest neighborhoods method, and LWR assigns instance-based weights to be used in linear regression. Another method called logit boost (additive logistic regression)32 is for boosting any of the classifiers that manage weighted data. It uses the logistic regression system for the learning process. A multiclass classifier33 can handle multiclass data through two-class distribution classifiers. A threshold-based classifier26 puts an upper limit on a probability output through a distribution classifier such that the misclassification error is minimized. A decision stump34 classifies the data according to a threshold value obtained by maximizing a likelihood function and usually utilizes a boosting algorithm. OneR (one rule)26 is a very simple but accurate method. It builds a one-level decision tree, learns a rule from each attribute, and selects the rule having the smallest error rate as the one rule. The results obtained from the hyperbox method were compared to the results of all of the classification methods available in the software WEKA. In all of these runs, the data set was again separated such that a randomly selected 66% of the data were treated as the training set, whereas the remaining data were used as the test set. Table 9 shows the accuracy results for the most frequently used classification methods in the literature as well as the classifiers that reported the best results. The other methods not presented in the table provided very poor
PCANN13 CoMFA14 CoMSIA14 GRID/GOLPE14 HM15 LSSVM15 PLS PLS PLS
number of variables
R2
8 8 8 8 8 7 7 7 10 15
0.550 0.730 0.872 0.908 0.821 0.870 0.870 0.854 0.899 0.932
Table 11. Comparison of Hyperbox Classification with Reference Methods method
accuracy (%)
PCA10 BPNN13 LDA15 LSSVM15 hyperbox, 7-attribute hyperbox, 10-attribute
82.2 88.9 86.7 91.1 100.0 100.0
accuracy levels. The attributes used in these methods are the same descriptors that were obtained by the significance tests and used in the final classification runs of the hyperbox model. Default values for tuning parameters of the classification methods were used in the WEKA runs. Classification results reached via some of these methods reported in the literature,13-15 however, are the results obtained by methods used with their optimally tuned parameters. A comparison of the results with these methods can be seen in Table 11. As shown in Table 9, the hyperbox method outperforms all of the classification methods available in WEKA. The results with the best accuracy are generally obtained from instancebased classifiers and the ones that utilize distribution classifiers. The best accuracy for the 7-attribute classification is 68.75%, whereas the hyperbox method has an accuracy of 100%. Similarly, the best accuracy for the 10-attribute classification is 75%, whereas the hyperbox method has 100% accuracy. The methodology developed in this article was also compared with the latest QSAR studies conducted using the same 45 DHP derivatives as the data set.10,13-15 Table 10 summarizes the regression results, which indicate the descriptive abilities of the linear models that explain the drug behavior. Classification results are compared in Table 11. We considered the R2 values of the regression analyses in evaluating the linear model built for describing the effectiveness of the drugs. It must be noted that regression studies are employed in the preliminary phase of the methodology to select the descriptive variables among the 127 descriptors and to build a model. Therefore, the results that present the prediction ability of our method are not the regression results, but the accuracy levels of our classification runs. In Table 10, the R2 values indicate that the best regression result is obtained for the 15descriptor model by PLS. However, this high rate is expected because the number of variables in this model is greater than the numbers in the other models used. The 7-descriptor model seems to be competitive with the other methods, performing better than three and slightly worse than four of the compared regression models (The values for the selected seven descriptors for each molecule are given in Table 12). To evaluate the prediction efforts, i.e., the proposed methodology combining regression, significance testing, and the hyperbox method, we compared the classification accuracy levels we obtained and the accuracy levels obtained in reference
4928
Ind. Eng. Chem. Res., Vol. 46, No. 14, 2007
Table 12. Values of the Seven Variables Constituting the Final Linear Model descriptors molecule
moment of inertia c
zx shadow/ zx rectangle
yz shadow
moment of inertia b
relative number of double bonds
min partial charge (Qmin)
xy shadow/ xy rectangle
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
0.004 78 0.006 57 0.007 10 0.005 54 0.007 05 0.006 66 0.007 11 0.007 05 0.007 05 0.007 11 0.005 93 0.006 58 0.007 23 0.006 16 0.004 00 0.007 15 0.006 67 0.007 13 0.004 85 0.006 72 0.007 10 0.005 84 0.005 52 0.006 66 0.006 68 0.005 22 0.002 71 0.007 10 0.006 02 0.003 91 0.003 17 0.004 70 0.005 29 0.005 40 0.005 15 0.006 76 0.005 24 0.006 93 0.005 81 0.005 05 0.005 25 0.005 12 0.006 32 0.005 11 0.004 63
0.629 0.573 0.634 0.646 0.606 0.600 0.614 0.607 0.622 0.604 0.617 0.606 0.638 0.594 0.604 0.649 0.535 0.588 0.640 0.640 0.562 0.628 0.634 0.617 0.633 0.577 0.598 0.618 0.670 0.701 0.696 0.749 0.714 0.696 0.714 0.591 0.746 0.661 0.638 0.656 0.759 0.675 0.650 0.701 0.614
65.6 66.0 61.4 64.3 64.2 64.0 61.8 63.7 62.5 62.7 60.9 60.2 57.9 64.9 67.1 59.1 64.0 65.3 65.8 62.6 69.1 64.2 68.3 60.9 62.4 60.8 67.2 60.9 57.8 62.0 62.4 60.6 60.7 59.3 59.1 65.2 64.5 57.3 63.3 67.0 65.9 64.8 63.6 62.7 74.5
0.006 34 0.007 33 0.008 46 0.007 50 0.008 47 0.008 07 0.008 92 0.008 46 0.007 55 0.008 56 0.007 89 0.008 63 0.009 25 0.008 43 0.005 04 0.008 88 0.007 09 0.008 24 0.006 44 0.008 71 0.007 45 0.008 09 0.007 48 0.008 73 0.008 70 0.006 59 0.003 18 0.009 15 0.008 67 0.005 19 0.003 98 0.006 57 0.007 71 0.008 03 0.007 51 0.007 58 0.006 47 0.008 48 0.007 66 0.006 62 0.006 23 0.007 00 0.008 51 0.007 02 0.005 69
0.0488 0.0455 0.0488 0.0930 0.0667 0.0930 0.0455 0.0667 0.0488 0.0476 0.0488 0.0488 0.0488 0.0476 0.0488 0.0488 0.0488 0.0444 0.0455 0.0455 0.0652 0.0444 0.0408 0.0476 0.0465 0.0455 0.0893 0.0465 0.0488 0.0488 0.0488 0.0930 0.0435 0.0476 0.0488 0.0488 0.0488 0.0488 0.0488 0.0930 0.0488 0.0909 0.0426 0.0488 0.0377
-0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.181 -0.173 -0.173 -0.175 -0.181 -0.173 -0.173 -0.173 -0.173 -0.199 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.172 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.173 -0.178
0.564 0.589 0.530 0.621 0.539 0.569 0.546 0.548 0.548 0.528 0.646 0.572 0.550 0.554 0.549 0.540 0.586 0.542 0.583 0.590 0.542 0.622 0.641 0.574 0.579 0.562 0.468 0.523 0.648 0.517 0.507 0.546 0.575 0.574 0.565 0.538 0.592 0.567 0.659 0.522 0.623 0.541 0.593 0.542 0.500
works. Although the best regression model is the 15-descriptor model, it need not be used in classification, because even the 7-descriptor model provides an accuracy of 100%, although the corresponding regression result is relatively low. Table 11 indicates that this classification result is the best result, and it seems that the hyperbox model is currently the most accurate method for classifying this particular data set into high- and low-activity classes.
to all of the classifiers available in WEKA and the results reported in the literature.
4. Conclusions
(1) Gaviraghi, G.; Barnaby, R. J.; Pellegatti, M. Pharmacokinetic Challenges in Lead Optimization. In Pharmacokinetic Optimization in Drug Research: Biological, Physicochemical, and Computational Strategies; Testa, B., van de Waterbeemd, H., Folkers, G., Guy, R., Eds.; Wiley-Verlag Helvetica Chimica Acta: Zu¨rich, Switzerland, 2001; pp 3-14. (2) Bugrim, A.; Nikolskaya, T.; Nikolsky, Y. Early prediction of drug metabolism and toxicity: Systems biology approach and modeling. Drug DiscoV. Today 2004, 9 (3), 127-135. (3) Helma, C. In silico predictive toxicology: The state of the art and strategies to predict human health effects. Curr. Opin. Drug DiscoV. Des. 2005, 8, 27-31. (4) Hemmateenejad, B.; Miri, R.; Akhond, M.; Shamsipur, M. QSAR study of the calcium channel antagonist activity of some recently synthesized dihydropyridine derivatives. An application of genetic algorithm for variable selection in MLR and PLS methods. Chemom. Intell. Lab. Syst. 2002, 64, 91-99.
In this article, a novel approach for the early prediction of drug behavior is presented. The molecules chosen for this study were 45 calcium channel antagonists from the literature that have been studied widely in prediction analyses. The steps that constitute the method are compared to those that are available in the literature using the same data set. It is seen that, with the presented approach, a 7-attribute model is sufficient to reach 100% accuracy in classifying the data set into high-activity and low-activity, and the proposed sequence of methods seems to provide the best results among the studies that have been published so far. Moreover, a novel classifier, the MILP-based hyperbox method, was found to be highly accurate and superior
Acknowledgment The authors thank Fadime Uney Yuksektepe for helping in the implementation of the hyperbox method. Literature Cited
Ind. Eng. Chem. Res., Vol. 46, No. 14, 2007 4929 (5) Ghose, A. K.; Crippen, G. M. Atomic physicochemical parameters for three dimensional structure directed quantitative structure activity relationships. 2. Modeling dispersive and hydrophobic interactions. J. Chem. Inf. Comput. Sci. 1987, 27, 21-35. (6) Anderson, A. C. The process of structure-based drug design. Chem. Biol. 2003, 10, 787-797. (7) Blundel, T. L. Structure-based drug design. Nature 1996, 384, 2326. (8) Henry, C. M. Structure-based drug design. Chem. Eng. News 2001, 79 (23), 69-74. (9) Tu¨rkay, M. Structural synthesis of small molecule drug candidates. In Proceedings of FOCAPD 2004; Floudas, C. A., Agrawal, R., Eds.; CACHE Corporation, Austin, TX, 2004; pp 395-398. (10) Takahata, Y.; Costa, M. C. A.; Gaudio, A. C. Comparison between neural networks (NN) and principle component analysis (PCA): Structureactivity relationships of 1,4-dihydropyridine calcium channel antagonists (nifedipine analogues). J. Chem. Inf. Comput. Sci. 2003, 43, 540-544. (11) Gaudio, A. C.; Korolkovas, A.; Takahata, Y. Conformational analysis of the 1,4-dihydropyridines linking the structural aspects to the biological binding event: A study of the receptor-site conformation. J. Mol. Struct. (THEOCHEM) 1994, 303, 255-263. (12) Costaa, M. C. A.; Gaudio, A. C.; Takahata, Y. A comparative study of principal component and linear multiple regression analysis in SAR and QSAR applied to 1,4-dihydropyridine calcium channel antagonists (nifedipine analogues). J. Mol. Struct. (THEOCHEM) 1997, 394, 291-300. (13) Viswanadhan, V. N.; Mueller, G. A.; Basak, S. C.; Weinstein, J. N. Comparison of a neural net based QSAR algorithm (PCANN) with hologram and multiple linear regression-based QSAR approaches: Application to 1,4-dihydropyridine-based calcium channel antagonists. J. Chem. Inf. Comput. Sci. 2001, 41, 505-511. (14) Schleifer, K.-J.; Tot, E. CoMFA, CoMSIA and GRID/GOLPE studies on calcium entry blocking 1,4-dihydropyridines. Quant. Struct.Act. Relat. 2002, 21, 239-248. (15) Yao, X.; Liu, H.; Zhang, R.; Liu, M.; Hu, Z.; Panaye, A.; Doucet, J. P.; Fan, B. QSAR and classification study of 1,4-dihydropyridine calcium channel antagonists based on least squares support vector machines. Mol. Pharm. 2005, 2 (5), 348-356. (16) Si, H. Z.; Wang, T.; Zhang, K. J.; Hu, Z. D.; Fan, B. QSAR study of 1,4-dihydropyridine calcium channel antagonists based on gene expression programming. Bioorg. Med. Chem. 2006, 14, 4834-4841. (17) Katritzky, A. R.; Lobanov, V. S.; Karelson, M. ComprehensiVe Descriptors for Structural and Statistical Analysis, Reference Manual, versions 2.0 and 2.13; University of Florida: Gainsville, FL, 1997. (18) Garthwaite, P. H. An interpretation of partial least squares. J. Am. Stat. Assoc. 1994, 89 (425), 122-127. (19) U ¨ ney, F.; Tu¨rkay, M. A mixed-integer programming approach to multiclass data classification problem. Eur. J. Oper. Res. 2006, 173 (3), 910-920. (20) HyperChem. 7.5, Hypercube, 2003.
(21) MINITAB Statistical Software, release 14 for Windows; MINITAB Inc.: State College, PA, 2003. (22) WEKA 3: Data Mining Software in JaVa; The University of Waikato: Hamilton, New Zealand, 2005. (23) Patankar, S. J.; Jurs, P. C. Prediction of IC50 values for ACAT inhibitors from molecular structure. J. Chem. Inf. Comput. Sci. 2000, 40, 706-723. (24) Heckerman, D. A Tutorial on Learning with Bayesian Network; Technical Report; Microsoft Research Advanced Technology Division, Microsoft Corporation: Redmond, WA, 1996. (25) Rish, I. An Empirical Study of the NaiVe Bayes Classifier; IBM Research Report; IBM Research Division, Thomas J. Watson Research Center: Yorktown Heights, NY, 2001. (26) Witten, I. H.; Frank, E. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed.; Morgan Kaufmann: San Francisco, 2005. (27) Scott, A. J.; Symons, M. J. Clustering methods based on likelihood ratio criteria. Biometrics 1991, 27, 387-397. (28) Widrow, B.; Lehr, M. A. 30 years of adaptive neural networks: Perceptron, madaline and backpropagation. Proc. IEEE 1990, 78 (9), 14151442. (29) Platt, J. Fast Training of Support Vector Machines using Sequential Minimal Optimization. In AdVances in Kernel MethodssSupport Vector Learning; Scho¨lkopf, B., Burges, C., Smola, A., Eds.; MIT Press: Cambridge, MA, 1999; pp 185-208. (30) John, G. C.; Leonard, E. T. K*: An instance-based learner using an entropic distance measure. In Proceedings of the 12th International Conference on Machine Learning; Morgan Kauffman: San Francisco, 2001; pp 108-114. (31) Cleveland, W. S.; Delvin, S. J. Locally weighted regression: An approach to regression analysis by local fitting. J. Am. Stat. Assoc. 1988, 83 (403), 596-610. (32) Friedman, J.; Hestie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting. Ann. Stat. 2000, 28 (2), 337, 407. (33) Tax, D. M. J.; Duin, R. P. W. Using two-class classifiers for multiclass classification. In 16th International Conference on Pattern Recognition (ICPR’02); International Association for Pattern Recognition (IAPR): Durham, NC, 2002; Vol. 2, p 20124. (34) Qu, Y.; Adam, B. L.; Yasui, Y.; Ward, M. D.; Cazares, L. H.; Schellhammer, P. F.; Semmes, O. J. Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin. Chem. 2002, 48, 1835-1843.
ReceiVed for reView November 8, 2006 ReVised manuscript receiVed February 27, 2007 Accepted May 4, 2007 IE0614327