9708
Ind. Eng. Chem. Res. 2009, 48, 9708–9712
Quantitative Structure-Property Relationship (QSPR) Prediction of Liquid Viscosities of Pure Organic Compounds Employing Random Forest Regression Remya Rajappan,† Prashant D. Shingade,‡ Ramanathan Natarajan,†,§ and Valadi K. Jayaraman*,‡ Centre for Mathematical Sciences Pala Campus, Arunapuram, Kerala, India 686 574, Chemical Engineering and Process DeVelopment DiVision, National Chemical Laboratory, Pune, India 411 008, and Department of Chemical Engineering, Lakehead UniVersity, 955 OliVer Road Thunder Bay, ON, Canada P7B 5E1
A quantitative structure-property relationship (QSPR) approach was used to develop a predictive model for viscosities of pure organic liquids using a set of 403 compounds that belong to diverse classes of organic chemicals. A pool of 116 descriptors that encode topostructural, topochemical, electrotopological, geometrical, and quantum chemical properties of the organic compounds was used to develop QSPR models, based on the robust Random Forest (RF) regression algorithm. The performance of the algorithm, in terms of correlation coefficients and mean square errors, was determined to be good. The capability of the algorithm to build models and select the most-informative features simultaneously is very useful for several quantitative structure-activity/property relationship tasks. The eight most-dominant features selected by the RF regression algorithm primarily contained predictors that encode characteristics of atoms and groups that form hydrogen bonds, as well as factors involving molecular shape and size. 1. Introduction Liquid viscosity is one of the most important transport properties of organic liquids and, therefore, it is a property of interest to industries such as oil refinery, paint, and manufacturing, and other industries that involve separations using solvent extraction. The prediction of liquid viscosities is also important for environmental chemists, with regard to handling oil spills and studying the dispersion of organic waste that has been discharged into water sources, because viscosity governs liquid flow and the miscibility of phases. There have been several studies on viscosity prediction of pure organic liquids, and most of the earlier methods used physicochemical properties for predictions.1-7 Many of these prediction models suffer from limitations in applications, because they were applicable only to homologous series of compounds and the predicted viscosities had large errors. Luckas and Lucas,6 as well as Joback and Reid,7 independently devised group contribution methods (GCMs); however, the GCMs also suffered from similar limitations, because they did not include contribution parameters for several groups that contained heteroatoms. Nashawi et al.8 reviewed the viscosity prediction models that used physicochemical properties and discussed the applicability and limitations of different approaches. The quantitative structureproperty relationship (QSPR) approach by Suzuki et al.,9 using multiple linear regression (MLR) and partial least-squares (PLS) regression, gave good predictions of liquid viscosities. Later, they increased the dataset to 361 by adding 124 chemicals to the initial set of 237 chemicals; for the larger dataset, the predictions were made using an artificial neural net (ANN) technique, in addition to MLR.10 The MLR was a ninedescriptor model, whereas the ANN models contained indicator variables, along with physicochemical properties. The approach was further extended11 to model the temperature dependence of viscosity using 1229 data points from 440 organic compounds. The MLR and ANN models developed by Suzuki and * To whom correspondence should be addressed. E-mail address:
[email protected]. † CMS Pala. ‡ NCL Pune. § Lakehead University.
co-workers,9-11 to predict viscosities, were based on physicochemical properties (quantitative property-property prediction), which placed a major limitation on the approach, because experimental data for physicochemical properties may not be available for all chemicals in a dataset. Computable molecular descriptors are preferred to physicochemical properties, because of their fast computation and their lack of a need for information other than the molecular structure of a chemical. Therefore, a QSPR approach can be used to predict the properties of all chemicals, including those that are to be synthesized. Most of the papers12-14 that appeared after 1997 on viscosity prediction followed a QSPR approach, using molecular descriptors that were calculated using the CODESSA software,15 and the regression models were either MLR or PLS regression. Kauffmann and Jurs16 developed predictive models, not only for viscosity but also for surface tension and thermal conductivity, because the three are related properties. The viscosity dataset that they used was comprised of 212 organic liquids, and an eight-descriptor model (MLR and NN) was developed from an initial pool of 239 descriptors that belonged to three major categories, namely, topological, geometrical, and quantum chemical (electronic). With the advancement of chemical graph theory, a large pool of computable molecular descriptors is available to develop predictive models. In addition to the topological indices, several descriptors, based on quantum chemical calculation at different levels of theory, were suggested.17-22 Therefore, QSAR modelers are overwhelmed with descriptors, and, in several instances, there are several more descriptors than the number of chemicals (observations). Hence, variable selection such as stepwise regression,23 pairwise correlation,24-26 and subset selection27 are used in regression analyses. However, overambitious stepwise regression results in overfitted models.28-30 Although methods such as partial least-squares (PLS) analysis31,32 and ridge regression (RR)33,34 can be used even for collinear descriptors, model interpretability becomes a problem, because there are many predictors in regression models developed using PLS and RR. Elastic net35 and LASSO36 (Least Absolute Shrinkage and Selection Operator) are some of the recent methods and have been developed specifically to handle
10.1021/ie8018406 CCC: $40.75 2009 American Chemical Society Published on Web 03/06/2009
Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009 Table 1. Composition of Viscosity Data chemical class
No. of compounds
chemical class
No. of compounds
saturated hydrocarbons cycloalkenes aromatic hydrocarbons halogenated compounds aliphatic amine aliphatic amide carboxylic acids esters alcohols
79 2 30 54 23 9 17 47 39
nitriles thiols sulfides heterocycles ethers aldehydes ketones anhydrides nitro compounds
10 3 5 28 23 10 17 2 5
Table 2. Result for Regression Analysis Using the Random Forest Approach for the Log Viscosity with Different Numbers of Important Descriptors Correlation Coefficient No. of features
MTRY
training set
test set
mean-square error (MSE) on test data
116 58 29 15 8
68 31 16 10 5
0.879 0.985 0.985 0.983 0.979
0.9 0.986 0.985 0.983 0.98
0.043 0.007 0.007 0.008 0.00
collinear descriptors. Gram-Schmidt orthogonalization,37 which normally is used to stabilize the calculation of regression coefficients for ordinary least-squares regression numerically, can be modified and used for predictor selection.38,39 It is very important to include variable selection in the validation step; otherwise, the cross-validated Rcv2 (q2) parameter suffers an upward bias. Kraker et al.40 discussed application of the proper method of cross-validation while performing descriptor selection. Unlike several other descriptor selection methods, the Random Forest (RF) algorithm41-43 prioritizes the descriptors, which eliminates extreme descriptor-thinning and resultant overfitting. The present paper describes the prediction of the viscosities (η) of 403 organic liquids, belonging to different chemical classes, using the RF algorithm. From a set of 116 descriptors, informative subsets were selected for quantitative structure-property (viscosity) relationship modeling using the RF algorithm, and the regression models were also built using the same algorithm. 1.1. Algorithm Descriptions: Random Forest. It is wellknown that the decision tree is bestowed with several desirable features for performing classification and regression tasks. Apart from handling high-dimensional data characteristics, which are highly desirable for estimating quantitative structure-activity/ property relationships, it has the ability to ignore irrelevant descriptors. The results produced by the decision tree can also be rigorously interpreted. The main drawback of the methodology is its poor performance. The most successful attempt to improve the performance of the decision tree has been the formulation ensemble tree approach of RF methods.41-43 Unlike the other ensemble methods, the RF method offers some unique features that make it suitable for handling quantitative structure-activity/property relationship tasks. These include built-in estimation of prediction accuracy, measures of descriptor importance, and a measure of similarity between molecules. The RF approach is an ensemble of randomly constructed independent (and unpruned, i.e., fully grown) decision trees.41-43 It uses the bootstrap sampling technique, which is an improved version of bagging. The method is better than bagging and is comparable to boosting, in terms of accuracy, but it is computationally much faster and more robust, with respect to overfitting noise, than boosting and other tree ensemble techniques.41 It generally exhibits substantial performance improvement over single tree
9709
methodologies such as Classification and Regression Trees (CART), and C4.5. Similar to other popular methods such as support vector machines and ANNs, the RF approach has an edge over MLR techniques, because it can pick up the nonlinear correlations between the descriptors and the activity. The RF method also possesses attractive features that the other models lack. It has fewer tunable parameters and is computationally much faster. In addition, it has an internal mechanism to compute a measure of variable importance, lending insight to the particular system being modeled. 1.2. Training Procedure. Given a set of n-dimensional input features that correspond to each of the M organic compounds, along with the experimentally determined viscosity values (in terms of logarithmic viscosities), the RF approach builds a regression model, by applying the following algorithm steps. For each tree, a bootstrap sample (with replacement) is drawn from the original training dataset, i.e., a sample is taken from the training dataset and is then replaced again in the dataset before drawing the next sample. Similarly, m numbers of samples are taken to form “In Bag” data for a particular tree. The main advantage of bootstrap sampling is to avoid overfitting the training data. In each of the bootstrap training sets, approximately one-third of the instances are unused for making the In Bag data, on average, and these are called the “Out Of Bag” (OOB) data for that particular tree. The classification tree is induced with this In Bag data, using the CART algorithm.42 The key difference is that, to grow the trees at each node, the best split is chosen from a randomly selected subset of MTRY features. The tree is grown until no further splits are possible. Here, MTRY is essentially the only tuning parameter in the algorithm. Pruning is not necessary in the RF method, because bootstrap sampling takes care of the overfitting problem. This further reduces the computational load of the RF algorithm. A large number of such trees are grown. Because, in each tree, only a subset MTRY of features are employed, and by virtue of not resorting to expensive pruning methodologies, the algorithm is computationally very fast, even for problems with a large number of descriptors. In the RF method, performance evaluation can be done in parallel with the training procedure, using OOB samples.41 Because of the fact that every tree is grown employing the bootstrapping technique, on average, two-thirds of the training instances are used to grow each tree, leaving behind one-third of the instances. These omitted examples can be applied for performance evaluation, instead of using the conventional 5or 10-fold cross-validation procedures. 1.3. Feature Selection. Apart from being a robust methodology, the RF approach can also be used to select the mostinformative features. When each tree is grown, the OOB error is estimated as explained in the previous paragraph. Subsequently, each feature in the OOB data is randomly permuted, one at a time, and errors for the modified dataset are also predicted. The measure of importance of each feature is calculated based on the difference in squared errors between the original and the modified data. Using this methodology, detailed simulations can be made to capture the most-informative features: (1) Employing a 5-fold cross-validation procedure estimates the errors, keeping all features. (2) Employing the procedure described previously, to calculate importance measures of the features and rank them. (3) Using feature ranking to remove the bottom half of the least important features and build a model to estimate the errors.
9710
Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009
Figure 1. Random Forest (RF) regression on log viscosity for the 8 best features selected: (a) log viscosity predicted versus log viscosity observed and (b) log viscosity predicted versus residuals.
Figure 2. Random Forest (RF) regression on log viscosity for the 15 best features selected: (a) log viscosity predicted versus log viscosity observed and (b) log viscosity predicted versus residuals.
(4) Repeating the procedure by progressively removing half of the current population of features, until only a small number of features remain. 2. Experimental Section The viscosity (η) values of 423 organic compounds, belonging to different classes, were collected from four papers.9,10,12,16 The composition of the dataset is provided in Table 1. For the sake of prediction, η values were converted to log viscosity values. Nineteen chemicals with log10 η > 1.5 were not used for the modeling, based on the normal distribution of the data. A large number of structural descriptors were calculated for the 403 chemicals, using molecular structures represented by the SMILES code44,45 as the input. INDCAL, which is an inhouse software program,46 was used to calculate topological descriptors including the Wiener number;47 the molecular connectivity indices, based on the formulations of Randic´,48 as well as Kier and Hall;49 frequencies of paths of varying length; Bonchev and Trinajstic50 information indices, based on distance matrices of molecular graphs; neighborhood complexity indices, as defined by Basak et al.,51 for hydrogen-filled graphs; and the Balaban J index.52 Additional topological descriptors, along with a large set of electrotopological or E-state indices,53 were calculated using E-Calc, which is a demo software program that is available with the book by Kier and Hall.53 Geometrical descriptors such as solvent accessible volume, molecular surface area, and quantum chemical descriptors (namely, the energy of the highest occupied molecular orbital (EHOMO), energy of the lowest unoccupied molecular orbital (ELUMO), and the HOMO-
LUMO gap (ELUMO - EHOMO)) were calculated using Chem3D Ultra software.54 In addition, the polar surface area (PSA), number of rotatable bonds, and the shape attribute (SA) were also calculated using the Chem3D Ultra 2008 software.55 The overall path connectivity and the mean of path connectivity indicessbased on simple connectivity, bond order, and valence typeswere also computed. The overall path connectivity index is calculated according to the formula OPC )
∑χ p h
h
where χh is the Randic´ connectivity index of order h and ph is the path count of length h; h varies from zero to the maximum possible order. The overall path connectivity can be calculated based on simple connectivity (OPC), bond-order connectivity (bOPC), and valence connectivity (VOPC). The mean path connectivity is defined as MPC )
h
∑ pχ
h
MPC can also be computed based on simple, valence, and bond order connectivities. There were 150 descriptors in the initial pool, and they were very diverse in characteristics and the underlying principle of calculations. They belonged to five major classes, namely, topostructural, topochemical, electrotopological, geometrical, and quantum chemical. Perfectly correlated descriptors (R ) 1) were identified, and only one descriptor of such a pair was retained. In addition, any descriptors that possess a constant value for all compounds within the dataset were
Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009
omitted and descriptors that have a value of zero (0) for >90% of the observations were also dropped, because a descriptor with sparse data might act as an indicator variable. The initial pool of 150 descriptors was reduced to a final set that contained 116 descriptors that encoded the diverse molecular characteristics of the 403 organic chemicals. 3. Results and Discussion For all the simulations, the Random Forest (RF) package, which was based on the original FORTRAN code of Brieman and Cutler,41 adapted in R-language for Statistical Computing by Andy Liaw,57 was used. The algorithm essentially has two parameters, viz., MTRY and the number of trees. MTRY represents a randomly selected subset of descriptors that is selected each time a tree is grown. Although the default value recommended for regression is one-third of the total number of features, we determined that this parameter required tuning to optimize performance. Our simulations indicated that keeping the number of trees for modeling at 500 provided good accuracy, and further increases in the number of trees did not improve performance. To tune the algorithm parameters, the data were randomly split into five different sets, in a ratio of 70:30. For each value of MTRY, an average cross-validation correlation coefficient and mean-square errors were determined; the constructed models were also tested on respective test sets, and the average performance of the test sets were determined; the results are shown in Table 2. It can be seen from the results (the performance, in terms of both correlation coefficients as well as mean-square errors) that the model with only eight mostinformative features (predictors) seems to be quite satisfactory, and the inclusion of additional predictors did not improve the predictive ability significantly. The residual plots for 8 and 15 descriptors are shown in Figures 1 and 2, respectively. These figures also support the claim that the models with a few dominant features can be used for accurate viscosity predictions. The eight dominant predictors that were ranked by the feature selection algorithm are given below with their brief descriptions: (1) Hmax, which represents the maximum hydrogen E-state (classified as hydrogen-bond donor/acceptor E-states).53 (2) TIC1, which represents the total information content of order 1, calculated using neighborhood similarity.50 (3) SsOH, which represents the sum of the E-states of the O atom of type OH.53 (4) HsOH, which represents the number of H atoms of the type OH.53 (5) IC0, which represents the information content of order 0, calculated based on neighborhood similarity.50 (6) VOPC, which represents the overall path connectivity based on valence connectivity. (7) H, the Harary index, which is a topological index that is derived from the reciprocal distance matrix for a molecular graph.56 (8) vχ6, which represents the valence connectivity index of order 6.48 The eight most-dominant predictors belonged to three of the five classes into which the pool of 116 descriptors were grouped. Any quantum chemical and geometrical descriptor does not find a place in the aforementioned list. The list of eight descriptors that have been mentioned have a combination of three electrotopological indices (Hmax, SsOH, HsOH), three topochemical descriptors (IC0, TC1, and VOPC), and two topostructural descriptors (H, vχ6). The electrotopological state indices are numerical values that have been computed for each atom in a molecule, and they encode information about both the topologi-
9711
cal environment of that atom and the electronic perturbations to all other atoms in the molecule. These three descriptors mainly characterize the atoms and groups that are involved in hydrogenbond interactions. The viscosity of a liquid is affected by intermolecular forces, and hydrogen-bonding interactions are the most dominant type of such forces. Hence, picking up the electrotopological descriptors that encode hydrogen-bonding characteristics as the dominant predictors supports the effectiveness of the RF algorithm. The topological descriptor that is known as the overall path connectivity (vOPC) encodes branching and the size of the molecules, whereas IC0 and TC1 are information theoretic indices. Although the topostructural descriptor vχ6 and the Harary index (H) also encode the branching and molecular size, vOPC is nondegenerate and has a unique value for each of the chemicals in the database. It was surprising to note that quantum chemical descriptors were not picked up, even in the first 15 prominent features (predictors) that were selected via the RF algorithm. In the first 15 dominant predictors, the polar surface area (PSA), which is an important property that characterizes molecular surface area, was selected, in addition to some more topostructural descriptors. The PSA is the sum of the surface contributions of polar atoms such as oxygen, nitrogen, and attached hydrogens in a molecule. Identifying PSA as one of the dominant predictors from a diverse pool of descriptors via the RF algorithm is consistent with the viscosity model that was reported by Kauffmann and Jurs,16 in which they used a charged polar surface area (CPSA). Hence, the RF algorithm seems to be capable of selecting appropriate predictors that affect the property under consideration. Acknowledgment Regression models, along with detailed usage directions, are available on request from the authors. One of the authors (R.N.) acknowledges the Department of Science and Technology, New Delhi, India for financial assistance, under Project No. SR/S4/ MS: 479/07. This contribution is paper No. 54 from the Centre for Mathematical Sciences Pala Campus, Kerala, India. Literature Cited (1) Horvath, A. L. Liquid viscosities of halogenated hydrocarbons. Chem. Eng. 1976, 83, 121. (2) Pachaiyappan, V.; Ibrahim, S. H.; Kuloor, N. R. Simple correlation for determining viscosity of organic liquids. Chem. Eng. 1967, 74, 193. (3) Przezdziecki, J. W.; Sridhar, T. Prediction of liquid viscosity. AIChE J. 1985, 31, 333. (4) Reid, R. C.; Prausnitz, J. M.; Poling, B. E. The Properties of Gases and Liquids, 4th Edition; McGraw-Hill: New York, 1987. (5) Lyman, W. J.; Reehl, W. F.; Rosenblatt, D. H. Handbook of Chemical Property Estimation Methods; McGraw-Hill: New York, 1982. (6) Luckas, M.; Lucas, K. Viscosity of liquids: An equation with parameters correlating with structural groups. AIChE J. 1986, 32, 139. (7) Joback, K. G.; Reid, R. C. Estimation of pure-component properties from group contribution. Chem. Eng. 1987, 57, 233. (8) Nashawi, I. S.; Elgibaly, A. A. Prediction of liquid viscosity of pure organic compounds via artificial neural networks. Pet. Sci. Technol. 1999, 17, 1107. (9) Suzuki, T.; Ohtaguchi, K.; Koide, K. Computer-assisted approach to develop a new prediction method of liquid viscosity of organic compounds. Comput. Chem. Eng. 1996, 20, 161. (10) Suzuki, T.; Ebert, R.; Schu¨u¨rmann, G. Development of both linear and nonlinear methods to predict the liquid viscosity at 20 °C of organic compounds. J. Chem. Inf. Comput. Sci. 1997, 37, 1122. (11) Suzuki, T.; Ebert, R.; Schu¨u¨rmann, G. Application of neural networks to modeling and estimating temperature-dependent liquid viscosity of organic compounds. J. Chem. Inf. Comput. Sci. 2001, 41, 776. (12) Ivanciuc, O.; Ivanciuc, T.; Filip, P.; Cabrol-Bass, D. Estimation of the liquid viscosity of organic compounds with a quantitative structureproperty model. J. Chem. Inf. Comput. Sci. 1999, 39, 515.
9712
Ind. Eng. Chem. Res., Vol. 48, No. 21, 2009
(13) Katritzky, A. R.; Chen, K.; Wang, Y.; Karelson, M.; Lucic, B.; Trinajstic, N.; Suzuki, T.; Schuurmann, G. Prediction of liquid viscosity for organic compounds by a quantitative structure-property relationship. J. Phys. Org. Chem. 2000, 13, 80. (14) Cocchi, M.; De Benedetti, P. G.; Seeber, R.; Tassi, L.; Ulrici, A. Development of quantitative structure-property relationships using calculated descriptors for the prediction of the physiochemical properties of a series of organic solvents. J. Chem. Inf. Comput. Sci. 1999, 39, 1190. (15) Katritzky, A. R.; Lobanov, V. S.; Karelson, M. CODESSA Version 2.0 Reference Manual; University of Florida: Gainesville, FL, 1994. (16) Kauffman, G. W.; Jurs, P. C. Prediction of surface tension, viscosity, and thermal conductivity for common organic solvents using quantitative structure-property relationships. J. Chem. Inf. Comput. Sci. 2001, 41, 408. (17) Karelson, M.; Lobanov, V. S.; Katritzky, A. R. Quantum-chemical descriptors in QSAR/QSPR studies. Chem. ReV. 1996, 96, 1027. (18) Parr, R. G.; Szentpaly, L.; Liu, S. Electrophilicity index. J. Am. Chem. Soc. 1999, 121, 1922. (19) Chattaraj, P. K.; Roy, D. R. Update 1 of Electrophilicity index. Chem. ReV. 2007, 107, R46. (20) Chattaraj, P. K.; Sarkar, U.; Roy, D. R. Electrophilicity index. Chem. ReV. 2006, 106, 2065. (21) Netzeva, T. I.; Aptula, A. O.; Benfenati, E.; Cronin, M. T. D.; Gini, G.; Lessigiarska, I.; Maran, U.; Vracˇko, M.; Schu¨u¨rmann, G. Description of the electronic structure of organic chemicals using semiempirical and ab initio methods for development of toxicological QSARs. J. Chem. Inf. Model. 2005, 45, 106. (22) Schu¨u¨rmann, G. Quantum chemical descriptors in structure-activity relationshipssCalculation, interpretation and comparison of methods (Chapter 6). In Predicting Chemical Toxicity and Fate; Cronin, M. T. D., Livingstone, D. J., Eds.; CRC Press: Boca Raton, FL, 2004; pp 85-149. (23) Draper, N. R.; Smith, H. In Applied Regression Analysis, 2nd Edition; John Wiley & Sons: New York, 1981; pp 294-379. (24) He´berger, K.; Rajko´, R. Generalization of pair-correlation method (PCM) for nonparametric variable selection. J. Chemom. 2002, 16, 436. (25) He´berger, K.; Andrade, J. M. Procrustes rotation and pairwise correlation: A parametric and a nonparametric method for variable selection. Croat. Chem. Acta 2004, 77, 117. (26) He´berger, K.; Rajko´, R. Variable selection using pair-correlation method. Environmental applications. SAR QSAR EnViron. Res. 2002, 13, 541. (27) Miller, A. J. Subset Selection in Regression; Chapman and Hall: London, 1990; pp 43-82. (28) Rencher, A.; Punn, F. Inflation of R2 in best subset regression. Technometrics 1980, 22, 49. (29) Hawkins, D. M.; Basak, S. C.; Shi, X. QSAR with few compounds and many features. J. Chem. Inf. Comput. Sci. 2001, 41, 663. (30) Hawkins, D. M. The Problem of Overfitting. J. Chem. Inf. Comput. Sci. 2004, 44, 1. (31) Wold, S. Discussion: PLS in chemical practice. Technometrics 1993, 35, 136. (32) Frank, I. E.; Friedman, J. H. A statistical view of some chemometrics regression tools. Technometrics 1993, 35, 109. (33) Hoerl, A. E.; Kennard, R. W. Ridge regression: Biased estimation for non orthogonal problems. Technometrics 1970, 12, 55. (34) Hoerl, A. E.; Kennard, R. W. Ridge regression: Applications to non orthogonal problems. Technometrics 1970, 12, 69. (35) Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 2005, 67, 301. (36) Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B 1996, 58, 267. (37) Thisted, R. A. Elements of Statistical Computing; Chapman and Hall: New York, 1988.
(38) Basak, S. C.; Natarajan, R.; Mills, D.; Hawkins, D. M.; Kraker, J. J. Quantitative structure-activity relationship modeling of insect juvenile hormone activity of 2,4-dienoates using computed molecular descriptors. SAR QSAR EnViron. Res. 2005, 16, 581. (39) Basak, S. C.; Natarajan, R.; Mills, D.; Hawkins, D. M.; Kraker, J. J. Quantitative structure-activity relationship modeling of juvenile hormone mimetic compounds for culex pipiens larvaeswith discussion of descriptor thinning methods. J. Chem. Inf. Model. 2006, 46, 65. (40) Kraker, J. J.; Hawkins, D. M.; Basak, S. C.; Natarajan, R.; Mills, D. Quantitative Structure-Activity Relationship (QSAR) modeling of juvenile hormone activity: comparison of validation procedures. Chemom. Intell. Lab. Syst. 2007, 87, 33. (41) Breiman, L.; Cutler, A. Random Forests. Available as freeware at http://www.stat.berkeley.edu/users/breiman/RandomForests/. (42) Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. J. Classification and Regression Trees; Chapman & Hall: New York, 1984. (43) Breiman, L. Random Forest. Mach. Learn. 2001, 45, 5. (44) Weininger, D. SMILES, A chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31. (45) Weininger, D.; Weininger, A.; Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 1989, 29, 97. (46) Natarajan, R.; Nirdosh, I.; Anbazhagan, T. M.; Murali, T. C. Topological index calculator (INDCAL): a computer program to calculate topographical and topostructural indices. In Proceedings of the International Seminar on Mineral Processing Technology, MPT-2002, January 3-5, 2002; Subramanian, S., Natarajan, K. A., Rao, B. S., Rao, T. R. R., Eds.; Vol. 1, pp 301-306. (47) Wiener, H. Structural determination of paraffin boiling points. J. Am. Chem. Soc. 1947, 69, 17. (48) Randic´, M. On characterization of molecular branching. J. Am. Chem. Soc. 1975, 97, 6609. (49) Kier, L. B.; Hall, L. H. Molecular ConnectiVity in StructureActiVity Analysis; Research Studies Press: Letchworth, Hertfordshire, U.K., 1986. (50) Bonchev, D.; Trinajstic´, N. Information theory, distance matrix and molecular branching. J. Chem. Phys. 1977, 67, 4517. (51) Basak, S. C. Information theoretic indices of neighborhood complexity and their applications. In Topological Indices and Related Descriptors in QSAR and QSPR; Devillers, J., Balaban, A. T., Eds.; Gordon and Breach Science Publishers: Amsterdam, The Netherlands, 1999; pp 563-593. (52) Balaban, A. T. Highly discriminating distance-based topological indices. Chem. Phys. Lett. 1982, 89, 399. (53) Kier, L. B.; Hall, L. H. Molecular Structure Description: The Electrotopological State; Academic Press: San Diego, CA, 1999. (54) Chem3D Ultra, version 8; Cambridge Soft Corporation: Cambridge, MA, 2004. (55) Chem3D Ultra, version 11; Cambridge Soft Corporation: Cambridge, MA, 2008. (56) Plavsˇic´, D.; Nikolic´, S.; Trinajstic´, N.; Mihalic´, Z. On the Harary index for the characterization of chemical graphs. J. Math. Chem. 1993, 12, 235. (57) Liaw, A.; Wiener, M. Classification and regression by random forest. R News. 2002, 2 (3), 18.
ReceiVed for reView December 1, 2008 ReVised manuscript receiVed February 9, 2009 Accepted February 10, 2009 IE8018406