Descriptor Selection Methods in Quantitative Structure–Activity

Improving virtual screening predictive accuracy of Human kallikrein 5 inhibitors using machine learning models. Xingang Fang , Sikha Bagui , Subhash B...
0 downloads 7 Views 2MB Size
Review pubs.acs.org/CR

Descriptor Selection Methods in Quantitative Structure−Activity Relationship Studies: A Review Study Mohsen Shahlaei* Department of Medicinal Chemistry and Novel Drug Delivery Research Center, School of Pharmacy, Kermanshah University of Medical Sciences, Kermanshah 81746-73461, Iran compounds, so that these “rules” can be employed to assess new chemical entities. Like other data mining techniques, QSAR is carried out in successive steps including data set preparation, descriptor calculation, descriptor selection, model building, and validation. The success of a QSAR study relies deeply on how each of these steps is performed.1 Chemical structure, in QSAR studies, is encoded by a variety CONTENTS of descriptors such as topological, constitutional, thermodynamic, quantum mechanical, functional groups, geometrical, 1. Introduction A shape descriptors, etc. 2. The Classical Methods C Because of raising computational power of hardware and 2.1. Forward Selection (FS) Method C 2.2. Backward Elimination (BE) Method C software and the reducing cost for computing and collecting 2.3. Stepwise Selection (SS) Method C various molecular descriptors, many QSAR modeling nowadays 2.4. Variable Selection and Modeling Method depends on the correct analysis and selection of computed Based on the Prediction (VSMP) C descriptors as independent variables for the QSAR model 2.5. Leaps-and-Bounds Regression C formation. It must be noted that usually only a small subset of 3. Artificial Intelligence-Based Methods C the calculated descriptors carries necessary information for 3.1. Genetic Algorithm (GA) Method C building a mathematical model of the system of interest. If the 3.2. Artificial Neural Network (ANN) Method D number of calculated descriptors is denoted with n, the variable 3.3. Simulated Annealing (SA) Method E selection procedure is regularly defined as selecting the m < n 3.4. Particle Swarm Optimization (PSO) Method E descriptors that allow the building of the best QSAR model. It 3.5. Automatic Relevance Determination (ARD) is possible to construct models including all of the calculated Method F descriptors, but there can be many reasons for selecting only a 3.6. Ant Colony System (ACS) Method F subset of them such as: (i) prediction accuracy of model might 4. Miscellaneous Methods G be improved through exclusion of redundant and irrelevant 4.1. Replacement Method (RM) G descriptors, (ii) the QSAR model to be built is often simpler 4.2. k Nearest Neighbors (kNN) Method H and potentially faster when less input descriptors are applied, 4.3. Successive Projections Algorithm (SPA) H then the interpretability of relationship between the descriptors 4.4. Uninformative Variable Elimination-Partial and observed activity might be increased, (iii) if the number of Least Square (UVE-PLS) I input descriptor is large as compared to the number of 5. Conclusion I molecules of interest, the effective number of degrees of Author Information J freedom may be too large for calculating reliable estimates of Corresponding Author J the QSAR model’s parameters, and (iv) most machine learning Notes J methods have a larger time complexity than linear in the Biography J number of molecules and/or number of descriptors, which References J prohibits the analysis of data sets with several hundred descriptors.3 Descriptor selection is aimed at getting rid of those calculated descriptors that are redundant, noisy, or irrelevant 1. INTRODUCTION for the model building tasks envisaged, in such a way that the Quantitative structure−activity relationship (QSAR), an dimensionality of input space can be reduced without loss of important area of drug design and discovery, searches important information. Selecting appropriate descriptors for information relating chemical structure to biological and QSAR analyses is a difficult task to accomplish as there are no pharmaceutical activities. Nowadays, one cannot talk about 1 absolute rules that govern this selection. However, it is welldrug design and discovery without mentioning QSAR. QSAR known, in both chemical and statistical fields, that the accuracy approaches have been applied to guide lead optimization and of regression methods is not monotonic with respect to the study action mechanisms of chemical−biological interaction in modern drug discovery.2 QSAR attempts to find reliable relationships between the variations in the values of calculated Received: November 10, 2012 descriptors and the biological activity for a series of © XXXX American Chemical Society

A

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

technique does not use any specific regression algorithm to select descriptor subsets. Instead, it only considers the characteristics of the calculated descriptors to perform the selection.6 The standard procedure of descriptor reduction, whereby low-variance and correlated descriptors are removed from an initial, large pool of calculated descriptors to give a smaller, more information rich pool, is an example of a filtertype descriptor selection technique.7 Other examples include mutual information-based methods8 and χ2 methods.9 In this Review, the focus is on wrapper-type methods. Various descriptor selection methods are studied in this Review. These methods can be divided into three categories that include: (1) classical methods such as forward selection and stepwise procedure, (2) artificial intelligence-based methods such as genetic algorithm, and (3) miscellaneous methods, for example, replacement method. When choosing a descriptor selection method, several factors must be considered. These include the simplicity and as well as efficiency, the likelihood of convergence to a global optimum in the hypersurface defined by descriptors, and the speed of the method. To demonstrate that the resulted models have good prediction of activity of selected studied compounds, some different methods have been used. To assess the predictive ability and to check the statistical significance of the developed models, the proposed models were applied for prediction of values of pIC50 for external set that were not used in model building. Cross-validation is a technique used to explore the reliability of statistical models. Root mean square error cross-validation (RMSECV) as a standard index to measure the accuracy of a modeling method, which is based on the cross-validation technique, and R2LOO as another criterion of predictivity of developed models were applied. Some criteria by Tropsha were suggested; if these criteria were satisfied, then it can be said that the model is predictive.10 These criteria include:

number of descriptors employed by the model. Thus, depending on the nature of the regression method, the presence of irrelevant or redundant descriptors can cause the system to focus attention on the idiosyncrasies of the individual molecules and lose sight of the broad picture that is necessary for generalization beyond the training set in a QSAR study (Figure 1).

Figure 1. Classification of descriptor selection methods studied.

2 RLOO > 0.5

There are two broad strategies that can be employed for reduction of descriptors, wrapper methods and filter methods.4 In a typical QSAR, descriptors are usually selected using a heuristic to maximize some score with respect to a single classifier or regression algorithm. This method is an example of wrapper-based descriptor selection.5 A wrapper approach essentially consists of two components: the objective function, which may be a linear or nonlinear regression model, and an optimization (selection) method to select descriptors for the objective function. Examples of the optimization component include genetic algorithms and simulated annealing. The performance of the regression algorithm is employed to guide the optimization procedure in the selection of descriptors. As a result, the selection procedure is closely tied to the regression algorithm that is used. Thus, for instance, one may get one set of descriptors if he or she is employing linear models (such as multiple linear regression) and another different set if he or she is using a nonlinear technique (such as a artificial neural network, ANN). It is clear that this method aims to determine the best descriptor subset for the regression method being employed. Filter techniques are also common in QSAR modeling. The difference between filter and wrapper techniques is that a filter

R2 > 0.6

R2 − R o2 R2

< 0.1

0.85 < k < 1.15

R2 − R o′ 2 R2

or

< 0.1

0.85 < k′ < 1.15

2

R is the correlation coefficient of regression. Definitions of other parameters such as R2o, Ro′2, k, and k′ are presented obviously in literature and are not written again here for shortness.10 For the evaluation of the predictive ability of multivariate calibration model, two parameters, the root-mean-square error (RMSE) and percent relative standard error (RSEP (%)), can be used too. To avoid chance correlations, which are possible because of a large number of generated columns (independent variables), and to test the robustness of developed models, the Yrandomization test has been applied to models. The dependent variable vector is randomly permuted, and a new QSAR model is constructed using the original independent variable matrix. The new modeling was expected to have low R2 values. To confirm, some iteration was carried out. If the results show high B

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

R2, it implies that an acceptable QSAR model cannot be obtained. The domain of application of a QSAR model11 must be defined, and predictions of activity for only those compounds that fall into this domain may be considered reliable. Such QSAR models could be used for screening new compounds. To define the applicability of the domain, one simple method is to determine the extent of extrapolation.10 Calculation of the leverage hi is performed to obtain the extent of extrapolation.12 In the following sections, each of the three categories is discussed in more detail. In section 2, the classical methods are reviewed. Next, section 3 focuses on intelligent methods for selecting the most predictive descriptors from a large calculated initial set. In section 4, the miscellaneous methods are described. For each category, we provide an overview of the method followed by examples of its applications in QSAR studies.

2.4. Variable Selection and Modeling Method Based on the Prediction (VSMP)

Liu and co-workers16 have developed a novel variable selection and modeling method based on the prediction, called VSMP. In this technique, two statistics, the interrelation coefficient between the pairs of the descriptors (called Rint) and the correlation coefficient (q2) calculated using the leave-one-out (LOO) cross-validation technique, are introduced into the entire subset selected to improve its performance. This method differs from the other procedures related to the QSAR by two main characteristics: (1) The search of various optimal subset is controlled by the statistic q2 or root-mean-square error (RMSEP) in the LOO cross-validation step rather than the correlation coefficient obtained in the model formation step (R2). (2) The searching speed of all optimal subsets is expedited by the statistic Rint together with q2. 2.5. Leaps-and-Bounds Regression

In some QSAR studies, so-called leaps-and-bounds regression was used to select descriptors.17 Leaps-and-bounds regression can give out quickly different sizes of the best subsets of descriptors without checking all possible subsets. This technique is based on the following fundamental inequality.18

2. THE CLASSICAL METHODS 2.1. Forward Selection (FS) Method

The forward selection method adds descriptors to the QSAR model one at a time. The first descriptor included in the regression is the one that gives the highest fitness function (for example, the highest correlation with the biological activity or minimum residual sum of squares). The descriptor selected first is forced into all further QSAR models. New descriptors are progressively added to the regression, each descriptor being selected because it gives the highest fitness function when added to those already chosen. Various rules can be employed as stopping criteria.13 The main disadvantage of FS is that if several descriptors collectively are good predictors but alone each is a poor prediction, none of the descriptors may be selected. The FS has been used in several QSAR studies such as in refs 3,14.

RSS(A) ≤ RSS(Ai )

where RSS is the residual sum of squares, A is any set of descriptors, and Ai is a subset of A. The number of subsets evaluated in a search for the best subset regression can be limited by the application of the inequality. For example, set A1 contains three descriptors with RSS, 596; set A2 contains four descriptors with RSS, 605. Thus, all of the subsets of A2 will be ignored, because these subsets have RSS greater than that for A2, and also for A1. All of the above methods, being essentially linear, suffer from the disadvantage that they may not be effective where the relationship between descriptors and activity is nonlinear and they may cause chance correlations when many variables, or a combination of variables, are screened for inclusion in the model. To reduce the risk of overfitting due to retaining too many descriptors, a procedure based on leave one out crossvalidation followed by a randomization test is applied to examine different sets of descriptors for significant differences in prediction.Also, a number of novel methods for descriptor selection have been thus introduced that will be studied in more detail in the following sections.

2.2. Backward Elimination (BE) Method

In contrast to forward selection, BE starts with all descriptors, and in the next step checks them one by one for deletion. The descriptor is selected for deletion on the basis of their contribution to the reduction of an error criterion such as error sum of squares. The BE procedure is terminated when all of the descriptors included in model are significant or all but one descriptor has been deleted. To date, only some limited studies have used BE as the variable selection method.15 2.3. Stepwise Selection (SS) Method

One of the well-known variable selection approaches that has long been employed in QSAR studies is the stepwise descriptor selection method. This approach is a “step by step” procedure. The selection phase in model formation begins without any descriptor in the regression equation. In each step, it introduces a descriptor that presents the higher fitness function applied (for example, correlation coefficient with the biological activity), but at the same time it analyzes the signification of the descriptor included previously in the regression QSAR model. If this descriptor lost the signification, then it is removed. The procedure is stopped when there is no descriptor in the pool of descriptors that satisfies the selection criterion. SS is a simple but powerful technique to obtain a subset of significant descriptors, but it does not account for nonlinear relationships such as artificial neural networks methods. Also, one of the more significant disadvantages of this method is that it seems to be appropriate for a little descriptor pool.

3. ARTIFICIAL INTELLIGENCE-BASED METHODS Artificial intelligence-based methods are able to carry out nonlinear mapping of the different calculated descriptors to the corresponding biological activity implicitly and can overcome some limitations of classic descriptor selection methods. However, the increasing number of theoretical descriptors available in digital form has prompted the development of novel descriptors selection methodologies that can process and interpret a large number of descriptors faster and with greater reliability. Artificial intelligence-based methods such as the following will be discussed and have been used extensively for this purpose. These techniques are widely used in QSAR. 3.1. Genetic Algorithm (GA) Method

GA is a optimization technique that mimics selection phenomenon in nature. The essence of selection in nature is that, under certain environmental conditions, species of high C

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

fitness can prevail in the next generation, and the best species may be regenerated by crossover together with random mutations of chromosomes in surviving species. Rogers and Hopfinger19 first used this technique in a QSAR study, and proved GA is a very efficient tool with many merits, as compared to other variable selection techniques. GA is governed by biological evolution rules.20 They are stochastic optimization methods that have been inspired by evolutionary principles. The distinctive aspect of a GA is that it investigates many possible solutions simultaneously, each of which explores different regions in vector space defined by calculated descriptors. The first step in a typical GA procedure is to create a population of N individuals. Each individual encodes the same number of randomly chosen descriptors. The fitness of each individual in this generation is determined. In the second step, a fraction of children of the next generation is produced by crossover (crossover children) and the rest by mutation (mutation children) from the parents on the basis of their scaled fitness scores. The new offspring contains characteristics from two or one of its parents. This method also includes elitism, which protects the fittest individual in any given generation from crossover or mutation during reproduction. The genetic content of this individual simply moves on to the next generation intact. These selection, crossover, and mutation processes are repeated until all of the N parents in the population are replaced by their children. The fitness score of each member of this new generation is again evaluated, and the reproductive cycle is continued until 80% of the generations showed the same target fitness score. Because of its simplicity, flexibility, easy operation, minimal requirements, and global perspective, GA have been successfully employed in many QSAR studies.21 Some of the QSAR studies carried out by our research group also verify the success of GA in the selection of descriptors.21b,22 It is well-known that the GA technique is sometimes located on a local minimum area in a space defined by a pool of descriptors to miss the best sebset of them. To settle the issue, GA can be run in different initial populations (that means various starting points in space spanned by a pool of descriptors), thus prolonging the computing time. GA has several advantages when compared to other optimization algorithms. It has the ability to move from local optima present on the response surface. It requires no knowledge or gradient information about the response surface and can be employed for a wide variety of optimization problems.29 The major drawbacks to GA are that there can be difficulties in finding the exact global optimum, which requires a large number of response (fitness) function evaluations, and configuring the problem is not straightforward.

type) identified the more important descriptors, a classical regression analysis was used to verify their importance.23 ANN techniques are part of an era of evolving compuational technology in which a computer program has been designed to learn from data in a manner emulating the learning pattern in the brain.24 ANN is typically employed when the problem is not understood well enough to use a typical descriptor selection method and there are a large number of descriptors too. Using ANN, the solution to the descriptor selection problem is sought as follows: As indicated in Figure 2, a typical ANN for descriptor selection can include a three-layer, fully connected, feed-

Figure 2. Schematic diagram of a feed-forward artificial neural network with three layers.

forward neural network. As indicated, the input layer accepts the calculated descriptor values, which numerically encode the features of each molecule of interest. The input signals are weighted as they are transmitted to the nodes of the second layer, the hidden layer. The hidden layer neurons process the data and send a signal to the neurons of the output layer. The output layer provides the predicted value, in the field of QSAR the activity of each compound. A neural network is trained to relate certain descriptor values to target outputs (activity). To accomplish this, a variety of neural network learning algorithms can be used such as back-propagation. The ANN “learns” by repeatedly passing through the input descriptors and adjusting its connection weights to minimize the error, in this case, the predicted versus the experimental biological activity. An ANN is thus a mathematical model to describe a nonlinear hypersurface. In descriptor selection by ANN, an approach can be proposed to variable selection that uses a neural network model, the tool to determine which descriptors are to be discarded.25 The method performs a backward selection by successively removing input nodes (neurons) in a network trained with the complete set of descriptors as inputs. Input nodes are removed, along with their connections, and remaining weights are adjusted in such a way that the overall input−output behavior learned by the network is kept approximately unchanged. A simple criterion to select input nodes to be removed can be developed such as error in prediction of biological activity.25

3.2. Artificial Neural Network (ANN) Method

Recently, there has been a growing interest in the application of ANN in the field of QSAR. ANNs are often used as regression method in conjunction with optimization techniques for feature selection, ranging from simple greedy approaches such as forward selection or backward elimination, to more elaborate methodologies such as simulated annealing and genetic algorithms. The ANN was also used to identify descriptors most relevant to the biological activity.23 Wikel and Dow report their initial results in the use of neural networks to identify descriptors most relevant to the biological activity.23 After the neural network (all networks were of the back-propagation D

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

where K is a constant used for scaling purposes and T is an artificial temperature factor that controls the ability of the system to overcome energy barriers. The temperature is systematically adjusted during the simulation in a manner that gradually reduces the probability of high-energy transitions. To circumvent the problem of assigning the appropriate value for K and to ensure that the transition probability is properly controlled, an adaptive approach is used. In this approach, K is not a true constant, but rather it is continuously adjusted during the course of the simulation based on a running estimate of the mean transition energy.29 In particular, at the end of each transition, the mean transition energy is updated, and the value of K is adjusted so that the acceptance probability for a mean uphill transition at the final temperature is 0.1%. In general, schedules that involve more extensive sampling at lower temperatures seem to perform best, although it is also important that sufficient time must be spent at higher temperatures so that the algorithm does not get trapped into local minima. In a typical QSAR problem, a state represents the set of descriptor weights used in the regression model. The objective is to minimize the regression error in the training set. Two types of steps were evaluated: (1) a descriptor is randomly chosen and assigned a new random weight in the interval [0, 1], and (2) all descriptor weights are adjusted by a random value in the interval [−0.25, 0.25], followed by truncation if they exceed the specified boundaries [0, 1].30

Unlike most descriptor selection aproaches that remove all useless features in one step, ANN removes features iteratively, thus enabling a systematic evaluation of the reduced network models produced during the progressive elimination of descriptors. Therefore, the number of input nodes (i.e., the final number of descriptors) is determined just according to the performance required to the network, without making a priori assumptions or evaluations about the importance of the input descriptors.25 This gives more flexibility to the descriptor selection algorithm that can be iterated until either a predetermined number of descriptors have been eliminated or the performance of the current reduced network falls below specified requirements. Moreover, the method does not depend on the learning procedure, because it removes input nodes after the training phase.25 ANN has some disadvantages; for example, it has the potential to overfit, or memorize, the data. In other words, if too much training data are used for model building, the ANN will eventually overfit, meaning that it will be fitted precisely to training data, thereby losing generalization. It is often a good idea to examine how good an ANN performs on data that it has not seen before (testing set). Testing with data not seen before can be done while training, to see how much training is required to perform well without overfitting. The testing can either be done by hand, or an automatic test can be used, which stops the training when a criterion such as the mean square error of the test data is not improving anymore. Another disadvantage is that descriptor selection using ANN is an enormous task. To investigate all possible descriptor combinations is impractical and very time-consuming.

3.4. Particle Swarm Optimization (PSO) Method

The PSO algorithm introduced as an optimization method by Eberhart and Kennedy.31 Similar to GA, PSO is a populationbased optimization algorithm. The optimization procedure is initialized with a population of random solutions, and searches for optima by updating generations. Unlike GA, PSO has no evolution operators such as crossover and mutation. In PSO, the potential solutions, called particles, are “flown” through the problem space by following the current optimum particles. As compared to GA, the advantages of PSO are that PSO is easy to implement and there are few parameters to adjust. This method simulates the behaviors of bird flocking involving the scenario of a group of birds randomly looking for food in an area. Not all of the birds know where the food is located, except the individual who is nearest to the food location. So the effective strategy for the birds to find food is to follow the bird that is nearest to the food. PSO is motivated from this scenario and used to solve the optimization problems. In PSO, each single solution is a particle in the search space. The algorithm models the exploration of a problem space by a population of individuals or particles. All of the particles have fitness values that are evaluated by a fitness function to be optimized. Like in other evolutionary computation, the population of individuals is updated by applying some kind of operators according to the fitness information so that the individuals of the population can be expected to move toward better solution areas. Instead of crossover and mutation operators, each individual in PSO flies in the search space with a velocity that directs the flying of the particle. The particles are “flown” through the problem space by following the current optimum particle. PSO is initialized with a group of random particles. Each particle is treated as a point in a D-dimensional space defined by original matrix of descriptors. The ith particle is represented as xi = (xi1,xi2,...xiD). The best previous position of the ith particle that gives the best fitness value is represented as pi =

3.3. Simulated Annealing (SA) Method

SA is used by several QSAR studies to solve the descriptor selection problem such as in ref 26. SA is a global, multivariate optimization method based on the metropolis Monte Carlo search algorithm.27 The SA optimization algorithm is based on the physical process of annealing. The position of the atoms in an annealing solid represents the parameters being optimized, and the energy of the solid represents the cost function being optimized. As in annealing, the lowest cost function (energy configuration) is obtained by lowering the temperature slowly. The technique starts from an initial random state equation, and proceeds stepwise through a search space associated with the problem of interest by developing a series of small, stochastic steps. As with particle swarms, a cost function maps each equation of state into a value that measures its cost or fitness. After each iteration (e.g., after a variable has been removed), the value of the cost function for the new step is compared to that of the previous step. If the new solution is better than the old one, the removal of the variable is confirmed. If the new solution is worse than the old one, there is still a probability, p, for the removal of the variable to be accepted. This offers the algorithm the possibility of jumping out of a local optimum.28 Otherwise, the removal of the variable will be discarded, and the previous step will be the starting point for the next attempt to eliminate a variable. While downhill transitions are always accepted, uphill transitions are accepted with a probability that is inversely proportional to the energy difference between the two states. This probability is computed using Metropolis’ acceptance criterion: p = e−ΔE / KT

(1) E

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

3.6. Ant Colony System (ACS) Method

(pi1,pi2,...piD). The best particle among all of the particles in the population is represented by pg = (pg1,pg2,...pgD). Velocity, the rate of the position change for particle i, is represented as vi = (vi1,vi2,...viD). In every iteration, each particle is updated by following the two best values. After finding the two best values, the particle updates its velocity and positions according to the following equation: vid = vid + c1 × r1 × (pid − xid) + c 2 × r2(pgd − xid)

(2)

xid = xid + vid

(3)

Ant colony optimization (ACO) has been described for descriptor selection in QSAR studies.36 ACO algorithms are stochastic search techniques that are inspired by the behavior of real ants.37 In nature, it is observed that real ants are capable of finding the shortest path between a food source and their nest without the application of visual information and hence possess no global world model, adapting to changes in the environment. The deposition of pheromone is the key feature in enabling real ants to find the shortest paths over a period of time. Each ant probabilistically prefers to follow a direction rich in this chemical. The pheromone decays over time, resulting in much less pheromone on less popular paths. Given that over time the shortest path will have the higher rate of ant traversal, this path will be reinforced and the others diminished until all ants follow the same, shortest path (the “system” has converged to a single solution). On the basis of this idea, artificial ants can be deployed to solve complex optimization problems via the use of artificial pheromone deposition. ACO is particularly attractive for descriptor selection as there seems to be no heuristic that can guide the search to the optimal minimal subset every time. Additionally, it can be the case that ants discover the best descriptor combinations as they proceed throughout the search space. Descriptors are recognized with space dimensions defining the available paths followed by ants, with permitted coordinates of 1 or 0 (selected and unselected descriptors respectively, as in GA). In this way, a given path is connected to a number of selected descriptors, which in turns corresponds to a given prediction error. As discussed above, in each generation, ants deposit a certain amount of pheromone, which increases with decreasing values of the objective function defined by each path. Two ACO algorithms already explained in the literature for descriptor selection have been employed, which will be described next. The first basic algorithm was introduced by Yu and coworkers.36c For a given descriptor selection problem expressed in a binary notation, an ant moves in an N-dimensional search space of N variables; its motion is restricted to 0 or 1 on each dimension. State “1” represents the selection of this variable, and state “0” represents the reverse. In the binary variable selection problem, the motion of ants is determined by moving probability of 0 or 1. The pheromone levels on each dimension (variable) rather than on a path are divided into two kinds: τi0 and τi1, which represent the pheromone of a dimension i taking the values 1 and 0, respectively. The pheromone levels corresponding to a dimension taking the value 1 or 0 are updated according to the updating rule:

where c1 and c2 are two positive constants named as learning factors, and r1 and r2 are random numbers in the range of (0,1). Equation 2 is used to calculate the particle’s new velocity according to its previous velocity and the distances of its current position from its own best position and the group’s best position. The particle then flies toward a new position according to eq 3. Such an adjustment of the particle’s movement through space causes it to search around the two best positions. If the minimum error criterion is attained or the number of cycles reaches a user-defined limit, the algorithm is terminated. The PSO is simple for implementation, and there are few parameters to adjust. The PSO has been used for selecting descriptors in some QSAR studies.32 3.5. Automatic Relevance Determination (ARD) Method

Frank described the use of Bayesian regularized artificial neural networks as a regression method coupled with automatic relevance determination in the development of QSAR models.33 He explained that the ARD method ensures that irrelevant or highly correlated indices used in the modeling are neglected as well as showing which are the most important variables in modeling the activity data. In the implementation of ANN as a regression method, a single rate of weight decay α is assumed for all of the network weights, but the scaling properties of networks suggest that weights in different network layers should employ different regularization coefficients. By separating the weights into different classes, MacKay34 developed a technique for soft network pruning called automatic relevance determination. In ARD, the weights are divided into one class for each input (containing all of the weights from that input to the hidden layer), one class for the hidden layer biases, and one class for each output (containing all of the weights from the hidden layer to that output). Inputs with large decay rates have small weights, so, in problems with many input variables, some of which may be irrelevant to the prediction of the output, ARD allows the network to “estimate” the importance of each input, effectively turning off those that are not relevant. This allows all descriptors, including those that have little impact on the output, to be included in the QSAR model without ill-effect, as irrelevant descriptors will have their weights reduced automatically. On the other hand, in problems with very large numbers of inputs, it may be more efficient to remove descriptors with large decay rates and train a new network on a reduced set of inputs, especially if the trained network is to be used to screen a very large virtual database. On the basis of Burden reports, ARD has two main advantages over other descriptor selection methods: it is firmly based on probability theory, and it is performed automatically.33,35

τi0(new) = ρτi0(old) +Δτi0

(4)

m

Δτi0 =

∑ Δ(i0k)

(5)

k=1

τi1(new) = ρτi1(old) +Δτi1

(6)

m

Δτi1 =

∑ Δ(i1k) k=1

(7)

where Δτi0 and Δτi1 presented the increment of pheromone corresponding to i dimension taking the value 1 or 0 at this (k) circle. Δτ(k) i0 and Δτi1 showed the amount of pheromone that F

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

model, the larger is the fitness function and the higher is the probability that the model is being selected. Shamsipur et al. reported a novel approach for the use of external memory in ant colony system too, as one of the most efficient ant colony optimization algorithms, for solving descriptor selection problem in QSAR/QSPR studies.36f The resulting algorithm is named as memorized ACS by them. In this approach, several ACS algorithms are run to build an external memory, which contains a number of elite ants, each of which comes from an independent ACS run. After that, all of the elite ants in the external memory are allowed to update the pheromones. The external memory is then emptied, and the updated pheromones are used again, by several ant colony system algorithms to build a new external memory. These steps are iteratively run for a certain number of iterations. From the results of some case studies carried out in this work, it can be clearly concluded that the performance of the proposed memorized ACS algorithm is highly improved over the ant colony system algorithm in terms of the convergence speed and the solution quality.36f This is because the use of information incorporated from previous iterations would be helpful and will lead to solutions with higher qualities in shorter time periods. In this way, not only exploitation from previous information but also exploration of the search space is enhanced in the proposed algorithm. Another novelty of this algorithm is that, at the end of running each memorized ACS algorithm, several combinations of appropriate descriptors exhibiting good regression statistics are available, which facilitate the interpretation of structure−activity/property relationship with high accuracy.36f

ant k left on the variable i at this circle. For each dimension, the intensity of pheromone at time 0 (τi0 and τi1) is set to 0. Δτi(1k) = F + FH if k th ant selected variable i both in the current iteration and in its global best solution Δτi(1k) = F if k th ant selected variable i only in the current iteration Δτi(1k) = FH if k th ant selected variable i only in its historical global best solution Δτi(0k) = F + FH if variable i was not selected by ant k in the current iteration or in its historical global best solution Δτi(0k) = F if variable i was not selected by ant k in the current iteration Δτi(0k) = FH if variable i was not selected by ant k in its historical global best solution

where F and FH are defined by the fitness function. To improve the convergence velocity, the information FH that corresponds to the historical global best result of the ith ant was introduced (k) to the increment of pheromone (Δτ(k) i0 and Δτi1 ). Ant k makes decisions concerning the variable selection according to the pheromone amount. The moving probability is τil pi(k) = τil + τi0 (8)

4. MISCELLANEOUS METHODS 4.1. Replacement Method (RM)

As discussed in the Introduction, the goal in the descriptor selection process is to search a large set containing D descriptors for an optimal subset of d ones that minimize the error function. In the RM the standard deviation S, which is defined as follows, is the error function:

In the modified ACO, m ants select variables from all N variables according to the probability defined by eq 15. After one selection, the amount of pheromone is updated according to the above equations. This process is iterated until the minimum error criterion is attained or the number of iteration reaches a user-defined limit. In the modified ACO, the pheromone levels were updated not only by current individual’s information but also by each ant’s previous or historical global best performance, so the information positive feedback in the modified ACO was really different from that in the conventional ACO. Using each ant’s previous best information, the modified ACO converges quite quickly toward the optimal position with satisfactory converging characteristics. Details of the modified ACO have been described elsewhere.38 In the modified ACO, the increment of pheromone left on a certain variable is measured according to a predefined fitness function. The following objective function is applied to descriptor selection in the modified ACO: ̂ 2 + 2p) F = −lg(RSSp /σPLS

S=

1 (N − d − 1)

N

∑ resi2 i=1

(10)

where N is the number of molecules in the training set, and resi is the residual for molecule i (difference between the observed and estimated activity). More precisely, we want to obtain the global minimum of S(d) where d is a point in a space of D!/[d! (D − d)!] ones. A full search of optimal variables requires D!/ [d!(D − d)!] linear regressions. The SS method, as mentioned above, consists of a step by step addition of descriptors to the regression model, initially without any descriptor present in the model, until there is no variable left outside the equation that minimizes its S. The SS method sacrifices accuracy for a much smaller number of linear regressions than a full search. Duchowicz et al. proposed the RM approach39 that generates linear regression models that are quite close to the full search ones with much less computational effort. This method approaches the minimum of S by judiciously considering the relative errors of the coefficients of the least-squares model given by a set of d descriptors d = (X1,X2,...,Xd). In this algorithm, at first a set d was selected at random to perform a linear regression. Next, one of the descriptors of this set was selected, called Xi, and replaced with each of the D descriptors

(9)

Here, p is the number of dependent variables (biological activity), RSSp is the residual sum of squares of p-variable model, and σ̂2PLS is defined as the value of RSS corresponding to the minimum number of descriptors of the original data set when further increase of the number of descriptors does not cause a significant reduction in RSS. The smaller is the residual sum of model and the fewer descriptors are involved in the G

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

of the pool D = (X1,X2,..., XD), D ≫ d (except itself), keeping the best resulting set (i.e., that with smallest S). Because one can start replacing any of the d descriptors in the initial model, then there will be d possible paths. Choose the variable in the resulting model with greatest relative error in its coefficient (omitting the one replaced in the previous step) and replace it with all of the D descriptors (except itself), keeping again the best set. Replace all of the remaining variables in the same way by omitting those replaced in previous steps. When finishing, start again with the descriptor having the greatest relative error in the coefficient and repeat the whole process. Repeat this process as many times as necessary until the set of descriptors remains unchanged. At the end, the best model for the path i is obtained. Proceed in exactly the same way for all possible paths i = 1,...,d, compare the resulting models, and the best one is kept. This calculation might be performed for d = 1,2,3,... to find the overall best model. This algorithm is used in some QSAR studies.40

used for stochastic sampling of the descriptor space, are presented by Zheng and Tropsha.41 The original kNN method was enhanced in this study by using weighted molecular similarity. In the original method, the activity of each compound was predicted as the algebraic average activity of its k-nearest-neighbor compounds in the training set. However, in general, the Euclidean distances in the descriptor space between a compound and each of its k nearest neighbors are not the same. Thus, the neighbor with the smaller distance from a compound was given a higher weight in calculating the predicted activity as follows: wi = ŷ =

kNN as a tool for variable selection in QSAR studies was developed by Tropsha and his coworkers.41 This technique was used to build QSAR models for various data sets by this group.41 Briefly, a subset of nvar (number of selected variables) descriptors is selected randomly as a hypothetical descriptor pharmacophore. The nvar is set to different values to obtain the best LOO q2 possible. Said another way, the kNN method optimizes the number of descriptor to achieve a QSAR regression equation with the highest LOO q2 as a fitness function as follows. To initiate the procedure, the following input should be provided: (1) the number of descriptors (d) to be selected from the pool of descriptors for the final best model; (2) the maximum number k of nearest neighbors; (3) the number of descriptors M to be changed at each step of the stochastic descriptor sampling procedure that utilizes simulation annealing; (4) the starting Tmax and ending Tmin of the simulation annealing parameter, “temperature”, T, and the factor d < 1 to decrease T (Tnext = dTprevious) at each step; (5) the number of times N the calculations must be performed before lowering T, if q2 is not improved. In the LOO cross-validation procedure, every molecule was eliminated from the data set once, and its activity was then predicted as a weighted average of the activities of its nearest neighbors using the following formula:

4.3. Successive Projections Algorithm (SPA)

SPA is a FS method that starts with one descriptor, and incorporates a new one at each iteration, until a specified number N of descriptors is reached.42 To solve the collinearity problems, selection of descriptors with minimum redundant information content is the purpose of this procedure. In QSAR studies, SPA uses a training and a test set consisting of descriptors data (X) and activities measured by an experimental method (y). The core of SPA consists of projection operations performed on the calibration matrix Xcal (Kc × J), whose lines and columns correspond to Kc training molecules and J descriptors, respectively. A detailed explanation of the projection operations is given elsewhere.42,43 Starting from each of the J descriptors (columns of Xcal) available for selection, SPA builds an ordered chain of Kc descriptors. In the building of this chain, each element is selected to display the least collinearity with the previous ones.

(11)

where di are the distances between this compound and its kNN. LOO q2 was calculated according to the following expression: q2 = 1 −

∑ (yi − y )̂ 2 ∑ (yi − y ̅ )2

(14)

Figure 3. Flowchart of the kNN method.

∑nearest neighbors yi exp( − di) ∑nearest neighbors exp( − di)

∑ wyi i

(13)

where di is the Euclidean distance between the compound and its k nearest neighbors, wi is the weight for every individual nearest neighbor, yi is the actual activity value for nearest neighbor i, and ŷ is the predicted activity value. In summary, the kNN QSAR algorithm generates both an optimum k value and an optimal nvar subset of descriptors, which afford a QSAR model with the highest value of q2. Figure 3 shows the overall flowchart of the current implementation of the kNN method.

4.2. k Nearest Neighbors (kNN) Method

ŷ =

exp( −d) ∑knearestneighbors exp( −d)

(12)

where yi are the experimental activities, ŷi are defined by eq 1, and y ̅ is the average activity. The summation in eq 2 is performed over all molecules of interest. A method of simulated annealing with the Metropolis-like acceptance criteria is used to optimize the variable selection. Further details of the kNN method implementation, including the description of the simulated annealing procedure H

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

In this context, the collinearity between descriptors is assessed by the correlation between the respective column vectors of Xcal. It is worth noting that, according to this selection criterion, no more than Kc descriptors can be included in the chain.42,43 [In fact, because the columns of Xcal lie on a Kcdimensional space, it is not possible to choose Kc+1 columns that are not linearly dependent.] From each of the J chains constructed as above, it is possible to extract Kc subsets of descriptors by using one up to Kc elements in the order in which they were selected. Thus, a total of J × Kc subsets of variables can be formed. To choose the most appropriate subset, regression models such as multiple linear regression models are built and compared in terms of the root-mean-square error of prediction in the validation set (RMSEV), which is calculated as RMSEV =

1 Kv

Kv



(yvk



yvk̂ )2

k=1

Figure 4. Graphical representation of the UVE-PLS model.

sj =

mean B(: , j) std B(: , j)

(16)

(15)

where mean B(:,j) is the mean value of the m elements of the jth column of B, and std B(:,j) is the standard deviation of the m elements of the jth column of B. The k noisy variables are irrelevant to model y, and thus, to discriminate stable and unstable regression coefficients, a cutoff value is defined as:

where yvk and ŷvk are the experimental and predicted values of the activity in the kth test molecule and Kv is the number of test molecules. Such a comparison of the subsets of variables can also be regarded as an optimization of the initial variable (first element of the chain) and the number of variables (number of elements extracted from the chain) to be selected by SPA. Because of the simplicity of the algebraic operations involved, the entire procedure (including the building and validation of regression models) can be carried out in an acceptable time frame for most applications.

cutoff = max(s(n + 1:n + k))

which is the maximal value of the vector s(n + 1:n + k) containing the stabilities of the regression coefficients associated with the k noise variables. All of the experimental variables with stability of the regression coefficients below the cutoff value are irrelevant to model y and are eliminated from the original data set, because their information content is not higher than the information content of the random descriptors.

4.4. Uninformative Variable Elimination-Partial Least Square (UVE-PLS)

In the manual approach, the uninformative descriptors are subjectively removed on the basis of either high noise or low effect on biological activity. To address this uninformative variable, an elimination method (UVE-PLS) was recently developed to eliminate uninformative descriptors for regression of the pool of descriptors.44 Artificial random descriptors are added to the data as a reference so that those descriptors that play a less important role in the model than the random variables are eliminated. A PLS calibration model can be much improved by excluding uninformative variables that have high variance but small covariance with the biological activity y. Model improvement means a decrease of the complexity and/or decrease of the root-mean-square error of prediction in the cross-validation procudure RMSECV (increase of predictive ability). The UVEPLS approach uses a cutoff value for the PLS coefficients that is determined by adding irrelevant descriptors to the original data and evaluating the corresponding PLS coefficients. The data matrix X (m × n) is augmented with a matrix N (m × k) containing random numbers with very small magnitude (of the order 10−10). The number of new descriptors, k, should be higher than 300. These new random variables do not influence the PLS model. The m vectors of regression coefficients, b, are calculated with leave-one-out cross-validation and saved into the matrix B (m × n + k). Its first n columns are the regression coefficients related to the experimental variables, and the k remaining columns are related to the uninformative variables (see Figure 4). The stability of the regression coefficient for the jth descriptor is then defined as:

5. CONCLUSION Selecting a small subset from a large pool of descriptors to construct a predictive and reliable QSAR model is an important step in the QSAR modeling procedure. It is known that increasing the number of descriptors in a typical QSAR model will improve the fit to a training set of data, but inclusion of too many descriptors will often cause a substantial reduction in the predictability of the generated QSAR model. In general, descriptor selection is very hard to solve, even approximately, with guaranteed performance bounds. During the past decades, there has been a focus on improving descriptor selection techniques by increasing algorithms used and generating more statistics for validation. As a usual rule of thumb, the n/m should be greater than or equal to 5, where n is the number of molecules of interest and m is the number of descriptor included in the QSAR model.17b This Review is specifically focused on common computational methods currently used in descriptor selection in QSAR studies. Several descriptor selection methods were comparatively explained and discussed to provide a better understanding of fundamental principles of descriptor selection procedure. Because of space limitations, many promising algorithms and their applications remain beyond the scope of this Review, and the methods described therefore highlight the role descriptor selection algorithms currently play in the QSAR studies. Our group believes this area will probably expand in the I

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

(13) Belsley, D. A.; Kuh, E.; Welsch, R. E. In Applied Linear Regression; Weisberg, S., Ed.; John Wiley & Sons: New York, 1980. (14) (a) Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Mach. Learn. 2002, 46, 389. (b) Wegner, J. K.; Fröhlich, H.; Zell, A. J. Chem. Inf. Comput. Sci. 2004, 44, 921. (15) Yang, S. P.; Song, S. T.; Tang, Z. M.; Song, H. F. Acta Pharmacol. Sin. 2003, 24, 897. (16) Liu, S. S.; Liu, H. L.; Yin, C. S.; Wang, L. S. J. Chem. Inf. Comput. Sci. 2003, 43, 964. (17) (a) Xu, L.; Zhang, W. J. Anal. Chim. Acta 2001, 446, 475. (b) Zhou, Y.; Xu, L.; Wu, Y.; Liu, B. Chemom. Intell. Lab. 1999, 45, 95. (18) Furnival, G. M.; Wilson, R. W. Technometrics 1974, 16, 499. (19) Rogers, D.; Hopfinger, A. J. J. Chem. Inf. Comput. Sci. 1994, 34, 854. (20) Holland, J. Adaption in Natural and Artificial Systems; The University of Michigan Press: Ann Arbor, MI, 1975. (21) (a) Hoffman, B. T.; Kopajtic, T.; Katz, J. L.; Newman, A. H. J. Med. Chem. 2000, 43, 4151. (b) Shahlaei, M.; Madadkar-Sobhani, A.; Fassihi, A.; Saghaie, L.; Shamshirian, D.; Sakhi, H. Med. Chem. Res. 2012, 21, 100. (c) Hemmateenejad, B.; Akhond, M.; Miri, R.; Shamsipur, M. J. Chem. Inf. Comput. Sci. 2003, 43, 1328. (d) Saghaie, L.; Shahlaei, M.; Fassihi, A.; Madadkar-Sobhani, A.; Gholivand, M. B.; Pourhossein, A. Chem. Biol. Drug Des. 2011, 77, 75. (e) Fan, Y.; Shi, L. M.; Kohn, K. W.; Pommier, Y.; Weinstein, J. N. J. Med. Chem. 2001, 44, 3254. (22) (a) Shahlaei, M.; Fassihi, A.; Saghaie, L.; Arkan, E.; MadadkarSobhani, A.; Pourhossein, A. J. Enzyme Inhib. Med. Chem. 2013, 28, 16. (b) Shahlaei, M.; Madadkar-Sobhani, A.; Saghaie, L.; Fassihi, A. Expert Syst. Appl. 2012, 39, 6182. (c) Shahlaie, M.; Fassihi, A.; Pourhossein, A.; Arkan, E. Med. Chem. Res. 2013, 22, 1399. (23) Wikel, J. H.; Dow, E. R. Bioorg. Med. Chem. Lett. 1993, 3, 645. (24) Rumelhart, D. In Parallel Distributed Processing; Feldman, J., Hayes, P., Rumelhart, D., Eds.; The MIT Press: London, 1982; Vol. 1. (25) Castellano, G.; Fanelli, A. M. Neurocomputing 2000, 31, 1. (26) (a) Jung, M.; Tak, J.; Lee, Y.; Jung, Y. Biorg. Med. Chem. Lett. 2007, 17, 1082. (b) Sutter, J. M.; Dixon, S. L.; Jurs, P. C. J. Chem. Inf. Comput. Sci. 1995, 35, 77. (c) Sutter, J. M.; Kalivas, J. H. Microchem. J. 1993, 47, 60. (d) Ghosh, P.; Bagchi, M. Curr. Med. Chem. 2009, 16, 4032. (27) Kirkpatrick, S.; Vecchi, M. Science 1983, 220, 671. (28) (a) Liao, G. C.; Tsao, T. P. Electr. Power. Syst. Res. 2004, 70, 237. (b) Nolle, L.; Armstrong, D.; Hopgood, A. A.; Ware, J. Int. J. Know. Intell. Eng. Syst. 2002, 6, 104. (29) Agrafiotis, D. K. J. Chem. Inf. Comput. Sci. 1997, 37, 841. (30) Cedeño, W.; Agrafiotis, D. K. J. Comput.-Aided Mol. Des. 2003, 17, 255. (31) Kennedy, J.; Eberhart, R. Proc. IEEE Int. Conf. Neural Network, 1995; p 1942. (32) (a) Shen, Q.; Jiang, J. H.; Jiao, C. X.; Shen, G.; Yu, R. Q. Eur. J. Pharm. Sci. 2004, 22, 145. (b) Lin, W. Q.; Jiang, J. H.; Shen, Q.; Shen, G. L.; Yu, R. Q. J. Chem. Inf. Model. 2005, 45, 486. (c) Shen, Q.; Jiang, J. H.; Jiao, C. X.; Huan, S. Y.; Shen, G.; Yu, R. Q. J. Chem. Inf. Comput. Sci. 2004, 44, 2027. (d) Lü, J. X.; Shen, Q.; Jiang, J. H.; Shen, G. L.; Yu, R. Q. J. Pharm. Biomed. 2004, 35, 679. (33) Frank, R.; Ford, M. G.; Whitley, D. C.; Winkler, D. A. J. Chem. Inf. Comput. Sci. 2000, 40, 1423. (34) MacKay, D. In Models of Neural Networks III; Domany, E., van Hemmen, J., Schulten, K., Eds.; Springer: New York, 1994. (35) (a) Burden, F.; Winkler, D. Methods Mol. Biol. 2008, 458, 25. (b) Winkler, D. A.; Burden, F. R. J. Mol. Graph. Model. 2004, 22, 499. (36) (a) Izrailev, S.; Agrafiotis, D. J. Chem. Inf. Comput. Sci. 2001, 41, 176. (b) Izrailev, S.; Agrafiotis, D. SAR QSAR Environ. Res. 2002, 13, 417. (c) Shen, Q.; Jiang, J. H.; Tao, J.; Shen, G.; Yu, R. Q. J. Chem. Inf. Model. 2005, 45, 1024. (d) Shi, W.; Shen, Q.; Kong, W.; Ye, B. Eur. J. Med. Chem. 2007, 42, 81. (e) Goodarzi, M.; Freitas, M. P.; Jensen, R. Chemom. Intell. Lab. 2009, 98, 123. (f) Shamsipur, M.; Zare-Shahabadi, V.; Hemmateenejad, B.; Akhond, M. Anal. Chim. Acta 2009, 646, 39. (37) Dorigo, M.; Stützle, T. Ant Colony Optimization; The MIT Press: Cambridge, MA, 2004.

future due to the increasing types of descriptors being introduced.

AUTHOR INFORMATION Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest. Biography

Mohsen Shahlaei was born in Kermanshah in October 1980. He received his M.Sc. degree in Chemistry from Razi Uniersity (Kermanshah, Iran) in 2006 and Ph.D. degree in Medicinal Chemistry at the Isfahan University of Medical Sciences (Isfahan, Iran). He accepted an appointment as Assistant Professor of Medicinal Chemistry at Kermanshah Univeristy of Medical Sciences in 2012 and rose through the ranks. Mohsen’s research interests are primarily directed toward understanding structure and important phenomena between the protein receptors and their ligands through the application of molecular dynamics simulation. His research interest also includes the foundations of the QSAR and the development of the respective computer software. He also has interests in the use of mathematical and statistical methods in QSAR studies. He has published over 40 papers in the areas of theoretical, computational, and experimental chemistry.

REFERENCES (1) Yasri, A.; Hartsough, D. J. Chem. Inf. Comput. Sci. 2001, 41, 1218. (2) Hansch, C.; Hoekman, D.; Gao, H. Chem. Rev. 1996, 96, 1045. (3) Merkwirth, C.; Mauser, H.; Schulz-Gasch, T.; Roche, O.; Stahl, M.; Lengauer, T. J. Chem. Inf. Comput. Sci. 2004, 44, 1971. (4) Dutta, D.; Guha, R.; Wild, D.; Chen, T. J. Chem. Inf. Comput. Sci. 2007, 47, 989. (5) Kohavi, R.; John, G. H. Artif. Intell. 1997, 97, 273. (6) Duch, W. In Feature Extraction: Foundations and Applications; Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L., Eds.; Springer: Berlin, Germany, 2006; Vol. 207. (7) (a) Goll, E. S.; Jurs, P. J. Chem. Inf. Comput. Sci. 1999, 39, 974. (b) Guha, R.; Jurs, P. C. J. Chem. Inf. Comput. Sci. 2004, 44, 2179. (8) Tarca, L. A.; Grandjean, B. P. A.; Larachi, F. Ind. Eng. Chem. Res. 2005, 44, 1073. (9) Liu, Y. J. Chem. Inf. Comput. Sci. 2004, 44, 1823. (10) Tropsha, A.; Gramatica, P.; Gombar, V. QSAR Comb. Sci. 2003, 22, 69. (11) (a) Golbraikh, A.; Tropsha, A. J. Mol. Graph. Model. 2002, 20, 269. (b) Saghaie, L.; Shahlaei, M.; Fassihi, A.; Madadkar-Sobhani, A.; Gholivand, M.; Pourhossein, A. Chem. Biol. Drug Des. 2011, 77, 75. (12) Atkinson, A. Plots, Transformations and Regression; Clarendon Press: Oxford, UK, 1985. J

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX

Chemical Reviews

Review

(38) Kubinyi, H. J. Chemom. 1996, 10, 119. (39) (a) Duchowicz, P. R.; Castro, E. A.; Fernández, F. M.; Gonzalez, M. P. Chem. Phys. Lett. 2005, 412, 376. (b) Duchowicz, P. R.; Fernández, M.; Caballero, J.; Castro, E. A.; Fernández, F. M. Bioorg. Med. Chem. 2006, 14, 5876. (40) (a) Duchowicz, P. R.; Goodarzi, M.; Ocsachoque, M. A.; Romanelli, G. P.; Ortiz, E. V.; Autino, J. C.; Bennardi, D. O.; Ruiz, D. M.; Castro, E. A. Sci. Total Environ. 2009, 408, 277. (b) Morales, A. H.; Duchowicz, P. R.; Pérez, M. Á . C.; Castro, E. A.; Cordeiro, M. N. D. S.; González, M. P. Chemom. Intell. Lab. 2006, 81, 180. (c) Mercader, A. G.; Duchowicz, P. R.; Fernández, F. M.; Castro, E. A. Chemom. Intell. Lab. 2008, 92, 138. (d) Duchowicz, P. R.; Mercader, A. G.; Fernández, F. M.; Castro, E. A. Chemom. Intell. Lab. 2008, 90, 97. (e) Mercader, A. G.; Duchowicz, P. R.; Fernández, F. M.; Castro, E. A.; Bennardi, D. O.; Autino, J. C.; Romanelli, G. P. Bioorg. Med. Chem. 2008, 16, 7470. (41) Zheng, W.; Tropsha, A. J. Chem. Inf. Comput. Sci. 2000, 40, 185. (42) Araújo, M. C. U.; Saldanha, T. C. B.; Galvão, R. K. H.; Yoneyama, T.; Chame, H. C.; Visani, V. Chemom. Intell. Lab. 2001, 57, 65. (43) Kawakami Harrop Galvão, R.; Fernanda Pimentel, M.; Cesar Ugulino Araujo, M.; Yoneyama, T.; Visani, V. Anal. Chim. Acta 2001, 443, 107. (44) Daszykowski, M.; Stanimirova, I.; Walczak, B.; Daeyaert, F.; De Jonge, M.; Heeres, J.; Koymans, L.; Lewi, P.; Vinkers, H.; Janssen, P. Talanta 2005, 68, 54.

K

dx.doi.org/10.1021/cr3004339 | Chem. Rev. XXXX, XXX, XXX−XXX