Article pubs.acs.org/jcim
PVLOO-Based Training Set Selection Improves the External Predictability of QSAR/QSPR Models Ying Dong,*,† Bingren Xiang,‡ and Ding Du*,† †
Department of Organic Chemistry, College of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, P. R. China ‡ Institute of Pharmaceutical Science, China Pharmaceutical University, 24 Tongjiaxiang, Nanjing 210009, P. R. China S Supporting Information *
ABSTRACT: In QSAR/QSPR modeling, the indispensable way to validate the predictability of a model is to perform its statistical external validation. It is common that a division algorithm should be used to select training sets from chemical compound libraries or collections prior to external validations. In this study, a division method based on the posterior variance of leave-one-out cross-validation (PVLOO) of the Gaussian process (GP) has been developed with the goal of producing more predictive models. Four structurally diverse data sets of good quality are collected from the literature and then redeveloped and validated on the basis of training set selection methods, namely, four kinds of PVLOO-based training set selection methods with three types of covariance functions (squared exponential, rational quadratic, and neural network covariance functions), the Kennard−Stone algorithm, and random division. The root mean squared error (RMSE) of external validation reported for each model serves as a basis for the final comparison. The results of this study indicate that the training sets with higher values of PVLOO have statistically better external predictability than the training sets generated from other division methods discussed here. These findings could be explained by proposing that the PVLOO value of GP could indicate the mechanism diversity of a specific compound in QSAR/QSPR data sets.
■
INTRODUCTION The most attractive feature of an ideal QSAR/QSPR model is that it can be used in predicting activities or properties of new unknown compounds. Currently, it is accepted that only models that have been validated externally, after their internal validation, can be considered reliable and applicable for both theoretical and practical purposes.1 Generally, the overall set for QSAR/QSPR studies contains two independent data sets, a modeling set and an external evaluation set, with no intersection between them. The modeling set should be divided into a training set and a test set. In the modeling process, the model is built with the compounds of the training set and validated internally by predicting activities or properties of the compounds in the test set. Then, the external validation could be performed by predicting activities or properties of the compounds in the external evaluation set, which have not been taken into account during the calibration of the model. This idea is precisely the core concept of “double crossvalidation”, which has been demonstrated to be associated with enhanced external predictability.2,3 There is no doubt that a rational training set is extremely important for building a “predictive” model because models developed on different chemicals result in different predictions. Commonly, the training set is generated by random division of the modeling set. The main drawback of random division is that it is not directed by any rationale while selecting training set chemicals. So, the predictabilities of models based on random © 2017 American Chemical Society
division could be assumed to be uncertain. For this reason, many studies have been seeking to develop rational division methods which could lead to more predictive models compared to random division. The most common rational division algorithms include the Kennard−Stone algorithm, Kohonen self-organizing maps, D-optimal design, variations of the sphere exclusion algorithm, etc.4−8 It was reported that the predictabilities of models based on some rational division approaches are superior to those of the models based on simple random division when predicting the test set.9−12 However, Martin et al.13 pointed out that the predictabilities of both types of models are comparable when dealing with external validation, although the models based on rational division methods generate better statistical results for predicting the test sets than models based on random division. These observations prompted us to explore a possible new rational division method, which could produce a reliable and applicable training set and avoid the proposal of overoptimistic, erroneously called, “predictive” QSAR/QSPR models. Gaussian process (GP) is a probabilistic strategy based on Bayesian inference. For a GP-based QSAR/QSPR model, it could be inferred that the calculated posterior variance could indicate mechanism diversity of a specific compound. It is widely agreed that diversity should be considered as a major criterion for Received: January 15, 2017 Published: April 18, 2017 1055
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling training set selection.1,11,13,14 On the basis of this understanding, we employed posterior variance of GP as a filter to separate the compounds of insufficient diversity from the compounds that have a higher level of diversity. In this study, four chemically diverse QSAR/QSPR data sets were explored extensively in order to test the hypothesis that the training sets with higher values of posterior variance have statistically better external predictability than the training sets generated from the commonly used division methods when focused on external validation.
COX2 Inhibitors Data Set. A set of structurally diverse cyclooxygenase-2 (COX2) inhibitors was also assembled by Sutherland17 from the work of Seibert et al., with pIC50 values ranging from 4.0 to 9.0. The compounds reported with indeterminate activities were removed. A total of 74 2.5D QSAR descriptors were calculated using Cerius2 (version 4.8.1). The raw QSAR data points were normalized to have a mean of zero and variance of one. A total of 94 compounds were assigned to the external evaluation set (cox2_ext) according to the literature, and the remaining 188 compounds were assigned to the modeling set (cox2_mod). Dimensional reduction was applied to cox2_mod as described in DHFR Inhibitors Data Set. A total of 19 new modeling sets (cox2_mod_r1−cox2_mod_r19, n = 188) were employed in the following experiments. Fub Data Set. The fraction of chemical unbound by plasma proteins (Fub) data set was collected by Ingle et al.23 Ingle’s Fub data set is also a structurally diverse data set, which consists of four independent parts, 1045 drugs (fub_1), 200 drugs (fub_2), 238 environmental contaminants from ToxCast I (fub_3), and 168 environmental contaminants from ToxCast II (fub_4). The experimental Fub values were converted into pseudoequilibrium constants (ln Ka) for model construction, and then, the resulting predictions were converted back to Fub for assessment of model accuracy.23 Descriptors for these compounds were calculated by using the MOE software. Due to the relatively poor predictions for basic and zwitterionic chemicals suggested by the authors,23 fub_1 and fub_2, which contain larger proportions of bases and zwitterions, were used as external evaluation sets, respectively, in this study, and fub_mod, the union of two ToxCast sets (fub_3 and fub_4), was selected as the modeling set. The overall data points were normalized. Three subsets of molecular descriptors generated by the authors23 were involved in the process of modeling, and the corresponding three new modeling sets (fub_mod_r1−fub_mod_r3) were built. All of the above QSAR/QSPR data sets used in this research are provided in the Supporting Information. Training Set Selection. In this section, different division strategies were applied to the modeling sets of the above-mentioned QSAR/QSPR data sets (i.e., mtp_mod, dhfr_mod_r1− dhfr_mod_r19, cox2_mod_r1−cox2_mod_r19, and fub_mod_r1−fub_mod_r3). For each modeling set, 80% of the compounds were collected to construct a training set, and the remaining compounds were drawn together and acted as a test set. Gaussian Process (GP). In 2006, Rasmussen and Williams elaborated on the theory of GP in their book.24 Here, a brief description of GP is provided. The aim of the computation is to construct a nonlinear multiple-input single-output system. As for a set Q of n compounds, Q = {S, a}, where matrix S = {s(i)}in= 1 is a set of molecular descriptors and vector a = {a(i)}i n= 1 is the corresponding target values. The GP will generate approximating function f(s) which preserves the relationship between structure s(i) and activity (property) a(i) in the data set Q. According to Bayesian inference, the posterior P(f(s)|Q) can be written as follows: P(f(s)|Q) ∝ P(a|f(s), S) P(f(s)), where P(a|f(s), S) is the likelihood of observing a given f(s) and S, and the prior P( f(s)) is the probability of f(s) before a is observed. Assuming that latent function f(s) is random variable and Gaussian probability distribution governs the property of the functions f(s(1)), f(s(2)),...,f(s(n)), a GP is fully specified by a mean function (e.g., mean of zero) and a n × n covariance
■
MATERIALS AND METHODS Data Preparation. Melting Points Data Set. The melting points data set is a large and structurally diverse data set compiled by Karthikeyan and co-workers.15 It includes 4173 compounds (mtp_mi) extracted from the Molecular Diversity Preservation International (MDPI) database and 277 drugs (mtp_ext) extracted from the Merck index by Bergström et al.16 A total of 203 2D and alignment-independent 3D descriptors were calculated from the optimized structures by the MOE software package (version 2004.03). All structures were projected onto principal components space by using 2D and 3D descriptors. In our work, the first 30 principal components retrieved from the literature15 served to perform further analysis. The overall data points were normalized to have a mean of zero and variance of one. In addition, the mtp_mi was randomly divided into a modeling set (mtp_mod, n = 2087), internal evaluation set 1 (mtp_int1, n = 1043), and internal evaluation set 2 (mtp_int2, n = 1043). The mtp_ext, which was not included in the MDPI data sets, was used for external validation. DHFR Inhibitors Data Set. A set of structurally diverse dihydrofolate reductase (DHFR) inhibitors was assembled by Sutherland17 from the work of Queener et al., with pIC50 values for rat liver enzyme ranging from 3.3 to 9.8. The compounds reported with indeterminate activities were removed. A total of 70 2.5D QSAR descriptors (i.e., 2D descriptors and 3D descriptors calculated from CORINA structures and Gasteiger− Marsili charges) were calculated using Cerius2 (version 4.8.1). All of the 2.5D data points were normalized to have zero mean and unit variance before subsequent handling. In this research, 124 compounds, which had been selected by “cherry picking” with a maximum dissimilarity algorithm,17,18 were assigned to the external evaluation set (dhfr_ext), and the remaining compounds were assigned to the modeling set (dhfr_mod, n = 237). Dimensional reduction was applied to dhfr_mod in order to take the subsets of molecular descriptors and get new modeling sets of appropriate size. Four build-in approaches in Waikato Environment for Knowledge Analysis (Weka 3.6.0),19−22 (a) correlation-based feature selection (CRFS), (b) consistencybased feature selection (CSFS), (c) wrapper, and (d) filtered subset evaluator, combined with five kinds of searching strategies, (1) best-first search, (2) scatter search, (3) genetic search, (4) stepwise greedy search, and (5) linear forward search, were employed. The combination of “c2” was not carried out for the functional limitation of Weka. The continuous values of pIC50 were discretized into three classes by using package of “weka.filters.unsupervised.attribute.Discretize” before dimensional reduction in order to fit the needs for (b), (c1), and (d). Conjunctive rule learner was implemented in wrapper. Default values were accepted for the remaining parameters of Weka. After dimensional reduction, a total of 19 new modeling sets (dhfr_mod_r1−dhfr_mod_r19, n = 237) were provided for the subsequent computation. 1056
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling matrix R: f(s) ∼ N(0, R). Here, the notation N is used for a normalized Gaussian distribution. Supposing that the observed values a differ from the function values of f(s) by additive noise ε and it has an independent and identical Gaussian distribution with a mean of zero and variance of σ2n: ε ∼ N(0, σ2n), the prior distribution on the observed values a becomes a ∼ N (0, C)
n
complexity penalty term, and − 2 log 2π is the normalization constant term. The general form of the RQ covariance function is given by Cov(a(i) , a(j)) = K (s(i) , s(j)) −α ⎛ (i) (j) 2 ⎞ M ⎛s − s ⎞ ∑m = 1 ⎜ m l m ⎟ ⎟ ⎜ ⎝ m ⎠ ⎟ = σ f2⎜1 + + σn2δij ⎜ ⎟ 2α ⎜ ⎟ ⎝ ⎠
(1)
where C denotes the covariance matrix for the noisy observed activities a, C = R + σ2nI gives a description of the similarity between compounds in the input space, and I denotes an identity matrix. The matrix element of C can be represented as Cov(a(i), a(j)) (i, j ∈ 1,2,...,n), which is the covariance of a pair of compounds i, j. For a new input s*, its output a* follows a joint Gaussian distribution ⎛a⎞ ⎜ ⎟ ∼ N (0, C*) ⎝ a* ⎠
It is equivalent to an infinite scale mixture of SE covariance functions with different characteristic length-scales. The values of α, l, σ2f , and σ2n could also be optimized by maximizing the log marginal likelihood (eq 4). An NN covariance function could be written as Cov(a(i), a(j)) = K (s(i), s(j)) T ⎛ ⎞ 2s (̃ i) ∑ s (̃ j) 2 ⎟ = sin−1⎜⎜ (i)T (i) (j)T (j) ⎟ π ⎝ (1 + 2s ̃ ∑ s ̃ )(1 + 2s ̃ ∑ s ̃ ) ⎠
(2)
⎡ [C ], [k] ⎤ where C* = ⎢ T , k is the vector of covariance between n ⎣[k ], [κ ]⎥⎦ compounds S and the new input s*, and κ is the autocovariance of s*. The predictive distribution of a* can be obtained through dividing the joint distribution (eq 2) by the distribution (eq 1),25
then P(a*) =
P aa* P(a)
( )
+ σn2δij
where s̃ = (1, s1, s2,...,sM) is an input vector augmented with one. Here, ∑ = diag(σ20, σ21, σ22,...,σ2M) is a diagonal weight prior, where σ20 is a variance for bias parameter controlling the functions offset from the origin, and σ21, σ22,...,σ2M are the variances for weight parameter. Values of the parameters in eq 6 could also be optimized by maximizing the log marginal likelihood (eq 4). From the above, it could be inferred that a larger value of V(a*) suggests a lower similarity to the compounds in Q and consequently higher heterogeneity (diversity). Since the calculated V(a*) embodies the information on chemical structures and target values, the “diversity” should be considered as mechanism diversity rather than structurally diversity. Rational Division Based on Posterior Variance of GP. GP models are developed using the modeling sets, and the following steps are required: create structures that define the likelihood and covariance function, define priors for the parameters, create a GP structure, and optimize the parameters of a GP structure. Note that since GP is a nonparametric model, there is no need to worry about whether it is possible for the model to fit the data. The algorithms of GP used in this research are mainly based on the source code written by Jarno Vanhatalo et al. (GPstuff, version 4.4).27 The calculations, which are detailed in Supporting Information, were implemented in MATLAB (MathWorks, Natick, MA), and SE, RQ, and NN covariance functions were applied, respectively. The leave-one-out cross-validation (LOO− CV) prediction could be computed by gp_loopred in GPstuff, which could return the posterior variance of LOO−CV (PVLOO) for each compound in the modeling set. In order to examine whether the training sets with higher values of PVLOO have statistically better external predictability, four types of division methods based on PVLOO were applied (Table 1). Three kinds of covariance functions (SE, RQ, and NN) are employed respectively in GP; so, for each modeling set, the PVLOO-based division method provided 12 different training sets and corresponding test sets. Rational Division Based on Kennard−Stone Algorithm. The Kennard−Stone (KS) algorithm4 was a popular rational division method when no standard experimental guide could be
∼ N(E(a*), V(a*)), in which E(a*) denotes the expectation for the predictive distribution of a*, the most probable value of E(a*) is output as the prediction result of GP, and V(a*) denotes the variance of the predictive distribution of a*, indicating the distance from the new compound s* to the compounds in data set Q. They can be calculated as follows: E(a*) = kTC−1a, V(a*) = κ − kTC−1k. The last step of GP is to solve the covariance matrix C. The matrix element of C can be calculated by kernel function K(s(i), s(j)) in input space based on Mercer’s theorem.26 Here, three kinds of the most widely used kernel functions (i.e., covariance functions), squared exponential (SE) covariance function, rational quadratic (RQ) covariance function, and neural network (NN) covariance function, are employed, respectively. The SE covariance function has the parametrized form ⎛ s(i) − s(j) ⎞2 ⎞ ∑ ⎜⎜ m m ⎟⎟ ⎟⎟ + σn2δij lm ⎠⎠ m=1 ⎝ M
(3)
s(i) m
where M is the number of molecular descriptors, is the mth descriptor for the ith compound, and δij is Kronecker delta function. In general, three free parameters including the lengthscale {lm}mM= 1, signal variance σ2f , and noise variance σ2n are defined as hyperparameters Θ. Hyperparameters could be optimized by maximizing the log marginal likelihood 1 1 n log P(a|S , Θ) = − aT C −1a − log|C| − log 2π 2 2 2
(6) T
, where a* has a Gaussian distribution as a*
Cov(a(i) , a(j)) = K (s(i) , s(j)) ⎛ 1 2 = σ f exp⎜ − ⎜ 2 ⎝
(5)
(4)
where P(a|S, Θ) is the marginal likelihood with a Gaussian 1 1 distribution, − 2 a TC −1a is the data-fit term, − 2 log|C| is the 1057
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling Table 1. Division Methods Based on PVLOO Types
Step 1: Sort all the compounds in a modeling set by
Step 2: Then, place the compounds in bins?
Type (i)
PVLOO value
No
Type (ii)
PVLOO value
No
Type (iii)
Target value (melting point, pIC50, or ln Ka) Target value (melting point, pIC50, or ln Ka)
Yes, place them in bins of five molecules according to their rank Yes, place them in bins of five molecules according to their rank
Type (iv)
followed. This algorithm allows the compounds to be divided evenly throughout the descriptor space of the original modeling set. A detailed description of the algorithm has been given by Martin et al.13 In this study, the KS algorithm was carried out by using “KennardStone.jar”,4 which is packed as a component of “Dataset Division 1.2” and available on the web site of Drug Theoretics and Cheminformatics Laboratory (DTC Lab).28 For each modeling set, the KS algorithm provides one training set and corresponding test set. Random Division. As mentioned above, Martin et al.13 found that the predictabilities of models derived from both the rational division method and random division method are comparable when dealing with external validation. For that reason, no rational division method was applied. In return, each modeling set was randomly divided 100 times, resulting in 100 different training sets and corresponding test sets. This procedure was intended to simulate an exhaustive search on all the possible training sets so that we can compare the results of the proposed method with the randomly distributed results. Assessment of External Evaluation Sets. The assessment of the external evaluation sets is very important for avoiding possible bias, which may lead to unreliable conclusions. Compared to the modeling set, an ideal external evaluation set should be collected from different sources. In QSAR studies, for example, new molecular entities and their activities, which were reported in literature, could constitute an external set. Therefore, data in the external set usually represent relatively high levels of structurally or mechanism diversity (larger values of V(a*), from the view of GP). That is why external predictions are often challenging. As described above, PVLOO could be calculated for each compound in the modeling set. Using the GP model built with the modeling set, V(a*), the variance of the predictive distribution of a* could be calculated by gp_pred in GPstuff for each compound in the external evaluation set. A two-sample t test (or Wilcoxon rank sum test) was conducted to compare the means (or medians) of the calculated two groups of variances at the 0.05 level of significance. The external evaluation set could be considered desirable if the mean (or median) of V(a*) is greater than the mean (or median) of PVLOO. For detailed the procedure, see the Supporting Information. QSAR/QSPR Method. QSAR/QSPR models were developed with the training sets. For comparison, the modeling sets (i.e., mtp_mod, dhfr_mod_r1−dhfr_mod_r19, cox2_mod_r1− cox2_mod_r19, and fub_mod_r1−fub_mod_r3) were also employed in the modeling process (Figure 1). Seven kinds of build-in procedures in Weka 3.6.0 were applied to modeling and prediction (see books29,30 by Witten et al. for more details on these algorithms). To simplify the computation, some recommended values of parameters in the literature were used.
Step 3: Select compounds into the training set
Step 4
Select 80% of the compounds with higher values of PVLOO Select 80% of the compounds with lower values of PVLOO In each bin, select 80% of the compounds with higher values of PVLOO In each bin, select 80% of the compounds with lower values of PVLOO
Put the remaining compounds into the test set Put the remaining compounds into the test set Put the remaining compounds into the test set Put the remaining compounds into the test set
Figure 1. Overview of workflow. The mtp_mod does not go through dimensional reduction.
(1) SMO31,32 implements Alex Smola and Bernhard Scholkopf’s sequential minimal optimization algorithm for training a support vector regression model. Two kernels were used, respectively: SMO(1): puk (Pearson VII function-based universal kernel): K (xi , xj) = 1/[1 + (2 || xi − xj ||2 2(1/ ω) − 1 /σ )2 ]ω
SMO(2): NormalizedPolyKernel: K (xi , xj) = (xiT xj + 1) p / xiT + 1 + xTj + 1
(2) SVM31,32 implements the support vector machine for regression. The algorithm, which was only used in the Fub data set, is selected by setting the RegOptimizer. The default RegOptimizer (RegSMOImproved) was used. Values of parameters were set as follows according to the literature:23 complexity parameter C = 50, kernel = RBFKernel K(xi,xj) = exp(−γ||xi − xj)||2). (3) MultilayerPerceptron (MLP) is a classifier that uses a feedforward artificial neural network to classify instances. The nodes in this network are all sigmoid except for the output node, in which case the output node becomes unthresholded linear units. Two groups of values were used. MLP(1): HiddenLayers = 10, LearningRate = 0.01, Momentum = 0.3; MLP(2): HiddenLayers = (the number of molecular descriptors +1)/2, LearningRate = 0.3, Momentum = 0.2. 1058
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling
Table 3. Results of Two-Sample t-Test (or Wilcoxon Rank Sum Test) for Melting Points Data Set
Table 2. Data Set and Relevant Information Data set
Numbers of reduced modeling sets
Numbers of training sets
Numbers of test sets
Melting points DHFR COX2 Fub
0a 19 19 3
113 2147 2147 339
113 2147 2147 339
a
Covariance functions
mtp_ext
mtp_int1
mtp_int2
NN
1a
2b
0c
RQ
1
0
0
SE
1
0
0
a
1 indicates mean (or median) of V(a*) is greater than mean (or median) of PVLOO. b2 indicates mean (or median) of V(a*) is lower than mean (or median) of PVLOO. c0 indicates mean (or median) of V(a*) is equal to mean (or median) of PVLOO.
The mtp_mod does not go through dimensional reduction.
(4) IBk33 is a k-nearest neighbors classifier. In this study, it implements the brute force search algorithm for nearest neighbor search. Euclidean distance function was employed in finding neighbors. For the Fub data set, the number of neighbors = 10, which was set according to the literature.23 For other data sets, default values of parameters were used. (5) KStar34 is an instance-based classifier. An entropy-based distance function was specified to determine the similarity between instances. (6) Bagging35 is a classifier to reduce variance depending on the fast decision tree learner. The size of each bag was set to 100% of the training set size.
(7) M5P implements base routines for generating M5 model trees and rules. The original algorithm of M5 was invented by Quinlan,36 and improvements were made by Wang.37 A 10-fold cross-validation (10-fold CV) was carried out in all cases. Models deriving from these approaches were used to support new predictions for the compounds of the test set (if any), internal evaluation set (if any), and external evaluation set.
Figure 2. Visualization of bivariate relationships between PVLOO values of NN (neural network), RQ (rational quadratic), and SE (squared exponential) covariance function-based GP models (melting points data set, n = 2087). The histograms displayed on the diagonal show the distribution of each variable. The results are derived from the models built with mtp_mod. 1059
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling
Figure 3. RMSE of test sets (n = 418) for different training set division methods (melting points data set). The results are derived from the SMO(2) models built with 113 training sets produced in training set selection. NN (neural network), RQ (rational quadratic), and SE (squared exponential) refer to the covariance functions used in the PVLOO-based division method (see Gaussian Process). Here, (i), (ii), (iii), and (iv) refer to the types of PVLOO-based division methods (Table 1). For example, NN(ii) indicates the type (ii) PVLOO-based division method with neural network covariance function. Type (i) uses a training set with higher values of PVLOO. Type (iii) also uses a training set with higher values of PVLOO, and it explores the impact of the target value (melting point, pIC50, or ln Ka) as well. On the contrary, type (ii) and type (iv) use a training set with lower values of PVLOO. Different colors are used to represent them for easy understanding: red for type (i) and (iii), light yellow for type (ii) and (iv), purple for KS algorithm, and green for random division.
Figure 4. RMSE of mtp_int1 (n = 1043) for different training set division methods (melting points data set). The results are derived from the SMO(2) models built with mtp_mod and 113 training sets produced in training set selection. The y-axis is log scale. “Na” (blue) means no data set division approach was applied; so the model is built directly from modeling set mtp_mod. Remaining color scheme explained in Figure 3.
compounds that have relatively higher PVLOO values might have better predictability. Furthermore, it is shown that the NN, RQ, and SE covariance function-based PVLOO values are positively related to each other, which indicates that the results from these three kinds of covariance functions could be mutually corroborated. Assessment of External Evaluation Sets. For the melting points data set, GP models were built with mtp_mod, and three kinds of covariance functions were employed. Table 3 summarizes the results of the two-sample t test (or Wilcoxon rank sum test). It could be suggested that mtp_ext is more diverse than mtp_mod, and it is a good option for external evaluation. As described above, mtp_mod and mtp_ext were extracted from two independent databases, but mtp_mod, mtp_int1 and mtp_int2 were generated from the same database by random division. So, as shown in Table 3, mtp_int1 and mtp_int2 are not more diverse than mtp_mod and might not be the suitable options for external evaluation. The results of the assessment for the other three data sets are available in the Supporting Information, which supported that dhfr_ext, cox2_ext, fub_1, and fub_2 are all acceptable for external evaluation. Modeling and Prediction. Melting Points Data Set. It is concluded that five kinds of models (Bagging, M5P, MLP(1), SMO(1), and SMO(2)) perform better than the others when the RMSE (root mean squared error) is considered as a measure for estimating model fit (Supporting Information, Figures S4−S8). Results of SMO(2) are presented here as an illustration. Other results are included in the Supporting Information.
■
RESULTS AND DISCUSSION Training Set Selection. Table 2 shows the numbers of reduced modeling sets, training sets, and test sets of the four QSAR/QSPR data sets. For each reduced modeling set, 113 training sets were produced (100 by random division, 1 by KS algorithm, and 12 by PVLOO-based training set selection). For the PVLOO-based training set selection, the PVLOO value of each compound in any specific reduced modeling set was computed. Figure 2 and Figures S1−S3 (Supporting Information) offer visualization of bivariate relationships between NN, RQ, and SE covariance function-based PVLOO values. It is shown that almost all the histograms are positively skewed, which indicates that the majority of compounds have relatively lower PVLOO values and their underlying mechanisms are quite similar and common to most of the compounds in the reduced modeling set. It shapes our common sense and verifies the relationship between PVLOO value and mechanism diversity. It opens the possibility for researchers to make the common mechanism fully understood. On the other hand, the minority of compounds have relatively higher PVLOO values and might have distinct mechanisms of action. As described above, diversity (especially mechanism diversity) is a key factor in achieving good predictability. It could be inferred that the training set which includes the 1060
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling
Figure 5. RMSE of mtp_int2 (n = 1043) for different training set division methods (melting points data set). The results are derived from the SMO(2) models built with mtp_mod and 113 training sets produced in training set selection. “Na” means the model is built directly from modeling set mtp_mod. The y-axis is log scale. Color scheme explained in Figures 3 and 4.
Figure 6. RMSE of mtp_ext (n = 277) for different training set division methods (melting points data set). The results are derived from the SMO(2) models built with mtp_mod and 113 training sets produced in training set selection. “Na” means the model is built directly from modeling set mtp_mod. The y-axis is log scale. Color scheme explained in Figures 3 and 4.
The RMSEs of training (Supporting Information, Figure S9) and 10-fold CV (Supporting Information, Figure S10) show striking regularities depending on the types of PVLOO-based division methods (Table 1). The RMSE of the type (ii) PVLOObased division method (NN(ii), RQ(ii), and SE(ii)) is always smaller than the RMSE of the type (i) (NN(i), RQ(i), and SE(i)) methods for any given kernel function. The same pattern could be observed between the type (iv) and type (iii) PVLOO-based division methods. It means that the training sets with lower values of PVLOO perform better while training and 10-fold CV. When predicting the test set (Figure 3 and Figure S11), the pattern reverses; i.e., for any given kernel function, the RMSE of the type (ii/iv) PVLOO-based division methods is always higher than the RMSE of the type (i/iii) methods. That is to say that the training sets with higher values of PVLOO are predictive when evaluating the melting points of the compounds with lower values of PVLOO. On the contrary, the training sets with lower values of PVLOO are less predictive when evaluating the melting points of the compounds with higher values of PVLOO. When predicting the mtp_int1 (Figure 4 and Figure S12), mtp_int2 (Figure 5 and Figure S13), and mtp_ext (Figure 6 and Figure S14), the pattern is still clear-cut for M5P, MLP(1), and SMO(2) but somewhat blurred for Bagging (Figures S12−S14) and SMO(1) (Figure S14). It could be demonstrated that in most cases (mtp_int1:90%, mtp_int2:83%, mtp_ext: 80%), the training sets with higher values of PVLOO are more predictive than the training sets with lower values of PVLOO. In order to compare the proposed method with the random division method, the Student’s t test was conducted (Table 4 and Table S4). For each algorithm for modeling, the difference between the RMSEs of 100 random divisions and that of a
PVLOO-based division method is determined by a right-tailed or two-tailed t test at a 0.05 level of significance. The null hypothesis is that the RMSEs of 100 random divisions (X) come from a normal distribution with mean of the RMSE of one specific PVLOO-based division method (m) and unknown variance. We also distinguished the performance of the proposed method from that of the KS algorithm and “Na” (which means no data set division approach was applied) by simply comparing their RMSEs (Table 4 and Table S4). In Table 4 and Table S4, the results of all eight algorithms (IBk, KStar, Bagging, M5P, SMO(1), SMO(2), MLP(1), and MLP(2)) are summarized. As shown in Table 4, when predicting the corresponding test set and external evaluation set, the proposed method (i.e., type (i) PVLOO-based division method: NN(i), RQ(i), or SE(i)) performs better than random division, KS algorithm, or “Na”. For the external prediction (mtp_ext), the best result shows that 88% of the training sets generated by RQ(i) have lower RMSEs than random, and 0% of the training sets generated by RQ(i) has comparative RMSEs. The worst result shows 63% and 13%, which means only 24% (1 − 0.63 − 0.13 = 0.24) of the training sets generated by NN(i) have higher RMSEs than random. The performance of type (iii) is not better than type (i), which suggests that sorting by the target value might not help for better external prediction. When predicting the internal evaluation sets, no overwhelming superiority could be found between the type (i) PVLOO-based division method and the random division (or KS algorithm), but the performance of type (iii) seems to be better than type (i). In addition, the performances of the type (i) and type (iii) PVLOO-based division methods are better than “Na” when predicting the external evaluation set. On the contrary, the 1061
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling
Table 4. Summary of Training Set Selection (Melting Points Data Set, Type (i) and Type (iii) PVLOO-Based Division Methods) Test set (%) n = 418 NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs random divisiona
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs KS algorithmb
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs “Na”c
63 100 75 13 88 75
mtp_int1 (%) n = 1043
0 0 13 0 0 13
38 13 25 50 63 63
63 100 88 50 88 75
0 25 13 38 13 13
mtp_int2 (%) n = 1043 38 50 38 75 50 38
0 25 25 13 25 25
mtp_ext (%) n = 277 63 88 75 75 75 50
25 25 38 38 63 75
38 63 38 88 75 63
88 75 75 75 63 50
13 13 13 25 13 25
25 13 13 38 13 13
63 88 75 75 63 63
13 0 0 13 13 13
In each cell, the former is the percentage of those whose null hypothesis is rejected, i.e., H = ttest (X, m, “tail”’, “right”) returns 1, while the latter is the percentage of those whose null hypothesis cannot be rejected, i.e., H = ttest (X, m) returns 0. bPercentage of those whose RMSE is lower than that of the KS algorithm. cPercentage of those whose RMSE is lower than that of “Na”. a
Figure 7. RMSE of test sets (n = 48) for different training set division methods (DHFR inhibitors data set). The results are derived from the SMO(1) models built with 113 training sets produced in training set selection on dhfr_mod_r7. Color scheme explained in Figure 3.
Figure 8. RMSE of dhfr_ext (n = 124) for different training set division methods (DHFR inhibitors data set). The results are derived from the SMO(1) models built with dhfr_mod_r7 and 113 training sets produced in training set selection. “Na” means the model is built from modeling set dhfr_mod_r7. Color scheme explained in Figure 3 and 4.
performances of type (ii/iv) PVLOO-based division methods (Table S4) are far worse than type (i/iii) PVLOO-based division methods for each prediction domain. It could be assumed that the compounds selected by the proposed method could represent the whole modeling set and might have fascinating external predictability. DHFR Inhibitors Data Set. The algorithm of SMO(1) which has lower RMSEs than the others is selected to conduct further
analysis (Supporting Information, Figures S15−S17). To state concisely, only the results derived from the model built with dhfr_mod_r7 and its 113 corresponding training sets are presented here as an illustration. Other results are included in the Supporting Information. For training (Supporting Information, Figure S18) and 10-fold CV (Supporting Information, Figure S19), the RMSE of the type (ii) PVLOO-based division method is always smaller than the 1062
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling Table 5. Summary of Training Set Selection (DHFR Inhibitors Data Set, Type (i) and Type (iii) PVLOO-Based Division Methods)
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs random divisiona
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs KS algorithmb
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs “Na”c
Test set (%) n = 48
dhfr_ext (%) n = 124
71 67 76 37 57 53
68 55 57 54 47 53
11 14 13 13 12 17 61 51 57 37 51 48
12 18 14 17 20 16 61 58 55 57 45 53 37 32 32 34 22 27
Figure 10. RMSE of cox2_ext (n = 94) for different training set division methods (COX2 inhibitors data set). The results are derived from the SMO(2) models built with cox2_mod_r8 and 113 training sets produced in training set selection. “Na” means the model is built from modeling set cox2_mod_r8. Color scheme explained in Figure 3 and 4.
Table 6. Summary of Training Set Selection (COX2 Inhibitors Data Set, Type (i) and Type (iii) PVLOO-Based Division Methods)
a
In each cell, the former is the percentage of those whose null hypothesis is rejected, i.e., H = ttest (X, m, “tail”, “right”) returns 1, while the latter is the percentage of those whose null hypothesis cannot be rejected, i.e., H = ttest (X, m) returns 0. bPercentage of those whose RMSE is lower than that of KS algorithm. cPercentage of those whose RMSE is lower than that of “Na”.
Test set (%) n = 38
cox2_ext (%) n = 94 71 69 72 64 64 66
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs random divisiona
30 51 44 27 43 38
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs KS algorithmb
38 48 43 41 46 40
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs “Na”c
15 11 8 20 13 18
11 7 13 8 12 16
62 64 64 60 59 63 59 61 64 49 54 62
a
In each cell, the former is the percentage of those whose null hypothesis is rejected, i.e., H = ttest (X, m, “tail”, “right”) returns 1, while the latter is the percentage of those whose null hypothesis cannot be rejected, i.e., H = ttest (X, m) returns 0. bPercentage of those whose RMSE is lower than that of KS algorithm. cPercentage of those whose RMSE is lower than that of “Na”.
Figure 9. RMSE of test sets (n = 38) for different training set division methods (COX2 inhibitors data set). The results are derived from the SMO(2) models built with 113 training sets produced in training set selection on cox2_mod_r8. Color scheme explained in Figure 3.
RMSE of the type (i) method for any given kernel function, and no consistent pattern could be found among type (iii) and type (iv) PVLOO-based division methods. When predicting the test set (Figure 7), the RMSE of the type (i/iii) PVLOO-based division methods is always lower than the RMSE of the type (ii/iv) methods for any given kernel function.
When predicting dhfr_ext (Figure 8), the same pattern could be tracked. Table 5 and Table S5 summarize the performance of different training set selection approaches, and the procedure of statistics was the same as described for Table 4. In Table 5 and Table S5, 1063
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling
Figure 11. RMSE of test sets (n = 82) for different training set division methods (Fub data set). The results are derived from the models built with 113 training sets produced in training set selection on fub_mod_r3. Color scheme explained in Figure 3.
and each algorithm (IBk, KStar, Bagging, M5P, SMO(1), SMO(2), MLP(1), and MLP(2)) are summarized. It shows that the type (i) PVLOO-based division method performs better than random division, KS algorithm, or “Na” when predicting the external evaluation set. For the external prediction (cox2_ext), the best result shows that 72% of the training sets generated by SE(i) have lower RMSEs than random, and 13% of the training sets generated by SE(i) have comparative RMSEs. The worst result shows 69% and 7%, which means only 24% (1 − 0.69 − 0.07 = 0.24) of the training sets generated by RQ(i) have higher RMSEs than random. The performance of type (iii) is not better than type (i). When predicting the corresponding test set, no overwhelming superiority could be found between the type (i) PVLOO-based division method and random division (or KS algorithm). The performance of the type (ii/iv) PVLOO-based division methods (Table S6) is still worse than type (i/iii) PVLOO-based division methods for each prediction domain especially when predicting the external evaluation set. Fub Data Set. SMO(2) and SVM, which have lower RMSEs than the others (Supporting Information, Figures S27−S30), are selected to conduct further analysis. Only the results derived from the model built with fub_mod_r3 and its 113 corresponding training sets are presented here as an illustration. Other results are included in the Supporting Information. For training (Supporting Information, Figure S31) and 10-fold CV (Supporting Information, Figure S32), the RMSE of the type (ii) PVLOO-based division method is always smaller than the RMSE of the type (i) method for any given kernel function, and the same pattern could be found among type (iv) and type (iii) PVLOO-based division methods. When predicting the test set (Figure 11), the RMSE of the type (i) PVLOO-based division method is always lower than the RMSE of the type (ii) method for any given kernel function. When predicting the external evaluation sets (Figures 12 and 13), the same pattern could be tracked. On the other hand, as shown in Figures 11−13, no consistent pattern could be found among type (iii) and type (iv) PVLOO-based division methods. Table 7 and Table S7 summarize the performance of different training set selection approaches, and the procedure of statistics
the results of each modeling set (dhfr_mod_r1−dhfr_mod_r19) and each algorithm (IBk, KStar, Bagging, M5P, SMO(1), SMO(2), MLP(1), and MLP(2)) are summarized. It shows that the type (i) PVLOO-based division method performs better than the random division or KS algorithm when predicting the test set or external evaluation set. For the external prediction (dhfr_ext), the best result shows that 68% of the training sets generated by NN(i) have lower RMSEs than random, and 12% of the training sets generated by NN(i) have comparative RMSEs. The worst result shows 55% and 18%, which means only 27% (1 − 0.55 − 0.18 = 0.27) of the training sets generated by RQ(i) have higher RMSEs than random. The performance of type (iii) is not better than type (i). All the PVLOO-based division methods perform worse than “Na”, indicating that these training sets could not represent the whole modeling set. The performance of the type (ii/iv) PVLOO-based division methods (Table S5) is still worse than the type (i/iii) PVLOO-based division methods for each prediction domain. COX2 Inhibitors Data Set. Five algorithms, Bagging, M5P, MLP(1), SMO(1), and SMO(2), which have lower RMSEs than the others, are selected to conduct further analysis (Supporting Information, Figures S20−S22). Only the results derived from the SMO(2) model built with cox2_mod_r8 and its 113 corresponding training sets are presented here as an illustration. Other results are included in the Supporting Information. For training (Supporting Information, Figure S23) and 10-fold CV (Supporting Information, Figure S24), no consistent pattern could be found among type (i/iii) and type (ii/iv) PVLOObased division methods. When predicting the test set (Figure 9, Figure S25), the RMSE of the type (i/iii) PVLOO-based division methods is smaller than the RMSE of the type (ii/iv) methods in the majority of cases (93%). When predicting cox2_ext (Figure 10, Figure S26), with few exceptions (7%), the RMSE of the type (i/iii) PVLOO-based division methods is lower than the RMSE of the type (ii/iv) methods for any given kernel function. Table 6 and Table S6 summarize the performance of different training set selection approaches, and the procedure of statistics was the same as described for Table 4. In Table 6 and Table S6, the results of each modeling set (cox2_mod_r1−cox2_mod_r19) 1064
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling
Figure 12. RMSE of fub_1 (n = 1045) for different training set division methods (Fub data set). The results are derived from the models built with fub_mod_r3 and 113 training sets produced in training set selection. “Na” means the model is built from modeling set fub_mod_r3. Color scheme explained in Figure 3 and 4.
Figure 13. RMSE of fub_2 (n = 200) for different training set division methods (Fub data set). The results are derived from the models built with fub_mod_r3 and 113 training sets produced in training set selection. “Na” means the model is built from modeling set fub_mod_r3. Color scheme explained in Figure 3 and 4.
was the same as described for Table 4. In Table 7 and Table S7, the results of each modeling set (fub_mod_r1−fub_mod_r3) and each algorithm (IBk, KStar, Bagging, M5P, SMO(1), SMO(2), MLP(1), MLP(2), and SVM) are summarized. It shows that the type (i) PVLOO-based division method performs better than random division or KS algorithm when predicting the test set or external evaluation sets. For the external prediction (fub_1 and fub_2), the best result shows that 70% of the training sets generated by NN(i) have lower RMSEs than random, and 19% of the training sets generated by NN(i) have comparative RMSEs. Even the worst result shows 48% and 26%, which means only 26% (1 − 0.48 − 0.26 = 0.26) of the training sets generated by RQ(i) have higher RMSEs than random.
Generally speaking, the performance of type (iii) is not better than type (i) except when predicting fub_2. Almost all the PVLOO-based division methods perform worse than “Na”, indicating that these training sets could not represent the whole modeling set. The performance of the type (ii) PVLOO-based division method (Table S7) is still far worse than the type (i) PVLOO-based division method for each prediction domain. For type (iv) (Table S7) and type (iii) PVLOO-based division methods, the similar pattern could be observed except when predicting fub_1.
■
CONCLUSIONS The use of the type (i) PVLOO-based division method yields a superior estimate of external predictability. For this method, all 1065
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling
method as an undersampling approach in modeling of an imbalanced large-scale data set.
Table 7. Summary of Training Set Selection (Fub Data Set, Type (i) and Type (iii) PVLOO-Based Division Methods) Test set (%) n = 82 NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs random division
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs KS algorithmb
NN(i) RQ(i) SE(i) NN(iii) RQ(iii) SE(iii)
vs “Na”c
a
93 78 70 26 37 30 70 59 48 15 15 26
7 7 4 11 7 22
fub_1 (%) n = 1045
fub_2 (%) n = 200
70 48 59 44 52 59
56 56 52 63 63 70
19 26 22 30 22 19
78 59 70 63 44 56
67 59 63 74 67 74
44 33 44 30 33 48
33 22 33 44 48 56
■
ASSOCIATED CONTENT
S Supporting Information *
15 22 22 19 22 15
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00029. Detailed information on GP. (PDF) Data sets used in this study. (ZIP) Statistical metrics for modeling: r = Pearson’s correlation coefficient; MAE = mean absolute error, RMSE values of training, 10-fold CV, test, and internal and external validation). (XLSX) Assessment of the external evaluation sets: Table S1−S3. (PDF) Figures S1−S32 and Tables S4−S7. (PDF)
■
AUTHOR INFORMATION
Corresponding Authors
*Y. Dong. E-mail:
[email protected]. *D. Du. E-mail:
[email protected]. ORCID
Ying Dong: 0000-0002-4397-4695 Ding Du: 0000-0002-4615-5433
a
In each cell, the former is the percentage of those whose null hypothesis is rejected, i.e., H = ttest (X, m, “tail”, “right”) returns 1, while the latter is the percentage of those whose null hypothesis cannot be rejected, i.e., H = ttest (X, m) returns 0. bPercentage of those whose RMSE is lower than that of KS algorithm. cPercentage of those whose RMSE is lower than that of “Na”.
Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS This work was supported by the National Natural Science Foundation of China (No. 21572270), Qing Lan Project of Jiangsu Province, and Priority Academic Program Development of Jiangsu Higher Education Institutions. The authors acknowledge the strong encouragement of Prof. Tao Lu, Prof. Zunjian Zhang, and Prof. Wenbin Shen during the implementation of the project. We thank Prof. Alexander Tropsha (University of North Carolina, Chapel Hill) for many thought-provoking discussions. Y.D. thanks Dr. Qingfa Zhou for his support of this study.
the compounds in a modeling set were sorted by PVLOO, and 80% of the compounds with higher values of PVLOO were selected into the training set. According to the results, it performs better than the KS algorithm, random division, and other types of PVLOO-based division methods. The main reason may be that the training set produced by the type (i) PVLOO-based division method could preserve more mechanism diversity. In QSAR/QSPR studies, random division has been widely used for a long time. It could be indicated from this study that random division could give a good prediction, but the breadth of its distribution is indicative of instability of this approach. In addition, based on the GP model, a useful approach was introduced for assessing the external evaluation sets. We found it helps to have an evolving understanding of the external data set. Although all the results strongly support the hypothesis that we advocate here, it does not mean the logic is universally true. For example, if the information from the molecular descriptors is not sufficient to represent the SAR/SPR, no reliable results could be reached. This can perhaps be summarized in the slogan “Data is Everything”. Another interesting result is that the performance of the training set provided by the type (i) PVLOO-based division method is statistically better than the modeling set (mtp_mod) when predicting the external evaluation set (mtp_ext) (Table 4). This implies that the size of training set could impact the predictability of QSAR/QSPR models, and sometimes it does not mean “the more compounds in the training set the better”. Studies38,39 have attempted to determine the relationship between training set size and the quality of prediction, and it was concluded that the size of a training set should be optimized if there is a trend of imbalance or redundancy. The work we report here suggests a possible application of the proposed
■
ABBREVIATIONS 10-fold CV, 10-fold cross-validation; COX2, cyclooxygenase-2; CRFS, correlation-based feature selection; CSFS, consistencybased feature selection; DHFR, dihydrofolate reductase; Fub, fraction of chemical unbound by plasma proteins; GP, Gaussian process; KS, Kennard−Stone algorithm; LOO−CV, leave-oneout cross-validation; MDPI, Molecular Diversity Preservation International database; MLP, MultilayerPerceptron; NN, neural network covariance function; PVLOO, posterior variance of leaveone-out cross-validation; RMSE, root mean squared error; RQ, rational quadratic covariance function; SE, squared exponential covariance function; Weka, Waikato Environment for Knowledge Analysis
■
REFERENCES
(1) Gramatica, P. Principles of QSAR Models Validation: Internal and External. QSAR Comb. Sci. 2007, 26, 694−701. (2) Baumann, D.; Baumann, K. Reliable Estimation of Prediction Errors for QSAR Models Under Model Uncertainty Using Double Cross-Validation. J. Cheminf. 2014, 6, 47−65. (3) Roy, K.; Ambure, P. The “Double Cross-Validation” Software Tool for MLR QSAR Model Development. Chemom. Intell. Lab. Syst. 2016, 159, 108−126.
1066
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067
Article
Journal of Chemical Information and Modeling (4) Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11, 137−148. (5) Zupan, J.; Novič, M.; Ruisánchez, I. Kohonen and Counterpropagation Artificial Neural Networks in Analytical Chemistry. Chemom. Intell. Lab. Syst. 1997, 38, 1−23. (6) Mitchell, T. J. An Algorithm for the Construction of “D-Optimal” Experimental Designs. Technometrics 1974, 16, 203−210. (7) Snarey, M.; Terrett, N. K.; Willett, P.; Wilton, D. J. Comparison of Algorithms for Dissimilarity-Based Compound Selection. J. Mol. Graphics Modell. 1997, 15, 372−385. (8) Gobbi, A.; Lee, M. L. DISE: Directed Sphere Exclusion. J. Chem. Inf. Comput. Sci. 2003, 43, 317−323. (9) Leonard, J. T.; Roy, K. On Selection of Training and Test Sets for the Development of Predictive QSAR Models. QSAR Comb. Sci. 2006, 25, 235−251. (10) Golbraikh, A.; Shen, M.; Xiao, Z. Y.; Xiao, Y. D.; Lee, K. H.; Tropsha, A. Rational Selection of Training and Test Sets for the Development of Validated QSAR Models. J. Comput.-Aided Mol. Des. 2003, 17, 241−253. (11) Golbraikh, A.; Tropsha, A. Predictive QSAR Modeling Based on Diversity Sampling of Experimental Datasets for the Training and Test Set Selection. Mol. Diversity 2000, 5, 231−243. (12) Wu, W.; Walczak, B.; Massart, D. L.; Heuerding, S.; Erni, F.; Last, I. R.; Prebble, K. A. Artificial Neural Networks in Classification of NIR Spectral Data: Design of the Training Set. Chemom. Intell. Lab. Syst. 1996, 33, 35−46. (13) Martin, T. M.; Harten, P.; Young, D. M.; Muratov, E. N.; Golbraikh, A.; Zhu, H.; Tropsha, A. Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling? J. Chem. Inf. Model. 2012, 52, 2570−2578. (14) Puzyn, T.; Mostrag-Szlichtyng, A.; Gajewicz, A.; Skrzyński, M.; Worth, A. P. Investigating the Influence of Data Splitting on the Predictive Ability of QSAR/QSPR Models. Struct. Chem. 2011, 22, 795−804. (15) Karthikeyan, M.; Glen, R. C.; Bender, A. General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks. J. Chem. Inf. Model. 2005, 45, 581−590. (16) Bergström, C. A. S.; Norinder, U.; Luthman, K.; Artursson, P. Molecular Descriptors Influencing Melting Point and Their Role in Classification of Solid Drugs. J. Chem. Inf. Comput. Sci. 2003, 43, 1177− 1185. (17) Sutherland, J. J.; O’Brien, L. A.; Weaver, D. F. A Comparison of Methods for Modeling Quantitative Structure−Activity Relationships. J. Med. Chem. 2004, 47, 5541−5554. (18) Hassan, M.; Bielawski, J. P.; Hempel, J. C.; Waldman, M. Optimization and Visualization of Molecular Diversity of Combinatorial Libraries. Mol. Diversity 1996, 2, 64−74. (19) Frank, E.; Hall, M.; Trigg, L.; Holmes, G.; Witten, I. H. Data Mining in Bioinformatics Using Weka. Bioinformatics 2004, 20, 2479− 2481. (20) Hall, M. A. Correlation-Based Feature Selection for Discrete and Numeric Class Machine Learning. In Proceedings of the 17th International Conference on Machine Learning; Stanford, CA, USA, June 29−July 2, 2000; Morgan Kaufmann: CA, pp 359−366. (21) Dash, M.; Liu, H. Consistency-Based Search in Feature Selection. Artif. Intell. 2003, 151, 155−176. (22) Kohavi, R.; John, G. H. Wrappers for Feature Subset Selection. Artif. Intell. 1997, 97, 273−324. (23) Ingle, B. L.; Veber, B. C.; Nichols, J. W.; Tornero-Velez, R. Informing the Human Plasma Protein Binding of Environmental Chemicals by Machine Learning in the Pharmaceutical Space: Applicability Domain and Limits of Predictability. J. Chem. Inf. Model. 2016, 56, 2243−2252. (24) Rasmussen, C. E.; Williams, C. K. I. Gaussian Processes for Machine Learning. In Adaptive Computation and Machine Learning; Dietterich, T., Ed.; MIT Press: Cambridge, 2006; pp 7−22. (25) Azman, K.; Kocijan, J. Application of Gaussian Processes for Black-Box Modelling of Biosystems. ISA Trans. 2007, 46, 443−457.
(26) Schölkopf, B.; Mika, S.; Burges, C. C.; Knirsch, P.; Muller, K. R.; Rätsch, G.; Smola, A. J. Input Space Versus Feature Space in KernelBased Methods. IEEE Trans. Neural. Netw. 1999, 10, 1000−1017. (27) Vanhatalo, J.; Hartikainen, J.; Tolvanen, V.; Vehtari, A.; Kegl, B. GPstuff: Bayesian Modeling with Gaussian Processes. J. Mach. Learn. Res. 2013, 14, 1175−1179. (28) DTC Lab Software Tools. http://teqip.jdvu.ac.in/QSAR_Tools/ (accessed March 2, 2017). (29) Witten, I. H.; Frank, E.; Hall, M. A. Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, 2011. (30) Witten, I. H.; Frank, E.; Hall, M. A.; Pal, C. J. Data Mining: Practical Machine Learning Tools and Techniques, 4th ed.; Elsevier: Boston, 2016. (31) Shevade, S. K.; Keerthi, S. S.; Bhattacharyya, C.; Murthy, K. R. K. Improvements to the SMO Algorithm for SVM Regression. IEEE Trans. Neural. Netw. 2000, 11, 1188−1193. (32) Smola, A. J.; Schölkopf, B. A Tutorial on Support Vector Regression. Stat. Comput. 2004, 14, 199−222. (33) Aha, D. W.; Kibler, D.; Albert, M. K. Instance-Based Learning Algorithms. Mach. Learn. 1991, 6, 37−66. (34) Cleary, J. G.; Trigg, L. E. K*: An Instance-Based Learner Using An Entropic Distance Measure. In Proceedings of the 12th International Conference on Machine Learning; Tahoe City, CA, USA, July 9−12, 1995; Morgan Kaufmann: CA, pp 108−114. (35) Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123−140. (36) Quinlan, R. J. Learning with Continuous Classes. In Proceedings of the 5th Australian Joint Conference on Artificial Intelligence; Singapore, 1992; World Scientific: Singapore, pp 343−348. (37) Wang, Y.; Witten, I. H. Induction of Model Trees for Predicting Continuous Classes, 1996. Department of Computer Science, University of Waikato. http://researchcommons.waikato.ac.nz/ handle/10289/1183 (accessed March 2, 2017). (38) Roy, P. P.; Leonard, J. T.; Roy, K. Exploring the Impact of Size of Training Sets for the Development of Predictive QSAR Models. Chemom. Intell. Lab. Syst. 2008, 90, 31−42. (39) Zakharov, A. V.; Peach, M. L.; Sitzmann, M.; Nicklaus, M. C. QSAR Modeling of Imbalanced High-Throughput Screening Data in PubChem. J. Chem. Inf. Model. 2014, 54, 705−712.
1067
DOI: 10.1021/acs.jcim.7b00029 J. Chem. Inf. Model. 2017, 57, 1055−1067