Discussion on Regression Methods Based on Ensemble Learning and

regression analysis. In ensemble learning, training samples and X-variables are sampled from a training dataset; multiple sub-datasets are prepared; a...
0 downloads 13 Views 2MB Size
Article Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

pubs.acs.org/jcim

Discussion on Regression Methods Based on Ensemble Learning and Applicability Domains of Linear Submodels Hiromasa Kaneko* Department of Applied Chemistry, School of Science and Technology, Meiji University, 1-1-1 Higashi-Mita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan S Supporting Information *

ABSTRACT: To develop a new ensemble learning method and construct highly predictive regression models in chemoinformatics and chemometrics, applicability domains (ADs) are introduced into the ensemble learning process of prediction. When estimating values of an objective variable using subregression models, only the submodels with ADs that cover a query sample, i.e., the sample is inside the model’s AD, are used. By constructing submodels and changing a list of selected explanatory variables, the union of the submodels’ ADs, which defines the overall AD, becomes large, and the prediction performance is enhanced for diverse compounds. By analyzing a quantitative structure−activity relationship data set and a quantitative structure−property relationship data set, it is confirmed that the ADs can be enlarged and the estimation performance of regression models is improved compared with traditional methods.



INTRODUCTION In quantitative structure−activity relationships (QSARs)1 and quantitative structure−property relationships (QSPRs),2 regression models are constructed between molecular descriptors (X) and activities/properties (y). Using regression models, it is possible to estimate the values of activities and properties for virtual chemical structures that have not yet been synthesized. Based on these estimated activities and properties, it is also possible to design chemical structures with desirable activities and properties. Linear regression analysis methods include ordinary leastsquares regression or multiple linear regression, principal component regression,3 partial least-squares (PLS) regression,4 ridge regression, least absolute shrinkage and selection operator, elastic net,5 and linear support vector regression (SVR),6 whereas nonlinear methods such as artificial neural networks,7 deep learning,8 nonlinear SVR,6 and random forests (RF)9 have been developed in various research fields including QSAR and QSPR. Both linear and nonlinear approaches require attention to prevent underfitting and overfitting,10 which degrade the estimation performance of regression models. Ensemble learning11,12 is used to estimate the values of properties and activities accurately in regression analysis. In ensemble learning, training samples and X-variables are sampled from a training data set; multiple subdata sets are prepared; and a subregression model is constructed for each subdata set. For example, RF involves ensemble learning to construct a number of decision trees by changing the X-variables and samples. Of course, ensemble learning can be combined with any regression © XXXX American Chemical Society

method. In terms of prediction, a stable estimation result can be obtained by setting the average value or the median value of the estimated y-values of all subregression models as the final estimated value for each query sample. As a y-value is estimated using many submodels, the influence of underfitting and overfitting each submodel can be reduced. The method of sampling with duplication is called bootstrap aggregating (bagging),13 and the method of sampling without duplication is called jackknife aggregating (jagging).14 Regression analysis with ensemble learning has been used for QSAR15,16 and spectrum analysis,17 and RF has been widely employed for QSAR.18−22 It is possible to decrease the variation of estimation errors for regression models with ensemble learning, but ensemble learning does not lead to an ultimate solution describing appropriate relationships between X and y. The cause will be the use of subregression models outside their applicability domains (ADs) in ensemble learning. It is important to consider applicability domains (ADs)23,24 in QSARs and QSPRs. ADs are data domains in which models can exert their original performance, that is, performance with a training data set. It is preferable to discuss estimation results only for samples that are inside the AD of a regression model, because the estimation results for samples outside the AD are unreliable. In fact, samples that are outside of ADs have high estimation errors in terms of y. Nevertheless, it is also possible to slightly change chemical structures that exist outside of ADs Received: November 7, 2017 Published: February 9, 2018 A

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling to the inside of ADs.25 The degree of “slightly” depends on descriptors and please refer to ref 25 for more details. ADs can also be considered in predictions with ensemble learning. For a query sample, when the variance of y-values estimated by subregression models is high, it is judged to be outside of the AD. However, an AD of each submodel and relationships between the size of ADs and predictive performance of submodels in ensemble learning methods such as bootstrap aggregating and jackknife aggregating has not been taken into account although ADs of individual models were considered in some kinds of consensus models26,27 in ensemble modeling. As each submodel has different samples and X-variables, its AD is also different. When the estimated y-values include values from submodels that are outside of their AD, the final estimated y-value is affected by these unreliable values. In particular, for the sampling of X-variables in ensemble learning, the descriptor space changes for each submodel, and so the AD for each submodel is considered to change significantly. Therefore, ensemble learning methods that consider the ADs of submodels are discussed. ADs are set for all submodels and jackknife aggregation (jagging) is applied in the variable direction, that is, each submodel is constructed changing a list of selected explanatory variables. In estimating the value of y for a query sample, only submodels for which the sample is within the AD are used. When multiple submodels are selected, the final estimated value is given as the center of the distribution of the values estimated with those models. By considering the ADs of submodels and employing only those where a query sample exists within the AD, the estimation performance in ensemble learning is improved. Furthermore, local models have good local predictive performance although local models have possibility of large prediction errors compared with a global model especially when training data is sparse28 and different ADs can be set by changing the X-variables. The estimation performance in this enlarged AD improves by operating all submodels appropriately. Samples existing in large data regions can be estimated with high accuracy. To verify the effectiveness of the proposed method, compounds with aqueous solubility are analyzed as a QSPR data set and compounds with toxicity are analyzed as a QSAR data set.

When sampling X-variables in ensemble learning, random sampling and genetic algorithm-based PLS (GAPLS) are discussed. By comparing with traditional regression method PLS and k-nearest neighbor regression and ensemble learning methods, the superiority of the proposed method is demonstrated.



METHOD After describing ensemble learning, ADs, and AD operation methods, a method that considers the ADs of subregression models in ensemble learning is proposed. This does not limit the choice of regression analysis method in ensemble learning. Ensemble Learning. Ensemble learning is a method in which multiple subdata sets are formed by sampling training samples and X-variables from a training data set to construct submodels and estimate the comprehensive y-value of a new sample according to these submodels. The method of sampling with duplication is called bagging, and that without duplication is called jagging. In classification analysis, the final estimation result is determined by the majority vote of the estimation results of submodels. In regression analysis, the average value and the median estimation result of the submodels are used as the final estimation result. The reliability of the final estimated value can be quantified by checking variations in the estimation results of submodels.14,29,30 In this study, the X-variables are selected without duplication to prepare multiple subdata set as well as jackknifing in sampling objects, and the median is used as the final estimation result. The method of selecting X-variables without duplication is called variagging in this paper. Applicability Domain (AD). ADs are data domains in which a regression or classification model exhibits its original performance. If the test samples to be estimated are within these ADs, the y-errors are expected to conform to the y-errors in the training samples. However, if the test samples are outside the ADs, there is a high probability that the y-errors will be larger than those in the training samples. Methods of determining the ADs include ensemble learning, range,31 distance from the center of training samples,32,33 and data density around new samples.34,35 The data density is used in this research, since it is known as one of the AD measures to

Figure 1. Basic concept of the proposed ECADS method. n means the number of submodels. B

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

a certain data domain. For example, when the AD is calculated with data density, samples can be rearranged in descending order of data density. The higher the data density, the higher the EPR. The coverage is calculated by adding in descending order of EPR, and the coverage of the i samples with the highest EPR is given as follows:

deliver good performance. A query sample is judged to be inside the AD when there are many training samples around it. Estimated prediction reliability (EPR), which is calculated in X-space, is an index of the reliability of estimated y-values and provides a quantitative metric for the suitability of the AD. The higher the data density, the higher the EPR. An estimated y-value with higher EPR can be considered more reliable. In this paper, the data density is calculated by the k-nearest neighbor (kNN) algorithm.36 The k training samples that are most similar to the query sample are selected, and the average of the similarity with the k samples is taken as the data density. The higher the average, the closer a query sample is to the model constructed with those training samples. In this study, the reciprocal of the Euclidean distance after normalization or autoscaling of X-variables is used to measure the similarity. The sum of the reciprocals of the nearest k Euclidean distances after normalization of X-variables defines the EPR. The comparison between Euclidean distance, Mahalanobis distance, and oneclass support vector machine as AD setting methods is shown in ref 35. Weights can be considered for X-variables in the calculation of distances.2 Jaccard distance or Tanimoto similarity, which can be defined also for continuous variables, can be used when descriptors are fingerprints. Coverage. Coverage was developed to analyze the relationship between the size of the AD and the predictive performance of a model.33 This metric is the ratio of samples that are within

i m

coveragei =

(1)

where m is the total number of samples and i is an additional character changing from 1 to m by 1. By analyzing the relationship between coverage and mean absolute value (MAE) calculated with only the i samples in the coverage, the performance of regression models for each EPR can be checked. MAEi, which is the MAE value calculated with only the i samples, is given by i

MAEi =

(k) ∑k = 1 |y(k) − yEST |

i

(2)

Figure 2. Number of estimated compounds in test samples with respect to MAE threshold for log S data set.

Figure 4. Measured and estimated y-values of the test compounds inside the AD for the models (a) PLS, (b) variagging PLS, (c) ECADS(rand), and (d) ECADS(GAPLS) when the MAE threshold is 0.45 for log S data set.

Figure 3. Measured and estimated y-values of the test compounds inside the AD for the models (a) ECADS(rand) and (b) ECADS(GAPLS) when MAE threshold is 0.3 for the log S data set. C

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling where y(k) is a measured y-value and y(k) EST is an estimated y-value. This MAEi is calculated with only the i samples with the highest EPR and i changes from 1 to m by 1. In terms of a good AD, a lower coverage value should be associated with a lower MAE. The performance of regression models considering ADs is important. It is possible to discuss the magnitude of MAE at a certain coverage value and discuss the magnitude of coverage at a certain MAE value if the relationship between coverage and MAE is given using a validation data set. For a constant coverage, the estimation performance is higher for regression models with a lower MAE. When an error threshold for MAE is given as the predictive performance target of a regression model, the higher the coverage, the more samples can be estimated with the desired accuracy, which means that the performance of the regression model is high. In this study, MAEP is used when MAE is calculated with test samples. Ensemble Learning Based on Submodels and Their ADs. Although ADs must be taken into consideration when using regression models (that is, estimating the y-values of query samples with regression models), ADs are not considered for subregression models in ensemble learning to the best of our knowledge. Of course, in ensemble learning, the variation or standard deviation of the estimation results of submodels is calculated, and samples with high standard deviations are considered to lie outside of the AD.14 However, in ensemble learning, submodels for which a query sample is outside the AD (which means the estimation results are unreliable) are used to derive estimations. The estimation performance will be improved

if the ADs of submodels are considered, i.e., only submodels for which the samples are inside the AD are used for each query sample. Furthermore, because submodels have high subestimation performance compared with global models, the proper operation of all submodels will improve the estimation performance inside the ADs and the overall AD will be enlarged to the union of the ADs of all submodels. The proposed method is called an ensemble learning method considering AD of each submodel (ECADS). Figure 1 shows the basic concept of ECADS. n means the number of submodels. First, n subregression models are constructed with variagging, i.e., n subdata sets are prepared from a training data set by selecting different X-variables and a subregression model is constructed for each subdata set. The methods to select the X-variables used in the submodels are random selection and GAPLS.37 In parallel with submodel construction, an AD is set in each submodel. The data density is calculated with kNN and the EPR is defined as the sum of the reciprocals of the nearest k Euclidean distances after normalization of X-variables. The relationship between coverage and MAE is obtained using the EPR. In the prediction process, the y-value of a query sample is estimated using only the submodels for which a query sample lies inside the AD. When multiple submodels are selected, the average value of the estimated y-values is the final estimated y-value. Samples existing in large data domains can be accurately estimated using the proposed method. The current regression method of ensemble learning based on submodels and their ADs has a limitation. An internal validation procedure is necessary to construct the relationship

Table 1. Examples of Estimated Compounds inside AD and Training Compounds That Are the Most Similar to Their Compounds for PLS Using the log S Data Set

D

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling between ADs and predictive ability of a final regression model and to predict a y-value of a new query. However, to perform an internal validation procedure, ADs of submodels must be integrated, which is difficult so far. Although the reciprocal of the Euclidean distance after normalization or autoscaling of X-variables was used as the AD setting method in this study, the distances of different submodels cannot be compared because X-variables of each submodel are different. For the other AD setting methods, ADs cannot be compared and integrated when X-variables are different. If ADs can be integrated, it is possible to estimate a y-value of a compound according to its EPR value since relationships between coverage and MAE according to EPR as a whole model can be obtained. For example, only the local models in which the EPR value of a query sample is less than or equal to the EPR value that satisfies the target MAE. However, in this study, the numbers of compounds that are inside of ADs are considered after fixing each EPR of each submodel in the Results and Discussion section.

The number of submodels n is 100 for variagging PLS, ECADS(rand), and ECADS(GAPLS). The number of components in PLS modeling was determined so as to maximize rCV2, which is r2 with 5-fold cross-validation. The number of samples that can be estimated below certain MAE thresholds in test samples was analyzed to examine the performance of each model considering its AD. X-variables are randomly selected and initial chromosomes are randomly generated in GAPLS modeling in the preparation of subdata sets, which means the proposed method has a random component. However, since the number of submodels is 100 and sufficiently high, the difference of the results in changing random seeds is slight for the proposed method. For each data set, 196 molecular structure descriptors38 for the compounds were calculated using RDKit.39 Descriptors for which the proportion of samples with the same value exceeded 0.64, which was 0.8 × 0.8, were removed in the case of 5-fold double cross-validation. Autoscaling or normalization is conducted as a preprocessing. This processing and exclusion of descriptors was conducted for training data sets. Important descriptors will be discussed by a hierarchical QSAR approach.40 Marvin View,41 which is a software package developed by ChemAxon, was used to visualize the chemical structures. QSPR Study on Aqueous Solubility. The data set used contains 1290 compounds and their aqueous solubility, expressed as log S, where S [mol/L] is the solubility at a temperature of 20−25 °C in moles per liter.42 Sixteen compounds were removed because of duplication. It should be noted that there is possibility of the problem of bias, i.e. log S values are



RESULTS AND DISCUSSION To verify the effectiveness of the proposed method, QSPR data for water solubility and QSAR data for toxicity were analyzed. PLS was used as a regression analysis method. The analysis compares the PLS model, ensemble model based on PLS and variagging (variagging PLS), and the proposed ECADS model. In the proposed method, two variable selection methods are compared for submodel construction: random selection of half of the X-variables and variable selection based on GAPLS with r2 after 5-fold cross-validation as the evaluation function. These are denoted as ECADS(rand) and ECADS(GAPLS), respectively.

Table 2. Examples of Estimated Compounds inside AD and Training Compounds That Are the Most Similar to Their Compounds for ECADS(GAPLS) using the log S Data Seta

a

The similarity is calculated in X-space using all descriptors. E

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Table S1 in the Supporting Information. For each model, when changing the MAE threshold and adjusting the size of the AD, the number of compounds was estimated for each MAEP or less. ECADS(rand) and ECADS(GAPLS) were able to estimate many more compounds with the same MAE thresholds compared with PLS and variagging PLS. For example, when MAEP = 0.3, no compounds were estimated by the traditional methods PLS and variagging PLS, whereas ECADS(rand) and ECADS(GAPLS) estimated 34 and 59 compounds, respectively. When MAEP = 0.35, PLS and variagging PLS estimated only 23 and 22 compounds, respectively, whereas ECADS(random) and ECADS(GAPLS) estimated about five times as many compounds, 110 and 128, respectively. These results confirm that the number of compounds that can be estimated at a certain level of predictive accuracy increases and an AD, which is a data domain where a regression model can estimate y-values with high accuracy, can be expanded by the proposed method, compared to that of traditional methods PLS and variagging PLS, thereby improving the model estimation performance for each compound.

systematically in error, for example, they were miscibilities, they were reported as solubility of moles per kilogram rather than that of moles per liter, or they were misdrawn. One hundred compounds were randomly selected as training compounds to estimate the log S values of the remaining 1174 compounds with high accuracy. Each subdata set in ensemble learning consisted of 100 compounds. The MAEP value of all the test samples was 0.644 in PLS modeling and that in kNN regression (k = 10) was 1.319, which was much worse than that in simple PLS. Figure 2 shows the number of estimated compounds in the test samples for various MAE thresholds, which means the number of test compounds inside the AD for each model at MAE thresholds. Here, MAEP denotes the MAE of prediction for the test samples. The specific values are shown in

Figure 5. Number of estimated compounds in test samples with respect to MAE threshold for toxicity data set.

Figure 6. Measured and estimated y-values of the test compounds inside the AD for ECADS(GAPLS) when MAE threshold is 0.2 for toxicity data set. MAEP(inside AD) = 0.21, MAEP(outside AD) = 0.40.

Figure 8. Measured and estimated y-values of the test compounds inside the AD for the models (a) PLS, (b) variagging PLS, (c) ECADS(rand), and (d) ECADS(GAPLS) when the MAE threshold is 0.3 for the toxicity data set.

Figure 7. Measured and estimated y-values of the test compounds inside the AD for the models (a) ECADS(rand) and (b) ECADS(GAPLS) when the MAE threshold is 0.25 for the toxicity data set. F

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling In selecting X-variables used for submodels in ensemble learning, it was possible to estimate more compounds with variable selection based on GAPLS than with random selection when MAEP was low. By selecting X-variables using GAPLS, submodels were constructed that improved the subestimation performance. As a result, the estimation performance as a whole was improved by integrating submodels and their ADs. Although this improvement in the estimation performance of submodels increases the computation time, it is preferable to construct submodels by GAPLS. Figures 3 and 4 show plots of the measured and estimated y-values of the test compounds inside the AD for each model at MAE thresholds of 0.3 and 0.45, respectively. Figure 3 does not include results for PLS and variagging PLS, because they did not estimate any compounds at this MAEP value. As shown in Figure 4, compared with PLS and variagging PLS, many compounds could be estimated with high predictive accuracy by the proposed method. Figure 3 exhibits a tighter distribution of compounds along the diagonal than Figure 4, which means that AD worked properly and the proposed models achieved higher estimation accuracy by setting appropriate ADs. These results confirm that the AD was enlarged and that y-values could be estimated with high accuracy using the proposed method. Tables 1 and 2 present examples of combinations of estimated compounds inside the AD and training compounds that are the most similar to their compounds for PLS and ECADS(GAPLS), respectively. The similarity was calculated

based on the Euclidean distance after normalization in the descriptor space. The test and training compounds have slightly different substituents and are very similar for PLS, as shown in Table 1, whereas for ECADS(GAPLS), the skeletons of chemical structures are completely different between the test and training compounds. Compared with that in PLS, the similarity between the test and training compounds was low in ECADS(GAPLS), which means that novel structures could be estimated with high accuracy using the proposed method. By selecting descriptors with GAPLS to construct submodels, chemical structures were abstracted and even those that did not appear similar to the training compounds were found to be inside the AD and could be estimated. This confirms that the ADs can be enlarged and that chemical structures that are not similar to each other can be appropriately estimated using the proposed method. QSAR Study Using pIGC50. This data set was downloaded from the Environmental Toxicity Prediction Challenge 2009 website.43 This is an online challenge that invites researchers to predict the toxicity of molecules against T. Pyriformis, expressed as the logarithm of 50% growth inhibitory concentration in milligrams per liter (pIGC50). The data set consists of 1093 compounds. Eight compounds were removed because of duplication. One hundred compounds were randomly selected as training samples to estimate the pIGC50 of the remaining 985 compounds with high accuracy. Each subdata set in ensemble learning consisted of 100 compounds. The MAEP value of all the

Table 3. Examples of Estimated Compounds inside AD and Training Compounds That Are the Most Similar to Their Compounds for PLS Using the Toxicity Data Set

G

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Table 4. Examples of Estimated Compounds inside AD and Training Compounds That Are the Most Similar to Their Compounds for ECADS(GAPLS) Using the Toxicity Data Seta

a

The similarity is calculated in X-space using all descriptors.

which indicates high prediction accuracy, as shown in Figure 8. By comparing Figures 6−8, as the MAE threshold decreases, the distribution of compounds around the diagonal becomes tighter and the subprediction accuracy increases because of the arrangement of ADs. In the plots, outlier samples, which come from activity cliff 44 meaning dramatic activity changes with small differences in chemical structures, did not seem to exist. This would be because molecular descriptors and modeling methods were suitable to estimate activity. Using the proposed method, it has been confirmed that the proposed method refines the AD of the ensemble model, the AD of the whole model can be enlarged and the estimation performance in this AD is improved. Examples of the combination of estimated compounds inside the AD and the training compounds that are most similar to their compounds for PLS and ECADS(GAPLS) are presented in Tables 3 and 4, respectively. The similarity calculation is the same as for the QSPR data analysis. In PLS, the substituents are slightly different, whereas in ECADS(GAPLS), the size of the chemical structures and the number of substituents are completely different. As shown in the QSPR data analysis, the similarity between the test and training compounds was lower for ECADS(GAPLS) than for PLS. Therefore, novel structures could be estimated with the proposed method. If a key of the toxicity is the presence of alcohols and metabolic potential, the shared methyl group features probably link the esters with the branched-chain alcohols in this case. The proposed method could extract structural information on toxicity and could accurately estimate toxicity of compounds that are dissimilar in terms of chemical structure but are similar in terms of toxicity. Submodels were constructed with the descriptors selected by GAPLS, and the abstraction degree of chemical structures increased, with even chemical structures that did not appear to be similar being predicted inside the AD. Thus, it has been

test samples was 0.482 in PLS modeling and that in kNN regression (k = 10) was 0.721, which was worse than simple PLS. The number of estimated compounds for each MAE threshold and each model is shown in Figure 5. The specific values are shown in Table S2 in the Supporting Information. Compared to PLS and variagging PLS, many compounds could be estimated with the same MAE thresholds by the proposed method. For example, when MAEP = 0.25, the traditional PLS and variagging PLS models could not estimate any compounds, whereas the proposed models estimated 142 and 239 compounds. When MAEP = 0.30, PLS found 233 and variagging PLS found 235 compounds, whereas ECADS(rand) and ECADS(GAPLS) could estimate more than twice as many, with 527 and 562, respectively. The size of the AD was expanded and the estimation performance improved using the proposed method. With the same MAE threshold, more compounds could be estimated using ECADS(GAPLS) than ECADS(rand). As ECADS(GAPLS) selects X-variables to improve the subestimation performance of regression models, the size of the ADs of the submodels will be narrowed; however, models with high subestimation performance can be obtained. By considering the AD of each submodel and using only those for which samples are within the AD, the overall estimation performance improves as compared with the random selection of X-variables. Figures 6−8 show plots of the measured and estimated y-values for test compounds at MAE thresholds of 0.2, 0.25, and 0.3, respectively. Figure 6 does not include results for PLS, variagging PLS, and ECADS(rand) because they could not estimate any compounds at this MAE threshold; similarly, Figure 7 does not include any results for PLS and variagging PLS. Compared with the traditional methods PLS and variagging PLS, the proposed method estimates a large number of compounds. These are distributed around the diagonal line, H

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling confirmed that using the proposed method enlarges the AD and enables novel chemical structures to be appropriately estimated.



ABBREVIATIONS



REFERENCES

(1) Tropsha, A. Best Practices for QSAR Model Development, Validation, and Exploitation. Mol. Inf. 2010, 29, 476−488. (2) Sahoo, S.; Adhikari, C.; Kuanar, M.; Mishra, B. K. A Short Review of the Generation of Molecular Descriptors and Their Applications in Quantitative Structure Property/Activity Relationships. Curr. Comput.Aided Drug Des. 2016, 12, 181−205. (3) Wold, S.; et al. Principal Component Analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37−52. (4) Wold, S.; Sjöström, M.; Eriksson, L. PLS-regression: a basic tool of chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109−130. (5) Li, Z. T.; Sillanpaa, M. J. Overview of LASSO-related penalized regression methods for quantitative trait mapping and genomic selection. Theor. Appl. Genet. 2012, 125, 419−435. (6) Bishop, C. M. Pattern recognition and machine learning; Springer: New York, 2006. (7) Marini, F.; Roncaglioni, A.; Novic, M. Variable Selection and Interpretation in Structure−Affinity Correlation Modeling of Estrogen Receptor Binders. J. Chem. Inf. Model. 2005, 45, 1507−1519. (8) LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436−444. (9) Palmer, D. S.; O’Boyle, N. M.; Glen, R. C.; Mitchell, J. B. O. Random forest models to predict aqueous solubility. J. Chem. Inf. Model. 2007, 47, 150−158. (10) Babyak, M. A. What you see may not be what you get: A brief, nontechnical introduction to overfitting in regression-type models. Psychosom. Med. 2004, 66, 411−421. (11) Svetnik, V.; Wang, T.; Tong, C.; Liaw, A.; Sheridan, R. P.; Song, Q. Boosting: An ensemble learning tool for compound classification and QSAR modeling. J. Chem. Inf. Model. 2005, 45, 786−799. (12) Novotarskyi, S.; Sushko, I.; Korner, R.; Pandey, A. K.; Tetko, I. V. A comparison of different QSAR approaches to modeling CYP450 1A2 inhibition. J. Chem. Inf. Model. 2011, 51, 1271. (13) Dietterich, T. G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach. Learn 2000, 40, 139−157. (14) Kaneko, H.; Funatsu, K. Applicability Domain Based on Ensemble Learning in Classification and Regression Analyses. J. Chem. Inf. Model. 2014, 54, 2469−2482. (15) Tavakoli, H.; Ghasemi, J. B. An improved ensemble learning machine for biological activity prediction of tyrosine kinase inhibitors. J. Chemom. 2015, 29, 213−223. (16) Xia, J.; Hsieh, J. H.; Hu, H.; Wu, S.; Wang, X. S. The Development of Target-Specific Pose Filter Ensembles To Boost Ligand Enrichment for Structure-Based Virtual Screening. J. Chem. Inf. Model. 2017, 57, 1414−1425. (17) Shinzawa, H.; Jiang, J. H.; Ritthiruangdej, P.; Ozaki, Y. Investigations of bagged kernel partial least squares (KPLS) and boosting KPLS with applications to near-infrared (NIR) spectra. J. Chemom. 2006, 20, 436−444. (18) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947−1958.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00649. Table S1. Estimated number of compounds in test samples with respect to the MAEP threshold for the log S data set. Table S2. Estimated number of compounds in test samples with respect to the MAEP threshold for the toxicity data set. Figure S1. Histogram of log S for the log S data set. Figure S2. Histogram of pIGC50 for the toxicity data set (PDF)





AD, applicability domain; QSAR, quantitative structure− activity relationship; QSPR, quantitative structure−property relationship; PLS, partial least-squares; SVR, support vector regression; RF, random forests; bagging, bootstrap aggregating; jagging, jackknife aggregating; variagging, variable selection aggregating; GAPLS, genetic algorithm-based partial leastsquares; EPR, estimated prediction reliability; kNN, k-nearest neighbor; MAE, mean absolute error; ECADS, ensemble learning method considering applicability domain of each submodel

CONCLUSION In this research, a regression analysis method based on ensemble learning that considers the ADs of subregression models has been discussed. By setting the AD of each submodel and using only submodels for which an input sample is within the AD, the whole estimation performance and the reliability of the estimated y-values are improved. In addition, because diverse submodels with different ADs are constructed by changing the X-variables, the whole AD becomes larger than in the global model. In case studies using QSPR and QSAR data sets, it was confirmed that the AD is enlarged and the estimation performance is improved compared with traditional regression analysis methods PLS and kNN regression and a traditional ensemble learning method variagging PLS. As described in Ensemble Learning Based on Submodels and Their ADs, the current regression method of ensemble learning based on submodels and their ADs has a limitation. An internal validation procedure is necessary to construct the relationship between ADs and predictive ability of a final regression model and to predict a y-value of a new query. However, to perform an internal validation procedure, ADs of submodels must be integrated. Predictive performance of submodels should be evaluated with considering ADs.45 Therefore, a future work would be to compare and integrate ADs even when X-variables are different and to establish the ensemble learning-based regression method considering the ADs of submodels. The method proposed in this study has been developed for a linear regression analysis method, namely PLS. By using nonlinear methods such as SVR, although attention must be paid to the setting of the AD, further improvement in estimation accuracy is expected. In addition, innovation of variable selection and data set construction methods contribute to predictive accuracy in ensemble modeling. QSPR and QSAR will be further developed using the proposed method.



Article

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Hiromasa Kaneko: 0000-0001-8367-6476 Notes

The author declares no competing financial interest.



ACKNOWLEDGMENTS We thank Stuart Jenkinson, Ph.D., from Edanz Group (www. edanzediting.com/ac) for editing a draft of this manuscript. I

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling (19) Palmer, D. S.; O’Boyle, N. M.; Glen, R. C.; Mitchell, J. B. O. Random Forest Models To Predict Aqueous Solubility. J. Chem. Inf. Model. 2007, 47, 150−158. (20) Polishchuk, P. G.; Muratov, E. N.; Artemenko, A. G.; Kolumbin, O. G.; Muratov, N. N.; Kuz’min, V. E. Application of Random Forest Approach to QSAR Prediction of Aquatic Toxicity. J. Chem. Inf. Model. 2009, 49, 2481−2488. (21) Doucet, J. P.; Panaye, A. Three Dimensional QSAR: Applications in Pharmacology and Toxicology; CRC Press, 2010. (22) Eguchi, A.; Hanazato, M.; Suzuki, N.; Matsuno, Y.; Todaka, E.; Mori, C. Maternal−fetal transfer rates of PCBs, OCPs, PBDEs, and dioxin-like compounds predicted through quantitative structure− activity relationship modeling. Environ. Sci. Pollut. Res. 2015, 1−11. (23) Tetko, I. V.; Sushko, I.; Pandey, A. K.; Zhu, H.; Tropsha, A.; Papa, E.; Oberg, T.; Todeschini, R.; Fourches, D.; Varnek, A. Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: Focusing on applicability domain and overfitting by variable selection. J. Chem. Inf. Model. 2008, 48, 1733−1746. (24) Dragos, H.; Marcou, G.; Varnek, A. Predicting the predictability: A unified approach to the applicability domain problem of QSAR models. J. Chem. Inf. Model. 2009, 49, 1762−1776. (25) Kaneko, H.; Funatsu, K. Strategy of Structure Generation within Applicability Domains with One-Class Support Vector Machine. Bull. Chem. Soc. Jpn. 2015, 88, 981−988. (26) Shen, M.; Béguin, C.; Golbraikh, A.; Stables, J. P.; Kohn, H.; Tropsha, A. Application of Predictive QSAR Models to Database Mining: Identification and Experimental Validation of Novel Anticonvulsant Compounds. J. Med. Chem. 2004, 47, 2356−2364. (27) Zhang, L.; Fourches, D.; Sedykh, A.; Zhu, H.; Golbraikh, A.; Ekins, S.; Clark, J.; Connelly, M. C.; Sigal, M.; Hodges, D.; Guiguemde, A.; Guy, R. K.; Tropsha, A. Discovery of Novel Antimalarial Compounds Enabled by QSAR-Based Virtual Screening. J. Chem. Inf. Model. 2013, 53, 475−492. (28) Guha, R.; Dutta, D.; Jurs, P. C.; Chen, T. Local Lazy Regression: Making Use of the Neighborhood to Improve QSAR Predictions. J. Chem. Inf. Model. 2006, 46, 1836−1847. (29) Shao, W.; Tian, X. Adaptive soft sensor for quality prediction of chemical processes based on selective ensemble of local partial least squares models. Chem. Eng. Res. Des. 2015, 95, 113−132. (30) Wang, Li.; Jin, H.; Chen, X.; Dai, J.; Yang, K.; Zhang, D. Soft Sensor Development Based on the Hierarchical Ensemble of Gaussian Process Regression Models for Nonlinear and Non-Gaussian Chemical Processes. Ind. Eng. Chem. Res. 2016, 55, 7704−7719. (31) Kaneko, H.; Arakawa, M.; Funatsu, K. Novel soft sensor method for detecting completion of transition in industrial polymer processes. Comput. Chem. Eng. 2011, 35, 1135−1142. (32) Dimitrov, S.; Dimitrova, G.; Pavlov, T.; Dimitrova, N.; Patlewicz, G.; Niemela, J.; Mekenyan, O. A stepwise approach for defining the applicability domain of SAR and QSAR models. J. Chem. Inf. Model. 2005, 45, 839−849. (33) Sushko, I.; Novotarskyi, S.; Korner, R.; Pandey, A. K.; Cherkasov, A.; Li, J. Z.; Gramatica, P.; Hansen, K.; Schroeter, T.; Muller, K. R.; Xi, L. L.; Liu, H. X.; Yao, X. J.; Oberg, T.; Hormozdiari, F.; Dao, P. H.; Sahinalp, C.; Todeschini, R.; Polishchuk, P.; Artemenko, A.; Kuz’min, V.; Martin, T. M.; Young, D. M.; Fourches, D.; Muratov, E.; Tropsha, A.; Baskin, I.; Horvath, D.; Marcou, G.; Muller, C.; Varnek, A.; Prokopenko, V. V.; Tetko, I. V. Applicability domains for classification problems: Benchmarking of distance to models for Ames mutagenicity set. J. Chem. Inf. Model. 2010, 50, 2094−2111. (34) Baskin, I. I.; Kireeva, N.; Varnek, A. The one-class classification approach to data description and to models applicability domain. Mol. Inf. 2010, 29, 581−587. (35) Kaneko, H.; Funatsu, K. Estimation of predictive accuracy of soft sensor models based on data density. Chemom. Intell. Lab. Syst. 2013, 128, 111−117.

(36) Ajmani, S.; Jadhav, K.; Kulkarni, S. A. Three-dimensional QSAR using the k-nearest neighbor method and its interpretation. J. Chem. Inf. Model. 2006, 46, 24−31. (37) Ghasemi, J.; Niazi, A.; Leardi, R. Genetic-algorithm-based wavelength selection in multicomponent spectrophotometric determination by PLS: application on copper and zinc mixture. Talanta 2003, 59, 311−317. (38) http://www.rdkit.org/docs/GettingStartedInPython.html#listof-available-descriptors (accessed November 2, 2017). (39) http://www.rdkit.org/ (accessed November 2, 2017). (40) Basak, S. C.; Mills, D. R.; Balaban, A. T.; Gute, B. D. Prediction of Mutagenicity of Aromatic and Heteroaromatic Amines from Structure: A Hierarchical QSAR Approach. J. Chem. Inf. Comput. Sci. 2001, 41, 671−678. (41) https://www.chemaxon.com/ (accessed November 2, 2017). (42) Hou, T. J.; Xia, K.; Zhang, W.; Xu, X. J. ADME Evaluation in Drug Discovery. 4. Prediction of Aqueous Solubility Based on Atom Contribution Approach. J. Chem. Inf. Comput. Sci. 2004, 44, 266−275. (43) http://www.cadaster.eu/node/65.html (accessed November 2, 2017). (44) Hu, Y.; Bajorath, J. Extending the Activity Cliff Concept: Structural Categorization of Activity Cliffs and Systematic Identification of Different Types of Cliffs in the ChEMBL Database. J. Chem. Inf. Model. 2012, 52, 1806−1811. (45) Kaneko, H. A New Measure of Regression Model Accuracy that Considers Applicability Domains. Chemom. Intell. Lab. Syst. 2017, 171, 1−8.

J

DOI: 10.1021/acs.jcim.7b00649 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX