ADME Properties Evaluation in Drug Discovery: Prediction of Caco-2

Mar 26, 2016 - Email: [email protected]. ... Compared with the reported QSAR/QSPR models about Caco-2 cell permeability, our model exhibits certain...
0 downloads 9 Views 3MB Size
Article pubs.acs.org/jcim

ADME Properties Evaluation in Drug Discovery: Prediction of Caco‑2 Cell Permeability Using a Combination of NSGA-II and Boosting Ning-Ning Wang,†,⊥ Jie Dong,†,⊥ Yin-Hua Deng,† Min-Feng Zhu,‡ Ming Wen,§ Zhi-Jiang Yao,†,§ Ai-Ping Lu,∥ Jian-Bing Wang,§ and Dong-Sheng Cao*,†,∥ †

School of Pharmaceutical Sciences, Central South University, Changsha 410013, P. R. China School of Mathematics and Statistics, Central South University, Changsha 410083, P. R. China § College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, P. R. China ∥ Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, P. R. China ‡

S Supporting Information *

ABSTRACT: The Caco-2 cell monolayer model is a popular surrogate in predicting the in vitro human intestinal permeability of a drug due to its morphological and functional similarity with human enterocytes. A quantitative structure−property relationship (QSPR) study was carried out to predict Caco-2 cell permeability of a large data set consisting of 1272 compounds. Four different methods including multivariate linear regression (MLR), partial least-squares (PLS), support vector machine (SVM) regression and Boosting were employed to build prediction models with 30 molecular descriptors selected by nondominated sorting genetic algorithm-II (NSGA-II). The best Boosting model was obtained finally with R2 = 0.97, RMSEF = 0.12, Q2 = 0.83, RMSECV = 0.31 for the training set and RT2 = 0.81, RMSET = 0.31 for the test set. A series of validation methods were used to assess the robustness and predictive ability of our model according to the OECD principles and then define its applicability domain. Compared with the reported QSAR/QSPR models about Caco-2 cell permeability, our model exhibits certain advantage in database size and prediction accuracy to some extent. Finally, we found that the polar volume, the hydrogen bond donor, the surface area and some other descriptors can influence the Caco-2 permeability to some extent. These results suggest that the proposed model is a good tool for predicting the permeability of drug candidates and to perform virtual screening in the early stage of drug development.



INTRODUCTION Nowadays, oral administration of drugs is the preferred route and a major goal in the development of new drugs because of its ease and patient compliance. Before an oral drug reaches the systemic circulation, it must pass through intestinal cell membranes via passive diffusion, carrier-mediated uptake or active transport processes. The most important strategy of pharmaceutical industry to overcome its productivity crisis in drug discovery is to focus on the molecular properties for absorption, distribution, metabolism and excretion (ADME).1 Bioavailability, reflecting the drug proportion in the circulatory system, is a significant index of drug efficacy. Screening for absorption ability is one of the most important parts of assessing oral bioavailability; therefore, it is crucial in ADME profiling.2 Present studies demonstrate that there is an apparent correlation between the human intestinal absorption and its intestinal permeability for a drug.3 As a consequence, the permeability value of a compound may be an important index of its human intestinal absorption to some extent. In several in vitro cell culture models for drug permeability, such as Parallel Artificial Membrane Permeability Assay (PAMPA), the © 2016 American Chemical Society

human colon adenocarcinoma (Caco-2) cell lines, the MadinDarby canine kidney (MDCK) cell, porcine kidney epithelial cell lines (LLC-PK1) and so on, the most widely used cell line is Caco-2 cell.4−7 Caco-2 cell line is a popular surrogate for the human intestinal epithelium to estimate in vivo drug permeability due to their morphological and functional similarities with human enterocytes.8 Moreover, the expression of transporters in Caco-2 cells such as P-glycoprotein (P-gp), the proton-coupled oligopeptide transporter (PEPT1) and others offer great advantages over simplified models.9 In the evaluation of in vitro pharmacokinetic properties, the permeability coefficient of Caco-2 monolayer cell (Papp) is usually used to evaluate the human intestinal permeability (HIP). Nevertheless, Caco-2 cell models have several disadvantages, the long culture periods (21−24 days) with consequently extensive cost is the principal practical shortcoming of this approach. It is difficult to realize the drug high Received: October 23, 2015 Published: March 26, 2016 763

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling

reliably. Finally, we found some important descriptors by the analysis of the single correlation between each descriptor and its Papp value, after that we tried to interpret these descriptors reasonably as far as possible. The results indicated that our proposed model could give a reliable and robust prediction of Papp values.

throughput screening (HTS), not to mention the virtual screening in the early stage of drug discovery.10,11 To solve these problems, computational prediction modeling provides an inexpensive and fast way to assess the potential for intestinal permeability of candidate drugs. There are two types of computational approaches until now: those based on rulesof-thumb and structure alerts such as Rule-of-Five and those based on data modeling such as Quantitative Structure− Activity/Property Relationship (QSAR/QSPR).12−16 It is wellknown to us that the latter should be more appropriate to predict molecular properties because of the complex disposition processes of oral drugs.17 At present, a lot of QSAR/QSPR studies involving the in vitro permeability have been published.18−23 Some of them were studied based on PAMPA and MDCK cell lines, which were also important to the pharmaceutical industry.7,23 Most of the rest models were established by Caco-2 cell line.18−21 In these studies, various types of descriptors have been applied to QSAR/QSPR models, such as topological descriptors; BCUT descriptors and so on. Several statistical methodologies have also been introduced to predict ADMET properties, multiple linear or nonlinear regression (MLR and MNLR, respectively) are two examples of these. They require stepwise regression, mathematical independence of the x-variables and, as a rule-of-thumb, at least 5 to 10 times as many observations as variables.24 But now, partial least-squares (PLS) and support vector machine algorithm (SVM) are more advanced treatments when handling data sets with few observations and many variables.25 What’s more, novel points of view have been applied to QSAR/QSPR studies, membrane-interaction QSAR (MI-QSAR) and holographic QSAR (HQSAR) analyses are two popular examples of these.26−28 In all cases, the obtained models predicted Caco-2 permeability with a relatively reasonable degree of accuracy. However, these models might be of little practical value due to their small size of permeability data set or lower accuracy. As we all know, when the data are limited, the obtained statistical models often fail because of overfitting problem and narrow domain of application of models, and then resulting in a limitation on their use in spite of their high accuracy.29 For example, in 2006, Seo Jeong Jung et al. have reported a model with a accuracy of 0.998, but it was not practical due to a small data consisted of 20 molecules.30 On the other hand, recent published models with a large data set were not so reasonable in predicting accuracy and most of these studies failed to follow the Organization for Economic Co-operation and Development (OECD) principles about QSAR/QSPR studies strictly: (1) a defined end point; (2) an unambiguous algorithm; (3) a defined domain of application, AD; (4) appropriate measures of goodness-of-fit, robustness and predictive ability; (5) a mechanistic interpretation, if possible.31−34 Taking into consideration the mentioned issues, the main objective of this study is to obtain a practical and efficient model to predict Caco-2 permeability. Primarily, we collected a relatively larger and structurally diverse Caco-2 permeability data set to build different QSPR models by MLR, PLS, SVM, Boosting and chose the best one of these as further analysis. Subsequently, several model validation methods were used to assess the robustness and predictive ability of our model according to the OECD principles and then defined its applicability domain. Compared with other QSAR/QSPR models published in the literature, our new QSPR model is able to make up the existing disadvantages to some extent and predict the permeability values of new compounds quickly and



MATERIALS AND METHODS Data Collection. In our study, the experimental values of Caco-2 permeability were collected from 23 published articles, ChEMBL data set (https://www.ebi.ac.uk/chembl) and online chemical data set (https://ochem.eu/).2,3,18,26−31,35−48 After that, an updated and heterogeneous data set including 1561 molecules was obtained. However, the permeability data was chaotic due to the different sources. To improve the quality and reliability of the data, we coped with it as follows: only those compounds with exact values were included, those molecules with empty or indeterminate values were removed. If there were two or more entries for one molecule, the arithmetic mean value of these values was adopted to reduce the random error. The data set was filtered to remove compounds with permeability values greater than 10−4 cm/s or less than 10−8 cm/s because of their potential unreliability. Only one record was reserved for the conformational or optical isomers, because stereochemistry was not considered in the subsequent analysis. Solvent or saline ions adhering to the molecules were removed automatically by Open Babel (http://openbabel.org/wiki/Get_ Open_Babel). The SMILES structures of these molecules were checked one by one to ensure their correctness. After a series of pretreatments, 1272 compounds and their permeability values were finally collected as further analysis. Their SMILES structures and permeability values can be found in SI1. Descriptor Calculation and Pruning. The Molecular Operating Environment software (version 2011.10) was used to calculate two-dimensional descriptors of 1272 molecules, getting 188 descriptors in total. After that, two-dimensional structures of molecules were optimized to obtain threedimensional structures by Merck Molecular Force Field 94, MMFF 94, and the gradient-threshold of potential energy was set to 0.001 kcal−1·mol−1.49 Subsequently, all three-dimensional descriptors were calculated, resulting in 138 descriptors. To sum up, there were 326 descriptors and all descriptors were checked to ensure that each descriptor value is available for each molecule. Two pretreatments were performed to delete some uninformative descriptors before further selection: (1) delete the descriptors whose variance is 0 or approaches 0; (2) if the correlation coefficient between two descriptors is higher than 0.95, only one was reserved. Finally, 193 descriptors were used to perform further variable selection and QSPR modeling. According to the OECD principles, not only the internal validation is needed to verify the reliability and predictive ability of models but also the external validation. In the study, all molecules were randomly classified into training set and test set by sample set partitioning based on joint x−y distances (SPXY) method to guarantee that the test samples could map the measured region of the input variable space completely.50 Thus, we obtained a training set of 1017 molecules (80% of the data set), and a test set of 255 molecules (20% of the data set). The training set was used to construct the prediction model and the test set was further used to assess the predictive ability of the model. Additionally, to validate further the prediction ability of our constructed model, we additionally collected 298 compounds with Caco-2 permeability values and 220 764

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling

used in NSGA-II are as follows: population size = 50, iterations = 500, number of targets = 2, largest number of principle components = 40. The importance of each descriptor was evaluated by the number of times that appeared and then 50 Pareto optimal solutions were determined in the end. The result is shown in Figure 1. The influence of the number of

compounds with MDCK permeability values from the ChEMBL data set and other literature sources, respectively. For MDCK permeability, it is not the same end point as Caco-2 permeability, but it is very similar and will give an idea of the predictive performance on compounds from different chemical series. Methods and Performance Evaluation. In this paper, we intended to construct different models by four statistical methodologies to find the optimal one. MLR, multiple linear regressions, is a popular method that attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. PLS, partial least-squares method, is a suitable choice when handling data sets with few observations and many variables. It can use correlated variables and data matrices with missing values in the model development.25 SVM, support vector machine, is an algorithm based on the structural risk minimization principle from statistical learning theory. Although developed for classification problems, SVM can also be applied to the case of regression.51,52 Boosting, originated in the machine learning field, is a rapid-developed method of data learning and it can not only improve the model accuracy but also avoid the overfittig phenomenon.53−55 In mathematics, boosting is an iterative reweighting procedure by sequentially applying a base learner to reweighted versions of the training data whose current weights are modified based on how accurately the previous learners predict these samples. The four approaches above-mentioned were all used to build predictive models and a series of statistical indexes were calculated. For the best predictive model, further study was performed to verify its robustness and predictive ability, the general process is shown below: cross validation to evaluate model performance, Y-randomization to validate reliability of the model, Williams plot to evaluate the application domain, comparison with published models and finally interpret the most important descriptors. To ensure that the derived model from the training set has a good generalization ability, 5-fold cross validation and external test sets were used for the validation purpose. For 5-fold cross validation, the training set was split into five roughly equal-sized parts first. Then the model was built with four parts of the data and the prediction error of the other one part was calculated. The process was repeated five times so that every part could be used as a validation set. Four commonly used parameters in regression problems were employed to evaluate the model performance, including the square correlation coefficients of cross validation (Q2), the root mean squared error of cross validation (RMSECV), the square correlation coefficients of fitting (R2), the root mean squared error of fitting (RMSEF), the square correlation coefficients of fitting for test set (RT2), and the root mean squared error for test set (RMSET).

Figure 1. Relationship between −Q2 and the number of descriptors.

descriptors to the properties of models (Q2) can be seen from the figure. The −Q2 descends generally, meaning that descriptor is the more the better. From the perspective of practical application, we prefer a simple predictive model. So, there is a conflict: on the one hand we pursue a good predictive ability, but on the other hand we ask for less descriptor. In the study, we selected 30 descriptors in order to keep a good balance between property and complexity. The selected descriptors and their coefficients in the model can be found in SI2 and SI3. Model Building and External Validation. Thirty descriptors selected by NSGA-II were used to generate the QSPR models by MLR, PLS, SVM and Boosting. To ensure the model derived from the training set has a good generalization ability, 5-fold cross validation and external test sets were used for the validation purpose.58−60 Four predictive models and their validation indexes are shown in the Table 1. From the Table 1. Statistical Results of the Fitting, 5-fold Cross Validation



RESULTS AND DISCUSSION Descriptor Selection. Nondominated Sorting Genetic Algorithm (NSGA), proposed by Srinivas and Deb in 1994, is a genetic algorithm based on dominant and nondominant relationships between individuals. In 2002, they improved the algorithm into NSGA-II, which reduced the complexity of the algorithm and maintained the diversity of species.56,57 In the study, we used NSGA-II combined with PLS to find the Pareto optimal solutions that benefits the predictive ability of the model and the number of descriptors. The final result was 50 optimal solutions and mutual independence. The parameters

method

Q2

RMSECV

R2

RMSEF

RT2

RMSET

MLR PLS SVM Boosting

0.79 0.79 0.81 0.83

0.34 0.34 0.32 0.31

0.81 0.81 0.91 0.97

0.33 0.33 0.22 0.12

0.75 0.75 0.80 0.812

0.36 0.36 0.32 0.31

statistical results of the fitting and 5-fold cross validation, we can see that four predictive models all have good statistical results on the whole, which indicate that the selected descriptors can predict Caco-2 permeability values effectively. In the table, the two models built by MLR and PLS have the same results; it is because that all 30 descriptors are the information carriers and were selected in the PLS model. The result of SVM model is better than MLR and PLS models, for the training set, R2 = 0.91, RMSEF = 0.22, and for the cross 765

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling validation, Q2 = 0.81, RMSECV = 0.32. When the model was applied to the test set, RT2 = 0.80, RMSET = 0.32. Among all the regression models, Boosting model is the best one. For the training set, R2 = 0.97, RMSEF = 0.12, and for the cross validation, Q2 = 0.83, RMSECV = 0.31. When the Boosting model was applied to the test set, RT2 = 0.81, RMSET = 0.31. On the whole, the SVM and Boosting models are better than MLR and PLS models, we speculate that there may be a nonlinear relationship between the selected descriptors and the Caco-2 permeability values.25,52,54,61 For the Boosting predictive model, its Q2 is slight lower than R2, indicating that the model is reliable and can avoid the overfitting phenomenon. When the model is applied to the test set, the RT2 is also slightly lower than Q2. As mentioned earlier, the test set was not used to establish the Boosting model; therefore, our final model may be generally applicable. The relationship between experimental values and predicted values is shown in Figure 2. Additionally, stricter evaluation was applied to the

0.85 ≤ k = 0.996 ≤ 1.15,

0.85 ≤ k′ = 1.001 ≤ 1.15

In conclusion, the Boosting model is suitable to predict Caco-2 permeability with the selected molecular descriptors. We then used our constructed Boosting model to predict their permeability values for external Caco-2 permeability and MDCK permeability data sets (see SI5 and SI6). For Caco-2 permeability, RT2 = 0.75, RMSET = 0.36. For MDCK permeability, RT2 = 0.69, RMSET = 0.38. These two external validations together with cross validation sufficiently demonstrated that our constructed Boosting model is able to predict accurately the permeability values to a certain extent. Y-Randomization Test. When selecting significant descriptors by optimization algorithm, it is possible to find some descriptors that seem important just-by-chance in fact. To guard against the possibility of having learned such chance models, Y-randomization test was advocated to validate the reliability of our QSPR model.65−67 In Y-randomization test, the Caco-2 permeability values were randomly shuffled to change their true order. Thus, although the Caco-2 permeability values (and the statistical distribution) stayed the same, their position against the appropriate compound and its descriptors was now altered, thus destroying any meaningful relation that may have existed between independent variables and response values. These new data were used to construct a series of prediction models (e.g., 1000) and obtained the distribution tendency of their metrics such as Q2. These metrics could be compared with those from the true model to obtain some hints about chance correlation. The distribution diagram of the Q2 values of 1000 randomized models and the true model can be seen in Figure 3. From the figure, we can clearly

Figure 2. Plot of predicted log Papp versus experimental log Papp for the training set (blue) and the test set (red).

Boosting predictive model. According to the suggestions from Tropsha et al., a QSAR/QSPR model can be seen as a practical predictive model when it satisfies the following criteria:62−64 Q 2 > 0.5 R2 > 0.6 R2 − R 0 2 R2

< 0.1 or

R2 − R 0′ 2 R2

Figure 3. Distribution of Q2 of randomized models compared with the true model in the Y-randomization test. (The red vertical line on the right represents the Q2 of the true model, and the distribution on the left side represents the distribution of Q2 values of randomized models.)

< 0.1

0.85 ≤ k ≤ 1.15 or 0.85 ≤ k′ ≤ 1.15

see that the Q2 values of randomly shuffled models are located in the range from −0.50 to −0.19. Compared with the Q2 value of the real model (Q2 = 0.83), there is a statistically significant difference (p value < 1.6 × 10−12). The lower Q2 values of these shuffled models suggest that our previous model indeed reflects the true relationship between the selected molecular descriptors and Caco-2 permeability values rather than chance correlation. Applicability Domain Evaluation. The applicability domain (AD) evaluation is a guarantee for QSAR/QSPR models in predicting uncertain compounds accurately and reasonably. There is a consistence with the AD evaluation

where Q2 and R2 are the square of the correlation coefficient for cross validation and external test set, respectively; R02 and R0′2 are the determination coefficient of predicted versus experimental values and experimental versus predicted values, respectively; k and k′ are the slope of the regression lines through the origin, respectively. According to Table 1 Q 2 = 0.83 > 0.5,

R2 = 0.81 > 0.6

R2 − R 0 2

R2 − R 0′ 2

2

R

< 0.1,

R2

< 0.1 766

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling

are likely to be unreliable. Additionally, we can find some compounds with higher leverage values have lower prediction errors, whereas some other compounds with higher prediction errors have lower leverage values. The phenomenon could be partly explained by the fact that the defined AD only considers interpolation by simply excluding all samples in the extremities and including all those surrounded by training samples. As we can see, these compounds outside the area formed by three gray lines are identified as outliers in Figure 4. Some compounds, outside the area formed by two vertical lines with larger prediction errors, are diagnosed as y-direction outliers. There are two structurally special molecules in these 12 outliers (see SI4). One is sulfasalazine, a sulfonamide, which is usually used in the treatment of rheumatoid arthritis and ulcerative colitis in clinical practice.73 Sulfasazine is made up of sulfanilamide pyridine and 5-aminosalicylic acid by the azo group. The azo group may not be included in the training set; therefore, it is an outlier for our final model. The other one is sparfloxacin, one of the third generation of quinolone antibiotics, is a new broad-spectrum oral drug and highly active against common respiratory pathogens, including multiresistant strains.74,75 There is a cyclopropane in its structure, which may be the reason for its specificity. The chemical structures of the two outliers can be seen in Figure 5.

criterion in the OECD principles (the third principle: a defined domain of application).68−72 All the published predictive models in the present seldom evaluated their ADs; it is unreasonable to apply these QSAR/QSPR models to predict new compounds. Prediction of a molecule in a given model is most likely to be reliable if this molecule falls within the AD; otherwise, its prediction is likely to be unreliable. In the study, we used the Williams plot to evaluate the AD of our QSPR model. Williams plot, a common method for AD evaluation, provides leverage values plotted against the prediction errors. The leverage value (h) measures the distance from the centroid of the training set and could be calculated for a given data set X by obtaining the leverage matrix (H) as follows:68,71 H = X(XTX)−1XT

where X is the descriptor matrix; XT is its transpose matrix; and (XTX)−1 is the inverse of (XTX). The diagonal elements in the H matrix represent the leverage values (h) for the molecules in the data set. The warning leverage, h*, was fixed at 3p/n (h* = 0.0885) in this study, where p is the number of descriptors and n is the number of training samples. A query molecule with leverage higher than h* may be associated with unreliable predictions. Such molecules are postulated outside the descriptor space and thus will be considered outside the AD. Figure 4 is the Williams plot based on the leverage values and

Figure 5. Structures of two compounds that diagnosed as y-direction outlier. (The left is Sulfasalazine; the right is Sparfloxacin.)

After the removal of these outliers, the new Q2 was increased by 0.02 and RMSEcv was decreased by 0.02. For external Caco-2 permeability and MDCK permeability, the removal of eight and six outliers, the RT2 was increased by 0.05 and 0.051, respectively, and the corresponding RMSET was decreased by 0.04 and 0.05, respectively (see SI7 and SI8). Furthermore, the compounds that have higher leverage values have lower prediction errors. We defined them as xdirection outliers. These x-direction outliers are far away from the main body of the training set. However, they do not have big prediction errors and thereby do not influence the prediction performance. In summary, the AD defined by the Williams plot is reasonable. We can use it to evaluate the reliability of the Boosting model for the prediction of new compounds. Comparison with Other QSAR/QSPR Models. To evaluate further the prediction ability of our model, we compared the prediction results between our model and the published models so far. The first QSAR/QSPR model used to predict Caco-2 permeability values was presented by Ulf Norinder in 1997 and from then on, several regression and classification models were published one by one.3 To our knowledge, these models are collected and shown in Table 2. From the table, we can clearly see that the data sets applied in the regression models are of all sizes, in which, the smallest includes 17 molecules and the biggest includes 15791 molecules. Various methods such as PLS, MLR, NN, ANN etc. are used to construct QSAR/QSPR models; what’s more, the accuracies and descriptors also differ a lot between different

Figure 4. Williams plot of leverages versus prediction errors. The horizontal line represents the warning leverage value (h* = 3p/n ≈ 0.100), and the vertical lines indicate the place of ±3 standard deviation units.

prediction errors. In the figure, the prediction errors and leverage values are represented by the horizontal axis and vertical axis, respectively. the horizontal line divides the leverage value axis into two parts, and the point above the line have a greater leverage value than h* (i.e., 3p/n = 0.0885); whereas the two vertical lines divide the prediction error axis into three parts and verify the presence of compounds with prediction errors greater than three times of variance (±3σ, i.e., ±0.921) in the training set. We defined the AD of our final model in the Figure 4. From the figure, we can see that the vast majority of compounds in the training and test set fall within the AD. Compounds in the domain often have lower leverage values and prediction errors, indicating these compounds can be well predicted with the Boosting model. However, there are also some compounds that fall out of the AD, indicating predictions of these compounds 767

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling Table 2. QSAR/QSPR Regression/Classification Models for Predicting Papp Values serial number

year

1

19973

17

9

2 3 4 5

200227 200235 200236 200228

38 87 73 81

6 5 24 4

6 7

200437 200438

8 9 10 11

200439 200542 200543 20062

51 33 16 100 32 57 100

70 6 2 4 2 4 4

12 13 14 15 16

200630 200744 200845 200846 200947

20 207 41 100 81

17

201048

296

4 9 5 5 6 5 12

18

201226 6 201234 200541 200818 201129 201331

60

6

15791 712 157 674 1289

8 5 6 9 9

19 20 21 22 23

number of molecules

number of descriptors

method

accuracy

PLSa GFAb ANNc GA-PLSd iterative calculation method PLS MLRe MLR MLR stepwise method ANN/MLR GAf SVMg MLR PLS GA-NN MLR MLR ANN ANN GFA GFA RFh KNNi LDAj LDA DTk

R2 = 0.91, Rcv2 = 0.85, Npc = 2, F = 70.27, RMSEtr = 0.305, RMSEcvtr = 0.39 R2 = 0.86, Q2 = 0.77 R = 0.79, RMSE = 0.51, Rcv = 0.71, RMSEcv = 0.58 R = 0.89, Rcv = 0.83 R = 0.85 R2 = 0.79, Q2 = 0.65, N = 46, RMSEp = 0.45 AP-BL, R = 0.84, S = 6.64 BL-AP, R = 0.89, S = 6.84 R = 0.82, Q = 0.79, S = 0.45 R2 = 0.56, S = 0.58, F = 18.1 R2 = 0.95/R2 = 0.40 R = 0.88, RTest = 0.85 R = 0.71, RTest = 0.78 R2 = 0.998, Q2 = 0.79 R2 = 0.86, Q2 = 0.86, R2p = 0.79, Q2 (14%) = 0.73 R2 = 0.95, Q2 = 0.92, LSE = 0.02 nonstochastic: R2 = 0.72, Q2 = 0.68, stochastic: R2 = 0.71, Q2 = 0.63 R2 = 0.72, Rcv2 = 0.68, F = 32.39 Rtr2 = 0.82, Rtest2 = 0.75 R = 0.84, RMSEF = 0.55, Q = 0.70, RMSEcv = 0.79, RT = 0.77, RMSET = 0.60 MI-QSAR: R2 = 0.81, Q2 = 0.70 HQSAR: R2 = 0.92, Q2 = 0.54 R2 = 0.52, RMSE = 0.20 external validation: 85% MCC = 0.81, 90.58%(tr); 84.21%(te) MCC = 0.62, 81.56%(tr); 83.94%(te) MCC = 0.67(H), 0.52(M), 0.65(L) 78.4/76.1/79.1%(tr); 78.6/71.1/77.6%(te)

a

PLS, partial least-squares. bGFA, genetic function approximation. cANN, artificial neural network. dGA-PLS, genetic algorithm-partial least-squares. MLR, multiple linear regression. fGA, genetic algorithm. gSVM, support vector machine. hRF, random forest. iKNN, K-nearest neighbor algorithm. j LDA, linear discriminant analysis. kDT, decision trees. e

models. At present, there are fewer classification models than regression models in predicting Caco-2 cell permeability. The latest classification model with 1289 compounds was reported in 2013, which could accurately predicted 78.4/76.1/79.1% of H/M/L compounds on training and 78.6/71.1/77.6% on test set. By contrast, we also evaluated the classification ability of Boosting model. All the compounds were divided into two classes according to the Caco-2 permeability cutoff value. Based on the biopharmaceutical classification system (BCS) permeability benchmarking, the β-blocker Metoprolol (Papp = 20 × 10−6 cm/s), with a human intestinal absorption(HIA) of 96%, was used to define the high permeability class boundary.76 When the Boosting model is applied to classify 1272 compounds into high and low classes, it classified correctly 87.7% and 82.4% for training and test set, respectively. In the same way, it classified correctly 83.2% and 75.5% for the external Caco-2 and MDCK test set, respectively. These results are listed in Table 3. It can be appreciated in this table that the Boosting model is useful in permeability classification. As we all know, it is easier to build a model to classify the new compounds as high, moderate and low permeability than predict their permeability values exactly. However, the classification models can not completely separate three classes because numerous compounds sparsely distribute in the overlapping area between moderate-high and moderate-low

Table 3. Classification Results by Boosting Model data set

H class

L class

accuracy (%)

MCCa

training set test setb test setc test setd

363 92 112 109

654 163 186 111

87.7 82.4 83.2 75.5

0.73 0.61 0.65 0.51

a

MCC, Matthews correlation coefficient. bThe Caco-2 test set. cThe external Caco-2 test set. dThe external MDCK test set.

permeability.31 From this view, the regression model seems to be more appropriate than the classification model with the same data set. Commonly, for the regression models, the more molecules the data set includes, the accuracy will be more difficult to improve and need more descriptors to build model. Additionally, systemic model validations and AD evaluation were strictly carried out according to the OECD principles for our QSAR/QSPR predictive models, which have not reported in the literature until now. As for the number of compounds, our model has some advantages thanks to the relatively larger data set in which 1272 compounds are included. Compared with Edward C. Sherer’s model reported in 2012,34 there are some similarities with ours. First, the MOE_2D descriptors are applied in both models. The former model includes 2D descriptors from MOE and two types of atom pairs; the latter includes 2D and 3D descriptors from MOE. Second, the two 768

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling

Figure 6. Variable importance of 30 molecular descriptors based on Boosting.

models are both validated by a new cell line permeability data set. The former is validated by 313 Caco-2 permeability values and the latter is validated by 298 new Caco-2 permeability data and 220 MDCK permeability data. Third, the most important parameters for Papp value in the two models are log D/log P. Even so, differences are always there. The two models are constructed based on different cell lines. The former one is mainly built with 15 791 LLC-PK1 cell line permeability values (rather than the permeability values from Caco-2 cell lines) and its predictive accuracy seems not so good: R2 = 0.52, RMSE = 0.20. Compared with it, our model consists of 1272 Caco-2 permeability values and obtains a relatively better predictive accuracy: R2 = 0.97 RMSEF = 0.12; Q2 = 0.83, RMSECV = 0.31; RT2 = 0.81, RMSET = 0.31. Moreover, the prediction results of MDCK permeability data using our constructed model indicated that passive permeability for different cell lines can be predicted with similar molecular properties and descriptors. This observation is consistent with the conclusion from Edward C. Sherer’s model reported in 2012. Model Interpretation. In a model, many different combinations of descriptors may have a similar prediction performance; therefore, it is difficult to establish a rich and interpretable model. However, it can still provide some hints for mechanism of action related to Papp. To compare and interpret every molecular descriptor effectively, the variable importance of molecular descriptors was computed based on Boosting. In each step, one descriptor was removed, the rest were applied to establish a new model and its Q2 was calculated. The difference between the new Q2 and the preceding one can be seen as a measure of variable importance of the removed molecular descriptor. The process was repeated 30 times and then the importance of every descriptor was obtained. The result is shown in Figure 6. From Figure 6, some descriptors which of great importance in predicting Caco-2 permeability were selected and shown in the Table 4. In general, these descriptors disclose the information about hydrogen bond, polarity, surface area of the molecule and so on.

Table 4. Most Important Descriptors in Predicting Papp Values serial number

name of descriptor

description

1 2 3 4 5

vsurf_Wp2 a_don a_nF b_double PEOE_VSA_FPNEG

6

PEOE_VSA_HYD

7

SlogP_VSA0

8

SMR_VSA3

polar volume number of hydrogen bond donor number of fluorine atoms number of double bonds fractional negative polar van der Waals surface area total hydrophobic van der Waals surface area log P for atom as calculated in the L ≤ −0.4 sum of v such that R is in (0.35,0.39)

To explore further the relationship between Papp values and these descriptors, the correlograms between each descriptor and permeability values were obtained and could be seen in Figure 7. In fact, the permeation of a compound is a complex process influenced by a different kind of interactions. Generally, a shaped suitable compound with proper polarity and lipophicity is considered to have an appropriate permeability value and can permeate the cell membrane effectively. There are 8 correlograms of permeability values and important descriptors that are related closely with log Papp values in Figure 7. Vsurf_Wp2 (polar volume, correlation coefficient is −0.52), a descriptor reflecting the polarizability of a molecule, have negative correlation with log Papp values.35 As we all know, the greater the polarity of a molecule is, its lipophicity and permeability will be weaker subsequently. A_don, the number of hydrogen bond donor (correlation coefficient is −0.75), shows the most apparent negative correlation with log Papp values. It contains the information about the hydrogen bond capacity of the molecules. Hydrogen bonding has been identified as an important parameter for describing drug permeability as seen in several studies and this is in agreement with our findings.77,78 When the number of hydrogen bonds of 769

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling

the hydrophobicity and the connection information on the molecule can also influence the Caco-2 cell permeability to some extent.



CONCLUSION As a necessary prerequisite for the action of oral drugs, the HIA of a candidate drug is of high importance. In recent years, almost all the new drug researchers have a consensus that the pharmacokinetics properties evaluation of candidate drugs in the early stage of drug discovery can reduce the upfront investment in the toxic compounds effectively and then reduce the costs of drug discovery. Caco-2 monolayer cell has been seen as the best in vitro model for HIA study due to the obvious pertinency between Caco-2 permeability and HIA values. So, obtaining the Papp value of a new drug is one of the most significant steps in its discovery process. In this paper, we constructed a QSPR model to predict Papp values reliably with a relatively larger and structurally diverse data set. The comparison between Boosting and other methods showed the nonlinear model derived by Boosting has a better result with the same descriptors. Furthermore, a series of evaluation steps such as cross validation, external test, Y-randomization test, and the applicability domain definition guarantee the robustness and reliability of our model, following the OECD principles. The literature survey from Table 2 demonstrates that compared with other published models, our model has advantages in data set size, model accuracy, model validation, and the predictability domain evaluation to a certain extent. These results indicate that the model built by Boosting is reliable and has a good predictive ability. Our proposed QSPR model could provide a fast, convenient, and accurate way to predict Papp values of vast compounds, even for virtual compounds. All in all, this QSPR model is necessary and useful in the early stage of drug discovery and can save a large amounts of human and financial resources.

Figure 7. Correlograms between each important descriptor and permeability values.

a compound increases, its lipophicity will be weaker and it is detrimental for compound across the cell membrane by passive diffusion. Another tentative explanation may be that it is more detrimental to the transport of the compound across the cell membrane to have many, but weaker, hydrogen bond interactions. There is a probably useful hint in our study that hydrogen bond acceptor descriptors were less important for prediction of human intestinal permeability. A_nF, the number of fluorine atoms, is a typical representative for the element information. The halogen substitution in the molecules may play an important role in increasing the polarization and thus decrease the water solubility and absorption of the molecule.26 B_double is a descriptor for the number of double bonds; it has a weaker negative correlation with log Papp values in Figure 7. This descriptor may be a composite result that makes the hydrogen bonding property increased and then the permeability values decreased. From Table 3, we can see that descriptors PEOE_VSA_FPNEG (fractional negative polar van der Waals surface area, correlation coefficient is −0.37), PEOE_VSA_HYD (total hydrophobic van der Waals surface area), SlogP_VSA0 [log P for atoms as calculated in the SlogP descriptor (L) ≤ −0.4, correlation coefficient is −0.55], SlogP_VSA3 (sum of such that L is in 0−0.1, correlation coefficient is −0.40) disclose information about the surface area of the molecule and a negative correlation relationship can been seen in Figure 7. In recent years, molecular surface properties, in particular the polar surface area (PSA), have found to be a computational filter for membrane permeability in the early stage of drug discovery.79−81 It is generally affirmed that the composite descriptor is closely related to hydrogen bonding, and what’s more, a study focused on the nature of PSA was published in 2001 by Patric Stenberg.82 On the other hand, these descriptors also contain the information about log P. Because of the presence of phospholipid bilayer, high lipophilicity (log P) of a compound is favorable for efficient permeability. Even so, the calculated octanol/water partition coefficient (CLOGP) should be less than 5 according to Lipinski’s “rule-of-five”, which is used widely to identify several critical properties that should be considered for compounds with oral delivery in mind. Except the mentioned descriptors,



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.5b00642. SI1 lists data set drugs, their basic structural information, and experimental permeability values. Also, it includes the prediction permeability values of our four QSAR models. SI2 and SI3 list the training set and test set drugs and their descriptors after pruning, respectively. SI4 lists the structures of 12 outliers in the Caco-2 permeability data set for Boosting model. SI5 and SI6 list the basic structural information, experimental permeability values and the predicted values of the additional Caco-2 test set and MDCK test set, respectively. SI7 and SI8 list the structures of the outliers in external Caco-2 and MDCK permeability data sets, respectively (XLSX).



AUTHOR INFORMATION

Corresponding Author

*D. S. Cao. Email: [email protected]. Author Contributions ⊥

The first two authors contributed to the paper equally.

Notes

The authors declare no competing financial interest. 770

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling



(17) Cabrera-Perez, M. A.; Pham-The, H.; Bermejo, M.; Alvarez, I. G.; Alvarez, M. G.; Garrigues, T. M. QSPR in oral bioavailability: specificity or integrality? Mini-Rev. Med. Chem. 2012, 12, 534−550. (18) Castillo-Garit, J. A.; Marrero-Ponce, Y.; Torrens, F.; GarciaDomenech, R. Estimation of ADME properties in drug discovery: predicting Caco-2 cell permeability using atom-based stochastic and non-stochastic linear indices. J. Pharm. Sci. 2008, 97, 1946−76. (19) Palm, K.; Luthman, K.; Unge, A. L.; Strandlund, G.; Artursson, P. Correlation of drug absorption with molecular surface properties. J. Pharm. Sci. 1996, 85, 32−39. (20) van De Waterbeemd, H.; Camenisch, G.; Folkers, G.; Raevsky, O. A. Estimation of Caco-2 Cell Permeability using Calculated Molecular Descriptors. Quant. Struct.-Act. Relat. 1996, 15, 480−490. (21) Avdeef, A.; Artursson, P.; Neuhoff, S.; Lazorova, L.; Gråsjö, J.; Tavelin, S. Caco-2 permeability of weakly basic drugs predicted with the Double-Sink PAMPA method. Eur. J. Pharm. Sci. 2005, 24, 333− 349. (22) Irvine, J. D.; Takahashi, L.; Lockhart, K.; Cheong, J.; Tolan, J. W.; Selick, H.; Grove, J. R. MDCK (Madin−Darby canine kidney) cells: a tool for membrane permeability screening. J. Pharm. Sci. 1999, 88, 28−33. (23) Di, L.; Whitney-Pickett, C.; Umland, J. P.; Zhang, H.; Zhang, X.; Gebhard, D. F.; Lai, Y.; Federico, J. J.; Davidson, R. E.; Smith, R. Development of a new permeability assay using low-efflux MDCKII cells. J. Pharm. Sci. 2011, 100, 4974−4985. (24) Østerlind, K. Stat. Med. 1998, 17, 2804−2805. (25) Höskuldsson, A. PLS regression methods. J. Chemom. 1988, 2, 211−228. (26) Shinde, R. N.; Srikanth, K.; Sobhia, M. E. Insights into the permeability of drugs and drug-likemolecules from MI-QSAR and HQSAR studies. J. Mol. Model. 2012, 18, 947−962. (27) Kulkarni, A.; Han, Y.; Hopfinger, A. J. Predicting Caco-2 cell permeation coefficients of organic molecules using membraneinteraction QSAR analysis. J. Chem. Inf. Model. 2002, 42, 331−342. (28) Yamashita, F.; Fujiwara, S.-i.; Hashida, M. The “Latent membrane permeability” concept: QSPR analysis of inter/intralaboratorically variable Caco-2 permeability. J. Chem. Inf. Model. 2002, 42, 408−413. (29) Pham The, H.; González-Á lvarez, I.; Bermejo, M.; Mangas Sanjuan, V.; Centelles, I.; Garrigues, T. M.; Cabrera-Pérez, M. Á . In Silico Prediction of Caco-2 Cell Permeability by a Classification QSAR Approach. Mol. Inf. 2011, 30, 376−385. (30) Jung, S. J.; Choi, S. O.; Um, S. Y.; Kim, J. I.; Choo, H. Y. P.; Choi, S. Y.; Chung, S. Y. Prediction of the permeability of drugs through study on quantitative structure−permeability relationship. J. Pharm. Biomed. Anal. 2006, 41, 469−475. (31) Pham-The, H.; González-Á lvarez, I.; Bermejo, M.; Garrigues, T.; Le-Thi-Thu, H.; Cabrera-Pérez, M. Á . The Use of Rule-Based and QSPR Approaches in ADME Profiling: A Case Study on Caco-2 Permeability. Mol. Inf. 2013, 32, 459−479. (32) OECD. Report from the expert group on (quantitative) structureactivity relationships [(Q) SARs] on the principles for the validation of (Q) SARs; No. 49. ENV/JM/MONO(2004)24; OECD: Paris, 2004. (33) OECD. Guidance document on the validation of (quantitative) structure-activity relationships [(Q) SAR] models; OECD Series on Testing and Assessment; No. 69. ENV/JM/MONO(2007)2; OECD: Paris, 2007. (34) Sherer, E. C.; Verras, A.; Madeira, M.; Hagmann, W. K.; Sheridan, R. P.; Roberts, D.; Bleasby, K.; Cornell, W. D. QSAR Prediction of Passive Permeability in the LLC-PK1 Cell Line: Trends in Molecular Properties and Cross-Prediction of Caco-2 Permeabilities. Mol. Inf. 2012, 31, 231−2. (35) Fujiwara, S.-i.; Yamashita, F.; Hashida, M. Prediction of Caco-2 cell permeability using a combination of MO-calculation and neural network. Int. J. Pharm. 2002, 237, 95−105. (36) Yamashita, F.; Wanchana, S.; Hashida, M. Quantitative structure/property relationship analysis of Caco-2 permeability using a genetic algorithm-based partial least squares method. J. Pharm. Sci. 2002, 91, 2230−2239.

ACKNOWLEDGMENTS We thank three anonymous referees and the editor for their constructive comments, which greatly helped improve upon the original version of the paper. This work is financially supported by grants from the Project of Innovation-driven Plan in Central South University, the National Natural Science Foundation of China (Grant No. 81402853), and the Postdoctoral Science Foundation of Central South University, the Chinese Postdoctoral Science Foundation (2014T70794, 2014M562142). The studies meet with the approval of the university’s review board.



REFERENCES

(1) Lin, J.; Sahakian, D. C.; De Morais, S.; Xu, J. J.; Polzer, R. J.; Winter, S. M. The role of absorption, distribution, metabolism, excretion and toxicity in drug discovery. Curr. Top. Med. Chem. 2003, 3, 1125−1154. (2) Guangli, M.; Yiyu, C. Predicting Caco-2 permeability using support vector machine and chemistry development kit. J. Pharm. Pharm. Sci. 2006, 9, 210−221. (3) Norinder, U.; Ö sterberg, T.; Artursson, P. Theoretical calculation and prediction of Caco-2 cell permeability using MolSurf parametrization and PLS statistics. Pharm. Res. 1997, 14, 1786−1791. (4) Bohets, H.; Annaert, P.; Mannens, G.; Van Beijsterveldt, L.; Anciaux, K.; Verboven, P.; Meuldermans, W.; Lavrijsen, K. Strategies for absorption screening in drug discovery and development. Curr. Top. Med. Chem. 2001, 1, 367−383. (5) Avdeef, A.; Bendels, S.; Di, L.; Faller, B.; Kansy, M.; Sugano, K.; Yamauchi, Y. PAMPA–critical factors for better predictions of absorption. J. Pharm. Sci. 2007, 96, 2893−909. (6) Artursson, P.; Palm, K.; Luthman, K. Caco-2 monolayers in experimental and theoretical predictions of drug transport. Adv. Drug Delivery Rev. 2001, 46, 27−43. (7) Irvine, J. D.; Takahashi, L.; Lockhart, K.; Cheong, J.; Tolan, J. W.; Selick, H. E.; Grove, J. R. MDCK (Madin-Darby canine kidney) cells: A tool for membrane permeability screening. J. Pharm. Sci. 1999, 88, 28−33. (8) Maeda, K.; Suzuki, H.; Sugiyama, Y. Hepatic Transport. In Drug Bioavailability; Wiley-VCH Verlag GmbH & Co. KGaA: Weinheim, 2009; pp 277−332. (9) Sun, H.; Pang, K. S. Permeability, transport, and metabolism of solutes in Caco-2 cell monolayers: a theoretical study. Drug Metab. Dispos. 2008, 36, 102−123. (10) Delie, F.; Rubas, W. A human colonic cell line sharing similarities with enterocytes as a model to examine oral absorption: advantages and limitations of the Caco-2 model. Crit. Rev. Ther. Drug Carrier Syst. 1997, 14, 66−86. (11) Anderle, P.; Niederer, E.; Rubas, W.; Hilgendorf, C.; SpahnLangguth, H.; Wunderli-Allenspach, H.; Merkle, H. P.; Langguth, P. PGlycoprotein (P-gp) mediated efflux in Caco-2 cell monolayers: the influence of culturing conditions and drug exposure on P-gp expression levels. J. Pharm. Sci. 1998, 87, 757−62. (12) Hou, T.; Wang, J.; Zhang, W.; Wang, W.; Xu, X. Recent advances in computational prediction of drug absorption and permeability in drug discovery. Curr. Med. Chem. 2006, 13, 2653−67. (13) Hou, T.; Wang, J. Structure-ADME relationship: still a long way to go? Expert Opin. Drug Metab. Toxicol. 2008, 4, 759−70. (14) Hou, T.; Li, Y.; Zhang, W.; Wang, J. Recent developments of in silico predictions of intestinal absorption and oral bioavailability. Comb. Chem. High Throughput Screening 2009, 12, 497−506. (15) Li, Y.; Xiao, H.; McClements, D. Encapsulation and Delivery of Crystalline Hydrophobic Nutraceuticals using Nanoemulsions: Factors Affecting Polymethoxyflavone Solubility. FOOD BIOPHYS. 2012, 7, 341−353. (16) Pelkonen, O.; Turpeinen, M.; Raunio, H. In Vivo-In Vitro-In Silico Pharmacokinetic Modelling in Drug Development. Clin. Pharmacokinet. 2011, 50, 483−491. 771

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling (37) Nordqvist, A.; Nilsson, J.; Lindmark, T.; Eriksson, A.; Garberg, P.; Kihlén, M. A General Model for Prediction of Caco-2 Cell Permeability. QSAR Comb. Sci. 2004, 23, 303−310. (38) Ponce, Y. M.; Perez, M. C.; Zaldivar, V. R.; Diaz, H. G.; Torrens, F. A new topological descriptors based model for predicting intestinal epithelial transport of drugs in Caco-2 cell culture. J. Pharm. Pharm. Sci. 2004, 7, 186−199. (39) Hou, T.; Zhang, W.; Xia, K.; Qiao, X.; Xu, X. ADME evaluation in drug discovery. 5. Correlation of Caco-2 permeation with simple molecular properties. J. Chem. Inf. Model. 2004, 44, 1585−1600. (40) Bergström, C. A. In silico Predictions of Drug Solubility and Permeability: Two Rate-limiting Barriers to Oral Drug Absorption. Basic Clin. Pharmacol. Toxicol. 2005, 96, 156−161. (41) Refsgaard, H. H.; Jensen, B. F.; Brockhoff, P. B.; Padkjær, S. B.; Guldbrandt, M.; Christensen, M. S. In silico prediction of membrane permeability from calculated molecular parameters. J. Med. Chem. 2005, 48, 805−811. (42) Chan, E.; Tan, W.; Ho, P.; Fang, L. Modeling Caco-2 permeability of drugs using immobilized artificial membrane chromatography and physicochemical descriptors. J. Chromatogr. A 2005, 1072, 159−168. (43) Değim, Z. Prediction of permeability coefficients of compounds through Caco-2 cell monolayer using artificial neural network analysis. Drug Dev. Ind. Pharm. 2005, 31, 935−942. (44) Di Fenza, A.; Alagona, G.; Ghio, C.; Leonardi, R.; Giolitti, A.; Madami, A. Caco-2 cell permeability modelling: a neural network coupled genetic algorithm approach. J. Comput.-Aided Mol. Des. 2007, 21, 207−221. (45) Santos-Filho, O. A.; Hopfinger, A. J. Combined 4D-fingerprint and clustering based membrane-interaction QSAR analyses for constructing consensus Caco-2 cell permeation virtual screens. J. Pharm. Sci. 2008, 97, 566−583. (46) Thomas, S.; Brightman, F.; Gill, H.; Lee, S.; Pufong, B. Simulation modelling of human intestinal absorption using Caco-2 permeability and kinetic solubility data for early drug discovery. J. Pharm. Sci. 2008, 97, 4557−4574. (47) Karelson, M.; Karelson, G.; Tamm, T.; Tulp, I.; Jänes, J.; Tämm, K.; Lomaka, A.; Savchenko, D.; Dobchev, D. QSAR study of pharmacological permeabilities. ARKIVOC 2009, 2, 218−238. (48) Paixão, P.; Gouveia, L. F.; Morais, J. A. Prediction of the in vitro permeability determined in Caco-2 cells by using artificial neural networks. Eur. J. Pharm. Sci. 2010, 41, 107−117. (49) Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 1996, 17, 490−519. (50) Galvao, R. K. H.; Araujo, M. C. U.; Jose, G. E.; Pontes, M. J. C.; Silva, E. C.; Saldanha, T. C. B. A method for calibration and validation subset partitioning. Talanta 2005, 67, 736−740. (51) Wang, J. B.; Cao, D. S.; Zhu, M. F.; Yun, Y. H.; Xiao, N.; Liang, Y. Z. In silico evaluation of logD7. 4 and comparison with other prediction methods. J. Chemom. 2015, 29, 389−398. (52) Cao, D.-S.; Liang, Y.-Z.; Xu, Q.-S.; Hu, Q.-N.; Zhang, L.-X.; Fu, G.-H. Exploring nonlinear relationships in chemical data u.sing kernelbased methods. Chemom. Intell. Lab. Syst. 2011, 107, 106−115. (53) Freund, Y.; Schapire, R. E. Experiments with a new boosting algorithm. In Proceedings of ICML Bari, Italy, July 3−6, 1996; Vol. 96, pp 148−156. (54) Cao, D.-S.; Xu, Q.-S.; Liang, Y.-Z.; Zhang, L.-X.; Li, H.-D. The boosting: A new idea of building models. Chemom. Intell. Lab. Syst. 2010, 100, 1−11. (55) Cao, D.-S.; Huang, J.-H.; Liang, Y.-Z.; Xu, Q.-S.; Zhang, L.-X. Tree-based ensemble methods and their applications in analytical chemistry. TrAC, Trends Anal. Chem. 2012, 40, 158−167. (56) Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. Evolutionary Computation, IEEE Transactions on 2002, 6, 182−197. (57) Srinivas, N.; Deb, K. Muiltiobjective optimization using nondominated sorting in genetic algorithms. Evol. Comput. 1994, 2, 221−248.

(58) Clark, R. D.; Fox, P. C. Statistical variation in progressive scrambling. J. Comput.-Aided Mol. Des. 2004, 18, 563−576. (59) Baumann, K. Cross-validation as the objective function for variable-selection techniques. TrAC, Trends Anal. Chem. 2003, 22, 395−406. (60) Kiralj, R.; Ferreira, M. Basic validation procedures for regression models in QSAR and QSPR studies: theory and application. J. Braz. Chem. Soc. 2009, 20, 770−787. (61) Yan, J.; Huang, J. H.; He, M.; Lu, H. B.; Yang, R.; Kong, B.; Xu, Q. S.; Liang, Y. Z. Prediction of retention indices for frequently reported compounds of plant essential oils using multiple linear regression, partial least squares, and support vector machine. J. Sep. Sci. 2013, 36, 2464−2471. (62) Golbraikh, A.; Tropsha, A. Beware of Q2. J. Mol. Graphics Modell. 2002, 20, 269−276. (63) Tropsha, A.; Gramatica, P.; Gombar, V. K. The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb. Sci. 2003, 22, 69−77. (64) Shahlaei, M. Descriptor selection methods in quantitative structure−activity relationship studies: a review study. Chem. Rev. 2013, 113, 8093−8103. (65) Rücker, C.; Rücker, G.; Meringer, M. y-Randomization and its variants in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47, 2345−2357. (66) Fox, J.-P. Randomized item response theory models. Journal of Educational and Behavioral statistics 2005, 30, 189−212. (67) Cao, D.-S.; Liu, S.; Fan, L.; Liang, Y.-Z. QSAR analysis of the effects of OATP1B1 transporter by structurally diverse natural products using a particle swarm optimization-combined multiple linear regression approach. Chemom. Intell. Lab. Syst. 2014, 130, 84− 90. (68) Netzeva, T. I.; Worth, A. P.; Aldenberg, T.; Benigni, R.; Roberts, D. W.; Schultz, T. W.; Stanton, D. T.; van de Sandt, J. J.; Tong, W.; Veith20, G. Current status of methods for defining the applicability domain of (quantitative) structure−activity relationships. In Report and recommendations of ECVAM Workshop, 2005; Vol. 52. (69) Weaver, S.; Gleeson, M. P. The importance of the domain of applicability in QSAR modeling. J. Mol. Graphics Modell. 2008, 26, 1315−1326. (70) Sahlin, U.; Jeliazkova, N.; Ö berg, T. Applicability domain dependent predictive uncertainty in QSAR regressions. Mol. Inf. 2014, 33, 26−35. (71) Sahigara, F.; Mansouri, K.; Ballabio, D.; Mauri, A.; Consonni, V.; Todeschini, R. Comparison of different approaches to define the applicability domain of QSAR models. Molecules 2012, 17, 4791− 4810. (72) Cao, D. S.; Liang, Y. Z.; Xu, Q. S.; Li, H. D.; Chen, X. A new strategy of outlier detection for QSAR/QSPR. J. Comput. Chem. 2010, 31, 592−602. (73) Dick, A.; Grayson, M.; Carpenter, R.; Petrie, A. Controlled trial of sulphasalazine in the treatment of ulcerative colitis. Gut 1964, 5, 437−442. (74) Cohen, M. A.; Yoder, S. L.; Talbot, G. H. Sparfloxacin worldwide in vitro literature: isolate data available through 1994. Diagn. Microbiol. Infect. Dis. 1996, 25, 53−64. (75) Ping, Z. Z. Progress in research of antibacterial agents. Chin. J. Antibiot. 2002, 27, 67−79. (76) Regårdh, C. G.; Borg, K. O.; Johansson, R.; Johnsson, G.; Palmer, L. Pharmacokinetic studies on the selectiveβ 1-receptor antagonist metoprolol in man. J. Pharmacokinet. Biopharm. 1974, 2, 347−364. (77) Raevsky, O. A.; Fetisov, V. I.; Trepalina, E. P.; McFarland, J. W.; Schaper, K. J. Quantitative Estimation of Drug Absorption in Humans for Passively Transported Compounds on the Basis of Their Physicochemical Parameters. Quant. Struct.-Act. Relat. 2000, 19, 366−374. (78) Agatonovic-Kustrin, S.; Beresford, R.; Yusof, A. P. M. Theoretically-derived molecular descriptors important in human intestinal absorption. J. Pharm. Biomed. Anal. 2001, 25, 227−237. 772

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773

Article

Journal of Chemical Information and Modeling (79) Palm, K.; Luthman, K.; Unge, A. L.; Strandlund, G.; Artursson, P. Correlation of drug absorption with molecular surface properties. J. Pharm. Sci. 1996, 85, 32−39. (80) Kelder, J.; Grootenhuis, P. D.; Bayada, D. M.; Delbressine, L. P.; Ploemen, J.-P. Polar molecular surface as a dominating determinant for oral absorption and brain penetration of drugs. Pharm. Res. 1999, 16, 1514−1519. (81) Pickett, S. D.; McLay, I. M.; Clark, D. E. Enhancing the hit-tolead properties of lead optimization libraries. J. Chem. Inf. Model. 2000, 40, 263−272. (82) Stenberg, P.; Norinder, U.; Luthman, K.; Artursson, P. Experimental and computational screening models for the prediction of intestinal drug absorption. J. Med. Chem. 2001, 44, 1927−1937.

773

DOI: 10.1021/acs.jcim.5b00642 J. Chem. Inf. Model. 2016, 56, 763−773