Computational Model To Predict the Fraction of Unbound Drug in the

Jul 1, 2019 - Read OnlinePDF (3 MB) ... The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jci...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/jcim

Cite This: J. Chem. Inf. Model. 2019, 59, 3251−3261

Computational Model To Predict the Fraction of Unbound Drug in the Brain Tsuyoshi Esaki,*,† Rikiya Ohashi,†,‡ Reiko Watanabe,† Yayoi Natsume-Kitatani,†,§ Hitoshi Kawashima,† Chioko Nagao,†,§ and Kenji Mizuguchi*,†,§

Downloaded via UNIV OF SOUTHERN INDIANA on July 23, 2019 at 03:45:19 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.



Laboratory of Bioinformatics, National Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito-Asagi, Osaka, Ibaraki 567-0085, Japan ‡ Discovery Technology Laboratories, Mitsubishi Tanabe Pharma Corporation, 2-2-50 Kawagishi, Toda, Saitama 335-8505, Japan § Laboratory of In-silico Drug Design, Center of Drug Design Research, National Institutes of Biomedical Innovation, Health and Nutrition, 7-6-8 Saito-Asagi, Osaka, Ibaraki 567-0085, Japan S Supporting Information *

ABSTRACT: Knowing the value of the unbound drug fraction in the brain (f u,brain) is essential in estimating its effects and toxicity on the central nervous system (CNS); however, no model to predict f u,brain without experimental procedures is publicly available. In this study, we collected 253 measurements from the literature and an open database and built in silico models to predict f u,brain using only freely available software. By selecting appropriate descriptors, training, and evaluation, our model showed an acceptable performance on a test data set (R2 = 0.630, percentage of compounds predicted within a 3-fold error: 69.4%) using chemical structure alone. Our model is available at https:// drumap.nibiohn.go.jp/fubrain/, and all of our data sets can be obtained from the Supporting Information.



INTRODUCTION Simulations of the time profile of drugs in the brain have been performed in many areas including toxicology1,2 and drug development.3,4 Estimating a drug’s ability to penetrate the central nervous system (CNS) in the early stages of drug development can reduce the time required to develop drugs that selectively target the CNS and help to identify those that should be discarded rapidly. The free drug hypothesis states that only drugs that do not bind proteins, lipids, or other tissue components can exhibit their pharmacological effects.5 To predict these effects or the toxicity of drugs, estimating the concentration of unbound drug in the brain, Cu,brain, is crucial.6−8 However, the blood−brain barrier and the blood−cerebrospinal fluid (CSF) barrier are located between the blood vessels and the brain and the CSF and the blood, respectively. Since these barriers play a pivotal role in controlling drug passage from the blood to the brain and to the CSF they make it difficult to estimate the Cu,brain using the concentration of unbound drug in the plasma (Cu,plasma).7,9 Three experimental methods can be used to investigate a drug’s distribution throughout the brain: the microdialysis (MD) method, the brain slice method, and the brain homogenate method. The MD method has been used to determine the unbound brain-to-plasma partition coefficient (Kp,uu,brain) and the unbound volume of distribution (Vu,brain, in mL/g of brain tissue). These measurements are obtained by determining the drug concentration in the interstitial fluid © 2019 American Chemical Society

(ISF) with the assumption that the unbound drug is in equilibrium between the brain and the ISF.10−12 This method is the most reliable for measuring drug passage to the brain as it allows for continuous and direct acquisition of samples. However, use of the MD method poses several issues. First, highly lipophilic compounds, which often appear in the early stages of drug development, are unsuited for use in this method due to their high adsorption by the MD instrument and their poor recovery.13,14 Second, to obtain data using this method, a highly efficient and effective surgery skill is required. Furthermore, the MD method is limited by the statements for animal protection as only a single sample can be taken from each animal for each measurement. Thus, use of MD for highthroughput screening has proven to be difficult.9,15 The brain slice method has been used in in vitro experiments to estimate Vu,brain. By employing this method, the amount of drug present in a slice of the brain can be measured when the drug is incubated in a buffer. The results obtained using this method tend to reflect the in vivo situations, as the brain slice maintains the basic structure of the blood and brain tissue. The brain slice method is, therefore, effective in providing information such as nonspecific binding sites and lysosome trapping. However, the method is not Received: February 28, 2019 Published: July 1, 2019 3251

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling

Figure 1. Distribution of the data collected in this study based on f u,brain (A) and log f u,brain (B). Samples were randomly divided into training and test sets and used to build predictive models.

screening in the early stages of drug development. The f u,brain is, therefore, useful in determining Kp,uu,brain and Vu,brain, and the ability to predict this value from chemical structure alone would be an important step toward estimating Cu,brain. f u,brain is, therefore, necessary in determining Kp,uu,brain, and fortunately, an acceptable amount of experimental f u,brain data could be found. Two in silico models were proposed for predicting f u,brain from chemical structures.24,25 However, one was trained on a data set of in-house measurements for 70 compounds, and the molecular descriptors were calculated using in-house software. Neither the details of the data set and descriptors nor the model itself are available to the public. The other study used commercial software to calculate the descriptors and did not publish all of the compounds used for training; this was despite the study employing 470 compounds and 55 validation compounds from the literature. Two other studies proposed models for estimating f u,brain using experimental data. One used the capacity factor of 36 compounds measured by mass spectrometry equipped with microemulsion electrokinetic chromatography (MEEKC).26 They showed the high correlations between f u,brain and the lipophilicity property, LogP; however, almost all of the evaluated compounds were basic and neutral compounds. Hence, this could only be used for nonspecific binding of neutral compounds and electric affinity of basic compounds with phospholipids. In drug discovery, acidic and zwitterions are also synthesized as lead compounds. Therefore, generating compounds that cover the wide chemical space during the early stage is difficult using the model constructed with information for base and neutral compounds only. The other model using experimental data employed the unbound fraction of 46 drugs measured using human embryonic kidney cells (HEK293)15 without the use of a tissue section from the brain. In addition to the requirement of experimental data, the number of compounds used in these studies (from 36 to 70) is probably too low to cover a wide chemical space. In the early stages of drug discovery, a freely available model, applicable to a wide range of chemicals and independent from experimental data, is desirable. In the present study, we aimed to develop an in silico model to predict f u,brain using only the chemical structure of the compounds and to make this model freely available. As the

readily repeatable in laboratories because a highly specialized instrument is required for slicing the raw tissue. The method of brain homogenate provides the fraction of unbound drug in the homogenized tissue (f u,brain) with a reasonably high-throughput screening. The brain homogenate contains tissues with their structures ruptured, and the resulting f u,brain is underestimated when compared to the values obtained using the brain slice method. This underestimation happens because the collapsed tissues provide additional binding sites for the drug. Predicting Kp,uu,brain is the most important parameter when investigating a drug’s ability to penetrate into the brain. However, because of the aforementioned issues of the MD method and the statements for animal protection, collecting large amounts of experimental data for Kp,uu,brain is generally difficult. Several laboratories collected experimental data from the literature or in-house measurements (115,16 246 (73 publicly available compounds),17 34618 unpublished, 133 from the literature,19 81 (40 publicly available compounds),20 1121 (91 publicly available compounds)21) and tried to construct regression models. However, performance was found to be insufficient to directly construct QSAR models.22 Furthermore, a low amount of published data has made it difficult to construct a new in silico model. Hence, we consider that directly constructing a prediction model for Kp,uu,brain is difficult and that it is more realistic to estimate Kp,uu,brain using Kp,brain and f u,brain or Vu,brain. The relationship between Kp,uu,brain, Vu,brain, and f u,brain has been reported. Kp,uu,brain is calculated using f u,brain and other parameters as follows K uu,brain =

Cu,brain Cu,plasma

=

fu,brain × C brain fu,plasma × Cplasma

where f u,plasma, Cplasma, and Cbrain are the fraction of unbound drug in the plasma, the concentration of drug in the plasma, and the concentration of drug in the brain, respectively. From this equation, Kp,uu,brain requires f u,brain. Friden et al.23 proposed a model for predicting Vu,brain using observed values of f u,brain and pKa and showed that their model provided predictions within a 2.2-fold error range for 95% of the drugs evaluated. This performance is acceptable for 3252

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling f u,brain of several mammals have been shown to be strongly correlated,8 we collected and pooled the data of f u,brain for various mammals (including mouse, rat, monkey, dog, pig, and humans). The collected data were rigorously curated by referring to the original literature, resulting in a large data set of 253 compounds categorized as follows: basic, 72; acid, 21; neutral, 141; zwitterion, 13; and 6 with no data in ChEMBL. The molecular descriptors of the collected data were generated using free software. To the best of our knowledge, this is the first study to report an in silico predictive model of f u,brain that has made the model and the associated data sets freely available. The results of this study will be valuable to researchers interested in CNS drug development in pharmaceutical companies and academia.

with near-zero variance and a high correlation with others (R > 0.95) were removed using the “nearZeroVar” and “findCorrelation” (in cutoff = 0.95) functions in the caret package28 in R to select a suitable number of descriptors. A total of 868 descriptors was retained. The Boruta package29 with a p value = 0.01 was then used to select the most important descriptors automatically. Boruta is a wrapper algorithm built around the random forest classification algorithm. To verify the importance of its features, Z-scores for the shuffled copies of the features and the original ones are compared. If the performance of the latter was better than that of the former, the descriptor was considered to be an important descriptor. External tests 1 and 2 were used to compare our model to those of Wan and Mateus, respectively. Finally, a total of 53 descriptors was used to train a model with each machine learning algorithm. The 20 most important descriptors are presented in Table 2 (the full list of 53 descriptors is shown in the Supporting Information, Table S2). Distribution of Collected Data Sets. To compare the distribution of our data sets against a data set of approved drugs we performed a principal component analysis (PCA) using the “prcomp” function in stats package (ver. 3.5.2), which is a part of R for the 53 scaled features. In addition, the four properties defined in Lipinski’s rule of five30 (molecular



RESULTS Distribution of Training and Test Sets. The distribution of the collected compounds is shown in Figure 1. From the 253 collected compounds, we first extracted 73 used for training in previous studies and used them as the external test set. We further divided these 73 compounds to serve as external test 1 (36 compounds used in Wan’s study26) and external test 2 (46 compounds used in Mateus’ study15) for comparison to the other studies. From the 180 remaining compounds, 144 (80.00%) were randomly selected to serve as a training set and the remaining 36 compounds (20.00%) were used as the test set. The class of all compounds used in this study are shown in Supporting Information (Table S1). The number of samples, range of values, mean, and standard deviation of each data set for logarithm of f u,brain are summarized in Table 1. No statistically significant difference

Table 2. List of the 20 Descriptors with Higher Importance among the 53 Featuresa descriptor SLogPc SMR_VSA7c AETA_alphac khs.aaCHb

Table 1. Summary of the Logarithm of f u,brain Dataset Characterizationa no. of samples data set training set test set external test 1c external test 2d

range of logtransformed values

AATSC0pc mean

median

S.D.b

253 144

−3.312 to 0.000 −3.312 to 0.000

−1.272 −1.205

−1.301 −1.235

0.791 0.830

36 36

−3.000 to −0.040 −3.078 to −0.265

−1.251 −1.470

−1.359 −1.519

0.782 0.698

46

−3.078 to −0.020

−1.421

−1.365

0.732

APC2D6_C_Xd APC2D7_C_Nd AATS 5pc ATSC0mc FPSA3c

a

Compounds in external tests 1 and 2 were separated from the data set to evaluate the prediction capability of our model and others, and 37 compounds were included in both external test sets. bStandard deviation of each group. cCompounds used for model building in Wan’s study.26 dCompounds used for model building in Mateus’ study.15

tpsaEfficiencyb SubFPC274d ZMIC1c GATS1ic AATS 1pc

in means was detected (p value = 0.6407 from Wilcoxon rank sum test) between the training and the test sets, confirming that the data sets were randomly selected without bias. Descriptor Selection. While the expressive power of predictive models depends on the number of descriptors, an excessive number can result in overfitting. Thus, before selecting important descriptors, we retained SLogP and removed XLogP, ALogP, ALogP2, and MLogP for the parameter of lipophilicity (due to the performance and good agreement27 with other LogP prediction methods) and retained “count of aromatic substructure” for number of aromatic ring. After removing several similar descriptors, those

AATS2ic c

AATSC0v ETA_psi_1c GATS 1pc n6aRingc

detail

meanImpb

Wildman−Crippen LogP MOE MR VSA Descriptor 7 (3.05 ≤ x < 3.63) averaged ETA core count counts the number of occurrences of the E-state fragments averaged and centered moreau-broto autocorrelation of lag 0 weighted by polarizability count of C−X at topological distance 6 count of C−N at topological distance 7 averaged moreau-broto autocorrelation of lag 5 weighted by polarizability centered moreau-broto autocorrelation of lag 0 weighted by mass fractional charged partial positive surface area (version 3) pPolar surface area expressed as a ratio to molecular size count of aromatic substructure 1-ordered Z-modified information content geary coefficient of lag 1 weighted by ionization potential averaged moreau-broto autocorrelation of lag 1 weighted by polarizability averaged moreau-broto autocorrelation of lag 2 weighted by ionization potential count of C−Cl at topological distance 5 ETA psi geary coefficient of lag 1 weighted by polarizability 6-membered aromatic ring count

9.758 6.482 4.875 4.665 4.547 4.249 4.086 4.078 3.953 3.934 3.891 3.872 3.674 3.599 3.581 3.546 3.539 3.524 3.496 3.488

a Features used for training f u,brain predictive models. bThe mean importance of descriptors (meanImp) was generated using Boruta. c These descriptors were generated using CDK. dThese descriptors were generated using Mordred. eThese descriptors were generated using PaDEL-Descriptor.

3253

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling

the root-mean-square-error (RMSE), the Gradient Boosting (GB) model had the highest performance followed by the Support Vector Machine with radial basis function kernel (Radial SVM) and Random Forest (RF) (Table 3). This result suggests that bagging, boosting, and nonlinear machine learning methods have a higher predictive capability than linear methods for this type of data analysis. An external validation using the test set was performed for each model. The statistical scores obtained with the test set are shown in the bottom half of Table 3. RF, GB, and radial SVM had higher scores than Support Vector Machine with linear kernel (linear SVM) and partial least squares (PLS), confirming the results from the cross-validation. The R2 and RMSE scores of the GB model from the crossvalidation were the highest, closely followed by the RF and radial SVM models. In the test set the GB model had the highest performance followed by the RF and radial SVM models. The GB model predicted values within a 2- and 3-fold error of the observed values for 38.89% and 69.44% of the compounds in the test set, respectively (Figure 4). Hence, the GB model is the best selection based on the results of both the training and the test sets.

weight (MW) < 500, SLogP < 5, number of hydrogen-bond acceptors (HBA) < 10, and number of hydrogen-bond donors (HBD) < 5) were used to compare the diversity of these data sets. Figure 2 shows the two-dimensional PCA plot of approved drugs, the training set, and the test set. Over 37% of the



DISCUSSION Contribution of the Compound’s Polarity. Traditionally, highly lipophilic compounds have been considered good CNS drug candidates due to the good correlation between their octanol/water partition coefficient and the brain-toplasma partition coefficient (Kp,brain)33 (unless the compounds are substrates of P-glycoprotein).34 However, a weak relationship between the Kp,brain (at a concentration considered effective) and the amount of unbound drug in the brain has been observed more recently based on an estimate using CSF.35 To confirm the traditional notion of the correlation between lipophilicity and Kp,brain, SLogP (as a measure of lipophilicity) was calculated for the 253 compounds using Mordred. The Kp,brain of those compounds was predicted using the Lukacova model36 in GastroPlus (ver 9.5, SimulationsPlus Ltd., Lancaster, CA); an American male (body weight = 70 kg) was used for the analysis (the Lukacova model combines the two equations of the Rodgers Kp calculation model).37,38 A positive correlation was observed for all of the compounds (R = 0.552) with neutral compounds exhibiting an even higher positive correlation (R = 0.737) (Figure 5A). More specifically, the basic and neutral compounds and the zwitterions showed a tendency to higher Kp,brain than acids within a similar SLogP range. The Kp,uu,brain of each compound was then calculated using the following equation

Figure 2. Score plot of the principal component analysis (PCA) to compare the distribution of the training (144 compounds), test (36 compounds), and External test (73 compounds) sets to that of approved drugs (5845 compounds) collected from KEGG DRUG. Full black, blue, green, and pink circles represent approved drugs, training, External test, and the test sets, respectively.

variance is accounted for by the first two principal components, and the compounds of our data sets were centered in the middle of the cloud for approved drugs. The cumulative contribution of PC1 and PC2 (total 37.7%) was low. Hence, we considered that the accuracy of the analysis with a low cumulative contribution rate might be insufficient. This might be a limitation of the calculable descriptors with the freely available software. In previous studies, the data set collected KEGG DRUG as approved drugs, and other data sets were prepared by PCA using several descriptors calculated by the freely available software31 and a commercial one.32 The sum of the cumulative contributions of PC1 (34.96%) and PC2 (20.64%) from the commercial software (total 55.60%) was higher than that obtained from the freely available software (PC1, 24.2%; PC2, 9.6%; total, 33.8%). From these results we consider that the low cumulative contribution in PCA may be due to the descriptor calculation software. Nonetheless, this value appears as an acceptable result using the freely available software only. The PCA results indicate that both the training and the test sets have an acceptable distribution for the construction of f u,brain predictive models and for evaluation of the predictive ability of drug-like compounds, respectively. The distributions of approved drugs and our data sets against the properties of Lipinski’s rule are shown as histograms in Figure 3. The distribution of our data sets is both similar and close to that of the approved drugs. Predictive Performance of Cross-Validation and External Validation. The predictive models trained using the 53 descriptors selected for the 144 compounds in the training set were cross-validated (see Experimental Section for details). Regarding the squared correlation coefficient (R2) and

K p,uu,brain = K p,brain ×

fu,brain fu,plasma

We used Watanabe’s predictive model to calculate f u,plasma.32 Only a low correlation (R = 0.113) was observed in the scatter plot of Kp,uu,brain and SLogP (Figure 5B). Within a similar SLogP range, the values of Kp,uu,brain of bases and acids were slightly lower than those of neutral compounds. These results suggest that the polarity, in addition to the lipophilicity, of a drug candidate is essential in estimating the distribution of unbound drug in the brain. The brain is enriched with phospholipids,39 and the concentrations of phosphatidylserine (PS), phosphatidyletha3254

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling

Figure 3. Distribution of the physicochemical properties of approved drugs and collected compounds based on the four properties in Lipinski’s rule of five (MW, molecular weight; SLogP, calculated by Mordred; HBA, number of hydrogen bond acceptors; HBD, number of hydrogen bond donors). Y axis of each histogram represents the percentage of the number of samples and X axis the total range for each data set. Full black, blue, pink, and green bars represent approved drugs, the training, test, and External test sets, respectively.

Table 3. Results of Cross-Validation and Evaluation algorithm

R2

RMSE

0.619 0.643 0.549 0.452 0.545

0.510 0.482 0.561 0.828 0.592

0.537 0.630 0.485 0.326 0.252

0.546 0.477 0.565 0.743 0.745

a

cross validation RF GB radial SVM linear SVM PLS test setb RF GB radial SVM linear SVM PLS a

Cross-validation was performed between 144 training compounds for the gradient boosting (GB), radial support vector machine (SVM), linear SVM, and partial least-squares (PLS) models, and out-of-bag was used for the random forest (RF) model instead of crossvalidation. The results of each resampling (CV number, R2 and RMSE) are shown in the Supporting Information (Table S3). b Thirty-six compounds in the test set were used in each model.

Figure 4. Relationship between the observed and the predicted f u,brain using the gradient boosting (GB) model. Each test compound is represented as a red circle. Solid, dashed, and dot lines represent perfect agreement and 2-fold and 3-fold error, respectively.

3255

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling

Figure 5. Relationship between SLogP and Kp,brain (A) or Kp,uu,brain (B). Blue, yellow, red, green, and gray circles represent basic, neutral, acidic, zwitterion, and no classification available compounds from ChEMBL, respectively.

Figure 6. Relationship between observed f u,brain and SLogP (A) and calculated logD at pH 7.4 (B). Blue, yellow, red, green, and gray circles represent basic, neutral, acidic, zwitterion, and no classification available compounds from ChEMBL, respectively.

charge interactions, while neutral compounds only bind to phospholipids through nonspecific interactions. The relationship between f u,brain and SLogP was plotted based on the polarity of the compounds as shown in Figure 6A. A negative correlation (R = −0.662) was observed between f u,brain and SLogP; however, we observed no apparent differences between the bases, acids, neutral compounds, or zwitterions. For the measurement of f u,brain, phosphate-buffered saline was used to maintain the sample’s pH at around 7.4. Lipophilicity at pH = 7.4, therefore, is the most suitable property to examine the distribution of drug compounds based on their molecular polarity. For this purpose, logD at pH 7.4 of the collected compounds was calculated using ADMET

nolamine (PE), and phosphatidylcholine (PC) in the brain are higher than those of other tissues (heart, kidney, and liver).40 In a previous study, quinidine, a basic compound, was found to bind to acidic phospholipids, with the highest binding occurring with PS as a result of its large number of binding sites. This binding was considered a consequence of electrostatic and hydrophobic interactions.41 Similarly, acidic compounds can bind to positively charged sites of phospholipids, such as PE and PC. Thus, it is implied that the electric charge of a compound, in addition to its lipophilicity, plays an essential role in determining f u,brain. Polar compounds tend to bind to phospholipids by nonspecific hydrophobic interactions and through specific 3256

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling Predictor (ver. 8.1, SimulationsPlus Ltd., Lancaster, CA) and the relationship between f u,brain and logD at pH 7.4 was plotted similarly to SLogP (Figure 6B). A negative correlation (R = −0.676) was observed, similar to the plot with SLogP; however, the LogD plot revealed a different trend. This plot showed that f u,brain of polar compounds tends to be lower than those of neutral compounds that are within the same range of logD. Thus, polar compounds, with their capability of both nonspecific binding and charge interactions, may bind to the brain homogenate more strongly than neutral compounds with similar lipophilicity. Analyzing the relationship between the charge of the compounds and the profiles of the phospholipids in the brain may provide further insights into drug binding affinity with brain homogenates. Two strong outliers can be seen in Figure 6A, and they represent Mitoxantrone and Doxorubicin. Generally, these two compounds are orally administered as hydrochloride salts, which imply that they may possess low water solubility and high lipophilicity. We considered that these are mispredictions generated due to the software’s limitation. Generally, almost all of the current models to predict LogP use the substructures and its count of the compounds. Thus, the compounds with many hydroxy groups are predicted as low LogP compounds despite their high lipophilicity. When we removed these two compounds, the recalculated correlations improved from R = −0.545 to −0.742 for basic compounds. In a previous study,27 SlogP had the highest accuracy in the freely available software. This misprediction of the previous two compounds might therefore be a limitation of the free software. Thus, we consider that our results reflect the capability of the free software. The lipophilicity parameters strongly affected the prediction of f u,brain; hence, we anticipate more effective algorithms to predict LogP. Comparison to Other Predictive Models. Two of the four previous studies15,24,26 described in the Introduction provided all of the data used for model building (36 and 46 compounds defined as external tests 1 and 2, respectively; these compounds are shown in the Supporting Information, Table S1), as well as the equations for predicting f u,brain.15,26 To compare our GB model to the two models of these previous studies, we first prepared two new test sets corresponding to the 36 and 46 compounds used in the previous studies. Although these compounds constitute a subset of our 253 compounds, the f u,brain values used in our study differ slightly from the values used in the previous studies, as we adopted average values over the relevant entries in ChEMBL. We thus reassigned the updated f u,brain values to the 36 and 46 compound test sets, predicted the f u,brain values using the published equations, and calculated the following statistical scores for each model: R2, RMSE, and the percentages of compounds predicted within 2- and 3-fold error. Similarly, our GB model was evaluated based on the two new test sets, and its performance was compared to that of the previous models. The results (Table 4) indicated that our model had lower scores than the previous models but could predict values within a 3-fold error of the observed values for over 70% of compounds in both external test 1 and 2. Due to the similar trend in external test 1 and 2, our model had difficulty in predicting low values, especially f u,brain < 0.01, indicating that the training data set was constructed with higher f u,brain compounds (mean of f u,brain in training set: 0.220) compared to the external test sets (external test 1, 0.091; external test 2, 0.121).

Table 4. Evaluation of the Predictive Ability of Our Model and Two Previously Published Modelsa Using Two Test Sets model 36 compoundsb capacity factor of MEEKC26 in silico GB model (this study) 46 compoundsc binding data to HEK293 cell lines15 in silico GB model (this study)

R2

RMSE

2-fold (%)

3-fold (%)

0.784 0.585

0.322 0.447

58.3 63.9

83.3 77.8

0.860

0.320

71.7

87.0

0.381

0.588

45.7

71.7

a

Results of statistical score recalculation; several scores of the previous models were different from the values in the original literature, as we replaced the f u,brain values of some compounds with those used in this study (values averaged over the relevant entries in ChEMBL). bResults predicted using the MEEKC model and our GB model against the 36 compounds in Wan’s study.26 cResults of the HEK293 model and our GB model against the 46 compounds in Mateus’ study.15

In the range of f u,brain from 0.01 to 1.0, the scores of our model increased and approached that of previous models for both external test 1 (R2 = 0.554, RMSE = 0.376, % within 2fold = 66.7%, 3-fold = 81.5%) and 2 (R2 = 0.519, RMSE = 0.485, % within 2-fold = 55.9%, 3-fold = 79.4%) sets. The role of f u,brain is to estimate the concentration of unbound drug in the brain. Thus, our model, which can predict f u,brain with acceptable accuracy for the high value, is useful in the early stage of drug discovery. Furthermore, we should emphasize that our model requires only chemical structures, while the previous models needed experimental measurements such as the capacity factor of MEEKC or drug binding to the homogenate of HEK293 cell lines in a dialysis apparatus. Nonetheless, a few compounds were better predicted by the previous models than by our model (Figure 7). This observation implies that some critical information lies within the experimental data. This information is difficult to obtain using our current approach with 2D descriptors and fingerprints.42 To address this issue, as an example, the graph convolution method may be employed as a cutting-edge technique to generate molecular descriptors.43 Graph convolution generates descriptors using features of graphs such as atoms, bonds, and distances. However, to effectively demonstrate the power of the graph convolution algorithm, a large number of measurements is required (e.g., over 1000 samples in a previous study43 ), but graph convolutions do not present an entirely great quality for any situation. They may, however, represent a new paradigm for future improvement. Thus, to obtain information that critically expresses the binding of compounds to the brain homogenate, it is important to collect a larger number of f u,brain measurements.



CONCLUSION In this study, we presented an in silico model to predict f u,brain using only the chemical structure of drug compounds. All f u,brain data used in this study were collected from an open database and have been made available as Supporting Information. As the f u,brain values of several mammals are strongly correlated,8 our predictive model, trained on the pooled data of several animals, should be applicable to multiple species including mouse, rabbit, dog, monkey, and humans. 3257

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling

Figure 7. Relationship between observed and predicted f u,brain using Wan’s19 (A) and Mateus’15 test compounds (B). Black circles represent the predicted values using the previous models (A,26 B15), and red circles represent the predictions using our GB model trained with the 144 compounds. Solid and dashed lines represent perfect agreement and a 3-fold error, respectively.

important descriptors were selected using the Boruta package (a wrapper algorithm built around the random forest classification algorithm that is used to verify the importance of features; the Z-score of the shuffled copies of the features and the original ones were compared) with a parameter of p value = 0.01;29 in the end, we retrieved 53 descriptors (more details are provided in the Results). Machine Learning Algorithms. The constructed regression models and statistical analysis were performed using the statistical software, R (ver. 3.5.1).50 To optimize the parameters and construct the predictive models using the selected descriptors, Random Forest (RF), Gradient Boosting (GB), Support Vector Machine with radial functional kernel (radial SVM), linear kernel (linear SVM), and partial least square (PLS) were used in caret, with a 10-fold cross validation (for GB, radial SVM, linear SVM, and PLS); out-of-bag validation was used instead of cross-validation for RF. RF is a machine learning method based on bagging.51 In RF training, descriptors are automatically selected through random repetitions; however, we used the descriptors selected by Boruta in RF because they improved the performance of the RF model (the compared effects of descriptor selection between Boruta and RF are discussed in the Supporting Information, Tables S4 and S5). One parameter required optimization, the number of max depth (mtry). Using out-ofbag validation instead of cross-validation for RF, the best value for mtry was determined as 36. GB is an algorithm based on the boosting model using decision trees52 and works with four tuning parameters: (1) number of iterations (n.tree), (2) complexity of the tree (interaction.depth), (3) learning rate (shrinkage), and (4) the minimum number of training set samples in a node to commence splitting (n.minosinnode). The chosen parameters for compounds in the f u,brain training data set were n.tree = 100 and interaction_depth = 6 using the 10-fold cross-validation when shrinkage = 0.1 and n.minobsinnode = 10 in the caret package.

Our study is fully reproducible, and the model is available at https://drumap.nibiohn.go.jp/fubrain/. The use of our model can spring forth the opportunity for researchers to evaluate the behavior of drugs that target the brain in the early stages of drug development.



EXPERIMENTAL SECTION Collecting Data Sets. The experimental f u,brain data collected were based on previous studies.15,24,26 All data were obtained by searching the open database, ChEMBL (ver. 23),44 followed by extraction of data relevant to this study. ChEMBL was selected because it contains a large stock of experimental data related to drug development. A rigorous curation was performed to filter the extracted data, resulting in 253 compounds that served as the data set for f u,brain prediction and evaluation. The structure data files (SDFs) for each data set were also obtained from ChEMBL using the ChEMBL ID of each compound. To assess the diversity of the collected compounds in our data set, we collected the data of approved drugs from KEGG DRUG45 using filters such as “molecular weight is less than 1,000” and “no any inner salt.” These filters retrieved 5845 drugs as the approved drug data set to represent the drug-like compounds. Similar to the f u,brain data set, SDF of approved drugs were collected from the ChEMBL database. Descriptor Calculation. To obtain the physicochemical properties of the collected compounds, three freely available software packages, CDK Descriptor Calculator GUI (ver. 1.4.8),46 Mordred (ver. 1.0.0),47 and PaDEL-Descriptor (ver. 2.21),48 were employed. In total, 286 and 1612 2D descriptors were calculated by the CDK and Mordred software, respectively. The PaDEL-Descriptor software was used to obtain 6010 descriptors concerning the following substructure information: (1) a substructure fingerprint count of 370, (2) a Klekatol−Roth fingerprint count of 4860,49 and (3) a 2D atom pair count of 780. After removing the similar descriptors, those having a near-zero variance or a high correlation with other descriptors were removed using the caret package.28 The 3258

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling SVM is a method based on statistical theory.53 In this study, two patterns of kernel functions, radial basis and linear kernel, were used as the most common and simple kernels, respectively. The kernel function, κ, of radial SVM is expressed as follows: κ = −|u − v|2/2σ2. Here, u and v are vectors that represent the observed and predicted values, respectively, and σ is the function of the bandwidth. Optimization was performed for two parameters: (1) insensitive loss function (sigma) and (2) cost of constrains violation (C). On the basis of the results of 10-fold cross-validation, sigma = 0.01715347 and C = 2 were selected as suitable parameters. The radial SVM was trained based on 133 support vectors. The kernel of linear SVM is described as κ = ⟨u, v⟩. Similar to radial SVM, u and v also represent the observed and predicted values, respectively. The parameter of C is required to optimize linear SVM; C = 1 was employed for f u,brain prediction. A total of 125 support vectors was generated to construct the linear SVM model. PLS is a linear method combining the supervised PCA and multiple recognition.54 The parameter for optimization was the number of PLS components (ncomp) with ncomp = 1 selected as a suitable value for this training set. Performance Evaluation. Two statistical scores were employed to compare the predictive ability between the algorithms: the squared correlation coefficient (R2) and the root-mean-square-error (RMSE). These equations are defined as n

R2 = 1 −

Chioko Nagao: 0000-0002-7721-0642 Kenji Mizuguchi: 0000-0003-3021-7078 Author Contributions

T.E. performed all computational analyses. T.E., R.O., R.W., and Y.N.-K. discussed the relationship between the compounds and in vivo data and analyzed the results. All authors contributed to the writing of the manuscript and have approved its final version. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was conducted as part of the “Development of Drug Discovery Informatics System” supported by the Japan Agency for Medical Research and Development (AMED). We acknowledge Mrs. Toshiyuki Oda and Daisuke Sato at Lifematics Inc. for assisting us with data collection, rigorous curation, and creation of the web interface. Finally, we thank Editage (www.editage.jp) for their assistance in editing and enhancing the language of the manuscript.

i=1

i=1

RMSE =

1 n

∑ (yipred − yiobs )2

ypred i ,

yobs i ,

(1)

n i=1

(2)

mean yobs, i

where and are predicted, observed, and average of observed f u,brain values of a compound i, respectively.



ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.9b00180. Details of the removal of compounds because of the unambiguous class assignments; confusion matrix and equations to calculate each statistical score (PDF) Information on the compounds used in this study; ChEMBL ID, smiles, compound names, data set (training, test, and External test 1 or 2 sets), and observed values (XLSX)



ABBREVIATIONS



REFERENCES

CNS, central nervous system; Cbrain, concentration of drug in the brain; Cplasma, concentration of drug in the plasma; Cu,brain, concentration of unbound drug in the brain; Cu,plasma, concentration of unbound drug in the plasma; f u,brain, fraction of unbound drug in the brain; f u,plasma, fraction of unbound drug in the plasma; GB, gradient boosting; HBA, hydrogenbond acceptor; HBD, hydrogen-bond donor; HEK293, human embryonic cells; ISF, interstitial fluid; Kp,uu,brain, ratio of drug concentration from brain to the plasma; Kp,uu,brain, ratio of unbound drug concentration from brain to the plasma; MEEKC, microemulsion electrokinetic chromatography; MW, molecular weight; PCA, principal component analysis; PLS, partial least squares; RF, random forest; SVM, support vector machine; Vu,brain, unbound drug volume in the brain

n

∑ (yipred − yiobs )2 /∑ (yiobs mean − yiobs )2



(1) Järnberg, J.; Johanson, G. Physiologically Based Modeling of 1,2,4-Trimethylbenzene Inhalation Toxicokinetics. Toxicol. Appl. Pharmacol. 1999, 155, 203−214. (2) Levitt, D. G. PKQuest: A General Physiologically Based Pharmacokinetic Model. Introduction and Application to Propranolol. BMC Clin. Pharmacol. 2002, 2, 5. (3) Theil, F.-P.; Guentert, T. W.; Haddad, S.; Poulin, P. Utility of Physiologically Based Pharmacokinetic Models to Drug Development and Rational Drug Discovery Candidate Selection. Toxicol. Lett. 2003, 138, 29−49. (4) Parrott, N.; Paquereau, N.; Coassolo, P.; Lave, T. An Evaluation of the Utility of Physiologically Based Models of Pharmacokinetics in Early Drug Discovery. J. Pharm. Sci. 2005, 94, 2327−2343. (5) Smith, D. A.; Di, L.; Kerns, E. H. The Effect of Plasma Protein Binding on in Vivo Efficacy: Misconceptions in Drug Discovery. Nat. Rev. Drug Discovery 2010, 9, 929−939. (6) Hammarlund-Udenaes, M.; Friden, M.; Syvanen, S.; Gupta, A. On The Rate and Extent of Drug Delivery to the Brain. Pharm. Res. 2008, 25, 1737−1750. (7) Liu, X.; Smith, B. J.; Chen, C.; Callegari, E.; Becker, S. L.; Chen, X.; Cianfrogna, J.; Doran, A. C.; Doran, S. D.; Gibbs, J. P.; Hosea, N.; Liu, J.; Nelson, F. R.; Szewc, M. A.; Van Deusen, J. Evaluation of Cerebrospinal Fluid Concentration and Plasma Free Concentration As a Surrogate Measurement for Brain Free Concentration. Drug Metab. Dispos. 2006, 34, 1443−1447.

AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. Phone: (+81)72-6397010. Fax: (+81)72-641-9881. *E-mail: [email protected]. Phone: (+81)72-641-9890. Fax: (+81)72-641-9881. ORCID

Tsuyoshi Esaki: 0000-0001-8780-6346 Rikiya Ohashi: 0000-0002-6919-5665 Reiko Watanabe: 0000-0001-9359-8731 Yayoi Natsume-Kitatani: 0000-0003-1749-3318 Hitoshi Kawashima: 0000-0003-3612-4294 3259

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling (8) Di, L.; Umland, J. P.; Chang, G.; Huang, Y.; Lin, Z.; Scott, D. O.; Troutman, M. D.; Liston, T. E. Species Independence in Brain Tissue Binding Using Brain Homogenates. Drug Metab. Dispos. 2011, 39, 1270−1277. (9) Liu, X.; Van Natta, K.; Yeo, H.; Vilenski, O.; Weller, P. E.; Worboys, P. D.; Monshouwer, M. Unbound Drug Concentration in Brain Homogenate and Cerebral Spinal Fluid at Steady State as a Surrogate for Unbound Concentration in Brain Interstitial Fluid. Drug Metab. Dispos. 2009, 37, 787−793. (10) Chaurasia, C. S.; Muller, M.; Bashaw, E. D.; Benfeldt, E.; Bolinder, J.; Bullock, R.; Bungay, P. M.; DeLange, E. C.; Derendorf, H.; Elmquist, W. F.; Hammarlund-Udenaes, M.; Joukhadar, C.; Kellogg, D. L., Jr.; Lunte, C. E.; Nordstrom, C. H.; Rollema, H.; Sawchuk, R. J.; Cheung, B. W.; Shah, V. P.; Stahle, L.; Ungerstedt, U.; Welty, D. F.; Yeo, H. AAPS-FDA Workshop White Paper: Microdialysis Principles, Application and Regulatory Perspectives. Pharm. Res. 2007, 24, 1014−1025. (11) Boschi, G.; Scherrmann, J.-M. Microdialysis in Mice for Drug Delivery Research. Adv. Drug Delivery Rev. 2000, 45, 271−281. (12) Joukhadar, C.; Muller, M. Microdialysis Current Applications in Clinical Pharmacokinetic Studies and its Potential Role in the Future. Clin. Pharmacokinet. 2005, 44 (9), 895−913. (13) Lindberger, M.; Tomson, T.; Ståhle, L. Microdialysis Sampling of Carbamazepine, Phenytoin and Phenobarbital in Subcutaneous Extracellular Fluid and Subdural Cerebrospinal Fluid in Humans: An in vitro and in vivo Study of Adsorption to the Sampling Device. Pharmacol. Toxicol. 2002, 91, 158−165. (14) Khramov, A. N.; Stenken, J. A. Enhanced Microdialysis Extraction Efficiency of Ibuprofen in Vitro by Facilitated Transport with â-Cyclodextrin. Anal. Chem. 1999, 71, 1257−1264. (15) Mateus, A.; Matsson, P.; Artursson, P. A High-Throughput Cell-Based Method to Predict the Unbound Drug Fraction in the Brain. J. Med. Chem. 2014, 57, 3005−3010. (16) Friden, M.; Winiwarter, S.; Jerndal, G.; Bengtsson, O.; Wan, H.; Bredberg, U.; Hammarlund-Udenaes, M.; Antonsson, M. Structurebrain exposure relationships in rat and human using a novel data set of unbound drug concentrations in brain interstitial and cerebrospinal fluids. J. Med. Chem. 2009, 52, 6233−6243. (17) Chen, H.; Winiwarter, S.; Friden, M.; Antonsson, M.; Engkvist, O. In silico prediction of unbound brain-to-plasma concentration ratio using machine learning algorithms. J. Mol. Graphics Modell. 2011, 29, 985−995. (18) Varadharajan, S.; Winiwarter, S.; Carlsson, L.; Engkvist, O.; Anantha, A.; Kogej, T.; Friden, M.; Stalring, J.; Chen, H. Exploring in silico prediction of the unbound brain-to-plasma drug concentration ratio: model validation, renewal, and interpretation. J. Pharm. Sci. 2015, 104, 1197−1206. (19) Spreafico, M.; Jacobson, M. P. In silico prediction of brain exposure: drug free fraction, unbound brain to plasma concentration ratio and equilibrium half-life. Curr. Top. Med. Chem. 2013, 13, 813− 820. (20) Loryan, I.; Sinha, V.; Mackie, C.; Van Peer, A.; Drinkenburg, W. H.; Vermeulen, A.; Heald, D.; Hammarlund-Udenaes, M.; Wassvik, C. M. Molecular properties determining unbound intracellular and extracellular brain exposure of CNS drug candidates. Mol. Pharmaceutics 2015, 12, 520−532. (21) Dolgikh, E.; Watson, I. A.; Desai, P. V.; Sawada, G. A.; Morton, S.; Jones, T. M.; Raub, T. J. QSAR Model of Unbound Brain-toPlasma Partition Coefficient, Kp,uu,brain: Incorporating P-glycoprotein Efflux as a Variable. J. Chem. Inf. Model. 2016, 56, 2225−2233. (22) Liu, H.; Dong, K.; Zhang, W.; Summerfield, S. G.; Terstappen, G. C. Prediction of brain:blood unbound concentration ratios in CNS drug discovery employing in silico and in vitro model systems. Drug Discovery Today 2018, 23, 1357−1372. (23) Friden, M.; Bergstrom, F.; Wan, H.; Rehngren, M.; Ahlin, G.; Hammarlund-Udenaes, M.; Bredberg, U. Measurement of Unbound Drug Exposure in Brain: Modeling of pH Partitioning Explains Diverging Results between the Brain Slice and Brain Homogenate Methods. Drug. Metab. Dispos. 2011, 39, 353−362.

(24) Wan, H.; Rehngren, M.; Giordanetto, F.; Bergstrom, F.; Tunek, A. High-Throughput Screening of Drug-Brain Tissue Binding and in Silico Prediction for Assessment of Central Nervous System Drug Delivery. J. Med. Chem. 2007, 50, 4606−4615. (25) Lanevskij, K.; Dapkunas, J.; Juska, L.; Japertas, P.; Didziapetris, R. QSAR analysis of blood-brain distribution: the influence of plasma and brain tissue binding. J. Pharm. Sci. 2011, 100, 2147−2160. (26) Wan, H.; Åhman, M.; Holmén, A. G. Relationship between Brain Tissue Partitioning and Microemulsion Retention Factors of CNS Drugs. J. Med. Chem. 2009, 52, 1693−1700. (27) Plante, J.; Werner, S. JPlogP: an improved logP predictor trained using predicted data. J. Cheminform 2018, 10, 61. (28) Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008, 28, 5. (29) Kursa, M. B.; Rudnicki, W. R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 11. (30) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Delivery Rev. 2001, 46, 3−26. (31) Esaki, T.; Watanabe, R.; Kawashima, H.; Ohashi, R.; NatsumeKitatani, Y.; Nagao, C.; Mizuguchi, K. Data Curation can Improve the Prediction Accuracy of Metabolic Intrinsic Clearance. Mol. Inf. 2019, 38, 1800086. (32) Watanabe, R.; Esaki, T.; Kawashima, H.; Natsume-Kitatani, Y.; Nagao, C.; Ohashi, R.; Mizuguchi, K. Predicting Fraction Unbound in Human Plasma from Chemical Structure: Improved Accuracy in the Low Value Ranges. Mol. Pharmaceutics 2018, 15, 5302−5311. (33) Levin, V. A. Relationship of Octanol/Water Partition Coefficient and Molecular Weight to Rat Brain Capillary Permeability. J. Med. Chem. 1980, 23, 682−684. (34) Begley, D. J. The Blood-brain Barrier: Principles for Targeting Peptides and Drugs to the Central Nervous System. J. Pharm. Pharmacol. 1996, 48, 136−146. (35) Kalvass, J. C.; Olson, E. R.; Cassidy, M. P.; Selley, D. E.; Pollack, G. M. Pharmacokinetics and Pharmacodynamics of Seven Opioids in P-Glycoprotein-Competent Mice: Assessment of Unbound Brain EC50,u and Correlation of in Vitro, Preclinical, and Clinical Data. J. Pharmacol. Exp. Ther. 2007, 323, 346−355. (36) Lukacova, P. N.; Lave, T.; Fraczkiewicz, G.; Bolger, M. B.; Woltosz, W. S. General Approach to Calculation of Tissue:Plasma Partition Coefficients for Physiologically Based Pharmacokinetic (PBPK) Modeling. AAPS Nationl Annual Meeing and Exposition, Atlanta, GA, Nov 16−20, 2008; AAPS, 2008. (37) Rodgers, T.; Leahy, D.; Rowland, M. Physiologically Based Pharmacokinetic Modeling 1: Predicting the Tissue Distribution of Moderate-to-Strong Bases. J. Pharm. Sci. 2005, 94, 1259−1276. (38) Rodgers, T.; Rowland, M. Physiologically Based Pharmacokinetic Modelling 2: Predicting the Tissue Distribution of Acids, Very Weak Bases, Neutrals and Zwitterions. J. Pharm. Sci. 2006, 95, 1238− 1257. (39) Purdon, A. D.; Rosenberger, T. A.; Shetty, H. U.; Rapoport, S. I. Energy Consumption by Phospholipid Metablism in Mammalian Brain. Neurochem. Res. 2002, 27, 1641−1647. (40) Choi, J.; Yin, T.; Shinozaki, K.; Lampe, J. W.; Stevens, J. F.; Becker, L. B.; Kim, J. Comprehensive Analysis of Phospholipids in the Brain, Heart, Kidney, and Liver: Brain Phospholipids are Least Enriched with Polyunsaturated Fatty Acids. Mol. Cell. Biochem. 2018, 442, 187−201. (41) Nishiura, A.; Murakami, T.; Higashi, Y.; Yata, N. Role of Acidic Phospholipids in Tissue Distribution of Quinidine in Rats. J. Pharmacobio.-Dyn. 1987, 10, 135−141. (42) Kearnes, S.; McCloskey, K.; Berndl, M.; Pande, V.; Riley, P. Molecular Graph Convolutions: Moving Beyond Fingerprints. J. Comput.-Aided Mol. Des. 2016, 30, 595−608. (43) Coley, C. W.; Barzilay, R.; Green, W. H.; Jaakkola, T. S.; Jensen, K. F. Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction. J. Chem. Inf. Model. 2017, 57, 1757−1772. 3260

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261

Article

Journal of Chemical Information and Modeling (44) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. ChEMBL: A Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100−D1107. (45) Kanehisa, M.; Goto, S.; Hattori, M.; Aoki-Kinoshita, K. F.; Itoh, M.; Kawashima, S.; Katayama, T.; Araki, M.; Hirakawa, M. From Genomics to Chemical Genomics: New Developments in KEGG. Nucleic Acids Res. 2006, 34, D354−D357. (46) Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. The Chemistry Development Kit (CDK): An OpenSource Java Library for Chemo- and Bioinformatics. J. Chem. Inf. Comput. Sci. 2003, 43, 493−500. (47) Moriwaki, H.; Tian, Y. S.; Kawashita, N.; Takagi, T. Mordred: A Molecular Descriptor Calculator. J. Cheminform. 2018, 10, 4. (48) Yap, C. W. PaDEL-Descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466−1474. (49) Klekota, J.; Roth, F. P. Chemical Substructures that Enrich for Biological Activity. Bioinformatics 2008, 24, 2518−2525. (50) R Core Team R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2018; https://www.R-project.org/ (accessed June 2018 to February 2019). (51) Breiman, L. Random Forest. Mach. Learn. 2001, 45, 5−32. (52) Friedman, J. H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189−1232. (53) Vapnik, V. N. The Nature of Statistical Learning Theory; Springer, New York, 1995. (54) Hoskuldsson, A. PLS Regression Methods. J. Chemom. 1988, 2, 211−228.

3261

DOI: 10.1021/acs.jcim.9b00180 J. Chem. Inf. Model. 2019, 59, 3251−3261