Iterative Screening Methods for Identification of Chemical Compounds

May 6, 2019 - Department of Chemical System Engineering, School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku , Tokyo 113-8656 , ...
0 downloads 0 Views 8MB Size
Article pubs.acs.org/jcim

Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Iterative Screening Methods for Identification of Chemical Compounds with Specific Values of Various Properties Tomoyuki Miyao† and Kimito Funatsu*,†,‡ †

J. Chem. Inf. Model. Downloaded from pubs.acs.org by GEORGIA SOUTHERN UNIV on 05/06/19. For personal use only.

Data Science Center and Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan ‡ Department of Chemical System Engineering, School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan S Supporting Information *

ABSTRACT: Identification of chemical compounds having desirable properties is a central goal of screening campaigns. Iterative screening is a means of surveying a set of compounds, during which their property values are determined and used as feedback for regression models. Quantitative models that assess the relationships between chemical structures and property/activity are repeatedly updated through this type of cycle, and the efficient sampling of compounds for the subsequent test is a key factor in the early identification of target compounds. Nevertheless, methodological approaches to comparisons and to establishing the degree of extrapolation of sampled compounds, including the effects of applicability domains, are still required. In the present study, we conducted a series of virtual experiments to assess the characteristics of different iterative screening methods. Genetic algorithm-based partial least-squares regression, support vector regression, Bayesian optimization with Gaussian Process (GP), and batch-based Bayesian optimization with GP (GP_batch) were all compared, based on the analysis of one million compounds extracted from the ZINC database. Our results show that, irrespective of the diversity of the initial set of compounds, it was possible to identify a compound having the desired property value using the appropriate screening method. However, overall, the GP_batch method was found to be preferable when evaluating properties either which are difficult to predict or for which a key factor is present in the set of molecular descriptors.



INTRODUCTION Identification of chemical compounds with specific properties is an important aspect of various research processes in the field of chemistry. Surveys of such compounds can be based on our current knowledge of the relationship between chemical structures and desired properties, and quantitative structure− property relationships (QSPRs) and quantitative structure− activity relationships (QSARs) play pivotal roles in such searches.1 These statistical models can be utilized to predict certain characteristics so as to determine which compounds should be included in the subsequent experimental work. For example, multitasking models for quantitative structure− biological effect relationships were proposed for identifying drug candidate compounds not only active against a single target but also showing a desirable safety profile.2 By combining this method with a fragment-based approach, de novo compounds can also be produced.3 After promising compounds are identified in this manner, they are actually synthesized and tested, and the experimental results are fed back to update the models. This procedure can be repeated until the compound or compounds with the desirable properties are obtained. This general procedure falls under the umbrella term of iterative screening.4 Iterative screening was originally used in the field of drug © XXXX American Chemical Society

discovery to reduce the number of compounds that had to be examined experimentally and to identify a greater number of active compounds compared to high-throughput screening.4,5 This process can also be applied during the lead optimization phase to find more potent compounds or compounds with desirable properties by predicting activity and/or property values in conjunction with feedback cycles.6,7 The goal of iterative screening can range from maximizing the number of active compounds identified during a given iteration time span8,9 to minimizing the number of iterations required to find optimal compounds or at least finding compounds that are superior to those in a current data set. In the latter case, active learning processes have been previously investigated in conjunction with various data sets,10 including those based on chemogenomics.11 Because the majority of screening campaigns are aimed at finding compounds having improved properties, the extrapolation of models is important, and iterative screening has been used to identify highly potent compounds, starting from a set of inactive or weakly potent compounds.12 Linear regression has Received: January 29, 2019

A

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

experiments, molecular weights and the number of hydrogen bond donors and acceptors were not used. All descriptors employed in each calculation are provided in the Supporting Information. Extended connectivity fingerprints29 with a diameter of four (ECFP4) were selected in the form of 1024 bit vectors, via a modulo operation. As information representing a relationship between independent variables and objective variables, the individual descriptors that showed the highest Pearson correlation coefficients with regard to their correlations with the target property values were determined to be the number of benzene rings for log P (0.67), the BertzCT value for QED (−0.72), and the number of aliphatic rings for SA_score (0.62). Screening Methods. Four machine learning methods were considered for compound selection. These were partial leastsquares regression30 with variable selection by genetic algorithm (GAPLS),31 BO14 with a Gaussian Process (GP),32 batch-based BO with a GP (GP_Batch),33 and Support Vector Regression (SVR).34 Applicability domains (ADs)35 were also employed in conjunction with the GAPLS and SVR methods, based on identifying 5-nearest neighbor (5-NN) compounds in a training data set.36 Partial Least-Squares Regression with Variable Selection by Genetic Algorithm. PLS is a linear regression method in which both independent and objective variables are projected onto latent spaces and are used to construct a linear regression model. This projection is conducted such that the covariance of the latent variables for both independent and objective variables is maximized.37 This allows the identification of important factors related to the independent variables that are highly correlated with those of the projected objective variables. In this study, a single objective variable was considered (that is, one property was selected as the target), and the projected independent variables determined the objective property. The GAPLS method is a combination of PLS and variable selection via GA, and the addition of GA has often been reported to improve the predictability of PLS.38−40 In this study, the GA settings included a population size of 100 and 30 generations, together with the use of the 3-fold cross-validated R2 value as the evaluation score. The latter value was also used to identify the optimal PLS components. Compounds were ranked according to the predicted objective variable values. When attempting to find compounds with a specific property value or a range of property values, compounds were ranked based on the closeness to the desired value or the mean of the desired range. Support Vector Regression. SVR is one of the most widely used QSAR/QSPR methods as it employs an epsilon tube-based loss function and is largely unaffected by small errors. As a result of using kernel functions, SVR can also be extended to perform as a nonlinear regression model. In this study, an SVR approach including the ν parameter (ν-SVR) was employed in combination with the radial basis function (RBF) kernel (for numerical descriptors) and the Tanimoto kernel41 (for ECFP4 fingerprints). An SVR hyper-parameter set, {C, ν, γ}, that provided the numerical descriptors was optimized by 5-fold cross validation on the basis of the mean absolute errors (MAEs), while a hyper-parameter set, {C, ν}, providing the fingerprints was optimized based on the same criterion. Compounds were ranked according to the values of predicted objective variables in the same manner as would be employed in the GAPLS method. Bayesian Optimization with the Gaussian Process. BO has been extensively used to obtain globally optimized solutions of black box functions with a limited number of iterations. In this

been found to give superior extrapolation results compared with the random forest13 or a similarity search using a set of compounds compiled from the ChEMBL database. Recently, Bayesian optimization (BO),14,15 which is intended to find globally optimized solutions of black-box functions by sequential experiments, has been utilized to establish the optimal experimental conditions for materials design16,17 and chemical reactions.18,19 BO can also be applied to compound searches via iterative screening.20 There are several key aspects associated with identifying compounds with specific property values during iterative screening. These include determining the most suitable method for rapid identification of the compound, assessing whether or not QSPR models can capture key factors related to the properties, and the extent to which the compounds to be identified can be extrapolated from an initial data set. In the present work, we conducted a series of virtual experiments to assess these issues, based on careful selection of experimental settings. We considered various practical scenarios, starting from a small number of compounds. During each iteration only a small subset of compounds was sampled from a large pool of compounds (such as a library containing many compounds). Three properties were taken into account in this study: calculated log P (the logarithm of the partition coefficient), quantitatively estimated drug-likeness (QED),21 and estimated synthetic accessibility score (SA_score).22 These properties were chosen because they can be analytically calculated, leading to understanding characteristics of different iterative screening methods. As an application, iterative screening was also conducted for identifying highly potent compounds starting from a data set consisting of weakly potent compounds.



MATERIALS AND METHODS Compound Data Sets for Property Optimization. In the initial step, one million compounds were randomly extracted from the ZINC15 database.23 Only compounds that were commercially available, within the “clean” reactivity category, and currently in stock were employed in the subsequent work. The calculated log P values for these compounds already provided in the ZINC15 database were employed in the analyses together with QED and SA_score values calculated using functions in the RDKit software package.24 Compound Data Sets for Potency Maximization. Five sets of active compounds with highly confident potency data and ZINC decoy compounds were taken from a previous study on virtual screening.25 The target activity sets, originally used for activity cliff analysis,26 were extracted from the ChEMBL database (version 23)27 and showed large potency variation. Compound potency was specified in the form of equilibrium constants (Ki). The number of ZINC decoy compounds was around 50 times larger than that of the corresponding active compounds. All the compound data can be downloaded from a publically accessible repository.28 Compounds for which descriptor calculation failed were removed, and active compounds were filtered so that their molecular weights were inside the range of the corresponding decoy compound data set. Molecular Representations. The molecular descriptors employed in this study varied depending on the target properties. Specifically, this work used 2D descriptors available in the RDKit modules24 that were not explicitly involved in the calculation of the target properties. In addition, van der Waals surface area (VSA)-type descriptors were omitted to simplify the virtual experiments. As an example, in the case of the QED B

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Table 1. Target Properties start (goal) thresholds property

range

mean (std)

max

min

no. of descriptors

QED SA_score logP

(0.01, 0.95) (1.00, 7.97) (−13.09, 19.02)

0.65 (0.20) 2.74 (0.60) 3.30 (1.56)

0.09 (0.95) 1.51 (7.84) −1.93 (15.44)

0.94 (0.02) 5.90 (1.14) 8.93 (−9.59)

128 139 141

Table 2. Activity Data Set Profile CPDs

MW range

data set

activity class

active (goal)

ZINC

start (goal) potency thresholds

active

ZINC

CHEMBL237 CHEMBL244 CHEMBL245 CHEMBL4860 CHEMBL1862

kappa opioid receptor ligands coagulation factor X inhibitors muscarinic acetylcholine receptor M3 ligands apoptosis regulator Bcl-2 inhibitors tyrosine-protein kinase ABL inhibitors

1849 (20) 1608 (17) 650 (7) 576 (6) 553 (6)

95 862 81 997 33 249 32 126 28 248

5.8 (10.3) 5.2 (10.7) 5.7 (10.2) 5.2 (10.9) 6.2 (10.6)

219−939 251−941 137−900 178−963 235−603

82−997 62−995 62−977 82−963 75−940

Table 3. Number of Trials That Successfully Identified the Target Compound for Various Properties and Screening Approaches QED

GAPLS GAPLS (AD) GP GP_ Batch SVR SVR (AD) SVR (ECFP4)

log P

SA_score

max

min

max

min

max

min

cluster

rand

cluster

rand

cluster

rand

cluster

rand

cluster

rand

cluster

rand

sum

0 0 0 0 0 0 0

0 0 0 0 0 0 0

3 0 1 3 0 0 0

5 0 5 5 1 0 0

3 1 0 3 3 0 1

5 2 4 5 4 2 1

3 3 2 3 2 0 0

5 5 4 4 0 0 1

3 3 3 3 3 2 0

5 5 5 5 5 5 0

3 3 3 3 3 3 2

5 5 5 5 5 5 0

40 27 32 39 26 17 5

Applicability Domain Considerations. One of the most interesting aspects of iterative screening is the manner in which ADs determine the compounds that are selected. ADs can be regarded as interpolated regions in the compound space spanned by the training compounds, and therefore, it would appear to be impossible to search for extrapolated compounds inside an AD. In this work, compounds were placed inside an AD if their average 5-NN distance with respect to training compounds was less than a threshold value. This value was set to 1.5 × (Q3(d) − Q1(d)) + Q3(d), where Q3(d) and Q1(d) were the 75th and 25th percentiles of the average 5-NN pairwise distances for the training compounds.43 The AD approach was only applied in conjunction with the GAPLS and SVR methods with numerical descriptors. Compounds outside ADs were not considered when creating rankings. Iterative Screening Scenarios. Two types of optimization scenarios were assessed. One was to find compounds that possessed either maximum or minimum values of specific properties, while the other was to find compounds for which the property values were within specific ranges. Finding Compounds with Maximum or Minimum Property Values. The first series of experiments attempted to find compounds having maximum or minimum property values, starting from a limited data set. An initial subset was compiled by randomly selecting 100 compounds from the 0.1% of the one million ZINC compounds (that is, from 1000 compounds) having values of the specific property being evaluated as far as possible from the target value. The goal was to identify those compounds representing the top 0.001% of all ZINC compounds (that is, a total of 10 compounds). Iterative screening was conducted starting from the 100 compounds, sampling 10 compounds at each iteration until finding one of the goal

study, the GP method in conjunction with the RBF kernel was used as a surrogate model, while expected improvements (EI) were employed as an acquisition function for maximization or minimization scenarios.14 In those scenarios in which the goal was to identify compounds having a specific property value (or range of values), the probability of reaching the range was calculated.42 One of the greatest challenges associated with the ordinal BO approach is the myopic nature of this method when identifying those compounds to be included in a subsequent iteration, that is, selecting a small set of compounds for subsequent testing based on an acquisition function might result in sampling similar compounds. Therefore, batch sampling approaches have been developed. The basic concept of batch sampling is that subsequent compounds are sampled based on previously sampled compounds, although the property values of these newer compounds are not assessed. This is expected to reduce the number of total iterations required to reach the global optimum. González et al. previously proposed an efficient batch BO process for use with the GP method based on including local penalties in the acquisition function values for the neighborhoods of compounds that have already been selected.33 Intuitively, one would expect a penalty function to be added to already sampled points in a batch sampling procedure after taking into account the likelihood of reaching the optimal property value. This penalty function is differentiable, meaning that there is no need to modify the optimization algorithms from normal acquisition function optimization procedures. In the present work, an approach employing batch BO with the GP method (termed GP_batch) was also applied as a screening method. In the case of the GP approach, hyper-parameters were optimized by maximizing the marginal likelihood function of the training data with different initial values of the hyper-parameters randomly selected 10 times. C

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

D

1.7 0.6) 10.3 (3.2) 8.3 (4.0) 4.7 (4.6) 4.7 (1.5) 10.0 (6.9) 25.5 (14.8) 1.0 (0.0) 4.4 (1.1) 2.8 (0.4) 2.8 (0.4) 3.4 (3.4) 4.0 (1.4) 1.0 (−)

1.3 (0.6) 11.0 (11.4) 6.3 (5.9) 2.7 (1.2) 7.3 (10.1) 5.5 (2.1) 9.0 (4.9) 7.2 (3.8) 20.8 (8.2) 8.8 (3.4) 5.3 (4.2) 16.3 (7.0) 26.5 (10.6) 15.3 (19.7) 15.5 (19.1) 1.4 (0.5) 12.0 (0.0) 13.3 (3.8) 8.2 (5.6) 12.8 (8.0) 21.5 (0.7) 25.0 (−) 26.6 (4.8) 3.2 (1.3) 35.0 (−) 4.0 (2.6)

26.0 (−)

3.3 (1.2) 16.3 (6.0)

3.6 (1.7) 4.3 (3.5)

2.3 (1.2) 7.0 (−)

max cluster rand

a

GAPLS GAPLS (AD) GP GP_ Batch SVR SVR (AD) SVR (ECFP4)

Standard deviations are provided in parentheses.

min cluster max

cluster

rand

QED

Table 4. Average Numbers of Trials Required To Identify a Target Compounda

rand

SA_score

cluster

min

rand

cluster

max

rand

log P

cluster

min

rand

compounds. The models were updated following each iteration using all of the compounds sampled to that point. The maximum iteration number was set to 40, meaning that at most 400 compounds were evaluated during each trial. Table 1 provides the properties that were assessed in addition to the associated range, mean and standard deviation values of the curated ZINC data set, and number of descriptors employed for each property. The ZINC data sets can be found in the Supporting Information. The threshold values for each property for the starting and goal compound subsets are given, which vary depending on the search strategy. As an example, when attempting to identify those compounds having the 10 highest log P (lipophilic) values, 100 compounds with log P values less than or equal to −1.93 (that is, being hydrophilic) were selected as starting compounds. Subsequently, 10 compounds were selected from the rest of the ZINC compounds per iteration until one of the goal compounds, having a log P value greater than or equal to 15.44, was identified. Various features of the sampled compounds were also monitored during this iterative screening to establish the diversity of sampled compounds and predictability of the current model. Five trials were conducted for each scenario by randomly setting different initial compound subsets. Starting with Clustered Compounds. In addition to the random selection of starting compounds, sets of clustered compounds were also used as the initial subsets to monitor the progression of compound diversity during iterative screening. K-means clustering was used based on ECFP4-based Tanimoto distances. Three clusters having more than 40 compounds and with the highest pairwise Tanimoto similarity were selected as the initial compound sets. The pairwise Tanimoto similarity in these clusters ranged from 0.23 (in one of the three trials for the QED minimization scenario) to 0.65 (in one of the three trials for the QED maximization scenario). Finding Compounds with a Specific Property Value. The second series of experiments involved finding compounds with property values within specific ranges. Initially, 100 compounds were randomly selected from between the 44.95 and the 45.05 percentiles of the ZINC compounds (representing 1000 compounds) for two target properties (log P and QED) with the goal of identifying compounds between the 49.9995 and the 50.0005 percentiles. There were a total of 10 target compounds in the case of the QED trials and 271 for the log P trials due to multiple compounds having the same log P values. QED values for the initial compounds were between 0.6577 and 0.6583 (and between 0.68704 and 0.68705 for the goal compounds), while log P values for the initial compounds were between 3.051 and 3.055 (with a value of 3.238 for the goal compounds). Employing these slightly extrapolated goals assisted in understanding the precision of the iterative screening methods as well as ensuring that the methods did not merely propose compounds with extreme values or structures compared to those of the already sampled compounds. Monitored Features during Iteration. During iterative screening experiments, various features for both the training compounds (all of the compounds sampled at a given point in addition to the initial compounds) and the compounds to be sampled were monitored to establish the model accuracy as reflected by the MAE and R2 values for training compounds, predicted property values, compound-to-scaffold ratio, pairwise Tanimoto similarity for training compounds, and MAE values for compounds to be sampled. Compound-to-scaffold ratios were calculated using the Bemis and Murko scaffold definition,44

2.2 (0.8) 8.6 (1.8) 7.4 (1.9) 3.6 (0.5) 9.4 (9.3) 12.0 (8.7)

Journal of Chemical Information and Modeling

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 1. Best compound property values obtained using random starting sets. In the case that a target compound having the desired property value was identified prior to completion of the maximum 40 iterations, this value is used for to assist in visual analysis of the results.

diversity. In this case, only data sets for kappa opioid receptor ligands (CHEMBL237) and coagulation factor X inhibitors (CHEMBL244) were considered due to the number of active compounds forming a cluster. Implementation. GPy,45 a GP framework written in Python, was used for Gaussian process modeling, while acquisition functions and batch processes for the GP method were implemented using the Python language. ECFP4 calculations and scaffold detection were performed using a program written in house in association with the OEChem Toolkit.46 Scikit-learning modules47 were employed to determine the PLS, SVR, K-NN, and K-mean values. GA calculations were conducted using Deap libraries.48

while ECFP4 values were used to establish pairwise Tanimoto similarity. Demonstration of Iterative Screening for Identifying Highly Potent Compounds. As a demonstrative application, highly potent compounds were tried to be identified starting from a set of weakly potent compounds for five bioactive targets. An initial subset was compiled by randomly selecting 40 compounds from the worst 10% of the active compounds in terms of pKi. The goal was set to find at least one of the compounds in the top 1% of the active compounds. For each iteration, 10 compounds were chosen based on model’s decision, and potency values were given accordingly. Potency values for all decoy compounds were set to zero. This procedure was repeated up to 40 iterations until finding at least one of the goal compounds. For each target set, CHEMBL ID, activity class, and the numbers of active and decoy (ZINC) compounds and potency thresholds for the start and goal compounds are presented in Table 2. For each method, iterative screening was repeated five times with a randomly selected initial subset. Clustered compounds were also used as an initial subset to monitor the progression of compound



RESULTS AND DISCUSSION

Finding Compounds with Maximum and Minimum Property Values. The numbers of trials that successfully identified at least one compound with the desired property value within 40 iterations, employing various screening methods and properties, are summarized in Table 3. These numbers are E

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 2. Best compound property values obtained using clustered starting sets. In the case that a target compound having the desired property value was identified prior to completion of the maximum 40 iterations, this value is used for to assist in visual analysis of the results.

categorized based on target properties, scenarios (maximization and minimization), type of starting compounds (either randomly chosen or chosen within a cluster), and search method. For randomly chosen starting compounds, five trials were conducted, and for clustered starting compounds, three clusters with the highest intrasimilarity were chosen. The total number of successful trials is reported in the sum column, and the maximum value(s) in each column is in bold. On the basis of these data, it is evident that log P was the property that was most readily optimized. All of the screening methods, with the exception of the SVR (ECFP4) and SVR (AD) methods, found the target compound within 40 iterations. The second easiest target property was the SA_score, followed by the QED value. In the case of the QED maximization scenario, none of the screening methods identified a target compound irrespective of the initial compound set categories. In addition, clustering of the initial compounds did not affect the screening results, except when employing the GP method. Comparing the screening methods, the GAPLS approach found the target compound most frequently, followed by the GP_batch method while the performance of the SVR (ECFP4) method was poor.

Incorporating ADs did not assist in the identification of the target compound within 40 iterations. The average numbers of iterations required to find the target compound are provided in Table 4, along with standard deviations where available. These numbers are categorized as in Table 3. For each combination of target property, scenario, and starting compound set, the minimum average number of iterations is given in bold. Starting with similar compounds required more iterations than using an initial set consisting of random compounds. The GAPLS method required the fewest average iterations for 7 out of the 12 categories, although the GP_batch method required similar quantities of average iterations. Interestingly, the GAPLS approach immediately found target compounds when searching for log P values irrespective of the scenario or type of initial compound set. This result suggests that the GAPLS method incorporates a key factor required to calculate log P, which is not unexpected because log P is a calculated property based on atomic contributions. Although screening methods incorporating ADs identified highly extrapolated compounds for some targets and scenarios, the performance of the GAPLS method was reduced when considering ADs. This may be explained by the fact that F

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 3. Diversity transitions of sampled compounds.

5-NN-based ADs prohibited from sampling compounds dissimilar to those in the training data set. The exception was in the case of the SA_score minimization scenario starting from a randomly chosen compound set. The SVR approach showed the same trend, and incorporation of ADs was evidently not required to find compounds that were highly extrapolated relative to the initial training data set. Using a total of six scenarios (representing maximization and minimization of log P, QED, and SA_score), the best property values among the sampled compounds in each trial were tracked. The best property values for both the maximization and the minimization scenarios are summarized in Figures 1 and 2 for random and clustered initial compounds, respectively. In these figures, each line represents the mean value with the standard deviation shown as a shaded region. Applying the log P maximization and minimization scenarios, the GAPLS method provided better property values than the other methods. Even though there was little information about lipophilicity in the initial data sets, extreme compounds could be immediately identified by the linear model. The SVR approach using ECFP4 fingerprints struggled with early identification of compounds having better property values but slowly achieved improved values as the trials proceeded. In the QED maximization scenario, for which none of the methods was able to identify a target compound, the GP, GP_batch, and SVR approaches were superior to the GAPLS method in terms of the property values of the best compound that was selected. The property QED differed from the other two target properties in

that it appeared to be more difficult to predict, because all of the descriptors associated with the calculation of this variable were explicitly omitted. In the case of log P, no descriptors were eliminated, while molecular weights and heavy atom count descriptors were eliminated for the SA_score. Iterative screening starting from a set of similar compounds did not prevent identification of a target compound, although the number of iterations required was increased for most of the screening methods (Table 4). Monitoring Transitions in the Diversity of Sampled Compounds. When attempting to identify compounds with new chemotypes while improving a model such that it functions reliably with a wide range of compounds (i.e., when expanding ADs), the sampling should include a wide variety of compounds. In this work, the different screening strategy showed different diversity characteristics. Figure 3 summarizes the variation in the diversity of the sampled compounds, starting from clustered initial compound sets. In the case of the QED maximization scenario, the average pairwise Tanimoto similarity for all sampled compounds, including the initial compound set decreased as the experiments proceeded (Figure 3A). At the same time, the compound-to-scaffold ratio decreased (Figure 3B), meaning that the diversity of the sampled compounds increased monotonically except when employing the SVR and SVR (ECFP4) methods. Comparing the results from these screening methods demonstrates that the GAPLS and GB_Batch approaches sampled a wider range of compounds than the GP, SVR, and SVR (ECFP4) methods. In contrast, when attempting to minimize G

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 4. Variations in the mean absolute errors for QED maximization scenarios (A, random; B, clustered) and SA_score minimization scenarios (C, random; D, clustered). Average values are represented as lines with standard deviations as shaded regions.

the QED value, although the average pairwise Tanimoto similarity values all exhibited the same trend (Figure 3C), the compound-to-scaffold ratio increased as the experiments proceeded for all methods other than the GP_batch method. These results indicate that without forcing compounds to become diverse relative to one another, the sampled compounds could be analogues of one another because of the similarity principle in QSAR models.49 During the QED minimization scenario, the local penalty strategy incorporated in the GP_batch method worked well in terms of allowing the sampling of diverse compounds while simultaneously identifying the target compound as early as possible. Monitoring the Reliability of Property Predictions. The reliability of the predicted values was monitored during these trials. The MAE values associated with sampling 10 compounds in each iteration was determined for the QED maximization and SA_score minimization scenarios as representative cases, as shown in Figure 4. For these scenarios, the SVR and SVR (ECFP4) methods showed smaller MAE values than the GAPLS, GP, and GP_batch approaches, partly because the latter methods sampled more diverse compounds than the former. During the early stage of the trials, the GAPLS method occasionally showed relatively high MAE values. This result is attributed to the nature of linear regression models, which generate large extrapolated values for compounds not included in the ADs. In the GP_batch method, the range of sampled

compounds was comparable to that for the GAPLS method, although large MAE values were not observed due to the methodological nature of the former approach, which employs the RBF as a kernel function. Although the GAPLS method sometimes produced very large MAE values during the early stage of the trials, these values eventually dropped to the same level as those generated by the GP and GP_batch methods. Introducing ADs to the GAPLS method suppressed this phenomenon associated with the GAPLS approach. Figure 5 presents the property values of the best candidate compounds and the mean absolute errors of the sampled compound set at each iteration for two regression methods with and without ADs. Incorporating ADs into the SVR method did not significantly improve the MAE values because the SVR method already showed sufficient predictability for the sampled compounds. When performing the SA_score minimization scenario, introducing ADs did not greatly modify the best property values of the sampled compounds (Figure 5A). In addition, for the log P maximization scenario and other scenarios, introducing ADs did not promote early detection of the target compounds, as shown in Table 4. It should be noted, however, that the AD criterion in the present study did not prevent the models from proposing extrapolated compounds. Comparison of Sampled Compounds among Methods. Each screening method specified characteristics for the sampled compounds, and Figure 6 presents the top three H

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 5. Effects of applicability domains on iterative screening for SA_score minimization (A, B) and log P minimization scenarios (C, D), employing randomly chosen starting compound sets. Average values are represented as lines with standard deviations as shaded regions.

compounds based on evaluation functions at 1, 10, and 40 iterations for the QED maximization scenario, employing the GAPLS, GP_batch, and SVR methods. Figure 7 shows the top three compounds following one, two, and three iterations for the log P minimization scenario using the same three methods. When employing the GAPLS approach, the QSPR models evidently attempted to identify a key factor for the target property. In the case of log P, this key factor seems to have been the number of hydroxyl groups, even though the experiment started with a set of compounds that were extremely lipophilic. When optimizing the QED during the first iteration, the number of fused rings without any heteroatoms appears to have been a key factor in the model. However, this assumption worked when searching for log P but not for QED. The GP_batch method proposed a diverse range of compounds based on the trade-off between explorative and exploitive screening due to the EI acquisition function. During the first iteration of the QED screening, the top compound was similar to the initial compounds, while the second best and third best were different, as a result of the local penalization associated with the batch sampling (Figure 6). Using the GP_batch approach successfully identified compounds with high QED values (0.9) following iteration 10. This sampling strategy sometimes chose compounds that were dissimilar to any compounds in the already sampled compound pool (representing explorative screening) to try to identify a compound whose property value was higher than those of any

sampled compounds. This process produced the top three candidates at iteration 40 (Figure 6). For these compounds, the surrogate GP model predicted QED values of 0.63, 0.63, and 0.62 from top to bottom, and the predicted standard deviations for the corresponding compounds were all 0.30. The evaluation function ranked these compounds the best ones because they had the potential to surpass the current best compound in the data set (QED 0.947 found at iteration 12). Using the SVR method showed more exploitive results. In the QED maximization scenario, analogous compounds were sampled in the first iteration, and therefore, there was only a small improvement in the property value, while compounds with better QED values were proposed following subsequent iterations. These selected structures tended to include substructures found in the initial compounds. Although the SVR approach demonstrated exploitive features when sampling compounds, it succeeded in proposing compounds with small log P values by capturing the key factor to that property. Precision of Screening Methods. Regression models always contain errors in prediction. Although the resolution of the QSPR model is normally limited, proposing compounds with a specific property value or a value within a small range is sometimes necessary. Thus, a second series of virtual experiments was conducted to evaluate this aspect of the screening process for log P and QED, which are the easiest and the most difficult properties in the previous virtual experiments, respectively. I

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 6. Examples of iteratively screened compounds for the QED maximization scenario using the GAPLS, GP_batch, and SVR methods.

have sufficient resolution. The mean absolute deviations of the objective property values between the sampled compounds at each iteration and the target compounds were monitored. These data show how close the proposed compounds were to the target compounds for each screening method (Figure 8). At each iteration, the diversities of the sampled compounds in terms of pairwise Tanimoto similarity were also monitored. It showed that the GAPLS model sampled the most diverse compounds, followed by the SVR (ECFP4) model, with the SVR, GP, and GP_batch methods having the lowest diversities (Figure 8E and 8F). The variations in the MAE values for the compounds at each iteration confirmed that the SVR (ECFP4) model exhibited the lowest predictability for their sampled compounds, while the GP and GP_batch models showed high predictability for sampled compounds at each iteration (Figure 8C and 8D). In addition, the SVR (ECFP4) model was found to sample compounds whose property values were not close to the target values. Overall, the GP and GP_batch

Table 5 provides the number of trials that successfully found at least one of the target compounds within 40 iterations. Here, the goals were values of 3.238 and 0.6870 for log P and QED, respectively. The successful trial numbers are categorized depending on the target properties, starting compounds (either randomly chosen or chosen within a cluster), and search method. For each screening method, the total number of successful trials is reported in the sum column, and the maximum number in the column is reported in bold. Unlike previous experiments, the exploitative approach was more important than the explorative version in these trials. In the log P experiments, the GP method found suitable compounds in the greatest number of trials, while the GAPLS (AD) and GP_batch methods were unable to identify appropriate compounds when starting from a random initial set of compounds. In the case of QED optimization, none of the screening methods could identify one of the goal compounds within 40 trials, suggesting that these QSPR models and the associated screening procedures evidently did not J

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 7. Examples of iteratively screened compounds for the log P minimization scenario using the GAPLS, GP_batch, and SVR methods.

precision rankings and considering the minimal explorative characteristics of this experiment. Summarizing these results, GAPLS is a better screening method than SVR (ECFP4) in terms of the trade-off between employing explorative and exploitive screening strategies. The GP and GP_batch methods sampled compounds that were sometimes quite similar to the compounds in the already sampled pools but proposed the best compounds in terms of closeness to the goal. Interestingly, the GP_batch method tended to sample diverse compounds at the first iteration but immediately adjusted the model to sample more similar compounds (Figure 8). Therefore, when a small degree of extrapolation is required in terms of property values, the GP and GP_batch methods may be good choices for this purpose.

Table 5. Number of Successful Trials for Narrow Range Goals log P

QED GAPLS GAPLS (AD) GP GP_Batch SVR SVR (AD) SVR (ECFP4)

cluster

rand

cluster

rand

sum

0 0 0 0 0 0 0

0 0 0 0 0 0 0

1 1 2 1 0 0 2

1 0 2 0 2 2 1

2 1 4 1 2 2 3

approaches sampled more suitable compounds than the other methods (Figure 8A and 8B), in agreement with the model K

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 8. Precision of various models, as represented by the mean absolute deviation of the y values of a set of sampled compounds from the goal value (A and B), mean absolute error for the sampled compounds (C and D), and average pairwise Tanimoto similarity of all sampled compounds including initial compounds (E and F).

Table 6. Number of Successful Trials for Highly Potency Compound Screening CHEMBL237

CHEMBL244

CHEMBL245

CHEMBL4860

CHEMBL1862

method

cluster

random

cluster

random

random

random

random

sum

GAPLS GP GP_Batch SVR SVR (ECFP4)

3 3 3 3 3

5 5 5 5 5

1 3 2 3 3

4 5 4 5 5

5 5 5 5 5

5 5 5 5 5

0 4 5 3 5

23 30 29 29 31

For randomly chosen starting compounds, five trials were conducted, and for clustered starting compounds, three clusters with the highest intrasimilarity were chosen. The total number of successful trials is reported in the sum column (Table 6). Overall, the GAPLS method seemed inferior to the GP, GP_batch, SVR, and SVR (ECFP4) methods in terms of the number of successful trials. For tyrosine-protein kinase ABL

Iterative Screening for Identifying Highly Potent Compounds. The numbers of trials that successfully identified at least one goal compound within 40 iterations, employing various screening methods and targets, are summarized in Table 6. These numbers are categorized based on target data sets and type of starting compounds (either randomly chosen or chosen within a cluster for CHEMBL237 and 245) and search method. L

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 9. Best compound potency values obtained using random starting sets. In the case that a goal compound was identified prior to completion of the maximum 40 iterations, the potency value for the compound is used for to assist in visual analysis of the results: (A) CHEMBL237, (B) CHEMBL244, (C) CHEMBL245, (D) CHEMBL4860, and (E) CHEMBL1862.



CONCLUSIONS In this work, we investigated the potential applications of iterative screening, focusing on the degree of explorative ability when employing careful experimental settings. These experiments showed that the GAPLS approach was superior to the other methods when a linear method was able to capture the relationship between molecular descriptors and a target property (log P in the present study). In the GAPLS method, a diverse range of compounds was sampled to enhance the likelihood of finding the target compound. The SVR, GP, and GP_batch methods were superior to the GAPLS and SVR(ECFP4) methods in the case of the QED maximization scenario in terms of the early detection of compounds with better property values. This result is partly attributed to the greater difficulty in predicting QED values using an intentionally reduced descriptor set and because extrapolation may not be as straightforward for this property compared to log P. The GAPLS and GP_batch methods sampled diverse compounds in terms of pairwise Tanimoto distances and compound-to-scaffold ratios, and the

inhibitors (CHEMBL1862), the GAPLS method could not succeed even in a single trial. Figure 9 clearly shows that the GAPLS method struggled to identify highly potent compounds for this target (Figure 9E). On the basis of the assumption derived from property analysis studies in the previous sections, the relationship between potency and descriptors seemed highly nonlinear for this target. Overall, GP_batch required a smaller number of iterations to identify highly potent compounds than other methods, whereas SVR (ECFP4) gradually identified higher potent compounds. When an initial data set consisted of similar compounds, iterative screening could also identify a goal compound in most of the trials. For the GAPLS and the GP_batch methods, the diversity of the sampled compounds increased, meaning that these methods proposed to sample more extrapolated compounds than other methods (Figure 10C and 10D). On the other hand, SVR (ECFP4) tended to sample less diverse compounds, and the best compounds’ potency increased at the slowest pace among the methods employed in this study (Figure 10A and 10B). M

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 10. Best compound potency values and diversity transitions of sampled compounds using clustered data sets in the case of potency maximization scenario. Best compound potency values for data sets CHEMBL237 (A) and 245 (B) were monitored. Also, average pairwise Tanimoto similarity of all sampled compounds including initial compounds is reported for CHEMBL237(C) and 245 (D).

number of compounds to do so. In addition, the iterative sampling of compounds allowed localization of sampled compounds to be relaxed, and the degree of relaxation was greatest for the GAPLS and GP_batch methods and smallest for the SVR (ECFP4) method.

GAPLS approach sometimes proposed compounds whose predicted property values were far from accurate. This phenomenon was suppressed by introducing ADs when sampling the compounds, although this modification typically increased the number of iterations required to reach the target compound. The GP_batch method exhibited consistently stable performance for all of the target properties and scenarios. Identifying compounds with slightly extrapolated property values and monitoring the precision of the screening methods demonstrated that the GP approach selected the desired compound in four out of eight trials. In addition, when the degree of extrapolation was relatively small, the GP and GP_batch methods sampled compounds similar to those already contained in the training data set to obtain the target compounds with more precise models. In summary, the GAPLS method seems superior in the case that a key factor for the target property is represented by the molecular descriptors employed during the iterative screening. Otherwise, the GP_batch, SVR, and GP methods tend to identify the most appropriate compound with fewer iterations. The GP model proposed compounds with the highest resolution among the methods assessed in this study. Overall, the GP_batch approach might be a good starting method, considering that it permits the early detection of distinct as well as similar compounds. In addition, the key factor is usually not known prior to the iterative screening. Furthermore, even when starting from a set of similar compounds, the target compound could be identified, although it was necessary to consider a greater



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.9b00093. Molecular descriptors for each of the target properties (PDF) One million ZINC compounds annotated with the three properties used in this study (ZIP)



AUTHOR INFORMATION

Corresponding Author

*Tel: +81-3-5841-7751. Fax: +81-3-5841-7771. E-mail: [email protected]. ORCID

Tomoyuki Miyao: 0000-0002-8769-2702 Kimito Funatsu: 0000-0002-9368-0302 Author Contributions

The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. N

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Funding

(18) Häse, F.; Roch, L. M.; Kreisbeck, C.; Aspuru-Guzik, A. Phoenics: A Bayesian Optimizer for Chemistry. ACS Cent. Sci. 2018, 4, 1134− 1145. (19) Schweidtmann, A. M.; Clayton, A. D.; Holmes, N.; Bradford, E.; Bourne, R. A.; Lapkin, A. A. Machine Learning Meets Continuous Flow Chemistry: Automated Optimization towards the Pareto Front of Multiple Objectives. Chem. Eng. J. 2018, 352, 277−282. (20) Ahmadi, M.; Vogt, M.; Iyer, P.; Bajorath, J.; Fröhlich, H. Predicting Potent Compounds via Model-Based Global Optimization. J. Chem. Inf. Model. 2013, 53, 553−559. (21) Bickerton, G. R.; Paolini, G. V.; Besnard, J.; Muresan, S.; Hopkins, A. L. Quantifying the Chemical Beauty of Drugs. Nat. Chem. 2012, 4, 90−98. (22) Ertl, P.; Schuffenhauer, A. Estimation of Synthetic Accessibility Score of Drug-like Molecules Based on Molecular Complexity and Fragment Contributions. J. Cheminf. 2009, 1, 8. (23) Sterling, T.; Irwin, J. J. ZINC 15 − Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324−2337. (24) Landrum, G. RDKit: Open-source cheminformatics; http://www. rdkit.org (accessed Jan 15, 2019). (25) Miyao, T.; Funatsu, K.; Bajorath, J. Exploring Alternative Strategies for the Identification of Potent Compounds Using Support Vector Machine and Regression Modeling. J. Chem. Inf. Model. 2019, 59, 983−992. (26) Hu, H.; Stumpfe, D.; Bajorath, J. Rationalizing the Formation of Activity Cliffs in Different Compound Data Sets. ACS Omega 2018, 3, 7736−7744. (27) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100−D1107. (28) Miyao, T.; Funatsu, K.; Bajorath, J. Compound data sets for support vector machine and regression modeling. Zenodo 2019, 1453935. (29) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (30) Wold, S.; Sjöström, M.; Eriksson, L. PLS-Regression: A Basic Tool of Chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109−130. (31) Hasegawa, K.; Miyashita, Y.; Funatsu, K. GA Strategy for Variable Selection in QSAR Studies: GA-Based PLS Analysis of Calcium Channel Antagonists. J. Chem. Inf. Comput. Sci. 1997, 37, 306−310. (32) Rasmussen, C. E.; Williams, C. K. I. Gaussian Processes for Machine Learning; MIT Press, 2006. (33) González, J.; Dai, Z.; Hennig, P.; Lawrence, N. Batch Bayesian Optimization via Local Penalization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR Workshop and Conference Proceedings; JMLR, 2016; Vol. 51, pp 648−657. (34) Smola, A. J.; Schölkopf, B. A Tutorial on Support Vector Regression. Statistics and Computing 2004, 14, 199−222. (35) Jaworska, J.; Nikolova-Jeliazkova, N.; Aldenberg, T. QSAR Applicabilty Domain Estimation by Projection of the Training Set Descriptor Space: A Review. ATLA, Altern. Lab. Anim. 2005, 33, 445− 459. (36) Mathea, M.; Klingspohn, W.; Baumann, K. Chemoinformatic Classification Methods and Their Applicability Domain. Mol. Inf. 2016, 35, 160−180. (37) de Jong; SIMPLS, S. An Alternative Approach to Partial Least Squares Regression. Chemom. Intell. Lab. Syst. 1993, 18, 251−263. (38) Xie, H.; Zhao, J.; Wang, Q.; Sui, Y.; Wang, J.; Yang, X.; Zhang, X.; Liang, C. Soil Type Recognition as Improved by Genetic AlgorithmBased Variable Selection Using near Infrared Spectroscopy and Partial Least Squares Discriminant Analysis. Sci. Rep. 2015, 5, 10930. (39) Yamashita, F.; Wanchana, S.; Hashida, M. Quantitative Structure/Property Relationship Analysis of Caco-2 Permeability Using a Genetic Algorithm-based Partial Least Squares Method. J. Pharm. Sci. 2002, 91, 2230−2239. (40) Zolgharnein, J.; Asanjarani, N.; Azimi, G.; Ghasemi, J. Simultaneous Spectrophotometric Determination of Ga(III) and

The authors declare no financial competing interest. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We are grateful to Jürgen Bajorath at the University of Bonn for fruitful discussion on iterative screening and active searches. We also thank OpenEye Scientific Software, Inc., for providing a free academic license for the OpenEye toolkits.



REFERENCES

(1) Tropsha, A.; Golbraikh, A. Predictive QSAR Modeling Workflow, Model Applicability Domains, and Virtual Screening. Curr. Pharm. Des. 2007, 13, 3494−3504. (2) Speck-Planche, A.; Cordeiro, M. N. D. S. Multitasking Models for Quantitative Structure−biological Effect Relationships: Current Status and Future Perspectives to Speed up Drug Discovery. Expert Opin. Drug Discovery 2015, 10, 245−256. (3) Speck-Planche, A.; Dias Soeiro Cordeiro, M. N. Speeding up Early Drug Discovery in Antiviral Research: A Fragment-Based in Silico Approach for the Design of Virtual Anti-Hepatitis C Leads. ACS Comb. Sci. 2017, 19, 501−512. (4) Bajorath, J. Integration of Virtual and High-Throughput Screening. Nat. Rev. Drug Discovery 2002, 1, 882−894. (5) Stahura, F.; Bajorath, J. Virtual Screening Methods That Complement HTS. Comb. Chem. High Throughput Screening 2004, 7, 259−269. (6) Varela, R.; Walters, W. P.; Goldman, B. B.; Jain, A. N. Iterative Refinement of a Binding Pocket Model: Active Computational Steering of Lead Optimization. J. Med. Chem. 2012, 55, 8926−8942. (7) Borrotti, M.; De March, D.; Slanzi, D.; Poli, I. Designing Lead Optimisation of MMP-12 Inhibitors. Computational and Mathematical Methods in Medicine 2014, 2014, 258627. (8) Warmuth, M. K.; Liao, J.; Rätsch, G.; Mathieson, M.; Putta, S.; Lemmen, C. Active Learning with Support Vector Machines in the Drug Discovery Process. J. Chem. Inf. Comput. Sci. 2003, 43, 667−673. (9) Garnett, R.; Gärtner, T.; Vogt, M.; Bajorath, J. Introducing the ‘Active Search’ Method for Iterative Virtual Screening. J. Comput.-Aided Mol. Des. 2015, 29, 305−314. (10) Reker, D.; Schneider, G. Active-Learning Strategies in Computer-Assisted Drug Discovery. Drug Discovery Today 2015, 20, 458−465. (11) Reker, D.; Schneider, P.; Schneider, G.; Brown, J. Active Learning for Computational Chemogenomics. Future Med. Chem. 2017, 9, 381−402. (12) Cortés-Ciriano, I.; Firth, N. C.; Bender, A.; Watson, O. Discovering Highly Potent Molecules from an Initial Set of Inactives Using Iterative Screening. J. Chem. Inf. Model. 2018, 58, 2000−2014. (13) Tin Kam Ho. Random Decision Forests. Proceedings of 3rd International Conference on Document Analysis and Recognition; IEEE Computer Society Press, 1995; Vol. 1, pp 278−282. (14) Jones, D. R.; Schonlau, M.; Welch, W. J. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization 1998, 13, 455−492. (15) Snoek, J.; Larochelle, H.; Adams, R. P. Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems 25; Pereira, F., Burges, C. J. C., Bottou, L., Weinberger, K. Q., Eds.; Curran Associates, Inc., 2012; pp 2951− 2959. (16) Ueno, T.; Rhone, T. D.; Hou, Z.; Mizoguchi, T.; Tsuda, K. COMBO: An Efficient Bayesian Optimization Library for Materials Science. Materials Discovery 2016, 4, 18−21. (17) Ju, S.; Shiga, T.; Feng, L.; Hou, Z.; Tsuda, K.; Shiomi, J. Designing Nanostructures for Phonon Transport via Bayesian Optimization. Phys. Rev. X 2017, 7, 021024. O

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Tl(III) by Using Genetic Algorithm Based on Wavelength SelectionPartial Least Squares Regression. J. Anal. Chem. 2015, 70, 148−153. (41) Ralaivola, L.; Swamidass, S. J.; Saigo, H.; Baldi, P. Graph Kernels for Compound Informatics. Neural Networks 2005, 18, 1093−1110. (42) Kishio, T.; Kaneko, H.; Funatsu, K. Strategic Parameter Search Method Based on Prediction Errors and Data Density for Efficient Product Design. Chemom. Intell. Lab. Syst. 2013, 127, 70−79. (43) Sahigara, F.; Ballabio, D.; Todeschini, R.; Consonni, V. Defining a Novel K-Nearest Neighbours Approach to Assess the Applicability Domain of a QSAR Model for Reliable Predictions. J. Cheminf. 2013, 5, 27. (44) Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887−2893. (45) GPy. {GPy}: A Gaussian Process Framework in Python; http:// github.com/SheffieldML/GPy (accessed Jan 15, 2019). (46) OEChem. TK, version 2.1.5; OpenEye Scientific Software, 2018. (47) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, É . Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. (48) Fortin, F.-A.; De Rainville, F.-M.; Gardner, M.-A.; Parizeau, M.; Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 2012, 13, 2171−2175. (49) In Concepts and Applications of Molecular Similarity; Johnson, M. A., Maggiora, G. M., Eds.; Wiley: New York , 1990.

P

DOI: 10.1021/acs.jcim.9b00093 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX