Article Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
pubs.acs.org/jcim
Exploring Alternative Strategies for the Identification of Potent Compounds Using Support Vector Machine and Regression Modeling Tomoyuki Miyao,† Kimito Funatsu,†,‡ and Jürgen Bajorath*,§ †
J. Chem. Inf. Model. Downloaded from pubs.acs.org by YORK UNIV on 12/17/18. For personal use only.
Data Science Center and Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan ‡ Department of Chemical System Engineering, School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan § Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, D-53115 Bonn, Germany S Supporting Information *
ABSTRACT: Support vector regression (SVR) is a premier approach for the prediction of compound potency. Given the conceptual link between support vector machine (SVM) and SVR modeling, SVR is capable of accounting for continuous and discontinuous structure−activity relationships (SARs) in potency prediction, which further extends the classical quantitative SAR (QSAR) paradigm. In the context of virtual compound screening, compound potency prediction can be applied to identify the most potent compounds that are available or enrich database selection sets with potent compounds. To these ends, we have evaluated new potency prediction strategies. Conventional (direct) potency prediction using SVR was compared to two-stage SVM-SVR modeling and potency prediction using SVR models trained in the presence of active and inactive compounds, a previously unconsidered approach. The latter models were found to maximize the recall of potent compounds but were least accurate in predicting high potency values. For this purpose, direct SVR predictions were preferred. However, the best balance between accurate potency predictions and enrichment of potent compounds in database selection sets was achieved by combined SVM-SVR modeling. Taken together, our findings further extend current approaches for compound potency prediction in virtual compound screening.
■
INTRODUCTION For virtual compound screening, a variety of computational methodologies have been developed or adapted.1 Classical approaches include various similarity search1,2 and quantitative structure− activity relationship (QSAR) methods.3 Over the past decade, a variety of machine learning algorithms have become increasingly popular for virtual screening.1,4 Widely investigated machine learning approaches include support vector machine (SVM)5−7 and support vector regression (SVR),8−11 random forest,12,13 gradient boosting,14,15 Bayesian classification,16,17 artificial neutral networks,18,19 and deep neural networks.20−22 Virtual compound screening generally aims to identify new chemical entities with a desired biological activity, but most computational methods that are appliedincluding similarity search and compound classification algorithmsdo not include compound potency as a search parameter. Instead, a qualitative measure of activity, i.e., active vs inactive, is typically applied. In recent years, however, increasing attention has been paid to direct screening calculations toward the detection of potent compounds,4 in the spirit of QSAR.3 For this purpose, SVR has become a method of choice.8,9 SVR is conceptually related to yet © XXXX American Chemical Society
distinct from the SVM formalism and used for prediction of numerical property values.8 As such, SVR is capable of accounting for a variety SARs in potency prediction, which explains the strong interest in SVR in the QSAR and chemoinformatics fields.23 In the context of virtual compound screening, potency prediction serves a dual purpose. In addition to predicting the potency level of test compounds as accurately as possible, and hence focus computational screening on highly potent compounds, one also aims to enrich potent compounds in database selection sets. Meeting these overlapping yet distinct goals requires careful consideration of prediction schemes. Herein, we have, in the context of virtual compound screening, evaluated previously unconsidered SVR strategies for their ability to predict most potent compounds in different data sets, enrich potent compounds in selection sets, and balance these Special Issue: Machine Learning in Drug Discovery Received: August 30, 2018
A
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Table 2. Training and Test Setsa
objectives. While others have focused on analyzing the applicability domain for SVR-based compound ranking,10 we have investigated new strategies to predict compound potency via SVR including combined SVM-SVR predictions, termed two-stage SVR, and SVR modeling on the basis of both active and inactive training compounds.
no. training CPDs
ZINC
■
MATERIALS AND METHODS Compound Data Sets. Ten different sets of active compounds with significant intraclass potency variations were taken from a previous investigation focusing on activity cliffs.24 These compound sets were selected from ChEMBL version 2325 on the basis of high-confidence activity data24 and are reported in Table 1. As potency measurements, equilibrium constant (pKi values) Table 1. Compound Sets from ChEMBLa potency (pKi) CHEMBL ID 237 244 245 4860 1862 1983 2954 3798 220 249
activity kappa opioid receptor ligands coagulation factor X inhibitors muscarinic acetylcholine receptor M3 ligands apoptosis regulator Bcl-2 inhibitors tyrosine-protein kinase ABL inhibitors serotonin 1d (5-HT1d) receptor ligands cathepsin S inhibitors calcitonin gene-related peptide type 1 receptor ligands acetylcholinesterase inhibitors neurokinin 1 receptor ligands
no. CPDs
max
min
1879 1608 652
11.5 11.4 10.5
4.1 3.6 4.1
630 553 478
11.3 10.7 10.7
3.3 4.2 4.4
435 416
9.9 11.3
3.5 4.7
380 350
11.8 12.0
1.0 3.8
no. test CPDs
data set
active CPDs
237 244 245 4860 1862 1983 2954 3798 220 249
939 804 326 315 276 239 217 208 190 175
ZINC
active CPDs (highly potent)
MACCS
ECFP4
potency threshold (pKi)
1878 1608 652 630 552 478 434 416 380 350
940 (94) 804 (80) 326 (32) 315 (31) 277 (27) 239 (23) 218 (21) 208 (20) 190 (19) 175 (17)
94000 80400 32600 31500 27700 23900 21800 20800 19,000 17500
470000 402000 163000 157500 138500 119500 109000 104000 95,000 87500
9.43 9.72 9.70 10.30 9.70 9.70 8.42 9.96 9.01 9.99
a
For each target set, the composition of training and test sets is reported. Compounds randomly selected from ZINC were used as inactive training instances. The numbers of ZINC compounds in test sets varied for different molecular representations. Potency thresholds are reported for defining highly potent compounds. ChEMBL IDs are used to designate the data sets.
were selected for performance evaluation. For evaluating the predictive ability of regression models, the mean absolute error (MAE) for measured and predicted potency values and the coefficient of determination (R2) were calculated for all active test compounds. Compound Classification. For classification of active versus inactive compounds (class label prediction), SVM models were derived. SVM is a supervised learning algorithm aiming to identify a hyperplane in descriptor space that best separate two classes of compounds while maximizing the margin of the hyperplane.30 SVM can be applied to nonlinear classification tasks through the use of kernels.31 The application of kernel functions makes it possible to map samples from the original descriptor space into a higher dimensional space without an explicit mapping function. Herein, the Tanimoto kernel32 was used in combination with ν-SVM that employs a ν parameter representing an upper bound on the fraction of permitted training errors. This hyperparameter was optimized by 3-fold cross validation based on area under the curve (AUC) values for receiver operating characteristic (ROC) curves. Class weights were introduced to penalize errors more in the minority class than in the majority class. The weights were determined by the inverse ratio of the number of samples in a class relative to the number of total samples (definition: no. all samples/ (2 × no. samples in a class)). For SVM classification models, compound rankings were generated on the basis of the signed distance from the hyperplane in direction of positive/active to negative/inactive and the models were evaluated on the basis of TP rates, Matthews correlation coefficients (MCC), AUC-ROC values, and AUC values for the precision recall curve (PRC), i.e., AUC-PRC values (with precision P = TP/(TP + FP)). Potency Prediction. For prediction of potency values, SVR8 including the ν parameter (ν-SVR) was used in combination with the Tanimoto kernel. SVR is a regression model variant of SVM. Rather than constructing a separating hyperplane like SVM, SVR derives a function to predict numerical values. SVR also projects data characterized by nonlinear SARs into higher dimensional space representations in which a linear
a
For each set, the CHEMBL ID, activity, number of active compounds (CPDs), and maximum and minimum potency values are reported.
were used. Inactive training examples were randomly selected from a subset of 1 000 000 compounds obtained from ZINC version 15.26 Inactive compounds should structurally be more diverse than compounds sharing the same activity, which was achieved by randomly selecting ZINC samples, as further discussed below. Molecular Representation. The extended connectivity fingerprints of bond diameter 4 (ECFP4)27 and MACCS28 were employed as molecular representations. MACCS consisted of 166 structural keys. Feature sets generated by ECFP4 were folded into a 1024-bit vector via modulo operation. Both fingerprints were generated using in-house Python scripts with the aid of the OEChem toolkit.29 Potency Criteria. For potency prediction and the identification of most potent compounds, “highly potent” compounds were defined as the top 10% most potent test compounds per data set (Table 2). As an alternative, “potent” compounds were defined as the subset of test compounds per set with at least median potency of all active test compounds. Threshold values are reported in Table 2. Performance Measures. Prediction accuracy was evaluated on the basis of true positive (TP) highly potent (or potent) compounds and false positive (FP) ZINC compounds selected from compound rankings. The size of compound selection sets was determined on a per data set basis. For example, if the top 10% most potent test compounds from a data set included 94 compounds, then the top 94 compounds from a given ranking B
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
assigned the lowest possible score in each case. First, range score was defined as the potency range among NNs. Second, mean score was defined as the average NN potency. Third, SARI score was calculated for NNs as the raw structure−activity relationship (SAR) discontinuity score component of the SAR Index33 to quantify potency variations among NNs. For SARI score calculations, the Tanimoto similarity threshold for NNs was set to 0.6 and 0.4 for MACCS and ECFP4, respectively.34 ν-SVR models were derived on the basis of active training compounds, which represents a conventional modeling strategy, and also on the basis of active compounds and ZINC compounds. Therefore, given a set of n active training compounds, 2n ZINC compounds were randomly selected and assigned a potency value of 0. Additional combined training sets were also generated by decreasing the ratio of inactive to active training compounds in a stepwise manner from 2.0 to 1.5, 1.0, 0.5, 0.1, 0.01, and 0.0 (no inactive training instances). SVR models derived from all training sets were evaluated on the basis of MAE and R2 as well as recall performance calculated from compound rankings. SVR models generated on the basis of active compounds were designated “SVR”, and models for active compounds in the presence of ZINC compounds were designated “SVR_ZINC”. All SVM and SVR calculations were carried out using scikitlearn (version 0.19.1).35
Figure 1. Virtual screening workflow. Path A: Two-stage virtual screening protocol (red lines) combines SVM and SVR modeling. Path B: Direct potency predictions using SVR (blue lines).
Table 3. First-Stage Compound Classification Using MACCSa data set
FN
FP
TN
TP
TP rate
MCC
AUCPRC
AUCROC
237 244 245 4860 1862 1983 2954 3798 220 249
31 14 19 4 7 12 3 1 25 19
1786 1403 1181 522 481 902 213 352 527 423
92214 78997 31419 30978 27219 22998 21587 20448 18473 17077
909 790 307 311 270 227 215 207 165 156
0.97 0.98 0.94 0.99 0.97 0.95 0.99 1.00 0.87 0.89
0.57 0.59 0.43 0.60 0.59 0.43 0.70 0.60 0.45 0.48
0.88 0.90 0.60 0.96 0.92 0.69 0.94 0.92 0.77 0.82
1.00 1.00 0.99 1.00 1.00 0.99 1.00 1.00 0.98 0.98
■
RESULTS AND DISCUSSION Prediction Strategy. The calculation protocol for two-stage and direct compound potency prediction is summarized in Figure 1. Two-stage predictions began with an SVM classification task (stage 1) to separate active from inactive compounds, followed by SVR-based regression task (step 2) for potency prediction of compounds classified as active in stage 1. Conventional direct potency predictions using SVR were carried out for all test compounds. An additional previously unexplored methodological aspect was that SVR models were not only trained on the basis of exclusively active compounds, which represents the conventional way of model generation, but also in the presence of active compounds and ZINC compounds assumed to be inactive (i.e., with assigned zero potency). Comparison of the resulting models made it possible to investigate the influence of inactive training data on SVR-based potency predictions. SVM and SVR models were generated for 10 compound data sets from medicinal chemistry, which contained between 350 and 1879 compounds and covered wide potency ranges (Table 1). The composition of training and test sets including highly potent compounds is reported in Table 2. ChEMBL IDs are used to designate the data sets. Highly potent compounds
Reported are statistics of first-stage virtual screening using MACCS as a molecular representation.
a
regression function might be derived. For SVR modeling, a hyperparameter set {ν, C} was optimized by 3-fold cross validation on the basis of the MAE for predicted and experimental potency values. Compounds rankings were generated based on predicted potency values and three additional measures considering nearest neighbor (NN) information. For each data set, the similarity threshold for identifying NNs was set to the 95 percentile of pairwise Tanimoto fingerprint similarity of training compounds. Three NN-based scores were calculated for test compounds. If a compound did not have at least three active training NNs it was Table 4. First-Stage Compound Classification Using ECFP4a data set
FN
FP
TN
TP
TP rate
MCC
AUC-PRC
AUC-ROC
237 244 245 4860 1862 1983 2954 3798 220 249
27 8 22 4 5 11 4 1 24 12
2507 399 3039 399 233 782 130 19 1003 193
467493 401601 159961 157101 138267 118718 108870 103981 93997 87307
913 796 304 311 272 228 214 207 166 163
0.97 0.99 0.93 0.99 0.98 0.95 0.98 1.00 0.87 0.93
0.51 0.81 0.29 0.66 0.73 0.46 0.78 0.95 0.35 0.65
0.93 0.99 0.69 0.98 0.98 0.89 0.98 1.00 0.80 0.93
1.00 1.00 0.99 1.00 1.00 1.00 1.00 1.00 0.98 0.99
Reported are statistics of first-stage virtual screening using ECFP4 as a molecular representation.
a
C
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Table 5. SVR Predictions Using MACCSa R2
mean absolute error data set 237 244 245 4860 1862 1983 2954 3798 220 249
passed 1st stage 0.66 0.64 0.75 0.61 0.61 0.61 0.62 0.73 0.86 0.77
First-Stage Classification. Results for first-stage classification of active and inactive compounds using SVM models are reported in Table 3 for MACCS and Table 4 for ECFP4. Prediction accuracy was generally high with TP rates ranging from 0.87 to 1.00 for both molecular representations. Thus, most active compounds were correctly classified and only few highly potent compounds were eliminated. For example, when using MACCS, only one highly potent compound was discarded for three target sets (244, 245, and 249) and when using ECFP4, only one highly potent compound was eliminated for four sets (237, 245, 2954, and 220). However, varying numbers of FP ZINC compounds per set also passed first-stage classification, ranging from 213 to 1786 for MACCS and from 19 to 3039 for ECFP4. Numbers of FP were frequently comparable to numbers of TP or only higher by a factor of 2−3, which corresponded to a strong enrichment of active compounds in first-stage selection sets. In addition, AUC-PRC and AUC-ROC values were only slightly higher for ECFP4 compared to MACCS. Although SVM classification was prone to FPs, in both cases, first-stage classification led to a strong enrichment of active versus inactive compounds in second-stage test sets. Comparison of Second-Stage and Direct Potency Predictions. Next, second-stage potency predictions using SVR models were compared to direct potency predictions. In both instances, the potency values of available active compounds were predicted. The mean absolute error and independently calculated R2 values for test set predictions are reported in Tables 5 (MACCS) and 6 (ECFP4). SVR models achieved slightly better performance when applying the models only to active compounds that passed first-stage classification. However, performance differences between two-stage and direct predictions were mostly marginal, due to small numbers of FN compounds in first-stage classification. Thus, as one would expect, the presence of a few active compounds that were incorrectly predicted and removed in first-stage classification did not significantly alter the performance of SVR models. In Tables 7 (MACCS) and 8 (ECFP4), virtual screening results using alternative prediction strategies are reported. For two stage-predictions, the range score accounts for the maximum and minimum potency of nearest neighbors of a test compound, mean selects the mean potency value of nearest neighbors, and SARI calculates mean SAR discontinuity score for nearest neighbors. Two-stage SVR achieved higher recall of highly
passed 1st stage 0.60 0.77 0.59 0.83 0.70 0.60 0.61 0.60 0.63 0.58
0.65 0.64 0.71 0.60 0.59 0.60 0.63 0.73 0.82 0.70
0.61 0.77 0.62 0.83 0.71 0.61 0.61 0.60 0.68 0.65
a
SVR predictions are summarized for each data set including all test compounds and the subset of test compounds predicted to be active in the first stage (passed 1st stage). Best predictions are reported in bold.
Table 6. SVR Predictions Using ECFP4a R2
mean absolute error data set 237 244 245 4860 1862 1983 2954 3798 220 249
passed 1st stage 0.55 0.55 0.71 0.53 0.54 0.51 0.51 0.64 0.76 0.63
passed 1st stage 0.71 0.83 0.63 0.87 0.75 0.73 0.74 0.69 0.70 0.66
0.54 0.55 0.69 0.51 0.53 0.50 0.50 0.64 0.68 0.57
0.72 0.83 0.65 0.88 0.74 0.74 0.75 0.69 0.78 0.70
a
SVR predictions are summarized for each data set including all test compounds and the subset of test compounds predicted to be active in the first stage (passed 1st stage). Best predictions are reported in bold.
represented the top 10% most potent compounds per test set. For two different molecular representations including the MACCS and ECFP4 fingerprints, different numbers of inactive test compounds were used. For ECFP4, a larger number of ZINC compounds was used for comparing two-stage SVR and direct SVR, given the large size and high resolution of this fingerprint. Table 7. Virtual Screening Using MACCSa recall of highly potent CPDs two-stage
no. ZINC CPDs predicted to be highly potent direct
two-stage
direct
data set
range
mean
SARI
SVR
SVR
SVR_ZINC
range
mean
SARI
SVR
SVR
SVR_ZINC
237 244 245 4860 1862 1983 2954 3798 220 249
0.27 0.03 0.28 0.19 0.19 0.04 0.10 0.15 0.26 0.18
0.10 0.24 0.47 0.19 0.26 0.17 0.33 0.65 0.26 0.29
0.02 0.01 0.00 0.13 0.26 0.00 0.10 0.15 0.00 0.00
0.56 0.53 0.50 0.32 0.30 0.39 0.43 0.25 0.53 0.53
0.52 0.25 0.50 0.32 0.30 0.39 0.14 0.00 0.26 0.35
0.60 0.48 0.53 0.29 0.26 0.17 0.62 0.50 0.74 0.35
14 33 0 0 1 9 4 0 6 0
58 40 0 0 4 18 6 0 14 5
77 70 1 0 0 20 11 1 10 7
5 16 0 0 0 5 9 10 6 0
23 52 0 0 0 5 18 20 14 3
0 0 0 0 0 0 0 0 0 0
Reported are the results of two-stage virtual screening and direct SVR predictions (without first-stage classification). For each data set, highly potent compounds represented the top 10% most potent compounds. Range score (range) covers the maximum and minimum potency of nearest neighbors of a test compound; Mean uses the mean potency value of nearest neighbors; SARI refers to the mean SARI discontinuity score for nearest neighbors. Best predictions are reported in bold. a
D
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Table 8. Virtual Screening Using ECFP4a recall of highly potent CPDs two-stage
no. ZINC CPDs predicted to be highly potent direct
two-stage
direct
data set
range
mean
SARI
SVR
SVR
SVR_ ZINC
range
mean
SARI
SVR
SVR
SVR_ZINC
237 244 245 4860 1862 1983 2954 3798 220 249
0.20 0.10 0.53 0.06 0.15 0.04 0.38 0.15 0.26 0.35
0.43 0.29 0.50 0.19 0.22 0.57 0.38 0.55 0.37 0.29
0.02 0.01 0.00 0.10 0.07 0.00 0.14 0.00 0.11 0.24
0.67 0.60 0.44 0.29 0.41 0.57 0.62 0.50 0.79 0.47
0.67 0.60 0.44 0.29 0.41 0.57 0.62 0.50 0.68 0.47
0.60 0.59 0.47 0.32 0.41 0.30 0.62 0.55 0.79 0.47
3 18 0 0 0 0 4 0 0 0
15 24 0 0 0 1 3 0 10 1
51 13 1 0 0 1 4 0 0 0
1 0 1 0 0 0 1 0 4 0
1 0 1 0 0 0 2 0 6 0
2 0 0 0 0 0 0 0 0 0
a Reported are the results of two-stage virtual screening and direct SVR predictions (without first-stage classification). For each data set, highly potent compounds represented top 10% most potent compounds. Range score (Range) covers the maximum and minimum potency of nearest neighbors of a test compound: Mean uses the mean potency value of nearest neighbors; SARI refers to the mean SARI discontinuity score for nearest neighbors. Best predictions are reported in bold.
Figure 2. Effect of inactive training compounds on potency predictions. Distributions of potency values predicted by alternative SVR models are reported in box plots for active and inactive test compounds. (A) Molecular representation: ECFP4; only active training compounds were used. (B) ECFP4; active and inactive training compounds were used. For each data set, different subsets of test compounds were used including active compounds predicted to be active in first-stage SVM classification (green), active compounds predicted to be inactive, ZINC compounds predicted to be active (blue), and ZINC compounds predicted to be inactive (pink).
Information (MACCS). These figures show the distribution of potency values predicted by SVR models for different categories of test compounds. There were no significant differences in potency distributions between TNs and FPs. Hence, first-stage SVM classification removed potential FPs prior to SVR potency prediction. However, calculations based on ECFP4 also accurately predicted the potency of active compounds directly. Among alternative scoring strategies reported in Tables 7 and 8,
potent compounds than direct SVR for six target sets using MACCS but only one set using ECFP4. Furthermore, two-stage SVR predicted smaller numbers of ZINC compounds to be highly potent than direct SVR but for ECFP4, the differences were again marginal, due to the overall small number of highly ranked ZINC compounds. Observed differences between two-stage and direct SVR prediction models can be rationalized on the basis of Figure 2A (ECFP4) and Figure S1A of the Supporting E
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 3. Mean absolute error for active test compounds using different models. Reported is the mean absolute error for active text compounds using SVR models trained in the presence of increasing numbers of ZINC compounds. The ratio of inactive to active training compounds ranged from 0.0 (no ZINC compounds) to 2.0. For each ratio (except 0.0 and 2.0), five different models were built using randomly selected subsets of ZINC compounds. ECFP4 was used as a molecular representation.
Figure 4. Recall of test compounds using different models. Reported is the recall of highly potent compounds according to Table 2 using SVR_ZINC models trained in presence of increasing numbers of ZINC compounds (top). The ratio of inactive to active training compounds ranged from 0.0 (no ZINC compounds were used) to 2.0. For each ratio (except 0.0 and 2.0), five different models were built using randomly selected subsets of ZINC compounds. The representation is according to Figure 3. In addition, the recall of false positive ZINC compounds is reported (bottom). ECFP4 was used as a molecular representation.
SVR achieved the overall best recall performance of highly potent compounds and minimized the number of ZINC compounds incorrectly predicted to be highly potent. SVR Modeling in the Presence of Inactive Training Compounds. SVR models were also derived in the presence of
twice the number of ZINC compounds compared to active training compounds. ZINC compounds were assigned a potency value of zero. Results of potency predictions using these models are shown in Figures 2B (ECFP4) and S1B of the Supporting Information (MACCS), which can be directly compared to F
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Table S1 of the Supporting Information reports chemical information for all set of active compounds including the compound-to-scaffold ratio as a measure of intraset diversity. The results show that active compounds in each data set were structurally diverse. For 9 of 10 data sets, the ratio was approximately two compounds per scaffold and in one set it was approximately three compounds per scaffold. The structural diversity of ZINC compounds was slightly higher than of active compounds, with on average 1.5 compounds per scaffold. Given the comparable degree of diversity among active compounds, differences in model performance were not attributable to differences in structural diversity. However, results of virtual screening and activity predictions typically display notable compound class dependence, as also observed herein. This is assumed to be a consequence of different structural characteristics that are accounted for to varying extents by alternative molecular representations. Difficult compound classes usually yield low virtual screening performance, regardless of the methods that are used. Alternative Potency Prediction Strategies. The results in Tables 7 (MACCS) and 8 (ECFP4) indicated that two-stage SVR achieved higher recall of highly potent compounds than direct SVR and lower recall of false positive ZINC compounds, although by small margins. We further explored these trends by lowering the threshold value for highly potent compounds, setting it to the median potency of all active compounds. Compounds exceeding this threshold were classified as potent compounds. Table 9 reports the comparison of two-stage SVR, direct SVR, and SVR_ZINC when applying this softer threshold. As a molecular representation, ECFP4 was used, for which differences between alternative screening strategies were smaller than for MACCS. Under these conditions, differences between alternative predictions strategies became much more pronounced, as revealed in Table 9. Two-stage-SVR generally achieved higher recall of potent compounds than direct SVR. Moreover, SVR_ZINC models achieved higher recall of potent compounds than two-stage SVR (with one exception) and consistently lower recall of FP ZINC compounds than the alternative prediction strategies. However, these models also displayed the overall highest mean absolute error in predicted potency values.
Figures 2A and S1A of the Supporting Information, respectively (reporting corresponding results for SVR models exclusively based on active compounds). SVR training in the presence of inactive training compounds resulted in the prediction of consistently lower potency values, with the exception of active compounds that were correctly classified by first-stage SVM. Interestingly, for these compounds, potency predictions using models trained in the presence and absence of inactive training instances were stable. Increasing the proportion of inactive versus active training compounds resulted in decreasing prediction accuracy of SVR models, as shown in Figures 3 (ECFP4) and S2 of the Supporting Information (MACCS). Furthermore, the recall of highly potent compounds using these models and the recall of FP ZINC compounds among are reported in Figures 4 (ECFP4)and S3 of the Supporting Information (MACCS). The recall of highly potent compounds was overall less affected by increasing proportions of inactive training compounds than prediction accuracy of the corresponding SVR models. For several data sets including 2954, 3798, and 220 a significant increase in the recall of highly potent compounds using MACCS was observed, which was a direct consequence of reducing the number of FP ZINC compounds that were detected. Although SVR_ZINC models frequently achieved further increased recall of highly potent compounds compared to models trained exclusively on active compounds, the potency of test compounds was often underpredicted. Examples of such compounds are shown in Figure 5. For the corresponding data sets,
■
CONCLUDING REMARKS SVR has become a popular approach for QSAR, given its ability to account for a variety of SARs. Herein, we have compared standard SVR, termed direct SVR, with two previously unconsidered strategies to predict compound potency using the SVR formalism, including combined SVM-SVR predictions, termed two-stage SVR, and SVR modeling on the basis of both active and inactive (zero potency) training compounds. We reasoned that combined SVM-SVR modeling might bear an advantage over direct SVR, if it were possible to successfully distinguish between active and inactive compounds during stage one (SVM classification) and thus focus stage two SVR potency predictions predominantly on active compounds. Moreover, we derived, to our knowledge for the first time, SVR models trained on active compounds as well as inactive compounds. Side-by-side comparison of these alternative potency prediction strategies revealed a number of interesting findings. SVM classification was indeed highly accurate in distinguishing between active and inactive compound on a variety of data sets. However, subsequent second-stage SVR potency prediction only yielded a marginaland molecular representation dependentadvantage over direct SVR when a stringent threshold for highly potent
Figure 5. Compounds with underpredicted potency. For three data sets (244, 245, and 3798), exemplary compounds are shown for which the too low potency was predicted SVR_ZINC. For each compound, the data set, measured potency, and potency predicted by SVR and SVR_ZINC are reported.
correlation plots are shown in Figure 6, which identify these exemplary compounds. G
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
Figure 6. Correlation plots for exemplary data sets. Shown are correlation plots for SVR and SVR_ZINC calculations on three data sets (244, 245, and 3798) using ECFP4 as a molecular representation. Blue circles represent training data, red crosses, test data, and black squares, test data predicted to be inactive by first-stage SVM models.
Table 9. Relative Performance of Two-Stage and Direct Compound Potency Predictionsa no. ZINC CPDs predicted to be potent
recall of potent CPDs two-stage
direct
two-stage
direct
mean absolute error two-stage
direct
data set
no. potent CPDs
median potency (pKi)
SVR
SVR
SVR_ ZINC
SVR
SVR
SVR_ ZINC
SVR
SVR
SVR_ ZINC
237 244 245 4860 1862 1983 2954 3798 220 249
470 402 163 157 138 119 109 104 95 87
7.28 7.70 7.64 8.46 8.00 8.30 6.74 7.69 6.93 8.60
0.77 0.87 0.69 0.90 0.82 0.84 0.71 0.78 0.44 0.90
0.67 0.66 0.65 0.90 0.82 0.84 0.33 0.37 0.22 0.89
0.80 0.88 0.74 0.91 0.83 0.84 0.86 0.78 0.75 0.84
56 28 32 0 0 7 30 8 51 1
129 131 43 0 0 8 73 64 74 3
4 1 11 0 0 0 0 0 5 0
0.54 0.55 0.69 0.51 0.53 0.50 0.50 0.64 0.68 0.57
0.55 0.55 0.71 0.53 0.54 0.51 0.51 0.64 0.76 0.63
0.79 0.73 1.18 0.63 0.68 1.03 0.63 0.72 1.29 0.96
a Reported are the results of two-stage and direct SVR predictions (without first-stage classification) using ECFP4. For each data set, potent test compounds represented the 50% most potent compounds. The same number of the test compounds was selected based on potency predictions. Mean absolute errors for two-stage SVR result from predictions on the subset of test compounds predicted to be active in the first stage. Best results are reported in bold.
compounds was applied, perhaps less so than anticipated. Furthermore, SVR training including inactive training examples
yielded predictive models that met or exceeded the recall performance of the alternative modeling strategies. In potency H
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling
(4) Vogt, M.; Bajorath, J. Chemoinformatics: A View of the Field and Current Trends in Method Development. Bioorg. Med. Chem. 2012, 20, 5317−5323. (5) Jorissen, R. N.; Gilson, M. K. Virtual Screening of Molecular Databases Using a Support Vector Machine. J. Chem. Inf. Model. 2005, 45, 549−561. (6) Geppert, H.; Horváth, T.; Gärtner, T.; Wrobel, S.; Bajorath, J. Support-Vector-Machine-Based Ranking Significantly Improves the Effectiveness of Similarity Searching Using 2D Fingerprints and Multiple Reference Compounds. J. Chem. Inf. Model. 2008, 48, 742− 746. (7) Liew, C. Y.; Ma, X. H.; Liu, X.; Yap, C. W. SVM Model for Virtual Screening of Lck Inhibitors. J. Chem. Inf. Model. 2009, 49, 877−885. (8) Smola, A. J.; Schölkopf, B. A Tutorial on Support Vector Regression. Stat. Comput. 2004, 14, 199−222. (9) Demiriz, A.; Bennett, K. P.; Breneman, C. M.; Embrechts, M. J. Support Vector Machine Regression in Chemometrics. Proceedings of the 33rd Symposium on the Interface of Computing Science and Statistics, Costa Mesa, CA, 2001. (10) Fechner, N.; Jahn, A.; Hinselmann, G.; Zell, A. Estimation of the Applicability Domain of Kernel-Based Machine Learning Models for Virtual Screening. J. Cheminf. 2010, 2, No. 2. (11) Li, L.; Wang, B.; Meroueh, S. O. Support Vector Regression Scoring of Receptor−Ligand Complexes for Rank-Ordering and Virtual Screening of Chemical Libraries. J. Chem. Inf. Model. 2011, 51, 2132− 2138. (12) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947−1958. (13) Schierz, A. C. Virtual Screening of Bioassay Data. J. Cheminf. 2009, 1, No. 21. (14) Babajide Mustapha, I.; Saeed, F. Bioactive Molecule Prediction Using Extreme Gradient Boosting. Molecules 2016, 21, No. 983. (15) Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; Pande, V. MoleculeNet: A Benchmark for Molecular Machine Learning. Chem. Sci. 2018, 9, 513−530. (16) Abdo, A.; Salim, N. Similarity-Based Virtual Screening with a Bayesian Inference Network. ChemMedChem 2009, 4, 210−218. (17) Abdo, A.; Chen, B.; Mueller, C.; Salim, N.; Willett, P. LigandBased Virtual Screening Using Bayesian Networks. J. Chem. Inf. Model. 2010, 50, 1012−1020. (18) Afantitis, A.; Melagraki, G.; Koutentis, P. A.; Sarimveis, H.; Kollias, G. Ligand - Based Virtual Screening Procedure for the Prediction and the Identification of Novel β-Amyloid Aggregation Inhibitors Using Kohonen Maps and Counterpropagation Artificial Neural Networks. Eur. J. Med. Chem. 2011, 46, 497−508. (19) Selzer, P.; Ertl, P. Applications of Self-Organizing Neural Networks in Virtual Screening and Diversity Selection. J. Chem. Inf. Model. 2006, 46, 2319−2323. (20) Gawehn, E.; Hiss, J. A.; Schneider, G. Deep Learning in Drug Discovery. Mol. Inf. 2016, 35, 3−14. (21) Kearnes, S.; McCloskey, K.; Berndl, M.; Pande, V.; Riley, P. Molecular Graph Convolutions: Moving beyond Fingerprints. J. Comput.-Aided Mol. Des. 2016, 30, 595−608. (22) Lenselink, E. B.; ten Dijke, N.; Bongers, B.; Papadatos, G.; van Vlijmen, H. W. T.; Kowalczyk, W.; IJzerman, A. P.; van Westen, G. J. P. Beyond the Hype: Deep Neural Networks Outperform Established Methods Using a ChEMBL Bioactivity Benchmark Set. J. Cheminf. 2017, 9, No. 45. (23) Hasegawa, K.; Funatsu, K. Non-Linear Modeling and Chemical Interpretation with Aid of Support Vector Machine and Regression. Curr. Comput.-Aided Drug Des. 2010, 6, 24−36. (24) Hu, H.; Stumpfe, D.; Bajorath, J. Rationalizing the Formation of Activity Cliffs in Different Compound Data Sets. ACS Omega 2018, 3, 7736−7744. (25) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.;
prediction, SVR was clearly superior to alternative measures including, among others, SAR discontinuity scoring. Moreover, when threshold levels for highly potent compounds were lowered, prediction trends became much more apparent, clearly distinguishing between different prediction strategies and performance criteria. Specifically, under these conditions, recall of most potent compounds increased in the order of SVR, twostage SVR, and SVR_ZINC; an interesting finding. Thus, the latter newly designed SVR models were most effective in identifying the most potent compounds in different data sets. By contrast, these models displayed the larges absolute error in potency prediction and often underpredicted potency values of the most potent compounds. Hence, taken together, our findings reveal a trade-off, which has implication for practical virtual screening: If the primary goal of a screening campaign is maximizing the identification of potent compounds (measured by recall performance in benchmark settings), SVR models trained on known active compounds and inactive compounds should be used. However, the proportion of active and inactive training instances should be balanced, as also indicated by our results. By contrast, for accurate potency prediction of test compounds, direct SVR is a method of choice. In addition, to achieve the best compromise between the identification of potent compounds and accurate potency prediction, two-stage SVR should be used. All data sets used in this study have been made freely available in an open access deposition on the ZENODO platform.36
■
ASSOCIATED CONTENT
* Supporting Information S
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.8b00584. Supplementary Table S1 provides chemical information for compound data sets. Supplementary Figures S1−S3 report prediction results for MACCS (PDF)
■
AUTHOR INFORMATION
Corresponding Author
*Tel.: +49-228-2699-306. Fax: +49-228-2699-341. E-mail:
[email protected]. ORCID
Kimito Funatsu: 0000-0002-9368-0302 Jürgen Bajorath: 0000-0002-0557-5714 Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS We are grateful to Dagmar Stumpfe at the University of Bonn for providing active compound data sets. We also thank OpenEye Scientific Software, Inc., for providing a free academic license of the OpenEye toolkit.
■
REFERENCES
(1) Geppert, H.; Vogt, M.; Bajorath, J. Current Trends in LigandBased Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation. J. Chem. Inf. Model. 2010, 50, 205−218. (2) Willett, P. Similarity-Based Virtual Screening Using 2D Fingerprints. Drug Discovery Today 2006, 11, 1046−1053. (3) Tropsha, A.; Golbraikh, A. Predictive QSAR Modeling Workflow, Model Applicability Domains, and Virtual Screening. Curr. Pharm. Des. 2007, 13, 3494−3504. I
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX
Article
Journal of Chemical Information and Modeling Overington, J. P. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100−D1107. (26) Sterling, T.; Irwin, J. J. ZINC 15 − Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324−2337. (27) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (28) MACCS Structural Keys; Accelrys: San Diego, CA, 2011. (29) OEChem. TK, version 2.1.4; OpenEye Scientific Software, Santa Fe, NM, 2018. (30) Vapnik, V. N. The Nature of Statistical Learning Theory, 2nd ed.; Springer: New York, 2000. (31) Boser, B. E.; Guyon, I. M.; Vapnik, V. N. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the fifth annual workshop on Computational learning theory−COLT ’92; ACM Press: New York, 1992; pp 144−152. (32) Ralaivola, L.; Swamidass, S. J.; Saigo, H.; Baldi, P. Graph Kernels for Chemical Informatics. Neural Networks 2005, 18, 1093−1110. (33) Peltason, L.; Bajorath, J. SAR Index: Quantifying the Nature of Structure−Activity Relationships. J. Med. Chem. 2007, 50, 5571−5578. (34) de la Vega de León, A.; Bajorath, J. Formation of Activity Cliffs Is Accompanied by Systematic Increases in Ligand Efficiency from Lowly to Highly Potent Compounds. AAPS J. 2014, 16, 335−341. (35) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, É . Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. (36) Miyao, T.; Funatsu, K.; Bajorath, J. Compound data sets for support vector machine and regression modeling. http://dx.doi. org/10.5281/zenodo.1453935 (accessed October 21, 2018).
J
DOI: 10.1021/acs.jcim.8b00584 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX