Profile-QSAR 2.0: Kinase Virtual Screening Accuracy Comparable to

Huang , Y.; Huang , Y.; Moodie , Z.; Li , S.; Self , S. Comparing and combining .... Albert J. Kooistra , Márton Vass , Ross McGuire , Rob Leurs , Iw...
0 downloads 0 Views 4MB Size
Article pubs.acs.org/jcim

Profile-QSAR 2.0: Kinase Virtual Screening Accuracy Comparable to Four-Concentration IC50s for Realistically Novel Compounds Eric J. Martin,* Valery R. Polyakov, Li Tian,† and Rolando C. Perez‡ Novartis Institutes for Biomedical Research, 5300 Chiron Way, Emeryville, California 94608-2916, United States S Supporting Information *

ABSTRACT: While conventional random forest regression (RFR) virtual screening models appear to have excellent accuracy on random held-out test sets, they prove lacking in actual practice. Analysis of 18 historical virtual screens showed that random test sets are far more similar to their training sets than are the compounds project teams actually order. A new, cluster-based “realistic” training/test set split, which mirrors the chemical novelty of real-life virtual screens, recapitulates the poor predictive power of RFR models in real projects. The original Profile-QSAR (pQSAR) method greatly broadened the domain of applicability over conventional models by using as independent variables a profile of activity predictions from all historical assays in a large protein family. However, the accuracy still fell short of experiment on realistic test sets. The improved “pQSAR 2.0” method ̈ Bayes categorical models at several thresholds with predicted IC50s from RFR models. replaces probabilities of activity from naive Unexpectedly, the high accuracy also requires removing the RFR model for the actual assay of interest from the independent variable profile. With these improvements, pQSAR 2.0 activity predictions are now statistically comparable to mediumthroughput four-concentration IC50 measurements even on the realistic test set. Beyond the yes/no activity predictions from a typical high-throughput screen (HTS) or conventional virtual screen, these semiquantitative IC50 predictions allow for predicted potency, ligand efficiency, lipophilic efficiency, and selectivity against antitargets, greatly facilitating hitlist triaging and enabling virtual screening panels such as toxicity panels and overall promiscuity predictions.



INTRODUCTION The majority of chemical starting points for drug discovery come from the experimental high-throughput screening of millions of compounds against each new biological target. It is an amazing and essential technology. However, it can take 6−9 months and the fully loaded cost can reach a million dollars. Hence, it has long been an ambition of computational chemists to replace them with fast, inexpensive “virtual-screens” on computers. Conventional random forest regression (RFR) virtual screening models appear to have excellent accuracy on random held-out test sets. Unfortunately, the hit rates in actual practice are much lower than the random test set results would suggest. Several groups have developed kinomewide virtual screening models.1−5 First generation Profile-QSAR (pQSAR) 1.06 was developed to improve virtual screening for both biochemical and cellular assays for targets in large protein families: initially kinases, but subsequently expanded to proteases, G-protein coupled receptors, Cytochrome P450s, and many nonkinase adenosine-binding proteins. pQSAR was one of a suite of methods developed for “Protein Family Virtual Screening”, which leverages similarity among targets in large protein families.6−13 pQSAR extends the effective training set by modeling each new kinase as a “hybrid,” specifically a linear combination, of activities against previous kinase assays.6 The © 2017 American Chemical Society

experimental IC50s cannot be used directly as independent variables, because the data table of IC50s was 99% missing values. In addition, for predictions on new or virtual compounds there are not yet any experimental activities. Instead, an intermediate step is added as shown in Figure 1a. ̈ Bayes QSAR models were first trained on Pipeline Pilot14 naive each historical assay (column) using the available experimental data. These were then used to compute a full profile of predicted activities for every compound on all of the historical ̈ Bayes models are yes/no categorical assays. Since these naive models, the activity profiles were represented as probabilities of activity at up to nine concentration thresholds, depending on the range of activities. A partial least-squares (PLS) regression model was then trained on at least 200 IC50s from the new kinase assay, using the predicted activity profiles from the historical assays as the independent compound variables. The result was much greater accuracy and a broader domain of applicability than the individual QSARs on which it was based. Selectivity is essential in drug discovery, and selectivity predictions were also excellent, even better than expected by propagation of random errors from the original models, indicating the cancelation of some systematic errors among Received: March 22, 2017 Published: June 26, 2017 2077

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

Figure 1. Progression of pQSAR generations: (a) pQSAR 1.0, (b) pQSAR 1.5, (c) pQSAR 2.0. C1, C2, ..., C105 are compunds; A1, A2, ..., A103 are historical kinase assays; and Anew is a new assay to be modeled.

the assay was probably used for research and is therefore of reasonable quality. The assays were not otherwise curated. In addition, two public collections of kinase assay data were tested: a set of 171 kinase assay from PubChem,15 downloaded May 2015, with 165 biochemical and 6 cellular assays, and a set of 159 kinase assays from ChEMBL,16 downloaded on July 6 2015, composed of 4 biochemical and 155 functional assays. The criteria for assay selection were the same as above. The specific assay identifiers are listed in Supplemental Table S1. Note that with so many assays, often cryptic annotations, and largely automated processing, some mistakes may have been made, but we doubt they affect the generally conspicuous results that follow. Many of the experimental IC50s were reported as above or below the highest or lowest concentration tested. In order to include chemical information from these very highly or poorly active compounds, they were offset by 10-fold and treated as ̈ Bayes and RFR additional quantitative values in the naive models. Random Training/Test Set Splits. Compounds were split into random test (25%) and training sets (75%) using the “Partitioning” KNIME node by selecting “Relative[%]” set at 25%, and the “Draw randomly” options in the configuration menu. “Realistic” Novelty Training/Test Set Splits. To make the “realistic” training/test set splits, compounds from each assay were clustered with MolSoft ICM-pro 3.8-5, with an average cluster size of 10, via a KNIME node specifically built for this purpose. Tests showed that Rext2 was insensitive to cluster size. The clusters were ordered in decreasing size and training set compounds were selected starting from the largest cluster until at least 75% were accumulated. The remaining roughly 25% of singletons and small clusters became the heldout test set. For some of the smallest data sets the cluster size had to be reduced to get ∼25% in small clusters. ̈ Bayes Categorical Models. Naive ̈ Bayes categorical Naive models were built using default parameters with radius 2 Morgan fingerprints (similar to ECFP4) and five physicochemical properties (ALogP, MW, number of H-bond acceptors,

the models. For additional details, the Supporting Information includes an outline of the pQSAR 2.0 workflow, which differs from pQSAR 1.0 only in the use of random forest regression ̈ Bayes, the prediction of actual IC50s models rather than Naive rather than probabilities of activity at up to nine concentration thresholds, and the exclusion of the assay of interest in the profile (see below). pQSAR 1.0 has now been applied to many dozens of internal projects with good success. Nevertheless, the use of Pipeline ̈ Bayes models has several limitations: Pilot naive ̈ Bayes QSARs assume linearity and independence (1) Naive of the descriptors. A nonlinear method might be more accurate. (2) Predicting probability of activity at several thresholds rather than IC50 increases the number of independent variables on average by 3.5-fold. This in turn requires either correspondingly more training IC50s for the new target or using fewer thresholds for small data sets. (3) Pipeline Pilot is expensive, and the licenses are limited. Open source machine learning software would allow running across large clusters. This article describes a novel cluster-based training/test set splitting algorithm that recapitulates the extreme chemical novelty of compounds actually ordered by real virtual screening projects. It reports on the substantial improvements in speed, accuracy, and domain of applicability on these realistic test sets ̈ Bayes models achieved by replacing the Pipeline Pilot naive with KNIME RFR models, a surprising improvement from deleting the assay of interest from the profile, and other modifications that impacted performance.



METHODS Assays. The primary data set is comprised of 728 internal Novartis (NVS) kinase IC50 assays. These included 583 biochemical and 105 cellular assays downloaded October 2012. All included assays contained at least 200 pIC50s with a standard deviation greater than 0.5. Besides ensuring enough data to train a model, the threshold of 200 IC50s indicates that 2078

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

Figure 2. Correlations between predictions and experiments for sets of conventional individual-assay RFR models from 728 kinase assays (both biochemical and cellular) using random or realistic training/test set splits.

Figure 3. Box plots summarizing novelty histograms. The X-axis is Tanimoto similarity (Tc) of the test set or actually ordered compounds to the nearest members of the model’s training sets. Y-axes are percentages of the test/ordered set compounds: (a) random held-out test sets from 728 kinase assays, (b) compounds ordered by actual projects, and (c) “realistic” training/test set splits from the same 728 kinase assays.

Analytics Platform 3.1.217 using pIC50 as a target column and Morgan fingerprints as the attribute. The rest of the node’s parameters were at their default settings. Here, 1024 bit, Radius 2 Morgan fingerprints were calculated using the RDKit18 Fingerprint KNIME node. The addition of five physicochemical properties had not improved the models and were dropped. One hundred trees were used, as correlations with experiment were comparable to using 500. ̈ Bayes and RFR Models. PLS PLS Regression on Naive regression models were built on the predictions from the ̈ Bayes or RFR QSAR above-mentioned conventional naive models using the PLS package of R version 3.3.119 as previously

number of donors, and number of rotatable bonds) using Pipeline Pilot Version 8.5.14 For each assay, separate models were trained using either 5, 6, or 9 binary activity thresholds depending on the number of compounds tested in that assay: 414 assays with 1142+ compounds used 9 thresholds, 94 assays with 720−1141 compounds used 6, 84 assays with 478−719 compounds used 5, and 136 assays with 200−477 compounds used 3, for a total of 728 assays. A further requirement was to have at least 25 compounds in the “active” category. The overall average was 3.5 thresholds per assay. Random Forest Regression. RFR models were built with the Tree Ensemble Learner (Regression) node of KNIME 2079

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

Figure 4. (a) Clustered data set with a random 75% depicted as gray circles and remaining 25% as black stars. (b) 75% from the largest clusters shown as gray circles with the 25% remaining singletons and small clusters depicted as black stars.

described.6 For pQSAR 1.0 the independent variables were ̈ probabilities of activity at various thresholds from the naive Bayes categorical models. For pQSAR 1.5, the independent variables were predicted IC50s from the RFR models for each of the assays including the RFR model built on the training set of the dependent variable assay. pQSAR 2.0 is identical to pQSAR 1.5, except the RFR model for the dependent variable assay is deleted from the independent variables. Scripts. The profile-QSAR procedure is scripted in a combination of KNIME, C#, and shell scripts. An outline of the procedure is included in the Supporting Information.

Tanimoto similarity (Tc) of the predicted compounds to the nearest members of the model’s training sets. The vertical axes are percentage of the prediction set compounds. The top histogram, Figure 3a, summarizes histograms for the random training/test set splits from the 728 kinase assays. Most random test set compounds are very similar to the training set, with Tc greater than 0.8 on a scale from zero to one. Figure 3b is from the compounds actually ordered by 18 real projects. Almost nothing was ordered from the high similarity bins. Most were from the very dissimilar bins around 0.3. Real project teams are simply not interested in testing more compounds similar to the ones for which they already have data and know the answer. They want to find novel active compounds. Predictions with Realistic Training/Test Set Splits. Accurately assessing the actual expected performance of virtual screening models requires a training/test set split that mirrors the novelty of compounds actually selected by project teams. Figure 4a illustrates the problem with a literature 2D clustered data set20 with a random 25% of the points represented as black stars. As expected, every star is near a gray circlein fact only a few are not actually touching a gray circle. The “realistically novel” training/test set split was devised to maximize the difference between test and training sets. The training set compounds are selected from the largest clusters, with the remaining roughly 25% of singletons and small clusters becoming the held-out test set. Figure 4b depicts this 25% as black stars, showing that most now have no nearby gray circle. Returning to the real data, the box plots in Figure 3c summarize the novelty histograms for the same 728 kinase assays using this new novelty test/training set splitting. The distribution of these test set similarities almost exactly mirrors that of the compounds actually ordered by projects. Performance of models using this realistic test/training set split should accurately predict the performance on projects. Unlike leaveclass-out training/test set splits, the realistic split has the intuitive appeal that the training set is more clustered and less diverse than the test set, as in real virtual screens. Note that our method is specifically designed to simulate a HTS of an existing compound collection, not of compounds yet to be synthesized for which time-split test sets are appropriate. While Sheridan



RESULTS AND DISCUSSION Conventional Random Forest Regression Models. The upper curve in Figure 2 shows the coefficient of determination, R2, between prediction and experiment for conventional, individual-assay RFR models trained and tested using a 75%/ 25% random split for 728 biochemical and cellular kinase assays. They look extremely good. The median correlation is R2 = 0.59 and over 80% of the models exceed R2 = 0.3, our recommended threshold for a useful virtual screen. For comparison, the correlation between 4-concentration highthroughput versions and more careful 8- to 12-concentration lead optimization versions of 5 internal kinase assays ranged from R2 = 0.48 to 0.60, with an average of R2 = 0.54. Thus, the RFR correlations compare favorably with the four-concentration IC50 experiments. Recognizing that a typical HTS only provides percent inhibition at a single concentration and does not even produce an IC50, one wonders why anyone would ever bother to run the expensive, time-consuming experimental screen? Clearly, predictions on random split data are much more accurate than actual virtual screening results and are overly optimiztic. Random Test Sets Vs Actual Virtual Screens. We hypothesized that the culprit was a too-limited domain of applicability. We compared the similarity of the random test sets to their training sets, and the similarity of compounds actually ordered by project teams, to the data used to train the virtual screening models. The box plots in Figure 3 summarize “novelty histograms.” The X-axes of the histograms are 2080

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

Figure 5. Correlations (R2) between prediction and experiment for 4 sets of models from 728 kinase assays: RFR, pQSAR 1.0, pQSAR 1.5, and pQSAR 2.0. Realistic training/test set splits used in all cases.

method. An intermediate step was therefore added. In pQSAR 1.0, naive Bayes QSAR models, based on radius 2 Morgan substructural fingerprints and five physicochemical properties, were first trained on whatever experimental IC50s are available in the individual columns of the profile matrix. These QSARs were then used to populate a new complete table of predicted activities. The PLS model is then trained for a new kinase using the rows of this full predicted profile matrix as the “compound descriptors,” as illustrated in Figure 1a. For predictions, new compounds are first run down the columns through the Bayes models to produce a new activity row in the profile matrix. This activity profile is then run through the PLS model to predict the new compound’s activity on the new kinase. By this two-step procedure, first down the columns and then across the profile row, the activity prediction of each new compound is informed by every one of 5 million IC50s from 728 assays covering a half million compounds. This sharply contrasts with the regular individual-assay models, where the activity prediction is only informed by experimental data measured on the assay for which the model is created. An important detail is that the Pipeline Pilot naive Bayes models predict probability for binary yes/no activity at a given threshold concentration, not pIC50. Bayes models were therefore trained at up to 9 concentration thresholds to capture the relative activity strengths. The number of thresholds actually used for each assay depended on the number of training compounds and distribution of activities. Unfortunately, this meant about 3.5-fold more independent variables than assays, which in turn requires correspondingly more training data for the new assay. Figure 5 shows the correlations R2 between prediction and experiment for RFR and pQSAR 1.0 models trained and tested using realistic training/test set splits on the 728 kinase assays. The pQSAR 1.0 models do indeed extrapolate much better to novel chemistry than the conventional RFR models, explaining about twice the variance on average, and succeeding in about 40% rather than 20% of the cases. We have used pQSAR 1.0 for many years on many dozens of internal projects, with overall average hit rates of over 50%, and hit rates of 30% to 60% even for novel compounds very unlike the training sets.12 But with a

dismisses his neighbor-split method, which has some similarities to our method, as too “pessimistic” for estimating future performance in lead optimization, it might nevertheless prove appropriate for virtual screens of the historical archive.21 The lower curve in Figure 2, presenting the performance of conventional RFR models built using this realistic training/test set split, shows that the models generally fail on these realistically novel test sets. (Note that the 728 assays are sorted by decreasing R2 independently for the two splitting methods, so the order of assays on the X-axis is different for the two curves.) The median correlation between prediction and experiment for the realistic split is only R2 = 0.12, virtually none, and now 80% of the models fail to achieve our threshold for usefulness of R2 = 0.3. When scientists test and publish empirical virtual screening models, we recommend they use this realistic test/training set split. Random test sets are extremely misleading. pQSAR 1.0. Apparently, the problem with the conventional RFR models was the limited domain of applicability. Available training data for any single given assay samples only a small fraction of the chemical diversity in the full compound archive, and project teams are mainly interested in those unsampled portions. For pQSAR 1.0, we hypothesized that we could indirectly extend the training set to cover the entire archive if the models could somehow include appropriately weighted activity information from all historical kinase assays. Due to similarities in kinase ATP binding sites, we imagined that a new kinase might be modeled as a “hybrid,” a linear combination, of the activities from the 728 previous kinase assays. The IC50s for a new kinase assay could be used to train a model by a robust regression method, such as partial least-squares (PLS), using the profile of pIC50s for each compound on the 728 historical kinase assays as the independent variables. Hence the name pQSAR. Experimental IC50s cannot be used directly. While we have an experimental compound by assay “profile matrix” of 5 million pIC50s covering 728 kinase assays measured against a total of a half million compounds, the matrix is 99% missing values. Furthermore, for new compounds we would not have any experimental pIC50s. We want a purely computational 2081

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

Figure 6. (X-axis) Correlation between prediction and experiment for 728 kinase pQSAR 2.0 models using random training/test set splits. (Y-axis) Correlation between prediction and experiment for 728 kinase pQSAR 2.0 models using the realistic training/test set splits.

adjusting the types or lengths of fingerprints or the number of trees, did not further improve performance. pQSAR 2.0. Achieving predictive power comparable to fourpoint IC50s required one further, rather subtle innovation. By default, the profile matrix included a column from the RFR QSAR based on the training data for the new assay of interest. Those data are available in the real use case, and a predicted activity for the assay of interest itself should be the most relevant of all the independent variables (profile) in the final PLS model. A subtle detail, however, is that the RFR model had been trained on the very same IC50s on which the final PLS model is trained. Therefore, this basis set column is contributing fitted pIC50s to the PLS regression, rather than predicted pIC50s. Nonlinear RFR models fit data extremely well. That caused the PLS coefficient for this particular RFR model to be artificially high, and prevented the other RFR models in the basis set from contributing as much as they should. This model dominated the prediction, so the final PLS model was really a slightly perturbed RFR model. Deleting this column (Figure 1c) in pQSAR 2.0, which had seemed intuitively like the most relevant possible descriptor, dramatically improved the results as shown in Figure 5. Now all of the profile models, and thus all 5 million IC50s covering 500 K compounds, could contribute to the predictions, greatly expanding the domain of applicability. The median correlation with experiment is now R2 = 0.55, a level only reached by 4 of the 728 conventional RF models. The virtual screen with pQSAR 2.0 is now statically comparable to the four-point IC50 experiments even for realistically novel compounds. Note that pQSAR is agnostic with respect to the single-assay models in the profile. We chose RFR models on Morgan fingerprints because they are reasonably fast, widely familiar and available, and considered among the better methods. Other chemical descriptors or machine learning methods such as a support vector machine with a Tanimoto kernel could be used. More interestingly, predicted IC50 profiles from “orthogonal”

median correlation of 0.24 on realistic test sets, they still are far from on par with experimental four-point IC50s on the realistic training/test set splits. ̈ Bayes pQSAR 1.5. In pQSAR 1.5 the Pipeline Pilot naive QSARs were replaced with RFR QSARs as shown in Figure 1b. This was hoped to improve the original implementation in four ways: a smaller descriptor set, lower cost, faster calculation, and greater accuracy. First, RFR models directly predict the IC50, not the probability of activity at up to nine thresholds. This reduced the number of independent variables in the predicted activity profile 3.5-fold, reducing the required number of training compounds for the final PLS model of the new assay. Second, Pipeline Pilot is expensive, but it seems to have some “secret sauce” that works particularly well. We tried several ̈ Bayes modeling packages, but the results open source naive were never as good. We hoped that open source RFR models would perform well while saving money. Third, because of the expense, we had limited Pipeline Pilot licenses. We could not run it across the entire cluster. Predicting the initial profiles of 5 million NVS compounds across 728 kinase assays was the slow step in building the pQSAR platform. Calculating this matrix of activity probabilities took months, so the models could not be retrained and updated frequently. Although each individual RFR model takes longer to compute than each naive Bayes model, the open source software could be distributed across the cluster, dramatically speeding up the overall calculation. Finally, we hoped the nonlinear RFR models would give more accurate ̈ Bayes models, activity profiles than one-pass, linear naive producing a PLS model with better final predictions. All four improvements were realized by pQSAR 1.5. Figure 5 shows that the accuracy of the better models substantially improved, and that 45% rather than 40% now achieved the threshold of R2 = 0.3. However, the lower half of models were not improved, and the median correlation remained at R2 = 0.24, performance still well below experimental four-point IC50s. Attempts to optimize the RFR model parameters, such as 2082

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

Figure 7. For kinase assays in both ChEMBL and PubChem, correlations between prediction and experiment for four sets of models: pQSAR 2.0 with both random and realistic splits and RFR with both random and realistic splits.

Table 1. Properties of NVS, PubChem, and ChEMBL Data Setsa

a

assay source

assays

cpds (× 103)

IC50s (× 106)

IC50s/cpd

RFR rnda

RFR rlstca

pQSAR rnda

pQSAR rlstca

NVS PubChem ChEMBL

728 171 159

377 233 13

3.2 0.45 0.11

8.4 1.9 8.7

0.59 0.16 0.40

0.12 0.01 0.02

0.73 0.21 0.65

0.55 0.10 0.48

Values are R2 on the random or realistic test sets.

Public Domain Data. Novartis and other large pharmaceutical companies benefit from an enormous wealth of internal historical assay data specifically for their compound collections. Smaller organizations and academics must rely on publically available compounds and data. The pQSAR method was therefore tested using publically available ChEMBL and PubChem kinase IC50 data. The correlations between prediction and experiment for pQSAR 2.0 models trained and tested entirely on ChEMBL kinase assay data are shown in Figure 7a and for PubChem in Figure 7b. The PubChem kinase pQSAR 2.0 was disappointing, with median correlation only R2 = 0.2 on the realistic training/test set splits and only 25% of assays achieving R2 = 0.3, yet it still performed far better than the virtually useless RFR models. The ChEMBL kinase pQSAR 2.0 models did very well, with median R2 = 0.48 and 80% of models reaching R2 = 0.3, albeit not quite on par with R2 = 0.55 for NVS pQSAR 2.0 models. Tables S2 and S3 contain the ChEMBL random and realistic training set compound ID, SMILES, assay IDs, experimental pIC50s, and the pQSAR 2.0 fitted pIC50s. Tables S4 and S5 contain the same fields but with pQSAR 2.0 predicted pIC50s. We invite others to train and test their methods on these same training/test set splits. Table 1 looks at several statistics about the data and models that might account for the poor performance of PubChem kinase pQSAR models compared to ChEMBL and NVS. It is not the number of assays, because the PubChem profile included slightly more assays than ChEMBL. PubChem also had nearly 20 times as many compounds and 4 times as many IC50s as ChEMBL. Furthermore, it is not the fraction of missing values in the experimental profile matrix, because although ChEMBL is only 95% missing values, NVS and PubChem are both 99%. Thus, the basic amount of data favors PubChem, which failed, over ChEMBL, which succeeded. The ChEMBL set contained mostly cellular assays, and PubChem contained

3D methods such as pharmacophore models, 3D similarity methods, or fast docking methods like Surrogate AutoShim8 can be substituted for, or combined with, the profiles from the 2D QSAR models. Such studies are currently underway. Performance on Random Test Does Not Predict Realistic Performance. Figure 6 compares the correlations between experiment and prediction for the 728 kinase assays trained and tested on random vs realistic splits. The points form a triangular scatter. The empty upper triangle confirms that realistic splits almost always perform worse than random splits. The uniform filling of the lower triangle shows that results on random splits cannot even be used as a relative indicator of realistic model performance. That is, compounds with R2 > 0.8 on random splits are spread pretty uniformly between R2 = 0.05 and 0.95 using the realistic splits. pQSAR 2.0 Using RFR Vs PLS in a Second Step. Just as ̈ Bayes QSARs in nonlinear RFR improved over linear naive building the profile matrix, we thought replacing the oldfashioned, parametric, linear PLS regression models in the second step with new-fangled, nonlinear RFR models might improve the final predictions. Figure S1 shows the correlations R2 between prediction and experiment for pQSAR 2.0 models trained and tested using both random and realistic training/test set splits, using either PLS regression or RFR for the second modeling step. Somewhat surprisingly, PLS not only extrapolated better to novel chemistry, with median R2 = 0.55 vs 0.46 for the realistic test set, predictions were also better for the random test set with median R2 = 0.73 vs 0.68. The linear method was more stable, extrapolating better over the large “distances” of the realistic test set, and predictions were also significantly better even for the very modest distances to the random test sets. Linear, parametric methods are not obsolete! 2083

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

Figure 8. Correlations between prediction and experiment for ChEMBL and NVS assays modeled using the NVS or ChEMBL profiles, all using realistic training/test set splits.

Figure 9. Histograms of similarity (Tc) of each compound in the predicted collection to the nearest compound in the profile: (a) ChEMBL to NVS, (b) NVS to ChEMBL, (c) PubChem to NVS, and (d) NVS to PubChem. The blue portion represents compounds in the predicted collection that share a Bemis−Murcko scaffold with a compound in the profile collection. The pink portion represents compounds with scaffolds in the predicted collection not found in the profile.

correlation between prediction and experiment of R2 = 0.16, compared to R2 = 0.40 for ChEMBL and R2 = 0.59 for NVS. This suggests that the PubChem data do not follow the basic similarity principle on which QSAR depends. In addition, each PubChem compound was tested on only 1.9 assays on average. The average for ChEMBL was 8.7 assays, and for NVS, it was

mostly biochemical assays, but the NVS set was also predominantly biochemical. Two differences which might disfavor PubChem are the amount of overlap between assays and the quality of the RFR models with random held-out test sets. Even with the simple random training/test set splits the PubChem RFR models did very poorly, with a median 2084

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

models do not overfit, and comparing fit to test sets for the random splits would support that claim in this case, where median R2 for the random test set is actually slightly higher than the model fit. pQSAR by this criterion would appear overfit, with fitted median R2 = 0.79 vs 0.73 on the random test ̈ set. One might naively assume that pQSAR would therefore not extrapolate as well. However, the difference between fit and test set median R2 for pQSAR for the realistic split is far less than for RFR. pQSAR median R2 degrades steadily but slowly going from fit to random split to realistic split. Median R2 for RFR does not degrade between fitting and the random split but fails dramatically with median R2 = 0.12, virtually no correlation, for the realistic split. This confounds the expected relationship between overfitting (as typically detected by random test sets) and domain of applicability. Predicting Cross-Reactivity between Assays. In kinase drug discovery it is important to identify off-target kinases that will be cross-reactive with the target of interest. We tested the ability of pQSAR to predict such “SAR similarity”. For the realistic test set compounds, the correlation R2 was computed between the experimental pIC50s for shared compounds between each pair of assays. High R2 implies cross-reactive assay pairs. Low R2 implies pairs with low SAR similarity. The same calculation was repeated using the pQSAR predicted pIC50s. The overall ability of pQSAR to estimate degree of cross-reactivity was evaluated by computing the correlation R2 between the experimental and predicted vectors of R2 values for all assay pairs with at least 15 shared compounds. Naturally, the results depend on the quality of the pQSAR models for each pair, so the comparison was performed including only assay pairs where models for both had Rext2 >0.3, >0.35, ..., >0.8. The results, shown in Table 3, demonstrate that pQSAR does well

8.4. The overlap between assays might be important for the algorithm to establish the relationships across the profiles. Predictions between NVS and ChEMBL. It would be nice if pQSAR models trained on NVS internal IC50s also performed very well modeling the ChEMBL IC50 data and vice versa. Unfortunately, Figure 8a shows that despite its enormous training set, the ChEMBL assays are poorly predicted by pQSAR models built with the NVS profiles. While they did still work better than conventional RFR models, very few reach the R2 = 0.3 threshold. This shows that even pQSAR has a limited domain of applicability. For Novartis kinase assays, it appears to be the entire Novartis collection, but not all the compounds in ChEMBL. For a typical virtual screen of the Novartis collection, where compounds are simply plated from the archive, pQSAR models work as well as experiment. For other collections, such as vendor catalogs, Figure 8a suggests that while they might still work better than conventional RFR, most might not reach our suggested threshold of R2 = 0.3. Figure 8b shows that pQSAR models built with the ChEMBL profile do not accurately predict NVS assays. Furthermore, they do not work even as well as conventional RFR. This reversal might indicate that the much larger Novartis archive samples the ChEMBL compounds better than the small collection of ChEMBL compounds samples the Novartis archive. This suggestion is supported by Figure 9a vs b, where the similarity histograms from ChEMBL to NVS are much fatter on the right and the proportion of shared scaffolds is greater. The PubChem case shown in Figure 9c and d is similar but less dramatic. However, note that the RFR models themselves for NVS assays already work much better than RFR models for ChEMBL, making it a tougher standard to beat. When making predictions outside the collection of compounds on which the profile was built, one should compare the statistics for a pQSAR to the corresponding RFR model, and use the one that gives better predictions. Combined NVS and ChEMBL Profiles. Figure S2 shows that the quality of predictions using the combined NVS and ChEMBL profiles are virtually identical to those from the individual NVS and ChEMBL pQSARs. This suggests that adding IC50 data from other group’s compound collections will not improve predictions on your own collection. However, the combined profile produces a single model that works well on both and would presumably work better on collections that have a mixture of compounds similar to both collections. Fitting, Overfitting, and Domain of Applicability. Table 2 shows that fits of pQSAR and RFR models are largely

Table 3. Ability of pQSAR 2.0 to Predict Degree of CrossReactivity between Assay Pairsa NVS pQSAR min 0.30 0.35 0.40 0.40 0.50 0.55 0.60 0.65 0.70 0.75 0.80

Table 2. Median Correlation R2 between Prediction and Experiment for Fitting and Testing pQSAR and RFR Models Built from Random and Realistic Training/Test Set Splits pQSAR RFR

fit test fit test

random

realistic

0.79 0.73 0.58 0.59

0.80 0.55 0.62 0.12

Rext2

ChEMBL

2

assay pairs R pred vs exp 52568 50072 46384 40752 33934 28250 21098 14284 8122 3958 42

0.56 0.58 0.60 0.62 0.65 0.68 0.71 0.76 0.80 0.84 0.95

b

pairs

R2 pred vs expb

13664 12804 9744 6586 3874 2260 1026 216 78

0.52 0.53 0.56 0.59 0.63 0.69 0.77 0.8 0.87

a

15 or more shared compounds in each pair. bR2 predicted vs R2 experimental for vectors of pairwise R2 values (see text).

at reproducing the experimental SAR similarity between pairs of assays for both Novartis and ChEMBL data even with a threshold of only R2 > 0.3. Attempts to perform similar calculations with single-assay random forest regression models were frustrated by the small number of successful model pairs sharing at least 15 common compounds. Myth of the Activity of a Compound against a Target. An important feature of pQSAR is that it keeps the assays separate. Thus, pQSAR provides, at least for large protein families, a solution to the “mix and match” assay aggregation problem of how and when to combine data from different

insensitive to whether the models are trained on the random or realistic training/test set splits, perhaps improving very slightly for the more homogeneous clusters of compounds in the realistic splits. pQSAR fits the data better, with R2 ∼ 0.80 vs 0.60 for RFR. pQSAR outperforms RFR on the random test set by R2 = 0.73 to 0.59. One frequently reads that random forest 2085

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

Figure 10. Histograms of correlations between pairs of (a) biochemical serine/threonine kinase assays and (b) cellular kinase assays, for the same target. Average correlations are R2 = 0.64 and 0.38, respectively. All pairs share at least 100 common compounds.

This raises uncomfortable questions for drug discovery, such as which assay to use, or does it even make sense to carefully optimize a series against just one fairly arbitrary assay? Presumably the most relevant biochemical assay is the one most like the cellular assay. But Figure 10b shows that cellular assays vary even more, with average correlation of R2 = 0.38. Because it combines assay information while keeping the assays separate, pQSAR models individual kinase assays, not a “kinase target.” Whenever possible, pQSAR models are trained on the specific assay or assays which the team is using. Another lesson from the differences among assays for the same targets is that conventional physics-based modeling does not even try to include most of the assay variables listed above. This suggests that any successful activity model will need an empirical component to capture these additional complexities which physics-based models would still neglect even if all the usual difficult physics problems, such as solvation, entropy, induced fit, ligand conformational distribution in solution, etc., were perfectly solved. Importance of IC50 Prediction. Predicting semiquantitative IC50s, comparable to four-concentration experiments, rather than simple active/inactive classes, has many important advantages for drug discovery. There often are as many as 50 000 hits in a virtual or experimental full-deck screen. If the assay biologists are willing to order and test say 2000 compounds for confirmation, the other 96% of the hits will likely never get another look. It is important to choose that 4% very intelligently! Criteria for selecting good leads include good potency, good ligand efficiency, good lipophilic efficiency and good selectivity against specific antitargets as well as the kinase family generally. All of these criteria require at least a semiquantitative estimate of the compounds activity, not just active/inactive categories. Selectivity requires predicting the IC50 not only of the target of interest but also specific antitargets or even of a representative sampling of the entire family. These off-target data would rarely be available

assays for the same protein target to assemble a sufficient training set.22,23 This is an issue not only for conventional single-assay models, but also implicitly for multitarget proteochemometric models, because there the descriptors are identical for each compound-target pair irrespective of the assay. With pQSAR, a separate model is built for each assay (rather than each target). All the activity data from all the assays inform each prediction, yet the calculation does not know which assays share a target. Some other multitarget machine learning methods such as multitask deep neural networks and matrix factorization share this virtue with pQSAR. Comparisons with these methods are currently underway and will be the subject of a future report. Profile-QSAR thus dispenses with the very notion that there is a single, well-defined activity of a compound for a target. Medicinal chemists speak about the activity of a compound against a drug target as if it were a thermodynamic property like log P or pKa. In fact, the protein is very sensitive to the specifics of its environment. The histogram in Figure 10a shows the correlations among pairs of NVS biochemical assays for the same serine/threonine kinase targets, each pair sharing at least 100 compounds in common. The average is only R2 = 0.64, and some pairs show almost no correlation. This does not reflect poor assays. Rather it is the result of many real biochemical differences among good assays: protein construct, monomer vs multimer vs heteromeric complex, post-translational modifications such as phosphorylation pattern, temperature, pH, buffer salts and concentrations, detergents, cofactor concentrations such as ATP or cyclins, the peptide substrate, incubation time, order of adding reagents, label type and location, assay format, and others. The target in the context of a cellular assay includes even more variables and, correspondingly, more variability, as shown in Figure10b. Besides many of the factors above, and cell specific factors such as growth medium, feedback pathways, and cell-cycle phase, cells are dynamic systems generally not in chemical equilibrium. 2086

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

available kinase assays, the original pQSAR method substantially improved the predictions on novel compounds. By allowing all assays to contribute to the model while keeping them separate, this also bypassed the “mix and match” assay aggregation problem of how and when to combine data from different assays. However, the accuracy was still not comparable to experiment. Next generation “pQSAR 2.0” IC50 predictions on the realistically novel test sets achieved a median correlation with experiment of R2 = 0.55, comparable to experimental fourconcentation IC50s. By contrast, only 4 of the very 728 RFR models on which the pQSAR 2.0 was based achieved this level of accuracy. The striking improvement over pQSAR 1.0 was ̈ Bayes models at achieved by replacing multiple yes/no naive several thresholds with single RFR models and, surprisingly, by eliminating the assay of interest from the activity profile.

experimentally, but a broad selection of pQSAR 2.0 models allows them to be estimated using, e.g. the thermodynamicsbased “partition” selectivity index.24 Thus, even if a singleconcentration HTS has been run, it is useful to also do pQSAR virtual screens to aid in hit triaging. pQSAR has been successfully applied to over 70 Novartis projects. In addition to virtual screening for hit-finding and hitlist triaging, it has also been used for bridging to new assays, HTS “rescue,” as an orthosteric counterscreen to identify allosteric inhibitors, predicting cellular penetration from predicted ratios of biochemical to cellular activity, to help determine modes of action from phenotypic screens, toxicity profiling, and many other applications. Most of these diverse applications benefit from or even require the semiquantitative IC50s. Extension to Other Families. While this paper has focused on kinases, pQSAR models have also now been developed for G-protein coupled receptors, proteases, cytochrome P450s, and a variety of nonkinase adenosinebinding targets from many diverse families. Together these cover nearly half of the estimated druggable genome.25 Over 10 billion IC50s predicted for over 5 million compounds covering over 2000 assays have been added to a wealth of additional experimental and computed data for use in the MoA Central database26 to help identify off-target toxicities and modes of action from phenotypic screens. Assuming a cost of only $0.05 per well, if one even could measure 10 billion 4-point IC50s experimentally, the fully loaded cost would reach billions of dollars. “Shoulders of Giants”. pQSAR works by repurposing the careful experimental work of thousands of chemists, biochemists, and biologists who have synthesized hundreds of thousands of compounds, developed thousands of assays, and measured millions of IC50s. This experimental investment, undoubtedly costing billions of dollars, is the real work behind these predictions. pQSAR is strictly machine learning, with no underlying mechanistic model beyond the assumption similar compounds have similar activity and that all members of a protein family share some unspecified common binding features. pQSAR simply organizes and applies this enormous wealth of accumulated experimental data to new, closely related problems.



ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00166. Table S1. ChEMBL and PubChem Assay IDs. Table S2. Random training set for ChEMBL kinase assays. Table S3. Realistic training set for ChEMBL kinase assays. Table S4. Random test set for ChEMBL kinase assays. Outline of Profile-QSAR procedure (TXT) Figure S1. Correlations between prediction and experiment for 4 sets of models from 728 kinase assays: pQSAR 2.0 (PLS in second step) and pQSAR 2.0 except using RFR for both steps, both methods using both random and realistic splits. Figure S2. Predictions on ChEMBL and NVS assays using a single combined profile vs separate ChEMBL and NVS profiles (PDF)



AUTHOR INFORMATION

Corresponding Author

*Tel.: 510-879-9657. E-mail: [email protected]. ORCID

Eric J. Martin: 0000-0001-7040-5108 Present Addresses †

L.T.: China Novartis Institutes for BioMedical Research Co., Ltd., 2F, Building 4, Novartis Campus, No. 4218 Jinke Road, Zhangjiang, Pudong, Shanghai 201203, China. ‡ R.C.P.: Shriram Center, 443 Via Ortega, Stanford CA, 94305.



CONCLUSIONS When evaluated with random held-out test sets, 728 RFR kinase QSAR models appeared to give excellent predictions, comparable to experimental 4-concentation IC50s. However, most members of the random test sets were highly similar to the training set, whereas compounds actually ordered by 18 real projects were very unlike the training sets. In order to better anticipate model performance on realistically novel compounds, a new training/test set splitting algorithm was devised. The compounds are first clustered, and 75% of the compounds from the largest clusters are used for training, testing the models on the remaining 25% of singletons and small clusters. The median correlation between prediction and experiment for the RFR models using these “realistic” training/test set splits was only 0.12, and 80% of the 728 models failed to achieve even a correlation of R2 = 0.3 with experiment. This limited domain of applicability of conventional machine learning models stems from the limited diversity of compounds with available IC50 data from any individual assay. By combining chemical and biological information from all

Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors express their gratitude to the thousands of medicinal chemists and assay biologists that worked on kinases, GPCRs, proteases, CYP P450s, and nonkinase adenosine binding proteins over many decades that represent the real effort behind the success of pQSAR models.



ABBREVIATIONS HTS, high-throughput screen; RFR, random forest regression; pQSAR, Profile-QSAR; R2, coefficient of determination; PLS, partial least-squares; NVS, Novartis; Tc, Tanimoto similarity



REFERENCES

(1) Christmann-Franck, S.; van Westen, G. J. P.; Papadatos, G.; Beltran Escudie, F.; Roberts, A.; Overington, J. P.; Domine, D.

2087

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088

Article

Journal of Chemical Information and Modeling

(23) Tarasova, O. A.; Urusova, A. F.; Filimonov, D. A.; Nicklaus, M. C.; Zakharov, A. V.; Poroikov, V. V. QSAR Modeling Using LargeScale Databases: Case Study for HIV-1 Reverse Transcriptase Inhibitors. J. Chem. Inf. Model. 2015, 55, 1388−99. (24) Cheng, A. C.; Eksterowicz, J.; Geuns-Meyer, S.; Sun, Y. Analysis of kinase inhibitor selectivity using a thermodynamics-based partition index. J. Med. Chem. 2010, 53, 4502−10. (25) Hopkins, A. L.; Groom, C. R. The druggable genome. Nat. Rev. Drug Discovery 2002, 1, 727−730. (26) Selinger, D. W.; Shivashankar, V.; Ghodssi, A.; Larbaoui, M.; Mendelev, I.; Steeves, M.; Wilbert, M. L.; Litster, S.; Jenkins, J. L.; Marc, P. MoA Central: A Massively Orthogonal Search Engine for Mechanism of Action & Toxicity Studies. Paper presented at the American Chemical Society Annual Meeting, Boston, August 2015.

Unprecedently Large-Scale Kinase Inhibitor Set Enabling the Accurate Prediction of Compound−Kinase Activities: A Way toward Selective Promiscuity by Design? J. Chem. Inf. Model. 2016, 56, 1654−1675. (2) Gao, C.; Cahya, S.; Nicolaou, C. A.; Wang, J.; Watson, I. A.; Cummins, D. J.; Iversen, P. W.; Vieth, M. Selectivity data: assessment, predictions, concordance, and implications. J. Med. Chem. 2013, 56, 6991−7002. (3) Merget, B.; Turk, S.; Eid, S.; Rippmann, F.; Fulle, S. Profiling Prediction of Kinase Inhibitors: Toward the Virtual Assay. J. Med. Chem. 2017, 60, 474−485. (4) Schurer, S. C.; Muskal, S. M. Kinome-wide activity modeling from diverse public high-quality data sets. J. Chem. Inf. Model. 2013, 53, 27−38. (5) Subramanian, V.; Prusis, P.; Xhaard, H.; Wohlfahrt, G. Predictive proteochemometric models for kinases derived from 3D protein fieldbased descriptors. MedChemComm 2016, 7, 1007−1015. (6) Martin, E.; Mukherjee, P.; Sullivan, D.; Jansen, J. Profile-QSAR: a novel meta-QSAR method that combines activities across the kinase family to accurately predict affinity, selectivity, and cellular activity. J. Chem. Inf. Model. 2011, 51, 1942−56. (7) Martin, E.; Mukherjee, P. Kinase-Kernel Models: Accurate In silico Screening of 4 Million Compounds Across the Entire Human Kinome. J. Chem. Inf. Model. 2012, 52, 156−170. (8) Martin, E. J.; Sullivan, D. C. Surrogate AutoShim: Predocking into a Universal Ensemble Kinase Receptor for Three Dimensional Activity Prediction, Very Quickly, without a Crystal Structure. J. Chem. Inf. Model. 2008, 48, 873−881. (9) Martin, E. J.; Sullivan, D. C. AutoShim: empirically corrected scoring functions for quantitative docking with a crystal structure and IC50 training data. J. Chem. Inf. Model. 2008, 48, 861−872. (10) Mukherjee, P.; Martin, E. Development of a Minimal Kinase Ensemble Receptor (MKER) for Surrogate AutoShim. J. Chem. Inf. Model. 2011, 51, 2697−2705. (11) Mukherjee, P.; Martin, E. Profile-QSAR and Surrogate AutoShim Protein-Family Modeling of Proteases. J. Chem. Inf. Model. 2012, 52, 2430−2440. (12) Mukherjee, P.; Martin, E. Chemogenomic protein-family methods in drug discovery: profile-QSAR and Kinase-Kernel; Pan Stanford Publishing Pte. Ltd., 2014; pp 65−116. (13) Tian, L.; Mukherjee, P.; Martin, E. Kinase-Kernel Models: Accurate Chemogenomic Method for the Entire Human Kinome. Mol. Inf. 2013, 32, 922−928. (14) Biovia. Pipeline Pilot. http://accelrys.com/products/ collaborative-science/biovia-pipeline-pilot/ (accessed January 2017). (15) Wang, Y.; Suzek, T.; Zhang, J.; Wang, J.; He, S.; Cheng, T.; Shoemaker, B. A.; Gindulyte, A.; Bryant, S. H. PubChem BioAssay: 2014 update. Nucleic Acids Res. 2014, 42, D1075−82. (16) Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Kruger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka, M.; Papadatos, G.; Santos, R.; Overington, J. P. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 2014, 42, D1083−90. (17) Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. KNIME: The Konstanz Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007); Springer, 2007. (18) Landrum, G., RDKit: Open-source cheminformatics. http:// www.rdkit.org. (19) R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing, 2008. (20) Jain, A. K. Data clustering: 50 years beyond K-means. Pattern recognition letters 2010, 31, 651−666. (21) Sheridan, R. P. Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. J. Chem. Inf. Model. 2013, 53, 783−790. (22) Huang, Y.; Huang, Y.; Moodie, Z.; Li, S.; Self, S. Comparing and combining data across multiple sources via integration of pairedsample data to correct for measurement error. Stat Med. 2012, 31, 3748−59. 2088

DOI: 10.1021/acs.jcim.7b00166 J. Chem. Inf. Model. 2017, 57, 2077−2088