Selectivity Data: Assessment, Predictions, Concordance, and

Aug 12, 2013 - Could high-quality in silico predictions in drug discovery eventually replace part or most of experimental testing? To evaluate the agr...
0 downloads 0 Views 886KB Size
Article pubs.acs.org/jmc

Selectivity Data: Assessment, Predictions, Concordance, and Implications Cen Gao,† Suntara Cahya,‡ Christos A. Nicolaou,† Jibo Wang,† Ian A. Watson,§ David J. Cummins,‡ Philip W. Iversen,‡ and Michal Vieth*,† †

Discovery Chemistry, ‡Discovery Statistics, §Advanced Analytics, Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States S Supporting Information *

ABSTRACT: Could high-quality in silico predictions in drug discovery eventually replace part or most of experimental testing? To evaluate the agreement of selectivity data from different experimental or predictive sources, we introduce the new metric concordance minimum significant ratio (cMSR). Empowered by cMSR, we find the overall level of agreement between predicted and experimental data to be comparable to that found between experimental results from different sources. However, for molecules that are either highly selective or potent, the concordance between different experimental sources is significantly higher than the concordance between experimental and predicted values. We also show that computational models built from one data set are less predictive for other data sources and highlight the importance of bias correction for assessing selectivity data. Finally, we show that smallmolecule target space relationships derived from different data sources and predictive models share overall similarity but can significantly differ in details.

1. INTRODUCTION

All machine learning and predictive modeling methods are fundamentally data driven, and as such the quality of the model produced, be it a single-target or a selectivity model, is in part dependent on the quality of the data used to develop it. Although data from multiple sources, such as different laboratories, may be combined to increase the size of the training and test data sets and thereby improve QSAR models, care must be taken to evaluate the concordance of different data sources. In our previous work, we compared and contrasted kinaseselectivity data on the overlapping sets of compounds generated from four independent profiling sources.11 We found a relatively low level of concordance between different profiling results that, in turn, affected the general kinase and inhibitor space conclusions drawn from the data. Here, we extend this work to evaluate the concordance of computational models and experimental data, and we compare it to the concordance between experimental results from different sources. To quantify better the concordance in terms of single experiment variability, we introduce a new statistical metric, concordance minimum significant ratio (cMSR), inspired by a similar metric used in assay validation (Theory and Methods, section 2.2). We illustrate the use of cMSR to evaluate different sources of experimental data as well as to compare experimental and predicted values on the same assay panel. We show the overall concordance between predicted and experimental data from the same source to be comparable to the concordance level

Understanding compound selectivity is one of the most critical components in all stages of targeted and phenotypic drugdiscovery decision making. Early in the process, during target validation, a potent, selective chemical tool compound is often deemed critical to confirm the target’s role in the disease model.1 In phenotypic discovery, multiple efforts have been focused on understanding the target and mechanism of action with both experimental2 and predictive profiling methods.3,4 In the lead optimization stage, animal toxicity data on optimized compounds is often linked to off-target activity,5 which can be identified from profiling data. In some instances, an understanding of adverse effects in humans is aided by the utilization of broad selectivity profiling, for example, using CEREP’s Bioprint approach. 6 Consequently, the informatics and computational communities, which historically focused on virtual screening aimed at single target activity prediction, have increasingly started to devote attention to polypharmacology prediction.7 One critical aspect of predicting and quantifying polypharmacology is the ability to understand the quality and applicability of predictions in various stages of decision making. In recent years, predictive modelers have been striving to improve the quality of in silico predictions.8 Of note is the elegant Profile-QSAR method9,10 that, in principle, can extend selectivity and activity prediction to a set of unknown but related targets. These approaches have been evaluated using a different set of quantitative metrics borrowed from the field of predictive modeling on single end-point problems and are not necessarily directly applicable to selectivity predictions. © 2013 American Chemical Society

Received: May 29, 2013 Published: August 12, 2013 6991

dx.doi.org/10.1021/jm400798j | J. Med. Chem. 2013, 56, 6991−7002

Journal of Medicinal Chemistry

Article

Figure 1. cMSR and R2 values in the presence of varying bias between a set of predicted and observed values. The green line is the identity line, whereas the red line represents the best fitting line. Where there is (a) no bias, (b) constant bias (i.e., the prediction is consistently under or over predicted across the range of predicted values), and (c) nonconstant bias (i.e., an overestimation at lower end of prediction values and an under estimation at the higher end). In all cases, the R2 value is consistent at around 0.9. However, from a concordance standpoint, panel a is the most desirable outcome. In contrast to R2, the cMSR reflects the potential bias in the data more accurately. significantly different (at the chosen confidence level, alpha is usually 0.05). In this study, we are adapting the concept of MSR to assess the concordance between two sets of values originating from experimental measurements or in silico predictions. This new metric may be applied to measure the concordance between the screening results on the same target coming from different laboratories or companies or the concordance between experimental values and their corresponding predicted values. We refer to this statistic as the cMSR, where c is used to denote concordance. The formula to compute cMSR is given as follows

between experimental results from different sources. In contrast, predictive models carry a lower level of agreement if used to predict experimental outcomes from other sources. We also note that the presence of bias (e.g., potency shifts between experimental data sources), which in our case varies for each target, needs to be accounted for. Although we provide a way to address bias correction, we realize that this adjustment might not be universally adequate and needs to be considered on a case-by-case basis. Finally, the identification of a compound’s primary target (i.e., the target where compound is most potent) is of interest in the analysis of phenotypic data. We found that two experimental sources have about a 50% overlap when the same primary targets are identified for compounds. In contrast to this, computational models have only a 20% agreement when compared with experimental results. The implications of our results span multiple distinct areas of drug discovery, including the comparison of assays from different sources, the utilization of computational models for selectivity assessment, and the limitations of predictive models built from literature data.

cMSR = 10(ave + k sd) where ave = (∑ni=1log(ri)/n)1/2, ri is the ratio of two values (in concentration units, e.g., IC50) associated with the ith compound, sd = (∑ni=1(log(ri) − ave)2/(n − 1))1/2, and the multiplier k = (1.645 + 0.325) e−4.134(ave/sd). The constant k is tuned for the 0.05 alpha level. It is noted that ave and sd are the sample average and standard deviation, respectively, of the log ratios of the two set of values of interest (in concentration units). They represent the average log difference between the two sets of values (i.e., bias) and the corresponding standard deviation around this bias (i.e., variability). It is assumed that the concentration values follow a typical log-normal distribution. If the results are already expressed on the logarithmic scale (e.g., −log10(IC50), often denoted as pIC50, pKi, etc.), then ave = (∑ni=1 ΔpIC50i/n)1/2 and sd = (∑ni=1 (ΔpIC50i − ave)2/(n − 1))1/2. It is noted that although related the cMSR is not equivalent to the rootmean standard error (RMSE) statistic. Further discussion regarding the difference between cMSR and RMSE can be found in the Supporting Information. As an extension to the MSR calculation for measuring the variability of a single assay, the cMSR measures both variability and bias between two sources of data (lab 1 vs lab 2, predicted vs experimental). In effect, the cMSR quantifies the magnitude of the variability and also penalizes any bias away from the ideal 1:1. When measuring the performance of individual assays with the MSR statistic, the bias does not come into the picture because we are quantifying the reproducibility of the assays within themselves. However, when we are to compare values between two different sources of data (different laboratories, predicted vs experimental, etc.), the bias becomes an integral part of the concordance measure in addition to variability. Figure 1 illustrates how cMSR and R2 account for three different scenarios of bias between the prediction and the corresponding experimentally observed values: no bias, constant bias, and nonconstant bias. As shown, cMSR accounts for the potential bias between the data pair, unlike the ubiquitous R2 statistic. Additionally, the cMSR

2. THEORY AND METHODS 2.1. Background and Limitation of Using R2. The concordance of different data sets and the performance of predictive models is typically measured using statistical metrics such as Pearson’s R2, which measures the degree of linear association between two values (e.g., predicted and observed). To capture better the predictive power of a model, the cross-validated R2 is used, often denoted as Q2. However, correlation-type metrics like these are affected by the range of the data, where a larger range in the data can artificially inflate the value of R2. Figure S1 illustrates this effect, where two random variates were generated to have a correlation of 0.71 and a subset selected (at random) with a smaller range gives a correlation of 0.56. A metric that directly compares the accuracy of the prediction in the relevant units of measurement is more preferable to quantify both data the concordance and model performance. The root-mean squared error (RMSE) is a useful beginning toward that end.12 2.2. Concordance MSR (cMSR). A common way to characterize the variability of assays during development is by using the MSR statistic, where MSR stands for minimum significant ratio.13 MSR is the smallest ratio of potencies between two compounds that is statistically significant. For example, the results of two different compounds tested in an assay with an MSR value of 3 would need to differ by more than 3-fold for the compounds to be deemed 6992

dx.doi.org/10.1021/jm400798j | J. Med. Chem. 2013, 56, 6991−7002

Journal of Medicinal Chemistry

Article

Table 1. Concordance of All 1466 Kinase-Inhibitor Pairs between Different Sources Measured by Four Metricsa source 1 Exp(LLY) Exp(LLY) Exp(Metz) Exp(Metz) Exp(LLY)

source 2 Exp(Metz) Pred(LLY) Pred(Metz) Pred(LLY) Pred(Metz)

R2, all data 0.58 0.46 0.34 0.28 0.29

mean fold difference, all datab

c

1.0c −1.0 1.3 1.0 −1.3

cMSR, all data 12.6c 16.7 16.4 18.8 26.1

percent within 3-fold, all data 73d 68 62 65 51

a

Only numerical values are used in the comparisons after prefixes are eliminated. bMean fold difference was computed from a common data comparison. A negative sign indicates that source 1 has lower values than source 2. cThe numbers correspond to Lilly-Metz bias-corrected measured results from the Metz publication (values from Table S5). See Table S2 for bias-uncorrected results. dTo comply with other metrics used in the table, the reported number for the percent of results within 3-fold indicates a comparison of the numerical results at its face value (e.g., activity of >1 μM is considered to be not within 3-fold of the activity of >5 μM, comparing 1 μM and 5 μM). If the prefixes are not eliminated before the comparison (e.g., >1 μM is considered to be the same as >5 μM and therefore within 3-fold), then this value becomes 77, which was used in our previous work.11 source and target and noted significant target-dependent differences between the Lilly and Metz sources for training set data (Table S1). These biases, if not corrected, would lead to large discrepancies in comparing the results between sources (e.g., Lilly vs Metz data, see Table S2). In our experiments, we use common compounds between Lilly and Metz as a test set. As a result, the training sets for both sources do not have any common compounds. Thus, we adjust for these bias differences using a simple target-dependent mean correction for Metz experimental data (see Table S1). Although we realize that this is an oversimplified way of correcting source bias, in our case the training set (aka “historical”) biases are correlated (R2 = 0.51, last two columns in Table S1) and inform well on the test set biases (aka “future data”). The overlapping test data between the two sets consists of 32 compounds across 69 kinase targets, with a total of 1466 kinaseinhibitor pairs. Raw data for these 1466 inhibitor-target pairs is given in Table S5. For this set of overlapping data, we report data concordance using the cMSR metric as well as other metrics used in our previous communication: R2, mean fold difference, and percentage of data within 3-fold.11 For each of the 69 targets with overlapping data in both experimental sources, a predictive model was built using the SVMFP approach. As described in our previous communications,15,16 we used support vector machines (SVM)17 to model continuous pIC50 values, which were scaled to the −1 to 1 range for computational convenience. Support vector machine learning is a supervised learning method used for classification and regression.18 SVM classification models function by simultaneously minimizing the classification error while maximizing the geometric margin between classes. SVM regression models apply analogous concepts to a “tube” region around the observed values. We utilized the svm_lite package17 and a Tanimoto kernel function with circular fingerprints to a maximum radius of three bonds (ECC3 fingerprint). Atom typing was a function of four atomic attributes: hydrogen count, π electrons, aromaticity, and the number of heavy atom connections. To estimate the performance of the models, we performed 10 random 50/50 train/test crossvalidation runs. As pointed out in our previous work,15,16 this and other training set cross validations tend not to be predictive of the test set performance for most metrics, including R2 (per-target crossvalidation statistics are reported in Figure S3 and compared to the test set results). Interestingly, in contrast to R2, the cross-validation cMSR metric appears to be concordant with the test set cMSR, with more than 50% of the cross-validated models with cMSR < 10 showing test set cMSR values .

Similar to the two experimental data sets, the concordance between the predicted activities and experimental data was assessed for all 1466 collated kinase-inhibitor pairs common between the Lilly and Metz data sets. It is worth noting that certain compound activities contain qualifiers (i.e., experimental data that are reported to be below or above the threshold values). The Lilly data set has an activity range between 1 nM to 20 μM. Activity outside of this range is reported with the qualifiers 20 μM. To assess the concordance of the activity levels better, we also looked into collated data without qualifiers (i.e., nonqualified results) separately. Out of the 1466 kinaseinhibitor pairs, 452 fall into this category. Finally, in addition to the collated data, we computed cMSR on a per-compound basis (32 compound bioprint group with up to 69 data points in each group) and a per-target basis (69 target bioprint group with up to 32 data points in each group). Target-based bioprint clustering was performed using the Wards method19 in R.20 Sixty eight targets with more than 10 compound data points were used to generate the continuous Tanimoto distance matrix. The clustering result was laid out using the equal angle algorithm implemented internally and was graphically displayed using Cytoscape21 and Spotfire.22

and experimental data is comparable to that found between the experimental results from different sources. However, the prediction models used on another source data have higher values in all metrics (i.e., predicting Metz experimental data using a model built on Lilly data and vice versa), with cMSR of 18.8 and 26.1, respectively. Note that the SVMFP methodology is just one of many ways to build predictive models. Although we use it in this work to exemplify the performance of predictive models, some generalizations might be specific to the method itself or the data sets we utilized. We also examined the performance of a random forest (RF) model with molecular descriptors and summarized the results in Table S3. On the basis of the results and our past experience,15 the SVMFP method is either comparable to or outperforms other models we have examined. Table 2 shows the concordance values for the 452 kinaseinhibitor pairs with nonqualified data. With minor exceptions, all metrics show that nonqualified data has a significantly lower concordance. Thus, we find less agreement in cross-comparing or predicting the activity level of compounds rather than their inactivity because most qualified data came from compounds that are low in activity (i.e., IC50 beyond the assay detection limit). The concordance trends between different data sources are similar to those for all data; however, the discrepancy between the two experimental sources deteriorates much less than for the predictive models. It is also of note that predictive models, on average, underestimate the experimental potency of active compounds by 2.3- to 3.6-fold. Both result sets suggest that the ability to estimate one’s own experimental outcome is not yet terribly satisfying, especially for active compounds. The problem could be complicated when one tries to increase the training set size by combining multiple data sources without proper normalization. The cMSR metric provides a more direct performance measure because we can compare directly the performance of the underlying assays as measured by MSR. The cMSR accounts for both the variability and the bias between sources of data. As we discussed earlier, bias can come in different varieties (constant vs nonconstant, linear vs nonlinear). The simple betweensource bias correction that we attempted here worked to some extent, but a more thorough normalization procedure should be performed in practice before combining data from multiple sources. In some cases, it may well be decided that combining the data is not feasible because of uncertainty in the normalization process. 3.2. Concordance of Per-Compound Selectivity Profiles. Large screening panels are typically used to understand compound selectivity profiles and off-target activities. Heat maps are common ways to visually analyze such selectivity data.15,23,24 They allow a visual comparison of the similarity between two selectivity profiles of the same compound from two profiling sources versus the profile of a different compound from the same panel. Moreover,

3. RESULTS 3.1. Concordance of Collated Target-Compound Data. Table 1 provides the pairwise concordance metrics between the experimental and predicted data across the Lilly and Metz sources for 1466 kinase-inhibitor pairs. Scatter plots with activity distributions are shown in Figures S3 and S4. cMSR between Exp(LLY) and Exp(Metz) is 12.6 for all data and increases to 23.3 (Table 2) for nonqualified data. Both cMSR values indicate a relatively low overall concordance when compared to biological assays developed internally at Lilly, which typically have an MSR of around 3. This low concordance is consistent with our previously reported finding11 obtained with traditional metrics (R2 and the percent of results within 3-fold). In addition, the mean fold difference shows an overall systematic bias between the two data sets. Without bias correction, Exp(Metz) is, on average, 3.5-fold (0.54 in log units) more potent than Exp(LLY) across all kinase-inhibitor pairs. This bias partially contributes to the large value of cMSR. R2 has a moderate value of 0.58 for Exp(LLY) and Exp(Metz), but it needs to be used with caution because the Lilly pIC50 distribution deviates from normality (see the activity distribution in Figure S4a,b). The SVMFP-based predictive models built from the experimental data (models built from the Lilly training data compared to the Lilly experimental data and models built from the Metz data compared to the Metz experimental data) show a comparable cMSR (16.7 or 16.4), mean fold difference (−1.0 or 1.3), and percentage of results within 3-fold (68 or 62). Although the cMSR of these predictions is relatively high compared to the target assay MSR of 3, it is only slightly higher than the variability seen between Lilly and Metz experimental data with the corresponding cMSR of 12.6. This finding indicates that the overall level of agreement between predicted 6994

dx.doi.org/10.1021/jm400798j | J. Med. Chem. 2013, 56, 6991−7002

Journal of Medicinal Chemistry

Article

Figure 2. (a) Heat maps for 29 of the 32 compounds tested in at least 25 targets. Empty values are shown in gray. Targets in columns are arranged in alphabetical order. Pred(LLY), Exp(LLY), and Exp(Metz) are shown in the left, middle, and right panels, respectively. (b) Difference maps between Exp (LLY) and Pred (LLY) (left side) and Exp (LLY) and Exp(Metz) (right). Empty values are shown in gray. Per-compound cMSR values are listed on the map. Deeper purple indicates larger differences in activity between two sources, whereas lighter purple means smaller difference. In both figures, compounds are sorted by per-compound cMSR between Exp(LLY) and Pred(LLY). (c) Scatter plot showing the log10 fold difference in activity of Pred(LLY)−Exp(LLY) versus Exp(Metz)−Exp(LLY). Kinase-inhibitor pairs that are within 3-fold in different data sources are colored accordingly: red for Pred(LLY)−Exp(LLY) and blue for Exp(Metz)−Exp(LLY). Of the total 1466 Exp(LLY) data, 777 (or 53%) are within 3−fold of both Exp(Metz) and Pred(LLY). These are colored magenta.

concordance between the outliers (e.g., the most potent selective compound). Figure 2a shows that Exp(LLY) and Exp (Metz) (middle and right panels, respectively) are more similar in terms of detecting potent, selective compounds (e.g., SGCV121 or SGCV97) despite the generally similar nature of the experimental and the same source of the predictive results. However, Pred(LLY) appears to predict better in the inactive space (i.e., the blue regions of heat map). This is likely due to the distribution of results in the experimental data sets trending toward the inactive space. For predictive results, such trends are captured during the model training process. To quantify this observation, we have determined the most potent activity and the associated target for 29 compounds with more >20 assay values from each profiling result (i.e., most likely the primary target for each compound). As shown in Table 3, 11 of 29 compounds have identical “most likely targets” for Lilly and Metz’s experiment data sets. On the other side, Pred(LLY) data

quantification of the per-compound selectivity profile allows one to examine better the differences between other compounds tested in the same panel. Figure 2a shows the bioprint heat map for Pred(LLY), Exp(LLY), and Exp(Metz) with rows sorted by the percompound cMSR between Exp(LLY) and Pred(LLY). (R2 sorting of the same panel is shown in Figure S6.) The corresponding difference maps, colored by the log10 activity difference of Pred(LLY)−Exp(LLY) and Exp(LLY)−Exp(Metz), are shown in Figure 2b. Intuitively, the top portions of both profiles (left and middle panels for Figure 2a) should be more similar than the bottom portions. The top 10 compounds on the left panel of Figure 2b show smaller differences (light purple), indicating a higher similarity profile compared to compounds in the bottom. Although global metrics (cMSR or R2) can assess overall bioprint similarity, they may not represent well the 6995

dx.doi.org/10.1021/jm400798j | J. Med. Chem. 2013, 56, 6991−7002

Journal of Medicinal Chemistry

Article

Table 3. Most Potent Activity and the Corresponding Target of the 29 Compoundsa

6996

dx.doi.org/10.1021/jm400798j | J. Med. Chem. 2013, 56, 6991−7002

Journal of Medicinal Chemistry

Article

Table 3. continued

6997

dx.doi.org/10.1021/jm400798j | J. Med. Chem. 2013, 56, 6991−7002

Journal of Medicinal Chemistry

Article

Table 3. continued

a Compound ID is followed by its common synonym or Pubchem ID. Compound activity is shown as pIC50. If known, primary targets for 24 compounds are listed on the basis of Davis’s report25 or literature searches. For each compound, data sources suggesting the same most potent target are shown in bold. Green cells indicate agreements between the data source and the reported primary targets.

target in three cases as Exp(Metz) and five cases as Exp(LLY). In summary, both models significantly underperform when

only shares three most likely targets with Exp(LLY) and four with Exp(Metz). Similarly, Pred(Metz) data identifies the same 6998

dx.doi.org/10.1021/jm400798j | J. Med. Chem. 2013, 56, 6991−7002

Journal of Medicinal Chemistry

Article

compared to in-between experimental agreement. Table 3 also shows literature reported primary targets for 24 compounds.25 The Lilly and Metz experimental profiles correctly identified the primary target for 11 and 13 of the compounds, respectively, whereas the predictive models built from either data set identified only 5 and 4 times. These results indicate that, unlike the case of entire data sets, the in silico results are significantly less informative/accurate than the results from other sources of experimental profiling data for individual compounds. In addition, computational models utilizing structural descriptors tend to predict similar selectivity profiles for closely related compounds, whereas in some instances experimental profiles for these related compounds might differ significantly (see Figure S8 for graphical representation of bioprint-based dendrograms for compounds based on experimental and predicted profiles). Selectivity ratio is another popular approach to characterize the selectivity profile of compounds and therefore we investigated the in silico prediction of the selectivity ratio (results reported from individual value predictions) for each compound in each target-pair combination. The concordance of the selectivity ratios shows a trend similar to that seen in absolute values (Table S4). In addition, the visualization of compound clusters driven by their bioprint profiles suggests that such a clustering result is also highly dependent on the exact data source used (Figure S8). 3.3. Concordance of the Per-Target Bioprint, Outsourcing Decisions. The distributions of the per-target cMSRs generated between different data sources are shown in Figure 3. They show overall trends consistent with the concordance of all data. Geometric means of cMSRs are the lowest between the two experimental sources (10.2) and predicted models within the same sources: 13.8 for Exp(LLY) and Pred(LLY) and 15.1 for Exp(Metz) and Pred(Metz). In silico predictions per target compared to experimental data from a different source are less concordant (mean of 16.2 for predicting Metz’s data using the Lilly-based model and 21.9 for predicting Lilly’s data using the Metz-based model). A number of targets have a cMSR value of 10 or less: 37 between the Lilly and Metz experimental data, 28 and 21 between the predicted in-source models and experimental data for Lilly and Metz, respectively, and 13 for Lilly data predicted from Metz-based models. The data presented in Figure 3 suggests that data concordance varies for each of Sugen kinase groups/ subfamilies. For example, results for AGC and CAMK kinases are more concordant than those for tyrosine kinases (TK) between the two experimental data sets. A similar trend is observed when comparing the experimental data with predictive models. The lower concordance for TK kinases could be related to larger variability in assay protocols and larger diversity of inhibition modalities. Both can possibly contribute to the poor performance of predictive models. For the target computational models with a cMSR value of 10 or less, computational models might give reliable predictions to the experimental outcome. A practical question is whether we could a priori identify those potentially promising models during training protocols. To answer this question, we investigated the relationship between the cMSR metrics of the training set cross-validations and test set. In our data set, we observed good concordance between the cross-validation cMSR and test set cMSR for both Lilly (Figure 4a) and Metz (Figure 4b) data sets. For Lilly data, 28 or 61% of models with crossvalidation cMSR