Article pubs.acs.org/jcim
Developing Collaborative QSAR Models Without Sharing Structures Peter Gedeck,*,† Suzanne Skolnik,‡ and Stephane Rodde¶ †
Peter Gedeck LLC, 2309 Grove Avenue, Falls Church, Virginia 22046, United States Novartis Institute for Biomedical Research, 250 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States ¶ Novartis Institute for Biomedical Research, Postfach, CH-4002 Basel, Switzerland ‡
S Supporting Information *
ABSTRACT: It is widely understood that QSAR models greatly improve if more data are used. However, irrespective of model quality, once chemical structures diverge too far from the initial data set, the predictive performance of a model degrades quickly. To increase the applicability domain we need to increase the diversity of the training set. This can be achieved by combining data from diverse sources. Public data can be easily included; however, proprietary data may be more difficult to add due to intellectual property concerns. In this contribution, we will present a method for the collaborative development of linear regression models that addresses this problem. The method differs from other past approaches, because data are only shared in an aggregated form. This prohibits access to individual data points and therefore avoids the disclosure of confidential structural information. The final models are equivalent to models that were built with combined data sets.
1. INTRODUCTION
more collaboration, we need ways of sharing information without sharing specific structures. In recent years, a variety of approaches were proposed to facilitate data exchange for model building. One way is to use descriptors that can easily be shared. Fingerprints have been suggested as a way of encoding structural information. While it is in principle possible to reverse engineer the actual structure from fingerprints to some degree,12 the large structural space makes it very difficult. As an alternative, molecular matched pairs move the focus from whole structures and their properties to differences between structures and their properties.13 As long as the structural changes are kept small, there is only disclosure of parts of proprietary structures. This approach lead recently to a very interesting collaboration between Roche, Genentech, and AstraZeneca. An alternative is to share individual models. Clark et al.14−16 proposed the use of naïve Bayesian models based on structural information. In this type of model individual structures are not revealed. We recently used a similar approach to combine several naïve Bayesian models for malaria built by eight commercial and noncommercial institutions with the aim to develop a publicly accessible consensus model. The consensus model outperformed the individual models. This shows that sharing can ultimately benefit everyone participating in such an exercise.17 The two approaches either share data in a form that makes it difficult to derive actual structures or share individual models and use aggregation of model predictions in a consensus model.
It is a well know and often confirmed fact that larger data sets lead to improved QSAR models.1 However, data set size is not the only factor that determines the ability of a model to generalize to novel structures. As the similarity of a novel structure to training set compounds is in general a good indicator of prediction accuracy,2 we expect a larger applicability domain for data sets that cover a wider structural space. Unfortunately, data sets are often not diverse enough. This has been demonstrated in depth for several log P models by Tetko et al.3,4 Most of the available models were developed using the Bioloom logPstar data set.5 For example, the widely used clogP model6−9 has a RMSE value of 0.5 for this data set (MAE = 0.28). This good performance is misleading, because the logPstar data set formed the basis for the model. Indeed, when Tetko et al. applied this and other models on larger, proprietary data sets from pharmaceutical companies, the RMSE values increased more than 1 log unit. This shows that the Bioloom logPstar data set is not representative enough of structures typically explored in pharmaceutical companies. Ideally, we want to include as much structural diversity as possible to build a model. This could easily be achieved, if experimental results are combined from as many sources as possible. However, in general there is little willingness to disclose proprietary structural information between competitors. This is unfortunate, as everyone would benefit from better computational models for physicochemical and ADME properties. There are a few noteworthy exceptions where screening results for malaria were made publicly available and shared through bioactivity databases such as PubChem or ChEMBL.10,11 To enable © 2017 American Chemical Society
Received: May 29, 2017 Published: July 19, 2017 1847
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling
included in the Bioloom logPstar data set. For this study, we removed the duplicates, leaving 3423 data points with log P values ranging from −12 to 9.3; most data points were in the range −1 to 5. We referred to this data set as EPAadd in this study. The Martel data set was designed to reflect pharmaceutically relevant compounds. It was cleaned up by keeping only data points marked as “Validated”. All structures could be converted into a molecule object in RDKit. This gave a data set with 707 data points in a log P range from 0.3 to 7. Figure 1 compares the distributions of log P and number of heavy atoms in the data sets. The Bioloom compounds are more polar with a mean log P value of 2.0 and smaller with an average of 15 heavy atoms. The distribution of the EPAadd data set is comparable. The compounds tend to be slightly larger with an average of 18 heavy atoms. In contrast, the Martel data set are more lipophilic (average log P 4.2) and larger (average number of heavy atoms 27). 2.3. Descriptors. We used the following descriptor sets in this study: alogP This is the RDKit implementation of the Ghose− Crippen atom types.26−28 The RDKit implementation differs in a few details for efficiency reasons. The original alogP method defines 72 atom types and gives one or more SMARTS patterns for each of them. For example, the atom type C1 defines SMARTS patterns for methane, primary, and secondary amines. This is split in RDKit into three different types. Due to this, we get a total of 110 different atom types instead of the original 72. For alogP predictions, RDKit maps the 110 atom types to the original 72. morgan-1 The morgan-1 descriptor uses the RDKit implementation29,30 of the popular ECFC_231 fragmentation scheme. Circular fragments are generated around each atom. The radius defines the size of these circular fragments. A radius of zero reduces the fragment to the atom. With a radius of one, the connected neighbors are included. Neighbors of neighbors are included with a radius of two, and so on. The Morgan algorithm32 is used
Here, we suggest an alternative approach. Instead of sharing the data or the final model, we share intermediate results of the model building algorithm to derive a collaborative model. We demonstrate this approach for linear ridge regression models. As we will see, the collaborative model, while never having access to all data at once, is practically identical to a model built on the full data set. The intermediate results are a reduced representation of the data set that will make it very difficult, if not impossible, to reconstruct individual data points. We will demonstrate this approach using the logPstar data set. This data set is sufficiently large to allow simulating different situations.
2. METHOD 2.1. Computational Details. The implementation of the algorithm uses Python (version 3.5.2).18 Structure handling and descriptor calculation were performed with the most recent version of the cheminformatics tookit RDKit.19 Scikit-learn (version 0.18.1)20 was used to build models. We identified a minor mistake in the Bayesian ridge regression implementation that leads to slightly incorrect λ values. A fix was developed and will be available in version 0.19 of the software. The visualizations in this publication were created using ggplot221 in R (version 3.3.2).22 2.2. Data Sets. We used the Bioloom logPstar,5 the Mansouri,23 and the Martel24 data sets in this study. The logPstar data set was cleaned up by removing data points for compounds with more than one fragment and cases where the SMILES string could not be converted into a molecule object in RDKit. After this cleanup, the data set contained 11 837 data points with log P values ranging from −5.1 to 11.3 with the majority of data points in the range −1 to 5. The Mansouri23 data set consists of a cleaned up version of the publicly available PHYSPROP physicochemical properties. The data set is available from the Chemistry Dashboard of the US Environmental Protection Agency (EPA).25 Structures were removed if the SMILES string could not be converted into a molecule object in RDKit. The remaining data set contained 15 165 entries. Of these, 10 742 compounds were already
Figure 1. (a) Distribution of experimental log P values for the data sets included in this study. For the Bioloom data set (black) the experimental log P values are in the range −5.1 to 11.3, with the majority of data points in the range −1 to 5. The Martel data set (blue) has a log P range from 0.3 to 7 with the majority of data points between 2.5 and 6. The EPAadd data set (red) has a log P range from −12 to 9.3 with the majority of data points between −1 and 5. (b) Distribution of number of atoms in the three data sets. The Bioloom data set (black) contains mostly smaller compound structures, while EPAadd (red) contains slightly larger compounds. The Martel data set (blue) contains predominantly compounds with more than 20 atoms. 1848
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling
λ is also called the ridge parameter. Adding the λI matrix to XTX leads to a shift of all its eigenvalues by λ. This only changes the Σ matrix in the singular value decomposition. We therefore get
to assign a number to each fragment; we subsequently map the number to an integer value between 0 and 4095. The integer values are used as a descriptor of the chemical structure. They can either be combined in a bit vector when the occurrence of a fragment is required or as a count vector when the number of times a fragment occurs in a structure is relevant. In this study we use the count vector representation as the descriptor. We limit the size of the fragments to a maximum radius of one. 2.4. Ordinary Linear and Ridge Regression. The ordinary linear regression problem is defined by the following equation.
Xb + ε = y
XTX + λ I = Vdiag(ϵ1 + λ , ..., ϵp + λ)VT
and consequently for the inverse ⎛ 1 1 (XTX + λ I)−1 = Vdiag⎜⎜ , ..., ϵ + λ ϵ ⎝ 1 p +
(1)
y1
X1
b = X−1y
X=
This can be rewritten in a form that is more suitable for calculation.
⋮
(7)
m
XTX =
∑ X iTX i
(8)
i=1
X = UΣV
m
XTy =
Here, U (n × n) and V (p × p) are both unitary matrices. Σ is an n × p matrix with Σij = δij ϵi . The ϵi are the eigenvalues ϵ1, ..., ϵp of the matrix and δij is the Kronecker delta which is 1 for i = j and 0 otherwise. Due to the properties of unitary matrices, the product XTX can be written as follows.
∑ X iTyi
(9)
i=1
If we want to make an ordinary linear regression, it is therefore not necessary to know the full matrices Xi and yi. We only need to have the products XTi Xi (p × p) and XTi yi (p × 1). This information is sufficient to make a linear regression. As we now have no access to individual data points, the Bayesian ridge regression approach is no longer possible. However, we can determine optimal λi values for the individual data sets. We tried the following approaches to combine these values.
XTX = (VΣTUT)(UΣVT) = VΣTΣVT = VΣ′2 VT (3)
This corresponds to the singular value decomposition of the matrix XTX. V is a p × p matrix. Σ′2 is the p × p diagonal matrix with the eigenvalues of X along the diagonal. The inverse (XTX)−1 is now given by
λmean = λgeo =
(4)
⎛1 1⎞ (Σ′2 )−1 = diag⎜⎜ , ..., ⎟⎟ ϵp ⎠ ⎝ ϵ1
m
1 m
m
∑ λi
mean of values
i=1
λ1·...·λm
λmin = min(λ1 , ..., λm)
geometric mean of values
minimum of values
Using a number of simulations, we found that λgeo is most comparable to the optimal λ value of the full data set model. 2.6. Computational Details. While the derivation of the aggregated regression model seems straightforward, the implementation needs to take the centering of the descriptor vectors into account. This is done by subtracting the average feature vector o from each row of the data matrix D. The average feature vector of the full data set is
Where (Σ′2)−1 is the p × p diagonal matrix with the inverse of the eigenvalues along the diagonal. In ridge regression, the term XTX in eq 2 is modified to include a term λI (p × p diagonal matrix with λ along the diagonal) for L2 regularization.
b = (XTX + λ I)−1XTy
⋮ ym
The matrix products XTX and XTy from eq 2 are then simply sums of the equivalent source specific matrix products.
(2)
T
(XTX)−1 = V(Σ′2 )−1VT
y=
Xm
The matrix product XTX has the shape p × p and the product XTy a vector of length p. We can use the singular value decomposition of the matrix X to further simplify the calculation.
Σ′2 = diag(ϵ1, ..., ϵp)
(6)
A common approach to determine the ridge parameter λ is cross-validation. Here we used the Bayesian ridge regression implementation in scikit-learn which iteratively derives a λ value based on the degrees of freedom and sum of squared error of the intermediate models. The method is outlined in the Appendix. 2.5. Extending the Linear Regression Approach to Combined Data Sets. If data are available from m different sources, the matrices X and y are composed of the individual matrices by appending rows.
For n data points and p features, the matrix X contains the descriptors for all data points and has shape n × p. The vector y contains the experimental values and is of length n. Finally, b are the regression coefficients (length p). ε is a scalar random variable to account for error. Under the assumption of minimizing the residual sum of squares, the OLS solution for b is
b = (XTX)−1XTy
⎞ ⎟VT λ ⎟⎠
(5) 1849
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling
done using stratified sampling to give each data set a similar distribution of log P values.
calculated from the average feature vectors of the individual data sets.
o=
1 n
m
randomstratified
Full ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯→ {A , B , C , D , E}
∑ ni ·oi
(10)
i=1
The data set is sorted by log P and split into blocks of five data points. Each data point in these blocks is then randomly assigned to one of the subsets. To build models, we split each individual data set into a training set and into a test set. Again, we use stratified sampling to keep a similar distribution of log P values.
m
n=
∑ ni
(11)
i=1
Once the vector o for the full data set is determined, we can apply this to m data sets Di to get the combined matrix X. To transform the products DTi Di and DTi yi instead, we can use the following equations. Their derivation is given in the Appendix. X iTX i = DiTDi − ni ooT
(12)
X iTyi = DiTyi − niyi ̅ o
(13)
randomstratified
A ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯→ {SA(training 80%), TA(test 20%)} ⋮ randomstratified
E ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯→ {SE(training 80%), TE(test 20%)}
Each of the training sets {SA, ..., SE} can now be used to derive models with their associated λ parameters {(MA, λA), ..., (ME, λE)}.
To summarize, we require the following information for each of the individual data sets (Di, yi): • ni number of data points • yi̅ the average of the dependent variables • oi the average feature vector of the independent variable (p × 1) • DTi Di product of the independent variable (p × p) • DTi yi product of the independent and dependent variable (p × 1) • λi the best ridge parameter derived for the data set Multiple data sets are now combined as follows: m
n=
∑ ni
o=
i=1
λ=
m
1 n
m
∑ ni ·oi
y̅ =
i=1
XX=
∑
Bayesianridgeregression
SA ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯→ (MA , λA) ⋮ Bayesianridgeregression
SE ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯→ (ME , λE)
We also combine the five training and test sets to derive a model for the full data set. {SA , ..., SE} → SFull
m
∑ ni ·yi ̅
{TA , ..., TE} → TFull
i=1
Bayesianridgeregression
λ1·...·λm
SFull ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯→ (MFull , λFull)
(14)
m T
1 n
The data set of about 12 000 data points is split into five subsets of about 2400 data points each. With a 4:1 split, this gives a training set size of 1900 and a test set size of 480. With a repetition of 100 times and 5 subsets, we get 500 individual results. For each result, we look at the performance of the subset model on the subset test set and a combined test set of all the other test sets. For comparison, we also check the performance of a model trained on the full data set on the subset test set. The results for the alogP descriptors are summarized in the left panel of Figure 2a. The performance of the models on the different data sets is practically independent of the model and test set. The average MAE is about 0.55 log units. This is the situation we would like to have for all of our models. The model is stable and we will benefit only slightly from adding more data (compare boxplots for Subset and Full). We can also apply the model to new data and will get a similar performance (compare box plots for Subset and External). 3.2. Simulating the Real World. Compared to the ideal situation of the previous section, individual data sets are often a lot less homogeneous. In the second simulation, we cluster the data sets in descriptor space to create structurally different subsets using k-means clustering in the descriptor space.
m
DiTDi
− noo
T
T
Xy=
i=1
∑ i=1
DiTyi
− ny ̅ o
(15)
The next step is to derive the singular value decomposition of XTX and use it to determine the coefficients and the intercept of the ridge regression. XTX = Vdiag(ϵ1, ..., ϵp)VT
⎛ 1 1 b = Vdiag⎜⎜ , ..., ϵp + ⎝ ϵ1 + λ y0 = y ̅ − obT
(16)
⎞ ⎟VTXTy λ ⎟⎠
(17) (18)
The following equation is used to predict a new descriptor vector d. y = dbT + y0
(20)
(19)
2.7. Statistics. For the analysis, we determined the residuals (log Pexp − log Ppred) for the compounds in the test set and calculated the mean absolute error (MAE).
3. RESULTS AND DISCUSSION 3.1. Simulating an Ideal World. In an ideal world, a model training set is representative of the full chemical space. We simulated this situation using the following approach. The full data set F was split into five equally sized data sets A, ..., E to simulate different smaller data sets. The splitting is
k ‐meansclustering
F ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯→ {A , B , C , D , E}
(21)
This simulates the situation of pharmaceutical companies with their different compound collections. With this approach, the individual data sets reflect the real world situation more. The clusters vary in size from 20 to 5400 data points. Repeated 1850
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling
Figure 2. Distribution of MAE values for different models and test sets. Models were trained on the full data set and on subsets. The subsets were either created using random sampling (left: ideal world) or from clustering (right: real world). The performance of these models on the subset test set and the combined test set of all the other subsests was measured using MAE. The individual boxplots are for the following combinations. Subset: performance of subset model on subset test set (gray). Full: performance of full data set model on subset test set (red). External: performance of subset model on remaining test sets (blue). Results are given for the (a) alogP and (b) morgan-1 descriptors.
clustering using the k-means clustering implementation in scikit-learn will also lead to different outcomes each time. This allows us to study the effect of training set size. The clustering approach leads to very different results. Figure 2a summarizes the results for the alogP descriptors. It shows that the MAE values vary a lot more compared to the ideal world simulation. This is mainly due to the variation of model performance with training set size (see Figure 3a). For small training sets, the model performance is very bad and slowly decreases to a MAE below 0.5 log units for training sets with more than 2000 data points. A similar observation for the correlation of data set size and performance was made by Gedeck et al. in the past.1 We also see that the performance of the full data set model is now about 0.05 log units worse. This indicates that the alogP descriptor set may not be sufficient to capture the existing structural diversity. The drop in performance is even more striking for the external test set. Especially for the large training
set sizes (larger clusters), predictions on the external test set become very bad. This indicates a larger structural difference to the training set. 3.3. Performance of Morgan-1 Descriptors. We get similar results for the morgan-1 descriptors (Figure 2b). We see a large drop in performance on the external test set for the real world simulation; the drop is 0.26 log units for the morgan-1 descriptors compared to 0.13 for alogP descriptors. The predictive performance of the models on test sets is in general better compared to the results for alogP (see Table 1). The average MAE in the ideal world simulation is 0.39 log units for the morgan-1 descriptors compared to 0.56 for the alogP descriptors. In the real world simulation, the average MAE is 0.35 compared to 0.56. The size dependence is also comparable to the alogP descriptors (Figure 3b). With the morgan-1 descriptors, the average model performance within subsets is below 0.4 for training set sizes of 500 and then slowly goes down to 0.3. It is 1851
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling
Figure 3. Model performance as a function of training set size in the real world simulation for the (a) alogP and (b) morgan-1 descriptors. The solid lines are determined using locally weighted regression (LOESS33) smoothing. The LOESS lines are repeated in part c for comparison; dotted lines alogP, solid lines morgan-1. Color scheme: Subset black, Full red, External blue.
Table 1. Median MAE for the Different Model and Test Set Combinations from the Ideal and Real World Simulations descriptor alogP
simulation
model
test set
median MAE
ideal
subset full subset subset full subset subset full subset subset full subset
subset subset external subset subset external subset subset external subset subset external
0.56 0.54 0.55 0.56 0.59 0.69 0.39 0.33 0.39 0.35 0.35 0.61
real
morgan-1
ideal
real
Table 2. Mean MAE for Full and Collaborative Models on Full Test Set for Different Descriptors and Ideal and Real World Simulations descriptor alogP morgan-1
simulation
MAE full model
MAE collaborative model
difference
ideal real ideal real
0.542 0.542 0.331 0.329
0.542 0.542 0.331 0.329
0.0003 0.0003 0.0004 0.0004
noteworthy, that the performance of the full model is comparable to the subset models over the whole size range. This is in contrast to the alogP models where we see a slightly lower performance of the full model. This shows the effect of having a larger descriptor space that is able to adapt better to structural variations. 1852
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling
Figure 4. Comparison of model coefficients from 40 simulations of full and collaborative models.
Figure 5. Impact of replacing a subset model with a collaborative model. The MAE of predicting a subset test set with either a subset model (black points) or the collaborative model (red points) is shown as a function of training set size. The solid lines are the LOESS regression for the different results (black subset model, red collaborative model). The results are for 40 real world simulations.
3.4. Creating a Collaborative Model. Using the methodology derived above, we can now simulate the effect of combining the regression information on the various subsets to build a collaborative model. For each of the descriptors and sampling methods, we build models using either the full training set or the combined information on the subsets. To initialize the λ value in the ridge regression, we use the geometric mean of the subset Bayesian ridge models. The full test set is then used to assess the performance of both models
measuring the MAE value. The simulation was repeated 40 times. The results are summarized in Table 2. The performance of the full and collaborative model is indistinguishable; the mean difference in the MAE values is insignificant (≈0.0004). We can also compare the final regression coefficients of the full and collaborative models. As can be seen in Figure 4, the coefficients are nearly unchanged between the collaborative and the full models. The median absolute difference between the coefficients is 0.005. There are 1853
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling a few outliers for the alogP descriptors; however, in particular for the morgan-1 descriptors, the correlation is very high. 3.5. To share or Not To Share? In the previous section, we have seen that sharing only basic regression information allows creation of a collaborative model without sharing structural information. This collaborative model is equivalent
to a model built using all data points. We have also seen that individual log P models have only limited capability to extrapolate to more diverse structures. It is therefore desirable to use the general model instead of a subset model to have more accurate predictions of novel structures. We also wish to understand the impact of this replacement on predictions of the actual subset. In Figure 5, the change of the performance of subset and collaborative models on the respective subset test set is displayed as a scatterplot of MAE over size of training set. For the alogP descriptor models, the graph shows that for all data set sizes, we see a slight decrease in performance. The median drop in MAE when going to a collaborative model is 0.04. The decrease is larger for small data sets but relatively stable for larger data sets. The morgan-1 descriptor models show the opposite behavior. For small data sets, we see on average a small improvement of the MAE and for larger data sets the models performance is practically the same. This result shows that, with the exception of alogP descriptor models with small data sets, replacing the subset model with a collaborative model causes no penalty compared to individual models. 3.6. Case Studies. We use the Martel24 and EPAadd23 data sets as “real life” case studies. The EPAadd compounds have a
Table 3. Models Built Using the Bioloom Dataset with the alogP and Morgan-1 Descriptorsa descriptor
data set
MAE
RMSE
fragments
alogP
Bioloom EPAadd Martel Bioloom EPAadd Martel
0.54 0.80 0.96 0.27 0.69 1.07
0.71 1.21 1.24 0.37 1.13 1.34
100 89 78 2207 1728 1048
morgan-1
new fragments 0 0 199 90
a
Both models were then applied on the three datasets (Bioloom, EPAadd, and Martel). The performance was measured using MAE and root mean squared error (RMSE). The column “fragments” contains the number of unique fragments in each dataset and “new fragments” gives the number of new fragments that were found in either EPAadd or Martel, but not in Bioloom.
Figure 6. Data sets were split into training and test sets using a random 80−20 split. Three models were created, and their performances were measured as MAE using the test set. The first model is trained using the full Bioloom data set (blue line, model 1), the second is a Bayesian ridge regression model using the training set (black line, model 2), and the third model is a combination of model of 1 and 2 (red line, model 3). The lines show the distribution of the MAE values for the three models on the test set for 300 repeats of this process. Results are shown separately for (a) the EPAadd and (b) the Martel data sets. 1854
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling
Figure 7. Three models were created using full data sets. The first model is trained using the full Bioloom data set (blue), the second model uses the EPAadd or Martel data set (internal data set, black), and the third model is a combined model of 1 and 2 (red). The experimental logP values were binned and the distribution of the absolute residuals for all three models are shown as individual boxplots in each logP bin. The graph on the right gives the median of these distributions. (a) Results for EPAadd data set; (b) results for Martel data set.
similar distribution of log P as the Bioloom data set and are only slightly larger. The structures in the Martel data set are more lipophilic and larger (Figure 1). As such the design of the data set is more representative to structures from pharmaceutical companies. In Table 3, we summarize results for models that were trained using the full Bioloom data set and then applied on the other two data sets. We can see a drop in performance for both data sets. The EPAadd data set drops 0.26 from a MAE of 0.54 to 0.80 for the alogP model and 0.42 from 0.27 to 0.69 for the morgan-1 model. For the Martel data sets, we observe larger drops to almost 1 log unit for the alogP model and to 1.1 for the morgan-1 model. The larger drop in the performance of the morgan-1 model is probably due to the fact that the morgan-1 descriptors set has additional descriptors (fragments) that are not found in the Bioloom data set. The EPAadd data set has 199 additional fragments and the Martel data set 90 additional fragments. We can now make the following experiment. The full Bioloom data set is used to train one model (model 1). Next we repeatedly split the other data sets into training and test sets using a random 80−20 split. Using the training set, we build a data set specific, “internal” model (model 2) and combine it with the Bioloom model 1 into a collaborative model (model 3). The performance of the three models is measured using MAE on the test set. The result of this experiment is summarized in Figure 6 for 300 repeats. Independent of the data set and the
descriptors, we see that the individual models perform better than the Bioloom model. The performance of the collaborative EPAadd model is very similar to the internal model. It drops slightly for the alogP descriptors and increases for the morgan-1 descriptors (Figure 6a). The collaborative Martel model on the other hand shows a significant drop in performance relative to the internal model. It is less for morgan-1, but still large. We assume that this is due to the larger structural diversity and the smaller training size of the Martel data set which leads to a larger influence of the structural features covered in the Bioloom data set. In both cases, the performance of the collaborative morgan-1 models is better compared to the alogP models. The larger number of structural fragments used by the morgan-1 descriptors gives the collaborative model more flexibility to adjust to the data sets. It is also interesting to look at the log P dependence of the prediction performance. Figure 7 summarizes the results for the two data sets and the morgan-1 descriptors. The results for alogP descriptors are given in the Supporting Information. For the EPAadd data set, the performance is fairly constant over the full log P range. While the Bioloom model drops for lower log P values, the collaborative model gives practically similar performance to the internal model. The behavior of the Martel data set is similar. Here the performance of the Bioloom model drops for larger log P values in the range where the data set has fewer examples. Combining the model with the internal model gives a consistent performance over the full log P range. 1855
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling
We can now express the matrix products XTX (p × p) and XTy (p × 1) as follows.
4. CONCLUSIONS AND OUTLOOK The results of this study confirmed the fact that models are only as general as the chemical space covered in the training set. This means that external models may have limited predictive power and internal models may not extrapolate well into novel chemical space. Increasing the training set size and in consequence the covered chemical space, leads to improved, more robust models. It is therefore desirable to include data from as many sources as possible. The introduced methodology allows the development of collaborative models for multiple, separate data sets that are practically identical to a model based on a merged data set. Each of the separate data sets only requires aggregation of the descriptors which makes reverse engineering of individual data points and consequently structures very difficult and for larger data sets practically impossible. This makes the method suitable for models based on multiple, proprietary data sets. It is a well know fact that the screening collections at large pharmaceutical companies overlap only little,34 so available data would complement each other well and collaborations would lead to stronger models with larger applicability domains. We believe that log P would be a good real life test case for collaborative model development. This study showed that a Bayesian ridge regression model using morgan-1 descriptors has sufficient performance to justify the effort. In addition, the fragments encoded by the morgan-1 descriptors are small enough to avoid disclosing any proprietary scaffold. This reduces intellectual property concerns and makes the approach useful for the development of collaborative models with a large number of proprietary data sets. This contribution covers only one aspect of collaborative model development. There are other issues that need to be considered when combining data from various sources. Are experimental protocols comparable or are there systematic differences between the sources? If individual data points are not available, it is not possible to identify and evaluate differences between duplicate measurements. However, external standards are often used during assay development and while assays are running. This information can be shared and allows assessing comparability. It is also important that preprocessing of structures is consistent between the collaboration partners. Examples include handling of salt forms, chirality, and standardization of tautomers. A final component of model development that requires specific attention is how to define and use applicability domain for the final model. While many approaches have been developed over the years,35,36 the question of which to apply in the collaborative context needs to be addressed. This issue however is more general and goes beyond the scope of this contribution.
■
n
(XTX)ij =
k=1
=
n k=1
= (D D)ij − noioj − nojoi + noioj = (DTD)ij − noioj n
(XTy)i =
n
∑ xkiyk
=
k=1
n
∑ (dki − oi)yk k=1
=
∑ dkiyk k=1
n
−
∑ oiyk k=1
T
= (D y)i − ny ̅ oi
Here y ̅ is the average of the y vector. The term oioj is the outer product of the vector o with itself (p × p). We therefore get the following results
XTX = DTD − nooT
(24)
XTy = DTy − ny ̅ o
(25)
A.2. Bayesian Ridge Regression
The Bayesian ridge regression method in scikit-learn is based on the methodology described by MacKay.37 The aim of ridge regression is the minimization of the following expression. n
p
∑ (yi − xiTbi)2 + λ ∑ bi2 i=1
j=1
This is equivalent to minimizing the equivalent expression with β λ = α. n
p
α ∑ (yi − xiTbi)2 + β ∑ bi2 = αE D + βE W i=1
j=1
The first term describes the fit of the data (ED is sum of squared residuals), while the second term controls the size of the regression coefficients (EW is the regularising function). According to MacKay, the Bayesian approach states that α and β are best expressed in terms of the effective number of parameters. For ridge regression this corresponds to the degree of freedom which can be calculated from the eigenvalues in eq 3. γ=
The offset o of a matrix D = (dij) with 1 ≤ i ≤ n and 1 ≤ j ≤ p is the column average of D. The elements of the vector o = (oj) with 1 ≤ j ≤ p are
∑
ϵi ϵi + λ
With this, MacKay derives the following expressions for determining Bayesian optimal values for α and β. γ γ = p 2 α= 2E W ∑j = 1 bi
n
(22)
β=
In linear regression the offset is often used to shift the descriptors so that the column average of the offset matrix X = (xij) becomes the null vector. The elements of the offset matrix are
xij = dij − oj
k=1
T
i=1
i=1
n
= (DTD)ij − oi ∑ dkj − oj ∑ dki + noioj
A.1. Offset
∑ dij
∑ (dkidkj − oidkj − dkioj + oioj) k=1
p
1 n
k=1
n
APPENDIX
oj =
n
∑ xkixkj = ∑ (dki − oi)(dkj − oj)
N−γ N−γ = n 2E D 2 ∑i = 1 (yi − xiTbi)2
These values can now be used to derive a new λ and consequently, new regression coefficients. This iterative process is repeated until the regression coefficients converge. In the
(23) 1856
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling
(15) Clark, A. M.; Ekins, S. Open Source Bayesian Models. 2. Mining a “Big Dataset” To Create and Validate Models with ChEMBL. J. Chem. Inf. Model. 2015, 55, 1246−1260. (16) Clark, A. M.; Dole, K.; Ekins, S. Open Source Bayesian Models. 3. Composite Models for Prediction of Binned Responses. J. Chem. Inf. Model. 2016, 56, 275−285. (17) Verras, A.; Waller, C. L.; Gedeck, P.; Green, D. V. S.; Kogej, T.; Raichurkar, A.; Panda, M.; Shelat, A. A.; Clark, J.; Guy, R. K.; Papadatos, G.; Burrows, J. Shared Consensus Machine Learning Models for Predicting Blood Stage Malaria Inhibition. J. Chem. Inf. Model. 2017, 57, 445−453. (18) Python Software Foundation. Python Language Reference, version 3.5. https://www.python.org/ (accessed date June 26, 2017). (19) RDKit. Open-Source Cheminformatics. https://github.com/ rdkit/rdkit/ (accessed date June 26, 2017). (20) Pedregosa, F.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. (21) Wickham, H. Ggplot2−Elegant Graphics for Data Analysis; Use R!; Springer, 2009. (22) R Core Team. R: A Language and Environment for Statistical Computing. http://www.R-project.org/ (accessed date June 26, 2017). (23) Mansouri, K.; Grulke, C. M.; Richard, A. M.; Judson, R. S.; Williams, A. J. An Automated Curation Procedure for Addressing Chemical Errors and Inconsistencies in Public Datasets Used in QSAR Modelling. SAR QSAR Environ. Res. 2016, 27, 911−937. (24) Martel, S.; Gillerat, F.; Carosati, E.; Maiarelli, D.; Tetko, I. V.; Mannhold, R.; Carrupt, P.-A. Large, Chemically Diverse Dataset of Log P Measurements for Benchmarking Studies. Eur. J. Pharm. Sci. 2013, 48, 21−29. (25) CompTox Dashboard. https://comptox.epa.gov/ (accessed date June 26, 2017). (26) Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for Three-Dimensional Structure-Directed Quantitative Structure-Activity Relationships I. Partition Coefficients as a Measure of Hydrophobicity. J. Comput. Chem. 1986, 7, 565−577. (27) Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for Three-Dimensional-Structure-Directed Quantitative Structure-Activity Relationships. 2. Modeling Dispersive and Hydrophobic Interactions. J. Chem. Inf. Model. 1987, 27, 21−35. (28) Ghose, A. K.; Pritchett, A.; Crippen, G. M. Atomic Physicochemical Parameters for Three Dimensional Structure Directed Quantitative Structure-activity Relationships III: Modeling Hydrophobic Interactions. J. Comput. Chem. 1988, 9, 80−90. (29) Landrum, G.; Lewis, R.; Palmer, A.; Stiefl, N.; Vulpetti, A. Making Sure Thereas a ”Give” Associated with the ”Take”: Producing and Using Open-Source Software in Big Pharma. J. Cheminf. 2011, 3, O3. (30) Riniker, S.; Fechner, N.; Landrum, G. A. Heterogeneous Classifier Fusion for Ligand-Based Virtual Screening: Or, How Decision Making by Committee Can Be a Good Thing. J. Chem. Inf. Model. 2013, 53, 2829−2836. (31) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (32) Morgan, H. L. The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. J. Chem. Doc. 1965, 5, 107−113. (33) Cleveland, W. S.; Devlin, S. J. Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting. J. Am. Stat. Assoc. 1988, 83, 596−610. (34) Kogej, T.; Blomberg, N.; Greasley, P. J.; Mundt, S.; Vainio, M. J.; Schamberger, J.; Schmidt, G.; Hüser, J. Big Pharma Screening Collections: More of the Same or Unique Libraries? The AstraZeneca−Bayer Pharma AG Case. Drug Discovery Today 2013, 18, 1014−1024. (35) Netzeva, T. I.; et al. Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships. The Report and Recommendations of ECVAM Workshop 52. Altern. Lab Anim. 2005, 33, 155−173.
actual implementation of the scikit-learn Bayesian ridge regression, a very small term is added to the numerator and denominator of the fractions to improves the stability of the method.
■
ASSOCIATED CONTENT
S Supporting Information *
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00315. Information about data set similarities and case study results for the alogP descriptor models (PDF)
■
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. ORCID
Peter Gedeck: 0000-0002-4111-3162 Notes
The authors declare no competing financial interest.
■
REFERENCES
(1) Gedeck, P.; Rohde, B.; Bartels, C. QSAR - How Good Is It in Practice? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets. J. Chem. Inf. Model. 2006, 46, 1924− 1936. (2) Sheridan, R. P.; Feuston, B. P.; Maiorov, V. N.; Kearsley, S. K. Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR. J. Chem. Inf. Comput. Sci. 2004, 44, 1912−1928. (3) Tetko, I. V.; Poda, G. I. Application of ALOGPS 2.1 to Predict Log D Distribution Coefficient for Pfizer Proprietary Compounds. J. Med. Chem. 2004, 47, 5601−5604. (4) Tetko, I.; Poda, G.; Ostermann, C.; Mannhold, R. Large-Scale Evaluation of Log P Predictors: Local Corrections May Compensate Insufficient Accuracy and Need of Experimentally Testing Every Other Compound. Chem. Biodiversity 2009, 6, 1837−1844. (5) BioLoom, version 2004; BioByte: Claremont, CA, USA, 2004. (6) Leo, A. J.; Jow, P. Y. C.; Silipo, C.; Hansch, C. Calculation of Hydrophobic Constant (Log P) from. Pi. and f Constants. J. Med. Chem. 1975, 18, 865−868. (7) Leo, A. J. Some Advantages of Calculating Octanol−water Partition Coefficients. J. Pharm. Sci. 1987, 76, 166−168. (8) Leo, A. J. Calculating Log Poct from Structures. Chem. Rev. 1993, 93, 1281−1306. (9) Leo, A. J.; Hoekman, D. Calculating Log P(Oct) with No Missing Fragments;The Problem of Estimating New Interaction Parameters. Perspect. Drug Discovery Des. 2000, 18, 19−38. (10) Gamo, F.-J.; Sanz, L. M.; Vidal, J.; de Cozar, C.; Alvarez, E.; Lavandera, J.-L.; Vanderwall, D. E.; Green, D. V. S.; Kumar, V.; Hasan, S.; Brown, J. R.; Peishoff, C. E.; Cardon, L. R.; Garcia-Bustos, J. F. Thousands of Chemical Starting Points for Antimalarial Lead Identification. Nature 2010, 465, 305−310. (11) Guiguemde, W. A.; et al. Chemical Genetics of Plasmodium Falciparum. Nature 2010, 465, 311−315. (12) Faulon, J.-L.; Brown, W. M.; Martin, S. Reverse Engineering Chemical Structures from Molecular Descriptors: How Many Solutions? J. Comput.-Aided Mol. Des. 2005, 19, 637−650. (13) Dossetter, A. G.; Griffen, E. J.; Leach, A. G. Matched Molecular Pair Analysis in Drug Discovery. Drug Discovery Today 2013, 18, 724− 731. (14) Clark, A. M.; Dole, K.; Coulon-Spektor, A.; McNutt, A.; Grass, G.; Freundlich, J. S.; Reynolds, R. C.; Ekins, S. Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets. J. Chem. Inf. Model. 2015, 55, 1231−1245. 1857
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858
Article
Journal of Chemical Information and Modeling (36) Mathea, M.; Klingspohn, W.; Baumann, K. Chemoinformatic Classification Methods and Their Applicability Domain. Mol. Inf. 2016, 35, 160−180. (37) MacKay, D. J. C. Bayesian Interpolation. Neural Computation 1992, 4, 415−447.
1858
DOI: 10.1021/acs.jcim.7b00315 J. Chem. Inf. Model. 2017, 57, 1847−1858