Improved Chemical Structure–Activity Modeling Through Data

for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom. J. Chem. I...
0 downloads 22 Views 1MB Size
Article pubs.acs.org/jcim

Improved Chemical Structure−Activity Modeling Through Data Augmentation Isidro Cortes-Ciriano*,† and Andreas Bender‡ †

Département de Biologie Structurale et Chimie, Institut Pasteur, Unité de Bioinformatique Structurale; CNRS UMR 3825, 25, rue du Dr Roux, 75015 Paris, France ‡ Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom S Supporting Information *

ABSTRACT: Extending the original training data with simulated unobserved data points has proven powerful to increase both the generalization ability of predictive models and their robustness against changes in the structure of data (e.g., systematic drifts in the response variable) in diverse areas such as the analysis of spectroscopic data or the detection of conserved domains in protein sequences. In this contribution, we explore the effect of data augmentation in the predictive power of QSAR models, quantified by the RMSE values on the test set. We collected 8 diverse data sets from the literature and ChEMBL version 19 reporting compound activity as pIC50 values. The original training data were replicated (i.e., augmented) N times (N ∈ 0, 1, 2, 4, 6, 8, 10), and these replications were perturbed with Gaussian noise (μ = 0, σ = σnoise) on either (i) the pIC50 values, (ii) the compound descriptors, (iii) both the compound descriptors and the pIC50 values, or (iv) none of them. The effect of data augmentation was evaluated across three different algorithms (RF, GBM, and SVM radial) and two descriptor types (Morgan fingerprints and physicochemical-property-based descriptors). The influence of all factor levels was analyzed with a balanced fixed-effect full-factorial experiment. Overall, data augmentation constantly led to increased predictive power on the test set by 10−15%. Injecting noise on (i) compound descriptors or on (ii) both compound descriptors and pIC50 values led to the highest drop of RMSEtest values (from 0.67−0.72 to 0.60−0.63 pIC50 units). The maximum increase in predictive power provided by data augmentation is reached when the training data is replicated one time. Therefore, extending the original training data with one perturbed repetition thereof represents a reasonable trade-off between the increased performance of the models and the computational cost of data augmentation, namely increase of (i) model complexity due to the need for optimizing σnoise and (ii) the number of training examples.



INTRODUCTION In supervised learning, the goal is to find a model (function) from a set of training examples able to predict an output variable (e.g., compound activities), following a given inductive bias,1 for yet unobserved data (i.e., generalize across the input space), which could be potentially observed in the future. In bioactivity or chemical structure−activity modeling, unobserved data corresponds to compounds whose activities have not yet been determined experimentally. Machine learning models are widely used at early stages of the drug discovery process by both academic and private laboratories.2 However, building statistically robust models displaying correct generalization properties across the chemical space is not always possible due to (among others) (i) the lack of sufficient compound activities on a target of interest and (ii) limited generalization ability of the models (overfitting) if the structure of future data has changes by systematic or random fluctuations in the response variable not accounted by the training examples. For instance, variations of pIC50 data over time due to degradation of assay reagents,3 or due to a change in © 2015 American Chemical Society

the composition of microtiter plates in the case of cell line sensitivity data.4 In several scientific disciplines, such as the analysis of spectroscopic data5−12 or the detection of conserved domains in protein sequences,13 the introduction of simulated unobserved data has proved an effective approach to increase (i) the predictive power (generalization ability) of models trained on a limited number of examples and (ii) the robustness of these models against systematic variations of the data. The exploitation of unobserved data in the training of predictive models is termed data augmentation.14 Given that structurally similar compounds are likely to exhibit akin bioactivities (i.e., similarity principle15) the hypothesis in data augmentation is that neighboring regions in descriptor space are likely to be associated with a little change in bioactivity (provided compounds are encoded with descriptors that reflect compound structural similarity in descriptor space). The Received: September 15, 2015 Published: December 1, 2015 2682

DOI: 10.1021/acs.jcim.5b00570 J. Chem. Inf. Model. 2015, 55, 2682−2692

Article

Journal of Chemical Information and Modeling

Figure 1. Illustration of data augmentation. (A) The left panel (A.1) illustrates the data augmentation scenario where noise is injected on the response variable (e.g., pIC50 values). A.2 illustrates the scenario where noise is added solely to the compound descriptors, whereas A.3 depicts the data scenario where noise is injected on both compound descriptors and the response variable. (B) Schematic representation of the training data in data augmentation. The original training data, namely, Doriginal = {Xoriginal; yoriginal}, is replicated N times (3 in this example). Random noise sampled from Gaussian distributions with zero mean and standard deviation σnoise is injected on the replications. This noise can be added to either the compound descriptors (Xperturbed), to the response variable (yperturbed), or to both of them. The value of σnoise was determined with grid search and cross-validation.

Table 1. Data Sets Modelled in This Studya dataset label DHFR rat DHFR homo F7 IL4 MMP2 P20309 P25929 P49146

biological end point rat dihydrofolate reductase human dihydrofolate reductase human factor-7 human interleukin-4 human matrix metallopeptidase-2 human muscarinic acetylcholine receptor M3 human neuropeptide Y receptor type 1 human neuropeptide Y receptor type 2

data points

activity range (pIC50)

bioactivity standard deviation (pIC50)

data source

ref

759 743 344 599 443 779

4.00−9.50 4.00−9.72 4.04−8.18 4.67−8.30 5.10−10.30 4.00−10.50

1.25 1.32 0.97 0.57 0.99 1.70

ChEMBL 16 ChEMBL 16 US20050043313 WO 2006/133426 A2 WO2005042521 ChEMBL 19

49 49 50 50 50 18

501 561

4.11−10.70 4.00−10.15

1.40 1.27

ChEMBL 19 ChEMBL 19

18 18

These data sets can be found in the references indicated in the last two columns and are provided in the Supplementary Information. The first column (dataset label) reports the dataset labels used throughout the manuscript. The diversity of the data sets was assessed with respect to the number of data points, target class, and to the standard deviation of the distribution of the pIC50 values.

a

improve with respect to models trained exclusively on the original training data. The increased in model robustness against random changes and systematic drifts of the response variable provided by data augmentation on spectroscopic data has been extensively reported in the literature.5−11 For instance, Saiz-Abajo et al.11 modeled near-infrared (NIR) spectroscopy data with ensemblePartial-Least Squares Regression (PLS)16 and data augmentation. Six types of simulated noise, reflecting data perturbations likely to affect NIR data, were explored, namely: (i) additive noise, (ii) multiplicative noise, (iii) intensity-dependent noise, (iv) local shift, (v) instrumental noise, and (vi) a combination of the preceding noise types. Overall, it was found that ensemble-

generation of augmented data is generally performed by adding a small amount of noise to replications of the original data. This injected noise is expected to mimick future changes in the structure of the data and, therefore, to lead to a better coverage of the future sample population. In QSAR, this corresponds to replicating the original training data, namely, Doriginal = {Xoriginal; yoriginal}, N times and adding noise to these replications (Figure 1A,B) on either (i) the bioactivities (Figure 1A.1), (ii) the compound descriptors (Figure 1A.2), or (iii) both (Figure 1A.3). Therefore, the generalization properties and robustness against future changes in the structure of the data of models trained on both the original and augmented training data are expected to 2683

DOI: 10.1021/acs.jcim.5b00570 J. Chem. Inf. Model. 2015, 55, 2682−2692

Article

Journal of Chemical Information and Modeling

choice of Morgan fingerprints as compound descriptors was motivated by the high retrieval rates obtained with these fingerprints in benchmarking studies of compound descriptors.15,23 To calculate the fingerprints for a given compound, each substructure in that compound, with a maximal diameter of four bonds, was assigned to an unambiguous identifier. Subsequently, these substructures were mapped into a 256 bitlong hashed array of counts. The position in the array where each substructure was mapped was given by the modulo of the division of the substructure identifier by the fingerprint size. In addition to Morgan fingerprints, continuous compound descriptors were computed. A total of 188 1D and 2D physicochemical descriptors was computed with RDkit (release version 2013.03.02).22 Data Preprocessing. The function RemoveNearZeroVarianceFeatures from the R package camb was used to remove those descriptors displaying constant values across all compounds (i.e., near-zero variance descriptors) using a cutoff value equal to 30/ 1.19,24,25 Subsequently, the remaining descriptors were centered to zero mean and scaled to unit variance (z-scores) with the function PreProcess from the R package camb. Model Training. Grid search with 5-fold cross validation (CV) was used to optimize the model parameters.26 The whole dataset was split into 6 folds of the same size by stratified sampling of the pIC50 values. In stratified sampling, the pIC50 values are splitted into groups and random sampling is applied within these groups.25 Thus, the distributions of pIC50 values across folds are comparable. The main assumption here is that the distribution of pIC50 values from the dataset is likely to match the distribution of pIC50 values from the population. One fold, 1/ 6, was withheld as the test set and served to assess the predictive power of the models. The remaining folds, 5/6, constituted the training set and were used to optimize the values of the parameters in the following way. For each combination of parameter values in the grid, a model was trained on four folds from the training set, and the values for the remaining fold were then predicted. This procedure was repeated five times, each time holding out a different fold. The values of the parameters exhibiting the lowest average RMSE value across these five repetitions were considered as optimal. Subsequently, a model was trained on the whole training set, using the optimized values for the parameters. The predictive power of this model was assessed on the test set by calculating the RMSE value for the observed against the predicted bioactivities. We repeated the grid search and fivefold cross validation process five times (i.e., replicates) for all combinations of factor levels. The composition of the six folds indicated above was different in each replicate. In order to make the results comparable on a given dataset for a given replicate across algorithms, descriptor types, and data augmentation levels, all models were trained using the same fold composition. Machine Learning Algorithms. Given a dataset D = {X; y} where X = {xi}ni=1 is the set of compound descriptors and y = {yi}ni=1 is the vector of observed bioactivities (here pIC50 values), the aim of supervised learning is to find the function (model) underlying D, which can be then used to predict the bioactivity for new data points, xnew. In the following subsections we briefly summarize the theory behind the algorithms explored in this study. Support Vector Machines. Kernel functions, statistic covariances,27 or simply kernels permit the computation of the dot product between two vectors, x, x′ ∈ X, in a higher

PLS models trained on augmented data led to calibration models exhibiting higher accuracy and robustness on the test set against the six types of noise than ensemble-PLS built models on solely the original data. Besides, the authors observed that higher predictive ability was achieved when the training data comprised the original and perturbed data instead of the original data alone. In the arena of protein motif finding from multiple sequence alignments, data augmentation was explored in combination with two deterministic and two stochastic algorithms.13 Data augmentation has also proved versatile when coupled to sampling and optimization,14 and deep learning algorithms.17 Nevertheless, the effect of data augmentation in QSAR has not been assessed yet. In this contribution, we explore the effect of data augmentation on the predictive power of QSAR models on molecules not included in the training phase. To this aim, we collected 8 diverse data sets from the literature and ChEMBL 1918 reporting compound activity as pIC50 values (Table 1) and explored four data augmentation scenarios (Figure 1). In all these scenarios, the original training data was replicated N times (N ∈ 0, 1, 2, 4, 6, 8, 10) and perturbed with Gaussian noise, whose optimal standard deviation was optimized with grid search and crossvalidation. Therefore, the data on which models were trained comprised the original training data and N replications thereof. The four data augmentation scenarios are defined by the type of perturbation applied to the replicated data, namely: (i) the replications were not perturbed, (ii) the pIC50 values of the replications were perturbed, (iii) the compound descriptors of the replications were perturbed, and (iv) both the pIC50 values and the descriptors were perturbed. To ensure that the conclusions are not dependent on a given algorithm or compound description, we repeated all calculations using Random Forest (RF), Support Vector Machine (SVM), or Gradient Boosting Machines (GBM) as the learning algorithm, and either Morgan fingerprints or 1D and 2D physicochemical descriptors to encode the compounds. The performance of the models on the test set, quantified by the RMSE values for the observed against the predicted pIC50 values, served as a proxy to evaluate the effect of these four data augmentation strategies on predictive ability of the models on new molecules.



METHODS Data Sets. We gathered a total of 8 QSAR data sets from the literature (references given in Table 1) and from ChEMBL database version 19.18 All data sets report compound potency as IC50 values. These values were modeled in a logarithmic scale (pIC50 = −log10 IC50). The size of the data sets range from 334 data points (human factor-7) to 779 (human muscarinic acetylcholine receptor M3 P20309). All data sets are provided in the Supporting Information. Molecular Representation. The function StandardiseMolecules from the R package camb19 was used to normalize all chemical structures using the default options. This normalization step is crucial for the generation of compound descriptors, as the value of most of them (except for e.g. heavy atom counts) depend on a consistent representation of molecular properties such as aromacity of ring systems, tautomer representation or protonation states. Inorganic molecules were removed, whereas the largest fragment was kept in order to filter out counterions. Compound Descriptors. Compounds were encoded with circular Morgan fingerprints20,21 calculated using RDkit (release version 2013.03.02).19,22 Morgan fingerprints encode compound structures by considering radial atom neighborhoods.20 The 2684

DOI: 10.1021/acs.jcim.5b00570 J. Chem. Inf. Model. 2015, 55, 2682−2692

Article

Journal of Chemical Information and Modeling dimensional space, F (potentially of infinite dimension), without explicitly computing the coordinates of these vectors in F: ⟨ϕ(x), ϕ(x′)⟩ = K (x , x′)

e.g. the squared error loss, i.e. the average of the square of the 1 n training residuals: Ψ(y , f ) = n ∑i = 1 (yi − f (x i))2 . The final model is given by

(1)

where ϕ(x) is a mapping function from X to F, ϕ: X → F. This is known as the kernel trick.28 Thus, the value of the kernel K applied on the input vectors x and x′ is equivalent to their dot product in F. In practice, this permits to apply linear methods based on dot products, e.g. SVM, in F while using X in the calculations (thus, not requiring to compute the coordinates of the input data in F). This is computationally less expensive than the explicit calculation of the coordinates of X in F, which in some cases might not even be possible. The linear relationships learnt in F are generally nonlinear in the input space. Moreover, kernel methods are extremely versatile, as the same linear method, e.g. SVM, can learn diverse complex nonlinear functions in the input space because the functional form is controlled by the kernel, which in turn can be adapted to the data by the user. SVM29,30 fit a linear model in a higher dimensional dot product feature space, F, of the form: f (x|w) = ⟨w Tϕ(x)⟩ + α0

B

f (x) =

where Gb(x) is the base learner trained on the bth bootstrap sample, wb its weight, and B the total number of iterations and trees. The weight for a base learner, wb, is, thus, proportional to its prediction accuracy. The update rule for the model can be written as fb (x) = fb − 1 (x) + νwbGb(x);

w

i=1

(3)

The optimization of the αi values is usually performed by applying Lagrangian multipliers (dual formulation): n α

i=1

1 2

n

n

∑ ∑ yyi j αiαjK (x i, x j)

(4)

i=1 j=1

subject to ∑ni=1 αi yi = 0 and 0 ≤ αi ≤ C. C is a regularization parameter that penalizes for incorrect predictions during training. Thus, the larger the value of C, the larger this penalty. The formula for the radial kernel is: K (x , x′) = e−

x−x′

2

/2l 2

(5)

where ∥x − x′∥ is the squared Euclidean distance and x the transpose of x. Ensemble Methods. Ensemble methods use multiple weak simple models (base learners) to obtain a meta-model with higher predictive power than it would be possible to obtain with any of the models used individually. Thus, building a model ensemble consists of (i) training individual models on (subsets) of the training examples and (ii) integrating them to generate a combined prediction. Although it is possible to build model ensembles using different machine learning algorithms (i.e., heteroensembles) as base learners, e.g. model stacking,31,32 homoensembles such as decision tree-based ensembles are predominant in the literature. The following subsection briefly presents the two ensemble methods used in this study. Boosting: Gradient Boosting Machines (GBM). In boosting,31,33,34 the base learners, here and generally regression trees,31 are trained and combined sequentially. At each iteration, a new base-learner is trained on the bth bootstrap sample, Gb, and added to the ensemble trying to minimize the loss function associated with the whole ensemble. The loss function, Ψ, can be 2

i=1

(8)

To minimize the risk of overfitting of GBM, several procedures have been proposed. The first one consists of training the individual base learners on bootstrap samples of smaller size than the training set. The relative size of these samples with respect to that of the training set is controlled by the parameter bag fraction or η. The value of η was set to 0.1, as this value has proved suitable to reduce the noise sensitivity of GBM in QSAR.35 A second procedure is to reduce the impact of the last tree added to the ensemble on the minimization of the loss function by adding a regularization parameter, shrinkage or learning rate (ν). The underlying rationale is that sequential improvement by small steps is better than improving the performance substantially in a single step. Likewise, the effect of an inaccurate learner on the whole ensemble is thus reduced. Another way to reduce overfitting is to control the complexity of the trees by setting the maximum number of nodes of the base learners with the parameter tree complexity (tc). Finally, the number of iterations, i.e. number of trees in the ensemble (ntrees), also needs to be controlled. The training error decreases with the number of trees, although a high number of iterations might lead to overfitting.34 Random Forest (RF). Like in bagging, RF36 build an ensemble (forest) of regression tress and average their predictions:

(x i , x) + α 0 ∑ αiyi ⟨ϕ(x i), ϕ(x)⟩ + α0 = ∑ αiyK i

max ∑ αi −

(7)

n

wb = min ∑ Ψ(yi , fb − 1 (x) + Gb(x))

n

i=1

0