Benchmarking the Predictive Power of Ligand ... - ACS Publications

Jul 11, 2016 - The present study aims at filling the shortage of a thorough assessment of the predictive power of ligand efficiency indices in QSAR. T...
0 downloads 12 Views 974KB Size
Article Journal of Chemical Benchmarking the Information and Modeling is published by the American predictive power Chemical Society. 1155 Sixteenth Street N.W., of ligand efficiency Washington, DC 20036 Published by American indices in QSAR Chemical Society. Copyright © American Chemical

Isidro Cortes-Ciriano Society. However, no

J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00136 • Journal of Chemical Information Publication Date (Web): 11 and Jul Modeling 2016 is published by the American

DownloadedChemical from http:// Society. 1155 Sixteenth Street N.W., pubs.acs.org on July 12, 2016

Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no

Just Accepted

Journal of Chemical “Just Accepted” manuscripts have been peer-re Information and Modeling is online prior to technical editing, formatting for pu published by the American Chemical Society. 1155 as a free Society provides “Just Accepted” Sixteenth Street N.W., dissemination of scientific material Washington, DC 20036as soon as p American appear in full in PDFPublished formatbyaccompanied by an Society. Copyright fully peer reviewed,Chemical but should not be considered © American Chemical readers and citableSociety. by theHowever, DigitalnoObject Identifie

to authors. Therefore, the “Just Accepted” Web in the journal. After a manuscript is technically Chemical as an ASAP Accepted” Web siteJournal and of published Information and Modeling is changes to the manuscript and/or graphic published by text the American Chemical Society. 1155 and ethical guidelines that apply to the journa Sixteenth Street N.W., or consequences arising from the Washington, DC 20036use of inform Published by American Chemical Society. Copyright © American Chemical Society. However, no

Activity inhibition(%)

100

R2test

NH2

11 composite 1.0 Page Journal 1 of of 41 Chemical Information and Modeling O

1 2 3 4 5

OH

3

50

parameters

f(IC50)

CH

0.5

IC50

data sets 0.0 ACS Paragon Plus29Environment 4 algorithms 2 descriptor types

0

3

4

5

6

log [Compound] (nM)

−0.5

pIC5 0 LELP NSEI

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Benchmarking the Predictive Power of Ligand Efficiency Indices in QSAR

Isidro Cortes-Ciriano1∗ (1) Institut Pasteur, Unité de Bioinformatique Structurale; CNRS UMR 3825; Département de Biologie Structurale et Chimie; 25, rue du Dr Roux, 75015 Paris, France. * To whom correspondance should be addressed: [email protected] Keywords : QSAR, ligand efficiency, lead optimization, bioactivity modelling.

∗ To

whom correspondence should be addressed

1 ACS Paragon Plus Environment

Page 2 of 41

Page 3 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Abstract Compound physicochemical properties favouring in vitro potency are not always correlated to desirable pharmacokinetic profiles. Therefore, using potency (i.e. IC50 ) as the main criterion to prioritize candidate drugs at early stage drug discovery campaigns has been questioned. Yet, the vast majority of the virtual screening models reported in the medicinal chemistry literature predict the biological activity of compounds by regressing in vitro potency on topological or physicochemical descriptors. Two studies published in this journal showed that higher predictive power on external molecules can be achieved by using ligand efficiency indices as the dependent variable instead of a metric of potency (IC50 ) or binding affinity (Ki ). The present study aims at filling the shortage of a thorough assessment of the predictive power of ligand efficiency indices in QSAR. To this aim, the predictive power of 11 ligand efficiency indices has been benchmarked across 4 algorithms (Gradient Boosting Machines, Partial Least Squares, Random Forest, and Support Vector Machines), 2 descriptor types (Morgan fingerprints, and physicochemical descriptors), and 29 data sets collected from the literature and ChEMBL database. Ligand efficiency metrics led to the highest predictive power on external molecules irrespective of the descriptor type or algorithm used, with an R2 test difference of

∼0.3 units, being this difference ∼0.4 units when modelling small data sets, and a normalized RMSE decrease of >0.1 units in some cases. Polarity indices, such as SEI and NSEI, led to higher predictive power than metrics based on molecular size, i.e. BEI, NBEI and LE. LELP, which comprises a polarity factor (cLogP) and a size parameter (LE) constantly led to the most predictive models, suggesting that these two properties convey complementary predictive signal. Overall, this study suggests that using ligand efficiency indices as the dependent variable might be an efficient strategy to model compound activity.

2 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Introduction The availability of large-scale bioactivity databases has fostered the development and customary usage of computational models to predict the biological activity of chemical substances at early stages of the drug discovery process. These databases store structural information of large sets of compounds and their in vitro activities, usually quantified with metrics of potency (IC50 ) or binding affinity (Ki ), on biological systems of diverse complexity, ranging from purified proteins to cell lines. The assumption that high in vitro potency towards an actionable target is likely to result in therapeutically active drugs at low dose regimens with little undesirable off-target activity, has made potency the main criterion to prioritize candidate drugs at early stage drug discovery campaigns. Nevertheless, this strategy has been questioned, as compound physicochemical properties favouring in vitro potency are not always correlated to desirable pharmacokinetic profiles. 1 For instance, it has been observed that the structural modifications required to increase the potency of hit compounds generally imply an increase in molecular weight. Given that some properties, such as solubility, depend on molecular weight, these modifications generally decrease compound druggability. 2

Therefore, finding the structural modifications required to transform a hit compound into a marketed drug represents a multiparameter optimization process where both compound physicochemical properties and potency need to be considered. Diverse quantitative metrics that relate compound potency to structural features (e.g. number of polar atoms), known as ligand efficiency indices or composite parameters, have been proposed in the medicinal chemistry literature. 1,3–6 Ligand efficiency indices normalize compound activity against physicochemical properties known to determine compound

3 ACS Paragon Plus Environment

Page 4 of 41

Page 5 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

druggability. Although there has been intense debate on the robustness of their underlying mathematical foundations, 7–13 their usefulness in lead optimization has been widely recognized. 2,8 For instance, Ligand Efficiency (LE), defined as LE =

∆Gbinding NHEA ,

was proposed to monitor the relationship between compound binding affinity and the number of heavy atoms (NHEA). Given that e.g. permeability and absorption deteriorate as a function of molecular size, LE can be used as a proxy to balance the potency and physicochemical properties of candidate drugs. Ligand efficiency metrics, their limitations and their application in drug discovery have been extensively reviewed in 1,3–6 , to which the interested reader is referred.

Quantitative Structure-Activity Relationship (QSAR) 14 modelling comprises those techniques that relate structural features of a set of compounds to their activity on a biological system of interest using machine learning algorithms. The ultimate objective is to apply the trained algorithms to large chemolibraries in order to find (generally structurally novel) compounds capable to modulate a biological system of interest (e.g. inhibit the activity of an enzyme). In practice, this is accomplished by regressing compound activity (i.e dependent variable) on compound descriptors (i.e independent variables or covariates), which can run the gamut from physicochemical properties (e.g. octanol-water partition coefficient) to binary fingerprints recording the presence or absence of a predefined set of chemical fragments. Although it is paramount to optimize both the potency and the physicochemical properties of lead compounds at all phases of the drug discovery process, in order to reduce the risk of failure during clinical trials, 3 most QSAR models reported in the medicinal chemistry literature use either IC50 or Ki /Kd values as the dependent variable.

4 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

In two recent studies published in this journal, 15,16 Sugaya explored the predictive power of ligand efficiency metrics in the context of bioactivity modelling. In a first study, 15 Sugaya collected two GPCR data sets from GPCRSARfari, 17 which comprised 40,826 IC50 (22,882 distinct compounds and 142 GPCRs) and 75,614 Ki values (33,541 distinct compounds and 136 GPCRs), respectively. A third data set was assembled from KinaseSARfari 17 by collecting IC50 data for 28,835 distinct compounds and 374 kinases, being the total number of datapoints 74,204. Support Vector Machine (SVM) classifiers were trained on the three data sets using compound and protein descriptors simultaneously, which is generally termed proteochemometric modelling (PCM). 18 For each data set, two classifiers were trained, using either IC50 / Ki values, or Binding Efficiency Index (BEI) 6,7,19 values as the dependent variable. Overall, models trained on BEI constantly displayed higher specificity on external compound-protein pairs irrespective of the compound and protein descriptors used. In a subsequent paper, 16 Sugaya tested the predictive power of BEI, Ligand Lipophilicity Efficiency (LLE), and Surface Efficiency Index (SEI) on three IC50 data sets, comprising GPCRs (78,114 datapoints), protein kinases (70,423) and ion channels (28,352), using Support Vector Regression (SVR) PCM models. As in the classification case, the highest predictive power on external molecules was obtained with (in this order) SEI, BEI, LLE and pIC50 values. Moreover, SVR models built on ligand efficiency metrics displayed higher predictive power on compound-protein systems displaying a dynamic range of pIC50 values.

Notwithstanding the new avenues opened by Sugaya’s results to better model compound activity, there are multiple ligand efficiency indices reported in the literature whose predictive power has not been thoroughly assessed yet in the arena of QSAR. 5 ACS Paragon Plus Environment

Page 6 of 41

Page 7 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The present study aims at filling this shortage. In this contribution, the predictive power of 11 ligand efficiency indices has been benchmarked across 4 algorithms (Gradient Boosting Machines, Partial Least Squares, Random Forest, and SVR), 2 descriptor types (Morgan fingerprints and 1-D and 2-D physicochemical descriptors), and 29 data sets collected from the literature and ChEMBL database version 19. 17 A model was trained for all combinations of factor levels, using R2 values on the test set as a proxy to monitor the performance on new molecules. A factorial design was implemented to dissect the contribution of each factor level to the variability in predictive power obtained with models trained using either ligand efficiency indices or pIC50 values as the dependent variable. In line with previous studies, ligand efficiency metrics led to the most predictive models on external molecules irrespective of the descriptor type or algorithm used, with an R2 test increase of ∼0.3, and a normalized RMSE (NRMSE) decrease of

>0.1 units in some cases.

6 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Methods Data sets A total of 29 QSAR data sets were gathered from the literature (references given in Table 1) and from ChEMBL database version 19. 17 All data sets report compound potency as IC50 values. These values were converted to a base-10 logarithmic scale (pIC50 = −log10 IC50 ). To ensure the reliability of the bioactivity values extracted from ChEMBL, only IC50 values corresponding to small molecules and satisfying the following criteria were kept: (i) assay score confidence ≥ 8, and (ii) activity unit equal to ’nM’. The average pIC50 value was calculated when multiple IC50 values were annotated on the same compound. The size of the data sets range from 137 (P188089: Alpha-2B adrenergic receptor) to 4,662 (P35968: Vascular endothelial growth factor receptor 2) datapoints. All data sets are provided in the Supporting Information.

7 ACS Paragon Plus Environment

Page 8 of 41

Page 9 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1: Data sets modelled in this study. These data sets can be found in the references indicated in the last two columns and are provided in the Supporting Information. The first column (Data set label) reports the data set labels used throughout the manuscript. The diversity of the data sets was assessed with respect to the number of data points, target class, and to the standard deviation of the distribution of the pIC50 values. Data source

Data set label

Biological endpoint

Datapoints

Activity range (pIC50 )

F7

Human factor-7

344

4.04 - 8.18

Bioactivity standard deviation (pIC50 ) 0.97

IL4

Human interleukin-4

599

4.67 - 8.30

0.57

MMP2

443

5.10 - 10.30

0.99

869

3.84 - 9.78

1.21

ChEMBL 19

1,651

4.00 - 10.00

1.31

ChEMBL 19

P03372

Human matrix metallopeptidase-2 Human tyrosine-protein kinase JAK2 Human serine/threonine-protein kinase Aurora-A Human estrogen receptor α

WO 2006/133426 A2 WO2005042521

908

4.00 - 9.70

1.32

ChEMBL 19

P04150

Human glucocorticoid receptor

1,182

4.00 - 10.40

1.15

ChEMBL 19

P06401

Human progesterone receptor

1,233

4.08 - 10.18

1.15

ChEMBL 19

P11229

Human muscarinic acetylcholine receptor M1 Human tyrosine-protein kinase SRC SRC Human selectin E

843

4.00 - 10.50

1.33

ChEMBL 19

2,719

2.30 - 9.92

1.38

ChEMBL 19

184

4.00 - 9.15

1.31

ChEMBL 19

580

4.00 - 9.40

1.23

ChEMBL 19

137

4.49 - 9.29

1.23

ChEMBL 19

1,018

4.00 - 10.68

1.15

ChEMBL 19

P21554

Rat 5-hydroxytryptamine receptor 1A Human cannabinoid receptor 1

1,392

4.05 - 10.03

1.49

ChEMBL 19

P24530

Human endothelin B receptor

1,030

4.00 - 10.52

1.32

ChEMBL 19

P25929

Human neuropeptide Y receptor type 1 Human 5-hydroxytryptamine receptor 2C Human MAP kinase ERK2

501

4.11 - 10.70

1.40

ChEMBL 19

926

4.00 - 9.30

1.08

ChEMBL 19

322

4.00 - 9.69

0.93

ChEMBL 19

Human vascular endothelial growth factor receptor 2 Human metabotropic glutamate receptor 5 Human FK506 binding protein 12 (mTOR) Human glucagon receptor

4,662

4.00 - 9.80

1.29

ChEMBL 19

1,381

4.00 - 9.40

1.14

ChEMBL 19

1,337

4.00 - 10.10

1.60

ChEMBL 19

944

4.00 - 9.30

1.04

ChEMBL 19

561

4.00 - 10.15

1.27

ChEMBL 19

P61169

Human neuropeptide Y receptor type 2 Rat D(2) dopamine receptor

1,968

4.00 - 10.22

1.02

ChEMBL 19

Q05397

Human focal adhesion kinase 1

416

4.00 - 9.30

1.02

ChEMBL 19

Q16602

Human calcitonin gene-related peptide type 1 receptor Human cyclin-dependent kinase 2 (CDK2) Human estrogen receptor β

660

4.00 - 11.00

1.48

ChEMBL 19

1,130

3.46 - 10.00

1.38

ChEMBL 19

799

4.00 - 9.46

1.14

ChEMBL 19

O60674 O14965

P12931 P16581 P17252 P18089 P19327

P28335 P28482 P35968 P41594 P42345 P47871 P49146

P24941 Q92731

Human protein kinase C (PKC) α Human α2B adrenergic receptor

8 ACS Paragon Plus Environment

US20050043313

Reference

20 20 20 17 17

17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 41

Molecular Representation The function StandardiseMolecules from the R package camb 21 was used to normalize all chemical structures using the default options. This normalization step is crucial for the generation of compound descriptors, as the value of most of them (except for e.g. heavy atom counts) depend on a consistent representation of molecular properties such as aromacity of ring systems, tautomer representation or protonation states. Inorganic molecules were removed, whereas the largest fragment was kept in order to filter out counterions.

Compound Descriptors Compounds were encoded with circular Morgan fingerprints 22,23 calculated using RDkit (release version 2014.09.2). 21,24 Morgan fingerprints encode compound structures by considering radial atom neighborhoods. 22 The choice of Morgan fingerprints as compound descriptors was motivated by the high retrieval rates obtained with these fingerprints in benchmarking studies of compound descriptors. 25,26 To calculate the fingerprints for a given compound, each substructure in that compound, with a maximal diameter of 4 bonds, was assigned to an unambiguous identifier. Subsequently, these substructures were mapped into a 256 bit-long hashed array of counts. The position in the array where each substructure was mapped was given by the modulo of the division of the substructure identifier by the fingerprint size.

In addition to Morgan fingerprints, continuous compound descriptors were computed. A total of 188 1-D and 2-D physicochemical descriptors were calculated with RDkit (release version 2014.09.2). 24 The list of physicochemical descriptors and the RDkit

9 ACS Paragon Plus Environment

Page 11 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

functions used for their computation are given in Table S1.

Data Preprocessing The function RemoveNearZeroVarianceFeatures from the R package camb was used to remove those descriptors displaying constant values across all compounds (i.e. nearzero variance descriptors) using a cut-off value equal to 30/1. 21,27,28 Subsequently, the remaining descriptors were centered to zero mean and scaled to unit variance (z-scores) with the function PreProcess from the R package camb.

Model Training Grid search with 5-fold cross validation (CV) was used to optimize the model parameters. 29 The whole data set was split into 6 folds of the same size by stratified sampling of the dependent variable on which the models were trained. This means that the composition of the folds changed for each of the 12 dependent variables used (eleven ligand efficiency indices and compound pIC50 values). In stratified sampling, the dependent variable is split into groups and random sampling is applied within these groups. 28 Thus, the distribution of the dependent variable across folds is comparable. The main assumption here is that the distribution of the dependent variable from the data set is likely to match the distribution of the population. One fold, 1/6, was withheld as the test set and served to assess the predictive power of the models, whereas the remaining folds, 5/6, constituted the training set. The training set was used to optimize the values of the parameters in the following way. For each combination of parameter values in the grid, a model was trained on 4 folds

10 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 41

from the training set, and the values for the remaining fold were then predicted. This procedure was repeated 5 times, each time holding out a different fold. The values of the parameters exhibiting the lowest average RMSE value across these 5 repetitions were considered as optimal. Subsequently, a model was trained on the whole training set, using the optimized values for the parameters. The predictive power of this model was assessed on the test set by calculating the R2 value for the observed against the predicted values for the dependent variable, which were either pIC50 or ligand efficiency metric values. The grid search and 5-fold cross validation process was repeated five times (i.e. replicates) for all combinations of factor levels. The composition of the six folds indicated above was different in each replicate.

The following definition of R2 was chosen, 30 which is suitable for any predictive model -linear and non linear-: 31,32

R2 = 1 −

∑iN=1 (yi − yei )2 ∑iN=1 (yi − y¯ )2

(1)

where ye and y¯ correspond to the predicted values and the mean value of the dependent variable for the molecules in the test set, respectively, and N to the number of molecules in this set.

Machine Learning Algorithms Given a data set D = {X; y} where X = {xi }in=1 is the set of compound descriptors, and y = {yi }in=1 is the vector of observed bioactivities (here pIC50 values or ligand efficiency indices), the aim of supervised learning is to find the function (model) underlying D, which can be then used to predict the bioactivity for new datapoints, xnew . The theory

11 ACS Paragon Plus Environment

Page 13 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

behind the algorithms explored in this study is briefly summarized in the following subsections.

Partial Least Squares Regression Partial least squares or projection to latent structures, 33 is a multivariate modeling technique capable to extract quantitative relationships from data sets where the number of descriptors, P, is much larger than the number of training examples, N. Multiple linear regression fails to model this type of data sets since for small (N/P) ratios, X is not a full rank matrix and its columns are likely to be collinear (the "small N large P problem"). 34 In Principal Components Regression (PCR), the principal components of X are taken as predictors, thus reducing P (dimensionality reduction) and the problem of multicollinearity. PLS extends this idea by simultaneously projecting both X and y to latent variables, with the constraint of maximizing the covariance of the projections of X and y. The reader is referred to Abdi and Williams 35 for further details on PLS.

Support Vector Machines Kernel functions, statistic covariances, 36 or simply kernels permit the computation of the dot product between two vectors, x, x′ ∈ X, in a higher dimensional space, F (potentially of infinite dimension), without explicitly computing the coordinates of these vectors in F:

hφ(x), φ(x’)i = K (x, x’)

(2)

where φ(x) is a mapping function from X to F, φ : X → F. This is known as the kernel trick. 37 Thus the value of the kernel K applied on the input vectors x and x’ is equivalent

12 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 41

to their dot product in F. In practice, this permits to apply linear methods based on dot products, e.g. SVM, in F while using X in the calculations (thus, not requiring to compute the coordinates of the input data in F). This is computationally less expensive than the explicit calculation of the coordinates of X in F, which in some cases might not even be possible. The linear relationships learnt in F are generally non-linear in the input space. Moreover, kernel methods are extremely versatile, as the same linear method, e.g. SVM, can learn diverse complex non-linear functions in the input space because the functional form is controlled by the kernel, which in turn can be adapted to the data by the user.

SVM 38,39 fit a linear model in a higher dimensional dot product feature space, F, of the form: f (x|w) = hwT φ(x)i + α0

(3)

where w is a vector of weights ∈ F. The kernel trick can be applied if w can be expressed as a linear combination of the training examples, namely w = ∑in=1 αi φ(xi ). Given the definition of kernel given above, Eq. 3 can be rewritten as: n

f (x) =

n

∑ αi yi hφ(xi ), φ(x)i + α0 =

∑ α i y i K ( xi , x )

i =1

i =1

+ α0

(4)

The optimization of the αi values is usually performed by applying Lagrangian multipliers (dual formulation): n

max ∑ αi − α

i =1

1 n n ∑ y i y j α i α j K ( xi , x j ) 2 i∑ =1 j =1

(5)

subject to ∑in=1 αi yi = 0 and 0 ≤ αi ≤ C. C is a regularization parameter that penalizes 13 ACS Paragon Plus Environment

Page 15 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

for incorrect predictions during training. Thus, the larger the value of C, the larger this penalty. The formula for the radial kernel is: ′

Radial Kernel: K (x, x ) = e

′ 2

− ||x−x2 || 2l

(6)

where ||x − x′ ||2 is the squared Euclidean distance, and xT the transpose of x.

Ensemble Methods Ensemble methods use multiple weak simple models (base learners) to obtain a metamodel with higher predictive power than it would be possible to obtain with any of the models used individually. Thus, building a model ensemble consists of (i) training individual models on (subsets) of the training examples, and (ii) integrating them to generate a combined prediction. Although it is possible to build model ensembles using different machine learning algorithms (i.e. heteroensembles) as base learners, e.g. model stacking, 40,41 homoensembles such as decision tree-based ensembles are predominant in the literature. The following subsection briefly presents the two ensemble methods used in this study. Boosting: Gradient Boosting Machines (GBM) In boosting, 40,42,43 the base learners, here and generally regression trees, 40 are trained and combined sequentially. At each iteration, a new base-learner is trained on the bth bootstrap sample, Gb , and added to the ensemble trying to minimize the loss function associated to the whole ensemble. The loss function, Ψ, can be e.g. the squared error loss, i.e. the average of the square of

14 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 41

the training residuals: Ψ(y, f ) = n1 ∑in=1 (yi − f (xi ))2 . The final model is given by: B

f (x) =

∑ wb Gb (x)

(7)

b =i

where Gb (x) is the base learner trained on the bth bootstrap sample, wb its weight, and B the total number of iterations and trees. The weight for a base learner, wb , is, thus, proportional to its prediction accuracy. The update rule for the model can be written as:

f b (x) = f b−1 (x) + ν wb Gb (x); 0 < ν ≤ 1

(8)

where f b (x) corresponds to the ensemble at iteration b, and ν to the learning rate (see below). Steepest gradient descent is applied at each iteration to optimize the weight wb for the new base learner as follows: n

wb = min ∑ Ψ(yi , f b−1 (x) + Gb (x)) w

(9)

i =1

To minimize the risk of overfitting of GBM, several procedures have been proposed. The first one consists of training the individual base learners on bootstrap samples of smaller size than the training set. The relative size of these samples with respect to that of the training set is controlled by the parameter bag fraction or η. The value of η was set to 0.1, as this value has proved a suitable to reduce the noise sensitivity of GBM in QSAR. 44 A second procedure is to reduce the impact of the last tree added to the ensemble on the minimization of the loss function by adding a regularization parameter, shrinkage or learning rate (ν). The underlying rationale is that sequential improvement by small steps is better than improving the performance substantially in a single step. 15 ACS Paragon Plus Environment

Page 17 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Likewise, the effect of an inaccurate learner on the whole ensemble is thus reduced. Another way to reduce overfitting is to control the complexity of the trees by setting the maximum number of nodes of the base learners with the parameter tree complexity (tc ). Finally, the number of iterations, i.e. number of trees in the ensemble (ntrees ), also needs to be controlled. The training error decreases with the number of trees, although a high number of iterations might lead to overfitting. 43

Random Forest (RF)

Like in bagging, RF 45 build an ensemble (forest) of regression

tress and average their predictions:

f (x) =

1 B TFb (x) B b∑ =i

(10)

where TFb (x) corresponds to the tree base learner in the forest trained on the bth bootstrap sample. The difference with bagging is that the node splitting is performed using only a subset of descriptors randomly chosen. This additional level of randomization decorrelates the trees in the forest leading, in practice, to a predictive power comparable to boosting. 40 In QSAR, RF have been shown to be robust with respect to the parameter values. In practice, a suitable choice of the number of trees (ntrees ) was shown to be 100, as higher values do not generally lead to significantly higher predictive power. 46–48

Machine Learning Implementation Machine learning models were built in R using the wrapper packages caret 27 and camb. 21 The following R packages were used to train the machine learning algorithms considered here: (i) pls 49 for Partial Least Squares (PLS), 33 (ii) kernlab 50 for Support

16 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 41

Vector Machines (SVM), 39 (iii) gbm 51 for Gradient Boosting Machines (GBM), 52 and (iv) randomForest 53 for Random Forest (RF). 45 The values of the parameters used in CV are reported in Table 2. Table 2: Algorithms benchmarked in this study. The third column indicates the parameters that were tunned using grid search and cross-validation (CV). The default values were used for those parameters not indicated therein. Learning Paradigm Linear Kernel Ensemble: boosting

Algorithm Partial Least Squares (PLS) Support Vector Machines Radial Kernel (SVM Radial) Gradient Boosting Machines (GBM)

Ensemble: bagging

Random Forest (RF)

Parameters and values used in CV σ ∈ {2−10 , 2−4 ..22 , 24 }; C ∈ {2−10 , 2−4 ..22 , 24 , 10, 100} Learning rate (ν) ∈ {0.04, 0.08, 0.12, 0.16}; ntrees : 500; tree complexity (tc ): 25; bag fraction (η): 0.1 ntrees : 500

ref 33

38 43,52

45

Ligand efficiency metrics or ligand efficiency indices The formulae for the ligand efficiency indices explored in this study are given below. Further details can be found in the references adjacent to the name of each metric. • Half maximal inhibitory concentration (IC50 ) % Activity = Ein f +

E0 − Ein f 1+(

Comp. conc. Slope ) IC50

(11)

where Ein f and E0 are the lower and upper activity plateaus, Comp. conc. is the compound concentration tested, and Slope corresponds to the slope parameter than controls the steepness of the linear part of the curve. A value of 0 for Ein f indicates 0% enzymatic activity after incubation with a given test compound. The parameter IC50 (half maximal inhibitory concentration) represents the compound concentration required to reach a response midway between the lower (E0 ) and 17 ACS Paragon Plus Environment

Page 19 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

upper (Ein f ) plateaus. IC50 is generally defined as the compound concentration required to inhibit the activity of an in vitro biological system by 50% (e.g. enzymatic activity or cell proliferation). Thus, IC50 values are used to quantify the activity of antagonist compounds. • Ligand Efficiency (LE) 5,7 LE =

(1.4 ∗ pIC50 ) NHEA

(12)

where NHEA corresponds to the number of non-hydrogen atoms. • Binding efficiency index (BEI) 6,7,19

BEI =

pIC50 MW (KDa)

(13)

where MW corresponds to the molecular weight. • Surface binding efficiency index (SEI) 7,19

SEI =

pIC50 TPSA 1002

(14)

where TPSA corresponds to the Topological Surface Area. • NSEI 6,7 NSEI =

pIC50 NPOL( N,O)

(15)

where NPOL corresponds to the number of Nitrogen and Oxygen atoms. • NBEI 6,7 NBEI =

pIC50 NHEA

18 ACS Paragon Plus Environment

(16)

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 41

• nBEI 6,7 IC50 NHEA

(17)

IC50 MW (KDa)

(18)

nBEI = −log10 • mBEI 6,7 mBEI = −log10 • Lipophilic efficiency index (LipE) 7

LipE = pIC50 − cLogP

(19)

where cLogP corresponds to computed octanol-water partition coefficient (LogP) values. • Ligand efficiency dependent lipophilicity (LELP) 7

LELP =

cLogP LE

(20)

• Ligand lipophilicity index (LLEAT ) 4,7 LLE AT = 0.11 −

ln(10) ∗ RT ∗ (clogP − pIC50 ) NHEA

(21)

• Fit quality (FQ) 4,7 FQ =

LE LEscale

where LEscale is the fit quality scaled ligand efficiency parameter. 4

19 ACS Paragon Plus Environment

(22)

Page 21 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Experimental Design To benchmark the predictive power of the set of ligand efficiency metrics described above, a balanced fixed-effect full-factorial experiment with replications was designed. 54 The following five factors were considered, namely: (i) Data set: 29 (Table 1), (ii) Algorithm: GBM, PLS, RF and SVM Radial, (iii) Descriptor type: Morgan fingerprints and physicochemical descriptors, (iv) Dependent variable: 11 ligand efficiency metrics and pIC50 values. This factorial design was studied with the following linear model: R2i,j,k,l,m,n test = Data seti + Descriptor type j + Algorithmk +

(23)

Dependent variablel + ( Algorithm ∗ Dependent variable)kl +

( Data set ∗ Dependent variable)il + ( Data set ∗ Descriptor type)ij + ( Data set ∗ Algorithm)ik + µ0 + ǫi,j,k,l,m

(i ∈ {1, ..., NData sets = 29}; j ∈ {1, ..., NDescriptor type = 2}; k ∈ {1, ..., NAlgorithm = 4}; l ∈ {1, ..., NDependent variable = 12}; m ∈ {1, ..., Nresamples = 100}) where the response variable, R2i,j,k,l,m,n test , corresponds to the R2 values on the test set. Data seti , Descriptor type j , Algorithmk , and Dependent variablel are the main effects. The terms Algorithm ∗ Dependent variable, Data set ∗ Dependent variable, Data set ∗ Descriptor type, and Data set ∗ Algorithm correspond to the interaction terms. The factor levels ’GBM’ (Algorithm), ’F7’ (Data set), ’Morgan fingerprints’ (Descriptor type), and ’pIC50 ’ (Dependent variable), were used as reference factor levels to calculate the in20 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 41

tercept term of the linear model, µ0 , which corresponds to the mean R2 test value for this combination of factor levels. Therefore, this combination of factor levels corresponds to training a GBM QSAR model on the data set F7, where compounds are encoded with Morgan fingerprints and their activity quantified with pIC50 values. The coefficients (slopes) for the other combinations of factor levels, correspond to the difference between their mean R2 test value and the intercept. The error term, ǫi,j,k,l,m , corresponds to the random error of each R2 test value, which are defined as: ǫi,j,k,l,m = R2i,j,k,l,m test − R2i,j,k,l . These errors are assumed to (i) be mutually independent, (ii) have zero expectation, and (iii) have constant variance.

The interaction terms were introduced to assess the interaction effects between two factors, e.g. Algorithm and Dependent variable. There is no interaction between two factors when the difference of the mean values of the dependent variable in the linear model (in this case R2 ) across the levels of a factor (e.g. Algorithm) does not change across the levels of a second factor (e.g. Dependent variable). This can be depicted in an interaction plot by plotting the levels of the factor Algorithm against the R2 values across the levels of the other factor, i.e. Dependent variable. In the absence of interaction, the lines connecting the points corresponding to the R2 values for the levels of the factor Dependent variable along the levels of the factor Algorithm would be parallel. By contrast, in the presence of interaction, the difference in R2 between the levels of the factor Algorithm would vary across the levels of the factor Dependent variable. Therefore, the performance of a given algorithm would depend on the type of dependent variable used, and the lines in the interaction plot would not be parallel. Consequently, it would not be possible to conclude about the effect of a single independent variable (e.g. factor Algorithm) on the R2 (termed main effects), as the R2 values would depend on the level 21 ACS Paragon Plus Environment

Page 23 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

of other factors (e.g. Dependent variable).

A model was trained for all combinations of factor levels, giving rise to 13,920 models (29 data sets x 4 algorithms x 2 descriptor types x 12 dependent variables x 5 replications). The predictive power of the models was assessed on the test set and quantified by the R2 value, R2 i,j,k,l,m=1 test , for the observed against the predicted values of compound activity (i.e. pIC50 or ligand efficiency metric). Bootstrapping 55 was used to generate 100 resamples (Nreplications ) for these R2 i,j,k,l,m,n test values (i.e. R2 i,j,k,l,m,n=2:100 test ), thus ensuring a balanced experimental design. Therefore, the total number of observations considered in the linear model was 1,392,000 (13,920 trained models x 100 resamples each). The normality and homoscedasticity assumptions of the linear model were respectively assessed with (i) quantile-quantile (Q-Q) plots, and (ii) by visual inspection of the R2 test distributions, and by plotting the fitted Rtest 2 values against the residuals. 54 Homoscedasticity means that the residuals are evenly distributed (i.e. equally dispersed) across the range of the dependent variable used in the linear model; in this case Rtest 2 values. It is necessary to test this condition to guarantee that the modelling errors (i.e. residuals) and the dependent variable are not correlated. If the residuals showed a systematic bias, this would imply that the residuals would not be consistent with random error. In other words, it would mean that the errors are not random and that they contain predictive information that should be included in the model. 56

22 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 41

Results and Discussion The fitted linear model displayed an R2 value adjusted by the complexity of the model, i.e. number of parameters, of 0.60, and a standard error for the residuals of 0.09. A significant interaction was found between the factors (i) Dependent and Algorithm, (ii) Dataset and Dependent, (iii) Dataset and Descriptor type, and (iv) Dataset and Algorithm (Pvalue < 0.01) (Figure 1). These results indicate that the predictive power of the models varies across factor levels (i.e. the predictive power of a given ligand efficiency metric varies depending on the data set, algorithm and descriptor type used). Figure 2A shows the distributions of R2 test values for the 4 algorithms considered across all data sets, descriptor types, and ligand efficiency metrics. Figure 2B reports the distributions of R2 test values for the 29 data sets across all descriptor types, algorithms, and ligand efficiency indices, whereas Figure 2C shows the R2 test distributions for the 12 ligand efficiency indices explored across all data sets, descriptor types and algorithms. The estimated values for the coefficients of the linear model, namely slopes and intercept, and their P-values are reported in Table S2, whereas Figure 3 shows the verification of the model assumptions. Figures S1 to S5 report the distribution of the ligand efficiency indices across the 29 data sets used in this study, whereas Figures S6 to S34 show the correlation between the pIC50 values and each of the 11 ligand efficiency metrics for the 29 data sets. R2 test (± 1 SD), RMSEtest (± 1 SD), and normalized RMSEtest (± 1 SD) values by the range of the dependent variable (i.e.

RMSEtest ) max (y)−min(y)

for all models

are given in Table 3 and in the Supporting Information.

GBM, RF and SVM models displayed comparable performance across the 12 depen-

23 ACS Paragon Plus Environment

Page 25 of 41

1.0

A Pearson Cor. Coeff. (R2)

0.9

0.8

0.7

0.6

0.15

0.10

0.05

F7 IL M 4 M O P2 14 O 965 60 6 P0 74 33 P0 72 41 P0 50 64 P1 01 12 P1 29 29 P1 31 65 P1 81 72 P1 52 80 P1 89 93 P2 27 15 P2 54 45 P2 30 49 P2 41 59 P2 29 83 P2 35 84 P3 82 59 P4 68 15 P4 94 23 P4 45 78 P4 71 91 P6 46 1 Q 169 05 Q 397 16 Q 602 92 73 1

Normalized RMSE (NRMSE)

B

pIC50

BEI

NSEI

nBEI

LipE

LLE_AT

LE

SEI

NBEI

mBEI

LELP

FQ

Dependent

0.9

C

0.8

0.7

FQ

AT E_ LL

LP LE

pE Li

I BE m

EI nB

I BE N

SE

I

SVM_radial

N

RF

SE I

LE

pI C

BE I

Algorithm GBM PLS

0.6

50

Pearson Cor. Coeff. (R2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1: Interactions plot. (A) Mean R2 test (A) and NRMSE (B) values for the 29 data sets across all types of dependent variables, algorithms and descriptors. (C) Mean R2 test values for the 12 dependent variables used (eleven ligand efficiency metrics and pIC50 values) across the 4 levels of the factor Algorithm.

24 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

Pearson Cor. Coeff. (R2)

A 1.0 0.5

0.0

−0.5

GBM

Pearson Cor. Coeff. (R2)

B

PLS

RF

SVM_radial

1.0

0.5

0.0

−0.5

F7 IL4 P2 965 674 372 150 401 229 931 581 252 089 327 554 530 941 929 335 482 968 594 345 871 146 169 397 602 731 M M O14 O60 P03 P04 P06 P11 P12 P16 P17 P18 P19 P21 P24 P24 P25 P28 P28 P35 P41 P42 P47 P49 P61 Q05 Q16 Q92

C 1.0 Pearson Cor. Coeff. (R2)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 41

0.5

0.0

−0.5

50

pIC

LE

I

BE

I

SE

EI

NS

EI

NB

EI

nB

EI

mB

E

Lip

LP E_AT LL

LE

FQ

Figure 2: Violin plots. A. R2 test values for the 4 algorithms considered in this study across all data sets, descriptor types and dependent variables (ligand efficiency metrics and pIC50 ). B. R2 test values for the 29 data sets considered across all descriptor types, algorithms, and dependent variables. C. R2 test values for the 12 dependent variables across all data sets, descriptor types, and algorithms. Blue points indicate the median and the interquartile range (25th-75th percentile), whereas red points indicate the mean R2 test value. The predictive power of GBM, RF and SVMradial does not vary significantly, with mean R2 test values close to 0.9, whereas the mean R2 test value obtained with PLS models dropped to 0.75 (A.). Mean R2 test values for the 29 data sets are in the 0.75-0.90 range. It can be seen that, overall, using ligand efficiency metrics as the dependent variable leads to higher predictive power in comparison to models trained using pIC50 values, with R2 test values in the 0.75-0.95 range, except for models trained using the ligand efficiency metrics nBEI and mBEI,25 which led to mean R2 test values of -0.75.

ACS Paragon Plus Environment

Page 27 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 41

Table 3: Mean normalized RMSE (NRMSE) and R2 values on the test. In the case of NRSME, only values lower than 1 were considered, as some resamples displayed non-representative values due to bootstrapping. R2 test (± 1 SD), RMSEtest (± 1 SD), and NRMSEtest (± 1 SD) values for all models are given in the Supporting Information. P values calculated using the Student’s t-test (two-tailed) between the distributions of NRMSE and R2 test values corresponding to models trained: (∗ ) on the ligand efficiency metric indicated in each row and pIC50 values, (∗∗ ) on morgan fingerprints and physicochemical descriptors, and († ) using the algorihtm indicated in each row and Random Forest. Factor Dependent∗

Descriptor type∗∗ Algorithm†

Factor level LELP SEI NSEI LLEAT LipE FQ BEI LE NBEI mBEI nBEI pIC50 Morgan fingerprints Physicochemical descriptors Random Forest SVM Radial Gradient Boosting Machines Partial Least Squares

mean NRMSEtest (P value) 0.05 (2.9x10−51 ) 0.07 (1.2x10−64 ) 0.07 (2.4x10−93 ) 0.08 (2.7x10−80 ) 0.08 (1.6x10−43 ) 0.09 (1.1x10−21 ) 0.09 (3.1x10−29 ) 0.09 (2.3x10−33 ) 0.09 (9.2x10−29 ) 0.13 (3.3x10−2 ) 0.13 (3.8x10−1 ) 0.13 (-) 0.09 (0.05) 0.09 (0.05) 0.08 (-) 0.09 (0.60) 0.08 (0.93) 0.12 (2.5x10−4 )

SD NRMSEtest 0.07 0.04 0.03 0.02 0.04 0.07 0.05 0.05 0.05 0.03 0.04 0.03 0.04 0.06 0.03 0.04 0.03 0.08

mean R2test (P value) 0.91 (3.8x10−57 ) 0.85 (2.2x10−24 ) 0.88 (5.6x10−40 ) 0.84 (2.3x10−23 ) 0.83 (1.3x10−22 ) 0.85 (1.3x10−27 ) 0.84 (4.8x10−24 ) 0.83 (1.1x10−21 ) 0.82 (4.5x10−18 ) 0.74 (1.7x10−1 ) 0.74 (4.0x10−1 ) 0.73 (-) 0.80 (1.0x10−22 ) 0.84 (1.0x10−22 ) 0.85 (-) 0.83 (2.1x10−5 ) 0.86 (0.17) 0.73 (3.8x10−78 )

SD R2test 0.06 0.11 0.08 0.08 0.08 0.08 0.08 0.08 0.09 0.12 0.12 0.13 0.11 0.10 0.08 0.09 0.08 0.13

dent variables considered (Figure 2C), with mean R2 test values in the 0.75-0.95 range, followed by PLS, with mean R2 test values in the 0.58-0.88 range. Models trained on physicochemical descriptors exhibited slightly higher predictive power than models trained on Morgan fingerprints (Table 3). Model performance did not vary significantly as a function of the size of the Morgan fingerprints (Figures S35 to S38), thus supporting the array length used here, namely 256. Overall, using ligand efficiency metrics as the dependent variable constantly led to models with higher predictive power with respect to models trained on pIC50 values, with an R2 test increase of ∼0.3 units and an NRMSE decrease of >0.1 units in some cases, e.g. P19327 and QO5397 (Figure 1A,B). The ligand efficiency metric leading to the highest predictive power varied across

27 ACS Paragon Plus Environment

Page 29 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

data sets, as evidenced by the significant interaction between the factors Data set and Dependent variable (Table S2) and the non-parallel lines in Figure 1A,B. Overall, these results indicate that using ligand efficiency metrics as the dependent variable constantly leads to increased predictive power on new molecules.

The lowest mean R2 test values were obtained with models trained on nBEI, mBEI and pIC50 values (Figure 2C and Table 3). This was expected, as in the data sets used here pIC50 , nBEI and mBEI values are highly correlated (see panels E and H in Figures S6 to S34). Models trained on LELP, SEI, NSEI, LLEAT , FQ, BEI and LE values led, in this order, to the highest mean R2 test values. This trend was observed for small and large data sets (Figure 1A,B), and when modelling increasingly larger fractions of the data set P35968 (Figure 4). Interestingly, the R2 test difference between LELP, NSEI and SEI, with respect to pIC50 values, is higher for small data sets. The highest mean R2 test values were obtained with models trained on LELP for 21 out of 29 data sets (dark green squares in Figure 1A,B), which are clearly anticorrelated to pIC50 values in some cases, namely F7 (Figure S6) and P19327 (Figure S11). Polarity indices, such as SEI and NSEI, carry higher predictive signal than ligand efficiency metrics based on molecular size, i.e. BEI, NBEI and LE (Figure 1A,B), hinting at the importance of polarity factors in driving the optimization of medicinal chemistry towards oral bioavailability. Notably, LELP comprises a polarity factor (cLogP) and a size parameter (LE), suggesting that these two properties convey complementary predictive signal. Transforming IC50 values into ligand efficiency metrics represents changing the landscape of the dependent variable, which implies that the distances between compounds in bioactivity space are altered. For instance, it can be seen in the panel J of Figures S6 to S34 that NSEI stratifies compounds displaying the same IC50 value into planes that comprise 28 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 30 of 41

Page 31 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(RSS). RMSE values, generally used to compare models trained on the same dependent q RSS variable, correspond to the standard deviation of the residuals, namely: N . The denominator in Eq. 1, i.e. the total sum of squares (TSS), quantifies the variation of the observations, e.g. pIC50 values. Therefore, two models with the same predictive power (i.e. with the same variation in the residuals, and thus leading to comparable RMSE values on external data) can lead to different R2 values if the range of the dependent variable is changed, as this would imply a change of the ratio RSS / TSS. 31 This issue has been reported theoretically 31 and in practice. 48 In order to assert that higher predictive power can be attained using ligand efficiency metrics instead of pIC50 values as the dependent variable, there must be a decrease of the RSS for a comparable TSS value. Notably, the predictive power of models trained using the set of ligand efficiency metrics considered here, while keeping all other factor levels constant (i.e. algorithm, descriptor type and data set), was not correlated to the range of the dependent variable. For instance, R2 test values obtained with models trained on LE, NBEI and FQ values, which are in the 0 - 1 range (panels A, B and I in Figures S6 to S34 ), are higher than those obtained with models trained on pIC50 values, which are in the 4 - 10 range (Table 3). This result indicates that the increase in predictive power does not arise from reducing the TSS of squares while keeping the variance of the residuals constant (see Figure 1 in ref.

31 ).

This is also supported by the fact that the lowest mean NRMSEtest values

are obtained with models trained using LELP, SEI and NSEI (Table 3). It is paramount to note that comparable results where obtained when monitoring model performance with R2 test (correlation-based metric) and with NRMSE (error-based metric) values. 56

To test whether the higher predictive power of models trained on ligand efficiency 30 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 41

indices with respect to those trained on pIC50 values depends on the fold composition used in CV, an alternative modeling pipeline was designed. In this case, the fold composition was defined by stratified sampling of the pIC50 values. Therefore, the parameter values for all models corresponding to each combination of the factor levels Algorithm, Decriptor type and Data set were optimized through CV using the same folds, irrespective of the dependent variable used. That is, the folds defined using stratified sampling on the pIC50 values were also used when using ligand efficiency metrics as the dependent variable. Five models (replications) were trained for all combinations of factor levels. The same experimental design was followed as described in Subsection Experimental Design. The fitted linear model displayed an adjusted R2 value of 0.81, and a standard error for the residuals of 0.24 (Figure S39). Collectively, the performance of the models trained on ligand efficiency indices and on pIC50 values was comparable across all data sets, descriptor types and algorithms (Figures S40 and S41). Therefore, these results indicate that the fold composition needs to be defined based on the dependent variable used to trained the models in order to fully exploit the higher predictive power of ligand efficiency indices. These results come as no surprise, as small samples (i.e. folds used to optimize the values of the parameters in CV) drawn at random from the medium-sized data sets used here might not reflect the distribution of the population (i.e. compound activities comprised in the whole data set) as accurately as samples drawn using stratified sampling (see Subsection Model Training).

Conclusions In this study, the predictive power of QSAR models trained on 11 ligand efficiency indices was compared to that of models trained on pIC50 values across 29 data sets, 2 de31 ACS Paragon Plus Environment

Page 33 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

scriptor types (Morgan fingerprints and physicochemical-property-based descriptors), and 4 machine learning algorithms (GBM, PLS, RF and, SVM radial) using a multifactorial experiment. R2 test values were used as a proxy to quantify the predictive power of the models. Overall, using ligand efficiency indices as the dependent variable led to models displaying higher predictive power on new molecules than models trained using pIC50 values, with an R2 test increase of ∼ 0.3 and and an NRMSE decrease of

>0.1 units in some cases. LELP, which comprises a polarity factor (cLogP) and a size parameter (LE) constantly led to the most predictive models, suggesting that these two properties convey complementary predictive signal. However, the performance of ligand efficiency metrics significantly varied across data sets, thus suggesting that several indices should be explored to achieve the highest improvement brought about by using ligand efficiency indices as the dependent variable.

Notably, this increase in predictive power did not stem from increasing the range of the dependent variable (TSS) while keeping the residuals constant (RSS), as models trained on ligand efficiency indices displaying a narrow range of values (e.g. NBEI, LE or FQ) outperformed (i.e. lower NRMSEtest and higher R2 test values) models trained on pIC50 values. Therefore, the distributions of ligand efficiency indices seem to facilitate the learning process, as previously pointed out by Sugaya. 16 Collectively, the results of this study suggest that using ligand efficiency indices as the dependent variable might be an efficient strategy to improve the predictive power on new molecules of chemical-structure activity models, while also permitting to monitor the relationship between the potency and the physicochemical properties of candidate drugs.

32 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 41

Acknowledgments ICC thanks the Institut Pasteur for funding.

Supporting Information The Supporting Information, including the data sets used in this study and the statistical metrics (R2 test , RMSEtest and NRMSEtest values) for all models (2,784 = 12 dependent variables x 4 algorithms x 29 data sets x 2 descriptor types) is available free of charge on the ACS Publications website.

References (1) Gleeson, M. P.; Hersey, A.; Montanari, D.; Overington, J. Probing the Links Between in Vitro Potency, ADMET and Physicochemical Parameters. Nat. Rev. Drug Discov. 2011, 10, 197–208. (2) Perola, E. An Analysis of the Binding Efficiencies of Drugs and Their Leads in Successful Drug Discovery Programs. J. Med. Chem. 2010, 53, 2986–2997. (3) Hopkins, A. L.; Keseru, G. M.; Leeson, P. D.; Rees, D. C.; Reynolds, C. H. the Role of Ligand Efficiency Metrics in Drug Discovery. Nat. Rev. Drug Discov. 2014, 13, 105–121. (4) Reynolds, C. H.; Tounge, B. A.; Bembenek, S. D. Ligand Binding Efficiency: Trends, Physical Basis, and Implications. J. Med. Chem. 2008, 51, 2432–2438. (5) Hopkins, A. L.; Groom, C. R.; Alex, A. Ligand Efficiency: A Useful Metric for Lead Selection. Drug Discov. Today 2004, 9, 430 – 431. 33 ACS Paragon Plus Environment

Page 35 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(6) Abad-Zapatero, C.; Perisic, O.; Wass, J.; Bento, A. P.; Overington, J.; Al-Lazikani, B.; Johnson, M. E. Ligand Efficiency Indices for an Effective Mapping of ChemicoBiological Space: The Concept of an Atlas-like Representation. Drug Discov. Today 2010, 15, 804 – 811. (7) Shultz, M. D. Setting Expectations in Molecular Optimizations: Strengths and Limitations of Commonly Used Composite Parameters. Bioorg. Med. Chem. Lett. 2013, 23, 5980 – 5991. (8) Reynolds, C. H. Ligand Efficiency Metrics: Why All the Fuss? Future Med. Chem. 2015, 7, 1363–1365. (9) Shultz, M. D. Improving the Plausibility of Success with Inefficient Metrics. ACS Med. Chem. Lett. 2014, 5, 2–5. (10) Kenny, P. W.; Leitão, A.; Montanari, C. A. Ligand Efficiency Metrics Considered Harmful. J. Comput. Aided Mol. Des. 2014, 28, 699–710. (11) Oprea, T. I. Current Trends in Lead Discovery: Are We Looking for the Appropriate Properties? J. Comput. Aided Mol. Des. 2002, 16, 325–334. (12) Abad-Zapatero, C.; Metz, J. T. Ligand Efficiency Indices As Guideposts for Drug Discovery. Drug Discov. Today 2005, 10, 464 – 469. (13) Kenny, P. W.; Leitão, A.; Montanari, C. A. Ligand efficiency metrics considered harmful. J. Comput. Aided Mol. Des. 2014, 28, 699–710. (14) Roy, K., Kar, S., Das, R. N., Eds. Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment; Academic Press: Boston, 2015.

34 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 41

(15) Sugaya, N. Training Based on Ligand Efficiency Improves Prediction of Bioactivities of Ligands and Drug Target Proteins in a Machine Learning Approach. J. Chem. Inf. Model. 2013, 53, 2525–2537. (16) Sugaya, N. Ligand Efficiency-Based Support Vector Regression Models for Predicting Bioactivities of Ligands to Drug Target Proteins. J. Chem. Inf. Model. 2014, 54, 2751–2763. (17) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. ChEMBL: a Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2011, 40, D1100–D1107. (18) Cortes-Ciriano, I.; Ain, Q. U.; Subramanian, V.; Lenselink, E. B.; Mendez-Lucio, O.; IJzerman, A. P.; Wohlfahrt, G.; Prusis, P.; Malliavin, T.; van Westen, G. J. P.; Bender, A. Polypharmacology Modelling Using Proteochemometrics: Recent Developments and Future Prospects. Med. Chem. Comm. 2015, 6, 24. (19) Nissink, J. W. M. Simple Size-Independent Measure of Ligand Efficiency. J. Chem. Inf. Model. 2009, 49, 1617–1622. (20) Chen, H.; Carlsson, L.; Eriksson, M.; Varkonyi, P.; Norinder, U.; Nilsson, I. Beyond the Scope of Free-Wilson Analysis: Building Interpretable QSAR Models with Machine Learning Algorithms. J. Chem. Inf. Model. 2013, 53, 1324–1336. (21) Murrell, D. S.; Cortes-Ciriano, I.; van Westen, G. J. P.; Stott, I. P.; Bender, A.; Malliavin, T. E.; Glen, R. C. Chemically Aware Model Builder (camb): An R Package for Property and Bioactivity Modelling of Small Molecules. J. Cheminf. 2015, 7, 45.

35 ACS Paragon Plus Environment

Page 37 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(22) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. (23) Glen, R. C.; Bender, A.; Arnby, C. H.; Carlsson, L.; Boyer, S.; Smith, J. Circular Fingerprints: Flexible Molecular Descriptors with Applications from Physical Chemistry to ADME. IDrugs 2006, 9, 199–204. (24) Landrum, G. RDKit: Open-Source Cheminformatics [Online]; Version 2014.09.2; Http://rdkit.org/ (accessed December 2014). (25) Bender, A.; Jenkins, J. L.; Scheiber, J.; Sukuru, S. C. K.; Glick, M.; Davies, J. W. How Similar Are Similarity Searching Methods? a Principal Component Analysis of Molecular Descriptor Space. J. Chem. Inf. Model. 2009, 49, 108–119. (26) Koutsoukas, A.; Paricharak, S.; Galloway, W. R. J. D.; Spring, D. R.; IJzerman, A. P.; Glen, R. C.; Marcus, D.; Bender, A. How Diverse Are Diversity Assessment Methods? a Comparative Analysis and Benchmarking of Molecular Descriptor Space. J. Chem. Inf. Model. 2014, 54, 230–242. (27) Kuhn, M. Building Predictive Models in R Using the Caret Package. J. Stat. Soft. 2008, 28, 1–26. (28) Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer New York: New York, NY, 2013. (29) Hawkins, D. M.; Basak, S. C.; Mills, D. Assessing Model Fit by Cross-Validation. J. Chem. Inf. Model. 2003, 43, 579–586. (30) Schuurmann, G.; Ebert, R.-U.; Chen, J.; Wang, B.; Kuhne, R. External Validation and Prediction Employing the Predictive Squared Correlation Coefficient - Test 36 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 41

Set Activity Mean vs Training Set Activity Mean. J. Chem. Inf. Model 2008, 48, 2140–2145. (31) Alexander, D. L. J.; Tropsha, A.; Winkler, D. A. Beware of R2: Simple, Unambiguous Assessment of the Prediction Accuracy of QSAR and QSPR Models. J. Chem. Inf. Model. 2015, 55, 1316–1322. (32) Kvalseth, T. O. Cautionary Note About R2. Am. Stat. 1985, 39, 279–285. (33) Wold, S.; Sjostom, M.; Eriksson, L. PLS-Regression: a Basic Tool of Chemometrics. Chemometr. Intell. Lab. 2001, 58, 109–130. (34) Abdi, H. Partial Least Squares Regression and Projection on Latent Structure Regression (PLS Regression). Wiley Interdisciplinary Reviews: Computational Statistics 2010, 2, 97–106. (35) Abdi, H.; Williams, L. J. Wiley Interdisciplinary Reviews: Computational Statistics. Volume 2010, 2, 433–459. (36) Genton, M. G. Classes of Kernels for Machine Learning: A Statistics Perspective. J. Mach. Learn. Res. 2002, 2, 299–312. (37) Schölkopf, B.; Tsuda, K.; Vert, J. Kernel Methods in Computational Biology; MIT Press: Cambridge, MA, 2004. (38) Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. (39) Ben-Hur, A.; Ong, C. S.; Sonnenburg, S.; Sholkopf, B.; Ratsch, G. Support Vector Machines and Kernels for Computational Biology. PLoS Comput. Biol. 2008, 4, e1000173.

37 ACS Paragon Plus Environment

Page 39 of 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(40) Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer Series in Statistics; Springer New York Inc.: New York, NY, USA, 2001. (41) Cortes-Ciriano, I.; Murrell, D. S.; van Westen, G.; Bender, A.; Malliavin, T. Prediction of the Potency of Mammalian Cyclooxygenase Inhibitors with Ensemble Proteochemometric Modeling. J. Cheminf. 2014, 7, 1. (42) Breiman, L. Arcing Classifier (with Discussion and a Rejoinder by the Author). Ann. Statist. 1998, 26, 801–849. (43) Natekin, A.; Knoll, A. Gradient Boosting Machines, a Tutorial. Front. Neurorobot. 2013, 7, 21. (44) Cortes-Ciriano, I.; Bender, A.; Malliavin, T. E. Comparing the Influence of Simulated Experimental Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets. J. Chem. Inf. Model. 2015, 55, 1413–1425. (45) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. (46) Sheridan, R. P. Using Random Forest to Model the Domain Applicability of Another Random Forest Model. J. Chem. Inf. Model. 2013, 53, 2837–2850. (47) Sheridan, R. P. Three Useful Dimensions for Domain Applicability in QSAR Models Using Random Forest. J. Chem. Inf. Model. 2012, 52, 814–823. (48) Cortés-Ciriano, I.; van Westen, G. J. P.; Bouvier, G.; Nilges, M.; Overington, J. P.; Bender, A.; Malliavin, T. Improved Large-Scale Prediction of Growth Inhibition Patterns on the NCI60 Cancer Cell-Line Panel. Bioinformatics 2016, 32, 85–95. (49) Mevik, B.-H.; Wehrens, R.; Liland, K. H. Pls: Partial Least Squares and Principal Component Regression. 2013; R package version 2.4-3. 38 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 41

(50) Karatzoglou, A.; Smola, A.; Hornik, K.; Zeileis, A. kernlab – an S4 Package for Kernel Methods in R. J. Stat. Soft. 2004, 11, 1–20. (51) Ridgeway, G. Gbm: Generalized Boosted Regression Models. 2013; R package version 2.1. (52) Friedman, J. H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2000, 29, 1189–1232. (53) Liaw, A.; Wiener, M. Classification and Regression by RandomForest. R News 2002, 2, 18–22. (54) Winer, B.; Brown, D.; Michels, K. Statistical Principles in Experimental Design; McGraw-Hill series in psychology; 3rd ed. McGraw-Hill, New York, 1991. (55) Efron, B.; Tibshirani, R. An Introduction to the Bootstrap; New York : Chapman & Hall, 1993. (56) Roy, K.; Das, R. N.; Ambure, P.; Aher, R. B. Be aware of error measures. Further studies on validation of predictive QSAR models. Chemometr. Intell. Lab. 2016, 152, 18 – 33.

39 ACS Paragon Plus Environment

Page 41 of 41

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment