Extreme Gradient Boosting as a Method for Quantitative Structure

Nov 23, 2016 - Data Science Department, MSD International GmbH (Singapore Branch), 1 Fusionopolis Place, #06-10/07-18, Galaxis, Singapore 138522 .... ...
0 downloads 7 Views 1MB Size
Subscriber access provided by the Henry Madden Library | California State University, Fresno

Article

Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships Robert P. Sheridan, Wei Min Wang, Andy Liaw, Junshui Ma, and Eric Gifford J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00591 • Publication Date (Web): 23 Nov 2016 Downloaded from http://pubs.acs.org on December 1, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships

Robert P. Sheridan 1, Wei Min Wang 2, Andy Liaw 3, Junshui Ma 3, Eric M. Gifford 4

1. Modeling and Informatics, MRL, Merck & Co. Inc., Rahway, NJ, U.S.A. 07065 2. Data Science Department, MSD Global Innovation Network, MSD International,Singapore 3. Biometrics Research Department, Merck & Co. Inc., Rahway, NJ, U.S.A. 07065 4. Bioinformatics Department, MSD International GmbH, Singapore

[email protected]

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.

2 ACS Paragon Plus Environment

Page 2 of 29

Page 3 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

INTRODUCTION Quantitative Structure-Activity Relationships (QSAR) is a very commonly used technique in the pharmaceutical industry for predicting on-target and off-target activities. Such predictions help prioritize the experiments during the drug discovery process. Higher prediction accuracy is always desirable, but there are practical constraints due to the fact that in an industrial environment there may be a large number (dozens) of models trained on a very large number (>100,000) of compounds and a large number (several thousand) of substructure descriptors. These models may need to be updated frequently (say, every few months). Our general rule of thumb is that a QSAR method should be able to build a model from 300,000 molecules with 10,000 descriptors within 12 hours elapsed time, without manual intervention. QSAR methods that are particularly compute-intensive or require the adjustment of many sensitive parameters to achieve good prediction for an individual QSAR data set are less desirable.

Only a small number of the many new machine learning algorithms are suitable for routine application in drug discovery because of the above constraint. Currently, the most commonly used methods are variations on Random Forest (RF) 1,2 and (Support Vector Machine) SVM3, which are among the most predictive and have few adjustable parameters. In particular, RF has been very popular since it was introduced as a QSAR method by Svetnik et al.2. Due to its high prediction accuracy, ease of use, and robustness to adjustable parameters, RF has been something of a “gold standard” to which other QSAR methods are compared4. This is also true for nonQSAR types of machine learning5. RF has been our workhorse QSAR method for many years.

3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

In 2012, Merck sponsored a Kaggle competition (www.kaggle.com) to examine how well stateof-the-art of machine learning methods can perform in QSAR problems. From that exercise we became aware that neural nets had undergone a renaissance, and that deep neural nets (DNN) had disruptively improved machine learning. We6 and other laboratories (reviewed in Gawehn et al.7) subsequently showed that DNNs are a practical method to apply to large QSAR problems and that DNNs make somewhat better predictions on the average than RF. One interesting point is that, although DNNs have many adjustable parameters, one can find a standard set of parameters suitable for most QSAR.

In the past this laboratory8 experimented with alternative recursive partitioning methods such as gradient boosting. Whereas RF builds an ensemble of independent recursive partitioning trees of unlimited depth, gradient boosting builds a sequential series of smaller trees, where each tree corrects for the residuals in the predictions made by all the previous trees. At the time we concluded that, while the gbm9 implementation of gradient boosting [https://cran.rproject.org/web/packages/gbm/gbm.pdf] provided similar prediction accuracy as RF, the large number of adjustable parameters made it harder to use.

Analogous to the situation with neural nets, the field of gradient boosting has progressed and a new implementation by Chen and Guestrin called Extreme Gradient Boosting (XGBoost)10 is readily available [https://github.com/dmlc/xgboost]. Extreme Gradient Boosting builds on previous ideas in gradient boosting. Basically, the training is done using an ‘additive strategy’: Given a molecule i with a vector of descriptors xi, a tree ensemble model uses K additive functions to predict the output. 4 ACS Paragon Plus Environment

Page 4 of 29

Page 5 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(1) 

     ,



Here the

∈ 

is the sets of all possible regression trees. The fk function at each of the K steps maps the

descriptor values in xi to a certain output. It is the function we need to learn, containing the structure of the tree and the leaf scores. However, there are minor improvements in the regularized objective which turned out to be helpful in practice. Specifically, XGBoost tries to minimize the regularized objective as follows:

(2)

   ∑  ,    ∑ Ω

where

1 Ω     ‖‖ 2 In the equation above, l is a differentiable convex loss function that measures the difference between the prediction ŷi and the target yi (in our case of QSAR, l is simply mean-square error). The second term penalizes the complexity of the model in terms of number of leaves in the tree T and vector of scores on leaves ω. It helps to smooth the final learned weights to avoid over-fitting. We expect the regularized objective will tend to select a model employing simple and predictive functions.

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Another two techniques aiming at reducing overfitting are shrinkage and column subsampling. After each step of boosting, the algorithm scales the newly added weights by a factor η. This reduces influences of each tree and makes the model learn slowly and (hopefully) better. Column (feature) subsampling considers only a random subset of descriptors in building a given tree. This also speeds up the training process by reducing the number of descriptors to consider.

XGBoost also uses the sparsity-aware split-finding approach to more efficiently train on sparse data (the computation complexity is approximately linear to number of non-missing entries in the input). This is potentially very helpful for QSAR because, if one uses substructure-type chemical descriptors, the compound/descriptor table is very sparse (with most entries zero) and XGBoost is potentially very fast when trained on even the largest QSAR data set.

The impact of XGBoost has been widely recognized in a number of machine learning and data mining challenges. Among the 29 challenge winning solutions published at Kaggle’s blog during 2015, 17 solutions used XGBoost. However, XGBoost has not yet been used extensively for QSAR, and this study applies XGBoost to 30 in-house QSAR data sets of a practical size. While XGBoost has a very large number of adjustable parameters, we can show that, for our problems, these parameters are not particularly sensitive and one can pick a standard set of parameters with which most QSAR problems can achieve a level of prediction better than RF and almost as good as DNN. Most importantly, XGBoost takes much less compute effort than either of those two methods and produces much smaller models.

6 ACS Paragon Plus Environment

Page 6 of 29

Page 7 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

METHODS

Data sets Table 1 shows the in-house data sets used in this study which are the same as in Ma et al.6. These data sets represent a pharmaceutical research relevant mixture of on-target and off-target activities, easy and hard to predict activities, and large and small data sets. The data sets are available in Supporting Information with the descriptors and molecule names disguised so the original compounds cannot be reverse-engineered from the descriptors. The 15 data sets labeled “Kaggle” were used for the Kaggle competition. As before, we use in-house data sets because: 1. We wanted data sets which are realistically large, and whose compound activity measurements have a realistic amount of experimental uncertainty and include a non-negligible amount of qualified data. 2. Time-split validation (see below), which we consider more realistic than any random crossvalidation, requires dates of testing, and these are almost impossible to find in public domain data sets.

A number of these data sets contain significant amounts of qualified data. For example, one might know that the measured IC50 is greater than 30µM only because 30µM was the highest concentration titrated. For the purposes of model-building those activities are treated as fixed numbers, because most off-theself QSAR methods handle only fixed numbers. For example, IC50 > 30µM is set to IC50=30 X 10 -6 M or –log(IC50)=4.5. Our experience is that it is best to keep such qualified data in the QSAR training sets; otherwise less active compounds are often predicted to be more active than they really are.

7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 29

Table 1. Data sets for prospective prediction. Data set

Type

Description

Number of Molecules Training+test 29958

Number of unique AP,DP descriptors 8217

2C8

ADME

2C9BIG

ADME

2D6

ADME

3A4*

ADME

A-II

Target

BACE

Target

CAV

ADME

CYP P450 2C8 inhibition -log(IC50) M CYP P450 2C9 inhibition -log(IC50) M CYP P450 2D6 inhibition -log(IC50) M CYP P450 3A4 inhibition -log(IC50) M Binding to Angiotensin-II receptor -log(IC50) M Inhibition of beta-secretase -log(IC50) M Inhibition of Cav1.2 ion channel

CB1*

Target

CLINT

ADME

DPP4*

Target

ERK2

Target

FACTORXIA

Target

FASSIF

ADME

HERG

ADME

HERGBIG

ADME

HIVINT*

Target

HIVPROT*

Target

LOGD* METAB*

ADME ADME

NAV

ADME

NK1*

Target

OX1*

Target

OX2*

Target

Binding to cannabinoid receptor 1 -log(IC50) M Clearance by human microsome log(clearance) µL/min/mg Inhibition of dipeptidyl peptidase 4 -log(IC50) M Inhibition of ERK2 kinase -log(IC50) M Inhibition of factor Xla -log(IC50) M Solubility in simulated gut conditions log(solubility) mol/l Inhibition of hERG channel -log(IC50) M Inhibition of hERG ion channel -log(IC50) M Inhibition of HIV integrase in a cell based assay -log(IC50) M Inhibition of HIV protease -log(IC50) M logD measured by HPLC method percent remaining after 30 min microsomal incubation Inhibition of Nav1.5 ion channel -log(IC50) M Inhibition of neurokinin1 (substance P) receptor binding -log(IC50) M Inhibition of orexin 1 receptor -log(Ki) M Inhibition of orexin 2 receptor -log(Ki) M

4.88+0.66

189670

11730

4.77+0.59

50000

9729

4.50+0.46

50000

9491

4.69+0.65

2763

5242

8.70+2.72

17469

6200

6.07+1.40

50000

8959

4.93+0.45

11640

5877

7.13+1.21

23292

6782

1.93+0.58

8327

5203

6.28+1.23

12843

6596

4.38+0.68

9536

6136

5.49+0.97

89531

9541

-4.04+0.39

50000

9388

5.21+0.78

318795

12508

5.07+0.75

2421

4306

6.32+0.56

4311

6274

7.30+1.71

50000 2092

8921 4595

2.70+1.17 23.2+/-33.9

50000

8302

4.77+0.40

13482

5803

8.28+1.21

7135

4730

6.16+1.22

14875

5790

7.25+1.46

8 ACS Paragon Plus Environment

Mean + stdev activity

Page 9 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

PAPP

ADME

Apparent passive permeability in PK1 cells log(permeability) cm/sec Transport by p-glycoprotein log(BA/AB)

30938

7713

1.35+0.39

PGP*

ADME

8603

5135

0.27+0.53

PPB*

ADME

human plasma protein binding log(bound/unbound)

11622

5470

1.51+0.89

PXR

ADME

50000

9282

42.5+42.1

RAT_F*

ADME

7821

5698

1.43+0.76

TDI*

ADME

5559

5945

0.37+0.48

THROMBIN*

Target

Induction of 3A4 by pregnane X receptor; percentage relative to rifampicin log(rat bioavailability) at 2mg/kg time dependent 3A4 inhibitions log(IC50 without NADPH/ IC50 with NADPH) human thrombin inhibition -log(IC50) M

6924

5552

6.67+2.02

* Kaggle data sets

In order to compare the predictive ability of QSAR methods, each of these data sets was split into two non-overlapping subsets: a training set and a test set. Although a usual way of making the split is by random selection, i.e. “split at random,” in actual practice in a pharmaceutical environment, QSAR models are applied "prospectively". That is, predictions are made for compounds not yet tested in the appropriate assay, and these compounds may or may not have analogs in the training set. The best way of simulating this is to generate training and test sets by "time-split". For each data set, the first 75% of the molecules assayed for the particular activity form the training set, while the remaining 25% of the compounds assayed later form the test set. We have found that, for regressions, R2 from time-split validation better estimates the R2 for true prospective prediction than R2 from any "split at random" scheme11. Since training and test sets are not selected from the same pool of compounds, the descriptor distributions in these two subsets are frequently not the same, or even similar to, each other.

QSAR Descriptors 9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 29

Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. In this paper, we use a set of descriptors that is the union of AP, the original "atom pair" descriptor from Carhart et al. 12 and DP descriptors ("Donor acceptor Pair"), called "BP" in Kearsley et al. 13 Both descriptors are of the form: Atom type i – (distance in bonds) – Atom type j For AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons; it is very specific. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other); it contains a more generic description of chemistry.

Random Forest RF is an ensemble recursive partitioning method where each recursive partitioning "tree" is generated from a bootstrapped sample of compounds, and a random subset of descriptors is used at the branching of each node in the tree. RF naturally handles correlation between descriptors, and does not need a separate descriptor selection procedure to obtain good performance. Importantly, while there are a handful of adjustable parameters (e.g. number of trees, fraction of descriptors used at each branching, node size, etc.), the quality of predictions is generally insensitive to changes in these parameters.

The version of RF we are using is a modification of the original FORTRAN code from Breiman1, which is built for regressions. It has been parallelized to run one tree per processor on a cluster. Such parallelization is necessary to run some of our larger data sets in a reasonable time. For all RF models, we generate 100 trees with m/3 descriptors used at each branch-point, where m is the number of unique 10 ACS Paragon Plus Environment

Page 11 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

descriptors in the training set. The tree nodes with 5 or fewer molecules are not split further. We apply these parameters to every data set.

Deep Neural Nets Our Python-based DNN code is the one obtained through the Kaggle competition from George Dahl14, then at the U. of Toronto, and modified by us. The DNN results we present are for single-task regression models using the “standard parameters” in Ma et al.6, which are applied to all data sets: 1. Four hidden layers with 4000, 2000, 1000, 1000 neurons. 2. Dropout rate 0% in the input layer, 25% in the first 3 hidden layers and, 10% in the fourth hidden layer. 3. Logarithmic transformation of the descriptors. ReLU activation. 4. minibatch size of 125, and 300 training epochs. 5. No unsupervised pre-training. For timing purposes we also have implemented a simplified (“quick”) version of the DNN, which achieves almost identical prediction accuracy to the standard parameters, but uses a smaller neural net. 1. Two intermediate layers of 1000 and 500 neurons with 25% dropout. 2. 75 training epochs. Otherwise identical to the standard parameters.

11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 29

XGBoost The implementation of XGBoost is the C++ version runnable on Linux. There are several dozen adjustable parameters, the full set visible at https://github.com/dmlc/xgboost/blob/master/doc/parameter.md. The ones we examined in detail are those which could show deviations from the defaults in QSAR problems: ETA (step size shrinkage). This controls the weights of subsequent trees. Default=0.3 MAX_DEPTH (maximum depth of a tree) Default=6 NUM_ROUND (maximum number of trees) Default=10 COLSAMPLE_BYTREE (what fraction of descriptors would be examined for each tree) Default=1.

Metrics In this paper, the metric used to evaluate prediction performance is R2, which is the squared Pearson correlation coefficient between predicted and observed activities in the test set. The same metric was used in the Kaggle competition and in our previous work. R2 measures the degree of concordance between the predictions and corresponding observations. This value is especially relevant when the whole range of activities is included in the test set. R2 is an attractive measurement for model comparison across many data sets because it is unitless and ranges from 0 to 1 for all data sets. The relative predictivity of the three methods we examine does not change if we use alternative metrics such as Q2 or RMSE.

12 ACS Paragon Plus Environment

Page 13 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Workflow for XGBoost The goal is to identify the most important few parameters and or identify a set of parameters that would be useful for most QSAR data sets. We found early on that a small value of ETA (0.05) was useful to keep the models growing slowly, therefore steadily, and kept ETA constant. A smaller ETA reduces the influence of each individual tree and leaves space for future trees to improve the model. Use of slow growth in boosting was suggested by Friedman15. We constructed sets of parameters in the following way: 1. TESTOPT: Find the values of MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE that, used in the model generated from the training set, gives the highest R2 for the test set. This is done by doing a grid search over the values MAX_TREE=(4,5,6,7,8,9,10) , NUM_ROUND=(250, 500, 750, 1000), and COLSAMLPE_BYTREE=(0.50, 0.75, 1.0). There are more efficient search methods, but a grid search turned out to be the most straightforward approach to find an optimum quickly due to XGBoost’s speed. It should be noted that TESTOPT does not reflect the real life situation. In reality, we would not know the activity values of the test set in advance. However, this gives us an upper limit for the R2 on the test set we can expect by optimizing these parameters, and it is interesting to know what parameter values we should use if we had prior knowledge. TESTOPT finds a different set of optimum parameters for each data set. 2. TRAINOPT: Finding the optimal value of MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE, again by grid-search, by the cross-validation of each training set. That is, split the training set in half randomly, make a model from the first half using the parameters, and then predict the remaining half. The set of parameters that gives the highest R2 for the prediction of the second half of the training set is used to generate a model using the entire training set. This model is used to predict

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 29

the test set. This is more representative of the real life situation because we are optimizing only on the training set. However, since the training set is not necessarily like the test set, optimizing on the training set may not be useful for getting maximum R2 on the test set. TRAINOPT finds a different set of parameters for each data set. 3. STANDARD. The goal is to find a common value of MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE to be used for all data sets. The most straightforward way of generating such a standard set is to find the mean optimum values of MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE over all data sets from either TESTOPT, TRAINOPT, or both.

Timing The three methods we are comparing RF, DNN, and XGBoost work on different machine architectures and/or modes in our environment: 1. RF runs as 100 jobs (one for each tree) running in parallel on a cluster. The cluster nodes are HP ProLiant BL460c Gen8 server blades, each equipped with two 8 core Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz processors and 256GB Random Access Memory (RAM). 2. XGBoost runs on a single node of the above cluster with multiple threads (NTHREAD=4). 3. DNN runs on a single NVIDIA Tesla C2070 GPU . It is hard to precisely compare the total computational expense of each method in our mixed environment, but some estimation is possible. By running DNN on a cluster CPU vs a GPU we know that the GPU is 10-fold faster than a CPU. We also know the total time for a RF model is 100 times the time to generate one tree on one cluster node. Perhaps most relevant to users of these methods, 14 ACS Paragon Plus Environment

Page 15 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

however, is the wall-clock time elapsed between when a model calculation is initiated and the time the model is written to disk and available to predict new compounds. Estimating the wall-clock time for RF is problematical because the 100 parallel jobs may not finish simultaneously. Sometimes, one or two may be delayed unpredictably, and the time between launch of the jobs and the completion of all of them is too dependent on the system load. A more robust, albeit slightly optimistic, measure is the time the first 95 jobs (i.e. trees) are completed. Included in the wall-clock time is the time to read through the descriptor file and create the molecule/descriptor table.

Model size Another interesting parameter is the size of the model file, in that the speed of prediction is sometimes limited in practice by the time taken to read the model into memory or by copying the model from an archive to the computer doing the predicting. Here we note the size of the (binary) files comprising the model. In the case of RF, we multiply the size of a single tree by 100. In practice the size of a model file can vary depending on the particular implementation of the QSAR method, but looking at the size will give us a rough idea of the relative complexity of models from the respective methods.

RESULTS Optimizations and standard parameters for XGBoost The optimum set of parameters for individual data sets and the R2 for each type of optimization are in Supporting Information. The difference between the best and worst R2 in the grid search is 0.07+0.03

15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 29

over all 30 data sets, and this is true for TESTOPT and TRAINOPT. That is, prediction accuracy is not particularly sensitive to the parameters we examined.

Mean optimum parameter values are shown in Table 2. While the optimum values of parameters do not correlate well between TESTOPT and TRAINOPT for individual data sets (not shown), the mean optimum parameter values for TESTOPT and TRAINOPT are not far apart relative to the overall range of each parameter. There is also not much difference between the Kaggle and non-Kaggle subsets. These observations suggest it is possible to construct a set of standard parameters useable for most data sets. We averaged over both TESTOPT and TRAINOPT to obtain the standard parameters MAX_DEPTH=7 (nearest integer to 7.4), NUM_ROUND=745, and COLSAMPLE_BYTREE= 0.56. Predictions using these values are called XGBoost_STANDARD.

Table 2. Mean optimal values for three parameters TESTOPT Kaggle TESTOPT non-Kaggle TESTOPT ALL TRAINOPT Kaggle TRAINOPT nonKaggle TRAINOPT ALL TESTOPT+TRAINOPT ALL

MAX_DEPTH 6.1 8.1 7.1 6.8 8.5

NUM_ROUND 700 783 741 683 816

COLSAMPLE_BYTREE 0.60 0.60 0.60 0.55 0.52

7.7 7.4

750 745

0.53 0.56

16 ACS Paragon Plus Environment

Page 17 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In TRAINOPT and TESTOPT there is a weak relationship between the optimum MAX_DEPTH and the optimum NUM_ROUND vs. Ntraining, the number of molecules in the training set: smaller data sets tend to prefer smaller MAX_DEPTH and smaller NUM_ROUND. Therefore it might be possible to guess a good MAX_DEPTH and NUM_ROUND for individual data sets based only on Ntraining. However, we will show below that the STANDARD parameters already give as good predictions as the TRAINOPT grid-search, so this type of refinement is not likely to be helpful overall.

Accuracy of prediction for the QSAR methods Figure 1 (Top) shows the R2 for prediction of the test set for standard DNN, XGBoost_TRAINOPT, and XGBoost_STANDARD vs. the R2 for RF, which we are taking as the baseline method. On the average, the XGBoost results are better than RF , i.e. the red and green best-fit lines are above the dashed line indicating the diagonal, and nearly as good as DNN, indicated by the blue best-fit line. The improvement of XGBoost over RF tends to become smaller at higher R2. These trends are made more visible by subtracting out the RF value, as shown in Figure 1 (Bottom). That there is not a systematic difference between XGBoost_TRAINOPT (green) and XGBoost_STANDARD (red) is seen in the superposition of green and red lines in Figure 1. That is, optimizing the parameters for each individual training set in TRAINOPT does not result in better predictions on the test set than are obtained by simply using the STANDARD set of parameters for all data sets.

17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. Prediction accuracy on the test set for XGBoost and deep neural nets vs. the prediction accuracy of random forest. Two different types of XGBoost parameters are shown, one with the parameters optimized for individual training sets (green), and one using a standard set of parameters for all data sets (red). (Top) The absolute R2. (Bottom) The R2 minus the R2 for random forest.

18 ACS Paragon Plus Environment

Page 18 of 29

Page 19 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The visual impression in Figure 1 is consistent with the mean R2 over the 30 data sets : RF=0.39, DNN=0.43, XGBoost_TRAINOPT=0.42, XGBoost_STANDARD=0.42. One should note these are only averages; any of the three methods may do best on a particular data set. We do not show XGBoost_TESTOPT in Figure 1 because it is, as previously mentioned, expected to be unrealistically optimistic; however we note that XGBoost_TESTOPT gets only slightly better R2 than XGBoost_TRAINOPT (mean R2=0.44).

Timing Total compute effort and wall-clock time for the different methods is shown in Figure 2 as a function of Ntraining. The log-log plot is the one where all methods show a maximally linear correlations and the range of timings can be appreciated. The total compute effort for DNN and XGBoost are roughly linear with Ntraining. As expected the DNN using fewer neurons (“quick”) requires less computation than the standard DNN. The total compute effort for RF rises roughly as the square of Ntraining. Clearly, XGBoost is the smallest by an order of magnitude in total compute effort (Top) and somewhat faster in wall-clock time (Bottom), at least for the larger data sets. QSAR methods tend to converge in wall-clock time for small data sets; for them reading the descriptor files and forming a molecule/descriptor table, which is shared by all methods, becomes the rate-limiting step.

19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Total computational effort (Top) and wall-clock time (Bottom) for random forest, deep neural nets and XGBoost. The computational effort is expressed in units of hours on a single cluster node. For this plot, all XGBoost models were made with NUM_ROUND=1000, which takes the maximum compute time.

20 ACS Paragon Plus Environment

Page 20 of 29

Page 21 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Model size Total model file size (in megabytes) is shown in Figure 3 as a function of Ntraining. The log-log plot is the one where all methods show a maximally linear correlations and the range of model file sizes can be appreciated. The size of DNN models is expected to depend on the total number of neurons. The number of neurons of the lowest layer will depend on the number of descriptors, which varies roughly as log(Ntraining), and the number of neurons in the intermediate layers will depend on the number of intermediate layers and number of neurons per layer set by the user. Effectively, the dependence of size is approximately log(Ntraining). We would expect networks with fewer layers and fewer neurons per layer (the “quick” DNN) to produce smaller models than the original standard DNN, and they do. The size of XGBoost models might be expected to be almost constant with Ntraining because there is a fixed number of trees with a small maximum depth, both set by the user. However there is an small dependence of size roughly with log(Ntraining), which probably reflects the fact that larger data sets have more trees closer to the maximum depth. In contrast, the number of nodes in an unpruned recursive partitioning tree should be linear with Ntraining, and we see this for RF. The model sizes of XGBoost are much smaller than any of the other methods for a given value of Ntrainining, except RF at very small values of Ntraining.

21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. Total model file size for the QSAR methods. Here the XGBoost models are made with the standard parameters.

DISCUSSION

XGBoost appears to be a very effective and efficient machine-learning method. Here we have demonstrated this is true in the realm of QSAR. It is effective in the sense that it provides prospective predictions as least as good on the average as RF, a gold-standard method, and nearly as good as single-task DNN. It is efficient because it achieves these results with much less computational effort than either of those methods and produces much smaller models. Having XGBoost means we can potentially handle many more and larger data sets and/or update them more frequently than we have previously, given our current compute environment.

22 ACS Paragon Plus Environment

Page 22 of 29

Page 23 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The potential difficulty with XGBoost having multiple adjustable parameters turns out, in practice, to not be a real issue for QSAR because we can identify standard values of at least some parameters. We have shown that these standard parameters can be used effectively with a large number of QSAR data sets; it is not necessary to optimize the parameters for each individual data set. Since we did not examine every parameter, it is possible that we could get results even better than we show here. We hasten to point out that our standard parameters apply only to QSAR problems in our environment. We can say nothing about non-QSAR applications of XGBoost.

We should point, apropos of the above paragraph, that our work brings out an issue with any kind of QSAR model-building, which we have noted before16 in another context. It is a general assumption in QSAR, and in machine-learning in general, that the molecules to be predicted (in the test set) are similar enough to the training set that maximizing the cross-validated predictions of the training set (by using different descriptors, tweaking adjustable parameters, etc.) is equivalent to maximizing predictivity on the test set. In practice, the training and test sets may be different enough that this is not true.

Despite the excellent performance of XGBoost, there are some issues that should be examined further. Recursive partitioning methods make predictions based on the average observed activities of molecules at their terminal nodes. This has the effect of compressing the range of predictions relative to the observed activities. For random forest we routinely do “prediction rescaling” 16 where the self-fit predicted activities in the training set of a particular model are linearly scaled to match the observed activities, and this scaling is applied to further predictions from that model. This does not affect the R2 of prediction, but does help the numerical match of

23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

predicted and observed activities at the highest and lowest ranges of activity. We have found XGBoost also benefits from prediction rescaling.

One disadvantage of XGBoost (and DNN) is that it does not produce the domain applicability metric TREE_SD (the variation in prediction among an ensemble) which is very useful in estimating the error bar on a QSAR prediction17, and which comes “for free” in RF.

Another potential issue is that boosting methods like gbm9 appear much more sensitive to artificial noise added to the activity data18 than other methods, in the sense that the R2 of prediction of a test set goes down more quickly with a given amount of noise added to the training set. We can confirm that XGBoost is more sensitive to artificial noise than RF, but about equivalent in sensitivity to DNN. However, it is not clear how relevant sensitivity to added noise is to the usefulness of a QSAR method. Clearly XGBoost is better than RF in the limit of no added noise.

AKNOWLEDGEMENTS

WMW and EMG would like to thank Roy Goh and Ivan Clement, Data Sciences MSD Singapore, for their discussions and support of this study.

CONFLICT OF INTEREST The authors declare no financial conflict of interest.

24 ACS Paragon Plus Environment

Page 24 of 29

Page 25 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ASSOCIATED CONTENT Supporting Information Available: 1. Table of R2 results for RF, DNN, and XGBoost. Similarly for timing and model size. 2. Training and test data for all data sets (with disguised molecule names and descriptors). This data set is suitable for comparing the relative goodness of different QSAR methods given the AP,DP descriptors.

REFERENCES 1. Breiman, L. Random Forests. Machine Learning 2001, 45, 3-32.

2. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random Forest: a Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958.

3. Cortes, Corinna; and Vapnik, Vladimir N.; "Support-Vector Networks", Machine Learning, 20, 1995.

4. Bruce, C.L.; Melville, J.L.; Picket, S.D.; Hirst, J.D. Contemporary QSAR Classifiers Compared. J. Chem. Inf. Model. 2007, 47, 219-227.

25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

5. Fernandez-Delgado, M.; Cernades, E.; Barro, S.; Amorim, D.A. Do We Need Hundreds of Classifiers to Solve Real World Problems? J. Machine. Learning. Res. 2014, 15, 3133-3181.

6. Ma, J.; Sheridan, R.P.; Liaw, A.; Dahl, G.E.; Svetnik, V. Deep Neural Nets as a Method for Quantitative-Structure-Activity Relationships. J. Chem. Inf. Model. 2015, 55, 263-274.

7. Gawehn, E.; Hiss, J.A.; Schneider, G. Deep Learning in Drug Discovery. Mol. Inf. 2016, 35, 3-14.

8. Svetnik, V.; Wang, T.; Tong, C.; Liaw, A.; Sheridan, R.P.; Song, Q. Boosting: an Ensemble Learning Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2005, 45, 786-799.

9. Friedman, J.H. Stochastic Gradient Boosting. Computational Statistics and Data Analysis 2002, 38, 367-378.

10. Chen, T.; Guestrin, C. XGBoost: a Scalable Tree Boosting System. arXiv:1603.02754v3 2016.

11. Sheridan, R.P. Time-Split Cross-Validation as a Method For Estimating the Goodness of Prospective Prediction. J. Chem. Inf. Model. 2013, 53, 783-790.

26 ACS Paragon Plus Environment

Page 26 of 29

Page 27 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

12. Carhart, R. E.; Smith, D. H.; Ventkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Application. J. Chem. Inf. Comput. Sci., 1985, 25, 6473.

13. Kearsley, S.K.; Sallamack, S.; Fluder, E.M.; Andose, J.D.; Mosley, R.T.; Sheridan, R.P. Chemical Similarity Using Physiochemical Property Descriptors. J. Chem. Inform. Comp. Sci. 1996, 36, 118-27.

14. Dahl, G.E.; Jaitly, N.; Salakhutdinov, R. Multi-Task Neural Networks for QSAR Predictions. 2014, arXiv:1406.1231 [stat.ML]. http://arxiv.org/abs/1406.1231

15. Friedman, J. Greedy Function Approximation: a Gradient Boosting Machine. Annals of Statistics 2001, 29, 1189–1232.

16. Sheridan, R.P. Global Quantitative Structure-Activity Relationship Models vs Selected Local Models as Predictors of Off-Target Activities For Project Compounds. J. Chem. Inf. Model. 2014, 54, 1083-1092.

17. Sheridan, R.P. Using Random Forest to Model the Domain Applicability of Another Random Forest Model. J. Chem. Inf. Model. 2013, 53, 2837-2850

27 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

18. Cortes-Ciriano, I.; Bender, A.; Malliavin, T. E. Comparing the Influence of Simulated Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets. J. Chem. Inf. Model. 2015, 55, 1413-1425.

28 ACS Paragon Plus Environment

Page 28 of 29

Page 29 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

TOC Graphic

29 ACS Paragon Plus Environment