Exploiting Multiple Descriptor Sets in QSAR Studies - American

Feb 23, 2016 - richness of multiple descriptor sets when assay data are poor in active ... different descriptor sets for QSAR modeling.2,3 For example...
1 downloads 0 Views 275KB Size
Subscriber access provided by University of Otago Library

Article

Exploiting Multiple Descriptor Sets in QSAR Studies Jabed H. Tomal, William J. Welch, and Ruben H Zamar J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.5b00663 • Publication Date (Web): 23 Feb 2016 Downloaded from http://pubs.acs.org on February 28, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Exploiting Multiple Descriptor Sets in QSAR Studies Jabed H. Tomal,† William J. Welch,∗,‡ and Ruben H. Zamar‡ Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada M1C 1A4, and Department of Statistics, The University of British Columbia, Vancouver, British Columbia, Canada V6T 1Z4 E-mail: [email protected]

Phone: +1-604-822-3339. Fax: +1-604-822-6960

Abstract A quantitative-structure activity relationship (QSAR) is a model relating a specific biological response to the chemical structures of compounds. There are many descriptor sets available to characterize chemical structure, raising the question of how to choose among them or how to use all of them for training a QSAR model. Making efficient use of all sets of descriptors is particularly problematic when active compounds are rare among the assay response data. We consider various strategies to make use of the richness of multiple descriptor sets when assay data are poor in active compounds. Comparisons are made using data from four bioassays, each with five sets of molecular descriptors. The recommended method takes all available descriptors from all sets and uses an algorithm to partition them into groups called phalanxes. Distinct statistical models are trained, each based on only the descriptors in one phalanx, and the models are then averaged in an ensemble of models. By giving the descriptors a chance to contribute in different models, the recommended method uses more of the descriptors in model averaging. This results in better ranking of active compounds to identify a shortlist of drug candidates for development. ∗ To

whom correspondence should be addressed of Toronto Scarborough ‡ The University of British Columbia † University

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Introduction A quantitative structure-activity relationship (QSAR) model relates assay measurements of a biological response such as activity, toxicity, or mutagenicity of chemical compounds to their molecular descriptors. The descriptors are numeric variables that characterize a molecule’s structure, i.e., it’s geometric shape, charge distribution, etc. The underlying premise is that chemical structure determines biological activity against a given biological target. The QSAR model allows smart screening 1 or in silico screening, i.e., select a focused set of molecules for further screening based on predicted activity. Chemists have been creative in generating many, qualitatively different, descriptor sets for QSAR modeling. 2,3 For example, the Molecular Descriptors Guide 4 by 2008 US Environmental Protection Agency listed 13 descriptor sets, and the Dragon software 5 produces 29 sets. Faced with this overwhelming choice, it is not clear what a modeler should do in practice. Thus, this article investigates how to best utilize multiple descriptor sets, perhaps totalling thousands of variables, for accurate QSAR modeling. Furthermore, the plethora of variables should be used effectively even when limited information is present in the assay data. We focus on classification of active versus inactive molecules in drug discovery, though the principles are easily extended to regression problems with a continuous biological response. Given the active/inactive assay results for a number of compounds and S descriptor sets, each with a moderate to large number of variables, three strategies are obvious for using the sets in QSAR modeling. The modeler could fit S classification models, one for each of the S descriptor sets, and choose the descriptor set performing best, 6 using say cross-validation (CV). 7 Or the outputs from the S classifiers could be averaged in a model ensemble to produce a single prediction model. 8 Finally, a single classification model (which might be a model ensemble) could be trained using the union of all the variables in the S descriptor sets. 9 Key to the best use of multiple descriptor sets is the classification method used for QSAR modeling. Tomal et al. (2015) proposed an ensemble of classifiers, a model averaging method, where the constituent models of the ensemble use distinct subsets of variables called phalanxes. 10 2 ACS Paragon Plus Environment

Page 2 of 27

Page 3 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In this article we extend the ensemble of phalanxes approach to multiple descriptor sets, which turns out to be even more effective. Prediction accuracy is substantially improved. Furthermore, the method is able to make explicit use of thousands of descriptor variables in the sense that nearly all appear in the ensemble, even in the presence of scant information in the bioassay response data. The rest of the article is organized as follows. Section 2 outlines the QSAR modeling methods used, including the method of ensembles of phalanxes, and how they are implemented with multiple descriptor sets. Section 3 describes the data sets used for comparisons: the bioassay data and the descriptors sets. Section 4 outlines the evaluation metrics for the results in Section 5. Section 6 provides insight into how the most competitive ensemble methods found achieve good performance, and finally Section 7 concludes with some summary remarks and discussion.

2. QSAR Models and Multiple Descriptor Sets Many classifiers, such as neural networks, recursive partitioning, k-nearest neighbours (KNN), and support vector machines (SVMs), have been applied to QSAR studies. 11–14 The most successful methods for binary classification problems in QSAR studies, however, have tended to be ensemble methods, which average several models. Merkwirth et al. (2004) built ensembles by averaging KNN or SVM models constructed on random subspaces of descriptors. 15 Bruce et al. (2007) compared several machine learning tools for mining drug data, 16 including SVMs and various ensembles of classification trees: boosting, 17 bagging, 18 and random forests (RFs). 19 These authors found that an RF model, which is an average of many classification trees, consistently improved classification predictive accuracy relative to a single tree over several QSAR applications. Hughes-Oliver et al. (2011) carried out a comprehensive comparison of 11 classifiers and RFs. 20 The methods were applied to data from four highly unbalanced two-class (active/inactive) assays, each with five descriptor sets, and performance was assessed in terms of ranking the rare active compounds ahead of the inactive compounds. Repeatedly, for each essay, an RF ensemble with one of the five sets of descriptors appeared among the top performing models, along with the

3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

more computationally demanding KNN. In summary, we can fairly say that RF is the current state of the art for applications in this context. On the other hand, there was no clear winner among the descriptor sets. For the AID 362 assay, for instance, an RF using Burden numbers as the descriptor set performed best and atom-pairs descriptors were fourth-best. Conversely, for assay AID 364 the relative performances of these two descriptors sets were reversed. In this article we use the same five descriptor sets as Hughes-Oliver et al: Atom Pairs (AP), Burden numbers (BN), 21,22 Carhart Atom Pairs (CAP), 23 Fragment Pairs (FP), and Pharmacophore fingerprints (PH). They are computed using PowerMV. 24 The Burden numbers are continuous descriptors, and the other four are bit string descriptors where each bit is set to 1 when a certain feature is present and 0 otherwise. PowerMV can generate up to 546, 24, 4 662, 735 and 147 descriptor variables for AP, BN, CAP, FP and PH, respectively. Thus, the binary descriptor sets like CAP are rich in terms of the number of variables relative to only 48–278 actives observed for the assays of Section 3. The continuous set BN has the smallest number of variables (24) but is also rich in the sense that a continuous variable can be split more than once in a tree. For any assay and any underlying classification method such as RFs, we compare seven possible classifiers utilizing all the descriptors sets. Five classifiers are built based on one set at a time, from which the best performer can be picked. Another classifier is formed by averaging the predictions from the five one-set classifiers; we call this an ensemble of descriptor sets (EDS). The final classifier is trained using all the variables in the union of the descriptor sets and is called all descriptors (AD). We next describe the two underlying classifiers, RFs and ensemble of phalanxes, and give a few more details about how they are used with multiple descriptor sets.

Random forests An RF is an ensemble of classification trees due to Breiman. 19 It has two randomization devices to grow different trees from the same data. First, each tree is grown on a bootstrap sample of the training data. Second, at each node of a tree, the best split is chosen among mtry descriptors 4 ACS Paragon Plus Environment

Page 4 of 27

Page 5 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

randomly sampled from those available. The trees are grown to the maximal depth and aggregation is done by majority vote. Breiman showed that a tighter upper bound on prediction error can be achieved when the average correlation between predictions from the constituent trees is small and the average prediction strength of the trees is large. The randomization devices in a forest of trees lower average correlation, and the construction of maximal-depth trees increases average strength. Even though growing trees to maximal depth increases prediction variance, aggregation over many trees reduces the overall variance and thus controls overfitting. To construct an RF in this article √ we use the R package randomForest 25 with default settings: we grow 500 trees and mtry = D, where D is the number of descriptors. For any bioassay, an RF is trained for each descriptor set separately, giving RF-AP, RF-BN, RF-CAP, RF-FP and RF-PH models. Each classifier gives separate estimates of the probabilities of activity for compounds in a test set (using CV). We denote these estimated probabilities by bAP for AP, etc. The elements of P bAP will ideally be larger for active compounds than inactive P compounds, hence ranking the actives first in a shortlist. The metrics in Section 4 are used to quantify the ranking performance, and the set with best performance can be chosen. Alternatively, the RF-EDS method ensembles the five classifiers by averaging the probabilities of activity across the five descriptor sets: ( ) bEDS = P bAP + P bBN + P bCAP + P bFP + P bPH /5, P

(1)

and the RF-AD method simply trains one RF classifier based on all descriptors from the union of bEDS from RF-EDS and P bAD from RF-AD are also assessed the five sets. The performances of P using the metrics in Section 4.

Ensemble of phalanxes Tomal et al. (2015) proposed a method to partition the variables in a descriptor set into groups or phalanxes. 10 Predictions from several classification models, each trained only using the variables

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

in one phalanx, are averaged to form an ensemble of phalanxes (EPX) classifier. The heart of the algorithm is to start with individual descriptors or small sets and recursively amalgamate them into larger groups. Amalgamation and stopping are based on criteria to measure whether any two groups have better predictive performance when they are not amalgamated, in which case their separate predictions would be averaged, versus combining the two groups into one and training one model. There are some other details such as how to form initial groups and screening out weak initial groups and the final candidate phalanxes when amalgamation stops, but the key idea is to arrive at variables together in a phalanx because they complement each other in one model, while the variables in different phalanxes work better as an ensemble. At every stage, predictions are made using a candidate group of descriptors by training an RF using only the group’s descriptors. The RF predictions from the final p phalanxes are averaged in an ensemble to give the EPX predictions. Thus, even with just one descriptor set, EPX is an ensemble of predictions from several RF models, each of which is itself an ensemble. As with using RF directly, EPX can be be applied to each descriptor set individually to give EPX-AP from the AP descriptors, etc. Analogous to equation (1) we can also ensemble the predictions from the five descriptor sets to define the EPX-EDS method, or we can use the union of all descriptor sets in EPX, i.e., EPX-AD.

3. Data All methods were applied to four assay data sets made available by the Molecular Libraries Screening Center Network. 26 These assays typically resulted in continuous responses (the bioassay activity score) from a primary screen and binary responses (the bioassay activity outcome, active versus inactive) from a secondary screen. In this article, we focus on the binary responses. Table 1 presents the biological target, number of compounds, number of active compounds, and proportion of actives for each of the four assay datasets: AID 348, AID 362, AID 364 and AID 371. These assays relate to Gaucher’s disease, tissue-damaging leukocytes, cytotoxicity, and lung tumor

6 ACS Paragon Plus Environment

Page 6 of 27

Page 7 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

cells, respectively. A summary of the AID 348 bioassay can be found at PubChem, 27 for instance. The proportions of active compounds are 0.0097, 0.0140, 0.0151 and 0.0839, respectively. Thus, active compounds are rare relative to the number of inactive compounds, and any method will be limited by the assay data when building a model to detect active compounds in a test set. Table 1: Four Assays from the Molecular Libraries Screening Center Network. Assay AID 348 AID 362 AID 364 AID 371

Biological target Gaucher’s disease Tissue-damaging leukocytes Cytotoxicity Lung tumor cells

Total 4946 4279 3311 3312

# Compounds Active (proportion) 48 (0.0097) 60 (0.0140) 50 (0.0151) 278 (0.0839)

Conversely, any descriptor set is rich in terms of the number of potentially useful predictor variables, even more so when there are multiple descriptor sets available corresponding to a bioassay. Table 2 shows by assay the numbers of non-constant variables for each descriptor set and in the all-descriptors (AD) set. For example, CAP has 1319–1795 non-constant variables available for any one assay. Combining descriptor sets gives as many as 2878 descriptors, for the AID 348 assay. Table 2: Five Descriptors Sets Generated by PowerMV and the Number of Non-Constant Variables for Each of the Four Assays. The last row (all descriptors) is obtained by pooling all five descriptor sets together. Descriptor set Atom pairs (AP) Burden numbers (BN) Carhart atom pairs (CAP) Fragment pairs (FP) Pharmacophores (PH) All descriptors (AD)

# Potential # Non-constant variables for assay variables AID 348 AID 362 AID 364 AID 371 546 367 360 380 382 24 24 24 24 24 4662 1795 1319 1585 1498 735 570 563 580 580 147 122 112 120 119 6114 2878 2378 2689 2603

7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

4. Evaluation of Methods Our goal is to find an ensemble that ranks well test compounds with unknown activity status in the sense that active compounds are ahead of inactives in a short list of leads for optimization. But we have assay data where the classes for all of the compounds are known, i.e., there is no test data. Hence, to evaluate predictive ranking performances of the ensembles we use balanced 10-fold CV.

Balanced 10-fold cross-validation Since the assay data contain few active compounds, a particular fold in a regular 10-fold CV might have an exceptionally small number of actives and thus make the training and test splits unrepresentative of each other. Hence, we have adopted a CV method where the dataset is randomly partitioned into 10 folds such that the resulting folds are approximately of equal size and have approximately equal numbers of actives. Treating one of these folds as a test set, the remaining 10 − 1 = 9 folds are combined together to form a training set in order to train a model. This model is then applied to the test set to obtain predictions. The process is repeated, holding out each of the 10 folds for testing in turn. Following Hawkins et al. (2003), 7 we call this method balanced 10-fold CV. Each compound appears in exactly one fold and hence once in a test fold. Thus, the 10-fold CV yields test probabilities of activity for all of the compounds in an assay. They are used to rank the compounds from most likely to be active to least likely, and the quality of this ranking is then assessed graphically via a hit curve, or quantitatively using the average hit rate or initial enhancement.

Hit curve A hit curve summarizes the performance of a classifier when the objective is ranking rare-class objects ahead of the majority class objects. Let n be the total number of compounds in a test set and a be the number of active compounds. Consider 0 ≤ at ≤ a be the number of actives or 8 ACS Paragon Plus Environment

Page 8 of 27

Page 9 of 27

hits in a shortlist of ranked compounds of size t ∈ [1, n]. The hit curve is a plot of at versus t or equivalently a plot of at /a versus t/n. The hit curve evaluates a ranking procedure at all possible shortlist cutoff-point t ∈ [1, n]. Classifier 1 with hit curve H1 (t) is uniformly superior to classifier 2 with hit curve H2 (t) if H1 (t) ≥ H2 (t) at every t with inequality for at least one t. Proportion of compounds selected

Proportion of compounds selected

0.02

0.03

0.04

0.05

0.06

100

150

200

250

300

0.002

0

10

0.004

0.006

0.008

0.01

20

30

40

50

20

35

0

0.42

0.01

0.73

0

RF−BN

EPX−CAP

0

50

0.31 0.21

15

0.1

10

0

0

0

5

0.1

5

Number of active compounds

0.52 0.42 0.31 0.21

Proportion of active compounds

25 20 15 10

Number of active compounds

EPX−AD

Proportion of active compounds

0.62

30

RF−AD

0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Number of compounds selected

Number of compounds selected

(a)

(b)

Figure 1: AID 348 assay: Hit curves from the four ensembles RF-BN, RF-AD, EPX-CAP, and EPX-AD with AHRs 0.103, 0.209, 0.247 and 0.330, respectively. Panel (a) is for the first 300 shortlisted compounds with IEs 7.15, 10.30, 8.93 and 11.51, respectively. Panel (b) is for the first 50 shortlisted compounds with IEs 20.61, 26.79, 37.10 and 41.22, respectively. Figure 1 shows four illustrative hit curves generated by four classifiers for the AID 348 assay: RF-BN, RF-AD, EPX-CAP, and EPX-AD. Panel (a) shows the hit curves for a shortlist of 300 compounds, and panel (b) zooms in on the first 50 compounds selected. Clearly EPX-AD, i.e., EPX applied once to the variables from all descriptor sets is best or nearly best throughout the curve. But the hit curves for RF-AD and EPX-CAP criss-cross substantially at several points: RFAD performs better at 300 compounds selected, for example, whereas EPX-CAP performs well at 50 compounds selected. In such a situation, a single number summary of a hit curve might be helpful to evaluate classifiers and to facilitate comparisons. A single number summary of a hit curve is also necessary to guide EPX construction. 9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 27

Initial enhancement Initial enhancement (IE) was proposed by Kearsley et al. (1996). 28 Having ranked the compounds using the estimated probabilities of activity, we shortlist the top t ≤ n compounds most likely to be active; the value of t depends on the resources available for a follow-up study. Initial enhancement for a shortlist of length t is the hit rate at t divided by the proportion of actives in the entire collection of compounds: IE =

(at /t) ∈ [0, (n/a)] . (a/n)

(2)

It is a relative measure of hit rate improvement offered by a classifier beyond what can be obtained under random ranking, and values much larger than 1 are desired. Panels (a) and (b) of Figure 1 show hit curves for the four classifiers RF-BN, RF-AD, EPXCAP, and EPX-AD. The four classifiers have IEs 7.15, 10.30, 8.93 and 11.51, respectively, for the first 300 shortlisted compounds. For the first 50 shortlisted compounds the IEs are 20.61, 26.79, 37.10 and 41.22, respectively.

Average hit rate The average hit rate (AHR) is a single number summary for the entire hit curve. Suppose we shortlist the top t ≤ n compounds and at of them are active. Then

h(t) =

at ∈ [0, 1] t

(3)

is the hit rate at t. Naturally, we want h(t) to be as large as possible throughout the curve. Let 1 ≤ t1 < t2 < · · · < ta ≤ n be the positions of the active compounds in the ranked list. AHR, sometimes called average precision, is defined as

AHR =

1 [h(t1 ) + h(t2 ) + · · · + h(ta )] . a

10 ACS Paragon Plus Environment

(4)

Page 11 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

AHR averages the “hit rates” at the position of the active compounds, and a larger AHR corresponds to a hit curve with a rapid early increase. AHR attains the maximum value of 1 if all of the active compounds are ranked before all of the inactive compounds. When there are tied probabilities of activity and hence compounds with tied ranks, AHR is an expectation over random ordering of the compounds in a tied group; further details may be found in Chapter 3 of Wang’s thesis. 29 The average hit rates for the four classifiers RF-BN, RF-AD, EPX-CAP, and EPX-AD presented in Figure 1 are 0.103, 0.209, 0.247 and 0.330, respectively. EPX-AD with the usually highest hit curve provides the largest AHR of 0.330. RF-BN with the lowest hit curve provides the smallest AHR of 0.103. EPX-CAP provides a bit better ranking of the active compounds at the start of the list than RF-AD and hence has slightly larger AHR. Ideally we will find a classifier with large AHR and IE, i.e., we shall look for a classifier which provides more hits earlier in the list as well as more actives in a shortlist of chosen length. Otherwise, we will choose a classifier by maximizing AHR alone provided that IE is close to the maximum.

5. Results This section contains the results of the 14 classifiers applied to each of the four bioassays. Balanced 10-fold CV is used to evaluate the 14 ensembles, with one CV run providing one AHR value and one IE value corresponding to a specific classifier. A shortlist of 300 compounds is used to compute IE for the assays AID 348, AID 362 and AID 364. For the other assay, AID 371, a shortlist of 600 compounds is used to compute the IE, as AID 371 has a much larger number of active compounds to be found. The balanced 10-fold CV procedure is repeated 16 times, using parallel computing, to give 16 AHR and 16 IE values for a specific classifier. Table 3 shows AHR and IE – averaged over 16 replications of the balanced 10-fold CV – for RFs applied separately to each of the five descriptor sets AP, BN, CAP, FP and PH and for RFs using all of the descriptor sets (EDS and AD). Results are given for all four assays. Standard errors

11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 27

Table 3: AHR and IE of Seven RFs Trained Using One of the Five Descriptor Sets (AP, BN, CAP, FP and PH) or Their Combinations (EDS and AD), Averaged Over 16 Replications of Balanced 10-fold CV, with Standard Errors of Means in Parentheses. Results are given for four assays. Light-gray highlighting shows, for a given metric, the largest observed mean among the five RFs applied to one of the five descriptor sets and any other means from a single descriptor set where the difference in means relative to the largest is not statistically significant at the 5% level by a Bonferroni-adjusted paired t-test. Dark-gray highlighting shows the largest mean among all seven RFs, and any other means from the seven RFs where the difference in means relative to the largest is not statistically significant at the 5% level by a Bonferroni-adjusted paired t-test. Descriptor set AP BN CAP FP PH EDS AD

AID 348 0.063 | 5.19 (0.002) | (0.14) 0.090 | 6.62 (0.003) | (0.14) 0.068 | 7.16 (0.002) | (0.15) 0.077 | 7.07 (0.002) | (0.23) 0.070 | 5.53 (0.003) | (0.11) 0.128 | 8.39 (0.004) | (0.13) 0.179 | 9.63 (0.005) | (0.14)

AHR | IE AID 362 AID 364 0.280 | 8.66 0.265 | 6.46 (0.006) | (0.16) (0.008) | (0.07) 0.242 | 9.47 0.327 | 5.47 (0.004) | (0.10) (0.006) | (0.11) 0.267 | 8.53 0.334 | 6.42 (0.006) | (0.13) (0.007) | (0.10) 0.266 | 8.84 0.305 | 5.53 (0.004) | (0.15) (0.006) | (0.09) 0.216 | 6.61 0.275 | 4.44 (0.004) | (0.09) (0.004) | (0.05) 0.328 | 9.74 0.367 | 6.48 (0.005) | (0.14) (0.006) | (0.09) 0.305 | 9.48 0.357 | 5.88 (0.005) | (0.10) (0.006) | (0.07)

AID 371 0.315 | 3.06 (0.003) | (0.02) 0.335 | 3.12 (0.002) | (0.02) 0.347 | 3.04 (0.003) | (0.01) 0.362 | 3.19 (0.003) | (0.03) 0.277 | 2.59 (0.002) | (0.01) 0.396 | 3.44 (0.002) | (0.01) 0.389 | 3.38 (0.003) | (0.01)

of means are also provided. For a given assay and metric, the largest observed mean among the five RFs applied to a single descriptor set is indicated by light-gray highlighting in Table 3. Also shown by light gray is any other mean from a single descriptor set where the difference in means relative to the largest is not statistically significant. Significance of the differences in means is assessed via two-sided matched-pairs tests, because the same CV partitions are used for all RFs. For example, for AID 348 and IE, the RF using the CAP descriptors has the largest mean, 7.16. Descriptor set FP gives a mean of 7.07 for a mean difference between RF-CAP and RF-FP of 0.09. The standard error of the mean of the 16 paired differences is 0.26, and the observed mean difference is not significant at the 5% level when compared with the t distribution with 15 degrees of freedom. Thus, FP is

12 ACS Paragon Plus Environment

Page 13 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

also highlighted. In general, these comparisons among the five descriptor sets have their p-values adjusted by a Bonferroni correction for the 10 possible tests among five methods. For AID 348 and IE, the difference between CAP and BN means is also not significant. Inspection of the results from using a single descriptor set shows substantial and mostly statistically significant differences between the mean performances for a given assay and metric. Moreover, the set providing the largest average performance depends on the assay and the performance metric. There is no clear guidance in choosing an overall best descriptor set. Similarly, dark-gray highlighting in Table 3 shows the largest mean among all seven RFs, and any other means from the seven RFs where the difference in means relative to the largest is not statistically significant by a Bonferroni-adjusted paired t-test. Here, p-values are adjusted for 21 tests comparing all seven methods. The last four rows of the table show that the observed average metric is always largest for either RF-EDS or RF-AD. For AID 362 and IE, the mean for RF-EDS is not significantly different from the RF-BN mean or the RF-AD mean, however, and similarly for AID 364 and IE, the RF-AP and RF-CAP means are not significantly different from the RFEDS mean. In all other columns, the gains in performance from either RF-EDS or RF-AD are statistically significant and substantial. We now demonstrate that performance is improved further by applying the EPX method. Table 4 shows the number of variables, initial groups, screened groups, candidate phalanxes, and screened phalanxes when EPX is run. The first six rows, for instance, show the application of the algorithm to the descriptor sets AP, BN, CAP, FP, PH and AD for the AID 348 assay. The sixth row shows that there are 2878 variables in AD from the union of the five descriptor sets. The 2878 variables are then partitioned into 676 initial groups. The initial groups are formed based on the names of the variables. 10 The 674 initial groups that survived screening are amalgamated into 11 candidate phalanxes. All of the 11 candidate phalanxes survived the second stage of screening. Thus, the final EPX predictions are an ensemble of 11 models. EPX applied to the binary descriptor sets – AP, CAP, FP and PH – shows heavy filtering of the initial groups and moderate filtering of the candidate phalanxes. But the application of the algorithm to AD shows little screening of

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the initial groups and no screening of the candidate phalanxes. This suggests that the variables in one descriptor set can either be useful by themselves or when they are used with variables in other descriptor sets. Table 4: Number of Variables, Initial Groups, Screened Groups, Candidate Phalanxes, and Screened Phalanxes When EPX is Run with One Descriptor Set (DS) or AD.

Assay

AID 348

AID 362

AID 364

AID 371

DS Variables AP 367 BN 24 CAP 1795 FP 570 PH 122 AD 2878 AP 360 BN 24 CAP 1319 FP 563 PH 112 AD 2378 AP 380 BN 24 CAP 1585 FP 580 PH 120 AD 2689 AP 382 BN 24 CAP 1498 FP 580 PH 119 AD 2603

Number of Groups Phalanxes Initial Screened Candidate Screened 75 25 3 2 24 24 6 6 455 120 13 10 101 27 3 3 21 5 1 1 676 674 11 11 77 54 9 2 24 24 6 6 325 286 17 16 102 99 7 7 21 18 2 1 549 544 19 19 78 67 6 5 24 24 5 5 394 368 10 10 104 96 6 6 21 19 5 3 621 621 14 14 78 63 3 3 24 24 4 4 362 220 11 11 104 85 3 3 21 16 4 4 589 574 16 16

Next we describe how variables are drawn from the pooled variables in AD to form phalanxes, using AID 348 for illustration. Table 5 shows the descriptor set memberships of the variables in each of the 11 phalanxes. The top performing phalanx contains a total of 123 variables of which 19, 3, 58, 19 and 24 variables are from AP, BN, CAP, FP and PH, respectively. Similarly, the other 10 phalanxes all contain variables from more than one descriptor set. The implication is that 14 ACS Paragon Plus Environment

Page 14 of 27

Page 15 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

variables from different descriptor sets work well together in the model trained for a phalanx. If we apply our algorithm to a single descriptor set, we might lose a set of variables that are weak within a descriptor set but are useful with variables in other descriptor sets. Table 5: Numbers of Variables from the Five Descriptor Sets in the 11 Phalanxes When EPX-AD is Applied to the AID 348 Assay. The phalanxes are ordered according to their individual AHRs.

Phalanx AP 1 19 2 68 3 0 4 6 5 0 6 248 7 6 8 0 9 10 10 6 11 4 Total 367

BN 3 4 1 4 4 6 0 1 0 0 1 24

# Variables CAP FP PH 58 19 24 261 105 17 9 5 5 5 6 0 8 7 0 1376 416 76 14 6 0 18 1 0 12 0 0 11 0 0 17 0 0 1789 565 122

Total 123 455 20 21 19 2122 26 20 22 17 22 2867

AHR 0.248 0.200 0.175 0.170 0.152 0.151 0.102 0.097 0.063 0.058 0.038 –

Next we emphasize that EPX uses a larger proportion of variables in the combined AD descriptor set than in individual descriptor sets. Table 6 gives the number (proportion) of variables used by EPX for each assay. The first column, for AID 348, shows that except for BN, EPX uses fewer than 35% of the variables in a single descriptor set. In the EPX-EDS ensemble, EPX uses 28.8% of all the variables in the five descriptor sets. In stark contrast, when applied to AD, EPX uses 99.6% of all the variables in the five descriptor sets. Thus, even though many more variables are available in the AD set, a greater proportion of them turn out to be selected. We next look at the predictive performance of EPX. Table 7 shows AHR and IE – averaged over 16 replications of the balanced 10-fold CV – for EPX applied to the five descriptor sets individually and collectively (EDS and AD). Statistical comparisons of the five EPX methods based on a single descriptor set and of all seven EPX classifiers follow the methods used for Table 3. The top performing individual descriptor set in terms of both AHR and IE is now always 15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 6: Number (Proportion) of Variables Used by EPX. Descriptor set AP BN CAP FP PH EDS AD

AID 348 124 (0.338) 24 (1.000) 486 (0.271) 166 (0.291) 29 (0.238) 829 (0.288) 2867 (0.996)

Assay AID 362 AID 364 AID 371 220 (0.611) 328 (0.863) 312 (0.817) 24 (1.000) 24 (1.000) 24 (1.000) 1012 (0.767) 1470 (0.928) 978 (0.653) 552 (0.981) 538 (0.928) 482 (0.831) 89 (0.795) 96 (0.800) 89 (0.748) 1897 (0.798) 2456 (0.913) 1885 (0.724) 2361 (0.993) 2689 (1.000) 2524 (0.970)

CAP across the four assays, with EPX-BN not significantly different in two instances. Recall that CAP is the largest of the five descriptor sets. EPX copes with the high-dimensionality of CAP by partitioning descriptors among distinct models. Indeed, it exploits the high-dimensionality to give a powerful classifier. Furthermore, for every assay EPX-CAP outperforms the top performance of RF with a single descriptor set and is about the same or better than RF-EDS and RF-AD (compare with Table 3). The last four rows of Table 7 present results for EPX-EDS and EPX-AD. Comparing all seven EPX classifiers, we see that the top performer is EPX-AD followed by EPX-EDS with two exceptions where EPX-AD shares the first place with EPX-EDS because the observed differences are not statistically significant. Moreover, EPX-AD clearly outperforms EPX-CAP, the best single descriptor set, every time. Finally, EPX-AD is compared with the most competitive RF classifiers — RF-EDS and RFAD — at the level of individual CV runs. Table 8 counts the number of CV runs where EPXAD outperforms a competing method. The top half of the table compares AHR values. EPXAD is uniformly superior to RF-EDS and RF-AD across all assays. EPX-AD having the larger AHR for every one of the 16 CV runs leads to a two-sided p-value from a paired sign test of 2 × 2−16 = 0.0003. The bottom half of the table relating to IE follows a similar pattern. The improvement from EPX-AD relative to RF-EDS and RF-AD is always statistically signficant.

16 ACS Paragon Plus Environment

Page 16 of 27

Page 17 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 7: AHR and IE for EPX Trained Using One of the Five Descriptor Sets (AP, BN, CAP, FP and PH) or Their Combinations (EDS and AD), Averaged Over 16 Replications of Balanced 10-fold CV, with Standard Errors of Means in Parentheses. Results are given for four assays. Light-gray highlighting shows, for a given metric, the largest observed mean among the five EPX classifiers applied to one of the five descriptor sets and any other means from a single descriptor set where the difference in means relative to the largest is not statistically significant at the 5% level by a Bonferroni-adjusted paired t-test. Dark-gray highlighting shows the largest mean among all seven EPX classifiers, and any other means from the seven classifiers where the difference in means relative to the largest is not statistically significant at the 5% level by a Bonferroni-adjusted paired t-test. Descriptor set AP BN CAP FP PH EDS AD

AID 348 0.128 | 6.47 (0.003) | (0.08) 0.124 | 8.07 (0.003) | (0.13) 0.230 | 8.67 (0.007) | (0.14) 0.143 | 8.02 (0.006) | (0.14) 0.109 | 6.60 (0.003) | (0.08) 0.229 | 11.51 (0.006) | (0.12) 0.296 | 11.24 (0.006) | (0.21)

AHR | IE AID 362 AID 364 0.309 | 8.24 0.308 | 6.50 (0.006) | (0.09) (0.008) | (0.08) 0.278 | 9.25 0.391 | 6.77 (0.006) | (0.11) (0.007) | (0.10) 0.373 | 9.62 0.397 | 7.49 (0.006) | (0.07) (0.006) | (0.06) 0.311 | 8.93 0.332 | 6.39 (0.006) | (0.11) (0.006) | (0.07) 0.215 | 6.53 0.277 | 4.72 (0.004) | (0.09) (0.004) | (0.06) 0.366 | 9.23 0.436 | 7.68 (0.005) | (0.13) (0.006) | (0.09) 0.391 | 10.14 0.430 | 7.80 (0.006) | (0.10) (0.007) | (0.09)

AID 371 0.330 | 3.02 (0.003) | (0.01) 0.360 | 3.24 (0.002) | (0.02) 0.397 | 3.29 (0.003) | (0.02) 0.365 | 3.15 (0.002) | (0.02) 0.274 | 2.49 (0.002) | (0.02) 0.428 | 3.66 (0.002) | (0.01) 0.432 | 3.71 (0.003) | (0.01)

6. Strength and Diversity of Ensembles In Section 2 we noted that Breiman 19 argued that an effective ensemble method will ideally be based on averaging classifiers with two desirable properties. The constituent classifiers should be strong by themselves and their predictions should be weakly correlated. We now look at these properties to explain the results in Section 5. First, we examine why RF-EDS is outperformed by EPX-EDS, taking AID 348 and CV run 1 for illustration. Recall that both methods average over classifiers from the five descriptor sets. Panels (a) and (b) of Figure 2 show the strengths, as measured by AHR or IE, of the five constituent classifiers along the x axes. For example, the label AP: 0.062 / 5.84 in panel (a) indicates that RF,

17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 8: Pairwise Comparisons of EPX-AD With RF-EDS and RF-AD at the Level of Individual CV Runs. The frequencies reported are the number of times out of 16 CV runs that EPX-AD has an AHR value (top half of the table) or IE value (bottom half of the table) at least as large as that of the competing method. The numbers in parentheses are p-values of two-sided sign tests comparing EPX-AD with each reference method.

Metric

AHR

EPX-AD versus RF-EDS RF-AD RF-EDS

IE

RF-AD

# CV runs with EPX-AD metric ≥ competitor’s metric AID 348 AID 362 AID 364 AID 371 16 16 16 16 (0.00003) (0.00003) (0.00003) (0.00003) 16 16 16 16 (0.00003) (0.00003) (0.00003) (0.00003) 16 15 16 16 (0.00003) (0.00052) (0.00003) (0.00003) 16 16 16 16 (0.00003) (0.00003) (0.00003) (0.00003)

when applied to descriptor set AP, yields an AHR of 0.062 and IE of 5.84. EPX-EDS has superior AHR and IE for every descriptor set, i.e., the constituent classifiers in the ensemble are uniformly stronger. The other ensemble property, correlations between predictions from the constituent classifiers, is shown by the gray-scale diversity maps in the plots. To make the plot for panel (a) relating to RF-EDS, the 48 active compounds are labeled 1–48 (right-hand y axis) in the order of their RF-EDS ranks. The last column in the plot indicates by gray scale (left-hand y axis) the ranks assigned by RF-EDS to the 48 actives, with black signifying rank 1 and white the last rank. Actives ranked towards the beginning of the RF-EDS shortlist will have dark gray (ranked early in the list) to light gray (ranked later in the list), so a preponderance of darker grays indicates an overall better ordering. The same gray scale is used for the rankings of the constituent classifiers. Thus, in the plot, low correlations among the predictions, i.e., diversity in ranks across the constituents, will result in different patterns of dark versus lighter gray. It is seen that there is considerable variation in ranks across the RF-AP, RF-BN, RF-CAP, RF-FP and RF-PH constituent classifiers. That is why the metrics for the RF-EDS ensemble are better than for any constituent. Turning to EPX-EDS in panel (b), we again see much variation in ranking among EPX-AP, EPX-BN, EPXCAP, EPX-FP and EPX-PH, but the contrast in colors across classifiers is even greater, i.e., more 18 ACS Paragon Plus Environment

Page 18 of 27

Page 19 of 27

variation. With better individual strengths of the underlying classifiers and greater diversity among them, EPX-EDS substantially outperforms RF-EDS. 45

45

40

40

35

300

500

200

200

30

100

25

10

35

30

300 200

30

100 25

50

25 50

20

20 10

15

15 5

5

40

500

100

50

45

35

300

10

10

20

10

15

5 10

1 05

05 1

05 1

01

01

(a) RF-EDS

(b) EPX-EDS

EPX : 0.33 / 11.51

PX10 : 0.058 / 1.8

PX11 : 0.038 / 2.13

PX9 : 0.063 / 2.62

PX8 : 0.097 / 4.63

PX7 : 0.102 / 3.28

PX6 : 0.151 / 8.24

PX4 : 0.17 / 7.21

PX5 : 0.152 / 6.31

PX2 : 0.2 / 9.27

PX3 : 0.175 / 5.93

PX1 : 0.248 / 7.97

EDS : 0.252 / 11.68

PH : 0.115 / 7.22

FP : 0.172 / 8.84

CAP : 0.247 / 8.93

BN : 0.135 / 8.24

AP : 0.139 / 6.53

EDS : 0.137 / 8.59

PH : 0.071 / 5.68

FP : 0.083 / 7.56

CAP : 0.072 / 7.21

BN : 0.103 / 7.15

01

AP : 0.062 / 5.84

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(c) EPX-AD

Figure 2: Diversity maps of (a) RF-EDS, (b) EPX-EDS, and (c) EPX-AD for AID 348 and the first CV run. The 48 active compounds are ordered on the right-side y axis using each method’s ranking. The gray scale (left-hand y axis) shows the ranks assigned to the 48 actives by RF-EDS, EPX-EDS, EPX-AD, and their constituent classifiers, with black signifying rank 1 and white the last rank. Similarly, panels (b) and (c) of Figure 2 compare EPX-EDS and EPX-AD. These methods use different underlying classifiers, based on descriptor sets versus phalanxes of descriptors, and so the strengths of the classifiers in the ensemble cannot be compared one to one. Nonetheless, it is seen that EPX-AD’s constituent AHR values of 0.248, 0.200, 0.175, etc. beat the 0.247, 0.172, 0.135 values for EPX-EDS. EPX-AD also shows more diversity among the orderings from the phalanxes than EPX-EDS does among its descriptor sets. Hence the overall AHR of EPX-AD at 0.330 is superior to 0.252 for EPX-EDS. The relative strengths in terms of constituent IE values are less clear, and the overall IE values of the two methods are similar. To investigate the impact on the structural diversity of the compounds found we now show some of the active compounds identified by EPX-AD for the AID 364 assay. They will be compared with structures found using just one descriptor set. 10 Table 9 gives four compounds ranked highly by EPX-AD for AID 364 but poorly by at least one EPX (or RF) classifier based on a single descriptor set. For instance, EPX-BN, one of the best single descriptor sets for AID 364, assigns ranks 485.5 and 3129 to compounds CID 5790 and 19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(a) CID 219770

(b) CID 663143

(c) CID 5790

Page 20 of 27

(d) CID 6364669

Figure 3: Structures of four active compounds from the AID 364 assay ranked well by EPX-AD but not by at least one of EPX-AP, EPX-BN, EPX-CAP, EPX-FP, and EPX-PH. CID 6364669, respectively. EPX-AD is able to assign high ranks to these four compounds by exTable 9: Four Active Compounds from the AID 364 Assay Ranked Well by EPX-AD But Not by at Least One of EPX-AP, EPX-BN, EPX-CAP, EPX-FP, and EPX-PH. CID EPX-AP 219770 112.5 663143 168.0 5790 2007.5 6364669 2772.0

EPX-BN 14.0 39.0 485.5 3129.0

Rank from EPX-CAP EPX-FP EPX-PH 29.0 2907.5 2094.5 64.0 47.5 2094.5 145.0 214.5 67.5 414.5 192.0 118.5

EPX-AD 36.0 55.0 81.0 91.0

ploiting a range of pharmacological features captured by the different sets. These four compounds, with structures shown in Figure 3, can be compared with those displayed in Figure 6 of Tomal et al. (2015). 10 The superior ranking performance of EPX-AD here has added to the diversity of compounds for lead optimization.

7. Discussion and Conclusions We have introduced two possible ways of using multiple descriptor sets together to classify and rank rare active compounds ahead of the majority inactive class in QSAR studies. Ensembling across the descriptor sets and pooling the descriptor sets both outperform picking the best single descriptor set, sometimes substantially. Furthermore, building classifiers using EPX provided better ranking than RF. For the four assays analyzed here, the overall most effective strategy was to 20 ACS Paragon Plus Environment

Page 21 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

pool all descriptors together and run EPX, i.e., the EPX-AD method. Even though there are as many as 2879 descriptors in total and as few as 48 actives, EPX-AD can make use of the descriptors by putting them in separate models. Most remarkably, EPX-AD always used at least 97% of the descriptors across the models from the various phalanxes. The implication is that every descriptor set has some potentially useful extra information if it can be exploited by the statistical modeling strategy. Another advantage of our proposed approach is that if knowledge of useful pharmacological features suggests the use of a particular descriptor set, it can easily be included in the pool of variables presented to EPX-AD. A straightforward extension of the ensemble methods described here would be to regression, where the assay response is continuous, e.g., the activity score from a primary screen. Random forests of regression trees would be used for constituent classifiers, including those based on phalanxes, rather than forests of classification trees. The performance metric could be root mean squared error of prediction, or to retain the aim of ranking the methods described by Wang et al. could be used. 30 R software to form phalanxes for classification is available from the first author.

Acknowledgement We thank the Editor, Associate Editor, and the Reviewers for their insightful comments, which substantially strengthened the arguments presented here. We also thank S. Stanley Young for his help generating the descriptor sets. This research was funded by Natural Sciences and Engineering Research Council of Canada grants RGPIN-2014-04962 and RGPIN-2014-05227. The WestGrid computational facilities of Compute Canada are also gratefully acknowledged.

Supporting Information Available This material is available free of charge via the Internet at http://pubs.acs.org/.

21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References (1) Engels, M. F.; Venkatarangan, P. Smart Screening: Approaches to Efficient HTS. Curr. Opin. Drug Discovery Dev. 2001, 4, 275–283. (2) Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors; WILEY-VCH: Weinheim, Germany, 2000. (3) Leach, A. R.; Gillet, V. J. An Introduction to Chemoinformatics; Springer: Dordrecht, The Netherlands, 2007; Chapter 3. (4) Molecular

Descriptors

Guide.

http://www.epa.gov/sites/production/

files/2015-05/documents/moleculardescriptorsguide-v102.pdf, accessed January 15, 2016. (5) Dragon Molecular Descriptors. http://www.talete.mi.it/products/dragon_ molecular_descriptors.htm, accessed January 15, 2016. (6) Alves, V. M.; Muratov, E.; Fourches, D.; Strickland, J.; Kleinstreuer, N.; Andrade, C. H.; Tropsha, A. Predicting Chemically-Induced Skin Reactions. Part I: QSAR Models of Skin Sensitization and Their Application to Identify Potentially Hazardous Compounds. Toxicol. Appl. Pharmacol. 2015, 284, 262–272. (7) Hawkins, D. M.; Basac, S. C.; Mills, D. Assessing Model Fit by Cross-Validation. J. Chem. Inf. Comput. Sci. 2003, 43, 579–586. (8) Hong, H.; Xie, Q.; Ge, W.; Qian, F.; Fang, H.; Shi, L.; Su, Z.; Perkins, R.; Tong, W. Mold, Molecular Descriptors from 2D Structures for Chemoinformatics and Toxicoinformatics. J. Chem. Inf. Model. 2008, 48, 1337–1344. (9) Tseng, Y. J.; Hopfinger, A. J.; Esposito, E. X. The Great Descriptor Melting Pot: Mixing Descriptors for the Common Good of QSAR Models. J. Comput. -Aided Mol. Des. 2012, 26, 39–43. 22 ACS Paragon Plus Environment

Page 22 of 27

Page 23 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(10) Tomal, J. H.; Welch, W. J.; Zamar, R. H. Ensembling Classification Models Based on Phalanxes of Variables With Applications in Drug Discovery. Annals of Applied Statistics 2015, 9, 69–93. (11) Aoyama, T.; Suzuki, Y.; Ichikawa, H. Neural Networks Applied to Quantitative StructureActivity Relationships Analysis. J. Med. Chem. 1990, 33, 2583–2590. (12) Rusinko, A.; Farmen, M. W.; Lambert, C. G.; Brown, P. L.; Young, S. S. Analysis of a Large Structure/Biological Activity Data Set Using Recursive Prtitioning. J. Chem. Inf. Comput. Sci. 1999, 39, 1017–1026. (13) Kauffman, G. W.; Jurs, P. C. QSAR and K-Nearest Neighbor Classification Analysis of Selective Cyclooxygenase-2 Inhibitors using Topologically-Based Numerical Descriptors. J. Chem. Inf. Comput. Sci. 2001, 41, 1553–1560. (14) Doniger, S.; Hofmann, T.; Yeh, J. Predicting CNS Permeability of Drug Molecules: Comparison of Neural Network and Support Vector Machine Algorithms. J. Comput. Biol. 2004, 9, 849–864. (15) Merkwirth, C.; Mauser, H.; Schulz-Gasch, T.; Roche, O.; Stahl, M.; Lengauer, T. Ensemble Methods for Classification in Cheminformatics. J. Chem. Inf. Comput. Sci. 2004, 44, 1971– 1978. (16) Bruce, C. L.; Melville, J. L.; Pickett, S. D.; Hirst, J. D. Contemporary QSAR Classifiers Compared. J. Chem. Inf. Model. 2007, 47, 219–227. (17) Freund, Y.; Schapire, R. E. A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences 1997, 55, 119–139. (18) Breiman, L. Bagging Predictors. Machine Learning 1996, 24, 123–140. (19) Breiman, L. Random Forests. Machine Learning 2001, 45, 5–32.

23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(20) Hughes-Oliver, J. M.; Brooks, A. D.; Welch, W. J.; Khaledi, M. G.; Hawkins, D.; Young, S. S.; Patil, K.; Howell, G. W.; Ng, R. T.; Chu, M. T. ChemModLab: A Web-Based Cheminformatics Modeling Laboratory. In Silico Biol. 2011, 11, 61–81. (21) Burden, F. R. Molecular Identification Number for Substructure Searches. J. Chem. Inf. Comput. Sci. 1989, 29, 225–227. (22) Pearlman, R. S.; Smith, K. M. Metric Validation and the Receptor-Relevant Subspace Concept. J. Chem. Inf. Comput. Sci. 1999, 39, 28–35. (23) Carhart, R. E.; Smith, D. H.; Ventkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J. Chem. Inf. Comput. Sci. 1985, 25, 64–73. (24) Liu, K.; Feng, J.; Young, S. S. PowerMV: A Software Environment for Molecular Viewing, Descriptor Generation, Data Analysis and Hit Evaluation. J. Chem. Inf. Model. 2005, 45, 515–522. (25) Breiman, L.; Cutler, A.; Liaw, A.; Wiener, M. Breiman and Cutler’s Random Forests for Classification and Regression. 2015; https://cran.r-project.org/web/packages/ randomForest/randomForest.pdf, Version 4.6-12, accessed January 15, 2016. (26) Molecular Libraries Screening Centers Network. http://mli.nih.gov/mli/ secondary-menu/mlscn/, accessed January 15, 2016. (27) BioAssay

Record

for

AID

348.

https://pubchem.ncbi.nlm.nih.gov/

bioassay/348, accessed January 15, 2016. (28) Kearsley, S. K.; Sallamack, S.; Fluder, E. M.; Andose, J. D.; Mosley, R. T.; Sheridan, R. P. Chemical Similarity Using Physiochemical Property Descriptors. J. Chem. Inf. Comput. Sci. 1996, 36, 118–127.

24 ACS Paragon Plus Environment

Page 24 of 27

Page 25 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(29) Wang, Y. M. Statistical Methods for High Throughput Screening Drug Discovery Data. Ph.D. thesis, University of Waterloo, Waterloo, Canada, 2005. (30) Wang, X. S.; Salloum, G. A.; Chipman, H. A.; Welch, W. J.; Young, S. S. Exploration of Cluster Structure-Activity Relationship Analysis in Efficient High-Throughput Screening. J. Chem. Inf. Model. 2007, 47, 1206–1214.

25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphical TOC Entry Some journals require a graphical entry for the Table of Contents. This should be laid out “print ready” so that the sizing of the text is correct. Inside the tocentry environment, the font used is Helvetica 8 pt, as required by Journal of the American Chemical Society. The surrounding frame is 9 cm by 3.5 cm, which is the maximum permitted for Journal of the American Chemical Society graphical table of content entries. The box will not resize if the content is too big: instead it will overflow the edge of the box. This box and the associated title will always be printed on a separate page at the end of the document.

26 ACS Paragon Plus Environment

Page 26 of 27

Page 27 of 27

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

This is the Table of Contents graphic 105x105mm (72 x 72 DPI)

ACS Paragon Plus Environment