Interpretation of QSAR Models By Coloring Atoms According to

Feb 19, 2019 - Most chemists would agree that the ability to interpret a Quantitative Structure Activity Relationship (QSAR) model is as important as ...
0 downloads 0 Views 1MB Size
Subscriber access provided by University of Glasgow Library

Chemical Information

Interpretation of QSAR Models By Coloring Atoms According to Changes in Predicted Activity: How Robust Is It? Robert P. Sheridan J. Chem. Inf. Model., Just Accepted Manuscript • Publication Date (Web): 19 Feb 2019 Downloaded from http://pubs.acs.org on February 19, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Interpretation of QSAR Models By Coloring Atoms According to Changes in Predicted Activity: How Robust Is It? Robert P. Sheridan Modeling and Informatics, Merck & Co. Inc., Kenilworth, NJ 07065 [email protected]

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT Most chemists would agree that the ability to interpret a Quantitative Structure Activity Relationship (QSAR) model is as important as the ability of the model to make accurate predictions. One type of interpretation is coloration of atoms in molecules according to the contribution of the atom to the predicted activity, as in “heat maps”. The ability to determine which parts of a molecule increase the activity in question and which decrease it should be useful to chemists who want to modify the molecule. For that type of application, we would hope the coloration to not be particularly sensitive to the details of model-building. In this paper we examine a number of aspects of coloration against 20 combinations of descriptors and QSAR methods. We demonstrate that atom-level coloration is much less robust to descriptor/method combinations than cross-validated predictions. Even in ideal cases where the contribution of individual atoms are known, we cannot always recover the important atoms for some descriptor/method combinations. Thus, model interpretation by atom coloration may not be as simple as it first appeared. INTRODUCTION Quantitative Structure-Activity Relationships (QSAR) is a very commonly used technique in the pharmaceutical industry for predicting on-target and off-target activities. As the branch of machine learning applied to chemistry, QSAR uses a statistical method to predict biological activities or physical properties from chemical descriptors. The set of rules to do the prediction can be called a “model.” Such predictions help prioritize the experiments during the drug discovery process. While higher prediction accuracy is always desirable, the ability to interpret a QSAR model is also important. Chemists would like to know what chemical features a model is

2 ACS Paragon Plus Environment

Page 2 of 41

Page 3 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

capturing and how to use that information to modify molecules to optimize one or more activities. The most recent comprehensive review on the topic of QSAR model interpretability is from Polishchuk 1. It is generally held that QSAR methods and descriptors that give rise to the most accurate predictions are not necessarily the ones that lend themselves most easily to interpretation. In particular, deep neural nets (DNNs) are becoming more popular as a QSAR method, but these models are generally considered very hard to interpret 2. There are two major realms of interpretability for QSAR. The first is “descriptor importance” (sometimes called “feature importance” in the field of machine learning). One would like to query a QSAR model as to which chemical descriptors make the most difference in activity. Some QSAR methods produce descriptor importances as a part of constructing the model (e.g. PLS, linear SVM, simple recursive partitioning, etc.), while others (non-linear SVMs, random_forest, DNNs) cannot. However, one can generate descriptor importances through some type of “sensitivity analysis” on almost any QSAR model, i.e. perturb an individual descriptor (for example by permuting the values randomly among molecules) and see how the overall accuracy of prediction changes. Another realm of interpretability is coloring molecules according to a QSAR model 3-8. The colored molecule is sometimes called a “heat map”. One advantage to a heat map is that the chemist need not interpret individual descriptor types but merely inspect a 2D structure. Heat maps can be generated only from substructure type descriptors, i.e. those that indicate the presence or frequency of a particular chemical group in a molecule (e.g. atom pairs, circular fingerprints, MACCS keys, etc.) and not to property type descriptors that apply to the entire molecule (LOGP, number of rotatable bonds, etc.) atoms. Robinson et al.7 have a detailed

3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

discussion about what makes heat maps better, and compared several methods of generating them. There are two major approaches to producing heat maps. One is to map descriptor importances from a QSAR model onto atoms, such that the color of each atom reflects the sum of the importances for all descriptors using that atom. The other approach is to remove the chemical descriptors associated with an atom or group of atoms and monitor how the predicted activity for that molecule changes. The color applied to an atom reflects the difference in the prediction. The second approach is preferable for number of reasons, some of which are: 1. It can apply to any descriptor/QSAR method combination, regardless of whether the method natively produces descriptor importances. 2. No explicit mapping of descriptors to atoms is necessary; one need only generate descriptors from the structure, which is the usual direction in QSAR. 3. There is no necessity of knowing the sign of a descriptor importance; some methods like random forest produce no signs. 4. The change we are making (i.e. neutralizing the descriptors of one atom) is exactly what we want to monitor if we are coloring single atoms. 5. The scale of the color for any descriptor/QSAR method combination is the same: the range of activity for the training set of the model. There are more sophisticated ways of finding atom contributions other than removing descriptors associated with an atom, but this approach has the advantage that it requires only an out-of-thebox set of capabilities for QSAR infrastructure, namely descriptor generation and prediction.

4 ACS Paragon Plus Environment

Page 4 of 41

Page 5 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The ability to determine which parts of a molecule help to increase the activity in question and which help to decrease it should be useful to chemists who want to modify the molecule. For example if the chemist wanted to modify a molecule to increase the activity he or she would leave alone the atoms that were associated with increased activity and change the atoms associated with lower activity. For that type of application, we would hope the coloration would be robust, i.e. which atoms are strongly associated with activity and inactivity should not be particularly sensitive to the descriptor used in the model (e.g. atom pairs vs. ECFP4) or the QSAR method (e.g. random forest vs. SVM). Here we undertake a study to determine the sensitivity of atom coloration and cross-validated prediction to 4 types of chemical descriptors and 5 QSAR methods. We show that atom coloring is much more sensitive to descriptors and method than cross-validated prediction. Further, in ideal cases where we know which atoms are important for activity, not all descriptor method combinations can recover the correct atoms. This would imply that atom coloring may not be as straightforwardly interpretable as hoped. METHODS For the purposes of a more compact notation we will use “T” to refer to different datasets, “D” to refer to different descriptors and “Q” to refer to different QSAR methods. Data sets (T) Twenty-four data sets used in this study are in Table 1. These are originally from Cortes-Ciriano9 and represent on-target datasets, most of which are from ChEMBL10. The names in Table 1 correspond to the file names in the download. Where the names of the files differ from the names in Table 1 of Cortes-Ciriano, the alternative names are shown in parentheses. The activities are in terms of -log(IC50), -log(EC50), or –log(Ki). If the same compound is listed more than once, we use the mean activity over all instances. The SMILES for the compounds and their activities are in Supporting 5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 41

Information. The diversity of each dataset is measured by the mean pairwise similarity of molecules using the ECFP4 descriptor and Tanimoto similarity index. The lower this number, the more diverse. As with most public domain datasets assembled from the literature, they are not fully diverse. For all datasets, the mean pairwise similarity > 0.14, which would be the mean pairwise similarity of random druglike compounds. Two of the datasets (IL4 and MM2) have a mean pairwise similarity consistent with sets of close analogs. As well as “real” datasets discussed above it is useful to have idealized datasets where 1) the activity is perfectly predictable from the structure alone and 2) the chemical groups responsible for the activity are unambiguously known. For this reason we modified the P47871 dataset by making the “activity” the number of negative charges at pH7.4. This is called the Anion dataset. We also modified the P47871 dataset such that the activity is the count of nonhydrogen atoms. This is the Atomcount dataset. QSAR Descriptors (D) Chemical descriptors we use in this study are listed below. All descriptors are used in frequency form, i.e. we use the number of occurrences in a molecule and not just the presence or absence. 1. APDP. This is the union of AP, the original "atom pair" descriptor from Carhart et al. 11 and DP descriptors ("Donor acceptor Pair"), called "BP" in Kearsley et al. 12. We use these in most of our QSAR studies and in our production QSAR. Both descriptors are of the form: Atom type i – (distance in bonds) – Atom type j For AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons; it is very specific. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other); it contains a more generic description of chemistry.

6 ACS Paragon Plus Environment

Page 7 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

2. TTDT. This is the union of the topological torsion descriptor Nilikantan et al. 13 and the DT descriptors (called “BT” in Kearsley et al.12). These use the same atom typing as AP and DP descriptors but represent four consecutively bonded atoms. 3. ECFP4. ECFP4 is the circular fingerprint with a radius of 4 described by Rogers and Hahn14. It is a literature standard as a chemical descriptor. 4. DRUGBITS. These represent ~300 common groups found in drugs (indole, hydroxide, amide, etc.). The exact substructures were published as Supporting Information in Svetnik et al.15.This is included because it is analogous to some of the fragment descriptors used by other workers for atom coloration. ECFP4 and TTDT can be considered “local descriptors” in that they include atoms near each other (within 3 or 4 bonds). APDP descriptors, in contrast, can include atoms that are far apart. DRUGBITS can be considered a kind of local descriptor, although as we will see, it behaves somewhat different from ECFP4 and TTDT. QSAR methods (Q) All methods are used in regression mode, i.e. both input activities and predictions are floating-point numbers. All appropriate descriptors are used in the models, i.e. no feature selection is done. 1. random_forest. We use the R module RandomForest (https://cran.rproject.org/web/packages/randomForest/index.html), which encodes the original method of Breiman16. Random forest was first applied to QSAR by Svetnik et al. 17. The defaults are 100 trees, nodesize=5, mtry=M/3 where M is the number of unique descriptors. 2. partial least squares (pls): We use the R module pls (https://cran.rproject.org/web/packages/pls/vignettes/pls-manual.pdf.) which encodes the implementation described by 7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 41

Mevik and Wehrens 18. The number of components (up to 10) is optimized by cross-validation on the training set. 3. liblinear. This is an efficient implementation of linear kernel SVM by Fan et al.19. (https://www.csie.ntu.edu.tw/~cjlin/liblinear/). The regularization parameter C is adjusted during model building such that cross-validated R2 within the training set is maximized. 4. deep neural networks (DNN): We use Python-based code obtained from the Kaggle contest and described in Ma et al.20. We use parameters slightly different than the “standard set” described in that paper: Two intermediate layers of 1000 and 500 neurons with 25% dropout rate, and 75 training epochs. The above change is made for the purposes of more time-efficient calculation. The accuracy of prediction is very similar to that from the standard set. 5. xgboost. This is the Extreme Gradient Boosting method published by Chen and Guestren 21. We are using a set of standard parameters from our own QSAR study 22. Two of the methods pls and liblinear can be considered linear methods, and the rest nonlinear. Workflow for activity prediction Since the methods vary in how well they self-fit the data, it makes sense to look at random crossvalidated predictions. Random cross-validation is very optimistic in terms of true predictivity 23, but it is a valid way of comparing QSAR methods. The idea here is to see how robust cross-validated predictions are to D and Q. The workflow is shown in Scheme 1. Scheme 1 uses NX2 cross-validation, which produces the same number of predictions for each molecule. In this case, 10 models and 10 sets of predictions are produced, but each molecule is predicted 5 times. The mean prediction for molecule i for D and Q over the 5 predictions of i is called the 8 ACS Paragon Plus Environment

Page 9 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

“consensus prediction” A(i,D,Q). We can calculate the Pearson R2 correlation for each A(i,D,Q) for each pair of D/Q combinations and do this for each T. For example, we would compare the consensus predictions from APDP/random_forest model of estrogen_alpha with the consensus predictions from TTD/liblinear model of estrogen_alpha. Workflow for coloration Our purpose here is to see how sensitive the coloration of atoms in a molecule is to D and Q. The workflow is in Scheme 2. A sample of 200 molecules is taken to cover the chemical space of T but reduce the computational time. In our implementation we do not really delete descriptors associated with atom k. Instead, we replace the element type of atom k with “Na” (sodium) and recalculate the descriptors. Since there are no molecules with a covalently bound Na in any of the datasets, descriptors associated with Na atoms do not exist in the training set of the QSAR model. This is equivalent to making the descriptors associated with atom k vanish for the purposes of prediction, without keeping a list of which descriptors map to which atoms. Note, C(i,k,D,Q) is strictly the “contribution” of atom k to the activity of molecule i, but as a type of shorthand we are calling it the “color” of k since that is the ultimate use of the contribution. True “colors” in a heat map would be generated by mapping C(i,k,D,Q) to a color spectrum. We are deemphasizing heat maps in this paper. RESULTS We address the specific questions: 1. Which D/Q combinations are the most predictive by cross-validation for real datasets?

9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 41

2. Does the overall color of molecules track with the observed activities, and how does that differ with D/Q combination? 3. Does most extreme color per molecule vary with D/Q combination? 4. Are important atoms distinguishable from other atoms in a molecule and how does this vary with D/Q combination? 5. Do different D/Q combinations make similar cross-validated predictions for molecules? 6. Do different D/Q combinations put similar colors on the same set of atoms? 7. Do all D/Q combinations identify the atoms important to activity for an idealized problem? Self-consistency of descriptor/method combinations The R2 of consensus predictions vs. observed from the cross-validation for all combinations of T/D/Q is in Supporting Information. This is a measure of the self-consistency of the models, and is not meant to be representative of true prospective prediction. The R2 ranges from 0.81 (mtor/APDP/xgboost) to 0.12 (P61169/DRUGBITS/liblinear). No cross-validations could be done with the kinase_src/APDP/DNN combination due to convergence issues with DNN so these are omitted. The mean R2 over all T for each D/Q is in Table 2. This result demonstrates that one can get reasonably self-consistent models for the datasets using a variety of D/Q combinations, although clearly some combinations are better than others. We used a two-tailed Student’s t-test to estimate the statistical significance of the difference between D/Q combinations. Since the standard deviations are roughly the same and the number of samples is identical (24), the significance level depends mostly on the difference between the means. D/Q combinations that are differ in their mean R2 by at least 0.09 are significantly different at the p < 0.01 level. Random_forest and xgboost seem among the best methods. This is not surprising given our previous comparisons.22 Previously we observed that DNN did better than random_forest on large (>10,000 compounds) datasets20, but here the datasets are small. Clearly, DRUGBITS combined with a 10 ACS Paragon Plus Environment

Page 11 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

linear method (pls and liblinear), leads to poorer models. Normalized root-mean-square error is an alternative metric to R2 for measuring the agreement of predicted and observed. It gives a very similar order for the cross-validated predictivity of D/Q combinations. Overall molecule color vs. observed activity If we sum C(i,k,D,Q) over all atoms k for each molecule i, that should produce a “summed color” for molecule i. One expectation about atom coloring is that molecules with higher observed activities should, generally speaking, have higher summed colors. The R2 of the summed color vs. the observed activity is shown in Supporting Information for all T/D/Q combinations. That R2 averaged over all datasets is shown in Table 3. D/Q pairs that differs in their mean R2 by at least 0.06 are significantly different at the p 100. Linear methods have no problem extrapolating from the majority of molecules to these few outliers during cross-validation, while non-linear methods apparently do. Non-linear methods appear much more predictive when the outliers are removed, although still not as good as the linear methods (R2 ~0.7 vs. > 0.9). Ideally, removing any atom would reduce the atom count by exactly 1, and we should never see a color < 0, i.e. where removing an atom would increases the atom count of a molecule. Figure 5 shows boxplots of colors. In some D/Q combinations there are (rare) outliers with colors as high as 36 and as low as -23. Which molecules contain the outliers varies with D/Q. These outliers are not due to the artifact mentioned above since we see this for methods other than liblinear, and Atomcount is not one of the non-diverse datasets. For APDP/pls, which is the second best combination in Table 9, the mean color 17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 41

is 0.96, which is close to the expected value and the range of colors is the narrowest, but there are a few outliers with colors < 0. For the most predictive combination APDP/liblinear, the mean color is 2.1, although there are no colors < 0. The worst cross-validated predicted combination in Table 9 DRUGBITS/DNN has a mean color of -0.7. For the purposes of a heat map, the mean color is not as important as there being a narrow range of non-negative values, which indicates all atoms contribute equally and positively to the atom count. The D/Q combinations where D=DRUGBITS are the worst in respect to having the largest range. The Atomcount dataset is an example where the activity of molecules can be predicted very well for many D/Q combinations, but the contribution of individual atoms is not what is expected. DISCUSSION Our paper concentrates on using changes in predicted activity on a QSAR model as a way of generating atom colors as in a heat map. The purpose of atom coloration is to let a chemist determine which portions of a molecule could be changed to improve activity, and what matters most is whether chemists can interpret the colors correctly. However, that is very subjective and therefore in this paper we have deemphasized interpretation of heat maps and tried to monitor more objective criteria as a function of descriptors (D) and QSAR method (Q), where we felt the criteria would be helpful for interpretation. Previous studies of atom coloration have used at most a few D/Q combinations applied to a handful of molecules and could not fully address this issue. We have uncovered three issues with coloration by prediction that make it not as straightforwardly interpretable as might be hoped: 1. Coloration is subject to artifacts under some circumstances, in our case with the liblinear method and with datasets that are not diverse. The colors make no physical sense because the prediction of the 18 ACS Paragon Plus Environment

Page 19 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

activity when some atoms are nullified is outside the range of the observed activities used to build the QSAR model. Fortunately, one easily can check for the presence of colors outside the range. 2. Coloration is very sensitive to D and Q, much more sensitive than cross-validated predictions of activity. This is true even for datasets where the cross-validated predictions are very good. We have made the assumption, in calculating the correlation of atom colors, that all atoms count equally, and that there is no reason to separate, say, terminal atoms from scaffold atoms. 3. In ideal circumstance where the activity should be predictable from the chemical structure, and we know what the colors should be, one has to have a very high cross-validated predictivity to recover those expected colors and not all D/Q combinations are suitable. In regards to the second point, why is coloration and predicted activity sensitive to D and Q? There are two major reasons. First is that different descriptors do not capture the same features of molecules in a different “language”; they really capture different aspects of molecules. Also, different QSAR methods handle different descriptors differently. That is, linear methods (pls and liblinear) will tend to select descriptors that are more linearly associated with the activity, while nonlinear methods (random_forest, xgboost, DNN) will favor descriptors that track with activity but are not necessarily monotonic. Why is coloration more sensitive than predicted activity to D and Q? A reasonable speculation is that any atomby-atom view of activity is too granular, but predictions effectively average over many atoms. While the method of coloring atoms by changes in predicted activity is “universal” in the sense that it can be applied to any D/Q combination, our work shows that the interpretation one would derive from colors is strongly dependent on how the model was built. One clear observation is that colors seem better behaved when the cross-validated predictions are higher, and it appears prudent to determine which D/Q combinations give the highest cross-validated predictions for a given T before generating a 19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 41

heat map. At present, however, we cannot establish a minimum cross-validated R2 for which heat maps will “work” for real datasets. There does not seem to be a universally good D/Q combination, although APDP descriptors appear to give the best results for both the real and contrived datasets. It should be noted that even when we have good cross-validated predictions, different D/Q combinations can still give different results and chemists might have to inspect heat maps from different D/Q combinations to get a more complete picture. Finally we note that we have tested only one simple approach to assigning atom contributions (how the prediction changes as descriptors associated with single atoms are eliminated). Other approaches may do better. ACKNOWLEDGMENTS The author thanks Andy Liaw, Matt Tudor, and Yuting Xu for helpful comments. CONFLICT OF INTEREST The author declares no financial conflict of interest. SUPPORTING INFORMATION SMILES_activity.txt contains the SMILES for the original compounds in the datasets and their observed activities. colors.txt contains the colors for individual atoms for all T/D/A combinations. correlations.txt contains the correlations of colors, of consensus predictions, and correlations of consensus predictions with observed. summed_colors.txt contains the correlations for the sum of colors in a molecules vs. the observed activity

20 ACS Paragon Plus Environment

Page 21 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

REFERENCES 1. Polishchuk, P. Interpretation Of Quantitative Structure-Activity Relationship Models: Past, Present, And Future. J. Chem. Inf. Model, 2017, 57, 2618-2639. 2. Gawehn, E.; Hiss, J.A.; Schneider, G. Deep Learning In Drug Discovery. Mol. Inf. 2016, 35, 3-14. 3. Rosenbaum, L.; Hinselmann, G.; Jahn, A.; Zell, A. Interpreting Linear Support Vector Machine Models With Heat Map Molecule Coloring. J. Cheminf. 2011, 3:11 4. Marcou, G.; Horvath, D.; Solov’ev, V.; Arrault, A.; Vayer, P.; Varnek, A. Interpretability Of SAR/QSAR Models Of Any Complexity By Atomic Contributions. Mol. Inf. 2012, 31, 639-642. 5. Polishchuk, P.G.; Kuz’min, V.E.; Artemenko, A.G.; Muratov, E.N. Universal Approach For Structural Interpretation Of QSAR/QSPR Models. Mol. Inf. 2013, 32, 843-853. 6. Riniker, S.; Landrum, G.A. Similarity Maps—A Visualization Strategy For Molecular Fingerprints And Machine-Learning Methods. J. Cheminf. 2013, 5: 43. 7. Robinson, R.L.M.; Placzewska, A.; Palczewski, J.; Kidley, N. Comparison Of The Predictive Performance And Interpretability Of Random Forest And Linear Models On Benchmark Datasets. J. Chem. Inf. Model. 2017, 57, 1773-1792. 8. An, Y.; Sherman, W.; Dixon, X.L. Kernel-Based Partial-Least Squares: Application To Fingerprint-Based QSAR With Model Visualization. J. Chem. Inf. Model. 2013, 53, 2312-2321. 9. Cortes-Ciriano, I. Benchmarking The Predictive Poser Of Ligand Efficiency Indices In QSAR. J. Chem. Inf. Model. 2016, 56, 1576-1587. 10. Gaulton, A; Bellis, L.J.; Bento, A.P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; A.l-Lazikani, B.; Overington, J.P. Chembl: A Large-Scale

21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Bioactivity Database For Drug Discovery. Nucleic Acids Research, 2012, 40, Pages D1100– D1107. 11. Carhart, R. E.; Smith, D. H.; Ventkataraghavan, R. Atom Pairs As Molecular Features In Structure-Activity Studies: Definition And Application. J. Chem. Inf. Comput. Sci., 1985, 25, 64-73. 12. Kearsley, S.K.; Sallamack, S.; Fluder, E.M.; Andose, J.D.; Mosley, R.T.; Sheridan, R.P. Chemical Similarity Using Physiochemical Property Descriptors. J. Chem. Inform. Comp. Sci. 1996, 36, 118-127. 13. Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R. Topological Torsions: A New Molecular Descriptor For SAR Applications. Comparison With Other Descriptors. J. Chem. Inf. Comput. Sci. 1987, 27, 82-85. 14. Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Mod. 2010, 50, 742-754. 15. Svetnik, V.; Wang, T.; Tong, C.; Liaw, A.; Sheridan, R.P.; Song, Q. Boosting: An Ensemble Learning Tool For Compound Classification And QSAR Modeling. J. Chem. Inf. Comput. Sci. 2005, 45, 786-799. 16. Breiman, L. Random Forests. Machine Learning 2001, 45, 3-32. 17. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random Forest: A Classification And Regression Tool For Compound Classification And QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958. 18. Mevik, B.H.; Wehrens, R.. The pls package: Principal Component And Partial Least Squares Regression. Journal of Statistical Software 2007, 18:1–24.

22 ACS Paragon Plus Environment

Page 22 of 41

Page 23 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

19. Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; Lin, C.-J. LIBLINEAR: A Library For Large Linear Classification. J. of Machine Learning Res. 2008, 9, 1871-1874. 20. Ma, J.; Sheridan, R.P.; Liaw, A.; Dahl, G.E.; Svetnik, V. Deep Neural Nets As A Method For Quantitative-Structure-Activity Relationships. J. Chem. Inf. Model. 2015, 55, 263-274. 21. Chen, T.; Guestren C. Xgboost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016, 785794. 22. Sheridan, R.P.; Wang, W.M.; Liaw, A.; Ma, J.; Gifford, E.M. Extreme Gradient Boosting As A Method For Quantitative Structure-Activity Relationships. . J. Chem. Inf. Model. 2016, 56, 2353-2360. 23. Sheridan, R.P. Time-Split Cross-Validation As A Method For Estimating The Goodness Of Prospective Prediction. J. Chem. Inf. Model. 2013, 53, 783-790.

Scheme 1. Workflow for calculating consensus cross-validated predictions. For each dataset T: For each D/Q combination: For 10 trials: Randomly take half of T as the training set, the remainder is the test set. Make a model using D/Q from the training set and predict the activity of the test set. The test set becomes the new training set, and the training set becomes the new test set. Repeat the model building and prediction using the new training and test sets. 23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 41

End loop over trials The consensus prediction, which we can call A(i,D,Q) for each molecule i is the mean prediction over the trials. End loop over D/Q combinations. End loop over datasets. Scheme 2. Workflow for calculating colors. For each dataset T: Make a QSAR model from all molecules in T for all D/Q combinations.

200 molecules are randomly selected from T to be colored.

For each D/Q combination: For each molecule i: Calculate the predicted activity of molecule i on the D/Q model. Call this P (i,0,D,Q) where the “0” indicates the whole molecule. For each atom k in molecule i: Remove the descriptors associated with atom k. Calculate the predicted activity of molecule i on the T/D/Q model. Call this P(i,k,D,Q). The “color” of atom k is C(i,k,D,Q)=P(i,0,D,Q)-P(i,k,D,Q) End loop over atoms End loop over molecules End loop over D/Q combinations End loop over datasets. 24 ACS Paragon Plus Environment

Page 25 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1. Datasets used in this study Name

aurora_A

Description

N

Range of pIC50

Mean pairwise

in molar

similarity

human protein kinase Aurora-A

1651

4.0-10.0

0.20

cdk2

human cyclicdependent kinase 2

1130

3.5-10.0

0.19

estrogen_alpha

human estrogen receptor 

908

4.0-9.7

0.23

human estrogen receptor 

799

4.0-9.5

0.25

1182

4.0-10.4

0.19

(O14965)

(P03372) estrogen_beta (Q29731) glucocorticoid (P04150)

human glucocorticoid receptor

IL4

human iterleukin-4

665

4.7-8.3

0.50

jak2

human protein kinase jak2

869

3.8-9.8

0.20

human protein kinase src

2719

2.3-9.9

0.18

4662

4.0-9.8

0.20

549

5.1-10.3

0.43

1337

4.0-10.0

0.21

(O60674) kinase_src (P12931)

map_vasc_endotel human vascular endothelial growth (P35968) factor receptor 2 MMP2

human matrix metallopeptidase-2

mtor

human FK506 binding protein 12

(P42345)

25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 41

P11229

human muscarinic M1 receptor

718

4.0-10.0

0.17

P19327

rat 5-HT1A receptor

919

4.0-10.0

0.22

P21554

human cannabinoid receptor

1240

4.0-9.7

0.21

P24530

human endothelin receptor

966

4.0-10.3

0.22

P25929

human neuropeptide Y receptor type 1

483

4.1-10.7

0.16

P28335

human 5-HT2C receptor

752

4.0-9.3

0.21

P41594

human metabolomic glutamate receptor 5

1318

4.0-9.4

0.19

P47871

human glucagon receptor

615

4.0-8.6

0.25

P49146

human neuropeptide Y receptor type 2

494

4.0-10.1

0.21

P61169

rat D2 receptor

1756

4.0-10.2

0.20

pkc_alpha

human protein kinase C alpha

580

4.0-9.4

0.21

human progesterone receptor

1233

4.1-10.2

0.19

Q16602

human calcitonin generelated peptide type 1 receptor

463

4.0-11.0

0.25

Anion

Number of negative charges per atom

615

0-3

0.25

Number of nonhydrogen atoms

615

18-248

0.25

(P17252) progesterone (P06401)

(modified from P47871) Atomcount (modified from

26 ACS Paragon Plus Environment

Page 27 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

P47871)

Table 2. Correlation of consensus prediction vs. observed activity for each D/Q combination averaged over 24 datasets. Descriptor APDP APDP ECFP4 TTDT ECFP4 TTDT APDP ECFP4 ECFP4 TTDT DRUGBITS DRUGBITS APDP ECFP4 TTDT APDP DRUGBITS TTDT DRUGBITS DRUGBITS

QSAR method random_forest xgboost random_forest random_forest xgboost xgboost DNN DNN pls DNN random_forest xgboost pls liblinear pls liblinear DNN liblinear pls liblinear

Mean+stdev R2 0.66+0.09 0.66+0.10 0.65+0.10 0.65+0.10 0.65+0.10 0.64+0.10 0.64+0.10 0.62+0.11 0.61+0.11 0.60+0.11 0.58+0.13 0.56+0.13 0.56+0.14 0.55+0.12 0.55+0.12 0.53+0.13 0.50+0.13 0.49+0.13 0.42+0.14 0.35+0.16

Table 3. Correlation of summed color vs. observed activity for each D/Q combination averaged over 24 datasets. Descriptor ECFP4 ECFP4 TTDT TTDT TTDT TTDT TTDT ECFP4 APDP APDP

QSAR method pls DNN DNN pls random_forest xgboost liblinear random_forest pls liblinear

Mean+stdev R2 0.89+0.06 0.86+0.05 0.85+0.05 0.79+0.09 0.79+0.07 0.78+0.07 0.78+0.09 0.77+0.06 0.77+0.09 0.77+0.09 27

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

APDP ECFP4 ECFP4 DRUGBITS APDP DRUGBITS DRUGBITS APDP DRUGBITS DRUGBITS

DNN xgboost liblinear random_forest random_forest DNN xgboost xgboost pls liblinear

0.76+0.08 0.74+0.10 0.73+0.11 0.72+0.07 0.71+0.10 0.68+0.09 0.68+0.11 0.68+0.10 0.41+0.12 0.26+0.14

Table 4. Most extreme color per molecule for each D/Q combination averaged over all atoms in all training sets.

Descriptor DRUGBITS TTDT ECFP4 APDP DRUGBITS DRUGBITS TTDT DRUGBITS DRUGBITS TTDT TTDT ECFP4 ECFP4 TTDT ECFP4 APDP ECFP4 APDP APDP APDP

QSAR method liblinear liblinear liblinear liblinear xgboost pls xgboost random_forest DNN random_forest DNN xgboost DNN pls random_forest xgboost pls random_forest pls DNN

Mean+stdev 2.84+1.53 2.53+1.25 2.02+1.02 1.24+0.39 1.17+0.68 1.02+0.56 1.01+0.59 0.98+0.65 0.96+0.53 0.88+0.61 0.85+0.44 0.82+0.53 0.78+0.41 0.75+0.39 0.75+0.58 0.73+0.47 0.66+0.40 0.64+0.49 0.53+0.32 0.53+0.32

28 ACS Paragon Plus Environment

Page 28 of 41

Page 29 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 5. Largest zscore averaged over all atoms in all training sets for each D/Q combination.

Descriptor APDP APDP APDP APDP APDP TTDT TTDT TTDT TTDT ECFP4 TTDT ECFP4 ECFP4 ECFP4 ECFP4 DRUGBITS DRUGBITS DRUGBITS DRUGBITS DRUGBITS

QSAR method pls DNN random_forest xgboost liblinear pls DNN xgboost random_forest random_forest liblinear xgboost liblinear DNN pls liblinear random_forest DNN xgboost pls

Mean+stdev zscore 3.19+0.85 2.92+0.67 2.89+0.81 2.83+0.74 2.72+0.61 2.61+0.59 2.57+0.56 2.52+0.61 2.52+0.62 2.51+0.70 2.50+0.56 2.50+0.65 2.34+0.44 2.33+0.45 2.28+0.47 2.22+0.65 2.14+0.63 2.11+0.52 2.10+0.57 2.09+0.55

Table 6. Correlation of consensus predictions for pairs of D/Q combinations averaged over 24 datasets.

Descriptor 1 APDP ECFP4 TTDT ECFP4 DRUGBITS ECFP4 ECFP4 APDP APDP ECFP4 ECFP4 APDP ECFP4 TTDT APDP

Method 1 random_forest random_forest random_forest random_forest random_forest DNN xgboost DNN random_forest DNN xgboost random_forest random_forest DNN DNN

Descriptor 2 APDP ECFP4 TTDT TTDT DRUGBITS TTDT TTDT APDP TTDT ECFP4 TTDT ECFP4 TTDT TTDT APDP

Method 2 xgboost xgboost xgboost random_forest xgboost DNN xgboost random_forest random_forest random_forest random_forest random_forest xgboost random_forest xgboost

29 ACS Paragon Plus Environment

Mean+stdev R2 0.96+0.02 0.96+0.01 0.96+0.02 0.95+0.03 0.94+0.02 0.94+0.03 0.93+0.03 0.93+0.02 0.93+0.03 0.93+0.03 0.92+0.03 0.92+0.03 0.92+0.03 0.92+0.04 0.92+0.04

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

APDP APDP APDP APDP ECFP4 etc. DRUGBITS DRUGBITS DRUGBITS APDP DRUGBITS DRUGBITS DRUGBITS APDP DRUGBITS DRUGBITS

Page 30 of 41

xgboost xgboost xgboost xgboost DNN

TTDT ECFP4 TTDT ECFP4 TTDT

random_forest xgboost xgboost random_forest random_forest

0.91+0.03 0.91+0.04 0.91+0.04 0.91+0.04 0.91+0.04

liblinear liblinear liblinear pls liblinear liblinear liblinear xgboost liblinear liblinear

ECFP4 ECFP4 TTDT DRUGBITS TTDT ECFP4 TTDT DRUGBITS ECFP4 TTDT

DNN random_forest random_forest liblinear DNN xgboost pls liblinear pls xgboost

0.49+0.17 0.49+0.17 0.49+0.16 0.49+0.18 0.49+0.17 0.48+0.18 0.47+0.18 0.47+0.17 0.47+0.18 0.47+0.17

Table 7. Correlations of colors for pairs of D/Q combinations averaged over 24 datasets. Descriptor 1

Method 1

Descriptor 2

Method 2

DRUGBITS TTDT ECFP4 APDP DRUGBITS ECFP4 ECFP4 DRUGBITS ECFP4 ECFP4 ECFP4 ECFP4 ECFP4 ECFP4 ECFP4 TTDT ECFP4 ECFP4 TTDT ECFP4 etc.

random_forest random_forest random_forest random_forest DNN random_forest DNN DNN DNN DNN DNN liblinear DNN xgboost random_forest DNN DNN xgboost DNN pls

DRUGBITS TTDT ECFP4 APDP DRUGBITS TTDT TTDT DRUGBITS ECFP4 ECFP4 ECFP4 TTDT TTDT TTDT TTDT TTDT TTDT TTDT TTDT TTDT

xgboost xgboost xgboost xgboost random_forest random_forest DNN xgboost random_forest pls xgboost liblinear random_forest random_forest xgboost random_forest xgboost xgboost xgboost pls

30 ACS Paragon Plus Environment

Mean+stdev R2 0.77+0.06 0.74+0.05 0.70+0.05 0.70+0.07 0.62+0.11 0.62+0.09 0.59+0.06 0.58+0.12 0.57+0.10 0.55+0.07 0.54+0.11 0.52+0.19 0.52+0.09 0.50+0.09 0.50+0.09 0.50+0.09 0.49+0.09 0.49+0.09 0.47+0.08 0.47+0.13

Page 31 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

DRUGBITS APDP APDP DRUGBITS APDP APDP DRUGBITS APDP APDP APDP

liblinear liblinear DNN liblinear liblinear random_forest liblinear Pls xgboost DNN

TTDT ECFP4 DRUGBITS ECFP4 DRUGBITS DRUGBITS ECFP4 DRUGBITS DRUGBITS DRUGBITS

DNN pls pls xgboost pls liblinear pls liblinear liblinear liblinear

0.05+0.03 0.05+0.03 0.04+0.03 0.04+0.03 0.04+0.04 0.04+0.03 0.04+0.04 0.04+0.03 0.03+0.03 0.02+0.02

Table 8. Consensus prediction vs. observed activity for each D/Q combination for the Anion dataset Descriptor QSAR method R2 DRUGBITS liblinear 0.97 APDP random_forest 0.97 DRUGBITS pls 0.96 APDP xgboost 0.96 DRUGBITS xgboost 0.95 ECFP4 liblinear 0.95 ECFP4 random_forest 0.93 ECFP4 xgboost 0.93 APDP liblinear 0.93 DRUGBITS random_forest 0.93 TTDT random_forest 0.93 APDP DNN 0.93 APDP pls 0.91 TTDT xgboost 0.91 ECFP4 pls 0.90 TTDT liblinear 0.88 TTDT DNN 0.86 ECFP4 DNN 0.86 DRUGBITS DNN 0.78 TTDT pls 0.77

31 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 9. Consensus prediction vs. observed activity for each D/Q combination for the Atomcount dataset Descriptor QSAR method R2 APDP liblinear 0.99 APDP pls 0.99 ECFP4 liblinear 0.98 DRUGBITS liblinear 0.95 TTDT liblinear 0.94 TTDT pls 0.92 APDP DNN 0.89 DRUGBITS pls 0.83 APDP random_forest 0.72 TTDT DNN 0.69 APDP xgboost 0.69 TTDT random_forest 0.63 ECFP4 pls 0.62 DRUGBITS xgboost 0.61 TTDT xgboost 0.59 ECFP4 xgboost 0.51 DRUGBITS random_forest 0.49 ECFP4 random_forest 0.47 ECFP4 DNN 0.41 DRUGBITS DNN 0.38

32 ACS Paragon Plus Environment

Page 32 of 41

Page 33 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.34 Br

0.51 0.39 0.55 1.58 O

0.31 0.43 O

0.25

0.56 0.55 0.55 0.29 N -0.18 0.48

0.44 O

0.87

APDP/random_forest

O

1.53

APDP/random_liblinear

O

1.54

TTDT/random_forest

0.37 0.46

F

0.20

1.2 Br

0.52 0.73 0.55 1.12 O

0.65 0.58 O

0.76

1.05 0.63 0.59 0.82 N -0.03 0.77

0.83

0.69 0.83

F

0.78 0.23 Br

0.99 0.79 0.66 0.62 O

0.35 0.82 O

1.37

0.81 0.93 N 1.10 0.61 -0.14 0.62

1.42

1.73 1.07

F

0.21

Figure 1. CHEMBL187915 with atom colors using estrogen_alpha models. Three different D/Q combinations are shown. The number on each atom indicates the change in predicted activity that occurs when the descriptors for the atom are removed. For example “1.53” means that an atom contributes 1.53 to the predicted pIC50 of that molecule.

33 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 41

2

R =0.72

2

R =0.16

Figure 2. Correlation of the colors of 200 estrogen_alpha molecules shown for different pairs of D/Q combinations (Above) APDP/random_forest vs. APDP/xgboost (Below) APDP/random_forest vs. APDP/liblinear. Each circle represents an atom. Different molecules are distinguished by color of the circle.

34 ACS Paragon Plus Environment

Page 35 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 3. The correlation of colors vs. the correlations of consensus predictions for pairs of D/Q combinations averaged over 24 datasets. Each symbol represents a pair of descriptor/method combinations. Descriptor pairs are depicted as colors and method pairs are depicted as shapes.

35 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

36 ACS Paragon Plus Environment

Page 36 of 41

Page 37 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

37 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 41

Figure 4. Boxplot of the colors of atoms in the Anion dataset, with the median represented by a star symbol. Atoms are put into three categories carboxylate (CARB), tetrazole (TET), and OTHER. Atoms in all 200 colored molecules are included.

38 ACS Paragon Plus Environment

Page 39 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

39 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 41

Figure 5. Boxplot of the colors of atoms in the Atomcount dataset, with the median represented by a star symbol. All atoms in all 200 colored molecules are included. The solid horizontal line is at 1, which is the expected color for all atoms, the dashed horizontal line is at 0.

40 ACS Paragon Plus Environment

Page 41 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

TOC Graphic

41 ACS Paragon Plus Environment