Novel Methods for Prioritizing “Close-In” Analogs ... - ACS Publications

Jun 28, 2017 - Arbor Analytics, LLC, 4079 Ramsgate Court, Ann Arbor, Michigan 48103, United States ... The SARM data structure allows automatic and ex...
0 downloads 3 Views 2MB Size
Subscriber access provided by McMaster University Library

Article

Novel Methods for Prioritizing “Close-In” Analogs from SAR Matrices Liying Zhang, Kjell Johnson, Jeremy T. Starr, Jared B.J. Milbank, Andrew M. Kuhn, Christopher S Poss, Robert V. Stanton, and VEERABAHU SHANMUGASUNDARAM J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00055 • Publication Date (Web): 28 Jun 2017 Downloaded from http://pubs.acs.org on June 29, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Novel Methods for Prioritizing “Close-In” Analogs from SAR Matrices Liying Zhang 1 , Kjell Johnson 2 , Jeremy Starr 3 , Jared Milbank 3a , Andrew M. Kuhn 3b , Christopher Poss 3 , Robert V. Stanton 1 , and Veerabahu Shanmugasundaram 3* 1

Pfizer Global Research and Development, 610 Main Street, Cambridge, MA 02139

2

Arbor Analytics, LLC, 4079 Ramsgate Court, Ann Arbor, MI 48103

3

Pfizer Global Research and Development, 300 Eastern Point Rd, Groton, CT 06340

a

Current affiliation is FORMA Pharmaceutics, 500 Arsenal St, Suite 100, Watertown, MA 02472 Current affiliation is RStudio, Inc., 17 Paula Lane, Waterford, CT 06385 * Corresponding author email: [email protected] b

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Here we describe the development of novel methods for compound evaluation and prioritization based on the Structure Activity Relationship Matrix (SARM) framework. The SARM data structure allows automatic and exhaustive extraction of SAR patterns from data sets and their organization into a chemically intuitive scaffold/functional-group format. While SARMs have been used in the retrospective analysis of SAR discontinuity and identifying underexplored regions of chemistry space, there have been only a few attempts to apply SARMs prospectively in the prioritization of “close-in” analogs. In this work, three new ways of prioritizing virtual compounds based on SARMs are described: 1) matrix patternbased prioritization, 2) similarity weighted, matrix pattern-based prioritization, and 3) analysis of variance based prioritization (ANV). All of these methods yielded high predictive power for six benchmark data sets (prediction accuracy R2 range from 0.63-0.82), yielding confidence in their application to new design ideas. In particular, the ANV method outperformed the previously reported SARM based method for five out of the six datasets tested. The impact of various SARM parameters investigated and the reasons why SARM-based compound prioritization methods provide higher predictive power are discussed.

ACS Paragon Plus Environment

Page 2 of 34

Page 3 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Introduction Lead optimization in medicinal chemistry is largely driven by hypothesis testing and depends on the ingenuity, experience, and intuition of medicinal chemists, focusing on the key question “which compound to make next?”1 Typically project teams start by making a set of modifications around a lead compound at various substituent positions, varying the vectors and evaluating the SAR and formulating subsequent hypotheses. Many of these initial modifications are based on synthetic feasibility and overall property criteria applied to a set of virtual compounds. After the initial exploration, if a positive SAR signal is noted further exploration of that chemistry space is performed by rapid close-in analog evaluations. While each design idea is clearly important in answering a question, the experimental investment can be quite costly. Hence taking the totality of the emerging SAR information content in a series can be extremely helpful in accelerating project progression. The evaluation can be as specific as R-group placements on a lead series at a particular position to evaluating the lead-series as a whole. By designing and synthesizing multiple compounds quickly, a picture of the local Structure-Activity Relationship (SAR) can emerge allowing go / no-go decisions regarding a series. However, the design of initial hit expansion libraries can be non-trivial, and is often biased by preconceived notions or an incomplete understanding of the chemistry space. The common hypothesis in close-in analog design is that, structurally similar compounds have similar biological activity2. However, many chemical series have been identified with activity cliffs and non-additivity SAR3,4.

SAR exploration and iterative compound design are essential components of lead optimization. Numerous approaches have been developed which aim to effectively identify key structural features correlating with biological activity and design new analogs. Standard R-group tables dominated SAR analysis for many years, focusing on a chemical scaffold with R-group variations at different substitution sites5. The use of standard SAR tables strongly relies on a chemist’s experience to help design novel compounds as well as synthetic feasibility. Quantitative SAR (QSAR) analysis enable the use of large

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

datasets and the identification of trends through statistical modeling that may not be possible by visual inspection6. However the strength of QSAR methods to mathematically identify trends can also be a limitation as chemical interpretation can be challenging. More recently SAR visualization methods, such as SAR matrix (SARM)7, have been developed extending the concepts from scaffold or matched molecular pairs (MMP). While informative as a data-mining tool and for getting quickly up to speed on SAR information on an unknown dataset, the newer SAR visualizations generally lack the statistical and quantitative elements that make QSAR models attractive.

The SARM methodology was introduced based on the concepts of MMPs and matched molecular series (MMS)7. A MMP is generally defined as a pair of compounds with only one structural change at a single site8. The MMP formalism enables the direct association between a structural transformation and a change in molecular property, e.g., biological activity. MMS is an extension of the MMP concept, which is defined as a series of compounds with structural changes at a single site7. MMS allows the study of structural transformations across compound series and their corresponding changes in biological properties. SARM further extends the concepts of MMP and MMS by arranging structurally related analog series into an R-group table like matrix (Figure 1A). In a typical SARM, each cell represents an individual compound, with rows and columns indicating compounds that share the same key (core) and value (R group).

ACS Paragon Plus Environment

Page 4 of 34

Page 5 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1. A) An SAR Matrix example. Compounds are arranged in a matrix format based on their cores and R-groups. Color represents experimental activity or compound properties; Grey indicates existing compounds without any experimental activity; white cells are virtual compounds with new combinations of cores and R-groups; B) Squarewise and Additivity. Three real compounds and one virtual compound form a square, where every two adjacent compounds are a matched molecular pair. The four compounds are referred as a “chemical double-mutant cycle”. Squarewise additivity is assumed in this work to predict the activity of the virtual compound in this square.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

SARMs display all structural relationships within a compound data set in the format of a two dimensional (2D) matrix. While an SAR matrix is very similar to an R-group table, it has several key differences. It fragments compounds in an exhaustive fashion and identifies all possible matrices with core structures and R groups. In contrast, a traditional R-group table only covers certain pre-defined core structures. A SARM simplifies the interpretation of the information through color-coding of compound properties and relationships. Additionally, missing compounds are highlighted as holes, which are colored white. These missing, or virtual compounds, can be prioritized for synthesis depending on the activity of compounds in their neighborhood (row/column). However the interpretation of the activity of a neighborhood can be highly subjective and depends on the ordering / clustering of the rows and columns of the SARM. Depending on the size and diversity of a dataset, a large number of overlapping SARMs may be generated. Consequently, a large number of virtual compounds are also generated, representing unexplored chemistry space. However visually exploring thousands of SARMs for identifying the most promising virtual compounds for synthetic follow up is not practical. Computational methods that help prioritize compounds are therefore necessary. Methods used thus far in conjunction with SARMs9,10 predicted the activity of virtual compounds based on an evaluation of matrix neighborhood and conditional probabilities. Here we examine additional methods for compound prioritization in SARMs using three techniques: 1) matrix patternbased prioritization, 2) similarity weighted, matrix pattern-based prioritization, and 3) analysis of variance based prioritization. The results are compared across six benchmark datasets of various compositions to establish the consistency of the relative performance of the prediction methods. We also evaluate our methods against a set of null models to evaluate the importance of squares, neighborhoods and SARMs and conclude that the more SARM information we use in evaluating test compounds, the better the prediction accuracy.

ACS Paragon Plus Environment

Page 6 of 34

Page 7 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Methods SAR Matrices The methodology used to create SARMs is described fully elsewhere7. In brief, SARMs are generated by exhaustively fragmenting compounds, and organizing all possible combinations of core structures and R groups. We use no special handling or treatment of stereoisomers, enantiomers and tautomers during the SARM generation process. Typically three types of matrices are considered: 1) single-cut matrices, which cut molecules at one exocyclic single bond to form core and R-groups; 2) double-cut matrices, which cut molecules at two exocyclic bonds; and 3) triple-cut matrices, which cut at three exocyclic bonds. The algorithm iterates through all possible cuts creating hundreds to thousands of matrices for a data set of average size. For each matrix, a unique index number is assigned as its Matrix ID. The code to run the SARM methodology was obtained through a collaboration with Prof. Jürgen Bajorath, University of Bonn, Germany.

Data sets Data sets used in this work were published earlier9. These include six benchmark datasets of various GPCR antagonists (dopamine D2 receptors , adenosine A1 receptor, adenosine A2a receptor, adenosine A3 receptor, melanocortin receptor, and histamine H3 receptor) from ChEMBL version 1511. These datasets will be referenced by their target identifier (TID) see Gupta et al Table 19) in the discussion section. Biological data is given as pKi’s that range from 3-11. The number of compounds in each target dataset ranged from 1103 to 1850, resulting in 665 to 1109 matrices. The datasets are summarized in Table 1.

Table 1. Data sets used in this study.

Target Name

TID

# Compounds

# Matrices

# Singe/Double/Triple

ACS Paragon Plus Environment

pKi Range

# Test Compounds

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 34

Dopamine D2

217

1419

701

291/349/61

3.0~10.2

473

Adenosine A1

226

1825

1104

350/535/219

4.2~10.5

608

Adenosine A2a

251

1850

957

288/451/218

4.0~11.0

616

Adenosine A3

256

1547

1109

361/525/223

4.1~11.0

515

Melanocortin 4

259

1103

669

155/364/150

3.9~9.4

367

Histamine H3

264

1718

655

257/354/44

4.4~10.5

572

In order to validate the results of the prediction algorithms described in this work, about one third of the “real compounds” for each dataset were randomly selected as a test set, as reported earlier in Gupta et al9.

“Close-in” Analog Prioritization Matrix Pattern-based (MP) evaluation In a typical SARM, a set of three real compounds can form a square with a virtual compound, where, two orthogonal changes are made (Figure 1B). This set of molecules has been referred to as a “chemical double-mutant cycle”12, by analogy with physical organic chemistry experiments used to quantify specific molecular interactions13,14. Here we refer to this compound relationship as a “squarewise” relationship, by analogy with a “pairwise” relationship. For a single square, the activity (e.g., pKi) estimate for the virtual compound, under the assumption of additivity, is the sum of activities for the neighboring real compounds minus the activity for the fourth compound at the diagonally opposite corner. This corresponds to determining the activity differences in moving from the diagonally opposite corner to each of the neighboring corners and applying both those differences to determine the unknown activity. Although some papers have questioned the assumption of additivity, in many or all cases2,15 it has also been noted that “the assumption of additivity is often the only reasonable approach when trying to rationally optimize inhibitor binding”12.

ACS Paragon Plus Environment

Page 9 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The Matrix Pattern (MP) method is based on the concept of squarewise additivity. Figure 2A illustrates how this method works to predict virtual compound A. This method first identifies all squares with virtual compound A, no matter how the rows and columns are arranged in a SARM. For each square, the predicted activity of virtual compound A is calculated based on the squarewise additivity. There could be a number of such squares across many SARMs; therefore the consensus prediction of compound A is the average of all individual squarewise predictions. The standard deviation for the individual predictions is also calculated.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 10 of 34

Page 11 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. A) MP compound prediction method. Highlighted boxes are squares with virtual compound A. The color of SARM cells indicates the potency of real compounds. Although only four squares are shown, there could be more squares for virtual compound A, and thus the consensus prediction is the average of S1~Sn squarewise predictions; B) MP-SIM compound prediction method, and the consensus prediction is the weighted average of S1~Sn squarewise predictions; C) ANV compound prediction method. For virtual compound in this illustrated matrix, M1, the i and j indices are both 2, since this compound is the combination of key 2 (the second row) and value 2 (the second column). The consensus prediction is the average of individual ANOVA predictions of M1~Mn; D) Null Model 1; E) Null Model 2.

Similarity Weighted Matrix Pattern-based (MP-SIM) evaluation A limitation of the MP method is that it assumes all squares contribute equally in the consensus prediction of the virtual compound. However, it is possible that for one square, the structural transformation is quite dramatic as compared to others with a more conservative transformation. The prediction from such a square is less likely to be accurate and might bias the consensus prediction results. Therefore a structural similarity-weighting factor was introduced when calculating the consensus prediction. For square n with virtual compound A and its neighbors B, C, and D, we define the weight of a square as:

 =

∑ , ,  3

Where TcAY is the Taminoto similarity between a virtual compound and its neighbor based on ECFP4 fingerprints 16. Then the consensus prediction of virtual compound A is the weighted average of all individual predictions (Figure 2B).

ANOVA-based (ANV) evaluation

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

For each data set, the test set was removed and retained as an evaluation set for the methodology. ANOVA models were then built on the remaining training set compounds.

First, the eligibility of each unique Matrix Type (single, double, or triple-cut matrix) and Matrix ID (sequential matrix index) is determined. For a Type/ID combination, the number of unique Key (e.g., core structure) and Value (e.g., R-group) levels are counted, along with the total number of molecules, which have measured activity values. If the total number of molecules with measured activity values is greater than the sum of the number of unique Key and Value levels, then the Type/ID combination can be analyzed using an ANOVA model17.

When there is a sufficient amount of data within a matrix, a statistical interaction term between the factors can be estimated. However, because almost all matrices are sparse (few real compounds relative to the number of possible compounds) the ANOVA model assumes that the effects of the Keys and Values are additive. This means that we assume that the activity can be explained by the effect due to a specific Key, plus the effect due to a specific Value, plus unexplainable random error as follows (Figure 2C):

 =  +  +  ,

where i=1, 2,…, n (where n is the total number of Keys), and j=1, 2,…, p (where p is the total number of Values). Activityij represents the activity of the compound comprised of ith Key and jth Value. The average effect of each Key and Value is then computed. This approach enables the statistical analysis of the overall effects of Key and Value, which can help identify the matrices with potentially interesting compounds. Furthermore, this approach allows the models to (1) predict activity levels for unmeasured compounds, and (2) generate prediction intervals17.

ACS Paragon Plus Environment

Page 12 of 34

Page 13 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

It is important to note the subtle difference between a confidence interval and a prediction interval. To illustrate the difference in these intervals, let us suppose a specific Key and Value combination is selected for a desired SAR matrix. The model derived from this SAR matrix provides estimates for the individual Key and Value. Using these estimates we can generate the corresponding expected (or average) activity value and using the estimated model error, we can compute a confidence interval about the expected activity value. Notice that this interval describes the bounds about the expected activity, but not the bounds about an individual new molecule based on this Key and Value combination. To get the interval about an individual Key and Value prediction, we must also include an estimate of variation for the prediction of an individual molecule. Therefore a prediction interval is necessarily wider than a confidence interval, and is the interval we would expect if we were to create a new molecule. The activity predictions and corresponding prediction bounds can then be used to help prioritize compounds for follow-up.

Null Hypotheses & Models Three Null Models were developed and compared with the SARM-based evaluation methods (described above) to convince ourselves of the superiority of compound prioritization methods developed.

1) Null Model 1 - The importance of squares: The Null hypothesis of this model is that the squarewise additivity is not critical in predicting a virtual compound. Therefore, no squarewise additivity is used in this Null Model development. In this model, for each square, the predicted activity of a virtual compound is simply the average of all the three real compounds in the same square. The consensus prediction of this virtual compound is the average of all individual square predictions across all matrices (Figure 2D).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

2) Null Model 2 - The importance of neighborhoods: The Null hypothesis of this model is that the neighborhood information (how the real compounds are arranged in the same matrix with the virtual compound) is not critical in virtual compound prediction. Therefore, in this model, for each matrix, the predicted activity of the virtual compound is simply the average of all real compounds in the same matrix. The consensus prediction of this virtual compound is the average of individual predictions from all matrices that contain this virtual compound (Figure 2E). 3) Null Model 3 - The importance of SARMs: The Null hypothesis of this model is that the SARM information is not useful in virtual compound prediction. Therefore, for each dataset, a kNN (k=10) model based on FCFP6 fingerprints16 was developed using all the real compounds in this dataset. The model then predicted each virtual compound.

All methods (MP, MP-SIM, ANV, and Null Models) were developed in R (version 3.1.0) and Pipeline Pilot (version 9.0.2.1).

Activity Cliffs (ACs) For each dataset, activity cliffs were calculated and given in Supplemental Table S1. Activity cliffs were defined as compound pairs with a Tanimoto similarity (calculated based on ECFP4 fingerprints) higher than 0.56 and activity (Ki) difference greater than 2 log units, as reported earlier18,19. Activity cliffs that were formed by a test compound and a training compound were retained separately. Not every test compound forms activity cliffs with training compounds; therefore, test compounds that form activity cliffs with training compounds (AC-forming test compounds) are reported as a separate column in this table.

In order to compare how different methods handle and predict activity cliffs (i.e., test compounds that form activity cliffs), test compounds were split into two groups: activity cliff –forming compounds (AC-forming) and compounds that don’t form activity cliffs with training compounds (No AC). Since

ACS Paragon Plus Environment

Page 14 of 34

Page 15 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

different methods have different coverage, the numbers of test set compounds that are predicted can vary from method to method. Prediction errors were then compared between these two groups for each method and results reported. Mean absolute errors were calculated, instead of root mean square errors, to minimize the impact of various group sizes.

Close-in Analog evaluation workflow using SARM A generalized workflow was developed (as shown in Figure 3) that uses all of the SARM-based compound evaluation methods. Basically, for a given set of compounds, the first step is to generate all SARMs based on the existing dataset. Generally a set of hundreds of compounds could result in hundreds or thousands of SARMs. Then each virtual compound is evaluated using all of the above methods, and averaging the predictions from individual methods makes its consensus prediction. All virtual compounds are then prioritized by the consensus predictions, and only the top virtual compounds are proposed for synthesis and testing. Often virtual compounds are prioritized based on multiple parameters (potency, selectivity etc.), which can be easily accommodated in the SARM framework in couple of ways: 1) Each parameter for virtual compounds are predicted separately, and then virtual compounds are prioritized based on individual criteria (for example, potency < 100nM and selectivity > 5-fold); or 2) calculate a multi-parameter optimization (MPO) score and then prioritize virtual compounds based on their predicted MPO scores.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. SARM-based library design workflow. The color of SARM cells indicates the potency of real compounds. Grey arrow in the bottom right indicates a second round of design.

Results and Discussion All test set compounds from each of the six data sets were predicted by three SARM-based methods and three Null Models that are described above, i.e., MP, MP-SIM, ANV, Null Model 1, Null Model 2, and Null Model 3. The prediction results from these methods are compared in Table 2 and Supplemental Table S2.

Compound Prediction Statistics Table 2. Comparison of the compound prediction statistics (top: accuracy; bottom: coverage) across different methodologies. NBH is the prediction results from a previously published SARM compound prediction method9.

ACS Paragon Plus Environment

Page 16 of 34

Page 17 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Target Name

TID

# Test Compound s

Null Model 1

Prediction Accuracy (R2) Null Null MP MPModel Model SIM 2 3

ANV

NBH

Dopamine D2

217

473

0.657

0.480

0.610

0.636

0.629

0.729

0.632

Adenosine A1

226

608

0.527

0.300

0.527

0.665

0.659

0.743

0.666

Adenosine A2a

251

616

0.747

0.608

0.688

0.754

0.755

0.780

0.749

Adenosine A3

256

515

0.569

0.283

0.636

0.644

0.638

0.671

0.655

Melanocortin 4

259

367

0.582

0.407

0.726

0.779

0.774

0.823

0.828

Histamine H3

264

572

0.582

0.422

0.611

0.656

0.643

0.790

0.669

Target Name

TID

# Test Compounds

Prediction Coverage (# Predictable Compound) Null Model 1

Null Model 2

Null Model 3

MP

MPSIM

ANV

NBH

Dopamine D2

217

473

224

473

473

224

224

126

200

Adenosine A1

226

608

282

608

608

282

282

202

267

Adenosine A2a

251

616

240

616

616

240

240

161

226

Adenosine A3

256

515

256

515

515

256

256

185

248

Melanocortin 4

259

367

138

367

367

138

138

109

125

Histamine H3

264

572

267

572

572

267

267

143

252

For each of the six data sets, the prediction accuracy is measured by squared Pearson’s correlation coefficient (R2) between reported activity values vs. predicted activity values for all test set compounds that are within the applicability domain of each method (Table 2). As one would expect, different methods have different applicability domains. For example, for some test compounds, if there are no squares available, methods like MP, MP-Sim, or Null Model 1 cannot predict these test compounds. Therefore, to capture the applicability domain, the prediction coverage of test sets for each method is reported in Table 2 as well.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The MP method is similar to previously reported SARM neighborhood-based method (NBH) 9, which utilizes neighborhood information (analogous to “squares” described in this work). Therefore, as expected, the prediction accuracies of MP models are quite similar to NBH. MP-SIM didn’t outperform MP or NBH, indicating that structural similarity information is not useful in improving compound prediction, an indication that the datasets used in the study covers relatively smooth SAR areas, where structural similarity general correlates with activity. Therefore similarity weighting, which was calculated based on structural similarity, did not alter the predictions that much.

The ANV method provides the best prediction accuracy for 5 out of the 6 datasets. This method not only utilizes the squarewise information, but also matrix neighborhood information. Further the ANV methods uses information from all real compounds in the same matrix to predict a virtual compound.

We also studied coverage as another factor to evaluate the performance of these methods (Table 2 and Supplemental Table S2). No obvious correlation is found between prediction coverage and prediction accuracy. In general, NBH has slightly lower coverage than MP, caused by the different approaches in processing the prediction results. For example, the NBH method requires a virtual compound that can be predicted to have at least three qualifying neighborhoods, while MP doesn’t have such limitations in processing the individual squarewise predictions. MP-SIM has the same coverage as the MP method, as they use the same squarewise information for individual predictions, and the only difference is in the calculation of consensus predictions. The ANV method has the lowest prediction coverage, because not every virtual compound has enough matrix key/value information to calculate a prediction. In order to compare the same sets of test compounds, we compared MP and MP-SIM test compound predictions that are in the applicability domain of the ANV method and found that the prediction accuracies for MP and MP-SIM are still poorer when compared to ANV (data not shown).

ACS Paragon Plus Environment

Page 18 of 34

Page 19 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Null Models were created to answer three questions about SARM-based compound prediction methods: 1) Is squarewise additivity useful for virtual compound predictions? 2) Are matrix neighborhood information important for virtual compound predictions? 3) Does the SARM formalism help virtual compound predictions? Comparing the prediction statistics in Table 2 and Supplemental Table S2, we noted that the Null Models perform worse than all of the SARM-based methods (MP, MPSIM, or ANV). Specifically, Null Model 2 uses the least information from the data set (no squarewise additivity information, no QSAR information, and only partial information about the matrix neighborhood). Therefore it is not surprising that Null Model 2 has the lowest prediction accuracy. On the other hand, since Null Model 2 requires the least information for prediction, it is capable of predicting the entire test set, and thus has the highest prediction coverage. Null Model 1 does not use squarewise additivity information in the prediction, and as expected, its prediction is not as accurate as the MP method for 5 out of the 6 datasets. The SARM formalism was found to be helpful in virtual compound prioritization in two key areas: 1) the SARM methodology assists in the identification and enumerations of all potential analogs; 2) the prediction accuracies are improved with the use of SARMs. In particular, the prediction accuracies of ANV, which utilizes SARM matrix, for all six datasets are significantly higher than those of Null Model 3, which doesn’t use any SARM information (Table 2, p value = 0.003, p value was derived from one tail t-test on prediction accuracies of ANV vs. Null Model 3).

Impact of parameters To better understand the performance of different methods, the prediction errors for the test set compounds were compared with a number of parameters that characterize the squares and neighborhoods of SARMs (such as number of squares, structural similarity within squares, variations of square predictions, matrix discontinuity scores, and matrix types).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 34

Figure 4 and Supplemental Figure S1 show the effect of the variations of square predictions (quantified as the prediction standard deviation) on prediction errors. The prediction standard deviation was discussed in the previous SAR Matrix compound prediction paper 9, where both low and high prediction variations are discussed. Gupta-Ostermann et al 9 show that compounds with low prediction variations across squares usually have lower prediction errors. In contrast, high prediction variations indicate discontinuous SAR regions surrounding virtual compounds, which are worth further investigation to understand the SAR in these regions. This is consistent with our observations. Generally, no matter which method is used, the higher the standard deviations of individual square predictions, the higher consensus prediction error the test compounds have (except Null 1 model, due to its insufficient amount of squares at high standard deviations). When the standard deviation is very low (i.e., close to 0), a higher prediction error is also seen. This is a result observed when predictions are based on the results from a few squares.

ACS Paragon Plus Environment

Page 21 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4. The effect of square prediction standard deviations on prediction errors. The top numbers represent the TID of each dataset (exemplified by datasets 217 and 264, see Supplemental Figure S1 for the full figure with all six datasets). The x-axis is the standard deviation of individual square predictions for the compound, and the y-axis is the difference between its consensus prediction and real activity (absolute consensus prediction error). The red dashed line indicates the prediction error of 0.5. Three methods are compared in this figure: MP, MP-SIM, and Null 1, all of which utilize squarewise information in their prediction.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5 and Supplemental Figure S2 shows the impact of square counts on prediction errors. Not surprisingly, the more squares supporting the prediction of a test compound, the lower the prediction error. For compounds with only tens of squares, the prediction errors can be as high as 2.5 log units, while for most compounds with hundreds of squares, the error can drop to below 0.5 log units (with some outliers in datasets with TID of 259 and 264, see Supplemental Figure S2). Among the six prediction methods discussed in this work (MP, MP-SIM, ANV, Null 1, Null 2, Null 3), three methods utilize the squarewise additivity information when predicting compounds: MP, MP-SIM and Null 1. For the Null 1 model, even compounds with hundreds of squares are seen to have prediction errors higher than 0.5, indicating the general loss of predictive power of this method. This is as expected, since the Null 1 method simply averages the activity of the other three real compounds in a square. Based on Figure 5 and Supplemental Figure S2, we observe that virtual compounds with at least 50 squares usually have relatively lower prediction errors across the six benchmark datasets.

ACS Paragon Plus Environment

Page 22 of 34

Page 23 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5. The effect of square counts on prediction errors. The top numbers represent the TID of each dataset (exemplified by datasets 226 and 251, see Supplemental Figure S2 for the full figure with all six datasets). The x-axis is the count of square for the compound, and the y-axis is the difference between its consensus prediction and real activity (absolute consensus prediction error). The red dashed line indicates the prediction error of 0.5. Three methods are compared in this figure: MP, MP-SIM, and Null 1, all of which utilize squarewise information in their prediction.

Figure 6 and Supplemental Figure S3 represents the impact of square similarity on prediction errors. Looking at MP predictions for the six datasets (Figure 6 top row), some datasets show a decrease in prediction errors when square similarity increases (e.g., dataset 256), while other datasets have the

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

reverse trend (e.g., dataset 259) or no obvious correlation between square similarity and prediction error (e.g., dataset 264). This could be due to the differences in the SAR landscapes of these datasets. Intuitively, the more structurally similar the four compounds within a square are, the smaller the differences in their activity. In a smooth SAR space, the change of potency on small structural changes is easier to predict. For example, it is usually easier to predict the potency change for a transformation of FCl than a transformation of Fpyridine. In this case, the increase of square similarity could result in lower prediction error. On the other hand, if a dataset has a rough SAR landscape, even though the four compounds in a square are structurally similar, they may have highly different activities. Especially for squares with 2D similarity of 1, which indicates the four compounds might be stereoisomers, the range of prediction errors can be large in rough SAR regions (e.g., dataset 259 at similarity = 1, the errors range from 0 to 2.5). Also when comparing the two methods, MP vs. Null 1, when the square similarity is high, i.e., > 0.8, the prediction errors of the two methods are similar. On the contrary, when the square similarity is low, i.e.,