Target-Specific Prediction of Ligand Affinity with Structure-Based

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40...
0 downloads 0 Views 1MB Size
Subscriber access provided by JAMES COOK UNIVERSITY LIBRARY

Chemical Information

Target-Specific Prediction of Ligand Affinity with Structure-Based Interaction Fingerprints Florian Leidner, Nese Kurt Yilmaz, and Celia A. Schiffer J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.9b00457 • Publication Date (Web): 05 Aug 2019 Downloaded from pubs.acs.org on August 5, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Target-Specific Prediction of Ligand Affinity with Structure-Based Interaction Fingerprints

Florian Leidner, Nese Kurt Yilmaz, Celia A. Schiffer*

Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA

*Corresponding author: [email protected]

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 33

Abstract Discovery and optimization of small molecule inhibitors as therapeutic drugs have immensely benefited from rational structure-based drug design. With recent advances in high-resolution structure determination, computational power, and machine learning methodology, it’s becoming more tractable to elucidate the structural basis of drug potency. However, the applicability of machine learning models to drug design is limited by interpretability of resulting models in terms of feature importance. Here, we take advantage of the large number of available inhibitor–bound HIV-1 protease structures and associated potencies to evaluate inhibitor diversity and machine learning models to predict ligand affinity. First, using a hierarchical clustering approach we grouped HIV-1 protease inhibitors and identified distinct core structures. Explicit features including protein–ligand interactions were extracted from high-resolution co-crystal structures as 3D-based fingerprints. We found that a gradient boosting machine learning model with this explicit feature attribution can predict binding affinity with high accuracy. Finally, Shapley values were derived to explain local feature importance. We found specific vdW interactions of key protein residues are pivotal for the predicted potency. Protein-specific and interpretable prediction models can guide the optimization of many small molecule drugs for improved potency.

ACS Paragon Plus Environment

2

Page 3 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Introduction Over the past decades structure based drug design (SBDD) has become an integral part of therapeutic development. 1-3 This can be attributed in equal parts to improvements in experimental xray-crystallography and NMR, and development of computational tools that facilitate structural analysis and molecular design. 4-6 The main premise of SBDD is that analysis of protein–ligand interactions and 3D structure of the target protein can guide molecular design. Using crystal structures or homology models, computational methods such as high throughput docking can identify potential binders from chemical libraries. In-silico screening campaigns are able to process large chemical libraries that would be infeasible to evaluate experimentally. One of the most important metrics in computational high throughput screening and SBDD is the binding affinity of a putative inhibitor toward a protein target. In particular in the early stages of lead discovery and optimization, binding affinity of small molecules can quickly be predicted by in-silico methods due to low cost and high accessibility of computational approaches. In theory, the binding free energy of a protein–ligand complex can be calculated directly from molecular dynamics simulations. The accuracy of the calculation however depends on the accuracy of the forcefield and convergence of the simulation. Simulations that follow a small molecule from free to bound state would be extremely slow to converge for most biological systems, and sampling enough states to simulate equilibrium conditions would be prohibitive. Instead free energy calculations commonly estimate the relative binding free energy of a small molecule of interest with respect to a highly similar reference molecule. The relative binding free energy is calculated by transforming the reference molecule into the target molecule over a series of “alchemical perturbation” steps in free energy perturbation (FEP) and thermodynamic integration methods. 7 These calculations are comparatively very slow because each perturbation step for each new compound requires additional simulations. Therefore, alchemical free energy calculations are useful primarily for evaluating specific modifications when an initial binder (hit or confirmed lead compound) has been identified. For high throughput in-silico screening campaigns, when routinely millions of compounds need to be evaluated, much simpler scoring functions are used. Scoring functions prioritize the rapid evaluation of putative binders over accurate prediction of binding free energy. The necessary increase in speed is achieved primarily by ignoring protein–ligand dynamics and explicit solvation. The free energy of binding is instead estimated based on a linear combination of drug–target interactions. Depending on the scoring function such interactions can include energy terms such as vdW energy and hydrogen-bonds, solvent accessible surface area, atompairwise potential derived from experimental data, or a combination of the above. 8-10The

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 33

weights of the interaction and potential terms used in scoring functions are generally fitted using a multi-linear regression model. 11 In recent years scoring functions based on complex machine learning algorithms have seen a surge in popularity. 12-16 While machine learning has been used in quantitative structure activity relationship (SAR) studies for a considerable amount of time, 17 its application to structure-based predictive modeling has so far been limited by the availability of protein–ligand co-crystal structures. Contrary to their traditional counterparts, machine learning based scoring functions (1) do not assume a linear relationship between the physicochemical properties of a protein–ligand complex and the binding free energy, (2) do not require input features to be independent, and (3) benefit more from large training sets. These advantages enabled nonlinear machine learning based scoring functions to achieve high accuracy on benchmark datasets. The distinguishing factor among various machine learning based scoring functions has been feature engineering and training process. For feature generation, protein-ligand structures are decomposed into a set of informative descriptors with the goal of thoroughly describing drug–target interactions while minimizing redundancy. Chosen features range in level of complexity and granularity from (atom-) pairwise distances over graph-based topological maps to protein–ligand interaction fingerprints. 12, 13, 15 After feature engineering, machine learning algorithms are trained on a set of protein–ligand structures with known binding affinity. For training, a number of benchmark sets for affinity prediction and docking are available. These datasets include a diverse set of protein families, ligand chemical structures, and binding affinities. 18, 19 Most models are trained and evaluated using the full set of available protein– ligand complexes. Although machine learning models can achieve good accuracy, application to structure based prediction of protein–ligand binding free energy suffer from certain drawbacks. Two main concerns are the potential bias of such models, and the opaqueness of the decision-making process. While a well-designed model training and evaluation process can alleviate some of these concerns, model bias is also in part due to the inherent bias in the sampling of chemical space in the training set. 20 Structure-based predictive modeling further propagates this issue by sampling only from the limited set of small molecules for which structural data is available. The opaqueness of the decision-making progress, also referred to as the black box behavior, is a well-known concern in machine learning. 21 Nevertheless only limited effort has been dedicated to developing better explanatory models, where the focus has been solely on performance and feature importance has been included only on the level of the overall model instead of individual

ACS Paragon Plus Environment

4

Page 5 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

predictions. Ultimately the two major concerns about machine learning and its application to structure-based binding predictions are closely linked: Exposing the decision-making process of a machine learning model could also alleviate concerns regarding the model bias. Toward addressing the concerns of bias and opaqueness in predictive models of ligand binding potency, we developed a gradient boosting (GB) model with explicit feature attribution. Contrary to the common practice of developing a general model, the target-specific model focused on inhibitors of HIV protease. HIV-1 protease is an important drug target in antiretroviral therapy and has been a success story of SBDD with numerous inhibitors and 9 FDA-approved drugs. The high rate of mutation in conjunction with the continuous development of novel inhibitors, and relative ease of protein crystallization resulted in a wealth of structural data, making the system an ideal test set for the development of structure-based, target-specific binding affinity prediction. Furthermore, a number of machine learning models trained on a non-specific set of protein–ligand complexes showed significant loss in accuracy when predicting the affinity HIV-1 protease inhibitors, a fact commonly attributed to the relatively large size of protease inhibitors.22 We first evaluated the chemical diversity of HIV protease inhibitors with available highresolution crystal structures and corresponding binding affinities. Using hierarchical clustering with subsequent substructure filtering, we show that clusters of inhibitors are based on distinct core structures. Subsequently a gradient boosting model was trained on a set of common protein–ligand interaction features to predict binding affinity. The clustering of inhibitors was used in leave-cluster-out cross-validation of the model in comparison to 3 baseline models. The resulting gradient boosting model achieved state-of-the-art accuracy on the HIV protease dataset and relatively low loss in accuracy when tasked to predict new compounds when compared to other models. Finally, Shapley values were derived to explain local feature importance for individual predictions. We show that specific vdW interactions are pivotal for predicted potency. We demonstrate the utility of explanatory modeling by discussing differences in binding of analogous inhibitors to wildtype and drug resistant variants of HIV protease.

Methods

Dataset Dataset Curation

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 33

We retrieved all unique sub-2.0 Å resolution structures of HIV protease bound to competitive active site inhibitors from the PDB for which either a Ki or Kd value was reported. We chose to include structures of both HIV-1 and HIV-2 protease that fit the same selection criteria, due to the high sequence and structure similarity of the two viral enzymes. Inactive variants with a D25N mutation were removed from the dataset since Ki values were determined against an active variant and the D25N mutation can have a significant impact on binding. 23 We also removed structures with missing residues or sidechains within 6 Å of the active site, structures that were not crystallized in the closed conformation and structures with ambiguous inhibitor binding sites. The final set included 282 structures of HIV protease in complex with competitive active site inhibitors. The binding affinity values ranged from 0.2 pM to 41 µM (respective PDB IDs: 3o9f and 2avq). The binding free energy (ΔG) was calculated from the reported inhibition and dissociation constants assuming a temperature of 300 K. Structure Preparation Structure optimization was performed using Schrodinger's Protein Preparation Wizard (Schrodinger version 2018-3). Missing residues were modeled using Prime. Small molecule protonation states were estimated using Epik. Protein protonation states were estimated using Propka. In a subsequent step protonation of the two catalytic aspartate residues, which exist primarily in the mono-protonated form, was examined and if necessary adjusted, using the pKa values predicted by Propka to determine which of the two aspartic acids to protonate. 24, 25 Finally, the structures were energy minimized using Impref with a convergence RMSE of 0.5.

Unsupervised Learning Ligand Similarity Small molecule topology was encoded in a 32-bit vector using radial fingerprints implemented in Canvas. 26, 27 Atoms were parameterized using simple functional types: H, C, (F,Cl), (Br,I), (N,0), S and others. Bonds were parameterized by hybridization. Similarity between radial fingerprints |𝐴 ∩ 𝐵|

was calculated using Tanimoto similarity coefficient: 𝑇(𝐴,𝐵) = |𝐴| + |𝐵| ― |𝐴 ∩ 𝐵| . Here A corresponds to the on bits in the first fingerprint, B to the on bits in the second fingerprint and A⋂B are the common on bits. 28 Minimum Common Substructure Tree Cut

ACS Paragon Plus Environment

6

Page 7 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Hierarchical clustering of the 2D fingerprints was performed using complete linkage on the Tanimoto dissimilarity matrix. To extract interpretable clusters from the resulting dendrogram we developed a tree cutting algorithm based on minimum common substructure. Starting from the root node, the algorithm matched the chemical substructure common across all members. A node formed a cluster if the minimum substructure that was shared among all of its members contained at least 10 heavy atoms. The exception to this rule was structures containing a phenyl ring; here the minimum common substructure had to contain at least 15 heavy atoms. To avoid small and singleton clusters we further restricted clusters to contain at least 10 members. This strategy identified the common core structure within ligand clusters.

Protein–Ligand Interaction Fingerprints To develop predictive models of small molecule binding affinity that both capture basic interactions and are interpretable, we chose structure-based protein–ligand interaction fingerprints. This approach maps a set of labels and pairwise interactions onto the protein sequence, producing an 𝑁 ∗ 𝑀 feature vector, where N is the number of residues and M is the number of features. This approach was originally proposed to profile interactions in small molecule docking, and has been applied in other computational techniques. 29-31 While structure interaction fingerprints were originally proposed as a binary encoding of categorical features, here we adapted them to continuous features. The following features were included in the interaction fingerprints: ●

Contact vdW potential between each protein residue and the bound ligand was calculated using Opls2005 force field parameters. 32 The values were constrained to be within the range of [ε, 0] where ε is the pairwise vdW potential at the energy minimum. This constraint was applied since positive, unfavorable interactions are likely to arise due to crystallization artifacts.



Protein–ligand hydrogen bonds were calculated using the following geometric criteria: Maximum distance 2.8 Å, donor minimum angle 120°, acceptor minimum angle 90°.33 If a protein residue was forming a hydrogen bond with the bound inhibitor, the distance between acceptor atom and donor hydrogen was reported. If multiple hydrogen bonds were formed only the shortest distance was reported. In addition to the hydrogen bond distance we calculated the number of times a residue was acting as either a hydrogen bond donor or a hydrogen bond acceptor. In addition to common hydrogen bond donors we also considered hydrogen atoms on aromatic ring systems.

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60



Page 8 of 33

Protein–ligand halogen bonds were calculated similarly to hydrogen bonds but with different geometric criteria: The minimum distance between an oxygen and a halogen bond donor 3.5 Å, donor minimum angle 140°, acceptor minimum angle 90°. 34 Both minimum distance and donor/acceptor count were calculated.



Salt bridges between protein and ligand ions were calculated using a simple distance criterion: opposite-charged side chains were considered to form a salt bridge if they were within 5 Å. The distance cutoff is more lenient than the commonly used 4 Å cutoff but better captures the second salt bridge peak observed at protein-protein interfaces. 35 36



Two types of π-interactions were included in the fingerprint. π-stacking interactions between aromatic ring systems were included both in a face-to-face or edge-to-face manner. The former was defined by a maximum distance cutoff o 4.4 Å and a maximum planar angle of 30°. The latter was defined by a maximum distance of 5.5 Å and a minimum angle of 60°. 35



π -cation interactions were included, defined by a maximum distance cutoff of 6.6 Å and maximum angle of 30°. 37

Thus, the feature set included frequently observed nonbonded interactions. 35 Features can however readily be expanded to include additional pairwise interactions, such as carbonyl-X interactions or categorical features. 38, 39

Supervised Learning Gradient Boosting Decision Tree Model The general objective of a machine earning model, given a set of explanatory variables X = (x1, … xN) and labels Y = (y1 … yN), is to define a function F(X) that minimizes the loss function L(y; F(X)) . Here Y represents the binding affinity and X represents the protein–ligand interaction fingerprints calculated for a combination of diverse chemical structures and target sequences. A predictive model based on these features needed to capture complex non-linear relationships, thus we chose gradient boosting decision tree approach. Conceptually, gradient boosting combines “boosting”, the observation that a series of weak predictors can be combined for improved accuracy, with a stepwise gradient descent. 40, 41 Gradient boosting decision trees have been implemented in a number of libraries. 42-44 We chose the method implemented in CatBoost due to its handling of decision tree updating and gradient estimates that lead to less bias.

ACS Paragon Plus Environment

8

Page 9 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The general formula of gradient boosting is: 𝐹𝑀(𝑋) = 𝐹𝑀 ― 1(𝑋) + 𝛽𝑀ℎ(𝑋;𝑎𝑀) where 𝑀―1

𝐹𝑀 ― 1(𝑋) =

∑ 𝛽 ℎ(𝑋;𝑎 ) 𝑖

𝑖

𝑖=1

𝐹𝑀 is the function after M gradient descent steps, ℎ(𝑋;𝑎𝑖) is the base estimator at step i with the parameter set 𝑎𝑖 and weight 𝛽𝑖. At every step parameters of the base predictor are chosen by minimizing the expected loss (EL) 𝑎𝑟𝑔𝑚𝑖𝑛(𝑎) 𝐸𝐿(𝑦;𝐹𝑖 ― 1(𝑋) + 𝛽𝑖ℎ(𝑋;𝑎𝑖)) The loss function chosen in this case was the root-mean-squared error: 𝐿(𝑦;𝐹(𝑥)) =

∑ (𝑦𝑖 ― 𝐹(𝑥𝑖)2 𝑖

The hyperparameters of the gradient boosting model (see Table S1) were optimized using Treestructured Parzen Estimators implemented in Hyperopt. 45 Hyperparameter search was performed for 500 trials. Baseline Models The performance of the gradient boosting model was evaluated against models derived using three other methods: elastic net linear regression, support vector machine (SVM) regression and random forest. 46, 47 The Scikit Learn 20.2 implementation of all three algorithms was used for evaluation. 48 The resulting models were evaluated using the same cross validation strategy outlined below. Hyperparameters of the elastic net model were chosen using cross validation. For gradient boosting, SVM and random forest models, we used the same approach that was chosen for the gradient boosting model. Hyperparameters are provided in Table S1. Model Evaluation Model performance was evaluated using two strategies. In the first approach, we used 10-fold nested cross validation. The dataset was randomly split into test, and training datasets. 10% of the structures were assigned to the test set, the remaining structures were used for model training. Hyperparameters were optimized using the training datasets. Thereafter the model was evaluated on the test set. In the second approach, the dataset was grouped according to chemical similarity of the inhibitor. This was done using the minimum common substructure tree cutting approach described above. One group was then used in the test set while the model was

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 33

trained on the remaining groups. This approach is similar to other leave-cluster-out cross validation strategies. 15, 49 Performance was evaluated using 4 metrics: 𝑅𝑀𝑆𝐸 = Pearson Correlation: 𝜌𝑃 =

1 𝑁

𝑁

∑( 𝑦 ― 𝑓(𝑥 )) 𝑖

𝑖

2

𝑖=1

𝑐𝑜𝑣(𝑓(𝑥), 𝑦) 𝜎𝑓(𝑥)𝜎𝑦

Spearman Correlation: 𝜌𝑠 =

𝑐𝑜𝑣(𝑟𝑓(𝑥), 𝑟𝑦) 𝜎𝑟𝑓(𝑥)𝜎𝑟𝑦

2

Coefficient of Determination: 𝑅 = 1 ―

∑ (𝑓(𝑥𝑖) ― 𝑦𝑖)2 𝑖

∑ (𝑦𝑖 ― 𝑦) 𝑖

Both the gradient boosting and random forest implementation utilize random subsampling of the data thus in both cases model evaluations were repeated 10 times with different random initializations. This bootstrap aggregation procedure can reduce the generalization error by averaging results across the separate models.

Feature Importance Interpreting factors underlying a prediction is a key aspect of machine learning. Interpretable features enable evaluation and validation of the predictions made by the model with existing domain knowledge. In the field of structure based drug design, an interpretable predictor can also inform the design of novel compounds. Complex non-linear machine learning models have a reputation of being a black box that does not reveal factors driving a prediction. In this work we emphasized model interpretability by first choosing non-bonded interactions as features, which have physical meaning and are commonly used by chemists and biochemists. The impact of the presence or absence of a particular contact or hydrogen bond on the model can readily be translated into new design strategies. To evaluate the impact of a set of features on the prediction of a machine learning model f(x), we used Shapley Additive Explanation (SHAP) values implemented in the SHAP python library. 50

The method was chosen for its accuracy, consistency, and easy implementation in decision

tree based models. Here we give a brief explanation of the SHAP method, which belongs to the

ACS Paragon Plus Environment

10

Page 11 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

class of additive feature explanation methods. The principal feature of these methods is that the original model f(x) is represented by an explanation model g(x’). 𝑀

𝑔(𝑥’) = 𝜙0 + ∑𝑖 = 1𝜙𝑖𝑥𝑖′ x’ is a simplified binary feature vector where 1 represents observed and 0 represents missing features. M is the number of input features and Φ are feature attribution values. The relative impact of each feature on a given prediction is given by the value of Φ. A solution to obtain the Φ values under the assumption of local accuracy where f(x) = g(x’) is to construct all possible combinations of features x’ and fit the marginal contributions of each feature. This approach is taken from game theory, where the weights Φ are known as Shapley values. Exact calculation of the expected output of a model conditioned on the subset of observed features because calculation time grows exponentially with the number of features. However, for Decision tree models a faster solution can be established by tracking subsets across the decision tree nodes. 51

Results Diversity of the Dataset Clustering Inhibitors with Unsupervised Learning To investigate the structural diversity of small molecules bound to HIV-1 protease we clustered inhibitors according to chemical similarity. Following hierarchical clustering inhibitors with common substructure were extracted using a minimum common substructure tree cutting approach (See Methods). The effect of the clustering parameters was investigated by recursively clustering the dataset changing the threshold parameters (Fig. S1). We found that the minimum atom cutoff had little effect on clustering results. Only the inhibitors grouped in clusters 6 and 8 (Fig. 1A&B) were found to be ambiguously assigned to either one combined or two separate clusters. Changing the minimum cluster size significantly affected the total number of clusters identified (Fig. S1). Ultimately a minimum atom cutoff of 10 and a minimum cluster size of 10 were chosen. We identified 8 clusters defined by distinct core structures. Depending on the cluster, the conserved core included between 15 and 37 atoms (Table 1). Approximately 70% of the inhibitor structures in the dataset were assigned to one of 8 clusters (Fig. 1). For the remaining

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 33

84 inhibitors, no common core structure could be identified and these were evaluated as the unassigned group (Fig. 1A). Cluster 1 included 92 inhibitor structures, which comprise more than 30% of all structures in the dataset. A closer inspection of this cluster (Fig. 1C) showed that the common substructure of cluster 1 roughly corresponds to the FDA approved inhibitors amprenavir (APV) and darunavir (DRV) and their analogs. Cluster 2 formed the second largest cluster with 26 structures. The core of cluster 2 originated from the FDA approved HIV-1 protease inhibitor atazanavir (ATV) and included 8 distinct HIV-1 protease complexes bound to ATV. All remaining clusters had less than 20 structures. With the exception of cluster 4, the common substructure of all clusters contained significantly more than the minimum number of required atoms. Ligand affinity assessed by binding free energy varied among and within inhibitor clusters (Fig. 1C). The overall highest affinity belonged to cluster 8 with mean binding free energy of –14.28 kcal/mol. Overall inhibitors forming this cluster were similar to DRV/APV and analogs, however contained 2 major distinguishing modifications relative to cluster 1: All inhibitors had a methoxyphenyl group and none of them had the bis-tetrahydrofuran (bis-THF) group, which is characteristic for cluster 1 52, 53. Most clusters had an average binding free energy of ~–13 kcal/mol, with the exception of clusters 2 and 3 with higher mean binding energy (–11.7 and –11.0 kcal/mol respectively). Inhibitor structures in cluster 2 were based on the atazanavir (ATZ) core, whereas cluster 3 included nelfinavir (NFV) and saquinavir (SQV) structures. The other cluster with comparatively low reported binding affinities was cluster 7, which included inhibitors based on the cyclic urea core. Inhibitors based on these cores are on average weaker binders and thus chemical space surrounding these cores has not been explored as exhaustively as the core of cluster 1. Nevertheless, the inhibitors in the selected dataset represent diverse chemical space with widely varying affinity. Protease Amino Acid Sequences One of the key challenges in the field of HIV-1 protease inhibition is the high rate of viral mutation. This has led to the rapid emergence of drug resistance and has been a driving factor in the development of HIV therapeutics with increased potency. The high plasticity and mutation rate of HIV protease is captured in amino acid sequence analysis of the dataset. 282 structures had one or more mutations compared to the NL3-4 wild type. 121 structures had 1–3 mutations, another 139 had 4–6, and 9 variants had more than 20 mutations. The latter include HIV-2 protease and a number of highly mutated clinical isolates. Across all structures the highest variability was observed for residues with known polymorphism, such as residue 63. Engineered

ACS Paragon Plus Environment

12

Page 13 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

mutations such as the two cysteine to alanine mutations at positions 67 and 95, and the ubiquitous Q7K to avoid auto-proteolysis also resulted in significant variability at these positions. The most variable drug resistance site was residue 84 (Sup Fig. 2).

Protein–Ligand Interaction Fingerprints Molecular similarity as represented solely by small molecule fingerprints was utilized to cluster the inhibitors, however only ligand-based features are not sufficient for the prediction of affinity, as this approach neglects information about the receptor. This is apparent for DRV: the dataset contains 21 structures of DRV bound to HIV protease with measured inhibition constants ranging from 2.7 pM to 41.5 nM. To obtain representation of the enzyme–inhibitor complex appropriate for binding inference, interactions between the small molecule and target have to be included in the features explicitly. Here we utilize a protein–ligand interaction fingerprinting approach to model nonbonded interactions between inhibitor and enzyme (see Methods). This approach has the advantage of being readily interpretable and can thus inform future design efforts. Protein–ligand interaction fingerprints were generated for the crystal structures of 282 complexes. Since HIV is a homodimeric protein, this calculation was carried out on both chains yielding a 25742 feature matrix for each protein–inhibitor pair. As the majority of HIV protease inhibitors are asymmetric, a global alignment of the feature matrices was performed. The matrices were then each collapsed into a single feature vector. Finally, all zero-variance features were removed reducing the length of the vector to 91 features. VdW contacts made up about half (59/91) of the final features. Hydrogen bond distances and hydrogen bond donor/acceptor counts formed the second largest group of features (51/91). The remaining 16 features consisted largely of ionic interactions. The resulting set of interaction-based features was used for training machine learning models to predict the binding free energy of the 282 ligand–protease complexes in the dataset.

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 33

Model Performance To determine which regression model was most appropriate to predict ligand affinities in the dataset, several well-established baseline methods were evaluated. We compared the performance of elastic net model, support vector machine (SVM) and random forest regression to the gradient boosting model on the HIV protease inhibitor dataset. Models were trained using the cross-validation strategies described in detail in the Methods section above. Model hyperparameters are provided in the supporting information. (Table S1) Elastic net model showed poorest performance with an RMSE of 1.69 kcal/mol or 1.23 log10(Ki/KD) while gradient boosting achieved the lowest RMSE of 1.48 kcal/mol or 1.08 log10(Ki/KD), where Ki and KD are the binding and disassociation constants (Fig. 2A). The performance of random forest regression was similar to gradient boosting, however the gradient boosting model was better able to capture the variance in the dataset (Fig. 2C). Both random forest and elastic net model showed worse performance when predicting either extremely strong or weak binders (Fig. 2A). The SVM and gradient boosting model was less prone to overfitting at the extremities of lowest and highest values. The ability of the models to predict previously unseen features was examined using leave-cluster-out cross validation of inhibitor clusters. In this approach one of the clusters with common chemical core identified using our unsupervised clustering approach was held back as a test set while the models were trained on the remaining data. A drawback of this approach is the significant variance in the size of clusters. Two clusters included only 11 members whereas the unassigned cluster included 82 structures and cluster 1 included 92 structures. Accordingly, predictive accuracy decreased when either cluster 1 or the unassigned group was used for evaluation. These two clusters contained a significant portion of the dataset, 82 unassigned structures and 92 structures in cluster 1 (APV/DRV). The decrease in performance is partially due to the reduction in size of the training set. Conversely predictions for clusters 2 (ATZ core) and 3 (SQV/NLV core) did not deviate significantly from the baseline performance of the models. The relatively stable performance against these scaffolds suggests that their interactions with the protease are not vastly distinct from other inhibitors. All models showed significant decrease in accuracy when structures from cluster 4 were used for model evaluation. This loss in accuracy cannot be explained by a reduction in size of the training-validation set, since cluster 4 constituted only 13 structures. Structures in cluster 4 are non-cleavable peptides

ACS Paragon Plus Environment

14

Page 15 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

whose large sizes diverge significantly from the average size of HIV-1 protease inhibitors. These long linear compounds make interactions not observed in any of the other inhibitors, thus a decrease in predictive accuracy was expected. The elastic net model failed to come up with realistic predictions for both clusters 4 and 8. For the former the average predicted binding free energy was –130 kcal/mol for the later it was –400 kcal/mol; in both cases the poor prediction was traced back to outliers. Overall, in both k-fold cross validation and leave-cluster-out crossvalidation using inhibitor clusters, gradient boosting model had a very good performance and high accuracy.

Feature Importance There is a rising awareness that interpretability of the prediction process is a desirable component of machine learning models. We utilized the recently proposed TreeSHAP method to build an explanatory model for our gradient boosting model. TreeSHAP uses an explanatory model with additive feature attribution: 𝑀

𝑔(𝑧′) = 𝜙0 + ∑𝑖 𝜙𝑖𝑧𝑖 The input features are represented by a simplified binary feature vector z accounting for the presence or absence of a feature in a prediction. Each feature is assigned a weight Φi that represents the contribution of that feature to the given predicted binding energy. Feature attribution scores can be interpreted intuitively: features with negative weights correlate with improved predicted affinity (lower binding free energy) whereas features with positive weights are associated with lower predicted affinity. Overall feature importance was determined by calculating the absolute Shapley value for each feature. To ensure convergence of ranking, feature importance was evaluated across 10 gradient boosting models trained with different random seeds. The ranking showed that a small subset of features was significantly more important for the predicted binding affinities (Supp. Fig. 3). Subsequently absolute shapley values were averaged across the 10 models. The 20 most important features in total contributed 43% of the overall feature importance score. These features included 19 vdW interactions and one hydrogen bond distance. To confirm the importance of those features we compared the performance of the gradient boosting model trained on the full feature set with the performance of a second model trained only on the 20 most informative features, and a third model trained on 20 randomly selected features. Predictions were made using 10-fold cross validation, and over 10 replicates with different

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 33

random seeds where a new set of random features was selected for each replicate. Using the full feature set, the RMSE of the gradient boosting machine model was 1.48 ± 0.02 kcal/mol. When the model was trained on 20 randomly selected features accuracy decreased by 14% to 1.69 ± 0.05 kcal/mol. When the 20 most informative features were used only a minor decrease in accuracy was observed with an overall RMSE of 1.51 ± 0.02 kcal/mol. This validated the use of TreeSHAP strategy and indicated that approximately the same accuracy can be obtained with only the most informative features. We subsequently evaluated the effect of the 20 most informative features on individual predictions (Fig. 3). The median feature effect revealed that most features imposed either minor penalty or gain on most predictions. However, for a subset of the predictions these features showed either a strong positive or negative effect as indicated by the long, one-sided tail of the violin plot. A few features, such as vdW contacts with residues 28B/80A and hydrogen bonds with residue 30 deviated from this trend and showed a near bimodal distribution in their effect on predictions. Feature Attribution of Predictions and Key Residues The most important vdW contacts are made between the bound inhibitor and residue 84 in both chains. This residue is in the active site, making extensive contacts with most HIV-1 protease inhibitors (Fig. 4). VdW contacts of –3.5 and –4 kcal/mol with residue 84 were found to be optimal for the predicted ligand affinity. More extensive contacts with residue 84 in either monomer decreased the positive effect while weaker contacts were heavily penalized by the model (Fig. 4, side panel). VdW contacts with residue 76A also had a significant effect on the predicted outcome. Unlike residue 84, this residue is located distal from the active site and makes only limited contacts with the bound inhibitor (Fig. 4). We examined structures with significant ligand interactions with residue 76A (10th percentile of vdW contacts). The mean binding affinity for those structures was only slightly higher than the mean binding affinity of the full dataset (–13.28 and –12.44 kcal/mol, respectively). However, the inhibitors that made strong interactions with residue 76A included a number of extremely potent inhibitors with reported binding affinity bellow 10 pM (–15.09 kcal/mol). Those inhibitors were based on the darunavir scaffold and contained modifications that allowed them to make additional contacts with this distal residue. The opposite trend was observed for vdW contacts made with residue 83B. While this residue is close to the active site its side chain is oriented away from the active site, limiting the contact potential with a bound inhibitor. In the gradient boosting model extensive contacts with residue 83B were associated with weaker predicted binding affinity (Figure 4). We found

ACS Paragon Plus Environment

16

Page 17 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

that inhibitors that make stronger contacts with residue 83B (10th percentile) are in fact slightly weaker binders, with a mean binding free energy of –11.12 kcal/mol. The FDA approved inhibitor saquinavir makes relatively extensive contacts with residue 83B and we found 9 of the 14 saquinavir-bound structures in the 10th percentile. This is significantly more than the 1.44 SQV structures that would be expected based on a random distribution. To evaluate the effect of features on predictions we compared three high quality crystal structures (Table 2): DRV bound to wildtype HIV-1 protease, DRV bound to a highly resistant variant of HIV-1 protease, and a DRV analog (GRL5010A) that showed improved binding to the drug resistant variant 54-56 GRL5010A distinguished itself from the parent compound DRV through the addition of a difluoride moiety to the bis-THF. This minor modification improved the binding affinity against a drug resistant variant by -1.9 kcal/mol, which equates to an approximately 20-fold difference in Ki. Predicted binding affinities were comparable to experimental measurements for these three complexes. (Table 2) The prediction closely recapitulated the measured binding affinity for DRV while predictions for GRL5010A diverged by 0.87 kcal/mol. The only difference between DRV and GRL5010A are two fluorine atoms that make additional interactions with the protease. The effect of features on the predicted binding affinity deviated significantly between the wild-type and the drug resistant variants, however was similar for the two inhibitors bound to the same drug resistant variant (Fig. 5A). We expected similar feature effects for the inhibitors as they share high chemical similarity. Thus, feature effects that diverged significantly between wild-type and drug resistant protease were analyzed in detail. Changes in vdW interactions with residue R8A had a significant effect on the predicted binding affinity of DRV and GRL5010A to the drug resistant protease variant. In both structures, the arginine side chain was found primarily in a “flipped out” conformation, facing away from the inhibitor, whereas in wildtype R8 was in a “tucked in” orientation facing toward DRV (Figure 5B). While vdW interactions between R8 and DRV in wild-type had a minor positive effect on predicted binding free energy the weaker interactions in the drug resistant variant were heavily penalized in the model (Figure 5C). An alternative conformation of R8A was observed in both drug resistant variants that resembled the “tucked in” conformation of the wildtype, however due to the lower occupancy it was discarded in the modeling process55. Interactions with R8 are specific to one monomer, since DRV and GRL5010A make no significant interactions with the R8B sidechain. In the GRL5010A bound structure R8 was closer to the inhibitor, consequently its interactions with R8 imposed less of a penalty on the predicted binding free energy.

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 33

In wildtype protease, the strongest positive effect on predicted binding free energy came from vdW interactions between the inhibitor and residue 28B. This residue is located close to the catalytic dyad and makes significant interactions with the bis-THF group of DRV and its analog (Figure 5B). The feature effects show an abrupt sign inversion at approximately 5 kcal/mol: stronger interactions improved the predicted binding free energy, whereas weaker contacts had no significant effect on the outcome of the prediction. Due to its location and high selectivity, this feature likely rewards optimal positioning of bis-THF like groups. In the drug resistant variants binding of the bis-THF group typically deviates slightly from the wildtype, but these minor structural rearrangements are enough to turn off the beneficial effect of this interaction. VdW interactions with residue 32A reduced the predicted binding affinity in all structures analyzed. The residue is located in the active site with the side chain facing the inhibitor (Figure 5B). The strength of the interaction is anti-correlated with the predicted binding affinity, thus stronger (more negative) interactions were associated with weaker binding and an increase in predicted binding free energy. Mutations at residue 32 are known to confer resistance to FDA-approved inhibitors. The most common drug resistance mutation at residue 32 is a change from valine to isoleucine. V32I mutation is also present in the drug resistant variant investigated here. The V32I mutation does not directly reduce binding interactions; in fact, both DRV and GR5010A make stronger contacts with the larger isoleucine residue, however the predictive model indicates stronger interactions result in a larger penalty on the binding free energy. This is an example where the model learns sequence features from molecular interactions. Changes in vdW interactions between residue 84 and the bound inhibitor had a large impact on the predicted outcome for all 3 cases examined. In the wild type background DRV makes what the model perceived as ideal contacts. In the drug resistant variant interactions decrease significantly imposing a large penalty on the predicted binding affinity (Fig. 5B&C). Loss of vdW interactions is partially due to a mutation of residue 84 itself, changing from isoleucine to the smaller valine. Although the inhibitor atoms that are in contact with residue 84 are invariant across DRV and GRL5010A, the latter is able to maintain slightly stronger interactions with residue 84A, resulting in a weaker penalty to the predicted binding affinity.

Conclusion and Discussion In this work, we developed a target specific machine learning model for predictive analysis of small molecule inhibitor binding affinity using 3D structural information. Model interpretability

ACS Paragon Plus Environment

18

Page 19 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

was prioritized by training the model on intuitive structure interaction fingerprints and developing a framework that enables evaluating feature importance for individual model predictions. Explicit feature attribution was demonstrated to be useful in interpreting individual model predictions and discerning key interactions underlying binding affinity. Discerning these interactions can have a broad impact on computer aided drug design. We chose a target specific approach focusing on a set of closely related proteins, which has been previously shown to achieve higher accuracy compared to a general model based on multiple distinct protein families. 57 An apparent limitation of the target specific approach is that the resulting model is restricted to closely related proteins and a new model has to be trained for dissimilar proteins. However, for most known drug targets, multiple crystal structures or structures of close homologs are often available, rendering target specific approaches viable for structure based drug design 58. Due to the available rich data, HIV protease was chosen as a model system; and this system is known to be a challenging target for structure-based free energy predictions 13, 59. As the input dataset used to train any machine learning model ultimately determines the predictions, the structure of training data must be evaluated. Unsupervised analysis of chemical diversity among HIV protease inhibitors showed that 32% of the inhibitors co-crystallized with HIV protease were derived from the FDA approved inhibitors DRV and APV. DRV is the most recent FDA approved HIV-protease inhibitor and exhibits high potency against both wild type and drug resistant variants. This explains the prevalence of inhibitors derived from the core chemical structure it shares with APV. In addition to the DRV/APV scaffold, we identified 7 other commonly used core structures that were represented in the dataset with at least 10 or more cocrystal structures. Four commonly used machine learning algorithms trained on structure interaction fingerprints were benchmarked. Of the four, the gradient boosting decision trees showed the best overall correlation with experimental binding affinity, however accuracy of all non-linear models (gradient boosting, random forest and support vector machine) was comparable. Despite the overrepresentation of DRV/APV in the dataset all non-linear models still had good performance when this inhibitor cluster was removed from the training set. In comparison with non-linear models, the elastic net model, which is a variant of penalized linear regression, showed worse correlation with experimental binding affinities. This highlights the advantage of non-linear machine learning models for structure based binding free energy prediction. Binding free energy is an amalgam of enthalpic, entropic, and solvent terms. Structure based descriptors can generally only account for enthalpic contributions, while the other terms may be

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 33

implicitly included. Nonlinear machine learning models can likely learn the interdependency between these terms, which is not necessarily linear and can be better captured with more complex models. To understand which features contribute the most to the predicted binding affinity, an explanatory modeling approach for structure based free energy predictions based on Shapely values was introduced. The underlying methodology was recently applied to healthcare to explain the risk factors in real time during general anaesthesia. 60. In contrast to global models of feature importance which completely miss selective effects, Shapley values enable attributing feature effects for each prediction separately. We demonstrate that this approach is preferable in interpreting predictive models of ligand affinity. Our model can more accurately describe the key molecular interactions that underlie specific ligand potency. While shapely values present a powerful tool to interpret the effect of a feature on the predicted binding free energy, they do not necessitate a causal relationship between interaction and inhibitor binding. Non-parametric machine learning models are agnostic of the causal physical relationship that drive binding and can only recapitulate the signal observed in the dataset. This can be used to prioritize physical interactions that are advantageous, but the model can also be used to determine if an interaction is outside of the applicability domain. Hydrogen bonds did not have a significant effect on the predicted outcome in our model in contrast to previous reports 55, 56. The vdW interactions outweigh other intermolecular interactions between the ligand and target in terms of feature importance. A similar observation was recently made by Yasuo et al., who developed a general classifier to distinguish active from inactive compounds using protein–ligand interaction energies 61. They found that the most important features were enriched in Coulomb and vdW terms. Electrostatic interactions are likely more sensitive to the noise in pairwise distances required to identify hydrogen bonds, and often conserved between a series of similar inhibitors. Hydrogen bonds are compensated by water-mediated hydrogen bonds. Our model did not account for water-mediated interactions, while solvation is critical in binding, neither our model nor any other structure based machine learning model incorporates the effects of protein–ligand solvation. Although solvent terms could potentially be learned implicitly from the structure of protein–ligand complex, this is a key challenge for future structure based predictive machine learning. VdW contact with hydrophobic residues in the active site 84A/B had a large effect on the predicted binding affinity. These residues similarly vdW contacts with putative hydrogen partners, such as 25A/B, 30A/B and 29B had a strong influence on the predicted binding free energy although only hydrogen bonds with 30B were among the most important features (Fig. 3). These findings align closely comparative

ACS Paragon Plus Environment

20

Page 21 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

molecular field analysis performed on a set HIV-1 protease inhibitors based on the cyclic urea core.62 Although the study only considered ligand structure, the analysis found that the field of favorable steric interaction overlay with residues 32, 82, 48, 50 and 84. Which, with the exception of residue 32 also ranked among the most important features identified in this study. We demonstrated the insights that can be gained from model predictions for a test case of DRV and an analog against a drug resistant variant of HIV-1 protease. Feature attribution changed considerably when wild-type protease was compared with the drug resistant variant. In contrast, feature importance was similar between DRV and its analog GRL5010A due to the high similarity between the two inhibitors. Our model found that GRL5010A was able to maintain key interactions over DRV with the drug resistant variant, in agreement with its better potency. The model did not assign considerable weight to the improved interactions between the GRL5010A fluoro substitutions and the protease, possibly because these interactions are underrepresented in our dataset. The model also distinguished resistant from wild-type protease via changes in vdW interactions that were a direct result of drug resistance mutations. While the predicted binding affinity alone would have been able to identify the improved binder, descriptive modeling can improve the confidence in the prediction by identifying the underlying features and give critical clues for molecular design efforts. In conclusion, the computational framework developed in this work enables predictive and explanatory modeling of ligand binding affinities. Knowledge of the key physical interactions underlying a prediction can be leveraged to both improve lead compounds and identify potential weaknesses in the machine-learning model. This will ultimately result in improved confidence in in-silico predictions to aid structure-based drug design.

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 33

Figures

Figure 1: Common core structures of HIV protease inhibitors. A) Hierarchical clustering of HIV protease inhibitors by chemical similarity. Clusters with conserved chemical structure are highlighted on the dendrogram by color. B) Violin plot of the inhibitor binding affinities for clusters. Horizontal lines indicate maximum, mean and minimum binding affinity. Violin width indicates inhibitor density estimated by a gaussian kernel density estimator. C) Exemplary structures for each cluster of inhibitors. The minimum common substructure is highlighted.

ACS Paragon Plus Environment

22

Page 23 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2: Performance of machine learning models on the HIV protease dataset. A) Correlation between predicted and measured binding free energy based on 10-fold cross validation. Solid line indicates perfect match, dashed lines indicate an RMSE of 1 kcal/mol. B) Root mean squared error when a distinct group of inhibitors were used for testing only. C) Table showing the performance of the tested machine learning functions on the HIV protease dataset. Leave Cluster Out RMSE was not determined for elastic net model due to large outliers when clusters 4/8 were used for testing.

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 33

Figure 3: Effect of the 20 most important features on binding affinity predictions. Violin plots of the distribution of feature effects (shapley values) for the 20 most informative features.

ACS Paragon Plus Environment

24

Page 25 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4: Key residues with effect of vdW contacts on predictions. In the center panel: Structure of HIV-1 protease where catalytic aspartic acids in the center of the active site are shown in yellow. Key residues 76A, 83B and 84A/B are highlighted in purple. Side panels show the effect of changes in vdW contacts of those residues on the predicted binding free energy. Density distribution, computed by gaussian kernel density estimate, is displayed next to scatter plot.

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 33

Figure 5: Differential effect of vdW interactions on binding to wildtype and drug resistant protease. Three structures were investigated in detail: DRV bound to wild type HIV-1 protease, (teal), DRV bound to a drug resistant variant (purple) and GRL-5010A bound to the same drug resistant variant (orange). A) Effect (shapley values) of structure interaction fingerprints on predicted binding free energy. Feature effects were ordered according to the effect on the predicted binding free energy of DRV bound to wild type HIV-1 protease. B) Features with significant change in effect mapped to structure (PDB IDs: See Table 2) C) Feature effect for the three structures (colored data points) relative to all predictions in the dataset.

ACS Paragon Plus Environment

26

Page 27 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Tables Table 1. HIV-1 protease inhibitor clusters. Cluster

Number of Structures

Atoms in Common Substructure

Mean Affinity (kcal/mol)

Inhibitor Similarity

1

92

31

-13.72 +/- 1.97

APV / DRV

2

26

23

-11.74 +/- 1.16

ATV

3

19

25

-10.97 +/- 1.12

NFV/SQV

4

13

15

-12.98 +/- 2.24

RTV

5

11

23

-13.36 +/- 1.56

Allophenylnorstati ne 63

6

15

23

-12.86 +/- 2.13

-

7

11

37

-12.17 +/- 1.69

Cyclic Urea 64

8

12

30

-14.28 +/- 1.36

-

Unassigned

83

NA

-11.03 +/- 2.40

-

Table 2. HIV-1 protease–inhibitor structures compared. Name

PDB ID

ΔG (experimental)

ΔG (predicted)

DRVWT

1T3R

-14.89 kcal/mol

-14.88 kcal/mol

DRVMUT

3UCB

-10.13 kcal/mol

-10.03 kcal/mol

GRL-5010AMUT

4YHQ

-12.03 kcal/mol

-11.16 kcal/mol

ACS Paragon Plus Environment

27

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 33

Acknowledgments This research was supported by NIH P01 GM109767

Supporting Information Available: Figure S1: Survey of the impact of clustering parameters on the results of minimum common substructure clustering. Figure S2: Hierarchical clustering of the HIV-1 Protease dataset by sequence identity and sequence variation measured by Shannon entropy. Figure S3: Rank ordered mean absolute shapley values averaged across 10 gradient boosting models. Red area indicates average feature importance ± standard deviation. Table S1: Hyperparameters used for machine learning models.

ACS Paragon Plus Environment

28

Page 29 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

References 1. Kuhn, B.; Guba, W.; Hert, J. r. m.; Banner, D.; Bissantz, C.; Ceccarelli, S.; Haap, W.; Körner, M.; Kuglstatter, A.; Lerner, C., A Real-World Perspective on Molecular Design: Miniperspective. J. Med. Chem. 2016, 59, 4087-4102. 2. de Ruyck, J.; Brysbaert, G.; Blossey, R.; Lensink, M. F., Molecular docking as a popular tool in drug design, an in silico travel. Adv. Appl. Bioinform. Chem. 2016, 9, 1. 3. Moroy, G.; Martiny, V. Y.; Vayer, P.; Villoutreix, B. O.; Miteva, M. A., Toward in silico structure-based ADMET prediction in drug discovery. Drug Discovery Today 2012, 17, 44-55. 4. Blundell, T. L.; Jhoti, H.; Abell, C., High-throughput crystallography for lead discovery in drug design. Nat. Rev. Drug Discovery 2002, 1, 45. 5. Blundell, T. L.; Sibanda, B. L.; Montalvão, R. W.; Brewerton, S.; Chelliah, V.; Worth, C. L.; Harmer, N. J.; Davies, O.; Burke, D., Structural biology and bioinformatics in drug design: opportunities and challenges for target identification and lead discovery. Philos. Trans. R. Soc., B 2006, 361, 413-423. 6. Pellecchia, M.; Bertini, I.; Cowburn, D.; Dalvit, C.; Giralt, E.; Jahnke, W.; James, T. L.; Homans, S. W.; Kessler, H.; Luchinat, C., Perspectives on NMR in drug discovery: a technique comes of age. Nat. Rev. Drug Discovery 2008, 7, 738. 7. Cournia, Z.; Allen, B.; Sherman, W., Relative binding free energy calculations in drug discovery: recent advances and practical considerations. J. Chem. Inf. Model. 2017, 57, 29112937. 8. Zou, X.; Sun, Y.; Kuntz, I. D., Inclusion of solvation in ligand binding free energy calculations using the generalized-born model. J. Am. Chem. Soc. 1999, 121, 8033-8043. 9. Muegge, I.; Martin, Y. C., A general and fast scoring function for protein− ligand interactions: a simplified potential approach. J. Med. Chem. 1999, 42, 791-804. 10. Wang, R.; Lai, L.; Wang, S., Further development and validation of empirical scoring functions for structure-based binding affinity prediction. J. Comput. Aided Mol. Des. 2002, 16, 11-26. 11. Eldridge, M. D.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee, R. P., Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J. Comput. Aided Mol. Des. 1997, 11, 425-445. 12. Ballester, P. J.; Mitchell, J. B., A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking. Bioinformatics 2010, 26, 1169-1175. 13. Sotriffer, C. A.; Sanschagrin, P.; Matter, H.; Klebe, G., SFCscore: scoring functions for affinity prediction of protein–ligand complexes. Proteins: Struct., Funct., Bioinf. 2008, 73, 395419. 14. Jiménez, J.; Skalic, M.; Martinez-Rosell, G.; De Fabritiis, G., K DEEP: Protein–Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. J. Chem. Inf. Model. 2018, 58, 287-296. 15. Feinberg, E. N.; Sur, D.; Wu, Z.; Husic, B. E.; Mai, H.; Li, Y.; Sun, S.; Yang, J.; Ramsundar, B.; Pande, V. S., PotentialNet for Molecular Property Prediction. ACS Cent. Sci. 2018, 4, 1520-1530. 16. Nguyen, D. D.; Cang, Z.; Wu, K.; Wang, M.; Cao, Y.; Wei, G.-W., Mathematical deep learning for pose and binding affinity prediction and ranking in D3R Grand Challenges. J. Comput. Aided Mol. Des. 2019, 33, 71-82. 17. Free, S. M.; Wilson, J. W., A mathematical contribution to structure-activity studies. J. Med. Chem. 1964, 7, 395-399. 18. Cheng, T.; Li, X.; Li, Y.; Liu, Z.; Wang, R., Comparative assessment of scoring functions on a diverse test set. J. Chem. Inf. Model. 2009, 49, 1079-1093.

ACS Paragon Plus Environment

29

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 33

19. Su, M.; Yang, Q.; Du, Y.; Feng, G.; Liu, Z.; Li, Y.; Wang, R., Comparative Assessment of Scoring Functions: The CASF-2016 Update. J. Chem. Inf. Model. 2018, 59, 895-913. 20. Ruddigkeit, L.; Van Deursen, R.; Blum, L. C.; Reymond, J.-L., Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864-2875. 21. Lapuschkin, S.; Wäldchen, S.; Binder, A.; Montavon, G.; Samek, W.; Müller, K.-R., Unmasking Clever Hans predictors and assessing what machines really learn. Nat. Commun. 2019, 10, 1096. 22. Meagher, K. L.; Carlson, H. A., Incorporating protein flexibility in structure-based drug discovery: using HIV-1 protease as a test case. J. Am. Chem. Soc. 2004, 126, 13276-13281. 23. Sayer, J. M.; Liu, F.; Ishima, R.; Weber, I. T.; Louis, J. M., Effect of the active site D25N mutation on the structure, stability, and ligand binding of the mature HIV-1 protease. J. Biol. Chem. 2008, 283, 13459-13470. 24. Piana, S.; Sebastiani, D.; Carloni, P.; Parrinello, M., Ab initio molecular dynamics-based assignment of the protonation state of pepstatin A/HIV-1 protease cleavage site. J. Am. Chem. Soc. 2001, 123, 8730-8737. 25. McGee Jr, T. D.; Edwards, J.; Roitberg, A. E., pH-REMD simulations indicate that the catalytic aspartates of HIV-1 protease exist primarily in a monoprotonated state. J. Phys. Chem. B 2014, 118, 12577-12585. 26. Rogers, D.; Hahn, M., Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742-754. 27. Duan, J.; Dixon, S. L.; Lowrie, J. F.; Sherman, W., Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods. J. Mol. Graphics Modell. 2010, 29, 157-170. 28. Willett, P.; Barnard, J. M.; Downs, G. M., Chemical similarity searching. J. Chem. Inf. Model. 1998, 38, 983-996. 29. Deng, Z.; Chuaqui, C.; Singh, J., Structural interaction fingerprint (SIFt): a novel method for analyzing three-dimensional protein− ligand binding interactions. J. Med. Chem. 2004, 47, 337-344. 30. Da, C.; Kireev, D., Structural protein–ligand interaction fingerprints (SPLIF) for structurebased virtual screening: method and benchmark study. J. Chem. Inf. Model. 2014, 54, 25552561. 31. Kooistra, A. J.; Kanev, G. K.; van Linden, O. P.; Leurs, R.; de Esch, I. J.; de Graaf, C., KLIFS: a structural kinase-ligand interaction database. Nucleic Acids Res. 2015, 44, D365D371. 32. Banks, J. L.; Beard, H. S.; Cao, Y.; Cho, A. E.; Damm, W.; Farid, R.; Felts, A. K.; Halgren, T. A.; Mainz, D. T.; Maple, J. R., Integrated modeling program, applied chemical theory (IMPACT). J. Comput. Chem. 2005, 26, 1752-1780. 33. Steiner, T., The hydrogen bond in the solid state. Angew. Chem. Int. Ed. 2002, 41, 4876. 34. Auffinger, P.; Hays, F. A.; Westhof, E.; Ho, P. S., Halogen bonds in biological molecules. Proceedings of the National Academy of Sciences 2004, 101, 16789-16794. 35. de Freitas, R. F.; Schapira, M., A systematic analysis of atomic protein–ligand interactions in the PDB. Medchemcomm 2017, 8, 1970-1981. 36. Xu, D.; Tsai, C.-J.; Nussinov, R., Hydrogen bonds and salt bridges across proteinprotein interfaces. Protein Engineering 1997, 10, 999-1012. 37. Durrant, J. D.; McCammon, J. A., BINANA: a novel algorithm for ligand-binding characterization. J. Mol. Graphics Modell. 2011, 29, 888-893. 38. Choudhary, A.; Gandla, D.; Krow, G. R.; Raines, R. T., Nature of amide carbonyl− carbonyl interactions in proteins. J. Am. Chem. Soc. 2009, 131, 7244-7246.

ACS Paragon Plus Environment

30

Page 31 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

39. Perras, F. A.; Marion, D.; Boisbouvier, J.; Bryce, D. L.; Plevin, M. J., Observation of CH⋅⋅⋅ π Interactions between Methyl and Carbonyl Groups in Proteins. Angew. Chem. 2017, 129, 7672-7675. 40. Freund, Y.; Schapire, R.; Abe, N., A short introduction to boosting. Journal-Japanese Society For Artificial Intelligence 1999, 14, 1612. 41. Friedman, J. H., Greedy function approximation: a gradient boosting machine. Annals of statistics 2001, 1189-1232. 42. Chen, T.; Guestrin, C. In Xgboost: A scalable tree boosting system, Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, ACM: 2016; pp 785-794. 43. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. In Lightgbm: A highly efficient gradient boosting decision tree, Advances in Neural Information Processing Systems, 2017; pp 3146-3154. 44. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A. V.; Gulin, A. In CatBoost: unbiased boosting with categorical features, Advances in Neural Information Processing Systems, 2018; pp 6638-6648. 45. Bergstra, J.; Yamins, D.; Cox, D. D., Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. 2013. 46. Zou, H.; Hastie, T., Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B. (Stat. Method.) 2005, 67, 301-320. 47. Cortes, C.; Vapnik, V., Support-vector networks. Machine learning 1995, 20, 273-297. 48. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V., Scikit-learn: Machine learning in Python. Journal of machine learning research 2011, 12, 2825-2830. 49. Kramer, C.; Gedeck, P., Leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets. J. Chem. Inf. Model. 2010, 50, 1961-1969. 50. Lundberg, S. M.; Lee, S.-I. In A unified approach to interpreting model predictions, Advances in Neural Information Processing Systems, 2017; pp 4765-4774. 51. Lundberg, S. M.; Erion, G. G.; Lee, S.-I., Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888 2018. 52. Ghosh, A. K.; Takayama, J.; Kassekert, L. A.; Ella-Menye, J.-R.; Yashchuk, S.; Agniswamy, J.; Wang, Y.-F.; Aoki, M.; Amano, M.; Weber, I. T., Structure-based design, synthesis, X-ray studies, and biological evaluation of novel HIV-1 protease inhibitors containing isophthalamide-derived P2-ligands. Bioorg. Med. Chem. Lett. 2015, 25, 4903-4909. 53. Nalam, M. N.; Ali, A.; Altman, M. D.; Reddy, G. K. K.; Chellappan, S.; Kairys, V.; Özen, A.; Cao, H.; Gilson, M. K.; Tidor, B., Evaluating the substrate-envelope hypothesis: structural analysis of novel HIV-1 protease inhibitors designed to be robust against drug resistance. J. Virol. 2010, 84, 5368-5378. 54. Surleraux, D. L.; Tahri, A.; Verschueren, W. G.; Pille, G. M.; De Kock, H. A.; Jonckers, T. H.; Peeters, A.; De Meyer, S.; Azijn, H.; Pauwels, R., Discovery and selection of TMC114, a next generation HIV-1 protease inhibitor. J. Med. Chem. 2005, 48, 1813-1822. 55. Agniswamy, J.; Shen, C.-H.; Aniana, A.; Sayer, J. M.; Louis, J. M.; Weber, I. T., HIV-1 protease with 20 mutations exhibits extreme resistance to clinical inhibitors through coordinated structural rearrangements. Biochemistry 2012, 51, 2819-2828. 56. Agniswamy, J.; Louis, J. M.; Shen, C.-H.; Yashchuk, S.; Ghosh, A. K.; Weber, I. T., Substituted bis-THF protease inhibitors with improved potency against highly resistant mature HIV-1 protease PR20. J. Med. Chem. 2015, 58, 5088-5095. 57. Ross, G. A.; Morris, G. M.; Biggin, P. C., One size does not fit all: the limits of structurebased models in drug discovery. J. Chem. Theory Comput. 2013, 9, 4266-4274. 58. Overington, J. P.; Al-Lazikani, B.; Hopkins, A. L., How many drug targets are there? Nat. Rev. Drug Discovery 2006, 5, 993.

ACS Paragon Plus Environment

31

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 33

59. Wang, Y.; Guo, Y.; Kuang, Q.; Pu, X.; Ji, Y.; Zhang, Z.; Li, M., A comparative study of family-specific protein–ligand complex affinity prediction based on random forest approach. J. Comput. Aided Mol. Des. 2015, 29, 349-360. 60. Lundberg, S. M.; Nair, B.; Vavilala, M. S.; Horibe, M.; Eisses, M. J.; Adams, T.; Liston, D. E.; Low, D. K.-W.; Newman, S.-F.; Kim, J., Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nat. Biomed. Eng. 2018, 2, 749. 61. Yasuo, N.; Sekijima, M., An Improved Method of Structure-based Virtual Screening via Interaction-energy-based Learning. J. Chem. Inf. Model. 2019. 62. Khedkar, V. M.; Ambre, P. K.; Verma, J.; Shaikh, M. S.; Pissurlenkar, R. R.; Coutinho, E. C., Molecular docking and 3D-QSAR studies of HIV-1 protease inhibitors. J. Mol. Model. 2010, 16, 1251-1268. 63. Baldwin, E. T.; Bhat, T. N.; Gulnik, S.; Liu, B.; Topol, I. A.; Kiso, Y.; Mimoto, T.; Mitsuya, H.; Erickson, J. W., Structure of HIV-1 protease with KNI-272, a tight-binding transition-state analog containing allophenylnorstatine. Structure 1995, 3, 581-590. 64. Lam, P.; Jadhav, P. K.; Eyermann, C. J.; Hodge, C. N.; Ru, Y.; Bacheler, L. T.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Wong, Y. N., Rational design of potent, bioavailable, nonpeptide cyclic ureas as HIV protease inhibitors. Science 1994, 263, 380-384.

ACS Paragon Plus Environment

32

Page 33 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

For Table of Contents use only

ACS Paragon Plus Environment

33