Task-Specific Scoring Functions for Predicting ... - ACS Publications

Nov 30, 2017 - Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824-1226, United States. ABSTRAC...
0 downloads 8 Views 2MB Size
Subscriber access provided by University of Florida | Smathers Libraries

Article

Task-Specific Scoring Functions for Predicting Ligand Binding Poses and Affinity and for Screening Enrichment Hossam Ashtawy, and Nihar Ranjan Mahapatra J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00309 • Publication Date (Web): 30 Nov 2017 Downloaded from http://pubs.acs.org on December 4, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Task-Specific Scoring Functions for Predicting Ligand Binding Poses and Affinity and for Screening Enrichment Hossam M. Ashtawy∗ and Nihar R. Mahapatra∗ Department of Electrical and Computer Engineering, Michigan State University East Lansing, Michigan 48824-1226, USA E-mail: [email protected]; [email protected]

Abstract Molecular docking, scoring, and virtual screening play an increasingly important role in computer-aided drug discovery. Scoring functions (SFs) are typically employed to predict the binding conformation (docking task), binding affinity (scoring task), and binary activity level (screening task) of ligands against a critical protein target in a disease’s pathway. In most molecular docking software packages available today, a generic binding affinity-based (BA-based) SF is invoked for all three tasks to solve three different, but related, prediction problems. The limited predictive accuracies of such SFs in these three tasks has been a major roadblock toward cost-effective drug discovery. Therefore, in this work, we develop BT-Score, an ensemble machine-learning (ML) SF of boosted decision trees and thousands of predictive descriptors to estimate BA. BT-Score reproduced BA of out-of-sample test complexes with correlation of 0.825. Even with this high accuracy in the scoring task, we demonstrate that the docking and ∗

To whom correspondence should be addressed

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

screening performance of BT-Score and other BA-based SFs is far from ideal. This has motivated us to build two task-specific ML SFs for the docking and screening problems. We propose BT-Dock, a boosted-tree ensemble model trained on a large number of native and computer-generated ligand conformations and optimized to predict binding poses explicitly. The model has shown an average improvement of 25% over its BAbased counterparts in different ligand pose prediction scenarios. Similar improvement has also been obtained by our screening-based SF, BT-Screen, which directly models the ligand activity labels as a classification problem. BT-Screen is trained on thousands of active and inactive protein-ligand complexes to optimize it for finding real actives from databases of ligands not seen in its training set. In addition to the three task-specific SFs, we propose a novel multi-task deep neural network (MT-Net) that is trained on data from the three tasks to simultaneously predict binding poses, affinities, and activity levels. We show that the performance of MT-Net is superior to conventional SFs and on a par with or better than models based on single-task neural networks.

Introduction Developing a new drug is a costly and time-consuming endeavor. Its early stages involve screening millions of drug-like molecules, known as ligands, against a target protein to alter its function. In vitro screening on a such scale is prohibitively expensive and laborious even with the leverage of automation via high-throughput screening (HTS)—refer to Table 1 for a complete list of abbreviations. To mitigate the costs, HTS is often preceded by virtual (also known as in silico) screening to filter the large compound libraries to a manageable size. 1–5 It is critical that the early computational and physical screening campaigns identify promising compounds with high probability of success since failing late in the drug-development pipeline can be very costly for a pharmaceutical company. 6 Such requirements are well recognized and this is evident from the number of computational modeling approaches that have been developed in the past two decades to improve the accuracy of virtual screening. These

2

ACS Paragon Plus Environment

Page 2 of 46

Page 3 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1: List of abbreviations and acronyms. Acronym

Description

BA Cr CV E.F.x% HTS Kd Ki LLCO LTCO ML NRMSD PDB PL PLC Pr QSAR Rp Rs RMSD SBDD SD SF SN % VS

Binding affinity PDBbind core test set Cross validation Screening enrichment factor by considering the top x% ranked ligands High-throughput screening Dissociation constant Inhibition constant Leave ligand clusters out Leave target clusters out Machine learning Normalized root-mean-square deviation Protein data bank protein-ligand protein-ligand complex PDBbind primary training set Quantitative structure-activity relationship Pearson correlation coefficient Spearman correlation coefficient Root-mean-square deviation Structure-based drug design Standard deviation Scoring function Docking success rate by considering the top N scored ligand poses Virtual screening

approaches can be categorized into two main groups. The first is ligand-based in which active compounds for the drug target are used to guide the search for similar molecules in the compound library. In these approaches, quantitative structure-activity relationship (QSAR) models are widely applied to rank ligands based on their similarity to the training data in terms of their physiochemical characteristics (descriptors). Structure-based drug design (SBDD) is the other category of molecular modeling approaches in which information about the 3D structure of the target and its bound ligand are utilized to predict other potential actives. The SBDD approach is more realistic and has a higher potential for improved accuracy since it uses information about both molecules and, unlike ligand-based approaches, can still predict the activity of new ligands with the target protein even if there no known active ligands against it. The work presented here focuses on SBDD. 3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Input Protein

Molecular docking and virtual screening Top ligand pose

Ligand 5

Protein

Ligand DB

Top Pose: 8

Docking SF

Ligand 3

Protein

Protein

Active Screening SF Ligand?

Score: Scoring SF -23

Page 4 of 46

Output Scored & ranked DB of ligands predicted as active Lig. 64 = -15 Lig. 3 = -23 …

Figure 1: The molecular docking workflow of ranking poses, determining activity binary labels of ligands, and scoring them using predicted binding affinities. Molecular docking is a popular computational modeling tool for SBDD to simulate how a ligand interacts with its target. In docking, large ensembles of ligand poses are sampled and placed in the active site of the target. Since the native conformation of the new ligand is unknown, the candidate poses are scored based on their physiochemical and geometric complementarity to the target’s binding site. Typically, the pose with the highest score is assumed to be the conformation the ligand would take when it binds to the target. After predicting its conformation, the ligand’s bio-activity (or lack thereof) against the target protein is estimated. If it is predicted to be active, a binding affinity score is predicted to quantify how strongly the protein and ligand molecules would bind. The predictive model that scores poses, classifies the ligand into active or inactive categories, and estimates its binding affinity is known as a scoring function (SF). Traditionally, these ligand docking, screening, and scoring tasks, depicted in Figure 1, are performed using the same SF built in the docking tool. 7–10 Such functions predict binding affinity (BA) that is in turn used to rank-order hundreds of generated poses for thousands to millions of different ligands for a certain target protein. In other words, the score that is used to rank poses is also employed to prioritize ligands against each other and to classify them as active or inactive molecules.

4

ACS Paragon Plus Environment

Page 5 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Motivation To investigate the effectiveness of using a generic, BA-based SF for the three tasks, we collected scoring, docking, and screening performance data of 27 SFs used in academic and commercial molecular docking packages. 11–13 This set of SFs was evaluated on a popular benchmark, viz. PDBbind core test set Cr , 14 which is derived from PDB and comprises 65 three-complex clusters corresponding to diverse protein families and drug-like molecules. Table 2 lists the scoring performance of the SFs in terms of Pearson correlation coefficient between the predicted and true values of BA. Their standard deviation (SD) values from experimental BA are listed in the fourth column. In the next columns, we show docking power in terms of S1 , S2 , and S3 success rates. The screening power statistics are reported in the last three columns in terms of the top 1%, 5%, and 10% enrichment factors. These scoring, docking, and screening statistics as well as details of the SFs, protein-ligand complexes, and descriptors used are discussed in more detail in the following sections, where we will see that the higher these values are (except SD), the more accurate an SF is in terms of scoring, docking, and screening performance. In this section, the focus will be on the relationship between the scoring, docking, and screening accuracies of the scoring functions. The results collected from our experiments 11 and literature 12,13 indicate that the ensemble ML SFs BRT and RF-Score achieve the best scoring accuracy. However, due to uncertainties associated with BA prediction, a relatively higher scoring power of these SFs does not necessarily translate to corresponding improvements in docking and screening predictions. This is evident from Table 2. The top-performing ML SF based on boosted regression trees, BRT, has the best BA prediction accuracy (Rp = 0.777), but fails to identify poses that are within 2 Å from the native one for (100 - 58.46 =) 41.54% of the proteins. A similar mediocre accuracy was also obtained when its screening performance was evaluated. The table shows that BRT achieved an enrichment factor of 6.15 compared to an ideal SF that would yield an enrichment factor of up to 65 for the database on which BRT was tested. The best performing SF in terms of correctly classifying ligands as active or inactive, GlideScore-SP, 5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 46

Table 2: Performance of conventional scoring functions in scoring, docking, and screening the ligands of 65 different protein families in the PDBbind core set versions 2013, 2014, and 2015. Scoring Functions1

Type2 Scoring Power Rp

BRT RF-Score X-Score ∆SAS ChemScore@SYBYL NN-Score SMINA ChemPLP@GOLD PLP1@DS LigScore G-Score@SYBYL ASP@GOLD ASE@MOE ChemScore@GOLD AffiScore@SLIDE D-Score@SYBYL Alpha-HB@MOE LUDI3@DS GoldScore@GOLD Affinity-dG@MOE LigScore2@DS GlideScore-SP Jain@DS PMF@DS GlideScore-XP London-dG@MOE PMF@SYBYL 1

2

3

ML ML Em Em Em ML Em E Em Em FF KB Em Em Em FF Em Em FF Em Em Em Em KB Em Em KB

0.777 0.723 0.6143 0.606 0.592 0.584 0.580 0.579 0.568 0.563 0.558 0.556 0.544 0.536 0.529 0.526 0.511 0.487 0.483 0.482 0.456 0.452 0.408 0.364 0.277 0.242 0.221

Docking Power (in %)

Screening Power (E.F.)

SD

S1

S2

S3

top 1%

top 5%

top 10%

1.42

58.46 21.03 62.35 25.88 60.88 78.46 78.46

74.87 33.85 77.50 37.64 72.94 84.62 84.10

82.56 45.13 79.11 48.83 80.00 88.72 87.69

87.06

90.03

77.94 81.54 47.35 72.94 53.24 80.88 58.97 20.88 75.29 67.94 72.35 64.41 75.53 78.97 50.58 52.71 75.59 62.35 52.29

84.12 91.28 64.41 82.06 61.76 84.12 72.82 32.35 85.17 79.12 84.12 75.58 84.71 86.17 64.71 62.94 84.12 77.05 61.18

86.18 94.36 75.00 87.65 66.47 88.53 82.05 46.35 88.53 85.00 88.53 81.65 87.06 88.53 72.94 66.47 86.18 80.00 66.47

2.87 1.85 2.14 1.28 2.38 4.00 2.87 5.88 4.28 6.27 1.26 6.23 2.35

2.31 1.44 1.41 1.12 2.18 2.67 2.36

82.05

6.15 1.54 2.31 1.41 5.26 11.28 1.54 14.28 6.92 18.90 1.92 12.36 4.36 18.90 4.615 2.31 4.87 12.53 7.95 8.21 15.90

1.56 1.78 1.79 1.82 1.83 1.84 1.84 1.86 1.87 1.87 1.88 1.89 1.90 1.92 1.92 1.94 1.97 1.97 1.98 2.02 2.03 2.05 2.11 2.18 2.19 2.20

19.54

5.90 4.87 16.81 8.08 5.38

6.83

3.28 1.79 3.23 4.28 4.52 4.15 6.23 6.27 2.51 2.87 6.02 3.36 2.21

4.31

3.04 4.07 1.44 3.79 1.59 4.08 2.77 1.46 1.32 2.80 3.16 3.19 3.51 4.14 1.80 2.63 4.07 2.51 1.90

Numbers highlighted in bold correspond to the top performing scoring function in each of the three performance metrics: scoring power (Rp ), docking power (S1 ), and screening power (E.F.1% ). The abbreviations for the type of scoring function: Em for empirical, KB for knowledge-based, FF for forcefield, and ML for machine-learning. Our version of X-Score yielded a scoring power of Rp = 0.627.

with enrichment rate of 19.54, has mediocre scoring performance (Rp = 0.452). Similarly, the most accurate SF in terms of docking power, ChemPLP@GOLD with pose prediction success rate of 82%, performs poorly in the scoring and screening tasks (Rp = 0.579 and screening E.F. = 14.28, respectively). SF design for solving the ligand scoring, docking, and screening problems has been investigated for over two decades now, 13,15 but the best empirical SF 6

ACS Paragon Plus Environment

Page 7 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

technique has a scoring power of only 0.614 (in terms of Rp ) on this benchmark, whereas our best proposed ML SF, BT-Score, attains a much higher scoring power of 0.82, which is a significant result as we will see in the Results section. In terms of docking and screening powers, however, BT-Score lags behind conventional ones. Thus, as we will see later, using BA-based ML SFs has clearly worked well for scoring in both relative (to conventional SFs) and absolute terms, but has worked in neither absolute nor relative terms in the case of docking nor screening power. Therefore, BA-based SF design is effective for ligand scoring but is clearly not sufficient to solve the ligand docking and screening problems. Note that in SBDD all three of the problems, ligand scoring, docking, and screening, are critical to solve well. Solving ligand scoring well, but not the docking and screening problems well will lead to ineffective virtual screening (VS)—this is what we noticed from the results above. Solving the ligand docking and screening problems well will lead to effective relative screening of ligands, but if ligand scoring is not done properly, there will be uncertainty in the binding affinity (and hence stability) of the identified ligand(s). The proposed methods seek to address the ligand docking, screening, and scoring problems using novel task-specific optimization of SFs.

Key contributions In this work, we contribute to the solution of the ligand scoring, docking, and screening problems using the following novel models: 1) BT-Score: an ensemble ML SF based on boosted decision trees and thousands of descriptors to predict BA. BT-Score’s predictions for out-of-sample test complexes have 0.825 correlation with experimental BA. This represents more than 14% and 31% improvement over the performance of the ML SF RF-Score and empirical model XScore whose Pearson’s correlation coefficients are 0.725 and 0.627, respectively. Despite its high accuracy in the scoring task, we find the docking and screening performance of BT-Score is far from ideal which motivated us to develop the following task-specific 7

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

SFs. 2) BT-Dock: a boosted-tree docking model fitted to a large number of native and computergenerated ligand conformations and then optimized to predict binding poses explicitly. The model has shown an average of 25% improvement over its BA-based counterparts in different pose prediction scenarios. 3) BT-Screen: a screening SF that directly models ligand activity as a classification problem. BT-Screen is fitted to thousands of active and inactive protein-ligand complexes to optimize it for finding real actives from databases of ligands not seen in its training set. Our results suggest that BT-Screen can be 73% more accurate than the best conventional SF in enriching datasets of active and inactive ligands for many protein targets. 4) MT-Net: a novel multi-task deep neural network trained on data from three tasks to simultaneously predict ligand binding poses, affinities, and activity labels. Our network is composed of shared hidden layers for the three tasks to learn common features, task-specific hidden layers for high-level feature representation, and three task-specific output layers for the three tasks. We show that the performance of MT-Net is better than those of conventional SFs and on par with single-task neural networks, and its accuracy can be further improved as more training data becomes available. The remainder of the paper is organized as follows. The next section presents the proteinligand complex database and the formation of the training and test datasets used to build and assess the proposed SFs, the descriptors extracted to characterize the complexes, and the machine learning algorithms of boosted decision trees and multi-task neural networks that we employ. Then, we present results comparing the scoring, docking, and screening powers of the proposed task-specific SFs and conventional approaches on known and novel protein targets. We also compare the performance of the multi-task deep neural network SF

8

ACS Paragon Plus Environment

Page 8 of 46

Page 9 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

to its single-task counterparts and conventional models. Finally, we summarize our findings and conclude our work.

Methods Molecular data We use the 2014 release of PDBbind to build and evaluate the proposed SFs in this work. 14,16 The protein-ligand complexes (PLCs) in PDBbind are obtained from the Protein Data Bank (PDB) database. The curators of PDBbind systematically selected high-quality threedimensional structures from PDB and collected reliable experimental binding affinity data from the literature. Some of the selection criteria that are imposed to ensure a high-quality PLC database include: (i) the complexes must have crystal structures of 2.5 Å resolution or better, (ii) the proteins must be complete with no missing side chains, (iii) the binding between proteins and their ligands must be non-covalent, (iv) binding affinity data must be expressed in terms of dissociation (Kd ) or association (Ka ) constants, (v) complexes must be between one protein and one ligand only, (vi) the molecular weight of the ligand must not exceed 1000, and (vii) it must not include atoms other than carbon, nitrogen, oxygen, phosphorus, sulfur, halogen, and hydrogen. For the 2014 release of PDBbind, 3,446 complexes were found to meet the criteria and were called the refined set. In addition to quality, another objective of PDBbind was to maximize diversity of its complexes. The refined set was clustered based on pairwise sequence similarity using BLAST with 90% similarity cut-off and was found to contain 1,135 unique protein families. In the majority of cases, proteins are bound to unique ligands based on Euclidean distance between ligands and Tanimoto similarity—we encoded the ligands using PaDEL descriptors 17 of 741 physicochemical properties and ECFP4 binary fingerprints 18 of 1,024 bits before calculating Euclidean distances and Tanimoto similarities, respectively.

9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Training and test data sets An important goal in this work is to objectively estimate the generalization accuracy of the proposed SFs based on their familiarity with the test proteins and ligands. More specifically, our objective is to investigate how SFs perform on test protein targets that are: (i) already present in the SFs’ training sets (but bound to different ligands) and (ii) partially or fully unrepresented in their training samples. Also, we investigate their effectiveness for ligands not seen during training. Given sufficiently large and diverse training data, ML SFs are expected to generalize well to novel proteins and ligands not present in their training complexes if they are able to learn the protein-ligand interactions at the fragment and atomic level. On the other hand, the proposed SFs would produce inaccurate predictions for novel targets and ligands if they are learning the relationship between activities and the interacting molecules at the macro level since they would fail to find similar proteins from their training (or rather, "memorizing") experience. Accordingly, we designed three different training-test sampling strategies from the 3,446-complex refined set of PDBbind to evaluate our SFs in real-world modeling scenarios based on the degree of overlap between the training and test proteins and ligands.

PDBbind core test set: novel complexes with known protein targets and ligands From the 3,446 PLCs in the refined set of PDBbind 2014, the curators of the database built a test set with maximum diversity in terms of protein families and binding affinities. The test set was constructed as follows. (i) The refined set was first clustered based on 90% primary-structure similarity of its protein families. (ii) Clusters with five proteins or more are then examined based on their ligands’ binding affinity values. From each of these PLC clusters, three complexes are selected: the PLC whose ligand has the lowest BA, the one with the largest BA, and the one closest to the mean BA of that cluster. (iii) The three selected PLCs are added to the test set if the complexes with the highest and lowest BA values are at least two orders of magnitude apart in terms of disassociation Kd or inhibition Ki constants. 10

ACS Paragon Plus Environment

Page 10 of 46

Page 11 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

This resulted in the selection of 65 different protein families where each protein family binds to three different ligands with varying BA values to form a set of 195 unique PLCs. This is called the core test set and has been a popular benchmarking test set for many recent docking and scoring systems. 12,13,15,19,20 Due to its popularity, the 2015 core set has been kept identical to its predecessor in PDBbind 2014 so that results of different studies can be compared. The corresponding training set for Cr, referred to as the primary training set and denoted by Pr, was formed by removing all Cr complexes from the total 3,446 complexes in the refined set. As a result, Pr contains (3, 446 − 195 =) 3,251 complexes that are disjoint from Cr complexes. In our experiments, we fixed the base sizes of training and test sets to 3,000 and 195 PLCs, respectively. The 3,000 PLCs are randomly sampled from the 3,251 set to help evaluate the variance of ML SFs. The core test set complexes are considered targets that are “known” to scoring functions due to the overlap between their training and test proteins. More specifically, for each protein in the 65 protein clusters of Cr, there is at least one identical protein present in the primary Pr training data, albeit bound to a different ligand. In addition, every ligand in the core test set has at least one training ligand that shares the same ligand cluster. We found that 40% of PLCs in Cr have some degree of similarity to training complexes in Pr in terms of both the protein and ligand present. We performed the comparison by considering both protein and ligand cluster memberships for each complex in both datasets. We generated 100 ligand clusters based on k-means clustering of PDBbind 2014 refined set ligands as discussed in more detail later in the Leave Clusters Out subsection. Therefore, ligands of the core test set are also considered “known” to scoring functions that are trained on the 3,000 PLCs sampled from the primary set Pr.

11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Multi-fold cross-validation: novel complexes with known and novel protein targets and ligands As mentioned in the previous section, each of the three PLCs in a cluster in Cr were sampled from larger protein-family clusters (with ≥ 5 PLCs per cluster) in the refined set and the remaining two or more PLCs were set aside in Pr. More stringent test sets were also developed in terms of the protein-family overlap with training sets. One testing technique is based on 10-fold cross-validation (CV ) where the refined set of PDBbind 2014 is shuffled randomly and then partitioned into 10 non-overlapping sets of equal size. Upon training and validation, one out of the ten sets is used for testing and the remaining nine are combined for training. Once the training and test round completes for this fold, the same process is repeated for the other nine folds one at a time. In a typical 10-fold CV experiment on a dataset of 3,446 PLCs, the training and test sizes of the 10 folds are (3, 446 × 9/10 ∼) 3,101 and (3, 446 × 1/10 ∼) 345 complexes, respectively. In order to prevent dataset size from having an effect on our results and have similar sizes to Pr and Cr, we randomly sample (without replacement) 3,000 from the 3,101 training complexes and 195 PLCs from the corresponding 345 complexes in each fold. The performance results on the 10 test folds are then averaged and reported in the Results section. Unlike the overlap in Cr, due to the randomness of CV, every protein and ligand family is not necessarily present in both the training and test sets across the ten folds. Therefore, some test proteins and ligands may actually be “novel” for the scoring function while others may be present in the training data. If a protein is present, it would be bound to a different ligand in the training set. It should be noted that in this work, we simulate how scoring functions would perform on novel complexes, targets, and/or ligands. What is considered novel here for SFs is based upon whether or not a similar entity is present in the training set for each entity present in the test set.

12

ACS Paragon Plus Environment

Page 12 of 46

Page 13 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Leave clusters out: novel complexes with novel protein targets and ligands Leave-target-clusters-out (LTCO) and leave-ligand-clusters-out (LLCO) represent our third approach to constructing training and test datasets. This is based on the novel-targets experiment by Ashtawy and Mahapatra 11 and leave-cluster-out method of Kramer et al. 21 The objective of this stringent approach to sampling training and test datasets is to evaluate scoring functions on protein targets or ligands dissimilar from those encountered during training. In LTCO, the refined set complexes are partitioned based on protein primary-structure similarity such that the training and test partitions have zero overlap in protein families. More specifically, datasets were constructed as follows. The 3,446 PLCs of the refined set were clustered based on the 90% similarity cut-off of the proteins’ primary structures (smaller similarity cut-off values of 30% through 90% are also considered as we show towards the end of the paper). A total of 1,135 clusters were found, of which the largest five are: HIV protease (262 PLCs), carbonic anhydrase (170 PLCs), trypsin (98 PLCs), thrombin (79 PLCs), and factor XA (56 PLCs). There were 9 clusters with size 30 PLCs or larger including the aforementioned five, 43 of size 10 or larger, and 1,092 clusters with size less than 10 PLCs, the majority of which (644 clusters) consist of a single PLC. We created 20 different trainingtest dataset pairs based on this distribution of protein families. For the nine most frequent protein families (with ≥ 30 PLCs in each), we created nine corresponding test sets each consisting of a single family. For each family-specific test set of these nine, we sampled 3,000 PLCs from the refined set, none of which contains the protein family under test. From the less populous clusters (i.e., those with < 30 PLCs), we randomly sampled eleven 195-PLC test datasets. These test sets could contain complexes from (1, 135 − 9 =) 1,126 different protein families, mostly singles (i.e., proteins that occur only once in the dataset). For each multi-family test set of these eleven, we again sampled another 3,000 PLCs from the refined set so that none of them contained the protein families under test. The justification for selecting these particular number of datasets and sizes is two-fold. First, we wanted about half of the mean performance on these test sets to result from the single-family test sets and 13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the other half from the multi-family sets. Second, taking the average of a larger number of test sets with reasonable sizes reduces the possibility of skewing the final performance due to potential outliers that could be obtained on a few small test sets. Test proteins are considered novel here due to the constraint imposed on sequence similarity between them and training proteins which must be less than 90%. Therefore, all test proteins in a test cluster are considered novel for scoring functions fitted to training proteins sampled using the LTCO strategy. We also test the performance of scoring functions on novel ligands using leave-ligandclusters-out (LLCO). In LLCO, clusters of ligands are created according to their Euclidean distances from each other. All-against-all Euclidean distances were computed from the 741PaDEL-descriptor vectors characterizing each ligand. Before calculating the distances, extremely small and large descriptor values (outliers) were replaced with their 1st and 99th percentiles, respectively, and then all descriptor values were normalized. We conducted experiments with different number of ligand clusters using k-means. Increasing the number of clusters increases the average similarity between ligands in each cluster and vice versa. We found that varying the number of clusters between 40 and 100 does not change our results significantly. This suggests that the vast majority of ligands are unique as was confirmed by visual inspection of ligand clusters and their quality metrics (such as the Silhouette Coefficient). The results we report in the paper are for 100 ligand clusters. Out of these 100 clusters, 23 have a minimum of 50 ligands and a maximum of 98. In our experiments, we leave these 23 clusters out (one or more at a time) of the training complexes for validation. Therefore, SFs in LLCO are tested on complexes with ligands that share no cluster membership with any ligands in their training complexes. In other words, training and test complexes do not overlap based on their ligand clusters and for this reason we consider the test ligands “novel” to the scoring function under assessment.

14

ACS Paragon Plus Environment

Page 14 of 46

Page 15 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Multi-perspective modeling of protein-ligand interactions Protein-ligand interactions are governed by a large number of complex factors. Examples of the intermolecular factors include hydrogen bonding, hydrophobic forces, electrostatic effects, van der Waals interactions, and a variety of other enthalpic and entropic contributions. Modeling such forces properly is a non-trivial task and it has been an active area of research for many decades. Over the years, numerous hypotheses for each of these factors have been proposed and utilized in empirical and machine-learning scoring functions as descriptors or features. 7,22–27 Although many of the SFs overlap in the factors included in their models, they typically have different modeling perspectives for them. Also, the vast majority of conventional SFs only model a small subset (typically less than 10) of the likely larger number of factors contributing to protein-ligand binding. 7,22–25 In other concurrent effort (not within the scope of this work), we built a multi-perspective library of descriptors for a large number of factors and found them effective when incorporated in scoring functions. Utilizing a large number of descriptors from different SFs and related molecular modeling tools provide multiperspective, complementary, and more comprehensive view of the complex intermolecular factors that would otherwise be impossible to model using only a handful of single-perspective hypotheses. Therefore, in this work, we use over 2,700 multi-perspective descriptors from the Descriptor Data Bank (www.descriptordb.com) that characterize proteins and ligands as individual molecules and as complexes together using 16 different sources. 7,9,12,17,18,22–30 We list the sources of these descriptors and the entities they characterize (i.e., ligand, protein, or complex) in Table 3. These descriptors will be extracted for each PLC in our datasets and will be utilized as features for the ML algorithms we employ in this work.

Molecular modeling using Extreme Gradient Boosting Decision trees are popular approaches for modeling heterogeneous data such as the molecular descriptors and activities that we collect from different sources. Compared to other nonlinear machine-learning techniques, they can be less robust especially when tackling high15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 46

Table 3: The 16 descriptor types and scoring functions and the molecular docking software in which they are implemented Descriptor set/SF

Molecule(s)1 Software

Type of SF

Reference

AffiScore AutoDock Vina ChemGauss Cyscore DPOCKET DSX ECFP GOLD LigScore NNScore PaDEL REPAST RETEST RF-Score SMINA X-Score

PLC PLC PLC PLC PLC Preds. L PLC Preds. PLC L P P PLC PLC PLC

Empirical Empirical Gaussian Empirical Descriptors Knowledge based Descriptors Descriptors Knowledge based Machine learning Descriptors Descriptors Descriptors Machine learning Empirical Empirical

7

1

SLIDE AutoDock Vina OpenEye Standalone FPOCKET Standalone RDkit GOLD IMP Standalone Standalone Standalone Standalone Standalone Standalone Standalone

22 25 24 28 29 18 9 30 27 17

descriptordb.com descriptordb.com 12 26 23

The entity characterized by the descriptors: P for protein, L for ligand, and PLC for protein-ligand complex. When access to the descriptors of SFs in not available, we use their predictions (denoted by Preds. in the table) as a surrogate for them.

dimensional data for a relatively small number of training examples. Ensemble learning via boosting is an effective strategy pioneered by Freund et al. 31 and others 32 to improve the robustness of decision trees by simultaneously decreasing their variance and bias errors. The boosting model consists of an ensemble of T trees whose individual predictions are combined to produce the final predicted output yˆ according to the equation:

yˆ = F (X) =

T X

ft (X),

t=1

where X ∈ Rm is a PLC characterized by m molecular descriptors, and ft is the t-th tree in the ensemble. The trees are grown sequentially one after the other in a greedy fashion. In each tree-fitting step, the training algorithm minimizes the overall loss by boosting the examples with the highest predicted error so far. There are several variants of boosting algorithms proposed in the literature that differ in the treatment of the most erroneous predictions, the 16

ACS Paragon Plus Environment

Page 17 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

employed loss function, and the regularization strategy used to avoid overfitting. In this work, we use the Python package XGBoost 33 in which each tree in the ensemble is fitted to residuals made by its predecessor trees. The package implements a regularized stochastic gradient boosting strategy proposed by Chen et al. 34 that builds upon earlier work by Friedman 32,35 and Freund et al. 31 The t-th tree in the ensemble is constructed to minimize the regularized loss L(t) as follows:

L(t) =

N X

(t−1)

ψ(yi , yˆi

+ ft (Xi )) + Ω(ft ),

i=1

where ψ is the loss function that accounts for the error between the true output yi for the (t−1)

complex Xi and its value estimated by the current (ft ) and past (ˆ yi

) trees in the ensemble.

The regularization term Ω uses the size of the trees in terms of the number of their nodes N and the values of their leaves θ to increase the loss as follows:

Ω(ft ) = γN + 0.5λ kθk2 .

We use the package XGBoost with its default parameter values for γ = 0 and λ = 1 to build three models for the docking, screening, and scoring tasks. The remaining parameters such as the number of trees, their maximum depth, the sub-sample ratio of the training instances and descriptors were optimized using grid search for each of the three tasks. The optimal values of these parameters are shown in Table 4. Each of the XGBoost-based SFs will be trained on their respective task-specific data sets as will be discussed in more detail in the following sections.

17

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Multi-task deep neural networks for simultaneous docking, screening, and scoring of protein-ligand complexes Protein-ligand complexes fed to the proposed docking (BT-Dock), scoring (BT-Score), and screening (BT-Screen) scoring functions for prediction are typically characterized by the same descriptors and they are formed with the same protein targets. The training datasets of these task-specific scoring functions may only differ in their sizes and type of the output (independent) variable. When fitting a boosting tree or any other ML model to the dataset of one task, the model does not take advantage of the datasets available for the other tasks to improve its performance. Rather, the three scoring functions are trained and applied in isolation despite the commonalities between the three problems they are modeling. In this section, we propose MT-Net, a novel multi-task deep neural network that can be effectively trained and applied to the three datasets simultaneously. MT-Net leverages information sharing across the three molecular modeling problems and uses abundant data for one problem to improve upon the performance of the tasks with limited data availability. In a sense, optimizing the network for multiple tasks simultaneously self-regularizes it against overfitting the tasks with limited data if they were to be learned separately. Multi-task learning using deep neural networks has been successfully applied in recent studies related to ligand-based drug design. 36–38 Unlike the three tasks defined here in this work for screening, scoring, and docking, each task in those studies is a predefined protein system for which compounds are classified as active or inactive molecules. As is the case for other QSAR models, the multi-target neural network must be re-trained from scratch in order to be used for targets different from the training proteins. On the other hand, the multi-task SFs presented in this work are applicable to any protein system without the need for extra training. During their construction, our task-specific and multi-task scoring functions take into account information about the ligand and protein explicitly instead of implicitly via multi-target learning as in some QSAR models. 36–38 Therefore, the proposed scoring functions in this work are applicable to any target as we will show later in the 18

ACS Paragon Plus Environment

Page 18 of 46

Page 19 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

following sections.

Network architecture. The neural network we propose here consists of an input layer for the raw descriptors xs1 , Ls hidden layers shared (s) among the three tasks for extracting lowlevel representations for the molecular interactions, Lt task-specific hidden layers for each task t ∈ {dock, screen, score} to generate high-level representations specific to each task, and Y t corresponding output layers as depicted in Figure 2. During prediction, lower-level

Figure 2: The architecture of the multi-task deep neural network SF MT-Net. features universally applicable for the three tasks are first extracted from the hidden shared layers as follows: xsl+1 = H(Wls xsl + bsl ),

l = 1, . . . , Ls

where Wls and bsl are, respectively, the weight matrix and the bias vector for the l-th shared hidden layer, and H is the activation function associated with it which is selected to be a rectified linear unit (ReLU) in this work. The output of the final shared hidden layer is then 19

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

fed as an input (xt1 = xsLs +1 ) to the task-specific layers to extract higher-level representations specialized for each task using the following transformations:

xtl+1 = H(Wlt xtl + btl ),

l = 1, . . . , Lt

Here, each task-specific layer is parameterized by a weight matrix Wlt and a bias vector btl . The final outputs for the three tasks are simply the following transformations for the features xtLt +1 generated by the high-level hidden layers: Y t = Ot (Wot xtLt +1 + bto ), where Ot is the output function for the t-th task which is logistic for the screening problem and linear for docking and scoring. The number and sizes (number of hidden units) of the shared and task-specific hidden layers were tuned using cross-validation. We tested multiple pyramidal configurations for the network and found the configuration of three hidden layers with sizes {2500, 1500, 1000} to be optimal for the shared layers, {200, 200} for the docking and scoring-specific layers, and {200, 100} for the screening-specific layers as shown in Figure 2 and Table 4.

Stochastic learning from imbalanced data. The three molecular modeling tasks benefit from training sets with different sizes. For the scoring task, we have a total of 3,000 training complexes with valid binding affinity labels. The screening scoring function is trained with more than 15,000 PLCs with binary labels indicating whether the complexes are active or not. The docking task benefits from the largest training size of 300,000 unique proteinligand conformations. For the network to perform well across the three tasks, it should utilize all available training data while guarding its weights against overfitting the tasks with the larger training sizes. We achieved this by performing stochastic gradient descent with minibatches sampled randomly from the training data of the three tasks. More specif-

20

ACS Paragon Plus Environment

Page 20 of 46

Page 21 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ically, we sample 10 PLCs from each task for a total minibatch size of 30 examples during every training (weight updating) step.

Gradient computing for unavailable labels. Despite having common training protein targets across the three molecular modeling tasks, the majority of proteins form unique complexes across tasks. Complexes with the same protein differ from each other in the bound ligands and/or the conformations of those ligands. Complexes available for one task may not be present for the other two tasks and consequently they do not share the same labels. The task-specific labels we use here are binding affinity, RMSD of a pose from its respective native conformation, and active/inactive binary value. This poses a challenge during training while computing the gradient to update the network’s weights as explained next. Gradients are calculated at the output layer of each task based on the difference between the actual values and labels predicted by those output layers. We use gradients of zero for the tasks with unlabeled examples so that the task-specific weights associated with missing labels do not get updated while weights in shared layers are updated with gradients calculated from labeled examples only. It should be noted that each of the 30 training examples in any random minibatch has at least one available label and at most two missing labels.

Network regularization. Networks with wide and deep hidden layers and multiple output layers such as the one we build here have millions of parameters (weights and biases) which outnumber our training size multiple fold. Therefore, it is necessary to regularize the network to avoid overfitting. MT-Net has a powerful built-in regularization mechanism enabled by the simultaneous optimization of shared layers with different training data for each task which helps prevent the shared weights from overfitting individual tasks. To further guard against overfitting, we use the dropout technique of hidden units by randomly selecting 20% of the network’s activations and deliberately switching them off to zero for each protein-ligand complex in the given training minibatch. Random dropout was introduced by Hinton et al. 21

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

to avoid overfitting by preventing co-adaptations among the hidden units of each layer. 39 A network trained with dropout can effectively be viewed as an average of an ensemble of multiple thinned networks. The cost of using dropout is longer training time since parameter updates are noisy due to the random replacement of activations with zeros.

Computational requirements. Due to its large number of parameters, MT-Net requires a powerful computer system to train and predict. We use an Amazon AWS instance with NVIDIA Kepler GK210 GPU featuring 2,496 parallel processing cores and 12 Gigabytes of GPU-dedicated memory. Training the network on this system takes five hours of wall clock time. Predicting a dataset of 1,000 protein-ligand complexes can be achieved in less than 5 milliseconds on the same machine. We use the deep learning Python libraries TensorFlow and Keras to build our multi-task neural network scoring functions. 40,41

Results and discussion Scoring task: binding affinity prediction Estimating binding affinity (BA) between proteins and drug-like compounds is very important in SBDD applications such as ligand ranking and lead optimization. Scoring is typically applied after identifying the ligand’s pose (docking) and determining whether it has biological activity with respect to the target protein (screening). The estimated BA provides an indication of how strongly the molecules bind and it is expressed in terms of the log of the disassociation (pKd ) or inhibition constants (pKa ). In this work, we train SFs for binding affinity prediction by learning from thousands of high-quality PLCs with experimentallydetermined BA data. More specifically, our ML SF, which we refer to as BT-Score, is an ensemble of 4,000 boosted regression trees fitted to more than 2,700 descriptors characterizing 3,000 training complexes whose BA values range from 2 to 11.96 in pKd /pKa units. Four versions of BT-Score models were constructed and evaluated to investigate its generaliza22

ACS Paragon Plus Environment

Page 22 of 46

Page 23 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 4: Optimal parameters for XGBoost, Random Forest (RF), and Multi-task deep neural network scoring functions Model

Parameter

Task Scoring

XGboost 33

RF 42

MT-Net

1

2 3

Number of trees Maximum tree depth Learning rate Data sub-sampling Descriptor sub-sampling Number of trees Maximum tree depth Data sub-sampling Descriptor sub-sampling Number of shared layers Number of neurons in shared layers Activation functions Number of task-specific layers Number of neurons in task-specific layers

Docking

Screening

4000 10 0.020 60% 75% 40001 Full2 60% 100%

5000 4000 7 7 0.030 0.030 26% 60% 75% 75% – – – – – – – – 3 {2500, 1500, 1000} Rectified linear units (ReLUs)3 2 2 2 {200, 200} {200, 200} {200, 100}

Random Forests is used to build the binding-affinity based scoring function RF-Score. Therefore, RF-Score is the same scoring function applied for the three tasks and that is why we list only one set of parameter values. The maximum depth of the tree. ReLU: output = max(0, input).

tion capacity on known and new targets and ligands (using the Cr, CV, LTCO, and LLCO datasets) as described in the Methods section. In addition to BT-Score, we also re-train and test the ML and empirical SFs RF-Score and X-Score using the same datasets and strategies to objectively compare the performance of the proposed model to other popular methods from the literature. Figure 3 compares the accuracy of BT-Score, RF-Score, and X-Score on the four datasets in terms of Pearson (Rp ) and Spearman (Rs ) correlation coefficients and normalized root-of-mean-squared error (NRMSE ) between true and predicted BAs. The normalized root-of-mean-squared error is computed as NRMSE = RMSE /(BAmax − BAmin ), where BAmax = 11.96 and BAmin = 2.00 are the largest and smallest BA values of the training complexes, respectively. The figure shows the improved performance of BT-Score as compared to RF-Score and

23

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

Cr (known targets)

CV (known and novel targets & ligands)

1.00 0.75

0.540

0.543

0.726

0.729

0.819

0.826

0.642

0.627

0.728

0.725

0.818

0.25

0.827

0.50

Scoring accuracy

0.00

Accuracy Rp

LTCO (novel targets)

LLCO (novel ligands)

Rs

1.00

NRMSE

0.75

0.297

0.313

0.391

0.388

0.555

0.590

0.502

0.534

0.484

0.532

0.581

0.25

0.637

0.50

e or X− Sc

F− R

BT

−S c

or

Sc or e

e

e or X− Sc

co F− S R

−S c

or

re

e

0.00

BT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 46

Scoring function

Figure 3: The scoring accuracy of BT-Score (proposed), RF-Score, and X-Score when evaluated on test complexes with proteins that are either fully represented (Cr ), partially but randomly represented (CV ), or not represented (LTCO) in the SFs’ training data. We also test them on complexes with novel ligands (LLCO) that are not represented in their training data. X-Score across the three testing scenarios for all performance metrics. BT-Score achieved excellent linear (Rp ) and ranking (Rs ) correlation of more than 0.825 when tested on the core and cross-validation test sets whose targets are fully or partially but randomly present in the training set. RF-Score comes second with 13% lower correlation coefficients at less than 0.730 in both test scenarios. The predictions of the empirical scoring function X-Score are even less correlated with the true values (Rp = 0.627 for Cr & 0.543 for CV ), which suggests that the relationship between the extracted descriptors and binding affinity is nonlinear. Just like BT-Score, RF-Score and X-Score are also trained and tested on fully (Cr ) and partially (CV ) overlapping targets. It should be noted that the overlapping targets in the training and test sets for the three SFs may be bound to different ligands and therefore the datasets could be disjoint at the complex level. 24

ACS Paragon Plus Environment

Page 25 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

It has been found that more than 90% of recent drug targets are known with complexes already deposited in PDB, the database on which our SFs are trained. 43 This means that BTScore will be applied to known targets in the vast majority of real-world drug development applications. To test BT-Score and the other two SFs on novel targets, we trained and evaluated them on datasets that are completely disjoint at the target level using the leavetarget-clusters-out (LTCO) strategy based on 90% BLAST similarity cut-off (smaller cut-off values were also considered and they are presented in Figure 7). The performance bars on the left panel show that the accuracy of BT-Score has dropped to around 0.64 in terms of the average correlation between predicted scores and true BAs across 20 out-of-sample target clusters. The accuracy, however, varies from one target to another as can be seen from the large error whiskers associated with the average bars. This variance can be attributed to the level of similarity of the active sites of the training and test complexes. Novel targets whose binding sites have drastically different physiochemical and structural characteristics from the training proteins could pose a challenge for ML and empirical SFs. On the other hand, BAs of targets with less different binding cavities appear to be more predictable despite the lack of homologous analogs in the training set. Similar to their inferior accuracy with respect to BT-Score on known targets, RF-Score and X-Score perform even more poorly on novel targets with correlations of less than 0.55. Our experiments and those of others 13 show that there is an association between binding affinity and ligand properties such as molecular weight. For example, by considering 741 diverse ligand descriptors extracted using PaDEL, 17 we found that the number of carbon atoms (nC) in the ligand has the highest correlation of 0.447 with binding affinity. The sum of atomic polarizabilities (ALOP) of the ligand has the second highest correlation of 0.413. These Pearson correlation coefficients were computed between each of the 741 descriptors and binding affinities of the 3,446 complexes in PDBbind 2014 refined set. For PDBbind 2014 core test set, nC and ALOP correlated slightly higher (0.491 and 0.509, respectively) with binding affinities of the 195 complexes in this set. The graph-based descriptor, Self-Returning

25

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Walk Count of order 4 (SRW4) in PaDEL, achieved the highest correlation of 0.526 with the core test set binding affinities. It should be noted that BT-Score achieved a correlation of 0.826 on the refined set (using 10-fold cross-validation) and 0.827 on the out-of-sample core test set—57% higher than that of SRW4 descriptor. However, a significant part of BTScore’s predictive power could be attributed to properties of the ligand alone. Indeed, by just training XGBoost on PaDEL descriptors, without any protein-related features, we obtained correlation of 0.717, 0.706, and 0.511 on Cr, CV, and LTCO, respectively. Although these results are still less than those (0.827, 0.826, and 0.637, respectively) obtained by BT-Score with the full set of descriptors, it is worth investigating how BT-Score would perform when it is tested on ligands very different from its training ligands. To this end, we used the leaveligand-clusters-out (LLCO) test strategy as described in the Methods section. We trained and tested 23 versions of BT-Score, RF-Score, and X-Score on random datasets sampled from the 3,446 refined set of PDBbind 2014 where the 3,000 training complexes in each sample do not include any ligands in the corresponding test sample whose size ranges from 50 to 98. The average performances as well as the standard errors of the SFs are depicted in the lower-right corner of Figure 3. The bar plots show that the scoring power of all three SFs has dropped sharply on novel ligands. Predictions of BT-Score, for example, correlated at 0.590 with true binding affinities as opposed to 0.826 when we allow it to learn from similar ligands bound to different proteins. However, its predictions are still more accurate than those of RF-Score (0.388) and X-Score (0.318). We repeated this experiment by fitting XGBoost models to the training ligands characterized by PaDEL descriptors only and tested them on the independent ligand clusters that were left out for validation. This ligand-based SF, which does not use protein information to predict BA, obtained an average correlation of 0.344 with true BAs of the test ligands. It is interesting to see that this ligand-based approach is better than X-Score that uses protein-ligand interaction terms to calculated BA. BT-Score, with its 2,714 descriptors characterizing both the protein and the ligand, is 71% more accurate (0.590 vs. 0.344) than the ligand-based approach. Even when tested on novel

26

ACS Paragon Plus Environment

Page 26 of 46

Page 27 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

targets bound to novel ligands (i.e., combined LTCO & LLCO), BT-Score still surpasses RF-Score and X-Score with a correlation of 0.518 vs. 0.302 & 0.251, respectively. In light of these results, it is evident that BT-Score is consistently more accurate than RF-Score, X-Score, and the naive ligand-based approach across all training and test scenarios as well as the four performance summary metrics. Similarly, superior scoring performance has been obtained using an updated version of our boosted neural network SF, BsN-Score. 19 The correlation coefficients between BsN-Score’s predictions and actual BA values ranged from 0.637 on novel targets (LTCO) to 0.844 on known targets (Cr ).

Docking task: binding pose prediction Molecular docking is an interplay between sampling the conformational space of the ligand and evaluating the sampled poses inside the target’s binding site. The quality of the sampling is dependent on the algorithm that generates the structures and the predictive model that rescores them. Typically, thousands of rounds of pose generation and refinement using predicted scores are conducted to intelligently and efficiently explore the vast conformational space. The scoring step is therefore essential to successfully steer the search towards native or near-native ligand poses. Moreover, estimating the ligand’s activity label (screening) and binding affinity (scoring) accurately are dependent upon how close the top-scoring conformation is to the native pose. Accurate scoring of the generated poses has a bearing on the quality of docking as well as the accuracy of the subsequent screening and binding affinity prediction steps. Traditional knowledge-based, empirical, and ML SFs are typically trained on crystal complexes with native poses. Moreover, they are trained to generate scores that resemble free energy of binding which is in turn used to rank poses. In other words, traditional SFs indirectly use the predicted free energy scores to identify native or near-native poses from a large number of decoy conformations which are not represented in their training datasets. In this section, we present BT-Dock, a docking-specific SF trained on native, near27

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

native, and decoy poses to directly predict the binding conformation of the ligand. Unlike BT-Score which is fitted to the native PDBbind PLCs with known BA values, BT-Dock is trained on the same PLCs but for each complex there are 100 poses (including native, nearnative, and decoy) instead of just one (native) conformation that was found in the complex crystal structure. The response variable for the ML SF BT-Dock is the root-mean-squared distance in Angstroms between each of the computationally generated poses and the native conformation.

Figure 4: Strategy used to generate complexes for training and testing SFs in the (a) docking and (b) screening tasks.

Figure 4(a) summarizes the steps we followed to generate the training poses and calculate their distance from their respective native conformations. First, we randomly translated and rotated the native ligand pose and then docked it into the binding site of the target which is marked by the original location of the co-crystallized ligand. We use AutoDock Vina 22 to generate 100 binding modes with its exhaustiveness search argument set to 25. We repeated 28

ACS Paragon Plus Environment

Page 28 of 46

Page 29 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

this random re-orientation and docking process 20 times to obtain a total of (100 × 20 =) 2,000 diverse poses around the native conformation. We then calculated the symmetrycorrected root-mean-square distance of each computationally-generated pose to the native conformation. All poses with RMSD greater than 10 Å away from the native conformation were discarded. The remaining poses were then clustered into 10 one-Angstrom wide bins where each bin contains 10 representative poses uniformly-spaced from each other. Therefore, for every native PLC we obtained a total of 100 poses spanning the [0-10 Å] range, where each pose is approximately 0.1 Å from its closest two neighboring conformations. The pose with 0 Å is simply the native conformation. The above pose-generation strategy was applied to each PLC in the training and test datasets described in the Methods section. The multi-perspective descriptors were then extracted for each native, near-native, and decoy complex. An ensemble of 5,000 boosted regression trees are fitted to the data using XGBoost to construct the docking-specific SF BT-Dock. We create and test four versions of the SF BT-Dock to evaluate its generalization performance on known targets and known and novel ligands (Cr ), novel targets (LTCO), novel ligands (LLCO), and a combination of the three (CV ). The docking performance of the SF is expressed in terms of the percentage of complexes for which it is able to find the native or near-native poses among the 100 conformations generated for each complex. More specifically, BT-Dock’s predicted RMSD scores are used to rank order the 100 poses in an ascending order and then the actual RMSD values of the top ranking N poses are examined. A complex is considered successfully docked if at least one of the top ranking N poses is within 2 Å from the native pose. We examine the top ranking one, two, and three poses in our experiments and we report the corresponding success rates S1 %, S2 %, and S3 % as shown in Figure 5. We compare the docking accuracy of BT-Dock to those of BT-Score and the conventional SFs RF-Score and X-Score which select the top-ranking poses based on their predicted BA values instead of explicit docking scores. The bar plots show that BT-Dock achieves a docking success rate of about 94% in terms

29

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

of the most stringent docking metric S1 %. The binding-affinity based SFs BT-Score, RFScore, and X-Score have substantially lower success rates, the best of which is about 74% achieved by BT-Score. Similar trends can also be observed for the S2 % and S3 % success rates across the four different training and test scenarios Cr, CV, LTCO, and LLCO. BTDock also appears to rank ligand poses correctly and consistently for different PLCs as can be observed from its error bars which are always shorter than those of BT-Score, RF-Score, and X-Score especially on novel protein targets (LTCO) and ligands (LLCO). These results indicate the utility of the proposed approach in designing docking-specific SFs. We believe the boost in accuracy is attributable to the explicit modeling of the objective function and the size as well as nature of the data available to BT-Dock. Moreover, BT-Dock is fitted to precise RMSD values for poses which are computationally generated and much more precise than the noisy experimental BA values to which conventional SFs are fitted.

Screening task: compound database enrichment Database enrichment refers to the process of filtering inactive compounds from molecular libraries while retaining the active molecules. Virtual screening is a popular computational approach for filtering compounds en masse using scoring functions. Similar to the docking task, the majority of conventional SFs order compound databases based on predicted BA values and regard the top-ranked x% molecules as actives. This would be an appropriate strategy had the SFs been fitted to experimental BA data from active and inactive proteinligand pairs. However, such models are typically trained on only active compounds with varying degrees of affinity values. This incurs bias towards active molecules and is suitable for ranking active compounds with respect to each other but is inadequate when inactive molecules are also present in the database. To address this deficiency, we propose BT-Screen, a screening-specific SF for predicting whether a molecule is active or inactive given the target protein. BT-Screen is a large ensemble of boosted classification trees trained on thousands of active and inactive protein-ligand pairs. The ensemble is explicitly optimized to reduce 30

ACS Paragon Plus Environment

Page 30 of 46

Page 31 of 46

Cr (known targets)

CV (known and novel targets & ligands)

100 75

46.35

38.04

25.61

83.00

76.91

64.30

88.13

82.43

70.39

97.61

96.70

94.17

44.46

36.56

25.49

84.56

79.18

65.90

90.92

85.54

73.79

98.56

93.95

25

96.87

50

Docking accuracy

0

Accuracy S1%

LTCO (novel targets)

LLCO (novel ligands)

S2%

100

S3%

75

47.61

39.24

25.64

83.51

77.46

64.68

90.01

83.49

71.71

97.98

96.90

93.40

51.27

41.76

26.91

83.76

76.04

62.92

88.74

83.17

70.01

97.89

93.27

25

96.57

50

re

e

co F− S R

Sc X−

co −S BT

or

re

k −D BT

co F− S R

oc

re

e or Sc X−

co −S BT

−D

oc

re

k

0

BT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Scoring function

Figure 5: The docking accuracy of BT-Dock (proposed), BT-Score, RF-Score, and X-Score when evaluated on test complexes with proteins that are either fully represented (Cr ), partially represented (CV ), or not represented (LTCO) in the SFs’ training data. The four scoring functions are also tested on novel ligands (LLCO) not represented in their training complexes. expected misclassification rate instead of minimizing BA mean-squared error. The proposed models take advantage of the larger training set size that is constructed by combining active and inactive molecules vis-à-vis only binders in the case of other SFs. Moreover, the training set of BT-Screen is more diverse for the same reason which makes it more reliable for accurate classification of wide varieties of molecules. The screening-specific ML models are also resilient to uncertainty in the experimentally collected BA data due to the use of binary labels instead of the measured BA values themselves as in the case of traditional SFs. As we will see in the next set of plots, these aforementioned factors result in substantial improvement in screening accuracy compared to those achieved by conventional models. From the 2007 and 2014 PDBbind core sets Cr, we derive three benchmarks for assessing the screening accuracy of the proposed and conventional SFs. As explained in the Methods 31

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

section, each core set consists of 65 protein families, each of which is bound to three different ligands with known and diverse binding affinity values. These are the active compounds that an ideal SF would identify. We assume the inactive molecules for each protein target are the crystallized ligands of the remaining 64 receptors in the test set. Although this bioactivity assumption requires further investigation, it seems to be reasonable since all the 65 proteins have a pairwise sequence similarity of less than 90%. Furthermore, all the assumed (65 × (64 × 3) =) 12,480 inactive protein-ligand pairs were also checked in the ChEMBL database for possible activity and cross binding. 44 Forty of these proteinligand pairs were found to be actives and therefore the final test set consists of a total of (195 + 40 =) 235 binders and (12, 480 − 40 =) 12,440 inactive molecules. This translates to a total of (235 + 12, 440 =) 12,675 test protein-ligand pairs and the actives to inactive ratio is approximately 1:64 for each target. A similar approach was also followed by Li et al. 13 to construct a test set for screening purposes. Their test set was derived from the 2013 (same as 2014) version of the PDBbind refined set and it also consists of 65 proteins each binding to 3 different ligands. This set overlaps with the 2007 PDBbind core set in only 10 protein families. We use the remaining 55 protein families of the 2007 PDBbind test set with their active and inactive molecules as part of the training set corresponding to the Cr test set. The other part of the training set is the non-overlapping complexes in the 2014 refined set which only includes active ligands. We train BT-Screen using the 2007 version of Cr with its active and inactive PLCs as well as the remaining (i.e., after removing Cr 2014) active complexes of the refined set of PDBbind 2014 and test it on the active and inactive PLCs of Cr 2014. BT-Score, RF-Score, and X-Score are only trained on the active complexes but tested on the same active and inactive PLCs of Cr 2014. We also use the (55 + 65 =) 120 active-inactive clusters of Cr 2007 and 2014 as well as the (3,446 - 120 =) 3,326 active PLCs in the 2014 refined set as the bases for the crossvalidation (CV ), leave-target-clusters-out (LTCO), and leave-ligand-clusters-out (LLCO) testing strategies for screening performance. Similar to the procedure we followed on the

32

ACS Paragon Plus Environment

Page 32 of 46

Page 33 of 46

known targets case (Cr ), we here also fit BT-Screen to the active and inactive complexes in the training folds (CV ) or clusters (LTCO & LLCO). The SFs BT-Score, RF-Score, and X-Score are trained on active PLCs only in those same training folds/clusters. We test the ability of all five SFs in discriminating between active and inactive PLCs in the corresponding out-of-sample test folds/clusters. Cr (known targets)

CV (known and novel targets & ligands)

40 30

13.54

11.17

38.48

0

11.28

10

10.80

20 33.90

Screening enrichment

Enrichment EF1%

LTCO (novel targets)

LLCO (novel ligands)

EF5% EF10%

40 30

R F−

Sc

or

e

e Sc or

or −S c BT

X−

e

11.54

34.43

cr ee −S BT

R F−

Sc

or

n

e

e Sc or X−

or −S c BT

−S

cr ee

n

0

e

30.88

10

20.83

20

BT

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Scoring function

Figure 6: The screening enrichment of BT-Screen (proposed), BT-Score, RF-Score, and XScore when evaluated on test complexes with protein targets that are either fully represented (Cr ), partially represented (CV ), or not represented (LTCO) in the SFs’ training data. The four scoring functions are also tested on novel ligands (LLCO) not represented in their training complexes. The screening power of SFs will be assessed based on their enrichment performance on the screening test set. Enrichment factor is a measure that accounts for the number of true binders among the top x% ranked molecules and is calculated as follows: EFx% = NTBx% /(NTBtotal × x%). Here NTBx% refers to the number of actives in the top-ranked x% of molecules, and NTBtotal is the total number of binders for the given target protein which is typically three. 33

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

In Figure 6, we report the screening powers of four SFs in terms of their (x =) 1, 5, and 10 per cent enrichment factors. The SFs are the screening-specific model BT-Screen and the BA-based approaches BT-Score, RF-Score, and X-Score. As in the scoring and docking cases, the performance of the models are judged based on their familiarity with the test targets and ligands as depicted in the four panels for Cr, CV, LTCO, and LLCO training-test sampling strategies. We notice from the four plots that BT-Screen consistently outperforms the three BA-based SFs regardless of the considered enrichment factor or level of overlap between the training and test proteins and ligands. BT-Screen achieves an enrichment factor of 33.9 for the top 1% molecules for Cr which is equivalent to about 34 times enhancement over random ranking. Compare that with the performance of BT-Score, the BA-based SF analog of BT-Screen, whose EF1% is 11.29. BT-Screen is also substantially more accurate (by about 73%) than the best performing conventional SF GlideScore-SP whose EF1% value is 19.54. This boost in performances can be attributed primarily to the task-specific learning approach, since BT-Screen and BT-Score are both trained on the same targets and ligands characterized by the same descriptor sets. We believe that further improvement in screening accuracy is achievable when more experimental data is collected and used. We also investigated whether ligand properties alone can predict the binary activity label (active/inactive) of the ligand against the target protein without using any protein information. We used the 741-PaDEL descriptors to characterize ligands and collected their activity labels against target proteins. We fitted the data using XGBoost models and evaluated the ligand-based SFs using our four training-test sampling strategies: Cr, CV, LTCO, and LLCO. The top EF1% across the four testing scenarios did not exceed 2.5 which is substantially lower than the performance of BT-Screen and GlideScore-SP.

34

ACS Paragon Plus Environment

Page 34 of 46

Page 35 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Multi-task deep neural network: simultaneous prediction of ligand binding pose, affinity, and activity label In this section, we investigate the benefits of multi-task learning to simultaneously predict the ligand binding pose, binding affinity, and whether it’s active or not for any protein target. We build the scoring function MT-Net which is based on the multi-task neural network architecture described earlier (in the Methods section) and compare its accuracy with conventional approaches and its single-task counterparts for the three tasks. The SFs based on the multi-task (MT-Net) and single-task neural networks are trained on the primary training set of PDBbind 2014 and tested on the out-of-sample core test set Cr. The performance results of these scoring functions and the best conventional approaches are listed in Table 5. Table 5: The docking, screening, and scoring performance of multi-task and single-task SFs on PDBbind 2014 core test set (Cr ). SF type

Docking (S2 %)

Screening (EF1% )

Scoring (Rp )

Multi-task (MT-Net) Single-task Best conventional1

90.25% 90.77% 82.05% (ChemPLP)

29.17 22.87 19.54 (GlideScore-SP)

0.804 0.803 0.627 (X-Score)

1

We use the docking and screening accuracies for ChemPLP and ClideScore-SP from the recent comparative study by Li et al. 13 The scoring performance of X-Score is calculated by us since we have access to its software.

The screening and scoring performance of MT-Net are substantially higher than those of GlideScore-SP and X-Score, which are the best performing conventional approaches on these two tasks. MT-Net’s binding pose prediction is also more accurate than the top performing conventional scoring function in the docking task, ChemPLP. MT-Net and ChemPLP correctly identified the correct binding modes of 90% and 82% of protein-ligand complexes in the core test set of PDBbind 2014, respectively. The binding mode is considered correctly identified if the top-scoring pose of the ligand was found to lie within 2 Å RMSD from the known native conformation. MT-Net achieves 80.51% success rate on the more stringent test where the top-scoring pose must lie within 1 Å RMSD (instead of 2 Å RMSD) from

35

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the known binding mode to be considered native or near-native. We do not report the more stringent success rate statistic for ChemPLP since it was not available to us in the original publication 13 and we do not have access to the scoring function’s predictions for the test complexes to compute this performance value ourselves. The screening performance of the multi-task SF is substantially higher than that obtained by the single-task neural network. The gap in performance between the two functions is much narrower for the docking and scoring tasks with a slight lead for the traditional singletask approach. The finding that docking performance is not benefiting from the multi-task approach might be attributed to the fact that the docking data set size is already large so that the additional screening and scoring data are relatively too small to make any difference. The reason behind the small difference in scoring performance is not very clear and warrants further investigation. However, it might be due to the fact that the docking and screening data sets are generated from the scoring complexes and therefore the derived complexes may not contribute new information. We believe that the gains in performance will be much greater if scoring, screening, and docking data were more diverse and larger in size. To our knowledge, such a novel architecture for scoring functions has not been attempted before and we believe it could be used as a blue-print for future scoring functions with the availability of more data.

When to trust machine-learning scoring functions We train ML SFs on a large number of diverse PLCs characterized by thousands of descriptors aiming to learn the complex function that govern the interactions between the two molecules at the atomic level. For a new test complex, the ML SF relies on these submolecular and atomic descriptors extracted from the complex and its learning experience to produce the desired predictions. The larger and more diverse the training data, the higher the number of similar micro interactions it can recognize between the test and training complexes. This ultimately should enable ML SFs to generalize well even beyond their training proteins since 36

ACS Paragon Plus Environment

Page 36 of 46

Page 37 of 46

interactions are recognized at the atomic level as opposed to a macro-level view. Theoretically, optimal ML SFs that work well for novel proteins can be built with the availability of: (i) high-quality training complexes, (ii) with abundance in diversity and scale, (iii) that are characterized using comprehensive and accurate descriptors, and (iv) fitted by an adept learning algorithm. Since each of these ingredients has its own limitations, it would be of practical value to empirically evaluate the generalization accuracy of ML SFs on novel targets given their imperfect training data and learning algorithms. We partially addressed this issue using our leave-target-clusters-out (LTCO) test sets by restricting the similarity between training and test complexes to 90% or less using BLAST. In this section, we present the results of a more systematic experiment where we report the generalization performance for several cut-off values to simulate the level of the novelty of test proteins. Scoring (correlation Rp)

Accuracy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Docking (success rate S1%)

Screening (EF top 1%) 40

0.75

75

0.50

50

0.25

25

30

0.00

20 10

0 30

50

70

90

100

0 30

50

70

90

100

30

50

70

90

100

Similarity cutoff

Figure 7: The scoring, docking, and screening accuracy of BT-Score, BT-Dock, and BTScreen when evaluated on datasets in which the test proteins have varying degrees of maximum BLAST-based sequence similarity to those in the SFs’ training data. The results on the scoring, docking, and screening tasks are shown in Figure 7 for the SFs BT-Score, BT-Dock, and BT-Screen, respectively. None of the training proteins of these SFs has a primary-sequence similarity of more than S% with the test targets where S ∈ {30, 50, 70, 90, 100}. The plots indicate that ML SFs are more accurate on targets that are at least 90% similar to their training proteins. However, their accuracies on novel test proteins with smaller sequence similarity are slightly more challenging for the three SFs, especially for BT-Score. The performance spikes abruptly upon relaxing the cut-off constraint from 70% 37

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

to 100%, where a complete overlap between the training and test proteins is allowed (i.e., on known targets but still different complexes). The gaps in performance are rather smoother and narrower for the three tasks when decreasing the similarity cut-off value from 70% gradually down to 30%. Such a trend indicates that SFs consider proteins with similarity of 90% and above to be homologous and those with S < 90% as non-homologous. Even though the ML SFs work very well when trained and tested on homologous proteins, they are still very predictive on novel targets, particularly in the docking and screening tasks. We expect the gap in performance to shrink even further with the availability of more data in the future. As for the proposed ML SFs, their performance results imply that they have the ability to learn protein-ligand interactions at the atomic level since they produce reasonably good predictions for novel targets and better accuracies for known proteins. In addition to novel protein targets, we also tested ML SFs on novel ligands. The similarity of training and test ligands is controlled via the number of ligand clusters we produce using k-means as discussed in the Leave Clusters Out subsection. As the number of clusters increases, the average similarity between ligands in the same cluster increases and vice versa. Also, increasing the number of clusters increases the probability of training and testing on similar ligands even though k-means may group the training and test complexes in different clusters. Similarly, decreasing the number of clusters lowers this probability but decreasing their number too much leaves very few test samples (clusters) to reliably judge the performance of SFs. Therefore, in our experiments we created x (where x ∈ {10, 20, 40, 60, 80, 100}) ligand clusters. We then trained and tested task-specific ML SFs on non-overlapping ligand clusters for every value of x. The performance of these SFs across x and on the three tasks were not very different from their accuracies on the LLCO approach where we grouped ligands into 100 clusters. When we created 10 clusters, SFs only lost 10% or less of their scoring, docking, and screening accuracy they obtained in the case of 100 clusters. The accuracy went up to the 100-cluster level at 40 clusters. This experiment indicates that most of ligands in PDBbind are actually different and ML SFs are still predictive even when the test ligands

38

ACS Paragon Plus Environment

Page 38 of 46

Page 39 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

are not represented in their training data.

Conclusion In this work, we developed BT-Score, an ML SF based on boosted trees and thousands of descriptors to predict binding affinity (BA). BT-Score improved scoring accuracy by more than 14% and 31% as compared to other machine learning (ML) and conventional SFs, respectively. Even with this superior accuracy in the scoring task, we observed the relative inability of BT-Score and other BA-based SFs to correctly rank ligand poses (docking) and classify them as active or inactive (screening). This motivated us to build two novel ML SFs for the docking and screening tasks. We proposed BT-Dock, a boosted-tree model fitted to hundreds of thousands of native and decoy ligand conformations characterized by large number of descriptors and optimized to predict binding poses explicitly. The model showed an average of 25% improvement over its BA-based counterparts in different pose prediction scenarios. Similar improvement has also been obtained by our screening-based SF, BT-Screen, which directly models ligand activity as a classification problem. BT-Screen is fitted to thousands of active and inactive complexes to optimize its ability to identify real active compounds from databases of new ligands not seen in its training set. In addition to the three individual task-specific SFs, we also presented a novel multi-task deep neural network (MT-Net) that is trained on data from the three tasks to simultaneously predict binding poses, affinities, and activity labels. MT-Net is composed of shared hidden layers for the three tasks to learn common features, task-specific hidden layers for higher feature representation, and three output layers for the scoring, docking, and screening tasks. We demonstrated that the performance of MT-Net is superior to conventional SFs and on par with (in the scoring and docking tasks), or better than (in the screening task), single-task neural networks, and its accuracy can be further improved as more training data becomes available.

39

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Acknowledgement This material is based upon work supported by the National Science Foundation under Grant No. 1117900. This work was also supported by Amazon’s AWS Cloud Credits for Research program. The authors would also like to thank the anonymous reviewers for helpful comments and suggestions.

Conflict of interest The authors declare no conflict of interest.

ORCID iDs • Hossam M. Ashtawy: orcid.org/0000-0002-5589-9274 • Nihar R. Mahapatra: orcid.org/0000-0002-3821-3330

References (1) Sliwoski, G.; Kothiwale, S.; Meiler, J.; Lowe, E. W. Computational methods in drug discovery. Pharmacological reviews 2014, 66, 334–395. (2) Tanrikulu, Y.; Krüger, B.; Proschak, E. The holistic integration of virtual screening in drug discovery. Drug Discovery Today 2013, 18, 358–364. (3) Hoelder, S.; Clarke, P. A.; Workman, P. Discovery of small molecule cancer drugs: successes, challenges and opportunities. Molecular oncology 2012, 6, 155–176. (4) Bajorath, J. Integration of virtual and high-throughput screening. Nature reviews. Drug discovery 2002, 1, 882.

40

ACS Paragon Plus Environment

Page 40 of 46

Page 41 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(5) Engels, M.; Venkatarangan, P. Smart screening: approaches to efficient HTS. Current opinion in drug discovery & development 2001, 4, 275–283. (6) Zineh, I.; Abernethy, D.; Hop, C. E.; Bello, A.; Mcclellan, M. B.; Daniel, G. W.; Romine, M. H. Improving the tools of clinical pharmacology: goals for 2017 and beyond. Clinical Pharmacology & Therapeutics 2017, 101, 22–24. (7) Jain, A. N. Scoring noncovalent protein-ligand interactions: A continuous differentiable function tuned to compute binding affinities. J. Comput.-Aided Mol. Des. 1996, 10, 427–440. (8) Accelrys Software Inc., The Discovery Studio Software. 2001; version 2.0. (9) Verdonk, M. L.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.; Taylor, R. D. Improved protein–ligand docking using GOLD. Proteins: Struct., Funct., Bioinf. 2003, 52, 609– 623. (10) MOE, M. Chemical Computing Group. Montreal, Quebec, Canada 2006, (11) Ashtawy, H. M.; Mahapatra, N. R. A comparative assessment of predictive accuracies of conventional and machine learning scoring functions for protein-ligand binding affinity prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 2015, 12, 335–347. (12) Ballester, P.; Mitchell, J. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 2010, 26, 1169. (13) Li, Y.; Han, L.; Liu, Z.; Wang, R. Comparative Assessment of Scoring Functions on an Updated Benchmark: 2. Evaluation Methods and General Results. J. Chem. Inf. Model. 2014, (14) Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind Database: Collection of Binding

41

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures. J. Med. Chem. 2004, 47, 2977–2980, PMID: 15163179. (15) Cheng, T.; Li, X.; Li, Y.; Liu, Z.; Wang, R. Comparative assessment of scoring functions on a diverse test set. J. Chem. Inf. Model. 2009, 49, 1079–1093. (16) Li, Y.; Liu, Z.; Li, J.; Han, L.; Liu, J.; Zhao, Z.-X.; Wang, R. Comparative Assessment of Scoring Functions on an Updated Benchmark: I. Compilation of the Test Set. J. Chem. Inf. Model. 2014, 54, 1700–1716. (17) Yap, C. W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32, 1466–1474. (18) Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. (19) Ashtawy, H. M.; Mahapatra, N. R. BgN-Score and BsN-Score: Bagging and boosting based ensemble neural networks scoring functions for accurate binding affinity prediction of protein-ligand complexes. BMC Bioinformatics 2015, 16, S8. (20) Ashtawy, H. M.; Mahapatra, N. R. Machine-learning scoring functions for identifying native poses of ligands docked to known and novel proteins. BMC Bioinformatics 2015, 16, S3. (21) Kramer, C.; Gedeck, P. Leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets. J. Chem. Inf. Model. 2010, 50, 1961– 1969. (22) Trott, O.; Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 2010, 31, 455–461.

42

ACS Paragon Plus Environment

Page 42 of 46

Page 43 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(23) Wang, R.; Lai, L.; Wang, S. Further development and validation of empirical scoring functions for structure-based binding affinity prediction. J. Comput.-Aided Mol. Des. 2002, 16, 11–26, 10.1023/A:1016357811882. (24) Cao, Y.; Li, L. Improved protein–ligand binding affinity prediction by using a curvaturedependent surface-area model. Bioinformatics 2014, btu104. (25) McGann, M. FRED pose prediction and virtual screening accuracy. J. Chem. Inf. Model. 2011, 51, 578–596. (26) Koes, D. R.; Baumgartner, M. P.; Camacho, C. J. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J. Chem. Inf. Model. 2013, 53, 1893–1904. (27) Durrant, J. D.; McCammon, J. A. NNScore: A neural-network-based scoring function for the characterization of protein- ligand complexes. J. Chem. Inf. Model. 2010, 50, 1865–1871. (28) Schmidtke, P.; Le Guilloux, V.; Maupetit, J.; Tufféry, P. Fpocket: online tools for protein ensemble pocket detection and tracking. Nucleic Acids Res. 2010, 38, W582– W589. (29) Neudert, G.; Klebe, G. DSX: a knowledge-based scoring function for the assessment of protein–ligand complexes. J. Chem. Inf. Model. 2011, 51, 2731–2745. (30) Krammer, A.; Kirchhoff, P. D.; Jiang, X.; Venkatachalam, C.; Waldman, M. LigScore: A novel scoring function for predicting binding affinities. J. Mol. Graphics Modell. 2005, 23, 395 – 407. (31) Freund, Y.; Schapire, R. A desicion-theoretic generalization of on-line learning and an application to boosting. 1995.

43

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(32) Friedman, J. Greedy function approximation: a gradient boosting machine. Annals of Statistics 2001, 1189–1232. (33) Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. CoRR 2016, abs/1603.02754 . (34) Chen, T.; He, T. xgboost: eXtreme Gradient Boosting. R package version 0.4-2 2015, (35) Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors). The annals of statistics 2000, 28, 337–407. (36) Ramsundar, B.; Kearnes, S.; Riley, P.; Webster, D.; Konerding, D.; Pande, V. Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072 2015, https://arxiv.org/abs/1502.02072 (accessed Oct 14, 2017). (37) Dahl, G. E.; Jaitly, N.; Salakhutdinov, R. Multi-task neural networks for QSAR predictions. arXiv preprint arXiv:1406.1231 2014, https://arxiv.org/abs/1406.1231 (accessed Oct 14, 2017). (38) Unterthiner, T.; Mayr, A.; Klambauer, G.; Steijaert, M.; Wegner, J. K.; Ceulemans, H.; Hochreiter, S. Deep learning as an opportunity in virtual screening. Proceedings of the Deep Learning Workshop at NIPS. 2014. (39) Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 2014, 15, 1929–1958. (40) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M. TensorFlow: A System for Large-Scale Machine Learning. OSDI. 2016; pp 265–283. (41) Chollet, F. Keras. 2015. 44

ACS Paragon Plus Environment

Page 44 of 46

Page 45 of 46 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(42) Breiman, L. Random forests. Machine Learning 2001, 45, 5–32. (43) Overington, J.; Al-Lazikani, B.; Hopkins, A. How many drug targets are there? Nat. Rev. Drug Discovery 2006, 5, 993–996. (44) Gaulton, A.; Bellis, L.; Bento, A.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40, D1100–D1107.

45

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphical TOC Entry

46

ACS Paragon Plus Environment

Page 46 of 46