Descriptor Data Bank (DDB): A Cloud Platform for ... - ACS Publications

Nov 29, 2017 - DDB is an open-access hub for depositing, hosting, executing, and sharing descriptor extraction tools and data for a large ... We then ...
0 downloads 13 Views 2MB Size
Subscriber access provided by READING UNIV

Article

Descriptor Data Bank (DDB): A Cloud Platform for MultiPerspective Modeling of Protein-Ligand Interactions Hossam Ashtawy, and Nihar Ranjan Mahapatra J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00310 • Publication Date (Web): 29 Nov 2017 Downloaded from http://pubs.acs.org on December 4, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Descriptor Data Bank (DDB): A Cloud Platform for Multi-Perspective Modeling of Protein-Ligand Interactions Hossam M. Ashtawy∗ and Nihar R. Mahapatra∗ Department of Electrical and Computer Engineering, Michigan State University East Lansing, Michigan 48824-1226, USA E-mail: [email protected]; [email protected]



To whom correspondence should be addressed

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Protein-ligand (PL) interactions play a key role in many life processes such as molecular recognition, molecular binding, signal transmission, and cell metabolism. Examples of interaction forces include hydrogen bonding, hydrophobic effects, steric clashes, electrostatic contacts, and van der Waals attractions. Currently, a large number of hypotheses and perspectives to model these interaction forces are scattered throughout the literature and largely forgotten. Instead, had they been assembled and utilized collectively, they would have substantially improved the accuracy of predicting binding affinity of protein-ligand complexes. In this work, we present Descriptor Data Bank (DDB), a data-driven platform on the cloud for facilitating multi-perspective modeling of PL interactions. DDB is an open-access hub for depositing, hosting, executing, and sharing descriptor extraction tools and data for a large number of interaction modeling hypotheses. The platform also implements a machine-learning (ML) toolbox for automatic descriptor filtering & analysis, and scoring function (SF) fitting & prediction. The descriptor filtering module is used to filter out irrelevant and/or noisy descriptors and to produce a compact subset from all available features. We seed DDB with 16 diverse descriptor extraction tools developed in-house and collected from the literature. The tools altogether generate over 2,700 descriptors that characterize (i) proteins, (ii) ligands, and (iii) protein-ligand complexes. The in-house descriptors we extract are protein-specific which are based on pair-wise primary and tertiary alignment of protein structures followed by clustering and trilateration. We built and used DDB’s ML library to fit SFs to the in-house descriptors and those collected from the literature. We then evaluated them on several data sets that were constructed to reflect real-world drug screening scenarios. We found that multi-perspective SFs that were constructed using large number of diverse DDB descriptors capturing various PL interactions in different ways outperformed their single-perspective counterparts in all evaluation scenarios, with an average improvement of more than 15%. We also found that our proposed protein-specific descriptors improve the accuracy of SFs.

2

ACS Paragon Plus Environment

Page 2 of 44

Page 3 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1: List of abbreviations and acronyms Acronym

Description

BA CLI Cr CSAR CV DDB E.F.x% FSS HTS Kd Ki LCO ML MP NRMSE OOB PDB PL PLC Pr QSAR REPAST RETEST Rp Rs RMSD RMSE SBDD SD SF SN % VS

Binding affinity Command-line interface PDBbind core test set Community structure-activity resource Cross validation Descriptor Data Bank Screening enrichment factor obtained by considering the top x% ranked ligands Feature subset selection High-throughput screening Dissociation constant Inhibition constant Leave clusters out Machine learning Multi-perspective Normalized root-mean-square error Out-of-bag data points in random forest models Protein data bank protein-ligand protein-ligand complex PDBbind primary training set Quantitative structure-activity relationship Receptor primary structure similarity via trilateration Receptor tertiary structure similarity via trilateration Pearson correlation coefficient Spearman correlation coefficient Root-mean-square deviation Root-mean-square error Structure-based drug design Standard deviation Scoring function Docking success rate by considering the top N scored ligand poses Virtual screening

Introduction Background In the field of structure-based drug design, the success of molecular docking and virtual screening campaigns are highly dependent on the accuracy and throughput of the scoring functions (SFs) employed. Dozens of SFs have been developed in the past 20 years to 3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

improve predictive accuracy and speed. Current SFs employed in commercial and free molecular docking software can be categorized into one of four main categories: force-field-based, 1 knowledge-based, 2 empirical, 3 and machine-learning SFs. 4–7 Knowledge-based SFs estimate the free energy of binding using distance-dependent pairwise contacts or potentials between the protein and its ligand. These potentials are derived from a database of protein-ligand complexes with solved structures. Force-field-based SFs rely on force fields (or energy potentials) to calculate the free energy change during protein-ligand binding directly using a predefined set of parameters and equations that include van der Waals, electrostatic, and solvation energy terms. Empirical SFs are also composed of energy terms similar to those employed in force-field SFs. The main difference between the two families is that the latter rely on multi-variate linear regression and experimental data to calculate the contribution of each energy term to the final binding affinity of the complex. Machine-learning approaches are an extension to empirical SFs in that they use more sophisticated statistical fitting algorithms and take into account a larger number of energy terms or descriptors to predict binding affinity. Most of the knowledge-based, force-field, and to some extent empirical SFs in use today attempt to reduce a very complex system of intermolecular forces into a small number of terms that are combined in a simple additive manner to calculate the free energy of binding. The empirical approach improves upon its knowledge-based and force-field counterparts by relying on experimental data and linear regression to express binding affinity as a weighted sum of energetic terms. Recent studies have shown that empirical SFs have better accuracy on average in predicting binding affinity when compared to knowledge-based and force-field techniques. 8,9 However, the empirical linear model is still too rigid and simple to reliably model the free energy of such a complex system. Their rigidity prevents them from fully exploiting new experimental data, which continues to grow. A new family of SFs based on machine learning has been introduced to the field in the past few years to address the shortcomings of the simple empirical approaches. 4,6,10,11 These SFs are typically based on

4

ACS Paragon Plus Environment

Page 4 of 44

Page 5 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

highly nonlinear predictive models such as ensemble decision trees and neural networks fitted to protein-ligand complexes with known experimental binding data. The complexes are typically characterized by larger number of descriptors than those used by the other three categories of SFs. 4,6,8,10,11 Recent machine-learning (ML) SFs include almost all the energy terms and potentials that knowledge-based (KB), force-field (FF), and empirical (Em) SFs use in one way or another. 4,6,10,11 In fact, all types and forms of energy terms that have been developed in the literature for empirical SFs can, in principle, be directly included as individual descriptors in ML SFs given enough training data. In this respect, ML SFs can be regarded as a superset of KB, FF, and Em SFs. As opposed to linear regression models used in empirical SFs, some ML approaches have higher learning capacity and can better handle high-dimensional data due to their superior bias-variance error trade-off. The more interactions they model and the more experimental data they are trained on, the more potential ML SFs have in making accurate predictions. Nowadays, there is an abundance of biochemical public data, be it drug-like molecular libraries, 12,13 solved protein structures, 14–19 or high-throughput screening assays, 20,21 all of which continue to grow in size, quality, and diversity. However, despite such advances in data availability, we still do not know how to comprehensively characterize the raw proteins, ligands, and protein-ligand complexes in these databases in terms of descriptors (features) suitable for virtual screening applications such as developing accurate SFs. Developers of empirical and ML SFs typically build in-house tools to extract descriptors, which is the most time-consuming stage in SF design. In this work, we present a cloud platform that provides SF designers with the tools necessary to easily and efficiently calculate a large number of diverse descriptors.

The need for a library of descriptor extraction tools Almost all databases of small compounds, proteins, or protein-ligand complexes host molecular data in their raw form. These molecules must be characterized comprehensively by 5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

informative descriptors before fitting scoring functions. Descriptors are simple models1 for quantifying or characterizing different molecular properties and protein-ligand interactions such as hydrogen-bonding, hydrophobic forces, steric clashes, electrostatic contacts, etc. Such interactions are very complex to represent precisely and there is no consensus on deriving a standard model for them. As a result, a large number of hypotheses have been proposed in the literature to characterize the same types of intermolecular forces. A hypothesis is basically a single perspective of the interaction it attempts to model as a descriptor. Developers of empirical SFs generally rely on their experience, intuition, and analysis of data sets to derive a hypothetical functional form for each descriptor. Despite the relatively large number of descriptors that could be computed efficiently from proteins, ligands, and protein-ligand complexes, most existing SFs utilize only a very small set of orthogonal descriptors characterizing the various interaction factors. E.g., among popular commercial and academia-developed SFs, D-Score 1 uses two descriptors, LigScore, 22 PLP, 23 and G-Score 24 three, LUDI 25 and ChemScore-2 26 four, Jain, 27 F-Score, 25 and AffiScore 28 five, X-Score 3 six, GoldScore 29 seven, ChemScore-1 29 nine, and GlideScore 30 ten descriptors. Furthermore, it is uncommon for an empirical SF to include more than one hypothetical model for each type of interaction it attempts to take into account. Also, the type of interactions incorporated in current SFs vary from one to the other. As an example, the empirical SF ChemPLP 31 accounts for a protein’s solvation and flexibility characteristics among other factors. The SF X-Score, on the other hand, models neither of these interactions. Instead, it includes a hydrophobicity energy term which is not present in ChemPLP. In addition to accounting for different interactions, two SFs that account for the same type of interactions may model them differently. As an example, hydrogen bonding and hydrophobic contacts are modeled differently by X-Score, 3 AffiScore, 28 and RF-Score. 6 The variation in the characterization of the same interactions are usually due to differences in the theoretical model of the interactions, the parameters of the model, the level of detail or granularity of the modeling 1

Descriptors models should not be confused with predictive models such as machine-learning scoring functions.

6

ACS Paragon Plus Environment

Page 6 of 44

Page 7 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

approach, etc. The lack of sufficient understanding of the forces that bring proteins and ligands together and the need to keep the individual interaction models simple enough to enable high-throughput screening are the main reasons behind such discrepancies. Although there is an overlap in the types of interactions that the features in various SFs attempt to capture, they still represent different ways or perspectives of capturing proteinligand binding. Current empirical SFs, such as X-Score, ChemPLP, and AffiScore, treat the hypothetical models of interactions as competing, mutually-exclusive perspectives instead of utilizing them collectively as co-operative descriptors. This is evident from the small number of descriptors normally considered by traditional SFs. An important reason existing SFs use very few features is to avoid overfitting in the simple, often linear, models employed, especially when trained on relatively small datasets. Also, the interpretation of linear SFs becomes harder due to high-dimensionality and potential multicollinearity as more factors are added, especially when they account for similar interactions. More importantly, implementing interaction terms from the literature is a very challenging task unless the modeling tool or software is easily accessible or the interaction model is simple and well described. Even with the availability of the software, sometimes users find it difficult to locate, obtain, install, and use software packages developed by others. To solve these problems, in this work, we introduce a descriptor platform that enables easy access, execution, and sharing of descriptor extraction tools via simple portals.

Methods The main design goal of our descriptor library is to extract descriptors for the three molecular entities: ligands, proteins, and protein-ligand complexes (PLCs). Based on our experience working with empirical and ML SFs, we observed that the vast majority of them include descriptors specific to ligands and PLCs. Protein-specific descriptors are underrepresented in the feature space and so there is a need to fix this imbalance. Therefore, in this section,

7

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

we present our novel approach for deriving similarity descriptors based on the primary and tertiary structures of proteins. We then describe the proposed Descriptor Data Bank (DDB) platform that facilitates integration of descriptors of different types and from various sources via multi-perspective modeling.

Receptor-specific descriptors based on structure similarity and trilateration The well-known similar property principle in cheminformatics states that “similar compounds have similar properties,” i.e., molecules or compounds that have similarity in descriptor space (which may be based on functional groups, structure, etc.) are likely to have similar biological activity. 32 This principle underpins the similarity-based virtual screening approach employed in ligand-based design wherein compounds similar to a query compound (a known active ligand) are retrieved from a compound database for enrichment. 33 We propose targetspecific descriptors based on the primary and tertiary structure similarities of receptors. Similarity Matrix

PDB Proteins

1a30

1f9g

Test Protein

2usn

(1) All-againstall Similarity

Prtn. 1a30 1f9g



2usn

1a30 100

25

...

72

1f9g

25

100

...

65

...

...

...

...

...

2usn 72

65

...

100

Centroid Proteins

(2) Clustering Clustered Proteins

(3) One-against-all Similarity

Similarity Features 25 14 94 55 27 33 47

Y-Axis

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 44

X-Axis

Figure 1: The workflow for generating the receptor-based similarity descriptors REPAST and RETEST. Homologous proteins may have similar functions and hence may be triggered by related 8

ACS Paragon Plus Environment

Page 9 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

or the same biological signals. Proteins that are structurally and functionally similar are more likely to be expressed in similar cells. Cellular environmental factors, such as the presence and concentration of different molecules and charged ions, could have a bearing on the binding process. Such factors are typically not captured when the structures of proteins or protein-ligand complexes are solved and their binding affinity determined. We believe extracting protein-similarity-based features could implicitly model environmental factors and intelligently encode the location of the target receptor in the protein family space. Distant homologous proteins that have low sequence similarity may have similar tertiary structures and as a result might bind to analogous ligands. To capture such information, we extract two types of structure-specific descriptors: one based on receptor primary structure similarity via trialateration (or REPAST) and a second set based on receptor tertiary structure similarity via trialateration (or RETEST). Figure 1 illustrates the strategy we follow to extract REPAST and RETEST descriptors. First, a set of diverse protein families from PDBbind 2007 is prepared for all-against-all pairwise alignment. We use BLAST 34 for primary sequence alignment and TM-Align 35 to compute tertiary structural similarity between proteins. The similarity scores associated with the pairwise alignments are recorded in a similarity matrix which is then clustered into a predefined number of groups. We use k-means to group the training proteins into 80 clusters. The optimal number of 80 clusters was chosen based on a grid search using 10-fold cross validation on the training data. The centroid proteins of the clusters are chosen as cluster prototypes or representatives. Finally, for a test PLC, the similarity of the protein is determined w.r.t. the cluster prototypes using BLAST and TM-Align to yield similarity descriptors. The distances (inverse of BLAST and TM-Align similarity scores) of the target protein from the 80 protein prototypes determine its location in the protein family space. The family of the test protein is therefore determined indirectly via trilateration. Two structurally or functionally homologous proteins are likely to have similar REPAST and/or RETEST descriptor vectors and hence separated by a small Euclidean distance. Representing

9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

large molecules such as proteins as descriptor vectors is a very efficient and useful technique for ML SFs. In addition to characterizing the protein as a whole molecule, this approach can also be applied to generating submolecular descriptors. We plan on extending this strategy in the future to also better characterize the sequence of the receptor’s binding site and the secondary structure of the domains near the cavity region.

Multi-perspective modeling of protein-ligand interactions To date, SF design has primarily focused on hypothesis-driven modeling of the binding mechanism using a set of descriptors specific to each SF. Consequently, a wide variety of descriptor sets representing diverse and overlapping hypotheses or perspectives have been developed over time. These perspectives represent valuable knowledge about the binding mechanism that have never before been brought together in one place and systematically analyzed. Given the high impact of PLC datasets such as RCSB PDB 36 and related ones like PDBbind 19 and CSAR 37 on drug discovery research and benchmarking, there is a clear need for such a resource for molecular descriptors representing multiple perspectives. We propose a descriptor data bank (DDB) and resource that will facilitate multi-perspective modeling of protein-ligand interactions via crowd-sourcing.

Descriptor Data Bank (DDB) Descriptor Data Bank (DDB) is designed to host and share descriptor calculation tools and data to develop diverse and comprehensive multi-perspective models of protein-ligand interactions, to the extent permissible by licensing restrictions. A machine-learning toolbox is also implemented in DDB to provide in-place automatic descriptor filtering and scoring function development. Its multi-modular architecture, illustrated in Figure 2, provides a framework for meeting DDB design goals in terms of versatile user interfaces, services, user programs and data, and system platforms. DDB is accessible online via the web at www.descriptordb.com and can also be installed locally as a standalone PyMOL plug-in and 10

ACS Paragon Plus Environment

Page 10 of 44

Web

PyMOL

CLI

OS libraries and tools dependent on the platform and UI

Results & reports

ML toolbox

Descriptor filtering & SF construction

Descriptors

Cloud

Unified interface to interaction modeling tools

Descriptor extractors

PC

Molecule datasets

DDB deposits

DDB downlods

Backend

Interface

User commands

Platform

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Frontend

Page 11 of 44

PC & HPC

Figure 2: The system architecture of Descriptor Data Bank. command-line programs. These three separate front-end channels act on user commands and raw data and deliver processed data and results. Data processing, storage, and retrieval takes place in the back-end subsystem where databases of molecules, descriptor extraction programs, descriptors, ML algorithms, and results are maintained. DDB’s front-end and back-end subsystems communicate through a unified interface with two Python-based programs and system-dependent libraries and tools. The two Python programs are for: (i) executing descriptor extraction codes donated by DDB users via well-defined interfaces and (ii) facilitating descriptor filtering and responding to SF development requests and generating results. We discuss the main features of DDB and its design in the following sections.

Descriptor tools and data: deposition, extraction, and sharing DDB’s online platform provides an upload portal to enable users to easily deposit their descriptor extraction programs and data. The descriptor extraction programs can be written in any programming language. To ensure consistency and quality, the donated programs and

11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3: The web interface for Descriptor Data Bank. data must adhere to DDB’s packaging and naming conventions which are provided on the website. Once a user deposits a descriptor extraction program, it will be inspected for security and computational resource requirements. If it passes these tests, the program will then be available for use and download. The descriptor extraction tool will appear on the descriptor extraction page (see Figure 3) as an additional available descriptor set that can be calculated for a given molecular dataset. Users will have the choice to use sample molecule datasets hosted on DDB or upload their own. Instructions regarding data formatting, naming conventions, and procedure for uploading molecule datasets are provided on the website. Users can also download the deposited descriptor extraction programs, molecule datasets, or raw descriptors via the download page. Figure 2 illustrates the flow, storage, and handling of molecule datasets, descriptor extraction programs, and raw descriptors. The standalone PyMOL plug-in and command-line programs are also capable of descrip-

12

ACS Paragon Plus Environment

Page 12 of 44

Page 13 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4: The graphical (left) and command-line (right) user interfaces of Descriptor Data 1 Bank. tor extraction as shown in Figure 4. Users can download descriptor extraction programs from DDB’s website to extend the functionality of local DDB and keep it up-to-date. The PyMOL plug-in and command line programs feature, by default, some of the descriptor extraction tools available in DDB. We seed DDB with a rich collection of diverse descriptors and descriptor extraction tools—of all three types: those that depend upon protein only, ligand only, and the protein-ligand complex—from three different sources that together will more fully capture the complex nature of protein-ligand interaction: (a) descriptors used in existing SFs—for some of these (e.g., the AutoDock Vina, 38 NNScore, 10 and Cyscore 39 descriptors) we have access to descriptor extraction tools and others we will implement ourselves based on descriptions in the literature; (b) the new molecular and submolecular (in the future) similarity descriptors such as REPAST and RETEST discussed earlier; and (c) molecular descriptors and physicochemical properties that have been employed in ligandbased screening and QSAR such as PaDEL 40 and ECFP. 41

Descriptor filtering and analysis Due to its open-access nature, there is a possibility that some of the descriptors hosted on DDB may be noisy or irrelevant for drug discovery and design applications. Improper

13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 44

Table 2: The 16 descriptor types and scoring functions that DDB is seeded with and the molecular docking software in which they are implemented Descriptor set/SF

Symbol1 Molecule(s)2 Software

Type of SF

Reference

AffiScore AutoDock Vina ChemGauss Cyscore DPOCKET DSX ECFP GOLD LigScore NNScore PaDEL REPAST RETEST RF-Score SMINA X-Score

a u h c f d e g l n p b t r s x

Empirical Empirical Gaussian Empirical Descriptors Knowledge based Descriptors Descriptors Knowledge based Machine-learning Descriptors Descriptors Descriptors Machine-learning Empirical Empirical

27

1 2

PLC PLC PLC PLC PLC Preds. L PLC Preds. PLC L P P PLC PLC PLC

SLIDE AutoDock Vina OpenEye Standalone FPOCKET Standalone RDkit GOLD IMP Standalone Standalone Standalone Standalone Standalone Standalone Standalone

38 42 39 43 44 41 45 22 10 40

descriptordb.com descriptordb.com 6 46 3

The symbol of the SF or extraction tool is used to identify the source of descriptors in Figure 7. The entity characterized by the descriptors: P for protein, L for ligand, and PLC for protein-ligand complex. When access to the descriptors of an SF is not available, we use the SF’s final predictions (denoted by Preds. in the table) as a surrogate or a proxy representation for the missing descriptors.

modeling of some interactions, bugs in the donated descriptor extraction program, or any other artifacts can introduce noise into the computed descriptors. To guard against noisy data and ensure robustness of DDB’s ML SFs, DDB provides an automatic descriptor filtering portal. Users have the option to experiment with DDB’s raw descriptors or to filter them before fitting an ML SF. Descriptor filters can be stored in DDB for later use by a user or shared with other developers for SF development. In fact, DDB can build ensemble-learning based SFs that are less prone to overfitting than their empirical counterparts. Ensemble models are more forgiving when training complexes are characterized by a larger number of descriptors and/or when some of the features are either noisy or not very relevant to the property being predicted. However, the predictive performance of ensemble ML SFs might be enhanced by utilizing only the most relevant and informative descriptors. In addition to accuracy gain, the other benefit of employing a compact set of features is computational

14

ACS Paragon Plus Environment

Page 15 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

efficiency. Speed is of utmost importance in molecular docking and virtual screening where interaction modeling is perhaps the most time consuming phase in the process. Therefore, substantial computational efficiency could be achieved using fewer features by ignoring the ones that are slower to extract among redundant descriptors. Computational savings gained could be utilized to explore the conformational space more thoroughly during docking or employed to screen more ligands. The descriptor filtering algorithm implemented in DDB is based on embedded featuresubset selection in which the search for an optimal subset of descriptors is built into the ML algorithm. The algorithm relies on Random Forest’s automatic variable importance calculation, 47 in conjunction with in-house iterative backward elimination, to remove redundant or noisy descriptors and only retain the most relevant descriptor subset for the given prediction task. Our algorithm employs fitness functions that allow task-specific descriptor subset selection. The selected subset of descriptors could be optimized for any or all of the following drug-discovery tasks: identifying the native or near-native binding pose during docking, binding affinity prediction, and enriching compound databases with active compounds. After filtering, DDB outputs results in the form of summary tables and graphs. These reports shed light on the effect of various interaction forces and other factors on the bioactivity of interest. Descriptor filtering and analyses can also be conducted using the PyMOL plug-in and command-line interface (CLI) as shown in Figure 4. We plan to extend the descriptor filtering algorithm so that it selects the descriptors that are faster to calculate when there are equally accurate but redundant features with different computational requirements. This improvement will result in the selection of accurate descriptors that are fast to calculate.

Machine-learning scoring functions in DDB We provide several machine-learning algorithms in DDB to develop SFs using descriptors hosted on the platform. The ML algorithms available include extreme gradient boosting trees

15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(XGBoost), 48 gradient boosting trees (GBT), 49 random forest (RF), 47 support vector machines (SVM), 50 multi-variate linear regression (MLR), and deep neural networks (DNN). 11 These algorithms are implemented in the Python libraries XGBoost, 48 Scikit-learn, 51 and TensorFlow. 52 DDB provides a unified and simple interface to construct ML SFs and hides the details associated with the underlying ML Python libraries. Users can select the desired learning algorithm, the PLC training and test datasets, and the sets of descriptors to characterize these complexes. Furthermore, users can also choose a precalculated descriptor filter to fit the ML algorithm to only the most important descriptors. Upon completion of the ML fitting process, predictions are calculated for the training and test complexes, and summary performance and processing time reports are produced. The results are saved in DDB for later use, and can be downloaded and shared with other DDB users. The ML SF fitting and prediction capabilities can also be performed locally in the user’s PC via the PyMOL plug-in and/or CLI as shown in Figure 4.

Main training and test protein-ligand complexes We use the 2014 version of the molecular database PDBBind 15,53 for building and testing task-specific scoring functions on protein-ligand complexes characterized by the proposed multi-perspective descriptors from DDB. PDBbind is a selective compilation of the Protein Data Bank (PDB) database. 14 Both databases are publicly accessible and regularly updated. PDB is periodically mined and only complexes that are suitable for drug discovery are filtered into the PDBbind database. The refined set of PDBbind 2014 contains 3,446 highquality protein-ligand complexes with experimental binding-affinity data collected from the literature. An important goal in this work is to objectively estimate the generalization accuracy of the proposed SFs based on their degree of familiarity with the test proteins and proteinligand complexes. Given sufficiently large and diverse training data, ML SFs are expected to generalize well to novel proteins not present in their training set if they are able to learn 16

ACS Paragon Plus Environment

Page 16 of 44

Page 17 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the protein-ligand interactions at the fragment and atomic level. On the other hand, the proposed SFs would produce inaccurate predictions for novel targets if they are learning the relationship between activities and the interacting molecules at the macro level since they would fail to find similar proteins from their training (or rather, "memorizing") experience. Accordingly, we designed three different training-test sampling strategies from the 3,446complex refined set of PDBbind dataset (version 2014) to evaluate our SFs in real-world modeling scenarios based on the degree of overlap between the training and test proteins.

PDBbind core test set: novel complexes with known protein targets From the 3,446 PLCs in the refined set of PDBbind 2014, the curators of the database built a test set with maximum diversity in terms of protein families and binding affinies. The test set was constructed as follows. (i) The refined set was first clustered based on 90% primarystructure similarity of its protein families. (ii) Clusters with five proteins or more are then examined based on their ligands’ binding affinity values. From each of these PLC clusters, three complexes are selected: the PLC whose ligand has the lowest binding-affinity (BA), the one with the largest BA, and the one closest to the mean BA of that cluster. (iii) The three PLCs are added to the test set if the complexes with the highest and lowest BA values are at least two orders of magnitude apart in terms of disassociation Kd or inhibition Ki constants. This resulted in the selection of 65 different protein families where each protein family binds to three different ligands with varying BA values to form a set of 195 unique PLCs. This is called the core test set and has been a popular benchmarking test set for many recent docking and scoring systems. 6–9,11 Due to its popularity, the 2015 core set has been kept identical to its predecessor in PDBbind 2014 so that results of different studies can be compared. We form the training set corresponding to Cr, referred to as the primary training set and denoted by Pr, by removing all Cr complexes from the total 3,446 complexes in the refined set. As a result, Pr contains (3, 446 − 195 =) 3,251 complexes that are disjoint from Cr complexes. In our experiments, we fixed the base sizes of training and test sets at 3,000

17

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and 195 PLCs, respectively. The 3,000 PLCs are randomly sampled from the 3,251 PLC set to help evaluate the variance of ML SFs. The core test set complexes are considered targets that are “known” to scoring functions trained on Pr due to the overlap between training and test proteins. More specifically, for each protein in the 65 protein clusters of Cr, there is at least one identical protein present in the primary Pr training data, albeit bound to a different ligand.

Multi-fold cross-validation: novel complexes with known and novel protein targets Although the training Pr and test Cr sets are disjoint at the protein-ligand complex level, they overlap at the protein family level. As mentioned above, each of the three PLC clusters in Cr were sampled from larger (≥ 5) protein-family clusters in the refined set and the remaining two or more PLCs are part of Pr. We also developed more stringent test sets in terms of protein-family overlap between training and test sets. One testing technique is based on 10-fold cross-validation (CV ) where the refined set of PDBbind 2014 is shuffled randomly and then partitioned into 10 non-overlapping sets of equal size. During training and validation, one out of the ten sets is used for testing and the remaining nine are combined for training. Once the training and test completes for this fold, the same process is repeated for the other nine folds one at a time. In a typical 10-fold CV experiment on a dataset of 3,446 PLCs, the training and test sizes of the 10 folds are (3, 446 × 9/10 ∼) 3,101 and (3, 446 × 1/10 ∼) 345 complexes, respectively. In order to prevent dataset size from having an effect on our results and have similar sizes to Pr and Cr, we sample 3,000 from the 3,101 training complexes and 195 PLCs from the corresponding 345 complexes in each fold. The performance results on the 10 test folds are then averaged and reported in the Results section. Unlike the overlap in Cr, due to the randomness of CV, every protein family is not necessarily present in both the training and test sets across the ten folds. Therefore, some

18

ACS Paragon Plus Environment

Page 18 of 44

Page 19 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

test proteins may actually be novel for the scoring function while others may be present in the training data. If it is present, the protein will be bound to a different ligand in the training set.

Leave clusters out: novel complexes with novel protein targets Leave-clusters-out (LCO) is our third approach to constructing the training and test datasets which is based on the novel-targets experiment by Ashtawy and Mahapatra 5 and leave-onecluster-out (LOCO) method of Kramer et al. 54 In this approach, the refined set complexes are partitioned based on the protein’s primary-structure similarity such that the training and test partitions have zero overlap in protein families. More specifically, datasets were constructed as follows. The 3,446 PLCs of the refined set were clustered based on the 90% similarity cut-off of the proteins’ primary structures. A total of 1,135 clusters were found, of which the largest five are: HIV protease (262 PLCs), carbonic anhydrase (170 PLCs), trypsin (98 PLCs), thrombin (79 PLCs), and factor XA (56 PLCs). There are 9 clusters with size 30 PLCs or larger including the aforementioned five, 43 of size 10 or larger, and 1,092 clusters with size less than 10 PLCs, the majority of which (644 clusters) consist of a single PLC. We created 20 different training-test dataset pairs based on this distribution of protein families. For the nine most frequent protein families (with ≥ 30 PLCs each), we created nine corresponding test sets each consisting of a single family. For each familyspecific test set of these nine, we randomly sampled 3,000 PLCs from the refined set so that none of them contained the protein family under test. From the less populous clusters (i.e., those with < 30 PLCs each), we randomly sampled eleven 195-PLC test datasets. These test sets could contain complexes from (1, 135 − 9 =) 1,126 different protein families, mostly singles (i.e., proteins that occur only once in the dataset). For each multi-family test set of these eleven, we again sampled another 3,000 PLCs from the refined set so that none of them contained the protein families under test. The justification for selecting these particular number of datasets and sizes is two-fold. First, we wanted about half of the mean

19

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

performance on these test sets to result from the single-family test sets and the other half from the multi-family sets. Second, taking the average of a larger number of test sets with reasonable sizes reduces the possibility of skewing the final performance due to potential outliers that could be obtained on a few small test sets. Test proteins are considered novel here due to the constraint imposed on sequence similarity between them and training proteins which must be less than 90%. Therefore, all test proteins in a test cluster are considered novel for scoring functions fitted to training proteins sampled using the LCO strategy.

Scoring accuracy: datasets and metrics We assess the scoring accuracy of SFs based on their ability to reproduce experimental BA values for complexes they have not encountered in their training data. The aforementioned training and test sets sampling strategies are used for this purpose. Training and test dataset pairs generated through Pr-Cr partitioning, or CV and LCO sampling, were all derived from the refined set PLCs whose BA data are in − log Kd or − log Ki and range from 2 to 11.96. The scoring accuracy of SFs will be expressed in terms of Pearson (Rp ) and Spearman (Rs ) correlation coefficients as well as the normalized root-mean-square error (NRMSE 2 ) between their predictions and true BA values.

Screening accuracy: datasets and metrics The screening accuracy of the proposed SFs will be estimated based on their ability to discriminate between active and inactive ligands w.r.t. the test protein. We derive our training and test sets for screening SFs from the 2014 refined set and the 2007 core set of PDBbind. As explained above, the 2014/2015 core set consists of 65 protein families, each of which is bound to three different ligands with known and diverse binding affinity values. These are the active compounds that an ideal SF would identify. The inactive molecules for 2

NRMSE = RMSE /(BAmax − BAmin ), where BAmax = 11.96 and BAmin = 2.00 are the largest and smallest BA values of the training complexes, respectively.

20

ACS Paragon Plus Environment

Page 20 of 44

Page 21 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

each protein target are assumed to be the crystallized ligands of the other (65 − 1 =) 64 receptors in the test set. Although this assumption requires further investigation, it seems to be reasonable since all 65 proteins have a pairwise sequence similarity of less than 90%. Furthermore, all the assumed inactive (65 × (64 × 3) =) 12,480 protein-ligand pairs were also checked in the ChEMBL database for possible activity and cross binding . 9,20 Forty of these protein-ligand pairs were found to be active and so were accordingly recategorized. Therefore, the final test set consists of a total of (195 + 40 =) 235 binders and (12, 480 − 40 =) 12,440 inactive molecules. This translates to a total of (235 + 12, 440 =) 12,675 test protein-ligand pairs and the active to inactive ratio is approximately 1:64 for each target. A similar strategy was applied to the 2007 core set which also consists of 65 protein families, 10 of which overlap with the 2014/2015 core set. By removing this overlap, the remaining 55 protein families of the 2007 PDBbind core set make (55×3 = 165) 165 active and (55×(54×3) =) 8,192 inactive complexes. Another set of active complexes we use for screening are the (3, 446 − 195 =) 3,251 PLCs in the 2014 primary set Pr. In our tests, we consider the 2014/2015 core (Cr ) PLCs to be one of the main test sets whose corresponding training sample is the union of the 2014 primary dataset Pr and the 2007 core set. The training-test dataset pairs for CV and LCO experiments are derived from the (65 + 55 =) 120-protein clusters of active/inactive PLCs. The screening power of SFs will be assessed based on their enrichment performance on the Cr, CV, and LCO screening scenarios. Enrichment factor is a measure that accounts for the number of true binders among the top x% ranked molecules and is calculated as follows: EFx% = NTBx% /(NTBtotal × x%). Here NTBx% refers to the number of actives in the top-ranked x% of molecules, and NTBtotal is the total number of binders for the given target protein which is typically three. In our experiments, we consider the top-ranked (x =) 1% compounds to calculate the screening enrichment factor which we denote by EF1% .

21

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Docking accuracy: datasets and metrics For building and testing our docking SFs, we use native and computationally-generated PLC conformations obtained from the 3,446-complex refined set of the 2014 PDBbind. A large pool of decoys was generated for each PLC in this refined set. The decoy set was produced using AutoDock Vina. 38 From the docking process, a diverse set of binding poses was generated by controlling docking parameters to maximize their diversity and coverage of the binding site. This process yielded a total of ∼300 poses for each protein-ligand complex. Binding poses that are more than 10 Å away, in terms of RMSD (root-mean-square deviation), from the native pose are discarded. The remaining poses are then grouped into ten 1-Å bins based on their RMSD values from the native binding pose. Binding poses within each bin were further grouped into ten clusters based on their RMSD similarities. From each such subcluster, a random pose was selected as a representative of that cluster and the remaining poses in that cluster were discarded. Therefore, at the end of this process, decoy sets consisting of (10 bins × 10 representatives =) 100 diverse poses were generated for each protein-ligand complex. Therefore, this process scaled the number of PLCs in the refined set to (3, 446 × 100 =) 344,600 native and computationally generated PLC conformations. The primary (Pr ) and core (Cr ) sets contain 300,000 and 19,500 PLC poses, respectively. Similarly, upon constructing training-test dataset pairs for CV and LCO, the original PLC complexes are used as the bases for sampling. In this work, we build docking-specific ML SFs that are trained on PLC poses labeled by their RMSDs from their respective native conformations. The docking ability of these SFs will be expressed in terms of a success rate statistic S% that accounts for the percentage of times the model is able to find a pose whose true RMSD is within 1 Å from the native conformation by only considering the topmost pose ranked based on predicted RMSD values.

22

ACS Paragon Plus Environment

Page 22 of 44

Page 23 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Multi-perspective descriptors extracted for PDBbind complexes For each protein-ligand complex in the aforementioned training and test sets, we extracted 16 descriptor sets covering a wide range of perspectives and hypothetical interaction models. The total number of descriptors we extracted using DDB are 2,714, including 320 receptorspecific descriptors from the proposed REPAST and RETEST techniques, 1,765 ligandspecific descriptors using PaDeL and ECFP (specifically, ECFP4), and 629 complex-specific interactions using the remaining 12 SFs listed in Tabel 2. All the scoring, screening, and docking PLC datasets described above with the 16 descriptor types are available on DDB’s website (www.descriptordb.com) for use and download.

Results and discussion Single vs. multi-perspective modeling of protein-ligand interactions In this section, we systematically investigate the effect of multi-perspective modeling of protein-ligand interactions on the performance of scoring functions in different applications. We use the training and test datasets described above for building and evaluating taskspecific ML SFs for the scoring, screening, and docking tasks. The scoring-specific SF is trained on PLCs with experimental BA data to predict BA for test complexes. The dockingspecific SF is fitted to PLCs with native and decoy ligand poses to predict RMSD of ligand poses for a test protein. The screening-specific SF is trained on active and inactive proteinligand complexes to predict whether a ligand is active with respect to a given test protein. Our proposed data-driven and multi-perspective (MP) modeling of PL interactions is compared against the conventional approach of PLC characterization that depends on singleperspective (SP) modeling. For MP modeling, we built two sets of SFs to evaluate the effectiveness of the proposed protein-based descriptors. In the first, we combined all 2,714 descriptors in the 16 descriptor sets listed in Table 2 for each training dataset to build XG-

23

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 44

Table 3: Comparison of the scoring, docking, and screening powers of single and multiperspective scoring functions using the core test set (Cr), leave-clusters-out (LCO), and crossvalidation (CV) datasets.

Task (metric)

Best SP (using MP (without MP (with NNScore descriptors) REPAST & RETEST) REPAST & RETEST) Cr

CV

LCO

Scoring (Rp ) 0.779 0.773 0.571 Docking (S%) 69.23 68.86 64.24 Screening (EF1% ) 28.00 32.92 24.38

Cr

CV

LCO

0.801 0.815 0.612 80.05 81.80 75.48 33.50 35.82 28.64

Cr

CV

LCO

0.827 0.825 0.637 82.05 81.78 77.89 34.00 37.33 29.79

Boost SFs specific to the scoring, docking, and screening tasks. We tested the resulting task-specific SFs on the corresponding test PLCs that are also characterized by the same 2,714 descriptors. The second set of MP SFs are trained using (16 - 2 =) 14 descriptor types, where the 320 protein-based descriptors (REPAST and RETEST) are excluded from the total 16 types used to train the full MP version. As for the SP approach, we considered each of the (16 - 4 =) 12 descriptor sets individually to characterize the training PLCs and then built XGBoost SFs for each descriptor set for the three tasks. We excluded the receptor-specific (REPAST & RETEST) and ligand-specific (ECFP & PADEL) descriptor sets from SP SFs since they only characterize the protein or the ligand molecules but not the protein-ligand complex as a whole. We then evaluated all SP-based SFs on out-of-sample validation PLCs whose interactions were modeled using the corresponding descriptor set. We select the descriptor set that is associated with the best overall scoring, docking, and screening accuracies on the out-of-sample validation complexes to build a single-perspective (SP) SF. Our results indicate that the 218 descriptors used by the neural-network SF NNScore represent the best SP feature set for the three tasks. Table 3 shows the scoring, docking, and screening powers of single and multi-perspective SFs on known targets (using the core test set (Cr ) and cross-validation (CV )) and novel proteins via leave-clusters-out (LCO) validation. The three rows of the table are for the three

24

ACS Paragon Plus Environment

Page 25 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

different tasks, the three major columns are for different PLC characterization techniques, and the three sub columns list the performance on different training-test sampling strategies. The SP XGBoost SF that obtained the best overall accuracy on validation datasets used the 218 NNScore descriptors. The performance statistics in the table show that both sets of MP XGBoost SFs are significantly and consistently better than the SP SF for the three tasks whether on known (Cr and CV ) or novel (LCO) test protein targets. The average improvement of the full MP approach over SP in scoring accuracy (across Cr, CV, and LCO test sets) is more than 8%, and more than 19% for the screening and docking tasks. Also, by comparing the summary statistics in the two right-most major columns, we observe that MP SFs trained with REPAST and RETEST descriptors are consistently more accurate than MP SFs trained without them. This highlights the utility of the proposed receptor-based descriptors in improving overall accuracy especially for the scoring task. It should be noted that all three SP and MP SFs were based on the same ML algorithm XGBoost with the same parameters, trained and tested on the same complexes, and evaluated using the same metrics. The only difference between the three sets of SFs is the descriptors they employ to characterize their training and test PLCs. All other factors that may have an effect on the performance results are fixed. Given the substantial improvement we report in Table 3 between SP and the two versions of MP, it is evident that multi-perspective modeling of protein-ligand interactions is a very promising technique in enhancing the accuracy of ML SFs. In addition to XGBoost, we were able to observe similar levels of improvements for other ML algorithms such as random forest, boosted and bagged ensemble neural networks, 11 and support vector machines.

Number of perspectives vs. number of training protein-ligand complexes Public biochemical databases such as PDB, 36 PubChem, 21 PDBbind, 15 ZINC, 12 and others have had a tremendous positive impact on advancing virtual screening and other molecu25

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

lar modeling applications. We believe our database will be a very important complement in this space of public biochemical data and will significantly contribute to the advancement of machine-learning SFs. In this section, we quantitatively show how SFs benefit from training on increasing numbers of public complexes (such as available from the aforementioned databases) characterized by an increasing number of descriptor sets (as available from DDB). In other words, we demonstrate how the performance of SFs improves as new PLCs and perspectives of their interactions are added to public databases overtime. To mimic the improvement of ML SFs in response to the release of new PLCs, we randomly sample increasing sizes of training subsets from the 3,251 primary training set Pr. We randomly select x base PLCs from the 3,251 complexes in Pr, where x ∈ {100, 680, 1260, 1840, 2420, 3000}, and use the selected PLCs to train XGBoost SFs for the three tasks. We then test them on the disjoint core set Cr for the scoring, docking, and screening tasks. This process is repeated 100 times to obtain robust average Rp , S%, and EF1% values, which are plotted in Figure 5(a). In the three tasks, we observe that the performance of XGBoost SFs improves as the size of the training dataset increases. The slope of the three curves indicates that these ML SFs have a potential for improvement as more training data becomes available in the future. To investigate the effect of increasing number of descriptor sets, we randomly select a  maximum of 100 combinations of x descriptor sets from all possible 16 combinations of x the 16 types listed in Table 2, where x ∈ {1, 4, 7, 10, 13, 16}, and use them to characterize the training set complexes, which we then use to train XGboost SF. These models are subsequently tested on the Cr dataset characterized by the same descriptors. For each number, x, of descriptor sets, the performance of the 100 SFs are averaged to obtain robust overall Rp , S%, and EF1% statistics, which are plotted in Figure 5(b). Similar to the case of increasing training data size in terms of number of new PLCs, the performance of the ML SFs steadily improves by increasing the number of protein-ligand interaction sets in all three tasks. It is clear from the right and left panels that the number of different models of protein-

26

ACS Paragon Plus Environment

Page 26 of 44

● ●

100

680

Scoring accuracy (R p ) Docking accuracy (S 11%) Screening accuracy (EF1%) 1260

1840

2420

3000

30 ●









1

(a) Number of training targets



4

Scoring accuracy (R p ) Docking accuracy (S 11%) Screening accuracy (EF1%) 7

10

13

15 20 25 Screening accuracy (EF1%)





10



0.55 0.60 0.65 0.70 0.75 0.80 0.85

Scoring (R p ) and docking (S 11%) accuracy

30 ●

15 20 25 Screening accuracy (EF1%)

● ●

10

Scoring (R p ) and docking (S 11%) accuracy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.55 0.60 0.65 0.70 0.75 0.80 0.85

Page 27 of 44

16

(b) Number of descriptor sets (multiperspecitves)

Figure 5: The scoring (Rp ), docking (S%), and screening (EF1% ) accuracies of multiperspective scoring functions trained on varying numbers of targets and interaction perspectives. ligand interactions are as important as the number of training complexes in improving the quality of prediction. Based on these results, we believe that the public database of proteinligand interactions we are proposing in this work will be of high value to ML SFs and just as important to their accuracy as the resources of raw structural and experimental data such as PDB and PDBbind.

Perspective filtering using automatic feature subset selection In addition to hosting descriptor extraction tools, DDB also offers a descriptor filtering and analysis toolbox. The toolbox is developed to solve the following problems: (i) automatically guard against noisy and irrelevant descriptors, (ii) provide insight about the effect of different intermolecular forces on the bioactivity of interest, and (iii) distill the information contained in all descriptors into a compact subset of accurate features for applications where high throughput is required. Descriptor filtering is essentially a problem of feature subset selection (FSS) in machine learning. The naïve approach to performing FSS is to search the feature space for the most relevant descriptor subset using brute force. Eliminating the least relevant

27

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

features from a pool of thousands of descriptors using exhaustive enumeration is impractical because of the combinatorial explosion of the number of all possible subsets (2P − 1, where P is the pool size). Numerous alternative approaches have been proposed in the literature that attempt to find accurate subset of features in a reasonable time. These approaches can be grouped into three major categories. First are filter-based approaches in which one descriptor is considered at a time to examine its relevance score to the property we wish to predict. The descriptors that have the lowest scores (e.g., the lowest correlation with the dependent variable) are filtered out. Although filter techniques are fast, they ignore descriptor interactions with each other since every feature is evaluated separately. The second category of FSS techniques are called wrapper methods in which a large number of descriptor subsets with varying sizes and combinations are constructed and examined according to some search heuristic. A classification or regression ML model is typically fitted to each descriptor subset to examine its accuracy on an out-of-sample test set. After evaluating each subset, the wrapper search technique, typically an evolutionary algorithm, explores the descriptor space in the vicinity of the most accurate subsets by generating new combinations of descriptors derived from the best subsets so far. The search continues until a predefined number of search rounds has been performed or a target fitness value has been obtained or a plateau has been reached. In any case, wrapper search algorithms such as genetic algorithms (GA) 55 or particle swarm optimization (PSO) 56 require a large number of descriptor subsets to explore and search rounds to arrive at an optimal descriptor subset. Based on our experiments using GA, we found that it is computationally very expensive to reliably search for the best descriptor subset when the number of descriptors exceed few hundreds. The third FSS technique is based on the embedded paradigm. As the name implies, the search for an optimal subset of descriptors is embedded into, or built in, the ML algorithm itself and therefore it is more computationally efficient than the wrapper approach. It is also more accurate than filter methods since descriptor interactions are considered by the

28

ACS Paragon Plus Environment

Page 28 of 44

Page 29 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ML model. Due to these properties, we choose the embedded method to be the basis of our FSS algorithm. The algorithm is based on random forest’s automatic variable importance calculation in conjunction with an in-house approach of iterative backward elimination to remove redundant, noisy, and irrelevant descriptors. Automatic variable importance calculation is a very useful byproduct of ensemble models. In addition to utilizing it for filtering, it helps identify the most critical interactions that contribute to the predictive accuracy of the model. Random forest utilizes out-of-bag (OOB3 ) PLCs to calculate the effect of each descriptor on the predicted bioactivity of interest. This is achieved by monitoring the change in the model’s accuracy while permuting (shuffling) input descriptors randomly, one at a time. With each random permutation of the descriptor values across the training complexes (i.e., descriptor noising), the corresponding increase in mean-square-error (MSE) on the OOB examples (OOBMSE (permute)) is measured. By comparing the intact OOBMSE (i.e., without descriptor noising) to thus computed OOBMSE (permute), the average increase in MSE is evaluated. For each tree t in the ensemble model, P˜ (t) ˜ is ˆi )2 , where N standard OOBMSE is determined as follows: OOBMSE(t) = N1˜ N i=1 (yi − y the number of OOB examples of the tree t. The OOBMSE criterion for the same tree when P˜ (t) (t) permuting the input variable X j is defined as: OOBMSEj = N1˜ N ˆi,j )2 . For each i=1 (yi − y descriptor X j in every tree t in the ensemble, the increase in MSE is calculated as follows: (t)

(t)

∆OOBMSEj = OOBMSEj − OOBMSE(t) . To calculate the importance (I) of descriptor (t)

j, the resulting ∆OOBMSEj values are averaged and normalized over all the trees in the ensemble according to the formula: I j = µ∆OOBMSEj /σ∆OOBMSEj . The overall increase in MSE (I j ) for an input descriptor X j can be considered its influence on predicting the final binding affinity or any other dependent variable. Based on variable importance and backward elimination we implement the descriptor filtering approach depicted in Algorithm 1. (i) First, the number of descriptor subsets 3 Out-of-bag (OOB) refers to complexes that are not sampled from the training set when bootstrap sets are drawn to fit individual trees in a random forest model—on average, about 34% of the training set complexes are left out (or “out-of-bag”) when bootstrap sets are drawn.

29

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 44

Algorithm 1 Descriptor filtering algorithm Input 1: Training data: D = {Xi , yi }N i=1 Input 2: The number of descriptor subset sizes to test: M Subset sizes M = Sequence(from = P , down to = 3, decrement by = ∼ (P −3)/(M −1)) Set the initial variable importance values for all P descriptors to some arbitrary value (e.g., 1) as I (0) = {Ij = 1 | 1 ≤ j ≤ P } 5: for p = 1 to M do 6: V (p) = choose M(p) descriptors with the largest V. I. values according to I (p−1) n oN (V (p) ) (p) 7: Characterize PLCs with the descriptor set V as Dp = Xi , yi 1: 2: 3: 4:

i=1

8: Train model Fp on data Dp as Fp = fit(Dp ) 9: Compute variable importance of the set V (p) as I (p) = Var. Imp.(Fp ) 10: Calculate the loss of model Fp as Lp = OOBMSE(Fp ) 11: end for ∗ 12: Find the optimal subset of descriptors V (optimal) = V (p ) , where p∗ = arg minp Lp

(M ) to test are defined by the user. This parameter allows the user to trade-off the time required to find the best feature subset with the granularity of the subset size and hence its quality. The sizes of the M feature subsets start from the full dimension P down to 3 with a fixed step of size ∼ (P − 3)/(M − 1). Here, we arbitrarily choose three to be the lowest number of descriptors to test. (ii) Starting from the largest descriptor subset (whose size is M(1) = P ), an RF model is built and its relative variable importance I (1) as well as out-of-sample loss L1 are recorded. (iii) We then use variable importance scores to select the M(2) most influential descriptors V (2) and eliminate the remaining less important M(1) − M(2) features (V (1) − V (2) ). Next, we build a new RF SF and record its out-ofsample loss L2 as well as its importance estimates I (2) of the descriptor subset with size M(2). (iv) We repeat this process of model construction, out-of-sample loss calculation, variable importance estimation, and backward elimination until all M (where M(M ) = 3) descriptor subsets have been tested. Upon completion, the most informative descriptors will be the subset associated with the lowest out-of-sample loss. The variable importance values of the best subset are a good estimate of the contributions of different molecular forces to the system’s binding affinity. We conducted two types of experiments to show the effectiveness of DDB’s descriptor

30

ACS Paragon Plus Environment

Page 31 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 4: The docking, screening, and scoring accuracies (A) of multi-perspective scoring functions constructed using raw (R) and noisy (R+N) descriptors. The number of descriptors (P ) and accuracies (A) are reported when all features are used (Full) and after conducting feature subset selection (FSS). Task (metric)

Raw descriptors (R) PRFull

Scoring (Rp ) 2714 Docking (S%) 2714 Screening (EF1% ) 2714

AFull R

PRFSS

0.827 216 82.05 270 34.00 216

Raw & noisy descriptors (R+N)

AFSS R

Full PR+N

AFull R+N

FSS PR+N

AFSS R+N

0.821 81.95 34.00

5428 5428 5428

0.819 79.49 32.00

378 216 541

0.823 81.54 33.50

filtering capability. In the first, we applied the iterative descriptor importance estimation and backward elimination algorithm on DDB’s 2,714 original (or raw, denoted by R) descriptors for the scoring, docking, and screening tasks. The goodness of fit metric for the scoring and docking tasks is the correlation (Rp ) between predicted and true labels of OOB PLCs. The true PLC labels for scoring are the experimental BA values and for docking it is the RMSD values of PLC poses. The objective function for the screening task is the average enrichment factor of active ligands on ten out-of-sample test sets. Upon calculating the optimal filtered subset for each task, we used them to build XGBoost SFs whose accuracy statistics are listed in Table 4. The second major column of Table 4 shows (from left to right) the number (PRFull ) of the original raw (R) descriptors, the accuracy (AFull R ) of XGBoost SFs trained on the raw descriptors, the number (PRFSS ) of descriptors in the best feature subset, and the accuracy (AFSS R ) of XGBoost SFs trained on this subset of filtered descriptors. We observe that the accuracy of the SFs did not change significantly (AFull vs. AFSS R R ) despite the use of (270/2714 ∼) 10% of the total number of descriptors (PRFull vs. PRFSS ). We further found that the best descriptors selected for each task include subsets from the 16 descriptor types. Note that the size of the best descriptor subset is comparable to that of NNScore whose 218 descriptors were found to be the best performing when single-perspectives were evaluated individually as shown in Table 3. However, the performance of the two descriptor sets (as listed in Table 3 and Table 4) are quite different and this implies the effectiveness of

31

ACS Paragon Plus Environment

0 500 1000 1500 2000 2500 (a) The size (P) of the most important subset out of 2704 raw descriptors



Scoring OOB accuracy (R p ) Docking OOB accuracy (R p ) Screening CV accuracy (EF1%)

30 35 40 Screening CV accuracy (EF1%)

0.70

0.74

0.78

Page 32 of 44



0.66

Scoring OOB accuracy (R p ) Docking OOB accuracy (R p ) Screening CV accuracy (EF1%)

Scoring and docking OOB accuracy (R p )



30 35 40 Screening CV accuracy (EF1%)

0.78 0.74



0.70 0.66

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Scoring and docking OOB accuracy (R p )

Journal of Chemical Information and Modeling

0 1000 2000 3000 4000 5000 (b) The size (P) of the most important subset out of 2704 raw and 2704 decoy descriptors

Figure 6: The scoring, docking (left y-axis), and screening (right y-axis) accuracy during the filtering of (a) raw and (b) noisy descriptors. Out-of-bag (OOB) protein-ligand complexes were used to validate the selected feature-subset for the scoring and docking tasks whereas cross-validation (CV) was applied in case of screening. The marks (+, o, and 4) on the figure denote the number of descriptors in the best subset (x-coordinate) found by the filtering algorithm and the corresponding scoring, screening, and docking performance (y-coordinate). multi-perspective modeling of interactions and the feature filtering algorithm. Figure 6(a) shows the search path for the best subset of descriptors for the three prediction tasks. The x-axis of the plot is for the descriptor subset size and the y-axis is for the accuracy of that subset on out-of-sample PLCs. The tick marks (the small square, circle, and triangle) on the plot correspond to the size of the best subsets and their performance. The results in the plot agree with those in the table in that the performance for large descriptor subsets are not very different from those for smaller subsets. The gap in performance for the screening task could be due to the difference in the sizes of the training and test datasets for the ML models used for filtering (RF in the plot) and the final SF construction (XGBoost in the table). The results of the first experiment clearly indicate that valuable computational resources could be saved by only extracting the most important descriptors without affecting performance. This was one of the objectives of implementing descriptor filtering as part of DDB. The approach of the second experiment is very similar to the first, but the objectives are 32

ACS Paragon Plus Environment

Page 33 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

different. The main objective is to simulate the possibility of depositing noisy descriptors to DDB. Noise could be introduced as a result of improper modeling of some interactions, bugs in the donated descriptor extraction program, or due to any other artifacts. Therefore, in this experiment, we evaluate the ability of the descriptor filtering algorithm to remove noise added as artificial descriptors. We first added as many as 2,714 random noise (N) variables to the 2,714 raw (R) descriptors. We then fitted two sets of SFs. The first set of XGBoost SFs are trained directly on the 5,428 mixed (R+N) descriptors. For the second, we first applied the descriptor filtering algorithm on the 5,428 mixed (R+N) descriptor set. After obtaining the filters for the three tasks, we fitted another set of XGBoost SFs on the resulting filtered descriptors. The results of both experiments are shown in the right-most major column of Table 4. Although not very significant, we notice a drop in performance of the SF that was trained on the noisy descriptors compared to its counterpart that was constructed on the original features (i.e., AFull vs. AFull R R+N ). This indicates the resilience of XGBoost SFs to such a large amount of noise (100%) in the data. After applying DDB filters, we notice a rebound in the performance of the ML SF back to normal. We examined the final three subsets of descriptors and we found that none of the 378, 216, and 541 selected descriptors are the noisy set we intentionally added. This clearly shows the accuracy of the embedding-based FSS algorithm in finding and removing noise. Drug designers and medicinal chemists are often interested in understanding the influence of various intermolecular forces on the ligand’s binding affinity and conformation. DDB’s filtering and ML toolbox serves this need through its descriptor importance plots such as the ones depicted in Figure 7. Out of the 2,714 descriptors in DDB, the bar plots illustrate the relative importance of the top 15 descriptors in predicting binding affinities and poses of PLCs in the core test set Cr. For binding affinity, RF’s variable importance plot suggests that descriptors that have the most predictive power are related to hydrophobic contacts followed by steric effects, hydrogen bonding, and ligand solvation. Whereas descriptors that characterize the ligand’s steric effects and clashes with the protein were found to play the

33

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

Scoring top 15 descriptors

Docking top 15 descriptors

Relative importance

0.15

0.10

0.05

0.00 r.X r.X 37 73 _C _C C_ h c. r.X1 .X1 C_ 8 X1 0 _ 1 _ 9 st 2 a. r hyd _CCeric X3 .X ro _ 1_ 11 ph 16 h 0_ o t.X 14 n pho CN bic b_ _1 2_ .X a. 1 s 6 X2 R_7 _v h.X co i n 6 1_ a _ re 0_ 1 _a H hp ho n me ffin B b . s i h. _h X13 _rm ty X5 ph 0 s _ d p. _lig ob_ N_ X1 D co S 3 es m r.X 1_A olv p 75 TS HB _C C1 n. X1 h. O_ e _v X1 12 in _s c. a X1 h _a teri _h .X2 ffin c h. y _ it X3 dr cl y _p op ash ro ho g. D bi X1 U eso c .X l 4_ lig f.X h. 2_ v B 2_ X6 hb f.X urie po _H h. 14 d_ ck_ B a. X5 _a sfc vo X2 _l s_ Ar l 6_ ig de e D n a t o es s a. t X3 a al_ o ity 3_ .X3 tar x.X lvH un 2_ ge 2_ B sa po t_ HB t_ la hyd po r_ r la sc o r_ or sc e or e

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Descriptor names (as generated by their respective packages)

Figure 7: The relative importance of the top 15 (out of 2,714) descriptors in predicting the ligand’s binding affinity (left panel) and poses (right panel) for the core test set complexes. The x-axis shows abbreviated names for the descriptors in which the first letter denotes the symbolic identifier of the SF or tool generating the descriptor (see Table 2), the following “.X#” is the original SF’s unique index of the descriptor, and the suffix is the descriptor short name as produced by the original SF or extraction tool. biggest role in estimating its binding pose. The descriptor based on NNScore’s term for Vina’s (SF) affinity score is among the top predictors of binding affinity and conformation. The volume of the pocket calculated by DPOCKET 43 appears to affect binding pose prediction but does not seem to have the same effect on BA estimation. On the other hand, the number of Carbon-Carbon atom pairs between the protein and ligand within 8 and 12 Å are very relevant for scoring but not docking. The descriptors based on these simple atomic contacts are calculated by the SF RF-Score (descriptor codes with the prefix “r”). The other top 15 scoring and docking descriptors shown in Figure 7 are generated by the following diverse set of SFs and characterization tools: ChemGauss (prefix h), AffiScore (a), NNScore (n), PaDEL (p), RETEST (t), DPOCKET (f), AutoDock Vina (U), Cyscore (c), and X-Score

34

ACS Paragon Plus Environment

Page 34 of 44

Page 35 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(x). In some cases, these tools overlap in features that describe the same interactions but using different modeling approaches. For example, hydrophobic contacts are modeled by RF-Score’s CC_8, CC_12 & CC_16 contacts, Cyscore’s hydrophobic term, and AffiScore’s hydrophobic score. Hydrogen-bonding-related descriptors are also generated by X-Score and ChemGauss using two different perspectives. Although the top 15 features capture a diverse collection of intermolecular forces, some of them characterize the same interaction type which highlights the benefits of the proposed multi-perspective modeling paradigm.

Conclusion In this work, we presented Descriptor Data Bank (DDB), a cloud platform for facilitating multi-perspective modeling of PL interactions. DDB is an open-access hub for depositing, hosting, executing, and sharing tools and data arising from a diverse set of interaction modeling hypotheses. In addition to the descriptor extraction tools, the platform also implements a machine-learning (ML) toolbox that drives DDB’s automatic descriptor filtering and evaluation utilities as well as scoring function fitting and prediction functionalities. To enable local access to many of DDB’s utilities, command-line programs and a PyMOL plug-in have also been developed and can be freely downloaded for offline multi-perspective modeling. We seed DDB with 16 diverse descriptor extraction tools developed in-house and collected from the literature. The tools combined generate over 2,700 descriptors that characterize (i) proteins, (ii) ligands, and (iii) protein-ligand complexes. The in-house descriptors we propose here, REPAST & RETEST, are target-specific and based on pairwise sequence and structural alignment of proteins followed by target clustering and trilateration. We built and used the fit/predict service in DDB to fit ML SFs to the in-house descriptors and those collected from the literature. We then evaluated them on several data sets that were constructed to reflect real-world drug screening scenarios. We found that multi-perspective SFs that were constructed using large number of diverse DDB interaction models outperformed

35

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

their single-perspective counterparts in all evaluation scenarios, with an average improvement of 15%. We also found that our proposed target-specific descriptors help improve the accuracy of SFs. In addition, DDB’s filtering module was able to exclude noisy and irrelevant descriptors when artificial noise was injected as new features. We also observed that the tree-based ensemble ML SFs implemented in DDB are robust even in the presence of noisy and decoy descriptors.

Acknowledgement This material is based upon work supported by the National Science Foundation under Grant No. 1117900. This work was also supported by Amazon’s AWS Cloud Credits for Research program. The authors would also like to thank the anonymous reviewers for helpful comments and suggestions.

Conflict of interest The authors declare no conflict of interest.

ORCID iDs • Hossam M. Ashtawy: orcid.org/0000-0002-5589-9274 • Nihar R. Mahapatra: orcid.org/0000-0002-3821-3330

References (1) Ewing, T.; Makino, S.; Skillman, A.; Kuntz, I. DOCK 4.0: Search strategies for automated molecular docking of flexible molecule databases. J. Comput.-Aided Mol. Des. 2001, 15, 411–428.

36

ACS Paragon Plus Environment

Page 36 of 44

Page 37 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(2) Gohlke, H.; Hendlich, M.; Klebe, G. Knowledge-based scoring function to predict protein-ligand interactions. J. Mol. Biol. 2000, 295, 337 – 356. (3) Wang, R.; Lai, L.; Wang, S. Further development and validation of empirical scoring functions for structure-based binding affinity prediction. J. Comput.-Aided Mol. Des. 2002, 16, 11–26, 10.1023/A:1016357811882. (4) Ashtawy, H. M.; Mahapatra, N. R. A Comparative Assessment of Conventional and Machine-Learning-Based Scoring Functions in Predicting Binding Affinities of ProteinLigand Complexes. Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference on. 2011; pp 627–630. (5) Ashtawy, H. M.; Mahapatra, N. R. A Comparative Assessment of Predictive Accuracies of Conventional and Machine Learning Scoring Functions for Protein-Ligand Binding Affinity Prediction. Computational Biology and Bioinformatics, IEEE/ACM Transactions on 2014, PP, 1–1. (6) Ballester, P.; Mitchell, J. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 2010, 26, 1169. (7) Ashtawy, H. M.; Mahapatra, N. R. Machine-learning scoring functions for identifying native poses of ligands docked to known and novel proteins. BMC Bioinformatics 2015, 16, S3. (8) Cheng, T.; Li, X.; Li, Y.; Liu, Z.; Wang, R. Comparative assessment of scoring functions on a diverse test set. J. Chem. Inf. Model. 2009, 49, 1079–1093. (9) Li, Y.; Han, L.; Liu, Z.; Wang, R. Comparative Assessment of Scoring Functions on an Updated Benchmark: 2. Evaluation Methods and General Results. J. Chem. Inf. Model. 2014, 54, 1717–1736.

37

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(10) Durrant, J. D.; McCammon, J. A. NNScore: A neural-network-based scoring function for the characterization of protein- ligand complexes. J. Chem. Inf. Model. 2010, 50, 1865–1871. (11) Ashtawy, H. M.; Mahapatra, N. R. BgN-Score and BsN-Score: Bagging and boosting based ensemble neural networks scoring functions for accurate binding affinity prediction of protein-ligand complexes. BMC Bioinformatics 2015, 16, S8. (12) Sterling, T.; Irwin, J. J. ZINC 15-Ligand Discovery for Everyone. 2015, (13) Knox, C.; Law, V.; Jewison, T.; Liu, P.; Ly, S.; Frolkis, A.; Pon, A.; Banco, K.; Mak, C.; Neveu, V. DrugBank 3.0: a comprehensive resource for ’omics’ research on drugs. Nucleic Acids Res. 2011, 39, D1035–D1041. (14) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242. (15) Li, Y.; Liu, Z.; Li, J.; Han, L.; Liu, J.; Zhao, Z.; Wang, R. Comparative Assessment of Scoring Functions on an Updated Benchmark: I. Compilation of the Test Set. J. Chem. Inf. Model. 2014, 54, 1700–1716. (16) Damm-Ganamet, K. L.; Smith, R. D.; Dunbar Jr, J. B.; Stuckey, J. A.; Carlson, H. A. CSAR benchmark exercise 2011–2012: evaluation of results from docking and relative ranking of blinded congeneric series. J. Chem. Inf. Model. 2013, 53, 1853–1870. (17) Benson, M. L.; Smith, R. D.; Khazanov, N. A.; Dimcheff, B.; Beaver, J.; Dresslar, P.; Nerothin, J.; Carlson, H. A. Binding MOAD, a high-quality protein-ligand database. Nucleic Acids Res. 2008, 36, D674–D678. (18) Liu, T.; Lin, Y.; Wen, X.; Jorissen, R.; Gilson, M. BindingDB: a web-accessible

38

ACS Paragon Plus Environment

Page 38 of 44

Page 39 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res. 2007, 35, D198–D201. (19) Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures. J. Med. Chem. 2004, 47, 2977–2980, PMID: 15163179. (20) Gaulton, A.; Bellis, L.; Bento, A.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40, D1100–D1107. (21) Wang, Y.; Xiao, J.; Suzek, T.; Zhang, J.; Wang, J.; Zhou, Z.; Han, L.; Karapetyan, K.; Dracheva, S.; Shoemaker, B. PubChem’s bioAssay database. Nucleic Acids Res. 2012, 40, D400–D412. (22) Krammer, A.; Kirchhoff, P. D.; Jiang, X.; Venkatachalam, C.; Waldman, M. LigScore: A novel scoring function for predicting binding affinities. J. Mol. Graphics Modell. 2005, 23, 395 – 407. (23) Gehlhaar, D. K.; Verkhivker, G. M.; Rejto, P. A.; Sherman, C. J.; Fogel, D. R.; Fogel, L. J.; Freer, S. T. Molecular recognition of the inhibitor AG-1343 by HIV-1 protease: Conformationally flexible docking by evolutionary programming. Chemistry & Biology 1995, 2, 317 – 324. (24) Tripos Inc., The SYBYL Software. 2006; version 7.2. (25) Bohm, H. The development of a simple empirical scoring function to estimate the binding constant for a protein-ligand complex of known three-dimensional structure. J. Comput.-Aided Mol. Des. 1994, 8, 243–256. (26) Eldridge, M. D.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee, R. P. Empirical scoring functions: I. The development of a fast empirical scoring function to estimate 39

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the binding affinity of ligands in receptor complexes. J. Comput.-Aided Mol. Des. 1997, 11, 425–445, 10.1023/A:1007996124545. (27) Jain, A. N. Scoring noncovalent protein-ligand interactions: A continuous differentiable function tuned to compute binding affinities. J. Comput.-Aided Mol. Des. 1996, 10, 427–440. (28) Zavodszky, M. I.; Sanschagrin, P. C.; Kuhn, L. A.; Korde, R. S. Distilling the essential features of a protein surface for improving protein-ligand docking, scoring, and virtual screening. J. Comput.-Aided Mol. Des. 2002, 16, 883–902. (29) Jones, G.; Willett, P.; Glen, R.; Leach, A.; Taylor, R. Development and validation of a genetic algorithm for flexible docking. J. Mol. Biol. 1997, 267, 727–748. (30) Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S. Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004, 47, 1739–1749, PMID: 15027865. (31) Korb, O.; Stutzle, T.; Exner, T. Empirical Scoring Functions for Advanced ProteinLigand Docking with PLANTS. J. Chem. Inf. Model. 2009, 49, 84–96. (32) Johnson, M. A.; Maggiora, G. M. Concepts and applications of molecular similarity. 1990, (33) Leach, A. R.; Gillet, V. J. An introduction to chemoinformatics; Springer Science & Business Media, 2007. (34) Madden, T. The BLAST sequence analysis tool. The NCBI Handbook. National Library of Medicine (US), National Center for Biotechnology Information, Bethesda, MD 2002,

40

ACS Paragon Plus Environment

Page 40 of 44

Page 41 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(35) Zhang, Y.; Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005, 33, 2302–2309. (36) Berman, H.; Henrick, K.; Nakamura, H. Announcing the worldwide protein data bank. Nature Structural & Molecular Biology 2003, 10, 980–980. (37) Dunbar Jr, J.; Smith, R.; Yang, C.; Ung, P.; Lexa, K.; Khazanov, N.; Stuckey, J.; Wang, S.; Carlson, H. CSAR Benchmark Exercise of 2010: Selection of the Protein– Ligand Complexes. J. Chem. Inf. Model. 2011, 51, 2036–2046. (38) Trott, O.; Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 2010, 31, 455–461. (39) Cao, Y.; Li, L. Improved protein–ligand binding affinity prediction by using a curvaturedependent surface-area model. Bioinformatics 2014, btu104. (40) Yap, C. W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32, 1466–1474. (41) Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. (42) McGann, M. FRED pose prediction and virtual screening accuracy. J. Chem. Inf. Model. 2011, 51, 578–596. (43) Schmidtke, P.; Le Guilloux, V.; Maupetit, J.; Tufféry, P. Fpocket: online tools for protein ensemble pocket detection and tracking. Nucleic Acids Res. 2010, 38, W582– W589. (44) Neudert, G.; Klebe, G. DSX: a knowledge-based scoring function for the assessment of protein–ligand complexes. J. Chem. Inf. Model. 2011, 51, 2731–2745.

41

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(45) Verdonk, M. L.; Cole, J. C.; Hartshorn, M. J.; Murray, C. W.; Taylor, R. D. Improved protein–ligand docking using GOLD. Proteins: Struct., Funct., Bioinf. 2003, 52, 609– 623. (46) Koes, D. R.; Baumgartner, M. P.; Camacho, C. J. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J. Chem. Inf. Model. 2013, 53, 1893–1904. (47) Breiman, L. Random forests. Machine Learning 2001, 45, 5–32. (48) Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. CoRR 2016, abs/1603.02754 . (49) Friedman, J. H. Stochastic gradient boosting. Computational Statistics & Data Analysis 2002, 38, 367 – 378. (50) Smola, A. J.; Schölkopf, B. A tutorial on support vector regression. Statistics and computing 2004, 14, 199–222. (51) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 2011, 12, 2825–2830. (52) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M. TensorFlow: A System for Large-Scale Machine Learning. OSDI. 2016; pp 265–283. (53) Wang, R.; Fang, X.; Lu, Y.; Yang, C.-Y.; Wang, S. The PDBbind Database: Methodologies and Updates. J. Med. Chem. 2005, 48, 4111–4119, PMID: 15943484. (54) Kramer, C.; Gedeck, P. Leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets. J. Chem. Inf. Model. 2010, 50, 1961– 1969. 42

ACS Paragon Plus Environment

Page 42 of 44

Page 43 of 44 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(55) Soufan, O.; Kleftogiannis, D.; Kalnis, P.; Bajic, V. B. DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PloS one 2015, 10, e0117988. (56) Sahu, B.; Mishra, D. A novel feature selection algorithm using particle swarm optimization for cancer microarray data. Procedia Engineering 2012, 38, 27–31.

43

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphical TOC Entry www.DescriptorDB.com

Protein

1000's of Descriptors

Ligand

Binding Pose

Active?

44

ACS Paragon Plus Environment

Binding Affinity

Page 44 of 44