Improved Method of Structure-Based Virtual Screening via Interaction

It aims to enrich potentially active compounds from a large chemical library for further ...... The Supporting Information is available free of charge...
0 downloads 0 Views 3MB Size
Subscriber access provided by WEBSTER UNIV

Chemical Information

An Improved Method of Structure-based Virtual Screening via Interaction-energy-based Learning Nobuaki Yasuo, and Masakazu Sekijima J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00673 • Publication Date (Web): 26 Feb 2019 Downloaded from http://pubs.acs.org on February 27, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

An Improved Method of Structure-based Virtual Screening via Interaction-energy-based Learning Nobuaki Yasuo†,‡ and Masakazu Sekijima∗,†,¶ †Department of Computer Science, Tokyo Institute of Technology, 4259–J3–23, Nagatsuta-cho, Midori-ku, Yokohama, Japan ‡Research Fellow of the Japan Society for the Promotion of Science DC1, 4259–J3–23, Nagatsuta-cho, Midori-ku, Yokohama, Japan ¶Advanced Computational Drug Discovery Unit, Tokyo Institute of Technology, 4259–J3–23, Nagatsuta-cho, Midori-ku, Yokohama, Japan E-mail: [email protected] Abstract Virtual screening is a promising method for obtaining novel hit compounds in drug discovery. It aims to enrich potentially active compounds from large chemical library for further biological experiments. However, the accuracy of current virtual screening methods is insufficient. In this study, we develop a new virtual screening method named Similarity of Interaction Energy VEctor Score (SIEVE-Score), in which proteinligand interaction energies are extracted to represent of docking poses for machine learning. SIEVE-Score offers substantial improvements compared to other state-of-theart virtual screening methods, namely, other machine-learning-based scoring functions, interaction fingerprints, and docking software, for the enrichment factor 1% results on the Directory of Useful Decoys, Enhanced (DUD-E). The screening results are also

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

human-interpretable in the form of important interactions for distinguishing between active and inactive compounds. The source code is available at: https://github.com/ sekijima-lab/SIEVE-Score.

Introduction The demand for computer-aided drug discovery has been rapidly increasing since the average cost per new drug developed reached 2.6 billion dollars. 1 To find active compounds for target proteins, high-throughput screening has been used for the screening of chemical libraries. 2 However, the cost of such compound screening can be drastically reduced if the compound libraries can be effectively filtered. Virtual screening is a method of key importance for decreasing the cost and increasing the hit rate of screening. 3,4 In virtual screening, compounds are computationally filtered out by their predicted activities, and the compounds identified as promising candidates are designated for evaluation through biological experiments. Recently, competitive virtual screening contests have been also held, such as in silico drug discovery contest, to evaluate various virtual screening methods for specific target molecules in practical situations. 5,6 Methods of virtual screening can be categorized into two types, namely, structure-based virtual screening (SBVS) and ligand-based virtual screening (LBVS). The structures of target proteins are used in SBVS, 7,8 while information on known inhibitors is used in LBVS. 9 One widely used method in SBVS is protein-ligand docking, which simulates proteinligand binding based on the interaction energy. Many applications exist for protein-ligand docking, such as DOCK, 10,11 Glide, 12 GOLD, 13 and AutoDock Vina. 14 In each of these applications, the algorithm starts by generating ligand conformations and subsequently optimizes their orientations and angles to minimize their binding energies with the target protein using a scoring function. The results are the coordinates of the atoms in the ligand and the corresponding interaction energy as the “docking score”. In basic docking-based virtual screening, compounds are ranked by their docking scores. 2

ACS Paragon Plus Environment

Page 2 of 35

Page 3 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

One advantage of SBVS is that novel compounds are more likely to be obtained than they are in LBVS. 5 This is because SBVS is based on physical interactions, while LBVS is based on the similarity/dissimilarity of known active/inactive compounds. Another advantage is the ability to perform interaction analysis using the docked structures, which provides knowledge of protein-ligand binding 15 to understand the affinity 16 and selectivity 17,18 of the compounds. These analyses can also be used to discover new chemical compounds, by generating derivatives systematically 19 and evaluating these compounds with structure-based scoring. By contrast, the scoring function in typical LBVS is based on the similarity of the chemical structures. 20 The accuracy of LBVS has greatly improved with machine-learning-based methods, which can effectively discriminate between active and inactive compounds. 21 Drug-target interaction prediction can also be applied to virtual screening. 22,23 However, two-dimensional structure-based LBVS suffers from the critical limitations that it cannot find new active compounds with dissimilar chemical structures. To avoid this problem, pharamacophore-based methods 24 and three-dimensional shape-based methods 25–27 have also been developed to obtain more novel compounds with less computational resource. Many attempts have been made to improve the hit rates of SBVS methods. 28–30 One of the problems lies in the scoring functions used in docking. These scoring functions of docking can be categorized into three types: force-field-based, 31 knowledge-based, 32 and empirical 12 functions. They basically consists of a set of simple functions of the distances between atoms due to computational cost limitations. Because of their nature, these scoring functions cannot fully consider weak interactions such as the solvation, or the induced fit of the proteins. There are various approach to improve the scoring functions. New machine-learningbased scoring functions for docking have also been developed as alternatives to the classical scoring functions. 33,34 These new scoring functions rely on the implicit learning of the relationships among protein-ligand complexes instead of being explicit scoring functions.

3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Machine-learning-based scoring functions have drastically improved the prediction accuracy for protein-ligand binding by means of nonlinear regression models such as support vector machines (SVMs) 35,36 and random forests. 37 Interaction fingerprints can also serve as the basis of potential virtual screening methods in which the similarities among protein-ligand interactions can be quantitatively calculated based on the presence or absence of interactions. The first interaction fingerprints to be proposed were structural interaction fingerprints (SIFt), 38 which encode protein-ligand interactions into binary vectors. Each vector consists of seven bits for each ligand binding site residue, which represents seven types of interactions: any interaction, interaction with the main chain, interaction with the side chain, polar interaction, nonpolar interaction, donation for hydrogen bonding, and acceptance for hydrogen bonding. Similar interaction fingerprints, such as atom-pairs-based interaction fingerprints (APIF), 39 and similar methods have been used for virtual screening. 40 Although these SIFt-like methods have proven to be useful, they have some limitations, such as difficulty in detecting the interaction type and the misclassification of interaction types that are not considered in SIFt, namely, π − π and cation−π interactions. More recent interaction fingerprint methods, such as structural protein-ligand interaction fingerprints (SPLIF) 41 use circular fingerprints of the interactions of atoms with their neighbors. This method avoids potential difficulties in interaction type detection because it does not depend on particular interaction types. Pharm-IF 42 uses atom types and the distances between interacting atom pairs and applies machine learning methods to improve hit enrichment. However, it is still not possible to evaluate the strength of interactions directly. In this study, we propose a new virtual screening method named Similarity of the Interaction Energy VEctor Score (SIEVE-Score), which is inspired by interaction fingerprint methods. In this method, the known inhibitor information is learned on the basis of proteinligand interaction energies extracted from protein-ligand docking, as in interaction fingerprint methods. Whereas classic scoring functions merely sum up the interaction energy terms, our

4

ACS Paragon Plus Environment

Page 4 of 35

Page 5 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

method is able to recognize important interactions through training. Our method is also different from basic interaction fingerprint methods in that the interaction energy vector is an energy-based, real-valued representation of protein-ligand interactions. The interaction energy vector can also include information on repulsive interaction terms such as van der Waals or Coulomb repulsion. In addition, the proposed method provides an interpretation of the predicted model in the form of important interaction energies for distinguishing between the active and inactive compounds in chemical libraries. We have evaluated the performance of SIEVE-Score in comparison with that of the state-of-the-art docking software Glide, with that of the recent machine-learning-based scoring function RF-Score-VS, and with that of the recent interaction fingerprint method SPLIF.

Methods Workflow Figure 1 describes the workflow of the proposed method. First, training compounds, whose affinities are already known, are obtained from databases. Second, these compounds are docked to the target protein, and interaction energy vectors are extracted from the docking results. Third, a random forest model is trained using the interaction energy vectors of the training compounds. Finally, the compounds to be screened are docked in the same way as the training compounds, and they are evaluated using the trained random forest model.

Protein-ligand docking The interaction energy vectors of active and inactive compounds are obtained using Glide (SP mode) 12 version 65013, which is a state-of-the-art docking program developed by Schrödinger, Inc. Protein Preparation Wizard 43 is used for hydrogen addition to and structural optimization of the target proteins. LigPrep is used to generate stereoisomers and determine the protonation forms of the docked compounds at pH 7.4. The docking grid is constructed such 5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1: Overview of SIEVE-Score. First, the bound structures of all compounds are generated using protein-ligand docking. Interaction energy vectors are extracted from the docking results, and a random forest model is subsequently trained using the interaction energy vectors of the compounds and their corresponding biological activities. New compounds are then screened by the trained model. that the center of the grid is the ligand of the cocrystal structure. The best-scored ligand pose is then used for further analysis. For calculating interaction energy vector, van der Waals, Coulomb, and hydrogen bonding (hbond) energy terms are used. In GlideScore 2.5, the van der Waals energy is calculated with the 12-6 Lennard-Jones potential:

EvdW

   σ 12  σ 6 − = 4 r r

where  and σ are the parameters and r is the distance between two atoms. The Coulomb energy is calculated with the Coulomb potential:

ECoul =

1 QQ0 4π0 r

where 0 is a constant and Q, Q0 are the charge of given two atoms. The hydrogen bonding energy is calculated with the following formula:

Ehbond = Chbond

X

6

g(∆r)h(∆α)

ACS Paragon Plus Environment

Page 6 of 35

Page 7 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

where Chbond is a constant parameter and g and h are functions that output a value of 1.00 for distances or angles within the preferred range and values of 0.00 – 1.00 for distances or angles that lie outside those limits. According to Friesner et al, the value of g(∆r) is 1.00 if the H...X hydrogen bond distance is 0.25 – 1.85 Å and decreases to zero in a linear fashion for distances in 2.10 – 2.50 Å. Similarly, the value of h(∆α) is 1.00 if the Z–H...X angle is 210◦ – 150◦ and decreases to zero in the ranges of 150◦ – 120◦ and 210◦ – 240◦ .

Interaction Energy Vector The features used in this study are interaction energy vectors, each of which consists of a set of protein-ligand interaction energy values. Some examples of interaction energy vectors are shown in Fig. 2. The interaction energy vector of a given ligand is defined as the vector of interaction energy terms between ligand and each amino acid residues within 12 Åfrom the center of the docking grid. The interaction energy terms of a given amino acid residue consist of three types of interactions: van der Waals, Coulomb, and hydrogen bonding interactions. Thus, the interaction energy vector has a number of terms equal to 3 (=vdW, Coulomb, hbond) × (the number of residues within 12 Åfrom the center of the docking grid). In this study, the interaction energy values are extracted from the per-residue interaction scores in the Glide docking results. The energy terms between the docked molecule and a given residue is calculated by the summation of all atom pairs between the docked molecule and the given residue.

Machine Learning The final score in SIEVE-Score is defined as the probability that a compound will be classified as active by a random forest classifier, which is a machine learning method that can be described as an ensemble of decision trees. 44 The training datasets are subsampled via the bootstrap sampling method for each decision tree, and subsequently, these decision trees are trained independently. In bootstrap sampling, samples are drawn from the original data 7

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2: Brief description of the interaction energy vector. An interaction energy vector consists of hydrogen bond, van der Waals, and Coulomb interaction energy terms for each residue in the grid of docking. The dimension of the interaction energy vector is 3 × the number of residues involved in docking. such that duplicates are allowed. The final classifier model is the average of the trained trees. The final model is proven to have stochastically higher accuracy than that of a single decision tree because the average model has lower variance than the single models do. The important features for discriminating between active and inactive compounds can be extracted from the importance of each tree in the random forest model. The importance of each feature is calculated using the Gini importance, which is calculated as the decrease in the impurity of the data through the node corresponding to that feature averaged over all P decision trees in the forest. The impurity is calculated as i pi (1 − pi ), where i denotes the class labels and pi denotes the probability of an item with label i being chosen in the given node of each tree. The number of trees was set to 1000 to achieve sufficient robustness of prediction in 5fold cross-validation (CV). The number of features in each tree was set to 6 based on CV. A randomly sampled subset of the AKT1 data in the DUD-E dataset was used for parameter tuning. Scikit-learn 0.18 45 was used for implementation.

8

ACS Paragon Plus Environment

Page 8 of 35

Page 9 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Dataset The Directory of Useful Decoys, Enhanced (DUD-E) dataset 46 was used to evaluate SIEVEScore. This dataset contains target proteins, active compounds and inactive (decoy) compounds. Cocrystal structures with ligands are available for all target proteins. The active and inactive compounds were curated from the ChEMBL 47 and ZINC 48 databases, respectively. Structurally similar active compounds have already been filtered out by means of a clustering analysis. The inactive compounds are selected to have similar physicochemical properties and dissimilar chemical structures compared with the active compounds. In short, it is ensured that there are no identical or excessively similar data for each target. There are 102 target proteins in DUD-E, including 38 of those in the original DUD, which contains 40 targets. DUD-E contains 26 kinases, 15 proteases, 11 nuclear hormone receptors, 5 G-protein-coupled receptors (GPCRs), and 45 other proteins. There are 224 active and 13,869 inactive compounds for each target on average. The mean number of inactive compounds per active compound is 34.9. The mean number of residues used to train SIEVE-Score is 57.2. The diverse subset of DUD-E was used for detailed and unbiased comparisons between SIEVE-Score and Glide docking. Detailed information on this subset, including the names of the targets, the PDB IDs of the structures used for docking, the numbers of active and inactive compounds, and the number of residues used to calculate SIEVE-Score is shown in Table 1. The same detailed information for all target proteins in DUD-E is shown in Table S2.

Evaluation The evaluation was conducted by means of per-target 5-fold CV to assess the generalization ability of the method by avoiding the overlap between the training and test data. The compounds for each target protein were divided into five equal-sized groups. A machine learning model was trained using four out of the five groups and tested using the remaining 9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 1: The description of the diverse subset of DUD-E. For each target, the name of the target protein, the PDB ID of the crystal structure, the description of the target, the numbers of active and inactive compounds, and the number of residues used to calculate SIEVE-Score are shown. HIV: Human immunodeficiency virus. PDB Description Active Inactive Residues Name AKT1 3cqw Serine/threonine-protein kinase Akt-1 293 16450 68 Beta-lactamase 48 2850 58 AMPC 1l2s CP3A4 3nxu Cytochrome P450 3A4 170 11800 3 40 3406 59 CXCR4 3odu C-X-C chemokine receptor type 4 GCR 3bqd Glucocorticoid receptor 258 15000 68 536 35750 62 HIVPR 1xl2 HIV type 1 protease HIVRT 3lan HIV type 1 reverse transcriptase 338 18891 54 3cjo Kinesin-like protein 1 116 6850 45 KIF11 group. This process was repeated 5 times, each with a different test group. The result for each target protein was obtained as the mean of the results of the five independently trained models with different groups as the test data. We used stratified CV, which means that the ratio between the numbers of active and inactive compounds was preserved in the training and test data. No information on other target proteins was used for training because per-residue representation is used in SIEVE-Score. We compared our method with several other methods. RF-Score-VS 37 is one of the latest state-of-the-art machine-learning-based scoring function. This scoring function also involves training a random forest model but is based on different features. The features are based on the thirty-six interaction count terms with particular protein–ligand atom type pair within a certain range. The interaction count terms represent the numbers of protein-ligand contacts in pairs of common heavy atoms in PDB, namely, C, N, O, F, P, S, Cl, Br, and I. RF-Score v1, 49 v2, 50 v3 51 use the same distance cutoff (12 Å), but different bins. Versions v1 and v3 have only one bin, but v2 has six bins with the size of 2Å. Version v3 also has six energy terms in AutoDock Vina 14 in addition to interaction count terms. The energy terms include hydrophobic, hydrogen bonding, and repulsion energy terms. Note that these terms are not per-residue features; instead, only one value is considered for

10

ACS Paragon Plus Environment

Page 10 of 35

Page 11 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

each feature type. Several CV methods are provided; per-target 5-fold CV was used here for fair comparison. The latest GitHub implementation provided by the authors of RF-Score-VS (the commit on May 4, 2017) 52 was used. We also compared our method to PLIF, SPLIF,, 41 and ligand-based similarity. PLIF is the classical interaction fingerprint method implemented in the MOE suite. 53 In PLIF, the interactions for each residue are categorized into surface contacts, ionic interactions, and hydrogen bonds. The occurrence of each type of interaction is encoded as one bit of the interaction fingerprint. The score for PLIF-based virtual screening is calculated as the Tanimoto coefficient between two interaction fingerprints. SPLIF is one of the latest state-of-the-art interaction-fingerprint-based virtual screening methods. Briefly, SPLIF encodes the three-dimensional coordinates of interacting proteinligand fragments as units of the interaction fingerprint. Each fragment consists of a contacting protein-ligand atom pair and its neighboring atoms. The SPLIF-based similarity between two ligands is calculated as follows. First, matching interacting protein-ligand fragments are detected based on the following criterion: root-mean-square deviation (RMSD) < 1 Å. Second, all atom lists of all matching bits are deduplicated to obtain unique matching ligand atoms (UMLA) and unique matching protein atoms (UMPA). Finally, the SPLIFbased similarity score is defined as follows: r SPLIFSim =

NU M LA NU M P A NU LA NU P A

where NU M LA is the number of unique matching ligand atoms, NU LA is the number of unique ligand atoms involved in at least one protein-ligand interaction, NU M P A is the number of unique matching protein atoms, and NU P A is the number of unique protein atoms involved in at least one protein-ligand interaction. The ligand similarity is the Tanimoto coefficient of functional connectivity fingerprints up to the second closest neighbor (FCFP4) in Pipeline Pilot. 54 The possible atomic labels in

11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

FCFP are as follows: hydrogen bonding acceptor, hydrogen bonding donor, positively ionized or positively ionizable, negatively ionized or negatively ionizable, aromatic, and halogen. These methods were applied to a subset of DUD-E dataset containing 10 target proteins. The screening accuracies for SPLIF, PLIF, and ligand similarity methods were obtained from Da et al. 41 To assess the screening accuracy, DEKOIS 2.0 data sets 55 was used for independent test. The active compound of DEKOIS was curated from ChEMBL database. The decoy compounds were generated from ZINC database, regarding high physicochemical similarity between actives and decoys and avoidance of potentially active compounds. In this evaluation, compounds of DUD-E and DEKOIS datasets were used for training and test, respectively. Eight proteins were used, namely ADRB2(ADRB2), AKT1(AKT1), FA10(FXA), FKB1A(FKBP1A), HS90A(HSP90), IGF1R(IGF1R), INHA(INHA), and ROCK1(ROCK1). The former names and the latter names in the brackets were used in DUD-E and DEKOIS dataset, respectively. There were 40 active compounds and 1200 decoy compounds in each protein in DEKOIS dataset. Structurally similar compounds between training data and test data were excluded from training to avoid overfitting. To find structurally similar compounds, ECFP4 56 in RDKit version 2017.03.3 was used. In the training data, 40 active compounds and 51 decoy compounds were filtered out with >= 0.8 similarity to the test data. The final number of compounds used for the training were 383 for active and 16525 for decoy. The protein structure and procedure for DEKOIS were the same as that for DUD-E.

Evaluation Metrics The enrichment factor (EF) 1% values, the EF 10% values, and the mean areas under the curves (AUCs) of the receiver operating characteristic (ROC) curves were used for evaluation. EF values are commonly used in virtual screening evaluation as accuracy metrics. The EF x% value is defined as the ratio between the predicted hit rate and the random hit rate when 12

ACS Paragon Plus Environment

Page 12 of 35

Page 13 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the top x% ranked compounds are selected as active. This value is calculated as follows:

EF x% =

total actives numberoftrue actives at x% ÷ number of compounds at x% total compounds

. The expected EF value of random prediction is 1. The ROC curve represents the relationship between the true and false positive rates, false actives true actives and , respectively. which are defined as predicted actives predicted actives The AUC is commonly used in machine learning studies as an accuracy metric and is defined as the area under the ROC curve. The ROC curve was plotted for each compound in prediction order. The possible values of the AUC range from 0 to 1, and the AUC values corresponding to ideal and random prediction are 1 and 0.5, respectively.

Results and Discussion We compared SIEVE-Score with the Glide docking software using DUD-E dataset. Stratified per-target 5-fold CV was used. The AUCs of the ROC curves and EF values were used as the metrics for evaluation. See the Method section for more details. Figure 3 and Table 2 present the EF and AUC results for the diverse subset of DUD-E. This subset contains 8 target proteins out of the 102 total targets in DUD-E. The ROC curves and the EF and AUC results for all target proteins in the DUD-E dataset are also presented in Fig. S1 and Table S1, respectively. With SIEVE-Score, the EF 1% results are improved for all eight target proteins. Similarly, the EF 10% results are improved for six targets, excluding CP3A4 (CYP3A4) and AMPC (β-lactamase), and the AUCs are improved for seven of the eight targets, excluding CP3A4. On average, the EF 1% value for SIEVE-Score is 2.7 times higher than that for Glide, indicating that 2.7 times more active compounds are found by SIEVE-Score than by Glide on average when the top 1% ranked compounds are biologically assayed for these target proteins. For AMPC and CP3A4, the deviations among different CV models are high. The results for these targets are further discussed in 13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the discussion section.

14

ACS Paragon Plus Environment

Page 14 of 35

Page 15 of 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 3: ROC curves for the diverse subset of DUD-E. Each plot corresponds to one target. The five colored lines represent the results of the five models considered in the CV evaluation of SIEVE-Score. The black dotted line represents the mean of these five models. The red dotted line represents the results of Glide docking (SP mode). The gray dotted line represents the expected results of random prediction. akt1: serine/threonine protein kinase Akt-1; ampc: β-lactamase; cp3a4: cytochrome P450 3A4; cxcr4: C-X-C chemokine receptor type 4; gcr: glucocorticoid receptor; hivpr: human immunodeficiency virus (HIV) type 1 protease; hivrt: HIV type 1 reverse transcriptase; kif11: kinesin-like protein 1.

15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 2: Comparison of EF 1%, EF 10% and AUC results between SIEVE-Score and Glide docking (SP mode) on the diverse subset of DUD-E. EF 1% Protein SIEVE-Score AKT1 42.1 30.7 AMPC CP3A4 6.7 61.1 CXCR4 GCR 33.3 38.3 HIVPR HIVRT 39.8 53.4 KIF11

Glide 10.3 6.1 6.0 14.7 20.9 17.8 20.4 44.8

EF 10% SIEVE-Score 7.3 5.6 2.3 8.7 6.8 7.5 6.9 8.2

Glide 2.7 5.8 3.1 5.7 4.3 5.0 5.7 7.5

AUC SIEVE-Score Glide 0.900 0.640 0.854 0.847 0.644 0.661 0.946 0.879 0.885 0.767 0.905 0.795 0.881 0.807 0.936 0.881

Figure 4 presents a scatter plot of the EF 1% results for SIEVE-Score vs. Glide. Each point represents a target in the DUD-E dataset. SIEVE-Score achieves better predictions for 97 of the 102 DUD-E targets and is tied with Glide for the remaining five targets. Similarly, SIEVE-Score achieves better predictions for 95 targets and is tied with Glide for 2 targets in the EF 10% case. With regard to the AUC, SIEVE-Score achieves better predictions for 95 targets, and there are no tied targets. The overall EF 1% value of SIEVE-Score for all 102 targets in DUD-E 1% is significantly higher than that of Glide (mean EF 1%: SIEVE-Score=43.913, Glide=21.253), with p