boosting structure-based virtual screening performance by

The first step of the CompScore algorithm is to convert the scores to relative rankings. For this, ... At this point, scoring components have been con...
2 downloads 0 Views 3MB Size
Subscriber access provided by Stockholm University Library

Chemical Information

CompScore: boosting structure-based virtual screening performance by incorporating docking scoring functions components into consensus scoring Yunierkis Perez-Castillo, Stellamaris Sotomayor, Karina Jimenes-Vargas, Mario GonzalezRodriguez, Maykel Cruz-Monteagudo, Vinicio Danilo Armijos-Jaramillo, M. Natália D.S. Cordeiro, Fernanda Borges, Aminael Sanchez-Rodríguez, and Eduardo Puente Tejera J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.9b00343 • Publication Date (Web): 26 Aug 2019 Downloaded from pubs.acs.org on August 28, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

CompScore: Boosting Structure-Based Virtual Screening Performance By Incorporating Docking Scoring Functions Components Into Consensus Scoring Yunierkis Perez-Castillo1,*, Stellamaris Sotomayor-Burneo2, Karina Jimenes-Vargas3, Mario Gonzalez-Rodriguez4, Maykel Cruz-Monteagudo5,7,8, Vinicio Armijos-Jaramillo6, M. Natália D. S. Cordeiro7, Fernanda Borges8, Aminael Sánchez-Rodríguez2, Eduardo Tejera6 1

Bio-Cheminformatics Research Group and Escuela de Ciencias Físicas y Matemáticas, Universidad de Las Americas. Quito 170504, Ecuador

2

Departamento de Ciencias Biológicas, Universidad Técnica Particular de Loja. Loja 110107, Ecuador 3

Dirección General de Investigación, Universidad de Las Americas. Quito 170504, Ecuador 4

5

SI2 Lab, Universidad de Las Americas, Quito 170504, Ecuador

Department of General Education, West Coast University—Miami Campus. Doral. Florida 33178. USA

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

6

Page 2 of 53

Bio-Cheminformatics Research Group and Facultad de Ingeniería y Ciencias Aplicadas, Universidad de Las Américas. Quito 170504, Ecuador.

7

REQUIMTE/Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade do Porto. Porto 4169-007, Portugal 8

CIQUP/Departamento de Química e Bioquímica, Facultade de Ciências. Universidade do Porto. Porto 4169-007. Portugal

Abstract

Consensus scoring has become a commonly used strategy within structure-based virtual screening (VS) workflows with improved performance compared to those based in a single scoring function. However, no research has been devoted to analyze the worth of docking scoring functions components in consensus scoring. We implemented and tested a method that incorporates docking scoring functions components into the setting of high performance VS workflows. This method uses genetic algorithms for finding the combination of scoring components that maximizes the VS enrichment for any target. Our methodology was validated using a dataset including ligands and decoys for 102 targets that have been widely used in VS validation studies. Results show that our approach outperforms other methods for all targets. It also boosts the initial enrichment performance of the traditional use of whole scoring functions in consensus scoring by an average of 45%. Our methodology showed to be outstandingly predictive when challenged to rescore external (previously unseen) data. Remarkably, CompScore was able not only to retain its performance after re-docking with a different software, but also proved that the enrichment

ACS Paragon Plus Environment

2

Page 3 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

obtained

was

not

artificial.

CompScore

is

freely

available

at:

http://bioquimio.udla.edu.ec/compscore/.

1. Introduction Structure-Based Drug Discovery (SBDD) uses the 3D-structure of proteins (targets) and compounds (ligands) in order to predict their potential interaction. Among SBDD applications, virtual screening (VS) aims at, given a large database of chemical compounds, rank them from highest to lowest probabilities of binding to a target of interest 1. One of the most widely used tool in SBDD is molecular docking, which has been particularly useful in VS 2. Applications of VS to the discovery of new molecular scaffolds to feed the drug development pipeline have been reported elsewhere 3–5. Any docking method has two main components: the conformational exploration of the receptorligand complex and the evaluation of the predicted interactions in the complex. Among them, scoring functions remain the weakest component. Deficiencies in the scoring functions can be explained by the complexity in estimating the binding energy between the protein and the ligand 6.

Each scoring function uses different physicochemical descriptions and parameters to estimate

the binding affinity of the complex. However, no individual scoring function can account for all the physicochemical events that are involved in the protein-ligand interactions and the computational estimation of the binding energy is just an approximation to the reality

7,8.

The

sacrifice of precision for the sake of calculation speed is a requirement that must fulfil any scoring function used for docking. For example, one scoring function might be very good at treating solvation effects but not at taking into account shape complementarity. To overcome these deficiencies, consensus scoring (CS) has emerged as a strategy that has shown to outperform single

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 53

scoring functions since it combines information from a variety of them and compensates their individual weaknesses 9–12. CS studies differ in the target, the combination of scoring functions they used and also in the method used to aggregate them. There are many reports of approaches of variable complexity to fuse scoring functions. Some of them employ classical aggregation methods such as majority voting, rank, maximum, intersection and minimum 13,14. Some other methods include multilinear regression, non-linear regression and multivariate analysis 15. In this sense, the use of machine learning (ML) approaches for the design of CS strategies has shown promising results and proved to be an innovative and effective way to overcome difficulties in structure-based CS

16.

Consequently, it leads to more accurate and general outcomes

17.

In a

recent study, an unsupervised machine learning algorithm for structure-based VS (Gradient Boosting) was proposed 18. This method was tested on 21 targets from the DUD-E database 19 and the ML-CS strategies showed better performance when compared with the traditional CS and the individual scoring methods. A supervised CS strategy using Random Forest to successfully predict protein-ligand binding poses has also been proposed 20. All the CS strategies listed above consist in combining the scoring functions from one or more software employing different aggregation methods and none of them has focus on the potential of scoring functions components for CS. Furthermore, many of these studies develop general models for any problem and have been validated on a limited number of study cases. Within this panorama one question arises: Can the combination of individual scoring functions components be more effective than combining whole scoring functions? A contribution to address this question was the study developed by Koes et al. in the context of the SMINA project 21. In this research, a custom empirical scoring function was developed from

ACS Paragon Plus Environment

4

Page 5 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the interaction terms available in AutoDock Vina 22, with the custom scoring function achieving a better performance than the original one. The proposed scoring function consisted in the linear combination of five terms out of the 58 scoring components available in AutoDock Vina. This research showed that a limited subset of scoring components from a single docking program can outperform its whole scoring function. Here, we propose the CompScore algorithm as a universal tool for CS that exploits the information provided by the components of scoring functions. In CompScore the scoring functions are decomposed into their components and a genetic algorithm is used to find the combination of them that maximizes the VS enrichment in each case study. This approach leads to tailored CS schemes for every target. Our methodology is extensively validated and found to be superior to any other tested scoring approach. In addition, general CS models relying on scoring functions components are proposed for the cases when no data is available for validating the VS workflows. CompScore is freely available through a Web Service at: http://bioquimio.udla.edu.ec/compscore/. 2. Computational Methods 2.1 Datasets and preparation Validation datasets were downloaded from the Directory of Useful Decoys, Enhanced (DUD-E) 19.

All the 102 targets from the DUD-E database were selected for the validation of our

methodology. We employed the same receptors structures and docking boxes as in the DUD-E. Receptors preparation included the addition of hydrogen atoms and partial atomic charges and their conversion to MOL2 format. Receptors were prepared with UCSF Chimera 23. In the DUD-E validation, docking calculations were performed with Dock 3.6 24. The DUD-E database contains more than one scored conformation per compound. Given that in a regular large scale VS campaign only one conformation per compound is analyzed, we filtered the provided

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 53

conformers to keep the top-scored one of each compound. Ligands were converted to MOL2 format using UCSF Chimera. Atomic partial charges for compounds provided in the DUD-E were preserved. A summary of the composition of the dataset for each target is provided as Supporting Information in Table S1. 2.2 Rescoring Conformers were rescored with Dock 6.8

25,

OEDocking

26

and Gold

27,

using each scoring

function default parameters. Python scripts were created to rescore molecules using OEDocking Toolkits and to produce scoring tables summaries without structures output. A total of 15 scoring functions were computed with these software and they are listed in Table 1. The Dock 3.6 scoring values provided in the DUD-E were included in our calculations, totalizing 16 scoring functions for CS analyses. Table 1. Scoring functions computed per docking program Docking Program

Scoring Function

Dock 6.8

Grid Contact Continuous Hawkins PBSA SASA

Gold

PLP GoldScore ChemScore ASP

OEDocking

Shapegauss

ACS Paragon Plus Environment

6

Page 7 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ChemScore ChemGauss 3 ChemGauss 4 PLP

2.3 The CompScore algorithm The CompScore algorithm searches for the combination of scoring functions components that maximizes a pre-selected VS enrichment metric. CS bases on the aggregation of the rankings derived from the individual scoring functions and components using their arithmetic mean. For a subset of scoring functions or components, compounds are first ranked according to each of them to transform a set of scores to relative rankings. Then, the aggregated relative rank of a compound is computed as the arithmetic mean of its ranks across all the selected scoring functions and components. Finally, the VS enrichment metric is evaluated over the aggregated relative ranking. In this study, 16 scoring functions obtained by four docking programs were computed for each compound in the DUD-E. With this amount of scoring functions, the exhaustive search of all possible combinations of scoring functions (of size 1 to 16, 65535 combinations) can be completed in a short time. However, these 16 scoring functions account for 87 scoring components in total. This leads to a combinatorial explosion preventing the evaluation of the VS quality of all of their possible combinations. For this reason, the CompScore algorithm implements a genetic algorithm (GA) to find the combination of scoring components that maximizes the desired VS enrichment metric. Any feature selection method that could be implemented in a wrapper maximizing the desired enrichment metric should provide close to optimum performance models. GA was selected due to its ease of implementation, flexibility and proved high performance in solving feature selection

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

problems in cheminformatics

28–30.

Page 8 of 53

Although beyond the scope of the current research, other

feature selection algorithms can be tested and added to CompScore in the future. 2.3.1 Scoring components pre-processing The first step of the CompScore algorithm is to convert the scores to relative rankings. For this, samples are ranked according to each scoring function and the relative ranks are computed as the ranks divided by the number of compounds in the dataset. The descending or ascending order for molecules ranking was determined from the definition of the scoring functions within the docking programs. For each molecule, the corresponding weighted scores were computed as the docking scores divided by the number of heavy atoms that it contains. To avoid redundancy in the input features, the correlation between the rankings produced by the scoring functions components are analyzed. Only one ranking among those having a correlation above certain threshold are kept on the dataset. From a set of correlated rankings, the one appearing first in the dataset is removed. In our validations, the maximum correlation allowed between relative rankings was set to 0.95. For practical reasons, it is also convenient to set the minimum number of scores levels that a scoring function component must provide. For example, a scoring component with the same value for all samples will produce a meaningless (constant) ranking of the compounds. The same applies to scoring functions components providing a limited number of unique values (levels). The CompScore algorithm provides the option of excluding such unpractical scoring function components. For our calculations, those scoring components providing less than four different score levels were excluded. 2.3.2 Virtual Screening enrichment metrics VS protocols validation aims at obtaining the highest enrichment of ligand molecules at the beginning of the ranked list. This is achieved through the maximization of an enrichment metric,

ACS Paragon Plus Environment

8

Page 9 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

which is selected depending on the objective of the VS campaign. In the CompScore algorithm, either the Enrichment Factor (EF) or BEDROC metrics can be maximized. These metrics have been extensively used for the estimation of the enrichment capacity of VS workflows. See 31–34 for definitions and applications. Each metric provides specific enrichment information. While EF measures the enrichment of ligands at a specific fraction of the ranked list, BEDROC accounts for early enrichment. For the maximization of EF, the fraction of screened data at which this metric is expected to be maximum must be provided. On the other hand, if BEDROC is to be maximized, the value of the α parameter for BEDROC calculation should be provided as input. 2.3.3 GA-guided consensus scoring At this point, scoring components have been converted to relative rankings and filtered as described in section 2.3.1. Also, an enrichment metric has been selected for maximization (2.3.2). Now, the CompScore algorithm searches for the combination of scoring functions components maximizing the enrichment. This was implemented as GA-guided search using the DEAP framework in Python 35. Individuals in the population for GA evolution were coded as binary vectors of length equals to the number of input scoring functions components. Bits set to 1 are considered for CS, while those set to 0 are excluded. In our validations we used a population of 100 individuals that evolves for 1000 generations. The initial population was randomly created. The selection operator was set to a tournament of size 2. The two points crossover operator was used for crossover and mutation proceeded according to the bit flipping operator. The objective function of the GA was set to either the EF or BEDROC metrics. To obtain the ranked list of compounds for enrichment computation, a subset of relative rankings was aggregated

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 53

using the arithmetic mean and sorted in ascending order. Then, the chosen enrichment metric was computed for the aggregated ranking of compounds. After finishing the GA evolution, the best solution was chosen as the individual in the population with the highest fitting (VS enrichment metric). A bootstrap cross-validation was performed to the best solution to evaluate its robustness. For this, 1000 bootstrap samples containing the same number of ligands and decoys as the whole target dataset were generated. All calculations were performed using a single core in a computer equipped with two Intel Xeon CPU E5-2690 and with 128 GB of RAM. The program was developed with Python 3.6 installed within the Anaconda Distribution and package management system. 2.4 Alternative scoring strategies For comparison purposes, three additional scoring strategies were implemented. The simplest of them consisted in the identification of the best performing individual scoring function or component (BISC). For this, the VS enrichment of each scoring function and its components was evaluated and the best performing one was returned. Another strategy consisted in the aggregation of all non-correlated and non-close to constant scoring components (ASC) following the same fusion scheme employed in CompScore (see section 2.3). The later can be interpreted as a CompScore model in which all scoring functions and their components are selected for aggregation. The last scoring strategy (Exhaustive Search, ES) consisted in the evaluation of the VS performance of all possible combinations of the whole scoring functions listed in Table 1 plus the Dock 3.6 scoring function, leading to the evaluation of 65535 combinations of scoring functions of size 1 to 16. This experiment provided the subset of the 16 whole scoring functions that must be aggregated to maximize the VS enrichment and produced a classical CS model

ACS Paragon Plus Environment

10

Page 11 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

composed by whole scoring functions. The VS performance of any combination of scoring functions was evaluated following the same procedure employed in CompScore. 2.5 External validation For external validation, the targets’ datasets were randomly split into training and external validation sets containing 80% and 20% of data, respectively. To maintain the same proportion of ligands and decoys in these sets as in the whole datasets, ligands and decoys were split separately. That is, the training sets contained 80% of ligands and 80% of decoys and the external validation sets contained the remaining compounds. Then, we used the training data to obtain the set of scoring functions components maximizing the initial enrichment of actives by employing BEDROC with α=160.9 as fitness function. Afterward, the resulting CompScore model was used to rescore the compounds on the external dataset and initial enrichment was measured over this rescored data. This external validation procedure was repeated 100 times for each target. 2.6 Re-docking experiments All DUD-E ligands and decoys were re-docked to their respective targets employing Dock 6.8. Compounds were considered flexible, with 500 orients generated prior to pruning (pruning_max_orients=500)

and

a

maximum

of

500

orientations

explored

(max_orientations=500). The bump filter was set to allow a maximum of two bumps of compounds with the receptor (max_bumps_anchor and max_bumps_growth set to 2). Bumps were defined as pairs of atoms overlapping in more than 75% of their Van der Waals radii. The grid-based scoring function of Dock 6.8 was used for re-docking. The remaining parameters of Dock 6.8 were kept to their default values. The best scored conformer of each compound was saved and then rescored following the same procedure described in section 2.2 for the original DUD-E poses. The redocked and rescored compounds were used as input to the CompScore algorithm and external

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 53

validation experiments were performed as described in section 2.5 over 50 train/external data splits. 2.7 CompScore general models To derive general CompScore models, all DUD-E targets were used as input to CompScore and the fitness function of the GA was setup to maximize the geometric mean of the relative enrichment across all targets. Since an individual in the GA population encodes a set of scoring functions components to aggregate, this global fitness function searches for a set of them maximizing, to the same extent, the relative enrichment across all the DUD-E targets. The relative enrichment for a specific target was measured as the ratio between the enrichment achieved by the individual and a target’s reference enrichment value. This reference enrichment was defined as that achieved when all target compounds are used to train a CompScore model (section 2.3). CompScore general models were trained for primary docking with Dock 3.6 (from the DUD-E poses) and with Dock 6.8 (from the re-docking poses). The search of these global models was repeated for 50 random splits of each target into training and validation sets, leading to 50 possible global CompScore models. As in previous experiments, 80% and 20% of the compounds were selected for the training and external validation sets, respectively. Scoring functions components present in at least half of the trained global models (25) were selected to form a general CompScore model. 2.8 Evaluation of artificial enrichment To evaluate whether CompScore provides a real boost in VS performance or it learns the bias present in the DUD-E, two experiments were performed. Firstly, the unbiasing molecular properties used in the DUD-E were computed with RDKit 36. The datasets of molecular properties of all targets were split into 50 train/external partitions and training data was used to train

ACS Paragon Plus Environment

12

Page 13 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

CompScore-like models. Datasets splitting was performed so that each partition has the same composition as that generated during CompScore validation. Since only 6 molecular properties are used as unbiasing ones in the DUD-E, all their possible combinations were explored and the one providing the highest value of BEDROC for each target partition was saved. This best model was then used to predict the corresponding external validation set. The second experiment to evaluate possible artificial enrichment consisted in the removal of the targets information from the DUD-E and the training of CompScore models with molecular descriptors codifying the structure of the compounds. CompScore models are obtained through the arithmetic mean of the relative rankings associated to the scoring components considered by the model. Given its ranking-based nature, only continuous molecular descriptors capable of providing a ranking of molecules can be used as input to CompScore. For this reason, the 2D Autocorrelation and RDF molecular descriptors were selected to codify 2D and 3D molecular structures, respectively. The 3D structures of the DUD-E molecules were generated with RDKit

36

and all

molecular descriptors were calculated with the same library using a Python script. DUD-E was constructed imposing the condition that ligands and decoys are topologically dissimilar. In consequence, structural descriptors should have the capability of separating ligands and decoys associated to a target even by using a simple model. Taking this into account, the performance of CompScore when scoring function components and molecular descriptors are used as inputs was evaluated in cross-target predictions. To perform these cross-target evaluations in a setup favorable to ligand-based models, a target model is only employed to predict the most similar targets to it in the ligands space. To compute the similarity between two targets A and B, first the similarity of each ligand i in target A to every ligand j in B, Si,j, is measured as the Tanimoto coefficient between their respective

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 53

Morgan fingerprints of radius=2. Then, the similarity of compound i to target B is defined as the maximum similarity between it and any ligand of B, Si,B = max(Si,j). Finally, the similarity between the targets A and B in the ligands space is computed as SA,B = min(Si,B). This targets similarity metric ensures that, for a pair of similar targets A and B, at least one compound in B is similar above certain cutoff to any ligand in A and vice versa. These calculations were performed with RDKit 36 and two targets were considered to be similar if they share a similarity higher than 40%. The enrichment of a target model in cross-target validations was measured as the average enrichment achieved across the targets similar to it. Additionally, for scoring functions components, it is known whether they must be ranked in ascending or descending order. However, this is not the case of physicochemical properties nor molecular descriptors. Thus, when transforming the later to relative rankings both the ascending and descending rankings were considered for training the CompScore models. The only constraint imposed to the models was that the two rankings associated to one descriptor were not allowed in the same model. 3. Results and discussion 3.1 CompScore performance We performed calculations with the aim of establishing the VS performance of the CompScore algorithm and testing how it compares to other scoring schemes described in section 2.4. All performance comparisons are made between CompScore and the best performing approach among the ACS, BISC and ES scoring schemes. As previously mentioned, the CompScore algorithm can be used to find the combination of scoring functions components that maximizes either EF or BEDROC. As previously mentioned, unlike EF, BEDROC is able to account for the early enrichment factor. Thus, we made experiments with both metrics. For the sake of simplicity, from

ACS Paragon Plus Environment

14

Page 15 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

here on, all analyses will focus on BEDROC computed for the α parameter set to 160.9 and on EF computed for the first 1% of screened data. In contrast to BEDROC, that is bounded between 0 and 1, EF is an unbounded metric which makes difficult its use in comparisons between the VS performance scoring method across different datasets. For this reason, the maximum possible EF (for a perfect ranking) was computed for each dataset and comparative analyses were carried out with the fraction of this optimal EF achieved by the VS scoring methods. The results obtained for all the DUD-E targets employing the four scoring strategies previously described are provided as Supporting Information. Table S2 contains the enrichment metrics obtained with the different scoring approaches when BEDROC with α=160.9 is used as the model selection criterion. On the other hand, Table S3 includes the equivalent information when EF at the first 1% of scoring data is employed as criterion for the selection of the best VS strategy. The results obtained for the early enrichment performance of the four CS strategies explored are summarized in Fig 1. In addition to the best solution per scoring strategy, we also summarize the best performance obtained per target with any of the ASC, BISC and ES strategies (cyan).

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 53

Figure 1. Violin plot of the Initial enrichment performance of the evaluated scoring strategies. Initial enrichment is measured as BEDROC for α=160.9. Black lines within the boxes indicate the value of the median. Plot is built with the values of BEDROC for all DUD-E targets as provided in Supporting Information Table S2. Fig 1 shows that the CompScore methodology (blue) outperforms the ASC, BISC and ES strategies and that it is the approach that provides the highest value of minimum early enrichment. The improvement resulted statistically significant (p=8∙10-8) according to the two-tail unpaired tTest. While any of the methodologies used for comparison (cyan) yield BEDROC values around 0.1, the minimum value obtained with the CompScore methodology is 0.23. Furthermore, more than 50% of the top BEDROC values obtained with CompScore are higher than the worst 75% values provided by any of the other methodologies. More important, when a target by target

ACS Paragon Plus Environment

16

Page 17 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

analysis is performed, it is observed that the proposed methodology outperforms any of the other methods for all of them (see Supporting Information Table S2). The worst initial enrichment is obtained when all scoring components are aggregated (ASC). This result is expected since no scoring components selection is performed before aggregation. It is well established that any consensus decision making system must contain a diverse subset of decision makers which are meaningful for the problem under investigation

37.

The latter is

addressed by the CompScore algorithm through the GA search. Another interesting result is that the performances of the BISC and ES approaches are highly similar. This suggests that a single scoring component can achieve a similar VS performance as the best combination of whole scoring functions. A detailed analysis of the data presented in Supporting Information Table S2 shows that for 12 targets, a single scoring function component (BISC approach) produces BEDROC values more than 5% higher than the best combination of whole scoring functions derived from the ES approach. This improvement can reach up to 73.5% for the mmp13 target. This finding supports our hypothesis that docking scoring functions components provide meaningful information for implementing VS workflows and that in some cases they can be more relevant than whole scoring functions. The execution time analysis shows that, in average, the CompScore algorithm requires 356 seconds while 150 seconds are required for completing the exhaustive exploration of the combinations of the 16 scoring functions. This execution time refers to that required to find a CS model once docking and rescoring has been performed. As shown in section 3.5, if the time required to perform docking and rescoring calculations is considered, then it will take much longer to obtain a CS model. Using as reference the run time for the ES approach with 16 scoring functions, the estimated time required for the exhaustive exploration all the possible combinations

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 53

of 18 scoring functions will almost double the time needed by CompScore. Moreover, the bootstrap cross-validation of our methodology shows that it remains stable after changes in the dataset composition (see Supporting Information Table S2). We also evaluated the performance of CompScore when the VS campaigns focus on maximizing EF. When the maximum value of EF achieved by each method is analyzed, the results are similar to those obtained with BEDROC as shown in Fig 2. For EF, the proposed algorithm also outperforms the rest of the test scoring schemes. The performance improvement for EF is also statistically significant (p=1.5∙10-7). Like with BEDROC, more than 50% of the best solutions found by the CompScore methodology outperform more than 75% of the worst solutions found by any of the other methods. Considering the equivalence of these results, from here, our discussions will be mainly focused on BEDROC.

Figure 2. Violin plot of the EF performance of the evaluated scoring strategies. EF is measured for a fraction of screened data equals to 0.01. Black lines within the boxes indicate the value of the

ACS Paragon Plus Environment

18

Page 19 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

median. Plot is built with the values of EF for all DUD-E targets as provided in Supporting Information Table S3. 3.2 Detailed enrichment improvement relative to other scoring schemes We also analyzed the BEDROC increase achieved by our methodology across all targets relative to the other tested scoring methodologies. This analysis reveals that CompScore improves, in average, the performance of any other tested method in 45%. However, there are large variations in this improvement among all DUD-E targets as shown in Fig 3. According to Fig 3, the CompScore methodology provides a BEDROC improvement higher than 33.87% for more than 50% of the targets analyzed. Such improvement almost doubles to 64.29% for 25% of the targets. The target by target analysis of the achieved fraction of the maximum possible EF reveals that for five targets our methodology is unable to improve this metric relative to the other approaches. Also, there is a 3.94% decrease in performance for the mmp13 target relative to the best performing alternative scoring scheme. Even so, for 50% of the targets, the EF improvement is higher than 39.84% when compared to any of the other methodologies. Such improvement sky rockets to more than 70% for 25% of the targets.

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 53

Figure 3. Percent of improvement in BEDROC of the CompScore method relative to the best performing VS models obtained with either the ASC, BISC and ES approaches. For the 18 targets with BEDROC lower than 0.25 according to all of the ASC, BISC and ES approaches, only two of them increase their BEDROC in less than 50% when CompScore is employed for scoring (see Fig 3). For 10 of them, BEDROC increases in more than 100%. That is, the BEDROC values provided by our methodology are more than twice the maximum value obtained with any of the other tested approaches. On the other hand, for those targets for which a BEDROC higher than 0.75 could be obtained with other methods, an improvement just lower than 25% is observed. The largest improvement in BEDROC is observed for the ampc target with 266.67% (over 3.5 times). The best performing alternative scoring scheme for this target is ES. It achieves a BEDROC of 0.18 through the combination of four scoring functions: Dock 3.6, Dock 6.8 Pbsa, Gold Goldscore and OEDocking ChemGauss 4. On the other side, the CompScore

ACS Paragon Plus Environment

20

Page 21 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

algorithm achieves a BEDROC value of 0.66 by aggregating two scoring functions (Dock 3.6 and OEDocking ShapeGauss) and 12 scoring functions components. These components include four from Dock 6.8, two from Gold and six from OEDocking. The fact that the largest BEDROC improvement is obtained for one of the targets with the lowest number of ligands in the DUD-E, lead us to investigate whether there could be a relationship between this improvement and the number of available ligands. This analysis is summarized in Fig 4, which shows that there is no relationship between these two variables. Independently of the number of ligands associated with a target, there could be either a high or low improvement in the BEDROC value provided by CompScore relative to the other scoring schemes. In the case of the targets with less ligands than ampc (cxcr4, comt, inha and fabp4) the improvement in BEDROC ranges from 5.32% to 65.12%. On the other side, for fnta, which is the target with the largest number of ligands (592), the increase in BEDROC is 109.09%.

Figure 4. Percent of improvement in BEDROC of the CompScore method relative to number of active compounds. All the solutions found by CompScore include scoring components of at least two different docking software. Specifically, 60 solutions contain scoring functions components from the four

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 53

docking software, 37 from 3 software and 5 of only two programs. Also, for nine targets (cp3a4, def, fak1, hdac8, kit, kpcb, pgh1, reni and rxra), the best solution found with our algorithm contains no whole scoring functions but only their components. Regarding the representativeness of each docking software in the solutions found by CompScore, scoring functions from Dock 3.6, Dock 6.8, OEDocking and Gold appears in the best combination for 65, 101, 96 and 99 targets, respectively. In addition, 17 scoring functions components out of 87 were not included in any CompScore solution because of being either constant or correlated in more than 90% of the targets under investigation. The log files containing the list of the scoring functions included in each target’s best solution are provided at the CompScore’s web site. 3.3 Inclusion of weighted scores We explored whether the addition of the docking scoring components weighted by the number of heavy atoms to the GA search could improve the previously presented results. The ES approach consists in the evaluation of the enrichment provided by all possible combinations of size 1 to N of N scoring functions. If weighted scores are considered, then the number of input variables to the ES CS scheme will rise to 32 from the 16 initial whole scoring functions. This will increase the number of possible scores combination from 65535 to 4.29∙109, leading to a combinatorial explosion. Considering the average runtime of 150 seconds per target for completing the ES rescoring with 16 scoring functions, to complete the ES rescoring on a single target using 32 input variables will take more than 100 days in average. For this reason, the ES approach was excluded from these analyses. The results of the inclusion of the weighted scores into CompScore are presented as Supporting Information in Tables S4 and S5 for BEDROC and EF, respectively, and summarized in Fig 5.

ACS Paragon Plus Environment

22

Page 23 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5. Performance comparison when the weighted scores are added to the CompScore algorithm. EF is represented as the fraction of its maximum possible value for each target. From Fig 5 it can be observed that the inclusion of the weighted scores dos not increase significantly, in average, the performance of the CompScore algorithm when either BEDROC or EF are used as VS selection metrics. However, a closer look at the results of this last experiment shows that the inclusion of the weighted scoring components can increase in 20% or more the initial enrichment of the VS models of the targets dhi1 (20%), hivint (25%), mcr (25%) and hivpr (50%). Likewise, when weighted scoring components are considered in the algorithm, the EF for targets aldr, mmp13, hivint and hivpr increase by 22%, 22%, 24% and 27%, respectively. In DUDE, ligands and decoys have similar molecular weight. Thus, we speculate that in databases with uneven distribution of molecules’ sizes the inclusion of the weighted scoring components can translate into improved VS workflows. 3.4 External validation

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 53

Hitherto, all the evidence indicates that decomposing the docking scoring functions into their components and the aggregation of a subset of them outperforms traditional CS based on the aggregation of whole scoring functions. The only criterion that can be used to evaluate the generalization capability of CompScore to new data is its performance on a data subset not used for model training. This external validation experiments were performed following the procedure described in section 2.5. External validation results are provided as Supporting Information in Table S6 and summarized in Fig 6. Results are presented for the average BEDROC over the 100 training/external validation splits performed for each target. First, it can be seen that the initial enrichment obtained for the training data does not differ from that of the previous experiments when all data was used to train CompScore models (blue and cyan violins, respectively). This result is in agreement with the robustness shown by CompScore in the bootstrap cross-validation experiments.

ACS Paragon Plus Environment

24

Page 25 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 6. External validation results. CompScore enrichment when all data is used for training is presented as a dark blue violin. Cyan and red violins correspond to the training and external data predictions, respectively. Likewise, the enrichment values observed for the external validation sets (red violin in Fig 6) are close to those observed for the training set. The closeness of the enrichment on the external validation set to that of the training set shows that CompScore provides predictive models. The data presented in Table S6 shows that the difference in average BEDROC between the training and external validation sets is lower than 0.1 for 85 out of 102 targets. That is, BEDROC difference between training and prediction is lower than 10% of its theoretical maximum for more than 83% of the DUD-E targets. In addition, BEDROC differences higher than 0.2 can only be observed for the targets mk01, cxcr4 and inha. These targets are among the lowest represented ones in the DUDE database with 78, 40 and 44 ligands, respectively. The larger differences in BEDROC between the training and external validation sets in these targets could be a consequence of the low amount of data available for training when 20% of it is removed. We consider that these reductions in

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 53

predictability for these few targets do not represent a loos of generalization in CompScore. In summary, the results of the external validation show that the proposed methodology not only provides high initial enrichment for the training data, but also replicates such enrichment on previously unseen data. 3.5 Performance on re-docking experiments Re-docking experiments were performed as described in section 2.6. The performance of CompScore after these re-docking calculations is summarized in Fig 7 and presented in detail in Supporting Information Table S7. In this experiment the average performance of CompScore was slightly lower than the obtained for the original DUD-E (9% lower, see Fig 1 and Table S1). Still, CompScore outperformed any of the other tested scoring schemes for all targets. It was also able to achieve a 62% average improvement in BEDROC relative to any of them. This improvement relative to the best VS model provided by any of the ASC, BISC and ES approaches is statistically significant (p = 4.1∙10-7). Interestingly, the average number of scoring functions components in the CompScore models across all targets is 14.46, which is similar to the value obtained for the original DUD-E (14.3).

ACS Paragon Plus Environment

26

Page 27 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 7. Violin plot of the Initial enrichment performance of the evaluated scoring strategies after re-docking with Dock 6.8. Initial enrichment is measured as BEDROC for α=160.9. Black lines within the boxes indicate the value of the median. Plot is built with the values of BEDROC for all DUD-E targets as provided in Supporting Information Table S7. The re-docking experiments also show that the increase in BEDROC provided by CompScore relative to the rest of the scoring methods is higher than 40% and 75% for 50% and 25% of the targets, respectively. These increments outperform those obtained for the original DUD-E. This behavior could be explained by the combination of different factors. Among them, the re-docking experiments were setup with conformational exploration parameters below the default values in Dock 6.8 to complete the more than 106 compound-target predictions in a reasonable time (~ 38 seconds per compound). Also, the grid-based scoring function used for re-docking is a very simple

ACS Paragon Plus Environment

27

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 53

one, taking only into account the Van der Waals and electrostatic interactions modeled as 6-12 Leonard-Jones and Coulomb potentials, respectively. On the other hand, the Dock 3.6 scoring function is more complex, considering electrostatic potentials from solving the Poisson-Boltzmann equation and adding a ligand desolvation term. In consequence, it is expected that higher quality poses for compounds are obtained with Dock 3.6 within the current modeling setup. From these results, we conclude that CompScore can compensate for the simplicity of scoring functions and poor conformational exploration to provide high performance VS models. The performance of CompScore in external validation experiments after re-docking was also investigated. As shown in Fig 8, the performance of CompScore in the external validation experiments after re-docking follow the same pattern observed for the original DUD-E docked conformations. The enrichments obtained for the training and the external validation sets are similar, indicating that the CompScore models obtained during training are generalizable to unseen data. In addition, the removal of 20% of the data from each target does not affect the quality of the trained models relative to the use of all data for training the CompScore models. The data used in Fig 8 is provided in Supporting Information Table S8.

ACS Paragon Plus Environment

28

Page 29 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 8. Violin plot of the performance of CompScore on external validations after re-docking with Dock 6.8. CompScore enrichment when all data is used for training is presented as a dark blue violin. Cyan and red violins correspond to the training and external data predictions, respectively. 3.6 General CompScore models General CompScore models were trained for the consensus rescoring of docking poses obtained with Dock 3.6 and Dock 6.8 as described in section 2.7. The average values of BEDROC obtained with the general CompScore models over the 50 external set partitions are provided in the Supporting Information Table S9 and summarized in Fig 9. For comparison purposes, the performances of the target customized models and the primary scoring functions used for docking are also included.

ACS Paragon Plus Environment

29

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 53

Figure 9. Violin plots of the performance of the CompScore general models when Dock 3.6 (a) and Dock 6.8 (b) are used as primary docking software. BEDROC on the external validation data is represented for the target customized model (blue), the general CompScore model (red) and the primary docking scoring function (cyan).

ACS Paragon Plus Environment

30

Page 31 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Overall, the general CompScore models show lower performance than the target customized ones, with an average decrease in BEDROC of 36% and 50% for Dock 3.6 and Dock 6.8, respectively. However, in average, they outperform the programs’ scoring functions used for primary docking, being these improvements statistically significant for both software. When Dock 3.6 is used as the primary docking tool, the corresponding general CompScore model provides higher enrichment than its own scoring function for 78 targets. The median increase in BEDROC for these is 76%, while the median decrease in BEDROC for the remaining targets is 42%. Similarly, the general CompScore model derived for Dock 6.8 as primary docking tool improves the VS initial enrichment for 80 targets. Notably, the median increase in BEDROC for these targets is 222% while its median decrease for the other 22 targets is 38%. Regarding their composition, the general CompScore models for Dock 3.6 and Dock 6.8 were composed by 18 and 21 scoring components, respectively, with nine of these components common to both models. The two models contained scoring components belonging to the three programs listed in Table 1. The Dock 3.6 general CompScore model contained three whole scoring functions (Dock 3.6 score, Gold ASP and OEDocking Chemgauss 3), while two (Dock6 PBSA and Gold GoldScore) were present on that of Dock 6.8. The fact that these general CompScore models contain few whole scoring functions, highlights the utility of breaking them down into their components for CS. The detailed list of the scoring components belonging to the general CompScore models can be found as Supporting Information in Table S10. These general models have been implemented in the CompScore server. Our results show, as expected, a decreased VS performance of the general CompScore models relative to the targets customized ones. In addition, the use of the Dock 6.8 general model and the one composed by the scoring components common to both general models in the CS of the DUD-E

ACS Paragon Plus Environment

31

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 53

original poses, did not improved the enrichment obtained with the Dock 3.6 scoring function. The later suggests that CompScore general models are dependent on the initially employed docking and scoring algorithms. Thus, whenever data for validating the VS workflow is available, we recommend training a target specific CompScore model. Unfortunately, this is not always the case and then VS schemes usually proceed through the use of a single scoring function or a CS strategy which selection is often guided by expert criterion. Taking into account the above results, the general CompScore models can be employed as primary VS tools capable of improving the performance of single scoring functions when either Dock 3.6 or Dock 6.8 are used as primary docking programs. Future research efforts will be devoted to increase the number of general CompScore models for different initial docking protocols. 3.7 Artificial enrichment analyses It has been recently shown that scoring functions based on complex ML methods, specifically on deep neural networks, can abstract higher dimensional features from low dimensional ones such as topological information

38,39.

The study of Seig et al. demonstrated that using the

physicochemical properties employed as unbiasing features in the DUD-E as input to simple ML algorithms can provide high performance VS models 38. Moreover, they noted that the performance of the evaluated deep learning methods is due to noncasual bias by employing ligand-based versions of the structural descriptors as input to the scoring models. In their cases study, models learned the noncasual bias because of to the presence of low dimensional discriminative features (molecular topology in the DUD-E) in a complex, non-transparent and difficult to interpret black box 3D descriptor. In contrast to the newly introduced deep learning models for scoring, in CompScore there is no learning from low dimensional features such as molecular properties and topology. Although our

ACS Paragon Plus Environment

32

Page 33 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

method relies in the use of GA, this optimization algorithm is not employed for fitting a ML model for scoring but for feature selection to address the combinatorial explosion arising from the increase in the number of possible combinations of scoring functions components. For a set of features (scoring functions components) selected by the GA, instead of training a ML model, CompScore aggregates the relative rankings derived from them using the arithmetic mean. This type of CS strategy, in contrast to complex ML algorithms, is a very simple one and can be straightforward interpreted in a set of rules and steps to follow for conventional ranking-based CS. Considering the guidelines provided by Seig et al. 38, to evaluate possible artificial enrichment in CompScore we followed the procedure described in section 2.8. These evaluations are summarized in Fig 10 and the detailed results are provided as Supporting Information in Tables S11 and S12 for the physicochemical properties and molecular descriptors, respectively. Additionally, the list of similar targets predicted by each CompScore model in the cross-target validations is presented in Table S13. From Fig 10(a) it can be seen that there is a baseline enrichment when physicochemical descriptors are used as input to CompScore, being this lower than the obtained with the target customized CompScore models for the original DUD-E and the re-docked database. In addition, these differences in enrichment are statistically significant. It must be considered that the CompScore models trained with the unbiasing features employed in the DUD-E are evaluated in intra-target validation experiments, which achieved higher baseline enrichments in the study of Seig et al. 38 compared to cross-target modeling. Thus, the observed enrichments correspond to the maximum BEDROC possible using the physicochemical properties of the molecules as input. Our results show that, despite there is a baseline enrichment with the unbiasing features, the performance achieved by CompScore is not due to the learning of these physicochemical properties from the datasets. This result is consistent with the fact that our

ACS Paragon Plus Environment

33

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 53

algorithm does not include the training of any ML model sensitive to abstract high level features from simple physicochemical properties.

Figure 10. Violin plots of the performance of CompScore when the DUD-E unbiasing features (a) and molecular descriptors (b) are used as input.

ACS Paragon Plus Environment

34

Page 35 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Our grouping approach identified 100 targets with at least one other similar to it in the ligands space (see Table S13). As for the DUD-E unbiasing features, the cross-target evaluation of the custom CompScore models trained with the 2D Autocorrelation and RDF descriptors reveals lower performance than the obtained with scoring data. The differences in performance between the CompScore models trained from Dock 3.6 and Dock 6.8 and those trained with the molecular descriptors are statistically significant. It must be considered that the average performance of the models trained with no docking information in cross-target validations is higher across similar receptors than across all targets (data not shown). Hence, the results represented in Fig 10(b) correspond to a favorable scenario for the predictions made with molecular descriptors. Still, the observed performance for the molecular descriptors is far from that obtained with molecular scoring data. In summary, the results obtained with CompScore when either the DUD-E unbiasing features or compounds structural information are used as input confirm that baseline enrichment values are obtained. This baseline enrichment comes from the DUD-E database itself, but in all cases is lower than the enrichment achieved by CompScore when scoring data is used as input. Our results show that the enrichment achieved with CompScore across the DUD-E database in CS is not artificial. The latter is supported by the fact that using physicochemical or compounds structural information alone leads to CompScore models with very low performance. 3.8 CompScore implementation and availability A helper python script was developed to summarize the rescoring results in a data table. This script extracts the molecules’ scoring information, including the value of scoring function components, from Dock 6.8 scoring files, OEDocking scoring tables and Gold log files. It currently supports the scoring functions listed in Table 1 and is freely available for download from the

ACS Paragon Plus Environment

35

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 53

CompScore server. In addition to scoring information, the scoring data table contains an ID, the number of heavy atoms and a classification as either ligand (1) or decoy (0) for each molecule. Extreme enough score values are assigned to compounds for which any scoring function failed, so that they were ranked at the end of the function’s ranked list. The output of the algorithm is a log file containing information on the scoring functions and components that must be combined to maximize the desired VS enrichment metric and enrichment values. A second output file with a ranked list of the input compounds with aggregated scores is provided as output. If used for rescoring purposes, the user must provide a data table containing the scoring values and a CompScore log. If rescoring is going to be performed with any of the general CompScore models, instead of providing a log file, the user must select which model to use. For more information on CompScore usage and examples, see the help pages available at http://bioquimio.udla.edu.ec/compscore-help/. 4. Conclusions Here, we introduce CompScore, a simple, fast, interpretable and universal algorithm that incorporates for the first time the idea of decomposing docking scoring functions into their components for consensus scoring in VS. The problem of the combinatorial explosion due the large pool of features that can be extracted from docking scoring functions, is addressed by using a GA as feature selection tool that searches for a subset of them maximizing either the BEDROC or EF metrics. We evaluated our method using the whole DUD-E database and compared its performance with that obtained with all the 65535 combinations of 16 diverse scoring functions from 3 docking software. In addition, we conducted comparisons with the best performing scoring component and with the aggregation of all of them.

ACS Paragon Plus Environment

36

Page 37 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In terms of initial enrichment, for all DUD-E targets, CompScore outperformed the rest of that we tested, including the exhaustive exploration of the 16 scoring functions. Our method was also found to highly improve, in more than 100%, the performance for targets for which the rest of the tested methods provided very low initial enrichment. The obtained results also highlight the importance of using a diverse set of scoring functions for consensus scoring since all found solutions included scoring components from at least two docking software. Similar conclusions could be extracted for the VS experiments performed using EF as objective function. It must be highlighted that this exceptional CS performance was obtained in less than 8 minutes for more than 75% of the targets. We also showed that the inclusion of the weighted scores can, in some cases, improve the VS performance of CS strategies. Furthermore, CompScore was shown to achieve a VS performance on unseen data similar to that observed for the training data. The latter demonstrated that the models proposed by CompScore are able to provide rankings highly enriched with ligands when new collections of compounds are predicted. CompScore was also shown to outperform different scoring strategies, including the exhaustive exploration of whole scoring functions combinations, after re-docking of the DUD-E with Dock 6.8. It was demonstrated that the proposed methodology provides non-artificial enrichment by analyzing the algorithm’s performance when, instead of scoring data, the unbiasing features of the DUD-E and compounds structural information are used as inputs. The best performance of CompScore is obtained in the customization of a CS function for a particular target. However, we propose general models that can be applied to increase the VS performance in modeling studies where Dock 3.6 and Dock 6.8 are used as primary docking engines. In future research, new general models for the rescoring of docking poses derived with different docking programs and primary scoring functions will be investigated.

ACS Paragon Plus Environment

37

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 53

Altogether, our results show that scoring functions components are more effective than whole functions for setting high performance VS protocols. In terms of docking calculations time, no extra effort compared to conventional CS approaches is necessary to apply the CompScore algorithm since all the information that it requires is included in the docking output files provided by the software. Finally, we propose that docking scoring functions breakdown into their components should become a routinary task for the development of CS workflows. CompScore can be seen as a take the best of each world (docking software and scoring functions) approach. We are aware that the affirmation that the use of scoring components for CS outperforms current CS methods based on whole scoring functions, can be polemic. However, we expect that the availability of CompScore along with the good performance that it achieved will attract the attention of researchers to further corroborate or reject our hypothesis.

Supporting Information. The following files are available free of charge. Tables S1 to S13. Composition of the datasets employed in our studies and VS performances of CompScore, BISC, ES and ASC approaches. In addition, the VS performance of CompScore when weighted scores are considered and the results of the external validation experiments are provided. The results of the re-docking experiments, the performance of the general CompScore models and their definition, as well as the detailed results of the analyses of artificial enrichment along with targets grouping for cross-target validations are included in the Supporting Information (docx).

ACS Paragon Plus Environment

38

Page 39 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Corresponding Author *To whom correspondence should be addressed: [email protected] Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Funding Sources Maykel Cruz-Monteagudo was supported by the Foundation for Science and Technology (FCT) and FEDER/COMPETE (grant SFRH/BPD/90673/2012). Abbreviations VS, Virtual screening; CS, Consensus scoring; CCR5, SBDD Structure-based drug discovery; ML, Machine learning; GA, Genetic algorithm; EF, Enrichment factor; ES; Exhaustive search; BISC, Best Individual Scoring Component; ASC, All Scoring Components. References (1) (2) (3) (4)

(5) (6)

Romano, T. K. Structure-Based Drug Design: Docking and Scoring. Curr. Protein Pept. Sci. 2007, 8 (4), 312–328. Lionta, E.; Spyrou, G.; Vassilatis, D. K.; Cournia, Z. Structure-Based Virtual Screening for Drug Discovery: Principles, Applications and Recent Advances. Curr. Top. Med. Chem. 2014, 14 (16), 1923–1938. Dong, Y.; Jiang, X.; Liu, T.; Ling, Y.; Yang, Q.; Zhang, L.; He, X. Structure-Based Virtual Screening, Compound Synthesis, and Bioassay for the Design of Chitinase Inhibitors. J. Agric. Food Chem. 2018, 66 (13), 3351–3357. Aldib, I.; Soubhye, J.; Zouaoui Boudjeltia, K.; Vanhaeverbeek, M.; Rousseau, A.; Furtmüller, P. G.; Obinger, C.; Dufrasne, F.; Nève, J.; Van Antwerpen, P.; Prévost, M. Evaluation of New Scaffolds of Myeloperoxidase Inhibitors by Rational Design Combined with High-Throughput Virtual Screening. J. Med. Chem. 2012, 55 (16), 7208–7218. Mahasenan, K. V.; Pavlovicz, R. E.; Henderson, B. J.; González-Cestari, T. F.; Yi, B.; McKay, D. B.; Li, C. Discovery of Novel Α4β2 Neuronal Nicotinic Receptor Modulators through Structure-Based Virtual Screening. ACS Med. Chem. Lett. 2011, 2 (11), 855–860. Drwal, M. N.; Griffith, R. Combination of Ligand- and Structure-Based Methods in Virtual Screening. Drug Discov. Today Technol. 2013, 10 (3), e395–e401.

ACS Paragon Plus Environment

39

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18)

(19) (20) (21) (22) (23)

Page 40 of 53

Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and Scoring in Virtual Screening for Drug Discovery: Methods and Applications. Nat Rev Drug Discov 2004, 3 (11), 935–949. Yunta, M. J. R. Docking and Ligand Binding Affinity: Uses and Pitfalls. Am. J. Model. Optim. 2016, 4 (3), 74–114. Wang, R.; Wang, S. How Does Consensus Scoring Work for Virtual Library Screening? An Idealized Computer Experiment. J. Chem. Inf. Comput. Sci. 2001, 41 (5), 1422–1426. Houston, D. R.; Walkinshaw, M. D. Consensus Docking: Improving the Reliability of Docking in a Virtual Screening Context. J. Chem. Inf. Model. 2013, 53 (2), 384–390. Klingler, F.-M.; Moser, D.; Büttner, D.; Wichelhaus, T. A.; Löhr, F.; Dötsch, V.; Proschak, E. Probing Metallo-β-Lactamases with Molecular Fragments Identified by Consensus Docking. Bioorg. Med. Chem. Lett. 2015, 25 (22), 5243–5246. Ballester, P. J.; Mangold, M.; Howard, N. I.; Robinson, R. L. M.; Abell, C.; Blumberger, J.; Mitchell, J. B. O. Hierarchical Virtual Screening for the Discovery of New Molecular Scaffolds in Antibacterial Hit Identification. J. R. Soc. Interface 2012, 9 (77), 3196–3207. Yang, J.-M.; Chen, Y.-F.; Shen, T.-W.; Kristal, B. S.; Hsu, D. F. Consensus Scoring Criteria for Improving Enrichment in Virtual Screening. J. Chem. Inf. Model. 2005, 45 (4), 1134– 1146. Plewczynski, D.; Łażniewski, M.; Grotthuss, M. V.; Rychlewski, L.; Ginalski, K. VoteDock: Consensus Docking Method for Prediction of Protein–Ligand Interactions. J. Comput. Chem. 2011, 32 (4), 568–581. Feher, M. Consensus Scoring for Protein–Ligand Interactions. Drug Discov. Today 2006, 11 (9–10), 421–428. James, L. M.; Edmund, K. B.; Jonathan, D. H. Machine Learning in Virtual Screening. Comb. Chem. High Throughput Screen. 2009, 12 (4), 332–343. Ballester, P. J.; Mitchell, J. B. O. A Machine Learning Approach to Predicting Protein– Ligand Binding Affinity with Applications to Molecular Docking. Bioinformatics 2010, 26 (9), 1169–1175. Ericksen, S. S.; Wu, H.; Zhang, H.; Michael, L. A.; Newton, M. A.; Hoffmann, F. M.; Wildman, S. A. Machine Learning Consensus Scoring Improves Performance Across Targets in Structure-Based Virtual Screening. J. Chem. Inf. Model. 2017, 57 (7), 1579– 1590. Mysinger, M. M.; Carchia, M.; Irwin, John. J.; Shoichet, B. K. Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem. 2012, 55 (14), 6582–6594. Teramoto, R.; Fukunishi, H. Supervised Consensus Scoring for Docking and Virtual Screening. J. Chem. Inf. Model. 2007, 47 (2), 526–534. Koes, D. R.; Baumgartner, M. P.; Camacho, C. J. Lessons Learned in Empirical Scoring with Smina from the CSAR 2011 Benchmarking Exercise. J. Chem. Inf. Model. 2013, 53 (8), 1893–1904. Trott, O.; Olson, A. J. AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization and Multithreading. J. Comput. Chem. 2010, 31 (2), 455–461. Pettersen, E. F.; Goddard, T. D.; Huang, C. C.; Couch, G. S.; Greenblatt, D. M.; Meng, E. C.; Ferrin, T. E. UCSF Chimera--a Visualization System for Exploratory Research and Analysis. J. Comput. Chem. 2004, 25 (13), 1605–1612.

ACS Paragon Plus Environment

40

Page 41 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(24) Mysinger, M. M.; Shoichet, B. K. Rapid Context-Dependent Ligand Desolvation in Molecular Docking. J. Chem. Inf. Model. 2010, 50 (9), 1561–1573. (25) Allen, W. J.; Balius, T. E.; Mukherjee, S.; Brozell, S. R.; Moustakas, D. T.; Lang, P. T.; Case, D. A.; Kuntz, I. D.; Rizzo, R. C. DOCK 6: Impact of New Features and Current Docking Performance. J. Comput. Chem. 2015, 36 (15), 1132–1156. (26) Kelley, B. P.; Brown, S. P.; Warren, G. L.; Muchmore, S. W. POSIT: Flexible ShapeGuided Docking For Pose Prediction. J. Chem. Inf. Model. 2015, 55 (8), 1771–1780. (27) Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R. Development and Validation of a Genetic Algorithm for Flexible Docking11Edited by F. E. Cohen. J. Mol. Biol. 1997, 267 (3), 727–748. (28) Fernandez, M.; Caballero, J.; Fernandez, L.; Sarai, A. Genetic Algorithm Optimization in Drug Design QSAR: Bayesian-Regularized Genetic Neural Networks (BRGNN) and Genetic Algorithm-Optimized Support Vectors Machines (GA-SVM). Mol. Divers. 2011, 15 (1), 269–289. (29) Pérez-Castillo, Y.; Lazar, C.; Taminau, J.; Froeyen, M.; Cabrera-Pérez, M. A.; Nowé, A. GA(M)E-QSAR: A Novel, Fully Automatic Genetic-Algorithm-(Meta)-Ensembles Approach for Binary Classification in Ligand-Based Drug Design. J. Chem. Inf. Model. 2012, 52 (9), 2366–2386. (30) Zheng, S.; Wang, Y.; Liu, W.; Chang, W.; Liang, G.; Xu, Y.; Lin, F. In Silico Prediction of Hemolytic Toxicity on the Human Erythrocytes for Small Molecules by Machine-Learning and Genetic Algorithm. J. Med. Chem. 2019. (31) Truchon, J.-F.; Bayly, C. I. Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem. J. Chem. Inf. Model. 2007, 47 (2), 488–508. (32) Huang, N.; Shoichet, B. K.; Irwin, J. J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49 (23), 6789–6801. (33) Perez-Castillo, Y.; Sánchez-Rodríguez, A.; Tejera, E.; Cruz-Monteagudo, M.; Borges, F.; Cordeiro, M. N. D. S.; Le-Thi-Thu, H.; Pham-The, H. A Desirability-Based Multi Objective Approach for the Virtual Screening Discovery of Broad-Spectrum Anti-Gastric Cancer Agents. PloS One 2018, 13 (2), e0192176. (34) Helguera, A. M.; Perez-Castillo, Y.; D S Cordeiro, M. N.; Tejera, E.; Paz-Y-Miño, C.; Sánchez-Rodríguez, A.; Teijeira, M.; Ancede-Gallardo, E.; Cagide, F.; Borges, F.; CruzMonteagudo, M. Ligand-Based Virtual Screening Using Tailored Ensembles: A Prioritization Tool for Dual A2AAdenosine Receptor Antagonists / Monoamine Oxidase B Inhibitors. Curr. Pharm. Des. 2016, 22 (21), 3082–3096. (35) Fortin, F.-A.; Rainville, F.-M. D.; Gardner, M.-A.; Parizeau, M.; Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 2012, 13, 2171–2175. (36) RDKit: Open-Source Cheminformatics. Version: 2019.03.1; RDKit, 2018. Http://Www.Rdkit.Org. (37) Polikar, R. Ensemble Based Systems in Decision Making. IEEE Circuits Syst. Mag. 2006, 6 (3), 21–45. (38) Sieg, J.; Flachsenberg, F.; Rarey, M. In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening. J. Chem. Inf. Model. 2019, 59 (3), 947–961. (39) Chen, L.; Cruz, A.; Ramsey, S.; Dickson, C. J.; Duca, J. S.; Hornak, V.; Koes, D. R.; Kurtzman, T. Hidden Bias in the DUD-E Dataset Leads to Misleading Performance of Deep

ACS Paragon Plus Environment

41

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Learning in Structure-Based 10.26434/chemrxiv.7886165.v1.

Virtual

Screening.

Page 42 of 53

ChemRxiv

2019,

ACS Paragon Plus Environment

42

Page 43 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ACS Paragon Plus Environment

43

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. Violin plot of the Initial enrichment performance of the evaluated scoring strategies. Initial enrichment is measured as BEDROC for α=160.9. Black lines within the boxes indicate the value of the median. Plot is built with the values of BEDROC for all DUD-E targets as provided in Supporting Information Table S2. 203x156mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 44 of 53

Page 45 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. Violin plot of the EF performance of the evaluated scoring strategies. EF is measured for a fraction of screened data equals to 0.01. Black lines within the boxes indicate the value of the median. Plot is built with the values of EF for all DUD-E targets as provided in Supporting Information Table S3. 203x156mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. Percent of improvement in BEDROC of the CompScore method relative to the best performing VS models obtained with either the ASC, BISC and ES approaches. 177x111mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 46 of 53

Page 47 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4. Percent of improvement in BEDROC of the CompScore method relative to number of active compounds. 186x89mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5. Performance comparison when the weighted scores are added to the CompScore algorithm. EF is represented as the fraction of its maximum possible value for each target. 203x156mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 48 of 53

Page 49 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 6. External validation results. CompScore enrichment when all data is used for training is presented as a dark blue violin. Cyan and red violins correspond to the training and external data predictions, respectively. 203x156mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 7. Violin plot of the Initial enrichment performance of the evaluated scoring strategies after redocking with Dock 6.8. Initial enrichment is measured as BEDROC for α=160.9. Black lines within the boxes indicate the value of the median. Plot is built with the values of BEDROC for all DUD-E targets as provided in Supporting Information Table S7. 203x156mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 50 of 53

Page 51 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 8. Violin plot of the performance of CompScore on external validations after re-docking with Dock 6.8. CompScore enrichment when all data is used for training is presented as a dark blue violin. Cyan and red violins correspond to the training and external data predictions, respectively. 203x156mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 9. Violin plots of the performance of the CompScore general models when Dock 3.6 (a) and Dock 6.8 (b) are used as primary docking software. BEDROC on the external validation data is represented for the target customized model (blue), the general CompScore model (red) and the primary docking scoring function (cyan). 165x259mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 52 of 53

Page 53 of 53 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 10. Violin plots of the performance of CompScore when the DUD-E unbiasing features (a) and molecular descriptors (b) are used as input. 165x256mm (300 x 300 DPI)

ACS Paragon Plus Environment