Subscriber access provided by MIDWESTERN UNIVERSITY
Pharmaceutical Modeling
Accurate Hit Estimation for Iterative Screening Using Venn-ABERS Predictors Ruben Buendia, Thierry Kogej, Ola Engkvist, Lars Carlsson, Henrik Linusson, Ulf Johansson, Paolo Toccaceli, and Ernst Ahlberg J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00724 • Publication Date (Web): 06 Feb 2019 Downloaded from http://pubs.acs.org on February 11, 2019
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
Journal of Chemical Information and Modeling
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
ACS Paragon Plus Environment
Page 2 of 41
Page 3 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
Journal of Chemical Information and Modeling
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
ACS Paragon Plus Environment
Page 4 of 41
Page 5 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Journal of Chemical Information and Modeling
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
ACS Paragon Plus Environment
Page 6 of 41
Page 7 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Journal of Chemical Information and Modeling
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
ACS Paragon Plus Environment
Page 8 of 41
Page 9 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Journal of Chemical Information and Modeling
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
ACS Paragon Plus Environment
Page 10 of 41
Page 11 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Journal of Chemical Information and Modeling
ACS Paragon Plus Environment
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
ACS Paragon Plus Environment
Page 12 of 41
Page 13 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Accurate Hit Estimation for Iterative Screening Using Venn-ABERS Predictors Ruben Buendia*†, Thierry Kogej‡, Ola Engkvist‡, Lars Carlsson‡§, Henrik Linusson†, Ulf Johansson†, Paolo Toccaceli§, Ernst Ahlberg┴ †
Dept. of Information Technology, University of Borås, Sweden
‡
Discovery Sciences, AstraZeneca IMED Biotech Unit, Gothenburg, Sweden
§
Department of Computer Science, Royal Holloway, University of London, Egham Hill, Egham,
Surrey, United Kingdom ┴
Data Science and AI, Drug Safety & Metabolism, AstraZeneca IMED Biotech Unit,
Gothenburg, Sweden
ABSTRACT. Iterative screening has emerged as a promising approach to increase the efficiency of high-throughput screening (HTS) campaigns in drug discovery. By learning from a subset of the compound library, inferences on what compounds to screen next can be made by predictive models. One of the challenges of iterative screening is to decide how many iterations to perform. This is mainly related to difficulties in estimating the prospective hit rate in any given iteration. In this article a novel method based on Venn-ABERS predictors is proposed. The method provides accurate estimates of the number of hits retrieved in any given iteration during a HTS campaign. The estimates provide the necessary information to support the decision of the number of iterations needed to maximize screening outcome. Thus, this method offers a prospective screening strategy for early stage drug discovery.
ACS Paragon Plus Environment
1
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 14 of 41
INTRODUCTION High throughput screening (HTS) has become a well-established paradigm in early-stage drug discovery1. The introduction of HTS enabled screening campaigns of very large collections, i.e. to the scale of millions, of compounds with the aim to maximize the chances of finding promising hits2,3. Running HTS campaigns is expensive, which warrants alternative approaches to screening the full collection4. The main driver behind the high cost of HTS campaigns is the large amount of resource required in relation to the number of hits retrieved5. This has generated a lot of interest in methods that reduces costs and speed up the early stage hit finding in drug discovery projects. One approach is to perform 3D-based virtual screening, which relies on computational tools to mine compound libraries and identify molecules that are likely to exhibit activity towards protein targets of known structure6. 3D-based virtual screening was shown to be efficient in selecting small subsets, (usually ranging from hundred to several thousand of compounds in pharmaceutical industries) of the available compound collection, which are significantly enriched in terms of activity/hits as compared to random selection of samples of the same size 7,8. If the 3D structure of the target is unknown, ligand-based methods are available. Ligand-based machine learning methods can lead to compound activity prediction or classification that might be relevant for virtual screening9. In drug discovery this approach is referred to as Quantitative Structure-Activity Relationship (QSAR) modelling, where the compound structures are used to derive the predictor variables or features, and the compound activities are the target variables or labels. Recently, iterative screening (IS) has been shown to increase screening efficiency10. It consists of screening a library in several iterations. At each iteration a subset of the collection is screened, and the results are used to determine the ‘best’ compound subset to screen at the next iteration.
ACS Paragon Plus Environment
2
Page 15 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Thus, this approach is an iterative process of model building and subsequent prediction of new compounds. The IS setup is well suited for machine learning (ML) methods because compounds selected in each iteration are labeled and can be used to retrain a new QSAR model; this is beneficial because additional labeled examples would successively increase the size and coverage of the training set. In ML, Active learning (AL) is an umbrella term for methods that select data points for testing and feeding back into the model11. AL was introduced to drug discovery, and in particular to hit discovery, in Warmuth et al.12 Active learning strategies might be exploitive when compounds are selected to maximize hit retrieval, explorative if they are selected to improve the model, or a combination of both13. For hit discovery, IS and AL refer to the same concept. Nevertheless, whereas AL is often used when explorative approaches are proposed, IS often refer to exploitive approaches. Maciejewski et al14 suggested an experimental design strategy depending on assay throughput and objective. For systems allowing high throughput an exploitive approach was preferred. By contrast, a weak reinforcement approach which combines exploitation and exploration was proposed for IS using smaller compound sets. Since high throughput is allowed in our experiments, an exploitive strategy was considered for the present work. Different examples of different explorative, exploitive, and balanced strategies can be found 1018
. However, to the best knowledge of the authors, the only study reporting an IS approach in a
high throughput setting is Maciejewski et al14. That work was a retrospective study performed on large in-house HTS assays (> 1 million compounds) with iterations of 50 000 compounds. One of the challenges of iterative screening is the choice of the portion of the compound library to be screened at each iteration as well as when to stop screening. Suppose that compounds are selected for testing from a list of compounds ordered by the predicted probability of activity.
ACS Paragon Plus Environment
3
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 16 of 41
Obviously, the larger the number of compounds chosen for testing, the larger the number of hits will be identified in each iteration. However, it should as well be observed that, as the set of selected compounds increases, each compound that is added to the set has, by construction, a smaller predicted probability of activity than the compounds already in the set. This means that the fraction of actives in the next set to test decreases with its size. There is therefore a trade-off between hit enrichment and number of hits. The former decreases with larger sets, whereas the latter increases. Designing an IS campaign need to take three factors into account simultaneously: i) the cost of screening a compound, ii) the cost and the logistics associated with the size and the number of screening sets, iii) the number of hits to be retrieved. Ideally, these factors should be balanced to decrease the time/cost needed to find the sufficient number of hits required to pursue a given drug discovery program. For a single HTS campaign, balancing these factors might lead to a variable number of compounds screened in different iterations. However, in large pharmaceutical companies, HTS campaigns are regularly run for different targets. Thus, logistic costs might decrease by setting a fixed number of compounds that is preserved along iterations and campaigns. An optimal fixed number would vary for different companies and should be updated as screening technology advance. In this study, we selected 50 000. The same number was used in Maciejewski et al, suggesting that 50 000 might be a reasonable number in this context. Given a fixed number of compounds per iteration, the remaining question is when to stop screening. An accurate estimation of the number of hits to be retrieved is the information that is needed to decide whether to perform another iteration.
ACS Paragon Plus Environment
4
Page 17 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Traditional ML methods do not allow accurate estimation of the number of hits that would be retrieved for a given number of compounds screened. The only published method proposed to solve this problem was conformal prediction19 as suggested in Svensson20. Conformal prediction provides a framework for generating confidence predictions with a fixed error rate 19. To achieve this, conformal predictors use a different notion of prediction from conventional methods; while conventional methods output a point value, a conformal predictor can be multivalued., i.e. a set or an interval. In fact, they can also be empty. Simply put, for binary classification, the labels (e.g. active, inactive) are assigned to the compound being predicted in a way it results to four different possible outputs: active, inactive, both labels simultaneously or none of the labels. This can highly limit the use of conformal prediction as the two latter classifications (e.g. both or none of the labels) present little practical use. Further, conformal classifiers provide no guarantees regarding the error rate of singleton predictions; this is because double predictions are always correct and empty predictions are always erroneous. Thus, whether the number of double predictions is larger than the number of empty ones, the error rate of singleton predictions would be larger than the error rate required. This was demonstrated in Linusson et al. 21, where error rate of singleton predictions was found substantially greater than the fixed error rate in several public data sets. In Svensson et al.20 a confidence level that optimize gain-cost function for the training set was estimated with the hope that results for the training set were reproduced for the test set. Even though relatively good results were obtained, the solution is sub-optimal, and might not extrapolate to other use cases. Further, the same solution could have been approached by traditional ML. Some weaknesses of using conformal prediction for classification had been pointed 21,22. Otherwise, conformal regression was proven valuable for QSAR models in Svensson et al. 23.
ACS Paragon Plus Environment
5
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 18 of 41
Here we propose probabilistic prediction to approach this problem and overcome the limitations of conformal prediction. We hypothesize that an accurate estimate of the number of hits in a set of untested compounds can be obtained by summation of the predicted probabilities of activity over the compounds in the set. For this, the probability estimations need to be calibrated, i.e. they reflect long-term relative frequencies. Note that, since conformal prediction does not provide probabilities, a head to head comparison against probabilistic prediction methods is not possible. Three probabilistic prediction approaches are available: i) Venn-ABERS predictors 24, ii) Zadrozny and Elkan's method25 based on isotonic regression, iii) Platt scaling26 based on logistic regression. All three methods were theoretically compared in Vovk et al. 27. This comparison demonstrated the tendency to overfit of the Zadrozny and Elkan's method which was first reported in Platt26. Venn-ABERS predictors can be considered a regularized version of the Zadrozny and Elkan's method27 that reduces its tendency to overfit. Venn-ABERS predictors inherit the properties of Venn predictors, and are therefore perfectly calibrated, i.e. probabilities are matched by observed frequencies. Whereas Platt scaling is less prone to overfit than the Zadrozny and Elkan's method, it assumes that the relationship between scores and probabilities can be expressed with a logistic function. Thus Venn-ABERS can improve on Platt’s scaling particularly when that relationship substantially departs from a sigmoid 28; what is a common scenario27. Further, the three methods have as well been empirically compared across several diverse datasets 24,27,29.30 and VennABERS have consistently provided more accurate calibrations. More relevant to this study, VennABERS were reported to provide more accurate calibrations for ligand-target prediction in Mervin et al31 and Toccaceli et al28. In this work, a QSAR model consisting of an SVM classifier was retrospectively validated on a large compound library in an iterative fashion. Venn-ABERS predictors were used on top of the
ACS Paragon Plus Environment
6
Page 19 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
classifier, and the calibration set was preserved throughout all iterations. The objective of the study is to evaluate the ability of the proposed method to provide accurate estimations of the number of hits that would be retrieved at each iteration. Secondarily, enrichment offered by the method in terms of hits and their diversity was evaluated against random selection. Diversity of hits refers to the number of scaffolds as represented by molecular frameworks (MF) 32 and topological frameworks (TF)33, shared by at least a hit. The experiments were performed on six different drug targets that represent typical cases with high, moderate and low proportion of hits. METHODS HTS Data. AstraZeneca in-house HTS data was utilized in this work. The primary activity data from six full HTS campaigns, against six different biological targets, was used. The use of inhouse data allows the evaluation of the method on large HTS datasets, which are currently not accessible in the public domain. Table 1 describes the six data sets used in this study. The data sets were chosen to represent typical cases with high, moderate and low proportion of hits, (2 targets for each case) denoted HHRT (High Hit Rate Target), MHRT (Moderate Hit Rate Target), and LHRT (Low Hit Rate Target) respectively. The compounds were described by signature descriptors34 derived from the chemical structure of the compounds. Diversity of hits was represented by MF and TF. Both, MF and TF offer a simple and consistent way to define chemistry scaffold. Numerous studies have used MF or TF for providing information about collection novelty or as tools to map chemical series sharing similar biological profile. MF are obtained by removing all atoms or side chains that are not part of rings or between rings. In our implementation, only single bonded entities are removed (e.g. the oxygen from a carbonyl or the nitrogen from an imine attached on a ring or a linker are then preserved). To get the TF, a further abstraction consisting in replacing all the atoms of the MF by saturated carbon atoms is performed.
ACS Paragon Plus Environment
7
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 20 of 41
Machine Learning and Venn-ABERS Predictors. Venn Predictors35 are a class of predictors built on generic ML methods that offer a guarantee of calibration. We use a special form of Venn Predictors, called Venn-ABERS Predictors24 (VAP), which can use any scoring binary classifier to produce calibrated probabilities. Given the scale of the data sets, VAP would require significant computational resources. However, a form of VAP called Inductive VAP (IVAP) which admits a fast-scalable implementation exists. The reduced computational power requirement comes at the cost of sacrificing part of the training data to be used as a calibration set. The scoring classifier is trained on the remaining part of the training data (which then takes the name of proper training set), and the IVAP algorithm operates on scores produced by the ML algorithm on the calibration set and the corresponding labels. IVAP only requirement is that calibration and test sets are independent and identically distributed (i.i.d.), i.e. they are exchangeable. Note that, for the sake of simplicity, we are not discussing here all technicalities surrounding the theory of probabilistic prediction but refer to Vovk et al.19. It is essential to note, that Venn Predictors are multi-probabilistic predictors. In the case of binary classification, for each test object, they output two probability estimates, p0 and p1, and not one as conventional estimators do. These p0 and p1 can be interpreted as lower and upper estimates of the probability of activity and their difference is related to the uncertainty in the probabilistic prediction itself 27. A single probability estimate can be obtained by combining these two estimates. One way is to seek the minimization of the regret, the additional loss incurred by following the prediction instead of the real outcome according to a certain loss function. In the case of a log loss function, the probability p can be calculated as p = p1 / (1 − p0 + p1) and can be considered as a good compromise between p0 and p1. Finally, it is important to observe that Venn Predictors are distribution-free in the sense
ACS Paragon Plus Environment
8
Page 21 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
that they do not assume that the probability distribution has a given mathematical form. The only assumption is that test and calibration data should be exchangeable. The ability of Venn-ABERS predictors to accurately estimate the number of hits that would be found for any number of compounds screened was shown by Buendia et al. 36. However, this was not tested iteratively, and initial calibration sets consisted of 50,000 compounds selected randomly from the compound library. Therefore, calibration and test sets were exchangeable, thus probabilities output by Venn-ABERS predictors were theoretically valid. Otherwise, at each iteration, the proportion of hits would increase in the training set and decrease in the test set, i.e. the distribution of hits in training/test changes as a function of iteration, test set size, and enrichment factor. Therefore, after the initial iteration, the exchangeability assumption is violated. To keep theoretical validity, a new calibration set should be generated by chemically testing a randomly selected subset of compounds from the test set at each iteration. Unfortunately, the cost of generating those calibration sets would make the use of Venn-ABERS unfeasible. We hypothesize that keeping the initial calibration set would provide an accurate estimation of the number of hits retrieved through several iterations; meaning that the summation of hits was close to the summation of the predicted probabilities output by Venn-ABERS predictors. By keeping the original calibration set, although the proportion of hits would decrease in the test set with each iteration, it would remain equal in the calibration set (it would not decrease). This way the exchangeability violation is partially mitigated. Summary of Iterative Screening Set‐Up For each target, the dataset was split into an initial proper training set of 50,000 compounds, a calibration set of 50,000 compounds, and a test set containing the rest of the compounds. All models consisted of a Support Vector Machine (SVM) and were implemented using LibLinear37. The penalty parameter, i.e. cost, was automatically
ACS Paragon Plus Environment
9
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 22 of 41
optimized by a new functionality in the LibLinear algorithm version 2.20 using the proper training data. For each assay an SVM was trained as well using the proper training data. The resulting model was used to output scores for each compound of the calibration and test sets. The absolute value of the distance to the hyperplane was used as scoring function. The calibration set, and the test scores were then used as input to the Python/NumPy function VennABERS.py28 which output calibrated probabilities of each compound in the test set for being a hit. This function is an implementation of the algorithm proposed by Vovk et al.27, which uses the properties of inductive Venn-ABERS predictors to highly reduce computational time. Finally, the regularization p = p1 / (1 − p0 + p1) was used to obtain single probabilistic predictions that facilitate the ranking of compounds. The method was implemented in the form of a single python script. This script is available in GitHub, and the link can be found in the notes. Calibrating predictions, the number of instances of the minority class (hits in this case) provide the resolution, i.e. number of steps between 0 and 1 in the output probabilities. The HTS assay of the LHRT2 was the one presenting highest imbalance with 0.34% of compounds being hits. This motivated the choice of 50 000 compounds for the calibration set; which would approximately contain an average of 150 hits. This way 150 probability steps were considered as a minimum to rank 50 000 compounds, (Calibration sets of up to 25 000 compounds performed poorly on trials to calibrate both LHRT). For the other targets, especially for the HHRT ones, a smaller calibration set would certainly be enough. Nevertheless, we preferred to select equal number of compounds across assays for consistency, which is convenient to show results, but it is also a value when performing HTS campaigns.
ACS Paragon Plus Environment
10
Page 23 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
In total, ten iterations were run, summing up for 600,000 compounds screened on each target which corresponds to around a third of the full HTS collection and was then set as the limit for our IS strategy. The in-silico experiments were repeated with ten seeds for statistical significance. Calibration and Enrichment Evaluation. To evaluate the accuracy of the calibration provided by the proposed method, at each iteration and for each seed and target, cumulative sums of labels, p, p1 and p0, of selected compounds (50,000 compounds with highest p), were calculated. For each target violin plots summarizing the results of the different seeds were plotted for each iteration. Cumulative sums of p0 and p1 of selected compounds represents the minimum and maximum number of expected hits. The cumulative sum of p represents the number of expected hits, whereas the actual number of hits retrieved is the cumulative sum of labels. Enrichment of selected sets of compounds was evaluated in terms of number of hits retrieved and their respective diversity as represented by the number of hits with different TF and MF. At each iteration and for each seed and target, cumulative sums of the percentage of hits, TF and MF retrieved from the test set were calculated and visualized against random selection. Again, for each target, violin plots summarizing the results of the different seeds were plotted for each iteration. RESULTS AND DISCUSSION Validity of the method. Violin plots representing the cumulative sums of labels, p, p1 and p0, of selected compounds were shown for evaluation of the calibration accuracy of predictions for six different targets and ten seeds per target (Figures 1). The range between the cumulative sums of p0 and p1 represents the minimum and maximum limits of expected hits, what provides information that other probabilistic prediction methods, e.g. Platt scaling or logistic regression, could not. The targets were chosen to represent typical cases with high, moderate and low proportion of hits. Two targets for each case were used. Results regarding random selection as well
ACS Paragon Plus Environment
11
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 24 of 41
as selection of compounds with highest p are shown. Perfect calibration is only guaranteed when calibration and test sets are exchangeable. Thus, for the proposed method, i.e. selection of the highest ranked compounds, perfect calibration is only guaranteed in the first iteration. Otherwise, perfect calibration is guaranteed in all iterations when selecting randomly. In this context, perfect calibration means that probabilities are matched by observed frequencies in the long run. Therefore, the cumulative sum of actual hits retrieved should be a value in between the cumulative sums of p0 and p1; small deviations are possible because a finite number of compounds, i.e. 50,000, were selected at each iteration. Moreover, with a reasonably good model and enough instances in calibration and training sets, p0 and p1 are expected to be close, and the cumulative sum of retrieved hits to be close to the cumulative sum of p. Following the proposed method of IS and retrieval of the compounds exhibiting highest probability of being hits, the proportion of hits decrease in the test set with each iteration. Thus, calibration and test sets are further away from being exchangeable at each iteration; and the accuracy of the calibration could degrade as the hit rate decrease in the test set. Nevertheless, during an HTS screening campaign, what is needed to support a decision on the portion size of the compound library to be screened at the next iteration, is the estimated amount of hits to be retrieved being close to the actual hits retrieved. This is, the cumulative sum of p being close to the cumulative sums of labels. For every target and iteration, the actual number of hits retrieved was close to the cumulative sums of the ‘p’ of selected compounds, and therefore the calibration remains accurate for all iterations and all six targets (Figures 1). Nevertheless, the median of hits retrieved often go slightly out of the limits of expected hits (Figure 1). This does not happen at the first iteration or when compounds are selected randomly, i.e. when perfect calibration is guaranteed. Moreover, in those
ACS Paragon Plus Environment
12
Page 25 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
cases, actual and expected hits are even closer. This indicates that the calibration accuracy degrades after the first iteration. Nevertheless, it remains accurate from a practical point of view. Enrichment Evaluation. The cumulative sums of the percentages of hits, MF and TF retrieved in each iteration and for each target using the Venn-ABERS approach was shown to evaluate enrichment in term of hits and their diversity (Figure 2). Results using random selection are also shown as reference. Whether enrichment is generally very high for, hits, TF and MF, the proportion of hits to TF is higher than using random selection; indicating that hits are more diverse when selecting randomly. This could be expected as a QSAR model selects compounds that are similar to known hits. It might also be worth pointing out that the enrichment of selected subsets against random selection can be substantially different for different targets; including targets with similar hit rates. This shows that QSAR models are better suited for some targets than for others. The proportion of hits to MF exhibit a smaller difference between methods. That is likely because a greater proportion of MF than TF are singletons. CONCLUSIONS One of the challenges of iterative screening is the choice of the portion of the compound library to be screened. These challenges cannot be approached by traditional ML methods. To solve this problem, a method to provide accurate estimations of the number of hits that would be retrieved in each iteration during an IS campaign is proposed. The method applies Venn-ABERS predictors on top of a scoring classifier (SVM in the present work); and was evaluated on six HTS campaigns against different biological targets. The proposed method showed ability to provide accurate estimations of the number of hits that would be retrieved in each iteration. Therefore, it can be used in IS campaigns as an efficient control on the number of iterations needed and the resulting
ACS Paragon Plus Environment
13
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 26 of 41
expected enrichment. Having an accurate estimation of the hits that would be found in the next iteration would help to decide whether to continue screening or not. Enrichment offered by the method in terms of hits and their diversity was evaluated against random selection. The proposed method exhibited good ability in identifying an enriched subset in terms of hits and their diversity for all six targets. The price for the value that the present method might add to IS campaigns is the cost of screening the compounds in the calibration set (which can have a significant economic cost). However, in a recently accepted paper by Johansson et al.38, it was shown that Venn-ABERS predictors can be estimated using out of bag calibration samples. That method might be applied to the case of study of the present work. That way, the information would be obtained without any additional efforts. However, only ML methods that allow bagging can be used; which excludes SVM. Quantifying uncertainty is a critical aspect in the drug discovery as well as in medical decision making39. Adding calibrated probabilities to predictions output by chemical models might be the best way to quantify the uncertainty of those predictions. For this reason, Venn-ABERS predictors might have a great potential in drug discovery research. However, to the knowledge of the authors, Venn-ABERS predictors have so far only been used in three articles 36,40,41.
ACS Paragon Plus Environment
14
Page 27 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
ACS Paragon Plus Environment
15
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 28 of 41
ACS Paragon Plus Environment
16
Page 29 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
ACS Paragon Plus Environment
17
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 30 of 41
Figure 1. This figure serves for evaluation of the accuracy of the calibration provided by the method proposed. For each target, violin plots summarize the results of 10 seeds, for the proposed method as well as random selection. The vertical axis represents number of hits and the horizontal axis represents iterations, 10 in total. Cumulative sums of p0 and p1 of selected compounds represents the minimum and maximum number of expected hits. The cumulative sum of p represents the number of expected hits, whereas the actual number of hits retrieved is the cumulative sum of labels. Note that results are not cumulative from one iteration to the next.
ACS Paragon Plus Environment
18
Page 31 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
ACS Paragon Plus Environment
19
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 32 of 41
ACS Paragon Plus Environment
20
Page 33 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Figure 2. This figure serves for evaluation of the enrichment provided by the method. For each target, violin plots summarize the results of 10 seeds. Cumulative sums of hits and topological
ACS Paragon Plus Environment
21
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 34 of 41
frameworks are shown for the proposed method (selection by p) as well as for random selection. The vertical axis represents the percentage of hits or topological frameworks in the test set, and the horizontal axis represents iterations.
ACS Paragon Plus Environment
22
Page 35 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Table 1. Characteristics of the Datasets Target
Active
Inactive
TF shared by at least one Hit
MF shared by at least one Hit
% of Active
HHRT1
47946
1643931
6379
20028
2.83%
HHRT2
42029
1917751
7444
21463
2.14%
MHRT1
23525
1702534
6660
15472
1.38%
MHRT2
16410
1964086
4180
9845
0.83%
LHRT1
6954
2010785
1667
4218
0.35%
LHRT2
6626
1954785
1959
4322
0.34%
AUTHOR INFORMATION Corresponding Author
[email protected] ORCID Ruben Buendia Lopez: 0000-0001-8126-9922
[email protected] ORCID Ernst Ahlberg-Helgee: 0000-0003-2050-9069 Present Addresses ‡ Lars Carlsson current address is only, Department of Computer Science, Royal Holloway, University of London, Egham Hill, Egham, Surrey, United Kingdom Notes The authors declare no competing financial interest. The method was implemented in the form of a single python script which is available for download at https://github.com/Ruben2233/Supporting_Information_JCIM_ci-2018-00724s
ACS Paragon Plus Environment
23
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 36 of 41
ACKNOWLEDGMENT This work was supported by the Swedish Knowledge Foundation through the project Data Analytics for Research and Development (20150185). We thank our colleague Oliver Laufkötter for thorough discussions and for improving the writing of the manuscript. ABBREVIATIONS TF, topological framework; MF, molecular framework; VAP, Venn-ABERS Predictors; IVAP, Inductive Venn-ABERS Predictors; SVM, support vector machine; ML, machine learning. REFERENCES (1)
Macarron, R. Critical review of the role of HTS in drug discovery. Drug Discovery Today
2006, 11, 277−279. (2)
Macarron, R.; Banks, M. N.; Bojanic, D.; Burns, D. J.; Cirovic, D. A.; Garyantes, T.;
Green, D. V. S.; Hertzberg, R. P.; Janzen, W. P.; Paslay, J. W.; Schopfer, U. and Sittampalam, G. S. Impact of high-throughput screening in biomedical research. Nature Rev. Drug Discov. 2011, 10, 188–195. (3)
Mayr, L. L.; Fuerst, P. The future of high-throughput screening. J. Biomol. Screening 2008,
13, 443-448. (4)
Bajorath, J. Integration of Virtual and High-Throughput Screening. Nat. Rev. Drug Disco.
2002, 1, 882−894. (5)
Phatak, S. S.; Stephan, C. C.; Cavasotto, C. N. High-throughput and in silico screenings in
drug discovery. Expert Opin. Drug Disco. 2009, 4, 947−959.
ACS Paragon Plus Environment
24
Page 37 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
(6)
Abdo, A.; Chen, B.; Mueller, C.; Salim, N.; Willett, P. Ligand-Based Virtual Screening
Using Bayesian Networks. J. Chem. Inf. Model. 2010, 50, 1012–1020. (7)
Kiss, R.; Kiss, B.; Könczöl, A.; zalai F.; Jelinek I.; László V.; Noszál B.; Falus A.; Keseru
G.M. Discovery of novel human histamine H4 receptor ligands by large-scale structure-based virtual screening. J Med Chem. 2008, 51, 3145-3153. (8)
Evers, A.; Hessler, G.; Matter, H.; Klabunde T. Virtual Screening of Biogenic Amine-
Binding G-Protein Coupled Receptors: Comparative Evaluation of Protein- and Ligand-Based Virtual Screening Protocols. J. Med. Chem. 2010, 53, 7521–7531. (9)
Dong, G.; Sheng, C.; Wang, S.; Miao, Z.; Yao, J.; Zhang, W. Selection of Evodiamine as
a Novel Topoisomerase I Inhibitor by Structure-Based Virtual Screening and Hit Optimization of Evodiamine Derivatives as Antitumor Agents. J. of Med.l Chem. 2010, 53, 7521-7531. (10) Paricharak, S.; Jzerman, A. P.; Bender, A.; Nigsch, F. Analysis of iterative screening with stepwise compound selection based on novartis in-house hts data. ACS Chem. Biol. 2016, 11, 12551264. (11) Settles, B. Active Learning Literature Survey. University of Wisconsin, 2010. (12) Warmuth, M. K.; Rätsch, G.; Mathieson, M.; Liao, J.; Lemmen, C. Active learning in the drug discovery process. In Adv. in Neural Inf. Proc. Sys. 14; Dietterich, T. G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, 2002; 1449-1456. (13) Reker, D.; Schneider, G. Active-learning strategies in computer-assisted drug discovery Drug Discovery Today 2015, 20, 458-465
ACS Paragon Plus Environment
25
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 38 of 41
(14) Maciejewski, M.; Wassermann, A. M.; Glick, M.; Lounkine, E. An Experimental Design Strategy: Weak Reinforcement Leads to Increased Hit Rates and Enhanced Chemical Diversity. J. Chem. Inf. Model. 2015, 55, 956−962. (15) Warmuth, M. K.; Liao, J.; Rätsch, G.; Mathieson, M.; Putta, S.: Lemmen, C. Active Learning with Support Vector Machines in the Drug Discovery Process. J. Chem. Inf. Comput. Sci. 2003, 43, 667-673. (16) Reker, D.; Schneider, P.; & Schneider, G. Multi-objective active machine learning rapidly improves structure–activity models and reveals new protein–protein interaction inhibitors. Chemical Science, 2016, 7, 3919-3927. (17) Paricharak, S.; Jzerman, A. P.; Jenkins J. L.; Bender, A.; Nigsch, F. Data-Driven Derivation of an "Informer Compound Set" for Improved Selection of Active Compounds in HighThroughput Screening. J Chem Inf Model. 2016, 56, 1622-30 (18) Reker, D., Schneider, P., Schneider, G., & Brown, J. B. Active learning for computational chemogenomics. Future medicinal chemistry 2017, 9, 381-402. (19) Vovk, V.; Gammerman, A.; Shafer, G. Algorithmic Learning in a Random World; Springer: New York, 2005. (20) Svensson, F.; Afzal, A. M.; Norinder, U.; Bender, A. Maximizing gain in high-throughput screening using conformal prediction. J Cheminform, 2017, 10. (21) Linusson, H.; Johansson, U.; Boström, H.; Löfström, T. Reliable Confidence Predictions Using Conformal Prediction. In, Advances in Knowledge Discovery and Data Mining. Proceedings
ACS Paragon Plus Environment
26
Page 39 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
of the 20th Pacific-Asia Conference, PAKDD, Auckland, New Zealand, April 19-22, 2016; Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J. Z., Wang, R., Eds.; Springer. (22) Linusson H.; Johansson U.; Boström H.; Löfström T. Classification with Reject Option Using Conformal Prediction. In, Advances in Knowledge Discovery and Data Mining. PAKDD, 2018. Phung D., Tseng V., Webb G., Ho B., Ganji M., Rashidi L., Eds.; Lecture Notes in Computer Science, vol 10937. Springer. (23) Svensson, F.; Aniceto, N.; Norinder, U.; Cortes-Ciriano, I.; Spjuth, O.; Carlsson, L.; Bender, A. Conformal Regression for Quantitative Structure–Activity Relationship Modeling—Quantifying Prediction Uncertainty. J
Chem Inf Model. 2018, 58, 1132-40.
(24) Vovk, V.; Petej, I. Venn-abers predictors. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014. (25) Zadrozny, B.; and Elkan, C. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 694–699, New York, 2002. ACM Press (26) J. C. Platt. Probabilities for SV machines. In Advances in Large Margin Classifiers, pages 61–74, 2000. P. J. Bartlett, B. Schölkopf, D. Schuurmans and A. J. Smola, eds. 61–74. MIT Press, Cambridge, MA (27) Vovk, V.; Petej, I.; Fedorova, V. Large-scale probabilistic predictors with and without guarantees of validity. In Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015, 892-900. MIT Press.
ACS Paragon Plus Environment
27
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 40 of 41
(28) Toccaceli, P.; Nouretdinov, I.; Luo, Z.; Vovk, V.; Carlsson, L.; Gammerman A. Excape project, wp1, probabilistic prediction. Technical report, Royal Holloway, University of London and AstraZeneca, 2016. (29) Lambrou, A.; Ilia Nouretdinov, and Harris Papadopoulos. Inductive venn prediction.
Annals of Mathematics and Artificial Intelligence, 74(1-2):181–201, 2015. (30) Lambrou A.; Papadopoulos H.; Nouretdinov I.; Gammerman A. In Reliable Probability
Estimates Based on Support Vector Machines for Large Multiclass Datasets, 2012. Iliadis L., Maglogiannis I., Papadopoulos H., Karatzas K., Sioutas S. (eds) Artificial Intelligence Applications and Innovations. AIAI 2012. IFIP Advances in Information and Communication Technology, vol 382. Springer, Berlin, Heidelberg. (31) Mervin, L; Toccaceli, P.; Engkvist, O.; Bender, A. Comparing probability scaling methods for ligand-target prediction highlights Venn-Abers predictors generate optimal calibrated probability estimates. Under review for J Chem Inf Model. Manuscript ID: ci-2018-00160k. (32) Bemis, G. W.; Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem., 1996, 39, 2887–2893. (33) Schuffenhauer, A.; Ertl, P.; Roggo, S.; Wetzel, S.; Koch, M. A.; Waldmann, H. The Scaffold Tree - Visualization of the Scaffold Universe by Hierarchical Scaffold Classification. J. Chem. Inf. Model. 2007, 47, 47–58. (34) Faulon, J. L.; Visco, J. D. P.; Pophale, R. S. The signature molecular descriptor. 1. Using extended valence sequences in qsar and qspr studies. J. Chem. Inf. and Comp. Sciences, 2003, 43, 707-720.
ACS Paragon Plus Environment
28
Page 41 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
(35) Vovk, V.; Nouretdinov, I.; Shafer, G. Self-calibrating probability forecasting. In Proceedings of the Advances in Neural Information Processing Systems 16, 2003, 1133-1140. (36) Buendia, R.; Engkvist, O.; Carlsson, L.; Kogej, T.; Ahlberg, E. Venn-Abers Predictors for Improved Compound Iterative screening in Drug Discovery. In Proceedings of Machine Learning Research, 2018, 91:1-19. (37) Fan, R. E.; Chang, K. W.; Hsieh, C. J.; Wang, X. R.; Lin, C. J. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 2008, 9, 1871-1874. (38) Johansson, U.; Löfström, T.; Linusson, H.; Boström, H. Efficient Venn Predictors using Random Forests. Machine Learning, 2018. (39) Begoli, E.; Bhattacharya, T.; Kusnezov, D.; The need for uncertainty quantification in machine-assisted medical decision making. Nature Machine Intelligence, 2019, 1, 20–23. (40) Arvidsson, S.; Spjuth, O., Carlsson, L.; Toccaceli, P.; Prediction of Metabolic Transformations using Cross Venn-ABERS Predictors. In Proceedings of Machine Learning Research, 2017, 60, 118-131 (41) Ahlberg, E.; Buendia, R.; Carlsson, L. Using Venn-Abers Predictors to assess CardioVascular Risk. In Proceedings of Machine Learning Research, 2018, 91, 1-15.
ACS Paragon Plus Environment
29