Shared Consensus Machine Learning Models for Predicting Blood

Subscriber access provided by University of Newcastle, Australia

Article

Shared Consensus Machine Learning Models for Predicting Blood Stage Malaria Inhibition Andreas Verras, Christopher Lee Waller, Peter Gedeck, Darren Green, Thierry Kogej, Anandkumar V. Raichurkar, Manoranjan Panda, Anang A Shelat, Julie A Clark, R. Kiplin Guy, George Papadatos, and Jeremy N. Burrows J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00572 • Publication Date (Web): 03 Mar 2017 Downloaded from http://pubs.acs.org on March 10, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Shared Consensus Machine Learning Models for Predicting Blood Stage Malaria Inhibition Andreas Verras1, Chris L. Waller2, Peter Gedeck3, Darren Green4, Thierry Kogej5, Anandkumar Raichurkar6, Manoranjan Panda6, Anang Shelat7, Julie Clark7, Kip Guy7, George Papadatos8, Jeremy Burrows9 Corresponding Author Email: [email protected] Author Affiliations 1. Merck & Co., Inc., Kenilworth, NJ, 07033, USA 2. Merck & Co., Inc., Boston, MA, 02210, USA 3. Novartis, Singapore, 117439, Singapore 4. GlaxoSmithKline, Stevenage, SG1 2NY, UK 5. AstraZeneca, Kungsbacka, 434 00, Sweden 6. AstraZeneca, Bangalore, 560045, India 7. St. Jude Children’s Research Hospital, Chemical Biology and Therapeutics Department, Memphis, TN, 38105, USA 8. European Bioinformatics Institute, Cambridge, CB10 1SD, UK 9. Medicines for Malaria Ventures Discovery, Geneva, 1215, Switzerland

ACS Paragon Plus Environment


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract The development of new antimalarial therapies is essential and lowering the barrier of entry for the screening and discovery of new lead compound classes can spur drug development at organizations that may not have large compound screening libraries or resources to conduct high throughput screens. Machine learning models have been long established to be more robust and have a larger domain of applicability with larger training sets. Screens over multiple data sets to find compounds with potential malaria blood stage inhibitory activity have been used to generate multiple Bayesian models. Here we describe a method by which Bayesian QSAR models, which contain information on thousands to millions of proprietary compounds, can be shared between collaborators at both for-profit and not-for-profit institutions. This model-sharing paradigm allows for the development of consensus models that have increased predictive power over any single model, and yet does not reveal the identity of any compounds in the training sets.

Introduction Malaria is a devastating disease with an annual morbidity estimated between 300-600 million people worldwide. Approximately 3.3 billion people, half the world’s population, are at risk of contracting the disease. Nearly 90% of all malaria deaths occur in sub-Saharan Africa and the mortality rate is greatest in children under the age of five. Because the disease is most prevalent in poor countries, with limited access to health care infrastructure, a single dose treatment is highly desirable. Additionally, recent resistance to known therapies has exacerbated the need for new pharmaceutical treatments which are


Page 2 of 31

Page 3 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


novel, cheap, and which have appropriate physical properties for easy storage and short administration regimes1.

The participation of academic and small non-profit organizations in antimalarial screening has been limited in part due to the large resource requirements of maintaining large compound libraries and screening capabilities. To aid in the discovery of new antimalarials we have generated multiple Quantitative Structure-Activity Relationship (QSAR) models trained on the ability of compounds to inhibit blood stage malaria. It is the goal of this work to expand the applicability and performance of the models by sharing SAR information between organizations without compromising training set identity. The models can be used to identify libraries enriched with malaria active compounds so that subsets of commercially available libraries can be acquired and screened based on the capacity of any given organization.

It has long been recognized that, generally, QSAR models perform better when trained on larger data sets2. QSAR models have been frequently trained to predict activity in enzymatic as well as phenotypic screens3-5. We report the generation of multiple models from phenotypic screens at four institutions and validate the consensus model on the phenotypic screens of a fifth institution. Furthermore, we detail a mechanism which allowed us to exchange models between organizations without compromising the identity of the training sets.

The models were constructed using the Naïve Bayesian method implemented in Pipeline Pilot6. The computational time for this method scales linearly with data set size and allows us to train models on



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

millions of data points. Performance has been shown to be comparable with more time intensive methods such as random forest7. Importantly, Naïve Bayesian has an additional advantage of being shown to appropriately handle extremely noisy data8.

Efforts have been made in the past to allow institutions to share data from screening efforts on proprietary compounds. In some cases subsets of the screen are made public and shared via bioactivity databases such as ChEMBL or PubChem9, 10. However, this data frequently includes only a subset of active compounds while information on inactive compounds is not shared. A very interesting and promising collaboration between Roche and AstraZeneca (AZ) allows them to share compound information via matched molecular pair analysis retaining information on transformations rather than molecule identity11. Our approach allows for organizations to exchange target-specific information, which has been informed in some cases by their entire compound collections, without revealing the identity of those compounds. This approach was simultaneously and independently developed by Clark et al. who describe an open source framework for sharing Naïve Bayesian models and executed a proof-of-concept with publicly available data12, 13. In this communication we describe a similar approach and extend the proof-ofconcept to proprietary data.

Model Sharing Each collaborator generated a model on their internal data sets using Pipeline Pilot and the models were shared as xml files between collaborators. Data set summaries and model details are provided in Table 1. The xml model files contain summary information on the data set including total number of data


Page 4 of 31

Page 5 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


records as well as the number of “good” and “bad” data records, i.e. active and inactive compounds according to the screening results. When physical properties are included as descriptors the model reveals the absolute ranges for these properties as well as the mean values and their standard deviation over the data set. When topological fingerprints are used as descriptors one can extract from the model all the fingerprints included, as well as their frequency and weights. This is essential to allow the model to be used predictively on novel data. Collaborators were comfortable sharing this data for two reasons: 1. Fingerprint information is only presented for all molecules in the training set in toto and not for specific molecules. While it is possible to deconvolute an individual molecule’s identity from its fingerprints14, the total chemical space the model fingerprints represent is substantially larger than the chemical space of the training set. 2. Low information bins are removed so not all fingerprints present in the training set are included. Fingerprints or physical property ranges that are not present in predominately active or inactive molecules frequently enough to have a significant contribution to the overall Bayesian prediction are omitted from the model. If the normalized bin estimate for a fingerprint is between -0.05 and 0.05 it is not included in the final xml file. This results in a huge reduction in fingerprints. For example, MMV model 1 is trained on 339 actives and 229,090 inactives resulting in a total of 175,155 FCFP_6 fingerprints for the model. Of those 154,145 are considered non-informative and only 21,010 are included in the final model. Because the Naïve Bayesian weights “good” features, the ratio of actives to inactives significantly affects the number of omitted fingerprints. The AZ models are trained on 3,272 actives and 11,574 inactives. This large proportion of actives results in significantly more fingerprint retention in model generation than any other models. The AZ model 1 has a total of 55,011 FCFP_6 descriptors only 1,706 of which are removed as low information bins.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

While identifying the exact members of any training set is not possible, the nature of the Bayesian prediction and the interpolative aspect of models does allow for some “fishing” for training set likeness as illustrated in Figure 1. The Evotec data set was split randomly in half. Half of the molecules were used to train a Bayesian model as described in the methods section. The other half was used as a test set and predicted by the model. For each predicted molecule, the nearest neighbor was found and its similarity is reported to this neighbor by Tanimoto distance in FCFP_6 space. Compounds which are predicted with a high score to be active or inactive tend to be closer to the training set. However there are many molecules identical to the training set (similarity score of 1.0) which are not predicted with high confidence. The identity of these similar, but low confidence predicted molecules, cannot be “fished” from model predictions.

Model Performance Contributor Medicines for Malaria Venture (MMV) was unique in that they had multiple data sets from disparate sources from which to generate models. The Evotec, Johns Hopkins, MMV – St. Jude, and MRCT data sets all represent unique screens done over the course of many years. Initially an effort was made to combine this data and aggregate all the actives and inactives, as defined for each assay, into a single training set to build one model. We defined combined models as models where data is aggregated into a single large training set vs. consensus models were models are built individually for each data set then their predictions averaged on test set molecules. By this definition combined models would require sharing of training set molecule structures and activities while consensus models require only final models and thus obscure the training sets. The ability of the combined model to return true positives was compared to the consensus method using the GNF Novartis data set. This is a set of 4,004 blood stage malaria active compounds which was made public by Novartis 9, 15. In the consensus model


Page 6 of 31

Page 7 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


a compound was predicted active if it was active in any single model. Table 2 illustrates the performance of the MMV models 1-4 consensus vs. the combined model. While not a robust evaluation of combined vs. consensus models, the data does suggest that consensus model may allow for a lower false negative rate if the consensus scoring and domain applicability were used to guide predictions. The authors decided to proceed with four models for multiple reasons: a) The experimental protocols vary significantly between assays. b) A more robust consensus method could be later evaluated. c) Models could be omitted entirely if they contributed little to a consensus prediction because of internally inconsistent data.

After each organization generated models the xml model files were circulated amongst contributors and used to predict all training sets. Figure 2 shows the predictions for all models, sorted by model training set size, represented as a box plot over all training sets. The box represents the upper and lower quartile of ROC AUC predictions across all training sets, the white dash indicates the median value, and the bars indicate the range of the maximum and minimum values in adjacent quartiles. Self-fit predictions are omitted, but all have self-fit ROC AUC's between 0.87 and 0.97 (data not shown). As expected we see a trend toward improved ROC AUC's with increase in data size.

St. Jude contributed two validation sets: a 220K internal screening collection (St. Jude Screening Set) and a 540K collection acquired from vendors (St. Jude Vendor Library). The structures of compounds in both validation sets were not revealed to the collaborators.

Model Performance – Consensus Models



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

To evaluate whether consensus models can perform better than individual models a simple averaging was done of the Bayesian prediction from each model. We generated 255 combinations of leave-xmodels-out where x varies from 0 to 7 and took the mean for all predictions in each set.

Predictions for the two simple consensus methods are shown in Figure 3. For both data sets we observe improvement in ROC AUC for consensus models over single models. Some of the worst performing models on these validation sets are single models. We find that even simply averaging the predictions from multiple models can result in an improvement over single models. The enrichment rate for the Vendor Library (Fig. 3b) is significantly higher than the enrichment in the St. Jude Screening Set (Fig. 3a). This may be because vendor libraries are commercially available and many molecules from the training sets may be represented in the compound collection of other collaborators; however, because we cannot identify the training sets or the contents of the Vendor Library set we cannot determine this with certainty.

In both validation sets the best performing model is the GSK model and the second best performing model is the Novartis model. This is not surprising given that these are the models built on the largest training sets. Despite the good performance of these models they are outperformed by the consensus models likely because the training sets are effectively larger for the consensus.

The dynamic range of prediction scores from Naïve Bayesian models as implemented in Pipeline Pilot is dependent upon the number of actives and inactive features which in turn frequently relates to training set size. Models trained on larger numbers of compounds have significantly larger prediction score


Page 8 of 31

Page 9 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


ranges. The GSK model trained on 2M compounds ranges from -202 to 197 in its predictions for the SJ Screening Collection. In contrast for the AZ models trained on 11K compounds the predictions range from -36 to 25. To normalize the scores relative to each other we generated predictions for all models for 1.4M compounds ChEMBL 1916. We treat this as a surrogate for drug-like space and use the standard deviation of each model’s predictions over this data set to generate a z-score normalized score for all 255 consensus molecules against the St. Jude Screening Collection and St. Jude Vendor Library validation sets. According to the equation: = (

− ) / Where ModelPrediction is the prediction for a given molecule, the ModelCutoff is determined during model training as the best cutoff to distinguish actives and inactives, and ModelStdDev is the previously described standard deviation over ChEMBL 19. When ROC AUC’s are calculated by this mean normalized score for consensus models we see no improvement over the raw mean scores for both validation sets (Fig. 4 a and b). The magnitude of prediction in a Naïve Bayesian for a given molecule represents the presence of highly weighted fingerprints contributing to activity or inactivity. Models built on larger training sets tend to have more of these features and a larger magnitude prediction reflects confidence. While normalization has been shown to be an effective tool for combining heterogeneous classification data17,18 we find normalization of models does not improve the predictivity of the consensus models. We attribute this to the fact that we are using a homogenous classification method, Naïve Bayesian in all cases, and the absolute prediction represents a confidence which is in turn treated as a weighting factor in the raw mean scores. The best performing consensus model in validation against the St. Jude Screening set was a combination of models MMV Model 2, MMV Model 4, AZ Model 1, and Nov Model 1 (ROC AUC 0.816). The best performing consensus model in validation against the St. Jude Vendor Library set was MMV Model 1,



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

MMV Model 4, AZ Model 1, GSK Model 1, and Nov Model 1 (ROC AUC 0.854). The top ranked consensus models against both validation sets included a mix of many of the models and the performance ranking is not unambiguous. In future work we intend to use a combination of all models. Domain of Applicability The variation in model performance between the two validation sets highlights an important aspect of QSAR that has received significant attention in recent years. Domain applicability is the practice of generating reliability metrics for QSAR models19-21. Domain applicability is predicated on the idea that not all molecules will be equally well predicted by a given model. Domain applicability can be used to give a measure of confidence for model predictions or highlight chemical space where a model can or cannot be reliably used. In consensus modeling it is particularly useful for indicating which model or sets of models to employ for a specific prediction.

Early work in quantifying domain applicability used similarity of a compound to be predicted to the nearest neighbor or neighbors in model training set22, an approach that cannot be implemented in this sharing paradigm. Because our training sets are obscured, we cannot compare validation and training sets. In anticipation of this problem we developed a protocol that allowed our collaborator contributing validation sets to submit their molecules to the models. The protocol generated fingerprints for each molecule appropriate to the models and compared the fingerprint coverage for each validation set molecule against the entire fingerprint collection for each model. Thus in addition to returning eight predictions (one for each model) for each molecule in the validation set we also return eight fingerprint coverage percentages. Structural information about the validation set molecules was then discarded and the predictions, fingerprint coverage, and hashed index for all validation molecules was shared between collaborators.


Page 10 of 31

Page 11 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


To calculate whether fingerprint coverage is a good measure of domain applicability, we sorted the data into quartiles based on the mean fingerprint coverage for each consensus model such that quartile 1 had the highest fingerprint coverage and quartile 4 the lowest. A ROC AUC was calculated for each quartile. Figure 5a shows the relationship between average mean fingerprint coverage and ROC AUC for the top five consensus sets against the St. Jude Screening library. Model performance increases with higher finger print coverage, however, when the same analysis is applied to the Vendor Library data set (Fig. 5b) we see much weaker trend for the first and second quartile followed by a significant enrichment for the fourth quartile which is comprised of molecules with the lowest mean fingerprint coverage.

The Vendor Library data set is commercially available and, while we cannot determine unambiguously if it was included in any training sets, Figure 5c shows the fingerprint coverage for each of the five models comprising the top consensus model for the Vendor Library data for only the fourth quartile of data when ranked by average fingerprint coverage. There is significantly higher fingerprint coverage of quartile 4 of the Vendor Library data by the GSK model suggesting that either these molecules, or molecules with the same fingerprint features, were included in the GSK training set. The mean fingerprint coverage is skewed low despite the consensus model’s good performance for this quartile.

We find that finger print coverage is a useful indicator domain of applicability and, importantly, applicable in this approach with obscured data sets.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Conclusions Compound libraries and associated screening data are some of the most valuable data in early drug discovery; thus, sharing this information has the potential to reveal early leads or compromise future projects which rely on a subset of chemical space. For neglected diseases several organizations have shared hit lists of active compounds arising from screens of proprietary compounds, but this omits valuable SAR which can be used to deprioritize or more quickly optimize a series. To allow for SAR sharing without revealing compound set identities, we have implemented an information sharing approach at the model level as opposed to compound level. This allows the prioritization of screening sets, should aid in SAR, and hopefully seed early malaria blood stage inhibitor discovery. We hope that the shared models might help discover new chemical matter that might contribute to superior antimalarial lead series. In parallel, we would also want to highlight that the model sharing paradigm can be applied to any in vitro assay, phenotypic screen, or physical property measurements. The approach can be implemented between partners and, even if both partners have very large datasets, likely lead to an improved model. This work represents a first attempt at this sharing approach and was done retrospectively using data sets that were not standardized whatsoever in terms of experimental method, activity cutoffs, or model descriptors. If an effort is made to pursue this approach prospectively, a more rigorous evaluation of these parameters, as well as domain applicability metrics, could further improve upon this sharing approach. Ideally these models can be safely shared publicly, particularly when housed with a third, “honest broker” party, to improve the predictive capabilities of all participants and the community alike.


Page 12 of 31

Page 13 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Experimental Details Model Training Data Sets Evotec. The Evotec set of compounds includes 229,429 compounds assayed in a medium throughput screen using erythrocytes and 3D7 PF variants with DAPT stain with compounds with an activity threshold of EC50 ≤ 4.0 uM. Of these, 339 compounds were classified as actives whereas the remaining 229,090 compounds were classified as inactives. The data is proprietary to MMV and was used to construct MMV model 1.

Johns Hopkins. The Johns Hopkins set of compounds includes 2524 compounds assayed in a medium throughput screen using erythrocytes and 3D7 and D2d PF variants at 10 uM in both 48 and 96 hours using yoyo red and sybr dyes with an activity threshold of pINH ≥ 50% at the high dose at either time. Of these, 247 compounds were classified as actives whereas the remaining 2277 compounds were classified as inactives. The assay was conducted by MMV and the data is public. It was used to construct MMV model 2.

MRCT. The Medical Research Council Technology (MRCT) set of compounds includes 41,689 compounds assayed in a medium throughput screen using erythrocytes and 3D7 and D2d PF variants with DAPI stain with an activity cutoff of pINH ≥ 40% at 1.96 uM concentration. Of these, 190 compounds were classified as actives whereas the remaining 41,499 compounds were classified as inactives. The data is proprietary to MMV and was used to construct MMV model 3.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

MMV – St. Jude. The St. Jude set of compounds includes 305,810 compounds assayed in a medium throughput screen using erythrocytes and 3D7 PF variants with yoyo red and sybr dyes with active compounds showing pINH ≥ 80% inhibitory at 7 uM compound concentration. Of these, 2507 compounds were classified as actives whereas the remaining 303,303 compounds were classified as inactives. The assay was conducted by MMV and St. Jude. The data is public and was used to train MMV model 4.

AZ. The AstraZeneca data contains 11,574 compounds which are a subset of a larger screening set. The initial screen was a 72 hour incubation with NF54 and K1 PF strains as previously described23,24. Of these, 3272 were classified as active and 8,302 compounds were classified as inactive with an activity cutoff of EC50 ≤ 2 uM. Because it was selected from a subset of a larger screen it contains the largest portion of actives of any data set. The data is proprietary to AZ and was used to construct AZ Models 1 and 2.

GSK. The GlaxoSmithKline data set contains 2,006,390 compounds assayed in a high throughput screen.Compounds were tested at 2 uM in 3D7 PF strain incubated for 7 hours. A cutoff of pINH ≥ 80% was used to describe active compounds. In total 13,535 were classified as active and 1,992,855 compounds were classified as inactive. The data is proprietary to GSK and was used to build GSK Model 1.

Novartis. The Novartis data set contains 2,523,921 compounds assayed in a high throughput screen using erythrocytes and 3D7 strain with activity cutoff of pINH ≥ 50% at either 1.25 or 12.5 uM compound


Page 14 of 31

Page 15 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


concentration. Of these 4,309 were classified as active and 2,519,612 were classified as inactive. The data is proprietary to Novartis and was used to build Nov Model 1.

Model Validation Data Sets

GNF HTS Actives. The set of compounds from the Genomics Institute of the Novartis Research Foundation includes 4004 actives assayed in a high throughput screen using erythrocytes and 3D7 strain with activity cutoff of pINH ≥ 50% at either 1.25 or 12.5 uM compound concentration. This data set was initially used only by contributors at MMV to assess the performance of their in house models in predicting a set of actives from an external test set25.

St. Jude Vendor Library Screen. In collaboration with MMV, St. Jude screened 541,403 compounds in a high throughput screen. Compounds were incubated for 72 hours at 7 uM against a 3D7 strain with an activity cutoff of pINH ≥ 80%. Of these 2,026 were classified as active and 539,377 were classified as inactive. The compound collection was purchased from various vendors and the data is proprietary to MMV and St. Jude.

St. Jude Sample Collection Screen. The sample collection screen consists of 220,691 compounds screened in a high throughput assay. Compounds were incubated for 72 hours at 7 uM against a 3D7 strain with an activity cutoff of pINH ≥ 80%. Of these 9,081 were classified as active and 211,610 were classified as inactive. The data is proprietary to St. Jude.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Machine Learning Models All machine learning models were trained on categorical, active and inactive data, using Naïve Bayesian as implemented in Pipeline Pilot. Uninformative bins which contribute less than 0.05 to the overall model score were removed. Continuous variables used as descriptors were segregated into 10 bins. For descriptors a combination of topolographical26 and physical properties27 were used.

MMV models 1, 2 , 3 and 4. The MMV models trained on the four data sets were generated using FCFP_6, ALogP, molecular weight, number of hydrogen bond acceptors and donors, and number of rotatable bonds as descriptors.

AZ model 1. AstraZeneca model 1 was trained on the AZ data set and generated using FCFP_6, ALogP, molecular weight, number of hydrogen bond acceptors and donors, and number of rotatable bonds as descriptors.

AZ model 2. AstraZeneca model 2 was trained on the AZ data set and generated using ECFP_6 descriptors.


Page 16 of 31

Page 17 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


GSK model 1. GSK model 1 was trained on the GSK data set and generated using FCFP_6, ECFP_6, ALogP, number of hydrogen bond acceptors and donors, and number of atoms as descriptors.

Nov model 1. Nov model 1 was trained on the Novartis data set and generated using FCFP_6, ALogP, molecular weight, number of hydrogen bond acceptors and donors, and number of rotatable bonds as descriptors.

A summary of all the data set and model information is found in Table 1.

Calculation of Fingerprint Coverage. For all compounds predicted in the validation sets a percent coverage of fingerprints was calculated. Fingerprints were generated for each molecule in the test sets and compared the all fingerprints in each model by: ( ℎ / ℎ ) " 100 Appropriate fingerprints were used for each comparison, i.e. FCPF_6 or ECFP_6 or both depending on model. Coverage was calculated only for fingerprints included in the model and low information bins not included in the models are not included in the coverage calculation.

Metrics There are multiple metrics for evaluating model performance and metrics specific to categorical data such as Cohen’s kappa. For our purposes we use receiver operator characteristic (ROC) curves almost



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

exclusively because other metrics such as accuracy are dependent on data set size and active/inactive distribution. ROC curves evaluate ranking methods by plotting fraction true positive on the y-axis and fraction false positive on the x-axis. ROC AUC measures the area under this curve. A perfect method would have a ROC AUC of 1 and a random method would have an AUC of 0.5.

Because we are trying to estimate the impact the models could have on enriching future screening campaigns a ROC AUC is an appropriate metric by which to evaluate the models. Many times in virtual screening authors evaluate early enrichment metrics such as enrichment in the first 5% of the ranking or continuous methods such as BEDROC28. However, such methods can show significant deviation based on the cutoff chosen for evaluation or are dependent on extrinsic parameters unlike ROC AUC29. For these reasons we have chosen to present overall ROC AUC in evaluating model performance.

Acknowledgements The authors would like to acknowledge Alex Alexander and Mark Gardner for discussions on information sharing and Bob Sheridan and Dana Honeycutt for advice on machine learning models. Thanks to Michael Palmer for assistance in organizing collaborators across five time zones. Funding Sources Funds for Andreas Verras were provided through the Richard T. Clark Fellowship for Global Health by Merck & Co.


Page 18 of 31

Page 19 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Contributions These authors contributed to the modeling work: A.V., C.W., P.G., D.G., T.K., A.R., A.S., and G.P.; Project design by A.V., C.W., and J.B; A.V. wrote the manuscript.

Supporting Information Training set molecules are released for data sets used to generate MMV Model 2 and MMV Model 4 with structures recorded as smile string and compound activity classified in the field Library as Active or Inactive. With this data it is possible to reconstruct models 2 and 4 and the fingerprint and property bin descriptors from the Naïve Bayesian have been included so that researchers might compare results.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 31

Tables Table 1. Data Set and Model Details.

Data Set

Data Set

Number of

Number of

Size

Actives

Inactives

229,429

339

229,090

Public

Model

Descriptors

FCFP_6, ALogP, MW, NumHAcc, Evotec

N

MMV Model 1 NumHDon, NumRotBonds FCFP_6, ALogP, MW, NumHAcc,

Johns Hopkins

Y

2,524

247

2,277


MRCT

N

41,689

190

41,499


MMV-St. Jude

Y

305,810

2,507

303,303


AZ

N

11,574

3,272

8,302

AZ Model 1 NumHDon, NumRotBonds

AZ

N

11,574

3,272

8,302

AZ Model 2

GSK

N

2,006,390

13,535

1,992,855

GSK Model 1

ECFP_6 FCFP_6, ECFP_6, ALogP, NumHAcc, NumHDon, NumAtoms FCFP_6, ALogP, MW, NumHAcc,

Novartis

N

2,523,921

4,309

2,519,612

Nov Model 1 NumHDon, NumRotBonds

St. Jude Vendor N

541,403

2,026

539,377

Validation set

N

220,691

9,081

211,610

Validation Set

Library St. Jude Screening Set


Page 21 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Table 2. Combined vs. Consensus model. Compounds are designated as active in the consensus model if they are active in any given individual model. Combined Model

Consensus Model

True Positives

2219

2801

False Negatives

1785

1203

False Positives

0

0

True Negatives

0

0

Correct Predictions

2219

2801

Incorrect Predictions

1785

1203

Total Scored Cases

4004

4004

Overall Error Rate

0.45

0.3

Overall Accuracy Rate

0.55

0.7

True Positive Rate

0.55

0.7



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References

1. WHO, World Health Organization World malaria report 2013. WHO, Geneva, 2013. http://www.who.int/malaria/publications/world_malaria_report_2013/en/ (accessed Oct 27, 2016).

2. Gedeck, P.; Rohde, B.; Bartels, C. QSAR − How Good Is It in PracTce? Comparison of Descriptor Sets on an Unbiased Cross Section of Corporate Data Sets. J. Chem. Inf. Model. 2006, 46, 1924–1936.

3. Ekins, S.; Reynolds, R. C.; Franzblau, S. G.; Wan, B.; Freundlich, J. S.; Bunin, B. A. Enhancing Hit Identification in Mycobacterium Tuberculosis Drug Discovery Using Validated Dual-Event Bayesian Models. PLOS ONE 2013, 8, e63240.

4. Lee, J. H.; Lee, S.; Choi, S. In Silico Classification of Adenosine Receptor Antagonists Using LaplacianModified Naïve Bayesian, Support Vector Machine, and Recursive Partitioning. Journal of Molecular Graphics and Modelling 2010, 28, 883–890.

5. Sun, H. An Accurate and Interpretable Bayesian Classification Model for Prediction of HERG Liability. ChemMedChem 2006, 1, 315–322.

6. BIOVIA Pipeline Pilot v. 8.5, Biovia.

7. Chen, B.; Sheridan, R. P.; Hornak, V.; Voigt, J. H. Comparison of Random Forest and Pipeline Pilot Naïve Bayes in Prospective QSAR Predictions. J. Chem. Inf. Model. 2012, 52, 792–803.

8. Glick, M.; Klon, A. E.; Acklin, P.; Davies, J. W. Enrichment of Extremely Noisy High-Throughput Screening Data Using a Naïve Bayes Classifier. J Biomol Screen 2004, 9, 32–36.


Page 22 of 31

Page 23 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


9. Gamo, F.-J.; Sanz, L. M.; Vidal, J.; de Cozar, C.; Alvarez, E.; Lavandera, J.-L.; Vanderwall, D. E.; Green, D. V. S.; Kumar, V.; Hasan, S.; Brown, J. R.; Peishoff, C. E.; Cardon, L. R.; Garcia-Bustos, J. F. Thousands of Chemical Starting Points for Antimalarial Lead Identification. Nature 2010, 465, 305–310.

10. Guiguemde, W. A.; Shelat, A. A.; Bouck, D.; Duffy, S.; Crowther, G. J.; Davis, P. H.; Smithson, D. C.; Connelly, M.; Clark, J.; Zhu, F.; Jiménez-Díaz, M. B.; Martinez, M. S.; Wilson, E. B.; Tripathi, A. K.; Gut, J.; Sharlow, E. R.; Bathurst, I.; Mazouni, F. E.; Fowble, J. W.; Forquer, I.; McGinley, P. L.; Castro, S.; AnguloBarturen, I.; Ferrer, S.; Rosenthal, P. J.; DeRisi, J. L.; Sullivan, D. J.; Lazo, J. S.; Roos, D. S.; Riscoe, M. K.; Phillips, M. A.; Rathod, P. K.; Van Voorhis, W. C.; Avery, V. M.; Guy, R. K. Chemical Genetics of Plasmodium Falciparum. Nature 2010, 465, 311–315.

11. Dossetter, A. G.; Griffen, E. J.; Leach, A. G. Matched Molecular Pair Analysis in Drug Discovery. Drug Discovery Today 2013, 18, 724–731.

12. Clark, A. M.; Dole, K.; Coulon-Spektor, A.; McNutt, A.; Grass, G.; Freundlich, J. S.; Reynolds, R. C.; Ekins, S. Open Source Bayesian Models. 1. Application to ADME/Tox and Drug Discovery Datasets. J. Chem. Inf. Model. 2015, 55, 1231–1245.

13. Clark, A. M.; Ekins, S. Open Source Bayesian Models. 2. Mining a “Big Dataset” To Create and Validate Models with ChEMBL. J. Chem. Inf. Model. 2015, 55, 1246–1260.

14. Faulon, J.-L.; Brown, W. M.; Martin, S. Reverse Engineering Chemical Structures from Molecular Descriptors: How Many Solutions? J Comput Aided Mol Des 2005, 19, 637–650.

15. ChEMBL-NTD. https://www.ebi.ac.uk/chemblntd/, (accessed Oct 27, 2016).



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

16. Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Krüger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka, M.; Papadatos, G.; Santos, R.; Overington, J. P. The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42, D1083–D1090.

17. Martínez-Jiménez, F.; Papadatos, G.; Yang, L.; Wallace, I. M.; Kumar, V.; Pieper, U.; Sali, A.; Brown, J. R.; Overington, J. P.; Marti-Renom, M. A. Target Prediction for an Open Access Set of Compounds Active against Mycobacterium Tuberculosis. PLOS Comput Biol 2013, 9, e1003253.

18. Riniker, S.; Fechner, N.; Landrum, G. A. Heterogeneous Classifier Fusion for Ligand-Based Virtual Screening: Or, How Decision Making by Committee Can Be a Good Thing. J. Chem. Inf. Model. 2013, 53, 2829–2836.

19. He, L.; Jurs, P. C. Assessing the Reliability of a QSAR Model’s Predictions. Journal of Molecular Graphics and Modelling 2005, 23, 503–523.

20. Weaver, S.; Gleeson, M. P. The Importance of the Domain of Applicability in QSAR Modeling. Journal of Molecular Graphics and Modelling 2008, 26, 1315–1326.

21. Dragos, H.; Gilles, M.; Alexandre, V. Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models. Journal of Chemical Information and Modeling 2009, 49, 1762–1776.

22. Sheridan, R. P.; Feuston, B. P.; Maiorov, V. N.; Kearsley, S. K. Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR. J. Chem. Inf. Comput. Sci. 2004, 44, 1912– 1928.

23. Hameed P, S.; Chinnapattu, M.; Shanbag, G.; Manjrekar, P.; Koushik, K.; Raichurkar, A.; Patil, V.; Jatheendranath, S.; Rudrapatna, S. S.; Barde, S. P.; Rautela, N.; Awasthy, D.; Morayya, S.; Narayan, C.; Kavanagh, S.; Saralaya, R.; Bharath, S.; Viswanath, P.; Mukherjee, K.; Bandodkar, B.; Srivastava, A.;


Page 24 of 31

Page 25 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Panduga, V.; Reddy, J.; Prabhakar, K. R.; Sinha, A.; Jiménez-Díaz, M. B.; Martínez, M. S.; Angulo-Barturen, I.; Ferrer, S.; Sanz, L. M.; Gamo, F. J.; Duffy, S.; Avery, V. M.; Magistrado, P. A.; Lukens, A. K.; Wirth, D. F.; Waterson, D.; Balasubramanian, V.; Iyer, P. S.; Narayanan, S.; Hosagrahara, V.; Sambandamurthy, V. K.; Ramachandran, S. Aminoazabenzimidazoles, a Novel Class of Orally Active Antimalarial Agents. J. Med. Chem. 2014, 57, 5702–5713.

24. Smilkstein, M.; Sriwilaijaroen, N.; Kelly, J. X.; Wilairat, P.; Riscoe, M. Simple and Inexpensive Fluorescence-Based Technique for High-Throughput Antimalarial Drug Screening. Antimicrob. Agents Chemother. 2004, 48, 1803–1806.

25. Gagaring, K.; Borboa, R.; Francek, C.; Chen, Z.; Buenviaje, J.; Plouffe, D.; Winzeler, E.; Brinker, A.; Diagana, T.; Taylor, J.; Glynne, R.; Chatterjee, A.; Kuhen, K. Novartis-GNF Malaria Box. Genomics Institute of the Novartis Research Foundation (GNF), 10675 John Jay Hopkins Drive, San Diego CA 92121, USA and Novartis Institute for Tropical Disease, 10 Biopolis Road, Chromos # 05-01, 138 670 Singapore. https://www.ebi.ac.uk/chemblntd, (accessed Oct 27, 2016).

26. Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling 2010, 50, 742–754.

27. Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for Three-Dimensional-StructureDirected Quantitative Structure-Activity Relationships. 2. Modeling Dispersive and Hydrophobic Interactions. Journal of Chemical Information and Computer Sciences 1987, 27, 21–35.

28. Truchon, J.F.; Bayly, C. I. Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem. J. Chem. Inf. Model. 2007, 47, 488–508.

29. Nicholls, A. What Do We Know and When Do We Know It? J Comput Aided Mol Des 2008, 22, 239– 255.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. Predictions vs. Similarity to Training Set. Predictions on the 50-50 split test set portion of the Evotec data set. Compounds with high absolute prediction scores tend to be closer to the training set molecules.


Page 26 of 31

Page 27 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 2. Model Predictions vs. Training Sets. Models shared between all collaborators were used to predict all training sets. Box plot of model predictions vs. ROC AUC over all training sets showing the overall performance of models. The horizontal line indicates median value, the boxes above and below are the upper and lower quartile respectively, and the vertical lines indicate the upper and lower adjacent values. The x-axis is sorted by total size of training sets.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

A

B

Figure 3. Consensus Model Performance. All combinations of consensus models were generated and ROC AUC was each based on an arithmetic average of each individual model predictions. A) Histogram of all consensus model combinations and their ROC AUC’s for the St. Jude Screening Collection validation set. Single model combinations are shown in black. B) Histogram of all consensus model


Page 28 of 31

Page 29 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


combinations and their ROC AUC’s for the St. Jude Vendor Library validation set. Single model combinations are shown in bold.

A

B

Figure 4. Normalized vs. Raw Mean Scores. All combinations of consensus models were generated and ROC AUC and the raw mean average score was plotted against the ChEMBL 19 normalized mean score. The x=y line is shown for comparison. A) Plot of St. Jude Screening Set normalized vs. raw mean scores. B) Plot of St. Jude Vendor Library normalized vs. raw mean scores.



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

B

A

C

Figure 5. Domain Applicability. Mean fingerprint coverage as a metric of domain applicability was applied to all consensus models. A) St. Jude Screening Library domain ROC AUC when splitting the data set into quartiles based on their raw mean fingerprint coverage. Quartile one represents the highest and quartile four the lowest mean fingerprint coverage. B) St. Jude Vendor Library domain ROC AUC when splitting the data set into quartiles based on their raw mean fingerprint coverage. Quartile one represents the highest and quartile four the lowest mean fingerprint coverage. C) Box plot of the mean fingerprint coverage for the St. Jude Vendor Library quartile four. The horizontal line indicates median value, the boxes above and below are the upper and lower quartile respectively, the vertical lines indicate the upper and lower adjacent values, and the points represent values outside the upper and lower adjacent values. The GSK model has significantly higher coverage of these compounds.


Page 30 of 31

Page 31 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


TOC graphic


Shared Consensus Machine Learning Models for Predicting Blood

Recommend Documents