Subscriber access provided by READING UNIV
Article
Coupling Matched Molecular Pairs with Machine Learning for Virtual Compound Optimization Simone Fulle, Samo Turk, Benjamin Merget, and Friedrich Rippmann J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00298 • Publication Date (Web): 13 Nov 2017 Downloaded from http://pubs.acs.org on November 13, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Coupling Matched Molecular Pairs with Machine Learning for Virtual Compound Optimization Samo Turk,1 Benjamin Merget,1 Friedrich Rippmann,2 Simone Fulle1* 1
BioMed X Innovation Center, Im Neuenheimer Feld 515, 69120 Heidelberg, Germany
2
Global Computational Chemistry, Merck KGaA, Frankfurter Strasse 250, 64293 Darmstadt,
Germany * Corresponding author:
[email protected] ACS Paragon Plus Environment
1
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 30
ABSTRACT: Matched molecular pair (MMP) analyses are widely used in compound optimization projects to gain insights into structure–activity relationships (SAR). The analysis is traditionally done via statistical methods but can also be employed together with machine learning (ML) approaches to extrapolate to novel compounds. The here introduced MMP/ML method combines a fragment-based MMP implementation with different machine learning methods to obtain automated SAR decomposition and prediction. To test the prediction capabilities and model transferability, two different compound optimization scenarios were designed: (1) ‘new fragments’ that occurs when exploring new fragments for a defined compound series, and (2) ‘new static core and transformations’ that resembles for instance the identification of a new compound series. Very good results were achieved by all employed machine learning methods especially for the ‘new fragments’ case, but overall deep neural network models performed best, allowing reliable predictions also for the ‘new static core and transformations’ scenario, where comprehensive SAR knowledge of the compound series is missing. Furthermore, we show that models trained on all available data have a higher generalizability compared to models trained on focused series and can extend beyond chemical space covered in the training data. Thus, coupling MMP with deep neural networks provides a promising approach to make high quality predictions on various data sets and in different compound optimization scenarios.
KEYWORDS: Machine Learning, Random Forest, Gradient Boosting Machines, Deep Neural Networks, Deep Learning, Matched Molecular Pairs, SAR
ACS Paragon Plus Environment
2
Page 3 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Introduction Matched Molecular Pair (MMP) is a popular concept to study changes of compound properties that are associated with a single localized structural change (Figure 1).1,2 If the structural change is correlated with a big change in activity, MMPs are called activity cliffs,3,4 and if no significant change is present such pairs are referred to as bioisosteres.5 Thus, extracting MMPs has high practical value for different compound optimization scenarios and previous applications include the optimization of biological activity,6–9 ADME properties,10,11 solubility,12,13 logD,9 and hERG inhibition.14
Figure 1: Different ways of representing a Matched Molecular Pair. A) As a pair of chemical structures; B) as static core shared between the structures plus the transformation described by two fragments F1 and F2 that differ between the structures.
The popularity of the MMP concept for compound optimization tasks resulted in a variety of implementations that can be split in supervised and unsupervised approaches. In supervised approaches, chemical transformations that generate MMPs are predefined,12 whereas unsupervised approaches find all possible pairs by using for instance the maximum common substructure algorithm or fragmentation rules,2,15 such as cleaving compounds at all rotatable single bonds16 or via retrosynthetic rules.17 Analyzing the influence of certain substituents on compound properties via statistical methods has been used extensively to
ACS Paragon Plus Environment
3
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 30
retrieve knowledge about the average as well as maximal potential effect of a particular structural change.7,12 Similar analysis can also be employed on matched molecular series, which consist of a set of two or more compounds with the same core but differing R-groups at the same position.18,19 QSAR-by-MMPA20 was a first attempt of using MMPs together with the nearest neighbor approach to make prospective predictions. To achieve even greater generalizability,14 MMPs can also be combined with more advanced machine learning (ML) methods such as Random Forest21 and SVM.22 ML is going through an explosive growth and there is probably not one industry sector that is not employing this technology to extract maximum information from their data. Random Forest (RF), Gradient Boosting Machines (GBM) and Deep Neural Network (DNN) are very popular ML methods and reported to be of high practical value for drug discovery efforts such as the prediction of compound activity23–25 and toxicological effects.26 Selecting the most suitable ML approach is dependent on the underlying task and data. An extensive evaluation of 17 classifier families on 121 data sets27 as well as a recent study by us on kinase profiling prediction28 both strongly indicate that RF is very well suited for classification problems. GBM has grown in popularity with the XGBoost library29 and show promising results for QSAR calculations on a variety of targets.24 While RF and GBM implementations can easily run on central processing units (CPU) with sufficient performance, DNNs leverage the recent advances in the graphical processing units (GPU) technology and are very powerful at object recognition in images and natural language processing as well as the prediction of compound activities.23,25 Given the strengths of MMPs to extract SAR from compounds and ML approaches to extrapolate from complex data, we have developed a method –herein after called MMP/ML– that couples both concepts to predict the effect of structural changes on compound activity. It is based on an in-house MMP implementation that uses retrosynthetic rules for fragmenting
ACS Paragon Plus Environment
4
Page 5 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
compounds. We systematically evaluated three popular machine learning methods (i.e. RF, GBM, and DNN) on five kinases and one bromodomain-containing protein. Two compound optimization scenarios were designed: (1) ‘new fragments’ scenario that occurs when exploring new fragments for a defined compound series with one or several scaffolds, and (2) ‘new static core and transformations’ scenario where novel scaffolds as well as either novel fragments or novel combinations of fragments are explored. This thorough validation can be expected to provide realistic expectations about the prediction performance in different compound optimization contexts.
Materials and methods
Data sets. Bioactivity data for each target was extracted from ChEMBL 2230 and only IC50, Kd, and Ki values from binding assays (assay type B) were kept which had a target confidence score of 8 or higher. In the case of compounds with multiple measurements on the same target, the median value was used. All bioactivity values were converted to pIC50 and compound structures were sanitized using RDKit.31 Three different MMP sets per target were prepared for evaluation. ‘Set 1 – all’ contains all MMPs, ‘set 2 – top 10 clusters’ contains MMPs from the 10 biggest clusters of compounds, and ‘set 3 – hinge binder’ is a subset of set 2 and contains only MMPs where the hinge binding part is changed (Table 1). Clustering of the compounds was done using the Butina clustering approach32 using atom pair fingerprints,33,34 and applying a Tanimoto similarity cutoff of 0.3. Following a procedure in ref.34 the initial clusters were extended by compounds with the same ChEMBL assay IDs. The employed data sets can be obtained from https://github.com/Team-SKI/Publications.
ACS Paragon Plus Environment
5
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 30
Table 1: Summary of test sets employed for evaluation. Set 1 – all
Set 2 – top 10 clusters
Set 3 – hinge binder
Target # compounds
# MMPs
# compounds
# MMPs
# MMPs
AurA
2,821
14,500
358
10,754
710
CDK2
1,464
4,274
190
1,712
356
KDR
6,494
40,422
511
15,876
1,930
p38a
3,859
24,892
309
8,526
530
TRKA
1,278
22,970
387
22,498
162
BRD4
731
3,650
225
2,746
NA
Fragment-based MMP implementation and feature vector encoding. MMPs were extracted by initially fragmenting all compounds via retrosynthetic rules (BRICS35) and storing all combinations of fragments and the respective remaining static core in a database. MMPs were finally identified by querying the database for all static parts that have at least two associated fragments. Subsequently, the static core shared between the two molecules forming an MMP and both fragments (F1 and F2) were individually encoded as hashed Morgan fingerprints36 with a radius of 2 and array length of 2048 bits (Figure 2). Please note that fragments F1 and F2 are the smallest component obtained from the fragmentation scenario and thus, can neither be broken into smaller fragments using the employed fragmenting scheme, nor do they consist of several BRICS fragments; in contrast, the static core is composed of one or more BRICS fragments. Regio-isomers were considered by keeping initially both variants of the static core (differing only by the position of the attachment point) and the respective remaining fragments. The two static cores were then merged to one fingerprint containing the shared bits and encoding the differing bits, which occur due to the transformation, to -1 in the case of disappearing bits and to 2 in the case of
ACS Paragon Plus Environment
6
Page 7 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
appearing bits (Figure S1 in the Supporting Information). Finally, the three remaining fingerprints of the MMP (i.e., for the static core and two fragments) were concatenated to one fingerprint with 6144 bits in length, starting with the fingerprint for the static core followed by the fingerprints of the fragments (Figure 2). Since the two fragments can be transformed into each other in both directions (i.e. F1 to F2 and F2 to F1) and each transformation has a different sign of the target value (∆pIC50), both fingerprint combinations were individually encoded (i.e. F1 followed by F2, and F2 followed by F1). Both fingerprint combinations were always assigned to the same fold of training and validation set in the subsequent validation scheme (see below). All steps were implemented in Python 2.7 using RDKit 2016_09_4.31
Figure 2. Encoding of MMP as feature vector. Static core, fragments F1 and F2 are each individually encoded as Morgan fingerprint, concatenated, and associated with respective target value. Machine learning and validation. Three different machine learning methods were used: Random Forest (RF), Gradient Boosting Machine (GBM), and Deep Neural Network (DNN). RF models were trained using scikit-learn 0.18.1, setting the number of estimators to 500 and the number of features per tree to the square root of the total number. GBM37 models were generated using XGBoost 0.6,29 setting the number of estimators to 1000, the max depth of trees to 3, and the learning rate (shrinkage) to 0.1. For DNN calculations, feed forward neural networks (multilayer perceptrons) were constructed using keras1.1.1 with TensorFlow 1.0 backend, consisting of three hidden layers of 512 neurons each and an output layer with 1 neuron. All layers had normal initialization and employed the rectified linear unit activation
ACS Paragon Plus Environment
7
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 30
function, except for the output layer which had sigmoid activation function. Furthermore, a dropout of 0.5 was used to avoid overfitting38 and Adam39 was used to optimize the mean squared error loss function. Evaluation of MMP/ML was performed using group 5-fold cross validation (CV) in two different validation scenarios, where (1) all fragments (and thus, also transformations) in the validation sets were not part of the training set (‘new fragments’), and (2) all static cores and transformations were not part of the training set (‘new static core and transformations’). Thus, both scenarios ensure that no compound is present in the training and validation set at the same time. Group 5-fold CV ensures that samples from the same group (i.e. transformations, fragments, and static cores and transformations) are always assigned to the same fold.
ACS Paragon Plus Environment
8
Page 9 of 30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Results and discussions When evaluating a prediction model, it is crucial to have a precise understanding about the underlying data and the potential prediction capabilities. Thus, we will first provide an overview of the data sets used to build and validate the prediction models. This is followed by some explanatory notes about the employed design of the model validation to test thoroughly the prediction capabilities and model transferability in different compound optimization scenarios. Based on these considerations, the validation results are finally presented and discussed. Summary of the data sets. Median compound pIC50 value in all data sets was 6.6, with TrkA having in average the most active (median pIC50 = 7.13) and BRD4 the least active compounds (median pIC50 = 5.90). In line with previous observations,7 the maximal impact of a single group on bioactivity is limited, has a median value of 0.59 but can in extreme cases amount to ∆pIC50 values of about 4 (Figure S2). The value for the median change is higher than observed by Hajduk and Sauer7 but can probably be attributed to the bigger size of fragments employed in the present study (Table 2, Figure S3). Effects of changes that groups introduce are centered around 0 but do not form a normal distribution (p