Subscriber access provided by University of Rochester | River Campus & Miner Libraries
Article
Geometric Deep Learning Autonomously Learns Chemical Features That Outperform Those Engineered by Domain Experts patrick hop, Brandon Allgood, and Jessen Yu Mol. Pharmaceutics, Just Accepted Manuscript • DOI: 10.1021/acs.molpharmaceut.7b01144 • Publication Date (Web): 04 Jun 2018 Downloaded from http://pubs.acs.org on June 5, 2018
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
Geometric Deep Learning Autonomously Learns Chemical Features That Outperform Those Engineered by Domain Experts Patrick Hop,∗ Brandon Allgood,∗ and Jessen Yu∗ Numerate Inc, San Francisco, USA E-mail:
[email protected];
[email protected];
[email protected] Abstract Artificial Intelligence has advanced at an unprecedented pace, backing recent breakthroughs in natural language processing, speech recognition, and computer vision: domains where the data is euclidean in nature. More recently, considerable progress has been made in engineering deep-learning architectures that can accept non-Euclidean data such as graphs and manifolds - geometric deep learning. This progress is of considerable interest to the drug discovery community, as molecules can naturally be represented as graphs, where atoms are nodes and bonds are edges. In this work, we explore the performance of geometric deep learning methods in the context of drug discovery, comparing machine learned features against the domain expert engineered features that are mainstream in the pharmaceutical industry.
Introduction Deep Learning Deep neural networks (DNNs) are not an entirely new concept and have existed for about 20 years, 1 only recently entering the spotlight due to an abundance of 1
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
storage and compute, as well as optimization advances. Today, deep-learning backs the core technology in many applications, such as self-driving cars,, 2 speech synthesis, 3 and machine translation. 4 Perhaps the most important property of DNNs is their ability to automatically learn embeddings (features) tabula rasa from the underlying data, aided by vast amounts of compute and more data than any one human domain expert can understand. Naturally, there is interest in expanding the domain of applicability of these methods to non-euclidean data such as graphs or manifolds, 5 which arise in domains such as 3D models in computer graphics, represented as riemannian manifolds, or graphs in molecular machine learning. Understanding data of this structure has been elusive for classical architectures, because of a lack of a well defined coordinate system and vector space structure in non-euclidian domains. Even operations as simple as addition often cannot find natural constructions, for example, the sum of two atoms or two molecules has no meaning. Geometric Deep Learning aims to solve this by defining primitives that can operate on these unwieldy data structures, primarily by constructing spatial and spectral interpretations of existing architectures 6 such as convolutional neural networks (CNNs). Recasting CNNs into this domain is of particular interest in drug discovery, as like nearby pixels, nearby atoms are highly related and interact with each other, whereas distant atoms usually do not.
Drug Discovery Development of novel therapeutics for a human disease is a process that can easily consume a decade of research and development, as well as billions of dollars in capital. 7 Long before anything reaches the clinic for validation, a potential disease-modulating biological target is discovered and characterized. Then, the search process for the right therapeutic compound is kicked off, a process akin to finding the perfect chemical key for a tough to crack biological lock, which is conducted through a vast chemical space, containing more molecules than atoms in the universe. Even restricting the search to molecules with a molecular weight of ≤ 500 daltons yields a search space of at least 1050 molecules, virtually all of which have never been synthesized before.
2
ACS Paragon Plus Environment
Page 2 of 22
Page 3 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
To make it to the clinic, drug discovery practitioners need to optimize for a wide range of molecular properties, ranging from physical properties, such as aqueous solubility to complex biochemical properties, such as blood-brain-barrier penetration. This long, laborious search has historically been guided by the intuition of skilled medicinal chemists and biologists, but over the past few decades, heuristics and machine learning have played an increasingly important role in guiding the process. The first widespread use of a heuristic is Lipinski’s rule of five (RO5), invented at Pfizer in 1997. 8 RO5 places limits on the number of hydrogen bond donors, acceptors, molecular weight and lipophilicity measures, and has been shown to filter out compounds that are likely to exhibit poor ADME properties. In practice, even today, RO5 is often still used to evaluate emerging pre-clinical molecules. Over the past two decades, machine learning models have begun to emerge in industry as a more advanced filter or virtual screen. Researchers in industry have shown that expertengineered features and support vector machines can be used to predict stability in human liver microsomes 9 10 effectively, among other endpoints. Multi-task, fully-connected neural networks on these same inputs has been shown to on average outperform more traditional models 11 , 12 including XGBoost, 13 with performance scaling monotonically with the number of tasks into the thousands. 14 Progress in learning from small amounts of data has been achieved using variants of matching networks. 15 More recently, use of 3D convolutional neural networks has shown considerable promise in predicting protein-ligand binding energy 16 (drug potency), and ranking models have made considerable progress is drug repurposing. 17 More recently, progress has been made in generating novel molecules in-silico, unlocking the possibility of screening molecules that have been designed by machines instead of humans. This may allow exploration of far-reaching regions of chemical space that are beyond those covered by existing human-engineered screens in industry. Success with these approaches was first demonstrated using Adversarial Autoencoders, which were shown to be able to hallucinate (generate) chemical fingerprints that matched with a variety of patented anticancer
3
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
assets. 18 Variational Autoencoders have also been used in this area 19 and have been shown to be able to hallucinate molecules that have exceptional solubility and low similarity to the training set. Segler 20 et al. relied on LSTMs trained on chemical language representations to achieve a similar result for potency endpoints. Recently, more progress has been made using Generative Adversarial Neural Network models, first on 2D representations 18 and later on 3D representations. 21 As these prediction systems improve, the average quality of molecules selected for synthesis in drugs programs improves significantly 22 resulting in programs that get to the clinic faster and with lower capital requirements - this is significant in light of pipeline attrition rates. For drugs in phase I, excluding portfolio re-balancing, about 40% fail due to toxicity, and about 15% fail due to poor pharmacokinetics, both of which have the potential to be captured by these prediction systems long before the clinic. 23 In this work, the state of the art of drug discovery feature engineering is compared against the state of the art of geometric deep learning in a rigorous manner. We will show that geometric deep learning can autonomously learn representations that outperform those designed by domain experts on four out of five of the datasets tested.
Chemical Embeddings The first challenge in machine learning is selecting a numerical representation that correctly captures the underlying dynamics of the training data, also known as features or an embedding, terms we will use interchangeably in this work. A fixed shape representation is typically required simply because the mathematics of learning algorithms require that their inputs be the same shape. Selecting an embedding that respects the underlying structure of the data cannot be overlooked because certain mathematical assumptions that apply to some datasets need not apply to others - reversing an English sentence destroys its meaning, whereas reversing an image generally would not. In natural language processing, a perni-
4
ACS Paragon Plus Environment
Page 4 of 22
Page 5 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
Figure 1: Bag of Fragments (Left). Bag of Words (Right) cious problem is that sentences need not be the same length, and that locality must be respected because words are highly related to their neighbors. Bag of words embeddings resolves this by mapping sentences into bit vectors that indicate the presence or absence of adjacent words in the document. This convenient, fixed length bit vector can later be used to train a classifier for any natural language processing task. In molecular machine learning, engineering a good embedding / features is a considerable challenge because molecules are unwieldy, undirected, multigraphs, with atoms being nodes and bonds being edges. A good chemical embedding would be able to model graphs with a differing number of nodes and edge configurations while preserving locality because it is understood that atoms that are close to each other generally exhibit more pronounced interactions than atoms that are distant. More formally, for a molecule represented with an adjacency matrix A ∈ {0, 1}n×n and atom-feature matrix X ∈ Rn×d , we want to construct some function f with (optionally) learnable parameters θ ∈ Rw s.t: f : (A, X; θ) → xd Where x is a fixed-shape representation that captures the essence of the underlying graph. This vector is then passed to a learning algorithm of a scientist’s choice, such as random forests or a fully connected neural network.
5
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Naive Embeddings A standard chemistry embedding solution is the extended-connectivity fingerprints (ECFP4). 24 These fingerprints generate features for every radius r ≤ 4, which is the maximum distance explored on the graph from the starting vertex. For a specific r, and specific vertex in the graph, ECFP4 takes the neighborhood features from the previous radius, concatenates them, and applies a hash function, the range of which corresponds to indices on a hash-table. After iterating over all vertices and radius values, this bag-offragments approach to graph embedding results in task agnostic representation that can be easily passed onto a learning algorithm.
Expert Embeddings Cheminformatics experts in drug discovery have, over decades, engineered numerous domain-specific, physiologically-relevant features, also known as descriptors. For example, polar surface area (PSA) is a feature that is calculated as the sum of the surface area contributions of the polar atoms in a molecule, a feature well known in industry to negatively correlate with membrane permeability. 101 of these publicly available, expert-engineered features [table-3] are easily available in the open-source RDKit package.
Learnable Embeddings One criticism of the naive embeddings is that they are not optimized for the task at hand. The ideal features to predict drug solubility are likely to be considerably different than the features used to predict photovoltaic efficiency. The solution is to allow the model to engineer its own problem specific, optimized embedding for the problem at hand, in essence, by combining the learner with the embedding. This is achieved by allowing gradients to flow back from the learner into the embedding function, allowing the embedding to be optimized in tandem with the learner. Neural Fingerprints 25 demonstrated that ECFP could be considerably improved in this manner by introducing learnable weights, while Weaves 26 demonstrated further improvements by mixing bond and atom features. Later, it was shown both of these graph embedding methods are special cases of message passing algorithms. 27
6
ACS Paragon Plus Environment
Page 6 of 22
Page 7 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
Methods Learning Algorithms Random Forests Random Forests are a common ensemble learning algorithm used in industry due to their training speed, high-performance, and easy of use. In this work, random forest models (sklearn’s RandomForestRegressor) are trained on the concatenation of 101 RDKit descriptors and the 384-bit wide ECFP4 fingerprints using 30 trees and a maximum tree depth of 4. This particular hyperparameter configuration is the result of tuning on the validation set by hand, with the aim of maximizing absolute performance, while minimizing the spread of performance. A maximum tree depth of 10 was used on the lipophilicity dataset, due to its size.
FC-DNN Fully-Connected Neural Networks operate on a fixed shape input by passing information through multiple non-linear transformations i.e layers. FC-DNN models were implemented in PyTorch 28 and are trained on the same inputs as the random forest models, with an added normalization preprocessing stage. After extensive hyperparameter tuning on the validation set, a neural network with two hidden layers of size 48 and 32 was found to perform well. ReLU activations and batch-normalization were used on both hidden layers. Optimization was performed using the ADAM optimizer. 29 For hyper-parameters, a static learning rate of 5e−4 and l2 weight decay of 8e−3 was used. All FC-DNN models were trained through training epoch 11, after which the models would begin overfitting.
GC-DNN Graph Convolutional networks are a geometric deep-learning method that is distinct from the previous methods in that they are trained exclusively from the molecular graph, an unwieldy input that can vary in the number of vertices as well as connectivity. This graph is initialized using a variety of atom features, ranging from atomic number to covalent radius. 7
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
The DeepChem Tensorflow 30 implementation of the graph convolution, graph-pooling, and graph-gather primitives was used to construct single-task networks. This implementation is unique in that it reserves a parameter matrix for each node degree, unlike other approaches. 6 For these experiments, a 3 layer network was used using ReLU activations, batch-normalization, and a static learning rate of 1e−3 with no weight decay. Once again, optimization was performed using the ADAM optimizer. A formal mathematical construction of the graph convolutional primitives is presented in the appendix.
Data Preparation Setting up valid machine learning experiments in molecular machine learning is a considerably more challenging than other domains. Datasets are autocorrelated because they are not collected by sampling from chemical space uniformly at random. Rather, datasets comprise of many chemical series of interest, with each series consisting of molecules that differ by only subtle topology changes. This underlying structure can be visualized using t-SNE, 31 a non-linear embedding algorithm that excels at accurately visualizing high-dimensional data, such as molecules. In essence, t-SNE aims to produce a 2D embedding such that points that are close together in high dimensions remain close together in the 2D embedding. Likewise, it aims to keep points that are far apart in high dimensions far apart in the 2D embedding. The resulting t-SNE scatterplot [fig-2] for the lipophilicity dataset reveals this clear clustering. It follows from this structure that randomly splitting datasets of this style results in significant redundancies between the training set and validation sets. It can be shown that benchmarks of this style significantly reward solutions that overfit, rather than solutions that can generalize to molecules that are significantly different than the training sets. 32 To control for this, we split the dataset into Murcko clusters, 33 and place the largest clusters in the training set, and the smallest ones in the validation set, targeting 80% of the data being placed in the training set, 10% in the validation set, and 10% in the test set. This method 8
ACS Paragon Plus Environment
Page 8 of 22
Page 9 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
results in the majority of the chemical diversity being held outside of the training set, not unlike the data the system will encounter when deployed. Both split and unsplit datasets have been open-sourced into a repository under the Numerate GitHub organization.
Figure 2: 2D embedding of the 4200 molecule lipophilicity dataset using t-SNE. Notice the heavy clustering that is characteristic of a drug-discovery dataset.
Capturing Uncertainty Small datasets, along with algorithms that rely on randomness during training, introduce considerable noise into the performance results. This makes it difficult to tease apart genuine advancements from luck [28]. Moreover, the performance of molecular machine learning systems are highly dependent on the choice of training set, making it difficult to assess how the system would perform on significantly novel chemical matter. Since there is no closed-form solution for uncertainty estimates for the metric that we are interested in, R2 , bootstrapping with replacement of the training set is used to capture uncertainty. Models are trained on 25 bootstrap re-samples and 25 R2 values are recorded. 9
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
The result is not a single score, but rather a distribution of scores, defined by a sample mean and sample variance. Variations in mean performance among learning algorithms can then be tested for statistical significance using the Welch t-test, an adaptation of the t-test that is more reliable for two samples that have unequal variances.
Experiments Regression models are tested against a variety of physiochemical and ADME endpoints that are of interest to the pharmaceutical industry. We restrict our choice of datasets to the ones released by AstraZeneca into ChEMBL, a publicly available database, 34 with the expectation that they were subject to their strict, internal quality control standards, contain considerable chemical diversity, and are representative of datasets held internally in industry.
Datasets pKa-A1 is the acid-base dissociation constant for the most acidic proton, which is an important factor in understanding the ionizability of a potential drug and has a strong influence over multiple different properties of interest, including permeability, partitioning, binding, etc. 35 This is this smallest dataset of the five, with only 204 examples.
Human Intrinsic Clearance is the rate at which the human body removes circulating, unbound drug from the blood. This is one of the key in-vitro parameters used to predict drug residency time in the patient. 36 In drug discovery, this property is assessed by measuring the metabolic stability of drugs in either human liver microsomes or hepatocytes. This dataset includes 1102 examples of intrinsic clearance measured in human liver microsomes (muL/min/mg protein) following incubation at 37C.
Human Plasma Protein Binding assays measure the proportion of drug that is bound reversibly to proteins such as albumin and alpha-acid glycoprotein in the plasma. Knowing 10
ACS Paragon Plus Environment
Page 10 of 22
Page 11 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
the amount that is unbound is critical because it is that amount that can diffuse into tissue or be cleared by the liver. 36 1640 compounds are measured, and regression targets are transformed using log(1 − bound), a more representative measure for scientists.
Thermodynamic Solubility measures solubility of a solid starting material in pH7.4 buffer. Solubility influences a wide range of properties for drugs, especially ones that are administered orally. This dataset contains 1763 examples.
Lipophilicity is a compounds affinity for a lipophilic solvent vs a polar solvent. More formally, we use logD (pH7.4) which is captured experimentally using the octanol / water distribution coefficient measured by the shake flask method. This is an important measure for potential drugs, as lipophilicity is a key contributor to membrane permeability. 36 Alternatively, highly lipophilic compounds are usually encumbered by low solubility, high clearance, high plasma protein binding, etc. Indeed, most drug discovery projects have a target range for lipophilicity. 36 This dataset is the largest of the three at 4200 compounds.
Results Graph Convolutional Neural Networks lead the three learning algorithms on four out of five datasets, with the exception being human plasma protein binding. All five difference between GC-DNNs and the industry-standard RFs were found to be statistically significant using the Welch t-test A / B test. Fully-Connected Neural Networks generally underperformed their counterparts, despite requiring considerably more hyperparameter tuning.
Discussion In part due to its autonomously learned features, graph convolutional neural networks outperformed methods trained on expert engineered features on four out of five datasets, with the exception being plasma-protein-binding. This is a surprising result given that GC-DNNs 11
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
(a) pKa Test Set R2
(b) Clearance Test Set R2
(c) PPB Test Set R2
(d) Thermosol Test Set R2
(e) Lipophilicity Test Set R2 Figure 3: Bootstrapped performance histograms and kernel density estimates for Random Forests, Graph Convolutional Neural Networks, and Fully Connected Neural Networks over five datasets.
12
ACS Paragon Plus Environment
Page 12 of 22
Page 13 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
Table 1: Test Set Performance R2 Dataset
Model
Mean Std
Range
pKa-A1 pKa-A1 pKa-A1 Clearance Clearance Clearance HPPB HPPB HPPB ThermoSol ThermoSol ThermoSol Lipophilicity Lipophilicity Lipophilicity
RF FC-DNN GC-DNN RF FC-DNN GC-DNN RF FC-DNN GC-DNN RF FC-DNN GC-DNN RF FC-DNN GC-DNN
.319 .191 .437 .155 .136 .217 .287 .203 .208 .187 .256 .294 .424 .345 .484
[-.260, .673] [.091, .377] [.204, .689] [.054, .253] [.088, .192] [.117, .333] [.215, .342] [.158, .265] [.126, .309] [.137, .224] [.224, .377] [.215, .377] [.371, .473] [.302, .402] [.436, .515]
.179 .072 .105 .047 .025 .048 .029 .024 .039 .021 .039 .043 .022 .025 .023
Table 2: A / B Test for Random Forests and Graph Convolutions using Welch t-test Dataset
p-value
pKa-A1 7.2e-3 Clearance 3.2e-5 HPPB 3.7e10 Thermosol 4.6e-13 Lipo 1.6e-12 are blind to the domain of drug discovery, and could trivially be re-purposed to solve orthogonal problems such as detecting fraud in banking transaction networks. Geometric deep learning approaches like this unlock the possibility of learning from non-euclidian graphs (molecules) and manifolds, providing the pharmaceutical industry with the ability to learn from and exploit knowledge from their historical successes and failures, resulting in significantly improved quality of research candidates and accelerated timelines. However, for these applications to take off in industry, there needs to be significant certainty that the system will remain performant under novel chemical matter. As part of this work, our analysis of uncertainty has revealed concerns in the methodology of learning
13
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
algorithm comparisons in this field. PKa-A1 in particular exhibits so much uncertainty that individual trials have little to no meaning. While it’s clear from the p-values that GC-DNNs do indeed outperform, the width of the uncertainty intervals indicates that it is completely unclear whether or not the resulting predictor will turn out to to be useful. Even the random forests trained on the 1102-example clearance dataset exhibits significant variability in performance, ranging from almost zero correlation to a high enough correlation to be useful, and everything in between. This is alarming considering that 1102-examples is considered a large dataset in this field, and could easily have cost in excess of a half million dollars to generate. Beyond this, there is still a significant amount of progress to be made. The publicly available approaches tested in this work still significantly lag the accuracy of the underlying assays they are trying to model. Thermodynamic solubility, in particular, has an assay limit upwards of 0.8 R2 , while all the presented models are under 0.3 R2 , a gap that more data is unlikely to cover. What’s missing? Our internal research shows that in most cases the answer is 3D representations. Medicinal molecules interact with the human body in three dimensions, while in solution. These molecular structures are not static and can take the form of a wide range of conformers. Building machine learning systems that are more aware of the true underlying physics can result in significantly more performant models, which will be the focus of our upcoming follow-up paper.
Acknowledgments Patrick thanks Bharath Ramsundar at Stanford and Jeff Blaney at Genentech for a long history of helpful discussions. As a group, we would like to thank the DeepChem open-source community for their end to end implementation of graph convolutional neural networks in TensorFlow.
14
ACS Paragon Plus Environment
Page 14 of 22
Page 15 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
Appendix Graph Convolutional Neural Network Let G(V, E) be a molecular graph for vertices V ⊆ Rd (atoms) and edges (bonds) E ⊆ {0, 1}. Before constructing the adjacency matrices, we need to define a constant a to which we pad the dimensionality of all of the adjacency matrices, and a constant m which is the maximum degree of a node that we will consider in the molecular graph. Now, we can construct m pairs of adjacency matrices Ai ∈ {0, 1}a×a and feature matrices Xi ∈ {0, 1}a×d for i ∈ {0, .., m} where each row in Xi is a vertex v ∈ V of degree i. Finally, we define a parameter matrix Wi ∈ Rd×d . We can then construct a graph convolution over nodes of degree i as: fi (Ai , Xi , Wi ) = σ(Ai Xi Wi ) The concatenation of all fi simply results in a graph convolution f over all nodes with degree ≤ m. If we let the function σ be the ReLU nonlinearity, we can then define a graph convolutional layer as Z = σ(f (A, X, W )) This embedding function can be shown to be sub-differentiable everywhere, allowing gradient signals to be passed down from the layers above using backpropagation, not unlike traditional convolutional or sequence embedding layers.
Welch t-test ¯ 1 and X ¯ 0 be the two sample means, respectively, and let s1 and s0 be the respective Let X sample variances. Let v1 and v0 be the degrees of freedom for the respective samples variances. Finally, let N1 and N0 be the respective sample sizes. We can now construct the Welch t-test statistic t and degrees of freedom v as follows:
15
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
¯1 − X ¯0 X t= q 2 s2 s1 + N00 N1
v≈
2
s21 N1
+
s20 N0
s41 N12 v1
+
s42 N02 v0
These values can then be passed through the student-t cumulative distribution function (CDF) to obtain the final p-value.
References (1) LeCun, Y.; Bengio, Y. Convolutional Networks for Images, Speech, and Time Series. Unpublished 1995, (2) Chi, L.; Y., M. Deep Steering: Learning End-to-End Driving Model from Spatial and Temporal Visual Cues. ArXiv 2017, (3) Arlk, S.; Diamos, G.; Gibiansky, A. Multi-Speaker Neural Text-to-Speech. ArXiv 2017, (4) Artetxe, M.; Labaka, G.; E., A. Unsupervised Neural Machine Translation. ArXiv 2017, (5) Brostein, M.; Bruna, J. Geometric Deep Learning: going beyond Euclidian data. ArXiv 2017, (6) Kipf, T.; Welling, M. Supervised Classification with Graph Convolutional Neural Networks. ArXiv 2016, (7) Hop, C.; Cole, M.; Duigan, D. High throughput ADME screening: practical considerations, impact on the portfolio and enabler of in silico ADME models. Current Drug Metabolism 2008, 847–853. 16
ACS Paragon Plus Environment
Page 16 of 22
Page 17 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
(8) Lipinski, C.; F., L.; Dominy, B. Experimental and computational approaches and permeability in drug discovery and development settings. Advances Drug Delivery Reviews 2001, 3–26. (9) Ortwine, D.; Aliagas, I. Physicochemical and DMPK In Silico Models: Facilitating Their Use by Medicinal Chemists. Molecular Pharmaceutics 2013, 1153–1161. (10) Aliagas, I.; Gobbi, A.; Heffron, T. A probabilistic method to report predictions from a human liver microsomes stability QSAR model: a practical tool for drug discovery. J Comput Aided Mol Des 2015, 327–338. (11) Ma, J.; Sheridan, R.; Liaw, A. Deep Neural Nets as a Method for Quantitative Structure Activity Relationships. Journal of Chemical Information and Modeling 2015, 263–274. (12) Kearnes, S.; Goldman, B.; Pande, V. Modeling Industrial ADMET Data with Multitask Networks. ArXiv 2017, (13) Sheridan, R.; Wang, W.; Liaw, A. Extreme Gradient Boosting as a Method for Quantitative Structure Activity Relationships. Journal of Chemical Information and Modeling 2016, 2353–2360. (14) Ramsundar, B.; Kearnes, S.; Riley, P. Massively Multitask Networks for Drug Discovery. ArXiv 2016, (15) Altae-Tran, H.; Ramsundar, B.; Pappu, Low Data Drug Discovery with One-shot Learning. ArXiv 2016, (16) Wallach, I.; Dzamba, M.; Heifets, A. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery. ArXiv 2015, (17) Aliper, A.; Plis, S. Deep Learning Applications for Predicting Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data. Molecular Pharmaceutics 2016, 17
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
(18) Kadurin, A.; Nikolenko, S.; K., K.; A., Z. druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico. Molecular Pharmaceutics 2017, (19) Gomez-Bombarelli, R.; Wei, J.; D., D. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science 2018, (20) Segler, M.; Kogej, T.; C., T. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. ACS Central Science 2017, (21) Kuzminykh, D.; Polykovskiy, D.; A., K.; A., Z. 3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks. Molecular Pharmaceutics 2018, (22) Lombardo, F.; Desai, P.; Arimoto, R. In Silico Absorption, Distribution, Metabolism, Excretion, and Pharmacokinetics (ADME-PK): Utility and Best Practices. An Industry Perspective from the International Consortium for Innovation through Quality in Pharmaceutical Development. Journal of Medicinal Chemistry 2017, 9097–9113. (23) Waring, M.; Arrowsmith, J.; Leach, A. . An analysis of the attrition of drug candidates from four major pharmaceutical companies. Nature Reviews Drug Discovery 2015, 475–496. (24) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling 2010, 742–754. (25) Duvenaud, D.; Maclaurin, D.; Iparraguirre, J. Convolutional Networks on Graphs for Learning Molecular Fingerprints. ArXiv 2015, (26) Kearnes, S.; McCloskey, K.; Berndl, M. Molecular Graph Convolutions: Moving Beyond Fingerprints. ArXiv 2016,
18
ACS Paragon Plus Environment
Page 18 of 22
Page 19 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
(27) Gilmer, J.; Schoenholz, S.; Riley, P. Neural Message Passing for Quantum Chemistry. ArXiv 2017, (28) Paszke, A.; Gross, S.; S., C. Automatic differentiation in PyTorch. NIPS 2017, (29) Kingma, D.; Lei Ba, J. Adam: A Method For Stochastic Optimization. ICLR 2015, (30) Abadi, M.; Agarwal, A.; Barham, P. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. ArXiv 2015, (31) Van-Der-Maaten, L. Barnes-Hut-SNE. ArXiv 2013, (32) Wallach, I.; A., H. Most Ligand-Based Benchmarks Measure Overfitting Rather than Accuracy. ArXiv 2017, (33) Rogers, D.; Hahn, M. The Properties of Known Drugs. 1. Molecular Frameworks. Journal of Medicinal Chemistry 1996, 0–0. (34) Bento, A.; Gaulton, A.; Overington, H. The ChEMBL bioactivity database: an update. Nucleic Acids Res 2014, 1083–1090. (35) Manallack, D. The pKa Distribution of Drugs: Application to Drug Discovery. (36) Khojasteh, K.; Wong, H.; Hop, C. Drug Metabolism and Pharmacokinetics Quick Guide. 2011,
19
ACS Paragon Plus Environment
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Graphical TOC Entry Some journals require a graphical entry for the Table of Contents. This should be laid out “print ready” so that the sizing of the text is correct. Inside the tocentry environment, the font used is Helvetica 8 pt, as required by Journal of the American Chemical Society. The surrounding frame is 9 cm by 3.5 cm, which is the maximum permitted for Journal of the American Chemical Society graphical table of content entries. The box will not resize if the content is too big: instead it will overflow the edge of the box. This box and the associated title will always be printed on a separate page at the end of the document.
20
ACS Paragon Plus Environment
Page 20 of 22
Page 21 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Molecular Pharmaceutics
Table 3: Expert Engineered Features Feature MaxAbsPartialCharge MinPartialCharge MinAbsPartialCharge HeavyAtomMolWt MaxAbsEStateIndex NumRadicalElectrons NumValenceElectrons MinAbsEStateIndex MaxEStateIndex MaxPartialCharge MinEStateIndex ExactMolWt BalabanJ BertzCT Chi0 Chi0n Chi0v Chi1 Chi1n Chi1v Chi2n Chi2v Chi3n Chi3v Chi4n Chi4v HallKierAlpha Ipc Kappa1 Kappa2 Kappa3 LabuteASA PEOE-VSA1 VSA-EState1 VSA-EState10 VSA-EState2 VSA-EState3 VSA-EState4 VSA-EState5 VSA-EState6 VSA-EState7 VSA-EState8 VSA-EState9 FractionCSP3 HeavyAtomCount NOCount NumHAcceptors
Mean Variance Feature 0.43 0.07 PEOE-VSA10 -0.42 0.07 PEOE-VSA11 0.0 0.0 PEOE-VSA12 0.26 0.08 PEOE-VSA13 0.16 0.19 PEOE-VSA14 0.0 0.0 PEOE-VSA2 141.25 40.29 PEOE-VSA3 0.16 0.19 PEOE-VSA4 11.65 2.53 PEOE-VSA5 0.27 0.08 PEOE-VSA6 -1.11 1.59 PEOE-VSA7 382.69 106.85 PEOE-VSA8 1.80 0.43 PEOE-VSA9 944.81 330.45 SMR-VSA1 19.20 5.32 SMR-VSA10 15.16 4.36 SMR-VSA2 15.47 4.46 SMR-VSA3 13.01 3.58 SMR-VSA4 8.87 2.66 SMR-VSA5 9.47 2.85 SMR-VSA6 6.69 2.18 SMR-VSA7 7.38 2.45 SMR-VSA8 4.71 1.68 SMR-VSA9 5.26 1.88 SlogP-VSA1 3.27 1.30 SlogP-VSA10 3.71 1.44 SlogP-VSA11 -2.67 0.89 SlogP-VSA12 0.0 0.0 SlogP-VSA2 18.54 5.75 SlogP-VSA3 7.84 2.77 SlogP-VSA4 4.13 1.79 SlogP-VSA5 159.92 43.61 SlogP-VSA6 13.85 7.86 SlogP-VSA7 0.0 0.0 SlogP-VSA8 1.55 3.95 SlogP-VSA9 0.0 0.0 TPSA 0.0 0.0 EState-VSA1 0.0 0.0 EState-VSA10 0.0 0.0 EState-VSA11 0.0 0.0 EState-VSA2 0.0 0.0 EState-VSA3 10.71 16.66 EState-VSA4 52.19 17.21 EState-VSA5 0.30 0.18 EState-VSA6 21 27.04 7.46 Plus Environment EState-VSA7 ACS Paragon 6.03 2.37 EState-VSA8 5.14 2.16 EState-VSA9
Mean 9.64 4.26 3.61 3.19 2.38 7.14 6.97 2.89 1.75 24.64 40.05 24.58 14.78 13.01 23.76 0.45 12.30 2.74 24.53 19.45 55.98 0.0 7.67 9.88 7.400 3.62 6.36 37.98 9.19 5.73 24.76 44.86 1.52 8.62 0.0 78.84 8.73 10.28 0.02 13.47 20.41 25.33 12.42 16.85 23.08 19.82 9.51
Variance 7.99 6.43 5.10 4.38 4.05 6.03 6.48 5.13 4.23 18.86 19.59 15.41 10.27 8.90 12.57 1.51 7.77 5.10 19.55 16.61 21.91 0.0 7.71 6.52 7.78 5.400 8.89 20.79 8.33 7.08 17.81 18.86 3.03 9.03 0.0 32.07 12.23 8.04 0.35 10.48 14.42 18.15 12.41 14.04 18.96 15.80 8.59
Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
ACS Paragon Plus Environment
Page 22 of 22