Geometric Deep Learning Autonomously Learns ... - ACS Publications

34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54 ..... from and exploit knowledge from their historical successes an...
0 downloads 0 Views 1MB Size
This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes.

Article Cite This: Mol. Pharmaceutics XXXX, XXX, XXX−XXX

Geometric Deep Learning Autonomously Learns Chemical Features That Outperform Those Engineered by Domain Experts Patrick Hop,* Brandon Allgood,* and Jessen Yu*

Downloaded via 79.110.18.36 on July 12, 2018 at 17:21:58 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

Numerate Inc., San Francisco, California 94107, United States ABSTRACT: Artificial Intelligence has advanced at an unprecedented pace, backing recent breakthroughs in natural language processing, speech recognition, and computer vision: domains where the data is euclidean in nature. More recently, considerable progress has been made in engineering deeplearning architectures that can accept non-Euclidean data such as graphs and manifolds: geometric deep learning. This progress is of considerable interest to the drug discovery community, as molecules can naturally be represented as graphs, where atoms are nodes and bonds are edges. In this work, we explore the performance of geometric deep-learning methods in the context of drug discovery, comparing machine learned features against the domain expert engineered features that are mainstream in the pharmaceutical industry. KEYWORDS: artifical intelligence, geometric deep learning, drug discovery, pharmaceutics



INTRODUCTION Deep Learning. Deep neural networks (DNNs) are not an entirely new concept as they have existed for ∼20 years,1 only recently entering the spotlight due to an abundance of storage and compute as well as optimization advances. Today, deeplearning backs the core technology in many applications, such as self-driving cars,2 speech synthesis,3 and machine translation.4 Perhaps the most important property of DNNs is their ability to automatically learn embeddings (features) tabula rasa from the underlying data, aided by vast amounts of compute and more data than any one human domain expert can understand. Naturally, there is interest in expanding the domain of applicability of these methods to non-euclidian data such as graphs or manifolds,5 which arise in domains such as 3D models in computer graphics, represented as riemannian manifolds, or graphs in molecular machine learning. Understanding data of this structure has been elusive for classical architectures because of a lack of a well-defined coordinate system and vector space structure in non-euclidian domains. Even operations as simple as addition often cannot find natural constructions, for example, the sum of two atoms or two molecules has no meaning. Geometric Deep Learning aims to solve this by defining primitives that can operate on these unwieldy data structures, primarily by constructing spatial and spectral interpretations of existing architectures6 such as convolutional neural networks (CNNs). Recasting CNNs into this domain is of particular interest in drug discovery, as like nearby pixels, nearby atoms are highly related and interact with each other whereas distant atoms usually do not. Drug Discovery. Development of novel therapeutics for a human disease is a process that can easily consume a decade of research and development, as well as billions of dollars in capital.7 Long before anything reaches the clinic for validation, a © XXXX American Chemical Society

potential disease-modulating biological target is discovered and characterized. Then, the search process for the right therapeutic compound is kicked off, a process akin to finding the perfect chemical key for a tough to crack biological lock, which is conducted through a vast chemical space containing more molecules than atoms in the universe. Even restricting the search to molecules with a molecular weight of ≤500 Da yields a search space of at least 1050 molecules, virtually all of which have never been synthesized before. To make it to the clinic, drug discovery practitioners need to optimize for a wide range of molecular properties, ranging from physical properties, such as aqueous solubility to complex biochemical properties, such as blood-brain barrier penetration. This long, laborious search has historically been guided by the intuition of skilled medicinal chemists and biologists, but over the past few decades, heuristics and machine learning have played an increasingly important role in guiding the process. The first widespread use of a heuristic is Lipinski’s rule of five (RO5), invented at Pfizer in 1997.8 RO5 places limits on the number of hydrogen bond donors, acceptors, molecular weight, and lipophilicity measures and has been shown to filter out compounds that are likely to exhibit poor ADME properties. In practice, even today RO5 is often still used to evaluate emerging preclinical molecules. Over the past two decades, machine learning models have begun to emerge in the industry as a more advanced filter or Special Issue: Deep Learning for Drug Discovery and Biomarker Development Received: Revised: Accepted: Published: A

December 18, 2017 May 31, 2018 June 4, 2018 June 4, 2018 DOI: 10.1021/acs.molpharmaceut.7b01144 Mol. Pharmaceutics XXXX, XXX, XXX−XXX

Article

Molecular Pharmaceutics virtual screen. Researchers in industry have shown that expertengineered features and support vector machines can be used to predict stability in human liver microsomes9,10 effectively, among other end points. Multitask, fully connected neural networks on these same inputs, has been shown to on average outperform more traditional models,11,12 including XGBoost,13 with performance scaling monotonically with the number of tasks into the thousands.14 Progress in learning from small amounts of data has been achieved using variants of matching networks.15 More recently, use of 3D convolutional neural networks has shown considerable promise in predicting protein−ligand binding energy16 (drug potency), and ranking models have made considerable progress is drug repurposing.17 More recently, progress has been made in generating novel molecules in silico, unlocking the possibility of screening molecules that have been designed by machines instead of humans. This may allow exploration of far-reaching regions of chemical space that are beyond those covered by existing humanengineered screens in industry. Success with these approaches was first demonstrated using Adversarial Autoencoders, which were shown to be able to hallucinate (generate) chemical fingerprints that matched with a variety of patented anticancer assets.18 Variational Autoencoders have also been used in this area19 and have been shown to be able to hallucinate molecules that have exceptional solubility and low similarity to the training set. Segler20 et al. relied on LSTMs trained on chemical language representations to achieve a similar result for potency end points. Recently, more progress has been made using Generative Adversarial Neural Network models, first on 2D representations18 and later on 3D representations.21 As these prediction systems improve, the average quality of molecules selected for synthesis in drugs programs improves significantly,22 resulting in programs that get to the clinic faster and with lower capital requirements, which is significant in light of pipeline attrition rates. For drugs in phase I, excluding portfolio rebalancing, ∼40% fail due to toxicity and ∼15% fail due to poor pharmacokinetics, both of which have the potential to be captured by these prediction systems long before the clinic.23 In this work, the state of the art of drug discovery feature engineering is compared against the state of the art of geometric deep learning in a rigorous manner. We will show that geometric deep learning can autonomously learn representations that outperform those designed by domain experts on four out of five of the data sets tested.

Figure 1. Bag of fragments (left); bag of words (right).

fixed length bit vector can later be used to train a classifier for any natural language processing task. In molecular machine learning, engineering good embedding/ features is a considerable challenge because molecules are unwieldy, undirected multigraphs with atoms being nodes and bonds being edges. A good chemical embedding would be able to model graphs with a differing number of nodes and edge configurations while preserving locality because it is understood that atoms that are close to each other generally exhibit more pronounced interactions than atoms that are distant. More formally, for a molecule represented with an adjacency matrix A ∈ {0, 1}n×n and atom-feature matrix X ∈ n × d , we want to construct some function f with (optionally) learnable parameters θ ∈ w s.t f : (A , X ; θ ) → x d

where x is a fixed-shape representation that captures the essence of the underlying graph. This vector is then passed to a learning algorithm of a scientist’s choice, such as random forests or a fully connected neural network. Naive Embeddings. A standard chemistry embedding solution is the extended-connectivity fingerprints (ECFP4).24 These fingerprints generate features for every radius r ≤ 4, which is the maximum distance explored on the graph from the starting vertex. For a specific r and specific vertex in the graph, ECFP4 takes the neighborhood features from the previous radius, concatenates them, and applies a hash function, the range of which corresponds to indices on a hash table. After iterating over all vertices and radius values, this bag of fragments approach to graph embedding results in task agnostic representation that can be easily passed onto a learning algorithm. Expert Embeddings. Cheminformatics experts in drug discovery have, over decades, engineered numerous domainspecific, physiologically relevant features, also known as descriptors. For example, polar surface area (PSA) is a feature that is calculated as the sum of the surface area contributions of the polar atoms in a molecule, a feature well-known in industry to negatively correlate with membrane permeability. There are 101 of these publicly available, expert-engineered features [Table 3] that are easily available in the open-source RDKit package. Learnable Embeddings. One criticism of the naive embeddings is that they are not optimized for the task at hand. The ideal features to predict drug solubility are likely to be considerably different than the features used to predict photovoltaic efficiency. The solution is to allow the model to engineer its own problem specific, optimized embedding for the problem at hand, in essence by combining the learner with the embedding. This is achieved by allowing gradients to flow back from the learner into the embedding function, allowing the embedding to be optimized in tandem with the learner.



CHEMICAL EMBEDDINGS The first challenge in machine learning is selecting a numerical representation that correctly captures the underlying dynamics of the training data, also known as features or an embedding, terms we will use interchangeably in this work. A fixed-shape representation is typically required simply because the mathematics of learning algorithms require that their inputs be the same shape. Selecting an embedding that respects the underlying structure of the data cannot be overlooked because certain mathematical assumptions that apply to some data sets need not apply to others, reversing an English sentence destroys its meaning, whereas reversing an image generally would not. In natural language processing, a pernicious problem is that sentences need not be the same length and that locality must be respected because words are highly related to their neighbors. Bag of words embeddings resolves this by mapping sentences into bit vectors that indicate the presence or absence of adjacent words in the document [Figure 1]. This convenient, B

DOI: 10.1021/acs.molpharmaceut.7b01144 Mol. Pharmaceutics XXXX, XXX, XXX−XXX

Article

Molecular Pharmaceutics Neural Fingerprints25 demonstrated that ECFP could be considerably improved in this manner by introducing learnable weights, and Weaves26 demonstrated further improvements by mixing bond and atom features. Later, it was shown both of these graph embedding methods are special cases of message passing algorithms.27

The resulting t-SNE scatterplot [Figure 2] for the lipophilicity data set reveals this clear clustering.



METHODS Learning Algorithms. Random Forests. Random Forests are a common ensemble learning algorithm used in industry due to their training speed, high-performance, and ease of use. In this work, random forest models (sklearn’s RandomForestRegressor) are trained on the concatenation of 101 RDKit descriptors and the 384 bit wide ECFP4 fingerprints using 30 trees and a maximum tree depth of 4. This particular hyperparameter configuration is the result of tuning on the validation set by hand with the aim of maximizing absolute performance while minimizing the spread of performance. A maximum tree depth of 10 was used on the lipophilicity dataset due to its size. FC-DNN. Fully-Connected Neural Networks operate on a fixed shape input by passing information through multiple nonlinear transformations, i.e., layers. FC-DNN models were implemented in PyTorch28 and are trained on the same inputs as the random forest models with an added normalization preprocessing stage. After extensive hyperparameter tuning on the validation set, a neural network with two hidden layers of size 48 and 32 was found to perform well. ReLU activations and batch normalization were used on both hidden layers. Optimization was performed using the ADAM optimizer.29 For hyperparameter, a static learning rate of 5e−4 and l2 weight decay of 8e−3 was used. All FC-DNN models were trained through training epoch 11, after which the models would begin overfitting. GC-DNN. Graph Convolutional networks are a geometric deep-learning method that is distinct from the previous methods in that they are trained exclusively from the molecular graph, an unwieldy input that can vary in the number of vertices as well as connectivity. This graph is initialized using a variety of atom features ranging from atomic number to covalent radius. The DeepChem Tensorflow30 implementation of the graph convolution, graph-pooling, and graph-gather primitives was used to construct single-task networks. This implementation is unique in that it reserves a parameter matrix for each node degree, unlike other approaches.6 For these experiments, a 3-layer network was used using ReLU activations, batch normalization, and a static learning rate of 1e−3 with no weight decay. Once again, optimization was performed using the ADAM optimizer. A formal mathematical construction of the graph convolutional primitives is presented in the Appendix. Data Preparation. Setting up valid machine learning experiments in molecular machine learning is considerably more challenging than other domains. Dataset are autocorrelated because they are not collected by sampling from chemical space uniformly at random. Rather, dataset are comprised of many chemical series of interest with each series consisting of molecules that differ by only subtle topology changes. This underlying structure can be visualized using t-SNE,31 a nonlinear embedding algorithm that excels at accurately visualizing high-dimensional data, such as molecules. In essence, t-SNE aims to produce a 2D embedding such that points that are close together in high dimensions remain close together in the 2D embedding. Likewise, it aims to keep points that are far apart in high dimensions far apart in the 2D embedding.

Figure 2. 2D embedding of the 4200 molecule lipophilicity dataset using t-SNE. Notice the heavy clustering that is characteristic of a drug-discovery dataset.

It follows from this structure that randomly splitting data sets of this style results in significant redundancies between the training set and validation sets. It can be shown that benchmarks of this style significantly reward solutions that overfit rather than solutions that can generalize to molecules that are significantly different than the training sets.32 To control for this, we split the dataset into Murcko clusters33 and place the largest clusters in the training set and the smallest ones in the validation set, targeting 80% of the data being placed in the training set, 10% in the validation set, and 10% in the test set. This method results in the majority of the chemical diversity being held outside of the training set, not unlike the data the system will encounter when deployed. Both split and unsplit dataset have been open sourced into a repository under the Numerate GitHub organization. Capturing Uncertainty. Small dataset, along with algorithms that rely on randomness during training, introduce considerable noise into the performance results. This makes it difficult to tease apart genuine advancements from luck [28]. Moreover, the performance of molecular machine learning systems is highly dependent on the choice of training set, making it difficult to assess how the system would perform on significantly novel chemical matter. Because there is no closed-form solution for uncertainty estimates for the metric that we are interested in, R2, bootstrapping with replacement of the training set is used to capture uncertainty [Figure 3]. Models are trained on 25 bootstrap resamples, and 25 R2 values are recorded [Table 1]. The result is not a single score but rather a distribution of scores defined by a sample mean and sample variance. Variations in mean performance among learning algorithms can then be tested for statistical significance using the Welch t test, an adaptation of the t test that is more reliable for two samples that have unequal variances [Table 2]. C

DOI: 10.1021/acs.molpharmaceut.7b01144 Mol. Pharmaceutics XXXX, XXX, XXX−XXX

Article

Molecular Pharmaceutics

Figure 3. Bootstrapped performance histograms and kernel density estimates for Random Forests, Graph Convolutional Neural Networks, and Fully Connected Neural Networks over five data sets.

Table 1. Test Set Performance R2 data set

model

mean

std

pKa-A1 pKa-A1 pKa-A1 Clearance Clearance Clearance HPPB HPPB HPPB ThermoSol ThermoSol ThermoSol Lipophilicity Lipophilicity Lipophilicity

RF FC-DNN GC-DNN RF FC-DNN GC-DNN RF FC-DNN GC-DNN RF FC-DNN GC-DNN RF FC-DNN GC-DNN

0.319 0.191 0.437 0.155 0.136 0.217 0.287 0.203 0.208 0.187 0.256 0.294 0.424 0.345 0.484

0.179 0.072 0.105 0.047 0.025 0.048 0.029 0.024 0.039 0.021 0.039 0.043 0.022 0.025 0.023

Table 2. A/B Test for Random Forests and Graph Convolutions Using Welch t-test

range [−0.260, [0.091, [0.204, [0.054, [0.088, [0.117, [0.215, [0.158, [0.126, [0.137, [0.224, [0.215, [0.371, [0.302, [0.436,

0.673] 0.377] 0.689] 0.253] 0.192] 0.333] 0.342] 0.265] 0.309] 0.224] 0.377] 0.377] 0.473] 0.402] 0.515]

data set

p-value

pKa-A1 Clearance HPPB Thermosol Lipo

7.2e−3 3.2e−5 3.7e10 4.6e−13 1.6e−12

released by AstraZeneca into ChEMBL, a publicly available database,34 with the expectation that they were subject to their strict, internal quality control standards, contain considerable chemical diversity, and are representative of data sets held internally in industry. Data Sets. pKa-A1. This is the acid−base dissociation constant for the most acidic proton, which is an important factor in understanding the ionizability of a potential drug and has a strong influence over multiple different properties of interest, including permeability, partitioning, binding, and so forth.35 This is this smallest data set of the five with only 204 examples. Human Intrinsic Clearance. This is the rate at which the human body removes circulating, unbound drug from the blood. This is one of the key in vitro parameters used to



EXPERIMENTS Regression models are tested against a variety of physiochemical and ADME end points that are of interest to the pharmaceutical industry. We restrict our choice of data sets to the ones D

DOI: 10.1021/acs.molpharmaceut.7b01144 Mol. Pharmaceutics XXXX, XXX, XXX−XXX

Article

Molecular Pharmaceutics Table 3. Expert Engineered Features feature

mean

variance

feature

mean

variance

MaxAbsPartialCharge MinPartialCharge MinAbsPartialCharge HeavyAtomMolWt MaxAbsEStateIndex NumRadicalElectrons NumValenceElectrons MinAbsEStateIndex MaxEStateIndex MaxPartialCharge MinEStateIndex ExactMolWt BalabanJ BertzCT Chi0 Chi0n Chi0v Chi1 Chi1n Chi1v Chi2n Chi2v Chi3n Chi3v Chi4n Chi4v HallKierAlpha Ipc Kappa1 Kappa2 Kappa3 LabuteASA PEOE-VSA1 VSA-EState1 VSA-EState10 VSA-EState2 VSA-EState3 VSA-EState4 VSA-EState5 VSA-EState6 VSA-EState7 VSA-EState8 VSA-EState9 FractionCSP3 HeavyAtomCount NOCount NumHAcceptors NumHeteroatoms NumRotatableBonds NHOHCount

0.43 −0.42 0.0 0.26 0.16 0.0 141.25 0.16 11.65 0.27 −1.11 382.69 1.80 944.81 19.20 15.16 15.47 13.01 8.87 9.47 6.69 7.38 4.71 5.26 3.27 3.71 −2.67 0.0 18.54 7.84 4.13 159.92 13.85 0.0 1.55 0.0 0.0 0.0 0.0 0.0 0.0 10.71 52.19 0.30 27.04 6.03 5.14 7.17 5.22 1.84

0.07 0.07 0.0 0.08 0.19 0.0 40.29 0.19 2.53 0.08 1.59 106.85 0.43 330.45 5.32 4.36 4.46 3.58 2.66 2.85 2.18 2.45 1.68 1.88 1.30 1.44 0.89 0.0 5.75 2.77 1.79 43.61 7.86 0.0 3.95 0.0 0.0 0.0 0.0 0.0 0.0 16.66 17.21 0.18 7.46 2.37 2.16 2.79 2.91 1.37

PEOE-VSA10 PEOE-VSA11 PEOE-VSA12 PEOE-VSA13 PEOE-VSA14 PEOE-VSA2 PEOE-VSA3 PEOE-VSA4 PEOE-VSA5 PEOE-VSA6 PEOE-VSA7 PEOE-VSA8 PEOE-VSA9 SMR-VSA1 SMR-VSA10 SMR-VSA2 SMR-VSA3 SMR-VSA4 SMR-VSA5 SMR-VSA6 SMR-VSA7 SMR-VSA8 SMR-VSA9 SlogP-VSA1 SlogP-VSA10 SlogP-VSA11 SlogP-VSA12 SlogP-VSA2 SlogP-VSA3 SlogP-VSA4 SlogP-VSA5 SlogP-VSA6 SlogP-VSA7 SlogP-VSA8 SlogP-VSA9 TPSA EState-VSA1 EState-VSA10 EState-VSA11 EState-VSA2 EState-VSA3 EState-VSA4 EState-VSA5 EState-VSA6 EState-VSA7 EState-VSA8 EState-VSA9 MolLogP MolMR -

9.64 4.26 3.61 3.19 2.38 7.14 6.97 2.89 1.75 24.64 40.05 24.58 14.78 13.01 23.76 0.45 12.30 2.74 24.53 19.45 55.98 0.0 7.67 9.88 7.400 3.62 6.36 37.98 9.19 5.73 24.76 44.86 1.52 8.62 0.0 78.84 8.73 10.28 0.02 13.47 20.41 25.33 12.42 16.85 23.08 19.82 9.51 3.28 104.00 -

7.99 6.43 5.10 4.38 4.05 6.03 6.48 5.13 4.23 18.86 19.59 15.41 10.27 8.90 12.57 1.51 7.77 5.10 19.55 16.61 21.91 0.0 7.71 6.52 7.78 5.400 8.89 20.79 8.33 7.08 17.81 18.86 3.03 9.03 0.0 32.07 12.23 8.04 0.35 10.48 14.42 18.15 12.41 14.04 18.96 15.80 8.59 1.32 28.25 -

predict drug residency time in the patient.36 In drug discovery, this property is assessed by measuring the metabolic stability of drugs in either human liver microsomes or hepatocytes. This data set includes 1102 examples of intrinsic clearance measured in human liver microsomes (muL min−1 mg−1 protein) following incubation at 37 °C. Human Plasma Protein Binding. This assay measures the proportion of drug that is bound reversibly to proteins such as albumin and α-acid glycoprotein in the plasma. Knowing the amount that is unbound is critical because it is that amount that can diffuse into tissue or be cleared by the liver;36 1640

compounds are measured, and regression targets are transformed using log(1 − bound), a more representative measure for scientists. Thermodynamic Solubility. This measures the solubility of a solid starting material in pH 7.4 buffer. Solubility influences a wide range of properties for drugs, especially ones that are administered orally. This data set contains 1763 examples. Lipophilicity. This is a compound’s affinity for a lipophilic solvent vs a polar solvent. More formally, we use logD (pH 7.4), which is captured experimentally using the octanol/buffer distribution coefficient measured by the shake flask method. E

DOI: 10.1021/acs.molpharmaceut.7b01144 Mol. Pharmaceutics XXXX, XXX, XXX−XXX

Molecular Pharmaceutics



This is an important measure for potential drugs, as lipophilicity is a key contributor to membrane permeability.36 Alternatively, highly lipophilic compounds are usually encumbered by low solubility, high clearance, high plasma protein binding, and so forth. Indeed, most drug discovery projects have a target range for lipophilicity.36 This data set is the largest of the three at 4200 compounds. Results. Graph Convolutional Neural Networks lead the three learning algorithms on four out of five data sets with the exception being human plasma protein binding. All five differences between GC-DNNs and the industry-standard RFs were found to be statistically significant using the Welch t-test A/B test. Fully Connected Neural Networks generally underperformed their counterparts despite requiring considerably more hyperparameter tuning.

Article

APPENDIX

Graph Convolutional Neural Network

Let G(V, E) be a molecular graph for vertices V ⊆ d (atoms) and edges (bonds) E ⊆ {0, 1}. Before constructing the adjacency matrices, we need to define a constant a to which we pad the dimensionality of all of the adjacency matrices and a constant m, which is the maximum degree of a node that we will consider in the molecular graph. Now, we can construct m pairs of adjacency matrices Ai ∈ {0, 1}a×a and feature matrices Xi ∈ {0, 1}a×d for i ∈ {0,.., m}, where each row in Xi is a vertex v ∈ V of degree i. Finally, we define a parameter matrix Wi ∈ d × d . We can then construct a graph convolution over nodes of degree i as



fi (Ai , Xi , Wi ) = σ(Ai XiWi )

DISCUSSION In part due to its autonomously learned features, graph convolutional neural networks outperformed methods trained on expert engineered features on four out of five data sets with the exception being plasma-protein binding [Table 3]. This is a surprising result given that GC-DNNs are blind to the domain of drug discovery and could trivially be repurposed to solve orthogonal problems such as detecting fraud in banking transaction networks. Geometric deep learning approaches like this unlock the possibility of learning from non-euclidian graphs (molecules) and manifolds, providing the pharmaceutical industry with the ability to learn from and exploit knowledge from their historical successes and failures, resulting in significantly improved quality of research candidates and accelerated timelines. However, for these applications to take off in industry, there needs to be significant certainty that the system will remain performant under novel chemical matter. As part of this work, our analysis of uncertainty has revealed concerns in the methodology of learning algorithm comparisons in this field. PKa-A1 in particular exhibits so much uncertainty that individual trials have little to no meaning. Although it is clear from the p-values that GC-DNNs do indeed outperform, the width of the uncertainty intervals indicates that it is completely unclear whether or not the resulting predictor will turn out to to be useful. Even the random forests trained on the 1102 example clearance dataset exhibits significant variability in performance, ranging from almost zero correlation to a high enough correlation to be useful and everything in between. This is alarming considering that 1102 examples is considered a large dataset in this field and could easily have cost in excess of a half million dollars to generate. Beyond this, there is still a significant amount of progress to be made. The publicly available approaches tested in this work still significantly lag the accuracy of the underlying assays they are trying to model. Thermodynamic solubility, in particular, has an assay limit upward of 0.8 R2, whereas all of the presented models are under 0.3 R2, a gap that more data is unlikely to cover. What’s missing? Our internal research shows that in most cases the answer is 3D representations. Medicinal molecules interact with the human body in three dimensions while in solution. These molecular structures are not static and can take the form of a wide range of conformers. Building machine learning systems that are more aware of the true underlying physics can result in significantly more performant models, which will be the focus of our upcoming follow-up paper.

The concatenation of all f i simply results in a graph convolution f over all nodes with degree ≤ m. If we let the function σ be the ReLU nonlinearity, we can then define a graph convolutional layer as Z = σ(f (A , X , W ))

This embedding function can be shown to be subdifferentiable everywhere, allowing gradient signals to be passed down from the layers above using backpropagation, not unlike traditional convolutional or sequence embedding layers. Welch t-Test

Let X̅ 1 and X̅ 0 be the two sample means, respectively, and let s1 and s0 be the respective sample variances. Let v1 and v0 be the degrees of freedom for the respective sample variances. Finally, let N1 and N0 be the respective sample sizes. We can now construct the Welch t-test statistic t and degrees of freedom v as X1̅ − X̅ 0

t=

s12 N1

( v≈

s12 N1 s14

N12v1

+

+ +

s02 N0 s02 N0

2

)

s24 N02v0

These values can then be passed through the student-t cumulative distribution function (CDF) to obtain the final p-value.



AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. *E-mail: [email protected]. *E-mail: [email protected]. ORCID

Patrick Hop: 0000-0002-6641-4228 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS Patrick thanks Bharath Ramsundar at Stanford and Jeff Blaney at Genentech for a long history of helpful discussions. As a group, we thank the DeepChem open-source community for F

DOI: 10.1021/acs.molpharmaceut.7b01144 Mol. Pharmaceutics XXXX, XXX, XXX−XXX

Article

Molecular Pharmaceutics

Perspective from the International Consortium for Innovation through Quality in Pharmaceutical Development. J. Med. Chem. 2017, 60, 9097−9113. (23) Waring, M.; Arrowsmith, J.; Leach, A. An analysis of the attrition of drug candidates from four major pharmaceutical companies. Nat. Rev. Drug Discovery 2015, 475−496. (24) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (25) Duvenaud, D.; Maclaurin, D.; Iparraguirre, J. Convolutional Networks on Graphs for Learning Molecular Fingerprints. ArXiv, 2015. (26) Kearnes, S.; McCloskey, K.; Berndl, M. Molecular Graph Convolutions: Moving Beyond Fingerprints. ArXiv, 2016. (27) Gilmer, J.; Schoenholz, S.; Riley, P. Neural Message Passing for Quantum Chemistry. ArXiv, 2017. (28) Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in PyTorch. NIPS, 2017. (29) Kingma, D.; Lei Ba, J. Adam: A Method For Stochastic Optimization. ICLR, 2015. (30) Abadi, M.; Agarwal, A.; Barham, P. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. ArXiv, 2015. (31) Van-Der-Maaten, L. Barnes-Hut-SNE. ArXiv, 2013. (32) Wallach, I.; A, H. Most Ligand-Based Benchmarks Measure Overfitting Rather than Accuracy. ArXiv, 2017. (33) Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887. (34) Bento, A.; Gaulton, A.; Overington, H. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 2014, 1083−1090. (35) Manallack, D. The pKa Distribution of Drugs: Application to Drug Discovery. Perspect Medicin Chem. 2007, 25−38. (36) Khojasteh, K.; Wong, H.; Hop, C. Drug Metabolism and Pharmacokinetics Quick Guide 2011, 1 DOI: 10.1007/978-1-44195629-3.

their end-to-end implementation of graph convolutional neural networks in TensorFlow.



REFERENCES

(1) LeCun, Y.; Bengio, Y. Convolutional Networks for Images, Speech, and Time Series. Unpublished, 1995. (2) Chi, L.; Y, M. Deep Steering: Learning End-to-End Driving Model from Spatial and Temporal Visual Cues. ArXiv, 2017. (3) Arlk, S.; Diamos, G.; Gibiansky, A. Multi-Speaker Neural Textto-Speech. ArXiv, 2017. (4) Artetxe, M.; Labaka, G.; E, A. Unsupervised Neural Machine Translation. ArXiv, 2017. (5) Brostein, M.; Bruna, J. Geometric Deep Learning: going beyond Euclidian data. ArXiv, 2017. (6) Kipf, T.; Welling, M. Supervised Classification with Graph Convolutional Neural Networks. ArXiv, 2016. (7) Hop, C.; Cole, M.; Duigan, D. High throughput ADME screening: practical considerations, impact on the portfolio and enabler of in silico ADME models. Current Drug Metabolism 2008, 847−853. (8) Lipinski, C.; Lombardo, F.; Dominy, B.; Feeney, P. J. Experimental and computational approaches and permeability in drug discovery and development settings. Adv. Drug Delivery Rev. 2001, 46, 3−26. (9) Ortwine, D.; Aliagas, I. Physicochemical and DMPK In Silico Models: Facilitating Their Use by Medicinal Chemists. Mol. Pharmaceutics 2013, 10, 1153−1161. (10) Aliagas, I.; Gobbi, A.; Heffron, T.; Lee, M.-L.; Ortwine, D. F.; Zak, M.; Khojasteh, S. C. A probabilistic method to report predictions from a human liver microsomes stability QSAR model: a practical tool for drug discovery. J. Comput.-Aided Mol. Des. 2015, 29, 327−338. (11) Ma, J.; Sheridan, R.; Liaw, A.; Dahl, G. E.; Svetnik, V. Deep Neural Nets as a Method for Quantitative Structure Activity Relationships. J. Chem. Inf. Model. 2015, 55, 263−274. (12) Kearnes, S.; Goldman, B.; Pande, V. Modeling Industrial ADMET Data with Multitask Networks. ArXiv, 2017. (13) Sheridan, R.; Wang, W.; Liaw, A.; Ma, J.; Gifford, E. M. Extreme Gradient Boosting as a Method for Quantitative Structure Activity Relationships. J. Chem. Inf. Model. 2016, 56, 2353−2360. (14) Ramsundar, B.; Kearnes, S.; Riley, P. Massively Multitask Networks for Drug Discovery. ArXiv, 2016. (15) Altae-Tran, H.; Ramsundar, B.; Pappu Low Data Drug Discovery with One-shot Learning. ArXiv 2016. (16) Wallach, I.; Dzamba, M.; Heifets, A. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structurebased Drug Discovery. ArXiv, 2015. (17) Aliper, A.; Plis, S.; Artemov, A.; Ulloa, A.; Mamoshina, P.; Zhavoronkov, A. Deep Learning Applications for Predicting Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data. Mol. Pharmaceutics 2016, 13, 2524. (18) Kadurin, A.; Nikolenko, S.; Khrabrov, K.; Aliper, A.; Zhavoronkov, A. druGAN: An Advanced Generative Adversarial Autoencoder Model for de Novo Generation of New Molecules with Desired Molecular Properties in Silico. Mol. Pharmaceutics 2017, 14, 3098. (19) Gomez-Bombarelli, R.; Wei, J.; D, D. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science 2018, 4 (2), 268−276. (20) Segler, M.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. ACS Cent. Sci. 2018, 4, 120. (21) Kuzminykh, D.; Polykovskiy, D.; Kadurin, A.; Zhebrak, A.; Baskov, I.; Nikolenko, S.; Shayakhmetov, R.; Zhavoronkov, A. 3D Molecular Representations Based on the Wave Transform for Convolutional Neural Networks. Mol. Pharmaceutics 2018, In press. (22) Lombardo, F.; Desai, P.; Arimoto, R.; Desino, K. E.; Fischer, H.; Keefer, C. E.; Petersson, C.; Winiwarter, S.; Broccatelli, F. In Silico Absorption, Distribution, Metabolism, Excretion, and Pharmacokinetics (ADME-PK): Utility and Best Practices. An Industry G

DOI: 10.1021/acs.molpharmaceut.7b01144 Mol. Pharmaceutics XXXX, XXX, XXX−XXX