Subscriber access provided by LUNDS UNIV
Computational Chemistry
Imputation of Assay Bioactivity Data using Deep Learning Thomas Whitehead, Ben Irwin, Peter A. Hunt, Matthew Segall, and Gareth Conduit J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00768 • Publication Date (Web): 12 Feb 2019 Downloaded from http://pubs.acs.org on February 14, 2019
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Imputation of Assay Bioactivity Data using Deep Learning ∗,†
T.M. Whitehead,
‡
B.W.J. Irwin,
P. Hunt,
‡
‡
M.D. Segall,
and G.J. Conduit
¶
†Intellegens, Eagle Labs, Chesterton Road, Cambridge, CB4 3AZ, United Kingdom ‡Optibrium, F5-6 Blenheim House, Cambridge Innovation Park, Denny End Road, Cambridge, CB25 9PB, United Kingdom
¶Cavendish
Laboratory, University of Cambridge, J.J. Thomson Avenue, Cambridge, CB3 0HE, United Kingdom
E-mail:
[email protected] Abstract
lections, but these are costly, and thus applied infrequently, and the throughput of an assay
We describe a novel deep learning neural net-
usually comes with a trade-o against the qual-
work method and its application to impute as-
ity of the measured data. As discovery projects
say pIC50 values. Unlike conventional machine
progress and new compounds are synthesised,
learning approaches, this method is trained on
the increasing cost of generating high-quality
sparse bioactivity data as input, typical of that
data means that only the most promising com-
found in public and commercial databases, en-
pounds are advanced to these late-stage studies.
abling it to learn directly from correlations be-
If one considers all of the compounds in a
tween activities measured in dierent assays.
large pharmaceutical company's corporate col-
In two case studies on public domain data
lection and the assay endpoints that have been
sets we show that the neural network method
measured, only a small fraction of the possible
outperforms traditional quantitative structure-
compound-assay combinations have been mea-
activity relationship (QSAR) models and other
sured in practice.
leading approaches. Furthermore, by focussing
are also sparsely populated; for example, the
on only the most condent predictions the accu2 racy is increased to R > 0.9 using our method, 2 as compared to R = 0.43 using the leading
ChEMBL
prole-QSAR approach.
small fraction of these missing data could be
1,2
Public domain databases
data set is just 0.05% compete.
The implication of this is that a vast trove of information would be revealed if only a lled in with high-quality results in a cost-
1
eective way.
Introduction
quality compounds, overlooked during optimi-
Accurate compound bioactivity and property
sation projects, could be identied.
data are the foundations of decisions on the se-
Further-
more, compounds with results from early assays
lection of hits as the starting point for discov-
could be selected for progression with greater
ery projects, or the progression of compounds
condence if downstream results could be accu-
through hit to lead and lead optimisation to candidate selection.
New hits for projects targeting
existing biological targets of interest and high-
rately predicted.
However, in practice, the
A common approach for prediction of com-
experimental data available on potential com-
pound bioactivities is the development of quan-
pounds of interest are sparse. High-throughput
titative structure-activity relationship (QSAR)
screens may be run on a large screening col-
ACS Paragon Plus Environment 1
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
models. data
3
to
These are generated using existing identify
correlations
between
Page 2 of 14
focussing on only the most condent results.
easily
We
will
compare
the
performance
of
our
calculated characteristics of compound struc-
method to impute bioactivities with a RF, a
tures,
commonly applied and robust QSAR machine
known as descriptors,
and their bio-
logical activities or properties.
The result-
learning method, a modern multi-target deep
ing models can then be applied to new com-
learning method, a leading matrix factorisation
pounds that have not yet been experimentally
approach, and the second-generation pQSAR
tested, to predict the outcome of the corre-
2.0 technique.
sponding assays.
13
A wide range of statisti-
In Section 2 we present the underlying deep
cal methods have been applied to build QSAR
learning methodology to handle missing data
models, from simple linear regression methods
and estimate uncertainty, along with details of
such as partial least squares
to more sophisti-
the data sets used in this study, the accuracy
cated machine learning approaches such as ran-
metric, and other machine learning methods ap-
dom forests (RF),
plied for comparison.
59
4
support vector machines
and Gaussian processes.
11
10
Then in Section 3 we
Another approach is
present two examples to assess the performance
the prole-QSAR (pQSAR) method proposed
of the algorithm against current methods. Fi-
by Martin et al.,
nally, in Section 4 we discuss our ndings and
12,13
which uses a hierarchi-
cal approach to build a model of a bioactivity
potential applications of the results.
by using as inputs the predictions from QSAR models of multiple bioactivities that may be correlated.
2
Recently, the application of ad-
vances in deep learning have been explored for generation of QSAR models;
14
The goal for the neural network tool is to pre-
while small im-
dict and impute assay bioactivity values, by
provements in the accurcay of predictions have
learning both the correlations between chemi-
been found, these methods have not generally
cal descriptors and assay bioactivity values and
resulted in a qualitative step forward for activity predictions.
1517
Methodology
also the correlations between the assay bioactiv-
One advantage of deep
ities. In Subsection 2.1 we introduce the data
learning methods is the ability to train models
sets used to validate the approach, before turn-
against multiple endpoints simultaneously, so-
ing in the following subsections to the descrip-
called multi-target prediction. This enables the
tion of the neural network method itself.
model to `learn' where a descriptor correlates with multiple endpoints and hence improve the
2.1
accuracy for all of the corresponding endpoints. However, the sparse experimental data could
Two data sets were used to train and validate
reveal more information regarding the correla-
the models: a set containing activities derived
tions between the endpoints of interest, if these
from ve adrenergic receptor assays (hereafter
could be used as inputs to a predictive model.
described as the Adrenergic set") and a data
Conventional machine learning methods can-
set comprised of results from 159 kinase as-
not use this information as inputs because the
says proposed by Martin et al.
bioactivity data are often incomplete, and so
(the Kinase set").
present a novel deep learning framework, previ-
1820
as a challeng-
ing benchmark for machine learning methods
cannot be relied on as input. In this paper we ously applied to materials discovery,
Data sets
13
These data sets are sum-
marised in Table 1. All of the data were sourced
that
from binding assays reported in the ChEMBL
can learn from and exploit information that is
database
sometimes missing, unlike other contemporary
1,2
and the assay data represented as
pIC50 values (the negative log of the IC50 in
machine learning methods. A further benet of
molar units). In the case of the Adrenergic set,
the proposed method is that it can estimate the
measurements from dierent assays were com-
uncertainty in each individual prediction, allow-
bined for each target activity and, where mul-
ing it to improve the quality of predictions by
ACS Paragon Plus Environment 2
Page 3 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
tiple values were available for the same com-
train the models, making this a dicult test of
pound, the highest pIC50 value was used, rep-
a machine learning method's ability to extrap-
resenting a `worst case' scenario for selectivity.
olate outside of the chemical space on which it
In the case of the Kinase data set, each activity
was trained.
was derived from a single assay, as dened in
a `realistic' test set, designed to be more rep-
ChEMBL.
resentative of real working practices in an accompounds are continuously proposed that ex-
the examples presented herein. The table shows
tend beyond the chemical space that has pre-
the data set, the number of compounds and as-
viously been explored.
says each contains, and the proportion of the
pounds appear in both the train and test sets:
Compounds
Assays
Filled
1731
5
37.5%
13998
159
6.3%
Kinase
Because the clustering
was carried out on a per-assay basis, some com-
compound-assay values that are lled.
Adrenergic
described this as
tive chemistry optimisation project, where new
Table 1: A summary of the data sets used in
Data set
Martin et al.
but the assay data for each compound is split between the sets, so that none of the same assay/compound pairs appear in both the train and test set and the validation is against a robust, disjoint test case.
320 molecular descriptors were used to char-
The Kinase data set
is provided with the supporting information for
acterise the compounds in the data sets. These
this paper.
comprised whole-molecule properties, such as the calculated octanol:water partition coe-
2.2
cient (logP), molecular weight, topological po-
21
22
Performance Metric
well as counts of substructural fragments rep-
To assess the performance of the models we use 2 the coecient of determination R for each as-
resented as SMARTS patterns.
say in the test set:
lar surface area
and McGowan volume,
as
23
In the case of the Adrenergic set, we employed a ve-fold cross-validation approach for building models and assessing their resulting accu-
2
R =1−
racy. The compounds in the data set were randomly split into ve disjoint subsets of equal
pred − yiobs )2 i (yi , P obs − y obs )2 i (yi
P
We repreated this pro-
th obs is the i observed assay value and where yi pred yi is the corresponding prediction. This is a more stringent test than the commonly used
cess using each of the subsets for testing, so
squared Pearson correlation coecient, which
that each compound was used as a test case for
is a measure of the t to the best t line be-
the tool. The Adrenergic data set is provided
tween the predicted and observed values, while
with the supporting information for this paper.
13
the coecient of determination is a measue of pred the t to the perfect identity line yi = yiobs . By denition, the coecient of determination is
as a challenging benchmark for machine learn-
less than or equal to the squared Pearson cor-
ing methods.
relation coecient.
size, the models were trained using four of the subsets, and then their accuracy evaluated on the remaining subset.
The Kinase set was provided in the supporting information of the paper by Martin et al.
In this case, the data set was
ing and test sets. The data were initially clus-
For each of the methods, we report the mean 2 of the R across all of the assays in the test set
tered for each assay and the members of the
to give an overall value.
split by Martin et al.
into independent train-
clusters used as the training set, leaving the
2.3
outliers from this clustering procedure as the data against which the resulting models were tested.
Neural network formalism
We now turn to the neural network formal-
This procedure means that the test
ism.
data is not representative of the data used to
This algorithm is able to automatically
ACS Paragon Plus Environment 3
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 14
identify the link between assay bioactivity values, and use the bioactivity data of other compounds to guide the extrapolation of the model, as well as using molecular descriptors as design variables.
Furthermore, the method can
estimate uncertainties in its predictions.
The
neural network builds on the formalism used to design nickel-base superalloys, molybdenum alloys, and identify erroneous entries in materials databases.
1820
We describe here the core neural
ηh1=tanh(Aih1xi+Bh1)
network and the rst novel aspect, the ability
y1=Ch1ηh1+Dh1
x1
to estimate the uncertainty in the predictions,
y1 η11
before Section 2.4 details the second novel part of the algorithm: how to handle missing data,
x2
y2
necessary to capture bioactivity-bioactivity correlations.
x = (x1 , . . . , xA+D ) to the neural network contains values for D = 320 molecular descriptors and A = 5 (for the Adrenergic data set) or A = 159 (for the Kinase data Each input vector
ηH1 xA+D
yA+D
ηh2=tanh(Aih2xi+Bh2)
set) bioactivity values. The ordering of the el-
y2=Ch2ηh2+Dh2
x1
ements of the input is the same for each com-
y1 η12
pound, but otherwise unimportant. The output
(y1 , . . . , yA+D ) of the neural network consists of
x2
y2
the original descriptors and the predicted bioactivities:
only the elements
(y1 , . . . , yA )
corre-
sponding to predicted bioactivities are used for
ηH2
evaluating the network accuracy.
xA+D
The neural network itself is a linear superpo-
Given properties
sition of hyperbolic tangents
f : (x1 , . . . , xi , . . . , xA+D ) 7→ (y1 , . . . , yj , . . . , yA+D ) with yj =
H X
yj
to compute
lines) gives the predicted property (green).
tting neural network. We use hyperbolic tangent activation functions to constrain the mag-
ηhj ,
(bot-
tanh function, and a linear combination (gray
{Aihj , Bhj , Chj , Dj } Each property yj for
1 ≤ j ≤ A is predicted separately. We set Ajhj = 0 so the network will predict yj without knowledge of xj . Typically around ve hidden nodes ηhj per output variable gives the best-
nitude of
y2
taken by the hidden nodes (blue), a non-linear
with parameters
as shown in Figure 1.
(top) and
tion (gray lines) of the given properties (red) are
This neural network has a single layer of hidden
ηhj
y1
The graphs
all the predicted properties. A linear combina-
.
i=1
nodes
The neural network.
graphs can be drawn for all other
! Aihj xi + Bhj
Predicted properties
tom) are computed from all the inputs; similar
h=1
and ηhj = tanh
Figure 1:
Hidden nodes
show how the outputs for
Chj ηhj + Dj , I X
yA+D
giving the weights
Chj
sole re-
ACS Paragon Plus Environment 4
Page 5 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
sponsibility for the amplitude of the output re-
pound and assay, and moreover the set of miss-
sponse. Twelve separate networks were trained
ing bioactivities is dierent for each compound.
on the data with dierent weights
However, there is information embedded within
1820
and their
variance taken to indicate the uncertainty in
bioactivity-bioactivity correlations.
the predictions accounting for both experimen-
neural network formalism requires that each
tal uncertainty in the underlying data and the
property is either an input or output of the
uncertainty in the extrapolation of the train-
network, and all inputs must be provided to
ing data.
obtain a valid output.
24,25
This is conceptually similar to
A typical
In contrast, we treat
the approach taken to uncertainty estimation
both the molecular descriptors and also the as-
in ensemble models, although here the under-
say bioactivities as both inputs and outputs of
lying model is a deep neural network and the
the neural network and adopt an expectation-
uncertainty estimates generated accurately rep-
maximization algorithm,
resent the observed errors in the predictions,
vide an estimate for the missing data, and use
including uncertainty due to extrapolation that
the neural network to iteratively improve that
is poorly captured by random forest (see also
initial estimate.
Section 3.2).
26
where we rst pro-
The algorithm is shown in Figure 2.
For
any unknown bioactivities we rst set missing
2.4
values to the average of the bioactivity values
Handling incomplete data
present in the data set for that assay. With estimates for all values of the neural network we can then iteratively compute
Network at x
xn+1 = Have all properties?
Yes
The nal predictions
xn + f (xn ) . 2 (y1 , . . . , yA )
are then the
elements of this converged algorithm corresponding to the assay bioactivity predictions.
No
The softening of the results by combining them
Use averages x0=x
with the existing predictions serves to prevent
n
Return f(x )
oscillations of the predictions, similar to the use of shortcut connections in ResNet. ically up to
Reached convergence
Yes
5
27
Typ-
iteration cycles were used to im-
pute missing bioactivity values, using the same
f
function
(as dened in Section 2.3) in every
cycle.
After 5 cycles the coecient of deter2 mination R in training improved by less than
No
0.01, comparable to the accuracy of the ap-
Xn+1 = [xn + f(xn)]/2
proach, conrming that we had used sucient Figure 2:
The data imputation algorithm for
the vector
x
iteration cycles to reach convergence. The parameters
of the molecular descriptors and
function
bioactivity values that has missing entries. We 0 set x = x, replacing all missing entries by av-
nealing
n
28
{Aihj , Bhj , Chj , Dj }
in the
are then trained using simulated anto minimize the least-square error of
(y1 , . . . , yA ) against 5 least 10 training rounds
the predicted bioactivities
erages across each assay, and then iteratively n+1 n n compute x as a function of x and f (x ) until we reach convergence after
f
the training data. At
were used to reach convergence.
iterations.
Hyperparameters, in particular the number of hidden nodes per output, the number of itera-
Experimental data are often incomplete
tion cycles, and the number of training rounds,
bioactivity values are not known for every com-
ACS Paragon Plus Environment 5
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 14
were selected using random holdout validation
bioactivity data, with separate latent features
on each training data set, without reference to
specialising in handling the descriptors.
the corresponding test set.
We also compare to the prole-QSAR 2.0 method of Martin et al.,
2.5
13
which builds a lin-
ear partial least squares (PLS) model of assay
Other machine learning meth-
bioactivities from the predictions of random for-
ods
est models for each assay individually.
In the
We compare our neural network algorithm with
2.0 version of the prole-QSAR method the RF
a variety of other popular machine learning ap-
predictions for an assay are not used as input
proaches from the literature.
to the PLS model for that assay.
RF methods
59
are a popular method of QSAR analysis, building an ensemble of decision trees to predict individual assay results.
3
Because decision trees
Imputing assay bioactivities
require all their input data to be present when they are trained, it is not possible to build RF models using sparse bioactivity data as input,
We present two tests of the performance of the
and RF must rely purely on chemical descrip-
deep learning formalism to impute assay bioac-
tors. We used the scikit-learn
tivity values. In each case we use disjoint train-
29
implementation
ing and validation data to obtain a true statis-
of the regression RF method. For
a
comparison
learning approach,
with
a
modern
tical measure, the coecient of determination,
deep
for the quality of the trained models.
we also built a conven-
tional multi-target deep neural network (DNN) model
30
using TensorFlow.
31
The model took
3.1
linear combinations of descriptors as inputs,
Adrenergic receptors
with eight fully connected hidden layers with
We rst present a case study using the Adren-
512 hidden nodes, and output nodes that gave
ergic data set described in Section 2.1.
the predicted assay results.
The ELU activa-
train two classes of model: the rst uses com-
tion function was used for all layers, and the
plete compound descriptor information to pre-
network was trained using Adam backpropaga-
dict the bioactivity values, and the second class
tion with Nesterov momentum
uses both the chemical descriptors and also the
32
and a masked
We
bioactivity-bioactivity correlations.
loss function to handle missing values. A principal component analysis (PCA) was performed
We rst train a neural network to take only
on the descriptors to select the subset of linear
chemical descriptors and predict assay bioac-
combinations of descriptors that captured 90%
tivities. This approach is similar to traditional
of the variance across the full descriptor set to
QSAR approaches, although it oers the ad-
avoid overtting of the DNN through the use
vantage of being able to indirectly learn the re-
of too many descriptors.
lationships between assay bioactivities through
A
popular
method
of
analysing
databases is matrix factorisation,
33
the iterative cycle described in Figure 2.
sparse
We
train the neural network providing as input the
where the
matrix of compound-assay bioactivity values is
N
approximately factorised into two lower-rank
lute Pearson correlation against the ve tar-
matrices that are then used to predict bioactiv-
gets, with
ity values for new compounds. Matrix factori-
set of 320 descriptors.
sation was popularised through its inclusion in
ure 3 shows that the neural network, predict-
the winning entry of the 2009 Netix Prize.
We used the modern Collective Matrix Fac-
ing based purely on descriptors, achieves a peak R2 = 0.60 ± 0.03 against the assays when using
torisation (CMF)
implementation of ma-
50 descriptors: fewer descriptors do not provide
trix factorisation, which makes eective use of
a sucient basis set, whereas more descriptors
35,36
34
descriptors with the highest average abso-
the available chemical descriptors as well as
ACS Paragon Plus Environment 6
N
varying between 0 and the full The grey line in Fig-
Page 7 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling being swamped by the numerous but weaker descriptor-bioactivity correlations. We
particularly
see
bioactivity-bioactivity
the
value
correlations
of
the
with
zero
descriptors, where the neural network achieves R2 = 0.35 ± 0.03 due solely to bioactivitybioactivity correlations. Random forest is not able to make predictions at all without any descriptors being present, as it cannot take the 2 sparse bioactivity data as input, and so R = 0. The ability to t the data better than a lead-
Figure 3: The coecient of determination for
ing QSAR method provides a solid platform
predicting the activity of the adrenergic recep-
for use of this neural network to impute assay
tors with number of chemical descriptors. The
bioactivity values.
magenta line is when the neural network is trained with both the activities and descriptors
3.2
present, the grey line with just the descriptors, and the cyan line is for random forest.
Error
Kinase data set
We now present a case study on the Kinase data
bars represent the standard error in the mean R2 over ve-fold cross-validation.
set proposed as an exemplar for benchmarking predictive imputation methods,
13
as described
in Section 2.1. The neural network not re-
In this data set the validation data comprised
quiring the full set of chemical descriptors to
the outliers from a clustering procedure, re-
provide a high-quality t enables us to focus at-
alistically representing the exploration of new
tention on the key descriptors, and hence chem-
chemical space.
ical features, that inuence bioactivity against
of determination by a method in the litera2 ture is R = 0.434 ± 0.009 by the prole-
over-t the data.
these targets.
37
We compare the neural network
The best achieved coecient
13
result to traditional random forest, using the
QSAR 2.0 method,
same descriptor sets, which achieves a similar 2 value of R = 0.59 ± 0.02 using 100 descriptors.
for this comparison.
which we re-implemented
The DNN multi-target 2 model discussed in Section 2.5 achieved R = 0.11 ± 0.01, the CMF method achieved R2 =
We next train a fresh neural network but incorrelations. With a total of 5 assays, this al-
−0.11 ± 0.01, and a conventional RF QSAR ap2 proach achieved only R = −0.19 ± 0.01, a re-
lows up to 4 additional input values per target
sult which is worse than random due to the ex-
as bioactivity values for every other assay are
trapolation in chemical space required to reach
used as input when present (although in the
the test set points.
clude the possibility of bioactivity-bioactivity
It is not
Using our deep neural network we predict
possible to use this assay bioactivity data as
the assay bioactivity values and also the un-
input to a RF approach, because the data is
certainties in the predictions.
sparse and RF methods require complete input
the predictions accepted, irrespective of the re-
information. However, in Figure 3 we see that
ported condence, the neural network attains R2 = 0.445 ± 0.007, a signicant improvement
majority of cases they are missing).
the neural network's peak accuracy increases to R2 = 0.71 ± 0.03 with 50 descriptors. We now
With
100%
of
over the DNN, CMF, and RF approaches and
achieve a signicantly better quality of t than −4 RF (with one-tailed Welch's t-test p = 3×10 )
similar to the prole-QSAR 2.0 method result.
due to the strong bioactivity-bioactivity corre-
dictions gives us more knowledge about the neu-
lations present in the data. The neural network
ral network results. In particular, we can dis-
is able to successfully identify these stronger
card predictions carrying large uncertainty, and
bioactivity-bioactivity relations, without them
trust only those with smaller uncertainty. This
However, access to the uncertainties in the pre-
ACS Paragon Plus Environment 7
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 14
lets us focus on the most condent predictions only, at the expense of reporting fewer total predictions. When this is done, the quality of the remaining neural network predictions increases, as shown in Figure 4, demonstrating that the neural network is able to accurately and truthfully inform us about the uncertainties in its predictions; the condence of predictions is correlated with their accuracy.
The coecient 2 of determination reaches values of R > 0.9, demonstrating eectively perfect predictions, when we complete only the most condent of the data.
1%
We note that this focus on the
most condent predictions, and corresponding increase in accuracy, is post-processing:
only
one model is trained, and the desired level of condence can be specied and used to return only suciently accurate results. The neural network is signifcantly more accurate than the DNN, CMF, and RF meth-
100% of the predictions are ac−66 −102 cepted (with p-values 3 × 10 , 2 × 10 , −107 and 2 × 10 respectively), and is signicantly ods even when
Figure 4: The coecient of determination for predicting the activity of the clustered Kinase data set with percentage of data predicted. The
more accurate than pQSAR 2.0 when only the
cyan point is for the random forest approach,
least condent 3% of predictions are discarded −4 (p = 3 × 10 ). As shown in Figure 4, the
the blue point is the collective matrix factorisation (CMF) method, the dark green point is the
accuracy improvement over the other methods
deep neural network (DNN) approach, the or-
increases substantially as a smaller fraction of
ange point is the prole-QSAR 2.0 method, and
the predictions are accepted. 2 The achieved R > 0.9 exceeds the level of R2 = 0.7 that is often taken as indicating accu-
the magenta line is the neural network proposed in this work. The magenta line shows that the accuracy of the neural network predictions in-
rate, reliable predictions in the presence of ex-
creases when focussing on the most condent
perimental uncertainty. In fact, the most con-
predictions, at the expense of imputing only a
dent 50% of the neural network's predictions all 2 have R > 0.7, permitting a nine-fold increase
proportion of the missing data. This conrms that the reported condences in the predictions correlate strongly with their accuracy.
in the number of accurate predictions that can
Error
be used for further analysis, relative to the orig-
bars represent the standard error in the mean R2 value over all 159 assays, and where not vis-
inal sparse experimental measurements. This high accuracy is achieved after approxi-
ible are smaller than the size of the points.
mately 120 core hours of training. The time to validate the data set is 0.1ms per compound for the neural network, versus 10ms per compound, 100 times longer, for the traditional random forest approach.
This acceleration in generating
predictions further enhances the real-world applicability of the neural network approach.
ACS Paragon Plus Environment 8
Page 9 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
3.2.1 Analysis
ing data, conrming the algorithm's ability to condently predict test points that are rela-
It is informative to analyse the results that our
tively far from the clusters of training points.
neural network approach is able to calculate ac-
In addition to this analysis, the Euclidean dis-
curately, and compare this to preconceptions of
tance between every test point and its nearest
how the algorithm functions. For example, as
neighbour training point was taken for all as-
a data-driven approach, it might be assumed
says. This measure showed no correlation with
that the assays with the most training data
the network's uncertainty or error, indicating
would be most accurately predicted by the neural network.
that the neural network is operating beyond a
However, as shown in Figure 5,
nearest-neighbour approach in descriptor space,
this is not the case; although the assay with
by exploiting assay-assay correlations that are
least training data is that predicted least accu-
carried across into assay-descriptor space.
rately, there is in general no correlation between the accuracy of the neural network's predictions and the amount of training data available to the algorithm.
In particular, the two assays with
most training data are relatively poorly cap2 tured by the neural network, with R < 0.2 in both cases.
Figure 5: The coecient of determination measured for each of the 159 kinase assays, plotted against the percentage of the data for that assay Figure 6:
present in the training set.
A 2-dimensional t-SNE embedding
of the input descriptor space for ChEMBL assay 688660.
Likewise, the most condent predictions are not for compounds `closest' to those in the training set.
data with colour indicating the uncertainty es-
The degree of separation can be
timate of the network in its predictions, where
measured in terms of the Euclidean distance be-
red indicates zero uncertainty and yellow a high
tween the points in the multi-dimensional space of descriptors used in the model.
The grey crosses show the train-
ing data and the coloured points show the test
uncertainty of 1 log unit.
A represen-
tative example assay's data (ChEMBL assay 688660) is shown in Figure 6 where the training
3.2.2 Summary
points (grey crosses) and test points (coloured points)
are
depicted
in
a
2-dimensional
t-
We have shown that the neural network pre-
distributed stochastic neighbour embedding (t-
sented delivers similar quality predictions for
SNE) generated using the StarDrop software package.
38
assay
The levels of predictive condence
bioactivity
to
the
prole-QSAR
2.0
method when considering the full test set and
are fairly uniform with distance from the train-
ACS Paragon Plus Environment 9
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 10 of 14
R2 > 0.7, which
that these methods outperform QSAR meth-
ing data could be lled in with
ods, including modern DNNs, and also out-
is considered to represent a high level of delity
performs matrix factorisation.
between prediction and experiment.
In addition, a
key advantage is that the neural network gives
The ability to make simultaneous, accurate
accurate uncertainties on its output, allowing
predictions across multiple assays will lend itself
us to prioritise only well-predicted assay activ-
well to the problem of selectivity across multiple
ities, enabling an increase in the coecient of
targets.
determination for the predictions of the realistic 2 2 data set from R = 0.445 up to R > 0.9 for a
ply beyond the binding assay data used in this
subset of the data. The ability to tune accuracy
functional assays; and the method can even
40,41
The method is general, so can ap-
analysis, for example to direct or downstream
with amount of data predicted is an invaluable
make accurate predictions beyond pIC50 val-
tool for scientists, fueling condence in results
ues, including physicochemical, absorption, dis-
and permitting a focus on only high-quality
tribution, metabolism, excretion, and toxicity
predictions.
These most condent predictions
(ADMET) properties. Therefore, it has a broad
are also not for the most complete assays or the
application for identication of additional ac-
most similar test points to the training data,
tive compounds within a database, recognition
showing that the neural network approach is
of the most inuential chemical properties, pre-
able to learn more complex and powerful rep-
diction of selectivity proles, and the selection
resentations of the assay bioactivity data.
of compounds for progression to downstream ADMET assays.
4
Conclusions
Conict of Interest Statement
Matthew
Segall, Peter Hunt and Ben Irwin are employ-
We have presented a new neural network im-
ees of Optibrium Limited, which develops the
putation technique for predicting bioactivity,
StarDrop
which can learn from incomplete bioactivity
TM
software used herein for analysis
of the results. Thomas Whitehead and Gareth
data to improve the quality of predictions by us-
Conduit are employees of Intellegens Limited,
ing correlations between both dierent bioactiv-
which develops the Alchemite
ity assays, and also between molecular descrip-
TM
software used
in this work.
tors and bioactivities. This results in a signi-
Acknowledgement
cant improvement in the accuracy of prediction
Gareth
Conduit
ac-
over conventional QSAR models, even those us-
knowledges the nancial support of the Royal
ing modern deep learning methods, particularly
Society and Gonville & Caius College. There is
for challenging data sets representing an extrap-
Open Access to this paper and data available
olation to new compounds that are not well rep-
at
https://www.openaccess.cam.ac.uk.
resented by the set used to train the model. This is representative of many chemistry opti-
Supporting Information Avail-
misation projects which, by denition, explore
able
new chemical space as the project proceeds. The method presented can also accurately estimate the condence in each individual predic-
The following les are available free of charge:
tion, enabling attention to be focussed on only the most accurate results.
•
It is important to
Section 3.1
base decisions in a discovery project on reliable results to avoid wasted eort pursuing incor-
•
rectly selected compounds or missing opportu-
39
Kinase_training_w_descriptors.csv: Training dataset used in Section 3.2
nities by inappropriately discarding potentially valuable compounds.
Adrenergic_dataset.csv: Dataset used in
On the Kinase example
•
data set, we demonstrated that 50% of the miss-
Kinase_test_w_descriptors.csv: dataset used in Section 3.2
ACS Paragon Plus Environment 10
Test
Page 11 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
This information is available free of charge via
lic High-quality Data Sets. J. Chem. Inf.
http://pubs.acs.org.
the Internet at
Model.
2013, 53, 27.
(7) Christmann-Franck, S.; van Westen, G.;
References
Papadatos,
G.;
Beltran
Escudie,
F.;
Roberts, A.; Overington, J.; Domine, D. (1) Gaulton, A.; Davies,
M.;
Bellis, L.; Hersey,
Chambers, J.;
A.;
Light,
Unprecedently
Y.;
Large-Scale
Kinase
In-
hibitor Set Enabling the Accurate Predic-
McGlinchey, S.; Akhtar, R.; Bento, A.;
tion of Compound Kinase Activities:
Al-Lazikani, B.; Michalovich, D.; Overing-
Way toward Selective Promiscuity by De-
ton, J. ChEMBL: A Large-scale Bioactiv-
sign?
ity Database For Chemical Biology and
1654.
2012,
Drug Discovery. Nucleic Acids Res.
J. Chem. Inf. Model.
Wohlfahrt, A.;
Bellis,
L.;
Krüger,
56,
(8) Subramanian, V.; Prusis, P.; Xhaard, H.;
40, 1100.
(2) Bento,
2016,
A
G.
Predictive
Proteochemo-
Gaulton,
A.;
Hersey,
A.;
metric Models for Kinases Derived from
Chambers,
J.;
Davies,
M.;
3D Protein Field Based Descriptors. Med-
Mak,
L.;
M.;
Pa-
F.;
McGlinchey,
Light, S.;
Y.;
Nowotka,
ChemComm
(9) Merget,
padatos, G.; Santos, R.; Overington, J. The ChEMBL Bioactivity Database: Update. Nucleic Acids Res.
2014,
A.;
Muratov,
E.
Turk,
S.;
Eid,
S.;
Ripp-
an
mann, F.; Fulle, S. Proling Prediction of
42,
Kinase Inhibitors: Toward the Virtual Assay. J. Med. Chem.
1083. (3) Cherkasov,
B.;
2016, 7, 1007.
(10) Doucet,
N.;
J.;
2017, 60, 474.
Xia,
H.;
Panaye,
A.;
Fourches, D.; Varnek, A.; Baskin, I. I.;
Fan, B. Nonlinear SVM Approaches to
Cronin, M.; Dearden, J.; Gramatica, P.;
QSPR/QSAR Studies and Drug Design.
Martin,
Curr. Comput.-Aided Drug Des.
Y.
C.;
Todeschini,
R.;
Con-
sonni, V.; Kuz'min, V. E.; Cramer, R.; Benigni,
R.;
Yang,
C.;
Rathman,
J.;
(11) Obrezanova, O.;
Teroth, L.; Gasteiger, J.; Richard, A.; Tropsha,
A.
QSAR
Have You Been?
In
S.;
The
Encyclopedia
Chemistry ;
Clark,
M.;
T.,
of
Eriksson,
Properties. J. Chem. Inf. Model.
(12) Martin, E.; Mukherjee, P.; Sullivan, D.;
Computational
P.,
Allinger,
N.,
Gasteiger,
J.,
Kollman,
P.,
Jansen, J. Prole-QSAR: a Novel MetaQSAR Method that Combines Activities Across the Kinase Family to Accurately
Schaefer III, H., P., S., Eds.; Chichester,
Predict Anity, Selectivity, and Cellular
UK: John Wiley and Sons, 1999; pp 116. (5) Gao, Wang,
C.; J.;
Cahya, Watson,
S.;
Nicolaou,
C.;
I.;
Cummins,
D.;
Activity. J. Chem. Inf. Model.
Predictions,
(13) Martin,
51,
E.;
Valery
R.
Polyakov,
V.;
Tian, L.; Perez, R. Prole-QSAR 2.0: Ki-
Concordance,
and Implications. J. Med. Chem.
2011,
1942.
Iversen, P.; Vieth, M. Selectivity Data: Assessment,
2007, 47,
18471857.
L.
Schleyer,
Gola, J.;
for Automatic QSAR Modeling of ADME
2014, 57, 49775010.
Sjostrom,
Csanyi, G.;
Segall, M. Gaussian Processes: a Method
Where
Where Are You Going
To? J. Med. Chem. (4) Wold,
Modeling:
2007, 3,
263289.
nase Virtual Screening Accuracy Compa-
2013,
rable to Four-Concentration IC50s for Re-
56, 6991.
alistically Novel Compounds. J. Chem. Inf. Model.
(6) Schurer, S. C.; Muskal, S. M. Kinomewide Activity Modeling from Diverse Pub-
ACS Paragon Plus Environment 11
2017, 57, 2077.
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 12 of 14
(14) Lo, Y.-C.; Rensi, S. E.; Torng, W.; Alt-
(23) Weininger, D. SMILES, a Chemical Lan-
man, R. B. Machine Learning in Chemoin-
guage and Information System. 1. Intro-
formatics and Drug Discovery. Drug Dis-
duction to Methodology and Encoding
covery Today
Rules. J. Chem. Inf. Comput. Sci.
2018, 23, 1538 1546.
1998,
28, 3136.
(15) Chen, H.; Engkvist, O.; Wang, Y.; Olivecrona, M.; Blaschke, T. The Rise of Deep
(24) Heskes, T. Practical Condence and Pre-
Learning in Drug Discovery. Drug Discov-
diction Intervals. Advances in Neural In-
ery Today
formation Processing Systems 9. 1997; pp
2018, 23, 1241 1250.
176182.
(16) Mayr, A.; Klambauer, G.; Unterthiner, T.; Steijaert, mans,
M.;
H.;
Wegner, Clevert,
J.
K.;
D.-A.;
Ceule-
(25) Papadopoulos,
G.;
Edwards,
P.;
Mur-
Hochre-
ray, A. Condence Estimation Methods
iter, S. Large-scale Comparison of Ma-
for Neural Networks: a Practical Compar-
chine Learning Methods for Drug Target
ison. IEEE Transactions on Neural Net-
Prediction on ChEMBL. Chemical Science
works
2018, 9, 5441. (17) Popova,
(26) Krishnan, T.; McLachlan, G. The EM Al-
M.;
Isayev,
O.;
Tropsha,
A.
gorithm and Extensions ; Wiley, 2008.
Deep reinforcement learning for de novo drug design. Science Advances
2018,
(27) He, K.; Zhang, X.; Ren, S.; Sun, J. Deep
4,
Residual Learning for Image Recognition.
eaap7885.
2016 IEEE Conference on Computer Vi-
(18) Conduit, B.; Jones, N.; Stone, H.; Con-
sion
duit, G. Design of a Nickel-base Superal-
2017, 131, 358.
(19) Conduit, Conduit,
B.; G.
Jones,
and
Patter
Probabilistic
Stone, Design
H.; of
Genetic
2018,
Simulated
Algorithm.
1995, 21, 1.
a
Molybdenum-base Alloy using a Neural Network. Scripta Materialia
(CVPR).
(28) Mahfoud, S.; Goldberg, D. Parallel Recombinative
N.;
Recognition
2016; pp 770778.
loy with a Neural Network. Materials and Design
2001, 12, 1278.
(29) Pedregosa,
146,
fort,
82.
F.;
A.;
Annealing:
Parallel
Varoquaux,
Michel,
V.;
A
Computing
G.;
Gram-
Thirion,
B.;
Grisel, O.; Blondel, M.; Prettenhofer, P.;
(20) Verpoort, P.; MacDonald, P.; Conduit, G.
Weiss, R.; Dubourg, V.; Vanderplas, J.;
Materials Data Validation and Imputation
Passos, A.; Cournapeau, D.; Brucher, M.;
with an Articial Neural Network. Com-
Perrot, M.;
putational Materials Science
2018,
147,
Duchesnay, E. Scikit-learn:
Machine Learning in Python. Journal of
176.
Machine
Learning
Research
2011,
12,
28252830.
(21) Ertl, P.; Rhodes, B.; Slezer, P. Fast Calculation of Molecular Polar Surface Area as
(30) Xu,
Y.;
Ma,
J.;
Liaw,
A.;
Sheri-
a Sum of Fragment-Based Contributions
dan, R. P.; Svetnik, V. Demystifying Mul-
and Its application to the Prediction of
titask Deep Neural Networks for Quanti-
Drug Transport. J. Med. Chem.
tative Structure-Activity Relationships. J.
2000, 43,
37143717. (22) Abraham, of
Chem. Inf. Model.
M.;
McGowan,
Characteristic
Volumes
J.
TheUuse
to
Measure
(31) Abadi,
M.;
2017, 57, 24902504.
Agarwal,
Brevdo, E.;
Chen, Z.;
A.;
Barham,
Citro, C.;
P.;
Cor-
Cavity Terms in Reversed-phase Liquid-
rado,
chromatography. Chromatographia
Devin, M.; Ghemawat, S.; Goodfellow, I.;
1987,
23, 243246.
G.
S.;
Davis,
A.;
Dean,
J.;
Harp, A.; Irving, G.; Isard, M.; Jia, Y.;
ACS Paragon Plus Environment 12
Page 13 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, Moore,
J.;
S.;
Schuster,
Mané,
D.;
Murray,
M.;
Shlens,
(39) Segall,
M.;
sorFlow:
Y.;
D.;
Olah,
C.;
tain Data. J. Comput.-Aided Mol. Des.
Steiner,
B.;
2015, 29, 809816.
(40) Sciabola, S.; Stanton, R. V.; Wittkopp, S.;
Viégas, F.;
Wildman, S.; Moshinsky, D.; Potluri, S.;
Zheng,
X.
Xi, H. Predicting Kinase Selectivity Pro-
Ten-
les
Large-Scale Machine Learning
//www.tensorflow.org/,
Modeling
Software avail-
systems. 2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA). 2017; pp 269274. (34) Töscher, A.; Jahrer, M.; Bell, R. M. The BigChaos Solution to the Netix Grand Prize. 2009. (35) Singh, A. P.; Gordon, G. J. Relational Learning via Collective Matrix Factorization. Proceedings of the 14th ACM on
Knowledge Discovery and Data Mining. New York, NY, USA, 2008; pp 650658. (36) Cortes, D. Cold-start recommendations in Collective Matrix Factorization. CoRR
2018, abs/1809.00366 . J.
J.;
Macur,
K.;
B¡czek, T. Molecular Descriptor Subset Selection in Theoretical Peptide Quantitative Model
Structure-Retention Development
Relationship
Using
Nature-
Inspired Optimization Algorithms. Anal. Chem.
48,
S.;
18511867, PMID:
Borza,
ity Proles. Molecules
factorization techniques in recommender
Liu,
Analy-
C.;
Pozzi,
A.;
Relationship Modeling of Kinase Selectiv-
1983, 372376.
P.;
QSAR
Meiler, J. Quantitative StructureActivity
(33) Mehta, R.; Rana, K. A review on matrix
(37) uvela,
2008,
(41) Kothiwale,
vex Programming Problem with Conver2 gence Rate O(1/k ). Soviet Mathematics
Conference
Free-Wilson
18717582.
(32) Nesterov, Y. A Method of Solving a Con-
International
Using
sis. Journal of Chemical Information and
https:
able from tensorow.org.
SIGKDD
Chal-
J.;
on Heterogeneous Systems. 2015;
Doklady
The
lenges of Making Decisions using Uncer-
Vasudevan, V.; Yu,
E.
R.;
Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke,
Champness,
Monga,
Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.;
M.;
2015, 87, 98769883.
https://www.optibrium.com/ stardrop/, Accessed: 2018-10-26.
(38) StarDrop.
ACS Paragon Plus Environment 13
2017, 22 .
Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Graphical TOC Entry
ACS Paragon Plus Environment 14
Page 14 of 14