Imputation of Assay Bioactivity Data Using Deep Learning

Feb 12, 2019 - method, a leading matrix factorization approach, and the second-generation .... impute missing bioactivity values, using the same funct...
0 downloads 0 Views 914KB Size
Subscriber access provided by LUNDS UNIV

Computational Chemistry

Imputation of Assay Bioactivity Data using Deep Learning Thomas Whitehead, Ben Irwin, Peter A. Hunt, Matthew Segall, and Gareth Conduit J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00768 • Publication Date (Web): 12 Feb 2019 Downloaded from http://pubs.acs.org on February 14, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Imputation of Assay Bioactivity Data using Deep Learning ∗,†

T.M. Whitehead,



B.W.J. Irwin,

P. Hunt,





M.D. Segall,

and G.J. Conduit



†Intellegens, Eagle Labs, Chesterton Road, Cambridge, CB4 3AZ, United Kingdom ‡Optibrium, F5-6 Blenheim House, Cambridge Innovation Park, Denny End Road, Cambridge, CB25 9PB, United Kingdom

¶Cavendish

Laboratory, University of Cambridge, J.J. Thomson Avenue, Cambridge, CB3 0HE, United Kingdom

E-mail: [email protected]

Abstract

lections, but these are costly, and thus applied infrequently, and the throughput of an assay

We describe a novel deep learning neural net-

usually comes with a trade-o against the qual-

work method and its application to impute as-

ity of the measured data. As discovery projects

say pIC50 values. Unlike conventional machine

progress and new compounds are synthesised,

learning approaches, this method is trained on

the increasing cost of generating high-quality

sparse bioactivity data as input, typical of that

data means that only the most promising com-

found in public and commercial databases, en-

pounds are advanced to these late-stage studies.

abling it to learn directly from correlations be-

If one considers all of the compounds in a

tween activities measured in dierent assays.

large pharmaceutical company's corporate col-

In two case studies on public domain data

lection and the assay endpoints that have been

sets we show that the neural network method

measured, only a small fraction of the possible

outperforms traditional quantitative structure-

compound-assay combinations have been mea-

activity relationship (QSAR) models and other

sured in practice.

leading approaches. Furthermore, by focussing

are also sparsely populated; for example, the

on only the most condent predictions the accu2 racy is increased to R > 0.9 using our method, 2 as compared to R = 0.43 using the leading

ChEMBL

prole-QSAR approach.

small fraction of these missing data could be

1,2

Public domain databases

data set is just 0.05% compete.

The implication of this is that a vast trove of information would be revealed if only a lled in with high-quality results in a cost-

1

eective way.

Introduction

quality compounds, overlooked during optimi-

Accurate compound bioactivity and property

sation projects, could be identied.

data are the foundations of decisions on the se-

Further-

more, compounds with results from early assays

lection of hits as the starting point for discov-

could be selected for progression with greater

ery projects, or the progression of compounds

condence if downstream results could be accu-

through hit to lead and lead optimisation to candidate selection.

New hits for projects targeting

existing biological targets of interest and high-

rately predicted.

However, in practice, the

A common approach for prediction of com-

experimental data available on potential com-

pound bioactivities is the development of quan-

pounds of interest are sparse. High-throughput

titative structure-activity relationship (QSAR)

screens may be run on a large screening col-

ACS Paragon Plus Environment 1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

models. data

3

to

These are generated using existing identify

correlations

between

Page 2 of 14

focussing on only the most condent results.

easily

We

will

compare

the

performance

of

our

calculated characteristics of compound struc-

method to impute bioactivities with a RF, a

tures,

commonly applied and robust QSAR machine

known as descriptors,

and their bio-

logical activities or properties.

The result-

learning method, a modern multi-target deep

ing models can then be applied to new com-

learning method, a leading matrix factorisation

pounds that have not yet been experimentally

approach, and the second-generation pQSAR

tested, to predict the outcome of the corre-

2.0 technique.

sponding assays.

13

A wide range of statisti-

In Section 2 we present the underlying deep

cal methods have been applied to build QSAR

learning methodology to handle missing data

models, from simple linear regression methods

and estimate uncertainty, along with details of

such as partial least squares

to more sophisti-

the data sets used in this study, the accuracy

cated machine learning approaches such as ran-

metric, and other machine learning methods ap-

dom forests (RF),

plied for comparison.

59

4

support vector machines

and Gaussian processes.

11

10

Then in Section 3 we

Another approach is

present two examples to assess the performance

the prole-QSAR (pQSAR) method proposed

of the algorithm against current methods. Fi-

by Martin et al.,

nally, in Section 4 we discuss our ndings and

12,13

which uses a hierarchi-

cal approach to build a model of a bioactivity

potential applications of the results.

by using as inputs the predictions from QSAR models of multiple bioactivities that may be correlated.

2

Recently, the application of ad-

vances in deep learning have been explored for generation of QSAR models;

14

The goal for the neural network tool is to pre-

while small im-

dict and impute assay bioactivity values, by

provements in the accurcay of predictions have

learning both the correlations between chemi-

been found, these methods have not generally

cal descriptors and assay bioactivity values and

resulted in a qualitative step forward for activity predictions.

1517

Methodology

also the correlations between the assay bioactiv-

One advantage of deep

ities. In Subsection 2.1 we introduce the data

learning methods is the ability to train models

sets used to validate the approach, before turn-

against multiple endpoints simultaneously, so-

ing in the following subsections to the descrip-

called multi-target prediction. This enables the

tion of the neural network method itself.

model to `learn' where a descriptor correlates with multiple endpoints and hence improve the

2.1

accuracy for all of the corresponding endpoints. However, the sparse experimental data could

Two data sets were used to train and validate

reveal more information regarding the correla-

the models: a set containing activities derived

tions between the endpoints of interest, if these

from ve adrenergic receptor assays (hereafter

could be used as inputs to a predictive model.

described as the Adrenergic set") and a data

Conventional machine learning methods can-

set comprised of results from 159 kinase as-

not use this information as inputs because the

says proposed by Martin et al.

bioactivity data are often incomplete, and so

(the Kinase set").

present a novel deep learning framework, previ-

1820

as a challeng-

ing benchmark for machine learning methods

cannot be relied on as input. In this paper we ously applied to materials discovery,

Data sets

13

These data sets are sum-

marised in Table 1. All of the data were sourced

that

from binding assays reported in the ChEMBL

can learn from and exploit information that is

database

sometimes missing, unlike other contemporary

1,2

and the assay data represented as

pIC50 values (the negative log of the IC50 in

machine learning methods. A further benet of

molar units). In the case of the Adrenergic set,

the proposed method is that it can estimate the

measurements from dierent assays were com-

uncertainty in each individual prediction, allow-

bined for each target activity and, where mul-

ing it to improve the quality of predictions by

ACS Paragon Plus Environment 2

Page 3 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

tiple values were available for the same com-

train the models, making this a dicult test of

pound, the highest pIC50 value was used, rep-

a machine learning method's ability to extrap-

resenting a `worst case' scenario for selectivity.

olate outside of the chemical space on which it

In the case of the Kinase data set, each activity

was trained.

was derived from a single assay, as dened in

a `realistic' test set, designed to be more rep-

ChEMBL.

resentative of real working practices in an accompounds are continuously proposed that ex-

the examples presented herein. The table shows

tend beyond the chemical space that has pre-

the data set, the number of compounds and as-

viously been explored.

says each contains, and the proportion of the

pounds appear in both the train and test sets:

Compounds

Assays

Filled

1731

5

37.5%

13998

159

6.3%

Kinase

Because the clustering

was carried out on a per-assay basis, some com-

compound-assay values that are lled.

Adrenergic

described this as

tive chemistry optimisation project, where new

Table 1: A summary of the data sets used in

Data set

Martin et al.

but the assay data for each compound is split between the sets, so that none of the same assay/compound pairs appear in both the train and test set and the validation is against a robust, disjoint test case.

320 molecular descriptors were used to char-

The Kinase data set

is provided with the supporting information for

acterise the compounds in the data sets. These

this paper.

comprised whole-molecule properties, such as the calculated octanol:water partition coe-

2.2

cient (logP), molecular weight, topological po-

21

22

Performance Metric

well as counts of substructural fragments rep-

To assess the performance of the models we use 2 the coecient of determination R for each as-

resented as SMARTS patterns.

say in the test set:

lar surface area

and McGowan volume,

as

23

In the case of the Adrenergic set, we employed a ve-fold cross-validation approach for building models and assessing their resulting accu-

2

R =1−

racy. The compounds in the data set were randomly split into ve disjoint subsets of equal

pred − yiobs )2 i (yi , P obs − y obs )2 i (yi

P

We repreated this pro-

th obs is the i observed assay value and where yi pred yi is the corresponding prediction. This is a more stringent test than the commonly used

cess using each of the subsets for testing, so

squared Pearson correlation coecient, which

that each compound was used as a test case for

is a measure of the t to the best t line be-

the tool. The Adrenergic data set is provided

tween the predicted and observed values, while

with the supporting information for this paper.

13

the coecient of determination is a measue of pred the t to the perfect identity line yi = yiobs . By denition, the coecient of determination is

as a challenging benchmark for machine learn-

less than or equal to the squared Pearson cor-

ing methods.

relation coecient.

size, the models were trained using four of the subsets, and then their accuracy evaluated on the remaining subset.

The Kinase set was provided in the supporting information of the paper by Martin et al.

In this case, the data set was

ing and test sets. The data were initially clus-

For each of the methods, we report the mean 2 of the R across all of the assays in the test set

tered for each assay and the members of the

to give an overall value.

split by Martin et al.

into independent train-

clusters used as the training set, leaving the

2.3

outliers from this clustering procedure as the data against which the resulting models were tested.

Neural network formalism

We now turn to the neural network formal-

This procedure means that the test

ism.

data is not representative of the data used to

This algorithm is able to automatically

ACS Paragon Plus Environment 3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 14

identify the link between assay bioactivity values, and use the bioactivity data of other compounds to guide the extrapolation of the model, as well as using molecular descriptors as design variables.

Furthermore, the method can

estimate uncertainties in its predictions.

The

neural network builds on the formalism used to design nickel-base superalloys, molybdenum alloys, and identify erroneous entries in materials databases.

1820

We describe here the core neural

ηh1=tanh(Aih1xi+Bh1)

network and the rst novel aspect, the ability

y1=Ch1ηh1+Dh1

x1

to estimate the uncertainty in the predictions,

y1 η11

before Section 2.4 details the second novel part of the algorithm: how to handle missing data,

x2

y2

necessary to capture bioactivity-bioactivity correlations.

x = (x1 , . . . , xA+D ) to the neural network contains values for D = 320 molecular descriptors and A = 5 (for the Adrenergic data set) or A = 159 (for the Kinase data Each input vector

ηH1 xA+D

yA+D

ηh2=tanh(Aih2xi+Bh2)

set) bioactivity values. The ordering of the el-

y2=Ch2ηh2+Dh2

x1

ements of the input is the same for each com-

y1 η12

pound, but otherwise unimportant. The output

(y1 , . . . , yA+D ) of the neural network consists of

x2

y2

the original descriptors and the predicted bioactivities:

only the elements

(y1 , . . . , yA )

corre-

sponding to predicted bioactivities are used for

ηH2

evaluating the network accuracy.

xA+D

The neural network itself is a linear superpo-

Given properties

sition of hyperbolic tangents

f : (x1 , . . . , xi , . . . , xA+D ) 7→ (y1 , . . . , yj , . . . , yA+D ) with yj =

H X

yj

to compute

lines) gives the predicted property (green).

tting neural network. We use hyperbolic tangent activation functions to constrain the mag-

ηhj ,

(bot-

tanh function, and a linear combination (gray

{Aihj , Bhj , Chj , Dj } Each property yj for

1 ≤ j ≤ A is predicted separately. We set Ajhj = 0 so the network will predict yj without knowledge of xj . Typically around ve hidden nodes ηhj per output variable gives the best-

nitude of

y2

taken by the hidden nodes (blue), a non-linear

with parameters

as shown in Figure 1.

(top) and

tion (gray lines) of the given properties (red) are

This neural network has a single layer of hidden

ηhj

y1

The graphs

all the predicted properties. A linear combina-

.

i=1

nodes

The neural network.

graphs can be drawn for all other

! Aihj xi + Bhj

Predicted properties

tom) are computed from all the inputs; similar

h=1

and ηhj = tanh

Figure 1:

Hidden nodes

show how the outputs for

Chj ηhj + Dj , I X

yA+D

giving the weights

Chj

sole re-

ACS Paragon Plus Environment 4

Page 5 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

sponsibility for the amplitude of the output re-

pound and assay, and moreover the set of miss-

sponse. Twelve separate networks were trained

ing bioactivities is dierent for each compound.

on the data with dierent weights

However, there is information embedded within

1820

and their

variance taken to indicate the uncertainty in

bioactivity-bioactivity correlations.

the predictions accounting for both experimen-

neural network formalism requires that each

tal uncertainty in the underlying data and the

property is either an input or output of the

uncertainty in the extrapolation of the train-

network, and all inputs must be provided to

ing data.

obtain a valid output.

24,25

This is conceptually similar to

A typical

In contrast, we treat

the approach taken to uncertainty estimation

both the molecular descriptors and also the as-

in ensemble models, although here the under-

say bioactivities as both inputs and outputs of

lying model is a deep neural network and the

the neural network and adopt an expectation-

uncertainty estimates generated accurately rep-

maximization algorithm,

resent the observed errors in the predictions,

vide an estimate for the missing data, and use

including uncertainty due to extrapolation that

the neural network to iteratively improve that

is poorly captured by random forest (see also

initial estimate.

Section 3.2).

26

where we rst pro-

The algorithm is shown in Figure 2.

For

any unknown bioactivities we rst set missing

2.4

values to the average of the bioactivity values

Handling incomplete data

present in the data set for that assay. With estimates for all values of the neural network we can then iteratively compute

Network at x

xn+1 = Have all properties?

Yes

The nal predictions

xn + f (xn ) . 2 (y1 , . . . , yA )

are then the

elements of this converged algorithm corresponding to the assay bioactivity predictions.

No

The softening of the results by combining them

Use averages x0=x

with the existing predictions serves to prevent

n

Return f(x )

oscillations of the predictions, similar to the use of shortcut connections in ResNet. ically up to

Reached convergence

Yes

5

27

Typ-

iteration cycles were used to im-

pute missing bioactivity values, using the same

f

function

(as dened in Section 2.3) in every

cycle.

After 5 cycles the coecient of deter2 mination R in training improved by less than

No

0.01, comparable to the accuracy of the ap-

Xn+1 = [xn + f(xn)]/2

proach, conrming that we had used sucient Figure 2:

The data imputation algorithm for

the vector

x

iteration cycles to reach convergence. The parameters

of the molecular descriptors and

function

bioactivity values that has missing entries. We 0 set x = x, replacing all missing entries by av-

nealing

n

28

{Aihj , Bhj , Chj , Dj }

in the

are then trained using simulated anto minimize the least-square error of

(y1 , . . . , yA ) against 5 least 10 training rounds

the predicted bioactivities

erages across each assay, and then iteratively n+1 n n compute x as a function of x and f (x ) until we reach convergence after

f

the training data. At

were used to reach convergence.

iterations.

Hyperparameters, in particular the number of hidden nodes per output, the number of itera-

Experimental data are often incomplete 

tion cycles, and the number of training rounds,

bioactivity values are not known for every com-

ACS Paragon Plus Environment 5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 14

were selected using random holdout validation

bioactivity data, with separate latent features

on each training data set, without reference to

specialising in handling the descriptors.

the corresponding test set.

We also compare to the prole-QSAR 2.0 method of Martin et al.,

2.5

13

which builds a lin-

ear partial least squares (PLS) model of assay

Other machine learning meth-

bioactivities from the predictions of random for-

ods

est models for each assay individually.

In the

We compare our neural network algorithm with

2.0 version of the prole-QSAR method the RF

a variety of other popular machine learning ap-

predictions for an assay are not used as input

proaches from the literature.

to the PLS model for that assay.

RF methods

59

are a popular method of QSAR analysis, building an ensemble of decision trees to predict individual assay results.

3

Because decision trees

Imputing assay bioactivities

require all their input data to be present when they are trained, it is not possible to build RF models using sparse bioactivity data as input,

We present two tests of the performance of the

and RF must rely purely on chemical descrip-

deep learning formalism to impute assay bioac-

tors. We used the scikit-learn

tivity values. In each case we use disjoint train-

29

implementation

ing and validation data to obtain a true statis-

of the regression RF method. For

a

comparison

learning approach,

with

a

modern

tical measure, the coecient of determination,

deep

for the quality of the trained models.

we also built a conven-

tional multi-target deep neural network (DNN) model

30

using TensorFlow.

31

The model took

3.1

linear combinations of descriptors as inputs,

Adrenergic receptors

with eight fully connected hidden layers with

We rst present a case study using the Adren-

512 hidden nodes, and output nodes that gave

ergic data set described in Section 2.1.

the predicted assay results.

The ELU activa-

train two classes of model: the rst uses com-

tion function was used for all layers, and the

plete compound descriptor information to pre-

network was trained using Adam backpropaga-

dict the bioactivity values, and the second class

tion with Nesterov momentum

uses both the chemical descriptors and also the

32

and a masked

We

bioactivity-bioactivity correlations.

loss function to handle missing values. A principal component analysis (PCA) was performed

We rst train a neural network to take only

on the descriptors to select the subset of linear

chemical descriptors and predict assay bioac-

combinations of descriptors that captured 90%

tivities. This approach is similar to traditional

of the variance across the full descriptor set to

QSAR approaches, although it oers the ad-

avoid overtting of the DNN through the use

vantage of being able to indirectly learn the re-

of too many descriptors.

lationships between assay bioactivities through

A

popular

method

of

analysing

databases is matrix factorisation,

33

the iterative cycle described in Figure 2.

sparse

We

train the neural network providing as input the

where the

matrix of compound-assay bioactivity values is

N

approximately factorised into two lower-rank

lute Pearson correlation against the ve tar-

matrices that are then used to predict bioactiv-

gets, with

ity values for new compounds. Matrix factori-

set of 320 descriptors.

sation was popularised through its inclusion in

ure 3 shows that the neural network, predict-

the winning entry of the 2009 Netix Prize.

We used the modern Collective Matrix Fac-

ing based purely on descriptors, achieves a peak R2 = 0.60 ± 0.03 against the assays when using

torisation (CMF)

implementation of ma-

50 descriptors: fewer descriptors do not provide

trix factorisation, which makes eective use of

a sucient basis set, whereas more descriptors

35,36

34

descriptors with the highest average abso-

the available chemical descriptors as well as

ACS Paragon Plus Environment 6

N

varying between 0 and the full The grey line in Fig-

Page 7 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling being swamped by the numerous but weaker descriptor-bioactivity correlations. We

particularly

see

bioactivity-bioactivity

the

value

correlations

of

the

with

zero

descriptors, where the neural network achieves R2 = 0.35 ± 0.03 due solely to bioactivitybioactivity correlations. Random forest is not able to make predictions at all without any descriptors being present, as it cannot take the 2 sparse bioactivity data as input, and so R = 0. The ability to t the data better than a lead-

Figure 3: The coecient of determination for

ing QSAR method provides a solid platform

predicting the activity of the adrenergic recep-

for use of this neural network to impute assay

tors with number of chemical descriptors. The

bioactivity values.

magenta line is when the neural network is trained with both the activities and descriptors

3.2

present, the grey line with just the descriptors, and the cyan line is for random forest.

Error

Kinase data set

We now present a case study on the Kinase data

bars represent the standard error in the mean R2 over ve-fold cross-validation.

set proposed as an exemplar for benchmarking predictive imputation methods,

13

as described

in Section 2.1. The neural network not re-

In this data set the validation data comprised

quiring the full set of chemical descriptors to

the outliers from a clustering procedure, re-

provide a high-quality t enables us to focus at-

alistically representing the exploration of new

tention on the key descriptors, and hence chem-

chemical space.

ical features, that inuence bioactivity against

of determination by a method in the litera2 ture is R = 0.434 ± 0.009 by the prole-

over-t the data.

these targets.

37

We compare the neural network

The best achieved coecient

13

result to traditional random forest, using the

QSAR 2.0 method,

same descriptor sets, which achieves a similar 2 value of R = 0.59 ± 0.02 using 100 descriptors.

for this comparison.

which we re-implemented

The DNN multi-target 2 model discussed in Section 2.5 achieved R = 0.11 ± 0.01, the CMF method achieved R2 =

We next train a fresh neural network but incorrelations. With a total of 5 assays, this al-

−0.11 ± 0.01, and a conventional RF QSAR ap2 proach achieved only R = −0.19 ± 0.01, a re-

lows up to 4 additional input values per target

sult which is worse than random due to the ex-

as bioactivity values for every other assay are

trapolation in chemical space required to reach

used as input when present (although in the

the test set points.

clude the possibility of bioactivity-bioactivity

It is not

Using our deep neural network we predict

possible to use this assay bioactivity data as

the assay bioactivity values and also the un-

input to a RF approach, because the data is

certainties in the predictions.

sparse and RF methods require complete input

the predictions accepted, irrespective of the re-

information. However, in Figure 3 we see that

ported condence, the neural network attains R2 = 0.445 ± 0.007, a signicant improvement

majority of cases they are missing).

the neural network's peak accuracy increases to R2 = 0.71 ± 0.03 with 50 descriptors. We now

With

100%

of

over the DNN, CMF, and RF approaches and

achieve a signicantly better quality of t than −4 RF (with one-tailed Welch's t-test p = 3×10 )

similar to the prole-QSAR 2.0 method result.

due to the strong bioactivity-bioactivity corre-

dictions gives us more knowledge about the neu-

lations present in the data. The neural network

ral network results. In particular, we can dis-

is able to successfully identify these stronger

card predictions carrying large uncertainty, and

bioactivity-bioactivity relations, without them

trust only those with smaller uncertainty. This

However, access to the uncertainties in the pre-

ACS Paragon Plus Environment 7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 14

lets us focus on the most condent predictions only, at the expense of reporting fewer total predictions. When this is done, the quality of the remaining neural network predictions increases, as shown in Figure 4, demonstrating that the neural network is able to accurately and truthfully inform us about the uncertainties in its predictions; the condence of predictions is correlated with their accuracy.

The coecient 2 of determination reaches values of R > 0.9, demonstrating eectively perfect predictions, when we complete only the most condent of the data.

1%

We note that this focus on the

most condent predictions, and corresponding increase in accuracy, is post-processing:

only

one model is trained, and the desired level of condence can be specied and used to return only suciently accurate results. The neural network is signifcantly more accurate than the DNN, CMF, and RF meth-

100% of the predictions are ac−66 −102 cepted (with p-values 3 × 10 , 2 × 10 , −107 and 2 × 10 respectively), and is signicantly ods even when

Figure 4: The coecient of determination for predicting the activity of the clustered Kinase data set with percentage of data predicted. The

more accurate than pQSAR 2.0 when only the

cyan point is for the random forest approach,

least condent 3% of predictions are discarded −4 (p = 3 × 10 ). As shown in Figure 4, the

the blue point is the collective matrix factorisation (CMF) method, the dark green point is the

accuracy improvement over the other methods

deep neural network (DNN) approach, the or-

increases substantially as a smaller fraction of

ange point is the prole-QSAR 2.0 method, and

the predictions are accepted. 2 The achieved R > 0.9 exceeds the level of R2 = 0.7 that is often taken as indicating accu-

the magenta line is the neural network proposed in this work. The magenta line shows that the accuracy of the neural network predictions in-

rate, reliable predictions in the presence of ex-

creases when focussing on the most condent

perimental uncertainty. In fact, the most con-

predictions, at the expense of imputing only a

dent 50% of the neural network's predictions all 2 have R > 0.7, permitting a nine-fold increase

proportion of the missing data. This conrms that the reported condences in the predictions correlate strongly with their accuracy.

in the number of accurate predictions that can

Error

be used for further analysis, relative to the orig-

bars represent the standard error in the mean R2 value over all 159 assays, and where not vis-

inal sparse experimental measurements. This high accuracy is achieved after approxi-

ible are smaller than the size of the points.

mately 120 core hours of training. The time to validate the data set is 0.1ms per compound for the neural network, versus 10ms per compound, 100 times longer, for the traditional random forest approach.

This acceleration in generating

predictions further enhances the real-world applicability of the neural network approach.

ACS Paragon Plus Environment 8

Page 9 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

3.2.1 Analysis

ing data, conrming the algorithm's ability to condently predict test points that are rela-

It is informative to analyse the results that our

tively far from the clusters of training points.

neural network approach is able to calculate ac-

In addition to this analysis, the Euclidean dis-

curately, and compare this to preconceptions of

tance between every test point and its nearest

how the algorithm functions. For example, as

neighbour training point was taken for all as-

a data-driven approach, it might be assumed

says. This measure showed no correlation with

that the assays with the most training data

the network's uncertainty or error, indicating

would be most accurately predicted by the neural network.

that the neural network is operating beyond a

However, as shown in Figure 5,

nearest-neighbour approach in descriptor space,

this is not the case; although the assay with

by exploiting assay-assay correlations that are

least training data is that predicted least accu-

carried across into assay-descriptor space.

rately, there is in general no correlation between the accuracy of the neural network's predictions and the amount of training data available to the algorithm.

In particular, the two assays with

most training data are relatively poorly cap2 tured by the neural network, with R < 0.2 in both cases.

Figure 5: The coecient of determination measured for each of the 159 kinase assays, plotted against the percentage of the data for that assay Figure 6:

present in the training set.

A 2-dimensional t-SNE embedding

of the input descriptor space for ChEMBL assay 688660.

Likewise, the most condent predictions are not for compounds `closest' to those in the training set.

data with colour indicating the uncertainty es-

The degree of separation can be

timate of the network in its predictions, where

measured in terms of the Euclidean distance be-

red indicates zero uncertainty and yellow a high

tween the points in the multi-dimensional space of descriptors used in the model.

The grey crosses show the train-

ing data and the coloured points show the test

uncertainty of 1 log unit.

A represen-

tative example assay's data (ChEMBL assay 688660) is shown in Figure 6 where the training

3.2.2 Summary

points (grey crosses) and test points (coloured points)

are

depicted

in

a

2-dimensional

t-

We have shown that the neural network pre-

distributed stochastic neighbour embedding (t-

sented delivers similar quality predictions for

SNE) generated using the StarDrop software package.

38

assay

The levels of predictive condence

bioactivity

to

the

prole-QSAR

2.0

method when considering the full test set and

are fairly uniform with distance from the train-

ACS Paragon Plus Environment 9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 14

R2 > 0.7, which

that these methods outperform QSAR meth-

ing data could be lled in with

ods, including modern DNNs, and also out-

is considered to represent a high level of delity

performs matrix factorisation.

between prediction and experiment.

In addition, a

key advantage is that the neural network gives

The ability to make simultaneous, accurate

accurate uncertainties on its output, allowing

predictions across multiple assays will lend itself

us to prioritise only well-predicted assay activ-

well to the problem of selectivity across multiple

ities, enabling an increase in the coecient of

targets.

determination for the predictions of the realistic 2 2 data set from R = 0.445 up to R > 0.9 for a

ply beyond the binding assay data used in this

subset of the data. The ability to tune accuracy

functional assays; and the method can even

40,41

The method is general, so can ap-

analysis, for example to direct or downstream

with amount of data predicted is an invaluable

make accurate predictions beyond pIC50 val-

tool for scientists, fueling condence in results

ues, including physicochemical, absorption, dis-

and permitting a focus on only high-quality

tribution, metabolism, excretion, and toxicity

predictions.

These most condent predictions

(ADMET) properties. Therefore, it has a broad

are also not for the most complete assays or the

application for identication of additional ac-

most similar test points to the training data,

tive compounds within a database, recognition

showing that the neural network approach is

of the most inuential chemical properties, pre-

able to learn more complex and powerful rep-

diction of selectivity proles, and the selection

resentations of the assay bioactivity data.

of compounds for progression to downstream ADMET assays.

4

Conclusions

Conict of Interest Statement

Matthew

Segall, Peter Hunt and Ben Irwin are employ-

We have presented a new neural network im-

ees of Optibrium Limited, which develops the

putation technique for predicting bioactivity,

StarDrop

which can learn from incomplete bioactivity

TM

software used herein for analysis

of the results. Thomas Whitehead and Gareth

data to improve the quality of predictions by us-

Conduit are employees of Intellegens Limited,

ing correlations between both dierent bioactiv-

which develops the Alchemite

ity assays, and also between molecular descrip-

TM

software used

in this work.

tors and bioactivities. This results in a signi-

Acknowledgement

cant improvement in the accuracy of prediction

Gareth

Conduit

ac-

over conventional QSAR models, even those us-

knowledges the nancial support of the Royal

ing modern deep learning methods, particularly

Society and Gonville & Caius College. There is

for challenging data sets representing an extrap-

Open Access to this paper and data available

olation to new compounds that are not well rep-

at

https://www.openaccess.cam.ac.uk.

resented by the set used to train the model. This is representative of many chemistry opti-

Supporting Information Avail-

misation projects which, by denition, explore

able

new chemical space as the project proceeds. The method presented can also accurately estimate the condence in each individual predic-

The following les are available free of charge:

tion, enabling attention to be focussed on only the most accurate results.



It is important to

Section 3.1

base decisions in a discovery project on reliable results to avoid wasted eort pursuing incor-



rectly selected compounds or missing opportu-

39

Kinase_training_w_descriptors.csv: Training dataset used in Section 3.2

nities by inappropriately discarding potentially valuable compounds.

Adrenergic_dataset.csv: Dataset used in

On the Kinase example



data set, we demonstrated that 50% of the miss-

Kinase_test_w_descriptors.csv: dataset used in Section 3.2

ACS Paragon Plus Environment 10

Test

Page 11 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

This information is available free of charge via

lic High-quality Data Sets. J. Chem. Inf.

http://pubs.acs.org.

the Internet at

Model.

2013, 53, 27.

(7) Christmann-Franck, S.; van Westen, G.;

References

Papadatos,

G.;

Beltran

Escudie,

F.;

Roberts, A.; Overington, J.; Domine, D. (1) Gaulton, A.; Davies,

M.;

Bellis, L.; Hersey,

Chambers, J.;

A.;

Light,

Unprecedently

Y.;

Large-Scale

Kinase

In-

hibitor Set Enabling the Accurate Predic-

McGlinchey, S.; Akhtar, R.; Bento, A.;

tion of Compound Kinase Activities:

Al-Lazikani, B.; Michalovich, D.; Overing-

Way toward Selective Promiscuity by De-

ton, J. ChEMBL: A Large-scale Bioactiv-

sign?

ity Database For Chemical Biology and

1654.

2012,

Drug Discovery. Nucleic Acids Res.

J. Chem. Inf. Model.

Wohlfahrt, A.;

Bellis,

L.;

Krüger,

56,

(8) Subramanian, V.; Prusis, P.; Xhaard, H.;

40, 1100.

(2) Bento,

2016,

A

G.

Predictive

Proteochemo-

Gaulton,

A.;

Hersey,

A.;

metric Models for Kinases Derived from

Chambers,

J.;

Davies,

M.;

3D Protein Field Based Descriptors. Med-

Mak,

L.;

M.;

Pa-

F.;

McGlinchey,

Light, S.;

Y.;

Nowotka,

ChemComm

(9) Merget,

padatos, G.; Santos, R.; Overington, J. The ChEMBL Bioactivity Database: Update. Nucleic Acids Res.

2014,

A.;

Muratov,

E.

Turk,

S.;

Eid,

S.;

Ripp-

an

mann, F.; Fulle, S. Proling Prediction of

42,

Kinase Inhibitors: Toward the Virtual Assay. J. Med. Chem.

1083. (3) Cherkasov,

B.;

2016, 7, 1007.

(10) Doucet,

N.;

J.;

2017, 60, 474.

Xia,

H.;

Panaye,

A.;

Fourches, D.; Varnek, A.; Baskin, I. I.;

Fan, B. Nonlinear SVM Approaches to

Cronin, M.; Dearden, J.; Gramatica, P.;

QSPR/QSAR Studies and Drug Design.

Martin,

Curr. Comput.-Aided Drug Des.

Y.

C.;

Todeschini,

R.;

Con-

sonni, V.; Kuz'min, V. E.; Cramer, R.; Benigni,

R.;

Yang,

C.;

Rathman,

J.;

(11) Obrezanova, O.;

Teroth, L.; Gasteiger, J.; Richard, A.; Tropsha,

A.

QSAR

Have You Been?

In

S.;

The

Encyclopedia

Chemistry ;

Clark,

M.;

T.,

of

Eriksson,

Properties. J. Chem. Inf. Model.

(12) Martin, E.; Mukherjee, P.; Sullivan, D.;

Computational

P.,

Allinger,

N.,

Gasteiger,

J.,

Kollman,

P.,

Jansen, J. Prole-QSAR: a Novel MetaQSAR Method that Combines Activities Across the Kinase Family to Accurately

Schaefer III, H., P., S., Eds.; Chichester,

Predict Anity, Selectivity, and Cellular

UK: John Wiley and Sons, 1999; pp 116. (5) Gao, Wang,

C.; J.;

Cahya, Watson,

S.;

Nicolaou,

C.;

I.;

Cummins,

D.;

Activity. J. Chem. Inf. Model.

Predictions,

(13) Martin,

51,

E.;

Valery

R.

Polyakov,

V.;

Tian, L.; Perez, R. Prole-QSAR 2.0: Ki-

Concordance,

and Implications. J. Med. Chem.

2011,

1942.

Iversen, P.; Vieth, M. Selectivity Data: Assessment,

2007, 47,

18471857.

L.

Schleyer,

Gola, J.;

for Automatic QSAR Modeling of ADME

2014, 57, 49775010.

Sjostrom,

Csanyi, G.;

Segall, M. Gaussian Processes: a Method

Where

Where Are You Going

To? J. Med. Chem. (4) Wold,

Modeling:

2007, 3,

263289.

nase Virtual Screening Accuracy Compa-

2013,

rable to Four-Concentration IC50s for Re-

56, 6991.

alistically Novel Compounds. J. Chem. Inf. Model.

(6) Schurer, S. C.; Muskal, S. M. Kinomewide Activity Modeling from Diverse Pub-

ACS Paragon Plus Environment 11

2017, 57, 2077.

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 14

(14) Lo, Y.-C.; Rensi, S. E.; Torng, W.; Alt-

(23) Weininger, D. SMILES, a Chemical Lan-

man, R. B. Machine Learning in Chemoin-

guage and Information System. 1. Intro-

formatics and Drug Discovery. Drug Dis-

duction to Methodology and Encoding

covery Today

Rules. J. Chem. Inf. Comput. Sci.

2018, 23, 1538  1546.

1998,

28, 3136.

(15) Chen, H.; Engkvist, O.; Wang, Y.; Olivecrona, M.; Blaschke, T. The Rise of Deep

(24) Heskes, T. Practical Condence and Pre-

Learning in Drug Discovery. Drug Discov-

diction Intervals. Advances in Neural In-

ery Today

formation Processing Systems 9. 1997; pp

2018, 23, 1241  1250.

176182.

(16) Mayr, A.; Klambauer, G.; Unterthiner, T.; Steijaert, mans,

M.;

H.;

Wegner, Clevert,

J.

K.;

D.-A.;

Ceule-

(25) Papadopoulos,

G.;

Edwards,

P.;

Mur-

Hochre-

ray, A. Condence Estimation Methods

iter, S. Large-scale Comparison of Ma-

for Neural Networks: a Practical Compar-

chine Learning Methods for Drug Target

ison. IEEE Transactions on Neural Net-

Prediction on ChEMBL. Chemical Science

works

2018, 9, 5441. (17) Popova,

(26) Krishnan, T.; McLachlan, G. The EM Al-

M.;

Isayev,

O.;

Tropsha,

A.

gorithm and Extensions ; Wiley, 2008.

Deep reinforcement learning for de novo drug design. Science Advances

2018,

(27) He, K.; Zhang, X.; Ren, S.; Sun, J. Deep

4,

Residual Learning for Image Recognition.

eaap7885.

2016 IEEE Conference on Computer Vi-

(18) Conduit, B.; Jones, N.; Stone, H.; Con-

sion

duit, G. Design of a Nickel-base Superal-

2017, 131, 358.

(19) Conduit, Conduit,

B.; G.

Jones,

and

Patter

Probabilistic

Stone, Design

H.; of

Genetic

2018,

Simulated

Algorithm.

1995, 21, 1.

a

Molybdenum-base Alloy using a Neural Network. Scripta Materialia

(CVPR).

(28) Mahfoud, S.; Goldberg, D. Parallel Recombinative

N.;

Recognition

2016; pp 770778.

loy with a Neural Network. Materials and Design

2001, 12, 1278.

(29) Pedregosa,

146,

fort,

82.

F.;

A.;

Annealing:

Parallel

Varoquaux,

Michel,

V.;

A

Computing

G.;

Gram-

Thirion,

B.;

Grisel, O.; Blondel, M.; Prettenhofer, P.;

(20) Verpoort, P.; MacDonald, P.; Conduit, G.

Weiss, R.; Dubourg, V.; Vanderplas, J.;

Materials Data Validation and Imputation

Passos, A.; Cournapeau, D.; Brucher, M.;

with an Articial Neural Network. Com-

Perrot, M.;

putational Materials Science

2018,

147,

Duchesnay, E. Scikit-learn:

Machine Learning in Python. Journal of

176.

Machine

Learning

Research

2011,

12,

28252830.

(21) Ertl, P.; Rhodes, B.; Slezer, P. Fast Calculation of Molecular Polar Surface Area as

(30) Xu,

Y.;

Ma,

J.;

Liaw,

A.;

Sheri-

a Sum of Fragment-Based Contributions

dan, R. P.; Svetnik, V. Demystifying Mul-

and Its application to the Prediction of

titask Deep Neural Networks for Quanti-

Drug Transport. J. Med. Chem.

tative Structure-Activity Relationships. J.

2000, 43,

37143717. (22) Abraham, of

Chem. Inf. Model.

M.;

McGowan,

Characteristic

Volumes

J.

TheUuse

to

Measure

(31) Abadi,

M.;

2017, 57, 24902504.

Agarwal,

Brevdo, E.;

Chen, Z.;

A.;

Barham,

Citro, C.;

P.;

Cor-

Cavity Terms in Reversed-phase Liquid-

rado,

chromatography. Chromatographia

Devin, M.; Ghemawat, S.; Goodfellow, I.;

1987,

23, 243246.

G.

S.;

Davis,

A.;

Dean,

J.;

Harp, A.; Irving, G.; Isard, M.; Jia, Y.;

ACS Paragon Plus Environment 12

Page 13 of 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, Moore,

J.;

S.;

Schuster,

Mané,

D.;

Murray,

M.;

Shlens,

(39) Segall,

M.;

sorFlow:

Y.;

D.;

Olah,

C.;

tain Data. J. Comput.-Aided Mol. Des.

Steiner,

B.;

2015, 29, 809816.

(40) Sciabola, S.; Stanton, R. V.; Wittkopp, S.;

Viégas, F.;

Wildman, S.; Moshinsky, D.; Potluri, S.;

Zheng,

X.

Xi, H. Predicting Kinase Selectivity Pro-

Ten-

les

Large-Scale Machine Learning

//www.tensorflow.org/,

Modeling

Software avail-

systems. 2017 2nd International Conference on Communication Systems, Computing and IT Applications (CSCITA). 2017; pp 269274. (34) Töscher, A.; Jahrer, M.; Bell, R. M. The BigChaos Solution to the Netix Grand Prize. 2009. (35) Singh, A. P.; Gordon, G. J. Relational Learning via Collective Matrix Factorization. Proceedings of the 14th ACM on

Knowledge Discovery and Data Mining. New York, NY, USA, 2008; pp 650658. (36) Cortes, D. Cold-start recommendations in Collective Matrix Factorization. CoRR

2018, abs/1809.00366 . J.

J.;

Macur,

K.;

B¡czek, T. Molecular Descriptor Subset Selection in Theoretical Peptide Quantitative Model

Structure-Retention Development

Relationship

Using

Nature-

Inspired Optimization Algorithms. Anal. Chem.

48,

S.;

18511867, PMID:

Borza,

ity Proles. Molecules

factorization techniques in recommender

Liu,

Analy-

C.;

Pozzi,

A.;

Relationship Modeling of Kinase Selectiv-

1983, 372376.

P.;

QSAR

Meiler, J. Quantitative StructureActivity

(33) Mehta, R.; Rana, K. A review on matrix

(37) šuvela,

2008,

(41) Kothiwale,

vex Programming Problem with Conver2 gence Rate O(1/k ). Soviet Mathematics

Conference

Free-Wilson

18717582.

(32) Nesterov, Y. A Method of Solving a Con-

International

Using

sis. Journal of Chemical Information and

https:

able from tensorow.org.

SIGKDD

Chal-

J.;

on Heterogeneous Systems. 2015;

Doklady

The

lenges of Making Decisions using Uncer-

Vasudevan, V.; Yu,

E.

R.;

Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke,

Champness,

Monga,

Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.;

M.;

2015, 87, 98769883.

https://www.optibrium.com/ stardrop/, Accessed: 2018-10-26.

(38) StarDrop.

ACS Paragon Plus Environment 13

2017, 22 .

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphical TOC Entry

ACS Paragon Plus Environment 14

Page 14 of 14