Molecular Pathway Association with Self ... - ACS Publications

Dec 26, 2018 - Institució Catalana de Recerca i Estudis Avançats (ICREA), Passeig Lluis Companys 23, 08010 Barcelona, Spain. •S Supporting Information...
1 downloads 0 Views 640KB Size
Subscriber access provided by The University of British Columbia Library

Computational Chemistry

PathwayMap: Molecular pathway association with self-normalizing neural networks José Jiménez Luna, Davide Sabbadin, Alberto Cuzzolin, Gerard MartínezRosell, Jacob Gora, John Manchester, Jose S. Duca, and Gianni De Fabritiis J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00711 • Publication Date (Web): 26 Dec 2018 Downloaded from http://pubs.acs.org on January 1, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

PathwayMap: Molecular Pathway Association with Self-Normalizing Neural Networks José Jiménez,†,⊥ Davide Sabbadin,†,⊥ Alberto Cuzzolin,‡ Gerard Martínez-Rosell,‡ Jacob Gora,¶,§ John Manchester,¶ Jose Duca,¶ and Gianni De Fabritiis∗,†,‡,k †Computational Science Laboratory, Universitat Pompeu Fabra, Barcelona Biomedical Research Park (PRBB), Carrer del Dr. Aiguader 88, 08003, Barcelona, Spain. ‡Acellera, Barcelona Biomedical Research Park (PRBB), Carrer del Dr. Aiguader 88, 08003, Barcelona, Spain. ¶Global Discovery Chemistry, Novartis Institutes for Biomedical Research, 250 Massachusetts Ave., Cambridge, MA 02139, USA §Freie Universität Berlin, Kaiserswerther Str. 16-18, 14195 Berlin, Germany kInstitució Catalana de Recerca i Estudis Avançats (ICREA), Passeig Lluis Companys 23, 08010 Barcelona, Spain ⊥Contributed equally to this work E-mail: [email protected]

Abstract

to reach its site at a sufficiently high concentration. Furthermore, for the compound to be safely administered, it must avoid side-effects (off-target activity), drug-drug interactions and non-specific or idiosyncratic toxicities at the therapeutic dose. 1 A similar narrative can be found in agrochemical discovery, where an organism (fungi, insect or weed) needs to be selectively targeted and human toxicity avoided. 2 In both disciplines, the early phases in the discovery process include the identification of a suitable molecule for posterior optimization. As a rule of thumb, the wider the explored chemical space (i.e. synthesized and tested in relevant biological assays), the higher chances of finding a compound displaying high affinity molecule to the target protein of interest.

Drug discovery suffers from high attrition as compounds initially deemed as promising can later show ineffectiveness or toxicity due to a poor understanding of their activity profile. In this work we describe a deep self-normalizing neural network model for the prediction of molecular pathway association and evaluate its performance, showing an AUC ranging from 0.69 to 0.91 on a set of compounds extracted from ChEMBL and from 0.81 to 0.83 on an external dataset provided by Novartis. We finally discuss the applicability of the proposed model in the domain of lead discovery. A usable application is available via PlayMolecule.org.

Introduction

The identification of a chemical which specifically binds with high affinity to a target protein is not a trivial task given that the chemical space is estimated to be composed of around 1060 different molecules. 3 In the context of high-throughput and phenotypic screening for lead discovery, pharmacologically active com-

A drug discovery project might take more than ten years to successfully find a molecule suitable for clinical use. Drug candidates must be potent against the intended target(s) and have appropriate pharmacokinetic properties

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

pounds for a desired target (hits) are regularly obscured by a large number of molecules acting though unspecific or not desired mechanism of action. The latter compounds may also have an identical assay readout as the desired target modulators, for instance through sharing a common signaling pathway with the desired target or with a target regulating an essential biological system function. Interpretation and deconvolution of the effect of a compound in a cell-based or more complex phenotypic assay is a time consuming process and it may be hard to differentiate a pharmacologically relevant target mediated effect from simple toxic compounds.

Page 2 of 17

ery applications. Many other reported methods in the literature exploit different sources of information. Hu et. al. 11 and Gao et. al 12 use chemical-chemical, chemical-protein and protein-protein annotations extracted from the STITCH 13 and STRING 14 databases in order to map 4366 and 3348 diffferent compounds, respectively, into 11 pathway classes, the second focusing particularly in yeast enzymes. A clear disadvantage of the latter methods is, therefore, their dependency on interaction information. Ma et. al , 15,16 Song et. al. 17 and Chen et. al 18 on the other hand, use transcriptome data to reflect pathway interference through gene expression levels. Similarly, these approaches are limited to work only when expression data is available. Common to all of the methods reported so far, leveraging data from available comprehensive pathway databases such as KEGG 19 and Reactome, 20 is their focus on a small subset of pathways and a very limited selection of training ligands and their lack of an accessible and computationally efficient implementation of the final model, which in turn reduces their applicability in drug discovery programs at scale.

Computational methods that enable the classification of the biological target or pathway that molecules are likely to hit would facilitate assay result analysis and, most importantly, fast off-target prediction for de-prioritization of compounds with known and unwanted mechanisms of action. In recent years, large-scale data integration (protein-protein interaction data-sets) from various pathway resources has taken a big step forward, 4 and so has research for modelling drug-pathway interactions using machine-learning approaches. Cai et. al. 5 and Lu et.al 6 proposed a nearest neighbour algorithm and an AdaBoost model, 7 respectively, and functional groups as descriptors, to classify 2764 compounds into 11 pathway classes. Macchiarulo et. al. 8 used a random forest (RF) model 9 and 32 physiochemical descriptors to classify 681 ligands into 52 individual and 7 class pathways. A limitation of the previously cited methods is that they did not take into account compound interference with several biological pathways (i.e. polypharmacology). Hamdalla et. al. 10 use a substructure matching and nearest neighbour approach against a reference scaffold database to map 3190 ligands into their corresponding individual and class pathways, and take into account multifunction compounds. However, their implementation requires the predicted number of pathways per compound to be pre-specified by the user and is currently very computationally expensive, rendering it impractical in large-scale drug discov-

The availability of information-rich pathway databases along with the one available in Uniprot 21 and ChEMBL 22 offers the possibility to map assay-tested molecules to their corresponding interfering pathways, making them ideal for building large-scale predictive machine-learning (ML) models. Applications of such models in the discovery of biologically active molecules are several: • identification of chemicals that have a high probability of targeting one or a selected set of (disease relevant) biological pathways. We believe this feature could find an application in the virtual screening of large databases of chemicals to discover compounds that may modulate disease impaired specific pathways. The same information can also be used as an enabler to discard (or de-prioritize testing of) compounds impairing pathways linked

ACS Paragon Plus Environment

2

Page 3 of 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

to the occurrence of toxic effects in clinical trials 23 or undesired modes of action.

KEGG. For each of these genes, expressed protein Uniprot identifiers were associated, leading to 8482 individual proteins. Ligands for these Uniprot accession numbers were then queried in the ChEMBL database, selecting those with an activity threshold of at least 5.5 pK units (referred here as ChEMBL active ligands). Pathways for which no ligands were found were removed, leading to a total of 299 individual pathways and 200035 associated active ligands. For the Reactome database, we used the provided mapping system applying a similar approach where from the lowest pathway hierarchy levels the Uniprot identifiers are retrieved. This procedure lead to the identification of 1797 pathways and 11077 associated Uniprot identifiers. Identically, ligands were extracted from ChEMBL using the same threshold, and dropping pathways for which no associated ligands are found, leading to a total of 219408 ligands and 1155 associated pathways. Similarly, subsets of Kd as well as IC50 measurements from the Novartis dataset were discretized. Two sets consisting of approximately 300.000 compounds each, containing only compounds acting in at least one pathway in their corresponding pathway database were retrieved. An exhaustive list of pathways for both databases is provided in Tables S1 and S2.

• identification of chemicals that are not active in multiple biological pathways. Such molecules have been categorized as dark chemical matter (DCM) by Novartis. 24 Assuming that the physico-chemical properties of a set of compounds are suitable to reach cellular targets, these may possess a unique activity in a yet to discover protein (or assay to test) and cleaner safety profile, thus making them valuable starting points in lead optimization efforts. • target deconvolution for compounds with a demonstrated phenotipic activity but unknown mechanism of action. The tool can identify the most likely pathways and therefore targets that the molecule could be interfering with. In this work we have developed and made available for public use a deep-learning model, based on self-normalizing neural networks 25 for pathway identification of molecular compounds using the KEGG and Reactome databases. To the best of our knowledge, this the most extensive approach to date for this task, both in terms of training samples and number of pathways predicted and evaluated.

The amount of pathways a single ligand intervenes in and the amount of ligands a single pathway affects (ligand and pathway cardinalities, respectively) is provided in Figure 1. The average ligand cardinality in ChEMBL for KEGG is 11.43, while for Reactome is 8.95, due to the higher number of unique pathways that are reported in the latter. In the Novartis dataset however, the average ligand cardinality for KEGG is 26.39 and for Reactome is 22.44. Ligand diversity, defined as the number of unique different label sets (i.e pathway combinations) that appear in the ChEMBL extracted data is 39692 and 46787 for KEGG and Reactome, respectively.

Materials We use two different biological pathway databases, mainly the Kyoto encyclopedia of genes and genomes (KEGG) release 87.0, and Reactome v.65, while ligand data comes from the ChEMBL database v.24 and an external dataset consisting of approximately 1.25 million assay response measurements supplied by Novartis. Being the goal of this work to enable the classification of chemicals that are likely to interact with biological pathways, we queried associated genes for each of the 328 individual Homo Sapiens curated pathways reported in

ChEMBL active ligand-protein associations revealed that cancer and G protein signalling related pathways can be found among the most

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling Pathway cardinality

KEGG

Finally we discuss a testing procedure for the model and comment on technical training details.

Ligand cardinality

120000

Frequency

70 60

100000

50

80000

Multilabel classification Traditional binary (or multiclass, by extension) classification exclusively assumes a single target label per instance, that is, if X = Rd is the feature space, and Y = {0, 1} the target space, its goal is to learn a function f : X → Y from a training set {(xi , yi )}ni=1 , where xi ∈ X and yi ∈ Y. Since a single ligand may play a role in several pathways, the presented problem instead falls naturally into the multilabel classification machine-learning framework. 27 In the latter, our target space becomes Y = {y1 , y2 , ..., yq }, where q is the possible number of labels, and the task becomes to learn a function h : X → 2Y . In our case, h is fully parametrized by a deep self-normalizing neural network that we describe in its corresponding subsection.

40 60000 30 40000 20 20000

10 0

0

10000

20000

30000

0

40000

0

Pathway cardinality Reactome

50

100

Ligand cardinality

140000 600 120000 500 100000 Frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

400

80000

300

60000

200

40000

100

20000

0

0

10000

20000

0

Page 4 of 17

0

50

100

Multilabel learning (also known as multitask learning of q binary classification tasks) has a rich history in the fields of text and multimedia categorization, 28–30 but it is also becoming increasingly common to see applications in bioinformatics. 31,32

150

Figure 1: Ligand and pathway cardinalities in the KEGG and Reactome databases. impacted ones in the training set (Tables 1 and 2). This is not surprising and is supported by the fact that significant chemical and pharmaceutical resources have been spent on cancer research and G protein-coupled Receptors (GPCRs) represent an important share of the current drug targets. 26 This inherent experimental bias could potentially later translate into model performance, as significantly more data is available for some pathways than others.

Featurization For ligand representation we use standard Extended Connectivity Fingerprints (ECFP4) as available in the rdkit software 33 with a size of 1024 bits and a radius size of 2 bonds. We note that more modern ligand representations for machine-learning tasks are nowadays available, namely those based on graph convolution. 34 These, in our experiments, were harder to calibrate, and significantly more computationally expensive to train than standard fingerprints. For the sake of simplicity and prediction speed we decided to use the latter in the final model. We also acknowledge that while fingerprints are widely used in many computational chemistry tasks, their interpretation when used in machine-learning based predictors

Methods In this section we detail how we first adapt the task of identifying pathways for a ligand to a multilabel classification problem, to then discuss details on the chosen network architecture.

ACS Paragon Plus Environment

4

Page 5 of 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1: Most and least common ChEMBL active-pathway associations in the KEGG training set extrated from ChEMBL v.24 Most common

Least common

Description

Count Description

Hepatocellular carcinoma 31216 Endocrine resistance 29275 Pancreatic cancer 28876 Chronic myeloid leukemia 26206 Apoptosis 25857 Colorectal cancer 25723 AGE-RAGE signaling pathway in 24976 diabetic complications Central carbon metabolism in cancer 24428 Thyroid hormone signaling pathway 23641 Prostate cancer 23289

Count

Collecting duct acid secretion 10 Glycosaminoglycan degradation 9 Nucleotide excision repair 9 Pentose and glucuronate interconversions 9 Steroid biosynthesis 8 Spliceosome 8 Cell adhesion molecules (CAMs)

4

Mannose type O-glycan biosynthesis Other types of O-glycan biosynthesis Selenocompound metabolism

3 3 2

Table 2: Most and least common ChEMBL active-pathway associations in the Reactome training set extrated from ChEMBL v.23 Most common

Least common

Description

Count Description

G alpha (s) signalling events

25883

Count

Signaling by BRAF and RAF fusions

17253

G alpha (i) signalling events Regulation of TP53 Activity through Methylation Activation of the pre-replicative complex CDT1 association with the CDC6:ORC:origin complex Regulation of TP53 Activity through Phosphorylation

15844

Defective HK1 causes hexokinase deficiency (HK deficiency) COX reactions NOSTRIN mediated eNOS trafficking NOSIP mediated eNOS trafficking Defective SLC22A5 causes systemic primary carnitine deficiency (CDSP) IRS activation

14119

Insulin receptor signalling cascade

1

13812

Signaling by Insulin receptor

1

13500

Insulin receptor recycling

1

13206

Proton/oligopeptide cotransporters

1

Interleukin-4 and Interleukin-13 signaling 20308 Ub-specific processing proteases 19317 Neutrophil degranulation 19004

ACS Paragon Plus Environment

5

1 1 1 1 1 1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

is harder to deconvolute against others, such as simple physiochemical properties. In this particular case, as we are using neural-network based models, conditional interpretation of the output based on the input becomes arduous, since learnable weights move in a high dimensional space.

It should be noted that one important reason for the choice of neural-network based models in this work is scalability; while other popular machine-learning algorithms such as RFs can also tackle multilabel classification problems, and in fact, one can achieve similar levels of performance as the models proposed here, the former entail expensive memory and disk space requirements when the number of labels is large, as their implementation is based on the binary relevance method. 42 Neural networks, on the other hand, scale particularly well in these instances, providing small and well-performing models that can be deployed in production with relative ease.

Self-normalizing neural networks Deep learning 35,36 approaches have revolutionized computer vision through the use of convolutional neural networks (CNNs) 37 and natural language processing via recurrent neural networks (RNNs). 38 Success stories for tabular data, with regular, fully forward neural networks (FNNs), however, are scarcer, with most researchers and engineers opting for other statistical models such as RFs and gradient boosting machines (GBMs). 39 In fact, in contrast with modern deep learning architectures, most successful FNNs tend to be relatively shallow. One of the reason deeper architectures are more performant in CNN-based computer vision tasks is batch normalization, 40 which ensures zero-mean and unit variance neuron activations. Self-normalizing neural networks (SNNs) tackle this problem by the use of the SELU activation function, which pushes neuron activation to zero mean and unit variance, enabling the efficient training of deeper fully connected architectures while avoiding exploding or vanishing gradients. The SELU activation function is given by: SELU (x) = β (max{0, x}+ min{0, α (exp(x) − 1)}) ,

Page 6 of 17

Independent models were trained for each database, with the proposed feed-forward neural-network architecture featuring 4 hidden layers with SELU activation, 512 units each, after applying batch normalization to the input. An alpha dropout layer is applied before the final layer, with standard probability 0 α . The final layer features as many nodes as labels present in the KEGG and Reactome sets, that is 299 and 1155 units respectively, with sigmoid activation, each one representing the probability of a pathway being active for a ligand. We did not see any significant experimental improvements by adding extra layers or modifying the number of neurons per layer.

Validation procedure and metrics We validate our model using k-fold cross validation, with k = 10, using two different types of splits, the first random and the second using chemical scaffold information. This means that the training data is split into k parts, trained on k − 1 and evaluated on the remaining, with results averaged over the k leftover sets. In the random split, ligands are randomly assigned to a split, while for the scaffold split, the assignment is performed by a clustering procedure using the k-means algorithm on ECFP4 fingerprints. The latter type of split is also commonly known as clustered cross validation, 43 and it typically shows less overconfident

(1)

with α ≈ 1.6733 and β ≈ 1.0507. Dropout 41 is a commonly used technique in deep learning applications, since it tends to provide better network generalization capabilities by randomly setting some of the activations to zero during training. In order to preserve the activation mean when using the SELU activation, an alpha-dropout technique is used, which 0 randomly sets activations to α = −βα instead.

ACS Paragon Plus Environment

6

Page 7 of 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

performance estimations. Both splits represent different realities and applicability scenarios, as the results using the first would only be valid for test ligands that fall into the training chemical manifold, while the second one represents performance for ligands falling outside this space. For further evaluation, we take a temporal split validation procedure 44 for the ligands provided by Novartis, where the first 90% of the ligands are used for training and the remaining 10% for testing. Although inductive and confirmation bias may generate similar molecules over time, we believe this procedure depicts another useful validation scenario. 45

where yi and hi is a single binary label and prediction for a single example, respectively. For the task of multilabel classification, the loss is both averaged over the set of possible labels and the batch to obtain a single estimate. In order to avoid overfitting issues while ensuring the best performance possible, a 10% validation set is extracted in each training split, and evaluated after each epoch, reducing the optimizer’s learning rate by an order of magnitude if no improvement is found after 5 consecutive ones.

Results

In order to validate the results obtained in the next section, we use several common multilabel classification metrics, which for the most part are generalizations of their binary classification counterparts. An extensive explanation of these metrics is provided in the Supporting Information. We note that comparison of the proposed approach against existing methods for pathway association is difficult as (i) all previous approaches consider a much lower number of pathway classes to be predicted, (ii) many do not take into account multifunction compounds, 5,6,8 (iii) others use external information, either provided by interaction databases 11,12 or transcriptomic data 15,16 or (iv) they require the number of predicted pathways to be pre-specified by the user, their algorithmic implementation does not provide a probability (or score) estimate for which the same evaluation metrics used here can be computed and is computationally expensive to be applied in large-scale scenarios. 10

Classification results Multilabel classification performances (see Methods section for details) for both the models built from ChEMBL data for KEGG and Reactome are presented in Table 3. We propose different analysis to evaluate their performance, such as the area under the curve (AUC), defined as the probability of ranking a positive example higher than a negative, ranging from 0.69 to 0.91, coverage, defined as the number of positions necessary in the ranked scores to cover all true labels, ranging from 41.31 to 157.1 and F1-scores, defined as the geometric average of recall and precision, ranging from 0.45 to 0.67. As expected, performance numbers in the scaffold split are less optimistic than in the random split, since it represents a scenario where test ligands are significantly different to those ones seen by the model during training. A similar narrative can be seen in the temporal split procedure applied in the set provided by Novartis (Table 4), with lower AUCs, coverages and F1-scores than the ones reported in the random split for the ChEMBL data, although in occasions better than the scaffold split procedure. Contrary to binary classification, where it is common to focus on a few classification metrics, such as AUC and F1-score, multilabel classification is harder to evaluate, and therefore additional metrics are provided to depict other aspects of model performance. Overall, results suggest that all trained models achieve

Training We train for 20 epochs using the Adam optimizer 46 with a starting learning rate of 10−4 , momentum parameters β1 = 0.9, β2 = 0.999, a batch size of 128 samples and the binary crossentropy loss function, defined by: L(yi , hi ) = yi log(hi ) + (1 − yi ) log(1 − hi ), (2)

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

satisfactory performance, although at different degrees depending on the scenario.

Page 8 of 17

pathway modulator ligands, 13831 (29.5%) and 13463 (28.7%) were found to be active in experimental tests. Out of the same subset, 6175 (13.2%) and 5513 (11.8%) had a reported associated target protein. The remainder were hits from a various set of phenotypic screenings and no mechanism of action is reported. Besides validating the fact that the model is able to classify active chemical matter, the analysis highlights that compounds defined as DCM within a defined set of assays might be found active in others, including phenotypic screening based ones.

Dark chemical matter classification We next applied our ChEMBL trained models to profile 139352 dark chemical matter (DCM) compounds reported by Wassermann et al., 24 defined as those showing no activity in at least 100 selected set of assays from the Novartis screening platform. Out of the 139352 initial compounds in this set, those that could either not be read by rdkit or were present in our training sets were discarded for evaluation, for a total of 122210 and 119964 filtered compounds for the KEGG and Reactome models respectively.

Randomly picked examples of these true positives reported by the models are reported in Figure 2. For instance, ChEMBL374810 is known to target the EGFR tyrosine kinase 48 and the Reactome model enables its classification as a kinase signaling cascade interfering agent. For compounds ChEMBL2141422, ChEMBL1887894, ChEMBL1481218, ChEMBL1879772 and ChEMBL1505678 very little information is known in terms of biological activity in the literature and therefore its assessment would be relevant in targets already present in the reported pathways. It has to be said that a visual inspection of the deorphanized dark chemical matter showed that some compounds might be active in phenotypic screening due to the enhanced chemical reactivity associated with specific moieties (i.e. michael acceptors, activated thioethers) and not for specifically binding to a target protein and exerting a pharamacologically relevant effect 49 (i.e. ChEMBL1543346, ChEMBL1543346 and ChEMBL1543346).

As previously reported, 24 a great number of unknown (or not publicly reported) biochemical pathways associated phenotypic responses exist for a big number of species of relevant agrochemical and pharmaceutical interest. By definition, DCM compounds are likely to also be inactive in a more comprehensive set of biochemical pathways. Despite the skepticism regarding the existence of truly inactive chemical matter, 75333 (61.6%) and 73030 (60.9%) of the DCM compounds were classified as inactive in all modelled pathways for the KEGG and Reactome models respectively, using a 0.5 probability threshold. Results from this classification show the tendency of DCM compounds to be inactive against multiple human targets and pathways and not just the single set of 100 assays reported in the original paper. In our view, the predicted data highlights that using DCM for a screening campaign on a newly discovered target involved in key biological processes might lead to a low hit rate.

Finally, in order to assess the capability of the model to identify molecules which are experimentally known to impair cell viability though specific or unspecific mechanism we compare the results obtained with other three datasets, the first being a small set of 160 compounds covering more than 40 different target kinases (Tocriscreen kinase inhibitor toolbox I & II), 50 the second being a subset of the Novartis-GNF malaria box dataset 51 (here named as Novartis Phenotypic), consisting of 399 compounds

The in-silico pipeline could also be an enabler to deorphanize such compounds. 47 We therefore checked whether all ligands with at least one predicted pathway in the DCM set were present in ChEMBL and active in any reported assay using the same thresholds detailed in the Materials section. Out of the 46877 (KEGG model) and 46934 (Reactome model) predicted

ACS Paragon Plus Environment

8

Page 9 of 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 3: Average classification based results (± 1 standard deviation) for built models in KEGG and Reactome via k-fold cross-validation and different split types. KEGG Hamming loss Zero-one loss Jaccard coef. One error Precision Recall F1 Avg. precision Coverage Ranking loss LRAP AUC

Reactome

Random split

Scaffold split

0.021 (±3 × 10−4 ) 0.57 (±0.003) 0.5 (±0.003) 0.6 (±0.002) 0.78 (±0.005) 0.6 (±0.007) 0.67 (±0.005) 0.7 (±0.005) 41.31 (±0.314) 0.06 (±0.005) 0.62 (±0.002) 0.91 (±0.002)

0.028 (±0.01) 0.004 (±4 × 10−5 ) 0.77 (±0.138) 0.56 (±0.004) 0.31 (±0.174) 0.5 (±0.004) 0.45 (±0.173) 0.6 (±0.004) 0.62 (±0.183) 0.7 (±0.005) 0.39 (±0.205) 0.54 (±0.003) 0.45 (±0.217) 0.6 (±0.003) 0.51 (±0.21) 0.61 (±0.003) 61.5 (±26.079) 61.17 (±0.827) 0.1 (±0.044) 0.03 (±4 × 10−4 ) 0.47 (±0.17) 0.61 (±0.003) 0.79 (±0.128) 0.91 (±0.001)

Table 4: Classification based results on the Novartis dataset using a 90-10 temporal split validation procedure

0.077 0.909 0.226 0.424 0.506 0.454 0.454 0.495 101.502 0.159 0.421 0.811

Scaffold split 0.006 (±0.001) 0.82 (±0.148) 0.2 (±0.175) 0.33 (±0.2) 0.48 (±0.253) 0.29 (±0.204) 0.32 (±0.216) 0.51 (±0.224) 157.1 (±133.713) 0.08 (±0.075) 0.37 (±0.161) 0.69 (±0.264)

of which are approved drugs 55 Compounds that could either not be read by rdkit or were present in the training set were removed. For the kinase inhibitor set 124 compounds remained, for both the KEGG and Reactome models, while for the Novartis Phenotypic and Pubchem sets 376 and 375 compounds, and 415 and 401 molecules remained in the presented model order. It would be expected that compounds known to modify relevant biological targets function be predicted to hit a significantly higher number of predicted pathways compared to the DCM data set.

KEGG Reactome Hamming loss Zero-one loss Jaccard coef One error Precision Recall F1 Avg. precision Coverage Ranking loss LRAP AUC

Random split

0.019 0.904 0.205 0.419 0.477 0.312 0.349 0.368 181.342 0.072 0.402 0.831

Results on this comparison can be seen in Figure 3 for the KEGG and Reactome databases respectively. It can be appreciated that the trained models have a tendency to assign fewer pathways in the DCM set than in the other three, as the latter distributions are heavier tailed. Using a significance level of α = 0.05, we perform a two-sample independent Mann Whitney’s U test to test whether it is equally likely that a randomly selected number of predicted pathways value from the DCM set will be less than its counterpart from the other reported sets, respectively. With p-values of 4.869 × 10−9 , 1.935 × 10−12 and 9.212 × 10−10

that are impairing Huh7 cell viability with activity threshold of at least 5.5 pIC50 units and the third being a set of 800 compounds from PubChem. 52 The latter is an intersection of compounds between DrugBank, 53 the therapeutic target database, 54 ChEMBL and Thomson Pharma with either a USAN or INN record, with molecular weight less than 100 and salt-stripped. It thus forms a core set of concordant manually curated structures, most

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 17

F

O

F

O

F

O

O

N

NH

NH

F

NH O O

N

N

N

ChEMBL374810 KEGG: Th17 cell differentiation Reactome: RAF/MAP kinase cascade

ChEMBL2141422 KEGG: Dopaminergic synapse Reactome: Glucagon signaling in metabolic regulation

ChEMBL1887894 KEGG: Fluid shear stress and atherosclerosis Reactome: -

O

NH

O

N

O O

N

S

O

O

NH

O

S

NH

NH

O

ChEMBL1481218 KEGG: Chronic myeloid leukemia Reactome: RUNX3 regulates CDKN1A transcription

O

ChEMBL1879772 KEGG: GnRH signaling pathway Reactome: Glucagon-like Peptide-1 (GLP1) regulates insulin secretion

ChEMBL1505678 KEGG: Regulation of lipolysis in adipocytes Reactome: -

NH2 O N NH

S

N

O

Br

S

O

NH N

N N

O

O

O

S NH2

O

ChEMBL1543346 KEGG: Chagas disease (American trypanosomiasis) Reactome: G2/M DNA damage checkpoint

ChEMBL1526691 KEGG: Reactome: Fructose catabolism

ChEMBL1347848 KEGG: Reactome: Transport of Ribonucleoproteins into the Host Nucleus

Figure 2: Randomly selected deorphanized set of molecules from the Dark Chemical Matter compound set 24 and their top predicted pathway with both models. On the last row, a set of potentially promiscuous compounds that furthermore are reported active in at least one ChEMBL assay. in the presented dataset order we accept all proposed alternative hypothesis in both respective comparisons using the KEGG trained model. For the Reactome set a similar conclusion is drawn, with p-values of 3.021 × 10−6 , 5.748 × 10−20 and 9.155 × 10−7 in each compar-

ison respectively.

ACS Paragon Plus Environment

10

Page 11 of 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

KEGG

Novartis DCM

Kinase set

Novartis Phenotypic

PubChem drug set

Reactome

Novartis DCM

Kinase set

Novartis Phenotypic

PubChem drug set

0

2

4

6

8 10 Number of predicted pathways

12

14

16

Figure 3: Violin plots for the number of predicted biological pathways in the Novartis DCM, Kinase, phenotypic and PubChem sets, for both KEGG (above) and Reactome (below) databases

Implementation, performance and availability

was developed purely in Python and the PyTorch 56 framework for tensor computations and neural network training (version 0.3.1).

An open implementation of the application is available via PlayMolecule.org, where users can freely submits their compounds for evaluation in either SMI or SDF format. Prediction speed may vary on the the amount of ligands and their complexity, but users can expect a throughput of about several thousands of molecules per second using an NVIDIA GTX1080Ti graphics card, and on the order of hundreds without GPU acceleration. The model presented here

Discussion In this work we have developed a pathway prediction method based on a deep learning architecture and evaluated its performance using different types of splits on both publicly available data extracted from ChEMBL and pharmaceutical data provided by Novartis. We have further exemplified multiple applications

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

in discovery programs. While both KEGG and Reactome feature a relatively extensive catalogue of metabolic and signalling interactions, the model can potentially be expanded to work with other pathway databases such as WikiPathways 57 or the small molecule pathway database (SMPDB). 58 In fact, a promising direction for future work encompasses building a single multilabel model with information from a large set of pathway databases, so as to reduce ligand redundancy. Another idea encompasses using transfer learning techniques, 59 that is, using the trained weights of one network trained on a set of pathways (e.g KEGG) as a prior for another (e.g Reactome).

Page 12 of 17

can help overcome the current challenge of predicting the biological effects of drugs before further investing resources in lead confirmation or optimization programs. Acknowledgement The authors thank Acellera Ltd. for funding. G.D.F. acknowledges support from MINECO (BIO2017-82628P) and FEDER. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 675451 (CompBioMed project).

Supporting Information Available

After the exemplification of the opportunities associated with the use of a large scale pathwaycentric approach to predict small molecule activity, we identified some of the challenges associated to the use of the described method. Model performance is reliant on the availability of high-quality in-vitro assays data and updated pathway annotation. Since both KEGG and Reactome databases are actively maintained it is likely for model performance to improve over time. We believe the tool to be particularly useful in annotating molecular-pathway associations in a discovery project (i.e. leave in or drop a compound from a screening set in a timely manner).

The following files are available free of charge. • supp_info.pdf: Supporting information containing information about multilabel classification metrics, descriptive information on the training sets, PyTorch network architecture and query to extract all ChEMBL ligands used in this work.

References (1) Kola, I.; Landis, J. Can the Pharmaceutical Industry Reduce Attrition Rates? Nat. Rev. Drug Discovery 2004, 3, 711. (2) Lamberth, C.; Jeanmart, S.; Luksch, T.; Plant, A. Current Challenges and Trends in the Discovery of Agrochemicals. Science 2013, 341, 742–746.

Furthermore, our production models were trained using only publicly available data. Pharmacologically active molecules from ChEMBL may be reactive (i.e. acting as false positives in assay readout 49 ) or not corresponding to the depicted structure (lack of sample quality control report). We consider the data associated with the molecules a source of inherent bias and therefore prediction error that may be improved after substantial effort in data curation. As we show in our collaboration with Novartis, model training in drug or agrochemical discovery companies, having at their disposal large sets of high-quality data not available to the general scientific community has the potential of improving its performance. Finally, we believe the proposed model

(3) Awale, M.; Visini, R.; Probst, D.; ArúsPous, J.; Reymond, J.-L. Chemical Space: Big Data Challenge for Molecular Diversity. Chimia 2017, 71, 661–666. (4) Chowdhury, S.; Sarkar, R. R. Comparison of Human Cell Signaling Pathway Databases-Evolution, Drawbacks and Challenges. Database 2015, 2015 . (5) Cai, Y.-D.; Qian, Z.; Lu, L.; Feng, K.-Y.; Meng, X.; Niu, B.; Zhao, G.-D.; Lu, W.C. Prediction of Compounds’ Biological

ACS Paragon Plus Environment

12

Page 13 of 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Function (Metabolic Pathways) Based on Functional Group Composition. Mol. Diversity 2008, 12, 131–137.

(15) Ma, H.; Zhao, H. Ifad: An Integrative Factor Analysis Model for Drug-Pathway Association Inference. Bioinformatics 2012, 28, 1911–1918.

(6) Lu, J.; Niu, B.; Liu, L. b.; Lu, W.C. b.; Cai, Y.-D. d. Prediction of Small Molecules’ Metabolic Pathways Based on Functional Group Composition. Protein and Peptide Letters 2009, 16, 969–976.

(16) Ma, H.; Zhao, H. FacPad: Bayesian Sparse Factor Modeling for the Inference of Pathways Responsive to Drug Treatment. Bioinformatics 2012, 28, 2662– 2670.

(7) Schapire, R. E. A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence 1999, 14, 771–780.

(17) Song, M.; Yan, Y.; Jiang, Z. DrugPathway Interaction Prediction via Multiple Feature Fusion. Molecular bioSystems 2014, 10, 2907–2913.

(8) Macchiarulo, A.; Thornton, J. M.; Nobeli, I. Mapping Human Metabolic Pathways in the Small Molecule Chemical Space. J. Chem. Inf. Model. 2009, 49, 2272–2289.

(18) Chen, F.-S.; Jiang, H.-Y.; Jiang, Z. Prediction of Drug-Pathway Interaction Pairs with a Disease-Combined LSA-PU-KNN Method. Mol. BioSyst. 2017, 13, 2583– 2591.

(9) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32.

(19) Ogata, H.; Goto, S.; Sato, K.; Fujibuchi, W.; Bono, H.; Kanehisa, M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999, 27, 29–34.

(10) Hamdalla, M. A.; Rajasekaran, S.; Grant, D. F.; Mandoiu, I. I. Metabolic Pathway Predictions for Metabolomics: A Molecular Structure Matching Approach. J. Chem. Inf. Model. 2015, 55, 709–718.

(20) Fabregat, A.; Sidiropoulos, K.; Garapati, P.; Gillespie, M.; Hausmann, K.; Haw, R.; Jassal, B.; Jupe, S.; Korninger, F.; McKay, S.; Matthews, L.; May, B.; Milacic, M.; Rothfels, K.; Shamovsky, V.; Webber, M.; Weiser, J.; Williams, M.; Wu, G.; Stein, L.; Hermjakob, H.; D’Eustachio, P. The Reactome Pathway Knowledgebase. Nucleic Acids Research 2016, 44, D481–D487.

(11) Hu, L.-L.; Chen, C.; Huang, T.; Cai, Y.D.; Chou, K.-C. Predicting Biological Functions of Compounds Based on Chemical-Chemical Interactions. PLoS One 2011, 6, 1–9. (12) Gao, Y. F.; Chen, L.; Cai, Y. D.; Feng, K. Y.; Huang, T.; Jiang, Y. Predicting Metabolic Pathways of Small Molecules and Enzymes Based on Interaction Information of Chemicals and Proteins. PLoS ONE 2012, 7, 1–9.

(21) Apweiler, R.; Bairoch, A.; Wu, C. H.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M. J.; Natale, D. A.; O’Donovan, C.; Redaschi, N.; Yeh, L.S. L. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 2004, 32, D115–9.

(13) Kuhn, M.; von Mering, C.; Campillos, M.; Jensen, L. J.; Bork, P. STITCH: Interaction Networks of Chemicals and Proteins. Nucleic Acids Research 2008, 36 . (14) von Mering, C.; Huynen, M.; Jaeggi, D.; Schmidt, S.; Bork, P.; Snel, B. STRING: A Database of Predicted Functional Associations between Proteins. 2003.

(22) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.;

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 17

Michalovich, D.; Al-Lazikani, B. ChEMBL: a Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2011, 40, D1100–D1107.

(31) Barutcuoglu, Z.; Schapire, R. E.; Troyanskaya, O. G. Hierarchical Multi-Label Prediction of Gene Function. Bioinformatics 2006, 22, 830–836.

(23) Hay, M.; Thomas, D. W.; Craighead, J. L.; Economides, C.; Rosenthal, J. Clinical Development Success Rates for Investigational Drugs. Nat. Biotechnol. 2014, 32, 40–51.

(32) Clare, A.; King, R. D. Knowledge Discovery in Multi-Label Phenotype Data. European Conference on Principles of Data Mining and Knowledge Discovery. 2001; pp 42–53.

(24) Wassermann, A. M.; Lounkine, E.; Hoepfner, D.; Le Goff, G.; King, F. J.; Studer, C.; Peltier, J. M.; Grippo, M. L.; Prindle, V.; Tao, J.; Schuffenhauer, A.; Wallace, I. M.; Chen, S.; Krastel, P.; Cobos-Correa, A.; Parker, C. N.; Davies, J. W.; Glick, M. Dark Chemical Matter as a Promising Starting Point for Drug Lead Discovery. Nat. Chem. Biol. 2015, 11, 958–966.

(33) Landrum, G. RDKit: Open-source cheminformatics. Accessed Sept. 2018. (Online). http://www.rdkit.org. 2006, 3, 2012. (34) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; Dahl, G. E. Neural Message Passing for Quantum Chemistry. arXiv preprint arXiv:1704.01212 2017, (35) LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436.

(25) Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-Normalizing Neural Networks. Advances in Neural Information Processing Systems. 2017; pp 971– 980.

(36) Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Netw. 2015, 61, 85–117. (37) Krizhevsky, A.; Sutskever, I.; Hinton, G. E. Imagenet Classification with Deep Convolutional Neural Networks. Advances in neural information processing systems. 2012; pp 1097–1105.

(26) Hauser, A. S.; Attwood, M. M.; RaskAndersen, M.; Schiöth, H. B.; Gloriam, D. E. Trends in GPCR Drug Discovery: New Agents, Targets and Indications. Nat. Rev. Drug Discovery 2017, 16, 829.

(38) Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. Advances in neural information processing systems. 2013; pp 3111–3119.

(27) Zhang, M.-L.; Zhou, Z.-H. A Review on Multi-Label Learning Algorithms. IEEE TKDE 2014, 26, 1819–1837. (28) Schapire, R. E.; Singer, Y. BoosTexter: A Boosting-based System for Text Categorization. Mach. Learn. 2000, 39, 135–168.

(39) Friedman, J. H. Greedy Function Approximation: a Gradient Boosting Machine. Ann. Stat. 2001, 1189–1232.

(29) Tsoumakas, G.; Katakis, I.; Overview, A. Multi-Label Classification : An Overview. Int. J. Data Warehous. Min. 2007, 3, 1– 13.

(40) Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167 2015,

(30) Abu-El-Haija, S.; Kothari, N.; Lee, J.; Natsev, P.; Toderici, G.; Varadarajan, B.; Vijayanarasimhan, S. YouTube-8M: A Large-Scale Video Classification Benchmark. ArXiv e-prints 2016,

(41) Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: a Simple Way to Prevent Neural Networks

ACS Paragon Plus Environment

14

Page 15 of 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.

tocriscreen-kinase-inhibitortoolbox-1-and-2_6268.

(42) Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier Chains for MultiLabel Classification. Machine Learning 2011, 85, 333–359.

(51) Pandey, R.; Kumar, R.; Gupta, P.; Mohmmed, A.; Tewari, R.; Malhotra, P.; Gupta, D. High Throughput in Silico Identification and Characterization of Plasmodium Falciparum PRL Phosphatase Inhibitors. J. Biomol. Struct. Dyn. 2017, 1–10.

(43) Kramer, C.; Gedeck, P. Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets. J. Chem. Inf. Model. 2010, 50, 1961–1969.

(52) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A. PubChem Substance and Compound Databases. Nucleic Acids Res. 2015, 44, D1202–D1213.

(44) Sheridan, R. P. Time-Split CrossValidation as a Method for Estimating the Goodness of Prospective Prediction. J. Chem. Inf. Model. 2013, 53, 783–790.

(53) Wishart, D. S.; Feunang, Y. D.; Guo, A. C.; Lo, E. J.; Marcu, A.; Grant, J. R.; Sajed, T.; Johnson, D.; Li, C.; Sayeeda, Z. DrugBank 5.0: a Major Update to the DrugBank Database for 2018. Nucleic Acids Res. 2017, 46, D1074–D1082.

(45) Wallach, I.; Heifets, A. Most LigandBased Classification Benchmarks Reward Memorization Rather than Generalization. J. Chem. Inf. Model. 2018, 58, 916– 932. (46) Kingma, D. P.; Ba, J. Adam: a Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 2014,

(54) Li, Y. H.; Yu, C. Y.; Li, X. X.; Zhang, P.; Tang, J.; Yang, Q.; Fu, T.; Zhang, X.; Cui, X.; Tu, G.; Zhang, Y.; Li, S.; Yang, F.; Sun, Q.; Qin, C.; Zeng, X.; Chen, Z.; Chen, Y. Z.; Zhu, F. Therapeutic Target Database Update 2018: Enriched Resource for Facilitating Benchto-Clinic Research of Targeted Therapeutics. Nucleic Acids Res. 2018, 46, D1121– D1127.

(47) Wassermann, A. M.; Tudor, M.; Glick, M. Deorphanization Strategies for Dark Chemical Matter. Drug Discovery Today: Technol. 2017, 23, 69–74. (48) VanBrocklin, H. F.; Lim, J. K.; Coffing, S. L.; Hom, D. L.; Negash, K.; Ono, M. Y.; Gilmore, J. L.; Bryant, I.; Riese, D. J. Anilinodialkoxyquinazolines: Screening Epidermal Growth Factor Receptor Tyrosine Kinase Inhibitors for Potential Tumor Imaging Probes. J. Med. Chem. 2005, 48, 7445–7456.

(55) PubChem Selected Compounds. Accessed Sept. 2018. http: //www.ncbi.nlm.nih.gov/sites/myncbi/ collections/public/1jy48mElsL23PiGUpF2n-dAh/.

(49) Pouliot, M.; Jeanmart, S. Pan Assay Interference Compounds (PAINS) and Other Promiscuous Compounds in Antifungal Research: Miniperspective. J. Med. Chem. 2015, 59, 497–503.

(56) Paszke, A.; Chanan, G.; Lin, Z.; Gross, S.; Yang, E.; Antiga, L.; Devito, Z. Automatic Differentiation in PyTorch. Advances in Neural Information Processing Systems 30 2017, 1–4.

(50) Tocriscreen Kinase Inhibitor Toolbox I & II. Accessed Sept. 2018. https://www.tocris.com/products/

(57) Slenter, D. N.; Kutmon, M.; Hanspers, K.; Riutta, A.; Windsor, J.; Nunes, N.; Mélius, J.; Cirillo, E.; Coort, S. L.;

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

DIgles, D.; Ehrhart, F.; Giesbertz, P.; Kalafati, M.; Martens, M.; Miller, R.; Nishida, K.; Rieswijk, L.; Waagmeester, A.; Eijssen, L. M.; Evelo, C. T.; Pico, A. R.; Willighagen, E. L. WikiPathways: A Multifaceted Pathway Database Bridging Metabolomics to Other Omics Research. Nucleic Acids Res. 2018, 46, D661–D667. (58) Jewison, T.; Su, Y.; Disfany, F. M.; Liang, Y.; Knox, C.; Maciejewski, A.; Poelzer, J.; Huynh, J.; Zhou, Y.; Arndt, D.; Djoumbou, Y.; Liu, Y.; Deng, L.; Guo, A. C.; Han, B.; Pon, A.; Wilson, M.; Rafatnia, S.; Liu, P.; Wishart, D. S. SMPDB 2.0: Big Improvements to the Small Molecule Pathway Database. Nucleic Acids Res. 2014, 42, D478–D484. (59) Thrun, S.; Pratt, L. Learning to Learn; Springer Science & Business Media, 2012.

ACS Paragon Plus Environment

16

Page 16 of 17

Page 17 of 17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Graphical TOC Entry @PlayMolecule.org O F

.sdf .smi

O

NH

N

N

Fc1cc(Br)ccc1OCC... NH2

Br

S N

O

F F

NH

?

S=C(NCc1ccccc1)Nc... Brc1ccc(cc1)C2=Nn3... CC1Cc2ccccc2N1C(=... Cc1ccc(cc1)S(=O)...

F

CC(C)CCNC(=O)CNC...

O O N

CC(=O)N1CCC[C@H]...

ACS Paragon Plus Environment

17