From Target to Drug: Generative Modeling for Multimodal Structure

20 hours ago - Chemical space is impractically large and conventional structure-based virtual screening techniques cannot be used to simply search thr...
0 downloads 0 Views 8MB Size
Subscriber access provided by Nottingham Trent University

Article

From Target to Drug: Generative Modeling for Multimodal Structure-Based Ligand Design Miha Skalic, Davide Sabbadin, Boris Sattarov, Simone Sciabola, and Gianni De Fabritiis Mol. Pharmaceutics, Just Accepted Manuscript • DOI: 10.1021/acs.molpharmaceut.9b00634 • Publication Date (Web): 22 Aug 2019 Downloaded from pubs.acs.org on August 23, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

From Target to Drug: Generative Modeling for Multimodal Structure-Based Ligand Design Miha Skalic,† Davide Sabbadin,† Boris Sattarov,† Simone Sciabola,‡ and Gianni De Fabritiis∗,†,¶,§ †Computational Science Laboratory, Universitat Pompeu Fabra, Barcelona Biomedical Research Park (PRBB), C Dr Aiguader 88, 08003 Barcelona, Spain. ‡Biogen Chemistry and Molecular Therapeutics, 115 Broadway Street, Cambridge, MA 02142, USA. ¶Acellera, Barcelona Biomedical Research Park (PRBB), C Dr. Aiguader 88, 08003, Barcelona, Spain. §Institució Catalana de Recerca i Estudis Avançats (ICREA), Passeig Lluis Companys 23, 08010 Barcelona, Spain. E-mail: [email protected]

Abstract

1

Chemical space is impractically large and conventional structure-based virtual screening techniques cannot be used to simply search through the entire space to discover effective bioactive molecules. To address this shortcoming we propose a generative adversarial network to generate, rather than search, diverse threedimensional ligand shapes complementary to the pocket. Furthermore, we show that the generated molecule shapes can be decoded using a shape-captioning network into a sequence of SMILES enabling directly structure-based de novo drug design. We evaluate the quality of the method by both structure- (Docking) and ligand-based (QSAR) virtual screening methods. For both evaluation approaches we observed enrichment compared to random sampling from initial chemical space of ZINC drug-like compounds.

Structure-based drug design 1 approaches start with target identification, where a target, typically a protein, is involved in a biological process that might be modulated through the functional binding of a compound. Target structure determination is usually followed, either by experimental methods or homology modeling, and suitable compound binding pockets are identified. Next, potential binders are docked, 2 i.e. positioned into the pocket based on their steric constraints and molecular interactions, scored and ranked based on the fit to the defined protein pocket region. This approach has several potential disadvantages. 3 Notably, virtual screening using docking relies on a predefined library of compounds which is orders of magnitude smaller than the chemical space of small C,O,N,S - containing molecules , estimated to be greater than 1060 . 4 Virtual and tangible accessible chemistry spaces are destined to grow over the next years. 5 However, exploring a bigger region of the chemical space can be achieved through de novo design of ligands which can be

keywords: Structure Based Drug Design, Deep Learning, Generative modeling

Introduction

ACS Paragon Plus Environment

1

Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

accommodated into protein pocket without relying on a pre-defined library of compounds, 6 which pharmaceutical and agrochemical companies still widely use in their projects. At the same time deep learning is enhancing and transforming the drug discovery process at a rapid pace, 7,8 affecting fields such as retrosynthesis route construction, 9 chemical space exploration, 10 and compound property prediction, 11 just to name a few. Three-dimensional representations and processing techniques are one of the cornerstones of deep learning applied to drug design. Examples of such applications include protein binding classification, 12–14 affinity prediction 15,16 and pocket classification 17 as well as similarity comparison. 18 In this work we present a novel method to generate focused virtual libraries of small molecules based on a protein structure using deep learning-based generative models. Structures of protein-ligand complexes obtained from ligand docking are used to train a generative adversarial model to generate compound structures that are complementary to protein but also maintain diversity among themselves. By generating compound structures for targets not in the training set we show that the model can generate plausible ligand shapes for previously unseen targets. The presented method is the first application of generative adversarial modeling applied to structure-based drug design with the use of shape-based information. The main contribution of this work is twofold. Firstly, we propose a generative adversarial network for a multimodal generation of ligand shapes, starting from a protein representation. Secondly, we show that these shapes can be decoded into grammatically correct SMILES strings corresponding to valid molecular structures. Therefore, novel potential ligands for structurally known protein targets can be designed.

1.1

Page 2 of 19

by performing several iterations of fragment placement and linking the most favourably fitting ones. 19 Although ligand-based generative models have received a lot of attention recently, 20 structure-based methods are lagging behind. Recently, a couple of deep learning methods were proposed: a graph-based machine learning method for ligand design 21 and LigVoxel, 22 a three-dimensional CNN approach. The former approach takes as input graph representation of a protein pocket and encodes it into a latent vector, followed by other networks, including a compound generation network, that use the vectorized representation of the protein pocket. On the other hand, in LigVoxel a neural network is trained to generate ligand shapes starting from protein pocket voxelizations. One shortcoming of the method is that the outputs suffer from the same issues as as image to image translation outputs trained with a L2 or binary cross-entropy loss, i.e. the output is blurred and unspecific, in addition to being unimodal. Hence, the output resembles molecular interaction fields as relatively broad spacial map of favorable placement for certain atom or group types, rather than specific ligand shapes. Although, LigVoxel offers some variability in the output by changing the number of ligand atoms in the input, one must select these values a priori. In the work presented here the networks do not require this information. The size of the ligand is inferred for other protein-ligand training examples. Generative adversarial networks (GANs). GANs, 23 a subgroup of generative models, have been initially proposed as a method for image generation. In the process, a generator (G) is trained to generate images that fool a discriminator (D) which on the other hand tries to distinguish between generated and real samples. Essentially, the objective is to find a Nash equilibrium of a value function V for two player min-max problem:

Related Work min max V (D, G) =Ex∼pdata (x) [log(D(x))] +

Structure-based de novo drug design. Typically structure based de-novo drug design is performed by placing fragments into the protein binding pocket and growing the ligand or

G

D

Ez∼pz (z) [log(1 − D(G(z)))], (1)

ACS Paragon Plus Environment

2

Page 3 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

where z is latent variable most commonly drawn from low-dimensional Gaussian or uniform distributions. In recent years, multiple GANs improvements have been proposed, 24–28 further improving quality of the generated outputs. GANs map values from an initial distribution to an output in a single feed-forward pass, enabling faster generation process than sequential 29 output generation. Similarly, GAN training has been applied to image-to-image translation, 30 G(A) → B, to generate complementary images, e.g. photographs from edge-maps. GAN style training allowed generation of images with less blur than a naive approach such as minimizing Euclidean distance between predicted and ground truth labels. This framework was later extended to work on unpaired images 31 and, most importantly for this work, to multimodal image generation using a method called BicycleGAN. 32 With multimodal transformations one can map single image to multiple complementary images. For example generating photographs containing different colors from the same edge-map. Briefly, BicycleGAN ˆ based tries to generate diverse set of outputs B on input A, where different modes of distribution p(B|A) are taken into account. Typically image-to-image GAN generators G, G(A, z) → ˆ ignore latent code z and generate the same B, output, dependent only on input A. This is known as mode collapse or unimodal output. BicycleGAN avoids this by introducing cycles of network transformations and by incentivizing reconstruction of original z from generated B . In the work presented here proteins shapes (A) are mapped to ligand shapes (B) and by varying latent code z different ligand shapes are generated.

2

shapes into SMILES strings (Figure 1 bottom). The model’s networks and training regime is based on BicycleGAN image model, adopted for input with three spatial dimensions (Figure 2A). To capture diverse multimodality we trained the model with active compounds docked to their targets. The docked poses might fail to represent the real poses using docking, but we gain a lot more data to train on in addition to keeping the same rigid protein structure paired with different ligands which helps to generate diverse shapes complementary to the protein pocket. By generating diverse shapes we can capture diverse pharmacophoric models and compound scaffolds that can bind to a protein pocket. Proteins from DUD-E dataset 33 were used as inputs to train LiGANN model. Known binders for each protein were used as the generation targets. The generative model was trained on 101 targets and a total of 11256 binding compounds. For each target 39 to 592 known binders, with a per target mean of 111, were docked using Smina 34 with a box size of 20Å and default settings. The highest scoring pose for each compound was kept to be used in the training. Featurization was done in the following manner. For ligands five channels were used one for each considered atom types: hydrophobic, aromatic, H-bond donors, H-bond acceptors and heavy atoms (occupancy). The atom types were computed using RDKit software. Proteins had two additional channels: positive and negative Gasteiger partial charge. 35 Protein atom types were computed using AutoDock 4 atom typing. 36 Molecule atoms were then voxelized into a discretized 1Å cubic grid of side size 24Å. The value at each voxel is determined by atom type and distance r between neighboring atoms and its center:

Method

 n(r) = 1 − exp − (rvdw /r)12 ,

We propose a three-dimensional convolutional neural network that generates ligand representations complementary to the input protein pocket (Figure 1 top), where both input pockets and output ligands are represented as voxels. Subsequently, captioning network decodes the

(2)

where rvdw is the corresponding van der Waals radius of a particular atom. To avoid overfitting and for a better generalization the compoundsreceptors complexes were randomly rotated and translated (2Å displacement of the voxelization

ACS Paragon Plus Environment

3

Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 19

Figure 1: Generative pipeline. (A) Firstly, a protein pocket is selected and voxelized. Secondly, the voxel arrays are fed into a generator network together with a latent code z to generate voxel array representing a ligand, complementary to the pocket. By varying the latent code z we can modify the generated compounds. (B) The generated ligand shapes then processed by a shape-captioning network, outputting a sequence of SMILES tokens. center from the ligand center), before being transformed into the voxelized representation. Network Architectures. We followed the model architectures of original two-dimensional BicycleGAN implementation as close as possible, but adopting it for three-dimensional input and taking into the account the memory constrains of a single Nvidia GTX 1080TI GPU. For the generator G, a three-dimensional version of U-Net 37 was used. For downsampling eight convolutional layers were used, having 64, 128, 256, 512, 512, 512, 512, 512 filters, respectively. 2nd , 4th , 6th layer used strided convolutions (s = 2) to downsample the array size. Upsampling followed the same filter size pattern but reversed and applying Transposed convolutions. For all the convolutions we used kernels of size 3. Leaky ReLUs (α = 0.2) after convolutions and ReLUs after transposed convolutions were used as the activation functions, except for final output, which used sigmoid function as the activation. The sigmoid activation constrains the output array values to a (0, 1) interval. Between the convolution and the activation operations InstanceNorm 38 was applied, normalizing

A) BicycleGAN Latent code

Encoder

Protein pocket

Generator

Ligand shape

Discriminator

Generated or Real

B) Captioning network Ligand Shape

variational autoencoder

Ligand shape

Captioning network

SMILES

C) LiGANN pipeline Latent code Protein pocket

Generator

Ligand shape

Captioning network

SMILES

Figure 2: Networks used in LiGANN. BicycleGAN (A) is trained to map protein pocket shapes to ligand shapes and shape captioning network (B) is trained to map ligand shapes to SMILES strings. Finally, elements from both models are combined into LiGANN pipeline (C).

ACS Paragon Plus Environment

4

Page 5 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

tent code dimension |z| = 8 was used. The code was concatenated to all intermediate layers of the network G. Networks were optimized with Adam 43 optimizer (α = 0.0002) and updated with gradients obtained from a single sample each time. The training was done for a total of 100 epochs, 50 with the default learning rate and the following 50 with linearly decaying the learning rate towards 0.

the values of in each channel per sample. As discriminators we used two patchGAN sequential convolutional neural networks. With patchGAN the discriminator evaluated patches of the generated shape, rather than evaluating the representation in one piece. The networks have 64, 128, 256, 512, 512, 512 and 32, 64, 126, 256, 256, 256 filters, with InstanceNorm and Leaky ReLUs (α = 0.2) between each convolution. Strided convolution (s = 2) was used on 2nd and 4th convolution. Encoder E is also a sequential model, consisting of 5 convolutional layers, with each being Instance Normalized and activated with ReLU activation. Number of used filters: 64, 128, 128, 256 and 256. After 2nd , 3rd , 4th and 5th the array dimensions were reduced by applying Average 3D pooling (s = 2). Captioning Network. In order to decode the generated shapes into SMILES strings a variant of ligand shape-captioning model 39 was employed with the same training data as the original work—compounds from drug-like ZINC15 database. 40 Briefly, the shape-captioning network consists of three-dimensional convolutional and recurrent long short-term memory (LSTM) 41 networks. LSTM outputs a sequence of SMILES tokes given the convolutional network output, much like image-captioning networks that outputs a sequence of words. Compared to the reference network 39 here we do not use conditional input as the input to the variational autoencoder, since the shape generation network outputs only ligand shapes and not the pharmacophoric points (Figure 2B). Furthermore, although during training we use combination of variational autoencoder and captioning network, for the inference we used only the captioning network (Figure 2C). The autoencoder was used so as to ensure that captioning can work on perturbed shapes. Network architectures are available in supporting information. Training details. Like the original BicycleGAN the model was based on LSGAN 42 variant, which uses least-squares objective function for optimization. Unlike original BicycleGAN the input to the discriminator was a concatenated representation of protein and ligand. La-

3

Results

In this section we show shapes generated by the proposed generative modeling approach and compare them to the previously proposed nongenerative method LigVoxel. 22 Furthermore, we show that using a shape captioning network the generated shapes can be decoded into grammatically correct SMILES corresponding to valid molecular structures and provide an insight into relationship between generated compounds and their parent shapes. It is difficult to assess the performance of the model from a computational point of view. We decided to compare how virtual screening tools such as QSAR and docking rank generated compounds from ones that are randomly selected from ZINC. Of course this approach is dependent on the accuracy of the QSAR and docking methods, nevertheless we do find enrichment when applying the screening methods. Throughout this section three drug targets not included in the DUD-E dataset 33 were used to demonstrate the proposed models performance in detail: delta opioid 7TM receptor (PDB-id 4N6H), 44 Serine/threonineprotein kinase CHK1 (PDB-id 1ZYS) 45 and Serine/threonine-protein kinase TNNI3K (PDB-id 4YFI). 46 These targets also possess a long history of drug research. In the last decade multiple chemotypes have been discovered to modulate or block the function of these proteins.

3.1

Shape Generation

Figure 3 shows an example of generated shapes (shapes 1-8) for the ATP binding pocket of

ACS Paragon Plus Environment

5

Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Serine/threonine-protein kinase TNNI3K (PDB ID: 4YFI) using LigVoxel and LiGANN methods. To get diverse shapes for LigVoxel we used different atom counts as the conditional input, while for LiGANN method we used random sampling of z values from a standard normal distribution. It is noticeable that there is little variability in LigVoxel generated shapes (shapes 1-4, Figure 3A). Mostly the shapes for a property simply grow or shrink depending on the input atom count with shape 1, 2 and 4 almost indistinguishable. On the other hand, the method presented here is capable of generating more diverse shapes (shapes 5-8, Figure 3B). For example H-bond acceptors (red isosurface) appear at different locations. Inspection of the different co-crystallized ligands at the same binding pocket (PDB ID: 4YFI, 4YFF, 6B5J) revealed the diversity of ligand features that the binding pocket is able to accommodate. The second observable advantage is that LiGANN, unlike LigVoxel, generates distinguishable shapes (not spread out blobs), making determination of property locations less ambiguous. Like LigVoxel, we observed that LiGANN model learns to avoid steric clashes with the protein and can generate feasible pharmacophoric models. For example Figure 4 shows a generated shape (shape 6 from figure 3) in the 4YFI pocket. For that shape two out of three H-bond acceptors are solvent exposed, while the 3rd could form a H-bond with protein’s phenylalanine 607.

3.2

Page 6 of 19

work. Observing the results (Figure 5) we see that the both compounds generated directly from binder shapes as well as compounds generated from protein pocket follow a similar distribution of property count of aromatic, H-bond donors, H-bond acceptors and heavy atoms. The pharmacophoric properties counts and the sums of corresponding voxel array values are positively correlated, therefore if the shape generation network generates shapes with filled aromatic channel, the captioning network will also decode the shape into compound with more aromatic rings. Figure S1 shows results of using crystallized ligand shapes directly for captioning. Next, we evaluate the shape generation on the test proteins. For each of the three test proteins LiGANN was used to generate 300 shapes by sampling z randomly and then shape captioning was applied to map the shapes to 10 different SMILES sequences. Probabilistic RNN sampling, where next token is chosen proportionally to predicted probability, was used to generate multiple SMILES sequences and hence we generated multiple compounds that could fit into the same shape. The decoding had a high success rate: 2815 (93.8%), 2595 (86.5%) and 2778 (92.6%) unique and at the same time grammatically correct SMILES corresponding to valid molecular structures were generated for proteins with PDB ids 4N6H, 4YFI and 1ZYS, respectively. Example of decoding is displayed in Figure 7. All the decodings have multiple aromatic rings, which could fit well into the large aromatic area of the input shape. The captioning is also responsive to changes in the input. When setting values of the aromatic channel to zeros, the generated compounds no longer contain aromatic rings but rather aliphatic ring systems (Figure 7 bottom). We do, however, have to note that the captioning networks decodings are not always perfect. For example,while correctly introducing carbonyl,cyano and ester groups in H-bond acceptor regions, the networks generated compound with hydroxyl group (Figure 7, D1) and amine (Figure 7, D3), which could act as Hbond donors, while the shape does not contain

Decoding Generated Shapes

As the ligand shape generation and captioning parts were trained separately and on different datasets it might be the case that captioning network will not decode the shapes efficiently. To evaluate if this is the case, the following experiment was conducted. For each protein-ligand pair in the DUD-E database the protein was featurized at the ligand center and together with randomly sampled z a complementary ligand shape was generated, followed by shape captioning. As a reference the known binders in the docked pose were also featurized and directly decoded with the captioning net-

ACS Paragon Plus Environment

6

Page 7 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

Figure 3: Ligand shapes produced by LigVoxel (A) and herein presented method LiGANN based on 4YFI protein crystal structure (B). For reference diverse co-crystallized ligands with Serine/threonine-protein kinase TNNI3K are shown (4YFF and 6B5J; C). Green, red and blue isosurfaces represent aromatic, H-bond acceptor, H-bond donor atoms occupancy, respectively. Ligand shape (any atom occupancy) is displayed as a grid. LigVoxel conditional input atom count for aromatic carbons, H-bond donors, H-bond acceptors and heavy atoms was as follows. Shape 1: 12, 4, 3, 25; Shape 2: 16, 5, 4, 35; Shape 3: 6, 3, 2, 11; Shape 4: 14, 5, 2, 29.

ACS Paragon Plus Environment

7

Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3.3

Docking

Generated compounds were docked into the protein pocket and scored for binding probability. As negative examples the same amount of compounds were at random drawn from druglike ZINC15 database, 40 the same database the captioning network was trained on. The selected compounds were docked using Smina docking software, 34 using same procedure as described in the method section, and ranked using BindScope 14 predictions. AUC was chosen as the evaluation metric, comparing the generated compound as positive examples to randomly sampled compounds as negative examples. The results are shown in Table 1. In all three tests the generated compounds scored significant higher than the sampled baseline (Mann-Whitney’s U test p-value  0.001). Receiver operating characteristic and precisionrecall curves are displayed in Figure S2.

Figure 4: Ligand shape generated by LiGANN for protein 4YFI. H-bond Donor areas are depicted in red and aromatic areas are depicted in green. Ligand shape is displayed as a grid. The two H-bond acceptor areas on the right are solvent exposed, while the buried region on the left-hand side could perform H-bonding with the protein.

Table 1: AUC values calculated with BindScope scoring software for three protein targets (4N6H, 4YFI and 1ZYS). Generated compounds are considered as positives while randomly sampled compounds are considered as negative examples.

any H-bond donors. To determine whether the proposed structurebased method introduced any biases we analyzed properties of compounds in the generated libraries (Figure 6). Properties such LogP 47 , synthetic accessibility, 48 natural productlikeness 49 follow similar distribution as the randomly sampled set of decoys. A bigger discrepancy is observed in quantitative estimation of drug-likeness 50 and molecular weight for 4YFI generated compounds. At the same time the generated libraries exhibit high diversity of 0.863, 0.865, and 0.871 for structures 4N6H, 4YFI and 1ZYS, respectively. This is comparable with diversity of a randomly sampled set of decoys with internal diversity of 0.856. Here for a set S the diversity IntDiv(S) is defined as: s X 1 T (m1 , m2 ), (3) IntDiv(S) = 1− |S|2 m ,m ∈S 1

Page 8 of 19

Target BindScope AUC 4N6H 4YFI 1ZYS

3.4

0.577 0.791 0.674

QSAR evaluation

Complementary to the structure-based virtualscreening methods, ligand-based virtual screening approaches such as Quantitative StructureActivity Relationship (QSAR) can evaluate enrichment in generated compounds activity over random baseline from a different perspective on per-target basis and could avoid potential biases of structure-based methods. 51,52 QSAR has been a well established and reliable tool used by medicinal chemists for decades. 53 In the scope of drug discovery QSAR usually implies establishing a relationship between ligand structure

2

where T is Tanimoto similarity and m is Morgan fingerprint of a compound.

ACS Paragon Plus Environment

8

Page 9 of 19

Aromatic

H­Bond Donor

300

Group

300 200 Group

100 0

Binder Generated

0

1

2 3 4 Property Count

5

Channel Sum

Channel Sum

400

6

100

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Property Count

700

200

100 Group

Binder Generated

0

1

2 3 Property Count

4

500

Occupancy Group

Binder Generated

400 300 200

5

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

50

Channel Sum

600

150

0

Binder Generated

200

0

H­Bond Acceptor Channel Sum

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

Property Count

Figure 5: Number of Aromatic rings, H-bond donors, H-bond acceptors and heavy atoms of generated molecules (X-axis) from shapes with corresponding sum of channel values (Y-axis). Initial shapes were either voxelized docked poses (Binders) or shapes generated from protein pocket (Generated).

ACS Paragon Plus Environment

9

Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

logP

0.40

Page 10 of 19

QED 4N6H 4YFI 1ZYS Decoys

0.35 0.30 0.25

3.5 3.0 2.5

SA

4N6H 4YFI 1ZYS Decoys

0.8

0.20

2.0

0.6

0.15

1.5

0.4

0.10

1.0

0.05

0.5

0.00

4

2

0

2

4

6

8

0.2 0.2

0.4

NP

0.7

4N6H 4YFI 1ZYS Decoys

0.5 0.4 0.3 0.2 0.1 3

2

1

0.6

0.8

1.0

0.0

1

2

3

4

5

6

weight

0.6

0.0

0.0

4N6H 4YFI 1ZYS Decoys

1.0

0

1

2

0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 0.000

4N6H 4YFI 1ZYS Decoys

200

300

400

500

600

Figure 6: Properties distribution of the generated compounds based on protein pocket (PDB IDs 4N6H, 4AFI, 1ZYS) and decoys randomly selected from ZINC15 traning set. Displayed are following properties: lipophilicity (logP), quantitative estimation of drug-likeness (QED), synthetic accessibility (SA), natural product-likeness (NP) and molecular weight (weight).

Figure 7: Captioning network input shape (on the left side with superimposed 4YFI ligand for reference) and 6 different decodings from the same shape. In the last row are shown 3 examples of generated compounds when the aromatic channel values are set to 0 prior to captioning.

ACS Paragon Plus Environment

10

Page 11 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

in QSAR. 59 Basically, we assume that trained model is applicable for a particular molecule if its average distance in multidimensional space of normalized features to its closest K neighbors Di doesn’t exceed:

and its activity on a biological target of interest. One way to do this is to train a machine learning model 54 on the set of ligands with known activity for that target. Here, LightGBM gradient boosting decision tree algorithm 55 (1000 trees, 0.8 colsample_bytree and keeping default values of other parameters) was applied to train a separate machine learning QSAR regression model for each target using binding affinity data (pKi or pKd ) available in ChEMBL24. 56 We obtained proteins with a crystal structure in protein data bank 57 and identified bound ligands. The ChEMBL compound selection query is available in supporting information (Listing S1). After obtain the compounds, they were standardized by employing MolVS (https://github.com/mcs07/MolVS) in addition to removing salts. Targets with binding affinity data for less than 100 compounds were filtered out. The QSAR models were trained on features of the molecules. Feature vector of each molecule was composed of 199 molecular descriptors (list is available in Supporting information), Morgan circular count vector of size 1024 and radius 2, Avalon fingerprint of size 1024, and 166 public MACCS keys. Every descriptor and fingerprint was generated using RDKit. Models were trained in 5-fold scaffold-based 58 crossvalidation manner. Only models with crossvalidation determination coefficient Q2 > 0.5 were used to evaluate LiGANN-generated potential binders and set of random decoys from ZINC. For each target 500 compound shapes were generated at the ligand center and each shapes was decoded into 10 SMILES, followed by filtering of invalid compounds and duplicates, yielding a final set of generated compounds. Same number of compounds was also selected from the ZINC database at random and used as decoys. We understand that LiGANN-generated compounds or random decoys from ZINC may end-up too far away in the chemical space from ChEMBL24 targets training set for model to be reliably predictive. To address this phenomenon, we have adopted distance-based definition of models applicability domain (AD)

Di < Dt + ZSt ,

(4)

where Dt is mean of average distances of each point in the training set to its K closest neighbors, St is standard deviation for this mean and Z is a parameter set to 1 in this study. Since ChEMBL24 datasets for our targets come in a variety of sizes we take K closest neighbours as: √ 2 K = L, (5) where L is the size of the training dataset. 60 In applicability domain calculation pipeline features were normalized and 1000 components yielded from Principal Component Analysis (PCA) were used for distance calculation to get rid of linearly correlated features. Number of PCA components was selected to account for at least 85% of total explained variance ratio for each dataset. This means that QSAR models were only applied to structures that passed the AD criteria. If LiGANN-generated set or ZINC random baseline set for a particular target contained less than 100 structures after AD filtering, this target was taken out of the evaluation by QSAR modeling. Moreover, LiGANN requires binding pocket location in order to generate potential binders. We manually curated the list of identified pockets and discarded target where non-ligand compound or modified residue was identified as the ligand. Lastly, to ensure there was no information leakage from training set to evaluation procedure, targets that belonged to same cluster of 70% sequence identity as any target from the training DUD-E set were also filtered out. At the end 31 targets were left for QSAR evaluation. Figure 8 shows distribution of predicted pKd grouped into deciles and averaged over all the targets. Figures S3 and S4 show distribution of values per target. Each predicted binding affinity is an average of the 5 models trained in the

ACS Paragon Plus Environment

11

Molecular Pharmaceutics

cross-validation procedure. As it is seen in Figure 8 generated compounds are more like to to be found in first decile or 8th and higher. Enrichment in the first decile mostly comes from targets that have right shifted predicted pKd of generated compounds compared to decoys. Such as example is glutaminyl cyclase (PDB id 2AFU). However, the biggest differences still comes from the top three deciles. By further breaking down the top decile, the ratio of generated compounds goes up from 1.5 to 1.9 when going from enrichment in upper 10% to upper 1%. Enrichment in top % is a desirable feature when one is tasked with synthesis of new compounds and only small number of compounds can be selected for synthesis and testing. Finally, in Figure 9 are presented some examples of generated compounds for a protein ADORA2a, generated from PDB structure 2YDO. The compounds were generated as described in this section and similar known binder is displayed next to it. As observed in these examples the proposed generative model is capable of producing compounds that are similar in shape with a few atom and fragment changes or relocations. However, although these differences can be small it often leads to big differences in potency of the compounds, known as activity cliffs. 61,62 This way, in every Example, besides number 4, amino group, crucial hydrogen-bond donor interacting with asparagine of ADORA2a binding pocket is not present or replaced with non-H-bond donor in the generated structure. On the other hand, example 4 shows a generation of a compound that perfectly matches fragment of the known binder. We believe that architectural changes such as replacing pooling layers, which discard precise spatial information, with transformations that preserve finer features 63,64 can be beneficial and help produce more precise pharmacophoric shapes and subsequent decoding into structure, while still allowing sufficient diversity. These experiments are left for future work.

Generated Decoys

0.12 0.10

Fraction of Compounds

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 19

0.08 0.06 0.04 0.02 0.00

1

2

3

4

5

6

7

8

Decile

9

10 (Top decile)

Figure 8: Portion of Generated and Decoy compounds in each decile. Values are averaged over 31 targets that passed the selection criterias.

3.5

Further evaluations

LiGANN method was evaluated in Biogen Inc. drug discovery pipeline to see if it would be able to discover novel small molecules that are similar to identified IRAK-4 kinase binders. For 5 publicly available crystal structures (5UIU, 4YO6, 5KG7, 2NRU and 2NRY), 50 thousand compounds were generated per crystal; 5000 shapes and 10 decodings per shape. Equal amount of compounds were also sampled from ZINC. The two libraries underwent 2-step filtering process and similarity screening proposed by Biogen. First step was filtering by physicochemical properties to account for drug-likeness and yielded 80k and 77k compounds for generated and decoys, respectively. Second step was ROCS-based 65 scoring against 5 known PDB ligands with subsequent ranking using sum reciprocal rank derived from per-ligand combo score ranks. Top 5k compounds were selected in each group and screened for similarity with proprietary active compounds. Screening results resemble results from QSAR evaluation and demonstrate enrichment similar to the one in the top decile. LiGANN-generated set contained 4 and 6 compounds respectively with Tanimoto similarity > 0.5 to real actives available in Biogen database with IC50 < 1µM and IC50 < 10µM , , while random decoys set contained 2 and 3. This provides a very similar level of enrichment that what we showed in the QSAR tests.

ACS Paragon Plus Environment

12

Page 13 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

Table 2: Enrichment in upper quantiles according to QSAR models. First two rows show portion of generated and decoy compounds in the quantile, averaged over 31 targets, followed by standard standard deviation. Count in quantile is normalized by amount of compounds considered. Row 3 shows p-value of Wilcoxon signed-rank test and row 4 shows the ratio of generated versus decoy compounds in the respective quantile. Quantile Generated Decoys p-value Ratio

0.9

0.95

0.99

0.122 ± 0.036 0.063 ± 0.020 0.013 ± 0.004 0.081 ± 0.037 0.038 ± 0.020 0.007 ± 0.004 1.864e-03 6.319e-02 5.871e-07 1.499 1.648 1.852

Figure 9: Six examples of generated compounds (left in pair) and similar known binders (right in pair) for adenosine receptor A2.

ACS Paragon Plus Environment

13

Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

4

Discussion

Page 14 of 19

Supporting Information Available

In conclusion, we have proposed LiGANN— a novel approach for neural-based mapping of protein structures to ligand representation, that gives distinguishable and decodable shapes. By leveraging multimodal mapping of BicycleGAN, we can generate shapes with higher diversity than previously proposed method LigVoxel. The generated shapes can be decoded into SMILES strings that can be used as input for large virtual screening campaigns. By combining the GAN model with shape captaining we are able to have direct protein-to-ligand mapping pipeline done by neural networks. We have shown that the method is cable of generating plausible shape and decode them with an enrichment of 1.8 in the top 1% based on the QSAR models and similar performance in the Biogen test. This shows that there is a reliable signal but it is not very high, carefully selecting compounds from the generated set is a probably a valid strategy to identify novel binders. Structure based generative modeling for drug design is just at the beginning and hopefully better models will allow for substantial improvements. In order to enable scientific community a better assess the method, a web-based application has been made available at https: //www.playmolecule.org/LiGANN, as part of www.playmolecule.org/ platform.

The following files are available: • supporting_information.pdf : Details of used methods and results for both docking and QSAR evaluation.

References (1) Anderson, A. C. The process of structurebased drug design. Chem. Biol. 2003, 10, 787–797. (2) Pagadala, N. S.; Syed, K.; Tuszynski, J. Software for molecular docking: a review. Biophys. Rev. 2017, 9, 91–102. (3) Scior, T.; Bender, A.; Tresadern, G.; Medina-Franco, J. L.; MartínezMayorga, K.; Langer, T.; CuanaloContreras, K.; Agrafiotis, D. K. Recognizing pitfalls in virtual screening: A critical review. J. Chem. Inf. Model. 2012, 52, 867–881. (4) Bohacek, R. S.; McMartin, C.; Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 1996, 16, 3–50.

Acknowledgement The authors thank Acellera for funding. G.D.F. acknowledges support from MINECO (Unidad de Excelencia María de Maeztu MDM-2014-0370 and BIO2017-82628-P) and FEDER. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 675451 (CompBioMed project). Special thanks goes to Gerard Martínez-Rosell and Alberto Cuzzolin for making the method available as part of the playmolecule.org website.

(5) Hoffmann, T.; Gastreich, M. The next level in chemical space navigation: going far beyond enumerable compound libraries. Drug Discov. Today 2019, (6) Todorov, N.; Alberts, I.; Dean, P. De Novo Design; 2007; pp 283–305. (7) Gawehn, E.; Hiss, J. A.; Schneider, G. Deep Learning in Drug Discovery. Mol. Inform. 2016, 35, 3–14. (8) Chen, H.; Engkvist, O.; Wang, Y.; Olivecrona, M.; Blaschke, T. The rise of deep learning in drug discovery. Drug Discov. Today 2018, 23, 1241–1250.

ACS Paragon Plus Environment

14

Page 15 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

(9) Segler, M. H.; Preuss, M.; Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 2018, 555, 604–610.

(17) Pu, L.; Govindaraj, R. G.; Lemoine, J. M.; Wu, H.-C.; Brylinski, M. DeepDrug3D: Classification of ligand-binding pockets in proteins with a convolutional neural network. PLoS Comput. Biol. 2019, 15, e1006718.

(10) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A. Automatic Chemical Design Using a DataDriven Continuous Representation of Molecules. ACS Cent. Sci. 2018, 4, 268– 276.

(18) Simonovsky, M.; Meyers, J. DeeplyTough: Learning Structural Comparison of Protein Binding Sites. BioRxiv 2019, https: //doi.org/10.1101/600304. (19) Sliwoski, G.; Kothiwale, S.; Meiler, J.; Lowe, E. W. Computational Methods in Drug Discovery. Pharmacol. Rev. 2014, 66, 334–395.

(11) Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; Pande, V. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 2018, 9, 513–530.

(20) Elton, D. C.; Boukouvalas, Z.; Fuge, M. D.; Chung, P. W. Deep learning for molecular generation and optimization-a review of the state of the art. arXiv preprint arXiv:1903.04388 2019, https://arxiv.org/abs/1903.04388.

(12) Wallach, I.; Dzamba, M.; Heifets, A. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery. arXiv preprint arXiv:1510.02855 2015, https: //arxiv.org/abs/1510.02855.

(21) Aumentado-Armstrong, T. Latent Molecular Optimization for Targeted Therapeutic Design. arXiv preprint arXiv:1809.02032 2018, https://arxiv.org/abs/1809.02032.

(13) Ragoza, M.; Hochuli, J.; Idrobo, E.; Sunseri, J.; Koes, D. R. Protein-Ligand Scoring with Convolutional Neural Networks. J. Chem. Inf. Model. 2017, 57, 942–957.

(22) Skalic, M.; Varela-Rial, A.; Jiménez, J.; Martínez-Rosell, G.; De Fabritiis, G. LigVoxel: Inpainting binding pockets using 3D-convolutional neural networks. Bioinformatics 2018, 35, 243–250.

(14) Skalic, M.; Martínez-Rosell, G.; Jiménez, J.; De Fabritiis, G. PlayMolecule BindScope: Large scale CNN-based virtual screening on the web. Bioinformatics 2018, 35, 1237–1238.

(23) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. Advances in Neural Information Processing Systems 27 2014, 2672–2680.

(15) Jiménez, J.; Škalič, M.; MartínezRosell, G.; De Fabritiis, G. KDEEP: Protein-Ligand Absolute Binding Affinity Prediction via 3D-Convolutional Neural Networks. J. Chem. Inf. Model. 2018, 58, 287–296.

(24) Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. Proceedings of the 34th International Conference on Machine Learning 2017, 70, 214–223.

(16) Hochuli, J.; Helbling, A.; Skaist, T.; Ragoza, M.; Koes, D. R. Visualizing convolutional neural network protein-ligand scoring. J. Mol. Graphics Modell. 2018, 84, 96–108.

ACS Paragon Plus Environment

15

Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(25) Wei, X.; Liu, Z.; Wang, L.; Gong, B. Improving the Improved Training of Wasserstein GANs. International Conference on Learning Representations 2018, 5767– 5777.

Page 16 of 19

and decoys for better benchmarking. J. Med. Chem. 2012, 55, 6582–6594. (34) Koes, D. R.; Baumgartner, M. P.; Camacho, C. J. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J. Chem. Inf. Model. 2013, 53, 1893–1904.

(26) Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. 2018.

(35) Gasteiger, J.; Marsili, M. Iterative partial equalization of orbital electronegativity—a rapid access to atomic charges. Tetrahedron 1980, 36, 3219–3228.

(27) Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral Normalization for Generative Adversarial Networks. International Conference on Learning Representations 2018,

(36) Morris, G.; Huey, R. AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. J. Comput. Chem. 2009, 30, 2785–2791.

(28) Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-Attention Generative Adversarial Networks. arXiv preprint arXiv:1805.08318 2018, https://arxiv.org/abs/1805.08318.

(37) Ronneberger, O.; Fischer, P.; Brox, T. UNet: Convolutional Networks for Biomedical Image Segmentation. arXiv preprint arXiv:1505.04597 2015, https://arxiv. org/abs/1505.04597.

(29) Oord, A. v. d.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759 2016, https: //arxiv.org/abs/1601.06759.

(38) Ulyanov, D.; Vedaldi, A.; Lempitsky, V. S. Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv preprint arXiv:1607.08022 2016, https: //arxiv.org/abs/1607.08022.

(30) Isola, P.; Zhu, J. Y.; Zhou, T.; Efros, A. A. Image-to-image translation with conditional adversarial networks. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2017, 5967–5976.

(39) Skalic, M.; Jiménez Luna, J.; Sabbadin, D.; De Fabritiis, G. Shape-Based Generative Modeling for de-novo Drug Design. J. Chem. Inf. Model. 2019, 59, 1205– 1214.

(31) Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A. A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. Computer Vision (ICCV), 2017 IEEE International Conference on Computer Vision 2017, 2223–2232.

(40) Sterling, T.; Irwin, J. J. ZINC 15 - Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324–2337. (41) Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780.

(32) Zhu, J.-Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A. A.; Wang, O.; Shechtman, E. Toward Multimodal Image-toImage Translation. Advances in Neural Information Processing Systems 30 2017, 465–476.

(42) Mao, X.; Li, Q.; Xie, H.; Lau, R. Y. K.; Wang, Z. Multi-class Generative Adversarial Networks with the L2 Loss Function. arXiv preprint arXiv:1611.04076 2016, abs/1611.04076, https://arxiv. org/abs/1611.04076.

(33) Mysinger, M. M.; Carchia, M.; Irwin, J. J.; Shoichet, B. K. Directory of useful decoys, enhanced (DUD-E): Better ligands

ACS Paragon Plus Environment

16

Page 17 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

(43) Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations 2014, 1–13.

(52) Smusz, S.; Kurczab, R.; Bojarski, A. J. The influence of the inactives subset generation on the performance of machine learning methods. J. Cheminf. 2013, 5, 17.

(44) Spahn, V.; Stein, C. Targeting delta opioid receptors for pain treatment: drugs in phase I and II clinical development. Expert Opin. Investig. Drugs 2017, 26, 155–160.

(53) Cherkasov, A.; Muratov, E. N.; Fourches, D.; Varnek, A.; Baskin, I. I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y. C.; Todeschini, R.; Consonni, V.; Kuz’min, V. E.; Cramer, R.; Benigni, R.; Yang, C.; Rathman, J.; Terfloth, L.; Gasteiger, J.; Richard, A.; Tropsha, A. QSAR modeling: where have you been? Where are you going to? J. Med. Chem. 2014, 57, 4977–5010.

(45) Rundle, S.; Bradbury, A.; Drew, Y.; Curtin, N. J. Targeting the ATR-CHK1 Axis in Cancer Therapy. Cancers 2017, 9. (46) Lawhorn, B. G.; Philp, J.; Graves, A. P.; Shewchuk, L.; Holt, D. A.; Gatto Jr, G. J.; Kallander, L. S. GSK114: A selective inhibitor for elucidating the biological role of TNNI3K. Bioorg. Med. Chem. Lett. 2016, 26, 3355–3358.

(54) Mitchell, J. B. Machine learning methods in chemoinformatics. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2014, 4, 468– 481.

(47) Wildman, S. A.; Crippen, G. M. Prediction of physicochemical parameters by atomic contributions. J. Chem. Inf. Model. 1999, 39, 868–873.

(55) Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. In LightGBM: A Highly Efficient Gradient Boosting Decision Tree; Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc., 2017; pp 3146–3154.

(48) Ertl, P.; Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminformatics 2009, 1 .

(56) Gaulton, A.; Hersey, A.; Karlsson, A.; Mendez, D.; Cibrián-Uhalte, E.; Atkinson, F.; Papadatos, G.; Smit, I.; Overington, J. P.; Chambers, J.; Bellis, L. J.; Davies, M.; Nowotka, M.; Dedman, N.; Mutowo, P.; Leach, A. R.; Bento, A. P.; Magariños, M. P. The ChEMBL database in 2017. Nucleic Acids Res. 2016, 45, D945–D954.

(49) Ertl, P.; Roggo, S.; Schuffenhauer, A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 2008, 48, 68–74. (50) Bickerton, G. R.; Paolini, G. V.; Besnard, J.; Muresan, S.; Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 2012, 4, 90–98.

(57) Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E. The protein data bank. Nucleic Acids Res. 2000, 28, 235–242.

(51) Chen, L.; Cruz, A.; Ramsey, S.; Dickson, C.; Duca, J. S.; Hornak, V.; Koes, D. R.; Kurtzman, T. Hidden Bias in the DUD-E Dataset Leads to Misleading Performance of Deep Learning in Structure-Based Virtual Screening. chemrxiv preprint chemrxiv:7886165 2019,

(58) Bemis, G. W.; Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 1996, 39, 2887– 2893.

ACS Paragon Plus Environment

17

Molecular Pharmaceutics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(59) Tropsha, A. Best practices for QSAR model development, validation, and exploitation. Mol. Inform. 2010, 29, 476– 488. (60) Lall, U.; Sharma, A. A nearest neighbor bootstrap for resampling hydrologic time series. Water Resour. Res. 1996, 32, 679– 693. (61) Stumpfe, D.; Bajorath, J. Exploring activity cliffs in medicinal chemistry: miniperspective. J. Med. Chem. 2012, 55, 2932– 2942. (62) Maggiora, G. M. On outliers and activity cliffs - why QSAR often disappoints. 2006, 46, 1535–1535. (63) Honari, S.; Yosinski, J.; Vincent, P.; Pal, C. Recombinator networks: Learning coarse-to-fine feature aggregation. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016, 5743–5752. (64) Islam, M. A.; Rochan, M.; Naha, S.; Bruce, N. D.; Wang, Y. Gated feedback refinement network for coarse-to-fine dense semantic image labeling. arXiv preprint arXiv:1806.11266 2018, https://arxiv. org/abs/1806.11266. (65) Hawkins, P. C.; Skillman, A. G.; Nicholls, A. Comparison of shapematching and docking as virtual screening tools. Journal of medicinal chemistry 2007, 50, 74–82.

ACS Paragon Plus Environment

18

Page 18 of 19

Page 19 of 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Molecular Pharmaceutics

Graphical TOC Entry Protein-to-ligand Shape captioning

Shape generation

Protein pocket

Ligand shapes

Compounds

ACS Paragon Plus Environment

19