Combinatorial Libraries Generation with Machine-Learning-Based

Feb 3, 2017 - Faculty of Mathematics and Computer Science, Jagiellonian University, .... want to find fingerprints of molecules with a high probabilit...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/jcim

Creating the New from the Old: Combinatorial Libraries Generation with Machine-Learning-Based Compound Structure Optimization Sabina Podlewska,† Wojciech M. Czarnecki,‡,§ Rafał Kafel,† and Andrzej J. Bojarski*,† †

Department of Medicinal Chemistry, Institute of Pharmacology, Polish Academy of Sciences, Smętna 12, 31-343 Kraków, Poland Faculty of Mathematics and Computer Science, Jagiellonian University, 30-348 Kraków, Poland



S Supporting Information *

ABSTRACT: The growing computational abilities of various tools that are applied in the broadly understood field of computer-aided drug design have led to the extreme popularity of virtual screening in the search for new biologically active compounds. Most often, the source of such molecules consists of commercially available compound databases, but they can also be searched for within the libraries of structures generated in silico from existing ligands. Various computational combinatorial approaches are based solely on the chemical structure of compounds, using different types of substitutions for new molecules formation. In this study, the starting point for combinatorial library generation was the fingerprint referring to the optimal substructural composition in terms of the activity toward a considered target, which was obtained using a machine learning-based optimization procedure. The systematic enumeration of all possible connections between preferred substructures resulted in the formation of target-focused libraries of new potential ligands. The compounds were initially assessed by machine learning methods using a hashed fingerprint to represent molecules; the distribution of their physicochemical properties was also investigated, as well as their synthetic accessibility. The examination of various fingerprints and machine learning algorithms indicated that the Klekota−Roth fingerprint and support vector machine were an optimal combination for such experiments. This study was performed for 8 protein targets, and the obtained compound sets and their characterization are publically available at http://skandal.if-pan. krakow.pl/comb_lib/.



INTRODUCTION

involves the combination of fragments derived from input molecules, as it is done, for example, in the BREED algorithm.4 A separate path of combinatorial approaches involves the application of reaction-based schemes to the enumeration of new molecules by connecting building blocks using possible chemical transformations.5−8 A number of approaches are also used to generate targetfocused libraries, which are based on information regarding the target protein, setting structural or spatial constraints on the basis of docking results, by imposing such requirements through the use of interaction fingerprints,9−11 or by incorporating detailed information about the structures of known ligands in the design of new connections.12 The sets of compounds generated using combinatorial approaches can be huge and are, therefore, usually evaluated using virtual screening procedures. Among the broad range of computational methodologies used for the search of new active compounds, pharmacophore modeling, 13,14 similarity searches,15−17 and docking18−22 are the most popular. The increasing amount of data used in virtual screening experiments

Drug discovery campaigns are complex, difficult, and expensive processes that last approximately 15 years and consume more than $1 billion.1 Finding a new small molecule with desirable pharmacological properties that will be both active and safe for use is associated with a host of difficulties. To facilitate, accelerate, and reduce the costs of this process, various computational approaches are used, among which virtual screening plays a prominent role in terms of popularity. The methods included in these types of methodologies are traditionally classified into ligand-, structure- and fragment-based paths, and new potentially active compounds are searched in commercially available libraries of structures or within the set of virtual molecules generated in a computational manner. The most popular ways of forming new molecules in silico are bioisosteric replacement, scaffold hopping and hybridization of ligands (Figure 1). The bioisosteric concept is based on the replacement of some chemical moieties with groups that are supposed to trigger a biological effect similar to that of the input compound.2 Scaffold hopping involves the replacement of the core structure of the molecule with other cores, leaving substituents unchanged,3 whereas the ligands hybridization © 2017 American Chemical Society

Received: July 24, 2016 Published: February 3, 2017 133

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling

(ML) methods.23−26 Their speed, effectiveness, and ability to handle high-dimensional data have led to the broad application of ML methods for both the prediction of biological activities27−33 and the molecular properties of compounds.34−40 The crucial step in all ML-based experiments is to provide the appropriate training data, in terms of amount, quality, and proper representation. Because of low computational expenses, a high speed of generation and the easy ability to perform comparisons between particular examples, the most commonly used method of compounds’ depiction for ML purposes is fingerprinting.41−44 This method adopts the form of a bit-string that falls into two major classes according to the algorithm underlying its calculation: hashed fingerprint and key-based fingerprint.15 The former group of representation provides information about compound structures by generating paths from a molecular graph and applying the hashing function to code the data. Because of the hashing procedure, the individual bit positions reflecting specific features are difficult to decode and it is problematic to provide any structural or chemical interpretation for them. In contrast, in key-based fingerprints, each position codes the presence or absence of a particular feature in the molecule.45 The most frequent keys are substructural ones, and the resulting fingerprints are typically assigned to the substructural group of representations. In this study, we combined the advantages of ML and the generation of virtual compounds concepts by constructing a tool for combinatorial libraries enumeration with optimization of the compound structures using ML methods. The resulting sets of compounds were target-focused, and their size and composition

Figure 1. Examples of approaches to virtual combinatorial libraries generation: (a) bioisosteric replacement, (b) scaffold hopping, and (c) hybridization of known ligands.

and the limited computational resources have triggered the exploration and development of new in silico tools. The approaches that have gained extreme popularity in the field of cheminformatics and medicinal chemistry are machine learning

Figure 2. Comparison of ligand-, fragment-, and structure-based approaches to virtual screening in reference to the developed methodology of new compounds’ generation. 134

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling varied depending on the compounds representation used for the ML-based optimization. The developed approach forms a bridge between ligand-25,46−49 and fragment-based50−57 approaches to computer-aided drug design (Figure 2). In contrast to the previously used combinatorial approaches, our methodology works on different level, without the explicit use of the chemical structure. It uses the bit-string substructural representation of molecules and generates new molecules from optimal fingerprints provided by the modifications of ML algorithms (reformulation of the optimization problem was required). The resulting combinatorial libraries were extended because they encompass all possible connections between the moieties indicated by ML as important for a particular activity profile. The obtained compounds were initially evaluated using hashed fingerprints and with the use of fingerprint applied for ML-based optimization, indicating the compounds’ high potential for biological activity. Moreover, new connections between particular substructures (parts of previously known active compounds) enabled the discovery of structurally new potential ligands of the considered receptors. The analysis of the distribution of physicochemical properties of the obtained molecules enables the fast selection of drug-like molecules in the generated libraries; examination of their presence in the ZINC database58 revealed an approximately 3−4% occurrence rate, indicating a possibility to perform large-scale screenings beyond the chemical space of commercially available compounds. The newly formed molecules also successfully passed the PAINS filter59 and were evaluated as being mostly of moderate difficulty for synthesis.



RESULTS AND DISCUSSION Mathematical Perspective. We will start with a general overview of our goals and the language that will be employed. We want to find fingerprints of molecules with a high probability of being active according to some trained model (classifier). The basic notation will utilize x ∈ [0,1]d to denote a particular fingerprint, which is a d-dimensional vector of positive values. For simplicity, we will assume that these values are from a [0,1] interval, as even for fingerprint of “count” type, rescaling can be performed to fit in the hypercube. Because of obvious limitations, we are unable to visualize d-dimensional spaces and thus simple 2D plots will be used as examples, with particular points placed in the entire R2. It should be considered as some type of lowdimensional projection of our fingerprints, such as the one obtained by principal component analysis60 or the tSNE method.61 Figure 3a and b shows a simple example of a set of active compounds represented by some predefined fingerprint. Using this type of visualization, one can think about current similarity-based methods, which basically rely on searches for compounds that are similar to previously known active molecules as a result of building a local density estimation (like the simple kernel density estimation shown in Figure 3a and b. For example, any simple substitution-based method efficiently does exactly this: as some of the active compound (single point in our thought experiment) is taken, its part is removed (thus, some small number of bits of its fingerprints is set to zero) and different moieties are attached (and thus some bits of its fingerprints are set to 1). Consequently, the distance (in terms of, for example, the Euclidean or Tanimoto distance) of the generated compound fingerprint to the original ones is very small. As a result, true novel samples are not generated, but rather a local search within the space of the fingerprints is performed.

Figure 3. Visualizations explaining the mathematical perspective of the developed methodology: (a) active compounds represented with some fingerprint, (b) the existing methods, (c) active compounds with the corresponding density estimations, (d) inactive compounds with the corresponding density estimations, (e) whole data set, (f) whole data set with joint probability estimation, (g) discriminative support function, and (h) discriminative support function with the application of sample stochastic optimization to maximize its value.

From a mathematical perspective, a system that employs the above reasoning is often called a generative model.62 In other words, we look for a probabilistic model P(x|y = act)

where y is the compound label (activity). We have access to the training set of form {(xi,yi)}Ni=1, where yi ∈ {act,inact} that can be used to construct such a distribution, which usually leads to some optimization problem that is focused on fitting a set of parameters θ of a particular model. One of the most popular ̈ Bayes,63 which is often used approaches from this family is Naive in cheminformatics applications and simply fits a particular density function independently to each feature (fingerprint dimension), such as univariate Gaussians.64 Figure 3 c and d shows an example of such a process resulting in a Gaussian model of P(x|y), which can be used to sample new compound ̈ Bayes is not the only solution; there are fingerprints. Naive more complex models that can be used, such as Gaussian Mixture Models,65 Restricted Boltzmann Machines,66 and many more. 135

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling Such an approach has one major flaw. Because of assumptions concerning the optimization procedure, it partially ignores the presence of negative samples (inactive compounds), which can (and often do) occupy the same part of the fingerprint space as the active samples. Consequently, we need to search for both densities and samples from active compounds subspace, thus ensuring that the reconstructed compound will be the active one (and not simply a popular compound). Consequently, we are more interested in obtaining a fingerprint x, such that P(y = act|x) is as high as possible. Figure 3e and f shows a representation of the considered setting. We want to take into account both active and inactive compounds and develop a joint probability distribution P(x,y) that can be used to identify fingerprints that maximize the probability of being active. There are, unfortunately, challenges to using such an approach. Identification of fingerprints corresponding to compounds with the highest probability of being active is extremely difficult from a mathematical perspective, and consequently, strong assumptions/simplifications must be introduced to make the problem tractable. Naiv̈ e Bayes introduces the assumption of feature independence, which is almost never valid. Gaussian Mixture Models assume that each class lies in a small combination of multivariate Gaussians, which can be efficiently approximated based on the given data. Unfortunately, in practice, fingerprints are extremely sparse, high-dimensional data with a rather limited set of examples and it is an established phenomenon in mathematics that learning a joint probability distribution requires large amounts of data. However, ML methods are quite successful in virtual screening and for many other branches of data analysis. In many cases, the results are obtained via a specific simplification of the above problem: through the use of discriminative models62 instead of generative, probabilistic ones. In contrast to the previously described approach, discriminative models do not attempt to model the distribution of data. Instead, they focus solely on building a support function a, which is roughly connected to the probability of correct classification. The only purpose of the discriminative model is to assign a valid label while ignoring the exact probability estimate. It is usually achieved by learning, in addition to a, some threshold T, and assigning the labels according to

extent) simulate this process through the use of stochastic optimization procedures applied to the problem of maximization of the modeled support (activity concept). Figure 3g and h shows a discriminatively trained model (SVM with a radial basis function kernel) and a few trajectories taken by stochastic optimization method runs used to maximize the support. It is evident that optimization procedures culminate in local optima, and thus efficiently generate multiple samples (fingerprints) from our activity model, which has a high probability of representing active compounds (according to our trained model). One may ask how this model differs from simply training a discriminative model followed by its application to every compound in some database represented as a given fingerprint and taking the ones that are considered active, as is usually done in ML-based virtual screening. There are three crucial differences: • The proposed procedure can generate new fingerprints/ compounds, that are not listed in any existing database because the methodology is not constrained to any finite set of compounds. • As opposed to classification, we will not end up with compounds that would be classified as active but rather compounds that are most active in their local f ingerprint neighborhood; • Quite surprisingly, this procedure is actually much faster than classifying whole databases (such as ZINC),58 as we will demonstrate later in this section by showing that the optimization procedure can be performed using basic tools, and for obvious reasons, its complexity derives only from the complexity of the modeled support function and not from the size of the chemical compounds databases (which grow extremely fast). There are, however, two important drawbacks to using such a surrogate for the probabilistic model. First, one can end up with a sample (fingerprint) that is very similar to the one that already exist in the training set (or is even exactly the same), thus limiting the novelty of the generated compounds. Second, the opposite effect can occur: we can end up far from the generic chemical compound manifold, thus generating a point that has no reasonable chemical interpretation. While there are many possible approaches to solving these two issues, we will focus on the simple ones. To force diversity from the training set, we will add an optimization constraint that forbids selection of the training set. Furthermore, as our model is selected by internal validation to be able to generalize to the test set, the overfitting nature of the support function should not be exhibited. To ensure that we generate samples that are away from the compounds manifold, we will start the optimization procedures close enough to this manifold, thus reducing the probability of divergence. Given this introduction, let us provide a more strict example of the above problem. We will show how it can be approached using a linear model (with an example of logistic regression) and a typical kernel model (using SVM as an example). The proposed scheme is more general, and in practice, it can be also applied to other models, such as Neural Networks (both shallow and deep)67 and Random Forests,68 among others. However, for the sake of simplicity, we will focus on the two previously mentioned examples as proofs of concept. Optimization Setting. Let us assume that we are given a classifier, parametrized with parameters θ and trained on {(xi,yi)}Ni=1 ⊂ Rd × {−1,+1}, such that it assigns a numeric

cl(x) = act ↔ a(x) ≥ T

One such example might be a logistic regression, which models the support function as a sigmoid of the linear projection lr(x) = act ↔

1 1 + exp(⟨θ1: , x⟩ + θ0)

≥ 0.5 ↔ − exp(⟨θ1: , x⟩ + θ0) ≥ −1

Similarly, both linear and kernelized support vector machine (SVM) assigns a particular measure of support to each sample, which is later compared with the trained bias value to assess the final label. The natural advantages of such approaches are as follows: • Tractability−distriminative models are much easier to train; in practice, they often have a single, global optimum. • Simplicity−support function models are often a linear combination of very simple objects. Unfortunately, these approaches no longer provide us with the actual probability distribution; thus, we cannot sample from their underlying concept of activity. Fortunately, we can (to some 136

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling

Table 1. Summary of Popular Kernel Functions with Their Gradients and Information Concerning Whether the Basic Fingerprint Optimization Problem Has Multiple Solutions ∇K (xi , x)

K(xi,x)

kernel

x

linear polynomial sigmoid RBF

⟨xi,x⟩ (a⟨xi,x⟩ + b)p tanh(a⟨xi,x⟩ + b) 2 e−γ∥xi−x∥

Tanimoto

⟨xi , x⟩

x + K (xi , x)(2xi − x)

|| xi ||2 + || x ||2 − ⟨xi , x⟩

|| xi ||2 + || x ||2 − ⟨xi , x⟩

xi ap(a⟨xi,x⟩ + b)p−1xi a(1 − K2(xi,x))xi 2γK(xi,x)(xi − x)

value a: Rd →R, which indicates its concept of compound activity. For example, for the logistic regression it will be alr (x|θ ) =

no yes, if p > 2 yes yes yes

ingly, one can still obtain a sample from such a model using stochastic optimization with a fixed number of steps, such as simulated annealing69 or Metropolis−Hasting sampling.70 Let us now focus on a more interesting case, in which we use a kernel model, such as kernelized SVM, which employs the following model of activity

1 ∝ ⟨θ1: , x⟩ 1 + exp(⟨θ1: , x⟩ + θ0)

We pose the following optimization problem, which we refer to as the f ingerprint optimization problem

N

asvm(x|θ ) =

maximize a(x|θ ) x

subject to gj(x),

multiple solutions

∑ yi θκi (xi , x) i=1

j = 1, ..., M

Although the SVM training procedure is convex and thus has a unique solution, the reversed optimization posed herein does not have this property. In fact, for nearly any choice of κ, the basic f ingerprint optimization problem has multiple local maxima. In contrast to the typical ML setting, this is a favorable property of the developed method, because we want to sample multiple, diverse fingerprints; thus, the presence of local solutions is desirable. To obtain them, one can simply use any first-order optimization technique, which is formalized in the following theorem. Theorem 2. For any kernel-based machine learning binary classifier minimizing the regularized empirical loss function on {(xi,yi)}Ni=1, locally optimal fingerprints can be found using a gradient of the form

where gj are constraints used to address previously outlined problems of overfitting to the training set and underfitting to the compounds manifold. In particular, we define a basic f ingerprint optimization problem by putting gj(x) = [x = xj] for j = 1, ..., N and gN+k(x) = [0 ≤ xk ≤ maxk] for k = 1, .., d, where xk denotes the kth dimension of the fingerprint and maxk is the maximum value of the kth dimension in the fingerprint, which makes sense from a chemical perspective. Thus, the space of the considered solution is constrained to a space of reasonable fingerprints minus the training set. It is easy to show, that for logistic regression (or any other linear model) solution of the above optimization problem is unique, and easy to find with closed-form equation. Theorem 1. For any linear model using activity model

N

∑ βi ∇x κ(xi , x) i=1

a(x , θ ) = ⟨x , θ ⟩

for some βi ∈ , where κ is a differentiable kernel, such that κ(xi,x) is bounded for any x satisfying constraints gj(x). Proof. According to the representer theorem, the solution of the minimization of every regularized empirical loss function can be represented as a functional

with constraints of form gj(x) = [aj ≤ xj ≤ bj] for some constraints aj, bj. A global optimum of the basic fingerprint optimization problem is obtained for x = [l(1)...l(d)], where ⎧ if θi ≥ 0 ⎪ bj , l(j) = ⎨ ⎪ ⎩ aj , otherwise

N

a(x ) =

if only x ≠ xi for each i = 1, ..., N. Proof. Let us assume that there is some x′ ≠ x, within constraints gj(x′), such that a(x′|θ) ≠ a(x|θ) and x′ is the global optimum. There must exist i such that x′i ≠ xi and θi ≠ 0 (otherwise a(x′|θ) = a(x|θ)). There are two possibilities: 1. θi > 0, then by construction of x, x′i < xi (as xi = maxi), consequently, we can construct x″ = [x′1...bi...x′d] ≠ x′ and a(x″|θ) > a(x′|θ); contradiction. 2. θi < 0, then by construction of x, x′i > xi (as xi = ai), consequently, we can construct x″ = [x′1...ai...x′d] ≠ x′ and a(x″|θ) > a(x′|θ); contradiction. This simple observation shows that for linear activity concepts, the posed approach tends toward the saturation of all features in the compound, which model associates with activity. Interest-

∑ βκi (xi , x) i=1

for some βi ∈ . Such a function is bounded by ΣNi=1|βi| supx:gj(x)κ(xi,x), and the supremum exists and is finite. Thus, the use of first-order optimization on a convex set of Rd, with N

∇x a(x) =

∑ βi ∇x κ(xi , x) i=1

converges to the local optima. Consequently, because SVM is a regularized empirical lossbased model, the only remaining element to apply our reasoning is the derivation of βi equations and ∇xκ for popular kernel functions. It is easy to verify that βi = yiθi as 137

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling N

implementation was used),92,42 and two OpenBabel fingerprints encoding SMARTS Patterns for Functional Group Classification by Christian Laggner93 with the number of keys presented in Table 3 (some individual keys that produced errors at the level of

N

∇x asvm(x|θ ) = ∇x ∑ yi θκ i (xi , x) = i=1

∑ yi θi∇x κ(xi , x) i=1

Let us now derive a gradient for one of the most popular SVM kernels: the radial basis function (RBF, also known as Gaussian kernel): 2

Table 3. Number of Keys Provided by a Particular Fingerprint

2

∇x κ rbf (xi , x|γ ) = ∇x e−γ || xi − x || = e−γ || xi − x ||

·∇x ( −γ || xi − x ||2 ) = −γκ rbf (xi , x|γ )(2(xi − x)) = 2γκ rbf (xi , x|γ )(xi − x) ∝ κ rbf (xi , x|γ )(xi − x)

fingerprint name

number of keys considered

MACCSFP KlekFP OpenBabel1 (FP3) OpenBabel2 (FP4)

162 4860 55 305

and the whole gradient is proportional to fingerprint generation were not considered; the full list of keys taken into account for each fingerprint is provided in the Supporting Information). The fingerprints were calculated by verifying the occurrence of a given substructure in the molecule. It was a frequency-based examination, and the formed fingerprints were of the “count” type (the number of particular occurrences was coded). For each active compound, the molecular fragments corresponding to particular substructural key were extracted. Parts of molecules that were not covered by any of the keys from a given fingerprint formed the library of linkers (Figure 4). Therefore, a linker originating from a considered molecule is constituted by the set of atoms that were not “hit” by particular fingerprint. To examine the extent to which the keys from particular fingerprints were covering the input structures, the number of linkers generated for particular targets/fingerprint were gathered and are presented in Table 4. Analogously, the substructural diversity of the compounds was analyzed by comparing the number of fragments (substructures)−hits by the particular keys of the fingerprint (Table 5). The obtained substructures were then “cleaned” based on their conclusion in other fragments (to remove duplicates). Thus, if a given set of atoms was fully included in another moiety, only the largest substructure was taken into further consideration. An analysis of the number of linkers shows that MACCSFP had the highest coverage rate of various chemical compounds because all atoms of the fingerprinted molecules were coded by the MACCS keys. On the other hand, despite the highest number of keys, for quite significant number of compounds, for KlekFP there were some atoms found that did not belong to any of the substructures described by the keys from the fingerprint definition, which is expressed by the relatively large number of linkers. The highest rate of noncovered molecular fragments was clearly observed for the OpenBabel1 fingerprint, which exhibited statistically more than one linker per structure. However, the OpenBabel1 fingerprint displayed the lowest number of overlapping atoms in the substructures hit by its keys, as expressed by the percentage of fragments that remained after cleaning that exceeded 85% in the majority of cases; for some targets (such as serotonin receptors), it was even above 90%. However, this relatively high fraction of nonshared (in terms of atoms) fragments observed for the OpenBabel1 fingerprint could also have resulted from the relatively low coverage rate provided by this representation, because a large number of atoms were not described by any of the OpenBabel1 keys. In contrast, for MACCSFP and KlekFP, the fraction of fragments that were not rejected in the cleaning procedure fell within the range of 0.1−0.2. Despite significant differences in the percentages of compounds that remained after cleaning for two OpenBabel fingerprints, the initial number of substructures

N

∇x asvm(x|θ , κ rbf , γ ) ∝

∑ yi θκi rbf (xi , x|γ )(xi − x) i=1

Consequently, the process of optimizing this activity concept can be seen as an iterative search for the optimal fingerprint, in which each active compound in the training set “attracts” a current solution to itself with a strength proportional to the kernelinduced similarity between these two objects multiplied by the SVM-assigned weight θi. Symmetrically, each inactive compound “pushes away” our solution with analogous strength. Table 1 summarizes the gradients for other popular kernel functions, together with information concerning whether it leads to the existence of multiple local optima. As previously mentioned, the proposed method can be applied not only to describe linear and kernel models, but also to any type of classifier that provides access to the internal model of activity. Similarly, one could derive equations for derivatives for the neural network models or weighted kNN.71 However, it may be a bit more complex for less mathematical models, such as Random Forests, which do not directly model the activity function but rather use rule-based splitting of the input space. Applications to this family of models will be addressed in future research. Fingerprints Generation. The following targets were used as case studies: serotonin receptors 5-HT2A,72−74 5-HT2C,75−77 5-HT6,78−80 5-HT7,80−82 beta2 adrenergic receptor (beta2AR),83,84 cathepsin B (catB),85,86 dopamine receptor D2,87 and tyrosine-protein C-terminal SRC kinase.88,89 For each of the above-mentioned proteins, the sets of active (with the Ki below 100 nM or the IC50 below 200 nM) and inactive (with a Ki higher than 1000 nM or IC50 higher than 2000 nM) compounds were fetched from the ChEMBL database.90 The number of compounds belonging to a particular activity class is shown in Table 2. All compounds were represented using the available sets of substructural features: MACCS fingerprint (MACCSFP),91 Klekota−Roth fingerprint (KlekFP; the PaDEL-Descriptor Table 2. Number of Compounds from Particular Activity Classes target

number of active compounds

number of inactive compounds

5-HT2A 5-HT2C 5-HT6 5-HT7 beta2AR catB D2 SRC

1835 1210 1490 704 273 245 3342 942

851 926 341 339 347 943 2873 2173 138

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling

Figure 4. Generation of libraries of substructures and linkers based on the hits of the fingerprint keys.

resulting from their calculation was similar. The highest number of fragments enumerated based on the fingerprint was obtained for MACCSFP (∼3 times higher than the OpenBabel fingerprints), whereas for KlekFP, the original number of fragments was approximately two times greater than for OpenBabel representations in most cases. At the end of this form of comparisons, the average number of fragments generated per ligand (the number of fragments after cleaning was considered) was calculated (Figure 5). The general dependencies of the number of fragments generated per input structure are the same for all examined targets, and the highest number was obtained for the OpenBabel1 fingerprint (on average, it was approximately 40 fragments/structure), which is approximately two times higher than the number of fragments

Table 4. Number of Linkers Returned for Each of the Fingerprints Used for the Considered Targets number of linkers target/fingerprint

MACCSFP

KlekFP

OpenBabel1

OpenBabel2

5-HT2A 5-HT2C 5-HT6 5-HT7 beta2AR catB D2 SRC

0 0 0 0 0 0 0 0

345 173 285 30 7 19 405 297

3296 2290 2211 1235 443 741 5975 1725

83 36 21 25 4 5 242 15

Table 5. Number of Fragments Obtained for a Given Target and Fingerprint: Raw and after the Cleaning Procedure MACCSFP

KlekFP

number of fragments

OpenBabel1

number of fragments

target

raw

after cleaning

percentage of fragments left after cleaning (%)

5-HT2A 5-HT2C 5-HT6 5-HT7 beta2AR catB D2 SRC

247 424 152 150 203 116 100 500 40 396 34 814 499 782 150 979

32 063 21 521 24 287 10 518 6 652 6 111 59 331 22 129

12.96 14.14 11.96 10.47 16.47 17.55 11.87 14.66

raw 136 507 84 396 90 280 59 037 24 550 36 912 300 967 64 112

OpenBabel2

number of fragments

after cleaning

percentage of fragments left after cleaning (%)

raw

20 753 13 129 14 159 7 571 3 517 4 435 43 016 11 534

15.20 15.56 15.68 12.82 14.33 12.02 14.29 17.99

77 343 48 237 62 193 29 499 13 985 10 922 147 328 50 514

139

number of fragments

after cleaning

percentage of fragments left after cleaning (%)

raw

after cleaning

percentage of fragments left after cleaning (%)

69 902 43 450 57 437 27 099 11 133 9 504 127 891 44 016

90.38 90.08 92.35 91.86 79.61 87.02 86.81 87.14

77 763 47 257 61 236 29 700 12 294 11 773 150 328 50 649

28 524 17 995 22 891 10 948 4 612 3 961 52 123 17 770

36.68 38.08 37.38 36.86 37.51 33.64 34.67 35.08

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling

Table 6. Minimum Standard Deviation Values of the Euclidean Distance between Optimal Fingerprints Obtained after the ML-Based Optimization Procedure

Figure 5. Analysis of the number of fragments generated from optimal fingerprint per one input structure for particular representations and targets (the results after the application of cleaning procedure are considered).

TP TP + FN

precision = F1 = 2·

TP TP + FP

precision· recall precision + recall

(1) (2)

(3)

and which, after the respective transformations, gives F1 =

2·TP 2·TP + FP + FN

MACCSFP

KlekFP

OpenBabel1

OpenBabel2

5-HT2A 5-HT2C 5-HT6 5-HT7 beta2AR catB D2 SRC

0.752 0.680 0.622 0.677 0.660 0.686 0.714 0.818

0.074 0.090 0.088 0.090 0.091 0.081 0.077 0.082

0.735 0.826 0.817 0.774 0.824 0.842 0.884 0.810

0.396 0.371 0.394 0.374 0.368 0.397 0.382 0.401

regression were extremely sparse, only the SVM results were further considered. As it was already mentioned, the presented analysis clearly indicated that the most consistent results were obtained for KlekFP; therefore, for subsequent steps, only this fingerprint was taken into account. Although MACCSFP provided the highest coverage rate of the compounds (almost no atoms were left for linker formation), the sets of substructures that were indicated as optimal for compound activity were rather different. In general, the most diverse outcome was obtained for the OpenBabel1 fingerprint, and taking into account also the high number of linkers resulting from the generation of this representation, it can be considered as the least useful for this type of optimization experiments. The ML-based methodology of obtaining the set of substructures considered as optimal for compounds biological activity was compared with the approach of privileged fragments.99 It was performed to assess whether the simple scanning of the frequency of occurrences of particular bits would not be sufficient for obtaining the same set of fragments for the formation of new molecules (Figure 6, Supporting Information). The cutoff for the fraction of compounds in which a particular substructure occurred (in the privileged fragment approach) was established in a way that the number of bits was comparable to the nonzero positions indicated via SVM-based experiments. The performed comparison clearly indicated that the fragments selected in the privileged fragments theory are significantly different from those obtained via ML-based optimization studies. Additional analyses of the positions that were pointed out to be desired for the biological activity also revealed that the sets of nonzero keys were not completely common and that the relatively large fraction of substructures were unique for particular targets. This information could be helpful, for example, for selectivity studies. Relatively high differences between machine learning-based optimization of the compounds composition and the privileged fragments-based approach lie in differences in the manner in which the results are obtained. During the search for privileged fragments, the set of active ligands is scanned in terms of the frequency of the occurrence of particular substructures, whereas for ML methods, it is not only important how frequent a given substructure occurs in the set of known ligands, but also what is its discriminative power in terms of the differentiation between actives and inactives. That is why, in ML-based models, fragments that occur frequently both in actives and inactives are generally not selected as optimal for new compounds formation. Combinatorial Libraries Generation, Characterization, and Evaluation. Because the numbers of nonzero positions in the optimal fingerprints were relatively high, the subgroups of

generated for the remaining fingerprints, followed by MACCSFP (20) and KlekFP (below 15). ML Experiments. For each protein/fingerprint set, the ML models were developed in 5-fold cross-validation mode (the fingerprints were normalized before the training procedure). The ML algorithms used were SVM94 with the radial basis function kernel and logistic regression with stochastic gradient descent (SGD) training95 as representative examples of linear models (grid search hyperparameters optimization was performed). The predictive power of the constructed ML models was evaluated based on the F1-score, which is the weighted average of precision and recall recall =

target/fingerprint

(4)

The parameters that provide the highest F1-score values were used to generate the proper optimization models using all the data available for each target (the list of optimal parameters selected in each case is provided in the Supporting Information). For each target/fingerprint pair, the developed ML-based optimization procedure was applied with constraints for the maximum number of substructures of a particular type equal to 5. Because of the randomness factor involved in the optimization process, for each case, 10 fingerprints were generated. The similarity between them was examined by calculating the Euclidean distance matrix,96,97 followed by analysis of the standard deviation.98 The string with the lowest standard deviation was taken for combinatorial library generation (Table 6, the strings obtained in the optimization procedure are provided in the Supporting Information). Table 6 also shows the consistency of the obtained results that was the highest for KlekFP (the lowest standard deviation values) and the lowest for MACCSFP and OpenBabel1, indicated by the highest standard deviations. Because the fingerprints obtained by logistic 140

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling

Figure 6. Comparison of machine learning-based selection of the optimal compounds composition with the privileged fragments approach on the example of serotonin receptor ligands; the cutoff for the frequency of occurrences in the privileged fragments-based methodology was adjusted to return the similar number of keys to machine learning-based protocol.

keys with 3−5 representatives were enumerated. The SMILES strings that referred to them were then picked and combined in all possible ways (because of a large number of obtained combinations from each subset of fragments, 50 connections were randomly selected for the formed library; Figure 7 and Table 7). To eliminate very large structures, additional constraint on the maximum number of non-hydrogen atoms was set (structures in which the number of atoms exceeded 35, excluding hydrogen atoms, were rejected). An example of new compounds formation is presented in Figure 8.

The number of compounds generated using this procedure is presented in Table 8, together with the results of the evaluation of their initial activity by hashed fingerprint and KlekFP. On the other hand, Table 9 contains the results of the verification outcome of the presence of the evaluated compounds in the ZINC database and the outcome of their evaluation by the PAINS filter. The activity evaluation was first performed based on the hashed fingerprint (extended fingerprint from the PaDELDescriptor42 package). All active and inactive compounds of a 141

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling

Figure 7. Scheme of the developed protocol for optimal fingerprint-based combinatorial libraries generation.

Table 7. Number of Unique Combinations of Different Substructures Depending on the Set Size target/ substructures set size

number of nonzero keys in optimal fingerprint

3 elements

5-HT2A 5-HT2C 5-HT6 5-HT7 beta2AR catB D2 SRC

27 40 38 40 41 32 29 33

560 2300 2024 1540 969 165 1140 286

4 5 elements elements 1820 12 650 10 626 7315 3876 330 4845 715

4368 53 130 42 504 26 334 11 628 462 15 504 1287

Table 8. Sizes of the Obtained Fragment Libraries and the Results of Their Initial Biological Evaluation

target

number of compounds left after duplicates filtering

number of compounds indicated as active in hashed fingerprint-based evaluation (% of initially generated unique cmds)

number of compounds indicated as active in KlekFP-based evaluation (% of initially generated unique cmds)

5-HT2A 5-HT2C 5-HT6 5-HT7 beta2AR catB D2 SRC

141 625 683 639 240 910 230 346 67 856 8569 194 422 27 642

84 295 (60%) 209 370 (31%) 119 995 (50%) 122 618 (53%) 8052 (12%) 468 (5%) 56 779 (29%) 3510 (13%)

105 938 (75%) 305 122 (45%) 98 786 (41%) 170 091 (74%) 11 912 (18%) 163 (2%) 69 616 (36%) 3713 (13%)

total 6868 68 380 55 430 35 420 16 644 1012 21 679 2366

particular target and all molecules generated via the protocol developed within the study, were translated into this bit string representation. The SVM-based predictive model for all known actives and inactives was then constructed, and the newly generated structures were evaluated in terms of their activity toward the considered target. The resulting molecules exhibited a high potency for biological activity, as for serotonin receptors, more than 50% of compounds were indicated as potentially active, for beta2AR and SRC −13%, 29% for dopamine receptor D2, and the lowest −5% for catB. In contrast, in the previously performed screening of commercial databases, only ∼1% of the compounds were assessed as such.100 Additional activity evaluation performed analogously but with the use of different compounds representation, the fingerprint that constituted the basis for compounds generation (KlekFP), resulted in the similar fraction of compounds indicated as active: from 74% and 75% for 5-HT7 and 5-HT2A, respectively, through 36−45% for D2, 5HT2C, and 5-HT6 to 2% for catB. Although the chemical composition of the generated compounds refers to the optimal

Table 9. Analysis of the Presence of Compounds in the ZINC Database and Results of Evaluation by the PAINS Filter

target

number of compounds left after duplicates filtering

number of compounds found in ZINC database (% of initially generated unique cmds)

number of compounds that successfully passed the PAINS filter

5-HT2A 5-HT2C 5-HT6 5-HT7 beta2AR catB D2 SRC

141 625 683 639 240 910 230 346 67 856 8569 194 422 27 642

3648 (3%) 10 638 (2%) 6430 (3%) 3064 (1%) 2508 (4%) 669 (8%) 7092 (4%) 1374 (5%)

139 901 (99%) 653 117 (96%) 234 818 (97%) 228 884 (99%) 66 996 (99%) 8467 (99%) 187 599 (96%) 27 639 (100%)

string of descriptors extracted according to the KlekFP representation, the fraction of molecules evaluated as active is

Figure 8. Examples of new compound generation for 5-HT6 receptor. 142

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling not close to 100%. The reason for that situation is that the newly formed compounds were not generated from the whole set of optimal substructures, but from relatively small subset of them, and therefore the similarity of the respective fingerprints to the optimal string is rather low (Supporting Information). The necessity of generating the subset of substructures for new compounds enumeration is a consequence of the relatively high number of nonzero positions in the optimal string. This is a limitation of the developed methodology and in further studies constraints preventing from such high number of substructures recommended to form the most active compounds should be applied. Such constraints can refer to the total number of nonzero positions allowed in the optimal fingerprint, or they should impose the limit on the sum of the molecular weights of the substructures referring to positions indicated as important for desired compounds activity. The analysis of the presence of the generated compounds in the ZINC database (Table 9) revealed that a relatively low fraction of compounds are offered for sale on the market (usually 3−4%). This finding indicates that the combinatorial libraries formed can constitute a promising starting point for virtual screening campaigns oriented at evaluation of molecules beyond the commercially available sets. Moreover, running the PAINS filter (evaluating compounds in terms of their probability of causing interference problems) led to very optimistic results, as almost all the compounds successfully passed this assessment (more than 95% for all the targets considered). The obtained sets of compounds were characterized in terms of their physicochemical properties, and an initial evaluation of their potential desirable biological activity was carried out as well. The molecular descriptors taken into account were as follows: molecular weight (MW), octanol−water partition coefficient (logP), number of hydrogen bond acceptors (HBA), number of hydrogen bond donors (HBD), and number of rotatable bonds (rotB). The distribution of the properties of the molecules from the library obtained for 5-HT2AR are presented in Figure 9 (characterization of all the remaining targets is available in the Supporting Information). The final evaluation was connected with the assessment of the synthetic accessibility of the generated compounds, indicating the possibility of obtaining the compounds physically, not only in silico. The above-mentioned analysis was performed with the use of the SYLVIA software101 with default settings. The results showed that for all of the targets considered, for the great majority of compounds, the medium difficulty in their synthesis was indicated (Figure 10a). The analogous assessment of synthetic accessibility was carried out for known ligands (Figure 10b); for comparison, similar tendencies were observed, and the medium difficulty in synthesis was also the most frequent outcome.



Figure 9. Histograms of the distribution of properties of the compound collection generated for 5-HT2AR.

CONCLUSIONS The developed methodology enabled the generation of targetfocused combinatorial libraries of new potentially active compounds. Unlike the available approaches used for novel compounds formation, the generation of structures is performed on a different level: it is fingerprint representation and ML that constitute the basis for this process. The analysis of various combinations of ML methods and compounds representations indicated that for the purpose of new libraries enumeration, KlekFP and SVM provided the most consistent results. Because fingerprint optimization is performed independently for each

target, the formed libraries are target-focused. Characterization of the obtained sets enables easy selection of drug-like representatives, the moderate difficulty in synthesis of the compounds, and their low risk of causing interferences problems (PAINS filter). The libraries are publically available at http:// skandal.if-pan.krakow.pl/comb_lib/ and can provide a starting point for virtual screening procedures. 143

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Journal of Chemical Information and Modeling



Article

AUTHOR INFORMATION

Corresponding Author

*Phone: +48 12 66 23 365. Fax: +48 12 637 45 00. E-mail: [email protected]. ORCID

Andrzej J. Bojarski: 0000-0003-1417-6333 Present Address §

W.M.C.: Google DeepMind, 5 New Street Square, London EC4A 3TW, U.K. Author Contributions

S.P. designed the protocol and experiments, prepared data sets, and analyzed data; W.C. performed ML-based fingerprints optimization; and R.K. automated the procedure of compounds libraries enumeration. The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Funding

The study was supported by the grant OPUS 2014/13/B/ST6/ 01792 and by the grant HARMONIA 2015/18/M/NZ7/00377 financed by the National Science Centre, Poland. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The study was supported by the grant OPUS 2014/13/B/ST6/ 01792 and by the grant HARMONIA 2015/18/M/NZ7/00377 financed by the National Science Centre, Poland. S.P. and A.J.B. participate in the European Cooperation in Science and Technology (COST) Action CM1207: GPCR-Ligand Interactions, Structures, and Transmembrane Signalling: An European Research Network (GLISTEN).

Figure 10. Results of the evaluation of the synthetic accessibility of (a) the generated compounds and (b) known ligands; the assessment was performed with the use of the SYLVIA software.101





ABBREVIATIONS beta2AR, beta2 adrenergic receptor; catB, cathepsin B; HBA, number of hydrogen bond acceptors; HBD, number of hydrogen bond donors; KlekFP, Klekota and Roth Fingerprint; logP, octanol−water partition coefficient; ML, machine learning; MW, molecular weight; RBF, radial basis function; rotB, number of rotatable bonds; SDG, stochastic gradient descent; SVM, support vector machine

EXPERIMENTAL SECTION The following parameters were optimized during ML methods training in the specified range of values: C ∈ {0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000}, γ ∈{10−10, 10−9, 10−8, 10−7, 10−6, 10−5, 10−4, 0.001, 0.01, 0.1, 1, 10, 100, 1000} for SVM and alpha ∈ {0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000} for logistic regression. The scikit-learn package102 was used as a source of ML tools; the values of the physicochemical properties for combinatorial libraries evaluation were calculated using the RDKit package.103





REFERENCES

(1) Hughes, J. P.; Rees, S. S.; Kalindjian, S. B.; Philpott, K. L. Principles of Early Drug Discovery. Br. J. Pharmacol. 2011, 162, 1239−1249. (2) Meanwell, N. A. Synopsis of Some Recent Tactical Application of Bioisosteres in Drug Design. J. Med. Chem. 2011, 54, 2529−2591. (3) Böhm, H. J.; Flohr, A.; Stahl, M. Scaffold Hopping. Drug Discovery Today: Technol. 2004, 1, 217−224. (4) Pierce, A. C.; Rao, G.; Bemis, G. W. BREED: Generating Novel Inhibitors through Hybridization of Known Ligands. Application to CDK2, P38, and HIV Protease. J. Med. Chem. 2004, 47, 2768−2775. (5) Chevillard, F.; Kolb, P. SCUBIDOO: A Large yet Screenable and Easily Searchable Database of Computationally Created Chemical Compounds Optimized toward High Likelihood of Synthetic Tractability. J. Chem. Inf. Model. 2015, 55, 1824−1835. (6) Jamois, E. A.; Hassan, M.; Waldman, M. Evaluation of ReagentBased and Product-Based Strategies in the Design of Combinatorial Library Subsets. J. Chem. Inf. Comput. Sci. 2000, 40, 63−70. (7) Zheng, W.; Hung, S. T.; Saunders, J. T.; Seibel, G. L. PICCOLO: A Tool for Combinatorial Library Design via Multicriterion Optimization. Pacific Symp. Biocomput. 2000, 5, 588−599.

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00426. Comparison of the ML experiments, the results of the privileged fragments theory, histograms of the distribution of properties of the obtained sets of compounds, and comparison of the Tanimoto values between the optimal string indicated by ML methods and newly formed compounds (PDF) List of keys used in the fingerprinting procedure, list of optimal parameters obtained in ML experiments, and list of optimal strings obtained in the ML-based optimization procedure (XLSX) 144

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling (8) Gillet, V. J. Reactant- and Product-Based Approaches to the Design of Combinatorial Libraries. J. Comput.-Aided Mol. Des. 2002, 16, 371− 380. (9) Orry, A. J. W.; Abagyan, R. A.; Cavasotto, C. N. Structure-Based Development of Target-Specific Compound Libraries. Drug Discovery Today 2006, 11, 261−266. (10) Harris, C. J.; Hill, R. D.; Sheppard, D. W.; Slater, M. J.; Stouten, P. F. W. The Design and Application of Target-Focused Compound Libraries. Comb. Chem. High Throughput Screening 2011, 14, 521−531. (11) Deng, Z.; Chuaqui, C.; Singh, J. Knowledge-Based Design of Target-Focused Libraries Using Protein-Ligand Interaction Constraints. J. Med. Chem. 2006, 49, 490−500. (12) Fischer, J. R.; Lessel, U.; Rarey, M. LoFT: Similarity-Driven Multiobjective Focused Library Design. J. Chem. Inf. Model. 2010, 50, 1− 21. (13) Langer, T.; Wolber, G. Pharmacophore Definition and 3D Searches. Drug Discovery Today: Technol. 2004, 1, 203−207. (14) Reddy, A. S.; Pati, S. P.; Kumar, P. P.; Pradeep, H. N.; Sastry, G. N. Virtual Screening in Drug Discovery − a Computational Perspective. Curr. Protein Pept. Sci. 2007, 8, 329−351. (15) Stumpfe, D.; Bajorath, J. Similarity Searching. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2011, 1, 260−282. (16) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical Similarity Searching. J. Chem. Inf. Comput. Sci. 1998, 38, 983−996. (17) Bender, A.; Jenkins, J. L.; Scheiber, J.; Sukuru, S. C. K.; Glick, M.; Davies, J. W. How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space. J. Chem. Inf. Model. 2009, 49, 108−119. (18) Sousa, F.; Fernandes, P. A.; Ramos, M. J. Protein − Ligand Docking: Current Status and Future. Proteins: Struct., Funct., Genet. 2006, 65, 15−26. (19) Bissantz, C.; Folkers, G.; Rognan, D. Protein-Based Virtual Screening of Chemical Databases. 1. Evaluation of Different Docking/ Scoring Combinations. J. Med. Chem. 2000, 43, 4759−4767. (20) Krovat, E. M.; Steindl, T.; Langer, T. Recent Advances in Docking and Scoring. Curr. Comput.-Aided Drug Des. 2005, 1, 93−102. (21) Halperin, I.; Ma, B.; Wolfson, H.; Nussinov, R. Principles of Docking: An Overview of Search Algorithms and a Guide to Scoring Functions. Proteins: Struct., Funct., Genet. 2002, 47, 409−443. (22) Dias, R.; de Azevedo, W. F., Jr. Molecular Docking Algorithms. Curr. Drug Targets 2008, 9, 1040−1047. (23) Mitchell, J. B. O. Machine Learning Methods in Chemoinformatics. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2014, 4, 468−481. (24) Melville, J. L.; Burke, E. K.; Hirst, J. D. Machine Learning in Virtual Screening. Comb. Chem. High Throughput Screening 2009, 12, 332−343. (25) Ma, X. H.; Jia, J.; Zhu, F.; Xue, Y.; Li, Z. R.; Chen, Y. Z. Comparative Analysis of Machine Learning Methods in Ligand-Based Virtual Screening of Large Compound Libraries. Comb. Chem. High Throughput Screening 2009, 12, 344−357. (26) Han, L. Y.; Ma, X. H.; Lin, H. H.; Jia, J.; Zhu, F.; Xue, Y.; Li, Z. R.; Cao, Z. W.; Ji, Z. L.; Chen, Y. Z. A Support Vector Machines Approach for Virtual Screening of Active Compounds of Single and Multiple Mechanisms from Large Libraries at an Improved Hit-Rate and Enrichment Factor. J. Mol. Graphics Modell. 2008, 26, 1276−1286. (27) Lin, H. H.; Han, L. Y.; Yap, C. W.; Xue, Y.; Liu, X. H.; Zhu, F.; Chen, Y. Z. Prediction of Factor Xa Inhibitors by Machine Learning Methods. J. Mol. Graphics Modell. 2007, 26, 505−518. (28) Lee, J. H.; Lee, S.; Choi, S. In Silico Classification of Adenosine Receptor Antagonists Using Laplacian-Modified Naiv̈ e Bayesian, Support Vector Machine, and Recursive Partitioning. J. Mol. Graphics Modell. 2010, 28, 883−890. (29) Wang, M.; Yang, X.-G.; Xue, Y. Identifying hERG Potassium Channel Inhibitors by Machine Learning Methods. QSAR Comb. Sci. 2008, 27, 1028−1035. (30) Hammann, F.; Gutmann, H.; Baumann, U.; Helma, C.; Drewe, J. Articles Classification of Cytochrome P 450 Activities Using Machine Learning Methods. Mol. Pharmaceutics 2009, 6, 1920−1926.

(31) Lin, H. H.; Han, L. Y.; Yap, C. W.; Xue, Y.; Liu, X. H.; Zhu, F.; Chen, Y. Z. Prediction of Factor Xa Inhibitors by Machine Learning Methods. J. Mol. Graphics Modell. 2007, 26, 505−518. (32) Cong, Y.; Yang, X.-G.; Lv, W.; Xue, Y. Prediction of Novel and Selective TNF-Alpha Converting Enzyme (TACE) Inhibitors and Characterization of Correlative Molecular Descriptors by Machine Learning Approaches. J. Mol. Graphics Modell. 2009, 28, 236−244. (33) Liu, X. H.; Song, H. Y.; Zhang, J. X.; Han, B. C.; Wei, X. N.; Ma, X. H.; Cui, W. K.; Chen, Y. Z. Identifying Novel Type ZBGs and Nonhydroxamate HDAC Inhibitors Through a SVM Based Virtual Screening Approach. Mol. Inf. 2010, 29, 407−420. (34) Tao, L.; Zhang, P.; Qin, C.; Chen, S. Y.; Zhang, C.; Chen, Z.; Zhu, F.; Yang, S. Y.; Wei, Y. Q.; Chen, Y. Z. Recent Progresses in the Exploration of Machine Learning Methods as in-Silico ADME Prediction Tools. Adv. Drug Delivery Rev. 2015, 86, 83−100. (35) Wang, J.; Hou, T. Chapter 5: Recent Advances on in Silico ADME Modeling. Annu. Rep. Comput. Chem. 2009, 5, 101−127. (36) Van de Waterbeemd, H.; Gifford, E. ADMET in Silico Modelling: Towards Prediction Paradise? Nat. Rev. Drug Discovery 2003, 2, 192− 204. (37) Raunio, H. In Silico Toxicology Non-Testing Methods. Front. Pharmacol. 2011, DOI: 10.3389/fphar.2011.00033. (38) Li, A. P.; Segall, M. Early ADME/Tox Studies and in Silico Screening. Drug Discovery Today 2002, 7, 25−27. (39) Hartung, T.; Hoffmann, S. Food for Thought ... On in Silico Methods in Toxicology. ALTEX 2009, 26, 155−166. (40) Castillo-Garit, J. A.; Marrero-Ponce, Y.; Torrens, F. Classification Models to Predict Caco-2 Cell Using Atom-Based Stochastic and NonStochastic Linear Indices. J. Pharm. Sci. 2014, 97, 129−138. (41) Geppert, H.; Vogt, M.; Bajorath, J. Current Trends in LigandBased Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation. J. Chem. Inf. Model. 2010, 50, 205−216. (42) Yap, C. W. E. I. Software News and Update PaDEL-Descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466−1474. (43) Duan, J.; Dixon, S.; Lowrie, J.; Sherman, W. Analysis and Comparison of 2D Fingerprints: Insights into Database Screening Performance Using Eight Fingerprint Methods. J. Mol. Graphics Modell. 2010, 29, 157−170. (44) Nisius, B.; Bajorath, J. Molecular Fingerprint Recombination: Generating Hybrid Fingerprints for Similarity Searching from Different Fingerprint Types. ChemMedChem 2009, 4, 1859−1863. (45) Clark, R. D.; Patterson, D. E.; Soltanshahi, F.; Blake, J. F.; Matthew, J. B. Visualizing Substructural Fingerprints. J. Mol. Graphics Modell. 2000, 18, 527−532. (46) Ewing, T.; Baber, J. C.; Feher, M. Novel 2D Fingerprints for Ligand-Based Virtual Screening. J. Chem. Inf. Model. 2006, 46, 2423− 2431. (47) Chen, B.; Harrison, R. F.; Papadatos, G.; Willett, P.; Wood, D. J.; Lewell, X. Q.; Greenidge, P.; Stiefl, N. Evaluation of Machine-Learning Methods for Ligand-Based Virtual Screening. J. Comput.-Aided Mol. Des. 2007, 21, 53−62. (48) Yang, S. Y. Pharmacophore Modeling and Applications in Drug Discovery: Challenges and Recent Advances. Drug Discovery Today 2010, 15, 444−450. (49) Sliwoski, G.; Kothiwale, S.; Meiler, J.; Lowe, E. W. Computational Methods in Drug Discovery. Pharmacol. Rev. 2014, 66, 334−395. (50) Keseru, G. M.; Erlanson, D. A.; Ferenczy, G. G.; Hann, M. M.; Murray, C. W.; Pickett, S. D. Design Principles for Fragment Libraries − Maximizing the Value of Learnings from Pharma Fragment Based Drug Discovery (FBDD) Programs for Use in Academia. J. Med. Chem. 2016, 59, 8189−8206. (51) Feyfant, E.; Cross, J. B.; Paris, K.; Tsao, D. H. H. Fragment-Based Drug Design. Methods Mol. Biol. 2011, 685, 241−252. (52) Salum, L. B.; Andricopulo, A. D. Fragment-Based QSAR: Perspectives in Drug Design. Mol. Diversity 2009, 13, 277−285. (53) Erlanson, D. A.; McDowell, R. S.; O’Brien, T. Fragment-Based Drug Discovery. J. Med. Chem. 2004, 47, 3463−3482. 145

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling

(78) Mitchell, E. S.; Neumaier, J. F. 5-HT6 Receptors: A Novel Target for Cognitive Enhancement. Pharmacol. Ther. 2005, 108, 320−333. (79) Wesołowska, A. Potential Role of the 5-HT6 Receptor in Depression and Anxiety: An Overview of Preclinical Data. Pharmacol. Rep. 2010, 62, 564−577. (80) Glennon, R. A. Higher-End Serotonin Receptors: 5-HT5, 5-HT6, and 5-HT7. J. Med. Chem. 2003, 46, 2795−2812. (81) Gellynck, E.; Heyninck, K.; Andressen, K. W.; Haegeman, G.; Levy, F. O.; Vanhoenacker, P.; Van Craenenbroeck, K. The Serotonin 5HT7 Receptors: Two Decades of Research. Exp. Brain Res. 2013, 230, 555−568. (82) Ciranna, L.; Catania, M. V. 5-HT7 Receptors as Modulators of Neuronal Excitability, Synaptic Transmission and Plasticity: Physiological Role and Possible Implications in Autism Spectrum Disorders. Front. Cell. Neurosci. 2014, 8, 250. (83) Rosenbaum, D. M.; Cherezov, V.; Hanson, M. A.; Rasmussen, S. G. F.; Thian, F. S.; Kobilka, T. S.; Choi, H.-J.; Yao, X.-J.; Weis, W. I.; Stevens, R. C.; et al. GPCR Engineering Yields High-Resolution Structural Insights into beta2-Adrenergic Receptor Function. Science 2007, 318, 1266−1273. (84) McGraw, D. W.; Liggett, S. B. Molecular Mechanisms of beta2Adrenergic Receptor Function and Regulation. Proc. Am. Thorac. Soc. 2005, 2, 292−296. (85) Gondi, C. S.; Rao, J. S. Cathepsin B as a Cancer Target. Expert Opin. Ther. Targets 2013, 17, 281−291. (86) Mort, J. S.; Buttle, D. J. Cathepsin B. Int. J. Biochem. Cell Biol. 1997, 29, 715−720. (87) Missale, C.; Nash, S. R.; Robinson, S. W.; Jaber, M.; Caron, M. G. Dopamine Receptors: From Structure to Function. Physiol. Rev. 1998, 78, 189−225. (88) Ma, Y. C.; Huang, J.; Ali, S.; Lowry, W.; Huang, X. Y. Src Tyrosine Kinase Is a Novel Direct Effector of G Proteins. Cell 2000, 102, 635− 646. (89) Roskoski, R. Src Protein-Tyrosine Kinase Structure and Regulation. Biochem. Biophys. Res. Commun. 2004, 324, 1155−1164. (90) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100−D1107. (91) Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization of MDL Keys for Use in Drug Discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273−1280. (92) Klekota, J.; Roth, F. P. Chemical Substructures That Enrich for Biological Activity. Bioinformatics 2008, 24, 2518−2525. (93) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: An open chemical toolbox. J. Cheminf. 2011, 3, 33. (94) Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273−297. (95) Zhang, T. Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms. Proc. 21st Int. Conf. Mach. Learn. 2004, 6, 116. (96) Gower, J. C. Properties of Euclidean and Non-Euclidean Distance Matrices. Linear Algebra Appl. 1985, 67, 81−97. (97) Gower, J. C. Euclidean Distance Geometry. Math. Sci. 1982, 7, 1− 14. (98) Joarder, A. H.; Latif, R. M. Standard Deviation for Small Samples. Teach. Stat. 2006, 28, 40−43. (99) Evans, B.; Rittle, K.; Bock, M.; DiPardo, R.; Freidinger, R.; Whitter, W.; Lundell, G.; Veber, D.; Anderson, P. Methods for Drug Discovery: Development of Potent, Selective, Orally Effective Cholecystokinin Antagonists. J. Med. Chem. 1988, 31, 2235−2246. (100) Smusz, S.; Kurczab, R.; Satała, G.; Bojarski, A. J. FingerprintBased Consensus Virtual Screening towards Structurally New 5-HT6R Ligands. Bioorg. Med. Chem. Lett. 2015, 25, 1827−1830. (101) Boda, K.; Seidel, T.; Gasteiger, J. Structure and reaction based evaluation of synthetic accessibility. J. Comput.-Aided Mol. Des. 2007, 21, 311−325.

(54) Hajduk, P. J. Fragment-Based Drug Design: How Big Is Too Big? J. Med. Chem. 2006, 49, 6972−6976. (55) Hajduk, P. J.; Greer, J. A Decade of Fragment-Based Drug Design: Strategic Advances and Lessons Learned. Nat. Rev. Drug Discovery 2007, 6, 211−219. (56) Murray, C. W.; Blundell, T. L. Structural Biology in FragmentBased Drug Design. Curr. Opin. Struct. Biol. 2010, 20, 497−507. (57) De Graaf, C.; Vischer, H. F.; De Kloe, G. E.; Kooistra, A. J.; Nijmeijer, S.; Kuijer, M.; Verheij, M. H. P.; England, P. J.; Van MuijlwijkKoezen, J. E.; Leurs, R.; et al. Small and Colorful Stones Make Beautiful Mosaics: Fragment-Based Chemogenomics. Drug Discovery Today 2013, 18, 323−330. (58) Irwin, J. J.; Shoichet, B. K. ZINC - A Free Database of Commercially Available Compounds for Virtual Screening. J. Chem. Inf. Model. 2005, 45, 177−182. (59) Baell, J. B.; Holloway, G. A. New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for their Exclusion in Bioassays. J. Med. Chem. 2010, 53, 2719−2740. (60) Pearson, K. On Lines and Planes of Closest Fit to Systems of Points in Space. Philos. Mag. 1901, 2, 559−572. (61) Van der Maaten, L. J. P.; Hinton, G. E. Visualizing HighDimensional Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579− 2605. (62) Jordan, A. On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes. Adv. Neural Inf. Process. Syst. 2002, 14, 841. (63) Lewis, David D. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In Machine learning: ECML-98; Nédellec, C., Rouveirol, C., Eds.; Springer: Berlin, 1998; pp 4−15. ̈ Bayes. Proc. 17th International (64) Zhang, H. The Optimality of Naive FLAIRS Conf. 2004, 3. (65) Hogg, R.; McKean, J.; Craig, A. Introduction to Mathematical Statistics; Pearson Education, Inc.: NJ, 2014. (66) Smolensky, P. Information Processing in Dynamical Systems: Foundations of Harmony Theory. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations; Rumelhart, D. E., McClelland, J. L., Corporate PDP Research Group, Eds.; MIT Press, Cambridge, MA, 1986; pp 194−281. (67) Lang, K.; Baum, E. Query Learning Can Work Poorly When a Human Oracle Is Used. In IEEE International Joint Conference on Neural Networks; IEEE Press, 1992; pp 335−340. (68) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5−32. (69) Č erný, V. Thermodynamical Approach to the Traveling Salesman Problem: An Efficient Simulation Algorithm. J. Optim. Theory Appl. 1985, 45, 41−51. (70) Hastings, W. K. Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 1970, 57, 97−109. (71) Altman, N. S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 1992, 46, 175−185. (72) Harvey, J. A. Role of the Serotonin 5-HT2A Receptor in Learning. Learn. Mem. 2003, 10, 355−362. (73) Celada, P.; Puig, M. V.; Amargós-Bosch, M.; Adell, A.; Artigas, F. The Therapeutic Role of 5-HT1A and 5-HT2A Receptors in Depression. J. Psychiatry Neurosci. 2004, 29, 252−265. (74) Raote, I.; Bhattacharya, A.; Panicker, M. M. Serotonin 2A (5HT2A) Receptor Function: Ligand-Dependent Mechanisms and Pathways. In Serotonin Receptors in Neurobiology; Chattopadhyay, A, Ed.; CRC Press: Boca Raton, FL, 2007; pp 1−17. (75) Giorgetti, M.; Tecott, L. H. Contributions of 5-HT2C Receptors to Multiple Actions of Central Serotonin Systems. Eur. J. Pharmacol. 2004, 488, 1−9. (76) Millan, M. J. Serotonin 5-HT2C Receptors as a Target for the Treatment of Depressive and Anxious States: Focus on Novel Therapeutic Strategies. Therapie 2005, 60, 441−460. (77) Wood, M. Role of the 5-HT2C Receptor in Atypical Antipsychotics: Hero or Villain? Curr. Med. Chem.: Cent. Nerv. Syst. Agents 2005, 5, 63−66. 146

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147

Article

Journal of Chemical Information and Modeling (102) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in {P}ython. J. Mach. Learn. Res. 2011, 12, 2825−2830. (103) RDKit: Cheminformatics and Machine Learning Software, 2013; http://www.rdkit.org.

147

DOI: 10.1021/acs.jcim.6b00426 J. Chem. Inf. Model. 2017, 57, 133−147