FSees: Customized Enumeration of Chemical ... - ACS Publications

Sep 12, 2016 - very large number of molecules, our algorithm uses file-based data structures instead of memory-based ones, thus overcoming the limitat...
1 downloads 10 Views 2MB Size
Article pubs.acs.org/jcim

FSees: Customized Enumeration of Chemical Subspaces with Limited Main Memory Consumption Florian Lauck and Matthias Rarey* ZBH - Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany S Supporting Information *

ABSTRACT: In the search for new marketable drugs, new ideas are required constantly. Particularly with regard to challenging targets and previously patented chemical space, designing novel molecules is crucial. This demands efficient and innovative computational tools to generate libraries of promising molecules. Here we present an efficient method to generate such libraries by systematically enumerating all molecules in a specific chemical space. This space is defined by a fragment space and a set of user-defined physicochemical properties (e.g., molecular weight, tPSA, number of H-bond donors and acceptors, or predicted logP). In order to enumerate a very large number of molecules, our algorithm uses file-based data structures instead of memory-based ones, thus overcoming the limitations of computer main memory. The resulting chemical library can be used as a starting point for computational leadfinding technologies, like similarity searching, pharmacophore mapping, docking, or virtual screening. We applied the algorithm in different scenarios, thus creating numerous target-specific libraries. Furthermore, we generated a fragment space from all approved drugs in DrugBank and enumerated it with lead-like constraints, thus generating 0.5 billion molecules in the molecular weight range 250−350.



than a set of similar molecules.6 These evolutionary methods carry out modification on either the atomic level7−9 or the fragment level.10−15 Because of their stochastic nature, these algorithms cannot exhaustively search chemical space. Another class of algorithms systematically and exhaustively builds molecules from different input data. One uses known inhibitors in the same binding pocket to determine overlapping bonds, which are then used for recombination.16 Others use only the elemental composition of a molecule as a constraint.17,18 An ambitious enterprise related to this category is the Chemical Space Project,19 since this method starts by enumerating graphs with molecular topologies. The primary constraint for this enumeration protocol is an upper bound for the number of atoms per molecule, i.e., 11, 13, or 17. These sets were published as GDB-11,20 GDB-13,21 and GDB-17,22 respectively. The result is an impressive set of 1.6 × 1011 molecules that contains mostly new structures as well as new scaffolds. However, using this set as is for virtual screening purposes is not feasible. Although the GDB may be exhaustive for small molecules, most molecules of interest contain more than 17 atoms. Of the 1713 unique molecules in the “Approved Drugs” set in DrugBank23 (version 4.3, filtered with MONA24,25), 75% (1286) have more than 17 heavy atoms. A general problem with some of the methods mentioned above, especially the ones with atom-level modifications, is that not every computationally generated combination of atoms

INTRODUCTION Because of the enormous extent of the chemical universe, mining for novel molecules with desirable properties proves difficult. An often-quoted number for the size of the interesting chemical space (i.e., up to 30 atoms) is 1060.1 Walters et al.2 conclude that the “virtual chemistry space” may even contain 10100 molecules. The number of molecules in the “known drug space”, i.e., those molecules that may actually be an active substance with a desirable effect, is estimated as only 1.1−2.0 × 106.3 Irrespective of the correctness of these estimates, it is obvious that trying to synthesize all possible molecules in the laboratory to test them and find the desired ones is not feasible. As a result of combinatorial explosion, even fast brute-force computation quickly leads to very high computing demands. To avoid this, the common strategy is usually to limit the search space by applying physicochemical constraints. A number of methods for generating libraries of molecules, thus enumerating chemical subspaces of interest, have been published in the past.4,5 They utilize a range of different strategies concerning the underlying algorithm, the building blocks, and synthetic feasibility. The methods can be roughly categorized into evolutionary algorithms that apply modifications to input molecules, deterministic algorithms that work on fragments or graphs, and approaches that utilize chemical reactions to derive molecular modifications. Most evolutionary algorithms use a structurally motivated fitness function to produce molecules closely related to an input query, although there is one method that uses a diversity function to produce a “representative universal library” rather © XXXX American Chemical Society

Received: March 2, 2016

A

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

artificial atoms with a defined type, called link atoms or linkers. The connection rules determine the compatibility of such atoms. When two fragments are connected, the link atoms are removed and a bond in accordance with the connection rule is introduced. Because of its combinatorial nature, a fragment space can represent a virtually infinite number of molecules constituted by its fragments. Methods utilizing fragment spaces include similarity searching,37 substructure searching,40,41 structure-based molecular design,42 and scaffold replacement.43,44 Fragment spaces are capable of modeling both reaction-based strategies described earlier. The fragmentation approach is straightforward since the RECAP29 and BRICS30 rules can be applied directly to a set of molecules, thus yielding a set of fragments. The synthetic reaction approach requires more sophisticated preprocessing in order to translate a compound library into a fragment space.44,45 Since our method works on the fragment-space data structure, it is impartial to the way the fragment space was created. In a previous publication by our group, an initial strategy for enumerating fragment spaces was introduced.31 However, the number of enumerated molecules was limited by the size of the main memory of a computer. As a result, very stringent constraints were used during enumeration runs. On the basis of the NAOMI library,46 we designed a novel algorithm operating with constant main memory requirements in order to enumerate very large fragment spaces. This is done by introducing file-based data structures. Since access to files is much slower than access to memory, additional strategies to reduce the number of queries were implemented. One strategy is to avoid redundant molecules, which leads to reductions in runtime and overall resource requirements. As described later, redundancy cannot be completely avoided, but it can be detected. In the following, we describe the overall enumeration strategy, how redundant molecules are avoided, and how filebased data structures are implemented and utilized. Then we apply our method to different scenarios.

describes a synthesizable molecule. An increased chance of synthetic tractability is achieved by applying modifications based on chemical reactions. Furthermore, this approach has the advantage that the molecule created is accompanied by a proposal for a synthetic route. One strategy is to apply synthetic reactions directly to a molecule in order to modify it. Both stochastic14 and deterministic26−28 methods exist. Another strategy is to use reactions indirectly, i.e., by retrosynthetic fragmentation of known molecules.29,30 The resulting fragments are then recombined and assembled into new molecules.10,31 In this context, when fragments have previously been exhibited as part of a larger molecule, they represent reliable building blocks and valid substructures. In recent years, fragment-based methods have gained popularity in drug discovery and have been successfully applied.32−34 Once initial lead fragments have been identified, the question arises how fragments can be extended to highaffinity binders. Usually, certain physicochemical constraints such as molecular weight, number of hydrogen-bonding groups, or calculated logP should be obeyed to ensure druglikeness.35,36 Furthermore, certain synthetic routes, i.e., reaction schemes and reactant lists, are usually envisioned. Here we present a novel method for constraint-based enumeration of the chemical (sub)space described by a fragment space.37 The algorithm exhaustively and deterministically generates all of the molecules exhibiting user-defined physicochemical properties, as defined in lead-likeness38,39 or drug-likeness35 filters. Once enumerated, every computational lead-finding technology, such as similarity searching, pharmacophore mapping, docking, or virtual screening, can be applied to further reduce the set to interesting candidates. A fragment space37 is a combinatorial chemical space consisting of molecular fragments and connection rules (see Figure 1). Each fragment has at least one reactive site, which corresponds to an open valence. These sites are modeled as



METHODS Fragment Spaces. The fragment spaces used in our experiments were generated via retrosynthetic fragmentation according to the BRICS fragmentation protocol.30 A set of molecules is cut into fragments while linkers are annotated at the cut bond. Since linker types are compatible with several other types, a fragment can be recombined with many other fragments. A subsequent postprocessing step removes fragments containing more than 16 heavy atoms, containing elementary rings with more than eight heavy atoms, or matching a reactive or toxic group.47 The ease of use of this approach allows us to quickly create many fragment spaces from varying molecule sets specific to the target class examined. Enumeration Algorithm. The enumeration procedure, called FSees (fragment space exhaustive enumeration system), systematically combines all compatible fragments up to a userdefined maximal number. In contrast to combinatorial libraries, implementing the naive procedure for fragment spaces would lead to a lot of redundant molecules, which in turn would have to be filtered out after the enumeration. Therefore, redundancy is avoided during the computation wherever possible, and expensive calculations are carried out only after all of the other filters have been passed. In the following we will first discuss how properties are efficiently computed, including the underlying data structure.

Figure 1. Example of how omeprazol may have been assembled during enumeration. (a, b) The components of a fragment space: (a) fragments with linkers shaded in green; (b) connection rules as a compatibility matrix of linker types. The numbers in the matrix denote the bond types (single, double, or triple), and empty entries indicate that these link types are incompatible. (c) Fragment tree defining the topology of the represented molecule. (d) Completely assembled omeprazol with newly created bonds in green. The blue arrows show how the compatibility matrix is involved when two fragments are connected. B

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

redundancy checking. A single fragment tree represents a distinct molecule. If two trees are identical, they represent the same molecule and thus lead to the same molecules upon extension. This property is used to avoid unnecessary computation, as described below in Detecting Redundancy. Nonmonotonously additive properties are computed on the basis of the NAOMI molecule data structure. The available properties are clogP (calculated according to Wildman and Crippen48), number of rotatable bonds, number of stereocenters, and topological polar surface area (tPSA). They have in common that they take neighboring atoms with varying depth into account, making the contribution of a fragment dependent on neighboring fragments. Furthermore, these properties cannot be computed easily on a fragment tree since fragments are not explicitly connected and linkers might occur in direct proximity. Table 1 lists the currently implemented properties. In principle, any molecular property may be added provided that it is implemented in NAOMI.46 Enumerating Fragment Trees. The enumeration algorithm uses every fragment in the fragment space as a starting point. For each fragment, a fragment tree is constructed with the current fragment as the root. Each initial tree is added to the tree queue individually and only after the previous starting fragment has been completely explored. This queue stores all of the fragment trees that are ready for further extension (see the black arrow in Figure 2). At first it contains only a single tree.

Then we will give a detailed description of the different phases of the algorithm, i.e., how fragments are systematically extended. In this context, several redundancy prevention techniques are applied. Finally, we will discuss the property and redundancy tests in detail. Property Calculation. In the context of fragment-based enumeration, numerical properties can be divided into two categories, i.e., whether they are monotonously additive or not. A property X is monotonously additive if X can be calculated for a molecule by adding up the individual values of X for each fragment of which the molecule consists and the sign of X is constantly either nonpositive or non-negative. Since many of the physicochemical properties are monotonously additive (see Table 1), they can be calculated for each fragment individually Table 1. List of Properties That Can Be Used for Enumeration Non-physicochemical Properties fragments (size constraint) Monotonously Additive Properties molecular weight atoms non-hydrogen atoms rings non-hydrogen bonds H-bond donors H-bond acceptors Lipinski-style H-bond donors Lipinski-style H-bond acceptors tVolume49 Nonmonotonously Additive Properties clogP tPSA rotatable bonds stereocenters

rather than on the whole molecule. Subsequently, it suffices to iterate over the fragments of which a molecule consists to calculate them. Monotonicity enables the enumeration to be aborted early as soon as the property reaches a predefined limit. To compute nonmonotonously additive properties, an instance of the molecule is built by copying and combining fragments. Since molecule construction is a relatively expensive computational procedure, this must be avoided for as long as possible. Thus, we use a tree data structure called a f ragment tree to represent a molecule implicitly: a node of the tree represents a specific fragment, and an edge represents a bond between two fragments. A node has as many edges as linkers on the corresponding fragment (see Figure 1a). The linkers on the fragment as well as the edges on the node are numbered and sorted, so that an unambiguous assignment is possible. When two fragments are connected, the parent node stores a pointer to the child node and vice versa. In this way, a topological representation of the molecule is created. For each open linker, a terminal f ragment is stored in the tree. This ensures that the resulting molecule does not have open linkers, especially with respect to the property computation. Terminal fragments are defined by the fragment space for each link type. The fragment tree is key to the enumeration algorithm. Is it used not only for efficient computation of monotonously additive molecular properties and as a template for the construction of the corresponding molecule but also for

Figure 2. Overview of the enumeration algorithm. Solid boxes show the three nested loops. A circle with a number represents a fragment, while a circle with a letter represents a fragment tree from a previous iteration. Green lines depict open and compatible linkers. Initial fragment trees are added to the tree queue via the black arrow and new extended fragment trees via the white arrow.

Then the algorithm exhaustively explores all possibilities for extension within the given physicochemical constraints. The extension protocol then generates more trees, which are added to the tree queue, and finishes when no trees are left in the queue. Only then is the tree associated with the next starting fragment added and the extension protocol carried out again. The algorithm terminates when every starting fragment has been processed by the extension protocol. During the enumeration, a very large number of fragment trees is encountered. Holding them all in memory is not feasible. Therefore, the tree queue is implemented as a filebased data structure employing an SQLite database. It contains a single table with a single column storing a serialized version of a fragment tree. For efficiency, the database uses a buffer stored C

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Although these rules prevent identical trees from occurring to a large degree, their generation is still possible in cases where symmetric fragments are involved. If a tree represents a fragment with symmetrically arranged linkers with respect to its topology, the same molecule is created when a certain fragment is attached to either linker. This results in the generation of many redundant molecules each time such a symmetric fragment is encountered. The FSees algorithm considers redundancy caused by symmetric fragments for the first time (see Figure 3, III). For symmetric fragments with up to three linkers, the solution is the following:

in main memory. The tree queue is mainly used by the extension protocol. The extension protocol consists of three loops (see Figure 2). (1) The first loop iterates through all trees from the tree queue. For each tree, the algorithm then determines all open linkers. (2) The second loop iterates the open linkers and determines all compatible fragments for a given linker. (3) The third loop iterates these compatible fragments and attaches each of them as a new tree node to a copy of the current fragment tree. At this point, property filters and redundancy tests are carried out, which are described in the next section. If a fragment tree passes all of the tests, it is added to the tree queue, and the corresponding molecule is written to the output file. Subsequently, the tree is subject to further rounds of attachment and property checks until no further extension is possible. In this manner, all possible compositions of fragments are generated. In the following sections we will discuss in greater detail those parts of the algorithm where strategies for avoidance of redundancy and exhaustive use of main memory are implemented. This is the computation of compatible fragments in loop (2) and the property and duplicate tests in loop (3). A first simple strategy to avoid unnecessary computation is implemented in loop (3). Here the upper bound for molecular mass is used. Since fragments are sorted in ascending order with respect to their mass, this loop is aborted as soon as the first fragment that violates a potential upper bound criterion is attached. Avoiding Redundancy. There are three types of redundancy that can be avoided during the construction of a fragment tree (see Figure 3, I−III) by applying three simple rules listed in the

3. the additional fragment is attached only to the f irst open linker per iteration. By exclusion of the other linker positions, a certain molecule is created only once. This is done in each iteration independently, and thus, only the first of the remaining open linkers is considered during subsequent iterations. There are three cases: (1) the symmetric fragment is at the root of the tree, and no fragments have been attached yet; (2) the symmetric fragment is at the root of the tree, and one fragment has been attached; (3) the symmetric fragment is not at the root of the tree. Cases 2 and 3 are essentially the same. For example, consider a symmetrical fragment A at the root of the tree with three linkers. Attaching a fragment B to any of the linkers results in the same molecule since A is symmetrical (case 1). Therefore, only the first linker is considered in the first iteration, thus creating the following tree A(B, −, −). The children of a node are listed in brackets, and “−” depicts an unused linker. In the second iteration (case 2), another instance of fragment B can be attached to the first open linker thus yielding the following tree: A(B, B, −). The third position is blocked. In the next iteration, fragments can be attached only to the third position, yielding A(B, B, B). In this way, all of the linker positions are considered for extension without generating redundant molecules such as A(−, B, −) or A(B, −, B). Symmetric fragments with more than three symmetric linkers are not considered since these fragments are very rare and the resolution of these cases is cumbersome for combinatorial reasons. This is also true for cases of fragments that are symmetric but do not have the same link type (see Figure 4 a). If all link types on a symmetric fragment are compatible with a specific other link type, identical fragments will be attached. The remaining cases of redundancy cannot be resolved by simple tree-topological considerations (see Figure 4a−d). These occur when (a) symmetric fragments with more than three linkers are involved; (b) compatibility rules allow for a fragment to be attached at symmetrically arranged link positions of different types; (c) the connection of fragments results in a symmetric substructure; and (d) the connection of two or more fragments leads to the same substructure as one or several connected fragment(s). These cases can be detected by comparing either the fragments trees (a) or molecules (b−d). A straightforward approach to handle these cases would be a filtering step applied after the enumeration has ended. However, it is better to detect redundancy as early as possible to avoid unnecessary computation by eliminating structures that have occurred before. Detecting Redundancy. If it is not possible to avoid redundancy during construction of fragment trees and molecules, it is best to detect duplicate entities to prevent the repeated exploration of previously explored search space. The

Figure 3. Overview of the resolved types of redundancy I−III. On the left, fragment trees are shown. The green background depicts the fragment tree that occurs first and red the identical tree that occurs later during enumeration. The latter is prevented by the rule on the right. * marks symmetric fragments.

following. Redundancy in cases I and II results from the same connection of fragments but in different order. Simple order rules described earlier are sufficient to avoid these types of redundancy:31 A fragment can be attached only if 1. its ID is greater than or equal to the ID of the root fragment; and 2. its ID is greater than or equal to the ID of any already existing sibling fragment. D

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 4. Overview of the types of redundancy that cannot be detected on the basis of fragment tree topoplogy. (a) Examples of symmetric fragments with more than three linkers. (b) Fragment with different link types that is symmetric because of the compatibility rules depicted in the compatibility matrix. (c) Example of two fragments that result in a symmetric product. (d) Example of two fragments that, upon connection, have the same substructure as a third fragment.

redundancy tests are combined with the property computation for efficiency. Since the two filter mechanisms are independent, they are applied in the order of increasing computational complexity. In summary, the following cascade of tests is carried out in loop (3), after a fragment tree has been extended. First, monotonously additive properties are calculated on the fragment tree, which is subsequently checked for redundancy. Then the molecule is built and checked for redundancy. Finally, the nonmonotonously additive properties are computed. In the following, this process is described in detail; a depiction is shown in Figure 5. The most efficient order of filters was empirically tested. Calculating the nonmonotonously additive properties after the molecule redundancy check proved to be faster in most scenarios. Although the computation of tPSA and rotatable bonds is very quick, the limiting factor is the costly computation of clogP. Because of its frequent use in many constraint sets, e.g., lead-like or drug-like (see Table 2), our algorithm performs the redundancy check prior to nonmonotonously additive property filtering. In order to reliably check for redundant structures, the algorithm keeps track of all fragment tree and molecule instances that occur during the enumeration in an SQLite database. The database already provides very efficient duplicate checks, can hold large numbers of entries, and is file-based and thus largely main-memory-independent. Two instances are used, one for fragment trees and one for molecules. The tree database and the molecule database each contain one table with a single column that holds a string representation of either a fragment tree or a molecule. For the latter, this can be any unique linear descriptor, e.g., unique SMILES50 or InChi;51 we use an internal descriptor specific to the NAOMI library. For fragment trees, a custom unique linear tree descriptor is used. It is generated by traversing the tree in a depth-first manner. The DFS algorithm uses the initial numbering and sorting of open linkers on the fragment, which is also mapped on the corresponding tree node and its edges.

Figure 5. Flowchart of the property and redundancy tests carried out in loop (3). The part of the algorithm that works on the fragment tree is shaded in green and the part that works on the molecule data structure in blue. X denotes the rejection of a tree or a molecule.

The database uses a number of strategies to make access more efficient. It has a preceding Bloom filter,52 i.e., a probabilistic data structure that allows efficient checking of whether an element already exists in the database. A Bloom filter consists of a bit array and a set of k hashing functions. When a new element is added, k hash values are calculated from the new element, and the corresponding k bits are set in the bit array. A lookup operation calculates the same hashes and checks whether the corresponding bits are set. If not, the E

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

thresholds. In contrast to the monotonously additive properties, nonmonotonously additive values are either in range or out of range. The aforementioned tolerance is applied here as well, and previous validations are taken into account. If all of the valuesup to k tolerated deviationsare in range, the molecule is written to the output file as a unique SMILES.50 Estimating the computational complexity of this algorithm is difficult because of the number of variables that must be taken into account. It is not directly dependent on the number of fragments or the constraints but is strongly affected by the properties of fragments and connection rules. For instance, the number and type of linkers as well as multiple compatibilities affect the number of trees generated. In addition, the computational cost of the individual algorithms for computing nonmonotonously additive properties influences the runtime significantly. Selection of Input Fragments. The default mode of the enumeration algorithm simply uses all of the fragments as starting points. Sometimes enumerating all of the molecules represented by a fragment space is neither necessary nor desired. By manually selecting the input fragments, the user can further restrict the search space. Hence, all of the resulting molecules are guaranteed to contain exactly one of these fragments. This approach is particularly useful in cases where certain fragments are desired as scaffolds or were identified as starting fragments before. In this mode, the algorithm uses the user-specified fragments as starting points, thus excluding all of them from being attached again during the enumeration. Therefore, the resulting molecules only contain one of the initial fragments. There is no explicit protocol for building several preselected starting fragments into a single molecule. If this should be desired, the algorithm can be run individually for each starting fragment. In this case, a subsequent filtering step of redundant molecules is necessary. Molecular Similarity. The similarity of molecules was calculated on the basis of extended-connectivity fingerprints (ECFPs) using the Tanimoto coefficient. ECFPs were implemented in NAOMI according to the original publication by Rogers and Hahn.53 We used a diameter of four bonds (ECFP_4) for all of the calculations.

Table 2. Physicochemical Constraints Used for Enumeration Experiments

lead-like drug-like serine/ threonineprotein kinase

molecular weight

logP

Lipinskistyle Hbond donors

250−350 ≤500 250−500

≤3.5 ≤5.0 1.0−5.0

− ≤5 1−4

Lipinskistyle Hbond acceptors

rotatable bonds

− ≤10 5−9

≤7 − −

element is new and is added to the database. If all of the bits are set, the element either exists or might be a false positive match. In either case, the element has to be looked up in the database table. In addition to the Bloom filter, the database uses a buffer for writing operations so that the costly operation of rebuilding the database index does not have to be carried out every time an element is added. After a tree has been extended, the monotonously additive properties, e.g., molecular mass and number of hydrogen-bond donors, are compared to user-defined thresholds. In the case that more than a user-defined number k of properties is above the threshold, the tree is discarded. If fewer than k property values are above the threshold, the tree is checked for redundancy by means of the tree database. Therefore, the tree database is queried with the unique tree descriptor of the fragment tree. If it is a novel tree, it is added to the tree database in its linear form and the tree queue as a fragment tree. If more than k property values are below the lower threshold, the tree does not represent a valid molecule. However, adding more fragments may result in a valid molecule since only monotonously additive properties are considered here. Therefore, the tree is added to the tree queue for further extension and the tree database to prevent repeated computation. A fragment tree is added to the queue only if the number of fragments is still less than a user-defined threshold. Since terminal fragments have properties of their own, they affect the overall properties of the resulting molecule. For example, if a terminal fragment with a hydrogen-bond donor is attached to an open linker, the molecule gains a hydrogen-bond donor. If the number of hydrogen-bond donors now exceeds the upper bound, the molecule must be excluded. If this is detected in the previous step, the corresponding fragment tree will be excluded, thus preventing the extension protocol from attaching a fragment without a hydrogen-bond donor in one of the next iterations. Therefore, terminal fragments were previously not considered in order to avoid the exclusion of precursor molecules. At this point, terminal fragments are included in the calculation of monotonously additive properties. Thus, only trees that represent molecules within the userdefined property range are considered further. These trees are translated into molecules by converting their nodes into their corresponding fragments. For each edge between two tree nodes, a bond between the two corresponding fragments is introduced (see Figure 1). Once a valid molecule instance is created, it is used to compute the unique linear descriptor of the current molecule. The descriptor is then used to check whether the molecule is redundant by querying the molecule database. If it is unique, its descriptor is added to the molecule database. Then the nonmonotonously additive properties, e.g., clogP and number of stereocenters, are calculated. Their values are compared with the user-defined



RESULTS AND DISCUSSION Several enumeration experiments with various sizes of fragment spaces are presented here. The overall setup is similar. For each class of molecules, a fragment space was generated by retrosynthetic fragmentation of known active molecules. Then each fragment space was enumerated with specific physicochemical properties. Finally, the resulting molecules were subjected to an analysis step. For some experiments, leadlike38,39 and drug-like35 property sets were used as constraints (see Table 2). The lead-like chemical space is of particular interest in early stages of the drug development pipeline. Since it describes relatively small molecules, there is much room for the optimization of a structure. Drug-like constraints may be more interesting at later stages, for instance to find molecules with the same physicochemical properties but different scaffolds or decorations. Since drug-likeness filters tolerate larger molecules, they may lead to very large numbers of molecules. In practice, it is better to use constraints specific to the target to allow for a more tailored library, a manageable number of enumerated molecules, and a reasonable runtime. F

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 6. Distributions of results from the DUD-E experiment as box plots (outliers omitted). From left to right: number of bioactive molecules in ChEMBL sets, number of enumerated molecules, average rate of individual enumerations, computing time of individual enumerations, and percent of recovered molecules used for the construction of a fragment space.

molecule are not in the enumerated range. Fourth, molecules that are composed of more than the enumerated number of fragments cannot be found. Lastly, unsaturated linkers are terminated with terminal fragments as defined by the BRICS rules,30 which may be different from the natively attached fragment, thus leading to a different structure. The percentage of recovered molecules was calculated with respect to the number of “recoverable” molecules, i.e., molecules that were fragmented and are in the enumerated property range. After successful enumeration, the resulting molecules were compared to known inhibitors from ChEMBL55 (version 20). To retrieve the set, each target Protein Data Bank (PDB) ID was mapped to a ChEMBL target ID, and then all of the corresponding bioactive compounds were downloaded. For five targets a ChEMBL entry could not be found. Two more targets were excluded since they produced less than 10 molecules because of a very low variance in the chemical property space of the known actives. For the remaining 88 targets, the “known actives”, i.e., molecules used to build the fragment space, were removed from the ChEMBL set to prevent a bias toward identical molecules. First, each ChEMBL compound was compared to each enumerated molecule. In order to get a background distribution of similarity values, the ChEMBL molecules were then compared to the “drug-like clean” subsets of the ZINC database,56 i.e., 1.5 × 107 molecules (accessed on Sept 15, 2015). All of the calculations are based on ECFP_4 fingerprints compared using the Tanimoto similarity measure. The resulting distributions were compared to determine how the enumerated libraries compare to a library of available molecules. Figure 7 shows these similarity distributions for two targets, MAP kinase-activated protein kinase 2 (MAPK2) and cytochrome P450 2C9 (CP2C9). In the case of MAPK2, the distribution shows that there are more enumerated molecules with a higher similarity to ChEMBL-MAPK2 than molecules from the ZINC set. CP2C9 shows the opposite behavior. To assess whether the enumerated set or the ZINC set is more similar to the respective ChEMBL subset, a receiver operating characteristic (ROC) curve was computed on the basis of these similarity distributions (see Figure 7). For each distribution enumerated versus ChEMBL (EC) and ZINC versus ChEMBL (ZC)the following computation was carried out: For each bin, the numbers of molecules in the range from the current bin i to N (where N is the number of bins) were computed and stored as ECi and ZCi, respectively. Then, the area under the curve (AUC) was calculated by means of the following formula:

In the following, we use both general-purpose constraints as well as target-specific ones derived from the properties of known active molecules. A large-scale experiment based on the “Database of Useful Decoys: Enhanced” (DUD-E) is described.54 This data set contains known inhibitors for a diverse set of 102 protein targets. The memory consumption of a typical enumeration process is discussed on the basis of one of the DUD-E experiments. Then two experiments are discussed in which the enumeration focuses on particular fragments of interest, thus demonstrating an alternative enumeration strategy. This strategy is applied to generate possible kinase inhibitors and microtubule-destabilizing anticancer agents. Finally, a fragment space derived from approved drug molecules is enumerated, thus creating 0.5 billion lead-like molecules. DUD-E Data Set. DUD-E54 was originally developed to benchmark docking algorithms but also constitutes a valuable resource for studies where well-characterized bioactive molecules are required. For each of the 102 targets, the “known actives” set of molecules was fragmented, and each resulting fragment space was enumerated. We use the same descriptors as in the drug-like rule set for constraining the enumeration, i.e., molecular weight, number of H-bond donors and acceptors, and cLogP. The values for the lower and upper bounds of each property are the values of the lower and upper quartiles of the property distributions of input molecules (see Table S1 in the Supporting Information). In addition to physicochemical properties, two more constraints were applied in order to avoid the generation of enormous amounts of molecules: the maximal number of fragments was set to four, and the runtime was limited to 7 days (single core). The latter resulted in eight computations being terminated. The number of generated molecules up to this point was between 4.1 × 107 and 1.1 × 108. An overview of the 94 successfully finished enumerations is depicted in Figure 6, and a detailed list is shown in Table S2. Enumerated molecules were compared to the initial molecules used to build the fragment space in order to determine the percentage of recovered molecules as shown in Figure 6 and Table S2. Not all of the initial molecules were generated during enumeration. There are several reasons for this. First, certain molecules are not included in the fragment space since they could not be fragmented with BRICS rules.30 Second, molecules that are composed of fragments that were excluded by BRICS because of large rings or toxic groups30 cannot be generated. Third, physicochemical properties of the G

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

427 to 533 was used for the enumeration, while only 11 of the 64 ChEMBL TYSY actives are in this range. Most ChEMBL molecules are smaller and as a result have fewer features in the ECFP. A distinct pattern found in most actives is a ring system with two aromatic rings and a six-membered ring with one oxygen atom. The fragment space does not contain this fragment and therefore cannot produce molecules with this specific pattern, which also affects ECFP similarity. The reason for the bad performance of CP2C9 is not as clear. It is probably due to the nature of cytochrome P450 enzymes since they bind a variety of molecules as substrates, activators, and inhibitors. The ChEMBL CP2C9 set contains more than 21 000 molecules. This set covers a greater chemical space and therefore prohibits a clear signal. For the similarity range from 0.5 to 1.0 (see Figure 7) the number of molecules from the enumerated set is often distinctly higher than the number of ZINC molecules (see Figure S3). This is implicitly reflected in the AUC but should not be overlooked. These molecules not only have desired physicochemical properties but also have a topological similarity as defined by the ECFP. This implies that the enumerated set contains numerous promising molecules, which makes it an interesting input for techniques that require molecular libraries, such as virtual screening. Target-Specific Sets. In this section we describe two experiments focusing on two targets in greater detail. The experiment is very similar to the previous one, but the enumeration is focused on specific f ragments of interest. First, a list of known bioactive molecules was retrieved from ChEMBL55 and filtered for duplicates. Second, the unique set was subjected to BRICS fragmentation. After enumeration, the resulting molecules were subjected to further analysis. The first target was serine/threonine-protein kinase. All of the bioactive compounds for three related proteins, i.e., serine/ threonine-protein kinase 32A, 32B, and 32C, were retrieved (CHEMBL6150, CHEMBL5912, and CHEMBL5405, respectively). The 131 unique molecules were fragmented into 242 fragments. This experiment demonstrates how the enumeration algorithm is used to exhaustively enumerate all molecules with the same central fragment, i.e., the same core. We selected fragments F196 and F199 (see Figure 8) since each has at least two linkers, is relatively large and inflexible, and therefore constitutes a good scaffold fragment. We chose enumeration constraints specific to the target class on the basis of the properties of the known ChEMBL actives. Fragment F196 yielded 2 701 282 molecules and fragment F199 2 859 468 molecules. A summary of the results, including computing times, can be found in Figure 8. For validation purposes, we verified that all of the resulting molecules contained the initial

Figure 7. Two examples of results from the analysis of the DUD-E experiment showing favorable (MAPK2, left column) and unfavorable (CP2C9, right column) cases. (a−d) Plots showing the distributions of similarity values for ranges 0.0 to 1.0 (a, b) and 0.5 to 1.0 (c, d). The black lines show the similarity distributions of the enumerated molecules vs the ChEMBL set and the gray areas the similarity distributions of ZINC vs ChEMBL. (e, f) Plots showing the corresponding ROC curves for the similarity distributions and the AUCs as numeric values. TP denotes the case where more enumerated molecules have a higher similarity to ChEMBL and FP the case where more molecules from ZINC have a higher similarity to ChEMBL.

AUC = ∑N−1 i=0 ECi × |ZCi+1 − ZCi|. The AUCs for MAPK2 and CP2C9 are 0.91 and 0.39, respectively. For AUCs around 0.5, no clear distinction can be made. Therefore, in the following classification, an AUC of 0.5 ± 0.1 is considered to describe sets that are equally similar to the corresponding ChEMBL subset. In 78 cases the enumerated sets had an AUC above 0.6, while eight cases had an AUC of 0.5 ± 0.1 and two cases had an AUC below 0.4, i.e., thymidylate synthase (TYSY) and CP2C9 (see Figure S3 in the Supporting Information). The two main reasons for the weak performance of TYSY are the choice of enumeration parameters and the composition of the fragment space. A molecular weight range of

Figure 8. Depiction of the fragments of interest and results of the enumeration experiments for the serine/threonine-protein kinase and tubulin. The enumeration parameters are given in Table 2. H

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

were not part of the input set and the corresponding bioactive molecules from ChEMBL. Hamburg Enumerated Lead-like Set. In the final experiment, we used the “Approved Drugs” set of DrugBank23 (version 4.2) to create a fragment space. Hence, all of the resulting fragments were directly derived from actual drug molecules. For 528 molecules none of the BRICS fragmentation rules could be applied. The fragmentation of the remaining 1009 of 1537 molecules yielded 1214 fragments. We enumerated this space with lead-like constraints to generate a library of molecules that could act as starting points for new drugs. For a fragment space of this size, an extremely long runtime had to be expected. Therefore, we split up the problem by performing one enumeration per fragment. An additional constraint was the selection of starting fragments. Only fragments containing at least two linkers and a ring of size five or more were considered. Thus, 183 individual enumerations were carried out. The resulting molecules were combined by merging their unique SMILES representations.50 The enumerations took an accumulated computing time of 1441 CPU hours, and 503 974 653 unique molecules were generated, among which 48 814 (0.0097%) were found in the “lead-like” set of ZINC56 comprising 7 096 164 molecules (accessed on Sept 15, 2015). The same analysis was done for 12 790 627 unique and valid structures from SureChEMBL58 (accessed on Jan 27, 2016), yielding 21 825 (0.0043%) molecules. The full collection of 5 × 108 compounds, named the Hamburg Enumerated Lead-like Set (HELLS), is split into smaller subsets for convenient access. Memory Consumption. The behavior of the algorithm regarding memory consumption and storage utilization during a typical enumeration is shown in Figure 12. In this case, the enumeration of target ADRB2 of the DUD-E data set (see above) was carried out on an Intel i5-4570 CPU at 3.20 GHz with 16 GB of RAM (see the Supporting Information for more examples). The memory requirements show a distinct progression. The usage rises steeply at the beginning of the process and levels off at around 5.5 GB. This is due to the parameters used for the database cache (i.e., 1 GB) and the size of the Bloom filter (i.e., 1.25 GB). Three databases and two Bloom filters are used, resulting in a memory requirement of 5.5 GB. The remaining memory is used by other parts of the process, e.g., the fragment space, fragments, current instances of product trees and molecules, database buffers, etc. The algorithm uses each fragment as a starting point and empties the tree database after a fragment has been processed. New trees cannot contain fragments previously used as starting points, and therefore, trees from previous starting points are discarded to save memory and speed up the tree database. This can be seen in the progression of the tree database fill state. The content of the database grows until the fragment has been processed, then it drops to zero. The tree queue shows the opposite behavior; it is filled quickly with each new fragment and decreases when no new trees are generated. The number of generated molecules increases steadily, while the slope of the curve decreases slightly. The rate at which molecules are generated behaves complementarily; it is quite high at first but then decreases slowly. This happens because the longer the enumeration runs, the more time the redundancy checks take because of the growing molecule database. The memory-related parameters can be chosen at the beginning of the enumeration. A database cache size of 1 GB was used for all of the database instances in all of the

fragment and that all of the molecules exhibited the properties used during the enumeration. Then the ECFP Tanimoto similarities between the enumerated molecules and the initial ChEMBL actives were calculated. Figure 9 shows the similarity

Figure 9. Similarity distributions of enumerated molecules compared to bioactive serine/threonine-protein kinase binders from ChEMBL for different starting fragments. The results for fragment F196 are shown on the left and those for fragment F199 on the right. The first row shows similarity values from 0.0 to 1.0, and the second row shows the range from 0.5 to 1.0.

distributions for fragments F196 and F199. For fragment F196, there were 2296 molecules with similarity values greater than 0.5, and for fragment F199 there were 2463. These molecules have the same physicochemical properties as many of the initial actives and additionally a high topological similarity. Figure 10 a,b shows the five molecules with the highest similarity values and the corresponding bioactive molecules from ChEMBL for fragments F196 and F199, respectively. The second target was tubulin, more precisely α- and βtubulin heterodimers (PDB ID 4O2A), which form microtubules, a component of the cytoskeleton. The cytoskeleton is responsible for the shape of a cell, transport of material within a cell, and coordination of cell division. Therefore, microtubules represent a promising anticancer target. La Regina et al.57 recently published a number of microtubule-destabilizing anticancer agents with a distinct trimethoxyphenyl moiety. This experiment demonstrates how the enumeration algorithm can be used to generate a diverse set of molecules with a distinct functional group, in this case the trimethoxyphenyl moiety. This strategy may be useful when an initial binder has been identified by fragment screening and needs to be developed into a lead. For target ID CHEMBL3394, 360 unique bioactive compounds were retrieved from ChEMBL. After fragmentation, the resulting fragment space contained 165 fragments, three of which contained the trimethoxyphenyl pattern. These three fragments (F58, F61, and F64) were used as starting points for enumeration with drug-like properties. The resulting 575 864 molecules were then compared to the initial ChEMBL actives via Tanimoto similarity of ECFP. The similarity distribution is shown in Figure 11. A total of 37 488 molecules with Tanimoto similarity greater than or equal to 0.5 were found (6624 ≥ 0.6, 995 ≥ 0.7, 136 ≥ 0.8, 67 ≥ 0.9). There were even 30 molecules with a similarity of 1.0, which were either identical to the initial ChEMBL inhibitors or composed of the same fragments as those. Figure 10c shows the five enumerated molecules with the highest similarity value that I

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 10. Five highest-scoring enumerated molecules that are not identical to input molecules for (a) serine/threonine-protein kinase fragment F196, (b) fragment F199, and (c) tubulin. The top row of each panel shows the enumerated molecules, and the bottom row shows the corresponding bioactive molecules from ChEMBL. The numbers indicate the Tanimoto similarities of ECFP_4 between the enumerated molecules and the CHEMBL molecules below. Molecule depictions were created with MONA.24,25

experiments. The size of the Bloom filter was set to 4 × 1010 bits (5 GB) for the DUD-E and HELLS enumerations and to 1010 bits (1.25 GB) for target-specific experiments.



SUMMARY AND CONCLUSION We have presented FSees, an algorithm for the systematic and exhaustive enumeration of fragment spaces. In contrast to combinatorial libraries, the enumeration of fragment space is challenging because of the handling of redundancy. Since fragment spaces describe an infinite number of molecules, the enumeration is limited to molecules of interest, i.e., molecules

Figure 11. Similarity distribution of enumerated molecules with a trimethoxyphenyl moiety compared to bioactive α- and β-tubulin binders from ChEMBL. The similarity range from 0.0 to 1.0 is shown on the left and the range from 0.5 to 1.0 on the right. J

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

4c). Furthermore, additional optional rules for the enumeration could be added, such as a rule that the same fragment may not be connected to itself or a rule that a molecule may not contain the same fragment more than n times (see ENUM299192 in Figure 10c, right). For the experimental validation, 105 fragment spaces were generated from active molecules. These spaces were then enumerated using different sets of constraints. In this way, a large scale experiment with 102 inhibitor classes was carried out. It was demonstrated that the enumeration yields molecules more closely related to known inhibitors than a generalpurpose database and therefore is a source of promising lead structures. An alternative enumeration mode was used in two experiments, demonstrating the usefulness of the algorithm in a fragment-based design context. First, a kinase inhibitor fragment space was enumerated with preselected cores, thus generating molecules with similar scaffolds. Second, a fragment space around an initial active trimethoxyphenyl fragment was enumerated. Finally, a fragment space built from approved drug molecules was enumerated, resulting in 0.5 billion lead-like molecules. This collection of mostly novel compounds is an interesting starting point for computational lead discovery projects.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00117. List of enumeration parameters and results for all 102 DUD-E targets, similarity distributions and ROC curves for 88 DUD-E targets, and memory consumption graphs for two enumerations (PDF) The HELLS collection of 0.5 billion compounds created by enumerating the DrugBank space with lead-like constraints is available from http://www.zbh.uni-hamburg.de/hells. The software tool FSees is available from http://www.zbh.unihamburg.de/fsees. The tool is free for evaluation purposes and academic use.

Figure 12. Progressions of different resources during the enumeration process of target ADRB2 of the DUD-E data set. From top to bottom: memory consumption, tree database, tree queue, molecule database, and rate of generated molecules.

with a certain physicochemical profile. Reasonably set constraints reduce runtime significantly and circumvent a subsequent filtering step. In addition, efficient redundancy checks have been implemented to prevent identical molecules from occurring. Most importantly, the implementation uses filebased data structures so that the algorithm is not limited by main memory. The fragment spaces used in the experiments presented here were generated by fragmentation of known active molecules. This fragmentation approach allowed us to generate many tailored fragment spaces demonstrating the function of the algorithm. However, not all molecules retrieved from retrosynthetic fragment spaces may be synthesizable. A more promising course of action is the utilization of fragment spaces generated from synthetic libraries.45 The enumerated molecules from these spaces come with a suggestion for the synthetic route, since specific synthetic reactions are used in building such spaces. An automated approach for creating such fragment spaces does not exist, to the best of our knowledge. However, a manually curated fragment space with 82 reactions is available for download.45 The algorithm works well in the described scenarios, but it could be improved when handling symmetry in more complex (and rare) cases. The most significant contribution would be the consideration of fragments with more than three linkers (see Figure 4 a). This would prevent redundant fragment trees from occurring, making the tree database obsolete. As a next step, one could evaluate whether the detection of symmetry in intermediate products leads to a justifiable speedup (see Figure



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Phone: +49 (0)40 42838 7350. Fax: +49 (0)40 42838 7352. Notes

The authors declare the following competing financial interest(s): M.R. declares a potential financial interest in the event that the FSees software is licensed for a fee to nonacademic institutions in the future.



ACKNOWLEDGMENTS The authors thank Matthias Hilbig for the development of MONA and Eva Nittinger and Kai Sommer for their valuable input on the manuscript.



REFERENCES

(1) Bohacek, R. S.; McMartin, C.; Guida, W. C. The Art and Practice of Structure-Based Drug Design: A Molecular Modeling Perspective. Med. Res. Rev. 1996, 16, 3−50. (2) Walters, W. P.; Stahl, M. T.; Murcko, M. A. Virtual Screening-an Overview. Drug Discovery Today 1998, 3, 160−178.

K

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling (3) Drew, K. L. M.; Baiman, H.; Khwaounjoo, P.; Yu, B.; Reynisson, J. Size Estimation of Chemical Space: How big is It? J. Pharm. Pharmacol. 2012, 64, 490−495. (4) Lauck, F.; Rarey, M. De Novo Molecular Design; Wiley-VCH: Weinheim, Germany, 2013; pp 325−347. (5) Reymond, J.-L.; Ruddigkeit, L.; Blum, L.; van Deursen, R. The Enumeration of Chemical Space. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2, 717−733. (6) Virshup, A. M.; Contreras-García, J.; Wipf, P.; Yang, W.; Beratan, D. N. Stochastic Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-Like Compounds. J. Am. Chem. Soc. 2013, 135, 7296−7303. (7) Lameijer, E.-W.; Kok, J. N.; Bäck, T.; Ijzerman, A. P. The Molecule Evoluator. an Interactive Evolutionary Algorithm for the Design of Drug-like Molecules. J. Chem. Inf. Model. 2006, 46, 545− 552. (8) Yu, M. J. Natural Product-like Virtual Libraries: Recursive Atombased Enumeration. J. Chem. Inf. Model. 2011, 51, 541−557. (9) Hu, X.; Beratan, D. N.; Yang, W. A Gradient-directed Monte Carlo Approach to Molecular Design. J. Chem. Phys. 2008, 129, 064102. (10) Schneider, G.; Lee, M.-L.; Stahl, M.; Schneider, P. De Novo Design of Molecular Architectures by Evolutionary Assembly of Drugderived Building Blocks. J. Comput.-Aided Mol. Des. 2000, 14, 487− 494. (11) Douguet, D.; Thoreau, E.; Grassy, G. A Genetic Algorithm for the Automated Generation of Small Organic Molecules: Drug Design Using an Evolutionary Algorithm. J. Comput.-Aided Mol. Des. 2000, 14, 449−466. (12) Pegg, S. C.-H.; Haresco, J. J.; Kuntz, I. D. A Genetic Algorithm for Structure-based de Novo Design. J. Comput.-Aided Mol. Des. 2001, 15, 911−933. (13) Gillet, V. J.; Willett, P.; Fleming, P. J.; Green, D. V. S. Designing Focused Libraries Using MoSELECT. J. Mol. Graphics Modell. 2002, 20, 491−498. (14) Vinkers, H. M.; de Jonge, M. R.; Daeyaert, F. F. D.; Heeres, J.; Koymans, L. M. H.; van Lenthe, J. H.; Lewi, P. J.; Timmerman, H.; Van Aken, K.; Janssen, P. A. J. SYNOPSIS: SYNthesize and OPtimize System in Silico. J. Med. Chem. 2003, 46, 2765−2773. (15) Fechner, U.; Schneider, G. Flux (1): A Virtual Synthesis Scheme for Fragment-Based de Novo Design. J. Chem. Inf. Model. 2006, 46, 699−707. (16) Pierce, A. C.; Rao, G.; Bemis, G. W. BREED: Generating Novel Inhibitors through Hybridization of Known Ligands. Application to CDK2, P38, and HIV Protease. J. Med. Chem. 2004, 47, 2768−2775. (17) Faulon, J. L.; Churchwell, C. J.; Visco, D. P. The Signature Molecular Descriptor. 2. Enumerating Molecules from Their Extended Valence Sequences. J. Chem. Inf. Model. 2003, 43, 721−734. (18) Peironcely, J. E.; Rojas-Chertó, M.; Fichera, D.; Reijmers, T.; Coulier, L.; Faulon, J.-L.; Hankemeier, T. OMG: Open Molecule Generator. J. Cheminf. 2012, 4, 21. (19) Reymond, J.-L. The Chemical Space Project. Acc. Chem. Res. 2015, 48, 722−730. (20) Fink, T.; Bruggesser, H.; Reymond, J.-L. Virtual Exploration of the Small-Molecule Chemical Universe below 160 Da. Angew. Chem., Int. Ed. 2005, 44, 1504−1508. (21) Blum, L. C.; Reymond, J.-L. 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. J. Am. Chem. Soc. 2009, 131, 8732−8733. (22) Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864−2875. (23) Wishart, D. S.; Knox, C.; Guo, A. C.; Shrivastava, S.; Hassanali, M.; Stothard, P.; Chang, Z.; Woolsey, J. DrugBank: a Comprehensive Resource for in Silico Drug Discovery and Exploration. Nucleic Acids Res. 2006, 34, D668−72. (24) Hilbig, M.; Urbaczek, S.; Groth, I.; Heuser, S.; Rarey, M. MONA - Interactive Manipulation of Molecule Collections. J. Cheminf. 2013, 5, 38.

(25) Hilbig, M.; Rarey, M. MONA 2: A Light Cheminformatics Platform for Interactive Compound Library Processing. J. Chem. Inf. Model. 2015, 55, 2071−2078. (26) Patel, H.; Bodkin, M. J.; Chen, B.; Gillet, V. J. Knowledge-based Approach to de Novo Design Using Reaction Vectors. J. Chem. Inf. Model. 2009, 49, 1163−1184. (27) Hartenfeller, M.; Zettl, H.; Walter, M.; Rupp, M.; Reisen, F.; Proschak, E.; Weggen, S.; Stark, H.; Schneider, G. DOGS: ReactionDriven de novo Design of Bioactive Compounds. PLoS Comput. Biol. 2012, 8, e1002380. (28) Durrant, J. D.; McCammon, J. A. AutoClickChem: Click Chemistry in Silico. PLoS Comput. Biol. 2012, 8, e1002397. (29) Lewell, X. Q.; Judd, D. B.; Watson, S. P.; Hann, M. M. RECAPRetrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry. J. Chem. Inf. Model. 1998, 38, 511−522. (30) Degen, J.; Wegscheid-Gerlach, C.; Zaliani, A.; Rarey, M. On the Art of Compiling and Using ’Drug-Like’ Chemical Fragment Spaces. ChemMedChem 2008, 3, 1503−1507. (31) Pärn, J.; Degen, J.; Rarey, M. Exploring Fragment Spaces Under Multiple Physicochemical Constraints. J. Comput.-Aided Mol. Des. 2007, 21, 327−340. (32) Schneider, G.; Fechner, U. Computer-Based De Novo Design of Drug-Like Molecules. Nat. Rev. Drug Discovery 2005, 4, 649−663. (33) Loving, K.; Alberts, I.; Sherman, W. Computational Approaches for Fragment-Based and De Novo Design. Curr. Top. Med. Chem. 2010, 10, 14−32. (34) Joseph-McCarthy, D.; Campbell, A. J.; Kern, G.; Moustakas, D. Fragment-Based Lead Discovery and Design. J. Chem. Inf. Model. 2014, 54, 693−704. (35) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Delivery Rev. 2001, 46, 3−26. (36) Young, R. J. Topics in Medicinal Chemistry; Springer: Berlin, 2014; pp 1−68. (37) Rarey, M.; Stahl, M. Similarity Searching in Large Combinatorial Chemistry Spaces. J. Comput.-Aided Mol. Des. 2001, 15, 497−520. (38) Teague, S. J.; Davis, A. M.; Leeson, P. D.; Oprea, T. The Design of Leadlike Combinatorial Libraries. Angew. Chem., Int. Ed. 1999, 38, 3743−3748. (39) Oprea, T. I.; Davis, A. M.; Teague, S. J.; Leeson, P. D. Is There a Difference between Leads and Drugs? A Historical Perspective. J. Chem. Inf. Model. 2001, 41, 1308−1315. (40) Ehrlich, H.-C.; Volkamer, A.; Rarey, M. Searching for Substructures in Fragment Spaces. J. Chem. Inf. Model. 2012, 52, 3181−3189. (41) Ehrlich, H.-C.; Henzler, A. M.; Rarey, M. Searching for Recursively Defined Generic Chemical Patterns in Nonenumerated Fragment Spaces. J. Chem. Inf. Model. 2013, 53, 1676−1688. (42) Degen, J.; Rarey, M. FlexNovo: Structure-Based Searching in Large Fragment Spaces. ChemMedChem 2006, 1, 854−868. (43) Maaß, P.; Schulz-Gasch, T.; Stahl, M.; Rarey, M. Recore: A Fast and Versatile Method for Scaffold Hopping Based on Small Molecule Crystal Structure Conformations. J. Chem. Inf. Model. 2007, 47, 390− 399. (44) Boehm, M.; Wu, T.-Y.; Claussen, H.; Lemmen, C. Similarity Searching and Scaffold Hopping in Synthetically Accessible Combinatorial Chemistry Spaces. J. Med. Chem. 2008, 51, 2468−2480. (45) BioSolve IT GmbH. FTrees-FS, KnowledgeSpace. http://www. biosolveit.de/FTrees-FS/ (accessed Jan 7, 2016). (46) Urbaczek, S.; Kolodzik, A.; Fischer, J. R.; Lippert, T.; Heuser, S.; Groth, I.; Schulz-Gasch, T.; Rarey, M. NAOMI: on the Almost Trivial Task of Reading Molecules from Different File Formats. J. Chem. Inf. Model. 2011, 51, 3199−3207. (47) Kazius, J.; McGuire, R.; Bursi, R. Derivation and Validation of Toxicophores for Mutagenicity Prediction. J. Med. Chem. 2005, 48, 312−320. L

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling (48) Wildman, S. A.; Crippen, G. M. Prediction of Physicochemical Parameters by Atomic Contributions. J. Chem. Inf. Model. 1999, 39, 868−873. (49) Rarey, M.; Dixon, J. S. Feature Trees: a new Molecular Similarity Measure Based on Tree Matching. J. Comput.-Aided Mol. Des. 1998, 12, 471−490. (50) Weininger, D.; Weininger, A.; Weininger, J. L. SMILES. 2. Algorithm for Generation of Unique SMILES Notation. J. Chem. Inf. Model. 1989, 29, 97−101. (51) Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.; Pletnev, I. InChI - the Worldwide Chemical Structure Identifier Standard. J. Cheminf. 2013, 5, 7. (52) Bloom, B. H. Space/Time Trade-offs in Hash Coding With Allowable Errors. Commun. ACM 1970, 13, 422−426. (53) Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (54) Mysinger, M. M.; Carchia, M.; Irwin, J.; Shoichet, B. K. Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med. Chem. 2012, 55, 6582−6594. (55) Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Krüger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka, M.; Papadatos, G.; Santos, R.; Overington, J. P. The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42, D1083−90. (56) Irwin, J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: A Free Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757−1768. (57) La Regina, G.; Bai, R.; Coluccia, A.; Famiglini, V.; Pelliccia, S.; Passacantilli, S.; Mazzoccoli, C.; Ruggieri, V.; Verrico, A.; Miele, A.; Monti, L.; Nalli, M.; Alfonsi, R.; Di Marcotullio, L.; Gulino, A.; Ricci, B.; Soriani, A.; Santoni, A.; Caraglia, M.; Porto, S.; Da Pozzo, E.; Martini, C.; Brancale, A.; Marinelli, L.; Novellino, E.; Vultaggio, S.; Varasi, M.; Mercurio, C.; Bigogno, C.; Dondio, G.; Hamel, E.; Lavia, P.; Silvestri, R. New Indole Tubulin Assembly Inhibitors Cause Stable Arrest of Mitotic Progression, Enhanced Stimulation of Natural Killer Cell Cytotoxic Activity, and Repression of Hedgehog-Dependent Cancer. J. Med. Chem. 2015, 58, 5789−5807. (58) Papadatos, G.; Davies, M.; Dedman, N.; Chambers, J.; Gaulton, A.; Siddle, J.; Koks, R.; Irvine, S. A.; Pettersson, J.; Goncharoff, N.; Hersey, A.; Overington, J. P. SureChEMBL: a Large-scale, Chemically Annotated Patent Document Database. Nucleic Acids Res. 2016, 44, D1220−8.

M

DOI: 10.1021/acs.jcim.6b00117 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX