Mapping of Drug-like Chemical Universe with ... - ACS Publications

Mar 28, 2017 - Although many approaches have been developed by the cheminformatics community for the analysis and visualization of drug-like chemical ...
34 downloads 9 Views 3MB Size
Article pubs.acs.org/jcim

Mapping of Drug-like Chemical Universe with Reduced Complexity Molecular Frameworks Aleksejs Kontijevskis* Nuevolution A/S, Rønnegade 8, DK-2100 Copenhagen, Denmark S Supporting Information *

ABSTRACT: The emergence of the DNA-encoded chemical libraries (DEL) field in the past decade has attracted the attention of the pharmaceutical industry as a powerful mechanism for the discovery of novel drug-like hits for various biological targets. Nuevolution Chemetics technology enables DNA-encoded synthesis of billions of chemically diverse druglike small molecule compounds, and the efficient screening and optimization of these, facilitating effective identification of drug candidates at an unprecedented speed and scale. Although many approaches have been developed by the cheminformatics community for the analysis and visualization of drug-like chemical space, most of them are restricted to the analysis of a maximum of a few millions of compounds and cannot handle collections of 108−1012 compounds typical for DELs. To address this big chemical data challenge, we developed the Reduced Complexity Molecular Frameworks (RCMF) methodology as an abstract and very general way of representing chemical structures. By further introducing RCMF descriptors, we constructed a global framework map of drug-like chemical space and demonstrated how chemical space occupied by multi-million-member drug-like Chemetics DNA-encoded libraries and virtual combinatorial libraries with >1012 members could be analyzed and mapped without a need for library enumeration. We further validate the approach by performing RCMF-based searches in a drug-like chemical universe and mapping Chemetics library selection outputs for LSD1 targets on a global framework chemical space map.



INTRODUCTION Drug-like chemical space is virtual space occupied by all chemically meaningful small drug-like molecules. Intelligent exploration, mapping, visualization, and navigation in this chemical universe is a key step for the discovery of novel drugs and tools for chemical biology. Drug-like chemical space is so vast that its complete coverage and analysis at the current state is beyond our current technological capabilities both in practical terms (of what could actually be synthesized) and computationally accessed. Thus, the focus in industry has shifted from the chemical library size race to library quality. Using various cheminformatics methods, the industry is moving toward identifying areas where chemical space is not covered by library collections or is under-represented. Nowadays, there are millions of drug-like molecules recorded in public and corporate databases, and this number increases exponentially due to the introduction of parallel and combinatorial synthesis approaches and emergence of DNA-encoded libraries technology. To comprehend this huge array of chemical data, it needs to be represented in a human-understandable, yet informationrich, format. The growing amount of accumulated data in many other areas of science gets mapped in different ways. Universe and planet maps in astronomy and genome and protein maps of living organisms in biology, as well as GPS maps on smartphones, are just few examples of daily big data © 2017 American Chemical Society

visualization systems. However, graphical depictions of the whole universe of drug-like chemically accessible small molecules are rare and largely incomplete.1,2 This is because chemical space is often defined by various sets of descriptors3 leading to a major problem; the lack of space invariance.4,5 Diverse descriptor sets and different distance measures6 result in chemical spaces showing variable molecule distributions.7 In the descriptor-based chemical space, where each compound is represented by a N-dimensional descriptor set, the two most popular used approaches for visualization are dimensionality reduction and similarity network graphs. Dimensionality reduction techniques allow one to reduce N-dimensional chemical space into a “compressed” latent space of 2 or 3 dimensions, e.g., by principal component analysis (PCA) and self-organizing Kohonen maps (SOM).8,9 In a graph-based chemical space, individual molecules, their scaffolds, or subscaffolds are shown as nodes. The nodes are then either linked to each other by edges based on structure decomposition rules or on an arbitrary chosen similarity metrics. This lets researchers construct various molecular or scaffold trees and networks in hierarchical manner. An overview on published methods discussing these strategies is summarized in Table 1. The main disadvantages of most of the listed Received: January 5, 2017 Published: March 28, 2017 680

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Scaffold maps

Feature Trees

Scaffold Topologies

Tree map of scaffolds

Hierarchical topological trees

6

7

8

9

10

Graph-based

4

Graph-based

Dimensionality reduction

3

5

Dimensionality reduction

2

Technique

Dimensionality reduction

1

No.

681

Hierarchical classification of compounds based on their molecular scaffolds with various abstraction levels Generation of virtual substructure templates

Reduction of molecular structures to connected graphs

Feature Trees

Chemical space networks Scaffold keys

Scaffold trees, Scaffold Hunter

ChemGPS, ChemGPSNP (PCAbased methods) Generative topographic mapping (GTM), extension of SOMs PCA

Method

GDB17. Exhaustive enumeration of all drug-like compounds with up to 17 heavy atoms. Mapping of GDB17 using 42 molecular quantum descriptors. Hierarchical classification of Bemis−Murcko scaffolds (Scaffold trees). Scaffold Hunter is an interactive computer application for navigation of chemical space using scaffold trees, which annotates them by available bioactivity data. Coordinate-free representation of chemical space and SAR analysis The method uses simple topological parameters of scaffolds for intuitive ordering of scaffolds. Describe molecules by their major building blocks in a nonlinear fashion. Define scaffold topology as a connected graph with the minimal number of nodes and edges required to fully describe molecule ring structure. The method describes additional levels in a hierarchy of topology-based scaffolds and visualizes them in a tree map, as implemented in Scaffvis programan interactive zoomable tree map which can analyze data sets with up to 105 molecules. Automatically generates, analyses, groups, and visualizes all topologically unique chemical templates

Probabilistic extension of SOMs

Chemical space maps built on a reference set of “satellite” compounds.

Short description

Enumeration of structures from any combinatorial libraries is a necessary requirement to use the approach. Not applicable for the analysis of ultralarge libraries (>106 compounds and beyond).

Enumeration of compounds from combinatorial libraries is required to analyze them. The method is useful for the analysis of data sets with 106 compounds).

Exhaustive enumeration of all scaffold topologies for compounds with up to eight rings

Searching in large combinatorial spaces

Scaffold similarity searches, scaffold bioisosteric replacements, scaffold hopping

Clear view on SAR activity landscape for small data sets

Highly suitable for the analysis of small data sets with known biological data.

Huge chemical database of 166 billion small drug-like compounds. High chemical space coverage.

Overcomes SOM limitations

Invariant chemical space. Compound libraries could be intercompared.

Pros

Table 1. Summary of Reports Describing Visualization and Analysis Techniques of Drug-Like Chemical Space

Journal of Chemical Information and Modeling Article

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Technique

Reduced Complexity Molecular Frameworks

No.

11

Table 1. continued

RCMF types and properties for druglike compounds, mono- and bi-functional reagents

Method

Enumeration-free method suitable for fast analysis of DELs and large combinatorial libraries (>109 compounds and beyond). Covers entire drug-like and fragment chemical space. Allows fast comparison analysis and mapping of drug-like space occupied by DELs and combinatorial libraries varying in their setups. Enables efficient search in drug-like chemical universe covered by DELs. Offers intuitive clustering of scaffolds from medicinal chemists' point of view.

Pros ranging them in collections of dynamic trees

Short description

and their derivatives present in the molecular graphs by mapping specific topological classes and templates on the nodes of dynamic trees and typifying their substructures by a rule-based system for generating a hierarchically prioritized topological line code for templates. Alternative topological way of representing chemical structures with new advanced features RCMF method might be too general when applied on a small set of very similar compounds.

Cons

This study

Refs

Journal of Chemical Information and Modeling Article

682

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

Figure 1. Nuevolution Chemetics platform. During the synthesis phase, a collection of drug-like molecules is synthesized as a mixture (hundreds of millions to billions and even up to trillions of diverse molecules). Each library molecule consists of a DNA-sequence (code) and a linker, which allows the DNA code and the small molecule to be physically attached to each other. The DNA code serves as a “barcode” holding the information for the structure of the small molecule. During Nuevolution’s screening of a biological disease, target inactive compounds are eliminated and active compounds are isolated. The structures of the active compounds are then determined by sequencing of the DNA codes.

likeness of library compounds (as each diversity position inevitably increases the overall average MW of the full-size molecules) accompanied by an unfavorable decrease in the library quality due to incomplete reactivity of BBs. Therefore, we put much emphasis in ensuring drug-likeness, chemical space diversity, and density covered by DELs, as well as the synthetic quality of DELs produced by Chemetics technology. The analyses of chemical space coverage by various DELs have been reported in scientific literature.50−52 Nevertheless, in the majority of the studies, the authors used rather small random subsets of DELs for enumeration and further diversity assessment, visualization, and mapping. Thus, an approach which would be able to analyze, compare, and map chemical space of billions of drug-like compounds and beyond (avoiding enumeration step) in a time-efficient manner is clearly missing. Here, we present a novel approach for the analysis and mapping of huge drug-like chemical space by introducing Reduced Complexity Molecular Frameworks (RCMF) as an abstract and very general way of representation of chemical structures. Compared to previously published approaches (Table 1), RCMFs offer an alternative topological way of representing chemical structures with new advanced features. The development of the method was inspired by an unmet need to handle and analyze multi-million Chemetics DELs as well as selection outputs (typically 105−106 unique compounds) resulting from DEL screening experiments. We aimed to develop a method that would mimic a chemist’s way of thinking in terms of “common motifs” but would not be limited to scaffold recognition and would not lose a chemist’s understanding of large structural data sets in their manual analysis process. As we further demonstrate, the developed RCMF approach allows without enumeration to map quickly at a desired resolution level and compare 106−1012 large libraries of drug-like compounds on a global framework chemical space map, analyze DEL diversities in their design phase, perform searches, and analyze selection output data sets to mention just few applications.

methods include the inability or very restricted capacity to analyze large chemical data sets (>106 compounds and beyond) and a need for enumeration of compounds present in real or virtual combinatorial libraries prior to the analysis. The use of huge virtual combinatorial libraries has become a common routine in pharmaceutical industry.34−36 Examples include the BICLAIM collection reported by Boehringer Ingelheim,34 PGVL from Pfizer,35 and “Proximal Lilly Collection” from Eli Lilly.36 Although possible searching strategies within these very large virtual libraries are reported, the authors do not discuss the drug-likeness of theoretical molecules and do not attempt to map them. To address the massiveness of the drug-like chemical universe and access its new uncharted areas in practical terms, a novel DNA-encoded library (DEL) technology has emerged and matured in recent years.37−49 DELs are making it possible to access billions of compounds in a less than a 100 μL volume with only negligible protein consumption and a screening duration of 1−5 days. In DEL, each structure is tagged with a DNA identification barcode. In the Chemetics platform developed by Nuevolution, small molecules are preformulated in a combinatorial manner on DNAs, and the final mixed small-molecule−DNA conjugates serve as the libraries ready for affinity screening.46,47,49 In our approach, DNAs are tagged on small molecules and serve as barcodes to record both the structural information on the small molecules and the library information (Figure 1). Repeated cycles via a well-established procedure, known as “split-and-pool” synthesis strategy in combinatorial chemistry, ensure production of huge and diverse compound libraries. Simultaneous testing of thousands to millions of structurally related compounds (including stereoisomers and enantiomers), within each scaffold series, provides ‘“instant SAR databases”’ after each DEL selection campaign. An invaluable set of SAR information is then used for the design and improvement of hits by traditional medicinal chemistry optimization, design of focused DELs, and/or further rounds of DNA-encoded “affinition maturation”. The productivity of any DEL eventually depends on library design (diversity and drug-likeness), synthesis quality, and how robust the screening process and the sequencing power are. The size of DELs could easily reach trillions of compounds by assembling building blocks (BBs) of four or more diversity positions; however, such design would compromise the drug-



MAIN IDEAS OF RCMF APPROACH

(1) Representation of DEL molecules in an abstract, but chemically meaningful, form by their reduced complexity molecular frameworks including description of ring chemical types and sizes, sizes of the linkers connecting

683

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

Table 2. Multi-Trillion Virtual Combinatorial Dimer Libraries Constructed from Commercially Available ACD-Rich Reagent Groups Library

Pos1 reagent group

Pos1 reagents

Pos1 core structures

Lib1

Carboxylic acids

1,037,815

1,048,555

Lib2

Isocyanates

7513

7513

Lib3

Solfonyl Cl/F

33,909

33,909

Lib4

Alkylating Cl/Br

520,261

520,261

Lib5

Electrophiles (NAS)

134,847

142,022

Lib6

Heteroaromatic ring as nucleophile Phenols Thiophenols Aldehydes Boronic acids

293,359 286,425 34,141 169,160 103,835

Lib7 Lib8 Lib9 Lib10 Total

(2) (3) (4) (5)

(6) (7)

(8)



Pos2 reagent group

Pos2 reagents

Pos2 core structures

Theoretical library size

3,805,469

3,955,564

4,147,626,410,020

3,805,469

3,955,564

29,718,152,332

3,805,469

3,955,564

134,129,219,676

3,805,469

3,955,564

2,057,925,682,204

3,805,469

3,955,564

561,777,110,408

296,346

Primary and secondary amines Primary and secondary amines Primary and secondary amines Primary and secondary amines Primary and secondary amines Electrophiles (NAS)

134,847

142,024

42,088,244,304

303,320 34,208 169,160 103,835

Electrophiles (NAS) Electrophiles (NAS) Primary amines I/Br−Suzuki

134,847 134,847 1,888,353 961,584

142,024 142,024 1,905,088 990,479

43,078,719,680 4,858,356,992 322,264,686,080 102,846,386,965 7,446,312,968,661

avoid intellectual property issues by disclosing graphical depictions of RCMF types with their descriptors and not structures of individual compounds or their Murcko scaffolds. We hope that the application of the presented approach in various areas of cheminformatics will inspire to efficiently solve many current problems related to analysis and handling of huge databases of chemical data and searches in huge virtual chemical spaces in general.

the rings, and angle information on how rings are interconnected. Exhaustive exploration of RCMF types for drug-like compounds with up to six rings. Introduction of RCMF types and descriptors for monoand bi-functional reagents. Assessment of mono- and bi-functional reagent diversity on the RCMF level. Exhaustive exploration of all theoretically possible combinations of mono- and bi-functional reagents using their RCMF types for the construction of an efficient rule-based system which could determine RCMF type and RCMF descriptors for full-size compounds in DELs without a need for enumeration. Efficient pairwise comparison of RCMF descriptors for similarity assessment based on adjustable penalty score tables. Construction of invariant maps for drug-like chemical space using RCMFs and use of these maps for visualization and comparison of chemical space occupied by DELs and virtual combinatorial libraries. Efficient searching for query structure analogues in DNAencoded and virtual combinatorial libraries based on RCMF descriptor identity or similarity.



MATERIALS AND METHODS

ChEMBL Drug Set. The ChEMBL drug set (11,222 molecules) was downloaded from the ChEMBL database (version 16.10.15.51)53 and further filtered. Drugs with 1−6 rings and MW < 800 Da were only allowed (7519 remained), which were further examined if they contain only “drug-like” atoms such as C, O, N, S, H, F, Cl, or Br. Drugs with a 5- or 6fused ring system, with very long and flexible side chains, or if composed of only C and O atoms, were also removed. Furthermore, sugar derivatives, drugs containing charged nitrogen in heterocycles, very “ugly” or “weird” drug molecules, complex natural product-like drugs, drugs with peroxy bridges, disulfide bonds, or multiple SO3 and SO4 groups, steroid-like drugs, small single ring drugs with “weird” ring systems, drugs with alkylating reactive handles, and epoxide or aziridine ringcontaining drugs were also filtered off. In addition, vitamin D analogues, prostaglandins, peptides, and macrocycle-containing drugs were also removed. This resulted in the reduced prefiltered set of 5877 drugs with 1−6 rings as shown in Supporting File S1. PubChem Preprocessing. The PubChem database54 (ca. 89.1 million compounds) was downloaded and prefiltered as follows: compounds with MW < 1000 Da (for the largest component) and number of rings less than 11 were allowed only. In addition, only compounds with C, N, O, S, P, H, and halogen atoms remained. Macrocycles, compounds with individual ring sizes greater than eight bonds, very complex bridged systems, and five or more fused rings systems were further discarded, and 85.7 million PubChem compounds remained and were used in further analysis. ACD Database Preprocessing and Virtual Dimer Libraries. The ACD database (May 2016 release) from BIOVIA55 was used as a source of commercially available

CONCLUSIONS A novel approach termed “reduced complexity molecular frameworks” was presented in this study. The main distinctive features of the developed method, including easy comparison of large chemical spaces occupied by DELs with different setups without a need for library enumeration, are helpful in the library design phase to optimize chemical space coverage not yet explored by already synthesized DELs. In addition, RCMFs could be used for efficient searching of large combinatorial spaces and analysis of selection outputs. Finally, the introduced RCMF descriptors represent a novel class of topology descriptors both for library BBs and full-size drug-like compounds. Automatic generation of RCMF descriptors for full-size drug-like compounds, which are not members of DELs or combinatorial libraries, is currently not available but is under active development. Another attractive feature of RCMFs is its generalization capacity which may be useful when trying to 684

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

Figure 3 demonstrates RCMFs for two drugs Risperidone and Doxazocin. In this example, both drugs have different Murcko scaffolds and different “carbon-atom-only” frameworks but the same RCMFs. Information on individual ring sizes and types, linker lengths, and angles between the rings is not preserved when representing RCMFs graphically. In contrast to Oprea scaffold topologies,30,31 fused, spiro, and bridge-head ring systems are represented as two or more merged circles, and the order of rings is predefined by letter codes (from A to F) in each RCMF type. For example, rings A and B and rings D and E represent a pair of 2-fused ring systems in both drugs (Figure 3). A complete list of graphical depictions of RCMF types with 1−6 rings considered in this study is provided in Supporting Figure S1. RCMF Descriptors. In this study, we developed three main sets of RCMF descriptors applicable for each RCMF type, i.e., ring, linker, and angle descriptors, as well as few special additional descriptors. Ring Descriptors. All single ring types (ring size < 9) were grouped into 28 general ring groups and are shown in Supporting Table S1. In this classification, a two-letter ring coding scheme was adopted to describe each ring group. According to this classification, Risperidone ring descriptors would be “5_type21;6AMM6AXX6B” or “5_type21;6BXX6AMM6A” depending on which ring is considered to be the first A ring. In case the RCMF type is symmetrical, all possible symmetrical RCMF descriptor strings are generated. They are then sorted alphabetically, and the first on the list is used as a consensus RCMF descriptor string (all the other descriptor strings are kept as well). In case the Risperidone “5_type21;6AMM6AXX6B” string comes first in A−Z alphanumeric sorting, “6AMM” corresponds to the first 2fused ring system where the “6A” ring code stands for a “6member aliphatic ring with heteroatoms” and “MM” codes for a “6-member aliphatic ring with 1 exocyclic double bond”. The middle “6A” corresponds to a piperidine ring C, whereas “XX6B” denotes the second 1,2-benzoxazole fused ring system. In addition, all ring sizes are also added to the RCMF descriptor string: “5_type21;66656;6AMM6AXX6B” (underlined). Linker Descriptors. Each linker connecting two rings is shown as a single line in the RCMF graphical representation. Linkers are described by size counting the number of bonds in them. Linker size is determined as it would appear in a carbonatom-only framework ignoring all side bonds outgoing from the main linker connecting two rings. RCMF linkers are labeled by specifying the rings which they connect. For example, one of Risperidone linkers would be abbreviated as “linkBC” and is equal to three bonds further extending the RCMF descriptor string: “5_type21;66656;6AMM6AXX6B;linkBC;3;linkCD;1”. In case the three arbitrary rings X, Y, and Z are connected to the same branching point (Br), then linker sizes are determined to the branching point and abbreviated as “linkXtoBr”, “linkYtoBr”, and “linkZtoBr”, respectively. No linker descriptors are generated if four or more rings are connecting to the same linker. Angle Descriptors. RCMF angle descriptors in this study are extracted purely from the 2D structure representation of each compound and may not correspond directly to real angle values derived from the 3D compound model. The angle vectors are coded by specifying three rings which form an angle. In the Risperidone example, angle vectors “ABC”, “BCD”, and “CDE” would be considered, where “ABC” codes for the angle between

fragments for building virtual dimer libraries. The following properties were calculated for each ACD compound (for the largest component): MW, number of rings, and largest individual ring size, as well as a complete list of elements (for all components). In addition, a list of suppliers was extracted for each ACD database molecule. Reactive handles were examined in each fragment using an in-house developed “Tag generator” tool (see Supporting Methods). This tool checks each fragment for the presence of greater than 50 different reactive handles and if found assigns a corresponding “reactive group tag”. The ACD database was further filtered based on criteria as specified in the Supporting Methods. In this paper, we limited the number of reactive groups to the following 13 commercially available reagent rich groups: carboxylic acids (1,037,815), isocyanates (7513), sulfonyl chlorides and sulfonyl fluorides (33,909), alkylating chlorides and bromides (520,261), electrophiles (134,847), heteroaromatic rings as nucleophiles (293,359), phenols (286,425), thiophenols (34,141), aldehydes (169,160), boronic acids (103,835), I/ Br−Suzuki reagents (961,584), primary amines (1,888,353), and secondary amines (2,014,274). Finally, “core” structures were generated for all 13 ACD reagent groups using the inhouse tool “Core structure generator” (see Supporting Methods for details), which were further used to build 10 virtual multitrillion dimer libraries termed Lib1−Lib10 (Table 2). Nuevolution Chemetics DNA-Encoded Libraries. Four Chemetics DNA-encoded libraries with diverse property profiles (termed “Lia108”, “Lia122”, “Lia123”, and “Lia126”) have been used in this study to demonstrate the applicability of the RCMF approach in DELs field. These DELs have been extensively screened over time in-house on a variety of targets resulting in discovery of multiple series of structurally diverse drug-like hits for majority of the screened targets. Lia108, Lia123, and Lia126 DELs are dimer libraries, and Lia122 has a trimer library setup as shown in Figure 2.

Figure 2. Setup of Nuevolution Chemetics DNA-encoded libraries Lia108, Lia122, Lia123, and Lia126.

Reduced Complexity Molecular Frameworks. The Reduced Complexity Molecular Framework (RCMF) is defined as an abstract representation of a group of chemical structures sharing the same ring connectivity pattern, where different rings (type- and size-wise) and different linkers connecting the rings (also type- and size-wise) are abstracted with “circles” and single “lines” accordingly. In addition, ring connectivity angles are disregarded when visualizing RCMFs graphically and determining their types. The RCMF definition is somewhat similar to the Oprea scaffold topology30,31 definition; however, it is more general (see details and discussion further below). 685

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

Figure 3. Shown are chemical structures of Risperidone and Doxazocin drugs, their Murcko scaffolds, “carbon-atom-only”, and RCMFs.

1-ring compounds are only used for their classification based on ring type and size. The “7_rings” framework type includes all compounds with seven rings. These compounds can be compared and grouped together with 6-ring compounds if the “minus-one-ring” option is enabled (described in further sections). We do not consider compounds with eight or more rings to be “drug-like”, and thus, they are not assigned to any specific RCMF type and are not compared to 2−6 rings RCMF types. In addition, the “Adamantanes” RCMF type includes all compounds containing adamantane or “adamantane-like” motifs. This RCMF type is not compared to other RCMF types due to the very special 3D shape of their adamantane moiety. RCMF for Building Blocks. The concept of RCMFs, reflecting topology classes for full-size drug-like compounds, was further extended to cover library building blocks. In addition to the RCMF properties described above, RCMFs for BBs include extra information on positioning of their reactive handle(s) on the framework and thus are classified separately (Figure 5, Supporting Figures S3 and S4). Determination of RCMF type for BBs starts with analyzing their “core” structures (Figure 5). The term “core” structure is used throughout this paper to refer to fragments used in any DEL or combinatorial library, where each reacting handle is replaced by a “dummy” atom (see the Supporting Methods and Figure S5 for more details). In short, each “dummy” atom is converted to a 4-member ring with “dummy” atoms (Stage I). Next, all heavy atoms (except dummy atoms) are replaced with carbons and all bonds with single bonds followed by Murcko scaffold generation procedure (Stage II and III). “Dummy” rings are converted back to original “dummy” atoms (Stage IV). In case a “dummy” atom is a part of a ring, it gets “pushed” outside the ring by one bond. Finally, all 2+ bond linkers are reduced to single bond linkers, and all −CH2− groups in all rings are removed to keep minimal possible ring size (Stage V). At last, “dummy” atoms are

the plane of the fused ring system of A and B rings and an individual ring C. “CDE” encodes the angle between the plane of the fused ring system of D and E rings and an individual ring C, whereas “BCD” encodes the angle between individual rings B, C, and D. Rings forming the angles are always mentioned in alphabetical order. For the Risperidone angle, “ABC” will be set to 30°, angle “BCD” to 180° (para), and angle “CDE” to 72°. Thus, the Risperidone RCMF descriptor string becomes “5_type21;66656;6AMM6AXX6B;linkBC;3;linkCD;1;angleABC;30;angleBCD;180;angleCDE;72”. Some other examples demonstrating how angle descriptors are derived for individual rings and 2- and 3-fused ring systems are shown in Figure 4. No angle descriptors are generated for 4−6 fused ring systems. Angle values for bridged-head systems are assigned based on their presumed “3D” shape of the system. A special flag is added at the end of the RCMF descriptor string to indicate the presence of a spiro or bridge-head system. As Risperidone does not contain any of these systems, the “NONE” flag is added: “5_type21;66656;6AMM6AXX6B;linkBC;3;linkCD;1;angleABC;30;angleBCD;180; angleCDE;72;NONE”. All Chemetics compounds present in Nuevolution DELs are attached to a DNA tag via an oligo attachment linker. Information on an oligo-linker attachment point on a molecule could be very important to know when analyzing and comparing compounds appearing in selection outputs (especially for small 2−4 ring ligands). This is addressed by attaching an “Lr” atom to each graphical representation of RCMF type (Supporting Figure S2). However, this increases the number of possible “Lr”-handles containing RCMF types substantially, and thus, this option is currently enabled for 2−4 ring compounds only. Linker-free RCMF types only were further considered throughout this study. Special RCMF Types. Acyclic and 1-ring molecules are assigned to “Acyclic” and “1-ring” RCMF types, respectively. As they carry very little information on the RCMF level, they are not compared to any other RCMF types. RCMF descriptors for 686

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

Figure 4. Few examples demonstrating how angle descriptors are derived for individual rings and 2−3 ring fused systems.

KNIME workflow to determine linker sizes between the rings and reactive handle(s). Automatic fragment RCMF descriptor calculation protocols have been developed for all 0−3 ring capping and internal BB RCMF types as shown in Table 3 and Supporting Figures S3 and S4. For larger 4+ ring BBs, RCMF descriptors are extracted manually as the number of such fragments used in Nuevolution DELs is usually very small. In case the BB RCMF type is symmetrical, RCMF descriptor strings are generated for all symmetrical variants, and the top string on the sorted list is used as a consensus RCMF descriptor string for BB. In addition, “minus-one-ring” RCMF descriptors for fragments are generated for all BBs. This is done by virtually “opening” or “removing” one ring at a time from a BB followed by modification of its RCMF descriptor string accordingly. If the same ring could be “opened” in several ways, the largest possible part of the ring gets removed ensuring solidity of the remaining structure (Supporting Figure S7) unless equal

replaced with halogens (Stage VI), and InChIKeys are generated for this reduced structure (Stage VII). InChIKeys are then compared to the reference set of InChIKeys known for each specific fragment RCMF type (Stage VIII) as provided in Supporting File S2. The RCMF-type determination procedure for BBs is implemented in KNIME workflows.56−59 Next, RCMF descriptors for BBs are calculated for each determined BB RCMF type individually by “split-and-analyse” principle, where each fragment is split into subfragments according to a specific set of rules. These rules include extraction of linkers and splitting fused ring systems into individual rings (preserving their aromaticity) to determine their type, etc. An example for determination of fragment RCMF descriptors is shown in Supporting Figure S6. In total, 12 KNIME workflows are developed to determine angles in various types of fused ring systems and individual rings, 15 KNIME workflows to split fragments into individual rings, one KNIME workflow to determine individual ring types, and one 687

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

Figure 5. Workflow for determination of RCMF types for building block “core” structures. See description of the workflow in the Materials and Methods section.

of possible BB RCMF combinations would reach ca. 19,500 in a linear trimer DEL setup for compounds with up to six rings. We took a challenge to examine all these BB RCMF combinations and implemented them in an efficient rulebased algorithm written in Perl. The Perl code has a tree-like architecture to ensure fast access to any combination of BB RCMFs (see a flow diagram in Figure 6 and Supporting Figure S8). Therefore, determination of RCMF descriptors for full-size molecules in DELs and virtual combinatorial libraries is done based on known RCMF descriptors of library BBs. It takes less than 5 min to generate RCMF descriptors for a random one million set of DEL fragment combinations on an Intel Core i7 quad-core based desktop machine, and calculations could be easily parallelized and distributed over multiple processors for further performance increase. In addition, the algorithm generates “minus-one-ring” RCMF descriptor strings for each library compound using “minus-one-ring” RCMF descriptors for corresponding fragments. “Minus-one-ring” RCMF descriptor strings ensure that full-size compounds with various number of rings and RCMF types could be intercompared (Supporting Figure S6). RCMF Analysis of Virtual Combinatorial Libraries. Core structures extracted from 13 ACD reagent groups were submitted for the RCMF descriptor generation procedure. Unique RCMF descriptor strings for ACD reagents were identified within each reagent group, and one representative reagent for each unique RCMF descriptor string was further used (Table 4). All combinations of representative reagents were generated and submitted for the RCMF descriptor generation procedure for full-size compounds from each virtual library Lib1−Lib10. For example, 188 thiophenols were selected for pos1 and 3511 NAS reagents for pos2 in Lib8 to form 660,068 dimers, which were subsequently analyzed to derive 646,666 unique RCMF descriptor strings (Table 4).

Table 3. Number of RCMF Types in Capping and Internal BBsa No. of rings

No. of fragment RCMF types for mono-functional (capping) BBs

No. of fragment RCMF types for bi-functional (internal) BBs

0 1 2 3 4

1 1 4 19 53*

1 2 14 52* 94*

a

All theoretical framework types are considered for 0−3 rings capping and 0−2 ring internal BBs. “∗” indicates RCMF types for fragments where additional fragment RCMF types are possible. ACD database analysis suggests that the number of reagents in these “additional” fragment RCMF types will be very small.

alternatives for the ring “opening” exists. In this scenario, both “opening” alternatives are employed. “Minus-one-ring” RCMF descriptor strings for fragments contain extra information regarding the removed ring, i.e., whether it was “fused” or “notfused” and whether a ring was “opened” or “deleted”. Removal of any terminal ring is considered a “deletion” unless it has a reactive handle attached to it (therefore, Ring A is labeled as “opened” and not “deleted” as shown in Supporting Figure S7). RCMF Descriptors for Full-Size Compounds. The number of theoretically possible RCMF types for capping and internal BBs with 0−3 rings is rather small (Table 3). Thus, all possible combinations of BB RCMFs could be considered for any arbitrary combinatorial library setup. For example, 3190 combinations of internal and capping BBs RCMFs are possible for a dimer DEL setup if library compounds with 0−6 rings are considered (Figure 2 and Supporting Table S2). This number is further reduced to 649 BB RCMF combinations in a virtual “cap−cap” dimer library (due to setup symmetry). The number 688

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

Figure 6. RCMF approach flow diagram demonstrating key steps of the method.

These 646,666 unique RCMFs represent framework chemical space occupied by ca. 4.8 billion Lib8 dimers. RCMF Descriptors of ChEMBL Drugs and PubChem Compounds. RCMF types for ChEMBL drugs and PubChem compounds were calculated using a protocol outlined in Supporting Figure S9. In brief, Murcko scaffolds were generated for all compounds first. Next, all bonds in Murcko scaffolds were replaced with single bonds, and all heavy atoms were replaced with carbons. Murcko scaffolds were extracted again from “carbon-atom-only” compounds. All linkers connecting the rings were reduced to a single bond linker, and all −CH2−CH2− groups in the rings were reduced to a single −CH2− group following the InChIKey generation procedure. Finally, generated InChIKeys were compared to a predefined set of InChIKeys which corresponds to a specific

RCMF type (Supporting File S3). The protocol was implemented in the KNIME workflow.56−59 As RCMF descriptor generation procedure for full-size compounds, which are not members of combinatorial libraries, is currently under development, RCMF descriptors were determined for ChEMBL drugs only (and only in a semiautomatic way). First, Murcko scaffolds were extracted from ChEMBL drugs, and 3432 unique Murcko scaffolds were identified. Scaffolds with 1−3 rings were converted to “core” structures by modifying a random scaffold hydrogen to a dummy “U” atom and there further treated as “reagents”. Their RCMF descriptors were generated using the same protocol for reagents as described above and converted to RCMF descriptors corresponding to full-size structures. Murcko scaffold InChIKeys for ChEMBL drugs with 4−6 rings were 689

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling Table 4. Estimation of Murcko Scaffolds and RCMFs in Virtual Dimer Libraries

Library

No. of unique pos1 InChIKeys for their Murcko scaffolds with a “dummy” ring

No. of unique pos2 InChIKeys for their Murcko scaffolds with a “dummy” ring

Theoretical number of Murcko scaffolds in a library

No. of unique RCMF descriptor strings for pos1 fragments

No. of unique RCMF descriptor strings for pos2 fragments

Theoretical number of RCMF descriptor strings in a library

99,960 1106 3098 34,940 12,480 29,915 18,091 385 14,176 5427

244,904 244,904 244,904 244,904 244,904 12,480 12,480 12,480 145,193 58,666

24,480,603,840 270,863,824 758,712,592 8,556,945,760 3,056,401,920 373,339,200 225,775,680 4,804,800 2,058,255,968 318,380,382 40,104,083,966

24,161 505 1256 10,937 3509 5192 5391 188 5407 1571

68,361 68,361 68,361 68,361 68,361 3511 3511 3511 41,520 14,430

1,651,670,121 34,522,305 85,861,416 747,664,257 239,878,749 18,229,112 18,927,801 660,068 224,498,640 22,669,530 3,044,581,999

Lib1 Lib2 Lib3 Lib4 Lib5 Lib6 Lib7 Lib8 Lib9 Lib10 Total

Real number of RCMF descriptor strings in a library 1,350,614,414 30,823,018 75,505,247 610,675,684 233,295,552 17,958,020 18,227,950 646,666 199,026,312 21,555,378

Table 5. Diversity and Library Setup for Nuevolution DELs Described in This Study Nuevolution DEL

Pos1 BBs

Pos2 BBs

Pos3 BBs

DEL size

Lia108 Lia122 Lia123 Lia126

11,279 384 8807 5814

13,440 329 14,496 18,513

− 960 − −

151,589,760 121,282,560 127,666,272 107,634,582

Estimated number of Murcko Scaffolds in DEL 36,746,400 8,986,978 13,888,200 12,377,588

Actual number of unique Murcko scaffolds in DEL 35,439,655 8,360,409 12,956,245 11,470,337

Unique RCMF descriptor strings in DEL 12,091,354 1,964,124 3,798,315 3,181,007

DEL compounds per Murcko scaffold 4.28 14.51 9.85 9.38

DEL compounds per RCMF descriptor string 12.54 61.75 33.61 33.84

positions. For example, if there are 100 carboxylic acids in pos1 and 100 amines in pos2 in some library X, a theoretical library X size will be 104 dimers. If there are 30 unique InChIKeys for a “dummy” ring containing scaffolds in pos1 and 40 unique InChIKeys for “dummy” ring containing scaffolds in pos2, a maximum theoretical number of Murcko scaffolds in library X will be 1200. However, identical scaffolds could often be formed due to library setup symmetry and other factors. In our experience the real number of Murcko scaffolds is usually in range of 90−96% from the theoretical estimate in multi-million DELs (Table 5). The described scaffold estimation procedure is fast, does not require structure enumeration, provides an accurate estimate of Murcko scaffolds in a library, and is used in-house in DEL design phase. Murcko scaffold estimation protocol is implemented as a KNIME workflow.56−59 Comparison of RCMF Descriptor Strings. Comparison of compound calculated descriptors is a routine operation in cheminformatics, allowing one to assess similarity, diversity, drug-likeness, apply filtering criteria, etc. To extend the applicability of an RCMF approach, RCMF descriptor string comparison methods have been implemented in this study as well. In brief, a pair of RCMF descriptor strings sharing the same RCMF type is aligned, and RCMF descriptors are compared pairwise. In the case of a descriptor pair mismatch, a certain penalty score is assigned. To compare different RCMF types, the “minus-one-ring” option is used. All original and “minus-one-ring” RCMF descriptor strings are compared by RCMF type first. If types match, RCMF descriptors strings are compared pairwise and an additional penalty score is assigned for ring “removal”. Further details on the comparison protocol can be found in the Supporting Methods. RCMF Descriptor-Based Clustering. Clustering of compounds based on the developed RCMF approach could be done in multiple ways using an overall weighted penalty score (described in Supporting Methods) as a distance or

compared to all Murcko scaffold InChIKeys extracted from four Nuevolution DELs for identity match, and RCMF descriptors were assigned accordingly. The remaining nonmatching drug scaffolds were inspected manually to derive their RCMF descriptors. Drug-likeness of DELs and Virtual Dimer Libraries. Drug-likeness of Nuevolution DELs and nondesigned virtual dimer libraries was estimated based on rule-of-five (Ro5)60 compliance of library compounds and histograms of compound physicochemical properties such as MW, AlogP, HBA, HBD, RotB, and PSA. These properties were calculated based on a random subset of one million enumerated compounds from each library. Although the theoretical size of some virtual libraries exceeds trillions of compounds, a one million large random compound subset is sufficient to estimate library druglikeness with 99% confidence level and 0.2% confidence interval.37,61 Oligo-linker attachment motif was stripped off for all molecules in DELs before calculation of physicochemical properties. All properties were calculated in the Canvas program version 2.8 (Schrodinger).62 In addition, ring profiles for all libraries were estimated based on the number of rings in corresponding fragments in each library position. Murcko Scaffold Analysis of DELs and Virtual Combinatorial Libraries. To estimate the number of Murcko scaffolds in each virtual library, Murcko scaffolds were extracted from the library fragments in a special way. First, reactive handles were converted to 4-member rings with “dummy” atoms, and Murcko scaffolds were generated. Next, InChIKeys were generated for all Murcko scaffolds with the “dummy” rings, and a subset of unique InChIKeys was identified. These unique InChIKeys correspond to all different substructures which reagents may “donate” to form different Murcko scaffolds in enumerated library molecules. Thus, a maximum theoretical number of Murcko scaffolds could be estimated by multiplying the number of unique InChIKeys in all library 690

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

or one Ro5 violation (mainly due to MW) as shown in Table 6 and Supporting Figures S10 and S11. Ro5 compliance is lower (ca. 63%) in nondesigned virtual dimer libraries as only MW and ring cutoffs were applied for ACD fragments (Table 6). The majority of dimers in virtual libraries have 1−5 rings as fragments with a maximum three rings were allowed in each position in virtual libraries setups. Similarly, greater than 98% of Lia123 and Lia126 compounds have 1−5 rings, whereas greater than 97% of Lia108 are composed of 1−6 rings dimers. Compounds with 2−6 rings make up greater than 94% of a trimer Lia122. Universe of RCMFs. Analysis of RCMF types in DELs, virtual combinatorial libraries, and PubChem led to a surprising finding that there is a relatively small number of RCMF types for “drug-like” compounds with up to six rings. We found that greater than 99% of all small “drug-like” molecules analyzed could be roughly grouped in only 452 RCMF types as graphically depicted in Supporting Figure S1. PubChem analysis revealed that ca. 80% of PubChem compounds belong to eight main RCMF classes, i.e., “2_type2” (27.1%), “1_type1” (15.5%), “3_type3” (12.5%), “3_type1” (10.7%), “2_type1” (4.7%), “4_type6” (3.8%), “4_type1” (3.5%), and acyclic (3.2%). Here, 95% of PubChem compounds can be represented by 27 RCMF types, whereas 99% of PubChem compounds could be assigned to just 80 RCMF types. This is consistent with the earlier PubChem analysis results using scaffold topologies as described by Oprea and co-workers.30,31 Further analysis of ChEMBL drugs (5069 drugs with 2−6 rings) demonstrated that 3391 unique Murcko scaffolds and 2559 unique RCMF descriptor strings could be extracted and grouped into 120 general RCMF types. We also found that there is a rather limited number of RCMF types for fragments, i.e., 25 RCMF types for capping BBs with up to three rings and 69 RCMF types for internal reagents with up to three rings (Supporting Figures S3 and S4). RCMF Chemical Space Maps. Diversity of DELs and combinatorial libraries could be assessed in many ways.37,50,52,65,66 Typically, a small fraction of the library is enumerated and used for assessment of overall library diversity. More seldom, an exhaustive enumeration of all library compounds is undertaken. In this work, we assessed diversity of multi-million DELs and multi-trillion virtual dimer libraries avoiding exhaustive library enumeration both on Murcko scaffold and RCMFs level. The more unique Murcko scaffolds are present in the library the higher is its structural diversity. Similarly, libraries with a high number of unique RCMF descriptor strings are more structurally diverse than libraries of the same size with lower number of unique RCMF descriptor strings. Table 5 demonstrates the estimated and real number of Murcko scaffolds in each Nuevolution DEL. A small ratio of DEL molecules vs the number of Murcko scaffolds in a DEL clearly demonstrates high structural diversity of Nuevolution libraries with Lia108 being the most structurally diverse. This ratio is higher in PubChem compounds with 1−6 rings (ca. 12.7). In contrast, the ratio of dimers vs their Murcko scaffolds in nondesigned virtual Lib1−Lib10 is 100−1000 suggesting high similarity of compounds and very dense coverage of the chemical space (Tables 2 and 4). RCMF diversity was assessed in DELs and virtual combinatorial libraries Lib1−Lib10. Compared to the number of Murcko scaffolds in each library, the number of unique RCMF descriptor strings was much lower demonstrating that RCMFs represent a higher level in structural classification

similarity measure between RCMF descriptor strings. In the current study, we implemented a “similarity network”-based clustering approach where all unique RCMF descriptor strings (and thus all compounds belonging to them) are considered as nodes. An edge is created between any two RCMF nodes if the overall weighted penalty score between them is less than a certain threshold. Once all edges are assigned to the pool of nodes, the resulting network is evaluated, and all individual subnetworks are found. Subnetwork connectivity patterns, size, and other parameters are evaluated according to preset rules (not disclosed). All qualifying subnetworks are ranked and assigned a cluster ID. Next, the overall weighted penalty score threshold is slightly increased, and the procedure is repeated. This gradual increase in threshold ensures a hierarchical treelike clustering of RCMF clusters (and compounds belonging to them). This method groups together RCMF types which might not be comparable otherwise. For example, a node for a 4-ring RCMF may be connected to the nodes representing 3-ring and 5-ring RCMF types allowing them to belong to the same network at a certain threshold level. The RCMF clustering approach is routinely used at Nuevolution for the analysis of selection outputs resulting from DELs screens in the case of “flat” ranking (examples are not shown). Mapping Chemical Space with RCMFs. Although the number of all possible theoretical RCMF descriptor strings might be huge, it is much lower than the estimated number 1060 of drug-like molecules.63 For example, RCMF type “3_type1” corresponds to a 3-ring framework, where ring A and B are fused and connected to an individual ring C by a linker. The number of theoretical RCMF descriptor strings for this framework type (if we consider only 20 different ring types and only 1−10 bonds linkers) would be 20 × 20 × 20 × 10 = 8 × 104. This number could easily reach trillions for 6-ring RCMF types due to combinatorics and increasing topological complexity of 6-ring frameworks. Nevertheless, these estimates are much lower than 1060, and the RCMF descriptor abstraction level could be adjusted. For example, decreasing the RCMF descriptor specification level might aid in keeping the number of possible RCMF clusters within a manageable range for human interpretation and visualization. This opens an attractive opportunity to arrange abstracted RCMF clusters on a predefined RCMF chemical space map. Libraries of compounds could be then projected on this map to assess their diversity, similarity, and space coverage. To achieve this, we adopted an RCMF descriptor complexity reduction scheme which is described in detail in the Supporting Methods. The constructed 2D RCMF map (306 × 100 cells) represents 30,585 abstracted RCMF clusters, with 27; 4158; 8398; 10,515; 7144; and 337 clusters for 1−6 ring compounds, respectively, plus six additional clusters (Supporting File S4). The developed 2D RCMF chemical space heatmaps were built using ClustVis tool.64



RESULTS A novel RCMF approach for the analysis of small molecule “drug-like” chemical space is presented in this study (Figure 6). The developed RCMF approach was validated on four in-house DELs (>5 × 108 million compounds), 10 virtual combinatorial libraries (>7.4 × 1012 virtual dimers), PubChem database (>8.9 × 106 compounds), and ChEMBL drug set (ca. 6000 drugs). Drug-likeness assessment results demonstrated that greater than 83% of compounds in all three Chemetics dimer libraries are Ro5 compliant, whereas 86% of trimers in Lia122 have no 691

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

85.6% 11.9% 2.6%

83.9% 13.6% 2.5%

66.8% 26.1% 7.1%

63.3% 31.2% 5.5%

61.7% 29.8% 8.5%

58.8% 31.9% 9.3%

57.6% 32.1% 10.3%

69.3% 24.6% 6.2%

61.1% 29.5% 9.4%

47.9% 37.6% 14.6%

52.3% 37.3% 10.4%

66.3% 29.0% 4.7%

63.1% 28.8% 8.1%

hierarchy at their current most detailed description level. There are on average 2.9−4.3 Murcko scaffolds per one RCMF descriptor string in DELs (Table 5) and 7−20 scaffolds per RCMF descriptor string in virtual Lib1−Lib10 (Tables 2 and 4). Nevertheless, the number of unique RCMF descriptor strings is still in the millions to allow quick and interactive visualization of the libraries. Further, an RCMF descriptor generalization scheme was applied to build a uniform 2D RCMF chemical space map with 30,585 clusters. All libraries were further projected on this map, and each cluster was colored according to its occupancy by library compounds or Murcko scaffolds (Figure 7, Supporting Figures S12 and S13). RCMF maps clearly demonstrate the high structural and topological diversity of DEL compounds covering various areas of this global map (Figure 7 and Supporting Figure S12). In addition, created maps clearly show that different DELs occupy different clusters on the map highlighting high structural and topological diversity of compounds also across different DELs. The analysis also showed that trimer Lia122 occupies much smaller areas of RCMF space than dimer DELs (Supporting Figure S12, see panel D vs panels A, B, and C). Finally, we have combined all four DELs and mapped them all together. Figure 7 demonstrates that 18,661 clusters (>61%) are occupied by at least one compound from DELs used in this study. Absence of DEL compounds in ca. 2/3 of unoccupied clusters could be explained by the presence of 4−6 fused ring systems in their RCMF cluster definition, rare RCMF types where a linker is attached to a sp3 carbon connecting both rings in a fused ring system, as well as 4- and 5-ring framework types where all or all but one rings are nonaromatic. In addition, the majority of uncovered 2-ring framework clusters contain either 8-member or larger individual rings or rings with two exocyclic double bonds in ortho or or para positions. Compounds belonging to these uncovered clusters are not of great interest from the drug discovery perspective. RCMF maps could also be used for quantitative diversity comparison across different libraries. This could be achieved by computing a ratio of between library compounds occupying each cluster. In this study, we compared the structural diversity of Lia108 to Lia126 (Figure 8) and trimer Lia122 to dimer Lia123 (Supporting Figure S14). The results indicate that Lia108 and Lia126 do occupy substantially different areas of RCMF maps, and their overlap is only partial. In contrast, Lia123 dominates RCMF map coverage, and Lia122 occupies only few areas not addressed by Lia123. Similarly, the developed RCMF maps could be used to map Murcko scaffolds. To this end, we repeated the cartography process by projecting all extracted Murcko scaffolds for each DEL (Supporting Figure S15). Distribution of DEL Murcko scaffolds on the maps is very broad, and the most populated clusters on the maps are very different across libraries. Again, trimer Lia122 occupied a much narrower area compared to dimer DELs (5060 Lia122 clusters vs 14,000−14,500 dimer DEL clusters). This primarily is attributed to a smaller number of BBs used in the trimer library setup. Eventually, we also mapped virtual combinatorial Lib1− Lib10 libraries on a global RCMF map (Supporting Figure S13). Libraries Lib1 and Lib4 appear to cover almost the whole map (78.7% and 83.0% coverage, respectively), followed by Lib2, Lib3, Lib5, and Lib9. Virtual libraries Lib6, Lib7, Lib8, and Lib10 occupy only few cluster areas on the RCMF map. Interestingly, there are few clusters “lines” not occupied by any virtual library compounds. These clusters represent RCMF

83.9% 13.6% 2.5% 0 1 2

36.2% 50.5% 13.3%

0.39% 5.33% 24.02% 39.23% 25.27% 5.41% 0.35% 0.00% 0.07% 11.98% 42.11% 40.06% 5.66% 0.12% 0.09% 2.66% 19.69% 41.42% 29.72% 6.06% 0.36% 0.00% 0.00% 10.08% 41.34% 40.90% 7.67% 0.02% 0.00% 0.00% 15.75% 41.07% 33.58% 8.89% 0.71% 0.00% 0.00% 5.87% 30.48% 41.97% 19.11% 2.57% 0.00% 2.24% 18.78% 39.18% 30.47% 8.55% 0.78% 0.18% 0.31% 5.42% 5.01% 32.12% 24.76% 42.52% 40.05% 17.37% 24.63% 2.33% 4.96% 0.06% 0.28% “Rule-of-five” violations 0.38% 6.70% 32.74% 41.31% 16.57% 2.24% 0.06% 0.53% 6.35% 24.97% 38.53% 24.21% 5.09% 0.32% 0.22% 4.26% 22.39% 38.02% 26.38% 7.72% 0.95% 0.20% 3.21% 17.20% 35.90% 31.25% 10.59% 1.53% 0.07% 1.23% 7.27% 20.69% 31.29% 25.60% 10.95% 0 1 2 3 4 5 6

0.00% 0.08% 2.03% 11.99% 29.09% 32.89% 18.06%

Lib7 Lib6 Lib5 Lib4 Lib3 Lib2 Lib1 Lia126 Lia123 Lia122 Lia108 Rings

Table 6. Distribution of Rings and Ro5 Compliance of Compounds in Nuevolution DELs and 10 Virtual Dimer Libraries

Lib8

Lib9

Lib10

Lib1-Lib10

Journal of Chemical Information and Modeling

692

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

Figure 7. Projection of four Nuevolution DELs on 2D RCMF map. Cluster color indicates the number of individual DEL compounds according to spectrum. Red color indicates that there are more than 10,000 DEL compounds per cluster. Colorless clusters have no DEL representative compounds.

Figure 8. Comparing diversity of Nuevolution Lia108 and Lia126 DELs. RCMF clusters occupied mostly by Lia108 compounds are shown in red, whereas framework clusters occupied mainly by Lia126 compounds are shown in blue. Yellow cluster color indicates framework clusters occupied by both libraries in ca. equal amounts. Uncolored framework clusters indicate unoccupied clusters by both DELs.

found after PCR amplification and decoding. The higher is the “count” the higher could be FC affinity for the target, although there is no direct linear relationship. FCs seen in both selections were considered further (807 FCs), where the highest cumulative count was 576 and the lowest was 7. The total sum of the counts for all 807 FCs was 23,403. RCMF descriptor strings were derived for all 807 FCs and further mapped on the RCMF map (Supporting Figure S17). Only the top 20 statistically significant clusters are shown on a 2D map (covering greater than 80% of the total count for all 807 FCs). Cluster significance was calculated based on the expected cluster count if 23,403 random Lia123 FCs with count 1 would be mapped on an RCMF map. The cluster was considered significant if the total count for all FCs belonging to the cluster was larger than the expected cluster count estimate by at least 2 standard deviations. Twenty FCs from eight different clusters were resynthesized in a free form tested in LSD1 inhibition assays (Supporting Methods). Low nM inhibition activity toward LSD1 was observed for the hits in Cluster 1 and 12 (chemical structures are not disclosed, but their corresponding RCMF descriptor strings are shown in Supporting Figure S17, panel B).

types which cannot be accessed by dimer libraries by their design. For example, ring connectivity patterns for 5-ring compounds with RCMF types 2, 17−20, 33, 40, 49, 52, 83 and 6-ring compounds with RCMF types 4, 5, 18, 28, 36, 40, 41, 49, 51, 62, 63, 70, 92, 159, 160, 163, 189, 293, and 328 cannot be synthesized by combining BBs with maximum three rings in a dimer setup (Supporting Figure S1). Either 4-ring BBs should be allowed in the design or a trimer library setup should be used to address these RCMF types. This highlights a need to produce trimer DELs which offer access to new topological space which might not be fully accessed by dimer DELs. Finally, we mapped the set of ChEMBL drugs on an RCMF map as demonstrated in Supporting Figure S16. Here, 1709 clusters were occupied by at least one ChEMBL drug. Analysis of DEL Selection Outputs Using RCMF Maps. The developed RCMFs approach could also be used for the analysis of DEL selection outputs. To demonstrate this, we have used a joint data set of two Lia123 selection outputs on the Lysine Specific Demethylase-1 (LSD1)67 target. Fragment combinations (FCs) are usually ranked according to the socalled “count” (number of observations), which is an indicator for level of enrichment in each selection data set. This “count” parameter indicates the number of corresponding DNA tags 693

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Figure 9. Example demonstrating searching techniques in large virtual combinatorial libraries of Lib1−Lib10 using RCMF descriptor strings. At the first stage, an RCMF descriptor string is generated for a query structure (anticancer drug Olaparib) as shown. At the second stage, an exact match of an Olaparib RCMF descriptor string is searched within Lib1−Lib10 RCMF descriptor strings. Only a small portion of the Lib1−Lib10 RCMF descriptor space needs to be processed as the exact string match is looked up only within “4_type6” RCMF-type descriptor strings. As the relationship between fragment combinations and their corresponding RCMF descriptor string is preserved for each virtual library Lib1−Lib10, ca. 263 K pos1 and pos2 building block (BB) combinations were retrieved and enumerated. Here, 415 Murcko scaffolds were extracted from ca. 263 K dimers, and 36 of them are shown in this figure demonstrating higher or lower similarity degree to the Olaparib scaffold.

Journal of Chemical Information and Modeling Article

694

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling

Figure 10. 2D RCMF heatmap based on Nuevolution Lia123 fragment RCMFs. The y-axis demonstrates clustering of 8803 Lia123 pos1 bifunctional fragments grouped according to their RCMF type. Pos1 fragments with one or two rings are further split into smaller RCMF classes based on aromaticity of their rings, which are colored according to abbreviations. Blue star indicates location of the reactive handle of pos1 BBs which further reacts with Lia123 pos2 fragments (capping BBs). Oligo-linker attachment point is shown only for 1-ring pos1 BBs. The x-axis demonstrates 14,496 Lia123 pos2 mono-functional BBs clustered according to fragment RCMF type. Pos2 1-ring BBs are further split into smaller groups based on their ring aromaticity and distance of the reactive handle to the ring. Pos2 2-ring BBs are split into subgroups according to their ring aromaticity type. Red star on the graphical representation of pos2 RCMFs corresponds to pos2 BB reactive handles which react with pos1 BBs (unspecified for 2fused ring BBs). Graphical depictions of RCMF types formed for enumerated library compounds are shown in gray boxes.

Searching Chemical Space with RCMFs. Similarly, with InChIKeys, RCMF descriptors strings could be used for chemical searches in databases, DELs, and virtual libraries. To exemplify this, RCMF descriptor strings of 5069 drugs from a ChEMBL drug set (drugs with 2−6 rings) were searched within multi-trillion virtual libraries Lib1−Lib10. RCMF descriptor string exact matches were found for 4393 drugs (ca. 87%). Here, 249 other drugs either contained a 4-fused ring system or were not covered by RCMF types present in Lib1−Lib10 and thus could not have been found in Lib1−Lib10. No RCMF descriptor string matches were returned for 427 ChEMBL drugs (ca. 8%). This indicates that Lib1−Lib10 cover the chemical space of drugs reasonably well. Next, Lib1−Lib10 compounds sharing identical RCMF descriptor strings with any found drug could be further enumerated and compared using traditional similarity assessment techniques, such as fingerprints. To illustrate this, we selected an anticancer drug Olaparib and searched for its analogues in Lib1−Lib10 (Figure 9). In total, 262,988 dimers sharing the same RCMF descriptor string were found (209,765 from Lib1, 36,654 from Lib5, 16,467 from Lib7, 98 from Lib8, and 4 from Lib10). Dimers were subsequently enumerated and compared to Olaparib using Morgan FPs (ECFP4-like FPs as implemented in RDKit in KNIME)56−59,68,69 and Tanimoto coefficient for similarity

assessment. Two dimers were found to be identical to Olaparib. The other 29 dimers were extremely similar to the original drug sharing greater than 0.8 Tanimoto similarity. For another 1914 dimers, Tanimoto was in a range of 0.7−0.8. In total, 54,234 dimers had Tanimoto similarity greater than or equal to 0.5 to Olaparib. Next, 415 unique Murcko scaffolds were extracted from 262,988 enumerated dimers and compared to Olaparib scaffold using the same Morgan FPs. Here, 33 scaffolds shared Tanimoto similarity greater than or equal to 0.5 toward the Olaparib Murcko scaffold. Figure 9 demonstrates a random pick of 36 out of 415 Murcko scaffolds, which share a larger or smaller degree of similarity to the Olaparib scaffold. This example demonstrates how one can easily access a specific portion of chemical space covered by multi-trillion compound libraries using RCMFs. The search space could be further expanded by including RCMFs descriptor strings with high similarity to the query RCMF descriptor string. Mapping DELs Using RCMFs of Fragments. Generation of RCMF descriptors for fragments is a primary step for the developed RCMF approach when analyzing DELs and combinatorial libraries (Figure 6). However, RCMF descriptors for fragments could be used alone to assess library diversity both on the individual BB and library level. The earlier could be achieved by calculating the distribution of reagents within each 695

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling RCMF fragment type (Supporting Figures S3 and S4). “Library level” could be then addressed by placing RCMF types for fragments for each library position on its own coordinate. One such example of DEL visualization based on RCMF of fragments is shown in Figure 10. In this figure, Lia123 pos1 and pos2 BBs are grouped according to their RCMF types of fragments with few general ring and linker descriptors on x-axis and y-axis, respectively, and a map is further color-coded according to the number of library dimers in a corresponding 2D cluster cell (see abbreviations in Figure 10). RCMF types for full-size library compounds are shown graphically for few areas on this map. It should be noted that the same RCMF types are formed in different regions of this heatmap, and therefore, it will be significantly different from 2D RCMF maps with 30,585 clusters presented earlier. Mapping based on RCMF types of BBs is especially suitable for visualization of dimer libraries; however, it becomes a challenge for trimer libraries as it would require visualization in 3D. Mapping of dimer DELs using RCMF types of fragments provides a convenient way to visualize and assess DEL and BB structural diversity at the same time, especially in the library design phase.

RCMFs are also very different from Feature Trees (FTs).28,29 FTs describe molecules by their major building blocks in a nonlinear fashion and were developed for searching in large combinatorial spaces. Although they are efficient in finding query compound analogues within a certain similarity threshold, FTs do not describe molecular topology compared to RCMFs. In contrast, each RCMF descriptor string may represent either an individual full-size compound or a whole class of similar compounds sharing similar Murcko scaffolds, and thus, similarity assessment across different RCMF clusters is also possible. In addition, the RCMF approach could be applied for comparison and visualization of chemical spaces occupied by DELs and other large combinatorial libraries, whereas FTs cannot. In this study, we presented a global RCMF map which may be considered as a “macro” level map of the chemical universe for small drug-like compounds. The resolution scale of this map was chosen to allow easy framework chemical space visualization on a PC screen-wide image and can be considered as the “World map” talking in geographical terms. Further increase in map resolution and “zooming” could be performed to explore the distribution and “landscape” of each individual RCMF cluster either on a more detailed RCMF level or on a Murcko scaffold or individual compound level. To our knowledge, the number of “pre-fixed” chemical space maps published to date is rather limited as the majority of chemical space visualization techniques are compound driven. The use of scaffold topologies as described in earlier studies30−32 indeed provides a high level of abstraction in a universe of small organic molecules. Nevertheless, application of these techniques requires availability of chemical structures (enumeration), does not allow direct comparison of different topological groups, and has no visualization capabilities for billions of compounds and beyond. The presented RCMF approach provides a new alternative methodology, which is capable of mapping chemical space occupied by trillions of compounds very efficiently. In addition, individual RCMF descriptor strings provide an intuitive and logical way for grouping similar scaffolds in chemically meaningful clusters which is easy to perceive by medicinal chemists. The structural hierarchy analysis presented in this paper could be roughly compared to geographic hierarchy of the world, where the number of people living in the world today (7.5 × 109) is roughly comparable to the number of individual compounds in a DEL (108−1010), the total number of world cities, towns, and villages (ca. 2.5 × 106 million) to the number of Murcko scaffolds in a DEL (10−30 million), and the number of countries (ca. 189−196) to the number of general RCMFs types (452). The number of continents (6) could be then attributed to the number of rings in drug-like compounds (0− 6). As explained earlier, the addition of RCMF descriptors increases the “resolution” level of each RCMF type greatly but never makes them more detailed as individual Murcko scaffolds. Therefore, the RCMF approach “operates” on the next structure-topology level in the classification hierarchy than occupied by Murcko scaffolds and allows intuitive clustering of them. As distances between RCMF clusters could be calculated as described in the Materials and Methods section, a networklike representation of framework chemical space is also possible, where nodes would stand for framework clusters and edges will be formed if similarities between framework clusters are higher than a defined threshold.



DISCUSSION A novel multi-purpose RCMF approach was presented in this study. The developed method was used to chart chemical space on a “global framework level” for ChEMBL drugs, PubChem, multi-million DELs, and virtual combinatorial libraries with trillions of library members. In addition, we demonstrated the utility of the approach to search efficiently desired areas of chemical space, map selection outputs resulting from DEL screening experiments, and efficiently group of millions of Murcko scaffolds into larger RCMF clusters in a robust and logical way mimicking the way medicinal chemists intuitively perceive molecules. Finally, we show that nearly all small druglike molecules with up to six rings considered in this study could be roughly grouped into just 452 RCMF types. We also found that the number of possible RCMF types for BBs is very limited allowing us to construct an efficient rule-based system to address all combinations and accurately assess framework diversity of DELs and combinatorial libraries of virtually any size avoiding enumeration of library compounds. RCMFs are different compared to Oprea’s scaffold topologies published earlier.30,31 First, RCMFs are more general. For example, RCMF type “3_type5” shown in Supporting Figure S1 covers all 3-ring scaffold topologies (type 6, 10, 11, and 16) shown in Figure 2 of ref 31. Furthermore, there are significant differences in the number of scaffold topologies listed in Table 1 of ref 30 and RCMF types shown in this study (especially for 4−6 ring compounds), i.e., 23 RCMF types vs 73 scaffolds topologies for 4-ring compounds, 83 RCMF types vs 590 scaffold topologies for 5-ring compounds, and 337 RCMF types vs 6454 scaffold topologies for 6-ring compounds. Supporting Figure S1 illustrates a graphical depiction for all RCMFs with up to six rings, whereas graphical depictions of scaffold topologies with a maximum of four rings are shown in ref 31. In addition, we introduced “minus-one-ring” option for RCMFs. This allows comparison of RCMFs across different types and finding relationships between them if the number of rings differs by one. Furthermore, all rings in each RCMF type have a predefined alphabetically coded order letting one to refer and describe any part of any RCMF type. Finally, introduction of RCMF types for mono- and bi-functional reagents is entirely unique to our study. 696

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling Notes

Current implementation of the method does not consider compounds with seven or more rings, macrocycles, and compounds with very complex bridged-head systems usually found in natural products. Although these classes of compounds could be integrated in the developed workflows in the future, it is appropriate beforehand to ask whether this is needed as 7-ring compounds could be grouped with 6-ring compounds in RCMF clustering process if “minus-one-ring” option is enabled and eight or more ring compounds are usually heavy, not drug-like, and are of little interest for medicinal chemistry optimization projects. Moreover, seven or more ring RCMF types introduce an even higher level of complexity on the topology level as demonstrated previously.30 It is worth mentioning that only a few approved drugs have seven or more rings, and their MW is usually well above 500 Da. In addition, less than 1% of PubChem (MW < 1000 Da) have seven or more rings. Few additional theoretically possible RCMF types might exist (not covered by depictions in Supporting Figures S1, S3, and S4), and it was not the purpose of this work to find and exhaustively cover all of them but rather concentrate on available RCMF types covering drug-like molecules in public databases, DEL, and combinatorial libraries. In addition, we introduced RCMF types for BBs further extending the applicability of the approach. It should be further stressed that the presented RCMF types generally cover greater than 99% of all in-house reagents, DELs, ACD, and PubChem compounds. As implementation of the algorithm is rather flexible, addition of new RCMF types into the system is possible. Similarly, a new set of rules could be added into the system which would consider ring-forming reactions since a rule-based engine behind generation of RCMFs for full-size compounds from RCMFs of BBs is currently capable to “link” individual fragments based on “side-chain” reactions only. Calculation of RCMF descriptors for full-size DEL compounds is done based on known RCMF descriptors of BBs. The rule-based algorithm is very fast and generates RCMF descriptors with the desired resolution details for several millions of fragment combinations in a few minutes. An automatic RCMF descriptor generation algorithm for full-size compounds (which are not members of any combinatorial library) is under active development. In addition, we are considering extending RCMF descriptor sets by introducing chemical types of linkers.



The author declares no competing financial interest.



ACKNOWLEDGMENTS This work was fully funded by Nuevolution A/S. The author wants to thank his colleagues from Nuevolution, Dr. Alex Haahr Gouliaev, Dr. Thomas Franch, and Dr. Mads Nørregaard-Madsen, for critically reading the manuscript and for valuable comments, as well as Dr. Jan Legaard Andersson for providing the description of the experimental assay for testing LSD1 hits.



ASSOCIATED CONTENT

S Supporting Information *



The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.7b00006. File S1. (XLSX) File S2. (XLSX) File S3. (XLSX) File S4. (XLSX) Methods, Figures S1−S20, Tables S1−S3. (PDF)



DEFINITONS AND ABBREVIATIONS DEL = DNA-encoded library BBs = building blocks “Reagents”, “fragments”, and “building blocks” = Terms are used interchangeably throughout the manuscript and denote mono- or bi-functional reagents, which are included in each DEL position to cause a chemical reaction with another position reagent RCMF = Reduced Complexity Molecular Framework RCMF type = RCMF with a specific ring connectivity pattern RCMF type for a full-size compound = RCMF type for DEL or virtual combinatorial library full-size compounds Fragment RCMF type (RCMF type for a reagent) = RCMF type for BBs included in DEL by design, where fragment RCMF type is defined not only by BB ring connectivity pattern but also by positioning of a reactive handle on the framework RCMF descriptors or RCMF descriptor strings = Indicate each ring, each linker, and each angle descriptors derived for a specific RCMF type. RCMF descriptors have a predefined order in a line notation string. These line notations are unique for each specific RCMF type. RCMF descriptors could be generated for full-size compounds and for reagents. RCMF descriptors for reagents also include information on linker size between each reactive handle and the closest ring, as well as angle descriptors indicating how reactive handle is attached to a system of rings RCMF chemical space map = 2D grid of individual RCMF clusters for full-size compounds. RCMF chemical space maps shown in this study include RCMF descriptors, which are generalized to a level where a theoretically possible number of descriptor combinations becomes manageable for visualization purposes REFERENCES

(1) Fey, N. Lost In Chemical Space? Maps to Support Organometallic Catalysis. Chem. Cent. J. 2015, 9, 38. (2) Osolodkin, D. I.; Radchenko, E. V.; Orlov, A. A.; Voronkov, A. E.; Palyulin, V. A.; Zefirov, N. S. Progress in Visual Representation of Chemical Space. Expert Opin. Drug Discovery 2015, 10, 959−973. (3) Todeschini, R.; Consonni, V. Handbook of Molecular Descriptors, Vol. 11; Wiley-VCH: Weinheim, Germany; 2002. (4) Shanmugasundaram, V.; Maggiora, G. M.; Lajiness, M. S. Hitdirected Nearest-neighbor Searching. J. Med. Chem. 2005, 48, 240− 248. (5) Sheridan, R. P.; Kearsley, S. K. Why Do We Need so Many Chemical Similarity Search Methods? Drug Discovery Today 2002, 7, 903−911. (6) Willett, P. Similarity-based Virtual Screening Using 2D Fingerprints. Drug Discovery Today 2006, 11, 1046−1053.

AUTHOR INFORMATION

Corresponding Author

*Phone: +4539130952. E-mails: [email protected], [email protected]. ORCID

Aleksejs Kontijevskis: 0000-0001-9600-0491 697

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling (7) Hoksza, D.; Škoda, P.; Voršilák, M.; Svozil, D. Molpher: A Software Framework for Systematic Chemical Space Exploration. J. Cheminf. 2014, 6, 7. (8) Kohonen, T. The Self-organizing Map. Proc. IEEE 1990, 78, 1464−1480. (9) Gaspar, H. A.; Baskin, I. I.; Varnek, A. Visualization of a Multidimensional Descriptor Space. In Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: Jürgen Bajorath; Bienstock, R. J., Shanmugasundaram, V., Bajorath, J., Eds.; 1222 of ACS Symposium Series 1222; American Chemical Society: Washington, DC, 2016; pp 243−267. (10) Rosén, J.; Lövgren, A.; Kogej, T.; Muresan, S.; Gottfries, J.; Backlund, A. ChemGPS-NP(Web): Chemical Space Navigation Online. J. Comput.-Aided Mol. Des. 2009, 23, 253−259. (11) Larsson, J.; Gottfries, J.; Muresan, S.; Backlund, A. ChemGPSNP: Tuned for Navigation in Biologically Relevant Chemical Space. J. Nat. Prod. 2007, 70, 789−794. (12) Larsson, J.; Gottfries, J.; Bohlin, L.; Backlund, A. Expanding the ChemGPS Chemical Space with Natural Products. J. Nat. Prod. 2005, 68, 985−991. (13) Gaspar, H. A.; Baskin, I. I.; Marcou, G.; Horvath, D.; Varnek, A. Chemical Data Visualization and Analysis with Incremental Generative Topographic Mapping: Big Data Challenge. J. Chem. Inf. Model. 2015, 55, 84−94. (14) Kireeva, N.; Baskin, I. I.; Gaspar, H. A.; Horvath, D.; Marcou, G.; Varnek, A. Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison. Mol. Inf. 2012, 31, 301−312. (15) Reymond, J. L. The Chemical Space Project. Acc. Chem. Res. 2015, 48, 722−730. (16) Ruddigkeit, L.; Blum, L. C.; Reymond, J. L. Visualization and Virtual Screening of the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2013, 53, 56−65. (17) Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J. L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864−2875. (18) Ruddigkeit, L.; Awale, M.; Reymond, J. L. Expanding the Fragrance Chemical Space for Virtual Screening. J. Cheminf. 2014, 6, 27. (19) Reymond, J. L.; Ruddigkeit, L.; Blum, L.; van Deursen, R. The Enumeration of Chemical Space. WIREs Comput. Mol. Sci. 2012, 2, 717−733. (20) Reymond, J. L.; van Deursen, R.; Blum, L. C.; Ruddigkeit, L. Chemical Space as a Source for New Drugs. MedChemComm 2010, 1, 30−38. (21) Reymond, J. L.; Awale, M. Exploring Chemical Space for Drug Discovery Using the Chemical Universe Database. ACS Chem. Neurosci. 2012, 3, 649−657. (22) Virshup, A. M.; Contreras-García, J.; Wipf, P.; Yang, W.; Beratan, D. N. Stochastic Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-like Compounds. J. Am. Chem. Soc. 2013, 135, 7296−7303. (23) Schuffenhauer, A.; Ertl, P.; Roggo, S.; Wetzel, S.; Koch, M. A.; Waldmann, H. The Scaffold Tree-Visualization of the Scaffold Universe by Hierarchical Scaffold Classification. J. Chem. Inf. Model. 2007, 47, 47−58. (24) Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887−2893. (25) Wetzel, S.; Klein, K.; Renner, S.; Rauh, D.; Oprea, T. I.; Mutzel, P.; Waldmann, H. Interactive Exploration of Chemical Space with Scaffold Hunter. Nat. Chem. Biol. 2009, 5, 581−583. (26) de la Vega de León, A.; Bajorath, J. Chemical Space Visualization: Transforming Multidimensional Chemical Spaces into Similarity-based Molecular Networks. Future Med. Chem. 2016, 8, 1769−1778. (27) Ertl, P. Intuitive Ordering of Scaffolds and Scaffold Similarity Searching Using Scaffold Keys. J. Chem. Inf. Model. 2014, 54, 1617− 1622.

(28) Rarey, M.; Stahl, M. Similarity Searching in Large Combinatorial Chemistry Spaces. J. Comput.-Aided Mol. Des. 2001, 15, 497−520. (29) Rarey, M.; Dixon, J. S. Feature Trees: A New Molecular Similarity Measure Based on Tree Matching. J. Comput.-Aided Mol. Des. 1998, 12, 471−490. (30) Pollock, S. N.; Coutsias, E. A.; Wester, M. J.; Oprea, T. I. Scaffold Topologies. 1. Exhaustive Enumeration up to Eight Rings. J. Chem. Inf. Model. 2008, 48, 1304−1310. (31) Wester, M. J.; Pollock, S. N.; Coutsias, E. A.; Allu, T. K.; Muresan, S.; Oprea, T. I. Scaffold Topologies. 2. Analysis of Chemical Databases. J. Chem. Inf. Model. 2008, 48, 1311−1324. (32) Velkoborsky, J.; Hoksza, D. Scaffold Analysis of PubChem Database as Background for Hierarchical Scaffold-based Visualization. J. Cheminf. 2016, 8, 74. (33) Jensen, A.; Seidler, S. Method for Generating a Hierarchical Topologican Tree of 2D or 3D-structural Formulas of Chemical Compounds for Property Optimisation of Chemical Compounds. U.S. Patent US20040088118, 2004. (34) Muegge, I.; Zhang, Q. 3D Virtual Screening of Large Combinatorial Space. Methods 2015, 71, 14−20. (35) Peng, Z. Very Large Virtual Compound Spaces: Construction, Storage, and Utility in Drug Discovery. Drug Discovery Today: Technol. 2013, 10, e387−e394. (36) Nicolaou, C. A.; Watson, I. A.; Hu, H.; Wang, J. The Proximal Lilly Collection: Mapping, Exploring, and Exploiting Feasible Chemical Space. J. Chem. Inf. Model. 2016, 56, 1253−1266. (37) So, S.-S. Enumeration and Visualization of Large Combinatorial Chemical Libraries. In A Handbook for DNA-Encoded Chemistry: Theory and Applications for Exploring Chemical Space and Drug Discovery; Goodnow, R. A., Ed.; Wiley: Hoboken, NJ, 2014; Chapter 12, pp 247−279. ISBN: 978-1-118-48768-6). (38) Eidam, O.; Satz, A. L. Analysis of the Productivity of DNAencoded Libraries. MedChemComm 2016, 7, 1323. (39) Zimmermann, G.; Neri, D. DNA-encoded Chemical Libraries: Foundations and Applications in Lead Discovery. Drug Discovery Today 2016, 21, 1828−1834. (40) Melkko, S.; Dumelin, Ch.E.; Scheuermann, J.; Neri, D. Lead Discovery by DNA-encoded Chemical Libraries. Drug Discovery Today 2007, 12, 465−471. (41) Clark, M. A.; Acharya, R. A.; Arico-Muendel, C. C.; Belyanskaya, S. L.; Benjamin, D. R.; Carlson, N. R.; Centrella, P. A.; Chiu, C. H.; Creaser, S. P.; Cuozzo, J. W.; et al. Design, Synthesis and Selection of DNA-encoded Small-molecule Libraries. Nat. Chem. Biol. 2009, 5, 647−654. (42) Litovchick, A.; Dumelin, Ch.E.; Habeshian, S.; Gikunju, D.; Guié, M.-A.; Centrella, P.; Zhang, Y.; Sigel, E. A.; Cuozzo, J. W.; Keefe, A. D.; Clark, M. A. Encoded Library Synthesis Using Chemical Ligation and the Discovery of sEH Inhibitors from a 334-million Member Library. Sci. Rep. 2015, 5, 10916. (43) Mullard, A. DNA-encoded Drug Libraries Come of Age. Nat. Biotechnol. 2016, 34, 450−451. (44) Mullard, A. DNA Tags Help the Hunt for Drugs. Nature 2016, 530, 367−369. (45) Wan, J.; Dou, D.; Song, H.; Wu, X.-H.; Cheng, X.; Li, J. Lead Generation for Challenging Targets in Lead Generation: Methods and Strategies; Holenz, J., Ed.; 2016; Vol. 67, pp 259−260 (ISBN: 978-3527-33329-5). (46) Nuevolution A/S. Method for the Synthesis of a Bifunctional Complex. World Patent WO 2004/039825, 2004. (47) Nuevolution A/S. Enzymatic Encoding Methods for Efficient Synthesis of Large Libraries. World Patent WO 2007/062664, 2007. (48) Goodnow, R. A., Jr.; Dumelin, C. E.; Keefe, A. D. DNA-encoded Chemistry: Enabling the Deeper Sampling of Chemical Space. Nat. Rev. Drug Discovery 2016, 16, 131−147. (49) Ahn, S.; Kahsai, A. W.; Pani, B.; Wang, Q.-T.; Zhao, S.; Wall, A. L.; Strachan, R. T.; Staus, D. P.; Wingler, L. M.; Sun, L. D.; Sinnaeve, J.; et al. Allosteric “Beta-blocker” Isolated from a DNA-encoded Small Molecule Library. Proc. Natl. Acad. Sci. U. S. A. 2017, 114, 1708−1713. 698

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699

Article

Journal of Chemical Information and Modeling (50) Franzini, R. M.; Randolph, C. Chemical Space of DNA-encoded Libraries. J. Med. Chem. 2016, 59, 6629−6644. (51) Mannocci, L. DNA-Encoded Libraries. In Diversity-Oriented Synthesis: Basics and Applications in Organic Synthesis, Drug Discovery, and Chemical Biology; Trabocchi, A., Ed.; Wiley, 2013; Chapter 11. (ISBN: 978-1-118-14565-4). (52) Satz, A. L. Foundations of a DNA-Encoded Library (DEL). In A Handbook for DNA-Encoded Chemistry: Theory and Applications for Exploring Chemical Space and Drug Discovery; Goodnow, R. A., Ed.; Wiley: Hoboken, NJ, 2014; Chapter 5, pp 99−122. (ISBN: 978-1-11848768-6). (53) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100−D1107. (54) PubChem Database. https://pubchem.ncbi.nlm.nih.gov/ (accessed May 11, 2016). (55) BIOVIA Available Chemicals Directory (ACD). http://accelrys. com/products/collaborative-science/databases/sourcing-databases/ biovia-available-chemicals-directory.html (accessed May 31, 2016). (56) Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. KNIME: The Konstanz Information Miner. In Data Analysis, Machine Learning and Applications; Preisach, C., Burkhardt, P. D. H., Schmidt-Thieme, P. D.L., Decker, P. D. R., Eds.; Springer: Berlin, Heidelberg, 2008; pp 319−326. (57) All workflows were implemented in KNIME, version 3.2.1. http://knime.org (accessed March 2017). (58) RDKit nodes 3.0.0. distributed as part of “Community Contributions”. http://tech.knime.org/community/ (accessed September 30, 2016). (59) Indigo nodes 2.0.0 from Epam distributed as part of “Community contributions”. http://tech.knime.org/community/ (accessed September 30, 2016). (60) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Delivery Rev. 2001, 46, 3−26. (61) Creative Research Systems. http://www.surveysystem.com/ sscalc.htm. (62) Canvas v2.8; Schrodinger, Inc.: Portland, OR, 2016. (63) Kirkpatrick, P.; Ellis, C. Chemical Space. Nature 2004, 432, 823. (64) Metsalu, T.; Vilo, J. ClustVis: A Web Tool for Visualizing Clustering of Multivariate Data Using Principal Component Analysis and Heatmap. Nucleic Acids Res. 2015, 43, W566−W570. (65) Ivanenkov, Y. A.; Savchuk, N. P.; Ekins, S.; Balakin, K. V. Computational Mapping Tools for Drug Discovery. Drug Discovery Today 2009, 14, 767−775. (66) Paolini, G. V.; Shapland, R. H.; van Hoorn, W. P.; Mason, J. S.; Hopkins, A. L. Global Mapping of Pharmacological Space. Nat. Biotechnol. 2006, 24, 805−815. (67) Arrowsmith, C. H.; Bountra, C.; Fish, P. V.; Lee, K.; Schapira, M. Epigenetic Protein Families: A New Frontier for Drug Discovery. Nat. Rev. Drug Discovery 2012, 11, 384−400. (68) Landrum, G. RDKit: Open-Source Cheminformatics. http:// www.rdkit.org (accessed September 30, 2016). (69) Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754.

699

DOI: 10.1021/acs.jcim.7b00006 J. Chem. Inf. Model. 2017, 57, 680−699