Mapping of drug-like chemical universe with ... - ACS Publications

Mar 28, 2017 - To address this big chemical data challenge, we developed Reduced Complexity Molecular (RCM) frameworks methodology as an abstract and ...
0 downloads 8 Views 3MB Size
Subscriber access provided by UB + Fachbibliothek Chemie | (FU-Bibliothekssystem)

Article

Mapping of drug-like chemical universe with reduced complexity molecular frameworks Aleksejs Kontijevskis J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00006 • Publication Date (Web): 28 Mar 2017 Downloaded from http://pubs.acs.org on March 30, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Mapping of Drug-like Chemical Universe with Reduced Complexity Molecular Frameworks

Aleksejs Kontijevskis*

Nuevolution A/S, Rønnegade 8, DK-2100 Copenhagen, Denmark

Corresponding Author*: Phone: +4539130952 E-mail: [email protected], [email protected]

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 63

ABSTRACT The emergence of DNA-encoded chemical libraries (DEL) field in past decade has attracted attention of pharmaceutical industry as a powerful mechanism for the discovery of novel drug-like hits for various biological targets. Nuevolution Chemetics technology enables DNA encoded synthesis of billions of chemically diverse drug-like small molecule compounds, and the efficient screening and optimization of these, facilitating effective identification of drug candidates at an unprecedented speed and scale. Although many approaches have been developed by the cheminformatics community for the analysis and visualization of drug-like chemical space, most of them are restricted to the analysis of maximum few millions of compounds and cannot handle collections of 108-1012 compounds typical for DELs. To address this big chemical data challenge, we developed Reduced Complexity Molecular Frameworks (RCMF) methodology as an abstract and very general way of representing chemical structures. By further introducing RCMF descriptors we constructed a global framework map of druglike chemical space and demonstrate how chemical space occupied by multi-million-member drug-like Chemetics DNA-encoded libraries and virtual combinatorial libraries with >1012 members could be analysed and mapped without a need for library enumeration. We further validate the approach by performing RCMF-based searches in drug-like chemical universe and mapping Chemetics library selection outputs for LSD1 target on a global framework chemical space map.

2 ACS Paragon Plus Environment

Page 3 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

INTRODUCTION Drug-like chemical space is virtual space occupied by all chemically meaningful small druglike molecules. Intelligent exploration, mapping, visualization, and navigation in this chemical universe is a key step for the discovery of novel drugs and tools for chemical biology. Drug-like chemical space is so vast that its complete coverage and analysis at the current state is beyond our current technological capabilities both in practical terms (of what could actually be synthesized) and computationally accessed. Thus, the focus in industry has shifted from the chemical library size race to library quality. Using various cheminformatics methods, the industry is moving towards identifying areas where chemical space is not covered by library collections or is under-represented. Nowadays, there are millions of drug-like molecules recorded in public and corporate databases and this number increases exponentially due to introduction of parallel and combinatorial synthesis approaches and emergence of DNA-encoded libraries technology. To comprehend this huge arrays of chemical data it needs to be represented in a human-understandable, yet information rich format. Growing amount of accumulated data in many other areas of science gets mapped in different ways. Universe and planet maps in astronomy, genome and protein maps of living organisms in biology, as well as GPS maps on smartphones are just few examples of daily big data visualization systems. However, graphical depictions of the whole universe of drug-like chemically-accessible small molecules are rare and largely incomplete.1,2 This is because chemical space is often defined by various sets of descriptors3 leading to a major problem; the lack of space invariance.4,5 Diverse descriptor sets and different distance measures6 result in chemical spaces showing variable molecule distributions.7 In the descriptor-based chemical space, where each compound is represented by a N-dimensional descriptor set, the two most popular used approaches for visualization are dimensionality reduction and

3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 63

similarity network graphs. Dimensionality reduction techniques allow one to reduce N-dimensional chemical space into a “compressed” latent space of 2-3 dimensions, e.g. by principal component analysis (PCA) and self-organizing Kohonen maps (SOM).8,9 In a graph-based chemical space, individual molecules, their scaffolds or subscaffolds are shown as nodes. The nodes are then either linked to each other by edges based on structure decomposition rules or on an arbitrary chosen similarity metrics. This lets researchers construct various molecular or scaffold trees and networks in hierarchical manner. An overview on published methods discussing these strategies is summarized in Table 1. The main disadvantages of most of the listed methods include inability or very restricted capacity to analyse large chemical data sets (>106 compounds and beyond) and a need for enumeration of compounds present in real or virtual combinatorial libraries prior to the analysis. The use of huge virtual combinatorial libraries has become a common routine in pharmaceutical industry34-36. Examples include BICLAIM collection reported by Boehringer Ingelheim34, PGVL from Pfizer35, and “Proximal Lilly Collection” from Eli Lilly36. Although possible searching strategies within these very large virtual libraries are reported, the authors do not discuss drug-likeness of theoretical molecules and do not attempt to map them. To address massiveness of drug-like chemical universe and access its new uncharted areas in practical terms a novel DNA-encoded library (DEL) technology has emerged and matured in recent years.37-49 DELs are making it possible to access billions of compounds in a less than a 100 µl volume with only negligible protein consumption and a screening duration of 1–5 days. In DEL, each structure is tagged with a DNA identification barcode. In the Chemetics platform developed by Nuevolution, small molecules are pre-formulated in a combinatorial manner on DNAs and the final mixed smallmolecule–DNA conjugates serve as the libraries ready for affinity screening.46,47,49 In our approach, 4 ACS Paragon Plus Environment

Page 5 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

DNAs are tagged on small molecules and serve as barcodes to record both the structural information of the small

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Page 6 of 63

Table 1. Summary of reports describing visualization and analysis techniques of drug-like chemical space. Nr. 1

Technique Dimensionality reduction

Method ChemGPS, ChemGPS-NP (PCA-based methods) Generative topographic mapping (GTM), extension of SOMs PCA

2

Dimensionality reduction

3

Dimensionality reduction

4

Graph-based

Scaffold trees, Scaffold Hunter

5

Graph-based

6

Scaffold maps

Chemical space networks Scaffold keys

7

Feature Trees

Feature Trees

8

Scaffold Topologies

9

Tree map of scaffolds

10

Hierarchical topologican trees

11

Reduced Complexity Molecular Frameworks

Short description Chemical space maps built on a reference set of “satellite” compounds

Cons Requires enumeration (for combinatorial libraries) Not practical to use for very large libraries (>106 compounds).

References 10-12

Requires enumeration of compounds. Performs slow for large compounds libraries (2.2 million in 19h).

8,9,13,14

Huge chemical database of 166 billion small drug-like compounds. High chemical space coverage. Highly suitable for the analysis of small datasets with known biological data

Does not cover drug-like chemical space for compounds with > 17 heavy atoms and MW range ca. 300-600 Da. In addition, Virshup et al. opposed a need for exhaustive enumeration of all possible chemicals.22 Cannot be used with large compound datasets (millons and beyond). Ring removal prioritization rules are hard-coded.

15-21

The method works well on small data set with (109 compounds and beyond). Covers entire drug-like and fragment chemical space. Allows fast comparison analysis and mapping of drug-like space occupied by DELs and combinatorial libraries varying in their setups. Enables efficient search in drug-like chemical universe covered by DELs. Offers intuitive clustering of scaffolds from medicinal chemists’ point of view. Cons: RCMF method might be too general when applied on a small set of very similar compounds

6 ACS Paragon Plus Environment

23-25

30,31

32

This study

Page 7 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

molecules and the library information (Figure 1). Repeated cycles via a well-established procedure, known as “split-and-pool” synthesis strategy in combinatorial chemistry, ensure production of huge and diverse compound libraries. Simultaneous testing of thousands to millions of structurally related compounds (including stereoisomers and enantiomers), within each scaffold series, provides ‘‘instant SAR databases’’ after each DEL selection campaign. An invaluable set of SAR information is then used for the design and improvement of hits by traditional medicinal chemistry optimization, design of focussed DELs and/or further rounds of DNA-encoded “affinition maturation”.

Figure 1. Nuevolution Chemetics platform. During the synthesis phase, a collection of drug-like molecules is synthesized as a mixture (hundreds of millions to billions and even up to trillions of diverse molecules). Each library molecule consists of a DNA-sequence (code) and a linker, which allows the DNA-code, and the small molecule to be physically attached to each other. The DNA-code serves as a “barcode” holding the information for the structure of the small molecule. During Nuevolution´s screening of a biological disease target inactive compounds are eliminated and active compounds are isolated. The structures of the active compounds are then determined by sequencing of the DNA-codes.

7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 63

The productivity of any DEL eventually depends on library design (diversity and druglikeness), synthesis quality, and how robust the screening process and the sequencing power are. The size of DELs could easily reach trillions of compounds by assembling building blocks (BBs) of 4 or more diversity positions, however such design would compromise drug-likeness of library compounds (as each diversity position inevitably increases overall average MW of the full-size molecules) accompanied by unfavourable decrease in the library quality due to incomplete reactivity of BBs. Therefore, we put much emphasis in ensuring drug-likeness, chemical space diversity and density covered by DELs as well as synthetic quality of DELs produced by Chemetics technology. The analyses of chemical space coverage by various DELs have been reported in scientific literature.50-52 Nevertheless, in majority of the studies the authors used rather small random subsets of DELs for enumeration and further diversity assessment, visualization and mapping. Thus, an approach which would be able to analyze, compare and map chemical space of billions of drug-like compounds and beyond (avoiding enumeration step) in a time-efficient manner is clearly missing. Here we present a novel approach for the analysis and mapping of huge drug-like chemical space by introducing Reduced Complexity Molecular Frameworks (RCMF) as an abstract and very general way of representation of chemical structures. Compared to previously published approaches (Table 1), RCMFs offer an alternative topological way of representing chemical structures with new advanced features. The development of the method was inspired by an unmet need to handle and analyse multi-million Chemetics DELs as well as selection outputs (typically 105-106 unique compounds) resulting from DEL screening experiments. We aimed to develop a method that would mimic chemist’s way of thinking in terms of “common motifs” but would not be limited to scaffold recognition and would not loose chemist’s understanding of large structural datasets in their manual analysis process. As we further demonstrate the developed RCMF approach allows without 8 ACS Paragon Plus Environment

Page 9 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

enumeration to map quickly at a desired resolution level and compare 106-1012 large libraries of druglike compounds on a global framework chemical space map, analyse DEL diversities in their design phase, perform searches and analyse selection output data sets to mention just few applications.

THE MAIN IDEAS OF RCM FRAMEWORKS APPROACH 1. Representation of DEL molecules in an abstract, but chemically meaningful form by their reduced complexity molecular frameworks including description of ring chemical types and sizes; sizes of the linkers connecting the rings, and angle information on how ring are interconnected. 2. Exhaustive exploration of RCMF types for drug-like compounds with up to 6 rings. 3. Introduction of RCMF types and descriptors for mono- and bi-functional reagents. 4. Assessment of mono- and bi-functional reagent diversity on RCMF level. 5. Exhaustive exploration of all theoretically possible combinations of mono- and bi-functional reagents using their RCMF types for the construction of an efficient rule-based system which could determine RCMF type and RCMF descriptors for full-size compounds in DELs without a need for enumeration. 6. Efficient pairwise comparison of RCMF descriptors for similarity assessment based on adjustable penalty score tables. 7. Construction of invariant maps for drug-like chemical space using RCMFs and use of these maps for visualization and comparison of chemical space occupied by DELs and virtual combinatorial libraries.

9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 63

8. Efficient searching for query structure analogues in DNA-encoded and virtual combinatorial libraries based on RCMF descriptor identity or similarity.

CONCLUSIONS A novel approach termed “reduced complexity molecular frameworks” was presented in this study. The main distinctive features of the developed method include easy comparison of large chemical spaces occupied by DELs with different setups without a need for library enumeration, are helpful in the library design phase to optimize chemical space coverage not yet explored by already synthesized DELs. In addition, RCMFs could be used for efficient searching of large combinatorial spaces and analysis of selection outputs. Finally, the introduced RCMF descriptors represent a novel class of topology descriptors both for library BBs and full-size drug-like compounds. Automatic generation of RCMF descriptors for full-size drug-like compounds, which are not members of DELs or combinatorial libraries, is currently not available but is under active development. Another attractive feature of RCMFs is its generalization capacity which may be useful when trying to avoid intellectual property issues by disclosing graphical depictions of RCMF types with their descriptors and not structures of individual compounds or their Murcko scaffolds. We hope that the application of the presented approach in various areas of cheminformatics will inspire to efficiently solve many current problems related to analysis and handling of huge databases of chemical data, and searches in huge virtual chemical spaces in general.

MATERIALS AND METHODS

ChEMBL drug set

10 ACS Paragon Plus Environment

Page 11 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ChEMBL drug set (11222 molecules) was downloaded from ChEMBL database (version 16.10.15.51)53 and further filtered. Drugs with 1-6 rings and MW < 800 Da were only allowed (7519 remained), which were further examined if they contain only “drug-like” atoms such as C, O, N, S, H, F, Cl, or Br. Drugs with 5- or 6-fused ring system, with very long and flexible side chains, or if composed of only C and O atoms were also removed. Furthermore, sugar derivatives, drugs containing charged nitrogen in heterocycles, very “ugly” or “weird” drug molecules, complex natural product-like drugs, drugs with peroxy bridges, disulphide bonds or multiple SO3 and SO4 groups, steroid-like drugs, small single ring drugs with “weird” ring systems, drugs with alkylating reactive handles, epoxide or aziridine ring containing drugs were also filtered off. In addition, vitamin D analogues, prostaglandins, peptides, macrocycle-containing drugs were also removed. This resulted in the reduced pre-filtered set of 5877 drugs with 1-6 rings shown in Supporting File S1.

PubChem pre-processing PubChem database54 (ca. 89.1 million compounds) was downloaded and pre-filtered as follows: compounds with MW < 1000 Da (for the largest component) and number of rings < 11 were allowed only. In addition, only compounds with C, N, O, S, P, H, and halogen atoms remained. Macrocycles, compounds with individual ring sizes > 8 bonds, very complex bridged systems, 5 or more fused-rings systems were further discarded and 85.7 million PubChem compounds remained and were used in further analysis.

ACD database pre-processing and virtual dimer libraries ACD database (May 2016 release) from BIOVIA55 was used as a source of commercially available fragments for building virtual dimer libraries. The following properties were calculated for 11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 63

each ACD compound (for the largest component): MW, number of rings, the largest individual ring size, as well as a complete list of elements (for all components). In addition, a list of suppliers was extracted for each ACD database molecule. Reactive handles were examined in each fragment using inhouse developed “Tag generator” tool (see Supporting Methods). This tool checks each fragment for presence of >50 different reactive handles and if found assigns corresponding “reactive group tag”. ACD database was further filtered based on criteria as specified in Supporting Methods. In this paper we limited the number of reactive groups to the following 13 commercially available reagent rich groups: carboxylic acids (1037815), isocyanates (7513), sulfonyl chlorides and sulfonyl fluorides (33909), alkylating chlorides and bromides (520261), electrophiles (134847), heteroaromatic rings as nucleophiles (293359), phenols (286425), thiophenols (34141), aldehydes (169160), boronic acids (103835), I/Br-Suzuki reagents (961584), primary amines (1888353), and secondary amines (2014274). Finally, “core” structures were generated for all 13 ACD reagent groups using in-house tool “Core structure generator” (see Supporting Methods for details), which were further used to build 10 virtual multi-trillion

dimer

libraries

termed

Lib1-Lib10

(Table

2).

12 ACS Paragon Plus Environment

Page 13 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Journal of Chemical Information and Modeling

.

Table 2. Multi-trillion virtual combinatorial dimer libraries constructed from commercially available ACD rich reagent groups. Library

Pos1 reagent group

Pos1 reagents

Pos1 core structures

Pos2 reagents

Pos2 core structures

Theoretical library size

Lib1

Carboxylic acids

1,037,815

1,048,555

Primary and secondary amines

3,805,469

3,955,564

4,147,626,410,020

Lib2

Isocyanates

7,513

7,513

Primary and secondary amines

3,805,469

3,955,564

29,718,152,332

Lib3

Sulfonoyl Cl/F

33,909

33,909

Primary and secondary amines

3,805,469

3,955,564

134,129,219,676

Lib4

Alkylating Cl/Br

520,261

520,261

Primary and secondary amines

3,805,469

3,955,564

2,057,925,682,204

Lib5

134,847

142,022

Primary and secondary amines

3,805,469

3,955,564

561,777,110,408

Lib6

Electrophiles (NAS) Heteroaromatic ring as nucleophile

293,359

296,346

Electrophiles (NAS)

134,847

142,024

42,088,244,304

Lib7

Phenols

286,425

303,320

Electrophiles (NAS)

134,847

142,024

43,078,719,680

Lib8

Thiophenols

34,141

34,208

Electrophiles (NAS)

134,847

142,024

4,858,356,992

Lib9

Aldehydes

169,160

169,160

Primary amines

1,888,353

1,905,088

322,264,686,080

Boronic acids

103,835

103,835

I / Br-Suzuki

961,584

990,479

102,846,386,965

Lib10

Pos2 reagent group

Total

7,446,312,968,661

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 63

Nuevolution Chemetics DNA-encoded libraries Four Chemetics DNA-encoded libraries (DELs) with diverse property profiles (termed “Lia108”, “Lia122”, “Lia123”, and “Lia126”) have been used in this study to demonstrate applicability of the RCMF approach in DELs field. These DELs have been extensively screened over time in-house on a variety of targets resulting in discovery of multiple series of structurally diverse drug-like hits for majority of the screened targets. Lia108, Lia123, and Lia126 DELs are dimer libraries and Lia122 has a trimer library setup as shown in Figure 2.

Figure 2. Setup of Nuevolution Chemetics DNA-encoded libraries Lia108, Lia122, Lia123, and Lia126.

Reduced Complexity Molecular frameworks Reduced Complexity Molecular Framework (RCMF) is defined as an abstract representation of a group of chemical structures sharing the same ring connectivity pattern, where different rings (typeand size-wise) and different linkers connecting the rings (also type- and size-wise) are abstracted with

14 ACS Paragon Plus Environment

Page 15 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

“circles” and single “lines” accordingly. In addition, ring connectivity angles are disregarded when visualizing RCMFs graphically and determining their types. RCMF definition is somewhat similar to Oprea scaffold topology30,31 definition, however, it is more general (see details and discussion further below). Figure 3 demonstrates RCMFs for two drugs Risperidone and Doxazocin. In this example, both drugs have different Murcko scaffolds and different “carbon-atom-only” frameworks but the same RCMFs. Information on individual ring sizes and types, linker lengths, and angles between the rings is not preserved when representing RCMFs graphically. In contrast to Oprea scaffold topologies30,31, fused, spiro, and bridge-head ring systems are represented as two or more merged circles and the order of rings is pre-defined by letter codes (from A to F) in each RCMF type. For example, rings A, B and rings D, E represent a pair of 2-fused ring systems in both drugs (Figure 3). A complete list of graphical depictions of RCMF types with 1-6 rings considered in this study is provided in Supporting Figure S1.

15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 63

Figure 3. Shown are chemical structures of Risperidone and Doxazocin drugs, their Murcko scaffolds, “carbon-atom-only” and RCMFs.

RCMF descriptors In this study, we developed 3 main sets of RCMF descriptors applicable for each RCMF type, i.e. ring, linker, and angle descriptors, as well as few special additional descriptors. Ring descriptors All single ring types (ring size < 9) were grouped into 28 general ring groups and are shown in Supporting Table S1. In this classification two-letter ring coding scheme was adopted to describe each ring

group.

According

to

this

classification

Risperidone

ring

descriptors

would

be:

“5_type21;6AMM6AXX6B” or “5_type21;6BXX6AMM6A” depending on which ring is considered to be the first A ring. In case if RCMF type is symmetrical all possible symmetrical RCMF descriptor strings are generated. They are then sorted alphabetically and the first on the list is used as a consensus RCMF descriptor string (all the other descriptor strings are kept as well). In case of Risperidone “5_type21;6AMM6AXX6B” string comes first in A-Z alphanumeric sorting. “6AMM” corresponds to the first 2-fused ring system where “6A” ring code stands for a “6-member aliphatic ring with heteroatoms” and “MM” codes for a “6-member aliphatic ring with 1 exocyclic double bond”. The middle “6A” corresponds to a piperidine ring C, whereas “XX6B” denotes the second 1,2-benzoxazole fused ring system. In addition, all ring sizes are also added to the RCMF descriptor string: “5_type21;66656;6AMM6AXX6B” (underlined). Linker descriptors Each linker connecting two rings is shown as a single line in RCMF graphical representation. Linkers are described by size counting the number of bonds in them. Linker size is determined as it

16 ACS Paragon Plus Environment

Page 17 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

would appear in carbon-atom-only framework ignoring all side bonds outgoing from the main linker connecting two rings. RCMF linkers are labelled by specifying the rings which they connect. For example, one of Risperidone linkers would be abbreviated as "linkBC" and is equal to 3 bonds further extending RCMF descriptor string: “5_type21;66656;6AMM6AXX6B;linkBC;3;linkCD;1”. In case if three arbitrary rings X, Y, Z are connected to the same branching point (Br) then linker sizes are determined to the branching point and abbreviated as "linkXtoBr", "linkYtoBr", and "linkZtoBr" respectively. No linker descriptors are generated if 4 or more rings are connecting to the same linker. Angle descriptors RCMF angle descriptors in this study are extracted purely from 2D structure representation of each compound and may not corresponds directly to real angles values derived from 3D compound model. The angle vectors are coded by specifying 3 rings which form an angle. In Risperidone example angle vectors "ABC", "BCD", and “CDE” would be considered, where "ABC" codes for the angle between the plane of the fused ring system of A and B rings and an individual ring C, "CDE" encodes the angle between the plane of the fused ring system of D and E rings and an individual ring C, whereas "BCD" encodes the angle between individual rings B, C, and D. Rings forming the angles are always mentioned in alphabetical order. For Risperidone angle ”ABC”, will be set to 30°, angle “BCD” to 180° (para), and angle “CDE” to 72°. Thus, Risperidone RCMF descriptor string becomes: “5_type21;66656;6AMM6AXX6B;linkBC;3;linkCD;1;angleABC;30;angleBCD;180;angleCDE;72”. Some other examples demonstrating how angle descriptors are derived for individual rings and 2- and 3-fused ring systems are shown in a Figure 4. No angle descriptors are generated for 4-6 fused ring systems. Angle values for bridged-head systems are assigned based on their presumed "3D" shape of system. A special flag is added at the end of RCMF descriptor string to indicate presence of spiro or bridge-head system. As Risperidone does not contain any of these systems “NONE” flag is added: 17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 63

“5_type21;66656;6AMM6AXX6B;linkBC;3;linkCD;1;angleABC;30;angleBCD;180; angleCDE;72;NONE”.

Figure 4. Few examples demonstrating how angle descriptors are derived for individual rings and 2-3 ring fused systems. All Chemetics compounds present in Nuevolution DELs are attached to a DNA tag via an oligo attachment linker. Information on oligo-linker attachment point on a molecule could be very important

18 ACS Paragon Plus Environment

Page 19 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

to know when analysing and comparing compounds showing up in selection outputs (especially for small 2-4 ring ligands). This is addressed by attaching “Lr” atom to each graphical representation of RCMF type (Supporting Figure S2). However, this increases the number of possible “Lr”-handle containing RCMF types substantially and thus this option is currently enabled for 2-4 ring compounds only. Linker-free RCMF types only were further considered throughout of this study. Special RCMF types Acyclic and 1-ring molecules are assigned to “Acyclic” and “1-ring” RCMF types respectively. As they carry very little information on RCMF level, they are not compared to any other RCMF types. RCMF descriptors for 1-ring compounds are only used for their classification based on ring type and size. “7_rings" framework type includes all compounds with 7 rings. These compounds can be compared and grouped together with 6 rings compounds if "minus-one-ring" option is enabled (described in further sections). We do not consider compounds with 8 or more rings to be “drug-like” and thus they are not assigned to any specific RCMF type and are not compared to 2-6 rings RCMF types. In addition, “Adamantanes" RCMF type includes all compounds containing adamantane or "adamantane-like" motifs. This RCMF type is not compared to other RCMF types due to very special 3D shape of their adamantane moiety.

RCMF for building blocks The concept of RCMFs, reflecting topology classes for full-size drug-like compounds, was further extended to cover library building blocks. In addition to RCMF properties described above, RCMFs for BBs include extra information on positioning of their reactive handle(s) on the framework

19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 63

and thus are classified separately (see Figure 5, Supporting Figures S3 and S4). Determination of RCMF type for BBs starts with analysing their “core” structures (Figure 5).

Figure 5. A workflow for determination of RCMF types for building block “core” structures. See description of the workflow in the Methods section.

The term “core” structure is used throughout this paper to refer to fragments used in any DEL or combinatorial library, where each reacting handle is replaced by a “dummy” atom (see Supporting Methods and Figure S5 for more details). In short, each “dummy” atom is converted to a 4-member

20 ACS Paragon Plus Environment

Page 21 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ring with “dummy” atoms (Stage I). Next, all heavy atoms (except dummy atoms) are replaced to carbons and all bonds to single bonds followed by Murcko scaffold generation procedure (Stage II and III). “Dummy” rings are converted back to original “dummy” atoms (Stage IV). In case if a “dummy” atom is a part of a ring it gets “pushed” outside the ring by one bond. Finally, all 2+ bond linkers are reduced to single bond linkers and all –CH2–CH2– groups in all rings are reduced to a single –CH2– group (Stage V). At last, “dummy” atoms are replaced to halogens (Stage VI) and InChIKeys are generated for this reduced structure (Stage VII). InChIKeys are then compared to the reference set of InChIKeys known for each specific fragment RCMF type (Stage VIII) as provided in Supporting File S2. RCMF type determination procedure for BBs is implemented in KNIME workflows.56-59 Next, RCMF descriptors for BBs are calculated for each determined BB RCMF type individually, by “split-and-analyse” principle, where each fragment is split into sub-fragments according to a specific set of rules. These rules include extraction of linkers, splitting fused ring systems into individual rings (preserving their aromaticity) to determine their type, etc. An example for determination of fragment RCMF descriptors is shown in Supporting Figure S6. In total, 12 KNIME workflows are developed to determine angles in various types of fused ring systems and individual rings, 15 KNIME workflows to split fragments into individual rings, one KNIME workflow to determine individual ring types, and one KNIME workflow to determine linker sizes between the rings and reactive handle(s). Automatic fragment RCMF descriptor calculation protocols have been developed for all 0-3 ring capping and internal BB RCMF types as shown in Table 3 and Supporting Figures S3 and S4. For larger 4+ ring BBs RCMF descriptors are extracted manually as the number of such fragments used in Nuevolution DELs is usually very small. In case if BB RCMF type is symmetrical RCMF descriptor strings are generated for all symmetrical variants and the top string on the sorted list is used as a consensus RCMF descriptor string for the BB. 21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 63

Table 3. The number of RCMF types in capping and internal BBs. All theoretical framework types are considered for 0-3 rings capping and 0-2 ring internal BBs. “*” indicates RCMF types for fragments where additional fragment RCMF types are possible. ACD database analysis suggests that the number of reagents in these “additional” fragment RCMF types will be very small.

Nr. of rings

Nr. of fragment RCMF types for mono-functional (capping) BBs

Nr. of fragment RCMF types for bi-functional (internal) BBs

0

1

1

1

1

2

2

4

14

3

19

52*

4

53*

94*

In addition, “minus-one-ring” RCMF descriptors for fragments are generated for all BBs. This is done by virtually “opening” or “removing” one ring at a time from a BB followed by modification of its RCMF descriptor string accordingly. If the same ring could be “opened” in several ways, the largest possible part of the ring gets removed ensuring solidity of the remaining structure (see Supporting Figure S7) unless equal alternatives for the ring “opening” exists. In this scenario, both “opening” alternatives are employed. “Minus-one-ring” RCMF descriptor strings for fragments contain extra information regarding removed ring, i.e. whether it was “fused” or “not-fused” and whether a ring was “opened” or “deleted”. Removal of any terminal ring is considered “deletion” unless it has a reactive handle attached to it (therefore Ring A is labelled as “opened” and not “deleted” as shown in Supporting Figure S7). 22 ACS Paragon Plus Environment

Page 23 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

RCMF descriptors for full-size compounds The number of theoretically possible RCMF types for capping and internal BBs with 0-3 rings is rather small (Table 3). Thus, all possible combinations of BB RCMFs could be considered for any arbitrary combinatorial library setup. For example, 3190 combinations of internal and capping BBs RCMFs are possible for a dimer DEL setup if library compounds with 0-6 rings are considered (Figure 2 and Supporting Table S2). This number is further reduced to 649 BB RCMF combinations in a virtual “cap-cap” dimer library (due to setup symmetry). The number of possible BB RCMF combinations would reach ca. 19500 in a linear trimer DEL setup for compounds with up to 6 rings. We took a challenge to examine all these BB RCMF combinations and implemented them in an efficient rule-based algorithm written in Perl. The Perl code has a tree-like architecture to ensure fast access to any combination of BB RCMFs (see a flow diagram in Figure 6 and Supporting Figure S8). Therefore, determination of RCMF descriptors for full-size molecules in DELs and virtual combinatorial libraries is done based on known RCMF descriptors of library BBs. It takes less than 5 minutes to generate RCMF descriptors for a random one million set of DEL fragment combinations on an Intel Core i7 quad-core based desktop machine and calculations could be easily parallelized and distributed over multiple processors for further performance increase. In addition, the algorithm generates “minus-one-ring” RCMF descriptor strings for each library compound using “minus-onering” RCMF descriptors for corresponding fragments. “Minus-one-ring” RCMF descriptor strings

23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 63

ensure that full-size compounds with various number of rings and RCMF types could be intercompared (Supporting Figure S6).

Identification of reactive handles in fragments (ACD, ZINC, vendor collections) Generation of ”tags” coding reactive handles

”Tag Generator” (in-house tool)

Replacement of reactive handles in fragments to ”dummy” atoms based on corresponding tags (Generation of ”Core structures”)

”Core Structure Generator” (in-house tool)

Determination of RCMF type and generation of RCMF descriptors for each fragment ”core structure” (for mono- and bi-functional reagents)

Multiple in-house KNIME workflows

Diversity analysis of available reagents on RCMF level Exhaustive exploration of RCMF types for mono- and bifunctional fragments

DNA-encoded library, Its selection output or a virtual combinatorial library

Mapping of DELs or selection outputs on to 2D RCMF chemical space map

Database of fragments with assigned RCMF types and descriptors

Design of DELs and virtual combinatorial libraries

Efficient rule-based system for combining RCMFs of fragments. Converts any combination of fragments RCMF descriptors to RCMF descriptors of the full-size compounds (without chemical enumeration)

Clustering of selection data set compounds based on their RCMF descriptors and ”counts” using various similarity thresholds

Database of RCMF descriptors for compounds in DELs or virtual combinatorial libraries. Includes information on all reagent combinations which, if combined, form specific RCMF type and descriptors present in library

Generation of RCMF descriptors Query structures (from MedChem projects, DELs ACS Paragon Plus Environment known actives from literature)

24

Efficient searching for query structures analogues in DELs and large virtual combinatorial

Page 25 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 6. RCMF approach flow diagram demonstrating key steps of the method.

RCMF analysis of virtual combinatorial libraries Core structures extracted from 13 ACD reagent groups were submitted for RCMF descriptor generation procedure. Unique RCMF descriptor strings for ACD reagents were identified within each reagent group and one representative reagent for each unique RCMF descriptor string was further used (Table 4). All combinations of representative reagents were generated and submitted for RCMF descriptor generation procedure for full-size compounds from each virtual library Lib1-Lib10. For example, 188 thiophenols were selected for pos1 and 3511 NAS reagents for pos2 in Lib8 to form 660068 dimers, which were subsequently analysed to derive 646666 unique RCMF descriptor strings (Table 4). These 646666 unique RCMFs represent framework chemical space occupied by ca. 4.8 billion Lib8 dimers.

Table 4. Estimation of Murcko scaffolds and RCMFs in virtual dimer libraries.

Library

Nr. of unique pos1 InChIKeys for their Murcko scaffolds with a “dummy” ring

Nr. of unique pos2 InChIKeys for their Murcko scaffolds with a “dummy” ring

Theoretical number of Murcko scaffolds in a library

Nr. of unique RCMF descriptor strings for pos1 fragments

Nr. of unique RCMF descriptor strings for pos2 fragments

Theoretical number of RCMF descriptor strings in a library

Real number of RCMF descriptor strings in a library

Lib1

99,960

244,904

24,480,603,840

24,161

68,361

1,651,670,121

1,350,614,414

Lib2

1,106

244,904

270,863,824

505

68,361

34,522,305

30,823,018

Lib3

3,098

244,904

758,712,592

1,256

68,361

85,861,416

75,505,247

Lib4

34,940

244,904

8,556,945,760

10,937

68,361

747,664,257

610,675,684

25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 63

Lib5

12,480

244,904

3,056,401,920

3,509

68,361

239,878,749

233,295,552

Lib6

29,915

12,480

373,339,200

5,192

3,511

18,229,112

17,958,020

Lib7

18,091

12,480

225,775,680

5,391

3,511

18,927,801

18,227,950

Lib8

385

12,480

4,804,800

188

3,511

660,068

646,666

Lib9

14,176

145,193

2,058,255,968

5,407

41,520

224,498,640

199,026,312

Lib10

5,427

58,666

318,380,382

1,571

14,430

22,669,530

21,555,378

Total

40,104,083,966

3,044,581,999

RCMF descriptors of ChEMBL drugs and PubChem compounds RCMF types for ChEMBL drugs and PubChem compounds were calculated using a protocol outlined in Supporting Figure S9. In brief, Murcko scaffolds were generated for all compounds first. Next, all bonds in Murcko scaffolds were replaced to single bonds and all heavy atoms were replaced to carbons. Murcko scaffolds were extracted again from “carbon-atom-only” compounds. All linkers connecting the rings were reduced to a single bond linker and all -CH2-CH2- groups in the rings were reduced to a single –CH2– group followed InChIKey generation procedure. Finally, generated InChIKeys were compared to a pre-defined set of InChIKeys which corresponds to a specific RCMF type (Supporting File S3). The protocol was implemented in KNIME workflow.56-59 As RCMF descriptor generation procedure for full-size compounds which are not members of combinatorial libraries is currently under development, RCMF descriptors were determined for ChEMBL drugs only (and only in a semi-automatic way). First, Murcko scaffolds were extracted from ChEMBL drugs and 3432 unique Murcko scaffolds were identified. Scaffolds with 1-3 rings were converted to “core” structures by modifying a random scaffold hydrogen to a dummy “U” atom and there further treated as “reagents”. Their RCMF descriptors were generated using the same protocol for reagents as described above and converted to RCMF descriptors corresponding to full-size structures. Murcko scaffold InChIKeys for ChEMBL drugs with 4-6 rings were compared to all Murcko scaffold

26 ACS Paragon Plus Environment

Page 27 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

InChIKeys extracted from four Nuevolution DELs for identity match and RCMF descriptors were assigned accordingly. The remaining non-matching drug scaffolds were inspected manually to derive their RCMF descriptors.

Drug-likeness of DELs and virtual dimer libraries Drug-likeness of Nuevolution DELs and non-designed virtual dimer libraries was estimated based on Rule-of-5 (Ro5)60 compliance of library compounds and histograms of compound physicochemical properties such as MW, AlogP, HBA, HBD, RotB, and PSA. These properties were calculated based on a random subset of 1 million enumerated compounds from each library. Although the theoretical size of some virtual libraries exceeds trillion of compounds, a one million large random compound subset is sufficient to estimate library drug-likeness with 99% confidence level and 0.2% confidence interval.37,61 Oligo-linker attachment motif was stripped off for all molecules in DELs before calculation of physicochemical properties. All properties were calculated in Canvas program version 2.8 (Schrodinger).62 In addition, ring profiles for all libraries were estimated based on number of rings in corresponding fragments in each library position.

Murcko Scaffold analysis of DELs and virtual combinatorial libraries To estimate the number of Murcko scaffolds in each virtual library, Murcko scaffolds were extracted from the library fragments in a special way. First, reactive handles were converted to 4member rings with “dummy” atoms and Murcko scaffolds were generated. Next, InChIKeys were generated for all Murcko scaffolds with the “dummy” rings and a subset of unique InChIKeys was

27 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 63

identified. These unique InChIKeys correspond to all different substructures which reagents may “donate” to form different Murcko scaffolds in enumerated library molecules. Thus, a maximum theoretical number of Murcko scaffolds could be estimated by multiplying the number of unique InChIKeys in all library positions. For example, if there are 100 carboxylic acids in pos1 and 100 amines in pos2 in some library X, a theoretical library X size will be 104 dimers. If there are 30 unique InChIKeys for “dummy” ring containing scaffolds in pos1 and 40 unique InChIKeys for “dummy” ring containing scaffolds in pos2, a maximum theoretical number of Murcko scaffolds in library X will be 1200. However, identical scaffolds could often be formed due to library setup symmetry and other factors. In our experience the real number of Murcko scaffolds is usually in range of 90-96% from the theoretical estimate in multi-million DELs (Table 5). The described scaffold estimation procedure is fast, does not require structure enumeration, provides an accurate estimate of Murcko scaffolds in a library and is used in-house in DEL design phase. Murcko scaffold estimation protocol is implemented as a KNIME workflow.56-59

Table 5. Diversity and library setup for Nuevolution DELs described in this study.

DEL size

Estimated number of Murcko Scaffolds in DEL

Actual number of unique Murcko scaffolds in DEL

Unique RCMF descriptor strings in DEL

DEL compounds per Murcko scaffold

DEL compounds per RCMF descriptor string

Nuevolution DEL

Pos1 BBs

Pos2 BBs

Pos3 BBs

Lia108

11,279

13,440

-

151,589,760

36,746,400

35,439,655

12,091,354

4.28

12.54

Lia122

384

329

960

121,282,560

8,986,978

8,360,409

1,964,124

14.51

61.75

Lia123

8,807

14,496

-

127,666,272

13,888,200

12,956,245

3,798,315

9.85

33.61

Lia126

5,814

18,513

-

107,634,582

12,377,588

11,470,337

3,181,007

9.38

33.84

Comparison of RCMF descriptor strings

28 ACS Paragon Plus Environment

Page 29 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Comparison of compound calculated descriptors is a routine operation in cheminformatics, allowing one to assess similarity, diversity, drug-likeness, apply filtering criteria, etc. To extend applicability of RCMF approach, RCMF descriptor string comparison methods have been implemented in this study as well. In brief, a pair of RCMF descriptor strings sharing the same RCMF type is aligned and RCMF descriptors are compared pairwise. In case of a descriptor pair mismatch, a certain penalty score is assigned. To compare different RCMF types “minus-one-ring” option is used. All original and “minus-one-ring” RCMF descriptor strings are compared by RCMF type first. If types match RCMF descriptors strings are compared pairwise and additional penalty score is assigned for ring “removal”. Further details on the comparison protocol could be found in Supporting Methods.

RCMF descriptor based clustering Clustering of compounds based on the developed RCMF approach could be done in multiple ways using overall weighted penalty score (described in Supporting Methods) as a distance or similarity measure between RCMF descriptor strings. In the current study, we implemented “similaritynetwork” based clustering approach where all unique RCMF descriptor strings (and thus all compounds belonging to them) are considered as nodes. An edge is created between any two RCMF nodes if overall weighted penalty score between them is less than a certain threshold. Once all edges are assigned to the pool of nodes the resulting network is evaluated and all individual sub-networks are found. Subnetwork connectivity patterns, size and other parameters are evaluated according to pre-set rules (not disclosed). All qualifying sub-networks are ranked and assigned a cluster ID. Next, overall weighted penalty score threshold is slightly increased and the procedure is repeated. This gradual increase of threshold ensures hierarchical tree-like clustering of RCMF clusters (and compounds

29 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 63

belonging to them). This method groups together RCMF types which might not be comparable otherwise. For example, a node for a 4-ring RCMF may be connected to the nodes representing 3-ring and 5-ring RCMF types allowing them to belong to the same network at a certain threshold level. RCMF clustering approach is routinely used at Nuevolution for the analysis of selection outputs resulting from DELs screens in case of “flat” ranking (examples are not shown).

Mapping chemical space with RCMFs Although the number of all possible theoretical RCMF descriptor strings might be huge it is much lower than the estimated number 1060 of drug-like molecules.63 For example, RCMF type “3_type1” corresponds to a 3-ring framework, where ring A and B are fused and connected to an individual ring C by a linker. The number of theoretical RCMF descriptor strings for this framework type (if we consider only 20 different ring types and only 1-10 bonds linkers) would be 20x20x20x10 = 8·104. This number could easily reach trillions for 6-ring RCMF types due to combinatorics and increasing topological complexity of 6-ring frameworks. Nevertheless, these estimates are much lower than 1060 and RCMF descriptor abstraction level could be adjusted. For example, decreasing RCMF descriptor specification level might aid in keeping the number of possible RCMF clusters within manageable range for human interpretation and visualization. This opens an attractive opportunity to arrange abstracted RCMF clusters on a predefined RCMF chemical space map. Libraries of compounds could be then projected on this map to assess their diversity, similarity and space coverage. To achieve this, we adopted RCMF descriptor complexity reduction scheme which is described in detail in Supporting Methods. The constructed 2D RCMF map (306 x 100 cells) represents 30585 abstracted RCMF clusters, with 27, 4158, 8398, 10515, 7144, and 337 clusters for 1-6 ring compounds

30 ACS Paragon Plus Environment

Page 31 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

respectively plus 6 additional clusters (Supporting File S4). The developed 2D RCMF chemical space heatmaps were built using ClustVis tool.64

31 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 63

RESULTS A novel RCMF approach for the analysis of small molecule “drug-like” chemical space is presented in this study (Figure 6). The developed RCMF approach was validated on four in-house DELs (>5·108 million compounds), 10 virtual combinatorial libraries (>7.4·1012 virtual dimers), PubChem database (>8.9·106 compounds) and ChEMBL drug set (ca. 6000 drugs). Drug-likeness assessment results demonstrated that >83% of compounds in all three Chemetics dimer libraries are Ro5 compliant whereas 86% of trimers in Lia122 have no or one Ro5 violation (mainly due to MW) as shown in Table 6 and Supporting Figures S10 and S11. Ro5 compliance is lower (ca. 63%) in nondesigned virtual dimer libraries as only MW and ring cut-offs were applied for ACD fragments (Table 6). Majority of dimers in virtual libraries have 1-5 rings as fragments with maximum 3 rings were allowed in each position in virtual libraries setups. Similarly, >98% of Lia123 and Lia126 compounds have 1-5 rings, whereas >97% of Lia108 are composed by 1-6 rings dimers. Compounds with 2-6 rings make >94% of a trimer Lia122.

A universe of RCMFs Analysis of RCMF types in DELs, virtual combinatorial libraries, and PubChem lead to a surprising finding that there is a relatively small number of RCMF types for “drug-like” compounds with up to 6 rings. We found that >99% of all small “drug-like” molecules analyzed could be roughly grouped in only 452 RCMF types as graphically depicted in Supporting Figure S1. PubChem analysis revealed that ca. 80% of PubChem compounds belong to 8 main RCMF classes, i.e. “2_type2” (27.1%), “1_type1” (15.5%), “3_type3” (12.5%), “3_type1” (10.7%), “2_type1” (4.7%), “4_type6” (3.8%), “4_type1” (3.5%) and acyclic (3.2%). 95% of PubChem compounds can be represented by 27

32 ACS Paragon Plus Environment

Page 33 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

RCMF types, whereas 99% of PubChem compounds could be assigned to just 80 RCMF types. This is consistent

with

the

earlier

33 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 Table 6. Distribution of rings and Ro5 compliance of compounds in Nuevolution DELs and 10 virtual dimer libraries. 3 4 Rings Lia108 Lia122 Lia123 Lia126 Lib1 Lib2 Lib3 Lib4 Lib5 Lib6 Lib7 Lib8 Lib9 5 0 0.00% 0.20% 0.07% 0.22% 0.53% 0.38% 0.18% 0.31% 0.00% 0.00% 0.00% 0.00% 0.09% 6 1 0.08% 3.21% 1.23% 4.26% 6.35% 6.70% 5.42% 5.01% 2.24% 0.00% 0.00% 0.00% 2.66% 7 8 2 7.27% 2.03% 17.20% 22.39% 24.97% 32.74% 32.12% 24.76% 18.78% 5.87% 15.75% 10.08% 19.69% 9 3 20.69% 11.99% 35.90% 38.02% 38.53% 41.31% 42.52% 40.05% 39.18% 30.48% 41.07% 41.34% 41.42% 10 4 31.29% 29.09% 31.25% 26.38% 24.21% 16.57% 17.37% 24.63% 30.47% 41.97% 33.58% 40.90% 29.72% 11 12 5 25.60% 32.89% 10.59% 7.72% 5.09% 2.24% 2.33% 4.96% 8.55% 19.11% 8.89% 7.67% 6.06% 13 6 18.06% 1.53% 10.95% 0.95% 0.32% 0.06% 0.06% 0.28% 0.78% 2.57% 0.71% 0.02% 0.36% 14 "Rule-of-five" violations 15 16 0 83.9% 36.2% 85.6% 83.9% 66.8% 63.3% 61.7% 58.8% 57.6% 69.3% 61.1% 47.9% 52.3% 17 1 13.6% 50.5% 11.9% 13.6% 26.1% 31.2% 29.8% 31.9% 32.1% 24.6% 29.5% 37.6% 37.3% 18 2 2.5% 13.3% 2.6% 2.5% 7.1% 5.5% 8.5% 9.3% 10.3% 6.2% 9.4% 14.6% 10.4% 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 ACS Paragon Plus Environment 47 48

Page 34 of 63

Lib10 0.00% 0.07% 11.98% 42.11% 40.06% 5.66% 0.12%

Lib1-Lib10 0.39% 5.33% 24.02% 39.23% 25.27% 5.41% 0.35%

66.3% 29.0% 4.7%

63.1% 28.8% 8.1%

34

Page 35 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

PubChem analysis results using scaffold topologies as described by Oprea and co-workers30,31. Further analysis of ChEMBL drugs (5069 drugs with 2-6 rings) demonstrated that 3391 unique Murcko scaffolds and 2559 unique RCMF descriptor strings could be extracted and grouped into 120 general RCMF types. We also found that there is rather limited number of RCMF types for fragments, i.e. 25 RCMF types for capping BBs with up to 3 rings and 69 RCMF types for internal reagents with up to 3 rings (Supporting Figures S3 and S4).

RCMF chemical space maps Diversity of DELs and combinatorial libraries could be assessed in many ways.37,50,52,65,66 Typically, a small fraction of the library is enumerated and used for assessment of overall library diversity. More seldom an exhaustive enumeration of all library compounds is undertaken. In this work, we assessed diversity of multi-million DELs and multi-trillion virtual dimer libraries avoiding exhaustive library enumeration both on Murcko scaffold and RCMFs level. The more unique Murcko scaffolds are present in the library the higher is its structural diversity. Similarly, libraries with the high number of unique RCMF descriptor strings are more structurally diverse than libraries of the same size with lower number of unique RCMF descriptor strings. Table 5 demonstrates the estimated and real number of Murcko scaffolds in each Nuevolution DEL. A small ratio of DEL molecules vs. the number of Murcko scaffolds in a DEL clearly demonstrates high structural diversity of Nuevolution libraries with Lia108 being the most structurally diverse. This ratio is higher in PubChem compounds with 1-6 rings (ca. 12.7). In contrast, a ratio of dimers vs. their Murcko scaffolds in non-designed virtual Lib1Lib10 is 100-1000 suggesting high similarity of compounds and very dense coverage of the chemical space (Tables 2 and 4).

35 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 63

RCMF diversity was assessed in DELs and virtual combinatorial libraries Lib1-Lib10. Compared to number of Murcko scaffolds in each library, the number of unique RCMF descriptor strings was much lower demonstrating that RCMFs represent a higher level in structural classification hierarchy at their current most detailed description level. There are on average 2.9-4.3 Murcko scaffolds per one RCMF descriptor string in DELs (Table 5) and 7-20 scaffolds per RCMF descriptor string in virtual Lib1-Lib10 (Tables 2 and 4). Nevertheless, the number of unique RCMF descriptor strings is still in millions to allow quick and interactive visualization of the libraries. Further RCMF descriptor generalization scheme was applied to build a uniform 2D RCMF chemical space map with 30585 clusters. All libraries were further projected on this map and each cluster was coloured according to its occupancy by library compounds or Murcko scaffolds (Figure 7, Supporting Figures S12 and S13). RCMF maps clearly demonstrate high structural and topological diversity of DEL compounds covering various areas of this global map (Figure 7 and Supporting Figure S12). In addition, created maps clearly show that different DELs occupy different clusters on the map highlighting high structural and topological diversity of compounds also across different DELs. The analysis also showed that a trimer Lia122 occupies much smaller areas of RCMF space than dimer DELs (Supporting Figure S12, see panel D vs. panels A, B, and C). Finally, we have combined all four DELs and mapped them all together. Figure 7 demonstrates that 18661 clusters (>61%) are occupied by at least one compound from DELs used in this study. Absence of DEL compounds in ca. 2/3 of unoccupied clusters could be explained by presence of 4-6 fused ring systems in their RCMF cluster definition, rare RCMF types where linker is attached to a sp3 carbon connecting both rings in a fused ring system, as well as 4- and 5- ring framework types where all or all but one rings are non-aromatic. In addition, majority of uncovered 2-ring framework 36 ACS Paragon Plus Environment

Page 37 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

clusters contain either 8-member or larger individual rings or rings with two exocyclic double bonds in ortho or

37 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Page 38 of 63

Figure 7. Projection of four Nuevolution DELs on 2D RCMF map. Cluster colour indicates the number of individual DEL compounds according to spectrum. Red colour indicates that there are more than 10,000 DEL compounds per cluster. Colourless clusters

have

no

DEL

representative

compounds.

38 ACS Paragon Plus Environment

Page 39 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

or para positions. Compounds belonging to these uncovered clusters are not of great interest from the drug discovery perspective. RCMF maps could also be used for quantitative diversity comparison across different libraries. This could be achieved by computing a ratio of between library compounds occupying each cluster. In this study, we compared structural diversity of Lia108 to Lia126 (Figure 8), and a trimer Lia122 to a dimer Lia123 (Supporting Figure S14). The results indicate that Lia108 and Lia126 do occupy substantially different areas of RCMF maps and their overlap is only partial. In contrast, Lia123 dominates RCMF map coverage and Lia122 occupies only few areas not addressed by Lia123. Similarly, the developed RCMF maps could be used to map Murcko scaffolds. To this end, we repeated cartography process by projecting all extracted Murcko scaffolds for each DEL (Supporting Figure S15). Distribution of DEL Murcko scaffolds on the maps is very broad and the most populated clusters on the maps are very different across libraries. Again, trimer Lia122 occupied much narrower area compared to dimer DELs (5060 Lia122 clusters vs. 14000-14500 dimer DEL clusters). This primarily is attributed to a smaller number of BBs used in a trimer library setup. Eventually, we also mapped virtual combinatorial Lib1-Lib10 libraries on a global RCMF map (Supporting Figure S13). Libraries Lib1 and Lib4 appear to cover almost the whole map (78.7% and 83.0% coverage respectively), followed by Lib2, Lib3, Lib5, and Lib9. Virtual libraries Lib6, Lib7, Lib8, and Lib10 occupy only few cluster areas on the RCMF map. Interestingly, there are few clusters “lines” not occupied by any virtual library compounds. These clusters represent RCMF types which cannot be accessed by dimer libraries by their design. For example, ring connectivity patterns for 5-ring compounds with RCMF types 2, 17-20, 33, 40, 49, 52, 83 and 6-ring compounds with RCMF types 4, 5, 18, 28, 36, 40, 41, 49, 51, 62, 63, 70, 92, 159, 160, 163, 189, 293, and 328 for cannot be synthesized

39 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 63

by combining BBs with maximum 3 rings in a dimer setup (Supporting Figure S1). Either 4-ring BBs should

be

allowed

40 ACS Paragon Plus Environment

Page 41 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Journal of Chemical Information and Modeling

Figure 8. Comparing diversity of Nuevolution Lia108 and Lia126 DELs. RCMF clusters occupied mostly by Lia108 compounds are shown in red, whereas framework clusters occupied mainly by Lia126 compounds are shown in blue. Yellow cluster colour indicates framework clusters occupied by both libraries in ca. equal amounts. Uncoloured framework clusters indicate unoccupied clusters by both DELs.

41 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 63

in the design or a trimer library setup should be used to address these RCMF types. This highlights a need to produce trimer DELs which offer access to new topological space which might not be fully accessed by dimer DELs. Finally, we mapped the set of ChEMBL drugs on RCMF map as demonstrated in Supporting Figure S16. 1709 clusters were occupied by at least one ChEMBL drug.

Analysis of DEL selection outputs using RCMF maps The developed RCMFs approach could also be used for the analysis of DEL selection outputs. To demonstrate this, we have used a joint data set of two Lia123 selection outputs on Lysine Specific Demethylase-1 (LSD1)67 target. Fragment combinations (FCs) are usually ranked according to socalled “count” (number of observations) which is an indicator for level of enrichment in each selection data set. This “count” parameter indicates the number of corresponding DNA tags found after PCR amplification and decoding. The higher is the “count” the higher could be FC affinity for the target although there is no direct linear relationship. FCs seen in both selections were considered further (807 FCs), where the highest cumulative count 576 and the lowest was 7. The total sum of counts for all 807 FCs was 23403. RCMF descriptor strings were derived for all 807 FCs and further mapped on RCMF map (Supporting Figure S17). Only top 20 statistically significant clusters are shown on 2D map (covering >80% of the total count for all 807 FCs). Cluster significance was calculated based on the expected cluster count if 23403 random Lia123 FCs with count 1 would be mapped on RCMF map. The cluster was considered significant if the total count for all FCs belonging to the cluster was larger than expected cluster count estimate by at least 2 standard deviations. Twenty FCs from 8 different clusters were resynthesized in a free form tested in LSD1 inhibition assays (see Supporting Methods). Low nM inhibition activity towards LSD1 was observed for the hits in Cluster 1 and 12 (chemical

42 ACS Paragon Plus Environment

Page 43 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

structures are not disclosed but their corresponding RCMF descriptor strings are shown in Supporting Figure S17, panel B).

Searching chemical space with RCMFs Similarly, to InChIKeys, RCMF descriptors strings could be used for chemical searches in databases, DELs and virtual libraries. To exemplify this, RCMF descriptor strings of 5069 drugs from ChEMBL drug set (drugs with 2-6 rings) were searched within multi-trillion virtual libraries Lib1Lib10. RCMF descriptor string exact matches were found for 4393 drugs (ca. 87%). 249 other drugs either contained a 4-fused ring system or were not covered by RCMF types present in Lib1-Lib10 and thus could not have been found in Lib1-Lib10. No RCMF descriptor string matches were returned for 427 ChEMBL drugs (ca. 8%). This indicates that Lib1-Lib10 cover chemical space of drugs reasonably well. Next, Lib1-Lib10 compounds sharing identical RCMF descriptor strings with any found drug could be further enumerated and compared using traditional similarity assessment techniques, such as fingerprints. To illustrate this, we selected an anticancer drug Olaparib and searched for its analogues in Lib1-Lib10 (Figure 9). In total 262988 dimers sharing the same RCMF descriptor string were found (209765 from Lib1, 36654 from Lib5, 16467 from Lib7, 98 from Lib8, and 4 from Lib10). Dimers were subsequently enumerated and compared to Olaparib using Morgan FPs (ECFP4-like FPs as implemented in RDKit in KNIME)56-59,68,69 and Tanimoto coefficient for similarity assessment. Two dimers were found to be identical to Olaparib. Other 29 dimers were extremely similar to the original drug sharing >0.8 Tanimoto similarity. For another 1914 dimers Tanimoto was in a range of 0.7-0.8. In total 54234 dimers had Tanimoto similarity ≥ 0.5 to Olaparib. Next, 415 unique Murcko scaffolds were extracted from 262988 enumerated dimers and compared to Olaparib scaffold using the same Morgan

43 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 63

FPs. 33 scaffolds shared Tanimoto similarity ≥ 0.5 towards Olaparib Murcko scaffold. Figure 9 demonstrates a random pick of 36 out of 415 Murcko scaffolds, which share larger or smaller degree of similarity to Olaparib scaffold. This example demonstrates how one can easily access a specific portion of chemical space covered by multi-trillion compound libraries using RCMFs. The search space could be further expanded by including RCMFs descriptor strings with high similarity to the query RCMF descriptor string.

Figure 9. An example demonstrating searching techniques in large virtual combinatorial libraries of Lib1-Lib10 using RCMF descriptor strings. At the first stage, RCMF descriptor string is generated for a query structure (anticancer drug Olaparib) as shown. At the second stage, an exact match of Olaparib RCMF descriptor string is searched within Lib1-Lib10 RCMF descriptor strings. Only a small portion of Lib1-Lib10 RCMF descriptor space needs to be processed as the exact string match is looked up only within “4_type6” RCMF type descriptor strings. As relationship between fragment combinations and their corresponding RCMF descriptor string is preserved for each virtual library Lib1-Lib10, ca. 263K pos1 and pos2 building block (BB) combinations were retrieved and enumerated. 415 Murcko scaffolds were extracted from ca. 263K dimers and 36 of them are shown in this figure demonstrating higher or lower similarity degree to Olaparib scaffold.

44 ACS Paragon Plus Environment

Page 45 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Journal of Chemical Information and Modeling

Figure 9. 45 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Page 46 of 63

46 ACS Paragon Plus Environment

Page 47 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Mapping DELs using RCMFs of fragments Generation of RCMF descriptors for fragments is a primary step for the developed RCMF approach when analysing DELs and combinatorial libraries (Figure 6). However, RCMF descriptors for fragments could be used alone to assess library diversity both on individual BB and library level. The earlier could be achieved by calculating distribution of reagents within each RCMF fragment types (Supporting Figures S3 and S4). “Library level” could be then addressed by placing RCMF types for fragments for each library position on its own coordinate. One such example of DEL visualization based on RCMF of fragments is shown in Figure 10. In this figure Lia123 pos1 and pos2 BBs are grouped according to their RCMF types of fragments with few general ring and linker descriptors on X- and Y-axis respectively and a map is further colour-coded according to the number of library dimers in a corresponding 2D cluster cell (see abbreviations in a footnote of Figure 10). RCMF types for fullsize library compounds are shown graphically for few areas on this map. It should be noted that the same RCMF types are formed in different regions of this heatmap and therefore it will be significantly different from 2D RCMF maps with 30585 clusters presented earlier. Mapping based on RCMF types of BBs is especially suitable for visualization of dimer libraries, however it becomes a challenge for trimer libraries as it would require visualization in 3D. Mapping of dimer DELs using RCMF types of fragments provides a convenient way to visualize and assess DEL and BB structural diversity at the same time, especially in the library design phase.

47 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Page 48 of 63

Figure 10.

48 ACS Paragon Plus Environment

Page 49 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 10. 2D RCMF heatmap based on Nuevolution Lia123 fragment RCMFs. The y-axis demonstrates clustering of 8803 Lia123 pos1 bifunctional fragments grouped according to their RCMF type. Pos1 fragments with 1-2 rings are further split into smaller RCMF classes based on aromaticity of their rings, which are coloured according to abbreviations. Blue star indicates location of the reactive handle of pos1 BBs which further reacts with Lia123 pos2 fragments (capping BBs). Oligo linker attachment point is shown only for 1-ring pos1 BBs. The x-axis demonstrates 14496 Lia123 pos2 monofunctional BBs clustered according to fragment RCMF type. Pos2 1-ring BBs are further split into smaller groups based on their ring aromaticity and distance of the reactive handle to the ring. Pos2 2-ring BBs are split into subgroups according to their ring aromaticity type. A red star on the graphical representation of pos2 RCMFs corresponds to pos2 BB reactive handles which react with pos1 BBs (unspecified for 2-fused ring BBs). Graphical depictions of RCMF types formed for enumerated library compounds are shown in grey boxes.

49 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 50 of 63

DISCUSSION A novel multi-purpose RCMF approach was presented in this study. The developed method was used to chart chemical space on a “global framework level” for ChEMBL drugs, PubChem, multi-million DELs, and virtual combinatorial libraries with trillions of library members. In addition, we demonstrated the utility of the approach to search efficiently desired areas of chemical space, map selection outputs resulting from DEL screening experiments and efficient grouping of millions of Murcko scaffolds into larger RCMF clusters in a robust and logical way mimicking the way how medicinal chemists intuitively perceive molecules. Finally, we show that nearly all small drug-like molecules with up to 6 rings considered in this study could be roughly grouped into just 452 RCMF types. We also found that the number of possible RCMF types for BBs is very limited allowing us to construct an efficient rule-base system to address all combinations and accurately assess framework diversity of DELs and combinatorial libraries of virtually any size avoiding enumeration of library compounds. RCMFs are different compared to Oprea’s scaffold topologies published earlier.30,31 Firstly, RCMFs are more general. For example, RCMF type “3_type5” shown in Supporting Figure S1 covers all 3-ring scaffold topologies (type 6, 10, 11and 16) shown in Figure 2 of ref. 31. Furthermore, there are significant differences in number of scaffold topologies listed in Table 1 of ref. 30 and RCMF types shown in this study (especially for 4-6 ring compounds), i.e. 23 RCMF types vs. 73 scaffolds topologies for 4-ring compounds, 83 RCMF types vs. 590 scaffold topologies for 5-ring compounds, and 337 RCMF types vs. 6454 scaffold topologies for 6-ring compounds. Supporting Figure S1 illustrates graphical depiction for all RCMFs with up to 6 rings, whereas graphical depictions of scaffold topologies with maximum 4 rings are shown in ref. 31. In addition, we introduced “minus-one-ring” option for RCMFs. This allows comparison of RCMFs across different types and finding relationships between them if number of rings differs by one.

50 ACS Paragon Plus Environment

Page 51 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Furthermore, all rings in each RCMF type have a pre-defined alphabetically-coded order letting one to refer and describe any part of any RCMF type. Finally, introduction of RCMF types for monoand bi-functional reagents is entirely unique to our study. RCMFs are also very different from Features Trees (FTs).28,29 FTs describe molecules by their major building blocks in a non-linear fashion and were developed for searching in large combinatorial spaces. Although they are efficient in finding query compound analogues within a certain similarity threshold, FTs do not describe molecular topology compared to RCMFs. In contrast, each RCMF descriptor string may represent either an individual full-size compound or a whole class of similar compounds sharing similar Murcko scaffolds and thus similarity assessment across different RCMF clusters is also possible. In addition, RCMF approach could be applied for comparison and visualization of chemical spaces occupied by DELs and other large combinatorial libraries whereas FTs cannot. In this study we presented a global RCMF map which may be considered as “macro” level map of chemical universe for small drug-like compounds. Resolution scale of this map was chosen to allow easy framework chemical space visualization on PC screen-wide image and can be considered as the “World map” talking in geographical terms. Further increase in map resolution and “zooming” could be performed to explore distribution and “landscape” of each individual RCMF cluster either on a more detailed RCMF level or on Murcko scaffold or individual compound level. To our knowledge, the number of “pre-fixed” chemical space maps published to date is rather limited as majority of chemical space visualization techniques are compound-driven. The use of scaffold topologies as described in earlier studies30-32 indeed provides a high level of abstraction in a universe of small organic molecules. Nevertheless, application of these techniques requires availability of chemical structures (enumeration), does not allow direct comparison of different topological groups and have no visualization capabilities for billions of compounds and

51 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 52 of 63

beyond. The presented RCMF approach provides a new alternative methodology, which is capable to map chemical space occupied by trillions of compounds very efficiently. In addition, individual RCMF descriptor strings provide an intuitive and logical way for grouping similar scaffolds in chemically meaningful clusters which is easy to percept by medicinal chemists. The structural hierarchy analysis presented in this paper could be roughly compared to geographic hierarchy of the world, where the number of people living in the world today (7.5·109) is roughly comparable to the number of individual compounds in a DEL (108-1010), the total number of world cities, towns, and villages (ca. 2.5·106 million) to the number of Murcko scaffolds in a DEL (10-30 million), and the number of countries (ca. 189-196) to the number of general RCMFs types (452). The number of continents (6) could be then attributed to the number of rings in drug-like compounds (0-6). As explained earlier, an addition of RCMF descriptors increases “resolution” level of each RCMF type greatly, but never makes them more detailed as individual Murcko scaffolds. Therefore, RCMF approach “operates” on the next structure-topology level in classification hierarchy than occupied by Murcko scaffolds and allows intuitive clustering of them. As distances between RCMF clusters could be calculated as described in Methods section a network-like representation of framework chemical space is also possible, where nodes would stand for framework clusters and edges will be formed if similarities between framework clusters are higher than a defined threshold. Current implementation of the method does not consider compounds with 7 or more rings, macrocycles, and compounds with very complex bridged-head systems usually found in natural products. Although these classes of compounds could be integrated in the developed workflows in the future it is appropriate beforehand to ask whether this is needed as 7 ring compounds could be grouped with 6 ring compounds in RCMF clustering process if “minus-one-ring” option is enabled and 8+ ring compounds are usually heavy, not drug-like and are of little interest for medicinal

52 ACS Paragon Plus Environment

Page 53 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

chemistry optimization projects. Moreover, 7+ rings RCMF types introduce even higher level of complexity on topology level as demonstrated previosuly30. It is worth to mention that only few approved drugs have 7+ rings and their MW is usually well above 500 Da. In addition, less than 1% of PubChem (MW < 1000 Da) have 7 or more rings. Few additional theoretically possible RCMF types might exist (not covered by depictions in Supporting Figures S1, S3 and S4), and it was not a purpose of this work to find and exhaustively cover all of them, but rather concentrate on available RCMF types covering drug-like molecules in public databases, DEL and combinatorial libraries. In addition, we introduced RCMF types for BBs further extending the applicability of the approach. It should be further stressed that presented RCMF types generally cover >99% of all in-house reagents, DELs, ACD and PubChem compounds. As implementation of the algorithm is rather flexible, addition of new RCMF types into the system is possible. Similarly, a new set of rules could be added into the system which would consider ring-forming reactions since a rule-based engine behind generation of RCMFs for full-size compounds from RCMFs of BBs is currently capable to “link” individual fragments based on “side-chain” reactions only. Calculation of RCMF descriptors for full-size DEL compounds is done based on known RCMF descriptors of BBs. The rule-based algorithm is very fast and generates RCMF descriptors with the desired resolution details for several millions of fragment combinations in few minutes. An automatic RCMF descriptor generation algorithm for full-size compounds (which are not members of any combinatorial library) is under active development. In addition, we are considering extending RCMF descriptor sets with introducing chemical types of linkers.

53 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 54 of 63

ASSOCIATED CONTENT Supporting Information The Supporting Information (PDF) is available free of charge on the ACS Publications website at DOI: and includes Supporting Methods, Supporting Figures S1-S20, Supporting Tables S1-S3. Supporting Files S1-S4 (XLS).

AUTHOR INFORMATION Corresponding Author Phone: +45 39 13 09 52. E-mail: [email protected], [email protected] Notes This work was fully funded by Nuevolution A/S. The author declares no competing financial interest.

ACKNOWLEDGMENTS The author wants to thank his colleagues from Nuevolution Dr. Alex Haahr Gouliaev, Dr. Thomas Franch, and Dr. Mads Nørregaard-Madsen for critical reading the manuscript and valuable comments as well as Dr. Jan Legaard Andersson for providing description of experimental assay for testing LSD1 hits.

54 ACS Paragon Plus Environment

Page 55 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

DEFINITONS AND ABBREVIATIONS DEL = DNA-Encoded Library. BBs = Building Blocks. “Reagents”, “fragments”, and “building blocks” terms are used interchangeably throughout the manuscript and denote mono- or bi-functional reagents which are included in each DEL position to cause a chemical reaction with another position reagent. RCMF = Reduced Complexity Molecular Framework. RCMF type = RCMF with a specific ring connectivity pattern RCMF type for a full-size compound = RCMF type for DEL or virtual combinatorial library fullsize compounds. Fragment RCMF type (RCMF type for a reagent) = RCMF type for BBs included in DEL by design, where fragment RCMF type is defined not only by BB ring connectivity pattern, but also by positioning of a reactive handle on the framework. RCMF descriptors or RCMF descriptor strings indicate each ring, each linker, and each angle descriptors derived for a specific RCMF type. RCMF descriptors have a pre-defined order in a line notation string. These line notations are unique for each specific RCMF type. RCMF descriptors could be generated for full-size compounds and for reagents. RCMF descriptors for reagents also include information on linker size between each reactive handle and the closest ring, as well as angle descriptors indicating how reactive handle is attached to a system of rings. RCMF chemical space map – 2D grid of individual RCMF clusters for full-size compounds. RCMF chemical space maps shown in this study include RCMF descriptors, which are generalized to a level where theoretically possible number of descriptor combinations becomes manageable for visualization purposes.

55 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 56 of 63

REFERENCES (1) Fey, N. Lost In Chemical Space? Maps to Support Organometallic Catalysis. Chemistry Central Journal 2015, 9:38. (2) Osolodkin, D.I.; Radchenko, E.V.; Orlov, A.A.; Voronkov, A.E.; Palyulin V.A.; Zefirov, N.S. Progress in Visual Representation of Chemical Space. Expert Opin. Drug Discov. 2015, 10, 959973. (3) Todeschini, R.; Consonni V. Handbook of Molecular Descriptors, vol. 11. Weinheim, Germany: Wiley-VCH; 2002. (4) Shanmugasundaram, V.; Maggiora, G.M.; Lajiness, M.S. Hit-directed Nearest-neighbor Searching. J. Med. Chem. 2005, 48, 240–248. (5) Sheridan, R.P.; Kearsley, S.K. Why Do We Need so Many Chemical Similarity Search Methods? Drug Discovery Today 2002, 7, 903–911. (6) Willett, P. Similarity-based Virtual Screening Using 2D Fingerprints. Drug Discovery Today 2006, 11, 1046–1053. (7) Hoksza, D.; Škoda, P.; Voršilák, M.; Svozil, D. Molpher: A Software Framework for Systematic Chemical Space Exploration. J. Cheminformatics 2014, 6:7. (8) Kohonen, T. The Self-organizing Map. Proc. IEEE 1990, 78, 1464−1480. (9) Gaspar, H.A.; Baskin, I.I.; Varnek, A Visualization of a Multidimensional Descriptor Space. In Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: Jürgen Bajorath (United States, 2016), vol. 1222 of ACS Symposium Series, United States, pp. 243–267. (10) Rosén, J.; Lövgren, A.; Kogej, T.; Muresan, S.; Gottfries, J.; Backlund, A. ChemGPSNP(Web): Chemical Space Navigation Online. J. Comput. Aided Mol. Des. 2009, 23, 253-259.

56 ACS Paragon Plus Environment

Page 57 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(11) Larsson, J.; Gottfries, J.; Muresan, S.; Backlund, A. ChemGPS-NP: Tuned for Navigation in Biologically Relevant Chemical Space. J. Nat. Prod. 2007, 70, 789-794. (12) Larsson, J.; Gottfries, J.; Bohlin, L.; Backlund, A. Expanding the ChemGPS Chemical Space with Natural Products. J. Nat. Prod. 2005, 68, 985-991. (13) Gaspar, H.A.; Baskin, I.I.; Marcou, G.; Horvath, D.; Varnek, A. Chemical Data Visualization and Analysis with Incremental Generative Topographic Mapping: Big Data Challenge. J. Chem. Inf. Model. 2015, 55, 84-94. (14) Kireeva, N.; Baskin, I.I.; Gaspar, H.A.; Horvath, D.; Marcou, G.; Varnek, A. Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison. Mol. Inform., 2012, 31, 301–312. (15) Reymond, J.L. The Chemical Space Project. Acc. Chem. Res. 2015, 48, 722–730. (16) Ruddigkeit, L.; Blum, L.C.; Reymond, J.L. Visualization and Virtual Screening of the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2013, 53, 56-65. (17) Ruddigkeit, L.; Deursen, R.; Blum, L.C.; Reymond, J.L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52, 2864–2875. (18) Ruddigkeit, L.; Awale, M.; Reymond, J.L. Expanding the Fragrance Chemical Space for Virtual Screening. J. Cheminformatics 2014, 6:27. (19) Reymond, J.L.; Ruddigkeit, L.; Blum, L.; van Deursen, R. The Enumeration of Chemical Space. WIREs Comput Mol Sci. 2012, 2, 717–733. (20) Reymond, J.L.; van Deursen, R.; Blum, L.C.; Ruddigkeit, L. Chemical Space as a Source for New Drugs. MedChemComm. 2010, 1, 30–38. (21) Reymond, J.L.; Awale, M. Exploring Chemical Space for Drug Discovery Using the Chemical Universe Database. ACS Chem. Neurosci. 2012, 3, 649–657.

57 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 58 of 63

(22) Virshup, A.M.; Contreras-García, J.; Wipf, P.; Yang, W.; Beratan D.N. Stochastic Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-like Compounds. J. Am. Chem. Soc. 2013, 135, 7296-7303. (23) Schuffenhauer, A.; Ertl, P.; Roggo, S.; Wetzel, S.; Koch, M.A.; Waldmann, H. The Scaffold Tree-Visualization of the Scaffold Universe by Hierarchical Scaffold Classification. J. Chem. Inf. Model. 2007, 47, 47-58. (24) Bemis, G.W.; Murcko, M.A. The Properties of Known Drugs. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887–2893. (25) Wetzel, S.; Klein, K.; Renner, S.; Rauh, D.; Oprea, T.I.; Mutzel, P.; Waldmann, H. Interactive Exploration of Chemical Space with Scaffold Hunter. Nat. Chem. Biol. 2009, 5, 581-583. (26) Vega de León, A.; Bajorath, J. Chemical Space Visualization: Transforming Multidimensional Chemical Spaces into Similarity-based Molecular Networks. Future Med. Chem. 2016, 14, 17691778. (27) Ertl, P. Intuitive Ordering of Scaffolds and Scaffold Similarity Searching Using Scaffold Keys. J. Chem. Inf. Model. 2014, 54, 1617-1622. (28) Rarey, M.; Stahl, M. Similarity Searching in Large Combinatorial Chemistry Spaces. J. Comput. Aided Mol. Des. 2001, 15, 497−520. (29) Rarey, M.; Dixon, J.S. Feature Trees: A New Molecular Similarity Measure Based on Tree Matching. J. Comput. Aided Mol. Des., 1998, 12, 471–490. (30) Pollock, S.N.; Coutsias, E.A.; Wester, M.J.; Oprea, T.I. Scaffold Topologies. 1. Exhaustive Enumeration up to Eight Rings. J. Chem. Inf. Model. 2008, 48, 1304–1310. (31) Wester, M.J.; Pollock, S.N.; Coutsias, E.A.; Allu, T.K.; Muresan, S.; Oprea, T.I. Scaffold Topologies. 2. Analysis of Chemical Databases. J. Chem. Inf. Model. 2008, 48, 1311–1324.

58 ACS Paragon Plus Environment

Page 59 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(32) Velkoborsky, J.; Hoksza, D. Scaffold Analysis of PubChem Database as Background for Hierarchical Scaffold-based Visualization. J. Cheminformatics 2016, 8:74. (33) Jensen, A.; Seidler, S. Method for Generating a Hierarchical Topologican Tree of 2D or 3Dstructural Formulas of Chemical Compounds for Property Optimisation of Chemical Compounds. 2004. Patent US20040088118. (34) Muegge, I.; Zhang, Q. 3D Virtual Screening of Large Combinatorial Space. Methods 2015, 71, 14-20. (35) Peng, Z. Very Large Virtual Compound Spaces: Construction, Storage, and Utility in Drug Discovery. Drug Discovery Today Technol. 2013, 10, e387−e394. (36) Nicolaou, C.A.; Watson, I.A.; Hu, H.; Wang, J. The Proximal Lilly Collection: Mapping, Exploring, and Exploiting Feasible Chemical Space. J. Chem. Inf. Model. 2016, 56, 1253-1266. (37) Goodnow, R.A.Jr. A Handbook for DNA-encoded Chemistry: Theory and Applications for Exploring Chemical Space and Drug Discovery, ed. Goodnow, R.A.; Acharya, R.A. Wiley, New Jersey, 2014, 247-279. (ISBN: 978-1-118-48768-6) (38) Eidam, O.; Satz, A.L. Analysis of the Productivity of DNA-encoded Libraries. Med. Chem. Commun. 2016, 7, 1323. (39) Zimmermann, G.; Neri, D. DNA-encoded Chemical Libraries: Foundations and Applications in Lead Discovery. Drug Discovery Today. 2016, 21, 1828-1834. (40) Melkko, S.; Dumelin, Ch.E.; Scheuermann, J.; Neri, D. Lead Discovery by DNA-encoded Chemical Libraries. Drug Discovery Today, 2007, 12, 465-471. (41) Clark, M.A.; Acharya, R.A.; Arico-Muendel, C.C.; Belyanskaya, S.L.; Benjamin, D.R.; Carlson, N.R.; Centrella, P.A.; Chiu, C.H.; Creaser, S.P.; Cuozzo, J.W. et al. Design, Synthesis and Selection of DNA-encoded Small-molecule Libraries. Nature Chemical Biology 2009, 5, 647–654.

59 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 60 of 63

(42) Litovchick, A.; Dumelin, Ch.E.; Habeshian, S.; Gikunju, D.; Guié, M-A.; Centrella, P.; Zhang, Y.; Sigel, E.A.; Cuozzo J.W.; Keefe, A.D.; Clark, M.A. Encoded Library Synthesis Using Chemical Ligation and the Discovery of sEH Inhibitors from a 334-million Member Library. Sci Rep. 2015, 5, 10916. (43) Mullard, A. DNA-encoded Drug Libraries Come of Age. Nature Biotechnology, 2016, 34, 450451. (44) Mullard, A. DNA Tags Help the Hunt for Drugs. Nature, 2016, 530, 367–369. (45) Wan, J.; Dou, D.; Song, H.; Wu, X-H.; Cheng, X.; Li, J. Lead Generation for Challenging Targets in Lead Generation: Methods and Strategies, Ed. Holenz, J. 67, 259-260 (ISBN: 978-3527-33329-5) (46) Nuevolution A/S. Method for the Synthesis of a Bifunctional Complex. 2004, Patent WO 2004/039825; https://nuevolution.com (47) Nuevolution A/S. Enzymatic Encoding Methods for Efficient Synthesis of Large Libraries. 2007, Patent WO 2007/062664; https://nuevolution.com (48) Goodnow, R.A.Jr.; Dumelin, C.E.; Keefe A.D. DNA-encoded Chemistry: Enabling the Deeper Sampling of Chemical Space. Nat. Rev. Drug Discov. 2017, 16, 131-147. (49) Ahn S.; Kahsai A. W.; Pani B.; Wang Q-T.; Zhao S.; Wall A.L.; Strachan R.T.; Staus D.P.; Wingler L.M.; Sun L.D.; Sinnaeve J. et al. Allosteric “Beta-blocker” Isolated from a DNA-encoded Small Molecule Library. PNAS, 2017, 114, 1708-1713. (50) Franzini, R.M.; Randolph, C. Chemical Space of DNA-encoded Libraries. J. Med. Chem. 2016, 59, 6629-6644. (51) Trabocchi, A.; Schreiber, S.L. DNA Encoded Libraries (Chapter 11) in Diversity-oriented Synthesis: Basics and Applications in Organic Synthesis, Drug Discovery, and Chemical Biology. 2013. (ISBN: 978-1-118-14565-4)

60 ACS Paragon Plus Environment

Page 61 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(52) Satz, L. A Handbook for DNA-encoded Chemistry: Theory and Applications for Exploring Chemical Space and Drug Discovery, ed. Goodnow, R.A.; Acharya, R.A. Wiley, New Jersey, 2014, 99–122. (ISBN: 978-1-118-48768-6) (53) Gaulton, A.; Bellis, L.J.; Bento, A.P.; Chambers, J.; Davies, M. Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J.P. ChEMBL: A Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100-1107. https://www.ebi.ac.uk/chembl/drugstore (accessed on March 29, 2016) (54) PubChem database. https://pubchem.ncbi.nlm.nih.gov/ (accessed May 11, 2016). (55) Available Chemicals Directory is available from Dassault Systemes Biovia K.K. http://accelrys.com/products/collaborative-science/databases/sourcing-databases/biovia-availablechemicals-directory.html (accessed May 31, 2016) (56) Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. KNIME: The Konstanz Information Miner. In Data Analysis, Machine Learning and Applications; Preisach, C., Burkhardt, P. D. H., Schmidt-Thieme, P. D.L., Decker, P. D. R., Eds.; Springer: Berlin, Heidelberg, 2008; pp 319−326. (57) All workflows were implemented in KNIME version 3.2.1 (available at http://knime.org). (58)

RDKit

nodes

3.0.0.

distributed

as

part

of

the

‘community

contributions’

http://tech.knime.org/community/ (accessed September 30, 2016) (59) Indigo nodes 2.0.0 from Epam distributed as part of the ‘community contributions’, http://tech.knime.org/community/ (accessed September 30, 2016) (60) Lipinski, C.A.; Lombardo, F.; Dominy, B.W.; Feeney, P.J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Deliv. Rev. 2001, 46, 3–26. (61) http://www.surveysystem.com/sscalc.htm

61 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 62 of 63

(62) Canvas v2.8. Schrodinger, Inc.: Portland, 2016. (63) Kirkpatrick P.; Ellis C. Chemical Space. Nature, 2004, 432, 823. (64) Metsalu, T.; Vilo, J. ClustVis: A Web Tool for Visualizing Clustering of Multivariate Data Using Principal Component Analysis and Heatmap. Nucleic Acids Res. 2015, 43, 566-570. (65) Ivanenkov, Y.A.; Savchuk, N.P.; Ekins, S.; Balakin, K.V. Computational Mapping Tools for Drug Discovery. Drug Discovery Today 2009, 14, 767–775. (66) Paolini, G.V.; Shapland, R.H.; van Hoorn, W.P.; Mason, J.S.; Hopkins, A.L. Global Mapping of Pharmacological Space. Nat. Biotechnol. 2006, 24, 805-815. (67) Arrowsmith, CH.; Bountra, C.; Fish, P.V.; Lee, K.; Schapira, M. Epigenetic Protein Families: A New Frontier for Drug Discovery. Nat Rev Drug Discov. 2012, 11, 384-400. (68) Landrum, G. RDKit: Open-source Cheminformatics, http://www.rdkit.org (accessed September 30, 2016) (69) Rogers, D.; Hahn, M. Extended-connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742– 754.

62 ACS Paragon Plus Environment

Page 63 of 63

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

For Table of Contents use only Mapping of Drug-like Chemical Universe with Reduced Complexity Molecular Frameworks Aleksejs Kontijevskis

63 ACS Paragon Plus Environment