Fragment Database FDB-17 - Journal of Chemical Information and

Apr 4, 2017 - (67) The Java source code is based on an online tutorial of Lindsay I. Smith. ...... Hawkins , P. C.; Skillman , A. G.; Warren , G. L.; ...
0 downloads 0 Views 7MB Size
Article pubs.acs.org/jcim

Fragment Database FDB-17 Ricardo Visini, Mahendra Awale, and Jean-Louis Reymond* Department of Chemistry and Biochemistry, University of Bern, Freiestrasse 3, 3012 Berne, Switzerland ABSTRACT: To better understand chemical space we recently enumerated the database GDB-17 containing 166.4 billion possible molecules up to 17 atoms of C, N, O, S and halogen following the simple rules of chemical stability and synthetic feasibility. However, due to the combinatorial explosion caused by systematic enumeration GDB-17 is strongly biased toward the largest, functionally and stereochemically most complex molecules and far too large for most virtual screening tools. Herein we selected a much smaller subset of GDB-17, called the fragment database FDB-17, which contains 10 million fragmentlike molecules evenly covering a broad value range for molecular size, polarity, and stereochemical complexity. The database is available at www.gdb.unibe.ch for download and free use, together with an interactive visualization application and a Web-based nearest neighbor search tool to facilitate the selection of new fragment-sized molecules for chemical synthesis.



INTRODUCTION Enriching screening libraries with chemically novel molecules is essential to enable the discovery of new chemical entities which might help in addressing unmet medical needs.1 The computational enumeration of possible chemical structures from first-principles offers a fascinating approach to explore which molecules are possible prior to their synthesis. Initial counting exercises dating back to the invention of graph theory2−4 have produced estimates of 1060 for the total number of possible druglike small molecules5,6 and 1020−1024 for all molecules up to 30 atoms.7,8 Furthermore, the development of cheminformatics, in particular the invention of SMILES as a compact line notation to write 2D-molecular information,9 and of 3D-generators to rapidly convert any 2D-structure to energetically favorable conformers of all possible stereoisomers, such as CORINA10 or OMEGA,11 have made it possible to assemble and explore very large databases of molecules in silico. Most computational enumeration approaches assemble new molecules by combining known building blocks with known reactions, a strategy which ensures easy synthetic access to the predicted molecules.12−16 Nevertheless innovations at the level of the building blocks themselves, although more difficult, would be highly desirable, in particular in the context of fragment-based drug discovery, which is one of the most successful recent drug discovery methods.17,18 Toward this goal we have enumerated all possible molecules from first principle following simple rules of chemical stability and synthetic feasibility and obtained the Generated DataBases GDB-11, GDB-13, and GDB-17 listing 26.4 million, 977 million, and 166.4 billion molecules up to 11, 13, and 17 non-hydrogen atoms (C, N, O, S and halogens).19−25 These databases were obtained starting with an exhaustive library of mathematical graphs provided by the program GENG26 by the sequential © XXXX American Chemical Society

steps of 1) selecting chemically meaningful hydrocarbon graphs, 2) introducing double and triple bonds in chemically meaningful ways to form skeletons, and 3) substituting heteroatoms for carbon atoms focusing on chemically relevant functional groups (Figure 1). Analyzing the GDBs using virtual screening and visualization tools specifically developed to address these unusually large databases27−34 shows that they differ from databases of known molecules such as DrugBank,35 ChEMBL,36 ZINC,37 or

Figure 1. Enumeration and selection workflow used to generate FDB17. Steps 4) and 5) are discussed in the present publication. Received: January 11, 2017 Published: April 4, 2017 A

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

classical fragment-likeness criterion for rotatable bonds (RBC ≤ 3) to enforce structural rigidity and limited functional group density by restricting the sum of nitrogen plus oxygen atoms and charges at neutral pH, as well as H-bond acceptor and Hbond donor atoms. We also eliminated notoriously unstable and reactive functional groups (FGs) and aromatic rings larger than 6-membered due to their exotic nature and allowed only up to one cyano group. Finally all nonaromatic carbon−carbon unsaturations and halogens were removed because in the first approximation they resemble saturated carbon−carbon single bonds and methyl groups, respectively. This criterion eliminated many redundant molecules and helped reduce database size. The restrictive criteria in Table 1 reduced GDB-17 by 36-fold from 166.4 billion to 4.6 billion molecules, here referred to as the 4.6G fragment subset (Figure 1). Despite this size reduction and simplification, however, this 4.6G subset was still largely dominated by the largest, most functionalized and stereochemically most complex molecules due to the fact that the exhaustive enumeration produces many more possible molecules for the higher values of HAC (heavy atom count), heteroatoms, and stereocenters (blue line, Figure 2A-C). This dominance of the largest and most complex molecules was even more apparent when subdividing the 4.6G subset into 175 bins corresponding to value triplets (HAC, heteroatoms, stereocenters), which were very unevenly populated and contained between 3,359 and 446,322,188 molecules (Figure 2D). To allow synthetically more attractive, smaller, less functionalized, and simpler molecules to become apparent we evenly sampled molecules from all these different bins, selecting either the entire bin content for the least populated bins or randomly selecting approximately 60,000 molecules per bin for the highly occupied bins. This selection resulted in a much smaller database of 10 million molecules evenly covering diversity in terms of molecule size and complexity, here called the fragment database FDB-17 (Figure 2, red lines). Property Profiles. The fragment database FDB-17 was compared with its parent database GDB-17, its 4.6G fragment subset, and a database of 40,986 molecules up to 17 atoms collected from various suppliers and obeying Congreve’s rule of three criteria,43 here referred to as “commercial fragments”. Only 31% (12,847 molecules) of these commercial fragments also appeared in the 4.6G fragment subset due to the presence of FGs not present in GDB-17 such as azides, thiols, and thioether, or eliminated when selecting the 4.6G fragment subset (criteria in Table 1, in particular halogens), and only 6.7% (2,740) appeared in FDB-17 due to the random sampling approach taken when composing FDB-17 from the 4.6G subset. Nevertheless, we considered that these commercial fragments formed a relevant reference to which FDB-17 should be compared since they represented molecules generally considered as fragments independent of the rules applied for FDB-17. While the complete GDB-17 and its 4.6G fragment subset peaked sharply at HAC = 17 and MW = 230 due to the combinatorial explosion as a function of HAC, the even molecular size distribution enforced on FDB-17 resulted in a size distribution comparable to that of commercial fragments (Figure 3A/B). GDB-derived fragments generally had less rotatable bonds than commercial fragments reflecting the abundance of molecules with 2 and 3 rings (Figure 3C). In terms of compound categories, half of the fragments were heteroaromatic for both the GDB-derived and commercial fragments; however, the second half was mostly heterocyclic for

PubChem38 not only by the very large number but also by the types of molecules they contain. By contrast to databases of known molecules which mostly consist of achiral aromatic molecules with only a few functional groups, the GDBs contain an unusually large fraction of highly functionalized heterocyclic, 3D-shaped molecules rich in stereogenic and quaternary centers.25 Nevertheless during experimental syntheses of GDB molecules identified by virtual screening we always selected molecules of low complexity to ensure rapid synthetic success.39−42 These projects suggested to us that exploring the GDB could be greatly simplified by defining a much smaller subset of molecules with strongly limited structural, stereochemical, and functional group complexity, which would be much more likely to be selected as synthetic targets. Herein we report a subset of 10 million molecules selected from GDB-17 by first applying fragment-likeness criteria43 and further complexity reduction filters to obtain a subset of 4.6 billion fragmentlike molecules, followed by even sampling across molecular size, polarity, and stereochemical complexity to allow the smaller, less functionalized, and stereochemically simpler fragments to become apparent. Compared to the entire GDB-17, this subset of 10 million molecules, here called the fragment database FDB-17, is greatly enriched in relatively simple molecules representing the more realistic synthetic targets in GDB-17. The size of FDB-17 furthermore renders this smaller database better suited than the entire GDB-17 for complex virtual screening tools such as docking or 3D-shape based comparisons, which are limited to a few million molecules.44−49 Although several algorithms for generating molecules such as MOLGEN,50,51 which produces molecules fitting a given elemental formula, or various algorithms performing random bond and atom changes to gradually evolve a molecular structure52−57 could possibly also be used to generate sets of virtual fragments comparable to FDB-17, an even sampling of chemical space as demonstrated here might be difficult to achieve by such methods because they require user-defined starting structures as inputs and their molecule construction principles might contain hidden biases.



RESULTS AND DISCUSSION Assembly of the Fragment Database FDB-17. To select a lower complexity subset of GDB-17 we defined criteria limiting the types and number of structural and functional elements allowed in a molecule (Table 1). Scaffold complexity was reduced by limiting the number of cycles, quaternary centers, and stereocenters. We furthermore implemented the Table 1. Filtering Criteria To Reduce GDB-17 to Its 4.6G Fragment Subset scaffolds ≤3 rings ≤2 small (3- or 4-membered) rings ≤2 quaternary centers ≤4 stereocenters ≤3 rotatable bonds

FG density ≤5 nitrogen + oxygen atoms ≤1 positive charge at neutral pH ≤1 negative charge at neutral pH ≤3 H-bond acceptor atoms ≤2 H-bond donor atoms

problematic/superfluous FGs no aldehydes

no aromatic ring >6 atoms

no epoxides, aziridines

≤1 CN (cyanide)

no O−(C O)−O (carbonate) no O−CN (imidate) no NO2 (nitro)

no nonaromatic CC

no CC (triple bonds) no halogens

B

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 2. Frequency histograms for the 4.6G fragment subset (blue line) and the 10 million FDB-17 database (red line) across (A) molecular size (HAC: ≤11, 12, 13, 14, 15, 16, 17), (B) heteroatoms (N+O+S: ≤1, 2, 3, 4, ≥5), and (C) stereocenters (0, 1, 2, 3, 4). In D the frequency histogram is shown by individual triplet value bins (HAC, heteroatoms, stereocenters) sorted by decreasing occupancy in the 4.6G fragment subset (blue line) and in FDB-17 (red line).

Figure 3. Property histograms for GDB-17, its 4.6G fragment subset, FDB-17, and commercial fragments: a) heavy atom count, b) mass in Dalton, c) rotatable bond count, d) categories, e) hydrogen bond acceptor, f) hydrogen bond donor, g) Clog P, h) stereo centers, (i) O+N count, j) fsp3 ratio, and k) ring count. For (d) each molecule is assigned to a single category as a function of its ring types in priority order heteroaromatic > aromatic > heterocyclic > carbocyclic > acyclic. C

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling GDB versus aromatic for commercial fragments, reflecting the effect of combinatorial possibilities in GDB versus the dominance of aromatic chemistry in commercial fragments (Figure 3D). In terms of polarity, the even sampling method used to compose FDB-17 allowed the HBA and HBD histograms to be comparable to commercial fragments, while the complete 4.6 G subset of GDB-17 peaked at the maximum allowed values (Figure 3E/F). Furthermore, the simplification of GDB-17 to the 4.6G fragment subset strongly reduced polarity as estimated by the calculated octanol:water partition coefficient clogP, reflecting the elimination of highly functionalized molecules, in particular multiply charged molecules; however, the even sampling across polarity values resulted in a broader coverage of the clogP scale by FDB-17 compared to the complete 4.6G subset (Figure 3G). In terms of molecular complexity as measured by the number of stereocenters, heteroatoms, and fraction of sp3 centers, the effect of selecting fragmentlike molecules from GDB-17 to form the 4.6G subset and the following even sampling to FDB-17 significantly lowered the fraction of the most complex molecules. Nevertheless FDB-17 still contained a much higher fraction of stereochemically complex, highly functionalized and 3D-shaped molecules compared to commercial fragments (Figure 3H/I/J). On the other hand, the fragment subsets of GDB-17 contained relatively more molecules with 2 or 3 rings compared to the full GDB-17 and commercial fragments (Figure 3K). This effect is probably mostly a consequence of limiting the number of rotatable bonds when creating the fragment subset of GDB-17, which limits the number of monocyclic molecules more strongly than for polycyclic molecules (Figure 3K). We further compared the various databases in terms of molecular shape as measured by the PMI (principal moments of inertia) distinguishing rodlike, disclike, and spherelike molecules58 and observed a distribution in line with the property profiles discussed above. As for most databases of known molecules the commercial fragmentlike molecules were predominantly planar. On the other hand, FDB-17 showed a molecular shape distribution very similar to that of GDB-17 and covered the entire shape triangle with a frequency peak at center left, showing that the complexity reduction steps taken to select fragmentlike molecules only marginally affected the overall shape distribution of the database (Figure 4). Interactive Visualization and Nearest Neighbor Searches. To go beyond property histograms and gain a closer insight into our fragment databases, we formatted FDB17 and the reference collection of 40,986 commercial fragments for interactive access using our recently described Mapplet application and Multifingerprint browser search tools,32,34,59 both accessible at www.gdb.unibe.ch. The mapplet features interactive color-coded maps representing the principal component plane (PC1, PC2) obtained by principal component analysis of FDB-17 represented in the multidimensional property spaces of MQN (Molecular Quantum Number: 42 descriptors counting atoms, bonds, polarity, and topology)27,28 and SMIfp (SMILES fingerprint: 34 descriptor counting characters appearing in the SMILES).31 As illustrated in Figure 5, the MQN-maps of FDB-17 separate molecules primarily by size and by number of rings, with a much denser coverage of the map compared to commercial fragments, in particular concerning the area covered by tricyclic molecules. On the other hand, the SMIfp-maps separate molecules by size and by the number of aromatic atoms. The coverage of FDB-17 and commercial fragments on the SMIfp-map is quite

Figure 4. Molecular shape analysis of GDB-17 versus fragments. The shape triangle is obtained by computing the PMI as described by Sauer et al.58 The color-code is the occupancy heatmap from low occupancy (blue) to high occupancy (magenta). The maximum cpd per pixel for each map is GDB-17:4978, FDB-17:2058, and commercial fragments:115.

comparable with the important difference that FDB-17 is more strongly populated in the lower left portion of the map grouping molecules with six or less aromatic atoms, which are molecules rich in nonaromatic atoms, while commercial fragments are evenly spread over the entire map. The FDB-mapplet displays the average molecule in each pixel of the color-coded maps in the side window on mouse over and provides access to the complete pixel contents by mouse right-click (Figure 6A/B). One can further select an individual molecule within the displayed list and access the multifingerprint browser Web page, where a search for nearest neighbors of that molecule can be performed in either FDB-17 or the commercial fragment set using a choice of six different fingerprints comprising the MQN27 and SMIfp31 used for the color-coded maps, as well as atom pair fingerprints APfp and Xfp perceiving shapes and pharmacophores33 and 1024-bit binary substructure (Sfp)60 and extended connectivity fingerprint (ECfp4)61 perceiving detailed substructures (Figure 6C). The nearest neighbor search is usually complete in less than 3 min, and the results are displayed as molecule matrix and available for download as a SMILES list (Figure 6D). Virtual Screening. We have shown previously that virtual screening of the entire GDB-17 database by nearest neighbor searches in the MQN-property space delivers many highscoring shape and pharmacophore analogs of known drugs as measured by the 3D-similarity function ROCS (rapid overlay of D

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 5. MQN- and SMIfp-maps (PC1, PC2 plane) of FDB-17 and commercial fragments. The maps for the commercial fragment set were generated by projecting the molecules on to the (PC1, PC2) planes obtained using FDB-17. The color scale is blue (lowest values) → cyan → green → yellow → red → magenta (highest value). MQN-map variance covered: PC1 33%, PC2 23%. SMIfp-map variance covered: PC1 60%, PC2 14%.

chemical structures).30 To test if the much smaller database FDB-17 might similarly be used as a source for new analogs of known fragments, we searched for new analogs of fencamfamine, gabapentine, rimantadine, and levetiracetame, which are four fragmentlike drugs of 17 atoms or less (Figure 7A), in the 10 M FDB-17, the entire GDB-17, and its 4.6G fragment subset, as well as the 41,000 commercial fragments. We performed virtual screening by collecting the 10,000 nearest neighbors of each drug in MQN-space for each of the above four sets. We also collected the 10,000 nearest neighbors for FDB-17 in Xfp space, a pharmacophore similarity property space suitable to identify shape and pharmacophore analogs,33 to compare the relatively simple MQN-similarity search with a more sophisticated approach (Xfp nearest neighbors could not be computed in reasonable time for the 4.6G subset or the entire GDB-17) and used 10,000 random molecules from FDB17 as control. Each group of 10,000 molecules was then scored for 3D-shape similarity to its parent drug using the ROCS

Tanimoto Combo score, and molecules with a score larger than 1.4 were considered as virtual hits.11 The number of virtual hits (ROCS > 1.4) found among the 10,000 molecules scored by ROCS was influenced by the drug used as reference, the source database from which the 10,000 molecules were taken, and the similarity search method used (Figure 7B and Table 2). Fencamfamine and rimantadine, which have only a single nitrogen atom as functional group, gave relatively high hit rates across all searches, probably because their simple pharmacophore is relatively easy to match. By contrast gabapentin and levitiracetam gave much lower hit rates, probably because they both contain two functional groups using 4 to 6 atoms, resulting in a more complex pharmacophore for which good analogs are less frequent. In all four drug cases the nearest neighbor set for the MQNand Xfp-similarity searches in FDB-17 and the MQN similarity search in the 4.6 G fragment subset and the entire GDB-17 provided a substantial number of virtual hits, showing that the E

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 6. FDB-17 mapplet and multifingerprint browser. a) View of the FDB-17 mapplet selecting a pixel on the MQN-map of FDB-17 color-coded by HAC. The selected pixel is tagged with balloon. b) Molecules in the pixel selected in a), molecules no. 29 to 44 of the 67 compounds in that pixel are shown. c) Multifingerprint browser window. d) Search results for ECfp4-nearest neighbors of the molecule selected in c). The FDB-17 mapplet and the multifingerprint browsers are accessible at www.gdb.unibe.ch.

fingerprint similarity searches could identify meaningful analogs in all three databases despite the fact that FDB-17 was much smaller than the 4.6 G fragment subset or the entire GDB-17. For gabapentin and levitiracetam the number of hits was in fact even lower for the entire GDB-17 than for the fragment subsets due to the presence, among the 10,000 MQN nearest neighbors investigated in GDB-17, of lower scoring unsaturated molecules that were not present in the fragment subsets due to the selection criteria (Table 1). The random set of 10,000 molecules from FDB-17 or the search in 10,000 MQN neighbors selected from the relatively small set of only 41,000 commercial fragments also provided a much smaller number of hits compared to FDB-17 and sometimes no hits at all. For the commercial fragments these low numbers of hits reflected the relatively small size of the database, while for the randomly selected subset of FDB-17 this showed that identifying high scoring analogs was not trivial and that the preselection by MQN or Xfp similarity was very effective in enriching for potential hits. All virtual hits (molecules with ROCS ≥ 1.4) were collected, and their substructure similarity to the parent drug was

measured using the Tanimoto similarity coefficient of a 1024bit binary Daylight-type fingerprint TSfp (Figure 7C).60 The hits were in part very close analogs of the starting drug (TSfp > 0.7) but also largely scaffold-hopping compounds with low substructure similarity to the parent drug (TSfp < 0.7). The number of hits in each category varied according to the drug, the database used, and the similarity search method (Table 2). As for the hit rates discussed above, searches in FDB-17 returned close analogs and scaffold-hopping compounds comparably well to searches in the much larger 4.6G fragment subset or the entire GDB-17. For each of the four drugs we could readily identify, by visual inspection of the hit lists and cross-checking with literature databases (Scifinder), interesting and previously unknown molecules from FDB-17 with high ROCS similarity with either high or low substructure similarity to the parent drug (Figure 7D).



CONCLUSION A series of simplification criteria were applied to the 166.4 billion molecules in GDB-17 to extract a fragmentlike subset of only 4.6 billion molecules, representing a 36-fold reduction in F

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

Figure 7. Virtual screening of FDB-17 and related databases for analogs of known drugs. (a) Structures of the four reference drugs. (b) Histogram of ROCS scores of the 10,000 nearest neighbors of each drug in the various databases and for the control set. (c) Histogram of substructure fingerprint Tanimoto similarity of the virtual hits (ROCS > 1.4) obtained from each set. (d) Examples of yet unknown, high ROCS analogs of the parent drugs identified in FDB-17 spanning a range of substructure fingerprint Tanimoto similarity values.

Table 2. Virtual Screening of FDB-17 and Related Fragment Databases for Drug Analogs no. of molecules with ROCS > 1.4: all/TSfp > 0.7/TSfp ≤ 0.7 source of 10k molecules scored a

GDB-17 MQN 4.6 G MQNb FDB-17 MQNc FDB-17 Xfpd FDB-17 randome commercial fragmentsf

g

gabapenting

rimantadineg

levetiracetamg

1298/7/1291 1535/339/1196 1018/74/944 1595/285/1310 177/2/175 103/2/101

169/28/141 486/0/486 187/42/145 110/52/58 4/0/4 13/4/9

3141/1020/2121 2611/1520/1091 2044/1323/721 2560/1290/1270 117/28/89 276/17/259

26/12/14 265/61/204 248/12/236 282/14/268 2/0/2 0/0/0

fencamfamine

a

For each data set the 10,000 nearest neighbors of the indicated drug were retrieved with MQN neighbors from GDB-17. bFor each data set the 10,000 nearest neighbors of the indicated drug were retrieved with MQN neighbors from the 4.G fragment subset of GDB-17. cFor each data set the 10,000 nearest neighbors of the indicated drug were retrieved with MQN neighbors from FDB-17. dFor each data set the 10,000 nearest neighbors of the indicated drug were retrieved with Xfp-neighbors from FDB-17. eFor each data set the 10,000 nearest neighbors of the indicated drug were retrieved with 10,000 randomly selected molecules from FDB-17. fFor each data set the 10,000 nearest neighbors of the indicated drug were retrieved with MQN neighbors from the 41k commercial fragment set. gDrug.

three-dimensional in contrast to the mostly planar commercial fragments. Furthermore, virtual screening of FDB-17 for analogs of known drugs delivers results that are comparable to virtual screening using the entire 4.6 G fragment subset of GDB-17. FDB-17 is available for download at www.gdb.unibe. ch together with an interactive MQN/SMIfp-mapplet application to visualize its contents and the corresponding MQN- and

size. Sampling molecules evenly across value triplets (heavy atoms, heteroatoms, stereocenters) allowed a further 460-fold reduction in size to form the fragment database FDB-17 containing 10 million molecules covering a broad range of molecular size, polarity, and complexity. The molecules in FDB-17 resemble commercial fragments in terms of typical fragment criteria; however, their molecular shape is highly G

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

MQN). The HBDA plugin was used for determining the HBD and HBA properties of atoms (for 42D MQN and 55D Xfp). The computation of the SMIfp (34D) was performed by first generating the unique SMILES for a molecule using JChem, followed by counting of 34 different letters in a SMILES string using in-house written code. The two binary fingerprints were namely a) Sfp (a daylight type 1024-bit hash fingerprint with maximum path length of 7 bonds), computed using the ChemicalFingerprint class of the JChem library, and b) ECfp4 (1024 bits Circular Extended connectivity fingerprint with bond diameter of 4), computed using the ECFP class of the JChem library. Principal Component Analysis (PCA). For FDB-17 MQN- and SMIfp-fingerprint data sets, PCA was performed using an in-house written Java program utilizing the JSci science library for computation of Eigenvalues and Eigenvectors.67 The Java source code is based on an online tutorial of Lindsay I. Smith.68 MQN- and SMIfp-Maps. For each molecule in the FDB-17 MQN-30 and SMIfp-data sets,31 PC-1 and PC-2 values were calculated using the first two Eigenvectors obtained from the respective sets. This was followed by computation of the largest (PCmax) and smallest (PCmin) PC values appearing in the PC-1 or PC-2 for FDB-17 MQN- and SMIfp-data sets. Afterward, the value range ΔPC = PCmax − PCmin was defined for each of the two FDB-17 data sets and was used to set the binning scales as ΔPC/1000. The PC-1 and PC-2 values were binned onto 1000 × 1000 2D-grids using the same absolute bin size on the PC-1 and PC-2 axis. Each molecule was assigned to a point on this 2D-grid. The same procedure was followed for commercial fragment MQN- and SMIfp-data sets. For the commercial fragment set the Eigenvectors obtained with FDB-17 were used for projection to allow direct comparison of both data sets on the maps. Furthermore, the binning scales were set to ΔPC/500, and each molecule was assigned to a point on 500 × 500 2D-grids. The points in each map were color coded according to the average and standard deviation of property of molecules at given point in the map. The HSL color space (Hue-SaturationLightness) was used for color coding. The hue value was set to average value of property, while the saturation value was set to the standard deviation of property at the given point in a map. The color changes from blue (lowest property value) to cyan to green to yellow to red to magenta (highest property value). MQN- and SMIfp-Mapplet. The Java based mapplet was designed to visualize the MQN- and SMIfp-maps of FDB-17 and commercial fragment data sets. The functions in the mapplet are illustrated in detail in the respective publication from our group.32 Multi-Fingerprint Browser. The Web-based application was designed to perform the similarity search using any of the six fingerprints mentioned before (APfp, Xfp, MQN, SMIfp, Sfp, and ECfp4). The construction of this browser is based on the similar principle we reported earlier for virtual screening in GDB-17 and ZINC databases.69

SMIfp-nearest neighbor search tools to perform ligand-based virtual screening on FDB-17. This resource should be useful for virtual screening and to inspire the synthesis of new fragments.



METHODS Database Assembly. The programming code used for the filtering is written in Java and depends on the JChem libraries from Chemaxon.62 The calculation was done with a 500 node cluster in 10,000 CPU hours. To obtain the 4.6G fragment subset the filter rules (Table 1) were applied starting from the MQN annotated version of GDB-17.30 To select the FDB-17 set we binned the 4.6 G fragment subset into 175 bins corresponding to all possible value triplet combinations (HAC, heteroatoms, stereocenters) with HAC = ≤11, 12, 13, 14, 15, 16, and 17 (7 values), heteroatoms = ≤1, 2, 3, 4, ≥ 5 (5 values), and stereocenters = 0, 1, 2, 3, 4 (5 values). Bins with a content N ≤ 70,000 molecules were taken completely. For bins with content N > 70,000 molecules we used the Java random number generator to produce random numbers RANi (i = 1, 2, 3. ...) in the interval 0 ≤ RANi ≤ S (S = 3 for N < 500,000, S = 20 for N > 500,000). We then sampled the bin file, containing one SMILES per line, at line number Li = (N/60,000) − RANi, L(i+1) = Li + (N/60,000) − RAN(i+1), etc. until the last line was reached, which produced an approximately random selection sampling across the entire bin content. To minimize calculation time all criteria which can be read directly out of the MQN values were applied first (the no halogens rule and almost all scaffolds and FG density rules). The more complex filters (all problematic/superfluous FGs + number of stereocenters) were filtered in a second run. For the commercial fragments we collected 110,998 molecules up to 17 atoms from the online catalogs AnalytiCon, ChemBridge, Enamine, FRAGMENTA, BIONET, LifeChemical, Maybridge, and Vitas. The standard Congreve’s rule of three (mass ≤ 300, hba ≤ 3, hbd ≤ 3, logP ≤ 3, rbc ≤ 3, and psa ≤ 60) was then applied, and duplicates were removed, leaving 40,986 molecules. PMI-Maps (Principal Moment of Inertia). The shape analysis was carried out after the protocol of Sauer and Schwartz58 with in-house software written in Java. For each molecule a single low energy conformer for a single 3Dstereoisomer was generated using CORINA. ROCS and Substructure Comparison. For each molecule a single low energy conformer was generated for each possible stereoisomer using Omega.63,64 ROCS was used to compare the resulting 3D structures against the corresponding query.65,66 Only the stereoisomer with the highest ROCS score was selected for further studies. The novelty of the example molecules shown in Figure 7 was checked by checking that the structures are not known in Scifinder. Calculation of Fingerprints. Molecules were processed in SMILES format using an in-house written Java program utilizing the JChem ChemAxon library as a starting point.62 Molecules were protonated at pH 7.4, counterions were removed, and valence errors were checked. All the fingerprints (APfp, Xfp, MQN, SMIfp, Sfp, and ECfp4) were calculated using an in-house written Java program utilizing the JChem ChemAxon library as a starting point. The detail procedure for computation of each fingerprint is described in a respective publication from our group.27,31,33,60 Briefly, the TopologyAnalyzer plugin was used for the calculation of the shortest topological path between a pair of atoms (for 21D APfp and 55D Xfp) and for determination of ring topology (for 42D



AUTHOR INFORMATION

Corresponding Author

*Fax: +41 31 631 80 57. E-mail: [email protected]. ch. ORCID

Jean-Louis Reymond: 0000-0003-2724-2942 H

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Author Contributions

(17) Foloppe, N. The Benefits of Constructing Leads from Fragment Hits. Future Med. Chem. 2011, 3, 1111−1115. (18) Murray, C. W.; Rees, D. C. The Rise of Fragment-Based Drug Discovery. Nat. Chem. 2009, 1, 187−192. (19) Fink, T.; Bruggesser, H.; Reymond, J. L. Virtual Exploration of the Small-Molecule Chemical Universe Below 160 Da. Angew. Chem., Int. Ed. 2005, 44, 1504−1508. (20) Fink, T.; Reymond, J. L. Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery. J. Chem. Inf. Model. 2007, 47, 342−353. (21) Reymond, J. L.; Van Deursen, R.; Blum, L. C.; Ruddigkeit, L. Chemical Space as a Source for New Drugs. MedChemComm 2010, 1, 30−38. (22) Johansson, E. M. V.; Kadam, R. U.; Rispoli, G.; Crusz, S. A.; Bartels, K.-M.; Diggle, S. P.; Camara, M.; Williams, P.; Jaeger, K.-E.; Darbre, T.; et al. Inhibition of Pseudomonas Aeruginosa Biofilms with a Glycopeptide Dendrimer Containing D-Amino Acids. MedChemComm 2011, 2, 418−420. (23) Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J. L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database Gdb-17. J. Chem. Inf. Model. 2012, 52, 2864−2875. (24) Reymond, J. L.; Ruddigkeit, L.; Blum, L. C.; Van Deursen, R. The Enumeration of Chemical Space. WIREs comput. Mol. Sci. 2012, 2, 717. (25) Reymond, J. L. The Chemical Space Project. Acc. Chem. Res. 2015, 48, 722−730. (26) McKay, B. D. Practical Graph Isomorphism. Congressus Numerantium 1981, 30, 45−87. (27) Nguyen, K. T.; Blum, L. C.; van Deursen, R.; Reymond, J.-L. Classification of Organic Molecules by Molecular Quantum Numbers. ChemMedChem 2009, 4, 1803−1805. (28) van Deursen, R.; Blum, L. C.; Reymond, J. L. A Searchable Map of Pubchem. J. Chem. Inf. Model. 2010, 50, 1924−1934. (29) Blum, L. C.; van Deursen, R.; Reymond, J. L. Visualisation and Subsets of the Chemical Universe Database Gdb-13 for Virtual Screening. J. Comput.-Aided Mol. Des. 2011, 25, 637−647. (30) Ruddigkeit, L.; Blum, L. C.; Reymond, J.-L. Visualization and Virtual Screening of the Chemical Universe Database Gdb-17. J. Chem. Inf. Model. 2013, 53, 56−65. (31) Schwartz, J.; Awale, M.; Reymond, J.-L. Smifp (Smiles Fingerprint) Chemical Space for Virtual Screening and Visualization of Large Databases of Organic Molecules. J. Chem. Inf. Model. 2013, 53, 1979−1989. (32) Awale, M.; van Deursen, R.; Reymond, J. L. Mqn-Mapplet: Visualization of Chemical Space with Interactive Maps of Drugbank, Chembl, Pubchem, Gdb-11, and Gdb-13. J. Chem. Inf. Model. 2013, 53, 509−518. (33) Awale, M.; Reymond, J. L. Atom Pair 2d-Fingerprints Perceive 3d-Molecular Shape and Pharmacophores for Very Fast Virtual Screening of Zinc and Gdb-17. J. Chem. Inf. Model. 2014, 54, 1892− 1897. (34) Ruddigkeit, L.; Awale, M.; Reymond, J. L. Expanding the Fragrance Chemical Space for Virtual Screening. J. Cheminf. 2014, 6, 27−39. (35) Law, V.; Knox, C.; Djoumbou, Y.; Jewison, T.; Guo, A. C.; Liu, Y.; Maciejewski, A.; Arndt, D.; Wilson, M.; Neveu, V.; et al. Drugbank 4.0: Shedding New Light on Drug Metabolism. Nucleic Acids Res. 2014, 42, D1091−D1097. (36) Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Kruger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; et al. The Chembl Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42, D1083−D1090. (37) Sterling, T.; Irwin, J. J. Zinc 15–Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324−2337. (38) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A.; et al. Pubchem

R.V. realized the project and wrote the paper. M.A. computed the MQN- and SMIfp-search mapplets and wrote the paper. J.L.R. designed and supervised the project and wrote the paper. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported financially by the University of Berne, the Swiss National Science Foundation, and the NCCR TransCure. We thank OpenEye Scientific Software and ChemAxon Pvt. Ltd. for free academic and web licenses for their products.



REFERENCES

(1) Bleicher, K. H.; Bohm, H. J.; Muller, K.; Alanine, A. I. Hit and Lead Generation: Beyond High-Throughput Screening. Nat. Rev. Drug Discovery 2003, 2, 369−378. (2) Cayley, E. Ueber Die Analytischen Figuren, Welche in Der Mathematik Bäume Genannt Werden Und Ihre Anwendung Auf Die Theorie Chemischer Verbindungen. Ber. Dtsch. Chem. Ges. 1875, 8, 1056−1059. (3) Schiff, H. Zur Statistik Chemischer Verbindungen. Ber. Dtsch. Chem. Ges. 1875, 8, 1542−1547. (4) Henze, H. R.; Blair, C. M. The Number of Isomeric Hydrocarbons of the Methane Series. J. Am. Chem. Soc. 1931, 53, 3077−3085. (5) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Delivery Rev. 1997, 23, 3−25. (6) Bohacek, R. S.; McMartin, C.; Guida, W. C. The Art and Practice of Structure-Based Drug Design: A Molecular Modeling Perspective. Med. Res. Rev. 1996, 16, 3−50. (7) Ertl, P. Cheminformatics Analysis of Organic Substituents: Identification of the Most Common Substituents, Calculation of Substituent Properties, and Automatic Identification of Drug-Like Bioisosteric Groups. J. Chem. Inf. Comput. Sci. 2003, 43, 374−380. (8) Kirkpatrick, P.; Ellis, C. Chemical Space. Nature 2004, 432, 823− 823. (9) Weininger, D. Smiles, a Chemical Language and InformationSystem 0.1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Model. 1988, 28, 31−36. (10) Sadowski, J.; Gasteiger, J. From Atoms and Bonds to 3Dimensional Atomic Coordinates - Automatic Model Builders. Chem. Rev. 1993, 93, 2567−2581. (11) Nicholls, A.; McGaughey, G. B.; Sheridan, R. P.; Good, A. C.; Warren, G.; Mathieu, M.; Muchmore, S. W.; Brown, S. P.; Grant, J. A.; Haigh, J. A.; et al. Molecular Shape and Medicinal Chemistry: A Perspective. J. Med. Chem. 2010, 53, 3862−3886. (12) Danziger, D. J.; Dean, P. M. Automated Site-Directed Drug Design: A General Algorithm for Knowledge Acquisition About Hydrogen-Bonding Regions at Protein Surfaces. Proc. R. Soc. London, Ser. B 1989, 236, 101−113. (13) Lewell, X. Q.; Judd, D. B.; Watson, S. P.; Hann, M. M. RecapRetrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry. J. Chem. Inf. Comput. Sci. 1998, 38, 511−522. (14) Leach, A. R.; Hann, M. M. The in Silico World of Virtual Libraries. Drug Discovery Today 2000, 5, 326−336. (15) Patel, H.; Bodkin, M. J.; Chen, B.; Gillet, V. J. Knowledge-Based Approach to De Novo Design Using Reaction Vectors. J. Chem. Inf. Model. 2009, 49, 1163−1184. (16) Hu, Q.; Peng, Z.; Kostrowicki, J.; Kuki, A. Leap into the Pfizer Global Virtual Library (Pgvl) Space: Creation of Readily Synthesizable Design Ideas Automatically. Methods Mol. Biol. 2011, 685, 253−276. I

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling Substance and Compound Databases. Nucleic Acids Res. 2016, 44, D1202−D1213. (39) Nguyen, K. T.; Syed, S.; Urwyler, S.; Bertrand, S.; Bertrand, D.; Reymond, J. L. Discovery of Nmda Glycine Site Inhibitors from the Chemical Universe Database Gdb. ChemMedChem 2008, 3, 1520−4. (40) Luethi, E.; Nguyen, K. T.; Burzle, M.; Blum, L. C.; Suzuki, Y.; Hediger, M.; Reymond, J. L. Identification of Selective NorbornaneType Aspartate Analogue Inhibitors of the Glutamate Transporter 1 (Glt-1) from the Chemical Universe Generated Database (Gdb). J. Med. Chem. 2010, 53, 7236−7250. (41) Garcia-Delgado, N.; Bertrand, S.; Nguyen, K. T.; van Deursen, R.; Bertrand, D.; Reymond, J.-L. Exploring Alpha 7-Nicotinic Receptor Ligand Diversity by Scaffold Enumeration from the Chemical Universe Database Gdb. ACS Med. Chem. Lett. 2010, 1, 422−426. (42) Brethous, L.; Garcia-Delgado, N.; Schwartz, J.; Bertrand, S.; Bertrand, D.; Reymond, J. L. Synthesis and Nicotinic Receptor Activity of Chemical Space Analogues of N-(3r)-1-Azabicyclo[2.2.2]Oct-3-Yl4-Chlorobenzamide (Pnu-282,987) and 1,4-Diazabicyclo[3.2.2]Nonane-4-Carboxylic Acid 4-Bromophenyl Ester (Ssr180711). J. Med. Chem. 2012, 55, 4605−4618. (43) Congreve, M.; Carr, R.; Murray, C.; Jhoti, H. A Rule of Three for Fragment-Based Lead Discovery? Drug Discovery Today 2003, 8, 876−877. (44) Klebe, G. Virtual Ligand Screening: Strategies, Perspectives and Limitations. Drug Discovery Today 2006, 11, 580−594. (45) Scior, T.; Bender, A.; Tresadern, G.; Medina-Franco, J. L.; Martinez-Mayorga, K.; Langer, T.; Cuanalo-Contreras, K.; Agrafiotis, D. K. Recognizing Pitfalls in Virtual Screening: A Critical Review. J. Chem. Inf. Model. 2012, 52, 867−881. (46) Heikamp, K.; Bajorath, J. The Future of Virtual Compound Screening. Chem. Biol. Drug Des. 2013, 81, 33−40. (47) Kolb, P.; Ferreira, R. S.; Irwin, J. J.; Shoichet, B. K. Docking and Chemoinformatic Screens for New Ligands and Targets. Curr. Opin. Biotechnol. 2009, 20, 429−36. (48) Forino, M.; Jung, D.; Easton, J. B.; Houghton, P. J.; Pellecchia, M. Virtual Docking Approaches to Protein Kinase B Inhibition. J. Med. Chem. 2005, 48, 2278−2281. (49) Geppert, H.; Vogt, M.; Bajorath, J. Current Trends in LigandBased Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation. J. Chem. Inf. Model. 2010, 50, 205−216. (50) Wieland, T.; Kerber, A.; Laue, R. Principles of the Generation of Constitutional and Configurational Isomers. J. Chem. Inf. Comput. Sci. 1996, 36, 413−419. (51) Buchanan, B. G.; Smith, D. H.; White, W. C.; Gritter, R. J.; Feigenbaum, E. A.; Lederberg, J.; Djerassi, C. Applications of Artificial Intelligence for Chemical Inference. 22. Automatic Rule Formation in Mass Spectrometry by Means of the Meta-Dendral Program. J. Am. Chem. Soc. 1976, 98, 6168−6178. (52) Gillet, V. J.; Newell, W.; Mata, P.; Myatt, G.; Sike, S.; Zsoldos, Z.; Johnson, A. P. Sprout: Recent Developments in the De Novo Design of Molecules. J. Chem. Inf. Model. 1994, 34, 207−217. (53) Brown, N.; McKay, B.; Gilardoni, F.; Gasteiger, J. A GraphBased Genetic Algorithm and Its Application to the Multiobjective Evolution of Median Molecules. J. Chem. Inf. Comput. Sci. 2004, 44, 1079−1087. (54) Brown, N.; McKay, B.; Gasteiger, J. The De Novo Design of Median Molecules within a Property Range of Interest. J. Comput.Aided Mol. Des. 2004, 18, 761−771. (55) Lameijer, E. W.; Kok, J. N.; Back, T.; Ijzerman, A. P. The Molecule Evoluator. An Interactive Evolutionary Algorithm for the Design of Drug-Like Molecules. J. Chem. Inf. Model. 2006, 46, 545− 552. (56) van Deursen, R.; Reymond, J. L. Chemical Space Travel. ChemMedChem 2007, 2, 636−640. (57) Virshup, A. M.; Contreras-Garcia, J.; Wipf, P.; Yang, W.; Beratan, D. N. Stochastic Voyages into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-Like Compounds. J. Am. Chem. Soc. 2013, 135, 7296−7303.

(58) Sauer, W. H.; Schwarz, M. K. Molecular Shape Diversity of Combinatorial Libraries: A Prerequisite for Broad Bioactivity. J. Chem. Inf. Comput. Sci. 2003, 43, 987−1003. (59) Awale, M.; Reymond, J. L. Similarity Mapplet: Interactive Visualization of the Directory of Useful Decoys and Chembl in High Dimensional Chemical Spaces. J. Chem. Inf. Model. 2015, 55, 1509− 1516. (60) Hagadone, T. R. Molecular Substructure Similarity Searching: Efficient Retrieval in Two-Dimensional Structure Databases. J. Chem. Inf. Model. 1992, 32, 515−521. (61) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742−754. (62) ChemAxon Ltd. http:www.chemaxon.com (accessed Feb 20, 2017). (63) Hawkins, P. C.; Skillman, A. G.; Warren, G. L.; Ellingson, B. A.; Stahl, M. T. Conformer Generation with Omega: Algorithm and Validation Using High Quality Structures from the Protein Databank and Cambridge Structural Database. J. Chem. Inf. Model. 2010, 50, 572−584. (64) Hawkins, P. C.; Nicholls, A. Conformer Generation with Omega: Learning from the Data Set and the Analysis of Failures. J. Chem. Inf. Model. 2012, 52, 2919−2936. (65) Rush, T. S.; Grant, J. A.; Mosyak, A.; Nicholls, A. A Shape-Based 3-D Scaffold Hopping Method and Its Application to a Bacterial Protein−Protein Interaction. J. Med. Chem. 2005, 48, 1489−1495. (66) Hawkins, P. C.; Skillman, A. G.; Nicholls, A. Comparison of Shape-Matching and Docking as Virtual Screening Tools. J. Med. Chem. 2007, 50, 74−82. (67) The JSci Science Library. http://jsci.sourceforge.net/ (accessed Feb 20, 2017). (68) University of Otago, New Zealand. http://www.cs.otago.ac.nz/ cosc453/student_tutorials/principal_components.pdf (accessed Feb 20, 2017). (69) Awale, M.; Reymond, J. L. A Multi-Fingerprint Browser for the Zinc Database. Nucleic Acids Res. 2014, 42, W234−W239.

J

DOI: 10.1021/acs.jcim.7b00020 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX