Perspective pubs.acs.org/jmc
Combining Molecular Scaffolds from FDA Approved Drugs: Application to Drug Discovery Miniperspective Richard D. Taylor,*,† Malcolm MacCoss,‡ and Alastair D. G. Lawson† †
UCB, 216 Bath Road, Slough, SL1 3WE, U.K. Bohicket Pharma Consulting LLC, 2556 Seabrook Island Road, Seabrook Island, South Carolina 29455, United States
‡
ABSTRACT: We have enumerated all linear combinations of ring systems from FDA approved drugs, up to three rings in length and up to four bonds linkers to give an in silico database of approximately 14 million molecules. This virtual library was compared with molecular databases of published and commercially available compounds to assess the prevalence of drug ring combinations in modern medicinal chemistry and to identify areas of under-represented, but clinically validated, chemical space. From the 10 trillion molecular comparisons, we found that less than 1% of the possible combinations of drug ring systems appear in commercially available libraries. This key observation highlights significant opportunities to design new fragment-like and lead-like libraries aimed at improving success rates and reducing risk in small molecule drug discovery, as, based on our previous analysis (Taylor et al. J. Med. Chem. 2014, 57, 5845−5849), approximately 70% of all new drugs are made up of only ring systems that have been used in existing drugs.
■
INTRODUCTION The combined capitalized cost per New Medical Entity (NME) launch has been estimated to be $166 million for hitto-lead and $414 million for lead optimization.1 With this in mind and the large attrition rates in drug discovery2 it is vital to start with a high quality scaffold for a drug discovery project to be successful. To assist with the identification of hits from small molecule libraries and the subsequent compound progression, there have been many attempts to enumerate chemical space with varying degrees of success. This is a significant challenge, and there have been a number of publications to estimate the size and subsequently enumerate druglike space.3 Owing to the large numbers of molecules required for full enumeration, typically these libraries are restricted using certain parameters such as molecular weight or the number of heavy atoms. The chemical space of druglike molecules has been estimated to be in excess of 1060 molecules4 and up to 1024 for all molecules up to a total of 30 atoms.5 It is interesting to note this is more than the current photometry-based estimates of the number of stars in the universe,6 which is estimated to be 6 × 1022. However, with this number of structures, there is also the additional problem of triaging the compounds and having practical methods to reduce the numbers to a manageable subset that are likely to have biological activity in a biochemical assay. The concept based around islands of biologically useful chemical space and privileged scaffolds has been well documented.7−11 There have been many attempts to identify the biologically relevant space; for example, Ertl12 © 2016 American Chemical Society
showed from a generated database of molecular structures that bioactivity was distributed in small bioactive islands. Zhao et al.13 analyzed kinase scaffolds and showed that a large fraction of kinase relevant chemical space is currently unexplored. Reker et al. have reviewed different techniques to identify areas of chemical space that have the greatest chance of success while considering structural novelty.14 We have decided to focus on enumeration by combining drug scaffolds rather than applying the more popular methods based on full enumeration within certain molecular property limits. On the basis of our previous analysis of drug discovery over the past 30 years, the subset of ring systems found in current drugs will typically make up 70% of all new drugs coming onto the market.15 Since a large effort in drug discovery is applied to increasing the chance of finding new hits, often described as enriching a molecular library, we believe that this group of focused molecules, with a proven pedigree for drug discovery, will be an extremely useful subset of chemical space. It is also interesting to remember, based on our previous work, that if a new drug contains a scaffold that has not been used before in any other drug, the remaining scaffolds are highly likely to come from existing drug rings. This observation is true for over 99% of all drugs in the past 30 years. Using these facts, we have attempted to address the following questions: Received: September 14, 2016 Published: December 9, 2016 1638
DOI: 10.1021/acs.jmedchem.6b01367 J. Med. Chem. 2017, 60, 1638−1647
Journal of Medicinal Chemistry
Perspective
Figure 1. Definitions of rings, ring systems, and frameworks for the drug bendroflumethiazide.
■
like moieties). In particular it was noted by Congreve et al.17 that a high quality fragment library would contain fragments with three or less rings. Using this premise as a starting point for combining ring systems, we have summarized the number of ring systems and the total points of attachments available for medicinal chemistry growth across our database of ring systems for monocycle, bicycle, and tricycle ring systems (see Table 1). To make the analysis simpler, we do not specify the stereochemistry of the attachment of linkers to the rings where there is an sp3 center.
1. What is the number of molecular ring frameworks or scaffolds possible from combining ring systems only found in FDA approved drugs? 2. Is this number small enough to be practically useful? 3. What is the overlap of currently available chemical space with this data set? 4. Can we use this data set to prioritize new areas of chemical space that are currently under-represented in compound collections but that have a proven track record in drug discovery?
Table 1. Number of Monocycle and Bicycle Ring Systems from FDA Approved Drugs and the Total Number of Possible Attachment Points for Each Type of Ring System
NUMBER OF COMBINATIONS POSSIBLE FROM COMBINING RING SYSTEMS FROM FDA APPROVED DRUGS In this analysis we are using the same nomenclature as described previously with rings, ring systems, and frameworks (see Figure 1) which builds on the seminal work by Bemis et al.16 On the basis of our previous analysis of rings, ring systems, and ring frameworks, the list of 351 drug ring systems in FDA-approved drugs was used as the starting molecular database. We have combined the ring systems to fully enumerate a database of ring frameworks or core scaffolds that can be used in drug discovery projects. Our definition of a ring is the smallest nonfused system with no acyclic (hydrocarbon and/or heteroatom containing) linkers or terminal groups. A ring system is defined as a complete ring or rings formed by removing all terminal and acyclic linking groups without breaking any ring bonds. A framework is defined as containing all the ring systems but also includes ring systems that are linked by nonterminal acyclic groups. In this analysis a distinction is made between ring systems and frameworks where a ring system can only contain ring bonds (including spiro groups) and no chain single bonds, although a framework can contain acyclic linking bonds that are nonterminating. A full description of the algorithm and ring database can be found in our previous work.9 In order to create a database of molecules containing only rings from approved drugs, the initial questions are as follows: How many ring systems per scaffold (or frameworks) should be included, and how do we connect the ring systems? From our experience in drug discovery, typical key features from a rationally designed chemotype that are required to show functional activity would be up to three rings per molecule (with notable exceptions from natural products and steroid-
type of ring system
number of unique ring systems found in drugs
single unique points of attachment across all rings
monocycle bicycle tricycle
95 124 58
269 676 440
If we focus on combining ring systems with an upper limit of three rings, then we can just consider the monocycle and bicycle ring systems. To enumerate all combinations of current drug ring systems, where the total number of rings is not greater than three rings, these systems can be arranged in three formats: monocycle−monocycle, monocycle−bicycle, monocycle−monocycle−monocycle (either linear or branched). To link these ring systems, we created a data set of aliphatic linkers up to four bonds in length, from FDA approved drugs, giving a total of 68 druglike linkers. For a preliminary analysis, if we consider linear unbranched combinations while checking for the correct atomic valence but initially not validating chemical feasibility or stability, the maximum combinations can be enumerated for monocycle and bicycle ring systems. These data are given in Table 2. By use of the number of attachment points and the number of linkers, the monocycle− monocycle−monocycle combination is approximated rather than fully enumerated since it is clear, as expected, from the preliminary calculation that the linear combinations of three monocycle systems are significantly larger in size than the equivalent single linker combinations of monocycle−monocycle and bicycle−monocycle. This start of a population 1639
DOI: 10.1021/acs.jmedchem.6b01367 J. Med. Chem. 2017, 60, 1638−1647
Journal of Medicinal Chemistry
Perspective
Table 2. Maximum Ring Combinations for Three Different Drug Ring Combinations Using Only Drug Linkers without Including Chemical Stability Filters ring system
possible combinations ignoring ring linkers
possible combinations with drug linkers
monocycle−monocycle bicycle−monocycle linear monocycle−monocycle−monocycle (estimated)
4.5 × 103 12 × 103 4 × 1011
2.4 × 106 12 × 106 7 × 1012
explosion by adding relatively few atoms to a starting scaffold is well documented.18 It is clear from Table 2 that a single addition of a monocycle ring to a two-ring system from the database of drug rings with the associated drug linker increases the number of possible molecular frameworks or scaffolds by 106. This would be further increased if we were to include branched monocycles rather than just the linear monocycles. For this initial data set to be useful in practical terms, we have decided to focus on the monocycle−monocycle and monocycle−bicycle combinations, since we are looking at full enumeration of these sets and the subsequent analysis. It should also be remembered that the monocycle−monocycle combinations will be a subset of the three monocycle combinations (either the linear or branched forms).
■
OVERLAP OF CURRENT CHEMICAL SPACE WITH ENUMERATED DRUG FRAMEWORKS To compare our database of enumerated drug frameworks with currently available chemical space, two different libraries of compounds were selected, ChEMBL version 1919 and a subset of eMolecules20 based on preferred suppliers, which are widely used for molecular analogues, often referred to as “SAR by catalogue”, for medicinal chemistry programs. The ChEMBL database is largely representative of synthesized molecules from the literature, and the eMolecules database contains compounds readily available for purchase. Clearly there are other chemical databases,21−23 but we have focused on downloadable data sets that are representative of synthesized molecules from current medicinal chemistry that either can be purchased or are reported in the literature. The workflow we used to generate and analyze these data sets is shown in Figure 2. The stages can be described as the following. Stage 1: The database of drug ring systems was generated as previously described, recording the frequency of each ring system for both monocycles and bicycles. Stage 2: All drug linkers were stored up to four bond linkers. Stage 3: All drug linkers are combined with ring systems at all available vectors to give two data sets of monocycle− monocycle and monocycle−bicycle combinations. Stage 4: A comparison of our virtual database of drug scaffolds with ChEMBL and eMolecules data sets was performed using three levels of analysis. The first analysis of the ChEMBL and eMolecules data sets was to determine which combinations of monocycle− monocycle and monocycle−bicycle rings are represented. We record whether the two ring systems have ever been used in the same molecule, regardless of how they are linked, and additional groups which may decorate the molecule. This analysis gives an indication of how widespread the ring systems are in current medicinal chemistry space but does not provide information about the proximity or bond distance between the drug rings. It is possible the drug ring systems
Figure 2. Workflow to compare enumerated drug ring systems with commercially available compounds from eMolecules and literature molecules from ChEMBL.
could be at distal regions of the molecule and unlikely to be associated with a single pharmacophore. The second analysis is a substructure search for the two rings covalently bound by a linear drug linker up to four bonds in length which gives an indication of the frequency of the ring framework or scaffold rather than the isolated ring systems. Clearly this second method is a subset of the first analysis. For the substructure analysis between our virtual drug ring system database and the real compounds in eMolecules and ChEMBL, the effective number of comparisons is extremely large for a brute force all versus all substructure comparison. A substructure comparison between all bicycle−monocycle drug rings with ChEMBL would require over 10 trillion substructure comparisons. To facilitate such a comparison, we use a ring fingerprint method (unpublished work) using precomputed bitstring indices, similar to REOS fingerprints,24 but the desired ring substructures are recorded, rather than reactive and undesirable functional groups, followed by the substructure calculation. This dramatically reduces the total number of actual substructure comparisons. Furthermore, the substructure searches are implemented so that we do not record if one ring is a partial match of another ring; e.g., benzene would not match with naphthalene. 1640
DOI: 10.1021/acs.jmedchem.6b01367 J. Med. Chem. 2017, 60, 1638−1647
Journal of Medicinal Chemistry
Perspective
than 100 matches for the exact compound). The exhaustive comparison takes three different formats as previously described: the combination match, substructure match, and exact match. The analysis is repeated for the monocycle−monocycle and bicycle−monocycle drug ring combinations for both the eMolecules and ChEMBL data sets. The heat maps and graphs in subsequent sections are given for the larger eMolecules set. The ChEMBL heat maps show a similar trend, and the totals for both ChEMBL and eMolecules are given in the following sections. Clearly the monocycle− monocycle combinations are symmetrical, so we have only populated one-half of the graph. In the heat map charts the drug ring systems are ordered by their frequency in drugs, and so the largest color coded region should be the top left quadrant (where green and yellow squares indicate examples of matched structures between the drug ring combinations and commercially available molecules, as previously stated), if the distribution follows that of the frequencies in drugs. From the results of the heat maps it is clear from the large amount of red squares, which indicate zero matches between drug ring combinations and commercially available compounds, that there are a significant number of drug ring combinations that have not been synthesized. The exact totals will be analyzed in the following sections but range from approximately 50% coverage with the combination match to less than 0.1% for the exact match.
The third analysis comparing drug ring combinations records the exact match of the ring frameworks, formed by combining the drug ring systems and drug linkers. This entire process is shown in Figure 2.
■
ALL VERSUS ALL HEAT MAP ANALYSIS We have taken a stepwise approach to analyze the virtual library of ring frameworks; the first analysis uses both the monocycle−monocycle and bicycle−monocycle combinations for each drug ring system pair. We have chosen to analyze these data using heat maps in Figures 3−5 to show the
■
BREAKDOWN OF MATCHES BY RING SYSTEMS
Following on from the individual all-by-all comparisons, the next focus of the work was to compare each ring system with any other drug ring system. To perform this analysis, we took the substructure heat maps for both the monocycle− monocycle (Figure 4a) and bicycle−monocycle (Figures 4b) combinations and assigned a one or zero depending on whether there was at least one example of this combination with a druglike linker. These distributions were then summed for each ring system to give the distributions for each individual ring system combined with any other drug ring system (see Figures 6−8). Again the ring systems are sorted by the frequency in drugs. Once the specific drug linkers are included, it can be seen from both the substructure (Figure 4) and exact matches (Figure 5) that the coverage dramatically decreases. When looking at the total substructure matches for each ring system, no ring system has full coverage; i.e., significantly less than 100% of the substructures are found in eMolecules (see Figures 6−8). Furthermore, from these distributions the monocycle−monocycle has the highest coverage and bicycle with any monocycle has the lowest coverage. Although this is true in percentage terms, in absolute numbers the coverage is of the same order; however the total available chemical space for bicycle combinations is significantly larger due to the larger number of growth points available on bicycle systems compared with a monocycle. Since there are significant amounts of red in the heat maps and a significant number of poorly represented substructure combinations, we have attempted to quantify the overall coverage in the following section.
Figure 3. Combination match heat map identifies the frequency of drug ring combinations in commercially available compounds for (a) monocycle−monocycle combinations and (b) bicycle−monocycle combinations, where the drug rings are present in the compound in any format.
coverage in the chemical databases for all possible formats for drug ring combinations, where a red square indicates no matches between the drug ring combination and the commercially available compounds through to yellow with some matches (approximately 500 matches for combinations and substructures and 50 matches for the exact compound), and green has the highest number of matches (greater than 1000 matches for combinations and substructures and greater 1641
DOI: 10.1021/acs.jmedchem.6b01367 J. Med. Chem. 2017, 60, 1638−1647
Journal of Medicinal Chemistry
Perspective
Figure 5. Exact match heat map identifies exact matches between commercially available compounds and all possible drug ring combinations with drug linkers for (a) monocycle−monocycle combinations and (b) bicycle−monocycle combinations.
Figure 4. Substructure match heat map identifies substructure matches between commercially available compounds and all possible drug ring combinations with drug linkers for (a) monocycle− monocycle combinations and (b) bicycle−monocycle combinations.
■
One potential explanation, particularly around the monocycles, as to why they are under-represented in both ChEMBL and eMolecules could be due to potential chemical instability, and the isolated groups are sometimes associated with metabolic liabilities and potential toxicity. However, these are often context-dependent, whereby chemical properties such as substitution patterns and local environment along with clinical properties such as the therapeutic window, mechanism of action, and route of administration all need to be considered. Moreover, any potential liabilities have been overcome to produce successful drugs, and we do not believe instability and ADMET properties can account fully for the lack of coverage in literature reported and commercially available compounds. We have also assessed the total number of ring combinations or frameworks that are available commercially by recording all monocycle−monocycle and bicycle−monocycle combinations with aliphatic linkers that were up to four bonds in length. The total number of unique commercial ring frameworks is approximately 1.6 million molecules. The overlap between the drug ring system combinations was only 57 556; i.e., 99.6% of drug ring frameworks are not
OVERALL TOTAL COVERAGE By use of this heat map assessment from the total bicycle and monocycle combinations, the total coverage of drug ring combinations, substructures, and exact matches are given in Tables 3 and 4 when compared with ChEMBL and eMolecules. From the distribution totals very few of the combinations are available in either ChEMBL or eMolecules. In fact less than 1% of the substructures have been made, and this is true for both the ChEMBL and eMolecules set, and less than 0.1% for the exact match. Examples of the most underused monocycle and bicycle systems are given in Table 5 and Table 6, which show less than 15% coverage in the heat map substructure analysis and less than 1% in the general substructure analysis. These ring systems are ranked by the frequency in drugs (most frequent first) and have generally been in at least two drugs. For this analysis we have ignored both diazepam and antibiotic type molecules due to the specificity associated with these chemotypes, although they are also generally underrepresented.. 1642
DOI: 10.1021/acs.jmedchem.6b01367 J. Med. Chem. 2017, 60, 1638−1647
Journal of Medicinal Chemistry
Perspective
Figure 6. Percentage of substructures that are available commercially from the virtual library of molecules formed by combining a drug monocycle with any other drug monocycle and a drug linker.
Figure 7. Percentage of substructures that are available commercially from the virtual library of molecules formed by combining a drug monocycle with any other drug bicycle and a drug linker.
Figure 8. Percentage of substructures that are available commercially from the virtual library of molecules formed by combining a drug bicycle with any other drug monocycle and a drug linker.
Table 3. Virtual Library of Enumerated Monocycle− Monocycle Drug Ring Systems Compared with Molecules in ChEMBL and eMolecules Using Combination, Substructure, and Exact Match Comparisons
combinations substructures exact match
enumerated total
matches in ChEMBL
matches in eMolecules
4.5 × 103 2.4 × 106 2.4 × 106
2107 (47%) 12193 (0.5%) 468 (0.02%)
2349 (52%) 23247 (1%) 1818 (0.08%)
Table 4. Virtual Library of Enumerated Bicycle−Monocycle Drug Ring Systems Compared with Molecules in ChEMBL and eMolecules Using Combination, Substructure, and Exact Match Comparisons
combinations substructures exact match
available commercially. Moreover, only 3% of commercially available ring combinations are drug ring combinations. We find this a surprising result, since approximately 70% of all
enumerated total
matches in ChEMBL
matches in eMolecules
12 × 103 12 × 106 12 × 106
3559 (30%) 23284 (0.2%) 1441 (0.01%)
3874 (33%) 34290 (0.3%) 3577 (0.03%)
new drugs only contain ring systems from this drug ring set. The overlap between drug ring combinations and commer1643
DOI: 10.1021/acs.jmedchem.6b01367 J. Med. Chem. 2017, 60, 1638−1647
Journal of Medicinal Chemistry
Perspective
Table 5. Examples of Monocycles (Ordered by the Frequency in Drugs) Which Are Frequently Used in Drugs but Under-Represented in eMolecules, Where Greater than 99% of the Substructures from Combining These Rings with All Other Possible Drug Monocycles or Bicycles Are Not Found in eMolecules
Figure 9. Venn diagram, drawn to scale, comparing the number and overlap of drug ring combinations and commercially available ring combinations.
the drug ring combinations from our database are covered by only 10% of the commercial suppliers.
■
FRAGMENT SCAFFOLDS AND APPLICATION TO FRAGMENT LIBRARY DESIGN We have seen for the complete enumerated sets that there is little commercial or literature coverage of drug ring combinations. We therefore reduced the set of molecules to an even smaller, more focused library to assess whether this observation is still true. A technique that is very much complementary to this analysis is fragment screening,25 and we can use this analysis to assess the size of a fragment deck to cover all of the known drug scaffolds that are designed from drug ring systems and drug linkers. For this analysis we have focused just on the two bond linkers and the heteroatom count from our previous analysis of drug ring frameworks, ignoring those scaffolds that have a high frequency in drugs but which only occur in a single therapeutic area or target class, e.g., diazopams and antibiotic derivatives. Clearly this could be extended to three bond linkers or further, which would include linkers such as amides, sulfonamides, etc., but here we are assessing the smallest useful scaffolds to see the current molecular coverage in modern medicinal chemistry which we intend to extend in the future. Furthermore, smaller molecules have a higher chance of binding to protein targets albeit with a weak activity.26 In our analysis there are a number of steps that are subjective, but from the rational design perspective these are molecules we would like to see as hits from a fragment screening campaign. Although for this to be practically useful, we have introduced filters that are related to typical fragment filters,27 that we would use in a real world context including crude chemical stability filters. We have applied a molecular weight cutoff as 280, alogp of ≤3, donors and acceptors of ≤3. Furthermore, we believe including terminal groups is useful in a fragment library, and so we have included the following important terminal groups in our enumeration; Me, F, NH2, CO2, and OH using only one group per framework and placed in all chemically feasible positions. Table 7 shows the breakdown for a virtual fragment library database of 421 929 molecules which have been fully enumerated. From a philosophical standpoint this number of scaffolds is a number that is within the capacity for standard virtual screening and biochemical screening, although
Table 6. Examples of Bicycles (Ordered by the Frequency in Drugs) Which Are Frequently Used in Drugs but UnderRepresented in eMolecules, Where Greater than 99% of the Substructures from Combining These Rings with All Other Possible Drug Monocycles or Bicycles Are Not Found in eMolecules
cially available ring combinations is shown in Figure 9 as a Venn diagram drawn to scale.
■
DISTRIBUTION OF DRUG RING SYSTEMS FROM COMMERCIAL SUPPLIERS On the basis of the importance of mining known ring systems, it is useful to know if the available molecules formed from combining drug ring systems are evenly distributed across suppliers. For each of the molecules available in ChEMBL, we have recorded all of those molecules that match the different drug ring system combinations and stored the commercial vendors that supply those compounds. This analysis is given in Figure 10, and it can be seen that 90% of 1644
DOI: 10.1021/acs.jmedchem.6b01367 J. Med. Chem. 2017, 60, 1638−1647
Journal of Medicinal Chemistry
Perspective
Figure 10. Histogram showing the number of molecules containing drug ring substructures from different commercial suppliers.
Table 7. Number of Enumerated Fragments Using Only Drug Rings, Up to Two Bond Drug Linkers, and Five Different Terminal Groups, Compared with Fragments Available in eMolecules and ChEMBL 1st ring
2nd ring
monocycle monocycle monocycle monocycle monocycle bicycle bicycle monocycle bicycle monocycle totals for all combinations
number of terminal vectors
number of enumerated molecules
eMolecules exact match
1 0 1 1 0 1
1106 4610 63142 2653 20253 330165 421929
419 (38%) 293 (6.4%) 1201 (1.9%) 776 (29.2%) 244 (1.2%) 374 (0.1%) 3307 (0.8%)
ChEMBL exact match 117 103 249 192 114 198 973
(10.6%) (2.2%) (0.4%) (7.2%) (0.6%) (0.1%) (0.23%)
2. Is this number small enough to be practically useful? 3. What is the overlap of currently available chemical space with this data set? 4. Can we use this data set to prioritize new areas of chemical space that are currently under-represented in compound collections but that have a proven track record in drug discovery? From this work we have answered these four questions with some surprising results. The number of frameworks or scaffolds possible from combining ring systems from FDA approved drugs is approximately 14 million compounds using known drug linkers. We have shown this set is small enough to analyze computationally, and based on our previous analysis, these molecules will cover approximately 70% of the scaffolds found in the drugs of the future. Currently 50% of the drug ring combinations have no representation in either readily accessible molecules or synthesized literature molecules. In this context we define a drug ring combination as two ring systems (either monocycle−monocycle or bicycle−monocycle) being present in the same molecule irrespective of the linker type or linker length so the molecular rings could be at distal parts of the molecule and not directly connected. When the drug ring systems are combined with a drug linker, up to four bonds in length and all accessible vectors of attachment are enumerated, less than 1% of the frameworks or molecular scaffolds are represented in either commercial libraries or literature compounds as a molecular substructure. This is true for all linear combinations of monocycle− monocycle and bicycle−monocycle ring systems. If we consider the exact match (rather than a substructure) of all enumerated drug ring frameworks, as previously described, with those compounds from the libraries of
in the past it has been considered to be too large for biophysical screening such as X-ray crystallography or NMR. We have demonstrated in-house capabilities to run our internal fragment deck of over 20 000 molecules in one month using surface-plasmon resonance (SPR) technology against 12 protein−protein interaction (PPI) targets (unpublished results). There is some redundancy in this virtual data set through close molecular similarity, and it would be possible to give broad molecular coverage by clustering the data set on different molecular properties, although the purpose of this exercise is to give an approximate upper limit, given certain constraints and approximations, as previously described, and the subsequent overlap with current medicinal chemistry. Of significance is the relatively modest size of a library, which is typically smaller than the high-throughput screening collections for most pharma companies, and the low coverage of these molecular designs. In fact given all of the approximations and constraints that make this set of molecules a conservative estimate of the size of a fragment library, we find it surprising that