Assessing the Growth of Bioactive Compounds and Scaffolds over

Feb 2, 2016 - The increase in compounds with activity against five major therapeutic target families has been quantified on a time scale and investiga...
1 downloads 0 Views 2MB Size
Article pubs.acs.org/jcim

Assessing the Growth of Bioactive Compounds and Scaffolds over Time: Implications for Lead Discovery and Scaffold Hopping Swarit Jasial, Ye Hu, and Jürgen Bajorath* Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany ABSTRACT: The increase in compounds with activity against five major therapeutic target families has been quantified on a time scale and investigated employing a compound−scaffold−cyclic skeleton (CSK) hierarchy. The analysis was designed to better understand possible reasons for target-dependent growth of bioactive compounds. There was strong correlation between compound and scaffold growth across all target families. Active compounds becoming available over time were mostly represented by new scaffolds. On the basis of scaffold-tocompound ratios, new active compounds were structurally diverse and, on the basis of CSK-to-scaffold ratios, often had previously unobserved topologies. In addition, novel targets emerged that complemented major families. The analysis revealed that compound growth is associated with increasing chemical diversity and that current pharmaceutical targets are capable of recognizing many structurally different compounds, which provides a rationale for the rapid increase in the number of bioactive compounds over the past decade. In light of these findings, it is likely that new chemical entities will be discovered for many small molecule targets including relatively unexplored ones as well as for popular and wellstudied therapeutic targets. Moreover, given the wealth of new “active scaffolds” that have been increasingly identified for many targets over time, computational scaffold-hopping exercises should generally have a high likelihood of success.



INTRODUCTION In pharmaceutical research, increasing volumes of compounds and activity data are becoming available. Not only data volumes but also complexity and heterogeneity are increasing, giving rise to the advent of big data phenomena in medicinal chemistry,1,2 similar to developments in biology and bioinformatics over the past decade,3 albeit still at lesser magnitude. Although large volumes of complex activity data are difficult to analyze, these data represent a valuable knowledge base for the large-scale exploration of structure−activity relationships and compound design.4 Analysis of activity data also helps to better understand ligand binding characteristics of therapeutic targets5 or promiscuity among bioactive compounds,6,7 which is defined as the ability of small molecules to specifically interact with multiple targets, a prerequisite for polypharmacological effects.8−10 Activity data can also be related to structural classification schemes. For example, the scaffold concept has been applied over the last two decades to define core structures of compounds in a consistent manner.11 Scaffolds are typically extracted from compounds by systematic removal of substituents.12 Accordingly, a series of analogs yields the same scaffold. The scaffold concept has provided a basis for the generation of data structures such as the scaffold tree13 to systematically organize compound collections and annotate them with activity information. Scaffold-based compound organization can be extended through the generation of carbon skeletons, also termed cyclic skeletons (CSKs),14 which © XXXX American Chemical Society

represent a further abstraction from chemical structures focusing on molecular topology and enable the implementation of compound−scaffold−CSK hierarchies for structural organization and data analysis.15 A CSK represents a set of scaffolds that share the same topology and are only differentiated by heteroatom replacements and/or bond order variations. Scaffold and CSK analysis is often carried out to assess the structural diversity of compound collection, which is from a chemical perspective more intuitive than the calculation of descriptor-based similarity values.16−18 The compound−scaffold−CSK hierarchy was previously employed by us to systematically explore structural relationships between scaffolds across bioactive compounds and study the potency range distribution of compounds sharing the same activity that were represented by different scaffolds.15 A major finding of this analysis was that many pairs of structurally distinct scaffolds represented highly potent compounds.15 The scaffold concept has also been applied to introduce “scaffold hopping”,19,20 which refers to computer-aided identification of compounds that share the same activity but differ in their core structures. Scaffold hopping through virtual compound screening is often regarded as one of the central tasks in computational medicinal chemistry. We have been interested in exploring the nature of compound data growth in relation to scaffold growth and Received: December 1, 2015

A

DOI: 10.1021/acs.jcim.5b00713 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Article

Journal of Chemical Information and Modeling

were retained. For example, if a potency value of 4 nM was reported for a given compound in 2010 and 3 nM in 2011 for the same target, the compound was selected and 2010 potency value (4 nM) was assigned to the target. Furthermore, compounds with a single qualifying potency measurement were also retained. Selected compounds and their activity data were assigned to individual years from 2000 to 2014 (all data reported prior to 2000 were assigned to year 2000). The final curation step yielded all target-based compound sets for the five major families. Table 1 reports the total number

diversity. How fast are volumes of bioactive compounds increasing and why are they increasing? Might the increase largely be due to extension of known compound series (perhaps reflecting a form of chemical “me-too-ism”)? Or is diversity generated among novel active compounds? Alternatively, might the increase be due to the emergence of novel targets for which new active compounds are identified? We have set out to explore these previously unaddressed questions. Therefore, the increase in bioactive compounds over time was quantified for five major target families, and the compound−scaffold−CSK hierarchy was employed to characterize increasing volumes of bioactive compounds and analyze compound-to-scaffold ratios. For the first time, the growth of bioactive compounds and scaffolds extracted from them was followed on a time course over 15 years. This made it possible to monitor compound-to-scaffold ratios during periods of largest compound and activity data growth and compare the progression to earlier years when compound and data volumes were limited. A major and rather unexpected finding of our analysis has been that target-based growth of active compounds was consistently paralleled by increases in scaffold diversity across all major target families, independent of compound and data volumes. This has several implications for small molecule discovery as also discussed herein.

Table 1. Target Family-Based Compound Setsa number of target family GPCRs kinases ion channels nuclear receptors proteases

all targets

all compounds

qualifying targets

qualifying compounds

165 276 80 27

46,905 21,756 10,748 5032

153 176 50 24

46,885 21,699 10,723 5026

143

17,534

104

17,492

a

For each family, the total number of targets and compounds available in ChEMBL and the number of targets and compounds qualifying for our analysis are reported.



MATERIALS AND METHODS Data Selection and Curation. Compounds and activity data were extracted from ChEMBL (release 20).21 Only compounds active against targets belonging to five major families were considered, including class A G protein-coupled receptors (GPCRs), ion channels, protein kinases, nuclear receptors, and proteases. These target families were organized following the UniProt22 and ChEMBL target classification schemes. To ensure high data confidence, several preselection criteria were applied as implemented in ChEMBL. Compounds were extracted for which direct interactions (i.e., assay relationship type “D”) with human single-protein targets at the highest confidence level (assay confidence score 9) were reported. The two parameters, “assay relationship type” and “assay confidence score”, qualify and quantify the level of confidence that a compound is tested against a given target in a relevant assay system, respectively. Relationship type “D” and confidence score 9 indicate the highest level of confidence for activity data from ChEMBL. Furthermore, two types of potency measurements were considered including (assay-independent) equilibrium constants (Ki) and (assay-dependent) IC50 values. Only explicitly specified Ki and IC50 values were taken into account, and all approximate measurements such as “>”, “