A Trend Analysis of Scaffolds in the Medicinal Chemistry Literature

Dec 13, 2017 - In this study, we employed data from ChEMBL20 to examine the evolution of scaffold features, such as enumerated ... a scaffold receives...
10 downloads 13 Views 9MB Size
Perspective Cite This: J. Med. Chem. XXXX, XXX, XXX−XXX

pubs.acs.org/jmc

The Rise and Fall of a Scaffold: A Trend Analysis of Scaffolds in the Medicinal Chemistry Literature Barbara Zdrazil*,† and Rajarshi Guha*,‡ †

Department of Pharmaceutical Chemistry, Division of Drug Design and Medicinal Chemistry, University of Vienna, Althanstraße 14, A-1090 Vienna, Austria ‡ Division of Preclinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), 9800 Medical Center Drive, Rockville, Maryland 20850, United States S Supporting Information *

ABSTRACT: Scaffolds are a core concept in medicinal chemistry, and they can be the focus of multiple independent development efforts over an extended period. Thus, scaffold associated properties can vary over time, possibly showing consistently increasing or decreasing trends. We posit that such trends characterize the attention that the community pays to a scaffold. In this study, we employed data from ChEMBL20 to examine the evolution of scaffold features, such as enumerated compounds, biological activity, and liabilities, over 17 years. Our analysis highlights that certain properties such as enumerated compounds, but not liabilities, show statistically significant increasing trends for some scaffolds. We also attempt to explain why a scaffold receives more attention over time and highlight that obvious aspects such as synthetic feasibility do not explicitly drive attention. In summary, trend analyses of scaffold properties could support scaffold selection and prioritization in small molecule development projects.



INTRODUCTION A chemical scaffold, being defined as the structural core of a chemical compound,1 is the focus of many efforts in medicinal chemistry. The concept of a scaffold is utilized in many scenarios ranging from the design of screening libraries2 to the characterization of structure−activity relationships3,4 and the rationalization of target specific activities,5 leading to the notion of privileged scaffolds (molecular frameworks that are capable to serve as ligands for multiple targets and/or do occur more often in drug-like molecule libraries).6−10 There is a rich literature on the use of scaffold-based approaches for retrospective analysis of compound collections or the medicinal chemistry literature.5,11−17 Importantly, most of these studies have considered scaffold collections at a fixed point in time. Exceptions include a review of temporal profiles for azanaphthalene scaffolds in drugs and natural products by Polanski et al.18 and a recent analysis of the growth of bioactive compounds and scaffolds over time by the group of Bajorath and co-workers.19 In the latter study, the authors examined five major therapeutic targets employing a compound-scaffoldcyclic skeleton hierarchy. The notion of temporal trends in scaffold properties is attractive in a medicinal chemistry setting as well as from a project management point of view. Examining the change in properties of exemplified compounds over time could serve to © XXXX American Chemical Society

highlight classes of scaffolds that the community is exploring. Conversely, identifying scaffolds with relevant bioactivity but relatively infrequent occurrence in the literature could serve in identifying gaps in the research space of a target or condition that might be worth exploring to generate novel or patentable leads. Such insight may not be available by a static analysis of the entire scaffold collection of, say, the ChEMBL database.20,21 The type of analysis we present here can assist a researcher in deciding whether to expend resources on exploring a certain structural class of compounds either synthetically, experimentally, and/or by means of in silico modeling as it provides a temporal summary of past synthetic and testing efforts within the drug design community. In this work, we perform a temporal analysis of scaffolds reported in the medicinal chemistry literature, represented by ChEMBL, by examining the trends in scaffold-derived properties over time. ChEMBL is a manually curated and open source repository of medicinal chemistry literature data, although it does not reflect the entirety of published bioactivities. Still, this valuable source is frequently updated to include recent data and improve data quality issues. Thus, it Received: June 29, 2017 Published: December 13, 2017 A

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

Figure 1. Twenty scaffolds in ChEMBL20 with the highest cumulative counts of unique member compounds. Scaffolds are ordered by descending counts of unique compounds. Scaffold IDs, cumulative counts of unique compounds (lower left), and bioactivities (lower right) are indicated in the boxes.

received over time. In other words, being popular does not imply that it is a “good” or worthwhile scaffold for a drug discovery campaign. An example are the aryl−aryl and aryl− heteroaryl scaffolds generated via Suzuki and other Pd(0) catalyzed coupling reactions.24 Because of the facile and high yield nature of these reactions, these types of scaffolds were easily generated and many are reported in the literature. But compounds based on these scaffolds are not necessarily soluble and have a tendency to be nondrug like (though ibrutinib is an exception). Thus, these popular scaffolds are not always “good” for drug design purposes. Some other scaffolds that are highly ranked by one or various popularity analysis measures have also been identified to cause liabilities. To the best of our knowledge, this is the most extensive trend analysis on scaffolds of this kind and the first to have quantified the notion of scaffold popularity in multiple dimensions.

is unrivaled in these aspects, which makes it the input database of choice for this exemplary scaffold trend analysis. We initially consider target-independent properties to avoid the bias in the medicinal chemistry literature which tends to be overenriched in activity data against certain druggable target classes (e.g., kinase and GPCR targets).22 We were especially interested in the evolution of a scaffolds’ interest (or attention) which it received over time, which we denote as “scaffold popularity”. Parameters for measuring scaffold popularity include the number of enumerated compounds, the number of different bioassays its members are tested in, and the median journal impact factors (JIFs) of the journals in which compounds belonging to a scaffold series have been published. In addition to popularity, we examine trends in bioactivity and pan-assay interference compounds (PAINS) liabilities23 associated with scaffolds. We emphasize that being identified as a “popular” scaffold by one or more of these measures does not necessarily imply that this scaffold also deserves the interest or attention that it B

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

Figure 2. Boxplots showing the distribution of trend line slopes for new compounds (A), assays (B), and targets (C).



RESULTS AND DISCUSSION We focused on trends based on reported structures and activities and thus employed ChEMBL20 as a source of curated structure−activity data. A predefined search query initially retrieved 1.73 Mio data entries (considering human targets only). Further filtering for bioactivity end points “Ki” and “IC50” and for data entries which were published between 1998 and 2014 led to 619671 nonunique compounds (282878 unique compounds). The structures were then decomposed using the Bemis−Murcko scheme25 and data points assigned to compound classes by grouping them by identical Bemis− Murcko scaffolds (103772 scaffolds). We then considered a “relevant” subset, defined as those scaffolds with at least 10 unique compounds, publications in at least five different years, and scaffolds with nongeneric ring systems (six-ring system or smaller with just one or no heteroatom placed inside the ring were removed). This identified 764 distinct scaffolds (corresponding to 101622 bioactivity data entries, 36328 unique compounds) and further served as a basis for all trend analyses reported here. A full list of all 101622 data entries corresponding to 764 distinct scaffolds is given in the Supporting Information . On the basis of the set of scaffolds and their members, we examined trends for three properties: scaffold popularity, bioactivity and target trends, and scaffold liabilities. All trend line slopes and their p-values are given as Supporting Information. 1. Scaffold Popularity. The relative importance of a scaffold to the research community, which we term “popularity”, was inspected by analyzing three different parameters over time: counts of unique compounds belonging to a scaffold (enumerated compounds), counts of total bioassays in which members of a scaffold have been tested, and the importance (expressed by the median journal impact factors for a given year) of the journals where scaffold members have been published. The use of three, potentially orthogonal, measures was motivated by a desire to detect potential interdependencies between these parameters and to see if the different measures identified different scaffolds as being popular. In particular, this exercise helps to define the usability of the

different measures for coming up with a global popularity measure. 1.1. Enumerated Compounds. An obvious indication of the popularity of a scaffold is the extent of its synthetic enumeration. Besides cumulative compound counts associated with scaffolds for the whole observation period, also the in-/ decrease in the number of new compounds per scaffolds and year can give an indication for the popularity of a certain scaffold. We further conjecture that the steepness of the trend corresponds to increased popularity. First, we ordered the 764 scaffolds by their cumulative counts of enumerated compounds labeling the scaffold with the highest number of total unique compounds as scaffold ID 1 (biphenyl). This nomenclature was retained for all other investigations to facilitate comparisons across different measures (tested bioassay, JIF, target, bioactivity, and liability trends). We emphasize that the numbering does not reflect a ranking for scaffold popularity. Scaffolds with the highest numbers of unique compounds (≥200) reported in ChEMBL20 from 1998 until 2014 are depicted in Figure 1. Approximately 75% of all scaffolds have member sets of 50 unique compounds or less, and 92% of scaffolds have less than 100 unique compounds. Just three scaffolds, namely biphenyl (scaffold ID 1), diphenyl ether (scaffold ID 2), and chalcone (scaffold ID 3), are represented by more than 500 unique compounds (see Figure 1). This is also reflected by the high numbers of bioactivity measures (ranging from 1241 to 2030) reported for these three scaffolds (see Figure 1). Overall, the number of unique compounds correlate moderately (R2 = 0.68) with the number of reported bioactivity values for a given scaffold (see Supporting Information, Figure S1). The latter is dependent not only on the number of bioassays reported for this scaffold class but also on the number of targets against which a compound was tested. To obtain an overview of the nature of compound diversity within the scaffold series occurring in our study, we computed, scaffoldwise, the median diversity scores for compounds in every publication year (based on CDK circular 1024-bit fingerprints), which is calculated as 1 minus the median C

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

Figure 3. (left) Counts of new compounds per year for scaffold ID 3 (chalcones). (right) Counts of assays per year for scaffold ID 489 (daunorubicin derivatives).

per year (slope: 3.8) with a continuous increase of new published compounds per year and peaks in single years (see Figure 3 (left)): starting with 43 unique compounds in 1998, additional unique compounds were published annually from year 2001 onward, showing a peak of 112 new compounds in 2009. A closer inspection of the number of different papers published in 2009 revealed that although there are in total 14 publications reported in 2009, only seven of these report more than a single new compound and the bulk of compounds (ca. 70%) was reported in only two of these publications.26,27 However, all 14 publications were authored by different research groups around the globe, reflecting a general increase in interest rather than just interest within a single research group. We also note that the trend line plot (Figure 3 (left)) shows a steep drop in 2014. However, evaluating the significance of this drop is hard due to the fact that ChEMBL20 data was only curated until July/August 2014 (personal communication with A. Gaulton, EMBL-EBI). Because this condition accounts for all trend analyses here, the reduction of the steepness of the overall trend will on average be similar for all scaffolds. Clearly, there is ongoing interest in the design of new chalcones over a period of 17 years. Interestingly, they have also been mentioned in the context of PAINS,23 as they are chemically quite reactive (see also Targetwise Bioactivity Trends: Specific Examples for Popular Scaffolds section). 1.2. Bioassays. Over time, the importance or popularity of a compound class (as characterized by a scaffold) may also be represented by its testing in an increasing number of bioassays. It should be noted that unique ChEMBL assay IDs are allocated to every distinct assay (i.e., assay protocol and assay conditions) for a set of compounds reported in a paper. If the same assay, run under a different condition, was reported in the same or a different paper, even for the same set of compounds, it receives a new assay ID. Thus, an analysis based on comparing distinct assay IDs reported for a scaffold does not truly reflect the number of distinct bioassays. However, as there are no other means to compare assays on a larger scale, unless manual data curation is performed for a selected subset of compounds or targets,28,29 we conducted the analysis using the bioassay IDs as provided by ChEMBL. The number of bioassays in which a scaffold has been tested is a measure of the effort of the authors of a paper to test their compound series in a number of different pharmacological experiments or targets due to a perceived importance of the compounds. Thus, we investigated the relationship of bioassay counts to counts of bioactivities, unique compounds, and

pairwise Tanimoto compound similarity (for all compounds associated with the scaffold). Because ChEMBL tends to contain compound series generated in academic settings, they tend to be congeneric (i.e., structurally similar). Thus, scaffolds with a larger number of member compounds tend to be made up of a larger number of congeneric series and therefore tend to be more diverse overall than scaffolds with only a few enumerated compounds. Across all scaffolds and years, the median diversity score is 0.27 (corresponding to a pairwise Tanimoto similarity score of 0.73) (a detailed list of all scaffoldwise median similarities per year is given in Supporting Information, Data File 3). Next, we considered 372 scaffolds with at least one new compound in more than five different (not necessarily consecutive) years in order to investigate trends of new compounds per year. Just 39 of these exhibit statistically significant trends: 11 scaffolds exhibit a steep positive trend (slopes between 3.8 and 0.5), and 28 scaffolds are characterized by a moderately steep positive (22) or moderately steep negative trend (6) with slopes between 0.4 and −0.4. There are no steep negative trends appearing in our analysis. A boxplot of the distribution of trend line slopes for new compounds is shown in Figure 2A. A closer inspection of the 11 scaffolds with steep positive slopes (between 3.8 and 0.5) reveals that most of them possess high total compound counts ranging from 505 to 162 compounds (ordered by descending slopes): scaffold IDs 3 (chalcones), 5 (erlotinib and derivatives), 8 (benzanilide derivatives), 12 (1,4-naphthoquinone derivatives), 11 (transstilbene derivatives), 16 (coumarin derivatives), 25 (pentacyclic triterpenoids), 27 (benzimidazole), 15 (N-benzylindoles), and 21 (diphenylurea). An exception is scaffold ID 107 (6(1,3-dioxobenzo[de]isoquinolin), scriptaid and derivatives), with just 75 compounds over the whole observation period but also a borderline steep slope. Our analysis suggests that scaffolds with steep positive compound trend lines are quite likely also among those with high total compound counts, although the converse is not true. Most of the scaffolds which exhibit high total compound counts (see Figure 1) do not exhibit statistically significant positive compound trends. Thus, scaffolds that are ranked high by both a high compound count as well as a steep compound trend should deserve special attention (scaffold IDs 3, 5, 8, 11, 12, and 16; trends line slopes depicted in Figure 3 (left) and Supporting Information, Figure S2). Scaffold ID 3 (chalcones; 505 unique compounds) possesses, by far, the steepest trend line of new compounds D

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

Figure 4. Twenty scaffolds in ChEMBL20 with the highest cumulative counts of bioassays; scaffolds are ordered by descending counts of assays. Scaffold IDs (upper left), cumulative counts of bioassays (upper right), cumulative counts of unique compounds (lower left), and cumulative target counts (lower right) are indicated in the boxes.

different human targets although consisting of just 21 different unique member compounds (considering K i and IC 50 measurements only; Figure 4). This is obviously a very popular scaffold in the medicinal chemistry literature which would not have been highlighted based on mere compound counts. A similar example is the paclitaxel-like scaffold ID 87 (Figure 4), with a very complex chemical structure, a lower number of unique representatives (82), but high counts of unique assays (837) and targets (133). However, in both cases, we note that it is a single compound being responsible for these high assay counts. In the case of scaffold ID 489, doxorubicin has been tested in 1545 assays, and in the case of scaffold ID 87, paclitaxel has been tested in 835 assays (compared to 1754 and 837 assays in total for these scaffolds, respectively). These results are not surprising given that these compounds are commonly tested in oncology projects. Importantly, these assays correspond to 515 and 289 unique publications for scaffold IDs 489 and 87, respectively. This

targets. The cumulative counts of total bioassays for a certain scaffold are moderately correlated to reported bioactivities for the 764 scaffolds (R2 = 0.7; Supporting Information, Figure S3). However, no or poor correlation with cumulative counts of unique compounds (R2 = 0.31) or cumulative target counts (R2 = 0.54) were observed (Supporting Information, Figure S3). The lack of correlation between cumulative bioassay and target counts might be a consequence of the degeneracy of bioassay identifiers, whereas targets are always treated as unique entities (and therefore can easily be mapped across different measurements and assays). Sorting scaffolds by their cumulative (static) bioassay counts, it becomes obvious that the scaffolds listed on top (Figure 4) exhibit a limited overlap with the ones with high cumulative compound counts (see Figure 1). Strikingly, the scaffold with the highest number of bioassays over the whole observation period (1998−2014) is a nongeneric five-ring system (scaffold ID 489). This typical daunorubicin-like scaffold has been measured in 1754 bioassays and on 163 E

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

Figure 5. heat map displaying the number of publication years (“Count”) of scaffold IDs 1−20 in different medicinal chemistry journals.

compounds, 391 bioactivities, 88 human targets, and 191 assays in total) and scaffold ID 520 (with 19 unique compounds, 224 bioactivities, 63 human targets, and 198 assays in total) owe their high occurrences in the medicinal chemistry literature to just one compound each: curcumin (belonging to scaffold ID 171) and sorafenib (belonging to scaffold ID 520) have been reported in 187 and 196 assays (with Ki or IC50 end points). Comparing the assay trend lines for curcumin alone with that of scaffold series 171 without curcumin (Supporting Information, Figure S6) as well as sorafenib alone with scaffold series 520 without sorafenib (Supporting Information, Figure S7), it becomes clear that interest in these scaffolds was mainly driven by single compounds (sorafenib and curcumin), respectively. This is, however, not so much pronounced for curcumin/scaffold ID 171 than it is for sorafenib/ID 520. To be able to automatically flag scaffolds where one or a few compounds (assigned to this scaffolds) might be driving the assay trends, we filtered out scaffolds where at least one compound has measurements in more than 50% of the years in which it was tested. A detailed list of these compounds, assigned scaffolds, and the number of years with measurements are included in the Supporting Information, Data File 4. This analysis flagged 375 (49% of 764) scaffolds as potential scaffolds with assay trends being driven by only one or a few compounds. However, 389 scaffolds (51%) were not influenced by such singleton outliers. While our use of robust regression downplays the effects of such outliers, such a filter process is still useful to flag scaffolds for which this behavior is present. Concluding this section on assay trends, we observed that the temporal analysis highlights certain scaffolds that have appeared later in the literature and can therefore not be detected by cumulative assay counts. Scaffold rankings for assay trends differ significantly from rankings for compound counts. Therefore, they offer an orthogonal measure to assess a scaffolds’ popularity.

indicates that the counts are not driven by a few publications describing large screening projects. For inspection of the temporal trends in bioassay counts over the observation period (1998−2014), we considered those scaffolds with tested assays in more than five different years (563 scaffolds), of which 102 exhibit statistically significant trends of tested assays per year. Thirty-nine of those scaffolds possess robust regression assay trend lines with steep positive slopes (between 14.7 and 0.7), while 63 exhibit moderately steep slopes with slopes between 0.7 and −0.3 (59 positive, 4 negative). Steep negative trends were not observed in this data set. A boxplot showing the distribution of assay trend line slopes is available as Figure 2B. Many scaffolds with steep positive tested assay trend lines also possess high cumulative assay counts (ordered by descending slopes): e.g., scaffold IDs 489 (daunorubicin derivatives), 87 (paclitaxel derivatives), 198 (1,3,4-thiadiazole derivatives), 38 (cis-stilbene derivatives), 59 (camptothecin derivatives), 251 (uracil derivatives), 6 (flavones), and 88 (benzanilide derivatives). Assay trend lines for the top five are shown in Figure 3 (right) (daunorubicin) and Supporting Information, Figure S4 (scaffold IDs 87, 198, 38, and 59). Hence, the daunorubicin-like scaffold (scaffold ID 489) exhibits the steepest positive trend line (slope: 14.7): starting with 19 bioassays in 1998, an almost steady increase in bioassay counts can be observed, with a peak of 261 bioassays reported in literature in 2013 (Figure 3 (right)). Interestingly, scaffold IDs 171 (curcumin and analogues) and 520 (sorafenib and analogues) with assay counts below 200 are also among the top 10 steepest trends. We would not have identified these two scaffolds as popular if we were constrained to using static assay counts. As is clear from their trend lines (Supporting Information, Figure S5), publications did not appear until 2004/2005, which explains their lower cumulative assay counts over the whole observation period (1998−2014). However, the steep trends of their assay count over the last 10 years highlights their popularity within the medicinal chemistry community. Importantly, scaffold ID 171 (with 57 unique F

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

Figure 6. Target type composition (single protein targets vs cell line targets) for scaffolds with high cumulative target counts.

1.3. Journal Impact Factors. The third measure of popularity that we inspected is the importance of the journal in which a scaffold or its members were reported. There are many measures of a journals importance and much debate on their correctness.30 We selected the Thompson Reuter Journal Impact Factor (JIF) for analysis. To investigate a potential relationship with other popularity measures, we inspected the median (per scaffold and year) JIF slopes of some example scaffolds that have been ranked on top by compound (scaffold IDs 3 and 5) and assay (scaffold IDs 87 and 489) popularity trend analyses before. In general, a slight upward trend for median JIFs (slopes between 0.07 and 0.12) for these scaffolds could be observed (see Supporting Information, Figure S8). It is necessary to evaluate whether a steady increase actually corresponds to a series of compounds receiving more and more attention in the medicinal chemistry literature over time or whether the slope simply reflects the natural increase of the JIFs of certain journals over time. This can be done by analyzing whether scaffold series are mainly published in recurrent or diverse journals over the years. We investigated this by identifying the unique journals in which scaffold IDs 1−20 were reported (a static view). For each scaffold, we counted the number of distinct years of the publications that reported those scaffolds over 17 years. This is summarized in Figure 5 and suggests that compounds belonging to a certain scaffold tend to be reported in a small set of recurrent journals: J. Med. Chem., Bioorg. Med. Chem. Lett., Bioorg. Med. Chem., Eur. J. Med. Chem., and J. Nat. Prod. This is a small fraction of the total number of journals (49) that were referenced in ChEMBL during the study period. One reason for this bias is the limited number of available medicinal chemistry journals (in comparison to the broader availability of i.e. pharmacology-related or medical journals). These frequently recurring medicinal chemistry journals also exhibit a slight increase in their JIF over the years (see Supporting Information, Figure S9). It is likely that, overall, the steady increase in JIF trends that is observed for selected scaffolds is a consequence of a natural increase in JIFs over time. Thus, scaffold ranking according to the steepness of the slopes does not appear to be very informative in the case of median JIFs. It is likely that higher impact factor papers appear earlier in literature, when reporting first occurrence or study of a new compound or compound series. Subsequent publications in lower impact factor journals would in this case lead to a negative JIF slope. Conversely, a later high impact factor paper could lead to a favorable positive slope if the papers before

showed lower JIFs and a series of high impact factor papers with the same scaffold would lead to a flat JIF slope. For evaluating scaffold popularity, obviously the point in time at which (a) higher impact factor paper(s) appeared should not play a role. Therefore, we examined JIFs by computing the sum of all median (across a given year and scaffold) annual JIFs from 1998−2014 and denote this as sJIF. This measure not only accounts for the intermittent occurrence of high JIFs in certain years, but also the counts of JIF occurrences in multiple years. The scaffold series ranked on top, based on sJIF, is the one containing sunitinib and analogues (scaffold ID 44). While overall its trend line slope is rather flat (−0.1), it shows peaks of median JIFs in 2006 and 2008, which correspond to publications in Nat. Chem. Biol. (see Supporting Information, Figure S10). Another scaffold among the top 10 with high sJIF which was identified as popular beforehand is scaffold ID 5 (erlotinib and analogues). It shows no distinct peaks of median JIFs but an ongoing solid publication record (see Supporting Information, Figure S10). In summary, insights from JIF trends are hard to evaluate and require a thorough case by case inspection. However, in combination with compound counts and counts of bioassays, the sJIF might provide an additional measure of scaffold popularity. 2. Target and Bioactivity Trends. Another view of scaffold importance is represented by a scaffold series’ aggregated biological activities of its member compounds. We first examined static target composition and temporal target trends alone without considering bioactivity. Second, we examined targetwise bioactivity trends, showcasing the most relevant scaffolds from the perspective of their biological responses. 2.1. Target Composition and Target Trends. First, for a static overview, we computed the count of the total number of unique targets for each scaffold in our data set over the whole observation period (1998−2014). Keeping all ChEMBL target types available in our analysis (single protein, cell line, protein complex, protein family, protein complex group, etc.) and ranking by total unique target counts, all top 10 scaffolds (possessing between 187 and 112 different targets) either possess high cumulative compound or high assay counts (ordered by descending total target count): scaffold IDs 1 (biphenyl derivatives), 6 (flavones), 7 (naphthalene derivatives), 489 (daunorubicin derivatives), 4 (serotonin and analogues), 87 (paclitaxel and analogues), 17 (diphenylG

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

Figure 7. Trend of polypharmacological profile of single protein and cell line targets for scaffold series 8 (benzanilide derivatives) in binary heat map representation: abscissae, year; ordinate, target names; blue field, active ≤1 μM; black field, inactive >1 μM; gray field, no measurements.

Next, temporal target trends were examined, first by inspecting all targets, second by considering single protein targets and cell line targets separately. Counting all unique targets per year and ranking the 563 scaffolds with target counts in at least six years by the steepness of their target trend line slopes, the top five, scaffold IDs 6 (flavones), 38 (cis-stilbene), 87 (paclitaxel and analogues), 251 (uracil derivatives), and 489 (daunorubicin derivatives), also belong to the top 20 with maximum number of cumulative target counts (Figure 6). The distribution of trend line slopes considering all target types is shown in Figure 2C. The situation changes when only single protein targets are considered for analyzing the steepness of the target trends (424 scaffolds with single protein target counts in at least six years): apart from scaffold IDs 2 (diphenyl ether derivatives), 6, and 10 (benzyl phenyl ether derivatives), which are among the scaffolds with highest cumulative compound and target counts

methane), 18 (quinoline), 2 (diphenyl ether), and 11 (transstilbene). On closer inspection of the target type composition of (the top 20) scaffolds with high total target counts, it is clear that in many cases more than half of the target types are cell lines (rather than single protein targets; see Figure 6). Whereas most of the scaffolds with low scaffold IDs (high counts of unique member compounds) from Figure 6 were measured on more single protein targets, those scaffolds with higher scaffold IDs (lower number of unique compounds) exhibit higher numbers of measurements on cell lines. An exception is, e.g., scaffold ID 3 (chalcone) with a high number of enumerated compounds (505), but a clear excess of cell lines versus single protein targets. The phenomenon could be explained in part by the rise in popularity of phenotypic screening in general31 as well as cell line profiling studies (especially in oncology).32 H

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

Figure 8. Trend of polypharmacological profile of single protein and cell line targets for scaffold series 3 (chalcones) in binary heat map representation: abscissae, year; ordinate, target names; blue field, active ≤1 μM; black field, inactive >1 μM; gray field, no measurements.

(see Figures 1 and 6), also scaffold ID 27 (benzimidazole) and 198 (1,3,4-thiadiazole derivatives) are found among the top five with steepest significant positive slopes. We next considered temporal trends for the case where cell lines are identified as the assay target. Considering exclusively the target type “cell line” and counting the number of those targets per year, the top five scaffolds with statistically significant steepest positive trend lines are also among the top 20, with the highest cumulative bioassay counts and/or steepest bioassay trend lines: scaffold ID 38 (cis-stilbene derivatives), 87 (paclitaxel and analogues), 88 (colchicine and analogues), 251 (uracil and derivatives), and 489 (daunorubicin and analogues). A closer look at the respective assay descriptions reveals that target type “cell line” usually indicates a cytotoxicity assay: assay descriptions usually contain the terms “cytotoxicity/cytotoxic”, “antiproliferative”, or “growth inhibition”. Therefore, a temporal shift from single protein

targets toward cell line targets (or vice versa) would indicate a shift in the focus of interest in these scaffold series (such as from a target specific focus to looking for broad spectrum activity). 2.2. Targetwise Bioactivity Trends: Specific Examples for Popular Scaffolds. Considering target and bioactivity trends together, bioactivity data can be aggregated by targets, allowing us to examine target profiles of scaffolds over time and show the distribution of actives versus inactives. In this section, we focus on some “popular” scaffolds from previous sections. In addition, scaffolds with a high number of measurements on single protein targets are compared to other representatives with a high number of cell lines as targets. For visualization purposes, only single protein as well as cell line targets with at least five unique compounds (within each scaffolds series) were considered. I

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

potential therapeutic effects for chalcones have been postulated (such as analgesic and anti-inflammatory effects,35,36 cardiovascular effects,37 antiangiogenic and hypolipidemic effects,38 and immunomodulation39) given its wide reactivity with many pharmacological relevant targets. It is now clear that the effects on these targets are largely due to unspecific reactivity of the αβ-unsaturated ketone.23,40 This reactivity is however not very strong, so that a certain degree of selectivity for some targets might be observed, as is also obvious from the heat map in Figure 8. Thus, this group of compounds does not at first sight attract attention as potential PAINS substructures, exhibiting promiscuous cellular responses rather than real activity. As demonstrated by our polypharmacological trend heat map, the medicinal chemistry communities’ interest in this special group of chemicals did not drop over the last 17 years, although no drug based on this scaffold has made it to the market. The reasons for this behavior within the community might be manifold. It is tempting to speculate that a lot of effort was driven by the high synthetic feasibility of chalcone derivatives.41 In addition, researchers in the field might not have been aware of the unspecific responses of chalcones due to lack of systematic literature surveys. 3. Trends in Scaffold Liabilities. Out of 764 unique scaffolds, only 33 scaffolds contain PAINS liabilities (and each one of the 33 contains a single PAINS substructure). The list of matching PAINS and the number of scaffolds containing them is summarized in Table 1. The table also reports the

Scaffold ID 8 (benzanilide derivatives) was selected as a representative scaffold with a high fraction of single protein targets (see Figure 6). It possesses a high cumulative compound count as well as a steep positive trend line of new compounds per year. Scaffold ID 3 (chalcones) was prioritized as representative for scaffolds with high numbers of cell line targets (see Figure 6). It shows a steep positive trend line slope for cell line targets (among top 10) and is among the ones with the steepest trend line of new targets per year. Moreover, scaffold ID 3 is among the top three scaffolds with highest cumulative compound counts (>500) and shows the steepest increase of new compounds per year. For scaffold IDs 8 and 3, heat maps reflecting their polypharmacological target profiles over time were generated such that each cell represents the median activity label from all compounds with measurements for a certain target (see Figure 7 and Figure 8). Thus, they provide a dynamic representation of the communities’ testing efforts as well as the biological responses that are caused by members of certain scaffold series. For this investigation, targets were grouped by taxonomic relationships (target families). Drug annotations were included, showing the appearance of approved drugs in certain years. For scaffold ID 8, this has been visualized in a separate heat map (Supporting Information, Figure S11). These analyses allow us to address a number of questions related to polypharmacology. For example, which targets within a scaffold series have been tested more frequently than others? Over time, can we observe a shift from more cell line to single protein targets or vice versa? Can selectivity profiles/ tendencies be deduced for whole scaffold series? Answers to these questions reveal the pharmacological testing behavior within the community over the years. The heat maps highlight the extent to which the community was interested in a certain target or a target family over the years and if bioactivity was observed for those targets (rather than inactivity). This exercise also highlights the extent of hypothesis-driven research based on testing around well-known (or well-studied) targets33 or for cytotoxicity across all sorts of cell lines, often without taking the exact mechanism of action or underlying regulatory pathway into account. Early testing of benzanilide compounds (scaffold ID 8) on single protein targets shows activity on matrix metalloproteinases and carbonic anhydrases (Figure 7). Later (in 2006), positive cytotoxicity measurements led to testing on cell line targets. As seen from the heat map of approved drugs contained in this scaffold series (Supporting Information, Figure S11), the only marketed drug, niclosamide, an antihelminthic agent for therapeutic use against tapeworm infections, was tested against cell line targets almost exclusively. As the mechanism of action in this case is uncoupling of oxidative phosphorylation or stimulation of ATPase activity in the worms, multiple protein targets are involved in this therapeutic effect (information from www. drugbank.ca). Another interesting case is the chalcone series (scaffold ID 3; Figure 8), with much more data on cell lines than on single protein targets over the years. Starting with positive cytotoxicity measurements on HeLa and K562 cells in 1998, some unsystematic cytotoxicity measurements on other cell lines and some typical protein targets related to cancer (e.g., EGFR, ErbB2, TNF) shows that, apparently, research was driven by the hypothesis that chalcones might represent potential anticancer agents.34 Over the years, diverse other

Table 1. Summary of the PAINS Substructures Contained within the 764 Unique Scaffolds, and the Number of Scaffolds That Contain a Given PAINS Substructure (“Count”)a PAINS

year

count

quinone_A(370) imine_one_A(321)|quinone_D(2) indol_3yl_alk(461) azo_A(324) ene_one_ene_A(57) ene_rhod_A(235) amino_acridine_A(46) anil_di_alk_A(478) ene_five_het_L(4) het_thio_666_A(13) imine_one_A(321) imine_one_isatin(189) keto_phenone_A(11) quinone_A(370)|anthranil_one_A(38)

1998 2002 1998 2006 2011 2007 2009 2012 2013 2013 2008 2004 2012 2011

8 4 4 3 3 3 1 1 1 1 1 1 1 1

The “year” column indicates the year that this substructure was first observed in a scaffold.

a

earliest occurrence of any scaffold containing the given PAINS substructure. The full list of 33 scaffolds, the year of first occurrence, and their associated PAINS matches are provided in Supporting Information, Data File 5. It is encouraging that only two of the scaffolds identified as popular by our analysis contain a substructure recognized as PAINS by our filter, namely scaffold ID 12 (naphthoquinone) and scaffold ID 489 (daunorubicin). In both cases, it is the quinone substructure causing the alert, which is also the most frequent PAINS substructure contained in our data set (see Table 1). We identified two scaffolds containing PAINS substructures from 1998, but since then PAINS containing scaffolds have appeared J

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

Figure 9. Distribution of trends for scaffolds containing PAINS substructures (PAINful) and those without (PAINless). In both cases, the trend corresponds to the slope of robust regression model fitting a feature of a scaffold to year. (left) Trend computed for the number of unique compounds associated with a scaffold per year. (middle). Median (Z-scored) bioactivity for compounds associated with a scaffold (over all years). (right) Trend computed for the median Z-scored bioactivities associated with a scaffold per year.

in nearly every year (with five showing up for the first time in 2009). While 2009 does have the largest number of scaffolds with PAINS liabilities, there is no statistically significant trend in the number of PAINS-containing scaffolds per year. We first examined whether the size of scaffold member sets differed, over time, for those with PAINS liabilities versus those without. This could suggest differences in how regions of chemical space (using the scaffold as a proxy) are explored based on the presence of liabilities within those regions. As before, for each scaffold, we fitted a robust linear regression for count of unique compounds as a function of year and used the slope of the year variable as an indicator of an increasing, decreasing, or static trend in count versus year. Figure 9 (left) plots the distribution of slopes for the “PAINful” (i.e., scaffolds containing a PAINS substructure) versus “PAINless” (i.e., scaffolds without a PAINS substructure) scaffolds and suggests that there is a small difference in the trends between these two classes, with PAINful scaffolds showing a slightly steeper trend on average than PAINless scaffolds. However, although this difference is statistically significant (Wilcox rank sum test, p = 0.009), the effect size is very small. Interestingly, of the 33 PAINful scaffolds, two scaffolds (0.2%) exhibit slopes greater than 1, whereas 31 PAINless scaffolds (4%) exhibit slopes greater than 1. The two PAINful scaffolds with slopes greater than 1 (indicating an increase in the number of compounds associated with them over time) are scaffold IDs 224 (9benzyl-1,2,3,4-tetrahydrocarbazole) and 12 (naphthoquinone). Even though both contain PAINS substructures, the biomedical literature shows that these scaffolds are being actively studied. Indeed, a search for “naphthoquinone” on PubMed42 shows a near monotonic increase in publications, from 1940 onward, referring to this term (Supporting Information, Figure S12). It is encouraging to see that our trend analysis is in line with our assumption that PAINless scaffolds will tend to be more extensively explored than PAINful scaffolds. We also investigated the distribution of bioactivities over time for scaffolds with PAINS liabilities versus those without. First, we compared the overall distribution of bioactivities of

scaffolds with and without PAINS liabilities. For this analysis, the Z-scored bioactivity for a scaffold was taken as the median of the Z-scored bioactivities of the individual compounds belonging to that scaffold. We did not aggregate replicate data or separate activity by target for individual compounds. Figure 9 (middle) suggests that there is no difference in the means of this distribution, and this is supported by a Wilcox rank sum test (p = 0.65). Next, we stratified the activity data by year and performed the same analysis, which indicated a single year (2007) where there was a statistically significant difference (Wilcox rank sum test, p = 0.04) in the scaffold bioactivity distributions between the PAINful and PAINless scaffolds. For every other year between 1998 and 2014 there was no difference. Considering 2007, we identified 24 and 416 PAINful and PAINless scaffolds, respectively. The median value of the scaffold activities (Z-scored) was −0.18 for the PAINful class and 0 for the PAINless class, indicating a slightly greater activity, on average, for compounds containing PAINless scaffolds. Lastly, we examined the bioactivity trends to see if there was a difference between the two classes of scaffolds. As with the compound count analysis described above, we fitted a robust regression model to the median scaffold Z-scored bioactivity (for a given scaffold) versus year and compared the distributions of slopes for the PAINful and PAINless scaffolds. We observed no difference between the slopes for the PAINful and PAINless scaffolds (Wilcox rank sum test, p = 0.40). As we showed previously, most scaffolds appear to exhibit static bioactivity trends over time, with a small fraction showing an appreciable increase or decrease. In Figure 9 (right), we note that the median slope for both classes of scaffolds is essentially zero, although PAINless class has some outliers (defined as scaffolds with slopes greater than 0.25 or less than −0.25 of where there are 25 (3.4%). Given the preponderance of the quinone substructure within the PAINS containing scaffolds, we examined the 13 quinone containing scaffolds in more detail. The distributions of enumerated compounds, bioactivities, and slopes for these scaffolds were similar to the overall PAINs trends described K

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

statistically significant trend fits) for three of the popularity measures we investigated. These plots are displayed in Supporting Information, Figure S13. In all three cases, the Pearson correlation between synthetic feasibility and slope is essentially 0, suggesting that ease of synthesis does not (explicitly) drive trends in these popularity measures. Note that synthetic feasibility (or equivalently, molecular complexity) is intuitively a time varying property; with advances in synthetic methodology, previously complex scaffolds may become easier to synthesize.48 The nature of assays in which scaffold members are tested may underlie specific trends. Specifically, certain types of assays such as cytotoxicity (e.g., tested on HepG2 cells) and Caco-2 permeability assays tend to be run for all hits in a screening campaign. To explore this further, we identified the cell lines for all the assays considered in our data set. Overall, the top three, most frequently occurring cell lines were CHO, HEK293, and MCF7, which are commonly used lines for viability studies. We then computed the frequency of occurrence of cell lines associated with each scaffold (i.e., by the assays in which the scaffold members were tested). The result is tabulated in Supporting Information, Data File 7. Interestingly, if we consider the most frequently occurring cell line for each scaffold, we observed just 14 scaffolds for which that cell line is HepG2 or Caco-2. Of these 14 scaffolds, only three had trend slopes (in any of the properties we considered) that were statistically significant. Furthermore, the slopes themselves were classified as flat or medium. The data thus suggests that testing in standard counterscreens is unlikely to drive the observed trends. Another factor that may drive trends is medicinal chemistry intuition. As noted in Shanmugasundaram et al.,49 lead optimization hypotheses depend on a chemists’ experience and intuition to a large degree. However, the role and extent of intuition is difficult to quantify, although Brown et al.50 have quantitatively described this issue in the context of preferential para-substitutions of aromatic rings. More fundamentally, we hypothesize that a scaffold may receive attention due to attention it has received in the past. This is may be especially true when prior studies have shown biological activity against interesting or relevant targets. This behavior is analogous to preferential attachment processes whereby wealth or credit is distributed among objects based on how much they already have.51

above. These scaffolds covered 741 unique compounds, tested in 2352 unique assays. 408 compounds were tested in multiple assays, with the top five most frequently (>50) tested compounds being doxorubicin, adriamycin, mitoxantrone, daunorubicin, and geldanamycin. Because many assays run as part of oncology projects will include well-known chemotherapeutics as controls and given the prevalence of PAINS in approved drugs (∼6%),43 it is not surprising to see these compounds in the PAINful category. On the other hand, 333 compounds containing the quinone substructure were tested a single time. Using the ChEMBL target type annotation, we identified assays whose targets were annotated with CELL_LINE or ORGANISM as being overenriched in a statistically significant manner (Fishers exact test, p = 0 and 1.36 × 10−6, respectively). This is in contrast to the set of scaffolds not containing PAINS, for which assays with protein targets (SINGLE PROTEIN, PROTEIN FAMILY, and PROTEIN COMPLEX) were statistically significantly overenriched (Fishers exact test, p < 0.015). A priori, it is not obvious why PAINS containing scaffolds should be preferentially tested in cellbased rather than biochemical assays. However, recent work by Weerapana et al.44 and Jöst et al.45 suggest that the in vivo behavior of reactive groups may not be as significant as believed, which could indicate that cell-based assays are more robust to such compounds than biochemical assays. Because PAINS and reactive groups in general are useful to avoid, the fact that approved drugs contain such moieties, coupled with in vivo behavior of reactive compounds, indicates that while reactive groups should make one wary, hard rules on their exclusion may not be beneficial. Furthermore, while scaffolds may contain PAINS, the presence of a privileged substructure within the scaffold may override the liability of the PAINS moiety.46 Nonetheless, our analysis highlights the fact that the medicinal chemistry literature tends to avoid scaffolds containing PAINS, and even for the PAINS containing scaffolds, there is little difference in key property distributions. 4. Toward a Consensus Trend Score. Aiming to unify the observations based on taking different viewpoints of scaffold trends (numbers of compounds, assays, targets, and liabilities), we created a consensus trend score by applying the rank product (RP) method.47 Liability trends contributed inversely to the score. According to this score, the top 10 highly ranked scaffolds are (ordered by descending consensus score): scaffold IDs 3, 87, 489, 5, 38, 198, 171, 11, 59, and 12 (a full list of all ranked scaffolds is given in Supporting Information, Data File 6). We note that the method does not weight the different contributions and that therefore for each use case the ranking can be adapted. Nonetheless, it presents a convenient way to obtain a global overview of popular scaffolds in large historical data collections. 5. Why Is a Scaffold Popular? The preceding discussion has focused on identifying scaffolds that are receiving increasing or decreasing attention over time, but we have not discussed why a scaffold exhibits a specific trend. Explaining why a given scaffold is being enumerated or being tested in more assays than another scaffold is difficult, if not impossible, in the absence of detailed information on the project studying those scaffolds. However, we explored whether certain general factors could explain attention trends. One might expect that synthetically accessible scaffolds will tend to be explored more than complex scaffolds. We computed a synthetic feasibility score using MOE 2016 and plotted the score versus the slopes for scaffolds (with



SUMMARY AND CONCLUSIONS A key motivation of this study was to explore how scaffolds and their associated properties could be analyzed in a temporal fashion. This is motivated by the fact that scaffolds represent a core conceptual element of a medicinal chemists’ thought process and the fact that scaffolds are associated with classes of molecules with specific types of activities. In this sense, some scaffolds receive more attention in the literature in the form of publications and thereby appear to be “popular”. We consider scaffolds to be popular if many unique compounds containing the same scaffold are reported in the medicinal chemistry literature and/or if they are showing an increasing trend in the number of these new member compounds over time. An example of such a scaffold series are chalcones, which showed the steepest positive trend line slope of new published compounds per year (considering ChEMBL20). Another, orthogonal, measure of scaffold popularity is given by trends in the number of assays in which a certain scaffold was tested. As L

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

increasing or decreasing trend at a certain point in time are easy to determine by looking into the respective publications. In addition, polypharmacological trend depictions per scaffold deliver a complete picture of the target-bioactivity profiles over time. This is very helpful in identifying a shift in interest concerning the testing behavior within a scaffold series over time. In a drug discovery setting, this helps to avoid investing effort in well-studied and/or useless (due to liabilities) compound series, especially if information about marketed drugs is included in the analyses. The basis of this study has been Bemis−Murcko scaffolds, but there is no restriction on the actual scaffolds that could be used. A particular fragmentation scheme (along with fragment selection rules) may result in some relevant scaffolds not being considered. An example is the oxetane scaffold, which has received increasing interest due to a combination of useful physicochemical properties and synthetic utility62 but is not considered in our analysis. In addition, our workflow did not perform any postprocessing after the Bemis−Murcko fragmentation. This results in some cases such as scaffold IDs 198 (1,3,4-thiadiazole) and 515 (2-amino-1,3,4-thiadiazole) which could be merged into a single scaffold. Whether this is the case is a subjective decision that depends on the fragmentation pipeline employed by the user. While alternative fragmentation methods are available (such as MEQI,63 matched molecular pairs, and so on), exhaustive fragmentation could identify scaffolds with more extreme trends. Furthermore, while Bemis−Murcko scaffolds are useful, hierarchical schemes64 would allow us refine the temporal analysis of scaffold “families” and is the focus of future work. In particular, a hierarchical scheme coupled with appropriate postprocessing could exclude scaffold pairs such as IDs 18 (quinoline) and 20 (quinolin-4-yl-aminophenyl), where ID 18 is a substructure of ID 20. We also note that the methodology described here could be easily applied to any arbitrary substructure. Finally, we note that while our focus has been exploring the community interest and popularity of a scaffold based on the (primarily academic) medicinal chemistry literature as captured by ChEMBL, we have not considered patents, which also represent a key measure of research relevance. Given that commercialization efforts involve scaffold hopping to identify patent holes, it might be expected that a given scaffold is not well represented across patents from different companies. At the same time, scaffold trends extracted from patents could provide insight into scaffolds and, more generally, regions of chemical space that are of rising commercial interest. Using resources such as SureChEMBL,65 we plan to explore how scaffold trends can be extracted from patent data. The workflows for curating and mining data in order to generate the summary data for the determination of scaffold trends are freely available from myExperiment (http://www. myexperiment.org/packs/725.html), codes for calculating the trends and consensus ranking from a GitHub repository (https://spotlite.nih.gov/ncats/ScaffoldTrends). While the current analysis is based on a number of specific choices of fragmentation scheme and liability filter, the approach is quite general and can be adapted to specific needs. Although the analyses proposed here do not support robust conclusions regarding the causality of trends, we have highlighted a number of factors that may play a role. More rigorous explanations of causality will require access to projectlevel metadata. Nonetheless, we believe that the temporal

this measure is also related to testing on different targets, scaffolds which are showing an increasing trend in the number of assays are also likely to possess a high cumulative target count. A good example of this kind is the daunorubicin scaffold, with the steepest positive assay slope but also a steep positive trend line of tested targets per year (and here especially cell line targets). Furthermore, if we consider “popular” to mean that it is of interest to many people, then publication of a scaffold in a higher profile (or more general) journal allows us to characterize the scaffold as “popular”. Because higher JIFs correlate to higher profile journals, we believe that using JIF as a proxy for scaffold popularity is not unreasonable. However, it is hard to differentiate a real increasing trend in scaffold popularity from a natural increase in a journals impact factor over time and thus the sum of JIFs across all years might provide a better way to assess scaffold popularity. Although there has been much discussion on the pitfalls of JIFs and other simplistic metrics of a journals importance,52−54 it is nonetheless an orthogonal way to assess the communities perceived interest in a scaffold. Indeed, given that publications (and more so, publications in high profile journals) drive many real-world decisions (e.g., promotion, tenure, awards, funding), assessing trends via some form of bibliometric analysis is not unfounded. Our exploration of trends in liabilities was motivated by the fact that, over time, accumulation of activity data and research reports would allow researchers to avoid exploration of scaffolds with liabilities. As recently outlined by Dahlin, Baell, and Walters,55,56 the “natural history” of compounds and their analogues can serve as an evidence-based guideline in early stages of drug discovery. The authors suggest an extensive literature survey using resources and tools such as PubChem,57 ChEMBL, Scifinder,58 or Reaxys59 to be informed about potential interference compounds (possessing bioassay promiscuity including reactivity). While our analysis showed that most scaffolds do not exhibit PAINS substructures, when they are present, there is no distinct trend in the occurrence of these liabilities in scaffolds over time. Our analysis does suggest that scaffolds without PAINS substructures have attracted more attention, compared to scaffolds with PAINS, as evidenced by the increasing size of the enumerated compounds for these scaffolds over time. Of course, it has been noted that PAINS are not necessarily the best way to characterize liabilities43,55 given the method by which they are defined. We use it because it is well-known, but the analysis could be repeated with any other liability filter such as the REOS60 or the Lilly MedChem filter.61 Taking all orthogonal scaffold popularity measures into account, we finally present a consensus trend score approach by calculating the rank product of the trend slopes of the numbers of new compounds per year, assays per year, targets per year, and liability trends. We propose to use this consensus ranking as the basis of a multidimensional view of a scaffolds’ popularity in big data collections. The weighting of the popularity contributions however likely needs to be adjusted depending on the use case. As popularity does not imply being a worthwhile scaffold for future drug discovery efforts, scaffold analysis cannot be reduced to using a simple formula. The consensus popularity ranking may help to assess if a compound series has been studied heavily before, and a more detailed inspecting of the contributions to this score and the trend lines themselves will show years of strong or little interest. Likely reasons for an M

DOI: 10.1021/acs.jmedchem.7b00954 J. Med. Chem. XXXX, XXX, XXX−XXX

Journal of Medicinal Chemistry

Perspective

anything else as flat. In some cases, no correlation trend could be determined due to a large number of missing values or zeros. Consensus Ranking Using the Rank Product Method. We implemented the rank product (RP) method in R (3.3.2) as described in Breitling et al.47 Briefly, scaffolds were independently ranked using the slopes of the trend lines derived from four popularity measures: tested assays, unique compounds, liabilities, and new targets (including protein and cell line targets). When the fit was not statistically significant, the slope was set to “NA”, resulting in the scaffold being ranked last. Ties in ranks were resolved using the average rank. For each scaffold, the RP was computed by taking the product of the four independent ranks. The scaffolds were then reranked based on their RP values.

analyses described here can provide useful information on the historical synthetic and testing behavior associated with regions of chemical space. This can be useful at the beginning of a medicinal chemistry project as well as when determining scaffolds that could be avoided due to extensive interest in favor of less studied scaffolds.



METHODS

Data Retrieval. Data was retrieved from ChEMBL20 via a PostgreSQL interface (pgAdmin3) by querying for compounds (ChEMBL compound IDs, preferred compound names, canonical smiles) along with their bioactivities (value, pChEMBL value, activity end point), target information (ChEMBL target ID, preferred target name), assay information (ChEMBL assay ID), and document information (journal, publication year, PMID, DOI). In this initial query, filters for the target organism (“Homo sapiens” was kept only), activity unit (“nanomolar” was kept only), and activity relation sign (‘=’ was kept only) were applied. No activity thresholds were applied to classify compounds as active or inactive. This led to an initial data set with 1.73 Mio data entries. The initial search query is given as supplement. We note that the assay data considered in this subset was originally obtained from three data sources, of which the “Scientific Literature” accounted for 99.4% of assays considered in this analysis. Next, the data file was imported into a KNIME66 (version 3.1) environment to perform the scaffold generation, grouping per scaffolds, data filtering, and postprocessing. Scaffold Generation, Data Filtering, and Processing. Scaffolds for the 1.73 Mio compounds (nonunique) were retrieved by using the ‘RDKit Find Murcko Scaffolds’ node in KNIME. After filtering for activity end points “Ki” and “IC50” and for data entries which were published in 1998 and later, 619671 nonunique compounds (282878 unique compounds) with calculated Murcko scaffolds were retained. Grouping by identical scaffolds led to 103772 scaffold clusters. By retaining scaffolds with at least 10 unique compounds, publications in at least five different years, and excluding scaffolds with too generic ring systems (six-ring system or smaller with just one or no heteroatom placed inside the ring), 764 distinct scaffolds (101622 data entries, 36328 unique compounds) were retrieved. Further, activity data points within these 764 scaffold clusters were converted into robust Z-scores on a per assay basis (by using ChEMBL assay IDs for the grouping), and Thomson Reuters journal impact factors (JIFs) were mapped accordingly (per journal and year). Filtering for PAINS Liabilities. We implemented a Python script using RDKit (2016.03.3) to match structures of scaffolds and their enumerated compounds to the set of PAINS filters described by Holloway et al.23 The script is available as Supporting Information. Trend Characterization. To characterize scaffold trends, for a given property X (number of unique compounds, number of bioassays, aggregated bioactivities, aggregated IFs, etc.), we fitted a robust linear regression model (X ∼ m × year + C). The slope, m, was used to characterize the nature of the trend. For example, in inspecting scaffold popularity, a steep increasing slope could indicate increasing importance, whereas a decreasing trend could indicate a scaffold falling into disfavor. A flat trend is also interesting, possibly indicating an ongoing interest in a scaffold. Trend analyses were performed in R 3.3.2 (https://www.r-project.org/), using the rlm function from the “MASS package”. Statistical significance of the trends was assessed by conducting a Wald test on the fitted model and extracting the p-value. Regression fits with p-values