The rise and fall of a scaffold: A trend analysis of ... - ACS Publications

Subscriber access provided by READING UNIV

Perspective

The rise and fall of a scaffold: A trend analysis of scaffolds in the medicinal chemistry literature Barbara Zdrazil, and Rajarshi Guha J. Med. Chem., Just Accepted Manuscript • Publication Date (Web): 13 Dec 2017 Downloaded from http://pubs.acs.org on December 13, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Medicinal Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

The rise and fall of a scaffold: A trend analysis of scaffolds in the medicinal chemistry literature Barbara Zdrazil*,†,# and Rajarshi Guha*,‡,# †

University of Vienna, Department of Pharmaceutical Chemistry, Pharmacoinformatics Research

Group, Althanstraße 14, A-1090 Vienna, Austria. ‡

Division of Preclinical Innovation, National Center for Advancing Translational Sciences

(NCATS), National Institutes of Health (NIH), 9800 Medical Center Drive, Rockville, MD 20850, USA. KEYWORDS Medicinal chemistry literature, Trend analysis, Scaffold analysis, Bemis-Murcko scaffold, Scaffold series, Journal impact factor, Open Data, ChEMBL database, Pan-Assay INterference compoundS ABSTRACT Scaffolds are a core concept in medicinal chemistry and they can be the focus of multiple independent development efforts, over an extended period. Thus, scaffold associated properties can vary over time, possibly showing consistently increasing or decreasing trends. We posit that such trends characterize the attention that the community pays to a scaffold. In this study, we

ACS Paragon Plus Environment

1

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 54

employed data from ChEMBL20 to examine the evolution of scaffold features, such as enumerated compounds, biological activity and liabilities, over 17 years. Our analysis highlights that certain properties such as enumerated compounds, but not liabilities, show statistically significant increasing trends for some scaffolds. We also attempt to explain why a scaffold receives more attention over time, and highlight that obvious aspects such as synthetic feasibility do not explicitly drive attention. In summary, trend analyses of scaffold properties could support scaffold selection and prioritization in small molecule development projects.

INTRODUCTION A chemical scaffold - being defined as the structural core of a chemical compound1 - is the focus of many efforts in medicinal chemistry. The concept of a scaffold is utilized in many scenarios ranging from the design of screening libraries2 to the characterization of structure-activity relationships3,4 and the rationalization of target specific activities,5 leading to the notion of privileged scaffolds (molecular frameworks that are capable to serve as ligands for multiple targets and/or do occur more often in drug-like molecule libraries).6–10 There is a rich literature on the use of scaffold-based approaches for retrospective analysis of compound collections or the medicinal chemistry literature.5,11–17 Importantly, most of these studies have considered scaffold collections at a fixed point in time. Exceptions include a review of temporal profiles for azanaphthalene scaffolds in drugs and natural products by Polanski et al.,18 and a recent analysis of the growth of bioactive compounds and scaffolds over time by the group of Bajorath and coworkers.19 In the latter study, the authors examined five major therapeutic targets employing a compound-scaffold-cyclic skeleton hierarchy.


2

Page 3 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


The notion of temporal trends in scaffold properties is attractive in a medicinal chemistry setting, as well as from a project management point of view. Examining the change in properties of exemplified compounds over time could serve to highlight classes of scaffolds that the community is exploring. Conversely, identifying scaffolds with relevant bioactivity, but relatively infrequent occurrence in the literature could serve in identifying gaps in the research space of a target or condition that might be worth exploring to generate novel or patentable leads. Such insight may not be available by a static analysis of the entire scaffold collection of say, the ChEMBL database.20,21 The type of analysis we present here can assist a researcher in deciding whether to expend resources on exploring a certain structural class of compounds either synthetically, experimentally and/or by means of in-silico modeling as it provides a temporal summary of past synthetic and testing efforts within the drug design community. In this work, we perform a temporal analysis of scaffolds reported in the medicinal chemistry literature - represented by ChEMBL - by examining the trends in scaffold-derived properties over time. ChEMBL is a manually curated and Open Source repository of medicinal chemistry literature data, though it does not reflect the entirety of published bioactivities. Still, this valuable source is frequently updated to include recent data and improve data quality issues. Thus, it is unrivaled in these aspects, which makes it the input data base of choice for this exemplary scaffold trend analysis. We initially consider target-independent properties to avoid the bias in the medicinal chemistry literature which tends to be over-enriched in activity data against certain druggable target classes (e.g., kinase and GPCR targets).22 We were especially interested in the evolution of a scaffolds’ interest (or attention) which it received over time, which we denote as ‘scaffold popularity’. Parameters for measuring scaffold popularity include the number of enumerated compounds, the


3


Page 4 of 54

number of different bioassays its members are tested in, and the median journal impact factors (JIFs) of the journals in which compounds belonging to a scaffold series have been published. In addition to popularity, we examine trends in bioactivity and PAINS (Pan-Assay INterference compoundS) liabilities23 associated with scaffolds. We emphasize that being identified as a ‘popular’ scaffold by one or more of these measures does not necessarily imply that this scaffold also deserves the interest or attention that it received over time. In other words, being popular does not imply that it is a ‘good’ or worthwhile scaffold for a drug discovery campaign. An example are the aryl-aryl and aryl-heteroaryl scaffolds generated via Suzuki and other Pd(0) catalysed coupling reactions.24 Due to the facile and high yield nature of these reactions, these types of scaffolds were easily generated and many are reported in the literature. But compounds based on these scaffolds are not necessarily soluble and have a tendency to be non-drug like (though ibrutinib is an exception). Thus, these popular scaffolds are not always ‘good’ for drug design purposes. Some other scaffolds that are highly ranked by one or various popularity analysis measures have also been identified to cause liabilities. To the best of our knowledge, this is the most extensive trend analysis on scaffolds of this kind, and the first to have quantified the notion of scaffold popularity in multiple dimensions. RESULTS AND DISCUSSION We focused on trends based on reported structures and activities and thus employed ChEMBL20 as a source of curated structure-activity data. A predefined search query initially retrieved 1.73 Mio data entries (considering human targets only). Further filtering for bioactivity endpoints ‘Ki’ and ‘IC50’ and for data entries which were published between 1998 and 2014, led to 619 671


4

Page 5 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


non-unique compounds (282 878 unique compounds). The structures were then decomposed using the Bemis-Murcko scheme,25 and data points assigned to compound classes by grouping them by identical Bemis-Murcko scaffolds (103 772 scaffolds). We then considered a ‘relevant’ subset, defined as those scaffolds with at least 10 unique compounds, publications in at least 5 different years, and scaffolds with non-generic ring systems (6-ring system or smaller with just one or no hetero atom placed inside the ring were removed). This identified 764 distinct scaffolds (corresponding to 101 622 bioactivity data entries; 36 328 unique compounds) and further served as a basis for all trend analyses reported here. A full list of all 101 622 data entries corresponding to 764 distinct scaffolds is given in the Supplementary Materials section (Supplementary Data File 1). Based on the set of scaffolds and their members, we examined trends for three properties – scaffold popularity, bioactivity and target trends, and scaffold liabilities. All trend line slopes and their p-values are given as Supplement (Supplementary Data File 2). 1. Scaffold Popularity The relative importance of a scaffold to the research community, which we term ‘popularity’, was inspected by analyzing three different parameters over time: counts of unique compounds belonging to a scaffold (enumerated compounds), counts of total bioassays that members of a scaffold have been tested in, and the importance (expressed by the median journal impact factors for a given year) of the journals where scaffold members have been published. The use of three, potentially orthogonal, measures was motivated by a desire to detect potential interdependencies between these parameters and to see if the different measures identified different scaffolds as


5


Page 6 of 54

being popular. In particular, this exercise helps to define the usability of the different measures for coming up with a global popularity measure. 1.1 Enumerated compounds An obvious indication of the popularity of a scaffold is the extent of its synthetic enumeration. Besides cumulative compound counts associated with scaffolds for the whole observation period, also the in/de-crease in the number of new compounds per scaffolds and year can give an indication for the popularity of a certain scaffold. We further conjecture that the steepness of the trend corresponds to increased popularity. First, we ordered the 764 scaffolds by their cumulative counts of enumerated compounds labelling the scaffold with the highest number of total unique compounds as scaffold ID 1 (biphenyl). This nomenclature was retained for all other investigations to facilitate comparisons across different measures (tested bioassay, JIF, target, bioactivity, and liability trends). We emphasize that the numbering does not reflect a ranking for scaffold popularity. Scaffolds with the highest numbers of unique compounds (≥ 200) reported in ChEMBL20 from 1998 till 2014 are depicted in Figure 1. Approximately 75% of all scaffolds have member sets of 50 unique compounds or less, and 92% of scaffolds have less than 100 unique compounds. Just three scaffolds, namely biphenyl (scaffold ID 1), diphenyl ether (scaffold ID 2), and chalcone (scaffold ID 3), are represented by more than 500 unique compounds (see Figure 1). This is also reflected by the high numbers of bioactivity measures (ranging from 1241 to 2030) reported for these three scaffolds (see Figure 1).


6

Page 7 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 1: 20 scaffolds in ChEMBL20 with the highest cumulative counts of unique member compounds; scaffolds are ordered by descending counts of unique compounds; scaffold IDs,


7


Page 8 of 54

cumulative counts of unique compounds (lower left) and bioactivities (lower right) are indicated in the boxes. Overall, the number of unique compounds correlate moderately (R²= 0.68) with the number of reported bioactivity values for a given scaffold (see Supplementary Figure S1). The latter is dependent not only on the number of bioassays reported for this scaffold class, but also on the number of targets against which a compound was tested. To obtain an overview of the nature of compound diversity within the scaffold series occurring in our study, we computed, scaffold-wise, the median diversity scores for compounds in every publication year (based on CDK circular 1024-bit fingerprints), which is calculated as 1 minus the median pairwise Tanimoto compound similarity (for all compounds associated with the scaffold). Since ChEMBL tends to contain compound series generated in academic settings, they tend to be congeneric (i.e., structurally similar). Thus, scaffolds with a larger number of member compounds tend to be made up of a larger number of congeneric series and therefore tend to be more diverse overall than scaffolds with only a few enumerated compounds. Across all scaffolds and years, the median diversity score is 0.27 (corresponding to a pairwise Tanimoto similarity score of 0.73). (A detailed list of all scaffold-wise median similarities per year is given in Supplementary Data File 3.) Next, we considered 372 scaffolds with at least one new compound in more than 5 different (not necessarily consecutive) years in order to investigate trends of new compounds per year. Just 39 of these exhibit statistically significant trends: 11 scaffolds exhibit a steep positive trend (slopes between 3.8 and 0.5), and 28 scaffolds are characterized by a moderately steep positive (22) or moderately steep negative trend (6) with slopes between 0.4 and -0.4. There are no steep


8

Page 9 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


negative trends appearing in our analysis. A boxplot of the distribution of trend line slopes for new compounds is shown in Figure 2A.

Figure 2: Boxplots showing the distribution of trend line slopes for new compounds (A), assays (B), and targets (C). A closer inspection of the eleven scaffolds with steep positive slopes (between 3.8 and 0.5) reveals that most of them possess high total compound counts ranging from 505 to 162 compounds (ordered by descending slopes): scaffold ID 3 (chalcones), 5 (erlotinib and derivatives), 8 (benzanilide derivatives), 12 (1,4 – naphthoquinone derivatives), 11 (transstilbene

derivatives),

16

(coumarin

derivatives),

25

(pentacyclic

triterpenoids),

27

(benzimidazole), 15 (N-benzylindoles), and 21 (diphenylurea). An exception is scaffold ID 107


9


Page 10 of 54

(6-(1,3-dioxobenzo[de]isoquinolin); scriptaid and derivatives) with just 75 compounds over the whole observation period, but also a borderline steep slope. Our analysis suggests that scaffolds with steep positive compound trend lines are quite likely also among those with high total compound counts, though the converse is not true. Most of the scaffolds which exhibit high total compound counts (see Figure 1) do not exhibit statistically significant positive compound trends. Thus, scaffolds that are ranked high by both – a high compound count as well as a steep compound trend - should deserve special attention (scaffold IDs 3, 5, 8, 11, 12, 16; trends line slopes depicted in Figures 3A and Supplementary Figure S2).

Figure 3A (left): Counts of new compounds per year for scaffold ID 3 (chalcone); Figure 3B (right): Counts of assays per year for scaffold ID 489 (daunorubicin derivatives). Scaffold ID 3 (chalcone; 505 unique compounds) possesses, by far, the steepest trend line of new compounds per year (slope: 3.8) with a continuous increase of new published compounds per year and peaks in single years (see Figure 3A): starting with 43 unique compounds in 1998, additional unique compounds were published annually from year 2001 onwards, showing a peak


10

Page 11 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


of 112 new compounds in 2009. A closer inspection of the number of different papers published in 2009 revealed that although there are in total 14 publications reported in 2009, only seven of these report more than a single new compound and the bulk of compounds (ca. 70%) was reported in only two of these publications.26,27 However, all 14 publications were authored by different research groups around the globe, reflecting a general increase in interest rather than just interest within a single research group. We also note that the trend line plot (Figure 3A) shows a steep drop in 2014. However, evaluating the significance of this drop is hard due to the fact that ChEMBL20 data was only curated until July/August 2014 (personal communication with A. Gaulton, EMBL-EBI). Since this condition accounts for all trend analyses here, the reduction of the steepness of the overall trend will on average be similar for all scaffolds. Clearly, there is ongoing interest in the design of new chalcones over a period of 17 years. Interestingly, they have also been mentioned in the context of PAINS23, as they are chemically quite reactive (see also section ‘Target-wise bioactivity trends - specific examples for popular scaffolds’). 1.2 Bioassays Over time, the importance or popularity of a compound class (as characterized by a scaffold) may also be represented by its testing in an increasing number of bioassays. It should be noted that unique ChEMBL assay IDs are allocated to every distinct assay (i.e., assay protocol and assay conditions) for a set of compounds reported in a paper. If the same assay, run under a different condition, was reported in the same or a different paper, even for the same set of compounds, it will receive a new assay ID. Thus, an analysis based on comparing distinct assay IDs reported for a scaffold does not truly reflect the number of distinct bioassays.


11


Page 12 of 54

However, as there are no other means to compare assays on a larger scale - unless manual data curation is performed for a selected subset of compounds or targets28,29 – we conducted the analysis using the bioassay IDs as provided by ChEMBL. The number of bioassays a scaffold has been tested in is a measure of the effort of the authors of a paper to test their compound series in a number of different pharmacological experiments or targets, due to a perceived importance of the compounds. Thus, we investigated the relationship of bioassay counts to counts of bioactivities, unique compounds and targets. The cumulative counts of total bioassays for a certain scaffold are moderately correlated to reported bioactivities for the 764 scaffolds (R² = 0.7; Supplementary Figure S3). However, no or poor correlation with cumulative counts of unique compounds (R² = 0.31) or cumulative target counts (R² = 0.54) were observed (Supplementary Figure S3). The lack of correlation between cumulative bioassay and target counts might be a consequence of the degeneracy of bioassay identifiers, whereas targets are always treated as unique entities (and therefore can easily be mapped across different measurements and assays). Sorting scaffolds by their cumulative (static) bioassay counts, it becomes obvious that the scaffolds listed on top (Figure 4) exhibit a limited overlap with the ones with high cumulative compound counts (see Figure 1). Strikingly, the scaffold with the highest number of bioassays over the whole observation period (1998 - 2014) is a non-generic 5-ring system (scaffold ID 489). This typical daunorubicin-like scaffold has been measured in 1754 bioassays and on 163 different human targets, although consisting of just 21 different unique member compounds (considering Ki and IC50 measurements only; Figure 4). This is obviously a very popular scaffold in the medicinal chemistry literature which would not have been highlighted based on mere compound counts. A similar example is the paclitaxel-like scaffold ID 87 (Figure 4), with a very


12

Page 13 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


complex chemical structure, a lower number of unique representatives (82) but high counts of unique assays (837) and targets (133). However, in both cases, we note that it is a single compound being responsible for these high assay counts. In case of scaffold ID 489, doxorubicin has been tested in 1545 assays, and in case of scaffold ID 87, paclitaxel has been tested in 835 assays (compared to 1754 and 837 assays in total for these scaffolds, respectively). These results are not surprising given that these compounds are commonly tested in oncology projects. Importantly, these assays correspond to 515 and 289 unique publications for scaffold IDs 489 and 87, respectively. This indicates that the counts are not driven by a few publications describing large screening projects.


13


Page 14 of 54

Figure 4: 20 scaffolds in ChEMBL20 with the highest cumulative counts of bioassays; scaffolds are ordered by descending counts of assays; scaffold IDs (upper left), cumulative counts of bioassays (upper right), cumulative counts of unique compounds (lower left) and cumulative target counts (lower right) are indicated in the boxes.


14

Page 15 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


For inspection of the temporal trends in bioassay counts over the observation period (1998 2014) we considered those scaffolds with tested assays in more than 5 different years (563 scaffolds) of which 102 exhibit statistically significant trends of tested assays per year. 39 of those scaffolds possess robust regression assay trend lines with steep positive slopes (between 14.7 and 0.7), while 63 exhibit moderately steep slopes with slopes between 0.7 and -0.3 (59 positive; 4 negative). Steep negative trends were not observed in this dataset. A boxplot showing the distribution of assay trend line slopes is available as Figure 2B. Many scaffolds with steep positive tested assay trend lines also possess high cumulative assay counts (ordered by descending slopes): e.g. scaffold IDs 489 (daunorubicin derivatives), 87 (paclitaxel derivatives), 198 (1,3,4-thiadiazole derivatives), 38 (cis-stilbene derivatives), 59 (camptothecin derivatives), 251 (uracil derivatives), 6 (flavones), and 88 (benzanilide derivatives). Assay trend lines for the top five are shown in Figure 3B (daunorubicin) and Supplementary Figure S4 (scaffold IDs 87, 198, 38, and 59). Hence, the daunorubicin-like scaffold (scaffold ID 489) exhibits the steepest positive trend line (slope: 14.7): starting with 19 bioassays in 1998, an almost steady increase in bioassay counts can be observed, with a peak of 261 bioassays reported in literature in 2013 (Figure 3B). Interestingly, scaffold IDs 171 (curcumin and analogues) and 520 (sorafenib and analogues) with assay counts below 200 are also among the top ten steepest trends. We would not have identified these two scaffolds as popular if we were constrained to using static assay counts. As is clear from their trend lines (Supplementary Figure S5), publications did not appear until 2004/2005 which explains their lower cumulative assay counts over the whole observation period (19982014). However, the steep trends of their assay count over the last ten years highlights their popularity within the medicinal chemistry community. Importantly, scaffold ID 171 (with 57


15


Page 16 of 54

unique compounds, 391 bioactivities, 88 human targets, and 191 assays in total) and scaffold ID 520 (with 19 unique compounds, 224 bioactivities, 63 human targets, and 198 assays in total) owe their high occurrences in the medicinal chemistry literature to just one compound each: curcumin (belonging to scaffold ID 171) and sorafenib (belonging to scaffold ID 520) have been reported in 187 and 196 assays (with Ki or IC50 endpoints). Comparing the assay trend lines for curcumin alone with that of scaffold series 171 without curcumin (Supplementary Figure S6) as well as sorafenib alone with scaffold series 520 without sorafenib (Supplementary Figure S7) it becomes clear that interest in these scaffolds was mainly driven by single compounds (sorafenib and curcumin), respectively. This is, however, not so much pronounced for curcumin/scaffold ID 171, than it is for sorafenib/ID 520. In order to be able to automatically flag scaffolds where one or a few compounds (assigned to this scaffolds) might be driving the assay trends, we filtered out scaffolds where at least one compound has measurements in more than fifty percent of the years it was tested in. A detailed list of these compounds, assigned scaffolds, and the number of years with measurements is included in the Supplementary Data File 4. This analysis flagged 375 (49% of 764) scaffolds as potential scaffolds with assay trends being driven by only one or a few compounds. However, 389 scaffolds (51%) were not influenced by such singleton outliers. While our use of robust regression downplays the effects of such outliers, such a filter process is still useful to flag scaffolds for which this behavior is present. Concluding this section on assay trends, we observed that the temporal analysis highlights certain scaffolds that have appeared later in the literature and can therefore not be detected by cumulative assay counts. Scaffold rankings for assay trends differ significantly from rankings for compound counts. Therefore, they offer an orthogonal measure to assess a scaffolds’ popularity.


16

Page 17 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


1.3 Journal impact factors The third measure of popularity that we inspected is the importance of the journal that a scaffold or its members were reported in. There are many measures of a journals importance, and much debate on their correctness.30 We selected the Thompson Reuter Journal Impact Factor (JIF) for analysis. To investigate a potential relationship with other popularity measures, we inspected the median (per scaffold and year) JIF slopes of some example scaffolds that have been ranked on top by compound (scaffold IDs 3 and 5) and assay (scaffold IDs 87 and 489) popularity trend analyses before. In general, a slight upward trend for median JIFs (slopes between 0.07 and 0.12) for these scaffolds could be observed (see Supplementary Figure S8). It is necessary to evaluate whether a steady increase actually corresponds to a series of compounds receiving more and more attention in the medicinal chemistry literature over time, or whether the slope simply reflects the natural increase of the JIFs of certain journals over time. This can be done by analyzing whether scaffold series are mainly published in recurrent or diverse journals over the years. We investigated this by identifying the unique journals that scaffold IDs 1-20 were reported in (a static view). For each scaffold, we counted the number of distinct years of the publications that reported those scaffolds over 17 years. This is summarized in Figure 5 and suggests that compounds belonging to a certain scaffold tend to be reported in a small set of recurrent journals: J. Med. Chem., Bioorg. Med. Chem. Lett., Bioorg. Med. Chem., Europ. J. Med. Chem, and J. Nat. Prod. This is a small fraction of the total number of journals (49) that are referenced in ChEMBL during the study period. One reason for this bias is the limited number of available medicinal chemistry journals (in comparison to the broader availability of i.e. pharmacology-


17


Page 18 of 54

related or medical journals). These frequently recurring medicinal chemistry journals also exhibit a slight increase in their JIF over the years (see Supplementary Figure S9). It is likely that, overall, the steady increase in JIF trends that is observed for selected scaffolds is a consequence of a natural increase in JIFs over time.

Figure 5: Heatmap displaying the number of publication years (‘Count’) of scaffold IDs 1-20 in different medicinal chemistry journals. Thus, scaffold ranking according to the steepness of the slopes does not appear to be very informative in case of median JIFs. It is likely, that higher impact factor papers appear earlier in literature, when reporting first occurrence or study of a new compound or compound series. Subsequent publications in lower impact factor journals would in this case lead to a negative JIF slope. Conversely, a later high impact factor paper could lead to a favorable positive slope if the


18

Page 19 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


papers before showed lower JIFs and a series of high impact factor papers with the same scaffold would lead to a flat JIF slope. For evaluating scaffold popularity, obviously the point in time at which (a) higher impact factor paper(s) appeared should not play a role. Therefore, we examined JIFs by computing the sum of all median (across a given year and scaffold) annual JIFs from 1998 – 2014 and denote this as sJIF. This measure not only account for the intermittent occurrence of high JIFs in certain years, but also the counts of JIFoccurrences in multiple years. The scaffold series ranked on top, based on sJIF, is the one containing sunitinib and analogs (scaffold ID 44). While overall its trend line slope is rather flat (-0.1), it shows peaks of median JIFs in 2006 and 2008, which correspond to publications in Nat. Chem. Biol. (see Supplementary Figure S10). Another scaffold among the top ten with high sJIF which was identified as popular beforehand is scaffold ID 5 (erlotinib and analogs). It shows no distinct peaks of median JIFs but an ongoing solid publication record (see Supplementary Figure S10). In summary, insights from JIF trends are hard to evaluate and require a thorough case by case inspection. However, in combination with compound counts and counts of bioassays, the sJIF might provide an additional measure of scaffold popularity. 2. Target and Bioactivity Trends Another view of scaffold importance is represented by a scaffold series’ aggregated biological activities of its member compounds. We first examined static target composition and temporal target trends alone, without considering bioactivity. Second, we examined target-wise bioactivity trends, showcasing the most relevant scaffolds from the perspective of their biological responses.


19


Page 20 of 54

2.1 Target composition and target trends First, for a static overview we computed the count of the total number of unique targets for each scaffold in our dataset over the whole observation period (1998 – 2014). Keeping all ChEMBL target types available in our analysis (single protein, cell-line, protein complex, protein family, protein complex group etc.) and ranking by total unique target counts, all top ten scaffolds (possessing between 187 and 112 different targets) either possess high cumulative compound or high assay counts (ordered by descending total target count): scaffold IDs 1 (biphenyl derivatives), 6 (flavones), 7 (naphthalene derivatives), 489 (daunorubicin derivatives), 4 (serotonin and analogs), 87 (paclitaxel and analogs), 17 (diphenylmethane), 18 (quinoline), 2 (diphenyl ether), 11 (trans-stilbene). On closer inspection of the target type composition of (the top twenty) scaffolds with high total target counts, it is clear that in many cases more than half of the target types are cell lines (rather than single protein targets; see Figure 6). Whereas, most of the scaffolds with low scaffold IDs (high counts of unique member compounds) from Figure 6 were measured on more single protein targets, those scaffolds with higher scaffold IDs (lower number of unique compounds) exhibit higher numbers of measurements on cell lines. An exception is e.g. scaffold ID 3 (chalcone) with a high number of enumerated compounds (505), but a clear excess of cell lines versus single protein targets. The phenomenon could be explained in part by the rise in popularity of phenotypic screening in general31 as well as cell line profiling studies (especially in oncology).32


20

Page 21 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 6: Target type composition (single protein targets vs. cell line targets) for scaffolds with high cumulative target counts. Next, temporal target trends were examined, first by inspecting all targets, second by considering single protein targets and cell line targets separately. Counting all unique targets per year and ranking the 563 scaffolds with target counts in at least six years by the steepness of their target trend line slopes, the top five - scaffold IDs 6 (flavones), 38 (cis-stilbene), 87 (paclitaxel and analogs), 251 (uracil derivatives), and 489 (daunorubicin derivatives) – also belong to the top twenty with maximum number of cumulative target counts (Figure 6). The distribution of trend line slopes considering all target types is shown in Figure 2C. The situation changes when only single protein targets are considered for analyzing the steepness of the target trends (424 scaffolds with single protein target counts in at least six years): apart


21


Page 22 of 54

from scaffold IDs 2 (diphenyl ether derivatives), 6, and 10 (benzyl phenyl ether derivatives) which are among the scaffolds with highest cumulative compound and target counts (see Figures 1 and 6), also scaffold ID 27 (benzimidazole) and 198 (1,3,4-thiadiazole derivatives) are found among the top five with steepest significant positive slopes. We next considered temporal trends for the case where cell lines are identified as the assay target. Considering exclusively the target type ‘cell line’ and counting the number of those targets per year, the top five scaffolds with statistically significant steepest positive trend lines are also among the top twenty with the highest cumulative bioassay counts and/or steepest bioassay trend lines: scaffold ID 38 (cis-stilbene derivatives), 87 (paclitaxel and analogs), 88 (colchicine and analogs), 251 (uracil and derivatives), and 489 (daunorubicin and analogs). A closer look at the respective assay descriptions reveals that target type ‘cell line’ usually indicates

a

cytotoxicity

assay:

assay

descriptions

usually

contain

the

terms

‘cytotoxicity/cytotoxic’, ‘antiproliferative’, or ‘growth inhibition’. Therefore, a temporal shift from single protein targets towards cell line targets (or vice versa) would indicate a shift in the focus of interest in these scaffold series (such as from a target specific focus to looking for broad spectrum activity). 2.2 Target-wise bioactivity trends - specific examples for popular scaffolds Considering target and bioactivity trends together, bioactivity data can be aggregated by targets allowing us to examine target profiles of scaffolds over time and show the distribution of actives versus inactives. In this section, we focus on some ‘popular’ scaffolds from previous sections. In addition, scaffolds with a high number of measurements on single protein targets are compared to other


22

Page 23 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


representatives with a high number of cell lines as targets. For visualization purposes, only single protein as well as cell line targets with at least 5 unique compounds (within each scaffolds series) were considered. Scaffold ID 8 (benzanilide derivatives) was selected as a representative scaffold with a high fraction of single protein targets (see Figure 6). It possesses a high cumulative compound count as well as a steep positive trend line of new compounds per year. Scaffold ID 3 (chalcones) was prioritized as representative for scaffolds with high numbers of cell line targets (see Figure 6). It shows a steep positive trend line slope for cell line targets (among top ten) and is amongst the ones with steepest trend line of new targets per year. Moreover, scaffold ID 3 is among the top three scaffolds with highest cumulative compound counts (> 500) and shows the steepest increase of new compounds per year. For scaffold IDs 8 and 3, heat maps reflecting their polypharmacological target profiles over time were generated such that each cell represents the median activity label from all compounds with measurements for a certain target (see Figure 7 and Figure 8). Thus, they provide a dynamic representation of the communities’ testing efforts as well as the biological responses that are caused by members of certain scaffold series. For this investigation, targets were grouped by taxonomic relationships (target families). Drug annotations were included, showing the appearance of approved drugs in certain years. For scaffold ID 8, this has been visualized in a separate heatmap (Supplementary Figure S11).


23


Page 24 of 54

Figure 7: Trend of polypharmacological profile of single protein and cell line targets for scaffold series 8 (benzanilide derivatives) in binary heat map representation: abscissae…..year; ordinate…..target names; blue field…..active ≤1µM; black field…..inactive >1µM; grey field...no measurements.


24

Page 25 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 8: Trend of polypharmacological profile of single protein and cell line targets for scaffold series 3 (chalcones) in binary heat map representation: abscissae…..year; ordinate…..target names; blue field…..active ≤1µM; black field…..inactive >1µM; grey field...no measurements.


25


Page 26 of 54

These analyses allow us to address a number of questions related to polypharmacology. For example, which targets within a scaffold series have been tested more frequently than others? Over time, can we observe a shift from more cell line to single protein targets or vice versa? Can selectivity profiles/tendencies be deduced for whole scaffold series? Answers to these questions reveal the pharmacological testing behavior within the community over the years. The heat maps highlight the extent to which the community was interested in a certain target or a target family over the years and if bioactivity was observed for those targets (rather than inactivity). This exercise also highlights the extent of hypothesis-driven research, based on testing around wellknown (or well-studied) targets33 or for cytotoxicity across all sorts of cell lines often without taking the exact mechanism of action or underlying regulatory pathway into account. Early testing of benzanilide compounds (scaffold ID 8) on single protein targets shows activity on matrix metalloproteinases and carbonic anhydrases (Figure 7). Later (in 2006), positive cytotoxicity measurements led to testing on cell line targets. As seen from the heat map of approved drugs contained in this scaffold series (Supplementary Figure S11), the only marketed drug, niclosamide, an antihelminthic agent for therapeutic use against tapeworm infections, was tested against cell line targets almost exclusively. As the mechanism of action in this case is uncoupling of oxidative phosphorylation or stimulation of ATPase activity in the worms, multiple protein targets are involved in this therapeutic effect (information from www.drugbank.ca). Another interesting case is the chalcone series (scaffold ID 3; Figure 8), with much more data on cell lines than on single protein targets over the years. Starting with positive cytotoxicity measurements on HeLa and K562 cells in 1998, some unsystematic cytotoxicity measurements on other cell lines and some typical protein targets related to cancer (e.g. EGFR, ErbB2, TNF)


26

Page 27 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


shows that apparently, research was driven by the hypothesis that chalcones might represent potential anti-cancer agents.34 Over the years, diverse other potential therapeutic effects for chalcones have been postulated (such as analgesic and anti-inflammatory effects,35,36 cardiovascular effects,37 anti-angiogenic and hypolipidemic effects,38 and immunomodulation39) given its wide reactivity with many pharmacological relevant targets. It is now clear that the effects on these targets are largely due to unspecific reactivity of the alpha-beta-unsaturated ketone.23,40 This reactivity is however not very strong, so that a certain degree of selectivity for some targets might be observed, as also obvious from the heat map in Figure 8. Thus, this group of compounds does not at first sight attract attention as potential PAINS substructures, exhibiting promiscuous cellular responses rather than real activity. As demonstrated by our polypharmacological trend heat map, the medicinal chemistry communities’ interest in this special group of chemicals did not drop over the last 17 years, although no drug based on this scaffold has made it to the market. The reasons for this behavior within the community might be manifold. It is tempting to speculate that a lot of effort was driven by the high synthetic feasibility of chalcone derivatives.41 In addition, researchers in the field might not have been aware of the unspecific responses of chalcones due to lack of systematic literature surveys.

3. Trends in Scaffold Liabilities Out of 764 unique scaffolds, only 33 scaffolds contain PAINS liabilities (and each one of the 33 contains a single PAINS substructure). The list of matching PAINS and the number of scaffolds containing them is summarized in Table 1. The table also reports the earliest occurrence of any scaffold containing the given PAINS substructure. The full list of 33 scaffolds, the year of first occurrence and their associated PAINS matches are provided in Supplementary Data File 5. It is


27


Page 28 of 54

encouraging that only two of the scaffolds identified as popular by our analysis contain a substructure recognized as PAINS by our filter, namely scaffold ID 12 (naphthoquinone) and scaffold ID 489 (daunorubicin). In both cases, it is the quinone substructure causing the alert, which is also the most frequent PAINS substructure contained in our dataset (see Table 1). We identified two scaffolds containing PAINS substructures from 1998, but since then PAINS containing scaffolds have appeared in nearly every year (with 5 showing up for the first time in 2009). While 2009 does have the largest number of scaffolds with PAINS liabilities, there is no statistically significant trend in the number of PAINS-containing scaffolds per year. Table 1. A summary of the PAINS substructures contained within the 764 unique scaffolds, and the number of scaffolds that contain a given PAINS substructure (‘Count’). The ‘Year’ column indicates the year that this substructure was first observed in a scaffold.

PAINS

Year

Count

quinone_A(370)

1998

8

imine_one_A(321)|quinone_D(2)

2002

4

indol_3yl_alk(461)

1998

4

azo_A(324)

2006

3

ene_one_ene_A(57)

2011

3


28

Page 29 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


ene_rhod_A(235)

2007

3

amino_acridine_A(46)

2009

1

anil_di_alk_A(478)

2012

1

ene_five_het_L(4)

2013

1

het_thio_666_A(13)

2013

1

imine_one_A(321)

2008

1

imine_one_isatin(189)

2004

1

keto_phenone_A(11)

2012

1

quinone_A(370)|anthranil_one_A(38)

2011

1

We first examined whether the size of scaffold member sets differed, over time, for those with PAINS liabilities versus those without. This could suggest differences in how regions of chemical space (using the scaffold as a proxy) are explored, based on the presence of liabilities within those regions. As before, for each scaffold we fitted a robust linear regression for count of unique compounds as a function of year and used the slope of the year variable as an indicator of an increasing, decreasing or static trend in count versus year. Figure 9A plots the distribution of


29


Page 30 of 54

slopes for the ‘PAINful’ (i.e., scaffolds containing a PAINS substructure) versus ‘PAINless’ (i.e., scaffolds without a PAINS substructure) scaffolds, and suggests that there is a small difference in the trends between these two classes, with PAINful scaffolds showing a slightly steeper trend on average than PAINless scaffolds. However, though this difference is statistically significant (Wilcox rank sum test, p = 0.009), the effect size is very small. Interestingly, of the 33 PAINful scaffolds, two scaffolds (0.2%) exhibit slopes greater than 1, whereas 31 PAINless scaffolds (4%) exhibited slopes greater than 1. The two PAINful scaffolds with slopes greater than 1 (indicating an increase in the number of compounds associated with them over time) are scaffold IDs 224 (9-benzyl-1,2,3,4-tetrahydrocarbazole) and 12 (naphthoquinone). Even though both contain PAINS substructures, the biomedical literature shows that these scaffolds are being actively studied. Indeed, a search for ‘naphthoquinone’ on PubMed42 shows a near monotonic increase in publications, from 1940 onwards, referring to this term (Supplementary Figure S12). It is encouraging to see that our trend analysis is in line with our assumption that PAINless scaffolds will tend to be more extensively explored than PAINful scaffolds. We also investigated the distribution of bioactivities over time, for scaffolds with PAINS liabilities versus those without. First, we compared the overall distribution of bioactivities of scaffolds with and without PAINS liabilities. For this analysis, the Z-scored bioactivity for a scaffold was taken as the median of the Z-scored bioactivities of the individual compounds belonging to that scaffold. We did not aggregate replicate data or separate activity by target for individual compounds. Figure 9B suggests that there is no difference in the means of this distribution and this is supported by a Wilcox rank sum test (p = 0.65). Next, we stratified the activity data by year and performed the same analysis, which indicated a single year (2007) where there was a statistically significant difference (Wilcox rank sum test, p = 0.04) in the


30

Page 31 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


scaffold bioactivity distributions between the PAINful and PAINless scaffolds. For every other year between 1998 and 2014 there was no difference. Considering 2007, we identified 24 and 416 PAINful and PAINless scaffolds, respectively. The median value of the scaffold activities (Z-scored) was -0.18 for the PAINful class and 0 for the PAINless class, indicating a slightly greater activity, on average, for compounds containing PAINless scaffolds. Lastly, we examined the bioactivity trends to see if there was a difference between the two classes of scaffolds. As with the compound count analysis described above, we fitted a robust regression model to the median scaffold Z-scored bioactivity (for a given scaffold) versus year and compared the distributions of slopes for the PAINful and PAINless scaffolds. We observed no difference between the slopes for the PAINful and PAINless scaffolds (Wilcox rank sum test, p = 0.40). As we showed previously, most scaffolds appear to exhibit static bioactivity trends over time, with a small fraction showing an appreciable increase or decrease. In Figure 9C we note that the median slope for both classes of scaffolds is essentially zero, though PAINless class has some outliers (defined as scaffolds with slopes greater than 0.25 or less than -0.25 of where there are 25 (3.4%). Given the preponderance of the quinone substructure within the PAINS containing scaffolds, we examined the 13 quinone containing scaffolds in more detail. The distributions of enumerated compounds, bioactivities and slopes for these scaffolds were similar to the overall PAINs trends described above. These scaffolds covered 741 unique compounds, tested in 2,352 unique assays. 408 compounds are tested in multiple assays, with the top five most frequently (> 50) tested compounds being doxorubicin, adriamycin, mitoxantrone, daunorubicin and geldanamycin. Since many assays run as part of oncology projects will include well-known chemotherapeutics as controls and given the prevalence of PAINS in approved drugs (~ 6%),43 it is not surprising to


31


Page 32 of 54

see these compounds in the PAINful category. On the other hand, 333 compounds containing the quinone substructure were tested a single time. Using the ChEMBL target type annotation, we identified assays whose targets were annotated with CELL_LINE or ORGANISM, as being over-enriched in a statistically significant manner (Fishers exact test, p = 0 and 1.36 x 10-6, respectively). This is in contrast to the set of scaffolds not containing PAINS, for which assays with protein targets (SINGLE PROTEIN, PROTEIN FAMILY and PROTEIN COMPLEX) were statistically significantly over-enriched (Fishers exact test, p < 0.015). A priori it is not obvious why PAINS containing scaffolds should be preferentially tested in cell-based rather than biochemical assays. However, recent work by Weerapana et al.44 and Jöst et al.45 suggest that the in vivo behavior of reactive groups may not be as significant as believed which could indicate that cell based assays are more robust to such compounds than biochemical assays. Since PAINS and reactive groups in general are useful to avoid, the fact that approved drugs contain such moieties, coupled with in vivo behavior of reactive compounds, indicate that while reactive groups should make one wary, hard rules on their exclusion may not be beneficial. Furthermore, while scaffolds may contain PAINS, the presence of a privileged substructure within the scaffold may override the liability of the PAINS moiety.46 Nonetheless, our analysis highlights the fact that the medicinal chemistry literature tends to avoid scaffolds containing PAINS, and even for the PAINS containing scaffolds, there is little difference in key property distributions.


32

Page 33 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Figure 9: Distribution of trends for scaffolds containing PAINS substructures (PAINful) and those without (PAINless). In both cases, the trend corresponds to the slope of robust regression model fitting a feature of a scaffold to year. A. Trend computed for the number of unique compounds associated with a scaffold per year. B. Median (z-scored) bioactivity for compounds associated with a scaffold (over all years). C. Trend computed for the median z-scored bioactivities associated with a scaffold per year. 4. Toward a consensus trend score Aiming to unify the observations based on taking different viewpoints of scaffold trends (numbers of compounds, assays, targets, and liabilities), we created a consensus trend score by


33


Page 34 of 54

applying the rank product (RP) method.47 Liability trends contributed inversely to the score. According to this score the top ten highly ranked scaffolds are (ordered by descending consensus score): scaffold IDs 3, 87, 489, 5, 38, 198, 171, 11, 59, and 12 (a full list of all ranked scaffolds is given in Supplementary Data File 6). We note that the method does not weight the different contributions and that therefore for each use case the ranking can be adapted. Nonetheless, it presents a convenient way to obtain a global overview of popular scaffolds in large historical data collections. 5. Why is a scaffold popular? The preceding discussion has focused on identifying scaffolds that are receiving increasing or decreasing attention over time. But, we have not discussed why a scaffold exhibits a specific trend. Explaining why a given scaffold is being enumerated or being tested in more assays than another scaffold is difficult, if not impossible, in the absence of detailed information of the project studying those scaffolds. However, we explored whether certain general factors could explain attention trends. One might expect that synthetically accessible scaffolds will tend to be explored more than complex scaffolds. We computed a synthetic feasibility score using MOE 2016 and plotted the score versus the slopes for scaffolds (with statistically significant trend fits), for three of the popularity measures we investigated. These plots are displayed in Supplementary Figure S13. In all three cases the Pearson correlation between synthetic feasibility and slope is essentially 0, suggesting that ease of synthesis does not (explicitly) drive trends in these popularity measures. Note that synthetic feasibility (or equivalently, molecular complexity) is intuitively a time


34

Page 35 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


varying property – with advances in synthetic methodology, previously complex scaffolds may become easier to synthesize.48 The nature of assays that scaffold members are tested in may underlie specific trends. Specifically, certain types of assays such as cytotoxicity (e.g. tested on HepG2 cells) and Caco-2 permeability assays tend to be run for all hits in a screening campaign. To explore this further we identified the cell lines for all the assays considered in our dataset. Overall, the top three, most frequently occurring cell lines were CHO, HEK293 and MCF7, which are commonly used lines for viability studies. We then computed the frequency of occurrence of cell lines associated with each scaffold (i.e., by the assays that the scaffold members were tested in). The result is tabulated in Supplementary Data File 7. Interestingly, if we consider the most frequently occurring cell line for each scaffold, we observed just 14 scaffolds for which that cell line is HepG2 or Caco-2. Of these 14 scaffolds, only three had trend slopes (in any of the properties we considered) that were statistically significant. Furthermore, the slopes themselves were classified as flat or medium. The data thus suggests that testing in standard counter-screens is unlikely to drive the observed trends. Another factor that may drive trends is medicinal chemistry intuition. As noted in Shanmugasundaram et al.49 lead optimization hypotheses depend on a chemists’ experience and intuition to a large degree. However, the role and extent of intuition is difficult to quantify, though Brown et al.50 have quantitatively described this issue in the context of preferential parasubstitutions of aromatic rings. More fundamentally, we hypothesize that a scaffold may receive attention due to attention it has received in the past. This is maybe especially true when prior studies have shown biological activity against interesting or relevant targets. This behavior is


35


Page 36 of 54

analogous to preferential attachment processes, whereby wealth or credit is distributed amongst objects based on how much they already have.51 SUMMARY AND CONCLUSIONS A key motivation of this study was to explore how scaffolds and their associated properties could be analyzed in a temporal fashion. This is motivated by the fact that scaffolds represent a core conceptual element of a medicinal chemists’ thought process and the fact that scaffolds are associated with classes of molecules with specific types of activities. In this sense, some scaffolds receive more attention in the literature in the form of publications and thereby appear to be ‘popular’. We consider scaffolds to be popular if many unique compounds containing the same scaffold are reported in the medicinal chemistry literature and/or if they are showing an increasing trend in the number of these new member compounds over time. An example of such a scaffold series are chalcones which showed the steepest positive trend line slope of new published compounds per year (considering ChEMBL20). Another, orthogonal, measure of scaffold popularity is given by trends in the number of assays a certain scaffold was tested in. As this measure is also related to testing on different targets, scaffolds which are showing an increasing trend in the number of assays are also likely to possess a high cumulative target count. A good example of this kind is the daunorubicin scaffold with the steepest positive assay slope but also a steep positive trend line of tested targets per year (and here especially cell line targets). Furthermore, if we consider ‘popular’ to mean that it is of interest to many people, then publication of a scaffold in a higher profile (or more general) journal allows us to characterize the scaffold as ‘popular’. Since higher JIFs correlate to higher profile journals, we believe that using JIF as a proxy for scaffold popularity is not unreasonable. However, it is hard to differentiate a real increasing trend in scaffold popularity from a natural increase in a journals


36

Page 37 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


impact factor over time and thus the sum of JIFs across all years might provide a better way to assess scaffold popularity. Though there has been much discussion on the pitfalls of JIFs and other simplistic metrics of a journals importance,52–54 it is nonetheless an orthogonal way to assess the communities perceived interest in a scaffold. Indeed, given that publications (and more so, publications in high profile journals) drive many real-world decisions (e.g., promotion, tenure, awards, funding) assessing trends via some form of bibliometric analysis is not unfounded. Our exploration of trends in liabilities was motivated by the fact that over time, accumulation of activity data and research reports would allow researchers to avoid exploration of scaffolds with liabilities. As recently outlined by Dahlin, Baell and Walters,55,56 the ‘natural history’ of compounds and their analogs can serve as an evidence-based guideline in early stages of drug discovery. The authors suggest an extensive literature survey using resources and tools such as PubChem,57 ChEMBL, Scifinder,58 or Reaxys,59 to be informed about potential interference compounds (possessing bioassay promiscuity including reactivity). While our analysis showed that most scaffolds do not exhibit PAINS substructures, when they are present, there is no distinct trend in the occurrence of these liabilities in scaffolds over time. Our analysis does suggest that scaffolds without PAINS substructures have attracted more attention, compared to scaffolds with PAINS, as evidenced by the increasing size of the enumerated compounds for these scaffolds over time. Of course, it has been noted that PAINS are not necessarily the best way to characterize liabilities,43,55 given the method by which they are defined. We use it since it is well known, but the analysis could be repeated with any other liability filter, such as the REOS60 or the Lilly MedChem filter.61


37


Page 38 of 54

Taking all orthogonal scaffold popularity measures into account, we finally present a consensus trend score approach by calculating the rank product of the trend slopes of the numbers of new compounds per year, assays per year, targets per year, and liability trends. We propose to use this consensus ranking as the basis of a multidimensional view of a scaffolds’ popularity in big data collections. The weighting of the popularity contributions however likely needs to be adjusted depending on the use case. As popularity does not imply being a worthwhile scaffold for future drug discovery efforts, scaffold analysis cannot be reduced to using a simple formula. The consensus popularity ranking may help to assess if a compound series has been studied heavily before, and a more detailed inspecting of the contributions to this score and the trend lines themselves will show years of strong or little interest. Likely reasons for an increasing or decreasing trend at a certain point in time are easy to determine by looking into the respective publications. In addition, polypharmacological trend depictions per scaffold deliver a complete picture of the targetbioactivity profiles over time. This is very helpful in identifying a shift in interest concerning the testing behavior within a scaffold series over time. In a drug discovery setting, this helps to avoid investing effort in well-studied and/or useless (due to liabilities) compound series, especially if information about marketed drugs is included in the analyses. The basis of this study has been Bemis-Murcko scaffolds, but there is no restriction on the actual scaffolds that could be used. A particular fragmentation scheme (along with fragment selection rules) may result in some relevant scaffolds not being considered. An example is the oxetane scaffold which has received increasing interest due to a combination of useful physicochemical properties and synthetic utility62 but is not considered in our analysis. In addition, our workflow did not perform any post-processing after the Bemis-Murcko fragmentation. This results in some


38

Page 39 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


cases such as scaffold IDs 198 (1,3,4-thiadiazole) and 515 (2-amino-1,3,4-thiadiazole) which could be merged into a single scaffold. Whether this is the case is a subjective decision that depends on the fragmentation pipeline employed by the user. While alternative fragmentation methods are available (such as MEQI,63 matched molecular pairs and so on) exhaustive fragmentation could identify scaffolds with more extreme trends. Furthermore, while BemisMurcko scaffolds are useful, hierarchical schemes64 would allow us refine the temporal analysis of scaffold ‘families’ and is the focus of future work. In particular, a hierarchical scheme coupled with appropriate post-processing could exclude scaffold pairs such as IDs 18 (quinoline) and 20 (quinolin-4-yl-aminophenyl), where ID 18 is a substructure of ID 20. We also note that the methodology described here could be easily applied to any arbitrary substructure. Finally, we note that while our focus has been exploring the community interest and popularity of a scaffold based on the (primarily academic) medicinal chemistry literature as captured by ChEMBL, we have not considered patents, which also represent a key measure of research relevance. Given that commercialization efforts involve scaffold hopping to identify patent holes, it might be expected that a given scaffold is not well represented across patents from different companies. At the same time, scaffold trends extracted from patents could provide insight into scaffolds, and more generally, regions of chemical space that are of rising commercial interest. Using resources such as SureChEMBL,65 we plan to explore how scaffold trends can be extracted from patent data. The workflows for curating and mining data in order to generate the summary data for the determination

of

scaffold

trends

are

freely

available

from

myExperiment

(http://www.myexperiment.org/packs/725.html), codes for calculating the trends and consensus ranking from a GitHub repository (https://spotlite.nih.gov/ncats/ScaffoldTrends). While the


39


Page 40 of 54

current analysis is based on a number of specific choices of fragmentation scheme and liability filter, the approach is quite general and can be adapted to specific needs. Though the analyses proposed here do not support robust conclusions regarding the causality of trends, we have highlighted a number of factors that may play a role. More rigorous explanations of causality will require access to project-level metadata. Nonetheless, we believe that the temporal analyses described here can provide useful information on the historical synthetic and testing behavior associated with regions of chemical space. This can be useful at the beginning of a medicinal chemistry project as well as when determining scaffolds that could be avoided due to extensive interest, in favor of less studied scaffolds. METHODS Data retrieval Data was retrieved from ChEMBL20 via a PostgreSQL interface (pgAdmin3) by querying for compounds (ChEMBL compound IDs, preferred compound names, canonical smiles) along with their bioactivities (value, pChEMBL value, activity endpoint), target information (ChEMBL target ID, preferred target name), assay information (ChEMBL assay ID) and document information (journal, publication year, PMID, DOI). In this initial query, filters for the target organism (‘Homo sapiens’ was kept only), activity unit (‘nanomolar’ was kept only) and activity relation sign (‘=’ was kept only) were applied. No activity thresholds were applied to classify compounds as active or inactive. This led to an initial dataset with 1,73 Mio data entries. The initial search query is given as supplement. We note that the assay data considered in this subset was originally obtained from three data sources, of which, the ‘Scientific Literature’ accounted for 99.4% of assays considered in this analysis.


40

Page 41 of 54 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60


Next, the data file was imported into a KNIME66 (version 3.1) environment to perform the scaffold generation, grouping per scaffolds, data filtering and post-processing. Scaffold generation, data filtering and processing Scaffolds for the 1,73 Mio compounds (non-unique) were retrieved by using the ‘RDKit Find Murcko Scaffolds’ node in KNIME. After filtering for activity endpoints ‘Ki’ and ‘IC50’ and for data entries which were published in 1998 and later, 619 671 non-unique compounds (282 878 unique compounds) with calculated Murcko scaffolds were retained. Grouping by identical scaffolds led to 103 772 scaffold clusters. By retaining scaffolds with at least 10 unique compounds, publications in at least 5 different years, and excluding scaffolds with too generic ring systems (6-ring system or smaller with just one or no hetero atom placed inside the ring) 764 distinct scaffolds (101 622 data entries; 36 328 unique compounds) were retrieved. Further, activity data points within these 764 scaffold clusters were converted into robust Z-scores on a per assay basis (by using ChEMBL assay IDs for the grouping), and Thomson Reuters journal impact factors (JIFs) were mapped accordingly (per journal and year). Filtering for PAINS liabilities We implemented a Python script using RDKit (2016.03.3) to match structures of scaffolds and their enumerated compounds to the set of PAINS filters described by Holloway et al.23 The script is available as Supplementary Information. Trend characterization To characterize scaffold trends, for a given property X (number of unique compounds, number of bioassays, aggregated bioactivities, aggregated IF’s etc.), we fitted a robust linear regression


41


Page 42 of 54

model (X ~ m*year + C). The slope, m, was used to characterize the nature of the trend. For example, inspecting scaffold popularity, a steep increasing slope could indicate increasing importance, whereas a decreasing trend could indicate a scaffold falling into disfavor. A flat trend is also interesting, possibly indicating an ongoing interest in a scaffold. Trend analyses were performed in R 3.3.2 (https://www.r-project.org/), using the rlm function from the ‘MASS package’. Statistical significance of the trends was assessed by conducting a Wald test on the fitted model and extracting the p-value. Regression fits with p-values < 0.05 were considered significant, otherwise insignificant. For trend analyses inspecting certain counts (like the number of new compounds per year, the number of bioassays per year, and target counts), zeros were added to years, where no value could be retrieved. In case of journal impact factors, empty fields (no JIF for that particular year) were treated as no data point for the regression calculation and statistics. Slopes ≥90th percentile were considered as steep trends; slopes

The rise and fall of a scaffold: A trend analysis of ... - ACS Publications

Recommend Documents