The Pharmacophore Network: a Computational ... - ACS Publications

laboratory. To evaluate the classifications, the Area Under the ROC Curve (AUROC) was used as a performance metric. Table 1 reports the average of the...
3 downloads 8 Views 3MB Size
Subscriber access provided by UNIV OF DURHAM

The Pharmacophore Network: a Computational Method for Exploring Structure-Activity Relationships from a Large Chemical Dataset Jean-Philippe Métivier, Bertrand Cuissart, Ronan Bureau, and Alban Lepailleur J. Med. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.jmedchem.7b01890 • Publication Date (Web): 12 Apr 2018 Downloaded from http://pubs.acs.org on April 12, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

The Pharmacophore Network: a Computational Method for Exploring Structure-Activity Relationships from a Large Chemical Dataset Jean-Philippe Métivier,‡,§ Bertrand Cuissart,§ Ronan Bureau,‡ and Alban Lepailleur‡,*



Centre d’Etudes et de Recherche sur le Médicament de Normandie, Normandie Univ,

UNICAEN, CERMN, 14000 Caen, France §

Groupe de Recherche en Informatique, Image, Automatique et Instrumentation de Caen,

Normandie Univ, UNICAEN, ENSICAEN, CNRS, GREYC, 14000 Caen, France

ABSTRACT Historically, structure-activity relationships (SAR) analysis has focused on small sets of molecules but in recent years, there has been increasing efforts to analyze the growing amount of data stored in public databases like ChEMBL. The pharmacophore network introduced herein is dedicated to the organization of a set of pharmacophores automatically discovered from a large dataset of molecules. The network navigation allows to derive essential tasks of a drug discovery process, including the study of the relations between different chemical series, the analysis of the influence of additional chemical features on the compounds’ activity, and the identification of diverse binding modes. This paper describes the method used to construct the pharmacophore network and a case study dealing with BCR-ABL exemplifies its usage for large-scale SAR analysis. Thanks to a benchmarking study, we also demonstrated that the selection of a subset of representative pharmacophores can be used to conduct classification tasks.

ACS Paragon Plus Environment

1

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

INTRODUCTION The investigation of structure-activity relationships (SAR) represents one of the most important tasks during the early stages of the drug discovery process. To this aim, pharmacophore modeling is nowadays very well accepted in the field of medicinal chemistry. Given a specific target, ligand-based pharmacophore elucidation requires the detection of the spatial arrangement of a combination of chemical features shared by several active molecules and responsible for favorable interactions with the active site.1 To discover these common anchoring features, the usual method starts with a careful selection of a small subset of ligands known for binding the same active site with the same binding mode. But over the last few years, a number of large chemical databases have become publicly available, such as ChEMBL2,3 or PubChem,4 providing an invaluable source of information for medicinal chemists and opening the way to the development of new computational methods for large-scale SAR analysis.5 The present work introduces a method that automatically computes pharmacophores from a large dataset of molecules without any prior supervised selection of a small subset of molecules. Since pharmacophores are intrinsically three dimensional, a key step of pharmacophore modeling is the conformational sampling of the molecules to be used for the alignment and the detection of the common features. Numerous approaches have been developed but it is well recognized that a broad conformational coverage requires an intensive computation when dealing with a large amount of molecules.6 To go beyond this limitation, we decided to approach the pharmacophore elucidation problem by considering 2D molecular structures. Therefore, the method described herein is based on the computation of the so called topological pharmacophores.7,8 Given a dataset of molecules, our method first enumerates all the recurring combinations of chemical features, then it computes their capability to discriminate between

ACS Paragon Plus Environment

2

Page 2 of 40

Page 3 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

active and inactive molecules. Such a discriminative capability is calculated using the notion of an emerging pattern (EP)9,10 that comes from computer science. When applied to a dataset partitioned into two classes (e.g. active versus inactive molecules), EP mining can identify the features that occur with higher frequency in one of the two classes. This notion has already been applied in chemoinformatics11–16 but this is the first time that EP are used to compute pharmacophores. Our work has led to the design of an intuitive and interpretable visualization tool to analyze the SAR information deriving from the mining of a chemical database like ChEMBL. Such a tool has become more and more important to initiate the discussion between computer scientists and medicinal chemists, and to guide hit-to-lead optimization programs. To graphically visualize the results of a large-scale SAR analysis, hierarchies and molecular networks are increasingly used because they provide an intuitive way to organize large amounts of chemical and biological data. Related works have been conducted by Birchall et al.17,18 and Wollenhaupt et al.,19 and for recent reviews on this topic the reader is referred to Stumpfe and Bajorath.20 Briefly, many of the methods developed in the recent years organize the molecular datasets on the basis of similarity relationships calculated from fingerprint-based representations of the molecules,21,22 i.e. from molecules depicted as Boolean vectors encoding the presence/absence of molecular descriptors. Even if it is fast to compute, the outcome of such fingerprint-based approach strongly depends on the chosen fingerprinting methodology, and it is often difficult for a medicinal chemist to give an interpretation of the results. To tackle these issues, the use of substructure-based similarity relying on the matched molecular pairs (MMP)23 or the maximum common substructure (MCS)24 concepts have been proposed.25,26 One of the main drawbacks of most of these methods is the lack of fuzziness of the detected common scaffolds – e.g. a chlorine and a bromine are

ACS Paragon Plus Environment

3

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

considered as two different atoms. With our methodology, this issue is addressed by the generalization capability of the pharmacophoric features – e.g. a chlorine and a bromine are both considered as a hydrophobic group. The SAR visualization tool we propose introduces the concept of pharmacophore network that takes advantage of the existing interconnections between the pharmacophores extracted by our algorithm. The objective of such an organization of the pharmacophores is to provide the medicinal chemists with a framework to study the relations between different chemical series, to study the influence of the addition or the deletion of chemical features on the compounds’ activity and particularly on the protein-ligand binding mode, and to interpret local SAR information. As a case study, the BCR-ABL tyrosine kinase has been chosen to give an example of how a pharmacophore network may be used to derive all these crucial drug discovery tasks. To complete the study, we implemented a rule-based classifier relying on a selection of representative pharmacophores to discriminate active from inactive molecules. The proposed classification method was evaluated on a panel of diverse datasets covering different protein classes.

RESULTS AND DISCUSSION BCR-ABL as a case study We computed the pharmacophoric network associated to a dataset collected from ChEMBL2,3 and dealing with BCR-ABL, a tyrosine kinase responsible for Chronic Myeloid Leukaemia (CML). In CML, a reciprocal translocation fuses the Abl and Bcr genes leading to a fusion gene encoding the constitutively active BCR-ABL.27 To expedite activity assays, ABL1 is used because it is easier to produce and to work with.28 Despite the continued clinical success of Imatinib,29 the development of resistance has led to the emergence of numerous drug discovery

ACS Paragon Plus Environment

4

Page 4 of 40

Page 5 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

programs28 that have contributed to the storage of thousands of data in chemical databases. Another reason why we chose BCR-ABL as a case study is that the active site of this kinase is highly flexible and can adjust to accommodate a variety of inhibitors, thus leading to different and well-defined binding modes. The mining of the ABL1 dataset has generated 87175 pharmacophores supported by at least 10 molecules and exhibiting up to 6 pharmacophoric features (user defined constraints). The starting points for the SAR analyses discussed in the following corresponded to pharmacophores selected by the MMRFS algorithm. From the 137 MMRFS pharmacophores, we navigated through the network thanks to the interactive web pages in order to retrieve the most interesting pharmacophores in terms of growth-rate and supporting molecules. The pharmacophore network – and particularly all the pharmacophores discussed below – can be accessed at https://chemoinfo.greyc.fr/2017_metivier. Study of the relations between different chemical series. The pharmacophore P1 (Figure 1) fits the greatest number of molecules and covers nearly thirteen times more active than inactive molecules. Among the 492 molecules in agreement with P1, 320 originate from a series of substituted 6,6-fused nitrogenous heterocyclic compounds (scaffold S1, Figure 1) patented in 2012.30 In addition to S1, the pharmacophore P1 represents other commonly used scaffolds (S2– S6) that occupy the adenine binding site and the hydrophobic back pocket adjacent to the ATP binding cleft. For instance S2 and S5 are respectively associated to ponatinib and dasatinib, two approved drugs for CML treatment. Unsurprisingly, the fused pharmacophoric feature AR1 generalizes the heterocyclic moieties that mimics one of the H-bonds that are formed by the adenine ring of ATP with the hinge region, and particularly with Met318. The methylphenyl (or methylpyridine) fragment HR2 occupies the hydrophobic back pocket close to the gatekeeper Thr315. It should be noticed that the H feature highlights the dramatic increased potency and

ACS Paragon Plus Environment

5

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

reduced promiscuity that have been observed with the introduction of a flag methyl. This methyl leads to a non-planar conformation of the ring with respect to the preceding parts of the compounds and gives additional hydrophobic interactions that increase the affinity and selectivity for the enzyme. The organization of the pharmacophores into a network makes it possible to analyze the transition from one pharmacophore to another. Such an information is of particular interest to study the relation between different chemical series. An example is provided on Figure 2. At a first glance, the pharmacophores P2 and P9 (Figure 2.A) – which correspond to pharmacophores selected by the MMRFS algorithm – are not obviously linked since they are specific of two different chemical series – the isoquinolines and the pyridopyrimidines, resp. – and they have only two R in common. The step by step analysis of the shortest path from P2 to P9 (Figure 2.B) shows that intermediate P4-P8 (Figure 2.A) are in agreement with some derivatives of the two chemical series, thereby suggesting the filling of a similar region of the BCR-ABL binding site – here the adenine binding site and the back pocket. Since no crystallographic data is available for the isoquinoline series, we performed docking simulations of compound 1 – representative of P2 – into the binding site of ABL1 bound to PD166326 – representative of P9. The results seem to confirm a similar binding mode for compound 1 and PD166326 (2, Figure 2.C). Particularly, the isoquinoline and its cyclopropylamide substituent in C-3 form two key hydrogen bonds to Met318 located in the Hinge region, and the 2-chloro-6-methylphenyl moiety fills the hydrophobic back pocket. Analysis of the influence of additional pharmacophoric features and detection of diverse binding modes. Starting from a pharmacophore of interest, the navigation through the pharmacophore network allows to analyze the influence of the nature and/or the position of additional features on

ACS Paragon Plus Environment

6

Page 6 of 40

Page 7 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

the compounds’ activity. Inversely, one can also analyze the influence of their removal. For instance, the Figure 3 shows the overall visualization of the pharmacophore network originating from P1 (Figure 3.A) and the distribution of its 63 nearest neighbors obtained following the addition of only one R, H, P, A, or D (Figure 3.B). It should be noticed that with the aim of analyzing a loss of activity caused by the addition of a pharmacophoric feature, we should drastically decrease the frequency threshold. Indeed, a harmful substitution is not likely to be over-represented in a dataset. In the present study, the minimal number of molecules that a pharmacophore has to fit in was fixed to 10. Therefore, our technique is best-suited to analyze the benefic effect of the recurring addition of a pharmacophoric feature. When we observe a decrease of the growth-rate following the addition of a pharmacophoric feature, it rather denotes the grouping of a higher proportion of inactive molecules but not directly the highlighting of the incriminated feature. In this case, the extension of the pharmacophore has to be analyzed in order to identify activity cliffs (see below). Concerning BCR-ABL, the analysis of the addition of pharmacophoric features is even more of interest due to the presence of inhibitors that adopt different binding modes. In brief, type I inhibitors bind the so-called active conformation of the kinase and are ATP-competitive.31 By contrast, Type II inhibitors preferentially bind to the inactive conformation of the kinase, thereby preventing its activation. Type II inhibitors also recognize the ATP-binding cleft but in addition they fill an adjacent allosteric pocket created when the activation loop move to its DFG-out conformation.32 Subsequent to the addition of a second hydrophobic group on R2, we obtain P10 which covers ligands exhibiting a di-orthosubstituted phenyl ring, a specific fragment of type I inhibitors. The agreement of dasatinib (3), a typical example of this binding mode, with P10 is shown on Figure 4.A. For comparison purpose, the crystal structure of ABL1 in complex with dasatinib is displayed on Figure 5.A.

ACS Paragon Plus Environment

7

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Some derivatives sharing a naphtyridine (or an isoquinoline), a pyridopyrimidine, or a vinyl purine scaffold, like compounds 4,30 5 (PD166326),33 or 6,34 resp., are in agreement with P10 (Figure 4.A). It should be noticed that compounds 3-6 all exhibit an H-bond donor interacting with the hinge region and mimicking a second interaction of the adenine moiety of ATP. To retrieve a pharmacophore describing these two hydrogen bonds, one can navigate through the network starting from P10. The corresponding pharmacophore is associated to 60 molecules, all of them being active as denoted by the infinite growth-rate. One of the really different structural elements of BCR-ABL that is exploited by type II inhibitors like ponatinib (7) and imatinib is described by P11 (Figure 4.B): the positively ionizable feature P situated several angstroms away from the binding site of ATP. As mentioned above, this interaction results from the opening of the allosteric site which is hydrophobic in nature but also enables a set of hydrogen bonds to be established. Particularly, in the case of ponatinib (2) the P feature interacts with the backbone carbonyl of Ile360 and His361 (Figure 5.B).35,36 Some compounds based on the isoquinoline (e.g. 8-9) and the imidazopyridazine (e.g. 10) scaffolds have been clearly designed to bind the DFG-out conformation (Figure 4.B). Apart from type I and type II inhibitors, ligands referred as type I1/2 inhibitors37 recognize the target kinase in its DFG-in conformation but establish type II interactions that are typical of a DFG-out conformation, namely the H-bonds taking place in the back cavity with Asp381 and/or Glu286. An interesting example of type I1/2 inhibitors is the purine-based compound AP24283 (11, Figure 5.C)38 which is covered by P12 (Figure 4.C), a pharmacophore describing the H-bond donor interacting with Glu286. According to their agreement with P12, it can be hypothesized that some isoquinoline (e.g. 12-13) and benzotriazine (e.g. 14) derivatives have been designed to fulfil the requirements of type I1/2 inhibition (Figure 4.C). Incidentally, the hydroxyl group of 14 has been shown in X-ray data to

ACS Paragon Plus Environment

8

Page 8 of 40

Page 9 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

establish H-bond with Glu286.39 It should be noticed that the most supported pharmacophores P13-P16 (Figure 3) are related to the over-represented compounds associated to the scaffold S1. The disclosed compounds are usually isoquinolines bearing a cyclopropyl carboxamide function on C-3 and therefore, P13-P16 respectively describe the second aromatic ring of the isoquinoline, the cyclopropyl moiety, and the CO and NH of the amide bond. Even if the grouping of derivatives belonging to a specific chemical series can be informative, the automated detection of pharmacophores representative of different protein-ligand binding modes is clearly the main strength of our approach. Interpretation of local SAR information. On Figure 3.A, densely connected nodes – also known as hubs – can be seen. In addition to the identification of common anchoring features across different chemical series (see above), these hubs can reflect the detection of specific scaffolds that have been intensively used in medicinal chemistry programs. As an example, P17 (Figure 6.A) covers dozen of active compounds deriving from a hit-to-lead optimization program that aimed at improving the selectivity and the solubility of 2 (PD166326).40 The analysis of the supporting molecules of P17 clearly shows that the para-phenylamino substituent tolerates various functional groups. It can be explained by the positioning of this substituent in a solvent accessible region at the back end of the ATP-binding site.41 Our method also identified the efforts around scaffold S1 to design inhibitors able to establish H-bonds in the back cavity (Figure 6.B). P18 groups molecules reflecting the bioisosteric H-bond donor features that have been introduced to interact with Glu286. Due to their relative small size in comparison with type II inhibitors, the molecules in agreement with P18 probably recognize the target kinase in its DFG-in conformation and can be considered as type I1/2 inhibitors. Contrary to P17 and P18 that exclusively group active molecules – as denoted by their infinite GR, P11 exemplifies the

ACS Paragon Plus Environment

9

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

identification of a discontinuous local SAR.20 Discontinuity is introduced when structurally similar molecules have very different potency. Such compounds form activity cliffs42 and an example dealing with analogues of ponatinib is shown on Figure 6.C. Compared to ponatinib, the inactivity of the imidazole derivative is likely due to significantly decreased hydrophobic contact between the monocyclic core and a hydrophobic pocket delimited by L248, Y253, V256, L370, and F382 (Figure 6.D). The improved potency of the dimethyl analogue seems to confirm the hypothesis. In summary, the pharmacophore network provides the medicinal chemists with a consistent framework for interpreting local SAR information. Benchmarking of the classification performances of the MMRFS pharmacophores Starting from several thousands of chemicals, the set of extracted pharmacophores can be very large, even under reasonable constraint settings of minimum frequency and minimum growthrate. Even if all these pharmacophores can be used for classification, not all of them are equally useful. Besides, it is important to reduce the number of pharmacophores in order to sustain a structural interpretation of the data. A subset of representative pharmacophores was selected using the MMRFS algorithm.43 In order to demonstrate the capabilities of the selected pharmacophores to discriminate active and inactive molecules, eight datasets showing differences in size, actives/inactives ratio (balancing), and structural heterogeneity were collected from ChEMBL (Table 1). The corresponding targets span a variety of protein classes, namely the carboxylesterases, GPCRs, ion channels, kinases, regulator proteins, and metalloproteases. The targets were selected because of their interest in the drug design efforts conducted in our laboratory. To evaluate the classifications, the Area Under the ROC Curve (AUROC) was used as a performance metric. Table 1 reports the average of the performances for the 5-fold crossvalidation procedures along with the number of pharmacophores selected by MMRFS

ACS Paragon Plus Environment

10

Page 10 of 40

Page 11 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

(#MMRFS) and the cardinality of the subset of pharmacophores used to maximize the F-measure (#subset). Classification performances are greater than 80% for 7 out of the 8 datasets, with the highest predictive power for ABL1 (AUROC of 0.94). The lowest performances are observed with the UR-2-R dataset but remained satisfying (AUROC of 0.74). It should be noticed the good recovery of the hERG inhibitors (AUROC of 0.84), a task known to be challenging to achieve. The results indicate that the performances do not vary across the protein families and seem not to be influenced by the size, the balancing, and the structural diversity of the datasets. The number of used pharmacophores ranges from 8.2 (MCL1) to 191 (hERG). It varies significantly as function of the number of active molecules in the dataset. The lowest values were obtained with datasets exhibiting the smallest number of actives. As an illustrative example, the web pages displaying the 137 MMRFS pharmacophores and the corresponding ROC curve for ABL1 are available at https://chemoinfo.greyc.fr/2017_metivier. The pharmacophores are ranked according to their order of selection during the iterative process of the MMRFS algorithm and we can see that the maximum F-measure (0.9223) is reached by using the first 98 MMRFS pharmacophores. To complete the study, we predicted the removed molecules of the ABL1 dataset – i.e. belonging to the range ]100 nM; 1 µM[ – by using the 98 pharmacophores that maximize the F-measure. As shown on Figure 7, the molecules with activities within the range ]100 nM; 700 nM] are essentially covered. For instance, the union of the 10 first pharmacophores covers 100% of the molecules within the range ]100 nM; 200 nM]. On the upper-right corner of the diagram are grouped the molecules which are in agreement with the MMRFS pharmacophores ranked beyond the 110th position and which exhibit activities greater than 700 nM. These results suggest that a 700 nM cutoff could have been used to discriminate the active from the inactive molecules. Finally, to test whether the use of the whole set of

ACS Paragon Plus Environment

11

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

pharmacophores – i.e. without the selection of a representative set using MMRFS – leads to better performances, an alternative benchmark study was carried out (data not shown). In this case, the pharmacophore applied by the rule-based classifier was selected according to a rank depending both on its growth-rate and support values. The results showed that the quality of the predictions are slightly decreased. Therefore, the set of pharmacophores selected by MMRFS are really efficient to summarize the active molecules of the datasets.

CONCLUSION In this paper we have described an original approach to organize and to analyze the invaluable SAR knowledge that can be extracted from a molecular database like ChEMBL. The first strength of the method is the automatic computation of pharmacophores from a large dataset of molecules annotated with a biological property. Contrary to the state-of-the-art methodologies, there is no need to carefully select a small subset of molecules prior to the search of their common pharmacophore. The fully unsupervised method we propose is able to compare up to several thousands of input molecules and to select all the encountered combination of pharmacophoric features that discriminate active from inactive molecules. The interconnections between the computed pharmacophores provide a hierarchical organization that corresponds to the second strength of the present work. This organization – that we have called the pharmacophore network – can be actively navigated thanks to interactive web pages generated by our algorithm. The navigation across the network allows to derive essential tasks of the drug discovery process, including: (i) the identification of common anchoring features across diverse chemical series of compounds, a task which corresponds to the very essence of the study of the ligand-receptor interactions; (ii) the analysis of the influence of the addition or the deletion of chemical features on the compounds’ activity; and (iii) the interpretation of local SAR

ACS Paragon Plus Environment

12

Page 12 of 40

Page 13 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

information by focusing on structural modifications that have been carried out around specific scaffolds. As a case study, we have mined a dataset of 1492 compounds extracted from ChEMBL and evaluated for their ABL1 activity. In addition to giving an exemplification for each drug discovery task listed above, we demonstrated that our method is also able to distinguish pharmacophores representative of different binding modes. Indeed, we have automatically generated pharmacophores segregating well-known inhibitors of type I, type II, and type I1/2. The analysis of the molecules in agreement with these pharmacophores suggests analogies with other inhibitors whose binding modes have not yet been reported. We are convinced that allosteric modulators – if included in the dataset – would have been also detected by our approach. However, we were limited by the lack of reported Ki and IC50 values on ABL1 for such inhibitors – e.g. GNF-2 that interferes with the myristate-binding site 44. It should be also noticed that the cross-analysis with pharmacophores extracted from a dataset of active/inactive inhibitors of the T315I mutated form of the kinase could be of great interest to design new molecules inhibiting the mutated enzyme. Finally, thanks to a benchmarking study we also demonstrated that the selection of a subset of representative pharmacophores using the MMRFS algorithm is very efficient to summarize the active molecules of a dataset and to be used for successful classification tasks. We believe that such a tool for large-scale SAR analysis will assist medicinal chemists by giving insights into favorable pharmacomodulations for their own hit-to-lead optimization programs, thus leading to an acceleration of the drug discovery process. In future work, we propose to tackle the polypharmacology paradigm by cross-referencing pharmacophores from different

ACS Paragon Plus Environment

13

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

datasets. Such a tool should provide an intuitive representation of the molecular features responsible for compounds’ multi-targeting activities.

METHODS AND DATASETS This section provides sufficient information to enable the reader to follow the logic of the procedures and results. A more detailed explanation of the algorithms used to compute the pharmacophores and to select a representative subset of pharmacophores for classification purposes are described in Supporting Information. Computation of the pharmacophores. As an input, the method considers a dataset of molecules (2D SDF format) annotated with a biological activity of interest. In a first step, every molecule is transformed into its reduced “pharmacophore graph”, a molecular representation which is strongly related to the notion of a reduced graph.45 In the pharmacophore graph of a molecule, a vertex denotes the occurrence of a pharmacophoric feature while an edge encodes the minimal number of bonds between the two vertices it links. The pharmacophoric features considered in the present work correspond to generalized functionalities involved in favorable interactions between ligands and targets: hydrogen bond acceptors (A) and donors (D), negatively (N) and positively (P) ionizable groups, hydrophobic areas (H), and aromatic rings (R). Figure 8 depicts the transformation of an example molecule M1 from its 2D molecular structure (a) into its reduced pharmacophore graph (b). In a second step, as a pharmacophore is of lesser interest when it relies on too few occurrences, only the combinations of pharmacophoric features satisfying a user-defined frequency threshold are retained. In a third step, for each frequent pharmacophore, one computes its capability to discriminate between active and inactive molecules. In the implementation of our algorithm, the notion of an emerging pattern (EP)9 quantifies this imbalance by using the growth rate (GR)

ACS Paragon Plus Environment

14

Page 14 of 40

Page 15 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

measure. Given a pharmacophore and a dataset partitioned in two classes – e.g. active and inactive molecules, the GR of a pharmacophore corresponds to the ratio between its frequency of fit within the active molecules and its frequency of fit within the inactive ones. To quantify how discriminative a pharmacophore is, we also computed its “confidence”. The confidence of a pharmacophore denotes the portion of active molecules in the set of all the molecules that contain the pharmacophore. The extraction of the pharmacophores uses two parameters: the minimal size of a support – i.e. the minimal number of molecules that a pharmacophore has to fit in – and the maximal number of features per pharmacophore. In the present study, the threshold were set to 10 and 6, respectively. These thresholds have been chosen because we assume that 10 is a relatively small number of molecules with respect to the size of the dataset and 6 is a classical number of anchoring features for small molecules. Visualization of the pharmacophore network. The extracted pharmacophores can be connected with each other through an ancestor-descendant relationship which records an inclusion relationship: starting from a pharmacophore, one can generate any of its descendants by adding pharmacophoric features together with the proper links. This inclusion relationship provides a partial order between the pharmacophores graphically rendered by a Hasse diagram.46,47 Figure 9 shows a prototypic Hasse diagram related to a partially ordered set of 6 pharmacophores where the ancestor-descendant relationship is depicted in a top-down manner. For instance, pharmacophore d is included in pharmacophore f and inversely pharmacophore f covers pharmacophore d. We have called this new organization of the pharmacophore the pharmacophore network. For an overall view, the pharmacophore network can be imported into Cytoscape.48 A prototypic pharmacophore network is shown on the left part of Figure 10. In this network, each node represents a pharmacophore and each edge links two pharmacophores that

ACS Paragon Plus Environment

15

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

only differ by a single feature. A node is color-coded according to the imbalance of the occurrences of the corresponding pharmacophore between active and inactive molecules – the growth-rate (GR) – and scaled in size according to its number of supporting molecules. In addition, interactive web pages are generated by our algorithm in order to actively navigate through the network and to analyze in detail the pharmacophores and their supporting molecules. Selection of a subset of pharmacophores. A selection of representative pharmacophores was carried out using MMRFS,43 a feature selection algorithm borrowed from the Maximal Marginal Relevance heuristic in information retrieval.49 MMRFS tends to output a subset of pharmacophores whose elements are discriminative, different and representative of the active molecules. The method is depicted as a pseudocode in Figure 11. MMRFS iteratively selects a pharmacophore which maximizes the quality function GR(p) - maxs∊S J(s,p). GR(p) denotes the growth-rate of a pharmacophore p which has been normalized to lie between 0 and 1. J(s,p) represents the Jaccard Index, a similarity measure between two pharmacophores s and p, that corresponds to the ratio between the number of molecules which contain both s and p and the number of molecules which contain either s or p. As MMRFS aims at selecting pharmacophores which represent the active molecules, a pharmacophore has to fulfill a further condition to be selected: it has to occur in at least n compounds in the set of the active molecules which do not already contain δ selected pharmacophores. In the present study, the parameters δ and n and have been respectively set to 2 and 5. We used MMRFS to select pharmacophores whose size lies between 4 and 6 features. Implementation of a rule-based classifier. To quantitatively assess the predictive potency of the MMRFS pharmacophores, a rule-based classifier has been implemented. First, the pharmacophores are ranked according to their order of selection during the iterative process of

ACS Paragon Plus Environment

16

Page 16 of 40

Page 17 of 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Medicinal Chemistry

the MMRFS algorithm. Second, the molecular dataset is redescribed using these pharmacophores: a molecule is now represented by the set of MMRFS pharmacophores it contains. Third, a classification is performed by applying a simple rule: if a molecule contains at least one MMRFS pharmacophore then it is predicted as active, otherwise it is predicted as inactive. The pharmacophore used for the classification of a molecule corresponds to the one associated to the lower rank in its redescription. The experiments were conducted into a five-fold cross-validation scheme50 and the Area Under the ROC Curve (AUROC)51 was used to quantitatively assess the performance of the classifier. The optimization of the classifier is seen here as finding the rank which leads to the subset of the MMRFS pharmacophores that maximizes the F-measure (Equation 1): ‫ ܨ‬− ݉݁ܽ‫= ݁ݎݑݏ‬

2 × ܲ‫݈݈ܴܽܿ݁ × ݊݋݅ݏ݅ܿ݁ݎ‬ ܲ‫ ݊݋݅ݏ݅ܿ݁ݎ‬+ ܴ݈݈݁ܿܽ

Equation 1. where the precision is the portion of molecules predicted as active which are really active; and the recall is the portion of active molecules which are predicted as active. Such a rule-based classifier predicts the activity of the molecules and gives an explanation by providing the pharmacophores implied in the decisions together with their fits in the molecules. Design of a workflow tool. A workflow tool has been developed with Qt, a cross-platform application framework available under open-source license. The tool, called Norns, can be downloaded free of charge from https://chemoinfo.greyc.fr/2017_metivier. The protocols used to compute the pharmacophore network and to generate the visualization web pages are provided along with the ABL1 dataset. Docking simulations. The protein crystallographic structure was loaded from the RCSB Protein Data Bank (PDB ID 1OPK) and prepared by using the Protein Preparation Wizard workflow in

ACS Paragon Plus Environment

17

Journal of Medicinal Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Maestro, version 10.2 (Schrödinger LLC, New York, NY, USA). In a first step, bond orders were assigned, hydrogen atoms were added, and water molecules were deleted beyond 5 Å from the ligand. In a second step, the PROPKA program52 was used to predict the protonation states for protein residues at pH 7 and the hydrogen bonding network was optimized by the minimization of sampled hydrogens. Finally a restrained minimization was carried out (convergence rmsd of 0.30 Å) using the OPLS_2005 force field. Compound 1 was docked into the active site of the target protein using the standard docking precision (SP) of Glide 6.7. The binding region was defined by a 10 Å × 10 Å × 10 Å grid box around the centroid of workspace ligand. The OPLS_2005 force field was used for grid generation and default settings were kept for all remaining parameters. Datasets. Eight diverse datasets covering six protein classes were selected for the benchmark study (Table 1). The datasets were collected from ChEMBL2,3 by following some restrictions: (i) only Ki and IC50 values expressed in nM were accepted as bioactivity data, (ii) measurements containing symbols as “>” or “