The Fragment Network: A Chemistry Recommendation Engine Built

Jul 15, 2017 - A social network can also be represented using a graph database.(7) In this scenario, the nodes represent users and the edges represent...
0 downloads 8 Views 5MB Size
This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes.

Article pubs.acs.org/jmc

The Fragment Network: A Chemistry Recommendation Engine Built Using a Graph Database Richard J. Hall,* Christopher W. Murray, and Marcel L. Verdonk Astex Pharmaceuticals, 436 Cambridge Science Park, Milton Road, Cambridge CB4 0QA, United Kingdom S Supporting Information *

ABSTRACT: The hit validation stage of a fragment-based drug discovery campaign involves probing the SAR around one or more fragment hits. This often requires a search for similar compounds in a corporate collection or from commercial suppliers. The Fragment Network is a graph database that allows a user to efficiently search chemical space around a compound of interest. The result set is chemically intuitive, naturally grouped by substitution pattern and meaningfully sorted according to the number of observations of each transformation in medicinal chemistry databases. This paper describes the algorithms used to construct and search the Fragment Network and provides examples of how it may be used in a drug discovery context.



INTRODUCTION

chemistry, which we use for exploring the chemical space around a hit obtained from a fragment screen. A fragment-based drug discovery (FBDD) campaign will typically start with the screening of a library of compounds using methods such as X-ray crystallography, NMR, surface plasmon resonance (SPR), thermal shift (Tm), or other biophysical techniques.9 Having discovered one or more fragment hits, the hit validation stage of a project involves testing sensible close analogues of a fragment hit. The purpose of this phase is to understand the nature of binding more fully, to generate some interpretable structure activity relationships (SAR) around the hit, and to optimize the fragment itself before it is grown toward a lead compound. If close analogues of a fragment are available in-house or commercially, this often allows for quicker and cheaper hypothesis testing than synthesis of new compounds. Hence “SAR by catalogue/collection” is a popular way to rapidly explore the chemical space around a fragment.10 Good search tools are required to mine the many millions of compounds that are currently available in fragment space. A chemist may need to construct many substructural search queries11 in order to find all relevant compounds in the chemical space around a fragment hit. Such queries may need

We are living in the age of big data. The field of data science is expanding rapidly as we look to extract knowledge from the vast quantities of data that are now being routinely generated across many disciplines.3 As the amount of data grows, the way in which it is stored and queried must adapt. Relational databases organize data in tables consisting of columns and rows.4 The querying of complex data stored in such a format requires joining multiple tables together; when these tables become large, this becomes computationally expensive and may put strain on the database. A graph database, consisting of nodes and edges instead of rows and columns, can provide a useful alternative model for describing relationships between entities.5 For example, the Google PageRank algorithm6 uses the webgraph that describes the links between pages of the worldwide web, where each node represents a web page and the edges are the hyperlinks between pages. A social network can also be represented using a graph database.7 In this scenario, the nodes represent users and the edges represent friendships or business acquaintances. An online retailer may provide product recommendations based on a graph database containing users, products, and purchases.8 We have become used to online stores suggesting similar products based upon the currently displayed book or other item. In this paper, we describe a similar recommendation system for medicinal 1,2

© 2017 American Chemical Society

Received: June 2, 2017 Published: July 15, 2017 6440

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

Figure 1. Nodes and edges in the Fragment Network generated from 4-hydroxy-biphenyl.

of a compound of interest. In turn, the neighbors of these nodes will contain further interesting transformations of the parent compound. Although the graph database may contain hundreds of millions of nodes and edges, we need only search in the local area around the compound of interest, which means that the result set is generated quickly enough to create an interactive tool. The results obtained, via a simple, user-friendly web interface, are in line with the way a medicinal chemist thinks about SAR and the tool has proved popular with the chemists at Astex.

to be complex in order to avoid being deluged by large numbers of uninteresting compounds, and in general the approach can be error prone and time-consuming. Alternatively, a similarity search may be used to identify near neighbors of a fragment hit.12,13 This approach uses chemical fingerprints to assess the similarity of pairs of molecules. A chemical fingerprint can be represented as a list of zeroes and ones that indicate the presence or absence of features in a molecule. Compounds are considered similar when there is overlap between these features. A fragment fingerprint will have far fewer features than a druglike molecule. This can result in low similarity scores for pairs of related fragments, and so interesting analogues can get lost in the noise. Other approaches to navigating chemical space include The Scaffold Tree,14 whereby compounds are classified according to their molecular framework. A matched molecular pair analysis allows compounds that differ by a single point of change to be retrieved.15,16 Ertl developed Scaffold Keys that allow one to search compounds based on scaffold similarity.17 Many other scaffold hopping methods have been developed that are capable of retrieving similar compounds; a recent volume edited by Brown provides an overview.18 Bajorath et al. have suggested that a chemical space network might be used to navigate between compounds that share similar properties.19 For example, compounds that form a matched pair or that have a similarity score above a particular threshold might be joined by an edge. This kind of network requires a relatively expensive “all versus all” comparison of the compounds in the graph database to be performed. Another way of building a graph database is the SmallWorld approach of Sayle et al.20 This approach requires the computation and storage of every subgraph of every molecule in the database. Searching the resulting graph database is faster than fingerprintbased similarity methods for large data sets. The Fragment Network described here is a chemical space network that treats each compound as a set of rings, linkers, and substituents. The nodes and edges of the Fragment Network are generated by iterative removal of these groups from the parent molecule. The metadata stored along with each node and edge allows us to filter, group, and sort the neighbors



METHODS

Constructing the Fragment Network. To add a compound into the Fragment Network, we must first generate its set of nodes and edges. We use the compound 4-hydroxy-biphenyl to demonstrate the algorithm (see Figure 1). This compound is of historical interest to the fragment-based drug discovery community as it was found by Hajduk et al. as a second site binder against stromelysin in one of the first fragment screening campaigns to be reported in the literature.21 We refer to the phenyl rings as ring R1 and ring R2 (Figure 1, node 1). The first node added to the network is the compound itself (Figure 1, node 1). We then generate new nodes by identifying and removing each ring, linker, and substituent from the starting compound (see Supporting Information for more details). This process results in 4 new nodes: (i) a biphenyl node (removal of the hydroxyl; Figure 1, node 2), (ii) a phenol node (removal of ring R2; Figure 1, node 3), (iii) a disconnected node containing a phenyl ring and a hydroxyl group (removal of ring R1; Figure 1, node 4), and (iv) a disconnected node containing a phenyl ring and phenol (removal of the linker between ring R1 and ring R2; Figure 1, node 5). Each of these new nodes is connected to the parent node by an edge. The algorithm then continues recursively; for each of the newly created nodes (nodes 2− 5), we identify and remove each ring, linker, and substituent to create new nodes that are joined to the parent by an edge. The algorithm results in a set of 8 nodes joined by 14 edges. New compounds can be added in turn to the network, creating new nodes and edges as necessary. We have incorporated compounds from the Astex registry and from commercial vendors (eMolecules22 and Sigma-Aldrich23 catalogues) into our network. Compounds that can be bought from preferred suppliers (suppliers with a history of fulfilling orders in a timely fashion) are differentiated from other commercial compounds. We have also added compounds from ChEMBL24 and ligands from the Protein Data Bank (PDB).25 Our aim has been to 6441

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

Figure 2. Summary of Fragment Network results based on a query requesting “medium” changes to 4-hydroxy-biphenyl. Each box shows a group of results: for example, box G indicates that 19 compounds were returned which have an addition at the 4 position of ring R2. The text indicates the total number of compounds (hits) in each group and the nature of the path from the query to the results. Node numbers refer to nodes in Figure 1: for example, the compounds in box H are connected to 4-hydroxy-biphenyl by a path of length two that also includes node 3, Figure 1. Up to three compounds from each group are shown as examples. The full set of hits is provided in the Supporting Information. build a tool that will allow us to investigate the SAR around a fragment-sized molecule. Because our fragment library contains compounds containing no more than 16 heavy atoms, we felt that 24 atoms represented an appropriate heavy atom count (HAC) limit. The network currently contains around 5 million compounds with a HAC of 24 or fewer. The resulting network contains a total of 23 million nodes and 107 million edges. The nodes and edges are stored in a Neo4j graph database.26 We store a number of key properties as attributes and labels of the nodes and edges of the network. The chemistry of each node is represented using a SMILES string.27,28 The number of heavy atoms and the number of ring atoms are also stored for each node. We label the edges between nodes with information about the bonds that are being made and broken. For example, in Figure 1, the edge between nodes 1 and 2 is labeled as removal of a hydroxyl group from the para position. Similarly, the edge between nodes 1 and 3 is labeled as removal of a phenyl ring from the para position. These attributes are used in order to filter, group, and sort the result set. Further details on the exact nature of the metadata are provided in the Supporting Information. Not every node in the network is an available compound. We label those that are available compounds with a general identifier, as well as an identifier for each database in which the compound can be found. For example, in Figure 1, node 1 is labeled as an available compound, as well as being labeled with an Astex registry identifier, the ChEMBL identifier 73380, and the eMolecules identifier 480834. These labels allow us to provide hyperlinks to the source databases as part of our web-based interface. Querying the Fragment Network. When searching the Fragment Network, the input query molecule is first converted to a nonisomeric canonical SMILES string and the matching node in the graph database is selected as the start node. If there is no matching

start node, a set of nodes and edges for the query compound is created and merged into the network as described above. The default search query returns all available compounds from graph nodes that are zero, one, or two edges away from the start node. This is a very efficient operation in a graph database: for around 2000 compounds in the main Astex fragment library, the median time taken to return all paths of length 2 is 170 ms per compound, which allows us to use the tool in a very interactive fashion. We provide an option to restrict the results to compounds in specific databases; typically, one might first search for company registry compounds, expanding the search to commercially available compounds if suitable material is not available in-house. We can also place limits on the size of the compounds in the hit list by restricting the change in heavy atom count or ring atom count. Our user interface provides defaults for “small” (add or remove up to two atoms, no change in the number of ring atoms), “medium” (add or remove up to three atoms, change the number of ring atoms by up to one atom), and “large” (add or remove up to six atoms, change the number of ring atoms by up to six atoms) changes. The default option is to return “medium” sized changes. Grouping the Results. Having obtained a set of compounds from nodes that are zero, one, or two edges away from the query compound node, we next categorize and group these compounds according to the types of transformations they represent. To demonstrate this categorization, a query requesting “medium” changes to 4-hydroxybiphenyl was performed. Compounds were restricted to those available from preferred suppliers. The results obtained for this query are summarized in Figure 2. The path length between the result node and the query node, combined with the change in the number of atoms, as well as the metadata stored on the edges connecting the nodes, has been used to classify each transformation. 6442

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

Exact Match (Figure 2A). A path of length zero indicates an exact match with the query compound. Because we have restricted the results to preferred suppliers, this means the query compound is available for purchase. A hyperlink to the Astex purchasing platform allows one-click ordering for any compound that is commercially available. Deletions (Figure 2B). A path length of one, combined with a decrease in HAC, indicates a deletion. Searching the Fragment Network for “medium” changes to 4-hydroxy-biphenyl (Figure 2, A) returns a single example of a deletion, namely the removal of the hydroxyl group, resulting in the biphenyl compound (see Figure 2, B). Note that there is also a deletion edge that links the query compound to phenol (see Figure 1, edge between node 1 and node 3). However, this is categorized as a “large” change (removal of six ring atoms), and so phenol is only returned if we expand our search to include “large” changes. Our search algorithm only returns paths to those nodes that are marked as available compounds; this means that the paths of length one between node 1 and node 4 and between node 1 and node 5 in Figure 1 are discarded. Additions (Figure 2C−G). A path of length one, combined with an increase in HAC, represents an addition. A search of our full network for “medium” changes to 4-hydroxy-biphenyl shows that there are 12 compounds available from preferred suppliers with an addition at the 2 position of ring R1 and two compounds with an addition at the 3 position of ring R1 (see Figure 2C,D). There are 11, 13, and 19 compounds available with a substituent at the 2, 3, and 4 positions of ring R2, respectively (see Figure 2E−G). It is interesting that Hajduk et al. explored many of these changes in their initial fragment elaboration work (i.e., hit validation) on stromelysin.21 The metadata stored on each edge allows us to group these sets of compounds in a meaningful way, i.e., 2, 3, and 4 addition, which is an attractive feature of the method. For example, if the crystal structure of a protein− fragment complex indicated that addition at a particular position might improve affinity (perhaps by filling a pocket or making a hydrogen bond) then a user could focus on a particular group of results in order to select follow up compounds. Replacements (Figure 2H−L). A path of length two can represent a number of transformations. If the central node in the path has a lower HAC than the start and end nodes, this indicates a replacement (i.e., the removal of a group and addition of another group). Such paths are only categorized as a replacement if the edge labels also match; if we remove a group from an ortho position, we must add a group to the same ortho position to accept the result. In our 4-hydroxy-biphenyl search, we find 97 replacements for the hydroxyl group and 83 replacements for ring R2 (see Figure 2L,H). The ability to return rings of different size and aromaticity to the query molecule with a single query is an attractive feature of the method. The paths that describe the replacement of ring R1 can be subdivided into two sets. The most interesting changes are probably those that preserve the para arrangement of the phenyl and the hydroxyl. We have defined equivalencies between rings of different sizes to allow us to replace a 1,4 six-membered ring with a five- or seven-membered ring that retain an approximate match to the vectors in the query molecule. There are 22 commercially available compounds in this category (see Figure 2J). These molecules represent examples that would only be identified by multiple substructure queries. The second set of ring replacements does not preserve the arrangement of the hydroxyl and phenyl groups. These may still be of interest if one is interested in understanding how well changes in the position of a substituent are tolerated. The 46 examples of commercially available ring replacement in this category include 2hydroxy-biphenyl and 3-hydroxy-biphenyl as interesting examples (see Figure 2I). Note that if we were to perform a search requesting “large” changes, we would be able to replace the phenyl with a fused ring system. The equivalency rules ensure the vectors of the new rings can be chosen as an approximate match to the vectors in the query molecule. Finally, there are 25 commercially available compounds that replace the (zero length) linker between the phenyl rings (Figure 2K). Introducing an additional atom between the rings will have a

significant effect on the geometry of the fragment and may not be tolerated. However, if we were looking for ideas to replace a methylene linker, then an ether or keto replacement might be a useful transformation. Double Additions and Double Deletions. There are additional transformations that involve a path length of two. If the HAC of each node in the path increases, this indicates a double addition (examples not shown in Figure 2). If the HAC of each node in the path decreases, this indicates a double deletion (there are no double deletions associated with the 4-hydroxy-biphenyl example when one searches with medium settings). Using the metadata from the nodes and edges, these double additions and deletions can be filtered and grouped in the same way as the paths of length one. Sorting the Results. To display the most relevant matches at the beginning of each group set, we sort the compounds within each group based on the likelihood of each type of transformation. The likelihoods are derived from an analysis of databases containing compounds synthesized and tested in medicinal chemistry applications. Here we illustrate the process using the CheMBL database (although for internal applications we use the internal Astex registry of compounds). The likelihoods are derived from all transformations in the Fragment Network that join pairs of ChEMBL compounds with path lengths of one or two. We believe these paths between pairs of compounds observed in the medicinal chemistry literature are highly relevant to the sorting of compounds for medicinal chemistry purposes. Table 1 lists the top 10 additions from a total of 0.5 million paths of length one between pairs of ChEMBL compounds in the Fragment Network.

Table 1. Ten Most Commonly Observed Substituent Additions in ChEMBLa

a

The number of observations is used to sort observations within each group set.

The highest likelihood is given to the addition of a methyl group, followed by chloro, methoxy, fluoro, hydroxyl, etc. For substituent replacements, the sort order is dependent on both the initial substituent and the substituent it is replaced with, but otherwise they are defined analogously to the sort order (likelihood) for additions. We analyzed 96 million paths of length two between ChEMBL compounds in order to derive a set of likelihoods for replacement of rings, linkers, and substituents. Table 2 lists the top 10 observations for replacement of a methyl and for a hydroxyl group. Again, this sorting method allows us to present the most commonly observed replacements at the beginning of the list of compounds displayed to the medicinal chemist. Feedback from users indicates that it is the grouping of results and the sorting within the groups that is a particularly attractive feature of the application. Retrieved results are in line with the way a medicinal 6443

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

fragment had an IC50 of 80 μM against PKB and was developed into low nM leads using structure-based design. Before the project team committed significant medicinal chemistry resource to optimize this fragment hit into a lead, a “hit validation” analysis was carried out on compound 1. The Fragment Network was not available at the time of this project, but we will show here how it may have been used during this hit validation stage. We only discuss compounds that are available from commercial vendors here, but additional compounds were tested that either existed in our compound registry or were synthesized specifically for the purpose of the hit validation process. If the Fragment Network is searched for “medium” changes to compound 1, the first compound returned is the exact match for this compound, as it can itself be purchased from commercial vendors (see Figure 4). The next results section shows possible “deletions”, and in this case only contains one compound (compound 2), where the methyl group is removed from the pyrazole ring. Compound 2 was tested as part of the hit validation phase of the PKB project at Astex and was found to have IC50 = 135 μM.29 Although the methyl group does improve affinity in compound 1, its group efficiency (GE)30 is low compared to the ligand efficiency31 of compound 2 (GE(methyl) = 0.32 vs LE = 0.47 for compound 2). The methyl was not retained as the fragment was optimized into a lead. The next results section generated when the Fragment Network is searched for “medium” changes to compound 1 involves additions to the pyrazole ring in the 5-position (compounds 3−5, see Figure 4). Two of these compounds were tested during the hit validation stages of this project: compound 3 had an IC50 of 72 μM, i.e., the additional methyl group provided little benefit, and compound 4 had an IC50 of 660 μM, indicating that the 5-amino group was detrimental to potency, possibly because the position of this group is suboptimal for forming an additional hydrogen bond with the hinge (see Figure 3). Five compounds are available that represent single additions to the phenyl ring, two in the 4position, two in the 3-position, and one in the 2-position (see Figure 4). Some of these compounds were tested as part of the hit validation process. Although there appears to be space in the binding site to accommodate small substituents (see Figure 3),

Table 2. Ten Most Common Replacements for Methyl and Hydroxyl Substituents in ChEMBL

chemist thinks about hit validation, grouped according to substitution position and because common transformations are presented early in the group, most compounds of interest are easily identified. Additionally, because we only need to search two edges from a query compound, the results are rapidly retrieved, meaning that the tool can be incorporated into an interactive workflow. The network search scales well as we incorporate more compounds because we only ever consider those compounds that are within two edges of the search query.



RESULTS AND DISCUSSION To demonstrate the utility of the Fragment Network, we present two examples of how the tool might be used in a fragment-based drug discovery campaign. Example 1: PKB. In 2007, we reported on the identification of novel inhibitors against protein kinase B (PKB or Akt), using fragment-based lead discovery.29 The initial fragment hit that gave rise to this series was compound 1 (see Figure 3). The pyrazole moiety of the fragment interacts with the hinge region of PKB via two strong hydrogen bonds, and the phenyl ring occupies a highly lipophilic area (see Figure 3). The

Figure 3. X-ray structure of the initial fragment hit 1 against PKA−PKB “chimera”. The hydrogen bonds formed between the fragment and the hinge are shown, as well as the protein surface, which is colored according to the lipophilicity of the surrounding protein residues. 6444

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

Figure 4. Exact match, deletions, and additions returned when the Fragment Network is searched for “medium” changes to compound 1.

Figure 5. Additions to the phenyl ring returned when the Fragment Network is searched for “medium” changes to compound 2. The first five compounds returned for the 4-position, the 3-position, and the 2-position are shown, respectively.

Figure 6. Pyrazole replacements returned when the Fragment Network is searched for “medium changes” to compound 2. The rank position of each compound in the list of 422 compounds is shown in parentheses. 6445

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

Figure 7. Top five phenyl replacements returned when the Fragment Network is searched for “medium” changes to compound 2.

structure of compounds 1 and 2, the phenyl ring is almost coplanar with the pyrazole (see Figure 3) and the pyridine nitrogen in compound 37 reinforces this geometry. However, the affinity of this compound is relatively low (∼500 μM), indicating that changing the electronics of the ring in this very hydrophobic part of the protein (see Figure 3) is more detrimental to affinity than any benefit obtained from freezing out the bioactive conformation of the ligand. The remaining sections returned by the Fragment Network for medium changes to compound 2 are (i) additions to the nitrogen atoms in the pyrazole ring; as these atoms form the key interactions with the PKB hinge (see Figure 3), compounds of this type were not tested and would be quickly dismissed by a medicinal chemist analyzing the search output. (ii) Replacements of the linker between the phenyl and the pyrazole; it was felt that this would alter the vectors so dramatically that none of these compounds were tested. (iii) Double additions to the pyrazole and phenyl rings (e.g., difluoro compounds); many of these were tested, but no additional insights were obtained. The conclusion of this hit validation process was that, apart from the methyl group on the pyrazole, the initial hit was essentially optimal. Hence the phenyl-pyrazole moiety was retained as this fragment was optimized into a potent lead molecule. The Fragment Network search retrospectively identified all avenues explored laboriously by the project team at the time of the hit validation process of fragment hit 1. It is clear that the compounds shown in Figures 4 and 5 could also have been obtained by grouping the results of a substructure search using compound 2 as the query. However, for the ring replacements shown in Figures 6 and 7, this is not the case as many of the results involve changes in the size of the rings or changes in the nature of the ring in terms of hetero atoms, bond types, etc. A similarity search based on compound 2 is unlikely to return all of the compounds shown in Figures 6 and 7 near the top of a ranked list. For example, compound 2 has 11 atoms and with “medium” change settings, we allow compounds with ±3 heavy atoms in our result set. Using the RDKit33 implementation of Morgan fingerprints12 with a radius of 2 and ranking all compounds with 8−13 heavy atoms that are available from commercial vendors (i.e., approximately the same set of compounds searched by the Fragment Network for the analysis shown), only five of the 15 compounds 26−40 were found in the top 50 and nine of the 15 compounds were found in the top 500. Example 2: HCV Protease−Helicase. More recently, Saalau-Bethell et al. reported on the discovery of inhibitors of the HCV NS3 protein, using fragment-based lead discovery.34 As reported in the same paper, these compounds inhibit the protease activity of this target via a newly discovered allosteric mechanism. The compounds bind to a so-called “tunnel” site between the helicase and protease domains and stabilize the autoinhibited (in terms of protease activity) form of the protein. One of the fragment hits obtained from the fragment

it was found that the affinity of compound 9 was significantly reduced compared to compound 1. This is probably due to the fact that the 2-substituent induces a conformational change in the fragment that is not tolerated. The Fragment Network search from compound 1 produces additional suggestions for transformations, but at this stage a search starting from the less decorated fragment 2 results in a much wider range of options. The use of hyperlinks in the results pane means that a follow up search can be initiated by simply clicking any compound of interest. A search for medium changes to compound 2 returns 14 examples of additions to the phenyl ring in the 4-position, 10 examples of additions in the 3position, and six examples of additions in the 2-position. Figure 5 shows the first five of these additions at the 4, 3, and 2 positions, respectively. Interestingly, none of the compounds shown in Figure 5 were tested against PKB at Astex. However, a number of other, noncommercially available compounds in this category from our registry (and that are found when the Fragment Network search includes in-house compounds) were tested to probe the effects of small substituents on the phenyl ring. None of these additions provided a significant benefit, and the corresponding features were therefore not retained as the fragment was further optimized; however, an sp3 carbon atom in the 4-position (e.g., compound 12) was incorporated in order to grow the fragment toward subpockets where additional potency was obtained (see Saxty et al.29). The Fragment Network returns 422 compounds in the section that lists replacements for the pyrazole ring in compound 2 (see Figure 6). When a section contains a significant number of compounds, the sorting algorithm applied by the Fragment Network becomes important. The first row in Figure 6 shows the first five compounds presented to the medicinal chemist. Pyridine and morpholine are both known kinase hinge binding groups, and examples of both were tested against PKB at Astex. The morpholine (30) had no measurable affinity, but the pyridine (26) was weakly active and was developed into a subseries of PKB inhibitors.32 The second row in Figure 6 shows selected compounds from the top 50 of the 422 pyrazole replacements returned by the Fragment Network. All of these ring systems are known kinase hinge binders, some of which were tested as part of the hit validation phase and found to be active against PKB. However, none of the tested compounds was more ligand efficient than compound 2. The Fragment Network also returns a section with replacements of the phenyl ring in compound 2. In total, 18 compounds were returned in this section, of which the top five are shown in Figure 7. These are all sensible suggestions, many of which were tested during the PKB project. An interesting example here is compound 37, which, apart from changing the electronics of the ring, also alters the conformational profile for the rotation around the bond between the rings. In the X-ray 6446

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

Figure 8. X-ray structure of the initial fragment hit 41 against HCV protease−helicase. The weak hydrogen bond between the amine of the fragment and Asp79 is shown, as well as the protein surface, which is colored according to the lipophilicity of the surrounding protein residues.

Figure 9. Exact match, deletions, and first returned compound in addition at the 2, 3, and 4 positions of phenyl ring A section, respectively, when the Fragment Network is searched for “medium” changes to compound 41.

compounds returned by the Fragment Network were tested against HCV during the hit validation phase of this project, including compound 45, but the SAR was difficult to interpret, possibly due to the fact that this position is partially solvent exposed. Interestingly, no 3-substituted analogues of compound 41 were tested during the hit validation phase of this project. However, several 3-substituted analogues of compound 51 were tested (see below). In the structure of compound 41 (see Figure 8), the 2-position of ring A points directly into the protein surface, suggesting that substituents at this position may not be tolerated. However, compound 43 was tested and has an IC50 comparable to that of compound 41. Interestingly, when the X-ray structure of the HCV protease−helicase complex with compound 43 was determined, it became apparent that the compound adopts a flipped binding mode, swapping ring A and ring B and allowing the 2-chloro substituent to be presented to a small subpocket in the tunnel site (see Figure 8). This subpocket can be accessed more ideally from the 2position on ring B (see Figure 8). However, when the Fragment Network is searched for medium changes to compound 41, it highlights the fact that no compounds are available from commercial vendors that have single small substituents on ring B. Hence, the project team synthesized a number of analogues to probe this region (see Figure 10). For example, compound 46 has an IC50 of 41 μM and appears to present the chloro substituent ideally to the subpocket. Also, the 2,5-difluoro analogue (compound 47) probes this region and has an IC50 of 34 μM. The two fluorine atoms flanking the

screen against this target was compound 41 (see Figure 8), which was found to have an IC50 of ∼500 μM in a protease activity assay involving the full-length protease−helicase protein.34 The recognition between fragment and protein is quite different from the previous example. Here, the main interactions toward the center of the pocket are lipophilic in nature and involve the two phenyl rings of the fragment. The amine of the fragment does form a hydrogen bond with the backbone carbonyl of Asp79, but it is quite long (3.3 Å) and solvent exposed. Again, we will illustrate how the Fragment Network may have been used during the hit validation phase of this project. For simplicity, we will refer to the two phenyl rings in compound 41 as ring A and ring B (see Figure 8). When the Fragment Network was searched for medium changes to compound 41, the first hit returned was compound 41 itself (see Figure 9). The next section contains deletions and consists of a single compound where the amino-methylene has been removed from compound 41. This compound (42) was tested against HCV protease−helicase, but was found to be inactive. This indicates that the polar contacts involving the amine are important for activity although reduced solubility of 42 may also affect the results from the assay. Three sets of compounds were returned by the Fragment Network for additions to ring A: four compounds with substituents in the 2-position (e.g., compound 43), four compounds substituted in the 3-position (e.g., compound 44), and nine examples of compounds substituted in the 4position (e.g., compound 45). A number of the 4-substituted 6447

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

ring A in compound 41 (see Figure 11). None of these exact compounds were tested, but several of the ring replacements were tested in the context of the aminomethylene substituents being in the 3-position (see above). Finally, the Fragment Network returned 14 examples of compounds representing replacements of the ether linker in compound 41 (see Figure 12). Several of these compounds were tested during the hit validation phase of this project; the compound with the ketone linker (61) showed some promise (IC50 = 180 μM) but was initially down-prioritized with respect to compounds with the ether linker. However, the ketonelinked subseries was followed up later in the project and led to a separate patent.35 The optimized fragment that resulted from the hit validation phase of the HCV protease−helicase project was compound 48 (see Figure 10, compound 3 in ref 34), which has an IC50 of 20 μM. It combines filling the subpocket in the tunnel site via a fluoro substituent in the 2-position of ring A (compound 47) with the beneficial effect obtained by moving the aminomethylene group from the 4-position to the 3-position on ring A (compound 51). Running the Fragment Network on the initial fragment hit 41 identified all the avenues explored by Astex scientists at the time of this project and highlighted the fact that no analogues of compound 41 are available from commercial vendors that have single small substituents on the B ring. Again, the compounds displayed in Figure 9 would have been found very straightforwardly with a substructure search from compound 41. However, the process would have been much more involved for many of the suggested ring replacements

Figure 10. Examples of analogues of compound 41 that were specifically synthesized for the hit validation phase of the HCV project at Astex. The Fragment Network highlighted the fact that no compounds with small substituents on ring B are available from commercial vendors, and compounds 46 and 47 were synthesized to address this. Compound 48 was the optimized fragment that formed the start point for hit-to-lead optimization.34

ether linkage help to stabilize the T-shaped bound conformation. The Fragment Network returns seven compounds in the section that represents replacements of ring B, two where the position of the aminomethylene, relative to the phenoxy, is unchanged, and five where the aminomethylene, relative to the phenoxy, is moved to an ortho or meta arrangement (Figure 11). Some of these compounds were purchased and tested, and compound 51 was found to have higher affinity (130 μM) than the initial fragment hit 41. This 3-amino-methylene substituent was retained in the molecule during the subsequent hit-to-lead campaign.34 Another seven examples were returned by the Fragment Network in the section containing replacements for

Figure 11. Replacements for ring A and ring B when the Fragment Network is searched for medium changes to compound 41. The top row shows the compounds in the section containing ring replacements for ring B, where the relative position of the two substituents is maintained. The second row shows the compounds in the section containing ring replacements for ring B, where the relative position of the substituents is different from that in compound 41. The bottom line shows the first five compounds in the section containing replacements for ring A. 6448

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

Figure 12. First five replacements returned for the ether linker when the Fragment Network is searched for medium changes to compound 41.

challenge to further elaboration or it may indicate that a node is in a relatively unexplored region of chemical space. The outcome of this analysis will provide useful data for guiding future iterations of the fragment library.

(Figure 11), for moving substituents around ring A (Figure 11), and for replacements of the ether linker (Figure 12), which all represent sensible areas of chemistry to explore.





CONCLUSIONS We have described the Fragment Network, a novel approach for exploring the chemical space around a compound, using graph database technology. A search interface was designed as a tool for discovery of close analogues during the hit validation stage of a FBDD project. Searches are fast enough for the tool to be fully interactive, and the compounds returned are grouped and sorted in a way that is intuitive to a medicinal chemist. We have demonstrated how the Fragment Network may have been used on two FBDD programs described in the literature. The tool identified, essentially instantly, all initial avenues explored by chemists on both projects. Many of the compounds suggested by the Fragment Network would have been difficult to find using traditional substructure queries or using standard similarity searches. The search algorithm we have implemented focuses on a subset of paths of lengths 0, 1, and 2 from the query compound. Hence the current approach may miss some compounds of interest that can only be found via a longer path, for example, a compound that adds three substituents to the query compound. In practice, one can find such compounds by initiating a new search from a fragment that adds one or two of these substituents. We believe that for hit validation this is an acceptable limitation; a more expansive search algorithm may be useful for other applications. There are a number of additional features that would enhance the utility of the Fragment Network. For example, a better representation of tautomers; at present, each tautomer of a compound is represented as a unique node. A Fragment Network that groups tautomers under a common node could provide some advantages over the current method and is under investigation. Another limitation of the current implementation is that ring systems are only linked if they share a common substituent. A set of additional edges to join similar ring systems would be a useful addition to the network. In the future, one can envisage enhancing the utility of the tool by adding other types of edges to provide links between related compounds. For example, one might add edges linking compounds that are active against a common protein target. A variant of the Fragment Network could also be used for SAR browsing during the hits to leads and lead optimization stages of a project. Finally, we are particularly interested in analyzing the nodes in the Fragment Network that represent compounds in our fragment library. A node with few chemically available neighbors may highlight a fragment that presents a synthetic

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jmedchem.7b00809. Detailed description of the algorithm used to generate nodes and edges; a full set of nodes, edges, and attributes for compounds listed in Figure 2; details on inserting nodes and edges into a Neo4j graph database and example network queries (PDF) Figure 2 nodes data (TXT) Figure 2 edges data (TXT) Figure 2 attributes data (TXT)



AUTHOR INFORMATION

Corresponding Author

*Phone: +44 1223 435069. E-mail: [email protected]. ORCID

Richard J. Hall: 0000-0001-8578-9458 Marcel L. Verdonk: 0000-0002-6484-3328 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We would like to thank the many scientists at Astex who have provided feedback on the Fragment Network, in particular Paul Mortenson, David Norton and Alison Woolford who helped design the user interface. We would like to thank Andrew Woodhead for his comments on the manuscript.



ABBREVIATIONS USED FBDD, fragment-based drug discovery; GE, group efficiency; HAC, heavy atom count; NS3, nonstructural protein 3; SPR, surface plasmon resonance; Tm, thermal shift



REFERENCES

(1) Big data; Wikipedia, 2017; https://en.wikipedia.org/wiki/Big_ data (accessed March 31, 2017). (2) Tetko, I. V.; Engkvist, O.; Koch, U.; Reymond, J. L.; Chen, H. BIGCHEM: Challenges and Opportunities for Big Data Analysis in Chemistry. Mol. Inf. 2016, 35 (11−12), 615−621. (3) Data science; Wikipedia, 2017; https://en.wikipedia.org/wiki/ Data_science (accessed March 31, 2017). (4) Codd, E. F. A Relational Model of Data for Large Shared Data Banks. Commun. ACM 1970, 13 (6), 377−387.

6449

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450

Journal of Medicinal Chemistry

Article

(5) Angles, R.; Gutierrez, C. Survey of Graph Database Models. Assoc. Comput. Mach., Comput. Surv. 2008, 40 (1), 1−39. (6) Page, L.; Brin, S.; Motwani, R.; Winograd, T., The PageRank Citation Ranking: Bringing Order to the Web. In Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, 1998, 1998; pp 161−172. (7) Hanneman, R. A.; Riddle, M. Introduction to Social Network Methods; University of California, Riverside: Riverside, CA, 2005. (8) Raeder, T.; Chawla, N. V. Modeling a Store’s Product Space as a Social Network. Advances in Social Network Analysis and Mining 2009, 164−169. (9) Murray, C. W.; Rees, D. C. The Rise of Fragment-Based Drug Discovery. Nat. Chem. 2009, 1 (3), 187−192. (10) Schulz, M. N.; Landstrom, J.; Bright, K.; Hubbard, R. E. Design of a Fragment Library that Maximally Represents Available Chemical Space. J. Comput.-Aided Mol. Des. 2011, 25 (7), 611−620. (11) SMARTSA Language for Describing Molecular Patterns; Daylight Chemical Information Systems, Inc.: Laguna Niguel, CA, 2008; http://www.daylight.com/dayhtml/doc/theory/theory.smarts. html (accessed March 31, 2017). (12) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50 (5), 742−754. (13) Willett, P. Similarity-Based Virtual Screening Using 2D Fingerprints. Drug Discovery Today 2006, 11 (23−24), 1046−1053. (14) Ertl, P.; Schuffenhauer, A.; Renner, S. The Scaffold Tree: an Efficient Navigation in the Scaffold Universe. Methods Mol. Biol. (N. Y., NY, U. S.) 2010, 672, 245−260. (15) Griffen, E.; Leach, A. G.; Robb, G. R.; Warner, D. J. Matched Molecular Pairs as a Medicinal Chemistry Tool. J. Med. Chem. 2011, 54 (22), 7739−7750. (16) Hussain, J.; Rea, C. Computationally Efficient Algorithm to Identify Matched Molecular Pairs (MMPs) in Large Data Sets. J. Chem. Inf. Model. 2010, 50 (3), 339−348. (17) Ertl, P. Intuitive Ordering of Scaffolds and Scaffold Similarity Searching Using Scaffold Keys. J. Chem. Inf. Model. 2014, 54 (6), 1617−1622. (18) Scaffold Hopping in Medicinal Chemistry; Brown, N., Ed.; WileyVCH Verlag GmbH & Co. KGaA: Weinheim, Germany, 2013; Vol. 58. (19) Maggiora, G. M.; Bajorath, J. Chemical Space Networks: a Powerful New Paradigm for the Description of Chemical Space. J. Comput.-Aided Mol. Des. 2014, 28 (8), 795−802. (20) Sayle, R. A.; Batista, J. C.; Grant, J. A., SmallWorld: Efficient Maximum Common Subgraph Searching of Large Databases. In Abstracts of Papers, 244th National Meeting of the American Chemical Society, Philadelphia, PA, Aug 19−23, 2012; Americal Chemical Society: Washington, DC, 2012; CINF123. (21) Hajduk, P. J.; Sheppard, G.; Nettesheim, D. G.; Olejniczak, E. T.; Shuker, S. B.; Meadows, R. P.; Steinman, D. H.; Carrera, G. M.; Marcotte, P. A.; Severin, J.; Walter, K.; Smith, H.; Gubbins, E.; Simmer, R.; Holzman, T. F.; Morgan, D. W.; Davidsen, S. K.; Summers, J. B.; Fesik, S. W. Discovery of Potent Nonpeptide Inhibitors of Stromelysin Using SAR by NMR. J. Am. Chem. Soc. 1997, 119 (25), 5818−5827. (22) eMolecules; eMolecules: La Jolla, CA, 2017; https://www. emolecules.com/ (accessed March 31, 2017). (23) Sigma-Aldrich; Sigma-Aldrich Co. LLC, 2017; https://www. sigmaaldrich.com/united-kingdom.html (accessed March 31, 2017). (24) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. ChEMBL: a Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40 (D1), D1100−D1107. (25) Berman, H. M. The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1), 235−242. (26) Neo4j; Neo Technology, Inc.: San Mateo, CA, 2017; https:// neo4j.com/ (accessed March 31, 2017). (27) Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Model. 1988, 28 (1), 31−36.

(28) Weininger, D.; Weininger, A.; Weininger, J. L. SMILES. 2. Algorithm for Generation of Unique SMILES notation. J. Chem. Inf. Model. 1989, 29 (2), 97−101. (29) Saxty, G.; Woodhead, S. J.; Berdini, V.; Davies, T. G.; Verdonk, M. L.; Wyatt, P. G.; Boyle, R. G.; Barford, D.; Downham, R.; Garrett, M. D.; Carr, R. A. Identification of Inhibitors of Protein Kinase B using Fragment-Based Lead Discovery. J. Med. Chem. 2007, 50 (10), 2293− 2296. (30) Verdonk, M. L.; Rees, D. C. Group Efficiency: A Guideline for Hits-to-Leads Chemistry. ChemMedChem 2008, 3 (8), 1179−1180. (31) Kuntz, I. D.; Chen, K.; Sharp, K. A.; Kollman, P. A. The Maximal Affinity of Ligands. Proc. Natl. Acad. Sci. U. S. A. 1999, 96 (18), 9997−10002. (32) Boyle, R. G.; Saxty, G.; Verdonk, M. L.; Taylor, R. D.; Hamlett, C.; Sore, H. F. Heterocyclic Containing Amines as Kinase B Inhibitors. WO/2006/136823, 2006. (33) Landrum, G. RDKit: Open-Source Cheminformatics Software, 2017; http://www.rdkit.org (accessed March 31, 2017). (34) Saalau-Bethell, S. M.; Woodhead, A. J.; Chessari, G.; Carr, M. G.; Coyle, J.; Graham, B.; Hiscock, S. D.; Murray, C. W.; Pathuri, P.; Rich, S. J.; Richardson, C. J.; Williams, P. A.; Jhoti, H. Discovery of an Allosteric Mechanism for the Regulation of HCV NS3 Protein Function. Nat. Chem. Biol. 2012, 8, 920−925. (35) Woodhead, A. J.; Hamlett, C. C. F. H.; Besong, G. E.; Chessari, G.; Carr, M. G.; Millemaggi, A.; Norton, D.; Saalau-Bethell, S. M.; Willems, H. M. G.; Thompson, N. T.; Hiscock, S. D. Pharmaceutical compounds. WO/2013/064543, 2013.

6450

DOI: 10.1021/acs.jmedchem.7b00809 J. Med. Chem. 2017, 60, 6440−6450