Article pubs.acs.org/jmc
AnalogExplorer: A New Method for Graphical Analysis of Analog Series and Associated Structure−Activity Relationship Information Bijun Zhang,† Ye Hu,† and Jürgen Bajorath* Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstrasse 2, D-53113 Bonn, Germany ABSTRACT: In recent years, several attempts have been made to develop graphical methods for the analysis of structure−activity relationships (SARs) in increasingly large and heterogeneous compound data sets. Among others, these approaches include extensions of conventional R-group tables and graph representations for the analysis of active analogs. Herein, we introduce AnalogExplorer as a new method for the graphical exploration of analog series. AnalogExplorer consists of three graphical components and is methodologically distinct from previous SAR visualization techniques. It is designed to deconvolute large series of analogs and systematically analyze and compare analog series contained in structurally heterogeneous data sets. In addition, analog subsets forming activity cliffs and R-groups responsible for cliff formation are easily identified in AnalogExplorer graphs. The design of AnalogExplorer is described in detail, and exemplary applications are discussed. In addition, the implementation of AnalogExplorer is made freely available.
1. INTRODUCTION The transformation of hits into leads and further optimization of leads are central tasks in medicinal chemistry.1 These processes involve the generation of series of analogs of preselected active compounds to increase their potency and improve other optimization-relevant features such as solubility or metabolic stability. In the practice of medicinal chemistry, analogs are typically designed and generated with the help of Rgroup tables that list core structures of individual series, Rgroups at predefined substitution sites, and potency values of analogs. R-group tables represent a conventional data format for the exploration (and exploitation) of structure−activity relationships (SARs) during chemical optimization efforts. However, R-group tables are difficult to use when compound series become large, multiple series are compared, or structurally heterogeneous compound sets are analyzed. As rapidly increasing amounts of SAR data become available, computational approaches have been introduced to access SAR information contained in compound data sets and visualize SARs.2,3 These include methods for the design of global activity landscape representations of compound data sets, which provide integrated views of similarity and activity relationships,4 as well as methods for the graphical analysis of local activity landscape features and analog series.2−4 A few approaches have been specifically introduced to further extend the R-group table format or provide alternative ways to graphically analyze analog series and their SARs. For example, so-called SAR maps have been developed for visualizing single analog series in a rectangular matrix format reminiscent of R-group tables.5 Compounds forming a series share the same core, which is © XXXX American Chemical Society
represented by the maximum common substructure (MCS), and contain R-groups at two predefined substitution sites. In the matrix, rows contain all unique R-groups attached to a given site of the MCS and columns contain all R-groups attached to another site. Accordingly, each cell in the matrix is associated with a compound consisting of corresponding substituents. Cells are colored according to compound potency values or computed selectivity scores.5 Two derivatives of these SAR maps have subsequently been introduced including a map that displays the biological activity profiles of compounds in a series across a panel of assays or targets6 and, in addition, a map accounting for single R-group polymorphism by analyzing compounds that only differ at a single substitution site.7 The SAR map and its derivatives provide compound-centric views for analyzing analogs with only one or two substitution sites. Other MCS-based visualization methods that conceptually depart from R-group table formats have also been introduced. For example, the combinatorial analog graph (CAG) evaluates the contributions of individual substitution sites and site combinations to compound activity within series and identifies substitution patterns responsible for SAR discontinuity.8 In this case, all compounds sharing the same MCS are compared in a pairwise manner and analogs with structural variations at one, two, or maximally three sites are identified. In the graph, each node represents one or more pairs of compounds that differ at corresponding substitution site(s). Since compounds are compared in different substitution site contexts, a compound Received: September 9, 2014
A
dx.doi.org/10.1021/jm501391g | J. Med. Chem. XXXX, XXX, XXX−XXX
Journal of Medicinal Chemistry
Article
Figure 1. Maximum common substructure. For a series of six analogs (androgen receptor antagonists), the MCS is colored red and shown with all variable substitution sites found in this series (bottom).
rationalized as chemical neighborhood graphs that identify subsets of analogs from which SAR information can be extracted. In addition to graphical methods for the organization and display of analog series and chemical neighborhood graphs, the matched molecular pair (MMP) formalism12 has also been adapted for the systematic study of analogs. An MMP is defined as a pair of compounds that only differ by a structural change at a single site.12 The exchange of a pair of substructures that converts MMP compounds into each other is referred to as a chemical transformation.13 The SAR matrix data structure is designed to identify and systematically organize analog series with core structures forming MMP relationships and reveal associated SAR information.14 By application of a two-step MMP fragmentation scheme, compounds with analogous core structures and conserved or varying substitutions are arranged in an R-group table-like matrix format. Cells in a matrix are colored according to compound potency values, and the matrix captures all possible combinations of structurally analogous cores and available substituents, many of which might still be virtual, hence providing compound design suggestions.14 Herein, we report a new computational methodology for the graphical representation and analysis of analog series that is distinct from currently available approaches and data structures, as discussed above. The method termed AnalogExplorer is compound-based (rather than compound pair- or substituentbased) and systematically explores substitution sites or site combinations in analog series, regardless of the number of substitution sites they might contain or the structural diversity
can appear multiple times in different nodes, which are colored according to SAR discontinuity scores.8 Hence, the CAG is a compound pair-based approach for the identification of key substitution site(s) with significant contribution to SAR discontinuity. Another MCS-based graphical method termed directed R-group combination graph (DRCG) focuses on Rgroup combinations in analog series.9 Nodes in the hierarchical DRCG represent all R-group combinations derived from an analog series, subsets of these R-group combinations, and the corresponding compounds. The graph reveals subset relationships between R-group combinations and potency changes associated with the addition or removal of substituents. Its primary focal point is the consideration of R-group combinations rather than compounds. An alternative approach, the similarity−potency tree (SPT), was introduced to mine SAR information for any given (reference) compound and its chemical neighborhood in a compound data set.10 All compounds that exceed a predefined structural similarity threshold when compared to the reference compound are identified and organized in a tree structure with decreasing similarity to the reference. Compounds in an SPT are colored according to their potency values. The SPT reveals analogs of the reference compound in the form of horizontal tree patterns. From horizontal and vertical patterns, SAR information can be extracted. Another reference compoundbased graphical representation termed chemical neighborhood graph (CNG) is designed to identify compounds in data sets that have many structural neighbors with significant potency differences.11 Both the SPT and CNG representations can be B
dx.doi.org/10.1021/jm501391g | J. Med. Chem. XXXX, XXX, XXX−XXX
Journal of Medicinal Chemistry
Article
Figure 2. Prototypic AnalogExplorer graph. (a) Shown is the complete graph for a series of 59 analogs (androgen receptor antagonists). At the top, the MCS is shown with all five variable substitution sites found in this series (see also Figure 1). In the graph, nodes represent substitution sites or site combinations that are scaled in size according to the number of analogs each corresponding subset contains (using five canonical node sizes). In addition, nodes are colored on the basis of the mean analog potency (mean pKi value) in each subset using a continuous color spectrum from red (minimal potency value in the series) to yellow (medium potency) to green (maximal potency value in the series). Empty nodes represent site(s) for which no corresponding analog subsets exist. Furthermore, the node border thickness reflects the potency range covered by analogs in the corresponding subset. If the potency of all analogs falls within the same order of magnitude (OoM), no scaling is applied. If the potency range of analogs spans a difference of 10-fold to less than 100-fold or at least 100-fold, thick black borders and thick red borders are used, respectively. (b) Shown is the reduced graph obtained from (a). The number of analogs represented by individual nodes (substitution sites and their combinations) is reported. C
dx.doi.org/10.1021/jm501391g | J. Med. Chem. XXXX, XXX, XXX−XXX
Journal of Medicinal Chemistry
Article
2.3. Selection of Analog Series. On the basis of scaffold analysis, a total of 247 analog series with activity against 67 different targets were selected from ChEMBL (release 18).19 These series had to meet the following selection criteria. Only compounds with direct interactions (i.e., assay relationship type “D”) with human targets at the highest confidence level (i.e., assay confidence score 9) were collected. As potency measurements, assay-independent equilibrium constants (Ki values) and assay-dependent IC50 values were separately considered. Only compounds with explicitly defined Ki or IC50 values were selected. Approximate measurements such as “>”, “