Chapter 5
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Going Beyond R-Group Tables Veerabahu Shanmugasundaram,* Liying Zhang,1 Christopher Poss,2 Jared Milbank,3 and Jeremy Starr Center of Chemistry Innovation and Excellence, Worldwide Medicinal Chemistry, Pfizer, 300 Eastern Point Road, Groton, Connecticut 06340 *E-mail:
[email protected] 1Current Address: Computational Sciences CoE, Worldwide Medicinal Chemistry, Pfizer, 610 Main Street, Cambridge, Massachusetts 02139 2Current Address: Predictive Informatics, R&D Business Technologies, Pfizer, 300 Eastern Point Road, Groton, Connecticut 06340 3Current Address: Cheminformatics, Forma Therapeutics, 500 Arsenal St, Watertown, Massachusetts 02472
Early stage drug discovery in biomedical research is enabled by a wide range of data visualization and analysis methodologies. In medicinal chemistry, the exploration of structure-activity relationships (SARs) plays a critically important role. SAR is typically explored for individual compound series on a case-by-case basis. A new data-structure developed by Prof. Jürgen Bajorath and coworkers called SAR matrices (SARMs) automatically extracts SAR patterns from data sets and organizes the exhaustive information contained in a project dataset in an easy and interpretable fashion. We have applied SAR matrices to various research problems of interest within Pfizer and have enabled an interactive custom SAR mining and visualization platform within TIBCO/Spotfire that significantly enhances the SARM interpretation and analysis by medicinal chemistry project teams. The study of SAR is one of the central themes in medicinal chemistry and the concept of visual SAR analysis that enables organization of large compound data sets on the basis of intuitive structural relationships is a very powerful tool for medicinal chemists.
© 2016 American Chemical Society Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Introduction SAR matrices (SARMs), developed by Prof. Jürgen Bajorath and coworkers, provide a novel visualization framework where SAR patterns can be automatically extracted from data sets and presented in the more familiar scaffold/functional-group SAR table view (1). The computational approach uses a matched-molecular pair-like algorithm (2) to identify and automatically extract groups of structurally related compounds exhaustively and displays the resultant information in a chemically intuitive and interpretable fashion (3). This is different from the commonly used R-group table, which is based on a medicinal chemist’s pre-defined structural definitions describing in detail the bond cuts and thereby the R- group substituents, one scaffold at a time. The information contained in the SAR matrix can be color-coded based on any property value of interest to the project team and can be easily exchanged for any other property value (e.g., potency, selectivity, permeability, metabolism, desirability scores (4)). Core scaffolds and substituent functional groups are organized as rows and columns. Several levels of bond-cuts (single, double and triple cuts) are used to develop core scaffold-functional group information. The matrix exemplified in Figure 1 is derived from a single bond cut procedure. A typical dataset affords many matrices depending on the number of scaffolds, functional groups and number of bond cuts determined by the SARM algorithm. Therefore an individual compound can be present in many matrices depending on the core scaffold/R-group combination (Figure 2).
Figure 1. An example SAR matrix.
54 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Figure 2. Example set of SAR matrices for a particular dataset. Highlighted on top is a single matrix from the set of matrices. In these matrices, each cell represents one compound. Colored cells are real (already synthesized) compounds. Colors indicate a property value colored by its favorability according to a standard stoplight color scheme. Blank cells are virtual compounds. Cells marked by “?” are suggested virtual compounds for further study. Information in a matrix can be used to capture SAR discontinuity (5), identify areas which will require more exploration, or suggest virtual compounds based on neighborhood information for synthesis (vide infra). SARMs can be used to interrogate the existing wealth of information contained in a project team (Figure 3), such as: (1) What are the over- and underexplored scaffolds? (2) What are the privileged R-groups in the dataset? (3) Where are the activity cliffs (6)? (4) Which combinations of core and R-groups should be evaluated further? (5) What are the SAR trends over time for a chemical series in a project and can that be used in go/no-go decisions in the project? (6) What is the probability that an area of chemistry space has high potential of meeting project goals? 55
Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Figure 3. Patterns from SARMs could provide a wealth of information that can be used in SAR mining and analysis.
As each dataset results in a collection of several matrices, a conservative ranking scheme (beyond the one developed by our collaborators at University of Bonn) was developed to prioritize the matrix set for visual examination and analysis. This scheme promoted information-rich matrices and distinguished them from information-poor matrices (Figure 4). The prioritization scheme is based on a ranking developed for each matrix taking into consideration SAR patterns, property variance and the size and dimension of each matrix. For instance, a matrix with large activity range would rank higher than a matrix with all-active compounds, since it indicates a discontinuous SAR chemistry space that usually stores more SAR information. Also a large matrix with hundreds of compounds would rank lower than a matrix with fewer compounds, since the larger matrix is more difficult to analyze visually. Figure 4 illustrates an example pipeline pilot protocol that was used for matrix prioritization purposes. From the 3000 matrices generated from a single dataset containing hundreds of close-in analogs, only about one-fourth of the matrices were prioritized as starting points for visual SAR analysis.
Figure 4. Matrix Prioritization: An example Pipeline Pilot protocol that sorts through the set of matrices and rank orders them based on SAR information content. 56 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Close-In Analog Prioritization Using Neighborhood Information The SARM data structure was originally designed to organize compound sets on the basis of core scaffolds and substituents. The set of matrices provides an exhaustive view of all possible structural relationships between cores and substituents. A characteristic feature of SARMs is that empty cells imply virtual compounds (close-in analogs) representing unexplored scaffold/R-group combinations and hence offer suggestions for synthesis and biological evaluation. However, the SARM data structure does not enable a direct prioritization of such virtual compounds. Rather, visual analysis of SARMs is required to analyze SAR and suggest virtual compounds. Further as a single virtual compound could be present in multiple matrices, a thorough examination of all combinations of these matrices is required for accurate prioritization and can be a time consuming exercise for a medicinal chemist. Therefore we developed a close-in analog prioritization technique using a neighborhood-based analysis method (NBH). For each virtual compound, NBHs consisting of known active compounds were defined. Virtual compounds were then ranked according to number of such NBHs by applying a Free-Wilson-like additivity principle to individual neighborhoods (7). This leads to the prediction of the potency of a virtual compound on the basis of differential core and substituent contributions from active neighbors (Figures 5-7). A distinguishing feature of the NBH-based prediction approach is that predictions over multiple NBHs are prioritized. Therefore, one can assign confidence to consistent predictions resulting in low SD values.
Figure 5. Virtual Compound Activity Prediction 57 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Figure 6. Illustration of the Neighborhood-Based Activity Prediction Method. For an example virtual compound X, the set of all NBHs (outlined in blue) in an SAR matrix is identified and four qualifying neighborhoods (NBHs 1 to 4) for prioritizations are determined.
A study to predict potencies was conducted across six data sets collected from CHEMBL, with prediction accuracy increasing with the number of qualifying NBHs (8). Depending on the composition of NBHs, virtual compounds with higher potency than known active neighbors can be predicted, and these predictions can then be easily prioritized. Predictions yielding high SD values are indicative of discontinuous SAR regions in which structurally analogous neighbors might have very different potencies. Although these regions usually fall outside the applicability domain of potency predictions employing an additivity principle, they are nonetheless interesting for compound design containing probable outlier and activity cliff information. In summary, a neighborhood-based SARM analysis was developed and potency predictions enabled for prioritizing virtual compounds for close-in analog synthesis. This enhancement significantly increased the attractiveness and utility of the SARM data structure for medicinal chemistry project applications. Various extensions to the prediction schema are currently in development. 58
Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Figure 7. SAR matrix. Three model series (A, B, and C) containing three compounds each are shown with their respective pKi values (red). Compounds in a series share a common core structure and differ by substitutions at a single site (highlighted in blue). The three series contain structurally related cores (bottom left; substructure differences between cores are highlighted in red).
Monitoring SAR Project Progression Lead optimization in project teams is largely driven by hypothesis-based, multi-parameter optimization that involves optimization of potency, selectivity and ADMET properties that still require ingenuity, experience, and intuition of medicinal chemists focusing on the key question “which compound to make next?” Accordingly, it is essentially impossible to predict whether or not a project might ultimately be successful. It is also very difficult to estimate when sufficient numbers of compounds have been evaluated to judge the odds of a project being successful. Given the subjective nature of lead optimization decisions and the optimism of project teams only very few attempts have been made to systematically evaluate project progression (9). Using SARMs, a computational framework to follow the evolution of structure-activity relationship (SAR) information content over a time course was recently developed (10). The approach was based upon the use of an SAR matrix data structure as a diagnostic tool to evaluate SAR redundancy by enabling a graphical representation of SAR progress by measuring the SAR information content within a chemical series over time. Newly synthesized compounds (shown on a white background in Figure 8A) are added in time intervals to evolving lead optimization sets (gray background), 59
Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
and SARMs are systematically calculated at each time point. SARMs calculated at each time point are retained and compared to newly derived matrices. Distributions of SARMs are monitored in scatterplots of median potency vs SARM discontinuity in which each SARM is represented as a color-coded dot. Dots with black border correspond to SARMs shown above the scatterplots. For temporal analysis, three categories of SARMs are distinguished: existing (colored gray), expanded (cyan), and new SARMs (magenta). Existing (old) matrices are not modified through the addition of newly synthesized compounds. Expanded SARMs evolve from existing matrices through the addition of analogues that further extend currently available matched molecular series (MMSs). New SARMs contain new MMSs and capture previously unobserved structural relationships due to the addition of novel structures.
Figure 8. illustrates a schematic representation illustrating the concept of monitoring SAR progression over time using SARMs.
Figure 8 (B) depicts two sets of SARM scatterplots. Comparison of SARM scatterplots makes it possible to follow SAR progression on a time course and judge the success of lead optimization (LO) efforts. For example, a desirable LO profile (top; positive SAR progression) would display a shift of matrix distributions over time toward the upper right quadrant of the scatterplot (high median potency and high SARM discontinuity), with an enrichment of new 60 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
SARMs. By contrast, the scatterplots at the bottom display negative progression of SAR over time because the matrix distribution shifts toward the bottom left quadrant (low median potency and low SARM discontinuity). On the right, trend plots from fitting average potency and SARM discontinuity scores of new matrices (magenta) for each time period to linear functions are shown. Trend lines monitor the development of SARM discontinuity and potency for an indicator SARM category over time. These investigations indicate that SARM ensembles are capable of detecting differences in SAR progression in compound sets of distinct composition and can be used as a diagnostic tool to distinguish SAR progression from redundancy (Figures 9-10). Application of the approach to datasets from drug discovery projects revealed SAR trends over time for chemical series that were ultimately successful or unsuccessful. Such insights are valuable in project decisions and merit further investigation in LO assessment. Since the SARM data structure can be easily annotated with different molecular properties, multiple parameters can be monitored.
Figure 9. Indicator SARM distributions over a time course for two Pfizer data sets. (A) neurodegenerative target, series 1, (B) neurodegenerative target, series 2, (C) inflammation target, series 1, and (D) inflammation target, series 2. Series 1 in (A) and (C) represented successful project progressions from which compounds were nominated as candidates for preclinical studies. By contrast, series 2 in (B) and (D) represented unsuccessful project progressions from which no compounds were nominated. 61 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Figure 10. Trend plots for two Pfizer data sets showing expanded and new indicator SARMs: (A) neurodegenerative target, series 1, (B) neurodegenerative target, series 2, (C) inflammation target, series 1, and (D) inflammation target, series 2. Trend lines separately monitor the development of median potency and SARM discontinuity scores over time for a given category of indicator SARMs. Series 1 in (A) and (C) represented successful chemical series and displayed positive SAR progression with an increase in both median potency and SARM discontinuity scores. Series 2 in (B) and (D) represented unsuccessful chemical series, which displayed negative SAR progression for expanded SARMs with a decrease in median potency and SARM discontinuity scores and essentially flat SARs for new SARMs.
Visualization Using TIBCO/Spotfire DXP Platform In order to enable easy access and use of SAR matrix information, we have developed a custom visualization for SARMs within Spotfire DXP (11). The DXP-based SAR matrix (DXP/SAR matrix) visualization features a number of convenient and useful functions that takes advantage of features in the DXP-platform and those implemented within Pfizer.
62 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Figure 11 depicts an example SAR matrix as implemented in Spotfire DXP. Cells with square color are real compounds and cells with round color are virtual compounds for which a prediction value has been generated. The inlays are confidence values for virtual compounds and predicted values for real compounds. The use of the inlays also enables multiple properties to be visualized in the same cell.
Figure 11. SAR matrix visualization in Spotfire DXP.
The DXP/SAR matrix platform provides extensive customization for property-based coloring, enables quick sorting, filtering and marking of matrices, the ability to subset matrices based on single-cut, double-cut, triple-cuts as well as all the compound- or property-based filtering abilities within the Spotfire application (Figures 12-15). Furthermore the DXP/SAR matrix implementation allows sorting of core scaffolds and R-groups based on any property column associated with the set of compounds (size, lipophilicity, etc.), core scaffold alignment (as drawn by the renderer) into a standard fashion, provides connection to computational models, as well as retrieval of other information contained in the DXP file (such as project data and connections to several Pfizer databases (12)). In addition, methods of providing virtual compound predictions based on the SAR patterns in the matrix and visualizing a confidence metric are also enabled.
63 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Figure 12. DXP/SAR Matrix. Set of matrices from the SAR matrix algorithm. All matrices that are present are loaded into the root node. This can be split using a variety of different ways based on attributes in the file. Shown here is a split by Matrix-ID wherein each split-node indicates a matrix that can be compressed or expanded. All the filtering and data-mining features of Spotfire DXP is also enabled within Pfizer environment.
Figure 13. SARM implementation within TIBCO/Spotfire DXP provides dynamic visualization capabilities and connections to Pfizer databases. 64 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Figure 14. Subsetting SARMs – Filtering by compound ID: Here the use case is typically directing SAR visualization and matrix filtering around a key compound
Figure 15. Subsetting SARMs – Filtering by matrix type/bond cuts : Here the use case is typically directing SAR visualization and matrix filtering around bond disconnections or sets of R-groups. In summary, SAR matrices coupled with TIBCO/Spotfire DXP data views provide novel SAR and design analyses that enable unique ways of evaluating and prioritizing virtual compounds. Furthermore, the ability to analyse project SAR based on compound series over time provides novel ways to use this information in project decision making. These extensions and the ability to visualize and access 65 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.
SARMs in TIBCO/Spotfire DXP enabled project teams to bring together multiple data tables and conceptual design frameworks into one environment. Merging virtual with real compound SAR data provides a powerful way of analyzing target molecules and related information in the context of existing chemistry space.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch005
Acknowledgments The authors would like to thank Disha Gupta-Ostermann, Shilva Kayastha, Antonio de la Vega de León, Dilyana Dimova and Jürgen Bajorath (University of Bonn, Germany) for their collaborative work and to Robert Stanton, Mark Noe and Tony Wood (Pfizer) for helpful discussions and support.
References 1.
Wassermann, A. M.; Haebel, P.; Weskamp, N.; Bajorath, J. SAR Matrices: Automated Extraction of Information-Rich SAR Tables from Large Compound Data Sets. J. Chem. Inf. Model. 2012, 52, 1769–1776. 2. Hussain, J.; Rea, C. Computationally Efficient Algorithm to Identify Matched Molecular Pairs (MMPs) in Large Data Sets. J. Chem. Inf. Model. 2010, 50, 339–348. 3. Gupta-Ostermann, D.; Bajorath, J. The ‘SAR Matrix’ Method and its Extensions for Applications in Medicinal Chemistry and Chemogenomics [v2; ref status: indexed, http://f1000r.es/3rg]. F1000Res. 2014, 3, 113. 4. Wager, T.; Chandrasekaran, R.; Hou, X.; Troutman, M.; Verhoest, P.; Villalobos, A.; Will, Y. Defining Desirable Central Nervous System Drug Space through the Alignment of Molecular Properties, in Vitro ADME, and Safety Attributes. ACS Chem. Neurosci. 2010, 1, 420–434. 5. Peltason, L.; Bajorath, J. SAR Index: Quantifying the Nature of Structure−Activity Relationships. J. Med. Chem. 2007, 50, 5571–5578. 6. Van Drie, J. H.; Lajiness, M. S. Approaches to virtual library design. Drug Discovery Today 1998, 3, 274–283. 7. Kubinyi, H. Free-Wilson analysis. Theory, applications and its relationship to Hansch analysis. QSAR 1988, 7, 121–133. 8. Gupta-Ostermann, D.; Shanmugasundaram, V.; Bajorath, J. NeighborhoodBased Prediction of Novel Active Compounds from SAR Matrices. J. Chem. Inf. Model. 2014, 54, 801–809. 9. Maynard, A. T.; Roberts, C. D. Quantifying, Visualizing, and Monitoring Lead Optimization. J. Med. Chem. 2016, 59, 4189–4201. 10. Shanmugasundaram, V.; Zhang, L.; Kayastha, S.; de León, A.; Dimova, D.; Bajorath, J. Monitoring the Progression of Structure–Activity Relationship Information during Lead Optimization. J. Med. Chem. 2016, 59, 4235–4244. 11. Spotfire DXP; TIBCO Software Inc.: Palo Alto, CA. 12. Brodney, M. D.; Brosius, A. D.; Gregory, T.; Heck, S. D.; Klug-McLeod, J. L.; Poss, C. S. Project-Focused Activity and Knowledge Tracker: A Unified Data Analysis, Collaboration, and Workflow Tool for Medicinal Chemistry Project Teams. J. Chem. Inf. Model. 2009, 49, 2639–2649. 66
Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.