Lead Optimization Using Matched Molecular Pairs: Inclusion of

Sep 28, 2010 - Coupling Matched Molecular Pairs with Machine Learning for Virtual Compound Optimization. Samo Turk , Benjamin Merget , Friedrich Rippm...
0 downloads 8 Views 3MB Size
1872

J. Chem. Inf. Model. 2010, 50, 1872–1886

Lead Optimization Using Matched Molecular Pairs: Inclusion of Contextual Information for Enhanced Prediction of hERG Inhibition, Solubility, and Lipophilicity George Papadatos, Muhammad Alkarouri, Valerie J. Gillet,* and Peter Willett Information School, University of Sheffield, Sheffield S1 4DP, U.K.

Visakan Kadirkamanathan Department of Automatic Control and Systems Engineering, University of Sheffield, Sheffield S1 3JD, U.K.

Christopher N. Luscombe, Gianpaolo Bravi, Nicola J. Richmond, Stephen D. Pickett, Jameed Hussain, John M. Pritchard, Anthony W. J. Cooper, and Simon J. F. Macdonald GlaxoSmithKline Medicines Research Centre, Stevenage SG1 2NY, U.K. Received July 8, 2010

Previous studies of the analysis of molecular matched pairs (MMPs) have often assumed that the effect of a substructural transformation on a molecular property is independent of the context (i.e., the local structural environment in which that transformation occurs). Experiments with large sets of hERG, solubility, and lipophilicity data demonstrate that the inclusion of contextual information can enhance the predictive power of MMP analyses, with significant trends (both positive and negative) being identified that are not apparent when using conventional, context-independent approaches. INTRODUCTION

Lead optimization is a complex, time-consuming task, in which the medicinal chemist seeks to obtain a sufficiently promising balance among potency, off-target interactions, toxicity, and pharmacokinetic behavior, inter alia, to make it worth allowing a molecule to progress to the candidate stage of the drug discovery pipeline. The successful optimization of an initial lead compound is hence crucially dependent on the medicinal chemist’s ability to choose which analogue (or set of analogues when, as is very often the case, chemical arrays are being used) should be synthesized next, based on the knowledge that has been obtained thus far in the optimization. For example, there might be a need for analogues that are notably more soluble but, as much as possible, still have the same (or even higher) potency. Many chemoinformatic approaches are available to assist in lead optimization.1 For example, approaches based on bioisosterism suggest substructural transformations that can be applied to an existing bioactive lead to yield analogues that might exhibit superior pharmacokinetic properties.2-6 Quantitative structure-activity relationship (QSAR) and quantitative structure-property relationship (QSPR) approaches have been used for optimization for many years7-10 and express the property of interest as a function of structural variables characterizing the molecules in a training set (typically those that have been synthesized and tested thus far in the optimization). Given such an expression, property values can then be computed for previously untested analogues, thus enabling the chemist to predict the change in property, ∆P, resulting from a specific structural change, * Corresponding author e-mail: [email protected]; phone: 0044114-2222652.

∆S. This is a powerful technique and one that is widely used; however, the medicinal chemist is likely to be at least as interested in the reverse process of deriving ∆S given a desired ∆P value. The ability to predict the change in structure that is required to bring about a specific change in property, or inverse QSAR,11,12 is the basis for the work reported here, which belongs to the class of matched molecular pair (MMP) methods that have come to the fore over the past few years.12-18 These methods are related to the bioisosterism approaches mentioned above in their focus on specific substructural transformations; however, they go further in providing quantitative estimates of the changes, ∆P, that result from the application of particular transformations, ∆S, and hence provide an inverse QSAR approach to optimization. Another difference is that, given an appropriate source of data, they can model not only biological activity, the principal focus of the bioisosterism approaches, but also any chemical, physicochemical, or ADME (absorption, distribution, metabolism, and excretion) property that needs to be optimized. An MMP is defined as two molecules that differ from each other by a small, specified change at one or more specified locations and that share a large, identical structural feature. We refer to the change between the pair as a transformation and the invariant feature as the context, with the point where the transformation has taken place referred to as the attachment point. These terms are illustrated in Figure 1, which shows an example of a single-point transformation, where there is only one attachment point. Multiple-point transformations are also possible, but they have not been considered in the current study. Efficient algorithms for MMP identification have been described by Raymond et al.19 and by Hussain and Rea.20

10.1021/ci100258p  2010 American Chemical Society Published on Web 09/28/2010

LEAD OPTIMIZATION USING MATCHED MOLECULAR PAIRS

Figure 1. One matched molecular pair and its context. The transformation is H f CF3 (a single-point change) and is highlighted in yellow. The asterisk in the context denotes the attachment point.

An early MMP study by Sheridan et al. identified molecular transformations using a maximum common substructure (MCS) procedure that was applied to a subset of the MDL Drug Data Report (MDDR) database with potency as the end point.13 This is the only MMP study that has used public, as opposed to corporate, data as the basis for analysis. Hajduk and Sauer at Abbott Laboratories analyzed additions (i.e., H f Y), functional group transformations (e.g., Br f OMe), and multiple regiospecific phenyl substitutions and noted that the potency changes caused by most transformations followed a normal distribution centered near zero.14 Leach et al. at AstraZeneca focused on ADME properties, such as aqueous solubility, plasma protein binding, and oral exposure, and molecular transformations, such as phenyl ring additions (i.e., Ph-H f Ph-Y) and methylation of heteroatoms (e.g., OH f OCH3).21 Haubertin and Bruneau, also at AstraZeneca, analyzed a set of ca. 9000 functional groups and their effect on solubility, protein binding, and distribution coefficient (log D), as well as several computed physicochemical properties.12 Lewis and Cucurull-Sanchez at Pfizer investigated the effects of regiospecific single and double phenyl ring additions using in-house data on human liver microsome activity and intrinsic clearance.15 Perhaps the most comprehensive study is that of Gleeson et al. at GlaxoSmithKline, who reported a systematic analysis of ca. 500 000 ADMET (absorption, distribution, metabolism, elimination, and toxicity) data points generated in eight in vitro assays for P450 inhibition, hERG (human ether-a`-gogo-related gene) inhibition, solubility, and permeability.16 The study considered only additions (H f Y) where Y was a member of a predefined list of ca. 90 frequently used substituents. Finally, in two practical applications of the MMP method, Birch et al. described the design of inhibitors of glycogen phosphorylase,17 and Southall and Ajay described the analysis of protein kinase patents.18 An assumption of the basic MMP approach is that the property difference, ∆P, resulting from a specific transformation depends only on the substructural change that has taken place, irrespective of the context, that is, of the structural environment in which that transformation has taken place. For example, if a hydrogen atom is replaced by a trifluoromethyl group (H f CF3), then the change in lipophilicity that results is assumed to be constant across the many potential contexts in which that transformation might take place. This is clearly a very strong assumption, and some of the published MMP studies have hence made limited use of contextual information to enhance the specificity of the analyses that can be carried out. For example, Gleeson et

J. Chem. Inf. Model., Vol. 50, No. 10, 2010 1873

al. differentiated halogen, amine, and alcohol additions depending on whether aromatic or aliphatic addition was involved.16 The starting point for the work reported here is a belief that the power of the MMP approach can be substantially increased if full use is made of the contextual information that is available when data mining is carried out on a sufficiently large scale (as is possible with the corporate data archives of the major pharmaceutical companies). We hence consider the context as an inherent component of an MMP analysis, and one that can be used to enhance the practical utility of the relationships existing between ∆P and ∆S. A further distinguishing characteristic of our work is that the MMPs are identified using a completely unsupervised process. Thus, as advocated by Sheridan et al.13 and by Hussain and Rea,20 there are no predefined lists of substituents or transformations, and we are hence not restricted to, for example, particular functional groups or addition reactions. In this article, we describe in detail the context-based approach to MMP analyses that we have developed and the application of our method to large sets of hERG, solubility, and lipophilicity data. METHODS

Data Sets. The work reported here uses structural and property data from the GlaxoSmithKline (GSK) corporate database relating to three important ADME properties: hERG inhibition, solubility, and lipophilicity. Human ether-a`-go-go-related gene (hERG) codes for the homonymous potassium ion channel protein. Inhibition of the hERG ion channel is an important antitarget in drug discovery as it is associated with potentially fatal heart conditions, and compounds are routinely assayed against it during lead optimization.22 A GSK fluorescence polarization (FP) in vitro assay was used to obtain the hERG inhibition measurements studied here. This assay relies on the binding of a fluorescent ligand to hERG membranes, as potential hERG ligands compete with the fluorescent compound and cause a decrease in FP signal. Solubility is one of the most important physicochemical properties in drug discovery: Low-solubility compounds suffer from poor absorption and bioavailability after oral dosing, and they can also cause synthetic and developmental problems.23 The solubility measurements here came from a GSK chemiluminescent nitrogen detector (CLND) solubility assay. After the preparation of the sample from 10 mM dimethyl sulfoxide (DMSO) stock solution, the response from the detector is directly proportional to the number of nitrogen atoms in the sample; hence, by knowing this number beforehand, it is easy to determine the molecule’s concentration in solution.24 Finally, lipophilicity is another physicochemical property of major importance that directly affects both biological activity and ADMET properties.23 The lipophilicity measurements came from a GSK chromatographic log D assay that measures a molecule’s gradient retention time in reverse-phase highperformance liquid chromatography (HPLC) at a pH of 7.4. For each of the three data sets, measurements containing modifier values (i.e., reported as “>” or “