680
J. Chem. Inf. Comput. Sci. 2003, 43, 680-690
Neighborhood Behavior of in Silico Structural Spaces with Respect to in Vitro Activity Spaces-A Novel Understanding of the Molecular Similarity Principle in the Context of Multiple Receptor Binding Profiles Dragos Horvath* and Catherine Jeandenans Cerep, 128 rue Danton, 92506 Rueil-Malmaison, France Received November 7, 2002
As a consequence of recent advances in the field of High Throughput Screening, the systematic testing (“in vitro profiling”) of compounds against a panel of targets covering different therapeutic areas is nowadays used to generate relevant information with respect to the in vivo behavior of drug candidates. However, the development of chemoinformatics tools required for the exploitation of such data is yet in an incipient phase. In this paper, a formalism for the analysis of activity profile vectors (describing the experimental responses of compounds in each of the considered activity tests) is introduced and applied at the study of Neighborhood Behavior (NB; the hypothesis that structurally similar compounds display similar biological properties) of molecular similarity metrics. The experimental activity profiles define an Activity Space in which more than 500 drugs and reference compounds are positioned, their coordinates being inhibitory propensities in the included tests and unambiguously characterizing a molecule in terms of its receptor binding properties. While previous studies of Neighborhood Behavior had to rely on a loose classification of compounds in terms of the therapeutic areas they were designed for, here the NB of a calculated “in silico” similarity metric has been redefined as a relationships between intermolecular dissimilarity scores in the “structural” and “activity” spaces, respectively, and expressed in terms of two quantitative criteria: “consistency” (the propensity of the metric to selectively rank activity-related compound pairs among the structurally most similar pairs) and “completeness” (monitoring the retrieval rate of activity-related compound pairs among the best ranked pairs of structural neighbors). These criteria were used to calibrate and validate a similarity metric based on Fuzzy Bipolar Pharmacophore Fingerprints. INTRODUCTION
assessment1
Nowadays, the early of potential liabilities of tens to hundreds of lead compounds, milestones of the path towards the drug candidate, plays a leading role in drug discovery. It relies on robotized High Throughput Profiling (HTP), monitoring the activities Am,t of molecules m (m ) 1...Nmols) against a series of t ) 1...Ntarget receptor and enzyme binding, functional (cell-based) and/or ADMET-related biological assays. The resulting experimental outputsthe “activity profiles” of compoundssprovides an extensive characterization of in vitro biological properties (potency and selectivity), being therefore a decisional aid for lead prioritization and potentially useful for the understanding of in vivo drug properties. However, the analysis of these vast activity matrices2 Amt is not a straightforward data mining exercise, and the development of specific mathematical tools accounting for the peculiarities of the acquired activity data (percentages of inhibition/activation, IC50 values, ...) is still at its beginning. In this context, a compound m is represented by an activity profile vector denoting a point A B m ) (Am,1, Am,2, ..., Am,t, ..., Am,Ntarget) in an Activity Space (AS). The goal of this paper is the development of a mathematical formalism describing the neighborhood properties of the AS, e.g. the definition of activity profile (“in vitro”) similarity * Corresponding author phone: +33.1.55.94.84.49; fax: +33.1.55.94.84.10; e-mail:
[email protected].
metrics subsequently used to optimize the Neighborhood Behavior of in silico similarity metrics. The in silico assessment and exploitation of molecular similarity,3-5 aimed at enhancing the success rate of drug discovery programs,6,7 is one important trend in chemoinformatics and high throughput molecular modeling. The heuristic concept of similarity, used by medicinal chemists to classify active compounds on the basis of their structural features, became the underlying principle of computed molecular similarity scores. Molecular structures (further on denoted by M and m) can be represented as points of a structural space8 (SS) defined by their calculated vectors of molecular descriptors, in which the in silico dissimilarity metric9 Σ(m,M) represents the distance between these points. The computational cost of molecular descriptors10 significantly decreased in the latter years, notably due to the development of modeling approaches for combinatorial products.11-13 This has opened the way to a whole series of time-and-expense-saving computational applications based on the molecular similarity concept, including the following: (i) cluster13 analysis of the SS, (ii) the selection of a most diverse14 core sublibrary of minimal size, homogeneously covering the SS occupied by a given (virtual or real) compound set, and (iii) the search for nearest neighbors of an active compound within a collection of yet untested (or even unsynthesized) compounds.15
10.1021/ci025634z CCC: $25.00 © 2003 American Chemical Society Published on Web 02/07/2003
MOLECULAR SIMILARITY PRINCIPLE
J. Chem. Inf. Comput. Sci., Vol. 43, No. 2, 2003 681
Table 1. Most Important Symbols and Abbreviations, in Alphabetical Order symbol Amt A/V ADME-T AS χΣ(s) FBPA HTP HTS FS Λ(k)(m,M) l NB Ntargets Nmols P PFD Σk(m,M) s SS TS TD wk, γ ΩΣ(s)
meaning activity of a molecule m in a test t included in the HTP panel average/variance normalization of a variable, setting its average value at the origin and its variance as unit length of the transformed coordinate absorption, distribution, metabolism, excretion-toxicity activity spaceseach axis stands for a test in the activity profile consistency criterion at similarity threshold s, according to metric Σsas defined in eq 10 fuzzy bipolar pharmacophore autocorrelograms High Throughput Profiling (testing of a compound against a target panel) High Througput Screening False Similar compound pairssm and M being similar: Σ(m,M) < s, although activity-unrelated: Λ(m,M) > l activity dissimilarity score of molecules M and m. If present, k relates to a specific scoring scheme. activity dissimilarity threshold chosen within the validity range of Λ Neighborhood Behavior number of biological assays in the activity panel number of molecules in the available data set The set of all Np ) 170 236 molecule pairs (m,M) ∈ P. The subset of similar pairs, P(Σ s, nor activity-related: Λ(m;M) > l weighing factors specifying the relative importance of each pharmacophore feature k. γ controls the “fuzziness” of the metric overall optimality criterion at similarity threshold s, according to metric Σsas defined in eq 11
These procedures implicitly assume that the structural space displays a significant Neighborhood Behavior,16 e.g. that a subset of the structurally nearest neighbors of an active compound is likely to include more actives than expected to be found in any other, randomly chosen, set of the same size. Therefore, calculated similarity scores displaying maximal NB need to be discoveredsand a quantitative NB estimator must be provided in order to serve as the objective function in the metric optimization procedure. Up to date,17-21 the NB of various metrics has been monitored with respect to a single molecular property at a timesthe “main” activity of the compound in the therapeutic area it had been designed for. However, in the novel context of High Throughput Profiling, NB can be regarded as the property of a Structural Space to map onto the ActiVity Space in a way ensuring that neighboring points in the former are likely to correspond to neighboring points in the latter. This paper describes, unlike in previous works on this topic,17-21 a novel perspective on the Neighborhood Behavior of in silico similarity metrics confronted to experimental activity profiles. In other words, the main question debated in this paper is whether and how the classical NB concept “Structural similarity implies a similar activity with respect to a given target” can be generalized to “Structural similarity implies similar actiVity profiles”. Prerequisites to this quest are as follows: (i) a quantitative definition of what “similar activity profiles” arese.g. the introduction of adequate AS metrics quantifying the similarity of activity profiles (in a context where the measured activities Amt are inhibition percentages at 10 µM) and potentially accounting for the bias due to the nonorthogonality of the AS axes (due to the
relatedness of some receptors in the test panel) and (ii) the definition of quantitative NB scores adapted to this context. These NB criteria are then used as objective functions for the choice of the optimal functional form and fittable parameters of a 3D similarity metric based on Fuzzy Bipolar Pharmacophore Autocorrelograms15 (FBPA). A validation study assessing the NB parameters of the optimal FBPA metrics against different compounds in a different activity space is undertaken in order to cross-check the extent to which the optimal parameters of this metric are biased by the actual choice of targets in the panel used for calibration. For a quick overview of the novel terminology used in this paper, see the glossary in Table 1. EXPERIMENTAL DATAsTARGET PANEL AND TESTED COMPOUNDS
Although HTP experiments may include cellular and/or ADMET-related assays, in this work the test panel has been restricted to receptor binding and enzyme inhibition tests, best illustrating the recognition of ligands by macromolecular targets. The panel of Ntargets ) 42 tests defining the activity space considered in this work is given in Table 2. Additional information with respect to these tests is available.22 A total of Nmols ) 584 drugs, reference ligands, and precursors used in pharmaceutical industry have been profiled, measuring their average percentages of inhibition at 10 µM (duplicate testing). These inhibition values are suited for use in drug discovery,23,24 discriminating between the (micromolar or stronger) leads that may be starting points of drug candidate optimization and inactive molecules.
682 J. Chem. Inf. Comput. Sci., Vol. 43, No. 2, 2003
HORVATH
AND JEANDENANS
Table 2. Biological Test Panel Used in This Worka
a
label
description
label
description
A1h Alpha1 Alpha2 Beta1h AT1h BZDc Bomb B2h CCKAh D1h D2h DaUpt ETAh Galan H1c ML1 M1h M3h NK1h NPY Muh 5HT1Ah 5HT1D 5HT2ch 5HT3h
human rec. adenosine nonselective rat brain alpha1 nonselective rat brain alpha2 human rec. beta1 human rec. angiotensin AT1 rat brain central benzodiazepine nonselective rat brain bombesin guinea pig ileum bradykinin 2 human rec. cholecystokinin A human rec. dopamine D1 human rec. dopamine D2 human rec. dopamine transporter human rec. endothelin A receptor nonselective rat striatum galanin guinea pig central histamine chicken brain melatonin human rec. muscarinic M1 human rec. muscarinic M3 human glioblastoma neurokinin 1 human rec. neuropeptide Y human µ opiate human rec. serotonin-1A bovine caudate serotonin-1D human rec. serotonin-2C human rec. serotonin-3
5HT6h 5HTUpt Sigma1 V1Ah K-ATP Cl CatB Elast PDEII PDEIV PKC EGF-TK PK55fyn TNFa HIVP NEUpth MAPkin CGRP
human rec. serotonin-6 human rec. serotonin transporter guinea pig brain sigma-1 human rec. vasopresin-1A rat brain ATP-dep. K channel chloride channel human placenta cathepsin B human leukocyte elastase human phosphodiesterase-II human phosphodiesterase-IV rat brain protein kinase C human EGF tyrosine kinase bovine thymus p55fyn kinase human TNF-R growth factor HIV protease human norepinephrine transporter rat recombinant MAP kinase human calcitonin gene-related peptide
rec. - recombinant; dep. - dependent; nonselective binding involves the different receptor subtypes.
Both structural and activity dissimilarity scores have been generated for each of the NP ) Nmols × (Nmols-1)/2 ) 170 236 possible compound pairs (m,M). In the following, this set of all compound pairs will be denoted by P, its size being NP.
The global dissimilarity score is defined as a weighted average of the pharmacophore pair partial scores δa,b(m,M), with weighing factors wk associated to each pharmacophore feature: Nf Nf
∑ ∑ wawbδa,b(m,M)|(Ψ
SPACES AND METRICSsDEFINITIONS AND DISCUSSION
Σ(m,M) )
4.1. FBPA Structural Space. The 252-dimensional structural space of Fuzzy Bipolar Pharmacophore Autocorrelograms (FBPA) has been described in detail elsewhere.15 The FBPA of a molecule M, denoted as ΨM(a,b,∆), is a threedimensional fingerprint monitoring the numbers of atom pairs within each of the 252 ) 21 × 12 categories that can be defined in terms of the 21 combinations of Nf ) 6 pharmacophoric features a,b ∈ {Hydrophobicity, Aromaticity, Hydrogen Bond Acceptor & Donor, Cation, Anion} times Nbin ) 12 considered distance ranges ∆ ∈ {(3,4), (4,5), ..., (14,15)} [Å]. Several metrics are applicable to this space: (i) The original15 fuzzy similarity metric Σ(m,M) relies on partial similarity scores δab(M,m) calculated with respect to the atom pair distributions matching each pharmacophore feature pair a,b:
δa,b(m,M) ) 1 -
2(Ψm X ΨM)a,b (ΨM X ΨM)a,b + (Ψm X Ψm)a,b
(1)
(Ψm X ΨM)a,b )
Nf Nf
∑ ∑ wawb|(Ψ
a)1b)a
M
X ΨM)a,b + (Ψm X Ψm)a,b>0
(3)
(ii) A fuzzy and weighted Dice9 metric, bypassing partial similarity score evaluation:
Σ(m,M) ) 1 -
2∑wawb(Ψm X ΨM)a,b a,b
∑wawb(ΨM X ΨM)a,b + ∑wawb(Ψm X Ψm)a,b a,b
(4)
a,b
(iii) The usual Dice9 metricsa particular case of eq 4 with weighing factors set to 1 and an infinitely large “fuzziness” parameter γ from eq 2:
a,b ∆
Nbin Nbin
∆1)1∆2)1
X ΨM)a,b + (Ψm X Ψm)a,b>0
Σ(m,M) ) 1 2∑∑Ψm(a,b,∆)ΨM(a,b,∆)
where
∑ ∑ Ψm(a,b,∆1)ΨM(a,b,∆2)exp[-γ(∆1 - ∆2)2]
M
a)1b)a
(2)
∑∑Ψm(a,b,∆)Ψm(a,b,∆) + ∑∑ΨM(a,b,∆)ΨM(a,b,∆) a,b ∆
a,b ∆
(5)
MOLECULAR SIMILARITY PRINCIPLE
J. Chem. Inf. Comput. Sci., Vol. 43, No. 2, 2003 683
The weight ωT of the increment λT(m,M) contributed by target T to the activity distance between M and m becomes
Table 3. In Silico Metrics Explored in This Work metric eq Σ1 Σ2 Σ3 Σ4 Σ5 Σ6 Σ7
(4) (4) (3) (5) (3) (3)
value range and remarks [0,2] - average/variance normalized FBPA metrics using best 2 (sub)optimal parametrization schemes [0,2] - normalized FBPA with equal weights wk ) 1 and fuzziness factor γ ) 0.32 [0,2] - normalized FBPA with wk ) 1 and γ ) 0.32 [0,2] - dice metric with normalized FBPA [0,2] - normalized FBPA with parameters from ref 15 [0,1] - FBPA metric as calibrated in ref 15
An overview of the characteristics of the metrics used in this work is given in Table 3. The metric Σ7 from Table 3 uses the absolute number of atom pairs ΨM(a,b,∆). However, the newly introduced diversity scoring rules Σ1...6 make use of Average/Variance (A/V) normalized FBPA (see Table 1). Note that 0 e Σ7 e 1, while 0 e Σ1...6 e 2. 4.2. Activity Space. As already mentioned, each of the Ntargets biological test of the HTP panel is assimilated to one dimension of the Activity Space (AS), featuring inhibition percentages of 10 µM ligand solutions. Activity space metricssfurther on denoted by Λ(m,M)shave been defined as the number of targets with respect to which two compounds M and m display “significant potency differences”, using the following rules (6): if the degrees of inhibition of a target t by m and M differ by more than a high threshold value H% (typically 70%), then this difference is considered to be “significant” and Λ(m,M) is incremented by one unit. If the difference is less than a low threshold L% (30%), this is most likely accounted by experimental noise and will be ignored. In intermediate situations, a subunitary linearly interpolated increment will be applied: Ntargets
Λ(1)(m,M) )
{
∑ λt(m,M)
where λt(m,M) )
t)1
if |AM,t - Am,t| g H 1 if |AM,t - Am,t| e L 0 (6) |AM,t - Am,t| - L% otherwise H% - L%
If however T and T′ are related targets, then the ability of the in silico metric to distinguish binders from nonbinders of T somehow implies its ability to do so with respect to T′. Its success with respect to the second target should not count, for it is the consequence of target interrelatedness and not a merit of the metric. Therefore, we have chosen to account for target intercorrelations by means of weighing the distance contribution λt(m,M) of the target t by a subunitary factor, unless t is “orthogonal” to all the others in the test panels e.g. if the activities of the molecules against j are not correlated with the ones against other targets. To do so, a measure of the target interrelatedness R(T,t) has been adopted. Nmols
2 ∑ Am,TAm,t R(T,t) )
m)1
Nmols
Nmols
m)1
m)1
∑ Am,T2 + ∑ Am,t2
(7)
Ntargets
ωT ) 1/
∑ R*(T,t) t)1
where R*(T,t) )
{
R(T,t) if R(T,t) gRmin (8) 0 otherwise
where ωT ) 1 if R(T,t) < Rmin (typically 0.5) for any T * t. The introduction of a minimal relevant correlation coefficient Rmin was necessary in order to focus on meaningful target correlations only, avoiding possible artifacts due to spurious summation of the numerous very small but positive R values of actually uncorrelated target pairs. It is now straightforward to define the correlation-corrected metric:
Λ(2)(m,M) ) Ntargets
∑ ωtλt(m,M)
where λt(m,M) is defined in (6) (9)
t)1
MOLECULAR SIMILARITY IN STRUCTURAL AND ACTIVITY SPACES; NEIGHBORHOOD BEHAVIORs DEFINITIONS AND TERMINOLOGY
Molecular similarity is a pairwise property, inversely related to the distance between the points representing the two compounds in the given structural or activity space. Classically, the term “molecular similarity” is used in the context of a structural descriptor space, implicitly assuming that the employed similarity criteria are of structural (topological, electron-density, or pharmacophoric) nature. Accordingly, in the present work, any reference to “similarity” without any further specifications will imply the structural similarity scores Σ. By “subset of similar compound pairs”, we mean the set of all molecule pairs (m,M) of structural dissimilarity score Σ(m,M) lower than a given structural dissimilarity threshold s: P(Σ < s) ) {(m,M)|Σ(m,M) < s}. If (m,M) ∈ P(Σ < s), then m and M are told to be similar compoundssa meaningful definition of what similar molecules are therefore implies (1) the design of an optimal similarity metric Σ and (2) the choice of an associated optimal value for the dissimilarity threshold (a.k.a. “dissimilarity radius”) s. The current paper, however, also introduced a different category of molecular similarity metrics Λ(m,M), based not on structural criteria, but on a difference in response within a panel of biological tests. To avoid the possible confusion arising from the indiscriminate use of “molecular similarity”, we will employ the term “activity-related” to denote a pair of compounds (m,M) having a level of activity panel dissimilarity below an activity dissimilarity threshold l. P(Λ < l) ) {(m,M)| Λ(m,M) < l} is the subset of “activityrelated” molecule pairssby contrast to P(Σ