Development and Validation of the EVA Descriptor for QSAR Studies

Jul 7, 1999 - It has been implemented within the Prometheus suite of "computer-aided molecular design" programs. A lead molecule will have a large ...
1 downloads 0 Views 2MB Size
Chapter 20

Development and Validation of the EVA Descriptor for QSAR Studies 1

1

2

David B . Turner , Peter Willett , Allan M. Ferguson , and Trevor W . Heritage

2

1

Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2 T N , United Kingdom Tripos Inc., 1699 South Hanley Road, St. Louis, MO 63144 2

QSAR models are of great importance in the rationalisation and pre­ diction of the relative bioactivities of sets of compounds. Recently, 3D­ -QSAR techniques, such as CoMFA, have proved to be an effective means of correlating shape-related features with bioactivity, provided that a suitable relative alignment of the structures concerned can be found. Here we describe the E V A QSAR method. E V A , which is based upon IR-range vibrational frequencies, provides an alignment-free methodology and is shown to produce statistically robust QSARs com­ parable in most cases to results obtained with CoMFA. The method has been extensively validated on eleven different data sets. A brief dis­ cussion of conformational sensitivity is given together with an evalua­ tion of the possibilities for model interpretation. We also report on­ going work centred upon using a genetic algorithm to provide models with enhanced predictivity over "classical" E V A QSAR. Since the advent o f classical Q S A R techniques, exemplified by Hansen (%), there has been considerable progress i n the development o f molecular descriptors and chemometric techniques for use i n such studies. The development o f 3 D - Q S A R techniques (2) that attempt to correlate biological activity with the values o f various types o f molecular field, for example steric, electrostatic or hydrophobic, has been o f particular interest (3-5). The original, and most well-known o f the 3 D - Q S A R tech­ niques is Comparative Molecular Field Analysis (3) ( C o M F A ) which uses steric and electrostatic field values calculated at the intersections o f a three-dimensional grid surrounding the structures in the data set. A major limitation o f C o M F A , and most other 3 D - Q S A R techniques, is the dependency upon the relative orientation o f the molecules i n the data set (6,7). Despite efforts to improve the efficiency o f the align­ ment process (7-10) the selection o f the molecular alignment is regarded as the major variable i n the analysis. These problems are further exacerbated when the conforma­ tional flexibility o f the molecules i n the data set is considered.

312

© 1999 American Chemical Society

313

There is, therefore, considerable interest i n the development o f new descrip­ tions o f molecular structure that do not require the alignment o f molecules, but that retain the 3 D and molecular property information encoded within molecular fields. Alternative descriptions o f molecular fields than those used i n C o M F A or molecular surface properties, for example methods based on autocorrelation vectors (11), mo­ lecular moments (12), or M S - W H I M descriptors (13), may provide effective orienta­ tion-independent descriptions o f molecular structure. In this chapter we review a novel descriptor o f molecular structure, known as E V A (EigenVAlue descriptor), that is derived from calculated infra-red (IR) range vibrational frequencies. A s discussed later i n this chapter, E V A has been found to yield robust Q S A R models that are, for the most part, statistically comparable to those derived using C o M F A , with the advan­ tage that E V A does not require structural alignment. Rationale During the late 1980's workers at Shell Research Limited (14) reasoned that a signifi­ cant amount o f information pertaining to molecular properties, i n particular biological activity, might be contained within the molecular vibrational wavefunction, o f which the vibrational spectrum is a fingerprint. The E V A descriptor is thus derived from normal co-ordinate E i g e n V A l u e s (i.e. the vibrational frequencies) that are either calcu­ lated theoretically or extracted from experimental IR spectra. Typically, a classical normal co-ordinate analysis (15) ( N C A ) is performed on an energy minimized struc­ ture, and the resulting eigenvalues represent the normal mode frequencies from which the E V A descriptor is derived. The associated normal coordinate eigenvectors (i.e., the vibrational motions) are not used within the E V A descriptor. The force constants upon which a normal co-ordinate analysis is dependent may be determined using a molecular mechanics, semi-empirical, or ab initio quantum mechanical method. The accuracy o f the calculated vibrational eigenvalues is, therefore, determined entirely by the quality of the force constants applied or derived. Determination of the E V A Descriptor Using the standard Cartesian co-ordinate system as a basis for describing the dis­ placement o f an atom from its equilibrium position i n a vibrating molecule requires 3N coordinates for a molecule containing N atoms. Three o f these coordinates de­ scribe rigid-body translational motion, and a further three describe rigid-body rota­ tions. Thus, i n the general case for a molecule o f N atoms there are 3N-6 vibrational degrees o f freedom, or 3N-5 for a linear molecule such as acetylene (only two coordi­ nates are required to fix the orientation). The number o f vibrational degrees o f free­ dom is equivalent to the number o f fundamental vibrational frequencies (normal modes o f vibration) o f the molecule. Whilst each o f these fundamental vibrations can be calculated, they may or may not appear i n an experimental I R absorption spectrum due to symmetry considerations, i.e. they may have zero (or close to zero) intensity (15). Thus, i n order to derive the E V A descriptor, each structure is initially charac­ terized by 3N-6 (or 3N-5) vibrational modes. In all but the special case where the molecules i n the data set contain the same number o f atoms it is not possible to com-

314

pare the vibrational frequencies directly. This so-called dimensionality problem does not arise during a C o M F A analysis because the molecular fields arising from each molecule are calculated across a fixed set o f lattice points; this would be an issue if, for example, one wanted to compare directly the atomic point charges from which the electrostatic fields are derived. Furthermore, even i n cases i n which it is desired to compare molecules that do contain the same number o f atoms, and hence the same number o f vibrational modes, it is difficult to establish which vibrations are directly comparable between molecules; this problem arises from inherent and effectively indeterminate contributions made by individual atoms to a given vibrational mode. In E V A , the dimensionality o f the descriptor is unified across the entire data set by a three-step procedure that involves transformation o f the sets o f vibrational frequencies onto a scale where they are directly comparable (i.e., a fixed-dimensional scale). In the initial step o f this standardization process the frequency values are projected onto a bounded frequency scale ( B F S ) with individual vibrations represented by points on this axis. The bounds chosen for the B F S o f 1 and 4,000 c m ' encompass the frequencies o f all fundamental molecular vibrations and match the range observed for experimentally-derived I R spectra. The second step i n the standardization process requires that each calculated frequency is characterized i n terms o f a kernel o f fixed height, width, and shape. Each o f the calculated vibrations is weighted equally during this process. The resulting value associated with each o f the calculated vibrations permits the proportion o f overlap o f vibrations to be determined, and may be consid­ ered analogous to, but i n no way representative of, peak intensity. Infra-red intensity information is not used i n the generation o f the E V A descriptor and, as explained below, the technique is not intended to simulate experimental I R spectra. 1

In practice, i n the second step a Gaussian kernel o f fixed standard deviation (a) is placed over each vibrational frequency value for each structure, resulting i n a series o f 3N-6 (or 3N-5) identical, and overlapping Gaussians (Figure 1). The value o f the E V A descriptor, EVA , at any chosen sampling point, x, on the bounded frequency scale is then determined by summing the amplitudes o f each and every one o f the 3N6 (or 3N-5) overlaid Gaussian kernels at that point: X

z'=l aV2?i th

where / , is the / frequency for the structure concerned. It is important at this stage to reiterate that the purpose o f the above E V A smoothing procedure is not an attempt to simulate the infra-red spectrum o f the mole­ cule o f interest, since the transition dipole data is ignored, but rather it is to provide a basis upon which vibrations occurring at slightly different frequencies may be com­ pared to one another. The Gaussian function applied to define peak shapes adds a probabilistic element i n that the peak maxima are centered at each o f the calculated frequency values (/,) and thus these points are taken to be the most probable values o f the respective frequencies. A n E V A descriptor sampled at a point for which x * f is thus considered to be a less probable value o f the frequency. In such cases, the corresponding contribution o f / , to the final value o f the E V A descriptor at point x t

^

1

1

1

Vibrational Frequency Wavenumber (cm* )

1

1

1

Figure 1. Profile o f the " E V A curve" for three arbitrary vibrational frequencies expanded using a a o f 10 cm" . The " E V A curve" is determined by summing the estimated "intensities" (amplitudes) o f the vibrations centred at 28 cm" and 51 cm" , respectively. The E V A descriptor is extracted by sampling the frequency scale at fixed intervals o f L cm" . (Reproduced from Ref. 21 with kind permission from K l u w e r Academic Publishers).

S=l cm*

316

(EVA ) w i l l be less than the maximum possible contribution. To a certain extent, this behavior o f the E V A descriptor reduces the dependency o f the final Q S A R model on the accuracy o f the original calculated vibrational frequencies, which are sensitive to the molecule geometry optimization criteria and to the theoretical approximations or empirically based parameters o f the chosen modeling paradigm (whether quantum, semi-empirical or molecular mechanics). Furthermore, and as discussed i n detail below, this behavior has implications for the sensitivity o f the descriptor to molecular conformation, i n that small changes i n vibrational frequencies arising from conforma­ tional changes may have insignificant effects on the resulting E V A descriptor values. In the third, and final, step o f the data standardization process the E V A func­ tion is sampled at fixed increments o f L cm" along the B F S , giving the 4,000/Z, values comprising the E V A descriptor. Typically, a descriptor set is derived using a Gaus­ sian standard deviation (cr) term o f 10 cm" and a sampling increment (L) o f 5 cm" , giving 800 descriptor variables. A s is the case with C o M F A , the dimensionality o f the E V A descriptor is much larger than the number o f compounds i n a typical Q S A R data set and thus data reduction methods, such as Partial least squares to Latent Structures (16) ( P L S ) or Principal Components Regression ( P C R ) , are applied to search for robust correlations with biological activity data. For most molecules o f interest to a medicinal chemist geometry optimisation and normal mode calculation is the timelimiting step. However, i f molecular mechanics methods are used this is only liable to take about a minute per structure depending mainly upon N and the hardware avail­ able. Therefore, whilst slower than C o M F A field calculations, the time needed for frequency calculations need not be prohibitive for Q S A R datasets o f typical size. X

1

1

1

Applications of the E V A Descriptor in Q S A R / Q S P R Studies One o f the first demonstrations i n the public domain o f the regressive modeling ca­ pability o f the E V A descriptor was obtained i n a Q S P R study (17) using Cramer's B C D E F data set (18). The data set consists o f measured \ogP values for a highly heterogeneous set o f 135 small organic chemicals, ranging from poly cyclic aromatics, such as the highly lipophilic phenanthrene (log? = 4.46), to small hydrophilic moie­ ties, including methanol (logP = 0.64). The E V A descriptors were derived using a 10 cm" Gaussian