Chapter 20
Development and Validation of the EVA Descriptor for QSAR Studies 1
1
2
David B . Turner , Peter Willett , Allan M. Ferguson , and Trevor W . Heritage
2
1
Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2 T N , United Kingdom Tripos Inc., 1699 South Hanley Road, St. Louis, MO 63144 2
QSAR models are of great importance in the rationalisation and pre diction of the relative bioactivities of sets of compounds. Recently, 3D -QSAR techniques, such as CoMFA, have proved to be an effective means of correlating shape-related features with bioactivity, provided that a suitable relative alignment of the structures concerned can be found. Here we describe the E V A QSAR method. E V A , which is based upon IR-range vibrational frequencies, provides an alignment-free methodology and is shown to produce statistically robust QSARs com parable in most cases to results obtained with CoMFA. The method has been extensively validated on eleven different data sets. A brief dis cussion of conformational sensitivity is given together with an evalua tion of the possibilities for model interpretation. We also report on going work centred upon using a genetic algorithm to provide models with enhanced predictivity over "classical" E V A QSAR. Since the advent o f classical Q S A R techniques, exemplified by Hansen (%), there has been considerable progress i n the development o f molecular descriptors and chemometric techniques for use i n such studies. The development o f 3 D - Q S A R techniques (2) that attempt to correlate biological activity with the values o f various types o f molecular field, for example steric, electrostatic or hydrophobic, has been o f particular interest (3-5). The original, and most well-known o f the 3 D - Q S A R tech niques is Comparative Molecular Field Analysis (3) ( C o M F A ) which uses steric and electrostatic field values calculated at the intersections o f a three-dimensional grid surrounding the structures in the data set. A major limitation o f C o M F A , and most other 3 D - Q S A R techniques, is the dependency upon the relative orientation o f the molecules i n the data set (6,7). Despite efforts to improve the efficiency o f the align ment process (7-10) the selection o f the molecular alignment is regarded as the major variable i n the analysis. These problems are further exacerbated when the conforma tional flexibility o f the molecules i n the data set is considered.
312
© 1999 American Chemical Society
313
There is, therefore, considerable interest i n the development o f new descrip tions o f molecular structure that do not require the alignment o f molecules, but that retain the 3 D and molecular property information encoded within molecular fields. Alternative descriptions o f molecular fields than those used i n C o M F A or molecular surface properties, for example methods based on autocorrelation vectors (11), mo lecular moments (12), or M S - W H I M descriptors (13), may provide effective orienta tion-independent descriptions o f molecular structure. In this chapter we review a novel descriptor o f molecular structure, known as E V A (EigenVAlue descriptor), that is derived from calculated infra-red (IR) range vibrational frequencies. A s discussed later i n this chapter, E V A has been found to yield robust Q S A R models that are, for the most part, statistically comparable to those derived using C o M F A , with the advan tage that E V A does not require structural alignment. Rationale During the late 1980's workers at Shell Research Limited (14) reasoned that a signifi cant amount o f information pertaining to molecular properties, i n particular biological activity, might be contained within the molecular vibrational wavefunction, o f which the vibrational spectrum is a fingerprint. The E V A descriptor is thus derived from normal co-ordinate E i g e n V A l u e s (i.e. the vibrational frequencies) that are either calcu lated theoretically or extracted from experimental IR spectra. Typically, a classical normal co-ordinate analysis (15) ( N C A ) is performed on an energy minimized struc ture, and the resulting eigenvalues represent the normal mode frequencies from which the E V A descriptor is derived. The associated normal coordinate eigenvectors (i.e., the vibrational motions) are not used within the E V A descriptor. The force constants upon which a normal co-ordinate analysis is dependent may be determined using a molecular mechanics, semi-empirical, or ab initio quantum mechanical method. The accuracy o f the calculated vibrational eigenvalues is, therefore, determined entirely by the quality of the force constants applied or derived. Determination of the E V A Descriptor Using the standard Cartesian co-ordinate system as a basis for describing the dis placement o f an atom from its equilibrium position i n a vibrating molecule requires 3N coordinates for a molecule containing N atoms. Three o f these coordinates de scribe rigid-body translational motion, and a further three describe rigid-body rota tions. Thus, i n the general case for a molecule o f N atoms there are 3N-6 vibrational degrees o f freedom, or 3N-5 for a linear molecule such as acetylene (only two coordi nates are required to fix the orientation). The number o f vibrational degrees o f free dom is equivalent to the number o f fundamental vibrational frequencies (normal modes o f vibration) o f the molecule. Whilst each o f these fundamental vibrations can be calculated, they may or may not appear i n an experimental I R absorption spectrum due to symmetry considerations, i.e. they may have zero (or close to zero) intensity (15). Thus, i n order to derive the E V A descriptor, each structure is initially charac terized by 3N-6 (or 3N-5) vibrational modes. In all but the special case where the molecules i n the data set contain the same number o f atoms it is not possible to com-
314
pare the vibrational frequencies directly. This so-called dimensionality problem does not arise during a C o M F A analysis because the molecular fields arising from each molecule are calculated across a fixed set o f lattice points; this would be an issue if, for example, one wanted to compare directly the atomic point charges from which the electrostatic fields are derived. Furthermore, even i n cases i n which it is desired to compare molecules that do contain the same number o f atoms, and hence the same number o f vibrational modes, it is difficult to establish which vibrations are directly comparable between molecules; this problem arises from inherent and effectively indeterminate contributions made by individual atoms to a given vibrational mode. In E V A , the dimensionality o f the descriptor is unified across the entire data set by a three-step procedure that involves transformation o f the sets o f vibrational frequencies onto a scale where they are directly comparable (i.e., a fixed-dimensional scale). In the initial step o f this standardization process the frequency values are projected onto a bounded frequency scale ( B F S ) with individual vibrations represented by points on this axis. The bounds chosen for the B F S o f 1 and 4,000 c m ' encompass the frequencies o f all fundamental molecular vibrations and match the range observed for experimentally-derived I R spectra. The second step i n the standardization process requires that each calculated frequency is characterized i n terms o f a kernel o f fixed height, width, and shape. Each o f the calculated vibrations is weighted equally during this process. The resulting value associated with each o f the calculated vibrations permits the proportion o f overlap o f vibrations to be determined, and may be consid ered analogous to, but i n no way representative of, peak intensity. Infra-red intensity information is not used i n the generation o f the E V A descriptor and, as explained below, the technique is not intended to simulate experimental I R spectra. 1
In practice, i n the second step a Gaussian kernel o f fixed standard deviation (a) is placed over each vibrational frequency value for each structure, resulting i n a series o f 3N-6 (or 3N-5) identical, and overlapping Gaussians (Figure 1). The value o f the E V A descriptor, EVA , at any chosen sampling point, x, on the bounded frequency scale is then determined by summing the amplitudes o f each and every one o f the 3N6 (or 3N-5) overlaid Gaussian kernels at that point: X
z'=l aV2?i th
where / , is the / frequency for the structure concerned. It is important at this stage to reiterate that the purpose o f the above E V A smoothing procedure is not an attempt to simulate the infra-red spectrum o f the mole cule o f interest, since the transition dipole data is ignored, but rather it is to provide a basis upon which vibrations occurring at slightly different frequencies may be com pared to one another. The Gaussian function applied to define peak shapes adds a probabilistic element i n that the peak maxima are centered at each o f the calculated frequency values (/,) and thus these points are taken to be the most probable values o f the respective frequencies. A n E V A descriptor sampled at a point for which x * f is thus considered to be a less probable value o f the frequency. In such cases, the corresponding contribution o f / , to the final value o f the E V A descriptor at point x t
^
1
1
1
Vibrational Frequency Wavenumber (cm* )
1
1
1
Figure 1. Profile o f the " E V A curve" for three arbitrary vibrational frequencies expanded using a a o f 10 cm" . The " E V A curve" is determined by summing the estimated "intensities" (amplitudes) o f the vibrations centred at 28 cm" and 51 cm" , respectively. The E V A descriptor is extracted by sampling the frequency scale at fixed intervals o f L cm" . (Reproduced from Ref. 21 with kind permission from K l u w e r Academic Publishers).
S=l cm*
316
(EVA ) w i l l be less than the maximum possible contribution. To a certain extent, this behavior o f the E V A descriptor reduces the dependency o f the final Q S A R model on the accuracy o f the original calculated vibrational frequencies, which are sensitive to the molecule geometry optimization criteria and to the theoretical approximations or empirically based parameters o f the chosen modeling paradigm (whether quantum, semi-empirical or molecular mechanics). Furthermore, and as discussed i n detail below, this behavior has implications for the sensitivity o f the descriptor to molecular conformation, i n that small changes i n vibrational frequencies arising from conforma tional changes may have insignificant effects on the resulting E V A descriptor values. In the third, and final, step o f the data standardization process the E V A func tion is sampled at fixed increments o f L cm" along the B F S , giving the 4,000/Z, values comprising the E V A descriptor. Typically, a descriptor set is derived using a Gaus sian standard deviation (cr) term o f 10 cm" and a sampling increment (L) o f 5 cm" , giving 800 descriptor variables. A s is the case with C o M F A , the dimensionality o f the E V A descriptor is much larger than the number o f compounds i n a typical Q S A R data set and thus data reduction methods, such as Partial least squares to Latent Structures (16) ( P L S ) or Principal Components Regression ( P C R ) , are applied to search for robust correlations with biological activity data. For most molecules o f interest to a medicinal chemist geometry optimisation and normal mode calculation is the timelimiting step. However, i f molecular mechanics methods are used this is only liable to take about a minute per structure depending mainly upon N and the hardware avail able. Therefore, whilst slower than C o M F A field calculations, the time needed for frequency calculations need not be prohibitive for Q S A R datasets o f typical size. X
1
1
1
Applications of the E V A Descriptor in Q S A R / Q S P R Studies One o f the first demonstrations i n the public domain o f the regressive modeling ca pability o f the E V A descriptor was obtained i n a Q S P R study (17) using Cramer's B C D E F data set (18). The data set consists o f measured \ogP values for a highly heterogeneous set o f 135 small organic chemicals, ranging from poly cyclic aromatics, such as the highly lipophilic phenanthrene (log? = 4.46), to small hydrophilic moie ties, including methanol (logP = 0.64). The E V A descriptors were derived using a 10 cm" Gaussian