Chapter 14
Molecular Hologram QSAR T r e v o r W . Heritage and David R . Lowis
Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
Tripos Inc., 1699 S. Hanley R o a d , St. Louis, MO 63144
QSAR techniques have proven to be extremely valuable in pharmaceutical research, particularly 3D-QSAR. However, the complexity of descriptor calculation, conformer generation, and structural alignment renders the use of this type of QSAR non-trivial. Furthermore, demands for analysis of large data sets such as those generated by combinatorial chemistry and high throughput screening have compounded this problem. Molecular Hologram QSAR (HQSAR) is a new technique that employs specialized fragment fingerprints (molecular holograms) as predictive variables of biological activity. By eliminating the need for molecular alignment, HQSAR models can be obtained more rapidly than other techniques, rendering them applicable to both small and large data sets. HQSAR models are comparable in predictive ability to those derived from 3D-QSAR techniques and can readily be extended to support chemical database searching.
212
© 1999 American Chemical Society
In Rational Drug Design; Parrill, A., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1999.
213
1
2
Since the works o f Hansch and Fujita , and Free and W i l s o n demonstrated the successful application o f theoretical and computational methods to understanding and predicting biological activity, there has been considerable progress in the development o f molecular descriptors and chemometric techniques. The entire field o f Quantitative StructureActivity Relationships ( Q S A R ) has arisen, based upon the underlying assumption that the variations in biological activity within a series o f molecules can be correlated with changes i n measured or computed molecular features or properties o f those molecules. In particular, the development of 3D Q S A R techniques that attempt to correlate biological activity with the values o f various types o f molecular field, for example, steric, electronic, or hydrophobic has been o f particular interest . Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
3 4
The most popular method o f 3D Q S A R in use today, Comparative Molecular Field Analysis ( C o M F A ) , uses steric and electrostatic field values computed at the intersections of a three-dimensional grid that surrounds the molecules in the data set. Although numerous successes i n the use o f 3 D Q S A R to predict biological activity have been reported, there remains the major limitation that the molecules in the data set must be mutually aligned based on some consistent rule or strategy . Several approaches " to alleviate this problem have been attempted with only moderate success, and coupled with the conformational flexibility o f the molecules i n the data set, this problem remains the major barrier to 3 D Q S A R . 3
5-6
7
8-9
A s a consequence, there is considerable interest in the development o f alternative descriptions o f molecular structure that do not require the alignment o f molecules, such as autocorrelation vectors , molecular moment analysis , vibrational eigenvalue analysis ( E V A ) , and 3D W H I M descriptors . In this chapter, we review a new descriptor o f molecular structure, known as the Molecular Hologram, that is based solely on 2 D connectivity information. A s discussed later in this chapter, Molecular Holograms yield statistically robust Q S A R models that are comparable, in statistical terms, to those derived using 3 D Q S A R techniques, with the key advantage that no 3D structure or molecular alignment is required. 10
11
12
13
M o l e c u l a r H o l o g r a m Q S A R Methodology Molecular Hologram Q S A R ( H Q S A R ) involves the identification o f those substructural features (fragments) in sets o f molecules that are relevant to biological activity. A key differentiator of this method relative to other fragment based methods such as FreeW i l s o n , or C A S E analyses, is that the Molecular Holograms generated encode all possible fragments, including branched, cyclic, and overlapping fragments. Thus, each atom in a molecule w i l l occur i n multiple fragments and therefore increment several bins in the Molecular Hologram. Unlike maximal common subgraph algorithms and the Stigmata approach which seek structural commonalities, H Q S A R yields a predictive 14
1 3
16
In Rational Drug Design; Parrill, A., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1999.
214
relationship between substructural features in the data set and biological activity using the Partial Least Squares ( P L S ) technique. 17
Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
M o l e c u l a r H o l o g r a m Generation. A Molecular Hologram is a linear array o f integers containing counts o f molecular fragments, and originates from the traditional binary 2 D fingerprints employed in database searching and molecular diversity applications. The process o f hologram generation is depicted in Figure 1. The input data set consists o f the 2 D chemical structures and the corresponding biological responses. The molecular structures are broken down into all possible linear, branched and cyclic combinations o f connected atoms (fragments) containing between M and N atoms. Each unique fragment i n the data set is assigned a pseudo-random, large positive integer value by means o f a cyclic redundancy check ( C R C ) algorithm. T w o key properties o f the C R C algorithm are that: (i) . very few "collisions" between fragments are observed - that is, each and every unique fragment is assigned a unique integer value, (ii) . the integer value assigned to a particular fragment is always reproducible for that fragment - even between runs. Each o f these integers is then "folded" (or hashed ) into a bin (or position) in an integer array o f fixed length L (L is generally a prime number between 50 and 500). The occupancy values for each bin are then incremented according to the number o f fragments hashing to their bin. Thus, all generated fragments are hashed into array bins in the range 1 to L. This array is called a Molecular Hologram, and the associated bin occupancies are the descriptor variables. 18
The hashing function is used to reduce the dimensionality o f the Molecular Hologram descriptor, but leads to a phenomenon known as "fragment collision". During fragment generation, identical fragments always hash to the same bin (since they have the same C R C number), and the corresponding occupancy for that bin is incremented. However, since the hologram length is, in most cases, considerably smaller than the total number o f unique fragments encountered in the data set, different unique fragments w i l l be hashed to the same bin, causing "collisions" between fragments. This is discussed further in the section on hologram length. H o l o g r a m Q S A R M o d e l B u i l d i n g . Computation o f the Molecular Holograms for a data set o f structures yields a data matrix (AT) o f dimension R x L, where R is the number o f compounds in the data set and L is the length o f the Molecular Hologram. For Q S A R purposes, a matrix o f target variables (biological activities) (Y) is also created. Standard P L S analysis is then applied to identify a set o f orthogonal explanatory variables (components) that are linear combinations o f the original L variables. Leave-one-out crossvalidation is applied to determine the number o f components that yields an optimally predictive model.
In Rational Drug Design; Parrill, A., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1999.
Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
Structure
0-^
Fragments
Fragment Generation
H
N
l.
Molecular Hologram
-
v
J
Generate CRC Number
J
Hash CRC into HbbgramL
^
Increment bin
40 .28 29 30..
Figure 1.
Generation o f Molecular Holograms.
In Rational Drug Design; Parrill, A., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1999.
216
Once an optimal model is identified, P L S yields a mathematical equation that relates the Molecular Hologram bin values to the corresponding biological activity o f each compound in the data set. The form o f this equation for the generated Q S A R model is shown by the following equation:
L
Activity, =
C.+ZJC.C,
/=l where xft is the occupancy value o f the Molecular Hologram o f compound i at position or bin /, c is the coefficient for that bin derived from the P L S analysis, L is the length o f the hologram, Activity• is the biological activity o f compound i , and c is a constant. t
Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
0
H Q S A R Parameters. A s is the case with all other Q S A R methods, careful selection o f parameters is critical to the success o f H Q S A R . The key parameters involved i n the generation o f molecular holograms are hologram length (L), fragment size (M and N), and parameters that control how different fragments distinguished - atoms, bonds, connections, hydrogens, and chirality. H o l o g r a m L e n g t h . The hologram length controls the number o f bins i n the hologram fingerprint. Since the hologram length is significantly less than the number o f fragments in most compounds, alteration o f this parameter causes the pattern o f bin occupancies to change. The effect o f this is to alter the distribution and frequency o f fragment collisions. During H Q S A R analyses it is important to compare and contrast models generated at several different hologram lengths to ensure that the result observed is not merely an artifact o f fragment collisions - lack o f consistency i n the P L S results at several lengths is a good indication that this phenomenon is occuring. The use o f prime number hologram lengths ensures that different fragment collision patterns are observed at each length. Fragment Size. Fragment size controls the minimum (M) and maximum (N) number o f atoms contained within any fragment. These parameters can be changed to bias the analysis toward smaller or larger fragments. Fragment Distinction. Depending on the application and data set i n question, H Q S A R allows fragments to be distinguished based on atoms, bonds, connections, hydrogens, and chirality parameters. The atoms parameter enables fragments to be dstinguished based on the elemental atom types they contain, for example, allowing benzene be distinguished from pyridine. The bonds parameter enables fragments to be distinguished based on bond orders, for example, i n the absence o f hydrogen, allowing butane to be distinguished from 2-butene. The connections parameter provides a measure of atomic hybridization states within fragments. That is, with connections on, fragments are distinguished based on the number and type o f bonds made to their constituent atoms.
In Rational Drug Design; Parrill, A., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1999.
217
The hydrogens parameter enables the generation o f fragments that include hydrogen atoms. A consequence o f setting this option is that many more fragments are generated. The chirality parameter enables fragments to be distinguished based on atomic or bond stereochemistry. Thus, this parameter allows fragments containing a cis double bond to be distinguished from the trans counterpart, and R-enantiomers to be distinguished from S at chiral centers. A p p l i c a t i o n of M o l e c u l a r Holograms i n Q S A R One of the first demonstrations o f the Q S A R modeling power o f H Q S A R was obtained in a retrospective analysis o f a data set endothelin inhibitors. The data set consists o f inhibition o f endothelin-1 binding to A 1 0 rat thoracic aorta smooth muscle cells for a series o f 36 compounds containing an aryl sulfonamide moiety with an isoxazole analog bonded to the amide nitrogen. Analysis o f the data set by the C o M F A technique is not straight forward due to different charge computation schemes, structure optimization techniques, and structure orientation schemes, although a model with cross-validated-r? (i.e. q2) o f 0.70 and S E o f 0.69 can be obtained . Molecular Holograms were generated for each molecule in the data set using lengths in the range 53 to 201, and fragment sizes in the range 2 to 9 atoms. The model based based on holograms o f length 53 gave crossvalidated-r? of 0.59 and S E o f 0.81 (see Figure 2). Figure 3 shows the outcome of randomization testing o f the H Q S A R model. Randomization testing involves randomly redistributing the activity data across the compounds and attempting to derive statistical models that correlate the scrambled data with the molecular descriptor. Figure 3 shows the distribution o f randomized q2 values relative to the observed q?, and provides a means by which the liklihood that the observed correlation could have arisen by chance can be assessed.
Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
19
3
19
One o f the key advantages o f the C o M F A and related techniques has been the capability to visualize, using 3 D isocontour plots, those regions o f space indicated by the P L S model to be highly correlated with the activity data. In H Q S A R it is possible to identify, from their P L S coefficients, those bins o f the molecular hologram that were most significant i n explaining the variation i n activity. The fragments i n those bins can then be identified, and then each atom i n the molecule is color coded based on the fragments that it occurred in. Figure 4 shows the color coding for four members o f the sulfonamide endothelin data set described above. It is satisfying that the color coding observed i n this set is consistent with the 3D isocontour maps derived from the C o M F A study. Thus, amino group substitution at the 5-position o f the 1-naphthyl group is favourable in the most active compound (8), but shifting the substitution around the ring to the 6- or 7position (compounds 11 and 14) leads to a decrease i n activity as indicated by the color coding o f the amino group nitrogen atom. A similar trend is seen i n the 2-naphthyl series as indicated by compound 31, which has very poor activity.
In Rational Drug Design; Parrill, A., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1999.
Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
218
Figure 2. data set.
Cross-validated predicted activity vs. actual activity for the endothelin
Figure 3. Histogram of cross-validated r frequency of occurrence for H Q S A R runs with scrambled response data for the Endothelin data set.
In Rational Drug Design; Parrill, A., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1999.
1,000
Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
219
Figure 4. H Q S A R model interpretation for four members o f the Endothelin data set. (Figure is printed in color in color insert.)
In Rational Drug Design; Parrill, A., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1999.
220
Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
In subsequent studies, the general applicability o f the Molecular Hologram descriptor i n Q S A R studies has been investigated i n detail using published data sets exhibiting a range of biological end-points. One key point to note, is that the H Q S A R analyses shown in Tables A and B are extremely fast (60 to 120 seconds per data set on an S G I 0 2 R 1 0 K ) and the data set preparation time is also minimal. In contrast, the 3D Q S A R techniques may take several weeks o f preparation i n order to generate an appropriate conformation and mutual alignment o f structures. C o m p a r i s o n w i t h 2 D Q S A R techniques. Table 1 shows a comparison between H Q S A R and several 2 D Q S A R methods, including connectivity indices, clogP/cMr, and descriptors based on molecular formula attributes, for some published data sets. In every case, H Q S A R outperforms the other 2 D Q S A R methods in terms o f q statistic - and in some cases by quite a significant margin. In those cases, where the other 2 D Q S A R techniques generated reasonable models, similar predictive performance as judged by the SE statistic. 2
CV
C o m p a r i s i o n w i t h 3 D Q S A R . Table 2 shows a comparison between H Q S A R and 3D Q S A R , primarily C o M F A , methods for some published data sets. Good P L S models (in terms o f q ) can be obtained for each o f the eight data sets, that are comparable with the corresponding 3D Q S A R model in most cases. The dependency o f H Q S A R on 2 D molecular fragments does, however, reduce the generality o f the method for ab initio predictions o f activity for "unseen" compounds - particularly those that contain a large number o f fragments that were not encountered in the training set. This is evidenced by the cross-validated standard error o f prediction (SEcv) statistic shown i n the table, which, in general is higher (worse) than the corresponding value obtained from the C o M F A study. In two cases leukotrienes and triazines , H Q S A R yields a significantly better Q S A R model than the 3 D technique, Apex-3D and C o M F A , respectively. In the case o f the triazines, C o M F A yields a q o f 0.47, compared with q = 0.70 for H Q S A R . The C o M F A result can be significantly improved (to q = 0.61) by explicit inclusion o f lipophilicity parameters within the regression equation. This result indicates that the Molecular Holograms do incorporate a broad amount o f information that has influence over biological activity. In the remaining case, where 3 D Q S A R performed less well, the angiotensins , H Q S A R performed similarly. 2
29
26
2
2
2
30
Comparison o f the H Q S A R and C o M F A models shown i n Table 2 indiates that, in general, C o M F A produces superior models in terms o f predictive performance (SE ), but the similarity i n the model statistics suggests that H Q S A R may be used as a probe for preliminary S A R prior to spending significant amounts o f time building a complex 3D Q S A R model. In addition, the similar trend between H Q S A R and C o M F A models gives confidence that H Q S A R can be reliably applied i n cases where C o M F A , or 3 D Q S A R , is inappropriate or awkward, for example to large data sets. CV
In Rational Drug Design; Parrill, A., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1999.
221
Table 1: Comparison of H Q S A R and 2D Q S A R techniques. 2D Q S A R
HQSAR Data Set
N 42 32 30 24 40 37
Downloaded by UNIV OF ARIZONA on June 5, 2013 | http://pubs.acs.org Publication Date: July 7, 1999 | doi: 10.1021/bk-1999-0719.ch014
20
Triazolinones Phenyltrypamines Benzindoles M A O hydrazides Phenylthiothymines Bisamidines 21
22
23
24
25
0.34
SEcv 0.53
0.56 0.69 0.80 0.83 0.82
1.13 0.55 0.26 0.79 0.25