J. Chem. Inf. Comput. Sci. 1998, 38, 669-677
669
Evaluation of Quantitative Structure-Activity Relationship Methods for Large-Scale Prediction of Chemicals Binding to the Estrogen Receptor† Weida Tong,*,‡ David R. Lowis,§ Roger Perkins,‡ Yu Chen,‡ William J. Welsh,4 Dean W. Goddette,§ Trevor W. Heritage,§ and Daniel M. Sheehan¶ R.O.W. Sciences, Inc., Jefferson, Arkansas 72079, Department of Chemistry & Center for Molecular Electronics, University of Missouri-St. Louis, St. Louis, Missouri 63121, Tripos Inc., St. Louis, Missouri 63144, and Division of Genetic and Reproductive Toxicology, National Center for Toxicological Research (NCTR), Jefferson, Arkansas 72079 Received January 27, 1998
Three different QSAR methods, Comparative Molecular Field AnaIysis (CoMFA), classical QSAR (utilizing the CODESSA program), and Hologram QSAR (HQSAR), are compared in terms of their potential for screening large data sets of chemicals as endocrine disrupting compounds (EDCs). While CoMFA and CODESSA (Comprehensive Descriptors for Structural and Statistical Analysis) have been commercially available for some time, HQSAR is a novel QSAR technique. HQSAR attempts to correlate molecular structure with biological activity for a series of compounds using molecular holograms constructed from counts of sub-structural molecular fragments. In addition to using r2 and q2 (cross-validated r2) in assessing the statistical quality of QSAR models, another statistical parameter was defined to be the ratio of the standard error to the activity range. The statistical quality of the QSAR models constructed using CoMFA and HQSAR techniques were comparable and were generally better than those produced with CODESSA. It is notable that only 2D-connectivity, bond and elemental atom-type information were considered in building HQSAR models. Since HQSAR requires no conformational analysis or structural alignment, it is straightforward to use and lends itself readily to the rapid screening of large numbers of compounds. Among the QSAR methods considered, HQSAR appears to offer many attractive features, such as speed, reproducibility and ease of use, which portend its utility for prioritizing large numbers of potential EDCs for subsequent toxicological testing and risk assessment. INTRODUCTION
The possibility that certain man-made chemicals can disrupt the sensitive endocrine systems of humans and other vertebrates by mimicking endogenous hormones has sparked intense scientific discussion and debate in recent years.1 This growing national concern has resulted in legislation, including reauthorization of the Safe Drinking Water Act and passage of the 1996 Food Quality Protection Act, mandating that the Environmental Protection Agency (EPA) develop a screening and testing program for endocrine disrupting compounds (EDCs).2,3 The EDC issue and the pressing regulatory requirements portend a prodigious financial burden for screening and testing that will likely comprise a suite of in Vitro and in ViVo assays for multiple endpoints. Some 80 000 or more existing chemicals, many commercially important and produced in enormous quantities, may ultimately need to be evaluated for their estrogenic activity under the EPA mandate.4 With the advent of combinatorial synthesis5 and high-throughput screening6 techniques, the number of chemi* To whom all correspondence should be addressed. E-mail wtong@ nctr.fda.gov. † The opinions expressed are those of the authors and not necessarily those of the U.S. Food and Drug Administration. ‡ R.O.W. Sciences, Inc. § Tripos Inc. 4 Department of Chemistry. ¶ Division of Genetic and Reproductive Toxicology.
cals to be tested is expected to grow dramatically in the coming years. Fortunately, this challenge is offset by the ability to construct quantitative structure-activity relationship (QSAR) models for the rapid prediction of activity. Such models have great potential for use in the identification and classification of large numbers of potential EDCs. At the very least, QSAR models could be employed to establish a prioritization procedure for subsequent biological testing. An EDC can be broadly defined as “an exogenous agent that interferes with the production, release, transport, metabolism, binding, action or elimination of natural hormones in the body responsible for the maintenance of homeostasis and the regulation of developmental processes”.1 Of the many biological mechanisms that can result in endocrine disruption, by far the most dominant and well studied is expression of an estrogenic response.7,8 Although several mechanistic events can determine the in ViVo estrogenic potency of a chemical, expression of an estrogenic response generally requires binding to the estrogen receptor (ER). In recent years, several QSAR models have been developed for estrogenic compounds binding to the ER.9-14 Most of these studies have employed the three-dimensional (3D)QSAR method of comparative molecular field analysis (CoMFA)15 for model building. This method requires a procedure known as “structural alignment” of the molecules under study because a common binding site is assumed. The utility of CoMFA has been demonstrated in a wide range of
S0095-2338(98)00008-0 CCC: $15.00 © 1998 American Chemical Society Published on Web 05/20/1998
670 J. Chem. Inf. Comput. Sci., Vol. 38, No. 4, 1998
applications.16-18 However, CoMFA requires some knowledge or hypothesis regarding the functionally active conformations of the molecules under study as a prerequisite for structural alignment. Moreover, care must be exercised when constructing molecular alignments because slight differences in alignment can lead to wide variations in the resultant CoMFA model. Classical QSAR models were also considered in the present study, and were produced using partial least-squares (PLS) multivariate linear regression techniques. Classical QSAR techniques attempt to correlate a biological activity or a physical property of interest with a pre-defined set of calculated physicochemical descriptors within a collection of related compounds. In contrast to CoMFA, classical QSAR methods require no structural alignment of the molecules.9 However, both CoMFA and classical QSAR methods require identification of a putative bioactive conformation derived from either experimental evidence, molecular modeling, or conjecture. This requirement may introduce uncertainties into the resulting QSAR models, especially when dealing with structurally diverse data sets containing highly flexible molecules.19 Hologram QSAR (HQSAR), recently introduced by Tripos, Inc.,20 is a novel QSAR method that eliminates the need for determination of 3D structure, putative binding conformations, and molecular alignment. In HQSAR, each molecule in the data set is divided into structural fragments that are then counted in the bins of a fixed length array to form a molecular hologram. This process is similar to the generation of molecular fingerprints employed in database searches21 and molecular diversity22 calculations. The bin occupancies of the molecular hologram are structural descriptors (independent variables) encoding compositional and topological molecular information. A linear regression equation that correlates variation in structural information (as encoded in the hologram for each molecule) with variation in activity data is derived through PLS regression analysis to produce a QSAR model. Unlike other fragmentbased methods,23 HQSAR encodes all possible molecular fragments (linear, branched, and overlapping). Optionally, additional 3D information, such as hybridization and chirality, may be encoded in the molecular holograms. Molecular holograms are generated in the same manner as hashed fingerprints where different unique fragments may populate the same holographic bin, allowing the use of a fixed length hologram fingerprint. This hashing procedure emphasizes the importance of patterns of fragment distribution within the hologram bins, which represents the nature of chemical structures more appropriately.21 QSAR studies involve two steps: first, descriptors are generated that encode chemical structural information, second, a statistical regression technique is employed to correlate the structural variation, as encoded in the descriptors, with the variation in biological activity. In the present study, three QSAR methods: CoMFA, CoDESSA,24 and HQSAR were evaluated using three separate data sets. Data sets 1 and 2 contained the same set of structurally diverse molecules but differed with respect to biological endpoints. Data set 3 was composed of a set of congeners exhibiting several degrees of conformational flexibility. All three QSAR methods derive a regression model from PLS analysis; consequently, they differ primarily in the nature of their
TONG
ET AL.
chemical descriptors. Specifically, CoMFA employs steric and electrostatic field descriptors that encode detailed information concerning intermolecular interaction in three dimensions. CODESSA calculates molecular descriptors on the basis of two-dimensional (2D) and 3D structures and quantum-chemical properties. HQSAR calculates exclusively fragment-based molecular descriptors that are explained in greater detail in the Methodology Section. By virtue of the differences in chemical descriptors, each of the three QSAR methods will relate molecular structure and properties to estrogenicity in a different way. The specific objective of the present study is to compare CoMFA, CODESSA, and HQSAR as QSAR methods for predicting the binding affinity of a subset of potential EDCs to the ERs. This objective is pursuant to our long-term goal of identifying a QSAR method that can be applied for the rapid screening of large numbers of potential EDCs. METHODOLOGY
Data Sets for Analysis. The biological activity data used in this study are the relative binding affinity (RBA) obtained from an ER competitive binding assay with labeled endogenous estrogen, 17β-estradiol (E2). The RBA is 100 times the ratio of the molar concentrations of E2 and the competing chemical required to decrease the receptor bound radioactivity by 50%. Data sets 1 and 2 contained the same 31 structurally diverse molecules (Figures 1-5) comprising 19 steroids, four triphenylethylenes, three diethylstilbestrol derivatives, two bis(4-hydroxylphenyl)alkanes, and three phytoestrogens. The RBA values for data sets 1 and 2 were obtained using human ER-R and rat ER-β, respectively.25 These compounds were used to develop the CoMFA models10 compared in this paper. Forty-seven of the compounds contained in data set 3 were largely congeners of the 2-phenylindole prototype structure (Figures 6 and 7).26-28 Data set 3 also included six structurally diverse estrogens: E2, ICI 164,384, ICI 182,780, tamoxifen, 4-hydroxytamoxifen, and hexestrol (Figures 1-3). The RBA values for compounds in data set 3 were obtained with calf ER. These data were used to derive the CoMFA and CODESSA models9 for comparison with the HQSAR models in this paper. QSAR Methods. All molecular modeling and statistical analyses were performed using Sybyl 6.315 and Pirouette 2.03.29 Procedures for selecting the putative bioactive molecular conformation required for CoMFA and CODESSA, together with rules for structural alignment employed in CoMFA, are described in previous reports.9,10 Calculation of CoMFA Descriptors. The aligned molecules were placed in a 3D cubic lattice with 2 Å spacing. Steric (van der Waals) and electrostatic (Coulombic) field descriptors were calculated for each molecule at all lattice points using an sp3 carbon probe with +1.0 charge. Calculated steric and electrostatic energies >30 kcal/mol were truncated to this value. The CoMFA field descriptors were scaled using the CoMFA standard scaling methods30 provided in Sybyl 6.3. Calculation of Classical QSAR Descriptors. The CODESSA program was used to generate values for >200 physicochemical descriptors.24 These descriptors are generally divided into five categories: constitutional, topological,
PREDICTION
OF
CHEMICALS BINDING
TO THE
ESTROGEN RECEPTOR
J. Chem. Inf. Comput. Sci., Vol. 38, No. 4, 1998 671
Figure 3. Structures of antiestrogens in data sets 1 and 2.
Figure 4. Structures of phytoestrogens in data sets 1 and 2.
Figure 5. Structures of industrial chemicals in data sets 1 and 2.
Figure 1. Structures of steroidal compounds in data sets 1 and 2.
Figure 2. Structures of synthetic estrogens in data sets 1 and 2.
geometrical, electrostatic, and quantum-chemical. The simplest descriptor type is constitutional (e.g., atom counts, molecular weight), which reflects the molecular composition without regard to geometric or electronic structure. Topological descriptors include the Kier and Hall, Randic, and Wiener indices, which are most sensitive to molecular connectivity. Geometrical descriptors, such as moment of inertia and molecular surface area, require the 3D coordinates of the constituent atoms of a molecule. Electrostatic descriptors reflect particular aspects of charge distribution and can be calculated using any of several empirical
procedures within the CODESSA program as well as a number of quantum-mechanical approaches. Quantumchemical descriptors enhance the conventional descriptors by providing information about the internal electronic properties of molecules. CODESSA is capable of computing ∼400 descriptors for each molecule. Descriptors for which values are invariant or incalculable for any compound within the data set were excluded from consideration. Of the ∼200 remaining descriptors, about half were quantum-chemical in nature. Each set of descriptor values was subjected to autoscaling31 prior to statistical analysis. Calculation of HQSAR Descriptors. The following procedure (Figure 8) was used to construct molecular holograms containing the HQSAR descriptors. First, all linear, branched, and overlapping substructural fragments in the size range 4 to 7 atoms were generated for each molecule.21 All generated fragments from a molecule were then hashed into a fixed length array to produce the molecular hologram. In detail, the procedure is as follows: the SLN (SYBYL Line Notation)32 for each fragment generated is mapped to a unique integer in the range of 0 to 231 using a CRC (cyclic redundancy check)33 algorithm. Each integer is then used to select a bin in an integer array of predetermined length (hologram length), the occupancy of which is then incremented by one. The hashing process occurs in cases where the value of the CRC-produced integer is larger than the length of the hologram, and the value of the remainder when the integer is divided by the hologram length is used to identify the array bin whose occupancy was to be incremented. The final array is the molecular hologram, and the bin occupancies are the descriptor variables that encode
672 J. Chem. Inf. Comput. Sci., Vol. 38, No. 4, 1998
TONG
ET AL.
Figure 8. Schematic illustrating generation of a molecular hologram: A molecule is broken into a number of structural fragments that are assigned fragment integer identifications (IDs) using the CRC algorithm. Then each fragment is placed in a particular bin based on its fragment integer ID corresponding to the bin ID. The bin occupancy numbers are HQSAR descriptors that count structural fragments in each bin.
Figure 6. Structures of 2-phenylindoles in data set 3.
Figure 7. Structures of 5,6-dihydroindolo[2,1-R]isoquinolines in data set 3.
molecular structural information. The hologram length (number of array bins) defines the dimensionality of the descriptor space. The hashing process greatly reduces the size requirement of a molecular hologram (compared with the case where each unique fragment is counted in its own bin) but leads to a phenomenon called “fragment collision”. Identical molecular
fragments always generate identical integers through the CRC algorithm and hence will always be counted in the same bin. Typically, because the number of unique fragments contained in a molecule is rather larger than the number of holographic bins, the hashing procedure described will map different integers, and therefore different unique fragments, to the same bin causing fragment collision. In other words, each holographic bin will correspond to several different substructural fragments. Surveying HQSAR models based on a range of different hologram lengths and selecting the hologram length that yields the lowest cross-validated standard error (or highest q2) minimizes the negative impact of such collisions. The HQSAR module provides 12 default hologram lengths that have been found to yield predictive models on a number of test data sets. These default hologram lengths are prime numbers such that each provides a unique set of fragment collisions. The exact model produced by HQSAR is dependent not only on the hologram length but also on the information contained in the generated fragments. The particular nature of substructural fragments generated by HQSAR and, consequently, the information contained in the resultant molecular holograms can be altered through the settings of seven parameters. These hologram construction parameters are divided into two classes: fragment size and fragment distinction. The two fragment size parameters, minimum and maximum fragment size, determine the maximum and minimum number of atoms in any one fragment (the default values for these parameters are 4 and 7, respectively). Fragment distinction parameters describe what information from the original molecule is retained in the fragment in terms of atoms, bonds, connections, hydrogens, and chirality. Table 1 and Figure 9 depict how these different parameter settings affect the information contained in the generated fragments and lead to the generation of distinct fragments from the same portion of the original molecule. PLS-QSAR. Predictive QSAR models were produced using separate PLS analyses of the three data sets to correlate variation in biological activity with variation in the descriptors described in the previous sections. The optimum number of principal components (PCs) corresponding to the smallest standard error of prediction was determined by the Leave-
PREDICTION
OF
CHEMICALS BINDING
TO THE
ESTROGEN RECEPTOR
J. Chem. Inf. Comput. Sci., Vol. 38, No. 4, 1998 673
Table 1. Definition of Fragment Parameters in HQSAR parameter atoms bonds connections hydrogens chirality
definition The atoms parameter enables fragments to be distinguished based on elemental atom types; for example, allowing NH3 to be distinguished from PH3. The bonds parameter enables fragments to be distinguished based on bond orders; for example, in the absence of hydrogen, allowing butane to be distinguished from 2-butene. The connections parameter provides a measure of atomic hybridization states within fragments; that is, connections causes HQSAR to keep track of how many connections are made to constituent atoms and the bond order of those connections. By default, HQSAR ignores the hydrogen atoms during fragment generation. The hydrogens parameter overrides this behavior. The chirality parameter enables fragments to be distinguished based on atomic and bond stereochemistry. Thus, stereochemistry allows cis double bonds to be distinguished from their trans counterparts, and R-enantiomers to be distinguished from S-enantiomers at all chiral centers. Table 2. Summary of the Key Statistical Parameters Obtained for Each QSAR Model datasets
statistics
CoMFA
HQSAR
CODESSA
1
q2 r2 PCs q2 r2 PCs q2 r2 PCs
0.70 0.95 4 0.60 0.95 4 0.61 0.97 9
0.67 0.88 4 0.68 0.91 5 0.53 0.93 9
0.46 0.79 2 0.61 0.92 4 0.54 0.68 3
2 3
Figure 9. Schematic illustrating different fragment parameters in HQSAR.
One-Out (LOO) cross-validation procedure. By this procedure, each compound is systematically excluded once from the data set, after which its activity is predicted by a model derived from the remaining compounds. The predicted activities of the “left out” compounds allow the calculation of q2 and cross-validated standard error. Using the optimal number of PCs, the final PLS analysis was carried out without cross-validation to generate a predictive QSAR model with a conventional correlation coeffficient r2 and a non-cross-validated standard error. RESULTS
Quality of the QSAR Models. The quality of a QSAR model can be assessed in terms of various statistical measures. The values of r2 and q2 are normally accepted as statistical measures of merit for a QSAR model. In many QSAR studies, the criterion r2 g 0.9 is employed to decide whether a model is internally self-consistent. It should be noted that r2 makes no assessment of the intrinsic precision or accuracy of the data itself. The value of q2, derived from the LOO cross-validation procedure, tests the stability of the model through perturbation of the regression coefficients by consecutively omitting each compound during the model generation procedure. Consequently, q2 can be considered a measure of the ability of the model to interpolate within the training set population. The criterion q2 g 0.5 is normally adopted in many CoMFA studies for determining the acceptability of the model for this purpose.34 Values of the r2 and q2 associated with the three QSAR models for each of three data sets are given in Table 2. In this example, only 2D connectivity, bond and elemental atom-type information (atoms and bonds parameters turned on) was used in the HQSAR calculations. It is seen that all three QSAR models exceeded the q2 g 0.5 criterion. In terms of goodness of fit, CoMFA models provided the highest r2 values
accounting for at least 95% of the variation in biological activity in all data sets. Individual r2 values for each data set were slightly lower for HQSAR than for CoMFA models. Importantly, the average r2 value for all three data sets exceeded 0.90 for both HQSAR and CoMFA. CODESSA yielded a good QSAR model for data set 2, but lower r2 values for data sets 1 and 3 thus indicating that further work on CODESSA may be required. Although r2 and q2 are important for validating the quality of a QSAR model, these parameters alone fail to account for other factors. One such factor is the number of principal components (degrees of freedom) that should be considered when comparing different QSAR models derived from an individual data set. The value of r2 generally increases as more PCs are included in the model. Thus, it would seem reasonable to scale a statistical parameter of choice by the number of PCs. Indeed, the primary motivation for using the PLS method is to build the most predictive model that fits the known biological data (high q2 and r2, respectively) with the fewest number of PCs to avoid overfitting of data points. Another factor is the range of biological activity within the data set, which also should be considered during the comparison of the quality of QSAR models across different data sets. Given two QSAR models that have the same r2 (or q2) value, the model derived from the data set with the larger biological activity range is more valid than that obtained with the smaller activity range. Alternatively, the standard error and cross-validated standard error can be used as measures of goodness of fit and predictivity. Although several ways exist to calculate the standard error for a regression equation, the number of degrees of freedom should be factored in when comparing different models. A more effective measure of model goodness of fit is the ratio of the standard error to the activity range. One advantage of explicitly including the range of biological activity is that the performance of separate QSAR models can be compared across different data sets. This ratio should generally be