Linear Regression and Computational Neural Network Prediction of

An average of 10 nonlinear CNN models with 11-5-1 architecture was found to best describe the system with root-mean-square errors of 0.28, 0.29, and 0...
1 downloads 0 Views 143KB Size
Chem. Res. Toxicol. 2001, 14, 1535-1545

1535

Linear Regression and Computational Neural Network Prediction of Tetrahymena Acute Toxicity for Aromatic Compounds from Molecular Structure J. R. Serra,† P. C. Jurs,*,† and K. L. E. Kaiser‡ Department of Chemistry, The Pennsylvania State University, University Park, Pennsylvania 16802, and National Water Research Institute, Burlington, Ontario L7R 4A6, Canada Received June 13, 2001

A quantitative structure toxicity relationship (QSTR) has been derived for a diverse set of 448 industrially important aromatic solvents. Toxicity was expressed as the 50% growth impairment concentration (ICG50) for the ciliated protozoa Tetrahymena and spans the range -1.46 to 3.36 log units. Molecular descriptors that encode topological, geometrical, electronic, and hybrid geometrical-electronic structural features were calculated for each compound. Subsets of molecular descriptors were selected via a simulated annealing technique and a genetic algorithm. From this reduced pool of descriptors, multiple linear regression models and nonlinear models using computational neural networks (CNNs) were derived and then used to predict the ICG50 values for an external set of representative compounds. An average of 10 nonlinear CNN models with 11-5-1 architecture was found to best describe the system with root-mean-square errors of 0.28, 0.29, and 0.34 log units for the training, cross validation, and prediction sets, respectively.

Introduction Chemical byproducts from industrial systems that are allowed to escape into the environment can have toxic effects. Each of these chemicals has the potential to be harmful, and it is crucial that each compound be assessed for its toxicity level. However, this can be costly, timeconsuming, and could potentially produce toxic side products from the experimental methods used today (1). Recently, computational methods have been used to solve complex problems in many aspects of science. One particularly useful method, the development of quantitative structure activity relationships (QSAR),1 has found diverse applications in chemistry. These applications include biological activity (QSAR) prediction (2-5), physical property (QSPR), prediction (6-8), and toxicity (QSTR) prediction (9-11). QSAR has great advantages over both experimental techniques and other computational methods. First, QSAR is a purely computational method that does not require the use of expensive equipment or hazardous chemicals. Second, QSAR has the advantage of being computationally inexpensive, as compared with molecular dynamics, Monte Carlo, and ab initio quantum mechanical methods. To quantify the toxicity for a set of 448 industrially relevant aromatic compounds, a QSTR is developed for 50% growth impairment concentration (ICG50) toward the †

The Pennsylvania State University. National Water Research Institute. Abbreviations: QSTR, quantitative structure toxicity relationship; ICG50, 50%, growth impairment concentration; CNN, computational neural network; QSAR, quantitative structure activity relationship; QSPR, quantitative structure property relationship; log P, octanol/ water partition coefficient; PNN, probabalistic neural network; ADAPT, Automated Data Analysis and Pattern Recognition toolkit; MLR, multiple linear regression; MOPAC, molecular orbital package; Tset, training set; Cvset, cross validation set; Pset, Prediction Set; CPSA, charged partial surface area; RMS, root mean square; BFGS, Broyden Fletcher Goldfarb Shanno; VIF, variation inflation factor. ‡

1

ciliated protozoa Tetrahymena. These data are a subset of that reported by Niculescu et al. (12) and encompass most of the data available in the Terra Tox database, compiled and published by TerraBase Inc (13). Tetrahymena has been the focus of many toxicity studies over the years because of its fast growth rates under simple and inexpensive culture conditions (1, 14). The ICG50 values were obtained by the standard methods of Shultz et al. Briefly, the organisms were dosed with a solution of sufficient concentration to inhibit the population growth rate by 50%. Population growth inhibition was then detected by absorbance at 540 nm (14). A QSTR that is not dependent upon the octanol/water partition coefficient (log P), some other experimental quantity or rule, or a particular mechanism of action or only valid for a homologous set of compounds that are not structurally diverse is desirable (15). Recently, Niculescu et al. have published a very successful Tetrahymena toxicity model using a probabilistic neural network (PNN) approach on a diverse set of 825 compounds using 33 molecular fragment descriptors (12). PNNs have proven to be useful techniques for the characterization of acute toxicity (16, 17). Computational neural networks provide another means with which to create a model for the Tetrahymena toxicity for a diverse set of compounds and have not been amply explored in the field. An advantage that CNNs have over PNNs is that they can possibly describe the model with one-tenth the number of descriptors. This paper discusses the development of three QSTRs to predict 50% population growth impairment concentrations (ICG50) of a set of 448 diverse, industrially relevant organic compounds. The acute toxicity toward Tetrahymena is defined as -log(mmol/L), where the logarithm is taken to contract the data set to a computationally efficient range (-1.64 to 3.36 log units), the negative sign

10.1021/tx010101q CCC: $20.00 © 2001 American Chemical Society Published on Web 10/30/2001

1536

Chem. Res. Toxicol., Vol. 14, No. 11, 2001

is used so the largest positive values represent the most toxic compounds, and the largest negative numbers represent the least toxic. The molecular weight of these 448 compounds ranged from 93 to 489 amu, with an average of 291 amu. The compounds contain only carbon, hydrogen, oxygen, nitrogen, sulfur, fluorine, chlorine, and bromine. All compounds investigated include at least one aromatic ring. There are 202 compounds that contain at least one oxygen, 104 compounds that contain at least one nitrogen, 117 compounds that contain at least one oxygen and one nitrogen, and 27 compounds that contain neither oxygen nor nitrogen.

Methodology The Automated Data Analysis and Pattern Recognition Toolkit (ADAPT) software system formulates a QSTR by finding a suitable model relating molecular descriptors derived from the molecular structures to the toxicity values of each compound. This process is performed in five steps: (1) structure entry and possibly geometry optimization, (2) molecular descriptor generation, (3) feature selection, (4) model construction, using either multiple linear regression (MLR) or computational neural networks (CNN), and (5) model validation. First, the molecular structures are sketched using HyperChem, a commercial software package by Hypercube, Inc., on a Pentium PC. Three-dimensional conformations are obtained using the PM3 semiempirical Hamiltonian (18) from the commercially available optimization program MOPAC (19). The compounds are randomly divided into a training set (tset), crossvalidation set (cvset), and a prediction set (pset) where the cvset and pset each make up 10% of the compounds, and the tset comprises the remaining 80% of the compounds. Molecular structure descriptors are then calculated based on topological, geometrical, and electronic properties of the molecules that comprise the data set. Topological descriptors encode graph theoretical properties of the molecules. These include simple atom and bond counts, connectivity indices (20-26), and simple path counts. Geometric descriptors are derived from three-dimensional molecular models of the structures and include such information as the principal moments of inertia, molecular volume, and the solvent accessible surface area (27, 28). Electronic descriptors characterize the structure with partial atomic charges, dipole moments, and bond lengths derived from either extended Huckel calculations or semiempirical methods from MOPAC. Hybrid descriptors can be formed from combinations of topological, geometric, and electronic descriptors (29-31). These include charged partial surface area (CPSA) (30, 31) descriptors that encode the potential for polar interactions as well as hydrogen-bonding effects (32, 33). Feature selection methods are used to reduce the number of descriptors from approximately 200 to a reasonable number, usually fewer than 60. Objective feature selection is performed without the use of the dependent variable. Several statistical and correlation criteria must be met for a descriptor to pass into the reduced pool. Reduced descriptor pools are created by discarding those descriptors with identical values for greater than 90% of the compounds or by discarding one of any two descriptors with a pairwise correlation above 93%. Either the genetic algorithm (34) or the simulated annealing technique (35) using either a linear fitness evaluator or a nonlinear CNN fitness evaluator performs subjective feature selection. This allows the user to specify the number of descriptors from the reduced pool, usually 6-12, to be included in a QSTR model. These evolutionary optimization routines are used to identify small sets of descriptors that are information rich. Three types of models are created in this work. These include type 1 models which use linear regression methods for both feature selection and model generation, type 2 models which use linear methods for feature selection and nonlinear (CNN) model development methods, and type 3 models which use

Serra et al. nonlinear (CNN) methods for both the feature selection and for model development. Finally, the best QSTR model is chosen. The best set of descriptors for the type 1 models are chosen based on individual T values and the best type 1 model is chosen by the overall rootmean-square (rms) error (36). The type 1 descriptor subsets are then submitted to a three-layer, fully connected, feed-forward CNN for development of the type 2 model. Network training is achieved by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method (37). The input layer of the CNN contains a neuron for each descriptor, a single output neuron for the property of interest, and a user-specified number of hidden neurons restricted to avoid the risk of chance correlation. The CNN training is monitored by the minimization of the cvset rms error. When the cvset rms error reaches a minimum, the network training is stopped because further decreases in the tset rms error signify the CNN is memorizing the tset data and ceasing to create a generalized model. The number of nodes in the hidden layer is varied to seek the best performance. The best architecture is found when the addition of another hidden neuron does not significantly improve the quality of the model. The descriptor subsets for the type 3 models are evaluated similarly as the type 2 models with the descriptor subset originating from nonlinear means using the genetic algorithm. The overall rms errors of the best type 1, 2, and 3 models are compared and the one with the lowest rms error is deemed best. The idea behind QSTR is that a correlation can be found between the toxicity of a compound and the molecular descriptors defining that compound. To check this, Monte Carlo or scramble runs were performed. The best type 1, type 2, and type 3 models created were checked by scrambling the dependent variables and building a model using the optimal descriptors for each model type. For a correlation between the toxicity of a compound and the molecular descriptors to be valid, the rms errors of the respective model must be lower than the rms error of the scramble run. All optimizations are performed on a DEC 3000 APX model 500 workstation running the Digital-UNIX operating system.

Results and Discussion The general progression of model production starts from the completely linear type 1 model that uses linear feature selection and linear model building. It then progresses through the hybrid linear/nonlinear type 2 model using linear feature selection and nonlinear model building. Finally, the completely nonlinear type 3 model that employs nonlinear feature selection and nonlinear model building is obtained. Of the 448 compounds in the data set, 287 compounds were randomly selected for the tset, 80 compounds comprise the pset, and 81 compounds comprise the cvset, as listed in Table 1. The cvset doubles as a second prediction set for the type 1 models. Care was taken to ensure that compounds that lie at either extrema of the -log(ICG50) range were not included in the pset or cvset. Type 1 Model (linear feature selection and linear model development). The descriptors chosen for the type 1 model are shown in Table 2. They were selected by the simulated annealing method coupled to a linear fitness evaluator. Only models with T-values greater than 4 were considered for this study. Models employing anywhere from 3-20 descriptors were examined, and an 11-descriptor model was the best found. The compounds from which the 11-descriptor type 1 model was formed are then tested for statistical outliers. On the basis of several MLR tests, no statistical outliers were present. No multicollinearities (r > 0.95) existed among the 11 descriptors, and variation inflation factors

Tetrahymena Acute Toxicity

Chem. Res. Toxicol., Vol. 14, No. 11, 2001 1537

Table 1. Compounds Used for Development of QSTR Model for Tetrahymena Toxicity set

CAS no.

compound name

set

CAS no.

compound name

Pset Tset Cvset Cvset Tset Tset Cvset Pset Tset Tset Tset Tset Tset Tset Tset Tset Cvset Tset Tset Pset Tset Tset Tset Tset Pset Pset Tset Tset Pset Tset Tset Tset Pset Cvset Pset Pset Cvset Tset Tset Tset Tset Tset Pset Tset Cvset Cvset Cvset Pset Tset Tset Cvset Tset Tset Cvset Pset Tset Pset Pset Tset Cvset Cvset Pset Cvset Tset Cvset Tset Pset Tset Tset Pset Pset Pset Pset Cvset Tset Tset Tset Tset Tset Tset Tset Tset Tset Tset Tset Tset Tset Cvset Tset Tset Cvset Pset

100-00-5 100-01-6 100-10-7 100-12-9 100-14-1 100-17-4 100-25-4 100-29-8 10031-82-0 100-39-0 100-44-7 100-46-9 100-47-0 100-51-6 100-52-7 100-61-8 100-66-3 100-83-4 1009-14-9 1016-78-0 103-05-9 103-16-2 103-63-9 103-69-5 103-73-1 103-83-3 103-85-5 104-13-2 104-40-5 104-51-8 104-85-8 104-86-9 104-87-0 104-88-1 104-91-6 104-94-9 105-07-7 106-40-1 106-44-5 106-47-8 106-48-9 106-49-0 108-39-4 108-42-9 108-43-0 108-44-1 108-45-2 108-46-3 108-68-9 108-69-0 108-86-1 108-90-7 108-95-2 1126-46-1 1129-35-7 1129-37-9 1137-41-3 1137-42-4 117-80-6 118-31-0 118-79-6 118-92-3 118-93-4 1194-02-1 119-61-9 1198-55-6 120-47-8 120-51-4 120-72-9 120-80-9 120-82-1 120-83-2 121-14-2 121-69-7 121-71-1 121-73-3 121-87-9 121-89-1 121-90-4 122-79-2 122-97-4 123-07-9 123-08-0 123-30-8 123-31-9 127-66-2 13037-86-0 131-11-3 131-18-0 131-58-8 132-64-9 133-11-9

1-chloro-4-nitrobenzene 4-nitroaniline 4-(dimethylamino)benzaldehyde 4-ethylnitrobenzene 4-nitrobenzyl chloride 4-nitroanisole 1,4-dinitrobenzene 4-nitrophenetole 4-ethoxybenzaldehyde benzyl bromide benzyl chloride benzylamine benzonitrile benzyl alcohol benzaldehyde N-methylaniline anisole 3-hydroxybenzaldehyde valerophenone 3-chlorobenzophenone a,a-dimethylbenzenepropanol 4-(benzyloxy)phenol (2-bromoethyl)benzene N-ethylaniline phenoxyethane N,N-dimethylbenzylamine 1-phenyl-2-thiourea 4-n-butylaniline LNP n-butylbenzene 4-tolunitrile 4-chlorobenzylamine 4-methylbenzaldehyde 4-chlorobenzaldehyde 4-nitrosophenol 4-methoxyaniline 4-cyanobenzaldehyde 4-bromoaniline 4-methylphenol 4-chloroaniline 4-chlorophenol 4-toluidine 3-methylphenol 3-chloroaniline 3-chlorophenol 3-toluidine 1,3-phenylenediamine resorcinol 3,5-dimethylphenol 3,5-dimethylaniline bromobenzene chlorobenzene phenol methyl 4-chlorobenzoate methyl 4-cyanobenzoate 4-nitrobenzaldoxime 4-aminobenzophenone 4-hydroxybenzophenone Dichlone 1-aminomethylnaphthalene 2,4,6-tribromophenol anthranilic acid 2′-hydroxyacetophenone 4-fluorobenzonitrile benzophenone tetrachlorocatechol ethyl 4-hydroxybenzoate benzyl benzoate indole catechol 1,2,4-trichlorobenzene 2,4-dichlorophenol 2,4-dinitrotoluene N,N-dimethylaniline 3-acetylphenol 1-chloro-3-nitrobenzene 2-chloro-4-nitroaniline 3′-nitroacetophenone 3-nitrobenzoyl chloride phenyl acetate 3-phenyl-1-propanol 4-ethylphenol 4-hydroxybenzaldehyde 4-aminophenol hydroquinone 2-phenyl-3-butyn-2-ol 4-heptyloxyphenol DMP DPP 2-methylbenzophenone dibenzofuran phenyl-4-aminosalicylate

Tset Cvset Tset Tset Tset Cvset Tset Tset Pset Tset Tset Tset Pset Tset Tset Tset Tset Tset Pset Tset Tset Pset Tset Tset Tset Cvset Tset Cvset Tset Cvset Tset Pset Pset Tset Tset Tset Tset Tset Tset Tset Pset Tset Pset Pset Tset Tset Tset Tset Pset Pset Tset Cvset Tset Cvset Cvset Tset Cvset Cvset Tset Tset Tset Tset Tset Pset Tset Cvset Pset Tset Tset Tset Tset Cvset Tset Tset Cvset Tset Cvset Tset Tset Tset Cvset Tset Tset Pset Pset Tset Pset Tset Tset Tset Tset Tset

555-16-8 555-21-5 55815-20-8 562-84-8 56602-33-6 571-58-4 5728-52-9 573-56-8 57455-06-8 576-24-9 577-19-5 578-54-1 579-66-8 580-51-8 582-33-2 58-27-5 585-34-2 585-79-5 586-76-5 586-78-7 587-02-0 587-03-1 58-90-2 589-16-2 589-18-4 591-19-5 591-27-5 591-35-5 591-50-4 5922-60-1 59-50-7 60-09-3 60-12-8 601-71-9 605-69-6 608-27-5 608-71-9 609-89-2 609-93-8 610-15-1 611-06-3 611-20-1 612-24-8 612-25-9 613-45-6 613-90-1 613-94-5 615-36-1 615-43-0 615-58-7 615-65-6 615-74-7 616-86-4 618-45-1 618-58-6 618-87-1 619-24-9 619-25-0 619-42-1 619-45-4 619-50-1 619-72-7 619-73-8 620-17-7 620-22-4 620-24-6 622-62-8 623-04-1 623-12-1 62-53-3 626-01-7 626-02-8 626-43-7 634-67-3 634-83-3 634-91-3 634-93-5 6361-21-3 636-30-6 6373-50-8 64063-37-2 6418-38-8 643-28-7 644-08-6 645-09-0 645-56-7 653-37-2 65-45-2 6627-55-0 6641-64-1 66-76-2 67-36-7

4-nitrobenzaldehyde 4-nitrophenyl acetonitrile 2,4-dibromo-6-phenylphenol 3-cyanobenzamide BOP 1,4-dimethylnaphthalene 4-biphenylacetic acid 2,6-dinitrophenol 3-iodobenzyl alcohol 2,3-dichlorophenol 1-bromo-2-nitrobenzene 2-ethylaniline 2,6-diethylaniline 3-phenylphenol ethyl-3-aminobenzoate 2-methyl-1,4-naphthoquinone 3-tert-butylphenol 1-bromo-3-nitrobenzene 4-bromobenzoic acid 1-bromo-4-nitrobenzene 3-ethylaniline 3-methylbenzyl alcohol 2,3,4,6-tetrachlorophenol 4-ethylaniline 4-methylbenzyl alcohol 3-bromoaniline 3-aminophenol 3,5-dichlorophenol iodobenzene 2-amino-5-chlorobenzonitrile 4-chloro-3-methylphenol 4-aminoazobenzene phenethyl alcohol 2,5-dibromobenzoic acid 2,4-dinitro-1-naphthol 2,3-dichloroaniline pentabromophenol 2,4-dichloro-6-nitrophenol 2,6-dinitro-4-methylphenol 2-nitrobenzamide 2,4-dichloronitrobenzene 2-cyanophenol 2-nitrobenzonitrile 2-nitrobenzyl alcohol 2,4-dimethoxybenzaldehyde benzoyl cyanide benzoic acid hydrazide 2-bromoaniline 2-iodoaniline 2,4-dibromophenol 2-chloro-4-methylaniline 2-chloro-5-methylphenol 4-ethoxy-2-nitroaniline 3-isopropylphenol 3,5-dibromobenzoic acid 3,5-dinitroaniline 3-nitrobenzonitrile 3-nitrobenzyl alcohol methyl 4-bromobenzoate methyl 4-aminobenzoate methyl 4-nitrobenzoate 4-cyanonitrobenzene 4-nitrobenzyl alcohol 3-ethylphenol 3-methylbenzonitrile 3-hydroxybenzyl alcohol 4-ethoxyphenol 4-aminobenzyl alcohol 4-chloroanisole aniline 3-iodoaniline 3-iodophenol 3,5-dichloroaniline 2,3,4-trichloroaniline 2,3,4,5-tetrachloroaniline 3,4,5-trichloroaniline 2,4,6-trichloroaniline 2-chloro-5-nitrobenzaldehyde 2,4,5-trichloroaniline 4-cyclohexylaniline 2,6-dichloro-3-methylaniline 2,3-difluorophenol 2-isopropylaniline 4-phenyltoluene 3-nitrobenzamide 4-propylphenol pentafluorobenzaldehyde 2-hydroxybenzamide 2-bromo-4-methylphenol 4,5-dichloro-2-nitroaniline dicoumarol 4-phenoxybenzaldehyde

1538

Chem. Res. Toxicol., Vol. 14, No. 11, 2001

Serra et al.

Table 1. (Continued) set

CAS no.

compound name

set

CAS no.

compound name

Tset Tset Pset Tset Cvset Tset Tset Pset Pset Tset Cvset Tset Cvset Tset Cvset Tset Tset Cvset Cvset Tset Tset Cvset Tset Tset Tset Tset Pset Pset Tset Tset Tset Tset Tset Tset Cvset Tset Tset Tset Tset Pset Pset Pset Tset Cvset Pset Tset Cvset Tset Tset Tset Tset Tset Tset Tset Tset Tset Tset Cvset Tset Tset Cvset Tset Tset Tset Tset Tset Cvset Tset Pset Tset Cvset Tset Tset Tset Tset Tset Cvset Tset Tset Tset Tset Tset Tset Cvset Tset Tset Tset Tset Cvset Tset Tset Pset

134-84-9 134-85-0 136-60-7 136-77-6 137-19-9 139-59-3 14191-95-8 14321-27-8 1443-80-7 1450-72-2 14548-45-9 1484-26-0 150-13-0 150-19-6 1504-58-1 150-76-5 1518-83-8 1527-89-5 15852-73-0 16245-79-7 1671-75-6 1674-37-9 1689-82-3 1745-81-9 17849-38-6 1821-39-2 1877-77-6 1885-29-6 18979-55-0 18982-54-2 1917-43-4 19438-10-9 2038-57-5 2046-18-6 2113-58-8 2138-22-9 2150-47-2 2237-30-1 2243-47-2 2357-47-3 2379-55-7 2409-55-4 24544-04-5 2495-37-6 24964-64-5 25167-83-3 253-52-1 253-82-7 260-94-6 2613-23-2 2683-43-4 2696-84-6 271-89-6 2845-89-8 28689-08-9 28804-96-8 2905-69-3 2928-43-0 2973-76-4 3012-37-1 3034-34-2 305-85-1 3209-22-1 3217-15-0 3218-36-8 3261-62-9 329-71-5 33228-44-3 33228-45-4 3360-41-6 33719-74-3 342-24-5 348-54-9 350-46-9 3544-25-0 3597-91-9 367-12-4 367-27-1 371-40-4 371-41-5 372-19-0 392-71-2 393-39-5 394-32-1 39905-57-2 402-45-9 4097-49-8 42882-31-5 4344-55-2 4383-06-6 4460-86-0 446-51-5

4-methylbenzophenone 4-chlorobenzophenone butyl benzoate 4-hexylresorcinol 4,6-dichlororesorcinol 4-phenoxyaniline 4-hydroxybenzyl cyanide N-ethylbenzylamine 4-acetylbenzonitrile 2-acetyl-4-methylphenol 4-bromophenyl-3-pyridyl ketone 3-benzyloxyaniline 4-aminobenzoic acid 3-methoxyphenol 3-phenyl-2-propyn-1-ol 4-methoxyphenol 4-cyclopentylphenol 3-methoxybenzonitrile 3-bromobenzyl alcohol 4-octylaniline heptanophenone octanophenone 4-phenylazophenol 2-allylphenol 2-chlorobenzyl alcohol 2-propylaniline 3-aminobenzyl alcohol 2-aminobenzonitrile 4-hexyloxyphenol 2-bromobenzyl alcohol 4-hydroxy-3-methoxybenzylamine methyl 3-hydroxybenzoate 3-phenyl-1-propylamine gamma-phenylpropyl cyanide 3-nitrobiphenyl 4-chloro-1,2-benzenediol methyl 2,4-dihydroxybenzoate 3-aminobenzonitrile 3-aminobiphenyl 5-amino-2-fluoro-benzotrifluoride 2,3-dimethylquinoxaline 2-tert-butyl-4-methylphenol 2,6-diisopropylaniline benzyl methacrylate 3-cyanobenzaldehyde 2,3,4,5-tetrachlorophenol phthalazine quinazoline acridine 3-chloro-4-fluorophenol 2,4-dichloro-6-nitroaniline 4-propylaniline 2,3-benzofuran 3-chloroanisole 1,5-dichloro-2,3-dinitrobenzene 4-biphenylcarbonitrile methyl 2,5-dichlorobenzoate 2-phenylbenzyl alcohol 5-bromovanillin benzyl thiocyanate 4-cyanobenzamide 2,6-diiodo-4-nitrophenol 2,3-dichloronitrobenzene 4-bromo-2,6-dichlorophenol 4-phenylbenzaldehyde 2-(4-tolyl)ethylamine 2,5-dinitrophenol 4-pentylaniline 4-hexylaniline 4-phenyl-1-butanol 3,5-dichloroanisole 2-fluorobenzophenone 2-fluoroaniline 1-fluoro-4-nitrobenzene 4-aminobenzyl cyanide 4-biphenylmethanol 2-fluorophenol 2,4-difluorophenol 4-fluoroaniline 4-fluorophenol 3-fluoroaniline 2,6-dichloro-4-fluorophenol 2-amino-5-fluoro-benzotrifluoride 5′-fluoro-2′-hydroxyacetophenone 4-hexyloxyaniline 4-hydroxybenzotrifluoride 4-tert-butyl-2,6-dinitrophenol 1-(1-naphthyl)ethylamine 4-butoxyaniline 3-hydroxy-4-methoxybenzyl alcohol 2,4,5-trimethoxybenzaldehyde 2-fluorobenzyl alcohol

Pset Tset Tset Tset Tset Tset Cvset Tset Tset Tset Tset Tset Tset Tset Pset Tset Tset Pset Tset Tset Tset Tset Tset Pset Cvset Tset Cvset Tset Cvset Tset Tset Tset Tset Tset Tset Tset Tset Pset Pset Tset Tset Tset Tset Tset Cvset Pset Tset Tset Tset Pset Tset Cvset Tset Tset Tset Cvset Tset Pset Tset Tset Tset Cvset Pset Tset Tset Tset Pset Tset Tset Pset Pset Pset Pset Tset Cvset Pset Pset Cvset Pset Pset Tset Pset Cvset Tset Cvset Pset Pset Cvset Tset Pset Tset Tset

69-72-7 697-82-5 698-87-3 7153-22-2 7244-78-2 7251-61-8 732-26-3 766-84-7 767-00-0 768-59-2 769-39-1 771-60-8 771-61-9 7781-98-9-8 78056-39-0 80-46-6 827-23-6 831-82-3 83-42-1 836-30-6 84-11-7 84-66-2 84-74-2 86-00-0 86-53-3 86-56-6 86-74-8 87-28-5 873-32-5 873-62-1 873-63-2 873-74-5 873-75-6 873-76-7 874-42-0 874-90-8 875-59-2 875-79-6 87-60-5 87-63-8 877-43-0 877-65-6 87820-88-0 87-86-5 87-87-6 88-04-0 88-06-2 88-65-3 88-69-7 88-72-2 88-73-3 88-74-4 88-75-5 89-59-8 89-61-2 89-69-0 89-95-2 90-01-7 90-02-8 90-12-0 90-15-3 90-30-2 90-41-5 90-43-7 90-72-2 90-90-4 91-10-1 91-15-6 91-19-0 91-20-3 91-22-5 91-57-6 91-63-4 91-66-7 920-99-3 92-52-4 92-67-1 92-69-3 92-82-0 93-55-0 93-58-3 935-95-5 93-89-0 94-09-7 942-92-7 94-67-7 94-71-3 95-01-2 95-15-8 95-51-2 95-53-4 95-55-6

salicylic acid 2,3,5-trimethylphenol 1-phenyl-2-propanol ethyl 4-cyanobenzoate 4-n-butoxynitrobenzene 2-methylquinoxaline 2,4,6-tri-tert-butylphenol 3-chlorobenzonitrile 4-cyanophenol 4-ethylbenzyl alcohol 2,3,5,6-tetrafluorophenol pentafluoroaniline pentafluorophenol ethyl 3-hydroxybenzoate 4,5-difluoro-2-nitroaniline 4-tert-pentylphenol 2,4-dibromo-6-nitroaniline 4-phenoxyphenol 2-chloro-6-nitrotoluene 4-nitrodiphenylamine 9,10-phenanthrenequinone DEP DBP 2-nitrobiphenyl 1-naphthonitrile 1-(dimethylamino)naphthalene carbazole 2-hydroxyethyl salicylate 2-chlorobenzonitrile 3-cyanophenol 3-chlorobenzyl alcohol 4-aminobenzonitrile 4-bromobenzyl alcohol 4-chlorobenzyl alcohol 2,4-dichlorobenzaldehyde 4-methoxybenzonitrile 4′-hydroxy-2′-methylacetophenone 1,2-dimethylindole 3-chloro-2-methylaniline 2-chloro-6-methylaniline 2,6-dimethylquinoline 4-tert-butylbenzyl alcohol Tralkoxydim pentachlorophenol tetrachlorohydroquinone 4-chloro-3,5-dimethylphenol 2,4,6-trichlorophenol 2-bromobenzoic acid 2-isopropylphenol 2-nitrotoluene 1-chloro-2-nitrobenzene 2-nitroaniline 2-nitrophenol 4-chloro-2-nitrotoluene 2,5-dichloronitrobenzene 2,4,5-trichloronitrobenzene 2-methylbenzyl alcohol 2-hydroxybenzyl alcohol salicylaldehyde 1-methylnaphthalene 1-naphthol N-phenyl-1-naphthylamine 2-aminobiphenyl 2-phenylphenol 2,4,6-tris(dimethylaminomethyl)phenol 4-bromobenzophenone 2.6-dimethoxyphenol phthalonitrile quinoxaline naphthalene quinoline 2-methylnaphthalene 2-methylquinoline N,N-diethylaniline 4-hydroxy-3-methoxybenzonitrile biphenyl 4-aminobiphenyl 4-hydroxybiphenyl phenazine propiophenone methyl benzoate 2,3,5,6-tetrachlorophenol ethyl benzoate Benzocain hexanophenone salicylaldoxime 2-ethoxyphenol 2,4-dihydroxybenzaldehyde thianaphthene 2-chloroaniline 2-toluidine 2-aminophenol

Tetrahymena Acute Toxicity

Chem. Res. Toxicol., Vol. 14, No. 11, 2001 1539

Table 1. (Continued) set

CAS no.

compound name

set

CAS no.

compound name

Cvset Tset Cvset Tset Cvset Tset Cvset Tset Cvset Tset Cvset Tset Tset Cvset Cvset Cvset Cvset Tset Tset Tset Tset Tset Tset Tset Pset Pset Pset Pset Tset Tset Tset Tset Tset Cvset Tset Tset Tset Tset Tset Tset

456-47-3 459-56-3 4748-78-1 475-38-7 481-39-0 490-79-9 495-40-9 498-00-0 500-66-3 500-99-2 501-94-0 50-30-6 50-45-3 50-73-7 50-79-3 50-84-0 51-28-5 51-36-5 51-44-5 5159-41-1 526-75-0 527-54-8 527-60-6 528-29-0 529-19-1 529-20-4 53222-92-7 5344-90-1 534-52-1 535-80-8 536-90-3 538-68-1 540-37-4 540-38-5 54135-80-7 55-21-0 552-16-9 554-00-7 554-84-7 555-03-3

3-fluorobenzyl alcohol 4-fluorobenzyl alcohol 4-ethylbenzaldehyde 5,8-dihydroxy-1,4-naphthoquinone juglone gentisic acid butyrophenone 4-hydroxy-3-methoxybenzyl alcohol olivetol 3,5-dimethoxyphenol 4-hydroxyphenethyl alcohol 2,6-dichlorobenzoic acid 2,3-dichlorobenzoic acid 2,4,6-trichlorobenzoic acid 2,5-dichlorobenzoic acid 2,4-dichlorobenzoic acid 2,4-dinitrophenol 3,5-dichlorobenzoic acid 3,4-dichlorobenzoic acid 2-iodobenzyl alcohol 2,3-dimethylphenol 3,4,5-trimethylphenol 2,4,6-trimethylphenol 1,2-dinitrobenzene 2-tolunitrile 2-methylbenzaldehyde 3-amino-2-methylphenol 2-aminobenzyl alcohol DNOC 3-chlorobenzoic acid 3-methoxyaniline n-pentylbenzene 4-iodoaniline 4-iodophenol 2,3,4-trichloroanisole benzamide 2-nitrobenzoic acid 2,4-dichloroaniline 3-nitrophenol 3-nitroanisole

Tset Tset Tset Pset Tset Tset Tset Cvset Tset Pset Tset Tset Tset Cvset Tset Tset Cvset Tset Tset Tset Tset Tset Tset Tset Pset Tset Cvset Tset Pset Tset Tset Tset Cvset Tset Tset Pset Tset Tset Cvset Cvset

95-57-8 95-64-7 95-65-8 95-68-1 95-69-2 95-74-9 95-75-0 95-76-1 95-77-2 95-78-3 95-79-4 95-82-9 95-95-4 97-00-7 97-02-9 97-53-0 98-54-4 98-82-8 98-84-0 98-86-2 98-95-3 99-04-7 99-05-8 99-06-9 99-08-1 99-28-5 99-51-4 99-61-6 99-65-0 99-71-8 99-75-2 99-76-3 99-77-4 99-88-7 99-89-8 99-90-1 99-92-3 99-93-4 99-94-5 99-99-0

2-chlorophenol 3,4-dimethylaniline 3,4-dimethylphenol 2,4-dimethylaniline 4-chloro-2-methylaniline 3-chloro-4-methylaniline 3,4-dichlorotoluene 3,4-dichloroaniline 3,4-dichlorophenol 2,5-dimethylaniline 5-chloro-2-methylaniline 2,5-dichloroaniline 2,4,5-trichlorophenol 1-chloro-2,4-dinitrobenzene 2,4-dinitroaniline eugenol 4-tert-butylphenol cumene R-methylbenzylamine acetophenone nitrobenzene 3-methylbenzoic acid 3-aminobenzoic acid 3-hydroxybenzoic acid 3-nitrotoluene 2,6-dibromo-4-nitrophenol 3,4-dimethylnitrobenzene 3-nitrobenzaldehyde 1,3-dinitrobenzene 4-sec-butylphenol methyl 4-methylbenzoate methyl 4-hydroxybenzoate ethyl 4-nitrobenzoate 4-isopropylaniline 4-isopropylphenol 4′-bromoacetophenone 4′-aminoacetophenone 4-acetylphenol 4-methylbenzoic acid 4-nitrotoluene

Table 2. Eleven Descriptors of the Linear Type 1 QSTR Model for Tetrahymena Toxicity descriptor

type

ALLP-1 S7CH-19 N6PC-19 WTPT-2 WTPT-3 3SP2-1 MDE-23 MDE-24 ECCN-1 SHDW-5 CHAA-1 intercept

topological topological topological topological topological topological topological topological topological geometric hybrid

TSet descriptor range 480-1981 0.05103-0.5880 40-271 1.901-2.108 0-20.08 0-5 0-15.02 0-14.71 45-512 0.4214-0.9232 -1.747 to 0.004 443

coefficient

description

-0.002 808 9 5.239 0.017 462 -10.32 -0.097 325 -0.3192 0.1217 -0.2555 0.012 054 1.822 1.097 18.24

total number of paths χ value of path chains of length 7a no. of path chains of length 6b molecular ID divided by the number of atoms in the molecule sum of all path weights starting from heteroatoms no. of sp2 hybridized carbons bonded to three other carbons molecular distance edge between 2° and 3° carbons molecular distance edge between 2° and 4° carbons eccentric connectivity index normalized molecular area projected onto the xz-plane sum of charge on proton acceptor atoms

a A path chain is defined as a set of atoms connected to each other such that some, or all, of the atoms are in ring systems. b A path cluster is defined as a path with at least one branch point.

(VIF) were less than 10. Pairwise correlations among the 11 descriptors ranged from 0.554 to 0.914 with a mean of 0.734. The 287 compound tset of this model had an rms error of 0.42 log units (r ) 0.85). The 81 compounds that comprise the first pset had an rms error of 0.66 log units (r ) 0.65). The 80 compounds that comprise the second pset had an rms error of 0.41 log units (r ) 0.85). The errors for the two prediction sets can be combined such that

RMSETotal )

x

(NP1 × RMSEP12) + (NP2 × RMSEP22) (1) NP1 + NP2

where NP1 is the number of members in the first pset (81) and NP2 is the number of members of the second pset (80). This gives an rms error of 0.55 log units (r ) 0.75). The compound 2,4,6-tri-tert-butylphenol (732-26-3) of the first pset had the largest residual of -3.71 which

inflated the rms error for the entire prediction set. If this compound is removed, the first pset rms error reduces to 0.52 log units (r ) 0.77), and the new combined pset rms error is 0.47 log units (r ) 0.81). The rms errors of the Monte Carlo runs were 0.75 log units for the tset, 0.74 log units for the first pset, and 0.69 log units for the second pset. These rms errors are much larger than those of the type 1 model. This provides evidence that the actual model was very unlikely due to chance but was based on real correlations between the ICG50 and the molecular structure descriptors. Of the 11 descriptors chosen for this model, nine are topological (geometry independent), one is geometric (geometry dependent), and one is a hydrogen-bonding hybrid descriptor. The topological descriptor ALLP-1 (38) encodes the total number of paths in the structure and can therefore give information on the size and degree of branching of the molecule. The topological descriptors S7CH-19 and N6PC-19 (38) also encode information on size and branching, including ring information in the case

1540

Chem. Res. Toxicol., Vol. 14, No. 11, 2001

of N6PC-19. The topological descriptors WTPT-2 and WTPT-3, calculated as the molecular ID divided by the number of atoms in the molecule and the sum of all path weights starting from heteroatoms, respectively, also encode size information (39). The 3SP2-1 descriptor encodes information on the hybridization of the carbon atoms in each compound. This, once again, includes information on branching since this particular descriptor encodes the number of sp2 hybridized carbons bonded to three other carbons in the compound. The electrotopological descriptor ECCN-1 (40, 41), encodes information on size and degree of branching of the molecule. The final topological descriptors MDE-23 and MDE-24 (42) encode information on the size of the molecule by counting the number of bonds between secondary and tertiary carbons and secondary and quaternary carbons, respectively. The geometric descriptor SHDW-5 (28) is calculated as the normalized molecular area of the molecule projected on the xz-plane. This descriptor encodes the molecule’s size and branching, taking into account an accurate molecular geometry. The final descriptor in the 11-descriptor type 1 model is the hybrid (geometry dependent) descriptor CHAA-1. This descriptor encodes the molecule’s ability to hydrogen bond. A general theme that appears among these 11 type 1 descriptors is that size and degree of branching are major factors in relating the molecular structure of these organic compounds to the toxicity of Tetrahymena using purely linear means. Thus, it is conceivable that bulk and steric interactions may play a dominant role in inhibiting the population growth of Tetrahymena. Type 2 Model (linear feature selection and nonlinear model development). Next, the optimal type 1 model descriptors are used to generate a CNN type 2 model. It is feasible that these descriptors that are chosen by linear means should perform even better with a nonlinear model. The CNN training begins with a random set of weights and biases and continues until an optimal set of weights and biases are obtained. The progress of the training is monitored with the cvset compounds, and the optimal time to cease training occurs when the cvset rms error reaches a minimum. After this point, the CNN stops learning generalities of the tset and begins memorizing specific attributes of the molecules that comprise the tset. Previously, an average prediction from several CNNs has produced models with more predictive power than individual models, while reducing the risk of chance correlations (3). All allowed CNN architectures using the 11 type 1 descriptors were analyzed. An allowable architecture is one where the ratio of the number of observations (compounds in the tset) to the number of adjustable parameters is greater than 2. All architectures from 11 to 3-1 (48 adjustable parameters), which has 11 input neurons for each descriptor of the type 1 model, three hidden neurons, and one output neuron which predicts the calculated -log(ICG50) value, to 11-10-1 (131 adjustable parameters) were analyzed. The best CNN model is then found such that adding another hidden layer neuron does not produce a considerable decrease in tset and cvset rms error. An 11-3-1 architecture was chosen as optimal. The nonlinear CNN model coupled with the averaging technique resulted in an improved model compared to the type 1 model. The first prediction set for the type 1 model is now used as the cvset for the type 2 model. The

Serra et al.

Figure 1. Plot of calculated -log(ICG50) versus experimental -log(ICG50) for eleven-descriptor, nonlinear type 3 model, for Tetrahymena toxicity.

tset rms error of the type 2 model decreased to 0.35 log units (r ) 0.85), a 17% improvement over the type 1 model. The cvset rms error was calculated as 0.40 log units (r ) 0.86), a 39% improvement to the first prediction set of the type 1 model. The pset rms error became 0.40 log units (r ) 0.86), a 2.4% decrease over the type 1 model. No noteworthy outliers were evident for this model. Monte Carlo runs provided evidence that a correlation exists between the type 2 model descriptors and the experimental toxicity because the rms errors were once again much higher than the rms errors of the type 2 model. The rms error of the tset was 0.80 log units, the rms error of the cvset was 0.78 log units and the rms error of the pset was 0.81 log units. Type 3 Model (nonlinear feature selection and model development). Finally, a completely nonlinear type 3 model is calculated. The general decrease in rms error from type 1 to type 2 suggests that the type 3 model will produce results that continue this trend. Once again, 11-3-1 to 11-10-1 model architectures were analyzed. An average of 10 CNNs with an 11-5-1 architecture was chosen as optimal. A plot of the predicted -log(ICG50) versus observed -log(ICG50) values for this set of 10 type 3 models is shown in Figure 1. As expected, the tset rms error of the type 3 model is 0.28 log units (r ) 0.94), which is a 20% improvement over the type 2 model and a 33% improvement over the type 1 model. The cvset rms error is calculated to be 0.29 log units (r ) 0.93), a 28% improvement over the type 2 model and a 56% improvement over the type 1 model. The pset rms error is calculated to be 0.34 log units (r ) 0.89), a 15% improvement over the optimal type 2 model and a 17% improvement over the type 1 model. A visual inspection of Figure 1 shows that no noteworthy outliers exist for the type 3 model. Once again, Monte Carlo runs provided evidence that a correlation was found between the experimentally observed toxicity and the type 3 model descriptors. The

Tetrahymena Acute Toxicity

Chem. Res. Toxicol., Vol. 14, No. 11, 2001 1541

Table 3. Eleven Descriptors of the Nonlinear Type 3 QSTR Model for Tetrahymena Toxicity descriptor

type

N6PC-19 N7CH-20 NDB-13 3SP2-1

topological topological topological topological

4-271 1-14 0-4 0-5

TSet descriptor range

description

2SP3-1

topological

0-7

ESUM-2 GRAV-2

topological geometric

0.00-68.94 22.97-48.18

DPSA-2

hybrid

126.5-1.542 × 104

SCAA-2

hybrid

-21.34 to 3.507 × 10-2

CHDH-1 CTAA-0

hybrid hybrid

0.00-0.9202 0-7

no. of path clusters of length 6a no. of path chains of length 7b no. of double bonds no. of sp2 hybridized carbons bonded to three other carbons no. of sp3 hybridized carbons bonded to two other carbons sum of E-state values over all heteroatoms square root of the gravitational index over all heteroatoms difference in partial positive and partial negative molecular chargesc sum of (surface area × charge) of proton acceptor atoms/number of acceptor atoms Sum of charge on all donatable hydrogens Count of proton acceptor atoms

a A path cluster is defined as a path with at least one branch point, b A path chain is defined as a set of atoms connected to each other such that some, or all, of the atoms are in ring systems. c (sum total positive molecular charge × ∑positive atomic charges) - (sum total negative molecular charge × ∑negative atomic charges) (sum total positive molecular charge × ∑positive atomic charges) - (sum total negative molecular charge × ∑negative atomic charges).

rms error of the tset was 0.78 log units, 0.77 log units for the cvset, and 0.78 log units for the pset. The 11 type 3 descriptors were not subjected to MLR analysis. There is no reason that a set of descriptors chosen by nonlinear means should give a reliable linear type 1 model. Any model created this way that showed low rms error would be likely to have poor general predictability. The set of 11 descriptors, shown in Table 3, that are used in the type 3 model include six topological descriptors, one geometric descriptor, and four hybrid descriptors. These descriptors are chosen from the same reduced descriptor pool as the type 1 and type 2 descriptors; however, the 11 most information-rich descriptors are chosen via a genetic algorithm coupled to a CNN fitness evaluator. The six topological descriptors are similar to those of the type 1 and type 2 models. Two descriptors are common to all three models, N6PC-19 and 3SP2-1. Both of these encode information about branching and size. One fragment descriptor, NDB-13, which encodes the number of double bonds is included in the 11 optimal descriptor pool. This may suggest that π electrons play a role in the activity of these compounds to the toxicity of Tetrahymena. The geometric descriptor GRVH-2 is calculated as the square root of the gravitational index of all heavy atoms and encodes information about size of the geometry optimized compound. The hybrid descriptors SCAA-2, CHDH-1, and CTAA-0 encode information about hydrogen bonding. SCAA-2 encodes each molecule’s ability to hydrogen bond with itself by calculating the sum of the surface area charge product of the proton acceptor atoms divided by the number of proton acceptor atoms. The descriptors CHDH-1 and CTAA-0 encode information about the molecule’s ability to hydrogen bond with polar species where a water molecule is used to approximate a polar medium. CHDH-1 calculates the sum of the charges on all the donatable protons, and CTAA-0 gives the count of proton acceptor atoms. The descriptor DPSA-2 (30) encodes information about the partially charged surface areas of each molecule. This is calculated as the difference in the sum of the partial positive and partial negative molecular charges. This type 3 model also supports the idea that the size and degree of branching of the compound coupled with the ability of the molecule to hydrogen bond to the active site are important in determining the toxicity to Tetrahymena.

Table 4. Eleven Descriptors of the Nonlinear Type 3 QSTR Model for Two Example Compounds

Table 4 shows the 11 descriptors of the type 3 model for two compounds drawn from the data set. The calculated toxicity value of 1.43 log units for compound 1 indicates it is rather toxic to Tetrahymena, whereas the calculated toxicity value of -0.73 log units for compound 2 demonstrates that it is rather nontoxic to Tetrahymena. A closer look at the descriptors for these two compounds could give some idea as to why. The first two descriptors, N6PC-19 and N7CH-20, are counts of simple χ indices. The descriptor N6PC-19 calculates the number of path clusters of length six. A path cluster of length six is defined as a chain of six heavy atoms with at least one branch point. A branch point can be thought of as a place where continuity is broken. A possible path for compound 1 would start from the methyl carbon (atom 7) attached to the ring, continue across the top two carbons of the ring (atoms 1 and 2), include both the carbonyl carbon (atom 8) and the oxygen (atom 9) and terminate at the bridge point of the second aromatic ring structure (atom 10). In this example, there is one branch point that occurs at the carbonyl carbon, atom 8. A second path cluster would once again start at the methyl carbon, include the carbon labeled atom 1, begin counterclockwise to include the carbon labeled atom 6, then back track clockwise around the ring, and ending at the carbon labeled atom 3 including the carbonyl carbon (atom 8) in the trip. In this example, there are two branch points. The first occurs at the carbon atom labeled atom 1, and the second occurs at the carbon labeled atom 2. Looking at compound 2, one of the

1542

Chem. Res. Toxicol., Vol. 14, No. 11, 2001

Serra et al.

Table 5. Compound Identification, Structure, and Predicted and Observed Tetrahymena Toxicity for the External Prediction Set

Tetrahymena Acute Toxicity

Chem. Res. Toxicol., Vol. 14, No. 11, 2001 1543

Table 5 (Continued)

possible path clusters of length six would start from the nitrogen of the amine group labeled atom 7, continue in a counterclockwise fashion around the ring, include the carbon attached to the ring (atom 8), and end at the ring carbon directly across from the amine group (atom 4). Clearly there are many more path clusters of length six for a larger, more branched compound than exist for a smaller, less branched compound. The descriptor N7CH-20 encodes very similar features. The path chain is a string of atoms, in this case 7, that must contain at least one ring. The path chain descriptor of length seven encodes substituted benzene rings quite well. In compound 1, there are only three possible ways a path chain of length 7 occurs. The first starts from the carbonyl carbon labeled atom 8 and continues around the six carbons that comprise the aromatic ring that contains the methyl group. The second proceeds as the first but ends in the aromatic ring with out the methyl group. The final path chain of length seven begins at the methyl

group carbon (atom 7) and proceeds counterclockwise around the attached aromatic ring (atoms 1-6). In compound 2, there are only two possible path chains of length seven. One begins at the amine nitrogen labeled atom 7 and includes all the carbon atoms (atoms 1-6) around the ring. The other begins at the carbon labeled atom 8 and continues around the ring including the carbons labeled atoms 1-6. The other descriptors that stand out are SCAA-2, CHDH-1, and CTAA-0. All of these descriptors encode information about a molecule’s ability to hydrogen bond. The descriptors SCAA-2 and CHDH-1 encode information on the ability of the molecule to hydrogen bond with itself. This could possibly be important at high concentrations and could explain why compounds with nonzero values for these descriptors exhibit lower toxicity than ones that have zero values here. A value of zero indicates that this compound is not able to hydrogen bond with itself, meaning there may be at least one proton acceptor site

1544

Chem. Res. Toxicol., Vol. 14, No. 11, 2001

Serra et al. Table 6. Outliers from the 52 Member External Prediction Set

Figure 2. Plot of calculated -log(ICG50) versus experimental -log(ICG50) for 52 member external prediction set.

or at least one proton donor site, but not both. In compound 1, there is one proton acceptor site, but no donor sites, so both SCAA-2 and CHDH-1 are 0. In compound 2, however, the oxygen of the hydroxyl group and the nitrogen of the amine group function as a proton acceptor sites, while the hydrogen of the hydroxyl group functions as the proton donor site. Therefore, compound 2 has nonzero values for SCAA-2 and CHDH-1. The descriptor CTAA-0 encodes information on the molecule’s ability to hydrogen bond with a polar solvent, like water. So, a compound needs only have a proton acceptor site or a proton donor site, or both. This particular descriptor encodes the number of proton acceptor sites. In compound 1 the oxygen in the carbonyl group is the only proton acceptor site. As before, the nitrogen of the amine group and the oxygen of the hydroxyl group serve as proton acceptor sites. External Prediction Set. Once the models have been created and validated, they are ready for use. In this case, the type 3 model was chosen as optimal based on the low tset and cvset rms errors. The type 3 model was used to predict the 50% growth impairment concentrations toward Tetrahymena for an external set of 52 compounds from a nonproprietary portion of the Terra Tox database. These compounds are structurally similar to the compounds of the tset and cvset previously used, as shown in Table 5. The 52-member external pset has an rms error of 0.59 log units. A plot of predicted -log(ICG50) versus observed -log(ICG50) is shown in Figure 2. This high error can be attributed mainly to 10% of the compounds shown in Table 6. There is no structural reason these compounds should be problematic since each type is adequately described in the tset. However, if these compounds are removed, the external pset rms error reduces to 0.35 log units. Thus, 90% of the compounds in the external prediction set had their -log(ICG50) values predicted with rms errors comparable to those of the prediction set used to validate the type 3 model. The three models created can readily be developed and tested using many QSTR programs. This work uses the

ADAPT software system, but any QSTR program that can create the topological, geometric, and electronic descriptors and can create MLR models and the CNN models previously described could implement these models.

Conclusions Three 11-descriptor models based solely on molecular structure were developed. First, an 11-descriptor, completely linear type 1 model was formed using linear feature selection and linear model development. Then, those descriptors optimally chosen for the type 1 model are fed to a CNN where an 11-3-1 architecture was chosen as best hybrid linear nonlinear type 2 model. Finally, a completely nonlinear 11-descriptor type 3 model was calculated using nonlinear feature selection and nonlinear model development. The completely nonlinear type 3 model based on an average of 10 CNNs was chosen best based on rms error and r values (36). The rms error of the tset, cvset, and pset are 0.28, 0.29, and 0.34 log units, respectively. A common theme between the descriptors chosen from the linear and nonlinear methods is that the toxicity of these compounds to Tetrahymena may potentially be related to the size, degree of branching, and hydrogen-bonding ability of the molecule of interest.

Acknowledgment. This work was partially supported by DuPont Stine-Haskell Laboratory.

References (1) Hill, D. L. (1972) The Biochemistry and Physiology of Tetrahymena, 1st ed., pp 230, Academic Press, New York. (2) Eldred, D. V., Weikel, C. L., Jurs, P. C., and Kaiser, K. L. E. (1999) Prediction of fathead minnow acute toxicity of organic compounds from molecular structure. Chem. Res. Toxicol. 12, 670-678.

Tetrahymena Acute Toxicity (3) Kauffman, G. W., and Jurs, P. C. (2000) Prediction of the sodium ion-proton antiporter by benzoylguanidine derivatives from molecular structure. J. Chem. Inf. Comput. Sci. 40, 753-761. (4) Patankar, S. J., and Jurs, P. C. (2000) Prediction of IC50 values for ACAT inhibitors from molecular structure. J. Chem. Inf. Comput. Sci. 40, 706-723. (5) Wessel, M. D., Jurs, P. C., Tolan, J. W., and Muskal, S. M. (1998) Prediction of human intestinal absorption of drug compounds from molecular structure. J. Chem. Inf. Comput. Sci. 38, 726-735. (6) Johnson, S. R., and Jurs, P. C. (1999) Prediction of the clearing temperatures of a series of liquid crystals from molecular structure. Chem. Mater. 11, 1007-1023. (7) Engelhardt McClelland, H., and Jurs, P. C. (2000) Quantitative structure-property relationships for the prediction of vapor pressures of organic compounds from molecular structure. J. Chem. Inf. Comput. Sci. 40, 967-975. (8) Cumpson, P. J. (2001) Estimation of inelastic mean free paths for polymers and other organic materials: use of quantitative structure-property relationships. Surf. Interface Anal. 31, 2334. (9) DeWeese, A. D., and Schultz, T. W. (2001) Structure-activity relationships for aquatic toxicity to Tetrahymena: Halogensubstituted aliphatic esters. Environ. Toxicol. 16, 54-60. (10) Filov, V. A., Golubev, A. A., Liublina, E. I., and Tokontsev, N. A. (1979) Quantitative Toxicology, 1st ed., pp 462, John Wiley & Sons, New York. (11) LeBlond, J. D., Applegate, B. M., Menn, F. M., Schultz, T. W., and Sayler, G. S. (2000) Structure-toxicity assessment of metabolites of the aerobic bacterial transformation of substituted naphthalenes. Environ. Toxicol. Chem. 19, 1235-1246. (12) Niculescu, S. P., Kaiser, K. L. E., and Schultz, T. W. (2000) Modeling the toxicity of chemicals to Tetrahymena pyriformis using molecular fragment descriptors and probabilistic neural networks. Arch. Environ. Con. Toxicol. 39, 289-298. (13) TerraBase Inc. (2000) TerraTox 2000: Explorer CD-Rom, Burlington, Ontario. (14) Schultz, T. W. (1997) Teratox: Tetrahymena pyriformis Polpulation Growth Impairment Endpoint. Toxicol. Methodol. 7, 289309. (15) Verhaar, H. J. M., Vanleeuwen, C. J., and Hermens, J. L. M. (1992) Classifying environmental pollutants 1. Structure-activity relationships for prediction of aquatic toxicity. Chemosphere 25, 471-491. (16) Kaiser, K. L. E., and Niculescu, S. P. (1999) Using probabilistic neural networks to model the toxicity of chemicals to the fathead minnow (Pimephales promelas): A study based on 865 compounds. Chemosphere 38, 3237-3245. (17) Kaiser, K. L. E., and Niculescu, S. P. (2001) Modeling acute toxicity of chemicals to Daphnia magna: A probabilistic neural network approach. Environ. Toxicol. Chem. 20, 420-431. (18) Stewart, J. J. P. (1990) Special Issue - Mopac - a semiempirical molecular-orbital program. J. Comput.-Aided Mol. Des. 4, 1-45. (19) Stewart, J. P. P. (1991) MOPAC (Molecular Orbital Package), version 6.0. (20) Murray, W. J., and Kier, L. B. (1976) Molecular connectivity 6. Examination of the parabolic relationship between molecular connectivity and biological activity. J. Med. Chem. 19, 573-578. (21) Murray, W. J., Hall, L. H., and Kier, L. B. (1975) Molecular connectivity III: Relationship to partition coefficents. J. Pharm. Sci. 64, 1978-1981. (22) Kier, L. B., and Hall, L. H. (1976) Molecular connectivity VII: Specific treatment of heteroatoms. J. Pharm. Sci. 65, 1806-1809. (23) Kier, L. B., Murray, W. J., Randic, M., and Hall, L. H. (1976) Molecular connectivity V: Connectivity series applied to density. J. Pharm. Sci. 65, 1226-1230.

Chem. Res. Toxicol., Vol. 14, No. 11, 2001 1545 (24) Kier, L. B., and Murray, W. J. (1975) Molecular connectivity 4. Relationships to biological activities. J. Med. Chem. 18, 12721274. (25) Kier, L. B., Hall, L. H., Murray, W. J., and Randic, M. (1975) Molecular connectivity I: Relationship to nonspecific local anesthesia. J. Pharm. Sci. 64, 1971-1974. (26) Hall, L. H., Kier, L. B., and Murray, W. J. (1975) Molecular connectivity II: Relationship to water solubility and boiling point. J. Pharm. Sci. 64, 1974-1977. (27) Bondi, A. (1964) van der Waals volumes and radii. J. Phys. Chem 68, 441-451. (28) Stouch, T. R., and Jurs, P. C. (1986) A simple method for the representation, quantification, and comparison of the volumes and shapes of chemical compounds. J. Chem. Inf. Comput. Sci. 26, 4-12. (29) Abraham, R. J., and Smith, P. E. (1987) Charge calculations in molecular mechanics IV: A general method for quantitative systems. J. Chem. Inf. Comput. Sci. 9, 288-297. (30) Dixon, S. L., and Jurs, P. C. (1992) Atomic charge calculations for quantitative structure property relationships. J. Comput. Chem. 13, 492-504. (31) Stanton, D. T., and Jurs, P. C. (1990) Development and use of charged partial surface area structural descriptors in computerassisted quantitative structure-property relationship studies. Anal. Chem. 62, 2323-2329. (32) Pimentel, G. I., and McClellan, A. L. (1960) The Hydrogen Bond, Freeman, San Francisco. (33) Vinogradov, S. N., and Linnell, R. H. (1971) Hydrogen Bonding, van Nostrand Reinhold, New York. (34) Luke, B. T. (1994) Evolutionary programming applied to the development of quantitative structure-activity-relationships and quantitative structure-property relationships. J. Chem. Inf. Comput. Sci. 34, 1279-1287. (35) Sutter, J. M., Dixon, S. L., and Jurs, P. C. (1995) Automated descriptor selection for quantitative structure-activity relationships using generalized simulated annealing. J. Chem. Inf. Comput. Sci. 35, 77-84. (36) Walpole, R. E., and Myers, R. H. (1993) Probability and Statistics for Engineers and Scientists, 5th ed., pp 766, Prentice Hall, Englewood Cliffs. (37) Wessel, M. D., and Jurs, P. C. (1994) Prediction of reduced ion mobility constants from structural information using multiple linear-regression analysis and computational neural networks. Anal. Chem. 66, 2480-2487. (38) Randic, M., Brissey, G. M., Spencer, R. B., and Wilkins, C. L. (1979) Search for all self-avoiding paths for molecular graphs. Comput. Chem. 3, 5-13. (39) Randic, M. (1984) On molecular identification numbers. J. Chem. Inf. Comput. Sci. 24, 164-175. (40) Sharma, V., Goswami, R., and Madan, A. K. (1997) Eccentric connectivity index: A novel highly discriminating topological descriptor for structure-property and structure-activity studies. J. Chem. Inf. Comput. Sci. 37, 273-282. (41) Muller, W. R., Szymanski, K., and Knop, J. V. (1987) An algorithm for construction of the molecular distance matrix. J. Chem. Inf. Comput. Sci. 8, 170-173. (42) Liu, S., Cao, C., and Li, Z. (1998) Approach to estimation and prediction for the normal boiling point (NBP) of alkanes based on a novel Molecular Distance-Edge (MDE) vector, Lambda. J. Chem. Inf. Comput. Sci. 38, 387-394.

TX010101Q