Applications of Molecular Connectivity Indexes and Multivariate

Gerald J. Niemi1, Ronald R. Regal2, and Gilman D. Veith3 ... 2Department of Mathematical Sciences, University of Minnesota—Duluth,. Duluth, MN 55812...
0 downloads 0 Views 1MB Size
11 Applications of Molecular Connectivity Indexes and Multivariate Analysis in Environmental Chemistry 1

2

3

Gerald J. Niemi , Ronald R. Regal , and Gilman D. Veith 1

Department of Pharmacology, University of Minnesota—Duluth, Duluth, MN 55812 Department of Mathematical Sciences, University of Minnesota—Duluth, Duluth, MN 55812 Environmental Research Laboratory, U.S. Environmental Protection Agency, Duluth, MN 55804

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

2

3

We have developed a data matrix of 90 variables calculated from molecular connectivity indices for 19,972 chemicals in the Toxic Substance Control Act (TSCA) inventory of industrial chemicals. Principal component analysis of this matrix revealed eight principal components that explained > 93 % of the variation in these data. The first three principal components convey generalized information on chemical structure: size, degree of branching, and number of cycles. The other components contained more specific information on branching, bonding, cyclicness, valency, and combinations of these structural attributes. Here we explored the use of the connectivity indices and their calculated principal components for their potential in predicting biodegradation as measured by biochemical oxygen demand (BOD) and the octanol/water partition coefficient. This approach showed promise in the prediction of biodegradation, but was of limited use in the prediction of the partition coefficient. Because it is possible to calculate the connectivity indices at a nominal cost for nearly all chemicals, the approach will prove especially useful for the identification of chemicals with similar structures and for systematically exploring where data are lacking on biological endpoints for chemicals in TSCA. More than 50,000 chemicals are c u r r e n t l y l i s t e d i n the T o x i c Substance C o n t r o l Act (TSCA) i n v e n t o r y , but p h y s i c a l - c h e m i c a l p r o p e r t i e s are a v a i l a b l e f o r a r e l a t i v e l y small percentage and b i o l o g i c a l endpoints f o r even l e s s . The c o s t s a s s o c i a t e d w i t h thoroughly testing a l l chemicals are prohibitive, so models are needed t o (1) p r e d i c t the e n v i r o n m e n t a l e f f e c t s of a new chemical, or (2) assess whether the chemical should be subject to a detailed testing regime (1). Although models are available to 0097-6156/85/0292-0148$06.00/0 © 1985 American Chemical Society Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

11.

NIEMI ET A L .

Molecular Connectivity and Multivariate Analysis

149

assess the environmental effects for some groups of chemicals (26), new compounds often cannot be categorized into one group or another. I f i t i s u n c l e a r when a model can be a p p l i e d to a new chemical, then the power of the model i s s u b s t a n t i a l l y reduced. To overcome this weakness, we are developing a quantitative s t r u c t u r e - a c t i v i t y strategy that i s c o n c e p t u a l l y a p p l i c a b l e to a l l c h e m i c a l s . To be a p p l i c a b l e , at l e a s t three c r i t e r i a are necessary. F i r s t , we must be able to calculate the descriptors or independent v a r i a b l e s d i r e c t l y from the c h e m i c a l s t r u c t u r e and, presumably, at a reasonable c o s t . Second, the a b i l i t y to c a l c u l a t e the v a r i a b l e s should be p o s s i b l e f o r any c h e m i c a l . F i n a l l y , and most importantly, the variables must be related to a parameter of i n t e r e s t so that the v a r i a b l e s can be used to predict or c l a s s i f y the a c t i v i t y or behavior of the chemical (1). One important area of r e s e a r c h i s the development of new v a r i a b l e s or d e s c r i p t o r s that q u a n t i t a t i v e l y d e s c r i b e the s t r u c t u r e of a c h e m i c a l . The development of these i n d i c e s has progressed i n t o the mathematical areas of graph theory and topology and a l a r g e number of p o t e n t i a l l y v a l u a b l e m o l e c u l a r d e s c r i p t o r s have been d e s c r i b e d (7-9). Our o b j e c t i v e i s not c o n c e r n e d w i t h t h e d e v e l o p m e n t o f new d e s c r i p t o r s , b u t a l t e r n a t i v e l y to explore the potential applications of a group of descriptors known as molecular connectivity indices (10). Molecular connectivity i n d i c e s are d e s i r a b l e as p o t e n t i a l e x p l a n a t o r y v a r i a b l e s because they can be c a l c u l a t e d f o r a nominal cost ( f r a c t i o n s of a second by computer) and they d e s c r i b e fundamental r e l a t i o n s h i p s about c h e m i c a l s t r u c t u r e . That i s , they describe how non-hydrogen atoms of a molecule are "connected". Here we are most concerned w i t h the s t a t i s t i c a l properties of molecular connectivity indices for a large set of c h e m i c a l s i n TSCA and the p r e s e n t a t i o n of the r e s u l t s of m u l t i v a r i a t e analyses using t h e s e i n d i c e s as e x p l a n a t o r y v a r i a b l e s t o u n d e r s t a n d s e v e r a l p r o p e r t i e s important to environmental chemists. We w i l l focus on two properties for which we have a r e l a t i v e l y l a r g e data base: (1) biodégradation as measured by the percentage of t h e o r e t i c a l 5-day b i o c h e m i c a l oxygen demand (B0D)( 11), and (2) n-octanol/water p a r t i t i o n c o e f f i c i e n t or hereafter termed log Ρ (12). Data Base The U.S. EPA Environmental Research L a b o r a t o r y - D u l u t h w i t h the h e l p of t h e i r cooperators has developed a data m a t r i x of 90 variables calculated from molecular connectivity indices (10) for 19,972 of the chemicals i n TSCA. Molecular connectivity indices consist of four primary types (paths or the edges between atoms, c l u s t e r s or branches, p a t h / c l u s t e r s , and c y c l e s or r i n g s ) that are calculated from Oth to 9th order depending on the number of connections between atoms. Path terms can include as many orders as there are edges between atoms i n the molecule, the minimum order f o r a c l u s t e r o r a c y c l e i s three, and the minimum f o r a p a t h / c l u s t e r i s four. Therefore, using 0th to 9th order, the number of v a r i a b l e s f o r one set of c o n n e c t i v i t y i n d i c e s i s 30 variables. In o u r d a t a base, we i n c l u d e d t h r e e s e t s o f

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS

150

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

c o n n e c t i v i t y i n d i c e s : s i m p l e i n d i c e s i n which the molecule i s assumed to be a s a t u r a t e d hydrocarbon, bond-corrected i n d i c e s that a d j u s t f o r double and t r i p l e bonds between atoms, and valence-corrected indices that adjust for the heteroatoms i n the molecule. The data base for biodégradation information consisted of 5day BOD t e s t s that were screened from the l i t e r a t u r e u s i n g a s y s t e m a t i c review procedure (13). A t o t a l of 340 c h e m i c a l s was included i n t h i s analysis. In contrast, the data base for log Ρ v a l u e s i s much l a r g e r because Leo and Weininger (14) have developed an a d d i t i v e model to e s t i m a t e l o g Ρ f o r c h e m i c a l s . T h e i r model a l s o g i v e s an e s t i m a t e f o r the c o n f i d e n c e i n the e s t i m a t e d value of l o g P. Here we used a data base of 1700 chemicals that were determined by Leo and Weininger (14) to have a high confidence i n the estimated value of log P. S t a t i s t i c a l Analysis We focused on f o u r primary o b j e c t i v e s : (1) c a l c u l a t i o n of the s t a t i s t i c a l properties of the molecular connectivity indices, (2) e v a l u a t i o n of how the d i m e n s i o n a l i t y (90 v a r i a b l e s ) of the m o l e c u l a r c o n n e c t i v i t y v a r i a b l e s c o u l d be reduced (15-17), (3) p r e d i c t i o n of h i g h or low 5-day BOD u s i n g the m o l e c u l a r c o n n e c t i v i t y v a r i a b l e s as d i s c r i m i n a t o r s ( 18), and (4) e s t i m a t i o n of l o g Ρ v a l u e s u s i n g the m o l e c u l a r c o n n e c t i v i t y indices as explanatory variables (19-20). Evaluation of the s t a t i s t i c a l properties i s a fundamental part of any s t a t i s t i c a l analysis and here we concentrated on the d i s t r i b u t i o n of each v a r i a b l e . To reduce the d i m e n s i o n a l i t y of t h i s data set we used p r i n c i p a l component a n a l y s i s (PCA) to explore the covariance structure of these data and to reduce the v a r i a b l e s to a more manageable number (PA1 method w i t h no r o t a t i o n , 21). Because a complete e x p l a n a t i o n of the procedure used to p r e d i c t the r e l a t i v e degree of biodégradation of c h e m i c a l s i s p r o h i b i t i v e (see 22), here we focus on the o v e r a l l r e s u l t s . B r i e f l y , our m u l t i v a r i a t e s t a t i s t i c a l a n a l y s i s i n c l u d e d (1) c a l c u l a t i n g the f a c t o r scores d e r i v e d from s i x p r i n c i p a l components of a p r i n c i p a l component analysis extracted from 45 of the molecular connectivity variables for the 19,972 chemicals of TSCA, (2) u s i n g these s i x p r i n c i p a l components to i d e n t i f y ten c l u s t e r s i n the p r i n c i p a l component space w i t h the K-means clustering algorithm of the Biomedical Computer Program (23), and (3) c a l c u l a t i n g the d i s c r i m i n a n t f u n c t i o n s that best separated chemicals with high BOD values (theoretical BOD > 17%) from those w i t h low BOD values ( t h e o r e t i c a l BOD < 13 %, there were no c h e m i c a l s w i t h BOD v a l u e s between 13 and 17 %) u s i n g the m o l e c u l a r c o n n e c t i v i t y i n d i c e s as discriminators (discriminant f u n c t i o n a n a l y s i s , 23). The BOD values were d i v i d e d i n t o h i g h and low g r o u p s b e c a u s e we were most i n t e r e s t e d i n the i d e n t i f i c a t i o n of whether a chemical was degradable (high BOD) or persistent (low BOD). In comparison, l o g Ρ does not f o l l o w a dichotomous l o g i c s i m i l a r to the BOD values and, t h e r e f o r e , we t r e a t e d l o g Ρ as a

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

II.

NIEMI ET AL.

Molecular Connectivity and Multivariate Analysis

151

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

continuous v a r i a b l e . For the p r e d i c t i o n s of l o g Ρ we explored s e v e r a l m u l t i p l e r e g r e s s i o n m o d e l s (21) and used e i t h e r p r i n c i p a l components or m o l e c u l a r c o n n e c t i v i t y i n d i c e s as explanatory variables. With the p r i n c i p a l components we explored the use of non-linear trends by including second and t h i r d order p o l y n o m i a l s as e x p l a n a t o r y v a r i a b l e s . With the m o l e c u l a r connectivity indices we used two d i f f e r e n t variable modifications i n the regression models: (1) l o g a r i t h m i c t r a n s f o r m a t i o n s , and (2) standardization by size whereby the logarithm of the number of atoms was subtracted from the logarithm of each variable. The r e g r e s s i o n analyses i n c l u d e d u s i n g both forward and stepwise methods. We were p r i m a r i l y concerned with finding combinations of variables that best reduced the standard error of the estimate (square root of the mean square error) (21). Results S t a t i s t i c a l p r o p e r t i e s of c o n n e c t i v i t y indices. Because there are so many variables involved i n this analysis, we only included the mean and standard d e v i a t i o n of the o r i g i n a l and l o g transformed data for the simple m o l e c u l a r c o n n e c t i v i t y i n d i c e s (Table I ) . The r e l a t i v e magnitude of the means and standard deviations for the bond-corrected and v a l e n c e - c o r r e c t e d i n d i c e s were very s i m i l a r to those for the simple indices. There are two f e a t u r e s of these v a r i a b l e s that have an important b e a r i n g on their s t a t i s t i c a l d i s t r i b u t i o n . F i r s t , each variable i s skewed to the r i g h t because of the presence of a few l a r g e molecules r e l a t i v e to the bulk of the chemicals i n the data base. Second, many of the v a r i a b l e s have a r e l a t i v e l y l a r g e number of zero v a l u e s . T h i s i s e s p e c i a l l y pronounced i n the t h i r d and f o u r t h order c y c l e terms because more than 98 % of the i n d u s t r i a l chemicals do not have this configuration. The log-transformations improve the d i s t r i b u t i o n s c o n s i d e r a b l y i n terms of the r a t i o between the mean and standard d e v i a t i o n , but many of the variables do not approximate a normal d i s t r i b u t i o n . Given these problems i n the d i s t r i b u t i o n of the data, we proceed with caution and consider subsequent analyses as e x p l o r a t o r y . Furthermore, the shortcomings i n the d i s t r i b u t i o n of these data supports the use of PCA i n the data r e d u c t i o n as opposed to procedures of f a c t o r a n a l y s i s . Because PCA i s used i n an e x p l o r a t o r y manner and no s p e c i f i c hypothesis i s being tested (24), the assumption of normality i n these data can be relaxed. Data reduction. We used the log-transformed data i n a l l analyses presented here. The PCA resulted i n eight p r i n c i p a l components with eigenvalues > 1 and they explained 93.5% of the v a r i a t i o n i n the o r i g i n a l data (Table I I ) . The f i r s t three p r i n c i p a l components a l l convey g e n e r a l i z e d i n f o r m a t i o n on c h e m i c a l s t r u c t u r e : s i z e (PC 1), degree of branchness (PC 2), and number of c y c l e s (PC 3). PC 1 was p o s i t i v e l y c o r r e l a t e d w i t h a l l 90 v a r i a b l e s (_r > .32), except f o r the c y c l i c v a r i a b l e s i n which £ was as low as .07 f o r the 3rd order c y c l i c v a r i a b l e s . PC 2 was p o s i t i v e l y c o r r e l a t e d (r_ > .26) w i t h a l l c l u s t e r v a r i a b l e s , but negatively correlated with a l l path and c y c l i c variables. PC 3

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS

T a b l e I. c l u s t e r ,,

Stat is t i ca 1 properties of Oth to p a t h / c l u s t e r, and of cycle types indices •

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

Original

9th order f o r path, simple connectivity

Log-transformed

1

Term

Order

Mean

Standard deviation

Mean

Standard deviation

Path

0 1 2 3 4 5 6 7 8 9

1 1 .24 7.10 6.34 4.67 3.45 2.51 1 .58 1 .00 .61 .37

5.35 3.51 3.51 3.00 2.51 2.15 1 .56 1.19 .86 .63

1 .09 .91 .86 .74 .63 .52 .39 .28 .20 .14

.19 .20 .21 .23 .25 .26 .25 .23 .20 .17

0.0 0.0 0.4 4.9 10.1 19.0 33.3 38.5 43.5 50. 1

Cluster

3 4 5 6 7 8 9

1.23 .09 .36 .09 .18 . 11 .13

1 .16 .21 .93 .47 .89 .83 .96

.32 .03 . 1 1 .02 .04 .02 .02

.18 .07 .14 .09 .12 .10 . 1 1

37.6 80.2 51.1 82.8 66.5 89.5 81.2

Path/ cluster

4 5 6 7 8 9

2.35 3.23 4.68 5.43 6.09 6.50

2.68 3.86 6.57 8.74

.47 .54 .61 .62 .60 .56

.26 .32 .40 .45 .50 .54

18.1 18.2 20.2 22.8 25.0 29.3

3 4 5 6 7 8 9

.01 .01 .02 .07 .12 .17 .21

.00 .00 .01 .05 .08 . 11 .13

.02 .02 .03 .06 .09 .13 .17

98.5 98. 1 84.5 36.2 40.6 45.9 48.2

Cycle

natural

11.80 15.09 .04 .05 .10 .19 .36 .59 .88

Percentage of zeroes

logarithms

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985. complex v a l e n c e corrected branched c h e m i c a l s w i t h many heteroatoms 5th to 7th order cycles

1.9

1.74

1.22

7

8

1.4

long chain molecules with few h e t e r o a t o m s

complex 3 r d and 4 t h o r d e r cyclic molecules

complex branching p a t t e r n s and multi-cyclic molecules with few h e t e r o a t o m s

3.2

2.83

6

short chain molecules with complex branches

multi-branched molecules with double or t r i p l e bonds and/or w i t h many h e t e r o a t o m s

molecules with s i n g l e bonds and simple branching patterns

3.5

3.13

5

3rd t o 4 t h order cycles

7th to 9 t h order cycles

5.8

5.19

4

multi-cyclic molecules

non-cyclic molecules

11.7

10.53

3

multi-branched molecules

few branches on m o l e c u l e

13.5

12.14

2

large molecules

small molecules

52.6

47.36

1

High values of principal component

Variation explained %

Principal component

Eigenvalue

Low v a l u e s of principal component

T a b l e II· I n t e r p r e t a t i o n s o f 8 p r i n c i p a l c o m p o n e n t s c a l c u l a t e d f r o m 90 v a r i a b l e s b a s e d on m o l e c u l a r c o n n e c t i v i t y i n d i c e s f o r 19,972 i n d u s t r i a l c h e m i c a l s .

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

154

ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS was p o s i t i v e l y correlated Cr > ·32) with a l l c y c l i c variables and n e g a t i v e l y c o r r e l a t e d w i t h most of the other v a r i a b l e s . The remaining p r i n c i p a l components explained patterns of v a r i a t i o n among c h e m i c a l s i n terms of g r a d i e n t s of c y c l i c n e s s , bonding, branching, v a l e n c y (e.g., number and p o s i t i o n s of heteroatoms s u c h as h a l o g e n s and oxygen), and c o m b i n a t i o n s of t h e s e s t r u c t u r a l attributes. The results of the p r i n c i p a l component analysis describe the " i n t r i n s i c " d i m e n s i o n a l i t y that i s measured about c h e m i c a l structure by the molecular connectivity indices. Eight p r i n c i p a l components e x p l a i n 93.5 % of the v a r i a t i o n i n the 90 o r i g i n a l v a r i a b l e s , but 23 p r i n c i p a l components are r e q u i r e d to e x p l a i n 99.0 % of the v a r i a t i o n i n these data. The r e l a t i v e l y h i g h percentage of e x p l a i n e d v a r i a t i o n i n the f i r s t three p r i n c i p a l components (77.8 %) i n d i c a t e s that there are r e l a t i v e l y h i g h c o r r e l a t i o n s among the v a r i a b l e s d e r i v e d from the m o l e c u l a r connectivity indices and i t i s obvious that size, branchness, and cyclicness are major s t r u c t u r a l properties of chemicals. In terms of data reduction, the p r i n c i p a l component analysis has reduced the d i m e n s i o n a l i t y from 90 v a r i a b l e s to 8 new v a r i a b l e s that s t i l l explained 93.5 % of the v a r i a t i o n i n the data s e t . Furthermore, i f we can i n t e r p r e t the f i r s t three p r i n c i p a l components as a s i z e a x i s (PC 1), an a x i s of branchness (PC 2), and an axis of cyclicness (PC 3), then we can envision the higher order components as conveying both s u b t l e and p o t e n t i a l l y important s t r u c t u r a l information because they are s t a t i s t i c a l l y uncorrelated (r_ - 0.0) or "independent" of size, branchness, and cyclicness. Prediction of BOD value. In the ten clusters i d e n t i f i e d by the K-means clustering procedure, two c l u s t e r s were r e p r e s e n t e d by c h e m i c a l s w i t h only low BOD v a l u e s and one c l u s t e r w i t h n e a r l y a l l (18 of 19 or 95 %) h i g h BOD v a l u e s (Table I I I ) . T h e r e f o r e , no discrimination was attempted within these clusters. In the remaining clusters there were 202 high BOD chemicals and 97 low BOD c h e m i c a l s . Of these, a p p r o x i m a t e l y 75 % (152 of 202) were c o r r e c t l y c l a s s i f i e d i n t o the h i g h BOD c l a s s , w h i l e 73 % (71 of 97) were correctly c l a s s i f i e d into the low BOD class. Using both the clustering and discrimination analyses, 77 % (170 of 220) and 78 % (93 of 120) of the chemicals i n the data base were correctly classified. W i t h i n each of the c l u s t e r s , between 2 and 4 m o l e c u l a r c o n n e c t i v i t y i n d i c e s were used i n the f i n a l d i s c r i m i n a n t f u n c t i o n s to separate the two c l a s s e s of BOD. W i t h i n each c l u s t e r a d i f f e r e n t combination of v a r i a b l e s were used as d i s c r i m i n a t o r s . Because of the e x p l o r a t o r y nature of t h i s a n a l y s i s , we lowered the F - r a t i o i n c l u s i o n l e v e l to 1.0. In several of the clusters, the F-ratios for variables included i n the d i s c r i m i n a n t f u n c t i o n s were subsequently s m a l l ( e . g . , < 4.0). P r e d i c t i o n s of l o g Ρ w i t h regression. As would be expected, the l a r g e s t v a l u e s of the e x p l a i n e d v a r i a t i o n (r squared) and the smallest standard error of estimates found with the regression models were those that included a l l 90 variables. These models

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

11.

NIEMI ETAL.

155

Molecular Connectivity and Multivariate Analysis

included one with log-transformed variables and the other with both log-transformations and with each variable standardized by s i z e (Table IV). In g e n e r a l , the standard e r r o r s of the e s t i m a t e s were l a r g e r e l a t i v e to the mean which i n d i c a t e s a r e l a t i v e l y poor f i t for the models tested here.

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

TABLE I I I . Summary of cluster analysis and discriminant function a n a l y s i s of h i g h (> 17 % of t h e o r e t i c a l ) and low (< 13 % of t h e o r e t i c a l ) b i o c h e m i c a l oxygen demand (BOD) v a l u e s f o r 340 chemicals.

Cluster 1 2 3 4 5 6 7 8 9 10 Total

Cluster analysis

Discrimination analysis

Number of chemicals

% Correctly c l a s s i f i e d

High BOD

Low BOD

High BOD

0 0 5 13 18 35 10 60 29 50

6 16 13 9 1 18 12 15 14 16

60

220

Low BOD

-

-

85 78

85

-

120

-

60 80 83 72 76

78 83 53 64 75

77

78

Table IV. Summary of m u l t i p l e r e g r e s s i o n analyses f o r the prediction of log Ρ from molecular connectivity indices (sample size - 1700).

2

Variable Number 90 90 90 90 24 8

Regression Modification Method logarithm logarithm logarithm+ size logarithm* size polynomialst h i r d degree principal component

OverR Standard a l l Variables Error of F 10 A l l E s t i m a t e

stepwise forward stepwise

34.8 30.2 35.3

7*54 .41 .54

762 .62 .62

1.26 1.26 1.26

forward

30.1

.27

.62

1.26

stepwise

32.1

.24

.31

1.66

stepwise

64.0

.23

1.75

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS

156

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

Discussion Of the two e x p l o r a t o r y analyses that we presented, one showed r e l a t i v e l y promising r e s u l t s f o r the g e n e r a l c l a s s i f i c a t i o n of high or low BOD v a l u e s w h i l e the other w i t h l o g Ρ p r o v i d e d unsatisfactory results. Among the advantages of the s t a t i s t i c a l procedures we used i n the BOD model may have been the e f f e c t of reducing the d i m e n s i o n a l i t y (PCA) and the c o m p l e x i t y of the problem by c l u s t e r i n g the c h e m i c a l s i n t o s m a l l e r groups. In contrast, for the regression analyses with log P, we attempted to d e a l w i t h the problem on a g l o b a l s c a l e whereby a l l c h e m i c a l s were i n c l u d e d i n one l a r g e a n a l y s i s . S u c k l i n g et a l . (25) have noted t h i s dilemma by p o i n t i n g out that complex problems are d i f f i c u l t to s o l v e i n one p i e c e and need to be broken down i n t o components, e s p e c i a l l y f o r the purposes of m o d e l l i n g . In the a n a l y t i c a l procedures of the BOD model we used this process: (1) reduced the d i m e n s i o n a l i t y of the data, (2) c l u s t e r e d the c h e m i c a l s a c c o r d i n g to s i m i l a r s t r u c t u r a l p r o p e r t i e s , and (3) attempted to analyze the parameter of interest within a s p e c i f i c c l u s t e r or group of s i m i l a r c h e m i c a l s . A f u r t h e r advantage of this procedure i s that chemicals can be assigned independently to a cluster without prior biases regarding where i t " f i t s " within the realm of e x i s t i n g c h e m i c a l s . This i s e s p e c i a l l y important from the p e r s p e c t i v e of environmental r e g u l a t i o n of a new chemical. A newly designed chemical by virtue of being new does not necessarily f i t into one group of chemicals or another. P r i n c i p a l components. One of the p o t e n t i a l powers of the p r i n c i p a l components i s that they are c a l c u l a t e d from a l a r g e data base representing many chemical configurations and they are independent axes ( i n t e r c o r r e l a t i o n jr = 0.0)(24). Therefore, they are well-suited for building prediction models with multivariate methods such as regression where a desirable property among the explanatory variables i s minimal m u l t i c o l l i n e a r i t y (26). However, i n these analyses, p r i n c i p a l components and polynomial regression of the p r i n c i p a l components were r e l a t i v e l y poor predictors of log Ρ and neither appear to be useful i n this application. This i s a c o n t r a s t w i t h the r e s u l t s presented by Murray e t a l . (19) who found a h i g h c o r r e l a t i o n between l o g Ρ and a form of the v a l e n c e - c o r r e c t e d m o l e c u l a r c o n n e c t i v i t y index. Murray et a l . (19) l i m i t e d t h e i r c o r r e l a t i o n a n a l y s i s to d i s t i n c t c h e m i c a l groups such as e s t e r s and a l c o h o l s . Our a n a l y s i s i n c l u d e d a l l the chemicals available i n a global model. Although the i n c l u s i o n of a l l 90 v a r i a b l e s produced a r e l a t i v e l y high jr and the lowest standard error of estimate, we suspect that there are many spurious correlations involved i n the prediction equation (27). It i s s l i g h t l y more encouraging that the f i r s t t e n v a r i a b l e s i n c l u d e d i n the stepwise r e g r e s s i o n models of the 90 variable log-transformed and 90 variable log-and s i z e - t r a n s f o r m e d data s e t s produced r v a l u e s of .54. The standard e r r o r of the e s t i m a t e s of l o g Ρ w i t h the r e g r e s s i o n equations i s substantially higher than the estimates obtained by Leo and Weininger (14) i n t h e i r model. 2

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

11.

NIEMI ET AL.

Molecular Connectivity and Multivariate Analysis

157

E x p l o r a t o r y data a n a l y s i s and data feedback. There are a large number of v a r i a b l e s that can be c a l c u l a t e d from a c h e m i c a l s t r u c t u r e , y e t when the number of v a r i a b l e s becomes l a r g e r e l a t i v e to the sample size then spurious or t r i v i a l correlations increase (27). For example, we believe the multiple regression analyses of l o g Ρ w i t h 90 v a r i a b l e s may r e p r e s e n t such a s i t u a t i o n . L i k e w i s e , the DFA used i n the BOD model examined a large number of variables r e l a t i v e to the sample size within some of the c l u s t e r s (e.g., see c l u s t e r 3, Table I I I ) . In a c t u a l i t y , we used the DFA as an e x p l o r a t o r y t o o l to f i n d p o t e n t i a l molecular configurations that were associated with high or low BOD v a l u e s (22). T h i s e x p l o r a t i o n l e d to the i d e n t i f i c a t i o n of s e v e r a l subgraphs of a molecule that may be a s s o c i a t e d w i t h persistence or degradability of a chemical. The next step i n the process would be to i d e n t i f y c h e m i c a l s w i t h these s p e c i f i c subgraphs and test them for t h e i r r e l a t i v e degradability. We believe that two of the main l i m i t a t i o n s for progress i n applications of multivariate analyses i n environmental chemistry at the c u r r e n t time are (1) the l a c k of a l a r g e data base (hundreds or thousands of c h e m i c a l s ) that has been c o l l e c t e d s y s t e m a t i c a l l y f o r endpoints of e n v i r o n m e n t a l concern (e.g., biodégradation, t o x i c i t y , and carcinogenicity), and (2) the lack of a feedback mechanism whereby new chemicals or chemicals with s p e c i f i c molecular configurations are being retested for s p e c i f i c b i o l o g i c a l endpoints. Regarding the former, we recognize that there are many l i m i t a t i o n s i n the BOD data base (e.g., 28); however, there are no alternative data bases with a sample size i n the hundreds that can be used i n attempts t o model the important environmental process of biodégradation. Future directions We are pursuing analyses with a v a r i e t y of additional endpoints, but i t i s a l r e a d y c l e a r t h a t c o n n e c t i v i t y i n d i c e s and multivariate techniques w i l l be useful i n some applications and l i m i t e d i n o t h e r s . A p o w e r f u l p r a c t i c a l a p p l i c a t i o n of t h i s approach i s that many variables can now be calculated f o r nearly a l l chemicals. Therefore, the "universe" of i n d u s t r i a l chemicals i n TSCA can be d e f i n e d i n m u l t i - d i m e n s i o n a l space. T h i s space gives EPA a tool for the i d e n t i f i c a t i o n of (1) whether there i s a s i m i l a r l y - s t r u c t u r e d chemical to a new chemical that needs to be evaluated, or (2) where data on b i o l o g i c a l endpoints are lacking w i t h i n t h i s c h e m i c a l space. The former concept of s i m i l a r l y s t r u c t u r e d or analogous c h e m i c a l s having s i m i l a r b i o l o g i c a l a c t i v i t y i s at the heart of the s t r u c t u r e - a c t i v i t y p e r s p e c t i v e f o r p r e d i c t i n g the e n v i r o n m e n t a l e f f e c t s of c h e m i c a l s . The l a t t e r p r o v i d e s an e f f e c t i v e , o b j e c t i v e means f o r t h e i d e n t i f i c a t i o n of what chemicals need to be tested. Both tools w i l l be u s e f u l i n understanding the r e l a t i o n s h i p s between chemical structure and b i o l o g i c a l a c t i v i t y and i n a p p l i c a t i o n s for the assessment of the environmental e f f e c t s of chemicals. One of the most important aspects, however, w i l l be to understand the c a p a b i l i t i e s of the approach and especially i t s l i m i t a t i o n s .

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS

158

K o w a l s k i (29) has c o r r e c t l y e m p h a s i z e d t h a t " t h e measurements made by a n a l y t i c a l chemists are associated with some degree of uncertainty," so " i t i s d i f f i c u l t to conceive of a more p e r f e c t marriage than a n a l y t i c a l c h e m i s t r y and s t a t i s t i c s and applied mathematics." The marriage i s the study of chemometrics and we are l i k e l y "only at the t h r e s h o l d of r e a l i z i n g the importance of s t a t i s t i c a l and mathematical techniques" (30) i n applications of chemical measurements. M u l t i t u d e s of data are currently capable of being generated by sophisticated a n a l y t i c a l equipment. S i m i l a r l y , a d a p t a b l e s o f t w a r e and h i g h - s p e e d computers are able to handle these l a r g e data s e t s . The next decade promises to be the "decade when c h e m i s t r y advances as a m u l t i v a r i a t e science (31)."

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

Acknowledgment s This research was supported by a cooperative agreement (CR8108240190) between the U.S. Environmental P r o t e c t i o n Agency and the University of Minnesota, Duluth.

Literature Cited 1. Veith, G.D. "State-of-the-Art Report on Structure-activity Methods Development"; Environmental Research Laboratory, Duluth, 1980. 2. Ogino, Α.; Matsumura, S.; Fujita, T. J. Med. Chem. 1980, 23, 437. 3. Yuan, M.; Jurs, P. Toxicol. and Appl. Pharmacol. 1980, 52, 294. 4. Ray, S. K.; Basak, S. C.; Raychaudhury, C.; Roy, A. B.; Ghosh, J.J. Ind. J. Chem. 1981, 20B, 894. 5. Birge, W. J.; Cassidy, R. A. Fund. and Appl. Toxicol. 1983, 3, 359. 6. Gillett, J. W. Environ. Toxicol. and Chem. 1983, 2, 463. 7. Balaban, A. T. Theoret. Chim. Acta. 1979, 53, 355. 8. Sabljic, Α.; Trinajstic, N. Acta Pharm. Jugosl. 1981, 31, 189. 9. Bertz, S. H. "A Mathematical Model of Molecular Complexity"; King, R. B., Ed.; STUDIES IN PHYSICAL AND THEORETICAL CHEMISTRY Vol. 28, Elsevier Science Publishers, Amsterdam, 1983; pp. 206-221. 10. Kier, L. B.; Hall, L. H. "Molecular Connectivity in Chemistry and Drug Research"; Academic Press, New York, 1976, p. 257. 11. Sawyer, C. N.; Bradney, L. Sew. Works J. 1946, 18, 1113. 12. Hansch, C.; Quinlan, J. E.; Lawrence, G. L. J. Org. Chem. 1968, 33, 347. 13. Vaischnav, D. "Biochemical oxygen demand data base"; Call, D.J.; Brooke, L.T.; Vaischnav, D.; AQUATIC POLLUTANT HAZARD ASSESSMENT AND DEVELOPMENT OF HAZARD PREDICTION TECHNOLOGY BY QUANTITATIVE STRUCTURE-ACTIVITY RELATIONSHIPS. University of Wisconsin, Superior research project report (CR809234) 1984. 14. Leo, Α.; Weininger, D. "CLOGP Version 3.2 User Reference

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

11. ΝIΕ MI ET A L.

15. 16. 17. 18. 19.

Downloaded by GRIFFITH UNIV on September 8, 2017 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch011

20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.

Molecular Connectivity and Multivariate Analysis

159

Manual"; Medicinal Chemistry Project, Pomona College, Claremont, 1984. Cramer, R. D. J. Am. Chem. Soc. 1980, 102, 1837. Lukovits, I.; Lopata, A. J. Med. Chem. 1980, 23, 449. Burkhard, L. P.; Andren, A. W.; Armstrong, D. E. Chemosphere 1983, 12, 935. Geating, J. "Project Summary: Literature Study of the Biodegradability of Chemicals in Water: Vols. 1 and 2"; U.S. Environmental Protection Agency, Cincinnati, EPA-600/s2-81175/176, 1981, p. 4. Murray, W. J.; Hall, L. H.; Kier, L. B. J. Pharm. Sci. 1979, 64, 1978. Govers, H.; Ruepert, C.; Aiking, H. Chemosphere 1984, 13, 227. Nie, Ν. Η.; Hull, C. H.; Jenkins, J. G.; Steinbrenner, Κ.; Bent, D. H. "Statistical Package for the Social Sciences"; McGraw-Hill, New York, 1975, p. 675. Niemi, G. J.; Regal, R. "Predicting Biodegradability from Chemical Structure: the Use of Multivariate Analysis"; U.S. Environmental Protection Agency, Duluth, 1983. Dixon, W. J.; Brown, M. B. "Biomedical Computer Programs - Ρ Series"; University of California Press, Berkeley, 1979, p. 880. Morrison, D. F. "Multivariate Statistical Methods"; McGrawHill, New York, 1967, p. 338. Suckling, C. J.; Suckling, Κ. E.; Suckling, C. W. "Chemistry Through Models"; Cambridge University Press, Cambridge, 1978. Tatsuoka, M. "Multivariate Analysis"; J. Wiley, New York, 1971. Wold, S.; Dunn, W. J. III. J. Chem. Inf. Comput. Sci. 1983, 23, 6. Howard, P. H.; Saxena, J.; Sikka, H. Environ. Sci. Technol. 1978, 12, 398. Kowalski, B. R. Anal. Chem. 1980, 52, 112R. Kowalski, B. R. Anal. Chem. 1978, 50, 1309A. Kowalski, B. R.; Wold, S. In "Handbook of Statistics, Vol. 2"; Krishnaiah, P. R.; Kanal, L. N. Eds.; North-Holland Publishing Co., 1982; Chap. 31.

RECEIVED June 28, 1985

Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.