137
J . Chem. In$ Comput. Si.1990, 30, 137-144
Cluster Analysis of Acrylates To Guide Sampling for Toxicity Testing RICHARD G. LAWSON and PETER C. JURS* 152 Davey Laboratory, Chemistry Department, Pennsylvania State University,
University Park, Pennsylvania 16802 Received October IO. 1989 A set of 143 acrylates drawn from the TSCA inventory have been investigated for structurally
defined clusters of compounds to simplify sampling for future toxicity screening. Each acrylate was represented by eight descriptors calculated from the molecular structure. Several standard clustering methods have been used to find five natural clusters of compounds. These five clusters are largely populated by compounds with similar chemical attributeswith separate clusters formed for compounds with high absolute partial atomic charges, hydrophobic compounds, small compounds, halogenated compounds, and large or oligomeric compounds. INTRODUCTION Acrylates are an important class of polymers used in a wide variety of consumer products. Unfortunately, there is concern about the safety of a few of these materials as monomers.
basic acrylate strudure
Some reports in the literature cite potential toxicity problems, and others raise questions about carcinogenic potential in animals.I4 Because of these questions the US. EPA is required to treat acrylates as potential carcinogens when new compounds are being considered for introduction into the marketplace. There are more than 200 acrylates currently on the TSCA inventory. Since there are so many compounds in the acrylate family, it would be impractical to test every one for biological activity. There must be some way, however, through which researchers can assess the risks of the acrylates that exist now and the risks of new compounds. If the mechanism of toxicity or carcinogenicity for acrylates were understood thoroughly enough to allow accurate a priori predictions of hazardous compounds, this knowledge could be used directly. Research in this direction is proceeding, but it is not presently possible to even estimate the biological activity of untested compounds. Considering the structural diversity in the set of acrylates that we considered in this study alone, it will likely be some time before this goal is reached. In the mcantime, since it is important to know now which compounds are dangerous, or are at least likely to be dangerous, another strategy must be taken. If we cannot identify the actual relationships between the physical and chemical properties of acrylates and their biological activity, we can seek empirical relationships that bridge this gap. This is the foundation of structureactivity relationship (SAR) Few acrylates have known biological activity, however, so there is not yet enough information to formulate even an empirical equation. T o pursue this structure-activity relationship strategy, the compounds under consideration would have to first be divided into groups of related physical and chemical properties, representative members could then be selected from these subsets for biological testing, and finally the connection between the physicochemical properties of the groups and their activity could be sought. The goal of this paper is thus to provide a means of sampling from the set of all acrylates in such a way that toxicity data can be gathered in an objective and efficient way. There are many obvious ways to divide acrylates into groups: halogenated vs nonhalogenated, sulfonated vs nonsulfonated, oligomers vs simple monomers, by molecular weight, etc.
These possibilities are all reasonable, but unless they are considered in a systematic way, some of the groups may be inconsistent or may overlap with other groups. Another consideration is that to derive groupings that are most likely to be related to biological activity, the discriminating features must be chosen as objectively as possible, from a pool of features that are intended to encode the reactivity of the compounds. The general approach of using cluster analysis as an objective tool to divide compounds into meaningful groups of related compounds so that representative elements could be sampled for screening has been described by Hodes and Willett.5.6 Willett applied cluster analysis to approximately 8500 compounds in a commercial inventory to provide an automated means of choosing trial compounds for activity screening. The compounds were separated on the basis of atom and bond-centered fragments. Hodes' work focused on a 4980-member subset of the 232 OOO compounds in an National Cancer Institute's inventory. His objective was to derive natural groups of compounds from which test compounds could be selected for cell culture screening, again on the basis of molecular fragments. The reports of these projects focused mainly on the mathematical and computer aspects of the projects, but provided evidence for the validity and value of such strategies. This paper employs the same basic strategy, but will focus more on the chemical aspects of the clustering, especially the use of chemically meaningful descriptions of the compounds. Our search for an objective clustering of the acrylates was done with the ADAPT software system. ADAPT has been described previously in the l i t e r a t ~ r e . ~Briefly, ?'~ the system is composed of more than 90 independent software modules that allow the user to enter, model, and store structures in computer disk files and then to calculate their geometric, electronic, physicochemical, and topological features in addition to simple fragments. The compounds can then be studied with multivariate statistical or pattern recognition methods. In the present study, clustering programs were employed. These programs partition the compounds into related groups based on the calculated molecular structure descriptors. THE DATA SET The 143 acrylates selected for this study were found via a CAS ONLINE search done with a profile designed to retrieve acrylates of commercial importance but to eliminate salts, metal-containing species, adducts, etc., from the list since these pose problems in structure entry and in calculations. A set of over 200 acrylates was originally obtained, but some compounds had to be excluded because they had too many atoms or atoms unsupported in the ADAPT software; 143 acrylates
0095-2338/90/ 1630-0137$02.50/0 0 1990 American Chemical Society
138 J. Chem. InJ Comput. Sci., Vol. 30, No. 2, 1990
LAWSONAND JURS 17.0
Frequency
16ic
rri h
I
I
8.5
0.0
5 53
12.0
j.0
215
p
Figure 2.
Histogram of the frequency of calculated log P values for
the acrylates with a maximum value of 7.0.
0 00 86 3
249 5
413 0
Molecular Weight
Figure 1. Histogram of the frequency of molecular weights for all of the acrylates.
remained after these restrictions. Substantial structural diversity is obvious in the data set with 40 halogenated compounds, 24 containing sulfur, and 37 containing nitrogen. The nominal molecular weights of the 143 compounds range from 86 to 740 amu, as shown in Figure 1. Many of the molecules have long alkyl side chains, which leads to great conformational mobility. DESCRIPTOR GENERATION Any structure-activity relationship is based on terms that describe the properties of the molecules accounting for activity, direct or indirectly. For example, to describe the solubility of a class of compounds empirically, one might use the polarity, density, and size of the compounds. Since it is not as clear which acrylate properties are related to toxicology, it is desirable to have a large number of features that could be related to reactivity, distribution in living systems, etc. For the 143 acrylates of interest, not only is their toxicity unknown, but there is no comprehensive body of data for any property. Therefore, it was necessary to calculate or estimate properties to describe the electronic, geometric, and physicochemical features of the compounds. Electronic features can be used to explain reactivity: the more highly charged a compound is, the more reactive it would be in general. The differences in charges on a particular atom, or atoms located within a substructure, for a set of compounds could possibly explain differences in their reactivity. In a similar way, the differences in the size or shape of molecules as measured by geometric descriptors could account for differences in their access to an active site or in their rate of transport through living systems. The high degree of conformational flexibility in these compounds prohibited the use of geometric descriptors, however. To test the sensitivity of geometric descriptors to conformational mobility, we calculated the moments of inertia and other parameters dependent on conformation for several energetically reasonable conformations of several structurally dissimilar molecules. The range of descriptor values found was too great for these features to be considered for further use. Electronic descriptors have been useful in past SAR studi e ~ . ' ~However, *'~ the common ones-Huckel molecular orbital calculations and extended Huckel molecular orbital calculations-could not be used since the compounds were for the most part largely saturated, which ruled out simple H M O treatment, and they could not be modeled confidently, which ruled out extended H M O treatment. More sophisticated electronic descriptors could have been applied if that were
necessary, but with 143 compounds, simple descriptors requiring little CPU time for calculation were favored. The results obtained below indicate that the simple descriptors were adequate. Some simple semiempirical calculations of u charge density developed by Del Re are independent of geometry, however, so these were ~ a l c u l a t e d . ' ~These J ~ are based largely on dipole moments and polarizability. Measured or estimated physicochemical descriptors such as log P (the logarithm of the partition coefficient between n-octanol and water) or molecular polarizability have been found to be related to biological activity in the past.I6J7 The few acrylate compounds that have partition coefficients reported in the literature have a wide range of values, especially considering the logarithmic ~ c a l e . ' ~ -For ' ~ consistency and simplicity we calculated log P values using the CLWP 3.5 program.20 These values are displayed in Figure 2. Some of the acrylates strained the limits of the program's parameters, however, resulting in unrealistically high values (as high as 18). Because such values are unrealistic, and to prevent these values from having too much influence on later model development, we truncated the log P values for this study at 7.0. Topological descriptors are based only on the identities of the atoms and their connections to each other and are independent of geometry. Simple counts of atoms or bonds can be used, or more complex graph theoretical molecular connectivity indices can be calculated.z1 The molecular connectivity indices provide estimates of the degree of branching, or mathematical measures of the size and shape of molecules that are especially useful for compounds that cannot be modeled confidently. More than 60 descriptors of these types were calculated. Before clustering was begun, the variables were evaluated to determine which ones could help discriminate between different groups of compounds. For example, if the minimum u charge for all compounds were identical, that feature would not help separate the acrylates. On the basis of these considerations, descriptors with a majority of zero values, or identical nonzero values, would be discarded. With the acrylate data set approximately 20 compound variables were eliminated on the basis of these considerations. The number of descriptors required to adequately describe the variance of the data was estimated by using principal components analysis.22 Although nearly 40 variables were considered, only 5 variables were required to account for 95% of the variance, and 8 variables accounted for 98% of the variance. Examination of the eigenvalues to determine the variables accounting for the majority of the variance was difficult with so many variables. An alternative method to choose highly orthogonal variables is to use vector space analysis. This involves selecting a variable at random as a basis vector and adding the variable from the remaining set that is most orthogonal to the basis vector. These
CLUSTER ANALYSIS OF ACRYLATES I
I
1
2
3
4
5
J . Chem. Inf. Comput. Sci., Vol. 30, No. 2, I990 139 6
7
8
1.00
0.47
1.00
0.09
0.13
1.00
0.47
0.58
-0.19
1.00
0.16
0.34
0.43
0.06
1.00
Table 11. Natural Number of Clusters clusters
iterations
4
11 11 II 11
5 6
7
0.19
0.48
0.68
0.15
0.60
1.00
0.22
0.01
0.47
0.14
0.31
0.47
1.00
-0.24
0.40
0.31
0.23
0.27
0.52
0.14
1.00
Figure 3. Pairwise correlation matrix of the eight variables selected for clustering on the basis of variance. Table 1. Set of Eight Structural Descriptors for the Acrylates 1 sss2 count of SS-2 (acrylate) 2 C H I S 6 2xv: path 2 valence-corrected molecular connectivity 3 MOLC7 ' x C : cluster 3 molecular connectivity index 4 K A P A 3 %: topological index measuring the importance of midchain branching 5 PATH2 path environment for SS-2 (acrylate) 6 ALLPI number of molecular paths 7 TSCH I sum of absolute values of all u charges 8 CLGPO calculated loa P
two vectors form a plane. The remaining variables are added one at a time on the basis of the plane angle that they form in the space of increasing dimensionality. Repeating this procedure with a number of different basis vectors leads to a stable set of orthogonal variables which is generally the same as the set that would have been determined through principal components analysis. Highly orthogonal, Le., minimally correlated, descriptors are desirable because they contain information that is not encoded by other descriptors. Since standard methods for feature selection in cluster analysis have not been established, selecting variables on the basis of maximum variance, or maximum information content, seemed to be a reasonable approach. The set of eight structural descriptors found to support clustering on the basis of these criteria is shown in Table I. The number of molecular paths (ALLP 1) is a topological descriptor based on the hydrogen-suppressed graph. The path environment descriptor PATH 2 describes the steric surroundings of the acrylate substructure as imbedded within the molecules. Only paths of lengths up to 10 were included in the computation. TSCH 1 is a whole-molecule descriptor codifying degree of departure from electronic neutrality. It is the total absolute value of the cr charges in the molecule. MOLC 7 is a molecular connectivity index based on atoms with three bonds attached to them (cluster 3). CLGP 0 is the calculated value of log P from the Pomona Medicinal Chemistry project software.20 SSS 2 is the count of the number of acrylate groups present in each molecule. CHIS 6 is the 2xv path 2 valence-corrected molecular connectivity index. KAPA 3 is 3 ~ a, measure of the shape of the molecular structure derived from the connection table where 3Pmax is the maximum number of paths of length 3 for a molecule of the same size, 3Pmin is the minimum number of paths of length 3 for a molecule of the same size, and 3Piis the actual number of paths of length 3 in the molecule.23 This index is reported to measure the presence of midchain branching in a molecular structure. Thus, each acrylate compound is represented by a point in this eight-dimensional space where each axis is associated with one of the structural descriptors. The problem of clustering of the acrylates can be stated as seeking natural clusters of points in this eightdimensional feature space.
mean SI 0.80
max SI I .oo
0.85 0.85 0.76
1 .oo 0.99 0.97
min SI 0.69 0.77 0.66 0.61
Some of the variables chosen are topological, and some are not continuous. The count of acrylate substructures is a discrete variable. Although discrete variables are not as commonly used in Euclidean space as continuous variables, their use can be just as valid. In the case of the count of acrylate substructures, which can only take on the values 1 , 2, 3, 4, or 5, Euclidean distances are still meaningful. A compound with a value of 4 has twice as many acrylate substructures as a compound with a value of 2. The difference between these compounds is the same as the difference between a compound with a value of 5 and a compound with a value of 3. The more abstract topological indices are just as valid in Euclidean space for the same reasons and have been used in many other ~tudies.~"~' CLUSTERING TENDENCY Once a set of descriptors has been chosen, it is important to determine that the data are not distributed randomly throughout the space. With randomly distributed data any clustering results will certainly be artifacts. Hopkins' statistic has been used to test whether a spacial distribution shows aggregation or r a n d o m n e ~ s . ~ *The - ~ ~ Hopkins' statistic is defined as C Uj H =
Cuj + CW,
where the U . are distances from randomly selected locations in the sample space to their nearest neighbor points and are distances from randomly selected data points to their nearest neighbors. The number of random locations and the number of data points selected are the same and should be 510%of the number of points in the data set.29 This statistic compares the nearest-neighbor distance distribution of randomly selected locations to that for the randomly selected patterns. Values of H close to 1.O indicate clustering tendency, and values of H close to 0.5 indicate random placement of the points in space. The probability of a given value of H arising due to chance can be computed as well. Thus, the Hopkins' statistic is easily calculated and is appealing in its simplicity and directness. The Hopkins' statistic was computed 10 times, with 14 sampling points each time, for the acrylate data set. Mean values of 0.76-0.86 were found, depending on the type of descriptor scaling used (raw descriptor values as opposed as autoscaled d e s r i p t o r ~ ) .These ~ ~ values indicate very strongly that the data set is clustered and justify proceeding with cluster analysis. The probability of a random data set giving H values in this range is only approximately 5%. While this computation provides very strong evidence that the data set is clustered, it says nothing about the populations, memberships, or numbers of clusters. These were determined by partitional clustering methods. CLUSTERING STUDIES
Many clustering programs have been described in the literature, so it can be difficult to decide which program is best for a specific problem. A clustering method was selected for this study by using some simple and practical guidelines. First, with the relatively large number of compounds, and eight features associated with each of them, hierarchical methods, such as single-linkage or Ward's method, would be more
LAWSONAND JURS
140 J . Chem. InJ Comput. Sci., Vol. 30, No. 2, 1990 Table 111. Acrylate Clustering Results CAS no.
name
CAS no. Cluster I , 8 Compounds 034395249 052723963 056361 558 064448686
000307982 017527296 017741605 027905459
I ,I-dihydroperfluorooctylacrylate 2-(perfluorohexyl)ethyl acrylate 2-(perfluorodecyl)ethyl acrylate I , 1,2,2-tetrahydroperfluorodecylacrylate
002156969 002156970 003076048 004813574 013048345 01 3402023 01353318 I 021643425
decyl acrylate lauryl acrylate tridecyl acrylate octadecyl acrylate (stearyl acrylate) decamethylene glycol, diacrylate hexadecyl acrylate oleyl acrylate tetradecyl acrylate
000095396 000096333 000103 I 17 000106638 00010674 1 000 10690I 000 1 40885 000 141 322 000356865 000407476 000424646 000689123 0008 I86 I 1 000925600 00093741 7 000999553 001070708 001663394 002160896 002223827 0022741 15 002399486 002426542 002439352 002478106 002495354 002499583 002499958 002664553 002998085 002998234 0030667 1 5 00312161 7 003326907 003530367 003953104
cyclol acrylate methyl acrylate 2-ethylhexyl acrylate isobutyl acrylate 2-ethoxyethyl acrylate glycidyl acrylate ethyl acrylate n-butyl acrylate I , I -dihydroperfluoropropyl acrylate 2,2,2-trifluoroethyl acrylate 1 ,I-dihydroperfluorobutylacrylate isopropyl acrylate 2-hydroxyethyl acrylate n-propyl acrylate phenyl acrylate allyl acrylate I ,4-butylene diacrylate rerr-butyl acrylate hexafluoroisopropyl acrylate neopentylglycol diacrylate ethylene glycol diacrylate tetrahydrofurfuryl acrylate (diethy1amino)ethyl acrylate (dimethy1amino)ethyl acrylate 4-hydroxybutyl acrylate benzyl acrylate heptyl acrylate hexyl acrylate nonyl acrylate sec-butyl acrylate amyl acrylate cyclohexyl acrylate 2-methoxyethyl acrylate 2-hydroxy-3-chloropropylacrylate phenylethyl acrylate 2-ethylbutyl acrylate
000383073 000423825 001492871 001893523 00374 I773 005888335 007347 195 015419940 0164328 18 017 329792 024447787 025268773 030145518 048077958 049859703 054449740 0589203 I 3
Cluster 4, 33 Compounds 05907 I102 C I 7 H,,FI7NO4Smonoacrylate 065983315 CISHl2FI7NO4S monoacrylate 066008682 CI2HIIFPNO4S monoacrylate 066008693 Cl3Hl2FI3NOlS monoacrylate 066008706 2,4,6-tribromophenyl acrylate 06667 I225 isobornyl acrylate 066710972 2-(2,4,6-tribrornophenoxy)ethyl acrylate 067584558 4-benzoyl-3-hydroxyphenyl acrylate 067584569 2-~4-benzovl-3-hvdroxv~henoxv) - acrylate 067584570 C 1'1 H I 2F9NW04S 068084628 2,2-bis(4-acryloxyethoxyphenol)propanediacrylate N-methyl perfluoroactane sulfanamidoethyl acrylate 068227974 acryloxypivalyl acryloxypivalate 068227985 2-[(perfluorodecyl)suIfonyl)]methylamino ethyl acrylate 068298602 C1,HI4Fl3NO4S acrylate 072276052 p-a,a-dimethylbenzylphenylacrylate 087320056 Cl6HI4Fl7NO4S acrylate
00168021 3 003524683 004986894 005459381 015625895 017831719 019778859 052408421 053417291
Cluster 5. 18 Compounds 0605068 12 triethylene glycol diacrylate 066028328 pentaerythritol triacrylate 066028340 pentaerythritol tetraacrylate 067892993 glycerol triacrylate 067893009 trimethylolpropane triacrylate 0679054580 tetraethylene glycol diacrylate 071412356 trirnethylolethane triacrylate 072928428 tetramethylenebis(oxy(2-hydroxytrimethylene) diacrylate 085412540 pentaerythritol diacrylate
I
Cluster 2, 13 Compounds 048076386 067952505 070146053 070495395 084732285 Cluster 3, 71 Compounds 004074888 005390545 007 25 1903 007328 I78 01 3048334 01328282 1 013533056 016868 I36 016969 IO1 017527310 017977092 0 18 526073 01 8621766 018933921 019485031 019660163 019721 370 023916338 024493536 02461 5847 024910847 030697406 037275471 0449 14036 048145046 051727505 052591272 0526078 15 063225536 066028306 067905082 06790541 3 067952492 068227996 068298066
I .
name
2-(perfluorododecyl)ethyl acrylate C2,HZ6N2O8 diacrylate bisphenol A diethylene glycol diacrylate C2JH2808 diacrylate eicosyl acrylate ( I -methylethylidene)bis[4,1-phenyleneoxy( 1-methyl-2,1ethanediyl)] diacrylate ( I -methylethylidene)bis[2-methyl-4,1 -phenylene)oxy( 1methyl-2,l -ethanediyl)] diacrylate methylenebis[2,1-phenyleneoxy(I-methyl-2,1-ethanediyl)] diacrylate C20H25N306
diethylene glycol diacrylate 2-nitrobutyl acrylate 2-butoxyethy1 acrylate diethylene giycol ethyl ether acrylate hexamethylene glycol diacrylate 3-butoxy-2-hydroxypropylacrylate diethylene glycol, monoacrylate cyclopentyl acrylate 3-phenoxy-2-hydroxypropyl acrylate perfluorobutyl acrylate 2,2-dinitropropyl acrylate 3-(dimethylamino)propyl acrylate 2-butene-I ,4-diol, diacrylate 1,3-dimethylbutyl acrylate 1,3-butylene glycoldiacrylate 2,3-dibromopropyl acrylate bis(ethy1ene acrylate) monosulfide 2-butenyl acrylate 1,3-proanediol diacrylate 2-carboxyethyl acrylate 2,3-dichloropropyl acrylate monoacryloyloxyethyl phthalate trimethylolpropane diacrylate 2-methylbutyl acrylate phenoxyethyl acrylate 2-hydroxypropyl acrylate nonafluorohexyl acrylate methylcarbamoyloxyethyl acrylate 2-acryloyloxyethyl butylcarbamate N,N-bis(2-acryloyloxyethyl)formamide C14H2204
C,,-,H,fiO, -" 2-methylheptyl acrylate C~~HI~FIINO~S C I ~ H I ZiN04S FI
C14H12FISN04S acrylate CIJHZ0O3 acrylate C17H12F21N04S acrylate C,$HI 2FI,N04S acrylate C13H12F,3N04S acrylate neopentyl glycol acrylate benzoate C25H24Br406 acrylate CloHloF9N04S acrylate C, I H I O F I I N 0 4acrylate S C,,H .- ,.-,FI .-?NO$ acrylate C13HloF15N0~S acrylate C15H14F15N04S acrylate C,4H14F,3N04S acrylate C I 6 H l 6 F I ~ N 0 acrylate 4S C18H14F21N04S acrylate C1,H2,06 acrylate
dipentaerythritol pentaacrylate C19H27N07 C17HnN07 C18H3208
CI~H~IN@~ 2-hydroxy- 1,6-hexanediyl diacrylate C15H2406
C IsH2408 C24H3201
2
CLUSTER
ANALYSIS OF ACRYLATES
J. Chem. In$ Comput. Sci., Vol. 30, No. 2, 1990 141
To better determine the proper number of clusters, and to computationally demanding than nonhierarchical methods, refine the results, the clustering program Isodata was used.34 such as Kmeans or Isodata. In addition, the resultant denThis is a more sophisticated clustering routine that employs drograms, which can be difficult to interpret or even misleading adjustable parameters to increase or decrease the number of with much smaller data sets, would be too complex for simple clusters within a trial on the basis of the cluster populations, interpretation. Finally, the objective in this study was to find the distance between them, the spread within them, and the a natural grouping of compounds, not the hierarchy of expected final number of clusters. Although the user does groupings that hierarchical methods provide. Two commonly enter the number of clusters that is expected, the final number used and well-established nonhierarchical partitioning methods, of clusters is determined by the chosen parameters and the Kmeans and Isodata, were thus used in this study. Since characteristics of the data set. In this way, with the target hierarchical methods were not used, they will not be discussed number of clusters, reasonable values for the distance between further. clusters, and the separation within them all determined by the To divide the compounds into reproducible clusters, the first preliminary Kmeans experiments, Isodata was expected to task is to determine the number of clusters that the data determine the appropriate number of clusters, and the most naturally comprise. The first step in this procedure involves stable populations. In general, using more than one clustering the clustering program K m e a n ~ . ~It’ is initialized with a set method on a data set will provide more confidence in the number of data points as starting cluster centers, and then it results. Although the algorithms of Kmeans and Isodata are assigns the compounds to those clusters on the basis of their both based on minimizing square error, and thus are not proximity to the respective centers. The cluster centers are completely independent, using both programs provides more updated to the centroid of these newly formed clusters, and confidence than using either one alone. The use of a clustering the points are reassigned, again on the basis of proximity. This program based on a completely different method would have procedure is repeated until there is no change in the cluster provided even more confidence in the results. Partitional populations. The number of clusters is input a t the beginning clustering programs other than Kmeans and Isodata were not of the program and cannot change. available a t the time of the study, however. The simplicity of this Kmeans routine can be exploited by The first experiments with Isodata used six as the target splitting the compounds into a fixed number of clusters renumber of clusters. The first 1 1 iterations were less consistent peatedly, with different initial cluster centers from one trial than anticipated, with a similarity coefficient of just 0.83, and to the next. This procedure can then be repeated for a different several partitions had five final clusters instead of six. Some number of clusters, so that the range of the expected number of the parameters were adjusted slightly in hopes of obtaining of clusters is covered. In general, the number of clusters that more consistent results with six clusters, but five final clusters is closest to the natural number of clusters will have more were obtained 9 times in 11 trials. This was strong evidence similar results from one trial to the next. For example, if the that the natural number of clusters was actually five. In routine divides the compounds into more clusters than they experiments with five as the target number of clusters the would fall into naturally, it would be forcing similar compounds results were much more consistent: four clusters were obtained into different groups. If this is repeated with different starting once and six clusters once, but 9 times of 11 there were five points, the compounds that are artificially separated will likely final clusters, with an average similarity of 0.87. The average be different from those in the previous iteration simply because similarity of 0.87 was determined to be statistically signifiartificial distinctions are being forced. If this is repeated many cantly different from the mean of 0.83 obtained earlier on the times for this number of clusters, the similarity of one clusbasis of the Mann-Whitney Wilcoxon rank-sum test.3s This tering trial to the next will be relatively low. A similar arwas not only strong evidence that five was the proper number gument holds for dividing the compounds into fewer clusters of clusters, but also that these partitions were stable. Of the than the natural number. Earlier work with these acrylate 11 partitions the one that had the highest average similarity compounds suggested that the natural number of clusters to the other 10 was chosen as the representative clustering. would be near 4, so 1 1 clustering trials were performed for The compounds comprising each of the five clusters are listed 4, 5, 6, and 7 clusters. Work by Jain and Moreau with arin Table 111. Thus, the use of the Kmeans and Isodata tificial data sets allows an estimate of the number of experclustering routines provided us with the memberships of the iments with random starting points that are required to have five clusters. a specified probability of obtaining the true p a r t i t i ~ n i n g . ~ ~ These clusters of acrylate compounds can be characterized Eleven experiments were found to be necessary to have 99.5% in general terms based on the distribution of some of their probability of correctly partitioning the data into four clusters, descriptor values. Cluster one is comprised of only eight highly and 95% probability of correctly partitioning the data into five charged compounds. The minimum total u charge in this clusters. group is larger than that for all except for 6 of the remaining The results of comparing the 1 I trials pairwise among 135 compounds. The average total charge within the group themselves are shown in Table 11. The similarity index is 6.55, which is more than 3 standard deviations higher than presented in this table is a weighted average of a cluster by the mean total u charge value for the entire population (2.24 cluster comparison of the two partitions a s shown here: SI f 1.4). Five of the compounds are monomeric and per= Ewi((cluster(i), r‘l cluster(i),)/(cluster(i), U cluster(i),). fluorinated, and the remaining three are dimers. All are high This index can range from near 0 for completely dissimilar molecular weight compounds. clusters to near 1 for identical clusters. The form of this index Cluster 2 is made up of 13 hydrophobic compounds. All relating the intersection of two corresponding clusters to their the log P values within this group are greater than 5.0, with union was suggested by the form of the similarity index used seven compounds having the maximum value of 7.0. The in atom pairs work.33 The results in Table I1 suggest, and average log P value for the entire population is 3.0. There are statistical tests confirm, that the differences are significant, nine monomers and four dimers in this group. that five cluster partitions are more similar to each other than The large cluster of 71 compounds comprising cluster 3 either four or seven, and six cluster partitions are also more contains the simple acrylates; 57 are monomeric, 13 dimeric, similar to each other than four or seven. Although these and 1 trimeric. The few nonmonomers in this group have lower experiments were not able to definitely provide us with the path and molecular connectivity values than average. All of natural number of clusters, we now had evidence that it was the compounds have rather low total u charge values; 65 of either five or six clusters.
142 J. Chem. In$ Comput. Sci., Vol. 30, No. 2, 1990 Table IV. Descriptor Values for All Acrylates no. CAS no. 1 I I 95396 2 1 96333 I 1031 17 3 4 1 106638 I0674 I 1 5 6 10690 I 1 7 140885 1 141322 8 1 307982 9 1 1 356865 IO 383073 II I 1 407476 12 423825 1 13 424646 14 1 689 123 15 1 1 81861 I 16 17 1 925600 93741 7 18 1 19 999553 1 20 I070708 2 1 1492871 21 1663394 22 1 1680213 2 23 1893523 24 1 1 21 56969 25 1 21 56970 26 2160896 27 1 2 2223827 28 2274115 29 2 2399486 30 1 2426542 31 1 1 2439352 32 2478 106 33 1 2495354 34 I 2499583 35 1 1 2499958 36 37 1 2664553 2998085 38 I 2998234 1 39 30667 I5 40 1 1 3076048 41 3121617 42 1 3326907 43 1 44 3 3524683 3530367 45 1 3741773 1 46 3953104 47 1 4074888 48 2 1 48 13574 49 4 4986894 50 1 5390545 51 545938 1 52 3 5888335 1 53 54 725 I903 1 1 57 7328178 1 56 7347195 2 57 13048334 58 13048345 2 59 I3282821 1 60 1 13402023 1 61 13533056 1 62 13533181 63 I 54 19940 1 64 3 15625895 65 I64328 18 1 1 66 I6868 136 67 1 I69691 01 68 1 17329792 69 17527296 1 70 17527310 1 17741605 71 1 I783 I719 72 2 17977092 73 1 74 18526073 1 75 I8621766 2 1893392 I 76 1 77 I948503 I 2
2 3.874 0.727 3.251 2.228 1.569 1.790 0.956 1.725 1.232 0.45 1 2.238 0.500 1.477 0.327 1.690 1.095 1.372 1.938 1.123 2.450 2.524 2.738 2.968 1.724 3.846 4.554 0.484 3.788 1.743 2.469 2.398 2.175 1.802 2.297 2.786 2.432 3.493 1.860 2.079 2.901 4.907 1.364 1.979 4.094 2.607 4.982 2.5 17 2.356 6.675 4.747 2.141 2.93 1 5.740 2.338 2.181 5.576 3.157 4.572 2.916 5.968 1.708 6.309 3.999 4.289 4.594 2.547 3.112 0.986 0.266 0.761 -0.228 3.580 2.525 2.528 2.140 3.081 2.6 I 2
LAWSON AND JURS
3 0.517 0.048 0.252 0.456 0.048 0.166 0.048 0.048 4.396 0.402 I .649 0.227 1.649 0.568 0.284 0.048 1x048 0.1 16 0.048 0.096 1.035 1.160 0.096 1.317 0.048 0.048 0.408 1.303 0.096 0.166 0.206 0.364 0.048 0.166 0.048 0.048 0.048 0.215 0.048 0. I66 0.048 0.048 0.177 0.85 0. I66 0.992 0.252 0.096 0.048 0.900 0.209 0.262 1.730 0.048 0.048 0.992 0.096 0.096 0.177 0.048 0.048 0.048 0.387 0.851 0.387 0.166 0.245 4.899 1.066 0.402 1.729 0.096 0.588 0.364 0.096 0.623 0.263
4 2.083 3.000 7.101 5.878 7.000 3.556 3.840 5.878 84.000 4.152 6.596 7.000 5.857 3.769 5.000 5.000 5.000 3.265 5.000 9.373 5.538 5.878 13.290 5.327 1 1.930 13.940 4.889 7.875 7.438 3.265 6.250 7.000 7.000 4.000 9.000 7.901 1 1.000 4.500 7.000 3.265 15.000 5.878 5.531 8.889 4.688 3.273 5.289 10.290 19.950 I 1.340 5.325 9.600 1.473 9.000 9.917 5.018 1 1.320 15.260 9.373 17.950 7.901 19.950 4.496 8.889 6.094 2.65 1 6.370 77.000 4.543 4.889 5.600 16.230 3.960 7.901 9.373 6.400 8.082
5 83 21 38 28 30 34 22 28 406 40 I18 31 113 48 27 25 25 51 25 42 79 28 54 95 45 51 45 45 37 46 37 31 30 52 36 33 42 28 30 51 54 28 31 64 55 69 34 45 69 75 37 54 115 36 39 76 48 60 42 63 33 69 133 64 142 45 63 351 78 42 1 I4 63 43 34 42 34 43
6 231 21 91 45 55 60 28 45 6 91 820 55 74 I 136 36 36 36 106 36 105 378 45 171 528 120 153 105 120 78 100 78 55 55 123 78 66 105 45 55 106 171 45 55 23 1 141 178 66 120 276 325 78 171 333 78 91 250 136 210 I05 23 1 66 276 488 23 1 626 85 20 1 2 351 105 74 1 23 1 105
66 I05 66 105
7 1.043 0.742 1.293 0.994 1.191 1.151 0.850 0.995 3.555 2.693 2.247 1.875 2.247 3.5 16 1.010 1.256 0.920 0.988 0.808 1.607 2.247 1.171 2.188 2.247 1.444 1.594 3.061 1.755 1.505 1.255 1.244 1.107 1.356 0.912 1.219 1.144 1.369 1.08 1 1.070 1.143 1.669 1.083 1.498 2.788 0.977 2.247 1.143 1.846 2.044 3.036 1.070 2.315 1.426 1.336 1.532 2.247 1.756 2.056 1.899 1.894 1.598 1.954 1.987 2.465 2.328 1.068 1.892 4.054 5.992 2.696 9.286 2.529 1.925 1.172 1.489 1.232 1.698
8 2.6 1 0.75 4.32 2.20 1.09 -0.13 1.28 2.33 0.00 2.38 6.05 2.04 4.99 2.62 1.58 -0.06 1.80 2.06 1.26 1.86 3.47 1.98 1.38 3.98 5.51 6.57 3.10 2.13 1.25 0.99 1.55 0.65 0.35 2.52 3.92 3.39 4.98 2.1 1 2.86 2.78 7.00 0.56 0.3 1 -0.34 2.84 4.33 3.26 1.32 7.00 1.18 1.77 1.54 4.09 2.15 1.16 4.77 2.92 5.03 1.33 7.00 0.01 7.00 3.87 2.18 4.3 1 2.22 1.67 0.00 2.95 2.01 3.89 1.45 0.67 0.3 I 1.31 3.04 1.64
CLUSTER
J . Chem. Inf. Comput. Sci., Vol. 30, No. 2, 1990 143
ANALYSIS OF ACRYLATES
Table IV (Continued) no.
CAS no.
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 I04 105 I06 107 108 109
19660163 1972 1370 19778859 2 1643425 23916338 24447787 24493536 246 15847 249 10847 25268773 27905459 3014551 8 30697406 34395249 37275471 449 I4036 48076386 48077958 48 145046 49859703 5 1727505 52408421 52591272 5260781 5 52723963 53417291 54449740 56361558 58920313 5907 1 I02 605068 12 63225536 64448686 65983315 66008682 66008693 66008706 66028306 66028328 66028340 6667 1225 66710972 67584558 67584569 67584570 67892993 67893009 67905082 6790541 3 67905480 67952492 67952505 68084628 68227974 68227985 68227996 68298066 68298602 70 146053 70495395 71412356 72276052 72928428 84732285 85412540 87320056
I IO Ill 112 I13 1 I4 1 I5 1 I6 1I 7 1 I8 1 I9 120 121 I22 123 124 125 126 127 128 I29 130 131 132 133 134 135 I36 I37 138 139 140 141 I42 143
1
2
1 2 3 1 1 2 2 1 1 I 1 2 1 1 2 1 1 1 1 1 1 2 1 1 2 2 I 2 1 1 5 1 2 1 1 1 1 2 3 3 1 2 1 1
3.553 2.356 4.246 5.261 1.403 7.114 2.091 1.515 2.597 1.323 0.019 6.248 3.358 2.061 3.635 2.348 7.382 1.980 2.533 2.227 3.541 4.833 0.734 1.758 6.380 3.440 5.141 8.339 2.030 1.601 7.710 2.788 7.063 5.581 1.405 1.653 1.900 2.885 5.306 4.59 4.9 18 I 1.060 1.817 1.694 1.570 5.531 4.949 3.97 I 4.555 3.355 3.435 8.199 1.446 2.153 2.277 2.401 1.848 2.362 9.1 1 1 6.917 4.585 1.733 4.657 6.448 7.200 6.420
I 1 3 2 2 2 1 2 1 1 1 1 1 1 2 2 2 1 2
1 3 2
3 0.6 I4 0.096 1.071 0.048 0.048 1.149 0.096 0.1 I3 0.375 1.698 1.398 2.302 0.280 6.201 0.803 0.337 0.048 I .803 0.116 1.471 0.340 0.354 4.250 0.090 0.8 13 0.803 1.033 1.149 1.698 1.483 1.655 0.090 0.726 0.694 2.135 1.803 1.471 0.225 0.321 0.321 1.380 2.247 1.035 1.200 1.366 0.166 0.356 0.359 0.504 0.225 0.337 1.482 1.532 1.532 1.366 1.200 1.151 1.483 1.753 0.75 1 0.596 2.135 0.521 0.579 1.028 1.902
the 71 values are below the average for the entire population, and 10 of these values are below 1.O. Cluster 4 contains 33 compounds, including 24 halogenated compounds (21 fluorinated and 3 brominated). The log P values tend to be slightly higher than average. Cluster 5 contains 18 compounds that are mostly oligomeric: 1 monomer, 7 dimers, 8 trimers, 1 tetramer, and 1 pentamer. The relatively wide range of log P values includes the lowest
4
5
6
7
8
5.531 10.290 9.562 15.940 5.878 9.408 8.333 7.000 5.531 5.627 5.035 9.694 5.878 132.00 6.479 5.531 2 1.960 6.980 5.48 I 6.612 7.259 16.260 60.000 7.438 12.720 6.479 4.250 13.270 6.359 5.579 16.000 10.290 12.500 2.525 7.090 6.584 6.178 9.679 15.580 13.810 6.817 9.057 4.646 4.855 5.087 15.750 1 1.560 7.934 6.667 10.850 8.333 9.877 5.354 6.1 12 5.878 5.689 5.087 6.343 9.671 10.210 12.250 7.460 14.080 8.945 18.840 6.428
31 45 61 57 28 238 39 31 31
55 120 210 190 45 1217 91 55 55 703 528 253 300
2.247 1.702 2.394 1.744 0.889 2.857 1.538 1.645 1.168 2.247 7.639 2.704 2.454 4.359 2.217 1.068 2.194 2.247 1.329 2.247 2.213 3.416 2.479 1.897 4.361 2.540 I .372 3.540 2.247 2.247 4.661 2.1 I3 3.830 1.535 2.247 2.247 2.247 2.007 3.072 2.923 1.861 2.247 2.247 2.247 2.247 2.962 3.515 1.988 1.821 2.242 1.293 3.168 2.247 2.247 2.247 2.247 2.247 2.247 3.420 3.057 2.654 2.247 3.469 3.781 4.660 2.745
2.14 1.81 1.65 7.00 1.79 5.87 1.33 0.33 1.86 3.92 3.42 2.90 1.70
111
9 65 81 990 51 31 75 1 I7 58 99 69 71 210 37 144 51 135 282 115 104 111 44 255 186 132 1 I4 96 52 81 75 72 274 75 84 93 0 139 54 88 50 0 0 0 0 0 0 0 0 0 0 0 135 68 0 0 1 IO
IO 153 55 325 780 160 56 1 246 300 4 78 804 153 475 1739 780 630 703 120 1426 732 990 741 528 153 378 325 270 1589 325 406 496 445 645 171 282 153 91 1381 595 666 56 1 465 435 703 1573 1260 23 1 1035 276 1001 666 448
0.00 0.67 2.73 7.00 2.82 2.50 2.35 2.59 -0.16 0.00 0.59 3.40 -1.85 5.08 6.00 4.41 4.76 -0.84 2.17 3.42 2.68 3.66 3.19 2.72 0.91 3.06 2.00 3.49 7.00 3.52 3.76 4.00 1.28 0.61 3.62 3.44 1.36 4.32 6.49 4.23 4.18 3.94 3.71 4.29 5.82 7.00 5.69 2.31 3.29 0.04 5.40 2.37 2.04
log P in the entire data set. Four of the total six hydrophilic compounds are in this group. CONCLUSIONS
In conclusion, five stable clusters of acrylates were found by using objective clustering methods based on calculated physical and chemical properties. The partitions that were
144 J . Chem. InJ Comput. Sci., Vol. 30, No. 2, 1990
derived from the features we chose were not obvious divisions to make before we began this study, but at the same time they could make sense in terms of toxicology. At this stage we cannot make conclusions about a relationship between these groups and biological activity, but the objective sampling scheme itself has succeeded in its initial goals. Once samples selected on the basis of these clusters have passed through bioassays, it may be possible to present a SAR study to predict the toxicity of other compounds. ACKNOWLEDGMENT The financial support of the Chemical Manufacturers Association made this work possible. Log P values were calculated at Rohm and Haas by Dr. Clay Frederick and Mary Bright. REFERENCES AND NOTES (1) Hashimoto, K.; Aldridge, W. N. Biochemical Studies on Acrylamide,
A Neurotoxic Agent. Biochem. Pharmacol. 1970, 19, 2591-2604. (2) Pozzani, U. L.; Weil, C. S.; Carpenter C. P. Subacute Vapor Toxicity and Range-Finding Data for Ethyl Acrylate. J . fnd. Hyg. Toxicol. 1949, 31, 311-316. (3) Treon, J . F.; Sigmon, H.; Kitzmiller, K. U. The Toxicity of Methyl and Ethyl Acrylate. J . Ind. Hyg. Toxicol. 1949, 31, 317-326. (4) Borzelleca, J. F., et al. Studies on the Chronic Oral Toxicity of Monomeric Ethyl Acrylate and Methylmethacrylate. Toxicol. Appl. Pharmacol. 1964, 6, 29. (5) Hodes, L. Clustering a Large Number of Compounds. 1. Establishing the Method on an Initial Sample. J . Chem. fnf: Comput. Sci. 1989, 29, 66-7 1 . (6) Willett, P.; Winterman, v.; Bawden, D. Implentation of Nonhierarchic Cluster Analysis Methods in Chemical Information Systems: Selection of Compounds for Biological Testing and Clustering of Substructures Search Output. J . Chem. Inf Comput. Sci. 1986, 26, 109-118. (7) Stuper, A. J.; Bruger, W. E.; Jurs, P. C. Computer Assisted Studies of Chemical Structure and Biological Function; Wiley-Interscience: New York, 1979. (8) Jurs, P. C. Pattern Recognition Used to Investigate Multivariate Data in Analytical Chemistry. Science 1986, 232, 1219-1224. (9) Varmuza. K . Pattern Recognition in Chemistry: Springer-Verlag: Berlin, 1980. ( 10) Jurs, P. C. Computer Assisted Studies of Structure-Activity Relations Using Pattern Recognition. Drug In/. J . 1983, 17, 219-229. ( 1 1 ) Jurs, P. C.; Stouch, T. R.; et al. Computer-Assisted Studies of Molecular Structure-Biological Activity Relationships. J . Chem. InJ. Comput. Sci. 1985, 25, 296-308. (12) Rose, S. L.; Jurs, P. C. Computer-Assisted Studies of Structure-Activity Relationships of N-Nitroso Compounds Using Pattern Recognition. J . Med. Chem. 1982, 25, 769-776. ( I 3) Stouch. T.R.; Jurs, P. C. Computer-Assisted Studies of Molecular Structure and Genotoxic Activity by Pattern Recognition Techniques.
LAWSONAND JURS EHP, Environ. Health Perspecl. 1985, 61, 329-343. ( I 4) Del Re, G. A Simple MO-LCAO Method for the Calculation of Charge Distributions in Saturated Organic Molecules. J . Chem. SOC.1958, 403 1-4040. (15) Del Re, G.; Pullman, B.; Yonezawa, T. Electronic Structure of the Alpha-Amino Acids of Proteins. Biochim. Biophys. Acta 1963, 153-182. (16) Autian, J. Structure-Toxicity Relationships of Acrylic Monomers. EHP, Environ. Health Perspect. 1915, 11, 141-152. (17) Lawrence, W. H.; Bass, G. E.; et al. Use of Mathematical Models in the Study of Structure-Toxicity Relationships of Dental Compounds: I. Esters of Acrylic and Methacrylic Acids. J . Dent. Res. 1972, 51, 526-535. ( 1 8) Tanii, H.; Hashimoto, K. Structure-Toxicity Relationships of Acrylates and Methacrylates. Toxicol. Lett. 1982, I f , 125-129. (19) Fujisawa, S.; Masuhara, E. Determination of Partition Coefficients of Acrylates, Methacrylates, and Vinyl Monomers Using High Performance Liquid Chromatography (HPLC) J . Biomed. Mater. Res. 1981, 15. 787-793. (20) Pomonoa College Med-Chem package ClogP 3.5. (21) Kier, L. B.; Hall, L. H. Molecular Connectivity in StructureActiuity Analysis; Wiley: New York, 1986. (22) Watanabe, S. Karhunen-Loeve Expansion and Factor AnalysisTheoretical Remarks and Applications. Proceedings of the 4th ConJerence on Information Theory; Publishing House of the Czechoslovak Academy of Sciences: Prague, 1965. (23) Kier, L. B. A Shape Index from Molecular Graphs. Quant. Struct.Act. Relat. Pharmacol., Chem. Biol. 1985, 4 , 109. (24) Randic, M. On Characterization of Molecular Branching. J . A m . Chem. SOC.1975, 97, 6609-661 5. ( 2 5 ) Hansen, P. J.; Jurs, P. C. Prediction of Olefin Boiling Points from Molecular Structure. Anal. Chem. 1987, 59, 2322-2327. (26) Rohrbaugh, R. H.; Jurs, P. C. Prediction of Retention Indexes for Diverse Drug Compounds. Anal. Chem. 1988, 60, 2249-2253. (27) Randic, M.; Jurs, P. C. On a Fragment Approach to Structure-Activity correlations. Quant. Struct.-Act. Relat. 1989, 8, 39-48. (28) Hopkins, B. A. New Method of Determining the type of Distribution of Plant Individuals. Ann. Bot. 1954. 18. 213-226. (29) Jain, A. K.; Dubes, R. C. Algorithms for Clustering Data; Prentice Hall: Englewood Cliffs, NJ, 1988. (30) Lawson, R. G.; Jurs, P. C. A New Index for Clustering Tendency and Its Application to Chemical Problems. J . Chem. fnx Compui. Sci. 1990, 30, 36-41. (31) MacQueen, J. Some Methods for Classification and Analysis of Multivariate Data. Proceedings of the 5th Berkeley Symposium on Probability and Statistics; University of California Press: Berkeley, 1967. (32) Jain, A. K.; Moreau, J. V. Bootstrap Techniques in Cluster Analysis. Pattern Recognit. 1987, 20, 541-568. (33) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J . Chem. Inf. Comput. Sci. 1985, 25, 64-73. (34) Ball, G. H.; Hall, D. J.; Isodata, an Iterative Method of Multivariate Analysis and Pattern Classification. Proceedings of the AFIPS Fall Joint Computer Conference; Spartan Books: Washington, DC, 1965; Vol. 2, pp 329-330. (35) Ott, L. An Introduction to Statistical Methods and Data Analysis; Duxbury Press: Boston, 1984