Screening Alternative Degreasing Solvents Using Multivariate Analysis

and University Statistics Center, New Mexico State University,. Las Cruces, New Mexico 88003. Multivariate analysis was used to explore physicochemica...
0 downloads 0 Views 117KB Size
Environ. Sci. Technol. 2000, 34, 2587-2595

Screening Alternative Degreasing Solvents Using Multivariate Analysis C. TREVIZO,† D. DANIEL,‡ AND N . N I R M A L A K H A N D A N * ,† Civil, Agricultural, and Geological Engineering Department and University Statistics Center, New Mexico State University, Las Cruces, New Mexico 88003

Multivariate analysis was used to explore physicochemical properties of organic chemicals that would characterize and identify degreasing solvents. The exploratory techniques used in this study include cluster analysis, discriminant function analysis, and canonical discriminant analysis. Out of a compilation of 16 physicochemical properties evaluated, aqueous solubility, Henry’s constant, and surface tension were identified as relevant properties that could effectively screen degreasing solvents from among 30 chemicals of similar chemical classes. The suitability of these three properties and the multivariate techniques used in classifying degreasing solvents were demonstrated on an external testing set of 10 solvent- and nonsolventtype chemicals. On the basis of the results of these studies, canonical discriminant analysis is recommended as a potential tool for screening purposes. The cluster analysis procedure was informative for explorative purposes; the discriminant function analysis procedure was not efficient in separating solvents from others.

Introduction Solvents are a class of chemicals that can dissolve specific components or break down certain chemicals in a complex mixture into more elementary forms. Because of this property, solvents have been used widely in various applications ranging from cleaning, degreasing, coating, painting, and extracting to chemical processing, manufacturing, and equipment maintenance (1, 2). In addition to their direct use in the industry, numerous commercial formulations and products containing solvents are used on a daily basis in the domestic, commercial, institutional, and military sectors. Common specific uses of solvents include mobilization of solids; preparation of reactants; application of particles onto a surface for coating; extraction of oil, flavors, and fragrances; thinners for paints, oils, and ink; adhesive for plastics; cleaning printed circuit boards and machine parts; dry cleaning of garments; decaffeinating coffee; etc. (3). Over 30 different synthetic organic chemicals have been used as degreasing solvents. It is estimated that the annual use of the five most commonly used solvents [viz., trichloroethylene (TCE), tetrachloroethylene (PCE), methylene chloride, 1,1,1-trichloroethane (TCA), and trichlorotrifluroethane (CFC 113)] in the United States is around 800 000 t (4). Such large usage as well as improper storage and disposal of spent solvents over the past decades have resulted in their * Corresponding author fax: (505)646-6049; e-mail: nkhandan@ nmsu.edu. † Civil, Agricultural, and Geological Engineering Department. ‡ University Statistics Center. 10.1021/es9912832 CCC: $19.00 Published on Web 05/16/2000

 2000 American Chemical Society

release into the environment, contaminating soils, groundwater, and the atmosphere. Because of their toxic, persistent, and recalcitrant nature, environmental contamination by degreasing solvents has emerged as one of the serious problems in the industrialized world. Recent studies have confirmed that many of the current solvents are hazardous to humans and harmful to the environment, causing (or suspected to cause) cancer, smog formation, ozone depletion, etc. As such, many of the common solvents are now targets of public concern and regulatory control. The Environmental Protection Agency (EPA) has included over 20 solvents in their list of 127 priority pollutants. The Clean Air Act Amendments of 1990 have listed several solvents as hazardous air pollutants (HAPs). The emissions of the most common solvents (viz., methylene chloride, PCE, TCE, TCA, carbon tetrachloride, and chloroform) are now regulated by 40 CFR, Parts 9 and 63, under the Toxic Release Inventory (TRI) program, whereby industries are now required to report to the EPA on their production and transfers. In an effort to minimize the release and environmental impacts of solvents, industries are being forced to adapt process modifications, recycling, and reuse of solvents on one hand and to develop environment-friendly substitute solvents on the other (2). In seeking substitutes or designing new ones, it is important to identify or develop solvents that have the desired degreasing characteristics and, at the same time, are nontoxic and readily biodegradable and pose minimal threat to the environment. Evaluation of solvents that are in current use in terms of their physical and chemical properties is the first step to characterize the desired features of a good solvent and to effectively develop a “greener” and efficient substitute solvent. Selection of substitute solvents is not a straightforward task because no single physicochemical property relates to solvent characteristics. The search for alternate solvents has been “characterized as Edisonian” because of the trial and error nature of the experimental evaluation of numerous potential alternatives (5). While acknowledging this process to be a significant technical challenge, Zhao and Cabezas (2) have identified the following three steps in developing substitute solvents: step 1: to determine the substitute candidates or the replacement formulations; step 2: to do performance and evaluation tests; and step 3: to do the full scale test. The first step has been recognized as the most important and most difficult one. Efforts of previous workers in fulfilling the first step have been classified the into three categories by Zhao and Cabezas (2): (i) screening of available solvent databases for single chemical substitutes; (ii) using computerbased molecular designing tools to develop new chemicals with the desired properties; and (iii) designing mixtures of available chemicals to achieve desired properties. Several special purpose computer software tools have been developed and are being applied for this purpose (2, 6). The first approach of screening databases is a more simple approach and can also enhance the effectiveness of the other two methods. Irrespective of the approach, identification of desired properties for a given application is a prerequisite in seeking substitute chemicals. One of the objectives of this study was to identify physicochemical properties of good solvents. The second objective of this study was to develop a screening process based on statistical multivariate analysis of physicochemical properties of good solvents. The screening VOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

2587

of substitute solvents remains a subjective process, depending on the application and the experience of end-users. Two of the commonly employed methods are the weighted-sum evaluation method and the pass/fail screening method. In the first method, quantifiable screening criteria weighted by appropriate weighting factors are summed up and compared for the alternatives. The criteria used are indirect measures of the overall effectiveness of the solvent. Some examples of criteria are reductions in raw material input, waste quantity, operational hazards, costs, etc. (4). The second method involves a step-by-step evaluation of the alternatives against yes/no or pass/fail type of criteria. Those that satisfy all the criteria are then selected for further testing. Examples of criteria might be as follows: is flash point less than or greater than 140° C, is dielectric strength less than or greater than 20 kV, etc. (7). Proposed solvents that pass the necessary criteria are then evaluated further under field conditions. An expert system software named SAGE is now available to aid in the screening process (http://clean.RTI.org/sol_alt). Users can run SAGE online over the Internet or download it to run on desktop computers to identify possible alternate solvents. This software first prompts the user to specify the material, nature, and shape of the part or surface to be cleaned; the contaminants to be cleaned; the degree of cleaning expected; the process configuration; etc. It then recommends a list of possible alternate solvents and processes that best satisfy the input data. The ultimate aim of this study was to develop and validate an alternate screening process to aid the substitute solvent search process. A statistical exploratory approach involving multivariate analysis procedures is adapted in this study. The following procedures are used: cluster analysis, discriminant function analysis, and canonical discriminant analysis.

Materials and Methods A training data set of 45 common solvent and nonsolvent chemicals was initially compiled as the starting point for this study. The following physicochemical properties for these chemicals were compiled from handbooks (e.g., refs 8-11) and literature (e.g., refs 12 and 13): boiling point (BoilPt), melting point (MeltPt), molecular weight (MW), octanol/ water partition coefficient [log(P)], water solubility [log(S)], vapor pressure (VP), Henry’s law constant [log(HC)], surface tension (ST), solubility parameter (SolP), autoignition temperature (AT), excess molar refraction (R), solute dipolarity (π), effective hydrogen-bond acidity (β), effective hydrogenbond basicity (R), and the characteristic volume of McGowan (MolarV). The significance of each of these parameters has been discussed elsewhere (e.g., refs 2, 10, and 13). In addition, calculated values of zero-order and first-order simple and valence molecular connectivity indexes (0χ, 0χν, and 1χν) were also adapted as additional properties (14). From the initial 45 chemicals identified, only 30 chemicals could be evaluated in this study as a training set due to the nonavailability of all the 18 physicochemical properties. The solubility parameter and autoignition temperature could not be found for several of the remaining 30 chemicals in the training set, so these parameters were discarded as variables in the analyses. Each of the remaining 30 chemicals was then identified as a “good” solvent or a nonsolvent based upon recommendations in solvent handbooks and usage in industry. The final training set thus consisted of a total of 30 chemicals, classified into 22 good solvents and 8 nonsolvents, each having 16 physicochemical properties. Additionally, a testing data set of 10 chemicals consisting of solvent and nonsolvent types was assembled to test any screening processes developed from the multivariate analysis 2588

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 34, NO. 12, 2000

procedures evaluated in this study. Because of the difficulty in identifying solvents and nonsolvents having readily available the physicochemical properties being examined in this study, preliminary cluster analyses were performed prior to forming the testing set. These cluster analyses identified water solubility, Henry’s law constant, surface tension, and the zero-order valence molecular connectivity index as physicochemical properties that were likely to be useful in the evaluation process. Thus, the 10 chemicals in the testing data set were selected based upon the availability of these four physicochemical properties, while also striving to obtain testing chemicals that greatly varied in their ability to act as a solvent (e.g., propane is an extreme that must be classified as a nonsolvent by any reasonable method). Table 1 lists the 30 training set chemicals along with the 10 testing set chemicals. To evaluate the screening method developed from cluster analysis, the 10 testing chemicals were added to the training set of 30 chemicals, and the cluster analysis process was repeated, noting the placement of the test chemicals in the dendogram relative to the good solvents and the nonsolvents of the training set. Evaluation of the method based on discriminant analysis was straightforward, simply giving a predicted classification of each testing set chemical accompanied by a (posterior) probability associated with the classification. A canonical discriminant analysis technique gave a distance measure for each of the testing chemicals, which was then compared to the distance measures of the good solvents and nonsolvents from the training set. Cluster analysis, its accompanying graphs, and discriminant function analysis were conducted in JMP (SAS Institute Inc.). All other graphs and canonical discriminant analysis were run in SAS (SAS Institute Inc.). All computations and graphs in both JMP and SAS were carried out on a 233-MHz Apple Macintosh G3-based computer. Cluster Analysis. Hierarchical cluster analysis is a common, multivariate pattern recognition technique used to group observations together according to their proximity to one another in the multidimensional space defined by the variables being studied (15). A cluster is defined to be either a single point or multiple points grouped together because of their relative closeness. To determine the closeness of two clusters, one must define a multidimensional measure of distance between the clusters and also the point of reference in each cluster between which the distance is measured. In most studies, though not all, the Euclidean distance based on standardized variables is used as a distance measure. The points of reference, between which the distance of two clusters is measured, define the cluster analysis “method”. The centroid method measures the distances between the means of each cluster. The nearest-neighbor method measures distances between two observations, one from each cluster, that are closer than any other such pair. This study investigated use of the centroid method but ultimately found the nearest-neighbor method with a Euclidean distance measure to be more effective. The results of the cluster analysis are displayed in two graphssa dendogram and an amalgamation schedule. The dendogram is a tree diagram connecting all the observations, which are listed to the left, and illustrates the relationship between the clusters that are formed. Clusters that are connected by lower branches on the tree are closer than clusters that are connected by higher branches. The amalgamation schedule is a line chart whose vertexes are horizontally aligned with the dendogram’s connecting branches, having one vertex associated with each merging of two clusters. The vertical distance between two vertexes indicates the distance between the two clusters that are merged by the corresponding branch. A plateau across vertexes indicates strong similarity among observations, while

TABLE 1. Training and Test Chemicals and Their Identification Code, Solvent Classification (Good Solvent, Nonsolvent, or Test Chemical), log(S), log(HC), ST, and 0χν Values data set

no.

chemical

class

log(S)

log(HC)

ST

0χν

training training training training training training training training training training training training training training training training training training training training training training training training training training training training training training testing testing testing testing testing testing testing testing testing testing

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

butanol ethanol 2-propanol methanol benzene toluene 1,2-xylene carbon tetrachloride chlorobenzene 1,1,1-trichloroethane methylene chloride perchloroethylene trichloroethylene ethyl acetate acetone methyl ethyl ketone methyl iso butyl ketone chloroform 1,1-dichloroethane 1,2-dichloroethane 1,1,2-trichloroethane cyclohexane acetic acid n-butyl acetate cyclohexanone diethylamine triethylamine propanol methyl chloride propane pentachloroethane n-pentane n-octanol n-butylcyclohexane 3-ethylhexane ethylbenzene trichlorobenzene propionic acid valeric acid acrylic acid

good good good good good good good good good good good good good good good good good good good non non non non good non good non good non non test test test test test test test test test test

4.87 5.77 6.00 6.06 3.25 2.73 2.24 2.91 2.59 3.18 4.29 2.17 3.04 4.81 6.00 5.38 4.31 3.90 3.70 3.93 3.65 1.74 6.78 3.83 4.36 5.89 4.74 5.34 3.77 1.79 2.70 1.58 2.73 -0.75 -0.15 2.22 1.54 6.00 4.38 6.00

-5.06 -5.20 -4.91 -3.87 -2.27 -2.23 -2.29 -1.52 -2.34 -2.10 -2.61 -1.57 -1.99 -3.92 -4.18 -4.98 -4.03 -2.36 -2.23 -3.01 -2.92 -0.71 -7.00 -3.50 -4.92 -4.59 -3.86 -5.16 -2.08 -0.16 -2.71 0.10 -4.60 0.13 0.63 -2.17 -2.53 -6.03 -5.87 -6.39

25.67 22.39 22.40 22.50 28.88 28.52 30.31 27.65 32.93 25.14 28.77 31.65 32.00 24.00 23.04 23.96 24.74 26.67 24.66 32.57 35.37 26.43 27.59 25.41 35.19 22.39 20.72 23.71 15.19 7.02 34.37 15.47 28.20 26.51 21.08 28.59 44.66 26.20 26.81 47.13

3.56 2.15 3.02 1.44 3.46 4.38 5.30 5.03 4.68 4.90 2.97 5.53 4.47 4.02 2.90 3.61 5.19 3.97 3.84 3.68 4.68 4.24 2.35 5.43 4.52 3.91 5.56 2.86 2.13 2.70 6.74 4.12 4.98 6.53 6.40 2.78 6.84 3.36 4.47 2.63

a sudden change in height indicates a large difference between the adjacent clusters. In this study, hierarchical cluster analysis was used iteratively on the training data set. Analyses began by using the entire set of the physicochemical variables discussed above to cluster the chemicals from the training set. Numerous subsets of the physicochemical variables were then tried, including all pairs and triplets, until a minimal subset of variables was found that clustered “good” solvents apart from nonsolvents. Discriminant Function Analysis. Discriminant function analysis is a multivariate analysis technique used for classification of each observation into one of multiple subpopulations based upon its location in the multidimensional space defined by the variables in the data set (15). A training set, with observations already correctly classified into populations and having standardized variables, is used to develop “discriminant functions” that provide rules for classifying other observations not in the training set. In situations where there are two populations to be discriminated between (such as the case in this study, where chemicals are classified as good solvents and nonsolvents), discriminant function analysis determines a single axis in the multidimensional space along which the greatest Euclidean distance is measured between the two population means relative to the variability of the observations within each of the two populations.

The discriminant function is just one of several important components in classification using discriminant function analysis. The discriminant function yields a “discriminant score” for each observation, which is the location of the observation along the discriminant axis. An observation with a discriminant score below a determined cutoff value is classified into one population, while a discriminant score above the cutoff value will classify an observation in the other population. The cutoff value is determined so that classification coincides with the population having the highest “posterior probability”sthe estimated probability that an observation is from a specific population given its observed discriminant score. Aside from determining the classification of an observation, posterior probabilities are informative because they estimate the confidence in the classification of the observation into a particular population. Validation of the developed classification rules is often accomplished in two manners. Misclassification rates are reported for the training set, indicating the proportion of observations from each population that were misclassified by the classification rules. Also, the classification rules developed using the training set can be applied to a testing data set whose correct classifications are known, and misclassification rates again reported. In this study, both types of validation were examined. Discriminant function analysis was performed, and corresponding misclassification rates were examined using the full compliment of variables VOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

2589

TABLE 2. Groups of Variables with High Correlations among the 30 Training Variables G g 0.70 1χν,

log(P), MW R, ST BoilPt, ST β, log(P) β, log(S)

G g 0.80 0χν,

MW β, log(HC)

G g 0.90 log(S), log(HC), log(P) MolarV, 0χ, 0χν, 1χν

using the training set. However, to examine misclassification rates for the testing data set, the discriminant function analysis was limited to using the variables available in the testing data set as discussed above. Canonical Discriminant Analysis. Canonical discriminant analysis is a multivariate dimension reduction technique (15). This method results in a series of variables (or axes) called “canonical variables”, each being a linear combination of the original variables. The first canonical variable gives the maximum possible separation between the population means (or more precisely, maximizes the between population variability) relative to the within population variability. Each subsequent canonical variable in the series increases this relative separation as much as possible but contributes less than previous canonical variables in the series. Canonical discriminant analysis differs from discriminant function analysis in several fundamental ways. First, being a dimension reduction technique, it does not directly yield classification rules. Second, discriminant function analysis uses each of the axes it yields to iteratively separate a population from another population or from a group of other populations, while each axis (or canonical variable) in canonical discriminant analysis contributes to the separation of all populations. And finally, for m populations, discriminant function analysis develops and requires m - 1 axes, whereas canonical discriminant analysis allows the user to select and utilize as many of the canonical variables (or axes) as desired, up to a maximum of m - 1, until the desired level of separation is obtained. This study used canonical discriminant analysis to develop a single axis that would yield maximum separation between the good solvent and the nonsolvent populations. The implementation was refined by further classifying the nonsolvents into two groups based on the direction they were relative to the good solvents. The entire complement of physicochemical variables was used to study the training data set, but examination of the testing data set was limited to the variables defined in the testing data set as discussed previously.

Results and Discussion Initial Investigation. Initial investigations examined twodimensional scatterplots and sample correlations of all 16 variables for the 30 training set chemicals. Both the correlation coefficients and the scatterplots revealed strong pairwise relationships among several groups. Table 2 displays sets of variables among which the sample correlation coefficients were at least 0.7, 0.8, and 0.9, respectively. Particularly strong correlations existed among the sets [log(S), log(HC), log(P)] and (MolarV, 0χ, 0χν, 1χν), where each pair in the set had a correlation coefficient greater than 0.9. These strong relationships reduce the need to have several of the variables from each set in analyses and give possible insight into results of some analyses such as the canonical discriminant analysis. Also of interest is the correlation between the solubility parameter (SolP) and the solvato chromic parameter, R, of 0.74, suggesting that SolP may not have contributed substantially beyond the contribution of R had it been usable in the analyses. 2590

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 34, NO. 12, 2000

Cluster Analysis. Cluster analysis was performed on both the training data set and the combinedstraining and testings data set. The initial strategy of the cluster analysis was 2-fold. The first objective was to determine a minimal subset of the 16 variables that separates the “good” solvents from the nonsolvents through exploratory use (i.e., repeatedly add or eliminate one variable at a time until reasonable clusterings were obtained). That is, we sought a small subset of variables that was sufficient to cluster the good solvents together but away from the nonsolvents. The second objective was to see if the chemicals from the testing data set could be appropriately clustered with the good solvents and nonsolvents using the minimal set. The purpose in doing this was to determine how few variables might be needed to separate the good solvents from the nonsolvents. Knowing this could potentially better focus the direction of future investigations on screening methodologies as well as simplify implementation of these techniques in practice. In pursuing the first objective, both centroid and nearestneighbor methods were employed, but the centroid method was not as fruitful, and the nearest-neighbor method was ultimately adopted. The first objective led to two sets of variables: [log(S), ST, 0χν] and [log(S), ST]. The clustering resulting from the two variable sets were somewhat different. Perhaps the most notable difference was that the variable set containing 0χν clustered 1,2-dicholoroethane among the good solvents, while the variable set without 0χν clustered it among the nonsolvents. Unfortunately, inclusion of the test chemicals caused clustering of the training chemicals to rearrange substantially when 0χν was used, including the placement of three good solvents among the nonsolvents. Inclusion of the test chemicals did not substantially alter clustering of the training chemicals when 0χν was not a factor, and this led to a preference for the [log(S), ST] variable set. Figure 1 shows the dendogram and associated amalgamation schedule graph (bottom of figure) for the training data set using the [log(S), ST] variable set. The dendogram shows two distinct clusters among the good solvents (distinguished by distinct symbols preceding the chemical namessa circle for the first cluster and a star for the second cluster). It also shows that the nonsolvents were the last chemicals to be clustered (distinguished by a square preceding the chemical name), indicating that they are the most isolated chemicals in the data set, even from each other. The cluster analysis for the combined data set of 40 chemicals, based on log(S) and ST (Figure 2), placed valeric acid, n-octanol, ethylbenzene, and pentachloroethane among the good solvents. The first two seem to be misclassified while the other two are appropriately placed. Propionic acid, n-pentane, n-butylcyclohexane, 3-ethylhexane, trichlorobenzene, and acrylic acid were all correctly placed among the nonsolvents. Discriminant Function Analysis. Discriminant function analysis was performed on the training data set and the combined data set. The objective in both cases was to evaluate the potential for discriminant function analysis to appropriately identify chemicals as good solvents or nonsolvents. While the training data set has no test data to evaluate, discriminant function analysis reports misclassifications that would occur for the training set chemicals using the rules it develops. Examination of the training data set allows investigating the use of the 16 variables, as opposed to the combined data set which limits investigation to four variables. Discriminant function analysis on the training data set with all 16 variables resulted in three misclassifications of good solvents (though one of these was 1,2-dichloroethane) and one misclassification of a nonsolvent, giving a total misclassification rate of 13.3%. Using only the variables [log(S),

FIGURE 1. Dendrogram from cluster analysis with training cases only using nearest neighbor method with variables log(S) and ST. ST] resulted in 8/4 misclassifications (solvents as nonsolvents/nonsolvents as solvents), which was better than [log(S), ST, 0χν] with 8/5 misclassifications but not quite as good as [log(S), ST, log(HC)] with 7/3 misclassifications. Treating 1,2dichloroethane as a solvent generally decreased the number of misclassifications by one or two, except when all 16 variables were used, where it increased misclassifications by two. Table 3 shows the classification results of the discriminant function analysis using [log(S), ST] with the combined data set. The predicted classification of the 10 test chemicals disagrees with the cluster analysis for four chemicals. Among the 30 training chemicals, 21 had posterior probabilities in the range of 0.5 ( 0.1, and 28 had posterior probabilities in the range of 0.5 ( 0.2. Hence, few chemicals were strongly classified, indicating a lack of certainty associated with the decision rules developed by the discriminant function analysis. Using other combinations of log(S), ST, log(HC), and 0χν did not substantially change these results. The posterior probabilities for the training set with all 16 variables were more extreme, typically occurring below 0.1 or above

0.9. This indicates more certainty in the decision rules developed by the discriminant function analysis. Ultimately, discriminant function analysis proved to be disappointing in its potential to classify solvents and nonsolvents using the four variables available in the testing data but showed some potential for situations where more of the 16 variables from the training data set are available. Canonical Discriminant Analysis. Further investigation of the four predictor variables available in the testing data set [log(S), ST, log(HC), 0χν] through two- and threedimensional plots prompted the idea of using either principal component analysis or canonical discriminant analysis for classifying solvents and nonsolvents. Using the training set, each possible subset of three variables from the original four variables [log(S), ST, log(HC), 0χν] were interactively examined in rotating three-dimensional plots. The good solvent observations and the nonsolvent observations were distinguished by different symbols, and patterns distinguishing the groups from one another were sought. A two-dimensional scatterplot of ST versus log(S) (Figure 3) illustrates the dominant and pertinent information found in this exploraVOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

2591

FIGURE 2. Dendrogram from cluster analysis including testing cases using nearest neighbor method with variables log(S) and ST. tion. Here, the good solvents form along a path running between two subgroups of the nonsolvents. This pattern prompted the use of canonical discriminant analysis. A similar strategy using principal component analysis could also be developed. One strategy would determine primary axes along which the good solvent observations have the greatest variability. Another axis could then be calculated that runs perpendicular to the primary axes and simultaneously minimizes the distances from the nonsolvent observations to the axis. However, this is a more convoluted approach. In many cases, these approaches may yield similar results. However, a strategy using principal component analysis does not have the direct objective of separating the 2592

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 34, NO. 12, 2000

groups as canonical discriminant analysis does (excepting the use of principal component analysis on the group means, weighted by the number of observations in the groups, which is equivalent to canonical discriminant analysis). In order for canonical discriminant analysis to be effective in this situation, it was necessary to reclassify the nonsolvents into two groupssthose lying above the path of the good solvents and those lying below the pathsfor a total of three groups (good solvents, high nonsolvents, and low nonsolvents). The idea was to determine an axis in the fourdimensional space of the original variables, which gives the best single measure of separation between the three groups. The first canonical variable from a canonical discriminant

TABLE 3. Discriminant Function Analysis Predicted Solvent Status of the Training and Testing Sets Chemicals and Their Posterior Probability of Good Solvent Classification Based on the Variables log(S) and Surface Tensiona

chemical

solvent status

predicted solvent status

posterior prob of solvent classification

butanol ethanol 2-propanol methanol benzene toluene 1,2-xylene carbon tetrachloride chlorobenzene 1,1,1-trichloroethane methylene chloride perchloroethylene trichloroethylene ethyl acetate acetone methyl ethyl ketone methyl isobutyl ketone chloroform 1,1-dichloroethane 1,2-dichloroethane 1,1,2-trichloroethane cyclohexane acetic acid n-butyl acetate cyclohexanone diethylamine triethylamine propanol methyl chloride propane pentachloroethane n-pentane n-octanol n-butylcyclohexane 3-ethylhexane ethylbenzene trichlorobenzene propionic acid valeric acid acrylic acid

good good good good good good good good good good good good good good good good good good good non non non non good non good non good non non test test test test test test test test test test

good good good good non non non non good non good non good good good good good good non good good non good non good good non good non non good non non non non non good good good good

0.544634 0.554575 0.566480 0.570646 0.496453 0.465404 0.460015 0.465013 0.507542 0.451004 0.549188 0.471352 0.520518 0.522876 0.573521 0.551923 0.505171 0.505487 0.472552 0.572728 0.589128 0.392095 0.659895 0.487709 0.622482 0.560738 0.482454 0.547084 0.372771 0.214293 0.529391 0.276135 0.461832 0.278218 0.255003 0.439882 0.583710 0.607795 0.531997 0.798574

a

FIGURE 3. Plot of surface tension vs log(S) (training set only).

1,2-Dichloroethane is treated as a nonsolvent.

analysis is defined on such an axis. Figure 4 shows a plot of the first canonical variable (CAN1) for the combined data set (note that the data are randomly spread in the horizontal direction for better visibilitysthere is no variable on the bottom axis). This plot shows even better separation than the plot of ST versus log(S) in Figure 3, yet it involves only a single (transformed) variable. Placing the test data on this axis defined by CAN1 using the four variables from the combined data set gives a visual indication of how these data would be classified (Figure 5). Note that the test chemicals are not used in the canonical discriminant analysis itself. Rather, the value of CAN1 is calculated for test chemicals using the total sample standardized canonical coefficients derived from the canonical discriminant analysis performed on the training chemicals only. Table 4 displays, in descending order, the value of CAN1 for all chemicals in the combined data set. This table indicates that acrylic acid and trichlorobenzene lie among the high nonsolvents; n-butylcyclohexane, 3-ethylhexane, and npentane lie among the low nonsolvents; valeric acid, pentachloroethane, n-octanol, and ethylbenzene lie among the good solvents; and propionic acid lies between the high nonsolvents and the good solvents. This is in good agreement

FIGURE 4. Plot of CAN1, no test group, using four variables from the combined set. with the classification by the cluster analysis procedure shown in Figure 2. Often, including the second canonical variable (CAN2) will further separate the groupings, but as Figure 6 shows, addition of this variable does not help separate the good solvents and the nonsolvents. Figure 7 shows a plot of the first canonical variable using all 16 variables from the training data set (again, there is no variable on the bottom axis). Comparing this plot with the VOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

2593

TABLE 4. 30 Training Chemicals and 10 Test Chemicals in Descending Order by Value of First Canonical Variable Based on log(S), log(HC), ST, and 0χν

FIGURE 5. Plot of CAN1, with test group, using the four variables from the combined set.

name

class

CAN1

acrylic acid trichlorobenzene cyclohexanone acetic acid 1,1,2-trichloroethane propionic acid 1,2-dichloroethane valeric acid pentachloroethane chlorobenzene butanol n-octanol trichloroethylene propanol methylene chloride ethanol methyl ethyl ketone benzene 2-propanol ethylbenzene methanol 1,2-xylene acetone perchloroethylene toluene diethylamine ethyl acetate methyl isobutyl ketone chloroform n-butyl acetate carbon tetrachloride 1,1-dichloroethane 1,1,1-trichloroethane triethylamine cyclohexane n-butylcyclohexane methyl chloride 3-ethylhexane n-pentane propane

test test high high high test good test test good good test good good good good good good good test good good good good good good good good good good good good good low low test low test test low

7.61790960 3.62024414 3.13461569 2.91174684 2.09923253 1.75530683 1.66534459 1.39453531 1.15143481 1.05130216 0.96966402 0.83893781 0.73851830 0.71605466 0.67491067 0.60000411 0.52866831 0.30567651 0.29132138 0.21502539 0.18538081 0.15295675 0.13626953 0.09920000 -0.08020000 -0.08170000 -0.13187222 -0.21006047 -0.27406502 -0.39981566 -0.77878854 -0.86084612 -1.10036275 -1.38645320 -1.44790864 -2.64809756 -3.03262471 -4.22102278 -4.72104029 -6.47613626

TABLE 5. Total Sample Standardized Canonical Coefficients for the Canonical Discriminant Analysis Using log(S), log(HC), ST, and 0χν

FIGURE 6. Plot of CAN1 vs CAN2, no test group, using the four variables from the combined set. one in Figure 4 [which is based on the combined data set having only four variablesslog(S), ST, log(HC), 0χν], one can see that the separation between the good solvents and the nonsolvents is even greater than when [log(S), ST, log(HC), 0χν] are used. While there is no test data set to evaluate that has all 16 variables, these results suggest even greater potential 2594

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 34, NO. 12, 2000

variable

CAN1

variable

CAN1

log(S) log(HC)

0.163886 -0.748385

ST 0χν

1.518834 -0.255353

for separating out the test chemicals if the variables absent from the combined data set were available. The total sample standardized canonical coefficients for the canonical discriminant analysis using [log(S), ST, log(HC), 0χν] (Table 5) show that log(HC) and ST contribute the most in the construction of CAN1, implying that these two variables are the most useful in separating the solvent and nonsolvent groups. Table 6 gives the total sample standardized canonical coefficients for the canonical discriminant analysis using all 16 variables. These coefficients imply that BoilPt, 0χ, and 0χν contribute the most in separating the groups. However, because of the numerous strong correlations among many of the variables in the data set, these variables may not be the only ones capable of producing good separation between the groups. For example, removal of 0χ does not dramatically impact the results of the canonical discriminant analysis nor even the removal of 0χ, 0χν, and 1χν. Because of the strong linear relationship between these

A strategy using principal component analysis was initially considered as an alternative to canonical discriminant analysis, but it lacked the direct objective of separating the solvent and nonsolvent groups. Cluster analysis may be an informative tool but has no clear indicator of solvent potential associated with it. Discriminant function analysis required too many variables to be generally useful, resulted in numerous misclassifications of the training data set, and lacked the ability to illustrate a chemical’s solvent potential in a simple manner. Two-dimensional scatterplots and threedimensional interactive rotating plots give insight into patterns that might be useful in developing strategies for screening chemicals and offer some understanding of relationships between variables, which is often useful in selecting variables to be used with a particular technique. Use of cluster analysis may prove useful in identifying commonalities and differences that exist among various groups of solvents. Understanding these distinctions may help develop future strategies in screening potential solvents. For example, it may be beneficial to classify solvents into two distinct groups for use in discriminant function analysis or canonical discriminant analysis based strategies. Investigation of other variables may also lead to better screening methodology, but the variables used should be easily obtained given that many of the chemicals that may be screened will not be well studied. Finally, further evaluation of the methods presented here using other sets of chemicals would add to our understanding of their suitability as screening techniques. FIGURE 7. Plot of CAN1, no test group, using all 16 variables from the training set.

TABLE 6. Total Sample Standardized Canonical Coefficients for the Canonical Discriminant Analysis Using All 16 Variables from the Training Data Set variable

CAN1

variable

CAN1

log(S) log(HC) ST BoilPt MeltPt MW log(P) VP

5.2450143 0.3571004 -2.3988698 3.3482704 -0.7266137 -0.3777263 -5.9752174 -0.7120605

R π R β MolarV 0χ 0χν 1χν

2.0698525 -2.1804122 -2.9647183 -8.0965957 2.3238106 1.8892035 -0.4653586 -0.2699279

variables and MolarV, MolarV and other less correlated variables are able to greatly compensate for the contributions of the removed variables. Comparison of Multivariate Methods. Of the several techniques considered in this paper, canonical discriminant analysis appears to hold the most promise for screening potential solvents. Canonical discriminant analysis was able to separate the solvent and nonsolvent groups well using just the four variables available in the testing data set, yet appears to have even greater ability to separate these groups when more variables are available. Additionally, the implication this measure has on a chemical’s potential as a solvent is easily discernible when displayed in either a table or a graph.

Literature Cited (1) Billatos, S.; Basaly, N. Green Technology and Design for the Environment; Taylor & Francis: London, 1997. (2) Zhao, R.; Cabezas, H. Ind. Eng. Chem. Res. 1998, 37 (8), 32683280. (3) Kirschner, E. M. Chem. Eng. News 1994, June, 13-20. (4) Callahan, M.; Green, B. Hazardous Solvent Source Reduction; McGraw-Hill: New York, 1995. (5) Allen, D. T. Pollut. Prev. Rev. 1997, Winter, 113-118. (6) Pretel, J.; Lopez, A.; Bottini, B.; Brignole, A. AIChE J. 1994, 40, 1349-1353. (7) Callahan, M.; Sciarrotta, T. Pollut. Prev. Rev. 1994, Winter. (8) Howard, P. Handbook of Environmental Fate and Exposure Data for Organic Chemicals; Lewis Publishers: Chelsea, MI, 1990. (9) Lide, D. Handbook of Organic Solvents; CRC Press: Boca Raton, FL, 1995. (10) Smallwood, I. Handbook of Organic Solvent Properties; Arnold: London, 1996. (11) Yaws, C. Chemical Properties Handbook; CRC Press: Boca Raton, FL, 1999. (12) Jasper, J. J. Phys. Chem. Ref. Data 1972, 1 (4), 841-1009. (13) Abraham, M.; Andonian-Haftvan, J.; Whiting, G.; Leo, A.; Taft, R. J. Chem. Soc., Perkin Trans. 2 1994, 1777-1791. (14) Nirmalakhandan, N.; Speece, R. E. Environ. Sci. Technol. 1988, 22 (6), 606-615. (15) Johnson, R.; Wichern, D. Applied Multivariate Statistical Analysis; Prentice Hall: New York, 1988. (16) Hairston, D. Chem. Eng. 1997, February, 55-58. (17) Meloun, M.; Militky, J.; Forina, M. Chemometrics For Analytical Chemistry Volume 1: PC-aided Statistical Data Analysis; Ellis Horwood: Chichester, 1992.

Received for review November 15, 1999. Revised manuscript received March 29, 2000. Accepted April 4, 2000. ES9912832

VOL. 34, NO. 12, 2000 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

2595