1685
Anal. Chem. 1984, 58, 1685-1688
flowmeter should perform well for most polymers encountered in GPC. Since the flowmeter precision decreases as the molecular weight of the polymer increases, and since GPC becomes more difficult at high molecular weights, problems with the flowmeter should occur only in extreme cases. The analysis presented here assumes that the cell delay constant, K, (cf. eq l),does not change with concentration or the power law constant. Experimentally, K , does change but it is not correlated with concentration for the data obtained. Our experience is that the reproducibility of K, for a given GPC system and flowmeter is such that overall flow measurement precision is about 0.2% (6). The flowmeter employed in this research is the subject of a pending patent and licensed to Molytek, Inc., for commercial use.
ACKNOWLEDGMENT We thank Gene Rose for many useful discussions and assistance with the rheological measurements and Tom Chamberlin for helpful comments. LITERATURE CITED (1) Miller, T. E.; Small, H. Anal. Chem. 1982, 54, 907-910. (2) Van Wazer, J. R.; Lyons, J. W.; Kim, K. Y.; Colwell, R. E. “Vlscoslty and Flow Measurement”; Intersclence Publlshers: New York, 1963 p 197. (3) Cho, Y. I.; Hartnett, J. P. Adv. Heat Transfer 1982, 15, 84, 113. (4) Powell, R. L.; Schwarz, W. H. Rheol. Acta 1975, 14, 729. (5) Snyder, K. L.; Foster, K. L., The Dow Chemical Co.. Midland, MI, prlvate cornmunlcation, 1982. (6) Chamberlln, T. A,; Tulnstra, H.W. Anal. Chem. 1983, 55, 430.
RECEIVED for review February 23,1984. Accepted April 12, 1984.
Minimizing Effects of Closure on Analytical Data Erik Johansson* and Svante Wold Research Group for Chemometrics, Umeb University, S-90187 Umeb, Sweden
Kristina Sjodin Department of Organic Chemistry, Royal Institute of Technology, S-10044 Stockholm, Sweden
Closure, or the constant sum, Is a well-known problem In geology and geochemistry but has apparently been neglected in analytkal chemistry. Closure affects the data, and thereby the results, each thne there Is a normallzatlon (scaling) of the data sample (object). Chromatography and mass spectrometry are examples of methods where the closure problem Is frequent. The effect of closure on two chromatographic data sets Is demonstrated and commented upon. A discussion Is given of how to predict the Influence of closure and how to avoid of mlnlmlze the problems.
negative correlations between major variables (variables that have a high percentage of the total) and also a risk for spurious positive correlations between minor variables (8-1 I ). Much effort has been spent on estimating the amount of closure on already closed data sets. Eventhough it may be possible to remove some of the trivial correlations that are introduced by normalization, it should be noted that, if data have gone through a closure inducing normalization, no method exists that can “open up” such data (11,12). In most analytical chemical treatments, fortunately, the raw data are available. This makes it possible to either keep the trivial correlations small or to some extent predict the influence of closure.
Analytical chemistry is becoming increasingly multivariate
THEORY To demonstrate the effect of closure on two chromatographical data sets we use principal components (PC) analysis, a common method for getting a view of multivariate data (13-15). PC analysis has been described in detail elsewhere (16);hence only a short presentation of pertinent features will be given. The data set, see Figure 1,is modeled by the, PC expression in eq 1.
(1-4). In many situations where multivariate analytical data
are measured, the data are normalized to a constant s u m which makes the data “closed”. This closure introduces a dependence between the variables so that if one large variable goes up, the others automatically go down because their sum is fixed. The effect of closure might or might not be crucial to the results of the investigation. In chromatography the data are often closed. The value of each constituent is often expressed as a percentage of the total amount to compensate for variation in the amount of sample injected or pyrolyzed (5). In mass spectrometry the fragment peaks m l z are presented with the largest fragment peak set to 100 and the other fragment peaks expressed as percentages of the largest peak (6). This is a type of sample normalization which affects both comparisons of mass spectra and library searches based on similarities. The consequence of expressing all variables as percentages of the total is that ”the sum of covariances of each variable will be exactly equal to its variance and opposite in sign” (7). This means there will be a considerable risk for spurious
A
xki
= xi
+ a = l t k a b a i + eki
(1)
Here x k i represents the observed value of variable i (height or integral of peak i) of object k (chromatogram k ) . The parameters Ri are averages of the variables i. The PC parameters baiand tkaare computed so that the residuals ekiare minimized in the least-squares sense. The smaller the residuals, the better the fit of the model, i.e., the more of the variance in the data is explained. The number A of statistically significant components (product terms) is determined by cross validation (20). PC based models are sensitive to the 0 1984 American Chemical Society
1888
ANALYTICAL CHEMISTRY, VOL. 56, NO. 9, AUGUST 1984
5
a I.? I 1
3 4
I4 9
I
X;
-
1
I
6,:
I
- I
I
I3
10
7
15
Figure 1. Data matrix X wlth elements xk,.
Vor P
yn_
Table I. Results of the Different Normalization Proceduresa
5
b 4
normalization a b c d
6 var.
6 var. 6var. 6 var.
1 1 (1) (1)
44% 47% (33%) (27%)
2
I.?
Norm (6) = 100 Norm(6)= 100 Norm high = 100
3
10
I I4 3
a, b, c, and d are defined in the text. The components in parentheses are not significant according to cross validation. Percentages, e.g., 44%, are the amount of variance explained by the PC model.
9
a 7
variance of the variables. In all of the following calculations, all variables are therefore scaled to the same variance (1.0). We emphasize that closure is a general problem that affeds the data and hence any multivariate data analytic method applied to it, e.g., factor analysis KNN (I@, LDA (19), and ALLOC (20). It also affects ordinary variable by variable plots.
(In,
EXPERIMENTAL SECTION The first data set is a small pilot study of the amounts of monoterpenes in spruce and their relation to the geographical origin of the spruce. Needles from spruce were collected and a headspace sample was injected on a 2.0-m glass column (5% Dow 11on Chromosorb W-AW-DMCS 100-120 mesh). The amounts of monoterpenes were estimated as the peak height of each peak. Part of this data set (one geographical origin) was selected for the present study. Closure can also be a problem in large data sets with many variables. This is shown by using part of the GC data from a study on volatile constituents of pine wood. Differences in the composition of volatile constituents between healthy and weakened trees were investigated. Weakened trees are more susceptible to insect infestation and odor signals may be a significant factor in guiding the insects to suitable trees. Headspace analyses of wooden samples of pines were performed on a 60-m OV-225 glass capillary column. A full description of both the analytical procedures and the multivariate analysis and the biological significance of the results for this second data set will be published later. RESULTS Spruce Data with Six Variables. The test data contained 15 chromatograms (objects) each with six Constituents (variables). To study the effect of closure, we performed PC analyses on the nonnormalized raw data and on data normalized in three different ways. The following PC analyses were made: (a) on the raw data; (b) on data summed to 100 for each object over all six variables; (c) on data summed to 100 for each object for the first five variables; (d) on data normalized as in MS, i.e., the largest peak set to 100 and the remaining expressed as percentages of this peak.
4 5
13 14
2
J I 9
I5
Figure 2. Plots of the relation between variables 2 and 3 for three different data treatments: (2a)nonclosed; (2b) closed over 6 variables; (2c) closed over 5 Variables. As can be seen from Table I the difference between (a) and (b) is minor and the effect of closure is small. To demonstrate how sensitive to normalization a data set with few variables can be, one variable, the last, was deleted and a new round of PC analyses were performed. The PC results on five nonnormalized variables are similar to that of six nonnormalized and not included in Table I. The data were thereafter normalized over five variables and a PC analysis was performed. The result of this normalization is considerable and no statistically significant component is found. Another way of demonstrating the effect of closure is variable by variable plots. The two largest variables, 2 and
ANALYTICAL CHEMISTRY, VOL. 56, NO. 9, AUGUST 1984 D
9z 9
9
16
I
1687
36
14 31
34 13 11 14
33 1719
37
XX
xxx
39 Vor 14 c
K_.
Neon
w
'0
or 9
6
36
t
37 39 34 12 14
13
22 19 31
11,ro
33
17 u
Flgure 3. Plot of the correlatlon between varlables 9 and 14 for data closed over all variables (a) and the selectlvely closed data (b).
3, are plotted against each other in Figure 2. It can be seen that Figure 2a and Figure 2b are similar while Figure 2c is considerably different. The normalization commonly used in mass spectrometry (d) affects the data strongly and gives a loss of information. No significant component is obtained and the calculated model describes less variance than all other models. This result indicates that the common habit in mass spectrometry of closing the largest variable to a constant (100) diminishes the possibility of extracting relevant information from mass spectral data. Pine Data with 31 Variables. One group of trees, weakened, giving a total of 12 samples were subjected to two types of closure. The first normalization was the traditional, Le., to sum all 31 variables to 100. The second normalization was the new proposed type (see below) where a homogeneous set of variables was selected and their sum normalized to 100 by a separate normalization factor for each object. Thereafter this normalization factor was used to scale all variables for this object. The selection of a homogeneous set of variables will be commented upon in the Discussion. The effect of the closure procedures was most profound for the two largest GC peaks. These are plotted against each other in Figure 3a (r = -0.85) and Figure 3b (r = 0.27). The strong spurious negative correlation stems from the fact that if variables 9 and 14 have to occupy the same space, it is impossible that there exist a case after closure were both 9 and 14 are large.
DISCUSSION A number of Monte Carlo calculations of the results of closure have been presented (8,211. The conclusions are that
Meon
Fkure 4. Plots of mean value vs. standard devlatbn for the varlables (crosses Indicates many numbers on top of each other).
the effect of closure is small on variables with equal mean and variance when the number of variables is larger than 8 but that the effect increases as the number of variables decreases. As demonstrated in Figure 2a and Figure 2b the effect of closure might be small even when as few as six variables are closed. That example shows that the effect of closure is dependent on the data structure of the nonclosed set. Skala has shown that if variables have different means and/or variances the closure effect might be strong even if the number of variables is large (21). These effects include the presence of artificial negative correlation among the large variables and sometimes a positive correlation between the smallest of the variables. Indeed, the artificial negative correlation between the largest variables was found in our second example where the number of variables was fairly large. Selective Closure: a Partial Soldtion. To avoid the most prominent closure effect, the artificial negative correlation among large peaks, we performed a selective closure of the data over a subset of the variables. The variables were selected according to the following criteria. First, the means and standard deviations of the variables should be approximately of the same size and secondly the number of variables in the selected set should be as large as possible. The selection of variables is conveniently made from a plot of the standard deviation for each variable against the mean for each variable. In Figure 4a it is apparent that variables 9 and 14 are almost an order of magnitude larger than the others so therefore they were excluded and the remaining variables plotted again. The result as shown in Figure 4b is that the size and standard deviation of the variables are now reasonably homogeneous even if there still are differences in
1668
ANALYTICAL CHEMISTRY, VOL. 56,
NO. 9, AUGUST 1984
variable means and standard deviations. The 12 selected variables were normalized with a normalization factor so that for each object they had a sum of 100. The remaining variables, that is, the two large (9,14) and the small variables, were thereafter multiplied by the normalization factor associated with each object. The small variables should not be included to avoid positive correlations between them (7). This will have the effect that closure will be limited to the 12 included variables while the remaining variables are not included in the constant sum, and thereby they will not be directly affected by closure. We want to stress that this selective normalization does indeed give a closure effect on the data, but due to the selection of many homogeneous variables it will hopefully be small. Our recommendation for closed GC data is therefore that variablevariable plots for a few of largest variables should be made and if a strong negative correlation exists, the selective normalization should be used instead. We do not recommend attempts to try a number of different normalization procedures since this will mean that the risk for spurious results will increase. It is also important that the same type of normalization is done on all samples included in the data set. Attempts to normalize different parts of the data in different ways will give nonsense separations between these parts. Practical results of the use of multivariate methods in mass spectrometry have so far been meager (22). One reason for this may be the habit of closing the raw data by setting the largest peak to 100. This usually will mean a loss of information compared to other forms of closure. It should also be noted that on most MS systems the raw data are available and our belief is that these nonclosed data could be used instead. CONCLUSION If the data are closed, no method exists that can "open up" the data (12). When the raw data are available there exist ways to keep the effect of closure small and thereby to allow the maximal amount of information to be extracted. The suitability of the selective closure procedure proposed in this paper, and all other closure procedures, is dependent on both the internal structure of the data and the scope of the investigation. Our recomendation is that normalization should be used with caution. If normalization is necessary, the selection of normalization procedure should be based on the prior information about the raw data structure and the scope of the investigation.
ACKNOWLEDGMENT Stimulating discussions with Kim Esbensen, UmeB, and Hal MacFie (Meat Research Institute, Bristol, United Kingdom, are gratefully acknowledged. LITERATURE CITED Massart, D. L.;Dljkstra, A.; Kaufman, L. "Evaluatlon and Optimization of Laboratory Methods and Analytical Procedures"; Elsevier: Amsterdam, 1978. Frank, I. E.; Kowalski, B. R. Anal. Chem. 1982, 5 4 , 232R-24313. Kowalskl, B. R., Ed. "Chemometrics, Theory and Practice"; American Chemical Society: Washington, DC, 1977; ACS Symp. Ser. No. 52. Wold, S.; Dunn, W. J., I11 J. Chem. I n f . Comput. Scl. 1983, 2 3 , 6-13. Blomquist, G.; Johansson, E.; Sijderstrom, B.;Wold, S. J. Chromatwf. 1979, 773, 7-19, McLafferty, F. W. "Interpretation of Mass Spectra", 3rd ed.; Universlty Science Books: California, 1980. Chayes F. "Ratio Correlation, a Manual for Students of Petrology and Geochemistry"; The University of Chicago Press: Chicago, IL, 1971. Skala W. Chem. Geol. 1979, 2 7 , 1-9. Spjotvoll, E.; Martens, H.; Voiden, R. Technometrics 1982, 2 4 , 173-186. Meuzelaar, H. L. C.; Haverkamp, J.; Hileman, F. 0. "Pyrolysis Mass Spectrometry of Recent and Fossil Biomaterials"; Elsevier: Amsterdam, 1982. Karrer, L. M.; Gordon, H. L.; Rothstein, S. M.; Mlller, J. M.; Jones, T. R. B. Anal. Chem. 1983, 2 4 , 1723-1728. Butler, J. C. J. Math. Geol. 1981, 73, 53-88. Wold, S.; Sjostrom, M. "Chemometrlcs, Theory and Practice"; Kowalski, B. R., Ed.; Amerlcan Chemical Society: Washington, DC, 1977; ACS Symp. Ser. No. 52, p 243. Derde, M. P,; Coomans, D.; Massart, D. L. Anal. Chim. Acta 1982, 741, 187-192. Kvalheim, 0. M.; bygard, K.; Grahl-Nielsen, 0. Anal. Chim. Acta 1983, 150, 145-152. Wold, S.; Albano, C.; Dunn, W. J., 111; Esbensen, K.; Hellberg, S.; Johansson, E.; Sjostrom, M. Proc. IUFOST Conf. Food Research and Data Analysis; Martens, H., Russwurm, H., Eds.; Elsevier: Amsterdam, 1983. Mallnowski, E. R.; Howery, D. G. "Factor Analysis in Chemistry"; Wiley-Interscience: New York, 1980. Varmuza, K. Fresenius 2.Anal. Chem. 1974, 268, 352-356. Stuper, A. J.; Brugger, W. E.; Jurs, P. C. "Computer Asslsted Studies of Chemical Structure and Biological Function"; Wiley-Interscience: New York, 1979. Coomans, D.; Massart, D. L.; Broeckaert, I.; Tassin, A. Anal. Chim Acta 1981, 133, 215-224. Skala. W. J . Math. Geol. 1979. 9 . 519-529. Vaimuza, K . "Pattern Recognition In Chemistry"; Springer-Verlag: Berlin, 1980.
.
RECEIVED for review November 28, 1983. Accepted March 27,1984. Grants from the Swedish Natural Science Research Council (NFR), the Swedish Council for Planning and Coordination of Research (FRN), and the National Swedish Board for Technical Development (STU) are gratefully acknowledged. K.S. acknowledges a fellowship from "Ingenjor Ernst Johnsons Fond".