Application of factor analysis to the study of mixed retention

Darryl G. Howery and Joseph M. Soroka. Analytical Chemistry 1986 58 (14), 3091- ... P. S. Shoenfeld and J. R. DeVoe. Analytical Chemistry 1976 48 (5),...
0 downloads 0 Views 1MB Size
laboratories is 17% on the Ellenberger Crude and the Heating Oil and 50% on the Light Arabian Crude. Since Laboratory No. 3 reported average values, the data were not included in the statistical evaluation; however, on the Ellenberger Crude and the Light Arabian Crude, the results were within the calculated relative standard deviation. However, on the Heating Oil, the result was outside the standard deviation obtained between laboratories.

CONCLUSION Part-per-billion levels of manganese can be determined directly in petroleum matrices by a heated vaporization atomic absorption technique. In this method, samples, after dilution in solvent, are analyzed using a method of micro standard additions. Comparison of the data obtained with results based on ashing of samples has demonstrated

that the procedure compensates for matrix effects and no background absorption is encountered. An interlaboratory cross-check program indicates that over the 10-100 ng/g range, the relative standard deviation is 10%a t a given laboratory and is 20% between independent laboratories. Above 100 ng/g, the relative precision in a given laboratory is 30%, and 50% between laboratories. The precision obtained a t several other laboratories is the same as that obtained in this laboratory.

ACKNOWLEDGMENT The author is grateful to R. A. Hofstader for his many useful comments and generous support in this work.

RECEIVEDfor review May 17,1974. Accepted September 4, 1974.

Application of Factor Analysis to the Study of Mixed Retention Mechanisms in Gas-Liquid Chromatography and Comparison to Linear Regression Analysis Paul H. Weiner, H. L. Liao, and Barry L. Karger Department of Chemistry, Northeastern University, Boston, Mass. 02 1 15

The mathematical technique of factor analysis has been applied to the study of mixed mechanisms of retention in the gas-liquld chromatography of a series of simple solutes on an aqueous electrolyte phase of 1.46 molal (CH3CH2)4NBr. For saturated alkane solutes, factor analysis shows that only one factor is sufficient to span the factor space and reproduce all data within experimental error. For the polar and unsaturated solutes, two factors are required to reproduce the data within its experimental precision. The abstract factors are identified with the physically significant parameters of gas-liquid interfacial surface adsorption and bulk gas-liquid partition. Distribution constants were extracted from this system by factor analysis, and these results are compared with similar results obtained by applying linear regression analysis to the same data.

Gas-liquid chromatography has been quite successful in the measurement of accurate thermodynamic distribution parameters when the solute retention is a function of only one mechanism ( i e . , bulk gas-liquid partition, gas-liquid interfacial adsorption, etc.) ( 1 ) . This ideal situation is expected to exist for such simple systems as alkane solutes in alkane liquid phases. Even for these simple systems, however, one might expect that more than one mechanism of retention might be observed if the experimental data were taken with greater precision. For many other systems, the solute’s net retention will be a function of mixed mechanisms (2-9). Author to whom reprint requests should be sent. (1) C. L. Young, Chromatogr. Rev., 10, 129 (1968). (2) J. R. Conder, D. C. Locke, and J. H. Purnell, J. Phys. Chem., 73 700 (1969). (3) J. R. Conder, J. Chromatogr., 39,273 (1969). (4) H. L. Liao, D. E. Martire, and J. P. Sheridan, Anal. Chem., 45, 2087 (1973).

2182

If it is assumed that the solute obeys Henry’s law with respect to all the mechanisms of retention, and that these mechanisms are not totally linearly dependent on one another, then the net retention volume, VNO, per gram of packing can be expressed as

where K i is the distribution constant for each mechanism of solute retention; 4; is the corresponding intensive property of the system such as volume or surface area of the liquid phase per gram of packing, and the sum is taken over the various mechanisms of retention present in the system. The distribution constants in Equation 1 can be determined by studying a solute’s net retention as a function of changes in the di’s. If the various 4’s can be determined from separate experiments, then it is possible to solve Equation 1 for the Ki values. For a good discussion on the extraction of Ki values in mixed retention mechanisms, see Conder, Locke, and Purnell, ( 2 ) . Recently, Devillez, Eon, and Guiochon (9) have postulated that additional mechanisms of retention may be important in certain cases, namely the Kelvin capillary condensation of the liquid phase into the pores of the solid support thereby modifying the surface characteristics of the liquid surface. Depending on the precision of the measurements and the type of system, Equation 1 may contain four or more terms. I t then becomes quite difficult to extract the desired distribution constants from the experimental retention data by (5) B. L. Karger, P. A . Sewell, R . C. Castells, and A . Hartkopf, J. Colloid lnterface Sci., 35, 328 (1971). (6) C. Eon, A. K. Chatterjee, and B. L. Karger, Chromatographia, 5 , 28 (1972). (7) P. Urone, Y . Takahisi. and G. H. Kennedy, Anal. Chem., 40, 1130 (1968). (8) B. L. Karger and H. L. Liao, Chromatographia, 7, 288 (1974). (9) C. Devillez, C. Eon, and G. Guiochon, J, Colloid lnterface Sci., in press.

A N A L Y T I C A L C H E M I S T R Y , VOL. 46, NO. 14. DECEMBER 1974

either of the typical graphical or linear regression approaches of data analysis. The problem is particularly serious when the experimental retention data are limited to four or five data points for each solute; the usual case for most chromatographic studies in this area. [For a discussion of the relative merits of the graphical and linear regression procedures when the retention is a function of only one or two mechanisms of retention, see ( 6 ) ] .As the number of retention mechanisms in a given problem increases, the procedures that must be employed to obtain suitable quantities which will yield linear plots become quite cumbersome. Moreover, in some cases, it is not even clear exactly what mechanisms of retention are operative or even the exact numbers of mechanisms present within the data. In these latter situations, the extraction of accurate distribution coefficient information by a graphical or regression approach is tenuous. For these reasons, we would like to propose factor analysis (10-15), an alternate form of data handling based on a method which can eliminate some of the problems associated with the graphical or linear regression approaches. (It is worth pointing out that Nikolov has recently treated the mixed mechanism problem by matrix algebra ( 1 6 ) ;however, the approach in the present paper is different.) Factor analysis is a multidimensional curve fitting scheme which can be used for linear equations in the form of Equation 1. Since the curve fitting takes place in an n-dimensional space, it is not necessary to mathematically transform the equation governing the system to extract quantities which will yield a linear plot in two dimensions. In regression analysis, one attempts to find a set of independent variables that regress onto a single dependent variable ( 1 7 ) . Factor analysis, however, allows the investigator to study more than one dependent variable within the same analysis. This is especially important in the case where the net retention volumes of a given solute are measured only on four or five different per cent column loadings. Moreover, one can combine the data of the net retention volumes of a series of solutes taken on several liquid loadings into one problem to test the model and to extract the pertinent distribution constant information for the individual solutes. In essence, there are more data available to delineate the factors present within the data space. Therefore, factor analysis can be considered as a global averaging procedure which can potentially increase the precision of the extracted values from a set of data because of the enlarged data base on which it operates. This can be especially important in mixed retention studies where the number of changes in column conditions ( e g., liquid loading) within one problem is small. One special advantage of factor analysis is its potential ability to determine the number of factors present in a data set without the necessity of first identifying the factors. While it is not presently possible by a simple test of factor analysis to ascertain unequivocally the true dimensionality (within the precision of the measurements) of a given data set, one can arrive at a conclusion with a fair degree of certainty. In the present paper, we will try to present the rea-

soning used to obtain a good estimate of the true dimensionality of a data set. In the present study, factor analysis will be applied to gas-liquid chromatographic data in which it is known that more than one mechanism of retention is operative-namely, the retention of nonpolar and polar solutes on a aqueous electrolyte liquid phase (8). It is known that the dominant mechanism of retention should be gas-liquid interfacial adsorption, and the secondary mechanism of retention should be bulk gas-liquid partition. Using this system as an example, the results of both a factor analysis and linear regression solution to the problem will be presented. Since there are only two factors in the problem, this example unfortunately does not fully contrast the relative merits of factor analysis and regression analysis. However, precise data are not available for a well defined system known to be affected by more than two mechanisms of retention.

EXPERIMENTAL The experimental details of the measurement of net retention volumes per gram of packing with the aqueous stationary phases have been described previously (8). Sufficiently small sample sizes were taken such that all measurements were made under Henry’s law conditions. (Symmetrical solute elution peaks were found.) All results were the average of three measurements with a precision in the net retention volumes of f 2 % . The columns were checked for solvent bleed, and the per cent water and salt in the packing before and after runs agreed within 0.25%. The surface areas of the liquid phases were estimated from the measured net retention volume of octane as a reference solute ( 8 ) .Briefly, in this procedure, a simple density correction for the presence of the added electrolyte was made to the surface area of a similarly loaded pure water support of known surface area. From this standard column (20% w/w aqueous solution on Porasil D), the change in surface area as a function of per cent liquid loading was estimated by measuring the net retention volume per gram of packing of n-octane, a solute known to be retained only by surface adsorption. The accuracy of the reported surface areas per gram of packing is dependent on the assumptions; however, the relationship between different loadings can be assumed to be quite accurate.

THEORY The details of the factor analysis model have been presented previously ( 1 0 ) .In the present section, we will limit ourselves to a brief discussion o f the rationale for applying factor analysis to the problem of mixed retention mechanisms, and the comparison of the relative merits of factor and regression analysis as data manipulators in this problem. Factor analysis can be used whenever the observable measurement of a system can be expressed as a linear sum of product functions as indicated by Equation 1. In the present problem, this requirement is met if each mechanism can be written as a separate term consisting of a product of a solute and stationary phase function and if, for convenience, the retention data are taken on the linear portions of the isotherms of each distribution process. For clarity, we can rewrite Equation 1 in matrix notation

In the factor analysis scheme, an experimental array of reis mathematically decomposed into a tention data, [ product of two other matrices which can then be associated with [ K ]and [@] by a linear transformation. For success, it is critical that the smallest dimension ( i e . , row or column) of the data array be greater than the number of factors present in the data space. This requirement on data availability can be contrasted with the situation in regression analysis. The linear regression model treats a single dependent variable at a time and therefore is restricted to a data

v~~],

(10) P. H. Weiner, E. R . Malinowski, and A. R. Levinstone, J. Phys. Chem., 74, 4537 (1970). (1 1) P. W. Weiner and E. R . Malinowski, J. Phys. Chem., 75, 1207 (1971). (12) P. H. Weiner, C. K . Dack, and D. G. Howery, J. Chromatogr., 69, 249 (1972). (13) P. H. Weiner and J. F. Parcher, Anal. Chem., 45, 302 (1973). (14) P. H. Weiner, J. Amer. Chem. Soc., 95, 5485 (1973). (15) D. G. Howery, Anal. Chem., 46, 829 (1974). (16) R. N. Nikolov, Chromatographia, 4, 565 (1971). (17) L. B. Anderson, Chem. Eng., 173 (1963).

A N A L Y T I C A L CHEMISTRY, VOL. 46, NO. 14, DECEMBER 1 9 7 4

2183

base consisting of a single column of [VN'] for each problem (i.e., a single solute). Regression analysis can thus be performed on a smaller data base than that of factor analysis; however, enough experimental data are often available (i.e.,retention for several solutes) that this is not a serious problem. In the decomposition of [ VNO], a square symmetric correlation matrix [C] is first formed by premultiplying the original data array by its transpose [VN0lT ( i e . , a matrix formed by interchanging the rows and columns)

T o give equal weight to all the columns of data, the data matrix is first normalized, so that a normalized [C] is formed. The normalized correlation matrix acquires its name from the fact that each of its elements represents the cosine of the angle between two vectors (Le.,columns) of the original normalized data matrix. Therefore, the main diagonal of the normalized correlation matrix contains all unities since these elements represent the correlation of a column of the data\ matrix with itself. The off diagonal elements, on the other hand, can have values which range from +1 to -1, depending on the degree of correlation between the two column vectors. In vector analysis notation, it can be said that each element of the normalized correlation matrix represents the dot product of two normalized (to unit length) columns of the data matrix. Forming the correlation matrix [C] can also be looked upon as a error smoothing or global averaging step since each element of the correlation matrix is calculated by summing the products of corresponding elements of two appropriate columns of [VNO]. For example, the element situated in the first row and third column of [C] is obtained by summing the products of corresponding elements of the first and third columns of [V")]. The smoothing arises if the errors in the elements of [ V")] are random. This global smoothing step can be contrasted with the situation present in a regression analysis where each separate dependent variable (column of [ VNO]data) stands alone and no preanalysis step except replication is used to smooth the raw data. The effect of this smoothing in factor analysis can be achieved in regression analysis by increasing the number and range of data points that are included within the dependent variable of a problem. The next step in factor analysis is to determine the number of abstract factors required to reproduce the original data within experimental error. This is accomplished by ordering the abstract factors of the correlation matrix in importance in terms of their respective eigenvalues and then by attempting to recalculate the original data set as less important eigenvectors are included within the factor space. By using this procedure, it is not always possible to make an unequivocal decision on the dimensionality of the data set. This step of the analysis is usually only powerful enough to narrow the possible choice down to a small range of values. Other supporting information, to be discussed later, can further narrow the choice. There is no similar capability available for the users of regression analysis since the theroretical model to be tested must be specified before the analysis begins. Once the number of abstract factors required to reproduce the original data within experimental error has been estimated to within a small range, the identification of the abstract factors with their physically significant counterparts can be attempted. This is accomplished by testing each suspected factor using a series of rotation matrices of dimensionality in the range determined by the data recal2184

culation step discussed earlier. By studying the fit of the suspected test factors as the dimensionality of the rotation matrix is varied, one can sometimes obtain further information concerning the dimensionality of the space since the rotation step will yield successful results for all real test factors only when the rotation matrix is of the proper dimensionality. With the factor analysis scheme in its present form, this identification step can be performed for each suspected factor without any a priori knowledge of the nature of the other factors present in the scheme. For example, if in a mixed retention mechanism problem there were contributions from bulk gas-liquid partition, gas-liquid interfacial adsorption, and solid surface adsorption, it would be possible to test models associated with each mechanism of retention without having t o specify models for the other two mechanisms of retention a t the same time. This capability of testing individual factors becomes more useful as the complexity of the problem being studied increases, assuming that the constraints imposed by the factor analysis model are still valid. In factor analysis, it is possible to identify many physically significant parameters with the same abstract factor or combination of abstract factors of the space. Therefore, in a given problem, there is no guarantee that if a set of factors are found which separately are associable by a set of linear transformations with the abstract factors of a space, that a linear combination of these same test factors will account for all independent factors in the space. To ascertain whether a set of independent test factors have indeed been found, it is necessary to attempt to recalculate the original data in terms of a linear combination of these physically significant test parameters. In making this calculation, however, factor analysis does not impose the constraint that a solution must be found that minimizes the deviations between experimental and predicted data as is the case in a regression analysis of the data. [This limitation of regression analysis has been well recognized in the literature (18, 19).]The original data will be reproduced only if the test factors contain within themselves adequate measures of all the important independent factors of the space. One cautionary statement should be made concerning the identity of the factors found in a factor analysis treatment of a data set. Just as in regression analysis, there is no guarantee that a particular solution, which adequately accounts for the variance in the data, is a unique solution to the problem. All that can be said is that within the framework of the solution found to the problem, there are adequate measures for all the true factors present in the space. Finally, in the comparison of the merits of factor and linear regression analysis, there is one area in which regression analysis has a clear advantage. Because of the minimum error constraint built into regression analysis, it is possible to place a confidence limit on the @-weightingfactors determined from the data treatment (17). In the present problem, the P-weightings correspond to the various distribution constants of the system. On the other hand, no confidence limits can be set on the K values obtained from factor analysis in its present form. This is a serious shortcoming, since it is important to estimate the precision of the extracted distribution constants, especially in a problem of thermodynamic interest. It is possible to circumvent this limitation of factor analysis by estimating the confidence limits through a general method of data treatment initiated by Mosteller (20) and developed further by other investigators (21 ) called the (18) R. A . Stowe and R. P. Mayer. Ind. Eng. Cbern., 61, 11 (1969). (19) R. P. Mayer and R. A . Stowe. Ind. Eng. Cbern., 61, 43 (1969). (20) S. Mosteller. Rev. Inst. Int. Statist., 39,363 (1971).

A N A L Y T I C A L C H E M I S T R Y , VOL. 46, N O . 14. DECEMBER 1974

Table I. Retention Volume us. Stationary P h a s e Loading at 12.5 “C, Data Base for Factor a n d Regression Analysesa Columnb 1

Solute

3

2

4

5

Class A 6.46 7.66 9.15 4.11 5.65 iz -Hexane 5.50 6.64 7.83 3.64 4.70 Cyclohexane 10.57 14.16 16.61 19.62 21.65 iz -Heptane 2-Methylhep21.85 30.27 33.97 40.53 49.22 tane -Octane 27.04 38.71 42.08 50.17 61.08 Class B Carbon 7.12 9.40 10.29 12.18 14.41 tetrachloride Methylene ‘15.81 17.86 17.92 19.84 22.43 chloride Chloroform 30.15 39.43 42.01 49.13 57.89 Benzene 25.30 33.09 35.27 41.32 48.73 Toluene 66.33 90.87 98.43 11.10 13.99 a Packing: 1.46m tetraethylammonium bromide coated on SOjlOO mesh Porasil D . b 1: VI,” (ml/g) = 0.29, ALO (cmZ/g) = 7.8 X lo4; 2:

vL.o= 0.25; A,.” = 11.2 x

104; 3: vLo= 0.205, A ~ = O 12.3 x 104; 4: V1.O = 0.17, AI>”= 14.6 X 104; 5 : VLO= 0.14, ALO = 17.8 X l o 4 .

“jackknife” method. T h e rationale behind this method is as follows. Two different estimates for the desired /3-weightings are calculated. One set is obtained using the entire data base and the test factors. Let these coefficients (distribution constants in the present problem) be labelled K. The second set of coefficients K‘ are calculated from data sets in which a given row is deleted from the data matrix and test factors. A set of K’ values can thus be calculated for each of the subdata bases corresponding to the deletion of a different row from the original data set of “r” rows. These K values are then transformed to a set of K * values according to Equation 4 K * = rK

(r

-

-

1)K’

(41

The arithmetical mean of this latter set of values, K * ,will then represent a “best estimate” of the distribution coefficient (even better than the distribution constant obtained from the total data base)

-

K* =

.1 1’ 1.i

where K A = the adsorption constant a t the gas-liquid interface and ALO = the surface area of the coated liquid phase per gram of packing. For the class B solutes, the following equation was used to extract the desired surface and bulk liquid distribution constants from regression analysis

(9)

Moreover, if the various K * values follow a normal distribution, it is possible to calculate interval estimates for these coefficients using the standard t - distribution with the aid of the following relationships

p =

R E S U L T S A N D DISCUSSION Shown in Table I are the net retention volumes per gram of packing of a series of solutes chromatographed on different loadings of an aqueous solution of 1.46 molal (CH&H&NBr. The solutes designated, A , are suspected from previous work (8) of being retained solely (in terms of the precision of the measurements) by gas-liquid interfacial adsorption. The solutes designated by B, are suspected of being retained by both gas-liquid interfacial adsorption and bulk liquid partition. The precision of the net retention volumes is f 2 % . The distribution constants extracted by linear regression analysis from the data in Table I are shown in the last column of Table VI. For class A solutes, Equation 1 reduced to

(5 1

The K* coefficients are assumed to be independent and normally distributed so that the variance, S** of K * (for each distribution constant) can be estimated from

where

bution with r - 1 degrees of freedom and (1 - a ) = the confidence level (22). Although the jackknife method affords the investigator some quantitative measure on the reliability of his extracted P-weights, these values have little or no significance unless one is satisfied that the set of chosen test factors adequately represent the data. In regression analysis, the calculation for the confidence limits also involves a measure of the fit of the model of the data ( I 7 ) . A similar calculation is not directly possible when the jackknife method is used. In the present paper, the closeness of fit will be estimated by calculating both the overall mean error and percentage error of data reproduction. In factor analysis, these quantities can be calculated for data reproduction in terms of both abstract and physically significant test factors. If the theoretical model is reasonable, then the calculated values of mean error and percentage error using physically significant parameters should fall quite close to their counterparts using the same number of abstract factors. At this point, the investigator would know that he could not make a large improvement in the fit of his data without introducing more terms into his model.

the theoretical mean, t = the Student’s t - distri-

(21) H. L. Grey and W. R. Schucany, “The Generalized Jackknife Statistic,” Marcel Dekker, New York, N.Y., 1972.

where K I , = the bulk liquid distribution constant and VI:) = stationary phase volume per gram of packing. It has been previously shown that Kelvin retention is negligible within our present system (8). The values listed in the last column of Table VI are previously published distribution constants obtained by separate regression analyses of the data for each solute. They represent a “best” estimate in a least squares sense. We would now like to compare these results with those obtained by applying factor analysis to the same data. We would also like to illustrate how factor analysis can handle a general problem in which the number of retention mechanisms for the various solutes is not a priori known. T o present the results in a clear-cut fashion, the data in Table I were divided into several sets. Separate analyses were performed on the class A solutes treated as a group (Case I) and the class B solutes treated as a group (Case 11).

(22) L. B. Anderson, Chem. Eng., 139 (1963).

A N A L Y T I C A L C H E M I S T R Y , VOL. 46, NO. 14, D E C E M B E R 1974

2185

Table 11. Eigenvalues for Five Data Combinations (See Table I)a Case I

Case I1

Case 111

Case

IV

Case V

4.9982 1.44 x 10-3 3.37 x 10-4 2.15 x 10-5 5.56 x 10-9

4.9905 9.9821 5.9967 5.9881 1.18 x lo-’ 9.43 x 10-3 1.62 x lo-’ 1.87 x 10-3 2.65 x 10-5 1.37 x 10-3 1.16 X 5.82 x 10-5 1.65 x 3.48 x 10-4 2.63 x 5.81 x 5.21 x lo-* 2.07 x 10-5 8.95 x 10-j 4.82 x 10-7 a Case I: Class A (Table I). Case 11: Class B. Case 111: Class A + Class B. Case IV: Class A + chloroform. Case V: Class B + n-octane. These two cases represented the clearest division of the solutes based on our previous knowledge of the system concerning the number of mechanisms of retention present in the data (8). The results a t various stages in the factor analyses of these two sets will be used as standards to compare with other sets. Various combinations of the class A and B solutes were also studied by factor analysis. First, all the class A and B solutes were combined into one set (Case 111). Second, in order to test the sensitivity of the factor analysis approach, two extreme sets were studied: all class A solutes plus chloroform (Case IV) and all class B solutes plus n-octane (Case V). Of the various sets, Case IV will present the severest test of the method since the second factor, namely partition, is represented to such a small extent (see Table VI). Shown in Table I1 are the eigenvalues obtained by diagonalizing the normalized correlation matrix of each of the five data sets analyzed. Only the first five eigenvalues are shown for each case since each of the data sets contains only 5 rows (i.e., five column loading conditions), and a data set cannot contain significant eigenvalues or eigenvectors greater than the smallest size of the data array. It is difficult to make use of the information in this table to ascertain positively the dimensionality of the five data sets; however, it is quite useful to see qualitatively how the relative magnitude of the first three eigenvalues change from case to case. An examination of the first eigenvalues (relative to the second ones) for each of the five cases reveals that there seems to be one predominant factor present within each data set. The changing magnitude of this first eigenvalue among the five cases simply reflects the different size data matrices that are being analyzed. From our a priori knowledge of the processes involved, it is easy to speculate that this dominant factor is associated with gas-liquid interfacial adsorption. We next note that the second eigenvalue is approximately a factor of 10 larger in Case I1 than in Case I. I t can be concluded that there is some likelihood that a second factor is present in Case I1 relative to Case I. This is reasonable since it is expected that Case 11, which contains the class B solutes, should be governed by both gas-liquid adsorption and bulk gas-liquid partition, while Case I should be governed by only one mechanism of retention, namely gas-liquid interfacial adsorption. However, this conclusion can only be tentative, since the second eigenvalue is quite small in Cases I and 11. For Case 111, which contains all class A and B solutes, the second eigenvalue is quite similar in magnitude to that in Case 11, indicating the probability two factors are also important in the former set. On the other hand, the third eigenvalue for Case I11 is 10 times smaller than the second eigenvalue, and it can be tentatively concluded that a third factor is negligible. Case V presents no further problem of interpretation, since the results are similar to those found in Cases I1 and 111. In Case IV, however, the results are ambiguous with the second and third eigenvalues being of the same order of magnitude. This is somewhat expected since 2186

Table 111. Mean and Per Cent Error Values for Data Reproduction in Terms of Abstract Factors for the Five Data Combinations One factor

Casea

I I1 I11 IV V a

Two factors

Three factors

*Mean errorb

Per cent error

Mean errorb

Per cent error

.Mean errorb

Per cent error

0.36 1.09 0.07 0.97 0.55

1.22 3.11 2.24 1.71 1.81

0.10 0.07 0.28 0.07 0.34

0.33 0.41 0.93 1.32 0.96

0.028 0.028 0.111 0.036

0.09 0.08 0.32 0.72 0.25

0.101

See bottom of Table 11. b Units = ml/gram.

this data set is predominantly composed of class A solutes with only one class B solute containing a small second retention mechanism. Again, the relative magnitudes of the eigenvalues should be used as indications of the number of terms, not as positive proof. Another tool in the factor analysis scheme that can aid the investigator in judging the dimensionality of his data sets is the recalculation of the original data in terms of abstract factors. This is accomplished by systematically increasing the number of abstract factors that are included in the data recalculation step and then comparing the recalculated data to the experimental data with respect to the experimental precision. In this comparison, it would be useful to have a single measure to judge the agreement, e.g., the correlation coefficient in regression analysis. Unfortunately, no corresponding quantity is available with factor analysis. T o obtain some measure of the degree of fit, we will report the mean error (in terms of retention volume per gram of packing) and the per cent error of data recalculation. When these criteria are not sensitive enough to make a reasonable judgment on the dimensionality of the data (because of a large range of solute retention volumes included within the data set), the mean error and per cent error will also be presented on an individual column by column basis of the data set. In this manner, the error of data reproduction can be judged in relation to the magnitude of the quantity being recalculated. We also tried to use as a measure of data reproduction the $ function proposed by Exner ( 2 3 ) , which basically compares the error variance to the data variance. However, this function was not sensitive enough to make a clear-cut decision concerning the true dimensionality of the data studied here. This situation might have arisen because of the presence of data sets which are dominated by one major factor. The results of these analyses will not be presented. The mean error and per cent values in terms of abstract factors for the five cases are shown in Table 111. For Case I, the overall mean error of data reproduction when one abstract factor was used to recalculate the data is f0.36 ml/ gram with a range of f0.034 ml/gram for hexane to f 0.80 ( 2 3 ) 0 . Exner, Collect. Czech. Chem. Commun., 31, 3222 (1966)

A N A L Y T I C A L CHEMISTRY, VOL. 46, NO. 1 4 , DECEMBER 1974

Table IV. Test of Physically Assigned Factors, AI,Oand VLO,for Each of the Five Data Combinations AL',

vp, m l / g

cm2/g

~

.

_

_

_

~

~

_

_

Predicted

Predicted One factor

Test

Two factors Case

7.8 x 11.2 x 12.3 x 14.6 x 17.8 x

104

io4

io4 104 104

8.0 x 10.9 x 12.4 x 14.8 x 17.5 x

104 104 104

Test

One factor

factors

0.29 0.25 0.205 0.17 0.14

0.12 0.16 0.18 0.22 0.25

0.14 0.15 0.23 0.26 0.17

0.29 0.25 0.205 0.17 0.14

0.14 0.17 0.18 0.21 0.25

0.30 0.24 0.20 0.17 0.15

0.29 0.25 0.205

0.30 0.26 0.19

0.14

0.13 0.17 0.18 0.22 0.25

0.29 0.25 0.205 0.17 0.14

0.12 0.16 0.18 0.22 0.25

0.25 0.24 0.23 0.22

0.29 0.25 0.205 0.17 0.14

0.13 0.17 0.18 0.21 0.25

0.30 0.24 0.20 0.17 0.14

Two

I (see bottom of Table 11)

7.9 x 104 10.1 X 103

io4

12.2 x 104 14.6 X 10'

104

17.9 x

io4

Case I1

7.8 X 11.2 x 12.3 X 14.6 x 17.8 x

lo4 104

lo4 104

104

9.1 x 11.6 x 12.4 x 14.4 x 16.9 x

104 104 104

10'

io4

7.8 X 11.1 x 12.3 x 14.8 x

lo4 104 lo4

10'

17.7 x 104 Case 111

7.8 X 11.2 x 12.3 x 14.6 x 17.8 x

10' 104 104

104 104

8.6 X l o 4 11.3 X 10' 12.4 x io4 14.6 x 104 17.2 x

io4

7.9 x io4 10.9 x 104 12.4 x io4 14.8 x io4 17.7 x 10'

0.17

0.17

0.15

Case I\'

7.8 x 11.2 x 12.3 x 14.6 x 17.8 x

104 104

104 104 104

8.2 x 104 11.1 x 104 12.4 X l o 4 14.8 x 104 17.5 X l o 4

7.9 x 10.9 x 12.3 x 14.8 X 17.8 X

104 104

104 lo4

10'

0.10

Case V

7.8 X 11.2 x 12.3 x 14.6 x 17.8 x

lo4 104

104 104 104

8.9 X 11.6 x 12.3 x 14.4 x 17.0 x

lo4 104 104 104 104

7.8 x 104 11.1 x io4 12.2 x io4 14.7 X 10' 17.7 x 104

ml/gram for octane. This represents an average percentage error of reproductions of 1.2% for all data, which compares favorably with the reported experimental precision ( f 2 % ) . When two factors are used, the overall mean error drops to 40.10 ml/gram with a range of f0.02 to f 0 . 3 3 ml/gram. With two factors, the experimental data are reproduced to 0.33%,far exceeding the precision of the experimental data. Therefore, one factor seems sufficient to reproduce the data within experimental error. The present analysis supports the tentative conclusion reached earlier concerning the eigenvalues of the Case I data set. For Case 11, the reported overall mean error with one factor is f1.09 ml/gram with a range of f 0 . 2 6 ml/gram for CC14 to f 3 . 1 9 ml/gram for toluene. This corresponds to a percentage error of -396, which exceeds the reported precision of the data set. With two abstract factors, the mean error falls to f0.07 ml/gram indicating that all data are being reproduced within experimental error. Again this conclusion supports our previous hypothesis obtained by the study of the eigenvalues. Using similar analyses, we may also conclude that two factors are required t o span the data spaces for Case 111. Cases IV and V represent a gray region; it is unclear from Table I11 whether one or two factors are required.

From these tentative examinations on the dimensionality of the data sets, it is possible to attempt to associate physically significant parameters with the abstract factors. From previous discussions, it is expected that the gas-liquid interfacial area per gram of packing, A LO, should test as a factor for data sets containing both class A and B solutes, while the bulk volume per gram of packing factor, VI,", should be operative only when class B solutes are present in the data sets. Furthermore, the VI," factor should test as a factor only when a two-dimensional rotation matrix is used, since it is a secondary factor. As mentioned previously, this factor rotation and testing scheme offers a powerful method of checking conclusions concerning the dimensionality of data sets reached with the aid of the eigenvalues and data reproduction steps discussed above. This arises because the factor rotation scheme will work well for secondary factors only if the rotation matrix contains the correct number of dimensions. Dominant factors will usually fit fairly well even with an improperly dimensioned rotation matrix. T o ascertain whether a suspected factor can be associated with a single or a linear combination of abstract factors of a data space, a least squares transformation or rotation matrix is calculated which attempts to map the sus-

A N A L Y T I C A L C H E M I S T R Y , VOL. 46, NO. 14, DECEMBER 1974

2187

pected factor onto a linear combination of the abstract factors of the space. Mathematically, this can be stated as follows. Let (S) be the suspected test factor expressed as a column vector containing r-rows (where r is the number of rows in the data matrix); let [ A ] be a matrix of the abstract factors of the space which contains r- rows and as many columns, n, as there are suspected abstract factors in the space; and let {TIcbe a transformation column vector containing n-rows. In terms of the above defined arrays, we seek a transformation vector IT), such that (10)

For details, the reader is referred to a previous paper (IO). This calculation will be successful only when the suspected test factor and the abstract factors differ only by a simple linear transformation, and the correct number of abstract factors are used in [ A ] .Once this “best” transformation vector has Seen found, Equation 10 is then used to recalculate the best representation of {S),.Therefore, by comparing the suspected test factor with its “best” predicted rotated counterpart on a row by row basis, a decision can be made on the validity of the test factor. Again, as was indicated earlier, a similar calculation can be made for each separate factor without having to input any information concerning the identity of the other factors present in the space, since for each test factor, a different {TI, can be found. The results of these rotations on the two suspected factors, ALO and VLO,with the five data sets are shown in Tahle IV. In Case I, we find, as expected, that the ALO factor tests well using a one-dimensional rotation vector, while the VLO factor does not give a positive test using either a one-dimensional or two-dimensional transformation vector with Equation 10. Note that with the ALO suspected factor, the fit does not improve significantly with a two-dimensional transformation vector. We can conclude that for the class A solutes, the VLO or bulk gas-liquid partition factor is not operative. Only gas-liquid interfacial adsorption is important. For Case 11, which concerns only the class B solutes, we would again expect that the ALO factor should test quite well with a one- or two-dimensional rotation vector, because it is the dominant mechanism of retention. For the VI,” factor, we would expect that the fit should be poor for a one-dimensional rotation vector since the VLO factor corresponds to a secondary factor, while there should be a dramatic improvemer:t in fit when a two-dimensional transformation vector is used. This is exactly the situation found when these tests were made as is shown for Case I1 in Table IV. The improvement of the VLO fit when the dimensionality of the rotation vector is increased from one to two dimensions is quite striking when compared to the corresponding results for VLO in Case I. Similar conclusions are reached when the ALO and VLO test factors are examined with Cases I11 and V. Both data sets contain class B solutes so that it is expected that a two-dimensional transformation matrix should be required to span these data sets and that V1,O should test as a factor. For Case IV, one would also expect that a two-dimensional space should exist because of the presence of the class B solute, CHC13, among the class A solutes. However, our previous examination of the eigenvalues for this case, shown in Table 11, indicated that the presence of a second factor within this data space was unclear. It is seen in Table IV that the f i t of the VLO test factor using a two-dimensional rotation vector is worse than in Cases 11,111, and V. However, when compared with the f i t obtained for the 2188

Table V. Mean Error and Per Cent Error Values of Data Reproductionin Terms of Physically Significant Parameters One-factor rotationa Per cent CaseC

I I1 I11 IV V

Two-factor rotation

Mean emoc

error

Mean error

0.25 1.45 0.78 0.47 1.07

0.85 4.20 2.50 0.83 3.52

0.34

0.16 0.20 0.22 0.17

Per cent emor

1.‘15 0.45 0.64 0.39 0.56

a The one-factor solution is always in terms of ALO. * The two factors are ALOand VLO. e See bottom of Table I1 for definition of cases.

VLO test factor in Case I, the importance of the VLO factor in case IV comes into clearer perspective. These results illustrate an important point about carelessly using the results of a factor analysis solution to a problem. To test for the presence or absence of a given factor, the data which are included within the original data set must be carefully selected to contain adequate representations for all important factors. Also, it is necessary to keep in mind the relative importance of the given factor in recalculating the observed variance of the experimental data when the fit of a given factor is judged. This example nicely illustrates the power of the rotation step of factor analysis to aid in the determination of the proper dimensionality of a data space. In many respects, this rotation scheme offers the investigator the clearest picture of the data dimensionality within the constraints of the precision of the data and the relative importance of the factor being tested. Therefore, the previous conclusions reached concerning the dimensionality of the data spaces can be strongly supported by the results of the factor rotation scheme. They still, however, cannot be considered to be unequivocal proof that the factor space has been properly delineated. Once a set of test factors equal to or greater in number to the dimensionality of the factor space has been found, it is then possible to check if a linear combination of these test factors adequately encompasses measures of all the important abstract factors present in the data. This is accomplished by making a simultaneous rotation of the test factors onto the abstract factors and then recalculating the original data in terms of these suspected factors. In this calculation, the number of test factors used within a given calculation is equal to the suspected dimensionality of the factor space of the problem. Since this calculation in no way forces a minimum error criterion on the simultaneous rotation step, the fit of experimental to predicted data will be good only in the situation that all important factors have been adequately accounted for. The fit of the model to the data base can be judged by observing the mean error of reproduction for each recalculation of the data base. The results of these calculations are shown in Table V for the five cases studied. In interpreting the results shown in Table V, it is important to keep in mind what the corresponding results were when the data matrix in question was recalculated in terms of a similar number of abstract factors (results shown in Table 111). If the model is good, then the mean error term should approach the corresponding value obtained for this quantity calculated in terms of a similar number of abstract factors. However, if the proposed model does not adequately describe the phenomena, the calculated mean error will deviate markedly from its abstract counterpart. In this respect, it is important to stress again that the mean

A N A L Y T I C A L C H E M I S T R Y , VOL. 46, NO. 14, DECEMBER 1974

Table VI. Adsorption and Partition Constants Determined by Factor and Regression Analyses Adsorption constants FA*,X Solute

Carbon tetrachloride Methylene chloride Chloroform Benzene Toluene 11 --Hexane Cyclohexane ti -Heptane 2 -Methylheptane I I -Octane

Case I

Case 111

Case I1

7.82

* 0.11

7.83

* 0.12

0.23

10.56 f 0.22

i 0.24

30.94 i 0.25 26.06 f 0.21 76.90 i 0.58 5.11 i 0.11 4.32 * 0.17 12.06 i 0.11 27.48 k 0.28 34.19 0.10

10.56

f

30.93 26.06 76.87

f

0.19 i. 0.56

5.18 rt 0.04 4.44 f 0.06 12.83 0.43 27.63 i 0.08 34.37 * 0.07

*

*

Case IV

30.92

5.09 4.31 11.85 27.47 34.19

Partition constants, Solute

Carbon tetrachloride Methylene chloride Chloroform Benzene Toluene 71 -Hexane Cyclohexane 11 -Heptane 2 -Methylhepime ) I -Octane 0

Case I‘

Case I1

3.44 25.56

Case 111

+ 0.72

3.40

rt

0.79

1.72

25.51

t

1.67

i

20.44 f 1.09 16.87 i 1.02 21.59 k 2.31

20.37 * 1.17 16.80 i 1.08 21.44 f 2.46 0.40 * 0.62 0.83 i 1.33 4.41 i 4.40 1.14 i 2.43 1.23 0.86

*

cm

i

0.45

* 0.23

Case V

7.83

i

0.11

10.58

i

0.25

30.98 26.10 77.03

r

0.23

i. 0.19

34.25

* 0.70

* 0.26 + 0.20

* 0.35

* 0.04

Case V

3.42

0.24 0.49 3.95 0.54 1.40

* 0.07

* 0.08 30.99 * 0.15 10.69

26.12 76.98 5.27 4.50 13.56 27.73 34.35

0.13 0.34 i 0.05 * 0.04 i 0.10 + 0.19 f 0.25

I

rt

FL*

Case lV

20.06

* 0.52

Regression

7.87

* 2.65 i

Reqression

0.75

3.14

25.51 -c 1.82

24.39

20.30 16.75 21.16

i 1.19 t 1.13 i 2.55

20.19 = 1.05 16.49 i 0.90 21.27 i 2.40

1.02

* 0.53

i

2

0.47

* 0.57

1.42

* 1.57

+ 12.5 rt 2.17 i 0.43

These quantities cannot be estimated for Case I ,since KJ.did not test as a valid factor.

error, calculated for the abstract case, is usually a limiting value which indicates how well one could expect to achieve with any n- term model. For Case I, our previous analysis indicated that the factor space was probably one dimensional and that A I : was a good measure for this factor. Shown in Table V are the mean error (and per cent error) when both a one-factor (ALII) and two-factor ( A L ” and VLO) were used to fit the data. The calculated mean error and per cent error are lower for the one-fact or solution than for the two-factor solution yielding one more additional supporting piece of information that there is no V,O term or factor present in this data set. This result occurs because no minimum error constraint is forced on the model. Therefore, it can be seen that with the factor analysis scheme, improvements in the fit of a model occur only if the model is improved by the modification. It is to be noted that the value of the mean error in Table V for Case I in terms of one factor is somewhat smaller than that calculated in terms of one abstract factor (Table 111). This surprising result can be explained by the fact that test factor was calculated from n -octane rethe A tention data. Therefore, when this same ALO factor is used to recalculate the retention data, it is naturally expected to reproduce the n- octane retention data quite well. (The mean error of recalculation of n- octane retention data decrease from a value of f 0 . 8 ml/gram in the abstract calculation to f 0 . 13 ml/gram.) As this circular situation arises very rarely, i t can be assumed that in most situations the abstract values of mean error are limiting. The other four data sets in Table V are straightforward to interpret. In all cases, the two-factor fit was significantly better than the one-factor fit which is reasonable in the

light of our understanding of the phenomena involved. These results even hold true for Case IV, in which some ambiguity existed as to the dimensionality of the data. From the results in Table V, it may be concluded that the model does accurately mirror the data. With this conclusion, it is then possible to proceed to the last step to extract the appropriate distribution constants by factor analysis using the jackknife procedure. The results of this calculation are shown in Table VI. I t is t o be noted that the constants are actually the “best estimate” values ( 2 0 ) , I?* (Equation F t ) , and the confidence limits were obtained from Equation 6. For adsorption constants, KA*, the values calculated by factor analysis for the five data cases all agree within their stated confidence limits. Even for Case IV, the calculated value of I?** for CHCll agrees well with the other data sets. This is not surprising since the adsorption mechanibm of distribution is the dominant mechanism of retention for all the solutes. Furthermore, in comparing the EA*values in Case 111, which contains the total data base with those obtained by separate regression analyses, it is found that the two calculation methods agree within the Calculated confidence limits except for n- heptane. The confidence limits determined by the jackknife method are in all caSes greater than those calculated by a regression analysis. This may arise from the fact that factor analysis does not provide a minimum error solution, while regression analysis forces the errors to their minimum values. Furthermore, it is inherent in the formulism of the jackknife approach that the calculated confidence limits may be larger than an estimate based simply on the concept of a standard error. The jackknife confidence limits are designed to take account of the two different sources of data uncertainty, namely, those

A N A L Y T I C A L C H E M I S T R Y , VOL. 46, NO. 14, DECEMBER 1974

2189

associated with variables which are known and controlled (internal uncertainty) and those variables that are unknown and uncontrolled (supplementary uncertainty). The standard error estimate of calculating confidence limits does not adequately account for the supplementary uncertainty type of error (20). For the partition constants, the internal consistency of the factor analysis results among the five cases is again quite good. Surprisingly, this good fit even extends to the KL*calculated for chloroform in Case IV. For this data point, the jackknifed confidence limit is larger than its counterparts calculated in the other cases; however, this is reasonable behavior for the coefficient of such a small secondary factor. Again the agreement of Case I11 between factor analysis and their regression determined counterparts is quite satisfactory for all class B solutes. For the class A solutes, the reported values of the KL* are quite small when compared to those calculated for the class B solutes. This indicates that bulk partition is not an important mechanism for solute retention. Furthermore, the large confidence limits of KL* for the class A solutes (even sometimes to include a value near zero) implies that not much weight can be given to the physical meaning of these values.

CONCLUSION From the results presented above, it is concluded that factor analysis offers a viable alternative to regression analysis for the study of mixed mechanisms of retention and for the extraction of useful thermodynamic information. The ability to test individual suspected factors can be useful in a multi-factor problem of any reasonable complexity. Furthermore, as a minimum error solution to the problem is not imposed, the present form of factor analysis offers a more valid picture of the fit of a given model to a data set. If the fit is good, then one can feel confident in applying regression analysis to the data. At present, there are some unanswered questions on the relative merits of the global averaging of the raw data afforded by factor analysis and the minimum error constraints imposed by regression analysis to the ultimate problem of the extraction of the “best” results from a given problem. More study has to be given to this important question. A serious shortcoming of the present factor analysis scheme is that it is hard to make a quantitative statement on the number of factors present in a data set. Various

2190

mathematical tools can be used to aid the investigator in ascertaining the correct dimensionality of a data space, the most sensitive of which seems to be the test factor rotation scheme. The conclusions based on this rotation scheme are predicated, however, on the precision of the experimental data and the relative importance of the factor being tested. Ideally, one should test the least important factor to ascertain best the correct dimensionality of a data space. We have tried to present the reasoning that one must use to arrive at a good estimate of the true dimensionality of a problem; however, it must be pointed out that no one tool at present can be used to arrive a t a completely unequivocal answer to this important question. Another previous limitation of factor analysis which impeded its applications to analytical problems has been eliminated by the mating of the jackknife method of calculating confidence limits with the normal factor analysis approach for the calculation of @-weights.The jackknifed procedure should be used to determine the confidence limits in regression analysis as well (20 1.

Finally, it must be restated that the present limited example does not provide the best choice of a problem to compare regression and factor analysis because of its small dimensionality. In more complex problems, extreme care must be taken to acquire accurate and reproducible retention data because any small change in the system could greatly affect the relative importance of the several mechanisms of retention. In these latter cases, the importances of the jackknifed solution to obtain meaningful results increases because it may be possible that there can exist some uncontrolled or supplementary variables within the system. In these situations, the model probably cannot be validated unless the distribution constants can also be determined by another independent method.

ACKNOWLEDGMENT The authors thank Svate Wold, Umea University, Umea, Sweden, for his suggestion of using the jackknife method for estimating confidence limits in factor analysis and Mervin Lynch of Northeastern University for his helpful discussions in statistics.

RECEIVEDfor review June 26, 1974. Accepted August 12, 1974. We would like to thank the NSF for support of this work.

A N A L Y T I C A L C H E M I S T R Y , VOL. 46, NO. 14, DECEMBER 1974