1004
Anal. Chem. I 9 8 4 56, 1004-1010
Validation of Hypothesis on a Data Matrix by Target Factor Analysis Avraham Lorber Nuclear Research Centre-Negeu, P.O. Box 9001,Beer-Sheua 84190,Israel
Target factor analysis (TFA), a method of rotatlng abstract eigenvectors into physically significant target test vectors, is presented as a projection of the target vectors onto the space spanned by the eigenvectors. The presentation makes use of the theory of the distribution of quadratlc forms to glve the distribution of the errors in the predicted target vectors. The acceptablllty of the hypothetical target test vector is glven by the x2 and by the Flsher-varlance ratlo tests. Examples from model data, mass spectrometry, generailred internal reference method, and curve resolution are presented.
Factor analysis, since its first application to chemical data analysis by Wallace (I),was found to be a substantial tool in the analysis of multivariate chemical data (2). Abstract factor analysis (AFA) is used to determine the number of independent variables in the data matrix. Subsequent to AFA the investigator wishes to find physical interpretation of the variables from which the data matrix is composed. The method that verifies a hypothesis on physically significant properties of the variables in the data matrix is target factor analysis (TFA) (3). TFA finds rotation vectors that rotate the physically meaningless eigenvectors obtained by AFA into a physically significant vectors. The input to this step is the test vector that, by hypothesis, represents a physical property. The rotation is performed by multiplying the rotation vector by the matrix of eigenvectors, whereby the predicted target vector is obtained. The confirmation or rejection of the hypothesis is accompished by comparing the target test against the predicted target vector, which should be equal if the hypothesis is true. Both the theory and applications to various problems in chemistry were presented in detail in the text by Malinowski and Howery (2). The rotation vectors are found by a least-squares solution of a matrix equation involving a matrix composed of the significant eigenvectors and the target test. Roscoe and Hopke (4)solved the matrix equation by a weighted least-squares method. They also proposed an iterative scheme that with a minimal knowledge of the physical quantities will converge to the desired physically significant predicted target vector. Rasmussen et al. (5) have noticed that the TFA method projects the target test vector onto the subspace generated by the significant eigenvectors. Thus the resultant predicted target vector lies in the eigenvector space. This observation allowed them to consider two other methods which are closely related to TFA, the Gram-Schmidt orthogonalization and the method used by Ho et al. (6) which is related to Bessel’s inequality. Because the statistical inference concerning the acceptability of a specific test vector is essential to TFA, it is important to study the errors associated with the rotation. Malinowski (7)presented two functions, the SPOIL and the RELI functions, that measure the compatibility of the test vector. However, as these are considered to be new concepts in factor analysis, their distribution is not known and statistical inference in terms of a known test is impossible.
Malinowski (7, 8) gave, as a rule of thumb, regions for acceptability for the SPOIL function. In this paper, TFA is developed from the viewpoint of projection matrices. The theory of the distribution of quadratic forms serves as the basic of the study of the errors in the TFA method. These theoretical considerations permit statistical inference in terms of well-known statistical tests. The theory of projection matrices and distribution of quadratic forms presented here is based on the text of Rao (9)and the text of Seber (10).The singular value decomposition is described in the text of Lawson and Hanson (11).
THEORY Presentation of TFA by Projection Matrices. An array of data obtained from an experiment contained m rows and n columns may be presented in matrix notation as D. It is a well-known result in linear algebra that any real matrix may be decomposed into three matrices
D = USVT
(1) where U is an m x n orthonormal matrix (i.e., U W = I,, where UT denotes the transpose and I, the identity matrix of rank m ) , V is an orthonormal, square matrix of order n, and S is a diagonal matrix (only the diagonal elements may differ from zero), with elements sl, ..., s, for which s I s2 I ... 1 s, I 0. These scalars are the nonzero singular values of the matrix D. The singular values are unique and are the square roots of the eigenvalues of the matrix DTD. The subscript k denotes the smallest significant eigenvalue. The decomposition presented in eq 1 is referred to as the “singular value decomposition” (SVD). This decomposition serves in presenting the theory. However, because the SVD is considered as a further decomposition of other orthonormal decomposition techniques, the implication to them is straightforward. For simplicity, only the case where m > n will be considered. This restriction does not pose any limitation because any data matrix may be transposed so that the number of rows will be greater than the number of columns. The objective of AFA is to determine k , the number of significant eigenvalues. When k is determined, eq 1is reduced to
D =USVT
(2)
Now U is a m X k matrix containing the first k eigenvectors that span the space of DDT. The remaining m - k eigenvectors span the null space of DDT. VT is a k X n matrix containing the first k eigenvectors that span the space of DTD. The remaining n - k eigenvectors of V span the null space of DTD. 8 is a diagonal matrix with the first k singular values the diagonal elements. D is the reproduced m X n data matrix obtained from k factors only. The result of AFA is that the original data are decomposed into orthogonal vectors which have no physical identity. The objective of TFA is to present the data matrix as a multiple of two matrices (A and B) whose vectors represent physical properties D = AB (3)
0003-2700/84/0356-1004$01.50/00 1984 American Chemical Society
ANALYTICAL CHEMISTRY, VOL. 56, NO. 6, MAY 1984
1005
Here A is a m X k matrix, with columns, a, representing physical properties that lie in the space generated by the rows of D. B is a k x n matrix, whose rows, b, represent physical properties that lie in the space generated by the columns of D. Because a lies in the space of U, it may be represented as a linear combination of U
projection operation as it is well discussed in the literature on generalized inverse. References to this subject are given in ref 9. Distribution of the Errors and Statistical Inference. The sum of squares of the deviations between the target test vector and that predicted from TFA is
(4)
Qtotal= eRTeR = (a - UUTtR)T(a - UUTtR) = tRT(I, - uUT)tR (15)
a = OrR
Here rRis a k vector called a rotation vector; the subscript R denotes the fact that the rotation is performed on the space generated by the rows of D. The same reasoning leads to
b = rCVT
(5)
Here rc is a k row vector that rotates the space generated by the columns of D. The subscript C stands for column rotation. The matrices obtained from k row rotations RR and from k column rotations Rc are interconnected by the relations
RR = SRc-’;
Rc = RR-lS
(6)
A test vector tR) assumed to be a physically significant vector that lies in the space generated by the rows of D, may be represented as t R = OrR (7) Following the usual least-squares method, both sides of eq 7 are multiplied by UT. As U is orthonormal, we get
rpR = U T t R
(8)
which is a best linear unbiased estimate (BLUE) of the row rotation vector. The predicted targent vector is found by inserting rR, which resulted from eq 8, into eq 4 to get
a = UUTtR
(9)
The same steps lead to the corresponding equations for the columns rotation vectors tc
rc = tcV
(10)
mT
The matrices and VVT are called projection matrices. The results of eq 9 and 11 are not unique to TFA, rather, they hold for any linear hypothesis. The orthogonal projection matrices arise from the MoorePenrose pseudoinverse. The pseudoinverse, X, of the matrix D had to satisfy the following four relations: (a) DXD = D (b)XDX = X (12) (d) (XD)T = XD (c) (DX)T= DX When all four relations are satisfied X is said to be the pseudoinverse of D, denoted by D+. Using eq 2, we can write D+ = @SOT)+ = V S - l U T (13) and get the relations = DD+;
VVT =
D+D
(14) The pseudoinverse as defined in eq 12 is unique, thus the results of eq 11 and eq 12 presented by the SVD are the same as the results obtained by any other decomposition, on condition that the projection matrices satisfy the MoorePenrose relations. The first two conditions in eq 12 are analogous to the statement that the projection matrices DD+ and D+D are idempotent (a matrix, Z, satisfying Z2 = Z). The last two relations are equivalent to the condition that the matrices are symmetric (ZT = Z). These two properties of the projection matrices are important in the derivation of the distribution of the errors in the predicted target vector. In this paper we do not intend to present the geometrical meaning of the m T
where eRis the error vector associated with the the test vector tR. In obtaining eq 15, one observes that (I, is an idempotent matrix. The theory of the distribution of quadratic forms (Qbd is a quadratic form) of idempotent matrices is well studied and we will make use of well-known theorems to describe the distribution of the errors in the TFA and get an expression that will enable inference in terms of the Fisher variance ratio. The mathematical development is described in the Appendix.
mT)
EXPERIMENTAL SECTION Four data sets, three of which were taken from the chemistry literature and the fourth of an artificial data, served to investigate and demonstrate the usefulness of the mathematical theory. A computer program was written in Fortran IV. The singular value decomposition was calculated by the subroutine SVDRSfrom ref 11. The program was run on a Digital PDP 11/34. RESULTS AND DISCUSSION Model Study. Before the new error theory is applied to real chemical data, it is important to test the new variance ratio by using model set data for which all information is known. A synthetic data base of various measurements on rectangles was constructed to test the variance ratio. Twenty sets of linear combinations of length, width, and thickness values served to construct the rows. Six linear combinations of two rectangles served to construct the columns. This data model is familiar, simple in structure, and very inexpensive to create. In addition the data model can be considered as a crude model for much spectral data in that all measurements are nonnegative and are expressed in the same units. To simulate real “analflical measurement” each data value was perturbed by 2% which is composed of 1% relative standard deviation (RSD), plus an additional random noise of 0.3 which simulates errors whose magnitude is not dependent on the ”measurement” magnitude (background determination, drift in signals). The data matrix was decomposed into the eigenvectors and the eigenvalues by the singular value decomposition. The singular values are 2.446, 0.119, 0.020,0.014,0.011,and 0.007. Application of AFA methods (2) confirms that only the first two eigenvalues are significant. Table I presents the results of target testing five vectors, all of which are perturbations to various degrees in a vector constructed from a linear combination of the two rectangles. The pure vector (unperturbed vector) is presented in the first column of the table. At the bottom of the table, the values of x2, SPOIL, and the variance ratio are presented. The significance level for accepting the hypothesis STFA2 5 SD2, both for the x2 and the variance ratio, are also presented. Malinowski in his theory of errors in the TFA (7) developed two functions for the purpose of assessing the validity and the worthiness of a target test vector, the reliability function, RELI, and the spoil function, SPOIL. He considered both functions to be new concepts in factor analysis; therefore, their interpretation and usage will be subjected to modifications. The calculation of the RELI function requires estimating the amount of error in the test vector; consequently, it’s application is limited. Malinowski divided the values of the SPOIL function into three regions: (1) an acceptable region (0.0-3.0); (2) a moderately acceptable region (3.0-6.0); and (3) an unacceptable region (>6.0). In Table I only the SPOIL values
1006
ANALYTICAL CHEMISTRY, VOL. 56, NO. 6, MAY 1984
Table I. Results of TFA on a Model Data Generated by Two Rectangles
perturbed by 2 RSD test predicted
pure vector 25.00 9.00 5.00 34.00 14.00 30.00 39.00 19.00 23.00 35.00 80.00 59.00 48.00 24.00 58.00 77.00 32.00 0.00 20.00 0.00
25.85 8.76 5.18 34.18 13.45 31.05 38.82 19.76 22.98 35.27 79.58 62.43 48.04 24.12 58.21 79.10 31.37 0.07 18.77 -0.18
X2
SPOIL variance ratio significance level for the x 2 test significance level for the Fisher test
25.03 8.93 4.67 34.63 14.05 30.45 40.07 19.15 23.20 35.55 81.16 59.35 48.20 23.73 58.75 77.17 32.32 0.13 20.19 -0.08
perturbed by 4 RSD test predicted 24.03 9.53 5.06 33.33 13.82 29.68 36.92 17.72 21.02 32.91 83.54 62.98 42.48 21.92 63.62 81.47 31.88 0.24 20.90 -0.13
25.45 8.78 4.73 34.80 13.97 30.92 40.47 19.15 22.90 36.08 82.55 60.00 48.43 23.76 59.02 78.42 31.85 0.10 20.48 -0.13
perturbed by 8 RSD test predicted 21.32 9.70 5.88 36.24 13.24 25.59 37.02 18.02 23.65 26.30 97.51 60.70 50.66 29.51 50.14 83.53 32.53 -0.12 17.16 -0.30
26.40 8.51 4.87 35.30 13.86 32.00 41.44 19.23 22.40 37.31 85.72 61.57 49.09 23.94 59.82 79.29 31.03 0.05 21.17 -0.22
fourth row perturbed test predicted 25.00 9.00 5.00 25.00 14.00 30.00 39.00 19.00 23.00 35.00 80.00 59.00 48.00 24.00 58.00 77.00 32.00 0.00 20.00 0.00
24.55 8.66 4.58 33.82 13.67 29.85 39.21 18.67 22.53 34.85 79.61 58.09 47.08 23.15 57.38 76.31 31.36 0.12 19.79 -0.09
fourth row perturbed __test predicted 25.00 9.00 5.00 40.00 14.00 30.00 39.00 19.00 23.00 35.00 80.00 59.00 48.00 24.00 58.00 77.00 32.00 0.00 20.00 0.00
24.79 8.99 4.64 34.49 14.07 30.18 39.82 19.12 23.30 35.24 80.38 58.95 48.02 23.67 58.53 77.92 32.50 0.14 20.01 -0.06
20.81 1.53 1.16 0.25
75.82 4.19 3.45 0.00
341.60 8.94 13.03 0.00
80.18 3.65 4.55 0.00
32.00 2.13 1.67 0.025
0.12
0.00
0.00
0.00
0.05
~~
are presented. The RELI function values are not presented as their use in practical situations is limited. From Table I it is obvious that both the x2 and the variance ratio detect the vector perturbed by 2 RSD and the vector whose fourth element was perturbed by 9 as the only acceptable vectors. The SPOIL function also detects that these two vectors fall within the acceptable region, while the vector perturbed by 4 RSD and the vector perturbed by 11 at the fourth element are moderately acceptable and the vector perturbed by 8 RSD is unacceptable. Comparison of the SPOIL function with the other two is difficult as the meaning of “moderately acceptable” is hard to measure. Close scrutiny of the vdues obtained shows that the SPOIL function values are behaving like the x2 values, while the variance ratio values are behaving differently. This behavior is seen when the results of the vector perturbed by 4 RSD are compared to those obtained from the vector perturbed by 11 a t its fourth element. The response of the variance ratio is in an opposite direction to the response of the x2 and the SPOIL. This behavior is not unexpected as the values obtained by the variance ratio are results of projecting the test vector on a different space than the space used by the SPOIL and the x2. This behavior of the variance ratio is of advantage because it is more sensitive to single perturbation than the others. In practical situations it is expected that perturbations on specific elements will occur more frequently then uniformly distributed errors. Table I presents another important property of the orthogonal projection. When the predicted vectors are compared to the pure vector, it is obvious that the predicted vector lies closer to the true vector than does the test vector. Roscoe and Hopke ( 4 ) suggest using this smoothing capability in order to construct an iteration scheme that with minimal knowledge of the physical quantities will converge to the desired physically significant predicted target vector. From the presentation of the TFA by orthogonal projection matrices, it is clear that because the predicted vector obtained by multiplying the test vector by an idempotent matrix already lies in the space
spanned by the primary eigenvectors, subsequent multiplication of the predicted vector will result in the predicted vector itself. Thus, no iteration is possible. Roscoe and Hopke found that the method succeeded in converging only for the weighted TFA. This success is attributed to perturbations which occurred during the inversion needed to obtain the weighted variance-covariance matrix. However, their results indicate that by perturbing the projection matrix or the predicted vector iteration is possible. Library Search of Overlapping Mass Spectra. TFA has a variety of chemical applications. It is particularly useful for identifying components of related mixtures (i.e., mixtures containing the same components but in different concentrations). The mass spectral data of Ritter et al. (12)serve for illustration. The data matrix was generated by measuring the intensities of four related mixtures of cyclohexane and cyclohexene at 20 mass-to-charge positions. Ritter et al. showed that the rank of the data matrix was 2,giving clear evidence that two components were present. Malinowski and McCue (13) showed in detail how TFA could be used not only to identify but also to quantify the components in these mixtures. Rasmussen et al. (5) incorporated a library file containing approximately 17 000 mass spectra into an orthogonal projection search method and confirmed that cyclohexane and cyclohexene gave the nearest matches. The five nearest matches included bicyclo[3.1.0]hexane, bicyclopropyl, and fluorocyclohexane. When these five nearest matches were subjected to TFA, the results summarized in Table I1 were obtained. The SPOIL function was calculated from its definition (7). The distance was calculated from eq 22. The variance ratio was calculated from eq 29. Table I1 also presents the area under the F distribution, which gives the significance level for accepting the hypothesis that STFA2 ISD2.Malinowski (7) found for the cyclohexane test vector a value of SPOIL = 1.39 vs. 2.36 in this study; however, the discrepancy is attributed to the fact that Malinowski used the mass spectral intensity data of pure cyclohexane from the work of Ritter et al. (12),while
ANALYTICAL CHEMISTRY, VOL. 56, NO. 6, MAY 1984
1007
Table 11. Comparisons of Target Testing Mass Spectral Intensities of Five Compounds That Give Nearest Matches to Cyclohexane/Cyclohexene Mixturesa cyclohexene bicyclo[3.1 .O]hexanec fluorocyclohexane bicyclopropylC cyclohexane -
-
test differenced test differenced test differenced test differenced test differenced 24.9 10.7 11.3 0.4 6.7 13.6 1.8 19.4 18.1 4.6 27 10.7 6.6 6.5 2.8 2.9 12.5 9.4 6.3 8.9 1.6 28 4.9 2.1 4.7 2.0 0.5 1.0 -1.0 2.6 8.9 2.0 29 52.0 18.1 13.5 -13.1 30.5 1.7 5.1 36.6 20.5 0.9 39 5.1 1.1 2.7 -0.6 3.0 -0.3 1.4 5.0 5.0 0.7 40 64.6 27.0 39.1 6.7 42.5 12.6 4.5 36.6 56.9 4.5 41 4.5 -0.6 11.4 5.3 2.0 -0.9 -0.2 2.7 25.8 1.7 42 0.3 -1.5 7.2 4.8 0.5 -0.3 0.2 -0.5 43 12.2 0.8 5.6 -2.4 3.5 -2.6 2.2 -4.8 1.0 8.7 2.8 0.5 51 15.5 4.5 6.1 -2.3 8.1 -1.4 2.4 12.9 53 4.1 0.3 100.0 26.1 27.5 -27.0 64.4 -0.7 5.2 76.9 54 6.2 -1.2 7.3 -1.6 17.7 7.9 3.2 -2.4 -0.1 5.6 55 34.2 0.0 0.4 -12.7 25.1 6.1 0.4 -4.6 0.4 -3.5 100.0 1.2 56 92.6 -17.9 100.0 19.1 2.1 -7.9 100.0 100.0 2.9 -1.0 67 3.4 -2.5 5.6 1.1 4.6 -0.5 -0.1 5.5 1.7 -0.4 68 0.3 -3.3 4.1 -1.3 0.3 -0.9 -0.8 0.1 23.4 -5.7 69 4.5 -2.9 1.4 -4.1 3.5 -3.0 0.2 7.4 0.6 -0.3 79 8.5 -1.8 8.8 1.2 9.6 0.4 11.5 1.4 0.5 -0.2 81 4.5 -36.3 31.1 1.2 21.8 -14.4 40.9 1.0 0.2 -1.2 82 0.3 -8.1 0.7 -12.9 0.3 -2.0 -1;2 0.1 72.9 -4.4 84 0.144 0.112 0.011 0.030 normalized 0.005 distance 15.30 13.30 6.48 3.76 SPOIL 2.36 47.50 32.30 9.26 variance ratio 1.20 2.95 0.002 0.21 0.005 significance level 41.3 6.18 for the Fisher test (%) Data are taken from Table I of ref 12. Reference 14. Reference 15. Difference = test - predicted. m/z
in this study the data were taken from a library file. Rasmussen et al. pointed out that several related methods are available for the comparison of vectors representing library mass spectra with the basic subspace generated by the significant eigenvectors: the Gram-Schmidt vector orthogonalization method, in which the orthogonal distance between the reference vector and the basic plane is the quantity of interest; the TFA as described by Malinowski and McCue (13) which relies on a comparison between the library test vector, which is referred to as the target, and the vector produced by projecting the test vector ipto the significant eigenvectors space, which is called the predicted target; comparison of the length of the projection into the basis space by normalizing the reference vector to unit length. This method is related to Bessel's inequality as pointed out by Ho et al. (6). When they examined these three related methods, they found that the Bessel inequality method only succeeded in finding the correct compounds in the related mixtures, while the other two succeeded partially. The distances computed by the Gram-Schmidt and by the TFA were essentially the same. In the light of the present study of TFA, it is clear that since the three methods measure the quadratic forms of projection matrices (which are unique on the condition that they are projection matrices of the Moore-Penrose pseudoinverse),the calculated distances should be equal. The reason that only the Bessel inequality succeeded in finding the correct components indicates that normalization is a necessary step before comparing results of different reference vectors. It is evident from Table I1 that the variance ratio is capable of finding the true components and rejecting the other components that give the nearest match. The SPOIL function also succeeds in finding the true components and rejecting the others. However, as the regions in the SPOIL functions were determined by resylts of simulations, use of the variance ratio is preferable as it has well-known statistical significance. Both the variance ratio and the SPOIL function are preferred to the distance measure that only sorts the test vectors ac-
cording to their distance from the data space. Generalized Internal Reference Method (GIRM). GIRM is a method that enables compensation of nonrandom fluctuations in analytical signals (16). Several internal standard channels, that respond differently to variations of the parameters of the analytical system, are measured simultaneously. The fluctuations in the instrumental parameters are found by rotating the temporal characteristics of the internal standard signals from the spectral space to the instrumental parameters space. By precalibration of the response of the analytical signals to variations in instrumental parameters, the compensation is possible. A matrix, whose elements describe the response of each internal standard signal to variation in a specific instrumental parameter, serves to rotate the measured variations in internal standard signals to the instrumental parameters space. Table I11 presents two matrices used by the GIRM. The matrices were obtained by an inductively coupled plasma (ICP) excitation source, coupled to a photodiode array detecting system. The matrix obtained at the 395-408-nm region is from ref 17 and the matrix at the 416-426-nm region is from ref 18. Both matrices were subjected to AFA (17), giving clear evidence that the matrix taken in the 395-408 nm region is of full rank and the 416-426-nm region matrix is probably singular of rank three. Aside from the different spectral lines, the determination of the effect of varying the uptake rate for the two matrices was obtained by different experimental procedures. In the 395-408-nm region the uptake-rate effect was calibrated by contracting the feed solution tube to various degrees, while, in the 416-426-nm region, the calibration was performed by changing the hydraulic head on the pneumatic nebulizer. Reaching singularity is associated with error amplification, which is an undesired effect, especially when compensation for small perturbations is claimed. It was suspected that the differences in calibrating the uptake rate are responsible for the singularity observed in the matrix of the 416-426-nm
1008
ANALYTICAL CHEMISTRY, VOL. 56, NO. 6, MAY 1984
Table 111. Matrices of the GIRM, Test on the Uptake Rate, and the Predicted Vector by TFA spectral windows, nm 39 5-408
416-4 26
aerosol incident power
carrier
cooling
spectral line, nm
uptake rate
gas
gas
AI1 Ca I1 Eu I1 Mn I ArI Sr I1
396.15 396.85 397.20 403.08 404.44 407.77
1.003 1.016 1.000 1.183 -0.445 0.995
0.820 1.341 1.000 -1.127 0.521 1.129
0.517 1.092
Ga I Ar I ELII1 Sr I1 Ca I
417.21 419.83 420.51 421.55 422.67 424.68 425.44
-1.503 0.412 1.000 1.258 -2.821 1.708 -2.334
0.210 1.284 1.000 1.187 -0.198 1.431 -0.369
sc I1 Cr I
region. TFA is advised to confirm this hypothesis. It was assumed that all signals produced by species that reach the excitation zone from the solution are affected to the same degree by varying the uptake rate, while the argon spectral line is not affected a t all. Therefore, a test vector of unit elements for all spectral lines except zero for argon's spectral line is used as a test vector. All four eigenvectors were incorporated in the TFA. The test and the predicted vectors are presented in Table 111. Because all the four eigenvectors were assumed to be significant, error estimate on the data matrices cannot be obtained. The error in the data matrices was estimated to be greater than the error in measuring the spectral line intensities which was measured (18)to equal 0.07% RSD. Taking this value as the error in the data mgtrices, the x2 values may be calculated. A value of 2.02 with two degrees of freedom is obtained for the TFA on the data in the 395-408-nm region; a value of 1500 with three degrees of freedom is obtained for the data in the 416-426-11111 region. Thus, our hypothesis that varying the hydraulic head on the nebulizer does not simulate variation in uptake rate is confirmed. When the 416-426-nm region was used to demonstrate the applicability of the GIRM (18), it was found that although the data matrix was near to singularity, no error amplification occurred. Nevertheless, the TFA performed here is essential to answering a more practical question; does the GIRM compensate for human errors during sample preparation (sample weighing, sample loss, error in solution mixing, and dilution). Human errors are characterized by the same behavior as the test vectors, Thus, if the hypothesis thqt human errors may be described as a linear combination of instrumental parameters is accepted, compensation for human errors during sample preparation is possible. The example of the application of TFA to GIRM demonstrated the following: (a) x2 inferencg is possible when the error in the data matrix is known; (b) TFA can be applied to matrices of full rank. Curve Resolution by Chemical Hypothesis. The recognition of TFA as a projection of a test vector on spaces spanned by the eigenvectors, enabled the derivation of error distribution and statistical inference whose advantage has been demonstrated in the preceding examples. Another advantage that is gained by this representation is that there are two projection matrices, one is spanning the row space and the other the columns space. Therefore, taking spectra of related mixtures, for example, in addition to hypothesis on the structure of the rows (spectra of pure components),hypothesis on the structure of the columns (concentrations) can be accomplished. Curve resolution using a postulated chemical equilibrium (19,20) or postulated reaction kinetics (20), is an example for
target factor analysis test predicted
-0.076 0.973
1.000
1.000
-3.224 6.533 1.235
-0.189 -2.774 1.613
1.000 1.000 1.000 1.000 0.000 1.o 00
-0.124 0.208
0.000
1.000
-1.124
1.000
1.000
1.157 -0.137 1.418 -0.691
1.141 -0.423 1.377 -0.384
0.000
1.000 1.000 1.000 1.000 1.000
1.004 0.998 1.005 0.998 -0.000 1.ooo
0.899 0.017 1.068 1.032 0.900 0.917 1.075
cases in which the test vector is projected on the matrix VVT rather than on the matrix OUT. Curve resolution, using a postulated chemical equilibrium, was suggested by Kankare (19). The method calculates a test vector by using the initial concentrations of the ith mixture (i = 1, ..., n) and by using a postulated stoichiometric coefficients. The method is illustrated by the following relatively simple example. n solutions were prepared by mixing two components, A and B. The initial concentrations for each mixture were [Ailoand [Bilp The spectra of the mixtures were taken at equilibrium and it was varified by AFA that one additional component appeared. We postulate that the chemical reaction is K
A+B=AB By the law of mass action we have
where K is the equilibrium constant. Mass balance is accomplished by using the initial concentrations. Thus, the concentration for each mixture can be calculated by [AB] = f/z(R (R2- 4[Ai]o[Bi]o)'/2) where
R = [AJo + [Bile + 1/K The only undefined variable remaining undefined is K which can be determined by minimization. Sylvestre et al. (20) presented a related method for the resolution of spectra gathered in kinetic measurements. The method uses initial concentrations and a postulated reaction rate for the curve resolution. The validation of the postulate is accomplished by projecting the n calculated concentrations on the matrix BOT. In the present article, numerical results on this subject are not presented, since the full portrayal of the method with chemical examples can be found (19, 20). Kankare (19) and Sylvestre et al. (20) have not presented their methods by orthogonal projection. Thus, the computation of their methods requires few steps and the error estimation is difficult. CONCLUSIONS In this article, the TFA method was formulated by means of orthogonal projection matrices arising from the MoorePenrose pseudoinverse. These projection matrices give minimum variance unbiased linear estimates. However, it is well-known that the projection operation is also defined for less restrictive estimates (21). There might be situations in which it will be advantageousto use a more general projection
ANALYTICAL CHEMISTRY, VOL. 56, NO. 6, MAY 1984
1009
we have the following theorems. If P is symmetric and idempotent of rank r, then the quadratic form yTpy is distributed as
with r degrees of freedom. x2 is a noncentral x2 with 0 9 0/u2 noncentrality parameter. Given m X m symmetric idempotent matrices P, (i = 1,2), the quadratices BTpiO/uzare statistically independent, if and only if, PlP2= 0. This result may be generalized to more than two quadratic forms. There are two important properties of the x2 distribution of which we take advantage: If Q, X2(r,)for i = 1 , 2 and Qi are statistically independent, then Q = Q1 + Qz is distributed as X2(r)with r = r1 rz degrees of freedom. If Qi x2(ri)for i = 1, 2 , r1 > r2, and Q = Q1 - Q2 is statistically independent of Q2, then Q x2(r),where r = rl - r2. From the above results on the distribution of quadratic forms, it is straightforward that the distribution of the errors in the TFA method obtained in eq 15, is
-
+
-
-
-
Qbtd
x2(m- 12)
(17)
This is true, since (Im-- UUT) is an idempotent symmetric matrix and (I, = 0. An unbiased estimate of is
mm
Qbtd
Figure 1. Mnemonic diagram of the relationship between the various
=m-k
S2total
quadratic forms (top)and the variances (bottom). matrix. Determining the number of independent variables by AFA is dependent on the criteria used. Furthermore, inspection of the eigenvalues usually suggests that there is no unique “rank” clearly assignable to the matrix (22). Rather, there is a range of ranks that may be reasonable choices. In situations like library search, in which it is known that the data matrix is composed of documented vectors, there is no significance to AFA. Rather, the number of constituents should be judged statistically from the set of nearest matches. Thus a method that does not necessitate AFA may here be considered. Biased estimators like the ridge regression method (23),that enables matrix inversion without prior knowledge of the rank, may qualify for library search. In such a case TFA should be generalized to include projection operation of biased estimators (21).
ACKNOWLEDGMENT The author wishes to express his thanks to M. Glouberman, Z. Goldbart, and M. Loobish for their help in executing this study. APPENDIX Staistical inference on a test vector necessitates (a) knowing the distribution of the errors in the predicted vector and the degrees of freedom associated with the errors, (b) estimating the errors which are inherent in the data matrix (Le., errors due to the data and not due to incompatibility of the test vector), and (c) choosing the appropriate statistical test. Distribution of errors in TFA. We consider a vector y, with m elements distributed in a multivariate normal distribution, with a dispersion matrix C(y) = uzIm,which means that the errors are uncorrelated and have the same variance, and the expected values are E(y) = 0 (the case of weighted least squares, in which, C(y) = 2, where 2 is a symmetric full rank matrix of variances, may be transformed easily into the simpler form; thus, for the sake of presentation it is more convenient to discuss the simpler form). On these assumptions
Analogous expressions may also be obtained for the test vectors on the VVT space. The quadratic form is Qc = eCTeC = (b - VVTtC)T(b - VVTtc) = tCT(1, - VVT)t, (19)
which is distributed as
Qc
-
x2(n- k)
(20)
and an unbiased estimate is
Sc2 = Q c / b - k)
(21)
Usually, normalization of the target test vector is required in order to avoid different scaling problems. The ratio
gives the sum of squares of projecting the test vector on the null space and is called (5) the distance. And for some purposes such as library search, where the interest is in finding nearests matches, the distance will find the set of nearest matches. However, in the general case of TFA, we are interested in statistical inference. Estimation of the Errors in the Data Matrix. The error in the data matrix S D 2 may be calculated from = dT(Im- UUT)d
(23)
where d is a column of the data matrix D. In order to determine the degrees of freedom, we consider that eq 23 may be written as Q d = dT[(I, - uuT) 4- (UUT- UUT)]d (23a) From the defining equation of the SVD we observe that
d w U T d = dTd Therefore, eq 23 becomes
(24)
1010
ANALYTICAL CHEMISTRY, VOL. 56, NO. 6, MAY 1984
which is distributed with x 2 distribution with n - k degrees of freedom. Thus, BLUE of Sd2is
Sd2 =
Qd/(n -
k)
(26)
It is important to consider the differences in degrees of freedom that are obtained when the projection matrix I, UUTis multiplied by tR,m - k , and that obtained when it is multiplied by d, n - k . The projection matrix I, - OUT is the sum of two matrices I, - WTand UUT- UUT,the first of which is the projection on the least-squares null space (i.e., the errors due to the fact that m > n ) and the second is the projection on the null space produced by the nonsignificant eigenvectors that are determined by AFA. When a vector from the data matrix is multiplied by I, - UUT,it will definitely result in zero. This fact is expressed mathematically in eq 23-26. The randomly distributed vector tR, which has not been incorporated in the decomposition,will result in a value different than zero when it is multiplied by I, - UUT. Sylvestre et al. (20) in their account of errors in the data matrix failed to make this distinction; however, they arrived to the "n - k" term by heuristic argument. It is convenient to normalize the columns of D. The normalization is required to get equal variances for all the columns. On the assumption that the variances are independently distributed, we obtain the following estimate for the error in the data matrix:
(28) which is distributed as a Fisher variance ratio with [ m- k ( n - l ) ( n - k ) , ( n - l ) ( n - k ) ] degrees of freedom. There are two limitations on the usage of the variance ratio defined by eq 28. There is always the possibility that the test vector is closer to the true data space than the data itself; therefore, negative values for the variance ratio might be obtained. In cases in which n is large and k is small, we may arrive at the situation that m - k < n(n - k ) ; therefore, STFA2 may not be determined. An approximation that will serve to find a variance ratio with no need to perform subtraction is suggested. It was shown in eq 23-26 that the error in the data matrix is determined by projecting d onto the projection matrix UUT Thus, if the test vector will be projected on the matrix I, - UUT,the resultant two variances are independent, and the variance ratio:
m.
F=
tT(1, - UUT)t
( m - n)(tTt)
(29)
which is distributed as F[m- n, ( n - l ) ( n - k ) ] and is free of the flaws in the variance ratio determined by eq 28. In the model study only the variance ratio calculated from eq 29 is presented.
LITERATURE CITED S D has ~ ( n - l ) ( n - k ) degrees of freedom. Statistical Inference. Figure 1illustrates the geometry of the TFA and the errors associated with it. A statistical inference on the target test vector relies on knowledge of the errors SD2 in the data matrix. In the case that S D 2 is exactly known, the x2test may be used to perform an inference, while in the case that it is not exactly known, it may be estimated from the data matrix and the F Fisher variance ratio test. Although the relationships among the error terms illustrated in Figure 1 are essentially those derived by Malinowski ( 8 , the present account differs in two main respects: The error in the data matrix in this work is calculated by quadratic forms, while Malinowski computed it from the result of pricipal component analysis. The degrees of freedom are different. Malinowski's expression for the variance ratio, the SPOIL function, is different than the variance ratio considered here. In the forthcoming discussion only the test vector on the row space will be considered since the degrees of freedom in determining the variance ratio for the column test vector are not sufficient. A prerequisite for using the F test is that the two variances are independently distributed. Figure 1shows that SmA2(the error due to incompatibility of the test vector) and S Dare ~ independent. Since SD2 is distributed as x2 with (r - l ) ( n k ) degrees of freedom,we may use the property of subtracting two variances distributed as x2 to calculate STFA'.When SWA' is divided by S D 2 , we obtain
(1) Wallace, R. M. J. Phys. Chem. 1960, 64, 899-901. (2) Malinowski, E. R.; Howery, D. G. "Factor Analysis in Chemistry"; Wiley: New York, 1980. (3) Weiner, P. H.; Mallnowski, E. R.; Levinstone, A. R. J. Phys. Chem. 1970, 74, 4537-4542. (4) Roscoe, B. A,: Hopke, P. K. Compuf. Chem. 1981, 5 , 1-7. (5) Rasmussen, G. T.; Hohne, B. A.: Wieboldt, R. C.; Isenhour, T. L. Anal. Chim. Acta 1979, 712, 151-164. (6) Ho, C.-N.; Christian, G. D.: Davidson, E. R. Anal. Chem. 1976, 50, 1108-1 113. (7) Malinowski, E. R. Anal. Chlm. Act8 1978, 103,339-354. (8) Maiinowski, E. R. Anal. Cbim. Acta 1981, 133,99-101. (9) Rao, C. R. "Linear Statistlcal Inference and Its Applications", 2nd ed.; Wiley: New York, 1973. (10) Seber, G. A. F. "Linear Regression Analysis": Wiley: New York, 1977. (11) Lawson, C. L.; Hanson, R. J. "Solving Least Squares Problems"; Prentice-Hall: Englewood, NJ, 1974. (12) Ritter, G. L.; Lowry,S'R.; Isenhour, T. L.; Wilkins. C. L. Anal. Chem. 1976, 48. 591-595. (13) Malinowski, E, R.: McCue, M. Anal. Chem. 1977, 49, 284-287. (14) Stenhagen, E.; Abrahamsson, S.; Mclafferty, F. W. "Atlas of Mass Spktral Data": Wiley: New York, 1969. (15) EPA/NIH. Mass Spectral Data Base, NRDS-NBS63, 1978. (16) Lorber, A,; Goldbbrt, 2. Anal. Chem. 1984, 56,37-43. (17) Lorber, A.; Goldbart, 2. Anal. Chlm. Acta: in press. (18) Lorber, A,; Goldbart, 2 . ; Eidan, M. And. Chem. 1984, 56, 43-48. (19) Kankare, J. J. Anal. Chem. 1970, 42, 1322-1326. (20) Sylvestre, E. A.; Lawton, W. H, Magglo, M. S . Technometrics 1974, (21) Rao, C ' R . Ann. Stat. 1976, 4, 1023-1037. (22) Duewgr, 0. L.; Kowalski, B. R.: Faschlng, J. L. Anal. Chem. 1976, 48, 2602-2010. _.._ -. (23) Marquardt, D. M. Technometflcs 1970, 12, 591-612.
RECEIVED for review June 30, 1983. Accepted December 12, 1983.