Theory of error in factor analysis - American Chemical Society

Hg-Ni OTTLE a t this pH, showing that the glyoxylic acid wave is obscured by hydrogen evolution at this electrode. As a result of the extended negative range, glyoxylic acid could be analyzed by thin-layer coulometry and by thin-layer differential pulse voltammetry in the Hg-Au OTTLE (26,27). Reduction of Metals. The utility of the Hg-Au minigrid for the reduction of metal ions to the elemental form is limited by the tendency of the gold which is dissolved in the mercury film to form intermetallic compounds with other metals (Cd, Mn, Sn, Zn) deposited in the mercury (28).For example, lead, which does not form an intermetallic compound with gold, gave a well-behaved, reversible cyclic voltammogram a t the Hg-Au minigrid. By contrast, cadmium gave a good reduction wave, but the oxidation wave for CdO was absent due to the formation of a Cd-Au intermetallic compound. ACKNOWLEDGMENT The authors gratefully acknowledge J. P. Haberman and D. A. Hudson, The Procter and Gamble Company, Cincinnati, Ohio, for the scanning electron microscope photographs. LITERATURE CITED (1) T. Kuwana and W. R. Heineman, Acct. Chem. Res., 9, 241 (1976). (2) N. Winograd and T. Kuwana, In "ElectroanalyticalChemistry", Voi. 7, A. J. Bard, Ed., Marcel Dekker, New York, N.Y., 1974. (3) T. Kuwana, Ber. Bunsenges. Phys. Chem., 77, 858 (1973). (4) R. W. Murray, W. R. Heineman, and G. W. O'Dom, Anal. Chem., 39, 1666 (1967). (5) W. R . Heineman and T. Kuwana, Anal. Chem., 43, 1075 (1971); 44, 1972 (1972). (6) W. R. Heineman, T. P. DeAngelis, and J. F.Goelz, Anal. Chem., 47, 1364 (1975). (7) A. M. Hartley, A. G. Hiebert, and J. A. Cox, J. Electroanal. Chem., 17,81 (1968).

(8) A. F. Trotman-Dickenson, Ed., "Comprehensive Inorganic Chemistry", Vol. 3, Pergamon Press, London, 1973, pp 283-285. (9) T. P. DeAngelis, Fellowship Report to the Electrochemical Society, J. Electrochem. SOC., 123, 186C (1976). (10) R. G. Clem, Anal. Chem., 47, 1778 (1975). (11) R. N. Adams, "Electrochemistry at Solid Electrodes", Marcel Dekker, New York, N.Y., 1969, pp 23-24. (12) B. Miller and S. Bruckenstein, Anal. Chem., 48, 2026 (1974). (13) G. Mamantov, P. Papoff, and P. Delahay, J. Am. Chem. SOC., 79, 4034 (1957). (14) H. Gerischer, Z.Phys. Chem. (Leipzig), 202, 302(1953). (15) R. W. Andrews, J. H. Larochelle, and D. C. Johnson, Anal. Chem., 48,212 ( 1976). (16) 2 . Galus, CRC Crit. Rev. Anal. Chem., 4,359 (1975). (17) W. R. Heineman, B. J. Norris, and J. F. Goelz, Anal. Chem., 47, 79 (1975). (18) M. Stulikova, J. Elecfroanal. Chem., 48, 33 (1973). (19) R. G. Clem, G. Litton, and L. D. Ornelas, Anal. Chem., 45, 1306 (1973). (20) G. E. Batley and T. M. Florence, J. Elecfroanal. Chem., 55, 23 (1974). (21) M. L. Meyer, M.S. Thesis, University of Clncinnati. Cincinnati, Ohio, 1976. (22) B. J. Norris, M. L. Meckstroth, and W. R. Heineman, Anal. Chem., 48,630 (1976). (23) T. M. Kenyhercz, T. P. DeAngelis, B. J. Norris, W. R. Heineman, and H. B. Mark, Jr., J. Am. Chem. SOC., 98, 2469 (1976). (24) H. B. Mark, Jr., T. M. Kenyhercz, and P. T. Kisslnger, ACS Symposium Series. In Dress. (25)V.-S.-Bezu&llV. N. Dmltrieva, T. S. Tarasyuk, and N. A. Izmailov, J. Gen. Chem. USSR. 30.2396 (1960). (26) T. P. DeAngelis and W. d. Heineman, Anal. Chem., 48, 2262 (1976). (27) T. P. DeAngelis and W. R. Heineman, unpublishedresults. (28) E. Barendrecht in "Electroanalytical Chemistry", Vol. 2, A. J. Bard, Ed., Marcel Dekker, New York, N.Y., 1967.

RECEIVEDMay 17,1976. Accepted January 5,1977. The authors gratefully acknowledge financial support provided by National Science Foundation Grant GP-41981X. T.P.D. acknowledges summer support by a J. W. Richards Summer Fellowship (The Electrochemical Society) and the Procter and Gamble Company, Cincinnati, Ohio, for financial support through a Procter and Gamble fellowship.

Theory of Error in Factor Analysis Edmund R. Malinowski Department of Chemistry and Chemical Engineering, Stevens Institute of Technology, Hoboken, N.J. 07030

A theory of error for abstract factor analysis (AFA) Is deveioped. H is shown that the resultingeigenvalues can be grouped into two sets: a primary set which contains the true factors together with a mixture of error and a secondary set which consists of pure error. Removal of the secondary set from the AFA scheme leads to data improvement. Three types of errors are shown to exist: RE, real error; XE, extracted error; and IE, imbedded error. These errors are related in a Pythagorean sense and can be calculated from a knowledge of the secondary eigenvalues, the size of the data matrix, and the number of factors involved. Mathematical modeis are used to illustrate and verify various facets of the theory.

Factor analysis (FA) is a mathematical technique for solving multidimensional problems of a certain type. Although FA was developed originally to solve psychological and sociological problems (I),it is rapidly gaining importance in various fields of chemistry such as gas-liquid chromatography ( 2 4 ) , spectrophotometry (5-7), nuclear magnetic resonance (8,9), \mass spectroscopy (10-13), and drug activity (14).This surge in applications was promulgated chiefly by the sudden availability of digital computers required to carry out the tedious calculations as well as an increased awareness of the methodology. 606

ANALYTICAL CHEMISTRY, VOL. 49, NO. 4, APRIL 1977

There presently exists a cautious reluctance by the chemical community to become overly excited about FA. Part of this reluctance is due to the fact that deduction of the number of factors operative in a data matrix, the first step in FA, has not been a clear-cut and straightforward procedure. If the data contained no experimental error, deducing the exact number of factors would present no problem to the analyst. Up to the present time, no method has been devised to determine whether or not a data matrix is truly factor analyzable. By investigating the factor space of the experimental error in relation to the true factor space of the data, for the first time since the conception of factor analysis, we have been able to achieve a partial answer to this problem. The details of our deductive reasoning will be presented in two parts. The first part, presented here, describes the underlying theory. The second part, which is presented in the accompanying paper, provides an interpretation of some of the results of the theory together with applications to real chemical problems. A Synopsis of Abstract Factor Analysis. Factor analysis (FA) is a mathematical technique which attempts to express a data point as a sum of product functions. A perfectly pure data point d:k is a point which contains no experimental error and is perfectly factor analyzable in n space. Such a point can be expressed as a linear s u m of n product terms, called factors. Each factor is comprised of a product of cofactors. That is:

J=n

d:h =

c

rTJc;k

j=1

(1)

where rTj is the j t h row cofactor and CJ*k is the j t h column cofactor. The sum is taken over all n factors which constitute the complete factor space. In matrix notation, this equation takes the form [D*] = [R*][C*]

(2)

where [D*], [R*],and [C*] are matrices with elements drh, r:J, and c ; k , respectively. We will call [D*] the pure data matrix, [R*] the pure row matrix since this matrix is associated with the row headings of the data matrix, and [C*]the pure column matrix since this matrix is associated with the column headings of the data matrix. If the data matrix consists of r rows and c columns, its size will be r X c; the size of the row matrix will be r X n; and the size of the column matrix will be n X

with physically significant quantities we call this procedure abstract factor analysis (AFA).In spite of this shortcoming, AFA can be an extremely useful tool for predicting chemical and physical phenomenon as already shown by several pioneering investigators (2, 8, 10, 12). Details of Decomposition. Before investigating the effects of experimental error on the FA procedure, it is important for us to fully understand the “decomposition” (diagonalization) of the covariance matrix, because our discussion will rely upon this knowledge. Since the data lies in n-dimensional space it will be convenient for us to consider a basis set of unit vectors (VI,U Z ,. . . , U n ) which spans the factor space. These vectors are mutually orthonormal. Instead of the scalar representation, d:h, a data point is better represented as a vector, D S in n-factor space. Instead of Equation 1, a better representation of the FA problem is given by

C.

The problem to be solved by FA is simply the following: Starting with the data matrix, obtain [R*] and [C”].To accomplish this task, FA makes use of the covariance matrix. The covariance matrix is defined as the product of the data matrix premultiplied by its transpose.

PI = [DITIDl

(3)

Notice that the superscripted asterisks have been deleted. This is done to emphasize the fact that this equation is perfectly general and applies to any data matrix. Throughout our discussion we make use of “covariance about the origin” and not “covariance about the mean” as most commonly done by others. The covariance matrix is diagonalized by finding a matrix [Q] such that

c

J=il

D:h =

Here r:, and c J hare the values projected onto the unit basis vector U,. For the Z a b element of the covariance matrix we reason as follows: zab

=

C DTa

a

DTb = C

(C rGUjc;a)

Z

=

J

cc Z

(:

[ZlQj =

XjQj

(6)

where Q j is the j t h column of matrix [Q]. These columns, called eigenvectors, form an orthonormal set. In reality, the diagonalization matrix [Q] is a valid representation for the column matrix [C]. To prove this we reason as follows:

where [U] = [D][Q]. Upon rearranging Equation 7 , we see that When this is compared to Equation 2, we conclude that and The factor analysis problem reduces to simply finding the matrix which diagonalizes the covariance matrix. Since the row and column matrices obtained by the method described above represent abstract mathematical solutions to the problem and cannot, in general, be identified directly

a

(F

~:~UJC;I.)

(12)

r.TlyaCTb

J

The last step is true because U,-U, = 8zk. From this equation we readily deduce that the covariance matrix can be “decomposed” into the following sum:

+

+

[Z] = XICTC;~ X & ~ C ~ T . . .

where 8Jk is the well-known Kronecker Delta, ifj # k 6jk= (5) ifj = k Here X j is an eigenvalue of the set of equations having the following form:

(11)

r:JJJc;k?

J=1

+ XnCECzT

(13)

where

Here C* is a column vector with components cJalC J b , . . . , c;h and C,-4 is its transpose, a row vector. Note that CJC 7 (known as a dyad) produces a matrix. Furthermore, by studying Equation 13 and comparing it to Equations 6 and 9 we conclude that [C] = [ c ; c ; .

*

.c p

(16)

The relationships described in Equations 14,15, and 16 are extremely important in FA.Equation 14 shows clearly that each eigenvalue is simply the sum of the squares of the projections of the row factors onto the respective basis axis. Extra Factors Produced by Experimental Error. Our development of FA thus far has been concerned with perfectly pure data, containing no experimental error. Such situations rarely exist and it is important for us to consider what happens in the FA scheme when imperfect data are involved. Experimental error invariably produces extra eigenvalues and eigenvectors which tend to confuse the situation. Retention of unnecessary eigenvalues will result in reproducing experimental error as well as the true data. I t is important that we develope some criterion to choose the correct number of eigenvectors. Because of experimental error, data point, d i k , is best expressed as a sum of two terms dik

= dZ;h

+ eik


(17) 607

where d;h is a perfectly pure data point and eik is the experimental error associated with this particular point. We will learn shortly that even when we do select the proper number of eigenvectorsto reproduce the data it is impossible for us to remove all of the error. Part of the error will mix into the FA scheme. We can express this fact by writing

eih = e$

+e$+

(18)

where e$ represents that part which mixes into the FA scheme and cannot be removed, and eph is that part which can be extracted by removing the unwanted extra eigenvectors. Placing Equation 18 into 17 we see that

d,h = d $

+ eyh

(19)

where Here d$ represents the value of a data point as reproduced by FA. Notice that it contains a mixture of error e%. From Equation 19 we learn that the extracted error e $ +is the difference between the raw data d,h and the reproduced data point d i . The raw data matrix is the sum of the pure data matrix [D*] and the error matrix [E]. Although these three matrices have the same size, r X c, r rows and c columns, their ranks (hereafter called dimensions) are different. The pure data matrix is n-dimensional. The error matrix is either r - or c-dimensional depending upon which is smaller. The raw data matrix has the same dimensions as the error matrix. For convenience we will assume throughout our discussion here that c is smaller than r and hence the raw data and error matrices are both c -dimensional. Since n is less than either c or r , a larger number of eigenvectors are required to span the error space than the pure data space. Any orthogonal set of axes can be used to describe the error space provided that a sufficient number of axes are employed. In fact we may use the same basis axes to define the error space as used to define the raw data space. To distinguish these axes from all other axes, we will employ a superscript dagger. However, additional orthogonal axes are required since the error space is larger than the pure data space. Hence we may express the error associated with data point d,h as the following linear sum

~6

Here c$ is a component of the j t h basis axis, is the projection of the ith error onto the j t h axis of the factor space, f l y J is the projection of the ith error onto the j t h axis of the pure error space. Placing Equations 22 and 1 into 17 we obtain

(23)

where r$ is defined as follows:

where i=r

A T = i = l (r:,& c$

+

f o r j = 1 , . . . ,n

g$

(26)

and i=r

A7 =

(ut)2

i= 1

forj = n

+ I , . . . ,c

(27)

These expressions show exactly how the error mixes into the eigenvalues which emerge from factor analysis. In Equation 25 the first term on the right involves a sum of n terms. These terms involve the factors associated with the pure data but contain an admixture of error. The second set involves a sum of terms which are associated with pure error. This second set should be deleted from the analysis since its retention will lead to reproduction of experimental error. We note here that if there were no experimental error, CT; and no would be zero, c$ would equal c;h, and, from Equation 24, rcwould equal rl*,as expected. Furthermore, Equation 26 would reduce to Equation 14, as expected, and A:, according to Equation 27, would vanish, as expected. The first set of n eigenvalues A: are called the primary eigenvalues. They consist of the largest eigenvalues. The second set of (c - n ) eigenvalues A: are called the secondary eigenvalues. These secondary eigenvalues are composed of pure error and their removal from the FA scheme will lead to improvement in data reproduction. In order to deduce how the error mixes into the eigenvectors, we express the error in terms of basis axes associated with r$ rather than c$. In other words we write

Here r$ is a component of the j t h basis axis and 05 and uyh are the projections of the kth error onto the j t h axis of the factor space and pure error space, respectively. Inserting Equations 28 and 1 into Equation 17 we find

=

j=n

J=C

;= 1

j=n+l

C r$c& + C r$vyh

(29)

In the last step we define c& as follows:

Equation 30 reveals precisely how the experimental error perturbs the components of the eigenvectors which emerge from FA. We note here that if there were no error, u$ and uyfi would be zero, r$ would equal rTh, and, hence, c 3 would equal c;h as expected. Eigenvalues and Error. Previously we learned that if the data matrix contains no error, the covariance matrix can be decomposed into a sum of n factors (see Equation 13). However, due to experimental error, FA invariably yields an excessive number of factors, equal to r or c, whichever is smaller (see Equation 25). Because the trace of the covariance matrix is invariant upon the similarity transformation expressed by an equation analogous to Equation 4 we conclude that i = r k=c

c d;h = trace [ZI= Y

i=l k=l

By employing a reasoning analogous to that employed previously (see Equations 11, 12, 13, 14,15, and 16, we conclude that [Z] = 608

y J=1

AfC;CjT

+ j =Js A,”CjC,’ n+l

(25)


;= 1

In other words, the sum of the squares of all the data points equals the sum of the eigenvalues. These eigenvalues can be separated into two groups

The first group on the right is the sum of the primary eigenvalues ,A: which contain some error imbedded with the true factors. The second group is the sum of the secondary eigenvalues A?, which are composed solely of experimental error. The removal of this latter group will lead to overall improvement in data reproduction. For this reason we wish to develop a criteria for deducting which eigenvalues belong to the primary and secondary sets. From the primary eigenvectors,the eigenvectors associated with the primary eigenvalues, we can regenerate the data, which we will label d$. From the same arguments used to obtain Equation 31, we obtain, in this case, the following relationship: i=r k = c i=l

h.=l

dA2 = trace [Z*] =

’2Xf

Subtracting Equation 33 from Equation 31 we obtain

C C (d$ - d ? )

i=l k = l

i=r

J=C

C

=

J=nfl

J=C

C C UP,”

1=1

J=n+l

(34)

where a,: is that part of the experimental error which can be extracted from the FA scheme by simply deleting the secondary eigenvectors. For convenience we will call this sum the extracted error (XE). Although the pure data and raw data matrices are not orthogonal to the experimental error matrix, it can be shown that the FA reproduced data matrix [D*] is orthogonal to its associated residual error matrix [EO]. That is [D*IT [EO] = [O]

(35)

where

+

(36) [D] = [D*] [EO] Residual Standard Deviation-The Real E r r o r . The residual standard deviation (RSD) provides us with an excellent criterion for deducing the primary eigenvalues. The residual standard deviation is defined by the following expression:

r ( c - TI)(RSD)~=

i=r

C

J=C

;a :

if

DATA

RAW

FA REPRODUCED DATA

DATA

(XE)

Figure 1. Mnemonic diagram of the Pythagorean relationship between the real error (RE), the extracted error (XE),and the imbedded error (IE), and their relationships to the pure, raw, and FA reproduced data

(33)

j= 1

i=r k=c

PURE

P

>c

(37)

1=1 J = n f l

The sum on the right represents the sum of the squares of the projections of the error points onto the secondary eigenvector axes only. Since these axes involve pure error components, the (RSD) is a measure of the real error (RE),the difference between the pure data, possessing no error, and the experimental data, possessing experimental error. Placing Equation 34 into Equation 37 and rearranging the result, we find

pare the (RSD) to the estimated error. We continue in this fashion, systematically transferring the largest eigenvalues, one at a time, from the secondary set to the primary set, until the resulting (RSD) approximately equals our estimation of the error. When this occurs we will have found which eigenvalues belong to the primary and secondary sets and our problem is solved. Root-Mean-Square Error-The Extracted Error. The root-mean-square error is defined by the following expression: i=r k=c

rc(RMS)2=

e!?

i=l h=l

(39)

where is the difference between the raw data point d i k and the FA regenerated data point d $ (see Equation 19). That is We now make the following sequence of observations:

+

d $ = ( d & eb)’

d $ = d%2+ 2d,fhe$ + e:;

CC d% = CC d&‘+ 2 C C d&e%+ CC e 3 However Equation 35 tells us that dkpk =0

Therefore, we conclude that

e% = CC d?h

-

d&2

(41)

Placing this into Equation 39 and recalling Equation 34, we find 112

Upon comparing this to Equation 38, we see that This equation can be used to deduce which eigenvaluesbelong to the primary and secondary sets. T o start the deductive process, we first assume that only the single largest eigenvalue belongs to the primary set. The (RSD) is calculated by summing over all the remaining eigenvalues in accordance with the expression above. If the calculated (RSD) approximately equals the estimated error, we conclude that only one factor is operating and our problem is solved. If, however, the calculated (RSD) is much greater than our estimated experimental error, then we have not chosen a sufficient number of eigenvectorsto represent the primary set. In this case we calculate the (RSD) assuming that the two largest eigenvalues belong to the primary set. Again we com-

(RMS) =

(5) n

1/2

(RSD)

(43)

Since n is less than c, the (RMS) is less than (RSD). In fact the (RMS) measures the amount of error which is extracted, as we shall learn shortly. Pythagorean Relationship between (RE), (IE), and (XE).Let us apply our mathematical reasoning to the experimental error matrix instead of the raw data matrix. We first form a covariance matrix from the error matrix. We then decompose this matrix into a sum of dyads (outer products of the eigenvectors; i.e., a row eigenvector premultiplied by a column eigenvector to yield a matrix) times their respective eigenvalues, obtaining an expression analogous to Equation ANALYTICAL CHEMISTRY, VOL. 49,

NO. 4, APRIL 1977

609

Table I. Artificial (One-Dimensional) Pure Data Matrix, Error Matrix, Raw Data Matrix, and FA Reproduced Raw Data Matrix FA reproduced raw data matrix, Raw data matrix, Error matrix, Pure data matrix, [Dl = [RI[CI [Dl = P*l+[El [El W*l

Column 2

Column I 1

1

2 3 4 5 6 7 8 9

2 3 4 5 6 7 8 9 10

10

Column 1

Column 1

Column 2

0.2 -0.2

-0.2

1.8

0.1 -0.1 0.0

2.9 4.0 4.9 6.2 7.2 7.8

0.0

-0.1

0.0

-0.1 0.2 0.2 -0.2 -0.2

-0.2 -0.1

0.1 0.1 0.2

-0.1

1.2

Column 1

Column 2

1.0936 1.7904 2.9846 3.9288 4.9240 5.9671 7.0118 7.9086 8.9033 9.9974

1.1052 1.8095 3.0163 3.9705 4.9763 6.0305 7.0862 7.9926 8.9978 10.1036

1.0 1.8

3.1 3.9 5.0 5.8 6.9 8.1 9.1 10.2

8.8

9.9

-

Table 11. Transpose of the Column-Factor Matrices Resulting from FA

IO

From raw data matrix,

From pure data matrix, [C*lT,

Column 2

9 -

z

8 -

0

ElT

I

Eigenvector 1

Eigenvector 1

Eigenvector 2

0.7071068 0.7071068

0.7033625 0.7 108314

0.7108314 -0.7033625

-

W 2

Table 111. Row-Factor Matrices from FA

From pure data matrix,

From raw data matrix,

B*l,

PI

Factor 1

Factor 1

Factor 2

1.41421 2.82843 4.24264 5.65685 7.07107 8.48528 9.89950 11.31371 12.72792 14.14214

1.55487 2.54555 4.24333 5.58569 7.00063 8.48367 9.96895 11.24396 12.65816 14.21377

0.14964 0.01344 -0.11901 0.10021 -0.03374 0.32765 0.26479 -0.15275 -0.14528 -0.13707

13. In fact we can use the same eigenvectors to describe the error space as used to describe the raw data space. The resulting eigenvalues, however, would be composed of error components rather than raw data components. Instead of Equation 32, we would obtain the following expression j=c j= 1

Figure 2. Illustration of the geometrical relationships between the raw data points of Table I and the primary and secondaty axes resulting from factor analysis of the raw data matrix

The term on the left represents the sum of the squares of all of the errors. The first term on the right involves the projections of the errors onto the primary axes and the second term on the right involves the projections of the errors onto the secondary axes. Each of these three terms is intimately related to the residual standard deviation. i = r k=c

rc (RSD)2= 2

j=n+l

(44) (49)

where

i=r

(45) and j=C

i=r

Aye = j=n+l

fl;2

i= 1 j = n + l

Since the trace of the error covariance matrix is invariant upon the similarity transformation, we conclude that the term on the left side of Equation 44 is the sum of the squares of all of the error points in the error matrix. Thus we learn from Equations 44,45, and 46 that 610

eb

i = l h=l

j=c

j= 1

RAW DATA COLUMN T W O


r(c

J=C

- TZ)(RSD)~ = i = l j=n+l

0:;

(50)

Inserting these equations into Equation 47 and rearranging we find, as expected, c-n (RSD)2= (RSD)2 (RSD)2 (51) c

+ (F)

This expression has an interesting and informative interpretation when we rewrite it in the following Pythagorean formulation: (RE)2 = (IE)2

+ (XE)2

(52)

Table IV.Summary of Model Data and the Results of Factor Analysis Results of factor analyzing impure data using n factors

Artificial data and artificial error r X c

Range of pure data

Range of error

RMS

RE

IE

10 x 2 16 X 5 16 X 5 15x5 10 X 6 16 X 9 10 x 9

1 to 10 2 to 170 2 to 170 0 to 27 0 to 32 -581 to 955 -137 to iao

-0.2 to 0.2 -0.08 to 0.08 -0.99 to 0.91 -1.9 to 2.0 -0.9 to 0.9 -1.0 to 1.0 -1.0 to 1.0

0.148 0.041 0.566 1.040 0.412 0.548 0.463

0.170 0.041 0.511 0.962 0.376 0.499 0.372

0.120 0.026 0.323 0.608 0.266 0.333 0.277

where (RE) = (RSD)

n (-) (XE) = (=) (IE) =

1/2

(53)

(RSD)

(54)

'I2 (RSD)

(55)

C

C

Here (RE) is the real error, the difference between the pure data and the raw data; (IE) is the imbedded error, the difference between the pure data and the FA reproduced data; and (XE) is the extracted error, the difference between the FA reproduced data and the raw data. As a mnemonic these relationships are portrayed in Figure 1. A very interesting and important feature of abstract factor analysis now becomes apparent. Since c > n,from Equations 53 and 54, we see that (IE) < (RE). The imbedded error, the error remaining in the FA reproduced data matrix, will always be less than the real error. Hence abstract factor analysis should always lead to data improvement. This fact, in itself, is extremely valuable. Using Correlation about the Origin. The theoretical equations developed here were derived on the basis of covariance about the origin. Many factor analysts prefer to use correlation about the origin. This involves normalizing each data column vector so that its dot product equals unity. The purpose of this procedure is to give equal statistical weight to each column of data. When we use correlation rather than covariance, we obtain theoretical equations analogous to Equations 48 through 55; however, instead of Equation 38 the residual standard deviation takes the following form:

rc(c

- n)

1

A Simple Mathematical Example. In order to fully appreciate the proposed error theory, it is helpful to study a simple artificial set of model data. A one-dimensional factor space represents the simplest case, requiring a minimum of effort on our part while illustrating the important features. The first two columns of Table I represent the elements of our artificially created data matrix. This 10 X 2 matrix is a pure data matrix possessing no experimental error. It is obviously one-dimensional since a plot of the elements in the first column against the correspondingelements in the second column yields a series of points which fall exactly on a straight line of unit slope. Factor analysis of this pure data matrix yields only one eigenvalue and one eigenvector, namely: XI = 770 and CIT = (0.7071068,0.7071068)

We next create an error matrix composed of a random set

RMS of difference between FA reproduced data and pure data

[C1T:

(di;*

- di;*)2] ''' rc

0.087 0.026 0.381 0.726 0.311 0.398 0.400

'

of numbers, both positive and negative, scattered about zero. The components of this matrix are shown in Table I. When these error points are added, matrix-wise, to the pure data, the raw data shown in Table I are obtained. The raw data matrix emulates a real situation, analogous to experimental data which contain random experimental error. When the elements of the first column of the raw data are plotted against the corresponding elements of the second column, the points do not fall on the line of unit slope. (See Figure 2.) Instead, the points are scattered so that the least-squares line, which passes through the origin, has a slope equal to 1.00985066. This line represents the primary eigenvector which emerges from FA. A measurable secondary eigenvalue and eigenvector also emerges from the analysis. In fact, we find XI = 767.1514 and CIT = (0.7033625,0.7108314)

A:! = 0.2886124 and CzT = (0.7108314, -0.7033625)

These eigenvectors constitute the columns of the columnfactor matrix. Table I1 shows a comparison between column-factor matrices associated with the pure data and raw data. The row-factor matrices produced by FA of the pure data and the raw data are listed in Table 111. One can show, by some tedious calculations, that all of the points listed in the second and third columns of Tables I1 and I11 obey Equations 24 and 30. This fact lends credence to the proposed theory of error because these numbers were obtained by a completely different route involving principal component analysis. Concerning the raw data matrix, the values in Table I11 represent the perpendicular projections of the raw data points onto the respective eigenvector axes shown in Figure 2. Each eigenvalue represents the s u m of the squares of the projections of the data points onto the corresponding eigenvector axis. The introduction of error perturbs the primary axis. The primary eigenvalue is no longer 770 but 767.1514. Also the raw data points are scattered about this axis and a secondary eigenvector, perpendicular to the primary axis, emerges from FA. The sum of the squares of the projections of the data points onto the secondaryaxis is no longer zero, as for the pure data, but is 0.2886124. According to the proposed theory of error the real error (RE) can be calculated strictly from a knowledge of the secondary eigenvalues which are produced by FA, as stated by Equation 38. Applying this equation to our one-dimensional model problem, for which n = 1, r = 10, and c = 2, we predict that the real error should be 0.170. This prediction is in agreement with the true RMS error, 0.148, calculated from the points in the error matrix of Table I. Perfect agreement should not be expected because the number of points in the error matrix is too small to follow high;probability statistics. The theory of error predicts that factor analysis should alANALYTICAL CHEMISTRY, VOL. 49, NO. 4, APRIL 1977

611

,

ways lead to data improvement. A visual point-by-point inspection of Table I, shows that, in general, the FA reproduced raw data emulate the pure data better than the raw data, which were used in the factor analysis to generate the reproduced data. Equation 54 quantitatively predicts the amount of error which remains in the reproduced data. Since there are two columns of data and only one primary factor, c = 2 and n = 1,Equation 54 predicts that the imbedded error (IE) is 0.120. This error is less than the real error. Furthermore, this error compares favorably with the root-mean-square of the differences between the pure data and the reproduced data, namely 0.087. Other model sets of data having different dimensionalities, sizes, and errors were constructed and factor analyzed. A brief summary of these models and the results of FA are given in Table IV. In each case the RMS of the artificial error compares favorably with the (RE) predicted by means of Equation 42. Also in each case, the (IE) calculated from Equation 54 compares favorably with the RMS of the differences between the pure and reproduced data. This lends further support to the validity of the theory. ACKNOWLEDGMENT

The author extends his thanks to Harry Rozyn, supported by the College Work Study Program, for his assistance with the computer program, and to John Petchul, participant in the “Undergraduate Projects in Technology and Medicine”

of Stevens Institute of Technology, for his help in carrying out many of the calculations. LITERATURE CITED (1) R. J. Rurnmel, “Applied Factor Analysis”, Northwestern University Press, Evanston, Ill., 1970. P. T. Funke, E. R. Malinowski, D. E. Martire, and L. 2. Poilara, Sep. Sci., 1, 661 (1966). P. H. Weiner, C. J. Dack and D. G. Howery, J. Chromatogr., 69, 249 (1972). R. 6.Selzer and D. G. Howery, J. Chromatogr., 115, 139 (1975). J. J. Kankare, A d . Chem., 42, 1322 (1970). Z. Z. Hugus Jr. and A. A. El-Awady, J. Phys. Chem., 75,2954 (1971). J. T. Bulmer and H. F. Shurvell, Can. J. Chem., 53, 1251 (1975). P. H. Weiner, E. R. Malinowski, and A. R. Levinstone, J. Phys. Chem., 74, 4537 (1970). P. H. Weiner and E. R. Malinowski, J. Phys. Chem., 75, 1207 (1971); 75, 3160 (1971). R. W. Rozett and E. McLaughlin Petersen, Anal. Chem., 47, 1301 (1975); 47, 2377 (1975); 48, 817 (1976). J. B. Justice, Jr., and T. L. Isenhour, Anal. Chem., 47, 2286 (1975). G. L. Ritter, S.R. Lowry, T. L. Isenhour,and C. L. Wilkins, Anal. Chem., 48, 591 (1976). E. R. Malinowski and M. McCue, Anal. Chem., 49, 284 (1977). M. L. Weiner and P. H. Weiner, J. Med. Chem., 16, 655 (1973).

RECEIVEDfor review September 27,1976. Accepted January 17, 1977. Presented, in part, on September 2, 1976, at the Symposium on Chemometrics: Theory and Applications during the American Chemical Society Centennial-California Section Diamond Jubilee Meeting in San Francisco, Calif.

Determination of the Number of Factors and the Experimental Error in a Data Matrix Edmund R. Mallnowskl Department of Chemistry and Chemical Engineering, Stevens institute of Technology, Hoboken, N.J. 07030

An imbedded error functlon and an indicator function, caicuiated solely from the eigenvalueswhlch result from subjecting a data matrix to abstract factor analysls, are used to determine not only the number of controlilng factors but also the root mean square of the experimental error without any a priori knowledge of the error. Model data are used to Illustrate the behavlor of these functions. The method Is applied to problems of Interest to chemists, Involving nuclear magnetic resonance, absorption spectroscopy, mass spectra, gas-llquld chromatography, and drug actlvity.

Factor analysis (FA), a computer technique for solving multidimensional problems, is rapidly gaining importance in chemistry (1-9).Factor analysis is based upon expressing a data point as a linear sum of product functions. Abstract factor analysis (AFA) is concerned only with data reproduction using the abstract factors which result from the analysis. The first step in the AFA process, called principal component analysis (PCA),involves determining the number of factors in the sum. In order to do this properly, all methods, except the current proposed methods, require a knowledge of the experimental error. Unfortunately such information is not always available and oftentimes it is incorrect, leading to erroneous conclusions. We wish to describe two methods, based on covariance FA, which allow us to deduce not only the size of the factor space but also the experimental error without any a priori knowledge 612


of the error. The two methods are based upon the Theory of Error for Abstract Factor Analysis (9).They depend only upon a knowledge of eigenvalues which result from AFA. The first method utilizes the concept of the imbedded error (IE), the error which cannot be removed by AFA. This second method involves the use of an indicator (IND) function. This function is empirical and its true origin is not fully understood a t the present time. Both methods are not applicable to all types of data matrices; they are limited to matrices which contain a relatively uniform error throughout, free from systematic and erratic errors. The fact that one can reproduce a set of data within experimental error, does not guarantee that the problem is factor analyzable. False solutions often arise. For example, one can subject a totally random matrix of numbers to factor analysis and obtain factors which reproduce the data within a specified degree of accuracy, even though the data are not truly sums of product functions. The IE and IND functions can also be used to determine whether or not a given data matrix is genuinely factor analyzable. At the present time, no other method exists for weeding out those matrices which are not truly factor analyzable. PRELIMINARY DISCUSSION Recently, Malinowski (9) developed a theory of error concerning abstract factor analysis. Using “covariance about the origin”, he was able to show how the experimental error mixes

Theory of error in factor analysis - American Chemical Society

Recommend Documents