Improving the reliability of factor analysis of chemical data by utilizing

tainty is presented. A model data structure for investigating the behavior of factor analysis techniques In a known, con- trolled manner Is described ...
0 downloads 0 Views 1MB Size
Improving the Reliability of Factor Analysis of Chemical Data by Utilizing the Measured Analytical Uncertainty D. L. Duewer and B. R. Kowalski* Laboratory for Chemometrics, Department of Chemistry, University of Washington, Seattle, Wash. 9 8 195

J. L. Fasching Department of Chemistry, University of Rhode Island, Kingston, R.I. 0288 1

A procedure for including measured analytical uncertainty into data analysis methodology is discussed, with partlcular reference to factor analysis. The suitability of various dispersion matrices and matrix rank determlnatlon criteria for data having analytical uncertalnty Is Investigated. A crlterion useful for judging the number of factors insensitive to analytlcal uncertainty is presented. A model data structure for investigating the behavlor of factor analysis techniques In a known, controlled manner is described and analyzed. A chemically interestlng test data base having analytical uncertainty Is analyzed and compared wlth the model data. The limit to meaningful factor analysis for a portlon of the API Project 44 low resolution mass spectral data is explored.

The multivariate statistical techniques of “factor analysis” have recently been used to study a variety of problems of interest in analytical chemistry, including environmental ( I ) , chromatographic ( 2 ) , nuclear magnetic resonance (3), and mass spectral data analysis (4, 5 ) . The methodologies and computer techniques used in these studies have mostly been adapted from procedures developed for use in the social sciences (especially psychology) dealing with data obtained from populations tractable to statistical analysis. Uncertainties in data of this type are generally assumed to “cancel out” if a suitably large sample of the population is examined; as a consequence, no specific treatment of data uncertainty has been developed for use with factor analysis. Current work in our laboratories on the incorporation of analytical uncertainty into the methodologies implemented in our data analysis system ARTHUR (6) indicates that consideration of such uncertainty could be a powerful aid in the use of factor analysis with chemical data. Previous work in “matrix rank determination” which explicitly utilized data uncertainty strongly supports this conclusion (7). As a test of this concept, we have examined a model data set and two data sets of chemical interest incorporating “real” data uncertainties. The following sections of this paper briefly review the goals and techniques of factor analysis, discuss how data uncertainty may be incorporated into general data analysis methodologies, define the data sets considered, present the results and implications of the analysis as applied to each and, finally, give the summary results for the complete study.

FACTOR ANALYSIS The primary goals of all factor analysis methodologies are “reducing the dimensionality of a matrix of observations”, finding “a particular method of analysis and reduction of dimensions which will yield relatively stable and invariant results . . .”, and, ultimately, to “facilitate interpretation of the phenomena . . . one is investigating” (8).Any technique which attempts to achieve.these goals using just the information redundancy among a body of “measurements” taken on a 2002

group of “samples” can be considered as “factor analysis”; however, the term is normally applied to methodologieswhich involve eigenanalysis of a dispersion matrix. The characteristic eigenvaluesand eigenvectors(the lengths and directions of the principal components or axes) of the data are typically the starting point for further analysis, such as varimax, quartimax, oblimax, and target rotation (9). Unfortunately, the amount and character of the information used in the subsequent analysis are determined by the initial eigenanalysis. (Recent work on determining the number of “interpretable factors” appears to alleviate this problem (IO).) In light of the importance of eigenanalysis to factor analysis, and the limits it imposes on subsequent interpretation-aiding methodologies,we have chosen to limit this study to the effects of data uncertainty on eigenanalysis. In keeping with the terminology used in chemical pattern recognition, “samples” and “measurements” will be used instead of the social science oriented “entities”, “characteristics” (or “attributes”), and/or “occasions”. We feel that this terminology is much more general and appropriate for applications in analytical chemistry. The term “feature” will be used to refer to any transformation and/or combination of a measurement(s). A given feature j of a given sample i will be designated x i j . The n features used to describe sample i define a “data vector” (“pattern” in the older chemical pattern recognition literature) which is designated Zi.The m data vectors from the n X m “data matrix” [XI. The dispersion matrix most commonly used in factor analysis is the n X n correlation matrix (“correlation about the mean”) [Rm],where each component is defined

Recently, it has been argued that for some analytical data, in particular mass spectral data of similar compounds, dispersion matrices which do not normalize the mean and/or the variance of each measurement give a more complete representation of the information implicit in the data (11).We have investigated the effect of data uncertainty on three such matrices in addition to the correlation matrix: the scatter (“covariance about the origin”) [Co]

the covariance (“covariance about the mean”) [Cm]

and the “correlation about the origin” [Ro]

ANALYTICAL CHEMISTRY, VOL. 48, NO. 13, NOVEMBER 1976

The average error, e , criteria accepts all factors required to reproduce the original data matrix within the average RMS data uncertainty

m

r=cx: i=l

Eigenanalysis solves for the n eigenvalues, X j , and eigenvectors, Lij, characteristic of the chosen dispersion matrix. With the numerical anslysis formalism we use, the eigenvectors are sorted in order of decreasing associated eigenvalues and are normalized to unit length as well as being orthogonal to one another (“orthonormal”). The square of each component, u j k , of a given eigenvector j is thus directly interpretable as that fraction of the variance spanned by vector j which is attributable to feature k . The associated eigenvalue j is a measure of the total variance spanned by Lij. A new data matrix [Y] of linear combination features or “factors”, y j , may be created through the transform where [ V] is the transpose of the n X n eigenvector matrix. These factors are, like the eigenvectors from which they are derived, ordered in decreasing representation of data variance. When taken in order, they form the best least-mean-squareerror orthogonal representation for any given data subspace (any number of dimensions less than n ) using only linear combinations of the features. (If the covariance matrix, [Cm], is used, this is the Karhunen-Loeve expansion (12).) The number of independent variances present in the data matrix (the matrix rank) can be determined from analysis of the eigenvalues, eigenvectors, and/or the factors. In the absence of any data uncertainty, nonlinear interactions between features, or round-off error, the rank is equal to the number of nonzero eigenvalues: this situation seldom arises with real data. We have examined several commonly used criteria for determining the number of “significant” factors to determine their usefulness with data which has associated analytical uncertainty: X(%info),e (13),and Bartlett’s x2 (14). Other criteria, such as “cross-validation”, have been used to determine matrix rank, but will not be considered in this study

x,

(15).

The average eigenvalue, X, criterion accepts all eigenvalues as large as the average eigenvalue; that is

x’ =

5 .2

Yijvjk

j=l k = l

In terms of matrices, [x’] = [ yl[ V’], where [ V’] is [ V] with the lowest ( n - p ) eigenvectors zeroed out. We have found this convention to be appropriate when: a. all uncertainties are of comparable magnitude, b. the uncertainties are small compared to the feature values and c. the uncertainties are normally distributed about the feature values. (The criterion can be modified to remove the first assumption by testing on the RMS uncertainty of each feature rather than averaging it over all features.) It should be noted that, since for all orthonormal matrices [ VIT = [Vl-l, [X’] is constrained to be identical to [XI if all n eigenvectors are included in [ V’]. Since the eigenvectors are sorted in order of decreasing eigenvalue (vector length or fraction variance spanned), e is constrained to become smaller with each additional eigenvector added. Bartlett’s x2 criterion accepts the hypothesis of having exactly p factors, for some probability level (we chose = 0.5), for the smallest value of p where

-E In@)’5 ~ ~ ~ B = m - (2n + 5)/6 - 2 p / 3

,

j

j = degrees of freedom 4 (n-p

- l ) ( n- p

+ 1)/2

and is obtained from any standard statistical reference. (Alternatively, the probability level, P(- B In ( R ) ) can , be calculated: the procedure is somewhat involved but many computer statistical libraries contain such a program.) This criterion is usable only with the correlation matrix and only for data drawn from a normal population. For such data, it does establish the upper limit of “statistically meaningful” factors. DATA UNCERTAINTY

This convention is founded on the consideration that a factor spanning less than the average variance of the original features contains little or no information on the feature dispersion. We have found it to be appropriate when: a. all features have variances of comparable magnitude and b. all features are “somewhat” correlated. (Any feature contributing unique information contributes no more than the average variance.) The significant variance, X(%info),criterion accepts all eigenvalues which are necessary to account for the total variance thought to come from the desired information (in contrast to the “uncertainty”) in the data; that is

5 hj/ ,t

X j I100

j=1

1’1

- (%uncertainty)

where p is the postulated number of independent variances. This convention assumes that the variance attributable to the data uncertainty will be largely contained in the factors with small eigenvalues. Our experience indicates it is appropriate when: a. all features have variances of comparable magnitude, b. all uncertainties are of comparable magnitude, e. the uncertainties are small compared to the feature values, and d. the uncertainties are normally distributed about the feature values.

Essentially all analyses of experimental data must contend with both sampling and analytical uncertainty. Sampling uncertainty, the error in the estimation of population parameters resulting from studying only a finite subpopulation, can be minimized by proper randomization, replication, and blocking of samples and, if enough samples are studied, by such computational procedures as “jackknifing” (16).Analytical uncertainty, the error in the estimation of a given measurement value for a given sample, defines the “goodness” of the data for a given group of samples obtained with given experimental procedures and is thus the limiting criterion for the information obtainable from those samples. In practice, it is often assumed that the variance attributable to the analytical uncertainty is small compared to the variance across the samples and is ignored in the data analysis. Statistical and pattern recognition methodologies exist which weight the various features by a function of the uncertainty and so adjust the “goodness” of the various features for the analysis (17). As previously discussed,the e criterion uses data uncertainty to evaluate the results of eigenanalysis on data which, up to the time the criterion is applied, have been assumed to be “exact”. If the analytical uncertainty is considered to define a distribution of possible values about some expectation value, it becomes possible to utilize the analytical

ANALYTICAL CHEMISTRY, VOL. 48, NO. 13, NOVEMBER 1976

2003

Cflm EB

rn

@

I

D

t

t

I

l

m

Factor,

Fac tor2

Factor,

Factor,

.

l

I

‘ f f l m

l

*

l

t

I

rn

I

I

s

I

s

t

FactorS

n

1 I

Fact ors

Figure 1. Rectangle data: correlation between “measurements”and factors (cf,J Area of square (0)is proportional to c$ for the true- data. Cross (+) defines edges of squares whose areas are proportional to c;; & std dev for UP-data

uncertainty to estimate the “goodness” of the results of data analysis on data with uncertainties. If some distribution, D(xi,,uLj),for a given measured value x L jcan be defined considering xLjto be the “expectation” value of the distribution and its associated analytical uncertainty uL,as a measure of the distribution spread, then synthetic data values may be generated by the production of random numbers having the distribution D(x,,uLj).These synthetic values are equally probable estimators of the “true” sample value since they have been obtained from the same value distribution. Thus ‘(uncertainty-perturbed-data’7(UP-data) matrices, [ X u ] may , be generated from the “true”-data matrix, [XI,and its associated uncertainty matrix, [VI. These UP-data matrices are equally probable subpopulations of the parent population. The results of analysis of such UP-data matrices can in turn be considered as estimating the results obtainable from an infinite population of possible results. No matter what methodologies are applied, the results of the data analysis can be treated statistically if sufficient UP-data matrices are analyzed. This procedure provides a measure of the reliability of the results for a particular set of samples and measurements. In particular, the number of factors (‘stable’’ to analytical uncertainty can be determined in this manner. It should be noted that the practice of reporting measurement values as “ua-aa f b-bb” is, in fact, an explicit statement that the reported value ((au-aa”is the expectation value of a distribution which is either normal about ‘(aa-aa’’ with a standard deviation of “b-bb”or is uniform about “au-aa” with

(‘edges’’at (‘ua-aa f b-bb”. Data reported at (‘aa-aa +b.bb, --c.cc” may be taken as defining a x2 distribution. The special case of values reported a t or below a detection limit may be treated by assigning an expectation value of one-half the detection limit and a standard deviation equal to the expectation value. Other methods of treating this problem can certainly be devised. The following data analyses assume the analytical uncertainty to be normal about the data values, although x 2 distributions are perhaps more typical of analytical data. The mean and standard deviation for the various result-values from the UP-data sets have been calculated assuming that the various results follow the normal distribution. This treatment is intended to provide only a conservative estimate of both the analytical uncertainty and the distribution of results. STUDY I. RECTANGLE DATA

A synthetic data base of various “measurements” on rectangles defined from random digits was constructed to test the various dispersion matrices and rank determination criteria. Twenty-five sets of “Length”, “Width”, “Thickness”, and “Random” values were defined by five, four, three, and four random digits, respectively. Six combination “meas u r e m e n t ~were ’ ~ calculated from these fundamental factors: the three “wrap-around” lengths 2(Length Width), 2(Length Thickness) and 2(Width Thickness) and the three diagonals (Length2 Width2)1/z,(Length2 Thickness2)1/2and (Width2 Thickness2)ll2.Note that the diago-

+

+

+

+

+

+

Table I. Rectangle Data Statistics Meana ((Measurement” 1. Length 2. Width

3. Thickness 4. 2(Length + Width) 5 . 2(Length + Thickness) 6. 2(Width Thickness) 7. (Length2+ Width2)1/2 8. (Length2+ Thickness2)1/2 9. (Width2 Thickness2)lI2 10. Random

+

+

a

Abbrev. L W T 2(L + W) 2(L T) 2(W + T) D(L,W) D(L,T) D(W,T) R

+

s/xb

1.0% 1.0% 1.0%

1.0% 1.0% 1.0% 1.0% 1.0% 1.0% 1.0%

[XI 51 950 5 022 554

113900 105000 11 160 52670 51 960 5 077 5 220

[Xu1 51 900 f 200

5 0 3 0 f 10 544f 1 113 800 f 200 104 900 f 200 11 1 5 0 f 10

52 700 f 100 52 000 f 200 5 0 7 0 f 10 5 2 2 0 f 20

Std dev

1x1 27 820 2 984 279 53 630 55 480 6 171 27 030 27 810 2 952 2 969

[Xu1 27 800 f 200 2 9 9 0 f 10 281 f 1 53 500 f 300 55 400 f 300

6 180h 20 27 000 f 100 27 900 f 100 2950f 10 2970f 20

Length in arbitrary unit. b “Analytical uncertainty” relative to “measurement” mean. From five UP-data matrices.

2004

ANALYTICAL CHEMISTRY, VOL. 48, NO. 13, NOVEMBER 1976

nals are not simple linear combinations of the fundamental factors but have a second-order component. The true-data matrix, [ X I ,formed from these data is ideal for testing factor analysis methodologies since it explicitly contains all fundamental factors and is mathematically exact. It is also familiar, simple in structure, and very inexpensive to create. (Similar data sets have been used, in fact, from the earliest work in modern factor analysis (18).) In addition, these data can be considered a crude model for much spectral data in that all “measurements” have a minimum possible value (zero) and are “expressed in the same units”. (This is a common misunderstanding: these “measurements” have the same unit of length (say, meters) but have unique units of measurement (meters of Length, meters of Width, etc.).) The range of measurement means and standard deviations is typical of chemical data, as well (Table I). To test the effect of “analytical uncertainty” with these data, a 1%relative standard deviation was assigned to each value. That is, each xlJ of each UP-data matrix, [ X u ] ,generated was drawn from a population normal about xLl with a standard deviation of xlJ/lOO. This value for the “analytical uncertainty” was chosen to approximate that encountered in “typical” chemical instrumental analysis. All results presented in the rest of this section are based upon the analysis of the true-data matrix and six UP-data matrices. The square of the correlation between the 10 “measurements”, f,, and the first six factors, $ k , for the four dispersion matrices studied are pictorially presented in Figure 1. (Factors $7-10 are, within graphical resolution, identical to $6.) This value is a convenient measure of how well a given feature aligns with the variance spanned by a given factor and is defined

;:I\ I’ rgO :i-1 -06

-6

-8

4

1

3

5

7

1

9

3

5

7

9

80 90

70

70

I

i‘li!b:b 60

1

60 1

60

3

5

7

9

1

3

5

7

3

5

7

9

9

Factors

C

CO

C,

RO

2

0

1

f

-1

01

1

2

3

5

7

3

5

7

9

01

3

5

7

9

Factors

Figure 2.

Rectangle data: rank determination criteria

( A ) Values for true-data. (0) Values for UP-data. Error bar spans f std dev

A c/2k value of zero (no square present in the appropriate grid of Figure 1)indicates that the given feature i j is completely independent of the variance spanned by the factor $k; a cfk value of unity (square of unit area in the appropriate grid of Figure 1)indicates that the feature is completely contained within the variance spanned by the factor. Mean normalization facilitates the interpretation of the eigenvectors, and thus the factors, as there are more extreme cfk (very large and very small) and fewer intermediate c 2 reJk. sulting from the analysis of the mean normalized matrices [Cm]and [Rm] than with the origin-normal [Co] and [Ro]. (The various rotations used to simplify the interpretation of the factors have this elimination of intermediate dependence as their goal.) This explicit removal of the feature means (in effect, augmenting the matrix) is very analogous to allowing a constant term in a multilinear regression (“least-squares fitting”): it isn’t required by the mathematics, but it does make the interpretation and application of the results easier. The isolation of the feature means allows the uncertainty in the estimation of the population means to be largely canceled (lst-order statistics) rather than be included the (2d-order) dispersion coefficients. The variability of the c$ values obtained from the UP-data (the asymmetry of the crosses in Figure 1) and the correspondence of the UP-data and the true-data cjk values (overlap of the UP-data crosses with their true-data square in Figure 1)indicates that the mean-normal

matrices are less sensitive to the imposed analytical uncertainty than is the scatter matrix. Variance normalization concentrates the significant dispersion information as large occur for one fewer factor with the variance-normal [Ro]and [Rm]than with the unnormalized [Co] and [Cm](Figure 1).This concentration aids in the determination of the matrix rank and may be important when using factor analysis to reduce the dimensionality of the data. It appears to result from the explicit removal (or isolation) of the simple scale differences between features. The variancenormal matrices, like the mean-normal, are less sensitive to the data uncertainty. This clearly results from the removal of scale differences between the features and the uncertainties. The uncertainty in the Length, for instance, is as large in absolute value as the information in Thickness although the relative uncertainties are identical. It may be of some interest that, with all four dispersion matrices studied, the factors most clearly associated with “Thickness”, 54 and $ 5 , span less variance than does the factor, $3, which is clearly associated with the unique “Random” fundamental factor. This reflects the partial representation in the initial factors of the variance due to “Thickness”. The eigenvalues and associated rank determination information for the X, X(%info),and c criteria are given graphically in Figure 2. The square of the correlation between a given factor obtained from the true-data matrix and from the UPdata matrices is presented in Figure 3. This value is a convenient measure of factor stability and is defined

cT~

ANALYTICAL CHEMISTRY, VOL. 48, NO. 13, NOVEMBER 1976

2005

p,i rLL

364

6

4

2

.4

2

1

3

5

7

9

21

1

3

5

7

9

1

Factors

3

5

7

3

5

7

9

9

Figure 3. Rectangle data: correlation between true-data and UP-data

factors

(e).

Error bar spans f std dev

I

Figure 5. Obsidian data: Fisher weights for factors ( A ) Weights for true-data. (0)Weights for UP-data. Error bar spans dev

A\\\

A-A

f std

Table 11. Rectangle Data Rank Summary

Rank Criterion

x

1. 2. X(9996)

3. e 4.'x 5. c; a

IC01

[CmI

2 1

2 2

4

4

a

a

3

3

Bo1 3 4 4 a 4

[RmI 4 4 4

6 4

Not applicable.

$$i=tG$XiE>

Measurements Figure 4. Obsidian data: Fisher weights for metals

(A)Weights for true-data. (0)weights for UP-data. Error bar spans f std dev

where y' denotes a UP-data derived factor. Table I1 lists the observed rank for each of the dispersion matrices as determined by simple application of the various criteria and the somewhat subjective analysis of the Cf values. (The dashed lines in Figures 2a and 2c indicate the acceptance values for the given criteria: the acceptance value for Figure 2b, 99%) overlays much of the data and is not shown.) The various criteria give results which are most consistent with the data structure when applied to the two variance-normal dispersion matrices [Ro]and [Rm].(Remember that there are four linear factors and three second-order terms present in the data.) The average error criterion, e, and the C: analysis appear to be least sensitive to the dispersion matrix formulation and best reflect the true matrix rank. The large difference between the eigenvaluesobtained from the analysis of the true- and UP-data matrices should be noted (Figure 2a). The variance-normal matrices, [Ro]and [ R m ] , give values which begin to differ for Xg, the two matrices which are not variance-normal, [Co] and [Cm],give values which begin to differ at Xq. A similar, perhaps even more pronounced, difference between the true- and the UP-data analyses is apparent for the e values. The factor for which this difference becomes appreciable may be taken as the point at which analytical uncertainty starts to dominate the data analysis. 2006

Thus, of the four dispersion matrices studied, the correlation matrix, [Rm],is clearly the most stable toward analytical uncertainty and the scatter matrix, [Co], the least stable. Variance normalization equalizes the contribution of each factor to the analysis (and equalizes the contribution of each factor's uncertainty) while mean normalization aids in the interpretation of the factors and permits the uncertainty in the feature means to be contained in first-order rather than second-order terms. The various rank determination criteria give comparable results when applied to the correlation matrix. There are obvious practical advantages to the i, h(%info) and x2 criteria over both the e and C: in terms of calculation time and cost, since the former criteria utilize information which comes immediately from the initial eigenanalysis while the latter require fairly extensive further computation. The detailed information on which features are contributing the greatest uncertainty to the analysis which is potentially available from the e calculations is its most attractive attribute. The C? values provide a unique measure of how sensitive the factors are to the data uncertainty, no matter how much variance they span. It is apparent, however, that determination of the true rank of data having uncertainty is not a trivial task, even with data of the simplest fundamental structure. S T U D Y 11. OBSIDIAN DATA

To test the validity of the inferences drawn in the study of the synthetic rectangle data for real chemical data, we examined the obsidian data described in refs. (19, 20). These data consist of the concentrations (determined by x-ray fluorescence) of the 10 metals Ba, Ca, Ti, Fe, Zr, Sr, K, Mn, Rb, and Y for 63 obsidian source samples from 4 sites near San Francisco. Analytical uncertainties were available for most data values; uncertainties which were either not determined or not reported were estimated by ope of us (JLF). The average uncertainty and statistics for these data are given in Table 111. All results presented in this section are based upon the analysis of the true-data and seven UP-data matrices.

ANALYTICAL CHEMISTRY, VOL. 48, NO. 13, NOVEMBER 1976

Table 111. Obsidian Data Statistics Meana

1.Ba 2. Ca 3. Ti 4. Fe 5. Zr 6. Sr

7. K 8.Mn 9. Rb 10. Y a

11.2% 5.6% 7.5% 5.1% 8.2% 30.6% 6.4% 8.7% 9.0% 12.0%

42.48 680.8 260.2 1209. 156.8 31.59 392.8 46.52 107.5 56.02

Std dev

42.7 f 0.4 683. f 6. 261. f 4. 1200. f 10. 158. f 1. 33. f 1. 395. f 4. 46.8 & 0.6 107. f 1. 56. f 0.7

17.3 f 0.9 286. f 4. 116. f 3. 326. f 8. 45. f 1. 19. f 1. 61. f 3. 15.2 f 0.5 21. f 4. 10.0 f 0.6

16.70 278.6 111.8 323.3 43.73 17.36 54.51 14.41 16.84 6.99

Concentration in ppm. Analytical uncertainty relative to metal concentration. From 7 UP-data matrices.

Table IV. Obsidian Data Rank Summary

R,

C”

d

Rank Criterion

x

1. 2. X(86%) 3. e 4. x 2 5. 6. W (Fisher)

cy

[COI

[Cml

Bo1

[Rml

2 1 3 a 3-5 4

3 2 2 a 2-3 3

2 1 4

4 4 3 7 2-4 3

a 3-4 4

-1 A.

-2

1

Not applicable. Unlike the “measurements” of the rectangle data, there are no obvious simple relationships among the 10 metals. This makes the interpretation of the factor analysis results more difficult. However, experience with these data has demonstrated that the four sites can be easily identified using almost any combination of 3-or-more of the metals. The utility of each metal for making this site identification can be quantitatively estimated using the multicategory Fisher weight (6, 21), the average of the ratios between the square of the difference between means and the sum of the squares of the standard deviations for all combinations of two sites. These weights are shown graphically in Figure 4. With the exception of yttrium, all metals have weights greater than 0.1. (Note, however, the difference between the weights for the true-data and the average of the UP-data. We believe the random perturbations of the UP-data unmask artificial relationships among the samples, such as matrix effects. The large difference between the true- and the UP-data weights for Ba and Sr appears to be a result of having only detection limit values for these metals for all samples coming from two of the sites.) The utility of the four dispersion matrices for capturing the site identification information can similarly be assessed with the multicategory Fisher weights of the resulting factors. The values are given graphically in Figure 5 . Note that the meannormal matrices give factors which generally decrease in weight while the origin-normal [Co] and [Ro] have their maximum weight at factor j z or 9 3 . The mean-normal matrices also give fewer factors of significant weight, indicating that the site identification information is more “concentrated”. The correlation matrix, [ R m ] ,gives both the largest weight of any of the matrices studied and appears to have the least difference between the true- and the UP-data analysis for the initial factors. The number of factors having multicategory Fisher weights as large as 0.1 is given in Table IV for each of the four dispersion matrices, along with the matrix rank summary.

3

5

7

9

1

1-

3

5

7

9

R,

p? -

70

60 1

3

5

7

9

Factors Flgure 6. Obsidian data: rank determination criteria (A Values for true-data. (0) values for UP-data. Error bar spans f std dev

Determination of the intrinsic rank for these data from the various criteria is, as expected, much less certain than with the rectangle data. Few, if any, “natural breaks” occur in any of the graphs of Figure 6. (The difference in the eigenvalues between the true- and the average of the UP-data may indicate the point at which the imposed random uncertainty begins to dominate the data analysis.) The results of the simple application of the various criteria are given in Table IV.

ANALYTICAL CHEMISTRY, VOL. 48, NO. 13, NOVEMBER 1976

2007

Table V. Mass Spectral Analytical Uncertainty mle Intensity Compound Ethanol Norbutane Benzene Cyclopentyl-cyclopentane

Number of Nonzero mle

Formula

Spec.

sa

s/xb

MinC

Maxd

Cone

sf

s/xg

CzHbOH C4H10 C6H6 ClOHl8

6 6 6 3

1.8 1.4 0.71 1.1

0.24 0.20 0.23 0.24

18 28 34 83

35 45 49 133

16 16 23 66

6.5 5.8 7.5 26.

0.26 0.15 0.18 0.24

a Std dev of intensity relative to base peak intensity. b Std dev of intensity relative to average intensity. Minimum number of nonzero intensities reported. d Maximum number of nonzero intensities reported. e Number of mle’s consistently reported with nonzero intensity. f Std dev of number of nonzero intensities reported. g Std dev of number of nonzero intensities reported, relative to average number reported.

2

1

3

5

7

9

Figure 7. Obsidian data: correlation between true-data and UP-data factors (q), Error bar spans f std dev It is apparent that the and X(%info) criteria are meaningful only with the correlation matrix, [ R m ] .The different variances of the metals dictate variance normalization for the underlying assumptions of these criteria to be satisfied, but the dramatic difference between the two variance-normal matrices [Ro]and [Rm]is perhaps surprising. This difference apparently arises from the fairly large and unequal relative uncertainties of the metals. The rectangle data, having assigned relative uncertainty of 1%for all “measurements”, does not show this obvious difference. The mean-normal matrices give fewer stable factors, as measured by the cross-correlation of the true-data factors with those from the UP-data, Cy, but appear to give a sharper “break” between stable and unstable factors (Figure 7 ) .The apparently anomalous behavior of factors y3 and 9 4 from the correlation matrix results from both factor “switching” and “mixing”. Factor “switching” comes from the ordering of the factors by the magnitude of the associated eigenvalues; if two (or more) eigenvalues are very similar, the random fluctuations in the UP-data from one run to the next can induce a trivial reordering of those factors. The interpretation of the factors is quite unchanged by this reordering; however, the reordering must be recognized and corrected for any further analysis to make sense. (This problem should be considered whenever two or more data sets are to be compared.) Factor “mixing”seems to be a result of having an essentially unique metal, Y , in the data; without information redundancy in the features, there can be no averaging of the analytical uncertainty. The factor(s) strongly assiciated with such data will be at least as poorly defined as the original measurement(s). Subsequent rotation of the eigenvectors may well help this problem, although the utility of keeping unique features in a factor analysis data base, once they have been identified as such, is certainty questionable. The “dip” in the Fisher weights for factor 3j3 of the correlation matrix (Figure 5) comes from the high correlation of the very low-weight Y with that factor. This suggests a rather 2008

I 10

20

30

4’0

5’0

6’0

70

80

90

100

Factors Figure 8. Mass spectral data: eigenvalue-based rank determination criteria Bar spans difference between true-data (0)and UP-data values

fundamental limitation of factor analysis: the principal independent variances of the data are not necessarily the structure of interest to the analyst. STUDY 111. API PROJECT 44 MASS SPECTRA

To test the various rank determination criteria with a large data base, we constructed a data matrix of low resolution mass spectra from the API Project 44 data (as encoded on the Mass Spectrometry Centre’s 6807 mass spectra tape (22)).All of these spectra have been normalized to the base peak. The 659 spectra of compounds having molecular weight within the range 12 to 141 and molecular formula of the range C2-10 H2-22 00-4 No-2 were selected for study. Where two or more spectra were given for the same compound, the spectrum having the greatest number of nonzero mle’s was selected. Doublet and metastable peak intensities were added to the nearest unit mle intensity. All unit mle’s having a nonzero intensity for any compound were included in the study (mle’s3 to 11 were excluded). This data matrix is similar, if not identical, to that discussed in earlier pattern recognition and factor analysis studies ( 5 , 2 3 ) . The analytical uncertainty associated with these data was evaluated from a study of four dissimilar compounds multiply reported in the Project 44 data (Table V). The average standard deviation relative to the mle intensity over all mle for each compound is approximately 20%; relative to the basepeak intensity, approximately 1.3%.This corresponds roughly to a 4% spectral intensity resolution if all mle distributions are assumed to be normal. The average relative standard deviation in the number of nonzero unit mle intensities for each compound studied is surprisingly high (20%),although the

ANALYTICAL CHEMISTRY, VOL. 48, NO. 13, NOVEMBER 1976

Table VI. Mass Spectral Data Rank Summary Criterion

Ranka

1. E; 2. X(98%) 3. e 4. x2 5.

44-47 98-105

c,z

75-b 92-95 11-15

a As determined from [XI and one [ X u ]analysis. [ X u ]was not examined with this criterion. ---x

-\,-

-

,

.‘..-L:

?;“

>TT\\T\,,:

V‘ A\.\\

\

\\

I

io

io

3b

4b

5’0

Factors

6‘0

70

do

BO

Flgure 9. M a s s spectral data: data reproduction within given error-based rank determination criterion

eigenvalues very close to unity reveals much about these data: the correlation between m/e’s is neither extensive nor stable to analytical uncertainty. Essentially, all features of these data are orthogonal to each other. (Recent work on determining the suitability of data for factor analysis may be of interest

(W.1

The correlation between the true- and the UP-data factors,

C:, indicates that extensive factor “mixing” between the

Figure 10. Mass spectral data: correlation between true-data and UPdata factors

(c)

(0)Cf between same-index factors. (V) between most similar factors

small number of compounds studied dictates that this value should be interpreted with caution. To approximate both the uncertainty in the intensity of the reported mle’s and the uncertainty in whether the zero-intensity mle’s are really zero-intensity, we assume all mle intensities are normally distributed about their reported value, x L j , with a standard deviation of 0.67% base peak intensity (approximately 2% spectral intensity resolution). Since all intensities must be positive, all UP-data values generated less than zero were set to zero. Values generated greater than the base peak were not modified. Economic considerations dictated that these data be used only to study the various rank determination criteria using the correlation matrix, and that the minimum possible number of UP-data matrices be studied (each eigenanalysis of these data cost approximately $50). We found, rather to our surprise, that the true- and a single UP-data matrix analyses were sufficient to establish the limits of analysis for these data. The values for the various rank determination criteria are presented graphically in Figures 8-10 and the results of the simple application of the criteria are given in Table VI. The discrepancy between the results obtained from the and all other criteria is striking. The difference between the initial eigenvalues for the true- and the UP-data and the number of

x

similarly eigenvalued factors exists (Figure 10). Comparison of the correlation between the same indexed factors with that between the factors most highly correlated suggests that factor “switching” is also encountered with these data. It is apparent that, even with factor “switching” taken into account, the number of factors stable to analytical uncertainty is less than 20. The instability of these data to analytical uncertainty and the disagreement between the various rank determination criteria are, in retrospect, perhaps not surprising. The structural identity of a given compound may be determinable from the mle’s characteristic of that compound, but the various functional groups have both characteristic mle’s and, often more importantly, characteristic differences between mle’s. The cross-productbetween even similar compounds may give very few nonzero terms because of a trivial offset in the characteristic masses (compare CzH50H with CZH~DOH). For dissimilar compounds, the autocorrelation of the mass spectra may well represent a t least functional group information in a more stable manner. The problem of proper representation of low resolution mass spectral information is currently being studied by the Laboratory for Chemometrics.

CONCLUSIONS Factor analysis can be used to explore the number and nature of the fundamental factors expressed in a body of measurements. I t is apparent that both mean and variance normalization of the data help to stabilize the analysis to the effects of analytical Uncertainty. Of the four dispersion matrices studied, the correlation matrix allows, by far, the most stable and easily interpreted analysis. We feel that none of the various rank determination criteria studied is clearly superior or completely satisfactory when used alone. The agreements and/or differences among the various criteria are a better guide to analysis than is reliance on a single rule. The X, X(%info),and Bartlett’s x2 have the advantage of requiring little further calculation beyond the initial eigenanalysis. Use of UP-data, data generated within the distributions defined by the measured data values and their associated analytical uncertainties, allows the estimation of the reliability of the results of data analysis. The difference between the true- and the UP-data results may indicate when analytical uncertainty dominates the data analysis. The variability observed in a number of UP-data analyses measures how reliable the results actually are. We recommend the use of synthetic, simply structured, and somewhat realistic data as a learning tool, no matter what

ANALYTICAL CHEMISTRY, VOL. 48, NO. 13, NOVEMBER 1976

2009

D. Katakis, Anal. Chem., 37, 876 (1965). P. Horst, “Factor Analysis of Data Matrices”, Holt, Rinehart and Winston, Inc., New York, 1965. R. J. Rummel, “Applied Factor Analysis”, Northwestern University Press, Evanston, Ill., 1970. C. B. Crawford, Psych. B., 82, 226 (1975). R. W. Rozett and E. M. Petersen, Anal. Chem., 47, 348 (1975). H. C. Andrews, “Introduction to Mathematical Technlques in Pattern Recognition”, Wiley-lnterscience, New York, 1972. P. H. Weiner, E. R. Malinowski, and A. R. Levinstone, J. Phys. Chem., 74, 4537 (1970). M. S.Bartlett, J. R. Statist. SOC., 816, 296 (1954). S.Wold, “Pattern Recognition by Means of the Disjoint Principal Components Models”, Technical Report No. 2, Research Group for Chemometrics, Institute of Chemistry, Umea University, S-90187 Umea, Sweden, March 1975. R. G. Mlller, Biometrika, 61, 1 (1974). P. R . Bevington, “Data Reduction and Error Analysis for the Physlcal Sciences”, McGraw-Hill Book Co., New York, 1969. L. L. Thurston, “Multiple Factor Analysis”, university of Chicago Press, Chicago, Ill., 1947. D. F. Stevenson, F. H. Stross and R. F. Heizer, Archaeometry, 13, 17 (1971). B. R. Kowalski, T. F. Schatzki, and F. H. Stross, Anal. Chem., 44, 2176

mathematical methodology is being used. Use of such data not only aids in “debugging” the necessary procedures, it trains the researcher’s intuition and may help reveal potential problems and limitations of the methodology used.

COMPUTATIONS All calculations required for this study were carried out on the University of Washington Academic Computer Center’s CDC6400/CYBER73 dual mainframe system. The programs used in this study are currently being added to our data analysis system ARTHUR; we hope to make this expanded system available early in 1977. ACKNOWLEDGMENT The authors express their gratitude to Eric Knudson and Maynarhs da Koven for their encouragement of and assistance with this study.

,

11F17!N -1.

LITERATURE CITED (1) E. W. Stromberg and J. L. Fasching, “The Application of Cluster Analysis to Trace Elemental Concentrations in Geological and Biological Matrices”, 7th Materials Research Symposium, “Accuracy in Trace Analysis: Sampling, Sample handling, and Analysis”, October 7-1 1, 1974. NBS Special Publication, U.S. Government Printing Office (Dec. 1975). (2) S. Wold and K. Anderson, J. Chromatogr., 80,43 (1973). (3) P. H. Weiner and E. R. Malinowski, J. Phys. Chem., 75, 1207 (1971). (4) R. W. Rozett and E. M. Peterson, Anal. Chem., 47, 2377 (1975). (5) J. B. Justice, Jr., and T. L. Isenhour, Anal. Chem., 47, 2286 (1975). (6) D. L. Duewer, J. R. Koskinen, and B. R. Kowalski, “Documentation for ARTHUR Version 1-8-75”, Chemometrlcs Society Report No. 2, Laboratory for Chemometrics. 1975.

R. A. Fisher, Ann. Eugen., 7 179 (1936). The Mass Spectrometry Data Centre, Building AB. IA, AWRE, Aldermaston, Readington RG7 4PR, England. W. L. Felty and P. C. Jurs, Anal. Chem., 45, 885 (1973). C. D. Dziuban and E. C. Shirkey, Psych. B., 81, 358 (1974).

RECEIVEDfor review May 11,1976. Accepted July 23,1976. We gratefully acknowledge the financial support of the Office of Naval Research under grant N00014-75-c-0536 and the Public Health Service under grant R01-MB-00184-02.

Estimation of the Linearity Domain Limit for the Ion-Sensitive Membrane Electrode Function Candin Liteanu,* lonel C. Popescu, and Elena Hopktean Department of Analytical Chemistry, University of Cluj-Napoca, Cluj-Napoca, Romania

Knowledge of the extreme value (pc), of the h e a r domain Is slgnificantly important for explaining the Ion-sensitive membrane-electrodes(ISME) worklng at low concentrations. For thls purpose, thls paper presents two calculation procedures: the statistical procedure, having a more general character based on the calculation of the extreme value 6, = €, - equal to the half-wldth of the confidence band; and the analytical procedure based on the calculatlon of the common root of the two equations describingthe linear and the nonlinear domains of the electrodlc function. The second procedure Is conditioned by the possibility of linearizing the equation E = f ( p c ) in the nonlinear domain. As an example, the data obtalned by a nitrate-sensitive membrane-electrode are completely processed.

€1

I

membrane of the product of interaction between the ion and the active electrodic compound (15).The correct estimation of such a value is of significant importance. We described below two procedures for calculating the extreme value (pc)l of the linear domain. The Statistical Procedure. Considering that the parameters of the linear calibration equation

E = Eo

+ bpc

were calculated by the least squares method, one admits that the differences between the experimental and calculated values, JEexp - [(Eo b(pc),,,] I = 6 , have a normal distribution with parameters 8 and CQ. Regardless of ,the type of distribution of values, the deviation from linearity, as seen in Figure 1, appears to be important below a value ( p c ) from ~ which the difference 61 = IEl - [(Eo b ( p c ) l ]I becomes significant and can be eliminated using a certain statistical criterion. The value 61 corresponds to the half-width of the confidence band limited by the two hyperbola arcs (26)’at the point of the calibration line of coordinates ( E , ( p c ) i ) . Criterion t ( S t u d e n t ) . If values have a normal distribution with parameters 8 and q,y e admit an associate Student distribution with parameters 6 and sa. To-calculate the value of 61, we have to test the hypothesis 61 > 6, where 8 is the arithmetic mean of a series of n 6 values different from 61. To eliminate the “extreme value” 61 we consider ( I 7) the Student variable:

+

+

Attempts were made in a number of papers to explain the operation of solid ion-sensitive membrane-electrodes (ISME) within the range of low concentrations, from the point of view of the deviation from linearity of the response function E = f ( p c ) . The factors under consideration were the solubility of the active electrodic material (I-IO), the ion adsorption on the membrane surface (8, S 1 1 ) or on the vessel walls (121, as well as the presence as impurities either of the primary ion (2, 4, 13) or of some interfering ions ( 1 4 ) . The deviation from linearity of the electrodic function may be also correlated with the transport through the liquid 2010

ANALYTICAL CHEMISTRY, VOL. 48, NO. 13, NOVEMBER 1976