Improved principal component analysis of noisy data - ACS Publications

Several teets exist for determining the number of principal factors, n, when applying principal component analysts (PCA) to a series of spectra. Howev...
2 downloads 0 Views 691KB Size
1058

AMI.

Ct”. w 1 , 63,1058-1063

Improved Principal Component Analysis of Noisy Data Martin J. Fay, Andrew Proctor, Douglas P.Hoffmann,’ and David M. Hercules*

Department of Chemistry, University of Pittsburgh, Pittsburgh, Pennsylvania 15260

Several tests exkt for determlnlng the number of principal

fa-,n, when ~PPWU prlnclpel c“nlanalysln (PCA)

to a serks d spectra. However, as the dgnalholse (S/N) retk d tho data to decrease, the dkcrlmlnatlng power of these tests to accurately determlne n also decreases. If the correlatlon functbn re8ponse v a h , rocnormally used In PCA to form the conelation/covarlane matrix are replaced by their autoconvobtbn values, rm, the discrhrkratlng power of PCA can be markedly Improved for noby data. rv Is a m e m e af ttn rymnntry of the condatbn hnc#on and g a b its power In PCA from the fact that the symmetry of the correlatlon functkn rutfen much less degradation In the prownce d ndre than does the correlatlon response, r,. A comparkon k made between the use of r , and rv values with computergenerated synthetlc spectra and with real ESCA spectra $$flved from known phydcal mlxtures d cobalt alumlnate and cobait oxlde.

INTRODUCTION Principal component analysis (PCA) is a method of multivariate analysis that can determine the number of factors or principal components, n, which make up a set of data. PCA has proved to be especially valuable for spectral analysis since it can help to determine the number of components that make up a series of poorly resolved spectral mixtures. PCA involves the mathematical transformation of a data matrix (series of mixture spectra) into a set of abstract variables known as eigenvectors that describe the variance in the data matrix. For a series with N spectra, N eigenvectors are generated. The eigenvectors that represent the principal components of the data matrix are termed primary eigenvectors, and the remaining eigenvectors that account for random noise are termed secondary eigenvectors. One of the major challenges in carrying out PCA is deciding which of the N eigenvectors are primary and so determine n. Numerous criteria have been proposed to help determine n. Malinowski and Howery (I) have classified these criteria into two categories: (A) those that require knowledge of the amount of noise in the data, and (B)those that do not require prior knowledge of the noise. Some of the commonly used criteria that do not require knowledge of the noise level are the imbedded error function (I),factor indicator function ( l ) , cross-validitory methods (2),ratios of successive eigenvalues (31, frequency analysis (4), the reduced eigenvalue ratio (51, and the F test (6, 7). Some of the proposed criteria that do require knowledge of the noise level are the real error function (I) (which is equal to the residual standard deviation) and the x2 function (8).As would be expected, criteria that do not require knowledge of the noise level are more widely used owing to the general difficulty in quantitatively determining the amount of noise in a set of data. Although a large number of criteria have been developed to assist in selecting n, it is still problematic when the level of noise is relatively high. Part of this difficulty in determining ‘Present address: Martin Marietta Energy Systems, Oak Ridge,

TN 37830-7272.

0003-2700/9 110363- 1058$02.50/0

n can be attributed to the sensitivity of correlation values (r,) to noise. This has been demonstrated by Hoffmann et al. (9), who showed that the correlation between two identical Mossbauer spectra decreased rapidly when the SIN of one of the spectra was reduced. Since a critical step in PCA involves decomposition of the correlation matrix (which contains ro values), the sensitivity of r, to high levels of noise is clearly a problem for PCA of noisy data. The correlation and convolution of two discrete periodic functions a ( k ) and b(k) are defined respectively as N- 1

corr

T

k=O N-1

conv

T

+

(1)

b ( -~ k)

(2)

= C a ( k ) b ( ~ K) =

X a&)

k =O

where T has incremental values from 0 to N - 1 and N is the number of data points in one period of the discrete functions a ( k ) and b(k). Figure 1 shows the cross-correlation function of two computer-generated spectra a(k) and b(k) and the autoconvolution of the correlation function. Autoconvolution is the convolution of a function with itself. This is essentially the correlation of a function with a rotated version of itself about the ordinate at T = 0. This rotation is implied by the minus sign in b(7 - lz) in eq 2. The value of the correlation function at T = 0 is r,, and it is this value that is commonly used in PCA. To improve the correlation analysis of noisy data, Hoffmann et al. proposed the use of the parameter, rw, which is the value at T = 0 of the autoconvolution of the correlation function (shown in Figure 1). The parameter rw is essentially a measure of the symmetry of the correlation function and as such is also a measure of spectral similarity. In the correlation analysis of Mossbauer spectra (9), it was found that r- was less sensitive to noise than r, and provided a better gauge of spectral similarity for noisy data. In PCA, all possible cross-correlation r, values are assembled into a diagonally symmetric square matrix, [Z], which undergoes a decomposition step, yielding the primary and secondary eigenvectors (I). We propose the use of ram values instead of correlation, r,, values in an effort to improve the principal component analysis of noisy spectra. In this paper, the ability of PCA, using rap matrices, to determine the number of principal components will be tested on synthetic spectra with various amounts of added noise. In addition, PCA will be carried out using r, and ram values on real ESCA spectra of physical mixtures with varying noise levels. EXPERIMENTAL SECTION ESCA spectra were obtained from a series of Co9O4/CoAl2O4 physical mixtures using a Leybold Heraeus LHS-10 X-ray photoelectron spectrometer, operated in the fixed analyzer transmission mode (hE = 100 eV) using nonmonochromatic A1 K a radiation (1486.6 eV) at a power of 240 W (12 kV, 20 mA). Powdered samples were mounted on double-sided tape and mbar throughout the spectral acquisition maintained at < period. Data were collected by using an IBM-compatible PC and custom-built interface. Spectra of differing SIN were obtained by scanning for different periods of time. THEORETICAL CONSIDERATIONS Calculation of r , and raF Evaluation of r, in the “time” domain is trivial, simply being the normalized sum of the 0 1991 American Chemical Society

ANALYTICAL CHEMISTRY, VOL. 63,NO. 11, JUNE 1, 19Ql

"AL\ J

r(7

ro

r

-\

\

I':

h

AUTO

T

h

Nn=o

T

Norm2 = N-l

N

c (R2+ 12)

' L

(7)

(8)

n=O

products of all points at zero displacement ( r = 0). (For PCA, determination of all the correlation values in the correlation matrix is normally achieved by premdtiplying the data matrix by ita transpose: (21= [DIT[D].) Evaluation of rm requires that the whole correlation function be known. This leads to use of the Fourier transform (FT) method since it is a more efficient route to determine the total correlation function. In general, the discrete FT, H(n) [n = 0, N - 11, of a function h(k) [k = 0, N - 11 is given by N-1

H(n) = C h ( k )exp(-i2mk/N), k=O

n = 0, N - 1

(3a)

and the inverse transform is correspondingly 1N-1 h(k) = - C H ( n ) exp(+i2~nk/N), k = 0, N - 1 (3b) Nn=0

If h(k)represents the original spectral data, it is by definition totally real and H(n) is generally a complex entity. Only when h(k) is even (i.e., symmetric) or odd (antisymmetric) will H(n) be respectively totally real or totally imaginary. The Fourier transformation of corr T (eq 1) leads to the following result (10)

CORR n = A(n) B*(n) =

+ IAIB + i(IARB - Z&A)

= RCoM

+ iIc0-

(4)

CORR n,A b ) , and B(n) are the FT's of corr T , a(k),and b(k), respectively (B*(n)is the complex conjugate of B(n)). RA, RB, RComand I A , IB, ICORRare the real and imaginary parts of the transforms A b ) , B(n), and CORR n. The Fourier transformation of conv T (eq 2) leads to a similar result to that of the F l'of CORR T but with B*(n)in eq 4 replaced by B(n). Autoconvolution of corr r is carried out in Fourier space simply by multiplication of CORR n (eq 4) by itself

ACONV

"z~

where R is R C Ofor ~ r, or RACOWfor raw The real and imaginary parts are normalized in an identical manner. The normalization factor is based on the concept that the autocorrelation of a function gives an r, value of unity. Thus, if R + iI represents the FT of a(k) or b(k), then the normalization factor is

Flguro 1. Schematic diagram lllustratlng the conelatlon of a(&)and b(k) to produce roand the autoconvolutkn of the correlation function to produce r,ym.

RARB

= 0) =

1059

n = RCORR2 - Icom2

+ i2RCORRICORR=

Using eqs 4 , 7 , and 8, we can write ro as

ro =

-

+

Clearly, the sine term is zero and the cosine term is unity for all values of n. In addition, the only output required is the real part and so eq 6 reduces to

+ IJB)

+

d z ( ~ A 2 IA~)c(RB~

+ IB~)

(9)

and using eqs 4, 5, 7, and 8, we can write rap as

where is the sum from n = 0 to N - 1. Hence, r, and r,, can be obtained from five sums of produds of the k T s of the original spectra. In thism e , the full inverse FT is not carried out since we are only evaluating at one point, i.e., T =: 0. Clearly from eq loa, if Icom is zero, this means that r,, = 1.0. Icom can be zero for two distinct reasons. Most importantly for this discussion, Icom will tend to zero as the correlation pattem becomes more symmekic as a reault of the crowcorrelation becoming more like autocorrelation (i.e., the two spectra are identical). However, Ic0m will also be forced to equal zero if both spectra are initially symmetric (or antisymmetric). This is because the FT of symmetric spectra are totally real and so I A and IB will be zero, which will force ICom = 0 (see eq 4). This can be easily overcome by forcing the abscissa to be nonlinear (e.g., converting x to x 2 and interpolating back to equal abscissa units). This mathematical operation has no effect on the validity of the comparison. Generationof Synthetic Spectra. The synthetic spectra in Figure 2 were generated by adding three Gaussian peaks of varying intensity. The peak positions and widths were kept constant. Gaussian noise was added to the synthetic spectra at point (X, Y) assuming that it was distributed with a mean value Y and a standard deviation of PI2.This meant that the amount of noise could be adjusted by varying the baseline intensity. A random noise pattern was generated in the following manner:

RACONV + ~IACONV (5) where RAcoNv and IACOWare the real and imaginary parts of the FT of the autoconvolution function, aconv r , whose FT is ACONV n. Evaluation of r, and rm involve the inverse transformation of CORR n and ACONV n to evaluate corr T and aconv T at T = 0. This requires evaluation of the inverse transform of a complex entity, R + iI, at r = 0 only. Thus, from eq 3b ( T = k) 1N-1 F ( T = 0) = (R iI)(cos 0 + i sin 0) (6) N"=O

e:(RARB

yNohy =

Y + ~Y.RND

(11)

where RND is a random normal deviate, which is defined as (11) 12

RND = Z R i - 6 i- 1

(12)

where Ri is a uniform random deviate in the range 0-1. This allows a maximum noise deviation of *6 standard deviations. A given noise pattern can be reproduced by using the same seed number in a random number generating program, and different patterns can be obtained by using different seeds. For this study, different seeds were used to add noise to different spectra in the series. Principal Component Analysis. Principal component analysis was carried out on the r, and r,, matrices, [Z], by

1060

ANALYTICAL CHEMISTRY, VOL. 63, NO. 11, JUNE 1, 1991 Area Ratio

/?\

1

2

3

20

20

80

17

23 80

Table I. Results for the Pure Artificial Spectral Series (Figure 2) (Three Known Components, 128 Points)" factor eigenvalue

F

IND

DF1 DF2

a

r0

1 2 3 4 5 6 7

6.98273 1.210863-2 5.158843-3 1.361493-6 1.361493-6 1.361493-6 1.361493-6

1.317353-4 1187.512 1.136253-4 5.754 6.44587E-6 1864.485 1.145933-5 0.493 2.578353-5 0.495 1.031343-4 0.496

1 2 3 4 5 6 7

6.98156 1.764333-2 8.090853-4 1.344433-5 1.344433-5 1.344433-5 1.344433-5

1.36356&? 1108.189 4.644523-5 50.179 2.02555E-5 29.613 3.600983-5 0.493 8.102213-5 0.495 3.240883-4 0.496

1

1

6 5 4 3 2 1 0

1 1 1 1 1 1 1

6 5 4 3 2 1 0

1

1 1 1 1