Discriminant analysis by double stage principal component analysis

Anal. Chem. 1983, 55, 1710-1712. CONCLUSION. The changes in praseodymium spectra obtained upon complexation are small, and the contributionsof the ...
0 downloads 0 Views 350KB Size
1710

Anal. Chem. 1983, 55, 1710-1712

CONCLUSION The changes in praseodymium spectra obtained upon complexation are small, and the contributions of the PP3and PrEDTA- species are similar enough to thoroughly test the ability of the Kalman filter to deconvolute overlapped responses. In this study, good deconvolution is possible because adequate models for the component spectra are available and because the signal-to-noise ratios are high. Even so, mixtures containing less than 10% of Pr3+could not be deconvoluted, probably because of slight fluctuations between model and overlapped spectra (model errors) combined with the similarity of the component spectra and the degree of their overlap. Where model error is small, the filter has been able to extract overlapped components from systems with peak height ratios up to 40:l (20). A potential problem in using the Kalman filter for spectral deconvolution would occur if a well-defined model was not available. In many cases, a series of two component mixture spectra may be obtained where only one of the contributing spectra can be obtained experimentally. A technique is currently being investigated which combines adaptive estimation techniques with the Kalman filter to obtain a second spectrum which cannot be obtained experimentally. GLOSSARY identity matrix Kalman gain vector V measurement noise P covariance matrix S measurement function vector ST transpose of the measurement function vector 2 variance of the measurement process concentration vector the kth estimate of X based on the j t h measureX(kp) ment ACKNOWLEDGMENT The authors wish to acknowledge John Frame for the design of the photoacoustic detector and Don Gilliland for help inI K

volving design considerations and for the construction of the photoacoustic detector.

LITERATURE CITED (1) (2) (3) (4)

Oda, S.; Sawada, T.; Kamada, H. Anal. Chem. 1978, 5 0 , 865-867. Oda, S.; Sawada, T.; Kamada, H. Anal. Chem. 1979, 51, 686-688. Patel, C. K. N.; Tam, A. C. Rev. Mod. Phys. 1981, 53, 517-580. Voightman, E.; Jurgensen, A.; Wlnefordner, J. Anal. Chem. 1981, 53, 1442- 1446. (5) Moeller, T.; Martin, D. F.; Thompson, L. C.; Ferrus, R.; Felstel, G. R.; Randall, W. J. Chem. Rev. 1965, 65, 1-50. (6) Brubaker, T. A.; Tracy, R.; Pomernackl, C. L. Anal. Chem. 1978, 50, 10 17A-1024A. (7) Brown, T. F.; Brown, S. D. Anal. Chem. 1981, 53, 1410-1417. (8) Dldden, C. B. M.; Poullsse, H. N. J. Anal. Lett. 1980, 13 (A14), 1211-1234. (9) Poullsse, H. N. J. Anal. Chlm. Acta 1979, 112,361-374. (10) Sawada, T.; Oda, S.; Shimlzu, H.; Kamada, H. Anal. Chem. 1979, 51, 688-690. (11) Kalman, R. E. J. Basic Eng. 1960, 82, 35-45. (12) Brown, T. F.; Brown, S. D. "Book of Abstracts", 185th National Meeting of the American Chemical Society, Seattle, WA, March 22, 1983; American Chemical Society: Washlngton, DC, 1983; paper 111. (13) Sabatlnl, A.; Vacca, A.; Gans, P. Talanta 1974, 21, 53-77. (14) Leggett, D. J. Talanta 1977, 24, 535-542. (15) Stewart, D. C.; Kato, D. Anal. Chem. 1958, 30, 164-172. (16) Carnall, W. T.; Fields, P. R.; Rajnak, K. J. Chem. Phys. 1968, 4 9 , 4424-4442. (17) Perrln, D. D., Ed. "Stablllty Constants of Metal-Ion Complexes; Part B, Organlc Llgands"; Pergamon Press: New York, 1979; p 759. (18) Betts, R. H.; Dahllnger, 0. F. Can. J . Chem. 1959, 37,91-100. (19) Mackey, J. L.; Powell, J. E.; Spedding, F. H. J. Am. Chem. SOC. 1982, 84, 2047-2050. (20) Scolari, C. A.; Brown, S. D. "Book of Abstracts", 185th National Meeting of the American Chemical Society, Seattle, WA, March 23, 1983; American Chemical Soclety: Washington, DC; paper 130.

2

RECEIVED for review April 19, 1983. Accepted June 1, 1983. This work was supported, in part, by the Phillips Petroleum Foundation through a summer fellowship for S.C.R. This work was presented a t the 185th National Meeting of the American Chemical Society, Seattle, WA, March 23, 1983.

Discriminant Analysis by Double Stage Principal Component Analysis Ronald Hoogerbrugge,* Simon J. Willig, and Piet G. Kistemaker FOM-Institute for Atomic and Molecular Physics, Kruislaan 407, 1098 SJ Amsterdam, The Netherlands

Discriminant analysis Is possible by double stage principal component analysis. The mathematical justification and a flow scheme of the implementation in the ARTHUR package are glven. Also a comparison has been made between the discriminant analysis results obtained with ARTHUR and with

SPSS.

The proliferation of modern analytical techniques, which supply the chemist with large arrays of data for the objects (samples) analyzed, has forced the researcher to adapt pattern recognition techniques (1). Every chemist who has tried to interpret spectral patterns of highly complex mixtures has realized that human intelligence alone is not sufficient to unravel the composite spectra adequately. In our laboratory

highly complex pyrolysis mass spectra of organic samples like microorganisms, body fluids, biopolymers, and geological samples are interpreted routinely. The combination of this mass spectrometric analysis and automated data analysis has been successful in a wide area of applications, i.e., quantitative analysis of polymers ( 2 ) ,typing of microorganisms ( 3 ) ,and analysis of ancient and recent sediments ( 4 ) . The mathematical procedures adopted in multivariate data analysis are of a quite general nature and have been applied widely next to analytical chemistry in the social and medical sciences, biology, and image reconstruction. The output of the calculations is preferably presented as the results of a statistical evaluation of the identities of the objects analyzed. Visualization of the mutual relationship of the objects is shown in so-called scatter plots. In these two-dimensional plots the objects are represented by points of which the coordinates are

0003-2700/83/0355-1710$01.50/00 1983 Amerlcan Chemical Society

ANALYTICAL CHEMISTRY, VOL. 55, NO. 11, SEPTEMBER 1983

determined by the scores of only two-selected variables. To construct the plots, new variables can be composed which describe the character of the objects as well as possible. The most effective new variables are the principal components (5, 6). Therefore data analysis packages like SPSS (7) and ARTHUR (8)contain a principal component analysis procedure. However, in identification projects, principal components are not the optimal choice and generally discriminant functions ( 5 ) have to be used. These discriminant functions describe most efficiently the differences between-predefined-groups of objects with respect to the variation within the groups. In many cases the number of variables is too large compared with the number of objects ( 5 ) . Therefore it k necessary first to reduce the number of variables, preferably with principal component analysis. This approach not only gives more reliable discriminant analysis results but also offers the possibility of performing discriminant analysis by a second principal component analysis as will be shown in this paper. Recently we started to investigate the potentials of the ARTHUR package on data analysis of pyrolysis mass spectra. Due to the structure of our data sets-built up of grouped data-discriminant analysis is essential in addition to the available supervised and unsupervised classification methods. Because of the small number of objects analyzed per group, the SIMCA (6) method cannot be used, Therefore we investigated the possibiility to implement-as simple as possible-discriminant analysis in the package. In this paper the mathematical justification for our approach of discriminant analysis will be given together with the required flow diagram used in the ARTHUR package.

METHOD The first linear discriminant function (LDF), D1,is defined as (1)

PROGRAM FLOW I

b

d

scaled

\r

I

1

I

Ireplace ayk (cat.mean- 1

I

VARVAR

Flgura 1. Flow scheme of discriminant analysis. Each procedure is represented by a rectangle which contains the function at the left side and the name of the routine in the ARTHUR package at the right side.

The program flow is marked with a heavy bar and for each routine incoming and outgoing data are indicated.

DIT.B D1 - maximail DIT klV D1 where DIT is the transposed LDF, B is the between group covariance matrix, and W is the within group covariance matrix. For the second LDF, D2,the same definition holds but with the preconditilon that D1 and Dzare independent. Formally we c~ define EL transformation matrix M as W WM = I . Defining B = W B M and D1 = M-lD1, eq 1 reduces to

DITBD1 -- maximal D,’Dl

This is similar to the definition of the first principal component (PC) of 8 (5). For M we can choose the scaled eigenvectors of W so it is possible to calculate LDF‘s by double stage PC analysis. For reliable LDF’s it is essential to make the number of variables (features) small compared with the number of objects (9). It is therefore worthwhile to reduce the number of features with a minimum loss of information. The data reduction 011 the basis of prominent F’C’sof W is not adequate because an uncontrollable amount of between-group variance can be lost. Therefore the transformation matrix M has to be built up of the PC’s (with large eigenvalues),of the total covariance matrix C = B + W. Routinely the PC’s with eigenvalue exceeding 1%of the total variance present are selected. However, the number of PC‘s retained have to be less than a quarter of the number otspectra in the set analyzed. W h p we scale M so that C = W G M = 1, W can be written as I - B. Equation 1 reads now

tats

DlT13Dl = maximal DlTD1 - DITBDl

1711

(2)

The first eigenvector of B (Le., with the largest eigenvalue) is the solution of eq 2 and the set of eigenvectors of B is equal to the set of transformed LDF’s D = M-lD. The eigenvalue XLDF of a LDF can be calculated from the eigenvalue Xg of B by XLDF =

As l-hfi

Discriminant scores are the projections of the objects on the LDF’s. Neglecting the loss of information introduced by skipping the PC’s with a small eigenvalue, the discriminant scores are equal to the projection of the factor scores on the transformed LDF’s. A flow scheme of the approach is shown in Figure 1. For the execution of this type of discriminant analysis only one subroutine has to be added to the ARTHUR package: the routine ”Mean”which replaces each spectrum by its categorial mean. The first pass through the feature scaling is not essential for the discriminant analysis but has been shown to be benificial for various data sets (11). Furthermore, using the ARTHUR package version 7-1-78, the factor scores were much more accurate when the original features were autoscaled (i.e., zero mean and unit variance). The autoscaling of the factor score? is required to make the transformed covariance matrix, C, equal to I. It has to be noted that the procedure “Scale”, in the ARTHUR package, version 7-1-78, scales the feature variances equal to (n - 1)-* instead of unity (nis the number of objects in the data set). Although this scaling does not influence the final discriminant scores, the eigenvalues XLDF cannot be calculated according eq 3. For convenience we modified the procedure “Scale” to give real autoscaled data.

1712

ANALYTICAL CHEMISTRY, VOL. 55, NO. 11, SEPTEMBER 1983

a

treated as a group and the representative points of a group in the scatter plots are connected. It is clear that for separation purposes the LDF’s are superior to the PC’s. The significant differences between the groups of spectra as expressed by the LDF’s are explained in ref 12. The algorithm used to calculate eigenvectors in the ARTHUR package has not been advertized as to be highly sophisticated. For this reason the discriminant analysis results obtained with the ARTHUR package were compared with the results obtained by data processing with the SPSS package. Three seta of pyrolysis mass spectra were selected from various fields of interest: (a) structure analysis of carbohydrates (12), (b) taxonomy of yeasts (3),(c) identification of mycobacteria (13).The scatter plots of the first two LDF’s obtained with both packages are very similar for each of the three sets analyzed. For the carbohydrates the correlation of the scores on the first LDF’s, obtained with both packages, is 0.985. Also for the second LDF’s the correlation is 0.985. This is quite sufficient for graphical presentation purposes.

N LL

:

SI

Score

F1

b

li

P1.2GIC

ACKNOWLEDGMENT The authors thank W. Windig for pointing out the fact that autoscaled factor scores lead to orthogonal discriminant functions and for helpful discussions. LITERATURE CITED

N

O

E 4 0

Q

p 1.3 Glc”

0

,

p1,LGlc

p 1,3Glc

ai.4 GIC

Score

D1

Flgure 2. Scatter plots of Curie-point pyrolysls mass spectra of a series of polyhexoses wlth dlfferent linkages types: (a) first two principal components: (b) first two dlscrlminant functlons.

RESULTS AND DISCUSSION A typical result of an ARTHUR run using the sequence of routines as indicated in Figure 1is shown in Figure 2. The data set used is a series of pyrolysis mass spectra of six polyhexoses with different linkage types. Each polymer was analyzed five times to measure the reproducibility of the analytical procedure. These subsets of five spectra were

(1) Kowalskl, B. R. Anal. Chem. 1975, 4 7 , 1152A. (2) van de Meent, D.; de Leeuw, J. W.; Schenck, P. A.; Wlndlg, W.; Haverkamp, J. J . Anal. Appl. pVro/ysis lg82, 4 , 133-142. (3) Wlndig, W.; Haverkamp, J.; Klstemaker, P. G. Anal. Chem. 1983, 55, 81. (4) van Graas, G.; de Leeuw, J. W.; Schenck, P. A,; Haverkamp, J. Geochim. Cosmochim. Acta 1981, 45, 2465-2474. (5) Tiedemen; Tatsuoka; Langmulr “Multlvarlate Statistics for Personal Classification”; Wlley: London, 1967. (6) Massart, D. L.; Dijkstra, A,; Kaufman, L. ”Evaluatlon and Optimizatlon of Laboratory Methods and Analytlcal Procedures”; Elsevler: Amsterdam, 1978. (7) Nie, H. N.; Hull. C. H.; Jenklns, J. G.; Steinbrenner, K.; Bent, D. H. “Statistical Package for the Soclal Science”; McGraw-HIII: New York, 1975. (8) “ARTHUR”; Infometrlx, Inc.: Seattle, WA. (9) Harper, A. M.; Duewer, D. L.; Kowalskl, B. R.; Faschlng, J. L. I n “Chemometrics: Theory and Appllcatlon”; Kowalski, B. R., Ed.; American Chemical Society: Washlngton, DC, 1977; ACS Symposlum Serles No. 52. (10) Wold, S.;Sjostrom, M. I n “Chemometrics: Theory and Application”; Kowalskl, B. R., Ed.; Amerlcan Chemlcal Society: Washlngton, DC, 1977; ACS Symposium Serles No. 52. (11) Eshuls, W.; Klstemaker, P. G.; Meuzelaar, H. L. C. J. Anal. Appl. pU~O&S/S 1977, 199-212. (12) van der Kaaden, A.; et al., unpublished work, FOM-Instltute for Atomic and Molecular Physlcs, Amsterdam, 1982. (13) Wleten, 0.; Haverkamp, J.; Groothuls, D. G.; Berwald, L. Q.; David, H. L. J. Qen. Microblol., in press.

RECENED for review February 16,1983. Accepted May 2,1983. This investigation was supported by the Foundation for Fundamental Research on Matter (FOM).