17 Simple Modeling by Chemical Analogy Pattern Recognition 1
2
3
W. J. Dunn III , Svante Wold , and D. L. Stalling 1
Department of Medicinal Chemistry and Pharmacognosy, University of Illinois at Chicago, Chicago, IL 60612 Research Group for Chemometrics, Umea University, S901 87 Umea, Sweden Columbia National Fisheries Research Laboratory, U.S. Fish and Wildlife Service, Columbia, MO 65201
2
Downloaded by UNIV OF BATH on July 3, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch017
3
The overall objective of the use of pattern recognition in environmental problems is to identify, categorize or classify samples based on chemical data describing the samples. A number of pattern recognition methods are available for application to measured chemical data. Although these methods can be used to classify single compounds or the components of complex mixtures, they sometimes differ considerably in the way in which the classification rules are derived and applied. The SIMCA (Simple Modelling by Chemical Analogy) method is unique in having been developed specifically for appli cation to chemical data and it has been shown in a number of studies to work very well. Its advantages are discussed. The p o t e n t i a l o f modern chemical instrumentation to detect and measure the composition o f complex mixtures has made i t necessary to consider the use o f methods o f multivariable data analysis i n the o v e r a l l evaluation of environmental measurements. In a number o f instances, the category (chemical class) of the compound that has given r i s e to a series of signals may be known but the s p e c i f i c e n t i t y responsible for a given signal may not be. This i s true, for example, for the polychlorinated biphenyls (PCB's) i n which the clean-up procedure and use o f s p e c i f i c detectors eliminates most p o s s i b i l i t i e s except PCB's. Such h i e r a r c h i c a l procedures simplify the problem somewhat but i t i s s t i l l advantageous to apply data reduction methods during the course of the interpretation process. A method that has been used with increasing success i s the SIMCA method of pattern recognition (1). This method i s extremely powerful when applied to data on complex mixtures, and a number of reports on such applications have recently appeared {2, 3, f ) · Steps i n a Pattern Recognition
Study
Pattern recognition methods are usually applied i n discrete steps, which are outlined here. I t i s assumed that the chemical measurements 0097-6156/ 85/ 0292-0243S06.00/ 0 © 1985 American Chemical Society Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS
244
that characterize the samples in a study are relevant to the formulated problem. Steps
Downloaded by UNIV OF BATH on July 3, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch017
1. 2. 3. 4. 5. 6.
Establish t r a i n i n g sets. Derive c l a s s i f i c a t i o n r u l e s . Select features. Refine c l a s s i f i c a t i o n r u l e s . C l a s s i f y unknowns. Review results g r a p h i c a l l y .
Various methods work i n similar ways with regard to some of the steps but the methods may d i f f e r i n very s i g n i f i c a n t and c r i t i c a l ways in other steps. In some ways the SIMCA method i s unique when viewed in the context of these steps. Establishment of Training Sets. The t r a i n i n g sets are the samples or compounds that are to serve as the basis of c l a s s i f i c a t i o n . It i s assumed that t h i s set and i t s associated data are representative of the class of samples that are to be categorized. The process of setting up t r a i n i n g sets i s somehwat a r b i t r a r y in the sense that i t i s based on the experience and knowledge of the person or persons conducting the analysis. This step i s not a function of the method of analysis being used. I t i s important, however, that an a n a l y t i c a l chemist who i s intimately f a m i l i a r with the instrument and with sample behavior be involved at t h i s point. Once the t r a i n i n g sets have been established, i t i s necessary to obtain data on them relevant to c l a s s i f i c a t i o n of subsequent samples. These data are the basis of the c l a s s i f i c a t i o n rules to be derived. These samples of unknown c l a s s assignment are known as the test samples or c o l l e c t i v e l y as the test set. The training set(s) and test set are tabulated with their data, as i n Figure 1. In matrix notation, the data describing the samples can be expressed as a vector, as the Equation (1), and each sample, i s then x
k
•
X
^ l'
X
2
f
X
3'
X
X
i ·•* pJ
(1)
represented as a point i n p_-dimensional space. When the data for the training sets are projected into variable space, the classes, i d e a l l y cluster as i n Figure 2. Derivation of C l a s s i f i c a t i o n Rules Up to this point the methods of c l a s s i f i c a t i o n operate in the same way. They d i f f e r considerably, however, in the way that rules for c l a s s i f i c a t i o n are derived. In t h i s regard the various methods are of three types: 1) class discrimination or hyperplane methods, 2) d i s tance methods, and 3) class modeling methods. In the class discrimination methods or hyperplane techniques, of which linear discriminant analysis and the linear learning machine are examples, the equation of a plane or hyperplane i s calculated that separates one class from another. These methods work well i f p r i o r knowledge allows the analyst to assume that the test objects must
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
DUNN ET AL.
SIMCA
Pattern Recognition
variable sample 1
1 x
l l
2 x
12
ι X
l
*
*
ρ
i
2
Downloaded by UNIV OF BATH on July 3, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch017
3
k
x
k l*
x
k i
η Figure 1.
Data matrix for a pattern recognition study.
Figure 2. Graphical representation of 2-classes of pattern recognition data i n 3-dimensions.
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS
Downloaded by UNIV OF BATH on July 3, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch017
246
belong to one of the two classes. The p o s s i b i l i t y that the test set samples are members of neither of the training sets i s not allowed by these methods. I t i s also necessary that the number of objects be much greater than the number of variables. The distance methods operate d i f f e r e n t l y . The c l a s s i f i c a t i o n of a test set member i s based on the c l a s s assignment of the samples i n the t r a i n i n g set nearest to the unknowns. The type of distance used can d i f f e r but i s usually the Euclidian distance, and the number of nearest neighbors i s selected in advance. Usually the 3 to 5 nearest neighbors are selected and the p o s s i b i l i t y that the unknown may not be represented in the t r a i n i n g sets i s allowed. Only one class modeling method i s commonly applied to a n a l y t i c a l data and this i s the SIMCA method (1) of pattern recognition. In t h i s method the class structure (cluster) i s approximated by a point, l i n e , plane, or hyperplane. Distances around these geometric func tions can be used to define volumes where the classes are located i n variable space, and these volumes are the basis for the c l a s s i f i c a tion of unknowns. This method allows the development of information beyond class assignment (5). The class models are of the form of Equation 2, which i s a _ A ^ · =Χ· + ^ k . + e, . ki ι ^ ka a i ki a—u
Σ
(2)
p r i n c i p a l components model. For A=0 the samples in a class are i d e n t i c a l and the c l a s s i s represented by a point i n space; for A=l i t i s represented by a l i n e , and for A > 2 by a plane or hyperplane. The objective of p r i n c i p a l components modeling i s to approximate the systematic c l a s s structure by a model of the form of Equation 2. This i s shown diagramatically below i n Equation 3. Here X i s the
(3)
data matrix for a training set. The vector product of the t's (princ i p a l components) and b's (loadings), represent the systematic part of the matrix and Ε represents the residual matrix. In t h i s example, two p r i n c i p a l components are a r b i t r a r i l y selected. More or fewer may be necessary, and t h i s i s a function of a predetermined stopping rule for extraction of p r i n c i p a l components from X. In SIMCA method, a cross v a l i d a t i o n technique {J) i s used. SIMCA uses the NIPALS (Nonlinear i t e r a t i v e P A r t i a l Least Squares) algorithm for p r i n c i p a l component abstraction (6). Due to the s i m p l i c i t y of the algorithm and the ease of programming i t for use
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
17. DUNN ET AL.
SIMCA
Pattern Recognition
247
on small computers, a short discussion of the NIPALS algorithm i s presented here. The NIPALS procedure works as i n Figure 3. One a r b i t r a r i l y selects a normalized vector b. The product X b' i s a η χ 1 vector. The product of t h i s vector in transpose with X gives a 1 χ ρ vector. Normalized, t h i s vector can be used to update b in the f i r s t multiplication. This i s continued i t e r a t i v e l y u n t i l t and b con verge, usually within about 25 interactions. The product t b is then substracted from X and the process continued with the matrix Ε u n t i l a stopping point i s reached. This method has the advantage of not requiring matrix inversion for c a l c u l a t i o n of the p r i n c i p a l com ponents .
Downloaded by UNIV OF BATH on July 3, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch017
Feature Selection Feature selection i s the process by which the data or variables impor tant for class assignment are determined. In t h i s step of a pattern recognition study the various methods d i f f e r considerably. In the hyperplane methods, the strategy i s to begin with a block of variables for the classes, calculate a c l a s s i f i c a t i o n function, and test i t for c l a s s i f i c a t i o n of the training set. In t h i s i n i t i a l phase, generally many more variables are included than are necessary. Variables are then detected in a stepwise process and a new rule i s derived and tested. This process i s repeated u n t i l a set of variables i s obtained that w i l l give an acceptable l e v e l of c l a s s i f i c a t i o n . This approach to feature selection leads to a set of descriptors that are optimal for c l a s s discrimination. These variables may or may not contain information that describes the classes. In SIMCA, a c l a s s modeling method, a parameter c a l l e d modeling power i s used as the basis of feature s e l e c t i o n . This variable i s defined in Equation 4, where S., i s the standard deviation of a v a r i -
MPOW = 1 - S / S i
(4)
i
able after i t i s f i t t e d to Equation 2 and S. i s the standard devia tion before i t i s f i t t e d to the model. This'parameter i s a measure of how well the variable contributes to the systematic class structure; i t s values are in the range 0 < MPOW < 1. Variables with low MPOW, which are considered noise, are deleted; those with high MPOW are retained. This c r i t e r i o n for s e l e c t i o n o f features leads to a set o f descriptors that contain optimal information about c l a s s membership as opposed to information about c l a s s differences. Model Refinement After the feature s e l e c t i o n process has been c a r r i e d out once by the SIMCA method, i t i s necessary to r e f i n e the model because the model may s h i f t s l i g h t l y . This r e f i n i n g of the model leads to an optimal set of descriptors with optimal mathematical structure.
American Chemical Society Library 1155 IStt St., N.W. D.C. 20038 Breen and Robinson;Washington, Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
Downloaded by UNIV OF BATH on July 3, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch017
ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS
Figure 3. NIPALS algorithm for extraction of p r i n c i p a l components from a data matrix.
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
17.
DUNN ET AL.
SIMCA
Pattern Recognition
249
Downloaded by UNIV OF BATH on July 3, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch017
C l a s s i f i c a t i o n of Unknowns Class assignment, by the methods of c l a s s i f i c a t i o n discussed e a r l i e r , d i f f e r s considerably. In the hyperplane methods, a plane or hyper plane i s calculated that separates each c l a s s , and c l a s s assignment i s based on the side of t h i s discriminant plane on which the unknown falls. The l i m i t a t i o n of t h i s approach i s that i t requires p r i o r knowledge (or an assumption) that the unknown be a member of one of the classes in the training sets. In the distance methods, c l a s s assignment i s based on the d i s tance of the unknown to i t s k-nearest neighbors; since the distances of the t r a i n i n g set objects from each other are known, one can deter mine whether an unknown i s not a member of the t r a i n i n g sets. Since SIMCA i s a c l a s s modeling method, c l a s s assignment i s based on f i t of the unknowns to the class models. This assignment allows the c l a s s i f i c a t i o n r e s u l t that the unknown i s none of the described classes, and has the advantage of providing the r e l a t i v e geometric portion of the newly c l a s s i f i e d object. This makes i t possible to assess or quantitate the test sample in terms of external variables that are available for the t r a i n i n g sets. Graphical Presentation of Results This aspect of data analysis i s somewhat neglected, as i t i s a s s o c i ated more with the interpretation of results than with the analyses. I t i s important to be able to view the structure of the data for the classes. This i s done in a variety of ways depending on the a n a l y t i c a l methods. The graphical technique most commonly used i s that of p l o t t i n g eigenvectors or p r i n c i p a l components. SIMCA uses t h i s method and software has been developed for three-dimensional color display of p r i n c i p a l components data. Other p l o t t i n g tech niques are also used i n SIMCA. The SIMCA method of pattern recognition i s i n a comprehensive set of programs for c l a s s i f i c a t i o n , and we have discussed how i t works in t h i s regard. C l a s s i f i c a t i o n problems represent only a few of types of problems that can be solved with t h i s approach.
Literature Cited 1. 2. 4. 5. 6. 7.
Wold, S. Pattern Recognition 1976, , 127-139. Dunn, W. J. III; Stalling, D. L.; Schwartz, T. R.; Hogan, J. W.; Petty, J. D.; Johansson, E.; Lindberg, W.; Sjostrom, M. Analysis 1984, 12, 477-485. Lindberg, W.; Persson, J.-A.; Wold, S. Anal. Chem. 1983, 55, 643-648. Albano, C.; Dunn, W. J. III; Edlund, U.; Johansson, E.; Norden, B.; Sjostrom, M.; Wold S. Anal. Chim. Acta Comp. Tech. Optim. 1978, 2, 429-443. Wold, S.; Albano, C.; Dunn, W. J. III; Esbensen, K.; Heliberg, S.; Johansson, Ε.; Sjostrom, M. In "Food Research and Data Analysis"; Martens, H.; and Russwurm, H.; Eds., Science: London, 1983; pp. 182-189. Wold, S. Technometrics 1978, 20, 397-406.
RECEIVED July 17, 1985
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.