3n Introduction lo Multivariate ialibration and Analysis -----I
handle large quantities of data are becomine a necessitv for the nracticine
the awareness of the need for moresophisticated techniques grows. However, widespread use of the full set of analytical tools will be realized only when the analyst is familiar with the general goals and advantages of the methods. This REPORT s&es as an introduction to the area of multivariate analyses known as multivariate calibration. Our goal is to give the chemist insight into the workings of a collection of statistical techniques and to enable him or her to judge the appropriateness of the techniques for an application of interest. The list of methods described is by no means a complete representation of those that can he applied to multivariste data. Also, we do not compare the results obtained using one method over another; instead, we describe the methods in the line of a tutorial. For this reason, the examples chosen illustrate how the methods work rather than compareamony them. The threemultivariate methods that will be discussed are multiple linear regression (MLR), principal component regression (PCR), and partial least squares (PLS). MLR, an extension of ordinary least squares, is the easiest to understand and also the most commonly used; PCR and PLS have yet to achieve widespread acceptance in chemistry. ~
Kemeih R. Beebe Bruce R. Kowalski
Department of Chemistry. BG-lo Labmalory for Chemhmetrics University of Washington Seattle. Wash. 98195 The advent of the laboratory computer has allowed analytical chemists to collect enormous amounts of data on a wide variety of problems of interest. With this ability, however, has come the realization that more data does not necessarily mean more information. One can collect reams of computer output without knowing anything more about the system under investigation. Only when thedataareinterpretedand put to use do they become valuable to the chemist and to society in general: thendata become information. For this reason, data analysis methods that can 0003-2700/87/A359- 1007/501.50/0 @ 1987 American Chemical Society
Calibration and prediction Multivariate statistics is a collection of powerful mathematical tools that can be applied IO chemical analysis when more than one measurement is acquired for each sample. For the sake of consistency. near-infrared (near-IR) reflectance analysis will be used as an example of such an analytical method throughout this w P O K r . However, the techniques described can be used in
any situation where multiple measure-
menta are acquired. One example of an analytical problem solved by near-IR reflectance analysis is the estimation of the protein and moisture content of wheat samnles. As with many analytical methois, this procedure consists of two phases: calibration and prediction. The chemist begins by constructing a data matrix (R) from the instrument responses at selected wavelengths for a given set of calibration samples. In the case of near-IR reflectance analysis, the R matrix can be constructed from the logarithmic reflectances (log(l/ref)) or from some other combination of reflectances obtained at the various wavelengths (4). A matrix of concentration values (C) is then formed using independent or referee methods such as Kjeldahl nitrogen analysis for protein determination and oven-drvine . -for the determination of moisture content. The goal of the calibration Dhase is to prodice a model that relates the nearIR spectra to the values obtained by the independent or referee methods. Figure 1illustrates the resulting matrices for this example. R is a 3 x 8 matrix with three rows of near-IR spectra from the analysis of three samples and eight columns corresponding to the eight near-IR wavelengths cbmen for the analysis. (Throughout this REPORT. these columns will be referred to as the'original variables for R.) The C matrix is 3 X 2 dimensional for this example. Again, the three samples occupy the three rows and the two columns represent the protein and moisture content of the samples as determined by the referee methods. According tn this notation, the terms elland c,?represent the protein content
ANALYTICAL CHEMISTRY. VOL. 59. hO. 17, SEPTEMBER 1. 1987
1007,.
earities). However, problems can be encountered with the use of real data. A model chosen solely according to this criterion attempts to use all of the variance in the R matrix, including any irrelevant information, to model C. When the resulting model is then applied to a new sample, the model will assume that the correlation found between the calibration R and C matrices also exists in that sample. Because the model was built using irrelevant information in the R matrix, this assumption will not be true. Unfortunately, even noise has a very high probability of being used to build the model. The following example illustrates this point. Consider the matrices
4 2 1 1 5 3 1 237471 126:
1-15
Flgure 1. Configuration of R and C matrices.
+
7 3
c=k
1;
0.42 -0.08
is minimized. In thii expression, cik is the actual concentration element in the ith row and kth column of C, Cjk is the MLR estimate of the same element using S, and (ik is the corresponding element of the matrix E. This approach appears to be reasonable, and in fact it is the best method when the analyst is dealing with wellbehaved systems (linear responses, no interfering signals, no anal-analyte interactions, low noise, and no collin-
1021
51
,;
3
0.481
0.55 -0.41
-0.24
0.28
0.05
MLR was used to determine the S matrix, as described earlier. A standard measure of the effectiveness of the model is the value of Err as presented above. For this example Err = 0.49, and it will he assumed that the matrix S is an accurate estimate of the true model. The coefficients in 8, therefore, will closely approximate the true relationship between the variables in R and C. To illustrate how a calibration method can he inappropriate, a column of random numbers ranging from zero to 100 was added to the R matrix. The addition of this column is analogous to the inclusion of a wavelength in nearIR analysis that has no useful information for describing the protein or moisture content of the samples. The resulting matrix RZand MLR model are
(1)
IOoaA
r2
-0.71
and moisture content, respectively, for the ith wheat sample found in the ith row of c. The next step in the calibration phase is the foundation of the entire analysis. The analyst must choose an appropriate mathematical method that will best reproduce C given the R matrix. How the analyst defines best will ultimately determine the method that is chosen. MLR Bssumes that the best approach to estimating C from R is to find the linear combination of the variables in R that minimizes the errors in reproducing C. It proposes a relationship between C and R such that C = RS E. where S is a matrix of regression Coefficients and E is a matrix of errors associated with the MLR model. S is estimated by linear regression, and the term
152
r15 63
152 132
102
96 L69
218
176 124
157
r 0.71
0.18
-0.42 0.24
-0.19 0.20
L-0.u
0.05
"=I
- --
91
82
- --
74 51
-0.20 0.03
- --
0.UlJ
The model resulting from the regres-
ANALYTICAL CHEMISTRY. VOL. 59, NO. 17, SEPTEMBER 1, 1987
sion of the same C as in the f i s t example onto this new matrix & is given in Sz. For this model, Err = 0.07, where the smaller value implies that thii second model is more effective at modeling C. To further evaluate these results, note that when R is multiplied by S, the jth column of R is always multiplied by the jth row of S. The importance of variable j to the model can therefore be determined by examining the jth row of S. The nonzero entries in the fourth row of Sz reveal that a variable consisting of random numbers (column number four of &) waa choaen as a significant contributor to the calibration model. In an ideal analysis, this random variable would have heen ignored in the model-building phase of the analysis. Furthermore, the inclusion of this column in Rzhas changed the estimated model coefficients so that they no longer represent the true model. The upper 3 X 3 portion of 82, which represents the model for the first three variables in R2,is not equal to 8, which represents the true model for the same variables. The addition of a column of random numbers has resulted in a model that appears to be better, in that it is more effective at reproducing C, and yet it does not describe the true relationship. This is because MLR uses all of the matrix R to build the model, regardless of whether or not it is relevant in describing the true model. Therefore, an erroneous model can be derived and subsequently used to predict the characteristics (e.g., protein content) of new samples. Thus MLR alone often will generate misleading models with subsequent errors in prediction. The remainder of this REPORT investigates the advantages of PCR and PLS over MLR. To understand these advantages, it is necessary to understand how the methods work. Graphical representations of many of the concepts have been included so that the reader not well versed in linear algebra will be able to follow the discussions. Notatkn Standard linear algebra notation will be employed throughout this article. Bold upper case letters refer to matrices; plain upper case letters are used for their row and column dimensions. For example, the letter R refers to the matrix of responses and is I X J dimensional. Bold lower case letters signify column vectors such as rz, which represents the second column of matrix R. By this convention r2T represents the second column of R written as a row vector. Plain lower case letters are scalars, representing single elements of vectors (mi), matrices (r!j), or other constants such as regression coefficients (b). The transpose of a matrix or vector is represented by a superscript T. (See
References 5 and 6 for books on linear algebra.) Matrix representations. Matrices of low dimensionality (i.e., the numbers of rows and columns are small integer values) can be represented graphicallv. Consider the following - 2 X 3 ma-
R can be represented graphically as R in row space or as R in column space. Row space is the space formed with the rows of R as the axes. It is, therefore, two-dimensional. Because there are
I
three columns, graphing R in row space consists of plotting three points (each point corresponding to a column) in two-dimensional space. The coordinates of the points representing the columns in each dimension are simply the matrix entries in the corresponding row. The resulting plot for this matrix
I
1I
(4.6)
c Column 3
L
F
I
W 2. The matrix R In row space.
Rgura 8. The matrix R A column space.
Isn't it time wu dixowrrd a more effectivewav to convev vour ideas! ChemDrOw:' the desktop publishing soffwarr for chemists, and Chan?u:" the molecular modelingpackage. will do Justthat-cleahyand effechvely! Wlth CIKmurow. YJUcan easilycreate tupqditydrawngsufyuw chemicalsbuctures. usingchemical drawing twk bunds. m w s . rings. text. circle.. and muw Transfer drawin@between C W h w and other softwart. lo combine picturn and text for scientific J l t c k s . Chm?Llnrovida a wdc. m v of features kir manioulatine:Wimensloncll molecularmddels. Define your own atom typesand subsh&res. Create molecular"miwies". Import and export s h c t u r a tu Chrmlhw. MM2. and other prngraw or hudd Indek with ( Y f d W s pomrful buildingtwk. ChanUruw and Chim3/)an*deigned k w Apple' Macintush" computers ~includimthenew Macintosh 111. and oubut tu anv comoatihk Dlinter. indudine highquaiity laser pnntenand phototype&tters. C k a w kcturn haw appeared in prestigiousjournals such as the Joumolof the Ammain Chemical socipfyand Telrahedm Lenm. ~
CIRCLE 28 ON READER SERVICE CARD
ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987 * 1008A
J factors. It is important to determine factors for greater freedom in representing the useful information in the R matrix. For example, the information in an I X J matrix R can he expressed as an I X J matrix R’, where the columns of R‘ are linear combinations of the original columns in R. The advantage of this is that if a particular column in R is not useful (as in the MLR example cited earlier), one can attempt to find a set of factors that gives that column small weights when forming R’. This is only one of the advantages of a factor-based calibration method over the methods that simply use the raw data. The other possible advantages can be understood hy examining one of the factor methods mentioned in the introduction: PCR (3,7,8).The following discussion concerns the preliminary step of P C R principal component analysis (PCA). An intuitive feel for how PCA works can he gained by considering the result of PCA performed on the 5 X 2 matrix R shown below.
is shown in Figure 2. Analogous to row space, the column space has two points and is three-dimensional. Figure 3 is a two-dimensional representation of this three-dimensional space. This example can be used to understand higher dimensional matrices, so that a 10 X 12 matrix would have 12 points in 10-dimensional row space and 10 points in 12-dimensional column space. Examples in higher dimensions than three can be understood if one refers back to two or three dimensions and considers situations of higher dimensionality as expansions of these simpler cases. Projections. Another concept that is important to understand is that of projecting either a point or a vector onto a vector or plane. Each of these can be viewed as being the perpendicular shadow of one object onto another. Figure 4 illustrates the result of projecting the vector a onto a plane in three-dimensional space. Factors. The last concept that is very basic to understanding both PCR and PLS is that of a factor. This is because both PCR and PLS are factorbased modeling procedures. For our purposes, a factor is defined as any linear combination of the original variables in R or C. It can be shown that given J factors for an I X J matrix R, one can also represent the variables in R as a linear combination of these same
[:
‘I
-4
In column space, the matrix is five points in two dimensions as shown in Figure 5.
-
X I
7
- 7 -6
A
I
I Flgure 4. The projectlon of a vector onto a plane. I010 A
ANALYTICAL CHEMISTRY. VOL. 59, NO. 17. SEPTEMBER 1, 1987
ugenvenor v i
I
-2
.
Figure 5. The matrix R plotted in coiumn space with the first eigenvector.
R = -10 -2 -2
lumn 2
In real analyses, the columns of R are often mean-centered and scaled. Mean centering rimply involves subtracting the average of a column from each of the entries in that column. Scaling, which gives equal weights to each variable, involves dividing each entry by the variance of the column. PCA is then performed on the covariance matrix RTR formed from the mean-rentered and scaled matrix. The first eigenvector corresponding to the largest eigenvalue is, by definition, the direction in the space defined by the columns of R that desrribes the maximumamount of variation or spread in the samples. Figure 5 shows the data and the direction of the first eigenvector where the space defined by R is a plane. In this example, all of the variation in the data can be described using one eigenvector. The samples all fall on a line in column space, and therefore all of the variation lies in one direction. When all of the variation in the samples cannot be accounted for using only one eigenvector, a second eigenvector can be found that is perpendicular or orthogonal to the first and desrribes the maximum amount of residual vruiation (not described by the fist eigenvector) in the data set. Figure 6 is the plot of a 30 X 2 matrix and the associated first two eigenvectors. The direction of the first eigenvector describes the maximum amount of variation or spread in the data. In this particular case, the samples in column space h a p pen to fall within a two-dimensional ellipse, and the first eigenvector corresponds to the major axis of the ellipse.
m e power of 4. m
Rheodyne LC valve you need. ,(,"> -...______ Operating within the flow STANDIP" \,AI