New approach for distance measurement in locally weighted regression

New Approachfor Distance Measurement in Locally. Weighted Regression. Ziyi Wang, Tomas Isaksson.t and Bruce R. Kowalski*. Center for Process Analytica...
15 downloads 0 Views 1MB Size
Anal. Chem. 1994,66, 249-260

New Approach for Distance Measurement in Locally Weighted Regression Ziyl Wang, Tomas Isaksson,t and Bruce R. Kowalski' Center for Process Analytical Chemistry, Laboratory for Chemometrics, Department of Chemistry, University of Washington, Seattle, Washington 98 195

This paper presents a new approach for distance measurement in locally weighted regression (LWR2) by balancing the information in both chemical and spectral spaces. The new method (LWR2) is compared with the ordinary locally weighted regression method (LWR), another modified LWR method (LWRl), and the linear calibration methods, principal component regression (PCR) and partial least squares (PLS). A simulation was conducted to study how noise in chemical and spectral spaces affects the predictive ability and stability of the LWR2 method with respect to the original LWR. The simulation showed that the LWR2 method is more robust and maintains good predictive ability in the presence of noise in both chemical and spectral spaces. Three near-infrared transmittance (NIT)data sets for food products and one Taguchi gas sensor array data set were further used to test LWR2. The first three data sets were based on measurements of diffuse NIT for water concentrations in 103 meat samples, fat concentrations in 100 homogenized beef samples, and temperatures in 94 homogenized beef samples, respectively. The last data set was based on measurements of eight Taguchi gas sensors for two-component mixtures of toluene and benzene in 100 samples. LWR2 produced up to a 52 improvement in terms of prediction errors as compared to ordinary LWR. The most commonly used calibration methods are linear in the sense that they produce models that are linear functions of the analytical measurements (X). In other words, the dependent variable y can be obtained from a linear combination of independent variables X according to eq 1, where b is the

y=Xb+e

(1)

regression vector determined in the calibration step and e is an error vector. A nonlinear data set is therefore one in which a nonlinear relation exists between y and X. Throughout this paper, scalars are represented by italic lower case letters, e.g., y . Column vectors are presented by boldface lower case letters, e.g., x, and row vectors are represented as the column vectors transposed, e.g., xT. Matrices are presented by bold face upper case letters, e.g.,

X. Methods like principalcomponent regression (PCR), partial least squares regression (PLS), and multiple linear regression (MLR) are well-established and have proved useful in many f On leave from Matforsk, Norwegian Food Research Institute, Oslovein 1, 1430 As, Norway.

0003-2700/94/036&0249$04.50/0 0 1994 American Chemical Society

PC 2

I

Outlier

Flgure 1. Exampleof nonilnear data (X) and an outlier (0). Two principal components are necessary to describe the data. The outlier is not detected since it lies in the first two principal component space.

applicati~ns.l-~However, these methods are based on statistical linear models and therefore are not able to efficiently model the nonlinear relationship between y and X. PCR and PLS can to a certain extent model nonlinearities by including additional factors in the model, but this may not be satisfactory for two reasons. First, the model is not parsimonious and prone to overfitting. Consequently, this leads to potential loss of predictive power of the model. Second, outlier detection can becomedifficult. To illustrate this, an example of a simple nonlinear system is shown in Figure 1. X is projected into the first two principal component (PC) space. Two PCs are necessary in this case to describe the data, even though the data lie on a curved line. Therefore, any new data point that lies in the plane described by the two PCs (0 in the figure) will not be determined an outlier, even if the point does not fall on the curve. With increasing numbers of data sets with nonlinear relationships between y and X, nonlinear calibration methods have to be applied in order to find better calibration models. Only recently have methods for multivariate nonlinear calibration been introduced into the chemical literature. These methods include data preprocessing with so-called multiplicative scatter correction (MSC),9Jo projection pursuit re(1) Pell, R. J.; Erickson, B. C.; Hannah, R. W.; Callis, J. B.; Kowalski, B. R. AM^. Chem. 1988,60, 2824. (2) Haaland, D.; Thomas, E. AMI. Chem. 1988, 60, 1202. (3) Haaland, D. AM^. Chem. 1988, 60, 1208. (4) Carey, W. P.; Wangen, B. C.; Dyke, J. T. Anal. Chem. 1989, 61. 1667. (5) Miller, C.; Eichinger, B. E. Appl. Spectrosc. 1990, 44, 496. (6) Wise, B. M.; Ricker, N . L.; Veltkamp, D. F.; Kowalski, B. R. Proc. Confrol Qual. 1990, 1 ( l ) , 41. (7) Lindberg, W.; Clark, G. D.; Hanna, C. P.; Whitman, D. A,; Christian, G. D.; Ruzicka, J. Anal. Chem. 1990,62, 849. (8) Lacy, N.; Christian, G. D.; Ruzicka, J. Anal. Chem. 1990, 62, 1482.

AnalytcalChemistry, Vol. 66, No. 2, January 15, 1994 249

gression (PPR),II,l2 alternating conditional expectations (ACE),l3.l4multivariate adaptive regression splines (MARS),15,16neural networks (NN),l7-l9 and a linear approximation method called locally weighted regression (LWR).2&23 It should be noted that both MSC and L W R are not nonlinear methods. MSC attempts to handle the nonlinear problem by preprocessing the data. L W R is based on a local linear model that collectively can be used for nonlinear relations. A comparison studyI4 has been presented recently to compare the performance of the above methods except for MSC by use of six nonlinear data sets. That comparison study showed that when nonlinearities were present in the data sets, models with good predictive ability could be constructed with the above methods. As a simple linear approximation technique, LWR could adequately model the nonlinear relation between y and X. Compared with the other nonlinear methods, L W R had an advantage in terms of the simplicity of the model constructed, as only two parameters need to be optimized in the calibration step. LWR was first proposed by Cleveland and Devlin20 as a curve fitting method. This method was published as a linear method for estimating nonlinear regression surfaces with limited information available about the surface. LWR is based on the assumption that a continuous and smooth nonlinear function can be locally approximated with piecewise linear functions. Therefore, L W R is in fact a piecewise linear method. Since no clear description was given by Cleveland and DevlinZ0of how to use the method in the prediction of unknown samples, prediction with locally weighted regression was introduced into analytical chemistry by Naes et al.?' One critical step in L W R is to select the local calibration samples by distance measures to the sample to be predicted. Euclidean distance was originally used by Cleveland and D e v h 2 0 Due to the high dimensionality, collinearity, and noise in the chemical data, the Mahalanobis distance in the PC space was proposed in the work by Naeset al.*' to determine the closeness between the prediction sample and calibration samples. Another modified distance measure was proposed by Naes and Isaksson.22 The strategy of that measure was to weight the principal components according to their predictive ability in a preliminary PCR prior to Mahalanobis distance measure. Then, the distances in the weighted PC space (pseudo Mahalanobis distance) were calculated. The L W R with weighted principal components (called LWRl in the following) was shown to give better prediction than LWR. One drawback of selecting local calibration samples based on the distance measures in both L W R and L W R l is that (9) Geladi, P.; McDougall, D.; Martens, H . Appl. Spectrosc. 1985, 39, 491. (10) Isaksson, T.; Naes, T. Appl. Specfrosc. 1988, 4 2 , 1273. ( I I ) Friedman, J. H.; Stuetzle, W . J. A m . Stat. Assoc. 1981, 76. 817. (12) Beebe, K. R.; Kowalski, B. R. Anal. Chem. 1988, 60, 2273. (13) Breiman, L.; Friedman, J. H. J. A m . Star. Assoc. 1985, 80, 580. (14) Sekulic, S . ; Seasholtz, M. B.; Wang, Z.; Kowalski, B. R., Lee, S.; Holt, B. Nonlinear Multivariate Calibration Method Anal. Chem. 1993, 65. 835A. (15) Friedman, J . Ann. Star. 1991, 19. 1 (16) Sekulic, S . ; Kowalski, 8. R. J. Chemom. 1992, 6 , 199. (17) Long, J. R.;Gregoriou, V.G.; Gemperline, P. .I. Anal. Chem. 1990,62, 1791. (18) Gemperline, P. J.; Long, J . R.; Gregoriou, V. G. Anal. Chem. 1991,63, 2313. (19) Wang, 2.;Kowalski, B. R. Dynamically Capacity Allocating Neural Network in Continuous Calibration. Submitted for publication in J . Chemom. ( 2 0 ) Cleveland, W . S . ; Devlin, S . J. J . A m . S f a t . Assoc. 1988, 83, 596. (21) Naes, T.; Isaksson, T.; Kowalski, B. R. Anal. Chem. 1990, 62. 668. (22) Naes, T.; Isaksson, T. Appl. Spectrosc. 1992, 46, 34. (23) Harnrnond, S . V. Making Light Work: Advances in Near Infrared Spectroscopy; Ian Michael Publications: Chichester, UK, 1992; pp 584-589.

250

Analytical Chemistry, Vol. 66, No. 2, January 15, 1994

30 Yo Fat

20 Yo Fat

10 YO Fat

303

*! OG.

Figure 2. The 10 calibration samples (X) geometrically closest to one prediction sample (0)in the first three princlpal component space (Mahalanobis distance) presented in part of the protein, fat, and water triangle. Dots illustrate the rest of the calibration samples.

distances are calculated only with regard to closeness to the prediction sample in the space of the independent variables (X) (called spectral space in the following). While the local calibration samples selected are close to the prediction sample in spectral space, they may not necessarily be close in the space of the dependent variables (y) (called chemical space in the following). One example of homogenized meat samples studied by Naes and Isakssonll is shown in Figure 2. In the figure, one sample to be predicted (indicated by 0)gave a higher prediction error for water concentration than the rest of the prediction samples. With 25 local calibration samples in the first three principal component space, the error was 2.87 wt % water compared to the overall average error of 0.60 wt %. The 10 closest local calibration samples (indicated by X ) selected from 70 calibration samples (indicated by 0 ) based on the Mahalanobis distance are shown in chemical space. Obviously, these samples are scattered over a large area. As pointed out,22such a result could indicate a weakness in LWR for these data since the concept of local linearity is supposed to be more strongly related to closeness in chemical composition. Like PCR, L W R uses a local inverse calibration method rather than the classical method. Comparisons of the inverse and classical methods have been studied in depth by H ~ a d l e y ~ ~ and N a e ~ These . ~ ~ studies show that the inverse method is (24) Hoadley, B. J. Am. Stat. Assoc. 1970, 65, 356 (25) Naes, T. Biometrical J. 1985, 27. 265.

the best for the prediction of y values (concentrations, for example) in a region near the center of the y values in the calibration samples. In addition, the inverse method is best suited for interpolation, while the classical method is the best for extrapolation. This requires LWR to select the calibration samples whosey values are close to they value of the prediction sample. In other words, they value of the prediction sample should be close to the center (mean) of the y values in the calibration samples. The above example shows that selecting the calibration samples on the basis of their closeness to the prediction samples in the spectral space does not necessarily mean they are close to the prediction samples in the chemical space. From an empirical point of view, Naes and Isaksson26 have studied the effect of using different types of calibration designs for near-infrared data analysis. One of the conclusions is that if, for example, better prediction is desired in the central region where the prediction samples are, it is should be reflected in the experimental design by addition of calibration samples near the center. This is counter to the arguments used for strictly linear modeling that dictate the use of high leverage samples. To illustrate this point, one simple example is given in Figure 3 based on a two-component mixture system with one analyte (A) and one interferent (I). Analyte A and interferent I have their unique absorbance peaks at wavelengths XI and X2, respectively. Figure 3a shows the pure spectra of analyte A and interferent I, three mixture spectra corresponding to three calibration samples, 1-3, and one mixture spectrum corresponding to the prediction sample, p. The concentrations of the analyte and interferent in the four samples are presented in Figure 3b. In Figure 3c, the Mahalanobis distances in the first two PC space are shown. The Mahalanobis distances between the sample p and the samples 1-3 are 0.34, 0.67, and 0.70, respectively. Assume in this case that the ordinary LWR needs two local calibration samples to be included in the local PCR model with the first two PCs. Obviously, samples 1 and 2 will be selected on the basis of their Mahalanobis distances to sample p. Since the concentrations of analyte A need to be calibrated, the regression vector b from PCR should represent a direction that is uniquely related to the concentration of analyte A and orthogonal to the change of any other variation. In addition, the projection of the spectrum of the prediction sample onto b must be directly proportional to the concentration of analyte A in the prediction sample. Samples 1 and 2 contain the same concentrations of interferent I, i.e. 4, but different concentrations of analyte A, Le., 2 and 3, respectively. The regression vector b12 obtained by PCR from spectra 1 and 2 is shown in Figure 3d. In Figure 3d, the spectra for samples 1-3 and p are represented by the corresponding vectors. The vertical vectors (indicated in Figure 3d by dashed arrows) represent the concentration variations of interferent I in the samples. It should be noted that this figure is obtained based on the true calculation. However, since it is not scaled, it is only used for illustration purposes. Since the only variation in spectra 1 and 2 is due to the concentration variations of analyte A, b12 represents the direction of such variation. However, since there is no concentration variation from the interferent I in samples 1 and 2, b12 is not able to find the optimum direction that is orthogonal to the variation of the (26) Naes, T.;Isaksson, T.Appl. Specrrosc. 1989, 43, 328.

concentration of the interferent I. Obviously, an error occurred for the prediction of concentration of analyte I in sample p by projecting the vector p onto regression vector b12. The true error is 0.56. Notice from Figure 3b that samples 1 and 3 are closest to samplep in concentration of analyte A than sample 2 which is selected by LWR. If the sum of the distances in chemical space with respect to analyte A and spectral space is used in the distance measure, samples 1 and 3 will be selected as the local calibration samples rather than samples 1 and 2. The regression vector b13 is obtained from spectra 1 and 3 by PCR. The vector bl3 represents a direction that is orthogonal to the variation of the interferent concentrations presented in samples 1 and 3. The correct concentration of analyte A in sample p is obtained by projecting the vector p onto b13. The example shows the weakness of LWR for selecting calibration samples. The analyte concentrations in calibration samples selected by LWR could be far from prediction samples in chemical space, and not enough variation of interferent concentrations is present in calibration samples. In this case, use of a method that selects calibration samples on the basis of their closeness to the prediction sample in both chemical and spectral spaces has a significant advantage over an ordinary LWR method. Robustness is another important reason for incorporating the chemical information into distance measure in LWR. Figure 4 gives an illustration. In Figure 4, the concentration y is nonlinearly related with response X. The straight line b represents a regression vector obtained from a linear method. A concentration outlier (indicated by A) is present and may not appear to be an outlier to the regression vector b. The two vertical dashed lines represent a spectral window chosen by LWR for a prediction sample p whose response is xp. Obviously,theoutlier is included in the local calibration model. However, assume that a good initial concentration estimate for sample p can be obtained. A window in the chemical (concentration) axis can then be set such that either the outlier could be excluded from the model or it could be assigned a less significant weight than the LWR method due to its large distance to the prediction sample in chemical space. Therefore, a more robust LWR could be achieved by incorporating the chemical information into the distance measure. This paper presents a new approach for distance measurement in locally weighted regression (called LWRZ in the following) by balancing the information in both chemical and spectral spaces. The idea is that distances between prediction samples and calibration samples should be small in both chemical and spectral spaces. Here, 2 in the abbreviation LWR2 has dual meaning. First, it stands for a new method. Second, it indicates that the calibration samples are chosen on the basis of closeness in two spaces, chemical and spectral. The LWRZ method is tested with three near-infrared transmittance (NIT) data sets and one Taguchi gas sensor array data set against PCR, PLS, LWR, and LWR1. Simulations are also conducted to study how noise in both chemical and spectral spaces affects the predictive ability and stability of the new approach with respect to ordinary LWR.

THEORY In this section, the distance measures in LWR, LWR1, and LWR2 methods are discussed. Analytical Chemistry, Vol. 66,No. 2, January 15, 1994

251

0 3

16

12

A

a

0

4

0

2

P

1

2

2 0

3

Conc. of Analyte A (Abr. Unit)

Wavelength

3

\ a2

03

04

05

06

07

Q8

PC 2

for Analyte A in sample P by b , 2

QQ concentration Estimate by b,2 for Analyte A in Sample p

Flgure 3. (a) Pure spectra of analyte A and interferent I . Spectra of calibration samples 1-3 and prediction sample p. (b) The concentrations of analyte A and interferent I in samples 1-3 and p. (c) Score plot of samples 1-3 and p in the first two principal component space. The numbers on the line indicate the Mahalanobis distances between sample p and samples 1-3. (d) Samples 1-3 and p are represented in vector space. The vertical dashed arrows represent the variations of the concentrations of interferent I for the corresponding samples. Different linear models were obtained by choosing different calibration samples. bI2 is obtained from calibration samples 1 and 2. bI3 is obtained from calibration samples 1 and 3.

252

Analytical Chemistty, Vol, 66, No. 2, January 15, 1994

Y

mathematically expressed as b L i n e a r regression Voctor b

(3)

Outllar /

Obviously, the closer the calibration sample i is to the prediction samplep, the larger the weight w,(xp) is, and vice versa. Here, the cubic weight function simulates a normal distribution of the local calibration samples. Other weight functions, such as the multivariate Gaussian function, can also be used. Distance Measure in LWR. The Euclidean distance proposed by Cleveland and DevlinZowas modified to Mahalanobis distance by Naes et al.21 The Mahalanobis distance in P C space is defined as p(xp,xi) = [(xi - XpITX(M)+(Xi- x p ) ~1’2 Figure 4. Illustration of the advantage of LWRP over LWR and linear calibrationmethods In terms of outlier detection.: X, callbratbnsample; 0, prediction sample; A, concentration outlier.

LWR Regression Steps. The basic steps are the same among the LWR, LWR1, and LWR2 methods and are outlined as follows (for further details, see ref 21). For each sample to be predicted: (1) Identify and select the number of calibration samples, N , that are closest to the prediction sample by measuring their distances from the prediction sample. The distance measures used in LWR, LWR1, and LWRZ aredifferent and arediscussed below. (2) Perform a weighted linear PCR based on Ncalibration samples in the M principal component space. (3) Predict the constituents in the prediction sample based on the weighted local PCR model.

where X(M) is the covariance matrix of the complete data set. Mdenotes the number of PCs used. denotes the generalize inverse. xi is the spectrum of the calibration sample i. x, is the spectrumof the prediction samplep. p(xp,xJ is thedistance between the calibration sample i and prediction sample p in M principal component space. Distance Measure in LWRI. The Mahalanobis distance in LWR was modified by Naes and Isaksson.22 The strategy was to weight principal components according to their predictive ability in the linear PCR. The distance is the one in the weighted principal component space, a pseudo Mahalanobis distance. The differences between RMSEPs (defined later) for the different principal components in PCR are used as weighting coefficients in the distance measure p . In mathematical terms, p is defined by M

There are two parameters in the procedure that need to be determined in the calibration step, namely, N , the number of calibration samples to be selected for the local model, and M , the number of PCs to be used. Both can be determined by either previous knowledge, cross validation or a separate prediction set. A cubic weight function was suggested by Cleveland and DevlinZ0and was used by Naes et a1.21 Each calibration sample is assigned a weight by the weight function. The weight function is defined as

where W(u) =

c

1-~3)3

o ~ u s i otherwise

and U =

P(X,Ji) d(x,)

(2b)

wi(xp)is the weight associated with the ith calibration sample in the weighted regression for the prediction of sample p. p(xp,xi) is the distance between the prediction sample p and the ith calibration sample. d(xp) is the largest p(x,,xi) over the N samples in the local calibration, which can be

(4)

p(x,,xi) = [~p(t,,t,i)2(RMSEP,12

- RMSEP,

2

11112

m=l

(5)

where RMSEP; is the square of the root mean square error of prediction for PCR using m principal components. p(t,,t,i) is the distance between scores of prediction sample p and calibration i for PC m, divided by the square root of the mth eigenvalue. If (RMSEP;,, - RMSEP;) is negative, it is replaced by 0. Distance Measure in LWRZ. It is clearly demonstrated by the examples in the Introduction that local calibration samples should be close to the prediction samples in both chemical and spectral spaces. Therefore, the proposed distance measure is a weighted average between Mahalanobis distance in the spectral space and distance in the chemical space. Preliminary concentration estimates for the prediction samples can be initially estimated by methods like PCR, PLS, or LWR. In thepresentwork, theinitialestimatesareobtained by the PCR method. With the initial concentration estimate pp for the prediction samplep, the proposed distance measure is pip

= a p y j p + Bpxip

(6)

where pipis the distance between the calibration sample i and prediction samplep. pXipis the Mahalanobis distance in the spectral space defined by eq 4. Analytical Chemistty, Vol. 66, No. 2,January 15, 1994

253

pYip is the distance in the chemical space defined by 1.0

P,(PcR)is the initial concentration estimate for the prediction samplep from PCR. 6, is introduced to make pYunitless and between 0 and 1 like pX. In the present paper, 6, is defined as

-

0.8

-

0.7

-

0.6

-

0.5

-

0.4

-

-p -

0.9

-2 s5

r

VI

N

Parameters a and 6 are weighting coefficients that weight the contributions of the two distance measures. a and 6 are constrained to a + 6 = 1. Notice that if a = 0, pip is the distance in the spectral space of the LWR method. Therefore, the distance measure in the LWR method is a special case of distance p i p .

Iterative Procedurein LWRZ. As described in the distance measure in LWR2, the procedure of LWR2 is to estimate a concentration for prediction sample initially by PCR. The estimated concentration is then used in the distance measure, as in eq 6a, to select the local calibration samples to get the first concentration estimates for the prediction samples by LWR2. Since the initial concentration estimates for the prediction samples by PCR may not be accurate enough due to nonlinearity in the data sets, this suggests that the first concentration estimates by LWR2 can be used again for distance measure to get the next concentration estimates and so on. The estimate for p Y , , can be obtained from (7)

where j indicates the number of iterations. Such a procedure is repeated until convergence, Le., until the concentration estimates are not changed within defined limits. Therefore, j = 1, 2, 3, ... J , where J is the number of iterations when the method is converged. Compared with LWR, the iterative LWR2 has two more parameters to be optimized, Le., a in eq 6 and Jop,the optimum number of iterations. These parameters in the iterative LWR2 can be obtained the same way as in LWR by cross validation or a separate prediction set. However, such an optimization procedure is computationally intensive for the iterative LWR2 method. In the present work, the parameters, N , M , and a are obtained from a separate prediction set for the minimum RMSEP value (defined in the following section) after the first iteration 0' = 1) by LWRZ and fixed afterward. Then, the LWRZ method is implemented iteratively under the fixed parameters, N , M , and a. Jopis obtained from the minimum RMSEPvaluesforJ= 1 , 2 , 3 , ... J . Suchaniterativeprocedure was tested for four real data sets studied and worked well consistently.

Method Comparison. In order to compare the predictive ability of the methods, independent test data sets were used. The prediction errors were calculated as the root mean square 254

AnalyticalChemistry, Vol. 66, No. 2, January 15, 1994

0.1 1000

1200

1400

1600

1800

2000

2200

2400

2600

Wavelength

Figure 5. Three pure spectra for glucose, casein, and lactate used for the simulation of a threscomponent mixture.

error of prediction (RMSEP), 1.

112

where I, denotes the number of prediction samples in the test set. ypidenote the true chemically measured reference value (concentration, for example), and jjpi denotes the predicted reference value (concentration) from the calibration method. Prediction improvement of method 2 over method 1 is calculated as 7% improvement =

(RMSEP, - RMSEPJ x 100 (9) RMSEP,

Software and Hardware. All method computations were performed using MATLAB (The Mathworks, Inc., Sherborn, MA). All graphics presented were also generated using MATLAB and subsequently imported into MacDraw Pro (Claris Corp., Santa Clara, CA) in order to add labels and annotations. All software were implemented on a DEC 3100 workstation (Digital Equipment Corp., Maynend, MA). Compared with LWR, the algorithm of LWR2 took -10 times as long as LWR to obtain model parameters, namely, number of principal components (M),number of calibration samples (N), and weighting coefficient a. However, in the prediction step, both algorithms took approximately the same amount of time. COMPUTER SIMULATI ON The fact that LWR2 takes advantage of the information in both chemical (y) and spectral (X) spaces raises pertinent questions. How does the noise in y and X affect the predictive ability of the method? Can LWR2 perform well with a high level of noise? How does the parameter a adjust the amount of the contribution from each space with different noise levels? In order to answer these questions, a simulation was conducted by adding different levels of noise to concentrations and spectra to study the stability of LWR2 as compared to LWR. In order for the simulation to be realistic, pure spectra for casein, glucose, and lactate were taken from a data set studied by Naes et al.,*' which are shown in Figure 5. A mixture

tm L.ctl*

Table 1. RMSEP Results from LWR for Simulated Data.

noise in y1 ( A ) 0 1 2 3

0%

5%

noise in X (R) 10% 15%

1.49 (10) 1.74 (10) 1.67 (10) 2.49 (10) 2.42 (10) 3.39 (10) 3.31 (20) 3.56 (10)

25%

3.10 (20) 4.10 (50) 4.69 (30) 3.17(20) 4.02 (30) 4.83 (30) 3.31 (20) 4.07 (20) 5.15 (20) 3.73 (20) 4.48 (30) 7.63 (60)

a The number in each parenthesis indicates the number of local calibration samples (N)by which the RMSEP results were obtained. All the results were obtained using two principal components (M).

imcmin

1oD

-

Flguo 6. Experimental design for the simulated data. The calibration (0)and predction ( 0 ) samples are presented In the casein, lactate, and glucose trlangle (sum of constituents is 100). 0.

a

Glucou 0

I

I

0.1t

y1 = y1 + A noise(0,l)

0.

0

'

0 0

.o.

. 0

'

0

' 0

,

'

. I + "

1.

0.04

0. 0 0

0.08

0. 1

0.12

0.14

0. 18

(11)

random noise with u = 0 and u = 1. A is the level of the noise added to the concentration yl. A = 0, 1, 2, and 3 was used in the simulation. Relative noise was added to each spectrum (x) of the calibration and prediction samples according to eq 12, where XI indicates the spectral value at the ith wavelength.

.o.

."." 0.01

about chemical concentrations is included adequately in these two principal component space, but in a nonlinear fashion. Therefore, a good nonlinear method should be able to give a good prediction using only two principal components. In order to compare the stability and predictive ability between LWR and LWR2, different noise levels were added to the concentrations and spectra. Absolute noise was added to concentrations in both calibration and prediction samples according to eq 1 1, where thenoise(0,l) is normally distributed

0 . 18

xi = xi+ Rxi noise(0,l)

(12)

Pc2

Figwo 7. Score pbt of the first two principal components from the simulated data (noiseless case).The callbratlon samples are denoted by 0 while the prediction samples are denoted by 0 . The obvious nonlinear relation between chemical concentrations (y) and the first two principal components Is observed.

design with closure between three constituents, as suggested by Naes et al.,21 is shown in Figure 6. The open circles on thevertices of the triangle are the 66 calibration samples. The dots are the 55 prediction samples. The mixture spectra for both calibration and prediction samples were obtained by x = y;9xc

+ y;xe + y y x ,

where yc,y,, and y~are the concentrations of casein, glucose, and lactate, respectively, in the sample and y, + y, + y~= 100 due to the closureconstraint. &, x,, and XI are the purespectra for casein, glucose, and lactate, respectively. The power terms in the equation were used to simulate the nonlinear relation between the measurements or responses (x) and the concentrations b). From eq 10, the highest nonlinear relation exists between x and yl. Therefore, the calibration was only conducted for the prediction of lactate concentration in the simulation. The scores for the prediction samples were computed from the eigenvectors obtained from the calibration samples. The scores of the first two principal components for both calibration and prediction samples are plotted in Figure 7 (The calibration samples are denoted by 0 and the prediction samples are denoted by 0 ) . Figure 7 shows that the main information

R indicates the percentage noise added to the spectrum. R = 0%, 5%, lo%, 15%, and 25% was used in the simulation. It should be noted that the highest noise levels for y~and X were simulated only for method comparison. Realistically, reference methods or instrument measurements are seldom so inaccurate. A total of 20 data sets were simulated with different spectra and concentration noise levels. As indicated in the above discussion, since the relevant information in the spectral space is included in the first two principal components, the LWR and LWR2 methods are forced to use only the first two principal components. The RMSEP results for each data set are reported in Table 1 for the LWR method and Table 2 for theLWR2method. It should benoted that theiterativeLWR2 was not applied and the RMSEP results for Table 2 were obtained after the first iteration (j= 1). Each result in Table 1 is the best among the RMSEP results computed for N = 10, 20, 30, 40, and 50. Similarly, each result in Table 2 is the best among the RMSEP results computed for N = 10,20, 30, 40, and 50 and a = 0.1, 0.2, 0.3, ... 1.0. From Tables 1 and 2, the RMSEP results generally increase for both methods as noise increases except in two cases. In Tables 1 and 2, at A = 2, R = 5%, the RMSEP results are decreased when noise in X is increased from 5% to 10%. In Table 1, at A = 0, R = 1576, the RMSEP result is decreased when noise in yl is increased from 0 to 1. However, the apparent decrease is not statistically significant. In the noiseless case, LWR2 gives a 57% improvement over LWR, which indicates Analytical Chemistry, Vol. 86, No. 2,January 15, 1994

255

Table 2. RMSEP Results from LWR2 for Simulated Data'

noise in y1 ( A )

OS

5%

noise in X ( R ) 10%

15%

25 %

0 1 2 3

0.64 (10,0.3) (57.0) 1.12 (10,0.3) (32.9) 2.32 (10,0.2) (4.1) 3.22 (20,0.2) (2.7)

1.28 (10,0.3) (26.4) 1.82 (10,0.3) (26.9) 3.28 (10, 0.2) (3.2) 3.56 (10,O.O)(0)

2.83 (20,0.3) (8.7) 2.88 (20,0.3) (9.1) 3.22 (30,0.2) (2.7) 3.73 (20,O.O) (0)

3.05 (30,0.5) (25.6) 3.79 (40,0.4) (5.7) 3.93 (40,0.4) (3.4) 4.45 (30, 0.2) (0.7)

4.63 (20,0.2) (1.3) 4.83 (30,O) (0) 5.15 (20,O) (0) 7.63 (60,O) (0)

a The numbers in the first parenthesis indicate the number of local calibration samples (N) and the a value by which the RMSEP results were obtained. All the results were obtained using two principal components (M) and after the first iteration 0 = 1). The number in the second parenthesis indicates the percentage improvement over LWR.

the calibration samples selected by LWR2 have a better linear relation between y1 and the first two PCs. At the first two noise levels ( A = 0 and 1) in y1, LWR2 gives significant improvement over L W R even as noise in X increases. However, when noise in y1 is very high, as in the last two noise levels ( A = 2 and 3), the concentration information can still give a slight improvement in some cases. The most important aspect that needs to be noticed is the a values. As can be seen, the values decrease as the noise in y1 increases. From eq 6, this means that the distance measure in LWR2 puts more emphasis on the distance in spectral space. As a result, the LWR2 method was gradually approaching the LWR method as noise in yl increased. The LWR2 was equal to the L W R method with a = 0 when A = 3 and R = 5%, lo%, and 25%. When A = 0 and 1, the a values increased a t R = 15%, which indicates that the distance measure puts more emphasis on the distance in chemical space. However, as noise in X increased too much with R = 2596, the a value decreased to 0.2 a t A = 0 and to zero a t A = 1, 2, and 3. This shows that the distance measure in chemical space is limited by the fact that the information capacity of y~ and x are different. yl is a concentration value, a single number measurement, which is sensitive to noise. x, on the other hand, is a multichannel measurement. In the calibration step, principal component analysis was used to project X into a principal component space and only the first two principal components were included in the model. Since the first two principal components preserve most of the variance in the data and are not sensitive to the noise, the information from the spectral space has the multivariate advantage over the one from the chemical space. This is the reason why the contribution from the spectral space to the distance measure in LWRZ is more than 50% most of the time; namely, a is less than 0.5. This was also shown to be the case in the following four real data sets where the highest optimal a value was 0.4. The reason that the contribution from the chemical space to the distance measure still decreases at high noise levels in the spectral space and low noise levels in the chemical space is due to the noise propagation from the spectral space through PCR to the concentration estimates. Since the iterative LWR2 was not applied, the distance measure in the chemical space was solely dependent on the initial concentration estimates by PCR. However, the accuracy of such initial concentration estimates became poorer as the noise in the spectral space increased, which might hinder the contribution from the chemical space to the distance measure. The simulation shows that, with only two principal components, the LWRZ method is very stable and maintains good predictive ability compared to the L W R method in the 256

Analytical Chemistry, Vol. 66, No. 2, January 15, 1994

presence of different noise levels in both concentrations and spectra. The information from concentrations helps to select the calibration samples that have a more linear relationship between the concentrations and spectral responses due to the closeness between the calibration samples and prediction samples in both chemical and spectral spaces. However, the information from the spectral space still contributes more in the distance measure. The advantage of incorporating the chemical information into the distance measure is minimized if the noise in y is too high.

EXPERIMENTAL DATA Four real data sets are used to test the performance among five different methods, namely, PCR, PLS, LWR, LWR1, and LWR2. The procedures for collecting the data are summarized in this section. Set 1. The same data set studied by Naes and Isaksson22 is used here to compare the predictive ability of the three LWR methods as discussed above. The data are based on measurements of diffuse near-infrared transmittance (NIT) for water concentrations in 103 samples of meat. The water concentrations are obtained by air drying a t 105 "C. The NIT spectra consisted of intensities a t 100 wavelengths (2nm steps) in the region between 850 and 1050 nm. Each chemical concentration value is an average of three replicates. The data set was divided into two sets. A total of 70 samples were allocated to the calibration set and 33 to the prediction set. Set 2. A total of 100 homogenized beef samples were measured using diffuse NIT. N I T spectra were measured in the 850-1050 nm (2-nm steps) region on a Tecator 1225 Infratec food and feed analyzer (Tecator AB, Hoganas, Sweden). The fat content was determined in triplicate by an ethylene tetrachloride extraction method (Fosslet, Foss Electric, Hiller~d,Denmark). The 100 samples were divided into a calibration set of 68 samples and a prediction set of 32 samples. Summary statistics for the data are provided in Table 3. The data are further described and discussed in Isaksson et aLZ7 Set 3. A total of 94 homogenized beef samples were heattreated to different temperatures in the range from -50 to -85 "C. After cooling to room temperature, diffuse N I T spectra weremeasured in the 850-1050nm (2-nmsteps) region on a Tecator 1225 Infratec food and feed analyzer instrument (Tecator AB). The data was divided into two sets. A total of 77 samples were allocated to the calibration set and 17 to the prediction set. Previous maximum heat treatment tem(27) Isaksson. T.; Miller. C.E.: Naes, T. App/. Speetrosc. 1992, 46, 1685.

~~~

~

~~

~

~

~~

~~~

T a m 3. Sunmary Statktlcs for Fat Concontratknr In Data Set

2

set (no. of samples) all (loo)

fat concn (wt % ) min rnax avg std

calibration (68)

prediction (32)

min max avg std min max avg std

set (no. of samples)

calibration (77)

all (100)

calibration (50)

prediction (50)

min max avg std min rnax avg min mas avg std

49.8 85.5 66.9 11.3 49.8 85.5 67.1 10.8 49.8 85.5 66.9 11.3

toluene (pprn)

benzene (pprn)

23.0 396.1 182.4 140.9 23.0 396.0 186.0 142.4 23.4 396.1 178.7 139.4

53.0 470.9 212.0 130.9 53.0 470.9 210.3 131.4 54.5 470.9 213.8 130.3

min max avg std min max avg std rnin mas avg std

Table 8. RMSEP Rewits for LWR'

temp ("C)

std

prediction (17)

set (no. of samples)

0.94 23.17 7.86 4.44 0.94 23.17 7.91 4.69 1.30 17.47 7.75 4.22

Tabh 4. Summary Statktica for Tmpwaturer In Data Set 3 all (94)

Table 5. Summary Statktica for the Concontratiom of Toluene and Benzene in Data Set 4

Number of Principal Components Nb

2

3

4

5

15 20 25 30 35 40

4.89 4.79 4.61 4.49 4.38 4.32

1.00 0.94 0.80 0.71 0.69 0.70

0.87 0.80 0.77 0.79 0.79 0.80

1.25 1.00 0.98 0.93 0.89 0.88

The calibrations are for water concentrations (wt %) (data set

1).b Indicates the number of samples in each local calibration. Table 7. RMSEP Rewits for LWRV

peratures were used as dependent variables (y). Summary statistics for the data are provided in Table 4. The data are further described and discussed in Ellekjaer and Isaksson.28 Set 4. Data set 4 is based on measurements of eight Taguchi gas sensors (TGS) for two-component mixtures of toluene and benzene. The eight TGS sensors (Figaro Inc.) consisted of three TGS 823, two TGS 816, TGS 824, TGS 815, and TGS 825. The eight sensors was arranged in a linear fashion in a flow cell where air was passed at a constant flow rate of 200 mL/min. Organic solvent vapors of toluene and benzene were generated using bubblers at a temperature of -15 OC (+5 OC for benzene). The flow of the air to the sensor block and bubblers was controlled by mass-flow controllers (Tylan Inc.) linked to a data-acquisition and control computer. Solvent vapor concentrations in the range of 5-500 ppm could be generated. For each sample, the vapors were generated for 10 minutes to allow adequate equilibrium time with the sensors, and between samples, the sensors were exposed to purge air for 45 min. In order to compare the methods, the data set was divided into two sets, one calibration set and one prediction set. Fifty samples were in the calibration set and 50 in the prediction set. Table 5 gives the statistics on the data set. Detail experimental procedures are described in Carey and Yee.2g

RESULTS AND DISCUSSION Set 1. The RMSEPresults from LWR, L W R l , and LWRZ are presented in Tables 6-8, respectively. It should be noted that the results in Table 8 from the LWRZ method were obtained after the first iteration (j= 1). As shown in ref 22, LWR showed superior predictive ability for water concen(28) Ellekjaer, M.R.;Isalrsson, T.J. Sci. Food Agric. 1992, 59, 335. ( 2 9 ) Carey, W. P.;Ye,S . Sew. Acruarors 1992, 9, 113.

Number of Principal Components Nb

2

3

4

5

15 20 25 30 35 40

4.39 4.38 4.40 4.41 4.44 4.45

0.88 0.73 0.67 0.65 0.65 0.67

1.20 0.90 0.72 0.69 0.71 0.73

1.40 0.99 0.80 0.74 0.72 0.74

'The calibrations are for water concentrations (wt %) (data set 1). b Indicates the number of samples in each local calibration. T a b 8. RMSEP Rewits for LWR2'

Number of Principal Components Nb

2

3

4

5

15 20 25 30 35 40

4.68 4.61 4.46 4.37 4.32 4.30

0.86 0.67 0.65 0.66 0.67 0.67

0.87 0.77 0.80 0.79 0.78 0.77

1.65 1.24 0.90 0.83 0.80 0.78

The calibrations are for water concentrations (wt % ) (data set 1). All the results are obtained after the fiist iteration (j = 1). b

Indicates the number of samples in each local calibration.

trations over PCR or PLS. The best RMSEP result for LWR is 0.69 wt %. After the first iteration, the best RMSEP result (0.65 wt %) from LWR2 is the same as LWR1. As mentioned in the Introduction, the prediction error for one prediction sample using LWR is 2.87 wt 7'% at N = 25 and M = 3, compared to the overall average error of 0.60 wt %. By using the distance measure in L W R l also at N = 25 and M = 3, the prediction error for this prediction sampledecreased from 2.87 to 0.83 wt %. The 10 samples closest to this prediction sample as chosen by the distance measure in L W R l are shown in the chemical space in Figure 8, where the Ana!~tIcalChemlstty, Vol. 66, No. 2,January 15, 1994

257

30 YOFat

30 YO Fat

20 YO Fat

20 YO Fat

10 Yo Fat

10 YO Fat

Figure 8. The 10 calibration samples ( X ) geometricallyclosest to the same prediction sample (0) as in Figure 2 in the first three weighted principal components (distance measure in LWR1) presented in part of the protein, fat, and water triangle. Dots illustrate the rest of the calibration samples.

prediction sample is indicated by 0,the 10 closest calibration samples by X, and the rest of the calibration samples by 0 . As can be observed, the calibration samples are closer to the prediction in the chemical space than those from LWR in Figure 2. The 10 calibration samples closest to the prediction sample as chosen by the distance measure from LWR2 at N = 25, M = 3, and a = 0.2 after the first iteration (j= 1) are also shown in the chemical space in Figure 9. It can be seen that the calibration samples are much closer to the prediction sample than those from L W R l in the chemical space. The prediction error for the prediction sample by LWR2 decreased from 2.87 to0.76 wt %, which is an improvement over L W R l . This shows that calibration samples that are close to the prediction sample in both chemical and spectral spaces have better predictive ability. The iterative procedure for the LWR2 method was tested for N = 25, M = 3, and a = 0.2. As shown in Figure 10, the minimum RMSEP result (0.63 wt %) was obtained after the second iteration, Le., Jop= 2. The procedure converged to 0.64 wt % after the third iteration. This minimum result is -8% better than LWR and -3% better than L W R l . The same data set was also studied by Naes et ai., using neural networks.30 The RMSEP result from the neural networks was 0.64 wt 7%. 258

Analytical Chemistry, Vol. 66, No. 2, January 15, 1994

Figure 9. The 10 calibration samples (X) geometrically closest to the same prediction sample (0) as in Figure 2 in both chemicaland spectral spaces (distance measure in LWRP) presented in part of the protein, fat, and water triangle.Dots illustrate the rest of the calibration samples.

Number 01 Iterations

Flgure 10. RMSEP results from PCR, LWR, LWRl, and iterative LWR2 for data set 1.

Set 2. The results for data set 2 are presented in Table 9. Three chemical compositions were measured in the samples: protein, water, and fat. The protein predictions were very precise by PCR and PLS. Due to the small concentration variation of protein contents in the samples, no further improvement was obtained by any of the LWR methods. The water prediction was improved over PCR and PLS by LWR and LWR2, where LWR2 is slightly better than LWR. The L W R l obtained approximately the same prediction error as PCR and PLS for water. Therefore, the main emphasis here (30)Naes, T.; Kvaal, K.; Isaksson, T.;Miller, C.J . Near InfraredSpectrosc. 1993, I . 1.

Tabla 9. RMSEP Resub for Fat Concentration8 (wt %) In Data Set 2 from PCR, PLS, and Three Different LWR Methods

method

RMSEP

factors

PCR PLS LWR LWRl LwR2

0.42 0.45 0.51 0.35 0.33

14 9 10 14 7 (Jw = 4, a = 0.4)b

Table 10. RMSEP Results for T m p r a t u r r s ("C) In Data Set 3 from PCR, PLS, and Three Different LWR Methods

N"

method

RMSEP

factors

N"

40

PCR PLS LWR LWRl LWR2

3.1 3.0 2.1 2.4 1.8

6 13 3 3 3 (Jop= 1,a = 0.2)b

20 20 30

60 40

"Indicates the number of samples in local calibration. bJop indicates the number of iterations while a indicates the weighting coefficient in the distance measure in LwR2.

a Indicates the number of samples in local calibration. b Jop indicates the number of iterations while a indicates the weighting coefficient in the distance measure in LwR2.

is put on the percentage fat prediction. The concentration variation for fat is the largest in the data set. As seen in Table 9, the RMSEP result from LWR is actually worse than for either PCR or PLS. This outcome is due to the cubic weight function used by LWR. The idea behind using the cubic weight function is to simulate a normal distribution of the calibration samples. Therefore, the closer the calibration sample is to the prediction sample, the larger the weight of that calibration sample is in the regression and vice versa. Generally, the function works very well provided nonlinear relations exist between y and X. However, as observed by Naes and Isaksson,22 some of the local calibration samples that are further from the prediction sample actually influence the estimation of the regression vector. Therefore, a uniform weight function was proposed by Naes and Isaksson,22namely, the same weights are assigned to all the local calibration samples, and an improvement over the cubic function was obtained from the data studied by Naes and Isaksson.22 Obviously, if all the calibration samples were selected by LWR with a uniform weight function, LWR would be the same as PCR and would therefore obtain the same results as PCR. However, using the cubic weight function, both LWRl and LWR2 obtained significantly better RMSEP results than PCR, PLS, and LWR. The best RMSEP (0.35 wt 5%) from LWRl was obtained at N = 60 and M = 14. The best RMSEP (0.33 wt 7%) from LWRZ was obtained at N = 40, M = 7, a = 0.4 and Jop= 4. The iterative LWRZ converged rapidly to RMSEP of 0.33 wt % after the fourth iteration. A larger model with more PCs resulted from LWRl (M = 14) compared with LWRZ (M = 7). In the distance measure in LWR1, the principal components that have the largest predictive ability in PCR have the largest influence on the distance. As in the case of data set 1, this distance measure chooses the calibration samples that have a better linear relation between y and the weighted PCs. In this case, 60 out of 68 calibration samples were selected to build the model, which strongly indicates a linear relationship between y and the weighted principal components. The best RMSEP result (0.33 wt %) was obtained by LWR2 with a much smaller model than LWR1. This indicates that the most important information for prediction is in the first seven PCs, but in a nonlinear way as indicated by the smaller number of local calibration samples used in LWRZ as compared with LWR and LWRI. Both LWR and LWR1, however, were unable to find the linear relation between y and the first seven PCs. LWR1, by weighting the PCs and expanding the PC space, found the linear relation between y and 14 PCs. Since the first few PCs contains the largest variation compared with the rest, they are more stable against minor changes in the spectra

Table 11. RMSEP Results for Toluene Concentration# (ppm) In Data Set 4 from PCR, PLS, and Three Merent LWR Methods

method

RMSEP

PCR PLS LWR LWRl LWR2

52.3 52.5 32.0 25.5 23.8

factors

N"

3 (Jop= 2, a = 0.4)b

15 25 10

7 7 3 6

a Indicates the number of samples in local calibration. bJ,,p indicates the number of iterations while a indicates the weightlng coefficient in the distance measure in LwR2.

Table 12. RMSEP Results for Benzene Concentratlor# (ppm) In Data Sol 4 from PCR, PLS, and Three Dtfferent LWR Methods

method

RMSEP

PCR PLS LWR LWRl LwR2

52.8 50.0 41.5 27.8 19.9

factors

NE

3 3 (Jop= 10, a = 0.3)b

45 15 10

4 3 6

0 Indicates the number of samples in local calibration. Jop indicates the number of iterations while a indicates the weighting coefficient in the distance memure in LWR2.

and noise. Therefore, the LWRZ model with the smallest number of PCs should be more stable and have better predictive ability. Set 3. The RMSEP results for data set 3 are presented in Table 10. About 30% improvement over PLS and PCR was obtained using LWR. This indicates a highly nonlinear data set. The best RMSEP result (1.8 "C) is given by LWRZ at N = 30, M = 2, a = 0.2, and Jop= 1, which is -14% improvement over LWR. Notice in this case that the RMSEP result from LWRl is worse than LWR. This shows a weakness of the method. Both LWR and LWRl obtained their best results a t N = 20 and M = 3. The only difference is that LWRl weighted the first three PCs according to their predictive ability estimated by PCR, while the first three PCs in LWR get equal weights. Therefore, the predictive ability estimated for the PCs by PCR may not be accurate when there is highly nonlinear relation between y and the PCs. Set 4. The RMSEP results from five methods for the prediction of toluene and benzene are provided in Tables 1 1 and 12, respectively. It can be seen that the use of PLS or PCR resulted in a high prediction error for both toluene and benzene. The comparison of RMSEP results with the standard deviations of the true concentrations in the prediction sets in Table 5 shows that neither PLS nor PCR was able to adequately model the variance of the analyte concentrations in the prediction sets. This is primarily due to the highly AnaEytcal Chemlstty, VoL 66,No. 2,January 15, 1994

259

nonlinear responses from the sensors, according to Carey and Yee.29 For the prediction of toluene, The RMSEPresult for LWR2 is 23.8 ppm a t N = 10, M = 3, CY = 0.4, and Jop= 2, an improvement of 26% over the best result of L W R and 7% over the best result of LWR1. For the prediction of benzene, the best RMSEP result (19.9 ppm) is obtained from LWR2 a t N = 10, M = 3, CY = 0.3, and Jop= 10, which is a 52% improvement over L W R and 28% improvement over L W R l . In both cases, values for a are relatively high compared to data sets 1 and 3, which means the information from chemical space gets a higher contribution in the distance measure. According to Carey and Yee,29with only five different types of sensors and the similarity of response for TGS 823 and TGS 825, the array actually consisted of only four discriminating sensors. The relative noise of the sensor response was 5% as estimated from the measurement reproducibility of each sensor. The noise for both concentrations of toluene and benzene was not reported. However, it was confirmed that the precision of the reference method was better than 2 8 . Therefore, with extremely high nonlinear response, high noise and low discriminating power of the sensors, the multichannel advantage from the measurements of the sensor array is not so dominant as with the above N I T data sets. This leads to the higher contribution to the distance measure from the chemical space information.

-

CONCLUSION A novel distance measure for locally weighted regression (LWR2) has been proposed and tested. The distance measure is a weighted average of the distances in both chemical and spectral spaces. The adjustable parameter a controls the weighting of each space and can be chosen by either cross validation or a separate prediction data set. As for the comparison between L W R and LWR2, evidence from the simulation shows that LWRZ can handle the nonlinearity between y and X better than L W R in the first two principal component space where the most of the relevant information is present, but in a nonlinear fashion. It also shows that L WR2 is a more stable method compared to L W R in the presence of noise in both chemical and spectral spaces. The simulation and the illustration for one of the prediction samples in data set 1 give good examples that local linearity can be achieved by using a local linear calibration model with calibration samples close to the prediction sample in both chemical and spectral spaces. The simulation also shows if the noise in y is too high, LWR2 performs about the same as LWR. Three LWR methods for nonlinear calibration were tested on four real data sets and compared with two linear calibration methods, PCR and PLS. The L W R methods show superior performance over PCR and PLS except for one case in which the result with LWR is slightly worse than PCR and PLS.

260

Analytical Chemistry, Vol. 66,No. 2, January 15, 1994

Also, as is noted from the results, very few principal components are necessary in order to obtain good prediction results from the three LWR methods. This phenomenon was alsoobserved by Naes et a1.21,22Since these first few principal components are more stable and less sensitive to minor changes in the spectra, the LWR methods are expected to be more stable and less sensitive to outliers. It is well-known that linear calibration methods such as PCR or PLS need more factors only to correct for nonlinearities in the relation between y and the first few principal components or factors. However, high levels of noise in larger principal components or factors can render the methods ineffective and sensitive to outliers. This is the main reason why the PCR or PLS models with more factors cannot handle nonlinearity in data as efficiently. The results from the real data sets show that LWRZ has the best predictive ability among the three LWR methods. In all cases, iterative LWR2 converged very nicely, which shows the stability of the iterative method. In most cases, the minimum RMSEP results can be obtained after only a few iterations. By setting smaller or even zero weights to some small PCs (with large variances), L W R l has the tendency to select some PCs with small variances that have better predictive ability. This is the reason that L W R l constructed models with many more factors than L W R or LWR2 in two cases. However, since the estimation of the predictive ability for PCs is determined by a preliminary PCR, such an estimation may not be accurate, especially for highly nonlinear data. That is the case for data set 3, where L W R l performed worse than LWR. Similarly, LWR2 also uses PCR to obtain initial concentration estimates for the prediction samples in order to help select the calibration samples. Most of the time, such initial concentration estimates may not be accurate due to outlier or nonlinear data. However, by combining information in chemical and spectral spaces, the iterative LWR2 method can achieve the minimum prediction error within a few iterations. The optimum number of iterations can be determined either by cross validation or by a separate prediction set.

ACKNOWLEDGMENT The authors thank Dr. Parick Carey for providing the Taguchi sensor data set. Dr. Tormod Naesand Dr. Age Smilde are acknowledged for their valuable comments. Mr. William Cusworth, 111, and Mr. Paul Mobley are acknowledged for their criticism on preparing the manuscript. This work was funded through a grant from the Center for Process Analytical Chemistry (CPAC) a t the University of Washington. CPAC is a NSF/Industry cooperative research center. Received for review June 28. 1993. Accepted October 18, 1993." Abstract published in Aduance A C S Absrracrs. December 1 , 1993.