Optimization in Locally Weighted Regression - ACS Publications

The application of locally weighted regression (LWR) to nonlinear calibration problems and strongly clustered calibration data often yields more relia...
1 downloads 0 Views 52KB Size
Anal. Chem. 1998, 70, 4206-4211

Optimization in Locally Weighted Regression Vitezslav Centner and D. Luc Massart*

ChemoAC, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium

The application of locally weighted regression (LWR) to nonlinear calibration problems and strongly clustered calibration data often yields more reliable predictions than global linear calibration models. This study compares the performance of LWR that uses PCR and PLS regression, the Euclidean and Mahalanobis distance as a distance measure, and the uniform and cubic weighting of calibration objects in local models. Recommendations are given on how to apply LWR to near-infrared data sets without spending too much time in the optimization phase. Nonlinearity and the occurrence of sample subgroups (i.e., clustering) are among the most common1-3 pitfalls in multivariate calibration. If nonlinearity and clustering are strong, then global linear calibration models do not fit calibration data correctly, which leads to biased predictions. The idea of using locally weighted regression (LWR) is to develop a local model for each new sample based on the nearest neighbors, i.e., the calibration objects nearest to it.4-12 It is hoped that such a model enables a more accurate prediction of the concentration (cˆ0) for the new sample from its measurements a0 than would be achieved with a global linear model. So far, the most often employed type of LWR is the one that uses principal component regression (PCR) as a regression method to build local models, the Mahalanobis distance as a distance measure to find nearest neighbors, and cubic weighting to weight the nearest neighbors in local models according to their distances to the new object.6,7,9 Potential alternatives are partial least squares (PLS) (to PCR), the Euclidean distance (to the Mahalanobis distance), and uniform weighting (to cubic weighting). To decide on what type of modeling (PCR/PLS), distance

measure (Euclidean/Mahalanobis), and weighting (uniform/ cubic) to use, but also to determine the global optimal complexity of local models and the global optimal number of nearest neighbors in local models, cross-validation or prediction testing is usually employed.6-8,13-15 The global optimum is the solution yielding the best predictions over the whole calibration domain. To find the optimal combination, considerable computational time is needed, however. The aim of this article it to show that for near-infrared (NIR) data, the combination PLS, Euclidean distance measured in the original measurement space, and uniform weighting generally yields satisfactory results. THEORY 1. LWR Algorithms. The method of locally weighted regression was proposed by Jensen, Martens, and Næs.4,5 However, since these references were overlooked in the chemometrics literature, most authors refer to the work of Cleveland and Devlin9 as the original source. The kernel of the LWR algorithm is as follows: (i) define the number of nearest neighbors to be used to build local calibration models (nlocal); (ii) for every new (test) object with unknown concentration c0 find the nlocal calibration samples closest to it (nearest neighbors); (iii) build a local calibration model using the nearest neighbors only; assign the weights of the neighbors in the local model according to their distance to the new object; (iv) predict the concentration of the new sample (cˆ0) from its absorbance a0 by applying the local calibration model developed. In refs 4-6 as well as in many other publications,7-12 the authors described LWR using PCR, the Mahalanobis distance, and cubic weighting (where the weight w of the nearest neighbor i, wi, is given by wi ) (1 - Di3)3 and Di is the distance between the new object and its ith nearest neighbor). In this paper, we also evaluate the performance of LWR when using PLS, the Euclidean distance, and uniform weighting. We have the following reasons to do this. (a) As PCs can have different roles in different parts of the calibration domain (in the individual local models), it is difficult to select a global subset of PCs that would be optimal over the whole domain. The implementation of PCR with a selection of PCs16-18 according to correlation is therefore not feasible in LWR. In contrast to PCR that builds on the maximum variance explained,

(1) Centner, V.; Verdu-Andres, J.; Walczak, B.;Jouan-Rimbaud, D.; Despagne, F.; Pasti, L.; Poppi, R.; Massart, D. L.; de Noord, O. E. A comparison of multivariate calibration techniques applied to experimental NIR data sets, submitted. (2) Jouan-Rimbaud, D.; Massart, D. L.; Leardi, R.; de Noord, O. E. Anal. Chem. 1995, 67, 4295. (3) Centner, V.; Massart, D. L.; de Noord, O. E. Anal. Chim. Acta 1996, 330, 1. (4) Jensen, S. A.; Martens, H. Multivariate Calibration of Fluorescence Data for Quantitative Analysis of Cereal Composition. In Food Research and Data Analysis; Martens, H., Russwurn, H., Eds.; Applied Science Publ.: New York, 1983; p 253. (5) Martens, H.; Næs, T. TrAC, Trends Anal. Chem. 1984, 3, 204. (6) Næs, T., Isaksson, T.; Kowalski B. K. Anal. Chem. 1990, 62, 664. (7) Næs, T.; Isaksson, T. Appl. Spectrosc. 1992, 46, 34. (8) Wang, Z.; Isaksson, T.; Kowalski, B. R. Anal. Chem. 1994, 66, 149. (9) Cleveland, W. S.; Devlin, S. J. J. Am. Stat. Assoc., 1988, 83, 596. (10) Ruppert, D.; Wand, M. P. Ann. Stat. 1994, 22, 1346. (11) Næs, T.; Isaksson, T. NIR News 1994, 5 (4), 7. (12) Næs, T.; Isaksson, T. NIR News 1994, 5 (5), 8.

(13) Martens, H.; Næs, T. Multivariate Calibration; Wiley: Chichester, U.K., 1989. (14) Brown, P. J. Measurement, Regression, and Calibration; Clarendon Press: Oxford, U.K., 1993. (15) Thomas, E. V. Anal. Chem. 1994, 66, 795A. (16) Sun, J. J. Chemom. 1995, 9, 21. (17) Cowe, I. A. McNicol, J. W. Appl. Spectrosc. 1985, 39, 257. (18) Jolliffe, I. T. Appl. Stat. 1982, 31, 300.

4206 Analytical Chemistry, Vol. 70, No. 19, October 1, 1998

S0003-2700(98)00208-X CCC: $15.00

© 1998 American Chemical Society Published on Web 09/03/1998

PLS focuses directly on correlation, which often leads to more parsimonious models. Since the simplicity of local models (compared to the global models) is one of the main arguments in favor of LWR,6 the application of PLS seemed more appropriate than PCR. Another reason for using PLS is the speed of this technique compared to PCR when the optimized PLS algorithm SIMPLS19 is used. The difference in the speed becomes important mainly in the optimization stage, i.e., when the cross-validation or prediction testing is carried out to determine the global optimal number of nearest neighbors and the global optimal complexity of local models. In this context, it should be mentioned that we consider a full cross-validation in this study, by which is meant a procedure in which the extraction of latent variables is repeated for each calibration object left out. This is different as compared for instance to LWR applied in ref 6, where PCs were extracted from the original data before starting cross-validation, i.e., once. Then, the scores of each object were left out one by one and used to optimize the LWR model. The former type of cross-validation was applied rather than the latter because it enables estimation of model complexity and prediction error, whereas the latter does not provide unbiased estimates of prediction error. (b) The advantage of using the Euclidean distance (ED) over the Mahalanobis distance (MD) is that it can be calculated directly in the original measurement space A, whereas MD cannot. Indeed, when the number of original variables p is higher that the number of objects n, as is usual in NIR spectroscopy, then the calculation of MD in the original space leads to singularity problems. For this reason, PCs have to used instead. Since MD then varies with the number of PCs considered, it has to be determined for each number of PCs separately. This procedure is much more time-consuming than the calculation of ED, which is carried out only once. Moreover, the higher (or irrelevant) PCs then have the same influence on the selection of nearest neighbors as the first (or the most relevant) PCs. (c) If uniform weighting is applied, then the same weights are assigned to all objects. This is not the case with cubic weighting. When cubic weighting is utilized, then it can happen that all the weight is given to two or three nearest neighbors. The uniform weighting of nearest neighbors in local models therefore should ensure that those models are more stable than the models obtained by means of cubic weighting. 2. LWR Using PLS, the Euclidean Distance, and Uniform Weighting. The LWR algorithm based on PLS, ED, and the uniform weighting, with double use of cross-validation, can be summarized as follows. (1) Define the maximal considered complexity of the LWR model M, M e MPLS, where MPLS is the optimal complexity of the global PLS model (2) Define a set of numbers (nnear) of nearest neighbors (nlocal) for which the optimization will be carried out by means of leaveone-out cross-validation [e.g., nnear ) (M + 1, M + 2, M + 5, M + 10, ..., nlocal, ..., n - 1)]. This cross-validation enables estimation of the global optimal number of nearest neighbors nlocal (3) For each nnear ) M + 1, ..., n - 1, apply leave-one-out crossvalidation. For each object i ) 1, ..., n, proceed as follows: (a) Leave the ith object out. (b) Calculate the Euclidean distance (19) de Jong, S. Chemom. Intell. Lab. Syst. 1993, 18, 251.

(EDik) between the object left out (i) and each calibration sample k, where k ) 1, ..., n and k * i

EDik )

x

p

∑(a

ij

- akj)2

(1)

j)1

where j is the variable index, j ) 1, ..., p. (c) Rank the EDik values in ascending order. For each number of nearest neighbors, nlocal, included in the vector nnear, nlocal ) M + 1, ..., n - 1 perform the following. (1) Form the local calibration data matrix (Alocal) by taking the nlocal objects closest to the object i. Their corresponding local concentration vector is called clocal. (2) From each element of the data matrix Alocal, subtract the column mean of the respective column (column centering). (3) Subtract the same column mean from the spectrum of the object i (ai). (4) For the global model complexity m, m ) 1, ..., M, build a local PLS model by estimating the PLS b coefficients blocal,m

clocal ) Alocal‚blocal,m + e

(2)

where e is the vector of residuals from the local model. Use the blocal,m values to predict the c value of the object i left out

cˆi(local,m) ) ai‚blocal,m

(3)

This cross-validation enables estimation of the global optimal model complexity m. (4) Quantify the predictive power of the LWR models containing m ) 1, ..., M components, built on the M + 1 to n - 1 nearest neighbors. Calculate the root mean square error of cross validation (RMSECVlocal,m) as a measure of the predictive ability:

x∑ n

RMSECVlocal,m )

(ci - cˆi,(local,m))2/n

(4)

i)1

where ci is the reference and cˆi(local,m) is the predicted concentration of the sample i. The results can be presented as a table m nlocal

RMSECVM+1,1 ... RMSECVn-1,1

... RMSECVlocal,m ...

RMSECVM+1,M ... RMSECVn-1,M

The nlocal and m that yield the lowest RMSECVlocal,m are selected for further use. When there are several nlocal and m giving the same minimal RMSECVlocal,m, then the solution where nlocal is maximal and m is minimal should be considered (the number of objects per dimension then is maximal). (5) If an independent test set is available (as was the case in this study), then the predictive power of the LWR model based on nlocal nearest neighbors and m latent variables can be evaluated. The root mean square error of prediction (RMSEP) is then determined as follows Analytical Chemistry, Vol. 70, No. 19, October 1, 1998

4207

x∑ nt

RMSEP )

(c0,i - cˆ0,i)2/nt

(5)

i)1

where c0,i is the reference concentration of the test sample i and cˆ0,i is its concentration predicted with the LWR model developed using nlocal nearest neighbors and m latent variables. Instead of using cross-validation as indicated in step 3, prediction testing can be applied to determine the optimal nlocal and m. This is then done as follows: For each test object with absorbances a0 (a) Determine ED between a0 and each calibration object ai (see also eq 1). Rank the calibration objects according to their ED to the test object. (b) Take nlocal calibration objects closest to the test object. This subset forms a local calibration set Alocal. The vector of the corresponding concentrations is called clocal. (c) Perform column centering of Alocal, and using the same mean, also the column centering of a0. (d) Build the local PLS model using the centered Alocal and clocal as shown in eq 2. (e) Use the estimated blocal,m to predict concentration of the test object from its a0.

cˆ0(local,m) ) a0‚blocal,m

Figure 1. Object scores on PC1 versus object scores on PC2 for the data set POLYMER.

(6)

Calculate RMSEPlocal,m similarly as indicated in eq 4. Select the global optimal nlocal and m for the further use. When nlocal and m have been selected, the LWR model can be applied to predict the concentration of new samples in the same way as described for prediction testing. EXPERIMENTAL NIR DATA SETS (A) POLYMER. The data set POLYMER was collected to develop a calibration model enabling us to determine the amount of a minor mineral compound1,22 in new polymer samples. The spectra were recorded from 1100 to 2498 nm and standard normal variate (SNV) transformed. The data were split into a calibration set (40 objects) and test set (14 objects) as described in ref 1. (B) SUGAR. The calibration concerns the determination of sugar in water. NIR spectra of 24 sugar solutions were collected from 1100 to 2500 nm. Since this data set is small, all data are kept together and not split into a calibration and test set. Only the cross-validation results are reported further. (C) POLY-DAT. The calibration problem concerns the determination of the hydroxyl number of polyether polyols by NIR spectroscopy (from 1132 to 2128 nm). The NIR spectra were offset corrected and outliers were removed from the data.3 The Duplex algorithm was applied to obtain the calibration set (60 objects) and the test set (24 objects).1 (D) WHEAT. The data set WHEAT is a reference data set submitted to the database of Chemometrics and Intelligent Laboratory Systems20 by Kalivas. The moisture content is the subject of calibration (spectra from 1100 to 2500). The offset setup described in ref 1 was utilized in this study: 59 objects are placed into the calibration set and 40 objects into the test set. (20) Kalivas, J. Chemom. Intell. Lab. Syst. 1997, 37, 255. (21) Centner, V.; de Noord, O. E.; Massart, D. L. Detection of nonlinearity in multivariate calibration. Anal. Chim. Acta, in press. (22) Verdu-Andres, J.; Massart, D. L.; Sterna, C. Anal. Chim. Acta 1997, 349, 271.

4208 Analytical Chemistry, Vol. 70, No. 19, October 1, 1998

Figure 2. Concentration c versus object scores on PC1 plot for the data set POLYMER. The global linear, the global quadratic and an example of the LWR model are shown.

RESULTS AND DISCUSSION From the previous investigation1,21 it is known that there is a significant nonlinearity in the data sets POLYMER and SUGAR and that the POLY-DAT data set is significantly clustered. On the other hand, the data set WHEAT1,21 is linear and clustered only on the third principal component. LWR is therefore not expected to yield significantly better predictions in this case than the global linear PLS or PCR models. This data set is included for demonstration purposes, namely, to show what happens when local models are applied is cases where there is no reason to do so. 1. Data Set POLYMER. The plot of object scores on PC1 versus object scores on PC2 (see Figure 1) and the plot of concentration c versus object scores on PC1 (see Figure 2) indicate that the calibration problem is strongly nonlinear and the data set is highly clustered. The nonlinearity of the relationship between PC1 and c is significant21 and can be approximated by a quadratic equation.1,21,22 Figure 2 shows that the development of local calibration models based on a few nearest neighbors (e.g., 10) improves the fit of the data. If the loss in precision due to decreased number of calibration samples in the model is not more significant than the gain in the model fitness, then the local LWR models should give better predictions than a global (linear or quadratic) model. If this is not the case, then LWR will give comparable predictions as global models.

Table 1. RMSE Values Achieved for the Data Set POLYMER by Applying LWR with PLS/PCR Regression, the Euclidean/Mahalanobis Distance, and Uniform/ Cubic Weightinga meth- dis- weightod tance ing nlocal PLS

ED

uniform

PLS

MD

uniform

PCR

ED

uniform

PCR

ED

cubic

PCR

MD

uniform

PCR

MD

cubic

4 5 6 7 8 4 5 6 7 8 4 5 6 7 8 4 5 6 7 8 4 5 6 7 8 4 5 6 7 8

RMSECV (m)

RMSEP (m)

1

2

3

1

2

3

0.0228 0.0232 0.0252 0.0258 0.0249 0.0239 0.0261 0.0271 0.0271 0.0261 0.0245 0.0246 0.0262 0.0258 0.0250 0.0309 0.0219 0.0226 0.0261 0.0253 0.0267 0.0281 0.0282 0.0274 0.0263 0.0307 0.0267 0.0249 0.0276 0.0265

0.0196 0.0170 0.0187 0.0198 0.0193 0.0199 0.0173 0.0178 0.0182 0.0190 0.0198 0.0233 0.0228 0.0171 0.0174 0.0872 0.0308 0.0191 0.0200 0.0191 0.0176 0.0172 0.0173 0.0175 0.0180 0.0395 0.0190 0.0172 0.0166 0.0163

0.0226 0.0183 0.0206 0.0199 0.0202 0.0272 0.0236 0.0425 0.0357 0.0447 0.0226 0.0180 0.0204 0.0202 0.0193 0.0432 0.0681 0.0252 0.0220 0.0182 0.0272 0.0243 0.0293 0.0307 0.0388 0.0355 0.0598 0.1072 0.0400 0.0252

0.0061 0.0072 0.0061 0.0061 0.0083 0.0150 0.0174 0.0186 0.0184 0.0124 0.0063 0.0070 0.0073 0.0081 0.0088 0.0123 0.0068 0.0055 0.0061 0.0061 0.0170 0.0127 0.0139 0.0139 0.0113 0.0233 0.0496 0.0128 0.0116 0.0128

0.0070 0.0060 0.0071 0.0073 0.0077 0.0125 0.0088 0.0111 0.0109 0.0156 0.0076 0.0071 0.0087 0.0093 0.0099 0.0114 0.0166 0.0095 0.0093 0.0071 0.0124 0.0083 0.0083 0.0082 0.0160 0.0281 0.0155 0.0069 0.0071 0.0071

0.0085 0.0053 0.0071 0.0068 0.0075 0.0086 0.0086 0.0071 0.0121 0.0225 0.0085 0.0058 0.0069 0.0070 0.0080 0.0253 0.0109 0.0066 0.0050 0.0059 0.0086 0.0103 0.0109 0.0111 0.0211 0.0250 0.0476 0.0104 0.0101 0.0100

a m, complexity of the model; n local, number of nearest neighbors used to build local models. Reference RMSE values obtained with the global PLS models: (i) linear PLS with five latent variables: RMSECV ) 0.0613, RMSEP ) 0.0445 (ii) quadratic PLS with one latent variable: RMSECV ) 0.0774, RMSEP ) 0.0576.

The results reported in Table 1show that LWR indeed drastically improves the accuracy of predictions compared to global PLS modeling. The optimal linear global PLS model with five latent variables yields RMSECV ) 0.0613 and RMSEP ) 0.0445. The optimal quadratic global model gives RMSECV ) 0.0712 and RMSEP ) 0.0576. This is much worse then RMSECV and RMSEP reached with all LWR models (see Table 1). Figure 3 shows the plot of c values predicted using LWR and PLS, respectively, versus the reference c. The plot confirms the superiority of predictions reached with LWR over predictions obtained with PLS. Cross-validation shows that the optimal complexity of the LWR models is two, independent of whether PLS or PCR, ED or MD, and uniform or cubic weighting is applied. The optimal RMSECV for all these combinations ranges from 0.0163 to 0.0191. The corresponding optimal RMSEP values are from 0.0060 to 0.0093. It can be concluded that the minimal RMSEP achieved with LWR is more than 4 times lower than that accomplished with global PLS. The combination PLS - ED - uniform weighting gives the smallest prediction errors. However, the accuracy of predictions obtained with all LWR alternatives is comparable. The advantage of the variant mentioned is that it is computationally the fastest alternative. The difference between RMSECV and RMSEP is due to the data splitting applied (see ref 1). Since the goal of calibration is

Figure 3. Concentration predicted using (O) the global linear PLS model and (+) local LWR model plotted versus the reference c for the data set POLYMER. Table 2. Optimal RMSE Values Achieved with LWR (Using PLS, ED, and Uniform Weighting) and (Global Linear) PLS Applied to the Data Set POLYMER When Extrapolation Is Needed in the Prediction Phase extrapolation in A

PLS LWR

extrapolation in c

m

RMSECV

RMSEP

m

RMSECV

RMSEP

5 1

0.0499 0.0135

0.0834 0.0281

5 3

0.0503 0.0189

0.0704 0.0415

to investigate LWR in the ideal calibration situation where all test samples are within the calibration domain, the predictive ability of the LWR models obtained by RMSEP is smaller than RMSECV. If the new samples would not necessarily fit well into the distribution of the calibration samples, one should consider RMSECV, i.e., 0.0191, as a realistic measure of the error of prediction. A method comparison23 shows that, for the POLYMER data set, LWR yields more accurate predictions than global PLS even when a mild extrapolation in A and c is needed in the prediction phase (see Table 2). The error of predictions reached with LWR then is approximately half of that achieved with global PLS. It is probable, however, that strong extrapolations with LWR would not yield good results. 2. Data Set SUGAR. The nonlinearity of the data set SUGAR is as strong as the one of the POLYMER data set.22 A difference between both calibration sets is that the samples of POLYMER are clustered, whereas the SUGAR samples are not. The RMSECV values reached with LWR range from 0.71 to 0.95 (see Table 3). The accuracy of predictions generally is comparable to that of PLS (RMSECV ) 0.82). It therefore seems that global PLS is able to cope with nonlinearity quite well. An advantage of the LWR models is, however, that they are simpler. The application of two latent variables in the LWR models instead of six latent variables used in the PLS model is an improvement. The combination of PLS with ED and the uniform weighting, preferred by us, gives good predictions that are not significantly (23) Pasti, L.; Centner, V.; Walczak, B.; Despagne, F.; Jouan-Rimbaud, D.; Massart, D. L.; de Noord, O. E. A comparison of multivariate calibration techniques applied to experimental NIR data sets. Part II: Evaluation of the predictive ability of models in extrapolation condition, in preparation.

Analytical Chemistry, Vol. 70, No. 19, October 1, 1998

4209

Table 3. Optimal RMSECV Values Achieved with Two-Component LWR Models Applied to the Data Set SUGARa

Table 5. Optimal RMSE Values Achieved under Different Conditions with Three-Component LWR Model Applied to the Data Set WHEATa

nlocal

method

distance

weighting

RMSECV

nlocal

method

distance

weighting

RMSECV

RMSEP

7 8 5 10 8 8

PLS PLS PCR PCR PCR PCR

ED MD ED ED MD MD

uniform uniform uniform cubic uniform cubic

0.73 0.80 0.78 0.71 0.95 0.84

58 58 58 58 58 58

PLS PLS PCR PCR PCR PCR

ED MD ED ED MD MD

uniform uniform uniform cubic uniform cubic

0.2433 0.2433 0.2425 0.2403 0.2425 0.2403

0.2150 0.2150 0.2150 0.2162 0.2150 0.2226

a Reference RMSECV values obtained with the global PLS models: (i) linear PLS with six latent variables, RMSECV ) 0.82; (ii) quadratic PLS with six latent variables, RMSECV ) 0.84.

a Reference RMSE values obtained with the global PLS models: (i) linear with three latent variables, RMSECV ) 0.2433 and RMSEP ) 0.2150; (ii) quadratic with three latent variables, RMSECV ) 0.311 and RMSEP ) 0.296.

Table 4. Optimal RMSE Values Reached with LWR Applied to the Data Set POLY-DATa nlocal method distance weighting complexity m RMSECV RMSEP 20 15 23 23 25 30

PLS PLS PCR PCR PCR PCR

ED MD ED ED MD MD

uniform uniform uniform cubic uniform cubic

7 7 9 9 7 7

1.82 1.77 1.84 2.02 1.60 1.64

1.86 1.48 2.25 2.28 1.95 2.29

a Reference RMSE-values achieved with the global PLS models: (i) linear PLS with seven latent variables, RMSECV ) 1.62 and RMSEP ) 2.23; (ii) quadratic PLS with seven latent variables, RMSECV ) 2.31 and RMSEP ) 3.00.

different from the optimal combination: PCR with ED and the cubic weighting. To check how LWR functions when an extrapolation in the space of measurements, in the space of measurements and concentration, or in the concentration space is needed in the prediction phase, a test set including eight extreme samples was prepared for each type of extrapolation. In all cases, the optimal LWR model included less latent variables than the optimal global PLS model (one instead of four components was used). Except the extrapolation in the concentration space that represented 35% extension of the concentration range, all these simpler models resulted in comparable or more accurate predictions that the global PLS model. When (strongly) extrapolating in the concentration space, LWR did not give as good predictions as global PLS. 3. Data Set POLY-DAT. The data set POLY-DAT is linear,22 but the Hopkins statistic3 indicates a significant clustering tendency. This is confirmed by visual observation. Table 4 shows that much more calibration objects are needed in the LWR models for POLY-DAT than in the models for POLYMER or SUGAR. The number of nearest neighbors used here ranges from 15 to 30 (depending on the LWR variant applied). This number corresponds (approximately) to the size of the two large clusters present in the data set. This indicates that within each cluster the calibration relationship is linear. Instead of using LWR, one could also split the data into two calibration problems and treat them separately (see ref 3). However, LWR does this automatically. The second reason more nearest neighbors are needed to reach optimal predictions is the complexity of the calibration problem. As 7 or 9 latent variables are included in the optimal model, a number of nearest neighbors between 15 and 30 seems reasonable. 4210 Analytical Chemistry, Vol. 70, No. 19, October 1, 1998

It should be noted that RMSECV obtained with global PLS (1.62) tends to be smaller than that of LWR (from 1.60 to 2.02). On the other hand, RMSEP obtained with LWR (from 1.48 to 2.28) tends to be lower than for global PLS (2.23). Within the calibration domain LWR predicts better than PLS, because the loss in precision due to the use of sample subsets instead of the whole calibration set is not so important as the gain due to decreased bias obtained due to a better fitness of local models to the data. This is not the case over the whole calibration range including extreme samples (whole calibration set). If an extrapolation would be necessary in the prediction phase, then LWR would not give better but comparable predictions as global PLS.23 The proposed LWR variant (PLS, ED, uniform weighting) is slightly worse then the optimal combination: PLS, MD, uniform weighting. However, the difference does not seem to be very important. 4. Data Set WHEAT. The data set WHEAT is included to show what happens when LWR is applied in situations where it should not be used, i.e., for linear homogeneous data. Table 5 indicates that whatever combination (method, distance measure, weighting) is applied, the optimal predictions are achieved when nlocal is such that all calibration objects are included in the LWR models. The difference due to the use of ED and MD then disappears because all samples are treated as the nearest neighbors. LWR with uniform weighting is equal to the global PLS and PCR model. The difference due to the application of the cubic weighting is negligible. Of course, LWR does not help to improve predictions when the calibration problem is linear and the data set (practically) homogeneous. However, there is no danger that LWR would be worse in such cases than global PCR or PLS, since LWR automatically becomes global PCR or PLS. Finally, it should be mentioned that the LWR optimization applied in this study determines one global optimal complexity of the local models and one global optimal number of nearest neighbors. If the number of calibration samples and the calibration domain spanned by the data would be large, so that LWR models with different complexity would be expected in different parts of the domain, then one could think of optimizing the model complexity and the number of nearest neighbors within different parts of the domain. However, to decide where borders of such different parts are within the domain, a strong chemical knowledge would have to be available.

CONCLUSIONS Locally weighted regression was applied to linear and nonlinear, to clustered, and to homogeneous data sets. The most significant improvement in the accuracy of predictions compared to global linear models was obtained for a nonlinear heterogeneous data set. For the nonlinear calibration problems studied here, simpler models were accomplished. This improvement is also important, because simpler models are easier to interpret. Moreover, such models are more robust. In general, it seems that LWR is to be recommended for much more general use than is now the case. The method yields better results than global PLS in the two of the situations described higher and is not worse or reduces automatically to global PLS in the other cases. It therefore minimizes the risk of losing accuracy when unnoticed clustering or nonlinearity is present. There are two contraindications for LWR: (1) When the number of samples used to build local models is so small that the determination of the model parameters becomes very imprecise, then predictions may be also imprecise. This situation should be, however, recognized in the optimization phase of LWR. To cope with this situation, one could use more calibration samples or record replicated measurements for the existing objects. This

would help to improve the precision of predictions of local models. (2) LWR models should not be applied when a strong extrapolation is needed in the prediction phase. In fact, such strong extrapolation is counterindicated for any calibration method. The LWR variant, which is based on PLS regression, Euclidean distance as a distance measure to find the nearest neighbors, and the uniform weighting is recommended because it is simplest, is fastest, and yields prediction errors that are at least as good as that of the more complex variants. ACKNOWLEDGMENT This study received financial support from the European Commission (SMT Program contract SMT4-CT95-2031) DWTC and FWO. C. Sterna (Rhone-Poulenc), O. E. de Noord (Shell), and R. D. Maesschalck (VUB) are thanked for providing the data set POLYMER, POLY-DAT, and SUGAR.

Received for review February 23, 1998. Accepted July 8, 1998. AC980208R

Analytical Chemistry, Vol. 70, No. 19, October 1, 1998

4211