Anal. Chem. 2000, 72, 1657-1665
Development of a Robust Calibration Model for Nonlinear In-Line Process Data Fre´de´ric Despagne† and D. Luc Massart*
ChemoAC, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussel, Belgium Paul Chabot
Elf Atochem North America, 2000 Market Street, Philadelphia, Pennsylvania 19103
A comparative study involving a global linear method (partial least squares), a local linear method (locally weighted regression), and a nonlinear method (neural networks) has been performed in order to implement a calibration model on an industrial process. The models were designed to predict the water content in a reactor during a distillation process, using in-line measurements from a near-infrared analyzer. Curved effects due to changes in temperature and variations between the different batches make the problem particularly challenging. The influence of spectral range selection and data preprocessing has been studied. With each calibration method, specific procedures have been applied to promote model robustness. In particular, the use of a monitoring set with neural networks does not always prevent overfitting. Therefore, we developed a model selection criterion based on the determination of the median of monitoring error over replicate trials. The back-propagation neural network models selected were found to outperform the other methods on independent test data. The need for ever-faster and less-expensive sample identification in the chemical, pharmaceutical, petroleum, or food industries, paralleled with the development of high-performance instrumentationsin particular low-cost silica optical fiberssand multivariate chemometrics techniques, has led to a multiplication of on-line or in-line implementations of near-infrared (NIR) process analyzers.1 The variability of processes imposes that rugged analyzers be implemented and that the calibration models that use the analyzer output for quantitative estimations be robust to variations in the spectral data. Among other variables, temperature often fluctuates in industrial processes. An increase in temperature results in a decrease in the degree and strength of hydrogen bonding.2 Temperature fluctuations cause changes in the equilibrium distribution of OHbound and OHunbound species which lead to apparent shifts and broadening of O-H absorption bands. Recently, Wu¨lfert et al. studied in detail the influence of temper* Corresponding author: (tel) +32 2 477 47 34; (fax) +32 2 477 47 35; (email)
[email protected]. †Current address: Eutech, 22 Avenue d′Auguette, 13117 Lave ´ ra, France. (1) Hassell, D.; Bowman, E. M. Appl. Spectrosc. 1998, 52, 18A-29A. (2) Li, Y.; Brown, C. W., Lo, S. J. Near Infrared Spectrosc. 1999, 7, 55-62. 10.1021/ac991076k CCC: $19.00 Published on Web 03/02/2000
© 2000 American Chemical Society
ature effects on vibrational spectra and multivariate calibration models.3 The effect of temperature leads to a curved relationship between concentration and absorbance. To accommodate such effects, one can choose to develop a partial least-squares (PLS) or a principal component regression (PCR) calibration model, both of which are well-known by chemometricians and offer inference properties to the user via plots of scores and loadings. However, these methods assume that the relationship between the spectra and the property of interest is linear. Curved effects can be modeled with these methods but they require that a larger number of factors be retained, with the risk of including irrelevant sources of variance in the model.4 Alternatively, one can decide to apply a nonlinear modeling technique. Nonlinear methods are more complex to use than linear ones, essentially because they are more prone to overfitting, especially when only a limited number of samples is available. However, on truely nonlinear data they outperform methods such as PLS or PCR.5,6 In a recent comparative study of multivariate calibration techniques applied to industrial data in interpolation conditions, locally weighted regression (LWR) and classical back-propagation neural networks (NN) appeared as two of the most efficient techniques out of 24, because of their flexibility and their ability to model not only nonlinear but also linear data sets.7 To develop a calibration model for in-line implementation on an industrial process in which temperature fluctuates, preliminary studies with off-line samples aimed at including temperature as an input variable in PLS models. Results were disappointing for aqueous solutions for which large changes in the spectra were observed as a function of temperature. A better possibility would be to use a temperature control on the sample before recording its NIR spectrum, but this approach involves more hardware, and the time necessary to stabilize the temperature of the solution between two measurements is unacceptably long. Therefore it was decided to develop nonlinear models to accommodate the temperature effects. (3) Wu ¨ lfert, F.; Kok, W. T.; Smilde, A. K. Anal. Chem. 1998, 70, 1761-1767. (4) Martens, H.; Naes, T. Multivariate Calibration; Wiley & Sons: Chichester, U.K., 1989. (5) Long, J.; Gregoriou, V.; Gemperline, P. Anal. Chem. 1990, 62, 1791-1797. (6) Blank, T.; Brown, S. D. Anal. Chem. 1993, 65, 3081-3089. (7) Centner, V.; Verdu-Andres, J.; Walczak, B.; Jouan-Rimbaud, D.; Despagne, F.; Pasti, L.; Poppi, R.; Massart, D. L.; de Noord, O. E. A comparison of multivariate calibration techniques applied to experimental NIR data sets. Appl. Spectrosc., in press.
Analytical Chemistry, Vol. 72, No. 7, April 1, 2000 1657
In this study, we compared the performance of PLS, LWR, and NN for modeling the water content in a process stream using NIR measurements. Different preprocessings of the spectral data were tested (selection of spectral range, second derivative), and the prediction errors of the three methods were evaluated on an independent data set. With each method, special care was taken during model optimization to promote model robustness and avoid overfitting. For selection of the number of factors to retain with PLS, a criterion based on the simulation of instrumental perturbations was used. For LWR, the randomization test was applied to select the number of nearest neighbors and number of factors to retain in the local linear models. For NN, a critical step that is rarely discussed in detail is the selection of the optimal topology and optimal set of weights. A procedure based on replicate trials was developed for selecting reliable solutions (solutions qualitatively similar on different subsets of samples drawn from the same population). THEORY (1) Development of PLS Models. PLS is considered as a reference algorithm for modeling multivariate chemical data.8-11 The principle of PLS is to project a data set from a space described by a set of original variables to another space defined by new variables called factors or latent variables (LVs). The response y is modeled as A
y)
∑b t
T a a qa
+e
(1)
a)1
where ta is the column vector of X-matrix scores on factor a, qTa is the row vector of y loadings on factor a, and e is a column vectors of residuals. The critical step in PLS is the determination of model complexity A, that is, the number of factors to retain. Models with too many factors overfit the calibration data and contain a significant amount of irrelevant information only related to noise. Models with too few factors do not properly describe the multivariate relationship between the descriptors and the response, resulting in a loss of predictive power. Leave-one-out cross-validation (LOOCV) is a widespread technique for estimating PLS model complexity; however, it is sensitive to overfitting and asymptotically inconsistent.12 LOOCV corresponds to validation with a small perturbation of the statistical sample; therefore, the number of factors associated with the minimum validation error often exceeds the number of factors necessary for a good generalization. To develop parsimonious (and therefore robust) PLS models, we applied an approach based on simulated perturbations that was proposed for model complexity estimation.13 The method consists of performing model validation on a subset of calibration samples on which instrumental perturbations have been simulated; it was inspired by a work of Gemperline, who used simulated perturbations for the selection of rugged NN models.14 The (8) Frank, I. E.; Friedman, J. H. Technometrics 1993, 35, 109-135. (9) Geladi, P. J. Chemom. 1988, 2, 231-246. (10) Hoskuldsson, A. J. Chemom. 1988, 2, 211-228. (11) Geladi, P.; Kowalski, B. Anal. Chim. Acta 1986, 185, 1-17. (12) Shao, J. J. Am. Stat. Assoc. 1993, 88, 486-494. (13) Despagne, F.; Massart, D. L. Anal. Chem. 1997, 69, 3391-3399. (14) Gemperline, P. Chemom. Intell. Lab. Syst. 1997, 39, 29-40.
1658 Analytical Chemistry, Vol. 72, No. 7, April 1, 2000
Figure 1. Principle of locally weighted regression.
simulated instrumental perturbations (stray light, optical path length change, wavelength shift, and heteroscedastic noise) are combined according to an experimental design so that, for each perturbation-free spectrum, 16 artificially perturbed spectra can be generated and added to the validation set. By following the evolution of validation error as a function of the number of factors, one can identify the factors that are the most affected by the simulated perturbations and therefore select the optimal model complexity. (2) Development of LWR Models. Locally weighted regression is considered as a method of choice for multivariate calibration in situations where data sets exhibit nonlinearity or clustering.15-18 The principle of LWR is to decompose a single model in a series of local linear models that can better accommodate clusters or nonlinearities present in the data. It is illustrated in Figure 1, for a univariate model y ) f(x). The “o” represent the calibration objects, the “*” represent the new objects whose Y-value must be estimated. In the calibration set, the nearest neighbors to each new object are determined in the X-space. For each new object, the set of nearest neighbors retained is used to build a local linear regression model. Once the model parameters have been calculated, the model can be applied to predict the Y-value of the new object. Several parameters must be fixed or optimized during LWR modeling: (1) the number of nearest neighbors in the local models around new objects; (2) the distance measure to identify nearest neighbors in X-space (Euclidean distance or Mahalanobis distance); (3) the nature of weighting of the neighbors in the regression step according to their distance to the new object (cubic weighting or uniform weighting); (4) the regression method for constructing local models (PCR or PLS); (5) the number of factors retained in the local models. The distance measure, neighbors weighting, and regression method can be set a priori by the user or optimized by trial and error. Recently, Centner and Massart studied the performance of LWR on several NIR data sets and they recommended the use of the Euclidean norm for distance determination in local models, PLS algorithm in the regression step, and uniform weighting of the calibration objects in local models.19 Only the number of (15) Wang, Z.; Isaksson, T.; Kowalski, B. R. Anal. Chem. 1994, 66, 249-260. (16) Naes, T.; Isaksson, T. Anal. Chem. 1990, 62, 664-673. (17) Naes, T.; Isaksson, T. NIR News 1994, 5/4, 7-8. (18) Naes, T.; Isaksson, T. NIR News 1994, 5/5, 8-9.
Figure 2. Local overfitting with locally weighted regression.
nearest neighbors and number of factors must then be optimized, which is done by LOOCV on the calibration set. The optimal number of nearest neighbors and factors are those that yield the best compromise over the whole calibration set and correspond to the lowest cumulative cross-validation error. An appealing feature of LWR is that it can perform at least as well as a full PLS model in the cross-validation step. For ncv samples and if all possible numbers of nearest neighbors are tested in the LWR cross-validation step (from 2 to ncv - 1), the rootmean-square error of cross-validation (RMSECV) of the (ncv 1)-nearest-neighbors model corresponds to the RMSECV of the global PLS model.
x∑ ncv
RMSECV )
i)1
applications where model simplicity and interpretability is a concern. A local modeling method should only be retained if the following conditions are fulfilled: (1) the complexity of the local models is similar or lower than the complexity of a global model; (2) the cross-validation error of local models is significantly smaller than the cross-validation error of a global model. To promote model robustness, we suggest systematically comparing the performance of local models with the performance of the global model (developed using all nearest neighbors) during LWR model optimization by cross-validation. The randomization test is applied to cross-validation residuals to test whether the distribution of residuals in the local and global models is significantly different.20 If a significant difference exists and the cross-validation error associated with local models is lower (which will be the case if the data set is nonlinear), then we suggest again applying the randomization test to compare residuals obtained with different model complexities and retaining the smallest possible number of factors in local models, providing that the crossvalidation error is not significantly different from the lowest crossvalidation error obtained. (3) Development of NN Models. It was demonstrated that back-propagation NN are with one hidden layer capable of modeling any continuous function.21 Therefore they can be successfully used in situations where a nonlinear relationship exists between spectroscopic data and the property of interest. NN for multivariate calibration have been extensively described in the literature.5,6,22-25 The form of a NN model that relates a set of independent variables xi to a response y can be expressed as nh
yˆ ) fo(θ′′ +
2
(yi - yi) ncv
(2)
where ncv is the number of samples used for cross-validation. A limitation of LWR is the tendency of the algorithm to local overfitting. Problems occur for data sets that are intrinsically linear but exhibit some clustering. If irrelevant sources of variance are present in the data set (imprecision of the reference method, presence of noise in the spectral data), LWR will tend to generate overfitted local models in the cross-validation step and interpretation of results can be misleading. One may get the wrong impression that the data set is nonlinear and select local models, when indeed a global linear model is more robust and interpretable. The situation is illustrated in Figure 2 for a simple univariate problem. In a regression model, the stability of coefficients is related to the square root of the number of calibration samples. Therefore, global models built with a large number of samples should be preferred to local models, except in situations where the degree of nonlinearity is such that the loss in stability caused by the use of a reduced number of calibration samples is compensated by the better precision of the local models.19 In addition to a better stability, a global PLS model offers the advantage that the set of factors is unique for all samples; therefore, identification of the main sources of variance dominating the process is easier with loading plots. This is particularly important for industrial in-line (19) Centner, V.; Massart, D. L. Anal. Chem. 1998, 70, 4206-4211.
∑ j)1
nd
w′′j fh(
∑w′ x θ′)) if
i
(3)
i)1
nd is the number of descriptors (input variables) presented at the input layer of the NN, nh is the number of nodes in the hidden layer, and fh and fo are the transfer functions used in the hidden layer and output layer, respectively. Generally, fh is nonlinear (sigmoid or hyperbolic tangent) and fo can be linear or nonlinear depending on the degree of nonlinearity of the data set. The w′if and w′′j represent the connection weights between input and hidden layey and between hidden and output layer, respectively. Biases θ′ and θ′′ act as offset terms by shifting the position of the transfer functions of the hidden and output layers. The adjustable parameters of the model (w′if and w′′j, θ′ and θ′′) are initially ascribed random values and optimized during an iterative learning procedure. Several algorithms exist to perform learning; they are derived from the basic error back-propagation principle. After each presentation of the full set of calibration samples (called an iteration or epoch), each weight and bias in the NN is corrected in proportion to the derivative of the cumulative error obtained at the output layer with respect to this parameter. The procedure of training a NN must be considered as an optimization problem in which one seeks the minimum of an error (20) van der Voet, H. Chemom. Intell. Lab. Syst. 1994, 25, 313-323. (21) Hornik, K.; Stinchcombe, M.; White, H. Neural Networks 1989, 2, 359366. (22) Gemperline, P. J.; Long, J. R.; Gregoriou, V. G. Anal. Chem. 1991, 63, 23132323. (23) Borggaard, C.; Thodberg, H. H. Anal. Chem. 1992, 64, 545-551. (24) Bos, M.; Bos, A.; van der Linden, W. E. Analyst 1993, 118, 323-328. (25) Despagne, F.; Massart, D. L. Analyst 1998, 123, 157R-178R.
Analytical Chemistry, Vol. 72, No. 7, April 1, 2000
1659
surface in the multidimensional space defined by the weights and biases. The exact profile of this surface depends on a number of factors such as the amount of noise present in the data and the nature of transfer functions used in the NN. In general, the error surface has a large number of local minimums or saddle points.26 There is no guarantee that a training algorithm will find the global minimum of the NN error surface, but this is not an issue if the data used for training contain some noise because this noise is associated with the global solution. Finding this global solution would lead to overfitting the training data. Since analytical data are rarely noise-free, the challenge is to find a solution as close as possible to the true underlying model, while not overfitting the noise present in training data. Overfitting with NN is equivalent to modeling a second-order relationship with a third-order or fourth-order polynomial. Overfitted models have poor generalization ability. The natural tendency of NN to overfit training data is even more pronounced with the second-order optimization methods currently used for training NN in most applications, because they tend to grow weights very rapidly.27 To avoid overfitting, it is recommended to reduce as much as possible the number of weights and biases in the NN (by eliminating irrelevant input and hidden nodes) and use a monitoring set. This monitoring set must contain samples as representative as possible of the global population, but different from the training samples. During the iterative training procedure, the NN performance is regularly tested on the monitoring samples. Training is stopped when the monitoring error stops decreasing or starts increasing, which indicates that the NN starts to overfit the training data and loses generalization ability. When the NN topology is optimized, it is customary to select a model according to its performance on the monitoring set. However, on strongly nonlinear or nonhomogeneous data sets, NN can develop several combinations of transfer functions that lead to equally good numerical values on training data, yet with different sets of parameters and different generalization ability. The use of an independent monitoring set to determine the end point of training does not always prevent overfitting due to chance correlation between the initial set of random weights and the monitoring samples. We pointed out that the training error surface is relatively complex, but it is also true for the monitoring error surface. To get an idea of the complexity of such surfaces, we represented in Figure 3 the distribution of root-mean-square errors of monitoring RMSEM for a nonlinear data set, over 50 replicate training runs of NN with the same topology and different sets of initial random weights.
RMSEM )
x∑
nmon
(yi - yˆi)2
i)1
nmon
(4)
where nmon is the number of samples used in the monitoring set. It is clear from this figure that replicate trials do not lead to a single minimum on the monitoring error surface. The variability of RMSEM values is rather large (from 0.22 to 0.38), but the user has to retain a single set of weights associated with a single (26) Hertz, J.; Krogh, A.; Palmer, R. Introduction to the Theory of Neural Computation; Addison-Wesley: Redwood City, CA, 1991. (27) Blank, T. B. Multivariate Analysis of Chemical Data Using Multilayer Perceptrons. Ph.D. Thesis, University of Delaware, 1996.
1660 Analytical Chemistry, Vol. 72, No. 7, April 1, 2000
Figure 3. Distribution of RMSEM values over 50 replicate training runs with the same topology and different sets of initial random weights, for a nonlinear industrial data set.
RMSEM value to define the final model. In this case, it is tempting to retain one of the four sets of weights associated with the lowest RMSEM values, but according to the histogram, these solutions are not representative of the population of solutions obtained over the 50 replicate trials. Alternatively, one can choose to retain the solution with the RMSEM value closest to the mean RMSEM, but this average value can be influenced by one or a few overoptimistic results obtained by chance. To develop robust and reliable NN models, we suggest basing the choice of the final model on a more robust criterion and retaining the set of weights associated with the RMSEM value closest to the median value obtained over the replicate trials. This criterion ensures that the final solution is not influenced by accidental training runs driven by chance to isolated local minimums of the monitoring error surface. The procedure can be implemented as follows: (1) Develop a NN model and optimize its topology (number of input and hidden nodes) by performing replicate trials with the same topology, starting with different sets of initial random weights. (2) For each trial, determine the end point of training with a monitoring set. (3) For each topology, save the set of optimized weights associated with the RMSEM value closest to the median over the replicate trials. (4) After testing different topologies, retain for the final model the set of optimized weights associated to the topology with the smallest median of RMSEM. This procedure involves replicate calculations that are relatively fast if a second-order algorithm is used for optimizing the NN weights and biases. EXPERIMENTAL SECTION (1) Description of the Process. The data were recorded on a pilot plant process in which two forms of organotins are distilled and dissolved in a tank initially filled with water. The goal of the study is to monitor in-line the evolution of water concentration in the tank as the organotins are dissolved. NIR spectra were recorded using a NIRSystem 5000 spectrophotometer equipped with a fiber-optic bundle and a transflectance probe installed in a sampling box on a side stream of the water concentrator. The clean probe was used as a reference. For each of the 193 samples
Figure 4. The 193 raw NIR spectra of aqueous solutions measured on-line.
Figure 6. Score plot in the PC1-PC2 plane for the raw spectra: o, calibration; *, test.
Figure 7. Loadings on PC1 for the raw spectra. Figure 5. Evolution of water content and temperature in a batch.
analyzed, 50 scans were coadded to obtain a spectrum between 1100 and 2500 nm with a 4-nm resolution (see Figure 4). The water content (% weight) was determined experimentally for each sample. The samples were recorded over 11 batches. To illustrate the temperature variations during the distillation process, we represented in Figure 5 the evolution of temperature and water content over one batch. In each batch, a few samples (37 in total) were selected at regular time intervals for additional laboratory analyses. As these 37 samples were considered as representative of the production, they constituted a natural test set that we used for validation of the different calibration models. The remaining 156 samples were used as a calibration set. The repartition of test and calibration samples can be seen in Figure 6 , where a score plot in the PC1PC2 plane has been reported. It can be seen on this figure that the data set is not homogeneous and that a significant nonlinearity exists in the X-space. (2) Data Preprocessing. The benefits of appropriate data preprocessing on model robustness and parsimony have been illustrated by de Noord.28 We observe in Figure 4 that the instrument dynamic range is exceeded in the region 1800-2100
nm (signal >1 A.U., region A) corresponding to the combination of O-H fundamental stretching and bending vibrations. Therefore an additional nonlinear effect can be expected due to detector saturation in this region. A visual study of the spectra showed that the signal-to-noise ratio (S/N) was degrading in the region 2400-2500 nm, blurring relevant chemical information (region B in Figure 4). In addition, the principal component loadings on PC1 showed a slope in baseline between 1500 and 2400 nm, probably due to variations in measurement conditions among the different batches and shift of the instrument baseline over time (Figure 7). This systematic effect was corrected by taking the second derivative of the spectra, calculated over an average of nine points. The score plot in Figure 8 shows that the second derivative seems to attenuate the nonlinearity in the X-space for the first two principal components. (3) Comparison of Methods and Preprocessings. The calibration models were developed as explained in the Theory section. The PLS model parameters were calculated on the 156 samples from the calibration set, and complexity optimization was performed by identifying the minimum prediction error on a subset of the calibration samples on which instrumental perturbations were simulated. (28) de Noord, O. E. Chemom. Intell. Lab. Syst. 1994, 23, 65-70.
Analytical Chemistry, Vol. 72, No. 7, April 1, 2000
1661
(4) Software. All algorithms used for this study are part of the Matlab Chemometrics Toolbox developed at the Vrije Universiteit Brussel. The initial Matlab code for the LevenbergMarquardt optimization algorithm was obtained from the Neural Network-Based System Identification Toolbox of M. Nørgaard, freely available on the Internet.32 All calculations were performed on a PII 233-MHz computer using Windows 95 as operating system.
Figure 8. Score plot in the PC1-PC2 plane for the second derivative spectra: o, calibration; *, test.
The LWR parameters were optimized by performing a LOOCV on the 156 calibration samples, followed by application of the randomization test to check that local models were significantly better than global models and to find the minimum complexity in local models. For the development of NN models, the 156-sample calibration set was split with the Duplex algorithm in a 116-sample training set and a 40-sample monitoring set. This algorithm is designed to maximize representativity between the two sets by selecting pairs of spectra that are the farthest apart from each other (Euclidean distance) and alternatively assigning these pairs to the training and monitoring set.25,29 Principal components were then determined on the training set. The monitoring and test sample scores were determined by projecting the samples on these PCs. Sample scores on the first 10 PCs were used as inputs to the initial NN models. The NN models were developed with hyperbolic tangent transfer functions in the hidden layer and linear transfer function in the output layer. During training, weight update estimates were calculated with the Levenberg-Marquardt algorithm, a pseudo-second-order optimization method that allows faster convergence compared to the generalized delta rule.30,31 Topology optimization was performed as explained in part 3 under Theory. For comparison of the different methods, the root-meansquared error of prediction (RMSEP) was evaluated with the 37sample test set.
RMSEP )
x∑ ntest
(yi - yˆi)2
i)1
ntest
(5)
To evaluate the influence of different preprocessings, models were developed on full spectra, spectra without region B (low S/N), and spectra without regions A (saturation of detector) and B. For each spectral range, models on raw and second-derivative data were developed. The different setups are summarized in Table 1. (29) Snee, R. D. Technometrics 1977, 19, 415-428. (30) Fletcher, R. Practical Methods of Optimization; Wiley: Chichester, U.K., 1987. (31) Derks, E. P. P. A.; Buydens, L. M. C. Chemom. Intell. Lab. Syst. 1998, 41, 171-184.
1662
Analytical Chemistry, Vol. 72, No. 7, April 1, 2000
RESULTS AND DISCUSSION (1) PLS. All prediction results are reported in Table 1. We observe that the PLS models developed on second-derivative data are always more parsimonious than the models developed on raw data. This is due to the elimination of one source of variation, the batch-to-batch background differences. However, the prediction results obtained on second-derivative spectra are not better than those obtained on raw spectra. One would expect that predictions improve with second-derivative spectra after removal of region B, since second-derivative spectra are more sensitive than raw spectra to a degradation in the S/N. This improvement is not observed since the RMSEP slightly increases; however, the model on second-derivative data is more parsimonious after removal of region B. Figures 6 and 8 show that the curved relationship between PC1 and PC2 is smoothed by the application of a second derivative, but this preprocessing also introduces some distortion in the X-space: batch 2, which exhibited the same pattern as the other batches on raw spectra, exhibits after second derivation a nonlinear pattern almost symmetric to the other batches. It is likely that this distortion propagates to the X-Y relationship and cannot be corrected with a linear method. The best overall result (RMSEP ) 0.87) is obtained on raw spectra when region B, which exhibits a low S/N, is removed (Figure 9). Despite the high number of factors used, the PLS model cannot accommodate the nonlinearity in the relationship between spectra and water content, as illustrated by the curvature in prediction residuals at both extremities of the water content range. (2) LWR. With all preprocessings, application of the randomization test confirmed that the local model results were significantly different from the results obtained with global PLS models using the same number of factors. The relatively small number of nearest neighbors retained in all cases (from 4 to 21, out of 155 possible neighbors) is characteristic of a truely nonlinear data set. The number of nearest neighbors retained with raw spectra is systematically smaller than with second-derivative spectra, which is again probably due to the significant curvature that affects all batches in the X-space. As for PLS, the removal of region B on second-derivative spectra does not lead to a reduction in RMSEP, but the complexity of local models drops from 6 to 3 LVs. The lowest prediction error for LWR (RMSEP ) 0.41) is also obtained with raw spectra, after removal of region B (see Figure 10). When raw spectra are used and if region A is also removed, the RMSEP increases from 0.41 to 0.83 for LWR and from 0.87 to 1.18 for PLS. This indicates that region A is important for calibration because it contains some information nonlinearly (32) Nørgaard, M. Neural Network Based System Identification Toolbox v1.0, Technol. Report 95-E-773, Institute of Automation, Technical University of Denmark. http://www.iau.dtu.dk/Projects/proj/nnsysid.html.
Table 1. Prediction Results preprocessing raw spectra second-derivative spectra raw spectra, region B removed second-derivative spectra, region B removed raw spectra, regions A and B removed second-derivative spectra, regions A and B removed
PLS
LWR
NN
8 LV RMSEP ) 1.15 7 LV RMSEP ) 1.08 9 LV RMSEP ) 0.87 6 LV RMSEP ) 1.20 7 LV RMSEP ) 1.18 4 LV RMSEP ) 1.32
4 nn, 3 LV RMSEP ) 0.64 21 nn, 6 LV RMSEP ) 0.58 4 nn, 3 LV RMSEP ) 0.41 11 nn, 3 LV RMSEP ) 0.60 9 nn, 5 LV RMSEP ) 0.83 10 nn, 3 LV RMSEP ) 0.71
4-4-1 sl, [1 2 3 4] RMSEP ) 0.33 4-4-1 sl, [1 2 4 5] RMSEP ) 0.55 4-5-1 sl, [1 2 3 4] RMSEP ) 0.33 4-4-1 sl, [1 2 4 5] RMSEP ) 0.31 4-4-1 sl, [1 2 3 4] RMSEP ) 0.53 4-4-1 sl, [1 2 3 4] RMSEP ) 0.67
Figure 9. Prediction results for PLS model built on raw spectra, region B removed.
Figure 10. Prediction results for LWR model built on raw spectra, region B removed.
related to the water content. Compared to PLS, the curvature in LWR residuals has been corrected and the prediction error is about 2 times lower. However, in this case, regression coefficients for each local model are determined on a subset of four samples only. Therefore, the robustness of the LWR model is questionable. (3) NN. Whatever the preprocessing, NN systematically outperform PLS and LWR. However, we insist on the need to identify a reliable solution with replicate NN training runs started with different sets of random points and eliminate the most extreme solutions (lowest or highest RMSEM values). As an
illustration, for the models developed on raw spectra with region B removed, if instead of selecting the set of weights for the final model according to the median RMSEM value (0.32), one selects the set of weights associated to the lowest RMSEM value over 20 training runs (0.21), an RMSEP value of 2.40 instead of 0.33 is obtained. The number of training runs was then extended to 100, but we did not observe any other occurrence of such a high RMSEP value (all other values were