Anal. Chem. 1997, 69, 3391-3399
Optimization of Partial-Least-Squares Calibration Models by Simulation of Instrumental Perturbations Fre´de´ric Despagne and De´sire´-Luc Massart*
ChemoAC, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium Onno E. de Noord
Shell International Chemicals BV, Shell Research and Technology Centre, P.O. Box 38 000, 1030 BN Amsterdam, The Netherlands
A critical step in partial-leasts-squares (PLS) modeling is the model optimization. Cross-validation is often applied, but in spite of its statistical properties, it suffers some severe shortcomings. In particular, cross-validation has a tendency to give overfitted models, whereas parsimonious models should be preferred. We propose an alternative form of internal validation, based on the simulation of instrumental perturbations on a subset of calibration samples. A simple criterion is proposed for the adjustment of perturbations. The method is applied for the validation of nine PLS1 calibration models on industrial data sets and compared with cross-validation and crossvalidation combined with a randomization test. It is shown that parsimonious models can be obtained, with a good predictive power when they are applied to external test data. Multivariate techniques are particularly suited for the determination of mixture composition from complex spectral data such as near-IR data.1 One of the most popular multivariate techniques is the partial-least-squares (PLS) algorithm that has been extensively described in the chemometrics litterature.2-5 The principle of PLS is to project a data set from a space described by a set of original variables to another space defined by new variables called factors, or latent variables (LV). The factors are linear combinations of the original variables that span the same data space; their number is equal to the number of original variables (or the number of samples minus 1 if this number is smaller). Factors are calculated iteratively to maximize the covariance between the X-variables (independent variables) and the Y-variable (dependent variable), so they can be ordered as a function of the amount of variance they explain in the X- or Y-data, respectively. If we make the assumptions that the first few factors only carry relevant information and that other factors explain mainly noise present in the data set, then we can discard these less informative factors and reduce the original space to a much smaller subspace. For the situation where we have only one dependent variable y (PLS1), (1) Martens, H.; Naes, T. Multivariate Calibration; Wiley & Sons: Chichester, U.K., 1989. (2) Frank, I. E.; Friedman, J. H. Technometrics 1993, 35, 109-135. (3) Geladi, P. J. Chemom. 1988, 2, 231-246. (4) Hoskuldsson, A. J. Chemom. 1988, 2, 211-228. (5) Geladi, P.; Kowalski, B. Anal. Chim. Acta 1986, 185, 1-17. S0003-2700(97)00228-X CCC: $14.00
© 1997 American Chemical Society
we can represent an equation in the m-dimensional space of original variables xi as m
y)
∑b x + e i i
(1)
i)1
The bi are the regression coefficients. In the factors space, this equation becomes A