Anal. Chem. 1998, 70, 35-44
Theoretical Justification of Wavelength Selection in PLS Calibration: Development of a New Algorithm Clifford H. Spiegelman,*,† Michael J. McShane,‡ Marcel J. Goetz,‡,§ Massoud Motamedi,| Qin Li Yue,†,⊥ and Gerard L. Cote´‡
Department of Statistics and Biomedical Engineering Program, Texas A&M University, College Station, Texas 77845, and Biomedical Engineering Center, Laser & Spectroscopy Program, University of Texas Medical Branch, Galveston, Texas 77550
The mathematical basis of improved calibration through selection of informative variables for partial least-squares calibration has been identified. A theoretical investigation of calibration slopes indicates that including uninformative wavelengths negatively affect calibrations by producing both large relative bias toward zero and small additive bias away from the origin. These theoretical results are found regardless of the noise distribution in the data. Studies are performed to confirm this result using a previously used selection method compared to a new method, which is designed to perform more appropriately when dealing with data having large outlying points by including estimates of spectral residuals. Three different data sets are tested with varying noise distributions. In the first data set, Gaussian and log-normal noise was added to simulated data which included a single peak. Second, near-infrared spectra of glucose in cell culture media taken with an FT-IR spectrometer were analyzed. Finally, dispersive Raman Stokes spectra of glucose dissolved in water were assessed. In every case considered here, improved prediction is produced through selection, but data with different noise characteristics showed varying degrees of improvement depending on the selection method used. The practical results showed that, indeed, including residuals into ranking criteria improves selection for data with noise distributions resulting in large outliers. It was concluded that careful design of a selection algorithm should include consideration of spectral noise distributions in the input data to increase the likelihood of successful and appropriate selection. Spectroscopic methods are being increasingly employed for quantitative applications in chemistry, biology, and medicine. While advances in instrumentation bring increased resolution and sensitivity, there persists a need to convert the collected data into useful information. Multivariate calibration models are statistical †
Department of Statistics, Texas A&M University. Biomedical Engineering Program, Texas A&M University. § Current address: Sunshine Medical Instruments, Inc., 550 Pilgrim Dr., Suite F, Foster City, CA 94404. | University of Texas Medical Branch. ⊥ Current address: Statistical and Mathematical Sciences, Eli Lilly and Co., Lilly Corporate Center, Indianapolis, IN 46285. ‡
S0003-2700(97)00573-8 CCC: $14.00 Published on Web 01/01/1998
© 1997 American Chemical Society
tools that have been developed as one way of aiding investigators in this capacity. Partial least-squares (PLS) regression is a popular multivariate calibration technique for quantitative analysis of spectral data because of its ability to overcome problems common to these data such as collinearity, band overlaps, and interactions.1,2 Interactions that cause new molecular bonding formations give rise to spectra that are nonlinear in Beer’s law because explicit information about the concentration of this new species is not included by the model.1 However, the results of any calibration technique, including PLS, can only be as good as the data the algorithm is given. It has been noted experimentally that including data with poor information regarding the parameter of interest results in less than optimal calibrations.3,4 As a result, considerable effort has been directed toward developing and evaluating different methods to find the useful variables or eliminate the noisy ones.3-19 Selection procedures have been coupled with a variety of different calibration modeling techniques, with most work concentrated in the area of conventional multiple linear regression (MLR). Factorbased techniques, such as PLS and principal component regres(1) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60, 1193. (2) Martens, H.; Næs, T. Multivariate Calibration; Wiley & Sons: New York, 1989; pp 314-321. (3) Rossi, D. T.; Pardue, H. L. Anal. Chim. Acta 1985, 175, 153. (4) Brown, C. W.; Lynch, P. F.; Obremski, R. J.; Lavery, D. S. Anal. Chem. 1982, 54, 1472. (5) Goetz, M. J.; Spiegelman, C. H.; Cote´, G.; Motamedi, M. Technical Report No. 226, Department of Statistics, Texas A&M University, 1995. (6) Bangalore, A. S.; Schaffer, R. E.; Small, G. W.; Arnold, M. A. Anal. Chem. 1996, 68, 4200. (7) Navarro-Villoslada, F.; Pe´rez-Arribas, L. V.; Leo´n-Gonza´lez, M. E., Polo-Dı´ez, L. M. Anal. Chim. Acta 1995, 313, 93. (8) Jouan-Rimbaud, D.; Walczak, B.; Massart, D. L.; Last, I. R.; Prebble, K. A. Anal. Chim. Acta 1995, 304, 285. (9) Lucasius, C. B.; Kateman, G. Trends Anal. Chem. 1991, 10, 254. (10) Jouan-Rimbaud, D. Massart, D. L.; Leardi, R.; de Noord, O. E. Anal. Chem. 1995, 67, 4295. (11) Kalivas, J. H.; Roberts, N.; Sutter, J. M. Anal. Chem. 1989, 68, 2024. (12) Sasaki, K.; Kawata, S.; Minami, S. Appl. Spectrosc. 1986, 40, 185. (13) Liang, Y.; Xie, Y.; Yu, R. Anal. Chim. Acta 1989, 222, 347. (14) Lucasius, C. B.; Beckers, M. L. M.; Kateman, G. Anal. Chim. Acta 1994, 286, 135. (15) Ho ¨rchner, U.; Kalivas, J. H. Anal. Chim. Acta 1995, 311, 1. (16) Brown, P. J.; Spiegelman, C. H.; Denham, M. C. Philos. Trans. R. Soc. London 1991, 337, 311. (17) Brown, P. J. J. Chemometrics 1992, 6, 151. (18) Brown, P. J. J. Chemometrics 1993, 7, 255. (19) Thijssen, P. C.; Vogels, L. J. P.; Smit, H. C.; Kateman, G. Fresenius Z. Anal. Chem. 1985, 320, 531.
Analytical Chemistry, Vol. 70, No. 1, January 1, 1998 35
sion (PCR), have been considered with wavelength selection algorithms less often, although improvements in performance have been reported in several studies.5-8 Regardless of the regression method employed, the thrust behind wavelength selection is the identification of a subset of spectral wavelengths that will produce the smallest possible errors when used to predict chemical concentrations. To arrive at the true optimal subset of wavelengths for a given data set, a consideration of all possible combinations is computationally prohibitive. Elaborate search-based selection methods such as the genetic algorithm (GA),6,9,10 simulated annealing (SA),11 and branch-and-bound (BB)12,13 have been developed to arrive at a solution more efficiently, and these work well across data types. Comparisons between SA, GA, and some stepwise procedures indicate that optimal solutions are found with SA and GA, while stepwise selection is unlikely to arrive at a global optimum, especially with large, complex data sets.14,15 However, search strategies such as GA, SA, and BB still tend to be slow and cumbersome compared to the simpler and more intuitive stepwise methods. Several stepwise selection schemes have been proposed to select wavelengths from small data sets faster and more methodically than the search procedures.5,16-21 The advantages of these methods are speed and simplicity, but they may suffer in robustness. They depend upon an ordering or ranking of the variables, which usually makes them sensitive to noise distributions and therefore they may not perform consistently across data of different noise character. Even though the concept of variable selection is intuitively wise and is clearly helpful in practical terms, the technical bases of the improvements when used with PLS to our knowledge have not been reported. Our first objective is to demonstrate both in practice and in theory that judicious selection of spectral regions for PLS calibration is good science. A mathematical explanation of the PLS calibration slopes and the effects of included noise is used to justify variable selection. The effects of selection on prediction are evaluated for simulated data, near-infrared (NIR) spectroscopy data, and Raman spectroscopy data, each having different noise characteristics. As the theory will indicate, the improvement in PLS calibration through variable selection is valid across data types. While it is true that selection can help in most cases, a given selection method may not perform equally well across data types when adequate attention is not given to spectral noise type and magnitude; this is an important consideration in choosing a selection procedure. Our second purpose, then, is to set forth a new selection criterion that is less affected by spectral outliers than some previous methods. This criterion is evaluated by comparison of performance across a variety of different data types. THEORY Our first discussion of the theory will center on the mathematical foundation of improved calibration through wavelength selection. Second, the basis of an improved ranking criterion will be presented. Finally, a discussion of the algorithm employed in this study will be given. (20) McShane, M. J.; Cote´, G. L.; Spiegelman, C. H. Appl. Spectrosc., 1997, 51, 1559. (21) Centner, V. Massart, D. L.; de Noord, O. E.; de Jong, S.; Vandeginste, B. M.; Sterna, C. Anal. Chem. 1996, 68, 3851.
36
Analytical Chemistry, Vol. 70, No. 1, January 1, 1998
I. Justification for Selection. The theoretical justification proceeds with three nested models. The first two are not realistic, but they lead to closed form solutions that transparently show the effect of including background noise regions in PLS calibrations. The third model includes additive noise in the peak regions, but this does not change the effects of purely noisy regions. Therefore, the results from model 2 can be extended to model 3. Let Y, the instrumental response (such as measured light intensity or absorbance), have two components: Y1, the peak regions for the chemical of interest and Y2, the background noise regions. The vector x holds the reference values (e.g., concentrations of the chemical of interest) and β is a vector of regression coefficients defining the relationship between Y and x. A capital E denotes random error. The appendix describes the notation in more detail. The three models are
Y ) (Y1,Y2) model 1: Y1 ) xβ′, Y2 ≡ 0 model 2: Y1 ) xβ′, Y2 ) E2 model 3: Y1 ) xβ′ + E1, Y2 ) E2 The general inverse calibration model is defined as
xˆ ) βˆ PLSY
(1)
where xˆ is the vector of predicted chemical values and βˆ PLS is the estimate for the regression vector, produced through PLS calibration. For model 1 the PLS calibration slopes are
βˆ PLS,1 )
β β′β
(2)
and for model 2 they are
βˆ PLS,2 )
(
(x′x)[(x′x)2(β′β) + x′EE′x]
(x′x)3(β′β)2 + 2x′xβ′βx′EE′x + x′(EE′)2x
)
β (3)
It is not hard to see from (3) that inclusion of a noisy background region has two typical effects. The first is that it produces a large relative bias toward the origin, due to the noise term E found in the denominator. As E increases, the calibration slopes converge to zero and it is clear that the inclusion of E should be avoided to preserve sensitivity. Second, a small additive bias is produced in the opposite direction, due to the noise terms in the numerator. This effect can be described as
additive biasβ )
(
(x′x)[x′EE′x] 3
2
(x′x) (β′β) + 2x′xβ′βx′EE′x + x′(EE′)2x
)
β (4)
Again it is transparent that the calibration suffers from inclusion of the noise terms. The Appendix includes a more detailed discussion of these effects. The practical result of reduced model
bias is increased calibration accuracy and precision, leading to smaller prediction errors, the goal of every calibration. II. Improved Selection. As mentioned in the introduction, some previously proposed stepwise selection criteria may not work equally across different data types. For example, a previously proposed selection criterion is the variable signal-to-noise ratio |βˆ i/σˆ i|, where βˆ i is the estimated slope from regressing the spectra at the ith wavelength on concentration and σˆ i is the estimated standard deviation of spectral measurements, which gives a measure of usefulness for each variable.5,14-16 However, for data with large outlying data points such as those produced by cosmic rays in the CCD of the Raman instrument, σˆ tends to underestimate the true noise contribution, and hence, variables are given greater weight than they deserve. The implication is that inclusion of such a variable with a large outlier will increase the bias in the estimate of regression coefficients. The reason for this problem and a possibility for alleviation are given in the following discussion. The Appendix provides supporting information on this topic. In general, for data with large outlying points, the noise is better modeled by a log-normal or exponential distribution than a Gaussian. It is known that the fourth power of spectral residuals 4 , typically converge to triple the fourth at each wavelength, Eˆ e1i power of variance estimates, 3σˆ 4i , as the noise distribution nears Gaussian. From this information, one can see that ranking using only residuals is equivalent to a ranking using only variance estimates if the noise is normally distributed. However, use of residuals is more appropriate when the noise is distributed differently. Therefore, we propose to improve selection by ranking variables according to a modified signal-to-noise ratio
| | βˆ i
(5)
4
xEˆ e41i
A comparison of this ranking with the one previously employed, as well as a hybrid approach was performed and the results presented below. It should be noted that, while outliers have not often been explicitly addressed in the literature, one method in the literature described as uninformative variable elimination by partial leastsquares (UVE-PLS) has been reported with a variant (UVE-M) that is robust to outliers.21 The ranking proposed in that work was based on the ratio of regression coefficient to the standard deviation of the regression coefficient.21 The modification to increase robustness was the use of the ratio of median value of regression coefficients to the interquartile range of regression coefficients at each wavelength.21 The benefit gained from this variation was not clear from the presented results, but the appearance of the robust variant does indicate consideration of the problem of outlying data points. III. Selection Algorithm. Variables are ranked from most desirable to least desirable according to a tunable signal-to-noise ratio
|x 4
νE ˆ e41i
βˆ i (n - 1) 4 σˆ i + (1 - ν) n
|
(6)
derived from the above discussion and results given in the
Figure 1. Flow diagram for execution of selection algorithm.
Appendix. Both (Eˆ eli4 )1/4 and σˆ are incorporated and a weighting variable or “tuning parameter”, ν, is assigned that is used to vary the degree of influence of (Eˆ eli4 )1/4 and σˆ relative to each other. As ν f 0, the ranking converges to one proportional to that used by previous authors, namely |βˆ i/σˆ i|.3,13-15 This allows us to evaluate the effectiveness of |βˆ i/σˆ i| (ν ) 0) and |βˆ i/(Eˆ eli4 )1/4| (ν ) 1) alone and in combination (ν ) 0.5). We seek to find the set of variables producing minimum crossvalidated mean-squared error (CVMSE) as calculated by
CVMSE )
n
(yˆi - yi)2
i-1
n
∑
(7)
where yˆi is the predicted chemical concentration for the ith sample in a model developed without the ith sample, yi is the actual concentration of the ith sample, and n is the total number of samples. The algorithm is straightforward: once the variables are ranked, PLS models are generated in iterative fashion, each loop including the next highest ranked variable into the test set. The algorithm attempts to minimize prediction errors and therefore continues until all variables are included. The variables providing a minimum CVMSE are then selected. The CVMSE is simultaneously computed for all latent variables specified for the PLS model and thus helps to determine the optimal number of factors as well. Figure 1 gives a simple flow chart of the procedure. Analytical Chemistry, Vol. 70, No. 1, January 1, 1998
37
Table 1. Comparison of Selection Results for New Method ν)0 data type simulated SNR ) 60 SNR ) 5.5 SNR ) 1.75 NIR Raman raw processed
MSEP (mg/dL)2
ν ) 0.5 no. of variables
MSEP (mg/dL)2
ν)1 no. of variables
MSEP (mg/dL)2
no. of variables
2.1112 1.9991 1.5912 49.4103
2 38 42 286
2.1112 1.9958 1.6033 42.5265
2 45 40 282
2.1112 1.9964 1.5730 49.7410
2 46 47 280
3801.8579 368.3165
317 512
2788.4020 428.4004
480 576
2575.7881 432.9873
501 580
It should be noted that the choice of PLS latent variables is an important step in calibration and may be considered during wavelength selection. Our approach here was to set the maximum number of latent variables based upon the full set. If use of another latent variable did not account for more than 2% of spectral variance or improve the prediction error by 2%, then this LV was not used. In the simulation and NIR data, only one factor was needed using all the variables so only one latent variable was calculated. For both sets of Raman data, three latent variables were considered and the selection was performed for each case. EXPERIMENTAL SECTION Instrumentation. Near-infrared spectroscopic data were obtained from a Mattson Infinity FT-IR spectrometer (Madison, WI) with a 20-W tungsten-halogen source, CaF2 beam splitter, and cryogenically cooled InSb detector. The 2.0-2.5-µm (50004000-cm-1) wavelength region was isolated through the combined use of an Oriel 2500-nm cutoff shortwave-pass filter (Stratford, CT) and a flat germanium window with antireflection coating which allowed a minimum of 80% transmittance above 2.05 µm. Samples were placed in 1-mm Infrasil quartz cells from Starna (Atascadero, CA) for sampling. Raman experimental data were obtained using a 180° confocal probe with a Spex 500M spectrometer (Edison, NJ) coupled to a cryogenically cooled 256 × 1024 CCD detection system. An argon laser was used to excite the samples with 20 mW of 514.5-nm radiation. Samples were contained in 20-mL scintillating vials. The monochromator used an entrance slit width of 200 µm. A 1200 grooves/mm holographic grating blazed at 900 nm was used to disperse the light. Reagents. For the NIR data, reagent-grade crystalline glucose from EM Science (Gibbstown, NJ) was dissolved in GTSF-2 cell culture medium manufactured by Krug Life Sciences (Houston, TX) to compose 41 samples of different glucose concentration covering 0-492 mg/dL. For the Raman experiment, 41 samples spanning 0-1000 mg/dL were mixed by dissolving D-(+)-glucose in 0.9% isotonic saline from Baxter Healthcare (Deerfield, IL). Procedures. Computer models were generated to simulate spectroscopic absorbance data. A Gaussian peak with standard deviation of 3 was calculated for five points, representing an absorption peak. This was placed amid 100 other points (i.e., wavelengths) having value zero. The vector size was 1 × 105. A 1 × 60 power (concentration) vector was generated and multiplied by the pure spectrum to obtain a matrix of simulated chemical absorbances of size 60 × 105. To make the simulation more 38 Analytical Chemistry, Vol. 70, No. 1, January 1, 1998
realistic, noise with σ ) 1, 10, and 25% of the peak absorbance value was added to the system. The noise added to each point had a 90% probability of coming from a normal distribution and a 10% probability of coming from a log-normal distribution. For the NIR data, double-sided interferograms were collected for the samples in random order with respect to concentration. Each interferogram was based on 256 coadded scans. Fourier transformation was performed with Mattson WinFirst 3.1 software and included triangular apodization and Mertz phase correction. The ratios of single-beam sample spectra with pure GTSF-2 backgrounds were used to obtain glucose absorbance spectra. Three replicate spectra were collected for each sample without removing the cuvette from the spectrometer. The data were digitally filtered to remove baseline and slope shifts due to temperature variation. A description of the digital filtering procedure is found elsewhere.20 In the Raman system, integration of the signal was carried out for 90 s. This resulted in the Raman Stokes spectra of 1024 points covering from 190 to 1742 cm-1. Large spikes in the data due to cosmic ray events were removed, and the point was replaced by interpolation. Scaling artifacts were removed from both sets using the technique of multiplicative signal correction (MSC).2 Software. All calculations were performed in MATLAB 5.0 produced by Mathworks (Natick, MA). PLS regression and crossvalidation were performed with software contained in the PLS Toolbox by Eigenvector Technologies (Manson, WA). RESULTS AND DISCUSSION The selection algorithm (alternately using ν ) 0, 0.5, and 1) was applied to each of the different data sets. Calibration sets were composed of 75% of the spectra in each data set, chosen randomly with the constraint that end points (smallest and largest concentrations) were included. These data were used with the selection algorithms. After selection of wavelengths based upon the minimum cross-validated mean-squared error, prediction sets composed of the remaining spectra were used to evaluate the selection with independent data. For each tested set of data, three plots were produced to help explain the inputs to the algorithm and the results. For Figures 2-7, the following information describes the figure parts. Part a simply displays the spectral data input to the algorithm, before mean-centering. Part b shows the rankings given by the program for the three cases ν ) 0, 0.5, and 1. These plots give insight to the differences in selected variables between the proposed criteria and how this varies between data
Figure 2. (a) Representative simulated spectra of SNR ) 60, with concentrations 0.1 (solid), 0.9 (dotted), and 2.9 (dashed). (b) Rankings for ν ) 0 (solid), 0.5 (dotted), and 1 (dashed). (c) Inverse CVMSE with increasing model size for ν ) 0 (circle), 0.5 (triangle), and 1 (asterisk).
Figure 3. (a) Representative simulated spectra of SNR ) 5.5, with concentrations 0.1 (solid), 0.9 (dotted), and 2.9 (dashed). (b) Rankings for ν ) 0 (solid), 0.5 (dotted), and 1 (dashed). (c) Inverse CVMSE with increasing model size for ν ) 0 (circle), 0.5 (triangle), and 1 (asterisk).
sets. Finally, part c shows the evolution of inverse CVMSE as more variables are added to the model. The maximum value in this plot identifies the set of selected variables. Table 1 describes the statistics of the experiments in terms of mean-squared error of prediction (MSEP), as obtained for the independent prediction
data using the PLS calibration model defined by the chosen variables. The following discussion of individual results for each data set will fill in details. Simulation. Simulated absorption peaks appear at variables 51-55 in each case, as seen in Figures 2a, 3a, and 4a. For the Analytical Chemistry, Vol. 70, No. 1, January 1, 1998
39
Figure 4. (a) Representative simulated spectra of SNR ) 1.75, with concentrations 0.1 (red), 0.9 (green), and 2.9 (blue). (b) Rankings for ν ) 0 (solid), 0.5 (dotted), and 1 (dashed). (c) Inverse CVMSE with increasing model size for ν ) 0 (circle), 0.5 (triangle), and 1 (asterisk).
first case, the signal-to-noise ratio (SNR) was set to approximately 60. The rankings, shown in Figure 2b clearly identify the absorption peak centered around wavelength 53. With this data, the different rankings provide the identical result as seen from 40 Analytical Chemistry, Vol. 70, No. 1, January 1, 1998
Figure 5. (a) Representative NIR spectra with glucose concentrations of 125 (solid), 250 (dotted), and 375 mg/dL (dashed). (b) Rankings for ν ) 0 (solid), 0.5 (dotted), and 1 (dashed). (c) Inverse CVMSE with increasing model size for ν ) 0 (circle), 0.5 (triangle), and 1 (asterisk).
Figure 2c and the first row in Table 1. In each case, a minimum CVMSE is found when two variables are included. Exclusion of useless variables improves the prediction and only two of the five true response variables were used to obtain a minimum CVMSE.
Figure 6. (a) Representative raw Raman Stokes spectra with glucose concentrations of 0 (solid), 501.19 (dotted), and 1000 mg/ dL (dashed). (b) Rankings for ν ) 0 (red, 0.5 (green), and 1 (blue). (c) Inverse CVMSE with increasing model size for ν ) 0 (circle), 0.5 (triangle), and 1 (asterisk). Three latent variables were used: the dotted lines indicate traces for a single factor, the dashed traces correspond to two factors, and the solid traces are from three-factor models.
In the second and third cases, SNRs of 5.5 and 1.75 were considered, respectively. From the rankings, an increasing
Figure 7. (a) Processed Raman Stokes spectra with glucose concentrations of 0 (solid), 501.19 (dotted), and 1000 mg/dL (dashed). (b) Rankings for ν ) 0 (red), 0.5 (green), and 1 (blue). (c) Inverse CVMSE with increasing model size for ν ) 0 (circle), 0.5 (triangle), and 1 (asterisk). Three latent variables were used: the dotted lines indicate traces for a single factor, the dashed traces correspond to two factors, and the solid traces are from three-factor models.
disparity is seen as SNR decreases. Despite this fact, the results in terms of CVMSE remain identical for the first five wavelengths selected. However, the wavelengths selected after this point are Analytical Chemistry, Vol. 70, No. 1, January 1, 1998
41
not the same for the different criteria as is seen by the prediction results. For the first case (SNR ) 5.5), a minimum CVMSE is found when only variance is used in ranking the variables, but the MSEP obtained from the independent set using these variables was larger than those given by rankings including residuals. In the second case (SNR ) 1.75), use of residuals decreases both the MSEP and the number of wavelengths chosen. The results indicate that when noise is not normally distributed, as was the case here, increasing noise levels changes the way the different rankings behave, an important consideration when selection performance is evaluated. This dependence will be illustrated further with data from NIR and Raman spectroscopy experiments where the noise types are different. NIR Spectroscopy. With Fourier processing the NIR spectra are very smooth, having the high-frequency noise filtered out. Absorption peaks for glucose appear approximately at 2.275 and 2.326 µm. The digital filtering has produced negative peaks as well, due to valleys between peaks, that appear around 2.25 and 2.3 µm. Rankings are very similar among ν ) 0, 0.5, and 1 as shown by Figure 5b. There is a small difference in the order of selection yet this has little bearing on the CVMSE for this data set, as is seen from Figure 5c. Table 1 gives the MSEP values obtained for this experiment, showing that the different rankings give comparable results. Raman Spectroscopy. Two forms of the Raman data were used. The first was the raw state which included baseline variations and large outliers produced by the cosmic rays. The second form was the processed data, in which the effects of cosmic rays had been removed. In both cases, scaling effects were reduced through use of MSC. Figures 6a and 7a show these input data, respectively. Glucose Raman peaks are located around 1120 and near 518 cm-1. Rankings for the unprocessed spectra reveal large differences between the criteria. As seen in Figure 6c and Table 1, the CVMSE clearly benefits from the inclusion of residuals in the selection ranking. This effect is due to the extremely large outliers exhibited by the spectra, which are not adequately accounted for by simply using variance to weight the variables. Here a minimum MSEP is obtained by ranking only with residuals. As shown in Table 1, more variables are chosen compared to the ranking using only variance. It is known that CCD technology has a Poisson noise distribution when cosmic ray effects are absent. This is the case for the processed data, for which the large spikes produced by cosmic rays have been removed. The noise more closely resembles a Gaussian distribution and the result is that the rankings perform similarly. Figure 7c shows that the CVMSE is similar across factors and ranking criteria. Table 1 indicates that the minimum MSEP is obtained using the variance ranking only, but the results are comparable across the methods. It is evident that the new ranking for selection improves calibration results for cases with large outliers and noisy data in general. In most cases there is a decrease in CVMSE as residuals are given more weight in the rankings (as ν f 1). In every case, use of residuals provides at worst a comparable CVMSE and comparable model size and usually improves the calibration when noise deviates from a normal distribution. 42
Analytical Chemistry, Vol. 70, No. 1, January 1, 1998
CONCLUSION The goal of this research was to provide a mathematical justification for wavelength selection in PLS calibration, showing explicitly the effects of including noise variables into the model. The results presented here reveal the increased bias in the PLS calibration slopes as noisy variables are introduced. They also indicate the utility of wavelength selection before calibration with PLS regardless of the noise character of the data. In addition to the mathematical justification for wavelength selection, a selection algorithm was developed that was more robust in the presence of outlying spectral points than some previous methods, allowing it to be used with more data types. The results of the study show that the use of residuals (as opposed to variance) in ranking variables for selection is particularly useful when dealing with large outliers and noisy data in general. The observed differences are clearly dependent upon the noise type and noise magnitude of the data. For data with large outliers, the proposed new ranking criteria results in smaller CVMSE and MSEP than the previously used ranking. For data with more subtle noise, the selection methods perform virtually the same. These results suggest that use of residuals to weight variables can be applied to many different types of data and tend to outperform rankings using only variance estimates. It is useful to replace variance estimates with residuals or at least include residuals in a hybrid ranking scheme to avoid problems associated with large outlying data points. Last, the wealth of literature on wavelength selection and lack of comprehensive comparisons of wavelength selection methods when used with different data types suggests that an important and interesting study would be to compare the results of all known wavelength selection algorithms and their behavior when presented with spectral outliers. Such an investigation is beyond the scope of this work, but the authors provide the idea as a possibility for a future project. ACKNOWLEDGMENT This work is supported in part by grants from the Chemistry and Statistics and Probability programs at the NSF (DMS9523878), the Whitaker Foundation, and the National Aeronautics and Space Administration (NAG9-821). The research was also supported by a fellowship from the Whitaker Foundation. APPENDIX The purpose of this Appendix is to show that, from a theoretical point of view, the inclusion of background regions in PLS calibration is analogous to adding noise in linear regression. We will give some formulas that explicitly display the effect of using background regions in calibration. For clarity, the appendix equations are numbered separately from those in the text. (22) Kiefer, J.; Wolfowitz, J. Ann. Math. Stat. 1956, 27, 887. (23) Gleser, L. J. ASA Proc. Bus. Econ. Sec. 1983, 57-66. (24) Fuller, W. A. Measurement Error Models; J. Wiley and Sons: New York, 1987. (25) Carroll, R. J.; Ruppert, D.; Stefanski, L. A. Measurement Error in Nonlinear Models; Chapman and Hall: New York, 1995; pp 21-78. (26) Carroll, R. J.; Spiegelman, C. H. J. Qual. Tech. 1986, 18, 170. (27) Lakshminarayanan, M. Y.; Gunst, R. F. Biometrika 1984, 71, 569. (28) Denham, M. C. Ph.D. Dissertation, University of Liverpool, 1991.
Fitting a regression model when there is error in the x-variable is called a measurement error model or an errors-in-variables model. We first define our measurement error model: We observe n independent pairs of measurements (Xi, Yi). The random Xi’s are related to the random Yi’s by the equations
Yi ) βxi + Vi Xi ) xi + i
(A1)
where i ) 1, ..., n. We assume that the independent errors i and Vi have mean zero and constant variance σ2x and σ2, respectively. We further assume that the predictors xi are a sequence of independent and identically distributed random variables having a finite variance. This is called the functional form of the measurement error model. If the predictors were random, it would be called the structural form.21,22 It is known that the least-squares estimator is biased toward the origin.23,24 In particular, the least-squares estimate of β, βˆ , converges to β(∑x2i /∑(x2i + 2i ), almost surely as n f ∞. It is also known that if the ratio ∑2i /∑x2i is big, the least-squares estimate has poor behavior relative to an errors-in-variables.25,26 If the ratio is not big, then the behavior of the least-squares estimate may be more acceptable as an estimator. For more detail about when least squares may be acceptable for a measurement error model, see refs 25 and 26. We will show that the PLS estimator has properties similar to the errors-in-variables model, except that measurement error in xi is replaced by noise variables or predictors that have little information. We consider three nested models of increasing complexity. The simpler models provide insight that is shrouded by the more complicated final model. Notation for PLS:
Y ) (Y1,Y2)
Y1 and the errors E1 have dimension n by q1, the measurements Y2 and the errors E 2 have dimension n by q2, x has dimension n by 1, and β has dimension q by 1. All errors are independent and have mean 0 and the ith column of EI has variance σii2. Now we give some facts and notation about PLS, but readers are referred to ref 27 for additional properties of PLS. The PLS calibration coefficients for model i ) 1, 2, and 3 are denoted βˆ PLS,i. If there is no second subscript on βˆ PLS,i, we will make clear by context the models that are represented. From a rewriting of an estimator we have
βˆ PLS )
)
x′YY′x Y′x 27 x′(YY′)2x
(A4)
From this we find several important results based on the models above. Following are the results and their proofs. Result 1. For model 1 the PLS estimator is the same as the least-squares estimator, that is
βˆ PLS,1 )
β β′β
(A5)
Proof of Result 1. By plugging in the form of Y for model 1 into (A4), we get
βˆ PLS )
x′(xβ′βx′)xβx′x x′(xβ′βx′)2x
(A6)
Now β′β is a scalar, so βˆ PLS ) (x′x)3β′ββ/(x′x)3(β′β)2 ) β/β′β, which is (A5). QED. Result 2. For model 2, the PLS calibration estimates have a form that explicitly shows the impact of the pure noise terms in Y2. In particular
βˆ PLS,2 )
model 1: Y1 ) xβ′, Y2 ≡ 0
(
(
(x′x)[(x′x)2(β′β) + x′EE′x]
(x′x)3(β′β)2 + 2x′xβ′βx′EE′x + x′(EE′)2x
)
β
(A7)
model 2: Y1 ) xβ′, Y2 ) E2 model 3: Y1 ) xβ′ + E1, Y2 ) E2
(A2)
Bold capitals denote matrices, bold lowercase denote vectors, and apostrophes indicate transposed matrices or vectors. The rows of Y represent measurement of light intensity (reflected, transmitted, etc.) at different wavelengths measured at a fixed chemical concentration. The columns represent measurements at a fixed wavelength over the set of concentrations.
y11 ... y1q Y ) l ... l yn1 ... ynq
(A3)
Y2 represents pure noise observations which carry no information relevant to the chemical measurement under consideration. Note that individual elements of the matrices E1 and E2 will be denoted by e, with appropriate subscripts where necessary to indicate whether it is an element of E1 or E2. The measurements
Proof of Result 2. Recall that, for model 2, Y ) [Y1 E2] and that YY′ ) Y′1Y1 + E′2E2. By using these equations and (A4) we get
βˆ PLS )
{(x′x)2β′β + x′E2E′2}βx′x x′(xβ′βx′ + E2E′2)2x
(A8)
The result, (A7), follows by expanding the denominator and canceling common terms from the numerator and denominator. QED. The final result is for models 1-3. Result 3. Under models 2 and 3 for fixed n, q1, x, and σ2, as q2 tends to infinity the estimator βˆ PLS tends to zero in probability. Proof of Result 3. By the Cauchy-Schwartz inequality (x′YY′x)2 e (x′(YY′)2x)(x′x). Using this inequality we get βˆ PLS′βˆ PLS ) (x′YY′x)3/(x′(YY′)2x)2 e (x′x)2(x′YY′x)3/(x′YY′x)4 ) (x′x)2/ x′YY′x. Now we use the fact that x′YY′x g x′E2E′2x and thus βˆ PLS′βˆ PLS e (x′x)2/x′E2E′2x. Finally, we note that the value of the Analytical Chemistry, Vol. 70, No. 1, January 1, 1998
43
denominator tends to q2σ2 x′x as q2 f ∞ by the strong law of large numbers. QED. Remarks. (1) The estimator βˆ PLS,2 is the sum of two terms. By dividing the numerator and denominator by (x′x)3(β′β)2 the first term can be rewritten as
(
1
)
x′(EE′)2x 2x′EE′x + 1+ (x′x)2β′β (x′x)3(β′β)2
βPLS,1
(A9)
Note that the role of the error in the PLS estimator through the terms x′EE′x and x′(EE′)2x is clear. The larger these terms get, the closer this first term gets to zero. The remaining term in βˆ PLS,2 represents an additive bias. It too goes to zero as x′(EE′)2x tends to infinity. The term x′(EE′)2x tends to 0 as either q2 or σ2 tends to infinity. (2) Result 2 is suggestive of a selection algorithm for reducing estimated bias. The terms x′EE′x and x′(EE′)2x in the denomina-
44
Analytical Chemistry, Vol. 70, No. 1, January 1, 1998
tor of (A9) suggest that σ2/β2 and (Ee4)1/2/β2 are important. We will try selection algorithms based upon estimates of ν(σ2i /β2i ) + 4 1/2 (1 - ν)((Ee1i ) /β2i ) for ν ) 0.0, 0.5, and 1.0 for each variable. 4 1/2 The smaller the estimates of ν(σ2i /β2i ) + (1 - ν)((Ee1i ) /β2i ), the more likely that we need to include the corresponding variable in our calibration model. Other choices of ν may be helpful, but we only consider the extremes and a center point for this work. (3) If the underlying error distribution has a continuous density with respect to Lebesgue measure the convergence given by result 3 also happens in L1. This can be seen by direct calculation for the normal distribution.
Received for review June 4, 1997. Accepted October 2, 1997.X AC9705733 X
Abstract published in Advance ACS Abstracts, December 1, 1997.