Application of Wavelet Transform To Extract the Relevant Component

Application of Wavelet Transform To Extract the. Relevant Component from Spectral Data for. Multivariate Calibration. D. Jouan-Rimbaud,† B. Walczak,...
0 downloads 12 Views 146KB Size
Anal. Chem. 1997, 69, 4317-4323

Application of Wavelet Transform To Extract the Relevant Component from Spectral Data for Multivariate Calibration D. Jouan-Rimbaud,† B. Walczak,†,‡ R. J. Poppi,†,§ O. E. de Noord,| and D. L. Massart*,†

ChemoAC, Pharmaceutical Institute, Vrije Unversiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium, and Shell Research and Technology Center, Shell International Chemicals B. V., P.O. Box 38000, 1030 BN Amsterdam, The Netherlands

An approach aiming at extracting the relevant component for multivariate calibration is introduced, and its performance is compared with the “uninformative variable elimination” approach and with the standard PLS method for the modeling of near-infrared data. The extraction of the relevant component is carried out in the wavelet domain. The PLS results on these relevant features are better, and therefore, it seems that this approach can successfully be used to remove noise and irrelevant information from spectra for multivariate calibration. In multivariate calibration applied to spectroscopic data, one often deals with hundreds of correlated variables. Therefore, special methods have to be used to build multivariate calibration models with such data, e.g., principal component regression (PCR) and partial least squares (PLS).1 Both these methods are based on the construction of a few orthogonal latent variables from the original ones. In PCR, in order to work only with the information relevant for the model, it is advised to select the PCs before modeling.2-4 In PLS, the latent variables are constructed to maximize the covariance between the matrix of independent variables X and the vector of dependent variable y in each dimension. However, some irrelevant information can still be present in the first latent variables, as a high covariance does not always imply a large correlation but can also be the result of a small correlation between y and the X variables and a large variance in X. The results of both approaches are often worsened by the presence of information irrelevant for calibration. Removal of the irrelevant information from the data before the modeling should improve the final calibration model. Indeed, as stated by Thomas,5 “the inclusion of non-informative (or difficult) measurements in a model can seriously degrade performance. For many difficult problems, wavelength selection can greatly improve the performance of full-spectrum methods”. Selection of variables is also useful in order to eliminate variables where nonlinearities are present.6 As demonstrated in ref 2, a preselection of relevant †

Vrije Universiteit Brussel. On leave from Silesian University, Katowice, Poland. § On leave from Universidade Estadual de Campinas, Brazil. | Shell International Chemicals B. V. (1) Martens, H.; Naes, T. Multivariate Calibration; Wiley: Chichester, UK, 1989. (2) Jouan-Rimbaud, D.; Walczak, B.; Massart, D. L.; Last, I. R.; Prebble, K. A. Anal. Chim. Acta 1995, 304, 285-295. (3) Davies, A. M. C. Spectrosc. Eur. 1995, 7 (4), 36-38. (4) Sutter, J. M.; Kalivas, J. H.; Lang, P. M. J. Chemom. 1992, 6, 217-225. (5) Thomas, E. V. Anal. Chem. 1994,66, 795A-804A. (6) Brown, P. J. J. Chemom. 1992, 6, 151-161. ‡

S0003-2700(97)00293-X CCC: $14.00

© 1997 American Chemical Society

variables can significantly improve the multivariate modeling. It is therefore of a great interest to have at one’s disposal a method enabling selection of relevant variables for PCR or PLS. Some methods have already been proposed to achieve this goal.2,7-10 In particular, the method called “uninformative variable elimination for PLS” (UVE-PLS)10 is very interesting, as the model retains only those variables that carry more information than noise variables (see Theory). The goal of this article is to present a modification of the UVEPLS method. The proposed method, that we call “relevant component extraction for PLS” (RCE-PLS) is based on the discrete wavelet transform.11 The idea is to identify the wavelet transform coefficients that are related to noise, to compute their contribution to the model, and to reject all wavelet transform coefficients that are less informative than the most informative noise coefficients. With the selected wavelet coefficients, a spectrum in the whole original domain can be reconstructed, with, at each wavelength, only the relevant component of absorbance. THEORY 1. Uninformative Variable Elimination for PLS.10 Uninformative variable elimination for PLS is presented in ref 10. The method consists of computing the reliability of each variable in the model and to retain only the most informative variables. In order to compute a reliability coefficient for each variable, a leaveone-out strategy is adopted: Each time an object is left out, a PLS model is computed, and a row vector of regression b coefficients of the closed form of the PLS model is obtained. There are as many b coefficient vectors as there are objects in the data set, and one can therefore compute the reliability of each b coefficient as the ratio c ) mean(b)/std(b), where std(b) is the standard deviation of the b coefficients. The procedure to compute c is therefore a jackknifing approach. Note that there exists a similarity between each c coefficient and the |t| value, that is computed to test the significance of the regression coefficients in MLR. However, performing a t test is not possible (7) Garrido Frenich, A.; Jouan-Rimbaud, D.; Massart, D. L.; Kuttatharmmakul, S.; Martı´nez Galera, M.; Martı´nez Vidal, J. L. Analyst 1995, 120, 27872792. (8) Lindgren, F.; Geladi, P.; Ra¨nnar, S.; Wold, S. J. Chemom. 1994, 8, 349363. (9) Lindgren, F.; Geladi, P.; Berglund, A.; Sjostrom, M.; Wold, S. J. Chemom. 1995, 9, 331-342. (10) Centner, V.; Massart, D. L.; de Noord, O. E.; de Jong, S.; Vandeginste, B. M.; Sterna, C. Anal. Chem. 1996, 68, 3851-3858. (11) Mallat, S. IEEE Trans. Pattern Anal.Machine Intell. 1989, 11 (7), 674693.

Analytical Chemistry, Vol. 69, No. 21, November 1, 1997 4317

Figure 1. Addition of random variables (block N) to the spectral matrix X.

here, as the determination of the number of degrees of freedom is difficult; this is the reason why the c coefficients corresponding to real variables are compared to the c coefficients corresponding to noise variables. The goal of UVE-PLS is to identify variables with b coefficients that contribute significantly to the model (“large” c values). The first step of the method is to add an m × p matrix N of random variables to the m × p X spectral matrix, as shown in Figure 1, where m denotes the number of observations and p denotes the number of variables. The amplitude of the random variables is multiplied by a very low factor, so that the PLS model computed on the matrix [X, N], of dimensions m × 2p, is not influenced by N. It is clear that the variables in N are uninformative (they represent noise), and therefore, their c values are used as an indicator of the value that can be reached by an uninformative variable. The simplest way (although not necessarily the best) to determine the cutoff value is to determine the maximum c value obtained for a variable from N (cN,max). All variables from X with c < cN,max are considered uninformative and are removed from the spectral X matrix. However, some random variables can show high c values due to chance correlations. This leads to the selection of fewer real variables for the PLS and may result in the elimination of relevant variables. To avoid this, Centner et al.10 proposed to use a cutoff level on cN,max, by using different quantiles (99, 95, 90%) of the ranked parameters cN. 2. Relevant Component Extraction for PLS. Instead of adding random noise variables to the original data matrix X, one can estimate the noise component of the original data and then estimate a threshold value beyond which b coefficients may be considered reliable. Identification of the noise component of data, difficult in the original domain, can be performed in other domains. In the present study, we propose to use the discrete wavelet transform (DWT).11 The DWT operates on individual spectra. Each spectrum, of length p ) 2N, is replaced by 2N wavelet transform coefficients. Therefore, the initial data matrix X (m × p) is replaced by a matrix W of the same dimensions, containing wavelet transform coefficients. The information content of matrices X and W is exactly the same, but the basis of the spectrum presentation is changed. In the wavelet basis, i.e., when the spectrum is described by wavelet coefficients, it is easy to distinguish the noise component from the component containing relevant information to reconstruct the initial spectrum. This is illustrated in Figure 2, where the initial spectrum and its wavelet transform coefficients (on a normal and on a log scale) are presented. In Figure 2b, the presented set of wavelet coefficients (associated to the DWT basis) contains the approximation coefficient of level N and all detail coefficients from level N - 1 to 1. As one can immediately notice, the spectrum presentation in the wavelet basis is sparse. There are many wavelet transform coefficients with very small amplitude, representing unimportant details, that 4318

Analytical Chemistry, Vol. 69, No. 21, November 1, 1997

Figure 2. (a) Near-IR spectrum of the first sample from data set 2; (b) the wavelet transform coefficients of the same spectrum (filter 8); (c) the wavelet transform coefficients of the same spectrum on a logarithmic scale.

can be omitted without substantially affecting the information content. Applying Saito’s method, we can choose the cutoff value automatically, that is, the value under which all wavelet coefficients are considered to represent noise only. The detailed description of this method can be found in refs 12 and 13. For the readers’ convenience we will present the main idea of this approach. It is based on information theory14-16 and can be summarized as follows. The wavelet transform coefficients are sorted in decreasing order of their absolute value, and as few of them as possible are used to reconstruct the initial spectrum with the smallest possible error. The number k* of wavelet coefficients that are significant (i.e., relevant for signal reconstruction without inclusion of noise) can be determined as the solution of the minimum description length (MDL) equation:

MDL(k*,f*) ) min[(3/2) k log(p) + (p/2) log||(I Θ(k))Rfs||2] (1) 0 e k < p, 1 e f e M where p denotes length of the signal s, Rf is a vector containing p wavelet transform coefficients, obtained by applying filter f to the original signal, k denotes the number of wavelet transform coefficients (nonzero elements in the vector Rf), f numbers the filter, I is the p dimensional identity operator (matrix), and Θ(k) is a thresholding operation that keeps the k largest elements (in absolute value) intact, and sets all other elements to zero. The MDL allows the determination of the smallest possible number of wavelet coefficients, that will enable as good a spectrum reconstruction as possible. The first term of this equation can be seen as a penalty function, increasing linearly with the number (12) Saito, N. Simultaneous noise suppression and signal compression using a library of orthonormal bases and the minimum description length criterion. In Wavelets in Geophysics; Foufoula-Georgiou, F., Kumar, P., Eds.; Academic Press: New York, 1994. (13) Walczak, B.; Massart, D. L. Chemom. Intell. Lab. Syst. 1997, 36, 81-94. (14) Rissanen, J. Ann. Stat. 1983, 11, 416-431. (15) Rissanen, J. IEEE Trans. Inform. Theory 1984, 30, 629-636. (16) Rissanen, J. Stochastic Complexity in Statistical Inquiry; World Scientific: Singapore, 1989.

Figure 5. Matrix W of the wavelet transform coefficients divided into two blocks W1 and W2, containing coefficients significant for signals reconstruction (W1) and noise (W2) components.

Figure 3. Cost function (3) and its components (1 and 2) for the near-IR spectrum presented in Figure 2a.

Figure 4. (a) The initial near-IR spectrum; (b) the reconstructed near-IR spectrum and (c) the estimated noise component.

of wavelet transform coefficients retained (as one would like as few coefficients as possible). The second term represents the error between the original signal and the signal reconstructed with the k largest elements and decreases as k increases. In Figure 3, that represents the cost function MDL for the spectrum presented in Figure 2a, these terms are represented by curves, numbered, respectively, 1 and 2. The resulting cost function (MDL), curve 3 in Figure 3, reaches the minimum for a certain number of wavelet coefficients, k*, that indicates the number of wavelet coefficients significant for spectrum reconstruction. In this case, the optimal number of wavelet transform coefficients is equal to 128. The remaining (512-128) coefficients can be set to zero. Hence, the initial spectrum can be reconstructed using only 128 coefficients instead of 512. The difference between the initial spectrum and the reconstructed one is due to the extracted noise component (see Figure 4). It should be stressed that our goal when applying Saito’s method is the identification of the wavelet coefficients that are associated with noise. A problem is how to select the significant wavelet transform coefficients for the whole set of spectra (instead of individual spectra). Different options were considered in our study, and it was decided to apply Saito’s method to the wavelet transform coefficients of the mean spectrum. The idea is that if a given coefficient is associated only with noise in each individual spectrum, in the mean spectrum it should be very small, and identified as noise. The wavelet transform coefficients identified as noise for the mean spectrum were considered as noise for all the spectra in the data set.

It should be noted that there exist a very large number of possible wavelet bases.17 We limited ourselves to the Daubechies family of orthogonal wavelets containing 10 filters varying in their support and smoothness.18 The choice of filter is, of course, datadependent. In our study, the optimal filter was selected with Saito’s method for the mean spectrum of each studied data set. The optimal filter is defined as that for which the cost function reaches a minimum. Once the optimal filter is selected, all the individual spectra are decomposed using this filter. As a result of the DWT, a matrix W, with coefficients identified as the important ones for the data reconstruction (block W1), and with coefficients regarded to be responsible only for noise (block W2), is obtained. Schematically it can be presented as shown in Figure 5. The main steps of the RCE-PLS procedure are identical to these of UVE-PLS; i.e., the PLS is performed for the matrix W ) [W1 W2] and the vector y. The reliability of the b coefficients is estimated by a leave-one-out procedure, similarly to UVE-PLS. The reliability of the b coefficients for the noise variables (block W2) gives some indication of the variables from the block W1 that are relevant in signal reconstruction but irrelevant for calibration purposes. Note that the small amplitude of the noise coefficients in W2 does not influence the PLS model. In order to determine the cutoff level of reliability, another criterion was considered, defined as19

cutoff ) (constant/0.6745)median(|cN|)

(2)

where the constant can vary from 1 to 3, and cN corresponds to the noise coefficients from the block W2. This definition of the cutoff is quite popular in the field of signal processing. When white Gaussian noise with mean 0 and standard deviation 1 is present, the obtained cutoff corresponds to the standard deviation of the noise. The complexity of the final model (PLS with the relevant wavelet coefficients) is estimated by a leave-one-out crossvalidation procedure, that enables us to also obtain the prediction error (expressed as the root mean square error of prediction, RMSEP). EXPERIMENTAL SECTION Data Set 1. The first data set used has already been presented in a previous calibration study.20 It is a set of near-IR spectra of polyether polyols. It was investigated further,21 and in this article, (17) Chui, C. K. Introduction to Wavelets; Academic Press: Boston, MA, 1991. (18) Daubechies, I. Commun. Pure Appl. Math. 1988, 41, 909-996. (19) Beyer, W. H. CRC Standard Mathematical Tables and Formulae, 29th ed.; CRC Press: Boca Raton, FL, 1991. (20) Jouan-Rimbaud, D.; Massart, D. L.; Leardi, R.; de Noord, O. E. Anal. Chem. 1995, 67, 4295-4301.

Analytical Chemistry, Vol. 69, No. 21, November 1, 1997

4319

we have worked with the cluster referred to as set A in ref 21 after removal of the replicate outlier and outlying object that were detected then. In this set, the spectra of 26 samples are present. Replicate spectra were averaged. The wavelength range was originally 1100-2158 nm (530 wavelengths), but it was reduced to 512 variables (29), by removing the first 18 variables. Data Set 2. This data set consists of FT-IR spectra of seven standards, measured at 701 wavelengths between 1600 and 900 cm-1. Each spectrum was duplicated, and this time, due to the small number of samples present, the duplicated spectra were used as such and not averaged as in data set 1. Of course, in the cross-validation procedure, one sample, i.e., its two replicates were removed together. The original PLS model on standard normal variate (SNV) transformed data22 is much better than that on raw data, so we have worked with SNV-transformed data. The wavelength range was reduced by eliminating the first 189 variables, as the chemical information is known to be mainly present in the range 1000-900 cm-1, and each spectrum was then described by 512 (29) variables. Data Set 3. This data set deals with 24 near-IR spectra of pharmaceutical tablets and has already been studied.2,23 The spectra were originally measured at 1050 wavelengths, between 400 and 2498 nm. In order to be used here, the spectral range was reduced to 1024 (210) wavelengths, by removing the last 26 variables, so that the spectral range we have worked with was reduced to 400-2446 nm. Data Set 4. This data set consists of near-IR spectra of gasoline samples of 19 standards measured at 277 wavelengths between 1550 and 2400 nm. The goal of the calibration is to quantify ethanol and methyl tert-butyl ether (MTBE) present into gasoline samples. Synthetic mixture samples were prepared to build the mixture calibration model from gasoline without previous MTBE and ethanol addition supplied by replan/petrobras-Brazil. Absolute ethanol (Quimex) and MTBE p.a. (Aldrich) were employed to produce calibration solutions of different concentrations (w/w). Spectrophotometric absorption measurement was performed using an acousto-optic tunable filter (AOTF)24 as wavelength selector. The spectral range was reduced to 256 wavelengths (28) by removing the last 21 variables, and a Savitsky-Golay first-derivative 11-point cubic smooth was applied on the whole spectrum to correct baseline drift and eliminate measurement noise. SOFTWARE All computations were performed with Matlab for Windows version 4.0,25 with our own written programs. The library of wavelet filter coefficients was taken from the Matlab Toolbox for wavelets, WavBox3.26 RESULTS AND DISCUSSION For each investigated data set, the mean spectrum was used to find the optimal wavelet filter. As presented in Table 1, the cost function (MDL) reaches a minimum for filter 8 of the (21) Centner, V.; Massart, D. L.; de Noord, O. E. Anal. Chim. Acta 1996, 330, 1-17. (22) Barnes, R. J.; Dhanoa, M. S.; Lister, S. J. Appl. Spectrosc. 1989, 43, 772777. (23) Jouan-Rimbaud, D.; Khots, M. S.; Massart, D. L.; Last, I. R.; Prebble, K. A. Anal. Chim. Acta 1995, 315, 257-266. (24) Guchardi, R. Dissertation Thesis, State University of Campinas, Brazil 1996. (25) Matlab, The MathWorks, Inc., 1992. (26) WavBox 3 by Carl Taswell, available via [email protected].

4320

Analytical Chemistry, Vol. 69, No. 21, November 1, 1997

Table 1. Number of Significant Wavelet Transform Coefficients (k), and Value of the Cost Function (MDL), for the 10 Filters from the Daubechies Family Calculated for Data Sets 1-4 data set 1

data set 2

filter

k

cost × 103

1 2 3 4 5 6 7 8 9 10

63 110 182 205 223 236 243 236 225 249

0.1029 -0.8624 -1.3903 -1.7828 -2.1008 -2.1761 -2.2449 -2.3464 -2.3305 -2.2714

data set 3

k

cost × 103

k

43 111 140 135 142 116 126 120 131 130

1.0087 0.3287 -0.0011 -0.2999 -0.3883 -0.5311 -0.4944 -0.6192 -0.5273 -0.4845

93 189 233 273 310 330 297 314 330 289

cost × 103

data set 4 k

cost × 103

-1.1100 41 0.0776 -2.8250 50 -0.9561 -3.7472 43 -1.0682 -4.3280 52 -1.1859 -4.5801 70 -1.2409 -4.8393 89 -1.2600 -4.8922 85 -1.2847 -4.9426 105 -1.4256 -5.0265 94 -1.2899 -4.8983 104 -1.3404

Figure 6. (a) Daubechies wavelet number 8 and (b) Daubechies wavelet number 9. Table 2. Results of the PLS, UVE-PLS, and RCE-PLS, Applied to Data Sets 1-4 no. of variables

filter

Data Set 1 6 5 7

512 86 14

8

0.602 0.277 0.214

Data Set 2 2 2 3

512 52 16

8

PLS UVE-PLS RCE-PLS

0.514 0.373 0.296

Data Set 3 9 6 5

1024 143 114

9

PLS UVE-PLS RCE-PLS

Data Set 4 (Ethanol) 0.463 4 0.248 3 0.198 3

256 87 9

8

PLS UVE-PLS RCE-PLS

Data Set 4 (MTBE) 0.771 5 0.342 4 0.311 4

256 90 19

8

model

RMSEP

PLS UVE-PLS RCE-PLS

1.898 1.531 1.296

PLS UVE-PLS RCE-PLS

no. of factors

Daubechies family of filters, in the case of the data sets 1, 2, and 4, and filter 9, in the case of the data set 3. These two wavelets filters are presented in Figure 6, and it can be seen that they are very similar. The number of significant wavelet transform coefficients is respectively 236, 120, 330, and 105, for data sets 1-4. All the remaining wavelet transform coefficients were considered as associated with spectral noise and used in the modeling phase as the noise features from block W2 (see Figure 5), that are used to estimate the minimal reliability of an informative variable. The results of PLS, UVE-PLS and its modified version, RCE-PLS, are presented in Table 2.

Figure 7. Reliability coefficients for the RCE-PLS model of ethanol with data set 4 (-) and the nine wavelet coefficients relevant for calibration.

Figure 8. Signal s presented as the sum of three components: a component relevant for calibration (r), a component irrelevant for calibration (i), and the noise (n).

One notices that the proposed approach performs much better than PLS and UVE-PLS approaches. The cross-validated RMSEP of the PLS models can be improved by about 30, 65, and 40% for data sets 1, 2, and 3, respectively, 60% for the modeling of ethanol with data set 4, and 65% for the modeling of MTBE. UVE-PLS is also much better than PLS, since it improves RMSEP by 20, 55, 30, 45, and 55% for the data sets in the same order. Although both of them are based on the idea introduced by Centner et al.,10 UVE-PLS and RCE-PLS are not equivalent. Before explaining the differences between them, it should be stressed again that, in the proposed procedure, the wavelet transform coefficients are used as features in modeling, and that the spectral reconstruction is not required. If the reconstructed spectra are presented in this article, it is only to illustrate the differences between variable elimination and component extraction. A plot of the reliability of the b coefficients for the modeling of ethanol with RCE-PLS is shown in Figure 7. Among all the significant wavelet coefficients (the first 105 on the plot, from the block W1), many do not look more informative than the noisy ones. The instrumental signal (s) can be presented as the sum of three components: the noise (n), a component relevant for calibration (r), and a component irrelevant for calibration (i), but still significantly different from noise (see Figure 8). The goal of the RCE-PLS approach is to eliminate the components i and n from each individual spectrum. This can be done in the wavelet domain. The noise component can be

Figure 9. (a) One original spectrum from data set 3 and its components: (b) relevant for calibration, (c) irrelevant for calibration, and (d) noise.

Figure 10. (a) All spectra of data set 3 and their (b) relevant components and (c) sum of irrelevant and noise components.

Analytical Chemistry, Vol. 69, No. 21, November 1, 1997

4321

Figure 11. Mean spectrum of data set 3 with the marked wavelengths selected as the informative ones in the uninformative variable elimination approach.

Figure 12. (s) One spectrum of data set 4 (after mean centering) and (+++) the spectrum reconstructed from the nine wavelet coefficients relevant for calibration.

Figure 13. Graphical interpretation of the selected features in the uninformative variable elimination and relevant component extraction approaches.

reconstructed using the inverse discrete wavelet transform to the subset of wavelet transform coefficients identified by Saito’s method. The wavelet transform coefficients associated with component i are identified in the modeling phase as those features from block W1 for which the reliability of b coefficients is lower than the reliability of the b coefficients for the noise features (W2 block). The remaining wavelet transform coefficients (those with the reliable b coefficients) are associated with component r. For instance, in the case of data set 3, 330 wavelet transform coefficients were identified as significant for spectra reconstruction 4322

Analytical Chemistry, Vol. 69, No. 21, November 1, 1997

(see Table 1) and were included into the block W1, whereas the remaining 694 coefficients were identified as those associated with noise and were therefore included into the block W2. In the modeling phase, only 114 out of the 330 significant wavelet transform coefficients from the block W1 were identified as relevant features for calibration (see Table 2). Therefore, the noise component (denoted as n) can be reconstructed by applying the inverse discrete wavelet transform to the 694 wavelet transform coefficients from the block W2, the component relevant for calibration with y (denoted as r) can be reconstructed with the 114 relevant wavelet transform coefficients, and the third component (denoted as i) can be reconstructed with the remaining 216 coefficients from the block W1. An original spectrum from the data set 3 and its three reconstructed components are presented in Figure 9. Instead of using the original spectra (matrix X 24 × 1024) in the PLS model, we used a subset of their wavelet coefficients (24 × 114). It means that in the original domain, we used the components of spectra presented in Figure 10b. The eliminated components (i and n) are presented in Figure 10c. As one can notice, the main variation in the original data is irrelevant for modeling with y (see parts a and c of Figure 10). The visualization of the UVE-PLS results is more convenient for chemical interpretation, as one can immediately see in the spectrum, which is the range of interest (Figure 11). With RCE-PLS, one needs to reconstruct spectra to see where the domain of interest for the modeling is. For example, on Figure 12, a first-derivative spectrum of data set 4 is presented, together with the first-derivative spectrum reconstructed from the nine coefficients relevant for the modeling of ethanol. Clearly, two spectral regions are important, that correspond to spectral regions where ethanol has absorption peaks. The interpretation of the selected features with the UVE-PLS and RCE-PLS approaches are presented graphically in Figure 13. The selection of the informative spectral range in the UVEPLS approach does not eliminate the irrelevant and noise components in this range, but can be seen as the approach that focuses on the range where component r dominates over i and n, that is shown by the large value of the reliability coefficients in this particular range. On the other hand, when the spectrum is reconstructed from the selected relevant wavelet coefficients, the whole original variable range is kept (i.e., the number of original variables does not decrease), but only part of the information carried by each wavelength remains, that is the part relevant for calibration. Essential differences between the two discussed approaches can be expected when components i and n have significant contributions to the spectra in the whole spectral range, because there is no range where r significantly dominates over i and n, while it may still be possible to remove the components i and n from each variable. CONCLUSIONS Wavelet transform enables transformation of the spectral signals from the original domain to the wavelet domain, in which (27) Coifman, R. R.; Mayer, Y.; Wickerhauser, V. In Progress in wavelet analysis and applications; Meyer, Y., Roques, S., Eds; Editions Frontieres; 1993; pp 77-93. (28) Cody, M. A. Dr. Dobb’s J. 1994, 17, 16-28. (29) Coifman, R. R.; Wickerhauser, M. V. IEEE Trans. Inform. Theory 1992, 38 (2), 713-719.

denoising operations can be carried out more easily. The wavelet transform coefficients can be treated as new features for multivariate calibration. Elimination of the subset of wavelet transform coefficients associated with noise and irrelevant information for calibration with y is equivalent with elimination of the noise and irrelevant components of spectra along the whole spectral range. In the present study, only RCE-PLS is considered, but, of course, the relevant component extraction approach can be coupled with any multivariate calibration method (e.g. RCE-PCR). Although the discrete wavelet transform was applied to present data in the wavelet domain, the wavelet packet transform (WPT)27,28 can be considered as a more general approach. WPT applied to the spectrum of length 2N leads to the matrix of N × 2N wavelet transform coefficients. We can take advantage of this redundancy to select the so-called best basis with 2N coefficients, in which the signal has the most sparse presentation. When the WPT was applied to the data sets presented here, the best basis, that was

determined with the entropy criterion,29 was always found to be similar to the DWT basis, and this is the reason why we limited ourselves to the DWT. ACKNOWLEDGMENT Prof. B. G. M. Vandeginste (Unilever, Vlaardingen, NL), Dr. K. A. Prebble (Wellcome, Dartford, UK), and R. Guchardi and C. Pasquini (State University of Campinas, BR), are thanked for providing the data sets used in this study. The authors also thank the Nationaal Fonds Wetenschappelijk Onderzoek and the Standards, Measurement and Testing program for the financial assistance. R.J.P. thanks the NFWO for his postdoctoral fellowship. Received for review March 18, 1997. Accepted July 16, 1997.X AC970293N X

Abstract published in Advance ACS Abstracts, September 1, 1997.

Analytical Chemistry, Vol. 69, No. 21, November 1, 1997

4323