Extracting Chemical Information from Spectral Data with Multiplicative

Zeng-Ping Chen , Li-Jing Zhong , Alison Nordon , David Littlejohn , Megan Holden , Mariana Fazenda , Linda Harvey , Brian McNeil , Jim Faulkner , and ...
0 downloads 0 Views 391KB Size
Anal. Chem. 2006, 78, 7674-7681

Extracting Chemical Information from Spectral Data with Multiplicative Light Scattering Effects by Optical Path-Length Estimation and Correction Zeng-Ping Chen, Julian Morris, and Elaine Martin*

School of Chemical Engineering and Advanced Materials, Centre for Process Analytics and Control Technology, University of Newcastle upon Tyne, NE1 7RU, England, UK

When analyzing complex mixtures that exhibit sampleto-sample variability using spectroscopic instrumentation, the variation in the optical path length, resulting from the physical variations inherent within the individual samples, will result in significant multiplicative light scattering perturbations. Although a number of algorithms have been proposed to address the effect of multiplicative light scattering, each has associated with it a number of underlying assumptions, which necessitates additional information relating to the spectra being attained. This information is difficult to obtain in practice and frequently is not available. Thus, with a view to removing the need for the attainment of additional information, a new algorithm, optical path-length estimation and correction (OPLEC), is proposed. The methodology is applied to two near-infrared transmittance spectral data sets (powder mixture data and wheat kernel data), and the results are compared with the extended multiplicative signal correction (EMSC) and extended inverted signal correction (EISC) algorithms. Within the study, it is concluded that the EMSC algorithm cannot be applied to the wheat kernel data set due to core information for the implementation of the algorithm not being available, while the analysis of the powder mixture data using EISC resulted in incorrect conclusions being drawn and hence a calibration model whose performance was unacceptable. In contrast, OPLEC was observed to effectively mitigate the detrimental effects of physical light scattering and significantly improve the prediction accuracy of the calibration models for the two spectral data sets investigated without any additional information pertaining to the calibration samples being required. The number of implementations of spectroscopic instrumentation, such as near-infrared spectroscopy, has increased significantly across a range of industrial sectors including the food, agriculture speciality chemicals, and pharmaceutical sectors1 as a consequence of its applicability for on-line process monitoring. More specifically, practical and economic benefits have been reported as a consequence of there being no requirement for * To whom correspondence should be addressed: (e-mail) [email protected]. (1) Siesler, H. W.; Ozaki, Y.; Kawata, S.; Heise, H. M. Near-Infrared Spectroscopy: Principal, Instruments, Applications; Wiley-VCH: Weinheim, 2002.

7674 Analytical Chemistry, Vol. 78, No. 22, November 15, 2006

sample preparation and the resulting sampling frequency being greater than that for off-line assay measurements. However, a number of issues arise with respect to spectroscopic instrumentation and those that are of greatest relevance to this paper include the following: (i) the resulting data comprises hundreds of measurement values per spectrum, (ii) there is a lack of selectivity with respect to the spectroscopic measurements, and (iii) there is a need to develop analytical methods without the prior separation of the analytes. One approach to tackling these issues has been through the application of chemometric algorithms, such as principal component analysis2 and partial least-squares (PLS).3 The goal of these techniques is to extract the chemical information, such as the concentration of chemical compounds, that is inherent within the spectral measurements. However, when analyzing solid and heterogeneous types of samples that exhibit sample-to-sample variability using spectroscopic instrumentation, the variation in the optical path length materializing from the physical differences between samples, due to particle size and shape, sample packing, and sample surface, for example, may result in the multiplicative light scattering effect masking the spectral variations relating to the differences between the chemical compounds within a sample.4 The effect of multiplicative light scattering is difficult to handle through the application of standard bilinear calibration methodologies as these are based on the construction of latent variables that are a linear combination of the wavelengths. Consequently if the spectral data are not appropriately preprocessed, the underlying behavior of the data, relating to the chemical properties, will be masked due to the effect of multiplicative light scattering. A number of chemometric preprocessing methods4-11 have been proposed to explicitly model the effect of multiplicative light scattering. One of the most frequently reported techniques in the literature is that of multiplicative signal correction (MSC).4 The methodology involves regressing each spectrum in a set of related samples, i.e., the samples comprise the same chemical components, on a reference spectrum (for example, the mean spectrum) to estimate the intercept and slope of the estimated regression equation that will theoretically capture the information relating (2) Cowe, I.; McNicoi, J. W. Appl. Spectrosc. 1985, 39, 257-266. (3) Martens H.; Martens, M. Multivariate Analysis of Quality: An Introduction; John Wiley and Sons: Chichester, 2001. (4) Geladi, P.; McDougall, D.; Martens, H. Appl. Spectrosc. 1985, 39, 491500. 10.1021/ac0610255 CCC: $33.50

© 2006 American Chemical Society Published on Web 10/14/2006

to the effect of multiplicative light scattering. Each individual spectrum is then corrected by subtracting the intercept and dividing by the slope. However the above process for correcting for multiplicative light scattering is only reliable if the chemical variation between the spectra to be corrected and the reference spectrum is negligible, or alternatively, the regression procedure is only applied to that part of the spectrum that does not contain chemical information, i.e., that part that is only influenced by multiplicative light scattering. If the above conditions are not satisfied, then the estimated intercept and slope may contain information relating to the analyte of interest and hence this will be lost during the implementation of the correction procedure. An alternative procedure proposed to correct for multiplicative light scattering, that is similar to MSC, is the inverted scatter correction (ISC)8 algorithm described by Helland et al. In contrast to MSC, where each spectrum is regressed on a reference spectrum, ISC takes the reference spectrum as the regressand and the spectrum to be corrected as the regressor. Although this so-called “forward” model adopted by ISC materializes in certain statistical differences between ISC and MSC, ISC still has the same application limitations as discussed for MSC. Both MSC and ISC have been extended to incorporate prior spectroscopic knowledge. Martens and Stark9 introduced chemical terms (the pure spectra of the chemical components in the samples) into the MSC algorithm to obtain a more effective separation of the chemical and physical effects in light spectroscopy. ISC was subsequently modified by Pedersen et al.10 to include linear and quadratic terms of the wavelengths and a quadratic term relating to the spectrum to be corrected (EISC1). The inclusion of linear and quadratic terms of the wavelengths, as an additional two regressors in EISC1, was to account for the possible presence of smooth wavelength-dependent spectral variations that may be present between samples. As stated by Pedersen et al.,10 the quadratic term relating to the spectrum to be corrected was included to model the heterogeneity of the samples and, hence, improve the estimation of the basic interference effects, with the intercept and slope relating to multiplicative light scattering. More recently, Martens et al.11 further extended the MSC and ISC algorithms by incorporating both chemical terms and linear and quadratic terms of the wavelengths (EMSC2 and EISC2) to help realize the efficient separation of the physical light scattering effects from the chemical light absorbance effects in vibrational spectra. The results from the analysis of the near-infrared transmittance spectra of mixtures of wheat gluten and starch powders using both EMSC2 and EISC2 were very good. However, the success of the EMSC2 and EISC2 algorithms is strongly dependent on the availability of the pure spectra for all the chemical components present in the samples and the consistency of the spectral contributions from the components in the mixtures (5) Barnes, R. J.; Dhanoa, M. S.; Lister, S. J. Appl. Spectrosc. 1989, 43, 772777. (6) Miller, C. E.; Næs, T. Appl. Spectrosc. 1990, 44, 895-898. (7) Isaksson, T.; Kowalski, B. R. Appl. Spectrosc. 1993, 47, 702-709. (8) Helland, I. S.; Næs, T.; Isaksson, T. Chemom. Intell. Lab. Syst. 1995, 29, 233-241. (9) Martens, H.; Stark, E. J. Pharm. Biomed. Anal. 1991, 9(8), 625-35. (10) Pedersen, D. K.; Martens, H.; Nielsen, J. P.; Engelsen, S. B. Appl. Spectrosc. 2002, 56 (9), 1206-1214. (11) Martens, H.; Nielsen, J. P.; Engelsen, S. B. Anal. Chem. 2003, 75, 394404.

with the components isolated in the pure state. In practice, the applicability of EMSC2 and EISC2 is limited as a consequence of the difficulties in satisfying these two requirements. A methodology that can correct for the effect of multiplicative light scattering for systems with little, or even no, prior spectroscopic knowledge is therefore highly desirable. The aim of this study is to propose a multiplicative light scattering correction algorithm that can be implemented without any prior spectroscopic knowledge. The performance of the proposed algorithm is investigated on two near-infrared transmittance spectral data sets, and the results are compared with two methodologies that have previously been applied to these data sets, EMSC2 and EISC1. THEORY Multiplicative Light Scattering. For I transparent solutions comprising J absorbing chemical components, where the cuvette width is kept constant during the recording of each measurement, the theoretical absorbance spectrum (xi,Chem, row vector) of sample i, according to Beer-Lambert’s law, is a linear combination of the absorbance contributions of all J components: J

xi,Chem )

∑c

i,jsj,

i ) 1, 2, ‚‚‚, I

(1)

j)1

where the row vector, sj, is the absorption spectrum and ci,j is the concentration of the jth component in sample i. By assuming sj (j ) 1, 2, ..., J) are linearly independent, the multivariate linear calibration model built between xi,Chem. and ci,j (i ) 1, 2, ..., I) can provide satisfactory predictions for the concentration of component, j, in future solution samples. If the samples to be analyzed are solid (powder, granules) or emulsions and dispersions, it is practically challenging to make the optical path length constant across the samples. For relatively simple systems, the effect of light scattering caused by changes in the optical path length due to the physical variations of the samples can be approximated by the following EMSC2 model:11

xi ) ai1r + bixi,Chem + diλ + eiλ2 + i ) J

ai1r +

∑ bc

i i,jsj

+ diλ + eiλ2 + i (2)

j)1

where xi is the measured absorbance spectrum of sample i. 1r is a row vector with its elements equal to unity. The coefficients ai and bi denote the additive and multiplicative effects of light scattering due to the physical variations of sample i relative to a reference sample. The coefficients di and ei are introduced to account for the smooth wavelength-dependent spectral variations that may be present between samples. The wavelength row vector λ is a linear function of the number of wavelengths (nanometres), and the entries lie between -1 and +1. i captures the unknown sources of spectral variation. Without the data being preprocessed using the appropriate techniques, the relationship between the measured absorbance spectra xi and ci,j (i ) 1, 2, ..., I) cannot be explicitly modeled by any of the popular multivariate linear calibration methods such as PCR and PLS. Analytical Chemistry, Vol. 78, No. 22, November 15, 2006

7675

Conventional Parameter Estimation and Spectral Correction. Although the models currently reported in the literature for the correction of multiplicative light scattering differ slightly,8 they can all be regarded as a simplification or modification of the EMSC2 model11 described in eq 2. In general, they share the same parameter estimation and spectral correction strategy. The key steps in the algorithm are summarized below: (1) Construct the regressor matrix for the model of interest. For EMSC2, the pure spectra of all the chemical components in the mixture samples are required to construct the regressor matrix. Further details are given in ref 11. (2) Estimate the model coefficients (i.e., ai, bi, di, and ei in eq 2) by applying least-squares regression to the regressand (the spectral data matrix) and the regressor matrix. (3) Insert the estimated coefficients into eq 3 to remove the effect of multiplicative light scattering thereby yielding the corrected spectrum xi,Corrected (xi,Corrected ≈ xi,Chem) with only the absorbance contributions from the chemical variations now being present.

xi,Corrected ) (xi - ai1r - diλ - eiλ2)/bi

(3)

Ideally, measurement error should be the only source of difference between xi,Corrected and xi,Chem; hence, it can be expected that the multivariate linear calibration model built on xi,Corrected exhibits predictive ability similar to that obtained for xi,Chem, the theoretical absorbance spectrum. Optical Path-Length Estimation and Correction (OPLEC). The parameter estimation and spectral correction of EMSC2 relies on the availability of the pure spectra of all the chemical components in the samples and the consistency of the spectral contributions from the components in the mixtures with the components isolated in the pure state. Both these aspects are difficult to satisfy in practice. This limitation of EMSC2 results from the requirement that all the model parameters in the spectral correction step are available. If the goal of the analysis is to build a robust and accurate calibration model, only the parameter bi, which contains information about the multiplicative effects of the optical path-length variation, is needed. The following section focuses on how bi can be estimated for the calibration samples and how it can be used for prediction in the case where there is no prior information about the pure spectra of the chemical components in the samples. First the influence of the baseline offset (ai) and the smooth wavelength-dependent spectral variations that may be present between samples (di and ei) can be removed by projecting the measured spectrum xi onto the orthogonal complement of the space spanned by the row vectors of P ) [1r; λ; λ2]:

Suppose the first component is the target component in the J ci,j ) 1 (which strictly holds for ci,j, which mixtures and ∑j)1 represents a unit-free concentration such as weight fraction and mole fraction); then eq 4 can be re-expressed as J

zi ) bici,1k1 + bi(1 -

∑ j)1 j*2

J

ci,j)k2 +

∑ bc

i i,jkj

+ /i )

j)3

J

bici,1∆k1 + bik2 +

∑ bc

i i,j∆kj

+ /i

j)3

∆kj ) kj - k2

(5)

Now, given the concentration vector c1 ) [ c1,1; ...; ci,1; ...; cI,1] of the target component in the calibration samples, column vector b ) [b1; ...; bi; ...; bI] can be obtained by the following procedure even if the pure spectra (sj, j ) 1, 2, ..., J) are unavailable. Let Zbase be a full-row-rank matrix assembled from J spectra (row vector) selected from Z ) [z1; ...; zi; ...; zI], and let Zrest comprise the rest of the spectra (rows) in Z. Likewise, bbase, brest, c1,base, and c1,rest consist of the corresponding elements of b and c1 with the same values of index i as the spectra in Zbase and Zrest, respectively. Each row in Zrest can be expressed as a linear combination of the rows in Zbase:

Zrest ) AZbase,

A ) ZrestZ′base(ZbaseZ′base)-1

(6)

According to eq 5, a linear relationship exists between zi and bi and also between zi and bici,1. Therefore, the following equations hold:

brest ) Abbase

(7)

diag(c1,rest)brest ) Adiag(c1,base)bbase

(8)

where diag(c1,rest) denotes the diagonal matrix in which the corresponding diagonal elements are the elements of c1,rest. Inserting eq 7 into eq 8 yields

diag(c1,rest)Abbase ) Adiag(c1,base)bbase

(9)

Since there is no requirement to know the absolute value of bi, the first element of bbase can be assigned to take a value of unity. The rest of the elements of bbase can be calculated using nonnegative least-squares regression. Given bbase, brest can then be estimated according to the following equation:

brest ) {diag(c1,rest) + I}-1A{diag(c1,base) + I}bbase (10)

zi ) xi(I - P+P) ) (ai1r + bixi,chem + di λ + ei λ2 + i)(I - P+P) ) J

∑ bc

i i,jkj

+ /i ,

i ) 1,2,‚‚‚,I

(4)

j)1

kj ) sj(I - P+P),

/i ) i(I - P+P),

where I represents an appropriately dimensioned identity matrix. 7676

Analytical Chemistry, Vol. 78, No. 22, November 15, 2006

The estimation of b is now obtained by rearranging the elements of bbase and brest in the appropriate order. Due to the presence of noise and possible interferences, different selections of Zbase may yield approximately different estimates of b. Therefore, in this paper, a set of matrices Zi,base (i ) 1, 2, ..., I) are constructed by first selecting the ith spectrum zi from Z and then sequentially adding a new spectrum that contains

the most relevant spectral information not contained in the spectra already selected, until the number of spectra in Zi,base is equal to J (the number of actual spectral variation sources including chemical components and possible interferences in the samples). Each Zi,base produces an estimate of b (bj) and the average overall bj’s (i ) 1, 2, ..., I) is taken as an estimate of b. Having calculated b, the following two calibration models can be built using a multivariate linear calibration method such as PLS:

diag(c1)b ) [1c,Z]β1, b ) [1c,Z]β2

(11)

where 1c is a column vector with its elements equal to unity. The two estimated regression vectors β1 and β2 can then be used to correct for the effect of multiplicative light scattering on the concentration predictions of the target component in any test sample as follows:

ztest ) xtest(I - P+P),

btestc1,test ) [1,ztest]β1,

btest ) [1,ztest]β2, c1,test )

[1,ztest]β1 [1,ztest]β2

(12)

Case Studies. The effectiveness of the OPLEC algorithm with respect to its ability to remove the spectral variations related to multiplicative light scattering was tested through its application to two sets of near-infrared spectroscopic data. The first data set was obtained from mixtures of wheat gluten and starch powder,11 and the second was generated from single wheat seeds.10 Since these two data sets have previously been analyzed by the EMSC2 and EISC1 algorithms, respectively, the performance of the OPLEC algorithm was compared against them. Powder Mixture Data: Near-Infrared Transmittance Spectra of Powder Mixtures. The powder mixture data set consists of 100 near-infrared transmittance spectra (between 850 and 1050 nm) of five mixtures of gluten and starch powder with different weight ratios (1:0, 0.75:0.25, 0.5:0.5, 0.25:0.75, 0:1). For each of the five powder mixtures, five samples were randomly taken and loosely packed into five different glass cuvettes. Two consecutive transmittance spectra were recorded for each sample. Following this, each sample was packed more firmly, and a further two consecutive transmittance spectra were recorded resulting in a total of 100 spectra. Each of the 100 transmittance spectra were transformed into absorbance spectra. Sixty spectra from the three mixtures with the ratio of gluten/starch equal to 1:0, 0.5:0.5, and 0:1 formed the calibration data set. The test set comprised the remaining 40 spectra from the other two mixtures. More experimental details are given in the original paper of Martens et al.11 Wheat Kernel Data: Near-Infrared Transmittance Spectra of Single Wheat Seeds. Pedersen et al.10 utilized a near-infrared transmittance spectral data set for the determination of the content of protein in wheat kernels to demonstrate the performance of their proposed EISC1 algorithm. The data set was assembled from 523 single kernel transmittance spectra (in the range of 8501050 nm) of wheat kernels from three different locations. A total of 415 spectra of wheat kernels from two locations formed the basis of the calibration data set. The remaining 108 spectra of wheat kernels from the third location made up the test data set.

The test samples were stored for approximately two additional months before the measurements were recorded. All the transmittance spectra were converted into absorbance spectra prior to the analysis. Experimental details can be found in their paper, and the data are available at http://www.models.kvl.dk/research/ data/. Data Analysis. For the aforementioned two data sets, PLS regression was used to build the calibration models between the mean-centered concentrations of the target components (gluten in the powder mixture data and protein in the wheat seed data) and the corresponding mean-centered raw or preprocessed nearinfrared spectra. The root-mean-square error of prediction for the test data set (RMSEPtest) is used as the performance criterion to assess the predictive ability of the resulting PLS models. All preprocessing and construction of multivariate calibrations were performed on a Pentium computer using Matlab version 6.5 (The Mathworks, Inc.). RESULTS AND DISCUSSION Powder Mixture Data. The 20 replicates of the near-infrared absorbance spectra for each of the 5 mixtures of gluten and starch powder are displayed in Figure 1. From the perspective of calibration, it would be expected to see five separate groups of spectra. However, due to the presence of multiplicative light scattering caused by the changes in the optical path length, it can be observed that the 20 spectra from the same mixture differ significantly. PLS was then used to build a calibration model between the concentrations of the gluten in the powder mixtures and the corresponding raw spectra. Examining the results from the leave-one-out cross-validation root-mean-square error of prediction for the calibration data set (not shown) indicated that a PLS model with nine components provided the best predictive ability for the raw data. The predictions of the optimal PLS model are not as good as expected with the RMSEPtest of the optimal PLS model being 0.024, which is equivalent to a relative predictive error of 6.08%. This result clearly demonstrates that even when the number of PLS components used is sufficiently large, PLS cannot model the raw spectra satisfactorily when the true behavior is masked by the effects of multiplicative light scattering. With a view to separating the spectral variations caused by multiplicative light scattering, as a consequence of the changes in the physical properties of the samples, from those resulting from the chemical components, the preprocessing techniques of OPLEC, EISC1, and EMSC2 were applied to the raw spectra given in Figure 1. The number of spectra used to construct the matrices, Zi,base, in OPLEC was taken to be two, i.e., the number of chemical components in the powder mixtures. As suggested in the original paper of Martens et al.,10 spectra x3 and x93 were used in EMSC2 as the pure spectra of gluten and starch, respectively. Additionally to investigate the possible influence of the choice of pure spectra on the performance of EMSC2, the mean spectra of the 20 replicates of pure gluten samples and pure starch samples were also considered as the input pure spectra. For the convenience of comparison, both sets of pure spectra were normalized. Hereafter, EMSC2 with the two different selections of input pure spectra will be simply referred to as EMSC23,93 and EMSC2mean. Figure 2 shows the spectra preprocessed by OPLEC, EMSC23,93, EMSC2mean, and EISC1. The spectra preprocessed by all four methods exhibit distinct spectral patterns for each of the five powder mixtures. The 20 replicates for each mixture are more Analytical Chemistry, Vol. 78, No. 22, November 15, 2006

7677

Figure 1. Raw absorbance spectra of the five mixtures of gluten and starch powder with different weight ratios (black lines, 1:0; blue lines, 0.75:0.25; red lines, 0.5:0.5; green lines, 0.25:0.75; cyan lines, 0:1).

Figure 2. Calibration (blue solid lines) and test (red dotted lines) spectra of powder mixtures preprocessed by different methods: (a) OPLEC (the parameter, bj, for the test spectra were determined by two-component PLS model), (b) EMSC23,93, (c) EMSC2mean, and (d) EISC1.

or less indistinguishable. However, this does not necessarily mean that all the methods provided satisfactory results. From a calibration perspective, the variations in the five spectral patterns should also correctly reflect the variations in the concentrations of the components within the five powder mixtures. From Figure 2, it can be seen that the spectra preprocessed by OPLEC (Figure 2a) and EMSC23,93 (Figure 2b) maintain the expected equal spacing between the neighboring spectral patterns. Although the results 7678 Analytical Chemistry, Vol. 78, No. 22, November 15, 2006

of EMSC2mean (Figure 2c) are similar to those of EMSC23,93 (Figure 2b), the less symmetric spacing between neighboring spectral patterns indicates that the different choice of input pure spectra does affect the performance of EMSC2. In contrast, the spectra preprocessed by EISC1 (Figure 2d) do not contain the correct information about the concentrations of the components in the powder mixtures. The unequal spacing that is clearly evident between neighboring spectral patterns indicates that either some

Figure 3. Predictive performance of the PLS models built on the calibration spectra of the powder mixtures preprocessed by different methods (black circle, the raw spectra; blue triangle up, OPLEC; yellow diamond, EMSC23,93; green square, EMSC2mean; red triangle down, EISC1).

of the chemical information has been incorrectly removed along with the effects of multiplicative light scattering or the effects of multiplicative light scattering have not been effectively corrected for. Either case will affect the performance of the calibration models built from EISC1 preprocessed spectra. Figure 3 confirms the above hypothesis. PLS models built on the calibration spectra preprocessed by EISC1 gave unacceptable predictions with errors even larger than those attained from the PLS model generated from the raw calibration spectra. The failure of EISC1 on this powder mixture data suggests that like its predecessor, ISC, it is not suitable for samples with significant spectral variations resulting from changes in the chemical composition. EMSC2 can greatly improve the predictive accuracy of the PLS models. However, the significant difference between the predictions provided by the PLS models on the preprocessed calibration spectra of EMSC2mean and EMSC23,93 further demonstrates that the choice of input pure spectra is crucial with respect to defining performance. The application of OPLEC offers almost the same (if not better) improvement in terms of the predictive ability of the PLS models as does EMSC23,93. Both methods attained the same level of minimal RMSEPtest (0.005) but with slightly different numbers of PLS components. It is worth noting that, after being preprocessed by OPLEC, a two-component PLS model provided excellent predictive results with a RMSEPtest equal to 0.008, which is equivalent to a relative error with a magnitude of 1.7%, while the corresponding RMSEPtest of EMSC23,93 is 0.013, i.e., 2.5% in terms of the relative error. Considering the fact that OPLEC does not use the pure spectra of the chemical components in the mixture samples as EMSC23,93 does, such a result is encouraging. Wheat Kernel Data. Figure 4 shows the raw spectra for the calibration set of wheat kernel data. The spectra cover the wavelength region, 850-1050 nm. This region contains the second overtones of the O-H stretching vibration (980-1010 nm) in carbohydrates and water, the second overtone of N-H stretching vibration (near 1010 nm) in the secondary amides (protein), and

Figure 4. The 415 raw spectra for the calibration set of wheat kernel data.

the third overtones of aliphatic C-H stretching vibrations (833880 nm). Consequently, spectral variations relating to the concentration variations of the chemical components such as water, starch, and protein in the samples are expected to be observed. However, such spectral variations are dominated by the effects of multiplicative light scattering as a result of the variations in the optical path length that occurred during the recording of each measurement. PLS was initially applied to implicitly compensate for the light scattering variations and model the relationship between the raw spectra and the corresponding protein content. To ensure a robust PLS model, the 415 calibration samples with protein contents ranging from 7.0 to 17.0% were sorted in ascending order by their protein content and then split into 10 consecutive cross-validation segments with each of the first 9 segments consisting of 42 samples. The cross-validated root-meansquare error of prediction for the calibration data set attained a minimal value (RMSEcv, 0.64%) for 11 PLS components. The predictive results of the 11-component PLS model are presented Analytical Chemistry, Vol. 78, No. 22, November 15, 2006

7679

Figure 5. Predictive results for the 11-component PLS model built from the raw calibration spectra of the wheat kernel data. Diagonal line, theoretically correct predictions.

Figure 6. RMSEcv versus number of PLS components for the calibration spectra of the wheat kernel data preprocessed by OPLEC using Zi,base with a different number of spectra (dotted line, 7; long dash line, 8; solid line, 9; dash-dot line, 10; dash-dot-dot line, 11; short dash line, 12).

in Figure 5. The RMSEPtest is of the order of 0.70%, which corresponds to an average relative predictive error of 5.9%. Clearly, the implicit modeling approach did not give satisfactory predictions for the test samples, especially those with relatively low protein content. Explicit multiplicative light scattering correction methods are therefore applied to improve the results. Since the pure spectra of the chemical components in the wheat seeds are unavailable, EMSC2 cannot be applied to this data set; hence, only OPLEC and EISC1 are discussed. For this data set, the number of spectra (J) included in the matrix Zi,base of OPLEC is determined by cross-validation. The RMSEcv profiles for different values of J are shown in Figure 6, and it can be concluded that the optimal value is between 9 and 10. To ensure that all possible sources of spectral variations are captured by Zi,base, J was set to 10. The calibration and test spectra transformed by OPLEC are shown in Figure 7a and b, respectively. It can be seen that the large additive offset and multiplicative scaling effects have been effectively removed. The transformed spectra maintain the characteristics of the near-infrared spectra with significant 7680 Analytical Chemistry, Vol. 78, No. 22, November 15, 2006

spectral variations relating to the concentration variations of the chemical components (water, starch, protein) in the samples being readily observed. Furthermore, the magnitude of the transformed test spectra is smaller than that of the transformed calibration spectra. This may reflect the possible loss of water content in the test samples during the additional storage period. Compared with the spectra preprocessed by OPLEC, the calibration and test spectra transformed by EISC1 (Figure 7c and d) bear little similarity. This is an indication that the application of EISC1 may have removed some of the spectral variation relating to the chemical components. The cause of this loss of chemical information is a consequence of the fact that the empirical model of EISC1 contains no term that explicitly relates to the concentration of the chemical components. Tenfold cross-validation was used to determine the optimal number of components to include in the PLS models using the resulting preprocessed OPLEC and EISC1 spectral data. The RMSEcv profiles recommended an 11-component PLS model for OPLEC and a 9-component PLS model for EISC1 (not shown). Figure 8 compares the predictive performance of the two resulting PLS models. Overall, OPLEC and EISC1 achieve similar performance with respect to the accuracy of the predictions, i.e., 0.40 and 0.42%, respectively. However, with respect to the relative prediction error, both methods give the same value, 3.3%, which is significantly lower than that obtained using the raw spectra (5.9%). Although both EISC1 and OPLEC have effectively improved the quality of the predictions for protein content, the reason for their success differs. The improvement in the predictions does not necessarily mean that EISC1 can effectively separate chemical information from the effects of multiplicative light scattering. For EISC1, it was already observed from the transformed calibration and test spectra (Figure 7c and d) that there had been a loss of chemical information during the preprocessing procedure. One possible reason for its success on this particular data set is that the “lost” chemical information mainly related to other components such as starch and water, rather than protein, and most of the spectral variation relating to protein was retained in the transformed spectra. This speculation is partly supported by the fact that no obvious spectral variations due to the second overtones of O-H stretching vibrations are observed at 980-1000 nm, while the possible contributions of the second overtone of N-H stretching vibrations in protein are readily observable at ∼1010 nm in Figure 7c and d. Thus, if the target component is water or starch instead of protein, the results would potentially have differed. In contrast, as demonstrated in Figure 7a and b, no chemical information has been removed by OPLEC along with the multiplicative light scattering effects. Hence, the good predictive results reinforce the effectiveness of OPLEC in terms of correcting the spectral variation due to the changes in the physical properties. CONCLUSIONS Without using any prior spectroscopic knowledge, OPLEC succeeded in separating the physical light scattering effects from the spectral variations related to the chemical components, and hence, the prediction accuracy of the calibration models for both the powder mixture data and the wheat kernel data was significantly enhanced. Compared with other existing multiplicative light scattering correction methods which are only applicable when the pure spectra of all chemical components in samples is available,

Figure 7. Transformed wheat kernel data. OPLEC preprocessing (a) mean-centered calibration data; (b) mean-centered test spectra. EISC1 preprocessing (c) mean-centered calibration data and (d) and mean-centered test spectra.

Figure 8. Predictive results of the optimal PLS models built on the calibration spectra of the wheat kernel data preprocessed by OPLEC (red cross) and EISC1 (blue circle). Diagonal line, theoretically correct predictions.

or the effect of chemical variations among the spectra is negligible, there are no additional information requirements with respect to the spectral data for the application of OPLEC. Consequently, OPLEC potentially has wider applicability than the existing methods reported in the literature. ACKNOWLEDGMENT The authors thank Professor Harald Martens at MATFORSK/ Norwegian Food Research Institute for the powder mixture data.

The authors also acknowledge the financial support of the EPSRC grant GR/R19366/01 (KNOW-HOW) and GR/R43853/01 (Chemicals Behaving Badly II).

Received for review June 4, 2006. Accepted September 12, 2006. AC0610255 Analytical Chemistry, Vol. 78, No. 22, November 15, 2006

7681