Multivariate Sensitivity for the Interpretation of the Effect of Spectral

Dec 19, 1998 - Nicolaas (Klaas) M. Faber. Netherlands Forensic Science Laboratory, Volmerlaan 17, NL-2288 GD Rijswijk, The Netherlands. Predictions ...
4 downloads 0 Views 110KB Size
Anal. Chem. 1999, 71, 557-565

Multivariate Sensitivity for the Interpretation of the Effect of Spectral Pretreatment Methods on Near-Infrared Calibration Model Predictions Nicolaas (Klaas) M. Faber

Netherlands Forensic Science Laboratory, Volmerlaan 17, NL-2288 GD Rijswijk, The Netherlands

Predictions obtained from a multivariate calibration model are sensitive to variations in the spectra such as baseline shifts, multiplicative effects, etc. Many spectral pretreatment methods have been developed to reduce these distortions, and the best method is usually the one that minimizes the prediction error for an independent test set. This paper shows how multivariate sensitivity can be used to interpret spectral pretreatment results. Understanding why a particular pretreatment method gives good or bad results is important for ruling out chance effects in the conventional process of “trial and error”, thus obtaining more confidence in the finally selected model. The principles are exemplified using the transmission nearinfrared spectroscopic prediction of oxygenates in ampules of the standard reference material gasoline. The pretreatment methods compared are the multiplicative signal correction, first-derivative method, and secondderivative method. It is shown that for this application the first- and second-derivative methods are successful in removing the background. However, differentiating the spectra substantially reduces multivariate net analyte signal (in the worst case by a factor of 21). Consequently, a significantly smaller multivariate sensitivity is obtained which leads to increased spectral error propagation resulting in a larger uncertainty in the regression vector estimate and larger prediction errors. Differentiating spectra also increases the spectral noise (each time by a factor 21/2) but this effect, which is well-known, is of minor importance for the current application. In applications of multivariate calibration methods such as principal component regression (PCR) or partial least squares (PLS), one assumes a certain relationship to hold between the multivariate instrument responses (“predictors”) and the analyte concentration (“predictand”). However, nonideal experimental circumstances may lead to a fluctuating contribution to the spectra, which poses a serious problem to the applicability of the underlying principles. Therefore, several procedures have been proposed in the past to eliminate or reduce these spectral variations. The following overview of recent applications of infrared (IR) and nearinfrared (NIR) spectroscopy shows that the “optimum” data pretreatment method is highly dependent on the specific application at hand. 10.1021/ac980415r CCC: $18.00 Published on Web 12/19/1998

© 1999 American Chemical Society

A number of data preprocessing methods for NIR reflectance spectra were evaluated for the determination of active compound in a pharmaceutical preparation, identifying the descending ranking: second-derivative method, normalization, first-derivative method, multiplicative signal correction (MSC),1 standard normal variate, and detrending.2 Ten gasoline properties have been predicted using NIR with a weighted moving average algorithm performing better than the second-derivative method.3 Methanol and methyl tert-butyl ether (MTBE) content in gasoline has been determined using mid-IR spectroscopy, with two-point baseline correction performing better than the second-derivative method or smoothing.4 Good results have been reported for prediction of MTBE content in gasoline using just mean-centered NIR data (for IR and Raman data, variance scaling was also applied prior to PLS analysis).5 Twenty gasoline properties (including oxygenate content) have been predicted using NIR, with the secondderivative method reported to be slightly superior to the firstderivative method.6 An obvious explanation for the variable results obtained by spectral pretreatment lies in the fact that individual methods rely on different assumptions about the structure of the spectral distortion. In additionsas detailed in this papersspectral pretreatment may also change the systematic part of the data. Depending on the shape of the spectra, this may happen in an advantageous or disadvantageous manner. Thus, successful application of a particular method cannot be guaranteed in a specific situation and may well be a matter of “trial and error”. In practice, the best spectral pretreatment method usually is the one that minimizes the prediction error for an independent test set. However, assessing the predictive ability of a model with reasonable certainty requires a sufficiently large test set, which may not always be available. This is evident from the large scatter in plots of root-mean-squared error of prediction (RMSEP) versus model dimensionality often presented in the literature. Here it is proposed to additionally characterize the resulting calibration models using multivariate analytical figures of merit (1) Geladi, P.; McDougall, D.; Martens, H. Appl. Spectrosc. 1985, 39, 491. (2) Blanco, M.; Coello, J.; Iturriaga, H.; Maspoch, S.; de la Pezuela, C. Appl. Spectrosc. 1997, 51, 240. (3) Litani-Barzilai, I.; Sela, I.; Bulatov, V.; Zilberman, I.; Schechter, I. Anal. Chim. Acta 1997, 339, 193. (4) Garcia, F. X.; De Lima, L.; Medina, J. C. Appl. Spectrosc. 1993, 47, 1036. (5) Cooper, J. B.; Wise, K. L.; Welch, W. T.; Bledscoe, R. R.; Sumner, M. B. Appl. Spectrosc. 1996, 50, 917. (6) Swarin, S. J.; Drumm, C. A. Spectroscopy 1992, 7, 42.

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999 557

such as the sensitivity. Monitoring these figures of merit may lead to a better understanding of what a pretreatment method actually achieves in terms of “cleaning up” the spectra. Several researchers have suggested that this approach leads to improved insight, which could enable more robust decision making concerning critical steps in the construction of the model.7 Multivariate analytical figures of merit have been introduced to analytical chemistry by Lorber.8 The key concept in his framework is the definition of net analyte signal (NAS) as the part of the gross response that is useful for calibration. All analytical figures of merit, e.g., sensitivity, selectivity, and signal-to-noise ratio (SNR), follow in straightforward fashion as natural generalizations of the univariate case. Recently, a number of papers have been published in this journal that deal with NAS calculation in inverse calibration models.9-11 In inverse calibration models, the spectra are the predictors rather than the predictand as in the classical model. This is a more flexible setup of the data since it allows one to calibrate for individual substituents without requiring explicit knowledge about the interferences. (Lorber’s original theory8 was developed for the classical model.) While Lorber et al.9 focussed on general methodology, Wentzell et al.10 demonstrated the utility of the NAS concept for evaluating calibration method performance and Xu and Schechter11 reported successful wavelength selection. It is important to note that different definitions exist for the multivariate sensitivity. However, since these definitions are closely related,12 the same conclusions should be arrived at when accounting for these differences. Here the original definition of Lorber is preferred because the relationship between Lorber’s sensitivity and prediction error variance has the same mathematical form as for univariate calibration.13-15 This fact enables us, for example, to represent the multivariate calibration model as a pseudounivariate calibration graph and visualize the propagation of spectral errors as a simple multiplication with the “inverse” sensitivity, which is the reciprocal of the sensitivity. The principles are exemplified using the transmission NIR spectroscopic prediction of the oxygenates MTBE, ethanol (EtOH) and water in ampules of standard reference material (SRM) gasoline (unit mass fraction % O). Variations in refraction and reflection induce baseline shifts in the spectra. Additionally, variations in the diameter and wall thickness of the ampules cause variations in the path length, which is a multiplicative effect. Spinning the ampules reduces the baseline shifts to a large extent.16,17 However, subtle variations remain which require the use of some form of mathematical correction technique before the construction of the model can proceed. The pretreatment methods compared in the current work are the MSC, first(7) Personal communication with B. R. Kowalski (1995) and J. H. Kalivas (1997). (8) Lorber, A. Anal. Chem. 1986, 58, 1167. (9) Lorber, A.; Faber, N. M.; Kowalski, B. R. Anal. Chem. 1997, 69, 1620. (10) Wentzell, P. D.; Andrews, D. T.; Kowalski, B. R. Anal. Chem. 1997, 69, 2299. (11) Xu, L.; Schechter, I. Anal. Chem. 1997, 69, 3722. (12) Kalivas, J. H.; Lang, P. M. Chemom. Intell. Lab. Syst. 1996, 32, 135. (13) Faber, N. M.; Lorber, A.; Kowalski, B. R. J. Chemom. 1997, 11, 419. (14) Faber, N. M.; Kowalski, B. R. J. Chemom. 1997, 11, 181. (15) Faber, N. M.; Kowalski, B. R. Chemom. Intell. Lab. Syst. 1996, 34, 283. (16) Choquette, S. J.; Chesler, S. N.; Duewer, D. L.; Wang, S.; O’Haver, T. C. Anal. Chem. 1996, 68, 3525. (17) Faber, N. M.; Duewer, D. L.; Choquette, S. J.; Green, T. L.; Chesler, S. N. Anal. Chem. 1998, 70, 2972; Erratum, Anal. Chem. 1998, 70, 4877.

558

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

derivative method, and second-derivative method. The calibration method used is PCR; the results obtained using PLS are very similar and will not be shown in detail. Finally, it is emphasized that this paper is not concerned with finding the best data pretreatment method for the current application. The purpose of this paper is to provide diagnostics for assessing the performance of any pretreatment method. (Neither is it attempted here to find the best calibration method, or the optimum splitting of the samples in calibration and test sets, for that matter.) Previously reported results16,17 were all obtained using the MSC technique. De Noord18 and Sum and Brown,19 among others, have shown that often a combination of methods works better. (Moreover, the best method may even depend on the constituent of interest.19) Several combinations of methods have been tested, and no (significant) improvement over “simple” MSC was observed. THEORY Construction of the Model. Assuming a linear relationship to hold between analyte concentrations and spectral data leads to the model equation,

c ) Rb + e

(1)

where c (n × 1) is the vector of true analyte concentrations, R (n × p) is the matrix of true spectral responses, b (p × 1) is the unknown regression vector, e (n × 1) is a vector of residuals, n is the number of calibration samples, and p is the number of wavelengths. Throughout this paper it will be assumed that the data (spectra and concentrations) are mean-centered. In most experimental situations, the true analyte concentrations and spectral responses are not available owing to measurement errors. Consequently, the regression vector is estimated using the measured analyte concentrations (“laboratory values”), c˜ ) c + ∆c, and spectral responses, R ˜ ) R + G + ∆R, where the tilde indicates measured rather than true quantities, the prefix ∆ denotes a random measurement error in the associated quantity, and G is the matrix containing spectral variations other than random noise. The regression vector obtained for A-dimensional PCR is given by

bˆ A ) (R ˜ tR ˜ )˜ tc ˜ AR

(2)

where the ∧ indicates estimated (or predicted) quantities, subscript A indicates the model dimensionality, superscript - signifies generalized inverse, and superscript t signifies matrix or vector transposition. Optional data pretreatment prior to parameter estimation (and prediction) is implied in the notation. Use of a generalized inverse is a convenient way of dealing with nonsquare or rank-deficient predictor matrices. Briefly, A scores are constructed as linear combinations of the p original variables. Then the regression vector estimate is obtained by an ordinary least-squares (OLS) step where the original variables are replaced by the scores. PCR and PLS (and other generalized inverse-based methods) only differ in the criterion that is used to construct the scores. It is worth noting that in inverse calibration (18) de Noord, O. E. Chemom. Intell. Lab. Syst. 1994, 23, 65. (19) Sum, S. T.; Brown, S. D. Appl. Spectrosc. 1998, 52, 869.

one often has more variables than calibration samples (p > n) and the situation is described as “underdetermined”. However, the OLS step on the scores requires that n > A, i.e., the data should be “overdetermined” with respect to the number of scores that is needed to obtain a predictive model. This condition is usually met in practice. The uncertainties in the regression coefficient estimates (variances as well as covariances) are collected in the covariance matrix. If the percentage explained variance (% VAR) of the spectral matrix is close to 100, an OLS-type expression is valid for PCR,14,15 i.e., 2 2 2 2 ˜ tR ˜ )V(bˆ A) ≈ (R A [|bA|Eσ∆R + σe + σ∆c]

(3)

where V(bˆ A) is the covariance matrix of the regression vector estimate, | |E symbolizes the Euclidean norm, and σ2 denotes the variance in the associated quantity. Some remarks are in order here. The OLS-type expression follows from a so-called zeroth-order approximation, which has been shown to be adequate for the application at hand.17 Moreover, in a strict sense the errors must be homoscedastic for eq 3 to be valid, which is not a realistic assumption for NIR spectra. However, the purpose of this paper is to compare the effect of different pretreatment methods. Since this is done by looking at relative changes, it is reasonable to assume that differences obtainable for specific pretreatments by using the “correct” expressions will cancel out. (Complicated expressions have been derived for PCR and PLS that account for heteroscedastic and correlated errors.14) The first factor on the right-hand side is a matrix that carries the information about the collinearity of the spectra in Adimensional principal component space; its diagonal elements are often referred to as variance inflation factors (VIFs) in full-rank OLS (p < n).12 It is seen that the propagation of the spectral measurement error amounts to an additional multiplication with the norm of the regression vector. The last two variances within brackets are associated with the analyte concentration, i.e., one for the residuals (difference between true values and their expectation) and one for the measurement error. Often the two are not distinguished, which is not correct. Their influence is the same on the uncertainty in the regression coefficient estimates but very different on the prediction error variance (see below). Prediction. In analogy to eq 1, the true analyte concentration in the unknown sample, cu, is given by

cu ) jc + rtub + eu

(4)

where jc is the true mean analyte concentration for the calibration set, ru is the true (mean-centered) unknown sample spectrum, eu is the unknown sample residual, and the subscript u indicates an unknown sample. (The usual convention of making mean centering of analyte concentration explicit in the prediction step is followed here.) Given the estimated regression vector, the analyte concentration is predicted as

cˆu ) ˜hc + ˜r tubˆ A

(5)

where cˆu is the predicted analyte concentration, ˜hc is the measured mean analyte concentration for the calibration set, ˜ru ) ru

+ gu + ∆ru is the measured (mean-centered) spectral vector for the unknown sample, and gu is the vector containing spectral variations other than random noise. The prediction error (PE) is defined as the difference between the prediction and the true value:

PEu t cˆu - cu

(6)

Using the zeroth-order approximation, its variance, V(PEu), is given by 2 2 + σ2e + σ∆c ]+ V(PEu) ≈ (1/n + hu)[|bA|2E σ∆R 2 σe2u (7) |bA|2E σ∆r u

where hu is the so-called leverage of the unknown sample. The leverage indicates how close the unknown sample is to the calibration samples in A-dimensional principal component space. The first term on the right-hand side is due to the uncertainty in the model whereas the last two terms are associated with the prediction sample. The model term consists of two parts: the 1/n term for estimating the model center and the leverage term for estimating the regression coefficients. It is observed that the measurement error in the analyte concentrations only affects the model term; no measurement is made, in principle, for the unknown sample. (The whole idea of calibration is to “replace the reference method”.) It is emphasized that spectral pretreatment influences the terms associated with the spectral measurement error only. In addition, the model term is usually much smaller than the unknown sample contribution to eq 7. The reason for this is that errors associated with the calibration samples average out to a large extent. The remainder of the paper therefore concentrates on the contribution of the unknown sample spectrum to eq 7. It is further assumed that the errors originate from the 2 same distribution for calibration and prediction samples, i.e., σ∆r u 2 20 ) σ∆R. Recently, Berger and Feld have reported on an applica2 tion where this term dominates (however, for their data, σ∆r . u 2 σ∆R). It is seen that the measurement errors in the unknown sample spectrum propagate in the same way as measurement errors in the calibration spectra. An interpretation in terms of multivariate sensitivity is given below. Multivariate Analytical Figures of Merit. Using the concept of multivariate net analyte signal, the multivariate model can be represented as a pseudounivariate calibration function.13 First, the NAS vector is calculated according to one of the schemes presented in the literature.9-11 (A more economic algorithm is proposed in ref 21.) Next, this vector is converted into a scalar by taking the Euclidean norm. The result is that eq 4 can be rewritten as

cu ) jc + r/ub + eu

(8)

where r/u denotes the scalar NAS and b is the Euclidean norm of the regression vector, i.e., (20) Berger, A. J.; Feld, M. S. Appl. Spectrosc. 1997, 51, 725. (21) Faber, N. M. Anal. Chem. in press.

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

559

b ) |b|E

(9)

This quantity is related to the multivariate sensitivity, s, as

b ) 1/s

(10)

which shows that b is conveniently designated as the “inverse” sensitivity. Using these concepts it becomes clear that the terms associated with the spectral measurement errors in eq 7 have a simple interpretation in terms of inverse sensitivity and therefore also in terms of sensitivity. It is interesting to note that, according to Mark,22 b is a statistic that has been called the “index of random error” because it expresses the sensitivity of the calibration equation to the random error of the optical data (e.g., electronic noise). Surprisingly, numerical values for b are seldom reported in the literature. Equation 8 has the algebraic form of a univariate calibration function. Figure 1 is the representation of a multivariate model consistent with eq 8. Plots such as Figure 1 enable one to conveniently visualize the spectral error propagation as the mapping of a region around the unknown sample (scalar) NAS onto a region around the predicted analyte concentration. Being able to visualize individual contributions to complicated expressions is believed to be a valuable addition to the existing toolbox of diagnostics for multivariate calibration. In contrast, the common presentation of prediction results as a plot of predicted versus reference values does not supply information about how different error sources contribute to prediction error. Spectral Pretreatment Methods. From the literature survey given in the introduction, it is clear that the MSC, first-derivative method, and second-derivative method are popular pretreatment methods in NIR spectroscopy. Only a brief discussion of the underlying assumptions concerning the mathematical form of the spectral variations is given here. For more details, see refs 1 and 5. MSC aims at removing additive and multiplicative effects by regressing the spectra onto a reference spectrum (often the mean of the calibration set). The first-derivative method removes a constant background whereas the second-derivative method also handles a sloping background. All methods achieve these goals in a local sense. Since these methods rely on different assumptions about the structure of the spectral variations, their successful application in practice is not guaranteed. Even for methods that are related it is difficult to say in advance how the prediction results will compare. For example, the standard normal variate technique and MSC have been shown to be linearly related.23 Nevertheless, different prediction results are obtained because the regression vector estimate obtained by generalized inversebased methods is not invariant to a change of scale of the predictor variables. EXPERIMENTAL SECTION Description of the Data. Full details concerning the experimental design, spectral data acquisition, peak identification, and reference method (gravimetry) are presented elsewhere.16,17 (22) Mark, H. Principles and Practice of Spectroscopic Calibration; Wiley: New York, 1991; p 56. (23) Dhanoa, M. S.; Lister, S. J.; Barnes, R. J. J. Near Infrared Spectrosc. 1994, 2, 43.

560 Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

Figure 1. Multivariate inverse calibration model presented as linear univariate calibration function of scalar NAS versus analyte concentration. The contribution of the uncertainty in the prediction sample NAS, dNAS, to the uncertainty in the predicted analyte concentration, dc, can be viewed as the mapping of a region on the NAS axis onto a region on the concentration axis.

Briefly, calibration and test sets consist of 40 samples each. NIR absorbances are taken at 391 wavenumbers evenly spaced in the region 6000-9000 cm-1. The standard deviation of the measurement error in the NIR absorbances was obtained by replication to be σˆ ∆R ) 10-4 AU. This value constitutes an average over the entire spectrum. The measurement error in the reference method was estimated to be 1.2 × 10-4 for MTBE, 2.3 × 10-4 for EtOH, and 5.9 × 10-4 for H2O (σˆ ∆c; in unit mass fraction % O). In ref 17 it was concluded that these values are small enough to be neglected. This is a condition that should always be tested because in a strict sense a model must be validated using reference values that are error-free. Otherwise the resulting prediction error estimate is pessimistic (i.e., too high) and should be corrected.24 Calculations. All calculations were performed in Matlab Version 4.2c (The Mathworks, Natick, MA). Standard implementations of PCR and MSC are used. Derivatives are obtained by taking differences without smoothing. This procedure increases the noise in the spectra by a factor 21/2 and 2, respectively. It is noted that in practice derivatives are always calculated with some sort of smoothing. Here Savitzky-Golay filters of various window size have been tested but the prediction results were not significantly affected. A plausible explanation for this result is that full-spectrum methods such as PCR and PLS already smooth the spectra. This smoothing is particularly effective here because there are many degrees of freedom in the spectra leading to a small “imbedded error”. RESULTS AND DISCUSSION This section is organized as follows. First, the effect of differentiation is presented for some of the “normal” diagnostics currently in use, i.e., (1) plots of the resulting spectra, (2) the quality of the model fit to the calibration spectra, (3) the prediction error, and (4) plots of the regression vector estimate. Next, an (24) Faber, N. M.; Kowalski, B. R. Appl. Spectrosc. 1997, 51, 660.

Figure 2. Illustration of data pretreatment for calibration sample one: (a) raw spectrum, (b) centering and MSC, (c) centering and first-derivative (1ST) method and (d) centering and second-derivative (2ND) method.

interpretation of observed trends is given in terms of the multivariate analytical figures of merit sensitivity, selectivity, and SNR. Effect of Differentiation on NIR Spectra. Figure 2 gives the raw spectrum and the mean-centered spectra after application of data pretreatment for calibration sample one. It is noted that the absorbance axis is labeled in unit 10-4 AU. The reason for doing so is that the numerical values displayed in Figure 2 give a good idea of the quality of the instrument: they constitute SNRs at the individual wavelengths for the raw spectrum and the MSC spectrum; for the first-derivative and second-derivative spectra one has to further divide these numbers by 21/2 and 2, respectively. It is seen that the mean spectrum makes by far the largest contribution to the data from which one may infer that the spectra are highly collinear. Stated otherwise, the model is constructed using only small deviations from the mean. As expected, the noise is amplified by differentiating the data. However, another effect, which has received only little attention in the literature, is that the signal itself is diminished (the early study by Juhl and Kalivas25 forms a notable exception). As a result, the second-derivative spectrum looks much more noisy than the first-derivative spectrum, which is somewhat misleading. Effect of Differentiation on Residual Standard Deviation. The residual standard deviation (RSD) of the calibration spectral matrix, R ˜ , is a measure of how close the model fits the data; it therefore describes how well pretreatment removes undesired substructures from the spectra. The RSD is obtained by summing the eigenvalues for the discarded principal components as

RSDA )

x

n

∑ λˆ /(n - A - 1)(p - A) a

(11)

a)A+1

where λˆ a is the estimated eigenvalue associated with the ath principal component. The eigenvalue is equal to the sum of (25) Juhl, L. L.; Kalivas, J. H. Anal. Chim. Acta 1988, 207, 125.

Figure 3. RSD for calibration spectra versus number of extracted principal components after applying MSC (o), first-derivative method (*) and second-derivative method (+).

Figure 4. Principal component (PC) loadings seventeen through twenty obtained after applying MSC to calibration set. The numbers in parentheses denote the percentage explained variance (%VAR) associated with a principal component.

squares of the spectral matrix explained by a principal component. The denominator in eq 11 converts the summation over the residual sums of squares into a variance estimate (mean-centering leads to the additional loss of degrees of freedom expressed by the term -1). Obviously, this calculation method also leads to an average error estimate. Figure 3 shows how the RSD varies as a function of the number of extracted principal components for different data pretreatment methods. It is seen that for MSC data the RSD has not completely settled down after extracting more than 15 principal components. This means that there is still some structure left in the data, which is confirmed by the loading plots shown in Figure 4 (although the percentage explained variance for these principal components is extremely small). In contrast, the RSD curve for first-derivative data is essentially flat after extracting approximately 10 principal components. For second-derivative data, this flattening effect is even more pronounced. It is clear that by differentiating the data Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

561

Table 1. Summary of Prediction Results Obtained after Applying MSC, First-Derivative (1ST) Method, and Second-Derivative (2ND) Method MSC analyte

PCs

RMSEP (% O)

MTBE EtOH H2O

6 12 10

0.017 0.0035 0.0023

1ST

2ND

PCs

RMSEP (% O)

PCs

RMSEP (% O)

6 12 11

0.021 0.0098 0.0035

6 12 10

0.027 0.050 0.0051

one removes tiny substructures; one also introduces noise, which is evidenced by the first-derivative curve falling under the secondderivative curve in Figure 3. Closer inspection of Figure 3 shows that a model fit-based spectral noise standard deviation estimate for the MSC data does not agree very well with the estimate that is based on replication (σˆ ∆R ) 10-4 AU). This follows from the observation that for secondderivative data one obtains an estimate that is slightly less than the replication-based estimate. However, for second-derivative data, the noise should have increased by a factor of 2. Consequently, the analysis of the residuals of principal component analysis (PCA) will lead to a noise estimate, smaller than the replication-based one by a factor of 2. The reason for this discrepancy may be that PCA fits substructures that contribute to the replication-based estimate. This can be interpreted as overfitting: many more than 20 principal components are needed to handle the substructures that are effectively removed by differentiation. To avoid the effects of overfitting, it is safer to use a noise estimate that is independent of the data from which the model is constructed.14 ' Effect of Differentiation on Prediction Error. It is worth recalling that Figure 3 has shown that both derivative methods are highly effective in removing small substructures from the spectral data. Thus, taking derivatives simplifies the job of modeling the spectra “within the noise” but the main question should be, does it also improve prediction? Table 1 gives the summary statistics obtained when models are constructed using various data pretreatment methods. RMSEP is calculated in the usual way from an independent test set. It is seen that generally a similar number of PCs is needed to obtain the “optimum” model. (For the determination of the “optimum” number of PCs, see Figure 4 in ref 17.) Taking the first-derivative spectra leads to a slightly larger prediction error for MTBE. In contrast, the results for EtOH are considerably worse (RMSEP has increased by more than a factor of 2) while the results for H2O have also significantly deteriorated. Taking the second-derivative spectra leads to a further deterioration of the prediction results, especially for EtOH (additional increase of RMSEP by more than a factor of 5). The cross-validation results obtained for the calibration set are consistent with the prediction results. It is interesting to compare the numbers given in Table 1 with the results obtained without MSC, i.e., only mean-centering is applied to the absorbance data. In general, more principal components are needed to obtain the “optimum” predictive model. The root-mean-squared error of calibration (RMSEC) values (i.e., the average fit errors for the calibration set) are comparable for MTBE and H2O but higher for EtOH (especially for PLS) whereas 562 Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

Figure 5. Regression vector estimate (unit mass fraction % O / absorbance) of optimum PCR model for EtOH after applying: (a) MSC, (b) first-derivative (1ST) method and (c) second-derivative (2ND) method.

the RMSEP values are slightly higher for all analytes. It follows that for the current application the use of MSC does not seem to be critical to obtain good prediction results: the models constructed without MSC are very similar to the ones constructed with MSC. Stated otherwise, PCR (and PLS) seem to be well-suited for modeling subtle substructures using additional factors without immediately giving unstable predictions. However, the need for a “larger” model may lead to problems if the model is put to the test on real unknowns, since robustness is known to degrade with increasing model dimensionality.26 This is the main reason why previously reported results16,17 were obtained with MSC. Finally, it is emphasized that the calculation of RMSEP is based on a rather large test set (40 samples) so that these estimates are relatively precise. Assuming that the estimated mean-squared error of prediction (MSEP) is distributed as a χ2 random variable with 40 degrees of freedom leads to a relative uncertainty in the RMSEP estimate of 11%.17 With a smaller test set, less precise estimates are obtained and it would, for example, be conceivable that first-derivative data yield the smallest RMSEP estimate for MTBE. (The same reasoning holds for the prediction of all analytes using data without MSC.) It follows that in the common situation where less test samples are available it will often be useful to look at additional diagnostics to rule out chance effects in the selection of the optimum spectral pretreatment method. Effect of Differentiation on Regression Vector Estimate. In the preceding section, it was found that the effects of differentiation are most pronounced for the prediction of EtOH. Consequently, the remainder of this paper will be restricted to further investigating the models constructed for this analyte. Figure 5 shows the regression vector estimates obtained after various data pretreatments. A notable increase of noise in the regression coefficients is observed. Practically, this observation together with the increased prediction error estimates are sufficient evidence to conclude that the best model is obtained using MSC. However, the increase of the noise in Figure 5 is much larger than the increase of noise in the data (factor of 21/2; see (26) Seasholtz, M. B.; Kowalski, B. R. Anal. Chim. Acta 1993, 277, 165.

Table 2. Relative Change in Multivariate Inverse Sensitivity, b, and Multivariate Sensitivity, s, Going from Multiplicative Signal-Corrected (MSC) to First-Derivative (1ST) Spectra, and from First-Derivative to Second-Derivative (2ND) Spectra MSC f 1ST analyte MTBE EtOH H2O

b 6.6 21 6.3

1ST f 2ND s

0.15 0.048 0.16

b 2.9 3.4 1.2

Table 3. Relative Change in RMSEP, MSEP, and Contribution of Spectral Measurement 2 Uncertainty to Prediction Error Variance, b2σ∆R , Going from Multiplicative Signal-Corrected (MSC) to First-Derivative (1ST) Spectra, and from First-Derivative to Second-Derivative (2ND) Spectra MSC f 1ST

s 0.35 0.29 0.82

Figure 2). Consequently, the origin of the problems encountered with the derivative spectra remains largely unknown. This situation is highly unsatisfactory from a scientific as well as a practical point of view, since only if the underlying cause can be uncovered, may we expect to find well-founded practical guidelines for spectral pretreatment. Interpretation in Terms of Multivariate Sensitivity. Less conspicuous than the increase of noise in Figure 5 is the change of scale in these plots: the regression coefficients themselves increase upon differentiation. This trend is consistent with the decrease of the signal observed in Figure 2, since, obviously, larger regression coefficients are needed to bring about the conversion to analyte concentration in eq 5. (Similar trends are observed for MTBE and H2O.) However, an increase of the regression coefficients results in an increased (multivariate) inverse sensitivity according to eq 9 and a decreased (multivariate) sensitivity according to eq 10. Table 2 summarizes the effect of differentiation on inverse sensitivity and sensitivity. The results are in agreement with the trends observed by Juhl and Kalivas.25 Now it also becomes clear that Figure 5 is misleading in the sense that the increased noise in the regression coefficients seems to be the cause of the increase in prediction error. There are two reasons why this is not the best interpretation of the observations. First, the model term in eq 7, through which the noise in the coefficients contributes, gives only a relatively small contribution to prediction error variance (with a large calibration set the 1/n term and the leverage term are both much smaller than 1). The second reason is that both the increased noise in the regression coefficients and the increased prediction error are due to the same cause, i.e., spectral error propagation, which is determined by the size of the regression vector. (Therefore it does not matter whether a regression vector is noisy or smooth; smoothing the regression vector, which is a sensible procedure if the model term is important, had little effect on prediction error.) Equation 3 simply states that it is relatively hard to precisely estimate large regression coefficients in the presence of significant spectral noise, whereas Figure 1 concisely pictures the role of a large regression vector in prediction. Finally, it is noted that all models for EtOH have the same dimensionality, and it is therefore relatively simple to isolate the relevant causes and effects. Increase of dimensionality would also tend to increase the noise in the results, owing to the larger number of parameter estimates (smaller number of degrees of freedom), but this is not the case here. Table 3 shows how RMSEP, MSEP, and the contribution of the spectral measurement uncertainty to prediction error variance (hence to MSEP, since MSEP ) variance + squared bias) change upon differentiating the data. It is seen that the latter contributions

1ST f 2ND

analyte

RMSEP

MSEP

2 b2σ∆R

MTBE EtOH H2O

1.2 2.8 1.5

1.5 8.1 2.3

87 881 79

RMSEP

MSEP

2 b2σ∆R

1.2 5.1 1.5

1.5 26 2.1

17 23 3.0

have increased dramatically going from multiplicative signalcorrected to first-derivative spectra (see column four). The largest increase is found for EtOH, whose multivariate sensitivity has decreased by a factor of 21. However, the effect on increasing MSEP is still relatively small for all three analytes (see column three). This can only mean that the term associated with the spectral uncertainties originally present in the data was negligible compared with the other contributions to eq 7. This result corroborates the outcome of a noise addition experiment where adding noise with a standard deviation of up to 6 × 10-4 had no palpable effect on prediction error.17 Going from first-derivative to second-derivative spectra leads to a considerable increase of RMSEP for MTBE and H2O (20 and 50%, respectively) while it is dramatic for EtOH (factor of 5). Evidently, for MTBE the spectral noise still contributes little to prediction error variance. However, for EtOH, the deterioration of MSEP can be completely explained in terms of decrease of sensitivity (factor of 3.4) and increase of spectral noise (factor of 21/2): taking the uncertainty in the MSEP estimate into account (22%; see ref 17) shows that the factors 26 and 23 are not statistically different. For H2O, the additional decrease of sensitivity (factor 1.2) explains the increase of MSEP to a large extent. It is always instructive to complement numerical results with graphical tools. Figure 6 displays the pseudounivariate presentation of the models constructed for EtOH. The horizontal axis is labeled in unit 10-4 AU (as in Figure 2). For the MSC data, the numerical values for (scalar) NAS have an interpretation as multivariate SNR as defined by Lorber.8 For a particular sample, Lorber’s SNR is given by the ratio of the NAS, r*, and the spectral noise standard deviation, σ.

SNR) r*/σ

(12)

Obviously, for first- and second-derivative data one has to correct for the increased noise (factor of 21/2 and 2, respectively), but this does not really prevent the kind of qualitative interpretation that will make multivariate calibration results more accessible to the practical analytical chemist. It is seen that the (multivariate) SNR is excellent for the multiplicative signal-corrected data. The prediction errors are small, which is evidenced from the fact that the reference values (calibration and test samples) are close to the (estimated) model. Going from multiplicative signal-corrected to first-derivative data leads to larger prediction errors, which is evidenced by the increased scatter around the model. Inverse sensitivity (slope) has increased, and SNR has decreased. It is Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

563

Figure 6. 6. Pseudo-univariate presentation of optimum PCR model for EtOH after applying: (a) MSC, (b) first-derivative method and (c) second-derivative method: calibration set (*) and test set (o). The line symbolizes the model and the cross (+) indicates the model center.

s ) ξ‚|s|E important to note that the prediction errors are vertical deviations from the model which are caused in part by increased fluctuations in the NAS, which are horizontal deviations from the true NAS. Finally, going from first-derivative to second-derivative spectra leads to a further increase of the inverse sensitivity and decrease of SNR. For EtOH, the critical level is reached where the contribution of the prediction sample spectral error dominates the prediction error. (Another example is given by Berger and Feld.20) Prediction sample 38 perfectly illustrates the general tendency of the reference values (approximately true values) to drift away from the model in the horizontal direction. The maximum horizontal deviation is slightly larger than 3 times the estimated standard deviation of the spectral noise where the factor of 2 for differentiation is accounted for. It is important to note that a maximum deviation in Figure 6c of, say, more than 4 times the estimated standard deviation of the spectral noise would invalidate the use of the additional approximations leading to the variance expressions 3 and 7. From a practical point of view, it is gratifying that a rigorous approach is not always necessary to explain observations. For example, ignoring heteroscedasticity and correlation has not prevented a satisfactory explanation of the deterioration of prediction results upon differentiation for the current data set. (In ref 14, complicated expressions are derived that take heteroscedasticity and correlation into account.) Interpretation in Terms of Multivariate Selectivity and SNR. It is interesting to contrast the previous discussion with two common interpretations of the effect of differentiating the spectra. Both interpretations describe taking derivatives as a tradeoff process and would therefore explain why the quality of the results varies with the characteristics of the input data in the way one often observes. The first interpretation is that it enhances the selectivity of the data (derivative spectra exhibit more fine structure; see Figure 2) at the expense of increasing the noise already present (by a factor ∼21/2 since differences are taken). It is emphasized that there is not necessarily a contradiction here if one defines multivariate selectivity and sensitivity in a consistent manner. A consistent definition of selectivity is given by Lorber.8 Lorber’s selectivity measures the amount of overlap between the spectra of the analyte and the interferences; it ranges between 0 (complete overlap) and 1 (no overlap). Kalivas and Lang discuss that Lorber’s sensitivity includes selectivity information since12 564

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

(13)

where ξ denotes the selectivity and s is the pure analyte spectrum. In principle, the selectivity measure should therefore lead to the same conclusions. However, there is a trivial reason why in general the selectivity of eq 12 cannot be used in inverse calibration: for its calculation, one needs the pure spectrum of the analyte, s, which is seldom available. (Consequently, a “more practical” definition of selectivity was proposed in ref 9.) Moreover, an interpretation in terms of selectivity should also include the effect on the total signal, i.e., the pure spectrum of the analyte. Being the product of these two (generally) unknown quantities, Lorber’s sensitivity implicitly includes both effects, which is clear from the simple relationship between prediction error variance and sensitivity obtained by combining eqs 7, 9, and 10. These are all strong reasons for preferring the interpretation in terms of sensitivity. For a further discussion of the relative merits of Lorber’s sensitivity, see refs 27-29. The second interpretation is in terms of SNR. For example, Blanco et al.5 state that, “The derivation transformation has the disadvantage that it always diminishes the signal-to-noise ratio and is highly sensitive to the presence of noise in the original spectrum.”(Italics by the author.) Clearly, the validity of this statement depends on the definition of SNR. It holds, for example, for the univariate SNRs displayed in Figure 2. The question is whether this observation is useful since these SNRs only give information about the quality of the instrument. However, for the characterization of the model, also the overlap between the spectra needs to be taken into account, i.e., the selectivity. For example, high univariate SNRs do not imply that an accurate model can be constructed. In the worst case, complete overlap between the analyte spectrum and the interferents’ spectra prevents determination of the analyte how large the univariate SNRs may be. A suitable multivariate definition of SNR has been proposed by Lorber; see eq 12. Equation 12 allows for a quantitative statement about the effect of taking derivatives on the multivariate (27) Faber, N. M.; Lorber, A.; Kowalski, B. R. Chemom. Intell. Lab. Syst. 1997, 38, 89. (28) Kalivas, J. H.; Lang, P. M. Chemom. Intell. Lab. Syst. 1997, 38, 95. (29) Faber, N. M. Anal. Chim. Acta, in press.

model: if the derivative spectra have larger (scalar) NAS than the raw spectra, the (multivariate) SNR will increase if the increase of NAS is larger than a factor of 21/2, otherwise it decreases. This reasoning shows that application of derivative methods does not necessarily amount to a trade-off process as commonly believed. For the example studied in this paper the NAS decreases upon differentiation for all analytes. In contrast with sensitivity, the SNR is sample-dependent. The sensitivity is therefore preferred for formulating a general statement about the effect of data pretreatment. The SNR will, however, indicate which predictions are most affected. For example, a crude estimate for limit of detection (LOD) in unit response is 3 times the SNR; to obtain the LOD in unit concentration one has to divide this number by the sensitivity (see ref 13 and references therein). CONCLUSIONS Multivariate sensitivity is potentially useful for interpreting spectral pretreatment results. For example, taking derivatives increases the noise level in the data but perhaps more importantly, it also affects the size of the regression vector which is equivalent to changing the multivariate sensitivity. Stated otherwise: differentiating the data not only increases the uncertainties in the data, which is generally known, but also affects the extent to which they propagate to the final predictions, which is believed to be a new and useful result. Especially the latter effect has caused the predictions of EtOH and H2O to deteriorate upon differentiating the data. (In the most extreme case, the sensitivity decreased by a factor of 21.) These resultsstogether with the difference in underlying assumptions about the form of the spectral variationss

may explain why variable results are reported in the literature for these popular data pretreatment methods. De Noord18 and Sum and Brown19 assessed the performance of a data pretreatment method on the basis of the reproducibility of spectra. However, data pretreatment may fail altogether if enhanced reproducibility is obtained at the cost of sufficiently decreased multivariate sensitivity. Hence the monitoring of multivariate figures of merit may give valuable complementary information. Very recently, Rutan et al.30 have proposed methodology based on experimental design and PCA for identifying the main sources of spectral variation. Their work will further aid in the selection of the proper data pretreatment method. The development of a comprehensive set of diagnostics for data pretreatment is believed to be an important subject for future research in chemometrics. ACKNOWLEDGMENT The National Institute of Standards and Technology is thanked for making the oxygenate data available for the current work. The author is indebted to Dave Duewer for opening the window on practical aspects of measurement science. Presented at CC’97 Conferentia Chemometrica, Budapest, August 21-23, 1997.

Received for review April 13, 1998. Accepted September 2, 1998. AC980415R (30) Rutan S.C.; de Noord, O. E.; Andre´a, R. R. Anal. Chem. 1998, 70, 3198.

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

565