9279
Anal. Chem. 1993, 65, 3279-3289
Strategies for Coupling Digital Filtering with Partial Least-Squares Regression: Application to the Determination of Glucose in Plasma by Fourier Transform Near-Infrared Spectroscopy Gary W.Small' Center for Intelligent Chemical Instrumentation, Department of Chemistry, Clippinger Laboratories, Ohio University, Athens, Ohio 45701-2979
Mark A. Arnold and Lois A. Marquardt Department of Chemistry, The University of Iowa, Iowa City, Iowa 52242
Protocols are established for coupling digital filtering techniques with partial least-squares (PLS) regression for use in constructing multivariate calibration models from Fourier transform near-infrared absorbance spectra. Calibration models are developed to predict glucose concentrations in bovine plasma samples. Employing a calibration data set of 300 spectra collected from 55 plasma samples and 3 plasma lots, individual calibration models are developed based on four spectral ranges selected from the region 5000-4000 cm-l. A separate test set of 69 spectra collected from 14 plasma samples is used to evaluate the computed models. Gaussian-shaped bandpass digital filters are implemented by use of Fourier filtering techniques and employed to preprocess spectra to remove variation due to the background absorbance of the plasma matrix. PLS regression is used with the filtered spectra to compute calibration models for glucose. The optimization of the filter bandpass parameters is explored through the use of response surface methods. Through these optimization studies, calibration models are developed that achieve standard errors of estimate and standard errors of prediction in the range 0.4-0.5 mM across the concentration range of 2.5-25.5 mM. It is determined that the use of digital filtering as a preprocessing step significantly improves the performance of the resulting calibration models, minimizes the importance of spectral range in the calibration model development, and reduces the required number of PLS factors in each model.
INTRODUCTION Recent efforta in our laboratories directed to the development of blood glucose sensing technology based on nearinfrared (near-IR) spectroscopy have focused on studying the effects of spectral interferences. Investigations have been conducted with an aqueous matrix at physiological pH1 and in buffered protein solutions.2 In addition, the impact of variations in sample temperature has been addressed.3 In
* Author to whom correspondence should be sent.
(1)Arnold, M. A.; Smell, G. W.AwZ. Chem. 1990,62,1457-1464. 0003-2700/93/03853279$04.00/0
this paper, we report our first efforta to determine glucose in an actual biological matrix through the study of bovine plasma samples derived from three different plasma lots. In each of our previous studies, a key component of the glucose determination has been the use of appropriate data analysis methodology to extract glucose spectral information from the background absorbance of other matrix components. To accomplish this task under conditions of increasing complexity in the sample matrix, we have turned to the use of digital filtering techniques such as Fourier filtering and multivariate calibration methods such as partialleasbsquares (PLS)regression. In the use of these data analysis approaches, a key consideration is the degree to which the adjustable parameters inherent in each method are optimized. In the work reported here, the challenge of determining glucose in plasma is used to illustrate the design of protocola for the optimal coupling of spectral preprocessing digital filters with calibration model generation based on PLS regression.
EXPERIMENTAL SECTION Apparatus. Spectra used in this work were collected with a Nicolet 740 Fourier transform infrared spectrometer (Nicolet Analytical Instruments, Madison, WI) operating in the Department of Chemistry at the University of Iowa. The spectrometer was confiiedwith a 76-Wtungsten-halogen source,CaFz beam splitter, and an Nz (1)-cooled InSb detector. A multilayeroptical interferencefilter (Barrh s o c . , Westford, MA) was placed in the optical path to isolate the range of 5000-4000 cm-1. The presence of the fiiter serves to reduce the noise level in the collected data by restricting the wavelengths reaching the detector.' Solution temperatures were controlled with a VWR Model 1140 refrigerated temperature bath (VWR Scientific, Chicago, IL). Temperatures were measured with a l/srin.-diameter copperconstantan thermocouple probe (Omega Inc., Stamford, CT) in conjunction with an Omega Model 670 digital meter. This arrangement allowed temperatures to be measured with an accuracy of f0.1 OC. Interferograms acquired by the Nicolet 620 computer controlling the spectrometer were Fourier processed to single-beam spectra by use of Nicolet software. The spectra were truncated to the range of 5000-4002 cm-1 and transferred to a Silicon Graphics4D/460 computer operating in the Center for Intelligent Chemical Instrumentation at Ohio University. Except for the analysis of variance studies, all subsequent data analysis was performed on the SiliconGraphicssystem. All computersoftware usedin processing the spectraldata on the SiliconGraphicssystem (2) Marquardt, L. A.; Arnold, M. A.; Small, G. W. A w l . Chem., preceding paper in thia issue. (3)Hazen, K. H.; h o l d , M. A.; Smell,G. W. submitted for publication in Appl. Spectrosc. Q l9S3 American Chemical Society
3280
ANALYTICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1993
Table I. Description of Plasma Samples lot I replic replicn conc spectra sample (mM) groups 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
7.2 9.1 9.9 10.8 11.5 12.9 13.5 14.7 15.5 16.3 16.5 18.2 18.9 20.0 20.9 22.4 22.7 23.3 25.5
18 4
4 4 5 5 6 9 8 7 4 4 5 5 4 4 5 7 5
7 2 2 2 2 2 3 3 3 3 2 2 2 2 2 2 2 3 2
sample
conc (mM)
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
4.4 5.5 6.2 7.4 8.2 9.2 10.4 11.2 11.8 13.0 14.1 14.9 19.9 16.8 24.6 18.3 18.8 16.0 20.2 21.5 20.8 7.7 12.0 10.0 15.5
44
was implemented in FORTRAN 77 (Irisversion 3.3.3L). Fourier filtering and multiple linear regression computations were performed with subroutines from the IMSL software package (IMSL, Inc., Houston, TX). Analysis of variance calculations were performed by use of the MINITAB statistical software package (version 7.1, Minitab, Inc., State College, PA) implemented on a Hewlett-Packard 486/25T computer operating under MS-DOS (version 4.01, Microsoft, Inc., Redmond, WA). Reagents. Reagent grade glucose, potassium phosphate salts, and EDTA were obtained from common suppliers. The phosphate buffer solutions were prepared with reagent grade water obtained by passing housedistilled water through a Milli-Qthreehouse purification unit (Millipore Corp., Bedford, MA). Water was purified immediately before use. Procedures. Over a period of 11months, three lots of fresh bovine blood were obtained from a local abattoir and stored with EDTA to prevent coagulation. A given lot of blood was allowed to stand for 24 h to remove glucose by metabolization. The blood was then centrifuged to obtain the plasma. Glucose standards were prepared by adding known amounts of dried glucose to the glucose-depleted plasma matrix. For plasma lots I, 11, and 111, respectively, 19, 25, and 25 samples were prepared, yielding a total of 69 samples. The final glucose concentrations were measured by an outside laboratory with a Beckman Astra amperometric glucose analyzer (Beckman Instruments, Inc., Fullerton, CA). In the typical case, severaldays elapsed between the collectionof the spectral data and the amperometric glucose measurements. The plasma samples were kept frozen between the spectral data collection and the reference measurement to help minimize any change in concentration that might be introduced by the delay. Table I lists the measured concentrations for each of the plasma samples. The relative standard deviationof glucose measurements made with the BeckmanAstra is typically found to be 3-6 76. Regression analyses of measured concentrations vs theoretically predicted concentrations based on the amount of glucose added typicallyproduce standard errors in the range of 0.5 mM. For the spectral data acquisition, plasma samples were placed in a 1-mmpath-length cell made from infrared quartz (Wilmad Glass Co., Buena, NJ). The cell was placed in a water-jacketed mount inside the sample compartment of the spectrometer, and the plasma was allowed to equilibrate to 37 "C. This sample temperature was selected to coincide with normal human body temperature. The resulting spectral measurements are thus compatible in terms of sample temperature with direct nonin-
lot I1 replic spectra 9 3 6 6 6 6 5 6 6 6 6 6 3 6 5 6 6 2 5 5 3 3 6 6 6
replicn groups
sample
3 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 1 1 2 2 2
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
lot I11 conc replic (mM) spectra 3.4 2.9 8.5 4.6 12.4 16.9 4.6 6.3 17.3 19.2 10.0 5.1 9.1 10.9 11.3 9.7 12.8 2.5 14.4 2.5 24.0 7.2 15.2 18.8 15.2
7 6 3 4 6 4 7 9 3 3 6 6 5 6 3 3 7 6 6 5 6 3 3 3 3
replicn PUPS
2 2 1 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 1 1 1 1
vasive spectral measurements that might be made on human subjects for the purpose of determining blood glucose concentrations. A thermocouple probe was positioned in the cell to allow the temperature to be monitored continuously. After temperature equilibration, 256 coadded interferogram scans were acquired. Doublesided interferograms were collected, consisting of 16 384 points. In the Fourier processing of the interferogram data, triangular apodization was employed, along with Mertz phase correction. The phase array used was based on 200 points on each side of the interferogram center burst. The point spacing in the transformed spectra was 1.9 cm-l. The region of 50004002 cm-1 contained 519 spectral points. Periodically throughout the data collection, spectra of a pH 7.2,O.l M phosphate buffer solution were acquired to serve as spectral backgrounds. The sample temperature for the buffer solutions was also 37 "C. During the data collection for plasma lots I-In, respectively, 19,24, and 22 buffer spectra were collected. The same sample handling, data acquisition, and Fourier processing steps described above were used to obtain the buffer spectra. For each of the three lots of plasma samples, spectra were acquired over a period of 3 days. The plasma spectra were collected in a random order with respect to estimated concentration. T w o types of replicate spectra were collected of each sample. First, two or three consecutive scans were made for each filling of the sample cell. Second, additional variability was introduced by collecting spectra of most samples on more than one occasion (i.e., at a different time of day or on another day). The total number of replicate spectra collected for each sample is listed in Table I, along with the number of replication groups. One replication group is defined as a set of replicate spectra collected from one filling of the sample cell and its placement in the spectrometer. Across the 69 samples, the number of replicate spectra ranged from 2 to 18, while the number of replication groups ranged from 1to 7. T w o or more replication groups were used for 55 of the 69 samples. The total number of spectra collected was 369.
RESULTS AND DISCUSSION Description of Calibration and Test Data Sets. To investigate the couplingof digital filtering with PLSregression for use in predicting glucose concentrations in bovine plasma, the spectral data of the 69 samples were divided into two groups, corresponding to 80 % and 20% of the data, respec-
ANAL.TICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1993 3281
Table 11. Description of Data Sets Used in Calibration and Prediction calibration0 prediction source samples spectra smaples spectra lot I lot I1
16 (4) 20 (6)
lot I11 total
19 (4) 55 (14)
100 (25) 103 (28) 97 (17) 300 (70)
3 5
6 14
13 30 26 69
a Samples and spectra comprising the monitoring set for the fiiter generation work are indicated in parentheses.
-0.003
11
t 4.004
1
0.00075 0.00050
-0.007
1
t
n
I
'
-0.008 ' ' 5000 4900 4800 4700 4600 4500 4400 4300 4200 4100 4000 Wavenumber ( c m j )
Flguro 1. Absorbance dlfference spectra of three plasma samples. Spectra A-C correspond to samples 67 (15.2 mM), 61 (12.8 mM), and 57 (9.1 mM), respecthrely. (3lucosebands are observed In the regions of 4300, 4400, and 4700 cm-l.
tively. All replicate spectra corresponding to 55 of the samples were placed into a calibration data set for use in constructing calibration models for predicting glucose concentration. The replicate spectra of the remaining 14 samples were withheld as a separate test set of data for use in evaluatingthe predictive ability of the computed calibration models. Individual samples were assigned to the calibration and test data sets in a random manner, with the following constraint. The samples in each plasma lot with the lowest and highest concentrations were placed in the calibration set. This ensured that the full range of concentration of each plasma lot was represented in the calibration set. The breakdown of numbers of samples and spectra from each plasma lot is presented in Table 11. A total of 300 spectra were placed in the calibration set and 69 were placed in the test set. Motivation for SpectralPreprocessing. Previous work in our laboratories directed to the determination of glucose by near-IR spectroscopy has focused on the region of 500Cb 4000 cm-1.1J Figure 1 displays three absorbance difference spectra that illustrate the presence of glucoseabsorption bands in this region. The spectra labeled A 4 in the figure correspond to samples 67 (15.2mM), 61 (12.8mM), and 57 (9,l mM), respectively. Single-beamspectra of these sampleswere ratioed to phosphate buffer spectra collectedat similar times, and the resulting transmittance values were converted to absorbance. A similarly constructed absorbance spectrum of sample 45 (3.4mM) was subtracted from each of the other spectra to produce the difference spectra displayed in the figure. Glucose bands are clearly observed in the regions of 4300,4400,and 4700 cm-1. To predict a glucoseconcentration directly from a collected spectrum, the information resident in these glucose bands must be extracted. This task is complicated by the fact that no matrix-matched background spectrum is typically available for use in processing the spectral data. The use of an artificial background spectrum (e.g., phosphate buffer as a background
-0.00050 5000 4900 4800 4700 4600 4500 4400 4300 4200 4100 4000 Wavenumber
(cm-1)
(Top)Absorbance spectra of two repllcates of sample 4 (10.8 mM) and one replicate of sample 58 (10.9 mM). (Bottom) Correspondlng spectraafter Fourier fllterlng with a Gausslan bandpass functlon centered at 0.052fand wlth a width specifled by a standard devlation of 0.013f. Note the decrease in basellne varlatbn and the enhancement of the spectra in the regbn of the glucose bands at 4400 and 4300 cm-l. Flguro 2.
for bovine plasma) may introduce severe baseline artifacts in the resulting absorbance spectrum. This problem can be observed in the three absorbance spectra displayed in the top portion of Figure 2. Two of the spectra are replicates of sample 4 (10.8 mM), while the third corresponds to sample 58 (10.9mM). Due to the presence of virtually identical glucose concentrations across the three spectra, it can be concluded that the variations observed are due largely to artifacts introduced through the use of buffer spectra as backgrounds. Large variations are particularly noted in the region above 4900 cm-1 and below 4200 cm-1. These regions are heavily influenced by the strong absorbance bands of water centered at 5300 and 3800 cm-1. In addition to the poor match between the plasma and buffer spectra in these regions, significantly less light is able to reach the detector due to the water absorbance. This reduction in signal also contributes to the increasedvariation noted in the spectra. A comparison with Figure 1 reveals a lack of discernible glucose signals in the absorbance spectra of Figure 2. For a spectral-processing strategy to be successful in extracting glucose information, variation of the type observed in the figure must be removed. Preprocessing of the spectral data to remove such artifacts is clearly motivated. PLS Regression. PLS regression is one of the most widely used strategies for processing spectral data and building a calibration model for predicting analyte concentrations.4~6In the context of a near-IR-based glucose determination, the (4) Sj&+r6m, M.; Wold, S.;Lindberg, W.; Pereson, J. A.; Martam, H. Anal. C h m . Acta 1983,160,61-70. (6)Martens,H.; Naee, T.Multivariate Calibration;Wiley: New York, 1989; Chapter 3.
3282
ANALYTICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1993
Table 111. Calibration and Prediction h u l t r for Filtered and Nonfiltered Data R2 SEEa SEPb spectral PLS range(cm-1) fadors (%) (mM) (mM) Nonfiitered Data
$ [ .
0 0 l i i
3
.. 0
A
.
* . . ' .
E
2
4
6
8
10
12
14
16
18
20
96.8 99.1 99.2 98.7 Filtered Data 10 99.3 8 99.3 9 99.3 8 99.2
5000-4002 4900-4200 4500-4200 4600-4200
1
01
18 11 11 10
5000-4002 4900-4200 4600-4200 4600-4200
I
1.13 0.60 0.55 0.72
1.54 0.49 0.66 0.64
0.53 0.53 0.54 0.54
0.46 0.39 0.42 0.37
Number of PLS factors
Flgure 3. SEP vs number of PLS factors for calibration models constructed from the 5000-4002-~m-~range. The resutts of uslng both nonflltered (A) and filtered data (6)are presented. The polnts
plotted for the filtered data represent the mean SEP computed from the use of the 10 calibration models derked from the corresponding 10 best filters found during the fllter design procedure. Error bars In the plot Indicate the 95% confidence intervals about the means.
*
a Standard error of estimate for the calibration model. Standard error of prediction.
S
24
input to the PLS procedure is an n X p matrix of absorbance spectra, where n is the number of spectra and p is the number of resolution elements (Le., absorbancevalues), along with an n x 1vector of measuredglucose concentrations. Before input to the PLS algorithm, the matrix of absorbance spectra is typically centered by subtracting from each absorbancevalue the mean absorbance at that resolution element. The result of PLS regression is an (h + 1)term calibration model of the form ci = b,
+ b1ti,, + ...+ b,x,
I
0
I
0
4
0
8
(1)
where cj is the predicted glucose concentration for spectrum i, the xi values are termed PLS factor scares, and the b terms are coefficients determined from a multiple linear regression analysis of the measured glucose concentrations of the calibration data and the h sets of PLS scores computed from the spectra of the calibration samples. To illustrate the use of PLS regression on the bovine plasma spectral data, calibration models were constructed based on four different spectral ranges. The entire range of 5OOO4002 cm-1 was used, along with three subsets of this range. The range of 4!300-4200 cm-1 was selected to remove the regions of largest variation in the spectra. T w o smaller ranges (4500-4200 and 4600-4200 cm-1) were selected to focus the calibration on the glucose bands near 4300 and 4400 cm-l. Employing the 300 spectra in the calibration set, PLS regression was used to generate models based on 1-20 terms. Each model was tested by applying it to the prediction of the 69 spectra in the test set. The standard error of prediction (SEP)was computed and plotted vs the number of PLS factors used. Figure 3A is a plot of this type for the models corresponding to the range of 5000-4002 cm-1. The model producing the minimum SEP was selected as the best model for each spectral range. For the case of the 6000-4002-~m-~ region, Figure 3A indicates that the 18-termmodel produced the best prediction results (SEP = 1.13 mM). The upper section of Table I11lists the statistics describing each of these best calibration models. Listed in the table are the number of terms in the model, the value of R2 (% ), the standard error of estimate (SEE), and the SEP. Figure 4 displays two plots of estimated glucose concentration (i.e., the concentration predicted by the calibration model) vs the concentration measured by the Beckman Aatra glucose analyzer. Both plots are derived from the 5000-4002-cm-1 spectral range. The upper plot correspondsto the model based on 10 PLS factors while the lower plot correspondsto the 18-term model found to be the best for this range. Open circles indicate spectra
12
16
20
24
28
20
24
28
Measured Concentration (mM)
0
-1
0
4
0
12
16
Measured Concentration (mM)
Figure4. Estimated glucose concentration vs concentration measured by the Beckman Astra glucose analyzer for 10-term(top)and l&term (bottom)PLS regression models. Both plots are based on nonflltered data and are derived from the 5000-4002~m-~spectral range. Calibration set and test set spectra are lndlcated by open and cbsed
circles, respectively.
in the calibration set while solid circles denote spectra in the test set. The solid lines in the plots denote perfect correlation between estimated and measured glucose concentrations. An advantage of the PLS algorithm is that it provides a means by which to extract a series of components from a set of input spectra in a manner that is driven by the explanation of variance in the vector of concentrations. Two disadvantages of the technique are apparent, however. First, noise components of the input data that happen to correlate with analyte concentration can be extracted and encoded in the factor scores. Second, if the concentration information is obscured by other larger sources of variation (e.g., baseline artifacts of the type displayed in Figure 2), the early factors computed will focus simply on extracting components of the baseline variation that correlate with concentration. In extreme cases,
ANALYTICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1093
the presence of significant baseline variation can effectively prevent the analyte information from being extracted, regardlesa of the number of PLS factors computed. To place the PLS technique in the proper perspective, it should be noted that the presence of significant spectral variation unrelated to analyte concentration can be problematic for any calibration method. These points are illustrated by an inspection of Figure 3A, Figure 4, and Table 111. The inclusion in the calculations of the regions of large spectral variation has a serious negative impact on the results. The best model for the 5000-4002-~m-~ region requires 18 terms, while the best models for the other regions require either 10 or 11terms. Even with the large number of additional terms, SEE and SEP are 2-3 times greater than for the models based on the other three spectral regions. When a 10-termmodel is used, the plot of estimated vs measured glucose concentration at the top of Figure 4 indicates a large degree of scatter and a poor match between the points corresponding to the calibration and test sets. The lower plot in Figure 4 reveals that significantscatter remains even when the l&term model is used. A total of 19 spectra (14 calibration, 5 prediction) in the lower plot exhibit residuals that exceed twice the SEE. Eleven of the spectra correspond to samples from plasma lot I, although the 19 spectra span 14 different plasma samples. No obvious unifying characteristic links these 19spectra. PLS regression alone is simply unable to model the glucose concentration adequately when presented with the raw spectra. The calibration and prediction results indicate that the spectral range is a key variable in the calibration model development. Even when the region of largest variation is removed, significant differencea are noted between the results for the remaining three spectral regions. Figure 2 suggests that significantbaseline variation still exists in the restricted spectral ranges. The presence of this variation may explain the differences between the results produced by these three ranges. The limitations of the PLS algorithm noted above motivate the use of an alternate spectral-processingstrategy to remove variation not related to analyte concentration. Such a procedure would be employed prior to the computation of the PLS factors and the generation of the calibration model. A key issue, however, is the ability to discriminate analyte and nonanalyte variation. In this work, we devise protocols for the use of digital fitering as a general preprocessing tool to be coupled with PLS regression. Digital Filtering. Digital fitering techniques are signalprocessing methods used to remove specified frequency components from measured data.6 The utility of these methods in removing spectral variation is based on the assumption that analyte and nonanalyte information can be separated due to differences in the underlying frequencies that comprise the data. If a spectrum is decomposed into its underlying harmonic components, it can be argued that verylow-frequency information is dominated by slowing varying components of the spectral trace (e.g., baseline variation). In a similar manner, high-frequency information is dominated by rapidly varying noise components of the spectrum. Information about analyte absorption bands should be concentrated in some middle range of frequencies. The digital fiitering technique used in this work is termed Fourier fiitering7-9 and is begun by applying a windowing function to the input spectrum, followed by a forward fast Fourier transform (FFT). The results presented here make (6) Childers, D.; Durling, A. Digital Filtering and Signal Processing;
Week St. Paul, MN, 1975. (7) Horlick, G. Anal. Chem. 1972,44,943-947. (8) B u h , C. A. Anal. Chem. 1974,46,890-895. (9) h i , T. M.; Warner, I. M. Appl. Spectrouc. 1984, 38, 422-429.
3288
.._________________________ 0.0 0.0
0.1
0.2 0.3 Digital Frequency (1)
0.4
0.5
Flgwe 5. Fourler domaln amplitude spectrum(solld Hne and left wls) and Gaussian filter bandpass functlon magnitude response (dadred llne and rlght y-axb). The region of 0.0-0.lf In the filtered Fwrkr domaln spectrum Is displayed In the Inset.
use of the Bessel windowing function.lO The real/imaginary profiie that results from the FFT is termed a Fourier domain spectrum. The frequency axis of this profile is often mapped onto a linear scale from 0.0 to 0.5 and is termed digital frequency 0.In the actual filtering step, frequencies are suppressed by performing a complex multiplication of the Fourier domain spectrum with a fiiter bandpass function. The bandpass function determines which frequencies in the data will be suppressed. An inverse FFT is next applied to the complex product, returning the data to the original domain. Since the FFT is a h e a r transformation, the inverse FFT is actually not required. Calibration models can be constructed directly from filtered Fourier domain spectra.11J2 A key advantage of returning the data to the original domain, however, is that the effects of the filtering procedure can be determined with respect to known spectral features. In performing the filtering operation, a variety of bandpass functions can be employed. Smoothly varying functions are often preferred to prevent the introduction of artifacts in the inverse FFT due to sharp transitions in the filtered Fourier domain spectrum. In the work described here, Gaussian bandpass functions were used as they possess a smoothly varying profile while still offering a relatively sharp rolloff. The solid l i e in Figure 5 indicates the amplitudes of each of the underlying frequencies of the absorbance spectrum of sample 58 displayed in the top portion of Figure 2. The spectrum was windowed with the Bessel function before the FFT was applied. The plotted amplitudes are a composite of the real and imaginary frequency components. The amplitude spectrum is clearly dominated by informationnear zero frequency. The dashed line in Figure 5 is a Gaussian bandpass function centered at 0.052f and with a width specified by a standard deviation of 0.013f. The half-power points of the function are located at approximately0.036 and 0.068f, corresponding to a width at half-height of approximately 0.032f. The right y-axis in the figure displays the magnitude response of the filter bandpass. A portion (0.00.ln of the fitered Fourier domain spectrumobtained through the application of this bandpass function is displayed in the inset of the figure. The lower spectrum in Figure 2 displaysthe results obtained by employing this bandpass function in Fourier filtering the tliree spectra in the top of the figure. As noted previously, a key feature of the unfiltered absorbance spectra at the top ~
(10) Kauppinen, J. K.; Moffatt, D. J.; Cameron, D. G.;Mantach, H.H. Appl. Opt. 1981,20,1866. (11) McClure, W. F.; Hamid,A.;Gieabrecht,F.G.;Weeks, W. W.App1. Spectrosc. 1984,38,322-329. (12) Donahue, S. M.; Brown, C. W.; Obremski, R. J. Appl. Spectrosc. 1988,42,353-359.
3284
ANALYTICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1993
.O.l5/ 5000
,
,
,
,
4900 4800 4700
,
,
4600
,
4500
,
,
,
,
,
,
,
j
4400 4300 4200 4100 4000
Wavenumber (cm.1)
Flgure 8. Loading weights for the fkst PLS factor computed from the regbn 5000-4002 cm-I of the nonflltered (A) and filtered (B) spectra of the calibration set. For the filtered data,the bandpass dispiayed in Figure 5 was used. The dashed line superimposed on the figure is the dlfference spectrum of sample 67 from Figure 1A.
of the figure is the large amount of baseline variation in the regions of 4200-4000 and 5000-4900 cm-l. A comparison of the top and bottom plots reveals that the filtering step has removed a significantfraction of this variation. In addition, the spectral information in the regions of the three glucose bands has been enhanced. Traces A and B in Figure 6 illustrate the impact of the filtering procedure on the PLS regression calculation. Figure 6A is a plot of the loading weights416for the fiit PLS factor computed from region 5000-4002 cm-1 of the nonfiltered spectra of the calibration set. Figure 6B is an analogous plot computed from calibration spectra which were preprocessed by applying the filter bandpass displayed in Figure 5. The dashed line superimposedon the figures is the same difference spectrum of sample 67 plotted previously as Figure 1A. The loading weights indicate the relative contributions of the individual spectral resolution elements to the PLS factor. It is clear from Figure 6A that the spectral regions containing large variation dominate the first PLS factor. For the filtered data, these regions have little impact as the fiiteringprocedure has eliminated much of the variation. Also, the first PLS factor computed with the filtered data is dominated by the region near the glucose band at 4400 cm-1. This band was judged most important in modeling glucose concentration in previous studies that used glucose samples prepared in phosphate buffer and bufferedprotein solutions.12 This result confirms that the filtering step is able to preprocess the data in a way that allows the PLS calculation to work more efficiently in extracting glucose spectral information. The fiitered spectra in Figure 2 resemble derivativespectra. This is not a coincidence, as a derivative calculation can be considered as an application of a filter that passes the highfrequency componente of a signal while suppressingthe lowfrequency components. For example, a harmonic function such as f ( t ) = .sin(2r(f)t) + sin(2r(2f)t) is composed of two frequency components (i.e., frequencies f and 2j). The first derivative of the function isf(t) = 27rO cos(2df)t) + 27r(2f) cos(2d2j)t). The component a t twice the frequency (i.e., 2j) is weighted by a factor of 2 in the derivative. A component at 3f would be weighted by a factor of 3. It can be seen from an extension of this example that the bandpass for a firetderivative filter is simply a line with a slope of 1 and an intercept of 0. This bandpass causes the low-frequency components of a signal to be suppressed. The Gaussian bandpass function used above performs a similar operation in suppressinglow-frequencysignals. However, it differs from the first-derivative fiiter in that high-frequency components are also suppressed.
The key to the Fourier fiitering procedure described above was the definition of the fiiter bandpass. If the fiitering step is to be useful as a general-purpose spectral preprocessing tool, protocols must be established for the defiiition of the bandpass. A significant complication in the development of such protocols is the fact that both the filtering and PLS regression calculations are ideally computations that extract components of the spectrum that relate to analyte concentration. The PLS calculation does this directly by use of the known concentrations of the calibration samples. Given an optimal bandpass, the fitering procedure ale0helps to extract analyte information by suppressing frequency components of the data that do not relate to changes in analyte concentration. To have maximum effectiveness, the filtering and PLS regression steps should work together, rather than be treated as independent processing steps. Stated differently, the design of the optimal filter bandpass requires information regarding the quality of the calibration model that would result from the application of PLS regression to the filtered data. Overview of Filter Design Procedures. The approach used here to define the optimal filter bandpass mirrors a procedure that we have used previously in building univariate calibration models based on integrated absorbancesof filtered spectral bands.' Statistics derived from the calibration model are used to define an objective function that describes the degree to which the filter bandpass is optimal. Given a bandpass shape (e.g., Gaussian), the filter can be explicitly defiied by two parameters: bandpass position and width. For the case of the Gaussian-shaped bandpass, these parameters correspond,respectively, to the mean and standard deviation of the Gaussian function. Together with the position and width parameters, the objectivefunction defies a response surface for the filter design. For example, if the value of R2 for the calibration model is used as the objective function, the optimal filter is specified by the mean and standard deviation of the Gaussian filter bandpass that produces the maximum R2 when a calibration model is constructed from the filtered data. The response surface can be interrogated for the optimal filter parameters in one of two ways. The surface can be mapped by exhaustively sampling values of the mean and standard deviation, filtering the data with the filter defined by the mean and standard deviation, building the calibration model with the filtered data, and computing the value of the objective function. The alternative approach is to use numerical optimization algorithms such as the simplex methodla or simulated annealing'" to traverse the response surface in search of the optimal value of the objective function. The latter approach has the advantage of speed in that fewer values of the objective function (i.e., fewer calibration models) need to be computed. It possesses the disadvantage that there is no guarantee that the optimal value found is the overall global optimum, rather than some local optimum. For the work reported here, we have employed a mapping of the response surface in order to be able to draw conclusions about the overall filter design approach without having to be concerned about the degree to which an optimization algorithm is able to find the global optimum of the response surface. The application of the above strategyto calibration models based on PLS regression is complicated by an additional experimental parameter, the number of PLS factors to employ in constructing the calibration model. The impact of the number of terms in the calibration model on the fiiter (13)Routh, M.W.; Swartz,P.A,;Denton, M . B.Anal. Chem. 1977,49, 1422-1428. (14) Sutter, J.
M.;Kalivas, J. H.Anal. Chem. 1991, 63, 2383-2386.
ANALYTICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1003
parameters must therefore be evaluated. This requires an independent study of the response surfaces corresponding to calibration models with different numbers of terms. In addition to the study of the number of terms in the calibration model, three other issues must be addressed in order to establish the effectiveness of Fourier filtering as a generalized spectral preprocessing tool. These issues are (1) the choice of an optimal objective function to use in defining the response surface for the filter design, (2)the effect of spectral range on the filter design, and (3) the effect of variation in the filter parameters on the performance of the calibration model derived from the fitered data. Selection of Objective Function. The principal requirement for an objectivefunction is that it provide sufficient sensitivity to allow the optimal filter parameters to be determined. Different objectivefunctions can exhibit widely different sensitivities. For example, the calibration model R2 is an obvious choice as an objective function. However, R2follows a cos2functionalform,leadingto avery flat response at values of R2 approaching 1.0 (i.e., the optimum). In this case, a wide range of filter designs will appear to produce similar calibration models. After evaluating several different objective functions, we adopted a function based on the mean squared error (MSE) produced in the calculation of the calibration model. The specific function used was 1 (2)
where the summation is carried out over the n calibration s w a , h is the number of PLS factors used, ci is the measured glucose concentration corresponding to spectrum i, and ti is the glucose concentrationpredicted by the calibration model. The denominator in eq 2 is the MSE of the calibration model. An MSE of 0.Orepresent.a a perfect fit of the calibration spectra to the model. The inverse of the MSE was used to increase the sensitivity of the objective function and to cause the optimal value of the function to be a maximum rather than a minimum. Figure 7A is a map of the response surface generated by use of this function. The objective function is plotted vs the mean and standard deviation of the corresponding Gaussian filter bandpass used to preprocess the spectral data. The spectral range of 5OOO-4002cm-1 was used, and the calibration models were based on 10 PLS factors. The range of 0.0-0.5f was sampled with a regular grid corresponding to a spacing of 0.01 in the mean and 0.005 in the standard deviation. A totalof 5000calibration models were generated corresponding to the individual filters. An inspection of Figure 7A reveals a series of optimal filters, corresponding to the ridge in the response surface. To evaluate these filters, a number of individual filters whose mean and standard deviation are located on the ridge were selected and the associated calibration models were applied to the prediction of glucose concentrations corresponding to the 69 spectra in the test set. A wide variety of prediction results were obtained across the selected filters, ranging from excellent to poor in terms of SEP. This indicates that many of the filters produce filtered spectra that lead to spurious calibration models that are not useful in prediction. Since the PLS algorithm is designed to extract components of the calibration spectra that correlate with concentration, it is possible to filter the data in such a way that random data values are introduced that correlate with concentration. The resulting models are accompanied by relatively good descriptivestatistics such as the MSE but are not effectivewhen applied to prediction data.
S286 A
Figure 7. Response surface maps for the filter design based on the use of 1/MSE (A) and l/(MSE MSPE) (6)as the objective function. No monltorkrgset was used Inthe generationof plot A, whlle a monitoring set containlng 25% of the calibration samples was used In generating plot 0.
+
In an effort to increase the robustness of the filter design scheme,attempts were made to incorporateprediction results into the objectivefunction. The calibrationset was randomly divided into two subsets-a new (smaller)calibration set and a separate “monitoringset”. Under this revised scheme, the filtered spectra of the new calibration set were used to generate a calibration model. That model was used to predict the glucose concentrations corresponding to spectra in the monitoring set. Based on these calibration and prediction results, a modified objective function was defined as 1
where n, is the number of spectra in the new calibration set, n, is the number of spectra in the monitoring set, h is the number of PLS factors used in the calibration model, ci is the measured glucose concentration of spectrum i in the calibration set, E j is the corresponding glucose concentration predicted by the model, mi is the measured glucose concentration of spectrum i in the monitoring set, and IP1.i is the glucose concentration predicted by the model. The denominator in eq 3 represents the sum of MSE and the mean squared prediction error (MSPE). This objective function gives equal weight to the calibration and prediction results obtained with the fiitered data. One drawback to this function that should be noted is that individual plasma samples may contribute to the function unequally, given that different numbers of replicate spectra may be collected for each sample. While we feel the calibration set in the present study is large enough to minimize this potential problem, it is an issue that should be considered when designinga calibration study based on this methodology. Given the desire to use a monitoring set to add robustness to the objective function, it is important to establish the optimal size of the monitoring set relative to that of the calibration set. To address this issue, eight calibration/ monitoring set pairs were generated corresponding to ratios
ANALYTICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1993
3280 0.5
1
1
0.0 0
10
20
30
40
50
60
Percentage 01 Calibration Samples Used for Monitoring
Flguro 8. Box plots descrlbing the distributions of the filter positions of the top 100 filters for each cailbration/monltorlng set.
of l00/0%, 95/5%, 87/13%, 80/20%, 75/25%, 6513596, 561 44%, and 47153% in terms of the number of samples in the calibration set relative to the number of samples in the monitoring set. The 100/0% set corresponds to the case in which all samplesare in the calibration set (i.e., no monitoring set is used). The monitoring sets were generated in a sequential order, with samples being selected randomly for inclusion. On the basis of this procedure, the smaller monitoring sets were subsets of the larger sets. For example, the 10% monitoring set was generated by random selection of samples to add to the 5% monitoring set. The random selection was based on samples rather than on spectra to ensure that replicate spectra of the same sample would not be present in both the calibration and monitoring sets. Once a sample was selected, all replicate spectra correspondingto that sample were placed in the monitoring set. Employing the objective function in eq 3, responsesurfaces were generated for each of the calibrationlmonitoring sets. Figure 7B is the response surface for the 75125% calibration/ monitoringset. A comparison to the plot of the 100/0% case (Figure 7A) reveals a significant reduction in the size of the ridge that defies the optimal filter parameters. The optimal filter parameters are much more clearly defied in the case in which the monitoring set is used. To study the impact of different monitoring set sizes in better defining the optimal filter parameters, the bandpass positions (i.e., the mean of the Gaussian bandpass function) corresponding to the largest 100 values of the objective function (out of 5000 total)were selected for each of the eight calibrationlmonitoring sets. By sampling the fiiter position corresponding to the best 100 filters, an outline of the top of the response surface is obtained. As the response surface changes from a broad ridge to a narrower optimal region, the median of the filter position should contract to a limiting value and the width of the distribution of the position values should decrease. Figure 8 contains a series of box plots for each calibration/ monitoring set that describes each distribution of positions of the top 100filters. The horizontal lines in each box denote the median fiiter position, while the notches specify the 95% confidence interval about the median. The vertical limits of each box denote the 50 % of the position values that surround the median. The upper and lower vertical lines denote the maximum and minimum position values for each set of 100. A box plot that spans a larger vertical distance corresponds to a wider distribution of values. In our application, this indicates a response surface with a broad ridge. Each box is plotted at a horizontal position corresponding to the percentage of the 55 calibration samples used in constructing the monitoring set. In the plot, a contraction of the response surface is noted beginning with the 25 % monitoring set. The distributions
of positions of the top 100 filters then stabilize to limiting values of the median and the distribution width. This result suggests that, for the bovine plasma data, a monitoring set consisting of 25% of the calibration samples is sufficient to define a sharp optimal region of the response surface. For this reason, the 75/25 % calibration/monitoring set was used for all further work with the plasma data. In Table 11,the values in parentheses indicate the numbers of samples and spectra from each of the plasma lots that comprised the 25 % monitoring set. Calibration and Prediction Results for Filtered Data. The filter design strategy described above was applied to the plasma data in order to evaluate the utility of Fourier fiitering as a spectral preprocessing tool. The impact of the selected spectral range and the number of PLS factors was evaluated. In addition, the precision of the filter bandpass parameters was studied. A 3000-point sampling of the response surface was employed in these studies. The mean of the Gaussian bandpass function was varied from 0.0 to O . l f , while the standard deviation was varied from 0.0 to 0.03f. Both ranges were sampled in steps of 0.00lf. This restricted region of the response surface was selected on the basis of the complete surfaces plotted in Figure 7. For each spectral range, separate response surfaces were mapped for calibration models based on 3-12 PLS factors. For each range and each model size, the filters corresponding to the top 10 values of the objective function were selected and the associated calibration models were applied to the 69 spectra in the prediction set. The SEP values were computed for each of these cases, as were the means and standard deviations of the SEP values. In addition, the smallest SEP value was found for each case. This value was subsequently used to select the optimal model size for each spectral range. The lower section of Table I11 lists the calibration and prediction results for these optimal models. The use of the prediction set for selecting the optimum model size does mean that the SEP values reported may be slightly lower than those that would be obtained from the application of the selected models to some other prediction set. A more independent, but computationally more intensive, procedure for selecting the model size would be to withhold subsets of the calibration data, generate calibration models with the remaining data, and use the spectra withheld as a prediction set for selecting the model size. Repeated application of this procedure by withholdingthe spectra of one sample at a time and pooling the prediction results is termed a cross-validation prediction.& A comparison of the results for the filtered and nonfiltered data reveals three significant points. First, for models constructed with the fiitered data, significant improvements are noted in both the calibration and prediction results. As expected, the benefits of filtering are greatest for the 5000-4002-~m-~region. Relative to the nonfiltered case, reductions of 53 % and 70 5% are observed in the values of SEE and SEP, respectively. Here, the filtering step is clearly critical in removing variation that is hampering the PLS regression calculation. Figure 9 is a plot of estimated vs measured glucose concentrations for the 10-term calibration model computed with the filtered data from 5000 to 4002 cm-'. The filter bandpass used is the same as that displayed in Figure 5. The format of the plot and the axis scales are identical to Figure 4. A striking improvementin both calibration and prediction results is noted relative to the results obtained with the 10term and 18-term models computed from the nonfiltered data. It is clear from an inspection of the figure that the precision of calibration and prediction is constant with concentration. Significant improvements are also realized for models constructed with filtered data from the restricted spectral
ANALYTICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1993 291
,
,
,
,
,
,
,
,
,
,
,
,
Table IV. Precision of Estimated Concentrations SD, filtered SD, nonfiitered spectral data (mM) data (mM) range (cm-1) within between within between 5000-4002 4900-4200 4500-4200 4600-4200
-11
0
'
'
'
'
8
4
'
'
12
'
'
'
16
'
20
24
I
28
Measured Concentration (mM)
Flguro 0. Estimated vs measured glucose concentratlonsfor the 10term caiibratkm model computed wlth the filtered data from 5000 to 4002 cm-l. The filter bandpass used is the same as that displayed In Flgure 5. Calibration set and test set spectra are indicated by open
and closed circles, respecthrely.
regions. For example, reductions in SEP of 20%, 36%, and 42% are observed for the 4900-4200-, 4500-4200-, and 4600-4200-cm-1 spectral regions, respectively. Thus, even when significant variation has been removed by restricting the spectral range, the filtering procedure can still lead to improvements in the computed calibration models. The results in Table I11 also reveal that the filtering step helps to eliminatespectral range as an important experimental variable. As noted previously, the use of nonfiltered data produces widely varying calibration and prediction results for the different spectral ranges. However, the results for models based on filtered data are remarkably consistent across the different ranges. Across the four spectral ranges, the relative standard deviations in SEE and SEP for models based on nonfiltered data are 31.4 % and 57.4%, respectively. The corresponding values for models based on filtered data are 1.1% and 9.6 % This is a highly significant result, as a large amount of time in any calibration study can be spent adjusting the spectral range submitted to the PLS regression computation. Finally, a reduction in the required number of PLS factors is observed for models computed from the filtered data. As expected, this reduction is greatest (10 PLS factors vs 18) for the region of 5000-4002 cm-1 where large amounts of baseline variation are present. Even with the restricted spectral regions, however, a reduction of two to three PLS factors is observed. It is clear from these results that the Fourier filtering step allows the PLS calculation to work more efficiently in extracting glucose informationfrom the spectral data. This is a significant result, given the fundamental principle that a simpler model is in most cases a more robust model. Examination of Precision of Estimated Concentrations. The collection of replicate spectra for each sample and the use of replication groups collected at different times allow the calculation of the precision of the estimated concentrations in two ways. Within a replication group for a given sample, calculation of the standard deviation of the estimated concentrationsprovides an estimate of the intrinsic noise level of a single spectral measurement, expressed in concentration terms. Calculation of the standard deviation of the estimated concentrationsfor a sampleacross replication groups provides an estimate of the day-to-daystability of the measurement. These calculations were performed separately for each of the 14 prediction samples (25 replication groups total),and standard deviations were computed within and between replication groups. These computed values were then pooled across the 14samplesto produce overall within- and between-
.
9287
0.296 0.242 0.244 0.257
0.432 0.348 0.352 0.292
1.281 0.231 0.229 0.265
1.307 0.352 0.659 0.664
group standard deviations. Separate calculations were performed for each of the four spectral ranges and for filtered and nonfiltered data. The resulting pooled standard deviations are presented in Table IV. Except for the 5000-4002-cm-1 spectral range with nonfiltered data, the pooled within-group standard deviations are remarkably consistent within the range of 0.2-0.3 mM. The problems encounteredin building an effective calibration model for the 5000-4002-cm-~range with nonfiltered data have been discussed previously. The fact that the withingroup standard deviation is stable across three spectral regions and stable with respect to whether or not filtering was performed suggests that it is an estimate of the intrinsic instrumental precision, expressed in concentration terms. A significantly larger range of values is observed in the pooled between-group standard deviations. It is apparent that the digital filteringstep significantlyreduces the betweengroup standard deviation for the two smallest spectral ranges. These differences in standard deviation are significant at the 95 % confidence level, based on statisticalF-tests. One would expect the spectral variation between replication groups to be largely composed of baseline changes. Here, the effectiveness of the filtering procedure in removing baseline variation is observed. For the 4900-4200-~m-~range, the filtered and nonfiltered data produce similar standard deviations. With this region, a large enough spectral range is available to enable the PLS algorithm alone to account for the baseline variation, although three additional model terms are required. The consistency in the computed within- and between-group standard deviations for all of the fdtered data provides further evidence that the filtering step effectively removes spectral range as a variable in the calibration design. Effects of Imprecision in Filter Parameters. An important question in designing filters for use in spectral preprocessing is the degree to which the filter position and width parameters need to be set to optimalvalues. In addition, the question arises as to the effect of spectral range and the number of PLS factors on the optimal filter parameters. Stated differently, do separate filters need to be designed for each spectral range and for each model size? As noted above, response surface maps were generated to allow these questions to be addressed. The top, middle, and bottom plots in Figure 10 display clustered bar charta that display the relationships between the number of PLS factors used and filter position, filter width, and SEP, respectively. Each cluster of bars corresponds to the four spectral ranges. From left to right in each cluster, the striped, open, solid, and hatched bars correspond to 5000-4002,4900-4200,4500-4200, and 4600-4200 cm-1, respectively. The height of each bar corresponds to the mean value of the parameter (i.e., filter position, filter width, or SEP) computed over the 10 filters exhibiting the largest values of the objective function. Error bars denote the upper 95% confidence limit associated with each mean. Since each mean is based on 10 observations, the confidencelimits can be compared across the bars as indicators of the width of each statistical distribution. The top two plots in Figure 10 can be used to address the issue of imprecision in the filter parameters by considering the ideal cases. If spectral range had no influence on the
9288
ANALYTICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1993
0.06 0.05
1
i b
c
0.04 0.09
Y
0.02 0.01
0.00
3
4
5
6 7 8 9 Number of PLS Factors
10
11
12
10
11
12
0.024 1
0.01 8
E
gb
0.012
E 0.006
0.000
9
4
5
6
r
8
9
Numbrr of PLS Factors 3.0
3
25
0
2.0
-i
1.5
f
Ly
g
1.0 0.5
0.0
Number of PLS FacIofa
Flgurr 10. Clustered bar charts dlsplaylng the relationships between the number of PLS factors used and filter position (top), filter width (middle),and SEP (bottom). The striped, open, solid, and hatched bars In each cluster correspond to 5000-4002, 4900-4200, 4500-4200, and 4600-4200 cm-l, respectively. Bar helghts correspond to the mean value of the parameter computed over the 10 filters exhibiting the largest values of the objective function. Error bars denote the upper 95% confidence limit associated with each mean value.
filter design, the heights of the four bars in each cluster would be identical to within the precision of the corresponding mean value. Similarly, if the number of PLS factors had no effect, the height of each group of bars would not change as a function of the number of PLS factors. Across the spectral ranges and the different numbers of PLS factors, significantly more variation is noted in filter width than in filter position. However, the mean values of both parameters vary to agreater extent than can be explained by the 95% confidence intervals. Analysis of variance (ANOVA) studies16 were performed to confirm this visual interpretation of the results. First, one-way ANOVA studies were conducted separately for the four spectral ranges. Only the range of 5-12 PLS factors was used in the study due to the poor calibration models formed
with less than five PLS factors. In each case, significant differences were found to exist among the means of the filter position and width parameters for the different numbers of PLS factors. This result suggests that, on statistical grounds, the filter design needs to be performed for each model size. An examination of residual plots from the ANOVA revealed no irregularities that would invalidate this conclusion. Next, the data were blocked by spectral range and a twoway ANOVA study was performed. The ANOVA model used included a term for the interaction between spectral range and the number of PLS factors. The study performed on the range of 5-12 PLS factors confirmed the results obtained with one-way ANOVA. Significant differences were noted between the filter position and width parameters. In this study, spectral range, the number of PLS factors, and the interaction of the two were all found to be significant. As before, the analysis of residual plots revealed no irregularities that would invalidate the study. The visual inspection of the top two plots in Figure 10 and the ANOVA results must be placed into perspective by inspecting the lower plot in Figure 10 and Figure 3B. The lower plot in Figure 10is a bar chart that indicates the effect of imprecision in the filter parameters on SEP. Here, it can be seen that, once the number of PLS factors has reached an optimal value, the effect of variation in the filter parameters on SEP is relatively small. Figure 3B is a plot of SEP vs number of PLS factors for the 50004002-cm-lspectral region. It is included in Figure 3 for comparison with the analogous plot for the nonfiltered data. The error bars are again constructed to indicate the 95% confidence interval for SEP based on the use of the 10 filters yielding the largest values of the objective function. Each point represents the mean SEP computed from the application of 10different calibration models to the prediction set. This plot provides further indication that, once the model has reached the optimal size, the effect on SEP of an imprecise filter bandpass is minimal. This is confirmed in the plot by the collapse of the error bars about the mean SEP as the model approaches the optimal size. The ANOVA studies confirm that statistically different filter parameters will be obtained if the filter design is performed for different spectral ranges and for models with different numbers of PLS factors. Some variation in the filter design is tolerable, however, as the different filters appear to work similarly in terms of their ability to preprocess spectra. Since PLS factors used in building the calibration model are computed separately for each set of filtered calibration spectra, the PLS calculation is able to adapt itself to the specific filter bandpass used. This serves to make the determination of a precise filter bandpass less important.
CONCLUSIONS The results presented in this paper demonstrate that, given appropriate data analysis methods, near-IR spectroscopy can be used to determine glucose concentrations in undiluted plasma matrices with high levels of accuracy and precision. Calibration models developed across three different plasma matrices were able to achieve values of SEE and SEP in the range of 0.4-0.5 mM. For the best model computed from the range of 4600-4200 cm-1, the SEP value of 0.37 mM corresponds to a mean percent error of 4.28%. This result is particularly impressive, given the quoted 3-6 % relative standard deviation of the Beckman Astra glucose analyzer used to determine the concentrations of the samples in the calibration and test sets. (15) Box, G. E. P.; Hunter, W. G.; Hunter, J. S. Statistics for Experimenters; Wiley: New York, 1978; Chapters 6, 7.
ANALYTICAL CHEMISTRY, VOL. 65, NO. 22, NOVEMBER 15, 1993
For the same calibration model, the mean relative standard deviation in predicted glucose concentration computed over the replicates of the 14 samples in the test set was 3.18%. This value indicates excellent precision in the predicted concentrations. The fact that these prediction results were obtained with a true external set of test spectra gives further indication that the computed calibration models are robust. The high levels of accuracy and precision obtained in this work confirm that near-IR spectroscopy is a competitive technique for the determination of glucose in complex biological matrices. A key to obtaining excellent calibration and prediction results was the ability to couple Fourier filtering to PLS regression in an optimal way. The results obtained in this study demonstrate that digital filtering can be a highly effective spectral preprocessing tool, provided the filter bandpass is optimized in conjunction with the calculation of the calibration model. While the work reported here focused exclusivelyon the use of Fourier filtering with PLS regression, the procedures developed for optimizing the filter bandpass should transfer directly to studies in which calibration models are developed with techniques such as principal components regression (PCRI.5 A key issue that determinesthe practicality of implementing the filter design scheme is the degree to which the filter parameters need to be optimized. The results of the study with the plasma data were somewhat inconclusive in this regard. While statistically different filter parameters were found when the filter design was applied to different spectral ranges and to models with different numbers of PLS factors, the 10 best filters tended to function similarly when used to preprocess spectra. It should be stressed, however, that these conclusions may be somewhat data dependent. We have reached similar conclusions in our previous work with glucose in buffered protein solutions,2 but other types of data may behave differently. Our recommendation is that potential users of this filter design method should explore the effects of spectral range, model size, and imprecision in the filter parameters to the maximum degree possible, given their computational resources. For example, if limited computational capabilities are present, a relatively coarse grid mesh could be used on the response surface for the filter design or numerical optimizationmethods could be used to search for the optimal filter parameters without exhaustivelysamplingthe response surface. It has been our experience that if significant sampleto-sample variation is present, the use of the filtering preprocessing step with a less than optimum filter bandpass still yields results superior to those obtained with PLS regression alone. The filtering step can be performed similarly with timedomain digital filters instead of through Fourier filtering. Numerous filter design methods are available to allow finite impulse response (FIR) and infinite impulse response (IIR) (16) Savitsky, A,; Golay, M. J. E. Anal. Chem. 1964, 36, 1627-1639.
3289
bandpass filters to be generated that can operate directly on spectra and thereby eliminate the use of the FFT.6 These methods could be used directly in the filter optimization or in building a dedicated filter once the optimum bandpass has been determined through the Fourier filtering procedures used here. For cases in which the input spectra have a very small number of points, the FFT may not able to compute an accurate frequency profile. In such cases, these alternate filter design methods may be required. Finally, the similarity of the filtered spectra in Figure 2 to derivative spectra raises the question of how the results obtained with the filtered data compare to those obtained with ordinary derivative spectra. While an exhaustive comparison would require a thorough investigation of different derivative approximation methods, it is useful to compare our results to those obtained with a standard differentiation algorithm. A common approach to numerical differentiation of spectral data is to use the polynomial least-squares derivative filters derived by Savitsky and Golay.16 These fiiters incorporate differentiation with smoothing. As an example, five of the Savitsky-Golay first-derivative filters (five- and seven-point quadratic, five- and seven-point cubic quartic, and seven-point quintic sexic) were applied to the calibration and prediction spectra, and PLS calibration models were developed over the 4900-4200-~m-~ spectral range. The computed models were applied to the prediction spectra, and SEP values were computed. The best SEP values obtained over the range of 1-20 PLS factors varied widely with the specific derivative approximationused, ranging from 0.50 to 0.81 mM. The models producing these SEP values ranged from 8 to 15 terms. The best of these SEP values is still greater than the 0.39 mM value obtained with our bandpass-filtered data. The results of this comparison study reveal that the choice of the derivative approximation function is critical. This is not surprising, as these derivative approximation methods are simply bandpass filters. Our approach based on tuning the filter bandpass precisely to the calibration data provides a clear advantage over the use of any standard filter with a predetermined bandpass.
ACKNOWLEDGMENT This research was supported by the National Institutes of Health under Grant 1-R01-DK45126-01A1. Partial funding for the purchase of the Nicolet 740 spectrometerwas provided by the National Institutes of Health under Grant 1-R03RR04583-01. The Department of the Army is acknowledged for providing the Silicon Graphics 4D/460 computer system. Portions of this work were presented at the 1993 Pittsburgh Conference and Exposition on Analytical Chemistry and Applied Spectroscopy, Atlanta, GA, March 1993. RECEIVEDfor review June 2, 1993. Accepted August 20, 1993.' a Abstract
published in Advance ACS Abstracts, October 1, 1993.