Anal. Chem. 2009, 81, 2199–2207
Multivariate Calibration with Basis Functions Derived from Optical Filters Toshiyasu Tarumi,† Yuping Wu, and Gary W. Small* Department of Chemistry and Optical Science and Technology Center, University of Iowa, Iowa City, Iowa 52242 Multivariate calibration models are constructed through the use of Gaussian basis functions to extract relevant information from single-beam spectral data. These basis functions are related by analogy to optical filters and offer a pathway to the direct implementation of the calibration model in the spectrometer hardware. The basis functions are determined by use of a numerical optimization procedure employing genetic algorithms. This calibration methodology is demonstrated through the development of quantitative models in near-infrared spectroscopy. Calibrations are developed for the determination of physiological levels of glucose in two synthetic biological matrixes, and the resulting models are tested by application to external prediction data collected as much as 4 months outside the time frame of the calibration data used to compute the models. The calibrations developed with the Gaussian basis functions are compared to conventional calibration models computed with partial leastsquares (PLS) regression. For both data sets, the models based on the Gaussian functions are observed to outperform the PLS models, particularly with respect to calibration stability over time. Multivariate calibration methods have come into routine use in a variety of applications in which no single measurement variable has sufficient selectivity to allow a univariate relationship to be established with analyte concentration.1 In quantitative applications of optical spectroscopy, multivariate calibration is required when spectral overlap among the components of a mixture sample obscures the signature of the analyte. In this case, a calibration model based on measurements at multiple wavelengths is used to extract the analyte information from the variable spectral background. With modern spectrometers, spectra are easily measured over hundreds of wavelengths. A set of calibration samples must be measured over these wavelengths to form the basis for the quantitative model. To obtain a statistically valid calibration model in this setting, the ratio of observations (i.e., samples) to independent variables (i.e., wavelengths used in the model) must be on the order of 6:1.2 This criterion dictates that either a very * To whom correspondence should be addressed. E-mail: gary-small@ uiowa.edu. † Present address: Nihon Buchi KK, IMON Bld. 3F, 2-7-17 Ikenohata, TaitoKu, Tokyo, 110-0008, Japan. (1) Martens, H.; Næs, T. Multivariate Calibration; Wiley: New York, 1989. (2) Standard Practices for Infrared Multivariate Quantitative Analysis; American Society for Testing and Materials: West Conshohocken, PA, 2000. 10.1021/ac802023w CCC: $40.75 2009 American Chemical Society Published on Web 02/10/2009
large number of calibration samples must be measured or a way must be found to extract the key information from a large number of wavelengths and represent that information in a relatively small number of independent variables that are subsequently used to construct the calibration model. The most widely used approach to addressing this issue is to employ one of a number of related latent variable methods to reduce the dimensionality of the spectra without losing the key information that is needed to establish the calibration model. The two most commonly applied methods to accomplish this goal are principal component analysis (PCA)3 and partial least-squares (PLS).4 If the n calibration spectra (p wavelengths) are assumed to occupy the rows of an n × p matrix, X, the PCA and PLS methods compute h underlying components in X (h , p) that represent the important spectral information. These methods take advantage of the inherent correlation that exists in X among the wavelengths, thereby extracting components that do not carry redundant information. In mathematical terms, these extracted components can be considered new basis functions that allow the information in X to be represented with lower dimensionality. A regression model is then computed to relate the analyte concentrations of the calibration samples to the corresponding coordinates (scores) of the spectra obtained with respect to the new basis. Once the calibration model is established, it is informative to take the computed model and basis functions and calculate the equivalent model that directly relates the original spectral variables in X to concentration. For a glucose analysis in nearinfrared (near-IR) spectroscopy (data set B in the research reported here), the solid trace in Figure 1 shows the resulting regression coefficients for a PLS model. The computed coefficients are plotted across the spectral range of the model, indicating the spectral signatures that the model is extracting. The sample matrix in this case was an aqueous buffer containing six independently varying constituents. Examination of Figure 1 reveals the presence of a combination of broad and sharp spectral features in the regression coefficient trace. The dashed line superimposed on the figure is the purecomponent spectrum of aqueous glucose5 over the corresponding spectral range. Comparison of the two plots reveals the presence in the regression coefficient trace of significantly narrower features than are present in the glucose spectrum. Although some narrow features are required in the regression coefficients to model the overlapping (3) Jollife, I. T. Principal Component Analysis; Springer-Verlag: New York, 1986. (4) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60, 1193–1202. (5) Amerov, A. K.; Chen, J.; Arnold, M. A. Appl. Spectrosc. 2004, 58, 1195– 1204.
Analytical Chemistry, Vol. 81, No. 6, March 15, 2009
2199
Figure 1. Regression coefficients for a PLS model for glucose derived from the calibration data in data set B (solid line and left y-axis) and the absorptivity spectrum of aqueous glucose (dashed line and right y-axis). The model was based on 12 latent variables and the spectral range of 4250-4500 cm-1.
signals from multiple components, the occurrence of very narrow features (e.g., 4250-4275 cm-1) raises the question of whether those features arise simply from artifacts or chance correlations in the calibration data. If so, the ability of the model to predict samples outside of the calibration set may be compromised. To address this issue, several researchers have worked to modify the latent variable methods to penalize features in the regression coefficient trace that are too narrow.6-10 In effect, these methods seek a set of lower-order basis functions for use in extracting information from the measured data. As these lowerorder functions are inherently biased against narrow features in the regression coefficient trace, potential artifacts of the type displayed in Figure 1 are avoided. In the work presented here, a conceptually similar approach is used to extract features from spectral data. A numerical optimization method is employed to define a set of Gaussian basis functions that can be used to represent the important information in a calibration set of near-IR spectra. In addition to providing a lower-order basis, a key characteristic of the use of these functions is that they provide a direct path to the implementation of the calibration model in the spectrometer hardware. Because of their analogy to optical filters with a single band-pass, the Gaussian basis functions allow the calibration model to take the form of a specialized filter photometer that is dedicated to a given analytical determination. This modeling approach is evaluated for the quantitation of physiological levels of glucose in two different synthetic biological matrixes. EXPERIMENTAL SECTION Sample Preparation. Two data sets were used in this research. Both were based on the analysis of physiological levels of glucose in synthetic biological matrixes. (6) Kramer, N.; Boulesteix, A.-L.; Tutz, G. Chemom. Intell. Lab. Syst. 2008, 94, 60–69. (7) Reiss, P. T.; Ogden, R. T. J. Am. Stat. Assoc. 2007, 102, 984–996. (8) Saeys, W.; De Ketelaere, B.; Darius, P. J. Chemom. 2008, 22, 335–344. (9) Alsberg, B. K. J. Chemom. 1993, 7, 177–193. (10) Marx, B. D.; Eilers, P. H. C. Technometrics 1999, 41, 1–13.
2200
Analytical Chemistry, Vol. 81, No. 6, March 15, 2009
Data set A consisted of 203 three-component samples containing nine levels of glucose (0.8-16.2 mM) in conjunction with a 4 × 4 factorial design of urea (7, 14, 21, 28 mM) and triacetin (1.4, 2.1, 2.8, 3.5 g/L). Triacetin was used here to model triglycerides in biological matrixes. All samples were prepared in a pH 7.4, 0.1 M phosphate buffer solution (NaH2PO4 · H2O + NaOH, Fisher Scientific, Fair Lawn, NJ). Sodium benzoate (0.5% w/w, Fisher Scientific) was added to the buffer as a preservative. The glucose (ACS reagent, Fisher Scientific) and urea (ACS reagent, Fisher) reagents were dried before use, and the triacetin (99%, Sigma Chemical Co., St. Louis, MO) was stored over molecular sieves (Fisher) to remove trace water. Water used for sample preparation was obtained from a Milli-Q Plus water purification system (Millipore, Inc., Bedford, MA). Stock solutions were prepared by weighing the relevant reagents and diluting with the phosphate buffer. A separate stock solution was prepared for each of the 16 combinations of triacetin and urea. Sample solutions with varying glucose concentrations were then generated by in-line mixing of a 20 mM glucose stock solution with the solutions of the sample matrix. Two peristaltic pumps (Rabbit-Plus and RP-1, Rainin Instrument Co., Woburn, MA) were used for the mixing. Data set B consisted of six-component samples containing varying levels of R-D-glucose, sodium L-lactate, urea, sodium L-ascorbate, β-alanine, and triacetin dissolved in the same phosphate buffer described above. By adding lactate, ascorbate, and alanine to the sample matrix of data set A, significantly greater spectral overlap was obtained. This provided a more rigorous test of the selectivity of the data analysis methodology. A uniform experimental design11 was used to generate an initial set of 2129 mixture combinations. The subset selection method of Carpenter and Small12 was then employed to obtain 64 calibration and 25 prediction samples from the pool of 2129 combinations. The concentration ranges for glucose, lactate, urea, ascorbate, alanine, and triacetin were 1.6-30, 0.8-32, 1-30, 1.2-19.8, 1.5-28.5, and 1.4-30 mM, respectively. In the calibration data set, the correlation coefficients between the concentrations of the components varied from -0.1 to 0.13, whereas for the prediction set, the correlation coefficients were in the range from -0.27 to 0.21. Stock solutions of each component were prepared by weighing R-D-glucose (ACS reagent, Aldrich, Milwaukee, WI), sodium L-lactate (99%, Aldrich), urea (minimum 99.5%, Sigma, St. Louis, MO), sodium L-ascobate (>99%, Aldrich), β-alanine (>99%, Aldrich), or triacetin (approximately 99%, Sigma) and diluting with the buffer. Each mixture sample was then prepared by pipetting appropriate volumes of the stock solutions and diluting with the buffer. Data Collection. For data set A, the data collection was performed over 59 days. A set of 86 calibration samples was generated with the pumping system and spectra were acquired on days 1, 2, 5, and 8. A total of 27 samples were subsequently measured (days 8 and 11) and used in optimizing the calibration. These samples are termed the monitoring set. Prediction sets of (11) Fang, K. T.; Wang, Y. Number-Theoretic Methods in Statistics; Chapman and Hall: London, 1993. (12) Carpenter, S. E.; Small, G. W. Anal. Chim. Acta 1991, 249, 305–321.
36 samples (days 11 and 18) and 54 samples (days 45, 47, 50, 53, 56, 59) were then collected for use in testing the optimized calibrations. Interferograms of 8192 points were obtained with a Digilab FTS-60A Fourier transform (FT) spectrometer (Varian, Randolph, MA) equipped with a 100 W tungsten-halogen source, CaF2 beam splitter, and InSb detector cooled with liquid nitrogen. The spectral bandwidth was 15 801 cm-1, and 256 scans were coadded. The flow of sample solution produced at the exit of the in-line mixer was directed into an Infrasil quartz flow cell with a 2 mm path length (Starna Cells, Atascadero, CA). A K-band interference filter (Barr Associates, Westford, MA) was placed before the sample cell to isolate the near-IR spectral region of 5000-4000 cm-1. The temperature of the flow cell was controlled to 37 ± 0.1 °C during the data collection by use of a circulating bath (model 9100, Fisher Scientific) and waterjacketed cell holder. The temperature was monitored by affixing a T-type thermocouple (Omega Engineering, Stamford, CT) directly to the exterior of the cell. A digital thermocouple meter (Omega Engineering) was used to record the temperature readings. For each sample matrix having a fixed concentration of triacetin and urea, the glucose concentration was varied in an increasing or decreasing order over the 0.8-16.2 mM range. The starting glucose concentration (i.e., 0.8 or 16.2 mM) was varied randomly with respect to the triacetin and urea levels. The order of the data collection was also random with respect to the triacetin and urea levels. Before starting the interferogram collection for a given sample matrix, phosphate buffer was pumped through the flow cell and three replicate interferograms were obtained. These buffer samples were used as a reference background in subsequent calculations. Three replicate interferograms were obtained for each glucose level. At the start and end of the data acquisition, fractions of the sample solution were collected after the flow cell in order to establish the actual glucose concentrations generated. A YSI model 2300 Stat Plus analyzer (YSI, Inc., Yellow Springs, OH) was used to determine the glucose concentration in each collected fraction. Burmeister and Arnold have estimated a standard error of 0.28 mM for glucose measurements in aqueous solutions with this instrument.13 The mean value of the measured concentrations in the two fractions was assigned as the value of the glucose concentration associated with the three collected interferograms. For data set B, the samples were measured with a Nicolet Nexus 6700 FT spectrometer (Thermo Electron Corporation, Madison WI) equipped with a globar source, CaF2 beam splitter, and liquid-nitrogen-cooled InSb detector. The K-band optical interference filter was again used to isolate the spectral region of 5000-4000 cm-1. A thin-film type neutral density filter with 63% transmittance (Rolyn Optics, Covina, CA) was also placed before the sample cell to attenuate the incident light for the purpose of ensuring detector linearity. Spectra were collected as 8192-point double-sided interferograms with 256 coadded scans. The spectral bandwidth was again 15 801 cm-1. A liquid transmission cell with a 20 mm diameter circular aperture (model 118-3, Wilmad Glass, Buena, NJ), sapphire windows (Meller Optics, Providence, RI), and a path length of (13) Burmeister, J. J.; Arnold, M. A. Anal. Lett. 1995, 28, 581–592.
1.5 mm was used to contain the sample solutions for measurement. The temperature of the cell was controlled to 37 ± 0.1 °C by the same procedures described above. The YSI analyzer was again used as a reference method to verify the glucose concentration of each sample solution after preparation although the actual prepared concentrations were used in the modeling computations. Samples in the calibration data set were measured over a period of 4 consecutive days, with 12, 16, 12, and 24 samples being measured on days 1, 2, 3, and 4, respectively. The 24 samples measured on day 4 served as the monitoring set used in the optimization of the calibration model. The spectra of the 25 samples in the prediction set were collected nine times over 115 days. The first, second, and third prediction sets were measured beginning 1 day, 1 week, and 2 weeks after the calibration data set was collected. After that, spectra of the prediction samples were collected every 2 weeks. Each collection of the prediction set was spread over 2 consecutive days, with 12 samples being measured on the first day and 13 on the second day. In all cases, samples were run in a random order with respect to glucose concentrations to reduce the correlation between glucose concentration and any time-dependent artifacts. Three replicate spectra were measured consecutively for each sample solution. During each collection session, spectra of buffer were also recorded periodically as a background. On each experiment day, 16 buffer spectra were collected before the sample collection during the first 2 h of spectrometer warm-up. Then, three buffer spectra were collected after six samples were measured and again at the end of each day. Since each prediction set was collected over 2 days, there were 44 buffer spectra available for use as backgrounds with each prediction data set. Computations. For both data sets, interferograms were converted to single-beam spectra with a point spacing of 1.93 cm-1 by Fourier processing with one level of zero-filling, triangular apodization, and Mertz phase correction. All further computations were performed under MATLAB (version 6.5, The MathWorks, Inc., Natick, MA) running on a Dell Precision 670 workstation (Dell Computer Corp., Austin, TX) operating under Red Hat Enterprise Linux WS (Red Hat, Inc., Raleigh, NC) or a Dell Dimension 4400 computer (Dell Computer, Austin, TX) operating under Microsoft Windows XP (Microsoft Inc., Redmond, WA). The optimization of the Gaussian basis functions was implemented with a public-domain MATLAB toolbox based on genetic algorithms (GAs).14 RESULTS AND DISCUSSION Characterization of Data. Figure 2A plots the absorptivity spectra (L/mM · mm) for the aqueous species in data sets A and B, including water.5 Although each species has a distinct spectrum, there is tremendous spectral overlap, particularly when all the species are considered together as is the case for data set B. Figure 2B computes the net analyte signal (NAS) for glucose with respect to the sample matrixes of data sets A (solid line) and B (dashed line). Here, the NAS is that defined by Lorber15 as the part of the pure-component analyte spectrum that is distinct (i.e., mathematically orthogonal) to the spectra of the other constituents (14) Houck, C.; Joines, J.; Kay, M. NCSU-IE Technical Report 95-09; North Carolina State University: Raleigh, NC, 1995. (15) Lorber, A. Anal. Chem. 1986, 58, 1167–1172.
Analytical Chemistry, Vol. 81, No. 6, March 15, 2009
2201
Figure 2. (A) Pure-component aqueous absorptivity spectra (L/ mM · mm) for the chemical constituents of data sets A and B at 37 °C. The plots numbered 1-7 correspond to triacetin, ascorbate, glucose, alanine, urea, lactate, and water. (B) Net analyte signal of glucose in data sets A (solid line) and B (dashed line). The greater magnitude of the glucose net analyte signal in data set A reflects the lower degree of spectral overlap in the corresponding sample matrix.
in the sample matrix. The larger NAS corresponding to data set A indicates the lower degree of spectral overlap in this sample matrix. The degree of overlap can be quantified by Lorber’s selectivity (s) measure: s)
NAS ||a||
(1)
In eq 1, the selectivity is defined as the ratio of the magnitude of the NAS to the magnitude to the pure-component spectrum of the analyte, indicated here in vector notation as a. The value of s varies from 0 to 1, where 1 indicates the analyte is completely selective relative to the background matrix. Over the 4200-4800 cm-1 spectral range, the selectivity values for data sets A and B were 0.23 and 0.051, respectively. These values reflect the chemical components of the two sample matrixes as represented by their absorptivity spectra in Figure 2A. Other sources of spectral variance such as the effect of changes in sample temperature were not included in the calculation. These results suggest that both data sets are demanding from the standpoint of spectral overlap, with data set B being more challenging than data set A. 2202
Analytical Chemistry, Vol. 81, No. 6, March 15, 2009
Overview of Analysis Strategy. The goal of this study was to develop a general strategy for extracting information from spectra through the use of a set of Gaussian basis functions. Once obtained, the function values were used as independent variables in the construction of calibration models. Gaussian functions are defined by two parameters, µ and σ, corresponding to the mean and standard deviation, respectively. Drawing the analogy to an optical filter, µ corresponds to the center wavelength and 2.354σ specifies the full width at halfmaximum (fwhm) of the filter band-pass. The Gaussian functions were scaled to a maximum of 1.0 to mimic an optical transmittance of 1.0 at the center wavelength. In this work, the specific basis functions were defined through the use of a numerical optimization procedure based on GAs. Genetic Algorithm Architecture. A GA uses a “chromosome” consisting of a collection of parameter values to be optimized. Taken together, this parameter set represents a candidate solution to the optimization problem. In this study, a chromosome consisted of 2k elements corresponding to k Gaussian basis functions, each of which was specified by a center position and fwhm value. For example, the first and second elements of the chromosome represented the fwhm and the center position of the first basis function, whereas the 2k - 1 and 2k elements represented the fwhm and the center position of the kth basis function. The parameter representing the fwhm of the Gaussian function was assigned an integer value from 1 to 19, corresponding to a range of 24-240 cm-1 in steps of 12 cm-1. Analogously, the parameter representing the center position was assigned an integer value from 1 to 54, corresponding to 4288-4712 cm-1 in steps of 8 cm-1. In the first step of the GA optimization, an initial population consisting of 60 such chromosomes was generated by assigning a uniformly distributed random number to each parameter. The GA operates by drawing an analogy to the biological concepts of survival of the fittest and natural selection. The chromosomes are evaluated on the basis of their fitness (i.e., their relative success as solutions to the optimization problem), and the fittest chromosomes are selected as parents and allowed to recombine to produce a new generation of child chromosomes with (hopefully) better fitness. These steps represent one iteration (generation) of the GA. Evaluating the fitness of each chromosome was performed by using the corresponding set of k basis functions to build a calibration model and assessing the performance of that model. To evaluate each basis function, the Gaussian was generated at the same point spacing as the collected spectra, the two “spectra” were multiplied point-by-point, and the resulting trace was integrated with the trapezoidal rule.16 The numerical integration was performed to maintain the analogy to the integrating character of an optical filter, although simply taking the inner product of the two spectral vectors would produce an analogous result. Independent variables for the calibration model were formed as values of the basis functions relative to a background response: di,j ) (si,j - b,j)/bj
(2)
(16) Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetterling, W. T. Numerical Recipes: The Art of Scientific Computing; Cambridge University Press: Cambridge, U.K., 1986.
In eq 2, si,j and bj represent the values of the jth of k basis functions for the ith sample and a background spectrum, respectively. Spectra of phosphate buffer collected in conjunction with the sample spectra were used to define the background response in eq 2. A calibration model was then formed on the basis of the di,j: ci ) r0 + r1di,1 + r2di,2 + ... + rkdi,k + ei
(3)
In eq 3, ci is the known analyte concentration for sample i, the k di,j terms are the differential responses as defined in eq 2, the rj terms are coefficients obtained from a multiple linear regression analysis of a set of calibration spectra, and ei is the residual or error about the model for sample i. The fitness (R) of the chromosome was then evaluated as R)
{
1 SEC + SEM + |SEC - SEM| 2
}
SEC )
SEM )
2
(4)
nc
∑ (c - cˆ )
2
i
i
i)1
(5)
nc - k - 1 nm
∑ (c - cˆ )
2
i
i
i)1
nm
(6)
In eq 4, SEC and SEM denote the standard error of calibration and standard error of monitoring, respectively. These values are defined in eqs 5 and 6, where nc and nm specify the number of spectra in the calibration and monitoring data sets, respectively, and cˆi is the concentration for sample i predicted by the calibration model. Chromosomes that are more fit (i.e., produce a better performing calibration model) have a larger value of R than those that are less fit. The |SEC - SEM| term was included to penalize models that exhibit different performance with the calibration and monitoring sets. The production of a new generation of chromosomes was initiated by applying tournament selection. In this approach, two chromosomes were randomly chosen from the population and the chromosome with the larger fitness of the two was selected into the pool of parent chromosomes. In this study, the tournament was repeated 60 times to fill the pool of 60 parent chromosomes. Child chromosomes were then generated from the pool of parent chromosomes through a mating process. In this study, arithmetic crossover was applied nine times first and then single-point crossover was applied nine times, replacing 36 parent chromosomes with 36 child chromosomes. Arithmetic crossover produces two child chromosomes, C1 and C2, by taking two complementary linear combinations of the parent chromosomes, P1 and P2, as follows: C1 ) aP1 + (1 - a)P2
(7)
C2 ) (1 - a)P1 + aP2
(8)
In eqs 7 and 8, a represents a random number taken from a uniform distribution between 0 and 1. This crossover operation introduces new parameter values into the child chromosomes. Single-point crossover simply exchanges all the elements of the parent chromosomes whose indices are greater than a randomly selected number. In contrast with arithmetic crossover, singlepoint crossover does not introduce new parameter values into the child chromosomes. Since parent chromosomes were randomly selected to participate in arithmetic or single-point crossover, at least 60 - 36 ) 24 chromosomes in the pool of parents were not changed by the mating process. The last step involved mutation of the unchanged parent chromosomes and the child chromosomes generated by the mating process. In the GA implementation in this study, the mutation process replaced an element of a chromosome with a random number generated from a uniform distribution over the possible values of the element. Both the chromosome and the element within it were randomly selected. The rate of mutation was fixed at 5%. For a population of 60 chromosomes each of which consisted of 2k elements, the mutation process was repeated 6k times. At the end of the mutation process, the chromosome with the lowest fitness was replaced with the one with the highest fitness found to that point to ensure that the chromosome with the highest fitness was always selected into the next generation. The cycle of the evaluation, selection, crossover, and mutation processes was then repeated over 150 generations. The chromosome with the highest fitness at the 150th generation specified the optimal parameters for the Gaussian basis functions. For each of the k basis functions selected by the GA as optimal, partial F-values were computed to evaluate the significance of the function in modeling glucose concentrations. A basis function was removed from the calibration model if its partial F-value did not exceed the tabulated F-value at a significance level of 99%. This step was used to remove variables from the calibration model that were not accounting for a statistically significant amount of the variance in the analyte concentrations. Finally, the GA optimization was repeated 100 times with different initial populations to help ensure a global optimum was found. At the end of this process, the k basis functions that produced the overall best fitness score were selected to form the final calibration model. This model was then tested by application to the prediction data that had been withheld from all of the previous model optimization steps. Analysis of Data Set A. Separate executions of the GA were performed for k ) 1-12, producing 12 models for consideration. The corresponding SEM values were then compared to choose a final model for application to the two prediction sets. Figure 3 shows the value of the fitness function, as well as the SEM as a function of k. Though the largest fitness value was observed with 12 basis functions, the fitness value did not change significantly after k ) 9. The minimum SEM of 0.17 mM was observed with 9 and 12 basis functions. Values of SEM obtained with eight or fewer functions were compared with the minimum SEM. On the basis of an F-test at a significance level of 95%, an SEM of 0.20 mM obtained with six basis functions was found not significantly different from the minimum SEM. This model was chosen for further study, although consideration of the reported standard Analytical Chemistry, Vol. 81, No. 6, March 15, 2009
2203
Figure 3. Values of the fitness function (eq 4) and SEM (eq 6) as a function of the number of basis functions (k) for data set A. Circles represent the values of the fitness function, while triangles denote the SEM values.
Figure 4. Six Gaussian basis functions that defined the optimal model for data set A (solid lines). The dashed line plots a singlebeam spectrum of phosphate buffer. The K-band interference filter used in the data collection restricts the collected spectrum to the region of 4000-5000 cm-1. The strong absorption bands of water near 3800 and 5200 cm-1 further reduce the transmitted light intensity to the range of 4150-4850 cm-1.
error in the reference concentrations of 0.28 mM could be used to support the selection of the five-term model (SEM ) 0.26 mM). The clear break in the plot of SEM versus k in Figure 3 at six basis functions was persuasive in making our selection. The corresponding value of SEC for the six-term model was 0.21 mM. Figure 4 shows the basis functions selected along with a scaled single-beam spectrum of the background buffer solution. The six basis functions are highly overlapped, three being relatively broad and three being quite narrow. From a perspective of filter photometry, a combination of optical filters of this type would not normally be used to implement an analysis. This result is quite interesting from that standpoint. The calibration model was next applied to the prediction of glucose concentrations in the two sets of prediction data. As noted previously, the prediction sets were acquired as many as 51 days after the end of the collection of the calibration data. For prediction sets 1 and 2, values of the standard error of prediction (SEP) were 0.39 and 0.42 mM. The calculation of SEP is identical to that of SEM in eq 6. Although these values are both higher than the SEC of the model, they still represent good performance over the glucose concentration range of 0.8-16.2 mM. Analysis of Data Set B. With the encouraging results from the analysis of data set A, work was next undertaken to address 2204
Analytical Chemistry, Vol. 81, No. 6, March 15, 2009
Figure 5. Concentration correlation plot for data set B obtained with the optimal calibration model computed with 10 basis functions. Upper triangles represent the calibration and monitoring data, while circles represent the prediction data in prediction set 1.
the significantly more challenging glucose calibration in data set B. As noted previously, the calibration samples collected on the last day for data set B were used as the monitoring set for this experiment. Accordingly, there were 40 samples (120 spectra including replicates) and 24 samples (72 spectra including replicates) in the calibration and monitoring sets, respectively. In computing the differential responses in eq 2, the mean of the last six buffer spectra collected each day was used to compute the bj. The GA was executed separately for values of k from 6 to 16. To determine the best model size, SEM values (eq 6) calculated from the corresponding calibration models were compared. The smallest SEM of 0.34 mM was observed with k ) 13, although the SEM values did not change greatly after k ) 9. As with data set A, an F-test at the 95% level was used to compare SEM values to the minimum SEM. On the basis of this comparison, the model for k ) 10 (SEM ) 0.41 mM) was selected for further study. The optimal calibration model was then applied to the prediction data. Figure 5 shows a concentration correlation plot of the calibration and monitoring sets (triangles) and the first prediction set (circles). The SEC value for this model was 0.40 mM, whereas the value of SEP for the first prediction set was 0.50 mM. Prediction data sets collected over 4 months were implemented to evaluate the prediction performance of the calibration model with time. As described previously, nine prediction data sets were collected approximately every 2 weeks after the calibration set. The glucose concentrations in these prediction sets were predicted with the calibration model, and SEP values were computed and plotted in Figure 6. Inspection of Figure 6 shows good performance in prediction, although some degradation in performance is observed with time. Prediction errors slowly increase, although they remain below 1.0 mM until prediction set 9 (SEP ) 1.17 mM). Overall, this is an excellent performance when the long time frame is considered. Figure 7 depicts the set of 10 basis functions selected by the GA. As observed previously with data set A, a combination of broad and narrow bands is seen. Significant overlap among the basis functions is also observed. The center positions of the Gaussians are focused on the range from 4300 to 4500 cm-1, a region in which glucose has a characteristic C-H combination band at 4400 cm-1 and several of the other compounds have significant features (see Figure 2A). Comparison to Conventional PLS Calibrations. To put the results presented previously in context, the calibration data were
Table 1. Summary of Prediction Results with Different Models for Data Set B cal/prd no.
SEC or SEP (mM) a
calibration prediction 1 prediction 2 prediction 3 prediction 4 prediction 5 prediction 6 prediction 7 prediction 8 prediction 9
Figure 6. Values of SEP for all prediction sets in data set B.
Figure 7. The 10 Gaussian basis functions determined by the GA for data set B.
also used to build conventional PLS models. Models were constructed with spectra in absorbance units, with the buffer spectra used as backgrounds. Optimization of these models considered the spectral range submitted to the PLS algorithm, as well as the number of latent variables employed in the model. With data set A, contiguous spectral regions of 200, 300, 400, 500, and 600 cm-1 were investigated, with each range size stepped across the region of 4224-4999 cm-1 in steps of 25 cm-1. Consideration of different spectral ranges helped to ensure that uninformative spectral points were not included in the PLS calculation. Models with 5-20 latent variables were evaluated. The calibration and monitoring data were used in conjunction with eq 4 to determine the best range and model size. Once the range was selected, an F-test at the 95% level was used in conjunction with the SEM values to finalize the optimal number of latent variables. This procedure produced a calibration model based on nine PLS latent variables over the 4349-4749 cm-1 range. The SEC and SEM values were 0.21 and 0.20 mM, respectively. When applied to the two prediction sets, SEP values of 0.52 and 0.50 mM were obtained. These SEP values are slightly higher than those obtained with the Gaussian basis functions. With data set B, a similar optimization procedure was performed. Absorbance data were again used. Contiguous spectral ranges were investigated with widths from 200 to 600 cm-1 in steps of 50 cm-1. For each range size, starting points from 4150
PLS
Gaussianb
0.43 0.39 0.71 0.86 0.94 1.44 1.39 1.58 1.99 2.30
0.40 0.50 0.54 0.46 0.55 0.88 0.63 0.62 0.92 1.17
a PLS model (absorbance spectra) with 12 factors and spectral range of 4250-4500 cm-1. b Model based on 10 Gaussian basis functions optimized by the GA.
to 4850 cm-1 were studied in steps of 50 cm-1. For each range studied, the number of latent variables was varied from 5 to 20. For this data set, models were ranked on the basis of SEM only, and the F-test at 95% confidence was again used to finalize the number of latent variables once the optimal spectral range was established. The optimal model was based on 12 PLS latent variables computed from the 4250 to 4500 cm-1 spectral range. The regression coefficients that define this model were discussed previously and plotted in Figure 1. Table 1 summarizes the SEP values obtained for the nine prediction sets in data set B. The results from the model based on Gaussian basis functions are also provided in the table. Inspection of the table reveals that the model based on the Gaussian functions clearly outperformed the conventional PLS model, especially with respect to robustness over time. Comparison of Regression Coefficient Vectors. Another way to compare the calibration models formed with Gaussian basis functions to the corresponding PLS models is to compute an overall p × 1 vector of regression coefficients, v, analogous to that displayed in Figure 1. For an input 1 × p single-beam spectrum, xi, whose glucose concentration, ci, is to be predicted with a previously computed model, we need v such that ci ) xv + coffset
(9)
where coffset is a constant offset that reflects the presence of an intercept term in eq 3. As shown in eq 3, the original model was formulated on the basis of a (k + 1) × 1 vector of regression coefficients, r, that related the concentrations of the calibration samples to the corresponding differential responses, di,j. For sample i, the k values of di,j were obtained from the corresponding values of si,j and bj, as shown in eq 2. For the time period over which a given p-dimensional background spectrum, xref, is used to compute the bj, we can combine eqs 2 and 3 to yield k
ci ) r0 +
∑ j)1
()
()
k rj rj si,j - rj ) r′0 + si,j bj b j j)1
∑
Analytical Chemistry, Vol. 81, No. 6, March 15, 2009
(10) 2205
k
∑r
r′0 ) r0 -
(11)
j
j)1
If we compute the values of si,j and bj as the inner products of xi and xref, respectively, with the jth of k p-dimensional Gaussian functions, gj, eq 10 is expanded to k
ci ) r′0 +
∑ j)1
(
)
p
rj
∑x
i,m gj,m
p
∑x
ref,q gj,q
m)1
q)1
k
) r′0 +
p
∑∑z
j,m xi,m
j)1 m)1
(12)
rj gj,m
zj,m )
(13)
p
∑x
ref,q gj,q
q)1
Taking advantage of the distributive property of multiplication over addition, eq 12 can be rewritten as p
ci ) r′0 +
( ) k
∑ ∑z
j,m
m)1
p
xi,m ) r′0 +
j)1
∑v
m xi,m
(14)
m)1
Comparison of eqs 9 and 14 reveals that k
coffset ) r′0 ) r0 -
∑r
j
(15)
j)1
and reincorporation of eq 13 into eq 14 leads to the elements of v in eq 9: k
vm )
∑ j)1
( ) rj gj,m
p
∑x
(16)
ref,q gj,q
q)1
Employing eq 16, Figure 8A plots the vector of regression coefficients obtained from the calibration model for data set B based on 10 Gaussian basis functions. In Figure 8B, the region of 4250-4500 cm-1 is overlaid with the vector of PLS regression coefficients from Figure 1. The two sets of regression coefficients in Figure 8B cannot be compared directly because the PLS coefficients were computed from spectra in AU, whereas those from the Gaussian-based model are designed to be applied to single-beam spectra. However, two general points can be noted. First, the regression model derived from the Gaussian basis functions is free from extremely narrow features or features that appear artifactual. Compare, for example, the regions of 4250-4275 or 4450-4500 cm-1 in the two curves in Figure 8B. This may help to explain why the model based on the Gaussian functions exhibits better prediction performance over time than the PLS model. Second, it is clear now why the Gaussian “filters” displayed in Figure 7 have an unusual pattern of widths and a high degree of overlap. When combined through regression analysis to form the calibration model, they generate the higher-order function that 2206
Analytical Chemistry, Vol. 81, No. 6, March 15, 2009
Figure 8. (A) Regression coefficient vector (eq 16) corresponding to the calibration model for data set B that employed 10 Gaussian basis functions. (B) Overlay of the region of 4250-4500 cm-1 in the regression coefficient vector from panel A (solid line and right y-axis) with the regression coefficients of the corresponding PLS model from Figure 1 (dashed line and left y-axis).
is represented by the regression coefficient vector in Figure 8A. This provides a unique way of looking at optimal filter selection in filter photometry. In addition, it is clear that a variety of functions other than Gaussian could be used in conjunction with eq 16 to formulate the calibration model. CONCLUSIONS The research presented in this paper demonstrates that effective multivariate calibration models can be constructed through the use of simple Gaussian basis functions determined through numerical optimization. The GA procedure employed here proved to be a workable strategy for use in optimizing the basis functions. The calibration models exhibited good performance in prediction and showed good robustness over time. For both data sets A and B, improved prediction performance was observed when compared to conventional PLS models based on absorbance data. These results were especially significant with data set B, which offered a significantly more challenging calibration problem than data set A. Data set B also featured a more demanding prediction test with respect to the time span between the calibration and prediction data. A key advantage to the calibration method based on Gaussian basis functions is the analogy to optical filtering offered by the approach. The potential to implement the calibration model directly in the spectrometer hardware is an attractive feature that
could offer key advantages in designing optical sensors dedicated to a target application. The results presented here also demonstrate that the optimal filter set for a given application does not simply mimic a low-resolution spectrometer. The presence of both broad and narrow filter band-pass functions, as well as the high degree of overlap among these functions, suggests that the optimal filter set must be determined through an optimization procedure of the type employed in this work. Examination of the regression coefficients computed with eq 16 revealed that the requirement for these overlapping filters of different widths is the need to
approximate an overall higher-order function that forms the calibration model. ACKNOWLEDGMENT This research was supported by the National Institutes of Health under Grant DK067445. Received for review September 23, 2008. Accepted December 22, 2008. AC802023W
Analytical Chemistry, Vol. 81, No. 6, March 15, 2009
2207