Pure Component Selectivity Analysis of Multivariate Calibration

Mar 27, 2004 - Mark A. Arnold,*,† Gary W. Small,‡ Dong Xiang,† Jiang Qui,§ and David W. Murhammer§. Optical Science and Technology Center and ...
0 downloads 0 Views 145KB Size
Anal. Chem. 2004, 76, 2583-2590

Pure Component Selectivity Analysis of Multivariate Calibration Models from Near-Infrared Spectra Mark A. Arnold,*,† Gary W. Small,‡ Dong Xiang,† Jiang Qui,§ and David W. Murhammer§

Optical Science and Technology Center and Department of Chemistry, and Department of Chemical and Biochemical Engineering, University of Iowa, Iowa City, Iowa 52242, and Center for Intelligent Chemical Instrumentation and Department of Chemistry and Biochemistry, Clippinger Laboratories, Ohio University, Athens, Ohio 45701

A novel procedure is proposed as a method to characterize the chemical basis of selectivity for multivariate calibration models. This procedure involves submitting pure component spectra of both the target analyte and suspected interferences to the calibration model in question. The resulting model output is analyzed and interpreted in terms of the relative contribution of each component to the predicted analyte concentration. The utility of this method is illustrated by an analysis of calibration models for glucose, sucrose, and maltose. Near-infrared spectra are collected over the 5000-4000-cm-1 spectral range for a set of ternary mixtures of these sugars. Partial leastsquares (PLS) calibration models are generated for each component, and these models provide selective responses for the targeted analytes with standard errors of prediction ranging from 0.2 to 0.7 mM over the concentration range of 0.5-50 mM. The concept of the proposed pure component selectivity analysis is illustrated with these models. Results indicate that the net analyte signal is solely responsible for the selectivity of each individual model. Despite strong spectral overlap for these simple carbohydrates, calibration models based on the PLS algorithm provide sufficient selectivity to distinguish these commonly used sugars. The proposed procedure demonstrates conclusively that no component of the sucrose or maltose spectrum contributes to the selective measurement of glucose. Analogous conclusions are possible for the sucrose and maltose calibration models. Near-infrared absorption spectra are characterized by weak, broad, and highly overlapping absorption bands corresponding to combination and overtone vibrations associated with C-H, N-H, and O-H functional groups. Multivariate calibration models are frequently required to achieve selective analytical measurements from such spectra.1-3 Generally, such models are statistical * Corresponding author. E-mail: [email protected]. † Optical Science and Technology Center and Department of Chemistry, University of Iowa. ‡ Ohio University. § Department of Chemical and Biochemical Engineering, University of Iowa. (1) Boysworth, M. K.; Booksh, K. S. In Handbook of Near-Infrared Analysis, 2nd ed.; Burns, D. B., Ciurczak, E. W., Eds.; Practical Spectroscopy Series 27; Dekker: New York, 2001; pp 209-240. (2) Hall J. W.; Pollard A. Clin. Chem. 1992, 38, 1623-1631. 10.1021/ac035516q CCC: $27.50 Published on Web 03/27/2004

© 2004 American Chemical Society

in nature, where mathematical algorithms are used to find the best combination of weighted wavelengths for distinguishing the selected analyte concentration from an array of chemical constituents within the sample matrix. The calibration process involves collecting a series of spectra from a set of representative samples, where these samples exhibit the full range of variations for all spectroscopically sensitive chemical and physical parameters expected for the eventual unknown samples. Various regression4,5 and artificial neural network6-8 algorithms are available to correlate spectral variations within the set of calibration spectra to the corresponding variations in the analyte concentration matrix. One particularly effective, and hence popular, algorithm is partial least-squares (PLS) regression. In PLS regression, the multitude of spectral dimensions is reduced to a few orthogonal latent variables, or factors. These factors are selected to maximize the correlation between spectral variance and analyte concentration.9-11 A major limitation of statistical-based calibration methods is the propensity to correlate spectral variations that exist within the calibration spectra but do not originate from the targeted analyte. This problem can manifest itself in many ways, depending on the type of sample, the experimental design used to collect the calibration data, and the conditions of the underlying spectroscopic measurements. Correlations between analyte concentration and the concentrations of one or more solutes within the chemical matrix represent one implementation of this limitation. Such concentration correlations might be naturally occurring or might be introduced by an improper experimental design. An example is the natural correlation between glucose and lactate concentrations in cell culture media during batch cultivations.12-15 Cellular metabolism results in the consumption of glucose and the (3) Heise, M.; Bittner, A.; Marbach, R. J. Near Infrared Spectrosc. 1998, 6, 349359. (4) Tan, H.; Brown, S. D. Anal. Chim. Acta 2003, 490, 291-301. (5) Brereton, R. G. Analyst 2000, 125, 2125-2154. (6) Almeida, J. S. Curr. Opin. Biotechnol. 2002, 13, 72-76. (7) Peterson, K. L. Rev. Comp. Chem. 2000, 16, 53-140. (8) Cirovic, D. A. Trends Anal. Chem. 1997, 16, 148-155. (9) Bjorsvik, H. R.; Martens, H. Handbook of Near-Infrared Analysis, 2nd ed.; Burns, D. B., Ciurczak, E.W., Eds.; Practical Spectroscopy Series 27; Dekker: New York, 2001; pp 185-207. (10) Wold, S.; Sjo ¨stro ¨m, M.; Eriksson, L. Chemom. Intell. Lab. Syst. 2001, 58, 109-130. (11) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60, 1193-1202. (12) Rhiel, M. H.; Amrhein, M. I.; Marison, I. W.; von Stockar, U. Anal. Chem. 2002, 74, 5227-5236.

Analytical Chemistry, Vol. 76, No. 9, May 1, 2004 2583

generation of lactate, which establishes an inverse correlation between these solutes. The use of untreated samples collected from batch cultivations as calibration standards will result in calibration models for glucose that depend on the concentration of lactate. This lactate dependency is created because the spectral features of lactate will correlate with glucose concentration through this metabolism-induced concentration correlation. Accurate glucose measurements are possible despite this correlation as long as the glucose-lactate concentration relationship in the training data set is identical to that in subsequent unknown samples. Systematic errors will result, however, if conditions change and this concentration relationship is not maintained. In general, such analyte-solute dependencies are undesirable and must be avoided, or eliminated, to maximize the robustness of the calibration model. The ability to characterize the selectivity of multivariate calibration models is critical before such models are considered ready for practical analytical applications. Clearly, a firm understanding of the chemical basis of model selectivity is required before the suitability and reliability of such models can be evaluated for intended applications. Historically, PLS and other multivariate calibration techniques have been widely used as a “black box” with little regard for the chemical basis of measurement selectivity. In part, this black box treatment has been accepted because methods or tools for characterizing model selectivity have not come into widespread use. Recent studies by several groups have sought to address this issue through the development of computational methods for assessing the selectivity of PLS models.16-18 Model performance is typically assessed by analyzing a series of samples not used to generate the calibration model. Generally, these so-called validation or prediction samples are composed of essentially the same chemical matrix as the calibration standards. Validation measurements provide information on the accuracy and precision of the model and can be used to determine when the system is overmodeled with too many latent variables or when the model must be upgraded with new calibration spectra to account for changes in either the measurement conditions or the sample matrix. It is important to recognize, however, that validation measurements do not probe the chemical basis of selectivity but test whether the requisite spectral correlations are maintained between the calibration and validation data. In some studies, measurement selectivity has been assessed by building calibration models from a set of “simplified” training standards, while judging model performance with more complex samples. An example is the measurement of individual sugars in fruit juices.19,20 In these studies, PLS calibration models were generated with spectra collected from a series of aqueous standard solutions composed of ternary mixtures of glucose, fructose, and (13) Arnold, M. A.; Burmeister, J. J.; Small, G. W. Anal. Chem. 1998, 70, 17731781. (14) McShane, M. J.; Cote´, G. L. Appl. Spectrosc. 1998, 52, 1073-1078. (15) Lewis, C. B.; McNichols, R. J.; Gowda, A.; Cote´, G. L. Appl. Spectrosc. 2000, 54, 1453-1457. (16) Lorber, A.; Faber, K.; Kowalski, B. R. Anal. Chem. 1997, 69, 1620-1626. (17) Faber, N. M. Anal. Chem. 1998, 70, 5108-5110. (18) Ferre´, J.; Brown, S. D.; Rius, F. X. J. Chemom. 2001, 15, 537-553. (19) Rambla, F. J.; Garrigues, S.; de la Guardia, M. Anal. Chim. Acta 1997, 344, 41-53. (20) Rodriguez-Saona, L. E.; Fry, F. S.; McLaughlin, M. A.; Calvey, E. M. Carbohydr. Res. 2001, 336, 63-74.

2584

Analytical Chemistry, Vol. 76, No. 9, May 1, 2004

sucrose. The model was then tested on a set of actual fruit juice samples. Accurate measurements in this case demonstrated model selectivity for the targeted analyte over all components within the chemical matrix of the fruit juice sample. This approach is rarely applicable because it can only be used when the matrix components do not significantly contribute to the measured spectra. Fruit juices are primarily composed of water and the three sugars used in the calibration standards. All other molecular constituents are present at significantly lower concentrations, thereby permitting effective calibration in the absence of the full chemical matrix. A simple experimental method is proposed here for assessing the selectivity of multivariate calibration models. This method consists of submitting pure component spectra to the calibration model and evaluating the model output. Selectivity is illustrated when the model predicts an analyte concentration of zero from pure component spectra of the potential interfering matrix constituents. Selectivity is also demonstrated when the correct analyte concentration is obtained from pure component analyte spectra of different concentrations. This report represents a first step toward development of this pure component selectivity analysis method. In this study, the proposed method is applied to individual PLS calibration models generated for glucose, sucrose, and maltose from a set of ternary mixtures. This experimental analysis of selectivity is confirmed by use of standard computational methods based on examining the regression coefficient vector of the PLS calibration model. This analysis establishes the selectivity of these models, thereby demonstrating the utility of the proposed methodology. EXPERIMENTAL SECTION Apparatus and Reagents. All spectra were collected with a Nexus 670 Fourier transform spectrometer (Thermo-Nicolet, Madison, WI). The spectrometer was equipped with a 20-W tungsten-halogen lamp source, calcium fluoride beam splitter, and indium antimonide (InSb) cryogenically cooled detector. A multilayer interference filter (Barr and Associates, Westford, MA) was positioned before the sample cell to confine the spectral range to 5000-4000 cm-1. Samples were placed in an aluminum-jacketed cell (Wilmad Glass, Buena, NJ) with sapphire windows (Meller Optics, Providence, RI). Teflon spacers were used to control a 1.5-mm optical path length. The sample temperature was maintained at 27.0 ( 0.1 °C with a circulating water bath (model 1167, VWR Scientific Products, Pittsburgh, PA). Glucose, maltose, potassium dihydrogen phosphate, and sodium hydrogen phosphate were purchased from Sigma Chemical Co. (St. Louis, MO). Sucrose was obtained from MCB Manufacturing Chemists, Inc. (Cincinnati, OH). All aqueous solutions were prepared with deionized water purified with a Milli-Q Reagent Water system (Millipore, Bedford, MA). Procedures. All solutions were prepared in a phosphate buffer that consisted of 0.025 M KH2PO4 and 0.025 M Na2HPO4 adjusted to a final pH of 6.86. Fifty unique ternary mixtures were prepared with different concentrations of glucose, sucrose, and maltose. The concentration for each solute varied randomly over a concentration range of approximately 0.5-50 mM. Each solution was prepared by dissolving the necessary amount of dried powder in an appropriate volume of buffer. Data Collection and Spectral Processing. Near-infrared spectra were collected as 128 coadded, double-sided interferograms.

Interferograms were collected with a gain of 8 and an aperture setting of 55. The Omnic software operating the spectrometer (version 5.1b, Thermo-Nicolet) was used to apply triangular apodization and Mertz phase correction to each interferogram and then to convert interferograms into single-beam spectra. No zerofilling was implemented before Fourier processing. This procedure generated single-beam spectra with a 0.964-cm-1 point spacing. Triplicate spectra were collected for each sample after thermal equilibration. Background spectra originated from blank buffer and were collected at the beginning of each day and then following every fifth sample. A total of 180 single-beam spectra (150 for the 50 samples and 30 for the background solutions) were transferred to an Iris Indigo computer (Silicon Graphics, Inc., Mountain View, CA) for further processing. All calibration models were generated from spectra in absorbance units (AU), where the ratio was computed between each sample single-beam spectrum and the most current background spectrum of buffer. Sample Design and Model Development. Analyte-specific calibration models demand a sample design with minimal covariance between components. Many strategies are available for this purpose. One strategy uses a uniform design algorithm to maximize the variance across system components.21 Alternatively, a genetic algorithm can be used to minimize correlation between solute concentrations.22 Another possibility is to assign solute concentrations randomly over a specified concentration range. In all cases, significance of the corresponding correlation coefficients must be judged statistically.23 A random concentration sample design was used in this study to generate 50 ternary mixtures with noncorrelating concentrations of glucose, sucrose, and maltose. Values of the coefficient of determination (r2) were 0.0102, 0.0102, and 0.0001 for the glucose-sucrose, glucose-maltose, and sucrose-maltose solute combinations, respectively. These correlations are not statistically different from zero at the 99% confidence level. The same analysis was performed for the various subsets of the data (see below for details), and no significant correlation was indicated for any of the solute concentration combinations. The exact concentration range for each solute was 0.71-49.32, 0.94-48.32, and 0.48-49.87 mM for glucose, sucrose, and maltose, respectively. The mean concentrations were 24.54, 22.89, and 25.60 mM for glucose, sucrose, and maltose, respectively. The full set of data (150 absorbance spectra from 50 samples) was split into separate calibration and prediction data sets for the purpose of generating calibration models with PLS regression. All spectra associated with 10 randomly selected samples were assigned to the prediction data set, and the 120 spectra associated with the remaining 40 samples comprised the calibration data set. Concentration correlations were negligible for the samples within this calibration data set with coefficients of determination (r2) of 0.0324, 0.0099, and 0.0015 for glucose-sucrose, glucose-maltose, and sucrose-maltose, respectively. The calibration set was further divided into training and monitoring subsets for optimization purposes. All spectra associated with 10 randomly selected (21) Fang, K. T.; Wang, Y. Number-Theoretic Methods in Statistics; Chapman and Hall: New York, 1994. (22) Ding, Q.; Small, G. W.; Arnold, M. A. Appl. Spectrosc. 1999, 53, 402-414. (23) Johnson, R. A.; Bhattacharyye, G. K. Statistics Principles and Methods, 2nd ed.; John Wiley & Sons: New York, 1992; p 686.

samples were placed in the monitoring set, and the remaining 30 samples were used for the calibration set. The optimum number of model factors (latent variables) and spectral range were determined by comparing values for the standard error of calibration (SEC) and standard error of monitoring as described elsewhere.24,25 As in our earlier work,25 results from three unique splittings of the calibration data were used to establish the ideal spectral range and number of factors. Once these parameters were established, the monitoring and calibration data were recombined and the final calibration model was determined from a PLS analysis of the full calibration data set. Spectra in the prediction data set were analyzed with the resulting calibration model. Model performance was assessed by computing the SEC for the full calibration set and standard error of prediction (SEP) for the spectra in the prediction set. Unique calibration and prediction data sets were used for each analyte. In addition, spectra were not mean-centered prior to the PLS analysis and no filtering or preprocessing steps were used to enhance the selectivity or signal-to-noise ratio of the spectra prior to analysis. RESULTS AND DISCUSSION The proposed pure component selectivity analysis is demonstrated by evaluating PLS calibration models for a group of structurally related carbohydrates. Sucrose is a disaccharide composed of glucose and fructose units, while maltose is a disaccharide composed of two glucose units. The chemical similarity of these solutes creates highly overlapping near-infrared absorption spectra, thereby creating a significant challenge to measure one sugar selectively over the others. Spectral Overlap. Absorbance spectra are presented in Figure 1A for individual solutions of 40 mM glucose, sucrose, and maltose in phosphate buffer. Absorption bands are similar for all three sugars. Each spectrum includes a broad band centered at approximately 4700 cm-1, as well as two narrower bands centered at approximately 4400 and 4300 cm-1. The exact peak positions of these bands are 4655, 4402, and 4317 cm-1 for glucose, 4702, 4395, and 4317 cm-1 for sucrose, and 4681, 4396, and 4318 cm-1 for maltose. The major differences in the pure component spectra presented in Figure 1A correspond to vertical shifts along the absorbance axis and baseline curvature, particularly at the wavenumber extremes. These principal differences are not related to the absorption properties of glucose, sucrose, and maltose but correspond to variations in the spectral backgrounds between the sample and background spectra used in the absorbance calculations. These variations are related to differences in water concentrations between the sample and buffer solutions, as well as insufficiently controlled instrumental and sample-related parameters such as spectrometer alignment and solution temperature. Much of this background variation can be characterized and removed by analyzing a group of spectra collected from blank buffer solutions over the course of the experiment. In all, 51 buffer spectra were collected for this purpose. Thirty background spectra were acquired during the data collection described above, and (24) Riley, M. R.; Rhiel, M.; Zhou X.; Arnold, M. A.; Murhammer, D. W. Biotechnol. Bioeng. 1997, 55, 11-15. (25) Riley, M. R.; Arnold, M. A.; Murhammer, D. W.; Walls, E. L.; DelaCruz N. Biotechnol. Prog. 1998, 14, 527-533.

Analytical Chemistry, Vol. 76, No. 9, May 1, 2004

2585

Figure 1. (A) Near-infrared combination region spectra for 40 mM concentrations of glucose (solid line), sucrose (dash-dot line), and maltose (dashed ine) dissolved in pH 6.86 phosphate buffer. (B) First (solid line), second (dash-dot line), and third (dashed line) principal components for spectral 100% lines (in AU) computed from blank buffer spectra. (C) Residual pure component spectra for glucose (solid line), sucrose (dash-dot line), and maltose (dashed line) after removing the three principal components plotted in panel B.

an additional 21 spectra were collected subsequently in order to better characterize the background variance. Because the chemical composition is constant across these spectra, the only sources of variation are changes in instrumental response and sample temperature. The buffer spectra were placed in chronological order according to their acquisition time, and the ratio was computed of each spectrum to the first spectrum in the sequence. The resulting transmittance spectra were then converted to AU. These spectra are termed 100% lines because in the absence of variation they would appear as lines with zero slope located at 100% transmittance (0 AU). Principal component analysis (PCA)26 was used to characterize the variance in these absorbance 100% lines. Spectra were not 2586

Analytical Chemistry, Vol. 76, No. 9, May 1, 2004

mean-centered or otherwise scaled prior to the PCA. Figure 1B presents the first three principal components, which explain 99.998% of the total spectral variance over the 4800-4200-cm-1 range. Each factor describes a nonrandom source of background variation. Although the physical meaning of these factors is difficult to assign, the first and third factors appear to arise from variations in single-beam spectral intensities and the shape of the second factor is suggestive of differences in solution temperature.27-29 A different view of spectral overlap between glucose, sucrose, and maltose can be obtained by removing this background variation from the pure component spectra. Figure 1C shows the result of removing the three background principal components from the pure component spectra presented in Figure 1A. These difference spectra correspond to the component of each solute spectrum that is orthogonal to the spectral variation inherent within the buffer spectra. This treatment eliminates much of the vertical variation, thereby enhancing the solute-dependent spectral differences. Inspection of these treated spectra reveals subtle differences in the position, size, and shape of the three absorption bands. Large differences are observed over the C-H region between 4420 and 4200 cm-1. Over this spectral region, glucose and maltose have similar spectra and the spectrum for sucrose is significantly different, particularly the relative magnitude of the 4400- and 4300-cm-1 absorption bands. By contrast, over the 48004420-cm-1 range, the maltose and sucrose spectra are similar and significantly different from the glucose spectrum. The differences and similarities of absorbance spectra can be quantified by a spectral vector analysis. In this analysis, each absorbance spectrum is represented as a vector in a multidimensional space where each spectral resolution element corresponds to an orthogonal axis. The spectra in Figure 1 consist of 622 resolution elements, so that each spectrum can be represented as a vector in this 622-dimensional space. An analysis of the direction of these vectors gives a measure of their similarity. Spectral overlap can be quantified as the sine of the angle (θ) between vectors. Exactly overlapping spectra will result in vectors with the same direction, leading to sine (0) ) 0.0. Similarly, two spectra that are completely nonoverlapping will produce orthogonal vectors (sine (π/2) ) 1.0). Lorber has used the same approach in his definition of spectral selectivity.30 In this definition, orthogonal spectra are assigned a selectivity of 1.0, thereby indicating maximum selectivity. Sine θ values between the spectral vectors corresponding to the pure component spectra in Figure 1A are 0.67 (glucose sucrose), 0.45 (glucose - maltose), and 0.29 (sucrose - maltose). After removing the background variance (Figure 1C), the corresponding sine θ values are 0.65, 0.40, and 0.37, respectively. These sine θ values indicate that the highest degree of spectral overlap exists between the glucose-maltose spectra and the maltosesucrose spectra over the 4800-4200-cm-1 spectral range. These values also demonstrate a similar degree of spectral overlap before and after removal of the background spectral variance, which (26) Jollife, I. T. Principal Component Analysis; Springer-Verlag: New York, 1986. (27) Olesberg, J. T.; Arnold, M. A.; Hu, S.-Y. B.; Wiencek, J. M. Anal. Chem. 2000, 72, 4985-4990. (28) Hazen, K. H.; Arnold, M. A.; Small, G. W. Appl. Spectrosc. 1998, 52, 15971605. (29) Eddy, C. V.; Arnold, M. A. Clin. Chem. 2001, 47, 1279-1286. (30) Lorber, A. Anal. Chem. 1986, 58, 1167-1172.

Table 1. rms Noise on 100% Lines for the Spectral Data from 50 Unique Ternary Mixtures

Table 2. Parameters and Results for PLS Calibration Models analyte

ranka

spectral range, cm-1

SEC, mM

SEP, mM

MPEPb (%)

glucose sucrose maltose

5 5 5

4420-4220 4500-4300 4550-4400

0.71 0.28 0.53

0.73 0.17 0.35

6.8 0.63 2.0

rms noise spectral section (cm-1)

mean (µAU)

RSD (%)

5000-4900 4900-4800 4800-4700 4700-4600 4600-4500 4500-4400 4400-4300 4300-4200 4200-4100 4100-4000

3456 160 20.2 6.26 3.74 4.17 9.12 49.4 878 10442

21.98 19.58 17.83 14.02 12.39 13.53 14.87 18.18 22.26 17.78

implies that this background removal process does not selectively alter the shape of the solute spectra. The vector angles cited above correspond to the solute spectra over the full spectral window of 4800-4200 cm-1. Spectral vectors over a more restricted range of 4420-4220 cm-1 emphasize the C-H bonding region. Sine θ values for the spectra in Figure 1C over this restricted spectral range are 0.82, 0.49, and 0.58, respectively, for glucose - sucrose, glucose - maltose, and sucrose - maltose. Each value is higher than the corresponding sine θ value for the full spectral range, which indicates that the C-H bonding region provides greater distinction between these sugars. Sine θ values over the 4800-4420-cm-1 spectral range are 0.13, 0.09, and 0.09, respectively, thereby indicating considerable overlap and little distinction between sugars. Spectral Quality. The ability to extract selective information from near-infrared spectra of mixtures depends on the quality of the individual spectra. High spectral reproducibility and low noise are necessary to distinguish the subtle spectral differences described above. In this work, variations in intensity of single-beam spectra and root-mean-square (rms) noise of 100% lines28 were measured as indicators of spectral quality for the entire data set. Reproducibility of the raw single-beam spectra was measured as the repeatability of the transmitted intensity at the point of minimum solution absorbance. For these aqueous samples, the maximum intensity in the single-beam spectrum occurs at 4510.67 cm-1. Across the full data set of 150 sample spectra, the standard deviation for intensity at 4510.67 cm-1 was 0.22 (arbitrary units) and the mean intensity was 33.33 (arbitrary units), which corresponds to a relative standard deviation of 0.67%. Furthermore, no systematic variance is evident in a plot of maximum single-beam intensity as a function of order of collection (data not shown). This degree of random variation corresponds to a three-day period, over which all spectra were collected. The 100% lines were computed for all samples by dividing the individual single-beam spectra by each corresponding replicate and converting the transmittance values to AU. All three possible combinations were generated for each sample. Each 100-cm-1 segment was fitted to a second-order polynomial baseline by regression analysis, and the rms noise corresponding to the spectral segment was computed about this baseline.15,20 The mean, standard deviation, and relative standard deviation were computed across all replicates for all samples. The results are summarized in Table 1 for each 100-cm-1 segment of the 100% lines. As

a Number of factors or latent variables in the final PLS calibration model. b Mean percent error in concentration for prediction samples.

expected, noise levels are greatest at the extremes where the incident radiation is mostly absorbed by water. The minimum rms noise level is observed for the 4600-4500-cm-1 segment, which corresponds to the water absorption minimum. The mean rms noise for the 4600-4500-cm-1 segment is only 3.7 µAU. In addition, the average rms noise is below 10 µAU over the 47004300-cm-1 spectral range. No systematic variation in the rms noise is evident as a function of the order of data collection. Calibration Models. Results for each analyte are summarized in Table 2. Five PLS factors provide the best performance in each case. The optimized spectral range is similar for each analyte and focuses on the subtle differences associated with the C-H bonding region, as noted in Figure 1. Values of SEC and SEP are similar for each analyte, with the lowest errors for sucrose and highest errors for glucose. Larger errors for the monosaccharide relative to the disaccharides are consistent with lower absorptivities expected for fewer C-H groups per mole for glucose compared to either sucrose or maltose. In addition, the SEP values track the degree of spectral overlap as quantified by the sine θ values noted above. Sucrose demonstrates the least overlap and lowest standard error, while glucose possesses the greatest overlap and highest standard error. Nevertheless, each analyte can be measured selectively. Concentration correlation plots are presented in Figure 2 for each analyte. In all cases, both calibration and prediction data fall along the ideal unity line. Residual analysis indicates the extent to which the concentrations of sucrose and maltose impact the measurement of glucose. Figure 3 shows the residual in glucose concentration as a function of the concentrations of glucose (Figure 3A), sucrose (Figure 3B), and maltose (Figure 3C). No systematic variation is evident in any of these plots, which indicates that glucose can be measured independently of the other carbohydrates. In fact, analogous residual plots for sucrose and maltose indicate similar selectivity for these sugars over the other two potential interferences. Linear regression analysis of these residual plots indicates that none of the slopes, y-intercepts, or values of r2 are significantly different than zero at the 95% confidence level. Pure Component Selectivity Analysis. Selectivity was further characterized by examining results after submitting pure component spectra to each PLS calibration model. Ideally, the glucose calibration function, for example, will be constructed from the net analyte signal (NAS) of glucose, where NAS is a term used to describe that portion of the analyte spectrum that is orthogonal to all nonanalyte spectral variations within the data set, including spectral features related to additional solutes and the solvent.30 The ratio of the magnitude of the NAS vector to the magnitude of the pure component spectrum yields the selectivity measure (sine (θ)) defined by Lorber. Analytical Chemistry, Vol. 76, No. 9, May 1, 2004

2587

Figure 2. Concentration (mM) correlation plots for PLS calibration models for glucose (A), sucrose (B), and maltose (C). The solid line represents ideal correlation, and open and closed circles correspond to calibration and prediction points, respectively.

If analyte orthogonality is achieved within the glucose calibration model, then a pure component spectrum of either sucrose or maltose will produce a result of zero concentration from the glucose model because no component of the sucrose or maltose spectrum will be involved in the prediction of the glucose concentration. To illustrate this concept, pure component spectra for glucose, sucrose, and maltose were submitted to the individual PLS calibration functions for glucose, sucrose, and maltose. These spectra corresponded to 1, 5, 10, 20, and 40 mM solutions of each carbohydrate. Results for the glucose calibration model are presented as a pure component selectivity plot in Figure 4. 2588 Analytical Chemistry, Vol. 76, No. 9, May 1, 2004

Figure 3. Residual plots from the PLS calibration model for glucose as a function of the concentration of glucose (A), sucrose (B), and maltose (C). The solid line represents a residual of 0 mM, and open and closed circles correspond to calibration and prediction points, respectively.

Pure component spectra of glucose produced predicted glucose concentrations that matched the known values. More significantly, when presented to the glucose model, pure component spectra of sucrose and maltose resulted in glucose model predictions of zero. This finding demonstrates that none of the spectral features within the pure component spectra of sucrose and maltose is used by the glucose PLS model for concentration predictions. These data prove that the PLS calibration model for glucose corresponds to the net glucose signal relative to the spectra of sucrose and maltose. In this case, this orthogonal relationship is the fundamental basis for glucose selectivity. Analogous analyses of the PLS models for sucrose and maltose generate essentially identical results, thereby proving selectivity

Table 3. Linear Regression Analysis from Pure Component Selectivity Plots calibration function

slope

glucose response y-intera

r2

slope

sucrose response y-intera

r2

slope

maltose response y-intera

r2

glucose sucrose maltose

0.965 0.002 -0.010

0.4 0.01 0.01

0.9987 0.0752 0.1275

-0.036 0.982 0.005

0.4 -0.08 -0.5

0.4926 0.9999 0.0500

-0.012 -0.004 0.969

-0.3 0.2 -0.03

0.0915 0.1959 0.9993

a

Millimolar.

Figure 4. Pure component selectivity plot for the PLS calibration model for glucose showing responses to pure component spectra for glucose (circles), sucrose (squares), and maltose (triangles). Solid lines show ideal responses.

for multiple analytes from a single near-infrared spectrum. Responses of the calibration models to all pure component spectra are summarized in Table 3 as a series of regression coefficients from a least-squares analysis. In all cases, slopes are essentially zero for potential interfering carbohydrates, which indicates no response from that substance and excellent selectivity for the principal analyte. As noted above, the lack of response from pure component spectra of cosolutes suggests that the NAS is the basis of selectivity for each calibration model. To test this further, the NAS was computed for each analyte and compared to the regression coefficient vector (β) for the corresponding PLS calibration model. These NAS values were computed from pure component spectra by standard methods.31 In this experiment, the NAS for glucose was computed from the pure component glucose spectrum and the pure component spectra of sucrose, maltose, and phosphate buffer. Selectivity of a PLS calibration model can be characterized by comparing the regression coefficient vectors (β) of the PLS calibration equation to the NAS. In each case, the NAS was converted to the corresponding regression vector (β) by the (31) Malinowski, E. R. Factor Analysis in Chemistry, 2nd ed.; John Wiley & Sons: New York, 1991.

Figure 5. Comparison of regression coefficient vectors computed from NAS (gray) and PLS (black) for glucose (A), sucrose (B), and maltose (C).

method reported by Kowalski and co-workers.16 The plots presented in Figure 5 permit a visual comparison of these β vectors for glucose, sucrose, and maltose. Figure 5A shows the β vector calculated from the NAS for glucose in gray and β vector for the Analytical Chemistry, Vol. 76, No. 9, May 1, 2004

2589

glucose PLS calibration model in black. Panels B and C of Figure 5 show analogous plots for sucrose and maltose, respectively. In all cases, the two vectors match, which confirms that each calibration model is based on the selective spectral information of the corresponding analyte. CONCLUSIONS Utility of the proposed pure component selectivity analysis is demonstrated with a series of PLS calibration models generated from near-infrared spectra collected from ternary mixtures of glucose, sucrose, and maltose. Although spectral features are highly overlapping for these sugars, selective measurements are possible. The pure component selectivity analysis indicates that none of the concentration measurements for glucose is derived from either the spectrum of sucrose or maltose. This analysis shows that the unique component of the glucose spectrum (i.e., the NAS) is solely responsible for the measurement. This finding is confirmed by comparing the regression coefficient vectors from the computed NAS and the PLS algorithm. This pure component selectivity analysis is proposed as a general technique for assessing the chemical basis of selectivity for multivariate calibration models. It provides a complementary procedure to the use of the NAS alone to characterize the selectivity of an analyte with respect to the sample matrix. Rather than a somewhat abstract numerical measure of selectivity (e.g., sine (θ)), the procedure illustrated here provides the analyst with selectivity information that is tied directly to the performance of the calibration model. In addition, by taking the ratio of the slope for the analyte in the pure component selectivity plot to the corresponding slope for an interference, a selectivity measure analogous to that used in univariate calibration can be obtained.

2590

Analytical Chemistry, Vol. 76, No. 9, May 1, 2004

Pure component selectivity analysis should also be valuable when exploring issues of selectivity for multivariate models derived from complex sample matrixes. For a complex matrix, all pure component spectra may not be available and thus any NAS calculation must be based solely on the calibration model.16-18 In such cases, when the slope of the pure component selectivity plot is unity for the analyte and zero for all reasonable interferences, additional confidence is gained in the NAS vector estimated from the calibration data. The work presented here represents the first step in the development of this selectivity analysis. The findings presented here confirm the utility of this method under circumstances of a properly designed sample matrix where the resulting multivariate calibration model is based solely on spectral information originating from the analyte of interest. The next step is to characterize models where spectral information originating from nonanalyte sources is involved, such as the presence of concentration correlations among the sample constituents. ACKNOWLEDGMENT This research was supported by grants from the Microgravity Science & Applications Division of the National Aeronautics and Space Administration (NAG8-1352) and the National Institute of Diabetes and Digestive and Kidney Diseases of the National Institutes of Health (DK-60657).

Received for review December 19, 2003. Accepted February 23, 2004. AC035516Q