Anal. Chem. 1996, 68, 4200-4212
Genetic Algorithm-Based Method for Selecting Wavelengths and Model Size for Use with Partial Least-Squares Regression: Application to Near-Infrared Spectroscopy Arjun S. Bangalore, Ronald E. Shaffer,† and Gary W. Small*
Center for Intelligent Chemical Instrumentation, Department of Chemistry, Clippinger Laboratories, Ohio University, Athens, Ohio 45701-2979 Mark A. Arnold
Department of Chemistry, University of Iowa, Iowa City, Iowa 52242
Genetic algorithms (GAs) are used to implement an automated wavelength selection procedure for use in building multivariate calibration models based on partial least-squares regression. The method also allows the number of latent variables used in constructing the calibration models to be optimized along with the selection of the wavelengths. The data used to test this methodology are derived from the determination of aqueous organic species by near-infrared spectroscopy. The three data sets employed focus on the determination of (1) methyl isobutyl ketone in water over the range of 1-160 ppm, (2) physiological levels of glucose in a phosphate buffer matrix containing bovine serum albumin and triacetin, and (3) glucose in a human serum matrix. These data sets feature analyte signals near the limit of detection and the presence of significant spectral interferences. Studies are performed to characterize the signal and noise characteristics of the spectral data, and optimal configurations for the GA are found for each data set through experimental design techniques. Despite the complexity of the spectral data, the GA procedure is found to perform well, leading to calibration models that significantly outperform those based on full spectrum analyses. In addition, a significant reduction in the number of spectral points required to build the models is realized. Multivariate calibration models have come into wide use in implementing quantitative spectroscopic analyses due to their ability to overcome deviations from the Beer-Lambert law caused by effects such as overlapping spectral bands and interactions between components. Lorber and Kowalski have shown that improvements in the prediction performance of these calibration models can be obtained by increasing the number of sensors or wavelengths used in the model formulation.1 Given the capability of computing models based on the use of multiple wavelengths, however, optimization issues arise regarding which specific wavelengths should be included. Several researchers have
demonstrated that employing selected wavelength regions rather than the entire spectrum improves the accuracy of predictions.2,3 Significant work has been performed in the area of wavelength selection for use in multivariate calibration. Gemperline has reviewed the work in this area prior to 1989,4 and numerous studies have been published since his review. Selection of optimal subsets of wavelengths has been achieved by use of criteria based on both the information content of the spectral data and statistics describing the performance of the calibration model. Some of the criteria used have included the spectral signal-to-noise (S/N) ratio, measures of matrix orthogonality such as the condition number or determinant, and calibration model statistics such as the minimum mean-squared error (MMSE) or predicted residual error sum of squares (PRESS).5-10 Alternatively, Brown has developed a graphical method for identifying the individual wavelengths or wavelength ranges obeying the Beer-Lambert law.11 The selection criteria noted above have typically been implemented through the use of search-based strategies or optimization techniques in which various combinations of wavelengths are evaluated. Optimization techniques used have included simplex optimization,5 branch and bound combinatorial optimization techniques,6,7 genetic algorithms (GAs),12,13,15 stepwise elimination (SE),13 simulated annealing (SA),14 and generalized simulated annealing (GSA).5,14
† Present address: Naval Research Laboratory, Environmental Chemistry and Sensor Chemistry Section, Chemistry Division, Code 6116, Washington, DC 20375-5342. (1) Lorber, A.; Kowalski, B. R. J. Chemom. 1988, 2, 67-79.
(2) Brown, C. W.; Lynch, P. F.; Obremski, R. J.; Lavery, D. S. Anal. Chem. 1982, 54, 1472-1479. (3) Rossi, D. T.; Pardue, H. L. Anal. Chim. Acta 1985, 175, 153-161. (4) Gemperline, P. J. J. Chemom. 1989, 3, 549-568. (5) Kalivas, J. H.; Roberts, N.; Sutter, J. M. Anal. Chem. 1989, 61, 2024-2030. (6) Liang, Y.; Xie, Y.; Yu, R. Anal. Chim. Acta 1989, 222, 347-357. (7) Sasaki, K.; Kawata, S.; Minami, S. Appl. Spectrosc. 1986, 40, 185-190. (8) Salamin, P. A.; Bartels, H.; Forster, P. Chemom. Intell. Lab. Syst. 1991, 11, 57-62. (9) Rimbaud, D. J.; Walczak, B.; Massart, D. L.; Last I. R.; Prebble, K. A. Anal. Chim. Acta 1995, 304, 185-295. (10) Brown, P. J. J. Chemom. 1992, 7, 255-265. (11) Brown, P. J. J. Chemom. 1992, 6, 151-161. (12) Lucasius, C. B.; Kateman, G. Trends Anal. Chem. 1991, 10, 254-261. (13) Lucasius, C. B.; Beckers, M. L. M.; Kateman, G. Anal. Chim. Acta 1994, 286, 135-153. (14) Ho ¨rchner, U.; Kalivas, J. H. Anal. Chim. Acta 1995, 311, 1-13. (15) Jouan-Rimbaud, D.; Massart, D.; Leardi, R.; Noord, O. D. Anal. Chem. 1995, 67, 4295-4301.
4200 Analytical Chemistry, Vol. 68, No. 23, December 1, 1996
S0003-2700(96)00712-3 CCC: $12.00
© 1996 American Chemical Society
These selection criteria and search methods have been used in conjunction with several model-building techniques. Most of the work in wavelength selection has been performed with conventional multiple linear regression. Model-building techniques based on the use of latent variables (e.g., principal component regression (PCR) or partial least-squares regression (PLSR)) have not been used as widely with wavelength selection, as these methods have a built-in capability to extract relevant information from a complete spectrum. However, several workers have demonstrated that wavelength selection can also enhance the performance of models generated with PCR and PLSR.9,16 In characterizing previous wavelength selection studies, two general statements can be made. First, most studies have employed data in which the analyte spectral signals were characterized by a high S/N ratio. In cases in which spectral interferents have been present, the analyte signals have tended not to be swamped by the signals due to the interferents. In these cases, it has generally been possible to build effective calibration models with relatively few wavelengths, and the various wavelength selection criteria and search methods tested have all performed effectively. Second, the selection of wavelengths to use in the calibration model has generally been performed separately from the optimization of other model parameters. For example, when PCR or PLSR is used, a key parameter in the calibration model is the number of latent variables employed. The optimal value of this parameter would seem to depend on which wavelengths have been selected, but previous implementations of wavelength selection procedures have not incorporated the number of latent variables into the optimization. Work in our laboratories is focused on the development of techniques in near-infrared (near-IR) spectroscopy for use in measuring analytes at low concentrations and in matrices in which the analyte signals are largely swamped by those due to other matrix constituents.17-19 As part of this work, a successful GA procedure has been developed for the collective optimization of five variables that are relevant to the use of PLSR in building calibration models based on near-IR spectral data.20 The variables explored in this work were the starting and ending points of a contiguous spectral range submitted to the PLSR procedure, the number of latent variables used, and the frequency position and width of a band-pass digital filter employed to preprocess the spectral data. This work proved that the GA could find optimal values for several disparate variables associated with the calibration model and that the PLSR procedure could be integrated into the objective function driving the optimization. Based on this success, this paper describes the development of a GA method for use in jointly selecting the individual spectral points (i.e., wavelengths) and the number of latent variables employed in developing a PLSR calibration model. Three nearIR spectral data sets are used in this research, each of which contains analyte signals near the limit of detection and/or spectral interferents whose signals dominate those of the analyte. (16) Navaroo-Villoslada, F.; Pe´rez-Arribas, L. V.; Leo´n-Gonza´lez M. E.; Polo-Dı´ez, L. M. Anal. Chim. Acta 1995, 313, 93-101. (17) Small, G. W.; Arnold, M. A.; Marquardt, L. A. Anal. Chem. 1993, 65, 32793289. (18) Marquardt, L. A.; Arnold, M. A.; Small, G. W. Anal. Chem. 1993, 65, 32713278. (19) Hazen, K. H.; Arnold, M. A.; Small, G. W. Appl. Spectrosc. 1994, 48, 477483. (20) Shaffer, R. E.; Arnold, M. A.; Small, G. W. Anal. Chem. 1996, 68, 26632675.
EXPERIMENTAL SECTION Instrumentation. The three near-IR spectral data sets used in this work represent sample matrices of increasing complexity. The analytes and matrices studied were (1) methyl isobutyl ketone (MIBK) in water, (2) glucose in an aqueous phosphate buffer matrix containing bovine serum albumin (BSA) and triacetin (GTB data set), and (3) glucose in human serum samples. The GTB and human serum data sets have been used in previous work.20,21 The MIBK and GTB data sets were collected at Ohio University with a Digilab FTS-60A Fourier transform spectrometer (Bio-Rad, Cambridge, MA). The instrument was operated with a standard near-IR configuration consisting of a 100-W tungsten halogen source, CaF2 beam-splitter, and liquid nitrogen-cooled InSb detector. The data collection focused on the combination spectral region of 5000-4000 cm-1. This region was isolated by placing a K-band interference filter (Barr Associates, Westford, MA) in the optical path of the spectrometer. Samples were contained in an Infrasil quartz transmission cell with a 2-mm path length. Sample temperatures were controlled to 25-26 °C for the MIBK data and to 37-38 °C for the GTB data by use of a water-jacketed holder for the sample cell. The human serum data set was collected at the University of Iowa with a Nicolet 740 Fourier transform spectrometer equipped with a 250-W tungsten halogen source, CaF2 beam splitter, and liquid nitrogen-cooled InSb detector. The region of 5000-4000 cm-1 was again isolated with a K-band interference filter (Barr Associates). The path length of the Infrasil quartz sample cell was 2.5 mm, and the sample temperature was controlled to 37.0 ( 0.2 °C by use of a water-jacketed cell holder. Reagents. Samples for the MIBK data set were prepared by dissolving reagent-grade MIBK (J. T. Baker, Inc., Phillipsburg, NJ) in water. The reagent-grade water used was obtained by passing house-distilled water through a Milli-Q Plus water purification system (Millipore Corp., Bedford, MA). A 500 ppm stock solution of MIBK was prepared by weighing the liquid reagent directly into a volumetric flask and diluting with the reagent-grade water. Individual samples were prepared by dilutions of the stock solution. The MIBK data set had a total of 17 samples, spanning the nominal concentration range of 1-160 ppm. The concentrations employed were 2.1, 4.8, 10.4, 20.8, 24.8, 31.2, 48.3, 52.0, 62.4, 72.4, 77.0, 96.6, 102.6, 115.6, 128.2, 144.5, and 153.9 ppm. The GTB data set was constructed by use of samples prepared in pH 7.4, 0.1 M phosphate buffer. The analyte studied was glucose. To simulate common interferences encountered in the analysis of glucose in blood samples, BSA and triacetin were placed in the sample matrix to model proteins and triglycerides, respectively. Separate stock solutions of 50 mM glucose (ACS reagent, Fisher Scientific, Fair Lawn, NJ), 190 g/L BSA (Cohn fraction V powder, Sigma Chemical Co., St. Louis, MO, Product No. A 4503), and 35 g/L triacetin (99%, Sigma Chemical Co.) were prepared in phosphate buffer. Individual samples were made by mixing appropriate volumes of the stock solutions and diluting to 50 mL with the phosphate buffer. The concentrations employed were based on a 10 × 4 × 4 factorial design. Ten levels of glucose were used (1, 3, 5, 7, 9, 11, 13, 15, 17, 19 mM), along with four levels of triacetin (1.4, 2.1, 2.8, 3.5 g/L) and four levels of BSA (49.3, 64.4, 79.8, 94.7 g/L). This produced a total of 10 × 4 × 4 ) 160 samples. (21) Hazen, K. H. Ph.D. Dissertation, University of Iowa, Iowa City, IA, 1995.
Analytical Chemistry, Vol. 68, No. 23, December 1, 1996
4201
Human serum samples were obtained from the Department of Pathology at the University of Iowa Hospitals and Clinics. Samples were collected from the general hospital population and analyzed for glucose content by the hospital clinical chemistry laboratory. The serum samples were then frozen until just prior to collection of the spectral data. The data set employed here was based on 238 samples spanning 160 unique (difference