Anal. Chem. 1998, 70, 623-627
An Enhanced Algorithm for Linear Multivariate Calibration Andrew J. Berger,* Tae-Woong Koo, Irving Itzkan, and Michael S. Feld
G. R. Harrison Spectroscopy Laboratory, Massachusetts Institute of Technology, Room 6-014, Cambridge, Massachusetts 02139
We present a new method of linear multivariate calibration that can generate better prediction results than those obtained by partial least squares (PLS). This is accomplished by incorporating the spectrum of the desired species into the calibration procedure. The method combines the advantages of different standard methods and is therefore called hybrid linear analysis (HLA). In side-by-side tests using both simulated and experimental data, HLA produced lower prediction errors than PLS in all instances. We recommend HLA over PLS in situations where the spectrum of the desired species is available. A common task in analytical chemistry and in medicine is to calculate the concentration of one species in a multicomponent sample. Generally, the presence of the other components makes it impossible to determine the concentration from a single measurement. Instead, multiple measurements are performed and a multivariate algorithm extracts the concentration from these values. The optimal choice of algorithm obviously depends upon how many measurements can be performed and what sorts of measurements they are. In this paper, we propose a new algorithm that, in many common situations, performs better than the algorithms currently available. Frequently, one can make certain assumptions of linearity about the spectral measurement process. First, one assumes that the spectrum observed is a linear combination of what we shall call “pure” spectra, one for each species present in the sample. Second, one assumes that the spectral response of the species of interest is linearly proportional to its concentration. Under such conditions, the concentration c can be predicted by the dot product of two vectors, the spectrum (i.e., set of measurements) s and a calibration vector b:
c ) s‚b
(1)
We shall use lowercase boldface type throughout the text to denote vectors and uppercase to denote matrices. Choice of the optimal linear multivariate calibration technique depends upon the extent of the calibration data, as mentioned above. For orientation, a brief summary of the basic methods is given here; for more detail, the reader is referred to some excellent tutorial articles.1,2 If the full set of pure spectra is * Author to whom correspondence should be addressed: (e-mail)
[email protected]. (1) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60, 1193-1202. (2) Geladi, P.; Kowalski, B. R. Anal. Chim. Acta 1986, 185, 1-17. S0003-2700(97)00721-X CCC: $15.00 Published on Web 02/01/1998
© 1998 American Chemical Society
available, then the calibration vectors for all species are the columns of the matrix Pt(PPt)-1 (where P is a spectral matrix comprised of the pure spectra); this is called the ordinary leastsquares (OLS) solution. In many cases, however, the available spectra are not the pure ones but rather the spectra of a set of representative samples, called a calibration set. If the concentrations of all species in this set are known, then P can be estimated by the equation P ≈ (KtK)-1KtS (where K is a matrix of concentration values and S is a matrix of spectral intensities for the calibration mixtures), and one can then use OLS to obtain the calibration vectors; this is called the classical least-squares (CLS) solution. If, more restrictively, only one species’ concentration is known, then the calibration vector for that single species can be found via the regression techniques of partial least-squares (PLS) or principal component analysis (PCA) followed by regression (called PCR). These methods cannot deduce any of the pure spectra; instead, they compute b directly from the mixture spectra. For that reason, PLS and PCR are often called implicit models, whereas OLS and CLS are explicit ones. Implicit and explicit models each have their advantages. Because explicit models use information about every species, they are capable of providing superior calibrations. However, because implicit models require information only about a single species, they can be significantly easier to implement. In addition, explicit models suffer when a species is mischaracterized or omitted or when spectral artifacts such as baselines are present, whereas implicit models tend to be robust.3 In our group’s research on Raman spectroscopy of biological analyte mixtures, we have often found ourselves in the situation of knowing both the concentration and the pure spectrum of a single species such as glucose. This has meant that we have too much information for PLS or PCR, which ignores the pure spectrum, and too little for OLS or CLS, which requires more pure spectra or more concentrations, respectively. We have used PLS for our concentration predictions but have suspected that a better calibration algorithm can be found, one that allows us to take advantage of knowing the spectrum of glucose without requiring us to monitor other species. In this paper, we present a new calibration method that accomplishes this goal. For input, it requires standard PLS information (spectra of a calibration set plus associated concentrations of one species) plus the spectrum of the species of interest. In order for the method to be used effectively, this spectrum must be known with high accuracy. Because this method combines (3) Thomas, E. V.; Haaland, D. M. Anal. Chem. 1990, 62, 1091-1099.
Analytical Chemistry, Vol. 70, No. 3, February 1, 1998 623
the explicit-modeling advantage of knowing a pure spectrum with the implicit-modeling advantage of ignoring all other species, we call it hybrid linear analysis (HLA). One would expect such a method to produce predictions superior to PLS and PCR because more physical information is used for calibration. In the following section we derive the HLA method. In the following sections, we describe two studies in which PLS and HLA were used to analyze the same data. The first study uses numerically simulated mixture spectra and illustrates the full advantages of HLA calibration. The second study uses experimentally obtained near-infrared Raman spectra of aqueous mixtures of glucose, lactic acid, and creatinine and demonstrates the performance of HLA on actual spectral data. We will show that, in all instances in both studies, HLA generates lower average prediction errors than does PLS. This result suggests that HLA should be used instead of PLS when the pure spectrum of a species can be obtained. DERIVATION Below, we derive the HLA method of calculating b, the calibration vector for concentration predictions. The desired species is denoted by the letter A. As stated in the introduction, the input parameters for the HLA method are the following: S, a matrix containing the spectra of many representative samples, called a calibration set. Each row contains one spectrum. The number of samples must be at least as large as the number of pure species; k, a column vector containing the concentration of A in each calibration sample; and a, a row vector containing an accurate spectrum of A measured at unit concentration. Using S, k, and a, we solve for b as follows: 1. Remove from S all spectral contributions from A, leaving behind a matrix of spectra Sb:
Sb ) S - ka
(2)
This is equivalent to taking the spectra of the calibration samples after removing species A; only the background interferents then contribute, and the result is Sb. 2. Generate a matrix of orthogonal basis spectra V composed of the principal components of Sb, just as is done in PCA. All of the spectra contained in Sb can be modeled by those in V; in this sense, the V spectra act like pure spectra of the background species. 3. Subtract from a its projections onto each of the V vectors, leaving a residual spectrum r:
r ) a - aVtV
(3)
This residual r is the portion of a that cannot be modeled by the spectra contained in V; i.e., it is orthogonal to V. Any high-quality spectrum containing some amount of A can be used to generate the residual; a is a logical and convenient choice. 4. Normalize r to get b:
b ) r/(a‚r) 624
Analytical Chemistry, Vol. 70, No. 3, February 1, 1998
(4)
Figure 1. Pure unit spectra of G, L, and C for simulation study. Each spectrum corresponds to unit concentration. Spectra are offset for clarity.
We have used the idea that b is the contravariant vector to all of the pure spectra other than a.4,5 This means that, for any pure spectrum x of a background species, x‚b ) 0. Since these pure spectra are not known, this condition may seem impossible to guarantee. However, by ensuring in step 3 that b is orthogonal to all spectra in V, we indirectly ensure that it is also orthogonal to all pure background spectra, since these can be expressed as linear combinations of the principal components of Sb. It can easily be verified that b has the desired properties of a concentration prediction vector. First, a‚b ) 1, so b correctly predicts the concentration of A in the unit spectrum. Since the signal scales linearly, the prediction will be correct for any concentration of pure A. Next, (a + x)‚b ) 1 + 0, so the presence of other species does not affect the predictions. Since, by assumption, all spectra are linear combinations of a and the background spectra, b will correctly predict the concentration of A in any sample. Those familiar with the PLS and PCR know that the user has the freedom to select the number of prinicpal components used in the model; this is known as the calibration rank. This same freedom exists for HLA, since it uses PCA in step 2. EXPERIMENTAL SECTION 1. Simulated Mixture Spectra. Excerpts from the Raman spectra of glucose, lactic acid, and creatinine were used as a basis set to generate simulated mixture spectra. To emphasize that the mixtures were simulated and that no comparison with our experimental study below is intended, we will refer to the simulated species simply as G, L, and C; the spectra g, l, and c are shown in Figure 1. Each mixture spectrum was a linear combination of random amounts of g, l, and c plus Gaussian random numbers generated by MATLAB. Figure 2 shows an example. In these spectra, the noise amplitude was chosen to make the signal-to-noise ratio approximately 5:1 for the largest peaks. Thirty of the spectra were used for calibration; the other 30 were reserved for validation. The random concentrations were saved as reference values. (4) Lorber, A.; Kowalski, B. R. J. Chemom. 1988, 2, 67-79. (5) Sanchez, E.; Kowalski, B. R. J. Chemom. 1988, 2, 247-263.
Figure 2. Typical simulated mixture with computer-generated random noise.
Figure 3. Typical near-infrared Raman spectrum of PBS containing dissolved glucose, lactic acid, and creatinine. Ten-second acquisition.
PLS and HLA were used to create calibration vectors bP and bH for each species; PCR was not tested because its performance is usually similar but slightly inferior to that of PLS.3 We used the PLS algorithm described by Haaland and Thomas.1 In addition, the pure spectra g, l, and c were used together to generate error-free bO vectors via the OLS method; these represented ideal, perfect calibrations. The concentrations of G, L, and C in each validation sample were then predicted three times using eq 1, once for each version of b. The prediction uncertainty of each method was estimated by computing the root-meansquared error of prediction (RMSEP) for the 30 validation samples. In all, nine RMSEP values were created, one for each species predicted by each method. Because this was a simulation, no spectral binning was performed, but the ranks of the PLS and HLA calibrations were chosen to minimize RMSEP. All calculations were performed on a Sun SPARCstation using in-house MATLAB routines. 2. Experimental Mixture Spectra. We prepared 19 mixtures of glucose, lactic acid, and creatinine in phosphate-buffered saline (PBS) from 200 mM stock solutions. Concentrations of the species in the mixtures ranged from 0 to 66 mM; we estimate the error introduced by our pipetting technique to be about 0.5
Figure 4. Pure Raman spectra obtained by subtracting a spectrum of saline from spectra of stock solutions. Spectra are offset for clarity.
Figure 5. Calibration vectors for component G from analysis of simulated mixtures. Spectra are offset for clarity.
mM per analyte. We used a concentration design in which the concentrations of different species were randomized. From each sample, a near-infrared Raman spectrum was acquired in 10 s. A typical spectrum is shown in Figure 3. Separately, spectra of the three stock solutions and of PBS were acquired for 100 s. All samples were measured in 1-cm quartz cuvettes using a system similar to one we have previously described;6 the one used here had a more efficient collection geometry in order to increase signal. The pure spectra of dissolved glucose, lactic acid, and creatinine were obtained by subtracting the PBS spectrum from the stock solution spectra; Figure 4 shows the results. Residual offset and slope (due to background fluctuations) were manually removed in order to depict more realistic pure spectra; however, this was found to be irrelevant to the concentration predictions. These spectra were rescaled by their concentrations and integration times to correspond to 1 mM concentration and 10-s integration. Because the concentrations were not known to high precision, we varied the amplitudes of these spectra over a range (6) Berger, A. J.; Wang, Y.; Feld, M. S. Appl. Opt. 1996, 35, 209-213.
Analytical Chemistry, Vol. 70, No. 3, February 1, 1998
625
Figure 6. Prediction plots for simulated species L generated using bP (left), bH (center), and bO (right). In each case, the dashed line at 45 deg, the line of perfect prediction, is a visual guide, not a fit. Table 1. Comparison of RMSEP Values from PLS, HLA, and OLS Analysis of the Simulated Mixture Spectraa RMSEP values PLS f species PLS HLA OLS RMSEP(PLS-OLS) RMSEP(HLA-OLS) HLA reduction G L C
0.35 0.25 0.19 0.66 0.49 0.35 0.56 0.32 0.17
0.16 0.31 0.39
0.06 0.14 0.15
63 55 62
a RMSEP values are in arbitrary concentration units. The final column shows the percent reduction in switching from PLS to HLA.
Table 2. Comparison of RMSEP Values Generated by PLS and HLA in Analysis of Experimental Raman Spectraa RMSEP values (mM) component
PLS
HLA
PLS f HLA reduction
glucose lactic acid creatinine
1.9 3.6 1.2
1.1 2.6 1.0
42 28 20
a The final column shows the percent reduction in switching from PLS to HLA.
of 8% to produce optimal HLA predictions; we assume that the optimal predictions correlated with a true rescaling to exactly 1 mM. As in the simulation, PLS and HLA calibration and prediction were performed. OLS, however, could not be performed because of the uncharacterized background fluctuations, as explained in the discussion of explicit models in the introduction. All 19 spectra were analyzed; no data points were discarded as outliers. To optimize the use of limited data, leave-one-out cross-validation was employed, in which 18 samples were used to predict concentrations in the nineteenth, with each sample rotated out in turn. As in the simulated case, this procedure was performed for each of the three species. Spectral binning and calibration rank were varied in order to obtain optimal results for both methods. RESULTS The first three columns of Table 1 list the RMSEP values for each species predicted by each method in the simulation study. Because HLA incorporates the spectrum of the desired species into the calibration, we expect it to produce lower prediction uncertainties than PLS, as stated earlier. This is exactly what is observed: HLA generates lower RMSEP values than PLS in all three cases. We can assess the errors in bP and bH by referencing their RMSEP values to those of the error-free bO calibrations. In the second half of Table 1, we have taken the PLS and HLA results from the first half and subtracted the OLS results, producing the quantities RMSEP(PLS-OLS) and RMSEP(HLA-OLS). In the last column, we observe that HLA consistently reduced this amount by about 60%. This shows that HLA, by performing a better calibration than PLS, eliminated more than half of the prediction uncertainty that was possible to eliminate. It could not produce a perfect calibration because the spectra used to compute V contained noise. 626 Analytical Chemistry, Vol. 70, No. 3, February 1, 1998
Figure 7. Calibration vectors for prediction of glucose from experimental Raman spectra, derived using different algorithms. Vectors are offset for clarity.
The ability of HLA to perform better calibrations than PLS is most clearly seen in the comparison of the calibration vectors. Figure 5 shows the b vectors produced by PLS, HLA, and OLS for the prediction of species G; similar results were obtained for L and C. In each case, we see bH vectors that have substantially lower noise than bP and that closely approach the shape of the ideal bO vector. The added knowledge of a single pure spectrum allows HLA to overcome the noisiness of the calibration spectra (Figure 2) that limits PLS. With this advantage, HLA can improve the concentration predictions dramatically, as in Figure 6, which shows predictions of L in each sample by each of the three methods. The PLS predictions are substantially skewed toward the average concentration value, but the HLA predictions appear similar in quality to the OLS results.
Figure 8. Prediction plots for lactic acid generated using bP (left) and bH (right). The dashed line at 45 deg represents the line of perfect prediction; it is a visual guide, not a fit.
The RMSEP values generated by PLS and HLA in the experimental study are shown in Table 2. We once again observe results consistent with our expectation that the RMSEP should decrease when the extra calibration information in HLA is used. The last column lists the percentage decreases in switching from PLS to HLA; they range from 20 to 42%. These decreases are not OLS-subtracted as in Table 1 because no OLS result is available; we note that the percentages would increase if the data were OLS-subtracted. An attempt to estimate the OLS results7 was inconclusive. The bP and bH calibration vectors for glucose are shown in Figure 7. The two vectors look very similar, as they also do for the other two species. Optimal predictions results were achieved by grouping 10 pixels together for glucose and lactic acid and seven for creatinine (because it has sharper, higher peaks, as shown in Figure 4), which is why these spectra have such lower resolution. The similarity of all three b pairs indicates that, for experimental data as well as simulated data, HLA uses the same spectral features as PLS to extract information. The predictions of lactic acid concentration are shown in Figure 8. Here, we observe that HLA substantially improved the prediction of about five samples while leaving the rest about the same. Improvements were observed in all RMSEP values, which supports our assertion that HLA will perform better than PLS as long as a high-quality pure spectrum of the species of interest can be obtained. DISCUSSION The motivation behind this work has been that calibrations can be improved by adding additional information. When the spectrum of a single species is known, neither PLS nor OLS can (7) Berger, A. J.; Feld, M. S. Appl. Spectrosc. 1997, 51, 725-732. (8) The procedure is as follows: First, generalize step 1 of the derivation in the Derivation section to remove spectral contributions from multiple species. Principal components of the residual spectra will then describe only the remaining (unknown) background constituents. Using the technique of step 3, orthogonalize the known background spectra to the principal components and to each other. These orthogonalized spectra are then included in the ensemble of spectra V, along with the principal components. In other words, part of V is now constructed from known line shapes, while only the unknown ones are derived from PCA. The derivation then proceeds as before with steps 3 and 4 to calculate the b vector.
incorporate the information in a productive manner. The HLA algorithm, in contrast, uses the spectrum in a physical way, allowing us to separate the spectral contributions of the desired species from those of the background. This enables HLA to construct a calibration vector b without having to use regression, which makes the process both more intuitive and more robust. By testing HLA side by side with PLS on both simulated and experimental data, we have shown that HLA consistently generates lower prediction errors, which is to be expected since it uses more information in the calibration stage. We therefore suggest that HLA should be used instead of PLS in situations where the pure spectrum of the desired species can be obtained. We note that the HLA method naturally extends to situations in which the pure spectra and concentrations of one or more background species are also known.8 As the number of known background spectra increases, HLA relies less and less on the calibration spectra. This makes the method more accurate, since noise in the calibration spectra is what limits the accuracy of b. Ideally, the spectra of all known species should be gathered, leaving only spectral artifacts to be treated as background. In conclusion, the advantages of HLA are that it is easier to understand than PLS, easier to implement than OLS, and able to incorporate all of the known pure spectra in a robust manner, which none of the algorithms currently in use are able to do. Even in situations where PLS provides a satisfactory RMSEP, it may be that HLA can provide an equivalent RMSEP using calibration data that require much less total collection time. We emphasize that the use of HLA is not limited to Raman spectra; for example, its application to the task of noninvasive blood glucose measurement is worthy of investigation. ACKNOWLEDGMENT This work was carried out at the MIT Laser Biomedical Research Center, supported by NIH Grant P41-RR02594.
Received for review July 7, 1997. Accepted October 22, 1997.X AC970721P X
Abstract published in Advance ACS Abstracts, December 15, 1997.
Analytical Chemistry, Vol. 70, No. 3, February 1, 1998
627