Anal. Chem. 1996, 68, 2392-2400
Wavelength Selection for Simultaneous Spectroscopic Analysis. Experimental and Theoretical Study Liang Xu and Israel Schechter*
Department of Chemistry, TechnionsIsrael Institute of Technology, Technion City, Haifa 32,000 Israel
Simultaneous multicomponent analysis is usually carried out by multivariate calibration models (such as principal component regression) that utilize the full spectrum. We demonstrate, by both experimental and theoretical considerations, that better results can be obtained by a proper selection of the spectral range to be included in calculations. We develop the theory that models the analytical uncertainty in multicomponent analysis and show the conditions where wavelength selection is essential (for example, when considerable spectral overlapping exists). An error indicator function is developed to predict the analytical performance under given experimental conditions, using a certain spectral range. This function is applied for allocation of the most informative spectral ranges to be utilized in multicomponent analysis. Selection of spectral ranges by this method is shown to ensure optimal results that considerably improve analytical performance in some cases. The similarity between the results obtained by this function and actual experimental results prove the validity of the proposed error indicator for wavelength selection. In addition to the experimental examples, extensive computer simulations have been carried out in order to study the validity of the theory over a wide range of the relevant parameters. Multicomponent spectral analysis is now gaining popularity. Numerous spectral data can be collected by modern instruments (such as photodiode arrays and CCD detectors), and several mathematical approaches have been designed to deal with these over-determined systems.1-6 Early applications used complete spectra for determination of all components in a mixture. Ross and Pardue showed, however, that accuracy can be improved by a careful selection of wavelength ranges, which results in a collinearity or spectral overlap reduction.7 The decision of how many data points and which wavelengths should be included in the analytical process is not trivial and is usually an empirical choice.8 Various criteria have already been developed to allow for automatic wavelength selection.8-25 Among the proposed meth(1) Kowalski, B. R.; Seasholtz, M. B. J. Chemom. 1991, 5, 129. (2) Martens, H.; Naes, T. Multivariate Calibration; Wiley: New York, 1989. (3) Malinowski, E. R. Factor Analysis in Chemistry; Wiley: New York, 1991. (4) Otto, M.; Wegscheider, W. Anal. Chem. 1985, 57, 63. (5) Lindberg, W.; Persson, J. A.; Wold, S. Anal. Chem. 1983, 55, 643. (6) Jolliffe, I. T. Principal Component Analysis; Springer-Verlag: New York, 1986. (7) Rossi, D. T.; Pardue, H. L. Anal. Chim. Acta 1985, 175, 153. (8) Mark, H. Appl. Spectrosc. 1988, 42, 1427. (9) Jochum, C.; Jochum, P.; Kowalski, B. R. Anal. Chem. 1981, 53, 85.
2392 Analytical Chemistry, Vol. 68, No. 14, July 15, 1996
ods, the determinant and the condition number of calibration matrix are the most preferred criteria for prediction of the best wavelength combination.8-12 Nevertheless, these two criteria are principally designed for exactly determined systems, although some modifications have been made to deal with overdetermined systems. [Most applications use RMSEP to characterize the performance of principal component regression (PCR) procedure, carried out under optimized conditions as obtained by the above criteria.] The major disadvantage of the current criteria is that all components are examined together with the same matrix, while in most cases, the optimal conditions for determination of each component is different and should be optimized separately. Recently, two stochastic search heuristics, namely, simulated annealing and genetic algorithms, have been suggested for wavelength selection. They are supposed to reach the global minimum and to overcome the problems of multivariate minimizations.26-29 The applications of these methods, however, are still in their infancy.30-32 In the meanwhile, the PCR and partial least squares (PLS) have been introduced as full-spectrum methods. They provide considerable improvement in analytical precision and accuracy as compared to other methods that are restricted to a small number of data points (e.g., inverse least squares).33-38 It is generally believed that these methods eliminate the necessity of wavelength (10) Juhl, L. L.; Kalivas, J. H. Anal. Chim. Acta 1986, 187, 347. (11) Juhl, L. L.; Kalivas, J. H. Anal. Chim. Acta 1988, 207, 125. (12) Otto, M. Anal. Chim. Acta 1986, 180, 445. (13) Otto, M.; George, T. Anal. Chim. Acta 1987, 200, 379. (14) Frans, S. D.; Harris, J. M. Anal. Chem. 1985, 57, 2680. (15) Bergmann, G.; Oepen, B. V.; Zinn, P. Anal. Chem. 1987, 59, 2522. (16) Kaiser, H. Z. Anal. Chem. 1972, 260, 252. (17) Ebel, S.; Glaser, E.; Abdulla, S.; Steffens, U.; Walter, V. Z. Anal. Chem. 1982, 313, 24. (18) Junker, A.; Bergmann, G. Z. Anal. Chem. 1974, 272, 267. (19) Junker, A.; Bergmann, G. Z. Anal. Chem. 1976, 278, 191. (20) Junker, A.; Bergmann, G. Z. Anal. Chem. 1976, 278, 273. (21) Honigs, D. E.; Freelin, J. M.; Hieftje, G. M.; Hirschfeld, T. B. Appl. Spectrosc. 1983, 37, 491 (22) Thijssen, P. C.; Kateman, G.; Smit, H. Anal. Chim. Acta 1984, 157, 99. (23) Morgan, D. R. Appl. Spectrosc. 1977, 31, 415. (24) Warren, F. V., Jr.; Bidlingmeyer, B. A.; Delaney, M. F. Anal. Chem. 1987, 59, 1890. (25) Smeyers-Verbeke, J.; Detaevernier M. R.; Massart, D. L. Anal. Chim. Acta 1986, 191, 181. (26) Li, T. H.; Lucasius, C. B.; Kateman, G. Anal. Chim. Acta 1992, 268, 123. (27) Lucasius, C. B.; Beckers, M. L. M.; Kateman, G. Anal. Chim. Acta 1994, 286, 135. (28) Kalivas, J. H.; Roberts, N.; Sutter, J. M. Anal. Chem. 1989, 61, 2024. (29) Sutter, J. M.; Kalivas, J. H. Anal. Chem. 1991, 63, 2386. (30) Kalivas, J. H.; Sutter, J. M. Anal. Chem. 1992, 64, 1200. (31) Tucker, E. E. Anal. Chem. 1992, 64, 1199. (32) Horchner, U.; Kalivas, J. H. Anal. Chim. Acta 1995, 311, 1. (33) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60, 1193. (34) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60, 1202. S0003-2700(95)01142-5 CCC: $12.00
© 1996 American Chemical Society
selection. A theoretical proof (based on certain assumptions) was given that addition of sensors or wavelengths always improves the results.39-42 Nevertheless, more and more evidence, from simulations and from experiments, shows that these full-spectrum algorithms could also benefit from wavelength selection, particularly for difficult (overlapping) multicomponent systems (see, for example, refs 13 and 28). It is realized that there are practical spectral situations that are not included in the simple assumptions made in previous theoretical studies.41,42 In all these cases, where full-spectrum multicomponent calibration methods are applied, it is necessary to reconsider the modeling of the prediction error. In the following, we demonstrate the necessity of wavelength selection in simultaneous multicomponent analysis and we develop a suitable criterion for carrying out this task. In this paper, the possibility of determining a component is discussed in terms of the net analyte signal (NAS), which has been developed by Lorber.39 The advantage of this approach is that the possibility of simultaneous determination can be studied separately for each component and optimization can be carried out accordingly. A mathematical approach is presented to study the uncertainty in the NAS of a component of interest, and a new mathematical function is derived to model its standard variance. We show that the analytical performance is governed to a large extent by the uncertainty in the NAS of a component. Therefore, the derived function, which is referred to as error indicator in the following, can be used as a criterion for wavelength selection. The proposed method is tested with both simulated and experimental spectral data. The actual analytical performance is evaluated from prediction errors obtained with the PCR calibration method.
a singular-value decomposition procedure. Thus, an unknown concentration vector cun can be predicted from the measured unknown sample response dun according to the following equation:
THEORY 1. Theoretical Background. Conventional normal notation has been adopted in the following discussion: Boldface capital letters are used for matrixes and boldface small ones for vectors. Superscript T designates the operation of vector or matrix transposition and the superscript + denotes the pseudoinverse of the nonsquare matrix of an overdetermined system. According to factor analysis conventions,3 the data matrix D, obtained from experimental measurements, is related to matrix R, representing the unit response of components in a mixture, and to matrix C, representing the required concentrations of the components:
bkT ) ckTD+
D ) RC
(1)
Since mean centering is generally applied as preliminary data treatment, the background contribution term is omitted in the above equation. During calibration, matrix R can be obtained as follows:
From eq 2 the pseudoinverse R+ is equal to CD+;3,41 thus the following expression is derived for the unknown concentration vector:
(4)
cun ) CD+dun
The above equation is referred to as the direct calibration model, where concentrations of all of components are known. In fact, only one row of R+ is needed for the prediction of one component, as shown in eq 3. Therefore, eq 4 can be rearranged in the following form:
cun,k ) ckTD+dun
(5)
where the vector ckT consists of the known concentrations of the kth component in calibration. As compared to eq 4, the above equation is referred to as the indirect calibration model because it allows the calibration of one component in a multicomponent system, where only the concentrations of the component of interest are known in calibration. The term ckTD+ is generally recognized as the regression vector bk for the corresponding component:
(6)
This is the normal way to obtain b from D and c during calibration, where D usually undergoes singular-value decomposition to generate its pseudoinverse D+. In order to estimate the error in prediction of concentrations, the errors in d, D+, and ck should be taken into account, as can be seen from eq 5. The major difficulty of this approach is the inversion of D because the pseudoinverse of the sum of two matrices is not equal to the sum of the pseudoinverse matrices. In previous studies41,42 it was assumed that errors in D only affect eigenvalues and can be represented by errors in the known concentrations during calibration, while the space spanned by the entire set of eigenvectors is correct. The following equation has been derived to predict the concentration variance:41 n
var(cun,k) )
∑b
2 i,k
i)1
R ) DC+
(3)
cun ) R+dun
m
∑h
2 i,un
var(dun,i) + κ
var(ck,i)
(7)
i)1
(2)
where C+ is the pseudoinverse of the concentration matrix C and can be calculated either by a normal least-squares method or by (35) Thomas, E. V.; Haaland, D. M. Anal. Chem. 1990, 62, 1091. (36) Otto, M.; Thomas, J. D. R. Anal. Chem. 1985, 57, 2647. (37) Geladi, P.; Kowalski, B. R. Anal. Chim. Acta 1986, 185, 1. (38) Kisner, H. J.; Brown, C. W.; Kavarnos, G. J. Anal. Chem. 1983, 55, 1703. (39) Lorber, A. Anal. Chem. 1986, 58, 1167. (40) Booksh, K. S.; Kowalski, B. R. Anal. Chem. 1994, 66, 782A (41) Lorber, A.; Kowalski, B. R. J. Chemom. 1988, 2, 93. (42) Lorber, A.; Kowalski, B. R. J. Chemom. 1988, 2, 67.
where n is the number of data points and m is the number of samples in calibration; var( ) stands for the variance of a variable; bj,k are the elements of the regression vector bk; hiun are the elements of the vector hun, defined as hun ) D+ dun. The coefficient κ is equal to the ratio of the variance of the concentration model to var(ck,i). The variance of the model can be computed from concentration residuals, ∆c ) (I - D+D)ck.41 The variances var(dun,j) and var(dk,j) are the variances of the responses of the unknown sample and of the concentration of the kth component in the calibration set, respectively. This equation has Analytical Chemistry, Vol. 68, No. 14, July 15, 1996
2393
been employed for wavelength selection by minimizing the prediction error. For a specific calibration procedure with a proper experimental design, the prediction error is determined by the norm of regression coefficients, as shown in the above equation. Mathematical proof has been given that the norm decreases when wavelengths are added, which eventually leads to the conclusion that addition of wavelengths always improves prediction quality.42 This is the basis of the general knowledge that the whole spectrum (within practical or economical limitations) should be used in PCR models. In the following we show that the assumptions that led to this conclusion are not always valid. 2. A New Model for the Uncertainty in the Norm of Net Analyte Signal. In the present approach, the concept of net analyte signal is employed in order to investigate the possibility of determining a component in a mixture. The NAS has been developed under the assumption that pure spectra of all components are known. It reveals the possibility of determining a component in a multicomponent system (with either the classical or the inverse model), as well as the internal relationship between these models. The net analyte signal is defined as the part of a spectrum that is orthogonal to the subspace spanned by the spectra of all other components. For the kth component with known concentration c0, the net analyte signal can be calculated by the following projection:39
d0,net ) (I - RkRk+)d0 ) d0 - RkRk+d0
)
c0,k ||d || ||d0,net|| un,net
(
)
||d0,net|| c0,k
-1
||dun,net||
(13)
where (||d0,net||/c0,k)-1 is considered as the sensitivity factor. The reason for introducing the above equation is to show that the norm of net analyte signal is really of importance for the quantitation of a component. Actually, eq 5 is more generally applicable as compared to eq 13, where explicit knowledge of the pure spectra is required. Equation 13, however, shows that the possibility of determining the concentration depends on the quality of the net analyte signal of the component or, more exactly, on the relative error of the norm of the net analyte signal. Therefore, eq 13 offers a possible way to optimize analytical conditions by minimizing the relative error in the norm of the net analyte signal. It can be seen from eqs 8 and 9 that the orthogonal part is obtained by subtracting the projection on the subspace of others components from the original vector. For the unknown sample response, dun, we have
(I - Rk Rk+) dun ) dun - Rk Rk+ dun
(14)
(8)
where I is the unit matrix and Rk is the sensitivity matrix consisting of spectra of all pure components excluding the kth component. The net analyte signal of the kth component in an unknown sample can be found in the same way:
dun,net ) (I - RkRk+)dun
cun,k )
(9)
Therefore, the quantitation of the kth component can be accomplished by solving the following equation, which is obtained by combining eqs 8 and 9:
Therefore, the uncertainty in the net analyte signal dun,net is not less than that of the measured signal dun, even if the matrix Rk is errorless. For simplicity, the error in dun,net is approximated by the error in dun. It is clear that errors in the measured signal are transferred to the corresponding net analyte signal as revealed by eq 14. Since the length of net analyte signal vector is utilized for quantitation, as shown in eq 13, the effect of the error on the norm is our major concern in this study. The calculated net analyte signal can be expressed as follows:
(15)
dnet ) dnet,tru + ∆dnet
(10)
The first term dnet,tru stands for the true vector and the second one for the error. Thus, the following can be obtained:
Transformation from vector to scalar is carried out by multiplying the vector by its transpose. Multiplying both sides of the above equation by the transpose of d0,net leads to
||dnet||2 - ||dnet,tru||2 ) 2∆dnetTdnet,tru + ||∆dnet||2 (16)
c0,kdun,net ) cun,kd0,net
It can be seen that the difference between the squared norm of the calculated net analyte signal and that of the true net analyte signal consists of two contributions. The first term can be expanded as
dT0,netdun,net
cun,k ) c0,k
)
||d0,net||2
c0,k dT0,netdun,net ‚ ||d0,net|| ||d0,net||
(11)
n
T
∑d ∆d
2∆dnet dnet,tru ) 2
i
i
(17)
i)1
Accordingly, by multiplying both sides of eq 10, the following has been obtained:39
dTun,netd0,net ||dun,net|| ) c0,k ||d0,net|| Inserting the above equation into eq 11 results in 2394
Analytical Chemistry, Vol. 68, No. 14, July 15, 1996
(12)
where di are the elements of the true net analyte signal vector dnet,tru and ∆di are the elements of the corresponding error vector ∆dnet. If errors in all measurements are normally distributed, with the same standard deviation σ, the first part is a Gaussian random number with standard deviation of 2||dnet,tru||σ and is hereafter referred to as the Gaussian part of the total difference. The second part is the squared norm of the error vector:
n
||∆dnet||2 )
∑
∆di2
(18)
i)1
Obviously, this part follows χ2 distribution with mathematical expect of nσ2 and is referred to as the χ2 part hereafter. Two extreme situations are worthwhile discussing: First, when the norm of dnet,tru is large enough, the χ2 part can be omitted. This approximation is usually termed as the first-order approximation, where one error term that is multiplied by another error term (by itself in the present case) is ignored.41 Actually, only if the multicomponent system is well-conditioned, the term of χ2 part in eq 16 can be dropped. The importance of the χ2 part will be shown in the following. For a well-conditioned system, the norm of the net analyte signal is sufficiently large, therefore, the Gaussian part dominates the difference between the two squared norms in eq 16. Only in this case, the standard variance of the difference of the two squared norms is approximately equal to the standard deviation of the Gaussian part. The second extreme situation is when the norm of the net analyte signal becomes smaller and smaller, and the contribution from the χ2 part random number becomes comparable to the Gaussian part. It means that, for a system of a small net analyte signal, which stands for an ill-conditioned system (i.e., severe spectral overlap), the χ2 part cannot be omitted. In fact, when the norm of the net analyte signal approaches zero, the difference of the two squared norms is governed by the χ2 part random number. In this case, the standard variance of the difference is equivalent to the mathematical expected value of the χ2 distribution random number. It is desirable to obtain an expression for the standard variance in the whole range of net analyte signal, assuming that the contribution from the χ2 part can be represented by a Gaussian noise with standard deviation of nσ2. In another words, the summation of the two random numbers in eq 17 can be represented approximately as a combined Gaussian distribution. Therefore, the variance of the difference between the two squared norms is given by the standard deviations of the combined Gaussians:
var(||dnet||2 - ||dnet,tru||2) ) (2||dnet,tru||σ)2 + (nσ2)2
(19)
Once the variance of the difference between the two squared norms is obtained, the calculation of the variance for net analyte signal is straightforward, because eq 16 can be rearranged as
||dnet|| - ||dnet,tru|| ) (2∆dnetTdnet,tru + ||∆dnet||2)/(||dnet|| + ||dnet,tru||) (20)
Obviously, the variance of the norm of net analyte signal is
var(||dnet|| - ||dnet,tru||) ) [(2||dnet,tru||σ)2 + (nσ2)2]/(||dnet|| + ||dnet,tru||)2 (21)
The above two equations have been derived using certain approximations that must be justified first. Their validity is demonstrated in the following by extensive simulations.
3. The Error Indicator as a Criterion for Wavelength Selection. It is assumed in the present approach that the prediction error in multivariate analysis is determined by the quality of the corresponding net analyte signal. Therefore, the relative error in the norm of the net analyte signal can be calculated in order to evaluate the analytical performance under specific experimental conditions. For convenience, this function is referred to as the error indicator (EI):
EI ) var(||dnet|| - ||dnet,tru||)1/2/||dnet,tru||
(22)
The full expression can be obtained by inserting eq 21 into the above equation. In practice, the norm value of the true net analyte signal, which is unknown, can simply be replaced by the measured one. The error in the measurements can be estimated as39
σ ) [dT(I - RR+)d/(n - m)]1/2
(23)
where n is the number of data points and m is the number of components. It should be pointed out that this error indicator is developed for the purpose of wavelength selection that will be carried out by minimizing the relative error in the net analyte signal. Therefore, eq 21 is not expected to predict the exact or absolute analytical errors; only the relative behavior is of importance for the purpose of wavelength selection. EXPERIMENTAL SECTION Fluorescence measurements were carried out with a portable experimental setup. The third harmonic generation of a pulsed Nd:YAG laser (Minilite, by Continuum) was used as the excitation source. It delivered 15 mJ at 355 nm, at a repetition rate of 10 Hz. Pulse duration was 4 ns. The spectral signals were collected by a 200 mm optical fiber and detected by a plug-in spectrometer (PC1000, Ocean Optics), which was installed directly into a computer expansion slot. In this arrangement, spectra were recorded using a linear CCD detector of 1024 elements. All spectra were measured at fixed wavelength interval of 0.16 nm, digitized (8 bits ATD), and transferred directly to the PC memory. Stock solutions of 1-pyrenebutyric acid (PbA, 9.1 × 10-5 M), 1-pyrene-carboxylic acid (PcA, 1.0 × 10-4 M), and pyrene-1-sulfonic acid (PsA, 8.0 × 10-5 M) were prepared by dissolving the corresponding reagent in dilute KOH. Sample solutions were prepared by transferring different volumes of the three components, ranging from 0 to 2 mL (with the mean of 1 mL) and then adding sufficient water to make a total volume of 25 mL. The pH of all solutions was adjusted to 12. All reagents were analytical grade and used as received. Distilled water was used throughout. COMPUTER PROGRAMS All computer programs for spectral simulations and for data analysis were written in FORTRAN 77 and run on a UNIX workstation and on a personal computer (80486 processor). Simulated vectors of the net analyte signals were generated with pseudorandom numbers,43 evenly distributed in the range of 0-1. Various intensities were obtained by multiplying the random numbers by a parameter. Simulated spectra were produced as a (43) Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetterling, W. T. Numerical Recipes: The Art of Scientific Computing; Cambridge University Press: Cambridge, U.K., 1990.
Analytical Chemistry, Vol. 68, No. 14, July 15, 1996
2395
sum of Gaussian functions, calculated at constant intervals. Peak separations were indicated by a resolution factor, Rs ) ∆λmax/4σ, where ∆λmax is the difference in the means and σ is the standard deviation of the corresponding Gaussian peaks.14 Normally distributed noise at a given level was generated by the BoxMuller algorithm43 and added to the exact data. Many concentration combinations were generated for each sample with the random number ranging from 0 to 1 with mean value of 0.5. For each spectral condition under investigation (characterized by a given sum of Gaussians, a given measurement range, and data points), the analytical performance was studied using the PCR procedure, as follows: First, 30 different samples, under the same spectral conditions but with different component concentrations were generated. This set of spectra was used as the calibration set for the PCR model development. Then, other 100 spectra were generated, and used for evaluation of the PCR prediction capabilities, under the same conditions. A relatively large number of samples were used both for calibration and for prediction, in order to minimize the effects of randomly selected values. We used a PCR algorithm previously described in the literature.33,34,35 In the present study, centering procedure was used as pretreatment of raw data during calibration. Cross-validation was employed in the PCR algorithm and the optimal number of factors was determined automatically from prediction error sum of squares (PRESS), through F-statistic test. Outliers in calibration sets were automatically detected both from concentration and absorption F-ratio. The quality of the analysis, under specific conditions, was evaluated by the root mean square error of prediction (RMSEP).33,34 For convenience of presentation, PCR calculations of the real samples were based on the added volume numbers of stock solutions rather than on the exact concentrations; therefore, RMSEP should be related to the added volumes, which were evenly distributed in the range of 0-2 mL. RESULTS AND DISCUSSION 1. Modeling the Uncertainty in Net Analyte Signal. The validity of eq 21 was first tested with computer simulations and the results are given in Figure 1. Excellent agreement between the values predicted by eq 21 and the simulation results were obtained, as seen in this figure. As can be seen from eq 21, the error in the net analyte signal’s norm is determined by three factors, which are briefly discussed in the following. These factors include the norm of the NAS vector, the number of vector elements (i.e., number of experimental data points), and the noise level in vector elements (i.e., in measurements). Random numbers have been used to generate the NAS vectors, since they may have any spectral form. The sign of each vector element has been omitted because only the uncertainty of the norm is of significance for the net analyte signal evaluation. Figure 1a shows the dependence of the error in the NAS on its norm. The number of elements in the NAS vector and the noise level in these elements have been kept fixed. It can be seen that when the norm gets larger, the error reaches a constant value that is equal to the measurement error. In this case, eq 21 is well approximated as follows:
var (||dnet|| - ||dnet,tru||) ) (2||dnet,tru||σ)2/(||dnet|| + ||dnet,tru||)2 ) (σ)2 (24) where 2||dnet,tru||σ . nσ2, ||dnet|| + ||dnet,tru|| ) 2||dnet,tru||. 2396 Analytical Chemistry, Vol. 68, No. 14, July 15, 1996
Figure 1. Modeling the error in net analyte signal. Standard deviations in NAS norm were calculated from 100 simulations. (a) Effect of NAS norm. Number of elements in NAS vector and their noise level are fixed at 100 and 0.01 (standard deviation) respectively. (b) Effect of number of elements in NAS vector. Its norm and its noise level are fixed at 5.64 and 0.01, respectively. (c) Effect of noise level in measurements. Norm of NAS and the number of elements are fixed at 5.64 and 100, respectively.
It means that, in a well-conditioned system, the error in the norm is determined by the Gaussian part; thus, the first-order approximation is justified in this case. It should be noted that in this case the relative error becomes smaller with the norm increases, although the error in the norm becomes independent of norm. For this reason, errors in the determined concentrations may be lower than measurement errors, a result that is in agreement with previous studies. On the other hand, when the norm becomes smaller, its error increases significantly, due to the existence of the χ2 part. It is clear from these simulations that the χ2 part cannot be generally omitted, particularly in ill-conditioned systems. Therefore, the first-order approximation that ignores the χ2 part of eq 21 is not always valid, and the results of several previously published
studies are not justified in these cases. Figure 1b shows the dependence of the error on the number of elements in the NAS vector (number of experimental data points). The norm of the vector and the measurement noise level have been kept constant. The error in norm increases monotonically with the number of data points. This means that when the contribution of data points to the net analyte signal is not significant, addition of these data points practically spoils the quality of the net analyte signal and eventually degrades the analytical results. This fact reveals the necessity of restricting the data points to be included in analysis. In other words, wavelength selection is necessary, particularly in ill-conditioned systems. Figure 1c shows the dependence of the net analyte signal’s error on the noise level of its vector elements. When the noise level is sufficiently small the error in the norm is equal to the error in vector elements ((i.e., in measured signals). This is equivalent to the case of a large net analyte signal norm. When the noise level becomes larger, errors in the norm become nonlinearly dependent on measurement errors and the system shifts gradually from well-conditioned toward ill-conditioned. This corresponds to the case of a small net analyte signal norm under conditions of a fixed noise level. A comparison of the predictions, with and without the χ2 part of eq 21, is also shown in the figure. Omission of the χ2 part leads to incorrect results at considerable experimental noise levels, since the nonlinear increases of the function cannot be reproduced with the Gaussian part solely. This plot emphasizes again the weakness of previous studies,40-42 where first-order approximation has been employed and the χ2 part has been practically ignored. It is clear from the above that whether a system is illconditioned or not depends on both the noise level and the NAS vector norm. A component that suffers from a severe spectral overlap (i.e., small net analyte signal norm) can be analyzed, provided that the noise level in the signal is sufficiently small. Therefore, it makes no sense to talk about spectral overlap in a system without information on measurement errors. 2. Wavelength Selection with Error Indicator Function. As shown in the previous sections (eq 13), the errors in predicted concentrations obtained by the PCR method are dependent on the quality of the net analyte signal of the corresponding component. It means that the relative error in the norm of the net analyte signal can be considered as an error indicator and can be utilized to model the analytical performance under various conditions. This error indicator can be employed as a criterion to locate the optimum experimental conditions. In order to evaluate the usefulness of this error indicator function as a criterion for wavelength selection, extensive simulations have been carried out on a two-component system with varying noise level. Previous similar studies referred only to oversimplified systems, where each component was represented by a single Gaussian peaks, e.g., refs 13 and 14. However, such models cannot represent most of the multicomponent spectra; therefore a more general presentation is required. Usually, the signals (e.g., spectra) from a multicomponent system can be divided into two groups: the parts that are similar, and are not very useful for analysis, and the parts that are different and consist of the information on the various concentrations. For example, near-IR spectrum of petroleum products consists of many constant peaks, and only several peaks vary due to a change in the aromatic
composition. We take advantage of the fact that our algorithms do not depend on the actual shape of the signals, and we represent the multicomponent signals by a spectrum with two peaks: One peak represents all parts of the experimental signals that are similar, and the other represents the parts that are different. For simplicity of presentation, only a two-component system is studied in this section, and extension to multicomponent systems is given in the following. Such a representation of a complex spectrum is given in Figure 2a, consisting of two peaks: P1 stands for the variations and P2 stands for similarities. This figure shows a particular spectral overlapping, and our task is to find out the most informative domain of the spectra, which is the range that provides the lowest analytical errors. Then, we evaluate the predicting capability of the error indicator function. In order to allocate the most informative range in the spectra, a moving-window strategy was employed: Both the width of the spectral window and the location of this center were varied, providing a two-dimensional grid. Thus, the RMSEP of a component, as well as the error indicator function, can be investigated as a function of width and location of measurement window and plotted as the third dimension over the above grid. The prediction errors in analysis of component A, under various measurement schemes, are shown in Figure 2b. A minimum is observed in this surface, indicating the best possible measurement program. Taking into account that the data include “white noise” only, the presence of a minimum in this surface is not obvious. It is usually assumed that PCR models cancel out “white noise”; therefore, in many cases the whole spectrum is introduced into PCR calibration. The previously provided theoretical consideration has supported the inclusion of the complete spectra.42 That conclusion, however, was drawn from the mathematical assumption that addition of wavelengths or sensors always decreases the norm of the regression vector. As a result, prediction errors decrease as indicated in eq 7, where the regression vector was assumed to be errorless (or errors in regression vector were neglected). However, the minimum presented in Figure 2b demonstrates that the previous conclusion is not generally accepted, which means that the assumptions made in the previous studies were oversimplified. Therefore, it does make sense to find out the most informative parts of the spectra. The above RMSEP surface has to be compared to the surface of the error indictor function. Figure 2c shows this function in the same coordinates as Figure 2b. The similarity of the two surfaces is obvious. The contours, shown on the bottom facet, indicate not only that the two surface are similar but also that the location of the minima is at the same place. It should be noted that there are some differences in the fine structures of the two surfaces; nevertheless, considering the fact that the error function is originally developed just to describe the uncertainty in net analyte signal, the similarity is surprising. The main characteristics of the RMSEP surface and the origin of its minimum can now be studied in terms of the components of the error function: First, the uncertainty of the net analyte signal is determined by its norm, and the larger the norm, the smaller the error in predicted concentrations is. Accordingly, a general trend can be observed in both surfaces, that prediction error decreases when the width of the moving window increases. Obviously, a narrow window results in a small net analyte signal, which provides a large prediction error; a too narrow window cannot represent the Analytical Chemistry, Vol. 68, No. 14, July 15, 1996
2397
Figure 2. (a, upper left) Spectra of two components. Each spectrum consists of two Gaussian peaks of unity standard deviation. P1, with a A B A B resolution factor of Rs ) 0.025 (λmax ) 2,5, λmax ) 2.4), represents dissimilarity ranges. P2, with a resolution factor Rs ) 0.0 () λmax ) λmax 7.5), represents the similarity. (b, lower left) Prediction error of component B as a function of the measurement program (spectral window’s width and center). Calculation is based on fixed wavelength interval of 0.01 (arbitrary unit) and Gaussian noise of 0.01 (standard deviation). Note the well-defined minimum in the surface, which indicates the best measurement conditions. The optimal window is located at 1.5-4.0, which corresponds to P1. (c, upper right) Prediction for component B with error indicator under the same coordinates as in (b). (d, lower right) Prediction for component B with error indicator obtained by using eq 19 instead of eq 16, where * indicates the omission of the χ2 part.
differences in the spectra. Under a specific window width, prediction error increases significantly when the window moves apart from peak P1, which stands for the center of the spectral differences. The similarity of the two spectra accounts for the decrease of net analyte signal when moving toward peak P2. The more similar the spectra are, the less the part that is orthogonal to the other, or to the linear combination of others in a multicomponent system. When the window moves toward the tail of the spectra, the low intensities account for the decrease of the actual net analyte signal. This eventually results in higher prediction errors. Following the RMSEP surface of Figure 2b at a constant window location, the prediction error generally decreases with the increase of the window’s width. This happens because more information about the component is included, which results in an increase of the net analyte signal. However, when the window becomes larger and larger, more data points from spectral regions of high overlap or low intensities are included. Unfortunately, inclusion of these data points will not increase the net analyte signal of the relevant component. On the other hand, errors contributed by the χ2 part increase with the number of data points, as can be seen from eq 21. In fact, these data points contribute more noise than information or even noise only. In this case, the actual prediction error increases due to the contribution from the 2398 Analytical Chemistry, Vol. 68, No. 14, July 15, 1996
χ2 part. These considerations are responsible for the observed minimum in the RMSEP surface. It should be pointed out that the net effect of addition of data points depends on the noise level, as indicated by eq 21. Addition of data may either improve or spoil the analytical results. Obviously, the location of the minimum (i.e., the best experimental conditions) changes according to the noise level, although it is mainly determined by the spectral characteristics. In other words, for a specific scheme of spectral overlap, the number of data points to be included in calibration depends on the experimental noise level. The lower the noise level, the more data points can be included. In fact, when multivariate calibration is concerned, one can hardly talk about spectral overlap without considering the noise levels. Equation 21 emphasizes the relevance of experimental noise investigation in multivariate calibrations. For comparison, Figure 2d was calculated by inserting eq 24 into eq 22, where the χ2 part has been omitted. This is the wellknown first-order approximation, which is being generally assumed in the literature. This surface is determined by the norm of net analyte signal; therefore, addition of more data points always results in a larger norm. This surface decreases continuously when the spectral window is enlarged. As a result, no minimum occurs in this surface, which corresponds exactly to the results obtained in previous studies.41,42 This comparison demonstrates
Figure 4. The same calculations for PcA as for PsA in Figure 3. (a) Determination of PcA and (b) prediction of PcA analysis by error indicator.
Figure 3. (a) Fluorescence spectra of pure components. Concentrations of PbA, PcA, and PsA were 9.1 × 10-5, 1.0 × 10-4, and 8.0 × 10-5 mol/L respectively. (b) Determination of PsA as a function of measurement program. Calculation of RMSEP is based on the recovery of added volumes (in milliliter) of PsA stock solution. (c) Prediction of determination of PsA with error indicator.
the major disadvantage of previous studies, which is corrected in the current theory. The distinction between the EI and the RMSEP criteria should be further emphasized: As shown in eq 13, the EI has been based on the hypothesis that the relative errors in predicted concentrations are dependent on the relative NAS error. RMSEP, however, results from complete PCR calculation; thus, it is used to characterize the PCR performance under specific conditions. Maximizing the length of the NAS only is not sufficient for optimization purposes. For example, addition of more data points
usually results in a larger norm of the NAS. However, according to eq 21, it introduces more noise as well. The final results may be spoiled if the relative error in the norm of NAS is increased, as shown in eq 22. 3. Application to Experimental Results of a Real Multicomponent System. A three-component system, consisting of potassium salts of PbA, PsA, and PcA was prepared, as described in the Experimental Section. The experimental study was designed to demonstrate the above theory. The fluorescence spectra of the pure components are shown in Figure 3a. It can be seen that the spectra of PbA and PsA are very similar and can be considered as an example of an ill-conditioned case (under the current experimental conditions). In this case, the standard deviation of the measured signals was about 2.8 counts. Under the same conditions, PcA is distinct and may represent the wellconditioned components. The results of the actual analysis of PsA are shown in Figure 3b. These results were obtained from a series of measurements used for development of a PCR calibration model and other measurements used for RMSEP calculation. The error indicator for the same experimental conditions are shown in Figure 3c, as a function of the same variables. Because of the limited number of samples, the surface of (b) is not as smooth as (c). However, the general features of the two surfaces are the same, which demonstrates the validity of the above results for real experimental data. The optimal window for the analysis of PsA is the one at Analytical Chemistry, Vol. 68, No. 14, July 15, 1996
2399
375-385 nm, as predicted with the error indicator in (c). This result corresponds to the spectral difference between PbA and PsA, shown in (a). The actual smallest RMSEP value for PCR prediction of PsA is obtained around this minimum, which confirms the validity of the error indicator as the right tool for wavelength selection. Similar results have been obtained for analysis of PbA (not presented). The results from the analysis of PcA are shown in Figure 4. These two surfaces are rather flat, as compared to the case of PsA, and the minimum can hardly be located (although a minimum is always present in the error indicator surface). That is to say that in this case the χ2 part of eq 21 can be practically omitted, as a first-order approximation. Therefore, from the practical point of view, wavelength selection for analysis of a wellconditioned component is not essential. As a matter of fact, no clear minimum is observed in surface (a), because it has been calculated from a limited number of samples.
It was shown that the χ2 part in the variance of the net analyte signal norm cannot be generally omitted, because of its nonlinear dependence on the noise level. Thus, the traditional first-order approximation is not justified in many cases, and the full treatment, described in the above, must be carried out. The current assumption that the analytical performance of multivariate calibrations is determined by the quality of the net analyte signal has been proven by extensive simulations and by experimental examples. The necessity of wavelength selection for the so-called full-spectrum methods, such as PCR, has been confirmed in ill-conditioned situations. The ability of the proposed error indicator function to perform effective wavelength selection has been demonstrated with both simulations and experimental results.
CONCLUDING REMARKS This paper has introduced the error indicator function, which was developed for the calculation of the relative error in the norm of net analyte signal of a component in a mixture. A mathematical model has been derived for the uncertainty in net analyte signals.
Received for review November 27, 1995. Accepted April 4, 1996.X
2400
Analytical Chemistry, Vol. 68, No. 14, July 15, 1996
ACKNOWLEDGMENT This study has been supported by the Israeli Ministry of AgriculturesThe Water Research Authority.
AC951142S X
Abstract published in Advance ACS Abstracts, May 15, 1996.