Anal. Chem. 1996, 68, 170-175
Information Theory Approach to Underdetermined Simultaneous Multicomponent Analysis Israel Schechter
Department of Chemistry, TechnionsIsrael Institute of Technology, Technion City, Haifa 32 000, Israel
An algorithm for analysis of simultaneous multicomponent systems is proposed. This method is suitable for underdetermined systems, where the number of chemical components is larger than the number of measured data points. This method provides a rational solution to such underdetermined cases, with no need of any model assumptions. Such situations are faced, for example, in application of low-cost sensors to environmental analysis, where complex mixtures must be handled. The proposed solution is based on information theory and provides a unique set of concentrations that is the most unbiased one. The algorithm maximizes the entropy of the solution set, which means that the final unjustified information is minimal. Detailed description of the mathematical procedure is provided. The method is exemplified for spectroscopic analysis and evaluated by extensive computer simulations. Several effects, such as the number of components and the number of measured points, the noise level, spectral characteristics, and the sampling design, are studied. The stability of the algorithm and the analytical performance in some cases are evaluated in respect to these factors. Satisfactory results are obtained in systems with realistic noise levels. Simultaneous multicomponent analysis is an important issue in modern analytical chemistry. The general knowledge is that numerous data are needed for accurate multicomponent analysis, and indeed, it has been shown that when an abundance of data is available, improved analytical performance can be achieved. Nevertheless, there are situations where simultaneous analyses have to be carried out using only a few pieces of information. This is often the case in development of low-cost chemical sensors. Such devices are supposed to handle complicated mixtures; however, due to cost restrictions, only limited data are collected. For example, an expensive photodiode array spectrometer, which provides ∼1000 pieces of spectral data, might be replaced by a few narrow-band filters and photodetectors. Many algorithms (such as PLS and PCR) have been developed to handle analysis based on numerous data.1,2 These so-called overdetermined systems are commonly used to improve analytical results.3 However, as far as we know, no algorithms have been suggested to help analysts in underdetermined systems. In these (1) Gans, P. Data Fitting in the Chemical Sciences; John Wiley & Sons: Chichester, U.K., 1992. (2) Miller, J. C.; Miller, J. N. Statistics for Analytical Chemistry, 3rd ed.; Ellis Horwood and Prentice Hall: New York, 1993. (3) Branham, R. L., Jr. Scientific Data Analysis. An Introduction to Overdetermined Systems; Springer-Verlag: New York, 1990.
170 Analytical Chemistry, Vol. 68, No. 1, January 1, 1996
analytical systems, only few data are available, even fewer that the number of compounds in the mixture. This paper is aimed at the above-mentioned underdetermined systems. We suggest an algorithm to provide simultaneous multicomponent analysis, when the number of components is larger than the available data points. Traditionally, such systems have been considered as of limited value, due to the feeling that no unique mathematical solution is possible. One could solve the equations by using additional chemical intuition or additional information from the general knowledge on the particular system of interest; however, the final results would depend on such biased assumptions. Out task is to find a unique solution that is the least possible biased such that analytical results can be obtained. We are looking for a solution to underdetermined chemical systems that is unique, is in agreement with all available data, and contains no biased, unjustified information. Such a solution is provided by information theory, which has the ability to handle underdetermined conditions.4 Generally, this solution is known as the maximum entropy solution.5 The maximum entropy approach has been already successfully applied to several analytical problems, such as resolution enhancement and image analysis.6,7 This paper applies the maximum entropy principle to analytical systems and provides an algorithm to calculate the desired concentrations. Of course, we do not defy the laws of mathematics by providing the exact solution to underdetermined systems, and there could be cases where our solution is wrong. The concentrations that maximize the entropy (under experimental constrains) are not necessarily the true analytical concentrations. Yet, they are the only rational solution that one can provide, and as we show in the following, the recovery errors are generally acceptable. The method and the algorithm are presented and exemplified by simple spectroscopic cases, including realistic experimental noise. They are evaluated by means of numerous simulations, and several factors of relevance in analytical chemistry are studied. THEORY Our purpose is to analyze a mixture of Nc components by means of Nw measured linear instrumental responses. For simplicity of presentation, we shall refer to spectroscopic analysis. In this case, we measure the spectral intensities at Nw wavelengths. (4) Ash, R. Information Theory; Interscience: New York, 1965. (5) Buck, B.; Macaulay, V. A. Maximum Entropy in Action; Clarendon Press: Oxford, U.K., 1991. (6) Grant, A. I.; Packer, K. J. Enhanced Information Recovery From Spectroscopic Data. In Maximum Entropy and Bayesian Methods; Skilling, J., Ed.; Kluwer: Dordrecht, The Netherlands, 1988. (7) Kauppinen, J. K.; Moffatt, D. J.; Mantsch, H. H., Can. J. Chem. 1992, 70, 2887-2894. 0003-2700/96/0368-0170$12.00/0
© 1995 American Chemical Society
These results are given by Nc
Dj )
∑R C
j ) 1, ..., Nw
ji i
(1)
Since this centering should not change the set of Lagrange parameters (λ1, ..., λNw), the only change is in the magnitude of the normalization factor, λ0: Nw
i)1
∑
Nw
λkRki + λ0 )
k)1
where Dj is the measured spectral response from the mixture, at wavelength j, Rji is the unity spectral response of component i at wavelength j, and Ci is the relative concentration of component i in the mixture, such that the sum of all concentrations is equal to 1 (or to 100%). If Nw is equal to Nc, the system is determined, and one can calculate the desired concentrations uniquely. If Nw is larger than Nc, the system is overdetermined, and one can use one of the many available methods (e.g., least squares, PCR) and obtain more precise results that compensate for noise in the data. However, if Nw is smaller than Nc, the system is considered underdetermined, and there are infinite mathematical solutions, which means that there are an infinite number or concentration sets that satisfy the given equations. We are looking for a single set of concentrations that fulfill the following requirements: First, it solves all Nw equations (1), that is, it agrees with all experimental data. Second, this set should be the less biased of all possible sets. In other words, we require that additional information introduced by the arbitrary selection of the set is minimal. There is just one such set, and it is given by information theory. This set is the one that maximizes the entropy, S (or the missing information):
∑
Nw
λkPki + λ′0, λ′0 ) λ0 +
k)1
∑λ D k
k
(7)
k)1
Hence, one can rewrite eq 5 as Nc
∑P e ∑ -
ji
Nw λP k)1 k ki
) 0, j ) 1, ..., Nw
(8)
i
The solution of these Nw equations will provide the set of Lagrange parameters that is needed for calculation of the required concentrations. A variational algorithm to solve such equations has already been proposed9 in the context of another maximum entropy problem. A variational set of concentrations has to be considered:
Cνi ) e-
∑
Nw λνP k)1 k ki
/ξ(λν)
(9)
where Nc
ξ(λν) )
∑e ∑ -
Nw λνP k)1 k ki
(10)
i)1
Nc
∑C ln C
S)-
i
(2)
i
i)1
Thus, eq 2 is an additional condition that has to be fulfilled by the concentration vector C. The method of Lagrange parameters8 describes the vector that maximizes the entropy in the form
∑
Nw λR ) j)1 j ji
Ci ) e-(λ0+
(3)
where λ0 takes into account that the sum of all concentrations is 1 (or 100%) and is defined by the following expression: (Note that λ0 is a function of all λj, by definition.) Nc
λ0
ξ(λ1, ..., λNw) ≡ e )
∑e ∑ -
Nw λR j)1 j ji
(4)
i
Using eqs 4 and 1 we get the new set of equations to be solved: Nc
∑(R
ji
∑
- Dj)e-
Nw λR k)1 k ki
) 0, j ) 1, ..., Nw
(5)
i
Which can be simplified by an origin shift:
Pji ) Rji - Dj (8) Jaynes, E. T. Phys. Rev. 1957, 106, 620.
(6)
and the index ν indicates variational functions. In other words, the index ν indicates that this set of concentrations is a trial set that has to be varied until the maximum entropy solution is obtained. The mathematical algorithm for maximizing the entropy is practically a variational method that varies these trial sets in a way toward the final solution. As shown in the following, this variation procedure can be carried out by a simple function minimization: It has been shown9 that solution of eqs 8 is equivalent to minimization of the potential function F.
F ) ln ξ(λν1, ..., λNν w)
(11)
The global minimum of this concave function can be easily found (by means of a standard nonlinear minimization algorithm), which provides the requested Lagrange parameters. The only severe numerical problem in the above algorithm is caused by linear dependent rows of matrix P. These rows must be linearly independent to ensure that eqs 8 have a solution, and the set of Lagrange parameters is unique. When the rows of P are not linearly independent, it means that some data points are not independent, and one can always eliminate one or more of the experimental data. Nevertheless, due to numerical considerations, even almost linear dependence among the rows of P introduces errors. Thus, these rows must be orthogonalized before the minimization process. This orthogonalization is carried out by the Gram-Schmidt procedure, and a new matrix P′ is generated instead of P, by means of the operator G: (9) Agmon, N.; Alhassid, Y.; Levine, R. D. J. Comput. Phys. 1979, 30, 250259.
Analytical Chemistry, Vol. 68, No. 1, January 1, 1996
171
(12)
P′ ) GP The final Lagrange parameters vector same operator:
λf
is then calculated by the
λf ) λG
(13)
The known Gram-Schmidt procedure is commonly described as an iterative algorithm;10 however, this approach is not suitable in our case (since the operator is needed for further calculations in eq 13). Thus, we have derived a matrix version of the GramSchmidt operator. (The code is available, upon request, from the author.) As already pointed out, a unique solution for the Lagrange parameters requires that the matrix P is of maximal rank (i.e., linearly independent). The proposed Gram-Schmidt procedure not only orthogonalizes the rows, but also provides the angle between the vector to be orthogonalized and its projection on the subspace of vectors that are already orthogonal. This angle can be used for identification of linearly dependence conditions or almost linearly dependence. In the first case, the corresponding experimental data are eliminated (resulting in a new P matrix), while in the almost linearly dependence conditions, the angle may be used for error estimation (and the dimension of P′ is unchanged). EXPERIMENTAL SECTION Analysis of the proposed algorithm has been carried out by computer simulations. Spectra of Gaussian peaks were created, and normally distributed noise, with given standard deviation, was added. Noise was created by the Box-Muller algorithm.11 Spectra of mixtures were generated in the following way: First, the peak parameters (center, width, height) of all compounds were provided. Second, the concentrations of all components were generated and their sum was renormalized to 100%. The final spectrum is calculated as a sum of all individual contributions. After all spectrum parameters were generated, sampling points were simulated. Several sampling strategies were investigated. The final experimental points were simulated as the spectral response at the selected sampling points, plus a Gaussian noise. For the sake of generality of our results, we have randomized many of the relevant parameters, as described in the following. For example, when the effect of wavelength sampling is studied, the maximum intensities are randomized, and the results are given as the average over many randomized set of intensities. The maximum entropy program accepts as input the singlecomponent spectral responses (at unit concentration) and the experimental points form a mixture. The output consists of the calculated concentrations. The average deviation from true concentrations is recorded. All software was written in FORTRAN77 and compiled by MSFORTRAN compiler. Code was run on a personal computer (80486 processor). Each single maximum entropy calculation takes ∼0.01 s. RESULTS AND DISCUSSION Spectroscopic Example. The principle of the maximum entropy program is demonstrated by simple spectroscopic ex(10) Ralston, A.; Rabinowitz, P. A First Curse in Numerical Analysis, 2nd Ed.; Mcgraw-Hill: New York, 1988. (11) Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vatterling, W. T. Numerical Recipes; Cambridge University Press: Cambridge, U.K., 1989.
172
Analytical Chemistry, Vol. 68, No. 1, January 1, 1996
Figure 1. Spectra of 10 pure compounds at unit concentration and the spectrum of a mixture (solid curve). Arrows indicate the seven wavelengths where intensities are measured. Table 1. Examples of Underdetermined Spectroscopic Measurements and the Maximum Entropy Solutionsa spectral measurement initial value
recovery error (%)
concentrations initial value
No Noise 4.764055 × 10-6 5.785490 × 10-7 4.711425 × 10-2 1.105812 × 10-3 3.332580 × 10-7 6.124853 × 10-2 4.215401 × 10-2 6.294458 × 10-8 7.067138 × 10-2 3.240211 × 10-1 -9.984369 × 10-8 8.362780 × 10-2 7.080988 × 10-1 -4.535622 × 10-8 9.422851 × 10-2 5.120140 × 10-1 2.927261 × 10-8 1.060071 × 10-1 9.013817 × 10-2 3.264552 × 10-8 1.177856 × 10-1 1.272085 × 10-1 1.413428 × 10-1 1.507656 × 10-1 4.761541 × 10-6 1.106038 × 10-3 4.214836 × 10-2 3.220645 × 10-1 6.947993 × 10-1 4.999929 × 10-1 8.824397 × 10-2
recovery error (%) -2.397539 × 10-2 1.743517 × 10-1 -5.691001 × 10-1 8.804960 × 10-1 -6.170365 × 10-1 -4.483161 × 10-2 2.598729 × 10-1 -2.192655 × 10-4 -1.251730 × 10-1 4.670643 × 10-2
Noise Level 0.095% -1.155221 × 10-10 4.711425 × 10-2 1.893544 × 10-1 -6.977473 × 10-11 6.124853 × 10-2 6.739566 × 10-2 -3.318945 × 10-10 7.067138 × 10-2 -1.197431 -5.811990 × 10-11 8.362780 × 10-2 -5.467610 × 10-1 1.077595 × 10-9 9.422851 × 10-2 5.694423 × 10-1 9.064878 × 10-10 1.060071 × 10-1 3.085257 4.450789 × 10-10 1.177856 × 10-1 7.934573 × 10-1 -1 1.272085 × 10 2.999927 × 10-1 1.413428 × 10-1 -7.159758 4.092068 1.507656 × 10-1
a The recovery of the spectral measurements is excellent, and the average concentration error is acceptable and depends on noise level (and other factors).
amples. A mixture of 10 compounds has been simulated. Spectra of pure compounds are slightly different (maximum intensities and wavelength), and each has a different concentration in the mixture. Figure 1 shows these individual spectra, and the resultant spectrum of the mixture (thick curve). Only seven experimental measurements were simulated, as shown by the arrows in the figure. Several sampling sets were randomly selected, and the concentrations of maximum entropy were calculated. Representative results are presented in Table 1. As expected, the recovered experimental spectral measurements (calculated from recovered concentrations) are in excellent agreement with all true values. (Maximum error is in the range of 10-7%.) The recovery of the 10 concentrations is excellent when no noise is present and is acceptable when a noise level of 0.05% was used. As explained in the following, the stability to noise, rather than the recovery errors, is of major relevance.
Figure 2. Improvement factor (namely, the ratio of the average statistical error to the average error using the maximum entropy algorithm) as a function of the number of components in the mixture and the number of measurements.
Effect of the Numbers of Components and of Data Points. As shown in the previous example, this method is designed for cases where the number of experimental points is smaller than the number of components in the mixture. We now study the performance of the algorithm as a function of these variables. It is expected that the average recovery error of concentrations is higher when the number of components far exceeds the number of measurements; however, we want to provide a quantitative characterization. Let us consider the proper way to evaluate the recovery errors. The problem is that we need to cancel out a trivial effect that causes the average error to decrease when the number of components is increased: Suppose that we are given any set of Nc concentrations (that sum up to 100%), and we randomly select a set of Nc numbers in the same range. We calculate the deviation of the first random number from the first concentration and continue to the last concentration. The above deviations can be averaged over many sets of random numbers to give the averaged statistical deviation. This statistical deviation decreases with Nc, since all numbers are given in a limited range (0-100%), and the average distance between numbers decreases. A proper evaluation of the results should cancel out this effect. Therefore, we present here only the improvement factor, calculated as the ratio of the statistical deviation to the average deviation obtained in the maximum entropy program. The results are shown in Figure 2. Only cases where the number of components is larger than the number of data points are considered. As expected, at a constant number of components, better results are obtained with more data points. Comparison of cases of a constant difference between the number of components and data points shows that better results are always obtained with large numbers. This finding can be intuitively explained by the size of the relative missing information: Less relative information is missing when we measure 13 data points in a 15-component mixture, as compared to three points in a fivecomponent mixture. It is interesting that the improvement factors vary from one up to seven, which means that this algorithm provides a remarkable improvement as compared to other possibilities (random or arbitrary guess).
Figure 3. Effect of experimental noise on the analytical error in a system of 10 components and seven experimental data: (A) maximum intensities of all components are equal. (B) maximum intensity of a single component is slightly different. (C-E) maximum intensities of two, three, and four components are different (correspondingly). (F) maximum intensities are randomized.
Experimental Noise. The recovery performance depends on the experimental noise level, as well as on other factors such as spectral characteristics and sampling. In the following we study the average errors as a function of these factors for realistic noise levels. Noise levels up to 0.1% are considered, which represent many realistic spectral systems. Calculations have been carried out in a system of 10 components and seven experimental data. It should be pointed out, however, that the scope of this method is not to minimize the recovery errors but to provide the only rational solution in underdetermined systems. Thus, the average error, which depends on the particular distance between the “true” concentrations and the maximum entropy concentrations, is not very informative. Nevertheless, the change in the error is the relevant factor, since it indicates numerical and algorithmical instabilities. Generally, the average error increases almost linearly with the relative noise level, as can be seen in Figure 3. However, the slope depends on spectroscopic characteristics. The worst case has been observed when all maximum intensities were exactly equal. In this rare case, all pure components are represented by peaks with the same maximum intensity and differ only in spectral width and wavelength. In this case (represented by curve A) the algorithm is very sensitive to the noise, implying that excellent experimental data are needed for analysis. As we break the symmetry, and let just a single pure spectra have a slightly different maximum intensity, the slope drops, providing more reasonable results (curve B). The slope becomes even better when we vary the maximum intensity of more components (curves C-E). The best results are obtained when the maximum intensities have been randomized (no internal correlations), as seen in curve F. A realistic spectral system is likely to be close to the random curve or at least in the range of curves C and D. In all these cases, the change in the average error as a function of noise level is moderate, which indicates relatively stable algorithm. Deviations from Beer’s Law. As previously mentioned this algorithm needs no model assumption; however, our study is based on a particular spectroscopic case which assumes that the system follows Beer’s law. Thus, it is interesting to study the effect of deviations from Beer’s law. For this purpose we have selected the following function to simulate the deviations: Analytical Chemistry, Vol. 68, No. 1, January 1, 1996
173
(
f ) -log
)
10-C + r 1+r
f is a function of the concentration C and a parameter r and we calculate the absorption as proportional to f. This function is equal to C for zero r and simulates a deviation from C for positive r values (deviations increase with C, as expected from regular deviations from Beer’s law). The average error as a function of r and of the noise level is presented in Figure 4. r is varied from 0 to 0.01, which corresponds to 4% deviation from Beer’s law. This surface shows moderate changes in average error, indicating stability of the algorithm to realistic deviations from Beer’s law. A Transparent Component. This algorithm is based on the assumption that the sum of all concentrations is 1, which is an obvious fact (and can be always obtained using proper units). The origin of this requirement is that the maximum entropy method has been developed for probabilities, which also must sum up to 1. Actual values of concentrations may be zero, as zero probabilities are acceptable. However, there is an interesting question regarding a possible analytical case where one or several components are transparent in the spectral range being analyzed. This case was studied by decreasing the value of the maximum intensity of one component in a mixture until it reaches zero. This effect has also been investigated as a function of the spectral noise level, and the results are shown in Figure 5. For each point in this surface, the maximum intensities of all other components were randomized (to obtain best conditions), while the maximum intensity of a single component (A) was kept constant. The average error increases when A approaches zero and when the noise level is increased. However, even at A ) 0, which is the case of a transparent component, reasonable results can be obtained. The noise effect is somewhat more intense at small A, as compared to large values. At the highest noise level and zero A, the error is pretty high, yet not catastrophic. Sampling Optimization. An important experimental issue is the design of the sampling process. The question is whether one can optimize the sampling process in order to obtain best results. At a given spectral system, we are looking for the best selection of the relevant measurement parameters, as described in the following. The two most relevant spectroscopic parameters are the measurement range and its wavelength location. Referring to the spectroscopic example of Figure 1, one could ask at which wavelengths should the experimental measurements be located for best analytical performance? Such questions, regarding the sampling of the spectrum, are often of practical relevance. Results for the spectral system such as in Figure 1 are shown in Figure 6. The average error is presented as a function of the selected width of the spectral window and the location of this window. A segment (along the spectrum), in which all measurements are to be carried out, has to be selected. We vary the width of this segment and the location of its center. The wavelength units in this figure are the same as in Figure 1. For each point in this plot, a spectral window (segment) was selected and placed at the given central wavelength. Then, measurements were randomly distributed in this wavelength range. Obviously, a too narrow spectral window provides poor results, because of almost linear dependence conditions. On the other hand, when the window becomes too wide, the error increases again, since some 174 Analytical Chemistry, Vol. 68, No. 1, January 1, 1996
Figure 4. Average error as a function of deviation from Beer’s law (represented by r) and of the spectral noise. The maximum deviation is of 4%. Spectral values have been randomized.
Figure 5. Average error as a function of the maximum intensity of a single component (A) and the noise level. The case of A ) 0 represents a component that is transparent in the measured spectral range. Maximum intensities of all other components are randomized.
points will be placed out of the informative range. For similar reasons, the central wavelength has an optimal location as well. Therefore, the surface of the analytical error spanned in these two coordinates has a global minimum. It means that optimal sampling can be achieved for a maximum entropy solution to provide the best results. However, this minimum is rather broad, indicating that one can still use the method at reasonable conditions that are not fully optimized. The minimum is more pronounced at narrow windows; thus, is such cases more attention should be paid to the spectral location of the measurements. CONCLUDING REMARKS An algorithm to obtain analysis of multicomponent systems, using limited data points, has been developed and evaluated. This algorithm is suitable for low-cost chemical sensors (e.g., spectroscopic sensors) and for other underdetermined analytical systems. The method is completely free of introduction of biased information. Any other solution of the underdetermined problem must assume a parametric model, and the analytical results depend on the particularly chosen model. The maximum entropy approach
a variety of noise levels, spectral characteristics, and deviations from Beer’s law. The simulated data include realistic noise levels and sources of error. In all cases, the changes in the observed errors were moderate, indicating a general stability of the numerical procedures and of the algorithm. The study of the effect of the sampling design showed that an optimum sampling program can be achieved, such that the maximum entropy solution provides best analytical performance. Nonetheless, the maximum entropy result can be far from correct when the concentration set has low entropy. It is emphasized that regardless of the average errors predicted by our simulations, this method is a unique and rational solution to the problem of underdetermined systems that are subjected to constraints (experimental measurements).
Figure 6. Analytical error as a function of sampling program. Width of the spectral window and central wavelength are varied. The observed global minimum indicates that the sampling program can be optimized. Spectral characteristics were randomized, and a noise level of 0.05% was introduced.
provides the unique solution that does not assume any unknown model. The performance of this method has been evaluated for
ACKNOWLEDGMENT We thank R. D. Levine for introduction to the field of maximum entropy methods. This study was supported by the Israeli Ministry of Science and the Arts. Received for review March 28, 1995. Accepted October 2, 1995.X AC950302X X
Abstract published in Advance ACS Abstracts, November 1, 1995.
Analytical Chemistry, Vol. 68, No. 1, January 1, 1996
175