995
Anal. Chem. 1984, 56,995-1003
multivariate absorbance data provided by a diode array detector provides a signal averaging and signal to noise improvement. Since curve resolution performs the quantitative resolution using all the absorbance information, the problem of varying apparent resolution which affects the perpendicular dropline method is avoided. The accuracy of the dropline method is greatest when the two chromatographic peaks are of equal size and it progressively decreases as the relative difference in the individual bands increases. Curve resolution performs the quantitation in the eigenvector space using all the wavelengths recorded and then displays the results at the selected wavelength, therefore, it is less susceptible to errors caused by changing the wavelength at which the quantitation is displayed. The limit to the accuracy obtainable using curve resolution is dependent on the degree of chromatographic resolution and the degree of spectral uniqueness. Curve resolution makes use of two assumptions; first, only nonnegative quantities of material are present, and, second, only nonnegative absorbances are permitted. In order to assure these assumptions are valid, the base line absorbance immediately before and after the coeluting band must be tested to assure that the base line is not negative a t any of the stored wavelengths. Experience with the HP 1040A detector has indicated that drift in the detector base line can result in negative absorbance being recorded. Proper base line subtraction to remove this
detector offset is the most significant practical limitation in determining the accuracy of the quantitative results obtained. The problem of base line correction is not limited to curve resolution. It is also a factor in determining the accuracy when quantitation is accomplished by using any method to separate partially resolved peaks.
ACKNOWLEDGMENT The authors wish to express their appreciation to Steve George of Hewlett-Packard for loan of a 1040A detector. Registry No. Nia, 98-92-0; Rib, 83-88-5. LITERATURE CITED (1) (2) (3) (4)
Davis, J. M.; Giddings, J. C. Anal. Chem. 1983, 55, 418. Yost, R.; Stoveken, J.; MacLean, W.J. Chromafogr. 1977, 734, 73. Poile, A. F.; Coulon, R. D. J . Chromatogr. 1981, 204, 149. Knorr, F. J.; Thorsheim, H. R.; Harris, J. M. Anal. Chem. 1981, 53, 821. (5) McCue, M.; Malinowski, E. R. J . Chromatogr. Sci. 1983, 2 7 , 229. (6) Lawton, W. H.; Sylvester, E. A. Technomefrics 1971, 73, 617. (7) Sharaf, M. H.; Kowalski, 8.R. Anal. Chem. 1981, 53, 518. (8) Sharaf, M. A.; Kowalski, B. R. Anal. Chem. 1982, 54, 1291. (9) Spjmtvoll, E.; Martens, H.; Volden, R. Technomefrics 1982, 2 4 , 173. (10) Martens, H. Anal. Chim. Acta 1879, 772, 423.
Received for review September 15,1983. Accepted January 24,1984. The work was supported in part by the Department of Energy. D.W.O. was a Chevron Research Fellow during the time this work was performed.
Statistical Approach for Estimating the Total Number of Components in Complex Mixtures from Nontotally Resolved Chromatograms David P. Herman, Marie-France Gonnord, and Georges Guiochon*
Laboratoire de Chimie Analytique Physique, Ecole Polytechnique, 91128 Palaiseau Cedex, France
A recently proposed statistical model of peak overlap in complex mlxtures Is tested by computer simulation. The theory, orlglnaily conceived by Davis and Glddings, assumes that solute retentlon In complex mixtures Is random and can thus be described by Polsson statistics. Consistent with this model, the present computer slmulatlon results indicate that by deflnlng the number of observed peaks, p , In a chromatogram as being the number of occurrences of peak maxima, over a limited range of peak capacities, Nc, the number of observed peaks Is related In a simple manner to the columns peak capacity through the relationship In ( p 1) = In ( m 1) m/(Nc - I),where m is the actual true number of components (resolved and unresolved). Signlficant deviations from the log-llnear relationship between peak number and reclprocai peak capaclty are however observed when m / ( Nc 1) ratios are greater than 1.0. A method of determining this ratio knowing neither m nor Nc independently is provided. The theory is tested to the extent that real-world complex mixture chromatograms can be made to adhere to Its underlying assumptions by using TLC retentlon data from the literature and from data generated hereln for a crude oil sample and one of its dlstlliates by GLC. The theory is also applied to estimate the total number of components present In the crude oil samples and the probability that any one observed peak In their optimum efficiency chromatograms are quantifiable singlets.
-
-
-
0003-2700/84/0356-0995$0 1.50/0
Column performances and technologies have reached the point where high efficiency columns (e.g., half a million theoretical plates in open tubular capillary gas chromatography) have come into common use and are now commercially available. The primary impetus behind their development has been the growing need to analyze real mixtures of increasing complexity. Despite these very high efficiencies, many complex mixtures have been shown to contain many unresolved peaks. Although it is the state of the art of the chromatographer’s skill to adjust phase selection and column characteristics so as to improve the resolution of overlapping components, when a large number of compounds within a single sample have to be analyzed experience shows that it is virtually impossible to resolve all of them on any one single stationary phase. Hence, our ability to perform qualitative and quantitative analyses of complex mixtures based solely on chromatographic data is severely limited. For example, accurate quantitative analyses based upon peak height or peak area measurements either require that each chromatographic peak corresponds to a single identifiable chemical component (Le., require extremely high column efficiencies) or that the detector response be highly specific to the compounds of interest. The latter solution, although viable, results in an overall loss of chemical information and does not allow an estimate of the total number of components to be made. Real-time spectral generating detectors on the other hand, dramatically increase the amount of available chemical data 0 1984 American Chemical Society
996
ANALYTICAL CHEMISTRY, VOL. 56,
NO. 6, MAY 1984
and can subsequently be used to spectrally resolve severely overlapping peaks via a variety of numerical methods (1,Z). The nonuniqueness of elution profiles does however pose some particular problems in the qualitative analysis of complex mixtures. For example, in GC/MS analyses automatic direct searching routines often fail to perform correct spectrum identifications when mixtures of two or more components coelute (e.g., geometric isomers of an aromatic series). This explains the great deal of current interest in multiple component deconvolution routines used to determine the number of individual components in a multicomponent peak as well as their individual spectra. It is therefore constructive to obtain an estimate of the occurrence of overlapping peaks in real chromatograms. Using combinatorial analysis, Rosenthal was the first to point out that the occurrence of overlapping components in complex chromatograms is far more prevalent than may have been previously realized (3). He obtained excellent agreement between his theoretical prediction of the probability of a given configuration of singlets, doublets, ...,multiplets and the one experiment provided. Another theory of component overlap is that proposed by Davis and Giddings (4) and is based on the assumption that the component peaks of complex mixtures arrange themselves randomly along the elution axis. Assuming a Poisson distribution of retention times, they derived a simple expression which allows one to estimate the total number of components (resolved and unresolved) in a sample by evaluating the number of observed peaks as a function of the column's peak capacity. Once the total number of components in a nontotally resolved sample is known, the theory further allows the number of singlets, doublets, triplets, etc. in any chromatogram of known peak capacity to be calculated. As noted by its authors, however, the latter theory is expected to be most valid for equal component peak height chromatograms. Real-world complex mixtures obviously produce chromatograms exhibiting unequal peak height distributions and are perhaps more realistically described as being Poisson distributed. There are some published data, however, suggesting that the peak height distributions of many complex chromatograms may in fact be exponential in character (5). Hence, one of the goals of this work was to determine the effect of unequal component peak heights on the ability of the theory to correctly predict the actual number of components in a mixture via computer generation of complex chromatograms possessing equal, random, and exponential component peak height distributions. In addition we wish to critique the different assumptions of the theory as they relate to real-world complex samples and to determine its range of validity. That is to say, we have attempted to determine to what extent complex chromatograms do or, by judicious choice of separation parameters, can be made to adhere to its underlying assumptions of random component retention times, constant peak densities, and constant peak widths. For these purposes we have utilized both thin-layer and liquid chromatographic data abstracted from the literature and data generated herein by analyzing a crude oil sample and one of its distillates on a capillary GC column. The range of validity of the simple algebraic expression referred to above relating the observed number of peaks to the theoretical peak capacity was determined by enumerating peaks a t varying carrier gas flow rates in the GC experiments (i.e., a t varying column efficiencies/peak capacities). For these experiments, argon was chosen as the carrier gas so that a small change in the carrier gas flow rate would result in a large change in peak capacity. Finally, once experimentally verified, the theory was applied to estimate the average number of occurrences of singlet, doublet, and
triplet peaks in the crude oil and distillate chromatograms obtained under optimum efficiency conditions.
EXPERIMENTAL SECTION Equipment and Reagents. A Varian Model 3700 (Varian, Walnut Creek, CA) gas chromatograph equipped with a dual differential flame ionization detector was used. U-grade argon (Air Liquide, France) was utilized exclusively as the carrier gas. A 27 m long, 110 pm i.d. Pyrex glass capillary column statically coated with OV-1 (6) was prepared in the laboratory, generating over 200000 theoretical plates ( k ' = 3.25) at optimum flow velocities. Simulated chromatograms were generated on a HP 21 MX (Hewlett-Packard, Palo Alto, CA) computer. Programs were written in the FORTRAN language. Solvents were UV grade reagents obtained from Merck (Merck, Darmstadt, West Germany). The crude oil sample was obtained from the Emeraude oil field (Congo). Procedure. The crude oil sample was subjected to a prior liquid-liquid extraction procedure so as to separate the aromatic and aliphatic fractions (7). Only the aromatic fraction was studied in this work. The crude oil PAH were dissolved in methylene chloride (2% v/v) and directly injected onto the capillary GC column. Synthetic standard test mixtues of 1-methylnaphthalene, 2,7-dimethylnaphthalene,l&dimethylnaphthalene, and 2,3,5trimethylnaphthalene in methylene chloride (1% v/v) were chromatographed in a manner identical with that of the PAH fractions. The standard deviations of their elution profiles were used to calculate peak capacities over the interval of interest. Chromatographic conditions were as follows: with a split ratio of 1:150,0.4 pL of either PAH sample or test mixture was injected and analyzed with a 50 to 200 "C linear temperature program at linear flow rates of 8.02,23.26, 31.25,38.46,46.15,and 69.77 cm/s. In order to keep the retention indexes constant for all eluates, the temperature programming rate was increased in proportion to the linear flow rate; the proportionality factor was 2.24 X OC/cm. A light petroleum hydrocarbon distillate,fulgene, was analyzed in like manner on the same column as above at five different flow rates of 14.0, 27.7, 40.3, 52.2, and 78.5 cm/s. The proportionality factor between the temperature programming rate and linear flow "C/cm. Finally, peak capacities were rate was instead 1.24 X calculated from peak variances of normal undecane and normal dodecane synthetic standard mixtures. THEORY The underlying assumption of the theory of component overlap recently submitted by Davis et al. ( 4 ) is that component peaks of complex mixtures arrange themselves randomly along the chromatographic time scale and can thus be described by use of Poisson statistics. A unique property of the Poisson distribution is that distances between consecutive points, X I , are exponentially distributed and can be described by the following normalized probability function: (0 < I' < m) (1) where A is the peak density given as the ratio of the number of components, m,to the total interval length X. Integration of eq 1 with respect to x ' between zero and xo gives the probability that any two consecutive points will fall within the distance x,, and is given as
P(x? =
P ( x f< x,,) = 1 -
(2)
The probability that two consecutive points are separated by a distance greater than x , is the complementary function P(xfL x,) = e+o (3)
If we assign to x , a value corresponding to the minimum distance between two Gaussian peaks in a chromatogram required to achieve a sufficient degree of resolution so as to be able to identify them as two distinct peaks, eq 2 and 3 can be used to derive an expression for the probability of observing singlets, doublets, and higher order n-tets. The expression
ANALYTICAL CHEMISTRY, VOL. 56,
derived by Davis et al. for this probability is P(n) = (1 - e-h)n-le-2h
(4)
where n = 1,2 , 3 for singlets, doublets, triplets, etc. (4). The total number of observed peaks, p , is then given simply as the sum of the number of singlets, doublets, triplets, etc. and can be written as m
p = m C P(n) = me-&O n=l
(5)
Initially defining the peak capacity, Nc, of a chromatographic interval of length X to be X / x o , eq 5 can be rewritten as
p = me-m/Nc
(6)
or in logarithmic form In p = In (m)- m / N c
(7) Thus, the total number of observed peaks for a given complex mixture chromatographed on columns of identical selectivities but of differing efficiencies (Le., differing peak capacities) is log-linearly related to the reciprocal of the peak capacity. Equation 7 allows an estimate of the number of components actually present in a mixture to be made from peak number-peak capacity data where a plot of In p vs. 1/Nc is predicted to be linear with slope -m and intercept In m. Each of the preceding seven equations can be found derived in greater detail in the original work of Davis et al. ( 4 ) and is presented herein so as to facilitate a discussion of our modifications of these equations consistent with our slightly different definition of the peak capacity. For the purposes of this publication the peak capacity of a chromatographic interval of length X (or if measured in time units, A T ) is defined as the maximum number of equally spaced Gaussian peaks that may be placed within an interval such that one observes exactly one peak maximum for every Gaussian peak contained therein. For Gaussian peaks of equal standard deviation, u,and amplitude, the peak capacity may be written as NC = (AT X ~ ) / X C= (AT + x O ) / X O (8) where x is the minimum number of u units of separation between the Nc equally spaced peaks required to observe exactly Nc peak maxima. The minimum value of x in which such a one-to-one correspondence between peak maxima and component peak number exists is easily shown to be 2.063 for noise-free signals. Note that our definition of the peak capacity of an interval is exactly unity greater than that used above to derive eq 6 and will be adopted herein so as to allow for the real possibility that two peak maxima may occur exactly a t each of the two interval boundaries. From eq 3 we can write an expression for the probability that any two pairs of consecutive peaks are separated by a distance greater than xu (i.e., have an interjacent minima) as p(x' > x u ) = e-m/(Ncl) (9)
+
The expected number of minima in a chromatogram is thus obtained by multiplying this probability by m - 1,the maximum possible number of minima. Keeping in mind that the number of peak maxima is one greater than the number of interjacent minima, the expected number of peak maxima in a chromatogram is given by the expression p - 1 = ( m - l)e-m/(NC-l) (10) or in logarithmic form
In ( p - 1) = In (m - 1) - m/(Nc - 1) (11) Note that eq 7 and 11 are very similar and will differ significantly only when enumerating a small number of peaks at low peak capacities. Equation 11is, however, believed to be more fundamentally correct when defining a peak as the
NO. 6,
MAY 1984
997
occurrence of a peak maxima. For example, in the limit that the peak capacity of an interval approaches a value of 1.0, m overlapping component peaks within that interval must sum to produce exactly one peak maximum. The numbers of peaks predicted by eq 7 and 11 in this limit are me-"' and 1.0, respectively. Hence, eq 11 was used exclusively in this work to compute m from peak number-peak capacity data.
RESULTS AND DISCUSSION Computer Simulation. In their original derivation of eq 7 Davis and Giddings assumed the component peaks of complex chromatograms to be of equal peak height so that the minimum distance of approach criteria, xo, remained constant and determinable over the chromatographic interval of interest. It must be realized however that xo will remain essentially constant for equal peak height chromatograms only under the conditions of low peak density, i.e., where severely overlapping components do not sum to produce chromatograms exhibiting unequal peaks heights. Obviously, complex mixtures are composed of components covering a wide range of concentrations and hence produce chromatograms exhibiting widely varying peak height distributions regardless of peak density. The computer simulations described below demonstrate the ability of eq 11to give a "reasonable" estimate of the true number of components in hypothetical mixtures whose component zones have retention times distributed according to the Poisson law and for which peak heights are either equally, randomly, or exponentially distributed. Chromatograms of complex mixtures were simulated by summing m Gaussian distributions whose first moments, u(i), were randomly distributed over the time interval of the chromatogram, AT, and can be represented by the equation
O 2, widely spaced data points in the linear range a t high peak capacities so that a true regression analysis could be performed with n - 2 degrees of freedrom. In the current example however this was not readily possible because we were in effect “efficiency limited” in such a way that additional usable data (i.e., a t m/(Nc - I) < 0.8) differing significantly in value from those presented could not be collected on the capillary column used in this study. From these two data points, values of mslopeand mintercept were calculated to be 64.7 and 70.4, respectively. Arbitrarily computing the average of melopeand mintercept and rounding the result to the nearest integer number, we estimate the actual number of components present in the fulgene sample, whose retention indexes lie between 1100 and 1200 noninclusive, to be 68. Compare this number to the 42 peaks that were observed in this interval under the optimum flow rate conditions shown in Figure 7. This result implies that the average multiplicity of each of the observed 42 peaks is approximately 1.6 and that
the probability that any one of the 68 components will appear as a singlet (estimated from eq 4 where Nc was equal to 120.7) is 32%. Thus, of the 42 observed peaks, the theory predicts that only 22 appear as singlets, the remainder being higher order multiplets. Also shown in Figure 9 is the In ( p - 1)vs. (Nc plot for the crude oil sample in which the denumbration interval was defined for those components eluting between 74.63 and 100.84 min, noninclusive (i.e., between those peaks identified as 1-methylnaphthalene and ethylmethylnaphthalene, respectively). Values of AT(f) were calculated by use of eq 15 where Tr,(f) and Trl(f) were ethylmethylnaphthalene and 1-methylnaphthalene retention times, respectively. uav(f) values were calculated at each flow rate by averaging the peak standard deviations of the four-component standard test mixture specified above whose retention times spanned the designated interval. The plot is observed to exhibit a significant degree of curvature between reciprocal peak capacities of 0.0053 and 0.0110 and most likely indicates that m/(Nc - I) ratios are significantly greater than 0.8 at most of the lower peak capacities. Hence, of the available data, fit of eq 11 to the two data points a t the two highest peak capacities (Nc equal to 190.5 and 92.2) should give the best estimate of m. The calculated best fit values of melopeand mintercept from these two data points were 138.6 and 137.2, respectively. Averaging and assuming an actual component number of 138, the average multiplicity of each of the 67 peaks observed in the optimum efficiency chromatogram shown in Figure 6 for those components eluting within the above-specified interval is 2.06. Moreover, the probability of observing any one of these 138 components as a singlet is estimated from eq 4 to be 23.7%. This in turn implies that, of the observed 67 peaks, only 33 appear as singlets. Hence, there is a greater than 50% chance that an attempt to quantitate a given component based solely on a peak height or peak area measurement will be in error. Columns having nearly 2.8 X lo8 and 3.4 X lo9 theoretical plates (Le., respective peak capacities of approximately 2320 and 7850 for a 26-min time interval as above) would be required to reduce this probability to more acceptable levels of 5% and 1% , respectively. I t is a t this point instructive to obtain an estimate of the precision by which m = 138 was determined from eq 11 for the crude oil sample. Because only two data points were used in its determination, an estimate of the uncertainty in each of the two In (p - 1)values at the two highest peak capacities is possible only by reference to the simulation experiments in which 20 independent simulation experiments were conducted at each Nc. Assuming that the uncertainty in p values in both simulation and real chromatograms is due primarily to the inability of a finite number of points to represent ideal randomness, the standard deviations in peak numbers (p = 67 and 31 in Figure 9) at the two highest peak capacities are estimated to be 3.5 and 1.6, respectively. These values were obtained by noting that the relative standard deviations in p at each peak capacity for the simulated, m = 200, exponential peak height chromatogram were nearly constant a t 0.053. After a transformation to logarithmic coordinates, it is easily shown by propagation of error that the standard deviation in m slope from these two data points is f13 components. The reader realizes that this value is only a crude estimate of the. precision that may be anticipated when m is determined by only two data points in or near the linear range of the function. Improved precision estimates will require that many data points in the linear range be accessible such that regression fit of several data points allows m and its uncertainty to be determined unambiguously. An independent method of estimating m/ (Nc - 1)ratios from real chromatograms is obviously needed so that the
ANALYTICAL CHEMISTRY, VOL. 56,
analyst can be reasonably sure that the peak number-peak capacity data over which regression fit of eq 11is performed are such that m/(Nc - 1) 5 0.8. In illustration, if the data point at Nc = 190.5 in Figure 9 for the crude oil sample had not been obtainable (i.e., had we collected these data on a column of much lower efficiency), mslopeand mintsrcept values calculated from the two data points at Nc = 92.2 and 72.9 would have been 76.0 and 70.0, respectively. Hence, our estimate of m would have been in error by nearly a factor of 2. Fortunately, it is possible to prerecognize those individual chromatograms in which m/(Nc - 1)ratios are greater than or equal to 0.8 without knowing either m or Nc individually by observing the relative frequency in which the chromatographic signal reaches base line. For randomly displaced peaks, the number of minima in a chromatogram which reach base line (to within some small tolerance) can be written as
B(f) = (m - l)e-xz”@ (16) where x” is the number of units of u separation between two peak maxima required such that the interjacent minima reach the base line to within some small present defined value. At 6u separation for example (Le., a resolution of 1.50) the height of a interjacent minima above base line is 12.2% of the height of the average of the two encompassing peak maxima. Utilizing this criteria, the ratio of the number of base line occurrences, B o , to the number of observed peaks, in a chromatogram is given approximately as
po,
B(f)/p(f)
[e-xZ,l~@](6-Z.l)/z.l g [e-m/(N*1)11.867
(17)
where Nc is defined as above at 2.10 u separation. Equation 17 thus allows a rough estimate of the ratio m/(Nc - 1)to be made from experimentally determinable parameters without prior knowledge of either m or Nc. Because the number of base line occurrences in real chromatograms are likely to be subject to substantial errors owing to noise and nonlinear sloping base lines (typical under temperature programming conditions), it should be stressed that eq 17 should be used only as a rough guideline as to whether a peak number-peak capacity datum should be included in or excluded from the regression analysis. With these limitations and criteria in mind we therefore conclude that any chromatogram in which the ratio of base line occurrences to peak number is significantly less than 23% (m/(Nc - 1) 2 0.8) should not be used to estimate m by means of eq 11. The relative base line frequencies of the crude oil chromatograms at Nc equal 190.5,92.2, and 72.9 were determined and found to be approximately 28,10, and 8%,respectively.
NO.6, MAY 1984
1003
Hence, in the example given above we are prewarned that a determination of m based solely on the two data points at Nc = 92.2 and 72 would be in serious error. It should be noted that although m/(Nc - 1)is less than 0.8 at Nc = 190.5, the ratio is significantly greater than 0.8 for the lower of the two data points (Nc = 92.2) utilized above to calculate m = 138 for the crude oil sample by fit of eq 11. By inference to the conclusions derived in the simulation experiments for exponentially distributed peak heights, we postulate that the calculated component number of 138 in actuality somewhat underestimates the true number of observable components eluting within the designated interval. A better estimate of m would require that we use a column having a significantly greater number of theoretical plates so that several data points in which m/(Nc - 1) < 0.8 could be collected. The analyst should further realize that this calculated m value is most certainly a lower estimate of the true actual number of components within the sample anyway, for the theory cannot take into account those components whose concentrations are so small that their peaks are lost in the noise signal, i.e., defined herein at S I N < 2. The conclusion above nevertheless remains valid despite this slight error in our estimate of m. That is, the crude oil sample, even after preextraction, is much too complex to allow either qualitative or quantitative analyses to be performed with any reasonable degree of confidence on columns having less than several hundred million theoretical plates. Clearly, the currently available high efficiency columns of nearly one million theoretical plates alone are not sufficient to analyze mixtures of this complexity but must be used instead in conjunction with either highly specific or spectral generating detectors.
LITERATURE CITED (1) Knorr, F. J.; Thorshelm, H, R.; Harrls, J. M. Anal. Chem. 1981, 53, 82 1-825. (2) Woodruff,H. B.; Tway, P. C.; Cline Love, L. J. Anal. Chem. 1881, 53, 81-84. (3) Rosenthal, D. Anal. Chem. 1882, 54, 63-66. (4) Davis, J. M.; Glddlngs, J. C. Anal. Chem. 1883, 55, 418-424. (5) Nagels, L. J.; Creten, W. L.;Vanpeperstraete, P. M. Anal. Chem. 1883, 55, 216-220. (6) Gonnord, M. F.; Gulochon, G.; Onuska, F. I. Anal. Chem. 1883, 55, 2 115-21 20. (7) Natusch, D. F. S.; Tomklns, B. A. Anal. Chem. 1978, 50, 1429-1434. (8) Snyder, L. R. J . Chromatogr. Scl. 1872, 70, 200-212. (9) Mattox, V. R.; Lltwlller, R. D.; Carpenter, P. C. J . Chromatogr. 1979, 775, 243-260.
RECEIVED for review October 5,1983. Accepted January 16, 1984.