Nonlinear calibration - Analytical Chemistry (ACS Publications)

Nonlinear calibration. Lowell M. Schwartz. Anal. Chem. , 1977, 49 (13), pp 2062–2069. DOI: 10.1021/ac50021a043. Publication Date: November 1977...
0 downloads 0 Views 938KB Size
LITERATURE CITED

CONCLUSIONS T h e results show that, even with this relatively simple approach, satisfactory results can be obtained with solid samples. Thus PTS is a simple method of obtaining the band-gap energy of semiconductors and has recently also proved useful in studying the spectral and electrical potential dependence of the temperature of a semiconductor electrode under irradiation (20). This latter application may be especially useful, since few methods exist for studying the spectroscopy of an operating electrode in situ. Improvements in cell design, apparatus, and instrumentation are possible and these would improve the response and make the treatment of the data more convenient. The addition of a well-controlled thermostat bath around the sample would decrease base-line corrections. Similarly the use of a matched pair of thermistors in a differential arrangement, both in contact with the samples but one near the irradiated portion and one in the dark, could be employed. This idea has already been applied and the base-line drift has almost completely been eliminated even in an unthermostated bath. With this arrangement it might also be possible to construct a dual-beam instrument for direct correction of the spectral output of the lampmonochromator system. The PTS of solutions would be improved by finding a better reflective, but thermally conducting, coating for the thermistor which would decrease its background response. Such improvements, as well as further applications and a theoretical treatment of PTS are currently being investigated.

ACKNOWLEDGMENT We appreciate the advice and assistance of Neil Jespersen.

(1) A. Rosencwaig, Anal. Chem., 47, 592A (1975). (2) W. R. Harshbarger and M. B. Robin, ACC. Chem. Res., 6 , 329 (1973). (3) M. J. Adams, A. A. King, and G. F. Kirkbright, Analyst(London), 101, 73 (1976). (4) R. Gray, V. Fishman. and A. J. Bard, Anal. Chem., 49, 697 (1977). (5) G. A. Crosby, J. N. Demas, and J. B. Callis, J . Res. Nan. Bur. Stand., Sect. A , 76, 561 (1972). (6) J. B. Callis, M. Gouterman, and J. D. S. Danieison, Rev. Scl. Instrum., 40, 1599 (1969). (7) P. G. Seybold, M. Gouterman, and J B. Callis, Photochem. Photoblol., 9. 229 (1969). (8) W. Lahmann’and H. J. Ludewig, Chem. Phys. Lett., 45, 177 (1977). (9) H. J. V. Tyrrell and A. E. Beezer, “Thermometric Titrimetry”, Chapman and Hall Ltd., London, 1968. (10) G. A. Vaughan, “Thermometric and Enthalpimetric Titrimetry”, Van Nostrand-Reinhold, London, 1973. (11) 8. E. Graves, Anal. Chem., 44, 993 (1972). (12) S. L. Cooke and B. B. Graves, Chem. Instrum., 1, 119 (1968). (13) R. Tamamushi, J . Electroanal. Chem., 6 5 , 263 (1975). (14) K. S.V. Santhanam, N. Jespersen. and A. J. Bard, J. Am. Chem. SOC., 99, 274 (1977). (15) S. L. Marov, “Handbook of Photochemistry”, Marcel Dekker. New York, N.Y., 1973, p 114; (16) W. H. McAdams, Heat Transmission”, McGraw-Hill, New York, N.Y., 1954, pp 462-464. (17) The Sadler Standard SpecWa (Dyes, Figments and Stains), Sadler Research Laboratories, Philadelphia, Pa. (18) J. W. Robinson, Ed., CRC “Handbook of Spectroscopy”, Vol. 11. CRC Press, Cleveland, Ohio, 1974. (19) H. S.Carslaw and I. C. Jaeger, “Conduction of Heat in Solids”, Oxford University Press, London, 1947, p 18. (20) A. Fujishima, G. H. Brilmyer, and A. J. Bard, “Semiconductor-Liquid Interfaces under Iilumination”, Electrochemical Society, 1977, in press. ~~

RECEIVED for review July 15,1977. Accepted August 22,1977. The support of this research by the National Science Foundation and the Army Research Office is gratefully acknowledged.

Nonlinear Calibration Lowell M. Schwartz Department of Chemistry, University of Massachusetts,

Boston, Massachusetts 02 125

Thls paper deals with the problem of flndlng the statlstlcal uncertalnty expressed as confidence ilmits for an analysis based on a measurement and read through a nonlinear calibratlon or standard curve of arbltrary form. Several aiternatlve methods wlth varying levels of computational complexlty are proposed and an lllustratlve example shows the results obtainable by each method.

A recent publication (1) has noted that when an analysis X is determined by reading a measurement Y through a nonlinear calibration curve, the probability distribution of X cannot be expected to be normal (Gaussian) even though the random scatter of Y is distributed normally. Consequently, the use of the statistics “mean” and “standard deviation” to describe the location and uncertainty, respectively, of X is somewhat problematical as these measures apply unambiguously only to the normal distribution. In fact, the mean of replicate measurements of Y,which is the unbiased best estimate of the true Y value, when projected through the calibration curve, yields the mode of the X distribution function. This X value, although the best estimate of the true value of the analysis, deviates from the mean of the X distribution function as sketched in Figure 1 of Ref. 1. The 2062

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

purpose of this study is to explore the use of the statistic “confidence interval” (between two “confidence limits”) as a more appropriate measure of the uncertainty of the nonnormal random variable X. This notion is not new having been discussed long ago in connection with analyses projected through linear calibration lines (2-4). Later, the method was extended to situations involving measurements of several different samples where simultaneous or joint confidence intervals are required for all the corresponding analyses as read through a common linear calibration line (5-7). The further extension to nonlinear calibration curves, although cited formally by Miller (7),apparently has not been discussed in a practical way in the chemical literature. The problem is formulated as follows: Calibrating measurements y L are taken of a series numbering n standard known samples x , . One or more measurement of y , may be made of each x , . The number of replications of each y , is denoted as n,. These data are used to construct a calibration curve yd(x) to represent the unknown functional dependence y ( x ) , which is in general nonlinear. A number N of unknown samples similar in nature to the calibrating samples are to be analyzed for the quantity X by measurement of Y in a manner identical to the calibrating measurements. N , replicate Y , measurements may be made, and corresponding X, values are determined by projecting Y , through the calibration curve.

We require confidence limits for X i as measures of uncertainties of these analyses. The number of analyses N to be made through a particular calibration curve is indefinite and so we seek independent confidence limits, i.e., nonsimultaneous confidence intervals. Let us assume that the calibration curve is found from the ( x i , yi) data by a least-squares method and that the functional form of the curve is

variances of the several samples each weighted by the appropriate number of degrees of freedom: N

5 ( n i - 1)var ( y j >+ Z ( N , -- 1) var (FJ sz =

i

i

N

5 ( n j- 1)+ C ( N , - 1) i

i

where the mean of the replicates yij is ”i

where Pk denote the least-squares parameters and the functions f k ( x ) are, perhaps, integral powers of x , but not necessarily so. The parameters Pk are normally-distributed random variables whose variances reflect the scatter of data points about the curve. Thus a prediction from this function for a given fixed x value is also a normally-distributed random variable since it is calculated as a linear combination of the Pk. Several (M)measurements of the same unknown yield a mean Y. This mean is a normally-distributed random variable in the sense that repeated determinations of M measurements a t a time yield normally distributed 7 values. The mean of these 7 values is Yavand the projection of this through the calibration curve yields the result X,, for the analysis. Although ya, - ycal(Xav)is identically zero, the difference e = Y - ycal(Xav) is not necessarily zero since any particular Y # Tab..The difference e , however, has a mean value of zero and, in fact, is normally distributed about zero since both 7 and ycal(Xav)are normally distributed. Writing the sample variance of e as var( t,s

i2y2

and this condition should be checked before attempting to solve Equation 5a or to proceed graphically. The principal disadvantage of this linear segment approach is that this criterion fails frequently. If the calibration function is severely curved, then calibrating points must be closely spaced if the segmented representation is to be accurate. Thus Ly2 - yll must be small and this causes the trouble. Faced with an unsatisfactory situation, there are several recourses; (1) Drawing line segments between more widely spaced point increases Iyz - yll but sacrifices accuracy. (2) Accept narrower confidence limits. If, for example, 95% confidence limits leads to failure of the criterion, then 90% or 80% confidence limits yield smaller t , which may succeed. (3) Make additional y1 and y2 measurements of the standards x1 and x z in order to increase nl and n2. (4) Abandon the linear segments approximation in favor of one of the methods described in the following sections. Whereas the simplest computation involves a sequence of line segments, the next level of computational complexity would be based on fitting a two- or three-parameter empirical equation to the full range of calibration points. The most obvious equation of this nature is the three-parameter parabolic function ycd = co C l x + Czxzbut this form might be generalized to ycd = Co + C J l ( x ) + Cj 2 ( x ) where J l ( x ) or f z ( x ) or both are nonlinear functions of x . The difficulty with using calibration curves of this type is that the analyst must exercise judgment in selecting functions f l ( x ) and f 2 ( x ) suited to the particular shape of his calibration data. Having chosen f l ( x ) and f z ( x ) , then conventional methods of multiple regression may be used to find the adjustable parameters Co,

+

ANALYTICAL CHEMISTRY, VOL. 49,

NO. 13, NOVEMBER

1977

Y

1l

A

= S,,S,

s,,

- Sfxz

S,,

= $WiXi2 - v x 2 i

s,. = 5i w i x i y i- v x y

=

$ w i x i f ( x i )- v f x i

sff = p U i [ f ( X i ) ] 2 - v7”2 SfY =

5Wi$1f(Xi) i

vy7

Regression analysis also provides formulas for the coefficient variances and covariances CJ ,Sff var (C,) = -

a

var (C,) =-

cov

((71,

CJ

C,)

2s,, a -a%fx = ___

A

where u2 is again the population variance of the yi and Yi measurements which is to be estimated by s2 of Equation 2. The variance in the denominator of Equation 1 when Equation 7 is used for y c d ( X , )becomes s2Uz(Xi)where

1

U,(Xj) = - +

THREE-PARAMETER FUNCTION

2064

with v = &“w,.If, as in the previous section, we assume that all the y , measurements are equally precise, the weighting factors w,equal n,the number of replicate measurements at each x:, and v is the sum total of all calibrating measurements. 9 is one of the three adjustable parametfirs and the other two are found by regression analysis (8) to be

1

-

Ni v

+ ( X j - Z)QS +

[ f ( X , )- 7 1 2 -

A

2(XI - X ) [ f ( X , )- 71s’”

SXX

A

-

(8)

A

As in the previous section, confidence limits for the analysis X , are determined by finding two appropriate values of X which satisfy the equation

{ y - y - C , ( X - 2 ) - C , [ f ( X )- 7 ] } 2 t:sZU,(X)

0

-

(9a) If f ( x ) is chosen as x2, Equation 9a is quartic in X , otherwise it is transcendental. In either case, the practical procedure is to use a numerical method such as a Newton-Raphson =

iteration. Alternatively, the intersection of the horizontal line y = Yiwith the two confidence bands

Table I. 01

t, yields the confidence limits on the x-axis. Again pathological circumstances, one of which is the calibration curve being too flat in the region of the (Xi, y,) point, may lead to unsatisfactory solutions. The line y = yi may not intersect both confidence band5 or may intersect each band more than once. Clearly, if f ( x ) = x2, the quartic Equation 9a admits four possible roots, only two of which are appropriate. If a numerical procedure is used to solve Equation 9a, then special caution must be used to assure convergence to the desired roots. Considering all the possible circumstances that might arise from an unlimited variety of calibration curve shapes, the graphical procedure brings the advantage of pictorial clarity to the problem of selecting proper solutions. Nevertheless, in contrast to the line segment method, satisfactory solutions can be obtained for the confidence limits in most cases when using a three-parameter empirical function. This latter method, however, requires a more extensive calculation, either a long hand calculation or a short digital computer program, and, as mentioned above, some judgment as to the choice of f ( x ) . POLYNOMIAL FUNCTION

A third method is to construct the calibration curve with as many adjustable parameters as are required to fit adequately the curvature of the calibrating data. Virtually no judgment is required in the fitting procedure and systematic error due to lack-of-fit is eliminated with certainty, but the calculation is relatively extensive, i.e., a hand calculation would be prohibitive. The fitting procedure is the same as that mentioned in a previous paper ( I ) . Polynomials of successively higher degree are fit until such time as an F test on the residual variances indicates that systematic error due to lack-of-fit has been eliminated and only random variability remains. At this stage the calibration curve is an m-degree polynomial which may be written either as a power series in x with coefficients c, or as a linear combination of orthogonal polynomials P,(x) with coefficient ai (IO, 2 1 )

yCa1=

m

m

x CjX’ j= 0

=

c UjPj(X)

(10)

j= 0

The fitting procedures, either by multiple regression or by orthogonal polynomials, provide formulas for var (ai) or var (c;) and cov (c;, c k ) and these quantities are each a product of (r2 (or the estimate s 2 ) and a function of the x i data. The denominator of Equation 1becomes s2Um(Xi)where, if yd(Xi) is substituted by the power series form of Equation 10,

1 l Um(Xi= ) -+ -

Ni

2 m

m

sZ j = o

X,Zjvar ( c j ) +

m

S2 j=Oj>k

or if orthogonal polynomials are used

1

l

m

U m ( X i )= - + - z [Pj(Xi)]Z var (a’) Ni s2 j = o The appropriate roots

X are extracted from

[ y - y,,(X)]2 - t2s2Um ( X ) = 0 or from the intersection of y =

Tiwith the confidence bands

Critical t , Values for 25 Degrees of Freedom 0.30 0.20 0.10 0.05 0.025 0.01 0.005 1.058 1.316 1.708 2.060 2.385 2.787 3.078

and the problems associated with these calculations are the same as in the previous section dealing with a three-parameter curve. ILLUSTRATIVE E X A M P L E The data shown in Table I1 were taken from a calibration “experiment” run on a digital computer. A nonlinear calibrating function y = tanh3 x was evaluated a t a number of xi values between 0.750 and 3.00. Corresponding yi values were taken in triplicate and these were scattered normally with mean tanh3 xi and with a standard deviation of 0.0500. Then an “unknown” sample, whose response Y = 0.8000, was “measured” six times. These measurements also were scattered normally with 0.0500 standard deviation. In a real experiment, of course, none of the “true” information: y = tanh3 x , a2 = 0.00250, Y = 0.8000, or X = 1.646 would be known to the analyst. Calculation U s i n g a L i n e a r Segmented Calibration Curve. The unknown response Y = 0.7814 falls between yi values 0.7506 and 0.7938 and so by linear interpolation, X = 1.678. With s2 based on 25 degrees of freedom, a table of double-sided probability points of the t distribution shows the following critical t , values: Table I. The criterion (6) for a successful solution of Equation 5 in this example holds only if t , < 1.12. Consequently, Equation 5 can be solved for approximately 7070 or lesser confide.nce limits. Since 95% confidence limits are conventionally chosen t o yield a “reasonable” level of certainty that the “true” X is bracketed within, this restriction to 70% or less is unfortunate and several recourses have been mentioned above. One of these is to accept the lesser percentage confidence interval and so solving Equation 5 with t , = 1.058 yields 7070 confidence limits of X values of 1.431 and 2.767.. Calculations Using Three-Parameter Empirical Equations. If f ( x ) is chosen as x2 to attempt to fit the data with a parabolic function, the resulting situation is shown in Figure 2. The calibrating points 9, v:s. x i from Table I1 are shown without connecting lines, the dashed curve is the best least-squares parabolic fit

0.7548 + 1.0414 (X - 1.875) 0.1972(3~’ - 4.031)

yal=

(13)

and the two solid curves are the 9570 confidence bands generated with to.os= 2.060 in Equation 9b. The line y = 0.7814 intersects the calibration curve twice but obviously the intersection a t X = 1.657 is the correct analysis. Similarly, y = 0.7814 makes four intersections with the two confidence bands and of these the appropriate confidence limits are 1.543 and 1.789. Examination of the scatter of the data points around the calibration curve in Figure 2 indicates that for xi > 2, the yi values seem to be following a weaker function of x than f ( x ) = x2. A logarithmic function might be better suited and so utilizing the flexibility of Equation 7, we try f ( x ) = In x. The resulting calibration curve

y,l=

0.7548 - 0.3462 ( X - 1 . 8 7 5 ) + 1 . 0 9 6 8 ( 1 n x - 0.5431)

(14)

is drawn as the dashed line in Figure 3 together with the 95% confidence bands. Intersections with 3’ = 0.7814 yield X = 1.635 and confidence limits 1.500 and 1.796. The variance of ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

2065

Table 11. Computer Calibration “Experimental” Data Based on y = tanh3x and 0’ = 0.00250 (ni - 1) Triplicate yij var (Yi) Yi Xi 0.00392 0.24 50 0.2160, 0.2960, 0.2230 0.750 0.00547 0.4509 0.5103, 0.4120, 0.4304 1.000 0.00427 0.6011 0.6342, 0,5483, 0.6209 1.250 0.00638 0.7506 0.7932, 0.7721, 0.6865 1.500 0.01602 0.7938 0.8903, 0.7134, 0.777’7 1.750 0.00054 0.9029 0.9218, 0.8925, 0.8945 2.000 0.00241 0.9327 0.8966, 0.9657, 0.9359 2.250 0.00 180 0.9406 0.9176, 0,9298, 0.9745 2.500 0.00110 0.9580 0.9847, 0.9401, 0.9493 2.750 0.00555 0.9728 0.9237, 1.0285, 0.9661 3.000 -

Sextuplicate Y,, Yl ? 0.7911, 0.7647, 0.7904 0.7814 0.7688, 0.8469, 0.7264 Total sum of squares Based on 25 degrees of freedom, s 2 = 0.00222 XI

If

0.2

I

I

I

I

,

1

2

3

4

1.0

X

Figure 2. A three-parameter parabolic calibrating function, Equation 13, (dashed curve) is fit to the cali_brationpoints (xi, A) of Table 11. The intersection of the horizontal line Y = 0.7814 with the calibration curve yields two solutions of which X = 1.657 is the appropriate one. The intersections of the horizontal line with 95% confidence bands (solid curves) yields confidence limits of 1.543 and 1.789

P 1 -

1)

var ( Y l j ) 0.00792 0.05538

1 I

! 0.8 I

Y I

I

/--

0.6L I

! 0.4 1

I

I 02t

‘! p/

0.41

1

I

2

3

Figure 4. A threedegree polynomial Calibrating function Equation 15 (dashed curve) is fit to single calibrating points ( x i , y,,) listed as t h e first entry of the triplicate yf in Table 11. The horizontal line Y = 0.791 1 intersects the calibration curve at X = 1.500 and intersects the 95 YO confidence bands at 1.303 and 1.806

0.2

1

L

1

X

1

I,

I

I

3

2

this example are the first of the three yil entries in Table I1 and Y = 0.7911. The calibrating polynomial turns out to be

X

Figure 3. A three-parameter logarithmic calibrating function Equation 14 (dashed curve) is fit to calibrating points (xi, A) of Table 11. The horizontal line Y = 0.7814 intersects the calibration curve at X = 1.635 and intersects the 95% confidence bands (solid curves) at 1.500 and 1.796 residuals about the parabolic and logarithmic curves are 0.00047 and 0.00027, respectively, which confirms the judgment that the logarithmic fit is the better of the two. Calculations Using a Polynomial Function. To illustrate this third method, we will suppose that only single measurements were made of each yi and Yi. As is discussed in a later section, we must base the estimate of s2 on the variance of residuals of the data points about the polynomial curve chosen by F-testing. The single measurements used in 2066

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

ycal= -1.067

+ 2.343~ 0 . 9 1 4 8 +~ ~0.1191x3(15) -

X = 1.500 and with s2 = 0.00112 based on 6 degrees of freedom, the 95% confidence limits are 1.303 and 1.806 as shown in Figure 4. COMPARISONS A calibration experiment could be expected to yield the true analysis only if (a) systematic errors were absent from all measurements y, and Y and from the calibration cuve-fitting procedure, (b) a sufficiently large number of replications of the measurement Y were made so that the mean of these approached the population mean, and (c) a sufficiently large number of calibrating points were taken so that the calibration

I 1.7

I

1 1.5

1.6

“true”

x

I 1.8

t

onolysis

4.646

+

L

2 n d degree polynomial

1.543

Logarithmic

3 -parameter

f

I 1.500

M o n t e C a r l o Simulation

I 95% 1.789

1.657

I 95% confidence limits 1.796

1.635

L 1.539

c o n f i d e n c e Iirnits

w 1.653 1.657 meon

Analysis by l i n e a r s e g m e n t e d curve

1 Four s t o n d a r d d e v i a t i o n s 1.767

mode

t 1.678

Figure 5. The Xdistribution function shown is the output of 200 Monte Carlo computer simulations smoothed with an empirical Pearson Type-IV distribution function, Below the curve are shown the Xanalyses and dispersions calculated by several alternative methods

curve passed through them would be an accurate representation of the true functional form. The “experiment” corresponding to the illustrative example fell considerably short of these ideal conditions. The several calculational methods yielded results with varying discrepancies and these are summarized in Figure 5. The X-distribution function shown in the top part of the figure is the result of treating the data by the Monte Carlo method ( I ) . After running 200 simulated analyses, X values were fitted to a Pearson Type IV distribution function (12) and that curve is the one shown. Below and to the same scale as the curve are indicated the various results. It turned out in this example that the polynomial function method required only m = 2 by the F-tests and so this method yielded the same results as that obtained with the three-parameter parabolic curve. The several X analyses obtained via the different methods were: parabolic, 1.657; logarithmic, 1.635; and linear-segmented, 1.678. Recall the “true” value is 1.646. The Monte Carlo simulation method using the same F-tests also fit to a m = 2 degree polynomial and so yielded an X-distribution mode of 1.657. The mean of the X-distribution was 1.653 for this particular computer run. Comparing the 95% confidence intervals obtained by the several methods: parabolic, 0.25; logarithmic, 0.30; and linear-segmented, none. To compare the Monte Carlo method, we note that, for normal distributions, a 95% confidence interval based on twenty or more degrees of freedom is approximately (within 5 % ) four standard deviations. Although the X-distribution output of the Monte Carlo simulation is not strictly normal, this relationship can serve to transform the observed 0.057 standard deviation to an approximate 95%

confidence interval of 0.23 for comparison. The variability in results is accountable by statistical uncertainty and also by practical limitations. For example, because the precision of “measurement” of standards and analyses was as large as u = 0.0500, the polynomial and Monte Carlo methods fit with only m = 2 polynomials and so the calibration curve was as crude as u = 0.0500 warrants. If the precision had been greater, these methods would have selected a higher degree polynomial as a better match to the “true” function and so would have yielded a more accurate analysis. This higher degree polynomial, of course, would be obtained only if the number of calibrating measurements y,, were in excess of the degree. If too few yi, are measured, the F-tests in these methods might not operate effectively. The three-parameter methods will always be inherently inaccurate for highly nonlinear curves unless the correct functional form is known. Had the “analyst” the insight to guess f ( x ) = t m h 3 x rather than x 2 or In x , the analysis would have been more accurate. The linear-segment method suffers from the several limitations described earlier. This method can be expected to be useful only for cases involving high precision measurements and calibration lines with slight curvature. An analyst in choosing among these methods might also consider the following factors: 1. In general, the m-degree polynomial method provides the most reliable confidence limits because it utilizes F-testing to fit a proper calibration curve. The cost of this reliability is using a digital computer program of approximately 300 FORTRAN statements, which is considerably more complex than the other alternatives for confidence limits. ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

2067

2. The Monte Carlo simulation method is equally reliable and costs about the same as the rn-degree polynomial method but its output is essentially the statistical moments of the X-distribution function. This method would be used when the standard deviation of the X-distribution and/or an empirical fit of that distribution were required. 3. The three-parameter curve method is less reliable because no systematic way exists for choosing the best functional form of the calibration curve. Its advantage is the modest computational effort required, about 100 FORTRAN statements. But if in a given case the analyst knows a priori the proper functional form of the calibration curve, the unreliability is eliminated and this method is the best. 4. I n general, the linear-segment curve method should be used only when lack of computing resources militates against the other methods. Listings of FORTRAN programs used in this study will be sent on request to the author.

MODIFICATIONS U p to this point we have assumed that replicate measurements have been made on y , and Y,, that all such measurements have the same inherent precision, and that the most reliable estimate of the population variance of all these measurements is found by pooling the individual variances according to Equation 2. While these assumptions may be valid in some experimental situations, circumstances will arise for which these are questionable. Several such situations are discussed as follows: A. Suppose F tests between the several var (y,) and var (Y,) indicate that var Cy,) have the same 2 but that the population variance of the Y , are significantly different. Here one might question the validity of the calibrating procedure, i.e., if the calibration points were, indeed, measured in an identical manner to the analyses. Nevertheless, confidence limits could be obtained but separate variance estimates must be made. For the calibrating points, s,”are calculated using Equation 2 but omitting the Y, contributions. For the analyses, the measurements are pooled into a common s$ using Equation 2 (omitting y , ) if F-testing justifies that a common s$ exists, otherwise individual s$, are calculated using Equation 2c. t, is selected on the total degrees of freedom inherent in s $ ~and S;.

1. Linear-Segment Method. In Equations 4 and 5 , s2U1(X,)would be replaced by

where var (a,), var (c,) and cov (c], c k ) are based on s; rather than s2. B. Suppose only single measurements are made of the calibrating points. If a sufficient number of replicate measurements are made of the Yi, then s2y calculated from these may serve as s: if it may be assumed that the yi and Yi are uniformly precise. If only one or two replications are available for Y,, then neither s: nor s$ can be calculated directly. However, if the calibration curve is an accurate representation of the y ( x ) function, the scatter of the calibrating points about the curve, Le., the variance of residuals, may be attributed to random experimental scatter and so this can be used for s.: Considering the natures of the three methods, i t is clear that only the polynomial function method ensures that this condition is fulfilled. Equations 11 and 12 are used without modification but s2 is assumed to be the variance of residuals of the lowest degree polynomial selected by F-testing. t , is selected on (n - m - 1) degrees of freedom. The choice whether to use variance of residuals or replicate Yimeasurements to estimate s2 depends on which can supply the greater number of degrees of freedom. C. Suppose a priori evidence or F-testing of var indicates that the precision of the calibrating points is not uniform along the curve, Le., s: depends on x , . The equations derived in this paper assume homoscedastic y,, i.e., s: is a constant. If the function s:(x,) can be treated as or transformed to be invariant with xi,then the methods described apply directly. Two such situations are: (1) If s:(xi) is an erratic function of x , , the erratic nature may or may not be reproducible. If irreproducible, the expedient recourse is to regard it as random and calculate an average s: over the length of the calibration curve to use as the basis of var [y,,l(X)] in the denominator of Equation 1. Perhaps this average value could be used also for s2yz but prudence suggests that s$4 be measured independently by making replicate Y , measurements and then using the modifications in section A above. (2) If s:(x,) is a regularly varying function of xi with known form, it is possible in some cases to effect a transformation of variables, for example, y to y‘, which makes s$(xi) independent of xi. One such example is in counting measurements (13). Once this transformation is done, the methods described apply. If the calibrating data cannot be made homoscedastic, more extensive modifications are required. These cases will be the subject of a separate article.

cui)

LITERATURE CITED and in criterion 6, sy replaces s. 2. Three-Parameter Method. In Equations 8 and 9, s 2 U 2 ( X )is replaced by

3. Polynomial Method. In Equations 11and 12, s2Um(Xi) is replaced by

(1) L. M. Schwartz, Anal. Chem., 48, 2287 (1976). (2) C. Eisenhart, Ann. Math. Stat.. 10, 162 (1939). (3) E. C. Feiller, J . R . Stat. SOC., Suppl., 7, 1 (1940). (4) E. Paulson, Ann. Math. Stat., 13, 440 (1942). (5) J. Mandel and F. J. Linnig, Anal. Chem., 29, 743 (1957). (6) J. Mandei, Ann. Math. Stat., 29, 903 (1958). (7) R. G. Miller, Jr., “Simultaneous Statistical Inference”, McGraw-Hill Book Co., New York, N.Y., 1966, Chapter 3. (8) 0. L. Davies and P. L. Goldsmith, Ed., “Statlstlcai Methods in Research and Production”, 4th Revised Edition, Oliver and Boyd, Edinburgh, 1972. (9) J. Mandel, “The Statistical Analysis of Experimental Data”, Intersclence-Wiley Publishing Co., New York, N.Y., 1964, Chapter 12. (10) G. E. Forsyt,pe, J . SOC.Ind. Appl. Math., 5 (2), 74 (1957). (1 1) A. Ralston, A First Course in Numerical Analysis”. McGraw-Hill Book Co., New York, N.Y., 1965, Chapter 6. (12) M. G. Kendall and A. Stuart, “The Advanced Theory of Statistics”, Vol. 1, 2nd edition, Hafner Publishing Co., New York, N.Y., 1958, Chapter A

(13)

R: R. Sokal and F. J. Rohlf, “Biometry, the Principles and Practice of Statistics in Biological Research”, W. H. Freeman, San Francisco, Calif., 1969, Chapter 13.

RECEIVED for review May 10,1977. Accepted August 5,1977.

2068

ANALYTICAL CHEMISTRY, VOL. 49, NO.

13,NOVEMBER 1977

Chemical Characterization via Fluorescence Spectral Files: Data Compression by Fourier Transformation Kalvin W. K. Yim, Thomas C. Miller,' and Larry R. Faulkner' Department of Chemistry, University of Illinois, Urbana, Illinois 6 180 1

A computer based file searching technique was developed for detailed comparisons of fluorescence spectra. Data compression was achieved by Fourier transformation of the spectra, and the search was based on comparisons of limited arrays of Fourier components. Tests were made with the worst available cases, viz., the ring alkylated phenols, which could not be distinguished from each other by earlier techniques for searching fluorescence files. Unequivocal identification was achieved by the present method in every tested case. The optimum number of Fourier components used in the comparison was shown to be about 10. The utllRy of this method for experimental applications Is discussed.

Fluorescence spectrometry, which possesses exceptional sensitivity and supplies two spectra for any emitting species, is potentially a powerful tool for trace characterization. However, its qualitative potential has not been realized, despite its extensive applications in quantitative analysis. One of the obstacles is that most practical samples are mixtures of emitters, and it is difficult to extract spectra of the pure components from the composite emission and excitation distributions. Progress with this problem has been reported recently. Warner et al. have successfully combined two-dimensional video recording with mathematical techniques of matrix manipulation, and they have been able to extract individual spectra from those of mixtures ( I ) . Another approach to the problem involves the use of matrix isolation techniques to produce quasi-linear emission spectra, as Stroupe et al. have done ( 2 ) . Under these conditions, individual components can be identified in composite emission spectra by the appearance of sharp characteristic lines. Of particular interest to us, however, is the prospect for separating components by chromatographic means and characterizing them by recording emission and excitation spectra directly in a column outflow ( 3 , 4 ) . We believe that such a combination of liquid chromatography and fluorescence spectrometry could be particularly valuable, if a reliable means can be demonstrated for interpreting spectra in terms of molecular structure. T h e problem of interpretation is not straightforward, because information in fluorescence spectra is contained more subtly than it is in NMR, IR, or mass spectral records. Spectral features usually cannot be assigned mentally to subgroups on a molecule. On the other hand, computer-based file-searching techniques do not require the obvious correlations required for mental interpretation; hence, they are the approaches of choice for structural interpretation of fluorescence records. Previous studies by our group have shown that these spectra do possess enough structural information to be useful (5, 6). A simple search algorithm utilizing only the salient features of the spectra, e.g., total number of peaks, peak locations, and relative intensities, demonstrates surprising discriminating 'Present address, Procter and Gamble Company, 6110 Center Hill Road, Cincinnati, Ohio 45224.

ability. It could unequivocally identify unknowns with rather structured spectra, and it accurately suggested molecular analogues for unknowns whose spectra were not in the library. The deficiency in that simple algorithm was felt with substances displaying comparatively smooth and structureless spectra, e.g., the ring alkylated phenols. In such cases, the search algorithm could not distinguish the spectrum of an unknown from those of many structurally similar materials. The available algorithm is useful in these instances as a rapid device for screening out unlikely candidates in a presearch of the data base, but a more detailed comparison procedure is required for unambiguous identification. We have successfully devised such a procedure, and we report below on its reliability. Strategy. Several requirements must be met by entries in the reference file, which is the heart of any detailed comparison method: (1)They must be computer-compatible; hence, they must be finite ordered arrays of numbers. ( 2 ) They must be compactly edited so that nonessential features are omitted in order to minimize storage space and maximize search speed. (3) All the distinct features of a spectrum must be retained in order to preserve its value as a discriminator. File searching procedures that have been developed for mass spectrometry (7-15) have been facilitated by the discrete nature of the m / e scale. Transforming the analogue spectrum into the digital domain can be done with negligible information loss, and it can be done efficiently in the sense that the digital array is virtually the most compact discrete representation that would characterize the complete spectrum. An analogous treatment cannot be straightforwardly applied to continuous spectra, such as the fluorescence curves which are of interest here. Two problems arise: (1) The spectra themselves do not suggest a digital representation, so the need for a machine-compatible version is usually met arbitrarily by sampling the spectra a t a constant, wavelength interval. However, this method requires many sample points for a faithful digital representation of the analogue signal. A typical excitation spectrum digitized a t a 1-nm spacing might require 150 wavelength-intensity pairs. Thus, the resulting representation may be faithful, but it is hardly efficient, and reliable data compression methods acquire more importance than they have for mass spectral files. ( 2 ) In the digital representation of a continuous spectrum, any given intensity-wavelength pair is typically rather strongly related to neighboring pairs. Information is much less isolated in discrete elements than it is in mass spectral representations; therefore data compression is more than a job of editing. In seeking ways to represent fluorescence spectra discretely and compactly, we have acted upon the fact that smooth spectra can be faithfully represented by the Fourier synthesis of relatively few sinusoidal harmonics (16). Thus we create a signature for a given species by subjecting its digitized emission and excitation spectra to fast Fourier transformation. Then the first several components of each curve are combined into a file entry representing the compound. The resulting library of Fourier signatures forms the basis for more detailed comparisons between unknowns and candidates suggested by presearches via the earlier algorithm. ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

2069