Confidence limits on the product of two uncertain numbers

Mar 17, 1972 - distributed random numbers has been determined empiri- cally by fitting a parameterized analytic function to a his- togram of a large n...
0 downloads 0 Views 504KB Size
of the more finely resolved compound type subclasses with the given hydrocarbon mass Z series classification obtained directly on the whole concentrate fraction and tabulated as part of Figure 3. The NMR data of Figure 10 show almost a 50% decrease in total hydrogen from GPC fraction 12 to fraction 36, the biggest decrease occurring in the proportional quantity of paraffinic hydrogens, peaks a and b. This fits the known pattern of early GPC elution of longer chain, higher molecular weight compounds and later GPC elution of highly condensed, lower molecular weight aromatic structures with little or no aliphatic substituents. Examination of aromatic hydrogen peaks e, f, and g, on the basis of increasing GPC retention volumes, shows increase in size, increase in fine structural detail, and a shift farther downfield, all pointing to increasing aromaticity, increasing ring condensation of both aromatics and naphthenoaromatics, and a decreasing degree of alkyl substitution. These observations are further enhanced by peak c, which, as the result of methyl, methylene, and methine groups attached directly to an aromatic ring, indicate the number of substituents attached to each molecule and the chain length of each substituent. The small peak d points to bridge hydrogens in condensed naphthene rings on condensed aromatic rings. The net results are average NMR spectra for each of the later eluting GPC fractions that resemble some of the NMR data by

Mair (18) for lower molecular weight homologs of cyclopentanophenanthrenes, dibenzothiophenes, fluorenes, etc., also found in petroleum.

ACKNOWLEDGMENT The authors are indebted to D. A. Doughty, of the Bureau of Mines Bartlesville (Okla.) Energy Research Center, for the nuclear magnetic resonance data provided in the course of this investigation, to R. L. Hopkins, now deceased, for performing the ion-exchange and chemicaltreatment steps in the early phases of the distillate study, and to J. F. McKay, of the Bureau of Mines Laramie (Wyo.) Energy Research Center, for providing some supporting TLC, UV, and fluorescence data. Received for review March 17, 1972. Accepted March 5, 1973. The investigation performed is part of the work of American Petroleum Institute Research Project 60 on "Characterization of Heavy Ends of Petroleum," which the Bureau of Mines is conducting at the Bartlesville (Okla.) and Laramine (Wyo.) Energy Research Centers. Reference to specific brands of equipment is made for identification only and does not imply endorsement by the Bureau of Mines. Presented before the Division of Petroleum Chemistry, Inc., 163rd National Meeting, American Chemical Society, Boston, Mass., April 9-14, 1972. (18) B J Mair Proc , Amer Petrol lnst

Sect 3 44, 378 (1964)

Confidence Limits on the Product of Two Uncertain Numbers Richard J. Olcott Southwestern at Memohis. Memohis. Tenn. 381 12 and St. Jude Children's Research Hospital, 332 N. Lauderdale, Memphis, Tenn. 38705

The distribution function for the product of two normally distributed random numbers has been determined empirically by fitting a parameterized analytic function to a histogram of a large number of such products. Separate calculations were carried out to display the effect of variation in dispersion of the multiplier relative to that of the multiplicand. The empirical distribution has been "Studentized" to cover the case of a few determinations. Tables of the inverted integrated distributions vs. the probability P and total degrees of freedom are presented along with a protocol for their use in calculating confidence limits on derived experimental results.

Often in a n investigation, the result of interest is the product of two numbers, each of which is derived from experiment and therefore uncertain to some degree. The measured concentration of an unknown in a titration, for example, is the product of a titer factor obtained from the reaction stoichiometry and the (uncertain) concentration of the standardized titrant, times the (uncertain) ratio of titrant volume to sample size. Again, the x-, y - , and z coordinates in absolute distance units of an atom in a

crystal are each the product of (uncertain) fraction-ofunit-cell-edge coordinates obtained from analysis of X-ray reflection intensities, times (uncertain) absolute unit cell dimensions obtained from measurement of the orientations of those reflections. In any such situation, the question arises of the probable range limits to be placed on the product. The matter of confidence intervals in general has been discussed by Mood ( I ) , Laitinen ( Z ) , and Neidig, et al. ( 3 ) . For a variable x which is normally distributed about a "true" value xt (which may contain systematic error), an efficient and unbiased estimate of xt is given by the arithmetic mean of n determinations f = Z x i / n . The dispersion of x about xt is measured by the absolute standard deviation uX which in turn is estimated as s, = J Z ( x i - x ) z / ( n - 1).The experimental f will rarely coincide with xt, but P out of 100 independent redeterminations of f should have (f - ts,/m < x, < (? t s , / f i ) ,

+

(1) A. McF. Mood, "Introduction to the Theory of Statistics," McGrawHill, New York, N.Y., 1950, Chapter 11. (2) H. A. Laitinen, "Chemical Analysis," McGraw-Hill, New York, N.Y., 1960, pp. 546-8. (3) H. A. Neidig, T. R. Yingling, R. E. Griswold, and T. G . Teates, Chemistry. 40, 28 (1967)

A N A L Y T I C A L C H E M I S T R Y , VOL. 45, NO. 9, A U G U S T 1973

1737

where t, tabulated ( 4 ) as Student's t, is a coefficie'nt which depends only on the number of degrees of freedom ( n - 1) and on the probability P. The range (f - t s x / f i f ts,/& is the range a t RCconfidence and f f ts,/ fiare the RC confidence limits on x . The situation becomes more complicated when the value of interest is some combination of two experimental values x and y. Kendall and Stuart ( 5 ) spend an entire chapter displaying different modes of setting confidence limits for d = f - 9. If ux = uy = CJ, we may treat the difference as normally distributed with mean d and est,imated standard deviation s d = v'(nXsx2 + n- ._ The P% confidence limits are d f t S d d m , ,_ where t is the usual Student's t for ( n , ny - 2) degrees of freedom. References ( 5 ) and (6) should be consulted if there is any reason to believe that ox # uy. Bliss (6) gives a detailed account of the data treatment for both the ( x y) and ( x / y ) cases, but the important ( x y ) class appears not to have been presented in detail. The naive solution to the problem is to state that, if a < x < b and c < y < d at, say, 90070 confidence, then the appropriate range for xy runs from ac to bd. Such a strategy would be approximately correct only if xi and yi were nearly inversely related, but often the variables are nearly or completely independent. Consider the probability that xy, < ac: the stated 5% probability that x < a must be multiplied by the probability that y < ( a c / x ) ; since the latter is generally less than unity the probability is less than 5% that xy < ac so ac is too small a lower limit for a 90% range. By extension, ac < xyt < bd is actually a range a t some confidence level between 90 and 100% and reporting P = 90% underestimates the precision of fg. A second approach has been presented by McDuffie ( 7 ) . The relative confidence limit on f is CL, = ts,/fv'% and similarly for y . McDuffie's suggestion, that the cumulative relative confidence limit on x ~ be j taken as CCL,,. = JCL,2 + CL,2, is equivalent to assuming that xy will be normally distributed with estimated mean ag and estimated relative standard deviation S,, = d ( s x 2 / f 2 )+ (s,,~/ 7)if n, = n4..However, if x and y are each symmetrically distributed (as implied by the taking of averages), then xy cannot be so distributed and thus cannot be characterized by a normal distribution. For example, if x and y are each symmetrically distributed, then the four products (f + bx)(9+ 6 y ) , (f + &)(g - 6 y ) , (E - 6,)(9 + 6 y ) , and (f 6,)(7 - 6,y) are equally likely to occur. Taking f = 9 = 1, 6, = r3y = 0.1, we have (1.21, 0.99, 0.99, 0.81), which are not distributed symmetrically about 1.O. In the following paragraphs, we first determine the distribution of xy, and then develop the equivalent of a table of t so that confidence limits may be obtained for different numbers of degrees of freedom.

+

+

We are interested in the distribution of xy = u and we will proceed following the methods of Mood and Graybill (8).The joint distribution of u: and y is: The distribution of u alone can be obtained by integrating out y: J-,

,

+

DISTRIBUTION OF x y We assume a t the outset that x and y are independent and that each is normally distributed: f ( x ) = N(x,,a,;r) =

1 exp[-(r - xti2/2u,*]

(1)

*U,

We lose no generality by taking the unit of x to be x t and thereby placing U , on a relative scale. Adopting the same conventions for y we have the joint distribution of x and y as: (4) Among others: "Handbook of Chemistry and Physics." R. C. Weast, Ed., 48th ed.. Chemical Rubber Publishing Co.. Cleveland, Ohio, 1967, p A-161. ( 5 ) M. G. Kendali and A. Stuart, "The Advanced Theory of Statistics." Voi. 2, Hafner, New York, N.Y., 1961, Chapters 20, 21 (6) C. I. Bliss, Biometrics, 12, 491 (1956). (7) B. McDuffie, J. Chem. Educ., 44, 234 (1967).

1738

This last integral is not tractable by analytical methods, but the form of the argument in the exponential indicates that g(w) is controlled by two parameters: the abscissa scale is set by u3. and the scale-invariant shape by the ratio ( u x / u y ) . To proceed further, it was necessary to resort t o an empirical determination of g ( l c ) . In outline, the strategy was to generate a large number of random numbers normally distributed about unity, multiply successive pairs, and construct a frequency profile of the products. The profile was then least-squares fit to an analytical function with adjustable parameters to obtain g(u ) in a form suitable for further work. The normally distributed random numbers were gener9 , 6 ) crP + 1, where u p is the deated (9) as p , = ( 2 1 ~ ~ sired standard deviation and q1 is a pseudo-random number uniformly distributed between zero and one. The IBM 1620 used in this calculation uses base-10 arithmetic and the maximum integer is 99999. The uniform random number generator was of the mixed congruential form ( I O ) , in which q L - l = [1101 x 105ql + 1597](mod 99999) x The apparent period of this generator is greater than 30000. Uniformity and randomness of distribution may be tested by overall frequency distribution and by the density of runs of similar values. When 20000 consecutively generated numbers were tallied into a histogram of 20 intervals from zero to one, the tally in each interval was between 975 and 1030 with no trends showing. This same 20000-number series was tested for runs of values greater or less than 0.5: if h'1 and .%'2 are the total frequencies and "-"for the (both > 20) of two symbols [here, sign of ( p , - 0.5)] and r is the total number of runs in a random series of symbols, then r should be normally distributed (II) with mean kLr= 1 + 2NlN2/(Nl + N2) and variN Z- N2)/(N1 + N2)2(N1+ N Z ance ur2 = ~ N I N Z ( ~ N-~N1 -1). The statistic z = ( r - k L r ) / u characterizes r the runs density: a binary series of symbols may be accepted as random a t 95% confidence if It1 < 1.96. The present series gave t = -0.41.

"+"

(8) A . McF. Mood and F. A. Graybill, "Introduction to the Theory of Statistics." 2nd ed., McGraw-Hill. New York, N.Y.. 1963, Chapter 10. (9) "System/360 Scientiflc Subroutine Package Version I i , " 3rd ed.. International Business Machines, White Plains. N.Y., 1967, p 54. (10) H . Maisel and G. Gnugneii. "Simulation of Discrete Stochastic Systems," Science Research Associates, Chicago, Ill., 1972, pp 140ff. (11) S. Siegel. "Nonparametric Statistics for the Behavioral Sciences," McGraw-Hili. New York. N . Y . . 1956. pp 56-8.

A N A L Y T I C A L CHEMISTRY, VOL. 45, N O . 9, A U G U S T 1973

Table I . Least-Squares Best-Fit Parametersa (fJx/fJ,y) = 1.0 0.3 0.1 0.03 0.0032(1) bi 0.0355(3) 0.0193(1) 0.0112(1) Ci 1.031(9) 1.1 17(27) 1.094(51) 1.1 1 (19) b2 0.0500(7) 0.0522(9) 0.0558(10) 0.0596(39) Cp 1.017(7) 0.998(9) 1.004(10) 0.987(10) b3 -0.0341 (3) -0.0203(2) -0.0157(1) -0.0119(1) ~3 0.800(7) 0.803(1 7) 0.820(35) 0.793(35) bq 0.0489(6) 0.0492(8) 0.0493(8) 0.0495(9) C4 1.194(8) 1.1 1 O ( 10) 1.056(11) 1.025(12) Nb 30000 10000 10000 5000 R" 4.5 6.4 7.0 8.6 a Equation 5. numbers in parenthesis are estimated standard deviation of least significant digits. Number of products entered into histogram to be fitted. "lo4 X [\'(gcaIcd gobwd)2/(201- 7)]112. These values may which was about 0.1 in each be compared with the sum [\'(gobRd)2]1'2. case.

-

I

,011

0

I

-4

-3

-2

-I

I

I

0

2

?

3

W

Figure 1. Solid line: least-squares best-fit function (Equation 5) through the empirical distribution (points) g(W) for ( n x / u v ) = 1.0 (30000products). Dashed line: reflection through W = 0 of g(W > 0)

The numerical g(w) was obtained by forming the products ( p 2 , ) ( p 2 , + l )and tallying these into a histogram dividing the region about unity into 200 intervals of length determinations made in finding f and 9 and so the quality 0.05m, where n2 = ax2 + u~~ + ux2uy2 is the exact variance of fp when both ux and my are known (12). Succesof s is also a function of the number of degrees of freedom. sive runs of 5000 products with ( m X / u y ) = 1.0, m y = 0.30, If the errors in x and y are normally distributed, then s2 0.10, 0.03, 0.01, 0.003, 0.001 showed histograms that were will be approximately ,y2-distributed with I? = n, n, virtually superimposable when the variation in u was ac2 degrees of freedom. Following again the methods of Refcounted for by scaling the abscissa in this manner. erence ( 8 ) , we define a variable T = W / m The joint The fitting of analytical functions to the numerical g ( ~ ) distribution of W and the estimated relative standard dewas carried out with an iterative nonlinear least-squares viation S is: routine described in detail elsewhere (13). In summary, F ( h ; S) = g ( W ) X 2 ( h ; S ? partial derivatives (evaluated as differences) of J g c a l c d ( w ) d w with each of the refinable parameters are used to construct a set of simultaneous equations for the parameter corrections which will reduce X((aca1cd From this, we have the joint distribution of T and S as: ~ o b s d ) ~The . corrections are applied to obtain a new set of parameters and the refinement is reiterated until successive sets of parameters agree within a specified amount or until the sum of squared deviations is minimized. A variety of function forms was tested; the most successful were sums of exponentials:

+

w,

(5) where W = (u: - l ) / u . Best-fit parameters appear in Table I and a representative fitting is plotted in Figure 1. In general, a smaller value of ( m x / u y ) leads to a g( W) that more closely approximates a simple Gaussian distribution (as expected from Equation 4). The departure of g(W) from simple symmetry a t larger ( u , / u y ) is made clear by the dashed line in Figure 1, where g( W L 0) has been reflected through W = 0 for easy comparison with g(W < 0).

The distribution of r alone is obtained by integrating out 3':

+

G ( k ;T < O ) = K C , : ' ( b , / c i ) [ l ( ~ ~ / k e ~I f : ' +)I ]) / ?-

+

G ( k ;7 >0) = K(b*/c,)[l ( T ~ / ~ c , ' ) ] - ' ~ + "'' where K is a collection of normalization constants.

}

(8)

UTILIZATION OF G(k; 7) With b's and c's obtained from the empirical g(W) and a given number of degrees of freedom k , the function G ( k : 7 ) characterizes the relative frequencies a t which different values of r are expected to appear in any experiment. In STUDENTIZATION OF g( W ) particular, H - ( k ; 7 ) = j.-LG(k; u ) du and H + ( k ; 7 ) = Equation 5 would be a sufficient characterization of the J'; G ( k ; v ) dc give the probabilities of finding W more negative and more positive than 7 . Conversely, if a level of behavior of g( W) for an experiment involving an infinite confidence P is chosen, the values of T* corresponding to number of determinations, for in that case u could be deH f ( k ; 7 ) = (1 - P ) / 2 will mark off the P 7 c confidence . in any real extermined exactly from g, and u - ~However, limits of W. A table of T* us. I? and P would thus serve as periment U'S are available only as estimates. Goodman the equivalent for xy of the usual tables of Student's t for (12) has shown that an unbiased estimate of absolute o2 is given by .s2 = ( f 2 s Y 2 / n , . )+ (.i'2s,2/nx)- (sx2s,s2/nxny). x alone. Equation 8 was integrated numerically for each of the (Note the reversed sign from the exact case.) The quality distributions determined above, by successively summing of the estimates s, and sy depends upon the number of the appropriate right-hand side evaluated a t intervals of 0.02 in 7 from -10.0 to +10.0. The integral was normal(12)L. A. Goodman. J . Amer. Stat. A s s . . 55,708 (1960). ized to unity by dividing each partial sum by the total. (13)R. J. Olcott, Ph.D. Thesis. University of Wisconsin, 1972,Chapter 5 , Appendix. Values of T corresponding to several commonly used A N A L Y T I C A L C H E M I S T R Y , VOL. 45, N O .

9, A U G U S T 1973

1739

Table II. Confidence Coefficients P = 99% 95 (ux/u,) = .1 .o k = 4 -5.10,5.29 -3.17,3.25 6 -4.15, 4.38 -2.79, 2.89 8 -3.74,3.97 -2.62,2.72 10 -3.52,3.76 -2.53,2.63 12 -3.39,3.62 -2.47,2.58 24 -3.08,3.32 -2.33,2.44 36 -2.99,3.23 -2.29,2.40 -2.82,3.06 -2.21,2.32 (ux/uy)

= 0.30

k = 4 6 8 10 12 24 36

-4.98,4.96 -4.07,4.08 -3.68,3.70 -3.47,3.50 -3.34,3.37 -3.05, 3.08 -2.96,3.00 -2.80,2.84

m

7-

and 90

T+

50

-2.47,2.50 -2.25,2.29 -2.15,2.19 -2.09,2.14 -2.06,2.10 -1.97,2.02 -1.94,1.99 -1.89,1.94

-0.938,0.853 -0.904,0.820 -0.891,0.808 -0.883,0.800 -0.878,0.795 -0.866,0.783 -0.861,0.779 -0.853,0.772

-3.08,3.04 -2.72,2.69 -2.56,2.54 -2.47,2.45 -2.41,2.40 -2.28,2.27 -2.24,2.23 -2.16,2.16

-2.39,2.33 -2.18,2.13 -2.08,2.04 -2.03,1.99 -1.99,1.96 -1.91,1.88 -1.88,1.85 -1.84,1.80

-0.873,0.791 -0.847,0.767 -0.834,0.755 -0.826,0.748 -0.821,0.743 -0.809,0.732 -0.805,0.728 -0.797,0.721

-2.99,2.89 -2.64,2.56 -2.48,2.41 -2.40,2.33 -2.34,2.28 -2.20,2.16 -2.18,2.12 -2.10,2.05

-2.31,2.22 -0.837,0.752 -2.11,2.03 -0.811,0.729 -2.02,1.94 -0.799,0.718 -1.96,1.89 -0.792,0.711 -1.93,1.86 -0.787,0.707 -1.85,1.78 -0.775,0.696 -1.83, 1.76 -0.771,0.693 -1.78, 1.72 -0.764,0.686

(ux/uy) =

k = 4

6 8 10 12 24 36

0.10 -4.85,4.74 -3.96,3.88 -3.58,3.52 -3.38,3.32 -3.25,3.20 -2.97,2.94 -2.89,2.85 -2.73,2.70

( u x / u ~=) 0.03

k = 4 -4.67,4.63 -2.86,2.83 -2.20,2.18 -0.778,0.756

6 8 10 12 24 36 w

-3.80,3.79 -3.44,3.44 -3.24,3.25 -3.12,3.13 -2.85,2.87 -2.77,2.79 -2.64,2.62

(ux/u,) = 0 k = 4 *4.604

6 8 10 12 24 W

3.707 3.355 3.169 3.055 2.797 2.576

-2.52,2.51 -2.38,2.36 -2.29,2.28 -2.24,2.24 -2.12,2.12 -2.08,2.08 -2.03,1.99

-2.01,1.99 -1.92,1.91 -1.87,1.86 -1.84,1.83 -1.76,1.76 -1.74,1.73 -1.72,1.67

-0.755,0.734 -0.743,0.723 -0.736, 0.716 -0.731,0.711 -0.720,0.701 -0.717,0.697 -0.730,0.670

12.776 2.447 2.306 2.228 2.179 2.064 2.026

f2.132 1.943 1.860 1.812 1.782 1.711 1.645

f0.741 0.718 0.706 0.700 0.695 0.685 0.674

values of P were found by linear interpolation between the appropriate partial sums. Results are set out in Table 11. Values for k = m were obtained by similar treatment of g(W) alone (Equation 5 ) . Values for ( u X / u y )= 0 are for a simple Gaussian distribution, obtained from a table of Student’s t (4). Assessing the accuracy of the entries in Table I1 is not a simple task. The estimated standard deviations of the values in Table I suggest an uncertainty of 2-4 ppt in the integrand evaluated at a point. This uncertainty is itself probably high. First, the esd’s were calculated on the

1740

basis of 201 data points in the histogram; no account was taken of the large sample size upon which the histogram was based. Furthermore, separate refinements of histograms from two different series of 5000 products [ ( u X / u y ) = 1.01 gave parameter values that agreed well within the esd’s. Finally, however, it is not a t all clear how uncertainty is to be propagated through an integration, especially when the integration cannot be performed analytically. On the balance, the T*-values are probably correct to within 5 ppt. Plots of I T * ( us. different variables show some interesting trends. I T* 1 is a monotonically decreasing function of k , as expected when the quality of the estimate of u increases with the number of degrees of freedom. T + and T both show J-shaped plots against log( u x / u y ) , but the slope of I T - I on the rise is somewhat less steep; the two curves cross near or beyond ( u x / u y ) = 1.0 and the point of crossing moves further out as P decreases. In any case T + and I T - I necessarily meet again at ( u x / u y ) = 0. Table I1 can be used to determine confidence limits on xy according to the following protocol: 1. Determine f, 7 , n,, n,, and the estimated relative standard deviations S, = ( l / f ) & ( x , - f ) 2 / ( n x- 1) and S,. Note that y is the factor with the larger S. 2. Determine the estimated product 37 and the estimated standard deviation of the product S = ( S x 2 / n x + S,2/ny - Sx2Sy2/n,ny)1/2. 3. Select a value for P and find, from the Table, values for T - and T + corresponding to k = n, ny - 2 degrees of freedom and the observed (Sx/S,) ratio. 4. The range at ROconfidence will b e n g ( l + ~ - S / f i 5 xy, 5 fg(l+ T+S/&. For example, in a typical determination of the potassium acid phthalate content of a mixture with an inert diluent by titration with NaOH solution, a student obtained the following results: Standardization: 0.1025, 0.1030, 0.1025, 0.1027M NaOH (Mean = 0.1027M, S,,,,, = 4.60 ppt). Titrant volume/sample mass: 15.87, 15.80, 15.79, 15.84 ml/gram (Mean = 15.82 ml/gram, S ,,, = 4.67 PPt) ’ In the notation of the protocol, the concentration is the variable x, and we derive the following quantities: f = 0.1027, 7 = 15.82, n, = n, = 4, S, = 0.00460, S, = 0.00467, S = 0.00328, k = 6, S,/S, = 1.0. The experimental quantity of interest is the product Cxy, where C is a collection of presumably error-free constants for conversion of units, stoichiometric factors, etc., and in this instance Cfg = 33.18%. Turning to Table 11, we find for P = 9590 the entries T - = -2.78, T + = $2.88, giving a range at 95% confidence of 33.06 5 C x y , 5 33.31%. As it turned out, this range included 33.2570, the accepted value for the sample in question.

+

ACKNOWLEDGMENT The author is grateful to J. T. McCall, Christian Brothers College, Memphis, for helpful discussions. Received for review October 27, 1972. Accepted March 5, 1973. Some of the computational work reported herein was supported by United States Public Health Service grant HL-09495. The rest of the computer time was generously provided by the Computer Center, Southwestern at Memphis.

ANALYTICAL CHEMISTRY, VOL. 45, NO. 9, AUGUST 1973