Understanding Least Squares through Monte Carlo Calculations

Jan 1, 2005 - Does the Answer Order Matter on Multiple-Choice Exams? Joel Tellinghuisen and Michelle M. Sulikowski. Journal of Chemical Education 2008...
0 downloads 3 Views 371KB Size
Information



Textbooks



Media



Resources

Understanding Least Squares through Monte Carlo Calculations Joel Tellinghuisen Department of Chemistry, Vanderbilt University, Nashville, TN 37235; [email protected]

The method of least squares (LS) is arguably the most important data analysis tool available to physical scientists. The statistical properties of linear least squares (LLS) are well established (1–6): the independent variable, x, is assumed to be error-free. If the data for the dependent variable, y, are unbiased, with errors that are random, independent, and of finite variance, the LLS estimators will be unbiased and of finite variance. If the data errors are “normal” (i.e., drawn from a Gaussian distribution), then the LLS estimators will also be normally distributed. If the data are weighted in accord with their inverse variances, the LLS estimators will also be minimum variance. Under these conditions, LS is also a maximum likelihood method, which is reassuring, since maximum likelihood is the more fundamental statistical principle behind data analysis. We are so used to assuming that our data conform to these requirements that we seldom consider their significance. For example, data are often transformed mathematically to facilitate their subsequent analysis by LS. Inversion (reciprocation) is used in the analysis of equilibrium and binding constant data, enzyme kinetics, adsorption isotherms, and fluorescence quenching; and logarithmic conversion is used with first-order kinetics data. If the original data are normal, these conversions render them nonnormal and biased; even worse, inversion produces a variate that has infinite variance, the LS analysis of which must yield estimators of infinite variance (7, 8). In practice this problem is never manifested in a single data set, and the LS estimators will usually remain welldefined in an asymptotic sense. However, the bias can persist and even increase with increasing number of data points n, leading to estimators that are inconsistent (biased in the limit n → ∞). Put simply, this situation leads to answers that, at sufficiently large n, are sure to be “wrong” by any statistical test. In my experience, these properties of LS float off the printed page and roll off the tongue much more easily than they register meaningfully in the consciousness. I have found Monte Carlo (MC) calculations an instructive device for appreciating what LS really means, and this article is an attempt to share that mode of comprehension. Through MC methods we can use the computer to carry out a given “experiment” so many times (105 or more easily) as to generate highly reliable statistical data for it. This means that the various probability distributions of interest—normal, t, and χ2— are all well defined by the histogrammed data. Thereby we can readily visualize how, under deviations from the usual assumptions, LLS can yield results that are unbiased but not minimum-variance, or minimum-variance but not unbiased, or even unbiased and minimum-variance but nonnormal. In laying the groundwork for the MC illustrations, I have also summarized the mathematics of LLS in a very compact matrix notation that renders it practically “formulaic”. While matrix notation for LS has been used previously in this Journal in piecemeal fashion (9–12), the comprehensive form given below is both simple and powerful; it warrants emphasis. www.JCE.DivCHED.org



Theory and Notation

Linear Least Squares—Matrix Formulation The goal of an LS fit is the minimization of the sum of weighted, squared residuals (δi = observed yi − calculated yi), S =

∑ wi δi 2

(1)

with respect to a set of adjustable parameters β, where wi is the weight for the ith point. The column vector β contains the p adjustable parameters; its transpose is a row vector: βT = (β1, β2, ..., βp). The problem is a linear one if the measured values of the dependent variable, y, can be related to those of the independent variable(s) (x, u, ...) and the adjustable parameters through the matrix equation, (2) y = X␤ + ␦ where y and δ are column vectors containing n elements (for the n measured values), and the design matrix X has n rows and p columns and depends only on the values of the independent variable(s) (assumed to be error-free) and not on β or y. For example, a fit to y = a兾x + bx + ce3u is linear, with two independent variables (x, u), three adjustable parameters (a, b, c), and X elements Xi1 = 1兾xi, Xi2 = xi, Xi3 = e3ui. On the other hand, the fit becomes nonlinear if, for example, the second term is changed to x兾b, or the third to 3ecu. The result of the minimization procedure is the set of equations, X T W X ␤ ⬅ A ␤ = XT W y

(3)

where the square weight matrix W is diagonal, with n elements Wii = wi.1 The matrix A (= XT W X) is called the matrix of the normal equations. The solution to eq 3 is, ␤ = A−1 XT W y

(4)

where A᎑1 represents the inverse of A. Knowledge of β permits calculation of the residuals δ from eq 2 and thence of S, which in matrix form is (5) S = ␦T W ␦ The variances in the parameters are the diagonal elements of the variance–covariance matrix V, which is proportional to A᎑1. As has been noted, minimum-variance estimates require that Wii = wi ∝ σi᎑2. If the data variances are known in an absolute sense and this proportionality constant is taken as 1.00, (6) V = A−1 This equation is exact, and the resulting V may be called the a priori variance–covariance matrix, since prior knowledge of the σi is required for its generation. Normally distributed data lead to normally distributed parameters, meaning both follow,

PG (µ, σ; u ) =

Vol. 82 No. 1 January 2005



(u − µ ) 1 exp − σ 2π 2 σ2

2

Journal of Chemical Education

(7) 157

Information



Textbooks



Media



Resources

where µ and σ are the true mean and standard deviation, respectively. For the data, u is the actual value of y at xi, µ is the true value, and σ is σyi; for the ith parameter, µ is βi,true, u is the value of β i determined for a specific data set, and σ = Vii1兾2. If the data are normal and are weighted correctly, S is distributed as a χ2 variate for ν = n − p degrees of freedom (1, 2). Correspondingly, S兾ν follows the reduced chi-square, χν2, distribution, given by, ν−2 2

Pres( z ) = C z

exp −

νz 2

(8)

where z = χν2 and C is a normalization constant. The χ2 distribution has a mean of ν and a variance of 2ν (13), which means that χν2 has a mean of unity and a variance of 2兾ν. In the limit of large ν, the χν2 distribution becomes Gaussian. What does it mean to say that eq 6 is exact? In MC studies of LLS, it means that one can use it to verify a computational procedure, since the σi are set by the computationalist. Thus, for example, with normally distributed error, one expects 95% of the estimates of βi to fall within ±1.96Vii1兾2 of the true value. Conversely, a statistically significant deviation from this expectation suggests a flaw in the MC procedures. There is some confusion in the literature regarding these matters. In general the off-diagonal elements in V (the covariances) are nonzero, which means that the parameters β are correlated. The correlation matrix C is obtained from V through C ij =

Vi j

(Vii Vj j )

1

(9)

2

and yields elements that range between ᎑1 and 1. However, each of the parameters in a linear fit is distributed normally about its true value, with σβi = Vii1兾2, irrespective of its correlation with the other parameters. The correlation comes into play only when we ask for joint confidence intervals of two or more parameters, in which case the confidence bands become ellipsoids in two or more dimensions (5, 12). The correlation is also given exactly by eq 9, obviating MC computations for characterizing such joint confidence ellipsoids for LLS fits. At the other extreme, suppose that nothing is known in advance about the statistics of the yi, except that we believe the parent distributions all to be normal, with the same variance, independent of yi. We set all the wi to 1.00, giving the case of unweighted least squares (or unweighted regression). The variance in y is then estimated from the fit as,

σy 2 ≈ sy 2 =

δi 2 n − p

=

S ν

(10)

which is recognized as the usual expression for estimating a variance by sampling (except that the degrees-of-freedom correction is p rather than 1). The use of eq 10 represents an a posteriori assessment of the variance in y.2 The corresponding a posteriori variance–covariance matrix is Vp =

158

S −1 A ν

Journal of Chemical Education

(11) •

Under the same conditions as stated just before eq 8, sy2 is distributed as a scaled χ2 variate. In intermediate situations, the σyi may not all be constant but may be known in only a relative sense. This is the case, for example, when data of unknown, constant σ are transformed before fitting. Here it is useful to designate one datum (or subset of the data) as reference, with wi = 1, and take all other weights to be sref2兾syi2. Then the quantity S兾ν (= sref2) obtained from the fit is more properly called the “estimated variance for data of unit weight”, and the estimated variance for a general point in the data set is sref2兾wi. In the case of the a priori V, confidence limits on the parameters are obtained from tables of the error function. However, if the σyi are not known in advance but must be estimated from the fit itself, by eq 10, the parameter standard errors from eq 11 are also uncertain estimates, sβi = Vp,ii1兾2 ≈ σβi. Accordingly we must employ the t-distribution to assess the parameter confidence limits. Under the same conditions that yield a normal distribution for the parameters β and scaled χ2 distributions for sy2 and for the Vp,ii from eq 11, the quantities (βi − βi,true)兾sβi belong to the tdistribution for ν degrees of freedom (1), which is given by,

Pt (t )



t2 = C ′ 1 + ν

ν +1 2

(12)

with C´ another normalizing constant. For small ν the t-distribution is narrower in the peak than the Gaussian distribution, with more extensive tails. However the t-distribution converges on the unit-variance normal distribution in the limit of large ν, making the distinction between the two distributions unimportant for large data sets. The elements of V scale with σy2 and with 1兾n. The latter, for example, means that the parameter standard errors (Vii1兾2) go as n᎑1兾2, all other things being equal. This means that they are to be interpreted in the same manner as the standard deviation in the mean in the case of a simple average. (One can readily verify that for a fit of data to y = a, the equations do yield for σa(sa) the usual expressions for the standard deviation in the mean.) Users of commercial data analysis programs should be aware that those programs that provide estimates of the parameter errors do not always make clear which equation—6 or 11—is used. For example, recent versions of the program KaleidaGraph (Synergy Software) use eq 11 in unweighted fits to user-defined functions, but eq 6 in all weighted fits (15). This means that in cases where the weights are known in only a relative sense, the user must scale the parameter errors by the factor (S兾ν)1兾2 to obtain the correct a posteriori values.3

Error Propagation Errors in functions f of uncertain variables are generally computed using the textbook error propagation formula: σf 2 =



∂ f ∂ βi

2

σβ 2 i

(13)

However, this expression assumes that the independent vari-

Vol. 82 No. 1 January 2005



www.JCE.DivCHED.org

Information

ables are truly independent or uncorrelated. That assumption almost never holds when the variables are the adjustable parameters from an LS fit, and one must use the full expression (2, 3, 16), (14) σf 2 = gT V g in which the elements of g are gi = ∂f兾∂βi.4 Equation 14, too, is exact for linear functions f of normal variates. In many cases eq 14 can be bypassed and the desired σf2 can be calculated by simply redefining the fit so that f is one of the adjustable parameters (16). This procedure can be shown to be identical to use of eq 14 and is illustrated below.

Monte Carlo Computational Methods The properties of LLS summarized earlier are premised upon the existence of a “true” function linear in the adjustable parameters β, onto the data (y) of which are superimposed random errors. The MC computations repetitively replicate this model for the data, solving for β and calculating S and Vp for typically 105 or more statistically equivalent data sets. The results are then histogrammed to obtain the distributional information, and their statistics are assessed. The mustering demands of so much data have led me to use “on-the-fly” procedures for binning and accumulating statistics (7, 8, 16). The built-in random number generator in most computer programs generates numbers from the uniform distribution on the default range 0–1. This distribution has a mean of 0.5 and a standard deviation of (12)᎑1兾2 (1). The numbers are easily mapped onto the range ᎑1 to 1, for which the mean is zero and the standard deviation (3)᎑1兾2. Uniform random deviates are readily converted to Gaussian with the Box– Muller method (5). I have also used the RAN0 routine of Press, et al. (5) to achieve an extra measure of randomness in the uniform deviates. The MC statistics are computed with the usual equations for sampling. For example, the average for parameter a over N data sets is 具a典 = ∑ai兾N, and the sampling variance is sa2 = 具a2典 − 具a典2. In confirming MC consistency with theory, it is necessary to know the precision of the MC parameter estimates, which (at the 68.3% or 1-σ level, for normal data) is their estimated standard error, sa兾√N. On the other hand the sampling estimates of the parameter standard errors are subject to the previously mentioned properties of the χ2 distribution, for N degrees of freedom in this case. Thus their relative standard errors are (2N)᎑1兾2 = 0.00224 for N = 105. I have used the KaleidaGraph program (KG) extensively to compute parameter standard errors for different models. This is most conveniently done using exactly fitting data and the “weight data” option of the “general” fit routine (which handles both linear and nonlinear fit models). I have also used KaleidaGraph to fit histogrammed data to distributions. In this application the data uncertainties are taken as the square root of the bin counts, in keeping with the approximate Poisson nature of the binning process (4). Illustrations

A Straight-Line Model with Constant Error The simplest linear model of interest, the straight line, serves nicely to illustrate the fundamental properties of LLS. www.JCE.DivCHED.org





Textbooks



Media



Resources

The is the five-point model used previously to investigate bias from weighting and from data transformation (7, 8): y = a + bx, with a = 1and b = 5. Written out explicitly for this model, eq 2 reads,

y =

y1

1 x1

y2 y3

1 x2 1 x3

= X␤ + ␦ =

a

1 1.1 1 3. 3

+

b

1 x4 1 x5

y4 y5

=

δ1 δ2 δ3 δ4 δ5

δ1 a

1 5.5 1 8.3

+

b

(15)

δ2 δ3 δ4 δ5

1 12

where in its second entry the matrix X contains the specific values of xi used in these test computations. The matrix A defined in eq 3 becomes,

A = XT W X =

1

1

1

1

x2

x3

x4

x5

σ y1 −2

0

0

0

σ y 2 −2

0

0





1 x1

0

σ y3

−2

0

0

0

0

0

0

0

0

σy4

0

0

0

0

1

x1

1 1

x2 x3

1 1

x4 x5

0 −2

0 σ y 5 −2

(16) =

∑ wi ∑ wi xi ∑ wi xi ∑ wi xi 2

where wi = 1兾σyi2. The right side of eq 3 reads XT Wy =

1 x1

1 x2

1 x3

w1 0

0 w2

0 0

0 0

0 0

y1 y2

0 0

0 0

w3 0

0 w4

0 0

y3 y4

0

0

0

0

w5

y5



1 x4

1 x5

(17) =

∑ wi yi ∑ wi xi yi

and eq 3 then becomes,

Vol. 82 No. 1 January 2005



Journal of Chemical Education

159

Information



a ∑ wi

Textbooks



b ∑ wi x i

∑ wi yi ∑ wi xi yi

a∑ wi x i b ∑ wi xi 2

=

Media



Resources A 10

(18)

8

Count /103

which, with removal of the parentheses, is recognized as the two equations in the two unknowns a and b that are the result of the standard textbook methods of deriving the LS equations for this model. If the σyi are known a priori, the variance–covariance matrix is given by eq 6,

6

4

2

V = A −1 =

=

2

1 ∑ wi xi det A −∑ wi xi 0.41860 2

−∑ wi xi ∑ wi

−0.020733

0 0

0.5

1.0

1.5

2.0

2

4

a

(19)

B 10

−0.020733 0.0585882

160

Journal of Chemical Education



Count /103

8

6

4

2

0 -4

-2

0

t(a) C 12

uniform, 5 pts normal, 5 pts normal, 20 pts

10

Count /103

where the numerical entries in the final matrix are obtained for constant σyi = 0.5. The diagonal elements of V are the parameter variances, V 11 = σ a 2 and V 22 = σ b 2 . Thus σa = 0.41860 and σb = 0.058588, values that are exact in this assumption of known σyi. While eqs 15–17 may seem like a lengthy path to the relatively simple result in eq 18, they represent a tremendous aid to computations with more complex models, since users need only to be able to write the matrix X that relates the obverved yi values to the parameters and the independent variables; the computer does the rest. Standard linear and nonlinear LS codes employ this matrix architecture (4, 5, 9–12). The validity of these results is illustrated through the Monte Carlo computations. These are done by adding random Gaussian error having this σy to the exact (true) yi values and solving for a and b repetitively using eq 4. (Note that A᎑1 , given by eq 19, is fixed for all data sets.) Statistics are accumulated for each parameter over the N randomized data sets in a given MC run, and the values are binned into histogram form. The a posteriori parameter standard errors are also computed for each data set; from eq 11, these are obtained by multiplying the a priori values by (χν2)1兾2. These sa and sb values are needed to generate the t-variates for histogramming. Results from several MC runs on 105 data sets are illustrated in Figure 1 and summarized in Table 1. From the table we see that the sampling statistics of all quantities are reasonably consistent with expectations except in one case, the standard deviation in χν2 for the uniformerror model. Thus, even with uniform random error, which is grossly different from the usually assumed normal error, the LS solution yields unbiased, minimum-variance estimates of the parameters. However, the figures show that none of the histogrammed quantities from the uniform-error model follows the distributions predicted and observed for normal error. Note that because the x structure of the data set is preserved on expanding from 5 to 20 points, the parameter standard errors drop by exactly a factor of two (a result readily predicted using KG). What will happen if we average the 4 values at each xi in the 20-point data set before fitting them? Clearly each single MC data set will yield exactly the same results if we

8 6 4 2 0 0

1

2

3

χν2 Figure 1. Histogrammed results of 105 LS estimates of the intercept a in the straight-line model described in text, for constant error, σy = 0.5, uniformly and normally distributed. The illustrated curves are the results of fitting to the theoretical distributions, with all parameters except a scaling factor fixed at the theoretical values: (A) Gaussians (eq 7) with σa = 0.4186 (5 pts) and 0.2093 (20 pts); (B) t-distribution (eq 12) with ν = 3 (5 pts) and 18 (20 pts); (C) reduced χ2 distribution (eq 8) with ν = 3 and 18. These fits themselves gave respective χ2 values of 40.9 (30 histogram points), 22.6 (30 pts), 35.5 (46 pts), 23.3 (37 pts), 40.6 (47 pts), and 11.9 (25 pts). Note that in (B) the a posteriori estimate sa is used to generate the t-variate for each of the 105 data sets, from t(a) = (a – 1)/sa. The constant peak amplitude (rather than area) for the 5- and 20-point data sets in (a) is a consequence of binning on the same relative interval of σa/4 in both cases.

Vol. 82 No. 1 January 2005



www.JCE.DivCHED.org

Information



Table 1. Monte Carlo Statistics from 105 Data Sets on Linear Modela Value

δb

σ/sc

δsb

Exact values (five-point data set) a

1.00

---

0.41860

---

b χν2

5.00

---

0.058588

---

1.00

---

(2/3)1/2

---

a

1.00015

b

5.00002

2

χν

a

0.99947

Normal error ᎑0.11 0.41784 ᎑0.11 0.058601 ᎑0.21

0.82015



Resources

a

8

b 6

4

᎑0.81 0

᎑0.10

-3

᎑2.00

b χν2

4.99996

᎑0.43

0.029312

᎑0.28

0.99997

᎑0.03

0.33240

᎑1.25

a

1.00083

Uniform error, 5 data points ᎑0.63 0.41956

᎑1.03

b χν2

4.99983

᎑0.92

0.058550

᎑0.29

0.99920

᎑0.40

0.636

᎑98.7

a

y = a + bx, with a = 1, b = 5, σy = 0.5, and 5 xi values: 1.1, 3.3, 5.5, 8.3, and 12. b Deviations, (observed – true) in units of sampling standard errors, which are σ/(105)1/2 for parameters, σ/(2 x 105)1/2 for standard deviations. c Ensemble estimates of standard deviations for MC data. Except for the first three entries, which are exact, these have a relative uncertainty of 0.224%; they are given with extra precision to facilitate comparisons. d Parameter standard errors exactly half those for 5-points; standard error in χν2 = (2/ν)1/2 = 1/3.

weight each set of 4 by 4 times the original weight for individual values. This is the same as taking each of the 5 averages of y to have σ–y = σy兾2, that is, the theoretical (or a priori) standard deviation in the mean. Alternatively, if we use the actual sampling (or a posteriori) statistics of each group of 4, we will not get identical results, and the statistics of this process will not follow the usual rules (17). In effect, by the latter procedure we are granting the data the right to determine their own destiny. This disparity emphasizes the importance of having prior knowledge of the data error and using this prior knowledge to assign the weights. If such prior weighting is used in the present case, the results in Figure 1A will be identical to those for the original 20-point data sets, but those in B and C will follow the distributions for ν = 3 instead of those for ν =18. In all of the examples covered in Figure 1 and Table 1, the LS estimators are unbiased. If the fitted data are the reciprocals of normal variates, we find that the parameter estimators are biased, as has been noted already. This case is illustrated in Figure 2, which shows that the parameter distributions are also no longer Gaussian. From the sampling statistics, the bias in a is +0.031 and that in b is ᎑0.00275; both are significant, by 23 σ and 15 σ, respectively. The reciprocal of a normal variate has infinite variance, violating one of the fundamental assumptions of LLS. How•

Media

2

Normal error, 20 data points (4 at each xi)d ᎑0.14 ᎑0.88 0.99991 0.20971

www.JCE.DivCHED.org



10

Count /103

Parameter

Textbooks

-2

-1

0

1

2

3

Z Figure 2. Histogrammed results of 105 LS estimates of a and b obtained by fitting the inverse of a normal variate. Random normal error of magnitude σz = σy/y2 was added to z = 1/y, and then the “data” yi = 1/zi were fitted. The histogrammed quantity is Z = (β – βtrue)/σb. The curves represent best LS fits to Gaussians having adjustable means and widths; the fit χ2 values of 611 and 106 confirm that even with this extra freedom, the Gaussian distribution is not adequate.

ever, there is no indication of any problem from this in the computations behind Figure 2, because the relative errors in the data are small (7, 8). The Lorentzian distribution has infinite variance (4), so it is of interest to examine the effects of using Lorentzian random errors instead of normal or uniform. We expect the LLS estimators to be similarly unbounded, and indeed I have observed divergent MC statistics for this case. For example, five different 105-set MC runs led to average values for a of ᎑1.86, 1.74, 0.74, 1.64, and 1.49, with ensemble standard deviations ranging from 100 to 1400. Fits of the parameter distributions indicated they were also Lorentzian. The dispersion of a Lorentzian distribution can be characterized by its half-width. However, eq 6 does not correctly relate the parameter half-widths to that of the data (as it does for normal error).5 Let us return to the proper cases of Table 1 and Figure 1 and examine the propagated error in the fitted function y(x). Taking f = a + bx, we find from eq 14,

σf

2

= (1 x ) V

1 x

= V11 + 2V12 x + V22 x 2 (20)

which differs from the commonly used eq 13 by the inclusion of the middle term. Since the Vij coefficients are known, we can straightforwardly evaluate σf2 as a function of x (11). We can verify that this expression is exact for any x0 by having our MC code compute f (x0) for selected x0, using each of the N sets of parameters and accumulating the statistics and distributional information. However, it is easy to see that this result is exact, because we can as well fit to y = a´+ b(x − x0) = a´+ bx´, which is a different but statistically equivalent linear fit having the property that y(x0) = a´. Thus the standard error in a´ is the desired error in the fit function at x0; and by all the properties of LLS, the a priori standard error in a´ is given exactly by eq 6. We can implement this computation trivially with a program like KG, obtaining, for example, standard errors of 0.32561, 0.22362, and 0.41464 at

Vol. 82 No. 1 January 2005



Journal of Chemical Education

161

Textbooks



Media



Resources

x0 = 2, 6, and 12, respectively. (b and σb remain unchanged as x0 is varied.) In another approach to this computation, we note that by eqs 9 and 19, the correlation between a and b is ᎑0.84537. This is proportional to V12, which in turn is proportional to ∑xi . In the alternate fit described above, the correlation vanishes at x0 = 6.04, so for this choice of x0 we can use the simpler eq 13 to compute the error band on the fit function at all x.

The Straight Line with Proportional Error Next let us consider the case where the data have proportional error, σyi ∝ yi, and examine the results both with and without weighting.6 Typical MC results for 4% error, for which the exact parameter errors are σa = 0.33606 and σb = 0.13551 are shown in Figure 3. With neglect of weights, the parameter distributions remain unbiased and normal but are no longer minimum-variance. The t-variate is grossly in discord with its correct distribution, as is true also of the χν2 distribution (not shown), confirming that both of the latter distributions are premised upon proper weighting of the data. Note that the a posterior Vp is completely unreliable as a predictor of the parameter errors in this case; from the MC statistics the Vp-based sa and sb are 1.121 ± 0.003 [1.121(3)] and 0.1569(4), respectively, as compared with the MC sampling estimates of 0.842(2) and 0.2092(5). This shows that under incorrect weighting, Vp can “lie” either optimistically or pessimistically. As a variation on the theme illustrated in Figure 3, suppose we have data of constant absolute error σy in y (as in Figure 1) but wish to enhance somehow our determination of the relatively imprecise intercept a. Recognizing the importance of the small-x points, we elect to weight the data as xi᎑1. Unfortunately, as the old saying goes, “you cannot fool Mother Nature”. From the ensemble statistics of the MC calculations, sa actually increases by 18% in this manipulation. Worse still, the Vp-based estimate of sa is grossly optimistic (by a factor of 2), deceiving us into thinking we are doing much better than we are. The twofold moral of these examples is very important: (i) minimum-variance estimates of the parameters are achieved only when the weights are proportional to the inverse data variances, as dictated by the data themselves and not by the analyst’s hopes and desires, and (ii) the a posteriori Vp is invalid if the weighting is otherwise. As one consequence of these realizations, it is inappropriate to use trial-and-error variation of weighting formulas in a quest for minimum parameter errors from Vp. Of course, in the above example, the goal of a more precise intercept might be implemented by altering the data structure to include more smallx values among the measurements and that would be a fully legitimate use of V in experiment design.

χ2 and Confidence Limits Returning to the constant-error model of Figure 1, suppose we select from our MC run just those sets that yield values of a in the extreme 10% wings of the distribution and then histogram b and χν2 for those same sets. The χν2 values for this subset are still properly distributed (Figure 4), showing there is no correlation between the parameters and χ2. The b distribution from the original model is significantly 162

Journal of Chemical Education



A 10

8

Count /103



6

4

2

0 -2

-1

0

1

2

3

4

a B 10

8

Count /103

Information

wtd unwtd

6

4

2

0 -4

-2

0

2

4

t(a) Figure 3. MC results (105 fits) for the intercept a with proportional error, σy = 0.04 y, both properly accommodated in the fit model (open points) and neglected (solid): (A) the curves are fitted Gaussians having σ = 0.3356(8) and 0.843(2) and (B) the t-variate is histogrammed with the t-distribution curve fit shown for the weighted results.

affected but not that from the alternate fit with x0 = 6.04. Again, this difference is a consequence of the significant correlation between a and b in the original model but the zero correlation for x0 = 6.04 in the altered model. Alternatively we can ask what happens to χ2 for a single data set when we alter the fit parameters from their minimum-variance values (4, 12). For this purpose it is useful to examine a “real” data set, as summarized in Table 2. If we now alter either of the parameters by ±σ (± s) and refit the other, we will observe (i) an increase in χ2 by exactly 1.00 for the V-based results, and (ii) an increase in S by the factor (ν + 1)兾ν = 4兾3 for the Vp-based results. (If χ2 has its expected value in the first case, the same fractional increase occurs.) If the parameter is altered by ±2σ兾s, the increases are 4.00 and 7兾3; by ±3σ兾s, 9.0 and 4 (4, 5). These increases are not special for a two-parameter fit; they hold any time one parameter in a linear fit is frozen at the altered value and all of the others are refitted. The results for ±σ are the reason why in ad hoc fitting, one is justified in dropping a parameter having a value smaller in magnitude than its standard error: the increase in S will be proportionately smaller than the increase in ν, leading to a smaller estimated sy2 in the more compact fit.7 The information about the correlation between a and b in the original model can also be visualized by histogramming

Vol. 82 No. 1 January 2005



www.JCE.DivCHED.org

Information



Textbooks



Media



Resources

A 1600 1400 1200

Count

1000 800 600 400 200 0 0

1

χν2

2

3

B 2500

Figure 5. Contour diagram of two-dimensional histogram of results for a and b from 105 fits for constant error model. The binning interval was σ/4 in each parameter. Note log scale for “elevation”; the highest single bin count was 1827.

Count

2000

1500

1000

500

0 -4

-3

-2

-1

0

1

2

3

4

Z Figure 4. MC results for χν2 (A) and b (B) from 105 fits, for constant error model with selection of the results which fall into the outer 20% range for a. Results are shown for the original model (open) and the altered model having x’ = x – 6.04 (solid). Curves represent results of LS fits to the theoretical distributions. The histogrammed quantity in (B) is Z = (b – 5)/σb.

Table 2. Five-Point "Real" Data Seta with Constant Error, σy = 0.5 V Results

Vp Results

Parameter

Value

σ

Value

σ

a

0.55189

0.41860

0.55189b

0.30493c

b χ2

5.05598

0.05859

5.05598b

0.04268c

1.59190

---

---

---

S

---

---

0.39798c

-----

2

sy

---

---

0.13266

a

0.97049d

----

0.85682d

---

b χ2

5.00645

0.03130

5.01990

0.02280

2.5919e

----

----

----

S

----

----

0.53064

----

sy2

----

----

0.13266

----

a

The values in the data set are (x, y): (1.1, 6.17), (3.3, 16.83), (5.5, 28.83), (8.3, 42.48), and (12.0, 61.14). b

Parameter values are identical in the two fits, since σy is constant.

c

Obtainable from V results by multiplication by (χ2/3)1/2 for sa and sb, by σy2 = 1/4 for S. d For this and subsequent entries in the shaded region, a has been fixed at a + σa or a + sa from original fits. e The increase in χ2 by 1, 4, and 9 for alterations of one parameter by σ, 2σ, and 3σ, respectively, can be observed also for perfectly fitting data, for which χ2 = 0 for the original fit.

www.JCE.DivCHED.org



the data in two dimensions. Figure 5 shows a contour diagram of such data, which can be fitted to a two-dimensional Gaussian (1), analogous to the one-dimensional distributions used to represent the relevant results in Figures 1–4. The elliptical shape of the contours can be predicted exactly from the A matrix (4, 5): A11 (δ a )

2

+ 2 A12 (δ a )( δ b ) + A22 ( δ b )

2

= ∆χ 2 (21)

where δa and δb are the displacements of the parameters from their true values and ∆χ2 is chosen to provide a specified confidence (12). For the 68.3% level of joint confidence for two parameters, ∆χ2 = 2.303, leading to the ellipses shown in Figure 6 for both the original model and for the altered model having zero correlation between a´ and b. The reduced version of the latter is particularly instructive, since the ellipse is now a perfect circle. These ellipses delimit a region of parameter space that should capture 68.3% of the sets of a and b parameters in MC computations like those used to generate the one-dimensional distributions in Figures 1–4. The MC run used to produce Figure 5 yielded 68,295 pairs within the ellipse in Figure 6A. While the joint confidence regions shown in Figures 5 and 6 are informative, they should not distract one from the significance of the simple normal distributions for individual parameters. The individual parameter distributions are more fundamental and in many cases of multiparameter fitting are the only ones of interest. For example, analysis of kinetics data generally requires two or more adjustable parameters, but the rate constant k is often the only quantity reported; thus its individual distribution is the only one of relevance to its confidence limits. As a second example, consider the joint distribution delimited by the circular region in Figure 6B. The two parameters here are rigorously independent, so information about one says nothing about the other. Emphasizing the joint confidence region in this case is like emphasizing the probability of observing two heads in two successive tosses of a coin over that of one head in one toss. The circular confidence region is based on the properties of

Vol. 82 No. 1 January 2005



Journal of Chemical Education

163

Information



Textbooks



Media



Resources

A 0.10

δb

0.05

0.00

-0.05

-0.10 -0.6

-0.4

-0.2

0.0

0.2

0.4

0.5

1.0

0.6

δa

B 1.5

1.0

σb

δb

0.5

0.0

-0.5

-1.0

-1.5

-1.5

-1.0

-0.5

0.0

1.5

δa σa Figure 6. 1-σ (68.27%) joint confidence regions for a and b (solid) and a’ and b (dashed) for linear-fit model with constant error, and altered version fitted about x0 = 6.04. Note the use of absolute scales along the axes in (A) but reduced scales in (B).

χ2, but we can define this region in a different way that makes this analogy even stronger. Since the two parameters are independent, their joint probability density is simply the product of their individual probabilities. Thus, the probability that both parameters are simultaneously within their 1-σ regions is 0.6832 = 0.466. If we seek to increase the latter figure to 0.683, we must enlarge the individual probabilities to 0.826, which means setting the ranges at ±1.36 σ. This would produce a square region intersecting the circle in Figure 6B. This same enlargement is always present in joint confidence regions, and it increases as the number of quantities jointly examined is increased. It is noteworthy that the projections of the ellipse and the circle in Figure 6B onto the two axes are identical and are, in fact, just (2.303)1兾2. The 164

Journal of Chemical Education



same factor will apply for any two-way joint confidence ellipse, no matter how many adjustable parameters are included in the fit model. But it makes little sense to just state this quantity without also showing the actual ellipse: in those cases where the correlation is most important, most of the region defined by the projections lies outside the joint confidence ellipse; in those cases where the correlation is small, the joint confidence concept is anyway less relevant. In further amplification of this last point, two highly correlated parameters have a narrow joint confidence ellipse, showing that they vary hand-in-hand. An important practical consequence is that if a way can be found to determine one of these parameters separately from the fit in question, the second will then be much better determined. For example, in Figures 5 and 6, the error in the constant a´ in the uncorrelated fit about x0 = 6.04 is 0.22361 and stays at this value if the slope b is fixed (hence removed from the fit). If b is similarly fixed in the original fit, the error in a drops to this same value, from its original value of 0.41860. How does this issue of joint confidence relate to the propagated error in a function of the parameters by eq 14? Continuing with the choice of f as the fit function itself, eq 14 is exact for the variance σf2 at any selected value x0 of the independent variable, and the calculated values of f from the MC run will be distributed normally about the true value (= 1 + 5x0). For 105 data sets, we expect 68,269 hits within ±σf of this value. But suppose we now ask how many of our data sets will yield calculated values within ±σf of the true values at every one of a set of x0 values (for which σf will vary with x0). To make this illustration concrete, I have chosen x0 = 0, 2, 4, 6, 8, 10, and 12, for which in one particular MC run, the number of “successes” dropped from 68,310 for x0 = 4 alone to 40,789 for all 7 values. If we want to raise the success rate in the latter calculation to 68.3%, we must increase the error bars at each x0; the appropriate quantity by which to do this is again (2.303)1兾2 = 1.518. With this enlarged error band, the number of hits rose to 70,038. Addition of two points at x0 = ᎑10 and 20 brought it down to 68,770; and two more points at x0 = ᎑100 and 200 gave slight further reduction to 68,663. In short, the enlarged error band now encloses 68.3% of all calculated curves, whereas the 1-σ values from eq 14 apply to individual points along the curve. The latter pointwise error band is again arguably more fundamental, as it answers directly the question: If I repeat this experiment very many times and use the results for a and b to compute the value of the function at x0, what will I get?

A Three-Parameter Model—Instructive Comparisons To put the results for the two-parameter linear model into fuller perspective, it is useful to compare with a linear model having more than two parameters. The analysis of gas chromatography data (19) by the van Deemter equation, y = Ax +

B + C x

(22)

is of interest in this regard, as this example has been targeted in several recent studies in this Journal (12, 20, 21). First, note that this is indeed a linear least squares problem (though some have referred to it otherwise), as all three parameters occur linearly in the fit function. Taking the parameters in order (β1 = A, etc.), the elements of X are

Vol. 82 No. 1 January 2005



www.JCE.DivCHED.org

Information

Xi1 = xi, Xi2 = 1兾xi, Xi3 = 1. Harris treated this problem using both unweighted and weighted LS (20), with 5% proportional error assumed for the latter. For the sake of illustrating the properties of LLS, either choice will suffice. However, it should be emphasized that in an actual application this choice is not arbitrary: it really must be founded on reliable experimental information about the data or a solid theoretical understanding of the measurement process. Of the other treatments of this example in this Journal, that by de Levie (21) considers only the unweighted case,8 while Ogren, et al. (12) treat only the weighted case. Since the latter is more instructive, I will consider it here. If this problem is modeled in MC calculations like those used to produce Figures 1–4, each of the three parameters will be normally distributed about its true value, with σβ values given by eq 6, since we assume the quoted errors (∼5%) on the data have absolute significance. The parameter errors are 0.001971, 1.652, and 0.1511, respectively. These values are smaller than those quoted by Ogren, et al. (12), who have reported the “projected” three-way joint confidence values in their Table 5, making them larger by (3.53)1兾2. As already noted, this is extremely conservative, since the resulting rectangular parallelepiped includes much volume outside of the joint confidence ellipsoid. The unscaled parameter standard errors have an unambiguous significance and are always relevant, regardless of whether one is interested in interparameter correlation; thus these are arguably the more appropriate quantities to report. As Ogren, et al. have noted, the χ2 from this fit is only 2.795, as compared with an expected value of 10 (13 points– 3 parameters). From Table C4 in ref 4, a value this small occurs randomly less than 2% of the time, so we might be suspicious that the scale of the error is wrong, even if the assumption of proportional error is right. Reducing all of the data errors by the factor (2.795兾10)1兾2 will lead to a reduction of all the V-based parameter errors by the same factor and will increase χ2 to its expected value of 10. This is exactly equivalent to the use of eq 11 for Vp, that is, an a posteriori assessment of the data error and hence of the parameter errors. For either the original data errors or the adjusted errors, if we choose any of the three parameters and increment or decrement it by its appropriate σ, 2σ, or 3σ, freeze it at this value and refit the other two parameters, we will observe increases in χ2 by 1, 4, and 9, respectively. Suppose we wish to examine the joint confidence region of A and C. We construct the appropriate ellipse in a manner similar to that of eq 21. Even though the number of fit parameters is now three, the number of parameters jointly examined determines the appropriate value of ∆χ2, and that value is again 2.303, just as in the case of eq 21 and Figure 6. For the coefficients in eq 21, however, we must do more work in this case (5): we select the appropriate 2 × 2 submatrix of V, V13 , which has elements V11, V13, V31, and V33. The Aij elements in eq 21 then become the corresponding elements of V13᎑1. Note that since this is a linear problem, we know from the outset all we need to know to compute this ellipse; in particular, C13 = ᎑0.8805 shows that A and C are moderately correlated.9 MC computations will confirm this A–C joint confidence ellipse by yielding about 68.3% of values within this region; at the same time B is not examined and can take any value.10 www.JCE.DivCHED.org





Textbooks



Media



Resources

We can use the same trick as before to bypass eq 14 in calculating the statistical error in the fit function: Define the fit as y = A (x − x0) + B (1兾x − 1兾x0) + C´, from which y(x0) = C´ and the error at x0 is σC´.11 The resulting error band will again be found to be narrower than that shown in Figure 7 of ref 12, by the factor (3.53)1兾2, and the distinction is the same as before: this enlargement is needed if one wants to define the error bands such that they capture 68.3% of all calculated curves, whereas the unmagnified σC´ value describes the scatter of values at a single point. The latter is normally the desired quantity: the goal of such data is the location and characterization of the minimum of y with respect to x, as that represents the point of maximum efficiency for the chromatographic separation. The minimum falls near x0 = 33 for these data, and with the original σyi values, σC´ = 0.0658 there.12 Just as for the straight-line fit, one finds that the correlation between A and C’ varies with x0. It vanishes near x0 = 59, for which the reduced joint confidence region becomes a circle, like that shown in Figure 6B. Comments The present MC computations serve to illustrate and confirm known properties of linear least squares under the usual assumptions of normally distributed random error in the data and proper weighting in their analysis. However, in cases where these assumptions are violated, the MC results provide new information—about the deviations from the Gaussian, t-, and χ2 distributions when the data are not normal and unbiased, and especially the extent to which the a posteriori Vp errs in its predictions of parameter errors when the fit model does not comport with the actual error structure of the data. This unreliability of Vp when weights are neglected for heteroscedastic data (data of varying σy) is widely underappreciated in research work; it warrants more emphasis in the classroom. Exacerbating this problem is the fact that the a posteriori Vp dominates in both teaching and research. Even when data are given correct relative weights, there is a deficiency in the use of Vp. At heart this approach assumes total ignorance about at least the scale of the error σyi in the experimental data, which means that the statistical properties of χ2 will limit the parameter error estimates to a relative precision of (2ν)᎑1兾2. In many applications—in analytical chemistry, for example, where the demands of sample preparation can limit the degrees of freedom to ν = 2–4—this translates into 35–50% uncertainty in the estimated σ and a significant further expansion of the confidence region owing to the need to employ the t-distribution. There is much to be gained by characterizing the data error independently and then taking the a priori approach that is so obvious from a Monte Carlo perspective. It is, of course, impossible to obtain perfect experimental knowledge of σyi. However, accumulated data from frequent use of an instrument or measurement technique in repeated experiments of similar nature should peg it much better than will an LS fit of the data from a single experiment. Also, with the widespread use of computers to log data, often with averaging at each point, the assumption of normally distributed error can be well founded, by virtue of the central limit theorem (1). In the a priori ap-

Vol. 82 No. 1 January 2005



Journal of Chemical Education

165

Information



Textbooks



Media



Resources

proach, one is justified in using V for the parameter variances and the Gaussian distribution for parameter confidence limits, even for small ν; and importantly, the χ2 test is available to judge the consistency of the data and the fit model. The properties of LLS do not carry over to nonlinear estimators, which are not normally distributed, are biased, and often do not even have finite variance.13 That is unfortunate, because many data analysis problems require nonlinear models. Happily, most of the properties of LLS turn out to be approximately true for nonlinear LS, provided the relative standard errors in the parameters are sufficiently small. From examination of a number of common nonlinear fitting problems, I have arrived at a 10% rule of thumb (7, 8, 16): If the standard error for a parameter is less than 10% of the parameter’s magnitude, then the “normal” interpretation is likely to provide confidence limits reliable to within 10%. If details about the parameter distributions and biases are needed, MC computations provide a straightforward way to acquire this information. Notes 1. If the measured yi are correlated rather than independent, there are nonzero off-diagonal elements in W (2). Also, the weights must be independent of y and β for the fit to remain linear. 2. This was designated “external consistency” by Birge (14) and Deming (3), in contrast with “internal consistency” for the case where the σi are known a priori. Unfortunately these terms have a propensity for ambiguity and their meaning has been reversed in at least one text, that by Bevington and Robinson (4). 3. In the KaleidaGraph program the quantity called “Chisq” in the output box is just the sum of weighted squared residuals S, which is χ2 only when the input σi values are valid in an absolute sense. For an unweighted fit, wi = 1. 4. Note that eq 13 is obtained from eq 14 by neglecting the off-diagonal elements of V. 5. Lorentzian random deviates can be generated by taking a ratio of two normal deviates of zero mean (1). There is no problem solving the LLS equations for data having Lorentzian error, as eq 4 is not sensitive to the nature of the error. However, the sampling statistics for a Lorentzian variate will be divergent, because the fundamental statistical theorem of sampling, the central limit theory, does not hold unless the variance is finite (1). 6. Constant and proportional error roughly bracket the observed extremes. There is an inherent ambiguity in defining the weights for actual data with proportional error. Not knowing the true yi, we must use either the observed yi or the adjusted yi from the fit itself. Both choices lead to parameter bias, and the latter requires a nonlinear algorithm (7, 8). Here we will just use the true values, a luxury afforded us only in the context of the MC calculations. 7. By ad hoc fitting, I mean cases where there is no firm commitment to a model, as in representing smooth data by polynomials. If a parameter is thought to be justified on firm theoretical grounds, one might adopt a more stringent statistical test for its rejection (10, 18). 8. de Levie’s (21) reported parameter errors are readily verified with a program like KG; however, his expression for sy contains an extra subtracted degree of freedom, and his error bands on the fit function were obtained using an absolute value approach that is typically much more conservative than the exact eq 14. 9. The correlation coefficients reported in ref 12 appear to be in error. The equation in question, eq 14 in that article, is correct only for a twoparameter problem; the general equation is the present eq 9. 10. The extension of this three-parameter case to three independent (uncorrelated) quantities is instructive. If we toss three true coins simultaneously, the probability of a head on any selected one of them is 1兾2, irrespective of what happens to the other two. If we select two of the coins and

166

Journal of Chemical Education



ask for the joint probability of two heads, it is 1兾4. The joint probability of heads on all three coins is 1兾8. To increase this to 1兾2, we must somehow “load” the coins to make their individual head probabilities (1兾2)1Ⲑ3. Similarly, by increasing the range to ±1.56 σ, we can define a volume that will capture 68.3% of the three-way events for three independent Gaussian variates. All of these probabilities are functions not just of the single variate, but also of how many individual events are taken as the compound event. 11. The values of A and B and their standard errors remain constant as x0 is varied here. 12. The MC approach for constructing the enlarged error bands in ref 12 represents an interesting complement to the MC computations described above for testing the error bands at a range of selected x0. One starts with an original data set, either exactly fitting (“true”) or already containing random error. Random error is added to these data to produce N sets of MC randomized data and correspondingly, N sets of fit parameters β. From the latter, one computes f (x0) for selected x0, and χ2 for the original data. These MC χ2 values must always exceed χ2 for the original fit of the original data (which would be 0 for exactly fitting data) and yield ∆χ2 = χ2MC − χ2orig. For those MC sets that satisfy a particular ∆χ2 test, the maximum and minimum values of the parameters and of f (x0) are taken as the appropriate statistical ranges. For ∆χ2 = 1, these values approximate the 1-σ range for all quantities; for ∆χ2 = 3.53, they approach the 3-way joint confidence projected errors for the parameters and the 1-σ “total curve” error band on the function. They always slightly undershoot the full error bands, which can be calculated easily for these linear fits from V, as already described. However, for nonlinear models, where eq 6 is only approximately true, they can provide realistic estimates of confidence limits (4, 5). 13. As one simple example, suppose our constant-error fit model is changed to y = 1兾A + bx. This becomes a nonlinear fit to a straight line, and the estimator A will not have the properties of an LLS estimator. In fact, each individual data set will return a value of A that is exactly the reciprocal of the value a returned in the linear fit. a has a relative error of 0.4186, and by error propagation, so does A. However, one can readily verify that even though a is normal, A is very nonnormal, biased, and divergent in its sampling statistics (8).

Literature Cited 1. Mood, A. M.; Graybill, F. A. Introduction to the Theory of Statistics, 2nd ed.; McGraw-Hill: New York, 1963. 2. Hamilton, W. C. Statistics in Physical Science: Estimation, Hypothesis Testing, and Least Squares; The Ronald Press Co.: New York, 1964. 3. Deming, W. E. Statistical Adjustment of Data; Dover: New York, 1964. 4. Bevington, P. R.; Robinson, D. K. Data Reduction and Error Analysis for the Physical Sciences, 2nd ed.; McGraw-Hill: New York, 1992. 5. Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetterling, W. T. Numerical Recipes; Cambridge University Press: Cambridge, United Kingdom, 1986. 6. Draper, R. N.; Smith, H. Applied Regression Analysis, 3rd ed.; Wiley: New York, 1998. 7. Tellinghuisen, J. J. Phys. Chem. A 2000, 104, 2834–2844. 8. Tellinghuisen, J. J. Phys. Chem. A 2000, 104, 11829–11835. 9. Dado, G.; Rosenthal, J. J. Chem. Educ. 1990, 67, 797–800. 10. Zielinski, T. J.; Allendoerfer, R. D. J. Chem. Educ. 1997, 74, 1001–1007. 11. Salter, C. J. Chem. Educ. 2000, 77, 1239–1243. 12. Ogren, P.; Davis, B.; Guy, N. J. Chem. Educ. 2001, 78, 827–836. 13. Abramowitz, M.; Stegun, I. A. Handbook of Mathematical Functions; Dover: New York, 1965. 14. Birge, R. T. Phys. Rev. 1932, 40, 207–227. 15. Tellinghuisen, J. J. Chem. Educ. 2000, 77, 1233–1239. 16. Tellinghuisen, J. J. Phys. Chem. A 2001, 105, 3917–3921. 17. Tellinghuisen, J. J. Mol. Spectrosc. 1996, 179, 299–309. 18. Shoemaker, D. P.; Garland, C. W.; Nibler, J. W. Experiments in Physical Chemistry, 6th ed.; McGraw-Hill: New York, 1996; pp 397–404. 19. Moody, H. W. J. Chem. Educ. 1982, 59, 290–291. 20. Harris, D. C. J. Chem. Educ. 1998, 75, 119–121. 21. de Levie, R. J. Chem. Educ. 1999, 76, 1594–1598.

Vol. 82 No. 1 January 2005



www.JCE.DivCHED.org