Research: Science and Education
Two-Variable Linear Regression: Modeling with Orthogonal Least-Squares Analysis n-Ortiz J. Vicente de Julia Instituto de Tecnología Química, CSIC-Universidad Polit ecnica de Valencia, Av. de los Naranjos s/n, 46022 Val encia, Spain Lionello Pogliani Dipartimento di Chimica, Universit a della Calabria, 87030 Rende (CS), Italy * Emili Besalu Department of Chemistry, Universitat de Girona, Facultat de Ci encies, Avda. Montilivi s/n, 17071 Girona, Spain *
[email protected] In many fields of chemistry, the ordinary least-squares method is preferentially used to fit data. Nevertheless, univariate linear regression by classical least-squares (CLS) analysis has some drawbacks that are usually overlooked in experimental science courses and even in many chemical research papers. Mainly, some users, and especially students, tend to “forget” the basic assumptions of the method and apply it over series of data pairs having errors in both variables. Even more concerning, some people still believe that the classical univariate linear regression performs a symmetric treatment of the data x and y series. Then, at the end of the treatment when the fitting equation is obtained, the mathematical expression is erroneously and freely manipulated, ignoring that the regression equation of y over x is not the same as the equation for x over y fit. Orthogonal least-squares (OLS) fitting is a good method to avoid the unsymmetrical treatment of the data. This is because the CLS method (in the y over x case, for instance) only minimizes the squared distance parallel to the y axis between the experimental points and the fitting line, whereas the distance parallel to the x axis is not considered because it is understood or assumed to be error or uncertainty free. The OLS regression is an alternative for obtaining symmetric treatments because it minimizes the sum of quadratic orthogonal distances to the equation line. It is an intuitive method that deals with errors in both variables. For every (xi,yi) data point (i = 1 - N) the distance that appears in the sum of errors to be minimized;ui for the CLS technique or di in the OLS method;is shown in Figure 1. The OLS equation, y = R( þ β(x (the formulation gives two possible equations, one being the wanted one and the other the orthogonal to it), is obtained from the following formulation (1): β(
pffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ - B ( B2 þ 1
and
Figure 1. The two distances which are minimized by the two methods: the vertical, ui, and the orthogonal, di, distances.
The covariance of x and y is defined as N P
Syx ¼ Sxy ¼
R( ¼ y - β( x
i¼1
Xi Y i
N
¼ ÆXY æ
where R is the intercept, β is the slope, and B is defined as B ¼
994
where Xi = xi - x and Yi = yi - y. Similarly, Sx2 = ÆX2æ and Sy2 = ÆY2æ. Both equations correspond to the two principal components of the data. The uncertainties on the intercept and on the slope are
Sx 2 - Sy 2 2Sxy
Journal of Chemical Education
_
_
_
Vol. 87 No. 9 September 2010 pubs.acs.org/jchemeduc r 2010 American Chemical Society and Division of Chemical Education, Inc. 10.1021/ed100307z Published on Web 07/08/2010
Research: Science and Education
expressed in terms of the respective variances: N P
SR ¼ 2
xi 2
Sd 2 i ¼N1 P N
i¼1
and Xi 2
Sd 2
Sβ 2 ¼
N P
i¼1
Xi 2
where Sd2 is the standard deviation of the residues, which is defined as N P
Sd 2 ¼
i¼1
N P
di 2
N -2
¼
i¼1
Literature Cited 1. Weisstein, E. W. Least Squares Fitting - Perpendicular Offsets. http://mathworld.wolfram.com/LeastSquaresFittingPerpendicularOffsets.html (accessed June 2010).
Supporting Information Available
ðyi - βxi - RÞ2
Expanded version of the text with an example. This material is available via the Internet at http://pubs.acs.org.
ðN - 2Þð1 þ β2 Þ
r 2010 American Chemical Society and Division of Chemical Education, Inc.
An extension of this information is given in the supporting information, including an inference procedure for the OLS equation slope interval.
_
pubs.acs.org/jchemeduc
_
Vol. 87 No. 9 September 2010
_
Journal of Chemical Education
995