Robust Linear Regression Using Theil's Method

Oct 10, 2005 - of the slope and intercept. This is primarily because all data values directly influence the least-squares estimates. The standard appr...
1 downloads 4 Views 67KB Size
Chemical Education Today

Letters Robust Linear Regression Using Theil’s Method In chemistry it is often necessary to fit a straight line to a set of data and the technique most commonly used is the method of least squares. Unfortunately, this method suffers from one important drawback in that it is not resistant to outliers, meaning that suspect points that lie outside the range of the others can have an undue influence on the estimate of the slope and intercept. This is primarily because all data values directly influence the least-squares estimates. The standard approach is to screen for outliers, by looking at the residuals of the data from the linear fit, or testing replicates at the same design point (i.e. x-axis), and then fit a least-squares line to the data with the suspected outliers removed or replaced. However, even when such obviously erroneous values have been removed or corrected, values which appear to be outliers may still occur. Moreover rejecting, or not, based on some kind of test, e.g. Dixon’s Qtest (1), will clearly affect the mean and standard deviation. Reasons should always be given when suspected outliers are rejected. An alternative approach is to use a method that is specifically designed to reduce the effect that outliers can have on slope and intercept estimates. In a recent study (2) comparisons have been made, for a wide range of data sets, of the relative performance of least squares and a technique, called Theil’s method (3), which is specifically designed for this purpose. In the incomplete version of the method, for a given set of n observed values (Xi, Yi), i = 1, 2, …n the estimate of the slope is calculated from the median of pairwise slopes. When n = 2N is even, for each pair of points (Xi, Yi), (XN+i, YN+i), i = 1, 2, …N the slope of the straight line through these is calculated from mi =

Y N + i − Yi , i = 1, 2, ... N X N +i − X i

(1)

That is, the pairwise slopes are based on the first points, second points, etc. in each half of the data. The estimates of the slope, m, and intercept, b, of a straight line relationship y = mx + b (relating the physical variables x and y) are then given by

1472

m = median(mi )

(2)

b = median(bi ) = median(Yi − mX i )

(3)

Journal of Chemical Education



For an odd number of points the middle point is omitted when calculating the pairwise slopes. Recall that the median is the middle value when an odd number of values are put in ascending order, and the mean of the two middle values when an even number of values is put in ascending order. Clearly as the number of observed values increases so does the number of pairs and slope estimates, but this is not a problem if a spreadsheet or something similar is used for the calculations. (It should be noted that in the Matlab routine for Theil’s incomplete method in ref 2 the final command cth=median(c); is missing.) There is also a complete version of Theil’s method which is even more robust, but it is more complicated because all possible pairs of points are used. In the study (2), least-squares and Theil’s methods are applied to data sets where all points except one randomly selected point, the outlier, lie on a straight line, and it is found that for 6 points or more the mean of all possible slope and intercept estimates coincide with the true values for Theil’s methods, but this is not the case for least squares when the outlier is retained. These results are replicated for data sets containing two such outliers when there are 10 or more data points. So when outliers are present Theil’s methods reduce their effect and are worthy of consideration. If the outliers are removed in advance then the least-squares estimates are comparable with those using Theil’s methods, but this means that the user must screen and remove, with reasons, the suspect points. Despite the absence of error estimates in Theil’s methods, the standard least-squares estimates and the corresponding error estimates and confidence intervals1 are based on a very precise set of assumptions about the errors in the x and y values (2). In practice these may not be valid for the given set of data, although confidence intervals are often quoted anyway. Having said that, the unreliability of estimates of parameter standard errors in least squares are more due to the large inherent uncertainty in an estimated variance for a small number of fitted points than to a violation of the assumptions behind least squares. It is arguably better to have some idea of parameter precision than none at all. We do not advocate a switch from the standard approach outlined at the beginning to one using Theil’s methods. However, it is important that practitioners understand the limitations of least squares and the issues concerning the removal or replacement of suspected outliers, and that there are other methods available which are as robust and accurate as least squares with outlier removal/replacement. A reliable estimate produced using a robust method can be just as valuable as one using least squares with outlier removal/replacement with confidence intervals based on assumptions which are not rigorously satisfied, although by opting for a robust method out of safety concerns, the guarantee of minimum variance that comes with least squares is lost. In any case, we certainly do advocate that practitioners critically evaluate their data rather than simply applying a method and assuming that it works.

Vol. 82 No. 10 October 2005



www.JCE.DivCHED.org

Chemical Education Today

Letters Note 1. Confidence intervals using rank correlation methods can be given for data without making these assumptions.

Literature Cited

P. Glaister

1. Miller, J. C.; Miller, J. N. Statistics for Analytical Chemistry, 3rd ed.; Ellis Horwood: London, 1993; p 63. 2. Glaister, P. Int. J. Math. Educ. Sci. Technol. 2005, 36, 110– 117.

www.JCE.DivCHED.org

3. Emerson, J. D.; Hoaglin, D. C. Resistant Lines for y versus x. In Understanding Robust and Exploratory Data Analysis; Hoaglin, D. C., Mosteller, F., Tukey, J. W. Eds.; John Wiley & Sons: New York, 1983; pp 129–165.



Department of Mathematics University of Reading Whiteknights, Reading, RG6 6AX, UK [email protected]

Vol. 82 No. 10 October 2005



Journal of Chemical Education

1473