C. Kinney Hancock
M University College Station
Some Misconceptions of Regression
Texas A &
Analysis in Physical Organic Chemistry
During the past three decades, quantitative structure-property relationships for organic compounds have been developed empirically a t an ever-increasing rate (1-3). The majority of these correlations are linear free energy relationships derived by ordinary linear regression analysis (method of least squares); however, there is increasing use of multiple regression analysis to derive relationships involving more than one independent variable. The purpose of this note is to call attention to some of the misconceptions in the use of regression analysis in physical organic chemistry that have been encountered in the literature. Probably, such misconceptions exist also in other scientific fields. The intent is to be generally helpful and not to be adversely critical of individuals; therefore, specific references will not be cited. Probably the most common misconception is the comparison of the significance of two regressions on the basis of the magnitudes only of the corresponding correlation coefficients. In order to make valid comparisons, the number of variables and the degrees of freedom (d,f. = n - m, where n is the number of observations and m is the total number of variables) must be considered in addition to the magnitude of the correlation coefficient. For example (4a), a correlation coefficientof 0.950 for a linear regression (two variables) with 2 d,f. is significant only a t the 95y0 confidence level while a considerably smaller correlation coefficient of 0.874 for another linear regression with 5 d.f. is highly significant at the 99% confidence level. Compared to the latter linear regression, significance a t the same confidence level of 99% for a multiple regression involving two independent variables (total of three variables) and the same d.f., 5, requires a minimum correlation coefficient of 0.917. In the extreme, i.e., d.f. = 0, the correlation coefficient is exactly unity without any statistical significance. In pther words, any two points lie on a straight line, any three points lie in a plane, etc. Perhaps equally as serious as the above is the comparison of the significance of two regressions on the basis of the magnitudes only of the corresponding standard deviations from regression. If the standard deviations from regression are about the same for two relationships, then the more dependable and more significant relationship of the two is that one which involves the wider range in values of the independent variable. More generally and more exactly, the comparison of significance should he made on the basis of the appropriate estimated standard errors of the regression coefficients and based on the appropriate tests of significance (4b). 608
/
Journal of Chemical Education
Another misconception is illustrated by the following example. Currently, the most common linear regression analysis in physical organic chemistry is that of the regression of log k on u to obtain a Hammett (5) equation, log k = log ko pu, where log k" is the regression intercept and p is the regression slope. This equation is useful in several ways; one of these is the estimation of new k-values from the corresponding ovalues, if available. It has been stated in the literature that not the above equation but, instead, the equation for the regression of u on log k should be used for the estimation of new a-values from experimental k-values. This proposal is questionable for two reasons. First, for many reaction series of m- and p-substituted benzene derivatives, it is obvious that there is a linear functional relationship (6) between log k as the dependent variable and u as the independent variable. Second, even though there are uncertainties in o-values, they are generally less than the uncertainties in experimental k-values. For these reasons, the equation for the regression of log k on u is equally applicable to the estimation of new k-values from known u-values and of new c-values from experimental k-values (6). Finally, in both written and oral presentations, it is occasionally stated that the inclusion of another independent variable, even though it is of no significance, will always improve the relationship. Generally, the statement might be that the relationship will probably not he significantly worse by inclusion of an additional variable. More exactly, the statement should be that the inclusion of an additional variable will always reduce the total of squared deviations from regression (the numerator of the mean square error of deviations from regression) but, at the same time, will increase 112 and hence reduce the degrees of freedom, n - m (the denominator of the mean square error of deviations from regression). The net effect may therefore be that the estimated root mean square error of deviations from regression may be increased; ie., in unfavorable cases, the relationship may even be made worse by the inclusion of an additional variable. The inclusion of another variable reduces the d.f. by one, and, for the same significance, a higher correlation coefficient is required as pointed out previously. So, most likely, the misconception arises from the observation that including the insignificant variable increases the correlation coefficient, without recognition of the accompanying decrease of one in the d.f. Acknowledgments. The author is indebted to The Robert A. Welch Foundation for partial support of this work and to Dr. H. 0. Hartley, Director of the Institute of Statistics of Texas A&M University, for his helpful comments and discussion.
+
Literature Cited
Chem.Rev., 53, 191 (1953). JR., DENO,N. C., AND SKELL,P. S., Ann. Rev.
, H., (1) J A F F ~H. (2) TAW R. W.,
Php.Chen.,9,287(1958). Chem.Rev., 63,171 (1963).
(3) WELLS,P. R.,
(4) SNEDECOR, G. W., "Statigticd Methods," 4th ed., The Iowa State College Press, Arnes, Iowa, 1946; (a) Table 13.6, p. 351; ( b ) pp. 118 and 349. (5) HAMMETT, L. P., "Phy~iczlOrganic Chemistry," McGmwHill Book Co., Inc., New York, 1940, Chap. 7. (6) DAVIEB,0. L., "Statistical Methods in Research and Production," 3rd ed., Oliver and Boyd, London, 1957, p. 169.
Volume 42, Number I I, November 1965
/
609