ThouqhTs ON
STEPWISE RE~RESSIO AN ys is e
0
Stepwise regression analysis technique is presented n the analysis of experimental data, it is often desired
I to construct an equation of the form y bo + + b2xz f + =
blxl
e
*
*
bkXk
where y is the response variable, X I , x g , * , xt are the input variables, and bo, 6 1 , . . ., b k are estimated coefficients. In general, the equation will not “reproduce” the observed y values exactly. This may be the result of measurement errors in observing y or it may be that the equation does not faithfully represent truth. +
Purpose of the Analysis
I n many cases, the analyst may be trying to better understand the underlying causal system. Where he can establish the proper true equation, the b’s are estimates of the true p ’ s . I n other cases, the analyst does not pretend to have the appropriate “causal’) model but is seeking a n imitative model. Where the system represented is stable, an imitative model can be very useful for predicting the response, y, for known values of the input variables, x . This paper is solely concerned with the technique of constructing a model for such predictive purposes.
many ways this process can be upset. Some examples of erroneous conclusions that can be reached are illustrated by Box (7). T h e more nearly the model approximates the real relationship, the better are the chances for success. Some purely imitative variables (not directly causal) may correlate very well with the response and thereby be useful. But such high correlation does not indicate a direct physical connection. Nor does a lack of correlation prove a lack of physical connection. Given a starting set of candidates, X I , X Z , . . ,, x k , the question naturally arises as to which subset to use. Perhaps an adequate equation can be constructed using only a few of the input variables X I , x 2 , . . ., x k . There is no unique statistical procedure to best determine which input variables to include. Several procedures exist but all depend to some extent on personal judgment. One well known technique for constructing the regression equation is known as stepwise regression. This paper will be concerned only with this technique, which is thoroughly discussed by Draper and Smith ( 2 ) , Before examining this technique, some data requirements need to be considered. Data Collection
Technique
The question then arises as to which variables might reasonably be included in a predictive model. Knowledge of the system is essential here because there are so
AUTHOR Paul T . Pope i s an Assistant Professor i n the Department of Mathematics, The University of Tulsa, Tulsa, Oklahoma 74704.
An initial step in the analysis is the collection of data from what is often a n “unplanned” experiment. Some pre-examination is wise. Even for imitative input variables, there must be variation in their values if any relationships are to be seen. And, of course, the response must really vary in “response” to changes in the causal variables. Otherwise, the model derived may only be “fitting” the measurement error. Data should be screened carefully for mistakes. A single miscopied number can destroy the whole project. VOL. 6 2
NO. 7 J U L Y 1 9 7 0
35
The quantity as well as the quality of data is important. There must be more combinations of conditions represented than there are constants to be estimated. For this, replicates (repeats at the same conditions) do not count. I t is also important to have some information left over for estimating the residual error. Here, replicates do count. These are largely economic as well as statistical questions but, in general, the more the better. The Stepwise Procedure
When the data have passed some preliminary examination, the stepwise procedure can be applied. T h e procedure enters the input variables into the equation one a t a time. I t does this by choosing the variable which, a t this point, has the most information relating to the remaining variation in the response. Some arbitrary rule must be used to decide when to stop adding variables. Usually, a pseudo “statistical significance” test is used for this. At each stage when a new variable, x i , is being considered, a statistic F,is computed. This is the ratio of the regression mean square for xi (adjusted for all other variables in the equation so far) to the mean square for error. If F, exceeds a chosen critical value, then the variable x , is included in the equation. Because the appropriate conditions for the test are not met, a more severe (larger) than usual critical value should be employed. This will depend on the particular circumstances of the analysis. The larger critical value helps to compensate for the process of always choosing the “best” remaining input variable. As a guide, F values from tables may be used where the degrees of freedom are 1 for the numerator and “error degrees of freedom” for the denominator. Estimating the Error
The question arises as to what should be used for the mean square error at each step. If the mean square error used at each step is the residual mean square a t that point, it may be considerably larger than the error variance associated with the best imitative model. (This error variance may be very near the true “measurement” error variance.) This inflated error imposes a severe hardship on the selection procedure where the “explanation” of the response is divided among several variables. What is proposed here is that a n estimate of error be obtained by first placing all the contending input variables into the equation. Variables which do not contribute to predicting will be included by this process only temporarily. T h e error estimate will be near that which the best model will give. T h e only loss will be in error degrees of freedom. This is a small price to pay for this improvement in the procedure. An illustration of what might otherwise happen is given by the following. Suppose we have three input variables x1, x 2 , x3 and, for the sake of simplicity, let the correlations between these be zero. T h e situation can be illustrated by the following analysis-of-variance table. 36
INDUSTRIAL A N D E N G I N E E R I N G CHEMISTRY
Analysis of Variance Source
Degrees of freedom
Sum of squares
Mean squares
1 1 1 8 11
1525 1550 1575 3400 8050
1525 1550 1575 42 5
XI X2
x3
Error Total
T h e sum of squares for error computed by putting all three variables in the equation is 3400 with 8 degrees of freedom. I n constructing the equation, x3 is entered first since it has the largest sum of squares. The statistic F B = MSR3/MSE = 1575/425 = 3.71. If we use the 90% point of the F distribution, which is 3.46, as our critical point, then x3 would be included in the equation. However, if MSE had been computed for only x3 in the equation, we would get, taking advantage of the orthogonality,
MSE
= (1525
+ 1550 f
3400)/10 = 647.5
and Fa = 2.43. This is less than the “critical value,” which says that the variable x3 would be excluded. Thus an estimate of error obtained by first placing all input variables of interest in the equation is desirable. Validation
An imitative model may be successfully used for prediction if the model is somehow indirectly related to truth and if the conditions of the system which provide that relationship remain stable. As mentioned in the beginning, there are a number of techniques designed to construct a prediction equation with each being somewhat subjective. Regardless of how constructed, a predictor equation is judged good or bad depending on its ability to predict. I n all situations, it would be wise to test the equation on a new set of data in order to determine if simply a “one time” fit had been achieved. Conclusions
Used with care a t each step and with a thorough understanding of the system represented, the stepwise regression analysis technique can be of assistance in constructing imitative models. Its use can be strengthened if a t each stage the benefit of an additional input variable is judged by using the residual error variance relative to a model which includes all candidates. This modification tends to reduce the risk of missing useful predictors. Of course, there remains the risk that useful predictors will be missed because variation was insufficient to reveal existing relationships. Acknowledgment
T h e author would like t o thank the referees for several helpful suggestions. REFERENCES (1) Box, G. E. P., Technornetiics, 8, 625-629 (1966). (2) Draper, N. R., a n d Smith, H., “Applied Regrecsion Analysic,” John Wiley a n d S o n s , I n c . , K e w Y o r k , 1966.