PITFALLS OF STEPWISE REGRESSION ANALYSIS

REGRESSION. ANALYSIS. Use of statistical methods in interpreting experi- mental data in the physical sciences has increased steadily, especially since...
0 downloads 0 Views 1MB Size
ROBERT A. STOWE RAYMOND P. MAYER

hcorrect estimation of error is a risk in regression analysis. This article tells how to avoid traps that lead to false experimental conclusions

I n making an individual measurement, the experimenter is, in effect, randomly drawing one result from an infinite number of results, normally distributed about the true mean. An experiment is performed by measuring a property under one set of conditions. Then, a variable is changed and the response is again measured. The difference in these two values is called the effect of changing the variable. The experimenter’s problem is illustrated in Figure 1.

Figure 7.

Distributions in measurements

Here, each Gaussian curve represents an infinite set of normally distributed measurements possible. The minus and plus represent the low and high levels of the variable. I n Case I, we have the condition in which changing the variable resulted in little or no change in the response. I n Case 111, a change in the variable had a large effect on the value of the response. This ilVOL. 6 1

NO. 6 J U N E 1 9 6 9

11

lustrates a significant effect. The intermediate case is shown in Case 11. T h e shaded area represents a region of values common to each distribution. The distribution a t the low level and the distribution a t the high level are not sufficiently different to be sure that the values drawn from them actually come from different distributions. The experimenter is psychologically conditioned to accept Case 111. Statistical techniques assist in the interpretation of the data by showing the probabilities of the results having occurred by chance.

- K -350

25

TABLE I.

Run 1 2 3 4

5 6 7 8

12

24 DESIGN FOR STUDYING VARIABLES A, B, C, AND D Treatment Treatment Combination Run Combination bc 9 (1) 10 bd a cd 11 b 12 abc c a bd d 13 acd 14 ab bcd aC 15 16 abcd ad

I N D U S T R I A L A N D E N G I N E E R I N G CHEMISTRY

35

40

45

PER IEUT OF YEtU

Figure 2. Distribution of numbers in sack

Experimental

Desired comparisons were illustrated by use of an orthogonal, or balanced, design in which the procedures are simple and the calculations are relatively straightforward and uncomplicated. A 16-run, complete factorial design was selected. This design is normally used for studying four variables: A , B, C, and D at each of two levels. Table I shows the treatment combinations for the runs in which the presence of the lower case letter a, b, c, or d indicates the high level of the corresponding variable. I n this design, each possible combination of variables is present. A physical measurement can be considered to have been drawn randomly from the normal distribution containing the infinite set of values for that measurement. For purposes of illustration, all of the measurements or responses for the entire 16 runs are to be obtained from the same normal distribution. Thus, it is known beforehand that no real effects will be present. The normal distribution was approximated closely by placing numbers on tokens. The number of tokens having the various values for per cent yield are shown in Figure 2. I n all, 100 tokens were taken. The curved line shows the Gaussian distribution which the tokens approximate. T h e true mean, .I;;, is 35.0y0 and the standard deviation, so, is 4.0. An experiment was run by shaking the tokens together in a sack and withdrawing a single token. Its value was used for the response of per cent yield for the conditions of that run. Each token was replaced before drawing again. The values obtained from the sack are shown in Table I1 and, ranging between 26 and 4070, appear typical

31)’

for the results of experimental runs. Most production plants or research laboratories would be happy to achieve conditions that would give this much increase in the reaction yield. However, as is known here, these values were obtained from a single population and have no cause and effect relationship. T h e selection was random and the values occurred strictly by chance. Indeed, the drawn values can be used to characterize the distribution from which they came. T h e calculated mean of 33.75 and the calculated standard deviation of 3.5 compare closely with the values of 35 and 4 which are built into the sack. Calculation of Effects

The formulas used in the calculation of the effects of the variables are listed below. Variable A is used as an example. The effect is the average of the eight runs made at the high level of variable A minus the average of the eight runs made at the low level as in Equation 1. T h e standard error of the effects is calculated from the standard deviation of the observations shown in Equation 2. Using the known standard deviation of 4 for this population, the standard error for an effect is 2.0. Normally, this value must be derived from an experimentally determined standard deviation. Various methods will be discussed later. However, the present illustration will employ the known value built into the sack. T h e calculation of “t” is shown in Equation 3. T h e equation normalizes the effect into “t” units for use in the “t” test for significance.

TABLE I I .

RAW DATA FROM DRAWING

Run

7, Yield

Run

1 2 3 4 5 6

35 36 33 33 33 36 35 32

10 11 12 13 14 15 16

7 8 2 = 33.75.

so

= 3.5.

9

yo Yield 39 36 31 40 32 26 34 29

TABLE 111.

TABLE IV. ANOVA, ANALYSIS OF VARIANCE ON E-WAY CLASS1FlCAT1ON Source Sum of Squares D.F.

CALCULATED EFFECTS OF VARIABLES

%

Variable

Effect

D AD CD B BC BCD A C AC ABCD ABD ACD AB BD ABC

-4.25 -2.75 -2.50 2.25 -2.00 1.25 1 .oo 0.75 0.75 0.75 0.50 0.50 0.25 0.00 0.00

Effect, =

“t”

2.125 1.375 1.250 1.125 1 .ooo 0.625 0.500 0.375 0.375 0 375 0.250 0.250 0.125 0.000 0 IO00

Confidence

95 80 75 75 99%

S.S. between levels S.S. within levels S.S. total (corrected)

R2

VOL. 61

NO. 6

JUNE 1 9 6 9

13

An interesting point is that the I; ratio indicates a much higher degree of confidence than the 95% indicated by the “t” test using the known error in the designed experiment. This arises from the technique called for in the ANOVA model. I n this model, total variance is partitioned and assigned to various causes. T h e residual assigned to variance within levels is used as an estimate of the error. I n this illustration for the D variable, the total variance should have been used for the error estimate since all observations came from the same population. Arbitrarily selecting the variance associated with the greatest effect leaves a within-levels residual too small for proper estimation of the error. This suggests that when independent estimation of error is absent, the standards for accepting results from ANOVA calculations need to be revised upward. Stepwise Regression Analysis by Computer

Next, data from the 1 6 runs were submitted to the computer for calculations. A sophisticated, modern, stepwise regression analysis program was used. T h e results of the first six steps are shown in Table V I . The variables were selected in order of their importance. Variable D,selected in Step 1, is again seen to be the most significant. T h e values obtained are identical with those just presented from the ANOVA calculations. I n the succeeding steps, the variables are selected in the same ranked order obtained from the calculation of effects. The F ratio reniains relatively high while the total per cent explained builds up to 92.9. These are very impressive results. Unfortunately, the values have no meaning since they arose by chance drawing from a single distribution. There is a great temptation to accept results from a computer on pure faith. The results always look impressive when printed out on official paper. Unfortunately, it is not always obvious what calculations are being performed by the computer. The stepwise regression analysis programs are too coiiiplex to be checked by the normal researcher. I n addition, the computer services are often provided by a separate department. A communications barrier can exist both because of distance and the use of specialized language. I n the present case, it is clear that the If’ ratio for variable D used the same estimation of error as in the A S O V A technique. Indeed, if the ANOVA process had been extended to include additional variables, the F ratio and per cent explained values would have been identical to these stepwise regression results. This suggested that the computer calculations should be clarified. Stepwise Regression Analysis b y Hand

The factorial design described earlier was orthogonal. Therefore, it has the advantage that the computer results can be duplicated easily by hand calculations as illustrated in Table VI1 for the six most important variables. To construct the table, the effect of each variable is squared. The fourth column is a pool of the values in the preceding column, formed by starting with 14

INDUSTRIAL AND ENGINEERING C H E M I S T R Y

TABLE VI.

STEPWISE REGRESSION ANALYSIS B Y COMPUTER

Total

Step

Variable

1 2 3 4

D AD CD B

5 6

BCD

BC

F Ratio

% Explained

9.13 4.88 5.41 6.32 8.31 4.33

39.5 56.0 69.7 80.7 89.5 92.9

the very bottorii or fifteenth variable. Each new effect squared is added to the pool as one goes upward in the table. The value of 4.8125 represents the sum of the square of the effects up through and including variable BCD. T h e pool builds up to 45.7500 as variable D is added in the last step. The average pool is formed by dividing the value for the pool by the number of variables contributing to it at that level. The F ratio in the next column is evaluated by dividing the effect squared for each variable by the average pool for the step just below it. Thus, for variable D, the value of 18.0625, when divided by 1.9777, gives the F ratio of 9.13 and so on. The last column lists values for R2 or total per cent explained. The first value of 39.5y0 is 100 times the contribution of the effect squared of variable D divided by the total pool value, that is, 100 times 18.0625 over 45.7500. The contribution of each succeeding variable is 100 times its effect squared over the same pool total. R2 increases as it sums the contribution of each succeeding step. The orthogonality of the designed experiment used here has made it possible to hand calculate the values given by the computer for stepwise regression. I t is thus possible to see the nature of the cornputer calculations and to relate them to the ANOVA and the factorial experiment illustrations shown earlier. I n particular, the hand calculation illustrates the pooling technique used by the stepwise regression analysis prograin. As pointed out for ANOVA calculations, this pool or rrsidual variance technique for error estimation gave a very erroneous impression of the significance of the data. Thus, stcpwise regression, using an internally generated estimation of error, is unable to warn of the insignificance of the data. Half-Normal Plots

Half-normal plots, as discussed by Daniel ( I ) , have been employed in the interpretation of data in factorial experiments. The Y-axis of the plots is a probability function associated with the rank order of the variables. The size of the effect is on the X-axis of the plot. The nature of the probability function is such that results, due to normally distributed errors, fall on a straight line. Significant effects, however, are displaced toward the right from the line.

Figure 3. Half-normal plot of ejects

Figure 3 shows a half-normal plot of the present data. The 15 variables are plotted in order of ranked size of the effects. The largest effect again is variable D, at the upper right portion of the plot. The solid line is a straight line drawn to fit the data of the first 10 points in the lower left portion of the plot. The five largest effects deviate considerably to the right of this line. The interpretation here is that the five largest effects fall off the line due to their significance and that they are real variables. Thus, this technique, as usually employed, is not able to warn of the true insignificance of the data. A further refinement may be used for cases in which the standard error of an effect is known. As previously discussed, the standard error of an effect is 2.0 for this set of data. The probability associated with one sigma unit is 0.68, which lies just below the position on the Yaxis for the fifth variable. This is shown by the solid black data point in the middle of the figure. The lower, dashed line was drawn through this point and the origin. I t becomes a meaningful estimate of the values which are expected to occur due to chance. All the points

TABLE VII. Variable

D AD CD B BC BCD

are distributed fairly uniformly along this line. Thus, the five largest values appear to be entirely without significance. They are merely a part of the distribution expected with a standard error of this magnitude. Combining the half-normal plot with the known standard error emphasizes visually that one of the 15 effects may be twice the size of the standard error and yet still be due to random error only. The possibility exists that there was something unusual or exceptional about the original set of numbers drawn from the sack. Therefore, the experiment was repeated four more times. The additional data were used to calculate the effect of the variables, were subjected to ANOVA calculations, and were submitted to the computer for stepwise regression. Similar results were obtained. Different variables were selected as most important. But again, the significance as measured by the statistical tests was high. These procedures gave no warning that the data came from a single population and had no significance. The half-normal plot was much more satisfactory. I n three of the new sets of data, the effects fit a straight line and thus were correctly identified as arising from a single population. The other required the combination of the known standard error of an effect with the half-normal plot to identify adequately the larger effects as being part of the random distribution. Estimation of Error

U p to this point, the need for an independent estimate of the standard error has been emphasized. The illustrations have used the standard error known to have been built into the sack. I t is now appropriate to examine some of the methods used to determine the error, other than pooling the lowest ranking variables. One common procedure is to assume that the effects of the higher order terms are a measure of the error. The design used here lends itself readily to using the four 3factor and the one 4-factor terms. The calculation is simple, merely squaring their effects and averaging and extracting the square root to obtain an estimation of the standard error for an effect as shown in Table V I I I . The standard error was thus estimated in both the original trial and in each of the four subsequent trials. As expected, they show a distribution about the true error in the sack. The first and fourth trials are seen to provide poorer estimates of the error. Some warning is

STEPWISE REGRESSION ANALYSIS BY HAND

Effect

Effect2

Pool

Av Pool

F Ratio

R2

-4.25 -2.75 -2.50 2.25 -2.00 1 .25

18.0625 7.5625 6,2500 5.0625 4 .oooo 1.5625

45.7500 27 6875 20.1250 13.8750 8.8125 4.8125

3.0500 1,9777 1 5481 1.1563 0.8011 0.4813

9.13 4.88 5.41 6.32 8.31 4.33

39.5 56.0 69.7 80.7 89.5 92.9

VOL. 6 1

NO. 6 J U N E 1 9 6 9

15

ESTIMATION OF ERROR FROM 5 INTERACTIONS

TABLE V I I I .

S.E.,Ii

=

~

Z

Trial 1 2 3 4

5 (Sack)

TABLE IX. ESTIMATION O F ERROR FROM REPLICATION OF 4 RUNS

E (for ~ ABC, Z ABD, ACD, BCD, AB-) 5

S.E.,If 0.72 1.51 1.52 2.91 1.80 (2.00)

needed. One should be cautious about deciding that the error has been adequately estimated from pooling only five interaction terms. A larger number are needed if a more accurate measurement is desired. However, a point to emphasize is that selection of the effects to be pooled should be made prior to conducting the runs. There is considerable bias introduced by using a pooling technique and selecting the terms after the runs have been completed. Justifying the choice because the results look reasonable or believable is not adequate. Another common method used to estimate the error is replication of a number of the runs. Ordinarily, economic considerations prohibit replication of each run. The calculation of the standard deviation of an observation is shown in Table I X . The example is for the case where four runs were used for the replication. The difference between the original and the duplicate run is assumed to be due to the error. These differences are squared, summed, divided by (2.4), and the square roots extracted. The standard error of an effect was described earlier. It is calculated as shown on the second line. The table shows estimates of the error obtained by several trials of four replicates each. Here again, some of the estimates are high and some are low compared with the known value in the sack. Replicating a larger number of runs would have defined the error more precisely. However, one has to weigh the cost of further runs to determine error more accurately us. the desire to try new runs to explore the variables under other conditions. Summary

There are several points to make in summary. First, the analysis of variance in which the pooling or residual variance method is used to estimate the error has been shown to give results which indicate greater than 99T0 coiifidence of significance when none exists. This was due to the manner in which the variance was partitioned, and apparently the pooling technique for error estimating is not adequate. Second, the “t” test gave results which corresponded to 95% confidence levels when using a known standard error. However, this is not surprising since the half16

I

I

I N D U S T R I A L A N D E N G I N E E R I N G CHEMISTRY

1 2 3 4

(Sack)

0.85 1.73 1.56 2.13 (2.00)

normal plot indicates that about one variable in 15 will be large enough to be considered significant a t this level. Third, the stepwise regression techniques were shown to suffer the same shortcomings in the estimation of error as the ANOVA technique because of their similarity in using a residual variance pool. The pitfalls of the stepwise regression technique were illustrated in coiijunction with an orthogonally designed experiment. The method of internal pooling of variance for estimating the error leads to even higher risk when applied to nonorthogonal data. Because of this, the practice of submitting plant-operating data to the computer for stepwise regression analysis is very subject to misinterpretation as showing significant effects when in fact there are none. Fourth, an independent standard error rstimation, in combination with a half-normal plot, was shown to be a powerful statistical tool for judging the significance of results. Last, the procedure used here is recommended as a statistical tool in its own right. When there is a doubt about the significance of a set of results. use a sack of numbers from a single population as a control test. As will sometimes happen, the random numbers will do better. I n that case, the conclusion is obvious that the whole project must be taken back to the bench for further work. The authors recognize that most of these concepts are inherent in the properties of the normal distribution. However, the actual drawing of the numbers was a stimulating experience and added objectivity to the interpretation of the data. This is a highly recommended procedure for checking the validity of results obtained from plant arid laboratory experiments. It defines the magnitude of the significance which can be obtained by chance in the system under study and establishes a standard which must be met before the results are accepted as valid. BIBLIOGRAPHY (1) Daniel, C., Technometrics, 1, 311-41 (1959). (2) Davies, D. L., “Statistical Methods in Research and Production,” Hafner Publishing Co., New York, N. Y., 1957. (3) . , Maver. , , R. P... and Stowe. R . A., IND.ENC.CHEM..6 1 (5). 42-6 (1969). (4) Wilson, E . B., Jr., “An Introduction to Scientific Research,’’ McGraw-Hill, Kew York, N. Y . , 1952.