WOULD YOU BELIEVE 99.9969% EXPLAINED?

Belize. Would you believe 99.9969% explained? If you were told some data had been submitted to a modern computer for regression calculations and that ...
2 downloads 0 Views 509KB Size
Would

You

ieve I

ould you believe 99.9969y0 explained? If you were a modern computer for regression calculations and that these were the bonajde results-would that help? Also, if you knew that the program had been debugged and the results checked for accuracy-does it begin to sound plausible? \Ye live in an age where the computer is used regularly. Data are fed in and results are printed back almost before we turn around. Complex calculations are performed routinely by standard programs. I t must be admitted that the capability of computers has been magnified to an extreme by science fiction. Indeed, one story concerns a computer with enough logic, random choice, and probability circuits that it developed a personality and became almost human. However, the normal individual accepts the computer as a complex but useful part of our world. Research, technical, and quality control personnel bump into the computer on many occasions. I t is used to provide a variety of services, ranging from storing information, to making calculations, to issuing our paychecks. While most everyone has heard the story of the janitor being given the vice president’s salary, the reliability has increased and routine matters are handled accurately with no difficulty. T h e computer has enlarged our capability to perform complex mathematical calculations. Thus, mathematical tools are available for routine use which, prior to the advent of the computer, would never have been used by the average quality control specialist. Among some of the fairly new techniques is stepwise multiple regression analysis. However, at times our use of this tool may ex-

W told some data had been submitted to

A2

INDUSTRIAL AND

ENGINEERING

CHEMISTRY

ceed our understanding of its complexities. We would like to describe a case history which illustrates some of the pitfalls of this analysis and points out some factors which help to assure its appropriate use. Case History

Recently, we were involved in a production quality problem. A customer was experiencing different and unpredictable service life with each batch of our regular production material. Gradually, as service life data accumulated, the customer began to question the quality of the batches. This in turn activated a considerable program within our organization. Perhaps the circumstances are common enough that you will recognize various aspects of the situation. Representatives of the different groups responsible were called together to consult on the problem. These included production superintendents, quality control managers, statisticians, research people, chemical process engineers, analytical specialists, and scientists from our computations laboratory. T h e aim was to find out what was wrong with the batches and to correct it quickly. T h e service life of each batch had already been tabulated. Routine checks of the production, analytical, and quality control data revealed no obvious reasons for the lack of quality. Therefore, the various groups involved

Raymond P. Mayer and Robert A . Stowe are with The Dow Chemical Co.’s Research and Development Laboratory in Ludington, Mich. 49431.

AUTHORS

RAYMOND P. MAYER

ROBERT A. STOVVE

A case history of an actual production quality problem reveals that the use of certain commonly accepted statistical techniques may lead to erroneously good correlations of plant data. T h e authors suggest how to avoid the trap

99.9969% Explained? in this effort agreed to assemble all the available data and submit it to the computer for analysis. O u r production plant supplied greater details on the conditions used in the preparation of each batch. T h e specification analysis data obtained prior to shipment were re-examined. Additional analytical tests were made on retained samples from each batch. A list of 16 variables was finally agreed upon. T h e list was big enough to include both unanimous choices and individual favorites. All concerned agreed that no stone should be left unturned, and that nothing should be overlooked. Since there were 22 batches involved, this represented a great mass of data. However, the computations lab reported that this was no problem. Indeed, they suggested that the variables could also be checked for correlations through use of power terms, logarithms, and reciprocals, if this were desired. T h e computations laboratory has a routine program designed to do least-squares fitting of data which allows the computer to calculate the coefficients associated with each variable and give a n estimation of the spread of the data through the correlation coefficient. Furthermore, this program readily allows combining of variables to give the combination which best fits the data. T h e variables are picked in order of importance. Each step in the calculation is subjected to a confidence test. T h e computer shuts itself down if the next step does not meet an F test ratio of at least two. T h e final equation is a mathematical description of the correlation which exists between the dependent variable and the significant independent variables. Table I shows a portion of the coded data submitted

to the computer. There were 22 batches and 16 independent variables in the complete table. T h e second column lists a wide range of service life values. T h e next columns list the values for the independent variables, XIto X16. As mentioned earlier, there were no obvious correlations present in the data. These values were submitted to the computer, and in due time the results were returned. When a plot was made between a single variable such as XSand the service life, the result appeared as in Figure 1. T h e straight line shows the least-squares fit to the data, where the slope is Bs and the intercept is Bo. T h e correlation coefficient, r , indicates the spread of data about the straight line. If all the points are on the line,

I

TABLE I. Batch 1 2 3 4 5 6

I

Service Life, Y 377 586 645 238 639 345

RAW DATA

XI 6

4 2 9 6 8

3

350

Independent Variables xz Xs . . . ..XI6 1 1 8 10 2 3 6 1 6 4 5 5 2 8 9 1 6 3

10

8

I

1

22 VOL.

61

NO.

5

MAY

1969

43

TABLE I I .

T E R M S FOR E Q U A T I O N

16

x 1 to XI6

X-1, inverse 16 A?, square 16 X P 2 ,inverse of square 16

64

r = 1. If they form a completely random or buckshot pattern, r = 0. The deviations of the points from the straight line are attributed to a combination of experimental error and the influence of other variables. If a plot were made of the other variables, similar results would be achieved. Equation 1 indicates the way the dependent variable, Y , the independent variables, X , and various coefficients fit together in a linear regression equation. T h e equation is expanded as necessary to include all the terms used in the calculation. Here, the Bo term is the combination of the intercepts from each individual plot of the data. Y = Bo B i X i 4- BzXz B3X3 . . . . . Bi6X16 (1) T h e correlation coefficient, 7, is calculated by Equation 2 . Equation 3 shows the relationship for the slope, B , associated with a given variable, while Equation 4 illustrates a use of the correlation coefficient. These are standard equations and are shown only to give an idea of what the computer is doing. Notice that there is a relationship between the slope, B, and the correlation coefficient, r . This makes it possible to discuss the results in terms of either B or r later.

+

+

+

yo Explained

= 100r2 (4) Table I1 shows the terms selected by group decision for the regression analysis. T h e group had a desire to make the best use of the data and to obtain the best fit. T h e computer was programmed to compute and use the inverse of each variable as well as the square and inverse of the square for a total of 64 terms. While the computer had the ability to handle additional functions, such as logarithms and cross-term products, the group felt that these could be left for possible calculation at a future date. Table 111 summarizes the first five steps of the regression program. O n the first step, X : was selected, as its correlation coefficient was the largest, -0.46. The negative sign implies that an increase in this variable causes a decrease in service life. T h e square of variable 7 was selected as being better than the variable itself. As listed 44

INDUSTRIAL

AND

ENGINEERING

CHEMISTRY

Figure 1. Plot of service life, variable X s

Y,f o r

each of 22 batches us. independent

in the second column, this explained 2 1 .2y0 of the variation in service life. There is an F level of 5.2 between the variance due to this variable and the residual variance. This is above the F value selected for adequate confidence and so the computer proceeded to the next step. Step 2 showed that variable 1 1 as the inverse of its square was next in importance with an F ratio of 5.6. This brought the per cent explained up to a total of 32.4. Similarly, other variables were selected to bring the total to “70y0Explained” by Step 5. T h e F value fluctuated slightly as the size of the residual variance decreased, but the F ratio was sufficiently high at each step to proceed. Notice the rather marked change in the correlation coefficient for X;; at Step 5. I t goes from -0.43 in Step 4 to -0.80 as X : , is added. This shift occurs as the equation adjusts to maintain the best fit to the data. T h e further one goes, the better the results get, as listed in Table I V . T h e total explained reaches 80, 90, and 9970. T h e total increases more slowly at each step as the additional variables become less important. However, the F level is maintained throughout. Part of the stepwise program is a critical look at variables previously selected for the regression equation. As the contribution of each variable is adjusted to give the best fit, it sometimes happens that one will be reduced to the point where it is no longer significant in the equation.

T A B L E I I I.

Step 1 2 3 4 5

STEPWISE REGRESSION

%

F

Explained 21.2 32.4 48.8 57.3 70.0

Level 5.2 5.6 7.1 4.6 8.2

Correlation Coeficient, r

x; -0.46 -0.49 -0.47 -0.42 -0.49

x;: x-2

’7-;

x,:

-0.43 -0.47 0.42 -0.42 0.43 0.31 -0.80 0.43 0.36 0.52

Variable Xy’had been added at Step 11. However, at Step 17, the F level had dropped to 1.7. T h e computer had instructions to reject variables below F equals 2. Therefore, it was dropped. This reduced the “yo Explained” slightly, but restored the statistical confidence in the equation as shown below.

Variable yo Explained 17 X,’ 99.74 T h e computer then continued through the final steps in the computation, with the equation building up to the final result of 99.996970 Explained, as shown in Table V. T h e results are printed out and the computer rests. T h e question is asked again, “Would you believe 99.9969% Explained?” I t would seem now that our group should have been happy. We had an equation which explained 99.9969% of the variation in service life in terms of the production variables. However, there were some doubters in the group. And, actually, there are several factors which might contribute some uncertainty and cause one to question the results. Step -

Contributing Factors Few data sets (22). First, there were relatively few

data sets or batches. One might wish for a greater number before accepting the calculated results. Many independent variables (64). I n this work, we had a total of 64 terms, arising from the original 16 variables plus their powers. This is a great number to choose among. By sheer luck, one might find a few that correlate. Using so many independent variables is especially dangerous when it is remembered that there were 22 batches and that only 21 terms are required to give an exact fit to the entire data. Before recommending extensive plant revisions to improve the product quality, the validity of the equation

TABLE IV.

Step 6 7

FURTHER STEPS

% Explained

F Level

78.5

7.3

8 9 10 11 12

85.2 90.6 94.7 96.5 97.3 98.0

7.8 9.1 11.0 7.1 4.3 4.2

13 14 15 16

98.5 99.31 99.57 99.78

4.3 10.2 5.3 6.8

TABLE V.

F I N A L STEPS

Step

yo Explained

18 19

99.86 99.938

F Level 6.5 7.0

20

99.986

14.4

21

99.9969

11.8

somehow had to be checked. While the computer had made no errors, what if we had erred in presenting the basic data? Could there be a number of variables we had overlooked or could possibly the customer have contributed to the short life of some batches? What were the chances that the variables used in the study had nothing to do with the problem? These are difficult questions to answer. While the textbooks recognize the problem, there are no clear-cut guidelines or rules to follow. One does not suddenly announce that the equations are faulty or that the data are not pertinent to the problem. Indeed, how does one identify and prove such a statement? This method was suggested: Suppose that one drew numbers out of a hat for the plant conditions and analytical values. Suppose the computer were allowed to correlate these with the service life? Surely the plant data should be better than these random numbers. Because no one could predict how such numbers would behave, numbers were drawn at random. T h e numbers one to 10 were used, with each number replaced before the next drawing. By use of this technique, a value was placed in the table for all 16 variables in each of the 22 batches. A few small side bets were placed on the outcome and the information was submitted to the computer. T h e computer squared the sums and summed the squares and printed out the results, which proved to have an unbelievably high correlation to the values for the service life of the batches. T h e stepwise regression results presented above were actually the results obtained for the random numbers. Whereas the actual production data originally explained about 75 to SOY0 of the variation in service life, the random numbers showed 99.9969y0 Explained ! Of course, with results like that out of a hat, it was no longer reasonable to make extensive plant revisions. After the dust settled, it became necessary to explain what happened. What had gone wrong to cause such results? More important, what could be done to safeguard against future problems of this type with the computer? Can computers be used for legitimate calculations when random numbers look so impressive? We have examined the details of this experiment and have selected what we consider to be some of the signals VOL.

61

NO. 5 M A Y

1969

45

which warn of spurious results. I n addition to the two items which have already been discussed, there are five other factors which contribute to making such results possible. Correlation among independent variables, max r = -0.61. T h e term independent variable is often used loosely. Data are fed into the calculations and called independent because they are used to correlate the dependent variable. Correlation coefficients as high as -0.61 were found between variables that had been labeled as independent. T h e correlation between each variable and its powers was even higher. This is completely contrary to recommended procedure for designed experiments. Ideal designs call for zero correlation between independent variables, Le., the design should be orthogonal ( I ) . Here, the analysis suffers because the coefficients for each variable in the multiple regression equation are not the same as if they had been determined separately. The correlation between independent variables makes the calculation change with each step in the analysis-there is no single value for each variable. One sees only adjusted values. Orthogonal design would have removed this difficulty. Many small correlations to dependent variable, max r = 0.46; 100 r2= 217,. One noticeable feature of this random data was that many correlations existed with the dependent variable. However, they were all relatively small. T h e largest value for r was 0.46 or 100r2 equals 217, Explained. Note that this value is smaller than that observed between pairs of independent variables. When correlations this large exist between independent variables by random chance alone, there must be an even greater correlation with the dependent variable for one to maintain confidence. T h e program should have been rejected at Step 1, when the initial correlation was so small. The serious error arose when so many small correlations were put together to get the one big correlation which totaled 99,996970 Explained. Adjusting to “discover hidden effects.” One socalled benefit of computers is their ability to discover hidden effects among intercorrelated data. This occurs through the process of adjusting to minimize the sum of squares fit to the data. IVhile this is not wrong mathematically, it certainly is risky when dealing with data which are known to be random, as in this example, or are being tested for randomness. Thus, in this case, variable X15, which had r = 0.07 or only 0.5% relationship to the customer’s problem in the original analysis, was selected in Step 8 and identified as a hidden effect contributing an additional 5.47, to the explanation. When dealing with plant data, one must resist this temptation to explore and should shut off the computer when it begins to select such worthless variables. A related point is that other variables which start out similarly small may be adjusted upward, may lack a few percentage points of being “discovered,” and will be subsequently adjusted downward, never to be seen again. Both types of hidden effects are better forgotten. Another related phenomenon is the dropping of variables selected in a previous step. Either a variable is or isn’t important 46

INDUSTRIAL

AND

ENGINEERLNG C H E M I S T R Y

to the customer’s problem. If the data aren’t sufficient to decide, then better data should be obtained. Empirical correlation for “best fit.” T h e computer selects the variables which give the best fit between the data and the empirical equation selected as a model. While this has the advantage of helping to avoid overlooking valuable information, it “proves” and “explains” nothing. Thus, random data are forced into an empirical equation and made to look legitimate. Such treatment only gives clues to pursue in further studies. A better approach is to specify the physical laws governing the problem, select the appropriate equations expected, and trst the data for fit to these prespecified equations. Pooling the variance or residual sum of squares OS. replication. T h e final contributing factor is that the only statistical check on the validity of the correlations is through comparison with the residual sum of squares. This “pooling the variance” has its adherents and statistical background. I t is not a substitute for replication. The present data have no replication and, hence, no independent estimation of the variance or error. Split batches sent to the customer would have provided an error estimate that would check his ability to influence the service life of the product. Duplicate runs in the plant would have given an estimate of the plant’s ability to reproduce conditions. Only variables that have a bigger influence than these sources would be significant. Note also that the F ratio dropped to as low as 4. This corresponds to a 90 to 95% confidence level for the majority of the steps. T h e 99.9969y0 is thus built on a series of small contributions which meet only this smaller hurdle. Conclusion

I t is suggested that this case history was not unique. Group consideration and action on important technical problems occur more and more frequently as our technology becomes more specialized. I t is felt that the decisions made by this group regarding the analysis of the data were fairly typical. Pressures to use the latest mathematical tools in conjunction with complex computer programs are often great. Response to this pressure is evident in the number of training programs available to people engaged in quality control work. Such people, however, are in a very real dilemma: They have the need to learn and to use these modern techniques, while at the same time, they have the responsibility to see that these techniques are not abused. This responsibility is often extremely difficult to meet in a practical way-that is, without resort to statistical jargon. I t is suggested that the drawing of nunibers out of a hat is a useful technique. T h e results may not always be as striking as those presented here, but they will illustrate how the program handles random data. Plant data should be substantially different and better before one can say with confidence, “Yes, I believe.” L I T E R A T U R E CITED (1) Degray, R. 57-60 (1966).

J.,“Keeping It On The Square,” IND.ENG.CHEX, 58 (71,