Theoretical Backgrounds of the

tical theory, and if they are understood the reader will be better equipped both for analy4ing the kinds of problems discussed here and for assimilati...
0 downloads 0 Views 412KB Size
Theoretical Backgrounds of the Statistical Met hods Underlying Probability Model Used in Making a Statistical 1n.erence Frederick Mosteller Harvard University, Cambridge 38, Mass.

This paper was written to explain some important statistical ideas which can help engineering and industrial chemists with everyday problems. The result i s a discussion of the notion of testing hypotheses including such concepts as null hypothesis, significance level, power, and regression. The ideas presented are common to a large body of statistical theory, and if they are understood the reader will be better equipped both for analy4ing the kinds of problems discussed here and for assimilating the statistical theory necessary for tackling more complicated problems of statistical analysis.

I

K T H E preeedmg paper Bcheff6 (3) has given an account of

the basic probability concepts used to construct the mathematical models for statistical inference. He has provided a n application of these concepts t o the important problem of the control of quality of a manufacturing process. The present paper will consider further models and their applications t o statistical problems.

4.48 3.41 6.07 4.21

4.38 4.02 2.60 4.75

3.45 4.48 3.41 3.66

3.95 1.96 3.60 3.94

As a test, the probabi1it.y of getting four or fewer observations on either side of the mean will be computed. The probability of getting four or fewer observations above the mean is givon by five terms of the binomial expansion

MODELS FOR A TEST O F DIFFERENCES

Tt is difficult to analyze data without a model in mind. Suppose one became interested in the question of whether or not the population from which the following data were drawn could reasonably haw: an average of 4.50: L1

'

connection with specific tests the models assumed \Till tie indicated. Without some model or belief as a guide in the situation no forward progress can be madc. I n special problems something may be assumed about the dist.ribution function such as symmetry or normality as part of the model. A credibility level or significance level is also set t o help make the decision whether the null hypothesis is t,rue or not. It is common practice t o set this level at 0.05 or 0.01. I n other words if deviations in a sample are observed which can occur less than once in twenty trials when the null hypothesis is true, the null hypothesis is rejected. This discussion describes a statistical test of Bignificance. The test involves a null hypothesis (including a model), a n alternative hypothesis, i~ test for which probabilities can be computed, and the application of the test to the data. As a very simple model it can be assumed t h a t ot)srrvations are just as likely to fall above the true mean as they :%reto fall below. The true mean was assumed t o be 4.50.

4.84 4.11 2.64 5.27

It can, of c o u m , be observed t h a t the actual mean of the data is 79.23/20 or 3.961, b u t whether this is very far from 4.50 or not is impossible t o tell with this one fact only. Only four observations (6.07, 4.75, 4.84, 5.27) out of the 20 are greater than 4.50, but similarly it cannot be told whether this is an unusual situation or not. The statistician's approach t o this problem is to assume that the original statement of the situation is true for purposes of computation-Le., t h a t 4.50 is the true mean. IIc further aasumes a model. I n practice this model is usually chosen as reasonable on the basis of extensive previous experience with similar data. The assumption that the true mean is 4.50, plus the assumption of the model, constitute what is called the null hypothesis. Against the null hypothesis is set u p the alternative hypothesis. I n the presedt case the alternative hypothesis is t h a t the true mean is not 4.50 (and that the original model is true). The model is set up in such a way t h a t one can compute the probability of getting results as bad or worse than those observed if the assumptions (the null hypothesis) are true. In

(20)(19)(18)(17) 1 (4)(3)(2)(1) (S)l*

[4845

(i)4

+

(f +

(20)(19)(18) 1 (3)(2)(1) (2)''

+ 1140 + 190 + 20 + 11 = 1.049'lg6 106

-t

=

0.00591

This computation gives the probability of obtaining four or fewtar observations above the mean; because of symmetry the probability of getting four or fewer below the mean is also 0.00591. Adding gives the final probability P = 0.0118. If 0.05 has been chosen as the significance level and the sample P observed is less than 0.05, then there is a choice of two decisions-that this ip a very unusual sample or t h a t the null hypothesis is false. Unless there is good reason t o believe this sample unusual the null h.ypothesis is rejected. Since the null hypothesis was concerned with both the model and the amount of departure fiom the assumed mean of 4.50, it has t o be decided on which count to reject the null hypothesis. If previous evidence with data of this kind has been t h a t the distributions are symmetric about the mean, the notion t h a t the mean is 4.50 is then rejected. The test just discussed is called the sign test. It has numerous applications in industrial problems. A useful reference about the sign test is that of Dixon and Mood (9). Other information from past data might have been available. For example, i t might have been known t h a t the distribution was approximately normal and L good estimate of the standard deviation might be a t hand. In the present case the distribution was known t o be normal with standard deviation unity I n such a case a more powerful test can be made. I n general.

295

1296

INDUSTRIAL AND ENGINEERING CHEMISTRY

Vol. 43. No. 6

must be willing t o be wrong a certain portion of the time before any tests can be constructed. There are two kinds of errors possible:

I. Reject the null hypothesis when it is true 11. Accept the null hypothesis when it is false

DIFFERENCE BETWEEN TRUE AND ASSUMED MEAN IN STANDARD DEVIATION UNITS OF THE POPULATION

Figure 1. Power Curves for Normal Distribution Testing Observed Mean against a Standard Assumed Mean When Stindard Deviation Is Known

the more information available, the more powerful is the test which can be made. The new null hypothesis is that the mean is 4.50 and the data are normally distributed with standard deviation equal to unity. The alternative is that the mean is not 4.50. The standard deviation of the mean is the standard deviation of the original population divided by the square root of the sample size,

n.

= 1/4% = 0.224. Now the I n this case u = 1, n = 20, distribution of sample means from a normal distribution is again normal, and if the sample mean less the true mean is taken and this departure measured in terms of the standard deviation of the sample mean, the resulting quantity is normally distributed a i t h zero mean and standard deviation of unity. This latter quantity, sometimes called the critical ratio, is

These errors were named producer and consumer errors by industrial statisticians who first noticed them, and more prosaically Type I and Type I1 errors by mathematical statisticians on their rediscovery. Associated with these errors there are risks, probabilities of making each kind of error. There is only one probability value associated with the first kind of error, while there are many probabilities associated with the second kind of error. The significance level chosen is the probability of making the first kind of error. Having fixed the significance level, the situation is examined t o see what the probabilities are of making the second kind of error. I n Figure 1, two power are shown, one for samples of size 1 and one for samples of As the size increases, probability of rejecting the null hypothesis when it is false increases, which means the risks associated with the second kind of error are smaller. The probability of rejecting the null hypothesis when it is true remains constant at 0.05. Curves like those in Figure 1 are called power curves. Some people prefer to look a t the probability of accepting the null hypothesis. This can be done by inverting the figure, When the poRer curve is inverted, the result is the operating characteristic curve. The information is exactly the same; there happen t o be separate names because the concepts arose from two different sources.

Y

t=- Z - m - Z - m U / d E

Ui

whem 5 is the observed mean and m is the true mean. If this quantity is too far away from aero, the null hypothesis is rejected. Roughly, when the null hypothesis is true, t should be between $1 and -1 in 68% of all samples, between +2 and -2 in 95% of all samples. For the present example t =

3.961 - 4.500 - -0.539 - -2.41 0.224 0.224

Since this value of t, -2.41, is outside the range +2 t o -2, the null hypothesis is rejected a t the 5% level. A table of the normal distribution shows t h a t the probability of obtaining a value of t this far from aero or further in either direction is 0.016. This result agrees fairly well with that obtained from the sign test. When the assumptions of the t-test are fulfilled it is on the average somewhat more sensitive t o deviations from the null hypothesis than is the sign test. However, in some specific examples including the present one, it is not more sensitive. The measure of sensitivity is called the power of the test.

/ Figure 2.

Regression Curve of Means of y on x

Prototype distributions of y given for four fixed values of

x

The power curve is particularly useful when there is some notion of the magnitude of effect expected in a n experiment, and some notion of the variability in the experimental variable. With these notions in hand i t can be decided what sample size is needed t o be pretty sure t o discover the effect in question. Often real effects are masked by measurement and system variabilities. When too small a n experiment is designed one is unlikely t o find a n effect. Power curves are useful in designing studies particularly with respect to their magnitude.

POWER CURVES AND OPERATING CHARACTERISTIC CURVES

How is the power of a test measured? Ideally, of course, the test mould always reject the null hypothesis when i t was not true, and accept it when true. Such a requirement imposes the necessity of a n infinite sample size! This is ridiculous. The only way t o proceed is t o give a hostage t o fortune. Users of tests

REGRESSION I N TWO DIMENSIONS

The regression problem is essentially one of fitting curves to data. Consider two variables x and y. Corresponding to each 2 there is assumed t o be a distribution of y values. For example, if adult males weighing 150 pounds in northeastern Cnited

INDUSTRIAL AND ENGINEERING CHEMISTRY

June 1951

a

8

States are considered, they will not all have the same height. In Figure 2 is indicated a set of distributions of y for four different values of z. The distributions have been laid on their sides for clarity in a two-dimensional picture. Each distribution of y has a mean value, my, Tyhich is unknown but for which the mean of the observations, y, is a n unbiased estimate. The curve (if it exists) connecting the means, m,, of these y distributions is called the regression curve. In such a case as height versus weight in man, the distribution of weight for a fixed height, say 5 feet 6 inches, could also have been considered. Connecting such means would give the regression of weight on height. If, instead, the mean heights for fixed weights are connected, the regression of height on weight is given. The existence of these two regression curves has led to much confusion, because often the model generating two regression curves is not applicable. An excellent discussion of this and related questions about regression is contained in a n article by Berkson ( 1 ) . Here it will suffice t o note that the model for which two regression curves are applicable assumes that there joint distributions of both z is a bivariate distribution-Le., and y-as in the height versus weight example. In many other practical cases it is not reasonable to assume such a joint distribution. When there are a number of standards, say four different standard concentrations of a drug, and the effect of these concentrations is assessed many times by the same method, one expects t o get a distribution of results clustering around the correct value for any particular concentration. Thus for one standard concentration it is reasonable to think of a distribution of observed results. However, it is not very reasonable t o think of the four standard concentrations as randomly distributed. Statisticians call these standards by the self-contradictory term fixed variates. A reasonable model for this latter situation, more like those commonly met in industrial practice, assumes as in Figure 2 that there is a distribution of y for any particular z. This distribution arises from measurement errors and from uncontrolled variables. The simplest example, and the most pleasant from the point of view of the mathematics, is the case where the regression curve is assumed to be a straight line. There are two other considerations that make the straight line case important. First, any decent mathematical function can be approximated by a straight line over a short interval. Very often, from the point of view of the underlying mathematical curve, short ranges are being considered. Secondly, even though a straight line may not obtain initiallv. “ , often a s i m ~ l etransformation such as the logarithmic one will yield a straight line in the new variables, as in fitting the function

Y

= ae@s

Now assuming that the true regression is I

Y=a+Pz for the corresponding where is thus the Same as value of z and that there are n observations split up among the various fixed values of x. a and B can be estimated by the quantities a and b, respectively, where a=Y-b2 b=

Z(Z

(1)

- %)(Y - 32)Z

2(Z

The observations are zi, yz, and the denominator of b must not be zero-i.e., more than one fixed value of z must be used. I n the summations the subscripts have been dropped, and the

1297

summations run from 1 to n. Of course, f and 9 are the means of the 2’s and of the y’s, respectively. The estimates a and b are the classic least square estimates. One advantage of these particular estimates is that if one conceives of performing the operations of taking IZ observations on many occasions and computing the estimates a and b over and over, then the long-run averages of the a’s and of the b’s are CY and p. When a statistic such as a or b has the property that its long-run average equals the true value of the quantity being estimated, it is said t o be unbiased. If it is assumed further that a t each of the fixed r values there is a distribution of y values whose variance is u z , an unbiased estimate of r,”,the variance around the regression line, can be obtained from

where the summation runs from 1 t o n. There are other advantages of these estimates associated with the distributions of a and b from sample to sample-for example, in large samples ( n > 3 0 ) the quantity

is approximately normally distributed. This ratio provides a way of testing whether a particular regression line has a slope too far away from that of a standard, or of testing the difference between two regression lines in the same manner as our previous test of a null hypothesis. More details on this matter are given by Wilks (4). When the variability of the y distributions changes from one fixed value of z t o another, the theory is much more involved. This case will not be discussed here. CONCLUSION

The notions of tests of significance have been discussed, with particular reference t o the underlying probability model used in making a statistical inference. The ideas of null hypothesis, significance level, and power curve or operating characteristic curve have been introduced. A probability model has been described for linear regression and it has been indicated that for this model the usual least square estimates of the slope and intercept of the straight line are unbiased estimates of the true values. LITERATURE CITED

(1) Berkson, J., J . Am. Statistical Assoc., 45, 164-80 (1950). (2) Dixon, W. J., and Mood, A. M., Ibid., 41, 557-66 (1946).

(3) ScheffB, H. F., IND.ENC.CHEM.,43, 1292 (1951). (4) Wilks, S. S., “Mathematical Statistics,” pp. 157-9, Princeton, N. J., Princeton University Press, 1943. RECEIVED June 23, 1950. Presented before the Division of Industrial and Engineering Chemistry, Symposium on Statistics in Quality Control in the CHEMICAL Chemical Industry, a t the 117th Meeting of the AMERICAN SOCIETY, Detroit, Mich.