Stat is t ical Met

Stat is t ical Met. Some Basic Concepts of Probability and Statistics. Henry Scheffk. Columbia University, New York, N. Y. HE first two papers in. The...
0 downloads 0 Views 500KB Size
Theoretical Backgrounds of the Statist ical Met Some Basic Concepts of Probability and Statistics Henry Scheffk Columbia University, New York, N . Y .

T

The frequency interpreAn outline of certain basic concepts of probability and this symposium lay the tation of the statement that statistics is attempted without introducing all the detailed theoretical basis for mathematics which make the concepts rigorous. an event has probability p thestatistical methods discan be deduced from the The topics included are: probability, distributions, stacussed in the following paptistical independence, population and sample, sampling mathematical axioms toere. The theoretical basis of distributions, central limit theorem, and states of gether with the above physical assumption about the all statistical methods is the statistical control. very rare occurrence of mathematical theory of events with very small probprobability. It is, of course, ability. The conclusion is, roughly, that if a very large number not possible t o give in this paper a rigorous mathematical forof trials is made of an event with probability p , then the obmulation of a11 the theoretical concepts involved in the statistical served proportion of successes will very rarely differ from the methods. The first question and the most basic one is, what is probprobability, p , by more than a small amount. This frequency interpretation may be expressed briefly by saying that the ability? The mathematician defines probability by a system observed proportion of successes is in the long run equal to p . of axioms from which he can develop the whole mathematical The notion of statistical independence is basic for the present theory of probability. As with any other mathematical theory purpose. Mathematically the independence of two events is of physical phenomena, some physical assumptions must also defined by the requirement that the probability of both of the be made which allow one t o bridge the gap between the theory events occurring is the product of their respective probabilities. and the actual physical situation. Sufficient for this purpose B bowl experiment illustrating this notion will give some insight seems to be the physical assumption that events with very small into its significance. probability happen very rarely. This assumption is much more vague in character than the axioms of the mathematical theory, Suppose the chips in the bowl each have two kinds of attributes but this is typical of assumptions that connect mathematical associated with two events, say coloring and numbering. Thus, models with physical occurrences in any field. some chips might be white and the remainder yellow, some might be marked 0 and the remainder 1. Each chip has a color and a It is not possible t o develop here the mathematical model number. Suppose there are 100 chips, of which 40 are white and containing the answer t o the question, what is probability? 60 yellow, so that the probability of a white chip being drawn is However, one can describe a simple physical situation which may 0.4; and suppose also that 25 of the 100 chips are marked 0 and 75 marked 1, so that the probability of a chip marked 0 is 0.25. be regarded as a standard one for exemplifying the theoretical If a chip is drawn it can be considered a trial of two events, one concepts. This situation is called a bowl experiment. By this event is the occurrence or nonoccurrence of a white chip, and its one means the following: probability is 0.4; the other event is the occurrence or nonoccurrence of 0, and its probability is 0.25. Are these two events Suppose a bowl contains a number of chips which are physically statistically independent? More information is needed to similar except for marking, so that they can be distinguished into answer this question. The answer depends on how the two two or more classes, say by coloring or numbering. The bowl classes-the class of white chips and the class of chips marked experiment consists of repetitions of the following sequence of 0-overlap. According to the mathematical definition the actions. The chips are thoroughly mixed in the bowl. One chip events “white chip” and “chip marked 0” are independent if the is taken out of the bowl, the person drawing the chip not looking probability of both occurring-namely, a white chip marked 0to see which he is selecting. The distinguishing characteristic of is the product of the probabilities-namely, 0.4 times 0.25, or the chip is recorded (number or color), the record including the 0.1. But the probability of a white chip marked 0 is equal to the order of the drawing, The chip is returned to the bowl. The total number of white chips marked 0 divided by the total number process is now ready for repetition, beginning with thorough mixof chips, and this will be 0.1 if and only if exactly 10 of the 100 ing. chips are white chips marked 0. From this information one can If it is assumed that all the chips in the bowl have the same determine the numbers in the other three cells of the following fourfold table : probability of being drawn, then the mathematical model (which is not being developed here) tells the following: White Yellow Total

HE first two papers in

Suppose the total number of chips in the bowl is N , and that the chips are classified into two classes of S chips and hi-S chips in such a way that every chip is in only one of the two classes. If for convenience the drawing of a chip is called a trial, and if the trial is called a success if one of the chips of the first class of S chips is drawn, then the probability of success is S I N . For example, if a bowl experiment is made with 100 chips numbered from 1 to 100, and if success consists of drawing a chip with number less than 31, then S = 30, N = 100, and the probability of success is 30/100 = 0.3. I n general then, if the probability of some event is p , this means that in repeated trials the event behaves like the appearance of success in a bowl experiment similar t o the one just described with S / N = p.

0

1 Total

10

40

..

25

-

75 -

Yellow 15

Total

45 -

75 -

60

100

They must be those indicated below. 0

1 Total

White 10 30 40

60

25

100

The proportion of chips marked 0 is the same among the ahite chips as among the yellow, and the same as in the whole tablenamely, 0.25 in every case. 1292

June 1951

P

1)

A

1

I N D U S T R I A L A N D E N G I N E E R I N G CHEMISTRY

By analyzing the concept of statistical independence in this way one is led t o see that in physical problems the assumption of the statistical independence of two events is appropriate when one is willing t o assume that the occurrence of either of the events cannot be influenced by whether or not the other occurs. Simple physical illustrations of statistical independence and dependence can be found in the phenomenon of radioactive decay of a substance. Let T1and TZbe two time intervals and say the event, El, occurs if there are one or more emissions during time interval T1, and similarly define the event, Ez,for the time interval, Tz. If the time intervals, TI and Tz,do not overlap, events E1 and E2 are statistically independent; if T1 and Ta overlap, E, and E2 are dependent. The physical significance of statistical independence leads then to another illustration of statistical independence by means of a bowl experiment with two bowls. If one event is defined by a drawing from one bowl and another event by a drawing from another bowl, then the two events are independent. Thus, instead of having a single bowl made up according to the above fourfold table there could be two bowls, one with white and yellow chips in the proportion 40 to 60, and the other with chips marked 0 and 1in the proportion 26 t o 75. The extension of the notion of statistical independence to three or more events is slightly more complicated as far as the mathematical definition goes, but the extension of the physical interpretation and the bowl experiment interpretation is simple. The bowl experiment interpretation of the statistical independence of k events can be made as follows: Suppose k bowls are prepared so that the k events determined by the IC bowl experiments separately have the same probabilities as the k events in question. Statistical independence of the k events means that their joint behavior is like that of a set of k drawings with one drawing made from each bowl. Another basic concept in statistics is that of a random variable. A random variable is a quantity that can take on different values, the occurrence of the different values being governed by a fixed set of probabilities. An example of a random variable is the number of heads obtained if 4 coins are tossed; the possible values of the random variable are the 5 integers 0, 1, 2, 3, 4, and the probabilities that these values be obtained are, respectively, 1/16, 4/16, 6/16, 4/16, 1/16, as determined by a wellknown probability law mhich need not be described here. If a random variable has only a finite number of possible values a bowl of chips can be prepared in such a way that the drawings from the bowl experiment behave like observations on that particular random variable. One marks the chips with the possible values of the random variable, putting in a number of chips with each marking proportional to the probability of the random variable taking on that value. For example, consider the random variable in the coin-tossing experiment mentioned above. T o prepare a bowl experiment whose outcome simulates this random variable one might put 160 chips into the bowl, of which 10 are marked 0, 40 marked 1, 60 marked 2, 40 marked 3, and 10 marked 4. A random variable has a probability distribution which it is sometimes helpful to picture like a mass distribution. Imagine distributing a unit of mass on an z-axis, the amount of mass assigned to each point being equal to the probability that the random variable assumes the corresponding value. Thus, the mass picture for the random variable in the above example consists of 5 mass points a t z = 0, 1, 2, 3, 4, the masses being, respectively, 1/16, 4/16, 6/16, 4/16, 1/16. Thus far all random variables have implicitly been assumed t o be discrete-i.e., that the mass picture consists of discrete mass points. However, the mass picture immediately suggests the notion of continuous random variable, corresponding to a continuous smearing out of the unit of mass of the z-axis. Continuous random variables may be defined by a function called the probability density, which is the analog of the density of mass on the z-axis. The

1293

most important example of a continuous random variable is one with a normal distribution; a random variable has a normal distribution if it has a probability density of the form

that is, if the probability density falls off exponentially with the square of the distance from point m; the meaning of the constants m and u will be considered later. The function p ( z ) may be interpreted by multiplying it by the differential dx; p(z)dz is the probability of finding the random variable in the dx, Random variables generated by interval from z to bowl experiments are necessarily discrete; however, a continuous random variable may in a certain sense be approximated as closely as desired by a suitable bowl experiment. A “normal bowl” refers to one which generates, up to this approximation, a normal random variable. The concepts of mean, variance, and standard deviation of a random variable or of a probability distribution are easily defined in terms of the mass picture. Consider the mass distribution on the z-axis, which gives the mass picture for the random variable. The mean of the random variable is the zcoordinate of the center of mass of this mass distribution. Its variance is its moment of inertia about an axis perpendicular to the z-axis through the center of mass. The square root of the variance, which is the same as the radius of gyration about this axis, is called the standard deviation of the random variable. The mean is a central value for the random variable or probability distribution in a sense that is entirely clear from this mechanical definition. The variance and standard deviation are measures of the spread of the probability distribution about its mean in senses that are again clear t o our mechanical intuitions on the basis of the definitions in terms of moment of inertia and radius of gyration. The meaning of the constants m and u in the probability density of the normal distribution introduced above can now be stated: m is the mean and v is the standard deviation of the distribution. The mean and variance are examples of expected values. The expected value of a function of a random variable may be defined as a weighted average of the function, the weighting being done according t o the probability distribution or mass distribution of the random variable. Thus, the mean is the expected value or weighted average of the random variable itself, while the variance is the expected value of the squared difference of the random variable from its mean. A frequency interpretation of an expected value may be made as follows:

+

Imagine getting a long series of observations on the random variable such as could be obtained in a bowl experiment generating the random variable, and each time an observation is obtained, averaging all the observations obtained thus far. The average of the observations will be, in the long run, equal to the mean or expected value of the random variable. If, instead of averaging the observations, one averages some function of the observations, this average will be in the long run equal to the expected value of the function. e

It is of the greatest importance to distinguish between statistical parameters of distributions or random variables, like the mean or variance of a distribution, and similar quantities calculated from a set of data. It will be convenient to call any set of data a sample-later it will be considered whether or not it might be a random sample-and t o call the statistical quantities calculated from the data, sample values. Thus, the sample mean of any set of measurements is their arithmetic average, the sample variance is the average of the squared deviations from the sample mean, and the sample standard deviation is the square root of the sample variance. In many industrial problems, instead of the sample variance or sample standard deviation’s being

1294

INDUSTRIAL AND ENGINEERING CHEMISTRY

used in statistical calculations, the range is used instead because it is easier to calculate. The range of the sample is the amount by which the largest measurement in the sample differs from the smallest. Walter -4.Shewhart of the Bell Telephone Laboratories has emphasized that the information contained in a set of data is not e.xhausted by giving the values of statistical measures like the sample mean, sample variance, range, etc. There is useful information in the order in which the measurements were obtained, and this order does not affect the statistical measures which have been mentioned; if the order of the data is changed these measures still have the same values. Most of the usual theory of statistical inference from samples is based on the assumption that the sample is a random sample from some populat,ion. In order to define t,his one needs to indicate first. what is meant by the statistical independence of two or more random variables. What is meant by the stat,istical independence of two or more events was stated above. Two discrete random variables are said to be statistically independent if for every value z that the first, can take on and every value y t’he second can take on, the events that the first random variable equal r and the second equal y are statistically independent. This means that the joint probability function of the two random variables factors into the probability function of the first alone times the probability function of the Eecond alone. In the case of two continuous random variables the mathematical condition for statistical independence is a similar factoring of the joint probability density function. The bowl experiment interpretation is that if k random variables are statistically independent they behave like drawings of k chips from k different bowls, each howl made up to generahe the corresponding random variable. A sample of n observations is said to be a random sample from a, population with a certain distribution if the n observations are on n statistically independent random variables all with the same distribution. The common distribution is then called the distribution of the population. The bowl experiment interpretation of a random sample from a population is as follows. Make up a bowl of chips which has the dist,ribution specified for the population. Draw n chips from the bowl (with replacement, of each chip after drawing, as usual); this gives the same results as drawing from n different bowls according to the bowl experiment interpretation of statistical independence given above. The result of the drawings-i.e., thc set of readings of the n chips in the order drawn-is a random sample from the popuhtion in question. Closely related to the notion of random sampling from a population is the concept of a stsate of statistical control of a manufacturing process or a method of measurement. The theory of statistical control of processes has been developed by Shewhart (1). Imagine a sequence of numbers generated by the ma.nufacturing or measuring process. The numbers might, be, for example, t,he diameters of successive shafts coming off a production line, or the per cent, of some chemical as determined by successive analyses by the same method applied to the same standard prcparation. If every set of n numbers behaves like a random sample from a fixed population (for every n ) the system is said t o be in a state of statistical control. The bowl interpretation is aga,in that, the numbers behave like drawings from a bowl. The importance of this concept arises from the following three circumstances: 1. The assumption on which most statistical methods are based-Le., that a state of statistical control exists. a state of 2. Shewhart’s ( I ) first empirical principle-that statistical control practically never exists in manufacturing processes or methods of measurements before the process is stabilized by being analyzed and modified by statistical control techniques. after a 3. Shewhart’s ( I ) second empirical principle-that st,ate of statistical control is attained it tends to persist.

A large part, of theoretical stat#isticsconsists of the calculation

Vd. 43, No. 6

of various sampling distributions, ;tiid .every application of

statistical methods involves the use of one or more sampling distributions. Any quantity which can be calculated from the sample alone without knowledge of the population is called a statistic; examples of statistics are the bm,mple values discusmd ahove. The statistic might be a more complicated funotion of the sample-for example, the fraction ahotie numerator is the sample mean minus a given constant and whose denominator is the sample standard deviation. Imagine random ,mmples taken repeatedly from the same populat,ion and the same statistie calculated for each sample-for example, the sample mean calculated from a sequence of random samples. The values takvn on by the statistic will vary from sample to sample; the statistic is a random variable, and its probability distribution m y be calculated by mathematical methods from the population distribution. The probability distribut>ionof the statistic in randoin mniples from a given population is called its sampling tlist,ribution for that papulation. For instance, suppose the statistic is the sample mean and the samples are random samples of n from a nornlnl population. Then it nlay be shown mathematically that the sampling distributioii of the mmple mean is a1.w normal, and that the relation between the normal distribution of tJhepopulation and the normal distiibut,ion of the ,sample mean i p the following. Both distributions are centered at the same point -i.e., both have the same mean-but the spread of the distrihution of the sample mean is smaller than that of the populatioii, the variarice of the sample mean being l / n times t b t , of ihe population. h i interesting and pra.ctically useful result is that even if the population is not normal the distribution of the sample mean is approximately normal. This follows from the central limit theorem ivhich states that with increasing sample s i x the distribution of the sample mean for random samples from an arbibrary population tends toward a limiting form which is normal. In concIusioii, soino bowl experiment interpretations of these remarks on sampling distributions will be indicated. Suppose a bowl of chips is prepared to represent a given population. Raridom samples of n can then be obtained as Already mentioned, and Ior each of these the value of the statistic can be calculated. The sequence of values t,hus obtained would behave emctlv like a. sequence of drawings from another bowl representing thc silnipling distribution of the st’atistic. The bowl representing the sampling distrihubion of the statistic could be p r e p r c d as follows: Suppose the constit.ution of the box-1 representing the population is known-Le., the number, N , of chips and the N numberr. marked on the chips are known. It is then possible to enumerate the ways in which a random sample of n can be drawn: Each element of the sample of n can be any one of the iV chips, and so the number of random samples is N”; for each of these one calculate? the value of the statistic and puts a chip into the new bowl marked with this value. This gives N“ chips in the new bowl representing the sampling distributor of the statistic. The meaning of the central limit theorem is that if a new bowl is made up in this way to represent the sampling distribution of the sample mean, then regardless of the constitution of the original bowl, drawings from the new bowl will behave almost like dranings from a normal bowl. LITERATURE CITED

(1) Shewhart, W.A,, “Economic Control of Quality of Manufactured Product,” New York, D. Van Nostrand Co.,1931; “Statistical Method from the Viewpoint of Quality Control,” Washington, D. C., Graduate School of the Department of Agriculture,

1939. RECEIVED April 10, 1950. Presented before the Division of Industrial and Engineering Chemistry, Symposium on Statistics in Quality Control in the Chemical Industry, a t the 117th Meeting of the ~ I E R X C A NCHE)rlICAL soCIETY, Detroit, Mich. This paper was prepared under sponuorship of the Office of Naval Research.