Statistical Tests of Significance - ACS Publications - American

that much of the development work on new analytical methods had tobe judged from the results of a minimum number of tests, and asked whether a symposi...
1 downloads 0 Views 500KB Size
SYMPOSIUM ON DESIGN OF EXPERIMENTS FOR DEVELOPING

NEW ANALYTICAL METHODS Presented before Division of Analytical and Micro Chemistry a t the 113th Meeting of the American Chemical Society, Chicago, Ill.

INTRODUCTORY REMARKS

.

GRANT WERNIMONT, Eastman Kodak Company, Rochester 4 , N . Y .

I

N THE fall of last year, Philip J. Elving, chairman of the

Division of Analytical and hlicro Chemistry, remarked that most discussions on the use of statistical methods presupposed that a large amount of data had been collected. He pointed out that much of the development work on new analytical methods had to be judged from the results of a minimum number of tests, and asked whether a symposium could not be arranged in which the emphasis would be placed upon methods of rvaluating a small amount of data. Biological and agricultural experimentalists have come to grips with this same difficulty. Their solution of the problem, largely through the work of R. A. Fisher, has been to design their experiments carefully, so that a maximum amount of information can be extracted from the results. These experimental designs have not been used very much by chemists until recently; but it is becoming increasingly clear that such designed experiments can be used in all kinds of experimental work. It is the purpose of this symposium to show something of how designed experiments can be used to help develop new analytical test methods.

The discussion of the principles which must be followed in designing and executing all experiments, if we are to be able to draw valid conclusions from our results is, in a way, the most important part of the symposium, because any amount of mathematical treatment cannot extract useful information from test results which were not obtained under controlled conditions. It is often easy to interpret the results of a simple experiment with little or no statistical treatment of the data. But situations arise when clear-cut decisions are not so easy to make. Another paper is devoted to an explanation of some of the techniques which the statisticians have developed to help us in such cases. It is possible to design a series of experiments in a fashion so that we get answers to several questions simultaneously. The design and use of these multiple factor experiments are presented here, and time is given to a consideration of what aspects of statistics might well be taught t o students in analytical chemistry. The four participants in the symposium are all closely connected with analytical chemistry and are actively engaged in practicing the principles they discuss.

Statistical Tests of Significance JAMES H. DAVIDSON, Merck & Co., Znc., Rahway, N . J . Illustrations are given of the general methodology and logic useful in establishing control by statistical tests. The three phases of effort proceeding from pure mathematics to applied s'tatistics to technology or research are bound together to form an all-inclusive state of control.

I

F WE were to play a game of chance where I flipped a coin and paid out a dollar whenever a tail appeared, but you paid me a dollar each time a head appeared, I am sure you would not expect to lose in the long run. Without any formal thought of statistics your intuition would tell you that half of the time you would win and half of the time you would lose. If, a t the beginning of play, four heads came up on the first four tosses, you would probably think I was pretty lucky. If I tossed seven heads in a row, you might well begin to wonder to yourself about the fairness of play. At the tenth head in a row, I believe you would feel convinced

that either the coin had two heads or the tossing was nonequitable in some other way. In effect, this line of approach on your part is the same as that used by the statistician. However, the statistician would not rely upon intuition alone but rather would consider the empirically close relationship between certain mathematical expressions and physical happenings. In the particular example cited, he would reason that if there were no bias in the tossing, there would be equal likelihood of a head or a tail showing. Hence, on any one toss the probability of obtaining a head would be one half.

1132

V O L U M E 20, NO. 12, D E C E M B E R 1 9 4 8 Further, assuming each toss to be independent of the other, the probability of two heads in a row would be (1/2)2 or 0.25. It is easy to extend this argument to show that, in a perfectly fair game, the probability of having four heads in a row is 0.064, of having seven heads in a row is 0.0078, and of having Ten heads in a row is 0.00098. Here the mathematics of the situation cease and the statistician must turn to logic or common-sense principles for his decision. There are now two types of error he can make. A4fterthe coin had come up heads ten times, he might conclude I was cheating, when in fact I was not cheating, for such a run of events could occur in a fair game 98 times in a 100,000. This is known as an error of the first kind, if it was the tacit assumption beforehand that the tossing was fair. It is the probability of the occurrence of this kind of error that determines the limits for a control chart. Again, after the first four heads he reasonably might have assumed that I was not cheating, when in fact I was. This would be an error of the second kind. Errors of this type cannot be controlled so simply a+ errors of the first kind, although in some instances the probability of their occurrence can be reduced t o a minimum. Thus, t o investigate the simple act of tossing a coin, use was made of mathematics in both a pure and applied sense and of logic. Each of the three steps was in this particular case so closely integrated as t o blur its identity. Severtheless, the conception of an ideal coin and the calculation of the probability of finding either a head or tail are in the realm of pure mathematics. The application of this model to an actual physical case and the comparison of the actual observations with those expected from the hypothesis belong in the realm of applied mathematics Lastly, the interpretation of the findings and the decision as to the action to take belong to the person with intimate knowledge of the practical situation. In many instances, such as the one just cited all three parts of the procedure will be vested TTith one person, but in the more complicated cases, two or even three specialists may be involved. However, the basic logic of the test of significance is the same. In the following paragraphs, a few of the more common uses of three of the most adaptable tests of significance are given. It is felt that these concepts will prove to be an invaluable aid to the chemical analyst DEVELOPMENT OF STATISTICAL METHOD

Perhaps the earliest proponent of the use of statistical method in analyzing data was Bayes (1) who in 1766 proposed that the solution t o the "problem of chances'' would meet the need. Actually, he did little to solve the problem, but his logical exposition of it laid the groundwork for the astute mathematician, Laplace. The work of Laplace had great influence on the development of statistics in France and England during the early nineteenth century. His German contemporary, K. F. Gauss, established the theory of errors as a useful concept in evaluating measurements and introduced distribution theories to the physical sciences. The first of the more modern distributions to be discovered was the x 2 distribution which forms the basis for the x 2 test. The function was mentioned by Helmert (4) in 1876, and rediscovered and developed by Pearson (6) in 1900. The value of x 2 may be defined in two ways:

where zi = an observed count of events

x'., - corresponding theoretical or expected count of events

1133 where s2

=

observed variance of n sample measurements defined by

true variance of the population from which the sample was taken n = number of independent observations Having thus defined x 2, its distribution function can be written in the form: u2 =

where v is the number of degrees of freedom, which in this case is one less than the number of values in the sample to be tested. Tables have been calculated giving the probability of exceeding given values of x2,assuming the sample were taken from the population designated. This has been the work of mathematicians and represents the integration of the known function for various degrees of freedom, V . It can be shown that the distribution becomes nearly normal for values of Y greater than 30 and the test becomes more sensitive for more degrees of freedom. Now consider again the simple game of tossing a coin. If one wanted to be elaborate, he could use the x2 test to challenge my honesty in play. At the end of four tosses one would expect two heads and two tails and this arrangement forms the hypothetical distribution which can be tested as shown in Table I.

Table I.

Test of Distribution No. of Heads

Expected Observed

2 4

No. of Tails 2 0

x* as follows: (0 - 2)2

From Equation 1, one calculates

x - 2 (4 + 2-= 42)'

In the example, since there are but two categories to place observations (heads or tails), there is but one degree of freedom, so Y = 1. Looking in a table of x2,one finds that if the hypothesis were true that the observed values are essentially like those expected, then the probability of finding x* = 4 is 0.046. Again, a t the end of seven tosses one has Table 11. Table 11. Test of Distribution No. of Heads

No. of

3.5

3.5

Expected Observed

7

0

(0 - 3.5)2 3 7 3.5 3.5 From the table of x 2 with Y = 1, one finds the probability of x 2 = 7 is 0.0082. Similarly at the end of ten tosses working in the same way, one finds x2 = 10, which has a probability of occurring by chance fluctuations of 0.00157. Considering the looseness of comparing the discrete binomial distribution to a continuous function, these probabilities agree sensibly well with those found earlier for the given runs of heads, I t is now time to interpret for action the above results which have been found by applied mathematical means. If one accepts the more or less arbitrary rule that when under the hypothesis the event observed has a probability of occurrence of less than 0.01, the hypothesis is rejected, then the tossing would be judged dishonest after seven heads had ocmirred. In case the x2

(7

- 3.5)2

Tails

+

1134

ANALYTICAL CHEMISTRY

probability for the event is greater than 0.05 under the hypothesis, then the hypothesis is not disproved; so it is generally accepted. Between these two probabilities is a region where the decision falls heavily upon the discretion of the person considering the particular problem. Often more data are requested. I t can be seen that using either the line of thought developed in the first paragraphs or the x 2 test, the results would be the same; that four heads in a row could not constitute evidence of a biased game, but that seven heads in a row would. Hence, the three phases of the question are completed. The following example illustrates a use for the second definition of x2. Over a long period of experiences, it has been found that the average standard deviation due to analysis of thiamine hydrochloride in vitamin enrichment mixture is 6.8 mg. per ounce. On a particular sample, a chemist submitted the following four results: 764, 752, 770, and 741 mg. per ounce. The question is asked whether these values are consistent with accepted error of determination. Using the usual formula for small samples one finds that s2 = 166.25, so that according to Equation 2 xz = nsz - = 4 X 166.25 = 14.38 u2 46.24

In this case, one tests the hypothesis that the variation in the sample agrees with the previously accepted error. As there are four observations in the sample, there are three degrees of freedom and entering a table of x*, one finds that a value as large as 11.34 has a probability of occurring of 0.01. The value of x2 found in the example was larger than that which could occur by chance only once in a hundred, so the interpretation of the applied mathematics would be that the hypothesis is rejected. The chemist’s results as submitted were too variable in the light of previous knowledge of the test’s precision. It is unlikely (less than one chance in a hundred) that the difference between his variation and the calculated variation determined by past experience could have occurred by random fluctuation STUDENT’S t TEST

Shortly after the inception of the x 2 test, in 1908 to be exact, Gosset (?), a chemist working in an English brewery, published a paper under the pseudonym “Student” in which he gave the distribution of the ratio of an average to its standard deviation. The ratio was later slightly modified and designated by t and since has been known as Student’s t. When the statistic to be examined is a sample average, the simplest definition of the ratio is :

where d is the difference between the sample average and the population average, and s is the standard deviation for the sample of n determinations. I t must be assumed that the sample is taken from an over-all normal distribution where the mean and standard deviation are independent. The distribution of the function t can be expressed as:

For a certain vitamin mixture the required amount of iron is

2400 mg. per ounce. On a certain lot of the material, the following six assays were reported: 2372, 2428, 2409, 2395, 2399, and 2411

mg. per ounce. The differences between each of these values and the average of the distribution from which they are supposed to have come are, respectively, -27, +28, +9, - 5 , -1, and +11 mg. per ounce. It is found that the average of the differences is +2.5 mg. per ounce and their standard deviation is s = 18.47. Hence

Looking a t a table of the t distribution for five degrees of freedom, one finds that this large value of t could occur by chance alone between 71 and 78% of the time. Thus it seems reasonable to conclude that the lot from which the same was taken conforms to the specification of 2400 mg. per ounce, providing the sample was random and representative of the lot., The value of t might well have been negative and the same conclusion would be valid. In this case, however, it might be reasonable to tolerate a larger positive value of t than in the negative direction because one might want the chance of accepting a lot with less than the required amount to be smaller than the chance of accepting one with more than its proportion. In using the t test in this manner some caution must be exercised in examining the data. For instance, if the results were obtained in the order 2372, 2395, 2399, 2409, 2411, 2428, one would hardly accept them as random. However, the value of t would be identical with that already found. In such a case, the cause for the trend must be sought. Any intuitive knowledge of the data which can be wisely utilized should be brought t o bear on any set of results. Often in chemical work as in other fields, it is desired to determine whether two averages can be assumed to agree. Again the t distribution can be used, but in a more general form:

- -

t =

(a- 2 2 ) - (ml - m) dnls:

+ nzs;

X

dnln2(n1+ n2 - 2) d n i

where the subscripts refer to the respective samples. The hypothesis that ml = mz is assumed and tested. Sgain, the samples are assumed to be from a normal distribution, but here the variances s: and si must also be assumed to come from the same overall distribution for the test to be valid. The sample sizes are designated by n, and n2,respectively. This form of the t test uses a pooled variance and does not give so conservative a value as other forms, but generally the final results are comparable.

A series of moisture determinations was taken from two portions of a lot of potassium chloride and one would test the hypothesis that the averages of the two series are alike: Series 1

Series 2

0.5% 0.8

0.7% 0.6 1.o 1.1 0.8 0.9 1.2

0.7 0.6 0.9 0.5

0.4

1.1 0.9

52

Ei = 0 . G sf = 0 . 0 3 2 4 ni = 7

8;

n2

= 0.9

= 0.0394 = 9

Upon substituting these values into the formula, one has: where Y is the number of degrees of freedom. Tables of t have been calculated for different levels of probability and various degrees of freedom. The tables show the values of t which can occur by chance alone with a given probability when samples are taken from a single normal distribution. Like the x2 distribution, the t distribution becomes normal very rapidly as n increases beyond 30. The t test may be used to test the hypothesis that a sample average varies but an insignificant amount from a population mean. The following example illustrates this use.

0.6 - 0.9 = d7X0.0324f9X0.0394

x 4-7J

16

- 8.91 = 2.92 3.05

The table of the t distribution is entered a t n1 4- n2 - 2 = 4 degrees of freedom, for there are n1 - 1 degrees of freedom in t e first series and nI - 1 in the second which are additive. It S found from the tible that such a value of t could occur by chance alone only about once in a hundred times. I t could then seem reasonable to take action on the premise that the two averages were significantly different. In this case, also, one should be reasonably sure that the determinations represent controlled random sampling.

V O L U M E 20, NO. 1 2 , D E C E M B E R 1 9 4 8 F TEST

In 1921, a short time after Student’s publicat,ion, Fisher (3) proposed his famous Z distribution. The value of Z is defined as: Z =

In s:/s,T

where ln represents the log to the base e and s: and s: represent two variances, the larger of which is s:. Snedecor later modified this ratio and made tables for the distribution he designated a, F in honor of Fisher. The value of F is defined by:

F = s:/si Its distribution is general and includes both the xz and t tests as special cases. The distribution can be written conveniently foi the Z function as fo’lows:

where n and nr are the degrees of freedom associated with si and sf,respectively Fisher also made the transformations and substitutions necessary to derive the x2 and t distributions from the above expression. The F distribution, like the x 2 , is independent of the parent population parameters from which the samples are taken and merely requires that the variances be independent. To be practical the distribution sampled should approximate normality. As a simple illustration of the use of the F test, one might consider the example of the moisture determinations on potassium chloride. One assumption of the t test was that the variances of the samples were estimates of a single value. This assumption can be tested by the F distribution. Thus for the value of F : 0394 F = = 0 0.0324 = One enters a table of the distribution of F a t 8 degrees of freedom for the greater variance and 6 degrees of freedom for the lesser variance. The value of F which could occur by chance five times in a hundred is F = 6.13; hence a value as small as 1.22 could occur more often and the hypothesis that the variances itre essentially equal is not refuted. It is usually well t o accompany each t test for the differences between two samples by a test of this nature. PHILOSOPHY O F SIGNIFICANCE

In the foregoing specific discussions, in every case an observed value of a function was found from the data and compared with a value in the corresponding frequency distribution which could occur by chance alone with a given probability. If the probability is low and the value observed is greater than the value in the table, the assumed hypothesis is rejected. On the other hand, if the value observed is lower than that which might be expected with reasonable frequency, the hypothesis is accepted. This is all done in the faith that the values obtained from the physical events can be tested against a purely mathematical model. All the three distributions discussed require in theory that the parent population have a normal distribution. This restriction has been largely overcome for the t distribution by the work of Pearson and Geary ( 5 ) . The assumption of equal variances when comparing two samples need not be made if one is viilling to apply Bchrens’ ( 2 ) test, which has been stated only for the case

1135

for equal sample sizes and does not hold in the theory of confidence intervals This latter limitation is serious because it does not allow a prediction of future events. The theory of confidence limits requirw that the assertion, that the parameter which is being estimated lies within the given interval, be true in an assigned portion of cases &s prescribed by the probability ‘evel. Every test so far described conforms to the confidence interval theory so that one can project conclusions into the future and feel sure they will hold with a certain probability which is the chance of committing an error of the first kind. Errors of the second kind are usually considered to be at a minimum, although they are not actively calculated. In most cases, a compromise must be made between the probabilities of committing the two types of error, because as the probability of one type of error increases, the probability of the other decreases. For the practical situation. an error of the first kind is the more critical, since it rejects a hypothesis. Hence, emphasis is placed upon its control. In all these tests of significance, randomness in values is assumed as well as control within sets. It should be borne in mind that whenever one makes a test of significance, he is testing not only the hypothesis he has in mind but also all the assumptions that have been made during t,he development of the test. Therefore it is important that the test used should involve as few assumptions as possible. When a test of significance shows a result that is incompatible with the hypothesis the assumptions should be checked more thoroughly than when the reverse is true, for usually nonsignificance is taken as a necessary but not entirely sufficient step in accepting a OStulate This argument is analogous to that used for the two kinds of error. The tests described here are rather timeworn and prosaic from the viewpoint of mathematical statistics. but they present excellent illustrations of the general methodology and logic useful in establishing control. In the more recent years, methods have been drveloped for determining random order to a much finer degree. The mathematicians are always seeking new distributions which will lead to more powerful tests of data. Whether oue makes the simplest kind of test, or the most complicated the underlying logic is the same. The three phases of effort proceeding from pure mathematics to applied statistics to technology or research are bound together to form an all-inclusive state of control. As to the type of bond, one might make an observation similar to Lippman’s remark about normal curves, that the experimenters believe the methods hold good because they are mathematical theorems and the mathematicians believe they work because they have been substantiated by experimental fact. hctually they are prohably both justified. LITERATURE CITED

Bayes, T., Phil. T r a n s , 5 3 , 3 7 0 (1763). Behrens, W. V . , Landw. Jahrb., 68, 807 (1929). Fisher, R. A., Metron, 1, N o . 4, 1 (1921). Helmert, F. R., 2. Math Phys., 21, 192 (1876). Pearson, E. S.,and Geary, R. C., “Te+ of Normality,” London, Biometrica Office, 1938. (6) Pearson, K., Phil. Mag. ( 5 ) , 50, 157 (1900). (7) Student, Biometrika. 6, 1 (1908). (1) (2) (3) (4) (5)

RECEIVED September 10, 1948.