I
by W. J. Youden, National Bureau of Standards
STATISTICAL A
W O R K B O O K
DESIGN
E C
F E A T U R E
Problems in Testing Materials Testing procedures must run a gauntlet of tests to establish their performance MANY of the problems in testing materials arise because the tests are intended to serve as indi cators of how the material under test will perform when placed in service. Seldom can the test dupli cate exactly the conditions encoun tered in actual use. Often the time element is involved and a quick laboratory test is expected to re veal service performance over a long future period of time. Laboratory Tests and Service Performance
The first difficulty is to determine what measurable characteristics of the material are correlated with the capacity of the material to meet the demands placed upon it. In the beginning, it is necessary to follow along the history of some, materials in actual service. After certain tests have been found to be good predictors of performance, the search takes a new turn. New and more convenient tests are sought that are closely correlated with ex isting tests. The new test can then replace the old test using the re lationship established between the old and new. Problems may arise here if the two tests a r e not com pared over the whole range of qual ity of the material under test. Reproducibility of Test Results
Tests on materials are not exempt from the problems that beset all types of measurement. O n e in escapable problem concerns the re producibility of the readings. R e peated tests conducted in the same laboratory display better repro ducibility than tests conducted in several laboratories. Hence, in the early stages of developing a test,
it is better to concentrate the in quiry in one laboratory. A well defined test procedure should give good reproducibility under good conditions—i.e., a quali fied operator working with proper equipment in a stable environment. Conducting the repeat measurements as nearly simultaneously as possible affords a temporarily stable environ ment. If tests are spread out over a period of time, agreement becomes poorer. Consequently, reproduci bility of a test procedure in practice is often not as good as duplicate ad jacent readings suggest. These problems appear in all testing procedures, whether the ma terial be steel, rubber, cement, or paper. Reproducibility is partic ularly important because any dis persion in the test results due to test procedure tends to reflect ad versely on the material. Then the problem is to detect and isolate this source of variation as a flaw in the procedure or in the material.
The presence of trends has the effect of making the s' estimate smaller than s, because adjacent readings tend to be more alike than separated readings. T h e disparity between the two estimates s and s' may be tested by calculating the ratio s'2 to s2. When this ratio falls below certain limits determined by the total number of measure ments, successive measurements are not independent. T h e following limits at the 9 5 % probability level are adapted from a table by C. A. Bennett [IND. ENG. CHEM. 43, 2063 (1951)]. η
Ratio
4
6
0.39
0.44
8 0.49
10 0.S3
In the February column, the coded thicknesses of nickel plate on eight successive strips cut from a sheet were listed as 4, 6, 10, 11, 13, 15, 17, 19. These data give s2 = 188.9/7 and s'2 = 37/14. T h e ratio, s'2/s2 about 0.10 or far below the tabulated value. T h e evidence for a trend is convincing. Statistical Test for Trends Here the trend presumably is in Tests are generally run in a time the material and not in the test sequence. If several test's are made procedure. on the same material, the results If the eight results are rearranged should be examined for possible in a random order—e.g., 10, 19, trends. Suppose the test results 15, 11, 13, 6, 4, 17—the estimate are X\, x^,. . ., xn. T h e usual for s'2 becomes 339/14. T h e ratio measurement of dispersion, the stand for this particular random order ard deviation, is given by the for is 0.90 and reflects the removal of the mula s = - \ / Σ 0 — m)2/(ri — 1), trend from the series. This sug where m is the mean of the η results. gests that if the strips had been T h e differences, d\ = (χχ — x^), tested in a random order and the d-z = (x2 — x3), between immedi ratio test still indicated a trend ately successive measurements may (when the results are arranged in also be used to calculate an esti the order of testing) the drift must be in the testing procedure. Often mate of the standard deviation using the formula s' = χ/~Σά'2/2(η — 1). tests are run in the same order as that used in taking the samples, or T h e two estimates, s and s', should in the same order as the location of be in reasonable agreement if the the positions of the samples. Whenresults are truly independent. I/EC
O R K B O O K F E A T U R E S 818 1AA
I/EC
STATISTICAL DESIGN
.
ever the sampling and testing order coincide, it will not be possible to decide from the data whether the trend is in the equipment or in the material. If the tests are run in a random order, the trend test may be applied to the results when arranged, first, in the order of test ing and, second, in the order de termined by the location of the test specimen. Precision a n d A c c u r a c y
The problem of rejecting an in dividual test result is a recurrent one. Within a laboratory, famili arity with the precision of the test inethod usually provides the op erator with a fair yardstick for setting aside a particular result. However, one hesitates to reject an apparently out-of-line result when the reports come from a group of laboratories. Certainly a different standard of rejection is appropriate, but when all but one or two labo ratories turn in results that are closely bunched, it does not seem proper to condemn the test procedure until the out-of-line laboratories check their results. What do the terms precision and accuracy mean? Consider a single operator who runs some tests— some on one set of equipment and some on a second set. The op erator notices that results with the second set of equipment tend to run higher than results with the first set. If all the tests on one set of equipment are run together and all tests with the second equip ment run at a later time, the dif ference may be associated with time and not with equipment. Assum ing that the tests have been run so that the observed difference in re sults can be confidently ascribed to something associated with one or the other pieces of equipment, then one source of dispersion has been located. What are the statistical labels that should be used in this connec tion? The results for one set of equipment may be clustered around the average with a very small dis persion. The "precision" may be good ; similarly for the other set. The data show, however, a fairly constant difference between the sets of equipment. If one of these sets can be regarded as a standard 82 A
A Workbook Feature
(perhaps by reason of careful cali bration at some qualified agency), the experimenter assigns an ap propriate correction to the results from the other set of equipment. In effect, a constant error is pre sumed to apply to the results with the set of equipment. This piece of equipment gave results that were not "accurate," even though the results were in good agreement among themselves. If, instead of two sets, there were a great many sets of equipment, one of them should lie taken as the standard. Each set of equipment may be expected to possess its own particular constant error and indeed be tagged with this correction which many would associate with the word accuracy. Examine the collection of these tags. The collec tion of constant errors may them selves constitute a random distribu tion of magnitudes—some positive, some negative, some small, some large—with reference to the chosen standard. From this point of view, a collection from many laboratories of test results on the same material— one result from each piece of equip ment shows exactly the kind of pattern commonly associated with random errors and the concept of precision. Precision and accuracy labels can be arbitrary, depending upon the point of view. The investigator needs techniques for collecting data that will reveal to him what is going on in the testing procedures. Suppose, for example, that in the industry there are a fairlylarge number of laboratories. If two different test materials are sent to each laboratory for testing, the results can be illuminating. Let the results be tabulated as shown below.
Listed in the last two columns of the table are the signs of the devia-
INDUSTRIAL AND ENGINEERING CHEMISTRY
tions for each result when compared with the average for all laboratories. The first plus sign under A means that ai was larger than the average, a. The minus sign, under B, means that bi was smaller than bConsider the expected collection of signs appearing in the last two columns if the test procedure exhibits only random errors associated with precision. A given result has an equal chance of being above or below the over-all average; hence, plus and minus signs should appear equally often. A given laboratory has a one in four chance of having two plus signs, the same one in four chance of having two minus signs, and a one in two chance of one plus and one minus. Nothing can be deducted about a particular lab oratory, but much can be deduced about the test procedure. Of the η laboratories, about one fourth should show two plus signs, about one fourth should show two minus signs, and about one half should show one plus and one minus sign, provided
only
random
errors
operate.
If the method is vulnerable to tem porary constant errors, there should be many more cases where the two signs are alike than unlike. If this situation exists, much of the scatter of the results from the laboratories arises from these temporary biases. A complete statistical analysis would • also make use of the actual magni tudes of these deviations from the over-all average. Only in so far as the nature of the dispersion can be found can steps be taken to improve the reproducibility of a test procedure. A test method should carry as part of its description an estimate of the reproducibility which can be achieved by a substantial ma jority of testing laboratories—and not by just one or a selected few. Laboratories falling short of this standard may have problems other than the testing procedure.
Our authors like to hear from readers. If you have questions or comments, or both, send them via The Editor, l/EC, 1155 16th Street N.W., Washington 6, D.C. Letters will be forwarded and answered promptly.