11 H o w to A v o i d Lying with Statistics A L L A N E. AMES and GEZA SZONYI
Downloaded by GEORGETOWN UNIV on August 31, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch011
Polaroid Corp., 750 Main St., Cambridge, MA 02139
I.
Introduction
Statistical analysis of randomly varying data has become commonplace in the age of calculators and computers. Such analyses are often carried out routinely using built-in calculator functions or standard computer subroutines. Parameters derived in such computations usually include the average (mean) and the standard deviation, characteristic of the central tendency and the variability of individual data sets. To compare two data sets with each other, the differences between their averages and the ratio of their variances are used customarily. Little thought is usually given to the fact that the compu tation of the above parameters presupposes that the data analyzed are essentially normally distributed and that this distribution is monomodal, i.e., showing essentially one major peak or central value only. If this is not the case, and standard (parametric) methods are applied for the evaluation of such data, the results obtained will represent an incorrect picture. In other words, the evaluator is: "lying with statis tics", usually without being aware he is doing so (1). To make matters worse, nearly all built-in programs for calculators, as well as the majority of the sub routines and programs for computers are of the para metric type. Nowhere in the instruction manuals is the fact adequately stressed that the usage of para metric methods presupposes known, usually normal, distribution which must be ascertained before these methods are applicable.
219
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
220
CHEMOMETRICS:
THEORY AND APPLICATION
This paper w i l l show that for many applications the use of d i s t r i b u t i o n - f r e e (nonparametric) methods may be preferable and easier to use than the conven t i o n a l parametric approach (1). (Throughout t h i s paper, the term "parametric t e s t s " pertains to the t-Test and the F-test, used with the appropriate tables (2, 3, £, 5, 6, S, 9) ) . We w i l l also provide a simple method f o r t e s t i n g the normality of the d i s t r i b u t i o n of data sets. Examples w i l l be given to demonstrate the p e n a l t i e s which can be incurred when parametric methods are used f o r the analysis of non parametric data.
Downloaded by GEORGETOWN UNIV on August 31, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch011
II.
Methodology
Normally d i s t r i b u t e d data are often represented g r a p h i c a l l y by the well-known, bell-shaped Gaussian curves (10, 11/ 12t 13, 14). To obtain such a curve, the data i n question are usually sorted i n ascending order and then grouped i n t o classes. The number of items i n each c l a s s , i . e . , the class frequencies, are then p l o t t e d against c l a s s midpoints i n such standard frequency p l o t s . Another method of representation f o r the same data i s the use of cumulative frequency p l o t s . Cumulative frequencies are obtained by adding ("cumulating") f o r each class a l l the c l a s s frequencies up to that point, divided by the t o t a l number of data i n the data set (10, 15J , as shown i n Figure 1. This paper deals with two applications of the cumulative d i s t r i b u t i o n . In one case, a known con tinuous d i s t r i b u t i o n — t h e normal d i s t r i b u t i o n — i s being compared to an unknown d i s c r e t e d i s t r i b u t i o n i n order to determine i t s normality or deviation from i t . In another case, two unknown cumulative d i s c r e t e d i s t r i b u t i o n s are being compared to each other to deter mine t h e i r sameness or d i f f e r e n c e . For both cases the same t e s t s t a t i s t i c i s a p p l i c a b l e (16, 1J, 18) : Τ = sup
|F(x) - S(x)
I
where: F(x) i s the cumulative d i s t r i b u t i o n function of an unknown d i s t r i b u t i o n ; S(x) i s the cumulative d i s t r i b u t i o n function of e i t h e r a known (e.g. normal) or an unknown d i s t r i b u t i o n ; and, sup = supremum
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Lying with Statistics
Downloaded by GEORGETOWN UNIV on August 31, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch011
AMES AND SZONYI
In Chemometrics: Theory and Application; Kowalski, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 1977.
Downloaded by GEORGETOWN UNIV on August 31, 2015 | http://pubs.acs.org Publication Date: June 1, 1977 | doi: 10.1021/bk-1977-0052.ch011
222
CHEMOMETRICS: THEORY AND APPLICATION
s i g n i f i e s maximum, and t o g e t h e r w i t h the two v e r t i c a l l i n e s s y m b o l i z e s t h a t one s h o u l d take the maximum d i f f e r e n c e e n c l o s e d by t h e s e l i n e s . T h u s , Τ i s the maximum v e r t i c a l d i s t a n c e between the two c u m u l a t i v e d i s t r i b u t i o n graphs. T h i s computed T , o r a s i m i l a r l y d e r i v e d v a l u e , i s s u b s e q u e n t l y compared to a p p r o p r i a t e tabulated values at selected confidence l e v e l s . The two d i s t r i b u t i o n s are r e g a r d e d as b e i n g d i f f e r e n t i f t h i s Τ i s g r e a t e r than the c o r r e s p o n d i n g t a b u l a t e d v a l u e , o t h e r w i s e , the o p p o s i t e h o l d s t r u e . A number o f methods have been d e s c r i b e d i n the l i t e r a t u r e t o t e s t whether a g i v e n d a t a s e t can be c o n s i d e r e d to be e s s e n t i a l l y n o r m a l l y d i s t r i b u t e d o r not. T a b l e 1 summarizes the most i m p o r t a n t o f t h e s e methods. As seen from t h i s t a b l e , a l l methods are based on the use o f c u m u l a t i v e d i s t r i b u t i o n f u n c t i o n s , e x c e p t the S h a p i r o - W i l k method. (The C h i Square t e s t uses the normal c u m u l a t i v e d i s t r i b u t i o n to compute t h e o r e t i c a l frequencies ( 1 9 ) · ) There i s no o v e r a l l agreement, based on the l i t e r a t u r e s u r v e y e d , as to which method i s " b e s t " f o r a l l p o s s i b l e e m p i r i c a l d i s t r i b u t i o n s (2J), 21, 22) . Comparison o f the v a r i o u s n o r m a l i t y t e s t s has l e a d to the r e s u l t t h a t i n some c a s e s , the same d a t a are c o n s i d e r e d normal by some t e s t s and not normal by o t h e r s (2_2) . However, the C h i Square t e s t i s g e n e r a l l y r e g a r d e d to be i n f e r i o r t o a l l o t h e r t e s t s (20,, 21, 22) . The e x t e n t to which p a r a m e t r i c methods can be used f o r b o r d e r l i n e n o r m a l i t y c a s e s has n o t been d e s c r i b e d i n the l i t e r a t u r e s u r v e y e d . This area needs s u b s t a n t i a l e x p l o r a t i o n a n d , a t t h i s s t a g e , i n f o r m e d i n t u i t i o n i s as good a g u i d e as a n y . T h i s paper d e a l s w i t h a s i m p l e t e s t f o r n o r m a l i t y , the L i l l i e f o r s T e s t {23^, 24.) /