Processing Outliers in Statistical Data - American Chemical Society

What does one do with seemingly inconsistent data? Almost every- one concerned with the analysis of experimental data has been confronted at one time ...
0 downloads 0 Views 648KB Size
Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

4 Processing Outliers in Statistical Data JOHANN A. MÜHLBAUER Continental Gummiwerke AG, Postfach 169, D-3000 Hannover 21, West Germany

This paper presents a method to decide the handling of seemingly inconsistent data ( o u t l i e r s ) . The univariate and multivariate methods recommended are strongly based on s t a t i s t i c s and the experience of the author in using them. What does one do w i t h seemingly i n c o n s i s t e n t data? Almost everyone concerned w i t h the a n a l y s i s of experimental data has been confronted a t one time o r another w i t h t h i s problem. Figure l a g i v e s a g r a p h i c a l r e p r e s e n t a t i o n of t h i s s u b j e c t . There i s a set of observations o r o b j e c t s o f o b s e r v a t i o n which seem t o be "incons i s t e n t " w i t h the main body o f the data. Such s u s p i c i o u s observat i o n s w i l l be r e f e r r e d t o as o u t l i e r s throughout t h i s paper. C e r t a i n l y the r e s u l t s of an i n v e s t i g a t i o n can be i n f l u e n c e d to a h i g h degree by such o u t l y i n g o b s e r v a t i o n s . How does one handle these observations? Basic

Philosophy

There a r e f o u r main s t r a t e g i e s concerning the processing o f outliers. Figures l b t o l e g i v e a g r a p h i c a l i n t e r p r e t a t i o n of these strategies• Rejection. The f i r s t s t r a t e g y i s t o remove the s u s p i c i o u s datum from the d a t a . Then, the a n a l y s i s and the conclusions t o be drawn are based only on the remaining v a l u e s . This c e r t a i n l y i s the way to d e a l w i t h o u t l i e r s which r e s u l t from human e r r o r s , gross e r r o r s of measurement o r something s i m i l a r ( F i g u r e l b ) . I n c o r p o r a t i o n . I n c o r p o r a t i o n of the s u s p i c i o u s o b s e r v a t i o n i n the a n a l y s i s i s i n some ways the opposite of our f i r s t s t r a t e g y . This 0097-6156/ 85/0284-0037S06.00/ 0 © 1985 American Chemical Society

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

38

T R A C E RESIDUE ANALYSIS

type of a c t i o n w i l l sometimes r e s u l t i n a t o t a l l y d i f f e r e n t view of our i n i t i a l problem ( F i g u r e l c ) .

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

Concentration. Neglecting a l l the nonsuspicious values and conc e n t r a t i n g the f u r t h e r a n a l y s i s on the o u t l y i n g data i s a s t r a t e g y which i s sometimes very u s e f u l i n process o p t i m i z a t i o n , q u a l i t y assurance or archeology ( F i g u r e I d ) . Accommodation. The philosophy of t h i s s t r a t e g y i s to i n c l u d e o u t l y i n g observations i n the a n a l y s i s . Methods are then used d e f i n e the f i n a l a c t i o n s which are only s l i g h t l y i n f l u e n c e d by presence of o u t l i e r s ( F i g u r e l e ) . Such s t a t i s t i c a l methods developed under the name of "robust s t a t i s t i c s . "

the to the are

I n f l u e n c i n g Factors The choice of the s t r a t e g y to be used depends on the p a r t i c u l a r situation. The choice of the s t r a t e g y might a l s o depend on the a b i l i t y to answer the q u e s t i o n : Are the o u t l i e r s r e a l l y i n c o n s i s tent w i t h the remainder of the data? U n f o r t u n a t e l y , not only the f i n a l a c t i o n but a l s o the method by which we w i l l d e f i n e whether or not an o u t l i e r i s r e a l l y i n c o n s i s t e n t depends on the s i t u a t i o n . There are s e v e r a l d i f f e r e n t but interdependent f a c t o r s which w i l l i n f l u e n c e s i g n i f i c a n t l y the whole process of handling outl i e r s . One must consider the d i s t i n c t i o n s - between d e t e r m i n i s t i c and s t a t i s t i c a l ( o r r a t h e r unknown) causes of o u t l i e r s , - between u n i v a r i a t e or m u l t i v a r i a t e data s e t s , i . e . , the nature of the d a t a , - between d i f f e r e n t s p e c i f i c p r o b a b i l i t y models l i k e the normal or the e x p o n e n t i a l d i s t r i b u t i o n , - between d i f f e r e n t forms of s t a t i s t i c a l a n a l y s i s i n which the o u t l i e r s have to be encountered, l i k e ANOVA, random sampling and so on, - between s i n g l e or m u l t i p l e o u t l i e r s , and, - most fundamentally, between the d i f f e r e n t aims and purposes that one may have i n studying o u t l i e r s . The D e c i s i o n Procedure Figures 2 to 4 d e s c r i b e the recommended procedure f o r processing outliers. These f l o w c h a r t s could be used a l s o to create a computer program. The e x p l a n a t i o n of some of the terms used i n these charts f o l l o w s : Automatic Processing of Standard Data. of t h i s procedure are that the

The main c h a r a c t e r i s t i c s

- data i s produced and processed r o u t i n e l y without

any

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

MUHLBAUER

Processing Outliers in Statistical Data

F i g u r e 1.

B a s i c concepts

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

T R A C E RESIDUE ANALYSIS

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

40

I use "EXPLICIT" j | processing of | | outliers. |

Use robust statistical methods

I I Have a thorough I look at the data. | Check them tor i I human errors or • mismeasurements.l I Proceed with the | cleared data. | 1

1 Consult a f e l l o w mathematician or use references.

A

(fe),(3:),(g),(3)

Univariate data ?

t

..

I Consult D fellow ' | mathematician or use I references

I

1

Y

E



S


C l a s s i c a l \ samp-ling \ problem ? /

Data \ supposed \ . to be ' V^no/mal ? /

N

>

Q

J I

Use AMT estimator tor location parameter.Use median of deviations from sample median for scale estimator

S \

s

I I

.1.

re

Use robust regression by Andrews' RHO reference (^f)

F i g u r e 2. D e c i s i o n flowchart p a r t 1

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

MUHLBAUER

Processing Outliers in Statistical Data

' 2

N

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

V A / \

. Fitting a regression line ?

y

/

A Detect N

/

v

ing inconsistent subsamples only ? >

C

I i

,

s

. / A n y form of*\ the General \ Linear > Model to be ' \ used

1 I

I I

l

I Use the classical I methods of J multiple comparisons of riance. I | e.g.v aanalysis I

I

NO

.



Are the ^ \ classical "\ YES assumptions for > ^ fitting regression/ \lines met 'iS

_t___

1

Prepare P r e p a r e the the problem problem

o that it m a y be s o l v e d j I| sby classic regression . I methods.If necessary | j contact a f e l l o w | themat

T"

±_

Use references {lY or ( fe) or contact a f e l l o w j mathematician.

I I 1

|

j . _ In the context of the I General Linear Model I use the MAXIMUM ABSOLUTE STUDENTIZED RESIDUAL to detect , inconsistency. Keep in I mind that inconsistency | is RELATIVE to the assumed form of the j model. 1

1

Use the MAXIMUM ABSOLUTE STUDENTIZED RESIDUAL to d e t e c t inconsistent v a l u e s . Keep in mind that inconsistency is RELATIVE to the assumed form of the regression l i n e .

Figure 3. D e c i s i o n flowchart part 2

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

TRACE RESIDUE ANALYSIS

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

42

V

I I

X \

n

°°

t

YES

s — 1

ta

N O y

w

< supposed to be ^exponential ?

,

Is there a transformation / of the data into X normal or exponential form \ which transforms / \ outliers into / \ outliers ? / /



J

I Use Shapiro - Wilks | Exponential W - Test I consecutively to ' clear the data. Keep I rejected data for supplementary I studies. For |jeferenc© see ( ^ )

/

x

I i r — | Use robust , statistical methods.

| •

'

F i g u r e 4.

Decis^

1

1

Test the transformed data.

ilowchart

| |

part 3

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

|

4.

Processing Outliers in Statistical Data

MIJHLBAUER

43

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

a n a l y s i s . The data i s not part of any s p e c i a l set or s p e c i a l environment. Sometimes the data i s c o l l e c t e d by automated devices and sometimes by independent service organizations. - r e s u l t s of the a n a l y s i s are g i v e n o n l y i n a summarized manner, such as a mean v a l u e , a standard d e v i a t i o n , the slope of a r e g r e s s i o n l i n e , e t c . U n i v a r i a t e Data - M u l t i v a r i a t e Data. I f one d e a l s o n l y w i t h one v a r i a b l e under study, e.g., the c o n c e n t r a t i o n of a p a r t i c u l a r chemical i n the water of a r i v e r , t h i s i s a u n i v a r i a t e problem. I t i s u n i v a r i a t e even when the v a r i a b l e under study depends on s e v e r a l o t h e r v a r i a b l e s s u c h as t e m p e r a t u r e and l o c a t i o n of sampling• On the c o n t r a r y , i f more then one v a r i a b l e i s under study s i m u l t a n e o u s l y , t h i s would be c a l l e d a m u l t i v a r i a t e problem. An example of a m u l t i v a r i a t e problem i s i n determining water q u a l i t y using s e v e r a l analyzed v a r i a b l e s . C l a s s i c a l Sampling Problem. I f one i s o n l y i n t e r e s t e d i n e s t i m a t i n g the l o c a t i o n and s c a t t e r parameters of a p o p u l a t i o n , t h i s i s a c l a s s i c a l sampling problem. C l a s s i c a l Assumptions f o r F i t t i n g Regression L i n e s . v a r i a b l e y might be expressed i n the f o l l o w i n g way:

The

dependent

y = f ( x , . . . , x ; b ,...,b ) + e 1

n

1

n

In t h i s formula, f i s a f u n c t i o n of the independent v a r i a b l e s x^ to x and the unknown parameters b. to b which i s l i n e a r i n the parameters. The f u n c t i o n n

f(x , x ; b , b ) = b x

2

x

2

x

+ b

x

2

2

i s the c l a s s i c a l example, but the f u n c t i o n f ( x

l>

x 2

;

b

i, b ) » b 2

x

sin ( ) + b X]L

2

cos(x ) 2

is also possible. The e r r o r , e, i s supposed t o be n o r m a l l y d i s t r i b u t e d w i t h mean 0 and s t a n d a r d d e v i a t i o n s i g m a . As a consequence t h i s means that the v a r i o u s measurements f o r y are ( s t o c h a s t i c a l l y ) independent and the a s s o c i a t e d e's come from an i d e n t i c a l p o p u l a t i o n (they have h o m o s c e d a s t i c i t y or equal v a r i a n c e over the f u l l range)• Detecting I n c o n s i s t e n t Subsamples Only. A p o s i t i v e response to t h i s choice r e s u l t s from an a n a l y t i c a l problem i n v o l v i n g an i n t e r l a b o r a t o r y comparison. The main i n t e r e s t i s to f i n d those l a b o r a t o r i e s which produce i n c o n s i s t e n t r e s u l t s . The r e s u l t s of each

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

44

TRACE RESIDUE ANALYSIS

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

l a b o r a t o r y form a subsample. We can f i n d i n c o n s i s t e n t subsamples and, t h e r e f o r e , i n c o n s i s t e n t l a b o r a t o r i e s • Is There a Transformation of the Data i n t o Normal o r Exponential Form? Many data sets a r e d i s t r i b u t e d according t o p r o b a b i l i t y laws that are not the common normal d i s t r i b u t i o n law. Transformations a r e p o s s i b l e t o convert such data s e t s t o a normal o r a n e a r l y normal d i s t r i b u t i o n . I t i s evident that transforming the data i s o n l y a p p r o p r i a t e when the o r i g i n a l problem, f o r example, d e c i d i n g whether two populations a r e d i f f e r e n t o r not, i s not a f f e c t e d by the t r a n s f o r m a t i o n . Several cases are p o s s i b l e . The following transformation, y = (t+3/8)

0

where t = number of occurrences, w i l l normal. This next f o r m a t i o n ,

#

5

transform Poisson data t o

y = a r c s i n [(t+3/8)/(n+3/4)] where n = number of runs, w i l l normal. F i n a l l y ,

0

#

5

transform binomial data t o n e a r l y

y = arc sinh [(t+3/8)(n-3/4)]

0

#

5

w i l l transform negative b i n o m i a l data t o n e a r l y normal. C a l c u l a t i o n and Processing Procedures;

the P r o c e s s i n g Flow Chart

There a r e v a r i o u s methods t o process the data which are mentioned i n the f l o w c h a r t . A l l of them a r e covered o n l y by c i t a t i o n s . You w i l l f i n d the b a s i c references i n Table I .

Table I .

Mathematical

Methods Included i n the Flow Chart

Method

Reference

AMT-estimator Shapiro Wilks W-test f o r normal data Shapiro Wilks W-test f o r e x p o n e n t i a l data Maximum s t u d e n t i z e d r e s i d u a l Median of d e v i a t i o n s from sample median Andrew's rho f o r robust r e g r e s s i o n C l a s s i c a l methods of m u l t i p l e comparisons M u l t i v a r i a t e methods

Number 1 2A 2B 2C 3 4 5 6-9

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

4.

MUHLBAUER

45

Processing Outliers in Statistical Data

Example. For an example of the use of t h i s d e c i s i o n procedure, I w i l l use DATASET D (see Appendix I ) . The data s e t i s t o be used to prepare a c a l i b r a t i o n graph i n chromatographic a n a l y s i s . I t contains a number of e x c e s s i v e l y h i g h values i n the lower l e v e l s due t o the presence of an o v e r l a p p i n g contaminant.

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

We s t a r t a t the top of the D e c i s i o n Flow Chart, part 1, shown i n Figure 2. D e c i s i o n diamond: Automatic processing of standard data? Since the answer i s "NO", the l e f t branch i s f o l l o w e d . Instruct i o n s are met t o have a thorough look a t the d a t a . There are s e v e r a l numbers which seem t o be i n c o n s i s t e n t . However, w i t h no a d d i t i o n a l data a v a i l a b l e t o t h i s author, I w i l l proceed. D e c i s i o n diamond:

U n i v a r i a t e data?; "YES"

D e c i s i o n diamond: C l a s s i c a l sampling problem? i s "NO", I r e s t a r t a t the top of Figure 3. D e c i s i o n diamond:

F i t t i n g a regression line?

As the answer

"YES"

D e c i s i o n diamond: Are the c l a s s i c a l assumptions f o r f i t t i n g r e g r e s s i o n l i n e s met? "NO" C l e a r l y the measurements a t the d i f f e r e n t x - l e v e l s d i f f e r i n t h e i r v a r i a b i l i t y . This can be shown by using the F - t e s t . Another method i s o u t l i n e d i n another chapter of t h i s t e x t ( 1 0 ) . In t h i s case weighted l e a s t squares w i l l r e solve the problem of h e t e r o s c e d a s t i c i t y or unequal v a r i a n c e across the graph. I have chosen weights of 1, 1, 0.1, 0.01 and 0.01 f o r the r e s o l u t i o n of t h i s problem.

Table I I . F i t t i n g DATASET D Data t o the F i r s t Order Regression Model, y = a + bx

Quantity

Equation C o e f f i c i e n t s (1) a = 0.13 b = 26.42 Max ASR (2)

C a l c u l a t e d Values t Max ASR

C r i t i c a l Values t Max ASR

2.10 2.10

0.28 8.66 2.29

2.78

(1) C o r r e l a t i o n c o e f f i c i e n t f o r the r e g r e s s i o n f i t t i n g i s 0.90. (2) Max ASR occurs a t x * 0.5, y =* 44.1.

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

46

TRACE RESIDUE ANALYSIS

D e c i s i o n command: Use the maximum absolute s t u d e n t i z e d r e s i d u a l method t o detect i n c o n s i s t e n t v a l u e s .

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

When t h i s method i s used, Table I I shows the r e s u l t s when the r e g r e s s i o n model i s the normal f i r s t order l i n e a r model. Since the maximum absolute s t u d e n t i z e d r e s i d u a l (Max ASR) found, 2.29, was l e s s than the c r i t i c a l value r e l a t i v e t o t h i s model, 2.78, the c o n c l u s i o n i s that there are no i n c o n s i s t e n t v a l u e s . I t i s evident that the c a l c u l a t e d t-value f o r the constant v a l u e , a, i s l e s s than the c r i t i c a l t - v a l u e . From the s t a t i s t i c a l viewpoint t h i s v a l u e , then, i s n e g l i g i b l e . The data can then be r e c a l c u l a t e d according t o the f i r s t order model without a constant value. Table I I I shows the r e s u l t of t h i s r e c a l c u l a t i o n . There are no changes r e l a t i n g t o the conclusions made concerning the author d e t e r m i n a t i o n . Three c r i t i c a l p o i n t s can be made i n t h i s a n a l y s i s . The f i r s t one i s l o c a t e d a t the "thorough look" i n s t r u c t i o n . This examination i n r e a l i t y i n v o l v e s a c r i t i c a l a n a l y s i s of the e x p e r i mental p r o t o c o l and the data produced from i t . For example, i t was q u i t e evident i n c o l l e c t i n g the standards data from DATASET D that values were w e l l out of l i n e w i t h previous d e t e r m i n a t i o n s . See other DATASETS, e s p e c i a l l y DATASET E i n the Appendix, f o r c o n f i r m a t i o n of t h i s i d e a . The second c r i t i c a l point i s a t the " P r e p a r a t i o n of the problem" i n s t r u c t i o n . In t h i s case heteros c e d a s t i c i t y must be removed before submitting the data t o r e g r e s s i o n a n a l y s i s . Weighted l e a s t squares of s e v e r a l types (11) and power transformations (10) can be used. The t h i r d c r i t i c a l point

Table I I I . F i t t i n g DATASET D Data t o the F i r s t Order Regression Model, y = bx

Quantity

C a l c u l a t e d Values t Max ASR

C r i t i c a l Values t Max ASR

Equation C o e f f i c i e n t s (1) b = 27.14

17.2

2.09

Max ASR (2)

2.36

2.78

(1) C o r r e l a t i o n c o e f f i c i e n t f o r the r e g r e s s i o n f i t t i n g i s 0.97. (2) Max ASR occurs a t x = 0.5, y = 44.1.

Kurtz; Trace Residue Analysis ACS Symposium Series; American Chemical Society: Washington, DC, 1985.

47 4. MUHLBAUER Processing Outliers in Statistical Data

i s a t the same i n s t r u c t i o n and i s the d e c i s i o n of the r e g r e s s i o n model used f o r the c a l i b r a t i o n graph. F i r s t order, higher order, and s p l i n e (12) methods can a l l be used f o r t h i s model. A l l these choices w i l l s i g n i f i c a n t l y i n f l u e n c e the d e c i s i o n concerning the r e a l i t y of i n c o n s i s t a n t v a l u e s .

Downloaded by CALIFORNIA INST OF TECHNOLOGY on September 22, 2017 | http://pubs.acs.org Publication Date: July 15, 1985 | doi: 10.1021/bk-1985-0284.ch004

Literature Cited

1.

Andrews, D. F.; B i c k e l , P. J.; Hampel, F. R.; Huber, P. J.; Rogers, W. J.; Tukey, J. W. "Robust Estimates of Location: Survey and Advances"; Princeton Univ. Press: Princeton, NJ, 1972; pp. 5, 15, 39-44, others.

2.

Barnett, V . ; Lewis, T. "Outliers i n S t a t i s t i c a l Data"; Wiley:New York, 1978; pp. (A) 89-103, (B) 76-88, (C) 234-265.

3.

Huber, P. J. "Robust S t a t i s t i c s " ; Wiley: New York, 1981; pp. 107-109.

4.

Lawson, J. S. J. Quality Technology 1982, 14, 19-33.

5.

M i l l e r , R. G. "Simultaneous S t a t i s t i c a l Inference"; McGrawHill: New York, 1966; p. 98.

6.

Beckman, R. J.; Cook, R. D.

7.

Campbell, N. A. Applied S t a t i s t i c s 1980, 29, 231-237.

8.

Maronna, R. A. The Annals of S t a t i s t i c s 1976, 4, 51-67.

9.

Schwager, S. J.; Margolin, B. H. The Annals of S t a t i s t i c s 1982, 10, 943-954.

Technometrics 1983, 25, 119-163.

10.

Kurtz, D. A.; Rosenberger, J. R.; Tamayo, G . , Chapter 9 i n this book.

11.

M i t c h e l l , D. G . , Chapter 8 i n this book.

12.

Wegscheider, W., Chapter 10 i n this book.

RECEIVED March 25, 1985

American Chemical Society Library 1155 16th St. N. W.

Kurtz; Trace Residue Analysis Washington, D. C. 20036 ACS Symposium Series; American Chemical Society: Washington, DC, 1985.