Estimation for Small Normal Data Sets with ... - ACS Publications

available, but their accuracy is greatly in doubt.) Conse- quently, the statistician often has a very basic problem facing him: how does he analyze da...
0 downloads 11 Views 656KB Size
Environ. Sci. Technol. 1985, 79, 1201-1206

Estimation for Small Normal Data Sets with Detection Limits Alan Glelt Versar, Inc., Springfield, Virginia 22151

Environmental data are characterized by the dual problems of small sample size and many values reported as “below detection limit”. We propose several possible estimators and test them via simulation. We find a method that outperforms the usual arbitrary techniques (such as using the detection limit for the missing data).

Introduction When environmental phenomena are measured, the measuring devices/procedures used are unable to detect low concentrations. Thus, concentrations below certain threshold levels are not measurable. Standard “detection limits” are set by various agencies for various phenomena for various types of measuring devices. Measured values below these limits are reported as “below detection limit” or as “trace” and are thus not available for statistical analysis. (Sometimes values below these limits are available, but their accuracy is greatly in doubt.) Consequently, the statistician often has a very basic problem facing him: how does he analyze data sets that contain a reasonable percentage of “below detection limit” entries? Environmental data are characterized not only by detection limits but also by small sample size. Required measurements for compliance purposes often are performed annually, quarterly, or, at most, monthly due to the expense or disruption caused by the testing. Studies of pilot plants or demonstration plants are often of such short duration that 5-10 samples are all that are obtained. Thus, methods for estimating the parameters of environmental data using asymptotic or large sample size procedures are usually inapplicable. In summary, environmental data usually have the following characteristics which make it difficult to analyze: (1)The data are cut off from below by detection limits. (2) The sample size is very small. As an example, suppose we have taken eight samples of air near a chemical warehouse in order to see if there are leaks (fugitive emissions). Concentrations below 0.8 ppb, say, are below the reliability of the measurement procedure. Of the eight samples, suppose five are below the detection limit while the other three are measured to have concentrations of 1, 2, and 5 ppb. How do we find the average concentration? Surely the smallest value that the true average could be is (0 + 0 + 0 + 0 + 0 + 1 + 2 + 5)/8 = 1 ppb while the largest is (0.8 + 0.8

+ 0.8 + 0.8 + 0.8 + 1 + 2 + 5)/8 = 1.5 ppb

Hence, the true average lies somewhere between 1 and 1.5 ppb, but where? If this were a regulated pollutant with a standard set at 1.4 ppb, is the facility in violation? The U.S.Environmental Protection Agency often requires that the missing data be replaced by the detection limit; doing that, the facility appears to be in violation. Does that seem reasonable? Below we find a method (expected values)

* Address correspondence to this author at Dowel1 Schlumberger, Inc., Tulsa, OK 74101. 0013-936X/85/0919-1201$01.50/0

that outperforms this somewhat arbitrary technique. Expected values give, in general, a more accurate estimate of the mean for each case in our simulation. We study below the problem of below detection limit data coupled with small sample size via computer simulations. We suppose that we are given N data points, p of which are below detection limit L and N - p of which have reported values larger than L. We suppose that the distribution for the underlying stochastic process is known to be either normal or log normal. We wish to estimate the mean and variance. The problem we are addressing is one of censoring. In general, censoring means that observations at one or both extremes are not available. Our problem is one of left censoring; life testing usually involves right censoring (large values not available). Two types of life censoring have received much attention. Type I occurs when the test is terminated at a specified time; type I1 occurs when the test is terminated at a particular failure. In type I censoring the number of failures as well as the failure times are random variables. This, of course, makes type I censoring far more complicated. Consequently, type I1 methods have often been applied to type I data with the hope that the bias is not appreciable. Our problem is analogous to type I censoring since the number p of measurments below L is a random variable. As the normal distribution is symmetric, for that case we are comidering type I censored, small sample size data. The problem of estimating the parameters of censored normal data has been extensively studied. Maximum likelihood has been studied in ref 1-3. These procedures are inefficient and highly biased for small data sets and require numerical interpolation in extensive tables. Linear estimators have been studied in ref 2 and 4-6. These methods are based on type I1 censoring and so are badly biased for s m d type I data sets. A method using moments was suggested in ref 7 but was highly inefficient. Other methods involve the replacing of the missing data by the expected values of order statistics or by constants. We consider three possible constants: zero, half the detection limit, and (the most conservative estimator for the mean) the detection limit. Though this latter value is badly biased, it has often been suggested by the U.S.Environmental Protection Agency as their accepted procedure merely because of its conservatism. Below we describe the results of a Monte Carlo simulation to investigate various techniques to estimate the mean and standard deviation for type I normally censored data. If our sample were log normal, we could apply the techniques to the logarithms of the data and then transform back. Thus, no separate analysis of log-normal data is required.

Maximum Likelihood Estimator We assume 0 C p C N values of our N data points lie below the known limit L. We estimate the mean and standard deviation by those values that maximize the likelihood that the given data set would actually have been obtained. The procedure requires tables, found in ref 2, and at least two data points above the limit L , i.e., p 5 N - 2.

0 1985 American Chemical ScIciety

Environ. Sci. Technol., Vol. 19, No. 12, 1985

1201

Truncated Maximum Likelihood Our next technique is easy to conceptualize: forget that data below L has been obtained, assume that the distribution of the remaining points are governed by the normal distribution truncated at L, and then use maximum likelihood estimators. This procedure again requires tables and at least two data points. Fill-In with Constants Various constants have been suggeskd as proxies for the data below L. The US. Environmental Protection Agency and various State Health Departments have a mandate to protect the environment and the human population from harmful pollutants. Thus, EPA often suggests that all censored values be replaced by the censoring value L to obtain the most conservative estimator for the mean pollutant levels. Clearly the most liberal policy is the one that substitutes zero for the censored data: If I cannot measure it, it is not there. Those suggesting some balance might use L/2. Let us suppose that we use C as the proxy for the missing data. Then we “ h o w ” the N data points: the real values and p values of C. We then would be able to use the standard formulas for the mean and standard deviation. Fill-In with Expected Values The most appealing proxies for the missing data are their expected values, i.e., the expectation of the first p order statistics conditional on being less than the limit L. However, the expected value depends on the unknown mean and variance. So this procedure must be done iteratively on a computer. The algorithm can be easily described: (1)Select any initial convenient guesses for the mean and variance. (2) On the basis of our tentative current values for the mean and variance, calculate the expected values for the first p order statistics conditional on being less than L. (3) Now that the first p data points are “known”,we can compute the mean and variance using the standard formulas (see Appendix). These are the updated estimates. (4)If the current and updated versions are close, STOP. Otherwise, go back to step 2 using the updated versions as the current values. In practice, this algorithm converged rapidly when “close” was defined by lo+. Linear Methods We first order the N - p data points that lie above the censoring value L. We then consider all linear estimators G = EGjXj and

H = CH,X, with expected values p and u, respectively. From this collection, we may find those with minimal variance. Tables for the best coefficients can then be prepared (8).

Simulation To evaluate the performance of our estimators, we performed a simulation. When the data are rescaled and translated, all the formulas depend on choosing two of the three parameters: H u, and L. We normalized the stimulated data to L = 1. Discussions with environmental engineers led us to believe that mean levels at the detection 1202

Environ. Sci. Technol., Vol. 19,

No. 12, 1985

Table I. N = 5, p = 1, L = 1, Mean Square Errors ,Ll

u

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

= 1.0

30 120 192 19 46

= 1.33

k

= 0.1 u = 0.2 u = 0.3

129 479 77 31 185

= 0.2

u

295 1078 53 135 416

= 0.3 av

u

106 574 530 213 96

108 255 683 280 78

Table 11. N = 5, p = 2, L = 1, Mean Square Errors = 1.0

/.l

u =

0.1

5 210 1246 237 27

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

u =

0.2

19 840 946 127 108

u

= 0.3 40 1889 700 71 243

X

p

362 4969 438 102 1566

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

= 0.67

= 0.3

u

= 0.1

8 405 3231 722 13

p

u =

134 501 307 136 164

lo4

p = 1.33 u = 0.3

av

488 1644 2470 910 151

138 1146 1341 336 132

Table 111. N = 5, p = 3, L = 1, Mean Square Errors

u

IO4

X

X

IO4

= 1.0

0.2

68 1620 2885 570 52

u

= 0.3

204 3645 2570 444 118

av 161 2660 2281 460 437

limit and one or two standard deviations above and below L were the cases to consider. Consequently, we selected the following seven combinations for p and u: 1 = 0.67, u = 0.2, 0.3 p

= 1.00, u = 0.1, 0.2, 0.3 p

= 1.33, u = 0.2, 0.3

We selected N = 5,10, and 15 as repesentative small data set sizes. Using a standard pseudo-random-number generator and the Box-Muller transformation, we generated one million standard normal random variates. Using these, we generated 50 000 data sets for each of the 21 combinations of N , p, and u, These data sets were then artifically censored at the cutoff L = 1 and passed to the several estimators to guess values for p and u, The data sets were then grouped by the value of p , the number of censored values, p = 0, 1,... N - 1. We then computed the average bias and variance for each technique for each value of p which represented at least 5% of the total sample of 50000 data sets (for N = 5 or 10) or 2.5% of the total sample (for N = 15). The results are very consistent throughout for each value of p . The linear methods and truncated MLE were terribly biased and had huge variances. The MLE usually had large bias and had large variances. The fill-in with constants only did well when the mean and the proxy constant lined up: when = 1.33, fill-in with 1.0 was acceptable; when p = 0.67, fill-in with 0.5 was acceptable. Fill in with expected values did a good job throughout the simulation. Tables I-XXVII give the mean square error (square of bias plus variance of the estimator) for five techniques: fill-in with expected values, MLE, and fill-in with constants (0.0, 0.5, and 1.0). Each table reports the mean square

Table IV. N = 5, p = 4, L = 1, Mean Square Errors p = 0.67 u =

expected value fill-in with0 fill-in with 0.5 fill-in with 1

0.2

u =

0.3

u

lo4

X

p = 1.0 = 0.1 u = 0.2 u = 0.3

p

av

52

5

34

147

344

116

2055 30 1206

1908 20 1331

6149 1477 4

5907 1361 16

5672 1254 36

4338 828 519

Table V. N = 10, p = 1, L = 1, Mean Square Errors /L

u =

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

lo4

X

Table X. N = 10, p = 6,L = 1, Mean Square Errors

u

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

u=

36 83 165 73 32

0.3

u=

46 161 133 65 48

Table VI. N = 10, p = 2, L = 1, Mean Square Errors

= 1.0 = 0.2

u=

26 1497 2883 565 46

0.3

av

76 3368 2559 432 104

35 1746 2891 573 54

Table XI. N = 10, p = 7, L = 1, Mean Square Errors

av

55 238 100 57 64

u

4 374 3230 721 12

= 1.33

0.2

= 0.1

expected value MLE fill-in with0 fill-in with 0.5 fill-in with 1

p = 0.67 0.3 u = 0.1

136 6018 1036 13 1441

20 551 4572 1064 7

u

lo4

X

p = 1.0 = 0.2 u = 0.3

149 2203 4258 917 27

av

442 4957 3957 784 62

187 3432 3456 695 384

LO4

X

Table XII. N = 10, p = 8, L = 1, Mean Square Errors p

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

u=

84 193 662 258 58

0.3

/L

0.67

= 0.2

u

av

61 468 480 164 48

u

73 331 57 1 211 51

Table VII. N = 10, p = 3, L = 1, Mean Square Errors

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

71 4043 2025 26 1227

= 0.3 72 7943 1906 17 1326

av 72 5993 1966 22 1277

lo4

X

Table XIII. N = 10, p = 9,L = 1, Mean Square Errors p u

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

= 0.1

u

13 148 601 92 33

lo4

X

= 1.33

= 0.2

u

lo4

X

= 1.0 = 0.2

u

57 592 369 26 132

= 0.3

p = 1.33 u = 0.3

av

162 838 1257 436 65

91 728 607 145 132

133 1333 202 25 298

Table VIII. N = 10, p = 4, L = 1, Mean Square Errors

X

lo4

X

lo4

p = 0.67 u =

expected value fill-in with 0 fill-in with 0.5 fill-in with 1

0.2

8 3117 118 1169

u

= 0.3

26 3041 105 1218

av 17 3079 112 1194

lo4 Table XIV. N = 15, p = 1, L = 1, Mean Square Errors

= 1.0 u = 0.2

X

p

u

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

= 0.1 4 201 1242 234 25

u

18 804 934 117 101

= 0.3

39 1808 676 51 226 ~

av 20 938 951 134 117

~~

p

u

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

= 0.2

= 1.33 u =

0.3

50 164 41 39 57

21 49 71 35 20

av 36 107 56 37 39

Table IX. N = 10, p = 5, L = 1, Mean Square Errors x lo4 Table XV. N = 15, p = 2, L = 1, Mean Square Errors

X

lo4

u = 1.0 u =

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

0.1

1 272 2119 443 18

u

= 0.2 3 1086 1774 297 71

u =

0.3

5 2444 1464 187 160

av 3 1267 1786 309 83

error for fixed values of p and N for all cases of ( h , a), which represent at least 5% of the 50 000 data sets for N = 5 or 10 and 2.5% of the 50000 data seta for N = 15. We also computed the average mean square error for each

u

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

p

= 1.33

= 0.2

u

37 95 282 110 27

= 0.3

av

33 273 166 57 38

35 184 224 84 33

value of p and N . This represents in some fashion the error that would result from the given techniques if each Envlron. Sci. Technol., Vol. 19, No. 12, 1985

1203

Table XVI. N = 15, p = 3, L = 1, Mean Square Errors

lo4

X

Table XXII. N = 15, p = 9, L = 1, Mean Square Errors

X

104 u = 1.33

= 0.2

u

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

u

75 174 653 249 45

= 0.3 47 430 466 149 32

jl

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

= 0.1

= 1.0 0.2

61 302 560 199 39

jl

u =

17 135 436 58 36

u =

74 539 233 10 143

0.3

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

173 1213 101 34 322

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

jl

= 1.0

u

= 0.2

= 0.1

jl

u =

39 645 524 43 119

8 161 787 131 30

0.3

92 633 429 105 136

jl

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

0.1

= 1.0 0.2

u =

0.3

4 198 1241 29 25

16 791 931 114 98

u =

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

0.3

262 5533 804 29 1477

u

jl

= 0.1

u

X

= 1.0

= 0.2

u

68 1836 3775 792 31

8 459 4102 944 8

= 0.3

192 4132 3463 655 70

av 133 2990 3036 605 397

X

89 796 810 188 121

0.3

!

lo4

X

av

37 1780 669 44 220

19 923 947 62 114

Table XX. N = 15, p = 7, L = 1, Mean Square Errors 1.0 = 0.2

27 1720 2886 569 52

av

217 927 1611 562 68

u =

= 0.67

jl

av

Table XIX. N = 15, p = 6, L = 1, Mean Square Errors

u =

57 3318 2550 425 101

av

= 1.33

u =

93 1452 319 15 267

0.3

Table XXIII. N = 15, p = 10, L = 1, Mean Square Errors 104

0.3

103 643 947 317 42

u =

20 1475 2879 561 45

3 368 3228 720 11

= 1.33

u =

u

X

Table XVIII. N = 15, p = 5, L = 1, Mean Square Errors 104

u

= 0.1

u

Table XVII. N = 15, p = 4, L = 1, Mean Square Errors 104

u

1.0 = 0.2

p =

av

X

lo4

/ . l =

u =

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

0.1

u

4 962 1460 224 78

1 241 1800 365 20

u =

0.3

7 2166 1160 123 175

av 4 1123 1473 237 91

Table XXI. N = 15, p = 8, L = 1, Mean Square Errors

X

lo4 1

u =

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

0.1 1

293 2463 528 15

/A

= 1.0

u

= 0.2 4 1172 2112 375 59

u =

0.3

10 2637 1791 251 134

av 5 1367 2122 385 69

value of g and u meeting our 5% or 2.5% criterion were equally likely. Of course, if the user had prior information 1204

Environ. Scl. Technol., Vol. 19, No. 12, 1985

3

. I

4

P

Figure 1. N = 5, u = 0.2, mean square errors for three procedures. Circles is filCin with 0.5. Crosses is fill-in with 1.O. Plain line is expected values.

about the mean and/or standard deviation, he would use the appropriate column(s). Figures 1-3 present some of our results in terms of graphs. For each fixed sample size ( N = 5,10, and 15) and with u fixed at 0.2, for each value of p = 1, ... N - 1, we selected the mast likely of the three tested means. We

Table XXIV. N = 15, p = 11, L = 1, Mean Square Errors 104

w = 0.67 u = 0.3 u = 0.1 expected value MLE fill-in with0 fill-in with 0.5 fill-in with 1

78 6457 1296 4 1400

23 628 5067 1192 5

= 1.0 u = 0.2 u = 0.3

/L

/A

165 2514 4767 1050 21

500 5657 4478 918 48

av

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

CT

expected value MLE fill-in with 0 fill-in with 0.5 fill-in with 1

192 3814 3902 791 369

Table XXV. N = 15, p = 12, L = 1, Mean Square Errors 104 /A

Table XXVI. N = 15, p = 13, L = 1, Mean Square Errors 104

X

/A

u = 0.3

av

75 3995 2025 26 1226

30 7777 1906 15 1325

53 5886 1966 21 1276

u = 0.3

av

75 10239 2632 65 1253

50 7567 2679 12 1221

Table XXVII. N = 15, p = 14, L = 1, Mean Square Errors 104

X

= 0.67

u = 0.2

= 0.67

= 0.2

25 4894 2726 79 1188

~

X

expected values fill-in with 0 fill-in with 0.5 fill-in with 1

X

= 0.67

u = 0.2

u = 0.3

av

3 3534 164 1149

25 3481 152 1180

14 2508 158 1165

%SO

*d

,