Characterization of Background Concentrations of Contaminants

Statistical Methods. The proposed Bayesian approach assumes environmental concentration data can be ap proximated by the log-normal distribution (12, ...
1 downloads 0 Views 117KB Size
Environ. Sci. Technol. 2006, 40, 6021-6025

Characterization of Background Concentrations of Contaminants Using a Mixture of Normal Distributions SONG S. QIAN* AND REGAN E. LYONS Nicholas School of the Environment and Earth Sciences, Duke University, Durham, North Carolina 27708-0328

We present a Bayesian approach for characterizing background contaminant concentration distributions using data from sites that may have been contaminated. Our method, focused on estimation, resolves several technical problems of the existing methods sanctioned by the U.S. Environmental Protection Agency (USEPA) (a hypothesis testing based method), resulting in a simple and quick procedure for estimating background contaminant concentrations. The proposed Bayesian method is applied to two data sets from a federal facility regulated under the Resource Conservation and Restoration Act. The results are compared to background distributions identified using existing methods recommended by the USEPA. The two data sets represent low and moderate levels of censorship in the data. Although an unbiased estimator is elusive, we show that the proposed Bayesian estimation method will have a smaller bias than the EPA recommended method.

1. Introduction Characterization of the background concentration of a contaminant is part of the U.S. Environmental Protection Agency’s (USEPA) risk assessment and risk management at Comprehensive Environmental Response, Compensation, and Liability Act (CERCLA or “Superfund”) sites (1). The primary goal of the CERCLA program is to protect human health and the environment from current and potential threats posed by uncontrolled release of hazardous substances. Because the same hazardous substances associated with a release can occur naturally in the environment or as a result of anthropogenic impacts not associated with siterelated activities, EPA recommends that risk assessments consider whether site-specific concentrations fall within the range of local background conditions or whether they may be indicative of site-related contamination (2). Background information is important to risk managers because the CERCLA program generally does not clean up to concentrations below natural or anthropogenic background levels (1). A large quantity of data is often available at hazardous waste sites. The statistical confidence of the characterization of background concentrations is enhanced by retaining the maximum number of data points in the analysis, including data from possibly impacted areas with no known contamination by the chemical of interest. However, using potentially impacted areas as background sites introduces the challenge of separating potentially contaminated data from data * Corresponding author phone: 919 613-8105; fax: 919 681-5740; e-mail: [email protected]. 10.1021/es0606071 CCC: $33.50 Published on Web 08/22/2006

 2006 American Chemical Society

indicative of local background. Currently, the USEPA (2) and the U.S. Naval Facilities Engineering Command (3) suggest a statistical approach for evaluating each available data point and removing potentially contaminated data points. This approach is based primarily on statistical tests of discordancy, supplemented by visual examination of the sample distribution. All data points from identified background sites are combined and subject to one or more tests for the rejection of outliers (i.e., tests of discordancy). A typical test for outliers is based on the normality assumption. An observation is deemed to be an outlier if the observation is extremely large (or small) based on the estimated normal mean and standard deviation using all observations except the one at hand. On one hand, a statistical hypothesis test, in general, has limited power to detect a true outlier when the sample size is small. On the other hand, the test often rejects the null hypothesis for practically insignificant differences when the sample size is large (the Lindley’s Paradox, (4)). Consequently, if the collected data from multiple background areas are to be analyzed separately, the outlier test is of relatively low power, and if the data are combined, we may reject too many data points as outliers. This is a dilemma of using statistical hypothesis testing that few have discussed. Even when the dilemma of hypothesis testing is resolved, misleading background characterization is possible using a hypothesis testing approach that calls for removal of observations with concentrations larger than a fixed value. Suppose that the presence of outliers or contaminated observations is a result of a release of the contaminant in part of the sampling area. Statistically we may represent the data as samples from two different distributions: the background distribution and the contaminated distribution (Figure 1). Using a statistical hypothesis test for the outlier is equivalent to setting an acceptable upper limit concentration, above which observations are classified as contaminated. This approach is not satisfactory when the centers of the two distributions are not sufficiently far apart: the upper tail of the background distribution is excluded while the lower tail of the contaminated distribution is included. In addition, a discordancy test is only effective when the number of “outliers’’ in the sample is small relative to the total sample size. The exact number of contaminated observations is rarely known. When the number of contaminated samples is large (e.g., the left column in Figure 1), the problem of outlier detection becomes one of discrimination between two classes of data. Using Figure 1 as a basic assumption of the underlying statistical processes that generated the sample measurements, we present a Bayesian hierarchical model for characterizing background concentration distribution. Our method is based on a mixture of two normal distributions. It is consistent with the basic model of data contamination described by Dixon (5). Furthermore, we use a hierarchical model to connect all sampling areas together assuming the background concentration distribution is the same for all sampling areas and the contaminated distributions varies from area to area. Our model directly estimates the background distribution parameters. As a result, we avoid the use of hypothesis testing, and thereby avoiding the technical problems discussed earlier. The idea of mixtures of distributions is an old one (6, 7), and its early theoretical development and application are summarized by Titterington et al. (8). Recent development in statistical computation (9, 10) makes practical applications of finite mixture models possible. VOL. 40, NO. 19, 2006 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

6021

FIGURE 2. Probability of a given log concentration value come from the contaminated distribution as shown in Figure 1.

FIGURE 1. Two hypothetical mixed sampling results. The top row shows the sample density and the hypothetical background and contaminated concentration distributions. The background distribution is normal with mean 0 and variance 1 and the contaminated distribution is normal with mean 4 and variance 4. The bottom row shows the corresponding normal quantile-quantile plots. The two columns show two different mixes. The sample in the left column has a background versus contaminated ratio of 1:1, and the sample in the right column has a ratio of 9:1.

2. Materials and Methods 2.1. Data. Data used in this paper are soil concentrations of lead (Pb) and mercury (Hg) from a U.S. federal facility in Missouri, USA. It is currently regulated for hazardous waste release under the Resource Conservation and Restoration Act (RCRA) and CERCLA. The data set, compiled by ARCADIS G&M, Inc. for the U.S. Army Environmental Center (11), includes soil metal concentration data collected by multiple studies in a total of 30 sampling areas from 1990 to 2003. Six sampling areas were identified as impacted directly by sitespecific activities or releases of lead and removed from the data set. Data from the remaining 24 sampling areas were included in the site-wide data set for characterizing the background levels of various metals. There are 731 observations of Pb soil concentrations (with 33 measurements below method detection limits) and 837 observations of Hg soil concentrations (with 627 measurements below detection limits). There are between 4 and 189 Pb measurements (mean 31, standard deviation 42) and 4-154 Hg measurements (mean 35 standard deviation 34) from each sampling area. Because the objective of this paper is to present an alternative statistical method for quantifying background concentrations, we chose to use data sets that were compiled following USEPA and Navy guidelines (2, 3) and were reviewed and approved by the USEPA (11), and were used by U.S. Army contractor for quantifying the respective background contaminant concentrations for the study site. Details of data collection/screening and site description are omitted. 2.2. Statistical Methods. The proposed Bayesian approach assumes environmental concentration data can be approximated by the log-normal distribution (12, 13), the same assumption used in the standard method sanctioned by USEPA (14). The basis of our approach is to assume that log-transformed soil metal concentration data from a sampling area are random samples from two normal distributions, one is the background concentration distribution N(µb,τb) and the other is the contaminated concentration distribution 6022

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 40, NO. 19, 2006

N(µc,τc). The notation N(µ,τ) represents a normal distribution with mean µ and precision τ. The precision is the reciprocal of the variance often used in Bayesian statistics. The background distribution is assumed to be the same for the entire site, while the contaminated concentration distribution is sampling-area-specific. Conditional on the two distributions (i.e., when the two distributions are known), we can assess the relative likelihood of a given data point being from one distribution over the other (the likelihood ratio of the two distributions). For example, the two hypothetical distributions displayed in Figure 1 are background of N (0, 1) (mean 0 variance 1) and the contaminated of N(4, 1/4) (mean 4 and variance 4). If we observe x ) 2, the likelihood that the observation is from the background distribution is 0.054 (the density of N (0, 1) evaluated at x ) 2), and the likelihood of the observation being from the contaminated distribution is 0.121. The probability of the observation being from the contaminated distribution is p ) 0.121/(0.121 + 0.054) ) 0.69, if we have no prior knowledge about the source of the observation. This probability indicates the uncertainty about whether or not a particular concentration value comes from a contaminated soil sample. If we know p, the distribution of the log transformed concentration x ) log (y) can be expressed as the following:

x ∼ N(µ, τ) where µ ) rµc + (1 - r)µb, τ ) rτc + (1 - r)τb, and r is a bernoulli random variable with probability of success p. Figure 2 shows the probability of being from the contaminated distribution as a function of the log concentration value, using the two hypothetical distributions in Figure 1. Note that the probability increases at the lower end of the plot. That is, the likelihood of being from the contaminated distribution increases (relative to the likelihood of being from the background distribution) when the data value becomes very small. Although the likelihood of observing a small value such as -4 is very small for both distributions, the contaminated distribution has heavier tails than the background distribution because the variance of the contaminated distribution is larger. Alternatively, if the background level distribution has a larger variance, we will see increased likelihood in favor of the background distribution when the observation becomes very large. The conventional wisdom tells us that a lower concentration is more likely from the background distribution and a higher concentration is more likely from a contaminated distribution. To reflect this belief in our model, we set τb ) τc ) τ. This prior restriction will not result in a same variance in the posterior distributions of the background and contaminated concentrations. This is because the predictive distribution of the contaminated concentration is π(x˜c) ) ∫ ∫µc,τ N(x|µc,τ,Y)dµcdτ. The mean and variance of the predictive distribution are determined by both the data and the prior distribution on µc and τ. Let xij be the jth observation (log concentration) from the ith sampling area, and xij is assumed to follow a (conditional) normal distribution. We assume that the background log concentration distribution for the study site is the same for

FIGURE 3. Histogram of the estimated posterior predictive background lead concentration distribution all sampling areas with mean µb and precision τ. For the contaminated soil samples, the level of contamination may vary from area to area within the site, with area-specific contaminated distribution means µcj. Using these basic assumptions, a hierarchical model is proposed:

xij ∼ N(µij,τ) µij ) rijµcj + (1 - rij)µb µcj ∼ N(µc,τhyper) rij ∼ bernoulli(p)

(1)

Prior distributions must be specified for µc, µb, τ, τhyper, and p. If there is no specific information on these parameters, vague priors can be used:

µb, µc ∼ N(0,0.0001) τhyper,τ ∼ gamma(0.001,0.001) p ∼ beta(1,1)

(2)

We note that the normal distribution of xij is conditional on rij. The marginal distribution of xij is most likely not normal. When the observation xij is below detection limit (or censored), the first line of eq 1 is replaced by xij ∼ N(µij, τ)I(,upij), where I(,upij) is the censored data operator indicating the observation xij is known to be below the detection limit upij and is otherwise uncertain. Details of handling censored data can be found in refs 15 and 16. An application can be found in ref 17. The Bayesian approach enables an easy and straightforward treatment of censored data, and its implementation under WinBUGS (18) makes it possible to incorporate the missing data computation into a single model. In order to make the model numerically stable, we must constrain that µcj > µb, the mean contaminated log concentration larger than the background log concentration. This is done by introducing a non-negative random variable θj and let µcj ) µb + θj. The posterior distribution of unknown parameters is estimated using the Markov chain Monte Carlo (MCMC) implemented under WinBUGS (19). Convergence of MCMC is determined by running multiple (5) chains and using the Gelman’s R ˆ statistic (20).

3. Results and Discussion The results of the model are presented in terms of posterior random samples of µb and τb. These samples define a posterior predictive distribution of the background level. For Pb, the background concentration distribution has a (equal tail) 95% credible interval between 3.63 and 72.64 mg/kg (Figure 3). The mean is about 21.72 mg/kg (median 16.21) and the standard deviation is 19.36 mg/kg.

FIGURE 4. Histogram of the estimated posterior predictive background mercury log concentration distribution The estimated posterior Hg background distribution has a 95% credible interval between 0.0014 and 0.2295 mg/kg. The posterior distribution is displayed in terms of log Hg concentration in Figure 4. The background Hg concentration has a mean of about 0.0425 mg/kg (median of 0.0185 mg/kg) and a standard deviation of 0.10 mg/kg. The mean is larger than the median because the estimated posterior distribution is highly skewed. Because 75% of the Hg concentration observations are below their respective method detection limits, this high level of uncertainty on the mean is expected. However, from a practical perspective, we are quite certain that the background concentration values are unlikely (only a 2.5% chance) to exceed 0.23 mg/kg, as indicated by the 95% credible interval. As many have pointed out (e.g., ref 21), the mean is not a good estimate of the central value, and the standard deviation is not a good estimate of the variability when the data distribution is positively skewed. We suggest that the estimated median and the 95% credible interval be used to characterize the background distributions. As a comparison, the same data sets are analyzed using the USEPA recommended method, which is a combination of discordancy tests and visual examination of probability plots to identify and remove “contaminated’’ observations. Before testing for outliers, below-detection-limit observations are either replaced by a fixed value (Pb, half of the detection limit) or removed from the data set (Hg) based on official guidelines (22, 3). We recognize that neither of the two treatments of censored data is desirable (21). The tests were applied to each sampling area separately. When the sample size is less than 25, the test of Dixon (23) was used. The Dixon’s test iteratively reveals whether or not the highest value is an outlier. When the sample size is larger than 25, the test of Rosner (24) was used. These tests cannot be used when the data set includes a large number of contaminated samples. Additional outliers are identified by visual examination of probability plots (or the quantilequantile plots defined in ref 25). In these instances, observations which deviate from a straight line closest to the origin are determined to be indicative of contamination ((3), and see Figure 1 for an illustration). Reliance on this graphical approach introduces a degree of uncertainty into the analysis, as identification of contaminated observations by this means relies heavily on subjective determinations by the analyst. These two discordancy tests and graphical analysis rejected 126 Pb concentration values as outliers. The remaining 605 “background’’ Pb observations have a sample mean of 17.3 mg/kg and standard deviation of 16.8. Similarly, 68 Hg observations were rejected as outliers from the 210 uncensored observations and the remaining 142 uncensored “background’’ Hg observations have a mean of 0.083 mg/kg and a standard deviation of 0.088 mg/kg. In comparison, the Bayesian estimated background Pb concentration distribution differs from the EPA-sanctioned estimate both in the mean and the standard deviation. VOL. 40, NO. 19, 2006 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

6023

FIGURE 5. Probabilities of data points being contaminated are plotted against the measured soil metal concentrations. Because the concentration distribution is approximately lognormal, the difference is mainly in the spread and the central locations of the two estimates are similar. The Bayesian estimated standard deviation is higher than the same from the EPA-sanctioned method. This is expected, because (1) substituting censored values with a fixed number will generally reduce the sample variance, and (2) using a cutoff point to remove observations that are seen as potentially contaminated artificially truncates the distribution, thereby reducing the variability. We note that substitution the censored values with a fixed value and using a cutoff point to remove potential outliers can result in biased estimate of the background mean. The difference in the estimated Hg background distribution between the two methods is substantial. Because the censored values account for 75% of the total Hg observations, removing them will result in overestimating the background mean and under estimating the background variance. The Bayesian estimated background standard deviation is actually smaller. Because the soil metal concentration distribution is approximately log-normal and the standard deviation of a log-normal distribution (σy) is a function of both the log mean (µx) and log variance (σx) [i.e., σ2y ) (eσx2 - 1)(e2µx+σx2)], it is likely that the overestimated mean resulted in an overestimated standard deviation even though the log variance was underestimated. The probability of a given soil sample (with log concentration xi) being contaminated is estimated by the mean of ri (Figure 5). As discussed in Section 2.2, in order to have a monotonically increasing probability as the concentration increases, we set the precision of the contaminated and the background distribution to be the same. In Figure 5, data from all sampling areas are combined. Therefore, different probabilities are associated with similar soil metal concentrations. This common variance assumption, however, does not constrain the posterior predictive distributions of the background and contaminated concentrations to have the same variance (Figure 6). We note that, because it is impossible to identify the contaminated samples definitely, the Bayesian approach is potentially biased. The estimated mean background concentration will be higher than the actual mean background concentration and the estimated mean contaminated concentration will be lower than the actual one. In other words, they are pulled toward the middle. The magnitude of the bias is largely decided by the number of contaminated samples in the data. We conducted a simulation study using the two hypothetical distributions in Figure 1. Simulated observations (without censoring) are 6024

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 40, NO. 19, 2006

FIGURE 6. Estimated background metal concentration distributions are compared with contaminated distributions from four selected sampling areas.

TABLE 1. Summary of the Estimated Background Distribution Parameters Using the Two Competing Methods mean (should be 0) standard deviation (should be 1) % background Bayesian EPA/Navy 60 70 80 90 95

0.392 0.252 0.178 0.031 0.010

-1.197 -0.787 -0.489 -0.294 -0.185

Bayesian

EPA/Navy

1.427 1.251 1.170 1.130 1.058

0.480 0.594 0.718 0.800 0.900

drawn from the two hypothetical distributions with five different compositions of “background’’ versus “contaminated’’ samples, with a total sample size of 1000. The bias of the Bayesian method is reduced as the number of contaminated observations becomes small relative to the total sample size. The hypothesis testing method recommended by the USEPA and U.S. Navy has a bias that will be reduced but is always larger than the corresponding Bayesian bias (Table 1). This simulation indicates that our approach not only properly handles censored values (the censored values are either removed or replaced by a fixed value using the existing method), but also reduces estimation bias. Because both methods are biased when the number of contaminated observations is large, it is important that a reference site be selected carefully. Although our method will not definitely separate the contaminated samples from the background ones, the estimated background concentration distribution parameters provide more information than a collection of “background data”. Using these parameters we can derive more informative comparisons of the background concentration distribution and data from contaminated sites than using the hypothesis testing approach suggested by the USEPA and U.S. Navy (refs, 2, 3). Furthermore, our method provides a framework for properly handling censored values. Using our method, we are able to automate the process and thereby reduce the time required for such analysis. Because the concentrations of Hg and Pb are potentially correlated due to similar exposure mechanisms, a bivariate normal distribution would be a better model for this problem. However, we chose not to use this approach because soil samples are not paired.

Appendix The model presented here can be easily implemented under WinBUGS, The WinBUGS program is listed below. model { for (i in 1:N){ x[i] ∼ dnorm(mu[i], tau)I(, up[i]); mu[i] < - r[i] * muC[area[i]] + (1-r[i]) * muB; r[i] ∼ dbern(p); } for (j in 1:Narea){ theta[j] ∼ dnorm(0, 0.0001)I(0,); muC[j] < - muB + theta[j]; } tau ∼ dgamma(0.001,0.001); muB ∼ dnorm(0, 0.0001); p ∼ dbeta (1, 1); }

(14)

Literature Cited

(17)

(1) U.S. Environmental Protection Agency. Risk assessment guidance for superfund. I.; Human health evaluation manual (Part A); Technical Report, EPA 540-1-89-002; Office of Emergency and Remedial Response, U.S. EPA: Washington DC, 1989. (2) U.S. Environmental Protection Agency. Guidiance for comparing background and chemical concentrations in soil for CERCLA sites; Technical Report, EPA 540-R-01-003, OSWER9285.7-41; Office of Emergency and Remedial Response, U.S. EPA: Washington DC, 2002. (3) Naval Facilities Engineering Cammand (NAVFAC). Guidance for environmental background analysis; Volume I. Soil. NFESC User’s Guide; Technical Report UG-2049-ENV; NAVFAC: Washington, DC, 2002. (4) Lindley, D. V. A statistical paradox. Biometrika 1957, 44, 187192. (5) Dixon, W. J. Analysis of extreme values. Ann. Math. Stat. 1950, 21 (4), 488-506. (6) Newcomb, S. A generalized theory of the combination of observations so as to obtain the best result. Am. J. Math. 1886, 8, 343-366. (7) Pearson, K. Contribution to the mathematical theory of evolution. Philo. Trans. R. Soc. London, Ser. A 1894, 185, 71-110. (8) Titterington, D. M.; Smith, A. F. M.; Makov, U. E. Statistical Analysis of Finite Mixture Distributions; Wiley: New York, 1985. (9) Casella, G.; Mengerse, K. L.; Robert, C. P.; Titterington, D. M. Perfect samplers for mixtures of distributions. J. R. Stat. Soc., B 2002, 64, 777-790. (10) Robert, C. P. Mixtures of distributions: inference and estimation. In: Markov chain Monte Carlo in Practice; Gilks, W. R.,

(11)

(12) (13)

(15) (16)

(18)

(19)

(20) (21) (22)

(23) (24) (25)

Richardson, S., Spiegelhalter, D. J.; Chapman & Hall: London, 1996. ARCADIS G& M Inc. Final Background Characterization Report: Lake City Army Ammunition Plant, Independence, Missouri; Technical Report; Arcadis G & M, Inc.: Tulsa, OK, 2005. Ott, W. R. Environmental Statistics and Data Analysis; Lewis Publishers: Boca Raton, FL, 1995. Limpert, E.; Stahel, W. A.; Abbt, M. Log-normal distributions across the sciences: keys and clues. Bioscience 2001, 51 (5), 341-352. Singh, A. K.; Singh, A.; Engelhardt, M. The lognormal distribution in environmental applications. Technology Support Center Issue, EPA/600/R-97/006; United States Environmental Protection Agency, Office of Research and Development, Office of Solid Waste and Emergency Response Washington, DC, 1997. Little, R. J. A.; Rubin, D. B. Statistical Analysis with Missing Data, 2nd edition; Wiley: New York, 2002. Lockwood, J. R.; Schervish, M. J.; Gurian, P.; Small, M. J. Characterization of arsenic occurrence in source waters of U.S. community water system. J. Am. Stat. Assoc. 2001, 96 (456), 1184-1193. Qian, S. S.; Schulman, A.; Koplos, J.; Kotros A.; Keller, P. A hierarchical modeling approach for estimating national distributions of chemicals in public drinking water systems. Environ. Sci. Technol. 2004, 38, 1176-1182. Lunn, D. J.; Thomas, A.; Best, N.; Spiegelhalter, D. Winbugssa Bayesian modelling framework: Concepts, structure, and extensibility. Stat. Comput. 2000, 10 (4), 325-337. Spiegelhalter, D.; Thomas, A.; Best, N.; Lunn, D. WinBUGS User Manual, Version 1.4. http://www.mrc-bsu.cam.ac.uk/bugs, 2003. Gelman, A.; Rubin, D. B. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992, 7, 457-511. Helsel, D. R. Less than obvious. Environ. Sci. Technol. 1990, 24 (12), 1767-1774. U.S. Environmental Protection Agency. Guidance for data quality assessment, practical methods for data analysis; Technical Report EPA/600/R-96/084, EPA QA/G-9, QA00 Update.; Office of Environmental Information: Washington, DC., 2000. Dixon, W. J. Processing data for outliers. Biometrics 1953, 9 (1), 74-89. Rosner, B. On the detection of many outliers. Technometrics 1975, 17 (2), 221-227. Cleveland, W. S. Visualizing Data; Hobart Press: Summit, NJ, 1993.

Received for review March 15, 2006. Revised manuscript received July 18, 2006. Accepted July 20, 2006. ES0606071

VOL. 40, NO. 19, 2006 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

6025