Article pubs.acs.org/est
Fitting a Distribution to Censored Contamination Data Using Markov Chain Monte Carlo Methods and Samples Selected with Unequal Probabilities Michael S. Williams* and Eric D. Ebel Food Safety and Inspection Service United States Department of Agriculture, 2150 Centre Avenue, Building D, Fort Collins, Colorado 80526, United States S Supporting Information *
ABSTRACT: The fitting of statistical distributions to chemical and microbial contamination data is a common application in risk assessment. These distributions are used to make inferences regarding even the most pedestrian of statistics, such as the population mean. The reason for the heavy reliance on a fitted distribution is the presence of left-, right-, and interval-censored observations in the data sets, with censored observations being the result of nondetects in an assay, the use of screening tests, and other practical limitations. Considerable effort has been expended to develop statistical distributions and fitting techniques for a wide variety of applications. Of the various fitting methods, Markov Chain Monte Carlo methods are common. An underlying assumption for many of the proposed Markov Chain Monte Carlo methods is that the data represent independent and identically distributed (iid) observations from an assumed distribution. This condition is satisfied when samples are collected using a simple random sampling design. Unfortunately, samples of food commodities are generally not collected in accordance with a strict probability design. Nevertheless, pseudosystematic sampling efforts (e.g., collection of a sample hourly or weekly) from a single location in the farmto-table continuum are reasonable approximations of a simple random sample. The assumption that the data represent an iid sample from a single distribution is more difficult to defend if samples are collected at multiple locations in the farm-to-table continuum or risk-based sampling methods are employed to preferentially select samples that are more likely to be contaminated. This paper develops a weighted bootstrap estimation framework that is appropriate for fitting a distribution to microbiological samples that are collected with unequal probabilities of selection. An example based on microbial data, derived by the Most Probable Number technique, demonstrates the method and highlights the magnitude of biases in an estimator that ignores the effects of an unequal probability sample design.
1. INTRODUCTION Every year hundreds of thousands of samples are collected and analyzed to assess microbial and chemical contamination in food and water.1 In many countries, ongoing efforts to improve hygiene have resulted in a decline in levels of contamination in all commodities. A consequence of this reduction is that the proportion of samples that test positive for the even the most contaminated commodities has fallen to less than 0.1. Samples that are test-negative or cannot be successfully enumerated are referred to as censored data because the true concentration cannot be observed.2,3 While the increase in the proportion of censored observations is unequivocally beneficial to public health, data sets with a large proportion of censored observations present many unique analytical challenges. A key application for these types of data sets is food-safety exposure assessment, where censored data can be fitted to a statistical distribution to describe contamination at some point in the farm-to-table continuum. In some applications, the population under consideration is of limited scope, with samples collected from a single field, retail outlet, or production This article not subject to U.S. Copyright. Published XXXX by the American Chemical Society
establishment. In these applications, the assumption is that the samples are independent and identically distributed (iid) observations (i.e., a random sample), which is usually reasonable. In other situations, data are collected from multiple fields or establishments.4−8 In this latter situation, the assumption of the data being an iid sample is less tenable because of clustering of samples within locations and also because the size of different sampling locations and the levels of the contaminant could be related (e.g., if a sampling locations production volume is correlated with production practices that directly influence levels of contamination). One solution to this problem would be to fit individual distributions to the data collected at each location. This approach, however, is often impractical because it can require sample sizes in excess 500 from each location when the Received: July 24, 2014 Revised: October 20, 2014 Accepted: October 21, 2014
A
dx.doi.org/10.1021/es5035574 | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
The proposed solution combines results from a large number of statistical techniques, so the population and the related notation are used to develop results for unequal probability sampling, weighted distribution theory, and weighted maximum likelihood estimation. This discussion is followed by the development of the proposed weighted bootstrap solution, and its application to MCMC methods for fitting a log-normal distribution to MPN data. 2.1. Population Description. The example population is loosely based on a common food safety application where samples of products are collected in production facilities and tested for microbial and/or chemical contaminants. The parameter values are similar to those observed in the testing of whole chicken carcasses in the United States. The artificial population consists of N = 100 000 elements (e.g., carcasses) and a sample of size n = 1500 is chosen from the population. The example assumes that the goal of a survey is to estimate a distribution describing the average concentration of a pathogen across the population. Individual carcasses will be indexed by i, and λi denotes the average concentration of the pathogen on carcass i. The actual number of pathogenic organisms in sample i will be denoted by the integer value yi. Unequal probability sampling designs typically use an easily observed covariate, denoted by xi, to define the probability of inclusion of population element i in a sample. For this application, it is assumed that xi represents the concentration of a nonpathogenic indicator organism on carcass i, such as generic Escherichia coli, and that the inclusion probability will be πi = nxi/X, where X = ∑i N= 1xi. These πi values are appropriate for probability proportional to size (pps) sample designs.17 Unequal probability sampling designs yield precise estimates of the population total when λi ∝ πi, which implies a positive correlation between the λi and xi values is desirable. To achieve this correlation, the λi and xi values, on the log10 scale, were generated from a multivariate normal distribution of the form
contaminant of interest is detected in less than 10% all samples.9−11 The assumption of the sample consisting of iid observations also fails when an unequal probability sample design is used to preferentially select some population elements over others. Examples include the risk-based food-safety programs,12 targeted surveillance applications used in disease surveillance,13,14 and EPAs Disinfectants and Disinfection Byproduct Rule,15 which emphasizes sampling at the end of a water distribution system (where byproduct levels are expected to achieve their highest levels) but the length or time of each system varies. While these types of sampling applications are rarely strict probability samples, the sample selection process results in a sample where the iid assumption is unreasonable, and the data are most appropriately described as an unequal probability sample.16,17 In many of these applications, however, there is sufficient knowledge about the population to approximate the differential probability of each element being selected into the sample, where the probability of a population element being selected is referred to as the probability of selection or inclusion probability.16−18 This quantity will be denoted by the symbol πi = P(element i is selected). Ignoring the effect of the sample design used to collect samples can induce a bias in the estimated distribution parameters because the distribution describing the population and the distribution of samples will be fundamentally different if the level of contamination is related to the inclusion probability. Methodology to relate the population and sample probability density functions is an application of weighted distribution theory.19−21 Many different methods, such as maximum likelihood estimation (MLE) and regression on order statistics,2,22 have been proposed to fit distribution functions to contamination data. MLE was previously demonstrated to be superior to other methods of fitting a distribution to contamination data sets when sample sizes are reasonably large.2,10 It also has the advantage in that it can be easily adapted to account for some unequal probability sampling designs.23,24 Recent studies, however, suggest that Bayesian methods, based on Markov chain Monte Carlo (MCMC) models, are equal or even superior to MLE, particularly when the proportion of censored observations is high.25,26 A disadvantage of MCMC models is that there are no simple methods of reweighting the data to account for unequal probability samples. In this study, a weighted bootstrap resampling approach is proposed to address the problem of unequal probability sampling. The focus of the study will be using MCMC methods to fit a log-normal distribution to Most Probable Number (MPN) data27,28 that are collected with unequal probabilities of selection. The analytical methods are implemented in a combination of the R29 and BUGS environments.30,31 The proposed solution represents a good compromise between simplicity and robustness. The method is tested with an artificial data set consisting of censored observations using a computer simulation. The simulation study illustrates the performance of the technique and demonstrates the bias in the estimated parameters that can occur when adjustments for unequal probability sampling are not applied.
⎛μ ⎛ log x ⎞ ⎜ 10 ⎟ ∼ MVN ⎜ x , ⎜ log λ ⎟ ⎜ μλ ⎝ 10 ⎠ ⎝
⎛ σ 2 ρσ σ ⎞⎞ x λ ⎟ ⎜ x ⎟ ⎜ 2 ⎟⎟ ⎝ ρσxσλ σλ ⎠⎠
The average concentration values for the pathogen were derived from a log-normal distribution with parameters λi ∼ log−normal(μλ = −2.5,σλ = 1.25). While these parameter values were chosen to be easily identified values for graphical comparisons, they are similar to the parameter values explaining levels of Salmonella and Campylobacter found on poultry carcasses in a recent nationwide survey of U.S. chicken producers.32 The average concentration values for the indicator organism also were generated from a log-normal distribution with parametrization xi ∼ log−normal(μx = 6.68,σx = 0.34). These parameters were derived from indicator organism sample data of chicken carcass rinse samples.33 The variance-covariance matrix for the multivariate normal distribution was ⎡ 0.113 0.240 ⎤ ∑=⎢ ⎥ ⎣ 0.240 1.562 ⎦
Following the conversion between the normal to log-normal scales, the data exhibited a weak sample correlation of ρ = 0.088. The data set contains information representing the application of a screening test for each sample, followed by the application of the MPN technique27,28,34,35 to estimate the average concentration in each sample that was test-positive on
2. MATERIALS AND METHODS The development of the proposed method requires first defining properties and notation for an artificial population. B
dx.doi.org/10.1021/es5035574 | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
the screening test. The MPN technique36 is a common method to quantify the level of contamination in a sample because it is able to provide estimates at low concentrations. This technique uses a series of dilution count experiments to derive estimates of the concentration of the microorganism of interest. The results of the screening test assume that the true numbers of organisms, denoted by yi, are uniformly distributed throughout the screening sample of volume vscreen = 60 mL. This leads to a limit of detection of 0.0167 organisms per ml, assuming a perfect test. To model this process, a yi value is randomly drawn from a Poisson distribution with rate parameter 60λi, and this value is used to determine if one or more organisms are present in the sample. The screen test data element is set to δi = 1when one or more organisms is present (i.e.,yi ≥ 1) and δi = 0 otherwise. The total number of samples,n, is broken down into subsets. The first subsets contains samples where the pathogen is not detected (i.e., δ = 0i). These are denoted by nND = n − ∑in= 1δi. The number of samples that are positive on the screening test is nMPN = ∑i n= 1δi. The average proportion of samples where the sample was screen-test positive and one or more organisms were found in the aliquot used for enumeration was 0.107. The MPN data represent a simulated 3-tube; 4-dilution experiment (i.e., a set of 12 test tubes, divided into sets of three and with each set containing one of four different volumes of the original sample material). At each of the four dilution levels, mk = 3simulated tubes are inoculated with a volume vk of the original sample material, with the volumes of the dilutions being 10.0, 1.0, 0.1, and 0.01 mL and indexed by k = 1,...K = 4. In practice each tube usually contains selective enrichment media and the contents are incubated. Following incubation, the number of tubes (sk) where organisms are detected at dilution k is counted. For any of the n = 1500 samples that were screen-test positive (δi = 1), a simulated MPN experiment is performed. The number of organisms in each tube is modeled as a realization of a Poisson distribution. At each dilution, a random draw from the Poisson distribution determines if one or more organisms are present in each tube. Tubes are positive when the count is one or greater. For example, using the v1 = 10 mL dilution for sample i, three random draws indexed by l and denoted by zl ∼ Poisson(10λi),l = 1,2,3 are simulated and the number of positive tubes,s1i, is the count of tubes where zl ≥ 1. The MPN method has a lower limit of quantitation of 0.03 = (1/33.33) organisms per ml and an upper limit of quantitation of 110 organisms per ml. In summary, for each sample i the observable data are the vector(xi,δi,s1i,s2i,s3i,s4i,m1i,m2i,m3i,m4i) = (xi,δi,si,mi), and the goal of analysis is to estimate the parameter vector(μ̂λσ̂λ) for the lognormal distribution describing the distribution for the latent variable λi. To simplify notation, it will be assumed that the λi values are available for each sample for much of the remainder of the theoretical development. 2.2. Sample Distribution and Sample Design. Assume the λi values are realizations derived from the population-level log-normal distribution of average concentration values, so λ ∼ f(μλ,σλ) with μλ and σλ being the parameter values for the distribution. Each λi is assigned a value0 < πi ≤ 1 that represents the probability that population element i will be included in a sample. The distribution of samples drawn from the population, with inclusion probabilities defined by the π values, is related to the population distribution through the relationship
f *(μλ , σλ) =
πf (μλ , σλ) E[π ]
where the term E[π]serves as a normalizing constant.19−21 Under equal probability sample designs, such as simple random, systematic, or Bernoulli sampling, the distribution of samples is identical to the population distribution because π ≡ E[π]. This application differs from size-biased distribution theory in that f(μλ,σλ)and f * (μλ,σλ) need not belong to the same family of distributions, so the analyst selects one or more candidate distributions that are appropriate for the application.21 When samples are selected with unequal probabilities, the π term has the effect of changing the influence of elements in the sample. An intuitive explanation of the inclusion probability is that if population element i is included in the sample, then this sample element represents (1/πk) similar population elements. These inclusion probabilities can be used to calculate descriptive statistics for the population using the HorvitzThompson estimator.17 For example, an estimator of the population total is n
Λ̂ =
∑ i=1
λi πi
Most any population parameter can be estimated via reweighting of the sample data by the inverse of the inclusion probabilities. Many different sample designs result in population elements having unequal inclusion probabilities. The sample design considered here, for the purpose of illustration, is the selection and estimation of the average concentration λi from an MPN experiment when the inclusion probability for population element i is based on the size of a covariate value xi, which is referred to as a probability proportional to size design.16−18 2.3. Maximum Likelihood. A distribution can be fitted to unequal probability sampling data by the method of maximum likelihood estimation.23,24 This estimator will be presented to help motivate the proposed weighted bootstrap technique. If n iid samples are selected from a population with inclusion probability πi and true average concentration λi were observed, the maximum likelihood estimator is given by n
Λ weighted(μ , σ ) =
∏ i=1
E[πi]f *(λi) = πi
n
∏ wi i=1
n
f *(λi)with ∑ wi = n i=1
Note that the term wi scales the contribution of each sampled λi value in the likelihood and that wi ≡ 1in the case that simple random or systematic sampling is used to draw the sample (i.e., the weighted likelihood reduces to the standard likelihood estimator under equal probability designs). Note that variance estimates derived using the weighted maximum likelihood estimator will be biased when samples are selected from a population comprising clusters because the assumption of independence between samples is not satisfied.23,24 One way to think of the weights in the maximum likelihood estimator is that they represent the number of times an element from the sampled distribution f * () would appear in an equivalent simple random sample drawn from the population distribution f(). For example, if wi = 2, an equivalent maximum likelihood estimator could be obtained by modifying the data to C
dx.doi.org/10.1021/es5035574 | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
replicate the sample element λi two times and use a new weight of wi = 1 for each of the replicates. A solution that replicates data would work for all integer-valued weights, but not for realvalued weights. An approximate solution would be to round the weights to the nearest integer value, but this would be problematic for weights less than 0.5 because these sample elements would be replicated 0 times in the likelihood. 2.4. Weighted Bootstrap. The bootstrap is a nonparametric method for approximating the sampling distribution of an estimator derived from an iid sample. The derivation of the properties of a bootstrap estimator are not estimator dependent, so the method can be applied to the general class of estimators that serve as summarizers of the sample data, which is denoted by S().37 While many applications related to the use of the bootstrap estimators focus on the generation of confidence intervals and statistical tests,38 maximum likelihood and Bayesian estimators are also appropriate summarizers for bootstrap and weighted bootstrap sample data,38,39 though examples of the approach taken here have not been identified. While many finite population sampling designs fail to meet a strict definition of an iid sample, most food-safety applications deal with populations where N ≫ n so the effects of the departure from the independence assumption are usually of minimal concern. It is, however, fairly common for the assumption that each observation is selected from an identical distribution to fail, with a typical situation being when samples are selected from multiple locations. In cases where the central tendencies of the distributions for each cluster are similar, the iid bootstrap can still be applied and will result in conservative interval estimates.40,41 The other alternative is to structure the bootstrap resampling to mimic the original sampling design. For example, the bootstrapping routine for a two-stage cluster sample would be to first select a bootstrap sample of locations (clusters) and then draw individual bootstrap samples from each cluster. The weighted bootstrap42,43 is an extension of the bootstrap method for constructing a new data sequence where the elements are selected, with replacement, from the original sample in proportions that are determined by the inclusion probabilities for the original sample values. So if Δi represents the number of times sample element i appears in a weighted bootstrap sample, the expected number of times a sample element appears in any one of B bootstrap samples is (E[Δi] = (E[πi])/πi). Thus, across a series of B weighted bootstrap samples, the weighted bootstrap sampling distribution approximates the population distribution because f (μλ , σλ) =
E[π ]f *(μλ , σλ) π
Inferences about the population are drawn from the B replicates of the weighted bootstrap procedure. The number of bootstrap replicates often ranges from between B = 50 and 200. The resampling scheme in Step 1 should mimic the original sample37 in order to maintain any correlated structure that exists in the population. This implies that when the data are grouped, as is the case when sampling occurs both within and between clusters or when the population is stratified, the bootstrap resampling scheme requires separately sampling each group of observations independently. In this study the MCMC routine serves as the summarizing function. It is applied to each bootstrap sample to generate bootstrap estimates of the pathogen distribution parameters denoted (μ*λb, σ*λb), b = 1,2,...B. This bootstrap sampling distribution is used to make inferences about the target parameters (μλ, σλ). For example, the weighted bootstrap estimator of the μλ parameter is μ ̂WB =
1 B
B
∑ μλ*b b=1
3. SIMULATION STUDY The goal of the simulation study was to demonstrate the performance of an estimation strategy that uses a weighted bootstrap resampling method and the MCMC fitting technique to estimate the parameters of a log-normal distribution. For the specific application of fitting a log-normal distribution, the data derived from each sample for the MPN method is the data vector((δ1,s1,m1)T, (δ2,s2,m2)T,...(δnMPN,snMPN,mnMPN)T,δnMPN+1,...δn). The MCMC model described in Williams and Ebel (2012a) serves as the summarizing statistic, which provides estimates of μλ and σλ for the pathogen distribution. Two different sampling methods were used. The first method serves as a baseline for comparison. It used simple random without replacement sampling (SRS) to draw an equal probability sample of size n = 1500 from the population. The log-normal distribution is fitted to the data using the MCMC routine to generate estimates μ̂ SRS, σ̂SRS. The second sampling method selects a probability proportional to size sample, where the xi values serve as the size measure. This sample also contains n = 1500 observations. Two estimation techniques are used with this sample. The first technique ignores the unequal probability sampling design and fits the log-normal distribution to the data using the MCMC routine and generates estimates μ̂UW, σ̂UW. These estimates are used to demonstrate the bias that can occur when unequal probability sampling data are treated as an iid sample. The final estimation technique uses the weighted bootstrap routine described above to generate B = 50 bootstrap sample estimates. The average of the 50 individual bootstrap estimates provided the weighted bootstrap estimates μ̂WB and σ̂WB. The simulation study drew a total of 500 simple random samples and 500 unequal probability samples from the population. For each sample, the MCMC routine generated a chain of 2000 samples, with the first 1000 samples serving as the burn-in period and the remaining 1000 samples used to generate the estimates for μ̂ SRS,σ̂SRS,μ̂UW,σ̂UW,μ̂ λ*b,σλb * . For each of the 500 unequal probability samplings, the simulation study uses B = 50 bootstrap samples to estimate the parameters μ̂WB,σ̂WB. These values represent short burn-in periods and MCMC sample sizes, as well as the minimum recommended number of bootstrap samples. This is due to the amount of
= E[Δ]f *(μλ , σλ)
Implementation of the weighted bootstrap for a general application with vector-valued observations, denoted by di, and summarizing statistic S() is as follows; suppose the sample data consists of observations{d 1 ,d 2 ,d 3 ,...d n }with inclusion probabilities{π1,π2,π3,...πn}. The proposed weighted bootstrap approach is as follows: 1. S e l e c t a w i t h - r e p l a c e m e n t s a m p l e d e n o t e d {d1*,d2*,d3*,...dn*} where the inclusion probability of element i on each of the n draws is inversely proportional toπi. 2. Determine the value of the statistic S(d1*,d2*,d3*,...dn*) that summarizes the data. 3. Repeat steps 1 and 2 B times. D
dx.doi.org/10.1021/es5035574 | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
Figure 1. Comparison of estimates for the simple random sampling (SRS), unweighted fitting (UW), and weighted bootstrap (WB) methods. The two graphs on the left-hand side of the figure demonstrate strong agreement between the individual estimates derived using the simple random sampling data and the unequal probability sampling data with the weighted bootstrap fitting technique. The estimates from both methods are tightly clustered about the nominal values of μλ = −2.5 and σλ = 1.25. The two graphs on the right-hand side of the figure demonstrate the observed biases in the μ̂ UW and σ̂UW estimators, particularly for the μ̂ UW parameter.
for comparison. The left-most two figures compare the distributions of μ̂SRS, μ̂WB and σ̂SRS, σ̂WB across the 500 iterations of the simulation study. The means of the observations for all four distributions essentially match the nominal values (i.e., μ̅SRS = −2.496,μ̅WB = −2.509 and σ̅SRS = 1.250, σ̅WB = 1.254). In contrast, failing to account for the unequal probabilities of selection leads to a noticeable bias in the log-normal parameters (μ̅UW = −1.959, σ̅UW = 1.264). This effect is depicted by the graphs on the right-hand size of Figure 1, where virtually all of the μ̅UW estimates are roughly half a unit too large. The bias in σ̅UW is not substantial for the choice of parameters used to construct this population, but this result cannot be generalized to other populations. The standard deviations of the estimates across the 500 iterations of the simulation study were sd(μ̂ SRS) = 0.053, sd(μ̂UW) = 0.041, sd(μ̂WB) = 0.066 and sd(σ̂SRS) = 0.044, sd(σ̂UW) = 0.037, sd(σ̂WB) = 0.046, so for this particular data set the weighted bootstrap estimator is the least efficient. A stronger linear relationship between the x and λ values in the data set would be required for the weighted bootstrap estimator to exhibit less uncertainty than that of the simple random sampling estimator.16,17
time required for the MCMC routine to run for each sample, which was approximately 1 h for each of the 500 iterations. While longer burn-in periods and at least 100 bootstrap samples would be recommended for an analysis of an actual data set, these sample sizes are adequate for the purposes of demonstrating the performance of the proposed weighted bootstrap solution. Assessing the bias in the pathogen distribution parameter estimates was accomplished by comparing the means of the estimated μ̂ SRS,σ̂SRS,μ̂UW,σ̂UW,μ̂WB, and σ̂WB parameters with the known values for the artificial data set. Using the μλ parameter as an example, these statistics are given by 1 μ̅ = * 500
500
∑ μ*̂ ,j j=1
where * = SRS,UW,or WB indicate the fitting method. The standard deviation of the estimates also was calculated across the 500 iterations of the simulation study to assess the accuracy of the parameter estimates. Computer code, written in R and OpenBUGS,31,44,45 is available as Supporting Information.
4. RESULTS The performance of the three methods is summarized in Figure 1, where the results for the SRS method are used as a baseline E
dx.doi.org/10.1021/es5035574 | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
(3) Helsel, D. R.; Cohn, T. A. Estimation of descriptive statistics for multiply censored water quality data. Water Resour. Res. 1988, 24, 1997−2004. (4) Gonzales-Barron, U.; Butler, F. A comparison between the discrete Poisson-gamma and Poisson-lognormal distributions to characterise microbial counts in foods. Food Control 2011, 22, 1279−1286. (5) Á lvarez-Fernández, E.; Alonso-Calleja, C.; García-Fernández, C.; Capita, R. Prevalence and antimicrobial resistance of Salmonella serotypes isolated from poultry in Spain: Comparison between 1993 and 2006. Int. J. Food Microbiol. 2012, 153, 281−287. (6) Busschaert, P.; Geeraerd, A. H.; Uyttendaele, M.; Van Impe, J. F. Estimating distributions out of qualitative and (semi) quantitative microbiological contamination data for use in risk assessment. Int. J. Food Microbiol. 2010, 138, 260−269. (7) Busschaert, P.; Geeraerd, A. H.; Uyttendaele, M.; Van Impe, J. F. Hierarchical Bayesian analysis of censored microbiological contamination data for use in risk assessment and mitigation. Food Microbiol. 2011, 28, 712−719. (8) Wilson, I. G. Salmonella and Campylobacter contamination of raw retail chickens from different producers: A six year survey. Epidemiol. Infec. 2002, 129, 635−645. (9) Shorten, P. R.; Pleasants, A. B.; Soboleva, T. K. Estimation of microbial growth using population measurements subject to a detection limit. Int. J. Food Microbiol. 2006, 108, 369−375. (10) Shumway, R. S.; Azari, R. H.; Kayhanian, M. Statistical approaches to estimating mean water quality concentrations with detection limits. Environ. Sci. Technol. 2002, 36, 3345−3353. (11) Williams, M. S.; Cao, Y.; Ebel, E. D. Sample size guidelines for fitting a lognormal probability distribution to censored most probable number data with a Markov chain Monte Carlo method. Int. J. Food Microbiol. 2013, 165, 89−96. (12) Uyttendaele, M.; Baert, K.; Ghafir, Y.; Daube, G.; De Zutter, L.; Herman, L.; Dierick, K.; Pierard, D.; Dubois, J. J.; Horion, B.; Debevere, J. Quantitative risk assessment of Campylobacter spp. in poultry based meat preparations as one of the factors to support the development of risk-based microbiological criteria in Belgium. Int. J. Food Microbiol. 2006, 111, 149−163. (13) Williams, M. S.; Ebel, E. D.; Wells, S. J. Population inferences from targeted sampling with uncertain epidemiologic information. Prev. Vet. Med. 2009, 89, 25−33. (14) Morignat, E.; Ducrot, C.; Roy, P.; Baron, T.; Vinard, J.; Biacabe, A.; Madec, J.; Bencsik, A.; Debeer, S.; Eliazsewicz, M. Targeted surveillance to assess the prevalence of BSE in high-risk populations in western France and the associated risk factors. Vet. Rec. 2002, 151, 73− 77. (15) EPA Stage 2 Disinfectants and Disinfection Byproduct Rule. http://water.epa.gov/lawsregs/rulesregs/sdwa/stage2/ (accessed September 30, 2014). (16) Cochran, W. G. Sampling Techniques, 3 ed.; John Wiley and Sons: New York, 1977. (17) Särndal, C. E.; Swensson, B.; Wretman, J. Model Assisted Survey Sampling; Springer, 2003. (18) Thompson, S. K. Sampling; Wiley, 2012. (19) Patil, G. P., Weighted distribution. In Encyclopedia of Environmetrics; El-Shaarawi, A. H., Piegorsch, W. W., Eds.; John Wiley & Sons, Ltd: Chichester, 2002; Vol. 4, pp 2369−2377. (20) Gove, J. H.; Patil, G. P. Modeling the basal area-size cistribution of forest stands: A compatible approach. For. Sci. 1998, 44, 285−297. (21) Van Deussen, P. C. Fitting assumed distributions to horizontal point sampling diameters. For. Sci. 1986, 32, 146−148. (22) Pouillot, R.; Delignette-Muller, M. L. Evaluating variability and uncertainty separately in microbial quantitative risk assessment using two R packages. Int. J. Food Microbiol. 2010, 142, 330−340. (23) Fuller, W. A. Sampling Statistics; John Wiley & Sons, 2011. (24) Williams, M. S.; Ebel, E. D.; Cao, Y. Fitting distributions to microbial contamination data collected with an unequal probability sampling design. J. Appl. Microbiol. 2013, 114, 152−160.
5. DISCUSSION The many complicating factors associated with fitting a lognormal distribution to MPN data collected with unequal selection probabilities presents a conundrum of how best to balance complexity with the need for eliminating a potentially substantial source of bias. The proposed weighted bootstrap resampling approach addresses the problem of unequal probability sampling data used in conjunction with MCMC methods. This study focuses on this application because of its relevance to food-safety risk assessment. It should be noted that the weighted bootstrap method can be applied to virtually any study where MCMC methods are used to make inferences from unequal probability sampling data. While the weighted bootstrap paired with MCMC parameter estimation is computationally intensive (1−2 h to evaluate a data set with 2000 observation using current laptop computer technology) it also represents the most general solution because the weighted bootstrap, when properly implemented, can incorporate the effects of clustering within the population without the need to explicitly model these effects. To illustrate this advantage consider the following (overly) simplistic example; suppose the goal of a survey is to estimate the overall proportion p of infected animals from an unequal probability sample in a population consisting of two herds, with one herd containing only infected animals and the second herd free of infection. In repeated sampling from these two herds, the Horvitz-Thompson estimator p̂ of the proportion infected will have zero variance. A weighted maximum likelihood estimator can be constructed24 that explicitly includes the adjustments for the unequal probability sampling design, but the estimated variance of p̂, derived from the weighted MLE, will be greater than zero. In contrast, a weighted bootstrapping method, which draws with-replacement samples from each herd independently, also will have the appropriate zero variance. While it may be possible to construct weighted MLE models that account for various sources of clustering within the data, such an approach is much less general than the proposed weighted bootstrapping method.
■
ASSOCIATED CONTENT
S Supporting Information *
Computer code for implementing the proposed method. This material is available free of charge via the Internet at http:// pubs.acs.org.
■
AUTHOR INFORMATION
Corresponding Author
* E-mail:
[email protected]. Author Contributions
The manuscript was written through equal contributions of both authors. All authors have given approval to the final version of the manuscript. Notes
The authors declare no competing financial interest.
■
REFERENCES
(1) Hill, W. E.; Suhalim, R.; Richter, H. C.; Smith, C. R.; Buschow, A. W.; M, S. Polymerase chain reaction screening for Salmonella and Enterohemorrhagic Escherichia coli on beef products in processing establishments. Foodborne Pathog. Dis. 2011, 8, 1045−1053. (2) Helsel, D. R. Nondetects and Data Analysis: Statistics for Censored Environmental Data; Wiley Inter-science: Hoboken, 2005. F
dx.doi.org/10.1021/es5035574 | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
(25) Williams, M. S.; Ebel, E. D. Methods for fitting the Poissonlognormal distribution to microbial testing data. Food Control 2012, 27, 73−80. (26) Williams, M. S.; Ebel, E. D. Methods for fitting a parametric probability distribution to most probable number data. Int. J. Food Microbiol. 2012, 157, 251−258. (27) Cochran, W. G. Estimation of bacterial densities by means of the most probable number. Biometrics 1950, 6, 105−116. (28) Haas, C. N. Estimation of microbial densities from dilution count experiments. Appl. Environ. Microbiol. 1989, 55, 1934−1942. (29) R Development Core Team R: A language and environment for statistical computing. http://www.R-project.org/. (30) Lunn, D.; Spiegelhalter, D.; Thomas, A.; Best, N. The BUGS project: Evolution, critique and future directions (with discussion). Stat. Med. 2009, 28, 3049−3082. (31) Lunn, D.; Thomas, A.; Best, N.; Spiegelhalter, D. WinBUGS - A Bayesian modeling framework: Concepts, structure, and extensibility. Stat. Comput. 2000, 10, 325−337. (32) FSIS. The Nationwide Microbiological Baseline Data Collection Program: Young Chicken Survey: July 2007−June 2008; U. S. Department of Agriculture: Washington D.C., 2009. (33) FSIS. Nationwide young chicken microbiological baseline data collection program (November 1999−October 2000); U. S. Department of Agriculture: Washington, D.C., 2001. (34) Halvorson, H. O.; Ziegler, N. R. Applications of statistics to problems in bacteriology. I. A means of determining bacterial populations by the dilution method. J. Bacteriol. 1933, 25, 101−121. (35) Garthright, W.; Blodgett, R. FDAs preferred MPN methods for standard, large or unusual tests, with a spreadsheet. Food Microbiol. 2003, 20, 439−445. (36) FDA. Bacteriological Analytical Manual. Appendix 2: Most Probable Number from Serial Dilutions; Washington, D.C., 2010. (37) Chao, M.-T.; Lo, S.-H. Maximum likelihood summary and the bootstrap method in structured finite populations. Stat. Sin. 1994, 4, 389−406. (38) Rao, J. N. K.; Wu, C. F. J. Resampling inference with complex survey data. J. Am. Stat. Assoc. 1988, 83, 231−241. (39) Efron, B.; Tibshirani, R. J. An Introduction to the Bootstrap; Taylor & Francis, 1994. (40) Lui, R. Y. Bootstrap procedure under some non-i.i.d. models. Ann. Stat. 1988, 16, 1696−1708. (41) Crouhy, M.; Galai, D.; Mark, R. Risk Management; McGraw Hill: New York, 2011. (42) Hall, P.; Mammen, E. On general resampling algorithms and their performance in distribution estimation. Ann. Stat. 1994, 22, 2011−2030. (43) Smith, A. F. M.; Gelfand, A. E. Bayesian statistics without tears: A sampling-resampling perspective. Am. Stat 1992, 46, 84−88. (44) Thomas, A.; Hara, B. O.; Ligges, U.; S, S. Making BUGS Open. R News 2006, 12−17. (45) R Development Core Team R: A language and environment for statistical computing. http://www.R-project.org.
G
dx.doi.org/10.1021/es5035574 | Environ. Sci. Technol. XXXX, XXX, XXX−XXX