Demographic and Behavioral Modifiers of Arsenic Exposure Pathways

Jun 19, 2008 - (Journal of Exposure Analysis and Environmental Epidemiology, 2002, 12, 29–43). Using demographic information (e.g., gender or age) a...
1 downloads 12 Views 211KB Size
Environ. Sci. Technol. 2008, 42, 5607–5614

Demographic and Behavioral Modifiers of Arsenic Exposure Pathways: A Bayesian Hierarchical Analysis of NHEXAS Data THOMAS J. SANTNER,* PETER F. CRAIGMILE, CATHERINE A. CALDER, AND RAJIB PAUL Department of Statistics, 404 Cockins Hall, 1958 Neil Avenue, The Ohio State University, Columbus, Ohio 43210

Received September 17, 2007. Revised manuscript received February 28, 2008. Accepted March 18, 2008.

We introduce a Bayesian hierarchical statistical model that describessubpopulation-specificpathwaysofexposuretoarsenic. Our model is fitted to data collected as part of the National Human Exposure Assessment Survey (NHEXAS) and builds on the structural-equation-based analysis of the same data by Clayton et al. (Journal of Exposure Analysis and Environmental Epidemiology, 2002, 12, 29–43). Using demographic information (e.g., gender or age) and surrogates for environmental exposure (e.g., tobacco usage or the average number of minutes spent in an enclosed workshop), we identify subgroup differences in exposure routes. Missing and censored data, as well as uncertainty due to measurement error, are handled systematically in the Bayesian framework. Our analysis indicates that household size, amount of time spent at home, use of tapwater for drinking and cooking, number of glasses of water drunk, use of central air conditioning, and use of gas equipment significantly modify the arsenic exposure pathways.

1. Introduction In this paper, we propose a Bayesian hierarchical statistical model that describes subpopulation-specific exposure pathways of arsenic (As), a toxic heavy metal. Human exposure to As may cause irritation of the gastrointestinal and respiratory tracts and prolonged exposure is associated with melanosis, hyperpigmentation, depigmentation, hyperkeratosis, and skin cancer, and may also affect the nervous, cardiovascular, and hematopoietic systems (2). Given the range of adverse health outcomes associated with exposure to As, there is a need to better understand the mechanisms by which individuals become exposed and whether these mechanisms are identical for all groups of individuals. To further the understanding of As exposure mechanisms, we fit a subpopulation-specific pathways model to data collected as part of the National Human Exposure Assessment Survey (NHEXAS), a three-phase survey instrument that is described in refs 1, 3, and 4. We draw on the NHEXAS environmental exposure and biomarker measurements as well as NHEXAS’s demographic and behavioral information. Determining the relationship between levels of a toxic substance in environmental media and human exposure can be difficult due to the fact that certain media contribute both directly and indirectly to personal exposure. For example, * Corresponding author phone: 614.292.2866; fax: 614.292.2096; e-mail: [email protected]. 10.1021/es702338v CCC: $40.75

Published on Web 06/19/2008

 2008 American Chemical Society

assuming an individual both drinks and cooks with tapwater, consider their exposure to a toxic substance from the concentration of that substance in their food and tapwater, both of which can be directly measured. The level of the toxic substance in the individual’s food may also depend on the level in the individual’s drinking water indirectly because water is used in food preparation. As a result, the partial correlation between the level of the substance in the tapwater and the level in the biomarker of exposure, may be a misleading assessment of the contribution of tapwater to the total exposure. Indeed, the direct exposure route from drinking water to personal exposure may not be important if the individual drinks primarily bottled or filtered water, but the exposure route from tapwater via food may be of significance if tapwater is used for cooking. Due to the complexity of these sorts of relationships, standard multiple regression tools are not appropriate in studies of exposure pathways. Instead, hierarchical pathways models have been proposed as an alternative approach to analyze exposure data. Such models permit both direct and indirect links between toxic substances in media and biomarker-determined exposure to be explored. Using NHEXAS data from refs 1, 5, and 6, we developed exposure pathways models for As that describe dermal, inhalation, and ingestion exposure mechanisms. In their fundamental paper (1), Clayton, et al. proposed a linked system of linear regression equations, a structural equation model (SEM), that relates the exposure measurements collected in various environmental media and are used to explain variation in total As exposure (measured by the urine biomarker). Variation in the levels of As in the participants’ outdoor air and soil samples are used to explain the variation in the indoor air and surface dust samples. The soil, indoor air, and surface dust levels, together with the As levels in the participants’ tapwater, are used to explain the variation in the food and beverage samples and in the air that the participants breathe. Ultimately, the concentration of As in the participants’ urine (the biomarker of exposure) is related to levels of As in the various environmental media. Using maximum likelihood estimates of the coefficients in the component regression equations that comprise the SEM and the associated estimated standard errors (1), they identify the pathways whose coefficients differed significantly from zero. They fitted several versions of their pathways models and explored the impact of the units of the environmental media variables, including concentrations, intakes, and loadings measures. Our analysis uses the structure of their “reduced” model, which is diagrammed in Figure 3, as a baseline for exploring subpopulation differences in exposure pathways. The boxes in the figure represent the As measured in the various environmental media and in the biomarker; the arrows connecting the boxes indicate direct relationships that are quantified in the model. For example, this structure allows us to quantify the direct relationship between the concentration of As in soil (SL) and in urine (UR), as well as the indirect relationship between SL and UR via the pathway indoor air (IA) to beverage (BV). The Bayesian analyses of the NHEXAS data in refs 5 and 6 use a hierarchical pathways model; the model shown in ref 5 extends that of ref 6 by adding source exposures. Both refine the analysis in ref 1 in the way that missing data, measurements that fall below the instrument detection limit (below the minimum detection limit, or MDL), and measurement error are handled. Instead of imputing the missing and censored observations before fitting the model (by filling in missing data using “similar” observations and plugging in VOL. 42, NO. 15, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

5607

one half of the MDL, respectively), the Bayesian hierarchical framework is able to accommodate these types of observations in a coherent manner that acknowledges the uncertainty in the imputation and does not require removing participants with incomplete records from the analysis. In addition, the pathways structure is specified through latent variables corresponding to the “true” levels of As in the various media. By introducing these latent variables, the analysis accounts for uncertainty in the measurement of As levels in the statistical inferences. In this paper, we extend the reduced model of ref 1 (see Figure 3) using the Bayesian framework of refs 5 and 6 by identifying demographic and behavioral variables that modify the pathways of exposure for certain subgroups of the general population. We term these variables effect and pathway modifiers. Pathway modifiers are surrogates for the levels of As in environmental media that are not measured directly in the NHEXAS study. For example, the amount of fresh food that one consumes may influence the baseline level of As ingested via the pathway in connecting solid food (SF) and UR, since the amount of fresh food consumed may serve as a surrogate for the amount of food consumed that was grown using As-containing pesticides. Similarly, the indicator variable of whether an individual has used gas equipment during the past week may be a pathway modifier for the pathway linking IA to personal air (PA) because the use of such equipment may increase the As exposure due to inhalation. From a modeling viewpoint, a pathway modifier of the pathway linking medium A to medium B is any variable that changes the intercept in the regression relating A to B. An effect modifier is a variable that affects a pathway connecting medium A to medium B by changing the strength of the relationship between A and B. For example, the amount of time spent at home may be an effect modifier of the pathway linking IA to PA: an individual’s PA may depend more heavily on the individual’s IA the more time the individual spends at home. In the regression setting, an effect modifier of the pathway relating medium A to medium B is a variable that has an interaction with A. However, in keeping with principle of hierarchy that states interactions between variables should not occur without each individual variable in the model, we also add an intercept modification whenever the variable is an effect modifier. Thus it is possible for a variable to be either a pathway modifier or an effect modifier for the pathway connecting any As source to any As desination node, but it cannot act as both. By introducing effect and pathway modifiers into the analysis, we are able to use information collected on the NHEXAS participants that informs about As exposure, but is not a direct measurement of As in an environmental medium. As a result, we are able distinguish subgroups of the general population, defined by demographic and behavioral characteristics, whose routes of exposure differ from the typical (baseline) individual. As is found in the soil, water, and the atmosphere. It is both naturally occurring and anthropogenically generated. The principle route of exposure for the general population is oral, either through diet or drinking water. In the remainder of this section, we review material in the toxicology, environmental, and health-effects literature that describes both the naturally occurring and anthropogenic sources of As in order to motivate our selection of demographic and behavioral information from NHEXAS to use in our subpopulation-specific pathways analysis. We draw heavily on the following sources: the review in ref 7 on the toxicological effects of As and the summaries in refs 8 and 9 on the health effects of As. As occurs naturally in the groundwater and also in some mineral ores, especially in ores near geothermal activity (10). As a result, well water that is contaminated by natural sources, 5608

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 15, 2008

such as As-containing bedrock, has been reported to be a leading cause of As toxicity throughout the world (7). This suggests that users of well water can be at greater risk for As exposure and thus we consider the source of drinking water as a potential exposure pathway modifier. As can be found in soil, in wind-blown dirt, and in lava from volcanoes. It occurs naturally in seawater and vegetation and is released into the air in sea salt spray and forest fires. As is present in the atmosphere for numerous reasons; one important source is that naturally occurring As in coal and oil is released to the atmosphere by oil- and coal-based power plants (11). Worldwide, anthropogenic sources of As far exceed naturally occurring sources (10). Important anthropogenic sources of As are products and processes that use wood preservatives, pesticides (insecticides, herbicides, and algicides), and growth stimulants for plants and animals. In particular, runoff from pesticides is one reason that groundwater in some regions contains elevated As concentrations. Pesticides may also contribute to total As exposure via ingestion of food (in wine through grapes sprayed with Ascontaining pesticides and in seafood from bivalves, certain cold water and bottom-feeding finfish, and seaweed). Smokers may inhale additional amounts of As because of pesticide residue on tobacco leaves. Thus the (amount of) tobacco use may be a pathway modifier. Lastly, in the United States, municipal waste incineration and nonferrous metal smelting are additional anthropogenic sources of As which suggests further pathway modifiers based on the distance to trash burning facilities or to smelting plants and on one’s occupation (12). In the U.S., As production has greatly decreased in the recent past, however, non-U.S. production has increased steadily. Based on 1990 data, ref 13 reported that the total emissions of As in the contiguous 48 states, excluding road dust or wind-blown dust from construction or agricultural tilling, to be approximately 3.0 tons/day with 90% of emissions coming from point sources and 5% from both area and mobile sources. The remaining sections of the paper are organized as follows. Section 2 describes the NHEXAS data, including the demographic and behavioral characteristics of individuals that we use to explore differences in exposure pathways. Section 2 also contains a description of our subpopulationspecific Bayesian pathways model and includes details on the handling of missing and censored observations, the choice of prior specifications, and the Markov Chain Monte Carlo algorithm used to fit the model. Section 3 describes inferences derived from our model, with particular emphasis on the important subgroups whose pathways differ from the general population, and includes a comparison of our results to those in ref 1.

2. Materials and Methods 2.1. Data. All of the data used in our analysis come from the NHEXAS study which is described in detail in ref 4. We analyze data from the NHEXAS participants from EPA’s Region 5 (Illinois, Indiana, Ohio, Michigan, Minnesota, and Wisconsin); our data set consists of 174 individuals from 32 counties. During our model-building exercise, we started with a Bayesian analysis of the model proposed in ref 1 as our baseline model. For each medium, the observations were transformed to the natural logarithm scale, and then standardized to have mean zero and variance one. The minimum detection limits were transformed accordingly. We then extended this baseline model to adjust for variable dilutions among the participants’ urine samples. This was done by controlling for levels of creatinine and several demographic variables (age, gender, race, body mass index), following the study detailed in ref 14 that demonstrated variability in the normal ranges of creatinine among different populations. Then, we examined a number of demographic

TABLE 1. Definitions of Variables Used in Various Models and Their Roles (Depending on the Model) as Media (M), Components of the Creatinine Adjustment (C), and Either E = Effect Modifier or P = Pathway Modifiers variable

type

definition and units

UR FD BV IA OA PA SL SD WF mlogcrtn male child 4-11 hhold hrs.home glass tap.c fruit tap.d no.c.air gasequip tobacco work20 no.fish no.wellwater

M M M M M M M M M C C, E C, E E, P E E P P E E P P P P P

natural log of arsenic concentration in urine (µg/L) natural log of arsenic intake in food (µg/day) natural log of arsenic intake in beverage (µg/day) natural log of arsenic concentration in indoor air (ng/m3) natural log of arsenic concentration in outdoor air (ng/m3) natural log of arsenic concentration in personal air (ng/m3) natural log of arsenic concentration in soil (ng/g) natural log of arsenic loading in surface (sill) dust (µg/cm2) natural log of arsenic concentration in tapwater, flushed (µg/L) negative of the natural log of creatinine (µg/L) binary (0/1) variable indicating the participant is male binary (0/1) variable indicating participant is between 4 and 11 yrs (inclusive) household size average number of hours participant spends inside home/day average number of glasses of water participant drinks/day binary (0/1) variable indicating participant cooks with tapwater binary (0/1) variable indicating participant eats fruit g 3 days/three-months binary (0/1) variable indicating participant drinks tapwater binary (0/1) variable indicating participant’s home does not have central air binary (0/1) variable indicating participant used gas equipment in the past week binary (0/1) variable indicating the participant is a current smoker binary (0/1) variable indicating participant works g 20 min/day in a workshop binary (0/1) variable indicating participant eats fish less than once/month binary (0/1) variable indicating the source of running water is well water

and behavioral variables as potential pathway and/or effect modifiers. A listing of all variables used in the final model of Table 3, together with their designated use as an effect or pathway modifier (depending on the pathway and whether for the reduced or full model) is given in Table 1. 2.2. Hierarchical Pathways Model. The Bayesian hierarchical pathways model described below allows us to identify differential As exposure pathways corresponding to various subpopulations. As discussed above, this approach allows us to accommodate the large amount of missing and censored observations in the NHEXAS data in a straightforward manner. In addition, we can capture explicitly the component of variation in the biomarker and environmental media measurements due to measurement error, which results from instrument imprecision and sampling variation (see Section 4 of the Supporting Information). The key contribution of our extension is to incorporate effect and pathway modifiers in the (unobserved) process equation. For each individual i and medium j, the model includes stages for the observed log concentrations (or loadings or intakes) in terms of the unknown “true” standardized log concentration of As in that individualspecific medium. First we define the data model for the observed data, then the process model for the true but unknown standardized log concentration (or loading or intake) which is of primary interest, and last the prior distributions for model parameters. 2.2.1. Model Notation. Throughout, let X denote the joint distribution for the multivariate object X, and [X|Y] denote the conditional distribution of X given Y. In addition we use the following symbols. I: number of individuals. J: number of process variables. L: number of pathway modifier covariate variables. M: number of effect modifier covariate variables. Xij: unknown “true” standardized log concentration of As in medium j for individual i. X: process variable matrix of dimension I × J. Y: data variable matrix of dimension I × J. Z: data observation state matrix of dimension I × J.

P: pathway modifier covariate variable matrix of dimension I × L. E: effect modifier covariate variable matrix of dimension I × M. ω: vector of measurement error precisions of length J. τ: vector of process error precisions of length J. β: regression parameters that relate the mean of each process variable to potentially other process variables, pathway modifier covariates, and effect modifier covariates. 2.2.2. Data model. Let N(µ, σ2) denote the normal distribution with mean µ and variance σ2, and CN(µ, σ2; c) denote the left-censored normal distribution with mean µ and variance σ2 that is censored at c. For each observation Yij, we let the (observation status) variable Zij be 0 if the observed value is below the MDL value Mij, or be 1 if Yij is above the MDL, or be 2 if Yij is missing. Then given the latent log concentration process and measurement errors, the data model is J

[Y|X, ω] )

∑ ∏

[Yij|Xij, ωj]

j)1 {i:zij)0 or 1} J

)

∏ [ ∏ CN(X , I ⁄ ω ;M ) × ∏ N(X , 1 ⁄ ω )] ij

j)1

j

ij

{i : zij)0}

ij

j

{i : zij)1}

Here ωj is the “precision” (1/variance) for the normal or censored normal distribution for Yij, as the case may be. 2.2.3. Process Models. Before describing the process model, we note that hierarchical pathway models, as described here, are like “classical” measurement error models in that they assume multiple stages and that the regressors at each stage are measured with error—see (15). The measurement error structure is made explicit in prior model stages. The model for the true (latent) process given its parameters β and τ is VOL. 42, NO. 15, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

5609

J

[X|β, τ] ) ∏ [Xj|{XK : k * j}, βj, τj]

TABLE 2. The RSD Values and Associated Measurement Error Precisions, ωj, for the Media That We Consider in Our Paper

j)1

I

J

)∏

j)1 J

)∏

j)1

∏ [X |{x ij

jK : k * j}, βj, τj]

i)1 I

∏ N(µ , 1 ⁄ τ ) ij

j

i)1

X },{βP},{βE},{βEX})T is a vector of length r ) where βj ) (Rj,{βjk jl jl jm j X + NP + 2NE. We assume the conditional mean of X 1 + Nj j j ij

is NjX

µij ) Rj +

∑ k)1

Njp

βXjkXisjkX +

∑ l)1

media

RSDj

ωj

air beverage dust food soil urine water

10% 10% 10% 10% 10% 15% 5%

893.9940 893.9940 893.9940 893.9940 893.9940 393.9863 3593.9985

NjE

βjlpPisjlp +

∑E

E EX E [βjm + βjmXjsEX] isjm im

m)1

(1) In eq 1, {Pik} and {Eik} are the sets of pathway and effect modifiers, respectively, associated with subject i. The term Rj is the baseline intercept in the pathway (regression) leading to medium j. The second term defines the main-effect relationships between the log As concentrations observed in X denotes other media and the mean of Xij. The coefficient βjk the strength of the linear association between Xij and Xi,sjkX X is an integer not equal to j, that denotes the specific (here sjk medium being included as a covariate in the regression equation). In the third term, βjlP defines the strength of the linear relationship between the pathway modifier Pi,sjkP for P is an integer that selects the appropriate the medium Xij (sjk pathway modifier). In the final term, βEX jm describes the strength EX for the of the effect modifier Ei,sjkE on the process variable Xi,sjm E is an integer that selects the appropriate medium Xij (sjm EX is an integer not equal to effect-modifier covariate, and sim j, that selects the modified process variable). The regression E E , E can be viewed as the hierarchical term that is term βjm i sjm E is added because of the presence of the interaction term Ei,sjm EX. × Xj,sim 2.2.4. Priors. We first discuss specification of ω, then our choice of priors for β and τ. The NHEXAS data set includes relative standard deviation (RSD) values that are media specific. The RSD is the absolute value of the coefficient of variation. We set the measurement precision ωj in our model based on the RSD values reported for each media as follows. Based on the definition of RSD for media j, RSDj, we assume that,

proper but that it is not substantively influenced by the prior assumption, see ref 16. In specifying prior distributions, we assume conditionally conjugate forms for computational convenience. The prior distribution assumed for β in the process model is J

[β] )

∏ j)1

log(Yij) ) log(Xij) + log(λij) where log (λij) is also a random quantity, but now bounded on [log(Uj), log(Vj)]. Assuming that log(λij) ∼ N(0, 1/ωj), we obtain 6 ≈ log(Vj) - log(Uj). ωj Rearranging the above formula we define ωj to be ωj )

6 . log(Vj) - log(Uj)

The RSD values and associated measurement error precisions, ωj, for the media that we consider in our paper are shown in Table 2. Our philosophy in selecting priors for β and τ was to assume vague, but proper distributions. Such an assumption guarantees that the posterior distribution is 5610

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 15, 2008

∏ MVN (m , Σ ) rj

j

j

j)1

where β ) (βT1 ,. . ., βjT)T. Here MVNp(.,.) denotes the pdimensional multivariate normal probability density function, and for j ) 1, . . ., J, mj is a known mean vector of length rj, and Σj is a known rj × rj positive definite covariance matrix. In our analysis we used mj ) 0 for each j to indicate no prior belief about the sign of the βj parameters. Since we are modeling standardized data, the prior covariance matrix, Σj, was a diagonal matrix in which the first value in the diagonal (corresponding to the intercept parameter) was set at a very small number (10-6) to tie the prior mean to zero (again because the standardized data had sample mean zero), and the remaining diagonal terms were set to a large value to indicate lack of prior information about the variance of the slope parameters in the model. The results were not sensitive to adding correlation to the βj parameters, or changing the (diagonal) variance terms for the slope parameters. For the process precisions, τ, we assumed that J

[τ] )

∏ j)1

J

[τj] ∼

∏ Ga(c , d ) j

j

j)1

for fixed constants {cj} and {dj} where Ga(c, d) denotes the probability density function

Yij ) Xijλij where Xij is the latent value of the concentration for individual i in media j, Yij is the associated observed value, and λij is a random quantity bounded on the interval [Uj, Vj], with Uj ) 1 - RSDj/100 and Vj ) 1 + RSDj/100. Then

J

[βj] ∼

f(x) )

dc c-1 x exp(-dx)I(x g 0) Γ(c)

In the absence of information about the distribution of τj, we set cj ) dj ) 0.001 for all j. The results did not change for the more informative choice of cj ) dj ) 1. 2.3. The Bayesian Model Fitting. In the Bayesian paradigm, all inferences are based on the posterior distribution of the parameters given the data. From Bayes’ Theorem, the posterior distribution is proportional to the product of the likelihood of the data and the prior density of the unknown parameters. As in most complex applications of Bayesian statistics, the joint posterior distribution is not available in closedform. Thus we used a simulation-based method known as Markov Chain Monte Carlo (MCMC) to draw samples from the posterior distribution (for example: refs 17 and 18. An MCMC algorithm constructs a Markov chain that, with sufficient sampling, produces samples from the posterior distribution. Summary statistics of these samples can then be used for inferential purposes: for example, the mean of the samples provides a point estimate of the center of the posterior distribution and is often used as a Bayesian analogue of the maximum likelihood estimator. In more detail, we

TABLE 3. For the Baseline Model, the Left-Most Column (SEM coef) Gives the Estimated Regression Coefficients from the Clayton et al. (2002) SEM Likelihood Analysis and the Clayton et al. Designations of the Magnitude of the Absolute Value of the Associated t-Statistics, and For Each of the Three Models, the Three Columns Labeled Pmean, Psd, and PPβ+ Are the Posterior Means, Posterior Standard Deviations, and Posterior Probability of the Coefficient Exceeding Zero for Each Regression Coefficient from Our Bayesian Analysis of That Pathways Modela variables Y

UR

FD

BV

IA

OA

PA

SL SD WF a

baseline model

final model-full

explanatory

SEM coef

Pmean

Psd

PPβ+

PA SD SL FD BV mlogcrtn male child4to11 hrs. home hhold male*PA IA SD WF tap.c no.fish fruit tap.c*WF IA SD WF tap.d glass tap.d*WF glass*WF OA SL hhold no.c.air no.c.air*OA hhold*OA Intercept OA IA tobacco gasequip hrs.home work20 hrs.home*OA hrs.home*IA intercept IA no.wellwater

0.06 0.09 -0.01 0.30 ** -0.02

0.069 -0.138 0.443 0.237 -0.102

0.161 0.117 0.190 0.135 0.137

0.664 0.113 0.991 0.959 0.228

0.18 / -0.06 0.11

0.107 -0.051 0.070

0.099 0.080 0.090

0.856 0.265 0.776

0.09 -0.18 / 0.24 **

-0.039 -0.057 0.257

0.091 0.074 0.087

0.330 0.217 0.998

0.58 /// 0.08

0.641 -0.056

0.097 0.147

1.000 0.437

0.06 0.50 **

-0.001 -0.129 0.702

0.010 0.144 0.110

0.481 0.182 1.000

0.04

-0.002 0.305

0.010 0.135

0.403 0.989

Pmean 0.210 -0.189 0.267 0.274 -0.238 0.208 0.132 0.176 0.046 -0.084 -0.291 0.117 -0.011 -0.394 -0.277 0.280 -0.009 0.490 -0.085 -0.037 0.311 0.233 -0.084 -0.269 0.070 0.745 -0.018 0.018 0.124 0.072 -0.044 0.000 0.817 -0.295 -0.045 0.543 -0.004 0.456 -0.058 0.062 -0.002 0.262 0.047

final model-reduced

Psd

PPβ+

Pmean

Psd

PPβ+

0.189 0.115 0.200 0.178 0.159 0.128 0.265 0.512 0.032 0.094 0.301 0.098 0.081 0.322 0.195 0.194 0.251 0.333 0.096 0.083 0.283 0.130 0.030 0.248 0.046 0.233 0.088 0.024 0.116 0.190 0.055 0.010 0.767 0.657 0.150 0.277 0.005 0.195 0.046 0.040 0.010 0.129 0.082

0.870 0.046 0.910 0.935 0.069 0.947 0.690 0.632 0.927 0.183 0.164 0.884 0.458 0.111 0.077 0.922 0.488 0.932 0.189 0.324 0.868 0.964 0.002 0.140 0.936 0.999 0.453 0.761 0.857 0.654 0.204 0.489 0.850 0.321 0.384 0.979 0.215 0.991 0.110 0.939 0.431 0.976 0.714

0.092 -0.249 0.292 0.258 -0.216 0.208 0.108

0.185 0.118 0.214 0.159 0.149 0.122 0.278

0.707 0.016 0.900 0.947 0.073 0.960 0.645

0.043 -0.069

0.034 0.088

0.900 0.215

0.111 -0.015 -0.451 -0.284 0.272

0.097 0.072 0.319 0.182 0.186

0.874 0.413 0.076 0.061 0.927

0.548 -0.082 -0.049 0.312 0.230 -0.079 -0.255 0.067 0.623 0.035 0.026 0.114

0.332 0.093 0.076 0.284 0.124 0.028 0.256 0.045 0.095 0.172 0.026 0.111

0.951 0.191 0.259 0.863 0.967 0.002 0.160 0.931 1.000 0.540 0.837 0.848

-0.001 -0.067 0.087

0.010 0.128 0.453

0.446 0.307 0.577

0.412 -0.006 0.440

0.269 0.005 0.193

0.937 0.106 0.988

0.036 -0.002 0.246 0.050

0.026 0.010 0.126 0.083

0.913 0.424 0.977 0.731

*** indicates t exceeds 3, ** indicates t between 2 and 3, * indicates t between 1 and 2.

used a Gibbs sampler, a particular type of MCMC algorithm, to iteratively sample the distribution of each of the parameter conditional on the data and on the current value of the other model parameters. The algorithm can easily handle missing data and observations below the MDL by treating them as unknown parameters in the model and drawing “imputed values” from their sampling distributions. We implemented our Gibbs sampler in the R statistical computing environment (19). Using standard convergence diagnostics for MCMC algorithms (17), we deemed that a “burn-in” of 10,000 iterations was sufficient for the algorithm to produce samples from the posterior distribution. We ran the algorithm for an additional 50,000 iterations, subsampling every tenth iteration to reduce the autocorrelation in the posterior samples. This produced 5,000 nearly uncorrelated draws from the posterior distribution on which to base statistical inferences. See the Supporting Information for further details on our model-fitting algorithm.

In our analysis, the slope parameters that link the log As measurements for different media and the various effect and pathway modifiers are of primary interest. Using the draws from the posterior distribution given the data, we estimated the posterior mean and standard deviation of each slope parameter using the approach discussed above. In addition, we estimated the posterior probability of each slope coefficient being positive, which can be viewed as a Bayesian alternative to the frequentist P-value obtained from the onesided test of the null hypothesis that the slope parameter is equal to zero.

3. Results and Discussion In order to understand the implications of our Bayesian approach, we first compare our baseline model of Figure 3 with the maximum likelihood estimates of the analogous model given in the first column of Table 3. In the analysis of (1), values below the MDL were replaced by one-half of VOL. 42, NO. 15, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

5611

FIGURE 1. Histograms of the posterior distribution of the standardized log As in urine concentrations for three subjects who have data values below the MDL. The vertical black lines denote one-half the MDL. the MDL, and missing values were imputed using values spatially nearby, or from similar media (e.g., using an outdoor air measurement for a missing indoor air measurement). In our analysis, all censored or missing values are considered unknown parameters within the Bayesian framework. To illustrate the result of regarding unknown measurements as censored values in the Bayesian analysis, Figure 1 displays histograms of the posterior standardized distribution of the log concentration of As in UR for three subjects who had values below the MDL. The vertical lines denote the value of one-half the MDL for each subject. This figure shows that the plug-in value for the censored observations does not adequately capture the uncertainty caused by censoring. The Bayesian fit is shown in the columns 2-4 of numerical values in Table 3 (under the baseline model columns). Overall, we find good agreement between the likelihood and Bayesian analyses of this model with some interesting second-order differences. As an example of agreement, for the pathway whose terminal node is As in beverages (BV) has an estimated coefficient of 0.257 for As in flushed water (WF) compared with an estimate of 0.237 obtained from the likelihood analysis. Both analyses show that the data provide strong evidence of a positive relationship between As in outdoor air (OA) and As in indoor air (IA) and between IA and As in personal air (PA). As an example of the inferential enhancements of the Bayesian model, consider the paths in the Figure 3 model diagram leading to As in urine (UR). The Bayesian analysis finds strong evidence of a positive relationship between As in urine and both As in soil (SL) and As in solid food (FD); the likelihood analysis supports a positive relationship only for FD. The Bayesian estimate of the coefficient of SL is both positive (scientifically plausible) and strong (has posterior probability exceeding 0.99 of being positive). Extending the baseline model of Figure 3, we used scientific knowledge from the literature variables and empirical statistical significance (posterior probability of the parameter being greater than zero) to identify potential effect and pathways modifiers. The entire set of additional variables considered and the pathways in the baseline model that they modify can be obtained from the Supporting Information. In our analysis, all variables were defined so that positive slopes indicated the direction of the effect suggested by the scientific literature or by intution. To perform variable selection among candidate covariates (and thus model selection), we examined the magnitude of the posterior probability of the slope parameter being positive for each variable. A large value of this posterior probability provided empirical evidence for retaining the variable, whereas a small value indicated no evidence of the effectiveness of the variable. We also based variable selection and classification (as a PM or EM) on the scientific information which we obtained from the exposure literature survey. For example, for deriving the “reduced” final model from the “full” final model in Table 3, we retained gender (male) in the urine 5612

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 15, 2008

pathway based on its prominence in the exposure-science literature as an important covariate and also its usefulness in creatinine adjustment. Similar reasoning was used in the model evolution shown for models 1-44 in the Supporting Information. We note that the use of the posterior probability of the positiveness of a regression parameter is not the only method for variable selection that has been proposed in the Bayesian literature. Other well-known techniques for model selection are the Bayes information criterion (BIC), the deviance information criterion (DIC), and the comparison of Bayes factors among other methods. While we do not believe that there is a single “best” model, some of the models we fit were clearly inferior to others, based on their lack of scientific interpretability and/or lack of empirical support. The middle three columns of Table 3 present the Bayesian fit of a model that was selected for presentation on the basis of its scientific credibility and empirical evidence (this model corresponds to model 44 in the Supporting Information). The right-most three columns of Table 3 present the Bayesian fit of a simplified version of this model. Both of the models fitted in the numerical columns 5-7 and 8-10 in Table 3) support the use of the creatinine adjustment advocated in ref 14 (with posterior probabilities of a positive coefficient for log creatinine are 0.95 and 0.96, respectively) and related demographic adjustments (gender and being a child). They also suggest that individuals who spend more hours in the home (hrs. home) have greater levels of As exposure. The As that is consumed in food, FD, is strongly related to the biomarker of exposure, and there are several empirically significant pathway modifiers of the pathway leading to FD. The use of tapwater for cooking is an effect modifier for the amount of As in flushed water (WF): the regression coefficient of -0.451 for flushed water in the reduced model is modified to +0.096 when tapwater is used for cooking. Individuals who eat fish less than once a month (a pathway modifier) tend to have greater levels of As in their food. This result appears to run counter to conventional scientific wisdom, but the contrast group contains not only heavy fish consumers but those who eat it in moderation and the effects we are measuring are more likely due to an unobserved measure of a healthy diet. The pathway whose terminal node is BV also has empirically important effect and pathway modifiers. If tapwater is used for drinking, i.e., tap.d is a pathway modifier, then the baseline amount of BV is increased. In addition, the more water an individual drinks, the stronger the relationship between WF and BV. The pathway leading to IA has two effect modifiers of moderate significance: the number of household members, hhold, and the presence of central air conditioning in the home, no.c.air. The larger the number of household members, the greater the baseline concentration of As in the household’s IA. Similarly, households that do not have central air conditioning have greater baseline As concentration levels in their IA. These findings are plausible since larger households may draw more As inside from the outdoors, whereas central air conditioning filtering systems may lower the concentration of As in indoor air. Lastly, the pathway whose terminal node is PA has three empirically interesting pathway and effect modifiers. The first of these is the use of gas equipment; participants who use gas equipment once a week or more appear to have higher concentrations of As in their PA. The second subgroup with a modified pathway leading to personal air consists of individuals who spend 20 min or more per day in workshops: members of the subgroup appear to have greater concentrations of As in their PA than the general population. Lastly,

FIGURE 2. Boxplots of the posterior distribution of the percentage change in the As concentration (on the original nonlog scale), for 1 and 2% increases in the As concentrations in personal air, soil, and food, respectively, keeping all other factors fixed.

FIGURE 3. The baseline structure of the Bayesian pathways model considered in this paper. The boxes represent the environmental media and biomarker of exposure measured in the NHEXAS study; the arrows connecting the boxes specify direct relationships. the number of hours that an individual spends at home appears to be an effect modifier for the concentration of As in indoor air. This finding is certainly plausible since the more time someone spends indoors, the greater the impact of IA on PA. Because our data is an approximate probability sample of Region 5, we can make some inference about the percentage change in As in urine in this population. Figure 2 displays the posterior distribution of the predicted percentage change in urine (on the original nonlog scale), for 1 and 2% increases in the As concentrations in personal air, soil, and food, respectively, keeping all other factors fixed. Note that this percentage is different for each individual and is affected by the uncertainty in the association between each media and urine. The more uncertain the relationship, the more uncertain the percentage change. We can see that the percentage change tends to be positive for 1 and 2% increases in all three media (personal air, soil, and food), and tends to be largest for an increase in food. This is consistent with previous work that diet is the primary route of As exposure. Specifically, we have a posterior mean percentage change in As urine of 0.37% for a 1% increase in food, versus a 0.28% change for a 1% increase in soil and a 0.15% change for a 1% increase in personal air). The greatest uncertainty in the percentage change among the media occurs in personal air. The As concentrations in soil, groundwater, and outdoor air vary spatially. While we do not have the exact spatial

locations of each subject, we do have their county ID. To investigate the possibility of spatial dependence, we considered an extension of Clayton’s baseline model in which the mean log As concentration in soil, flushed water, and outdoor air were allowed to vary by county. We found that the MCMC algorithm that is used to provide the posterior distribution of the county-level-mean of the standardized log As in soil did not converge; this is most likely because there is very little observed soil data above the MDL; the Bayesian model is essentially unidentifiable. Removing the county-varying mean for soil, improved the fit and covergence of the model. For a model having only county-specific flushed water and outdoor air means, we found evidence that the mean standardized log concentration of As is significantly higher for some counties, relative to others. However the relationships between the media did not change appreciably. Given the issues of identifiablity of the model (there are 32 county-level means per media that must be modeled and there are only 174 individuals), we elected to present the simpler (single mean) models in this article. In summary, we have shown the Bayesian statistical paradigm provides a framework amenable to the study of pathways of exposure because it can validly accommodate missing and censored observation, it incorporates measurement errors, and it allows differences among subpopulations through the inclusion of effect and pathway modifiers. For the toxic heavy metal As, we have demonstrated that household size, amount of time spent at VOL. 42, NO. 15, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

5613

home, use of tapwater for drinking and cooking, the amount of water consumed, the use of central air conditioning, and the use of gas equipment all significantly modify the pathways of exposure to As. We believe that these methods and findings should prove useful in future investigations of subpopulation differences in exposure to As and other toxics.

Acknowledgments This research was supported by the American Chemistry Council’s (ACC) Long-Range Research Initiative under an agreement between the ACC and the Ohio State University Research Foundation. The agreement was a result of a submission to the Environmental Protection Agency’s (EPA) FY2003 STAR Grant program, in response to an RFA titled, “Environmental Statistics Research: Novel Analyses of Human Exposure Related Data”, which was jointly funded by the EPA’s National Center for Environmental Research (NCER) and the ACC. We thank Noel Cressie, Bruce E. Buxton, Hongfei Li, Nancy J. McMillan, Michele Morara, Ke Wang, and Jian Zhang for helpful discussions of the material described in this paper. We would also like to thank the referees for suggestions that both improve the presentation and strengthen the results of the paper.

Supporting Information Available Further details on the model and data analysis. This material is available free of charge via the Internet at http:// pubs.acs.org.

Literature Cited (1) Clayton, C. A.; Pellizzari, E. D.; Quackenboss, J. J. National human exposure assessment survey: Analysis of exposure pathways and routes for arsenic and lead in EPA Region 5. J. Exposure Anal. Environ. Epidemiol. 2002, 12, 29–43. (2) WHO. Arsenic, Environmental Health Criteria 18; World Health Organization: Geneva, 1981. (3) U.S. Environmental Protection Agency. NHEXAS Exposure Factors Descriptive Analysis; National Center for Environmental Assessment, Environmental Protection Agency: Washington, DC, 2003; Contract no. 68-C-00-122, Peer Review, Draft Report. (4) Pellizzari, E.; Lioy, P.; Quackenboss, J.; Whitmore, R.; Clayton, A.; Freeman, N.; Waldman, J.; Thomas, K.; Rodes, C.; Wilcosky, T. Population-based exposure measurements in EPA Region 5: A phase I field study in support of the National Human Exposure Assessment Survey. J. Exposure Anal. Environ. Epidemiol. 1995, 5, 327–358.

5614

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 15, 2008

(5) McMillan, N. J.; Morara, M.; Young, G. S.; Hierarchical Bayesian modeling of human exposure pathways and routes. ASA Proceedings of the 2006 Joint Statistical Meetings; American Statistical Association: Alexandria VA, 2006; pp. 2492-2503. (6) Cressie, N.; Buxton, B. E.; Calder, C. A.; Craigmile, P. F.; Dong, C.; McMillan, N. J.; Morara, M.; Santner, T. J.; Wang, K.; Young, G.; Zhang, J. From sources to biomarkers: a hierarchical Bayesian approach for human exposure modeling. J. Stat. Planning Inference 2007, 137, 3361–3379. (7) Fan, A. Assessement of metals in drinking water with specific references to lead, copper, arsenic, and selenium. In Toxicology of Metals; Chang, L. W., Magoes, A. E. L., Suzuki, T., eds.; CRC Press: Boca Raton, FL, 1996; pp 39-53. (8) U.S. Environmental Protection Agency. Special Report on Ingested Arsenic: Skin Cancer; Nutritional Essentiality, EPA/625/ 3-87/013; National Center for Environmental Assessment, Environmental Protection Agency: Washington, DC. 1984. (9) U.S. Environmental Protection Agency. Air Quality Criteria for Lead, EPA-600/8-83/028aF; Environmental Criteria and Assessment Office, Environmental Protection Agency: Research Triangle Park NC, 1986; Vol I-IV. (10) ATSDR. Toxicological Profile for Arsenic; Agency for Toxic Substances and Disease Registry, Public Health Services, U.S. Department of Health and Human Services: Washington, DC, 2007. (11) Pacyna, J. Atmospheric emissions of arsenic, cadmium, lead and mercury from high temperature processes in power generation and industry. In Lead, Mercury, Cadmium and Arsenic in the Environment; Hutchinson, T., Meema, K., Eds.; John Wiley & Sons Ltd.: New York, 1987; pp 69-87. (12) Garcia-Vargas, G. G.; Cebri′an, M. E. Health effects of arsenic. In Toxicology of Metals; Chang, L. W., Magoes, L., Suzuki, T., Eds.; CRC Press: Boca Raton, FL, 1996; pp 39-53. (13) Rosenbaum, A.; Axelrad, D.; Woodruff, T. National estimates of outdoor air toxics concentrations. J. Air Waste Manage. Assoc. 1999, 49, 1138–1152. (14) Barr, D.; Wilder, L.; Caudill, S.; Gonzalez, A.; Heedham, L.; Pirkle, J. Urinary creatinine concentrations in the U.S. population: Implications for urinary biologic monitoring measurements. Environ. Health Perspect. 2007, 113, 192–200. (15) Carroll, R. J.; Ruppert, D.; Stefanski, L. A.; Crainiceanu, C. M., Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed.; Chapman and Hall/CRC: New York, 2006. (16) Berger, J. O. Statistical Decision Theory and Bayesian Analysis; Springer-Verlag: New York, 1985. (17) Gelman, A.; Carlin, B. P.; Stern, H. S.; Rubin, D. B. Bayesian Data Analysis; Chapman & Hall, London, 1995. (18) Chen, M.-H.; Shao, Q.-M.; Ibrahim, J. G. ; Monte Carlo Methods in Bayesian Computation; Springer-Verlag: New York, 2000. (19) R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2007; ISBN 3-900051-07-0.

ES702338V