Article pubs.acs.org/est
High Throughput Heuristics for Prioritizing Human Exposure to Environmental Chemicals John F. Wambaugh,*,† Anran Wang,†,§,∥ Kathie L. Dionisio,‡ Alicia Frame,†,∥ Peter Egeghy,‡ Richard Judson,† and R. Woodrow Setzer† †
National Center for Computational Toxicology, and ‡National Exposure Research Laboratory, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, North Carolina 27711, United States § North Carolina State University, Department of Statistics, Raleigh, North Carolina 27695-8203, United States ∥ Oak Ridge Institute for Science and Education Grantee, P.O. Box 117, Oak Ridge, Tennessee 37831-0117, United States S Supporting Information *
ABSTRACT: The risk posed to human health by any of the thousands of untested anthropogenic chemicals in our environment is a function of both the hazard presented by the chemical and the extent of exposure. However, many chemicals lack estimates of exposure intake, limiting the understanding of health risks. We aim to develop a rapid heuristic method to determine potential human exposure to chemicals for application to the thousands of chemicals with little or no exposure data. We used Bayesian methodology to infer ranges of exposure consistent with biomarkers identified in urine samples from the U.S. population by the National Health and Nutrition Examination Survey (NHANES). We performed linear regression on inferred exposure for demographic subsets of NHANES demarked by age, gender, and weight using chemical descriptors and use information from multiple databases and structure-based calculators. Five descriptors are capable of explaining roughly 50% of the variability in geometric means across 106 NHANES chemicals for all the demographic groups, including children aged 6−11. We use these descriptors to estimate human exposure to 7968 chemicals, the majority of which have no other quantitative exposure prediction. For thousands of chemicals with no other information, this approach allows forecasting of average exposure intake of environmental chemicals.
■
INTRODUCTION Because of the ubiquitous use of chemicals in our modern society, people are potentially exposed to thousands of chemicals present in the home, workplace, and other surroundings. Exposures can occur through multiple pathways via air, water, food, and soil. A recent, nontargeted screen of pooled human plasma samples identified the presence of more than 2000 potentially anthropogenic compounds.1 Out of several hundred chemicals specifically targeted for investigation in the 2005 Centers for Disease Control and Prevention (CDC) NHANES study, 148 synthetic chemicals and pollutants were detected in the blood and urine of a representative sample of the U.S. civilian population.2 The synthetic chemicals and pollutants detected in the majority of individuals tested included acrylamides, cotinine, trihalomethanes, phenolic compounds (i.e., bisphenol A, triclosan, benzophenone), phthalates, chlorinated pesticides, organophosphate pesticides, pyrethroids, heavy metals, aromatic hydrocarbons, polybrominated diphenyl ethers, perfluorocarbons, and a host of polychlorinated biphenyls and solvents.3 There are currently ∼80 000 chemicals registered in the U.S. under the Toxic Substances Control Act, 700 to 1000 chemicals This article not subject to U.S. Copyright. Published XXXX by the American Chemical Society
entering commerce every year, and an estimated 30 000 chemicals in wide commercial use.4−7 A concern facing regulatory agencies and public health advocates is the potential adverse effects of these chemicals on human health and the environment, as only a relatively small subset of these chemicals has been sufficiently well characterized.4−6 To address this concern, the U.S. government initiated the Toxicity Forecaster (ToxCast) program at the Environmental Protection Agency (EPA) and formed the interagency Tox21 consortium involving the EPA, the National Institutes of Health, and the Food and Drug Administration. The ToxCast program and Tox21 consortium have evaluated over 8000 chemicals using a broad range of in vitro high-throughput screening (HTS) assays to identify potential hazards through interactions with proteins, pathways, and cellular processes.8 In order to provide the complementary exposure science to assess risk for these and other chemicals, the EPA and U.S. Received: July 23, 2014 Revised: October 7, 2014 Accepted: October 13, 2014
A
dx.doi.org/10.1021/es503583j | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
Markov Chain Monte Carlo (MCMC) allow the numerical solution of problems that might be intractable with other statistical approaches.17 We selected this approach because we wish to consider the full range of plausible exposures indicated by a given analyte in urine (e.g., multiple chemicals can produce the same analyte compound) and because biomonitoring data is often confounded by limits of detection (i.e., the concentration of analyte is only known to be below a certain limit). Bayesian analysis allowed us to account for uncertainty and variability within the NHANES data, relate urine analytes to parent chemical exposures, estimate uncertainty in our regression coefficients, and to propagate that uncertainty into our predictions. Here we present a high throughput heuristics methodology that can prioritize environmental chemicals based on human exposure. For demographic groups reported by NHANES we used linear regression models on chemical descriptors that could be rapidly and automatically gleaned from databases or predicted from chemical structure. Separate models for demographic groupings, stratified by gender, age, or body mass index (BMI), allow for group-specific prediction of exposure potential. We then investigate whether or not we can identify differences in the heuristics between these groups or whether a common set of heuristic factors can explain variation in chemical exposure for all demographic groups.
National Academies of Science have developed a long-range vision for exposure science in the 21st century, and a strategy for implementing this vision over the next 20 years.9,10 Consistent with that strategy, the EPA’s ExpoCast initiative11 has been developing mechanistic and heuristic models for making high-throughput exposure (HTE) predictions that can be rapidly parametrized for thousands of chemicals. Coupled with hazard-related HTS, HTE modeling can move risk-based evaluation earlier in the chemical management decision process.5,7,12,13 Where the putative human dose of concern as identified by hazard HTS is similar to doses predicted by exposure HTS chemicals become targets for further investigation.14 We previously evaluated the predictive value of HTE models that can be rapidly parametrized for thousands of chemicals with pharmacokinetic inferences of chemical exposure from the NHANES biomonitoring data.15 A total of 1936 chemicals were evaluated using far-field, mass balance human exposure models and a yes/no indicator for near-field (e.g., consumer product in the home) chemical use. Predicted human exposures (mg/kg BW/day) were compared to exposures that could be inferred from NHANES urine concentrations for 82 of the 1936 chemicals. Joint regression of the 82 inferred exposures on the model predictions and the near-field indicator provided a calibrated consensus prediction that was used for prioritization of the entire 1936 chemical library. The variance of this regression (i.e., the amount that the calibrated predictor overor under-predicted the inferred exposures) provided an empirical determination of uncertainty in the model predictions, so that the uncertainty of these HTE estimates could be appropriately considered. Information on chemical use was found to be most predictive; generally, chemicals above the limit of detection in NHANES had indoor/consumer product use. For the 82 NHANES chemicals this simple (i.e., yes/no) near-field indicator variable had greater predictive power than the more complicated environmental exposure models. The ability of a simple heuristic−whether or not chemicals were present in the home−to forecast exposure in the prior study raises the question of what other simple heuristics might be used to classify chemicals. Given a variety of rapidly obtained data, such as putative chemicals use categories and physicochemical properties largely obtained from quantitative structure−activity relationship (QSAR), which factors best explain exposures inferred from the available NHANES biomonitoring data? Having previously succeeded in evaluating existing models with a relatively limited number of chemical exposures inferred from NHANES urine data (N = 82), we now aim to build a new model by identifying more detailed heuristics using only moderately more chemicals (N = 106). Caveats to this approach include the danger of overfitting; to assess model parsimony we rely on the Akaike Information Criterion (AIC).16 We aim to determine the simplest model (i.e., least number of factors) that best describes the data without fitting noise in that data. The minimal predictive subset of factors we identify in this manner we describe as the heuristics of environmental chemical exposure. Bayesian analysis techniques allow us to infer ranges of exposures that are consistent with exposure biomarkers measured in urine samples and reported by the CDC NHANES. Bayesian statistics is a rigorous statistical methodology that attempts to consider all likely explanations for data. When approached in Bayesian manner, techniques such as
■
MATERIALS AND METHODS Chemical Descriptors. The Tox21 Chemical Library includes over 8000 compounds such as industrial chemicals, pesticides, consumer product and food ingredients, and pharmaceuticals.8 The full list of chemicals considered is available from http://epa.gov/ncct/Tox21/. High throughput (i.e., automatically retrieved or algorithmically generated) chemical descriptors could be obtained for 7968 chemicals. Physico-chemical properties (Supporting Information (SI) Table 1) were obtained primarily from EPA’s EPI (Estimation Programs Interface) Suite of physical/chemical property and environmental fate QSAR estimation programs and database of experimentally obtained physicochemical properties (http:// www.epa.gov/opptintr/exposure/pubs/episuite.htm). Experimental data was used in place of QSAR when available. Chemicals were assigned to a small number of broad chemical use categories using an automated approach to mining the full ACToR (Aggregated Computational Toxicology Resource) database, the EPA’s online warehouse of publicly available chemical toxicity data (http://actor.epa.gov/). A total of 514 separate listings (including federal, state, and international regulatory listings) for chemicals falling into specific use classes (e.g., U.S. EPA Antimicrobials Reregistration) were used to derive a set of 15 broad chemical use categories (e.g., “Antimicrobials”). All chemicals in the ACToR database were checked against the listings and assigned to the associated categories; a single chemical could fall into multiple use categories. Any chemical that was known to be a pesticide was excluded from the food use category. After category assignments for chemicals (given in SI Table 1) were generated, the number of occurrences of that chemical in a listing associated with each use category (the number of “hits”) was evaluated. Any chemicals with 1−3 hits for a given category were manually checked to determine whether or not that category was accurately applied. The final use assignments were then converted to binary values. Of the 15 chemical use categories in the resulting database (“ACToR UseDB”), B
dx.doi.org/10.1021/es503583j | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
of Exposure”), MCMC simulation inferred various likely combinations of parent chemical exposure rates that would be consistent with the analytes present in urine. Each sample of inferred exposure rates within the Markov chain for each demographic group was investigated via the method of best subsets with complete enumeration.22 A total of 19 potential predictors were considered: production volume, 13 chemical use categories, 3 physicochemical properties, and 2 randomly generated descriptors (to assess the reliability of the results). The average AIC for each subset size across all samples indicated the optimal number of factors. The Bayesian analysis of geometric mean exposure rates was repeated with assumption that exposure rates were log-normally distributed about a linear predictor model containing all factors. Standardized regression coefficients (weights) for the factors, an average (intercept) value, and an error term were jointly estimated along with parent exposure rates using Hamiltonian MCMC performed via Stan.23 A horseshoe prior was used to shrink regression coefficients for most predictors toward zero, thus working to reduce the number of factors included in the predictor model.24 Then, using the subset size identified by best subsets, those factors with the largest absolute regression coefficients were identified as the optimal subset of factors. The Bayesian analysis was finally repeated using regression on only those factors.
petrochemicals and fertilizers were not present among the NHANES chemicals. Modified categories were constructed from the “Consumer use” and “Chemical/Industrial Process use” ACToR categories, and from the “Pesticide use” and “Inert Ingredients in Pesticides use” ACToR categories. For the initial best subsets analysis (Figure 2), an additional category “Other” was created for chemicals with no use indicated by ACToR. Total United States chemical production volume (lbs/year) was obtained from the EPA High Production Volume (HPV) list (http://www.epa.gov/hpvis/). Chemicals not listed were assumed to be produced at less than 25,000 lbs/year (the threshold for reporting). Biomarkers of Exposure. The CDC NHANES is a program of studies designed to assess the health and nutritional status of adults and children in the U.S. (http://www.cdc.goc/ nchs/nhanes.htm). As part of the NHANES, the CDC monitors biomarkers of chemical exposure (chiefly metabolites) in the blood and urine of the noninstitutionalized civilian U.S. population residing in the 50 states and District of Columbia to quantify the levels of chemical compounds present in U.S. residents, and regularly presents findings from these studies in The National Report on Human Exposure to Environmental Chemicals.2 Each two-year study cycle comprises approximately 10 000 individuals, with chemical exposure biomarker data available from only a subset of about 2000 individuals. A reverse pharmacokinetics approach15,18,19 was used to infer parent compound chemical exposure from NHANES biomonitoring data for creatinine-adjusted urine concentrations Assumptions similar to those of Mage et al.18 were made− chiefly that the individuals were at steady-state due to a constant uptake rate such that analyte molecules present in urine corresponded to exposure to and intake of the related parent molecules over the time represented by that urine sample. In 2010, NHANES began reporting urine flow data along with chemical monitoring data. This data was modeled to determine distributions appropriate for each demographic analyzed, and the modeled values were used for chemicals without urine flow data. The creatinine excretion rate along with the subject’s body weight as reported by NHANES were used along with a mapping derived from the NHANES reports between parent and analyte compounds (including the relative molecular weights) to convert urine concentration per mg creatinine (e.g. μg/L) to an exposure in units of mg/kg bodyweight/day. Linear Regression Model. The structure of the NHANES sampling design was accommodated using tools from the R survey package.20 Geometric mean population exposures to parent compounds were inferred via Bayesian analysis from the NHANES subject-specific urine samples. First, the geometric mean urine concentrations were estimated from the raw individual NHANES data using pseudomaximum likelihood. For observations below limit of detection, a censored likelihood was used with assumption that the population distribution of compounds in urine is log-normal. Other than using subject specific data, as opposed to summary quantiles, to characterize the population distribution for the calculation of geometric mean concentrations, the method for estimating parent exposures was similar to that described in Wambaugh et al.:15 Geometric mean parent chemical exposure rates were initially inferred via Markov Chain Monte Carlo (MCMC) implemented in JAGS v3.1.0.21 Given the steady-state reverse pharmacokinetics assumptions and parent-to-analyte mapping (both described in “Biomarkers
■
RESULTS One hundred six (106) parent chemical compound exposures were inferred from 68 urine analytes identified in the NHANES biomonitoring data. We developed a high throughput heuristic model using the available NHANES data, and then used this model to rank 7968 chemicals by their potential human exposure. Figure 1 demonstrates a heat map of the high throughput chemical descriptors used for the evaluation (NHANES) chemicals. The rows in Figure 1 are the chemical descriptors (i.e., potential predictors) and the columns are individual chemicals. The rows and the columns are permuted to place similar values near each other according to hierarchical clustering by Euclidean distance. The horizontal side bar at the top annotates the chemicals to indicate the chemical classes assigned by NHANES. The parabens, for example, are clustered together. No strong correlations between the use category predictors are observed, however this was achieved by performing Boolean logical operations on some of the ACToR UseDB categories. The “Consumer use” and “Chemical/Industrial Process use” categories were transformed into three modified categories (chemicals with both uses, and chemicals with exclusively one or the other use), and the “Pesticide use” (which may include both active and inert pesticide ingredients) and “Inert Ingredients in Pesticides use” ACToR UseDB categories were transformed to create two mutually exclusive modified categories: “Pesticide Active” use and “Pesticide Inert” use. The high throughput chemical descriptors for the entire 7968 chemical library evaluated are shown in SI Figure 1. Two separate but complementary techniques were used to analyze the potential predictive factors. In Figure 2, we used the method of best subsets to analyze how many and which factors should be included to best describe the chemical exposures inferred for the general U.S. population NHANES urine data without overfitting. Each sample in the converged Markov chain is a different combination of parent chemical exposure C
dx.doi.org/10.1021/es503583j | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
Figure 2. For the 106 chemicals inferred from NHANES biomonitoring we determined that five chemical descriptors were optimal for predicting exposures without overfitting (e.g., fitting noise). The top portion of Figure 2 shows the average relative AIC (smaller is better) for models made with different numbers of parameters for explaining 1500 different combinations of chemical exposures. The 97.5% percentile and the 2.5% percentile from the relative AIC of the samples are indicated by the vertical line at each size of subset. The horizontal dotted line indicates the minimum average relative AIC, and the vertical dotted line indicates its corresponding size of subset. The predictors involved in the optimal model with higher frequencies are represented by darker circles, and those with lower frequencies by lighter circles. As a sanity check, two random variables generated from binomial distribution with probability 50% and 10% of obtaining 1, are not selected as optimal descriptors in the five factor model.
Figure 1. Databases and predictions from chemical structure provide high throughput descriptors that were used to investigate correlation with environmental chemical exposure to humans. These parameters were available for 7968 chemicals (available in SI Table 1 and shown in SI Figure 1). Here we show the data for the 106 chemicals for which we also have biomonitoring data from the CDC NHANES study. The rows correspond to the high throughput descriptors, consisting of Boolean (0/1) use category annotations from the ACToR database, experimental and/or predicted physical-chemical properties, and chemical production volume data. Continuous variables have been scaled to be between zero and one. The columns are chemicals, whose class as identified by the CDC is annotated by the color bar at the top of the heat map and the legend at the bottom. The dendrograms at the top and left-hand size indicate how chemicals and categories have respectively been clustered hierarchically using the complete linkage method (Johnson 1967).
occurs when the number of predictive factors in the subset is five. Therefore, a five-factor model is suggested by AIC. Because each sample from the Markov chain represents a different combination of chemical exposures that would be consistent with the urine biomarkers identified by NHANES, different factors may perform better for different samples. In the bottom-half of Figure 2 the predictive factors involved in the best model with higher frequencies among the 1500 samples are represented by darker circles, and those with lower frequencies by lighter circles. Best subsets analysis suggests that a subset of five factors is most appropriate for describe the inferred NHANES exposures, but is only clear about the identity of three of those factors. Among the five factor models, “Pesticide Inert” use is included in 93.7% of models, “Production volume” is included in 64.9%, and “Consumer & Industrial” use is included in 63.9% (“Industrial” indicates use in chemical/industrial processes). Using the best subsets technique, the final two factors are more equivocal: nine different factors are included in anywhere from 15% to 33% of the five factor models, including six ACToR use categories (“Antimicrobial”, “Colorant”, “Personal Care”, “Pesticide Active”, “Flame Retardant”, “Industrial no Consumer”, “Consumer no Industrial”), hydrophobicity, and both random variables. In order to identify the five optimal predictors, we performed a joint Bayesian exposure inference and linear regression
rates that would be consistent with the urine analytes identified in the NHANES; we consider the 1500 samples to be representative of the range of possible exposures. For each sample, best subsets is used to determine the optimal parameters to explain the data given a choice of a fixed number of factors. By example, if the best one factor model identifies the single best parameter, then the best two-factor model would explore all combinations of two factors to identify the best two factors for predicting the inferred exposures, and so on, to a model that uses all available factors. AIC, the criterion we use to determine model parsimony and prevent overfitting, is a function of the number of factors (subset size) and the success of the best combination of that number of factors for describing the sample. In the top-half of Figure 2 the distribution of the relative AIC across all samples is plotted for each subset size. Lower AIC scores indicate a more parsimonious modela model that explains more variance with fewer predictive factors. The plot points at each subset size in Figure 2 indicate the median AIC across all samples, whereas the 97.5 percentile and the 2.5 percentile are indicated by the vertical line at each size of subset. The horizontal dotted line indicates the minimum average relative AIC, and the vertical dotted line indicates that the minimum average relative AIC D
dx.doi.org/10.1021/es503583j | Environ. Sci. Technol. XXXX, XXX, XXX−XXX
Environmental Science & Technology
Article
Figure 3. Separate linear regressions were made of the inferred exposures on the five optimal descriptors for different demographic groups monitored by the CDC NHANES studies. Figure 3 (A) shows the posterior distribution of standardized regression coefficients for the top five factors. The wider bar indicates the 50% range, while the narrower bar indicates the 95% range for each coefficient. Large positive coefficients indicate correlation with increased exposures, whereas large negative coefficients correspond to decreased exposures. Among the demographic groups considered there is little variation in the significance of those factors for the chemicals we have examined. “BMI_GT_30” indicates individuals with a body mass index greater than 30 (obese), while “BMI_LT_30” are not obese. “ReproAgeFemale” refers to women aged 16−49. In Figure 3 (B) we show the correlation of inferred exposures and the calibrated model for the total population. The solid line indicates the 1:1 line (perfect predictor). Bayesian analysis was used to distribute urine products using mass-balance, giving the 95% confidence intervals (light lines) and medians (solid circles).
Table 1. High Throughput Exposure Heuristics number of chemicals
heuristic
description
inferred NHANES chemical exposures (106)
full chemical library (7968)
ACToR “consumer use & chemical/industrial process use”
chemical substances in consumer products (e.g., toys, personal care products, clothes, furniture, and home-care products) that are also used in industrial manufacturing processes, does not include food or pharmaceuticals.
37
683
ACToR “chemical/industrial process use with no consumer use”
chemical substances and products in industrial manufacturing processes that are not used in consumer products, does not include food or pharmaceuticals
14
282
ACToR UseDB “pesticide inert use”
secondary (i.e., nonactive) ingredients in a pesticide which serve a purpose other than repelling pests, pesticide use of these ingredients is known due to more stringent reporting standards for pesticide ingredients, but many of these chemicals appear to be also used in consumer products (see Figure 1).
16
816
ACToR “pesticide active use”
active ingredients in products designed to prevent, destroy, repel, or reduce pests (e.g., insect repellants, weed killers, and disinfectants).
76
877
TSCA IUR 2006 total production volume
sum total (kg/year) of production of the chemical from all sites that produced the chemical in quantities of 25 000 pounds or more per year. If information for a chemical is not available, it is assumed to be produced at