Environ. Sci. Technol. 1998, 32, 3396-3404
Decision Tree Method for the Classification of Chemical Pollutants: Incorporation of Across-Chemical Variability and Within-Chemical Uncertainty JOSEPH N. S. EISENBERG* School of Public Health, University of California, Berkeley, California 94720 THOMAS E. MCKONE University of California, School of Public Health and Lawrence Berkeley National Laboratory, Berkeley, California 94720
We have developed a decision tree methodology for the classification of chemicals by estimates of potential human exposure. The steps involved in the construction of a decision tree are as follows. Monte Carlo simulations are conducted by randomly sampling chemical and environmental properties, whose range of values represents the variability of parameters across a defined set of chemicals and environmental conditions. The tree structure is then defined by a series of constraints placed on the various chemical and environmental properties using the Classification and Regression Tree Algorithm (CART). Each node of the tree is associated with a human exposure value and is considered a bin, which classifies chemicals whose properties are consistent with those parametric constraints associated with the particular node. In addition to being associated with parametric constraints, each bin or tree node is associated with a human exposure level. In this manner, the tree structure functions as a template from which a set of chemicals are classified into parametric regions that are associated with an exposure level. Three important properties of this classification approach are as follows: (a) The variability across this chemical set is described by the template. (b) Parameter correlations are described by assessing which bins are represented by at least one chemical. (c) The sensitivity of the classification is assessed using both the uncertainty of the values for a particular chemical and any uncertainty or variability associated with site-specific exposure and environmental properties. To illustrate these properties, a case study was conducted in which exposures were estimated using the multimedia exposure model CalTOX assuming a regional chemical release into soil. A decision tree template was constructed and then used to classify 79 chemicals. Analysis of the simulation outputs identified 4 out of 14 chemical properties whose value ranges played the dominant role in the classification of chemicals into exposure ranges (R 2 ) 0.78); i.e., 78% of the exposure variation seen in the data could be explained using only 4 of the 14 chemical properties that are known to affect chemical fate and transport. The most important classifier was the half-life in root-zone soil, τs. In addition, 3396
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 32, NO. 21, 1998
a sensitivity analysis of 93 site-specific environmental and exposure properties suggested that only four of these factors influenced the classification.
Introduction The number of synthetic chemicals released to the environment has increased dramatically in the last century. These chemicals originate from human activities (such as combustion related to energy production, industrial processes, and agriculture) and can result in human exposure through inhalation, ingestion, or dermal uptake. Although traditionally the approach toward regulation has been to assess them chemical by chemical, this approach is costly and timeconsuming. Recently there have been a number of ranking systems developed to help streamline the process of identifying those chemicals that require regulation [see Davis et al. (1) for a review of 51 such systems]. Many of these ranking methods suffer from a number of important disadvantages. First, explicit treatment of uncertainty and in particular the capacity to see how uncertainty reduction will impact classification is rarely included in any ranking scheme. Therefore, the approaches for dealing with sparse or no data are often based on subjective judgment. Second, many of the simple quantitative ranking approaches are based on the result of multiplying together a small number of factors. An example of this approach is the Waste Minimization Prioritization Tool (WMPT) that uses four agent-specific propertiess toxicity, persistence, bioaccumulation, and mass. Each of these four factors are aggregated terms that represent complex processes. Neither the multiplication of the four properties nor the aggregation required to obtain values for these properties are based on how these properties are likely to impact potential exposure. Thus, the approach is not based on a mechanistic understanding of the environmental transport/transformation processes, and rankings may be based on an unjustified weighting of one factor over another. Alternatively, many ranking systems, such as the U.S. EPA Hazardous Waste Identification Rule (HWIR) (2) are so complex that the user cannot interpret what factors are critical in making key classifications. The HWIR waste classification is based on the use of multiple, linked models with large numbers of inputs, many of which are based on critical and often scientifically indefensible assumptions that cannot be communicated to the user (3). In general, with such modelbased classification schemes, it is easy to imbed deep within the computer code values that are both highly uncertain and significant to the ranking process; i.e., these models tend to be “black box” simulators, providing exposure estimates based on release data but with no information on the underlying uncertainty and sensitivity associated with those estimates. In contrast, we have developed a systematic approach for choosing attributes from which to classify in the context of the variability across chemicals and the uncertainty of particular chemical properties. The focus of this paper is to present our decision tree method for classifying chemicals based on human exposure estimates from a mathematical model, arguing that exposure is a critical and often overlooked component in assessing risk. This method is illustrated through a case study, using CalTOX (a multimedia exposure * Corresponding author fax: (510) 642-5815; e-mail address:
[email protected]. S0013-936X(97)00975-9 CCC: $15.00
1998 American Chemical Society Published on Web 09/12/1998
model) and a set of 79 chemicals. After presenting the results of this case study, we discuss how an exposure-based classification can be coupled with toxicity data to provide a comprehensive chemical risk ranking. We also discuss how this decision tree method does not require estimates from mathematical models but can also classify based on empirically derived data sets. In brief, our method consists of analyzing exposure estimates from model simulations using the Classification and Regression Tree (CART) algorithm (4), a multivariate nonlinear statistical algorithm that provides the parametric classification rules for dependent variables; i.e., this analysis assesses which parameter values and parametric relationships are most important in classifying an output as a high vs low exposure. In this manner, a parametric delineation or decision tree of high/low exposure regions is constructed using the smallest reliable set of factors, identifying the set of factors and their relationships needed for classification prior to analyzing the properties of any particular chemical. This tree can then be used as a template to assess the exposure potential of actual chemicals by observing which regions they fall into based on the constructed parametric rules.
Method for Chemical Classification The basic approach of this analysis is to classify chemicals into categories using physiochemical factors. The categories are defined by constraints placed on the physiochemcial values and are characterized by a mean potential human exposure value and standard deviation. The rules that define these exposure level regions are selected based on the criterion to minimize the overall variance and therefore maximize the ability to predict exposure levels. The CART algorithm is economical in its rule selection and will only place constraints on those factors needed to achieve a low overall variance, therefore identifying factors most important in assessing exposure. Once established, the classification criteria allows for discussion of sensitivity and uncertainty; i.e., this approach provides a framework to assess how likely a chemical could be reclassified given additional information pertaining to the physiochemical factors and what impact that classification shift would have with respect to the exposure assessment. Specifically, Monte Carlo simulations were conducted by sampling from ranges of parameter values. The output exposure levels coupled with the associated parameter values were then analyzed by CART, which produces a classification tree. As to not bias our classification tree toward any given set of chemicals, parameter values were independently sampled from a uniform distribution. Thus, the decision tree served as a template from which chemicals were classified. The following section provides an overview of CART, the classification algorithm. Classification Analysis (CART). The simulation outputs from CalTOX were analyzed using the Classification and Regression Tree (CART) algorithm, a nonparametric statistical procedure that classifies data via a series of “yes/no” questions concerning the physiochemical factors (4). The CART analysis is a classification method that produces a tree structure using a parametric decision at each node based on an inequality, pi < c, where pi corresponds to property i and c is a constant value. Observations that satisfy this condition are sent to the left node while the others are sent to the right node. Each node is characterized by a mean value and standard deviation associated with the observations satisfying the above inequality. A specific split is chosen that maximizes the reduction in variance between the parent node and the resultant two child nodes. Splitting continues until a stopping rule is satisfied: either the node has no variance or the number of observations is small. Higher nodes are revisited and, if necessary, readjusted to improve the lower level errors.
This revisitation is especially important when particularly strong interactions among parameters exist. From this decision tree, an exhaustive list of sub-trees is made. For each tree size, defined by the number of terminal nodes, the tree is chosen that maximizes the decrease in variance. In general, as a tree size increases, the variance is reduced; as the tree size decreases, the variance is increased. The objective is to find an optimal combination of tree size and variance, somewhat analogous to choosing the correct bin size for a histogram [see Breiman (4) for details]. As a summary metric, an R 2 is defined as the percent decrease in variance between the first node of the tree and the nodes representing the nth level of the tree. The variance at level n is calculated as a weighted sum of each node at that level.
Case Study For this case study, we have made the following assumptions: First, a 1 ppm initial concentration of the chemical is placed within the soil, a source medium that plays a major role in the regulatory arena. The assumption of a uniform chemical concentration layer when evaluating soil contamination is common within the literature (5). Second, potential dose scenarios are applied to a residential population living on or near this land unit for various exposure durations (ED). Third, the output of each simulation is the sum of the average potential daily dose over the dermal, ingestion, and inhalation exposure routes (mg kg-1 d-1). Fourth, the toxic effects slope factor for all simulations was set to unity; i.e., we focused on identifying the physiochemical factors that drive exposure. Fifth, for each parameter within CalTOX a lower and upper bound was assigned based on ranges that are likely to exist when classification simulations are applied to synthetic chemical release into soils within California. The 107 physiochemical factors consist of 14 chemical properties, 41 landscape properties, and 52 exposure properties (Tables 1, A1, and A2). Environmental Fate and Transport Modeling (CalTOX). Potential exposure levels were generated for a given set of physiochemical factors using the CalTOX model, a set of spreadsheet process models, and spreadsheet data sets used to assist in assessing human exposures and defining soil cleanup levels at uncontrolled hazardous wastes sites (6-8). The CalTOX multimedia transport and transformation model is a dynamic model that calculates time-varying concentrations of contaminants among multiple environmental media when these contaminants are introduced initially to soil layers or released continuously to air or water. The seven environmental media or compartments used in CalTOX are (1) air, which is used to represent the lower troposphere; (2) ground-surface soil, which represents the litter and dust layer on a soil horizon; (3) plants, which represent the aboveground vegetation; (4) rooting-zone soil, which is used to represent the roughly top 1 m of the unsaturated zone in which plant roots are found; (5) the vadose-zone soil, which represents the soil layers below the root zone but above the saturated zone; (6) surface water, which represents ponds, lakes, and or rivers; and (7) sediments, which represent the sediment layers at the bottom of ponds, lakes, and/or rivers. For each compartment, CalTOX computes time-varying chemical concentrations that result from a combination of transport and degradation processes. The model then partitions the chemical into multiple exposure pathways and subsequent multiple exposure routes (Figure 1). Mathematically, CalTOX addresses the inventory of a chemical in each compartment and the likelihood that, over a given period of time, the chemical will remain in the compartment, be transported to some other compartment, or be transformed into some other chemical species. Quantities or concentrations within compartments are VOL. 32, NO. 21, 1998 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
3397
TABLE 1. List of Chemical Properties and the Range of Values Associated with the 79 Chemicals in the CalTOX Data Set molecular weight (g/mol) octanol-water partition coefficient melting point (K) vapor pressure (Pa) solubility (mol/m3) Henry’s law constant (Pa-m3/mol) diffusion coefficient in pure air (m2/d) diffusion coefficient in pure water (m2/d) organic carbon partition coefficient reaction half-life in air (d) reaction half-life in ground-surface soil (d) reaction half-life in root-zone soil (d) reaction half-life in the vadose-zone soil (d) reaction half-life in groundwater zone soil (d) reaction half-life in surface water reaction half-life in the sediment zone (d)
symbol
lower bound
upper bound
MW Kow Tm VP S H Dair Dwater Koc τa τg τs τv τq τw τd
50 1.0 1 × 102 1 × 10-8 1 × 10-7 3 × 10-4 2 × 10-2 3 × 10-5 4.0 2 × 10-3 4.0 4.0 1 × 102 1 × 101 2 × 10-3 7.0
420 5 × 109 1 × 103 6 × 105 1 × 103 6 × 103 2.0 1.4 × 10-4 5 × 106 1 × 104 9 × 103 9 × 103 2 × 104 2 × 104 3 × 104 3 × 104
FIGURE 1. General integration of source, dispersion, and exposure used in CalTOX. The dispersion among environmental media is primarily defined by chemical and landscape parameters, whereas the transfer of chemicals along exposure pathways is dependent on both environmental media concentrations and human activity data (exposure factors). described by a set of linear, coupled, first-order differential equations. A compartment is described by its total mass, total volume, solid-phase mass, liquid-phase mass, and gasphase mass. Contaminants are moved among and lost from each compartment through a series of transport and transformation processes that can be represented mathematically as first-order losses. Maddalena et al. (9) have compiled and described the CalTOX multimedia mass-balance equations and have shown that the CalTOX produces results comparable to regional fugacity models (10). It should be noted here that the focus of this paper is on parameter uncertainty and not on model accuracy; i.e., given exposure estimates, we are interested in both identifying the parameters most important for classification and assessing the stability of our proposed classification scheme in the context of parameter uncertainty. The exposure assessment in CalTOX is based on building links among environmental media and exposure media. The links among multimedia transport models and multiple pathway exposure models are used in CalTOX to estimate average daily doses within a human population. Exposure media include outdoor air, indoor air, food, household dust, homegrown foods, animal food products, and tap water. Exposure routes are inhalation, ingestion, and dermal uptake. The potential dose over an exposure duration is expressed in CalTOX as a average daily dose rate (ADDpot), in mg/kg-1 d-1; i.e., averaged over a specified averaging time: 3398
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 32, NO. 21, 1998
ADDpot ) CEnv × ITF ×
CR ED × EF × BW AT
(1)
where CEnv is the concentration in environmental media (mg/ kg), ITF is an intermedia transfer factor that relates contaminant concentration in exposure media to concentration in environmental media (units depend on the two media), CR is the contact rate (kg/d), BW is the body weight (kg), ED is the exposure duration (yr), EF is the exposure frequency (d/yr), and AT is the averaging time (d). McKone and Daniels (11) have described the application of eq 1 to multiple exposure pathways for contaminated soils. Their approach has been incorporated in CalTOX (8). Model Parametrization. All parameter values were assigned uniform distributions as previously described. In this section we describe the sources for these range assignments. As referenced below, the chemical properties and exposure data used in CalTOX are based on EPA references that have been reviewed extensively by the U.S. EPA, CalEPA, and a California peer review and public review process. The values used in this study are primarily measured values. Those values that were estimated come from the literature. For chemical properties, we used the ranges corresponding to the 79 chemicals currently used in the CalTOX data sets. Properties for these chemicals, such as solubility, vapor pressure, octanol-water partitioning, environmental halflives, etc., are available from Cal-EPA on their World-WideWeb site (http://www.cwo.com/∼herd1/downset.html/) and have been derived primarily from a report issued by U.S.
EPA (12). These ranges are similar to those presented in widely available reference compilations of chemical properties, such as Mackay et al. (13) and Verschueren (14). For the upper and lower bounds on landscape properties, we derived ranges from a review of California landscape data compiled by Schwelen et al. (15) for the California EPA. California has a varied topography and climate pattern that has resulted in a wide variety of landscape characteristics, e.g., climates, ecosystems types, soil characteristics, and hydrologic patterns. In the Schwelen et al. (15) report, ranges of soil characteristics are derived from the STATSGO database (16); variations in climate data and some hydrologic characteristics are obtained from NOAA (17); and additional hydrologic data comes from van de Leeden et al. (18). Ranges of values for exposure parameters are based on Cal-EPA Department of Toxic Substances Control values as originally compiled by McKone (8), and are similar to ranges used by the U.S. EPA (19) and other Cal-EPA agencies (20). Simulation Design. Monte Carlo simulations were conducted on a Pentium microcomputer using the software package Crystal Ball, version 4.0 (21). The simulations were divided into three steps. First, all of the landscape and exposure properties were assigned the nominal values shown in Tables A1 and A2 (see Supporting Information). The 14 chemical properties (Table 1) were randomly sampled for each simulation run, and the output was saved along with the parameter values of the 14 chemical properties. After completing 5000 runs, the data were analyzed using CART; i.e., a series of parametric rules were developed to construct regions represented by a mean exposure potential and a standard deviation. The results of CART were then used to identify the chemical property values associated with high and low exposure situations and to assign categories. Since in this study we are concerned with exposure and not toxicity, we normalized the chemical potencies to 1 mg-1 kg-1 d-1, a value that is within the distribution of potencies across measured chemicals (22). Therefore, we define exposure levels above 10-6 mg kg-1 d-1 as “high”, based on the fact that a target de minimus risk of 10-6 is frequently selected by regulatory agencies. As described in the Discussion section, normalization of the potencies allows for the incorporation of toxicity information that would extend our exposure classification to a risk classification. Up to this point, the classification regions were based on the constraints placed on the 14 chemical properties (Table 1), independent of any particular chemical. Next, using these defined exposure regions, the 79 chemicals listed in the CalTOX database were classified based on their chemical properties. From this classification, five prototypical chemicals were chosen that were representative of the different exposure regimes identified by CART. These chemicals were used in the next two Monte Carlo simulation steps to study the sensitivity of the 93 landscape and exposure properties shown in Tables A1 and A2. For the second Monte Carlo simulation step, the chemical and exposure properties were held constant, and the landscape properties were randomly sampled. The exposure properties were assigned the same values as in the first Monte Carlo simulation step. The chemical properties were assigned point values from the CalTOX database that were associated with one of the five prototypical chemicals chosen. Four more sets of Monte Carlo simulations were conducted using the other prototypical chemicals. The third Monte Carlo simulation step was analogous to the second with the landscape properties held constant and the exposure properties randomly sampled. As with the data set from the first Monte Carlo step, each of the 10 data sets generated from the last two Monte Carlo steps was analyzed using CART.
Results Results from the case study are used here to illustrate the features of our decision tree classification method. Three important properties of this classification approach are as follows: (1) The tree structure reflects the parametric variability across chemicals. (2) A measure of the parametric correlation of a particular set of chemicals can be assessed through the comparison of the regions represented by at least one chemical with those not represented by any chemicals. (3) The sensitivity in classifying a particular chemical provides a measure of how both the uncertainty in identifying the correct chemical values and the uncertainty and variability of site-specific information impacts this classification. It should be remembered that the values presented here are estimates made by the CalTOX model based on a pulse dose of 1 mg/kg in soil. Generation of Classification Template. Figure 2 shows the results of the first simulation run in which the 14 chemical properties were randomly sampled from ranges, which include those of all 79 chemical in the CalTOX database. At each node, the log mean exposure (within the oval) and the geometric standard deviation (below the oval) are summary statistics associated with those simulations satisfying the branching rules that lead to the particular node. Three distinct regions were identified. These regions were defined by 4 of the 14 parameters sampled: the half-life in the rootzone soil, τs; the organic carbon partition coefficient, Koc; the octanol-water partition coefficient, Kow; and the half-life in the vadose-zone soil, τv. Therefore, the specific values of the remaining 10 parameters were found not to be important for assessing exposure within their specified a priori range. Within each of these three regions, subregions were assigned as LE or HE depending on whether exposure levels were above or below 10-6 mg kg-1 d-1, corresponding to a log exposure of -6 (again the value of 10-6 was chosen for historical reasons and is only meant to illustrate our classification approach). Each of these subregions has an associated log mean exposure value and standard deviation. As one traverses down the tree, resulting in additional parametric constraints, the geometric standard deviation (GSD) decreases, providing an exposure estimate with less variance. For example, subregion LEIIc has a log exposure level of -11.6 ( 3.5, corresponding to an exposure level of 2.5 × 10-12 mg kg-1 d-1 (GSD ) 30), whereas subregion LEIIe has an log exposure level of -17.3 ( 1.9, corresponding to an exposure level of 5 × 10-18 mg kg-1 d-1 (GSD ) 6.7). Overall, by using the above four parameters, the classification tree decreased the variance from 11.74 to 2.53 (R 2 ) 0.78). The R 2 statistic was calculated from the CART diagram in Figure 2, where the variance prior to classification is the square of the GSD from the first node, and the variance after classification is a weighted sum of variances from the 11 terminal nodes. Figure 3 contains a parametric summary of this classification, showing, for example, that region III, consisting of 64 chemicals, was defined by constraints on both τs and Kow. Overall, Figure 3 illustrates the broad classification based on a single rule constraining τs, along with the three exceptions to this rule. That is, the CART routine identified the half-life in root-zone soil, τs, as the most important classifier of exposure level. The constraint τs < 27 d classified a primarily low exposure region (regions I and II) with two exceptions: subregion HEI defined by τs < 27 d, τv > 50 d, and Koc < 312; and subregion HEIIa defined by τs < 27 d, τv > 38 d, 312 < Koc < 13 000, and τq > 40 d. Analogously, τs > 27 d classified a primarily high exposure region (region III) with one exception: subregion LEIIIa defined by 27 > τs > 61, Kow < 2.0 × 107, and Koc > 913. This first simulation run was used to develop the template from which chemicals were classified and is basically a VOL. 32, NO. 21, 1998 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
3399
FIGURE 2. CART diagram of exposure as a function of chemical properties. Within the oval is the log-exposure, in units of log(mg kg-1 d-1), and below is the standard deviation.
FIGURE 3. Illustration of parametric constraints for each of the three large classification regions (LEI, LEII, and HEIII) along with the subregions that are exceptions to these classifications (HEI, HEIIa, and LEIIIa). The number of CalTOX database chemicals that are represented within these regions and subregions are in parentheses after the classification name. characterization of the across-chemical variability. The structure of this template is independent of the parameter correlations that exist within the set of chemicals being classified, allowing for the classification of additional previously uncharacterized chemicals. That is, if we sample across a specific chemical set, thereby accounting for parametric correlations, the resulting template would be specific to this chemical set and would not provide a means of classifying other chemicals. Our choice of developing a more generic template was due to the existence of huge numbers of chemicals that are continually being characterized. As described in the following section, the correlation structure of a given chemical set is reflected by comparing those regions represented by chemicals with those regions not represented by chemicals. Of course, if the properties of a new chemical 3400
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 32, NO. 21, 1998
reside outside of the originally prescribed ranges, then a new template would be required. Chemical Classification. Using this tree structure based on constraints of five chemical properties, 79 known chemicals were mapped onto the tree and classified (Table 2). As evident in Table 2, not all subregions shown in Figure 2 were represented by one or more of the 79 chemicals. Sixty-four chemicals were classified into region III (log mean exposure ) -4 ( 1.4 mg kg-1 d-1), three of which were classified into a low exposure subregion (log mean exposure ) -7.3 ( 1.3 mg kg-1 d-1). Figure 3 presents these numbers for a few regions. The majority of these region III chemicals were classified into subregions IIIa, IIIb, or IIIc, which were characterized by a relatively large half-life in root-zone soil (τs > 27 d) and an octanol-water partition coefficient (Kow
TABLE 2. Classification of 79 Chemicals Listed in the CalTOX Database exposure class
exposure subclass
Kow
name
Koc
LE I LE I LE I
dimethyl phthalate acetone butanol
47 0.6 7
79
HE I HE I HE I HE I HE I HE I HE I HE I
1,3-dichloropropene methyl chloride methyl bromide toluene diethyl phthalate ethylbenzene xylenes (total) isophorone
65 8 14 482 222 1330 1300 50
26 6 10 139 83 228 271
estd Koca
τs
τv
0.29 3.36
4.00 4.00 4.00
15 15 28
24
7.21 17.5 17.5 28.4 29. 6.50 15.1 17.5
58.8 63 63 109 115 117 187 63
6.15 × 104
4 13.5
91 59.6
1.57 × 109
14 17.5
200 190
25 39.4 47.5 51.8 56.2 74.9
278 109 45.2 278 190 50.6 61 143 250 88.3 83 175 104 61.4 197 74.6 222 45 156 104 104 143 122 104
134 263 63 260 243 336 916 388 798 10.7 n/a 616 393 198 184 368 388 119 1450 181 181 42 1450 336
LE II LE II
LE IId LE IId
butyl benzyl phthalate hexachlorocyclopentadiene
4.29 × 104 1.28 × 105
4.27 × 104
LE II LE II
LE IIc LE IIc
bis(2-ethylhexyl)phthalate di-n-octyl phthalate
1.64 × 105 3.27 × 109
8.74 × 104
HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III
HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa HE IIIa
1,1-dichloroethylene 1,2-dichloroethane methylene chloride vinyl chloride benzene 1,1-dichloroethane chloroform bis(2-chloroethyl) ether 1,1,2-trichloroethane 1,1,2,2-tetrachloroethane atrizine 1,1,1-trichloroethane bromoform nitrobenzene carbon tetrachloride chlorobenzene 1,2-dichlorobenzene (o) styrene cis-1,2-dichloroethylene 2,6-dinitrotoluene 2,4-dinitrotoluene bromodichloromethane trans-1,2-dichloroethylene chlorodibromomethane
135 27.9 18 15.2 151 62 90 21 126 245 513 273 225 69 527 644 2840 892 52 82 99 108 117 156
4.1 18.4 22 29.2 55.1 59 60 76 76 79 100 110 126 156 198 228 384 912
HE III HE III HE III
LE IIIa LE IIIa LE IIIa
R-HCH (R-BHC) fluorene acenaphthene
6310 1.47 × 104 9260
1840 9750 5030
42.7 44.4 57.1
139 152 216
HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III
HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb HE IIIb
β-HCH (β-BHC) 1,2,4-trichlorobenzene methoxychlor benzo[a]pyrene di-n-butyl phthalate hexachlorobenzene γ-HCH (lindane) naphthalene 3,3-dichlorobenzidine endosulfan hexachloroethane hexachloro-1,3-butadiene heptachlor epoxide chrysene
6930 1.00 × 104 4.53 × 104 2.20 × 106 3.64 × 104 3.51 × 105 5290 2390 3800 3830 1.00 × 104 6.25 × 104 1.38 × 105 5.64 × 105
2430 1700 7.89 × 104 2.49 × 106 1580 4.58 × 104 1500 1070
68.9 70 131 229 234 253 332 398 104 228 104 104 293 385
131 388 208 880 48 5150 123 130 388 28.1 388 388 553 2380
HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III HE III
HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc HE IIIc
1,4-dichlorobenzene (p) tetrachloroethylene anthracene carbon disulfide 1,2-dichloropropane pyrene fluoranthene aldrin benz[a]anthracene trichloroethylene PCB-1254?c dibenz[a,h]anthracene dieldrin heptachlor endrin benzo[b]fluoranthene DDD DDE toxaphene chlordane 2,3,7,8-TCDD DDT
3080 382 3.03 × 104 107 123 1.04 × 105 1.27 × 105 7.20 × 106 4.97 × 105 322 2.59 × 106 4.65 × 106 5.13 × 105 1.67 × 105 8.53 × 104 1.91 × 106 1.35 × 106 3.95 × 106 1850 1.21 × 106 4.62 × 106 1.38 × 106
574 197 2.22 × 104
522 594 649 657 728 820 852 867 878 930 943 945 991 1100 1210 1920 3210 3210 4020 5700 6660 9170
522 760 970 48.5 2740 4010 1020 585 1460 757 943 17300 1100 131 3230 1580 5750 5720 20600 1390 6660 5.720
HE III
HE IIId
indeno[1,2,3-c,d]pyrene
1.76 × 107
665
2060
a
1820 1840 4820 3.00 × 104 6.63 × 104 2.71 × 105
47 7.06 × 104 4.94 × 104 4.84 × 104 4.59 × 104 85.6 1.26 × 106 2.03 × 106 1.26 × 104 6810 1.14 × 104 8.64 × 104 5.18 × 104 5.38 × 106 3.09 × 104
When there is no empirical value for Koc, it is estimated from the equation Kow × 0.48. VOL. 32, NO. 21, 1998 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
3401
FIGURE 4. Summary of landscape and exposure sensitivity analysis. For each of the five chemicals listed, an exposure range is shown for the region in which the chemical was originally classified (black bars), for the shifted range due to the sensitivity of the landscape parameters (dark gray bars), and for the shifted range due to the sensitivity of the exposure parameters (light gray bars). < 2 × 107). The three chemicals classified into LEIIIa were characterized by a half-life in root-zone soil that was slightly larger than that in regions I and II (27 d < τs < 61 d), compensated by a large Koc and a Kow < 2 × 107. Of the remaining 15 chemicals, four were classified into region II, a low exposure region, and 11 into region I. Of the region I chemicals, three were classified into LEI (log mean exposure ) -8 ( 2.7 mg kg-1 d-1) and eight into HEI (log mean exposure ) -3.6 ( 1.6 mg kg-1 d-1). Regions I and II are both characterized by a small half-life in root-zone soil (τs < 27 d) and are distinguished by a large Koc associated with region II and a small Koc associated with region I. Therefore, Figure 2 describes the parameter correlation structure pertaining to this case study of 79 chemicals, each released into the soil at 1 ppm. The largest region that is not represented by any chemicals was within region II, suggesting that a combination of τs < 27 d and Koc between 312 and 1.4 × 104 was inconsistent with the case study chemical set. The region best represented, 63 of the 79 chemicals, was within region III, suggesting that most chemicals were characterized by τs > 27 d and Kow < 2.0 × 107. Sensitivity Analysis. On the basis of the above findings, five representative chemicals were selected to explore landscape and exposure parameter sensitivity. Methyl bromide was chosen from region 1 and benzene was chosen from subregion HEIIIa to represent the volatile organics, hexachlorobenzene was chosen from subregion HEIIIb and dioxin (2,3,7,8-tetrachlorodibenzo-p-dioxin) was chosen from subregion HEIIIc to represent the semivolatile persistent organics, and fluorene was chosen from subregion LEIIIa to represent the PAH compounds. Recall that for the first set of simulations the landscape properties were randomly sampled and the exposure properties were held constant. For the second simulation set, the exposure properties were randomly sampled and the landscape properties were held constant. For both these simulation sets, the chemical properties were assigned values associated with one of the five chemicals chosen. Figure 4 shows summary results of the sensitivity analysis. For each of the five chemicals shown in Figure 4 the following applies: The black bar illustrates the mean and range of exposure values for the region in which the chemical was classified. The dark gray bar illustrates the range of values obtained when using the singlevalue (arithmetic mean) chemical properties, the single-value (arithmetic means) exposure properties, and values of the landscape properties sampled from range distributions are used. The light gray bars are analogous to the dark gray bars 3402
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 32, NO. 21, 1998
except that the exposures properties are sampled from range distributions and the landscape and chemical properties are fixed at their respective arithmetic mean values. For benzene, hexachlorobenzene, and fluorene, varying either the landscape or exposure property values had little effect on the original classification of the chemical; i.e., values obtained from the simulations (gray bars) remained within the bounds of the black bars. These results suggest that, in general, the uncertainty and variability of landscape and exposure factors values does not shift the chemicals out of their exposure classes; i.e., these chemical classifications were not sensitive to the 1 order of magnitude uncertainties assigned to the 93 landscape and exposure values. The exceptions to this observation were dioxin and methyl bromide. For dioxin, the range of exposure values from both the landscape and exposure sensitivity studies resulted in higher values; however, since dioxin was already classified as a high exposure chemical, its classification status did not change. For methyl bromide, the range of exposure values from both the landscape and exposure sensitivity studies crossed the high/low exposure classification. When fixed nominal values for the landscape and exposure properties are used with methyl bromide, it was classified as a high exposure chemical. However, Figure 4 shows that this high exposure classification can shift to a low exposure classification when there are variations in the either landscape or exposure parameters. The break between low and high exposure for methyl bromide was determined primarily by two landscape parameter values, βq, the porosity of the aquifer layer, and the rainwater infiltration through the soil, recharge (m/d). If βq < 0.22 and recharge < 0.00015 m/d, methyl bromide was classified as a low exposure chemical; otherwise, it was classified as a high exposure chemical. For the exposure parameters, if the averaging time over which dose calculations (AT) were made, were > 32 000 d (88 y) and the fraction of water needs provided by the groundwater > 0.225, methyl bromide was classified as a low exposure chemical. The room ventilation rate in the bathroom, water use in the shower, and exposure duration were all secondary factors that could lower the exposure levels for methyl bromide. The robustness of the classification depends on both the reliability of the chemical properties data used and the variability of site-specific propertiesssuch as those specified by the 93 landscape and exposure parameters. The classification of landscape and exposure parameters could be affected by uncertainty in chemical property values. For
some chemicals, properties such as Koc, H, and particularly environmental half-lives (τs and τv) can have large uncertainties due to both measurement error and poorly understood dependence on variable landscape factors such as soil conditions (23). Explicit representation of the uncertainty in the properties of a specific chemical could provide an additional measure of the robustness or sensitivity of any given classification. Although not shown, this type of uncertainty analysis is analogous to the sensitivity studies described in this section; i.e., values would be varied based on an assessment of the uncertainties pertaining to a given chemical’s properties and simulations run to compare the new classification status with the previous ones.
Discussion In developing a system for exposure and/or risk estimation, there is an inherent tension between ensuring that a model is comprehensive and striving for simplicity. How comprehensive a model needs to be depends on the questions of interest and the resolution required. For example, in our case study using CalTOX, in the context of exposures due to soil contaminants, we found that only 4 of the 14 chemical properties and to a lesser extent 4 of the 93 landscape and exposure properties came into play when we classified chemicals into exposure level categories. This result illustrates a common behavior of large complex models; i.e., only a small subset of the model parameters actually control the output of a given application. The CART algorithm not only identified these important factors but also provided explicit categorical rules for classification. In addition, CART provided information on both why a given chemical was classified in a particular category and the sensitivity of this classification to uncertainties. For example, in our case study, benzene was classified in a high exposure region (HEIIIa). To shift this classification to a low exposure region would require either the estimation of τs to decrease from 190 to < 27 d and τv from 260 to < 50 d or the estimation of Koc to increase from 55 to > 913 and τs to decrease from 190 to < 61 d. Here we see that a relatively large shift of parameter values is needed to shift the classification of benzene. In contrast, for many chemicals, to shift between the LEI and HEI categories requires a much smaller shift of parameter values. For example, butanol has an estimated vadose half-life of 28 d. If this estimate was found to be greater than 50 d, it would shift from LEI to HEI. Likewise chemicals such as methyl bromide and methyl chloride would shift from HEI to LEI if estimates of τv decreased from 60 to 50 d. Thus, the CART provides not only a scheme for exposure classification but also a process for assessing the robustness of the classification relative to uncertainties in the parameters that are critical to that classification. This sensitivity information is useful in the context of parametric uncertainty, providing insight as to which uncertainties are significant with respect to exposure classification and which are not. This can in turn help establish priorities for research that would be most useful in improving the accuracy of the classifications. For example, with respect to the broad classification of chemicals in the three primary regions labeled in Figure 2, improved estimates of the rootzone soil half-life would have the single largest impact for increasing the accuracy of exposure estimates and thereby decreasing the coefficient of variation of the classification tree. However, as noted in the previous paragraph, with respect to region I chemicals, improved estimates of the vadose-zone half-life is as important as estimates of the rootzone soil half-life. The handling of data gaps can proceed in much the same manner. That is, if there was no data on the vadose-zone half-life for a given chemical, the classification tree of Figure 1 suggests that information on τv is required
to distinguish between a low and high exposure classification only if τs < 27 d and Koc < 312. And the break point between the classification shift of high and low exposure levels is on the order of τv ) 50 d. Our simulation case study suggested that chemical properties define the broad structure of the exposure categories. The uncertainty of the landscape and exposure properties in most cases did not alter these original classifications, though for some chemicals that were near transitions the classification could shift depending on what specific values were chosen for the landscape and exposure properties. It is of interest to note that two of the important chemical properties were associated with the half-life of a chemical within different soil regions. This finding is in agreement with the study by Labieniec et al. (23) that concludes that compounds which are persistent in the subsurface exhibit a higher risk variability than those compounds that are highly degradable. Degradation properties are difficult to quantify, and where data is available, they tend to have large variations from location to location and large uncertainties due to measurement error and bias. In many cases, regulatory agencies deal with these uncertainties by simply assuming no degradation, that is, an infinite half-life. Our analysis reveals the importance of restricting the range of half-life values to some plausible but still highly uncertain range for regulatory analysis. In addition, we see the value of including ranges for defining the need and value of additional and improved measurements. Classifying exposure is only one component of the risk assessment equation. High relative exposure to some chemicals may be less of a public health risk than lower exposures of other chemicals. Toxicity information is also required to provide a comprehensive risk assessment. One way to incorporate toxicity into this analysis is to consider it as a second dimension, which may be correlated to some degree with exposure, if both exposure and toxicity depend on common or related chemical structures or properties. The joint uncertainty of the exposure and toxicity measures could then be assessed to determine the categorical probabilities for a two by two classification matrix, high/low exposure, and high/low toxicity. With respect to the regulatory process, if data are unknown but theoretical bounds can be put on risk factors, simulations can provide sensitivity information that can be used in decision making. Therefore, rather than requiring a classification system to use a “best guess” or a “worst case” value, the full range of values that are thought to be plausible can be used, and the importance of this lack of precision can be explicitly assessed. In this manner, the uncertainty of the correct chemical property values can be accounted for when it affects the sensitivity of the classification. If none of the parameter values within this range result in a significant classification shift, then a better understanding of these values will not add value to the regulatory decision. In contrast, if there is a classification shift, the parameter value in which this shift occurs will inform the regulatory process as to the best approach toward making a decision. This decision may be that the shift occurs in a region that is unlikely to be valid and therefore not important, or it may be that this is a critical value to better understand, prioritizing additional research. In either case, this analytical approach provides a tool that can be used by regulators to help make decisions in the presence of uncertain and/or missing data.
Acknowledgments We acknowledge the helpful critiques and suggestion provided by the three anonymous reviewers and the editor Mitch Small. This work was supported in part with funding provided by the State of California through the Department of Toxic Substances Control (DTSC) by Contract Agreement 95-T1050. VOL. 32, NO. 21, 1998 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
3403
Supporting Information Available Two tables listing the landscape and exposure properties and range of values used (2 pages). Ordering information is given on any current masthead page.
Literature Cited (1) Davis, G.; Swanson, M.; Jones, S. Comparative evaluation of chemical ranking and scoring methodologies; University of Tennessee Center for Clean Products and Clean Technologies: Knoxville, TN, 1994. (2) U.S. EPA. Proposed Rule: Hazardous Waste Identification Rules Waste (HWIR), pre-publication version, Identification Number EPA Docket F-95-WHWP-FFFFF; U.S. Environmental Protection Agency: Washington, DC, 1995. (3) U.S. EPA Review of a methodology for establishing human health and ecologically based exit criteria for the Hazardous Waste Identification Rule (HWIR); U.S. EPA Science Advisory Board: Washington, DC, 1996. (4) Breiman, L.; Freidman, J.; Olshen; Stone, C. Classification and Regression Trees; Wadsworth, Inc.: Monterey, CA, 1984. (5) Jury, W.; Russo, D.; Streile, G.; El Abd, H. Water Resour. Res. 1990, 26, 13-20. (6) McKone, T. E. CalTOX, A Multimedia Total-Exposure Model for Hazardous-Wastes Sites Part I: Executive Summary: Prepared for the State of California, Department Toxic Substances Control, Lawrence Livermore National Laboratory, 1993. (7) McKone, T. E. CalTOX, A Multimedia Total-Exposure Model for Hazardous-Wastes Sites Part II: The Dynamic Multimedia Transport and Transformation Model; Prepared for the State of California, Department Toxic Substances Control, Lawrence Livermore National Laboratory, 1993. (8) McKone, T. E. CalTOX, A Multimedia Total-Exposure Model for Hazardous-Wastes Sites Part III: The Multiple-Pathway Exposure Model; Prepared for the State of California, Department Toxic Substances Control, Lawrence Livermore National Laboratory, 1993. (9) Maddalena, R. L.; McKone, T. E.; Layton, D. W.; Hsieh, D. P. H. Chemosphere 1995, 30, 869-899. (10) Mackay, D. Multimedia Environmental Models, the Fugacity Approach; Lewis Publishers: Chelsea, MI, 1991.
3404
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 32, NO. 21, 1998
(11) McKone, T. E.; Daniels, J. I. Regul. Toxicol. Pharmacol. 1991, 13, 36-61. (12) U.S. EPA. Chemical Properties for Soil Screening Levels; Prepared by Research Triangle Institute, Research Triangle Park, NC, 1994. (13) Mackay, D.; Shiu, W. Y.; Ma, C. Illustrated Handbook of PhysicalChemical Properties and Environmental Fate for Organic Chemicals; Lewis Publishers: Boca Raton, FL, 1992-1995; Vols. 1-4. (14) Verschueren, K. Handbook of Environmental Data on Organic Chemicals, 3rd ed.; Van Nostrand Reinhold: New York, 1996. (15) Schwalen, E. T.; Kiefer, K. L.; Geng, S.; McKone, T. E.; Hsieh, D. P. H. The Distribution of California Landscape Variables for CalTOX; Prepared by the Risk Science Program, University of California, Davis, for the State of California Department of Toxic Substances Control, 1995. (16) U.S. DA. STATSGO Database; U.S. Department of Agriculture, National Resource Conservation Service, 1994. (17) NOAA. Climates of the States, Volume II Western States; U.S. National Oceanic and Atmospheric Administration, Water Information Center, Inc.: 1974. (18) van der Leeden, F.; Troise, F. L.; Todd, D. K. The Water Encyclopedia, 2nd ed.; Lewis Publishers: Chelsea, MI, 1991. (19) U.S. EPA. Exposure Factors Handbook; U.S. Environmental Protection Agency, Office of Health and Environmental Assessment: Washington, DC, 1989. (20) OEHHA. Air Toxics Hot Spots Program Risk Assessment Guidelines Part IV, Technical Support Document, Exposure Assessment and Stochastic Analysis; Office of Environmental Health Hazard Assessment, California Environmental Protection Agency: 1996. (21) Decisioneering Crystal Ball, 4.0 ed.; Decisioneering: Boulder, CO, 1996. (22) Taylor, A. C.; Evans, J. S.; McKone, T. E. Risk Anal. 1993, 13, 403-411. (23) McKone, T. E. SAR QSAR Environ. Res. 1993, 1, 41-51. (24) Labieniec, P. A.; Dzombak, D. A.; Siegrist, R. L. J. Environ. Eng. 1996, 122, 612-621.
Received for review November 6, 1997. Revised manuscript received July 22, 1998. Accepted July 28, 1998. ES970975S