Environ. Sci. Technol. 1999, 33, 3332-3340
Exploring Factors Controlling the Variability of Pesticide Concentrations in the Willamette River Basin Using Tree-Based Models S O N G S . Q I A N * ,† A N D CHAUNCEY W. ANDERSON‡ Environmental Sciences and Resources Program, Portland State University, Portland, Oregon 97207, U.S. Geological Survey, Water Resources Division, Portland, Oregon 97216
We analyzed available concentration data of five commonly used herbicides and three pesticides collected from small streams in the Willamette River Basin in Oregon to identify factors that affect the variation of their concentrations in the area. The emphasis of this paper is the innovative use of classification and regression tree models for exploratory data analysis as well as analyzing data with a substantial amount of left-censored values. Among variables included in this analysis, land-use pattern in the watershed is the most important for all but one (simazine) of the eight pesticides studied, followed by geographic location, intensity of agriculture activities in the watershed (represented by nutrient concentrations in the stream), and the size of the watershed. The significant difference between urban sites and agriculture sites is the variability of stream concentrations. While all 16 nonurban watersheds have significantly higher variation than urban sites, the same is not necessarily true for the mean concentrations. Seasonal variation accounts for only a small fraction of the total variance in all eight pesticides.
Introduction In 1990, the Oregon Department of Environmental Quality (ODEQ) identified the protection and enhancement of water quality in the Willamette River Basin as one of its critical long-term resource-management goals (1). In response, the Oregon Legislature established the Willamette River Basin Technical Advisory Steering Committee to work with ODEQ to oversee and establish priorities for the completion of a multiphase Willamette River Basin Water Quality Study (hereafter referred to as the Willamette study). The overall objective of the Willamette study is to develop a complete database for the river basin, coupled with operative water quality models, to enable Federal, state, and local agencies to cooperatively ensure the preservation and beneficial use of the Willamette River Basin and its associated biota (2). In the first two phases of the study, the advisory committee focused on issues including assessment of habitat, biological communities, point- and nonpoint-source pollution, and modeling of flow and water quality. These phases provided * Corresponding author phone: (503)725-8190; fax: (503)725-3888; e-mail:
[email protected]. † Portland State University. ‡ U.S. Geological Survey. 3332
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 33, NO. 19, 1999
an indication of concentrations and relative detection frequency of pesticides in the Willamette River Basin. However, it remained unclear to what extent the results held true for small streams in agricultural and urban areas throughout the basin. On the basis of these results, the U.S. Geological Survey (USGS) undertook a study of water quality in small agricultural and urban streams in the Willamette River Basin as part of phase three of the Willamette study. The USGS study was to (i) describe the distribution of dissolved pesticide concentrations in small streams with intensive urban and agricultural upstream land uses throughout the basin, (ii) document exceedances of water quality guidelines for the targeted pesticides, and (iii) identify the relative importance of broad measures of land use and seasonality in determining those concentrations. The sampling sites were on streams draining 16 small, randomly selected agricultural watersheds and on four urban watersheds (3). In addition to 86 pesticides, the USGS study measured suspended sediment (SS), nutrient concentrations, 5-day biochemical oxygen demand (BOD5), stream temperature, dissolved oxygen (DO), and pH. Information on watershed land use were collected, and pesticides applications were estimated. Figure 1 presents the data used in this paper (we grouped data according to urban and agricultural sites for obvious reasons). Details about data collection and analysis can be found in ref 3. This paper uses data from the USGS study to demonstrate the use of classification and regression trees (CART) for identifying factors associated with the variability of pesticide concentrations in the water environment. The USGS data were collected in 95 samples from 20 randomly selected urban and agricultural subbasins (3). Among a total of 36 pesticides detected basinwide, we select eight compounds as subjects of this paper, including the four most frequently detected (>80% of all samples) herbicides (atrazine, desethylatrazine, simazine, and metolachlor), herbicide diuron (detected in 73% of samples), and three less frequently detected (less than 40% of samples) chemicals (diazinon, pronamide, and tebuthiuron). Information on the distribution of pesticide concentrations is important in assessing the impact on biota or risk of human exposure. When factors associated with pesticide concentrations are known, mechanisms of pesticide release to the environment can be more adequately assessed, and the public can be advised accurately on the potential risks associated with pesticide use. CART is a binary recursive partitioning method that yields a class of models called tree-based models. It is an alternative to linear and additive models for regression and classification problems. The method is attractive to many exploratory environmental studies due to its capability of handling both continuous and discrete variables, its inherent ability to model interactions among predictors, and its hierarchical structure. Instead of using CART for its conventional functions (i.e., prediction and classification), we use CART to identify variables that contribute significantly to the variability of pesticide concentrations. This function is achieved due to the hierarchical structure of a CART model, which reveals the relative importance of each selected variable in the fitted tree structure. Using a tree-based model as an exploratory data analysis tool, we are able to explore the structure of data while imposing few prior assumptions.
Methods Tree-based modeling (4) is an exploratory technique for uncovering structure in data, increasingly used for (i) devising prediction rules that can be rapidly and repeatedly evaluated, 10.1021/es9812148 CCC: $18.00
1999 American Chemical Society Published on Web 08/27/1999
FIGURE 1. Boxplots of log-transformed pesticides concentrations illustrate the difference between urban and agricultural sites. (ii) screening variables, (iii) assessing the adequacy of linear models, and (iv) summarizing large multivariate data sets (5). The models are fitted by binary recursive partitioning, whereby a data set is successively split into increasingly homogeneous subsets until it is infeasible to continue (6). Tree-based models are so-called because the primary method of displaying the fit is the form of a binary tree. Many advantages of using the tree-based modeling approach can be identified. For example, tree-based models are much easier to interpret and discuss than linear models when the set of predictors contains a mix of numeric variables and factors; tree-based models are invariant to monotone transformation of predictors so that the precise form in which these appear in a model is irrelevant; tree-based models are more adept at capturing nonadditive behavior [the standard linear model does not allow interactions between variables unless they are prespecified and of a particular multiplicative form (6)]. The recursive partitioning method can be described on a conceptual level as a process of reducing the measure of “impurity” (4). For a regression problem, this impurity is
measured by the deviance (or the sum of squares) of a particular node i: mi
Di )
∑ (y
k
- µi)2
(1)
k)1
where Di is the deviance for the ith node which has mi observations (indexed by k), yk is the kth observation in the node, and µi is the predicted mean for node i. For a classification problem, the deviance is defined as gi
∑p
Di ) -
k
log (pk)
(2)
k)1
known as the information index, or gi
Di )
∑ p (1 - p ) k
(3)
k
k)1
VOL. 33, NO. 19, 1999 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
3333
known as the Gini index, where gi is the number of classes in the node, pk is the proportion of observations that are in class k. The deviance is zero for a pure node, in which all the yk values are the same (for a regression problem) or all the observations belong to one class (for a classification problem). At the beginning, all observations are assigned to the same node. Each split puts observations into two child nodes (left and right), and the deviation after split is
Di child ) Di,L + Di,R
(4)
The split that maximizes the reduction in deviance ∆D ) Di - Di child is the split chosen at a given node. For a regression problem, this process is equivalent to choosing the split to maximize the between-groups sum of squares in a simple analysis of variance (ANOVA). Without stopping rules, the binary splitting can go on until each final node has only one data point. This kind of overfitting is certainly undesirable. A commonly used procedure for selecting the “right size” of a tree is crossvalidation (4, 6). The original data are divided into mutually exclusive sets, and for each set a sequence of trees with different size are grown to the remaining sets. The set “held out” is then used to evaluate the predictive deviances for the fitted sequence of trees. A plot of the cross-validated deviance or other summary statistics (e.g., the predictive “R 2” or the proportion of variance explained using the held-out set) versus tree size will help determine the right size of the model. In the final (cross-validated) tree, not all candidate predictor variables will be included. This particular feature may be undesirable for some problems. However, in exploratory studies, we view this feature as desirable since treebased models can serve as a means of variable selection. In other words, not all candidate variables are important in explaining the variation of the response variable. Using treebased models, we will be able to identify important variables in predicting the response variable (the top few variables in the tree). We note that the binary recursive partition process works one variable at a time; as a result, the number of candidate predictor variables will not pose the problem of overfitting with too many variables as in a linear regression problem. Because a final tree model divides the predictor variable space into subregions, and within each subregion the response variable variance is relatively small, the recursive partitioning process can be viewed as the opposite of that of ANOVA. In other words, ANOVA tests whether the considered factors contribute to the response variance significantly, and the tree-based model identifies those factors that contribute significantly to the response variable variation. Pesticide concentrations are measured as continuous variables, so regression tree modeling can be applied directly. However, a common problem of many environmental studies is that a significant number of observations may be at values below the method detection limit (MDL) or left-censored. There are many possible ways of dealing with left-censored data (7); for example, replacing those censored data with a fixed number is a common practice. However, when the portion of censored data is overwhelming (say over 20%), replacing them with a constant may bias the resulting analysis. Instead, we may treat the concentration data as categorical, i.e., dividing the data into categories such as below MDL, low, mid-range, and upper-range, and a classification tree model can be used. In this study, we treat the four most frequently detected herbicide concentrations as continuous variables, since less than 10% of all data were under MDL; these concentrations were treated as being equal to the values of detection limits. The next four pesticides have at least 27% undetected concentration values and are therefore treated as categorical. We use both the classification 3334
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 33, NO. 19, 1999
tree and regression tree to model the herbicide diuron (with 27% measurements under MDL) to compare the two ways of dealing with the problem of nondetection.
Results Concentrations of each pesticide are used as the response variable for the respective model. For the regression models, concentration values were log-transformed so that their distributions are more likely to be symmetric. Potential predictor variables included in each model are the percentage of subbasin area covered by agricultural (Ag), residential or urban (Resid), and forested (Forest) land uses, watershed size (Size), number of crops in the watershed (NoCrop), sampling site location (Latitude and Longitude), river water chemistry measurements (NH4+ or NH4; NO2- + NO3- or NOx, TKN, 5-day biochemical oxygen demand or BOD; total phosphorus or TP; soluble reactive phosphorus or SRP; and fecal bacteria), and month of the year (Month) when samples were taken (April, May, July, October, and November) to represent the seasonal effect. We use water chemistry measurements to represent the degree of human impact of the watershed. For instance, agricultural intensity may be partly represented by the concentration of nutrients, and BOD and fecal coliform bacteria are indicative of pollution from human and animal waste. The final models are presented graphically. These models are selected based on their predictability simulated by using cross-validation. Since the final model may be a result of purely random selection of the first split among several competing splits (4), it is important to examine these competing splits in order to draw appropriate conclusions. As an example, we present competing splits for the diuron models. We first compare the two models developed for diuron: one is a regression model and the other is a classification model. Regression versus Classification. The comparison of the results from using a regression tree and a classification tree on the diuron data provides insight into whether the conversion of pesticide concentration from a continuous variable to a categorical one is an effective way for handling left-censored data. In this section, we also present the crossvalidation results used for selecting the final model. The upper panel of Figure 2 presents the final regression tree model for diuron with four splits. The first split is on Ag. The second split is on NOx for Ag < 74.5%. The third split is on NH4 (the first three splits represent the scale and intensity of agriculture activities in the watershed), and the last split is on Forest. The middle panel shows boxplots of diuron concentrations (log-transformed) within each of the final nodes. The basis of including only the first four splits in the final model is presented in lower panel, which shows that further splits would not reduce model’s relative predictive error (the lower right panel) or increase predictive R 2 (the lower left panel). The classification model for diuron is presented in Figure 3 (top panel). The final model has five splits. The first four splits are the same as in the regression model. The original concentration data were divided into four categories: below MDL, low, medium, and high. The below MDL class is obvious. The other three classes are decided on the basis of a cumulative probability plot shown in the lower panel, where two natural breaks are apparent. We note that the number of data points in each class is not the same. The two models (Figures 2 and 3) are similar to each other. The first four splits of both models are on the same variables, and they were split at the same or very close values. This result indicates that the data structure was not altered significantly after the transformation of the diuron concentration from a continuous to a categorical variable, supporting
FIGURE 2. Top panel shows the final regression (tree) model for diuron. The text above each split shows the variable that is split and the condition for the left branch. The number of observations in each node and the mean value of the response variable are printed below each split. Here the mean value is the average of the log-transformed diuron concentrations. When used in prediction, we determine first whether Ag < 74.5. If the answer is no, we check the variable NH4 and predict diuron concentration to be -0.3789 if NH4 < 0.0835 and predict 0.5659 otherwise. The relative height of each stem represents the amount of variation explained by the respective split. The middle panel shows boxplots of log diuron concentrations in each of the four final nodes. Cross-validation results for the diuron regression model are shown in the two bottom panels. The bottom left panel shows the model R 2 values at each split (the solid line labeled as “Apparent”) and the relative prediction R 2 calculated from a cross-validation simulation (the dashed line labeled as the “X Relative”). The bottom right panel presents the relative prediction error based on the same cross-validation simulation (or X relative error) at each split ((1 standard error). the idea that tree-based models can be useful for environmental data analysis where left-censoring of data is often a problem. The final models shown in Figures 2 and 3 are not to be regarded as the only feasible models, as discussed in ref 4; competing splits might be excluded purely by chance (e.g., due to measurement or sampling error). Therefore, listing some of the competing splits (Table 1) may help us identify important factors. In Table 1, the variable column indicates the variable to be split, the split at column shows the value at which the respective variable is split, and the improve column presents the proportion of variance reduction (regression) or the percent of misclassification reduced (classification) after the respective split. Node number in Table 1 represents the order of splits. Unfortunately, the current setup of CART does not allow us to fix the first split
at a given variable and investigate possible alternatives. We note that the first three competing splits in Table 1 happen to be the first three variables selected in the final model. As a result, we do not anticipate the overall conclusions would be changed should an alternative model be selected. Regression Models. Regression models were fitted to the log-transformed pesticide concentrations. The pesticide concentrations were log-transformed to stabilize variance and to make the distribution of the resulting variable roughly symmetric. As a consequence of this transformation, model predictions (the mean) represent the median in the original metric. We use the sample variance of the log-transformed concentration data as a measure of spread. The variability of atrazine (with a variance of logarithm concentration or log variance being 0.630) in streams was largely determined by land use (Ag) (Figure 4a). The median VOL. 33, NO. 19, 1999 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
3335
FIGURE 3. Final classification (tree) model for diuron (top panel) revels a similar structure as in Figure 2. The text above each node is variable that is split and the condition for the left branch. The classification of the corresponding node is listed under the respective split. The numbers (in parentheses) indicate the number of data points in each class (below MDL/low/medium/high). See Figure 2 for other explanations. The lower panel is the quantile plot of log-transformed diuron concentrations (the cumulative distribution of the data). The data are divided into four groups based on the shape of the plot. Observations with values below MDL are set to be equal to MDL (-1.70 or 0.02 µg/L). The two horizontal lines separate low, medium, and high at values shown (original scale, µg/L, in parentheses). Note that there may be other methods for dividing the data into groups that are physically meaningful or meet certain management requirement. atrazine concentration for those watersheds with 85.5% or more agriculture use is 3.388 µg/L (or 0.5299 in logarithm, with a log variance of 0.422), and the median for the other watersheds is 0.073 µg/L (-1.138, 0.415). Among watersheds with Ag between 42.5% and 85.5%, the number of crops found in the watershed (NoCrop) is the next important variable. For watersheds with NoCrop < 7 (usually dominated by grass seed crops), the median atrazine concentration is 1.11 µg/L (0.04666, 0.36), and the median concentration is about 0.093 µg/L (-1.029, 0.204) for those watersheds with more than 3336
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 33, NO. 19, 1999
seven different crops. When Ag is less than 42.5% of the total watershed (-1.5760, 0.122), changes in atrazine concentration are largely seasonal (not shown), although the seasonal difference is relatively small. Based on the cross-validation R 2 and relative prediction error, a model with two splits (or three terminal nodes) is as “good” as models with three or four splits. In other words, the percent agricultural coverage in the watershed alone can be used as the predictor of atrazine concentration in the stream. We presented the model with three splits in Figure
TABLE 1. Competing Splits for the Diuron Model regression model variable Ag NOx NH4 TP NOx TKN Size NH4 BOD Month Forest Size SRP
split at node no. 1: 94 observations 74.5 1.69 0.0835 0.174 node no. 2: 63 observations 1.735 0.5825 3430 node no. 3: 31 observations 0.0835 2.55 category node no. 4: 49 observations 5.5 3970 0.051
classification model improve
variable
0.3836 0.3808 0.3651 0.3333
Ag NH4 TP NOx
0.3481 0.2562 0.2317
NOx TKN Size
0.4107 0.3301 0.2967
NH4 Month Latitude
0.4367 0.3262 0.2285
Forest Longitude Size Latitude NoCrop Longitude
4a to show one additional factor (NoCrop) that may have influence on atrazine concentrations. A close competitor to NoCrop is the geographic location. The model shows that the basin can be split into two regions at latitude about 44°45′′ (a few kilometers north of Albany, OR). Atrazine concentration is lower in the north and higher in the south. It is noted that agricultural areas south of Albany are dominated by grass seed crops. Changes of the concentrations of desethylatrazine (with a log variance of 0.292) are also dominated by the changes in the agricultural coverage of the watershed (Figure 4b). For desethylatrazine, the cutoff point for percent agricultural coverage is 50%. The second factor affecting desethylatrazine concentrations is geographic location for those sites with agriculture coverage more than 50%, watersheds north of latitude ∼44°45′′ have lower desethylatrazine concentrations. Since desethylatrazine is a decomposition byproduct of atrazine, it is understandable that similar structure is seen in the two models (Figure 4a,b). Variances for the three terminal nodes are similar (∼0.10). NOx, SRP, and Latitude of the watershed are the three factors determining concentrations of metolachlor (log variance 0.933) (Figure 4c). The median concentration varies from 0.006 µg/L (-2.199, 0.277) for low nutrient sites (NOx < 1.855 and SRP < 0.152) to 0.483 µg/L (-0.3162, 0.255) for those southern sites (south of ∼45°5′′) with NOx concentration larger than 1.855. Variances for the other two terminal nodes are close to 0.470. The concentrations of simazine (log variance 0.443) are largely explained by geographic location (Figure 4d). Sites west of the Willamette River (west of longitude ∼123°9′′) have a median simazine concentration of 0.007 µg/L (-2.14, 0.047), and sites to the far east of the valley (east of longitude ∼122°50′′) have a median of 0.290 µg/L (-0.5379, 0.093). In between, the median simazine concentration is 0.040 µg/L (-1.401) (with a high variability, ranging from 0.015 µg/L (-1.82, 0.062) for small watersheds with low TKN concentration to 0.046 µg/L (-1.338, 0.273) for large watersheds and to 0.08 µg/L (-1.096, 0.386) for sites with high TKN). We note here that the influence of watershed size and TKN concentration may not be statistically significant, since the predictive R 2 for the third and fourth splits (on TKN and watershed size) were actually reduced (i.e., the increase in apparent R 2 can be explained as fitting noise). The full model has further
split at node no. 1: 94 observations 74.5 0.0835 0.2015 1.69 node no. 2: 63 observations 1.5 0.5825 3430 node no. 3: 31 observations 0.0835 categorical 44.91 node no. 4: 47 observations 5.5 -122.8 3970 node no. 5: 21 observations 44.91 12 -123.1
improve 31.31 31.26 29.06 24.23 11.130 10.840 5.900 9.662 8.904 8.691 9.089 6.440 5.839 8.352 8.352 6.345
splits on Month (higher concentrations in spring). We considered removing TKN as a predictor variable to determine if seasonal variation is significant. The cross-validation results indicate that longitude of the sampling site is the principal significant factor affecting variation of simazine concentrations. Classification Models. Classification modeling is applied to diazinon, diuron, pronamide, and tebuthiuron. The model for diuron is presented in the Regression versus Classification section. A classification model predicts class association. In our situation, there are four classes of contaminant concentrations (below MDL, low, medium, and high). The prior probability of each individual site belonging to a particular class is taken to be 1/4. The prediction of class association for each node is the class with the highest posterior probability. For diazinon data, 70 of the 95 samples are below MDL, and 50 of the 58 samples with forest coverage larger than 3.5% have diazinon concentrations below MDL. When forest coverage is less than 3.5%, seasonal variation dominates (high in spring and low in summer and fall) (Figure 5a). All samples collected from watersheds with agricultural land use less than 42.5% have pronamide concentrations below MDL (Figure 5b). Pronamide concentrations are highly variable in watersheds with high agriculture coverage. According to the model, this high variability is mainly seasonal (highly variable in spring and early fall, high in late fall/early winter). Low tebuthiuron concentrations are likely in watersheds with small residential area coverage (