Monitoring Network Design for Phytoremediation Systems Using

May 10, 2011 - The underlying assumption in the above monitoring network design approaches is that sufficient information is available through direct ...
0 downloads 0 Views 2MB Size
ARTICLE pubs.acs.org/est

Monitoring Network Design for Phytoremediation Systems Using Primary and Secondary Data Sources Gayathri Gopalakrishnan,*,† Barbara S. Minsker, and Albert J. Valocchi Department of Civil and Environmental Engineering, University of Illinois at UrbanaChampaign, 205 N Mathews Avenue, Urbana, Illinois 61801, United States

bS Supporting Information ABSTRACT: Phytoremediation, or contaminant removal using plants, has been deployed at many sites to remediate contaminated soil and groundwater. Research has shown that trees are low-cost, rapid, and relatively simple-to-use monitoring systems as well as inexpensive alternatives to traditional pump-and-treat systems. However, tree monitoring is also an indirect measure of subsurface contamination and inherently more uncertain than conventional techniques such as wells or soil borings that measure contaminant concentrations directly. This study explores the implications for monitoring network design at real-world sites where scarce primary data such as monitoring wells or soil borings are supplemented by extensive secondary data such as trees. In this study, we combined secondary and primary data into a composite data set using models to transform secondary data to primary, as primary data were too sparse to attempt cokriging. Optimal monitoring networks using both trees and conventional techniques were determined using genetic algorithms, and trade-off curves between cost and uncertainty are presented for a phytoremediation system at Argonne National Laboratory. Optimal solutions found at this site indicate that increasing the number of secondary data sampled resulted in a significant decrease in global uncertainty with a minimal increase in cost. The choice of the data transformation model had an impact on the optimal designs and uncertainty estimated at the site. Using a data transformation model with a higher error resulted in monitoring network designs where primary data were favored over colocated secondary data. The spatial configuration of the monitoring network design was similar with regard to the areas sampled, irrespective of the data transformation model used. Overall, this study shows that using a composite data set, with primary and secondary data, results in effective monitoring designs, even at sites where the only data transformation model available is one with significant error.

’ INTRODUCTION The goal of spatial soil and groundwater monitoring is often to provide the best possible map of contaminant concentrations at a particular instant in time given data limitations such as sparse monitoring networks and irregular geometry.14 Methods commonly used in generating these maps are hydrogeological and geostatistical approaches.1 The hydrogeological approach uses fate and transport models to estimate contaminant concentrations5 whereas the geostatistical approach utilizes geostatistical estimation procedures such as kriging to provide minimum error estimates of contaminant concentrations at unsampled locations using linear combinations of sample values.68 While the geostatistical approach is computationally less intensive than the hydrogeological method for a stationary contaminant map, sufficient qualitative and quantitative information is still required to develop accurate contaminant maps at real-world sites. This information is typically obtained through samples collected from soil borings or groundwater monitoring wells. The number and location of the wells and borings, sampling frequency and type of samples collected form the basis of monitoring network design.9 Monitoring network design has focused primarily on three problems to date: (a) how to design or augment monitoring networks for site characterization using geostatistics10 (b) where r 2011 American Chemical Society

to locate new monitoring points for contaminant plume detection through a combination of optimization and numerical simulation1 and (c) how to reduce spatial and temporal redundancies in existing monitoring networks at sites undergoing remediation through redundancy approaches, often using geostatistics and optimization.2,3,1113 The objectives used in designing networks are typically site-specific and should be chosen to reflect stakeholder preferences.1 Commonly used objectives include minimizing the estimation uncertainty and minimizing the cost of monitoring.2,1114 The underlying assumption in the above monitoring network design approaches is that sufficient information is available through direct measurements of the attribute of interest, also known as primary data. This is not usually true at most real-world sites, where the cost associated with direct measurements of interest (for example, groundwater concentrations of a particular chemical sampled by monitoring wells) is significant, resulting in sparse data. At many sites, scarce primary data is often supplemented with cheaper and more extensive secondary data.68,15 Received: April 29, 2010 Accepted: May 2, 2011 Revised: March 28, 2011 Published: May 10, 2011 4846

dx.doi.org/10.1021/es1042657 | Environ. Sci. Technol. 2011, 45, 4846–4853

Environmental Science & Technology Examples of secondary data include water level data at monitoring wells,16 electric resistivity and electromagnetic induction data obtained from geophysical surveys,17 and concentrations of cocontaminants found in the soil or groundwater that are sampled more extensively than the contaminant of interest.6,18 In recent years, researchers have found that contaminant concentrations found in plant tissue samples collected at phytoremediation sites are correlated with soil and groundwater concentrations and could potentially be used as secondary data.1921 Specifically, Vroblesky et al.19 found that chemicals, primarily trichloroethene (TCE) and cis-1,2, dichloroethene (DCE), are present in the trunks of trees growing above shallow contaminated groundwater. It was hypothesized that direct uptake of contaminated groundwater by plant roots was reflected in tree core concentrations and hence analysis of tree cores could be used to qualitatively delineate a groundwater plume. Struckhoff et al.20 showed that there was a significant correlation between soil concentrations of tetrachloroethene (PCE) and tree cores collected from the trunks of the trees. While tree core samples could potentially be used as secondary data sources, the sampling method has disadvantages when used in long-term monitoring. Collecting multiple core samples from the same tree would be necessary for monitoring purposes and is likely to cause damage to the tree, thus rendering the method infeasible. In order to overcome this disadvantage, Gopalakrishnan et al.21 collected branch samples from trees growing at a phytoremediation site with VOC-contaminated soil and groundwater and developed significant correlations (>0.8) between contaminant concentrations in the tree branch samples and VOC concentrations found in the soil and groundwater, respectively. Branch samples are faster and less expensive to collect than tree core samples, can be collected multiple times from the same tree and could potentially be used as secondary data in monitoring network design. The presence of secondary data sources raises an interesting question for practitioners in the field: how can primary and secondary data be combined to obtain the best map of the contaminants? Co-kriging, a geostatistical interpolation technique that estimates the primary attribute at an unsampled location as a function of the primary attribute at sampled locations and the secondary attribute at all sampled locations, has been the preferred technique to date.6,22 Estimation errors using cokriging are lower due to the inclusion of additional secondary data; however the technique is extremely demanding in terms of the data and modeling effort required.6,23,24 Previous studies have focused on minimizing secondary data requirements through techniques such as branch and bound algorithms,25 fuzzy set theory16 and sequential search algorithms.26 A common drawback to these methods is the focus on secondary data while assuming sufficient primary data are available at the site to perform the analysis. At many real-world sites, primary data are too few to perform cokriging accurately, even with an abundance of secondary data.27 The practitioner then has to make a difficult decision: should primary data be increased by drilling additional soil borings or groundwater monitoring wells? This decision would substantially increase the cost of monitoring and may not be feasible at all sites. An alternative is to combine existing secondary and primary data in order to develop a sufficiently accurate map of the contaminants. When the choice is to combine data types, a method that accounts for the relationship between primary and secondary data needs to be developed. Hubbard et al.17 developed linear relationships between secondary geophysical data and hydraulic

ARTICLE

well bore measurements and used these relationships in a Bayesian approach to estimate hydraulic conductivity over the entire aquifer. This approach was first validated using synthetic data sets and subsequently at the field scale where it was found that secondary geophysical data “can provide valuable, high-resolution information” that may help to reduce the ambiguity associated with hydrogeological heterogeneity at the fieldscale.17 Abbaspour et al.27 explored the use of combined secondary and primary data to predict contaminant concentrations and contrasted the results to those obtained using cokriging. A relationship was determined via a “transfer function” or a regression equation developed using secondary data present at the same location as primary data (colocated data).27 Secondary data were then transformed into estimated primary data using the “transfer function” and a single data set generated by combining the estimated and measured primary data.27 Each data point in the data set was assumed to have an error associated with it. For measured primary data, this was assumed to be the measurement error for each sample. For estimated primary data, this was assumed to be the sum of the measurement error and the estimation error associated with the transformation of the secondary data into primary data. Abbaspour et al.27 tested this approach at 2 field sites. Results indicated that using a combined data set enabled estimation of contaminant concentrations when insufficient primary data were available and resulted in lower estimation errors compared to cokriging. At most real-world sites, a “true contaminant map” where contaminant concentrations are known at every single location is not available. Practitioners assume a “known contaminant map” based on available primary and secondary data and verify this through constant monitoring at the site. When primary and secondary data are combined to develop the best possible contaminant map, an interesting question is raised while designing the monitoring network: which data types and locations are more valuable in mapping the contaminants? Primary data that are measured directly have lower error values but are also sparse and expensive to collect. Estimated primary data have higher error values as a result of the transformation from secondary data but are abundant and less expensive to sample. This question is addressed for the first time in this paper where we develop an approach to optimizing monitoring network design that considers both primary and secondary data. We then investigate the relative value of different data types and the impact of the transformation method used and associated error on monitoring network design. The objective of the approach is to enable practitioners at real-world sites to develop alternative monitoring network designs using both primary and secondary data at sites where primary data are insufficient for geospatial statistical analyses. Additionally, the most efficient monitoring network design depends on the choice of performance criteria. If the objective is to generate the most accurate map of the contaminants given available data, the best design would require sampling all primary and secondary data, a strategy that is often prohibitively expensive. Sampling a subset of primary and secondary data would reduce the monitoring cost but could result in greater uncertainty in estimating the contaminant map. Uncertainty is modeled by computing the error associated with estimating contaminant concentrations at unsampled locations. Uncertainty can be local, where the error is estimated at a single unsampled location, or global, where the error is computed for all unsampled locations.28 4847

dx.doi.org/10.1021/es1042657 |Environ. Sci. Technol. 2011, 45, 4846–4853

Environmental Science & Technology Local uncertainty is often used when the goal of monitoring network design is to augment the network by installing new monitoring locations. Global uncertainty is used when the goal is to “measure the relative importance of monitoring locations based on their influence when computing the spatial moments (such as mass) for the entire contaminant map”.2 The error is often measured as kriging estimation variances for the contaminantmap. This paper also investigates the trade-offs between the cost of the monitoring network design and global uncertainty in estimating the contaminant map when different data types are used. The approach was tested at a phytoremediation site at Argonne National Laboratory, Illinois (Argonne), previously described by Gopalakrishnan et al.21 A detailed description of the site is given in the Materials and Methods section and in the Supporting Information (SI). Primary data are trichloroethylene (TCE) concentrations in soil samples and secondary data are TCE concentrations in plant branch tissue. Here, we first use two data transformation models with different estimation errors to convert secondary data to primary data. A multiobjective optimization algorithm in combination with kriging is then used to design the monitoring network and quantify the trade-offs between cost and global uncertainty for the contaminant map.

’ MATERIALS AND METHODS The phytoremediation site at Argonne consists of willow trees and poplar trees that are remediating contaminated soil and groundwater, respectively. Volatile organic chemicals, primarily TCE, are the main contaminants found at this site.21,29 Soil consists of sand and silty sand with clay lenses. Groundwater at this site is relatively shallow at approximately 69 m below ground surface.29 Hybrid willows were planted on a 3  3 m grid in the area of soil contamination with roots at approximately 3 m depth. Hybrid poplars were planted on a 4.8  4.8 m grid in the area of groundwater contamination using the TreeWell technology so that the roots were forced to grown in the deeper contaminated groundwater. As described by Gopalakrishnan et al,21 in 2004, branch samples from 120 willows and 126 poplar trees were obtained by collecting a section of a tree branch at the ground surface in the southwest direction.21 Soil samples were collected from 31 locations at depths of 2.53.5 m and groundwater samples were collected from six monitoring wells. Soil samples were compared to branch samples from the nearest willow trees and a similar analysis was carried out for the groundwater samples and poplar trees. Data transformation models were developed to transform tree data into soil data.21 Detailed descriptions of the site, sampling and analysis protocols and model development are described by Gopalakrishnan et al.21 and are available in the SI. In this paper, primary data are TCE concentrations in soil samples and secondary data are TCE concentrations in plant branch tissue collected from willows. Data from the groundwater monitoring system at this site have not been used as only six groundwater monitoring wells are present, which is insufficient to perform a meaningful geospatial statistical analysis. The methodology used here combines (1) data transformation models to convert secondary data to primary data with (2) interpolation using log-normal kriging6 and error estimates from data transformation and (3) search and optimization using the Nondominated Sorted Genetic Algorithm-II (NSGA-II).30 The

ARTICLE

details of each component in this methodology are presented in the following sections of this paper. Data Transformations. Secondary data in this case study (i.e., willow branch TCE concentrations) are transformed into primary data (i.e., soil TCE concentrations) using a linear regression model and an analytical model (for details on model development see Gopalakrishnan et al.21). The models were developed and compared using the 31 soil samples and samples from the nearest trees (colocated samples). The site-specific linear regression model (Model 1) was developed by directly comparing TCE concentrations in the 31 soil samples with TCE concentrations in branch samples from the nearest neighbor willow trees. Model 1 (R2 = 0.89) is given by log½Csoil  ¼ 1:97log½Cbranch   0:17

ð1Þ

where Csoil = TCE concentration in soil (μg/kg) and Cbranch = TCE concentration in willow branch (μg/kg). The analytical model (Model 2) was developed considering uptake and transport processes for contaminants in trees. Contaminants in the subsurface first enter the plants through the roots, are transported in the plant tissue (xylem) and undergo diffusion, transformation and sequestration reactions.1921 Model 2 is given by Cbranch ¼

Csoil exp½  ððRa =ðRb  Ra ÞÞð2Dr π=Q Þ Kd ðnz þ z2 ÞTSCFð1  f ÞKww

ð2Þ

where Kd = equilibrium distribution coefficient between soil and water = foc0.63 Kow foc = mass fraction of organic carbon (g/g) = 1.75% Kow = octanolwater partition coefficient for TCE () = 10 2.33 Ra = radius of the branch (cm) Rb = effective diffusion length of the chemical in the branch (cm) = 0.85Ra Dr = diffusion coefficient (cm2/s) = 1.1  104 Q = flow rate through the tree (L/d) = 15 n = number of branches at the sampling point = 39 z = height of sampling location above the roots (cm) = 275 z2 = distance along the branch where sample is collected (cm) = 10 TSCF = translocation stream concentration factor () = 0.75 f = fraction of contaminant degraded in the root zone or rhizosphere = 0 Kww = equilibrium partition coefficient between the wood and water in the tree = 98.6 L/kg The diffusion coefficient in Model 2 was obtained by fitting the model to data (R2 = 0.85). As shown here, Model 1 has fewer parameters and is easier to use than Model 2. Both models are compared in this study in order to evaluate the effects of the transformation model on designing the optimal monitoring network. We assume that tree and soil data are point data and we do not consider differences resulting from the fact that tree roots will integrate over some volume of surrounding soil. Interpolation Using Log-Normal Kriging. The composite data set comprising measured and estimated primary data is highly skewed and approximately log-normal. Hence, log-normal kriging, where a logarithmic transformation is used to convert the data into an approximately normal distribution, was selected.68,22 While there are theoretical limitations to log-normal kriging, 4848

dx.doi.org/10.1021/es1042657 |Environ. Sci. Technol. 2011, 45, 4846–4853

Environmental Science & Technology

ARTICLE

primarily due to the sensitivity of the back transformation from logarithmic data space, the method yields the best results when data are highly skewed or relatively sparse.31 All operations are performed in the logarithmic data space, including calculating the estimation errors for transformed secondary data, interpolation using kriging, and search and optimization. The software package KT3D from GSLIB is used in the geostatistical analysis.22 The variogram model is selected so as to minimize the cross-validation errors obtained in kriging.6 A spherical model with a nugget of 0, sill of 7, and range of 40 is selected. Errors in the variogram model as a result of using data of different quality have not been considered here. Estimation Error for the Transformed Secondary Data. We follow the method developed by Abbaspour et al.27 to combine primary and secondary data. The estimation error for transformed secondary data is calculated in two steps. First, the residual mean square error for colocated secondary data (tree samples) is calculated using the equation below. p

S2 Z, Yi ¼

ðlog½Csoil, m   log½Csoil, m, e Þ2 ∑ m¼1 p

ð3Þ

where S2Z,Yi = Residual mean square error Csoil, m = Measured TCE concentration in soil at location m (μg/kg) Csoil, m, e = Estimated TCE concentration in soil at location m using the data transformation models described above (μg/kg) p = Number of colocated locations = 31 Then, the estimation errors for secondary data that are not colocated are calculated by 2 3 6 ðlog Cbrch, 0  log Cbrch, avg Þ2 7 1 7 þ 1 þ S2 Z0 ¼ S2 Z, Yi 6 6 7 n p 4 25 ðlog Cbrch, j  log Cbrch, avg Þ



j¼1

ð4Þ where S2 Z0 = Residual mean square error at location 0 Cbrch, 0 = Measured TCE concentration in tree branch at location 0 (μg/kg) n = Total number of secondary data =120 Cbrch, avg = Average TCE concentration for tree branches (μg/kg) = n

∑ ½Cbrch, m  m¼1 n The estimation errors for secondary data transformed using Model 1 ranged from 0.14 to 0.17 and from 0.68 to 0.72 when Model 2 was used. The higher errors for Model 2 compared to Model 1 are likely due to Model 1 being a site-specific model with significantly fewer parameters and a higher correlation with the data compared to Model 2.21 Additionally, some of the parameters used in Model 2 (e.g., TSCF, Kww) were obtained from laboratory-scale experiments and parameter values could be different at the field-scale, which would also contribute to the

higher errors obtained in Model 2. As we are evaluating the impact of data transformation methods in this study, we assume that the measurement error is negligible compared to the estimation error. Multiobjective Optimization. Multiobjective optimization is used to help decision makers identify and explicitly select and balance conflicting objectives for the application being evaluated. In many cases, improving performance in one objective (e.g., cost) reduces performance in another objective (e.g., uncertainty and optimal trade-offs between the objectives of the application need to be identified. These trade-offs are composed of a set of solutions that are better than all the other solutions in at least one objective and are known as nondominated solutions. These solutions are plotted to form a Pareto front for 2 objectives or a Pareto surface for more than 2 objectives.2 In this paper, the NSGA-II developed by Deb et al.30 is used for multiobjective optimization. Genetic algorithms (GAs) are heuristic techniques that search for the optimal solution to a problem based on Darwinian ‘natural selection’. The decision variables defined for the problem are coded as a string of binary digits (01) or real numbers and are linked to form the “chromosome”. We used binary decision variables in this problem, as a location was either sampled (1) or not (0). Each chromosome represents a single trial design. The fitness of each member of a randomly generated initial population of these strings is determined by how well the design satisfies specified objectives and constraints. Once the fitness of the entire population of designs is determined, the GA uses three classic Darwinian operators to evolve the population to the optimal or near-optimal solution—selection, crossover and mutation. The NSGA-II uses a two-step selection processes that combines both binary tournament selection and ranking-based selection.30 Binary tournament selection occurs first. In this procedure randomly selected strings are compared using binary tournament selection and the “fittest” of them are allowed to enter the mating pool. The next operator is crossover. In this process, members of the mating pool are coupled together to mate with a specific probability Pc. Mating involves selecting one or more “crossover” points where the strings exchange bits (genes) with each other. The third operator used is mutation, where randomly selected bits within the new child population are changed with a given probability of mutation Pm. Finally, the NSGA-II uses rankingbased selection to determine which of the parent and child designs will survive. In this selection scheme, the populations of N parent designs and N child solutions in the current generation t are combined to yield a selection pool of 2N individuals, from which the N highest ranked individuals are allowed to pass to generation t þ 1. As shown by Reed and Minsker,2 “this selection method aids the algorithm in efficiently identifying high-order Pareto surfaces because it is implicitly elitist (i.e., the best designs are guaranteed to survive into the next generation)”. Details on designing the NSGA-II for this problem are presented in the SI. Optimization Formulation. The objectives for the case study were determined in consultation with field managers at Argonne. These objectives are: (1) minimizing monitoring cost and (2) minimizing the global uncertainty in estimation of contaminant concentrations. Global uncertainty was selected at this site as the goal is to evaluate the relative redundancy of monitoring locations for the entire contaminant map. The monitoring cost is defined as the cost of sampling and analysis for both soil and tree samples. Cost estimates were obtained from field managers at Argonne. The best estimate of contaminant concentrations is obtained when all soil borings and 4849

dx.doi.org/10.1021/es1042657 |Environ. Sci. Technol. 2011, 45, 4846–4853

Environmental Science & Technology

ARTICLE

Following Reed and Minsker,2 we quantify uncertainty at all unsampled locations using the estimated standard deviation (i.e., the square root of estimation variance) obtained from kriging. However, in this problem, we have an additional source of error as a result of transforming the secondary data into estimated primary data. Therefore, we calculate the global uncertainty for the monitoring network as the sum of the error computed for all unsampled locations in the monitoring network and the estimation error associated with the transformed secondary data sampled in the monitoring design (tree samples). The global uncertainty function is then given by uncertainty ¼

i ¼ unsamp

∑ i¼1

ðUi Þ þ

j ¼ ntreesamp



j¼1

ðEtree, j Þ

ð6Þ

where Ui = square root of estimation variance at the unsampled location i unsamp = total number of unsampled locations in the monitoring network design error for the sampled tree location j = Etree,j = Estimation √ 2 S Zj ntreesamp = number of trees that are sampled in the monitoring network design The decision variables for this case study are as follows: Figure 1. Cost and uncertainty trade-off curves for the composite data set consisting of measured soil samples and tree samples transformed using (a) Linear regression or Model 1 and (b) Analytical model or Model 2.

trees are sampled, which is prohibitively expensive. Sampling a subset of the soil borings and trees will result in greater uncertainty in estimating the contaminant concentrations while simultaneously reducing the monitoring cost. The Pareto front is found by minimizing both uncertainty and cost. The cost function is given by cost ¼

i ¼ nsoil

j ¼ ntrees

∑ ðSsoil þ Asoil Þisoil þ j∑¼ 1 i¼1

samp

xk, i

¼ 0, if the ith soil boring or tree is not sampled ð7Þ where k is the monitoring scheme design and i is the soil boring or tree location The constraints for this case study provide upper bounds on the number of locations sampled, as follows: 0 e nsoil e soilmax 0 e ntrees e treemax

ðStree þ Atree Þjtree ð5Þ

where Ssoil = sampling cost of 1 soil boring = $70 Asoil = analysis cost of 1 soil boring = $110 nsoil = number of soil borings Stree = sampling cost of 1 tree = $4 Atree = analysis cost of 1 tree =$50 isoil = 1 if the soil boring at location i is sampled and 0 if not jtree = 1 if the tree at location j is sampled and 0 if not ntrees = number of trees Global uncertainty is estimated for monitoring networks using only primary data by computing minimum error estimates of contaminant concentrations at unsampled locations using linear combinations of sampled values (kriging) and the associated estimation variance.68 The estimation variance represents the uncertainty of estimates at unsampled locations and has been used to derive confidence intervals centered on the estimated value.2,3235 This method assumes that the variance is only a function of the monitoring network configuration.6 In this paper, we assume that the design when all monitoring locations are sampled provides the baseline level of uncertainty and the global uncertainty for all other designs is computed relative to the baseline.

¼ 1, if the ith soil boring or tree is sampled

ð8Þ

where soilmax is the maximum number of soil boring locations at the site = 31 and treemax is the maximum number of tree locations at the site = 120

’ RESULTS AND DISCUSSION The results presented here illustrate the impact of secondary data on monitoring network design and the trade-offs between cost and global uncertainty for optimal designs developed for the phytoremediation case study. There are 151 decision variables for this particular case and approximately 5  1045 possible sampling designs. The NSGA-II reduces this sampling space to fewer than 100 possible designs that form the Pareto front. The designs on the Pareto front are optimal in that increasing performance in one objective will decrease performance in the second objective (e.g., reducing cost will increase uncertainty). In this section, we present optimal monitoring network designs that include secondary data and discuss the impact of data transformation models, error estimates and the relative importance of different data types when determining the optimal design. Figure 1(a,b) present the trade-off curves between cost and global uncertainty for the monitoring network designs when the linear model (Model 1, Figure 1a) and analytical model (Model 4850

dx.doi.org/10.1021/es1042657 |Environ. Sci. Technol. 2011, 45, 4846–4853

Environmental Science & Technology 2, Figure 1b) are used to transform secondary data (tree samples) to primary data (soil samples). The design utilizing only soil data to map the contamination is shown in both figures as a reference point, even though it was not found to be an optimal design. As seen in Figure 1(a,b), when tree data are included in the monitoring network design, the estimated uncertainty is reduced by 2040% compared to the design with only soil data, for the same level of expenditure. Alternatively, if the same level of uncertainty as the solution with only soil data is desired, the cost could be reduced by 88100% by sampling trees. While the magnitude of the reductions was a function of the data transformation model used, incorporating secondary data in the design significantly improved the performance of the monitoring network for both models. Optimal designs sample between 7 and 151 locations, including trees and soil borings. The lowest cost design in both cases consisted of seven tree samples and no soil samples while the highest cost design contained all 120 trees and 31 soil borings. As seen in Figure 1a, the maximum reductions in uncertainty with minimal increases in cost are achieved for designs sampling between 65 and 80 trees and 0 and 5 soil borings, indicating that the trees are providing far more useful information than the soil borings. In Figure 1b, the designs that achieved the greatest reductions in uncertainty with the lowest increases in cost comprised 5061 tree samples and 1121 soil borings, indicating that with Model 2 the value of the tree data is a bit lower than in Model 1 due to the higher transformation errors, but still greater than the soil borings. In both cases, increasing the number of sampled locations further did not change the uncertainty significantly, while the cost of monitoring increased. Note that the baseline value for the uncertainty when all of the soil and tree locations are included is approximately 4% for Model 1 and 20% for Model 2. The uncertainty estimate is greater than 0 even when all the locations are sampled because of the inclusion of data transformation errors for the tree locations. The values are higher for Model 2 because the transformation error is greater than Model 1. As shown in Figure 1(b), there are significantly fewer nondominated solutions when Model 2 is used compared to Model 1. The gaps present in Figure 1(b) are not due to a shortcoming of the NSGA-II, but are the result of including data transformation errors when estimating global uncertainty. Including data transformation errors implicitly adds another objective to the problem, since there is a conflict between minimizing the errors from estimating the concentrations at unsampled locations and minimizing the errors from transforming the secondary data. The error associated with estimating the concentrations at unsampled locations is minimized by including more samples, most of which are tree samples. This, in turn, increases the data transformation error. Thus, including additional tree samples does not necessarily minimize the global uncertainty. For example, consider the gap in Figure 1(b) between the design with 61 tree samples and 31 soil samples and the design with 120 tree samples and 31 soil samples. In this case, when the tree samples exceed 61, the reduction in errors at the unsampled locations is offset by the increase in data transformation error. This results in designs that cost between $9,000 and $12,000 but are dominated by designs with fewer samples and lower costs that have the same level of uncertainty. These gaps are less visible in Figure 1(a) as the data transformation error for Model 1 is approximately 5 times smaller than Model 2. Thus, for most designs shown in Figure 1(a), including additional tree samples

ARTICLE

results in a small increase in data transformation error while achieving a significant reduction in the estimation errors at the unsampled locations. The magnitude of the data transformation error also influences the relative importance of primary and secondary data in the optimal monitoring design. As expected, more locations are used to decrease the uncertainty in Figure 1(a,b), regardless of the data transformation model. However, when Model 1 is used, only tree locations are added to the design until an uncertainty level of 28% is reached for the design with 65 trees. At that point, further reductions in uncertainty are only possible by including soil borings in the design. On the other hand, when Model 2 is used, soil borings are necessary in the design in order to achieve an uncertainty level of 40% and lower. Incorporating additional trees in the design resulted in a marginal decrease in uncertainty until an uncertainty level of 30% is reached for the design with 61 trees and 21 soil borings. Subsequently, uncertainty can only be decreased further by including all of the soil borings or all of the tree locations. These results indicate that tree locations are more valuable in developing optimal monitoring designs for Model 1 as opposed to Model 2. Thus, smaller data transformation errors favor secondary data in the monitoring network design compared to primary data. The solutions highlighted in red in the trade-off curves of Figure 1 sample between 34 and 82 locations and are visualized in SI Figures S2S4 to investigate how the choice of the data transformation model and trade-offs in the objectives affect the sampling geometrics. The colors shown in SI Figures S2S4 represent the range of contaminant concentrations, with the dark brown representing the areas with the highest TCE concentrations. Edge effects seen in the contaminant maps are a result of the use of interpolation techniques such as kriging and have been ignored when evaluating the monitoring designs as they are outside of the baseline monitoring network of 151 locations. Designs that use Model 1 are presented in SI Figure S2(a,b) and those that use Model 2 are presented in Figure S3(a-b). A single design is selected from each trade-off front with approximately the same number of locations sampled (71 and 72) and illustrated in SI Figure S4(a, b) to show the impacts of the transformation models. In SI Figure S2(a), the design consists of 34 tree locations and in SI Figure S2(b), 65 trees are sampled. In both cases, the locations sampled are along the boundaries and leading edges of the contamination, which are most important for reducing uncertainty. The uncertainty is further reduced in SI Figure S2(b) compared to SI Figure S2(a) by sampling areas of high concentration and allowing better delineation of the area of contamination. Similar results are obtained in SI Figure S3(a,b) when Model 2 is used instead of Model 1 and the designs comprise 38 and 82 locations, respectively. In this case, sampled locations are also found along boundaries and leading edges of the contaminant map. This suggests that the spatial configuration of the monitoring network design is similar with regard to the areas sampled, irrespective of the data transformation model used. This result has implications for sites where accurate, sitespecific models with low transformation errors are not always available to convert secondary data to primary data. Phytoremediation systems are a classic example of such situations. At most phytoremediation sites, relatively few soil borings and groundwater monitoring wells are colocated with the trees. Thus, developing a site-specific linear relationship between primary and secondary data with a low transformation error is not always 4851

dx.doi.org/10.1021/es1042657 |Environ. Sci. Technol. 2011, 45, 4846–4853

Environmental Science & Technology possible. However, the analytical model (Model 2) presented here can be used to transform tree data to soil or groundwater data, albeit with a larger transformation error. Our results suggest that even if the only data transformation model available is Model 2, an effective monitoring design can be implemented by using the secondary tree data. The two designs illustrated in SI Figure S4(a,b) are selected from the trade-off curves of Figure 1 such that the total number of locations sampled is almost the same, but the ratio of tree and soil samples are different. In SI Figure S4(a), the design is obtained using Model 1 and consists of 66 trees and five soil samples and in SI Figure S4(b), the design comprises 54 trees and 18 soil samples using Model 2. The spatial distribution of the network is again similar for both designs, with locations sampled along contaminant boundaries and leading edges. However, the design in SI Figure S4(a) has a smaller uncertainty estimate compared to the design in SI Figure S4(b), primarily because of the lower value of the data transformation error for Model 1. The effect of the data transformation error is observed in the choice of soil or tree samples when trees are located next to soil borings. In general, the network shown in SI Figure S4(b) prefers to sample the soil boring as opposed to SI Figure S4(a) where the tree location is preferred. This result suggests that when the data transformation error is significant, primary data are preferred over secondary data when they are colocated. In areas where primary data are unavailable, the choice of the data transformation model did not change the secondary data that are sampled in the network, but did impact the value of the uncertainty estimate. Abbaspour et al.27 proposed the use of a combined data set at sites where primary data are insufficient for cokriging. Our results indicate that in such a situation, a monitoring network can be designed using both primary and secondary data to better map the contamination. However, this approach has been tested at present for cases where there is a strong correlation between primary and secondary data. The impacts on monitoring network design when there is a weak correlation would need to be further investigated. A second drawback of this method is that errors in variogram estimation from using data of different quality are not considered. The impact on monitoring network design from including these errors is of interest and could be evaluated in future research by incorporating data transformation error as measurement error in variogram estimates. Additionally, the possible difference in support structure between primary and secondary data as illustrated here for soil and tree samples have not been considered here. Trees have roots that explore the area around them and hence are different from soil samples that are closer to point samples. The impact on geostatistical techniques and the choice of optimal monitoring network designs because of this difference would need to be evaluated in the future. Further, our results have focused on the case where the objective is global uncertainty. At many sites, the objective may be local uncertainty and the impact of different objectives on monitoring network design using primary and secondary data would need to be investigated. Also, the monitoring network in this paper has been optimized for data collected in a single time period. At sites where contaminant transport is a concern and multiyear monitoring is necessary, periodic evaluation of the optimal monitoring network design may be necessary. Additionally, tree branch samples were available at ground level at this site, which minimized contaminant losses due to diffusion, but this may not always be the case at sites with mature trees. At such

ARTICLE

sites, further research should be undertaken to identify whether tree core samples may be necessary as secondary data sources. In summary, our results provide clear evidence that including secondary data in monitoring network design for real-world sites has the potential to significantly reduce monitoring costs by incorporating secondary data in designing monitoring networks. Optimal solutions found at this site indicate that increasing secondary data sampling results in a significant decrease in global uncertainty with a minimal increase in cost. The technique used in this study to combine secondary and primary data into a single data set through data transformation models also shows promise for sites with sparse primary data in that information from both data sources can be utilized with minimal effort. The choice of the data transformation model has an impact on the optimal designs and also affected uncertainty estimations. We found that the relative importance of different data types (i.e., primary versus secondary data) depends on the objectives chosen for the site as well as the data transformation model. While these results show promise in developing more effective monitoring designs at real-world sites, methods to incorporate the data transformation errors from variogram estimates and differences in support structures between data types, and hence the impacts on monitoring network design, require further investigation. Further validation of this method using groundwater data would also be valuable for practitioners.

’ ASSOCIATED CONTENT

bS

Supporting Information. Additional information including four figures. This material is available free of charge via the Internet at http://pubs.acs.org.

’ AUTHOR INFORMATION Corresponding Author

*Phone: 630-252-7051; fax: 630-252-9281; e-mail: ggopalakrishnan@ anl.gov. Notes †

Now at Energy Systems Division, Argonne National Laboratory

’ ACKNOWLEDGMENT We thank Dr. M. Cristina Negri at Argonne National Laboratory for her assistance in obtaining the cost and contaminant data at the field site and in providing the field manager’s perspective in setting the objectives at the site. Funding from Argonne EQO for site monitoring and from the U.S. Department of Energy (Grant No. DE-FG07-02ER 635302) is gratefully acknowledged. ’ REFERENCES (1) Loaiciga, H; Charbeneau, R. J.; Everett, L. G.; Fogg, G. E.; Hobbs, B. F.; Rouhani, S. Review of ground-water quality monitoring network design. J. Hydraul. Eng. 1992, 118 (1), 11–37. (2) Reed, P.; Minsker, B. Striking the balance: Long-term groundwater monitoring design for multiple conflicting objectives. J. Water Res. Plann. Manage. 2004, 150, 141–148. (3) Reed, P.; Minkser, B.; Goldberg, D. E. A multiobjective approach to cost effective long-term groundwater monitoring using an elitist nondominated sorted genetic algorithm with historical data. J. Hydroinformatics 2001, 3, 71–90. (4) Zhou, Y. Sampling frequency for monitoring the actual state of groundwater systems. J. Hydrol. 1996, 180, 301–31. 4852

dx.doi.org/10.1021/es1042657 |Environ. Sci. Technol. 2011, 45, 4846–4853

Environmental Science & Technology (5) Everett, L. G. Groundwater Monitoring; General Electric Technology Marketing Operations. Schenectady, NY, 1980. (6) Goovaerts, P. Geostatistics for Natural Resource Evaluation; Oxford University Press: New York, 1997. (7) Kitanidis, P. K. Introduction to Geostatistics with Applications in Hydrogeology; Cambridge University Press, New York, NY; 1997. (8) Chiles, J. P.; Delfiner; P. Geostatistics: Modeling Spatial Uncertainty; Wiley Series in Probability and Statistics, John Wiley & Sons, Inc.: New York, NY; 1999. (9) ASCE Task Committee on Long-term Groundwater Monitoring Design. Long-Term Groundwater Monitoring: The State of the Art: American Society of Civil Engineers, Reston, VA; 2003. (10) ASCE Task Committee on Geostatistical Techniques. Review of geostatistics in geohydrology. I: Basic concepts. J. Hydraul. Eng. 1990, 116(5), 612632. (11) Johnson, V. M.; Tuckfield, R. C.; Ridley, M. N.; Anderson, R. A. Reducing the sampling frequency of groundwater monitoring wells. Environ. Sci. Technol. 1996, 30, 355–358. (12) Aziz, J. J.; Newell, C. J.; Rifai, H. S.; Ling, M.; Gonzales, J. R. Monitoring and Remediation Optimization System (MAROS): Software User’s Guide, Version 1; United States Air Force Center for Environmental Excellence, Brooks AFB: San Antonio, TX; 2000. (13) Cameron, K.; Hunter, P. Optimization of LTM networks: Statistical approaches to spatial and temporal redundancy. In Proceedings of the American Institute of Chemical Engineers, 2000 Spring National Meeting, Remedial Process Optimization Topical Conference, Atlanta, 2000 (14) Reed, P.; Minsker, B.; Valocchi, A. J. Cost effective long-term groundwater monitoring design using a genetic algorithm and global mass interpolation. Water Resour. Res. 2000, 36, 3731–3741. (15) Isaaks, E. H.; Srivastava, R. M. An Introduction to Applied Geostatistics; Oxford University Press: New York; 1989. (16) Passarella, G.; Vurro, M.; D’Agostino, V.; Barcelona, M. J. Cokriging optimization of monitoring network configuration based on fuzzy and non-fuzzy variogram evaluation. Environ. Monit. Assess. 2003, 82 (1), 1–21. (17) Hubbard, S.; Chen, J.; Peterson, J.; Majer, E.; Williams, K.; Swift, D.; Bailouz, B.; Ruben, Y. Hydrogeological characterization of the South Oyster Bacterial Transport Site using geophysical data. Water Resour. Res. 2001, 37 (10), 2431–2456. (18) Istok, D. J.; Smyth, J. D.; Flint, A. L. Multivariate geostatistics analysis of groundwater contamination: A case history. Ground Water 1993, 31 (1), 63–74. (19) Vroblesky, D. T.; Neitch, C. T.; Morris, J. T. Chlorinated ethenes from groundwater in tree trunks. Environ. Sci. Technol. 1999, 33, 510–515. (20) Struckhoff, G. C.; Burken, J. G.; Schumacher, J. G. Vapor-phase exchange of perchloroethene between soil and plants. Environ. Sci. Technol. 2005, 39, 1563–1568. (21) Gopalakrishnan, G.; Negri, M. C.; Minsker, B. S.; Werth, C. J. Monitoring subsurface contamination using tree branches. Ground Water Monit. Rem. J. 2007, 27 (1), 65–74. (22) Deutsch, C. V.; Journel, A. G. GSLIB: Geostatistical Software Library and User’S Guide; Oxford University Press: New York, 1992. (23) Ahmadi, S. H.; Sedghamiz, A. Application and evaluation of kriging and cokriging methods on groundwater depth mapping. Environ. Monit. Assess. 2008, 138, 357–368. (24) Journel, A. G.; Juijbregts, C. J. Mining Geostatistics; Academic Press: New York; 1978. (25) Benjemaa, F.; Marino, M. A.; Loaiciga, H. A. Multivariate geostatistical design of groundwater monitoring networks. J. Water Res. Plann. Manage. 1994, 120 (4), 505–522. (26) Pan, G.; Moss, K.; Heiner, T.; Carr, J. R. A fortran program for a three dimensional cokriging with case demonstration. Comp. Geosci. 1992, 18 (5), 557–578. (27) Abbaspour, K. C.; Schulin, R.; Van Genuchten, M. T.; Schlappi, E. An alternative to co-kriging for situations with small sample sizes. Math. Geol. 1998, 30 (3), 259–274. (28) Goovaerts, P. Geostatstical modeling of uncertainty in soil science. Geoderma 2001, 103, 3–26.

ARTICLE

(29) Quinn, J. J.; Negri, M. C.; Hinchman, R. R.; Moos, L. M.; Wozniak, J. B.; Gatliff, E. G. Predicting the effect of deep-rooted hybrid poplars on the groundwater flow system at a large-scale phytoremediation site. Int. J. Phytorem. 2001, 3, 41–60. (30) Deb, K. Multi-Objective Optimization using Evolutionary Algorithms; John Wiley & Sons LTD: New York, NY, 2001. (31) Saito, H.; Goovaerts, P. Geostatistical Interpolation of positively skewed and censored data in a dioxin-contaminated site. Environ. Sci. Technol. 2000, 34, 4228–4235. (32) Christakos, G.; Olea, R. A. A multiple-objective optimal exploration strategy. Math. Comput. Model. 1988, 11, 413–418. (33) Olea, R. A. Sampling design optimization for spatial functions. Math. Geol. 1984, 16 (4), 365–391. (34) Yfantis, E. A.; Flatman, G. T.; Behar, J. V. 1987 Efficiency of kriging estimation for square, triangular, and hexagonal grids. Math. Geol. 1987, 19(3), 183205. (35) Minsker, B. Genetic algorithms. In Hydroinformatics: Data Integrative Approaches in Computation, Analysis, and Modeling; P. Kumar, Ed,; CRC Press: Boca Raton, FL, 2005; ISBN 0849328942.

4853

dx.doi.org/10.1021/es1042657 |Environ. Sci. Technol. 2011, 45, 4846–4853