Article pubs.acs.org/est
Application of Stochastic Models in Identification and Apportionment of Heavy Metal Pollution Sources in the Surface Soils of a Large-Scale Region Yuanan Hu and Hefa Cheng* State Key Laboratory of Organic Geochemistry, Guangzhou Institute of Geochemistry, Chinese Academy of Sciences, Guangzhou 510640, China S Supporting Information *
ABSTRACT: As heavy metals occur naturally in soils at measurable concentrations and their natural background contents have significant spatial variations, identification and apportionment of heavy metal pollution sources across largescale regions is a challenging task. Stochastic models, including the recently developed conditional inference tree (CIT) and the finite mixture distribution model (FMDM), were applied to identify the sources of heavy metals found in the surface soils of the Pearl River Delta, China, and to apportion the contributions from natural background and human activities. Regression trees were successfully developed for the concentrations of Cd, Cu, Zn, Pb, Cr, Ni, As, and Hg in 227 soil samples from a region of over 7.2 × 104 km2 based on seven specific predictors relevant to the source and behavior of heavy metals: land use, soil type, soil organic carbon content, population density, gross domestic product per capita, and the lengths and classes of the roads surrounding the sampling sites. The CIT and FMDM results consistently indicate that Cd, Zn, Cu, Pb, and Cr in the surface soils of the PRD were contributed largely by anthropogenic sources, whereas As, Ni, and Hg in the surface soils mostly originated from the soil parent materials.
■
INTRODUCTION Heavy metals and their compounds are naturally ubiquitous throughout the environment. Heavy metals in soils come in the first place from weathering of the parent materials but can also result from accumulation of the metals released from industrial and agricultural activities. According to Alloway,1 the major anthropogenic inputs of heavy metals to soils and the environment are metalliferous mining and smelting, agricultural and horticultural materials, sewage sludges, fossil fuel combustion, metallurgical industries, electronics, chemical and other manufacturing industries, waste disposal, sports shooting and fishing, and warfare and military training. Globally, mine tailings, smelter emissions, waste incineration, and atmospheric deposition are the most important sources of heavy metal pollution, whereas vehicular emissions are widely accepted as the main sources of heavy metals in urban topsoils and dust.2 In China, the major sources of heavy metal pollution are emissions from industrial operations, whereas wastewater irrigation and land fertilization with sludges contribute most to the heavy metal pollution of agricultural soils.3 China has undergone rapid transformation from a traditionally agricultural-based economy to a strongly manufacturingbased economy since the economic reform that began in the late 1970s.4 The establishment of industrial operations and fast urban expansion have drastically increased industrial and © 2013 American Chemical Society
municipal wastewater discharges and other pollutant emissions nationwide.5 The lack of pollution controls and poor enforcement of environmental laws and regulations further exacerbated the widespread environmental pollution.3,6,7 The Pearl River Delta (PRD) located in Southern China’s Guangdong Province, which used to be cultivated intensively, has been transformed to a major manufacturing hub and one of the most densely urbanized regions in the world over the past three decades. Studies have found significant enrichment of trace metals in the recently deposited sediments8−10 and in the aquatic organisms of the Pearl River Estuary.11 Similar to sediment, soil is also a major sink of heavy metals released from anthropogenic sources,12 thus it is important to understand the status of heavy metal pollution in the surface soils of the PRD. One of the key prerequisites in developing policies on pollution control, soil remediation, and environmental management is an accurate assessment of the contributions and impacts of the anthropogenic sources.13,14 However, the lack of large-scale and long-term monitoring data often makes the apportionment of contributions from different sources a Received: Revised: Accepted: Published: 3752
October 22, 2012 March 9, 2013 March 15, 2013 March 15, 2013 dx.doi.org/10.1021/es304310k | Environ. Sci. Technol. 2013, 47, 3752−3760
Environmental Science & Technology
Article
Data Collection and Preparation. Four groups of predictors were used for evaluating the sources of heavy metals and their contributions, including: (1) land use, which is an important marker of human activities; (2) soil properties, including TOC and soil type (which refers to the different sizes of mineral particles in the soil and is related to the parent material); (3) local socioeconomic conditions that influence the bulk emissions of various pollutants from industrial and vehicular sources, including population density and gross domestic product (GDP) per capita; (4) road conditions, which is represented by the lengths and classes of the roads surrounding the sampling sites. Three of these seven predictors were categorical variables. The land use type was classified into six categories, that is, agricultural land, industrial area, waste disposal/treatment site, urban area, forest land, and drinking water source protection area, based on the dominant land use pattern. The soil types were identified as ferric acrisols (ACf), haplic acrisols (ACh), humic acrisols (ACu), cumulic anthrosols (ATc), calcaric cambisols (CMc), eutric fluvisols (FLe), and haplic solonchaks (SCh) by matching the sampling site locations with the Harmonized World Soil Database.34 The roads were classified as: primary or secondary roads and highways; tracks, trail or footpaths; and connectors. A buffer zone of 5 km radius was generated for each sampling site, and the total length of the roads within the zone and their main classes were identified based on the region’s roadmap using ArcGIS. The population density and the GDP per capita were obtained from the relevant statistical yearbook and census data.35 Modeling Methodology. The relationship between the predictors and the heavy metal contents in the soil samples was described by the CIT, which embeds the recursive binary partitioning into a well-defined framework of conditional inference procedures.28 The conditional distribution of the response variable Y is a function of the m-dimensional covariate X = (X1, X2, ..., Xm):
challenging task. To this end, stochastic models can serve as important tools for source identification and apportionment of pollution sources. Finite mixture distribution model (FMDM), which can help derive the properties of the subpopulations in a pooled population,15 has been used to establish the criteria of heavy metal concentrations and to differentiate the effects of natural background and human activities.14,16 Interpretation of complex spatial patterns of ecological data and remote sensing data by decision tree methods has gained significant attention over the past decade.17,18 Nonetheless, there have only been few studies using such methods in assessing soil pollution by heavy metals and persistent organic pollutants.19−21 Compared to other statistical tools such as linear regression and discriminant analysis, decision tree offers several major advantages, including the flexibility with handling of different types of response variables, the capability of identifying complicated relationship without strong model assumptions, the robustness with respect to outliers, the ability to deal with missing values, and easy result interpretation.17,22 Classic decision trees such as the classification and regression tree (CART)23 and “C5”24 are constructed by selecting the best split measured by the Gini index or information gain through exhaustive search over all possible variables to split and all possible places for a split. These algorithms suffer from two major limitations: overfitting,25 and biased variable selection, that is, the preference for variables with more categories and the continuous variables.26,27 Although pruning based on crossvalidation can be applied to remove the overfitting to the training data, the biased variable selection significantly complicates the interpretation of the decision trees.28,29 In the present study, a new tree method developed recently, the conditional inference tree (CIT), was used to identify the important factors responsible for the distribution of heavy metals (Cr, Ni, Cu, Zn, As, Cd, Hg, and Pb) in the surface soils of the PRD. CIT can handle different types and scales of the predictors while overcoming the problems of overfitting and biased variable selection.28 Random forest, which is an efficient algorithm for both classification and regression trees based on bootstrap aggregation,30 was also implemented to evaluate the importance of the predictors. Random forest consisting of CITs based on subsampling without replacement was used in this study to guarantee unbiased and reliable variable selection.31 A conditional permutation scheme was also adopted in calculating the variable importance scores to reduce the degree of the preference of correlated predictor variables.32 In the cases that could not be well explained by the CIT, FMDM analysis was implemented as a supplement tool to help distinguish the contributions of natural background and anthropogenic sources.
D(Y |X ) = D(Y |X1 , X 2 , ..., X m) = D(Y |f (X1 , X 2 , ..., X m)) (1)
where f (X1, X2, ..., Xm) is a function of the covariate. The recursive binary partitioning algorithm for a given population ϕn with n samples (ϕn = {(Yi, X1i, ..., Xmi); i = 1, 2, ..., n}) was constructed using the corresponding case weight vector (W = (w1, w2, ..., wn)). Each node of the tree was represented by a case weight vector W, which had nonzero integers for the corresponding observations that are elements of the node and had zeros for those not. The variable selection and splitting procedures were completed separately to minimize the bias. The general partitioning procedure is summarized in the Supporting Information. To assess the importance of the available predictors for estimating the heavy metal concentrations in the surface soils, their variable importance scores were calculated by means of random conditional inference forests using the heavy metal concentrations as the response variables. A random forest is a classifier consisting of a collection of binary classification or regression trees constructed by generating bootstrap samples from the original training set and fitting a tree to the generated samples using a variable randomly selected at each node (Supporting Information). In the random forest framework, the variable importance was quantified by measuring the difference in the prediction accuracy before and after permuting Xj averaged over all the trees,30 which has the form of:
■
MATERIALS AND METHODS Sample Collection and Chemical Analyses. The study region (Figure S1 of the Supporting Information) is located between the latitudes of 21°40′0″ N and 24°0′0″ N and between the longitudes of 112°0′0″ E and 115°30′0″ E in the PRD covering more than 7.2 × 104 km2 of land surface. Samples of surface soils (0−10 cm depth) were collected from 227 sites. The contents of heavy metals (Cr, Ni, Cu, Zn, As, Cd, Hg, and Pb) in the soils were determined after microwaveassisted digestion following the procedures of a previous study.33 Their total organic carbon (TOC) contents were measured with a Vario EL Cube elemental analyzer (Elementar, Germany). 3753
dx.doi.org/10.1021/es304310k | Environ. Sci. Technol. 2013, 47, 3752−3760
Environmental Science & Technology
Article
Table 1. Basic Statistics of the Contents of Heavy Metals Measured in the Surface Soils of the PRD (n = 227) and Their Classifications no. of samples in each class of soil qualitya metal
mean (mg/kg)
std dev (mg/kg)
min (mg/kg)
median (mg/kg)
max. (mg/kg)
I
II
III
exceeding III
Cr Ni Cu Zn As Cd Hg Pb
67.2 26.0 49.9 115 19.4 0.22 0.07 51.4
79.4 62.1 280 205 36.0 0.29 0.10 48.1
2.91 N/D 1.69 9.79 1.82 0.02 N/D 4.95
55.4 16.1 23.4 72.2 11.2 0.14 0.04 42.4
1007 849 4215 2515 294 3.06 0.96 464
186 201 163 155 148 152 207 92
31 7 29 52 51 27 15 133
6 17 33 16 16 42 5 2
4 2 2 4 12 6 0 0
a According to the soil quality standards in China (GB 15618−1995), class I is defined as unpolluted, classes II and III are slightly and moderately polluted, respectively whereas exceeding the threshold of class III indicates heavy pollution; N/D − not detected.
1 Ntree
VI(Xj) =
where μi and σi are the mean and standard deviation of the Pi(y|θi), respectively. The parameters αi, μi, and σi for each individual component were estimated by the method of maximum likelihood using the Newton-type and the expectation maximization algorithms. The single, double, and triple log−normal distribution models were examined for all data sets, and the double log−normal distribution model was found to work the best. That is, the heavy metal concentrations in the surface soils could be adequately described by two-mixture distributions, and the respective PPDF could be treated as a mixture of observations of background areas and polluted areas. FMDM analysis of the soil heavy metal concentration data was also performed in the R software environment.
Ntree
∑ [Et (Xj) − Et ] t=1
(2)
where VI(Xj) is the variable importance for Xj, Ntree is the number of trees, Et is the prediction error of a single tree t on the associated out-of-bag samples, and E t (X j ) is the corresponding prediction error with the variable Xj randomly permuted. The errors were calculated as the mean square error for regression and misclassification rate for the classification models. A combination of univariate analysis, Mahalanobis distance method, and the pruning algorithm were used to detect and remove high-leverage data points to construct robust decision trees.36−38 The CITs and random forests were implemented in the R software environment. Because the CIT algorithm does not have the problem of overfitting and has the prediction accuracy equivalent to those of optimally pruned trees obtained with other decision tree techniques,28 the minimum node size requirement was set at 1% of sample population when implementing the CITs. The results showed that over 3% of the observations were sent into each of the two daughter nodes at each split in all trees. FMDM analysis was used to identify the population probability density functions (PPDFs) of the background areas and polluted areas for Cr, Ni, As, and Hg in the surface soils. The observed heavy metal concentrations in the selected samples were treated as a finite mixture of multiple unimodal distributions. The general form of the mixture distribution model is:
■
RESULTS AND DISCUSSION 1. Descriptive Statistics of Heavy Metal Pollution. Table 1 shows the basic descriptive statistics for heavy metal contents in the surface soils of the PRD and summarizes the number of samples in different classes of soil quality according to the Environmental Quality Standard for Soils in China (GB 15618−1995). Cd, Cu, As, Zn, and Ni were the major heavy metals of concerns found in the surface soils. The concentrations of Cd, Cu, As, Zn, and Ni exceeded the class II standards in 48 samples (21.1%) of Cd, 35 samples (15.4%) of Cu, 28 samples (12.3%) of As, 20 samples (8.8%) of Zn, and 19 samples (8.4%) of Ni. Only 10 samples had Cr concentrations and 5 samples had Hg concentrations exceeding the class II standards. Meanwhile, Pb pollution appeared to be the most widespread with 113 samples (58.6%) having Pb concentrations above the class II standard, even though only 2 samples showed Pb concentrations higher than the class III standard. This was probably due to the fact that atmospheric Pb emitted by vehicles and industrial sources is predominantly present in the submicrometer aerosols that can be transported over long distances.39 Overall, the elevated levels of heavy metals in the surface soils indicate that heavy metal pollution is a significant environmental concern in the PRD. In general, the contents of heavy metals in a soil are determined by the inputs from several sources, including parent material, atmospheric deposition, agrochemicals, organic wastes, and other inorganic pollutants, along with the losses in metals removed by crop materials, and the leaching and volatilization processes.1 Because heavy metals occur naturally in soils at measurable concentrations, it is not possible to assess the contributions from anthropogenic sources based on concentration measurements alone. Spatial distribution maps (Figure S2 of the Supporting Information) could reveal the
k
P(y|θ ) =
∑ αiPi(y|θi) i=1
(3)
where P(y|θ) is the probability density function of the random variable y, k is the number of components, θi is the set of parameters defining the Pi(y|θi) of the ith component, and αi is the corresponding mixing weight and satisfies: k
∑ αi = 1
(0 ≤ αi ≤ 1) (4)
i=1
P(y|θ) could denote any mixture distribution comprised of natural and anthropogenic contributions, and its components were identified by fitting the log−normal model to the heavy metal concentration data: Pi(y|θi) =
2 2 1 e−(ln y − μi ) /2σi , y > 0 2π yσi
(5) 3754
dx.doi.org/10.1021/es304310k | Environ. Sci. Technol. 2013, 47, 3752−3760
Environmental Science & Technology
Article
Figure 1. Regression trees for (a) Cd, (b) Zn, and (c) Cu. In the CITs, n is the number of sampling sites, and the metal concentrations are shown in the unit of mg/kg.
possible sources for some heavy metals based on the apparent overlaps of hot spots of metal pollution with industrial and urban centers (Supporting Information). Nonetheless, such an approach was qualitative at the best, whereas stochastic models could quantitatively evaluate the contributions to heavy metal pollution from different sources. As a result, CIT and FMDM analyses were performed to identify the sources of heavy metals found in the surface soils and to apportion the contributions from natural background and human activities. 2. Heavy Metals Showing Strong Human Impact: Cd, Zn, Cu, Pb, and Cr. Parts a−c of Figure 1 show the details of CITs for Cd, Zn, and Cu. For these metals, the most important splitting factor of the root nodes was the land use type indicating strong contributions from anthropogenic sources. The right branches with lower metal levels contained the soil samples collected from the areas affected weakly by human activities (i.e., drinking water source protection areas and forest lands) that could represent the distributions of these metals in the natural background. As expected, the surface soils from the areas affected more strongly by human activities (the left branches), including industrial areas, urban areas, waste disposal/treatment sites, and agricultural lands, had significantly more Cd, Zn, and Cu compared to the background areas. The CIT for Cd consisted of 5 nodes including 3 terminal nodes (part a of Figure 1). The mean concentration of Cd in the left branch (0.27 mg/kg) was more than two times higher than that in the right one (0.11 mg/kg), which suggests that
human activities in the PRD had resulted in significant soil pollution by Cd. The next splitting factor of the left branch was the type of soil. The node containing the calcaric cambisols and eutric fluvisols had a mean concentration of Cd (0.53 mg/kg) that was more than twice that (0.25 mg/kg) of the node composed of the ferric acrisols, haplic acrisols, cumulic anthrosols, and haplic solonchaks. Soil parent material determines the contents of metals initially present in the soil even though they may be leached out during the weathering process. As a result, the type of soil can make a significant difference in terms of the contents for some heavy metals. The regression tree of Zn (part b of Figure 1) shows similar patterns to that of Cd. The soils from the drinking water source protection areas and forest lands (the right branch) showed a relatively low mean Zn concentration (55.4 mg/kg). The first splitting factor of the left branch was the soil type. The mean Zn concentration (249.4 mg/kg) in the terminal node of the calcaric cambisols and eutric fluvisols was approximately two times higher than those in the other types of soils. The last two terminal nodes were separated by the soil TOC. Higher concentrations (mean, 145.6 mg/kg) belonged to the ferric acrisols, haplic acrisols, cumulic anthrosols, and haplic solonchaks with TOC contents above 2.0%, whereas those with lower TOC contents had a mean Zn concentration of 98.0 mg/ kg. Soil organic matter is consisted of dead plant and animal tissues in various stages of decomposition, and its content depends on climatic conditions, vegetative cover, topographical 3755
dx.doi.org/10.1021/es304310k | Environ. Sci. Technol. 2013, 47, 3752−3760
Environmental Science & Technology
Article
Figure 2. Contributions from natural background and anthropogenic inputs to soil Pb and Cr in the PRD: (a) Regression tree for Pb, (b) regression tree for Cr, (c) FMDM model fit for Cr contents in the non-acrisols, and (d) FMDM model fit for Cr contents in the acrisols. In the CITs, n is the number of sampling sites, and the metal concentrations are shown in the unit of mg/kg.
strates that the anthropogenic inputs were the major factor controlling the distribution of Cu in the surface soils of the PRD. The regression tree of Pb with 4 terminal nodes (part a of Figure 2) resulted from the statistical analysis of 227 soil samples. The main decision factor was the land use type, which separated the urban soils and waste disposal/treatment site soils with relatively higher Pb concentrations from the others. Pb is the most widely scattered toxic metal on earth solely due to anthropogenic activities, and a wide range of anthropogenic sources, mainly mining, smelting, industrial uses, waste incineration, coal burning, and leaded gasoline are responsible for lead pollution in the environment.39 In the PRD, vehicular emissions are one of the major sources of lead pollution, which explains why heavier Pb pollution occurred in soils from the urban areas besides waste disposal/treatment sites. In the left branch, the next decision factor was the soil TOC. The soils from the urban areas and waste disposal/treatment sites with TOC greater than 3.94% had a mean Pb concentration (206.8 mg/kg) more than three times higher than that (61.2 mg/kg) in the soils with lower TOC contents. Unlike agricultural soils, urban soils and soils from waste disposal/treatment sites typically have low organic matter contents. The occurrence of high TOC contents (>3.94%) probably resulted from past contamination by sludges and wastewaters, which also introduced Pb to the soils. In the right branch, GDP per capita was the next decision factor. Soils with lower Pb concentrations (mean, 38.3 mg/kg) were found in node 6 characterized by lower GDP per capita (≤117 739 Yuan). Greater emissions from industrial sources and vehicles are expected in the areas with more developed local economy (i.e., higher GDP per capita). The wide occurrence of Pb at relatively low levels in the surface soils is consistent with Pb-containing aerosols being a dominant source of Pb pollution in the PRD.
position, and soil texture, with productive agricultural soils typically containing 3−6% of organic matter. Soils with higher TOC contents were probably influenced more by human activities, and were more efficient at retaining heavy metals,40 resulting in greater accumulation of them. With 11 nodes including 6 terminal nodes, the CIT of Cu (part c of Figure 1) is more complicated than those of Cd and Zn and bears more pronounced effect of human activities. The right branch of the regression tree was split into two terminal nodes representing the soils from the drinking water source protection areas and forest lands with mean Cu concentrations of 10.6 and 20.2 mg/kg, respectively. The left branch consisted of the soils from the agricultural lands, industrial areas, urban areas, and waste disposal/treatment sites, which had higher Cu concentrations than the right one. The soils in the left branch were split first according to the TOC content. Samples containing more than 4.24% of organic matter were separated to terminal node 8, and they had the highest mean Cu concentration (71.2 mg/kg) among all of the nodes. The next splitting factor was the GDP per capita. Higher GDP per capita corresponded to more developed local economy and consequently higher emissions of pollutants (including heavy metals) in general. That is why terminal node 7, which consisted of the soils from the areas with GDP per capita greater than 87 040 Yuan, showed a relatively high mean Cu concentration (44.7 mg/kg). The population density as the last splitting variable separated the soils in the areas with population densities greater than 1163 per km2, which had a mean Cu concentration (43.7 mg/kg) two times higher than that (21.8 mg/kg) of the soils from less densely populated areas. The urban areas with denser population were typically concentrated with more vehicles and industrial emission sources releasing more heavy metals to the local soils. The strong dependence of soil Cu concentrations on socioeconomic conditions demon3756
dx.doi.org/10.1021/es304310k | Environ. Sci. Technol. 2013, 47, 3752−3760
Environmental Science & Technology
Article
Table 2. Fitted Parameters for the Log−Normal Distributions of the Concentrations of Cr, As, Ni, and Hg in the Surface Soils of the PRD and the Goodness of Fit Obtained from FMDM Analysis first component (background distribution)
second component (polluted distribution)
metal
soil subgroup
weight
mean
standard deviation
weight
mean
standard deviation
Dfa
Chisqb
pc
Cr Cr As Hg Ni Ni
acrisols non-acrisols soils other than eutric fluvisols soils other than eutric fluvisols acrisols non-acrisols
0.48 0.84 0.85 0.94 0.62 0.98
28.7 67.4 11.1 0.05 10.7 24.6
24.2 48.1 7.9 0.04 16.8 22.0
0.52 0.16 0.15 0.06 0.38 0.02
68.8 86.3 28.9 0.24 20.6 122.1
31.3 11.8 11.7 0.11 7.3 14.4
35 45 41 55 15 20
27.5 30.6 35.2 45.2 16.8 16.7
0.81 0.95 0.72 0.82 0.33 0.67
Degrees of freedom of the fitted mixture model. bThe chi-squared goodness-of-fit statistic. cThe significance level (p-value) for the test with the null hypothesis H0: the estimated model is consistent with the observed distribution. Reject H0 if p ≤ 0.1. a
Figure 3. Contributions from natural background and anthropogenic inputs to soil As and Hg in the PRD: (a) Regression tree for As, (b) regression tree for Hg, (c) FMDM model fit for As contents in the soils other than eutric fluvisols, and (d) FMDM model fit for Hg contents in the soils other than eutric fluvisols. In the CITs, n is the number of sampling sites, and the metal concentrations are shown in the unit of mg/kg.
mean concentration of Cr (49.6 mg/kg) than the non-acrisols (mean Cr concentration, 70.4 mg/kg) probably due to the lower Cr contents in the parent materials of the formers. Cr concentrations in the surface soils (excluding those from the waste disposal/treatment sites) were further analyzed with FMDM. As shown in parts c and d of Figure 2 and Table 2, the mean values of the background log−normal distributions in the acrisols and non-acrisols were 28.7 and 67.4 mg/kg, whereas the corresponding polluted distributions had means of 68.8 and 86.3 mg/kg, respectively. The weight values of the polluted distributions indicate that Cr pollution was more widespread in the acrisols (0.52) than in the non-acrisols (0.16). The mean of the polluted distribution in the acrisols was comparable to that of the background distribution in the non-acrisols suggesting that the impact of human activities on soil Cr pollution was not
Soil concentrations of Cr at 226 sampling sites were used for developing the corresponding regression tree (part b of Figure 2). The main decision factor was the land use type, which separated the soils from the waste disposal/treatment sites from the rest. Cr was used widely in electroplating, leather tanning, and textile industries in the PRD, whereas coal and oil combustion, waste incineration, and breakdown of chromiumbased automotive catalytic converters could also release Cr. Industrial effluent and solid wastes from chromate-processing facilities, when disposed of improperly in wastewater treatment plants and landfills, could cause pollution of the surrounding soils.1 That is why the soils from waste disposal/treatment sites contained much higher levels of Cr compared to those from the other areas. In the right branch, the soil type was the next splitting parameter dividing the subgroup into two terminal nodes. The acrisols (ferric, haplic, and humic) showed a lower 3757
dx.doi.org/10.1021/es304310k | Environ. Sci. Technol. 2013, 47, 3752−3760
Environmental Science & Technology
Article
Figure 4. Contributions from natural background and anthropogenic inputs to soil Ni in the PRD: (a) Regression tree for Ni, (b) FMDM model fit for Ni contents in the non-acrisols, and (c) FMDM model fit for Ni contents in the acrisols. In the CITs, n is the number of sampling sites, and the metal concentrations are shown in the unit of mg/kg.
significant outside of the waste disposal/treatment sites in the PRD. 3. Heavy Metals Contributed Mostly by Parent Materials: As, Hg, and Ni. Parts a and b of Figure 3 and part a of Figure 4 show the details of CITs for As, Hg, and Ni. The most important splitting factor of the root nodes was the soil type for these metals. The variations in the background concentrations of heavy metals among different types of soils were high enough to overshadow the contributions from anthropogenic inputs. That is, human activities were not the most significant factor affecting the spatial distributions of As, Ni, and Hg in the surface soils of the PRD. In general, the left branches with higher metal contents correspond to the soils with elevated background levels of As, Ni, and Hg, whereas the right branches contained the soils with lower metal contents. The regression trees of both As (part a of Figure 3) and Hg (part b of Figure 3) had 2 terminal nodes with soil type being the only decision factor for the splits. The highest mean concentrations of As (38.8 mg/kg) and Hg (0.32 mg/kg) were found in the eutric fluvisols. These results suggest that the distributions of As and Hg in the surface soils of the PRD were controlled predominantly by the natural background contents. The concentrations of As and Hg in all types of soils except the eutric fluvisols were further analyzed with FMDM (parts c and d of Figure 3). The mean values of the background log−normal distributions of As and Hg in these soils were 11.1 and 0.05 mg/kg, whereas those of the polluted distributions were 28.9 and 0.94 mg/kg, respectively. Overall, the contribution of human activities to soil As was limited with only 15% of the surface soils affected by anthropogenic inputs, whereas the contribution from anthropogenic sources to soil Hg was relatively negligible (Table 2).
The regression tree of Ni with two terminal nodes (part a of Figure 4) was developed from the statistical analysis of Ni concentrations in the soils from 224 sampling sites. Soil type separated the acrisols (ferric, haplic, and humic) with lower Ni concentrations (mean, 14.2 mg/kg) from the rest (mean, 25.9 mg/kg). The results of CIT indicate that the parent materials contributed greatly to soil Ni in the PRD, whereas no clear indication of anthropogenic contribution was manifested. FMDM results (parts b and c of Figure 4) show that the mean values of the background log−normal distributions of Ni in the acrisols and non-acrisols were 10.7 and 24.6 mg/kg, whereas those of the corresponding polluted distributions were 20.6 and 122.1 mg/kg, respectively. There was a significant contribution from anthropogenic sources to Ni in the acrisols, whereas anthropogenic inputs resulted in very high concentrations of Ni in only 2% of the non-acrisols, which is indicative of the presence of few strong point sources (Table 2). Identification of Human Influence on Heavy Metal Pollution. Conventional multivariate analysis techniques, such as principle component analysis and clustering analysis, can help identify the probable sources of heavy metals based on their associations with each other. The stochastic models used in this study are much more powerful and effective tools at pinpointing the sources of heavy metals and apportioning the contributions of natural background and anthropogenic inputs. FMDM analysis alone could identify the subpopulations with different characteristics for the heavy metals in the surface soils (Figure S3 and Table S1 of the Supporting Information). However, it could not accurately explain the causes of the difference in heavy metal contents observed in the surface soils across a large-scale region, where the variations resulting from soil parent materials could be significant. Compared to FMDM, CIT performed much better at quantitatively assessing the 3758
dx.doi.org/10.1021/es304310k | Environ. Sci. Technol. 2013, 47, 3752−3760
Environmental Science & Technology
Article
Figure 5. Variable importance scores of the individual predictors for estimation of the distributions of soil heavy metals in the surface soils of the PRD obtained from variable importance analysis. For easy comparison, the values of variable importance were rescaled by setting their sum to 1. In general, the predictors with higher positive variable importance scores have stronger association with the concentrations of heavy metals in the soils.
■
contributions of natural and anthropogenic sources. Furthermore, with the aid of random forest, CIT could also be used to evaluate the importance of individual predictors for estimating the distributions of soil heavy metals as shown in Figure 5. Significant impact of the soil type was observed for all of the heavy metals investigated except for Cu and Pb indicating the existence of large variability in the natural background levels of these metals. High importance scores of land use type, GDP per capita, and population density were also observed for most heavy metals indicating that anthropogenic sources contributed significantly to soil heavy metal pollution in the PRD. It should be noted that the variables with the highest importance scores were not always the primary predictors in the regression trees, in which the split variables were chosen according to statistical significance test instead of variance reduction. As shown in the cases of Cr, As, Hg, and Ni, coupling the CIT with the FMDM can help reveal the impact of human activities on the soil heavy metal pollution in the PRD. Because of the presence of both natural and anthropogenic sources of heavy metals, detection of heavy metals in soils (even above the regulatory levels) does not necessarily indicate the occurrence of pollution in contrast to the cases of synthetic organic pollutants (e.g., polychlorinated biphenyls). Heavy metals in surface soils may exhibit complex spatial variations and patterns in regions impacted by fast industrialization and rapid urbanization, such as the PRD, which makes it challenging to assess the contribution from human activities. The stochastic models used in this study are capable of identifying the anthropogenic versus natural origins of heavy metals in the surface soils across large-scale regions. With proper selection of predictors relevant to the sources and transport of heavy metals, such as local socioeconomic conditions, land use, and soil properties, CIT can serve as an effective tool for explaining the relationship between the concentrations of soil heavy metals and the selected independent variables. Furthermore, combination of CIT with FMDM can help explain the impact of human activities on the spatial variations and patterns of soil heavy metals and apportion the contributions from natural background and anthropogenic inputs.
ASSOCIATED CONTENT
S Supporting Information *
Additional information on the sampling site locations, general partitioning procedure for construction of the CITs, general principle of random forest, spatial distributions of soil Cd, Cr, Hg, and As levels in the PRD, and more FMDM fitting results for the heavy metals. This material is available free of charge via the Internet at http://pubs.acs.org.
■
AUTHOR INFORMATION
Corresponding Author
*Phone: (+86) 20 8529-0175; fax: (+86) 20 8529-0706; e-mail:
[email protected]. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS The authors gratefully acknowledge the anonymous reviewers for their valuable comments and suggestions on an earlier draft. We also thank Xueping Liu, Jinmei Bai, Gaoling Wei, Zaicheng He, Yanli Wei, Liangying Liu, Haiyan Hu, and Baozhong Zhang for help with field sampling and chemical analyses. This work was supported in parts by the Natural Science Foundation of China (Grants 41202251, 41073079, and 41121063), and the Chinese Academy of Sciences (Y234081001 and “Interdisciplinary Collaboration Team” program). This is contribution No. IS-1640 from GIGCAS.
■
REFERENCES
(1) Alloway, B. J. Heavy Metals in Soils, 2nd ed.; Blackie Academic and Professional: London, 1995. (2) Mostert, M. M. R.; Ayoko, G. A.; Kokot, S. Application of chemometrics to analysis of soil pollutants. Trends Anal. Chem. 2010, 29 (5), 430−445. (3) Cheng, S. Heavy metal pollution in China: origin, pattern and control. Environ. Sci. Pollut. Res. 2003, 10 (3), 192−198. (4) Cheng, H.; Hu, Y. Planning for sustainability in China’s urban development: Status and challenges for Dongtan eco-city project. J. Environ. Monitor 2010, 12 (1), 119−126.
3759
dx.doi.org/10.1021/es304310k | Environ. Sci. Technol. 2013, 47, 3752−3760
Environmental Science & Technology
Article
(27) White, A. P.; Liu, W. Z. Technical note: Bias in informationbased measures in decision tree induction. Mach. Learn. 1994, 15 (3), 321−329. (28) Hothorn, T.; Hornik, K.; Zeileis, A. Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Stat. 2006, 15 (3), 651−674. (29) Nagy, K.; Reiczigel, J.; Harnos, A.; Schrott, A.; Kabai, P. Treebased methods as an alternative to logistic regression in revealing risk factors of crib-biting in horses. J. Equine Vet. Sci. 2010, 30 (1), 21−26. (30) Breiman, L. Random forests. Mach. Learn. 2001, 45 (1), 5−32. (31) Strobl, C.; Boulesteix, A. L.; Zeileis, A.; Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinf. 2007, 8 (1), 25. (32) Strobl, C.; Boulesteix, A. L.; Kneib, T.; Augustin, T.; Zeileis, A. Conditional variable importance for random forests. BMC Bioinf. 2008, 9 (1), 307. (33) Zhang, W.; Liu, X.; Cheng, H.; Zeng, E.; Hu, Y. Heavy metal pollution in sediments of a typical mariculture zone in South China. Mar. Pollut. Bull. 2012, 64 (4), 712−720. (34) FAO/IIASA/ISRIC/ISSCAS/JRC Harmonized World Soil Database (version 1.2); FAO: Rome, Italy and IIASA, Laxenburg, Austria, 2012. (35) Guangdong Bureau of Statistics. Guangdong Statistical Yearbook 2011; China Statistics Press: Beijing, 2011. (36) John, G. H. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining; AAAI Press: Menlo Park, CA, 1995; pp 174−179. (37) Hodge, V.; Austin, J. A survey of outlier detection methodologies. Artif. Intell. Rev. 2004, 22 (2), 85−126. (38) Riani, M.; Atkinson, A. C.; Cerioli, A. Finding an unknown number of multivariate outliers. J. Roy. Stat. Soc. B 2009, 71 (2), 447− 466. (39) Cheng, H.; Hu, Y. Lead (Pb) isotopic fingerprinting and its applications in lead pollution studies in China: A review. Environ. Pollut. 2010, 158 (5), 1134−1146. (40) Pitt, R.; Clark, S.; Field, R. Groundwater contamination potential from stormwater infiltration practices. Urban Water 1999, 1 (3), 217−236.
(5) Cheng, H.; Hu, Y. Improving China’s water resources management for better adaptation to climate change. Climatic Change 2012, 112 (2), 253−282. (6) Wei, B.; Yang, L. A review of heavy metal contaminations in urban soils, urban road dusts and agricultural soils from China. Microchem. J. 2010, 94 (2), 99−107. (7) Chen, H.; Zheng, C.; Tu, C.; Zhu, Y. Heavy metal pollution in soils in China: Status and countermeasures. Ambio 1999, 130−134. (8) Li, X.; Wai, O. W. H.; Li, Y. S.; Coles, B. J.; Ramsey, M. H.; Thornton, I. Heavy metal distribution in sediment profiles of the Pearl River estuary, South China. Appl. Geochem. 2000, 15 (5), 567−581. (9) Li, X.; Shen, Z.; Wai, O. W. H.; Li, Y.-S. Chemical forms of Pb, Zn and Cu in the sediment profiles of the Pearl River Estuary. Mar. Pollut. Bull. 2001, 42 (3), 215−223. (10) Liu, W. X.; Li, X. D.; Shen, Z. G.; Wang, D. C.; Wai, O. W. H.; Li, Y. S. Multivariate statistical study of heavy metal enrichment in sediments of the Pearl River Estuary. Environ. Pollut. 2003, 121 (3), 377−388. (11) Ip, C. C. M.; Li, X. D.; Zhang, G.; Farmer, J. G.; Wai, O. W. H.; Li, Y. S. Over one hundred years of trace metal fluxes in the sediments of the Pearl River Estuary, South China. Environ. Pollut. 2004, 132 (1), 157−172. (12) Wong, S. C.; Li, X. D.; Zhang, G.; Qi, S. H.; Min, Y. S. Heavy metals in agricultural soils of the Pearl River Delta, South China. Environ. Pollut. 2002, 119 (1), 33−44. (13) Zhang, X.; Lin, F.; Wong, M.; Feng, X.; Wang, K. Identification of soil heavy metal sources from anthropogenic activities and pollution assessment of Fuyang County, China. Environ. Monit. Assess. 2009, 154 (1), 439−449. (14) Lin, Y.; Cheng, B.; Shyu, G.; Chang, T. Combining a finite mixture distribution model with indicator kriging to delineate and map the spatial patterns of soil heavy metal pollution in Chunghua County, central Taiwan. Environ. Pollut. 2010, 158 (1), 235−244. (15) Everitt, B. Finite Mixture Distributions; Chapman and Hall: London, New York, 1981. (16) Yang, S. Y.; Chang, W. L. Use of finite mixture distribution theory to determine the criteria of cadmium concentrations in Taiwan farmland soils. Soil Sci. 2005, 170 (1), 55−62. (17) Death, G.; Fabricius, K. E. Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 2000, 81 (11), 3178−3192. (18) Xu, M.; Watanachaturaporn, P.; Varshney, P. K.; Arora, M. K. Decision tree regression for soft classification of remote sensing data. Remote Sens. Environ. 2005, 97 (3), 322−336. (19) Zhang, X.; Lin, F.; Jiang, Y.; Wang, K.; Wong, M. T. F. Assessing soil Cu content and anthropogenic influences using decision tree analysis. Environ. Pollut. 2008, 156 (3), 1260−1267. (20) Vega, F. A.; Matías, J. M.; Andrade, M. L.; Reigosa, M. J.; Covelo, E. F. Classification and regression trees (CARTs) for modelling the sorption and retention of heavy metals by soil. J. Hazard. Mater. 2009, 167 (1−3), 615−624. (21) Kubošová, K.; Komprda, J.; Jarkovský, J.; Sáňka, M.; Hájek, O.; Dušek, L.; Holoubek, I.; Klánová, J. Spatially resolved distribution models of POP concentrations in soil: a stochastic approach using regression trees. Environ. Sci. Technol. 2009, 43 (24), 9230−9236. (22) Lee, T.-S.; Chiu, C.-C.; Chou, Y.-C.; Lu, C.-J. Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Comput. Stat. Data Anal. 2006, 50 (4), 1113−1130. (23) Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. J. Classification and Regression Trees; Wadsworth International Group: Belmont, CA, 1984. (24) Quinlan, J. R. C4.5: Programs for Machine Learning; Morgan Kaufman Publisher: San Mateo, CA, 1993. (25) Bradford, J.; Kunz, C.; Kohavi, R.; Brunk, C.; Brodley, C. In Proceedings of 10th European Conference on Machine Learning (ECML-1998); Springer: Berlin, 1998; pp 131−136. (26) Shih, Y. S. A note on split selection bias in classification trees. Comput. Stat. Data Anal. 2004, 45 (3), 457−466. 3760
dx.doi.org/10.1021/es304310k | Environ. Sci. Technol. 2013, 47, 3752−3760