Environ. Sci. Technol. 2008, 42, 4029–4036
Multivariate Soft-Modeling To Predict Radiocesium Soil-to-Plant Transfer ANNA RIGOL,* MARTA CAMPS, ANNA DE JUAN, GEMMA RAURET, AND MIQUEL VIDAL Departament de Qu´imica Anal´itica, Universitat de Barcelona, Mart´i i Franque`s, 1-11, 08028 Barcelona, Spain
Received September 7, 2007. Revised manuscript received February 14, 2008. Accepted March 3, 2008.
A multivariate soft-modeling approach based on an exploratory principal component analysis (PCA) followed by a partial least squares regression (PLS) was developed, tested, and validated to estimate radiocesium transfer to grass from readily measurable soil characteristics. A data set with 145 soil samples and 21 soil and plant parameters was used. Samples were soils from various field plots contaminated by the Chernobyl accident (soddy-podzolic and peaty soils), submitted to several agricultural treatments (disking, ploughing, fertilization, and liming). Parameters included soil characteristics and the corresponding radiocesium soil-to-plant transfer factors. PCA of data showed that soil samples were grouped according to the field plots and that they covered a wide range of possible soil-to-plant transfer scenarios. PLS was used to design and build the multivariate prediction model. The soil database was split in two parts: (i) a representative calibration set for training purposes and model building and (ii) a prediction set for external validation and model testing. The regression coefficients of the model confirmed the relevant parameters to describe radiocesium soil-to-plant transfer variation (e.g., phyllosilicate content and NH4+ status), which agreed with previous knowledge on the interaction mechanisms of this radionuclide in soils. The prediction of soil-to-plant transfer was satisfactory with an error of the same order of magnitude as the variability of field replicates.
Introduction After a radioactive fallout event in the terrestrial environment leading to the contamination of large areas, a fast and detailed evaluation of the radiological consequences is required along with the proposal of efficient intervention actions. A variety of models has been developed for the prediction of the radionuclide fate in terrestrial environments. Some of these models have been implemented in computerized decision support systems (DSS) that allow the estimation of effective intervention strategies for contaminated scenarios. However, there is not yet an agreement about the best approach to be used (general or specific, black box or mechanistic), and this choice often depends on the specificity and quality of the input data. One of the most important exposure pathways to be considered in modeling the impact of a radioactive fallout * Corresponding author phone: 34-934021277; fax: 34-934021233; e-mail:
[email protected]. 10.1021/es702251y CCC: $40.75
Published on Web 04/29/2008
2008 American Chemical Society
event is the transfer of radionuclides through the food chain and the subsequent internal dose in humans due to ingestion of contaminated food. Areas producing food products are especially vulnerable to radioactive contamination, and vulnerability depends on the radionuclide considered. Radiocesium presents high transfer rates to a wide range of foodstuffs, mainly in soils with very low clay content, such as sandy and organic soils (1). Before the Chernobyl accident, a number of models were designed to estimate the transfer of radiocesium to food products (2-4), but it was only afterward that knowledge acquired on radiocesium-soil interaction revealed that variation in the soil-to-plant transfer of radiocesium to a given plant species was primarily due to differences in soil characteristicssmainly soil solution composition and clay content. Thus, semimechanistic hardmodeling was increasingly used to estimate the activity concentrations of radiocesium in plants from soil characteristics routinely measured or readily estimated. Roca-Jove et al. (5) predicted radiocesium root uptake from K+ absorption rates using the Barber and Cushman nutrient uptake model (6). The TEMAS model (7) evaluated the soil-to-plant transfer by taking into account physical soil characteristics, such as texture and bulk density. However, the key contribution to the improvement of the semimechanistic models was the incorporation of the specific parameters that best describe the radiocesium-soil interaction: the radiocesium interception potential (RIP) and the K+ and NH4+ concentration in soil solution. These parameters are used to predict radiocesium soil-to-plant transfer (8-10) and can be either measured (8) or deduced from common soil properties using parametric relationships. Thus, K+ concentration in soil solution can be estimated from the percentage of K+ occupancy of the inorganic cation exchange capacity (9), and the RIP, which is an estimation of the radiocesium specific sorption capacity of the soil, may be deduced from the percentage of clay (11). In the SAVE IT software package this approach is also combined with other radioecological processes for a dynamic prediction of the transfer of radiocesium to food products (12). Soft-modeling methods offer a valuable option to study environmental data and processes. These nonmechanistic approaches are measurement-driven and can be used to find similarities or relationships among samples or measured variables and to look for the mathematical relationship that allows the prediction of an environmental parameter from the knowledge of other potentially related environmental properties. The main difference with semimechanistic or hard-modeling approaches is that the relationships found stem from the variation of the experimental measurements and are not supported by mathematical expressions that describe underlying physicochemical phenomena. Softmodeling approaches skip the formulation of mechanistic expressions, which is a relevant advantage since the complete knowledge of all events related to the variation of an environmental parameter, needed for a good mechanistic model, is seldom achieved. A common objection posed to predictive models obtained by soft-modeling techniques is, however, the lack of causality, i.e., the model responds to the mathematical relationship between the properties measured and the parameter predicted for a pool of samples but does not directly imply that the variation of these properties is generally responsible for the variation of the predicted parameter. This lack of causality may be fully understood when the number of samples studied is small and when they cover a very narrow range of environmental scenarios; hence the importance of working with representative, large, and high-quality databases VOL. 42, NO. 11, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
4029
that include all the potential variables affecting a target process. Of the soft-modeling tools available, partial leastsquares regression (PLS) has long been used for the prediction of quantitative information from multivariate linear instrumental responses (spectra, chromatograms, etc.) recorded in multicomponent systems (13-16). However, to date only a few studies can be found in which PLS is applied for the prediction of environmental processes (17-20). The present paper is the first attempt to apply multivariate soft-modeling and, concretely, PLS, as an alternative to hardmodeling for the prediction of radiocesium transfer to grass from readily available soil and plant characteristics in natural meadows. To do so, a large and representative data set with 145 soil samples and 21 soil and plant parameters was used. Samples were soils from various field plots contaminated by the Chernobyl accident (soddy-podzolic and peaty soils), submitted to several agricultural treatments (disking, ploughing, fertilization, and living). The example presented is a scenario of special concern due to its high radionuclide bioavailability with the subsequent risk of contamination of milk and other food products.
Experimental Section Data Origin. The data set selected for this study was obtained from the European funded REDUP project (21). In this project a two-year experiment was designed to evaluate changes in radiocesium soil-to-plant transfer to grass in natural meadows submitted to several agricultural treatments, considered as intervention actions in that project. A database of 145 samples was created that included about 21 variables (basically soil parameters) for each sample. Soil and Plant Samples. Soil samples belonged to six experimental field plots of Russia, Belarus, and Ukraine: three soddy-podzolic soils (Sawichi, Christinovka, and VIUA) and three peaty soils (Dublin, Mateyki, and Rudnuy). General characteristics for the untreated soils are shown in section S1 of the Supporting Information (SI). Several treatments were considered at each site: tillage techniques (disking and ploughing), liming (two doses), and application of NPK fertilizer (two doses). For each treatment, four randomly located plots of approximately 3.5 × 5 m were used as replicates in all the experimental sites. After treatments, the soils were reseeded with a mixture of three plant species. At harvest, four plant subsamples from an area of 0.5 × 1 m were taken from each plot and mixed into one bulk sample providing four plant replicates per treatment. Soils were sampled to a depth of 20 cm. Four soil replicates per plot were taken and mixed into one bulk soil sample per plot, thus also leading to four soil replicates per treatment. Soil and Plant Variables. Soils and plants originated from the various treatments were characterized by several parameters. Table 1 shows the minimum and maximum values, the low and high boundary of the interquartile range (25th and 75th, respectively), and the median obtained for all soil and plant variables in the wide range of samples collected during each year of the experiment. Some of the variables considered in the data sets were common soil parameters such as pH, organic matter content (OM), bulk density (BD), cation exchange capacity (CEC), and exchangeable cations (Mexch). More specific parameters were also considered for their relevance in radiocesium behavior, such as those related to the specific sorption of radiocesium in soils: phyllosilicate content (phyllo), illite, and the radiocesium interception potential (RIP). The field capacity (FC) and the soil solution composition (Mss) of each soil sample were also determined and included in the data sets. With respect to the plant parameters, only the biomass density per area was considered. 137Cs mass activity density was determined in plant and soil samples [(Am)plant and (Am)soil, respectively] by the research institutes responsible for the field work. 137Cs was determined 4030
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 11, 2008
by high-resolution γ spectrometry using a HPGe detector (ORTEC GMX series) coupled to a multichannel analyzer (IN-1200). Efficiency calibration was carried out with a 152Eu standard source. The soil-to-plant transfer factor (Cr) was calculated as the ratio between the radiocesium concentration in plant and in soil. Methodology. As data obtained for the first and second year of the experiment were considered different in terms of meadow maturation, three different data sets were defined corresponding to the first year of field experiments (first year data set, 81 samples), the second year of experiments (second year data set, 64 samples), and the two years of experiments (first + second year data set). Data analysis and modeling involved three main steps: (i) Data Screening and Pretreatment. Each data matrix was built up with the rows related to the soil samples and 21 columns corresponding to the soil and plant variables. As seen in Table 1, data distributions were skewed for most variables. To correct for this fact, the values of all variables were log-transformed, except pH and BD values that showed naturally a nonskewed distribution. Due to the large differences in scale among variables, the suitably transformed data were also autoscaled prior to analysis, i.e., mean centered (each element subtracted by its mean column) and scaled to unit variance (each element divided by the standard deviation of its column). This gave the same weight to all variables in the analyses. (ii) Data Exploration and Description by Principal Component Analysis (PCA). PCA is used here to investigate the existing relationships between samples and variables. As shown in Figure S1 of SI, PCA provides a transformation of the high-dimensional variable space to an efficient lowdimensional space of abstract uncorrelated variables, the principal components (PC). The number of components found is related to the number of independent sources needed to explain the observed data variance. PCA decomposes the original data set in the product of scores (related to sample description) and loadings (related to variable description). The score plot displays the samples in the PC space, mapping the structure of the sample population, and permitting the detection of groups of samples (clusters) and possible outliers. The loading plot displays the variables in the PC space, and their correlation and relevance can be interpreted considering vectors from the origin to the variable position. A small angle between vectors indicates correlation of the corresponding variables, either positive (same direction) or negative (opposite direction). The relevance of the original variables is proportional to the absolute value of their loadings in the PC space. (iii) Prediction of Radiocesium Soil-to-Plant Transfer (Cr) by PLS. Partial least squares regression is a multivariate regression method that builds a model between a sample variable (y) and a set of other sample descriptors (X) for predictive purposes. In the environmental context of this work, Cr was the predicted variable and the rest of soil and plant parameters were used as predictor variables. In a way similar to PCA, the PLS model is built in a low-dimensional space formed by PLS components. However, while PCA is a maximum variance projection method in the X-space, PLS seeks to find a maximum covariance model of the relationship between the X- and y-space. Thus, PLS components include the variance among the explanatory variables (X) that is useful for the prediction of y (13, 14). The construction of the inverse PLS regression model, y ) XB, consists of the creation of the PLS space and the calculation of the model regression coefficients (B) in this space. Full cross-validation was used in this case to determine the number of significant PLS components to be included in the regression model. This procedure is based on excluding one-at-a-time consecutive rows of the matrix X and the
VOL. 42, NO. 11, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
4031
1.2 1.8 0.003 0.03
0.76 0.22 0.21
% % mmol L-1 mmol L-1
mmol L-1 mmol L-1 mmol L-1
illite phyllo (NH4)ss Kss
Cass Mgss Nass 2.6 0.52 0.61
1.3 4.2 0.07 0.10
61
3.8 1.2 0.04 28
0.11
5.0 0.04
20
100 0.80 6.1
2035 1800 0.43
25%
4.5 2.3 1.2
2.3 5.8 0.36 0.75
105
17 1.6 0.08 29
0.19
5.4 0.08
36
170 1.1 8.9
3190 3510 0.89
median
7.0 3.3 2.5
2.6 8.8 0.86 1.3
193
33 2.2 0.16 57
0.33
6.3 0.20
50
218 1.2 31
5240 5340 1.6
75%
14 5.5 4.6
3.2 21 2.2 3.1
2030
44 8.6 1.1 142
1.2
6.9 0.65
147
355 1.4 81
13800 15300 3.4
max
0.94 0.18 0.27
0.7 1.8 0.001 0.01
14
2.2 0.31 0.005 11
0.06
3.9 0.01
12
100 0.30 2.2
436 1260 0.04
min
1.4 0.31 0.46
0.97 3.6 0.03 0.17
71
3.9 0.97 0.03 17
0.11
5.0 0.03
18
262 0.80 6.9
940 3380 0.11
25%
3.6 1.3 0.86
1.2 4.2 0.23 0.35
136
7.7 1.7 0.08 22
0.17
5.5 0.07
28
323 1.1 8.9
2215 4550 0.69
median
second yeara
4.9 2.0 1.5
1.6 7.7 0.39 0.50
278
20 3.2 0.10 28
0.25
5.9 0.23
37
432 1.2 18
3800 9960 1.1
75%
7.1 4.3 3.7
2.4 19 1.6 1.3
619
40 8.0 0.56 147
0.44
7.6 0.43
161
850 1.4 81
11800 16200 4.3
max
definition
plant biomass density per area bulk density of soil organic matter content: loss on ignition at 450 °C for 16 h field capacity: moisture content in soil after water removal by centrifugation (0.33 bar) in the saturated state pH measured in KCl exchangeable NH4+: BaCl2extraction and UV spectroscopy (22) exchangeable K+: CH3COONH4 extraction (23) and ICP-OES exchangeable Ca2+ (idem Kexch) exchangeable Mg2+ (idem Kexch) exchangeable Na+ (idem Kexch) cation exchange capacity: sum of extractable bases and acidity with BaCl2-triethanolamine (TEA) (24) radiocesium interception potential: product of Kd and Kss at 0.1 mmol L-1 of K+ (25) Illite content (26) phyllosilicate content (26) NH4+ content in soil solutionc K+ content in soil solution (after rewetting soil at field capacity) by ICP-OES Ca2+ content in soil solution (idem Kss) Mg2+ content in soil solution (idem Kss) Na+ content in soil solution (idem Kss)
137Cs
mass activity density in plant mass activity density in soil ratio between (Am)plant and (Am)soil
137Cs
a Minimum, 25% percentile (lower quartile), median, 75% percentile (upper quartile), and maximum values of variables. b (Am)plant was used for the calculation of Cr, but this variable was not included in the data sets for the PLS model. c (NH4)ss was calculated from Kss assuming the same ratio obtained from the exchangeable NH4+ and K.
18
mmolc kg-1
RIP
1.1 0.53 0.01 9.7
kg-1
cmolc cmolc kg-1 cmolc kg-1 cmolc kg-1
0.02
cmolc kg-1
Kexch
Caexch Mgexch Naexch CEC
3.7 0.01
cmolc kg-1
pH (NH4)exch
11
15 0.30 3.0
200 720 0.07
min
%
Bq plant Bq kg-1 soil Bq kg-1 plant/ Bq kg-1 soil g m-2 g cm-3 %
kg-1
unit
FC
biomass BD OM
(Am)plant (Am)soil Cr
b
variable
first yeara
TABLE 1. Abbreviations, Units, Definitions, and Statistics of the Values of Soil and Plant Variables
TABLE 2. Variance Explained in the PCA Models data set
matrix dimension
1st year 2nd year 1st + 2nd year
81 samples × 21 variables 64 samples × 21 variables 145 samples × 21 variables
a
Percentage of variance.
b
PC1 a
36 (36) 32 (32) 33 (33)
n
∑ (y
2 i i predicted - ymeasured
)
i)1
n
(1)
The number of components, for which the RMSECV reaches its minimal value, is selected to construct the final model. The use of PLS allowed for the development of a qualitative study, focused on the assessment of the relevance of the different measured variables in the description/prediction of the soil-to-plant transfer and for the development of the regression model to quantitatively predict the value of this parameter in the different soil samples. Both qualitative and quantitative tasks are described in more detail in subsections iii.a and iii.b, respectively. (iii.a) Relevance of Variables To Predict Radiocesium Soilto-Plant Transfer (Cr) by PLS. In a first stage, PLS was used to model separately the three data sets (first year, second year, and first + second year) in order to ascertain the relevance of the different variables for the prediction of radiocesium soil-to-plant transfer and to assess the differences among the data sets due to the soil maturation degree. Since this part of the study was focused on the qualitative interpretation of the PLS models, these models were built using all samples in the data sets, and the results presented come from the cross-validated models. The difference among models and the relevance of the different variables for radiocesium soil-to-plant transfer prediction was studied by looking at the quality parameters of the model and at the regression coefficients, B. The regression coefficients contain information linked to the relevance of the original X variables for y prediction. Thus, the higher the absolute value of the coefficient, the higher the relevance. The sign generally indicates the direction of correlation. Although this interpretation would be absolutely straightforward only in the case where X variables were completely uncorrelated, the most salient trends in the regression coefficients can be safely interpreted as relevant for predictive purposes. (iii.b) Construction and Validation of a Radiocesium Soilto-Plant Transfer Prediction Model by PLS. The predictive model was built on the data set composed by all the samples (first + second year). To test the potential usefulness of the PLS model by using external validation, the population of samples was representatively divided into a calibration set, used for the establishment of the model, and a prediction set, used to validate externally the predictive ability of the model. Two thirds of the total set of samples were selected for calibration using the Kennard-Stone algorithm (27). The remaining samples were used as the prediction set. The soil samples of the prediction set were used to compare the predicted and the field values. For this comparison, predicted PLS values were back-transformed to the original non-logarithmic units. A root mean square error of 4032
9
PC2
PC3
PC4
PC5
PC6
28 (64) 33 (65) 26 (59)
14 (78) 13 (78) 12 (71)
12 (90) 8 (86) 7 (78)
2 (92) 4 (90) 6 (84)
2 (94) 2 (92) 5 (89)
Percentage of cumulative variance.
corresponding variable y. For the remaining data set, the PLS model is constructed considering various numbers of components. Then, the model is used to predict y in the excluded samples. The root mean square error of crossvalidation (RMSECV) is calculated from all the models done with the different numbers of components according to the following expression:
RMSECV )
b
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 11, 2008
prediction (RMSEP) is calculated from these back-transformed values with the same expression used before for RMSECV. The predictive ability of the model was evaluated from (a) the comparison of predicted and field soil-to-plant transfer values for the individual soil samples assayed and (b) the quality of the correlation obtained among all the predicted and field values. The Unscrambler 6.11a software (Copyright 1986-1997, CAMO ASA, Norway) was used for PCA and PLS analyses. Further details on the theoretical background of the multivariate calibration methods and on the software applied can be found elsewhere (28).
Results and Discussion Data Exploration and Description. In a first step, an exploratory analysis of the autoscaled data of the three data sets was carried out using principal component analysis. Whereas soil and plant data of the Rudnuy site in the first year of experiment and of Mateyki site in the second year were not available, the combination of the data from the two years (first + second year data set) provided a complete mapping of the samples from the six sites. Table 2 shows the percentage of variance explained with each PC for the three data sets. In all cases, more than 70% of the total variance was explained with the three first principal components. As similar conclusions were drawn from the score and loading plots obtained from the three data sets, Figure 1 shows, as an example, the plots of PC1, PC2, and PC3 obtained for the first + second year data set (see Figure S2 of SI for the plots corresponding to the first and second year data sets). As shown in the three-dimensional score plot (Figure 1a), samples clustered mainly according to the field plots (soil type), the agricultural treatment being less relevant for the grouping of samples and leading to only slight differences in the PC values. Dublin samples constituted a cluster different from the rest along the direction of PC1. PC2 helped to separate well-defined clusters for Christinovka and Mateiky, different from the group of soils including Sawichi, VIUA, and Rudnuy. Differences among these three latter soils were mainly enhanced along the PC3 direction. Moreover, samples belonging to the same experimental site showed a certain separation according to the year of the experiment, indicating that soil and plant characteristics varied during the field treatments among years. The PCA score plot also showed that soil samples succeeded in covering a wide range of possible soil-to-plant transfer scenarios, which is a positive aspect for the use of this data to build a model of general validity for prediction of radiocesium transfer to plant. With respect to the loading plots, and concretely, in the PC2 vs PC1 plot (Figure 1b), almost all variables showed significant values at least in one of the principal components. Although slight differences were observed between the loading plots of the three data sets studied (see also Figure S2 of SI), some common correlations among variables were seen. Focusing on soil parameters, some expected relationships were discerned in the PC2 vs PC1 loading plot, which confirmed the quality of the data used for the construction of the model and the appropriateness of the methodology used to describe the natural situation. Thus, OM, CEC, exchangeable cations, and FC were positively correlated with
TABLE 3. Description of Cross-Validated PLS Models field vs predicted regressiona data set
number of y explained RMSECVb PLS cumulative components variance
slope
r
1st year 2nd year 1st + 2nd year
2 2 4
including (Am)soil 82 0.40 88 0.44 81 0.44
1.0 ( 0.1 0.89 0.99 ( 0.09 0.92 0.99 ( 0.07 0.91
1st year 2nd year 1st + 2nd year
2 2 4
excluding (Am)soil 80 0.42 85 0.43 77 0.44
1.0 ( 0.1 0.88 1.0 ( 0.1 0.91 0.99 ( 0.07 0.90
a Offsets of all field vs predicted regressions were found not to be statistically significant. b Value calculated after backtransforming the logarithmic predicted values to original Cr units.
FIGURE 1. PCA score plot (a) and loading plots (b, c) for PC1, PC2, and PC3 and for the first + second year data set. Sawichi: S, first year; s, second year. Dublin: D, first year; d, second year. VIUA: V, first year; v, second year. Christinovka: C, first year; c, second year. Mateyki: M, first year. Rudnuy: r, second year. The arrows in the score plots are the projection of the averaged sample for each year and site in the PC1-PC2 space. each other and negatively correlated to BD. pH also showed a negative correlation with the concentration of cations in soil solution, since a decrease in pH causes desorption of cations from the solid phase. On the other hand, radiocesium interception potential (RIP), illite, and phyllosilicate contents were positively correlated, as has been widely reported (11, 29). The radiocesium soil-to-plant transfer factor (Cr) was in the opposite direction of PC2 with respect to RIP and to phyllosilicate and illite contents. This negative correlation has been previously reported (8, 11) and is explained by the tendency of clay materials to irreversibly sorb radiocesium at specific sites, decreasing the fraction of the radionuclide that can be taken by the plant roots. The correspondence between samples and variables in scores and loading plots helps to convey the nature of the
different soil clusters detected. Thus, high positive values of PC1 were observed for OM, CEC, exchangeable cations, and FC. All of these variables relate to a noticeable organic matter content of the soil, thus causing that Dublin soils appeared in the score plot overlapping the PC1 axis in the positive direction. RIP, phyllosilicate, and illite contents, variables that are often related to large clay content in the soil, showed high positive values of PC2 in the loading plot. This same location was also observed in the score plot for Christinovka samples, which are the soils with the highest clay content among the soils tested here. The Cr variable appeared in the loading plot with an almost negligible PC1 and a significant PC2, the same position shown in the score plots by the group of soils composed by VIUA, Sawichi, and Rudnuy samples which corresponds, in general, to soils with lower clay content and higher Cr values than the rest of soil samples. These three soils were only distinguished by the PC3, which is basically associated with the pH variable, as shown in the loading plot PC3 vs PC1. Indeed the Sawichi soils, in opposite direction to pH variable, are the most acid among all soil populations. The representativeness of the soil database used has been shown in the score plots, with soil clusters with very distinct characteristics, and in Table 1, where soil variables, such as the Cr values to be predicted, range in value of 2 or 3 orders of magnitude. The loading plots also pointed out correlations among variables typically encountered in previous general studies on soil behavior. Strange relationships that could have stemmed from a too small and homogeneous soil sample population were not detected. The soil samples used include a wide variety of possible scenarios and, therefore, the explored data sets seem suitable for the elaboration of a general model able to predict radiocesium transfer to grass in natural meadows in soils with varying characteristics. Relevance of Variables To Predict Radiocesium Transfer to Plant. The first year, the second year, and the first + second year data sets were modeled separately using PLS to reveal which variables better explained Cr in each year of experiment. To test the relevance of (Am)soil in the prediction of Cr, two options were considered for modeling each data set: including or excluding this variable. Table 3 summarizes the statistical description of the crossvalidated PLS model obtained in each case, and almost identical results were obtained for the three data sets and the two options tested. Suitable models with two PLS components for the one-year data sets and four PLS components VOL. 42, NO. 11, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
4033
FIGURE 2. Regression coefficients of the first + second year PLS model. for the two-year data sets were obtained with an explained y cumulative variance around 80%. Satisfactory regressions with slopes close to 1 were obtained for the three sets of data, confirming the absence of bias in the models and, therefore, of systematic overestimations or underestimations of Cr. The RMSECV values, calculated after back-transforming the logarithmic values of predicted Cr to the original units, were also very similar for the three models (around 0.4) with absolute values of residuals for the predicted Cr of the individual samples in the range from 0.001 to 1.2. To further evaluate the goodness of the models, reference values of field Cr were calculated as the mean of field results obtained for the same soil and the same agricultural treatment. Field residuals were defined as the difference between an individual field result and the respective reference value. As an example, in Figure S3 of SI, the residuals obtained for the first + second year data set excluding the (Am)soil variable are represented with respect to the corresponding reference value of field Cr. PLS and field residuals are within the same order of magnitude; hence, the error of PLS model for Cr prediction is comparable with the field variability of this parameter. The regression coefficients of the PLS models obtained for the three data sets, including and excluding the (Am)soil variable, were compared to identify the most relevant variables to describe the Cr variation for each data set. The direction and magnitude of the influence of soil and plant variables on Cr can generally be associated with the sign and the magnitude of its regression coefficient, respectively. However, since soil and plant variables may be partially correlated, their related regression coefficients cannot be considered in an individual or completely independent way. Only the most salient trends in the regression coefficients can be safely used for chemical interpretation. Figure 2 shows, as an example, the regression coefficients obtained for the first + second year data set excluding (Am)soil, since nonsignificant variations in the relative relevance of variables were observed when including (Am)soil, and similar conclusions can be drawn from the regression coefficients of the three data sets (Figure S4 of SI shows the results obtained for the first and second year data sets). Among the clearest trends, the high positive correlation of the K+ and NH4+ concentration in soil solution, with the predicted variable (Cr), is worthy of note in the three data sets. Another general trend was the relevance of RIP and phyllosilicate variables that presented high negative regression coefficients. Explanation of the relevance of these variables lies in the mechanisms governing root uptake (see Section S2 of SI). Thus, previous articles in the literature have shown that, for 4034
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 11, 2008
FIGURE 3. Predicted versus field 137Cs soil-to-plant transfer factors in logarithmic units using external validation. The solid line represents the ideal log(Cr)pred ) log(Cr)measured relationship. Offsets were found not to be statistically significant. radiocesium, the solid-liquid distribution coefficient (Kd) is directly related to the presence of clays, especially 2:1 phyllosilicates (i.e., illite), since specific sorption of radiocesium occurs in these materials (30-32). This specific sorption capacity is quantified by the RIP (25). Then, an increase in RIP, phyllosilicate, or illite content may provoke an increase in Kd and thus, assuming that Cr is inversely proportional to Kd, a decrease in Cr, this being reflected by the high negative regression coefficients of the most relevant RIP and phyllosilicate variables. On the other hand, K+ and NH4+, are the main competing ions of radiocesium for these specific sites, and high concentrations of these species in soil solution may contribute to an increase in radiocesium Cr by decreasing Kd. This is confirmed by the high positive regression coefficients of K+ and NH4+ concentrations in soil solution. Other variables that have a significant weight on Cr prediction with a negative correlation are plant biomass and pH. The effect of plant biomass is explained by the decrease on radiocesium concentration in the plant when increasing its biomass due to a dilution effect. In contrast, pH does not have a direct effect on radiocesium sorption; however, it may play an indirect role through a negative correlation with (NH4)ss (33). Finally, the regression coefficients of the three data sets suggested that an increase in exchangeable divalent cations might be related to a decrease in radiocesium soilto-plant transfer. A possible interpretation of this behavior already observed in previous investigations (34) is that divalent cations compete with radiocesium uptake through competition for exchange sites in the apoplast of the root cortex. Thus, prior to root uptake, radiocesium must pass through a cell-wall free space, characterized by cationic exchange mechanisms associated with carboxylic groups. The radiocesium loading in the apoplast is influenced by the solution composition, being lower at high divalent ion concentrations. Construction and External Validation of a Model for Prediction of Radiocesium Transfer to Plant. Once the similarity in terms of variable relevance for Cr prediction of the three data sets was demonstrated, the whole database including the two years of experiments (first + second year data set) was selected for the construction of a model of more general application. Since multivariate soft-modeling might be used for estimation of Cr in readily contaminated soils and for prediction of radiocesium mobility in soils
FIGURE 4. Comparison of the field 137Cs soil-to-plant transfer factor and the predicted PLS value (back-transformed to original units) obtained using external validation. Error bars indicate (a) the standard deviation of field replicates (n ) 4) for field values and (b) the RMSEP for predicted values. susceptible of radioactive contamination, the model was built without considering the (Am)soil variable. The soil database was split in two: a representative calibration set for training purposes and model building (twothirds of the data), and a prediction set for external validation and model testing (one-third of the data). The PLS model obtained with the calibration set showed a pattern of regression coefficients similar to those obtained in the previous section with the whole data set, confirming the representativeness of selected data. Once constructed, the model was externally validated with the test set. A satisfactory global correlation between predicted and field Cr values was obtained, with a slope close to 1 (Figure 3). This confirmed the accuracy, the robustness, and the absence of bias in the PLS model. Individual predicted values were back-transformed to the original units and compared with those obtained from field experiments. Multivariate softmodeling succeeded in predicting radiocesium soil-to-plant transfer. Thus, the predicted Cr agrees with the field Cr value within the experimental error (see Figure 4). Moreover, the root-mean-square error of prediction (RMSEP) was 0.33, with absolute values of residuals for the predicted Cr of the individual samples in the range from 0.003 to 0.7. This error is acceptable taking into account that field residuals were often higher with values up to 0.9 (see Figure S3 of SI). It is worth noting that, even using the original untransformed data, acceptable correlations were obtained with prediction errors of the same order of magnitude as using logtransformed data sets. However, log-transformed data were preferred to confer more stability to the regression model. The results obtained in the present study encourage the use of soft-modeling for the prediction of environmental processes and confirm that this approach is a useful alternative to mechanistic modeling. In contrast to multivariate modeling, the traditional modeling based on assumed mechanisms is frequently done for a few variables at a time and hence neglects variable interactions (synergisms and antagonisms). In addition, a complete knowledge of all events and mechanisms in environmental processes is hardly ever achieved in practice, and this makes the complete reliance on mechanistic models risky. In contrast, semiempirical models underlying PCA and PLS are more forgiving to lack of knowledge, and hence they are often preferable in situations where the information on mechanisms is only partial. The wealth of diagnostics available in multivariate
modeling also provides more knowledge about the data and their relationships, as well as about their deficiencies, than the mechanistic modeling that works by fitting a fixed mathematical expression to the variation of the data and allows less freedom in the interpretation of the results. Therefore, the combination of soft-modeling methodologies with representative environmental data sets is offered as a new and excellent option to interpret and predict environmental processes, having potential application in a wide variety of scenarios in terms of process nature and degree of previous knowledge.
Acknowledgments This research was supported by the Ministerio de Ciencia y Tecnologı´a (PPQ2002-00264) and the Ministerio de Medio Ambiente (1.2-193/2005/3-B). M.C. thanks the Spanish government for the concession of a training grant in research. The authors also thank N. Grebenshikova and S. Firsakova (RIR, Belarus), N. Sanzharova and S. Fesenko (RIARAE, Russia), Y. Ivanov and S. Levchuk (UIAR, Ukraine), and C. Shand and M. Cheshire (MLURI, U.K.) for supplying some of the plant and soil data.
Supporting Information Available Description of the general characteristics for the untreated soils used in the study (Section S1); scheme of the creation of the PC space in PCA analyses (Figure S1); PCA score and loading plots for PC1, PC2, and PC3 and for the first and second year data sets (Figure S2); comparison of PLS and field residuals of Cr obtained for the first and second year data sets (Figure S3); regression coefficients of the first and second year PLS models (Figure S4); and description of the mechanisms governing root uptake (Section S2). This material is available free of charge via the Internet at http:// pubs.acs.org.
Literature Cited (1) Handbook of Parameter Values for the Prediction of Radionuclide Transfer in Temperate Environments; Technical Report No. 364; International Atomic Energy Agency: Vienna, 1994. (2) Wicker, F. W.; Kirchner, T. B.; Breshears, D. D.; Otis, M. D. Estimation of radionuclide ingestion: the “PATHWAY” foodchain model. Health Phys. 1990, 59, 645–657. (3) Mu ¨ ller, H.; Pro¨hl, G. ECOSYS-87: a dynamic model for assessing radiological consequences of nuclear accidents. Health Phys. 1993, 64, 232–252. VOL. 42, NO. 11, 2008 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
4035
(4) Johnson, R. H.; Mitchell, N. T. User manual for the SPADE suite codes, SPADE 3.0, PC version; Report M2757-R2; Ministry of Agriculture Fisheries and Food: U.K., 1993. (5) Roca-Jove, M. C.; Vallejo-Calzada, V. R. Predicting radiocaesium root uptake based on potassium uptake parameters. A mechanistic approach. Plant Soil 2000, 222, 35–49. (6) Barber, S. A.; Cushman, J. H. In Modelling Waste Water Renovation Land and Treatment; Iskander, I. K., Ed.; WileyInterscience, New York, 1981; pp 382. (7) Gutierrez, J.; Vazquez, C.; Meckbach, R.; Wilkins, B.; Rafferty, B.; Holm, E.; Badie, M.; Burton, O. Techniques and Management Strategies for Environmental Restoration and their Ecological Consequences; TEMAS project Final Report, FI4P-CT95-0021; CIEMAT: Madrid, 2000. (8) Camps, M.; Rigol, A.; Hillier, S.; Vidal, M.; Rauret, G. Quantitative assessment of the effects of agricultural practices designed to reduce 137Cs and 90Sr soil-plant transfer in meadows. Sci. Total Environ. 2004, 332, 23–38. (9) Absalom, J. P.; Young, S. D.; Crout, N. M. J.; Nisbet, A. F.; Woodman, R. F. M.; Smolders, E.; Gillett, A. G. Predicting soil to plant transfer of radiocesium using soil characteristics. Environ. Sci. Technol. 1999, 33, 1218–1223. (10) Absalom, J. P.; Young, S. D.; Crout, N. M. J.; Sanchez, A.; Wright, S. M.; Smolders, E.; Nisbet, A. F.; Gillett, A. G. Predicting the transfer of radiocaesium from organic soils to plants using soil characteristics. J. Environ. Radioact. 2001, 52, 31–43. (11) Smolders, E.; Van den Brande, K.; Merckx, R. Concentrations of 137Cs and K in soil solution predict the plant availability of 137Cs in soils. Environ. Sci. Technol. 1997, 31, 3432–3438. (12) Gillett, A. G.; Crout, N. M. J.; Absalom, J. P.; Wright, S. M.; Young, S. D.; Howard, B. J.; Barnett, C. L.; McGrath, S. P.; Beresford, N. A.; Voigt, G. Temporal and spatial prediction of radiocaesium transfer to food products. Radiat. Environ. Biophys. 2001, 40, 227–235. (13) Boardman, A. E.; Hui, B. S.; Wold, H. The partial least squaresfix point method of estimating interdependent systems with latent variables. Comm. Statist. Theory Methods 1981, 10, 613– 639. (14) Wold, S.; Martens, H.; Wold, H. The multivariate calibrationproblem in chemistry solved by the PLS method. Lect. Notes Math. 1983, 973, 286–293. (15) Roig, M.; de Juan, A.; Garcı´a, J. F.; Toribio, M.; Vidal, M.; Rauret, G. Determination of a mixture of gamma-emitting radionuclides using solid scintillation detectors and multivariate calibration. Anal. Chim. Acta 1999, 379, 121–133. (16) Navea, S.; Tauler, R.; de Juan, A. Application of the local regression method interval partial least-squares to the elucidation of protein secondary structure. Anal. Biochem. 2005, 336, 231–242. (17) Eriksson, L.; Hermens, J. L. M.; Johansson, E.; Verhaar, H. J. M.; Wold, S. Multivariate analysis of aquatic toxicity data with PLS. Aquat. Sci. 1995, 57, 217–241. (18) Smolin ´ ski, A.; Walczak, B.; Einax, J. W. Robust multivariate calibration in environmental studies. Anal. Lett. 2003, 36, 2317– 2336. (19) Aulinger, A.; Einax, J. W.; Prange, A. Setup and optimization of a PLS regression model for predicting element contents in river sediments. Chemom. Intell. Lab. Syst. 2004, 72, 35–41.
4036
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 42, NO. 11, 2008
(20) Sonesten, L. Land use influence on 137Cs levels in perch (Perca fluviatilis L.) and roach (Rutilus rutilus L.). J. Environ. Radioactiv. 2001, 55, 125–143. (21) Vidal, M.; Camps, M.; Grebenshikova, N.; Sanzharova, N.; Ivanov, Y.; Vandecasteele, C.; Shand, C.; Rigol, A.; Firsakova, S.; Fesenko, S.; Levchuk, S.; Cheshire, M.; Sauras, T.; Rauret, G. Soil- and plant-based countermeasures to reduce 137Cs and 90Sr uptake by grasses in natural meadows: the REDUP project. J. Environ. Radioactiv. 2001, 56, 139–156. (22) Methods for the chemical analysis of water and wastes; Method 350.1, EPA-600/4-79-020, Environmental Protection Agency. Government Printing Office: Washington, DC, 1979. (23) Thomas G. W. In Methods of soil analysis. Part 2. Chemical and microbiological properties. Page, A.L.; Miller, R.H.; Keenney, D.R., Eds.; ASA-SSSA Publisher: Madison, Wisconsin, 1982, pp. 159. (24) Soil Survey Laboratory Methods Manual. Burt, R. Ed.; Investigations Report No. 42, Version 4.0, Natural Resources Conservation Service, USDA: Washington, 2004. (25) Wauters, J.; Elsen, A.; Cremers, A.; Konoplev, A.; Bulgakov, A. A.; Comans, R.N.J. Prediction of solid/liquid distribution coefficients of radiocesium in soils and sediments. Part one: a simplified procedure for the solid phase characterization. Appl. Geochem. 1996, 11, 589–594. (26) Hillier, S. Accurate quantitative analysis of clay and other minerals in sandstones by XRD: comparison of a Rietveld and a reference intensity ratio (RIR) method and the importance of sample preparation. Clay Miner. 2000, 35, 291–302. (27) Kennard, R. W.; Stone, L. A. Computer aided design of experiments. Technometrics 1969, 11, 137–148. (28) Multivariate Analysis in Practice; Esbensen, K., Scho¨nkopf, S., Midtgaard, T., Eds.; Computer-Aided Modelling AS: Trondheim, Norway, 1994. (29) Waegeneers, N.; Smolders, E.; Merckx, R. Ecological risk assessment: a statistical approach for estimating the radiocesium interception potential of soils. J. Environ. Quality 1999, 28, 1005– 1011. (30) Staunton, S. Adsorption of radiocesium on various soils: Interpretation and consequences of the effects of soil:solution ratio and solution composition on the distribution coefficient. Eur. J. Soil Sci. 1994, 45, 409–418. (31) Absalom, J. P.; Young, S. D.; Crout, N. M. J. Radio-cesium fixation dynamics: measurement in six Cumbrian soils. Eur. J. Soil Sci. 1995, 46, 461–469. (32) Rigol, A.; Vidal, M.; Rauret, G.; Shand, C. A.; Cheshire, M. V. Competition of organic and mineral phases in radiocesium partitioning in organic soils of scotland and the area near Chernobyl. Environ. Sci. Technol. 1998, 32, 663–669. (33) Sanchez, A. L.; Smolders, E.; Van den Brande, K.; Merckx, R.; Wright, S. M.; Naylor, C. Predictions of in situ solid/liquid distribution of radiocesium in soils. J. Environ. Radioactiv. 2002, 63, 35–47. (34) Smolders, E.; Sweeck, L.; Merckx, R.; Cremers, A. Cationic interactions in radiocesium uptake from solution by spinach. J. Environ. Radioactiv. 1996, 34, 161–170.
ES702251Y