Environ. Sci. Technol. 2006, 40, 7848-7853
Estimating Nutrients and Chlorophyll a Relationships in Finnish Lakes OLLI MALVE* Finnish Environment Institute, Helsinki, Finland SONG S. QIAN Duke University, Nicholas School of the Environment and Earth Sciences, Durham, North Carolina
We model the response of chlorophyll asa surrogate for the phytoplankton community volumesto variations in lake total phosphorus (TP) and total nitrogen (TN) concentrations. The model is fitted to a large crosssectional data set from the Finnish Lake monitoring network. The objective is to support the Finnish Government in identifying management actions to achieve compliance of the chlorophyll a concentration standard with a given confidence level and to provide tools for the estimation of critical (target) loads for nutrients in monitored lakes. We develop a Bayesian hierarchical linear model which combines advantages of both the currently preferred nonhierarchical lake-type-specific linear model and lakespecific linear model fitted separately using data from a single lake. The hierarchical model is less biased at lakelevel compared to the lake type model. In contrast to the lake model, it predicts the lake specific chlorophyll a response to nutrients outside the lake specific observational range. The hierarchical model is used to calculate probabilities of chlorophyll a concentration exceeding the standard under different nitrogen and phosphorus concentration combinations. These probabilities can be used to estimate acceptable nitrogen-phosphorus concentration combinations by a lake manager. We discuss how our study can be useful in implementing the European Water Framework Directive.
1. Introduction Although more and more information on lake hydrology, chemistry, and biology are collected from well-designed and funded water quality monitoring networks, the lack of adequate data is often the biggest obstacle in developing a water quality management plan (National Research Council (1)). The complexity of natural processes in lakes and reservoirs makes it difficult to transform routinely monitored data into scientific knowledge that can be used for supporting lake-specific management decision making (2). Even with the increased monitoring effort, lake-specific data are often sparse. As a result, data from similar lakes are often pooled to increase the sample size to achieve the necessary statistical power. Once a set of lakes and reservoirs are classified as belonging to a same type, lake-specific observations are often pooled for analysis. Linear regressions are applied to combined observations within a type of lakes and reservoirs. For example, EUTROMOD (3) is a collection of empirical eutrophication models fitted to cross-sectional lake data from different geographical regions in the U.S. * Corresponding author phone: 358 19 40300 359; fax: 358 19 40300 391; e-mail:
[email protected]. 7848
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 40, NO. 24, 2006
The assumption of homogeneity within a type is rarely realistic. Therefore, neither an empirical model based on the data of all lakes within a type nor a model based on the data of a single lake alone are satisfactory. A Bayesian hierarchical linear model that combines data from multiple types of lakes to predict single lakes within the lake type provides a tradeoff between the models fitted to the data of a lake type and to the data of a single lake. The Bayesian hierarchical model allows aggregation at different levels and provides a mechanism for pooling information to strengthen a model’s lakelevel predictive power. Hierarchical and nonhierarchical linear regressions of total phosphorus (TP) and total nitrogen (TN) on chlorophyll a (Chla) are fitted to the data from the Finnish lake monitoring network. The lakes in the network have been classified into nine lake types for the determination of ecological status (4) by the Finnish Environment Institute (SYKE) (5). The lakes were classified according to the expert assessment and to lake morphological and chemical metrics such as depth, surface area, and color. It was believed that the accuracy of the estimates of the standard (natural) ecological status of the lake types, the human impacts, and of the ecological status of the lakes can be optimized that way. Updating of parameters and predictive distributions has been proposed for the implementation of the EU Water framework directive (7) to facilitate the adaptive decision making process. The models may be fitted and the water quality management decisions refined sequentially over time when the new monitoring data becomes available. Because of the Bayesian nature of our models, we are able to derive a lake-specific predictive probability distribution of Chla concentration, which can be used to estimate the probability of water quality standard violation for each lake. In addition, the posterior distributions of model parameters from this study provide prior distributions for future model revision or for developing lake-specific models.
2. Data National water quality monitoring of Finnish lakes started in 1965 after the passage of the Water Act in 1962. Sampling strategy and analysis methods have been described by Niemi et al. (6). In January 2000, 253 lake sites from the national monitoring network were integrated into the European Environment Agency’s Eurowaternet (8) to produce statistically reliable information that allow the European Commission, member states, and the general public to judge the effectiveness of the environmental policy. Information was required on the status of Europe’s inland water resources, quality and quantity and how the status relates and responds to pressures on the environment (8). The use of lake-type-specific Chla response models is based on the assumption that lakes within each type in the Geomorphological Typology of Finnish Lakes (Table 1) are likely to have similar Chla response to nutrient variation. We use 19 248 July and August observations of TP, TN, Chla from 2289 Finnish Lakes from 1988 to 2004. About 42% of the observations are from July and 58% from August. However, observations are unevenly distributed between years, types, and lakes (Tables 3 and 4, Supporting Information). Of the total of 2289 lakes, 900 lakes have only one observation. The average number of observations is eight (s.d. 26) per lake. One lake has 441, and there are 12 lakes that have more than 150 observations. Trophic status of the lake types varies (see Table 1, Supporting Information). Humic lake types (5ssmall, humic, deep lakes; 6sdeep, very humic lakes; 8sshallow, humic 10.1021/es061359b CCC: $33.50
2006 American Chemical Society Published on Web 10/24/2006
TABLE 1. Geomorphological Typology of Finnish Lakes Specified by Finnish Environment Institutea lake type I II III IV V VI VII VIII IX a
name
characteristics
large, nonhumic lakes large, humic lakes medium and small, nonhumic lakes medium area, humic deep lakes small, humic, deep lakes deep, very humic lakes shallow, nonhumic lakes shallow, humic lakes shallow, very humic lakes
SA > 4,000 Ha, color 4,000 Ha, color >30 SA: 50 - 4,000 Ha, color 3 m SA: 50 - 500 Ha, color: 30-90, D > 3 m color >90, D > 3 m color 90, D < 3 m
SA ) surface area, D ) depth.
FIGURE 1. Regression tree plot of observed log(Chla) [µg L-1] partitioned with TN and TP concentrations [µg L-1]. lakes; and 9sshallow, very humic lakes) have highest Chla and nutrient concentrations and lowest TN/TP - ratio. Linear correlation between predictor variables log(TP) and log(TN) is high with a correlation coefficient of 0.62 (Table 2, Supporting Information) posing a potential problem in separating the effects of these two predictors. The correlations between log(Chla) and the two predictors are similar. To discern the roles of TP and TN in deciding Chla concentration, we use the classification and regression tree (CART) procedure (11) as a variable selection method to identify important factors associated with the variation of our response variable Chla (12). The CART model is selected based on a crossvalidation procedure to minimize a model’s prediction error (11). The variable selected first is the most important one or has the most effect on Chla. Using log(TP), log(TN), lake type, depth, color, surface area, and year of observation as potential predictors, the resulting tree (Figure 1) includes TP as the most influential predictor. TN plays a role only when TP is larger than 42.75 µg L-1. Even though the lake type is not an important predictor in the CART model for the entire data set, when the CART procedure is applied separately to each lake type, we discover remarkable differences in the nutrient effects between the lake types. In all lake types, both nutrients have an effect, but
in large and deep lakes (types 1, 2, 4, 5, and 6) and in shallow, humic lakes (type 8), TP is more influential. In medium and small non-humic lakes and in shallow non-humic and very humic lakes (types 3, 7, and 9), TN is more influential. These differences between the lake types reveals the usefulness of the geomorphological typology in the prediction of Chla and in the lake management. To further corroborate the role of nutrients, we create two conditioning plots (coplots) (9). The conditional plots in Figure 1, Supporting Information, shows that the log(Chla)log(TP) relationship is relatively stable, for all conditions of log(TN) and lake depth, while the log(Chla)-log(TN) relationship varies as a function of log(TP) and depth. This result indicate a potential interaction between log(TP) and log(TN).
3. Methods 3.1. Bayesian Hierarchical Linear Model. Bayesian methods are currently experiencing an increasing popularity in the sciences as a means of probabilistic inference (13). Among their advantages are (1) the ability to incorporate prior information, (2) the ease with which Bayesian methods can be incorporated into a formal decision analytic context, (3) the explicit handling of uncertainty, and (4) the straightforward ability to assimilate new information in contexts such as adaptive management. Introduction to Bayesian statistics and data analysis can be found in refs 14, 15, and 16. In a predictive model for lake Chla concentrations, model parameters for different lakes in the same lake type may be similar. Therefore, estimates of these parameters can be expressed in terms of a common prior distribution of model parameters. In other words, we assume that lake-specific model parameters are random variables from a common distribution. Computationally, it is natural to model the data hierarchically. That is, individual observations of Chla concentration are modeled conditional on lake-specific model parameter values, the lake-specific model parameter values are modeled conditional on lake-type-specific parameters, which are themselves modeled conditional on a parameter distribution of all lakes in Finland (eqs 1-4). An important feature of the hierarchical model is that the hierarchical probability distribution imposes dependence among parameters which allows a hierarchical model to have enough parameters to form a realistic model without over fitting the data (14). Details of the Bayesian hierarchical modeling approach can be found in ref 14. Qian et al. (17, 18) indicated that using a hierarchical modeling approach to pool data from different sources often results in reduced model uncertainty and improved accuracy in estimated model parameters. The hierarchical linear model is summarized as follows:
log(yijk) ∼ N(Xβij,τ2)
(1)
Xβij ) β0,ij + β1,ij × log(TPijk) + β2,ij × log(TNijk) + β3,ij × log(TPijk) × log(TNijk) (2) βij ∼ N(βi, σ2i )
(3)
βi ∼ N(β, σ2)
(4)
where log(yijk) is the kth observed log(Chla) value from lake j in type i. X is the model matrix containing observed TP and TN values from lake j in type i, βij ) [β0,ij,β1,ij,β2,ij,β3,ij] is the lake-specific model parameter vector which consists of intercept (β0,ij) and slopes for log(TP) (β1,ij), log(TN) (β2,ij), and for the combined effect of log(TP) and log(TN) (β3,ij), τ2 is the model error variance, βi ) [β0,i,β1,i,β2,i,β3,i] is a vector 2 of model parameter means for lake type i, σ2i ) [σ0,i , σ21,i, σ22,i, VOL. 40, NO. 24, 2006 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
7849
σ23,i] is a vector of between lake variance of model parameters for lake type i, and β ) [β0,β1,β2,β3] and σ2 ) [σ20, σ21, σ22, σ23] are the means and between type variance. Note that the hierarchical notation in eqs 1-4 indicates conditional distributions, i.e., yijk is normally distributed conditional on Xβij and τ2, βij is normally distributed conditional on βi, σ2i and βi is normally distributed conditional on β, σ2. The interaction term is added to the model to account for the non-additive effects of TP and TN indicated by the coplots in Figure 2, Supporting Information. The noninformative prior distributions of β, τ,σi, and σ in eqs 1-4 are as shown:
β ∼ N(0, 10 000)
(5)
σi,σ,τ ∼ unif (0, 100)
(6)
where N(0, 10 000) is the normal distribution of β with mean 0 and variance 10 000 and unif (0, 100) is the uniform distribution of σi, σ, and τ with lower (0) and upper (100) limits. The prior distributions for σi, σ, τi, and β are considered “non-informative” or vague. The width of the 95% prior interval for β is approximately (200, practically “flat” in the region of interest. The standard non-informative prior for a variance parameter is p(σ2) ∝ 1/σ2, which arises from assuming that the log of the variance parameter has a uniform distribution on (-∞, +∞). This prior is improper, which could lead to an improper posterior distribution. Instead, we used a uniform distribution for standard deviation, as suggested by Gelman (14). 3.2. Bayesian Nonhierarchical Linear Models. To compare hierarchical model to its classical counterparts a nonhierarchical type specific dummy variable model is fitted. We combine data from all nine types and use the lake type as a dummy variable (23). By using a lake type dummy variable, the resulting model had type-specific slopes and intercepts and a common model error variance. The common model error variance allows a meaningful model comparison with the hierarchical linear model. The Bayesian linear model with a dummy variable is summarized as follows:
log(yik) ∼ N(XβiZi, τ2)
(7)
9
XβiZi )
∑[β
0,iZ0,i
+ β1,iZ1,i × log(TPik) + β2,iZ2,i ×
i)1
log(TNik) + β3,iZ3,i × log(TPik) × log(TNik)] (8) βi ∼ N(µβi, σ2i )
(9)
where yij is the kth observation of Chla concentration from lakes in type i, X is the model matrix containing observed TP and TN values from all lakes, βi ) [β0,i,β1,i,β2,i,β3,i] is the lake type-specific model parameter vector which consists of intercept (β0,i) and slopes for log(TP) (β1,i), log(TN) (β2,i), and the interaction of log(TP) and log(TN) (β3,i), Zi ) [Z0,i,Z1,i,Z2,i,Z3,i] is the 0,1 dummy-coded matrix, τ2 is the model error variance. 2 µβi ) [µβ0,i, µβ1,i, µβ2,i, µβ3,i] and σ2i ) [σ0,i , σ21,i, σ22,i, σ23,i] are the mean and the variance for βi. The vague prior distributions of µβi, τ, σi are as follows:
µβi ∼ N(0, 10 000)
(10)
σi,τ ∼ unif(0, 100)
(11)
To see how well lake-specific observations alone identify a non-hierarchical linear regression model, we fit a linear regression model (eqs 12-14) to data sets from selected lakes. We used these lake-specific linear models as a baseline for 7850
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 40, NO. 24, 2006
the hierarchical model and the non-hierarchical dummy variable model.
log(yk) ∼ N(Xβ,τ2)
(12)
Xβ ) β0 + β1 × log(TPk) + β2 × log(TNk) + β3 × log(TPk) × log(TNk) (13) β ∼ N(µβ, σ2)
(14)
3.3. Computation and Model Comparison. Analytical solutions to the parameter estimates of hierarchical model are not known. Therefore, a Markov chain Monte Carlo simulation (MCMC) method (19) is used for simultaneously estimating the distribution parameters by sampling the parameters from their joint posterior distribution. The MCMC method is implemented using the freely available WinBUGS software (20) (21). The MCMC method allows us to sample all unknown parameters (βij,βi,σ2i ,σ2,τ2) from their joint posterior distribution. All inferences are made based on these posterior samples. The number of simulations that the MCMC simulation needs to converge to the true posterior distribution is called the burn-in period. Samples after the burn-in are saved for the statistical inference of the posterior distribution. Multiple MCMC chains with different lengths are run and R ˆ statistics (24) are calculated to select the length of the burnin period. Burn-in period is long enough if R ≈ 1. The burnin period for the hierarchical model is 45 000. We take 1000 samples for each unknown quantity from the next 45 000 MCMC iterations to reduce autocorrelation of the sample. The burn-in period for the non-hierarchical models is 10 000 and 1000 samples are taken for each unknown quantity from the next 10 000 MCMC iterations. The two models are compared using the deviance information criterion (DIC), a Bayesian measure of model complexity and fit (21). DIC is a sum of the posterior mean deviance D(θ), a Bayesian measure of fit or “adequacy,” and a complexity measure pD, which corresponds to the trace of the product of Fisher’s information and the posterior covariance. The Bayesian deviance D h is based on -2log(likelihood), a measure of residual information in data conditional on parameter θ or -2log[p(Y|θ)]. In our case, the data (Y ) log(chla)) are assumed to be from a normal distribution. The Bayesian deviance is
p(Y|θ) D(θ) ) -2 log f(Y)
(15)
where f(Y) is the mathematical upper limit of the likelihood function (it is reached when the estimated mean Xβij is equal to the average observed log(chla) values from lake j in type i). The smaller the D(θ), the closer the actual likelihood (p(Y|θ)) is to the maximum (hence a better model). The complexity measure is the mean deviance minus the deviance evaluated at the posterior parameter means this:
pD ) D(θ) - D(θ h)
(16)
The DIC is defined as
DIC ) D(θ) + pD
(17)
a Bayesian measure of model fit penalized by an additional complexity term. A smaller DIC indicates a “better” model. R squared (R2), the statistical measure of how well a regression line approximates real data points, is also used to compare the mean fit of the models. R2 is computed from the ratio of two sources of variation, SStotal and SSerror (eq 18). SStotal is the variability of the log(Chla) about its mean. SSerror
is the variability of the log(Chla) about the predicted mean values from the model. If SSerror is much smaller than SStotal (0