Nonparametric Rank Regression for Analyzing Water Quality

Jan 25, 2011 - ARTICLE pubs.acs.org/est. Nonparametric Rank Regression for Analyzing Water Quality. Concentration Data with Multiple Detection Limits...
0 downloads 0 Views 2MB Size
ARTICLE pubs.acs.org/est

Nonparametric Rank Regression for Analyzing Water Quality Concentration Data with Multiple Detection Limits Liya Fu†,§ and You-Gan Wang*,‡,§ †

Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, China ‡ Centre for Applications in Natural Resource Mathematics (CARM), School of Mathematics and Physics, The University of Queensland, St Lucia, Queensland, 4072, Australia § CSIRO Mathematics, Informatics and Statistics, 120 Meiers Road, Indooroopilly, Queensland, 4068, Australia ABSTRACT: Environmental data usually include measurements, such as water quality data, which fall below detection limits, because of limitations of the instruments or of certain analytical methods used. The fact that some responses are not detected needs to be properly taken into account in statistical analysis of such data. However, it is well-known that it is challenging to analyze a data set with detection limits, and we often have to rely on the traditional parametric methods or simple imputation methods. Distributional assumptions can lead to biased inference and justification of distributions is often not possible when the data are correlated and there is a large proportion of data below detection limits. The extent of bias is usually unknown. To draw valid conclusions and hence provide useful advice for environmental management authorities, it is essential to develop and apply an appropriate statistical methodology. This paper proposes rank-based procedures for analyzing non-normally distributed data collected at different sites over a period of time in the presence of multiple detection limits. To take account of temporal correlations within each site, we propose an optimal linear combination of estimating functions and apply the induced smoothing method to reduce the computational burden. Finally, we apply the proposed method to the water quality data collected at Susquehanna River Basin in United States of America, which clearly demonstrates the advantages of the rank regression models.

’ INTRODUCTION In environmental studies, measurements often have values below a specified detection limit (DL) because of limitations of the measurement procedures or the analytical tools used in the laboratories. Data below or above a detection limit are also referred to as left or right censored data, respectively. There may also be multiple detection limits involved, if, for example, an instrument was upgraded during the project period or data was combined from multiple laboratories. Various strategies have been used to analyze the data with DLs,1-12 such as deletion of the values below detection limits (BDLs) or simple substitution methods in which the measurements below detection limits are replaced with fixed values, such as zero, one-half of the DLs, or the DLs.1,2 However, the deletion method may cause upward bias and a loss of information, and the substitution methods are generally not suitable when the results strongly depend on the substituted values.2 In particular, when there is a high proportion of censored data, the biased standard errors will further distort our inference.3,4 If the distribution of the measurements is known to be either normal or log-normal, an alternative method is to fill in values randomly selected from the appropriate distribution or replace the values below the DLs with their conditional expected values (conditional on being less than the DLs).2,6 This latter method produces unbiased regression parameter estimates and corrects for bias in variance estimates when the distributional assumptions are met, otherwise, bias will occur.2,5-7 The robust method of regression on order statistics r 2011 American Chemical Society

(ROS) suffers from the same problem3,8,9,10). In presence of both detection limits and non-normality, data analysis can become even more daunting. To most effectively monitor and manage the water quality, and furthermore protect aquatic living resources, the Susquehanna River Basin Commission (SRBC) in the U.S.A. implemented a five-year nutrient-monitoring program to establish a database for the mainstream Susquehanna River, its major tributaries, and other selected tributaries (data available at http://www.srbc.net/programs/CBP/nutrientprogram.htm). More than 14 water quality indicators were measured (twice per month, roughly), some are of particular interest, for example, dissolved ammonia and orthophosphate. Dissolved ammonia are toxic to fish and aquatic organisms, even in very low concentrations. A concentration of dissolved ammonia greater than approximately 0.1 mg/L usually indicates that the water is polluted and can cause fish death. Orthophosphate is produced by natural processes and is one of the elements of phosphorusis necessary for the growth of plants and animals. Therefore, orthophosphate can stimulate the growth of water plants that provide food for fish. However, if too much orthophosphate is present, algae and water weeds grow profusely and use up large amounts of oxygen, which can lead to the death of many fish and aquatic organisms. Therefore, it is important to monitor Received: April 22, 2010 Accepted: December 10, 2010 Revised: December 5, 2010 Published: January 25, 2011 1481

dx.doi.org/10.1021/es101304h | Environ. Sci. Technol. 2011, 45, 1481–1489

Environmental Science & Technology concentrations of dissolved ammonia and orthophosphate in rivers and waterways. The SRBC data sets on dissolved ammonia and orthophosphate were combined from Pennsylvania Department of Environmental Protection (PADEA) Laboratories and Columbia Analytical Services. The proportions of values below detection limits were 19.3% and 23.2% for dissolved ammonia and orthophosphate, respectively. The detection limits for dissolved ammonia in these two laboratories are 0.02 and 0.01 mg/L, respectively. The detection limits for orthophosphate in the two laboratories are 0.01 and 0.002 mg/L, respectively. As we will see later, this data set is highly non-normal and logtransformation does not alleviate this non-normality problem. Checking for normality is also difficult when most low values are censored. Detection limits often vary because of the different analytical instruments used by different laboratories at different time periods. In some cases, data were collected for different research projects and different limits were used because of cost considerations. In contrast to the parametric methods, robust nonparametric methods that do not require distributional assumptions have received some attention in environmental research.13-15 Halperin16 proposed an extension of the Wilcoxon-Mann-Whitney test to analyze the data with a fixed detection limit. Zhang et al. 4 developed a nonparametric estimation procedure and established the theoretical equivalence of three nonparametric methods: the Wilcoxon rank sum, the Gehan, and the Peto-Peto tests under a fixed detection limit and some mild conditions. Their simulation studies demonstrated that nonparametric methods outperform the parametric methods. The rank estimating functions are step functions, which are discontinuous in the regression parameters. This makes it difficult to find consistent estimators and their asymptotic variance and covariance matrices. Much progress has been made to overcome this difficulty.17-20 In environmental research, the statistical analysis of the multivariate censored data has an additional complication because of the underlying temporal correlations from observations within each site. Jung and Ying 21 proposed a great idea of constructing a generalization of the Wilcoxon-Man-Whithey rank statistic based on an independence working model assumption for clustered data analysis. This method is simple and the resulting estimators also have desirable statistical properties. This method based on an independence “working” model ignores the within-site correlations in obtaining parameter estimates, while taking account of the correlation in calculating the standard errors. The development of more efficient estimators with similar computational complexity is of interest for clustered censored data analysis, when measurements from the same site exhibit strong temporal correlations. In this paper, we aim to develop robust rank-based procedures for analyzing non-normal environmental data with multiple detection limits. In particular, we extend the idea of Wang and Zhu22 to clustered censored data by reweighting the Gehan-type weight estimating functions corresponding to the contributions from the between- and within-cluster ranks; we also reduce the computational burden by using the induced smoothing method.20,23-25 The proposed estimators offer a substantial improvement in the mean squared errors when strong temporal effects exist.25 We illustrate the proposed methodology by analyzing the water quality data from Susquehanna River Basin in U.S.A. and investigate whether the environmental variables have significant effects on the concentrations of two water quality indicators (Dissolved Ammonia and Orthophosphate) using generalized linear mixed models and rank

ARTICLE

regression models. Finally, we draw conclusions and present discussions on implications and future research issues. All analysis was done in R. The computing codes are available from authors.

’ RANK REGRESSION FOR DATA WITH MULTIPLE DETECTION LIMITS Rank Regression Models. Let {Yik, k = 1,..., ni, i = 1, ..., N}and {Cik, k = 1, ..., ni, i = 1, ..., N} be the measurements and the corresponding detection limits, and let {Xik, k = 1, ..., ni, i = 1, ..., N} be the corresponding p  1 covariate vectors, where ni is the number of measurements at the Ith site, and N is the number of sites. We assume that (Y i1 , ..., Y in i ) and (C i1 , ..., C in i ) are independent condition al on (Xi1, ..., Xini). Here Δik = 1 indicates that Yik is below the detection limit and Δik = 0 indicates that Yik is not censored. Let Y~ ik = min(Yik,Cik) and Zik = log(Y~ ik), then observations consist of the triplet (Zik, Δik, Xik). In the case of BDL (Δik = 1), we will observe Y~ ik = Cik instead of Yik itself. The established statistical models in survival analysis are often based on right censored data, and if this is desirable, one can consider (-Zik, Δik, -Xik) instead of (Zik, Δik, Xik). We consider the following rank regression model

Zik ¼ Xik Τ β þ εik

ð1Þ

where β is a vector of unknown regression parameters, and (εi1, ..., εini) are independent vectors for i = 1, ..., N. However, for each site, the error terms (εi1, ..., εini) may have temporal correlations. We assume that the median of εik - εjl is zero, for example, p(εik g εjl) = p(εik e εjl) for any i 6¼ j, and the vector (εi1, ..., εini) has a joint distribution, independent of the site index i.25 Between and Within Estimating Functions. Under the independence model assumption, the estimating function using a Gehan-type weighted function is ni X N X N X 1 X Δik ðXik - Xjl ÞIðeik gejl Þ M2 i ¼ 1 j ¼ 1 k ¼ 1 l ¼ 1 nj

SðβÞ ¼

ð2Þ

where M = ΣN i=1ni is the total number of measurements, eik = log(Y~ ik) - XikTβ are the residuals, and I is the indicator function taking values 1 or 0. The estimate of βis obtained by solving S(β) = 0. Note that this function is monotonic but discontinuous in β.26 To incorporate the temporal correlations, it is natural to consider the following two types of estimating functions using ranks between and within sites ni X N X N X 1 X Δik ðXik - Xjl ÞIðeik gejl Þ M 2 i ¼ 1 j6¼i k ¼ 1 l ¼ 1 nj

Sb ðβÞ ¼

Sw ðβÞ ¼

ni X ni N X 1X Δik ðXik - Xil ÞIðeik geil Þ M i¼1 k¼1 l¼1

According to Lee et al. 18, the consistent estimators of the covariance matrices for (M)1/2Sb(β) and (M)1/2Sw(β) are asymptotically equivalent to Vb ¼

ni X ni N X X i¼1 k¼1 l¼1

Vw ¼

ni X ni N X X

jik jTik and Δik ðXik - Xil ÞðXik - Xil ÞΤ Iðeik geil Þ

i¼1 k¼1 l¼1

1482

dx.doi.org/10.1021/es101304h |Environ. Sci. Technol. 2011, 45, 1481–1489

Environmental Science & Technology

ARTICLE

Table 1. Parameter Estimates (EST), Bias (100), and the Mean Squared Errors (MSE) (100) for the Three Different Estimators of β1 (True Value is 1) and β2 (True Value is 0.5) Based on 1000 Simulation Studies When the Error Distributions Follow a Multivariate Normal Distribution with Mean Zero and Exchangeable Correlation Structure with Parameter F = 0.6 β1 = 1

β2 = 0.5 rank

imputation censoring (%)

DL/2

0

20

40

DL

AFT model

AFT model

0.997

0.997

0.996

0.500

0.500

0.498

-0.325

-0.325

-0.418

0.001

0.001

-0.189

MSE

0.770

0.770

0.954

0.187

0.187

0.236

EST bias

1.009 0.892

0.790 -21.025

1.000 0.051

0.506 0.626

0.389 -11.067

0.501 0.125

MSE

1.284

5.420

1.027

0.384

1.512

0.264

EST

0.694

0.556

1.014

0.340

0.274

0.504

bias

-30.625

-44.395

1.433

-16.025

-22.581

0.383

10.174

21.121

1.356

2.783

5.450

0.369

jik ¼

nj X nm N X N X X j¼1 l¼1 m¼1 h¼1

fΔik ðXik - Xjl ÞIðeik gejl Þ - Δjl ðXik - Xmh ÞIðeik eejl ÞgIðejl gemh Þ nr N P P Iðers eejl Þ r¼1 s¼1

The smoothed versions of Sb(β) and Sw(β) are given as follows23,24 ~ Sb ðβÞ ¼ 0 1 nj ni X N X X e e 1 ik jl B C Δik ðXik - Xjl ÞΦ@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA M 2 i6¼j k ¼ 1 l ¼ 1 Τ ðXik - Xjl Þ Γb ðXik - Xjl Þ ~ Sw ðβÞ ¼ 0 l6¼k

DL

bias

respectively, where

1 M i¼1 k¼1

DL/2

EST

MSE

ni X ni N X X

rank

imputation

1

eik - eil B C Δik ðXik - Xil ÞΦ@qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiA Τ ðXik - Xil Þ Γw ðXik - Xil Þ

~b,β ~w) where (Γb,Γw) are variance matrices of the estimators (β obtained from {~Sb(β), ~Sw(β)}, respectively, and Φ( 3 ) is the cumulative distribution function of the standard normal distribu~ b = ∂E{~Sb(β)}/ tion. It is easy to obtain the Jacobian matrices D T T ~ ~ ∂β and Dw = ∂E{Sw(β)}/∂β after specifying Γb and Γw. The iterative expressions are -1 ~ b- 1 ðβ~b ÞV ðβ~b ÞD ~ b ðβ~b Þ and MΓb ¼ D -1 ~ w- 1 ðβ~w ÞV ðβ~w ÞD ~ w ðβ~w Þ MΓw ¼ D

ð3Þ

The steps to find the smoothed estimates and their variance matrices are as follows: Step 1. Let (XΤX)-1/M be the initial variance matrix of Γb and Γw. Step 2. In the kth iteration, using Γb(k-1) and Γw(k-1), obtain ~w(k)) by solving equa~b(k),β the updated estimates (β tions ~Sb(β) = 0 and ~Sw(β) = 0.

~w(k)) and ~b(k),β Step 3. Using the current estimates of (β (k-1) (k-1) ,Γw ), updates the values of (Γb(k),Γw(k)) (Γb in expression 3. Step 4. Repeat Steps 2 and 3 until a convergence criterion is met. The numerical differences between the smoothed and unsmoothed estimating functions are negligible.24 The estimators ~b and β ~w are asymptotically multivariate normal distributions, β and the final values of Γb and Γw are the corresponding covariance matrices P estimates. We can estimate the covariance matrix of (Sb,Sw) as ̂ using the perturbed resampling method22 and then obtain the final combined estimates by solving the next equation27 ! ~ X Sb ðβÞ ~ ^ Τ Τ ~b , D ~wÞ -1 ~ Sr ðβÞ ¼ ðD ¼ 0 ð4Þ Sw ðβÞ The induced smoothing method is substantially faster computationally and stable numerically.23-25 Substantial improvement is also expected from the combined estimates when the temporal correlation is strong. Simulation Studies. The performance of the different methods is now investigated by simulation studies using uncensored/ censored data. Data are generated using Zik = Xik1β1 þ Xik2β2 þ εik, N = 40, β1 = 1, and β2 = 0.5. The detection limits are generated from a uniform distribution U(0, τ), where τ is to determine the censoring rate. The first covariate Xik1 is binary (equiprobable 0 or 1), and the second covariate Xik2 follows the standard normal distribution N(0,1). We consider three different censoring rates 0%, 20%, and 40%. The error term has a random intercept representing the site effects, that is, εik = Ri þ ηik. The resulting within-site correlation is chosen using F = var.(Ri)/var.(εik). For the rank-based method, the covariance matrix Σ is approximated using 3000 samples from an exponential distribution with unit mean. Results of the simulation studies are shown in Table 1. The rank-based method and the glmmPQL method perform similarly in absence of censoring, which is consistent with the theoretical results 28. However, one DL method has very large bias when censoring rate is 20% and neither imputation methods perform well when censoring rate is 40%. As one may expect, the rankbased method perform well in all cases. 1483

dx.doi.org/10.1021/es101304h |Environ. Sci. Technol. 2011, 45, 1481–1489

Environmental Science & Technology

ARTICLE

Figure 1. Time series plots for concentrations of dissolved ammonia and orthophosphate from 2005 to 2009. The cycles in the plots indicate the measurements below detection limits.

’ STATISTICAL ANALYSIS OF WATER QUALITY DATA WITH MULTIPLE DETECTION LIMITS In this section, we will use the proposed method to analyze the data from Susquehanna River Basin from 2005 to 2009. Background of the Data set. In this study, samples were collected from six sites on the Susquehanna River, three sites on its major tributaries and 14 sites on the other selected tributaries in the basin during various flows. For each of the six sites, two samples were collected each month: one near the twelfth of the month and one during monthly base-flow conditions. Additionally, at least four high-flow events were sampled. When possible, a second high-flow event was sampled in accordance with spring planting in the basin. During high-flow sampling events, samples were collected daily during the rise and fall of the hydrograph.

For the other 17 sites, fixed-date monthly samples were collected during the middle of each month. Additionally, two storm samples were collected per quarter at each site 29. In the context of load estimation, there are a number of methods available, such as simple average, ratio methods, and rating curve approaches 30, and in particular, Cohen et al. 31 included seasonal effects and long-term trends. In our analysis here, we will investigate whether the concentrations of dissolved ammonia and orthophosphate are affected by the following environmental variables: instantaneous flow in cubic feet per second, season, temperature, and water pH in standard units. Preliminary Analysis. We first constructed time series plots for these two water quality indicators (see Figure 1). It can be seen that the maximum values of dissolved ammonia and orthophosphate 1484

dx.doi.org/10.1021/es101304h |Environ. Sci. Technol. 2011, 45, 1481–1489

Environmental Science & Technology

ARTICLE

Figure 2. Time series plots for log-values of dissolved ammonia and orthophosphate from 2005 to 2009 for each site.

both appeared in 2007. The concentrations of dissolved ammonia in 2007 were much higher than other years (top panel of Figure 1). The time series plots for each site (Figure 2) indicate that the concentrations of dissolved ammonia and orthophosphate at sites 1576754, 1576787, and 1578475 were much larger than other sites, which can also be seen from the boxplots for the log values of dissolved ammonia and orthophosphate (Figure 3).

The boxplots indicate that the maximum values for dissolved ammonia and orthophosphate appeared at sites 1576787 and 1570000, respectively. The boxplots also show that there are many unusual values for dissolved ammonia and orthophosphate. The Q-Q plots indicate strong deviations from normal distributions for both water quality concentrations (on log scale) (Figure 4). The preliminary analysis indicates that the data from 1485

dx.doi.org/10.1021/es101304h |Environ. Sci. Technol. 2011, 45, 1481–1489

Environmental Science & Technology

ARTICLE

Figure 3. Boxplots for log values from dissolved ammonia and orthophosphate by site.

dissolved ammonia and orthophosphate are non-normal with a number of unusual values (large in absolute values), apart from the values below detection limits. Models and Results. In this subsection, we will establish models to investigate whether the concentrations of dissolved ammonia and orthophosphate (on the natural log scale) are impacted by the following environmental variables: seasonality (S), temperature (T), instantaneous flow (F) in cubic feet per second, and water's pH (P) in standard units. We consider the following model logðYik Þ  β0 þ β1 sinð2πSk Þ þ β2 cosð2πSk Þ þ β3 sinð4πSk Þ þ β4 cosð4πSk Þ þ β5 Tik þ β6 Pik þ β7 lnðFik Þ þ εik

ð5Þ

where Y is the concentration of the water quality indicator. The subscripts i and k are for at site i and time k; (β 1 ,β 2 ,β 3 ,β 4 ) describe the annual and six-month variation in concentration of the water quality indicator; (β 5 ,β 6 ,β 7 ) indicate the relationship between the concentration and temperature, water pH, and streamflow, respectively; ε is the error term. To test whether an environmental variable has a significant effect on the concentration, we can do the following hypothesis test H0 : βl ¼ 0 vs:H1 : βl 6¼ 0 where l = 1, ..., 7. We will consider the Wald-type test statistic β̂l/SE(β̂l), which follows the standard normal distribution. 1486

dx.doi.org/10.1021/es101304h |Environ. Sci. Technol. 2011, 45, 1481–1489

Environmental Science & Technology

ARTICLE

Figure 4. Q-Q plots showing deviations from normality for log values of dissolved ammonia and orthophosphate (censored observations excluded).

We first investigate two traditional approaches, replacing the censored values with one-half the detection limits and then with the detection limits. In each case, we use the traditional generalized linear mixed models to fit the data and regard sites as random effects. Parameter estimates and their p-values in model 5 obtained using the statistical software R using package glmmPQL are listed at the left panels of Tables 2 and 3. It can be seen that the concentrations of dissolved ammonia and orthophosphate are both affected by temperature, pH, and streamflow (left panels of Table 2 and 3). When values below detection limits are replaced with the detection limits, the results indicate the concentrations of dissolved ammonia have a six-month cycle, and the estimate of

β3 using orthophosphate data set is -0.0141. However, when one-half the detection limits are substituted for each value less than the detection limits, the six-month cycle is no longer significant for dissolved ammonia. In general, there are noticeable differences in parameter estimates when different imputation values are used and tests results from the simple substitution methods may depend on the values substituted. It is thus desirable to develop an alternative approach. We now consider the robust rank regression method. The results are presented at the right panels of Tables 2 and 3. Note that the intercept for the rank regression can be estimated by the median of the residuals. For dissolved ammonia, compared to the estimates using the generalized linear 1487

dx.doi.org/10.1021/es101304h |Environ. Sci. Technol. 2011, 45, 1481–1489

Environmental Science & Technology

ARTICLE

Table 2. Parameter Estimates, Their Standard Errors (SE), and p-Values for Dissolved Ammonia Using the Generalized Linear Mixed Model and the Rank Regression Model, Respectively imputation methods

rank-based methods

DL/2 parameters

DL p-value

estimates (SE)

AFT Models p-value

estimates (SE)

estimates (SE)

p-value

β1

0.0101 (0.0248)

0.6827

0.0133 (0.0194)

0.4921

-0.0064 (0.0155)

0.6802

β2

-0.1032 (0.0259)

0.0001

-0.0813 (0.0202)

0.0001

-0.0373 (0.0143)

0.0090

β3 β4

-0.0453 (0.0247) -0.1089 (0.0257)

0.0673

-0.0431 (0.0194) -0.0869 (0.0201)

0.0261

-0.0613 (0.0148) -0.0641 (0.0150)