Environ. Sci. Technol. 2009, 43, 1128–1133
Time-Frequency Analysis of Beach Bacteria Variations and its Implication for Recreational Water Quality Modeling ZHONGFU GE* AND WALTER E. FRICK U.S. Environmental Protection Agency, Ecosystems Research Division, 960 College Station Road, Athens, Georgia 30605; Email:
[email protected].
Received April 23, 2008. Revised manuscript received December 9, 2008. Accepted December 12, 2008.
This paper exploited the potential of the wavelet analysis in resolving beach bacteria concentration and candidate explanatory variables across multiple time scales with temporal information preserved. The wavelet transform of E. coli concentration and its explanatory variables observed at Huntington Beach, Ohio in 2006 exhibited well-defined patterns of different time scales, phases, and durations, which cannot be clearly shown in conventional time-domain analyses. If linear regression modeling is to be used for the ease of implementation and interpretation, the wavelet-transformed regression model reveals that low model residual can be realized through matching major patterns and their phase angles between E. coli concentration and its explanatory variables. The property of pattern matching for linear regression models can be adopted as a criterion for choosing useful predictors, while phase matching further explains why intuitively good variables such as wave height and onshore wind speed were excluded from the optimal models by model selection processes in Frick et al. (Environ. Sci. Technol. 2008, 42, 4818-4824). The phase angles defined by the wavelet analysis in the time-frequency domain can help identify the physical processes and interactions occurring between bacteria concentration and its explanatory variables. It was deduced, for this particular case, that wind events resulted in elevated E. coli concentration, wave height, and turbidity at the beach with a periodicity of 7-8 days. Wind events also brought about increased beach bacteria concentrations through large-scale current circulations in the lake with a period of 21 days. The time length for linear regression models with statistical robustness can also be deduced from the periods of the major patterns in bacteria concentration and explanatory variables, which explains and supplements the modeling efforts performed in (1).
Introduction It has been widely recognized that proper and timely prediction of water quality at recreational beaches is critical to protection of public health. Fast prediction, however, has been seriously restricted by the current standard of analyzing the concentrations of fecal indicator bacteria, e.g., Escherichia * Corresponding author. National Research Council; U.S. Environmental Protection Agency, Ecosystems Research Division, 960 College Station Road, Athens, GA 30605; Phone: +1 219-926-8336 Ext 430; Fax: +1 219-929-5792; Email:
[email protected]. 1128
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 4, 2009
coli (E. coli) and enterococci, which typically takes 18-24 h to complete. After this time, bacteria concentrations observed based on such sampling standards generally provide little information about the current condition of water quality at the beach. Boehm et al. (2) exhibited multiple time scale structures in enterrococci concentration at Huntington Beach, California. It was shown that the time scales of bacteria variations in the surf zone span at least 7 orders of magnitude, from minutes to decades. Therefore, even if morning samples (e.g., 9 a.m.) at a beach are taken as conservative representatives of the bacteria concentrations of the associated days, bacteria concentrations still vary on time scales larger than 1 day due to, for example, lunar cycles and seasonality. To better study such time structures at specific beaches, more advanced analytical techniques, beyond time-history graphs, are desired. In the present study, we implement the wavelet analysis, a mathematical tool that has been extensively applied in geophysics, engineering, and physics (3), to resolve multiple time structures in E. coli concentration and in its explanatory variables. With desirable capabilities of analyzing nonstationary time series, the wavelet analysis has the potential of becoming a standard approach for identifying useful explanatory variables and their important time scales. The delay of sampling results has also shifted our interest to statistical approaches. While nonlinear modeling approaches such as artificial neural networks have been attempted (4, 5), practical techniques for modeling recreational water quality are primarily multiple linear regression (MLR) analysis and its derived methods (6-10). Moreover, the U.S. Geological Survey (USGS) has been monitoring and modeling bacteria concentrations (E. coli) and potential explanatory variables at Huntington Beach, Ohio since 1997 (11, 12). Besides a valuable multiyear database, they established a beach advisory Web site (13) to issue daily sampling results and the probability of the nowcast E. coli concentration to exceed EPA’s 235 col/100 mL standard during the 2006-2008 beach seasons. In an independent effort for the same beach, Frick et al. (1) attempted linear regression modeling on data sets of only a few weeks. They found that models based on 5-7 weeks of data yield satisfactory predictions (nowcasts) with adjusted R2 typically over 0.60, which are much higher than those obtained from multiyear models for the same beach (11, 12). This finding, as well as their results showing considerable variability in the predictive capacity of the models based on different lengths of data, is perhaps a manifestation of the fact that nonstationarity exists in their data sets. As wavelet analysis is well suited for analyzing nonstationary time series, a preliminary reassessment is conducted in the present work. The objectives of the present work are first to exploit the potential of wavelet analysis in resolving multiple time scale structures that are expected to be present in bacteria concentration and in its explanatory variables, and second to investigate the effectiveness of linear regression models in the presence of nonstationary time structures. Approaches to improve regression models on nonstationary data sets also will be discussed.
Materials and Methods Study Site and Data Sets. The study site, Huntington Beach, Ohio has been described in a number of USGS reports (11, 12). Further details can be found on the “Ohio-nowcast” Web site (13). Briefly, Huntington Beach is on the southern shoreline of Lake Erie. Daily data sampling has been conducted during recreational seasons (from late May through early September), including analysis of water samples 10.1021/es8024116 CCC: $40.75
2009 American Chemical Society
Published on Web 01/20/2009
for E. coli concentration and observation of explanatory variables for model development and testing. For the 2006 data set used by Frick et al. (1) and here, E. coli concentrations were sampled daily around 9 a.m. in water 2-3 feet deep in the surf zone. At the same time and location, wave height was measured with a graduated stick. Water samples were analyzed for concentrations of E. coli (col/100 mL) and turbidity at local laboratories within 6 h of collection. The rainfall and wind direction data were obtained from Cleveland-Hopkins International Airport, about 8 miles southeast of the beach. In order to identify a complete set of variables, Frick et al. (1) collected a few more variables from public databases, such as solar radiation at both 8 and 9 a.m. Most meteorological parameters were taken from the National Weather Service (http://www.weather.gov). More details on the data collection in 2006 can be found in ref 1. After appropriate preprocessing (e.g., taking natural logarithm of E. coli concentration and decomposing wind velocity in onshore and alongshore directions), we identified 10 explanatory variables that could be useful for E. coli prediction, and they are further investigated in the present work: turbidity (NTU), water temperature (F), wave height (categorized as 1-4), antecedent 24 h rainfall (inches), onshore wind speed (mph), alongshore wind speed (mph), solar radiation at 8 and 9 a.m. (w/m2), cloud cover (%), and dew point (F). The complete data set for 2006 consists of 85 consecutive days from May to August. Introduction to the Wavelet Analysis. Conventional studies of nearshore bacteria variations are essentially in the time domain (2). It is often difficult for time-domain analyses to unambiguously recognize different time structures in the original time series. Using the wavelet analysis, a onedimensional time series is decomposed into a two-dimensional time-frequency domain, which is similar to a Fourier analysis but still preserves temporal information. Different temporal patterns are thus isolated in different scales, and they no longer superimpose on each other as in the original time series. The Continuous Wavelet Transform (CWT) is usually defined as X(a, τ) )
∫
∞
-∞
∗ x(t)ψa,τ (t)dt
(1)
where x(t) denotes the original time series, X(a,τ) and ψa,τ(t) denote the wavelet coefficient (the result of wavelet transform) and the wavelet function respectively at scale a and time τ, and * means complex conjugate. The Morlet wavelet is used in the present work, whose mother wavelet is given as ψ(t) ) π-1⁄4eiω0te-t ⁄2 2
(2)
where ω0 is set at 6 to approximately satisfy the admissibility condition (3). The family of the Morlet wavelets ψa,τ(t) can be generated by time translation and scale dilation, specifically ψa,τ(t) )
1 t-τ ψ a √a
( )
(3)
While there are many wavelet families to choose, the Morlet wavelet has two advantages for this particular study. One advantage is that the Morlet wavelet is a sinusoid modulated by a Gaussian envelope, making it particularly suitable for matching local wavy fluctuations such as high bacteria concentrations and high winds. Moreover, the Morlet wavelet is complex and therefore can result in complex wavelet coefficients with both magnitude and phase information. For complex wavelet coefficients X1(a,τ) and X2(a,τ), which are the wavelet transforms of time series x1(t) and x2(t) respectively, their wavelet cross spectrum can be defined as
X1*X2(a,τ) (14). The phase angle of the wavelet cross spectrum φ(a,τ) is simply the phase difference between X1(a,τ) and X2(a,τ), i.e., φ(a, τ) ) φ2(a, τ) - φ1(a, τ)
(4)
where φ1 and φ2 are the phase angles of X1(a,τ) and X2(a,τ), respectively. This can be readily verified by expressing the complex wavelet coefficients in their modules and phase angles (15). Wavelet Analysis and Regression Modeling. When a regression model is to be constructed to estimate the response y(t) based on a set of explanatory variables xi(t) (i ) 1, · · · , p), regression coefficients βi (i ) 0, 1, · · · , p) are determined by, for example, the least-squares method that minimizes the variance of the residual ε(t). The model can be expressed as y(t) ) β0 + β1x1(t) + β2x2(t) + · · · + βpxp(t) + ε(t)
(5)
When both sides of eq 5 are wavelet-transformed, the linearity of the wavelet transform leads to Y(a, τ) ) β0 + β1X1(a, τ) + β2X2(a, τ) + · · · + βpXp(a, τ) + Ε(a, τ) (6) where Y, Xi (i ) 1, · · · , p), and Ε denote the wavelet coefficients of y, xi (i ) 1, · · · , p), and ε, respectively. Equation 6 thus has interesting properties. For example, any major pattern (e.g., a peak value) at (a0,τ0) in the wavelet coefficient of an explanatory variable (e.g., X1), if not counterbalanced by a similar pattern in those of any other explanatory variables, will be mapped into the wavelet coefficient of the response around the same time and scale (a0,τ0). Alternatively, we deduce the property of pattern matching that every major pattern in the wavelet coefficient of the response should match similar patterns in the wavelet coefficient of one or more explanatory variables around the same time and scale. If a major pattern in the response at (a0,τ0) cannot be found in any explanatory variables, this pattern is unexplained by the current set of variables and hence must exist in Ε(a,τ) around (a0,τ0). (Obviously, if peak values occur at (a0,τ0) in the response but not in any of the variables, eq 6 leads to Y(a0,τ0) ≈ Ε(a0,τ0).) If this is the case, such a major pattern, often a peak or a strong periodical variation, is responsible for an elevated model residual. The only way to reduce the residual of the MLR model is to find a new explanatory variable that can account for the unexplained major pattern in the response. Therefore, the property of pattern matching deduced from eq 6 can help identify useful explanatory variables and examine whether the present set of variables is sufficiently representative. Since the wavelet coefficients are complex, we will show the real parts of them in the following discussion, simply because the pattern-matching property also holds for the real (or imaginary) part of the wavelet-transformed linear regression model, namely Re[Y(a, τ)] ) β0 + β1Re[X1(a, τ)] + β2Re[X2(a, τ)] + · · · + βpRe[Xp(a, τ)] + Re[Ε(a, τ)] (7) where Re means the real part of a complex number. The widely used wavelet scalogram, defined as the squared modulus of the wavelet coefficient (3), is not suitable for demonstrating pattern matching. This is because |Y|2 ) YY* and the multiplication of Y with its conjugate will complicate the pattern comparison by introducing unwanted cross terms such as 2βiβjRe [XiXj*] (i,j ) 1, · · · , p and i * j).
Results Comparison of Major Patterns. Computer programs provided by Torrence and Compo (14) for the CWT was adapted VOL. 43, NO. 4, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
1129
FIGURE 1. Wavelet coefficient (real part) of E. coli concentration (14 contours): largest values are indicated by red and smallest by blue; horizontal dash-dot lines indicate important time scales; solid curves define the cone of influence. by the authors and used here. Figure 1 shows the real part of the wavelet coefficient of E. coli concentration. The abscissa represents the time domain and the ordinate shows the scale domain which is equivalently the period. A cone of influence (COI) due to edge effects at both ends of the data set is indicated by solid curves, such that below the curves results might be contaminated by the transforms that involve data points close to the edges (day 1 and day 85) (14). To separate significant peaks from randomness, we conducted significance tests following Ge (16). All values below the 30% significance level were set at zero, so that all contours in Figure 1 can be considered statistically significant (different from zero) with a confidence level of 70%. A quick inspection of the wavelet coefficient identifies two evident time patterns with relatively long durations. One is marked as pattern A, approximately a periodical undulation happening from day 20 to day 85 with a period of 7-8 days. The other one is marked as pattern B from day 30 to day 85 with a period of 21 days. A 17 day pattern is also evident, which gradually transitions into pattern B from day 30. There are other time patterns that appear to be more short-lived. We will therefore consider patterns A and B to be major ones. It also is clear from Figure 1 that the E. coli concentration series is a nonstationary process with multiple time scales. The nonstationarity is reflected by for example the restricted duration of pattern A and the gradual formation of pattern B at day 30. When linear regression models are to be used to predict E. coli concentration as shown in Figure 1, we expect to see that the two major patterns, A and B, are matched by similar patterns in some of the explanatory variables. The wavelet-transformed turbidity is shown in Figure 2. A pattern with a period of 7-8 days (still referred to as pattern A) is recognizable, while this pattern appears to experience a frequency shift from a period of 7-8 days to 6-7 days during day 50 to day 60. Pattern B is almost missing in turbidity, which implies that turbidity cannot explain the 21 day periodicity in E. coli concentration. A similar inspection of the wavelet transforms of dew point and cloud cover shows that they both have pattern A but not pattern B. In Frick et al. (1) turbidity, dew point, and cloud cover were consistently the top three useful variables for 5- to 7-week linear regression models, determined by model selection processes based on Mallows’ Cp. Therefore, the matching of pattern A partially explains why these three variables were found to be most effective among all candidate variables. 1130
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 4, 2009
FIGURE 2. Same as Figure 1 but for turbidity.
FIGURE 3. Same as Figure 1 but for wave height. An unexpected finding in Frick et al. (1) is probably that wave height did not seem to be useful in MLR models for predicting bacteria concentration, a conclusion seemingly at odds with other observations at Great Lakes beaches (6, 7, 9). Figure 3 shows the wavelet coefficient of wave height with patterns A and B both present, although some fine details in the patterns are not identical to those in E. coli concentration. According to the property of pattern matching, we would expect wave height to be at least as effective as turbidity, cloud cover, and dew point in modeling the variations in E. coli concentration. The wavelet coefficient of onshore wind speed (given in the Supporting Information) also has a pattern with a period of 21 days first appearing at day 30 or even earlier and extending to day 85. This pattern seems to be a perfect match for pattern B in E. coli concentration. Onshore wind speed also has a pattern that is similar to pattern A in E. coli concentration, while it is somewhat interrupted during day 50 to day 60 and shifted away from day 65. Compared to Figure 1, one would think that onshore wind speed is not an obviously worse variable than turbidity. In fact, again, onshore wind speed was not found as explanatory as turbidity, cloud cover, and dew point, just like what was found about wave height (1). We will discuss this paradox below. Comparison of Wavelet Phase Angles. For oscillatory patterns such as patterns A and B in E. coli concentration, pattern matching also implies a matching in the phase angle. This can be understood by considering a simple example: a sinusoid 3sin (2πf0t) can be perfectly linearly modeled by another sinusoid sin (2πf0t) because they have the same pattern (a sinusoidal periodicity with a frequency f0) and the same phase angle (or zero phase difference). However, no
FIGURE 4. Phase difference between E. coli concentration and explanatory variables. Upper panel is for pattern A (period of 7-8 days); magenta: turbidity; red: wave height; blue: onshore wind speed; black: cloud cover; green: dew point. Lower panel is for pattern B (period of 21 days); red: wave height; blue: onshore wind speed. Ticks on the right side of the figure show the equivalent time delays deduced from the associated phase differences. good linear model can be constructed between 3sin (2πf0t) and sin (2πf0t + π/4). Even though they still have the same pattern, the phase shift by π/4 makes it impossible to find a linear relation as good as the one for the case of zero phase difference. If the phase difference is π/2, the two time series are completely uncorrelated with a zero mean correlation coefficient. This example illustrates the importance of the relative phase difference between matching patterns. For time series with multiple time structures, the wavelet analysis can be used to define local frequency (period) and local phase angle for every point in the time-frequency domain. The phase difference between the response (e.g., E. coli concentration) and its explanatory variables is given by eq 4. Ideally, only when similar major patterns in an explanatory variable and E. coli concentration have a zero phase difference can this variable be truly useful in accounting for the variations in E. coli concentration through a linear regression model. Figure 4 contains all explanatory variables that have matching patterns with patterns A and B in E. coli concentration. (Other variables, such as rainfall and solar radiation, do not have a well-defined pattern A or B.) Each curve indicates the phase difference by which the pattern in the explanatory variable leads that in E. coli concentration (eq 4). Therefore negative phase differences are for the cases of patterns in the explanatory variables lagging behind those in E. coli concentration. For pattern A, which has a period of approximately 8 days, a 180° phase difference is equivalent to about four days (half-period) of time delay. Phase differences of 90° and 45° are hence equivalent to two days and one day, respectively. Therefore, patterns with phase differences smaller than 45° can be considered synchronized, in the sense that they increase, decrease, and attain their peaks and valleys always on the same day. For pattern B, which has a much longer period, 90° and 45° phase differences are equivalent to time delays of 5.3 and 2.6 days. It is noted that phase differences close to 0° and 180° indicate negligible phase differences, whereas phase differences around 90° tend to minimize the utility of the explanatory variable for the associated pattern in linear regression models.
It is now clear from Figure 4 why wave height turned out not to be a good explanatory variable by model selection processes (1). It has both patterns A and B, but its pattern B lags behind that of E. coli concentration by more than two days throughout the whole period. From day 30 to day 40, the phase difference is close to -90°, completely in quadrature with pattern B of E. coli concentration. As a result, pattern B in wave height would increase the residual of linear regression models through the nonzero phase difference with respect to its counterpart in E. coli concentration. Onshore wind speed was excluded by Frick et al. (1) for similar reasons. Pattern B of onshore wind speed lags behind that of E. coli concentration by 1-2 days in most of the study period. Its pattern A has even a larger phase difference (very close to 90° around day 50 and day 74) with that of E. coli concentration. Consequently, the phase difference that is seriously nonzero has rendered onshore wind speed as an undesirable variable for predicting E. coli concentration through linear regression models. Other three variables, cloud cover, dew point, and turbidity, do not have pattern B but all have pattern A with near-zero phase differences. For instance, the phase difference between E. coli concentration and turbidity is consistently smaller than 45° from day 20 to day 85. The curves for dew point and cloud cover are very similar, and they both have negligible phase differences with E. coli concentration from day 20 to day 50 and from day 60 to day 70. There is a sudden change in the phase difference around day 60. After this phase shift, patterns A in these two variables become out of phase (180°) with that in E. coli concentration. Their phase differences remain far from 90° until about day 65 when they drop to below 135°. In summary, that wave height and onshore wind speed were not considered to be the most useful explanatory variables in the linear regression models developed by Frick et al. (1) is consistent with the observation that pattern B of wave height and pattern A of onshore wind speed have phase differences that are seriously nonzero with respect to their corresponding patterns in E. coli concentration. They were hence eliminated early for a less residual in the model VOL. 43, NO. 4, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
1131
selection process through a backward elimination (1), regardless of whether the other patterns in them are in phase with those in E. coli concentration or not. In comparison, turbidity, dew point, and cloud cover all have patterns that are similar to pattern A of E. coli concentration. Their phase differences are very small or negligible in most part of the 85 day period. Physical Explanation. Given a complete set of matching patterns and their phase differences in Figure 4, we can deduce physical processes that might have caused the variations in E. coli concentration. It should be noted that there are other chemical and microbiological processes that can also contribute to the complexity of the problem, such as nutrient input into the ambient waters and bacteria dieoff due to water temperature, salinity, and pH. Physical parameters by no means constitute a complete explanation for the variability of beach bacteria concentration. We start with pattern A in the central period from day 20 to about day 60, in which all phase differences are relatively stable. Since pattern A in onshore wind speed has the largest phase difference, we assume that a peak value in onshore wind speed occurs first over Lake Erie, and the growing wind field, which is often a large-scale event, is observed by an inland meteorological tower at the Cleveland-Hopkins International Airport. On the second day (when the phase difference is smaller than 45°), dew point temperature and cloud cover reach their peak values almost simultaneously, followed by elevated E. coli concentrations observed at the beach maybe a few hours later. Since wave height and turbidity have phase angles lagging behind that of E. coli concentration, wave height attains its peak value later than does E. coli concentration and turbidity reaches its maximum still later. But they all seem to happen on the same day. Ignoring the first and the last 10 days which are within the COI (Figures 1-3), the sequence of events is less regular in the periods such as from day 10 to day 20 and from day 60 to day 75. These two periods are more of transitional ones. For example, from day 60 to day 75, onshore wind speed appears to have a rapidly increasing phase difference. The changing wind speed, possibly signifying a severe weather event, interacts with cloud cover and dew point and gives rise to larger time delays between patterns A in these two variables and that in E. coli concentration than the time delays in the central 40 day period. The changing winds, however, have a much weaker influence on wave height and turbidity. The sequence of events in pattern A can be explained by theories of wind-generated waves. Wind growth, sometimes wind storms, occurs over the lake and induces growing surface waves by exerting shear stresses tangential to the water surface as well as a stochastic pressure fluctuation field normal to the water surface. The time scale for the growth of wind-generated waves is closely related to the duration of the wind events, typically ranging from a few hours to tens of hours (17). The elevated wave height at the beach has further consequences such as enhanced mixing of beach water and suspension of E. coli due to larger horizontal and vertical velocities of fluid particles (17). In the meantime, the beach water becomes more and more turbid. For pattern B from day 20 to day 60, an increase in E. coli concentration apparently occurs first, leading an increase in onshore wind speed and in wave height by about 2 days and 3-5 days, respectively. Since wind speed is an independent meteorological factor that cannot be affected by beach bacteria concentration, the above interpretation of the sequence of events is not practical. By noting that a 2 day lead for pattern B (with a period of 21 days) is equivalent to a 19 day lag, it is more likely that the event of wind growth occurs first, giving rise to an increased wave height at the beach about 1-2 days later and then leading to elevated E. coli concentrations in the beach water another 17 days later. 1132
9
ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 43, NO. 4, 2009
Here the time difference between increased onshore winds and wave height is again 1-2 days, consistent with that for pattern A. In addition to wind-generated waves, the increased onshore winds in pattern B might also induce current circulations in Lake Erie (18). Considering the time delay of 19 days, this wind-induced current circulation might have a large spatial scale. The currents that entrain bacterial contaminants from other sources (7) finally reach the study beach 19 days later and cause the change in E. coli concentration. Since data for current speed at the beach and for discharges from nearby estuaries are lacking for this study, no better or detailed explanation can be deduced about the interactions of patterns B in different variables. Regression Models Guided by the Wavelet Analysis. Another immediate application of the wavelet analysis to regression modeling is the estimation of the time length over which regression models can be established with statistical robustness. To our knowledge, this issue has never been successfully resolved. In fact, when constructing a MLR model on nonstationary variables, the robustness of the model is first assured by the stability of the first and second moments, the mean and the variance, of the variables. This is simply because the regression model is determined by these moments. For E. coli concentration and its explanatory variables, the largest major pattern has a period of 21 days (pattern B). Treating periodical patterns as sinusoidal time series, we estimate that the mean value of the time series becomes stable when averaged over a time duration longer than approximately 5/3 times its period. In other words, the time length for robust MLR models should not be less than 35 days (5/3 times 21 days), or 5 weeks. As the stability of the mean dominates that of higher moments for this particular problem, MLR models that are based on data sets shorter than 5 weeks are not statistically robust. This is also consistent with the observation shown in ref 1 that the adjusted R2 of the MLR models on the same data set as used here rise abruptly (to over 0.6) for five-, six-, and seven-week models. Therefore, the wavelet analysis implemented in the present work has helped to define, for linear regression models, the lower limit of the time length, an important issue that Frick et al. (1) failed to identify or explain. Setting the upper limit for the time length of MLR models, however, is not straightforward. More details can be found in the Supporting Information.
Discussion The above results and analyses have shown that the application of wavelet analysis has not only revealed multiple time scale structures in the time series of beach bacteria concentration but also provided a means to identify the interactions between major time structures (patterns) in E. coli concentration and its explanatory variables. The wavelettransformed linear regression model, as given by eqs 6 and 7, indicates that the property of pattern matching is a practical strategy for selecting explanatory variables. Moreover, it is equally important to ensure near-zero phase differences between corresponding major patterns in E. coli concentration and in explanatory variables if the patterns are oscillatory ones. The results based on wavelet phase angles have successfully explained why wave height and onshore wind speed, intuitively strong factors in close relations with beach bacteria concentrations, failed to survive in the four-variable optimal models selected by Cp-based backward elimination processes (1). We however should mention here that the pattern and phase matching properties were deduced within a framework of linear modeling (eqs 6 and 7). When nonlinear relations exist between bacteria concentration and the explanatory variables, a major pattern at (a0,τ0) in the wavelet transform of a variable may be associated with a major pattern at a very different time and scale, (a1,τ1), in E. coli concen-
tration. Pattern and phase comparisons can be very complicated. In the present study, some explanatory variables, such as rainfall, do not appear to have a well-defined pattern A or B. This only means that those variables do not have persistent linear relations with E. coli concentration for this particular case. They might still have strong influences on the beach bacteria concentration through nonlinear mechanisms. In this case, nonlinear modeling approaches such as neural networks (4, 5) can be employed. The elimination of wave height and onshore wind speed as seen in ref 1 resulted in loss of information conveyed by their major patterns. In order to make better use of major patterns in explanatory variables, one possible approach is performing time shifts in order to synchronize patterns (i.e., to make the associated phase difference zero). It is, however, practically difficult to find a single time delay for each explanatory variable, because different major patterns in a variable tend to have different phase differences with respect to their corresponding ones in E. coli concentration. As a result, shifting a variable in favor of one major pattern might well increase the phase difference of another. It is also possible in some cases that the time difference between patterns is not an integer number of days. This makes simple time shifting impractical. To solve a similar problem in forecasting short-term spring snowmelt river flood, Adamowski (19) developed a wavelet-based modeling technique that performs narrowband wavelet decomposition of variables to isolate important patterns and reconstructs them based on their amplitudes and phase angles. Major patterns in E. coli concentration and in explanatory variables can be extracted and shifted in a similar fashion, which can possibly create more useful explanatory variables and improve the predictive capacity of regression models. These approaches will be attempted in future studies.
Acknowledgments We thank Richard Zepp and Mike Cyterski of EPA for the helpful discussions. This work also benefited from NRC’s Research Associateship Programs. Although this work was reviewed by EPA and approved for publication, it may not necessarily reflect official Agency policy. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10) (11)
(12)
(13) (14) (15) (16)
Supporting Information Available
(17)
A figure showing the wavelet transform of onshore wind speed and further discussion about the time length for linear regression models. This material is available free of charge via the Internet at http://pubs.acs.org.
(18)
Literature Cited (1) Frick, W. E.; Ge, Z.; Zepp, R. G. Nowcasting and forecasting concentrations of biological contaminants at beaches: A
(19)
feasibility and case study. Environ. Sci. Technol. 2008, 42, 4818– 4824. Boehm, A. B.; Grant, S. B.; Kim, J. H.; Mowbray, S. L.; McGee, C. D.; Clark, C. D.; Foley, D. M.; Wellman, D. E. Decadal and shorter period variability of surf zone water quality at Huntington Beach, California. Environ. Sci. Technol. 2002, 36, 3885–3892. Addison, P. S. the Illustrated Wavelet Transform Handbook: Introductory Theory and Applications in Science, Engineering, Medicine and Finance; IOP Publishing Ltd.: Bristol, UK, 2002. Lin, B.; Syed, M.; Falconer, R. A. Predicting faecal indicator levels in estuarine receiving waterssAn integrated hydrodynamic and ANN modelling approach. Environ. Modell. Software 2008, 23, 729–740. Zhang, Q.; Stanley, S. J. Forecasting raw-water quality parameters for the north Saskatchewan River by neural network modeling. Water Res. 1997, 31, 2340–2350. Nevers, M. B.; Whitman, R. L. Nowcast modeling of Escherichia coli concentrations at multiple urban beaches of southern Lake Michigan. Water Res. 2005, 39, 5250–5260. Nevers, M. B.; Whitman, R. L.; Frick, W. E.; Ge, Z. Interaction and influence of two creeks on Escherichia coli concentrations of nearby beaches: Exploration of predictability and mechanisms. J. Environ. Qual. 2007, 36, 1338–1345. Hou, D.; Rabinovici, S. J. M.; Boehm, A. B. Enterococci redictions from partial least squares regression models in conjunction with a single-sample standard improve the efficacy of beach management advisories. Environ. Sci. Technol. 2006, 40, 1737– 1743. Olyphant, G. A. Statistical basis for predicting the need for bacterially induced beach closures: Emergence of a paradigm. Water Res. 2005, 39, 4953–4960. Ge, Z.; Frick, W. E. Some statistical issues related to multiple linear regression modeling of beach bacteria concentrations. Environ. Res. 2007, 103, 358–364. Francy, D. S.; Gifford, A. M.; Darner, R. A. Escherichia coli at Ohio Bathing Beachessdistribution, Sources, Waste-Water Indicators, And Predictive Modeling, Water Resources Investigations Report 02-4285; U.S. Geographical Survey: Columbus, OH, 2003. Francy, D. S.; Darner, R. A. Procedures for Developing Models to Predict Exceedances of Recreational Water-Quality Standards at Coastal Beaches, Techniques and Methods 6-B5; U.S. Geographical Survey: Columbus, OH, 2006. Ohio Nowcasting Beach Advisories. http://www.ohionowcast.info. Torrence, C.; Compo, G. P. A practical guide to wavelet analysis. Bull. Am. Meteorol. Soc. 1998, 79, 61–78. Ge, Z. Significance tests for the wavelet cross spectrum and wavelet linear coherence. Ann. Geophys. 2008, 26, 3819–3829. Ge, Z. Significance tests for the wavelet power and the wavelet power spectrum. Ann. Geophys. 2007, 25, 2259–2269. Kinsman, B., Wind Waves: Their Generation and Propagation on the Ocean Surface; Dovers Publications, Inc.: Mineola, New York, NY, 2002. Beletsky, D.; Saylor, J. H.; Schwab, D. J. Mean circulation in the Great Lakes. J. Great Lakes Res. 1999, 25, 78–93. Adamowski, J. F. Development of a short-term river flood forecasting method for snowmelt driven floods based on wavelet and cross-wavelet analysis. J. Hydrol. 2008, 353, 247–266.
ES8024116
VOL. 43, NO. 4, 2009 / ENVIRONMENTAL SCIENCE & TECHNOLOGY
9
1133