Harmonic Analysis of Environmental Time Series with Missing Data or

Sep 13, 2007 - The air quality data is a 13 year series of 24 hour average particulate ... These data have irregular sampling periods and missing data...
0 downloads 0 Views 252KB Size
Environ. Sci. Technol. 2007, 41, 7030-7038

Harmonic Analysis of Environmental Time Series with Missing Data or Irregular Sample Spacing SHABNAM DILMAGHANI,† ISAAC C. HENRY, PURIPUS SOONTHORNNONDA,‡ ERIK R. CHRISTENSEN,‡ AND R O N A L D C . H E N R Y * ,† Department of Civil & Environmental Engineering, University of Southern California, 3620 South Vermont Avenue, Los Angeles, California 90089-2541 and Department of Civil Engineering and Mechanics, University of Wisconsins Milwaukee, Milwaukee, Wisconsin 53201-0784

The Lomb periodogram and discrete Fourier transform are described and applied to harmonic analysis of two typical data sets, one air quality time series and one water quality time series. The air quality data is a 13 year series of 24 hour average particulate elemental carbon data from the IMPROVE station in Washington, D.C. The water quality data are from the stormwater monitoring network in Milwaukee, WI and cover almost 2 years of precipitation events. These data have irregular sampling periods and missing data that preclude the straightforward application of the fast Fourier transform (FFT). In both cases, an anthropogenic periodicity is identified; a 7-day weekday/ weekend effect in the Washington elemental carbon series and a 1 month cycle in several constituents of stormwater. Practical aspects of application of the Lomb periodogram are discussed, particularly quantifying the effects of random noise. The proper application of the FFT to data that are irregularly spaced with missing values is demonstrated on the air quality data. Recommendations are given when to use the Lomb periodogram and when to use the FFT.

Introduction Time series of properties of environmental systems typically have a periodic component. These periodic variations are often related to basic daily or seasonal solar cycles, although weekly or monthly periodicities related to human activity cycles may also be present. Random and episodic variations are also important sources of variability in time series, but they are not discussed here. Harmonic analysis based on the Fourier transform is regarded by many as the sovereign method for investigating periodic behavior of time series. However, most reports in the environmental literature rely on plots of hourly, daily, or monthly averages as the only tools to investigate periodicities in the data. A good example is the ozone weekday/weekend effect, the paradoxical increase in tropospheric ozone concentrations in California on Saturday and Sunday, even though emissions of precursor species are lower on weekends than on weekdays. Graphs and tables of daily average concentrations were almost the only methods used to study this puzzling periodicity (1). * Corresponding author e-mail: [email protected]. † University of Southern California. ‡ University of WisconsinsMilwaukee. 7030

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 41, NO. 20, 2007

Harmonic analysis would seem to be a more efficient and quantitative method to study this and other periodic features of environmental time series. Previous applications of harmonic analysis have mostly been of time series of tropospheric ozone concentrations (2-4). Sirois examined oxides of sulfur and nitrogen in eastern North America (5). The work of Hies et al. (6) on harmonic analysis of airborne elemental carbon particulate is discussed below. None of these works use the methods applied in this paper. Bragatto et al. (7) do use some of the same methods to examine 1 year’s worth of 1 h benzene concentrations determined by differential absorption spectrometry. The emphasis in their paper is on results rather than the method itself. Most of the previous studies have applied a logarithmic transform to the data, used windowing functions, and filtered and smoothed the data or periodogram. The analysis presented here investigated these and other advanced techniques but found that these did not contribute to the understanding of the periodic behavior of the time series and, therefore, such methods were not applied. It is in the nature of time series of environmental data to have significant gaps of missing data or to have irregular sampling intervals or both. This is perhaps the main reason that harmonic analysis is seldom applied to environmental data. The fast Fourier transform (FFT) is a very efficient and easily available means of calculating the discrete Fourier transform, but only if the time series has data sampled at equal intervals with no missing values. This paper presents a technique borrowed from astrophysics that allows one to estimate the power spectrum and Fourier transform of data with a great deal of missing data and highly irregular sampling without interpolating the data to a regular spacing or replacing the missing values. Two examples are given, one of a long time series of airborne particulate matter composition with semiregular spacing and many missing values, and one of stormwater composition measured at inherently irregular intervals. The discrete Fourier transform (DFT) is the natural choice for quantitative analysis of the periodicities in time series, and the FFT is the most computationally efficient means to calculate the DFT. However, the FFT requires equally spaced data with no missing observations. A few, isolated missing points can be filled in by interpolation; however, environmental data often have long runs of missing data that cannot be replaced by interpolated values in any reasonable manner. Also, the sampling schedule may be irregular or semiregular, as in the case where samples are taken on fixed days of the week or at the onset of precipitation. The first example in this paper uses data from the IMPROVE particulate monitoring network taken on a Wednesday-Saturday schedule. In this case, even if there were no missing data, it is impossible to apply most Fourier methods, as these require that the sample spacing be uniform. As shown in the air quality example below, the FFT can be applied to semiregular data of this type, if one is careful to identify false periodicities caused by the analysis. Completely irregular sampling is the norm for networks monitoring strormwater, since samples are taken only during precipitation events, as in the water quality example below. Before going to the examples, the methods to be applied are described.

The Lomb Periodogram and DFT The discrete Fourier transform of a series of concentrations ck observed at equally spaced times tk, k ) 0,..., n - 1 is given by 10.1021/es0700247 CCC: $37.00

 2007 American Chemical Society Published on Web 09/13/2007

F(fj) )

1

n-1

∑c n

exp(-2πifjtk)

k

(1)

k)0

where i2 ) -1, j, k ) 0,..., n-1, and fj ) j/(tn-1 -t0). If the ck are real numbers, then the second half of the Fourier coefficients are the complex conjugate of the first half and the DFT need only be calculated at frequencies up to ffloor(n/2). The function floor(x) is the smallest integer not exceeding x. At these so-called natural frequencies, an estimate of the power spectrum is given by the sum of the squares of the Fourier coefficients and can be written as

P(f) ) |F(f)|2 )

1 n

n

[(

∑ c cos(2πft )) j

2

+

j

j)1

n

(

∑ c sin(2πft )) ] 2

j

j

(2)

j)1

The term P(0) is defined as the square of the mean of the cj. If the mean is subtracted from the data, then P(0) is 0. In statistics, an estimate of the power spectrum at a finite set of frequencies as in eq 2 is known as a periodogram. In 1897, Sir Arthur Schuster first observed that the periodogram as defined above allows one “to separate any true periodicity of a variable from among its irregular changes, provided we can extend the time limit sufficiently” (8). The Schuster periodogram defined in eq 2 has been criticized on the basis that it is not a statistically consistent estimator of the “true” power spectrum in that as n increases the variance of the Schuster periodogram does not decrease but stays constant. However, whereas the variance of the Schuster periodogram does not decrease with n, the signal-to-noise ratio does decrease (9). The form of the periodogram given in eq 2 has a number of technical problems when applied to unequally spaced data or equally spaced data with missing values (9, 12). Chief among these is the fact that the sines and cosines are no longer orthogonal over the set of unequally spaced observations. This invalidates the derivation of eq 2, which is based on the orthogonality of sines and cosines over the set of natural frequencies (10). The solution is to define a time shift that makes the sines and cosines orthogonal over the unequally spaced set of observations for an arbitrary frequency. Applying this shift results in the Lomb periodogram (11):

(

P(fk) ) 1 2σ

2

n



(

n

cjcos(2πfk(tj - τ)))2

j)1

( +

n



∑ c sin(2πf (t - τ))) j

k

j

j)1 n

cos2(2πfk(tj - τ))

j)1

∑ sin (2πf (t - τ)) 2

( ) j)1

n

τ(fk) )

1 2πfk

k

j

}

2

∑ sin(2πf t ) k j

tan

-1

j)1 n

(3)

∑ cos(2πf t ) k j

j)1

where σ2 is the variance of the data. Normalization by σ2 is natural here even if the data have not been mean subtracted since the first term of the Fourier transform is the square of the mean value and is not part of the periodogram. Note that the time shift τ depends only on the frequency and sampling

times, and not on the data. The usual practice is to calculate the periodogram at set of natural frequencies assuming uniform sample spacing equal to the average sample spacing, but any set of frequencies can be used. A popular software implementation of the Lomb periodogram allows the user to specify an over sampling factor creating a periodogram with 5 or 10 times the number of natural frequencies (14). This gives a smoother periodogram with less chance of missing an important peak. A freely available implementation of this algorithm in C can be found at http:// www.physionet.org/physiotools/wag/lomb-1.htm. This form of the periodogram is equivalent to a least-squares fit of the data with sine and cosine functions (9-11) and Mathias et al. (13) give a computationally efficient algorithm for calculating it. Because the DFT is a finite estimate of the continuous Fourier transform, it is subject to several difficulties. One is the well-known phenomenon of aliasing, in which the uniform spacing of the data samples interacts with a highfrequency periodicity in the data to produce a false peak at a lower frequency. Generally, irregular sampling intervals greatly reduce the size of false peaks due to aliasing. The possibility of aliasing is assessed quantitatively by the window function defined as

W(f) )

1 n

2

n-1

({

∑ k)0

n-1

cos(2πftk)}2 + {

∑ sin(2πft )} ) 2

k

(4)

k)0

where the value of W (f) ranges between zero and one. The window function depends solely on the timing and spacing of the data and has large peaks at frequencies where aliasing may be a problem. If the data has a periodicity of frequency f, and if f0 is the frequency of a peak in the window function, then the periodogram will have a false peak at f0 - f. The use of the window function is demonstrated below in the air quality example. The DFT and periodogram have other problems caused by finite sampling. To overcome these and to improve the sensitivity of the analysis for detection of weak periodicities, most previous reports of Fourier analysis of environmental time series apply one or more nonlinear transformations to the data, such as taking logs or multiplying the data by a Hanning or other windowing function (not to be confused with the window function defined above). These and other nonlinear transformations have not been applied in this paper because nonlinear transformations destroy the simple connection between the DFT and the original data and cannot be undone. This outweighs any good that the transformations might accomplish. For example, if the data are log transformed and then multiplied by a windowing function, there is no longer any simple relationship between the periodogram and the fraction of the variance explained by the periodicities. Also, nonlinear transformations make it impossible (or at least very difficult) to get phase information about the original time series from the coefficients of the DFT (the importance of this is discussed below). Note that this criticism does not apply to linear transformations of the data. Since the DFT is a linear transformation, linear transformations can be undone and do not destroy the simple relationship between the data and the periodogram or the coefficients of the DFT. In this paper, the only transformation applied is a simple linear detrending of the data. Whereas the periodogram is very useful in identifying periodicities in the data, the phase information lacking in the periodogram is usually very important in environmental applications; a 365 day cycle that peaks in the summer is quite different from one that peaks in the winter. For phase information, an estimate of the DFT is required. The Lomb estimate of the DFT is given by (15) VOL. 41, NO. 20, 2007 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7031

n

∑ A(f )c cos(2πf (t - τ)) +

F(fk) ) F0(fk)

k

j

k

j

j)1

iB(fk)cj sin(2πfk(tj - τ)) F0(fk) )

n exp(-i2πfkt) x2

shape as shown by the existence of large harmonics at p/2, p/3, p/4, etc. Practical use of these formulas will be discussed below. Finally, Scargle shows that the Lomb periodogram of uncorrelated Gaussian noise has an exponential distribution with mean 1 and gives an expression for statistical confidence levels on the peaks in the periodogram (9). For example, the 99% confidence level is given by

n

∑ cos (2πf (t - τ))] 2

A(fk) ) [

k

-1/2

z99 ) -ln[1 - (1 - 0.01)1/N]

j

(10)

j)1 n

B(fk) ) [

∑ sin (2πf(t - τ))] 2

-1/2

(5)

j

j)1

If the transform is calculated over the set of natural frequencies, then the inverse transform is calculated by the usual formula:

Yj )

N

1

∑F 2N

k

exp(i2πfktj) + Fk exp(-i2πfktj)

(6)

k)1

where N is the number of natural frequencies. The DFT for equally spaced data evaluated at the natural frequencies will reproduce the original data exactly. In general, the DFT for unequally spaced data will not reproduce the original data exactly. Nevertheless, one can estimate the times of maxima and minima for a given periodicity from the DFT phase as discussed next. If the periodogram has a significant peak at period p and the corresponding complex DFT coefficient is Fp ) |Fp|exp(iφ), where φ ) Im(Fp)/Re(Fp) is the phase angle and |Fp| is the magnitude, then the sinusoidal waveform associated with this periodicity is given by

y(t) )

(|

|

1 t F exp(iφ)exp i2π + 2N p p

)

(

)

||

(

Fp exp(-iφ)exp -2πi

t p

))

|Fp| t t exp i2π + φ + exp - i2π + φ 2N p p

( (

)

)

((

|Fp| t cos 2π + φ N p

(

)))

)

(7)

Assuming that the time series starts at time t ) 0 and 0 e φ e 2π, tmax the time the maximum occurs is



tmax + φ ) 2π p

(

tmax ) p 1 -

φ 2π

)

(8)

The time of the minimum will simply be tmax+p/2, modulo p, if necessary. If the time series starts at time T0 then the time of the maximum and minimum are given by

Tmax ) mod(T0 + tmax,p) p Tmin ) mod Tmax + ,p 2

(

)

(9)

where mod(x,p) represents x modulo p. The above estimates of the times of maxima and minima assume that the waveform is primarily a simple sinusoidal shape. These formulas do not apply to waveforms that have a more complex 7032

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 41, NO. 20, 2007

where N is the number of independent frequencies. Selection of the N is discussed in the next paragraph. Assuming the data are corrupted by uncorrelated, Gaussian noise, there is a 99% probability that a peak in the periodogram greater that z99 is not noise. If the noise is not Gaussian, then z99 is not strictly a confidence level. Even in this case, eq 10 is often a good indicator of the noise level in the Lomb periodogram. The exponent N in eq 10 is the number of independent frequencies. If the data are equally spaced, this is the number of natural frequencies N ) floor(n/2), where n is the number of data points. The function floor(x) is the smallest integer not exceeding x. If the data are regularly spaced except for missing data, N can be approximated by the number of natural frequencies calculated as if there were no missing data. For irregularly spaced data, there is no single, simple expression for the number of independent frequencies. If the data are sampled at a regular set of intervals, such as the data in the air quality example below, then the number of independent frequencies can be estimated as the number of natural frequencies assuming that the data are equally spaced with a spacing given by the average spacing. In the water quality example below there are very few, highly scattered observations; the number of independent frequencies, N, given by the average spacing method is 21. The 90, 95, and 99% confidence levels given by eq 10 are 5.29, 6.02, and 7.64. Several sets of 100 000 Lomb periodograms were calculated using simulated random data from a standard normal distribution with the same spacing as the water quality example. The empirically determined 90, 95, and 99 confidence levels were 5.31, 5.67, and 6.45. The agreement with eq 10 for N ) 21 is good for the 90% level, fair for 95, and poor for the 99% level. This shows that, at least for data sampled as in the water quality example, eq 10 is not a valid functional form and confidence levels based on this formula cannot be trusted. Horne and Baliuna give an empirical formula for N if the data are evenly spaced and some examples of N when the data are not evenly spaced (16). Their results do not agree with our simulations, and their methods are not recommended. To summarize, if the data are almost equally spaced, then eq 10 is valid with N equal to the number of natural frequencies in the data based on the average spacing of the data. For data sampled at highly irregular periods, eq 10 is not valid and should not be used. Instead, an ad hoc study with simulated random data can be used to estimate the confidence levels.

Water Quality Example Harmonic analysis of time series of stormwater composition data is a very challenging problem. Samples are only taken during precipitation events and may not be taken at all in winter or other times of the year. Nonetheless, the Lomb periodogram and Fourier transform may still be able to identify and characterize the periodicities in the data. As an example, the methods given above have been applied to data for several years from the greater Milwaukee, Wisconsin area. The subarea considered is from the City of St. Francis, near Lake Michigan about three miles south of Milwaukee Harbor. This is mainly a residential area with both local and

FIGURE 1. Time series of first flush Cu concentrations in stormwater runoff at Station SWSF 14 in Milwaukee, WI from November 2000 to October 2002. countywide roads. The Milwaukee Metropolitan Sewerage District (MMSD) stormwater sampling activity is a voluntary program started in 2000 to estimate the quantity and quality of stormwater in the greater Milwaukee area. Between April 2000 and May 2001, 10 storm events were sampled, with lower frequency before and after that period. Figure 1 shows the times series of the concentration of copper in the stormwater. Before calculating the Lomb periodogram, the set of fequencies to be examined must be determined, unlike the equally spaced case where the natural frequencies are fixed by the data. From Figure 1, it is clear that periods of less than 7-10 days cannot be supported by the data. However, to be on the safe side, use the set of natural frequencies as if the data were sampled every other day, i.e., a spacing of 2 days. The data cover a period of 697 days so the number of natural frequencies is floor(697/2/2) ) 174. The Lomb periodogram is not limited to these frequencies, as the FFT would be. A major advantage of the Lomb periodogram is that the frequencies can be over sampled to improve the resolution of the periodogram. If one over samples by a factor of 10, then there are 1740 frequencies covering periods of 4-697 days. Figure 2 shows the Lomb periodogram for this set of periods. Although there are only 22 observations in just under 2 years, the Lomb periodogram for Cu shows a large peak at about 31 days (31.66 to be exact). A large peak at this period is also seen in the periodograms of zinc, calcium, and total suspended solids (TSS). The periodogram clearly shows a periodicity of about 31 days, or 1 month. However, to better understand the possible physical significance of this, the actual time of the maxima and minima are required. To get this information, the Lomb approximation of the DFT was also calculated. The phase angle was determined for the average of the DFT peak at 31.66 days and its two neighbors. This averaging over three periods gives a better estimate of the phase, which can be somewhat erratic. The phase angle so determined is 2.3405. The data start on the sixth day of the month so T0 ) 6 and

by eq 9 the maximum occurs at day 26 of the month and the minimum at day 10. This result can be tested by plotting the Cu concentrations on a time axis that has been folded every 31 days, as can be seen in Figure 3. Also plotted in Figure 3 are the sinusoidal waveform determined from the Lomb estimate of the DFT for a period of 31.66 days and a line determined by nonparametric regression of the Cu on the time modulo 31. The figure supports a minimum around day 10 of the month. Very similar plots are found for Zn, Ca, and TSS. Thus, there is strong evidence for a real monthly periodicity in the data. Inquires to the local authorities indicate that during the period in question the streets in the area were swept once a month. The drainage area for the site is bounded by E. Howard Ave. on the north, S. Lake Dr. to the east, E. Denton Drive to the south, and S. Packard Ave to the west. The stormwater from this area drains to Lake Michigan. East of Lake Dr. there is a new development with separate sewer and stormsewer. Further to the east a cliff drops about 11 m to Lake Michigan. Because of the small size of the drainage area (∼23 ha), street sweeping may have a more noticeable effect on this station than others in the greater Milwaukee area. Regarding the constituents monitored, TSS would be expected to vary with street sweeping since the sweeper collects solids from the streets. Calcium in the form of CaCO3 is a large fraction of solids. Both zinc and copper are known significant metals associated with urban stormwater originating mainly from automobile traffic. Zinc comes from tire wear, and copper is derived from brake linings and metal plating (17, 18).

Air Quality Example This example analyzes data for 24 hour average airborne particulate elemental carbon (EC) concentration data for the Washington, D.C. IMPROVE sampling network from Wednesday August 31, 1988 to May 26, 2002, as shown in Figure 4. The sampling schedule was every Wednesday and Saturday through August 26, 2000, and every third day from VOL. 41, NO. 20, 2007 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7033

FIGURE 2. Lomb Periodogram of first flush Cu concentrations in 22 stormwater samples for Station SWSF 14 in Milwaukee, WI near lake Michigan. The dashed line is the 99% confidence level.

FIGURE 3. Station SWSF 14 first flush Cu concentration data plotted on a time axis folded every 31 days. The concentration data have been normalized to a mean of 0 and a standard deviation of 1. The solid line is the sinusoidal waveform determined for a period of 31.66 days from the Lomb approximation of the DFT and normalized to a standard deviation of 1. The dashed line is a nonparametric regression estimate of the relationship of time and normalized concentrations. August 31, 2000 to Sunday May 26, 2002. There were a total of 1341 samples out of a total possible of 1462 (1250 for the Wednesday through Saturday sampling schedule and 212 for the every third day schedule). There are 121 missing samples or 8.3% missing data. There are runs of 62 and 32 7034

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 41, NO. 20, 2007

days with no data, and many periods of 1 week or more without data, as seen in Figure 5. Thus, filling in missing data by interpolation is not an option. More serious than the missing data is the problem of unequal spacing of the data. The Wednesday through Saturday schedule gives a semi-

FIGURE 4. Elemental carbon 24 h concentration time series from the IMPROVE Washington, D.C. site.

FIGURE 5. Intervals between samples of the Washington, D.C. elemental carbon data. regular pattern of 3 and 4 day sample spacing. Even if the missing data were filled in, with a sampling pattern like this it is not realistically possible to use the fast Fourier transform (FFT), which requires equally spaced data. Just how the FFT can be applied to this data is given after a discussion of the Lomb periodogram results. The first step in the spectral analysis of the Washington, D.C. EC data is to remove the large and obvious trend. Linear

regression determined the trend line shown in Figure 4. The detrended EC ) EC - 2.2048 × 10-4 t - 1.916 µgm-3, where t is the serial day number starting at 0 on Wednesday, August 31, 1988. The Lomb periodogram was calculated for the detrended data and is seen in Figure 6. The periodogram was calculated at the set of natural frequencies as if the data were evenly spaced every 3 days with 10 times over sampling. The horizontal dashed line is the 99% confidence limit for VOL. 41, NO. 20, 2007 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7035

FIGURE 6. Lomb periodogram of detrended EC data using natural frequencies for 3 day sample spacing with 10 times over sampling.

FIGURE 7. Schuster periodogram of pseudodaily detrended EC data calculated by the FFT. peaks from random errors. Clearly, there are only two significant peaks: one at a period of 363.5 days and one at 7.004 days. There is a nearly significant peak near 2000 days (5.5 years) that is likely the result of imperfect removal of the trend by the simple linear detrending. The first peak is the yearly or seasonal cycle and the second is perhaps a weekday7036

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 41, NO. 20, 2007

weekend effect. To find out, the approximate time of the minima and maxima are needed. Starting with the yearly cycle, EC is expected to peak in the winter since it is a primary pollutant. As before, the phase angle is calculated from the DFT for the peak at 363.5 and its two nearest neighbors; the result is -1.918. August 31 is

FIGURE 8. Window function of the Washington IMPROVE EC sampling times. day 243 of the year, so this is the value of T0 in eq 9. The time of the maximum in the yearly cycle of EC is estimated to be day 354 or December 19, in the winter as expected. For the weekly cycle, T0 is 3.5 if the week is taken to start at 0 on midnight Sunday. From the DFT, the phase angle is 0.707 and the time of the maximum is 2.7 or late Tuesday. The time of the minimum is this plus 3.5 or 6.2, which is early Saturday. So, there is evidence of a 7 day cycle with a minimum on the weekend and a maximum during the week. This, too, is reasonable, for in the U.S. the main source of EC is usually heavy-duty diesel trucks, for which the emissions are known to show a marked decrease during the weekend. The Lomb periodogram does not have any other large peaks, indicating that there are no other long-term periodicities in the meteorology or emissions. Hies et al. (6) present a harmonic analysis of 1 year of daily EC data taken at a number of sites in Germany. Using a much more complicated form of analysis not based on the Lomb periodogram, they also found a 7 day cycle and its 3.5 day harmonic, which they attribute to traffic patterns. Since their data set was only 1 year long, they did not report a yearly cycle. They also report periodicities at 4.6 days and in a range between 13 and 42 days. Similar periodicities are not seen in the much longer data set analyzed here. The 4.6 day cycle is too short to be seen, and perhaps the longer cycles are caused by the episodic behavior of the EC during the 1 year of data. A much longer time series, as presented here, would tend to average out the episodic variations. The Shuster periodogram of the detrended EC data can be obtained from the DFT calculated by the FFT as follows. Each detrended EC data point is assigned a serial day number starting with 1 for the first observation and ending with 5017, the number of days covered by the data. Create a vector of zeros of length 5017. Place each detrended EC data point in the vector at the corresponding serial day number. The resulting vector has zeros where there is no detrended EC data. Since the mean of the detrended EC data is 0, the mean of this new data vector is also 0. The DFT of the zero-filled

data vector is calculated by the FFT. The natural frequencies will be k/5017 day-1, k ) 1, ..., 2508. The Schuster periodogram for these frequencies is given in Figure 7. To compare it with the Lomb periodogram in Figure 6, it has been normalized by dividing by the variance of the zero-filled data vector and its length. For periods greater than 6 days, the two figures are very similar; the minor differences are mostly because the two periodograms are calculated over slightly different sets of frequencies. Since the zero-filled data vector has 1 day spacing, the Schuster periodogram starts at a period of 2 days. There are large peaks at 2.33 and 3.5 days. These are classic aliasing peaks and can be discounted as false peaks. The window function (eq 4) for the Washington, D.C. sampling times and frequencies is given in Figure 8. It is interpreted in this way. If the data has a periodicity of period P, and if P0 is the period of a peak in the window function and P0 < P, then the periodogram will have a false peak at 1/(1/P0 - 1/P). There are two large peaks at 7 and 358 days in the Schuster periodogram. The window function for the EC data has large peaks at 1.4, 1.75, 2.33, and 3.5 days. Thus, the 7 day peak will be aliased to 1.75, 2.33, and 3.5 days. This explains the 2.33 and 3.5 day peaks seen in Figure 7. If one looks closer, peaks can also be seen at 2.348 and 3.5346, which are due to the aliasing of the large peak at 358 days. The FFT can be used to calculate the Schuster periodogram for even the highly irregular sampling in the water quality example given above. The procedure is the same as in the air quality example. Create a vector of zeros with length equal to the number of days spanned by the data, 697 in this case. Next, detrend the Cu data or, since there is no trend, simply subtract the mean. Insert the mean subtracted data in the vector of zeros at the rows corresponding to the serial day number. As with the EC data above, we now have a vector with mean 0 that has zeros at rows corresponding to days without Cu data. The Schuster periodogram of this vector can be calculated by the FFT as it is uniformly spaced with no missing values. The Schuster periodogram normalized by the length and variance of VOL. 41, NO. 20, 2007 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

7037

the zero-filled data vector is very close to the Lomb periodogram calculated for the same set of periods. The agreement is excellent for periods greater than 10 days (see Figures S1 and S2 in the Supporting Information for this paper). Because of the highly irregular sampling, no aliasing peaks are expected and none are seen. This is also confirmed by the fact that the window function does not have any large peaks. The following recommendations are made for harmonic analysis of time series of environmental data: First, detrend the data; generally a linear detrending is all that is needed. Use the Lomb periodogram if the number of data points is small enough so that the computing time is not too long. Over sample the frequencies by as much as needed to get a smooth periodogram. If the number of data points is very large, create a zero-filled vector of detrended data as described above and use the FFT to calculate the DFT and the Schuster periodogram. The natural frequencies can be adjusted by padding the vector with zeros as needed. Calculate the window function to identify any aliasing problems. Divide the Schuster periodogram by the variance and length of zero-filled data vector to make it compatible with the Lomb periodogram confidence levels. If the data are approximately uniformly spaced, estimate the confidence levels using eq 10 with N equal to the number of natural frequencies based on the average spacing of the data. If the data are very irregularly spaced, determine the confidence levels by a simulation study of the periodogram of Gaussian noise with the same data spacing.

(2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)

(13)

Acknowledgments

(14)

This was supported by Science To Achieve Results (STAR) Grant number R832157 from the Environmental Protection Agency.

(15) (16)

Supporting Information Available Schuster periodogram of copper concentrations at station SWSF 14 using the fast Fourier transform. This material is available free of charge via the Internet at http:// pubs.acs.org.

Literature Cited (1) Fujita, E. M.; Campbell, D. E.; Zielinska B.; Sagebiel, J. C.; Bowen, J. L.; Goliff, W. S.; Stockwell W. R.; Lawson, D. R. Diurnal and weekday variations in the source contributions of ozone

7038

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 41, NO. 20, 2007

(17) (18)

precursors in California’s South Coast Air Basin. J. Air Waste Manage. Assoc. 2003, 53, 844-863. Sirois, A.; Olson, M.; Pabla, B. The use of spectral analysis to examine model and observed O3 data. Atmos. Environ. 1995, 29, 411-422. Rao, S. T.; Zurbenko, I. G.; Neagu, R.; Porter, P. S.; Ku, J. Y.; Henry, R. F. Space and time scales in ambient ozone data. Bull. Am. Meteorol. Soc. 1997, 78, 2153-2166. Sebald, L.; Treffeisen, R.; Reimer, E.; Hies, T. Spectral analysis of air pollutants. Part 2: ozone time series. Atmos. Environ. 2000, 34, 3503-3509. Sirois, A. Temporal variation of oxides of sulphur and nitrogen in ambient air in eastern Canada: 1979-1994. Tellus 1997, 49B, 270-291. Hies, T.; Treffeisen, R.; Sebald, L.; Reimer, E. Spectral analysis of air pollutants. Part 1: elemental carbon time series. Atmos. Environ. 2000, 34, 3495-3502. Bragatto, P. A.; Cordiner, S. M.; Lepore, F. L.; Sacco, D.; Ventrone, I. Long period benezene measurements in urban atmosphere. Environ. Monit. Assess. 2000, 65, 99-107. Schuster, A. On lunar and solar periodicities of earthquakes. Proceed. R. Soc. 1897, 61, 455 - 465. Scargle, J. D. Studies in astronomical time series analysis. II. Statistical aspects of spectral analysis of unevenly spaced data. Astrophys. J. 1982, 263, 835-853. Bloomfield, P. Fourier Analysis of Time Series: An Introduction. John Wiley & Sons: New York, 2000; pp 9-23. Lomb, N. R. Least-squares frequencysanalysis of unevenly spaced data. Astrophys. Space Sci. 1976, 39, 447-462. Scargle, J. D. Studies in astronomical time series analysis. III. Fourier transforms, autocorrelation functions, and crosscorrelation functions of unevenly spaced data. Astrophys. J. 1989, 343, 874-887. Mathias, A.; Grond, G. F. R.; Seese, D.; Canela, M.; Diebner, H. H. Algorithms for spectral analysis of irregularly sampled time series. J. Stat. Software, May 2004, 11, 1-27. Press, W. H.; Rybicki, G. B. Fast algorithm for spectral analysis of unevenly sampled data. Astrophys. J. 1989, 338, 277-280. Schultz, M.; Stattegger, K. SPECTRUM: Spectral analysis of unevenly spaced paleoclimate time series. Comput. Geosci. 1997, 23, 929-945. Horne, J.; Baliunas, S. A prescription for period analysis of unevenly sampled time series. Astrophys. J. 1986, 302, 757763. Christensen, E. R.; Guinn, V. P. Zinc from automobile tires in urban runoff. ASCE J. Environ. Eng. 1979, 105, 165-168. Sansalone, J. J.; Hird, J. P.; Cartledge, F. K.; Tittlebaum, M. E. Event-based stormwater quality and quantity loadings from elevated urban infrastructure affected by transportation. Water Environ. Res. 2005, 77, 348-365.

Received for review January 4, 2007. Revised manuscript received July 17, 2007. Accepted July 26, 2007. ES0700247