Semiparametric Regression Approach to Adjusting for Meteorological

Sep 23, 1999 - Daily meteorological predictor variables are based on cor ..... model, i.e., α1 = ∑j=112α0j, is shared by all 12 of the predictors ...
3 downloads 0 Views 167KB Size
Environ. Sci. Technol. 1999, 33, 3873-3880

Semiparametric Regression Approach to Adjusting for Meteorological Variables in Air Pollution Trends THOMAS S. SHIVELY* AND THOMAS W. SAGER Department of Management Science and Information Systems, CBA 5.202, The University of Texas at Austin, Austin, Texas 78712

A semiparametric regression model is used to analyze relationships between ambient ozone, related meteorological predictor variables, and time trends. This model enjoys a number of advantages over alternatives. The model produces outputs that parallel the outputs of linear regression but without forcing the structure of linear regression upon the data. Instead, the abundant data are allowed to determine the form of the model as in other nonparametric regression models. Four Houston, TX, sites with relatively complete ozone and meteorological data from 1983 to 1995 are analyzed. For these four sites, the model explains over 60% of the variation in log (ozone) and estimates declines in meteorologically adjusted ozone of 13-26%. These estimated declines appear to be real and are consistent with other evidence, such as declines in NOx, declines in exceedance days, and implementation of control policies. Because the estimated declines are interpreted, as in ordinary regression, as having been adjusted for year-toyear differences in the meteorological variables, one can infer that factors other than changes in the weather are responsible for the declines.

1. Introduction This paper proposes a semiparametric regression technique to obtain the long-term trend in ozone levels after accounting for the effects of meteorological conditions. The Environmental Protection Agency (EPA) has recently mandated trend analysis and adjustment for the effects of meteorological conditions (1), so it is important to develop methods for implementing this mandate. The semiparametric method we propose allows the data to specify the functional forms used to model both the trend and the relationships between ozone and meteorological variables. Allowing the data to specify the functional forms minimizes the risk of imposing incorrect prior information through parametric assumptions on the functional forms that may give misleading results. Ground-level ozone is one of six primary, or criteria, pollutants targeted for control by the Clean Air Act, as amended, in affected U.S. urban areas. Houston, TX, is generally recognized as having the second highest ambient ozone problem among U.S. urban areas (2). The National Ambient Air Quality Standard (NAAQS) for ozone requires that areas previously in noncompliance not exceed the * Corresponding author tel: (512)471-3322; fax: (512)471-0587; e-mail: [email protected]. 10.1021/es990286b CCC: $18.00 Published on Web 09/23/1999

 1999 American Chemical Society

threshold of 12 pphm (hourly average) more than 3 days in 3 consecutive years. Houston regularly exceeds this NAAQS a couple of dozen times or more per year. For example, a scan of hourly ozone data provided by the Texas Natural Resource Conservation Commission (TNRCC) from 15 monitoring stations in the Houston area from 1983 through 1993, with about 9-10 stations operational at any given time, showed the tally of the number of distinct days each year in which at least one station recorded an hourly measurement that exceeded the NAAQS ozone standard of 12 pphm was 79 days for 1983, 55 days for 1984, and 54, 46, 53, 49, 37, 48, 34, 25, and 25 days for 1985-1993, respectively. States with noncompliant regions such as Houston are required to develop State Implementation Plans to bring these areas into compliance with the NAAQS. The estimated costs of such policies are substantial. Yet the real effects of good plans can easily be obscured by uncontrollable changes in meteorological conditions. Worsening weather conditions, for example, may overwhelm the effects of these plans and make it difficult to interpret trends in the raw ozone data. The Texas ozone experience in 1995 provides a case in point. In 1995, measured levels of ambient ozone in noncompliance regions such as Houston were at their highest levels since 1990. However, preliminary analysis also suggests that weather conditions in 1995 favored the formation of ground-level ozone. Recorded ozone levels seem to have been declining gradually since the early 1980s. The question to ask is whether the increase in 1995 was due to unusually adverse weather conditions or to some other factor such as faltering control policy. Answering this question requires the ability to separate the effects of meteorological conditions from the effects of control policy and other factors. The general methodology that addresses such questions is regression, which estimates the relationship between a response variable (pollutant levels) and predictor variables (meteorological conditions, controls, etc.) and assesses the importance of each predictor. The most commonly used forms of regression are linear regression and, less frequently, semiparametric or nonparametric regression. The purpose of this paper is to demonstrate the usefulness of semiparametric regression for addressing some of the above questions. The methodology combines some of the best features of both traditional and nonparametric regression. We illustrate the method by analyzing ambient ozone data, although the procedure can easily be applied to other pollutants as well. Previous work in the environmental literature on trend analysis that incorporates meteorological variables includes refs 3-6. In each case, the response variable is some characteristic of ozone, such as the daily maximum hourly measurement, and the predictors include a mix of weather and/or precursor variables and a time variable representing trend. The magnitude and sign of the trend component in these models may be interpreted as incorporating the effect of control policy and other factors after adjusting for the effects of all other predictors. The first three of these references employ conventional parametric regression models, which require that the form of the relationship (e.g., linear) be specified. Our experience suggests that such models may be too constraining for fitting ozone data. For example, the parametric forms necessarily create estimates that follow the forms, whether the data do or not, and this may lead to curious artifacts in the estimates that are not related to known physical theory. Moreover, considerable statistical skill and subject-matter knowledge is required to get the correct form of the model. However, ozone data are usually sufficiently VOL. 33, NO. 21, 1999 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

3873

abundant in the temporal domain that the data themselves can be allowed to specify the form of the model. This avoids the risk of specifying a possibly incorrect parametric form for the model, which can lead to very misleading conclusions. If we could determine a parametric form a priori that both made physical sense and was likely to include the true relationship, we would not have used this semiparametric method. A posteriori, a fitted semiparametric model can sometimes suggest a good parametric model for the physical relationship. However, because of the complicated interactions in our model, we did not find one. Nonparametric regression models avoid prior specification of the model form by letting the data determine the form. However, if a nonparametric model permits too broad a class of model forms, statistical efficiency may be reduced with the practical consequence that it may be difficult to draw useful conclusions. When using nonparametric models, it may be desirable to limit the class of model forms by screening out unlikely possibilities in order to improve efficiency. If it may be said that parametric models risk incorporating too much prior information into regression models, then nonparametric models may incorporate too little. The modeling approach adopted in this paper is a form of semiparametric regression. The model marries a limited parameter set to a process for generating a fairly wide class of possible models, thus incorporating the best features of both parametric and nonparametric regression. Excellent references for nonparametric regression include refs 7 and 8. The initial version of our model was introduced into the statistical literature by Wahba (9). Recent papers in the statistical literature that discuss the properties of the model include refs 10 and 11. Gao et al. (6) also use a semiparametric model to estimate weather-adjusted ozone, but their methodology is subject to both computational and theoretical difficulties that our model avoids. The characteristic of ozone that is modeled in this paper is the mean (or expected value) of the distribution of daily ozone measurements given the meteorological conditions on the day. To be sure, other ozone parameters are also important. The current NAAQS targets high ozone concentrations. Trends in the frequency and the size of exceedances over a specified threshold level have been studied in refs 12 and 13 among others. However, there are good reasons for paying attention to the expected value as well. For example, if chronic exposure to lower ozone levels proves to have adverse health consequences comparable in significance to those resulting from acute exposure to high levels, then reduction of extreme values at the expense of rising mean levels may be undesirable. Moreover, the expected value is a more statistically stable parameter than an extreme value. The EPA has recently shown some sensitivity to this argument and has adopted an alternative standard to be based on an 8-h moving average approach to extreme values. If the distribution of the response variable or of the regression residuals can be identified, there is often a direct relationship between the expected value and/or dispersion parameters and the parameters characterizing the extreme values. This would allow reliable estimation of trends in the extremes from more stable parameters (14). The remainder of the paper is organized as follows: Section 2 contains a description of the data used in this study. Section 3 presents a brief outline of the semiparametric regression model, while section 4 discusses the results of applying the model to the data and includes some concluding remarks.

2. The Data Ozone results from complex chemical reactions in the atmosphere involving oxygen, sunlight, precursor chemicals, 3874

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 33, NO. 21, 1999

and weather. Ozone tends to be high during the summer months and during the day. Therefore, we analyze data covering the high ozone season (May 1-October 31) during the day (6 a.m. to 7 p.m.) for the years 1983-1995 inclusively for four fixed monitoring sites in the Houston region. The southeast to northwest orientation of these sites tracks the prevailing southeasterly wind. Thus, trends and conditions in common to the four sites should be as representative of areawide trends and conditions in Houston as at any available sites. The daily ozone value we use from each site is the largest of the 13 hourly averages of ozone measurements from 6 a.m. through 7 p.m. The convention used to handle missing values is described later. The natural logarithm of the daily maximum ozone is used as the response variable rather than ozone itself, because the logarithm of ozone is more nearly normally distributed and because the logarithmic transformation reduces the impact of outliers in the analysis. Daily meteorological predictor variables are based on corresponding hourly temperature, humidity, wind speed, and wind direction data collected at the sites. Ozone concentrations generally increase with increasing solar energy and decrease with increasing wind speed and humidity. A direct measure of solar energy was not available at the monitoring stations included in this study. However, the daily temperature range (maximum daily temperature minus minimum daily temperature) provides an excellent and well-accepted proxy for solar energy. Ozone concentrations generally decrease with increasing wind speed, as windy weather disperses the chemical precursors of ozone. Atmospheric water vapor essentially washes ozone from the air. Finally, certain wind directions can also impact the formation of ozone. For example, the Houston region often experiences higher monitored ozone values when the wind is from the southeast and brings in precursor emissions from the Gulf Coast chemical industry (personal communication with the TNRCC meteorological staff). There are also potentially important effects due to interactions between the meteorological variables. For example, for lower wind speeds, ozone concentrations are typically much higher for large temperature ranges than for small temperature ranges. However, high wind speeds disperse the precursors to ozone faster than any amount of solar energy (proxied by temperature range) can create ozone. Thus, the relationship between ozone and temperature range may depend on the wind speed. If so, this implies that there should be a term to model the interaction between temperature range and wind speed. For similar reasons, interactions may exist between wind speed and wind directions. Finally, there may be an interaction effect between temperature range and humidity. These five potential interaction effects are modeled by the variables x8,t-x12,t below. 2.1. The Variables. The following is a list of the specific predictor variables used. The subscripted “x” notation anticipates the mathematical statement of the full model in eq 2. In each predictor, the argument “t” stands for time, explicitly indicating by subscript or parenthetically (viz., xt or x(t)) that each datum originates from a different day. TREND (x1,t): Annual trend, assumes the value 1 for each day in 1983, 2 for each day in 1984, ..., 13 for each day in 1995. TRANGE (x2,t): Temperature range, difference between the maximum and minimum hourly temperature measurements during the 6:00 a.m. to 7:00 p.m. period on day t. WIND SPEED (x3,t): Average wind speed, mean of the hourly wind speed measurements during the 6:00 a.m. to 7:00 p.m. period on day t. HUMIDITY (x4,t): Average relative humidity, mean of the hourly relative humidity measurements during the 6:00 a.m. to 7:00 p.m. period on day t. NW/NE (x5,t): Proportion of hours in which the wind direction was predominantly between NW and NE during

More formally, the semiparametric regression model with one time-dependent continuous predictor is

TABLE 1. Houston Sites Analyzed in This Study

site location Aldine Drive, north of downtown Clinton Avenue, downtown Deer Park, southeast of downtown far northwest Harris County

no. of nonmissing days 2119 1804 1841 2051

% of days missing 11.4 24.6 23.0 14.3

the 6:00 a.m. to 7:00 p.m. period on day t. Air moving in from the north of Houston tends to be relatively clean. NE/ESE (x6,t): Proportion of hours in which the wind direction was predominantly between NE and ESE during the 6:00 a.m. to 7:00 p.m. period on day t. Air moving in to Houston from generally easterly directions tends to bring in relatively more precursors. ESE/SSW (x7,t): Proportion of hours in which the wind direction was predominantly between ESE and SSW during the 6:00 a.m. to 7:00 p.m. period on day t. Air moving in to Houston from southeasterly and southern directions tends to bring in relatively more precursors. (The remaining wind direction, SSW/NW, is not used because it adds no explanatory power to the other three directions: the fourth is determined by the other three since the sum of the four is 100%.) Wind speed by temperature range interaction term (x8,t), the product of WS × TRANGE. Wind speed by NW/NE interaction term (x9,t), the product of WS × NW/NE. Wind speed by NE/ESE interaction term (x10,t), the product of WS × NE/ESE. Wind speed by ESE/SSW interaction term (x11,t), the product of WS × ESE/SSW. Humidity by temperature range interaction term (x12,t), the product of TRANGE × HUM. (The interaction term HUM × WS, representing the interaction between humidity and average wind speed, is not included in the model because it is a nearly perfect linear combination of other predictors. Thus, HUM × WS adds no explanatory power to the model.) 2.2. Missing Values. In each case, the ozone response variable and meteorological predictor variables are calculated on a daily basis by reference to the subset of the 13 possible hourly measurements that are nonmissing for 1 day. If more than 6 h of data are missing on 1 day for any weather variable or for ozone, then the entire day is omitted from the analysis. Each site potentially provides 2392 days for analysis over the 13-year period of the study. However, missing values typically reduce the available data by 10-25%. The sites used in the analysis are those with the most complete records over the time period of the study. Most of the missing data correspond to extended down times of 2 weeks to 1 month because of maintenance or instrument malfunction. Only 1993 is missing more than half the data at any site. Ozone levels immediately before and immediately after down times test insignificantly different from other times, so there is no evidence that down times are related to ozone levels. The sites and the number of available days at each site are shown in the Table 1.

3. The Methodology 3.1. The Mathematical Model. The well-known basic form of a conventional linear regression model with one timedependent predictor x(t), response y(t), and error term (t) is y(t) ) R0 + R1x(t) + (t). The semiparametric model retains this simple linear form at its core but adds an additional term to allow for a broad class of possible deviations from linearity.

y(t) ) f (x(t)) + (t) ) R0 + R1x(t) + τ



x(t)

0

(x(t) - s)W(ds) + (t) (1)

R0 is the value of f at 0, i.e., R0 ) f (0); R1 is the slope of the function at 0, i.e., R1 ) f ′(0); τ is a smoothing parameter; W(s) is a Wiener process with unit dispersion parameter; and W(ds) is the change in the Wiener process over an interval of length ds. The (t) are independent and identically distributed N(0, σ2) random variables. The parameters R0, R1, τ2, and σ2 are unknown parameters that need to be estimated. We put diffuse (noninformative) priors on the R0 and R1 [i.e., the prior on (R0, R1)′ is N(0, kI2) with k f ∞] and flat (noninformative) priors from 0 to 1014 on the τ2 and σ2. The representation of f (x) in eq 1 corresponds to Wahba’s (9) specification of a cubic stochastic spline function, whose slope changes stochastically according to a Wiener process. This representation provides a very flexible class of functions with which to model a relationship. Note that if τ ) 0, then eq 1 reduces to a linear function in x. An intuitive explanation of the expression for f (x) in eq 1, why it represents a flexible method to model the relationship between y and x, and how it is derived is given in the Supporting Information. Mathematical justifications for properties of the semiparametric model may be found in refs 9-11. To handle multiple predictor variables, additional linear and integral terms are added, one linear and one integral term per predictor. The mathematical form for a 12-predictor model of the type that we apply in this paper is as follows:

y(t) ) f1(x1(t)) + f2(x2(t)) + ... + f12(x12(t)) + (t) ) 12

∑{R

0,j



+ R1jxj(t) + τj

j)1

xj(t)

0

(xj(t) - s)Wj(ds)} + (t) (2)

Thus, the multiple-predictor model is an additive model in which each term fj(xj(t)) is of the form in eq 1, and xj(t) corresponds to one of the 12 predictors listed in section 2. 3.2. Estimation. This section provides a brief outline of the Bayesian estimation procedure we use to estimate the functions fj, j ) 1, ..., p. The representation of fj in eq 2 can be rewritten as

fj(xj(t)) ) R0j + R1j(xj(t)) + gj(xj(t)) where

gj(xj(t)) ) τj



xj(t)

0

(xj(t) - s)W(ds)

Then, the model in eq 2 can be written in matrix notation as

y ) XR + g1 + ... + gp + 

(3)

where y ) {y(1), ..., y(T)}′; the tth row of X is (1, x1(t), ..., xp(t)); R ) (R0, R11, ..., R1p)′ with R0 ) R01 + ... + R0p [The R0j cannot be estimated separately, so the function fj can only be estimated up to an additive constant. This situation is similar to that encountered with a multivariate linear regression model.]; gj ) {gj(xj(1)), ..., gj(xj(T))}′; and  ){(1), ..., (T)}′. The Bayesian estimate of the parameter vector R is the posterior mean of R given the data y, i.e., E(R|y). Similarly, the Bayesian estimate of gj is E(gj|y). From these, we can estimate fj up to the additive constant R0j by computing E( fj|y) ) xj E(R1j|y) + E(gj|y) where xj ) {xj(1), ..., xj(T)}′. In principle, E(R|y) and E(gj|y) can be computed using numerical integration. However, in practice, the numerical integrations VOL. 33, NO. 21, 1999 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

3875

required to calculate these quantities are computationally infeasible. For this reason, we use the Gibbs sampler to estimate E(R|y) and E(gj|y). The Gibbs sampler provides a computationally efficient method for computing the mean of a posterior distribution in a Bayesian model that avoids high dimensional numerical integration. A description of the Gibbs sampler we use is given in the Supporting Information. 3.3. Parametric, Nonparametric, and Semiparametric Regression. Both parametric and nonparametric methods may suffer significant disadvantages. A major disadvantage of the parametric approach is that it imposes its structure upon the data. For example, the use of linear regression will force a linear relationship upon the data, even if the true relationship is nonlinear. The use of parametric regression is appropriate only if the specified parametric functional form is correct. Nonparametric regression adapts to local variation in the relationship between log (ozone) and x. For example, a common way to estimate f (x) nonparametrically is to apply a weighted average to the log (ozone) data at x and its near neighbors, with the weight diminishing by distance from x. However, nonparametric regression may have two significant disadvantages. First, the nonparametric estimate may not be stable where there are relatively few data in the vicinity of x. Second, most nonparametric regression methods find it difficult to choose the appropriate degree of local averaging without active user intervention. Overaveraging leads to considerable bias in estimating f, whereas underaveraging results in estimates that are too variable. As a nonparametric method, the semiparametric regression model used in this paper permits a relatively wide class of functions for f. But the class is limited to reasonable candidates by a small set of key parameters and a simple process for generating candidate functions from the data and those parameters. The small parameter set controls the size of the candidate class, while the candidate function generating process avoids the limiting restrictions of parametric regression. When the data indicate that the traditional linear regression model will suffice for f, the semiparametric model produces it. Moreover, the degree of local averaging is governed by a smoothing parameter τ that is estimated automatically without user intervention. Our estimation method implements a Bayesian technique using noninformative priors, with computational details handled via Gibbs sampling. An intuitive explanation of the candidate function generating process and the Bayesian estimation technique used to estimate the functions is given in the Supporting Information. Analogues of diagnostic and precision measures for ordinary linear regression are also available in semiparametric regression, including residual analysis, standard error of the regression, and R 2. For each data value y(t), an estimate yˆ(t) ) ˆf1(x1(t)) + ... + ˆf12(x12(t)) is produced. Thus the residuals y(t) - yˆ(t) ) y(t) - [ ˆf1(x1(t) + ... + ˆf12(x12(t))] can be calculated, plotted, and diagnosed for model violations. In addition, an estimate of σ can be calculated as

σˆ )

x∑ 1

n

n t)1

[y(t) - fˆ 1(x1(t)) - ... - fˆ 12(x12(t))]2

and the R 2 as n

∑[y(t) - fˆ (x (t)) - ... - fˆ 1

R2)1-

1

12(x12(t))]

2

t)1

n

∑[y(t) - yj]

2

t)1

3876

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 33, NO. 21, 1999

These formulas are identical to the case of linear regression, although the mathematical properties are still under development. For example, the asymptotic properties of σˆ are unknown, although simulation results given in ref 15 indicate that σˆ provides a good estimate of σ. An advantage of semiparametric regression in a Bayesian context is that it can provide interpretations not directly available in ordinary regression. For example, consider the linear regression procedure for deciding whether to retain a predictor in the model. This is formulated as a test of the null hypothesis that the predictor’s coefficient is zero versus the alternative that it is not zero. In deciding this test, the most natural question to ask is: What is the probability that the null hypothesis is true? Classical regression does not answer this question. (Indeed, the p value is often misinterpreted as the answer to the question.) Classically, we must answer circuitously: If the null hypothesis is true, then the p value is the probability of a test result at least as far from zero as was observed; and if the null hypothesis is false, then we must deal with type II errors and power functions. However, the Bayesian approach provides a direct answer to the question. A noninformative prior probability is combined with the data to yield the posterior probability that the coefficient is zero. The details of the implementation of this approach for the semiparametric model are given in ref 11 and in the Supporting Information. In the Results section of this paper, we use a probability cutoff of 0.5 to decide whether to retain a predictor. Under this rule, if the posterior probability that the predictor’s coefficient is zero exceeds 0.5, then it is more likely than not that the predictor does not belong in the model. Clearly, other decision rules could also be used. For example, if the cost of erroneous retention of a predictor exceeds the cost of erroneous deletion, then the cutoff probability should be lowered. Details on implementing cost-based cutoffs may be found in a standard treatment of Bayesian decision analysis, for example, in ref 16, section 10.3.2 (especially Theorem 10.3.3). The probability that the coefficient is between any two numbers can also be calculated. Thus, the Bayesian approach provides true confidence (i.e., probability) intervalssunlike the classical approach, where a 90% confidence interval is often misinterpreted as meaning there is a 90% probability that the interval contains the coefficient.

4. The Results The semiparametric model described in section 3 was applied separately to the data from each of the four Houston sites identified in section 2. The results are discussed in some detail for the Clinton Avenue site to illustrate the type of conclusions that may be drawn from our model. The results for the other sites are similar and are summarized. Figure 1 shows a plot of the estimated trend component of the semiparametric model with the corresponding linear regression trend line (the short-dashed line in Figure 1) superimposed for contrast. (For the purpose of this illustration, the vertical location of the plots in Figure 1 is determined by setting the other predictors at their modal values: 8.3 °C for temperature range, 5 mph for wind speed, 61% for humidity, and southerly wind direction.) The long-dashed lines are 90% confidence bands obtained from the semiparametric model. Figure 1 reveals that there has been a decrease in meteorologically adjusted log (ozone). The estimated decrease in log (ozone) is 0.26 over the 1983-1995 period according to the semiparametric model versus an estimated decline of 0.29 according to the linear regression model. Both of these declines can be interpreted as having been adjusted for the meteorological predictors in the model. That is, for any given combination of temperature range, wind speed, humidity and wind direction, the estimated average decline in log (ozone) is 0.26. Because we believe

FIGURE 1. Time trend of weather-adjusted log (ozone) at the Houston HCL site. The vertical log scale is relative to other predictors set to modal values (see text). that the weather predictors included in our model capture much of the relevant effect of meteorological variables, the interpretation of the decline is that even after having compensated for the mitigating effect of good weather and the aggravating effect of bad weather, there remains a decline of 0.26 at the Clinton site due to other factors (including control policy) not in the model. A decline of 0.26 in log (ozone) units converts into a decline of 22.8% in ozone itself with a 95% probability interval of (14.6%, 31.0%). High levels of ozone are more favorably affected in absolute terms than lower levels. For example, conditions that would have given rise to an ozone level of 200 ppb in 1983 would give rise to a level of only about 154 ppb in 1995; and 100 ppb in 1983 would correspond to 77 ppb in 1995. In addition, the function shown in Figure 1 for the semiparametric model seems more realistic than the straight line of ordinary regression, which implies equal decreases from year to year. The semiparametric function shows a steep initial decline from 1983 to 1984, a generally slower decrease from 1984 to 1994, and an increase in 1995. The three-dimensional plots in Figures 2-6 show how predictor variables interact in pairs to affect ozone in ways that would not be revealed by single-variable plots. For example, Figure 2a shows the combined effect of temperature range and wind speed. Figure 2a has been split into TRANGE and WIND SPEED sections in Figure 2, panels b and c, respectively. Figure 2b indicates that along sections of constant temperature range, ozone declines as the wind speed risess declining more gradually at lower temperature ranges and more dramatically at higher temperature ranges. This is consistent with the hypothesis that higher wind speeds prevent ozone from forming by diluting and dispersing the chemical precursors. Figure 2c shows that, along sections of constant wind speed, ozone rises gradually as temperature range increases but then tapers off somewhat paradoxicallys perhaps an artifact of sparse data for very large temperature ranges. The sections at higher wind speeds in Figure 2c are associated with lower ozone levels, in accord with our expectation. In general, the features of Figure 2 agree with our expectations. But one unusual feature is the very rough area occurring at middle to high wind speeds and moderate temperature ranges in Figure 2a-c. Although this region shows the expected rise in ozone with increasing temperature range, the ozone also rises with increasing wind speed. The variance of the ozone function estimate increases in this region and the data become sparser, but the effect appears real. So further investigation of this anomaly seems warranted.

FIGURE 2. Effect of temperature range and wind speed at the Houston HCL site. The vertical log scale is relative to temperature range ) 0 and wind speed ) 0. In Figure 3, we see that humidity has a dramatic effect on ozone. There are few days with humidity below 30%. Along VOL. 33, NO. 21, 1999 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

3877

FIGURE 3. Effect of temperature range and relative humidity at the Houston HCL site. The vertical log scale is relative to temperature range ) 0 and humidity ) 0.

FIGURE 4. Effect of wind speed and NW/NE wind direction at the Houston HCL site. The vertical log scale is relative to wind speed ) 0 and proportion NW/NE wind ) 0.

FIGURE 5. Effect of wind speed and NE/ESE wind direction at the Houston HCL site. The vertical log scale is relative to wind speed ) 0 and proportion NE/ESE wind ) 0. given TRANGE sections, ozone rises slightly until humidity reaches about 50% and then plunges to its minimum at 100% humidity. As expected, at given humidity levels, ozone rises with TRANGE. The effects revealed in Figure 3 are largely in accord with expectation. Figures 4-6 show the effects of various combinations of wind speed and direction. For example, on days when the wind is mostly from the NW/NE, indicated by wind from NW/NE close to 1.0 in Figure 4, log (ozone) decreases only moderately with increasing wind speed. On such days, relatively clean air from the continental interior is blown into Houston. In Figure 6, we see that, when the wind is mostly from the ESE and SW, ozone is relatively high and tapers off before rising again somewhat at WIND 3878

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 33, NO. 21, 1999

FIGURE 6. Effect of wind speed and ESE/SW wind direction at the Houston HCL site. The vertical log scale is relative to wind speed ) 0 and proportion ESE/SW wind ) 0. SPEED ) 14. In Figure 5, ozone levels rise steadily and significantly with increasing wind speed. We do not have a physical explanation for this phenomenon, although we note that the higher wind speeds occur relatively infrequently. Wind direction seems to be of lesser importance relative to the other predictors. The ozone surface varies more in the TRANGE by HUMIDITY and TRANGE by WIND SPEED figures than in the other figures and relatively more for HUMIDITY than for the other variables. This indicates that changes in humidity are relatively more important for changes in ozone than are the other predictors. In ordinary linear regression, by analogy, the partial and/or sequential R 2 of the humidity variable would exceed that of the other predictors. It should be noted that the vertical axis in Figures 2-6 is an interval-scale rather than a ratio-scale variable. This means that vertical differences make sense, but the location of the zero point is arbitrary. The reason is that in both linear and semiparametric regression the constant part of the model, i.e., R1 ) ∑j)112R0j, is shared by all 12 of the predictors and cannot be subdivided among themsthat is, the individual constant terms, R0j, j ) 1, ..., 12, are nonidentifiable. This implies the predictor components f(xj(t)) of both the linear and semiparametric models can be estimated only up to a constant; therefore, the location of the intercepts on the vertical axes are arbitrary and chosen for convenience. However, the vertical placement of the functions is immaterial since we wish to estimate the change in ozone levels as the predictors change. We have observed that our semiparametric model estimates a 0.26 decline in log (ozone) from 1983 to 1995 at the Clinton site. In ordinary linear regression, the significance of this decline could be determined by testing the hypothesis that the coefficient of the trend variable is zero, i.e., that the trend variable drops out of the set of predictors. In the semiparametric model (eq 2), the jth predictor does not have an effect on the response if and only if both the linear and nonlinear components vanish. These components vanish if both R1j and τj, respectively, are zero. Using the Bayesian methodology, we can calculate the posterior probability that the trend variable should be excluded from the model by calculating the posterior probability that the associated R1j and τj are both zero. If this posterior probability exceeds 1/2, then the evidence in the data implies that the trend variable should be excluded; otherwise the evidence implies that it should be included. Similarly, the posterior probabilities provide evidence for including or excluding other predictors. The marginal posterior probabilities for each predictor are shown in Table 2. The table indicates that all 12 predictors should be retained. In particular, the component TREND is

TABLE 2. Variable Selectiona TREND 0.000 ESE/SSW 0.099 aRetain

Marginal Posterior Probability That r1j ) 0 and τj ) 0b HUM WS NW/NE 0.000 0.404 0.075 HUM × TRANGE WS × NW/NE WS × NE/ESE 0.193 0.379 0.226

TRANGE 0.000 WS × TRANGE 0.000

predictor if tabled probability < 0.5.

b

If R ) 0 and τ ) 0 for any predictor, then that predictor vanishes.

almost certainly important. Therefore, the general decline from 1983 to 1995 is real with very high probability and is not the result of random variability. 4.1. Model Diagnostics. Cross-validation results are used to show that the semiparametric estimation procedure provides reliable out-of-sample estimates of the trend and meteorological functions. To validate the model, every fourth observation in the total sample was put into a validation data set with the remaining observations used for an estimation data set. Thus, the estimation data set consists of 75% of each year’s data, and the out-of-sample validation data set consists of 25% of each year’s data. The root-meansquared error (RMSE) is used to measure the quality of the fit in the out-of-sample validation data set. It is defined as

RMSE )

x

1 [ Tvalidation sample

∑[y(t) - fˆ (x (t)) - ... - fˆ 1

1

12(x12(t))]

2

12

∑f (x (t)) + (t) j

TABLE 3. Summary of Models for Other Sites

R2

site

Aldine 0.612 Deer Park 0.597 NW Harris 0.565 County a

j

j)1

where (t) ) F(t - 1) + u(t), with u(t) independent and identically distributed N(0, σ2) random variables. Note that if F ) 0 the errors are independent. The trend and meteorological variables x1, ..., x12 from the Clinton monitoring site are used to form the design matrix. The functions f1, ..., f12 used to generate the y values are the function estimates obtained from the Clinton monitoring site when all 13 years of data (1804 observations) are used, and the (t) are assumed

probability that estimated 95% probability r ) 0 and τ ) 0 % ozone interval for % for the TREND decline ozone decline terma 18.4 25.8 13.2

(12.9, 23.9) (19.7, 31.9) (7.1, 19.3)

0.000 0.000 0.000

If R ) 0 and τ ) 0 for the TREND term, then the TREND term vanishes.

TABLE 4. Summary of Autocorrelation and Cross-Validation Results for Other Sites

site

autocorrelation coefficient

RMSE for estimation data set

RMSE for validation data set

Aldine Deer Park NW Harris County

0.32 0.38 0.33

0.311 0.343 0.309

0.323 0.367 0.324

]

where Tvalidation sample is the number of observations in the out-of-sample validation data set, fˆ 1 represents the estimate of fj using the estimation data set, and the sum is over all data points in the validation data set. A similar definition can be used to define the RMSE for the estimation data set. Using these definitions, the RMSE for the estimation and out-of-sample validation data sets are 0.355 and 0.361, respectively, so the out-of-sample RMSE is only 1.4% worse than the estimation RMSE. The mean response for the natural logarithm of ozone in the validation sample is 1.606. This implies that the estimation procedure is providing reliable function estimates. The semiparametric regression model produces measures of fit that correspond well with those of ordinary linear regression. For example, the semiparametric R 2 for the Clinton site is 0.619 and indicates that 61.9% of the variability in log (ozone) is explained by the semiparametric model. The estimate of σ is 0.36. Inspection of residual plots showed no patterns or evidence of heteroscedasticity. In addition, the Kolmogorov-Smirnoff test for normality accepts the null hypothesis that the residuals are normally distributed at the 10% significance level. There is a small but statistically significant positive autocorrelation coefficient of 0.28 at the Clinton site. However, the cross-validation results given above indicate that the practical effect of this autocorrelation on model fit is marginal. To provide further evidence of the effect of autocorrelation a simulation was done. The “true” model used to generate the data in the simulation is

y(t) )

NE/ESE 0.159 WS × ESE/SSW 0.029

to be independent. Similarly, the value of σ used to generate the data (σ ) 0.36) is the estimated value of σ obtained from the Clinton site when the (t) are assumed to be independent. We consider two values of F in the simulation: (1) F ) 0 (i.e., independent errors); and (2) F ) 0.28 (which is the estimated autocorrelation coefficient for the Clinton site). The estimation and validation samples in this simulation were constructed the same as they were in the cross-validation study discussed in the preceding two paragraphs that used the actual data. For F ) 0, the average RMSE among the 10 validation samples is 0.38 with a standard deviation of 0.01. For F ) 0.28, the average RMSE is 0.39 with a standard deviation of 0.01. Thus, the out-of-sample validation results are only about 3% worse for F ) 0.28 than for F ) 0.0. This implies that the practical effect on the function estimates of the autocorrelation is small and therefore that autocorrelation of this size does not have a substantial practical effect on the validity of the model. 4.2. Other Sites. We applied the same semiparametric regression model to log (ozone) at the three other Houston sites. Highlights of the analysis are shown in Table 3. All sites experienced considerable declines in weather-adjusted ozone concentrations. The posterior probabilities indicate that these declines are real: For each site, the probability is virtually zero that the trend is really flat. This implies with probability approaching one, the declines are not merely the result of random variability in the data. As measured by the semiparametric R 2, the predictor variables explained a similar proportion of the variability of log (ozone) as at the Clinton site, approximately 60%. The estimated autocorrelation coefficients and the crossvalidation results for each site are given in Table 4. While each of the autocorrelation coefficients is large enough to reject the assumption of independent errors, the crossvalidation results indicate the effect on the model fit is marginal. VOL. 33, NO. 21, 1999 / ENVIRONMENTAL SCIENCE & TECHNOLOGY

9

3879

In a preliminary analysis of 1983-1992 only, the estimated percentage ozone declines obtained using the semiparametric model were even greatersfrom 28% to 45% at the four sites. Curiously, the mean level of measured ozone increased in both 1994 and 1995, although our model of weather-adjusted ozone shows a decline in 1994 followed by a rise in 1995 (see Figure 1). 4.3. Supporting Evidence for a Downward Trend. The estimated declines from 1983 to 1995 are substantial, especially given the Houston area’s population and industrial growth. Therefore, it is reasonable to ask if collateral evidence supports these estimated declines. The answer is yes. First, the annual averages of the raw ozone data (unadjusted for meteorological conditions) are decreasing over the period 1983-1995. Second, there is a downward trend in the precursors to ozone and in emission inventories from 1983 to 1995. Annual averages of NOx, an ozone precursor, and estimated total emissions from transportation in Houston have declined over the 1983-1995 period. Measured industrial emissions have also declined, despite an increase in the number of industrial plants (personal communications with TNRCC staff). Unfortunately, we cannot confirm whether ambient VOCs have declined since measured ambient levels of VOCs are not generally available from the TNRCC before 1993. In general, the estimated decline is consistent with the available collateral evidence. Finally, we note that the annual averages of humidity and wind speed are flat over the period 1983-1995, while temperature range shows a general upward trend that should be associated with an upward trend in ozone, all other things being equal. We therefore estimate a decline in ozone that occurs in the face of weather more favorable to ozone formation.

Acknowledgments This research was supported by the Texas Natural Resources ConservationCommission(TNRCC)underContract4500000084 and Contract 582900760. The views expressed herein are those of the authors and do not necessarily represent the position of the TNRCC. We are grateful to the TNRCC for providing the data analyzed in this study. Both authors were also supported by Faculty Research Committee grants from the Graduate School of Business at the University of Texas at Austin.

3880

9

ENVIRONMENTAL SCIENCE & TECHNOLOGY / VOL. 33, NO. 21, 1999

Supporting Information Available An intuitive graphical explanation of the spline regression model, a brief description of the estimation and variable selection techniques, and the computational algorithm used are given (11 pages). This material is available free of charge via the Internet at http://pubs.acs.org.

Literature Cited (1) Fed. Regist. 1993, 58 (28), 8469. (2) U.S. Environmental Protection Agency, Office of Air Quality. National Air Quality and Emissions Trends Report, 1995; U.S. Government Printing Office: Washington, DC, October 1996; EPA 454/R-96005 (see Tables A-16 and A-19). (3) Sager, T. W.; Hemphill, M. W.; Gise, J. P. Proceedings of the Air and Waste Management Association; AWMA: Pittsburgh, 1991; Paper 91-66.6. (4) Bloomfield, P.; Royle, A.; Yang, Q. Technical Report 1. National Institute of Statistical Sciences: Research Triangle Park, NC, 1993. (5) Flaum, J.; Rao, S. T.; Zurbenko, I. J. Air Waste Manage. Assoc. 1996, 35. (6) Gao, F.; Sacks, J.; Welch, W. J. Technical Report 14. National Institute of Statistical Sciences: Research Triangle Park, NC, 1994. (7) Green, P. J.; Silverman, B. W. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach; Chapman and Hall: New York, 1994. (8) Hastie, T. J.; Tibshirani, R. J. Generalized Additive Models; Chapman and Hall: New York, 1990. (9) Wahba, G. Spline Models for Observational Data; SIAM: Philadelphia, 1990. (10) Wong, C. M.; Kohn; R. J. Econometrics 1996, 74, 209. (11) Shively, T. S.; Kohn, R.; Wood, S. J. Am. Stat. Assoc. In press. (12) Shively, T. S. Atmos. Environ. 1991, 25B, 387. (13) Smith, R. L.; Shively, T. S. Atmos. Environ. 1995, 29, 3489. (14) Sager, T. W.; George, E. I.; Murff, E. J. Proceedings of the Air and Waste Management Association; AWMA: Pittsburgh, 1995; Paper 95-TA43.01. (15) Ansley, C. F.; Kohn, R.; Wong, C. M. J. Stat. Comput. Simulat. 1992, 44, 1. (16) Casella, G.; Berger, R. Statistical Inference; Wadsworth: Pacific Grove, CA, 1990.

Received for review March 15, 1999. Revised manuscript received July 23, 1999. Accepted July 28, 1999. ES990286B