Subscriber access provided by PEPPERDINE UNIV
Article
Blending multiple nitrogen dioxide data sources for neighborhood estimates of long-term exposure for health research Ivan Charles Hanigan, Grant J Williamson, Luke David Knibbs, Joshua Horsley, Margaret Rolfe, Martin Cope, Adrian Barnett, Christine Cowie, Jane S. Heyworth, Marc L Serre, Bin Jalaludin, and Geoffrey G Morgan Environ. Sci. Technol., Just Accepted Manuscript • DOI: 10.1021/acs.est.7b03035 • Publication Date (Web): 26 Sep 2017 Downloaded from http://pubs.acs.org on September 26, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Environmental Science & Technology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 30
Environmental Science & Technology
1
Blending multiple nitrogen dioxide data sources for
2
neighborhood estimates of long-term exposure for
3
health research
4
Ivan C. Hanigan*, Centre for Air Quality and Health Research and Evaluation, Woolcock
5
Institute of Medical Research, University of Sydney, Sydney, Australia & University of Canberra,
6
Canberra, Australia
7
Grant J. Williamson, Centre for Air Quality and Health Research and Evaluation, Woolcock
8
Institute of Medical Research Sydney, Australia & School of Biological Sciences, University of
9
Tasmania, Hobart, Australia
10
Luke D. Knibbs, Centre for Air Quality and Health Research and Evaluation, Woolcock Institute
11
of Medical Research Sydney, Australia & School of Public Health, The University of
12
Queensland, Herston, Australia
13
Joshua Horsley, School of Public Health, University of Sydney, Sydney, Australia
14
Margaret I. Rolfe, School of Public Health, University of Sydney, Sydney, Australia
15
Martin Cope, Centre for Air Quality and Health Research and Evaluation, Woolcock Institute of
16
Medical Research Sydney, Australia & CSIRO, Melbourne, Australia
ACS Paragon Plus Environment
1
Environmental Science & Technology
Page 2 of 30
17
Adrian G. Barnett, Institute of Health and Biomedical Innovation & School of Public Health and
18
Social Work, Queensland University of Technology, Brisbane, Australia
19
Christine T. Cowie, Centre for Air Quality and Health Research and Evaluation, Woolcock
20
Institute of Medical Research, University of Sydney; South West Sydney Clinical School,
21
University of NSW & Ingham Institute for Applied Medical Research, Sydney, Australia
22
Jane S. Heyworth, Centre for Air Quality and Health Research and Evaluation, NESP Clean Air
23
and Urban Landscapes, School of Population and Global Health, The University of Western
24
Australia, Perth, Australia
25
Marc L. Serre, University of North Carolina, Chapel Hill, USA
26
Bin Jalaludin, Centre for Air Quality and Health Research and Evaluation, Woolcock Institute of
27
Medical Research, University of Sydney; South West Sydney Clinical School, University of NSW
28
& Ingham Institute for Applied Medical Research, Sydney, Australia
29
Geoffrey G. Morgan, Centre for Air Quality and Health Research and Evaluation, Woolcock
30
Institute of Medical Research & University Centre for Rural Health, North Coast, School of
31
Public Health, University of Sydney, Sydney, Australia
32
Corresponding Author
33
*Corresponding Author. Correspondence should be addressed to: Post: Building 22, University
34
of Canberra, Bruce, ACT, 2601. Email:
[email protected]; Phone: +61 2 6201
35
5298; Fax: +61 2 6201 5999.
36
ACS Paragon Plus Environment
2
Page 3 of 30
Environmental Science & Technology
37
TOC ART
38
Abstract
39
Exposure to traffic related nitrogen dioxide (NO2) air pollution is associated with adverse health
40
outcomes. Average pollutant concentrations for fixed monitoring sites are often used to estimate
41
exposures for health studies, however these can be imprecise due to difficulty and cost of spatial
42
modeling at the resolution of neighborhoods (e.g. a scale of tens of meters) rather than at a
43
coarse scale (around several kilometers). The objective of this study was to derive improved
44
estimates of neighborhood NO2 concentrations by blending measurements with modeled
45
predictions in Sydney, Australia (a low pollution environment). We implemented the Bayesian
46
Maximum Entropy (BME) approach to blend data with uncertainty defined using informative
47
priors. We compiled NO2 data from fixed-site monitors, chemical transport models, and satellite-
48
based land use regression models to estimate neighborhood annual average NO2. The spatial
ACS Paragon Plus Environment
3
Environmental Science & Technology
Page 4 of 30
49
model produced a posterior probability density function of estimated annual average
50
concentrations that spanned an order of magnitude from 3 to 35 ppb. Validation using
51
independent data showed improvement, with Root Mean Squared Error (RMSE) improvement of
52
6% compared with the land use regression model and 16% over the chemical transport model.
53
These estimates will be used in studies of health effects and should minimize misclassification
54
bias.
55
Introduction
56
There is evidence that short-term acute exposures to ambient air pollution cause adverse health
57
effects such as hospitalizations or deaths from cardiovascular disease and respiratory disease1.
58
Several studies have shown an association between long-term exposure to air pollution and
59
increased risks of various health outcomes2. However, evidence is lacking at the lower end of the
60
exposure-response function, because most existing studies have been conducted in cities where
61
pollutant levels are relatively high3. At lower concentrations it is important that errors associated
62
with exposure are minimized. Sydney, Australia’s most populous city (~5 million), and its
63
surrounding regions have relatively low concentrations of air pollution compared with similar
64
economically developed cities. For example, the annual average nitrogen dioxide (NO2)
65
concentration was 7.6 parts per billion (ppb) in 2011 (the focal year of our study) (source: New
66
South Wales Government air pollution data). This is lower than concentrations found in many
67
health studies in other cities around the world. In a review of 15 cohort studies, Hoek et al.4
68
found much higher long-term annual average NO2 concentrations for their cohorts with a mean
69
of 18.7 ppb (8.9 to 35.0 ppb). Rome, for example, is a similar population size to Sydney but
ACS Paragon Plus Environment
4
Page 5 of 30
Environmental Science & Technology
70
long-term average NO2 in that cohort was 23.0 ppb, three times higher than the average in
71
Sydney.
72
Considerable research effort has been spent evaluating statistical methods which are sensitive
73
to subtle variation in exposures5. However, there is no consensus concerning the most
74
appropriate methods to estimate long-term, spatially-varying exposures, especially when the risk
75
of exposure misclassification bias is high. Valid estimates of the spatial distribution of air
76
pollution are required so that exposure-response associations can be investigated in
77
epidemiological studies. Additional research is necessary that can provide more accurate and
78
precise estimates of exposure, and to ascertain the key sources of the uncertainties that remain.
79
NO2 is a marker of traffic-related air pollution and more spatially heterogeneous compared
80
with particulate matter in urban areas where traffic is the main source6, hence our interest in
81
better exposure estimation for this pollutant. We aimed to provide improved concentration
82
estimates by improving the representativeness of the modeled spatial patterns in order to better
83
estimate exposure at the residential address. This improved characterization of the variation in
84
exposures should produce more accurate estimates of risk-response functions.
85
Progress in the development of spatially resolved air pollution exposure assessment models
86
has become increasingly sophisticated, both nationally and internationally, using new data
87
sources such as satellite data and new statistical techniques. For instance Akita et al.7, Xu et al.8
88
and Buteau et al.9 have used Bayesian blending methods to improve air pollution models in
89
North America. Meanwhile other approaches have developed such as that used by Knibbs et al.10
90
who produced improved NO2 air pollution maps for Australia using a Land Use Regression
91
(LUR) model that combined satellite images with ground based predictor variables. A different
ACS Paragon Plus Environment
5
Environmental Science & Technology
Page 6 of 30
92
approach used by Cope et al.11 in Australia applies physics and chemistry principles to model the
93
dispersion of emissions to predict pollution. Relatedly, Shaddick et al.12 recently developed a
94
global Bayesian-based PM2.5 model.
95
While several novel air pollution exposure assessment methods are therefore now available,
96
few studies have sought to combine methods to leverage the best attributes of each, particularly
97
in low exposure settings where uncertainty of the measurements will be a key challenging point
98
(i.e. close to detection limits). Therefore, the purpose of this study was to blend together
99
information from multiple data sources to estimate annual average NO2 concentrations at the
100
level of Sydney neighborhoods (small areas incorporating housing within hundreds of meters of
101
roads). We implemented the Bayesian Maximum Entropy (BME) approach13 to integrate
102
modeled predictions with measurements of NO2, to provide a new spatial model. Our goal was to
103
blend data from multiple sources based on BME (a method which has been recently used
104
internationally7), but our aim was not to compare BME against any other blending methods. To
105
date no Bayesian blending methods have previously been implemented in the Sydney study area
106
specifically, and only occasionally in air pollution modelling more generally, therefore this work
107
is filling a key knowledge gap.
108
Methods and materials
109
Our approach involved blending fixed-site monitor data with a satellite-based Land Use
110
Regression model (SatLUR) and Chemical Transport Model (CTM). The resulting estimates
111
were validated against passive samplers because they were independent of all other data inputs.
112
Our input data are shown in panels A, B and C of Figure 1. Panels B and C show that NO2
113
varies substantially across Sydney, and SatLUR estimates display greater variation than CTM.
ACS Paragon Plus Environment
6
Page 7 of 30
Environmental Science & Technology
114
Panel D shows the derived trend surface of all inputs, used to adjust for regional spatial
115
autocorrelation prior to our spatial modeling.
116 117
Figure 1. Maps of A) NO2 monitoring sites with roads, B) Satellite based Land Use Regression
118
model (SatLUR), C) Chemical Transport Model (CTM) and D) the inverse distance weighted
119
smoothed surface used as an offset for our Bayesian Maximum Entropy (BME) modeling.
120
Study area and period
121
Our study area was the Sydney greater metropolitan region of New South Wales (NSW),
122
Australia. This is the most populous region in Australia. The study area covered over 17,000
123
square kilometers, and had an estimated population in 2015 of 5.8 million (Australian Bureau of
124
Statistics, LGA population estimates 2015). Our study period comprised the years 2011 (for the
ACS Paragon Plus Environment
7
Environmental Science & Technology
Page 8 of 30
125
predictive modeling) and 2013-2014 (for the validation dataset) which we selected based on the
126
availability of pollution data and because it coincides with the five-year follow up of the
127
longitudinal “45andUp” cohort study with a sample size of 99,317 persons in the greater Sydney
128
region which adds to the applicability of the study to support health research (see
129
www.saxinstitute.org.au for more information on the cohort). The comparability of the two
130
periods was supported by our exploratory analyses that showed that the spatial pattern of annual
131
averages between these periods was stable (see the Supporting Information document).
132
The Sydney region has relatively low concentrations of air pollution when compared with
133
similar economically developed cities around the world; the mean of annual average daily NO2
134
concentrations at 14 fixed-site monitors in 2011 was 7.6 ppb, ranging from 1.8 to 13.0 ppb at
135
individual monitors (NSW Office of Environment and Heritage monitor data,
136
http://www.environment.nsw.gov.au/AQMS/search.htm).
137
Fixed-site monitoring NO2 measurements
138
During the study period The NSW Office of Environment and Heritage (OEH) monitored the
139
concentrations of NO2 at 14 sites across the Sydney Metropolitan area for regulatory compliance.
140
We obtained daily average ground-level NO2 measurements for January to December 2011 from
141
the OEH data portal (http://www.environment.nsw.gov.au/AQMS/search.htm). NO2
142
concentrations were measured using the standard chemiluminescence methods. The
143
measurements had undergone quality assurance procedures via internal processing prior to public
144
release of data online (see http://www.environment.nsw.gov.au/resources/air/nepm-air-
145
monitoring-plan.pdf for details).
ACS Paragon Plus Environment
8
Page 9 of 30
146
Environmental Science & Technology
We computed annual averages for each of the 14 monitors using data for all days where
147
concentrations had been recorded between January and December 2011. All sites had more than
148
90% of days where concentrations had been recorded. The annual average was calculated using
149
the mean of 24 hour daily averages as originally provided.
150
Satellite-based Land Use Regression Model
151
We used a national satellite-based NO2 Land Use Regression model (SatLUR) developed by
152
Knibbs et al.10. The model development and validation are described extensively elsewhere
153
(Knibbs et al.10; Knibbs et al.14). The SatLUR model used data from fixed-site government-run
154
monitors from networks around Australia as the outcome variable, and the model incorporated
155
high-resolution land-use predictors across Sydney (including roads, impervious surfaces,
156
industrial point sources, and industrial land use) and satellite observations of NO2 from the ozone
157
monitoring instrument aboard the Aura satellite. Two estimates of SatLUR derived NO2
158
concentrations were available: one based on surface data (estimated using surface-to-column
159
ratios from the WRF-Chem model), and the second based on the estimates of total tropospheric
160
NO2 column density. The two LUR models developed using the separate satellite estimates
161
produced similar results but the column-density model estimates were used in this study due to
162
ease of implementation compared with the surface model10. The column model captured up to
163
81% of spatial variability in annual average NO2 concentrations when cross-validated, and up to
164
66% when validated against an independent set of NO2 measurements14. Predicted values of NO2
165
were estimated for Mesh Block centroids from the 2011 Census. Mesh Blocks were the smallest
166
geographical statistical unit used by the Australian Bureau of Statistics in the 2011 census. In
167
NSW in 2011 Mesh Blocks had a median population of 83 (range 3 to 1,932 persons) and mean
168
area of 8.8 square kilometers (range 0.0005 to 12,541 km2) (Australian Bureau of Statistics data).
ACS Paragon Plus Environment
9
Environmental Science & Technology
Page 10 of 30
169
Chemical Transport Model
170
Chemical Transport Models (CTMs) were used to estimate concentrations of NO2 as one of
171
our data inputs. The CTMs used emissions data and physical dispersion modeling to predict NO2
172
air pollution concentrations. We used NO2 predictions from the CTMs developed by Cope et al.11
173
for July 2010 through June 2011. The model comprised a prognostic meteorological model, the
174
2008 NSW OEH air emissions inventory (developed to describe the emissions from the shipping
175
industry as well as traffic and industry emissions), and a chemical transport and particle
176
dynamics model11. The model used a 1 × 1 kilometer grid cell for the inner region around the
177
central urban area of Sydney and a 3 × 3 kilometer square grid cell for the larger area (Greater
178
Sydney Metropolitan Region). Daily predicted NO2 average concentrations from the CTM were
179
used to calculate monthly concentrations, which were then combined to calculate annual average
180
NO2 concentrations.
181
Passive samplers
182
The resulting estimates from the BME model were validated against passive samplers data
183
because they were independent of all other data inputs. Forty-seven Ogawa passive samplers
184
were deployed at sites located within the metropolitan area of the city that were selected to
185
capture within-city and near-road variability in NO2. Samples were taken for two week periods
186
during July-August (winter) 2013, November-December (summer) 2013 and March-April
187
(autumn) 2014. We used data for this period because no other data were collected during 2011.
188
The fixed-site monitors showed minimal change in concentration between 2011 and 2014 (see
189
Supporting Information document). Passive sampling locations were chosen to capture the full
190
range of expected NO2 concentrations across the city, based on knowledge of land use, traffic
ACS Paragon Plus Environment
10
Page 11 of 30
Environmental Science & Technology
191
conditions and industrial sources. Passive sampler siting followed the method as outlined in the
192
European Study of Cohorts for Air Pollution Effects (ESCAPE) protocol15. For more details see
193
the Supporting Information document.
194
Conversion to annual mean NO2 was conducted using standard protocols14. Passive sampler
195
measurements were adjusted to an estimated annual mean using the ratio of mean NO2 measured
196
by fixed-site monitors during each measurement period, compared with its annual mean. The
197
ratio was calculated based on three separate fixed-site monitors in the study area. The selection
198
criteria for the fixed-site monitors and the adjustment process are described in the original
199
paper14.
200
Bayesian Maximum Entropy spatial model
201
We implemented the BME geostatistical approach13. BME is a statistical framework for
202
spatial estimation that can process a wide range of general and site specific knowledge based on
203
entropy maximization and epistemic Bayesian processing rules. We used the BMElib13 software
204
which is a numerical implementation of the BME framework in the case where the general
205
knowledge includes mean and covariance, and the site specific knowledge consist of hard and
206
soft data. Where we talk specifically about the implementation of BME, we refer to the
207
numerical package in the BMEprobaMoments function found in the BMElib13 software. We
208
created an exposure surface by blending the measured concentrations of NO2 at the fixed-site
209
monitors with the two models (SatLUR and CTM). These data all included various sources of
210
uncertainty such as measurement error, emission inventory assumptions, and spatial uncertainty
211
about the geographical predictors (such as industrial land use classes). Each data point therefore
ACS Paragon Plus Environment
11
Environmental Science & Technology
212
contained information about the uncertainty that the BMElib model used when estimating the
213
concentration.
214
The model approach is similar to spatial interpolation, which is a method used to generate
215
estimates across an entire region (e.g. the city of Sydney) from a scattered set of data points in
216
that region (e.g. monitors and model estimation points). BMElib is similar to the kriging
217
algorithm common in GIS software which estimates the interpolated values using a statistical
218
model governed by the covariance relationships between data points across space.
219
Page 12 of 30
The BME framework allows prior information to be weighted, based on a Probability Density
220
Function (PDF). A PDF represents the known or assumed likelihood of all the possible values.
221
The PDF is used to specify the probability of the value falling within a particular range of values.
222
PDFs can be combined to give estimates that are based on all the available knowledge. To do
223
this, the BME framework uses maximum entropy, which is a concept used in information theory.
224
Maximum entropy involves a modest claim being made regarding certainty about the
225
probabilities of expected values, and PDFs can be combined to achieve a maximally informative
226
model, given the uncertain general knowledge base. Thus, the ‘informativeness’ requirement is
227
mathematically expressed in terms of a maximum entropy condition of the type based on the
228
Shannon information concept13. The estimated PDF is the one that best represents the current
229
state of knowledge asserted by the prior PDFs. This is referred to as the posterior PDF.
230
When the simplest constraints are used in BME, no PDF is assigned as all the data points are
231
assumed to be the ‘true’ value at that spatial location. The BMEprobaMoments algorithm then
232
reduces to the kriging algorithm. However, if additional information is included to apply
233
probabilistic constraints then each datum is assigned a PDF. The PDF explicitly models the
ACS Paragon Plus Environment
12
Page 13 of 30
Environmental Science & Technology
234
measurement error in each datum using a distribution to reflect uncertainty that is centered on the
235
observed datum. Incorporating these distributions is straightforward in a Bayesian paradigm and
236
including measurement error in the model more closely captures reality. This should give a better
237
estimate of the mean pollutant concentration as well as be better at capturing uncertainty than a
238
standard geostatistical model16.
239
In this study, we applied probabilistic constraints on our prior expectations of the uncertainty
240
around that central estimate (i.e. the range and shape of the PDF). Then using the BMElib
241
program13, these prior PDFs are blended together and a posterior PDF produced. The mean of the
242
BME posterior PDF minimizes the mean square estimation error and was used as our estimate of
243
NO2 in the output result.
244
We categorized our data into two groups: (i) hard data: the fixed-site monitor measurements
245
with high precision or minimal uncertainty; and (ii) soft data: the SatLUR and CTM model
246
estimates with an uncertainty characterized by a PDF.
247
Parametrization for prior probability density functions of soft data
248
We created the PDFs with a triangular distribution where the mean was set to the concentration
249
estimated by either of our two models (the SatLUR or CTM model), and the corresponding range
250
(the minimum and maximum for the triangular distribution) was estimated from the uncertainties
251
associated with the expected values at each location, based on how well they reproduce the co-
252
located observed values from the NO2 passive samplers within each location. This meant that
253
areas with less observed measurement error had a narrower PDF. We used the Root Mean
254
Squared Error (RMSE) of these paired predictions and observations to determine the range of the
255
prior PDFs. This was 2.7 ppb for the SatLUR and 3.1 ppb for the CTM.
ACS Paragon Plus Environment
13
Environmental Science & Technology
Page 14 of 30
256
The SatLUR estimates for Mesh Block centroids were used as one type of soft data input. We
257
extended the spatial coverage of these by also incorporating a grid of points equally spaced at 2.5
258
kilometers across the region (because all points within a Mesh Block give the same value), so
259
that each estimation node in the modeled output would have guaranteed data inputs within the
260
search radius used by the BMEprobaMoments algorithm.
261
De-trending data inputs using a global offset
262
Similar to kriging, BMElib relies on the assumption that the random errors have zero mean and
263
constant variance, and so trend removal can help satisfy assumptions of normality and
264
stationarity. The first stage in our analysis was to develop a smoothed trend surface. We used an
265
inverse distance weighted average smoother with 300 × 300 meter grid cell resolution as an
266
offset to de-trend the input datasets. We chose this resolution so that the smooth trend should
267
produce low variance in the residuals while maintaining adequate residual autocorrelation. As
268
Xu et al.8 point out this is because when a trend surface is smoother and less variable across
269
space, the transformed residuals retain more variability. Conversely as the surface becomes less
270
smooth the residuals are less autocorrelated. Residuals for each data input and this offset surface
271
were calculated, along with the upper and lower limits of the PDFs.
272
Determination of the spatial covariance model
273
The experimental covariance model was estimated based on the residuals of all hard and soft
274
data within a 10 × 10 kilometer square that included high expected concentrations of NO2. We
275
used the ArcMap 10.2.1 kriging cross-validation tool to optimize the parameters for an
276
exponential spatial covariance model. The covariance sill reported by ArcGIS was 1.04. The
ACS Paragon Plus Environment
14
Page 15 of 30
Environmental Science & Technology
277
covariance range was 1238.5 meters. Please see the Supporting Information document for further
278
details on the spatial covariance model.
279
Estimation of the BME posterior PDFs
280
We computed the BME posterior PDFs at the nodes of a two dimensional network of
281
estimation points on the Australian Map Grid (GDA94, zone 56), with regular spacing every 100
282
m within 10 × 10 kilometer squares. We chose this resolution as a trade-off between
283
computational time and our aim to model NO2 at the level of street addresses. We did this within
284
these subsets to improve the efficiency of computational work and allow more rapid testing and
285
revision of the parametrizations. We then combined all resultant subset squares into a mosaic for
286
the full study region. We used the mean estimate of NO2 from the posterior PDFs as our
287
predicted concentration. During computation we allowed the inclusion of data points external to
288
the squares out to a buffer distance of 1200 meters to minimize any discontinuity at the
289
boundaries when the squares were combined. This buffer distance means that all estimation
290
nodes within each subset square has a complete set of input data points within the search radius,
291
even when the point is on the boundary of the subset square. We set the maximum search radius
292
for each estimation node at 1,240 meters so that the experimental covariance range was met, and
293
to ensure that each node would include some data points (e.g. soft data for SatLUR were at the
294
equally spaced 2,500 meter grid in addition to each of the Mesh Block centroids). We assumed
295
that the mean was constant over the estimation neighborhood, after assessing the frequency
296
distribution of our de-trended residuals from the global offset. The BMElib model returned
297
estimates that represent the mean value which minimizes the estimation error variance of the
298
posterior PDF. This value is on the scale of the residuals and so when added to the global offset
299
the result provides the estimated long-term NO2 concentrations.
ACS Paragon Plus Environment
15
Environmental Science & Technology
Page 16 of 30
300
Validation methods
301
We used an external validation method to assess our model. Our validation dataset comprised
302
NO2 passive sampler observations (adjusted for seasonality using the method described in
303
Knibbs et al.14). In this approach we assessed the agreement between the modeled estimates and
304
a set of unrelated, independent measurements. We calculated the RMSE for our BMElib model
305
as well as the two models used as data inputs (SatLUR and CTM).
306 307
We used Matlab R2016a, BMElib2.0b, ArcGIS 10.2.1 and R version 3.2.5 for data preparation and analyses.
308
Results and discussion
309
The BMElib model estimated annual average NO2 concentrations are shown in Figure 2.
310 311 312 313
Figure 2. Maps of NO2 annual average (ppb) estimated by Bayesian Maximum Entropy blending of various data inputs for A) Sydney and B) Liverpool (a built-up south-western suburb of Sydney)
ACS Paragon Plus Environment
16
Page 17 of 30
Environmental Science & Technology
314
Validation statistics
315
We compared the RMSE for the CTM, SatLUR and BMElib output (Figure 3D). In all models
316
there was an indication of under-prediction at levels above 10 ppb (Figure 3 A-C).
317 318
Figure 3. Scatter plots of measured versus estimated NO2 annual averages (ppb) estimated by
319
A) Chemical Transport Model (CTM), B) Satellite-based Land Use Regression (SatLUR) and
320
C) Bayesian Maximum Entropy (BME) integration of fixed-site monitors, CTM and SatLUR
321
data. Panel D shows the Root Mean Squared Error (RMSE) in ppb for each modeling approach.
322
ACS Paragon Plus Environment
17
Environmental Science & Technology
323
Page 18 of 30
Table 1. Root Mean Squared Error (RMSE) for each model against the validation dataset. Modeling method
Root Mean Squared Error (ppb)
Chemical Transport Model (CTM)
3.1
Satellite-based Land Use Regression (SatLUR)
2.8
Bayesian Maximum Entropy (BME)
2.6
324 325
Key results and comparison with other studies
326
We found that the BMElib resulted in a reduced RMSE against the validation dataset when
327
compared to the RMSE for the CTM and SatLUR. The magnitude of the difference between the
328
three RMSE was 0.5 ppb (BMElib compared to CTM) and 0.2 (BMElib compared to SatLUR).
329
As a percentage difference this was a 16.2% improvement over the CTM and 5.7% improvement
330
over the SatLUR.
331
A similar study by Akita et al.7 in Spain found larger improvements in predictive accuracy of
332
NO2 using BME compared with their CTM and LUR (not Satellite-based). Akita et al. found that
333
their BME model reduced the RMSE for validation data by approximately 40% from both their
334
CTM and LUR models, and over 60% from an ordinary kriging model of sensor data. A
335
possible reason for this is that NO2 concentrations in Sydney were lower than concentrations
336
measured in Spain. Our maximum predicted concentration was 35 ppb in Sydney, whereas
337
Spain had a maximum concentration of 58 ppb (converted from the original predicted NO2 value
338
reported in Akita et al. using the conversion factor of 1.9125).
339
Alternatively, the difference may reflect higher underlying precision of our input model data
340
(our CTM and SatLUR). The CTM used by Akita et al. was the CALIOPE Air Quality Forecast
ACS Paragon Plus Environment
18
Page 19 of 30
Environmental Science & Technology
341
Modeling (AQFM) system and was reported to have a RMSE against the fixed-site monitors of
342
6.9 ppb. In contrast, the CTM from Cope et al.11 which we used has a smaller RMSE of 3.1 ppb
343
against our passive sampler validation dataset. Therefore the BME model RMSE of 4.0 ppb (7.6
344
µg/m3) reported by Akita et al. represented a substantial improvement over their CTM. An
345
additional likely explanation is that there were only 14 observations used as hard data here, while
346
there were a lot more observations (N=80) in the Akita et al. model, therefore we would expect a
347
much greater drop in RMSE in that study than here. While our relative improvements were more
348
modest, they are still important for epidemiological studies because the small relative risks
349
arising from air pollution exposures (e.g., of the order of ~1.04), are assigned to large population
350
numbers, which results in a large population attributable risk from air pollution.
351
Strengths and weaknesses
352
Our project demonstrates an alternative to similar applications of the BME in two important
353
ways. First, we de-trended the input datasets using an inverse distance weighted smoother to
354
obtain our offset, which is a robust alternative to other options and requires fewer assumptions
355
regarding the spatial autocorrelation structure. To the best of our knowledge this is a new
356
contribution to the approach. Second, we defined the range of the prior probability intervals
357
using a triangular density function based on the agreement of the input data with the external
358
validation dataset. This is an important parameter and using our approach we estimate the
359
uncertainty against an independent set of external observed data, rather than from internal model
360
cross-validation.
361 362
In general, the strengths of the BME approach include the ability to combine all available data inputs despite their different temporal and spatial attributes into one estimate, making use of all
ACS Paragon Plus Environment
19
Environmental Science & Technology
363
available prior information regarding uncertainties. Weaknesses of our study include that the
364
workload for preparing the data and conducting the analysis was very labor intensive. In
365
addition, the actual calculations themselves were computationally intensive and required many
366
hours of computer time to complete.
Page 20 of 30
367
A limitation of this study was the incomplete seasonal coverage of the passive sampler data
368
used as the external validation. The seasonal adjustments made by Knibbs et al.14 of the three ×
369
fortnightly samples is a widely used method for passive samplers which is likely to make them
370
representative of annual (long-term) averages. Even so, the potential remains that the sampled
371
seasonal coverage of the measurements might not fully capture the annual average conditions.
372
One major conclusion of our study is that the blending done in the BMElib model showed an
373
improvement in predictive performance in comparison to either the CTM or SatLUR alone.
374
However the three modelled outputs are not exactly comparable because the SatLUR was
375
developed with a training dataset that spanned multiple cities across the Australian continent. It
376
is possible that a SatLUR model built exclusively from the Sydney data might perform better
377
than this national model17,18,19, however no city-specific SatLUR currently exists that could be
378
used for our study, so we have made use of all information at the spatial extent and resolution
379
that was available. However, this remains a limitation and it is not possible to conclude that the
380
BMElib is the optimal method based on our study. Rather, we can only conclude that BMElib
381
with a SatLUR model at a continental extent had the greatest performance to-date as measured
382
by our validation set for Sydney.
383
Future directions
ACS Paragon Plus Environment
20
Page 21 of 30
384
Environmental Science & Technology
Future research should attempt to ameliorate a key weakness of any blending model, which is
385
that they cannot overcome limitations that are common across all the data inputs (i.e.
386
observations, SatLUR and CTM). For example, it is likely that NO2 in reality may be much
387
more spatially variable than what is displayed in any of the input datasets20 and so the spatial
388
variability of the original data inputs places limits on the extent that Bayesian blending can
389
produce spatial variability in the predictions. However, the aim of such blending is to get an
390
estimate that is at least as good as the best of the available inputs, or even better. BMElib can
391
theoretically achieve this outcome by incorporating all available information and accounting for
392
uncertainty to combine all the available data into one single exposure metric, without losing
393
information from each individual data set.
394
There are a number of parameters in the model specification that might be optimized in future
395
research. One of the most important assumptions relates to the range of the interval used to
396
construct the PDF for the soft data. Also, the shape of the PDF used may be important. We used
397
a triangular PDF because of the ease of specification (only the mean, minimum and maximum
398
are needed) however other functional forms such as a normal distribution may result in further
399
improvements if they capture the prior information better, but this was beyond the scope of this
400
study. In future research we aim to experiment with ways that the prior PDF parameters can be
401
used to optimize the predictions based on validation (i.e. change the parameters and compare the
402
impact these changes have on the validation statistics).
403
Implications
ACS Paragon Plus Environment
21
Environmental Science & Technology
404
This paper presents a combination of novel approaches to the operationalization of the BME
405
spatial modeling framework for air pollution estimation and shows the validation statistics for
406
this approach provided better results than some more classical alternatives.
Page 22 of 30
ACS Paragon Plus Environment
22
Page 23 of 30
Environmental Science & Technology
407
Acknowledgements
408
The authors gratefully acknowledge contributions from individuals within Government
409
departments and private institutes involved in the measurement and modeling of air pollutant
410
concentrations.
411
Supporting Information Available
412
BME_NO2_Sydney_SI.docx describes the full details of our data preparation, parameter
413
selection procedure and some technical detail about our datasets. This information is available
414
free of charge via the Internet at http://pubs.acs.org.
415
Author Contributions
416
The manuscript was written through contributions of all authors. All authors have given approval
417
to the final version of the manuscript.
418
Funding Sources
419
Funding for this work was provided by the Centre for Air quality and health Research and
420
evaluation (CAR), an Australian National Health and Medical Research Council Centre for
421
Research Excellence, administered by the Woolcock Institute of Medical Research, University of
422
Sydney.
423
References
424
(1)
425 426 427
Pope, C. A.; Dockery, D. W. Health effects of fine particulate air pollution: lines that connect. J. Air Waste Manage. Assoc. 2006, 56 (6), 709–742.
(2)
Brunekreef, B. Health effects of air pollution observed in cohort studies in Europe. J. Expo. Sci. Environ. Epidemiol. 2007, 17, S61–S65.
ACS Paragon Plus Environment
23
Environmental Science & Technology
428
(3)
Page 24 of 30
Beelen, R.; Raaschou-Nielsen, O.; Stafoggia, M.; Andersen, Z. J.; Weinmayr, G.;
429
Hoffmann, B.; Wolf, K.; Samoli, E.; Fischer, P.; Nieuwenhuijsen, M.; et al. Effects of
430
long-term exposure to air pollution on natural-cause mortality: An analysis of 22 European
431
cohorts within the multicentre ESCAPE project. Lancet 2014, 383 (9919), 785–795.
432
(4)
Hoek, G.; Krishnan, R. M.; Beelen, R.; Peters, A.; Ostro, B.; Brunekreef, B.; Kaufman, J.
433
D. Long-term air pollution exposure and cardio- respiratory mortality: a review. Environ.
434
Health 2013, 12 (1), 43.
435
(5)
Peng, R. D.; Dominici, F. Statistical methods for environmental epidemiology with R. A
436
case study in air pollution and health; Springer Science & Business Media: New York,
437
USA, 2008.
438
(6)
Health Effects Institute Panel on the Health Effects of Traffic-Related Air Pollution.
439
Traffic-related air pollution: a critical review of the literature on emissions, exposure, and
440
health effects. Health Effects Institute Special Report 17; Boston, MA, 2010.
441
(7)
Akita, Y.; Baldasano, J. M.; Beelen, R.; Cirach, M.; de Hoogh, K.; Hoek, G.;
442
Nieuwenhuijsen, M.; Serre, M. L.; de Nazelle, A. Large Scale Air Pollution Estimation
443
Method Combining Land Use Regression and Chemical Transport Modeling in a
444
Geostatistical Framework. Environ. Sci. Technol. 2014, 48 (8), 4452–4459.(8)
445
(8) Xu, Y.; Serre, M. L.; Reyes, J.; Vizuete, W. Bayesian Maximum Entropy integration of
446
ozone observations and model predictions: a national application. Environ. Sci. Technol.
447
2016, 50 (8), 4393–4400.
ACS Paragon Plus Environment
24
Page 25 of 30
448
Environmental Science & Technology
(9)
Buteau, S.; Hatzopoulou, M.; Crouse, D. L.; Smargiassi, A.; Burnett, T.; Logan, T.;
449
Deville, L.; Goldberg, M. S. Comparison of spatiotemporal prediction models of daily
450
exposure of individuals to ambient nitrogen dioxide and ozone in Montreal, Canada.
451
Environ. Res. 2017, 156 (2), 201–230.
452
(10) Knibbs, L. D.; Hewson, M. G.; Bechle, M. J.; Marshall, J. D.; Barnett, A. G. A national
453
satellite-based land-use regression model for air pollution exposure assessment in
454
Australia. Environ. Res. 2014, 135, 204–211.
455
(11) Cope, M.; Keywood, M.; Emmerson, K.; Galbally, I.; Boast, K.; Chambers, S.; Cheng, M.;
456
Crumeyrolle, S.; Dunne, E.; Fedele, R.; et al. The Centre for Australian Weather and
457
Climate Research Sydney Particle Study-Stage-II; CSIRO and Office of Environment and
458
Heritage, Sydney, Australia, 2014; ISBN: 978-1-4863-0359-5.
459
(12) Shaddick, G.; Thomas, M. L.; Jobling, A.; Brauer, M.; van Donkelaar, A.; Burnett, R.:
460
Chang, H. H.; Cohen, A.; Van Dingenen, R.; Dora, C.; et al. Data Integration Model for
461
Air Quality: A Hierarchical Approach to the Global Estimation of Exposures to Ambient
462
Air Pollution, J. R. Stat. Soc. C. 2017. doi:10.1111/rssc.12227.
463 464
(13) Christakos, G.; Bogaert, P.; Serre, M. Temporal GIS: Advanced functions for field-based applications; Springer-Verlag: New York, 2002.
465
(14) Knibbs, L. D.; Coorey, C. P.; Bechle, M. J.; Cowie, C. T.; Dirgawati, M.; Heyworth, J. S.;
466
Marks, G. B.; Marshall, J. D.; Morawska, L.; Pereira, G.; et al. Independent Validation of
467
National Satellite-Based Land-Use Regression Models for Nitrogen Dioxide Using Passive
468
Samplers. Environ. Sci. Technol. 2016, 50 (22), 12331–12338.
ACS Paragon Plus Environment
25
Environmental Science & Technology
Page 26 of 30
469
(15) Beelen, R.; Hoek, G.; Vienneau, D.; Eeftens, M.; Dimakopoulou, K.; Pedeli, X.; Tsai, M.-
470
Y.; Künzli, N.; Schikowski, T.; Marcon, A.; et al. Development of NO2 and NOx land use
471
regression models for estimating air pollution exposure in 36 study areas in Europe – The
472
ESCAPE project. Atmos. Environ. 2013, 72 (2), 10–23.
473 474 475
(16) Diggle, P. J.; Tawn, J. A.; Moyeed, R. A. Model-Based Geostatistics. J. R. Stat. Soc. C. 1998, 47 (3): 299–350. (17) Larkin, A. J.; Geddes, A.; Martin, R. V.; Xiao, Q.; Liu, Y.; Marshall, J. D.; Brauer, M.;
476
Hystad, P. A Global Land Use Regression Model for Nitrogen Dioxide Air Pollution.
477
Environ. Sci. Technol. 2017, 51 (12), 6957–6964.
478
(18) Hoek, G.; Eeftens, M.; Beelen, R.; Fischer, P.; Brunekreef, B.; Boersma, K. F.; Veefkind,
479
P., Satellite NO2 data improve national land use regression models for ambient NO2 in a
480
small densely populated country. Atmos. Enviro. 2015, 105, 173-180.
481
(19) Bechle, M. J.; Millet, D. B.; Marshall, J. D., National Spatiotemporal Exposure Surface for
482
NO2: Monthly Scaling of a Satellite-Derived Land-Use Regression, 2000-2010. Environ
483
Sc.i Technol. 2015, 49, (20), 12297-305.
484
(20) Apte, J. S.; Messier, K P.; Gani, S.; Brauer, M.; Kirchstetter, T. W.; Lunden, M. M.;
485
Marshall, J. D.; Portier, C. J.; Vermeulen, R.C.H.; Hamburg, S. P. High-Resolution Air
486
Pollution Mapping with Google Street View Cars: Exploiting Big Data. Environ. Sci.
487
Technol. 2017, 51, (12), 6999–7008.
ACS Paragon Plus Environment
26
Page 27 of 30
Environmental Science & Technology
Figure 1. Maps of A) NO2 monitoring sites with roads, B) Satellite based Land Use Regression model (SatLUR), C) Chemical Transport Model (CTM) and D) the inverse distance weighted smoothed surface used as an offset for our Bayesian Maximum Entropy (BME) modeling. 210x148mm (300 x 300 DPI)
ACS Paragon Plus Environment
Environmental Science & Technology
Figure 2. Maps of NO2 annual average (ppb) estimated by Bayesian Maximum Entropy blending of various data inputs for A) Sydney and B) Liverpool (a built-up south-western suburb of Sydney) 99x34mm (300 x 300 DPI)
ACS Paragon Plus Environment
Page 28 of 30
Page 29 of 30
Environmental Science & Technology
Figure 3. Scatter plots of measured versus estimated NO2 annual averages (ppb) estimated by A) Chemical Transport Model (CTM), B) Satellite-based Land Use Regression (SatLUR) and C) Bayesian Maximum Entropy (BME) integration of fixed-site monitors, CTM and SatLUR data. Panel D shows the Root Mean Squared Error (RMSE) in ppb for each modeling approach. 237x127mm (300 x 300 DPI)
ACS Paragon Plus Environment
Environmental Science & Technology
TOC Art 237x127mm (300 x 300 DPI)
ACS Paragon Plus Environment
Page 30 of 30