Subscriber access provided by CORNELL UNIVERSITY LIBRARY
Article
Estimating PM2.5 Concentrations in the Conterminous United States Using the Random Forest Approach Xuefei Hu, Jessica Hartmann Belle, Xia Meng, Avani Wildani, Lance Waller, Matthew Strickland, and Yang Liu Environ. Sci. Technol., Just Accepted Manuscript • Publication Date (Web): 23 May 2017 Downloaded from http://pubs.acs.org on May 25, 2017
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Environmental Science & Technology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 29
Environmental Science & Technology
1
Estimating PM2.5 Concentrations in the Conterminous United States Using the Random
2
Forest Approach
3
Xuefei Hu1, Jessica H. Belle1, Xia Meng1, Avani Wildani2, Lance A. Waller3, Matthew J.
4
Strickland4, Yang Liu1*
5
1
6
Atlanta, GA 30322, USA
7
2
8
USA
9
3
Department of Environmental Health, Rollins School of Public Health, Emory University,
Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322,
Department of Biostatistics & Bioinformatics, Rollins School of Public Health, Emory
10
University, Atlanta, GA 30322, USA
11
4
School of Community Health Sciences, University of Nevada Reno, Reno, NV 89557, USA
12 13
KEYWORDS: PM2.5, AOD, National, and Random Forests
14 15
* Corresponding Author:
16
Yang Liu
17
Tel: +1-404-727-2131, Fax: +1-404-727-8744
18
E-mail:
[email protected] 19 20 21
1 ACS Paragon Plus Environment
Environmental Science & Technology
Page 2 of 29
22
Abstract: To estimate PM2.5 concentrations, many parametric regression models have been
23
developed, while non-parametric machine learning algorithms are used less often and national-
24
scale models are rare. In this paper, we develop a random forest model incorporating Aerosol
25
Optical Depth (AOD) data, meteorological fields, and land use variables to estimate daily 24-
26
hour averaged ground level PM2.5 concentrations over the conterminous United States in 2011.
27
Random forests are an ensemble learning method that provides predictions with high accuracy
28
and interpretability. Our results achieve an overall cross validation (CV) R2 value of 0.80. Mean
29
prediction error (MPE) and root mean squared prediction error (RMSPE) for daily predictions
30
are 1.78 and 2.83 µg/m3, respectively, indicating a good agreement between CV predictions and
31
observations. The prediction accuracy of our model is similar to those reported in previous
32
studies using neural networks or regression models on both national and regional scales. In
33
addition, the incorporation of convolutional layers for land use terms and nearby PM2.5
34
measurements increase CV R2 by ~0.02 and~0.06, respectively, indicating their significant
35
contributions to prediction accuracy. Two different variable importance measures both indicate
36
that the convolutional layer for nearby PM2.5 measurements and AOD values are among the most
37
important predictor variables for the training process.
38
39 40
2 ACS Paragon Plus Environment
Page 3 of 29
Environmental Science & Technology
41
1. Introduction
42
PM2.5 refers to airborne particles less than 2.5 µm in the aerodynamic diameter and has been
43
linked to many adverse health outcomes including cardiovascular and respiratory morbidity and
44
mortality 1-4. Hence, obtaining accurate local PM2.5 concentrations plays a crucial role in
45
addressing many environmental public health concerns. However, measurements from central-
46
site monitors often lack adequate spatiotemporal resolution to capture variability in the study
47
population exposures, thus resulting in exposure error and biased health effect estimates, while
48
using remote sensing and air quality models can improve spatiotemporal variability of ambient
49
pollutant concentrations and related exposures5.
50
Satellite-derived aerosol optical depth (AOD) data with comprehensive spatiotemporal coverage
51
have the capability to expand PM2.5 predictions beyond those provided by ground monitoring
52
networks alone and such data have been increasingly used for PM2.5 concentration estimation in
53
areas where ground measurements are not available. AOD products used in previous studies
54
include those derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) 6-8,
55
Multiangle Imaging SpectroRadiometer (MISR) 7, 8, Geostationary Operational Environmental
56
Satellite Aerosol/Smoke Product (GASP) 9, 10, Multi-Angle Implementation of Atmospheric
57
Correction (MAIAC) 11-13, and Visible Infrared Imaging Radiometer Suite (VIIRS) 14. To
58
establish the relationship between PM2.5 and AOD, previous efforts tend to focus on the
59
development of various regression models 9, 11, 15. Those models have evolved from the use of
60
AOD as the only predictor 16-18 to the incorporation of multiple additional predictors including
61
meteorological and land use variables 6, 7, 9, 13, 19, 20 and from one-stage 6, 18, 19 to multi-stage
62
models 9, 11, 15, 20, 21.
3 ACS Paragon Plus Environment
Environmental Science & Technology
Page 4 of 29
63
Although parametric regression models can sometimes reach a high level of prediction accuracy
64
with cross-validated R2 values above 0.8 22, they become increasingly complicated and encounter
65
difficulties when datasets have a large number of predictors and many data records. Compared to
66
traditional parametric regression models, non-parametric machine learning algorithms have
67
several strengths. First, they typically involve fewer and less restrictive assumptions regarding
68
independence of observations and distributions of outcomes and these assumptions are refined as
69
new data are added. Second, modelers do not need to define as strict of a set of relationships
70
among variables before implementing non-parametric machine learning algorithms as they do for
71
traditional parametric regression approaches. Examples of the use of non-parametric machine
72
learning algorithms for PM2.5 concentration estimation include Gupta and Christopher 23, who
73
used an artificial neural network (ANN) to estimate surface level PM2.5 from MODIS AOD data
74
in the southeastern United States. Zou, et al. 24 proposed a radial basis function (RBF) neural
75
network method to estimate PM2.5 concentrations in the state of Texas, USA. Reid, et al. 25used
76
the generalized boosting model (GBM) with 29 predictor variables to estimate PM2.5
77
concentrations during the 2008 northern California wildfires. Di, et al. 26 used a backward
78
propagation neural network to calibrate GEOS-Chem simulations and predicted PM2.5
79
concentrations in 1 km x 1 km grid cells in the Northeastern United States.
80
These studies demonstrate the potential of using non-parametric machine learning algorithms for
81
PM2.5 concentration estimation on the regional scale. Nevertheless, such efforts made on a
82
national scale (e.g., over the conterminous United States) are still very limited to date. Zhan, et
83
al. 27 developed a geographically-weighted gradient boosting machine (GW-GBM) to predict
84
daily PM2.5 concentrations across China. Di, et al. 28 incorporated convolutional layers into a
85
neural network to account for spatial and temporal autocorrelation and PM2.5 concentrations were
4 ACS Paragon Plus Environment
Page 5 of 29
Environmental Science & Technology
86
estimated with high accuracy (e.g., cross-validated R2 above 0.8) over the continental United
87
States from 2000 to 2012. Although neural networks tend to produce high prediction accuracy,
88
their results are difficult to interpret. Neural networks lack variable importance measures in the
89
classification, which can help estimate the strength of relationships between PM2.5 and various
90
predictors (e.g., meteorological and land use variables). Additional families of machine learning
91
algorithms provide such variable importance measures, and when building a national model for
92
PM2.5 concentration estimation, we examined random forest algorithm in this light. Like neural
93
networks, random forests provide multi-variate, non-parametric, non-linear regression and
94
classification based on a training dataset. Because random forests are an ensemble method, the
95
training time can be reliably tuned based on desired accuracy and available computing resources.
96
More importantly, the random forest approach also provides variable importance measures to
97
measure the prediction strength of each variable 29, thereby yielding more interpretable results
98
than those typically generated from neural networks.
99
In this paper, we develop a multi-variate random forest model incorporating AOD data (e.g.,
100
satellite AOD and model simulated AOD), meteorological fields, and land use variables to
101
estimate daily 24-hour averaged ground-level PM2.5 concentrations over the conterminous United
102
States for the year 2011. Prediction accuracy is evaluated by 10-fold cross validation (CV)
103
statistics, and two variable importance measures are implemented to examine the impact of each
104
predictor on PM2.5 concentration estimation on a national scale.
105 106
2. Materials and Methods
107
We define our study area as the conterminous United States consisting of 48 adjoining U.S.
108
states and Washington D.C. (Figure 1). The entire area is divided into nine climate regions,
5 ACS Paragon Plus Environment
Environmental Science & Technology
Page 6 of 29
109
including Northwest, West North Central, Northeast, East North Central, Central, West,
110
Southwest, South, and Southeast. These climate regions were identified by National Oceanic and
111
Atmospheric Administration’s (NOAA’s) National Centers for Environmental Information
112
scientists (https://www.ncdc.noaa.gov/monitoring-references/maps/us-climate-regions.php). [Insert Figure 1 about here]
113 114
2.1 PM2.5 measurements
115
The 24-hour averaged PM2.5 concentrations for 2011 collected from 1248 U.S. Environmental
116
Protection Agency (EPA) federal reference method samplers were downloaded from the EPA’s
117
Air Quality System Technology Transfer Network (http://www.epa.gov/ttn/airs/airsaqs/).
118
2.2 AOD data
119
We use satellite-derived AOD from MODIS as our primary predictor and GEOS-Chem AOD
120
simulations when satellite AOD is missing.
121
2.2.1 MODIS AOD
122
Collection 6 level 2 Aqua MODIS retrievals at 550 nm wavelength, specifically MYD04_L2
123
files, at a nominal resolution of 10 km were regridded to the 12 x 12 km2 grid used in the
124
Community Multi-Scale Air Quality (CMAQ) modeling system 30. Regridding to a static grid is
125
necessary when modeling with Level 2 MODIS data since the pixels shift in location and size
126
with each satellite overpass. We used a regridding approach based on the full area of individual
127
MODIS pixels, using pixel boundaries reconstructed from neighboring midpoints using a
128
Voronoi tessellation algorithm31. The daily average AOD value within each CMAQ grid cell was
129
calculated by averaging any MODIS pixels that fell at least partially within that cell. AOD
130
averages were calculated using only high confidence AOD retrievals from the combined deep-
131
blue and dark-target parameter (DB-DT) introduced in Collection 6 32. Our approach combines
132
the best elements of two distinct AOD retrieval algorithms, effectively creating an AOD 6 ACS Paragon Plus Environment
Page 7 of 29
Environmental Science & Technology
133
parameter that combines the accuracy users have come to expect from the dark-target algorithm
134
with the increased coverage that the deep-blue algorithm provides33. Lee, et al. 18 used Aqua
135
AOD to predict missing Terra values to increase the spatiotemporal coverage of AOD. However,
136
predicted AOD values inevitably contain additional errors that may reduce model performance.
137
Thus, we only used Aqua AOD in this study.
138
2.2.2 GEOS-Chem AOD
139
The GEOS-Chem model is a global three-dimensional model of tropospheric chemistry driven
140
by assimilated meteorological observations from the Goddard Earth Observing System (GEOS)
141
of the NASA Global Modeling Assimilation Office (GMAO)
142
(http://acmg.seas.harvard.edu/geos/doc/man/index.html). We use GEOS-Chem v10.1 in the
143
current analysis, driven by meteorological data of GEOS-5 for year 2011. The nested grid
144
simulations in North America with 0.5° x 0.666°spatial resolution produces compositional
145
AODs, including sulfate-nitrate-ammonium AOD, black carbon AOD, organic carbon AOD,
146
accumulation mode sea salt AOD, coarse mode sea salt AOD and total dust AOD at 3-h
147
intervals. We calculate total column AOD by summing all 6 AOD parameters over 37 vertical
148
layers (up to ~20 km from the surface) 34.
149 150
2.3 Meteorological fields
151
We obtained meteorological fields from two separate datasets. The first is the North American
152
Regional Reanalysis (NARR) (http://www.emc.ncep.noaa.gov/mmb/rreanl/) with a spatial
153
resolution of ~32 km and a temporal resolution of three hours. The second is the North American
154
Land Data Assimilation System Phase 2 (NLDAS-2) (http://ldas.gsfc.nasa.gov/nldas/) with a
155
spatial resolution of ~13 km and a temporal resolution of one hour. The meteorological fields
7 ACS Paragon Plus Environment
Environmental Science & Technology
Page 8 of 29
156
used in this analysis include air temperature, dew point temperature, visibility, pressure, potential
157
evaporation, downward longwave radiation flux, downward shortwave radiation flux, relative
158
humidity, u-wind (east-west component of the wind vector) and v-wind (north-south component
159
of the wind vector). All meteorological measurements for the period from 10 am to 4 pm local
160
standard time were averaged to generate daytime meteorological fields. The averaged
161
meteorological fields represent the average weather condition at the Aqua overpass time (~1:30
162
pm local time) and reduce the adverse impact of possible extreme weather occurred at the
163
satellite overpass time on the relationship between PM2.5 and meteorological variables.
164 165
2.4 Land use variables
166
We downloaded elevation data at a spatial resolution of ~30 m from the National Elevation
167
Dataset (NED) (http://ned.usgs.gov). We extracted road network data, including limited access
168
highway, highway, and local roads, from ESRI StreetMap USA (Environmental System
169
Research Institute, Inc., Redland, CA). We obtained forest cover and impervious surface, both at
170
the spatial resolution of ~30 m, from the 2011 Landsat-derived land cover map and impervious
171
surface map downloaded from the National Land Cover Database (NLCD)
172
(http://www.mrlc.gov). We obtained primary PM2.5 and PM10 emissions data (tons/year) from the
173
2011 EPA National Emissions Inventory (NEI) facility emissions report
174
(https://www.epa.gov/air-emissions-inventories/2011-national-emissions-inventory-nei-data).
175
Finally, we obtained 2010 population density data at the census tract level from the U.S. Census
176
Bureau (https://www2.census.gov/geo/tiger/).
177 178
2.5 Regional and temporal dummy variables
8 ACS Paragon Plus Environment
Page 9 of 29
Environmental Science & Technology
179
The PM2.5-AOD relationship has been shown to exhibit some regional variation due to
180
meteorological conditions 35, and there are also daily variations in this relationship 18. Hence, we
181
include a dummy climate region variable to account for regional variations and monthly and
182
daily dummy variables to account for daily variations.
183 184
2.6 Data integration
185
All data were re-projected to the USA Contiguous Albers Equal Area Conic coordinate system.
186
For the model training dataset, we assigned meteorological fields and AOD data to each PM2.5
187
monitoring site using the nearest neighbor approach. We averaged forest cover, impervious
188
surface, and elevation values and summed road length and point emission values over a 1 km x 1
189
km square buffer centered on each PM2.5 monitoring site. We assigned each monitoring site a
190
population density value of the census tract containing its location. For prediction dataset, the
191
same procedure was applied to each MODIS grid cell.
192 193
2.7 Methods
194
2.7.1 Convolutional layer
195
To account for spatial and temporal autocorrelations , we adopt the distance-inversed weighted
196
average function proposed by Di, et al. 28 to create convolutional layers for nearby PM2.5
197
measurements and all land use terms and use these layers as ordinary input predictor variables in
198
our model. For each monitoring site and MODIS grid cell, distance-inversed weighted averages
199
of nearby PM2.5 measurements and land use terms were calculated for that site and grid cell.
200
Although it is optimal to use monitoring sites within a certain distance, due to the difficulties in
201
identifying such a distance, we used all monitoring sites in the conterminous U.S. to create the
9 ACS Paragon Plus Environment
Environmental Science & Technology
Page 10 of 29
202
convolutional layers, and this also helps to reduce possible missing values in convolutional
203
layers. The kernel function can be expressed in general terms as n
zj
204
∑ = ∑
i =1 n
wij z i
(1)
w i =1 ij
205
where zj is the value of convolutional layer at monitoring site or grid cell j, zi is the PM2.5 and
206
land use terms at monitoring site i, and wij ∝
207
monitoring site i). When creating convolutional layers for monitoring site j, the PM2.5 and land
208
use terms at monitoring site j were not included in the calculation.
209
2.7.2 Model structure and validation
210
Random forests are a set of decision trees, and each tree is constructed using the best split for
211
each node among a subset of predictors randomly chosen at that node. In the end, a simple
212
majority vote is taken for prediction. Random forests are user-friendly and have only two
213
parameters, including mtry (the number of predictors sampled for splitting at each node) and ntree
214
(the number of trees grown). The random forest algorithm first draws ntree bootstrap samples
215
from the original data, and then for each sample, grows an unpruned classification or regression
216
tree with mtry of predictors randomly sampled and the best split chosen at each node. Finally,
217
predictions are made by aggregating the predictions of ntree trees (e.g., majority vote for
218
classification and average for regression). The error rate can be calculated using predictions of
219
out-of-bag samples (the data not in the bootstrap sample) 36, 37. In this study, by comparing
220
results of different settings of mtry and ntree, we set mtry as 12 and ntree as 1000 to achieve the best
221
prediction accuracy.
1 ( d ij is the distance between grid cell j and d ij2
10 ACS Paragon Plus Environment
Page 11 of 29
Environmental Science & Technology
222
We developed a random forest model incorporating AOD data, meteorological fields, land use
223
variables, and their convolutional layers to estimate ground-level PM2.5 concentrations over the
224
conterminous United States in 2011. The predictor variables include
225
(1) daily 24-hour averaged ground level PM2.5 measurements (µg/m3) ; (2) the x-y coordinates of
226
monitoring sites or MODIS grid cell centroids in the Albers projection used for this analysis; (3)
227
dummy variables: month, day, and climate regions; (4) meteorological fields: NARR variables
228
(dew point temperature (K), visibility (m), 2-m pressure (pa), 10-m pressure (pa), 30-m pressure
229
(pa), 30-m temperature (K), 180-150mb temperature (K)), NLDAS variables (potential
230
evaporation (kg/m2), downward longwave radiation flux (W/m2), downward shortwave radiation
231
flux (W/m2), connective available potential energy (J/kg), pressure (pa), 2-m temperature (K), 2-
232
m relative humidity (%),east-west component of the wind vector (m/sec) at 10 m, north-south
233
component of the wind vector (m/sec) at 10 m); (5) land use terms: elevation (m), local road (m),
234
forest cover (unitless), primary PM2.5 point emissions (tons/year), impervious surface (%), and
235
population density (persons/km2); (6) convolutional layers for elevation (m), local road (m),
236
forest cover (unitless), primary PM2.5 point emissions (tons/year), impervious surface (%),
237
population density (population/km2), highway (m), limited access highway (m), primary PM10
238
point emissions (tons/year), nearby PM2.5 measurements (µg/m3); (7) MODIS AOD (unitless)
239
and GEOS-Chem simulated AOD (unitless) used when MODIS AOD was missing.
240
We applied a 10-fold cross validation (CV) technique to establish and validate our prediction
241
results. The entire training dataset was randomly split into 10 subsets with each subset containing
242
approximately 10% of the training data. In each round of cross validation, we use nine subsets
243
for model training and make predictions for the held-out test subset. The process was repeated 10
244
times until every subset was tested. In addition, we also conducted a spatial CV and a temporal
11 ACS Paragon Plus Environment
Environmental Science & Technology
Page 12 of 29
245
CV, based on partitioning the training dataset by monitoring site and day of year, respectively.
246
We calculated statistical indicators such as R2, Mean Prediction Error (MPE), and Root Mean
247
Squared Prediction Error (RMSPE) between CV predictions and observations to assess the
248
prediction accuracy of the proposed model for the entire study area and study period. We also
249
ran the model and conducted CV for each season and each climate region separately. We
250
calculate two importance values for each predictor variable to examine what variables provide
251
the largest impact on our predictions. The first measure is the increase of mean square errors
252
(MSE) of predictions which are estimated using out-of-bag samples as a result of each predictor
253
variable being permuted. The higher the number, the more important the predictor variable. The
254
second one is the increase in node purities from splitting on the predictor variable. Node purity
255
measures the similarity of responses at a node across regression trees in the runs of the random
256
forest model. More useful variables achieve higher increases in node purities 37. All modeling
257
was done using R statistical software version 3.3.2 38.
258
3. Results
259
3.1 Descriptive statistics
260
The mean, standard deviation, maximum, and minimum for all variables over 2011 are presented
261
(Table S1). The annual mean PM2.5 concentrations is 9.69 µg/m3 for the continental U.S. in
262
2011, and the overall mean of MODIS AOD is 0.14. The percentage of missing MODIS AOD is
263
~68% in the conterminous U.S. during 2011. A Pearson’s correlation was conducted for all
264
variables (Table S2) to avoid potential multicollinearity problems. The results show that the
265
correlation coefficients between most of the variables are relatively low. It should be noted that
266
random forests are applicable even with highly correlated variables 39.
267
3.2 Results of model validation
12 ACS Paragon Plus Environment
Page 13 of 29
Environmental Science & Technology
268
Table 1 and Figure 2 presents the overall, spatial, and temporal CV validation results for the
269
entire study area and study period, including the values of R2, MPE, and RMSPE between CV
270
predicted values and observations. The results show that the overall R2 reached a relatively high
271
value of 0.80. MPE and RMSPE for daily predictions are 1.78 and 2.83 µg/m3, respectively,
272
indicating a good agreement between CV estimates and observations over the conterminous
273
United States in 2011. In addition, using GEOS-Chem simulated AOD to replace MODIS AOD
274
reduced CV R2 by ~0.01 to 0.79, indicating the feasibility of using simulated AOD when
275
MODIS AOD is not available. Nevertheless, mirroring previous studies, we find prediction
276
accuracy varies by season and climate region (Tables 2 and 3). For instance, model performance
277
is the best in summer, followed by winter and fall, and the worst in spring in terms of CV R2.
278
For different climate regions, prediction accuracy is relatively high in west, northeast, central,
279
and east north central regions, and low in northwest and west north central regions. The model
280
has intermediate performance in southeast, south, and southwest regions.
281
[Insert Tables 1-3 about here]
282
[Insert Figure 2 about here]
283 284
3.3 Variable importance assessment
285
Figure 3 illustrates the results of variable importance assessment for predictor variables in the
286
final model. The results show that the convolutional layer for nearby PM2.5 measurements and
287
AOD are among the top five most important predictor variables for both importance measures.
288
[Insert Figure 3 about here]
289
3.4 Contribution of convolutional layers
290
We compared the CV R2 values between models with and without convolutional layers (Table
291
S3). The results show that incorporation of convolutional layers for land use terms could increase 13 ACS Paragon Plus Environment
Environmental Science & Technology
Page 14 of 29
292
CV R2 by ~0.02, and inclusion of the convolution layer for nearby PM2.5 measurements increases
293
CV R2 by ~0.06, indicating a strong contribution of these convolutional layers to prediction
294
accuracy. We also compared CV R2 between models using our convolutional layers and surfaces
295
yielded using universal kriging (Table S4). The results show that CV R2 of the model using
296
surfaces generated using universal kriging is ~0.04 lower than that of our model, indicating that
297
geostatistical interpolation methods such as universal kriging can also be used to generate
298
convolutional layers.
299
3.5 Estimation of PM2.5 concentrations
300
Due to the use of GEOS-Chem simulated AOD when MODIS AOD is missing, the coverage of
301
our predictions extends from ~32% to full coverage. Our annual mean PM2.5 surface on the
302
CMAQ grid (12 x 12 km2) over the continental U.S. in 2011 appears in Figure 4. The results
303
show that PM2.5 predictions (Figure 4a) and observations (Figure 4b) share a similar spatial
304
pattern. PM2.5 concentrations are generally higher in the eastern U.S. (e.g., the southeast and
305
central regions) than in the western part. There are also areas with high predicted and observed
306
PM2.5 levels in the western U.S. (e.g., the San Joaquin Valley in California). High concentrations
307
also occur in large urban centers such as Indianapolis, IN, Houston, TX, Columbus, OH,
308
Chicago, IL, and Los Angeles, CA. Figure 4c illustrates the difference between annual mean
309
PM2.5 predictions and ground measurements at each ground monitor and difference
310
interpolations over the continental U.S. generated using the inverse distance weighted (IDW)
311
interpolation. Figure 4c shows that underestimation mostly occurs in areas with high
312
concentrations which are mainly in the eastern U.S., while overestimation appears primarily in
313
the western U.S. where PM2.5 concentrations are generally low. The observed differences at
14 ACS Paragon Plus Environment
Page 15 of 29
Environmental Science & Technology
314
~95% of ground monitors are within ±4 µg/m3. Seasonal PM2.5 maps (Figure S1) show that air
315
pollution is more severe in summer, particularly in the eastern U.S. [Insert Figures 4 about here]
316 317
4. Discussion
318
Our approach has several strengths. First, we used random forests to build a PM2.5 predictive
319
model on a national scale and demonstrated that such an approach can achieve high prediction
320
accuracy, even across so broad a geographic area. For instance, on the regional scale, our model
321
yields a similar CV R2 to previous studies conducted in the northeastern U.S. 22 and the southeast
322
40
323
region. On the national scale, the prediction accuracy of our model is comparable to that of a
324
neural network approach proposed by Di, et al. 28. Both their model and ours used many similar
325
predictor variables such as AOD, land use terms, and meteorological data. However, our final
326
model did not include those variables with low importance such as additional GEOS-Chem
327
outputs, Normalized Difference Vegetation Index (NDVI), absorbing aerosol index (AAI),
328
boundary layer height, and albedo. In fact, including them reduced prediction accuracy. We also
329
tested interactions between climate regions and meteorological variables, between day and
330
meteorological variables, between climate regions and AOD, and between day and AOD. These
331
interaction terms did not help improve prediction accuracy and thus were not included in the
332
final model. In addition, we included some variables not included in Di, et al. 28 such as forest
333
cover and percent impervious surface, both of which show high importance in our model.
334
Compared to their neural network approach based on more than 50 predictors, our random forest
335
approach has fewer predictors (~40). In addition, the neural network approach involves a more
336
complex, two-stage structure, while our random forest approach is based on a simpler, one-stage
337
structure.
in 2011, even though we optimize our model for the continental U.S. instead of within each
15 ACS Paragon Plus Environment
Environmental Science & Technology
Page 16 of 29
338
Another strength of our random forest approach is that it provides an importance estimate for
339
each predictor variable. Variable importance measures can help pinpoint what variables are most
340
important in the reduction of prediction errors and thus make variable selection more efficient
341
than neural networks. Our random forest approach provided two types of variable importance
342
measures during the training process, including the increase in MSE and the increase in node
343
purities, both of which provided additional insight into the set of predictor variables yielding the
344
greatest improvements in prediction accuracy. In our implementation, when we chose variables,
345
we tried to avoid highly correlated variables that may severely affect original importance
346
measures 39. To make importance values directly comparable, we first added all potential
347
predictor variables into the random forest model and calculated importance values (e.g., the
348
increase in MSE) for all variables, and then compared the prediction accuracy of models
349
incorporating different combinations of variables to identify the best set of predictors among
350
those under consideration. Díaz-Uriarte and Alvarez de Andrés 41 suggested a backward
351
elimination strategy and iteratively fitted random forests, at each iteration building a new forest
352
after discarding variables with the smallest variable importance values. We adopted a similar
353
strategy and iteratively fitted random forests with each time discarding the variable with the
354
smallest importance value. Finally, we identified a threshold where the incorporation of variables
355
with importance values above the threshold generally increased prediction accuracy, while the
356
inclusion of those with importance values below the threshold tended to decrease prediction
357
accuracy. The variables with importance values below the threshold were discarded, and those
358
with importance values above the threshold were further considered. This variable selection step
359
dramatically simplified the process of identifying useful predictor variables. For example, we
360
tested NDVI as a predictor in our model. We found that the importance value of NDVI fell
16 ACS Paragon Plus Environment
Page 17 of 29
Environmental Science & Technology
361
below our threshold, and the inclusion of NDVI in the model slightly reduced the prediction
362
accuracy, probably due to the incorporation of other predictors with stronger individual
363
associations with the outcome (e.g., forest cover and impervious surface). In contrast, for neural
364
networks, variables need to be toggled on and off individually to compare model performance in
365
order to determine whether the variable is useful or not. This can be time-consuming, especially
366
when large training and testing datasets are involved.
367
One potential limitation of this study is our use of MODIS AOD data with 10 km spatial
368
resolution, instead of a new AOD product with 1 km spatial resolution derived using the Multi-
369
Angle Implementation of Atmospheric Correction (MAIAC) algorithm. Lee, et al. 40
370
demonstrated that the 1-km MAIAC AOD can outperform the existing 10-km data in predicting
371
PM2.5. An advantage of 10-km data over 1-km data is considerable reduction in computing time.
372
However, incorporating 1-km MAIAC AOD data into our model not only can predict PM2.5
373
concentrations at a higher spatial resolution, but may further improve the prediction accuracy,
374
and we plan to extend our approach to this setting. Another limitation is regarding the temporal
375
mismatch among PM2.5 measurements, MODIS AOD, and meteorological fields. In this study,
376
PM2.5 measurements are 24-hour averaged, while MODIS AOD is a single value obtained at the
377
Aqua overpass time and meteorological variables are averaged from 10 am to 4 pm. It is optimal
378
to use both 24-hour averaged AOD and meteorological variables in the model. For
379
meteorological fields, we tested the 24-hour averaged meteorological values in the model and did
380
not find a substantial difference in prediction accuracy between models using meteorological
381
fields averaged within two different time periods. For AOD, other AOD products should be
382
considered. GASP AOD retrievals are attempted every half hour during daylight and thus can
383
help with the temporal mismatch between PM2.5 and AOD. Although Green, et al. 42 found that
17 ACS Paragon Plus Environment
Environmental Science & Technology
Page 18 of 29
384
PM2.5 predicted from MODIS AOD is more accurate than those generated from GASP AOD at
385
Bondville, Illinois, It is still worthwhile to examine its performance on a national scale, we will
386
address this issue in future research. Furthermore, the results of spatial and temporal CV indicate
387
that the model is more capable of explaining temporal variation in the relationship between PM2.5
388
and predictor variables than accounting for spatial variation. Thus, more work needs to be done
389
in the future to address this issue. Finally, although our model reached a relatively high
390
prediction accuracy, the results depict overestimation and underestimation of PM2.5
391
concentrations across the continental U.S. If these outputs were used in an epidemiologic study,
392
then these regional differences in accuracy could produce (or obfuscate) regional heterogeneity
393
in health associations.
394 395 396 397 398 399 400 401 402 403 404 18 ACS Paragon Plus Environment
Page 19 of 29
405
Environmental Science & Technology
Table 1. Cross validation results for the entire study area and study period. Overall Spatial Temporal
R2 0.80 0.70 0.79
RMSPE (µg/m3) 2.83 3.43 2.88
MPE (µg/m3) 1.78 2.24 1.81
Slope 1.00 1.00 1.01
MPE (µg/m3) 1.62 1.71 1.75 2.20
Slope 1.00 1.00 1.01 1.01
MPE (µg/m3) 1.19 1.72 1.86 1.77 1.98 2.14 1.66 1.73 1.68
Slope 1.01 1.00 1.00 0.98 1.00 1.01 1.01 1.01 1.00
406 407
Table 2. Cross validation results for each season. Spring Summer Fall Winter
R2 0.78 0.81 0.79 0.80
RMSPE (µg/m3) 2.54 2.78 2.67 3.55
408 409
Table 3. Cross validation results for each climate region. Northwest West North Central Northeast East North Central Central West Southwest South Southeast
R2 0.68 0.64 0.79 0.82 0.80 0.83 0.74 0.74 0.74
RMSPE (µg/m3) 2.37 2.61 2.84 2.50 2.86 3.32 2.85 2.66 2.71
410 411 412 413 414 415 416 417 418 419 420 421 19 ACS Paragon Plus Environment
Environmental Science & Technology
Page 20 of 29
422 423
Figure 1. Study area.
424 425 426 427 428 429 430 431 432 433 434 435
20 ACS Paragon Plus Environment
Page 21 of 29
Environmental Science & Technology
436 437
Figure 2. 10-fold cross validation. (a) Overall; (b) Spatial; (c) Temporal.
438 439
21 ACS Paragon Plus Environment
Environmental Science & Technology
Page 22 of 29
440 441
Figure 3. Variable importance. (a) Increase in mean square errors (%IncMSE); (b) Increase in
442
node purities (IncNodePurity). 22 ACS Paragon Plus Environment
Page 23 of 29
Environmental Science & Technology
443 444
Figure 4. Annual mean predictions. (a) Annual mean PM2.5 predictions over the continental U.S.
445
for 2011; (b) Annual mean PM2.5 measurements at ground monitors; (c) Difference between
446
annual mean predictions and observations at ground monitors and difference interpolations over
447
the continental U.S.
448 449 450 451 452 453 454 455 456 457 23 ACS Paragon Plus Environment
Environmental Science & Technology
458
AUTHOR INFORMATION
459
Corresponding Author
460
* (Y.L.) Phone: +1-404-727-2131; fax: +1-404-727-8744; e-mail:
[email protected].
461
Present Addresses
462
1
463
Atlanta, GA 30322, USA
464
Author Contributions
465
The manuscript was written by Xuefei Hu and edited by all authors. All authors have given
466
approval to the final version of the manuscript.
467
Notes
468
The authors declare no competing financial interest.
469
ACKNOWLEDGMENT
470
This work was partially supported by the NASA Applied Sciences Program (grant No.
471
NNX16AQ28G, PI: Liu).
472
SUPPORTING INFORMATION AVAILABLE
473
Tables S1-S4 and Figure S1. This information is available free of charge via the Internet at
474
http://pubs.acs.org.
Page 24 of 29
Department of Environmental Health, Rollins School of Public Health, Emory University,
475 476 477 478 24 ACS Paragon Plus Environment
Page 25 of 29
Environmental Science & Technology
479
REFERENCES
480 481 482
(1). Madrigano, J.; Kloog, I.; Goldberg, R.; Coull, B. A.; Mittleman, M. A.; Schwartz, J., Long-term Exposure to PM2.5 and Incidence of Acute Myocardial Infarction. Environ. Health Perspect. 2013, 121, (2), 192-196.
483 484 485 486
(2). Neophytou, A. M.; Costello, S.; Brown, D. M.; Picciotto, S.; Noth, E. M.; Hammond, S. K.; Cullen, M. R.; Eisen, E. A., Marginal Structural Models in Occupational Epidemiology: Application in a Study of Ischemic Heart Disease Incidence and PM2.5 in the US Aluminum Industry. Am. J. Epidemiol. 2014, 180, (6), 608-615.
487 488 489
(3). Bose, S.; Hansel, N.; Tonorezos, E.; Williams, D.; Bilderback, A.; Breysse, P.; Diette, G.; McCormack, M. C., Indoor particulate matter associated with systemic inflammation in COPD. J. Environ. Prot. 2015, 6, (5), 566-572.
490 491 492 493
(4). Habre, R.; Moshier, E.; Castro, W.; Nath, A.; Grunin, A.; Rohr, A.; Godbold, J.; Schachter, N.; Kattan, M.; Coull, B., The effects of PM2. 5 and its components from indoor and outdoor sources on cough and wheeze symptoms in asthmatic children. J. Expo. Sci. Environ. Epidemiol. 2014, 24, (4), 380-387.
494 495 496 497
(5). Baxter, L. K.; Dionisio, K. L.; Burke, J.; Ebelt Sarnat, S.; Sarnat, J. A.; Hodas, N.; Rich, D. Q.; Turpin, B. J.; Jones, R. R.; Mannshardt, E.; Kumar, N.; Beevers, S. D.; Ozkaynak, H., Exposure prediction approaches used in air pollution epidemiology studies: Key findings and future recommendations. J. Expo. Sci. Environ. Epidemiol. 2013, 23, (6), 654-659.
498 499 500
(6). Hu, X.; Waller, L. A.; Al-Hamdan, M. Z.; Crosson, W. L.; Estes Jr, M. G.; Estes, S. M.; Quattrochi, D. A.; Sarnat, J. A.; Liu, Y., Estimating ground-level PM2.5 concentrations in the southeastern U.S. using geographically weighted regression. Environ. Res. 2013, 121, 1-10.
501 502 503
(7). Liu, Y.; Franklin, M.; Kahn, R.; Koutrakis, P., Using aerosol optical thickness to predict ground-level PM2.5 concentrations in the St. Louis area: A comparison between MISR and MODIS. Remote Sens. Environ. 2007, 107, (1-2), 33-44.
504 505
(8). Ma, Z.; Hu, X.; Huang, L.; Bi, J.; Liu, Y., Estimating ground-level PM2. 5 in China using satellite remote sensing. Environ. Sci. Technol. 2014, 48, (13), 7436-7444.
506 507 508
(9). Liu, Y.; Paciorek, C. J.; Koutrakis, P., Estimating Regional Spatial and Temporal Variability of PM2.5 Concentrations Using Satellite Data, Meteorology, and Land Use Information. Environ. Health Perspect. 2009, 117, (6), 886-892.
509 510 511
(10). Paciorek, C. J.; Liu, Y.; Moreno-Macias, H.; Kondragunta, S., Spatiotemporal associations between GOES aerosol optical depth retrievals and ground-level PM2.5. Environ. Sci. Technol. 2008, 42, (15), 5800-5806.
512 513 514 515
(11). Hu, X.; Waller, L. A.; Lyapustin, A.; Wang, Y.; Al-Hamdan, M. Z.; Crosson, W. L.; Estes Jr, M. G.; Estes, S. M.; Quattrochi, D. A.; Puttaswamy, S. J.; Liu, Y., Estimating groundlevel PM2.5 concentrations in the Southeastern United States using MAIAC AOD retrievals and a two-stage model. Remote Sens. Environ. 2014, 140, 220-232. 25 ACS Paragon Plus Environment
Environmental Science & Technology
Page 26 of 29
516 517 518
(12). Hu, X.; Waller, L. A.; Lyapustin, A.; Wang, Y.; Liu, Y., 10-year spatial and temporal trends of PM2.5 concentrations in the southeastern US estimated using high-resolution satellite data. Atmos. Chem. Phys. 2014, 14, (12), 6301-6314.
519 520 521
(13). Hu, X.; Waller, L. A.; Lyapustin, A.; Wang, Y.; Liu, Y., Improving satellite-driven PM2.5 models with Moderate Resolution Imaging Spectroradiometer fire counts in the southeastern U.S. J. Geophys. Res., Atmos. 2014, 119, (19), 11375-11386.
522 523 524
(14). Wu, J.; Yao, F.; Li, W.; Si, M., VIIRS-based remote sensing estimation of ground-level PM2.5 concentrations in Beijing–Tianjin–Hebei: A spatiotemporal statistical model. Remote Sens. Environ. 2016, 184, 316-328.
525 526 527
(15). Kloog, I.; Nordio, F.; Coull, B. A.; Schwartz, J., Incorporating Local Land Use Regression And Satellite Aerosol Optical Depth In A Hybrid Model Of Spatiotemporal PM2.5 Exposures In The Mid-Atlantic States. Environ. Sci. Technol. 2012, 46, (21), 11913-11921.
528 529 530
(16). Wang, J.; Christopher, S. A., Intercomparison between satellite-derived aerosol optical thickness and PM2.5 mass: Implications for air quality studies. Geophys. Res. Lett. 2003, 30, (21), 2095.
531 532
(17). Hu, Z., Spatial analysis of MODIS aerosol optical depth, PM2.5, and chronic coronary heart disease. Int. J. Health Geogr. 2009, 8, (1), 27.
533 534 535
(18). Lee, H. J.; Liu, Y.; Coull, B. A.; Schwartz, J.; Koutrakis, P., A novel calibration approach of MODIS AOD data to predict PM2.5 concentrations. Atmos. Chem. Phys. 2011, 11, (15), 79918002.
536 537 538
(19). Liu, Y.; Sarnat, J. A.; Kilaru, A.; Jacob, D. J.; Koutrakis, P., Estimating ground-level PM2.5 in the eastern united states using satellite remote sensing. Environ. Sci. Technol. 2005, 39, (9), 3269-3278.
539 540 541
(20). Kloog, I.; Koutrakis, P.; Coull, B. A.; Lee, H. J.; Schwartz, J., Assessing temporally and spatially resolved PM2.5 exposures for epidemiological studies using satellite aerosol optical depth measurements. Atmos. Environ. 2011, 45, (35), 6267-6275.
542 543 544
(21). Ma, Z.; Hu, X.; Sayer, A. M.; Levy, R.; Zhang, Q.; Xue, Y.; Tong, S.; Bi, J.; Huang, L.; Liu, Y., Satellite-based spatiotemporal trends in PM2. 5 concentrations: China, 2004–2013. Environ. Health Perspect. 2016, 124, (2), 184-192.
545 546 547 548
(22). Kloog, I.; Chudnovsky, A. A.; Just, A. C.; Nordio, F.; Koutrakis, P.; Coull, B. A.; Lyapustin, A.; Wang, Y.; Schwartz, J., A new hybrid spatio-temporal model for estimating daily multi-year PM2.5 concentrations across northeastern USA using high resolution aerosol optical depth data. Atmos. Environ. 2014, 95, 581-590.
549 550 551
(23). Gupta, P.; Christopher, S. A., Particulate matter air quality assessment using integrated surface, satellite, and meteorological products: 2. A neural network approach. J. Geophys. Res., Atmos. 2009, 114, D20205.
26 ACS Paragon Plus Environment
Page 27 of 29
Environmental Science & Technology
552 553 554
(24). Zou, B.; Wang, M.; Wan, N.; Wilson, J. G.; Fang, X.; Tang, Y., Spatial modeling of PM2.5 concentrations with a multifactoral radial basis function neural network. Environ. Sci. Pollut. Res. 2015, 22, (14), 10395-10404.
555 556 557 558
(25). Reid, C. E.; Jerrett, M.; Petersen, M. L.; Pfister, G. G.; Morefield, P. E.; Tager, I. B.; Raffuse, S. M.; Balmes, J. R., Spatiotemporal Prediction of Fine Particulate Matter During the 2008 Northern California Wildfires Using Machine Learning. Environ. Sci. Technol. 2015, 49, (6), 3887-3896.
559 560 561
(26). Di, Q.; Koutrakis, P.; Schwartz, J., A hybrid prediction model for PM2.5 mass and components using a chemical transport model and land use regression. Atmos. Environ. 2016, 131, 390-399.
562 563 564
(27). Zhan, Y.; Luo, Y.; Deng, X.; Chen, H.; Grieneisen, M. L.; Shen, X.; Zhu, L.; Zhang, M., Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmos. Environ. 2017, 155, 129-139.
565 566 567
(28). Di, Q.; Kloog, I.; Koutrakis, P.; Lyapustin, A.; Wang, Y.; Schwartz, J., Assessing PM2.5 Exposures with High Spatiotemporal Resolution across the Continental United States. Environ. Sci. Technol. 2016, 50, (9), 4712-4721.
568 569
(29). Friedman, J.; Hastie, T.; Tibshirani, R., The elements of statistical learning. Springer series in statistics Springer, Berlin: 2001; Vol. 1.
570 571
(30). Levy, R., Collection 006 (C6) Modis atmosphere l2 aerosol product. In 6 ed.; LAADS Web: https://ladsweb.nascom.nasa.gov/data/search.html, 2014.
572 573
(31). Turner, R., deldir: Delaunay triangulation and Dirichlet (Voronoi) tessellation. R package version 0.0-8 2009, URL http://cran.r-project.org/web/packages/deldir.
574 575 576
(32). Levy, R. C.; Mattoo, S.; Munchak, L. A.; Remer, L. A.; Sayer, A. M.; Patadia, F.; Hsu, N. C., The Collection 6 MODIS aerosol products over land and ocean. Atmos. Meas. Tech. 2013, 6, (11), 2989-3034.
577 578
(33). Belle, J.; Liu, Y., Evaluation of Aqua MODIS Collection 6 AOD Parameters for Air Quality Research over the Continental United States. Remote Sens. 2016, 8, (10), 815.
579 580 581
(34). Li, S. S.; Garay, M. J.; Chen, L. F.; Rees, E.; Liu, Y., Comparison of GEOS-Chem aerosol optical depth with AERONET and MISR data over the contiguous United States. J. Geophys. Res., Atmos. 2013, 118, (19), 11228-11241.
582 583 584
(35). Schaap, M.; Apituley, A.; Timmermans, R. M. A.; Koelemeijer, R. B. A.; de Leeuw, G., Exploring the relation between aerosol optical depth and PM2.5 at Cabauw, the Netherlands. Atmos. Chem. Phys. 2009, 9, (3), 909-925.
585
(36).
Breiman, L., Random Forests. Mach. Learn. 2001, 45, (1), 5-32.
27 ACS Paragon Plus Environment
Environmental Science & Technology
Page 28 of 29
586 587
(37). Liaw, A.; Wiener, M., Classification and Regression by randomForest. R News 2002, 2, (3), 18-22.
588 589
(38). R Core Team, A language and environment for statistical computing. In R Foundation for Statistical Computing: Vienna, Austria, 2016.
590 591
(39). Strobl, C.; Boulesteix, A.-L.; Kneib, T.; Augustin, T.; Zeileis, A., Conditional variable importance for random forests. BMC Bioinform. 2008, 9, (1), 307.
592 593 594 595
(40). Lee, M.; Kloog, I.; Chudnovsky, A.; Lyapustin, A.; Wang, Y.; Melly, S.; Coull, B.; Koutrakis, P.; Schwartz, J., Spatiotemporal prediction of fine particulate matter using highresolution satellite images in the Southeastern US 2003-2011. J. Expo. Sci. Environ. Epidemiol. 2016, 26, (4), 377-384.
596 597
(41). Díaz-Uriarte, R.; Alvarez de Andrés, S., Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006, 7, (1), 3.
598 599 600
(42). Green, M.; Kondragunta, S.; Ciren, P.; Xu, C., Comparison of GOES and MODIS Aerosol Optical Depth (AOD) to Aerosol Robotic Network (AERONET) AOD and IMPROVE PM2.5 Mass at Bondville, Illinois. J. Air Waste Manage. Assoc. 2009, 59, (9), 1082-1091.
601 602 603 604
28 ACS Paragon Plus Environment
Page 29 of 29
Environmental Science & Technology
TOC Art 84x47mm (300 x 300 DPI)
ACS Paragon Plus Environment