Estimating PM2.5 Concentrations in the Conterminous United States

*Phone: +1-404-727-2131; fax: +1-404-727-8744; e-mail: ... In addition, the incorporation of convolutional layers for land use terms and nearby PM2.5 ...
1 downloads 0 Views 2MB Size
Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Article

Estimating PM2.5 Concentrations in the Conterminous United States Using the Random Forest Approach Xuefei Hu, Jessica Hartmann Belle, Xia Meng, Avani Wildani, Lance Waller, Matthew Strickland, and Yang Liu Environ. Sci. Technol., Just Accepted Manuscript • Publication Date (Web): 23 May 2017 Downloaded from http://pubs.acs.org on May 25, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Environmental Science & Technology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29

Environmental Science & Technology

1

Estimating PM2.5 Concentrations in the Conterminous United States Using the Random

2

Forest Approach

3

Xuefei Hu1, Jessica H. Belle1, Xia Meng1, Avani Wildani2, Lance A. Waller3, Matthew J.

4

Strickland4, Yang Liu1*

5

1

6

Atlanta, GA 30322, USA

7

2

8

USA

9

3

Department of Environmental Health, Rollins School of Public Health, Emory University,

Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322,

Department of Biostatistics & Bioinformatics, Rollins School of Public Health, Emory

10

University, Atlanta, GA 30322, USA

11

4

School of Community Health Sciences, University of Nevada Reno, Reno, NV 89557, USA

12 13

KEYWORDS: PM2.5, AOD, National, and Random Forests

14 15

* Corresponding Author:

16

Yang Liu

17

Tel: +1-404-727-2131, Fax: +1-404-727-8744

18

E-mail: [email protected]

19 20 21

1 ACS Paragon Plus Environment

Environmental Science & Technology

Page 2 of 29

22

Abstract: To estimate PM2.5 concentrations, many parametric regression models have been

23

developed, while non-parametric machine learning algorithms are used less often and national-

24

scale models are rare. In this paper, we develop a random forest model incorporating Aerosol

25

Optical Depth (AOD) data, meteorological fields, and land use variables to estimate daily 24-

26

hour averaged ground level PM2.5 concentrations over the conterminous United States in 2011.

27

Random forests are an ensemble learning method that provides predictions with high accuracy

28

and interpretability. Our results achieve an overall cross validation (CV) R2 value of 0.80. Mean

29

prediction error (MPE) and root mean squared prediction error (RMSPE) for daily predictions

30

are 1.78 and 2.83 µg/m3, respectively, indicating a good agreement between CV predictions and

31

observations. The prediction accuracy of our model is similar to those reported in previous

32

studies using neural networks or regression models on both national and regional scales. In

33

addition, the incorporation of convolutional layers for land use terms and nearby PM2.5

34

measurements increase CV R2 by ~0.02 and~0.06, respectively, indicating their significant

35

contributions to prediction accuracy. Two different variable importance measures both indicate

36

that the convolutional layer for nearby PM2.5 measurements and AOD values are among the most

37

important predictor variables for the training process.

38

39 40

2 ACS Paragon Plus Environment

Page 3 of 29

Environmental Science & Technology

41

1. Introduction

42

PM2.5 refers to airborne particles less than 2.5 µm in the aerodynamic diameter and has been

43

linked to many adverse health outcomes including cardiovascular and respiratory morbidity and

44

mortality 1-4. Hence, obtaining accurate local PM2.5 concentrations plays a crucial role in

45

addressing many environmental public health concerns. However, measurements from central-

46

site monitors often lack adequate spatiotemporal resolution to capture variability in the study

47

population exposures, thus resulting in exposure error and biased health effect estimates, while

48

using remote sensing and air quality models can improve spatiotemporal variability of ambient

49

pollutant concentrations and related exposures5.

50

Satellite-derived aerosol optical depth (AOD) data with comprehensive spatiotemporal coverage

51

have the capability to expand PM2.5 predictions beyond those provided by ground monitoring

52

networks alone and such data have been increasingly used for PM2.5 concentration estimation in

53

areas where ground measurements are not available. AOD products used in previous studies

54

include those derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) 6-8,

55

Multiangle Imaging SpectroRadiometer (MISR) 7, 8, Geostationary Operational Environmental

56

Satellite Aerosol/Smoke Product (GASP) 9, 10, Multi-Angle Implementation of Atmospheric

57

Correction (MAIAC) 11-13, and Visible Infrared Imaging Radiometer Suite (VIIRS) 14. To

58

establish the relationship between PM2.5 and AOD, previous efforts tend to focus on the

59

development of various regression models 9, 11, 15. Those models have evolved from the use of

60

AOD as the only predictor 16-18 to the incorporation of multiple additional predictors including

61

meteorological and land use variables 6, 7, 9, 13, 19, 20 and from one-stage 6, 18, 19 to multi-stage

62

models 9, 11, 15, 20, 21.

3 ACS Paragon Plus Environment

Environmental Science & Technology

Page 4 of 29

63

Although parametric regression models can sometimes reach a high level of prediction accuracy

64

with cross-validated R2 values above 0.8 22, they become increasingly complicated and encounter

65

difficulties when datasets have a large number of predictors and many data records. Compared to

66

traditional parametric regression models, non-parametric machine learning algorithms have

67

several strengths. First, they typically involve fewer and less restrictive assumptions regarding

68

independence of observations and distributions of outcomes and these assumptions are refined as

69

new data are added. Second, modelers do not need to define as strict of a set of relationships

70

among variables before implementing non-parametric machine learning algorithms as they do for

71

traditional parametric regression approaches. Examples of the use of non-parametric machine

72

learning algorithms for PM2.5 concentration estimation include Gupta and Christopher 23, who

73

used an artificial neural network (ANN) to estimate surface level PM2.5 from MODIS AOD data

74

in the southeastern United States. Zou, et al. 24 proposed a radial basis function (RBF) neural

75

network method to estimate PM2.5 concentrations in the state of Texas, USA. Reid, et al. 25used

76

the generalized boosting model (GBM) with 29 predictor variables to estimate PM2.5

77

concentrations during the 2008 northern California wildfires. Di, et al. 26 used a backward

78

propagation neural network to calibrate GEOS-Chem simulations and predicted PM2.5

79

concentrations in 1 km x 1 km grid cells in the Northeastern United States.

80

These studies demonstrate the potential of using non-parametric machine learning algorithms for

81

PM2.5 concentration estimation on the regional scale. Nevertheless, such efforts made on a

82

national scale (e.g., over the conterminous United States) are still very limited to date. Zhan, et

83

al. 27 developed a geographically-weighted gradient boosting machine (GW-GBM) to predict

84

daily PM2.5 concentrations across China. Di, et al. 28 incorporated convolutional layers into a

85

neural network to account for spatial and temporal autocorrelation and PM2.5 concentrations were

4 ACS Paragon Plus Environment

Page 5 of 29

Environmental Science & Technology

86

estimated with high accuracy (e.g., cross-validated R2 above 0.8) over the continental United

87

States from 2000 to 2012. Although neural networks tend to produce high prediction accuracy,

88

their results are difficult to interpret. Neural networks lack variable importance measures in the

89

classification, which can help estimate the strength of relationships between PM2.5 and various

90

predictors (e.g., meteorological and land use variables). Additional families of machine learning

91

algorithms provide such variable importance measures, and when building a national model for

92

PM2.5 concentration estimation, we examined random forest algorithm in this light. Like neural

93

networks, random forests provide multi-variate, non-parametric, non-linear regression and

94

classification based on a training dataset. Because random forests are an ensemble method, the

95

training time can be reliably tuned based on desired accuracy and available computing resources.

96

More importantly, the random forest approach also provides variable importance measures to

97

measure the prediction strength of each variable 29, thereby yielding more interpretable results

98

than those typically generated from neural networks.

99

In this paper, we develop a multi-variate random forest model incorporating AOD data (e.g.,

100

satellite AOD and model simulated AOD), meteorological fields, and land use variables to

101

estimate daily 24-hour averaged ground-level PM2.5 concentrations over the conterminous United

102

States for the year 2011. Prediction accuracy is evaluated by 10-fold cross validation (CV)

103

statistics, and two variable importance measures are implemented to examine the impact of each

104

predictor on PM2.5 concentration estimation on a national scale.

105 106

2. Materials and Methods

107

We define our study area as the conterminous United States consisting of 48 adjoining U.S.

108

states and Washington D.C. (Figure 1). The entire area is divided into nine climate regions,

5 ACS Paragon Plus Environment

Environmental Science & Technology

Page 6 of 29

109

including Northwest, West North Central, Northeast, East North Central, Central, West,

110

Southwest, South, and Southeast. These climate regions were identified by National Oceanic and

111

Atmospheric Administration’s (NOAA’s) National Centers for Environmental Information

112

scientists (https://www.ncdc.noaa.gov/monitoring-references/maps/us-climate-regions.php). [Insert Figure 1 about here]

113 114

2.1 PM2.5 measurements

115

The 24-hour averaged PM2.5 concentrations for 2011 collected from 1248 U.S. Environmental

116

Protection Agency (EPA) federal reference method samplers were downloaded from the EPA’s

117

Air Quality System Technology Transfer Network (http://www.epa.gov/ttn/airs/airsaqs/).

118

2.2 AOD data

119

We use satellite-derived AOD from MODIS as our primary predictor and GEOS-Chem AOD

120

simulations when satellite AOD is missing.

121

2.2.1 MODIS AOD

122

Collection 6 level 2 Aqua MODIS retrievals at 550 nm wavelength, specifically MYD04_L2

123

files, at a nominal resolution of 10 km were regridded to the 12 x 12 km2 grid used in the

124

Community Multi-Scale Air Quality (CMAQ) modeling system 30. Regridding to a static grid is

125

necessary when modeling with Level 2 MODIS data since the pixels shift in location and size

126

with each satellite overpass. We used a regridding approach based on the full area of individual

127

MODIS pixels, using pixel boundaries reconstructed from neighboring midpoints using a

128

Voronoi tessellation algorithm31. The daily average AOD value within each CMAQ grid cell was

129

calculated by averaging any MODIS pixels that fell at least partially within that cell. AOD

130

averages were calculated using only high confidence AOD retrievals from the combined deep-

131

blue and dark-target parameter (DB-DT) introduced in Collection 6 32. Our approach combines

132

the best elements of two distinct AOD retrieval algorithms, effectively creating an AOD 6 ACS Paragon Plus Environment

Page 7 of 29

Environmental Science & Technology

133

parameter that combines the accuracy users have come to expect from the dark-target algorithm

134

with the increased coverage that the deep-blue algorithm provides33. Lee, et al. 18 used Aqua

135

AOD to predict missing Terra values to increase the spatiotemporal coverage of AOD. However,

136

predicted AOD values inevitably contain additional errors that may reduce model performance.

137

Thus, we only used Aqua AOD in this study.

138

2.2.2 GEOS-Chem AOD

139

The GEOS-Chem model is a global three-dimensional model of tropospheric chemistry driven

140

by assimilated meteorological observations from the Goddard Earth Observing System (GEOS)

141

of the NASA Global Modeling Assimilation Office (GMAO)

142

(http://acmg.seas.harvard.edu/geos/doc/man/index.html). We use GEOS-Chem v10.1 in the

143

current analysis, driven by meteorological data of GEOS-5 for year 2011. The nested grid

144

simulations in North America with 0.5° x 0.666°spatial resolution produces compositional

145

AODs, including sulfate-nitrate-ammonium AOD, black carbon AOD, organic carbon AOD,

146

accumulation mode sea salt AOD, coarse mode sea salt AOD and total dust AOD at 3-h

147

intervals. We calculate total column AOD by summing all 6 AOD parameters over 37 vertical

148

layers (up to ~20 km from the surface) 34.

149 150

2.3 Meteorological fields

151

We obtained meteorological fields from two separate datasets. The first is the North American

152

Regional Reanalysis (NARR) (http://www.emc.ncep.noaa.gov/mmb/rreanl/) with a spatial

153

resolution of ~32 km and a temporal resolution of three hours. The second is the North American

154

Land Data Assimilation System Phase 2 (NLDAS-2) (http://ldas.gsfc.nasa.gov/nldas/) with a

155

spatial resolution of ~13 km and a temporal resolution of one hour. The meteorological fields

7 ACS Paragon Plus Environment

Environmental Science & Technology

Page 8 of 29

156

used in this analysis include air temperature, dew point temperature, visibility, pressure, potential

157

evaporation, downward longwave radiation flux, downward shortwave radiation flux, relative

158

humidity, u-wind (east-west component of the wind vector) and v-wind (north-south component

159

of the wind vector). All meteorological measurements for the period from 10 am to 4 pm local

160

standard time were averaged to generate daytime meteorological fields. The averaged

161

meteorological fields represent the average weather condition at the Aqua overpass time (~1:30

162

pm local time) and reduce the adverse impact of possible extreme weather occurred at the

163

satellite overpass time on the relationship between PM2.5 and meteorological variables.

164 165

2.4 Land use variables

166

We downloaded elevation data at a spatial resolution of ~30 m from the National Elevation

167

Dataset (NED) (http://ned.usgs.gov). We extracted road network data, including limited access

168

highway, highway, and local roads, from ESRI StreetMap USA (Environmental System

169

Research Institute, Inc., Redland, CA). We obtained forest cover and impervious surface, both at

170

the spatial resolution of ~30 m, from the 2011 Landsat-derived land cover map and impervious

171

surface map downloaded from the National Land Cover Database (NLCD)

172

(http://www.mrlc.gov). We obtained primary PM2.5 and PM10 emissions data (tons/year) from the

173

2011 EPA National Emissions Inventory (NEI) facility emissions report

174

(https://www.epa.gov/air-emissions-inventories/2011-national-emissions-inventory-nei-data).

175

Finally, we obtained 2010 population density data at the census tract level from the U.S. Census

176

Bureau (https://www2.census.gov/geo/tiger/).

177 178

2.5 Regional and temporal dummy variables

8 ACS Paragon Plus Environment

Page 9 of 29

Environmental Science & Technology

179

The PM2.5-AOD relationship has been shown to exhibit some regional variation due to

180

meteorological conditions 35, and there are also daily variations in this relationship 18. Hence, we

181

include a dummy climate region variable to account for regional variations and monthly and

182

daily dummy variables to account for daily variations.

183 184

2.6 Data integration

185

All data were re-projected to the USA Contiguous Albers Equal Area Conic coordinate system.

186

For the model training dataset, we assigned meteorological fields and AOD data to each PM2.5

187

monitoring site using the nearest neighbor approach. We averaged forest cover, impervious

188

surface, and elevation values and summed road length and point emission values over a 1 km x 1

189

km square buffer centered on each PM2.5 monitoring site. We assigned each monitoring site a

190

population density value of the census tract containing its location. For prediction dataset, the

191

same procedure was applied to each MODIS grid cell.

192 193

2.7 Methods

194

2.7.1 Convolutional layer

195

To account for spatial and temporal autocorrelations , we adopt the distance-inversed weighted

196

average function proposed by Di, et al. 28 to create convolutional layers for nearby PM2.5

197

measurements and all land use terms and use these layers as ordinary input predictor variables in

198

our model. For each monitoring site and MODIS grid cell, distance-inversed weighted averages

199

of nearby PM2.5 measurements and land use terms were calculated for that site and grid cell.

200

Although it is optimal to use monitoring sites within a certain distance, due to the difficulties in

201

identifying such a distance, we used all monitoring sites in the conterminous U.S. to create the

9 ACS Paragon Plus Environment

Environmental Science & Technology

Page 10 of 29

202

convolutional layers, and this also helps to reduce possible missing values in convolutional

203

layers. The kernel function can be expressed in general terms as n

zj

204

∑ = ∑

i =1 n

wij z i

(1)

w i =1 ij

205

where zj is the value of convolutional layer at monitoring site or grid cell j, zi is the PM2.5 and

206

land use terms at monitoring site i, and wij ∝

207

monitoring site i). When creating convolutional layers for monitoring site j, the PM2.5 and land

208

use terms at monitoring site j were not included in the calculation.

209

2.7.2 Model structure and validation

210

Random forests are a set of decision trees, and each tree is constructed using the best split for

211

each node among a subset of predictors randomly chosen at that node. In the end, a simple

212

majority vote is taken for prediction. Random forests are user-friendly and have only two

213

parameters, including mtry (the number of predictors sampled for splitting at each node) and ntree

214

(the number of trees grown). The random forest algorithm first draws ntree bootstrap samples

215

from the original data, and then for each sample, grows an unpruned classification or regression

216

tree with mtry of predictors randomly sampled and the best split chosen at each node. Finally,

217

predictions are made by aggregating the predictions of ntree trees (e.g., majority vote for

218

classification and average for regression). The error rate can be calculated using predictions of

219

out-of-bag samples (the data not in the bootstrap sample) 36, 37. In this study, by comparing

220

results of different settings of mtry and ntree, we set mtry as 12 and ntree as 1000 to achieve the best

221

prediction accuracy.

1 ( d ij is the distance between grid cell j and d ij2

10 ACS Paragon Plus Environment

Page 11 of 29

Environmental Science & Technology

222

We developed a random forest model incorporating AOD data, meteorological fields, land use

223

variables, and their convolutional layers to estimate ground-level PM2.5 concentrations over the

224

conterminous United States in 2011. The predictor variables include

225

(1) daily 24-hour averaged ground level PM2.5 measurements (µg/m3) ; (2) the x-y coordinates of

226

monitoring sites or MODIS grid cell centroids in the Albers projection used for this analysis; (3)

227

dummy variables: month, day, and climate regions; (4) meteorological fields: NARR variables

228

(dew point temperature (K), visibility (m), 2-m pressure (pa), 10-m pressure (pa), 30-m pressure

229

(pa), 30-m temperature (K), 180-150mb temperature (K)), NLDAS variables (potential

230

evaporation (kg/m2), downward longwave radiation flux (W/m2), downward shortwave radiation

231

flux (W/m2), connective available potential energy (J/kg), pressure (pa), 2-m temperature (K), 2-

232

m relative humidity (%),east-west component of the wind vector (m/sec) at 10 m, north-south

233

component of the wind vector (m/sec) at 10 m); (5) land use terms: elevation (m), local road (m),

234

forest cover (unitless), primary PM2.5 point emissions (tons/year), impervious surface (%), and

235

population density (persons/km2); (6) convolutional layers for elevation (m), local road (m),

236

forest cover (unitless), primary PM2.5 point emissions (tons/year), impervious surface (%),

237

population density (population/km2), highway (m), limited access highway (m), primary PM10

238

point emissions (tons/year), nearby PM2.5 measurements (µg/m3); (7) MODIS AOD (unitless)

239

and GEOS-Chem simulated AOD (unitless) used when MODIS AOD was missing.

240

We applied a 10-fold cross validation (CV) technique to establish and validate our prediction

241

results. The entire training dataset was randomly split into 10 subsets with each subset containing

242

approximately 10% of the training data. In each round of cross validation, we use nine subsets

243

for model training and make predictions for the held-out test subset. The process was repeated 10

244

times until every subset was tested. In addition, we also conducted a spatial CV and a temporal

11 ACS Paragon Plus Environment

Environmental Science & Technology

Page 12 of 29

245

CV, based on partitioning the training dataset by monitoring site and day of year, respectively.

246

We calculated statistical indicators such as R2, Mean Prediction Error (MPE), and Root Mean

247

Squared Prediction Error (RMSPE) between CV predictions and observations to assess the

248

prediction accuracy of the proposed model for the entire study area and study period. We also

249

ran the model and conducted CV for each season and each climate region separately. We

250

calculate two importance values for each predictor variable to examine what variables provide

251

the largest impact on our predictions. The first measure is the increase of mean square errors

252

(MSE) of predictions which are estimated using out-of-bag samples as a result of each predictor

253

variable being permuted. The higher the number, the more important the predictor variable. The

254

second one is the increase in node purities from splitting on the predictor variable. Node purity

255

measures the similarity of responses at a node across regression trees in the runs of the random

256

forest model. More useful variables achieve higher increases in node purities 37. All modeling

257

was done using R statistical software version 3.3.2 38.

258

3. Results

259

3.1 Descriptive statistics

260

The mean, standard deviation, maximum, and minimum for all variables over 2011 are presented

261

(Table S1). The annual mean PM2.5 concentrations is 9.69 µg/m3 for the continental U.S. in

262

2011, and the overall mean of MODIS AOD is 0.14. The percentage of missing MODIS AOD is

263

~68% in the conterminous U.S. during 2011. A Pearson’s correlation was conducted for all

264

variables (Table S2) to avoid potential multicollinearity problems. The results show that the

265

correlation coefficients between most of the variables are relatively low. It should be noted that

266

random forests are applicable even with highly correlated variables 39.

267

3.2 Results of model validation

12 ACS Paragon Plus Environment

Page 13 of 29

Environmental Science & Technology

268

Table 1 and Figure 2 presents the overall, spatial, and temporal CV validation results for the

269

entire study area and study period, including the values of R2, MPE, and RMSPE between CV

270

predicted values and observations. The results show that the overall R2 reached a relatively high

271

value of 0.80. MPE and RMSPE for daily predictions are 1.78 and 2.83 µg/m3, respectively,

272

indicating a good agreement between CV estimates and observations over the conterminous

273

United States in 2011. In addition, using GEOS-Chem simulated AOD to replace MODIS AOD

274

reduced CV R2 by ~0.01 to 0.79, indicating the feasibility of using simulated AOD when

275

MODIS AOD is not available. Nevertheless, mirroring previous studies, we find prediction

276

accuracy varies by season and climate region (Tables 2 and 3). For instance, model performance

277

is the best in summer, followed by winter and fall, and the worst in spring in terms of CV R2.

278

For different climate regions, prediction accuracy is relatively high in west, northeast, central,

279

and east north central regions, and low in northwest and west north central regions. The model

280

has intermediate performance in southeast, south, and southwest regions.

281

[Insert Tables 1-3 about here]

282

[Insert Figure 2 about here]

283 284

3.3 Variable importance assessment

285

Figure 3 illustrates the results of variable importance assessment for predictor variables in the

286

final model. The results show that the convolutional layer for nearby PM2.5 measurements and

287

AOD are among the top five most important predictor variables for both importance measures.

288

[Insert Figure 3 about here]

289

3.4 Contribution of convolutional layers

290

We compared the CV R2 values between models with and without convolutional layers (Table

291

S3). The results show that incorporation of convolutional layers for land use terms could increase 13 ACS Paragon Plus Environment

Environmental Science & Technology

Page 14 of 29

292

CV R2 by ~0.02, and inclusion of the convolution layer for nearby PM2.5 measurements increases

293

CV R2 by ~0.06, indicating a strong contribution of these convolutional layers to prediction

294

accuracy. We also compared CV R2 between models using our convolutional layers and surfaces

295

yielded using universal kriging (Table S4). The results show that CV R2 of the model using

296

surfaces generated using universal kriging is ~0.04 lower than that of our model, indicating that

297

geostatistical interpolation methods such as universal kriging can also be used to generate

298

convolutional layers.

299

3.5 Estimation of PM2.5 concentrations

300

Due to the use of GEOS-Chem simulated AOD when MODIS AOD is missing, the coverage of

301

our predictions extends from ~32% to full coverage. Our annual mean PM2.5 surface on the

302

CMAQ grid (12 x 12 km2) over the continental U.S. in 2011 appears in Figure 4. The results

303

show that PM2.5 predictions (Figure 4a) and observations (Figure 4b) share a similar spatial

304

pattern. PM2.5 concentrations are generally higher in the eastern U.S. (e.g., the southeast and

305

central regions) than in the western part. There are also areas with high predicted and observed

306

PM2.5 levels in the western U.S. (e.g., the San Joaquin Valley in California). High concentrations

307

also occur in large urban centers such as Indianapolis, IN, Houston, TX, Columbus, OH,

308

Chicago, IL, and Los Angeles, CA. Figure 4c illustrates the difference between annual mean

309

PM2.5 predictions and ground measurements at each ground monitor and difference

310

interpolations over the continental U.S. generated using the inverse distance weighted (IDW)

311

interpolation. Figure 4c shows that underestimation mostly occurs in areas with high

312

concentrations which are mainly in the eastern U.S., while overestimation appears primarily in

313

the western U.S. where PM2.5 concentrations are generally low. The observed differences at

14 ACS Paragon Plus Environment

Page 15 of 29

Environmental Science & Technology

314

~95% of ground monitors are within ±4 µg/m3. Seasonal PM2.5 maps (Figure S1) show that air

315

pollution is more severe in summer, particularly in the eastern U.S. [Insert Figures 4 about here]

316 317

4. Discussion

318

Our approach has several strengths. First, we used random forests to build a PM2.5 predictive

319

model on a national scale and demonstrated that such an approach can achieve high prediction

320

accuracy, even across so broad a geographic area. For instance, on the regional scale, our model

321

yields a similar CV R2 to previous studies conducted in the northeastern U.S. 22 and the southeast

322

40

323

region. On the national scale, the prediction accuracy of our model is comparable to that of a

324

neural network approach proposed by Di, et al. 28. Both their model and ours used many similar

325

predictor variables such as AOD, land use terms, and meteorological data. However, our final

326

model did not include those variables with low importance such as additional GEOS-Chem

327

outputs, Normalized Difference Vegetation Index (NDVI), absorbing aerosol index (AAI),

328

boundary layer height, and albedo. In fact, including them reduced prediction accuracy. We also

329

tested interactions between climate regions and meteorological variables, between day and

330

meteorological variables, between climate regions and AOD, and between day and AOD. These

331

interaction terms did not help improve prediction accuracy and thus were not included in the

332

final model. In addition, we included some variables not included in Di, et al. 28 such as forest

333

cover and percent impervious surface, both of which show high importance in our model.

334

Compared to their neural network approach based on more than 50 predictors, our random forest

335

approach has fewer predictors (~40). In addition, the neural network approach involves a more

336

complex, two-stage structure, while our random forest approach is based on a simpler, one-stage

337

structure.

in 2011, even though we optimize our model for the continental U.S. instead of within each

15 ACS Paragon Plus Environment

Environmental Science & Technology

Page 16 of 29

338

Another strength of our random forest approach is that it provides an importance estimate for

339

each predictor variable. Variable importance measures can help pinpoint what variables are most

340

important in the reduction of prediction errors and thus make variable selection more efficient

341

than neural networks. Our random forest approach provided two types of variable importance

342

measures during the training process, including the increase in MSE and the increase in node

343

purities, both of which provided additional insight into the set of predictor variables yielding the

344

greatest improvements in prediction accuracy. In our implementation, when we chose variables,

345

we tried to avoid highly correlated variables that may severely affect original importance

346

measures 39. To make importance values directly comparable, we first added all potential

347

predictor variables into the random forest model and calculated importance values (e.g., the

348

increase in MSE) for all variables, and then compared the prediction accuracy of models

349

incorporating different combinations of variables to identify the best set of predictors among

350

those under consideration. Díaz-Uriarte and Alvarez de Andrés 41 suggested a backward

351

elimination strategy and iteratively fitted random forests, at each iteration building a new forest

352

after discarding variables with the smallest variable importance values. We adopted a similar

353

strategy and iteratively fitted random forests with each time discarding the variable with the

354

smallest importance value. Finally, we identified a threshold where the incorporation of variables

355

with importance values above the threshold generally increased prediction accuracy, while the

356

inclusion of those with importance values below the threshold tended to decrease prediction

357

accuracy. The variables with importance values below the threshold were discarded, and those

358

with importance values above the threshold were further considered. This variable selection step

359

dramatically simplified the process of identifying useful predictor variables. For example, we

360

tested NDVI as a predictor in our model. We found that the importance value of NDVI fell

16 ACS Paragon Plus Environment

Page 17 of 29

Environmental Science & Technology

361

below our threshold, and the inclusion of NDVI in the model slightly reduced the prediction

362

accuracy, probably due to the incorporation of other predictors with stronger individual

363

associations with the outcome (e.g., forest cover and impervious surface). In contrast, for neural

364

networks, variables need to be toggled on and off individually to compare model performance in

365

order to determine whether the variable is useful or not. This can be time-consuming, especially

366

when large training and testing datasets are involved.

367

One potential limitation of this study is our use of MODIS AOD data with 10 km spatial

368

resolution, instead of a new AOD product with 1 km spatial resolution derived using the Multi-

369

Angle Implementation of Atmospheric Correction (MAIAC) algorithm. Lee, et al. 40

370

demonstrated that the 1-km MAIAC AOD can outperform the existing 10-km data in predicting

371

PM2.5. An advantage of 10-km data over 1-km data is considerable reduction in computing time.

372

However, incorporating 1-km MAIAC AOD data into our model not only can predict PM2.5

373

concentrations at a higher spatial resolution, but may further improve the prediction accuracy,

374

and we plan to extend our approach to this setting. Another limitation is regarding the temporal

375

mismatch among PM2.5 measurements, MODIS AOD, and meteorological fields. In this study,

376

PM2.5 measurements are 24-hour averaged, while MODIS AOD is a single value obtained at the

377

Aqua overpass time and meteorological variables are averaged from 10 am to 4 pm. It is optimal

378

to use both 24-hour averaged AOD and meteorological variables in the model. For

379

meteorological fields, we tested the 24-hour averaged meteorological values in the model and did

380

not find a substantial difference in prediction accuracy between models using meteorological

381

fields averaged within two different time periods. For AOD, other AOD products should be

382

considered. GASP AOD retrievals are attempted every half hour during daylight and thus can

383

help with the temporal mismatch between PM2.5 and AOD. Although Green, et al. 42 found that

17 ACS Paragon Plus Environment

Environmental Science & Technology

Page 18 of 29

384

PM2.5 predicted from MODIS AOD is more accurate than those generated from GASP AOD at

385

Bondville, Illinois, It is still worthwhile to examine its performance on a national scale, we will

386

address this issue in future research. Furthermore, the results of spatial and temporal CV indicate

387

that the model is more capable of explaining temporal variation in the relationship between PM2.5

388

and predictor variables than accounting for spatial variation. Thus, more work needs to be done

389

in the future to address this issue. Finally, although our model reached a relatively high

390

prediction accuracy, the results depict overestimation and underestimation of PM2.5

391

concentrations across the continental U.S. If these outputs were used in an epidemiologic study,

392

then these regional differences in accuracy could produce (or obfuscate) regional heterogeneity

393

in health associations.

394 395 396 397 398 399 400 401 402 403 404 18 ACS Paragon Plus Environment

Page 19 of 29

405

Environmental Science & Technology

Table 1. Cross validation results for the entire study area and study period. Overall Spatial Temporal

R2 0.80 0.70 0.79

RMSPE (µg/m3) 2.83 3.43 2.88

MPE (µg/m3) 1.78 2.24 1.81

Slope 1.00 1.00 1.01

MPE (µg/m3) 1.62 1.71 1.75 2.20

Slope 1.00 1.00 1.01 1.01

MPE (µg/m3) 1.19 1.72 1.86 1.77 1.98 2.14 1.66 1.73 1.68

Slope 1.01 1.00 1.00 0.98 1.00 1.01 1.01 1.01 1.00

406 407

Table 2. Cross validation results for each season. Spring Summer Fall Winter

R2 0.78 0.81 0.79 0.80

RMSPE (µg/m3) 2.54 2.78 2.67 3.55

408 409

Table 3. Cross validation results for each climate region. Northwest West North Central Northeast East North Central Central West Southwest South Southeast

R2 0.68 0.64 0.79 0.82 0.80 0.83 0.74 0.74 0.74

RMSPE (µg/m3) 2.37 2.61 2.84 2.50 2.86 3.32 2.85 2.66 2.71

410 411 412 413 414 415 416 417 418 419 420 421 19 ACS Paragon Plus Environment

Environmental Science & Technology

Page 20 of 29

422 423

Figure 1. Study area.

424 425 426 427 428 429 430 431 432 433 434 435

20 ACS Paragon Plus Environment

Page 21 of 29

Environmental Science & Technology

436 437

Figure 2. 10-fold cross validation. (a) Overall; (b) Spatial; (c) Temporal.

438 439

21 ACS Paragon Plus Environment

Environmental Science & Technology

Page 22 of 29

440 441

Figure 3. Variable importance. (a) Increase in mean square errors (%IncMSE); (b) Increase in

442

node purities (IncNodePurity). 22 ACS Paragon Plus Environment

Page 23 of 29

Environmental Science & Technology

443 444

Figure 4. Annual mean predictions. (a) Annual mean PM2.5 predictions over the continental U.S.

445

for 2011; (b) Annual mean PM2.5 measurements at ground monitors; (c) Difference between

446

annual mean predictions and observations at ground monitors and difference interpolations over

447

the continental U.S.

448 449 450 451 452 453 454 455 456 457 23 ACS Paragon Plus Environment

Environmental Science & Technology

458

AUTHOR INFORMATION

459

Corresponding Author

460

* (Y.L.) Phone: +1-404-727-2131; fax: +1-404-727-8744; e-mail: [email protected].

461

Present Addresses

462

1

463

Atlanta, GA 30322, USA

464

Author Contributions

465

The manuscript was written by Xuefei Hu and edited by all authors. All authors have given

466

approval to the final version of the manuscript.

467

Notes

468

The authors declare no competing financial interest.

469

ACKNOWLEDGMENT

470

This work was partially supported by the NASA Applied Sciences Program (grant No.

471

NNX16AQ28G, PI: Liu).

472

SUPPORTING INFORMATION AVAILABLE

473

Tables S1-S4 and Figure S1. This information is available free of charge via the Internet at

474

http://pubs.acs.org.

Page 24 of 29

Department of Environmental Health, Rollins School of Public Health, Emory University,

475 476 477 478 24 ACS Paragon Plus Environment

Page 25 of 29

Environmental Science & Technology

479

REFERENCES

480 481 482

(1). Madrigano, J.; Kloog, I.; Goldberg, R.; Coull, B. A.; Mittleman, M. A.; Schwartz, J., Long-term Exposure to PM2.5 and Incidence of Acute Myocardial Infarction. Environ. Health Perspect. 2013, 121, (2), 192-196.

483 484 485 486

(2). Neophytou, A. M.; Costello, S.; Brown, D. M.; Picciotto, S.; Noth, E. M.; Hammond, S. K.; Cullen, M. R.; Eisen, E. A., Marginal Structural Models in Occupational Epidemiology: Application in a Study of Ischemic Heart Disease Incidence and PM2.5 in the US Aluminum Industry. Am. J. Epidemiol. 2014, 180, (6), 608-615.

487 488 489

(3). Bose, S.; Hansel, N.; Tonorezos, E.; Williams, D.; Bilderback, A.; Breysse, P.; Diette, G.; McCormack, M. C., Indoor particulate matter associated with systemic inflammation in COPD. J. Environ. Prot. 2015, 6, (5), 566-572.

490 491 492 493

(4). Habre, R.; Moshier, E.; Castro, W.; Nath, A.; Grunin, A.; Rohr, A.; Godbold, J.; Schachter, N.; Kattan, M.; Coull, B., The effects of PM2. 5 and its components from indoor and outdoor sources on cough and wheeze symptoms in asthmatic children. J. Expo. Sci. Environ. Epidemiol. 2014, 24, (4), 380-387.

494 495 496 497

(5). Baxter, L. K.; Dionisio, K. L.; Burke, J.; Ebelt Sarnat, S.; Sarnat, J. A.; Hodas, N.; Rich, D. Q.; Turpin, B. J.; Jones, R. R.; Mannshardt, E.; Kumar, N.; Beevers, S. D.; Ozkaynak, H., Exposure prediction approaches used in air pollution epidemiology studies: Key findings and future recommendations. J. Expo. Sci. Environ. Epidemiol. 2013, 23, (6), 654-659.

498 499 500

(6). Hu, X.; Waller, L. A.; Al-Hamdan, M. Z.; Crosson, W. L.; Estes Jr, M. G.; Estes, S. M.; Quattrochi, D. A.; Sarnat, J. A.; Liu, Y., Estimating ground-level PM2.5 concentrations in the southeastern U.S. using geographically weighted regression. Environ. Res. 2013, 121, 1-10.

501 502 503

(7). Liu, Y.; Franklin, M.; Kahn, R.; Koutrakis, P., Using aerosol optical thickness to predict ground-level PM2.5 concentrations in the St. Louis area: A comparison between MISR and MODIS. Remote Sens. Environ. 2007, 107, (1-2), 33-44.

504 505

(8). Ma, Z.; Hu, X.; Huang, L.; Bi, J.; Liu, Y., Estimating ground-level PM2. 5 in China using satellite remote sensing. Environ. Sci. Technol. 2014, 48, (13), 7436-7444.

506 507 508

(9). Liu, Y.; Paciorek, C. J.; Koutrakis, P., Estimating Regional Spatial and Temporal Variability of PM2.5 Concentrations Using Satellite Data, Meteorology, and Land Use Information. Environ. Health Perspect. 2009, 117, (6), 886-892.

509 510 511

(10). Paciorek, C. J.; Liu, Y.; Moreno-Macias, H.; Kondragunta, S., Spatiotemporal associations between GOES aerosol optical depth retrievals and ground-level PM2.5. Environ. Sci. Technol. 2008, 42, (15), 5800-5806.

512 513 514 515

(11). Hu, X.; Waller, L. A.; Lyapustin, A.; Wang, Y.; Al-Hamdan, M. Z.; Crosson, W. L.; Estes Jr, M. G.; Estes, S. M.; Quattrochi, D. A.; Puttaswamy, S. J.; Liu, Y., Estimating groundlevel PM2.5 concentrations in the Southeastern United States using MAIAC AOD retrievals and a two-stage model. Remote Sens. Environ. 2014, 140, 220-232. 25 ACS Paragon Plus Environment

Environmental Science & Technology

Page 26 of 29

516 517 518

(12). Hu, X.; Waller, L. A.; Lyapustin, A.; Wang, Y.; Liu, Y., 10-year spatial and temporal trends of PM2.5 concentrations in the southeastern US estimated using high-resolution satellite data. Atmos. Chem. Phys. 2014, 14, (12), 6301-6314.

519 520 521

(13). Hu, X.; Waller, L. A.; Lyapustin, A.; Wang, Y.; Liu, Y., Improving satellite-driven PM2.5 models with Moderate Resolution Imaging Spectroradiometer fire counts in the southeastern U.S. J. Geophys. Res., Atmos. 2014, 119, (19), 11375-11386.

522 523 524

(14). Wu, J.; Yao, F.; Li, W.; Si, M., VIIRS-based remote sensing estimation of ground-level PM2.5 concentrations in Beijing–Tianjin–Hebei: A spatiotemporal statistical model. Remote Sens. Environ. 2016, 184, 316-328.

525 526 527

(15). Kloog, I.; Nordio, F.; Coull, B. A.; Schwartz, J., Incorporating Local Land Use Regression And Satellite Aerosol Optical Depth In A Hybrid Model Of Spatiotemporal PM2.5 Exposures In The Mid-Atlantic States. Environ. Sci. Technol. 2012, 46, (21), 11913-11921.

528 529 530

(16). Wang, J.; Christopher, S. A., Intercomparison between satellite-derived aerosol optical thickness and PM2.5 mass: Implications for air quality studies. Geophys. Res. Lett. 2003, 30, (21), 2095.

531 532

(17). Hu, Z., Spatial analysis of MODIS aerosol optical depth, PM2.5, and chronic coronary heart disease. Int. J. Health Geogr. 2009, 8, (1), 27.

533 534 535

(18). Lee, H. J.; Liu, Y.; Coull, B. A.; Schwartz, J.; Koutrakis, P., A novel calibration approach of MODIS AOD data to predict PM2.5 concentrations. Atmos. Chem. Phys. 2011, 11, (15), 79918002.

536 537 538

(19). Liu, Y.; Sarnat, J. A.; Kilaru, A.; Jacob, D. J.; Koutrakis, P., Estimating ground-level PM2.5 in the eastern united states using satellite remote sensing. Environ. Sci. Technol. 2005, 39, (9), 3269-3278.

539 540 541

(20). Kloog, I.; Koutrakis, P.; Coull, B. A.; Lee, H. J.; Schwartz, J., Assessing temporally and spatially resolved PM2.5 exposures for epidemiological studies using satellite aerosol optical depth measurements. Atmos. Environ. 2011, 45, (35), 6267-6275.

542 543 544

(21). Ma, Z.; Hu, X.; Sayer, A. M.; Levy, R.; Zhang, Q.; Xue, Y.; Tong, S.; Bi, J.; Huang, L.; Liu, Y., Satellite-based spatiotemporal trends in PM2. 5 concentrations: China, 2004–2013. Environ. Health Perspect. 2016, 124, (2), 184-192.

545 546 547 548

(22). Kloog, I.; Chudnovsky, A. A.; Just, A. C.; Nordio, F.; Koutrakis, P.; Coull, B. A.; Lyapustin, A.; Wang, Y.; Schwartz, J., A new hybrid spatio-temporal model for estimating daily multi-year PM2.5 concentrations across northeastern USA using high resolution aerosol optical depth data. Atmos. Environ. 2014, 95, 581-590.

549 550 551

(23). Gupta, P.; Christopher, S. A., Particulate matter air quality assessment using integrated surface, satellite, and meteorological products: 2. A neural network approach. J. Geophys. Res., Atmos. 2009, 114, D20205.

26 ACS Paragon Plus Environment

Page 27 of 29

Environmental Science & Technology

552 553 554

(24). Zou, B.; Wang, M.; Wan, N.; Wilson, J. G.; Fang, X.; Tang, Y., Spatial modeling of PM2.5 concentrations with a multifactoral radial basis function neural network. Environ. Sci. Pollut. Res. 2015, 22, (14), 10395-10404.

555 556 557 558

(25). Reid, C. E.; Jerrett, M.; Petersen, M. L.; Pfister, G. G.; Morefield, P. E.; Tager, I. B.; Raffuse, S. M.; Balmes, J. R., Spatiotemporal Prediction of Fine Particulate Matter During the 2008 Northern California Wildfires Using Machine Learning. Environ. Sci. Technol. 2015, 49, (6), 3887-3896.

559 560 561

(26). Di, Q.; Koutrakis, P.; Schwartz, J., A hybrid prediction model for PM2.5 mass and components using a chemical transport model and land use regression. Atmos. Environ. 2016, 131, 390-399.

562 563 564

(27). Zhan, Y.; Luo, Y.; Deng, X.; Chen, H.; Grieneisen, M. L.; Shen, X.; Zhu, L.; Zhang, M., Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmos. Environ. 2017, 155, 129-139.

565 566 567

(28). Di, Q.; Kloog, I.; Koutrakis, P.; Lyapustin, A.; Wang, Y.; Schwartz, J., Assessing PM2.5 Exposures with High Spatiotemporal Resolution across the Continental United States. Environ. Sci. Technol. 2016, 50, (9), 4712-4721.

568 569

(29). Friedman, J.; Hastie, T.; Tibshirani, R., The elements of statistical learning. Springer series in statistics Springer, Berlin: 2001; Vol. 1.

570 571

(30). Levy, R., Collection 006 (C6) Modis atmosphere l2 aerosol product. In 6 ed.; LAADS Web: https://ladsweb.nascom.nasa.gov/data/search.html, 2014.

572 573

(31). Turner, R., deldir: Delaunay triangulation and Dirichlet (Voronoi) tessellation. R package version 0.0-8 2009, URL http://cran.r-project.org/web/packages/deldir.

574 575 576

(32). Levy, R. C.; Mattoo, S.; Munchak, L. A.; Remer, L. A.; Sayer, A. M.; Patadia, F.; Hsu, N. C., The Collection 6 MODIS aerosol products over land and ocean. Atmos. Meas. Tech. 2013, 6, (11), 2989-3034.

577 578

(33). Belle, J.; Liu, Y., Evaluation of Aqua MODIS Collection 6 AOD Parameters for Air Quality Research over the Continental United States. Remote Sens. 2016, 8, (10), 815.

579 580 581

(34). Li, S. S.; Garay, M. J.; Chen, L. F.; Rees, E.; Liu, Y., Comparison of GEOS-Chem aerosol optical depth with AERONET and MISR data over the contiguous United States. J. Geophys. Res., Atmos. 2013, 118, (19), 11228-11241.

582 583 584

(35). Schaap, M.; Apituley, A.; Timmermans, R. M. A.; Koelemeijer, R. B. A.; de Leeuw, G., Exploring the relation between aerosol optical depth and PM2.5 at Cabauw, the Netherlands. Atmos. Chem. Phys. 2009, 9, (3), 909-925.

585

(36).

Breiman, L., Random Forests. Mach. Learn. 2001, 45, (1), 5-32.

27 ACS Paragon Plus Environment

Environmental Science & Technology

Page 28 of 29

586 587

(37). Liaw, A.; Wiener, M., Classification and Regression by randomForest. R News 2002, 2, (3), 18-22.

588 589

(38). R Core Team, A language and environment for statistical computing. In R Foundation for Statistical Computing: Vienna, Austria, 2016.

590 591

(39). Strobl, C.; Boulesteix, A.-L.; Kneib, T.; Augustin, T.; Zeileis, A., Conditional variable importance for random forests. BMC Bioinform. 2008, 9, (1), 307.

592 593 594 595

(40). Lee, M.; Kloog, I.; Chudnovsky, A.; Lyapustin, A.; Wang, Y.; Melly, S.; Coull, B.; Koutrakis, P.; Schwartz, J., Spatiotemporal prediction of fine particulate matter using highresolution satellite images in the Southeastern US 2003-2011. J. Expo. Sci. Environ. Epidemiol. 2016, 26, (4), 377-384.

596 597

(41). Díaz-Uriarte, R.; Alvarez de Andrés, S., Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006, 7, (1), 3.

598 599 600

(42). Green, M.; Kondragunta, S.; Ciren, P.; Xu, C., Comparison of GOES and MODIS Aerosol Optical Depth (AOD) to Aerosol Robotic Network (AERONET) AOD and IMPROVE PM2.5 Mass at Bondville, Illinois. J. Air Waste Manage. Assoc. 2009, 59, (9), 1082-1091.

601 602 603 604

28 ACS Paragon Plus Environment

Page 29 of 29

Environmental Science & Technology

TOC Art 84x47mm (300 x 300 DPI)

ACS Paragon Plus Environment