A High Resolution Spatiotemporal Fine Particulate Matter Exposure Assessment Model for the Contiguous United States

Currently available nationwide prediction models for fine particulate matter (PM2.5) lack prediction confidence intervals and usually do not describe cross validated model performance at different spatiotemporal resolutions and extents. We used 41 different spatiotemporal predictors, including data on land use, meteorology, aerosol optical density, emissions, wildfires, population, traffic, and spatiotemporal indicators to train a machine learning model to predict daily averages of PM2.5 concentrations at 0.75 sq km resolution across the contiguous United States from 2000 through 2020. We utilized a generalized random forest model that allowed us to generate asymptotically-valid prediction confidence intervals and took advantage of its usefulness as an ensemble learner to quickly and cheaply characterize leave-one-location-out CV model performance for different temporal resolutions and geographic regions. Using a variable importance metric, we selected 8 predictors that were able to accurately predict daily PM2.5, with an overall leave-one-location-out cross validated median absolute error of 1.20 μg m3 , an R of 0.84, and confidence interval coverage fraction of 95%. When considering aggregated temporal windows, the model achieved leave-one-location-out cross validated median absolute errors of 0.99, 0.76, 0.63, and 0.60 μg m3 for weekly, monthly, annual, and all-time exposure assessments, respectively. We further describe the model’s cross validated performance at different geographic regions in the United States, finding that it performs worse in the Western half of the country where there are less monitors. The code and data used to create this model are publicly available and we have developed software packages to be used for exposure assessment. This accurate exposure assessment model will be useful for epidemiologists seeking to study the health effects of PM2.5 across the continental United States, while possibly considering exposure estimation accuracy and uncertainty specific to the the spatiotemporal resolution and extent of their study design and population.


Introduction
Particulate matter less than 2.5 (PM 2.5 ) has a wide range of negative health effects, including on the respiratory system, cardiovascular system, metabolic system, nervous system, reproductive system, pregnancy and birth outcomes, cancer, and mortality ("Integrated Science Assessment for Particulate Matter" 2020). Newer developments in PM 2.5 exposure assessment methods have improved the spatial resolution and accuracy of estimates, often reducing the bias and uncertainty present in health studies. Furthermore, the relationship between PM 2.5 and health effects has been studied at increasingly higher temporal resolutions, allowing for investigation of its short-term and acute effects. Application of these models within large registries and multi-site studies has highlighted the need for timely, accessible, open, high resolution, and nationwide exposure assessment models that can quickly be tailored to individual studies at different spatiotemporal resolutions and extents. Current nationwide PM 2.5 exposure assessment models do not consider model performance at different spatiotemporal aggregations, resolutions, and extents (e.g., daily estimates at zip codes across the entire United States versus annual averages at exact locations within one city), which would allow for study-specific estimates of bias and uncertainty. Furthermore, uncertainty in exposure assessment of ambient pollutants is often ignored in health studies because exposure predictions do not include prediction standard errors or prediction 95% confidence intervals. Without this critical feature, scientists cannot propogate uncertainty from exposure assessment to health effects models in order to produce unbiased estimates that truly reflect the uncertainty contributed by both exposure estimation and health effects modeling.
The objective here was to create an open and reusable high resolution (0.75 sq km) spatiotemporal PM 2.5 exposure assessment model for the contiguous United States from 2000 through 2020. We used existing PM 2.5 measurements from the Environmental Protection Agency to train a prediction model based on features related to land use, meteorology, aerosol optical density, emissions, wildfires, population density, development, and traffic. We designed this model such that it could (1) produce accurate predictions with accompanying estimates of prediction uncertainty and (2) be used to easily quantify cross validated accuracy at different spatial and temporal resolutions and extents.

Study Domain and Spatiotemporal Grid
In order to spatially harmonize predictors and predictions, the H3 hexagonal hierarchical spatial index (Brodsky 2018) at a resolution of 8 was used to create a grid of 11,932,970 hexagonal cells across the contiguous United States. Each hexagon grid cell is represented using a geohash of 15 characters and has an average area of 0.74 sq km and an average side length of 461.4 m. A hierarchical spatial index was used to quickly scale spatial data and because an average polygon (e.g., zip code or census tract) can be approximated with hexagonal tiles with a smaller margin of error than would be possible with square tiles (Brodsky 2018). located daily measurements were averaged. The latitude and longitude for each measurement location was geohashed to its containing resolution 8 H3 grid cell.

Land Use
Land use information was obtained from the National Land Cover Database (NLCD) and summarized at each resolution 8 H3 grid cell as the average percentage imperviousness of containing 30 sq m cells. We also included the fraction of each cell that was "green" (i.e. any NLCD category except water, ice/snow, developed medium intensity, developed high intensity, rock/sand/clay) as well as the fraction of each cell that was non-impervious. The impervious category was further classified into types of imperviousness, including primary urban, primary rural, secondary urban, secondary rural, tertiary urban, tertiary rural, thinned urban, thinned rural, nonroad urban, nonroad rural, energy production urban, energy production rural. NLCD data was available as an annual value, available for 2001, 2006, 2011, and 2016. In the training and prediction data, each day was assigned to the nearest calendar year with available data.

Meteorology
Meteorological information was obtained from the National Centers for Environmental Prediction (NCEP) North American Regional Reanalysis (NARR) dataset (Kistler et al. 2001). Daily averages of planetary boundary layer height, visibility, wind speed and direction, air temperature, relative humidity, precipitation rate, and surface pressure for each resolution 8 H3 grid cell were calculated based on the 12 x 12 km NARR cell that contained its centroid.

Aerosol Optical Density
Aerosol optical density (AOD) information was obtained from the Moderate Resolution Imaging Spectroradiometer (MODIS) aboard the National Aeronautics and Space Administration (NASA) Terra and Aqua satellites. Specifically, AOD at 470 nm from the Multi-Angle Implementation of Atmospheric Correction (MAIAC) Land AOD Daily L2G 1km SIN Grid V006 product (MCD19A2) (https://doi.org/10.5067/MODIS/MCD19A2.006) was used by joining the centroid of each 1 km grid cell to its containing resolution 8 H3 hexagon cell. AOD measurements were utilized if they were (1) less than 2, (2) were from pixels that were "Clear" or "Possibly Cloudy" in the Cloud Mask, and (3) were from pixels that were "Clear" or "Adjacent to a single cloudy pixel" in the Adjacency Mask.

PM . Emissions
Emissions information was obtained from the National Emissions Inventory (NEI) Database as annual estimated tons of PM 2.5 , available for 2008, 2011, 2014, and 2017. County-level emissions due to nonpoint, onroad, nonroad, and event sources were joined to all containing resolution 8 H3 hexagon cells. NEI event sources capture wildfires and prescribed burns and are calculated by the EPA based on satellite detection approaches, fire models, and fire activity data provided by local agencies. The exact latitude and longitude of point sources were assigned to their containing resolution 8 H3 hexagon cells and all cells were assigned a value equal to the distance to the closest NEI site. In the training and prediction data, each day was assigned to the nearest calendar year with data.

Wildfires
Wildfire information was obtained from Version 1.5 of the Fire Inventory from the National Center for Atmospheric Research (FINN) dataset (Wiedinmyer et al. 2011) that estimates daily open biomass burning at a 1 km sq resolution as the area of total land burned and the estimated PM 2.5 emitted from the fire event. The centroid of each 1 km grid cell was joined to its containing resolution 8 H3 hexagon cell, with daily estimates available for 2002 through 2018.

Population Density
Estimates of total population from the 2018 5-yr American Community Survey at the census tract level were used to calculate the population density for each resolution 5 H3 hexagon cell as the area-weighted total population divided by the total area.

Traffic
Distance to the closest S1100 road from the 2018 TIGER/Line Shapefiles for the centroid of each resolution 8 H3 hexagon cell centroid was used to approximate interstate traffic.

Derived Spatiotemporal Predictors
To include temporal information, we included the calendar year, day of the year, and day of the week as numeric predictors as well as an indicator variable that was true if any date was a U.S. holiday (New Year's Day, Fourth of July, Thanksgiving, Christmas, Labor Day, Memorial Day, Martin Luther King Jr. Day). Spatial predictors were included as the X and Y coordinates of each grid cell and an identifier for each H3 grid cell.
A convolution function was used to create a training feature describing average regional daily PM 2.5 by spatially aggregating the measurements to resolution 3 H3 grid cells (average area of 12,392 sq km each), calculating the inverse distance weighted mean of neighboring cells within a 5-ring buffer, and then calculating a moving three day average.

Generalized Random Forest Modeling
We used the generalized random forest (GRF) framework (Athey, Tibshirani, and Wager 2019) to train a regression forest. This differs from the more traditional random forest by (1) sampling without replacement for each resample in order to be able to generate variance estimates and (2) using the gradient of the objective function to optimize a linear approximation to the criterion in order to reduce computation times. Within the GRF, variances for predictions are estimated by training trees in small groups on subsets of each resample and then comparing predictions within and across groups (Athey, Tibshirani, and Wager 2019).
We considered autocorrelation among measurements from the same locations in order to produce cluster-robust predictions and variance estimates by resampling at the location-rather than the measurement-level (i.e., location-day-level). This was accomplished by using the H3 grid cell identifier to select observations for resamples such that all observations within one location were selected during training and predictions.
Missing data in predictors was handled using the missing incorporated in attributes criterion (Twala, Jones, and Hand 2008) because our previous work has shown that missingness itself, especially within AOD measurements, can be a good predictor of PM 2.5 concentrations (Brokamp, Jandarov, et al. 2018). In practice, this means that observations with missing values can be split on missingness and/or their observed values.
Random forests were trained using a subsample of half the available training data for each tree and each split was generated by considering 14 randomly selected predictors, with the restrictions that each resulting split could not be less than 5% of the observations in the parent node and must have at least 1 observation. We trained one random forest model per calendar year for the entire contiguous United States.

Assessing Cross Validated Model Performance
Leave one location out (LOLO) cross validated (CV) predictions were created for each measured daily PM 2.5 concentration by holding out one location and predicting all of its daily measurements using the remaining locations. To estimate model performance when averaging concentrations over weeks, months, years, or all time, we averaged the measured concentrations and predictions and defined the standard error of the average means as the square root of the sum of the squared individual standard errors. Cross validated statistics for each of the held out locations and temporal windows (daily, weekly, monthly, annual, all) were calculated, including median absolute error (MAE), root mean squared error (RMSE), squared correlation between predicted and observed ( 2 ), slope from an intercept-less regression model between predicted and observed (slope), and the percentage of measurements that were covered by the 95% prediction confidence interval. The cross validated statistics across all locations for each temporal window were summarized using the median and boxplots.

Variable Selection
We initially used a variable importance metric (a weighted sum of how many times each feature was split on at each depth in the forest, up to 4 splits deep) to sequentially reduce the number of predictors included in the GRF model while estimating the cross validated MAE. At each step, we removed variables that were at least one order of magnitude lower in importance compared to the most important variable. We balanced the loss in accuracy against the computational time needed to train and predict using the GRF model as well as the interpretibility of the included predictors.

Computing
All aspects of this analysis were completed in R (R Core Team 2021), specifically using the sf (Pebesma 2018), h3 (Kuethe 2019), fst (Klik 2020) and grf (Tibshirani, Athey, and Wager 2020) packages. All code used to acquire and harmonize data, train and evaluate models, and make predictions is publicly available in an online repository (https://github.com/geomarker-io/st_pm_hex).

Training Data
Data on measured PM 2.5 concentrations and 41 different predictors, including land use, meteorology, AOD, PM 2.5 emissions, wildfires, population density, traffic, and spatiotemporal indicators were harmonized to daily averages within cells in a hexagonal lattice pattern defined by the H3 hexagonal hierarchical spatial index (Brodsky 2018). This resulted in 11,932,970 hexagonal cells covering the contiguous United States, with an average area each of 0.74 sq km. The hierarchical nature of the index system allowed us quickly join indexed data across disparate spatial resolutions and aggregate them across different levels of precision as needed. The hexagon is preferable to other polygons that tile regularly (i.e., square and triangle) because its immediate neighbors are equidistant and expanding rings of neighbors approximate circles, which reduces spatial distortion and misalignment problems when scaling spatial data in a hierarchical grid or lattice.
2,374,589 24-hour average PM 2.5 measurements from 2000 through 2020 contained within 1,685 unique hexagon grid cells were retrieved. Grid cells with PM 2.5 measurements had a median of 1,042 distinct daily measurements and each calendar day with PM 2.5 measurements had a median of 150 grid cells. Daily measurements ranged from -0.1 to 640.6 3 , with a mean of 10.76 3 (25th quartile: 5.80, median: 9.00, 75th quartile: 13.60 3 ). Notably, only 0.3% ( = 7,184) of grid-days with PM 2.5 measurements had non-missing AOD measurements. Table 3 contains the number of PM 2.5 measurements and the mean PM 2.5 concentration for each calendar year.

Generalized Random Forest Modeling
We trained regression forests using the generalized random forest (GRF) framework to predict measured PM 2.5 concentrations. Variable importance was used to select the following predictors (listed in order of decreasing importance in the final model): PM 2.5 convolution, X coordinate, day of year, air temperature, PM 2.5 emissions event, planetary boundary layer height, Y coordinate, and wind speed in the direction. The variable importance for the initial regression forest containing all predictors in presented in Table 1. The cross validated MAE for this forest was 0.97 3 . Regression forests trained with 28, 14, and 8 predictors with the highest variable importances had cross validated MAEs of 0.97, 0.93, and 0.90 3 , respectively. Our variable importance results suggest that aerosol optical density did not significantly increase PM 2.5 prediction accuracy, similar to a previous nationwide PM 2.5 prediction model (Di et al. 2019) and a research report on this question (Paciorek and Liu 2012). Notably, the initial model with 41 predictors had an out of bag MAE of 0.97 3 and was reduced to 8 total predictors while maintaining a similarly low out of bag MAE of 0.90 3 . Utilizing a lower number of predictors in the final model reduces the computational resources required to train and predict with machine learning models. Furthermore, although decreases in exposure prediction error do not necessarily lead to less biased health effects estimates, (Szpiro, Paciorek, and Sheppard 2011) including exposure predictors as covariates in health effects models can cause bias (Cefalu and Dominici 2014). However, there are methods proposed for correcting inference based on outcomes predicted by machine learning (Wang, McCormick, and Leek 2020) and specifically within air pollution epidemiology (Szpiro, Sheppard, and Lumley 2011;Szpiro and Paciorek 2013;Gryparis et al. 2008).

Quantifying Model Performance
Although leave-one-location-out (LOLO) CV estimates have lower bias and variance compared to -fold approaches (Watson et al. 2020), the latter are often used due to computationally-intensive repetitive model training on data subsets or resamples. Here, we were able to estimate LOLO CV predictions without requiring expensive iterative model training by making predictions using only trees from the generalized random forest that did not use the held-out location for training.
This allowed for the use of a single trained ensemble to generate LOLO estimates for 1,685 unique locations and proved to be an advantage over other predictive machine learning models that do not utilize this ensemble framework of bagged aggregation, such as boosted trees and neural networks. Instead of using one error statistic to summarize model performance, we used this novel method to quickly and cheaply characterize LOLO prediction error for different temporal resolution (e.g., daily, weekly, monthly, annual) and geographic regions.
The median of the cross validated statistics across all locations for each temporal window are presented in Table 2 (Table 3). Even though the number of overall measurements decreased over time, the overall concentration and variability of PM 2.5 also decreased over time, resulting in drastic changes in model improvement for any one given year (Table 3).  Figure 2). The model tended to perform better in the Eastern Half of the United States, which had more PM 2.5 measurements. Notably 3 of the regions in the Western United States did not have any PM 2.5 measurements and so cross validated error estimates could not be calculated for these locations. Estimation of model accuracy for predicted exposures in different regions of the United States as well as at different temporal aggregations will allow health studies to consider the impact of model accuracy specific to their spatiotemporal domain; for example, daily exposures within one major metropolitain area versus monthly exposures across the entire United States. Figure 4 presents the relationship between the length of the predicted PM 2.5 confidence interval and the residual for 1,000 randomly selected cross validated predictions. As expected, the model is more confident (i.e., shorter prediction 95% confidence intervals) when it is more accurate (i.e., smaller residual values).

Discussion
This work is the first PM 2.5 exposure assessment model with prediction confidence intervals and will allow for use of existing approaches (Szpiro and Paciorek 2013;Spiegelman 2016;Zandbergen et al. 2012) to propogate uncertainty from exposure estimation to health effect estimation. Other approaches have taken an empirical approach to estimating prediction uncertainty by using the standard deviation of predictions from multiple models or of prediction residuals, but this model is the first to produce asymptotically-valid prediction confidence intervals. Additionally, our novel approach to LOLO CV will allow for quick estimation of model accuracy at different spatial and temporal resolutions and extents that can be specialized to specific study populations.
Compared to other recent nationwide PM 2.5 prediction models, our model demonstrated a similar level of cross validated accuracy, with an LOLO daily CV R 2 of 0.84 compared to 10-fold CV R 2 estimates of 0.86 (Di et al. 2019) and 0.84 (Hu et al. 2017). Notably, our cross validated predictions at the annual level showed an improved performance (LOLO CV R 2 : 0.95) compared to another existing model that also assessed annual performance (Di et al. 2019) (10-fold CV R 2 : 0.89). Small differences in model performances are to be expected given that our model included more recent predictions (i.e. 2018 -2020) and that we used a leave-one-location-out rather than leave-10-fold-out approach to estimate cross validated model performance.
One limitation inherent to our approach is possible spatiotemporal discontinuities in predicted concentrations due to the use of spatiotemporal data with different resolutions and frequencies. One example of this can be seen in Figure 5, specifically predictions for November 13th, 2019 in Los Angeles, CA show discontinuities coinciding with the boundaries of the NARR meteorological data. Predictions In the future, the use of the hexagonal hierarchical spatial indexing system will be advantageous for querying model predictions at different spatial aggregation scales, which might be necessary for protection of private health information in epidemiologic studies. Furthermore, the hexagonal hierarchical spatial index was designed to minimize quantization error introduced when people move throughout cities, making this exposure assessment tool ideal for utilizing high resolution GPS or travel activity data in order to assess high resolution daily ambient PM 2.5 exposures for health effects studies. The increasingly higher temporal precision at which health outcomes are measured will make temporal environmental health studies better, but will require hourly resolution exposure assessment models, as will be feasible with the upcoming NASA TEMPO satellite (Zoogman et al. 2017) and that has been done elsewhere (Jiang et al. 2021;Sun, Gong, and Zhou 2021). Future research could work to extend this modeling framework to hourly PM 2.5 concentrations by utilizing TEMPO satellite data and hourly meteorological data. Additionally the current model framework could also be expanded to other EPA criteria pollutants.
As this set of predictions and prediction model is maintained and used for exposure assessment and methodology, we aim to make it findable, accessible, interoperable, and reusable (FAIR) (Wilkinson et al. 2016). In addition to making the code and data used to create the model publicly available, we are also making estimates freely available and have developed software packages to be used for exposure assessment. This free and open source exposure prediction model could provide more insight and transparency into human health studies that are eventually used for policy decisions, including the EPA's National Ambient Air Quality Standards. FAIR model predictions and software tools will hopefully make high resolution exposure assessment for human health studies more easily and widely used. Ideally, this would reduce barriers of entry for scientists interested in environmental exposures and would facilitate more interdisciplinary research in existing and ongoing nationwide health studies.

Conclusion
In conclusion, this accurate exposure assessment model will be useful for epidemiologists seeking to study the health effects of PM 2.5 across the continential United States, while possibly considering exposure estimation accuracy and uncertainty specific to the spatiotemporal resolution and extent of their study design and population.