Submitted:
23 October 2024
Posted:
23 October 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
- Augment imbalanced PM2.5 dataset with cluster-based undersampling with different combinations of majority-to-minority class ratios
- Investigate the impact of two minority-majority cutoff thresholds based on limits set by the EPA on model performance
- Build and train a transformer model to leverage the capabilities of multi-head attention in the context of PM2.5 forecasting
- Develop a robust forecasting model that accurately predicts PM2.5 concentrations, particularly during extreme pollution spikes caused by events like wildfires in New York City, Philadelphia, and Washington, D.C.
2. Literature Review
2.1. Data Augmentation Techniques for PM2.5
2.2. Transformer-Based PM2.5 Prediction Models
3. Data
3.1. Study Area
3.2. Data Description
3.2.1. Ground-Level PM2.5 Measurements
3.2.2. Satellite-Derived Aerosol Optical Depth (AOD)
3.2.3. Meteorological Variables
3.2.4. Geographical Variables
4. Methodology
4.1. Data Preprocessing and Collocation
4.2. Cutoff Threshold
4.3. Cluster-Based Undersampling
4.4. Transformer Model Architecture

4.4.1. Positional Encoding
4.4.2. Multi-Head Attention
4.4.3. Encoder
4.4.4. Decoder
4.5. Model Training and Evaluation
4.5.1. Model Training and Hyperparameter Tuning
4.5.2. Accuracy Measures
5. Experiments and Results
5.1. Accuracy Assessment
5.2. Partial Sampling Ratio
5.3. Cutoff Threshold
5.4. Time Series Analysis

6. Discussion
7. Conclusion
References
- Abedi, A., Baygi, M. M., Poursafa, P., Mehrara, M., Amin, M. M., Hemami, F., & Zarean, M. (2020). Air pollution and hospitalization: an autoregressive distributed lag (ARDL) approach. Environmental Science and Pollution Research, 27(24), 30673–30680. [CrossRef]
- Agarwal, S., Sharma, S., R., S., Rahman, M. H., Vranckx, S., Maiheu, B., Blyth, L., Janssen, S., Gargava, P., Shukla, V. K., & Batra, S. (2020). Air quality forecasting using artificial neural networks with real time dynamic error correction in highly polluted regions. Science of The Total Environment, 735, 139454. [CrossRef]
- Arhami, M., Kamali, N., & Rajabi, M. M. (2013). Predicting hourly air pollutant levels using artificial neural networks coupled with uncertainty analysis by Monte Carlo simulations. Environmental Science and Pollution Research, 20(7), 4777–4789. [CrossRef]
- Balch, J. K., Bradley, B. A., Abatzoglou, J. T., Chelsea Nagy, R., Fusco, E. J., & Mahood, A. L. (2017). Human-started wildfires expand the fire niche across the United States. Proceedings of the National Academy of Sciences of the United States of America, 114(11), 2946–2951. [CrossRef]
- Bazi, Y., Bashmal, L., Al Rahhal, M. M., Dayil, R. Al, & Ajlan, N. Al. (2021). Vision Transformers for Remote Sensing Image Classification. Remote Sensing 2021, Vol. 13, Page 516, 13(3), 516. [CrossRef]
- Bella, T. (2023, June 8). Philadelphia’s hazardous air quality from Canadian wildfires is worst level in city since 1999 - The Washington Post. The Washington Post. https://www.washingtonpost.com/climate-environment/2023/06/08/philadelphia-air-quality-worst-wildfire-smoke/.
- Boisramé, G. F. S., Brown, T. J., & Bachelet, D. M. (2022). Trends in western USA fire fuels using historical data and modeling. Fire Ecology, 18(1), 1–34. [CrossRef]
- Cekim, H. O. (2020). Forecasting PM10 concentrations using time series models: a case of the most polluted cities in Turkey. Environmental Science and Pollution Research, 27(20), 25612–25624. [CrossRef]
- Chakma, A., Vizena, B., Cao, T., Lin, J., & Zhang, J. (2017). Image-based air quality analysis using deep convolutional neural network. Proceedings - International Conference on Image Processing, ICIP, 2017-September, 3949–3952. [CrossRef]
- Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. [CrossRef]
- Chen, X., Wu, Y., Wang, Z., Liu, S., & Li, J. (2020). Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June, 5904–5908. [CrossRef]
- Chen, Z., Chen, D., Zhao, C., Kwan, M. po, Cai, J., Zhuang, Y., Zhao, B., Wang, X., Chen, B., Yang, J., Li, R., He, B., Gao, B., Wang, K., & Xu, B. (2020). Influence of meteorological conditions on PM2.5 concentrations across China: A review of methodology and mechanism. Environment International, 139, 105558. [CrossRef]
- Chu, J., Dong, Y., Han, X., Xie, J., Xu, X., & Xie, G. (2021). Short-term prediction of urban PM2.5 based on a hybrid modified variational mode decomposition and support vector regression model. Environmental Science and Pollution Research, 28(1), 56–72. [CrossRef]
- Cui, B., Liu, M., Li, S., Jin, Z., Zeng, Y., & Lin, X. (2023). Deep learning methods for atmospheric PM2.5 prediction: A comparative study of transformer and CNN-LSTM-attention. Atmospheric Pollution Research, 14(9), 101833. [CrossRef]
- Dee, D. P., Uppala, S. M., Simmons, A. J., Berrisford, P., Poli, P., Kobayashi, S., Andrae, U., Balmaseda, M. A., Balsamo, G., Bauer, P., Bechtold, P., Beljaars, A. C. M., van de Berg, L., Bidlot, J., Bormann, N., Delsol, C., Dragani, R., Fuentes, M., Geer, A. J., … Vitart, F. (2011). The ERA-Interim reanalysis: configuration and performance of the data assimilation system. Quarterly Journal of the Royal Meteorological Society, 137(656), 553–597. [CrossRef]
- Deegan, D. (2023, June 7). Canadian Wildfires Prompt Poor Air Quality Alert for Parts of New England on June 7, 2023 | US EPA. https://www.epa.gov/newsreleases/canadian-wildfires-prompt-poor-air-quality-alert-parts-new-england-june-7-2023.
- Ding, W., Zhang, J., & Leung, Y. (2016). Prediction of air pollutant concentration based on sparse response back-propagation training feedforward neural networks. Environmental Science and Pollution Research, 23(19), 19481–19494. [CrossRef]
- Dong, J., Zhang, Y., & Hu, J. (2024). Short-term air quality prediction based on EMD-transformer-BiLSTM. Scientific Reports 2024 14:1, 14(1), 1–17. [CrossRef]
- Duke, B., Ahmed, A., Wolf, C., Aarabi, P., & Taylor, G. W. (2021). SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 5908–5917. [CrossRef]
- EPA AQI. (2024). Final Updates to the Air Quality Index (AQI) for Particulate Matter - Fact Sheet and Common Questions.
- Feng, W., Boukir, S., & Huang, W. (2019). Margin-Based Random Forest for Imbalanced Land Cover Classification. International Geoscience and Remote Sensing Symposium (IGARSS), 3085–3088. [CrossRef]
- Flores, A., Valeriano-Zapana, J., Yana-Mamani, V., & Tito-Chura, H. (2021). PM2.5 prediction with Recurrent Neural Networks and Data Augmentation. 2021 IEEE Latin American Conference on Computational Intelligence, LA-CCI 2021. [CrossRef]
- Gao, X., Koutrakis, P., Coull, B., Lin, X., Vokonas, P., Schwartz, J., & Baccarelli, A. A. (2021). Short-term exposure to PM2.5 components and renal health: Findings from the Veterans Affairs Normative Aging Study. Journal of Hazardous Materials, 420, 126557. [CrossRef]
- Gao, X., & Li, W. (2021). A graph-based LSTM model for PM2.5 forecasting. Atmospheric Pollution Research, 12(9), 101150. [CrossRef]
- Gariazzo, C., Carlino, G., Silibello, C., Renzi, M., Finardi, S., Pepe, N., Radice, P., Forastiere, F., Michelozzi, P., Viegi, G., & Stafoggia, M. (2020). A multi-city air pollution population exposure study: Combined use of chemical-transport and random-Forest models with dynamic population data. Science of The Total Environment, 724, 138102. [CrossRef]
- Gilcrease, G. W., Padovan, D., Heffler, E., Peano, C., Massaglia, S., Roccatello, D., Radin, M., Cuadrado, M. J., & Sciascia, S. (2020). Is air pollution affecting the disease activity in patients with systemic lupus erythematosus? State of the art and a systematic literature review. European Journal of Rheumatology, 7(1), 31. [CrossRef]
- GMTED2010. (2024). USGS. https://www.usgs.gov/coastal-changes-and-impacts/gmted2010.
- Graupe, D., Krause, D. J., Moore, J. B., & Moore, J. B. (1975). Identification of Autoregressive Moving-Average Parameters of Time Series. IEEE Transactions on Automatic Control, 20(1), 104–107. [CrossRef]
- Grigsby, J., Wang, Z., Nguyen, N., & Qi, Y. (2021). Long-Range Transformers for Dynamic Spatiotemporal Forecasting. https://arxiv.org/abs/2109.12218v3.
- Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz-Sabater, J., Nicolas, J., Peubey, C., Radu, R., Schepers, D., Simmons, A., Soci, C., Abdalla, S., Abellan, X., Balsamo, G., Bechtold, P., Biavati, G., Bidlot, J., Bonavita, M., … Thépaut, J. N. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society, 146(730), 1999–2049. [CrossRef]
- Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [CrossRef]
- Huang, F., Li, X., Wang, C., Xu, Q., Wang, W., Luo, Y., Tao, L., Gao, Q., Guo, J., Chen, S., Cao, K., Liu, L., Gao, N., Liu, X., Yang, K., Yan, A., & Guo, X. (2015). PM2.5 Spatiotemporal Variations and the Relationship with Meteorological Factors during 2013-2014 in Beijing, China. PLOS ONE, 10(11), e0141642. [CrossRef]
- Hystad, P., Larkin, A., Rangarajan, S., AlHabib, K. F., Avezum, Á., Calik, K. B. T., Chifamba, J., Dans, A., Diaz, R., du Plessis, J. L., Gupta, R., Iqbal, R., Khatib, R., Kelishadi, R., Lanas, F., Liu, Z., Lopez-Jaramillo, P., Nair, S., Poirier, P., … Brauer, M. (2020). Associations of outdoor fine particulate air pollution and cardiovascular disease in 157 436 individuals from 21 high-income, middle-income, and low-income countries (PURE): a prospective cohort study. The Lancet Planetary Health, 4(6), e235–e245. [CrossRef]
- Jia, H., Liu, Y., Guo, D., He, W., Zhao, L., & Xia, S. (2021). PM2.5-induced pulmonary inflammation via activating of the NLRP3/caspase-1 signaling pathway. Environmental Toxicology, 36(3), 298–307. [CrossRef]
- Jian, L., Zhao, Y., Zhu, Y. P., Zhang, M. B., & Bertolatti, D. (2012). An application of ARIMA model to predict submicron particle concentrations from meteorological factors at a busy roadside in Hangzhou, China. Science of The Total Environment, 426, 336–345. [CrossRef]
- Kamalov, F., Atiya, A. F. , Elreedy, D. (2022). Partial Resampling of Imbalanced Data. [cs.LG] arXiv:2207.04631 [Preprint] . [CrossRef]
- Khan, A. A., Chaudhari, O., & Chandra, R. (2024). A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Systems with Applications, 244, 122778. [CrossRef]
- Kloog, I., Chudnovsky, A. A., Just, A. C., Nordio, F., Koutrakis, P., Coull, B. A., Lyapustin, A., Wang, Y., & Schwartz, J. (2014). A new hybrid spatio-temporal model for estimating daily multi-year PM2.5 concentrations across northeastern USA using high resolution aerosol optical depth data. Atmospheric Environment, 95, 581–590. [CrossRef]
- LAADS DAAC. (2024). NASA. https://ladsweb.modaps.eosdis.nasa.gov/.
- Lao, X. Q., Guo, C., Chang, L. yun, Bo, Y., Zhang, Z., Chuang, Y. C., Jiang, W. K., Lin, C., Tam, T., Lau, A. K. H., Lin, C. Y., & Chan, T. C. (2019). Long-term exposure to ambient fine particulate matter (PM 2.5 ) and incident type 2 diabetes: a longitudinal cohort study. Diabetologia, 62(5), 759–769. [CrossRef]
- Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature 2015 521:7553, 521(7553), 436–444. [CrossRef]
- Lee, S., Lee, W., Kim, D., Kim, E., Myung, W., Kim, S. Y., & Kim, H. (2019). Short-term PM2.5 exposure and emergency hospital admissions for mental disease. Environmental Research, 171, 313–320. [CrossRef]
- Li, T., Shen, H., Yuan, Q., Zhang, X., & Zhang, L. (2017). Estimating Ground-Level PM2.5 by Fusing Satellite and Station Observations: A Geo-Intelligent Deep Learning Approach. Geophysical Research Letters, 44(23), 11,985-11,993. [CrossRef]
- Li, X., Peng, L., Hu, Y., Shao, J., & Chi, T. (2016). Deep learning architecture for air quality predictions. Environmental Science and Pollution Research, 23(22), 22408–22417. [CrossRef]
- Li, Y., & Moura, J. M. F. (2020). Forecaster: A Graph Transformer for Forecasting Spatial and Time-Dependent Data. Frontiers in Artificial Intelligence and Applications, 325, 1293–1300. [CrossRef]
- Liang, F., Xiao, Q., Wang, Y., Lyapustin, A., Li, G., Gu, D., Pan, X., & Liu, Y. (2018). MAIAC-based long-term spatiotemporal trends of PM2.5 in Beijing, China. The Science of the Total Environment, 616–617, 1589–1598. [CrossRef]
- Lin, W. C., Tsai, C. F., Hu, Y. H., & Jhang, J. S. (2017). Clustering-based undersampling in class-imbalanced data. Information Sciences, 409–410, 17–26. [CrossRef]
- Liu, H., & Zhang, X. (2021). AQI time series prediction based on a hybrid data decomposition and echo state networks. Environmental Science and Pollution Research, 28(37), 51160–51182. [CrossRef]
- Liu, J., Weng, F., & Li, Z. (2022). Ultrahigh-Resolution (250 m) Regional Surface PM2.5Concentrations Derived First from MODIS Measurements. IEEE Transactions on Geoscience and Remote Sensing, 60. [CrossRef]
- Liu, L., Zhang, Y., Yang, Z., Luo, S., & Zhang, Y. (2021). Long-term exposure to fine particulate constituents and cardiovascular diseases in Chinese adults. Journal of Hazardous Materials, 416, 126051. [CrossRef]
- Lu, Y., Giuliano, G., & Habre, R. (2021). Estimating hourly PM2.5 concentrations at the neighborhood scale using a low-cost air sensor network: A Los Angeles case study. Environmental Research, 195, 110653. [CrossRef]
- Luo, Z., Huang, F., & Liu, H. (2020). PM2.5 concentration estimation using convolutional neural network and gradient boosting machine. Journal of Environmental Sciences, 98, 85–93. [CrossRef]
- Lyapustin, A., Martonchik, J., Wang, Y., Laszlo, I., & Korkin, S. (2011). Multiangle implementation of atmospheric correction (MAIAC): 1. Radiative transfer basis and look-up tables. Journal of Geophysical Research: Atmospheres, 116(D3), 3210. [CrossRef]
- Lyapustin, A., Wang, Y., Korkin, S., & Huang, D. (2018). MODIS Collection 6 MAIAC algorithm. Atmospheric Measurement Techniques, 11(10), 5741–5765. [CrossRef]
- Ma, J., Cheng, J. C. P., Lin, C., Tan, Y., & Zhang, J. (2019). Improving air quality prediction accuracy at larger temporal resolutions using deep learning and transfer learning techniques. Atmospheric Environment, 214, 116885. [CrossRef]
- Ma, Z., Hu, X., Huang, L., Bi, J., & Liu, Y. (2014). Estimating ground-level PM2.5 in china using satellite remote sensing. Environmental Science and Technology, 48(13), 7436–7444. [CrossRef]
- McDuffie, E., Martin, R., Yin, H., & Brauer, M. (2021). Global Burden of Disease from Major Air Pollution Sources (GBD MAPS): A Global Approach. Research Reports: Health Effects Institute, 2021(210), 1–45. /pmc/articles/PMC9501767/.
- Mi, T., Tang, D., Fu, J., Zeng, W., Grieneisen, M. L., Zhou, Z., Jia, F., Yang, F., & Zhan, Y. (2024). Data augmentation for bias correction in mapping PM2.5 based on satellite retrievals and ground observations. Geoscience Frontiers, 15(1), 101686. [CrossRef]
- Mohammed, R., Rawashdeh, J., & Abdullah, M. (2020). Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results. 2020 11th International Conference on Information and Communication Systems, ICICS 2020, 243–248. [CrossRef]
- Moreo, A., Esuli, A., & Sebastiani, F. (2016). Distributional random oversampling for imbalanced text classification. SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 805–808. [CrossRef]
- Neishi, M., & Yoshinaga, N. (2019). On the Relation between Position Information and Sentence Length in Neural Machine Translation. CoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference, 328–338. [CrossRef]
- Qin, Y., Kim, E., & Hopke, P. K. (2006). The concentrations and sources of PM2.5 in metropolitan New York City. Atmospheric Environment, 40(SUPPL. 2), 312–332. [CrossRef]
- Sharma, A., Valdes, A. C. F., & Lee, Y. (2022). Impact of Wildfires on Meteorology and Air Quality (PM2.5 and O3) over Western United States during September 2017. Atmosphere 2022, Vol. 13, Page 262, 13(2), 262. [CrossRef]
- Spracklen, D. V., Mickley, L. J., Logan, J. A., Hudman, R. C., Yevich, R., Flannigan, M. D., & Westerling, A. L. (2009). Impacts of climate change from 2000 to 2050 on wildfire activity and carbonaceous aerosol concentrations in the western United States. Journal of Geophysical Research: Atmospheres, 114(D20), 20301. [CrossRef]
- State of Global Air Report. (2024). https://www.stateofglobalair.org/resources/report/state-global-air-report-2024.
- Stivaktakis, R., Tsagkatakis, G., & Tsakalides, P. (2019). Deep Learning for Multilabel Land Cover Scene Categorization Using Data Augmentation. IEEE Geoscience and Remote Sensing Letters, 16(7), 1031–1035. [CrossRef]
- Thangavel, P., Park, D., & Lee, Y. C. (2022). Recent Insights into Particulate Matter (PM2.5)-Mediated Toxicity in Humans: An Overview. International Journal of Environmental Research and Public Health 2022, Vol. 19, Page 7511, 19(12), 7511. [CrossRef]
- Torgo, L., Ribeiro, R. P., Pfahringer, B., & Branco, P. (2013). SMOTE for Regression. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8154 LNAI, 378–389. [CrossRef]
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 2017-December, 5999–6009. https://arxiv.org/abs/1706.03762v7.
- Wang, Z., Wang, Z., Zou, Z., Chen, X., Wu, H., Wang, W., Su, H., Li, F., Xu, W., Liu, Z., & Zhu, J. (2024). Severe Global Environmental Issues Caused by Canada’s Record-Breaking Wildfires in 2023. Advances in Atmospheric Sciences, 41(4), 565–571. [CrossRef]
- Wen, C., Liu, S., Yao, X., Peng, L., Li, X., Hu, Y., & Chi, T. (2019). A novel spatiotemporal convolutional long short-term neural network for air pollution prediction. Science of The Total Environment, 654, 1091–1099. [CrossRef]
- Westerling, A. L., Hidalgo, H. G., Cayan, D. R., & Swetnam, T. W. (2006). Warming and earlier spring increase Western U.S. forest wildfire activity. Science, 313(5789), 940–943. [CrossRef]
- Xiao, Q., Zheng, Y., Geng, G., Chen, C., Huang, X., Che, H., Zhang, X., He, K., & Zhang, Q. (2021). Separating emission and meteorological contributions to long-term PM2.5trends over eastern China during 2000-2018. Atmospheric Chemistry and Physics, 21(12), 9475–9496. [CrossRef]
- Xu, Y., Ho, H. C., Wong, M. S., Deng, C., Shi, Y., Chan, T. C., & Knudby, A. (2018). Evaluation of machine learning techniques with multiple remote sensing datasets in estimating monthly concentrations of ground-level PM2.5. Environmental Pollution, 242, 1417–1426. [CrossRef]
- Xu, Y., Yang, W., & Wang, J. (2017). Air quality early-warning system for cities in China. Atmospheric Environment, 148, 239–257. [CrossRef]
- Yan, X., Zang, Z., Jiang, Y., Shi, W., Guo, Y., Li, D., Zhao, C., & Husi, L. (2021). A Spatial-Temporal Interpretable Deep Learning Model for improving interpretability and predictive accuracy of satellite-based PM2.5. Environmental Pollution, 273, 116459. [CrossRef]
- Yang, W., Deng, M., Xu, F., & Wang, H. (2018). Prediction of hourly PM2.5 using a space-time support vector regression model. Atmospheric Environment, 181, 12–19. [CrossRef]
- Yang, Z., Sinnott, R. O., Bailey, J., & Ke, Q. (2023). A survey of automated data augmentation algorithms for deep learning-based image classification tasks. Knowledge and Information Systems, 65(7), 2805–2861. [CrossRef]
- Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718–5727. [CrossRef]
- Yin, S., Li, T., Cheng, X., & Wu, J. (2022a). Remote sensing estimation of surface PM2.5 concentrations using a deep learning model improved by data augmentation and a particle size constraint. Atmospheric Environment, 287, 119282. [CrossRef]
- Yu, M., Masrur, A., & Blaszczak-Boxe, C. (2023). Predicting hourly PM2.5 concentrations in wildfire-prone areas using a SpatioTemporal Transformer model. Science of The Total Environment, 860, 160446. [CrossRef]
- Yu, M., Zhang, S., Ning, H., Li, Z., & Zhang, K. (2024). Assessing the 2023 Canadian wildfire smoke impact in Northeastern US: Air quality, exposure and environmental justice. Science of The Total Environment, 926, 171853. [CrossRef]
- Yu, X., Wu, X., Luo, C., & Ren, P. (2017). Deep learning in remote sensing scene classification: a data augmentation enhanced convolutional neural network framework. GIScience & Remote Sensing, 54(5), 741–758. [CrossRef]
- Yue, Z., Witzig, C. R., Jorde, D., & Jacobsen, H. A. (2020). BERT4NILM: A Bidirectional Transformer Model for Non-Intrusive Load Monitoring. NILM 2020 - Proceedings of the 5th International Workshop on Non-Intrusive Load Monitoring, 89–93. [CrossRef]
- Zeyer, A., Bahar, P., Irie, K., Schluter, R., & Ney, H. (2019). A Comparison of Transformer and LSTM Encoder Decoder Models for ASR. Automatic Speech Recognition & Understanding, 8–15. [CrossRef]
- Zhan, Y., Luo, Y., Deng, X., Chen, H., Grieneisen, M. L., Shen, X., Zhu, L., & Zhang, M. (2017). Spatiotemporal prediction of continuous daily PM2.5 concentrations across China using a spatially explicit machine learning algorithm. Atmospheric Environment, 155, 129–139. [CrossRef]
- Zhang, S., Mi, T., Wu, Q., Luo, Y., Grieneisen, M. L., Shi, G., Yang, F., & Zhan, Y. (2022). A data-augmentation approach to deriving long-term surface SO2 across Northern China: Implications for interpretable machine learning. Science of The Total Environment, 827, 154278. [CrossRef]
- Zhang, Z., Wu, W., Fan, M., Wei, J., Tan, Y., & Wang, Q. (2019). Evaluation of MAIAC aerosol retrievals over China. Atmospheric Environment, 202, 8–16. [CrossRef]
- Zhang, Z., Zeng, Y., & Yan, K. (2021). A hybrid deep learning technology for PM2.5 air quality forecasting. Environmental Science and Pollution Research, 28(29), 39409–39422. [CrossRef]
- Zhang, Z., & Zhang, S. (2023). Modeling air quality PM2.5 forecasting using deep sparse attention-based transformer networks. International Journal of Environmental Science and Technology, 20(12), 13535–13550. [CrossRef]
- Zhao, Z., Qin, J., He, Z., Li, H., Yang, Y., & Zhang, R. (2020). Combining forward with recurrent neural networks for hourly air quality prediction in Northwest of China. Environmental Science and Pollution Research, 27(23), 28931–28948. [CrossRef]
- Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12), 11106–11115. [CrossRef]







| City | Area | Population | Coordinates |
|---|---|---|---|
| New York City | 790 square km (302.6 square miles) | 8.336 million | 40.4774° N, -74.2591° W (southwest) to 40.9176° N, -73.7004° W (northeast) |
| Philadelphia | 347.52 square km (134.18 square miles) | 1.567 million | 39.8670° N, -75.2803° W (southwest) to 40.1379° N, -74.9558° W (northeast) |
| Washington D.C. | 76 square km (68 square miles) | 671,803 | 38.7916° N, -77.1198° W (southwest) to 38.9955° N, -76.9094° W (northeast) |
| Variable | Type | Source | Unit |
| PM2.5 | Target | AirNow | μg/m³ |
| AOD | Covariate | MODIS MAIAC (Terra and Aqua) | Unitless |
| Boundary Layer Height | Covariate | ECMWF ERA5-hourly | meter |
| Relative Humidity | Covariate | ECMWF ERA5-hourly | % |
| Temperature (at 2m) | Covariate | ECMWF ERA5-hourly | K |
| Surface Pressure | Covariate | ECMWF ERA5-hourly | Pa |
| Wind Speed | Covariate | ECMWF ERA5-hourly | m/s |
| Elevation | Covariate | USGS | Meter |
| City | Total Samples | Low-Value | High-Value | Ratio of High to Low-Value |
|---|---|---|---|---|
| New York City | 2,284,200 | 2,038,105 | 246,095 | 0.1207 |
| Washington DC | 456,840 | 406,716 | 50,124 | 0.1232 |
| Philadelphia | 1,285,200 | 1,027,868 | 257,332 | 0.2503 |
| Total | 4,026,240 | 3,472,689 | 553,551 | 0.1594 |
| City | Total Samples | Low-Value | High-Value | Ratio of High to Low-Value | ||
|---|---|---|---|---|---|---|
| New York City | 2,284,200 | 2,272,914 | 11,286 | 0.00496 | ||
| Washington DC | 456,840 | 454,649 | 2,191 | 0.00481 | ||
| Philadelphia | 1,285,200 | 1,276,870 | 8,330 | 0.00652 | ||
| Total | 4,026,240 | 4,004,433 | 21,807 | 0.00544 | ||
| Partial Sampling Ratio | High-Value Samples | Low-Value Samples |
|---|---|---|
| 10/90 | 3,498 | 31,482 |
| 20/80 | 6,996 | 27,984 |
| 30/70 | 10,494 | 24,486 |
| 40/60 | 13,992 | 20,988 |
| 50/50 | 17,490 | 17,490 |
| Training Parameter | Values |
|---|---|
| Model training data | 2021, 2022, 2023 |
| Data split | Training (80%) and testing (20%) |
| Optimizer | Adam |
| Learning Rate | 0.001 |
| Epochs | 20 |
| Number of encoder and decoder layers | 6 |
| Model Dimension | 8 |
| Batch Size | 256 |
| Input length | 8 |
| Output length | 8 |
| Dropout Rate | 0.1 |
| Resampling Ratio | Whole | High-Value | ||||
|---|---|---|---|---|---|---|
| RMSE | MAE | R² | RMSE | MAE | R² | |
| Original | 3.174 | 0.661 | 0.801 | 32.013 | 26.705 | 0.036 |
| 10/90 | 3.217 | 0.726 | 0.796 | 29.366 | 20.284 | 0.188 |
| 20/80 | 3.090 | 1.145 | 0.812 | 25.948 | 19.044 | 0.366 |
| 30/70 | 2.823 | 1.535 | 0.843 | 25.243 | 18.827 | 0.400 |
| 40/60 | 2.816 | 1.325 | 0.845 | 23.284 | 17.383 | 0.490 |
| 50/50 | 2.757 | 1.044 | 0.850 | 21.287 | 14.114 | 0.574 |
| Resampling Ratio | Whole | High-Value | ||||
|---|---|---|---|---|---|---|
| RMSE | MAE | R² | RMSE | MAE | R² | |
| Original | 3.174 | 0.661 | 0.801 | 41.34 | 28.269 | 0.607 |
| 10/90 | 2.282 | 1.592 | 0.897 | 19.747 | 13.81 | 0.633 |
| 20/80 | 2.080 | 1.386 | 0.914 | 15.353 | 10.077 | 0.778 |
| 30/70 | 2.306 | 1.671 | 0.895 | 16.095 | 12.204 | 0.756 |
| 40/60 | 2.423 | 1.726 | 0.884 | 16.556 | 12.917 | 0.741 |
| 50/50 | 2.677 | 1.875 | 0.858 | 19.116 | 14.321 | 0.656 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).