Towards Real Time, High Spatial Resolution Air Pollution Exposure Estimation in Microenvironments Supported by Physics Induced Machine Learning Approaches

John Bartzis; Ioannis Sakellaris; Spyros Andronopoulos; Alexandros Venetsanos; Fernando Martin; Stijn Janssen

doi:10.20944/preprints202604.0398.v1

Submitted:

30 March 2026

Posted:

07 April 2026

You are already at the latest version

Abstract

Reliable and timely estimation of air pollution exposure at high spatial and temporal resolution remains challenging in complex urban environments, where pollutant concentrations vary due to traffic emissions, urban morphology, and meteorological conditions. This study presents a physics-informed machine learning framework for near–real-time estimation of NO₂ concentrations at fine spatial scales. The approach combines a limited set of steady-state Computational Fluid Dynamics (CFD) simulations with operational meteorological and air-quality data. CFD simulations under specific wind directions are first used to characterize site-specific dispersion patterns. These outputs are then scaled using hourly meteorological observations to generate physics-based concentration descriptors. A machine learning predictor, implemented using Random Forest and Extreme Gradient Boosting, is trained to refine these estimates by incorporating additional environmental and observational features. The method is applied to a 1 km × 1 km urban district in Antwerp, Belgium, within the FAIRMODE intercomparison framework. Validation against measurements from 105 passive samplers collected over one month shows substantial improvement compared to standalone dispersion modeling, with coefficients of determination up to R² = 0.965 and reduced bias across locations. These findings demonstrate that integrating physical modeling with machine learning enables accurate and computationally efficient high-resolution exposure assessment in urban settings.

Keywords:

physics-informed machine learning

;

urban air quality

;

NO₂ exposure

;

computational fluid dynamics (CFD)

;

Random Forest Regression

;

XGBoost

;

microenvironment modeling

;

near–real-time estimation

;

urban dispersion

;

FAIRMODE

Subject:

Environmental and Earth Sciences - Pollution

1. Introduction

Urban air pollution remains a major environmental and public-health challenge, particularly in densely populated urban environments where exposure to priority pollutants such as nitrogen dioxide (NO₂), particulate matter (PM), and ozone (O₃) exhibits strong spatial and temporal variability at fine scales [1,2,3]. Within traffic-dominated urban microenvironments, pollutant concentrations are governed by highly localized emission sources, complex street-scale flow and dispersion processes, and rapidly varying meteorological conditions, leading to pronounced concentration gradients over short distances [4,5]. Accurate characterization of population exposure in such microenvironments is therefore essential for effective air-quality management, epidemiological assessment, and the design of targeted mitigation strategies.

Among regulated urban air pollutants, NO₂ is of particular relevance as a primary indicator of traffic-related emissions and near-road exposure [6,7,8]. Owing to its strong association with combustion processes and well-documented health impacts, NO₂ is widely used as a tracer for traffic-induced air pollution in urban environments [9]. Consequently, NO₂ provides a suitable first focus for the development and evaluation of high-resolution exposure estimation methodologies, while remaining representative of broader challenges associated with urban air-quality assessment.

Regulatory air-quality monitoring networks provide continuous online measurements for major priority pollutants, including NO₂, PM, and O₃, with high accuracy and data quality. However, the spatial coverage of such networks is typically limited to a relatively small number of fixed monitoring stations due to economic, infrastructural, and operational constraints [6,8,9,10]. As a result, large portions of the urban domain—particularly at the street and microenvironment scale—remain unmonitored in real time. Estimation of pollutant concentrations at unmeasured locations, therefore relies on model-based spatial extrapolation approaches, including simplified dispersion models, land-use regression techniques, and urban-scale chemical transport models [6,11,12]. In complex urban environments, these methods may be associated with substantial uncertainty, especially when applied in near–real-time operational contexts [13].

High-resolution modeling frameworks, such as computational fluid dynamics (CFD) and advanced urban chemical transport models, can provide improved representation of flow, dispersion, and chemical processes, enabling the resolution of fine-scale concentration patterns for multiple pollutants [14,15,16]. Nevertheless, their high computational requirements generally restrict their application to offline analyses or quasi-operational scenarios, limiting their direct applicability for real-time exposure estimation across urban microenvironments.

Recent advances in Artificial Intelligence (AI) and Machine Learning (ML) have opened new avenues for addressing these limitations by enabling the integration of observational data with physical knowledge of atmospheric processes [17,18,19]. Machine learning techniques have been increasingly applied to the estimation and forecasting of urban air pollutant concentrations, including NO₂, PM, and O₃, demonstrating strong capability in capturing nonlinear relationships between emissions, meteorology, and concentrations [20,21,22]. However, purely data-driven approaches may lack physical consistency and robustness when applied to complex dispersion regimes. Physics-induced or physics-informed machine learning approaches, which embed physical constraints or model-derived information within data-driven frameworks, offer a promising pathway toward reliable, high-resolution, near-real-time air pollution exposure estimation in data-sparse urban environments [23,24].

The challenge here is: starting from the existing experiences of Physics-induced Machine Learning Methodologies, how we can go further to be able to address complex air quality problems (e.g. urban environments) providing answers to the user even in real time. The recent work of Bartzis et al. [25] inaugurates such an approach looking at the specific problem of puff emissions quantization in an urban microenvironment. The present study aims to go further and develop a novel physics-induced, AI-driven methodology for near–real-time estimation of air pollution able to treat micro-environments of any complexity, by providing exposure estimations to selected locations. The proposed framework seeks applicability to the major priority pollutants (NO₂, PM, O₃), with primary focus on this study on NO₂ as a pollutant strongly linked to traffic emissions and near-road exposure. The methodology integrates (a) historical observational data with flow and dispersion advanced modeling outputs and (b) contemporaneous online air-quality observations, typically available only at a limited number of monitoring locations. By combining physical process understanding with machine learning techniques, the proposed approach aims to provide reliable exposure estimates at any location of interest within an urban city sector. A case study is selected for demonstration and validation of the methodology. The case study is drawn from the FAIRMODE Intercomparison Exercise [26,27] and it concerns a city sector of Antwerp, Belgium. Since this is the first attempt, the obtained demonstration results have at least to enhance the credibility of the proposed approach and, consequently, provide validation results and set the basis for its future systematic validation and evaluation effort.

2. Methodology

2.1. The Present Site

The site selected concerns the city of Antwerp (Belgium), used in the FAIRMODE joint intercomparison exercise with the primary aim to assess the suitability of present modeling methodologies for estimating long-term average air pollutant concentration maps in urban hot spots [26,27]. It is an urban district of the city of Antwerp roughly 1km x 1km in an urban built-up area, typical of North-West European cities, consisting of a mix of street canyons and open areas. An illustration of the site is shown in Figure 1. The site data details can be found in Martin et al. (2024) [26]. A data summary is given in Table 1.

As shown in Figure 1, two monitoring stations (42R801 and 42R802) provide hourly averaged NO₂ concentrations. Additional experimental NO₂ data have been provided through an installed network of 105 passive samplers within the region, collecting NO₂ during the time period 30/04/2016 to 28/05/2016, covering almost one (1) month. The data produced at each sampler location, is the NO₂ Concentration averaged over the whole above period of sampling.

2.2.. The Present Approach

The basic idea is utilizing regular monitoring data to feed a proper methodology able to provide near real time exposure estimations at any location of interest in the surrounding area. The present methodology refers to an approach where the end product is a Machine Learning Regressor (MLR) to do this task. The selected MLR is trained and tested using the available hourly data. The challenge is to establish a reliable, near real-time operational approach. The methodology will be validated at the 105 locations where passive sampling has been carried out (Table 1).

Therefore, the proposed methodology has to remain (a) relatively simple with maximum exploitation of the existing expertise and (b) operational with minimal computer time and power. It is a stepwise approach as follows:

2.2.1. The Site Reference Simulations

A limited number of CFD simulations are carried out under reference steady flow and dispersion and a prescribed emission spatial profile. A suitable flow and dispersion model is required for this purpose. In the present study, the CFD RANS code ADREA-HF is utilized, which has been extensively used for such studies in the past [30,31]. For the air turbulence closure, the standard k-ε turbulence scheme has been utilized. A special characteristic of ADREA-HF is the introduction of the real urban topography into the Cartesian domain not only as volume and surface porosities but also as associated building/ground/emission surfaces through their real area and orientation. It is worth mentioning that each road has been introduced as a near-ground emitting surface.

The computation domain under study has been taken 1020m x 1020m x 300m in the West-East, South-North and vertically upwards directions respectively, including the site shown in Figure 1. The domain is divided into a 204 x 204 x 60 rectangular cells with uniform horizontal resolution 5m x 5m and a vertical non-uniform grid with 2m at the ground up to 10m at the top.

The output of CFD simulations to be utilized under the present methodology consists of steady-state NO₂ concentrations at specific locations that include (a) the two monitoring positions 42R801 and 42R802 and (b) the 105 passive samplers’ positions mentioned above.

The limited number of CFD flow and dispersion simulations have been performed under the following conditions:

(a) All simulations are performed with a single realistic reference wind speed (VREF=10m/s) defined at the domain top (i.e. 300m).

(b) Each simulation corresponds to one wind direction. Equidistant wind directions covering all 360 degrees have been selected close enough to ensure reliable spatial interpolations when needed. In the FAIRMODE intercomparison exercise, an increment of 22.5 to 45 degrees (i.e. 8-16 sectors) seem to give acceptable results [27]. In our study, for conservative reasons, the 32 selected directions (NSIM=32) were taken, corresponding to a wind direction increment

{Δ θ}_{S I M} =

11.25 degrees. Thus, the selected wind directions

θ_{S I M}

have as follows:

θ_{S I M, n s} = (n - 1) \cdot {Δ θ}_{S I M} (n = 1,2 \dots N S I M = 32)

and

{Δ θ}_{S I M} = 11.25 d e g r e e s

(c) The inflow boundary condition is derived by calculating the inflow wind speed profile using ADREA-HF code, through the simulation of the corresponding 1D neutral flow with speed at the above VREF.

(d) At each road surface, the relevant traffic annual emission rate is introduced in terms of the associated emission flux (μg/m3/sec)

The 32 CFD simulations have been performed and the computed ‘reference’ NO2 concentrations

{C C S}_{n s}

(ns=1,…,NSIM) at each of the 107 above mentioned locations (i.e. 105 passive sampling and two(2) pollutant stations sites) have been stored to be used online, when needed.

2.2.2. The Online Flow and Dispersion Model Application

The next step is to estimate the NO₂ hourly concentrations for all year 2016 avoiding the straightforward method of direct CFD simulations. Such CFD simulations could be impractical due to demands of (a) high computational power and time and (b) additional inflow boundary conditions that could be difficult to obtain with the required accuracy. An attractive alternative would be to rely on the Step 1 reference CFD simulations and use the expertise gained up to now providing results via proper scaling/extrapolating approximations. Along this direction, a significant effort has taken place in recent years, for the concentration and exposure behavior especially under high time resolution, including short-time releases in complex terrains [25,32,33]. The proposed approach has been applied to the present Antwerp data with a relative success as follows:

Following Bartzis et al [34], the hourly observed wind data given as pairs of wind speed and direction [(Vn,θn), n=1, …, NWIND], have been considered. They are 8784 pairs of wind data corresponding to the number of hours of the leap year 2016. For each sensor and each hour considered, the observed wind direction θn, is used to estimate for this specific direction, the ‘reference’ concentrations

{C S S}_{n}

derived from the abovementioned computed ‘reference’

{C C S}_{n s}

via linear interpolation. Then, the present CFD based model concentrations CSMn, are estimated, applying the 1/V scaling rule [34], as follows:

{C S M}_{n} = {E T F}_{n} \cdot {C S S}_{n} \cdot \frac{V R E F}{V_{n}} (n = 1, \dots, NWIND)

(1)

where

{E T F}_{n}

is the emission time factor given by the above Antwerp data reflecting the hourly variation with respect to the annual emission.

It is noticed that the above scaling has been successfully applied to the present Antwerp data [35]

To obtain the final modeled concentrations, we have to add the background concentrations (CB) given by the Antwerp data (Table 1)

{C M}_{n} = {C B}_{n} + {C S M}_{n} (n = 1, \dots, NWIND)

(2)

where

{C B}_{n}

is the given hourly background concentration reflecting the contribution of the NO₂ transport outside the domain.

It is noticed that

{C M}_{n}

were also the ones provided to the FAIRMODE Intercomparison Exercise [36]. In the remainder, for convenience, the CM values are denoted as MODEL values.

However, it is important at this stage, to make the following comments:

(a) The past experience has shown, that the methodology based on Equation (2) seemed to give good results especially at the locations reflecting a relatively good representativeness of the surrounding area. A deterioration is expected when the local flow is rather unstable and the inherent sub hourly concentration time fluctuations are relatively high [33,34].

(b) The results depend strongly on the background concentrations, which are not always easy and straightforward to estimate

(c) All calculations are made assuming neutral atmospheric stability conditions. Preliminary data analysis has shown that in this particular case study, the ambient atmospheric stability conditions were indeed neutral for most of the hours [35]. One has to add here, that the urban canopy complexity with the local turbulence is producing, it enhances neutral stability. However, it is a factor that cannot all the time be ignored a priori.

(d) NO₂ is, a rather chemically active substance and this could have an effect on NO2 budget especially in winter time, although for distances of the order of one (1) km or less, no significant effects are expected in general.

(e) For completeness, one should not forget the additional errors coming from the CFD modeling assumptions including the inflow boundary conditions.

The additional consideration of AI-aided methodologies aiming to improve the results, is expected to remove at least partly directly or indirectly some of the above drawbacks.

2.2.3. The Machine Learning Regressors (MLR)

The present focus is on suitable Supervised Machine Learning methodologies, able to handle tabulated data as already indicated above. An important property is that we are seeking methods that can meet the imposed time constraints successfully. The past experience in air pollution application is an additional criterion for selection. Among well-established regression methodologies, Random Forest Regression (RFR) and Extreme Gradient Boosting Regression (XGBR) were chosen as starting methods because they have been widely used in both meteorology and air quality problems. RFR has been frequently applied for pollutant concentration prediction and environmental regression tasks due to its robustness and strong generalization capabilities [37,38]. Similarly, XGBR has demonstrated high predictive accuracy for air quality estimation and related meteorological modeling [39,40,41]. An important characteristic of the selected models is their ability to operate efficiently under computational time constraints. Ensemble tree-based methods such as Random Forest (RF) and Extreme Gradient Boosting (XGBoost) are known for their parallelizable structure and scalability. RF builds trees independently, allowing efficient parallel computation [41]. XGBoost further improves computational efficiency through optimized gradient boosting, parallel tree construction, cache-aware memory access, and sparsity-aware learning [39]. These properties make both methods particularly suitable for applications requiring rapid model training and prediction, such as environmental monitoring and operational air quality forecasting systems.

2.2.3.1. The MLR Features

A key factor for the development of the particular MLR Algorithm is the selection of the proper feature variables.

It is noticed that the site under consideration is under routine monitoring on hourly basis that includes parameters that theoretically affect the NO₂ dispersion such as the ambient wind speed and direction, the temperature, the humidity and solar irradiation. It has been observed [42] that the NO₂ concentrations time series show a diurnal behavior with integral time scale of the order of 24hrs. Thus, the hour of the day is an additional feature to consider. On the other hand, in the present site, NO₂ hourly concentrations as indicated in Table 1, are measured at two locations: 42R801 (background) and 42R802 (traffic). It is logical that the local background (42R801) concentrations should be added as an input feature.

As a consequence, the local traffic (42R802) concentrations will remain part of the target variable as it is explained below.

A critical parameter shaping flow and dispersion is the ground morphology. For such a high spatial resolution, this cannot be done purely in geometrical terms. Following the conceptual frame introduced by Bartzis et al. [25] and adjusting to the present problem, the modeled concentrations CSM as given by Equation (1), are of high consideration. In order to avoid the involvement of the usually problematic external transport concentrations, the following input variable (XMODEL) has been proposed:

X M O D E L = C S M - {C S M}_{[42 R 801]} + {C O B S}_{[42 R 801]}

(3)

Conceptually, the XMODEL array expresses, in a quite simple way, two important determining effects: (1) the projection of the detailed ground morphology into the flow and dispersion phase through CSM and (2) the influence of the pollutant remote transport into the local domain through a single station (42R801) observed concentrations, cancelling the need of the often problematic ‘background’ concentration (CB).

Concerning the target variable (TARGET) to train/test the selected MLR, the choice is rather straightforward: The observed hourly concentrations of the second station 42R802, i.e.

T A R G E T (t r a i n i n g / t e s t i n g) = {C O B S}_{[42 R 802]}

(4)

In summary,

(a): The MLR input variables consist of the hourly arrays of (i) the observed meteostation wind speed, wind direction, temperature, relative humidity and solar irradiation), (ii) the diurnal hour (1-24h) and (iii) the XMODEL parameter given by Equation (3)
(b): The MLR output variable designated as TARGET, is the hourly array of the observed NO2 concentrations of the second station 42R802 as indicated by Equation (4).

2.2.3.2. MLR Training and Testing

Both RFR and XGBR have been created using Python 3.9 scikit-learn. The feature matrix for training and testing for both methods is always identical. It consists of the variables described in subsection 2.2.3.1. Concerning the number of rows, all the hours that they have no missing data have been selected. Thus, 8342 hourly data out of 8784 corresponding to the year 2016, have been considered. From those data, 80% have been used for training and 20% for testing.

The results obtained for both methods give good and comparable coefficients of determination, i.e., the R² parameters are as follows: for the testing data R² =0.936 for RFR and R²= 0.965 for XGBR, whereas for the training data R² =0.960 for RFR and R²= 0.989 for XGBR.

The importance of the various input variables, is illustrated in Figure 2. The obtained values look comparable with respect to the two MLRs. The major input variable seems to be XMODEL for both RFR and XGBR. This seems quite reasonable, taking into consideration that these parameters combine morphology and key underlying physics. The wind parameters’ lower importance can be explained by the fact that the wind effect, to a large degree, has already been included in XMODEL. The role of thermal related parameters (temperature, humidity, solar radiation) seems to be limited supporting the assumption that the neutral atmospheric stability approximation is rather valid in this specific application. RFR indicated relatively slightly higher importance for XMODEL and consequently slightly lower importance for the rest.

The present MLRs behavior with respect the testing data are illustrated in Figure 3a,b, showing the predicted vs the observed NO₂ concentrations. The results also here look comparable as it was expected based on the above-mentioned results. The obtained results seem also quite reasonable for both methods supporting the validity of the present concept. The XGBR predictions seem slightly better.

It is rather clear here that the present methodology, in reality, acts as a corrector of the already predicted pollutant concentration spatial distribution at a given hour, obtained via CFD modeling, by taking into consideration additional defining data such as (a) the meteostation observation data (i.e. temperature, relative humidity and atmospheric stability) and (b) the pollutant hourly concentration data from only one local station.

3. The Validation Exercise

The validation exercise concerns the 105 locations where the respective NO₂ mean concentrations during the period 30/04/2016 to 28/5/2016, have been measured through passive sampling. The locations are shown in Figure 1. Concerning the geographical spread, they cover a good range both horizontally and vertically. More specifically, the horizontal range relative to 42R801 reference station is: along X-axis: -491m to 491m, and Y-axis: -506m to 457m, Concerning the Z- axis, the samplers have been spread vertically at various heights as follows (a) four samples near the ground (height 2.5m to 3,5m) (b) 85 samples at 1sr floor (height 3.23-7.6m), (c) 11 samplers at the 2nd floor (height 4.34 to 10.85m (d) 5 at the 3rd floor (height 10.6m) (e) one (1) sampler at the 6^th floor (height 19m).

Each MLR is used to provide NO₂ Hourly concentrations at the 105 locations during the above sampling period (i.e. 696hrs) and then, the averages for the campaign period were computed. These averages are compared with the observed ones seeking for agreement.

Looking at the data during this period, it is noticed that several NO2 hourly concentration data from the selected as a reference station 42R801 were missing. In fact, spread 31 values out of 696 ones were missing. A respective imputation process needed to be applied. Keeping things simple and practical, an imputation process has been performed using the above mentioned RFR approach, with the following changes: The NO2 hourly concentrations at the 42R801 station have replaced the target variable, whereas the given NO2 background hourly concentrations have replaced the XMODEL input variable. A new RFR predictor has been built using those data. The same training/testing strategy (i.e. 80/20) has been followed. The results obtained gave good R2 statistics: R2 =0.912 for the testing data and R2 =0.945 for the training data. The estimated missing concentrations have been added to the 42R801 NO₂ concentrations in order to be able to build a complete array to be used for the present validation exercise. It is noticed that only one method (i.e the RF method) has been applied since the intention was to use an identical dataset for both methods used for the validation phase.

The core validation exercise concerns the performance of the selected two MLRs (i.e. RFR and XGBR) in predicting the average NO₂ concentrations at the 105 sampler sites, the results for each sampler, are illustrated in Figure 4. The samplers are presented in concentration size order. It is clear the significant improvements obtained by both selected MLRs. Such an improvement is also illustrated on global terms, in Figure 5 showing the average concentration comparisons over all 105 samplers, The CFD model underprediction need to be further looked to what degree is affected by the specific CB array input. Both Regressors gave almost identical results. The agreement with the observed values seems good. In the high exposure regime, the predicted peaks follow the real trend. Concerning the discrepancies, the predictions seem to follow the flow and dispersion model trend as expected, since the MLRs are trained by the model. In other words, the problem lies mainly in the model rather than in the predictors. In elevated concentrations, the highly local behavior is expected to be sensitive to the model spatial resolution as well as the used emissions detailed local spatial profile.

It is interesting to see to what degree the height of the sampling location affects the performance of the predictions. Table 2 gives an indication showing the average concentrations per floor. It is clear the MLRs improve considerably the results. The most significant discrepancy appears at the 6^th floor measurement. This can be explained from the fact that the concentrations levels at large heights are expected to be influenced more by the non-local pollutant transport.

4. Concluding Remarks

This study introduced a physics-induced, AI-driven methodology for near–real-time estimation of NO₂ concentrations in complex urban microenvironments. The proposed approach combines a limited but physically consistent set of CFD reference simulations with operational meteorological and monitoring data, and with MLRS as corrective and adaptive components. The key innovation lies in embedding morphology- and physics-based descriptors (XMODEL) into the feature space of the Machine Learning Predictor, thereby preserving the fundamental dispersion behavior while allowing data-driven correction of residual errors.

Application to the Antwerp FAIRMODE case study demonstrated that both Random Forest Regression and Extreme Gradient Boosting significantly improve concentration estimates compared to pure dispersion modeling. The improvement is consistent across most sampling heights and spatial locations, with the largest discrepancies appearing at higher elevations where non-local transport effects become more dominant. The results confirm that the proposed framework effectively acts as a physically guided model corrector, capable of capturing additional influences such as atmospheric stability and secondary effects without requiring computationally intensive real-time CFD simulations.

The methodology satisfies the two principal operational requirements: (a) computational efficiency suitable for near–real-time applications and (b) maximum exploitation of existing physical modeling expertise. Although the present study focused on NO₂ as a traffic-related tracer pollutant, its conceptual approach can be extendable to other priority pollutants such as PM and O₃, provided that appropriate background treatment and chemical considerations are incorporated.

Future work should focus on systematic multi-site validation, incorporation of stability-dependent corrections, refinement of background concentration treatment, and assessment under non-neutral atmospheric conditions. Integration within operational air-quality management systems and smart-city platforms represents a natural next step. Overall, the study provides a solid methodological foundation toward reliable, high-resolution, real-time urban exposure estimation through physics-informed machine learning.

Author Contributions

Conceptualization, J. G. B.; Methodology, J. G. B., I.A., S.A., A.V., ; Software, J. G., I.A., S.A., A. V.; Validation, J. G. B., S.A., F.M.; Formal analysis, J. G. B., I.A., S.A; Investigation, J. G. B.; Data curation, J.G.B., F.M., J, S.; Writing—original draft preparation, J.G.B, I.S. Writing—review and editing, All; Visualization, I.S.; Supervision, J.G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

The authors are thankful for participating in the FAIRMODE Microscale Modelling Intercomparison Exercise.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NO2	Nitrogen Dioxide
PM	Particulate Matter
O3	Ozone
CFD	Computational Fluid Dynamics
AI	Artificial Intelligence
ML	Machine Learning
FAIRMODE	Forum for Air Quality Modeling
MLR	Machine Learning Regressors
RFR	Random Forest Regression
XGBR	Extreme Gradient Boosting Regression
RF	Random Forest
XGBoost	Extreme Gradient Boosting

References

Beelen, R.; Raaschou-Nielsen, O.; Stafoggia, M.; Andersen, Z.J.; Weinmayr, G.; Hoffmann, B.; Wolf, K.; Samoli, E.; Fischer, P.; Nieuwenhuijsen, M.; et al. Effects of Long-Term Exposure to Air Pollution on Natural-Cause Mortality: An Analysis of 22 European Cohorts within the Multicentre ESCAPE Project. Lancet 2014, 383, 785–795. [Google Scholar] [CrossRef]
Targa, J.; Colina, M.; Banyuls, L.; González Ortiz, A.; Soares, J. Status Report of Air Quality in Europe for Year 2023, Using Validated Data. ETC HE Report 2025/2, Zenodo, 2025.
WHO, W.H.O. WHO Global Air Quality Guidelines: Particulate Matter (PM2.5 and PM10), Ozone, Nitrogen Dioxide, Sulfur Dioxide and Carbon Monoxide; WHO Guidelines Approved by the Guidelines Review Committee; World Health Organization: Geneva, 2021; ISBN 978-92-4-003422-8. [Google Scholar]
Soulhac, L.; Salizzoni, P.; Cierco, F.-X.; Perkins, R. The Model SIRANE for Atmospheric Urban Pollutant Dispersion; Part I, Presentation of the Model. Atmos. Environ. 2011, 45, 7379–7395. [Google Scholar] [CrossRef]
Vardoulakis, S.; Fisher, B.E.A.; Pericleous, K.; Gonzalez-Flesca, N. Modelling Air Quality in Street Canyons: A Review. Atmos. Environ. 2003, 37, 155–182. [Google Scholar] [CrossRef]
Berkowicz, R. OSPM - A Parameterised Street Pollution Model. Environ. Monit. Assess. 2000, 65, 323–331. [Google Scholar] [CrossRef]
European Parliament Directive 2008/50/EC, Air Quality — European Environment Agency. Available online: https://www.eea.europa.eu/policy-documents/directive-2008-50-ec-of (accessed on 14 November 2022).
Snyder, E.G.; Watkins, T.H.; Solomon, P.A.; Thoma, E.D.; Williams, R.W.; Hagler, G.S.W.; Shelow, D.; Hindin, D.A.; Kilaru, V.J.; Preuss, P.W. The Changing Paradigm of Air Pollution Monitoring. Environ. Sci. Technol. 2013, 47, 11369–11377. [Google Scholar] [CrossRef]
Hoek, G.; Beelen, R.; de Hoogh, K.; Vienneau, D.; Gulliver, J.; Fischer, P.; Briggs, D. A Review of Land-Use Regression Models to Assess Spatial Variation of Outdoor Air Pollution. Atmos. Environ. 2008, 42, 7561–7578. [Google Scholar] [CrossRef]
Zhang, Y.; Bocquet, M.; Mallet, V.; Seigneur, C.; Baklanov, A. Real-Time Air Quality Forecasting, Part I: History, Techniques, and Current Status. Atmos. Environ. 2012, 60, 632–655. [Google Scholar] [CrossRef]
Lateb, M.; Meroney, R.N.; Yataghene, M.; Fellouah, H.; Saleh, F.; Boufadel, M.C. On the Use of Numerical Modelling for Near-Field Pollutant Dispersion in Urban Environments − A Review. Environ. Pollut. 2016, 208, 271–283. [Google Scholar] [CrossRef]
Tominaga, Y.; Stathopoulos, T. CFD Simulation of Near-Field Pollutant Dispersion in the Urban Environment: A Review of Current Modeling Techniques. Atmos. Environ. 2013, 79, 716–730. [Google Scholar] [CrossRef]
Baklanov, A.; Schlünzen, K.; Suppan, P.; Baldasano, J.; Brunner, D.; Aksoyoglu, S.; Carmichael, G.; Douros, J.; Flemming, J.; Forkel, R.; et al. Online Coupled Regional Meteorology Chemistry Models in Europe: Current Status and Prospects. Atmospheric Chem. Phys. 2014, 14, 317–398. [Google Scholar] [CrossRef]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N. Prabhat Deep Learning and Process Understanding for Data-Driven Earth System Science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef]
Wang, R.; Yu, R. Physics-Guided Deep Learning for Dynamical Systems: A Survey 2023.
de Villeroché, A.; Le Guen, V.; Mouradi, R.-S.; Massin, P.; Bocquet, M.; Farchi, A.; Cheng, S.; Armand, P. Physics-Informed Neural Networks for Atmospheric Flow Modeling of Pollutant Dispersion in Industrial Sites. Air Qual. Atmosphere Health 2026, 19, 38. [Google Scholar] [CrossRef]
Grange, S.K.; Carslaw, D.C.; Lewis, A.C.; Boleti, E.; Hueglin, C. Random Forest Meteorological Normalisation Models for Swiss PM₁₀ Trend Analysis. Atmospheric Chem. Phys. 2018, 18, 6223–6239. [Google Scholar] [CrossRef]
Willard, J.; Jia, X.; Xu, S.; Steinbach, M.; Kumar, V. Integrating Scientific Knowledge with Machine Learning for Engineering and Environmental Systems 2022.
Di, Q.; Amini, H.; Shi, L.; Kloog, I.; Silvern, R.; Kelly, J.; Sabath, M.B.; Choirat, C.; Koutrakis, P.; Lyapustin, A.; et al. An Ensemble-Based Model of PM2.5 Concentration across the Contiguous United States with High Spatiotemporal Resolution. Environ. Int. 2019, 130, 104909. [Google Scholar] [CrossRef]
Li, P.; Zhang, T.; Jin, Y. A Spatio-Temporal Graph Convolutional Network for Air Quality Prediction. Sustainability 2023, 15. [Google Scholar] [CrossRef]
Liao, Q.; Zhu, M.; Wu, L.; Pan, X.; Tang, X.; Wang, Z. Deep Learning for Air Quality Forecasts: A Review. Curr. Pollut. Rep. 2020, 6, 399–409. [Google Scholar] [CrossRef]
Anitescu, C.; İsmail Ateş, B.; Rabczuk, T. Physics-Informed Neural Networks: Theory and Applications. In Machine Learning in Modeling and Simulation: Methods and Applications; Rabczuk, T., Bathe, K.-J., Eds.; Springer International Publishing: Cham, 2023; pp. 179–218. ISBN 978-3-031-36644-4. [Google Scholar]
Karniadakis, G.E.; Kevrekidis, I.G.; Lu, L.; Perdikaris, P.; Wang, S.; Yang, L. Physics-Informed Machine Learning. Nat. Rev. Phys. 2021, 3, 422–440. [Google Scholar] [CrossRef]
Bartzis, J.; Andronopoulos, S.; Sakellaris, I. Source Term Estimation for Puff Releases Using Machine Learning: A Case Study. Atmosphere 2025, 16. [Google Scholar] [CrossRef]
Martín, F.; Janssen, S.; Rodrigues, V.; Sousa, J.; Santiago, J.L.; Rivas, E.; Stocker, J.; Jackson, R.; Russo, F.; Villani, M.G.; et al. Using Dispersion Models at Microscale to Assess Long-Term Air Pollution in Urban Hot Spots: A FAIRMODE Joint Intercomparison Exercise for a Case Study in Antwerp. Sci. Total Environ. 2024, 925, 171761. [Google Scholar] [CrossRef] [PubMed]
Martín, F.; Rodrigues, V.; Santiago, J.L.; Sousa, J.; Stocker, J.; Janssen, S.; Jackson, R.; Russo, F.; Villani, M.G.; Tinarelli, G.; et al. Estimating the Air Quality Standard Exceedance Areas and the Spatial Representativeness of Urban Air Quality Stations Applying Microscale Modelling. Sci. Total Environ. 2025, 988, 179824. [Google Scholar] [CrossRef]
Janssen, S.; Dumont, G.; Fierens, F.; Mensink, C. Spatial Interpolation of Air Pollution Measurements Using CORINE Land Cover Data. Atmos. Environ. 2008, 42, 4884–4903. [Google Scholar] [CrossRef]
Degraeuwe, B.; Hooyberghs, H.; Janssen, S.; Lefebvre, W.; Maiheu, B.; Megaritis, A.; Vanhulsel, M. A Source Apportionment and Air Quality Planning Methodology for NO2 Pollution from Traffic and Other Sources. Environ. Model. Softw. 2024, 176, 106032. [Google Scholar] [CrossRef]
Andronopoulos, S.; Bartzis, J.G.; Würtz, J.; Asimakopoulos, D. Modelling the Effects of Obstacles on the Dispersion of Denser-than-Air Gases. J. Hazard. Mater. 1994, 37, 327–352. [Google Scholar] [CrossRef]
Venetsanos, A.G.; Papanikolaou, E.; Bartzis, J.G. The ADREA-HF CFD Code for Consequence Assessment of Hydrogen Applications. Int. J. Hydrog. Energy 2010, 35, 3908–3918. [Google Scholar] [CrossRef]
Bartzis, J.G.; Efthimiou, G.C.; Andronopoulos, S. Modelling Exposure from Airborne Hazardous Short-Duration Releases in Urban Environments. Atmosphere 2021, 12, 130. [Google Scholar] [CrossRef]
Bartzis, J.G.; Sakellaris, I.A.; Efthimiou, G. On Exposure Uncertainty Quantification from Accidental Airborne Point Releases. J. Hazard. Mater. Adv. 2022, 6, 100080. [Google Scholar] [CrossRef]
Bartzis, J.G.; Sakellaris, I.A.; Andronopoulos, S.; Venetsanos, A.; Triantafyllou, A. Towards New Simplified Methodologies on Source Term Estimation and Associated Uncertainties from Accidental Airborne Releases. Build. Environ. 2024, 251, 111222. [Google Scholar] [CrossRef]
Bartzis, J.; Sakellaris, I.; Tolias, I.; Venetsanos, A. SIMPLIFIED MICROSCALE MODELING METHODOLOGIES FOR URBAN AIR QUALITY. Proceedings of Abstracts 13th International Conference on Air Quality: Science and Application, University of Hertfordshire; 2022.
Martín, F.; Janssen, S.; Rodrigues, V.; Sousa, J.; Santiago, J.L.; Rivas, E.; Stocker, J.; Jackson, R.; Russo, F.; Villani, M.G.; et al. Using Dispersion Models at Microscale to Assess Long-Term Air Pollution in Urban Hot Spots: A FAIRMODE Joint Intercomparison Exercise for a Case Study in Antwerp. Sci. Total Environ. 2024, 925, 171761. [Google Scholar] [CrossRef] [PubMed]
Belgiu, M.; Drăguţ, L. Random Forest in Remote Sensing: A Review of Applications and Future Directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
Zhan, Y.; Luo, Y.; Deng, X.; Chen, H.; Grieneisen, M.L.; Shen, X.; Zhu, L.; Zhang, M. Spatiotemporal Prediction of Continuous Daily PM2.5 Concentrations across China Using a Spatially Explicit Machine Learning Algorithm. Atmos. Environ. 2017, 155, 129–139. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Zhan, Y.; Luo, Y.; Deng, X.; Grieneisen, M.L.; Zhang, M.; Di, B. Spatiotemporal Prediction of Daily Ambient Ozone Levels across China Using Random Forest for Human Exposure Assessment. Environ. Pollut. 2018, 233, 464–473. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Bartzis, J.G.; Kalimeri, K.K.; Sakellaris, I.A. Environmental Data Treatment to Support Exposure Studies: The Statistical Behavior for NO2, O3, PM10 and PM2.5 Air Concentrations in Europe. Environ. Res. 2020, 181, 108864. [Google Scholar] [CrossRef] [PubMed]

<

Figure 1. A Site overview. The red dots indicate the NO2 monitoring locations and the green dots the passive samplers’ locations.

Figure 2. The present MLR Input features importance.

Figure 3. The present MLRs, i.e. RFR (a) and XGBR (b). Predictions for the observed NO₂ concentrations at the 42R802 station at the hours assigned for testing.

Figure 4. The present MLR (i.e. RFR and XGBR) Predictions for the observed average NO₂ concentrations measured by each passive sample. The samples numbering has been made according to the measured average concertation value starting from the lowest one.

Figure 5. The present MLR (i.e. RFR and XGBR) Predictions for the observed NO₂ Concentrations averaged over all samplers.

Table 1. The Site Main Data.

Data	Source
Measured NO2 hourly concentrations	Two stations: Urban background (42R801); Traffic station (42R802)
Modeled Background NO₂ hourly concentrations (CB) representing incoming transport	The RIO model [28]
Measured concentrations with distributed passive sampling	105 passive samplers during period 2016-04-30 to 2016-05-28 Location range (relative to 42R801 location): Xrange: -491m to 491m, Yrange: -506m to 457m, Zrange: 2.5m to 19m
Measured Meteorological hourly data Wind speed and direction (30 m above ground) Near-surface (3m) temperature, relative humidity and total radiation	One Meteorological station VMM measurement station M802 at Antwerp-Luchtbal (Location: 51.261N, 4.425E)
Modeled annual traffic emissions for a selection of major and secondary roads	The official Flemish FASTRACE traffic emission model (version 2.1), based on COPERT 5 emission factors [29]
Modeled hourly emission factors	Provided by VITO, derived from monthly and daily profiles

Table 2. Average Samplers NO₂ Concentrations at various vertical positions (µg/m³).

Floor	No of Samples	OBS	MODEL	RFR	XGBR
Near Ground	4	41.61	29.16	37.35	37.09
1st	84	40.18	30.53	38.64	38.27
2nd	11	38.38	30.46	38.57	38.28
3rd	5	40.58	32.26	40.33	39.91
6th	1	44.78	27.37	35.70	35.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.