Preprint
Article

This version is not peer-reviewed.

Shaping Adaptive and Sustainable Smart Urban Ecosystems

Submitted:

19 April 2026

Posted:

21 April 2026

You are already at the latest version

Abstract
The transition from smart to intelligent cities allows for the deployment and management of information and communication technologies in the urban context to be driven by holistic sustainability requirements rather than technical ones such as feasibility and fragmented, siloed operational patterns. This work proposes a multi-dimensional decision-making framework to manage a smart-intelligent city as an urban Cyber-Physical System across environmental, economic, and social sustainability pillars, metrics and their tradeoffs. A methodology based on Deep Reinforcement Learning and reward-shaping mechanisms is proposed to represent and assess sustainability pillar dependencies and their interplay. A case study on a Low-Power Wide-Area Network planning, deployment and management in a Sicilian municipality has been developed to demonstrate the effectiveness of the proposed approach in dealing with the dynamics and the non-linear dependencies of the sustainability pillars. The results thus obtained provide a blueprint for urban planners to develop sustainable, resilient, cost-effective, and environmentally friendly smart-intelligent city frameworks.
Keywords: 
;  ;  ;  ;  

1. Introduction

The modern urban environment is conceptualized as a dynamic, evolving, increasingly complex urban ecosystem where the convergence of physical and computational elements is essential to keep high the quality of life [1,2]. In this way, municipalities face significant open issues and challenges, often conflicting such as infrastructure degradation, inadequate service provision, growing demand, cost, and environmental impact. These challenges are central to the 11th United Nations Sustainable Development Goal (SDG 11), which mandates the creation of sustainable cities and communities that are inclusive, safe, and resilient, as illustrated in Figure 1 [19].
Despite this potential, even the sustainability of the smart city Information and Communication Technology (ICT) infrastructure is a critical concern. Current models often suffer from a lack of systematic capacity planning, resilience, and long-term sustainability strategies. Many existing infrastructures follow siloed patterns and a static deployment bias, where sensors, devices and networks are specifically designed, deployed and configured for a given vertical application, with fixed parameters regardless of the fluctuating urban context. This absence of resource reuse and modularity leads to low adaptivity, rendering such systems unsustainable in the face of rapid environmental or social shifts.
A viable solution to these limitations is the integration can be Artificial Intelligence (AI), moving towards intelligent cities [48]. This evolution represents a fundamental paradigm shift toward holistic urban ICT sustainability, where AI serves as the neural blueprint of the urban entity [3,4]. Such adaptive urbanism allows the city to function like a living organism, mutating its behavior to optimize its response to bio-social demands while minimizing its environmental footprint [10].
Related work identifies a distinction between classical computational methods, which lack the real-time adaptivity required for complex urban scenarios, and more advanced machine learning approaches. Although supervised learning is highly effective for predictive tasks, it remains limited by its reliance on extensive labeled datasets, that are rarely available for dynamic urban control [20,21,22,23]. Consequently, Reinforcement Learning (RL) has emerged as a dominant methodology, enabling autonomous agents to derive optimal strategies through direct interaction with the urban environment [24,27]. Furthermore, breakthroughs in Multi-Agent Reinforcement Learning (MARL) [25,31] and Large Language Models (LLM) [26,28,29] have introduced advanced reasoning capabilities, although balancing the computational overhead of AI with the energy constraints of edge devices remains a persistent open problem in the design of sustainable systems [30].
The main contribution of this work is the proposal of a holistic approach to multi-dimensional (environmental, economic, and social) urban ICT infrastructure sustainability, characterized by its three main pillars: environmental ( P e ), economic ( P c ), and social ( P s ). This framework moves beyond traditional siloed patterns by prioritizing resource reuse, proper capacity planning, and multi-objective optimization. The proposed methodology is framed within the Cyber-Physical Systems (CPS) paradigm, where smart-intelligent cities are modeled as interconnected urban CPS. A real-time, adaptive AI-driven model based on Deep Reinforcement Learning (DRL) and Deep Q-Networks (DQN) [32,34,35,36] is adopted to optimize and trade-off among environmental, economic, and social sustainability metrics.
The proposed framework represents a strategic solution to address the multi-faceted nature of SDG 11. By leveraging the DQN-based CPS, the integration of the three pillars ( P e , P c , P s ), and the related metrics, allows for a direct contribution to the operational targets of SDG 11, as summarized in Table 1. The effectiveness of this solution is demonstrated through a case study on Low-Power Wide-Area Network (LPWAN) [16] technologies, specifically the LoRaWAN specification [17].
The remainder of this work is organized as follows: Section 2 discusses the background and related work; Section 3 presents the proposed approach and sustainability metrics; Section 4 details the experimental case study; Section 5 analyzes the results of our investigation; and Section 6 concludes the work with future perspectives.

3. Context-Aware Optimization

To evaluate the systemic impact of the ICT infrastructure in urban environments, this study defines a multi-dimensional framework taking into account environmental, economic, and social sustainability metrics. They are quantified through recognized scientific metrics, providing a standardized approach to evaluate how a Smart City infrastructure interacts with the urban ecosystem and its governance.
As illustrated in Figure 2, our approach introduces a conceptual framework integrating multiple CPS into a cohesive urban ecosystem. It is characterized by the AIoT as a “cognitive” engine that transforms the old concept of static infrastructure in an adaptive urban ecosystem. AIoT acts as a stochastic controller that processes real-time telemetry to optimize system responses.
The architecture of the proposed adaptive urban ecosystem is illustrated in Figure 3. The framework conceptualizes the transition from a static sustainability model to a dynamic, context-aware system. As shown in this diagram, the system is driven by two primary inputs: the Sustainability, which defines the theoretical SDG11, and the Urban Context, providing real-time environmental and operational data. These inputs converge on the Preliminary Setup Layer (PSL), highlighted by the dashed red boundary. Within this layer, the sustainability pillars P e ( t ) , P c ( t ) , and P s ( t ) are not treated as static metrics but as dynamic functions of the urban context.
The interaction among these pillars is governed by a Multi-Objective Optimization engine. This component processes the input parameters (i.e., from the PSL) to resolve inherent trade-offs, also ensuring that the infrastructure remains efficient and reliable under varying urban conditions. Finally, an Adaptive Feedback loop returns the optimized state to the Urban Context, enabling a continuous self-healing and self-configuring cycle.

3.1. Smart Urban Context

The core of the proposed framework lies in the formalization of the smart urban context as a dynamic CPS ecosystem governed by a state-space representation. Thereby, a dynamic urban context C ( t ) can be defined as the n-tuple of Equation (1)
C ( t ) = c 1 ( t ) , c 2 ( t ) , , c n ( t ) C R n
where each component c i ( t ) represents a specific environmental or systemic metric, such as traffic congestion levels, network availability, or sudden urban emergencies, and where C is the context state space.
To tune the model to urban-wide areas, usually organized into geographical districts (e.g., urban districts, industrial zones, or residential quarters), a context matrix  S ( t ) is formally defined by eq. (2):
S ( t ) = C 1 , 1 ( t ) C 1 , q ( t ) C p , 1 ( t ) C p , q ( t ) C p , q R n , p , q
mapping p q > 0 districts of the urban area into the corresponding C i , j ( t ) context representing the i , j -th geographical district at time t. This formulation ensures that the context matrix S ( t ) captures the dynamics of the city and its districts, preserving geographic and topological distance through the matrix adjacency. This allows the municipality to identify spatial correlations and proximity-based dependencies, which are essential for risk prevention (e.g., hydro-geological monitoring) and urban optimization (e.g., commercial promotion in high-footfall areas). Therefore, it is possible to identify complex spatial-temporal correlations that would be lost in a lower-dimensional or aggregated representation, finally identifying optimal trade-offs among the three pillars of sustainability. This representation enables a comprehensive overview of the city dynamics by aggregating localized district-level data and identifying potential spatial correlations between adjacent districts. Using this structured information, the framework can detect criticalities and vulnerabilities associated with specific districts based on the interplay of their context metrics, such as traffic congestion, weather patterns, and crowd density.
The practical implications of such granular monitoring are manifold: high-resolution telemetry is crucial for hydro-geological risk prevention and for optimizing urban traffic planning through proactive interventions; the ability to analyze real-time visitor and tourist flows enables more effective promotion of events and commercial activities in areas with higher crowding; the high-dimensional observation space is transformed into an actionable decision-support tool, allowing the municipality to identify both systemic risks and strategic opportunities by capturing the inter-dependencies between neighboring districts. This representation does not merely describe the environment but actively shapes the weighting of the sustainability pillars within the proposed multi-objective optimization process, ensuring that the resulting urban trade-offs are both resilient and context-aware. The brain of the optimization process determines the criticality of the current situation. For example, in high-critical scenarios, the controller may prioritize immediate social and network performance by extending the duration of high-power operational states. Conversely, in routine contexts, the optimization logic shifts toward energy saving and cost reduction. As a result of this analysis, the context impacts the sustainability of the smart urban ecosystem and the selected main metrics that characterize its pillars, as described below. This holistic representation emphasizes that the transition toward a smart city is not merely technological but is intrinsically linked to measurable sustainability goals, where CPS bridge the gap between physical infrastructure and socio-economic outcomes.

3.2. Sustainability Pillars and Metrics

Environmental sustainability P e is traditionally quantified by the ecological cost of the digital infrastructure, focusing primarily on metrics such as energy consumption and carbon footprint. However, in the context of a Smart City, this perspective is inherently limited. True environmental sustainability must be redefined through the proactive role of CPS, acting as responsive sentinels. Beyond mere device-level energy savings, these systems monitor both local and neighboring contexts to dynamically manage urban dynamics. By anticipating traffic peaks, high-density events, and congestion, the DQN policy prioritizes the systemic prevention of pollution surges, using the infrastructure as an active tool for ecological mitigation rather than a passive energy consumer. In such a scenario, energy consumption is not a static value but a baseline strictly coupled with the operational context S ( t ) . It serves as a direct measure of the CPS impact on environmental and economic costs, as the power consumption scales dynamically across different modes, from low-power stand-by or idle states to high-performance active monitoring.
Let d i , j be the number of devices deployed across the i , j -th urban district, the total energy consumptionE over an observation interval [ t 0 , t ] is formally defined by Equation (3):
E ( s , t ) = t 0 t P ( S ( τ ) , τ ) , d τ [ kWh ] .
where the lowercase s represents the specific realization of the state at the current time t, i.e. s = S ( t ) . The total power P is the result of the sum of the instantaneous power measured at time t for each device in the district, formally defined by Equation (4):
P ( s , t ) = i = 1 p j = 1 q k = 1 d i , j P i , j , k ( C i , j ( t ) , t ) [ W ] .
The carbon footprint ( C F ) represents the Global Warming Potential (GWP) [44] associated with the system existence and operation in the observation interval [ t 0 , t ] . It accounts for the embodied carbon of each device, provided as a manufacturer specification, and the emissions derived from its contextual energy consumption. Leveraging the energy formulation defined in Equation (3), the total C F for the infrastructure across the n districts is formalized by Equation (5):
C F ( s , t ) = i = 1 p j = 1 q k = 1 d i , j G W P i , j , k + ϵ g e o · E ( s , t ) [ kgCO 2 eq ]
where G W P i , j , k is the aggregate cradle-to-gate carbon footprint of the k-th device located in the i , j -th urban district (including its enclosure, electronics, and battery where applicable). This term represents the embodied impact linked to the physical existence of the device. The term ϵ g e o · E ( s , t ) represents the operational impact, where ϵ g e o denotes the grid emission intensity factor [45] and E ( s , t ) is the context-dependent energy consumption derived from the instantaneous power P ( s , t ) as defined by Equation (3). This formulation allows the framework to evaluate how the selection of specific hardware models and their operational duty cycles, driven by the urban dynamics captured in S ( t ) , jointly influence the total environmental sustainability of the infrastructure.
The economic sustainability ( P c ) ensures the financial viability and scalability of the infrastructure within municipal budget constraints. It is modeled through the Total Cost of Ownership (TCO), which balances the initial investment with long-term maintenance. The TCO represents the comprehensive financial metric for evaluating the ( P c ) of the CPS infrastructure over its entire operational life. It is defined by Equation (6) as the sum of the initial investment and the accumulated operating costs:
T C O ( s , t ) = C A P E X ( t 0 ) + O P E X ( s , t ) ) [ Currency ]
The term C A P E X ( t 0 ) represents the Capital Expenditure incurred at the initial time t 0 , acting as a static and context-independent value. It includes the procurement of devices for all the districts, including hardware costs, licenses, and physical installation. By defining it at t 0 , the model treats the initial investment as a fixed boundary condition for the optimization problem, as formally defined by Equation (7):
C A P E X ( t 0 ) = i = 1 p j = 1 q k = 1 d i , j H i , j , k + I i , j , k + L i , j , k | t 0 [ Currency ]
where H i , j , k , I i , j , k , and L i , j , k denote the hardware, installation, and licensing costs for the k-th device in the i , j -th geographical district, respectively.
Conversely, O P E X ( s , t ) represents the dynamic and context-aware Operational Expenditure. It is formally defined by Equation (8):
O P E X ( s , t ) = ( t t 0 ) T · K T + O P E X V ( s , t ) [ Currency ]
where the K T term represents the time-invariant operational costs (e.g., cloud subscription fees or fixed infrastructure leasing) in the T period. The variable operational expenditure O P E X V during the observation interval [ t 0 , t ] is formally defined by Equation (9) as a direct function of the geographical urban context, capturing how urban dynamics drive resource allocation and energy demand.
O P E X V ( s , t ) = t 0 t i = 1 p j = 1 q k = 1 d i , j V i , j , k ( τ ) · P i , j , k ( C i , j ( τ ) , τ ) + M i , j , k ( τ ) + δ k · B i , j , k ( τ ) · O p r e d ( C i , j ( τ ) , τ ) d τ
In this formulation, the first term of the sum captures the monetary cost of the energy, where V i , j , k is the unit price (currency-per-kW), and P i , j , k is the power of Equation (4). The integral term also captures the localized maintenance and hardware preservation costs across the districts over the observation interval [ t 0 , t ] . Here, M i , j , k denotes the routine maintenance for the k-th device in the i , j -th district, encompassing inspections and standard operational overhead. The second part of the sum addresses the critical management of power-source longevity. The binary parameter δ k indicates the presence of a battery for the k-th device model, as specified by Equation (10).
δ k = 1 if the k - th device is battery - powered 0 if the k - th device is mains - powered .
The term B i , j , k ( t ) represents the replacement cost for the battery (including procurement and field logistics). These factors are modulated by the predictive operator O p r e d , which estimates the degradation rate and the “state-of-the-health” of the device. By incorporating this predictive logic, the model acknowledges that operational stress directly accelerates battery chemical aging. The above formulation is useful for long-term planning, as battery-operated devices ( δ k = 1 ) can incur periodic replacement costs and labor, whereas mains-powered devices ( δ k = 0 ) contribute more significantly to the energy cost.
The social sustainability ( P s ) is evaluated through the system ability to provide a reliable and timely service to the community. Information systems and IT services are increasingly performing a wide variety of organizational functions and personal activities. Therefore, high-quality information systems and IT services are essential to provide value and avoid possible negative consequences for their stakeholders. According to the ISO / IEC System Quality and Software Quality Requirements and Evaluation (SQuaRE) family of International Standards [47], the social value of a CPS is quantified through objective quality of service (QoS) parameters. In this work, we redefine these metrics to be context-aware, ensuring they respond dynamically to the urban context. A QoS metric is the Packet Delivery Ratio (PDR) measuring the communication integrity and the reliability of the data flow from the devices to the application. It is formally defined by Equation (11):
P D R ( s , t ) = t 0 t i = 1 p j = 1 q k = 1 d i , j Ψ R X , I , j , k ( S ( τ ) , τ ) d τ t 0 t i = 1 p j = 1 q k = 1 d i , j Ψ T X , I , j , k ( S ( τ ) , τ ) d τ , t t 0 τ m a x
where the P D R is defined as the ratio of cumulative successfully received packets ( Ψ R X ) to total transmitted packets ( Ψ T X ) across the urban scenario. By modeling these variables as time-varying processes, the integral structure captures the cumulative throughput over the observation window [ t 0 , t ] , effectively filtering out instantaneous stochastic noise. To ensure statistical consistency, the interval [ t 0 , t ] is assumed to be significantly larger than the maximum network propagation delay. Furthermore, the explicit dependence of both integrals on the urban context S ( t ) formalizes how urban stressors, such as localized fading or congestion, modulate this ratio.
The τ m a x is defined by Equation (12):
τ m a x = max i , j , k τ P r o p , i , j , k ( t )
where the term τ P r o p , i , j , k ( t ) denotes the instantaneous latency experienced by a packet transmitted by the k-th device within the i , j -th district. This delay accounts for the physical distance between nodes, the signal propagation speed in the urban medium, and potential context-driven retransmissions or multi-hop overheads.
The Service Timeliness, denoted as σ ( s , t ) , quantifies the average end-to-end latency over the observation window [ t 0 , t ] , ensuring the system meets specific real-time requirements. It is formally defined by Equation (13) as the mean temporal gap between data generation at the sensing layer and its final reception at the application level:
σ ( s , t ) = E t D , i , j , k t S , i , j , k ( t t 0 ) [ s ]
where t S and t D represent the timestamps of data sensing and delivery for each packet, respectively. To keep the operational integrity of safety-critical services, we impose a strict latency constraint:
σ ( s , t ) τ r e q
where τ r e q represents the application-specific deadline. While τ m a x of Equation (12) defines the physical upper bound of the network propagation delay, the timeliness τ accounts for the total system latency, including processing times and queuing delays. The condition σ τ r e q ensures that the response of the system to the urban environment S ( t ) is based on timely rather than outdated information to preserve the effectiveness of the decision-making process under dynamic conditions.
The Service Availability ( A s ) [47] represents the fraction of time the infrastructure is fully operational and capable of fulfilling its functional requirements. Although the service availability at an individual device level is sensitive to localized stressors, the A s is a critical system-level Key Performance Indicator (KPI). The proposed model leverages the aggregation of individual availability metrics to derive a systemic indicator that reflects the capacity of the smart city infrastructure to provide continuous services, despite possible failures of single nodes.
Defining the availability of a device a k by Equation (15):
a k ( t ) = U p t i m e k ( t , t 0 ) D o w n t i m e k ( t , t 0 ) + U p t i m e k ( t , t 0 )
where U p t i m e k ( t , t 0 ) denotes the cumulative duration within the time interval ( t 0 , t ) in which the k-th device is fully operational and the service is active. Conversely, D o w n t i m e k ( t 0 , t ) accounts for periods of inactivity, and the service is inactive. Thereby, the system-level availability is formally specified by Equation (16):
A s ( s , t ) = i = 1 p j = 1 q 1 d i , j k = 1 d i , j a i , j , k ( t )
where d i , j denotes the number of devices deployed across the i , j -th urban district. Unlike static reliability models, this formulation accounts for environmental and operational stressors across the full urban context S ( t ) . A high A s value is mandatory for mission-critical urban services, ensuring 24/7 availability and resilience against localized disruptions.

3.3. DQN-Based Optimization

As discussed in Section 2, a DQN-based optimization strategy has been adopted to achieve context-aware governance of urban resources, thereby facilitating the transition toward more adaptive and sustainable smart city ecosystems. A general schema of the proposed method is depicted in Figure 4. This architectural framework transitions from raw environmental data to a high-level decision-making state through a three-level hierarchical abstraction. At the most granular level (System Metrics), the system captures systemic metrics represented as c n ( t ) . Such metrics are geographically aggregated into p · q > 0 district contexts ( C i , j ( t ) = ( c 1 ( t ) , c 2 ( t ) , , c n ( t ) ) ) encapsulating the n multi-dimensional status of the i , j -th district at time t.
The integration of this state representation into the DQN pipeline follows a specific information flow. The Environment block, acting as the CPS, outputs the context matrix S ( t ) . Therefore, the matrix S ( t ) is passed to the DQN agent, which processes the data based on its internal structure. It uses convolutional layers to keep the grid shape intact. This allows the agent to recognize spatial patterns, such as how a problem in one district might affect its neighbors. By preserving this grid structure, the agent better understands the layout of the city, leading to more effective urban management decisions.
More specifically, at each time step t, the agent observes the global state of the city through the context matrix S ( t ) C p , q . This representation captures the status of all functional districts ( i , j ) , characterized by systemic metrics whose raw data, collected from IoT devices, are normalized before being processed by the DNN to ensure numerical stability and effective feature extraction. Once the agent detects the optimal policy for the devices of each district ( i , j ) , it transforms this policy in an operational action a i , j A p , q to set devices to a specific operational condition. By discretizing the operational conditions into a finite set of h N  operational modes as defined by Equation (17):
O = { o 1 , . . . , o h } , o h N
the a i , j action can be defined as a pair of eq (18):
a i , j = ( o s , o f ) A p , q O 2
with o s , o f O as starting and final state of an action. It is important to remark that, in general, the operational conditions are usually orthogonal to the sustainability pillars, and o s may be the same as o f , i.e., no changes are enforced by o s = o f actions.
The Sustainability Gain  G π ( s , a ) , is formally defined by Equation (19), for a specific state-action pair ( s , a ) S x A . It is the expected cumulative discounted return, which the DQN agent aims to maximize to identify the optimal policy ( π ). Specifically, G π ( s , a ) corresponds to the action-value function, satisfying the Bellman optimality criterion [49].
G π ( s , a ) = E y = 0 γ y i = 1 p j = 1 q R i , j S ( t + y ) , A ( t + y ) , γ ( 0 , 1 )
This formulation encapsulates the multi-level complexity of urban infrastructure through several key components. The expectation operator E [ · ] accounts for the intrinsic stochasticity of the urban environment. State transitions and the resulting rewards are influenced by unpredictable events, such as traffic spikes or sudden emergencies, which characterize the described dynamics. The temporal index t denotes the current decision epoch, while y N represents the discrete look-ahead horizon. The term t + y indicates that the agent current return is not merely a function of immediate rewards, but a discounted accumulation of expected future states S ( t + y ) and actions A ( t + y ) . The integration of an infinite horizon ( y ), moderated by the discount factor γ ( 0 , 1 ) , ensures the strategic sustainability of the policy: for γ 0 + , the objective function collapses into a single-step optimization, where the agent considers only the immediate reward. Conversely, as γ 1 , the agent equally weights present and future rewards. However, in this case, the agent loses the ability to rank different strategies, as any policy providing a positive reward would result in the same mathematical value (). By keeping γ < 1 , the series converges to a finite number, allowing the DQN to mathematically determine which policy is superior by comparing finite values.
The sustainability gain G π ( s , a ) aggregates the local rewards R i , j at the i , j -th district level to ensure computational scalability within large-scale smart city deployments. As the number of urban IoT devices may grow into the thousands, a device-level reward structure would introduce high-frequency noise and a dimensional explosion in the feedback signal, severely impacting the DQN convergence. By a district-level reward R i , j , the system considers each geographical area as a functional domain. This ensures that the learning complexity remains tied to the resolution of the urban ( p × q ) grid rather than the fluctuating density of the deployed hardware. It is formally defined by Equation (20), aggregating the environmental, economic, and social sustainability metrics of all devices within that area:
R i , j ( s , a ) = 1.0 λ 1 L e ( η i , j , a i , j ) + λ 2 L c ( η i , j , a i , j ) + λ 3 L s ( η i , j , a i , j )
where λ k [ 0 , 1 ] R are the sustainability factors balancing the impact of normalized losses L on the system, such that k = 1 3 λ k = 1 . The reward function R i , j formalizes the operational efficiency of the ( i , j ) -th district by coupling the local action a i , j with the expanded neighborhood [50] state ( η i , j ) . This formulation adopts a penalty-from-unity approach: starting from an ideal value of 1.0 , the reward is decreased by a weighted sum of these components. As a consequence, maximization of the expected return G π ( s , a ) is mathematically equivalent to minimizing the long-term cumulative loss. Thereby, a policy  π is a parametric configuration tha can be identified by the 3-tuple of Equation (21):
π = ( λ 1 , λ 2 , λ 3 ) .
The spatial dependency is modeled upon the Moore neighborhood convention, which is widely adopted in networked RL to capture local inter-dependencies and interference patterns in distributed sensing infrastructures [50]. It is formally defined by Equation (22):
η ( i , j ) = { C u , v S : | u i | r , | v j | r }
where η ( i , j ) is composed of the context subset (sub-matrix) of the global state space S centered on the district ( i , j ) with radius r.
By conditioning environmental ( L e ), economic ( L c ), and social ( L s ) losses on the context of adjacent districts, the system prevents the emergence of selfish policies that might optimize local parameters at the expense of neighboring areas. Each loss component L x of Equation (20) is modeled as a weighted aggregation of a specific subset of normalized metrics as specified by Equation (23)
L x = k = 1 n ω k · c ^ k ; c ^ k = c k c k m a x
where ω k represents the relative weight of each normalized metric c ^ k within that specific loss, such that k = 1 n ω k = 1 , and c k m a x is the maximum value threshold allowed to preserve the operation of the equipment. The hierarchical weighting structure allows the DQN agent to internalize a complex multi-dimensional state while maintaining a clear balance between high-level sustainability goals and specific physical constraints.

4. Case Study

To validate the proposed approach, a real-world urban deployment (i.e. the historical center of a Sicilian city) consisting of five neighboring districts has been investigated. The effectiveness of this solution is demonstrated through a case study on LoRaWAN technologies [17] as the smart city ICT infrastructure. LoRaWAN is a LPWAN specification designed to wirelessly connect battery operated things to the internet in regional, national or global networks, and targets key IoT requirements such as bi-directional communication, end-to-end security, mobility and localization services [18].
As shown in Figure 5, a 3x3 map defines the urban domain where districts A, B, C, D, and E represent the active operational area covered by a central LoRaWAN gateway located in district C (0,0). Non-covered peripheral areas are masked to focus the DQN agent policy on the core interconnected districts. Each active district is equipped with one installation point (pole) including several IoT devices to capture the multi-faceted dynamics of the district, as reported in Table 3.
Each cell corresponds to a surface area of 0.04 k m 2 , due to the urban morphology of the historic center, and its impact on the LoRaWAN communication. To ensure the replicability of the proposed approach, the cell size can be appropriately scaled based on the topology and geo-morphological features of the new monitored environment.
More specifically, Pole A acts as an integrated sensing cluster hosting one environmental station, one traffic sensor, and one parking node, while Pole B provides a dual-source setup for environmental and traffic monitoring. A similar configuration is replicated on Pole C, which hosts an additional environmental station, a traffic sensor, a parking node, and a standalone LoRaWAN gateway providing the backbone for the LoRaWAN communication layer. Mobility dynamics are further captured by Pole D, which serves as a high-density hub with one traffic sensor and two parking nodes, and by Pole E, conceived for localized parking monitoring with a single sensing unit.
Thereby, the experimental testbed comprises 12 heterogeneous edge devices and a central gateway G W , leveraging hardware and firmware technology provided by the innovative company SmartMe.io1. The gateway acts as the primary orchestrator, ensuring seamless connectivity between the edge nodes and a dedicated cloud-based smart mobility platform for real-time data processing and analytics. This technological infrastructure is categorized into three functional typologies of devices, as shown in Table 4.
All poles are permanently mains-powered, and devices operate in a continuous DC active state. As a consequence, the O P E X V formulation (Equation (9)) is simplified ( δ k = 0 ). Table 5 reports the baseline energy metrics under standard operating conditions, assuming a static duty-cycle without any dynamic policy intervention. These values represent the reference consumption against which the agent energy-saving capabilities are subsequently measured.
This baseline has been established during the start-up phase of the project, where measurements have been calibrated and validated using certified instrumentation. Therefore, this baseline is designed to be periodically updated and refined, provided that such updates remain compatible with existing network constraints and infrastructure overhead (e.g., the 1% duty-cycle in LoRaWAN communication).
During runtime, the system collects a set of heterogeneous physical measurements from the urban environment, formally defined as a real-valued 10-dimensional tuple by Equation (24):
C ( t ) = c 1 ( t ) , c 2 ( t ) , , c 10 ( t ) ] C R 10
in compliance with Equation (1).
Table 6 reports the metrics considered for the examined scenario, including features and related impacts on the city dynamics. The radio parameters ( c 9 , c 10 ) and urban demand indicators ( c 4 , c 5 , c 6 ) specify the transmission power and frequency. Physical stressors ( c 1 , c 2 ) derived from the weather dataset, alongside the Ultraviolet (UV) index ( c 3 ), influence the probability of hardware fatigue and maintenance frequency. In our T C O model, these environmental factors directly impact the amortization O P E X and C A P E X by modulating the expected life-cycle of the mains-powered devices. Furthermore, the performance metrics of the LoRaWAN network ( c 7 , c 8 ), combined with the activity density ( c 6 ), define the overall Q o S .
The Table 6 metrics, selected from weather, traffic, parking and LoRaWAN datasets, serve as the primary inputs for the optimization process. To prevent bias during the DQN training phase, these raw observations are normalized into a normalized demand vector  C ^ , as defined by Equation (23). Through the normalization process, the DQN agent executes the operational phases of its architecture to maximize the cumulative reward R defined by Equation (20). The training process integrates experience replay and target network synchronization to ensure the stability of the Q-value estimation. To effectively navigate the trade-off between investigating new strategies, a trade-off mechanism allows the agent to explore the state-action space before converging toward the optimal policy π maximizing the sustainability gain G π ( s , a ) .
A distinctive feature of this implementation is the spatial modeling of the state, which extends beyond localized observations. The logic implemented integrates a context with the dynamics detected in adjacent districts ( A , B , D , E ) , defined as the radius r = 1 Moore neighborhood set η ( 0 , 0 ) of Equation (22). Therefore, it is not just a quick reaction to local changes, but it is possible to understand how different districts affect each other. For example, how traffic moves or how environmental changes spread from the outer districts to the center. The agent can anticipate incoming data traffic from neighboring areas and adjust settings in advance to stay within the constraints.
The experimental setup implements three ( h = 3 ) discrete operational conditions for the edge infrastructure, as defined by Equation (25) specializing Equation (17):
O = { o ECO , o STD , o HPM }
where each mode o E C O , o S T D , and o H P M defines a specific power profile. The o E C O mode identifies the lower operational bound for the edge device, characterized by a power consumption of 2 W. From a hardware perspective, this configuration corresponds to an idle state or a low-power duty cycle where GPU-intensive tasks are suspended and sensing frequencies are minimized. In the o S T D mode, the device operates at its nominal power profile of 4 W. This setup ensures compliance with standard regulatory limits while providing sufficient data granularity for routine urban monitoring. The o H P M mode is characterized by maximum computational and transmission effort to ensure high system responsiveness during urban emergencies or peak demand. In this configuration, the device enables full real-time edge-inference, resulting in a power surge of 7 W. This value aligns with the 10 W Thermal Design Power (TDP) profile of the Jetson Nano, accounting for sustained GPU utilization during complex neural inference.
In applying DRL techniques to urban scenarios, overfitting is a critical challenge: it occurs when the neural network accurately learns the specific noise or outliers of the training dataset but fails to generalize its policy to unseen conditions. In our context, an over-fitted DQN agent might trigger wrong actions in response to sensor anomalies.

5. Results

To validate the robustness of the proposed approach, a comprehensive analysis has been conducted on the urban infrastructures described in Table 3. The C district has been selected as the primary ( 0 , 0 ) reference due to its full-fledged configuration, as shown in Table 4. Such a multi-modal data environment provides a challenging and representative scenario. Three policies, characterizing the agent behavioral attitude toward urban dynamics, emerge as trade-offs from the multi-objective optimization. We define these policies as lazy, balanced, and responsive.
The lazy policy π l = ( 0.6 , 0.3 , 0.1 ) identifies an energy-saving and cost-saving oriented profile. In this configuration, the DQN agent keeps the devices in a low-power state to minimize energy consumption and reduce the associated economic expenditure. It is simultaneously focused on the device lifespan: by operating primarily in less demanding and less stressful conditions, it preserves the hardware electronic integrity. Thereby, the system ensures long-term operational resilience by avoiding the thermal and computational stress typical of higher-performance modes. Although this policy safeguards hardware longevity and related costs, it introduces a significant operational risk to the urban scenario: by failing to capture micro-scale traffic fluctuations and peak congestion events, the system underestimates the actual urban stress. As a consequence, it can represent a systemic risk for the community, as decision-makers are provided with insufficient data that masks pollution hotspots and traffic bottlenecks. In summary, the lazy policy achieves device-level resilience impacting on the city observability and consequently on its sustainability.
The responsive policy π r = ( 0.1 , 0.3 , 0.6 ) identifies a social-oriented profile. This configuration is highly sensitive to fluctuations in urban traffic demand; while it accepts a higher energy and economic cost at the device level, it minimizes the information latency. By operating at peak performance, the system avoids underestimating mobility dynamics, which is crucial to prevent flawed or weak decision-making. In this case study, responsiveness translates in substantial environmental and social benefits: by providing high-resolution monitoring of traffic congestion and parking occupancy, the system enables more effective urban flow management and a reduction in traffic congestion. Consequently, the ability to detect and react to micro-scale mobility fluctuations allows for the targeted mitigation of pollution hotspots. In this configuration, local energy E consumption and capital C A P E X expenditure could be a strategic investment to achieve a systemic reduction of the overall environmental footprint.
The balanced policy π b = ( 0.3 , 0.4 , 0.3 ) identifies a nominal baseline profile designed to ensure that urban monitoring remains reliable and inclusive without reaching the energy peaks of the responsive mode while avoiding the information latency of the lazy approach.
To account for seasonal variations in traffic-parking patterns and environmental conditions, four representative months have been selected to sample the annual operational cycle from August 2025 to March 2026: August (summer), October (autumn), January (winter), and March (spring). This seasonal sampling ensures that the DQN policy is not over-fitted to specific temporal conditions but remains effective and generalized across different climatic and social contexts. The following analysis details the behavioral patterns learned by the DQN agent across the four representative months, illustrating how the lazy, balanced, and responsive policies map the normalized traffic demand to specific operational conditions O = { o E C O , o S T D , o H P M } .
In the August scenario depicted in Figure 6, the agent faces distinct summer traffic peaks where the policies exhibit highly differentiated behaviors. The lazy policy is in the o E C O state for the majority of the day, showing high tolerance for traffic increases and transitioning to o S T D only during the mid-day and evening peaks. The balanced policy serves as a stable baseline, remaining in o S T D for almost the entire 24-hour cycle to ensure constant monitoring. Finally, the responsive policy acts as a vigilant sentinel, proactively switching to o H P M during the two main traffic surges.
In the October scenario depicted in Figure 7, the lazy policy switches from the o E C O mode to the o S T D mode only during the two highest peaks of the day, maintaining its low-power profile like in August. The balanced policy remains in o S T D throughout the active city hours (07:00–21:00) and reverting to o E C O only during deep night. The responsive policy switches into o H P M at each traffic peak, specifically targeting the early morning rush, the mid-day plateau, and the evening return, thus maximizing the permanence in the high-performance o H P M condition.
By observing the January scenario depicted in Figure 8, the analysis reveals how the agent adapts to winter traffic demand. The lazy policy confirms the extreme cost-oriented attitude observed in August and in October. Devices remain in o E C O even during significant traffic demand, with only a brief transition to o S T D during the late morning. Although the balanced policy is not high-responsive to the midday peaks (0.5–0.7) in traffic demand, it maintains devices in o S T D during the (07:00–21:00) daily period when the city center is expected to be persistently affected by vehicular traffic flows. Devices are switched to o E C O during the early morning hours (normalized traffic demand < 0.1), and around the 22:00. The responsive policy, instead, promptly escalates to o H P M to cover the broad mid-day traffic plateau and a spike around the 16:00.
Observing the March scenario depicted in Figure 9, the analysis confirms the characteristic behavior of the three policies. The lazy policy maintains devices in the o E C O , activating the o S T D condition only during the peaks of demand (around hours 11:00 and 17:00). The balanced policy maintains its cost-oriented objective. Finally, the responsive policy confirms its high sensitivity.
The cross-seasonal analysis confirms that the DQN agent has successfully synthesized three distinct optimization policies. The policy convergence observed across the four representative months suggests a high degree of robustness: regardless of the seasonal baseline, the lazy policy consistently acts as a lower bound for energy expenditure, while the responsive policy serves as a high-fidelity upper bound, with a direct benefit in terms of social sustainability.
The balanced policy exhibits an intermediate and stabilizing behavior. It effectively filters out minor traffic fluctuations to maintain devices in a steady o S T D monitoring state, proving to be a compromise profile between social, economic, and environmental sustainability, without incurring the economic penalties of the responsive mode. However, the balanced policy is weaker than responsive to address environmental sustainability.
The economic sustainability of the examined CPS is evaluated through a T C O analysis over a 10-year operational horizon. The initial C A P E X is established as the baseline, and comprises the acquisition of the equipment shown in Table 4, alongside a centralized gateway, software orchestration platforms, and professional installation costs. The growth in T C O is driven by the O P E X , and it is primarily conditioned by energy consumption costs, assuming a baseline rate of 0.20 / kWh , and system maintenance costs.
The results of Figure 10 allow an investigation of the policy sensitivity: the responsive policy exhibits higher growth, reaching an OPEX-to-CAPEX ratio of approximately 5.6 % after 10 years. The DQN agent consistently prioritizes the o H P M mode to ensure near-zero latency and high information fidelity during peak traffic demand, which inherently maximizes the power draw across the NVIDIA Jetson Nano rails.
The lazy policy achieves the highest economic sustainability, limiting the 10-year O P E X increase to less than 4%. All policies show a subtle transition to a linear trend after the initial five-year phase, reflecting their stabilization and leading to a predictable marginal cost per year.
Finally, the balanced policy shows a mid-range trade-off. These savings are achieved with 95% confidence intervals, ensuring the reliability of the economic projections.
The training process of the DQN agent has been monitored through three key metrics, as illustrated in Figure 11. The Epsilon curve represents the exploration-exploitation trade-off. It follows a planned decay from 1.0 to a minimum threshold of 0.1 around episode 700. This transition ensures that the agent sufficiently explores the state-action space in the early stages before shifting toward the exploitation of the learned optimal policy in the final phases of training. The high stability of the reward and loss trends in the last 200 episodes indicates that the agent has reached a reliable and converged operational state. The Mean Squared Bellman Error (MSBE) represents a fundamental metric [49] for evaluating approximation accuracy, formulating learning objectives, and investigating the theoretical properties of RL algorithms. The M S B E trend shows a consistent downward trajectory, starting from approximately 0.55 and approaching zero as the training progresses. This reduction indicates that the model has accurately captured the underlying dynamics of the context. The Cumulative Reward trend exhibits a strong upward slope, stabilizing at a high plateau (approximately at 0.9 ) after 800 episodes. This trend confirms that the agent is successfully learning a control policy that maximizes the objective function, balancing energy efficiency and service quality according to the defined reward structure.
Figure 12 shows a radar of the multi-objective optimization policies resulting from our analysis, mapped on a set of UN SDG 11 targets. This map is obtained considering the set of heterogeneous metrics of the dynamic urban context C ( t ) defined in Equation (24).
The responsive policy demonstrates a clear prioritization of Target 11.2 (Sustainable Transport) and Target 11.5 (Disaster Resilience and Safety). By proactively escalating to the high-performance state o H P M during traffic peaks, the agent ensures the information fidelity required to mitigate congestion and reduce emergency response times. This “citizen-centric” behavior acknowledges that a localized increase in energy consumption E ( s , t ) is a strategic investment to achieve systemic environmental benefits and public safety, directly supporting Target 11.6 (reducing the environmental impact of cities).
Conversely, the lazy policy aligns primarily with Target 11.b (Resource Efficiency). By maintaining the infrastructure in a low-power o E C O state for the majority of the operational cycle, it minimizes the carbon footprint C F ( s , t ) and the operational expenditure O P E X . While this approach maximizes the physical and financial longevity of the CPS, it results in lower performance regarding real-time mobility management.
Finally, the balanced policy identifies an equilibrium point for Target 11.7 (Inclusive and Reliable Monitoring). By acting as a stable baseline in o S T D , it provides continuous and reliable data flows without the energy surges of the responsive mode or the data scarcity of the lazy profile.
The analysis confirms that the DQN-based framework does not merely optimize a technical trade-off but offers an optimal set of strategies. This allows municipal decision-makers to dynamically tune the infrastructure behavior to meet specific sustainability priorities, from strict resource conservation to high-fidelity urban resilience.

6. Conclusions and Future Directions

The proposed multi-dimensional decision-making framework successfully transitions urban management from static, siloed ICT deployments to an integrated, adaptive urban Cyber-Physical System. By leveraging Deep Reinforcement Learning, specifically Deep Q-Networks, the model effectively navigates the non-linear trade-offs inherent in the environmental, economic, and social pillars of sustainability. The empirical results from the LoRaWAN case study demonstrate that reward-shaping mechanisms allow for the precise modulation of node duty-cycles, balancing energy conservation with service availability. This approach mitigates the inefficiencies of traditional high-performance configurations while preserving the resilience required for modern urban infrastructures. Moreover, this methodology provides a scalable blueprint for achieving carbon-neutral, cost-effective, and socially equitable smart-intelligent cities. This research confirms that the intersection of autonomous AI agents and granular urban telemetry is essential for fulfilling the complex requirements of UN Sustainable Development Goal 11.
Future iterations of this work will explore the integration of decentralized federated learning to further enhance data privacy and system-wide scalability in heterogeneous urban environments. To address long-term challenges, aging-/reliability-aware framework are envisioned to be integrated in the DQN controller. A feasible way can be by introducing a cumulative stress counter as a state variable, the agent could learn to proactively mitigate hardware aging.

Author Contributions

Conceptualization, M.G. and S.D.; methodology, M.G. and S.D.; software, M.G.; validation, M.G. and S.D.; formal analysis, M.G. and S.D.; investigation, M.G.; resources, M.G. and S.D.; data curation, M.G.; writing—original draft preparation, M.G. and S.D.; writing—review and editing, M.G. and S.D.; visualization, M.G.; supervision, S.D.; project administration, S.D.; funding acquisition, S.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the Italian Ministero delle Imprese e del Made in Italy (MIMIT) under the project SMART•E - piattaforma per l’IoT Maintenance, il Facility e l’Asset Management dell’industria 4.0, grant number FTE0000382 (CUP: B47H22004430008, COR: 22573728).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available. Data availability is restricted by a confidentiality agreement between the technology provider and the local public administration. Due to the integration of the testbed into the municipal infrastructure, the raw datasets contain sensitive administrative information that cannot be disclosed to ensure compliance with security protocols and institutional privacy. Requests to access the datasets should be directed to SmartMe.io.

Acknowledgments

The authors would like to express their gratitude to SmartMe.io for the technical support throughout the experimental phase of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AI Artificial Intelligence
AIoT Artificial Intelligence of Things
CAPEX Capital Expenditure
CF Carbon Footprint
CM Classical Methods
CR Coding Rate
CPS Cyber-Physical System
DC Direct Current
DNN Deep Neural Networks
DRL Deep Reinforcement Learning
DQN Deep Q-Networks
DT Digital Twin
EU European Union
GW Gateway
GWP Global Warming Potential
ICT Information and Communication Technology
IEC International Electrotechnical Commission
IoT Internet of Things
ISO International Organization for Standardization
KPI Key Performance Indicator
LLM Large Language Models
LPWAN Low-Power Wide-Area Network
MARL Multi-Agent Reinforcement Learning
OPEX Operational Expenditure
PDR Packet Delivery Ratio
PSL Preliminary Setup Layer
QoS Quality of Service
RL Reinforcement Learning
SDG Sustainable Development Goal
SF Spreading Factor
SL Supervised Learning
SQuaRE Systems and Software Quality Requirements and Evaluation
TCO Total Cost of Ownership
UV Ultraviolet

References

  1. Alahi, M.E.E.; Sukkuea, A.; Tina, F.W.; Nag, A.; Kurdthongmee, W.; Suwannarat, K.; Mukhopadhyay, S.C. Integration of IoT-Enabled Technologies and Artificial Intelligence (AI) for Smart City Scenario: Recent Advancements and Future Trends. Sensors 2023, 23, 5206. [Google Scholar] [CrossRef]
  2. Bellini, P.; Nesi, P.; Pantaleo, G. IoT-Enabled Smart Cities: A Review of Concepts, Frameworks and Key Technologies. Appl. Sci. 2022, 12, 1607. [Google Scholar] [CrossRef]
  3. Terven, J. Deep Reinforcement Learning: A Chronological Overview and Methods. AI 2025, 6, 46. [Google Scholar] [CrossRef]
  4. Plaat, A. Deep Reinforcement Learning; Springer Nature Singapore: Singapore, 2022; ISBN 978-981-19-0638-1. [Google Scholar] [CrossRef]
  5. Ficili, I.; Giacobbe, M.; Tricomi, G.; Puliafito, A. From Sensors to Data Intelligence: Leveraging IoT, Cloud, and Edge Computing with AI. Sensors 2025, 25, 1763. [Google Scholar] [CrossRef] [PubMed]
  6. Pochelu, P.; Cartiaux, H.; Schleich, J. What artificial intelligence can do for high-performance computing systems? Engineering Applications of Artificial Intelligence 2026, 164, 113248. [Google Scholar] [CrossRef]
  7. De Vita, F.; Bruneo, D. Leveraging Stack4Things for Federated Learning in Intelligent Cyber Physical Systems. J. Sens. Actuator Netw. 2020, 9, 59. [Google Scholar] [CrossRef]
  8. Beguni, C.; Căilean, A.-M.; Zadobrischi, E.; Avătămăniței, S.-A.; Lavric, A.; Stoian, F.-M. The Convergence of Artificial Intelligence and Public Policy in Shaping the Future of Ride-Hailing: A Review. Smart Cities 2026, 9, 40. [Google Scholar] [CrossRef]
  9. Zghidi, N.; Trabelsi, R. Impact of Digitalization on Sustainable Development: A Comparative Analysis of Developed and Developing Economies. J. Risk Financial Manag. 2025, 18, 359. [Google Scholar] [CrossRef]
  10. Bibri, S.E.; Huang, J. Artificial intelligence of things for sustainable smart city brain and digital twin systems: Pioneering environmental synergies between real-time management and predictive planning. Environ. Sci. Ecotechnol. 2025, 26, 100591. [Google Scholar] [CrossRef]
  11. Rojek, I.; Mikołajewski, D.; Galas, K.; Piszcz, A. Advanced Deep Learning Algorithms for Energy Optimization of Smart Cities. Energies 2025, 18, 407. [Google Scholar] [CrossRef]
  12. Wang, K.; Ke, Y. Social sustainability of communities: A systematic literature review. Sustainable Production and Consumption 2024, 47, 585–597. [Google Scholar] [CrossRef]
  13. Capecchi, S.; Corduas, M.; Piccolo, D. Social Sustainability and Subjective Well-Being: A Study on Italian Inner Areas. Sustainability 2025, 17, 2078. [Google Scholar] [CrossRef]
  14. Dresp-Langley, B.; Ekseth, O.K.; Fesl, J.; Gohshi, S.; Kurz, M.; Sehring, H.-W. Occam Razor for Big Data? On Detecting Quality in Large Unstructured Datasets. Appl. Sci. 2019, 9, 3065. [Google Scholar] [CrossRef]
  15. Seng, K.P.; Ang, L.M.; Ngharamike, E. Artificial intelligence Internet of Things: A new paradigm of distributed sensor networks. Int. J. Distrib. Sens. Netw. 2022, 18, 15501477211062835. [Google Scholar] [CrossRef]
  16. Mekki, K.; Bajic, E.; Chaxel, F.; Meyer, F. A comparative study of LPWAN technologies for large-scale IoT deployment. ICT Express 2019, 5, 1–7. [Google Scholar] [CrossRef]
  17. Giacobbe, M.; Zanafi, S.; Zaia, A.; Puliafito, A. Empirical Validation of LoRaWAN Coverage Models for Smart Urban Connectivity. In Proceedings of the European Council for Modelling and Simulation, ECMS; 2025; pp. 493–499. [Google Scholar] [CrossRef]
  18. LoRa Alliance, What is LoRaWAN? 2026. Available online: https://lora-alliance.org/about-lorawan/ (accessed on 13 April 2026).
  19. United Nations. Sustainable Development Goals. Department of Economic and Social Affairs. Available online: https://https://sdgs.un.org/goals (accessed on 30 March 2026).
  20. Kim, S.-K.; Chan, I.C. Novel Machine Learning-Based Smart City Pedestrian Road Crossing Alerts. Smart Cities 2025, 8, 114. [Google Scholar] [CrossRef]
  21. Palley, B.; Poças Martins, J.; Bernardo, H.; Rossetti, R. Integrating Machine Learning and Digital Twins for Enhanced Smart Building Operation and Energy Management: A Systematic Review. Urban Sci. 2025, 9, 202. [Google Scholar] [CrossRef]
  22. Fatorachian, H.; Kazemi, H.; Pawar, K. Digital Technologies in Food Supply Chain Waste Management: A Case Study on Sustainable Practices in Smart Cities. Sustainability 2025, 17, 1996. [Google Scholar] [CrossRef]
  23. Fatorachian, H.; Kazemi, H.; Pawar, K. Enhancing Smart City Logistics Through IoT-Enabled Predictive Analytics: A Digital Twin and Cybernetic Feedback Approach. Smart Cities 2025, 8, 56. [Google Scholar] [CrossRef]
  24. Sutton, R.S.; Barto, A.G. Reinforcement learning. Journal of Cognitive Neuroscience 1999, 11, 126–134. [Google Scholar]
  25. Liang, J.; Miao, H.; Li, K.; Tan, J.; Wang, X.; Luo, R.; Jiang, Y. A Review of Multi-Agent Reinforcement Learning Algorithms. Electronics 2025, 14, 820. [Google Scholar] [CrossRef]
  26. Chen, M.; Sun, L.; Li, T.; Sun, H.; Zhou, Y.; Zhu, C.; Wang, H.; Pan, J.Z.; Zhang, W.; Chen, H.; et al. ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning. arXiv 2025, arXiv:2503.19470. [Google Scholar] [CrossRef]
  27. Oh, J.; Farquhar, G.; Kemaev, I.; et al. Discovering state-of-the-art reinforcement learning algorithms. Nature 2025, 648, 312–319. [Google Scholar] [CrossRef]
  28. Yu, Q.; Zhang, Z.; Zhu, R.; Yuan, Y.; Zuo, X.; Yue, Y.; Wang, M. Dapo: An open-source llm reinforcement learning system at scale. arXiv 2025, arXiv:2503.14476. [Google Scholar] [CrossRef]
  29. Hubert, T.; Mehta, R.; Sartran, L.; et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature 2026, 651, 607–613. [Google Scholar] [CrossRef]
  30. Chaudhari, S.; Aggarwal, P.; Murahari, V.; Rajpurohit, T.; Kalyan, A.; Narasimhan, K.; Deshpande, A.; da Silva, B.C. RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs. ACM Comput. Surv. 2025, 58, 53. [Google Scholar] [CrossRef]
  31. Hady, M.A.; Hu, S.; Pratama, M.; et al. Multi-agent reinforcement learning for resources allocation optimization: a survey. Artif. Intell. Rev. 2025, 58, 354. [Google Scholar] [CrossRef]
  32. De Vita, F.; Nardini, G.; Virdis, A.; Bruneo, D.; Puliafito, A.; Stea, G. Using Deep Reinforcement Learning for Application Relocation in Multi-Access Edge Computing. IEEE Commun. Stand. Mag. 2019, 3, 71–78. [Google Scholar] [CrossRef]
  33. Liu, R.; Nageotte, F.; Zanne, P.; de Mathelin, M.; Dresp-Langley, B. Deep Reinforcement Learning for the Control of Robotic Manipulation: A Focussed Mini-Review. Robotics 2021, 10, 22. [Google Scholar] [CrossRef]
  34. Kim, Y.; Jung, B.C.; Song, Y. Online Learning for Joint Energy Harvesting and Information Decoding Optimization in IoT-Enabled Smart City. IEEE Internet Things J. 2023, 10, 10675–10686. [Google Scholar] [CrossRef]
  35. Zhang, B.; Liu, C.H.; Tang, J.; Xu, Z.; Ma, J.; Wang, W. Learning-Based Energy-Efficient Data Collection by Unmanned Vehicles in Smart Cities. IEEE Trans. Ind. Inform. 2018, 14, 1666–1676. [Google Scholar] [CrossRef]
  36. Park, J.; Baek, J.; Song, Y. Optimizing smart city planning: A deep reinforcement learning framework. ICT Express 2025, 11, 129–134. [Google Scholar] [CrossRef]
  37. Remmache, M.I.; Boudouh, S.S.; Bendouma, T.; Abdelhafidi, Z. Balancing Energy and Latency in Multi-Edge MEC Systems Using DQN-Based Task Offloading. In Proceedings of the 2025 7th International Conference on Pattern Analysis and Intelligent Systems (PAIS), Laghouat, Algeria, 2025; pp. 1–5. [Google Scholar] [CrossRef]
  38. Minskere, L.; Kalnina, D.; Salkovska, J.; Batraga, A. Urban Communication in Smart Cities: Stakeholder Participation Motivators. Smart Cities 2026, 9, 58. [Google Scholar] [CrossRef]
  39. Das, D.K.; Aiyetan, A.O.; Mostafa, M.M.H. Advancing Smart Cities in Africa: Barriers, Potentials, and Strategic Pathways for Sustainable Urban Transformation. Smart Cities 2026, 9, 38. [Google Scholar] [CrossRef]
  40. Shafiullah, M.; Katranji, A.R.; Hassan, M.; Rahman, M.M.; Shezan, S.A. Advanced Multivariate Deep Learning Methodology for Forecasting Wind Speed and Solar Irradiation. Smart Cities 2026, 9, 59. [Google Scholar] [CrossRef]
  41. Bisegna, F.; Vespasiano, F.; Pompei, L.; Burattini, C.; Belli, E.; Bellucci, A.M.; Di Vittorio, F.; Blaso, L. Towards the Decarbonization of Urban Communities: Evaluation of Smart and Green Strategies to Reduce Gas Carbon Emissions. Smart Cities 2026, 9, 26. [Google Scholar] [CrossRef]
  42. Santos, O.; Ribeiro, F.; Metrôlho, J.; Dionísio, R. Using Smart Traffic Lights to Reduce CO2 Emissions and Improve Traffic Flow at Intersections: Simulation of an Intersection in a Small Portuguese City. Appl. Syst. Innov. 2024, 7, 3. [Google Scholar] [CrossRef]
  43. Abu-Rayash, A.; Dincer, I. Development of an integrated model for environmentally and economically sustainable and smart cities. Sustain. Energy Technol. Assess. 2025, 73, 104096. [Google Scholar] [CrossRef]
  44. Eurostat, Smart cities - statistics on urban institutions. Statistics Explained. 2023. Available online: https://ec.europa.eu/eurostat/statistics-explained/SEPDF/cache/10719.pdf (accessed on 24 March 2026).
  45. International Energy Agency (IEA). World Energy Outlook 2024; IEA Publications: Paris, France, 2024; Available online: https://www.iea.org/reports/world-energy-outlook-2024 (accessed on 24 March 2026).
  46. National Institute of Standards and Technology (NIST). Trustworthiness (System). Computer Security Resource Center (CSRC) Glossary. Available online: https://csrc.nist.gov/glossary/term/trustworthiness_system (accessed on 24 March 2026).
  47. ISO/IEC Standard No. 25002:2024; Systems and software engineering — Software product Quality Requirements and Evaluation (SQuaRE) — Managed quality model. ISO: Geneva, Switzerland, 2024.
  48. Bittencourt, J. C. N.; Jesus, T. C.; Peixoto, J. P. J.; Costa, D. G. The Road to Intelligent Cities. Smart Cities 2025, 8, 77. [Google Scholar] [CrossRef]
  49. Patterson, A.; Liao, V.; White, M. Robust Losses for Learning Value Functions. IEEE Transactions on Pattern Analysis and Machine Intelligence 2023, 45, 6157–6167. [Google Scholar] [CrossRef] [PubMed]
  50. Zaitsev, D. A.; Shmeleva, T. R.; Ghaffari, P. Modeling Multidimensional Communication Lattices with Moore Neighborhood by Infinite Petri Nets. 2021 International Conference on Information and Digital Technologies (IDT), Zilina, Slovakia, 2021; pp. 171–181. [Google Scholar] [CrossRef]
Figure 1. United Nations Sustainable Cities and Communities Goal (SDG11), and its Targets.
Figure 1. United Nations Sustainable Cities and Communities Goal (SDG11), and its Targets.
Preprints 209275 g001
Figure 2. Conceptual Framework of Adaptive Urban Ecosystem composed by multiple AI-driven CPS.
Figure 2. Conceptual Framework of Adaptive Urban Ecosystem composed by multiple AI-driven CPS.
Preprints 209275 g002
Figure 3. Overall Architecture of the proposed multi-objective optimization framework leading to a context-aware optimized urban trade-offs.
Figure 3. Overall Architecture of the proposed multi-objective optimization framework leading to a context-aware optimized urban trade-offs.
Preprints 209275 g003
Figure 4. DQN-based optimization schema, highlithing the cyber-to-physical and physical-to-cyber transitions in the smart urban ecosystem.
Figure 4. DQN-based optimization schema, highlithing the cyber-to-physical and physical-to-cyber transitions in the smart urban ecosystem.
Preprints 209275 g004
Figure 5. Spatial 3 X 3 grid representation of the urban district forming the monitored urban center. The map is centered on the Moore-modeled district C(0,0), with radius r = 1 .
Figure 5. Spatial 3 X 3 grid representation of the urban district forming the monitored urban center. The map is centered on the Moore-modeled district C(0,0), with radius r = 1 .
Preprints 209275 g005
Figure 6. DQN Policy Sensitivity Analysis for the C district in August. The plot illustrates the trade-offs resulting from the DQN-based optimization relative to normalized traffic demand (shaded area).
Figure 6. DQN Policy Sensitivity Analysis for the C district in August. The plot illustrates the trade-offs resulting from the DQN-based optimization relative to normalized traffic demand (shaded area).
Preprints 209275 g006
Figure 7. DQN Policy Sensitivity Analysis for the C district in October.
Figure 7. DQN Policy Sensitivity Analysis for the C district in October.
Preprints 209275 g007
Figure 8. DQN Policy Sensitivity Analysis for the C district in January.
Figure 8. DQN Policy Sensitivity Analysis for the C district in January.
Preprints 209275 g008
Figure 9. DQN Policy Sensitivity Analysis for the C district in March.
Figure 9. DQN Policy Sensitivity Analysis for the C district in March.
Preprints 209275 g009
Figure 10. Projected 10-year TCO growth as a percentage of initial investment across different operational trade-offs. Results are reported with a 95% confidence interval.
Figure 10. Projected 10-year TCO growth as a percentage of initial investment across different operational trade-offs. Results are reported with a 95% confidence interval.
Preprints 209275 g010
Figure 11. DQN training metrics: MSBE loss, cumulative reward, and epsilon decay trends over 1000 episodes, showing model stability and convergence.
Figure 11. DQN training metrics: MSBE loss, cumulative reward, and epsilon decay trends over 1000 episodes, showing model stability and convergence.
Preprints 209275 g011
Figure 12. DQN-based trade-offs alignment with UN SDG 11 targets. The radar plot illustrates the alignment between the agent behavioral profiles and the global indicators for sustainable and resilient cities.
Figure 12. DQN-based trade-offs alignment with UN SDG 11 targets. The radar plot illustrates the alignment between the agent behavioral profiles and the global indicators for sustainable and resilient cities.
Preprints 209275 g012
Table 1. Comprehensive Mapping of the proposed framework into SDG 11 Targets and Implementation Strategies.
Table 1. Comprehensive Mapping of the proposed framework into SDG 11 Targets and Implementation Strategies.
Sustainability Pillar SDG 11 Targets Framework Contribution
Environmental ( P e ) 11.6 Environment Minimizes the urban carbon footprint by mitigating traffic-related emissions. It dynamically regulates district resources to counter pollution peaks, ensuring a synergy between energy conservation and the overall environmental quality of the city area.
Economic ( P c ) 11.4, 11.5 Heritage, Resilience Minimizes economic impact by deploying resilient monitoring networks for urban heritage and early-warning systems. It reduces financial losses and recovery costs from environmental disasters through proactive infrastructure management and efficient resource allocation.
Social ( P s ) 11.1, 11.2, 11.7 Housing, Transport, Public Spaces Guarantees the necessary operational performance for smart mobility services, preventing service disruptions in urban transport infrastructures. It ensures that public transit and pedestrian spaces remain safe, accessible, and inclusive by maintaining reliable connectivity and real-time data flow.
Integrated ( P e , P c , P s ) 11.3 Urbanization Orchestrates participatory settlement planning by adapting to dynamic “bio-social” demands through multi-objective optimization.
11.a Regional Planning Supports strong links between urban and rural areas overcoming siloes patterns.
11.b Integrated Policies Internalizes multi-dimensional rewards to promote resource efficiency, climate change mitigation, and adaptive urban resilience.
11.c Sustainable Building Provides Municipalities with a dynamic instrument for “adaptive” city planning, focusing on sustainable infrastructure and resilient buildings.
Table 2. Comparative Assessment of Classical Methods (CM), Supervised Learning (SL), and Reinforcement Learning (RL) in the reference context.
Table 2. Comparative Assessment of Classical Methods (CM), Supervised Learning (SL), and Reinforcement Learning (RL) in the reference context.
Feature CM (LP/Heuristics) SL RL
Data Heterogeneity Rigid Pre-defined Rules Historical Pattern Matching Goal-Oriented Optimization
Portability Lightweight but Inflexible Memory Intensive Asymmetrical Workload
Env. Interaction Open-loop Execution Passive Observation Active Closed-loop Feedback
Sustainability Fixed Thresholds Pattern Replication Autonomous Exploration
Trade-off Logic Static Weighting Perfectly Labeled Data Dynamic Pareto Balancing
Scientific Ref. Classical Theory  [7,20,21,22,23]  [24,25,26,27,28,29,30,31,32,33,36,40]
Table 3. Spatial distribution and functional roles of the urban sensing infrastructure across the five neighboring districts and the centroidal gateway ( G W ) .
Table 3. Spatial distribution and functional roles of the urban sensing infrastructure across the five neighboring districts and the centroidal gateway ( G W ) .
District/Pole Coord. ( i , j ) Deployed Devices Strategic Role
A ( 0 , 1 ) 1 Env., 1 Traffic, 1 Parking Integrated Urban Sensing
B ( 1 , 1 ) 1 Env., 1 Traffic Telemetry & Flow Analysis
C ( 0 , 0 ) 1 Env., 1 Traffic, 1 Parking, 1 GW Network Backbone & Sensing
D ( 1 , 0 ) 1 Traffic, 2 Parking Mobility & Occupancy Hub
E ( 0 , 1 ) 1 Parking Localized Spot Monitoring
Table 4. Technical specifications and functional classification of the urban IoT testbed.
Table 4. Technical specifications and functional classification of the urban IoT testbed.
Typology Hardware Platform Num Primary Function
Environmental (weather station) Raspberry Pi 4 (12V supply) 3 Ambient telemetry
Parking (optical sensor) Nvidia Jetson Nano (5V rail1) 5 Optical occupancy sensing
Traffic (optical sensor) Nvidia Jetson Nano (5V rail1) 4 Vehicle flow analysis
Network & Context Parameters
Protocol LoRaWAN Class C (SF7, 125 kHz, CR 4/5)2
Duty Cycle 1 % (EU 868 MHz)
1 Powered via DC–DC converter from a 24V source; 2 ChirpStack, open source LoRaWAN Network Server.
Table 5. Baseline Power and daily (24h) Energy Consumptions in standard operating mode, without DQN optimization.
Table 5. Baseline Power and daily (24h) Energy Consumptions in standard operating mode, without DQN optimization.
Typology Num P device (W) E d e v i c e (kWh) Num · E d e v i c e (kWh)
Environmental (weather station) 3 2.364 0.0567 0.1702
Parking (optical sensor) 5 3.395 0.0815 0.4074
Traffic (optical sensor) 4 4.460 0.1070 0.4281
LoRaWAN Gateway 1 3.583 0.0860 0.0860
Overall benchmark 13 - 0.33121 1.0917
1 It represents the benchmark for the district C (0,0) in our case study.
Table 6. Set of metrics considered for the examined scenario, with related measures and impacts on the dynamics of the city.
Table 6. Set of metrics considered for the examined scenario, with related measures and impacts on the dynamics of the city.
Dataset Metric Measure Impact
Weather c 1 Ambient Temperature (°C) Hardware thermal stress and operational efficiency
c 2 Relative Humidity (%) Signal attenuation and link propagation quality
c 3 Ultraviolet (UV) Index Seasonal fluctuations and weather-induced variability
Traffic c 4 Flow Index Real-time service urgency and prioritization
c 5 Total Vehicle Count Demand intensity and infrastructure load
Parking c 6 Lot Occupancy Rate Local urban activity density and utility demand
LoRaWAN c 7 Packet Delivery Ratio (PDR) Network reliability and communication quality
c 8 Latency ( τ ) Transmission delay and real-time responsiveness
c 9 RSSI (dBm) Signal strength and transmission power efficiency
c 10 SNR (dB) Link robustness and noise interference levels
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated