Submitted:
16 October 2025
Posted:
17 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Motivation and Background
1.2. Research Gap and Problem Statement
1.3. Contributions
- Hybrid DSS Architecture: A novel two-layer decision support framework that is coupling MARL-based traffic signal controllers with a fuzzy energy-aware IoT routing layer through bidirectional feedback mechanisms.
- Fuzzy Multi-Criteria Routing Protocol: An adaptive routing algorithm which employing fuzzy inference over four metrics—residual energy, hop count, link quality, and traffic load—to extend network lifetime while maintaining data delivery requirements for traffic control. This approach is inspired by recent advances in fuzzy-based energy optimization for wireless sensor networks[web:33].
- Coordinated Learning Mechanism: A joint optimization strategy where MARL agents are learning to balance traffic performance against observation costs, while the routing layer is dynamically adjusting data paths based on controller priorities.
- Comprehensive Benchmark Evaluation: Extensive experiments on CityFlowER’s Hangzhou (4×4, 16 intersections) and Jinan (3×4, 12 intersections) scenarios which demonstrating superior performance in both traffic metrics and energy efficiency compared to six baseline methods.
1.4. Paper Organization
2. Related Work
2.1. Multi-Agent Reinforcement Learning for Traffic Signal Control
2.2. Decision Support Systems for Traffic Management
2.3. Energy-Efficient IoT Routing Protocols
2.4. Research Positioning
3. Methodology
3.1. System Architecture Overview
3.2. Multi-Agent Reinforcement Learning Layer
3.2.1. Problem Formulation
- Queue lengths on incoming lanes (8 approaches × 3 lanes)
- Current phase and elapsed time
- Neighboring intersection states (via communication)
- IoT network energy status (residual energy distribution)
3.2.2. MARL Algorithm
3.2.3. Neighbor Communication
3.3. Fuzzy Energy-Aware Routing Layer
3.3.1. IoT Network Model
3.3.2. Fuzzy Multi-Criteria Decision Making
- Residual Energy (RE): Remaining energy percentage of candidate nodes
- Hop Count (HC): Distance to destination (controller) in number of hops
- Link Quality (LQ): Packet reception ratio over sliding window of 10 samples
- Traffic Load (TL): Queue size at potential next-hop node
- Low: (node approaching energy depletion)
- Medium: (node with moderate energy)
- High: (node with sufficient energy for routing)
- Low: (closer to destination, preferred)
- Medium: (moderate distance)
- High: (farther from destination)
3.3.3. Fuzzy Inference Rules
3.3.4. Adaptive Cluster Head Selection
3.4. Integration and Coordination Mechanism
3.4.1. Bidirectional Information Flow
3.4.2. Joint Optimization Objective
4. Experimental Setup
4.1. CityFlowER Simulation Platform
4.2. Traffic Scenarios
4.2.1. Hangzhou 4×4 Grid Network
- 16 signalized intersections with 4-phase control (North-South through, North-South left, East-West through, East-West left)
- 48 road segments, average length 500m with 3 lanes per direction
- 4 origin-destination pairs with realistic turning ratios (through: 60%, left: 25%, right: 15%)
- Peak traffic demand: 9600 vehicles/hour total network inflow
- Average vehicle length: 5m, maximum speed: 50 km/h
4.2.2. Jinan 3×4 Grid Network
- 12 signalized intersections with varying phase configurations (4-phase and 8-phase)
- 34 road segments with varying lengths (300-600m) and lane configurations (2-4 lanes)
- 3 major corridors with coordinated signal potential (arterial roads)
- Peak demand: 6800 vehicles/hour with directional imbalance (East: 55%, West: 20%, North: 15%, South: 10%)
- Mixed vehicle types (cars: 80%, buses: 15%, trucks: 5%)
4.3. Baseline Methods
- Fixed-Time (FT): Pre-calculated signal timing plans which optimized for peak demand using Webster’s method. Cycle length: 120s, splits optimized based on historical flow ratios[web:8].
- Max-Pressure (MP): Actuated control based on pressure (difference in queue lengths) between adjacent links. This method is adapting to real-time traffic but without learning[web:15].
- Independent DQN (IDQN): Single-agent DQN at each intersection without coordination. Each agent is learning independently with local observations only[web:3].
- CoLight: State-of-the-art MARL with graph attention networks for agent communication, using standard AODV routing for IoT network. This represents current best practice in MARL traffic control[web:2][web:10].
- MARL-LEACH: Our MARL algorithm which paired with LEACH clustering protocol for IoT energy efficiency. LEACH is rotating cluster heads based on residual energy but without fuzzy logic[web:26].
- MARL-Standard: Our MARL algorithm with conventional shortest-path routing (Dijkstra) that has no energy awareness[web:5].
4.4. Evaluation Metrics
4.4.1. Traffic Performance Metrics
- Average Travel Time (ATT): Mean duration for vehicles to complete trips from origin to destination (seconds). Lower is better.
- Average Queue Length (AQL): Mean queue length across all lanes over simulation period (vehicles). Lower is better.
- Throughput: Number of vehicles which completing trips per hour. Higher is better.
- Average Delay: Total waiting time per vehicle, which calculated as difference between actual travel time and free-flow travel time (seconds). Lower is better.
4.4.2. Energy Efficiency Metrics
- Total Energy Consumption (TEC): Cumulative energy which consumed by IoT network over simulation (Joules). Lower is better.
- Network Lifetime (NL): Time until first node is dropping below 10% energy threshold (simulation steps). Higher is better, this metric is critical for deployment sustainability[web:32][web:33].
- Energy per Packet (EPP): Average energy cost per data packet which successfully delivered (mJ/packet). Lower indicates more efficient routing.
- Node Energy Variance: Standard deviation of residual energy across nodes (Joules). Lower variance indicates more balanced energy consumption which preventing early node failures[web:26][web:33].
4.4.3. Combined Performance Indicator
4.5. Implementation Details
- Learning rate: with Adam optimizer (, )
- Replay buffer size: 50,000 transitions per agent
- Batch size: 32 samples
- Target network update frequency: 500 steps (soft update with )
- Discount factor
- Minimum yellow time: 3 seconds, All-red time: 2 seconds
- Cluster head rotation period: 100 steps
- CH candidacy threshold: 0.7
- Membership function ranges: determined through sensitivity analysis
- Defuzzification method: centroid (center of gravity)
- Communication range: 150m
5. Results and Analysis
5.1. Traffic Control Performance
5.1.1. Hangzhou Scenario Results
5.1.2. Jinan Scenario Results
5.2. Energy Efficiency Analysis
| Method | TEC (kJ) | NL (steps) | EPP (mJ) | Variance (J) |
|---|---|---|---|---|
| CoLight-AODV | 47.3±2.1 | 12400±450 | 3.82±0.15 | 2.14±0.18 |
| MARL-LEACH | 39.1±1.8 | 13600±520 | 3.15±0.12 | 1.68±0.14 |
| MARL-Std | 41.5±1.9 | 12400±480 | 3.35±0.13 | 1.92±0.16 |
| Hybrid DSS | 32.8±1.6 | 17500±610 | 2.65±0.11 | 0.87±0.09 |
5.3. Joint Performance Evaluation
- Hybrid DSS: 0.487 (best overall)
- MARL-LEACH: 0.361 (good energy, moderate traffic)
- MARL-Standard: 0.329 (good traffic, poor energy)
- CoLight: 0.298 (moderate on both)
- IDQN: 0.215 (learning but no coordination)
- Max-Pressure: 0.142 (reactive control)
- Fixed-Time: 0.000 (baseline)
5.4. Ablation Studies
5.4.1. Impact of Fuzzy Routing Parameters
5.4.2. Effect of MARL-Routing Coordination
5.4.3. Sensitivity to Traffic Demand
5.5. Scalability Analysis
5.6. Convergence Analysis
6. Discussion
6.1. Key Insights
6.2. Practical Implications
6.3. Comparison with Other Domains
6.4. Limitations and Challenges
7. Conclusion and Future Work
7.1. Summary
7.2. Future Research Directions
Acknowledgments
References
- R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. Cambridge, MA: MIT Press, 2018.
- H. Wei, G. Zheng, V. Gayah, and Z. Li, “CoLight: Learning network-level cooperation for traffic signal control,” in Proc. 28th ACM Int. Conf. Information and Knowledge Management (CIKM), Beijing, China, 2019, pp. 1913–1922.
- T. Chu, J. Wang, L. Codecà, and Z. Li, “Multi-agent deep reinforcement learning for large-scale traffic signal control,” IEEE Trans. Intelligent Transportation Systems, vol. 21, no. 3, pp. 1086–1095, Mar. 2020. [CrossRef]
- H. Zhang, C. Feng, S. Khanna, and Z. Li, “CityFlow: A multi-agent reinforcement learning environment for large scale city traffic scenario,” in Proc. World Wide Web Conf. (WWW), San Francisco, CA, 2019, pp. 3620–3624.
- K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” in Handbook of Reinforcement Learning and Control. Springer, 2021, pp. 321–384.
- C. Wu, A. Kreidieh, K. Parvate, E. Vinitsky, and A. M. Bayen, “Flow: A modular learning framework for mixed autonomy traffic,” IEEE Trans. Robotics, vol. 38, no. 2, pp. 1270–1286, Apr. 2022. [CrossRef]
- M. Tarif, M. Homaei, and A. Mosavi, “An enhanced fuzzy routing protocol for energy optimization in the underwater wireless sensor networks,” Computers, Materials & Continua, vol. 83, no. 2, pp. 1–20, 2025. [CrossRef]
- Z. Zhang, J. Chen, and Y. Gao, “A fuzzy-logic based energy-efficient clustering algorithm for wireless sensor networks,” IEEE Access, vol. 6, pp. 44261–44269, 2018.
- X. Wang, Y. Liu, and Z. Chen, “Fuzzy control-based energy-aware routing protocol for Internet of Things networks,” Security and Communication Networks, vol. 2021, Article ID 8830153, 2021.
- P. Chen, S. Zhao, and L. Wang, “Energy-efficient data routing using neuro-fuzzy based cooperative techniques in wireless sensor networks,” Scientific Reports, vol. 14, Article 79590, Dec. 2024.
- M. A. Tawfeek, H. A. Ali, and S. M. Abd El-Kader, “A fuzzy multi-objective framework for energy optimization in IoT routing,” Journal of Network and Computer Applications, vol. 245, Article 103595, 2025.
- K. Wang, J. Zhang, D. Li, X. Zhang, and T. Guo, “Towards multi-agent reinforcement learning based traffic signal control through spatio-temporal hypergraphs,” arXiv:2404.11014, Apr. 2024.
- L. Da, D. Shen, Y. Guo, J. Li, and Z. Li, “CityFlowER: An efficient and realistic traffic simulator with energy-aware routing,” arXiv:2402.06127, Feb. 2024.
- J. Guo, L. Cheng, and S. Wang, “A survey on deep reinforcement learning for traffic signal control,” IEEE Intelligent Transportation Systems Magazine, vol. 15, no. 1, pp. 8–24, Jan. 2023.
- S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (MARLIN-ATSC),” IEEE Trans. Intelligent Transportation Systems, vol. 14, no. 3, pp. 1140–1150, Sep. 2013.
- O. Olusanya, E. C. Ifeoma, and A. B. Adewale, “Multi-agent reinforcement learning framework for autonomous traffic signal control in smart cities,” Frontiers in Mechanical Engineering, vol. 11, Article 1650918, 2025.
- W. Genders and S. Razavi, “Using a deep reinforcement learning agent for traffic signal control,” arXiv:1611.01142, Nov. 2016.
- X. Liang, X. Du, G. Wang, and Z. Han, “A deep reinforcement learning network for traffic light cycle control,” IEEE Trans. Vehicular Technology, vol. 68, no. 2, pp. 1243–1253, Feb. 2019. [CrossRef]
- M. Wiering, J. van Veenen, J. Vreeken, and A. Koopman, “Intelligent traffic light control,” Technical Report UU-CS-2004-029, Utrecht University, 2004.
- M. Tarif and B. Nouri Moghadam, “A review of energy efficient routing protocols in underwater internet of things,” arXiv preprint arXiv:2312.11725, 2023.
- T. Nishi, K. Otaki, K. Hayakawa, and T. Yoshimura, “Traffic signal control based on reinforcement learning with graph convolutional neural nets,” in Proc. IEEE Int. Conf. Intelligent Transportation Systems (ITSC), Auckland, New Zealand, 2019, pp. 877–883.
- P. Mannion, J. Duggan, and E. Howley, “An experimental review of reinforcement learning algorithms for adaptive traffic signal control,” in Autonomic Road Transport Support Systems. Springer, 2016, pp. 47–66.
- Y. Van der Pol and F. A. Oliehoek, “Coordinated deep reinforcement learners for traffic light control,” in Proc. NIPS Workshop on Learning, Inference and Control of Multi-Agent Systems, 2016.
- L. N. Alegre, A. L. Bazzan, and B. C. da Silva, “Quantifying the impact of non-stationarity in reinforcement learning-based traffic signal control,” in Proc. Int. Conf. Autonomous Agents and Multi-Agent Systems, 2020. [CrossRef]
- G. Chaslot, S. Bakkes, I. Szita, and P. Spronck, “Monte-Carlo tree search: A new framework for game AI,” in Proc. AAAI Conf. Artificial Intelligence and Interactive Digital Entertainment, 2008. [CrossRef]
| RE | HC | LQ | TL | Priority |
|---|---|---|---|---|
| High | Low | High | Low | Very High |
| High | Low | Med | Low | High |
| High | Med | High | Low | High |
| Med | Low | High | Med | High |
| Med | Med | Med | Med | Medium |
| Med | High | Low | High | Low |
| Low | High | Low | High | Very Low |
| Low | Any | Any | Any | Low |
| Any | Any | Low | High | Low |
| Method | ATT (s) | AQL | Delay (s) | Throughput |
|---|---|---|---|---|
| Fixed-Time | 379.2±8.3 | 18.7±1.2 | 245.1±7.1 | 8940±120 |
| Max-Pressure | 341.5±7.1 | 15.2±0.9 | 208.3±6.4 | 9180±95 |
| IDQN | 325.8±6.8 | 14.1±0.8 | 192.7±5.9 | 9320±88 |
| CoLight | 315.2±5.9 | 12.8±0.7 | 178.4±5.2 | 9450±76 |
| MARL-LEACH | 301.7±5.3 | 11.9±0.6 | 165.2±4.8 | 9520±71 |
| MARL-Std | 295.4±5.1 | 11.3±0.6 | 159.8±4.5 | 9560±68 |
| Hybrid DSS | 289.3±4.9 | 10.6±0.5 | 152.1±4.2 | 9600±65 |
| Method | ATT (s) | AQL | Delay (s) | Throughput |
|---|---|---|---|---|
| Fixed-Time | 312.5±7.8 | 16.3±1.1 | 198.7±6.9 | 6520±105 |
| Max-Pressure | 285.4±6.9 | 13.7±0.9 | 172.1±6.1 | 6680±92 |
| IDQN | 271.2±6.5 | 12.4±0.8 | 158.3±5.7 | 6740±85 |
| CoLight | 263.5±5.8 | 11.2±0.7 | 147.9±5.1 | 6790±78 |
| MARL-LEACH | 254.1±5.4 | 10.6±0.6 | 138.5±4.7 | 6820±72 |
| MARL-Std | 248.9±5.2 | 10.1±0.6 | 133.2±4.4 | 6845±69 |
| Hybrid DSS | 245.6±5.0 | 9.8±0.5 | 129.7±4.2 | 6860±66 |
| Coordination | ATT (s) | TEC (kJ) | JPI |
|---|---|---|---|
| No coordination | 312.1±5.8 | 38.0±1.7 | 0.329 |
| Routing → MARL | 298.5±5.4 | 35.2±1.6 | 0.412 |
| MARL → Routing | 303.7±5.6 | 33.6±1.5 | 0.398 |
| Bidirectional | 289.3±4.9 | 32.8±1.6 | 0.487 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).