Submitted:
02 February 2026
Posted:
05 February 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction

- Multi-Agent Risk Prediction Network (MARP-Net): This module is responsible for comprehensively assessing potential safety risks in complex driving scenes. It leverages a Graph Neural Network (GNN) architecture to model spatial and temporal interactions among all vehicles in the scene, combined with a Transformer-based Encoder to predict future multi-agent trajectories. Based on these predictions, MARP-Net computes various collision risk indicators, such as Time-to-Collision (TTC) and Time Headway (THW), outputting a detailed risk heatmap and identifying critical adversarial vehicles.
- Explainable Evasive Trajectory Generator (EETG-Module): This component takes the identified risks and critical vehicles from MARP-Net and performs two crucial functions. First, a Constrained Optimization Solver generates safety-critical evasive trajectories that adhere to vehicle kinematics, traffic regulations, and collision avoidance constraints. Second, an LLM-based Explanation Generator, fine-tuned on models like Llama-2-13B or Mistral-7B, produces natural language explanations for the proposed evasive actions (e.g., “Due to sudden lane change of front vehicle A, a left lane change is advised to avoid collision.”). This enhances the system’s transparency and trustworthiness.
- We evaluate Proactive-Scene using a combination of large-scale real-world autonomous driving datasets, including nuScenes [14] and Waymo Open Dataset [15], augmented with carefully curated safety-critical “near-miss” scenarios generated from high-fidelity simulators such as Carla [16] and SUMO [17]. Our evaluation methodology involves closed-loop testing across diverse and challenging driving situations. The results demonstrate that Proactive-Scene consistently outperforms existing baseline methods across a suite of metrics covering risk prediction accuracy, evasion performance, and explanation coherence. For instance, our method achieves a significantly lower Collision Rate of 6.72%, compared to 18.73% for a Reactive Rule-based Planner and 7.89% for a Standard RL-based Planner. Furthermore, Proactive-Scene exhibits superior proactive capabilities, evidenced by a higher Recall of Critical Events at 77.89% and an Evasion Timeliness of 1.28s, alongside a remarkable Explanation Coherence score of 7.83.
- We propose Proactive-Scene, a novel comprehensive framework for proactive risk assessment and explainable evasive trajectory generation, transitioning autonomous driving systems from reactive to anticipatory safety.
- We develop a multi-modal perception and prediction pipeline, integrating Graph Neural Networks and Transformer architectures (MARP-Net) for robust multi-agent risk identification, coupled with a constrained optimization solver for kinematically feasible and safe evasive trajectory generation.
- We introduce an innovative LLM-based module for generating natural language explanations of proposed evasive maneuvers, significantly enhancing the interpretability and trustworthiness of safety-critical autonomous driving decisions.
2. Related Work
2.1. Multi-Agent Prediction and Proactive Motion Planning for Autonomous Driving
2.2. Explainable Artificial Intelligence and Large Language Models in Autonomous Driving
3. Method

3.1. Multi-Agent Risk Prediction Network (MARP-Net)
3.1.1. Multi-Agent Interaction Modeling with Graph Neural Networks
3.1.2. Future Trajectory Prediction with Transformer-based Encoder
3.1.3. Risk Assessment and Heatmap Generation
3.2. Explainable Evasive Trajectory Generator (EETG-Module)
3.2.1. Constrained Optimization Solver for Trajectory Generation
3.2.2. LLM-based Explanation Generator
3.3. Overall Training and Inference Process
4. Experiments
5. Experiments
5.1. Experimental Setup
5.1.1. Datasets
5.1.2. Baselines
- Reactive Rule-based Planner: A conventional planner that relies on predefined rules and thresholds to react to immediate collision threats. It does not incorporate explicit multi-agent prediction or proactive risk assessment.
- Prediction-only System: This baseline integrates a multi-agent trajectory prediction module (similar to MARP-Net’s prediction component but without the explicit risk heatmap generation) with a basic reactive planner. It predicts future trajectories but does not actively seek to avoid predicted risks until they become imminent.
- Standard RL-based Planner: A state-of-the-art reinforcement learning-based planner that learns optimal driving policies through interaction with the environment. While often exhibiting robust performance, it typically lacks explicit interpretability for its decision-making process.
5.1.3. Evaluation Metrics
-
Risk Prediction Metrics:
- –
- Recall of Critical Events (%): The percentage of actual critical safety events correctly identified by the system. Higher is better.
- –
- Precision of Critical Events (%): The percentage of identified critical events that are indeed actual critical events. Higher is better.
- –
- Prediction FDE (m): Final Displacement Error, measuring the Euclidean distance between predicted and ground-truth agent positions at the end of the prediction horizon. Lower is better.
-
Evasion Performance Metrics:
- –
- Collision Rate (%): The percentage of evaluation scenarios resulting in a collision. Lower is better.
- –
- Evasion Timeliness (s): The average time before a potential collision at which an evasive maneuver is initiated. Higher indicates more proactive behavior. Higher is better.
- –
- Jerk (m/s3): Average absolute jerk experienced by the ego vehicle during evasive maneuvers, indicating ride comfort. Lower is better.
-
Explainability Metrics:
- –
- Explanation Coherence (score): A quantitative score (e.g., automated or expert-rated) reflecting the logical consistency and relevance of generated explanations to the evasive action. Higher is better.
5.2. Overall Performance Comparison
5.3. Analysis of Proactive-Scene Components
5.3.1. Effectiveness of MARP-Net
5.3.2. Effectiveness of EETG-Module
5.4. Human Evaluation of Explanations
5.5. Ablation Studies
- Impact of GNN: Replacing the Graph Neural Network for multi-agent interaction modeling with a simpler Multi-Layer Perceptron significantly degrades both risk prediction (72.15% Recall, 75.30% Precision, 1.10m FDE) and evasion performance (9.48% Collision Rate). This highlights the GNN’s importance in accurately capturing complex spatial-temporal dependencies between agents, which is vital for robust trajectory prediction and subsequent risk identification.
- Impact of Transformer: Substituting the Transformer-based encoder for future trajectory prediction with a simpler LSTM model also leads to a noticeable drop in performance (74.55% Recall, 77.20% Precision, 1.01m FDE). This validates the Transformer’s superior ability to model long-range temporal dependencies and contextual information within agent historical states, crucial for precise trajectory forecasting.
- Impact of Constrained Optimization Solver: When the advanced Constrained Optimization Solver is replaced by a heuristic rule-based evasive planner, the Collision Rate drastically increases to 13.91%, and Evasion Timeliness drops to 0.72s, along with higher Jerk (1.63 m/s3). This underscores the necessity of a sophisticated planner capable of considering complex constraints and optimizing for safety, comfort, and proactive behavior simultaneously, rather than relying on reactive rules.
- Impact of Risk Heatmap: Even with accurate trajectory predictions, removing the aggregated risk heatmap and relying solely on direct pair-wise collision checks results in a higher Collision Rate (8.52%) and reduced Evasion Timeliness (1.15s). This demonstrates the value of projecting individual risks into a holistic, discretized heatmap, allowing the optimization solver to navigate not just around predicted agent positions but also around broader areas of potential conflict, thereby improving proactive avoidance.
5.6. Performance Across Diverse Safety-Critical Scenarios
- Sudden Lane Change: This scenario, characterized by an abrupt cut-in from an adjacent lane, is highly challenging. Proactive-Scene achieves a significantly lower Collision Rate (7.45% vs. 9.12%) and better Evasion Timeliness (1.10s vs. 0.96s), attributed to MARP-Net’s capability to predict such aggressive maneuvers earlier and EETG-Module’s swift evasive planning.
- Emergency Braking: In situations where a lead vehicle unexpectedly brakes hard, early detection and a smooth, timely response are critical. Our framework exhibits a lower Collision Rate (5.98% vs. 7.05%) and higher Evasion Timeliness (1.36s vs. 1.14s), indicating its effectiveness in maintaining safe following distances and executing proactive braking or lane change maneuvers.
- Pedestrian Jaywalking: This involves unpredictable agent behavior. Proactive-Scene shows superior performance (8.02% Collision Rate vs. 10.15%, 1.05s Evasion Timeliness vs. 0.81s) by leveraging its robust GNN-based interaction modeling to better anticipate pedestrian intent, even when highly ambiguous, and generate safe trajectories.
- Intersection Conflict: Intersections pose complex multi-agent coordination problems. Our framework maintains a notable advantage (8.95% Collision Rate vs. 12.08%, 1.01s Evasion Timeliness vs. 0.77s), demonstrating its ability to reason about multiple interacting agents and their potential collision zones in intricate environments.
- Occlusion/Blind Spot: These scenarios test the predictive capabilities under partial observability. Despite inherent difficulties, Proactive-Scene still achieves a lower Collision Rate (8.48% vs. 11.21%) and higher Evasion Timeliness (1.09s vs. 0.86s), likely due to MARP-Net’s ability to infer likely behaviors from partial observations and HD map context.
5.7. Computational Performance Analysis
6. Conclusion
References
- Sun, H.; Xu, G.; Deng, J.; Cheng, J.; Zheng, C.; Zhou, H.; Peng, N.; Zhu, X.; Huang, M. On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark. In Findings of the Association for Computational Linguistics: ACL 2022; Association for Computational Linguistics, 2022; pp. 3906–3923. [Google Scholar] [CrossRef]
- Xu, J.; Ju, D.; Li, M.; Boureau, Y.-L.; Weston, J.; Dinan, E. Bot-Adversarial Dialogue for Safe Conversational Agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics, 2021; pp. 2950–2968. [Google Scholar] [CrossRef]
- Huang, S. Reinforcement Learning with Reward Shaping for Last-Mile Delivery Dispatch Efficiency. European Journal of Business, Economics & Management 2025a, Vol. 1(No. 4), 122–130. [Google Scholar]
- Huang, S. Prophet with Exogenous Variables for Procurement Demand Prediction under Market Volatility. Journal of Computer Technology and Applied Mathematics 2025b, Vol. 2(No. 6, 2025b), 15–20. [Google Scholar] [CrossRef]
- Liu, W. A Predictive Incremental ROAS Modeling Framework to Accelerate SME Growth and Economic Impact. Journal of Economic Theory and Business Management 2025, Vol. 2, 25–30. [Google Scholar] [CrossRef]
- Si, S.; Zhao, H.; Luo, K.; Chen, G.; Qi, F.; Zhang, M.; Chang, B.; Sun, M. “A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks,”. 2025. Available online: https://arxiv.org/abs/2510.05608.
- Jian, P.; Yu, D.; Zhang, J. Large language models know what is key visual entity: An LLM-assisted multimodal retrieval for VQA. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024; pp. 10939–10956. [Google Scholar]
- Jian, P.; Yu, D.; Yang, W.; Ren, S.; Zhang, J. Teaching vision-language models to ask: Resolving ambiguity in visual questions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics 2025a, Volume 1, 3619–3638. [Google Scholar]
- Jian, P.; Wu, J.; Sun, W.; Wang, C.; Ren, S.; Zhang, J. Look again, think slowly: Enhancing visual reflection in vision-language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025b; pp. 9262–9281. [Google Scholar]
- Si, S., Ma, W., Gao, H., Wu, Y., Lin, T.-E., Dai, Y., Li, H., Yan, R., Huang, F., and Li, Y., “SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents,” Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. Available online: https://openreview.net/forum?id=viktK3nO5b.
- Qi, L.; Wu, J.; Choi, J. M.; Phillips, C.; Sengupta, R.; Goldman, D. B. Over++: Generative Video Compositing for Layer Interaction Effects. arXiv 2025a, arXiv:2512.19661. [Google Scholar]
- Gong, B.; Qi, L.; Wu, J.; Fu, Z.; Song, C.; Jacobs, D. W.; Nicholson, J.; Sengupta, R. The Aging Multiverse: Generating Condition-Aware Facial Aging Tree via Training-Free Diffusion. arXiv 2025, arXiv:2506.21008. [Google Scholar]
- Qi, L.; Wu, J.; Gong, B.; Wang, A. N.; Jacobs, D. W.; Sengupta, R. Mytimemachine: Personalized facial age transformation. ACM Transactions on Graphics (TOG) 2025b, Vol. 44, 1–16. [Google Scholar] [CrossRef]
- Chen, Y.; Liu, Y.; Chen, L.; Zhang, Y. “DialogSum: A Real-Life Scenario Dialogue Summarization Dataset,” Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. In Association for Computational Linguistics; 2021; pp. 5062–5074. [Google Scholar] [CrossRef]
- Vidgen, B.; Nguyen, D.; Margetts, H.; Rossini, P.; Tromble, R. Introducing CAD: the Contextual Abuse Dataset. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics, 2021; pp. 2289–2303. [Google Scholar] [CrossRef]
- Herzig, J.; Müller, T.; Krichene, S.; Eisenschlos, J. Open Domain Question Answering over Tables via Dense Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics, 2021; pp. 512–519. [Google Scholar] [CrossRef]
- Aggarwal, S.; Mandowara, D.; Agrawal, V.; Khandelwal, D.; Singla, P.; Garg, D. Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Association for Computational Linguistics, 2021; Volume 1, pp. 3050–3065. [Google Scholar] [CrossRef]
- Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z., and Sun, M., “ChatDev: Communicative Agents for Software Development,” Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, 2024, pp. 15174–15186. [CrossRef]
- Hao, S.; Gu, Y.; Ma, H.; Hong, J.; Wang, Z.; Wang, D.; Hu, Z. Reasoning with Language Model is Planning with World Model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, 2023; pp. 8154–8173. [Google Scholar] [CrossRef]
- Seo, A.; Kang, G.-C.; Park, J.; Zhang, B.-T. Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Association for Computational Linguistics, 2021; Volume 1, pp. 6167–6177. [Google Scholar] [CrossRef]
- Liu, Z.; Chen, N. Controllable Neural Dialogue Summarization with Personal Named Entity Planning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, 2021; pp. 92–106. [Google Scholar] [CrossRef]
- He, J.; Kryscinski, W.; McCann, B.; Rajani, N.; Xiong, C. CTRLsum: Towards Generic Controllable Text Summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics, 2022; pp. 5879–5915. [Google Scholar] [CrossRef]
- Fu, J.; Huang, X.; Liu, P. SpanNER: Named Entity Re-/Recognition as Span Prediction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Association for Computational Linguistics, 2021; Volume 1, pp. 7183–7195. [Google Scholar] [CrossRef]
- Zhang, X.; Li, R.; Yu, J.; Xu, Y.; Li, W.; Zhang, J. Editguard: Versatile image watermarking for tamper localization and copyright protection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; pp. 11964–11974. [Google Scholar]
- Zhang, X.; Tang, Z.; Xu, Z.; Li, R.; Xu, Y.; Chen, B.; Gao, F.; Zhang, J. Omniguard: Hybrid manipulation localization via augmented versatile deep image watermarking. In Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 3008–3018. [Google Scholar]
- Xu, Z.; Zhang, X.; Li, R.; Tang, Z.; Huang, Q.; Zhang, J. Fakeshield: Explainable image forgery detection and localization via multi-modal large language models. arXiv 2024, arXiv:2410.02761. [Google Scholar]
- Si, S., Zhao, H., Chen, G., Li, Y., Luo, K., Lv, C., An, K., Qi, F., Chang, B., and Sun, M., “GATEAU: Selecting Influential Samples for Long Context Alignment,” Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, edited by C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, Association for Computational Linguistics, Suzhou, China, 2025b, pp. 7380–7411. https://doi.org/10.18653/v1/2025.emnlp-main.375, URL https://aclanthology.org/2025.emnlp-main.375/. 7380–7411. [CrossRef]
- Lampinen, A.; Dasgupta, I.; Chan, S.; Mathewson, K.; Tessler, M.; Creswell, A.; McClelland, J.; Wang, J.; Hill, F. “Can language models learn from explanations in context?” Findings of the Association for Computational Linguistics: EMNLP 2022; Association for Computational Linguistics, 2022; pp. 537–563. [Google Scholar] [CrossRef]
- Dai, L.; Xu, Y.; Ye, J.; Liu, H.; Xiong, H. Seper: Measure retrieval utility through the lens of semantic perplexity reduction. arXiv 2025, arXiv:2503.01478. [Google Scholar] [CrossRef]


| Metric | Reactive Rule-based | Prediction-only | Standard RL-based | Proactive-Scene (Ours) |
|---|---|---|---|---|
| Rec. Crit. (%) ↑ | 5.21 | 68.42 | 75.18 | 77.89 |
| Prec. Crit. (%) ↑ | 8.55 | 72.10 | 78.33 | 80.05 |
| Pred. FDE (m) ↓ | 2.89 | 1.05 | 0.98 | 0.91 |
| Coll. Rate (%) ↓ | 18.73 | 12.56 | 7.89 | 6.72 |
| Ev. Time (s) ↑ | 0.25 | 0.88 | 1.12 | 1.28 |
| Jerk (m/s3) ↓ | 1.87 | 1.52 | 1.15 | 0.98 |
| Explanation Coherence (score) ↑ | N/A | N/A | 5.21 | 7.83 |
| Human Eval. Relevance (score) ↑ | N/A | N/A | 4.5 | 7.6 |
| Model Variation | Risk Prediction | Evasion Performance | ||||
|---|---|---|---|---|---|---|
| Rec. Crit. (%) ↑ | Prec. Crit. (%) ↑ | Pred. FDE (m) ↓ | Coll. Rate (%) ↓ | Ev. Time (s) ↑ | Jerk (m/s3) ↓ | |
| FullProactive-Scene(Ours) | 77.89 | 80.05 | 0.91 | 6.72 | 1.28 | 0.98 |
| w/o GNN | 72.15 | 75.30 | 1.10 | 9.48 | 1.06 | 1.17 |
| w/o Transformer | 74.55 | 77.20 | 1.01 | 8.05 | 1.18 | 1.06 |
| w/o Optimization Solver | 77.89 | 80.05 | 0.91 | 13.91 | 0.72 | 1.63 |
| w/o Risk Heatmap | 77.89 | 80.05 | 0.91 | 8.52 | 1.15 | 1.02 |
| Scenario Type | Proactive-Scene (Ours) | Standard RL-based Planner | |||
|---|---|---|---|---|---|
| Coll. Rate(%)↓ | Ev. Time (s)↑ | Jerk (m/s3)↓ | Coll. Rate(%)↓ | Ev. Time (s)↑ | |
| Sudden Lane Change (Cut-in) | 7.45 | 1.10 | 1.10 | 9.12 | 0.96 |
| Emergency Braking (Lead Vehicle) | 5.98 | 1.36 | 0.94 | 7.05 | 1.14 |
| Pedestrian Jaywalking | 8.02 | 1.05 | 1.20 | 10.15 | 0.81 |
| Intersection Conflict | 8.95 | 1.01 | 1.31 | 12.08 | 0.77 |
| Occlusion/Blind Spot | 8.48 | 1.09 | 1.16 | 11.21 | 0.86 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).