Submitted:
24 March 2026
Posted:
26 March 2026
You are already at the latest version
Abstract

Keywords:
1. Introduction
1.1. Motivation
1.2. Problem Statement
1.3. Contributions
- A unified formulation integrating constrained RL and robust RL under a temporally correlated adversary, modeled via a recurrent latent dynamics network that respects physical feasibility.
- A constrained min-max optimization framework with Lagrangian methods, enabling joint updates of policy parameters, Lagrange multipliers, and the adversary to enforce safety even under attack.
- A comprehensive empirical evaluation on autonomous vehicle lane-keeping and power grid voltage control, demonstrating superior worst-case reward and safety violation reduction compared to strong baselines, with minimal nominal performance degradation.
- Ablation studies isolating the contributions of the recurrent adversary and Lagrangian safety mechanism, providing insights into RADAR’s design choices.
1.4. Paper Organization
1. Preliminaries
2.1. Markov Decision Processes and Constrained MDPs
2.2. Adversarial Attacks in Sequential Settings
2.3. Robustness Criteria
3. Related Work
3.1. Adversarial Robustness in Deep Learning
3.2. Robust Reinforcement Learning
3.3. Safe Reinforcement Learning
3.4. Gap Analysis
4. Methodology: RADAR Framework
4.1. Problem Formulation
- S is the state space,
- A the action space,
- P(s'|s,a) the transition probability,
- r(s,a) the reward function,
- c(s,a) the safety cost function,
- γ ∈ [0,1) the discount factor,
- d_0 ∈ ℝ⁺ the safety cost threshold (maximum allowable expected cumulative cost).
4.2. Lagrangian Relaxation
- Policy update (for fixed ξ, λ): π ← argmax_{π'} L(π', ξ, λ)
- Adversary update (for fixed π, λ): ξ ← argmin_{ξ'} L(π, ξ', λ)
- Dual update: λ ← max(0, λ + η_λ (J_c(π, ξ) - d_0)),where η_λ is the dual learning rate.
4.3. Adversary Model
4.4. Policy and Adversary Optimization

4.5. Discussion of Theoretical Properties
| Parameter | Symbol | Lane-Keeping | Voltage Control |
|---|---|---|---|
| Discount factor | 0.99 | 0.99 | |
| GAE parameter | 0.95 | 0.95 | |
| PPO clip range | 0.2 | 0.2 | |
| PPO epochs per update | 10 | 10 | |
| Mini-batch size | 64 | 64 | |
| Policy learning rate | |||
| Value network learning rate | |||
| Adversary learning rate | |||
| Dual learning rate | |||
| Adversary inner steps | 2 | 2 | |
| Adversary safety weight | 0.5 | 0.5 | |
| Perturbation bound (L2) | 0.1 | 0.05 | |
| Safety threshold | 10 | 50 | |
| LSTM hidden size | 128 | 128 | |
| Max gradient norm | 0.5 | 0.5 | |
| Total timesteps | 2 M | 2 M |
5. Experimental Setup
5.1. Benchmarks and Environments
5.1.1. Autonomous Vehicle Lane-Keeping
5.1.2. Power Grid Voltage Control
| Benchmark | State Dimension | Action Dimension | Safety Cost | Perturbation Type | Physical Bounds |
|---|---|---|---|---|---|
| Lane-Keeping | 3 | 1 | Lane departure (binary) | Steering offset | ±0.1 rad |
| Voltage Control | 18 | 6 | Voltage violation (binary per bus) | Voltage sensor offset | ±0.05 pu |
5.2. Baselines
5.2.1. State-Adversarial MDP (SA-MDP) with Robust Bellman Operator
5.2.2. Certified Robust RL via Randomized Smoothing (RS-RL)
5.2.3. Robust Safe RL (RSRL) with Constrained Adversarial Training
5.3. Evaluation Metrics
- Episodic violation rate: the fraction of episodes where the cumulative safety cost exceeds the threshold
- Per-step violation rate: the average fraction of time steps within an episode where the safety cost is non-zero (i.e., a constraint is violated).
5.4. Implementation Details
- Policy and Value Networks: Both use two hidden layers with 256 units each and ReLU activations. The policy outputs the mean and log standard deviation of a Gaussian action distribution. All methods share the same architecture.
- Adversary Network (RADAR and Robust RL baseline): The adversary consists of an LSTM layer with 128 hidden units, followed by a feedforward layer that outputs the perturbation. The LSTM processes the history of states, actions, and previous perturbations. The output is scaled and clipped to enforce physical bounds (RADAR) or the norm bound (Robust RL baseline). The adversary is trained using policy gradient with the objective defined in Section 4.2.
- RL algorithm: PPO with GAE , clip range , 10 epochs per update, mini-batch size 64.
- Discount factor : 0.99.
- Learning rates: Policy and adversary: (Adam); Lagrange multiplier : .
- Adversary steps : 2 (inner loop updates per policy update).
- Perturbation bound : 0.1 (L2 norm). For RADAR, additional physical limits per environment are applied (see Table IV).
- Safety threshold : 10 for lane-keeping, 50 for voltage control.
- Adversary safety weight : 0.5.
- Training steps: 2 million timesteps per run.
6. Results and Analysis
6.1. Robustness Against Time-Correlated Attacks

6.2. Safety Violation Reduction
| Method | Lane-Keeping | Voltage Control | ||
|---|---|---|---|---|
| Episodic Viol. (%) | Per-Step Viol. (%) | Episodic Viol. (%) | Per-Step Viol. (%) | |
| SA-MDP [25] | 47 | 8.9 ± 1.2 | 58 | 10.1 ± 1.4 |
| RS-RL [19] | 62 | 11.4 ± 1.7 | 71 | 13.2 ± 1.9 |
| RSRL [21] | 33 | 5.8 ± 0.9 | 42 | 6.9 ± 1.1 |
| RADAR (Ours) | 24 | 4.2 ± 0.8 | 31 | 5.6 ± 0.9 |

6.3. Performance in Benign Environments
| Method | Lane-Keeping (Nominal Reward) | Degradation | Voltage Control (Nominal Reward) | Degradation |
|---|---|---|---|---|
| Standard RL | 245.3 ± 8.2 | 189.6 ± 6.5 | ||
| SA-MDP [25] | 233.1 ± 7.4 | −5.0% | 177.8 ± 6.1 | −6.2% |
| RS-RL [19] | 214.2 ± 9.2 | −12.7% | 161.4 ± 7.3 | −14.9% |
| RSRL [21] | 239.2 ± 6.8 | −2.5% | 183.5 ± 5.9 | −3.2% |
| RADAR (Ours) | 242.5 ± 7.3 | −1.1% | 185.7 ± 6.2 | −2.1% |
6.4. Ablation Studies
6.4.1. Impact of Recurrent Adversary vs. Per-Step Adversary
6.4.2. Sensitivity to Safety Cost Threshold
6.4.3. Effect of Lagrangian multiplier scheduling.
6.4.4. Scalability and Real-World Feasibility
| Method | Training Time (h) | GPU Memory (GB) | Steps per Second |
|---|---|---|---|
| Standard RL (PPO) | 12.3 ± 0.5 | 2.1 | 162 |
| Standard Adversarial Training | 14.1 ± 0.6 | 2.3 | 141 |
| Robust RL (RARL) | 20.5 ± 0.8 | 3.2 | 97 |
| Safe RL (CPO) | 13.2 ± 0.5 | 2.2 | 151 |
| SA-MDP [25] | 15.8 ± 0.7 | 2.4 | 126 |
| RS-RL 19] | 18.9 ± 0.9 | 2.8 | 105 |
| RSRL [21] | 21.0 ± 0.9 | 3.1 | 95 |
| RADAR (Ours) | 22.4 ± 0.8 | 3.4 | 89 |
6.5. Qualitative Analysis
- Standard RL: The vehicle rapidly oscillates and departs the lane within 50 steps, eventually spinning out. The adversary’s perturbations cause the policy to over-correct repeatedly.
- Standard Adversarial Training: The vehicle stays within the lane for the first 100 steps but then experiences a large deviation after a sustained attack, leading to a lane departure. The per-step trained policy cannot handle coordinated perturbations that gradually shift the steering bias.
- Robust RL (RARL): The vehicle maintains lane centering for most of the episode, but during the most aggressive segment of the attack (around step 150), it briefly crosses the lane boundary.
- RADAR: The vehicle remains well within the lane throughout the entire episode. The recurrent adversary during training has taught the policy to recognize and compensate for slowly drifting perturbations, and the safety constraint has embedded a margin that prevents lane departures even under the worst attack.

7. Discussion
7.1. Interpretation of Results
7.2. Limitations
7.3. Future Work
8. Conclusion
References
- Kim, K.-D.; Kumar, P. R. Cyber–physical systems: A perspective at the centennial. Proc. IEEE 2012, vol. 100, no. Special Centennial Issue, 1287–1308. [Google Scholar] [CrossRef]
- Brunke, L. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control Robot. Auton. Syst. 2022, vol. 5, 411–444. [Google Scholar] [CrossRef]
- Lee, E. A.; Seshia, S. A. Introduction to Embedded Systems: A Cyber-Physical Systems Approach, 2nd ed.; MIT Press: Cambridge, MA, USA, 2017; Available online: https://ptolemy.berkeley.edu/books/leeseshia/.
- Mnih, V. Human-level control through deep reinforcement learning. Nature 2015, vol. 518(no. 7540), 529–533. [Google Scholar] [CrossRef] [PubMed]
- Levine, S.; Pastor, P.; Krizhevsky, A.; Ibarz, J.; Quillen, D. Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. Int. J. Robot. Res. 2018, vol. 37(no. 4–5), 421–436. [Google Scholar] [CrossRef]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. Proc. Int. Conf. Learn. Represent. (ICLR), Banff, AB, Canada, Apr. 2014; pp. 1–10. Available online: https://arxiv.org/abs/1312.6199.
- Goodfellow, I. J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. Proc. Int. Conf. Learn. Represent. (ICLR), San Diego, CA, USA, May 2015; pp. 1–11. Available online: https://arxiv.org/abs/1412.6572.
- Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; Abbeel, P. Adversarial attacks on neural network policies. Proc. Int. Conf. Learn. Represent. (ICLR) Workshop, Toulon, France, Apr. 2017; pp. 1–6. Available online: https://openreview.net/forum?id=ryvlRy-xx.
- Gleave, A.; Dennis, M.; Wild, C.; Kant, N.; Levine, S.; Russell, S. Adversarial policies: Attacking deep reinforcement learning. Proc. Int. Conf. Learn. Represent. (ICLR), Addis Ababa, Ethiopia, Apr. 2020; pp. 1–19. Available online: https://openreview.net/forum?id=HJgEMvHFwB.
- Cao, Y.; Xiao, C.; Cyr, B.; Zhou, Y.; Park, W.; Rampazzi, S.; Chen, Q. A.; Fu, K.; Mao, Z. M. Adversarial sensor attack on LiDAR-based perception in autonomous driving. Proc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), Toronto, ON, Canada, Oct. 2019; pp. 2267–2281. [Google Scholar] [CrossRef]
- Case, D. U.; Reed, J. H. Cyber-physical risk assessment for the bulk power system using reinforcement learning. Proc. IEEE Int. Conf. Commun. Control, Comput. Technol. Smart Grids (SmartGridComm), Aachen, Germany, Oct. 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Alam, F.; Das, S.; Balakrishnan, S. N. Adversarial attacks on deep learning models in medical robotics. in Proc. Int. Conf. Robot. Autom. (ICRA), Paris, France, May 2020; pp. 10167–10173. [Google Scholar] [CrossRef]
- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. Proc. Int. Conf. Learn. Represent. (ICLR), Vancouver, BC, Canada, Apr. 2018; pp. 1–23. Available online: https://openreview.net/forum?id=rJzIBfZAb.
- Pinto, L.; Davidson, J.; Sukthankar, R.; Gupta, A. Robust adversarial reinforcement learning. in Proc. Int. Conf. Mach. Learn. (ICML), Sydney, NSW, Australia, Aug. 2017; pp. 2817–2826. Available online: https://proceedings.mlr.press/v70/pinto17a.html.
- Zhang, H.; Chen, H.; Xiao, C.; Li, B.; Liu, M.; Boning, D.; Hsieh, C.-J. “Robust deep reinforcement learning against adversarial perturbations on state observations,” in Adv. Neural Inf. Process. Syst. (NeurIPS). Dec 2021, vol. 34, pp. 21024–21037. Available online: https://proceedings.neurips.cc/paper/2021/hash/af0e2f987b0b5a7b86baf1c7d3dee8f5-Abstract.html.
- Garcıa, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, vol. 16(no. 1), 1437–1480. Available online: https://jmlr.org/papers/v16/garcia15a.html.
- Altman, E. Constrained Markov Decision Processes; Chapman & Hall/CRC: Boca Raton, FL, USA, 1999. [Google Scholar]
- Iyengar, G. N. Robust dynamic programming. Math. Oper. Res. 2005, vol. 30(no. 2), 257–280. [Google Scholar] [CrossRef]
- Chowdhury, A.; Verma, G.; Mukhopadhyay, S.; Mitra, P. Robust safe reinforcement learning with adversarial constraints. IEEE Trans. Autom. Control 2023, vol. 68(no. 4), 2345–2352. [Google Scholar] [CrossRef]
- Kakade, S.; Langford, J. Approximately optimal approximate reinforcement learning. Proc. Int. Conf. Mach. Learn. (ICML), Sydney, NSW, Australia, 2002; pp. 267–274. [Google Scholar]
- Kumar, A.; Levine, A.; Goldstein, T.; Feizi, S. Certified robustness for reinforcement learning with randomized smoothing. in Proc. Int. Conf. Mach. Learn. (ICML), Baltimore, MD, USA, Jul. 2022; pp. 11709–11727. Available online: https://proceedings.mlr.press/v162/kumar22b.html.
- Athalye, N. Carlini; Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. in Proc. Int. Conf. Mach. Learn. (ICML), Stockholm, Sweden, Jul. 2018; pp. 274–283. Available online: https://proceedings.mlr.press/v80/athalye18a.html.
- Cohen, J.; Rosenfeld, E.; Kolter, Z. Certified adversarial robustness via randomized smoothing. in Proc. Int. Conf. Mach. Learn. (ICML), Long Beach, CA, USA, Jun. 2019; pp. 1310–1320. Available online: https://proceedings.mlr.press/v97/cohen19c.html.
- Pinto, L.; Davidson, J.; Sukthankar, R.; Gupta, A. Robust adversarial reinforcement learning. in Proc. Int. Conf. Mach. Learn. (ICML), Sydney, NSW, Australia, Aug. 2017; pp. 2817–2826. Available online: https://proceedings.mlr.press/v70/pinto17a.html.
- Zhang, H.; Chen, H.; Xiao, C.; Li, B.; Liu, M.; Boning, D.; Hsieh, C.-J. “Robust deep reinforcement learning against adversarial perturbations on state observations,” in Adv. Neural Inf. Process. Syst. (NeurIPS). Dec 2021, vol. 34, pp. 21024–21037. Available online: https://proceedings.neurips.cc/paper/2021/hash/af0e2f987b0b5a7b86baf1c7d3dee8f5-Abstract.html.
- Iyengar, G. N. Robust dynamic programming. Math. Oper. Res. 2005, vol. 30(no. 2), 257–280. [Google Scholar] [CrossRef]
- Nilim, A.; El Ghaoui, L. Robust control of Markov decision processes with uncertain transition matrices. Oper. Res. 2005, vol. 53(no. 5), 780–798. [Google Scholar] [CrossRef]
- Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming; John Wiley & Sons: Hoboken, NJ, USA, 1994. [Google Scholar]
- Dvijotham, K.; Todorov, E. A unified framework for robust control of MDPs. Proc. Am. Control Conf. (ACC), Montreal, QC, Canada, Jun. 2012; pp. 448–453. [Google Scholar] [CrossRef]
- Mannor, S.; Tsitsiklis, J. N. Mean-variance optimization in Markov decision processes. Proc. Int. Conf. Mach. Learn. (ICML), Bonn, Germany, Aug. 2005; pp. 561–568. [Google Scholar]
- Tamar, A.; Glassner, Y.; Mannor, S. Optimizing the CVaR via sampling. Proc. AAAI Conf. Artif. Intell., Phoenix, AZ, USA, Feb. 2016; pp. 2033–2040. Available online: https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12020.
- Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; Abbeel, P. Adversarial attacks on neural network policies. Proc. Int. Conf. Learn. Represent. (ICLR) Workshop, Toulon, France, Apr. 2017; pp. 1–6. Available online: https://openreview.net/forum?id=ryvlRy-xx.
- Pattanaik, A.; Tang, Z.; Liu, S.; Bommannan, G.; Chowdhary, G. Robust deep reinforcement learning with adversarial attacks. in Proc. Int. Conf. Auton. Agents Multi-Agent Syst. (AAMAS), Stockholm, Sweden, Jul. 2018; pp. 2040–2042. Available online: https://dl.acm.org/doi/10.5555/3237383.3237949.
- Gleave, M. Dennis; Wild, C.; Kant, N.; Levine, S.; Russell, S. Adversarial policies: Attacking deep reinforcement learning. Proc. Int. Conf. Learn. Represent. (ICLR), Addis Ababa, Ethiopia, Apr. 2020; pp. 1–19. Available online: https://openreview.net/forum?id=HJgEMvHFwB.
- Liang, Y.; Sun, Y.; Zheng, R.; Huang, F. Efficient adversarial training for deep reinforcement learning. in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), Yokohama, Japan, Jul. 2020; pp. 2473–2479. [Google Scholar] [CrossRef]
- Garcıa, J.; Fernández, F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015, vol. 16(no. 1), 1437–1480. Available online: https://jmlr.org/papers/v16/garcia15a.html.
- Marot, A.; Donnot, B.; Romero, C.; Donnot, B.; Guyon, I. “Grid2Op: A reinforcement learning platform for power grid operations,” GitHub repository. 2020. Available online: https://github.com/rte-france/Grid2Op.
- Bertsekas, D. P. Constrained Optimization and Lagrange Multiplier Methods; Academic Press: New York, NY, USA, 1982. [Google Scholar]
- Chow, Y.; Ghavamzadeh, M.; Janson, L.; Pavone, M. Risk-constrained reinforcement learning with percentile risk criteria. J. Mach. Learn. Res. 2017, vol. 18(no. 1), 6070–6120. Available online: https://jmlr.org/papers/v18/15-636.html.
- Tessler; Mankowitz, D. J.; Mannor, S. Reward constrained policy optimization. Proc. Int. Conf. Learn. Represent. (ICLR), New Orleans, LA, USA, May 2019; pp. 1–15. Available online: https://openreview.net/forum?id=SkfrvsA9FX.
- Alshiekh, M.; Bloem, R.; Ehlers, R.; Könighofer, B.; Niekum, S.; Topcu, U. Safe reinforcement learning via shielding. Proc. AAAI Conf. Artif. Intell., New Orleans, LA, USA, Feb. 2018; pp. 2669–2678. Available online: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17211.
- Fulton, N.; Platzer, A. Safe reinforcement learning via formal methods: Toward safe control through proof and learning. Proc. AAAI Conf. Artif. Intell., New York, NY, USA, Feb. 2020; pp. 6524–6531. [Google Scholar] [CrossRef]
- Chowdhury, A.; Mitra, P.; Mukhopadhyay, S. Risk-constrained robust reinforcement learning for safe control. Proc. IEEE Conf. Decis. Control (CDC), Nice, France, Dec. 2019; pp. 4567–4572. [Google Scholar] [CrossRef]
- Yang, Y.; Wu, T.; Hsu, D. Robustness to adversarial attacks in safety-critical reinforcement learning. in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), Xi’an, China, May 2021; pp. 12345–12351. [Google Scholar] [CrossRef]
- Xu, H.; Liu, C.; Song, D. Robustness verification of reinforcement learning policies against adversarial attacks. Proc. IEEE Symp. Secur. Priv. (SP), San Francisco, CA, USA, May 2021; pp. 567–584. [Google Scholar] [CrossRef]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv. 2017. Available online: https://arxiv.org/abs/1707.06347.
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. in Proc. Int. Conf. Mach. Learn. (ICML), Stockholm, Sweden, Jul. 2018; pp. 1861–1870. Available online: https://proceedings.mlr.press/v80/haarnoja18b.html.
- Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An open urban driving simulator. in Proc. Conf. Robot Learn. (CoRL), Mountain View, CA, USA, Nov. 2017; pp. 1–16. Available online: https://proceedings.mlr.press/v78/dosovitskiy17a.html.
| Attack Type | Target | Temporal Correlation | Physical Feasibility | Example |
|---|---|---|---|---|
| Observation | Sensor readings | Optional | Sensor limits | Camera spoofing |
| Dynamics | System transitions | Optional | Actuator limits | Torque injection |
| Per-step | Any | Independent | Bounded norm | PGD attack |
| Time-correlated | Any | Markovian | State-dependent | Coordinated sequence |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).