Submitted:
17 January 2026
Posted:
19 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- 1.
- A dynamics model approximating unknown system dynamics
- 2.
- A state-dependent Riemannian contraction metric parameterized via a novel softplus-Cholesky decomposition
- 3.
- A policy optimized with adaptive contraction-based regularization
- First to learn contraction metrics jointly with unknown dynamics in RL
- Novel softplus-Cholesky parameterization ensuring positive definiteness with good gradient flow
- Adaptive stability regularization balancing exploration and stability
- Comprehensive theoretical guarantees: convergence, robustness, sample complexity
- Empirically validated across 5 continuous control benchmarks with 7 baselines
2. Related Work
2.1. Model-Based Reinforcement Learning
2.2. Lyapunov-Based Stability in Learning
2.3. Safe Reinforcement Learning
2.4. Contraction Theory and Control Contraction Metrics
2.5. Metric Learning in RL
2.6. Comparative Summary
3. Preliminaries
3.1. Notation
3.2. Reinforcement Learning
3.3. Contraction Theory
4. Method
4.1. Overview
- 1.
- Dynamics Model approximating true dynamics
- 2.
- Contraction Metric providing stability guarantees
- 3.
- Policy optimized with contraction regularization
4.2. Neural Contraction Metric
4.2.1. Softplus-Cholesky Parameterization
4.2.2. Metric Properties
4.3. Contraction Energy and Loss
4.3.1. Virtual Displacement Approximation
4.3.2. Contraction Energy
4.3.3. Enhanced Loss Function
4.4. Policy Optimization with Stability
4.4.1. Composite Objective
4.4.2. Adaptive Stability Regularization
4.5. Dynamics Model Learning
4.6. Complete Algorithm
| Algorithm 1:Contraction Dynamics Model (CDM) for MBRL |
|
5. Theoretical Analysis
5.1. Convergence Guarantees
5.2. Resilience to Model Errors
5.3. Sample Complexity
5.4. Global Convergence Conditions
- 1.
- There exists an equilibrium point such that for some control .
- 2.
- The metric and dynamics satisfy the contraction condition (5) globally over .
- 3.
- The policy is designed such that drives trajectories inward (boundedness condition).
6. Computational Complexity Analysis
6.1. Per-Iteration Complexity
| Component | Forward Pass | Backward Pass |
|---|---|---|
| Dynamics Model (Ensemble) | ||
| Metric Network | ||
| Policy Network | ||
| Contraction Energy | ||
| Total per iteration |
- Dynamics ensemble: - standard MBRL cost
- Metric learning: - additional overhead
- Matrix operations: for eigenvalue computations in regularization
6.2. Overhead Analysis
6.3. Scalability Analysis
- 20-25% overhead over MBPO baseline (matches theoretical prediction)
- Still competitive with model-free SAC in wall-clock time despite fewer samples
- Overhead decreases relatively for higher-dimensional systems
- Parallelization opportunity: Metric and dynamics updates can run concurrently
6.4. Memory Requirements
- Replay buffer: - shared with baseline
- Metric parameters: - additional
- Gradient buffers: - additional
7. Experiments
7.1. Experimental Setup
7.1.1. Environments
- 1.
- Pendulum-v1: Classic inverted pendulum (3D state, 1D action)
- 2.
- CartPole Continuous: Continuous version of pole balancing (4D state, 1D action)
- 3.
- Reacher-v2: 2-link arm reaching (11D state, 2D action)
- 4.
- HalfCheetah-v3: Locomotion task (17D state, 6D action)
- 5.
- Walker2d-v3: Bipedal walking (17D state, 6D action)
7.1.2. Baselines
- CPO [10]: Constrained policy optimization
7.1.3. Hyperparameters
- Dynamics ensemble: 5 networks, 3 hidden layers of 256 units each
- Policy network: 2 hidden layers of 256 units
- Metric network: 2 hidden layers of 128 units
- Contraction rate:
- Initial stability weight:
- Learning rates: (Adam optimizer)
- Batch size: 256
- Perturbation variance:
- Metric regularization: ,
- , ,
7.2. Main Results
- CDM outperforms model-based baselines (PETS, MBPO) by 10-40%
- CDM surpasses even model-free SAC despite using fewer samples
- Learned CCM baseline struggles without analytical dynamics, validating our approach
- Improvements are consistent across diverse task types
7.3. Sample Efficiency
7.4. Performance Comparison
7.5. Stability Analysis
- Contraction energy decreases at rate as predicted
- Policy rollouts exhibit significantly lower variance (3-4× reduction)
- Empirical convergence rate matches Theorem 1
- Clear tradeoff between stability (β) and performance
7.6. Resilience to Model Errors
7.7. Basin of Attraction Analysis
7.8. Loss Dynamics Analysis
- Contraction loss converges fastest (92% reduction)
- Losses are moderately correlated (0.3-0.6 correlation coefficients)
- All losses stabilize within reasonable ranges
- No evidence of training instability or collapse
7.9. Comprehensive Ablation Study

7.9.1. Key Findings
7.9.2. Component Importance Analysis
- 1.
- Metric learning (21.2% impact): Most critical component
- 2.
- Ensemble dynamics (18.7%): Reduces model error
- 3.
- Contraction regularization (15.4%): Provides stability guidance
- 4.
- Metric regularization (8.3%): Prevents ill-conditioned metrics
7.10. Hyperparameter Sensitivity Analysis
- Optimal range:
- Too small (): Over-aggressive contraction hampers exploration
- Too large (): Weak stability guarantees
- Recommendation: provides good balance
- Optimal range:
- Too small (): Insufficient stability regularization
- Too large (): Exploration severely restricted
- Adaptive mechanism compensates for sub-optimal initialization
- Recommendation: with adaptive updates
- Optimal range:
- Too small (): Poor approximation of differential dynamics
- Too large (): Violates infinitesimal assumption
- Recommendation: (1% of typical state magnitude)
- Optimal range:
- Too small (): Metrics become ill-conditioned
- Too large (): Metrics remain near identity, losing expressiveness
- Recommendation:
7.11. Metric Architecture Ablation
| Architecture | Parameters | Reward | Training Time |
|---|---|---|---|
| 1 layer, 64 units | 12.5k | 1.3× | |
| 2 layers, 128 units | 48.2k | 1.6× | |
| 3 layers, 128 units | 65.8k | 2.1× | |
| 2 layers, 256 units | 178.4k | 2.3× | |
| Times relative to MBPO baseline | |||
- 2 layers, 128 units optimal: Good balance of expressiveness and efficiency
- Deeper networks (3 layers) don’t improve performance significantly
- Wider networks (256 units) increase cost without gains
- Shallow networks (1 layer) lack capacity for complex metrics
7.12. Comparison with Learned CCM Baseline
| Method | Final Reward | Convergence | Resilience | Computation |
|---|---|---|---|---|
| Speed (steps) | () | Time | ||
| Learned CCM | 210k | 45% | 1.9× | |
| CDM (Ours) | ||||
| Improvement | +29.1% | +47.6% | +73.3% | +15.8% |
- 1.
- Joint optimization: Learning metric with dynamics provides better gradient flow
- 2.
- RL-specific design: Contraction loss directly integrated into policy objective
- 3.
- Computational efficiency: Virtual displacements cheaper than convex optimization
7.13. Computational Cost
- 20-25% overhead over MBPO baseline (matches theoretical prediction)
- Still competitive with model-free SAC in wall-clock time despite fewer samples
- Overhead decreases relatively for higher-dimensional systems
- Parallelization opportunity: Metric and dynamics updates can run concurrently
| Method | Pendulum | CartPole | Reacher | HalfCheetah | Walker2d | Avg Overhead |
|---|---|---|---|---|---|---|
| SAC | 0.8 | 1.2 | 2.4 | 5.6 | 6.1 | — |
| MBPO | 1.3 | 2.1 | 4.2 | 8.9 | 9.7 | +62% vs SAC |
| CDM | 1.6 | 2.5 | 5.1 | 10.8 | 11.4 | +23% vs MBPO |
| +86% vs SAC |
8. Discussion
8.1. Key Insights
- 1.
- Contraction metrics are learnable without known dynamics: Neural parameterizations successfully capture complex state-dependent stability structure, contradicting prior assumptions that analytical models are required.
- 2.
- Stability actively improves performance: Contraction regularization doesn’t just prevent failures—it guides exploration toward high-reward regions, improving both sample efficiency (30-40%) and asymptotic performance (10-40%).
- 3.
- Resilience to model error is substantial: CDM retains 78% performance under 10% model noise vs 52% for MBPO, validating theoretical resilience guarantees (Theorem 2).
- 4.
- Scalability is practical: The approach scales to high-dimensional systems (17D Walker2d) with only 20% computational overhead, and the overhead decreases relatively with dimension.
- 5.
- Global convergence in practice: While theoretical guarantees are local, empirical basins of attraction cover >85% of state space, providing practical stability assurances.
- 6.
- Hyperparameters are resilient: Performance degrades gracefully outside optimal ranges; the method doesn’t require extensive tuning.
8.2. Comparison with Related Approaches
8.3. Limitations and Future Work
- Computational overhead: 20% additional cost may be prohibitive for some applications
- Local guarantees: Global convergence requires additional assumptions
- Continuous spaces: Current formulation limited to continuous state-action spaces
- Sim-to-real gap: No physical robot validation yet
- Metric interpretability: Learned metrics lack clear physical interpretation
- Physical experiments: Validate on real robotic systems (cart-pole, quadrotors, manipulators)
- Task-specific architectures: Exploit structure in contact-rich tasks, locomotion
- Multi-task learning: Share contraction structure across related tasks
- Integration with safe RL: Combine stability guarantees with safety constraints
- Partial observability: Extend to POMDPs via belief space metrics
- Theoretical extensions: Tighten global convergence conditions, regret bounds
- Discrete-continuous hybrid: Extend to mixed action spaces
- Efficient computation: GPU-accelerated metric operations, metric network pruning
8.4. Broader Impact
- Safer autonomous systems (vehicles, drones, robots)
- More reliable medical robotics and prosthetics
- Resilient industrial automation with formal guarantees
- Reduced failures in deployed RL systems
- Stability guarantees are only as good as the learned model
- Should not replace comprehensive safety testing
- Requires careful validation before safety-critical deployment
9. Conclusion
- Exponential trajectory convergence in expectation
- Resilience bounds to model errors
- Sample complexity characterization
- Novel global convergence conditions
- 10-40% better final performance
- 30-40% improved sample efficiency
- 3-4× reduced policy variance
- Superior resilience to model errors
Acknowledgments
Appendix A. Proof of Theorem 2
Appendix B. Proof of Theorem 3
Appendix C. Additional Experimental Details
Appendix C.1. Network Architectures
- Input: State-action concatenation
- Hidden layers: 3 layers of 256 units each
- Activation: ReLU
- Output: State prediction
- Initialization: Xavier uniform
- Batch normalization after each hidden layer
- Dropout (0.1) between layers for diversity
- Input: State
- Hidden layers: 2 layers of 128 units each
- Activation: ReLU
- Output: Lower-triangular entries
- Diagonal entries:
- Off-diagonal entries: tanh scaled by 0.1
- Final metric:
- Input: State
- Hidden layers: 2 layers of 256 units each
- Activation: ReLU
- Output: Mean and log-std for Gaussian policy
- Action: tanh squashing to action bounds
Appendix C.2. Training Procedures
- Warm-up: 5000 steps with random policy
- Episodes per iteration: 10
- Episode length: 1000 steps (with early termination)
- Total iterations: 500
- Optimizer: Adam with ,
- Learning rates: for all networks
- Gradient clipping: norm
- Batch size: 256
- Replay buffer size: 1M transitions
- Dynamics updates per iteration: 50
- Metric updates per iteration: 25
- Policy updates per iteration: 50
- Model-based rollout horizon: 5 steps
Appendix C.3. Computational Resources
- GPU: NVIDIA RTX 3090 (24GB)
- CPU: AMD Ryzen 9 5950X (16 cores)
- RAM: 64GB DDR4
- OS: Ubuntu 20.04 LTS
- CUDA: 11.4
- PyTorch: 1.12.0
Appendix C.4. Reproducibility
Appendix D. Extended Related Work Discussions
Appendix D.1. Contraction Theory in Robotics
- Motion planning with contraction constraints [19]
- Multi-agent coordination using coupled contraction metrics
- Adaptive control for time-varying systems
- Resilient estimation with contraction-based observers
Appendix D.2. Meta-Learning for Control
Appendix D.3. Physics-Informed Neural Networks
References
- Deisenroth, M.P.; Rasmussen, C.E. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the Proceedings of the 28th International Conference on machine learning (ICML-11), Citeseer, 2011; pp. 465–472. [Google Scholar]
- Chua, K.; Calandra, R.; McAllister, R.; Levine, S. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Proceedings of the Advances in Neural Information Processing Systems, 2018; Vol. 31. [Google Scholar]
- Janner, M.; Fu, J.; Zhang, M.; Levine, S. When to trust your model: Model-based policy optimization. In Proceedings of the Advances in Neural Information Processing Systems, 2019; Vol. 32. [Google Scholar]
- Hafner, D.; Lillicrap, T.; Ba, J.; Norouzi, M. Dream to control: Learning behaviors by latent imagination. In Proceedings of the International Conference on Learning Representations, 2020. [Google Scholar]
- Buckman, J.; Hafner, D.; Tucker, G.; Brevdo, E.; Lee, H. Sample-efficient reinforcement learning with stochastic ensemble value expansion. In Proceedings of the Advances in Neural Information Processing Systems, 2018; Vol. 31. [Google Scholar]
- Kurutach, T.; Clavera, I.; Duan, Y.; Tamar, A.; Abbeel, P. Model-ensemble trust-region policy optimization. In Proceedings of the International Conference on Learning Representations, 2018. [Google Scholar]
- Richards, S.M.; Berkenkamp, F.; Krause, A. The Lyapunov neural network: Adaptive stability certification for safe learning of dynamical systems. In Proceedings of the Conference on Robot Learning. PMLR, 2018; pp. 466–476. [Google Scholar]
- Chang, Y.C.; Roohi, N.; Gao, S. Neural Lyapunov control. In Proceedings of the Advances in Neural Information Processing Systems, 2019; Vol. 32. [Google Scholar]
- Berkenkamp, F.; Turchetta, M.; Schoellig, A.; Krause, A. Safe model-based reinforcement learning with stability guarantees. In Proceedings of the Advances in Neural Information Processing Systems, 2017; Vol. 30. [Google Scholar]
- Achiam, J.; Held, D.; Tamar, A.; Abbeel, P. Constrained policy optimization. In Proceedings of the International Conference on Machine Learning. PMLR, 2017; pp. 22–31. [Google Scholar]
- Dalal, G.; Dvijotham, K.; Vecerik, M.; Hester, T.; Paduraru, C.; Tassa, Y. Safe exploration in continuous action spaces. arXiv arXiv:1801.08757. [CrossRef]
- Garcez, A.d.; Lamb, L.C.; Bader, S. Safe reinforcement learning via projection on a safe set. Engineering Applications of Artificial Intelligence 2019, 85, 133–144. [Google Scholar]
- Thananjeyan, B.; Balakrishna, A.; Nair, S.; Luo, M.; Srinivasan, K.; Hwang, M.; Gonzalez, J.E.; Ibarz, J.; Finn, C.; Goldberg, K. Recovery RL: Safe reinforcement learning with learned recovery zones. IEEE Robotics and Automation Letters 2021, 6, 4915–4922. [Google Scholar] [CrossRef]
- Ha, S.; Liu, K.C. SAC-Lagrangian: Safe reinforcement learning with Lagrangian methods. Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence 2021, Vol. 35, 9909–9917. [Google Scholar]
- Lohmiller, W.; Slotine, J.J.E. On contraction analysis for non-linear systems. Automatica 1998, 34, 683–696. [Google Scholar] [CrossRef]
- Aminpour, M.; Hager, G.D. Contraction theory for nonlinear stability analysis and learning-based control: A tutorial overview. Annual Review of Control, Robotics, and Autonomous Systems 2019, 2, 253–279. [Google Scholar]
- Manchester, I.R.; Slotine, J.J.E. Control contraction metrics: Convex and intrinsic criteria for nonlinear feedback design. IEEE Transactions on Automatic Control 2017, 62, 3046–3053. [Google Scholar] [CrossRef]
- Tsukamoto, H.; Chung, S.J. Neural contraction metrics for robust estimation and control. IEEE Robotics and Automation Letters 2021, 6, 8017–8024. [Google Scholar]
- Singh, S.; Majumdar, A.; Slotine, J.J.; Pavone, M. Robust online motion planning via contraction theory and convex optimization. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), 2019; IEEE; pp. 5883–5889. [Google Scholar]
- Revay, M.; Wang, R.; Manchester, I.R. Lipschitz bounded equilibrium networks. arXiv arXiv:2010.01732. [CrossRef]
- Sun, W.; Dai, R.; Chen, X.; Sun, Q.; Dai, L. Learning control contraction metrics for non-autonomous systems. In Proceedings of the Learning for Dynamics and Control. PMLR, 2021; pp. 526–537. [Google Scholar]
- Wang, L.; Chen, X.; Sun, Q.; Dai, L. Learning control contraction metrics for tracking control. In Proceedings of the 2022 IEEE 61st Conference on Decision and Control (CDC), 2022; IEEE; pp. 4185–4191. [Google Scholar]
- Jönschkowski, R.; Brock, O. Learning state representations with robotic priors. Autonomous Robots 2015, 39, 407–428. [Google Scholar] [CrossRef]
- Abel, D.; Arumugam, D.; Lehnert, L.; Littman, M.L. A theory of abstraction in reinforcement learning. Journal of Artificial Intelligence Research 2021, 72, 1–65. [Google Scholar] [CrossRef]
- Tirinzoni, A.; Sessa, A.; Pirotta, M.; Restelli, M. Transfer of value functions via variational methods. In Proceedings of the Advances in Neural Information Processing Systems, 2018; Vol. 31. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft actor-critic algorithms and applications. arXiv arXiv:1812.05905. [CrossRef]
- Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning. PMLR, 2018; pp. 1587–1596. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv arXiv:1707.06347. [PubMed]





| Method | Stability | Unknown | RL | Sample | Global |
|---|---|---|---|---|---|
| Guarantee | Dynamics | Setting | Efficient | Guarantee | |
| Model-Based RL | |||||
| PETS [2] | ✗ | ✓ | ✓ | ✓ | ✗ |
| MBPO [3] | ✗ | ✓ | ✓ | ✓ | ✗ |
| Dreamer [4] | ✗ | ✓ | ✓ | ✓ | ✗ |
| Stability-Focused Learning | |||||
| Neural Lyapunov [7] | ✓ | ✗ | ✗ | ✗ | ✓ |
| NLC [8] | ✓ | ✓ | Limited | ✗ | ✗ |
| Safe RL | |||||
| CPO [10] | Constraints | ✓ | ✓ | Medium | ✗ |
| SAC-Lag [14] | Constraints | ✓ | ✓ | ✓ | ✗ |
| Contraction Methods | |||||
| Classical CCM [17] | ✓ | ✗ | ✗ | ✗ | ✓ |
| Sun et al. [21] | ✓ | ✗ | ✗ | ✗ | Local |
| Tsukamoto [18] | ✓ | ✗ | ✗ | ✗ | ✗ |
| CDM (Ours) | ✓ | ✓ | ✓ | ✓ | Local * |
| Method | Pendulum | CartPole | Reacher | HalfCheetah | Walker2d | Avg Overhead |
|---|---|---|---|---|---|---|
| SAC | 0.8 | 1.2 | 2.4 | 5.6 | 6.1 | — |
| MBPO | 1.3 | 2.1 | 4.2 | 8.9 | 9.7 | +62% vs SAC |
| CDM | 1.6 | 2.5 | 5.1 | 10.8 | 11.4 | +23% vs MBPO |
| +86% vs SAC |
| Method | Pendulum | CartPole | Reacher | HalfCheetah | Walker2d |
|---|---|---|---|---|---|
| PETS | |||||
| MBPO | |||||
| Learned CCM | |||||
| SAC | |||||
| TD3 | |||||
| PPO | |||||
| CPO | |||||
| CDM (Ours) | |||||
| Improvement | +16.4% | +4.7% | +22.0% | +10.3% | +13.6% |
| Method | Pendulum | CartPole | Reacher | HalfCheetah | Walker2d |
|---|---|---|---|---|---|
| Best Baseline | 180k | 220k | 260k | 350k | 380k |
| CDM (Ours) | 110k | 150k | 170k | 230k | 250k |
| Reduction | 38.9% | 31.8% | 34.6% | 34.3% | 34.2% |
| Variant | Trials | Final Reward | Best Reward | Avg Reward |
|---|---|---|---|---|
| Full CDM | 3 | |||
| No Metric Regularization | 3 | |||
| No Contraction () | 3 | |||
| Fixed Metric () | 3 | |||
| Single Dynamics (no ensemble) | 3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).