Submitted:
06 January 2026
Posted:
06 January 2026
Read the latest preprint version here
Abstract

Keywords:
1. Introduction: How Do Machines Learn to Make Decisions?
2. The Mathematical Foundation of Decision-Making—Markov Decision Process (MDP)
2.1. What Is the Markov Property?
2.2. The Five Elements of an MDP
2.3. Policy and Return
From Policy Evaluation to Optimal Policy—The Evolution of the Bellman Equation
3.1. State-Value Function and Action-Value Function
3.2. The Bellman Equation: The Cornerstone of Policy Evaluation
3.3. The Bellman Optimality Equation: The Beacon for Finding the Optimal Policy
3.4. Dynamic Programming: Model-Based Solution Methods
4. Model-Free Learning—From Theory to Practice
4.1. Monte Carlo and Temporal Difference Learning
4.2. Q-Learning: A Practical Breakthrough from the Bellman Optimality Equation
4.3. Function Approximation: Coping with High-Dimensional State Spaces
5. The Rise of Deep Reinforcement Learning
5.1. The Breakthrough of Deep Q-Networks (DQN)
5.2. Policy Gradient: Directly Optimizing the Policy
6. The Deep Deterministic Policy Gradient (DDPG) Algorithm
6.1. The Challenge of Continuous Action Spaces
6.2. The Core Idea of DDPG
6.3. The Connection Between DDPG and the Bellman Optimality Equation
6.4. DDPG Algorithm
| Initialize actor network and critic network Initialize corresponding target networks μ’ and Q’ Initialize experience replay buffer R for each time step do Select action a = + exploration noise Execute a, observe reward r and new state s’ Store transition (s,a,r,s’) in R Randomly sample a minibatch from R # Update critic (based on modified Bellman optimality equation) # Update actor (approximate policy improvement) |
6.5. Advantages and Challenges of DDPG
7. Frontier Developments and a Unified Theoretical Perspective
7.1. Subsequent Improved Algorithms
7.2. A Unified Theoretical Perspective on Reinforcement Learning
7.3. Future Directions
8. Conclusion: The Science of Intelligent Decision-Making Within a Unified Framework
References
- Bellman, R. Dynamic Programming; Princeton University Press, 1957. [Google Scholar]
- Sutton, R. S.; Barto, A. G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press, 2018. [Google Scholar]
- Watkins, C. J. C. H.; Dayan, P. Q-learning. Machine Learning 1992, 8(3-4), 279–292. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; Hassabis, D. Human-level control through deep reinforcement learning. Nature 2015, 518(7540), 529–533. [Google Scholar] [CrossRef]
- Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR) 2016. [Google Scholar]
- Fujimoto, S.; van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. International Conference on Machine Learning (ICML) 2018. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. International Conference on Machine Learning (ICML) 2018. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. International Conference on Machine Learning (ICML) 2014. [Google Scholar]
- Arulkumaran, K.; Deisenroth, M. P.; Brundage, M.; Bharath, A. A. A brief survey of deep reinforcement learning. IEEE Signal Processing Magazine 2017, 34(6), 26–38. [Google Scholar] [CrossRef]

| Theoretical Stage | Core Concept | Mathematical Expression | Representat-ive Algorithms | Applicable Scenarios |
|---|---|---|---|---|
| Problem Definition | Markov Decision Process | - | Decision problem formulation | |
| Policy Evaluation | Bellman Equation | Policy evaluation algorithms | Analyzing a given policy | |
| Optimal Condition | Bellman Optimality Equation | Theoretical benchmark | Defining optimal policy standard | |
| Model-Based Solution | Dynamic Programming | Value iteration, Policy iteration | Environments with known model | |
| Model-Free Learning | Temporal Difference Learning |
|
SARSA, Q-learning | Environments with unknown model |
| High-Dimensional Extension | Function Approximation | Linear function approximation | Large state spaces | |
| Deep Integration | Deep Q-Learning |
|
DQN and its variants | High-dimensional inputs like images |
| Policy Optimization | Policy Gradient | REINFORCE | Direct policy optimization | |
| Integrated Methods | Actor-Critic | actor: |
A3C, DDPG | Continuous action spaces |
| Continuous Control | Deterministic Policy Gradient | DDPG, TD3, SAC | Robot control, etc. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.