Submitted:
01 June 2026
Posted:
02 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- (1)
- We propose a temporal coherence measurement mechanism based on the local directional consistency of TD residuals. By characterizing the stability of update signals within a local temporal window, this mechanism provides a unified sample-wise reliability criterion for both advantage estimation and policy updates. It enables the algorithm to distinguish stable trajectory segments from highly fluctuating ones, thereby reducing the dependence of policy optimization on a fixed form of temporal feedback modeling.
- (2)
- We develop a temporal-coherence-aware advantage construction method that adaptively integrates short-horizon and long-horizon advantage estimates. The resulting advantage signal combines the merits of low variance and low bias. Compared with conventional advantage estimation based on a fixed GAE parameter, the proposed method can dynamically adjust the temporal scale of advantage estimation according to the reliability of local temporal feedback, thereby improving the stability of policy update signals.
- (3)
- We propose a coherence-adaptive proximal optimization algorithm, termed CAPO. While preserving the stable optimization framework of PPO, CAPO embeds the temporal coherence coefficient into the clipped surrogate objective to form a sample-wise dynamic clipping mechanism. This design allows policy updates to automatically balance sufficient policy improvement and conservative proximal constraints according to signal reliability. Experimental results on multiple representative control tasks demonstrate that the proposed method achieves a more stable training process, faster convergence, and superior overall control performance.
2. Preliminaries
2.1. Markov Decision Process
2.2. Proximal Policy Optimization
2.3. Generalized Advantage Estimation and TD Residual
3. Methodology
3.1. Temporal Coherence Measurement
3.2. Coherence-Adaptive Advantage Construction
3.3. Coherence-Adaptive Proximal Policy Optimization
3.4. Mechanism Analysis
3.5. Algorithm Summary
| Algorithm 1:The proposed CAPO algorithm. |
|
Input: Policy parameters , value parameters , discount factor , coherence window length H, GAE parameters and , clipping bounds and .
Output: Optimized policy parameters .
1: Initialize and .
2: for each training iteration do
3: Collect trajectories using the old policy .
4: Compute TD residuals using the old value function.
5: Compute the temporal coherence coefficient within a local window.
6: Construct the short-horizon advantage with .
7: Construct the long-horizon advantage with .
8: Fuse the two advantages to obtain .
9: Normalize to obtain .
10: Compute the dynamic clipping coefficient .
11: Optimize the CAPO clipped surrogate objective.
12: Update the critic by minimizing the value-function loss.
13: Update old parameters: , .
14: end for
|
4. Experimental Results and Analysis
4.1. Experimental Environments and Settings
4.2. Experimental Results
4.2.1. Overall Comparison
4.2.2. Results on Discrete-Action Control Tasks
4.2.3. Results on Continuous-Action Control Tasks
4.3. Discussion
5. Conclusion and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Mei, J.; Chung, W.; Thomas, V.; et al. The role of baselines in policy gradient optimization. Adv. Neural Inf. Process. Syst. 2022, 35, 17818–17830. [Google Scholar]
- Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning; PMLR, 2018; pp. 1587–1596. [Google Scholar]
- Romoff, J.; Henderson, P.; Piché, A.; et al. Reward estimation for variance reduction in deep reinforcement learning. arXiv 2018, arXiv:1805.03359. [Google Scholar] [CrossRef]
- Henderson, P.; Islam, R.; Bachman, P.; et al. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018; Volume 32. [Google Scholar]
- Sutton, R.S.; McAllester, D.; Singh, S.; et al. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 1999, 12. [Google Scholar]
- Konda, V.; Tsitsiklis, J. Actor-critic algorithms. Adv. Neural Inf. Process. Syst. 1999, 12. [Google Scholar]
- Schulman, J.; Levine, S.; Abbeel, P.; et al. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning; PMLR, 2015; pp. 1889–1897. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; et al. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
- Schulman, J.; Moritz, P.; Levine, S.; et al. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
- Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; et al. What matters for on-policy deep actor-critic methods? A large-scale study. In Proceedings of the International Conference on Learning Representations, 2021. [Google Scholar]
- Chen, G.; Peng, Y.; Zhang, M. An adaptive clipping approach for proximal policy optimization. arXiv 2018, arXiv:1804.06461. [Google Scholar] [CrossRef]
- Chen, G.; Huang, Z.; Wang, W.; et al. A novel dynamically adjusted entropy algorithm for collision avoidance in autonomous ships based on deep reinforcement learning. J. Mar. Sci. Eng. 2024, 12, 1562. [Google Scholar] [CrossRef]
- Wang, W.; Li, M.; Chen, G.; et al. Cognitive entropy proximal policy optimization for autonomous ship collision avoidance based on deep reinforcement learning. Eng. Appl. Artif. Intell. 2026, 172, 114416. [Google Scholar] [CrossRef]
- Duan, J.; Guan, Y.; Li, S.E.; et al. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6584–6598. [Google Scholar] [CrossRef]
- Liu, Q.; Li, Y.; Shi, X.; et al. Distributional policy gradient with distributional value function. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 6556–6568. [Google Scholar] [CrossRef]
- Cheng, Y.; Huang, L.; Chen, C.L.P.; et al. Robust actor-critic with relative entropy regulating actor. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9054–9063. [Google Scholar] [CrossRef]
- Shang, Z.; Li, R.; Zheng, C.; et al. Relative entropy regularized sample-efficient reinforcement learning with continuous actions. IEEE Trans. Neural Netw. Learn. Syst. 2023, 36, 475–485. [Google Scholar] [CrossRef]
- Jin, Y.; Song, X.; Slabaugh, G.; et al. Partial advantage estimator for proximal policy optimization. IEEE Trans. Games 2024, 17, 158–166. [Google Scholar] [CrossRef]
- Li, Y.; Tan, X.Y. Candidate ratio guided proximal policy optimization. Eng. Appl. Artif. Intell. 2025, 152, 110576. [Google Scholar] [CrossRef]
- Zhou, R.; Cao, H.; Huang, J.; et al. Hybrid lane change strategy of autonomous vehicles based on SOAR cognitive architecture and deep reinforcement learning. Neurocomputing 2025, 611, 128669. [Google Scholar] [CrossRef]
- Rupprecht, T.; Wang, Y. A survey for deep reinforcement learning in Markovian cyber–physical systems: Common problems and solutions. Neural Netw. 2022, 153, 13–36. [Google Scholar] [CrossRef]
- Jia, L.; Su, B.; Xu, D.; et al. Policy optimization algorithm with activation likelihood-ratio for multi-agent reinforcement learning. Neural Process. Lett. 2024, 56, 247. [Google Scholar] [CrossRef]
- Cheng, Y.; Guo, Q.; Wang, X. Proximal policy optimization with advantage reuse competition. IEEE Trans. Artif. Intell. 2024, 5, 3915–3925. [Google Scholar] [CrossRef]
- Humayoo, M.; Zheng, G.; Dong, X.; et al. Relative importance sampling for off-policy actor-critic in deep reinforcement learning. Sci. Rep. 2025, 15, 14349. [Google Scholar] [CrossRef]
- Zhou, Y.; Jiang, J.; Shi, Q.; et al. GA-HPO PPO: A hybrid algorithm for dynamic flexible job shop scheduling. Sensors 2025, 25, 6736. [Google Scholar] [CrossRef]
- Qu, S.; Guan, W.; Hu, T.; et al. The collaborative navigation decision-making method of USV by UAV based on improved PPO algorithm. Ocean Eng. 2025, 341, 122381. [Google Scholar] [CrossRef]




| Hyperparameter | Symbol | Value |
|---|---|---|
| Short-horizon GAE parameter | 0.85 | |
| Long-horizon GAE parameter | 0.97 | |
| Temporal coherence window | H | 4 |
| Minimum clipping bound | 0.08 | |
| Maximum clipping bound | 0.25 | |
| Entropy coefficient | 0.01 | |
| Value loss coefficient | 0.5 |
| Environment | CAPO | PPO | A2C |
|---|---|---|---|
| CartPole-v1 | |||
| Acrobot-v1 | |||
| LunarLander-v2 | |||
| MountainCarContinuous-v0 | |||
| LunarLanderContinuous-v2 | |||
| BipedalWalker-v3 |
| Environment | Threshold | CAPO | PPO | A2C |
|---|---|---|---|---|
| CartPole-v1 | Reward | 63k steps | 72k steps | – |
| Acrobot-v1 | Reward | 58k steps | 99k steps | – |
| LunarLander-v2 | Reward | 451k steps | 121k steps | – |
| Environment | Threshold | CAPO | PPO | A2C |
|---|---|---|---|---|
| MountainCarContinuous-v0 | Reward | 27k steps | 21k steps | – |
| LunarLanderContinuous-v2 | Reward | 27k steps | 65k steps | – |
| BipedalWalker-v3 | Reward | 714k steps | 1524k steps | – |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.