CAPO: A Coherence-Adaptive Advantage Mechanism Instantiated in Proximal Policy Optimization

Mingjie Li; Jiajie Hu; Weijun Wang; Tao Zhang; Daoshun Xie; Zhihui Hu; Jiangfeng Xian

doi:10.20944/preprints202606.0185.v1

Submitted:

01 June 2026

Posted:

02 June 2026

You are already at the latest version

Abstract

Performance degradation caused by temporal-difference signal fluctuations and the inflexibility of fixed proximal constraints remains a key challenge in policy optimization algorithms. To address these issues, this article develops a Coherence-Adaptive Proximal Optimization (CAPO) method. We first derive a temporal-coherence-aware advantage estimation mechanism by measuring the directional consistency of temporal-difference residuals within a local time window. Based on this mechanism, short-horizon and long-horizon advantage estimates are adaptively integrated to provide a more stable and temporally reliable policy improvement signal. CAPO is then instantiated within the PPO framework by incorporating the temporal coherence mechanism into both advantage construction and proximal policy updates. The proposed method preserves the stable optimization structure of PPO while adaptively adjusting the clipping range according to sample-wise temporal coherence. In this way, CAPO allows more sufficient policy improvement when the learning signal is temporally consistent and imposes more conservative updates when the signal is unstable. Experiments on several representative OpenAI Gym control tasks show that CAPOachieves better or comparable performance than standard PPO in most benchmark environments, with improved training stability and convergence behavior, especially in tasks where local temporal feedback provides reliable policy improvement information.

Keywords:

deep reinforcement learning

;

proximal policy optimization

;

coherence-adaptive proximal optimization

;

advantage estimation

;

temporal-difference residual

;

policy gradient

;

continuous control

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

In deep reinforcement learning (DRL), the improvement of policy performance fundamentally relies on the reliable construction of policy update signals [1]. For policy-gradient-based reinforcement learning algorithms, an agent typically estimates the advantage function from sampled trajectories and uses this signal to guide the policy toward actions that yield higher long-term returns. Therefore, advantage estimation not only determines the direction of policy-gradient updates, but also directly affects the update magnitude, training stability, and final control performance. This issue becomes particularly critical in continuous-control and high-dimensional decision-making tasks, where environmental feedback often exhibits strong temporal correlations and local fluctuations. Value-function approximation errors, reward noise, and delayed trajectory returns can be propagated into the advantage function through temporal-difference residuals, thereby destabilizing the policy update signal [2,3,4]. Accordingly, extracting more reliable advantage information from temporal feedback remains a central challenge for improving the stability and convergence efficiency of deep policy optimization algorithms.

A substantial body of research has been devoted to improving policy-gradient estimation and stabilizing policy updates. Sutton et al. established the theoretical foundation of policy-gradient methods, providing a principled framework for directly optimizing parameterized policies [5]. Building upon this foundation, Konda and Tsitsiklis introduced the actor–critic architecture, in which a value-function estimator is incorporated to reduce the variance of policy-gradient estimates and improve the efficiency of policy learning [6]. To further constrain the instability induced by overly aggressive policy updates, Schulman et al. proposed trust region policy optimization (TRPO), which restricts the update magnitude by imposing a Kullback–Leibler (KL) divergence constraint between the old and new policies, thereby mitigating policy degradation caused by excessively large updates [7]. Subsequently, proximal policy optimization (PPO) replaced the computationally expensive second-order constrained optimization in TRPO with a clipped surrogate objective. Owing to its favorable balance between training stability and implementation simplicity, PPO has become one of the most widely adopted on-policy policy optimization algorithms [8]. In parallel, generalized advantage estimation (GAE) introduced a bias–variance trade-off parameter,

λ

, to provide a continuous interpolation between Monte Carlo returns and temporal-difference estimates, offering an effective advantage estimation tool for PPO and related policy optimization methods [9].

Despite the strong stability and general applicability demonstrated by the combination of PPO and GAE across a wide range of control tasks, both its advantage estimation scheme and proximal constraint mechanism still exhibit inherent limitations [10]. First, GAE typically relies on a fixed

λ

parameter to control the temporal scale of advantage estimation. A larger

λ

enables the estimator to incorporate more long-horizon return information, but it also makes the advantage estimates more susceptible to reward fluctuations and value-function approximation errors. In contrast, a smaller

λ

places greater emphasis on local temporal-difference (TD) residuals, which can reduce variance but may weaken the ability to perform long-term credit assignment. Since the temporal stability of feedback signals varies substantially across samples during training, a fixed

λ

is unlikely to simultaneously satisfy the estimation requirements of both stable trajectory segments and highly fluctuating ones. Second, PPO usually adopts a uniform clipping coefficient,

ϵ

, to constrain the policy probability ratio for all samples, without explicitly differentiating the reliability of their update signals [11]. When the signs of TD residuals remain consistent within a local temporal window, the corresponding advantage signal is often supported by a clear directional trend, in which case a fixed clipping boundary may unnecessarily restrict effective policy improvement. Conversely, when TD residuals change signs frequently within a local window, the current update signal may contain substantial noise; under such conditions, a uniform clipping constraint may be insufficient to suppress unreliable policy updates.

To address these limitations, recent studies have sought to further enhance policy learning from multiple perspectives, including exploration enhancement, value estimation improvement, and policy constraint optimization. Chen et al. designed a dynamically adjusted entropy mechanism that enables the algorithm to adaptively regulate exploration intensity according to different training stages, thereby improving policy search in the early phase of training [12]. Wang et al. further incorporated a cognitive entropy mechanism into the PPO framework to improve exploration efficiency and convergence performance in complex decision-making tasks [13]. Duan et al. proposed distributional soft actor–critic, which learns the state–action return distribution to alleviate Q-value overestimation and thereby improves policy performance in continuous-control tasks [14]. Liu et al. further introduced a distributional value function into policy-gradient estimation, allowing the policy gradient to exploit distributional value information rather than relying solely on expected value estimates, which enhances exploration capability [15]. In addition, Cheng et al. improved the stability of policy improvement by introducing a relative-entropy constraint into the actor–critic framework [16]. Shang et al. extended relative-entropy regularization to continuous dynamic policy programming, thereby improving both learning stability and sample efficiency in continuous action spaces [17].

It should be noted that the aforementioned studies mainly focus on enhancing exploration, improving value estimation accuracy, or constraining the magnitude of policy updates. However, within the standard PPO framework, the temporal scale of advantage estimation and the clipping boundary of policy updates are still typically governed by fixed hyperparameters, and these two components are not jointly adapted according to the reliability of local temporal feedback [18]. In other words, most existing methods implicitly assume that advantage signals at different time steps possess a comparable level of reliability, while overlooking the directional consistency exhibited by TD residuals within local temporal windows [19]. In fact, TD residuals not only reflect the estimation discrepancy between immediate rewards and the value function, but also encode implicit information about whether the return trend of the current trajectory segment is temporally stable. When the directions of TD residuals remain consistent across adjacent time steps, the corresponding policy improvement signal is supported by stronger temporal continuity. Conversely, when positive and negative TD residuals frequently cancel each other within a local window, the resulting advantage estimate may be substantially affected by noise, local perturbations, or value-estimation errors. Therefore, how to exploit the local directional consistency of TD residuals to construct a policy optimization mechanism that can simultaneously regulate advantage estimation and the policy clipping range remains an important problem worthy of further investigation [20].

Motivated by the above discussion, this paper proposes a coherence-adaptive proximal optimization algorithm (CAPO), which aims to improve the training stability and convergence efficiency of PPO through the coordinated design of advantage estimation and proximal policy constraints. Specifically, we first construct a temporal reliability measure based on the local directional consistency of TD residuals, which is used to evaluate whether the update signal of a given sample exhibits a stable trend within a local temporal window. Then, guided by the resulting temporal coherence coefficient, CAPO adaptively integrates short-horizon and long-horizon advantage estimates, enabling the advantage construction process to dynamically balance low variance and low bias according to the reliability of temporal feedback. Furthermore, the same temporal coherence coefficient is incorporated into the clipped surrogate objective of PPO to construct a sample-wise dynamic clipping boundary. In this way, policy updates are allowed to explore a broader optimization region for samples with high temporal coherence, while being subject to stricter proximal constraints for samples with low temporal coherence.

The main contributions of this work are summarized as follows.

(1): We propose a temporal coherence measurement mechanism based on the local directional consistency of TD residuals. By characterizing the stability of update signals within a local temporal window, this mechanism provides a unified sample-wise reliability criterion for both advantage estimation and policy updates. It enables the algorithm to distinguish stable trajectory segments from highly fluctuating ones, thereby reducing the dependence of policy optimization on a fixed form of temporal feedback modeling.
(2): We develop a temporal-coherence-aware advantage construction method that adaptively integrates short-horizon and long-horizon advantage estimates. The resulting advantage signal combines the merits of low variance and low bias. Compared with conventional advantage estimation based on a fixed GAE parameter, the proposed method can dynamically adjust the temporal scale of advantage estimation according to the reliability of local temporal feedback, thereby improving the stability of policy update signals.
(3): We propose a coherence-adaptive proximal optimization algorithm, termed CAPO. While preserving the stable optimization framework of PPO, CAPO embeds the temporal coherence coefficient into the clipped surrogate objective to form a sample-wise dynamic clipping mechanism. This design allows policy updates to automatically balance sufficient policy improvement and conservative proximal constraints according to signal reliability. Experimental results on multiple representative control tasks demonstrate that the proposed method achieves a more stable training process, faster convergence, and superior overall control performance.

The remainder of this paper is organized as follows. Section 2 introduces the preliminaries of reinforcement learning, PPO, and advantage estimation. Section 3 presents the proposed temporal-coherence-aware advantage estimation and adaptive proximal update mechanisms, and further instantiates them within the PPO framework as CAPO. Section 4 describes the experimental design and reports the simulation results. Finally, Section 5 concludes this study and discusses potential directions for future work.

2. Preliminaries

This section introduces the theoretical foundations required for the proposed method. We first formulate the reinforcement learning problem as a Markov decision process (MDP). Then, we review the basic objective of proximal policy optimization (PPO). Finally, we describe the construction of generalized advantage estimation (GAE) and temporal-difference (TD) residuals, and analyze how the temporal scale of advantage estimation affects policy updates. These preliminaries provide the theoretical basis for the temporal-coherence-aware advantage estimation and adaptive proximal update mechanisms proposed in Section 3.

2.1. Markov Decision Process

Reinforcement learning is essentially a sequential decision-making problem that arises from the interaction between an agent and its environment. Such a problem is commonly formulated as a Markov decision process (MDP) [21]. A standard MDP can be defined as a five-tuple:

M = (S, A, P, R, γ),

(1)

where

S

denotes the state space,

A

denotes the action space, and

P (s^{'} ∣ s, a)

represents the state transition probability function, which specifies the conditional probability of transitioning to the next state

s^{'}

after executing action a in state s. The function

R (s, a)

denotes the immediate reward function, which characterizes the feedback obtained from the current state–action pair, and

γ \in [0, 1)

is the discount factor used to balance immediate rewards and long-term returns.

At time step t, the agent selects an action

a_{t}

from the current state

s_{t}

according to a policy

π (a_{t} ∣ s_{t})

. The environment then returns the next state

s_{t + 1}

and the immediate reward

r_{t}

. Through continuous interaction, the agent generates a trajectory:

τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots) .

(2)

The goal of reinforcement learning is to learn an optimal policy

π^{*}

that maximizes the expected discounted cumulative return, which can be formulated as:

π^{*} = arg max_{π_{θ}} J (θ) = arg max_{π_{θ}} E_{τ \sim π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} r_{t}],

(3)

where

τ

denotes the trajectory induced by policy

π_{θ}

,

s_{t}

and

a_{t}

denote the state and action at time step t, respectively, and

E_{τ \sim π_{θ}} [\cdot]

represents the expectation over the trajectory distribution generated by the policy

π_{θ}

. For policy-gradient-based reinforcement learning algorithms, the policy parameters

θ

are iteratively updated by maximizing the objective function in Equation (3). The decision-making process of an MDP is illustrated in Figure 1.

2.2. Proximal Policy Optimization

Proximal policy optimization (PPO) is a representative on-policy policy optimization algorithm that has been widely used in deep reinforcement learning [22]. Its core idea is to improve training stability by constraining the magnitude of policy changes when updating the policy using trajectories sampled from the current policy. Let

θ_{old}

denote the parameters of the old policy and

θ

denote the parameters of the current policy to be optimized. The importance sampling ratio between the new and old policies is defined as:

r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})} .

(4)

This ratio measures the change in the probability of selecting action

a_{t}

under state

s_{t}

when moving from the old policy to the current policy. Specifically,

r_{t} (θ) > 1

indicates that the current policy increases the probability of selecting the action, whereas

r_{t} (θ) < 1

indicates that the current policy decreases this probability.

Based on the importance sampling ratio, PPO constructs the following clipped surrogate objective:

L^{CLIP} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})],

(5)

where

{\hat{A}}_{t}

denotes the advantage estimate at time step t,

ϵ

is the clipping coefficient, and

clip (\cdot)

restricts the policy probability ratio to the interval

[1 - ϵ, 1 + ϵ]

. The clipped objective plays a key role in stabilizing policy optimization. When the probability ratio between the new and old policies remains within a reasonable range, the algorithm follows the standard policy-gradient update. However, when the policy update becomes excessively large, the clipping mechanism prevents the objective from increasing further, thereby suppressing unstable updates and reducing training oscillations.

It should be emphasized that, in PPO, both the direction and intensity of policy updates are mainly determined by the advantage estimate

{\hat{A}}_{t}

. When

{\hat{A}}_{t} > 0

, the algorithm tends to increase the probability of selecting the corresponding action; when

{\hat{A}}_{t} < 0

, it tends to decrease this probability. Therefore, the quality of advantage estimation directly affects the effectiveness of policy optimization in PPO [23]. If the advantage signal is corrupted by noise, value-estimation errors, or local reward fluctuations, PPO may still update the policy along an unreliable direction even though the clipping mechanism limits the magnitude of policy changes. Consequently, constructing a more stable and reliable advantage signal is crucial for further improving the performance of PPO.

2.3. Generalized Advantage Estimation and TD Residual

In the actor–critic framework, the advantage function is commonly used to measure how much better or worse a specific action is compared with the average behavior under the current state [24]. Let

V_{ϕ} (s_{t})

denote the state-value function parameterized by

ϕ

. The temporal-difference (TD) residual at time step t can be defined as:

δ_{t} = r_{t} + γ (1 - d_{t}) V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t}),

(6)

where

d_{t}

denotes the terminal indicator. When

d_{t} = 1

, the current state is terminal, and the value of the next state is no longer included. When

d_{t} = 0

, the TD residual incorporates both the immediate reward and the estimated value of the next state. Therefore, the TD residual reflects the discrepancy between the one-step feedback and the value-function prediction, and it serves as a fundamental building block for advantage estimation.

To balance estimation bias and variance, generalized advantage estimation (GAE) constructs the advantage signal by exponentially accumulating multi-step TD residuals [25]. It is formulated as:

{\hat{A}}_{t}^{GAE} = \sum_{l = 0}^{T - t} {(γ λ)}^{l} δ_{t + l},

(7)

where

λ \in [0, 1]

is the bias–variance trade-off parameter. When

λ

is large, the advantage estimator incorporates more long-horizon return information, leading to lower bias but potentially higher variance. In contrast, when

λ

is small, the advantage estimate relies more heavily on short-horizon TD information, which reduces variance but may weaken long-term credit assignment.

It can therefore be observed that

λ

essentially controls the temporal scale of advantage estimation. A fixed

λ

implies that all samples use the same temporal scale to construct advantage signals. However, during practical training, the stability of TD residuals varies across different trajectory segments. For trajectory segments in which the directions of TD residuals are relatively consistent, long-horizon advantage estimation can more fully exploit continuous feedback information. Conversely, for trajectory segments where positive and negative TD residuals fluctuate frequently, long-horizon accumulation may amplify unstable signals, making short-horizon advantage estimation more robust.

Based on this observation, the following section constructs a temporal coherence coefficient from the local directional consistency of TD residuals, which is used to adaptively integrate short-horizon and long-horizon advantages. Furthermore, this coefficient is introduced into the clipped objective of PPO to dynamically adjust the sample-wise proximal update range. In other words, the proposed method does not alter the fundamental optimization framework of PPO. Instead, it introduces a temporal-coherence-aware mechanism into two key components, namely advantage estimation and clipping constraints, thereby improving the reliability of policy update signals and the stability of the training process.

3. Methodology

This section presents the proposed coherence-adaptive proximal optimization method in detail. To address the limitations of standard PPO, where the temporal scale of advantage estimation is fixed and the proximal clipping boundary lacks sample-wise adaptability, we first construct a temporal coherence measurement mechanism based on the local directional consistency of TD residuals. Based on this mechanism, a temporal-coherence-aware advantage construction method is then developed by adaptively integrating short-horizon and long-horizon advantages, thereby producing a more reliable policy update signal. Finally, the constructed temporal coherence coefficient is further embedded into the clipped surrogate objective of PPO, leading to the proposed coherence-adaptive proximal optimization algorithm, termed CAPO.

3.1. Temporal Coherence Measurement

In standard PPO, policy updates typically rely on advantage estimates constructed by GAE [26]. However, the temporal scale in GAE is mainly controlled by a fixed parameter

λ

, making it difficult to adaptively adjust advantage estimation according to the stability of feedback signals in different trajectory segments. When the directions of TD residuals within a local temporal window are relatively consistent, the update signal in the corresponding trajectory segment exhibits stronger temporal continuity, and it is therefore appropriate to incorporate return information over a longer horizon. In contrast, when TD residuals frequently change signs within a local window, the update signal may be affected by reward noise, value-estimation errors, or local perturbations. In such cases, excessive long-horizon accumulation may amplify unreliable signals.

To characterize the above temporal feedback property, we first define the one-step TD residual based on the value function associated with the old policy:

δ_{t} = r_{t} + γ (1 - d_{t}) V_{ϕ_{old}} (s_{t + 1}) - V_{ϕ_{old}} (s_{t}),

(8)

where

r_{t}

denotes the immediate reward at time step t,

V_{ϕ_{old}} (\cdot)

denotes the value function corresponding to the old policy,

γ

is the discount factor, and

d_{t}

is the terminal indicator. When

d_{t} = 1

, the current state is terminal, and the value of the next state is excluded from the TD residual computation.

Let H denote the predefined length of the local coherence window. For time step t, we compute the signed accumulation and the absolute accumulation of local TD residuals starting from t, respectively, as follows:

S_{t} = \sum_{i = 0}^{H_{t} - 1} δ_{t + i},

(9)

B_{t} = \sum_{i = 0}^{H_{t} - 1} |δ_{t + i}|,

(10)

where

H_{t}

denotes the effective window length after considering terminal states. If a terminal state is encountered within the window, subsequent time steps are excluded from the coherence computation for the current time step.

Based on the above two quantities, the temporal coherence coefficient is defined as:

ω_{t} = clip (\frac{| S_{t} |}{B_{t} + η}, 0, 1),

(11)

where

η

is a small positive constant used to avoid division by zero, and

ω_{t} \in [0, 1]

. As shown in Equation (11),

ω_{t}

essentially characterizes the directional consistency of TD residuals within a local temporal window. When the TD residuals in the window have nearly consistent directions, the magnitude of the signed accumulation approaches the absolute accumulation, namely

| S_{t} | \approx B_{t}

, and thus

ω_{t}

approaches 1. Conversely, when positive and negative TD residuals frequently cancel each other,

| S_{t} |

becomes significantly smaller than

B_{t}

, and

ω_{t}

approaches 0.

It should be emphasized that

ω_{t}

does not directly measure the magnitude of rewards or returns. Instead, it describes whether the current policy update signal exhibits a consistent temporal trend within a local time range. Therefore, this coefficient provides a unified sample-wise reliability criterion for subsequent advantage estimation and policy clipping. By introducing local temporal coherence into the modeling of policy update signals, CAPO compares the signed accumulation strength and the overall fluctuation strength of TD residuals within a local window, thereby determining whether the current advantage signal is supported by a stable temporal trend. In this way, CAPO can distinguish stable feedback from fluctuating feedback at the sample level, providing a unified basis for the subsequent adjustment of the advantage estimation time scale and the adaptive design of the proximal clipping boundary.

3.2. Coherence-Adaptive Advantage Construction

After obtaining the temporal coherence coefficient, we further construct a coherence-adaptive advantage estimator. Standard GAE uses a single

λ

parameter to control the temporal scale of advantage estimation. However, a fixed

λ

is difficult to simultaneously adapt to both stable and fluctuating trajectory segments. To address this issue, we define short-horizon and long-horizon advantage estimates separately, and then adaptively integrate them using the temporal coherence coefficient.

Specifically, given a smaller parameter

λ_{s}

and a larger parameter

λ_{l}

, where

0 \leq λ_{s} < λ_{l} \leq 1

, the short-horizon advantage estimate is defined as:

{\hat{A}}_{t}^{s} = \sum_{l = 0}^{T - t} {(γ λ_{s})}^{l} δ_{t + l},

(12)

and the long-horizon advantage estimate is defined as:

{\hat{A}}_{t}^{l} = \sum_{l = 0}^{T - t} {(γ λ_{l})}^{l} δ_{t + l},

(13)

where

{\hat{A}}_{t}^{s}

relies more heavily on local TD information and thus has lower variance and stronger conservativeness, whereas

{\hat{A}}_{t}^{l}

incorporates more long-horizon return information and can provide more sufficient credit assignment. Nevertheless, when TD residuals fluctuate strongly, the long-horizon estimate may introduce higher variance.

Based on the above two advantage estimates, the CAPO advantage is constructed as:

{\hat{A}}_{t}^{CAPO} = ω_{t} {\hat{A}}_{t}^{l} + (1 - ω_{t}) {\hat{A}}_{t}^{s} .

(14)

As shown in Equation (14), the CAPO advantage is a convex combination of the short-horizon and long-horizon advantages. When

ω_{t}

approaches 1, the local TD residuals exhibit strong directional consistency, indicating that the current update signal is supported by a stable temporal trend. In this case, CAPO assigns a larger weight to the long-horizon advantage estimate to more fully exploit long-term return information. Conversely, when

ω_{t}

approaches 0, the local TD residuals show substantial directional conflicts, indicating that the current update signal has lower reliability. In this case, CAPO relies more on the short-horizon advantage estimate to reduce the interference of unstable signals in policy updates.

It should be noted that the proposed advantage construction does not empirically select among several fixed

λ

values. Instead, it uses local temporal coherence as an adaptive modulation factor for the temporal scale of advantage estimation. For samples with stable temporal feedback, CAPO increases the contribution of the long-horizon advantage, allowing the policy update to better exploit return information across multiple time steps. For samples with fluctuating temporal feedback, CAPO strengthens the influence of the short-horizon advantage to suppress the propagation of local noise and value-estimation errors during multi-step accumulation. As a result, the constructed advantage signal can dynamically balance low bias and low variance according to the feedback reliability of different trajectory segments, thereby providing a more stable optimization basis for subsequent policy updates.

To further improve training stability, the CAPO advantage is normalized within each training batch:

{\bar{A}}_{t}^{CAPO} = \frac{{\hat{A}}_{t}^{CAPO} - μ_{A}}{σ_{A} + η},

(15)

where

μ_{A}

and

σ_{A}

denote the mean and standard deviation of

{\hat{A}}_{t}^{CAPO}

in the current batch, respectively. The normalization operation mitigates the influence of advantage-scale variations on the magnitude of policy-gradient updates across different training stages. In practical implementation, the normalized advantage can also be clipped to prevent extreme advantage values from causing excessively large gradient updates.

3.3. Coherence-Adaptive Proximal Policy Optimization

Based on the coherence-adaptive advantage estimator constructed in the previous subsection, we further embed the temporal coherence coefficient into the proximal policy update objective of PPO, leading to the proposed CAPO algorithm. It should be emphasized that the core of CAPO is not to alter the fundamental proximal optimization framework of PPO, but to use temporal coherence to jointly modulate advantage estimation and the clipping boundary. In this way, the policy update can adaptively adjust its update range according to the reliability of local temporal feedback.

Let

θ_{old}

denote the parameters of the old policy and

θ

denote the parameters of the current policy to be optimized. For the state–action pair

(s_{t}, a_{t})

at time step t, the importance sampling ratio is defined as:

r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{old}} (a_{t} ∣ s_{t})} .

(16)

In standard PPO, the clipping coefficient

ϵ

is usually treated as a global fixed hyperparameter. In contrast, CAPO constructs a sample-wise dynamic clipping boundary according to the temporal coherence coefficient:

ϵ_{t} = ϵ_{min} + ω_{t} (ϵ_{max} - ϵ_{min}),

(17)

where

ϵ_{min}

and

ϵ_{max}

denote the lower and upper bounds of the clipping range, respectively. Since

ω_{t} \in [0, 1]

, the dynamic clipping coefficient satisfies:

ϵ_{t} \in [ϵ_{min}, ϵ_{max}] .

(18)

When

ω_{t}

is large, the advantage signal of the current sample is supported by stronger local temporal coherence. CAPO therefore applies a relatively relaxed proximal constraint, allowing the policy to make more sufficient use of reliable feedback for effective improvement. Conversely, when

ω_{t}

is small, the advantage signal may be affected by local noise, value-estimation errors, or short-term fluctuations. In this case, CAPO adopts a more conservative proximal constraint to reduce the interference of unreliable updates during policy iteration. Through this design, CAPO transforms the globally fixed clipping boundary in standard PPO into a temporal-coherence-driven sample-wise dynamic clipping mechanism, thereby improving the adaptability of policy updates to variations in local feedback quality.

Based on the CAPO advantage estimate and the dynamic clipping boundary, the actor objective of CAPO is formulated as:

L_{actor}^{CAPO} (θ) = E_{t} [min (r_{t} (θ) {\bar{A}}_{t}^{CAPO}, clip (r_{t} (θ), 1 - ϵ_{t}, 1 + ϵ_{t}) {\bar{A}}_{t}^{CAPO})],

(19)

where

{\bar{A}}_{t}^{CAPO}

denotes the normalized CAPO advantage estimate, and

ϵ_{t}

is the sample-wise dynamic clipping coefficient. As shown in Equation (19), CAPO preserves the basic structure of the PPO clipped surrogate objective, while replacing the fixed advantage estimate and the fixed clipping boundary with a temporal-coherence-aware advantage estimate and a dynamic proximal constraint, respectively.

In practical optimization, to remain consistent with the standard actor–critic framework, we further incorporate the value-function loss and the entropy regularization term. The overall optimization loss of CAPO is defined as:

L_{CAPO} (θ, ϕ) = - L_{actor}^{CAPO} (θ) + c_{v} L_{V} (ϕ) - c_{e} H (π_{θ}),

(20)

where

c_{v}

and

c_{e}

denote the coefficients of the value-function loss and entropy regularization, respectively, and

H (π_{θ})

denotes the policy entropy. The value-function loss is used to constrain the estimation error of the critic and is written as:

L_{V} (ϕ) = E_{t} [{(V_{ϕ} (s_{t}) - {\hat{V}}_{t})}^{2}],

(21)

where

{\hat{V}}_{t}

denotes the value-learning target. In this work, the value target is constructed by combining the unnormalized CAPO advantage with the old value estimate:

{\hat{V}}_{t} = {\hat{A}}_{t}^{CAPO} + V_{ϕ_{old}} (s_{t}) .

(22)

This design aligns the critic learning target with the CAPO advantage construction. Unlike directly using an independent return target, the proposed value target incorporates adaptive advantage information, allowing the critic supervision signal to reflect the modulation effect of local temporal coherence on policy updates. Consequently, the policy improvement on the actor side and the value estimation on the critic side are coordinated under the same temporal reliability mechanism. This coordination helps reduce the objective mismatch between advantage construction and value learning, thereby further enhancing the stability of the training process.

3.4. Mechanism Analysis

The core idea of CAPO can be summarized as using the local directional consistency of TD residuals to jointly regulate the temporal scale of advantage estimation and the constraint range of policy updates. Compared with standard PPO, CAPO exhibits two main mechanistic characteristics.

First, CAPO enables adaptive adjustment of the temporal scale in advantage estimation. Standard GAE relies on a fixed

λ

parameter, which means that all samples share the same temporal scale for advantage construction. In contrast, CAPO simultaneously constructs short-horizon and long-horizon advantage estimates and adaptively integrates them using the temporal coherence coefficient

ω_{t}

. When the local feedback signal is stable, CAPO increases the contribution of the long-horizon advantage, allowing the policy update to exploit more sufficient long-term return information. When the local feedback signal is unstable, CAPO increases the contribution of the short-horizon advantage, making the policy update more conservative. This mechanism alleviates the limited adaptability of fixed-parameter GAE under different sample conditions.

Second, CAPO enables sample-wise adaptation of the proximal clipping boundary. Standard PPO applies a uniform clipping coefficient to all samples, and thus cannot distinguish whether the update signal of each sample is reliable. In contrast, CAPO dynamically computes

ϵ_{t}

according to

ω_{t}

, such that samples with high temporal coherence are assigned a larger policy update region, whereas samples with low temporal coherence are subject to stricter proximal constraints. Therefore, CAPO does not simply enlarge or shrink the clipping range of PPO. Instead, it dynamically balances sufficient policy improvement and conservative updates according to the reliability of the local update signal.

Furthermore, when

ω_{t} = 0

, CAPO degenerates into a conservative update form dominated by the short-horizon advantage and the minimum clipping range:

{\hat{A}}_{t}^{CAPO} = {\hat{A}}_{t}^{s}, ϵ_{t} = ϵ_{min} .

(23)

When

ω_{t} = 1

, CAPO becomes a more sufficient update form dominated by the long-horizon advantage and the maximum clipping range:

{\hat{A}}_{t}^{CAPO} = {\hat{A}}_{t}^{l}, ϵ_{t} = ϵ_{max} .

(24)

Therefore, CAPO can be regarded as a continuous adaptive policy optimization mechanism that connects short-horizon conservative updates and long-horizon sufficient updates. This mechanism enables policy optimization to move beyond a single fixed temporal scale and a uniform clipping boundary, allowing the update behavior to be automatically adjusted according to local temporal feedback characteristics.

3.5. Algorithm Summary

The overall procedure of CAPO is summarized as follows. First, the agent interacts with the environment using the old policy

π_{θ_{old}}

and collects states, actions, rewards, terminal indicators, action log probabilities, and state-value estimates. Then, the TD residuals at each time step are computed using the old value function, and the temporal coherence coefficient is evaluated within a local temporal window. Next, the short-horizon and long-horizon advantages are constructed using

λ_{s}

and

λ_{l}

, respectively, and are adaptively fused by the temporal coherence coefficient to obtain the CAPO advantage. Finally, the CAPO advantage and the dynamic clipping boundary are embedded into the PPO clipped surrogate objective, and the actor and critic are jointly optimized.

The algorithmic comparison between PPO and CAPO is illustrated in Figure 2, and the pseudo-code of CAPO is presented in Algorithm 1.

Without modifying the fundamental actor–critic architecture of PPO, the proposed algorithm introduces a temporal-coherence-driven mechanism for jointly regulating advantage estimation and proximal policy constraints. Specifically, CAPO reconstructs the policy update signal through the adaptive fusion of short-horizon and long-horizon advantages, and dynamically adjusts the proximal clipping range according to local temporal coherence. In this manner, policy optimization is allowed to perform more sufficient improvement under reliable feedback, while maintaining conservative updates under fluctuating feedback. Meanwhile, this work further characterizes the update behavior of CAPO from multiple perspectives, including the temporal coherence level, dynamic clipping magnitude, policy distribution variation, and clipping ratio. These analyses provide a basis for subsequently evaluating the training stability and convergence characteristics of the proposed method in the experimental section.

Algorithm 1:The proposed CAPO algorithm.

Input: Policy parameters

θ

, value parameters

ϕ

, discount factor

γ

, coherence window length H, GAE parameters

λ_{s}

and

λ_{l}

, clipping bounds

ϵ_{min}

and

ϵ_{max}

.

Output: Optimized policy parameters

θ

.

1: Initialize

θ_{old} \leftarrow θ

and

ϕ_{old} \leftarrow ϕ

.

2: for each training iteration do

3: Collect trajectories using the old policy

π_{θ_{old}}

.

4: Compute TD residuals using the old value function.

5: Compute the temporal coherence coefficient

ω_{t}

within a local window.

6: Construct the short-horizon advantage

{\hat{A}}_{t}^{s}

with

λ_{s}

.

7: Construct the long-horizon advantage

{\hat{A}}_{t}^{l}

with

λ_{l}

.

8: Fuse the two advantages to obtain

{\hat{A}}_{t}^{CAPO}

.

9: Normalize

{\hat{A}}_{t}^{CAPO}

to obtain

{\bar{A}}_{t}^{CAPO}

.

10: Compute the dynamic clipping coefficient

ϵ_{t}

.

11: Optimize the CAPO clipped surrogate objective.

12: Update the critic by minimizing the value-function loss.

13: Update old parameters:

θ_{old} \leftarrow θ

,

ϕ_{old} \leftarrow ϕ

.

14: end for

4. Experimental Results and Analysis

4.1. Experimental Environments and Settings

To systematically evaluate the effectiveness of the proposed CAPO method, comparative experiments were conducted on six representative control tasks from the OpenAI Gym platform. The selected environments cover both discrete-action and continuous-action control tasks, and span different levels of difficulty, ranging from basic balance control and underactuated system control to high-dimensional continuous locomotion control. Specifically, the benchmark environments include CartPole-v1, Acrobot-v1, LunarLander-v2, MountainCarContinuous-v0, LunarLanderContinuous-v2, and BipedalWalker-v3. These environments differ significantly in terms of state-space dimensionality, action-space type, dynamical complexity, and task objective structure, thereby providing a comprehensive testbed for evaluating the learning efficiency, training stability, and final control performance of different policy optimization methods. Illustrations of the experimental environments are shown in Figure 3.

Three representative policy optimization algorithms were selected for comparison, namely CAPO, PPO, and A2C. PPO serves as the direct baseline of the proposed method and is used to verify the effectiveness of the temporal-coherence-aware advantage estimation and dynamic proximal clipping mechanisms introduced into the standard PPO framework. A2C is adopted as a representative advantage-based actor–critic method to further examine the performance differences among different policy optimization mechanisms in typical control tasks. Since all three algorithms are policy-gradient-based reinforcement learning methods, they provide a targeted and comparable basis for experimental evaluation.

To ensure fairness in comparison, all algorithms were implemented and evaluated under a unified training framework. Except for their specific policy update mechanisms, the network architecture, optimizer settings, training steps, environment interaction procedure, and major training configurations were kept as consistent as possible across different methods. For each algorithm in each benchmark environment, six independent runs were conducted using different random seeds, and the resulting performance was statistically analyzed. In the reward curves, the solid line represents the mean episodic reward over multiple independent runs, while the shaded region denotes the variation range across different random seeds. By performing repeated experiments and reporting averaged results, the influence of random initialization, sampling noise, and environmental stochasticity on individual runs can be effectively reduced, enabling a more objective evaluation of the overall learning performance and training stability of different algorithms.

The experiments were conducted on a workstation equipped with an Intel(R) Core(TM) i5-14600KF processor, an NVIDIA GeForce RTX 5070 GPU with 12 GB of memory, and 32 GB of system RAM. The main hyperparameter settings are summarized in Table 1.

4.2. Experimental Results

To visually compare the learning performance of different algorithms across the six control tasks, Figure 4 presents the mean episodic reward curves of CAPO, PPO, and A2C in each benchmark environment. The solid lines denote the averaged results over multiple independent runs with different random seeds, while the shaded regions indicate the corresponding standard deviations. A higher mean episodic reward generally indicates that the agent obtains a larger cumulative return and achieves better control performance in the corresponding task. For negative-reward tasks such as Acrobot-v1, a mean episodic reward closer to zero indicates better performance.

To further quantitatively evaluate the stable control performance of different algorithms in the later stage of training, we calculate the average episodic reward over the last 30% of the training process. The results are reported in the form of mean ± standard deviation over six independent runs. The mean value reflects the overall final-stage performance of each algorithm, while the standard deviation indicates the stability of training across different random seeds. The statistical results of the final-stage average rewards in the six environments are summarized in Table 2.

As shown in Table 2, CAPO achieves the best final-stage performance in four out of the six benchmark environments, including CartPole-v1, Acrobot-v1, LunarLanderContinuous-v2, and BipedalWalker-v3. In particular, CAPO exhibits clear advantages in Acrobot-v1 and BipedalWalker-v3, where it not only obtains higher average rewards but also shows smaller standard deviations than PPO, indicating improved training stability and robustness across different random seeds. In CartPole-v1, CAPO also reaches a higher final-stage reward with lower variability, suggesting that the proposed coherence-adaptive mechanism can improve convergence consistency in relatively simple discrete-control tasks.

For LunarLander-v2 and MountainCarContinuous-v0, PPO achieves higher final-stage average rewards than CAPO. In LunarLander-v2, PPO obtains the best mean reward, while CAPO still substantially outperforms A2C, indicating that the proposed method maintains competitive learning ability in discrete-action landing control. In MountainCarContinuous-v0, the performance gap between CAPO and PPO is relatively small, and both algorithms achieve stable positive rewards, whereas A2C fails to learn an effective control policy. These results suggest that CAPO does not universally dominate PPO in all environments, but it provides more stable and competitive performance across tasks with different action spaces and dynamical characteristics.

Overall, the experimental results demonstrate that the proposed CAPO method can effectively improve the stability and final-stage performance of PPO in most benchmark tasks. By adaptively regulating the advantage estimation horizon and the proximal clipping range according to local temporal coherence, CAPO is able to exploit reliable feedback more sufficiently while suppressing unstable updates caused by fluctuating temporal signals. This mechanism contributes to more consistent learning behavior, especially in environments with complex dynamics or high-dimensional continuous control requirements.

4.2.1. Overall Comparison

As shown in Figure 4, CAPO achieves better or comparable learning performance to PPO in most benchmark environments and substantially outperforms A2C overall. Specifically, in CartPole-v1, Acrobot-v1, LunarLanderContinuous-v2, and BipedalWalker-v3, CAPO exhibits a faster reward improvement trend, higher final-stage average rewards, or more stable training curves. The advantages of CAPO are particularly evident in continuous-action control tasks such as LunarLanderContinuous-v2 and BipedalWalker-v3, indicating that the proposed temporal-coherence-aware advantage estimation and dynamic proximal clipping mechanisms can effectively improve the quality of policy updates in complex continuous-control scenarios.

Compared with standard PPO, the advantage of CAPO is not simply reflected in achieving the highest final reward across all tasks. Instead, its improvement is mainly manifested in two aspects. First, CAPO can improve episodic rewards more rapidly during the early or middle stage of training, suggesting that the constructed policy update signal is more effective. Second, CAPO maintains higher final-stage rewards in several high-dimensional continuous-control tasks, indicating better training stability and control performance under complex dynamics. In contrast, A2C generally exhibits slower convergence and markedly lower final-stage rewards than PPO and CAPO in most environments. This result suggests that actor–critic methods without proximal policy constraints are more susceptible to gradient fluctuations and sampling noise in complex control tasks.

It should also be noted that CAPO does not achieve clearly superior results over PPO in LunarLander-v2 and MountainCarContinuous-v0. This indicates that the performance gain of CAPO is closely related to the reward structure, action-space type, and whether local TD residuals can effectively reflect the direction of policy improvement. For tasks with special reward structures or those that are relatively easy to approach a performance plateau, the improvement margin of CAPO may be limited.

4.2.2. Results on Discrete-Action Control Tasks

In the CartPole-v1 environment, both PPO and CAPO are able to rapidly learn effective control policies, but their final-stage performance exhibits noticeable differences. As shown in Figure 4(a), CAPO gradually reaches a higher reward level during the middle and later stages of training and more stably approaches the upper reward limit of the task. According to Table 2, the final-stage average reward of CAPO is

394.58 \pm 4.90

, which is higher than that of PPO (

371.73 \pm 13.79

) and A2C (

45.93 \pm 6.29

). Compared with PPO, CAPO improves the final-stage average reward by 22.85 and achieves a markedly lower standard deviation, indicating stronger policy retention capability after approaching near-optimal control performance. This result suggests that, in a basic discrete balance-control task, CAPO can further improve final-stage control performance while preserving the stable optimization characteristics of PPO.

In the Acrobot-v1 environment, the agent is required to swing an underactuated two-link system to a target height. Since this task typically adopts negative rewards, a mean episodic reward closer to zero indicates better control performance. As shown in Figure 4(b), both CAPO and PPO effectively improve the episodic reward, whereas the learning process of A2C is substantially unstable. Table 2 shows that CAPO achieves a final-stage average reward of

- 81.51 \pm 1.35

, outperforming PPO (

- 94.77 \pm 3.55

) and A2C (

- 250.27 \pm 192.86

). Compared with PPO, CAPO improves the average reward by 13.26 and exhibits a smaller final-stage standard deviation, indicating better stability in the underactuated control task. From the perspective of convergence speed, CAPO reaches the reward level of

- 100

after approximately 58k environment interaction steps, whereas PPO requires approximately 99k steps to reach a comparable level. This further indicates that CAPO can learn an effective swing-up control strategy more rapidly.

In the LunarLander-v2 environment, the agent needs to complete attitude adjustment, deceleration, and safe landing within a discrete action space. Compared with the previous two discrete-control tasks, this task is more complex and includes stronger terminal feedback and collision penalties in its reward structure. As shown in Figure 4(c), CAPO can quickly move away from the negative-reward region in the early stage of training and clearly outperforms A2C. However, during the middle and later stages, PPO obtains a higher mean episodic reward. As reported in Table 2, PPO achieves a final-stage average reward of

130.04 \pm 30.02

, which is higher than that of CAPO (

83.79 \pm 19.99

), while A2C only reaches

- 23.79 \pm 11.47

. In other words, although CAPO improves the final-stage reward by 107.58 compared with A2C, it remains approximately 46.25 lower than PPO.

This result indicates that CAPO does not achieve an advantage over PPO in LunarLander-v2. A possible reason is that this task involves pronounced stage-wise decision-making and terminal-reward characteristics, where the local directional consistency of TD residuals may not always sufficiently reflect the true quality of the long-term landing strategy. When local feedback fluctuates strongly, CAPO tends to adopt a more conservative advantage estimate and a narrower clipping range. This helps suppress unstable updates, but may also limit sufficient policy exploration during complex landing phases. Therefore, in tasks such as LunarLander-v2, where terminal rewards play a relatively important role, the fixed clipping mechanism of standard PPO may preserve a larger policy search space.

To further compare the learning efficiency of different algorithms in discrete-action tasks, we calculate the number of environment interaction steps required for each algorithm to reach a predefined reward threshold. Since different tasks have distinct reward scales and convergence ranges, the performance threshold is set separately according to the reward characteristics of each environment. If A2C fails to stably reach the corresponding threshold during training, the result is denoted by “–”. The results are shown in Table 3.

As shown in Table 3, CAPO requires fewer environment interaction steps than PPO to reach the specified reward thresholds in CartPole-v1 and Acrobot-v1, indicating higher learning efficiency in basic discrete-control and underactuated-control tasks. In particular, in Acrobot-v1, CAPO reaches the reward level of

- 100

with substantially fewer interaction steps than PPO, further confirming its advantage in tasks involving multi-step feedback accumulation. However, in LunarLander-v2, PPO reaches the reward threshold of 100 much earlier than CAPO. This demonstrates that the performance improvement of CAPO does not universally hold across all discrete-action tasks, but is closely related to the reward structure, the strength of terminal feedback, and whether local TD residuals can accurately characterize the direction of long-term policy improvement.

Overall, CAPO achieves better final-stage average performance than PPO and A2C in both CartPole-v1 and Acrobot-v1, while also exhibiting faster convergence and lower training fluctuation. In LunarLander-v2, although CAPO clearly outperforms A2C, it does not surpass PPO. These results suggest that the proposed temporal-coherence-aware mechanism is more suitable for discrete-control tasks in which local feedback trends are relatively clear and policy improvement exhibits continuous accumulation. For tasks with prominent terminal rewards and stronger stage-wise decision-making characteristics, further refinement of the coherence-adaptive mechanism may still be required.

4.2.3. Results on Continuous-Action Control Tasks

This subsection further analyzes the experimental results on three continuous-action control tasks, including MountainCarContinuous-v0, LunarLanderContinuous-v2, and BipedalWalker-v3. Compared with discrete-action tasks, continuous-action control requires the agent to directly output continuous action values, thereby imposing higher requirements on action regulation accuracy, policy update stability, and long-term credit assignment. Figure 4(d)–(f) show the mean episodic reward curves of the three algorithms in the continuous-action tasks, while Table 2 reports the average rewards over the last 30% of the training process.

In the MountainCarContinuous-v0 environment, the agent needs to continuously control the acceleration of the car, repeatedly accumulate kinetic energy under limited energy constraints, and finally reach the target position at the top of the hill. The policy structure of this task is relatively clear. Once the agent learns an effective energy-accumulation behavior, the mean episodic reward can rapidly approach a high level. As shown in Figure 4(d), both CAPO and PPO quickly improve their rewards within a relatively small number of training steps and eventually stabilize in a high-reward region, whereas A2C exhibits a substantially delayed reward improvement. According to Table 2, the final-stage average reward of CAPO is

81.87 \pm 2.03

, while PPO achieves

83.11 \pm 1.90

. The two methods show very similar performance, with PPO being slightly higher than CAPO. In contrast, A2C obtains only

- 18.24 \pm 58.19

, indicating that it fails to stably learn an effective control policy. These results suggest that, in continuous-control tasks with relatively simple reward structures, explicit policy patterns, and evident performance plateaus, the improvement margin of CAPO is relatively limited, but it can still maintain final-stage performance comparable to PPO.

In the LunarLanderContinuous-v2 environment, the agent is required to control the main engine and side engines in a continuous action space to achieve deceleration, attitude adjustment, and safe landing. Compared with the discrete version, this task places higher demands on continuous action regulation, and the policy learning process is more susceptible to local state variations and value-estimation errors. As shown in Figure 4(e), CAPO exhibits a faster reward improvement in the early stage of training and enters a higher reward region earlier than the baselines. In the later stage of training, the mean reward of CAPO remains overall higher than those of PPO and A2C. Table 2 shows that CAPO achieves a final-stage average reward of

88.26 \pm 14.45

, outperforming PPO (

78.56 \pm 29.05

) and substantially exceeding A2C (

- 87.49 \pm 25.12

). Compared with PPO, CAPO improves the final-stage average reward by 9.70 and achieves a clearly smaller standard deviation, indicating that it not only improves final-stage performance but also enhances training stability across different random seeds.

In the BipedalWalker-v3 environment, the agent needs to control a multi-joint bipedal robot to maintain balance and move forward on complex terrain. This task involves a higher-dimensional state space and continuous action space, and therefore requires strong action coordination, long-term stability, and cross-time-step credit assignment. As shown in Figure 4(f), CAPO demonstrates faster reward growth during the early and middle stages of training and achieves substantially higher rewards than PPO and A2C in the later stage. Table 2 reports that CAPO reaches a final-stage average reward of

289.55 \pm 12.71

, which is significantly higher than PPO (

224.97 \pm 37.18

) and A2C (

35.07 \pm 78.56

). Compared with PPO, CAPO improves the final-stage average reward by 64.58; compared with A2C, the improvement reaches 254.48. Moreover, the standard deviation of CAPO is markedly lower than that of PPO, indicating more stable training behavior in high-dimensional continuous locomotion control.

To further evaluate convergence efficiency in continuous-action tasks, Table 4 reports the number of environment interaction steps required by the three algorithms to reach predefined reward thresholds.

As shown in Table 4, CAPO requires substantially fewer environment interaction steps than PPO to reach the specified reward thresholds in LunarLanderContinuous-v2 and BipedalWalker-v3, indicating that it can learn effective continuous-control policies more efficiently. In particular, in BipedalWalker-v3, CAPO reaches the reward level of 200 after approximately 714k environment interaction steps, whereas PPO requires approximately 1524k steps to reach a comparable level. This result demonstrates that CAPO has a more evident convergence-efficiency advantage in high-dimensional continuous locomotion control. In contrast, in MountainCarContinuous-v0, PPO reaches the reward threshold of 80 slightly earlier than CAPO, and the final-stage average rewards of the two methods are very close. This further suggests that the main advantage of CAPO does not lie in tasks with relatively simple policy structures.

Overall, across the three continuous-action tasks, CAPO achieves better final-stage average performance than PPO and A2C in LunarLanderContinuous-v2 and BipedalWalker-v3, while also exhibiting faster convergence and lower training fluctuation. These results indicate that CAPO is particularly suitable for continuous-control tasks involving complex temporal feedback, high action-coordination requirements, and more important long-term credit assignment. The reason is that CAPO can adaptively regulate the temporal scale of advantage estimation and the policy clipping range according to the local directional consistency of TD residuals. When the feedback trend is stable, CAPO sufficiently exploits long-horizon return information; when the feedback fluctuates strongly, it suppresses unreliable updates. Consequently, CAPO improves the quality of policy optimization in complex continuous-control tasks.

4.3. Discussion

The experimental results across the six benchmark environments demonstrate that CAPO achieves favorable learning performance and training stability in most tasks. For discrete-action control tasks such as CartPole-v1 and Acrobot-v1, CAPO improves the final-stage average reward and accelerates convergence to some extent. For continuous-action control tasks such as LunarLanderContinuous-v2 and BipedalWalker-v3, the advantages of CAPO are more evident, as reflected by higher final-stage average rewards, faster learning speed, and smaller performance variations across different random seeds.

These results indicate that the proposed temporal-coherence-aware mechanism based on the local directional consistency of TD residuals can effectively improve the reliability of policy update signals. When the local feedback trend is relatively stable, CAPO increases the contribution of long-horizon advantage estimation and appropriately relaxes the policy update range, allowing the agent to make fuller use of reliable temporal feedback. Conversely, when local feedback fluctuates strongly, CAPO tends to adopt a more conservative advantage estimate and a narrower clipping boundary, thereby reducing the influence of unreliable updates on the training process. Therefore, CAPO shows more pronounced advantages in tasks with complex temporal feedback and higher requirements for continuous control.

Meanwhile, the experimental results also show that CAPO does not achieve clearly superior performance over PPO in LunarLander-v2 and MountainCarContinuous-v0. This suggests that the effectiveness of the proposed mechanism is related to the reward structure, action-space type, and whether local TD residuals can effectively characterize the direction of long-term policy improvement. For tasks with dominant terminal rewards or relatively evident performance plateaus, the improvement margin of CAPO may be limited. This observation also indicates that temporal-coherence-aware policy optimization is more suitable for tasks in which local feedback trends contain meaningful information for policy improvement, while further refinement may be required for tasks with sparse or strongly stage-dependent reward structures.

Overall, CAPO provides a flexible extension to the standard PPO framework by jointly adapting the advantage estimation horizon and proximal clipping range according to local temporal coherence. This design allows the policy update to balance sufficient improvement and conservative constraint at the sample level, which contributes to improved stability and convergence efficiency in several representative control tasks.

5. Conclusion and Future Work

This paper proposes a coherence-adaptive proximal optimization method, termed CAPO, to improve the adaptability of advantage estimation and proximal policy updates in the PPO framework. Different from standard PPO, which typically adopts a fixed temporal scale for advantage estimation and a uniform clipping boundary for all samples, CAPO introduces a temporal coherence coefficient based on the local directional consistency of TD residuals. This coefficient is used to characterize the reliability of sample-wise update signals within a local temporal window. Based on this measurement, short-horizon and long-horizon advantage estimates are adaptively fused to balance variance reduction and long-term credit assignment according to the stability of temporal feedback. In addition, the same temporal coherence coefficient is incorporated into the clipped surrogate objective to construct a sample-wise dynamic clipping mechanism, allowing the policy update range to be adjusted according to the reliability of local feedback.

Comparative experiments were conducted on six representative OpenAI Gym control tasks to evaluate CAPO against PPO and A2C. The results show that CAPO achieves better or comparable performance to PPO in several benchmark environments and generally performs better than A2C. In particular, CAPO shows relatively clear advantages in CartPole-v1, Acrobot-v1, LunarLanderContinuous-v2, and BipedalWalker-v3, where it obtains higher final-stage rewards or more stable training behavior. These results suggest that the proposed temporal-coherence-aware mechanism can improve the reliability of policy update signals, especially in tasks where local TD residuals provide informative feedback for policy improvement. However, CAPO does not consistently outperform PPO in all environments. In LunarLander-v2 and MountainCarContinuous-v0, the improvement is limited or PPO achieves better final-stage performance, indicating that the effectiveness of CAPO is still related to the reward structure, action-space characteristics, and temporal feedback patterns of different tasks.

Future work will focus on further improving the adaptability and generalization of the proposed mechanism. First, more flexible temporal coherence measurement strategies can be explored, such as incorporating longer temporal windows, task-stage information, or uncertainty estimates, to better handle tasks with strong terminal rewards or stage-dependent feedback. Second, the proposed method can be evaluated in more complex continuous-control tasks and realistic physical simulation environments to further examine its practical applicability. Finally, although CAPO is instantiated within the PPO framework in this study, the temporal-coherence-aware mechanism may also be integrated with other policy-gradient algorithms or extended to multi-agent reinforcement learning scenarios, which will be investigated in future research.

Author Contributions

Conceptualization, M.L. and T.Z.; methodology, M.L.; software, M.L.; validation, M.L., J.H., W.W. and Z.H.; formal analysis, M.L.; investigation, M.L.; data curation, M.L.; writing—original draft preparation, M.L.; writing—review and editing, J.H., W.W., T.Z. and Z.H.; visualization, M.L.; supervision, T.Z.; project administration, T.Z.; funding acquisition, T.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China, grant number 52201403; the Chenguang Program of Shanghai Education Development Foundation and Shanghai Municipal Education Commission, grant number 23CGA61; the Fujian Provincial Natural Science Foundation of China, grant number 2024J01702; the Natural Science Foundation of Xiamen, China, grant number 502Z202373038; and the National Natural Science Foundation of China, grant number 52371369.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data generated and analyzed during this study are available from the corresponding author upon reasonable request.

Acknowledgments

Not applicable.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Mei, J.; Chung, W.; Thomas, V.; et al. The role of baselines in policy gradient optimization. Adv. Neural Inf. Process. Syst. 2022, 35, 17818–17830. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning; PMLR, 2018; pp. 1587–1596. [Google Scholar]
Romoff, J.; Henderson, P.; Piché, A.; et al. Reward estimation for variance reduction in deep reinforcement learning. arXiv 2018, arXiv:1805.03359. [Google Scholar] [CrossRef]
Henderson, P.; Islam, R.; Bachman, P.; et al. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018; Volume 32. [Google Scholar]
Sutton, R.S.; McAllester, D.; Singh, S.; et al. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 1999, 12. [Google Scholar]
Konda, V.; Tsitsiklis, J. Actor-critic algorithms. Adv. Neural Inf. Process. Syst. 1999, 12. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; et al. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning; PMLR, 2015; pp. 1889–1897. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; et al. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Schulman, J.; Moritz, P.; Levine, S.; et al. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; et al. What matters for on-policy deep actor-critic methods? A large-scale study. In Proceedings of the International Conference on Learning Representations, 2021. [Google Scholar]
Chen, G.; Peng, Y.; Zhang, M. An adaptive clipping approach for proximal policy optimization. arXiv 2018, arXiv:1804.06461. [Google Scholar] [CrossRef]
Chen, G.; Huang, Z.; Wang, W.; et al. A novel dynamically adjusted entropy algorithm for collision avoidance in autonomous ships based on deep reinforcement learning. J. Mar. Sci. Eng. 2024, 12, 1562. [Google Scholar] [CrossRef]
Wang, W.; Li, M.; Chen, G.; et al. Cognitive entropy proximal policy optimization for autonomous ship collision avoidance based on deep reinforcement learning. Eng. Appl. Artif. Intell. 2026, 172, 114416. [Google Scholar] [CrossRef]
Duan, J.; Guan, Y.; Li, S.E.; et al. Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6584–6598. [Google Scholar] [CrossRef]
Liu, Q.; Li, Y.; Shi, X.; et al. Distributional policy gradient with distributional value function. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 6556–6568. [Google Scholar] [CrossRef]
Cheng, Y.; Huang, L.; Chen, C.L.P.; et al. Robust actor-critic with relative entropy regulating actor. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9054–9063. [Google Scholar] [CrossRef]
Shang, Z.; Li, R.; Zheng, C.; et al. Relative entropy regularized sample-efficient reinforcement learning with continuous actions. IEEE Trans. Neural Netw. Learn. Syst. 2023, 36, 475–485. [Google Scholar] [CrossRef]
Jin, Y.; Song, X.; Slabaugh, G.; et al. Partial advantage estimator for proximal policy optimization. IEEE Trans. Games 2024, 17, 158–166. [Google Scholar] [CrossRef]
Li, Y.; Tan, X.Y. Candidate ratio guided proximal policy optimization. Eng. Appl. Artif. Intell. 2025, 152, 110576. [Google Scholar] [CrossRef]
Zhou, R.; Cao, H.; Huang, J.; et al. Hybrid lane change strategy of autonomous vehicles based on SOAR cognitive architecture and deep reinforcement learning. Neurocomputing 2025, 611, 128669. [Google Scholar] [CrossRef]
Rupprecht, T.; Wang, Y. A survey for deep reinforcement learning in Markovian cyber–physical systems: Common problems and solutions. Neural Netw. 2022, 153, 13–36. [Google Scholar] [CrossRef]
Jia, L.; Su, B.; Xu, D.; et al. Policy optimization algorithm with activation likelihood-ratio for multi-agent reinforcement learning. Neural Process. Lett. 2024, 56, 247. [Google Scholar] [CrossRef]
Cheng, Y.; Guo, Q.; Wang, X. Proximal policy optimization with advantage reuse competition. IEEE Trans. Artif. Intell. 2024, 5, 3915–3925. [Google Scholar] [CrossRef]
Humayoo, M.; Zheng, G.; Dong, X.; et al. Relative importance sampling for off-policy actor-critic in deep reinforcement learning. Sci. Rep. 2025, 15, 14349. [Google Scholar] [CrossRef]
Zhou, Y.; Jiang, J.; Shi, Q.; et al. GA-HPO PPO: A hybrid algorithm for dynamic flexible job shop scheduling. Sensors 2025, 25, 6736. [Google Scholar] [CrossRef]
Qu, S.; Guan, W.; Hu, T.; et al. The collaborative navigation decision-making method of USV by UAV based on improved PPO algorithm. Ocean Eng. 2025, 341, 122381. [Google Scholar] [CrossRef]

Figure 1. Illustration of the Markov decision process, where the agent interacts with the environment by selecting actions according to its policy and receiving states and rewards from the environment.

Figure 2. Workflow comparison between standard PPO and the proposed CAPO method, illustrating how CAPO introduces temporal-coherence measurement to adaptively regulate advantage estimation and the sample-wise clipping range.

Figure 3. Illustrations of the six benchmark control environments used in the experiments, including CartPole-v1, Acrobot-v1, LunarLander-v2, MountainCarContinuous-v0, LunarLanderContinuous-v2, and BipedalWalker-v3.

Figure 4. Mean episodic reward curves of CAPO, PPO, and A2C on six OpenAI Gym benchmark environments, where the solid lines represent the average rewards over six independent runs and the shaded regions denote the standard deviations.

Table 1. Main hyperparameter settings of the proposed CAPO algorithm.

Hyperparameter	Symbol	Value
Short-horizon GAE parameter	$λ_{s}$	0.85
Long-horizon GAE parameter	$λ_{l}$	0.97
Temporal coherence window	H	4
Minimum clipping bound	$ϵ_{min}$	0.08
Maximum clipping bound	$ϵ_{max}$	0.25
Entropy coefficient	$c_{e}$	0.01
Value loss coefficient	$c_{v}$	0.5

Table 2. Comparison of final-stage average episodic rewards of different algorithms in six benchmark environments.

Environment	CAPO	PPO	A2C
CartPole-v1	$394.58 \pm 4.90$	$371.73 \pm 13.79$	$45.93 \pm 6.29$
Acrobot-v1	$- 81.51 \pm 1.35$	$- 94.77 \pm 3.55$	$- 250.27 \pm 192.86$
LunarLander-v2	$83.79 \pm 19.99$	$130.04 \pm 30.02$	$- 23.79 \pm 11.47$
MountainCarContinuous-v0	$81.87 \pm 2.03$	$83.11 \pm 1.90$	$- 18.24 \pm 58.19$
LunarLanderContinuous-v2	$88.26 \pm 14.45$	$78.56 \pm 29.05$	$- 87.49 \pm 25.12$
BipedalWalker-v3	$289.55 \pm 12.71$	$224.97 \pm 37.18$	$35.07 \pm 78.56$

Table 3. Comparison of convergence efficiency in discrete-action control tasks.

Environment	Threshold	CAPO	PPO	A2C
CartPole-v1	Reward $\geq 380$	63k steps	72k steps	–
Acrobot-v1	Reward $\geq - 100$	58k steps	99k steps	–
LunarLander-v2	Reward $\geq 100$	451k steps	121k steps	–

Table 4. Comparison of convergence efficiency in continuous-action control tasks.

Environment	Threshold	CAPO	PPO	A2C
MountainCarContinuous-v0	Reward $\geq 80$	27k steps	21k steps	–
LunarLanderContinuous-v2	Reward $\geq 50$	27k steps	65k steps	–
BipedalWalker-v3	Reward $\geq 200$	714k steps	1524k steps	–

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

CAPO: A Coherence-Adaptive Advantage Mechanism Instantiated in Proximal Policy Optimization

Abstract

Keywords:

Subject:

1. Introduction

2. Preliminaries

2.1. Markov Decision Process

2.2. Proximal Policy Optimization

2.3. Generalized Advantage Estimation and TD Residual

3. Methodology

3.1. Temporal Coherence Measurement

3.2. Coherence-Adaptive Advantage Construction

3.3. Coherence-Adaptive Proximal Policy Optimization

3.4. Mechanism Analysis

3.5. Algorithm Summary

4. Experimental Results and Analysis

4.1. Experimental Environments and Settings

4.2. Experimental Results

4.2.1. Overall Comparison

4.2.2. Results on Discrete-Action Control Tasks

4.2.3. Results on Continuous-Action Control Tasks

4.3. Discussion

5. Conclusion and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe