Reinforcement Learning-Enhanced Adaptive NMPC for Safe Autonomous Driving

Sheng Jin; Joel Y. Y. Loh

doi:10.20944/preprints202605.1986.v1

Submitted:

27 May 2026

Posted:

28 May 2026

You are already at the latest version

Abstract

Nonlinear Model Predictive Control (NMPC) has garnered significant attention in autonomous systems due to its ability to predict future states and manage complex vehicle dynamics. However, the adaptability of existing NMPC methods is constrained by having to manually set the weight coefficients in the NMPC cost function. This study aims to explore a novel approach that integrates NMPC with Reinforcement Learning (RL), specifically employing Proximal Policy Optimization (PPO), to dynamically adjust NMPC weight matrices. The investigation begins by establishing a physics-based model for a two wheeled differential drive vehicle. A PPO model is then trained and deployed real-time to adapt to the NMPC weight matrices, achieving a 71 % reduction in tracking error compared with the NMPC baseline. Importantly, the performance gain arises from PPO’s ability to reshape the NMPC cost function real-time, amplifying both orientation and lateral penalties in curves while relaxing them on straights, thereby enabling adaptive trade-offs between accuracy and control effort that static-weight NMPC cannot achieve. To enhance safety, the controller is integrated with a Control Barrier Function (CBF) layer for real-time obstacle avoidance, while PPO’s real-time weight adaptation contributes to improved tracking performance relative to NMPC+CBF. Finally, robustness evaluations under friction uncertainty, sensor noise, and path disturbances demonstrate that the PPO+NMPC+CBF method maintains reliable tracking accuracy and safety margins.

Keywords:

autonomous vehicles

;

control barrier function (CBF)

;

nonlinear model predictive control (NMPC)

;

proximal policy optimization (PPO)

;

reinforcement learning (RL)

Subject:

Engineering - Electrical and Electronic Engineering

1. Introduction

Path tracking is a fundamental problem in autonomous driving, where the controller must ensure accurate trajectory following while satisfying vehicle dynamics and safety constraints. In recent years, various control strategies have been explored for path tracking. Traditional PID controllers, favored for their simplicity, are widely applied to lane keeping and parking, with nested PID improving accuracy [1,2]. Sliding Mode Control (SMC) enhances robustness through nonlinear switching, while fuzzy PID and adaptive fuzzy SMC improve adaptability via online gain tuning [3,4,5]. Yet these controllers cannot explicitly handle multi-variable constraints, limiting their scalability in autonomous driving. Model Predictive Control (MPC) overcomes this by embedding dynamics and constraints into an optimization problem [6]. Extensions with state lattice planners [7] and refined cost functions [8], improve planning and comfort, but linear approximations limit fidelity in dynamic scenarios [9]. To address this shortcoming, Nonlinear Model Predictive Control (NMPC) directly incorporates nonlinear vehicle dynamics into the predictive model, allowing more accurate state evolution and improved performance in complex and highly dynamic driving conditions [10]. It enables joint steering and torque control for obstacle avoidance at high speeds and enhances rollover prevention for four-wheel independent drive vehicles [11,12]. In addition, NMPC supports automatic weight tuning via genetic algorithms and robust control through Gaussian process modeling of uncertainty [13,17]. However, its effectiveness is highly sensitive to factors such as vehicle speed, road curvature, and sampling frequency. Prior studies report that high vehicle speed or road curvature significantly amplifies prediction errors and can destabilize the optimizer if weights are not carefully adjusted [18,19]. Similarly, insufficient sampling density degrades prediction fidelity, while excessive sampling increases solver iterations and computational time, leading to response delays [20]. As a result, NMPC often requires trade-offs between responsiveness, stability, and computational cost. The latter grows approximately quadratically with prediction horizon and number of constraints, making real-time operation challenging for embedded platforms [21].Furthermore, traditional NMPC approaches rely on fixed weights and parameters that are typically tuned via expert knowledge or costly trials, making them time-consuming and difficult to scale [22].

Several adaptive tuning strategies have been proposed to reduce this dependence on manual NMPC weight selection. Genetic algorithms have been used to tune nonlinear MPC parameters for autonomous vehicle velocity and steering control [13], while Bayesian optimization has been applied to MPC weight adaptation in connected and automated vehicle scenarios [14]. Other approaches use curvature-dependent or gain-scheduled MPC weights so that tracking penalties change with the local path geometry [15]. Recent weight-varying MPC methods also combine optimization-based weight selection with reinforcement learning to adjust controller priorities online [16]. These approaches show that adaptive weighting is an active research direction, but they differ in the source of adaptation. Offline optimizers such as genetic algorithms and Bayesian optimization can reduce manual tuning but may require repeated scenario evaluations. Gain-scheduled and curvature-dependent methods are computationally efficient but depend on predefined scheduling rules. The present work instead learns an online mapping from local tracking features to NMPC weight scaling coefficients, while retaining the NMPC solver for constrained control action generation.

To overcome the limitations of classical controllers, recent work has explored Reinforcement Learning (RL) for greater adaptability. RL learns optimal strategies through interaction with the environment, avoiding explicit models or manual tuning. Most control applications adopt the Actor–Critic (AC) architecture, where the Actor maps observations to actions and the Critic evaluates long-term rewards [23]. This end-to-end mechanism enables low-latency inference, making RL suitable for high-frequency or resource-constrained systems [24,25]. Therefore, to better handle complex dynamic environments, integrating RL with traditional controllers has gained increasing attention. Studies have combined RL with PID to adjust weights in real time, reducing AGV tracking errors and improving adaptability to complex paths [26,27]. In MPC, RL has been used to tune cost functions and prediction horizons online, enhancing trajectory tracking without increasing computational load [28,29], while neural networks have been applied to learn vehicle dynamics for improved prediction accuracy and robustness [30]. For NMPC, a RL-based frameworks like Re4MPC dynamically select models, constraints, and cost functions, significantly reducing computational overhead [31], and Actor-Critic structures have been employed to optimize weights and initialization for better convergence and stability [32]. Additionally, real-time tuning methods combining digital twins and RL improve control accuracy in uncertain environments such as surface systems [33,34,35]. However, most RL-based controllers adopt an end-to-end design, which limits interpretability and weakens constraint enforcement [36]. They also lack formal safety guarantees, restricting their applicability in safety-critical autonomous driving.

Since the weighting matrices in NMPC directly determine the trade-off among tracking accuracy and energy consumption, adaptive tuning becomes a natural entry point. The core challenge is to integrate RL into NMPC in a way that preserves transparency, respects constraints, and ensures real-time safety. This work addresses these issues by proposing a three-layer framework which can adapt NMPC weights online with a Control Barrier Function (CBF) layer enforcing safety via real-time projection. The performance of the proposed approach is illustrated in Figure 1, which presents a trajectory comparison highlighting its advantages over baseline methods.

This paper is organized as follows. Section 2 formulates the autonomous vehicle trajectory tracking problem and defines the control objectives. Section 3 presents the proposed PPO+NMPC+CBF framework, detailing the integration of PPO-based adaptive weight tuning with NMPC optimization and safety enforcement via CBF. The simulation setup and experimental results are presented, including evaluations of tracking accuracy, safety, and robustness in Section 4. Finally, conclusion remarks are provided in Section 5.

2. Problem Formulation

Section II formulates the trajectory-tracking problem for a differential-drive vehicle. Both the dynamic model of the system and the associated NMPC controller are presented to establish the basis for the proposed PPO-augmented safe control framework.

2.1. Vehicle Dynamic Model

This section systematically derives the kinematic and dynamic models of a two-wheel differential drive robot with a front caster and rear Ackermann wheels [37]. A nonlinear state-space representation is subsequently established to facilitate the design of advanced control strategies.

Consider a global inertial frame

(X, Y)

and a local body-fixed frame

(x, y)

attached to the vehicle’s center of mass, as illustrated in Figure 2. Denoting

v

as the forward linear velocity and

ω

as the yaw rate about the Z-axis, the kinematic equations are given by:

\dot{Z} = [\begin{matrix} \dot{x} \\ \dot{y} \\ \dot{θ} \end{matrix}] = [\begin{matrix} v c o s θ \\ v s i n θ \\ ω \end{matrix}]

(1)

The dynamics terms of the differential-drive vehicle are derived by applying Newton’s second law in the longitudinal direction and torque balance about the yaw axis. The resulting nonlinear five-state model captures both translational and rotational dynamics terms, considering wheel torques, rolling resistance, aerodynamic drag, and yaw damping. The compact state-space

Θ = {[x y θ v ω]}^{T}

form is expressed as follows:

\dot{Θ} = F (Θ, u) = [\begin{matrix} \dot{Z} \\ \dot{v} \\ \dot{ω} \end{matrix}]

(2)

= [\begin{matrix} v c o s θ \\ v s i n θ \\ ω \\ \frac{[(T_{l} + T_{r}) / r - c_{r r} m g s g n (v) - c_{v} v)]}{m} \\ \frac{L (T_{r} - T_{l}) / 2 r - c_{ω} ω}{J_{z}} \end{matrix}]

where

T_{l}, T_{r}

denote the left and right wheel torques,

r

is the wheel radius,

c_{r r}

and

c_{v}

represent the rolling resistance and aerodynamic drag coefficients,

L

is the wheelbase,

c_{ω}

is the yaw damping coefficient,

m

is the vehicle mass,

ω

denotes the yaw rate of the vehicle, and

J_{z}

is the yaw inertia. The nonlinear function

F (\cdot)

encodes the kinematic and dynamic relations of the vehicle, while

u = {[T_{l} T_{r}]}^{T}

represents the input vector of wheel torques. A detailed derivation of the vehicle kinematic and dynamics are provided in Appendix A.

2.2. NMPC Controller for Trajectory Tracking

Nonlinear Model Predictive Control extends the classical MPC framework by incorporating nonlinear vehicle dynamics into the prediction process. While retaining the quadratic cost structure of linear MPC, NMPC accounts for nonlinear state transitions, thereby improving fidelity in highly dynamic conditions.

To solve the nonlinear constrained optimization problem, a Sequential Quadratic Programming (SQP) scheme is employed. At each control step, the nonlinear dynamics and inequality constraints are locally linearized around the current iterate

{(Θ}_{i}^{g u e s s}, u_{i}^{g u e s s})

. A quadratic programming (QP) subproblem is then constructed:

{Q P}_{N M P C} = \arg \min_{∆ Θ, ∆ u} \sum_{k = 0}^{N - 1} \frac{1}{2} {[\begin{matrix} {∆ Θ}_{i, k} \\ {∆ u}_{i, k} \end{matrix}]}^{T} H_{i, k} [\begin{matrix} {∆ Θ}_{i, k} \\ {∆ u}_{i, k} \end{matrix}]

(3a)

s.t.

{∆ Θ}_{i, 0} = {\hat{Θ}}_{i} - Θ_{i, 0}^{g u e s s}

(3b)

{∆ Θ}_{i, k + 1} = A_{i, k} {∆ Θ}_{i, k} - B_{i, k} {∆ u}_{i, k}

(3c)

C_{i, k} {∆ Θ}_{i, k} + D_{i, k} {∆ u}_{i, k} + h_{i, k} \leq 0

(3d)

The SQP update variables are defined by

Θ_{i, k} = Θ_{i, k}^{g u e s s} + Δ Θ_{i, k}

and

u_{i, k} = u_{i, k}^{g u e s s} + Δ u_{i, k}

, where the superscript guess denotes the current SQP iterate and the delta terms are the QP decision variables.

Where

A_{i, k} = \partial F / \partial Θ

and

B_{i, k} = \partial F / \partial u

are Jacobians of the dynamics, while

C_{i, k}, D_{i, k}

are the Jacobians of inequality constraints, and

h_{i, k} = h (Θ_{i}^{g u e s s}, u_{i}^{g u e s s})

is the constraint offset. The Hessian of the Lagrangian is approximated using a Gauss–Newton scheme, yielding a positive semi-definite weighting matrix

H_{i, k} \approx W_{i, k}

, which reduces computational burden while preserving descent directions. The Newton directions

{(∆ Θ}_{i}, {∆ u}_{i})

are updated via line search until convergence [38], and the first control input is applied in a receding horizon fashion.

The inequality in Eq. (3d) is the first-order approximation of the nonlinear constraint function written in the standard SQP form

g (Θ, u) \leq 0

. The sign convention

\leq 0

means that admissible trajectories correspond to non-positive constraint values, while positive values indicate constraint violation. Thus, Eq. (3d) enforces the locally linearized state, input, input-rate, and safety constraints within the QP subproblem.

The linearization in Eq. (3) is used as a local SQP approximation rather than as a global approximation of the nonlinear vehicle dynamics. Its validity depends on the predicted state and input trajectories remaining close to the current SQP iterate over the prediction horizon. In this study, the vehicle operates under relatively smooth low-speed path-tracking conditions, with bounded inputs, a short prediction horizon, and repeated re-linearization at every receding-horizon step. These conditions limit the deviation between the nonlinear dynamics and their first-order approximation over each short horizon. The approximation may become less accurate during aggressive maneuvers, high-speed operation, large initial tracking errors, abrupt obstacle interactions, severe actuator saturation, or longer prediction horizons. Thus, the linearized QP subproblem is used only as a local computational step inside the SQP-based NMPC solver, while the closed-loop controller relies on repeated online re-linearization to remain consistent with the nonlinear model.

The trajectory tracking problem with embedded safety can be written as a finite-horizon optimal control problem:

Before writing the finite-horizon problem, we define

X

as the admissible state set,

U

as the admissible wheel-torque input set, and

Δ U

as the admissible input-increment set. These calligraphic set symbols are used consistently in the constraints below.

\min_{{\{u_{k}\}}_{k = 0}^{N - 1}} \sum_{k = 0}^{N - 1} ({e_{k}}^{T} Q e_{k} + {u_{k}}^{T} R u_{k})

(4a)

s.t.

Θ_{k} = {\hat{Θ}}_{k}

(4b)

Θ_{k + 1} = F (Θ_{k}, u_{k}), k = 1, \dots, N - 1

(4c)

Θ_{k} \in X, k = 1, \dots, N - 1

(4d)

u_{k} \in U, k = 0, \dots, N - 2

(4e)

∆ u_{k} ≔ u_{k} - u_{k - 1} \in ∆ U, k = 0, \dots, N - 2

(4f)

where

e_{k} = y_{k} - r_{k}

is the tracking error and

Q, R

weight tracking accuracy and control effort. The admissible state and input sets are given by

X

and

U

, while

∆ U

constrains the rate of input variation to improve feasibility and smoothness.

The weighting matrices are chosen to satisfy the standard quadratic-regulator definiteness conditions. Specifically,

Q ≽ 0

is positive semi-definite so that tracking errors are penalized with non-negative weights, while

R ≻ 0

is positive definite so that the control-effort term remains strictly convex with respect to the input. These conditions are maintained throughout the simulations.

In this formulation, the predictive optimizer enforces system dynamics and feasibility, while the PPO agent adaptively tunes the cost weights real-time. This separation allows NMPC to remain interpretable, while RL and CBF adapts the cost and enforces safety respectively in real time.

The controller should be interpreted as a receding-horizon numerical optimizer rather than a globally optimal regulator. At each sampling instant, the NMPC solver computes a finite-horizon solution under local SQP linearization, a Gauss-Newton Hessian approximation, finite solver tolerances, and PPO-adapted cost weights. The resulting input is therefore locally optimized, or suboptimal when the SQP iterations are terminated before exact convergence, rather than globally optimal for the original nonlinear problem.

For the same reason, this work does not claim a formal closed-loop stability guarantee in the Lyapunov or asymptotic sense. Such a guarantee would require additional terminal costs, invariant terminal sets, recursive-feasibility arguments, or a dedicated Lyapunov analysis. The simulations instead evaluate bounded tracking behavior, constraint satisfaction in the tested scenarios, and robustness under selected disturbances.

3. Construction of PPO+NMPC+CBF Framework

This section presents the construction of the PPO+NMPC+CBF framework, which integrates reinforcement learning, predictive optimization, and safety-constrained control. The PPO module is first introduced to generate adaptive weight coefficients for NMPC, followed by the PPO-adapted NMPC formulation. Finally, a Control Barrier Function layer is incorporated to impose obstacle-avoidance constraints. Together, these components form a closed-loop controller in which PPO adapts the NMPC cost weights, NMPC enforces the vehicle dynamics and input limits, and the CBF layer adds obstacle-avoidance constraints. Strict forward invariance of the safe set is obtained only when the CBF constraints are feasible and the slack variables introduced later remain inactive. When relaxation is required, the controller should instead be interpreted as providing practical safety preservation with penalized safety-margin violations.

3.1. PPO Model Training

PPO is a policy gradient algorithm that improves stability by optimizing a clipped surrogate objective, preventing destructive policy updates [39]. At each step of the rollout, i.e., during the policy’s interaction with the environment, the agent receives a compact feature vector summarizing kinematic states, path-relative geometry, task progress, and a simplified cost summary. The Actor head outputs scaling coefficients that rescale the NMPC weighting matrices:

Q

, which penalizes state deviation from the reference, and

R

, which penalizes control effort. The NMPC solver uses these adapted weights to compute the optimal control input for the plant. The Critic head evaluates the current state by predicting its long-term value, guiding the policy update through advantage estimates. Advantages are computed using Generalized Advantage Estimation (GAE), and the policy is optimized by maximizing the clipped surrogate objective:

L^{C L I P} (θ) = {\hat{E}}_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, c l i p (r_{t} (θ), 1 - ε, 1 + ε) {\hat{A}}_{t})]

(5)

where

r_{t} (θ)

is the probability ratio between the updated and old policies, and

{\hat{A}}_{t}

denotes the advantage estimator [40]. The full PPO objective integrates the clipped surrogate with a value function penalty and an entropy bonus to balance exploration and stability [39]:

L^{P P O} (θ) = {\hat{E}}_{t} [{L_{t}}^{C L I P} (θ) - {c_{1} L_{t}}^{V F} (θ) + c_{2} S [π_{θ}] (s_{t})]

(6)

where

c_{1}, c_{2}

are scalar hyperparameters,

{L_{t}}^{V F} (θ)

is the squared value error

{{(V}_{θ} (s_{t}) - {V_{t}}^{t a r g})}^{2}

, and

S [π_{θ}]

is the entropy term.

In Eq. (5),

r_{t} (θ)

is the policy probability ratio and

{\hat{A}}_{t}

is the advantage estimate. The clipping parameter

ϵ

limits the policy update ratio to the interval

[1 - ϵ, 1 + ϵ]

, which prevents a single batch from producing an excessively large actor update. Equation (6) then combines the clipped policy objective, value-function fitting, and entropy regularization, so the PPO update balances policy improvement, critic accuracy, and exploration.

The PPO Neural Network (NN) reweights the predictive controller’s quadratic penalties instead of commanding torques directly. As illustrated in Figure 3, In each step, the PPO NN reads a compact feature vector which summarizes kinematic states, path-relative geometry, task progress, and a simplified summary of the cost function. The policy returns a few non-negative coefficients. These coefficients scale the diagonal terms of

Q

and

R

.

The shared backbone extracts features used by both the actor and critic. The policy head applies Softplus so the scaling coefficients remain non-negative before they are passed to the NMPC cost matrices, while the value head estimates the state value for critic training.

During training, the environment returns a scalar reward and penalty function that combines five terms:

r_{t} = ω_{p o s} \cdot {n o r m_{d i s t}}_{t} + {ω_{h e a d} \cdot h e a d}_{t} + {ω_{s p e e d} \cdot s p e e d}_{t} + {ω_{g o a l} \cdot g o a l}_{t} + {ω_{s h a p e} \cdot s h a p e}_{t} .

(7)

These components jointly encourage accurate, smooth, and efficient path tracking.

(i)

The position term

{n o r m_{d} i s t}_{t}

represents the lateral deviation of the vehicle’s center from the reference path, normalized to reflect tracking accuracy;

(i i)

The heading term

{h e a d}_{t}

denotes the angular error between the vehicle heading and the tangent of the path, capturing steering alignment;

(i i i)

The speed term

{s p e e d}_{t}

is the difference between the current forward velocity and the nominal reference speed

v_{r e f}

, reflecting speed stability;

(i v)

The goal term

{g o a l}_{t}

measures either the distance to the target or a terminal indicator of goal attainment, enforcing convergence to the final point;

(v)

The shaping term

{s h a p e}_{t}

represents the reduction in distance-to-goal between consecutive steps, acting as a potential-based shaping reward to encourage gradual progress toward the target [42]. The relative influence of each component is determined by the corresponding weights

ω_{p o s}, ω_{h e a d}, ω_{s p e e d}, ω_{g o a l}, ω_{s h a p e}

.

Algorithm 1 summarizes the offline PPO training framework used to adapt the NMPC weight matrices. In each iteration, the policy generates scaling factors for the cost matrices, which the NMPC solver exploits to compute control actions. Collected trajectories are used to compute rewards and advantage estimates, and the policy is updated via the clipped surrogate objective with entropy regularization. The resulting policy is then deployed to provide adaptive weight tuning during execution.

In Algorithm 1,

π_{θ}

denotes the actor policy,

V_{ϕ}

denotes the critic network, the reference-path environment supplies rollout observations and rewards, and the NMPC solver converts PPO-scaled weights into feasible wheel-torque commands. The variables

o_{t}

and

a_{t}

denote the observation and PPO action at time step t, respectively.

Algorithm 1. Offline PPO Training Framework.

Input: Initial policy

π_{θ}

, value network

V_{ϕ}

, reference path environment, NMPC solver.

Output: Trained policy

{π_{θ}}^{*}

.
Initialize networks and empty rollout buffer

𝒟

;
while training budget not exhausted do
Reset environment; obtain

o_{0}

;
while episode not finish do
Policy outputs

a_{t}

to rescale (

Q, R

) diagonals;
NMPC solves horizon problem; apply first control;
Simulate environment to obtain

o_{t + 1}

and reward

r_{t}

;
Store (

o_{t}, a_{t}, r_{t}, o_{t + 1}, \log π_{θ} (a_{t} |o_{t}), V_{ϕ} (o_{t})

) in

𝒟

;
end
Estimate GAE and value targets from

𝒟

;
Update

π_{θ}

via clipped surrogate with entropy;
Update

V_{ϕ}

by regression to value targets;
end
Return trained policy

{π_{θ}}^{*}

.

3.2. PPO+NMPC

This section introduces a framework that integrates Proximal Policy Optimization with Nonlinear Model Predictive Control. The proposed framework enhances nonlinear predictive control by introducing an adaptive weight-shaping mechanism. At each decision step, the PPO policy receives a compact observation vector that summarizes vehicle motion states, path-following deviations, and task progress. Instead of producing control inputs directly, the policy outputs a set of non-negative coefficients that scale the diagonal elements of the NMPC weighting matrices:

Q^{'} = d i a g (α_{1} q_{1}, α_{2} q_{2}, \dots)

(8)

R^{'} = d i a g (β_{1} r_{1}, β_{2} r_{2}, \dots)

(9)

Here,

Q'

and

R'

are the PPO-adapted versions of the nominal NMPC cost matrices

Q

and

R

in Eq. (4). The prime notation is used only to distinguish the online scaled matrices from the fixed nominal matrices. The coefficients

α_{i}

and

β_{j}

are generated by the PPO policy and scale the corresponding diagonal entries of

Q

and

R

, thereby reshaping the quadratic tracking and control-effort penalties. Thus, the mapping from observation to action can be expressed as:

The nominal coefficients satisfy

q_{i} \geq 0

and

r_{j} > 0

. The Softplus activation and clipping bounds in Eq. (10) keep the PPO scaling coefficients within admissible ranges, so the adapted matrices preserve

Q' ≽ 0

and

R' ≻ 0

. PPO therefore changes the relative importance of tracking and control effort without violating the definiteness requirements of the quadratic NMPC cost.

(α_{i}, β_{j}) = c l i p (s o f t p l u s (W_{2} ϕ (W_{1} o_{k} + b_{1}) + b_{2}), c_{m i n}, c_{m a x})

(10)

u_{k} = κ_{N M P C} (Θ_{k}; Q^{'} (α_{i}), R^{'} (β_{j}))

(11)

where

o_{k}

is the observation vector at step

k

;

W_{1}, W_{2}, b_{1}, b_{2}

are policy network parameters;

ϕ (\cdot)

is the hidden-layer activation function;

c_{m i n}, c_{m a x}

are lower and upper bounds applied to both

α_{i}

and

β_{j}

;

Θ_{k}

is the system state at step

k

; and

κ_{N M P C} (\cdot)

denotes the NMPC solver that computes feasible control input

u_{k}

under adapted weights and physical constraints.

The use of PPO as an online weight-adaptation layer also differs from offline tuning and rule-based scheduling. Offline tuning methods select a fixed set of weights before deployment, whereas curvature-scheduled or gain-scheduled designs update weights according to predefined variables or rules. In the proposed framework, the policy updates the weighting coefficients at each control step using the current tracking state, path-relative features, and task information. The controller therefore retains the constrained NMPC structure, while the relative emphasis on lateral error, heading error, and control effort changes with the local driving context.

This mechanism enables the controller to change its priorities dynamically. When the reference path involves sharp curves, the policy increases the relative weights associated with orientation and lateral deviation, which reduces tracking error and improves path fidelity. Conversely, when the vehicle moves along straighter segments, these penalties are relaxed while more emphasis is placed on minimizing control effort, resulting in smoother actuation and greater energy efficiency. By doing so, the NMPC solver always optimizes against a cost function that reflects real-time driving demands.

An important feature of this design is the separation of responsibilities. Reinforcement learning provides adaptability by adjusting cost weights online, while NMPC ensures interpretability and constraint satisfaction by solving the predictive optimization problem under nonlinear dynamics and physical limits. This division preserves feasibility and stability, while improving robustness compared to fixed-weight formulations.

Nonetheless, the current framework only adapts to path geometry and internal states. It cannot directly address the presence of moving obstacles or other safety-critical scenarios. To overcome this limitation, the next section introduces a Control Barrier Function layer that augments the adaptive controller with real-time safety constraints.

3.3. PPO+NMPC+CBF

Control Barrier Functions provide a formal condition for forward invariance of a designated safety distance set

C

[40]

. For a control-affine system

\dot{x} = f (x) + g (x) u

, a continuously differentiable function

h : R^{n} \to R

defines

C = \{x |h (x) \geq 0\}

. Safety is preserved if there exists an extended class-

K_{\infty}

function

α (\cdot)

such that:

L_{f} h (x) + L_{g} h (x) u + α (h (x)) \geq 0

(12)

which restricts the evolution of

h (x)

to prevent the state from crossing the boundary

h (x) = 0 .

This condition generalizes Nagumo’s theorem to controlled dynamical systems and provides a sufficient condition for set invariance [44]. In discrete-time settings, the CBF condition is often approximated by:

h_{k} - h_{k - 1} + γ h_{k - 1} \geq 0

(13)

where

γ > 0

controls the allowable approach rate to the constraint boundary [45]. In this obstacle avoidance task, the safety function is defined as:

Equation (12) is the continuous-time CBF condition. The safe set is

C = {x ∣ h (x) \geq 0}

, and the terms

L_f h (x)

and

L_g h (x)

are Lie derivatives of the safety function along the drift and input vector fields. The inequality limits the decrease of

h (x)

so that, when the constraint is feasible, a trajectory that starts inside

C

cannot cross the boundary

h (x) = 0

.

Equation (13) is the discrete-time counterpart used over the NMPC prediction horizon. For

0 < γ \leq 1

, it can be written as

h (x ₖ) \geq (1 - γ) h (x ₖ ₋ ₁)

. Hence, if

h (x ₖ ₋ ₁) \geq 0

and the constraint is enforced without relaxation, the next predicted state remains inside the safe set. The parameter

γ

controls how fast the trajectory may approach the boundary, with smaller values enforcing a more conservative approach rate.

h (x) = d (x, x_{o b s}) - (r_{o b s} + δ)

(14)

where

d (\cdot)

denotes the Euclidean distance between the vehicle and the obstacle center,

r_{o b s}

is the obstacle radius, and

δ

is a safety margin. If condition (12) is feasible and enforced without relaxation, the predicted trajectory maintains the prescribed minimum clearance from the obstacles over the prediction horizon.

By embedding the safety function into the predictive optimization, the complete control problem becomes an NMPC formulation with adaptive cost shaping from PPO and safety constraints imposed by CBF. At each control step, the PPO policy outputs scaling coefficients

α_{i}, β_{j}

that adapt the diagonal entries of the cost matrices

Q

and

R

. The resulting finite-horizon optimization problem is expressed as

\min_{{\{u_{k}\}}_{k = 0}^{H - 1}, {\{{s_{k}}^{(o)}\}}_{k = 1}^{H}} \sum_{k = 0}^{H} ({e_{k}}^{T} Q^{'} e_{k}) + \sum_{k = 0}^{H - 1} ({u_{k}}^{T} R^{'} u_{k}) + ρ \sum_{o \in O} \sum_{k = 1}^{H} {‖{s_{k}}^{(o)}‖}_{1}

(15a)

s.t.

Θ_{k + 1} = F (Θ_{k}, u_{k}), k = 0, \dots, H - 1

(15b)

Θ_{k} \in X, u_{k} \in U, ∆ u_{k} \in ∆ U,

(15c)

h^{(o)} (Θ_{k}) - h^{(o)} (Θ_{k - 1}) + γ h^{(o)} (Θ_{k - 1}) \geq - {s_{k}}^{(o)},

(15d)

{s_{k}}^{(o)} \geq 0, k = 1, \dots, H .

(15f)

Here,

Q^{'}, R^{'}

are the PPO-adapted weight matrices and

{s_{k}}^{(o)}

are slack variables penalized with factor

ρ > 0

. The functions

h^{(o)} (Θ_{k})

represent safety functions defined for each obstacle

o \in O

. Introducing slack variables ensures feasibility even in narrow passages or under strong disturbances, but it also opens the possibility of constraint violations. Therefore, each slack variable is penalized in the cost function by a large factor

ρ

, discouraging unnecessary relaxation of the safety constraints.

Constraint (15d) applies the discrete CBF condition in Eq. (13) to each obstacle

o

and each prediction step

k

. The safety function

h^(o) (T h e t a_k)

is positive when the predicted vehicle position is outside the obstacle radius plus the margin, zero on the safety boundary, and negative inside the forbidden region. With

s_k^(o) = 0

, Eq. (15d) enforces the discrete CBF condition directly. With

s_k^(o) > 0

, the lower bound is relaxed to preserve feasibility, and the resulting trajectory may temporarily reduce the nominal safety margin.

It is important to distinguish the strict CBF condition from the relaxed implementation used here. If s_k^(o) = 0 for all obstacles and prediction steps, the discrete CBF condition is enforced directly and the safe set remains forward invariant under the model assumptions. If s_k^(o) > 0, the constraint is relaxed to maintain feasibility, and the safety margin may be temporarily reduced. The penalty term on s_k^(o) therefore does not guarantee zero violation, but makes relaxation costly so that it is used only when strict feasibility cannot be maintained.

To better illustrate the overall architecture, Figure 4 provides a schematic overview of the proposed framework. Thus, the PPO+NMPC+CBF framework combines adaptive cost shaping, predictive optimization, and safety enforcement in a unified loop. PPO provides online weight adjustment, NMPC enforces dynamic feasibility, and the CBF layer promotes obstacle avoidance through safety constraints. When the relaxed CBF constraints require nonzero slack, the method preserves feasibility while minimizing safety-margin violations rather than guaranteeing strict obstacle clearance. The following section evaluates its effectiveness through simulation experiments.

4. Experiments

This section presents comprehensive simulation experiments to validate the proposed PPO+NMPC+CBF framework. The experiments evaluate tracking accuracy, real-time obstacle avoidance, and robustness under uncertainties, demonstrating the framework’s performance relative to baseline controllers in terms of error reduction, safety enforcement, and control stability. Results confirm the framework’s effectiveness in the tested simulations, with PPO+NMPC achieving approximately 71% tracking error reduction and PPO+NMPC+CBF maintaining zero collisions in the evaluated dynamic-obstacle scenarios. All simulations are conducted within a unified environment, and the controller parameters are kept consistent unless otherwise specified.

4.1. Simulation Setup

The simulation environment was implemented in Python 3.10, using SciPy (SLSQP solver) for numerical computation, and

P y T o r c h

for PPO implementation. Development was executed on an Intel Core i7-13650HX CPU with 8 GB RAM and an NVIDIA RTX 4060 Laptop GPU. Experiments were conducted on a

200 \times 200

grid map representing structured urban environments (see Appendix B), with dynamic obstacles introduced at

(4.4, 2.8, r = 0.4 m)

in Step 65 and

(6.4, 6.3, r = 0.3 m)

in Step 150. These experiments are tested on Case 9 in the dataset. The key parameters utilized in the simulation environment are summarized in Table 1.

The reported simulations were executed in an offline Python environment and were not profiled as an embedded real-time implementation. Therefore, the current hardware specification should be interpreted as the computational platform used for simulation, rather than as evidence of guaranteed real-time deployment performance. The short prediction horizon and 0.1 s sampling time used in the simulations are also consistent with the local-validity assumption of the SQP linearization used in the NMPC solver.

The present evaluation is limited to numerical simulation. The reported results therefore validate the proposed control architecture under controlled simulation conditions, but they do not constitute hardware-in-the-loop or physical robot validation. Real-world deployment would introduce additional effects that are not fully captured in the current model, including actuator delay, wheel slip, localization error, sensor latency, communication delay, and embedded-computation limits. These factors may affect both NMPC feasibility and the timing of PPO-based weight adaptation.

4.2. Simulation Results and Analysis

The PPO model was trained externally using a self-constructed dataset of 32 maps (excluding Case 9 for testing), with each dataset a

200 \times 200

grid representing structured urban environments. The maps were procedurally generated to cover a spectrum of layouts, ranging from straight paths with sparse obstacles to dense clusters, providing varied training conditions for the policy.

The training environment used a 10-dimensional state vector and a 7-dimensional action vector. The policy network employed an Actor–Critic architecture with a hidden layer of 128 units, Softplus activation for non-negative outputs, and clipped surrogate objectives for stability. Training lasted 500 episodes with a batch size of 32 and a learning rate of 0.001 using Adam optimizer.

The PPO policy was trained on 32 procedurally generated grid maps and evaluated on Case 9, which was excluded from training. This setup tests whether the learned weight-tuning strategy can be reused on a new map within the same simulation environment. The result should therefore be interpreted as generalization within the tested map class, not as evidence of broad applicability to arbitrary driving environments. The policy observes local tracking and task features, rather than the full map, so it can adjust the NMPC weights when the vehicle encounters different local path conditions such as straight segments, turns, and obstacle-induced deviations. However, the training set remains limited, and the controller has not yet been evaluated under substantially different road layouts, vehicle parameters, reference speeds, sensor delays, actuator dynamics, or hardware constraints. Future work will extend the evaluation to larger and more diverse map sets, out-of-distribution scenarios, and hardware-in-the-loop or real-vehicle experiments.

As shown in Figure 5, the cumulative reward increased from approximately

- 65

in early episodes to stabilize near

- 10

by episode

500

, reflecting improved tracking accuracy, speed regulation, and goal approach. The policy began to exhibit convergence after roughly 180 episodes, where the reward curve leveled off and variability reduced, indicating consistent performance and effective balancing of the dense reward components across diverse maps.

To evaluate the benefits of PPO-driven adaptive weight tuning, we compared fixed-weight NMPC with PPO+NMPC on Case 9 without dynamic obstacles, using the pre-trained PPO model. Both controllers shared the same nonlinear dynamics and

A^{*}

reference path, differing only in cost adaptation: NMPC employed static weights

(Q = R = d i a g (1))

, while PPO+NMPC dynamically scaled

Q

and

R

according to state observations.

The metrics comparison, as illustrated in Figure 6, shows that PPO+NMPC significantly improves positional and heading accuracy compared to fixed-weight NMPC, leading to more precise path following. Specifically, the mean position error decreases from

76.2 m m

to

22.3 m m

, and the mean absolute heading error reduces from

{11.14}^{°}

to

{7.31}^{°}

demonstrating substantial gains in tracking fidelity. Here, the heading error is reported as the mean of absolute angular deviations, providing a robust measure of orientation accuracy across diverse trajectories. These accuracy gains are accompanied by slight reductions in step count and path length, but they also come with higher total energy consumption and lower energy efficiency. The radar chart highlights this contrast, with PPO+NMPC scoring higher on accuracy-related metrics but lower on energy-related metrics. The result indicates that the PPO policy does not improve all performance dimensions simultaneously. Instead, it shifts the controller toward tighter tracking and faster correction at the expense of actuation effort. This behaviour is consistent with the learned weight adaptation because larger lateral and heading penalties in curved or high-error regions cause the NMPC solver to apply stronger corrective torques. Overall, Figure 6 illustrates how PPO-driven adaptation prioritizes tracking fidelity and stability rather than energy minimization.

Table 2 summarizes the quantitative performance comparison across all controllers above. Results show that PPO augmentation consistently reduces mean position and heading errors and slightly decreases step count, while the CBF layer adds safety constraints for obstacle avoidance in the tested scenarios.

The energy-related metrics in Table 2 show the cost of adaptive weight tuning. PPO+NMPC and PPO+NMPC+CBF both reduce tracking error, but both also require higher total energy than their fixed-weight counterparts. The CBF layer enforces obstacle-avoidance constraints, while PPO shifts the NMPC cost toward more aggressive tracking corrections. As a result, the proposed controller is best interpreted as an accuracy- and safety-oriented controller rather than an energy-optimal controller. In applications where energy efficiency is the primary objective, the PPO reward function or the NMPC cost should include a stronger energy penalty so that the learned policy balances tracking accuracy, safety, and actuation effort more evenly.

Real-time computational performance is an important requirement for deploying the proposed controller on an autonomous platform. In the present study, the controller was evaluated in simulation and the implementation was not instrumented to separately record NMPC solve time, PPO inference time, SQP iteration count, maximum control-loop latency, or achieved closed-loop control frequency. PPO inference is expected to add only a small overhead because the policy network is a compact feedforward model, whereas the dominant computational cost is expected to come from the NMPC optimization and the CBF-constrained solve. However, without explicit profiling, the present results should be interpreted as simulation-based validation of the control architecture rather than a complete real-time implementation benchmark. A full runtime study should report the mean and maximum NMPC solve time, PPO inference time, total control-loop latency, average solver iterations, and achieved control frequency under representative hardware and embedded deployment conditions.

The present comparison focuses on fixed-weight NMPC, PPO+NMPC, NMPC+CBF, and PPO+NMPC+CBF. A direct benchmark against other adaptive NMPC strategies, such as genetic-algorithm-tuned NMPC, Bayesian-optimization-tuned NMPC, curvature-dependent weight scheduling, and adaptive MPC, is outside the scope of the current simulation study. Such a comparison would be useful because these methods adapt the controller through different mechanisms and may lead to different trade-offs among tracking accuracy, energy use, computational cost, and safety-margin preservation. Future work will evaluate these adaptive tuning strategies under a common vehicle model, map set, and obstacle-avoidance benchmark.

Figure 7 compares position and heading tracking errors between NMPC+CBF and PPO+NMPC+CBF. The error curves (left) show that PPO+NMPC+CBF not only yields smaller deviations, especially after obstacle encounters, but also returns to the reference trajectory more quickly, indicating faster recovery. The transient response following disturbances is noticeably smoother, suggesting stronger robustness compared to fixed-weight NMPC. The boxplots (right) further highlight this improvement: both the mean (from

0.100

m to

0.08 m

) and median of the position error decrease (from

0.05

m to

0.03 m

), while the heading error distribution becomes narrower with fewer outliers. Beyond reducing error magnitudes, the reduced variance and tighter distributions demonstrate more consistent tracking

Building upon the adaptive tracking improvements demonstrated with PPO+NMPC, the next step is to evaluate PPO+NMPC+CBF for scenarios involving dynamic obstacles. The evaluation focuses on scenarios with dynamic obstacles introduced during the trajectory. Both utilize the same nonlinear dynamics model,

A^{*}

reference path, and CBF constraints (

γ = 0.7, d_{s a f e} = 0.15 m

). The difference lies in weight tuning: NMPC+CBF maintains fixed weights, while PPO+NMPC+CBF employs dynamic adaptation via the pre-trained PPO policy. This setup enables a direct comparison of safety and tracking performance under identical conditions.

and more reliable convergence, lowering the risk of extreme deviations.

Following the tracking error analysis, Figure 8 illustrates how PPO dynamically adjusts the

Q

and

R

matrices within the PPO+NMPC+CBF framework by showing weight adjustment factors over 250 time steps. Rather than treating all terms equally, the policy adapts specific weights to changing trajectory demands. For example,

Q_{1} - Q_{3}

and

R_{2}

show clear variability, indicating active regulation of lateral, longitudinal, and heading errors as well as torque balance between wheels. This reflects the controller’s attempt to prioritize accuracy and stability under different operating phases. By contrast,

Q_{4}

,

Q_{5}

and

R_{1}

remain nearly constant, suggesting that the policy deliberately keeps angular-rate and certain input penalties stable to preserve safety margins and prevent excessive control effort. Notably,

R_{2}

shows greater variability than

R_{1}

, indicating that the policy exploits the right-wheel torque penalty as a flexible degree of freedom for fine-tuning trajectory corrections. Meanwhile, the left-wheel penalty remains stable, serving as a baseline to preserve overall control balance. The overall pattern highlights a division of labor: some weights are flexibly tuned to handle trajectory complexity, while others remain fixed to guarantee robust constraint enforcement. This balance demonstrates how PPO adapts the cost landscape in real time, providing interpretability for why certain performance trade-offs emerge.

4.3. Robustness Evaluation

Real-world autonomous driving is inevitably affected by uncertainties such as model mismatch, sensor imperfections, and path disturbances. To evaluate the robustness of the proposed PPO+NMPC+CBF framework, three representative disturbances were considered:

(i)

friction uncertainty, introduced by scaling the rolling resistance and velocity damping coefficients;

(i i)

sensor noise, simulated by injecting Gaussian noise of varying standard deviations into state observations; and

(i i i)

path disturbance, generated by perturbing the reference trajectory with random deviations. Each scenario was tested over 20 independent trials using the pre-trained PPO policy in Section IV-B and the nonlinear vehicle model in Section II-A, with mean tracking error adopted as the primary metric. Here,

σ

represents the standard deviation in the robustness evaluation.

The boxplots in Figure 9 highlights the selective strengths of the PPO+NMPC+CBF framework. Its limited sensitivity to friction mismatch suggests that the learned weight adaptation remains stable under the tested parameter variations, although this does not by itself establish generalization to different vehicle models or driving environments. The gradual degradation under sensor noise reflects the limits of observation quality, but the errors remain bounded, showing that the controller avoids noise amplification. For path disturbances, the wider error spreads suggest more challenging recovery, yet the absence of large outliers indicates that the CBF layer helps prevent unsafe divergence in the tested scenarios. The mean and median tracking errors remain low (

\leq

0.13 m) across all disturbance scenarios, which reflects stable and bounded system behavior rather than any catastrophic deviation. In contrast, prior studies have reported that conventional NMPC controller exhibit larger deviations and wider error distributions when subjected to observation noise or model mismatch [46,47]. It reflects its sensitivity to parameter uncertainty and external disturbances.

Overall, the experiments show that PPO+NMPC improves tracking accuracy compared with fixed-weight baselines, while the integration of the CBF layer adds safety constraints for obstacle avoidance and improves recovery after disturbances in the tested scenarios. Robustness evaluations further confirm stable performance under friction, noise, and path uncertainties, with tracking errors remaining bounded. PPO+NMPC+CBF therefore shows a simulation-level balance between adaptability, accuracy, safety constraints, and control effort.

The robustness tests provide an initial indication of closed-loop behaviour under selected simulated uncertainties, but they do not replace experimental validation. A more complete validation should include hardware-in-the-loop testing and physical experiments on a mobile robot platform. In such tests, the same PPO+NMPC+CBF pipeline should be evaluated under real-time control constraints, with measured actuator response, localization uncertainty, obstacle-detection latency, and solver execution time included in the loop.

5. Conclusions

This paper proposed a PPO-augmented nonlinear MPC framework with a CBF safety layer for autonomous vehicle trajectory tracking. The design addresses adaptability, safety-constrained control, and robustness by combining dynamic weight adjustment with predictive optimization. Extensive simulations demonstrated that PPO-enhanced NMPC achieves notable reductions in position and heading errors compared with linear MPC and fixed-weight NMPC, but these gains are obtained with higher control effort and lower energy efficiency. Thus, the present policy favours tracking accuracy and safety over energy minimization. The CBF layer improves obstacle avoidance and enables smoother recovery after disturbances. Strict safety-constraint satisfaction holds when the CBF constraints remain feasible without slack relaxation, while the implemented relaxed formulation provides practical safety preservation by penalizing any required reduction of the safety margin. Robustness tests under friction uncertainty, sensor noise, and path disturbances further confirmed stable performance, with both mean and median tracking errors remaining within a small range across all tested scenarios. Collectively, these results validate the proposed PPO+NMPC+CBF framework as an effective and interpretable simulation framework for adaptive trajectory tracking and safety-constrained navigation within the tested environment class. Future work will include larger and more diverse map evaluations, out-of-distribution scenarios, explicit real-time profiling, hardware-in-the-loop validation, and physical experiments on a mobile robotic platform. These tests will be used to quantify NMPC solve time, PPO inference time, control-loop latency, solver iteration count, achievable control frequency, tracking error, and safety-margin preservation under real sensing, actuation, and computation constraints.

Accordingly, the reported results should be read as simulation evidence of bounded closed-loop behavior under the tested conditions, rather than as a proof of global optimality or formal asymptotic stability of the complete learning-enhanced controller.

Acknowledgments

This work acknowledges grant support from Royal Society RGS\R2\242489 and from the Dame Kathleen Ollernshaw Fellowship.

Conflicts of Interest

The authors declare no conflict of interests.

Appendix A

DERIVATION OF VEHICLE MODEL

The derivations in Appendix A are not claimed as a new theoretical contribution. They follow the standard Newton-Euler force and torque balance for differential-drive vehicles, consistent with the background model in [37], and are included to make the simulation model, assumptions, and parameter definitions self-contained.

The linear acceleration of the differentia-drive vehicle in the forward direction is obtained by balancing the driving force generated by the wheel torques against the principal resistive forces. Applying Newton’s second law to the vehicle mass yields the net force balance from which the instantaneous linear acceleration

\dot{v}

can be directly computed as:

\dot{v} = \frac{[(T_{l} + T_{r}) / r - c_{r r} m g s g n (v) - c_{v} v)]}{m}

(A1)

where

T_{l}, T_{r}

denote the driving torques of left and right wheels with the wheel radius

r

. Opposing this motion are the rolling resistance and the viscous drag which are proportional to coefficients

c_{r r}

and

c_{v}

respectively. The angular acceleration of the differential-drive vehicle about its

Z

-axis is obtained by balancing the net driving torque generated by the wheel torque difference against the viscous damping torque. The net torque balance on the yaw inertia

J_{z}

gives the instantaneous angular acceleration

\dot{ω}

as:

\dot{ω} = \frac{L (T_{r} - T_{l}) / 2 r - c_{ω} ω}{J_{z}}

(A2)

where

L

denotes the wheelbase. The coefficient

c_{ω}

represents the yaw damping coefficient and the variable

ω

denotes the yaw rate of the vehicle.

Appendix B

TRAINING DATASET

The training dataset of 32

A^{*}

reference paths, constructed on

200 \times 200

grids to emulate urban environments, is provided :

References

Marino, R.; Scalzi, S.; Orlando, G.; Netto, M. A nested PID steering control for lane keeping in vision-based autonomous vehicles. Proc. Amer. Control Conf., Jun. 2009; pp. 2885–2890. [Google Scholar]
Du, X.; Tan, K. K.; Htet, K. K. K. Vision approach towards fully self-reverse parking system. Proc. IEEE Int. Conf. Mechatronics Autom., Aug. 2014; pp. 186–191. [Google Scholar]
Du, X.; Tan, K. K. Autonomous reverse parking system based on robust path generation and improved sliding mode control. IEEE Trans. Intell. Transp. Syst. 2015, vol. 16(no. 3), 1225–1237. [Google Scholar] [CrossRef]
Chen, X.; Bao, Q.; Zhang, B. Research on 4WIS electric vehicle path tracking control based on adaptive fuzzy PID algorithm. Proc. Chinese Control Conf., Jul. 2019; pp. 6753–6760. [Google Scholar]
Yeh, Y.-C.; Li, T.-H. S.; Chen, C.-Y. Adaptive fuzzy sliding-mode control of dynamic model-based car-like mobile robot. Int. J. Fuzzy Syst. 2009, vol. 11(no. 4), 272–281. [Google Scholar]
Limon, D.; Ferramosca, A.; Alvarado, I.; Alamo, T. Model predictive control for setpoint tracking. arXiv 2024, arXiv:2403.02973. [Google Scholar] [CrossRef]
Zhang, C.; Chu, D.; Liu, S.; Deng, Z.; Wu, C.; Su, X. Trajectory planning and tracking for autonomous vehicle based on state lattice and model predictive control. IEEE Intell. Transp. Syst. Mag. 2019, vol. 11(no. 2), 29–40. [Google Scholar] [CrossRef]
Wang, H.; Liu, B.; Ping, X.; An, Q. Path tracking control for autonomous vehicles based on an improved MPC. IEEE Access 2019, vol. 7, 161064–161073. [Google Scholar] [CrossRef]
Stano, P.; Montanaro, U.; Tavernini, D.; Tufo, M.; Fiengo, G.; Novella, L.; et al. Model predictive path tracking control for automated road vehicles: A review. Annu. Rev. Control 2023, vol. 55, 194–236. [Google Scholar] [CrossRef]
Köhler, J.; Müller, M. A.; Allgöwer, F. A nonlinear tracking model predictive control scheme for dynamic target signals. Automatica 2020, vol. 118, Art.(no. 109030). [Google Scholar] [CrossRef]
Yuan, H.; Sun, X.; Gordon, T. Unified decision-making and control for highway collision avoidance using active front steer and individual wheel torque control. Veh. Syst. Dyn. 2019, vol. 57(no. 8), 1188–1205. [Google Scholar] [CrossRef]
Zhu, G.; Jie, H.; Hong, W. NMPC-based path tracking control strategy for 4WID autonomous vehicle considering handling stability under extreme conditions. Proc. 7th CAA Int. Conf. Veh. Control Intell., Oct. 2023; pp. 1–6. [Google Scholar]
Du, X.; Htet, K. K. K.; Tan, K. K. Development of a genetic-algorithm-based nonlinear model predictive control scheme on velocity and steering of autonomous vehicles. IEEE Trans. Ind. Electron. 2016, vol. 63(no. 11), 6970–6977. [Google Scholar] [CrossRef]
Le, V.; Malikopoulos, A. Controller adaptation via learning solutions of contextual Bayesian optimization. IEEE Robot. Autom. Lett. vol. 10, 8308–8315, 2025. [CrossRef]
Alcalá, E.; Puig, V.; Quevedo, J.; Rosolia, U. Autonomous racing using Linear Parameter Varying-Model Predictive Control (LPV-MPC). Control Eng. Pract. 2020, vol. 95, Art.(no. 104270). [Google Scholar] [CrossRef]
Zarrouki, B.; Spanakakis, M.; Betz, J. A safe reinforcement learning driven weights-varying model predictive control for autonomous vehicle motion control. Proc. IEEE Intell. Veh. Symp. (IV) 2024, 1401–1408. [Google Scholar] [CrossRef]
Ostafew, C. J.; Schoellig, A. P.; Barfoot, T. D. Robust constrained learning-based NMPC enabling reliable mobile robot path tracking. Int. J. Robot. Res. 2016, vol. 35(no. 13), 1547–1563. [Google Scholar] [CrossRef]
Pannek, J.; Gerdts, M. Performance of sensitivity based NMPC updates in automotive applications. arXiv 2014, arXiv:1401.3548. [Google Scholar] [CrossRef]
Diehl, M.; Ferreau, H. J.; Haverbeke, N. “Efficient numerical methods for nonlinear MPC and moving horizon estimation,” in Nonlinear Model Predictive Control: Towards New Challenging Applications; Magni, L., Raimondo, D. M., Allgöwer, F., Eds.; Springer: Berlin, Heidelberg, 2009; pp. 391–417. [Google Scholar]
Kayacan, E.; Saeys, W.; Ramon, H.; Belta, C.; Peschel, J. M. Experimental validation of linear and nonlinear MPC on an articulated unmanned ground vehicle. IEEE/ASME Trans. Mechatron. 2018, vol. 23(no. 5), 2023–2030. [Google Scholar] [CrossRef]
Stella, L.; Themelis, A.; Sopasakis, P.; Patrinos, P. A simple and efficient algorithm for nonlinear model predictive control. Proc. 56th IEEE Conf. Decis. Control, Dec. 2017; pp. 1939–1944. [Google Scholar]
Goodwin, G. C.; Cea, M. G.; Seron, M. M.; Ferris, D.; Middleton, R. H.; Campos, B. Opportunities and challenges in the application of nonlinear MPC to industrial problems. Proc. IFAC World Congr., Aug. 2012; pp. 39–49. [Google Scholar]
Pane, Y. P.; Nageshrao, S. P.; Babuška, R. Actor-critic reinforcement learning for tracking control in robotics. Proc. 55th IEEE Conf. Decis. Control, Dec. 2016; pp. 5819–5826. [Google Scholar]
Kosta, K.; Anwar, M. A.; Panda, P.; Raychowdhury, A.; Roy, K. RAPID-RL: A reconfigurable architecture with preemptive-exits for efficient deep-reinforcement learning. Proc. IEEE Int. Conf. Robot. Autom., May 2022; pp. 7492–7498. [Google Scholar]
Riemer, M.; Subbaraj, G.; Berseth, G.; Rish, I. Enabling realtime reinforcement learning at scale with staggered asynchronous inference. arXiv 2024, arXiv:2412.14355. [Google Scholar] [CrossRef]
Shan, Y.; Zheng, B.; Chen, L.; Chen, L.; Chen, D. A reinforcement learning-based adaptive path tracking approach for autonomous driving. IEEE Trans. Veh. Technol. 2020, vol. 69(no. 10), 10581–10595. [Google Scholar] [CrossRef]
Sierra-Garcia, J. E.; Santos, M. Combining reinforcement learning and conventional control to improve automatic guided vehicles tracking of complex trajectories. Expert Syst. 2023, vol. 41(no. 2), Art. no. e13076. [Google Scholar] [CrossRef]
Bellegarda, G.; Byl, K. An online training method for augmenting MPC with deep reinforcement learning. Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2020; pp. 5453–5459. [Google Scholar]
Chen, Z.; Lai, J.; Li, P.; Awad, O. I.; Zhu, Y. Prediction horizon-varying model predictive control (MPC) for autonomous vehicle control. Electronics 2024, vol. 13(no. 8, Art. no. 1442). [Google Scholar] [CrossRef]
Rokonuzzaman, M.; Mohajer, N.; Nahavandi, S.; Mohamed, S. Model predictive control with learned vehicle dynamics for autonomous vehicle path tracking. IEEE Access 2021, vol. 9, 128233–128249. [Google Scholar] [CrossRef]
Akmandor, N. Ü.; Prajapati, S.; Zolotas, M.; Padır, T. Re4MPC: Reactive nonlinear MPC for multi-model motion planning via deep reinforcement learning. arXiv 2025, arXiv:2506.08344. [Google Scholar]
Reiter, R.; Ghezzi, A.; Baumgärtner, K.; Hoffmann, J.; McAllister, R. D.; Diehl, M. AC4MPC: Actor-critic reinforcement learning for nonlinear model predictive control. arXiv 2024, arXiv:2406.03995. [Google Scholar]
Martinsen, B.; Lekkas, A. M.; Gros, S. Reinforcement learning-based NMPC for tracking control of ASVs: Theory and experiments. Control Eng. Pract. 2022, vol. 120, Art.(no. 105024). [Google Scholar] [CrossRef]
Berg, H. S.; Menges, D.; Tengesdal, T.; Rasheed, A. Digital twin syncing for autonomous surface vessels using reinforcement learning and nonlinear model predictive control. Sci. Rep. 2025, vol. 15(no. 1, Art. no. 9344). [Google Scholar] [CrossRef]
Mehndiratta, M.; Camci, E.; Kayacan, E. Automated tuning of nonlinear model predictive controller by reinforcement learning. Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2018; pp. 3016–3021. [Google Scholar]
Ceusters, G.; Camargo, L. R.; Franke, R.; Nowé, A.; Messagie, M. Safe reinforcement learning for multi-energy management systems with known constraint functions. arXiv 2022, arXiv:2207.03830. [Google Scholar] [CrossRef]
Malu, S. K.; Majumdar, J. Kinematics, localization and control of differential drive mobile robot. Glob. J. Res. Eng. 2014, vol. 14(no. H1), 1–7. [Google Scholar]
Yang, H.; Deng, F.; He, Y.; Jiao, D.; Han, Z. Robust nonlinear model predictive control for reference tracking of dynamic positioning ships based on nonlinear disturbance observer. Ocean Eng. 2020, vol. 215, Art.(no. 107885). [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Han, M.; Zhang, L.; Wang, J.; Pan, W. Actor-critic reinforcement learning for control with stability guarantee. IEEE Robot. Autom. Lett. 2020, vol. 5(no. 4), 6217–6224. [Google Scholar] [CrossRef]
Devlin, S.; Kudenko, D. Dynamic potential-based reward shaping. Proc. Int. Conf. Autonomous Agents Multiagent Syst., Jun. 2012; pp. 433–440. [Google Scholar]
Ames, D.; Coogan, S.; Egerstedt, M.; Notomista, G.; Sreenath, K.; Tabuada, P. Control barrier functions: Theory and applications. Proc. Eur. Control Conf., Jun. 2019; pp. 3420–3431. [Google Scholar]
Horváth, Z.; Song, Y.; Terlaky, T. Invariance conditions for nonlinear dynamical systems. arXiv 2016, arXiv:1607.01107. [Google Scholar] [CrossRef]
Ames, D.; Xu, X.; Grizzle, J. W.; Tabuada, P. Control barrier function based quadratic programs for safety critical systems. IEEE Trans. Autom. Control 2017, vol. 62(no. 8), 3861–3876. [Google Scholar] [CrossRef]
Suwartadi, E.; Kungurtsev, V.; Jäschke, J. Sensitivity-Based Economic NMPC with a Path-Following Approach. Processes 2017, vol. 5(no. 1), 8. [Google Scholar] [CrossRef]
Zhang, H.; Li, P.; García, C. E. Robust stability of nonlinear model predictive control based on extended Kalman filter. J. Process Control 2012, vol. 22(no. 1), 82–89. [Google Scholar]

Sheng Jin received the B.E. degree in Automation from Guangdong University of Technology, Guangzhou, China, in 2024. He is currently pursuing the M.Sc. degree in Advanced Control and Systems Engineering with the Department of Electrical and Electronic Engineering at the University of Manchester, Manchester, U.K. His research interests include learning-based control and reinforcement learning.

Joel Loh received the Ph.D. degree in Electrical and Computer Engineering from the University of Toronto, Toronto, Canada, in 2020. He is currently a Dame Kathleen Ollernshaw Fellow (Assistant Professor) with the Department of Electrical and Electronic Engineering, University of Manchester, Manchester, U.K. His research interests include metamaterials, memristors, chemical sensing, and artificial intelligence.

Figure 1. Trajectory comparison among the

A^{*}

reference path, NMPC+CBF, and the proposed PPO+NMPC+CBF framework. The proposed method achieves closer adherence to the reference while ensuring safe clearance from obstacles.

Figure 1. Trajectory comparison among the

A^{*}

reference path, NMPC+CBF, and the proposed PPO+NMPC+CBF framework. The proposed method achieves closer adherence to the reference while ensuring safe clearance from obstacles.

Figure 2. Differential-drive vehicle model showing position

(x, y)

, velocity

v

, yaw rate

ω

, and wheel torques

T_{l}, T_{r}

. The yaw angle θ is measured counterclockwise from the global X-axis The variable

J_{z}

represents the yaw moment of inertia about the vertical Z-axis.

Figure 2. Differential-drive vehicle model showing position

(x, y)

, velocity

v

, yaw rate

ω

, and wheel torques

T_{l}, T_{r}

. The yaw angle θ is measured counterclockwise from the global X-axis The variable

J_{z}

represents the yaw moment of inertia about the vertical Z-axis.

Figure 3. PPO weight-adaptation module. The actor outputs non-negative scaling coefficients for the NMPC weighting matrices

Q

and

R

, while the critic estimates the state value for PPO training.

Figure 3. PPO weight-adaptation module. The actor outputs non-negative scaling coefficients for the NMPC weighting matrices

Q

and

R

, while the critic estimates the state value for PPO training.

Figure 4. Workflow of the proposed PPO+NMPC+CBF control framework.

(a)

A^{*}

generates a reference path from maps and static obstacles.

(b)

The environment simulates vehicle response using torque inputs.

(c)

Real-time module observes states and adjusts NMPC weights via a neural network.

(d)

Offline PPO trains Actor–Critic networks to optimize the weight policy.

(e)

NMPC solves constrained optimal control.

(f)

CBF imposes real-time safety constraints for dynamic obstacle avoidance. Solid lines represent data flow; dashed borders indicate external modules.

Figure 4. Workflow of the proposed PPO+NMPC+CBF control framework.

(a)

A^{*}

generates a reference path from maps and static obstacles.

(b)

The environment simulates vehicle response using torque inputs.

(c)

Real-time module observes states and adjusts NMPC weights via a neural network.

(d)

Offline PPO trains Actor–Critic networks to optimize the weight policy.

(e)

NMPC solves constrained optimal control.

(f)

CBF imposes real-time safety constraints for dynamic obstacle avoidance. Solid lines represent data flow; dashed borders indicate external modules.

Figure 5. PPO Training Reward Across 500 Episodes (smoothed with error band). The red line marks the onset of the steady phase at around 180 episodes.

Figure 6. Comparison of NMPC and PPO+NMPC across key performance metrics. The left and middle subplots show absolute values and relative improvements. The radar chart visualizes normalized metrics, with Energy defined as the time-integrated wheel torques and Efficiency as the path length–to–energy ratio. All radar values are normalized by setting the larger value of each metric to 1, ensuring direct visual comparison between the two controllers.

Figure 7. Comparative Analysis of Tracking Errors for NMPC+CBF and PPO+NMPC+CBF Frameworks, showing error trajectories with uncertainty bands (left) and error distributions with boxplots (right).

Figure 8. Dynamics and Activity of

Q

and

R

Matrix Weight Adjustments in PPO+NMPC+CBF Framework. Each curve corresponds to a specific cost component:

Q_{1}

(lateral error),

Q_{2}

(longitudinal error),

Q_{3}

(heading error),

Q_{4}

and

Q_{5}

(velocity and angular rate), and

R_{1}

and

R_{2}

(control effort on left and right wheel torques).

Figure 8. Dynamics and Activity of

Q

and

R

Matrix Weight Adjustments in PPO+NMPC+CBF Framework. Each curve corresponds to a specific cost component:

Q_{1}

(lateral error),

Q_{2}

(longitudinal error),

Q_{3}

(heading error),

Q_{4}

and

Q_{5}

(velocity and angular rate), and

R_{1}

and

R_{2}

(control effort on left and right wheel torques).

Figure 9. Robustness analysis of PPO+NMPC+CBF under friction uncertainty, sensor noise, and path disturbances. Boxplots illustrate mean tracking errors across varying disturbance levels. The results show stable performance with limited noise sensitivity and moderate disturbance effects.

Table 1. SIMULATION PARAMETERS.

Parameter	Value
Vehicle mass $(m)$	$10 k g$
Inertia $(J_{z})$	$0.4 k g \cdot m^{2}$
Wheel Radius $(r)$	$0.05 m$
Wheelbase $(L)$	$0.30 m$
Rolling Friction coef. $(c_{r r})$	$0.01$
Velocity Damping coef. $(c_{v})$	$1.0$
Angular Damping coef. ${(c}_{ω})$	$0.05$
NMPC Horizon $(N)$	$10$
Time Step $(d_{t})$	$0.1 s$
Max Velocity $(v_{m a x})$	$2.0 m / s$
Max Torque $(T_{m a x})$	$N \cdot m$

Table 2. PERFORMANCE METRICS FOR BASELINE AND PPO-AUGMENTED CONTROLLERS.

Metrics	NMPC	PPO+NMPC	NMPC+CBF	PPO+NMPC+CBF
Steps	$237$	$227$	$250$	$240$
Mean position error $(m)$	$0.076$	$0.022$	$0.101$	$0.086$
Path length $(m)$	$14.22$	$13.99$	$16.26$	$15.48$
Total energy $(J)$	$34.88$	$45.94$	$38.79$	$49.72$
Energy efficiency $(m / J)$	$0.408$	$0.305$	$0.419$	$0.311$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Reinforcement Learning-Enhanced Adaptive NMPC for Safe Autonomous Driving

Abstract

Keywords:

Subject:

1. Introduction

2. Problem Formulation

2.1. Vehicle Dynamic Model

2.2. NMPC Controller for Trajectory Tracking

3. Construction of PPO+NMPC+CBF Framework

3.1. PPO Model Training

3.2. PPO+NMPC

3.3. PPO+NMPC+CBF

4. Experiments

4.1. Simulation Setup

4.2. Simulation Results and Analysis

4.3. Robustness Evaluation

5. Conclusions

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

References

MDPI Initiatives

Important Links

Subscribe