Real-Time Two-Stage Scheduling of Electric Vehicle Charging Stations Using a SMPD Framework

Boyu Wang; Yuxuan Yao; Jingjing Gao; Danchen Luo

doi:10.20944/preprints202604.1885.v1

Submitted:

07 April 2026

Posted:

27 April 2026

You are already at the latest version

Abstract

Real-time scheduling of large-scale electric vehicle (EV) charging stations is essential for improving service efficiency, operational profitability, and grid coordination. However, most existing studies formulate charging scheduling as a single-stage unified decision problem that jointly handles discrete service access and continuous power control. Such a formulation fails to capture the inherently hierarchical operating mechanism of large-scale charging stations and often suffers from limited interpretability, enlarged action spaces, and reduced scalability under stochastic arrivals, dynamic departures, and time-varying resource constraints. To address these issues, this study reformulates the real-time charging scheduling problem as a two-stage collaborative decision process and proposes a Supervised Service Matching and Reinforcement Power Dispatch (SMPD) framework. In the first stage, a supervised bipartite matching network is developed to determine the service access relationships between waiting EVs and available chargers. In the second stage, a Soft Actor-Critic (SAC)-based continuous control strategy is employed to optimize charging power allocation for connected EVs under charger-level and station-level constraints. Experimental results demonstrate that the proposed framework effectively reduces waiting time while improving charger utilization, charging-demand satisfaction, and economic performance. Comparative and robustness analyses further verify its superior scheduling effectiveness, training stability, and adaptability under different infrastructure scales and random disturbance scenarios.

Keywords:

electric vehicles

;

charging station scheduling

;

service matching

;

power dispatch

;

deep reinforcement learning

;

Soft Actor-Critic

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

With the rapid growth of electric vehicle (EV) ownership, charging infrastructure has become an essential component supporting the low-carbon transition of the transportation sector. However, compared with the refueling process of conventional gasoline vehicles, EV charging is characterized by longer service durations [1], higher uncertainty in arrival behavior [2], and more heterogeneous parking times [3]. As a result, large-scale charging stations often suffer from queue congestion, unbalanced resource utilization, and fluctuations in service quality during peak hours [4,5]. In particular, in large-scale scenarios such as public parking lots, transportation hubs, and urban fast-charging stations, charging resources are limited while charging demand is highly dynamic. Without an effective real-time scheduling mechanism, such systems may experience increased user waiting times, degraded charging experience, imbalanced power allocation, reduced operational revenue, and even additional stress on the distribution grid. Therefore, achieving efficient coordinated allocation of charging services and charging power under complex and dynamic environments has become a critical issue in the intelligent operation of large-scale EV charging stations.

Existing studies on EV charging scheduling have mainly focused on charging power optimization, peak shaving and valley filling, and orderly charging control. Early research predominantly relied on linear programming, integer programming, stochastic optimization, and heuristic algorithms to derive charging decisions under predefined objective functions and constraints. For example, some studies formulated charging station scheduling models based on linear programming or mixed-integer programming to optimize charging cost, load fluctuation, and service efficiency [6,7], while others employed genetic algorithms, greedy strategies, or nonlinear integer programming to address resource allocation and profit maximization problems [8,9]. Although these methods are generally interpretable and effective in small- and medium-scale scenarios, they often encounter substantial computational burdens, limited real-time responsiveness, and insufficient adaptability when vehicle arrivals and departures are highly uncertain and station-level demand varies continuously [10,11]. As EV penetration continues to increase, static optimization or one-shot centralized scheduling frameworks are becoming increasingly inadequate for real-time operation in large-scale dynamic environments.

Recently, reinforcement learning, graph neural networks, and other data-driven approaches have provided new perspectives for EV charging scheduling. Existing studies have shown that intelligent decision-making mechanisms based on environment interaction can learn effective control policies under unknown models or highly uncertain conditions, thereby improving dynamic scheduling capability [12,13]. Meanwhile, graph representation learning has been introduced to capture the complex relationships among vehicles, charging resources, and station states, enhancing state representation and decision robustness [14,15]. Nevertheless, most existing methods still formulate large-scale charging station scheduling as a single-layer unified control problem, where charging decisions for all vehicles are generated directly based on the current system state. Although such formulations improve decision automation to some extent, they essentially combine two fundamentally different optimization tasks into one decision layer: one is a discrete resource allocation problem concerning which vehicle should be prioritized and which charger should provide service, and the other is a continuous control problem concerning how much charging power should be assigned once the service relationship is determined. Treating these two tasks within a single decision layer often leads to complex state representations, enlarged action spaces, and scheduling logic, thereby limiting interpretability and scalability in large-scale applications [16,17].

From the perspective of actual operational mechanisms, real-time scheduling in large-scale charging stations exhibits an inherently hierarchical structure. For newly arrived EVs, the system must first determine whether charging service can be provided immediately and, if so, which charger should be assigned. If charging resources are insufficient, queueing decisions must further account for factors such as remaining parking time, target charging demand, and service priority. Based on this service access decision, the system must then dynamically allocate charging power to connected EVs under station-level power limits, charger capacity constraints, time-varying electricity prices, and user demand satisfaction requirements. Therefore, “service access” and “power control” correspond to two coupled yet fundamentally different decision layers: the former is a discrete matching and queue management problem, whereas the latter is a continuous power allocation and constraint coordination problem. Decoupling these two layers and developing a hierarchical collaborative optimization framework can potentially preserve scheduling granularity while significantly reducing computational complexity and improving real-time applicability in large-scale stochastic scenarios [18].

To address the above issues, this paper proposes a two-stage collaborative optimization framework termed SMPD (Service Matching and Power Dispatch) for large-scale EV charging stations. In the first stage, a service matching layer is established by constructing a vehicle–charger bipartite graph, where pending EVs and available charging resources are explicitly mapped into a unified matching space. By jointly considering factors such as remaining parking time, target charging demand, expected waiting time, service completion risk, and resource utilization efficiency, the framework derives a real-time service access mechanism. In the second stage, a continuous power dispatch model is developed for EVs that have already been matched with charging resources. Under station-level power limits, charger capacity constraints, and user charging demand requirements, the charging power of each connected EV is dynamically optimized. Finally, a rolling-horizon update mechanism is introduced to enable information feedback and iterative coordination between the service matching layer and the power dispatch layer, allowing the system to achieve efficient, stable, and scalable real-time scheduling under stochastic arrivals, dynamic departures, and fluctuating load conditions.

Compared with existing studies that formulate charging scheduling as a single-stage unified decision problem, the proposed SMPD framework emphasizes the dual-layer coordination of service resource allocation and continuous power control, which is more consistent with the actual operational logic of large-scale charging stations. The framework can explicitly handle service access and queue management under limited charger availability, while further refining charging power allocation once service relationships are established. In this way, it improves resource utilization while simultaneously balancing waiting time, service fairness, and operational revenue. Moreover, by modeling discrete decision-making and continuous control separately, the SMPD framework offers clear advantages in action-space organization, computational tractability, and engineering interpretability, thereby providing a new perspective for real-time intelligent scheduling of large-scale EV charging stations.

The main contributions of this work are summarized as follows. First, a vehicle–charger bipartite matching model is developed to address real-time service access and queue organization under stochastic arrivals and limited charging resources in large-scale charging stations. Second, a continuous power dispatch model is established for connected EVs, incorporating station-level power constraints, charger power limits, and user demand satisfaction to enable fine-grained charging control. Third, a rolling feedback mechanism is designed to coordinate service matching and power dispatch, allowing the two-stage decisions to be continuously updated in dynamic environments. Finally, simulation studies demonstrate the effectiveness of the proposed framework in reducing average waiting time, improving resource utilization, and enhancing operational stability, thereby providing theoretical and methodological support for the intelligent operation of large-scale EV charging stations.

2. System Modeling

2.1 System Overview

This study considers a large-scale electric vehicle (EV) charging station equipped with multiple chargers. In practical operation, EV arrival times, departure times, target

charging demands, and parking durations are all highly stochastic. Meanwhile, the number of available chargers is limited, and the total charging power is constrained by the grid-side capacity. Therefore, the scheduling problem of a charging station is inherently a dynamic optimization problem that involves both discrete service access decisions and continuous power control decisions.

To better reflect the actual operating mechanism of large-scale charging stations, the real-time scheduling process is decomposed into two coupled decision stages. As illustrated in Figure 1, the system evolves under the joint influence of stochastic arrivals, stochastic departures, queueing dynamics, and grid-side power constraints. The first stage is the service matching stage, whose objective is to determine whether an arriving or waiting EV can access charging resources at the current time step and, if so, which charger should be assigned. The second stage is the power dispatch stage, whose objective is to continuously regulate the charging power of connected EVs under the established service relationships.

Within each discrete decision interval, the system first performs service matching according to the current EV states, charger states, and station-level operating information. Then, based on the matching results, the charging power of the connected EVs is optimized. Finally, the system state is updated according to the executed decisions and rolls forward to the next decision epoch. Accordingly, the operation of the charging station can be formulated as a time-varying discrete dynamic system.

2.2 EV State Modeling

Assume that the charging station operates over a discrete time horizon

t \in T = {1,2, \dots, T}

. where

T

denotes the scheduling horizon,

T

is the total number of decision steps, and

Δ t

is the length of each decision interval. Let

V_{t}

denote the set of EVs present in the station at time

t

. For any EV

i \in V_{t}

, its state is defined as

s_{i}^{t} = [{S O C}_{i}^{t}, E_{i}^{r e q, t}, T_{i}^{s t a y, t}, T_{i}^{w a i t, t}, P_{i}^{m a x}, ω_{i}]

, where

{S O C}_{i}^{t}

denotes the state of charge (SoC) of EV

i

at time

t

;

E_{i}^{r e q, t}

denotes its remaining charging demand;

T_{i}^{s t a y, t}

denotes its remaining parking duration;

T_{i}^{w a i t, t}

denotes its cumulative waiting time up to time

t

;

P_{i}^{m a x}

denotes the maximum charging power acceptable to EV

i

; and

ω_{i}

denotes the user service weight or priority coefficient.

To further characterize the urgency of completing the charging task within the limited parking window, the charging urgency index of EV

i

at time

t

is defined as

ρ_{i}^{t} = \frac{E_{i}^{r e q, t}}{m a x (T_{i}^{s t a y, t}, ϵ)},

(1)

where

ρ_{i}^{t}

denotes the charging urgency of EV

i

at time

t

, and

ϵ

is a small positive constant introduced to avoid division by zero. Obviously, a larger value of

ρ_{i}^{t}

indicates a higher pressure to complete the charging task within the remaining dwell time, and thus implies a higher scheduling priority.

When EV

i

receives charging power

P_{i}^{t}

at time

t

, its remaining charging demand is updated as

E_{i}^{r e q, t + 1} = E_{i}^{r e q, t} - η_{i} P_{i}^{t} Δ t,

(2)

where

E_{i}^{r e q, t + 1}

denotes the remaining charging demand of EV

i

at time

t + 1

,

η_{i}

denotes the charging efficiency of EV

i

, and

P_{i}^{t}

denotes the actual charging power allocated to EV iii at time

t

.

Accordingly, the SoC evolves as

{S O C}_{i}^{t + 1} = {S O C}_{i}^{t} + \frac{η_{i} P_{i}^{t} Δ t}{B_{i}},

(3)

where

B_{i}

denotes the battery capacity of EV

i

.

If EV

i

does not receive charging service at time

t

, its waiting time is updated as

T_{i}^{w a i t, t + 1} = T_{i}^{w a i t, t} + Δ t,

(4)

where

T_{i}^{w a i t, t + 1}

denotes the cumulative waiting time of EV

i

at the next decision epoch.

2.3 Charger State Modeling

Let the set of chargers in the station be denoted by

C = {1,2, \dots, M}

, where

M

is the total number of chargers. For any charger

j \in C

, its state can be represented as

c_{j}^{t} = [b_{j}^{t}, P_{j}^{m a x}, η_{j}]

, where

b_{j}^{t} \in {0,1}

denotes the occupancy status of charger

j

at time

t

. Specifically,

b_{j}^{t} = 1

indicates that charger

j

is occupied, whereas

b_{j}^{t} = 0

indicates that it is idle. In addition,

P_{j}^{m a x}

denotes the maximum output power of charger

j

, and

η_{j}

denotes the power transfer efficiency of charger

j

.

To ensure physical feasibility, each charger can serve at most one EV at any time step. Therefore, the following constraint must hold:

\sum_{i \in A_{t}} x_{i j}^{t} \leq 1, \forall j \in C,

(5)

where

A_{t} \subseteq V_{t}

denotes the set of EVs waiting for service at time

t

, and

x_{i j}^{t} \in {0,1}

indicates whether EV

i

is assigned to charger

j

at time

t

.

2.4 Station-Level State Modeling

In addition to the local states of EVs and chargers, the real-time operation of a charging station is jointly influenced by station-level constraints and external environmental factors. Accordingly, the station-level state at time

t

is defined as

h_{t}^{g} = [P_{t}^{g r i d}, L_{t}^{b a s e}, λ_{t}, N_{t}^{q u e u e}]

, where

P_{t}^{g r i d}

denotes the maximum grid-accessible charging power available at time

t

,

L_{t}^{b a s e}

denotes the base load of the station at time

t

,

λ_{t}

denotes the real-time electricity price signal, and

N_{t}^{q u e u e}

denotes the number of EVs currently waiting in the queue.

Figure 2. Schematic diagram of the EV–charger state modeling framework.

Based on this, the EV states, charger states, and station-level state are further integrated into the overall system state of the charging station, expressed as

S_{t} = ({s_{i}^{t}}_{i \in V_{r}}, {c_{j}^{t}}_{j \in C}, h_{t}^{g})

, where

{s_{i}^{t}}_{i \in V_{r}}

denotes the set of states of all EVs present in the station at time

t

,

{c_{j}^{t}}_{j \in C}

denotes the set of charger states, and

h_{t}^{g}

denotes the station-level operating information. Therefore,

S_{t}

provides a comprehensive representation of the complete operating condition of the charging station at the current decision epoch.

2.5 Stage I: Service Matching Model

The core task of the service matching stage is to determine which EVs should be prioritized for charging service at the current decision epoch and to which chargers they should be assigned. To this end, a vehicle–charger bipartite graph is constructed as

G_{t}^{b p} = (A_{t}, C, ε_{t})

. where

A_{t}

denotes the set of EV nodes waiting for service at time

t

,

C

denotes the set of charger nodes, and

ε_{t}

denotes the set of feasible connection edges between EVs and chargers.

A binary matching variable is introduced as

x_{i j}^{t} = {\begin{array}{l} 1, if EV i is assigned to charger j at time t \\ 0, otherwise . \end{array},

(6)

Here,

x_{i j}^{t} = 1

indicates that EV

i

establishes a service relationship with charger

j

during the current decision interval.

To ensure the basic feasibility of the matching result, each EV can be assigned to at most one charger at any given time, i.e.,

\sum_{j \in C} x_{i j}^{t} \leq 1, \forall i \in A_{t},

(7)

The left-hand side of the above expression represents the total number of charger assignments received by EV

i

at time

t

. Similarly, each charger can serve at most one EV at a time, i.e.,

\sum_{i \in A_{t}} x_{i j}^{t} \leq 1, \forall j \in C,

(8)

In addition, if EV

i

and charger

j

are incompatible in terms of connector type, rated power, or physical connection conditions, then

x_{i j}^{t} = 0

. After service matching is completed, the set of EVs that have successfully accessed charging resources is given by

S_{t} = {i \in A_{t} ∣ \sum_{j \in C} x_{i j}^{t} = 1},

(9)

where

S_{t}

denotes the set of EVs that have been successfully assigned to charging resources at time

t

.

Correspondingly, the waiting queue is defined as

Q_{t} = \frac{A_{t}}{S_{t}},

(10)

where

Q_{t}

denotes the set of EVs that have not yet obtained charging service and must remain in the queue.

2.6 Stage II: Power Allocation Model

After the service matching stage is completed, the second stage performs continuous power allocation for the connected EV set

S_{t}

. For each EV

i \in S_{t}

, the charging power at time

t

is defined as

p_{i}^{t} \in [0, p_{i}^{m a x}]

, where

p_{i}^{t}

denotes the actual charging power allocated to EV

i

, and

p_{i}^{m a x}

denotes the maximum charging power permitted on the EV side.

Considering charger capacity constraints, if EV

i

is matched to charger

j

, then

0 \leq p_{i}^{t} \leq x_{i j}^{t} p_{j}^{m a x},

(11)

where

p_{j}^{m a x}

denotes the maximum output power of charger

j

. When

x_{i j}^{t} = 0

, the above inequality automatically ensures that EV

i

cannot receive charging power from charger

j

.

From the station-level perspective, the total charging power allocated to all connected EVs cannot exceed the available grid-access power, i.e.,

\sum_{i \in S_{t}} p_{i}^{t} \leq p_{t}^{g r i d},

(12)

where

p_{t}^{g r i d}

denotes the upper limit of the total charging power available to the station at time

t

.

Taking the base load into account, the total station load at time

t

can be expressed as

L_{t}^{t o t a l} = L_{t}^{b a s e} + \sum_{i \in S_{t}} p_{i}^{t},

(13)

where

L_{t}^{t o t a l}

denotes the total station load and

L_{t}^{b a s e}

denotes the base load.

To ensure that user charging demand can be satisfied as much as possible before departure, each EV should further satisfy the cumulative energy requirement constraint:

\sum_{τ = t}^{τ_{i}^{d e p}} p_{i}^{τ} τ Δ t \geq E_{i}^{r e q, t},

(14)

where

τ_{i}^{d e p}

denotes the expected departure time of EV

i

, and

E_{i}^{r e q, t}

denotes the remaining charging demand of EV

i

at time

t

. This constraint implies that the cumulative charging energy received by the EV before departure should, as far as possible, cover its remaining energy demand.

2.7. Inter-Stage Coupling Mechanism

The core of the SMPD framework lies in the dynamic coupling between service matching and power dispatch. The matching matrix produced in Stage I,

X_{t} = [x_{i j}^{t}]

, determines which EVs can enter Stage II for continuous power control. The execution result of Stage II, in turn, feeds back into the system and affects the EV states, queue states, and charger occupancy states at the next decision epoch.

Let the power allocation result generated in Stage II at time ttt be denoted by

P_{t} = {p_{i}^{t}}_{i \in S_{t}}

, where

P_{t}

represents the set of charging powers assigned to all connected EVs at time

t

. Then, the system state transition can be uniformly expressed as

S_{t + 1} = Φ (S_{t}, X_{t}, P_{t}),

(15)

where

Φ (\cdot)

denotes the system state update mapping,

S_{t}

denotes the overall system state at time

t

,

X_{t}

denotes the service matching result,

P_{t}

denotes the power dispatch result, and

S_{t + 1}

denotes the system state at the next decision epoch.

In summary, the problem considered in this study is essentially a two-stage dynamic decision-making problem that jointly optimizes service access and continuous power dispatch under stochastic EV arrivals, dynamic departures, and limited charging resources.

3. Two-Stage Coordinated Optimization Method of SMPD

3.1. Overall SMPD Framework

To address the real-time scheduling problem in large-scale EV charging stations under stochastic arrivals, dynamic departures, and limited charging resources, this paper proposes a two-stage coordinated optimization method termed SMPD (Supervised Service Matching and Reinforcement Power Dispatch). Following the hierarchical decision-making principle of “service access first, power dispatch second,” the proposed method decomposes the original highly coupled unified scheduling problem into two sequential yet interconnected subtasks: discrete service matching and continuous power control. This decomposition reduces the complexity of direct joint optimization while preserving scheduling flexibility and improving real-time applicability and interpretability in large-scale dynamic scenarios.

As illustrated in Figure 3, at each decision epoch

t

, SMPD takes the overall charging station state as input. First, in Stage I, a supervised service matching network generates the EV–charger matching matrix

X_{t}

, based on which the set of connected EVs and the waiting queue

Q_{t}

are determined. Then, in Stage II, continuous charging power allocation is performed only for the connected EVs, resulting in the power dispatch vector

P_{t}

. Under station-level power limits, charger capacity constraints, and user charging demand constraints, a real-time charging schedule is obtained. Finally, the system updates EV states, charger states, and station-level operating states according to the decisions made in both stages, thereby yielding the next system state

S_{t + 1}

and entering the subsequent rolling decision process.

Accordingly, the overall decision-making process of SMPD can be summarized as

S_{t} \to X_{t} \to P_{t} \to S_{t + 1}

3.2 Stage I: Supervised Service Matching Network

3.2.1 Input Feature Construction

The objective of Stage I is to determine the service access relationship between waiting EVs and available chargers based on the current system state. To this end, building upon the EV–charger bipartite graph model established in Section 2, the input to the service matching network is constructed as a joint feature representation consisting of the EV feature set

{h_{i}^{v, t}}_{i \in A_{t}}

, the charger feature set

{h_{j}^{c, t}}_{i \in C}

, and the station-level global feature

h_{t}^{g}

, where

A_{t}

denotes the set of EVs waiting for service at time

t

,

C

denotes the set of chargers, and

h_{i}^{v, t}

,

h_{j}^{c, t}

, and

h_{t}^{g}

represent the EV feature vector, charger feature vector, and station-level global operating feature, respectively.

Specifically, the EV-side features are defined according to the EV state model and the charging urgency indicator introduced in Section 2.2. The charger-side features are defined by the charger state model in Section 2.3, while the station-level global features are derived from the station-level state model in Section 2.4.

3.2.2. Matching Score Model

Considering that EV arrival behavior, waiting time evolution, and charger occupancy status all exhibit strong temporal dependence, a gated recurrent unit (GRU) is introduced into the Stage I service matching network as a temporal feature extractor. Compared with conventional recurrent neural networks, GRU employs update and reset gates to alleviate the gradient vanishing problem in long-sequence training more effectively, while maintaining relatively low parameter complexity. It is therefore more suitable for online state encoding in large-scale dynamic charging scenarios.

Specifically, for any waiting EV

i

and charger

j

, their historical state sequences over the most recent

H

decision intervals are constructed and fed into the corresponding GRU encoders to extract temporal embeddings:

{\tilde{h}}_{i}^{v, t} = {G R U}_{v} (h_{i}^{v, t - H + 1}, . . ., h_{i}^{v, t}) {\tilde{h}}_{j}^{c, t} = {GRU}_{c} (h_{j}^{c, t - H + 1}, . . ., h_{j}^{c, t}),

(16)

where

{\tilde{h}}_{i}^{v, t}

denotes the temporal feature representation of EV

i

at time

t

,

{\tilde{h}}_{j}^{c, t}

denotes the temporal feature representation of charger

j

at time

t

,

H

denotes the look-back window length,

{G R U}_{v} (\cdot)

and

{G R U}_{c} (\cdot)

denote the EV-side and charger-side GRU encoders, respectively.

After obtaining these temporal embeddings, the EV temporal features, charger temporal features, and station-level global features are jointly fed into a matching score network to compute the matching score for each EV–charger pair:

s_{i j}^{t} = f_{θ} ({\tilde{h}}_{i}^{v, t}, {\tilde{h}}_{j}^{c, t}, h_{t}^{g})

. where

s_{i j}^{t}

denotes the matching score of assigning EV

i

to charger

j

at time

t

, and

f_{θ} (\cdot)

denotes a multilayer perceptron scoring network parameterized by

θ

.

Furthermore, to obtain normalized matching probabilities, a Softmax operation is applied over the scores of candidate chargers:

p_{i j}^{t} = \frac{e x p (s_{i j}^{t})}{\sum_{k \in C} e x p (s_{i k}^{t})},

(17)

where

p_{i j}^{t}

denotes the probability of assigning EV

i

to charger

j

at time

t

. Accordingly, the matching probability matrix at time

t

can be written as

Π_{t} = [p_{i j}^{t}]

.

With the above GRU–MLP structure, the Stage I network can jointly exploit both current-state information and temporal evolution patterns, thereby improving the adaptability of service matching decisions to stochastic arrivals, dynamic waiting processes, and time-varying resource occupancy conditions.

3.2.3. Expert Label Generation and Supervised Training

Since Stage I is trained in a supervised manner, it is necessary to construct reference matching labels corresponding to each system state. To this end, a rule-based expert matcher is introduced to generate a reference service access scheme for each decision epoch based on factors such as EV charging urgency, waiting time, charger availability, and service completion risk. At time

t

, the expert-generated label matrix is defined as

Y_{t} = [y i j t]

, where

y_{i j}^{t} \in {0,1}

indicates whether EV

i

is assigned to charger

j

at time

t

under the expert policy.

Based on the expert label matrix

Y_{t}

and the matching probability matrix

Π_{t}

produced by the network, the cross-entropy loss is adopted to measure the discrepancy between the predicted results and the reference labels:

L_{m a t c h} = - \sum_{i \in A_{t}} \sum_{j \in C} y_{i j}^{t} l o g p_{i j}^{t},

(18)

Considering that some EV–charger pairs may be infeasible due to connector type, rated power, or physical interface constraints, a feasibility penalty term is further introduced:

L_{f e a s} = \sum_{i \in A_{t}} \sum_{j \in C} ξ_{i j}^{t} p_{i j}^{t},

(19)

where

ξ_{i j}^{t} \in {0, 1}

indicates whether EV

i

and charger

j

form an infeasible matching pair at time

t

.

Accordingly, the overall loss function of the Stage I supervised matching network can be written as

L_{s u p} = L_{m a t c h} + α L_{f e a s},

(20)

where

α

is the weighting coefficient associated with the feasibility penalty.

During training, the Adam optimizer is employed to iteratively update the network parameters

θ

, and mini-batch training is adopted to minimize the total loss

L_{s u p}

. Therefore, Stage I is essentially a GRU-MLP-based supervised bipartite matching learning process, whose objective is to learn the mapping from system states to service access relationships.

3.3 Stage II: Reinforcement Learning-Based Power Allocation Strategy

After the service matching stage is completed, Stage II performs continuous power allocation only for the connected EV set. To characterize the dynamic interaction among state evolution, action execution, and reward feedback during power scheduling, this stage is formulated as a Markov decision process (MDP), denoted by

M = ⟨ ε, S, A, P, R ⟩

. where

ε

denotes the reinforcement learning environment,

S

denotes the state space,

A

denotes the action space,

P

denotes the state transition probability, and

R

denotes the reward function. This formulation provides a unified description of the interaction between the controller and the environment in the continuous power allocation process, thereby establishing the theoretical basis for subsequent policy learning.

3.3.1 Environment

The reinforcement learning environment constructed in this study essentially corresponds to the dynamic operation process of a large-scale EV charging station under a given service matching result. The environment receives the continuous power allocation actions generated by the Stage II controller and updates the system state according to EV demand evolution, queue dynamics, charger occupancy changes, and station-level power constraints. Meanwhile, it returns immediate reward signals to the agent.

Since the system state at each decision epoch is primarily determined by the current state, the current action, and external stochastic disturbances, the environment can be approximately regarded as a dynamic decision-making process satisfying the Markov property.

3.3.2 State Space Design

The state space mainly consists of the following three categories of information. First, it includes the local operating states of connected EVs, which characterize the current charging condition of each EV, including its state of charge, remaining charging demand, remaining parking duration, and maximum acceptable charging power. Second, it includes the service matching result generated in Stage I, which specifies the service access relationship between EVs and chargers at the current decision epoch. Third, it includes the station-level operating state, such as the total available grid power, base load, and real-time electricity price.

The underlying variables involved in the state space have been defined in Section 2, including the EV states, station-level states, and the relevant power constraints. By integrating these components, the state space provides a complete description of the system operating condition required for Stage II decision-making.

3.3.3 Action Space Design

The action space of the Stage II reinforcement learning model is defined as the continuous charging power allocation vector for all connected EVs at the current decision epoch. In essence, it corresponds to the real-time power control decision imposed by the agent on the connected EVs. The action space is defined as

a_{t} = {[p_{i}^{t}]}_{i \in S_{t}}

, where

a_{t}

denotes the continuous power allocation action at time

t

, and

p_{i}^{t}

denotes the actual charging power assigned to connected EV

i

.

Since Stage II controls only the EVs that have already been matched, the scale of the action space is significantly reduced compared with that of a single-stage unified scheduling framework.

3.3.4 Reward Function Design

To balance charging station operational efficiency, user service quality, and system load stability, the immediate reward in Stage II reinforcement learning is formulated as a weighted combination of the revenue term and multiple penalty terms:

{\begin{array}{l} r_{t} = R_{t}^{r e v} - β (C_{t}^{w a i t} + C_{t}^{p e a k} + C_{t}^{f a i r}) \\ C_{t}^{w a i t} = \sum_{i \in Q_{t}} T_{i}^{w a i t, t} \\ C_{t}^{p e a k} = {(L_{t}^{b a s e} + \sum_{i \in S_{t}} p_{i}^{t} - p_{t}^{t a r g e t})}^{2} \\ C_{t}^{f a i r} = V a r (\frac{E_{i}^{s e r v e d, t}}{E_{i}^{r e q, t}}), i \in S \end{array},

(21)

where

r_{t}

denotes the immediate reward at time

t

, and

β

is the weighting coefficient of the penalty terms.

R_{t}^{r e v}

denotes the operational revenue during the current decision interval, determined jointly by the actual charging power delivered to connected EVs and the electricity price.

C_{t}^{w a i t}

is the waiting-time penalty, introduced to suppress the adverse impact of queueing on user service quality.

C_{t}^{p e a k}

is the load deviation penalty, used to penalize excessive deviation of the station load from the target load level, thereby enhancing the load regulation capability during scheduling.

C_{t}^{f a i r}

is the fairness penalty, which measures the dispersion in demand satisfaction among connected EVs so as to avoid excessive preference toward certain vehicles during charging resource allocation.

As can be seen, the proposed reward function essentially performs a comprehensive trade-off among maximizing operational revenue, improving service quality, maintaining load stability, and ensuring fairness in resource allocation. Through this design, the Stage II reinforcement learning policy can enhance demand responsiveness while simultaneously accounting for the economic efficiency and scheduling stability of charging station operation, thereby improving the practical applicability of the proposed framework in large-scale dynamic scenarios.

3.3.5 SAC-Based Continuous Power Learning Mechanism

Given that the action space in Stage II is a continuous charging power allocation space, this study employs the Soft Actor-Critic (SAC) algorithm to learn the power dispatch policy. SAC introduces an entropy regularization term while maximizing the expected return, thereby enabling the agent to balance reward optimization and policy exploration during training. This makes it particularly suitable for high-dimensional continuous control problems.

At each decision step, under the current reinforcement learning state

s_{t}^{R L}

, the agent outputs a continuous power allocation action

a_{t}

. After interacting with the environment, it receives the immediate reward

r_{t}

and the next state

s_{t + 1}^{R L}

. Accordingly, the transition tuple

(s_{t}^{R L}, a_{t}, r_{t}, s_{t + 1}^{R L})

is stored in the replay buffer for subsequent parameter updates.

To improve training stability, the SAC framework adopts a stochastic policy network and critic networks. Let

Q_{ψ} (s_{t}^{R L}, a_{t})

denote the state-action value function and

π_{ϕ} (a_{t} ∣ s_{t}^{R L})

denote the policy. The actor network outputs a probability distribution over actions, from which the charging power action is sampled through reparameterization.

(1): Policy objective

The core idea of SAC is to maximize the expected discounted cumulative reward augmented by an entropy term. The policy objective can be written as

J (π) = \sum_{t} E_{(s_{t}^{R L}, a_{t}) ~ ρ_{π}} [r_{t} + α H (π (\cdot | s_{t}^{R L}))],

(22)

where

ρ_{π}

denotes the state–action distribution induced by policy

π

,

α

is the entropy regularization coefficient, and

H (\cdot)

denotes the policy entropy. This objective encourages effective reward maximization while maintaining sufficient exploration throughout training, thereby avoiding premature convergence to local optima.

In practice, the actor loss function can be expressed as

L_{π} = E_{s_{t}^{R L} ~ D, a_{t} ~ π_{ϕ}} [a_{e n t} l o g π_{ϕ} (a_{t} | s_{t}^{R L}) - Q_{ψ} (s_{t}^{R L}, a_{t})],

(23)

where

D

denotes the replay buffer. As implied by this formulation, the objective of policy optimization is to improve action value estimation while simultaneously controlling policy entropy, so as to balance exploitation and exploration.

(2): Value update mechanism

For the critic networks, SAC uses the Bellman equation for iterative learning. The target value is defined as

y_{t} = r_{t} + γ E_{a_{t + 1} ~ π_{ϕ}} [{\tilde{Q}}_{\tilde{ψ}} (s_{t + 1}^{R L}, a_{t + 1}) - a_{e n t} l o g π_{ϕ} (a_{t + 1} | s_{t + 1}^{R L})],

(23)

where

γ

is the discount factor and

{\tilde{Q}}_{\tilde{ψ}}

denotes the target critic network. By incorporating the entropy term into the target value, the critic learning process remains consistent with the maximum-entropy objective.

On this basis, the critic loss is formulated as

L_{Q} = E_{(s_{t}^{R L}, a_{t}, r_{t}, s_{t + 1}^{R L}) ~ D} [{(Q_{ψ} (s_{t}^{R L}, a_{t}) - y_{t})}^{2}],

(24)

where

L_{Q}

denotes the mean squared temporal-difference error of the critic network.

After each update, the target network parameters are softly updated as

\tilde{ψ} = τ ψ + (1 - τ) \tilde{ψ},

(25)

where

τ \in (0,1)

is the soft update coefficient. This target update mechanism helps improve training stability.

(3): Action execution and constraint handling

Since the Stage II output corresponds to a continuous charging power allocation vector, the raw action sampled from the policy network must be further corrected according to the power bounds and physical constraints introduced in Section 2 before execution. Specifically, the raw action must satisfy the EV-side maximum charging power constraint, the charger capacity constraint, and the station-level total power constraint, as described in Eqs. (11)–(14). Therefore, the actual action executed in Stage II is the feasibility-projected power allocation action, ensuring that the reinforcement learning policy remains physically realizable and practically applicable.

3.4. Two-Stage Coupled Decision-Making and Rolling Update Mechanism

The key advantage of the SMPD framework does not lie in simply cascading the service matching and power dispatch modules, but rather in explicitly embedding their dynamic coupling into a unified system evolution framework through a rolling-horizon mechanism. Specifically, at each decision epoch

t

, Stage I determines the set of connected EVs and the queue structure, while the continuous power allocation generated by Stage II further affects EV remaining demand, waiting time, residual parking time, and charger occupancy. These factors jointly determine the system state at the next decision epoch. Therefore, a significant feedback and temporal coupling relationship is formed between service access decisions and power control decisions.

Let

S_{t}

denote the system state at time

t

. After executing service matching and power dispatch, the state transition process can be uniformly expressed as

S_{t + 1} = Φ (S_{t}, X_{t}, P_{t})

. Thus, SMPD performs dynamic real-time scheduling in a rolling manner by coupling the outputs of the two stages within a unified system evolution process.

To further clarify the execution logic of the proposed method in continuous scheduling scenarios, the rolling two-stage scheduling procedure of SMPD is summarized as Algorithm 1.

Algorithm 1: The proposed SMPD framework

Input: Initial system state

S_{1}

; decision horizon

T

; trained GRU-MLP matching network

f_{θ}

; trained SAC-based power dispatch policy

π_{ϕ}

;
Output：Service matching results

{X_{t}}_{t = 1, . . ., T}

; power dispatch results

{P_{t}}_{t = 1, . . ., T}

; system state trajectory

{S_{t}}_{t = 1, . . ., T}

;

1.: Observe the initial system state $S_{1}$
2.: for $t = 1, 2, . . .$ do
3.: Extract $χ_{t}^{m a t c h} = {{h_{i}^{v, t}}_{i \in A_{t}}, {h_{j}^{c, t}}_{i \in C}, h_{t}^{g}}$ from the current system state $S_{t}$
4.: Generate the matching probability matrix $Π_{t} = f_{θ} (χ_{t}^{m a t c h})$
5.: Generate the feasible matching matrix $X_{t}$
6.: Determine the connected EV set and the waiting queue from $X_{t}$
7.: Construct the Stage-II reinforcement learning state $s_{t}^{R L}$
8.: Generate the continuous action $a_{t} ~ π_{ϕ} (\cdot | s_{t}^{R L})$
9.: Project $a_{t}$ onto the feasible power allocation set to obtain $P_{t}$
10.: Execute $X_{t}$ and $P_{t}$ in the environment

Update the system state according to $S_{t + 1} = Φ (S_{t}, X_{t}, P_{t})$
11.: end for

Return

{X_{t}}_{t = 1, . . ., T}

,

{P_{t}}_{t = 1, . . ., T}

,

{S_{t}}_{t = 1, . . ., T}

4. Experimental Design and Results Analysis

4.1. Experimental Scenario and Parameter Settings

To ensure comparability of the experimental results, the same discrete-time decision mechanism as that adopted in the reference study is used, and the decision interval is uniformly set to 15 min.

Regarding the economic parameter settings, the user payment consists of two components, namely the electricity tariff and the service fee. The same time-of-use pricing scheme as that in the reference study is adopted. Specifically, the electricity prices for the three periods 8 AM–12 AM, 12 AM–8 PM, and 8 PM–8 AM are set to 0.84, 1.48, and 0.39, respectively, while the corresponding service fees are 0.24, 0.10, and 0.29. The penalty coefficient is set to 0.4 to characterize the joint influence of waiting time, load deviation, and service fairness on the overall scheduling objective.

In terms of infrastructure configuration, the charging station is composed of multiple chargers with identical rated power. When the station scale is enlarged in the experiments, the rated output power of a single charger is not increased; instead, the overall service capacity is improved by increasing the number of chargers that can operate simultaneously. This setting is consistent with the reference study and is helpful for analyzing the scalability of the scheduling method under different infrastructure scales.

On this basis, the proposed SMPD framework is deployed in the above experimental environment. At each decision epoch, the system first extracts vehicle-side features, charger-side features, and station-level global features, and then feeds them into the Stage I GRU-MLP service matching network to generate the matching matrix

X_{t}

. Subsequently, under the matching-result constraints, the Stage II reinforcement learning state is constructed, and the SAC controller outputs the continuous power allocation vector

P_{t}

. Finally, the system state is updated according to the state transition mapping

S_{t + 1}

and enters the next rolling decision cycle. Each experiment is independently repeated 30 times, and the final results are averaged to reduce the influence of randomness on the experimental conclusions.

4.2. Benchmark Algorithms and Evaluation Metrics

To comprehensively evaluate the performance of the proposed method, three representative baseline algorithms, namely LSAR, CPC, and BinAlg, are selected for comparison with the proposed SMPD framework. The reference study also adopted these three baselines to validate the advantages of the proposed method in terms of revenue improvement and penalty reduction.

Among them, LSAR (Linearization Technique for Representing State-Action Function) linearly approximates the state-action function using manually designed feature functions, and thus represents an approximate scheduling method that relies on physical priors and handcrafted features. CPC (Constant Power Charging) adopts a constant-power charging strategy, under which an EV is charged continuously at its maximum admissible power once it is connected, until its charging demand is satisfied. BinAlg (Binning Algorithm) reduces solution complexity by discretizing and compressing both the state space and the action space through a binning mechanism. These three methods respectively represent linear-approximation-based scheduling, rule-based scheduling, and state-compression-based scheduling, thereby providing sufficiently diverse benchmark references for evaluating the proposed SMPD framework.

To assess the overall performance of different methods from the three aspects of service efficiency, economic performance, and operational stability, the following evaluation metrics are adopted. Average waiting time is used to measure the average delay from EV arrival to successful access to charging resources. Demand satisfaction ratio is used to reflect the extent to which the charging energy actually obtained by an EV before departure satisfies its target charging demand. Charger utilization is used to characterize the utilization efficiency of station-side charging resources. Average profit is used to measure the economic return over the entire scheduling horizon. The overall penalty term is used to quantify the aggregated losses associated with waiting time, load deviation, and fairness. These metrics together enable a relatively comprehensive evaluation of the actual scheduling performance of each algorithm in complex dynamic environments.

4.3. Learning Performance Analysis

The reference study reported in its learning-performance analysis that the cumulative reward of the reinforcement learning module gradually increases with training iterations and becomes stable after approximately 400 training steps. Meanwhile, the total revenue increases progressively, whereas the penalty term continues to decrease, indicating that the model gradually learns an effective scheduling strategy that balances system revenue and user experience.

For the proposed SMPD framework, a similar convergence pattern is observed during training. As shown in Figure 4(a), the cumulative reward exhibits some fluctuations in the early stage of training, mainly because the model is still in the exploration phase,

during which stable coordination between the Stage I service matching results and the Stage II power control strategy has not yet been established. As training proceeds, the cumulative reward rises rapidly and enters a stable region after approximately 450 training steps. Figs. 4(b) and 4(c) further show that the total system revenue continues to increase during training, while the overall penalty term gradually decreases. This indicates that the proposed method can continuously refine both the service access strategy and the continuous power allocation strategy through repeated interactions, thereby achieving simultaneous improvement in economic efficiency and service quality.

From a mechanistic perspective, the training stability of SMPD mainly originates from the decoupling of the two-stage tasks. In Stage I, supervised learning is first used to generate high-quality service matching results, explicitly determining the set of connected EVs and the waiting queue structure at the current decision epoch. Stage II then performs continuous power allocation only under the constraints of the matching results. As a consequence, the model no longer needs to directly cope with the high-dimensional hybrid action space induced by the joint consideration of discrete access decisions and continuous power control. Compared with a single-stage unified learning framework, SMPD demonstrates more evident advantages in both training efficiency and convergence stability.

4.4. Overall Performance Comparison

To comprehensively evaluate the overall performance of the proposed SMPD method in real-time scheduling for large-scale EV charging stations, it is compared with the three baseline algorithms CPC, BinAlg, and LSAR. Considering that the result of a single stochastic simulation may be affected by fluctuations in EV arrival and departure behaviors, each method is independently executed 30 times under the same experimental configuration. The mean and standard deviation of the main evaluation metrics are then reported to enhance the reliability and persuasiveness of the result analysis.

Table 1. Overall performance comparison of different algorithms on the main evaluation metrics.

Method	Waiting time (min)	Charger utilization (%)	Average profit	Penalty
CPC	34.8 ± 2.7	71.5 ± 2.1	648.4 ± 18.6	-1186.2 ± 24.3
BinAlg	29.6 ± 2.1	75.9 ± 1.8	662.1 ± 16.9	-1129.4 ± 21.7
LSAR	24.7 ± 1.8	79.3 ± 1.5	675.8 ± 14.2	-1094.7 ± 18.5
SMPD	18.4 ± 1.2	84.8 ± 1.1	819.6 ± 11.7	-1016.3 ± 14.1

Table 4-1 presents the overall comparison results of the proposed SMPD method and the three baseline algorithms on the main evaluation metrics. It can be observed that SMPD achieves the best overall performance in terms of average waiting time, demand satisfaction ratio, charger utilization, average profit, and composite penalty, indicating that the proposed two-stage coordinated optimization framework can effectively balance service efficiency, resource utilization, and system-level economic performance.

From the perspective of service efficiency, SMPD reduces the average waiting time to 18.4 min, representing reductions of 47.1%, 37.8%, and 25.5% compared with CPC, BinAlg, and LSAR, respectively. This result indicates that the proposed method can coordinate vehicle access order and limited in-station charging resources more effectively, thereby significantly alleviating queue congestion, especially under highly dynamic arrival conditions. At the same time, the demand satisfaction ratio achieved by SMPD reaches 95.3%, which is 8.6, 5.5, and 3.7 percentage points higher than those of CPC, BinAlg, and LSAR, respectively. This further shows that the proposed framework not only improves service accessibility, but also better fulfills the target charging demand of EVs within their limited dwell time.

From the perspective of resource utilization and economic efficiency, SMPD achieves a charger utilization rate of 84.8%, which is significantly higher than 71.5% for CPC, 75.9% for BinAlg, and 79.3% for LSAR. Meanwhile, its average profit reaches 819.6, exceeding the three baseline methods by 26.4%, 23.8%, and 21.3%, respectively. In addition, the composite penalty of SMPD is −1016.3, which is also lower than that of all competing methods. These results indicate that the proposed method can improve charger utilization while effectively reducing performance losses caused by waiting cost, load deviation, and service imbalance, thereby achieving simultaneous improvements in both economic return and operational stability.

Figure 5 further presents the boxplots of the different algorithms with respect to average waiting time and charger utilization. It can be seen that SMPD not only yields the lowest median waiting time, but also exhibits the most concentrated overall distribution, suggesting that it can achieve more stable service efficiency across repeated independent experiments. At the same time, the charger-utilization distribution of SMPD remains consistently in the high-value region with relatively small fluctuations, indicating that the proposed method can effectively maintain statistical stability while improving resource utilization. In contrast, CPC exhibits a more dispersed distribution with more pronounced outliers, implying that it is more sensitive to stochastic arrivals and dynamic station disturbances. Although BinAlg and LSAR outperform CPC to some extent, they still remain inferior to SMPD in both distribution position and fluctuation range.

Mechanistically, the performance advantages of SMPD mainly originate from two aspects. First, the supervised service matching module in Stage I jointly considers remaining dwell time, waiting time, residual charging demand, and station-level resource status, thereby generating access schemes that are better suited to dynamic operating scenarios and alleviating queue congestion during peak periods. Second, the SAC-based continuous power allocation module in Stage II further optimizes intra-station power allocation under the matching results, establishing a more balanced trade-off between demand satisfaction and control-oriented scheduling objectives. As a result, the proposed method achieves clear improvements not only in economic performance, but also in service efficiency and operational stability.

4.5. Reward Comparison Under Different Infrastructure Scales

To further evaluate the adaptability of the SMPD framework under different infrastructure scales, this study compares the performance of CPC, BinAlg, SMPD, and LSAR under different numbers of chargers. Since the primary objective of infrastructure-expansion experiments is to assess the economic performance and real-time feasibility of the algorithms under scaling conditions, two metrics are adopted: average reward (Reward) and single-decision runtime (Runtime). Here, Reward is used to measure the overall operational return of different methods under different charging-station scales, whereas Runtime reflects the computational overhead of each algorithm in online scheduling scenarios.

As shown in Table 2, when the number of chargers increases from 100 to 2000, the Reward of all algorithms exhibits an increasing trend. This indicates that infrastructure expansion can enhance station service capacity, allowing more EVs to receive effective charging service within their limited dwell time and thereby improving the overall system return. However, the growth patterns differ markedly across algorithms. The Reward of CPC increases with infrastructure scale but consistently remains at the lowest level, suggesting that a rule-based constant-power charging strategy cannot sufficiently exploit the benefits of expanded charging resources. BinAlg outperforms CPC under all four scales, indicating that state compression and discretization mechanisms can improve resource scheduling to some extent, although its economic performance still falls short of learning-based approaches. LSAR achieves the highest Reward under all four scales, followed by SMPD, and both methods significantly outperform CPC and BinAlg.

Figure 6. Comparison of reward and runtime of different algorithms under different infrastructure scales: (a) comparison of Reward under different infrastructure scales; (b) comparison of Runtime under different infrastructure scales.

A further comparison shows that the Reward of SMPD reaches 644.5, 675.8, 712.4, and 741.2 under

| N |

=100,

| N |

=600,

| N |

=1000, and

| N |

=2000, respectively, corresponding to improvements of 4.9%, 4.2%, 4.2%, and 4.5% over CPC, and 2.0%, 2.1%, 2.0%, and 2.0% over BinAlg, respectively. These results indicate that, compared with rule-based strategies and state-compression-based methods, the proposed two-stage coordinated framework exhibits a relatively stable advantage in reward performance. At the same time, the stronger economic performance of LSAR under the current experimental setting suggests that a linearly approximated state-action modeling approach still retains considerable competitiveness in this scenario.

From the perspective of computational efficiency, the runtime of LSAR lies between those of CPC and SMPD. Notably, LSAR achieves the highest Reward while maintaining a relatively controllable computational cost. By contrast, the runtime of SMPD reaches 6.66 s, 7.28 s, 7.91 s, and 8.92 s under the four infrastructure scales, respectively, which is the highest among all compared methods. This is mainly because the proposed framework must sequentially execute the Stage I GRU-based service matching module and the Stage II continuous power allocation module, while also performing rolling state updates during scheduling, resulting in a longer computational pipeline.

Nevertheless, it should be emphasized that even in the largest-scale scenario, i.e.,

| N |

=1000, the single-decision runtime of SMPD remains below 10 s, which is still far smaller than the 15 min scheduling interval considered in this study. This indicates that although the proposed method incurs a higher computational overhead than the baseline algorithms, it still satisfies the basic real-time requirement of online scheduling. From an engineering perspective, SMPD therefore retains acceptable real-time solvability while preserving strong optimization capability.

Considering both Reward and Runtime, different algorithms exhibit different trade-offs between economic performance and computational efficiency. CPC achieves the lowest runtime but also the lowest reward. LSAR obtains the highest reward in the current experimental setting, while its runtime remains lower than that of SMPD, thus exhibiting a favorable balance between return and computational cost. Although SMPD does not outperform LSAR in reward, it consistently remains superior to CPC and BinAlg across all infrastructure scales and completes decision-making within an acceptable time range. This suggests that the proposed method maintains a clear advantage in reward growth over rule-based and state-compression-based methods under large-scale deployment, while also demonstrating good scalability. However, there is still room for further improvement in terms of both computational efficiency and economic performance.

4.6 Robustness Analysis Under Random Early-Departure Disturbances

In practical operation, the departure behavior of EV users is usually highly uncertain, and some vehicles may leave the charging station earlier than expected due to temporary itinerary changes. It is therefore necessary to further evaluate the adaptive scheduling capability and operational robustness of the proposed SMPD framework under dynamic disturbances. To this end, a random early-departure mechanism is introduced on top of the standard experimental setting. Four early-departure ratios, namely 5%, 10%, 15%, and 20%, are considered to investigate the reward degradation of different algorithms under different disturbance intensities.

Table 3 and Figure 7 present the reward degradation results of BinAlg, LSAR, and SMPD under different early-departure disturbance levels. It can be observed that as the early-departure ratio gradually increases, the reward degradation of all three algorithms also rises. This indicates that random early departures disrupt both the service access relationship and the rhythm of power allocation, thereby weakening the overall scheduling performance of the charging station. However, under all disturbance scenarios, SMPD consistently exhibits the smallest reward degradation, demonstrating that the proposed method has stronger robustness and stability under dynamic environmental changes.

More specifically, under a 5% early-departure disturbance, the reward degradation of SMPD is only 1.46%, which is lower than 2.08% for LSAR and 2.43% for BinAlg. When the disturbance intensity increases to 10%, the reward degradation of SMPD becomes 3.48%, which still remains significantly smaller than those of the two baseline methods. Under the 15% disturbance setting, the reward degradation of SMPD further increases to 5.80%, yet it still stays below the levels of the competing algorithms. In the strongest disturbance scenario, namely the 20% early-departure ratio, the reward degradation of SMPD reaches 8.12%, which is still 2.29 and 3.85 percentage points lower than those of LSAR and BinAlg, respectively. These results indicate that although random early departures impose a significant disturbance on real-time charging-station scheduling, the proposed method can more effectively mitigate the negative impact of such dynamic disturbances on economic performance.

Mechanistically, the robustness of SMPD mainly originates from its two-stage rolling decision-making mechanism. On the one hand, the Stage I service matching module can dynamically reconstruct the EV–charger access relationship according to the latest service state, thereby promptly releasing reusable charging resources after some vehicles leave earlier than expected. On the other hand, the Stage II SAC-based continuous power control module can further reallocate charging power among the remaining vehicles based on the updated access relationship, so as to better prioritize those with shorter remaining dwell time. It is precisely this closed-loop feedback mechanism of “service access–power redistribution–rolling update” that enables the proposed method to maintain high reward levels and stable operational performance even under dynamic disturbance scenarios.

5. Conclusions

This paper investigated the real-time scheduling problem of large-scale EV charging stations under stochastic arrivals, dynamic departures, and limited charging resources, and proposed a two-stage collaborative optimization framework termed SMPD. By decomposing the highly coupled unified scheduling task into two subproblems, namely supervised service matching and reinforcement learning-based power dispatch, the proposed framework enables structured coordination between efficient access decision-making and continuous charging power control. The results demonstrate that the proposed method can effectively reduce waiting time, improve demand satisfaction and charger utilization, and enhance the overall economic performance of charging stations while maintaining real-time feasibility. Moreover, under different infrastructure scales and random disturbance scenarios, SMPD continues to exhibit strong scalability and operational robustness. Overall, the two-stage rolling scheduling mechanism developed in this study provides an effective modeling and solution paradigm for real-time intelligent scheduling of large-scale EV charging stations, and also offers a useful methodological reference for the efficient operation of charging infrastructure in complex dynamic environments.

Author Contributions

Conceptualization, B.W.; Methodology, Y.Y.; Software, J.G.; Validation, B.W. and D.L.; Investigation, B.W. and J.G.; Resources, J.G. and Y.Y.; Data curation, B.W. and Y.Y.; Writing—original draft, B.W.; Writing—review&editing, B.W., Y.Y. and D.L.; Supervision, J.G. and B.W.; Project administration, D.L. B.W., Y.Y. and J.G contributed equally to this work. All author shaveread and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Guo, Z.; Zhang, S.; Wang, J.; Huang, X.; Hu, R.; Xu, W. Modeling and Analysis of UAV Charging Scheduling in Fixed/Mobile Charging Station Systems. 2024 14th International Conference on Information Science and Technology (ICIST), Chengdu, China, 2024; pp. 894–899. [Google Scholar] [CrossRef]
Zhuang, W.; Zhang, H.; Wang, R.; Chen, Z. Optimal Scheduling Strategy of Urban Charging Station with PV and Storage Based on FWA. 2020 International Conference on Electrical Engineering and Control Technologies (CEECT), Melbourne, VIC, Australia, 2020; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, C. Charging schedule optimization of electric bus charging station considering departure timetable. 8th Renewable Power Generation Conference (RPG 2019), Shanghai, China, 2019; pp. 1–5. [Google Scholar] [CrossRef]
Putri, S. M.; Ashari, M.; Endroyono; Suryoatmojo, H. EV Charging Scheduling with Genetic Algorithm as Intermittent PV Mitigation in Centralized Residential Charging Stations. 2023 International Seminar on Intelligent Technology and Its Applications (ISITIA), Surabaya, Indonesia, 2023; pp. 286–291. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, C.; Lu, C. Risk-Limiting Multi-Station EV Charging Scheduling with Imperfect Prediction. 2022 7th IEEE Workshop on the Electronic Grid (eGRID), Auckland, New Zealand, 2022; pp. 1–5. [Google Scholar] [CrossRef]
Saner, C. B.; Saha, J.; Sharma, A.; Srinivasan, D. Heuristic Methods for EV Charging Scheduling in Fast Charging Stations with Modular Architecture: A Comparative Analysis. 2023 IEEE PES 15th Asia-Pacific Power and Energy Engineering Conference (APPEEC), Chiang Mai, Thailand, 2023; pp. 1–6. [Google Scholar] [CrossRef]
Alirezazadeh, A.; Disfani, V. Profit-Maximizing Scheduling of Mobile Charging Stations Using Deep Reinforcement Learning: A Case Study in Chattanooga. 2025 IEEE Power & Energy Society General Meeting (PESGM), Austin, TX, USA; 2025, pp. 1–5. [CrossRef]
Qarebagh, A.J.; Sabahi, F.; Nazarpour, D. Optimized Scheduling for Solving Position Allocation Problem in Electric Vehicle Charging Stations. 2019 27th Iranian Conference on Electrical Engineering (ICEE), Yazd, Iran, 2019; pp. 593–597. [Google Scholar] [CrossRef]
Yang, Z.; Wang, S.; Deng, J.; Peng, X.; Jiang, N.; Li, Y. Intelligent Energy Management Method of EV Charging Station with Photovoltaic and Energy Storage System Considering EV Charging Scheduling. 2024 8th International Conference on Power Energy Systems and Applications (ICoPESA), Hong Kong, Hong Kong, 2024; pp. 488–494. [Google Scholar] [CrossRef]
Qureshi, U.; Ghosh, A.; Panigrahi, B. K. Dynamic Routing and Scheduling of Mobile Charging Stations for Electric Vehicles Using Deep Reinforcement Learning. 2024 IEEE Power & Energy Society General Meeting (PESGM), Seattle, WA, USA, 2024; pp. 1–5. [Google Scholar] [CrossRef]
Wang, R.; Huang, Q.; Chen, Z.; Xing, Q.; Zhang, Z. An Optimal Scheduling Strategy for PhotovoltaicStorage-Charging Integrated Charging Stations. 2020 12th IEEE PES Asia-Pacific Power and Energy Engineering Conference (APPEEC), Nanjing, China, 2020; pp. 1–5. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Y.; Zhang, X.; Jiao, F. Research Progress on Scheduling Algorithms for Mobile Charging Station. 2025 9th International Conference on Power Energy Systems and Applications (ICoPESA), Nanjing, China; 2025, pp. 634–640. [CrossRef]
Zhou, J.; Zhou, Y.; Zhang, K.; Liu, S.; Shi, S.; Li, H. Scheduling Optimization Method for Charging Piles in Electric Vehicle Charging Stations Based on Mixed Integer Linear Programming. 2023 IEEE 7th Conference on Energy Internet and Energy System Integration (EI2), Hangzhou, China, 2023; pp. 220–224. [Google Scholar] [CrossRef]
Lu, X.; Liu, N.; Chen, Q.; Zhang, J. Multi-objective optimal scheduling of a DC micro-grid consisted of PV system and EV charging station. 2014 IEEE Innovative Smart Grid Technologies - Asia (ISGT ASIA), Kuala Lumpur, Malaysia, 2014; pp. 487–491. [Google Scholar] [CrossRef]
Chi, Y.; Sun, J.; Ma, K.; Hong, Y.; Wang, S. Real Time Scheduling and Intelligent Prediction Algorithm for Electric Vehicle Charging Station Based on Big Data Analysis. 2024 International Conference on Electrical Drives, Power Electronics & Engineering (EDPEE), Athens, Greece, 2024; pp. 324–328. [Google Scholar] [CrossRef]
Fang; Xiang, S.; Zhang, X. Optimal Scheduling Strategy of Distributied Electric Bus Charging Stations. 2024 IEEE 7th International Electrical and Energy Conference (CIEEC), Harbin, China, 2024; pp. 4371–4376. [Google Scholar] [CrossRef]
Saha, M.; Thakur, S. S.; Bhattacharya, A. Optimal Scheduling Strategy for Swapped Battery Station of Electric Vehicle by Human Felicity Algorithm. 2022 IEEE 6th International Conference on Condition Assessment Techniques in Electrical Systems (CATCON), Durgapur, India, 2022; pp. 427–431. [Google Scholar] [CrossRef]

Figure 1. Schematic illustration of the operating scenario of a large-scale EV charging station under the two-stage coordinated scheduling paradigm.

Figure 3. Overall flowchart of the proposed SMPD framework for two-stage rolling coupled decision-making and system state updating.

Figure 4. Learning performance evolution curves of the proposed SMPD framework during training. (a) Evolution of cumulative reward during training. (b) Evolution of total revenue during training. (c) Evolution of the penalty term during training.

Figure 5. This is a figure. Schemes follow another format. If there are multiple panels, they should be listed as: (a) Description of what is contained in the first panel; (b) Description of what is contained in the second panel. Figures should be placed in the main text near to the first time they are cited.

Figure 7. Comparison of reward degradation of different algorithms under random early-departure disturbances.

Table 2. Comparison of average reward under different infrastructure scales.

Method	$\| N \|$ =100		$\| N \|$ =600		$\| N \|$ =1000		$\| N \|$ =2000
Method	Reward	Runtime(s)	Reward	Runtime(s)	Reward	Runtime(s)	Reward	Runtime(s)
CPC	614.2	4.18	648.4	4.34	683.7	4.67	709.5	5.61
BinAlg	631.8	5.42	662.1	5.79	698.6	6.56	726.8	7.64
SMPD	644.5	6.66	675.8	7.28	712.4	7.91	741.2	8.92
LSAR	701.3	5.51	719.6	5.94	761.5	7.17	789.7	8.36

Table 3. Comparison of reward degradation under different early-departure ratios.

Departure ratio	BinAlg	LSAR	SMPD
5%	-2.43%	-2.08%	-1.46%
10%	-6.11%	-5.02%	-3.48%
15%	-9.04%	-7.72%	-5.80%
20%	-11.97%	10.41%	-8.12%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Real-Time Two-Stage Scheduling of Electric Vehicle Charging Stations Using a SMPD Framework

Abstract

Keywords:

Subject:

1. Introduction

2. System Modeling

2.7. Inter-Stage Coupling Mechanism

3. Two-Stage Coordinated Optimization Method of SMPD

3.1. Overall SMPD Framework

3.2.2. Matching Score Model

3.2.3. Expert Label Generation and Supervised Training

3.4. Two-Stage Coupled Decision-Making and Rolling Update Mechanism

4. Experimental Design and Results Analysis

4.1. Experimental Scenario and Parameter Settings

4.2. Benchmark Algorithms and Evaluation Metrics

4.3. Learning Performance Analysis

4.4. Overall Performance Comparison

4.5. Reward Comparison Under Different Infrastructure Scales

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe