Reinforcement Learning for Electric Vehicle Charging using Dueling Neural Networks

We consider the problem of coordinating the charging of an entire fleet of electric vehi1 cles (EV), using a model-free approach, i.e., purely data-driven reinforcement learning (RL). The 2 objective of the RL-based control is to optimize charging actions, while fulfilling all EV charging 3 constraints (e.g., timely completion of the charging). In particular, we focus on batch-mode learn4 ing and adopt fitted Q-iteration (FQI). A core component in FQI is approximating the Q-function 5 using a regression technique, from which the policy is derived. Recently, a dueling neural net6 works architecture was proposed and shown to lead to better policy evaluation in the presence of 7 many similar-valued actions, as applied in a computer game context. The main research contribu8 tions of the current paper are that (i) we develop a dueling neural networks approach for the set9 ting of joint coordination of an entire EV fleet, and (ii) we evaluate its performance and compare 10 it to an all-knowing benchmark and an FQI approach using EXTRA trees regression technique, 11 a popular approach currently discussed in EV related works. We present a case study where 12 RL agents are trained with an ǫ–greedy approach for different objectives, i.e., (a) cost minimiza13 tion, and (b) maximization of self-consumption of local renewable energy sources. Our results 14 indicate that RL agents achieve significant cost reductions (70–80 %) compared to a business-as15 usual scenario without smart charging. Comparing the dueling neural networks regression to 16 EXTRA trees indicates that for our case study’s EV fleet parameters and training scenario, the 17 EXTRA trees-based agents achieve higher performance in terms of both lower costs (or higher 18 self-consumption) and stronger robustness, i.e., less variation among trained agents. This sug19 gests that adopting dueling neural networks in this EV setting is not particularly beneficial as 20 opposed to the Atari game context from where this idea originated. 21

iteration. 23 24 Over the last decade, there has been an unprecedented increase in the usage of 25 EVs, and this trend is expected to continue over the coming decade [1]. EVs are a key 26 element in the energy transition process, providing new opportunities as flexible load 27 assets: they tend to be parked and connected to a charging station longer than needed 28 to complete the charging [2]. As discussed in [3], the flexibility of EVs can be used to 29 provide services to different stakeholders in smart grids such as ancillary services for 30 local grid management, cost benefits for individual EV owners and improving local re-31 newable energy consumption. In this paper we focus on an operator responsible for 32 the charging of a fleet of parked EVs. The operator ensures that each EV is charged to 33 a user-defined energy level and often performs this by charging each EV as soon as it 34 arrives. We will consider two main scenarios for the operator to deviate from this strat-35 egy: (i) minimize the total cost of charging, and (ii) maximize the consumption of locally 36 generated renewable energy (PV). The charging strategies for both these scenarios are 37 constrained by the need to complete the charging of each EV before its departure. 38 is excellently suited for the EV fleet charging problem, where dynamics are often de-48 pendent on uncertain arrival, departure times and energy requirements of individual 49 EVs. 50 As discussed further in Section 1.1, several RL based approaches have been re-51 searched previously for DR related control problems and each approach has its own 52 merit. Given the specifics of this problem-a continuous state representation and a set 53 of discrete actions-we focus on the batch reinforcement learning technique of fitted 54 Q-iteration (FQI) [7]. As discussed in [8], FQI exploits data more efficiently compared 55 to other RL algorithms such as policy optimization or deep Q-networks. It purely re-56 lies on past transitions between the agent and the environment, and trains a regression 57 model to approximate the Q-function and derive optimum control decisions. With this, 58 we aim to determine optimum charging actions for the fleet of EVs without any explicit 59 knowledge about individual EV parameters like arrival, departure times or energy re-60 quirements. Several regression techniques have been studied previously in RL litera-61 ture, including tree based methods and deep learning based approaches. Amongst the 62 latter, a novel technique of dueling neural networks has been shown to significantly 63 improve the performance of RL agents trained to play Atari games [9]. Based on this, 64 we aim to investigate the impact of dueling neural networks, towards the performance 65 of fitted Q-iteration in the above mentioned EV charging problem. 66 The main contributions of this paper can be summarized as follows:   Several works have used reinforcement learning techniques in the context of de-76 mand response and electric vehicle charging [3]. Here, we provide an overview of 77 work related to different RL techniques and their application to EV charging. The Q- 78 learning algorithm has been extensively researched for its application in demand re-79 sponse (DR) [10,11]. However, the algorithm requires discrete state and action spaces, 80 generally represented as a Q-table and updated after every transition. This is a major 81 drawback of this algorithm leading to inefficient use of data and a long time for conver-82 gence. 83 Following recent advances in computational capacities and deep learning tech-84 niques, the current state-of-the-art focuses on approximating the Q-function using func-85 tional approximators (regression models) including deep neural networks. These ap-86 proaches can be classified into two main categories: (i) policy optimization and, (ii) value 87 iteration. For (i), the work presented in [12] uses a constrained policy optimisation al-88 gorithm to train an RL agent for an EV charging/discharging problem. This is a policy 89 search based approach where a deep neural network is trained to take actions that op-  other commonly used methods such as Deep Q Networks and Actor critic approaches.

97
In contrast, we present a method of coordinating the charging of multiple EVs with 98 significantly less training data (~100 days).

99
For (ii), a value iteration approach based on the DQN algorithm [13] has been 100 studied in [14]. achieves results approximately equal to a stochastic programming based benchmark.

114
Similar to this, we propose a dueling neural networks architecture instead of the EXTRA 115 trees regression technique to train the agent and obtain optimum control policies.

116
The articles mentioned above, specifically [12,15]  EVs and EV fleets: a three-step modeling approach using a priority based dispatch [18]; 123 a binning algorithm for 2D grid representation of aggregate state [19]; a constrained 124 MDP formulation of EV charging/discharging [12]. We will adapt the three-step mod-125 eling approach as presented in [18] to model an EV fleet.

126
The main aim of this paper is to investigate dueling neural networks as a regres-127 sion technique in FQI and assess its performance in obtaining optimal control decision 128 in the RL setting. Based on literature, EXTRA trees regression technique is a popular 129 choice for such problems, i.e., FQI for DR scenarios [20]. It has been shown to outper-130 form alternative approaches such as multi-layer perceptrons, XGBBoost, bag of ELMs 131 (extreme learning machines), etc. in the context of residential heating loads for DR [20].

132
Hence we will benchmark dueling neural networks against the EXTRA trees regression 133 technique.

135
A reinforcement learning approach relates to the process of an agent learning a control policy h through observed interactions with the environment to be controlled. The decision making process of an RL agent is formulated using a Markov Decision Process (MDP) [21], as shown in Fig. 1. In this paper, we will consider discrete MDPs, where the agent interacts with the environment in discrete time steps t. An MDP is defined by its state space X, action space U, and transition function f : X × U × W → X, Reward (r t ) The expression in Eq. (1) represents the dynamics of the environment from a state x t ∈ X to x t+1 , under action u t ∈ U and a random process w t ∈ W with a probability distribution p w (·, x t ). Each state transition is accompanied by an immediate reward signal r t , defined by a reward function ρ : X × U × W → R: The goal of RL agent is to find a policy h : X → U, u = h(x), that minimizes the total reward during a finite time horizon T, starting from an initial state x 1 . This T-time horizon reward is denoted by J h (x 1 ) and is given by: J h can be expressed in a recursive and convenient way by using the state-action value function or Q-function [6] defined as: The Q-function defined in Eq. (4) gives the expected cumulative reward over the time horizon T and hence can be used to characterize a policy h. The optimal Q-function corresponds to the best Q-function that can be obtained over all policies h and is given by: The policy that gives the optimal Q-function for all state-action pairs is termed as an optimal policy and can be calculated as a greedy policy: Further, the Bellman optimality equation [21] can be used to obtain the optimal Qfunction as shown in Eq. (7).

136
The state space X represents the set of all possible states of the environment, which in this case is the fleet of EVs to be charged. The charging of each EV i is defined by a set of parameters: arrival time (t arv i ), departure time (t dep i ), energy required for full charge , constitutes the internal state of EV i and completely defines its behavior. With prior knowledge of Ω i,t , EV i can be controlled in a deterministic manner for any time t. Hence the internal state of the fleet (x int t ∈ X) is defined as where N con t represents the number of EVs connected at time t. Using this definition, the minimum and maximum energy levels for each EV i at a given time t are defined as Further, the state of energy of EV i at time t is represented by x phys i,t and satisfies the constraints given in Eq. (10).

137
The action space U corresponds to a set of actions that the agent can take to influence the state of the environment. An action thus corresponds to the charging power that is given to each EV i at time t and is assumed to take discrete values, 0 or P max i . The state of energy of each EV is bound by Eq. (10) and hence each EV can overrule an action taken by the agent to satisfy these constraints. This is modelled using the function B, that maps the action (u i,t ) determined by the agent for EV i, to a physical control Here, E lim i,t consists of minimum and maximum energy boundaries as defined in Eq. (9) 138 and B(·) is defined in Eq. (13) as shown in Section 2.3.

140
The transition function f describes the transition of the environment from state x t to x t+1 due to an action u i,t under an uncertainty w t . However, for this problem the uncertainty (w t ) represents the uncertainty in arrival and departure times of each EV and is modelled implicitly in this formulation. Further, the transition function defines the charging process of each EV. Using a linear model and constraints defined in Eq. ( 10), for an EV i, the new energy state (x phys i,t+1 ) is modelled as: The expression in Eq. (12) represents the transition over the duration of a time slot. The actual power (u phys i,t ) used by EV i for time t depends on the new state of energy and is given by: The reward function (ρ) is modelled following the objective of the agent. In this 142 paper, two different agent objectives are studied. Consequently, two different reward 143 functions are used and are described as follows: 144 1. Minimizing Cost: The objective of the agent is to minimize the cost of energy consumed during the charging process. To achieve this objective, the reward signal is based on the day ahead price and corresponds to the cost of energy consumed in each time slot. The reward function ρ is given by: where, N con t represents the number of EVs connected at time t and λ t represents 145 the day-ahead price for time t. 146 2. Maximizing Self-Consumption: The objective of the agent is to maximize self-consumption, considering locally generated solar energy. To achieve this objective, the hourly price signal (λ t ) and solar generation (P PV t ) is considered. The reward is calculated as the cost of energy consumed or delivered for each time slot and is modelled as: and N con t represents the number of EVs connected at time t. 147

148
In this section, the implementation of our RL agent is discussed. We adopt a three-149 step modelling and dispatch approach previously presented in [18]. In step 1, state 150 information of each EV is collected and represented as an aggregate.
Step 2 involves us-151 ing this aggregate state information to calculate the charging power for the entire fleet, 152 followed by step 3, where this power is distributed between individual EVs follow-153 ing a priority based dispatch. This modelling approach provides a high-fidelity model, 154 which is linear, scalable and substantially complex to train and test the agent. The internal state defined in Eq. (8) gives a raw representation of the state of each EV in the fleet. This representation is high-dimensional, and the dimensions vary with time depending on the number of EVs present. To avoid using this representation, alternative features were engineered. To capture time dependence, a time component (x time t ) corresponding to the hour of the day was used such that, Further, an aggregate state of energy of fleet was used to express a time dependent controllable state component. The aggregate energy state x agg t is given by: ment. This partial observability is addressed by using an approximate state [6] defined as: Here, k represents the number of past observations that are included in the approximate state and will be referred to as "depth". This is a hyperparameter that determines the quality of state information being passed on to the agent. This depth will vary for different information settings and following a sensitivity analysis, for this problem, we use depth = 2. Similar to the aggregate state, an aggregate agent action is obtained by aggregating over all connected EVs, Consequently, since we consider a charging rate of 3kW and a maximum of 100 EVs, the action space U is modified as where all values represent aggregate charging power in kW.  Based on the work presented in [7], FQI uses a regression model and a batch of previous interactions between the agent and the environment to estimate an approximate optimal Q-function, Q * over all state-action pairs. While different regression models can be used, this paper focuses on the application of dueling neural networks. With a finite time horizon, it is hypothesised that the dueling neural networks architecture can be effective in representing the Q-function by capturing the variations caused due to the value function as well as the action advantages. Further, we will compare the dueling neural networks regression to a more conventional approach of using EXTRA trees [22], [16]. As described in Algorithm 1, FQI uses a set F , consisting of tuples Here, x l is the approximate state of the environment as defined in Eq. (17) and u l represents the action taken during this instance. Consequently, x ′ l denotes the state reached after transition and r l is the reward obtained. Deviating from the algorithm presented in [7], we use an ensemble of Qfunctions, each corresponding to a time slot. For the final time slot, the Q-function This approach ensures a better convergence by avoiding the problem of training regression models on non-stationary Q-function targets. The batch of past transitions (F ) is built during the training process following an ǫ-greedy exploration technique [6]. In this exploration technique, the agent selects an action u k according to Eq. (20), based on an exploration probability ǫ. During the training, the exploration probability (ǫ) is initialized to 1 and reduced over the training period to obtain an optimal policy. The Build training set T ={(i l , o l ), l = 1, 2, . . . , #F }: 5: i l = (x l , u l ) 6: Use regression algorithm on T to obtain Q k 9: end for 10: Output: value of ǫ is a hyperparameter and is used to strike a balance between exploration and exploitation by the agent.
The training process starts with an empty set F and four tuples (x l , u l , x ′ l , r l ) are col-163 lected with each interaction between the agent and the environment, following the 164 ǫ-greedy exploration technique mentioned previously.  With this representation, dueling neural networks can hence capture effectively the vari-171 ations in Q-function that are caused due to immediate dispatch actions and due to long 172 term charging state values. In [9], it has been shown that agents using dueling neural 173 networks architectures perform better than vanilla deep neural networks for the Atari 174 game environments. Hence, we investigate if agents trained using a dueling neural 175 networks architecture would also be more effective for our EV fleet problem. Figure 2   176 shows the schematic representation of a dueling neural networks architecture. As discussed previously, the advantage function measures the relative importance of taking an action u t , in a particular state x t and is defined in Eq. (21), where V h represents the value function following policy h.
For the optimal case, if * represents the optimal policy and u * represents the best action in state x, then the advantage and value functions are given by Eq. (22).
where Q * , V * andA * are the optimal Q, value and action functions respectively. From Eq. (21), it is clear that given the value of Q-function for a state-action pair, it is not possible to uniquely recover the value function and the advantage function values. This issue is addressed by forcing the advantage function estimator to zero for the best action based on Eq. (22), as discussed in [9]. This aggregation is represented as "∑" in Fig. 2 and gives the Q-function estimate for the given input state and all possible actions in U. As shown in Fig. 2, the input module, value function stream and advantage function stream are represented by learnable parameters θ, α, β respectively. The value function and advantage function approximations are expresssed in the parametric form as V(x; θ, α) and A(x, u; θ, β), respectively. The parameterized Q-function approximation Q(x, u; θ, α, β) is then obtained using Eq. (23). In the third step, a collective charging action u t is selected as a greedy policy Eq. ( 6) using the learned Q-function from Algorithm 1. The collective charging power u t is then distributed between all EVs following a priority-based dispatch algorithm [18]. In this dispatch algorithm, for each time slot all EVs are assigned a corner priority dependant on their departure times, energy requirements and charging capacities. Following this, EVs with a higher corner priority are given preference to charging, while the charging is delayed for EVs with lower corner priorities. This priority dispatch algorithm is modelled using Eq. (24)-(27). The power demand of an individual EV i at time t is expressed as f i,t (p), which is a function of priority p: where p c i,t is the corner priority for EV i at time t and defined as: It is a heuristic that indicates the priority assigned to an EV for charging. The closer the corner priority is to 0, the lower are the chances of the EV being dispatched (charged). At the fleet level, the power demand of all EVs is aggregated: The collective charging action u t is distributed between individual EVs following a priority dispatch technique, by first calculating an equilibrium priority: This equilibrium priority is then passed on to all EVs. The EVs consume according

EV Fleet Simulation Environment
189 The EV fleet model described in the previous section is adopted for 100 EVs. The 190 assumed parameters in this case are given in     The simulations are performed using FQI with two regression models: dueling 200 neural networks and EXTRA trees. The EXTRA trees implementation is based on the 201 work presented in [17] and implemented using the sklearn package [24]. The parame-202 ters used for both models after hyperparameter tuning are given in Table 2.

203
The performance of each model needs to be compared with a business-as-usual case of charging EVs as soon as they arrive and a benchmark case where a controller takes optimum charging decisions based on perfect knowledge about the environment (i.e., EV parameters Ω). Additionally, both these models are compared to each other to investigate which model calculates the best policy. For this comparison, we define a score metric,

225
The results for the experimental setup described in the previous section are pre- Cost are presented and later, the results for Maximizing Self-Consumption are shown.

228
Both sets of results are focused on assessing the performance of the dueling neural net-229 works architecture and comparing it with EXTRA trees. 230 1 In practice the agent should be trained at the end of each day based on the forecasted price of the next day. To reduce the computational burden of the entire simulation, each agent was trained after every 10 days. 2 The simulations were carried out on a Laptop with the following specifications: i7 10 th Gen processor, 16

Objective 1: Minimizing Cost 231
The agent objective is to minimize the cost of energy consumed during a day. As 232 defined in Section 2.4, the agent is exposed to day ahead prices and is expected to take 233 optimal actions depending on the variation in prices over the day. The training period 234 includes a total of 10 different price profiles. The performance of both regression models 235 was tested on 3 validation days. Figure 4 shows the comparison for all 20 agents of both 236 agent classes for the 3 validation days. It can be observed in Fig. 4   The results from Table 3 suggest that both dueling neural networks and EXTRA 244 trees based agents are capable of computing control actions that lead to significant re-245 duction in daily operational cost (shown in Table 1). It can observed that an average cost 246 reduction of 60% can be achieved using the best performing agents as compared to a  Here, the primary objective of the agent is to maximize self-consumption from a In Fig. 6  business-as-usual case (shown in Table 2). Figure 7 shows the state and action compar-285 isons between the best performing agents for both regression techniques.   This theoretical benchmark is formulated as a mixed integer linear program (MILP) with the objective of minimizing the cost of energy during the charging process for each EV and then aggregating it for the entire fleet. The optimisation problem for an EV i is formulated as shown in Eq. (A29).
where, N con t represents the number of EVs connected at time t. Eq. (A29) is solved using 358 the differential evolution optimisation algorithm [26].

359
The benchmark values, rewards for business-as-usual case and best performing 360 agents are shown in Table 2. in Section 5. Figure 8 shows the results of this comparison. We observe that the agents  Both Fig. 8 and Fig. 9 indicate that the vanilla neural network based agents are 381 inferior in their performance. This is primarily due to inaccuracies in approximating the 382 Q-function for the two objectives. Further, these results also indicate that comparatively, 383 dueling neural networks based agents are better at approximating this Q-function due 384 to the explicit streams for value and advantage function estimations.