Preprint
Article

This version is not peer-reviewed.

Energy Demand Response in a Food Processing Plant: A Deep Reinforcement Learning Approach

A peer-reviewed article of this preprint also exists.

Submitted:

22 November 2024

Posted:

26 November 2024

You are already at the latest version

Abstract

The food industry faces significant challenges in managing operational costs due to its high energy intensity and rising energy prices. Industrial food processing facilities, with substantial thermal capacities and large demands for cooling and heating, offer promising opportunities for demand response (DR) strategies. This study explores the application of deep reinforcement learning (RL) as an innovative, data-driven approach for DR in the food industry. By leveraging the adaptive, self-learning capabilities of RL, energy costs in the investigated plant are effectively decreased. The RL algorithm is compared with the well-established optimization method mixed integer linear programming (MILP), and both are benchmarked against a reference scenario without DR. The two optimization strategies demonstrate cost savings of 17.57% and 18.65% for RL and MILP, respectively. Although RL is slightly less efficient in cost reduction, it significantly outperforms in computational speed, being approximately 20 times faster. During operation, RL only needs 2 ms per optimization compared to 19 s per MILP, making it a promising optimization tool for edge computing. Moreover, while MILP's computation time increases considerably with the number of binary variables, RL efficiently learns dynamic system behavior and scales to more complex systems without significant performance degradation. These results highlight that deep RL, when applied to DR, offers substantial cost savings and computational efficiency, with broad applicability to energy management in various applications.

Keywords: 
;  ;  ;  ;  

1. Introduction

Industrial food processing facilities are energy-intensive [1], where up to 50% of the food production costs are energy-related [2]. One effective approach to reduce energy costs, lower emissions, or enhance energy efficiency is DR for load profile adjustments [3]. The enormous DR potential of the industrial sector is shown in Siddiquee et al. [4]. In the food industry, this DR potential exists due to the high energy demand for heating and cooling food products [5]. One promising approach to use this load shifting potential is via thermal energy storages (TESs) [6]. In this study, a food processing plant with a TES is optimized using RL. The first part of the introduction focuses on DR in food processing plants using thermal capacities as TES followed by RL as a general optimization method and some DR applications of RL in various fields. Finally, the research gap is highlighted and the contributions are shown.

1.1. Thermal Capacities for Demand Response

The effectiveness of food products as TES has been proven by several studies [7,8,9,10,11] investigating supermarkets. In these studies, refrigerated food products are used as TES for load shifting by allowing a controlled temperature range. Transient models of food products can be physics-based via transient energy balances or data-driven, and a variety of optimization algorithms is used for both of them [7,10,11]. While [7,8,9,10,11] focuses on commercial buildings, [13,14,15,16,17] explores industrial settings, particularly refrigerated warehouses. The thermal capacity of such warehouses is mainly due to the stored products [18]. Case studies show event-based DR potential in two American refrigerated warehouses [13]. Akerma et al. [14] present a detailed model for load shedding during 2 or 4 hour DR events, and a follow-up paper [15] confirms that this 2 hour load shifting does not significantly impact stored food. Ma et al. [16] investigated multiple warehouses, using non-convex optimization for cooperative DR to reduce energy costs and consumption by varying set point temperatures. Khorsandnejad et al. [17] use MILP to analyze a cold warehouse with various compressors and a TES, achieving up to 50% reduction in electricity costs, 25-30% in CO2 emissions, and about 50% in peak load. Besides existing work on the DR potential in food storage warehouses, multiple studies [19,20,21,22,23,24,25] have examined DR applications in food processing plants. For instance, Chen et al. [19] developed an energy hub model to simulate the electricity demand of a food processing procedure within an industrial park. They implemented a two-stage robust co-optimization approach using MILP to minimize electricity costs under a time-of-use (TOU) pricing scheme. The model included a battery energy storage system and a TES for flexibility. Giordano et al. [20] examined the energy supply of a milk production plant. They modeled the energy supply using fixed load profiles and incorporated an energy storage for flexibility. A rule-based operation optimized energy consumption, reducing energy demand and carbon emissions. Pazmiño-Arias et al. [21] optimized the energy supply of a dairy factory using an energy hub model with fixed load profiles for heat, cooling, and electricity. They applied nonlinear optimization to minimize total energy costs under a TOU pricing scenario. Cirocco et al. [22] investigated the DR potential of combining industrial TES with a photovoltaic (PV) system in an Australian winery. Using historical load profiles, the authors utilized a TES to boost PV self-consumption. A nonlinear optimal control algorithm reduced electricity costs by 22% compared to a non-optimized case without PV. Saffari et al. [23] conducted a case study in a dairy factory producing yogurt and cheese. The work investigated the combination of TES and PV using a MILP formulation based on load profiles. Simulations showed cost reductions from 1.5% to 10%, depending on the scenario. In an earlier study conducted by the authors of this paper [24,25], an Austrian food processing plant is modeled and its operation is optimized in a simulation study. The study uses a detailed model of the food production process for DR. Flexibility options like thermal mass or a chilled water buffer are compared. The DR optimization problem is solved via MILP, achieving reductions in electrical power consumption by up to 18%, electricity costs by up to 24%, and peak load by up to 36%. In summary, the literature shows a clear potential for DR in the food industry.

1.2. Reinforcement Learning for Demand Response

Literature reviews such as Zhang and Grossmann [26] from 2016 and the literature from SubsectionSection 1.1 show, that MILP is the standard method for optimization in industrial DR problems. In Zhang and Grossmann [26], 36 out of 42 industrial demand side management (DSM) papers used MILP. While MILP is computationally efficient for small problems and is capable of finding the global optimum, a high number of binary variables can drastically increase the computation time. Furthermore, a downside of MILP is that the problem has to be solved for every different time window separately in model predictive control (MPC). In contrast, machine learning (ML) techniques like RL require training only once, resulting in a computational efficient application afterwards. Therefore, the potential of RL in a food processing plant is investigated in this study.
Recently, promising technologies like Deep Q-Learning (DQL) have emerged. DQL, introduced by Mnih et al. [27], replaces the Q-table from classical Q-learning [28] with a neural network, enabling it to handle high-dimensional state spaces. DQL showed enormous potential, being capable of achieving great scores in over 2600 Atari games [27]. A limitation of the DQL network is its tendency to overestimate state values and moving targets. To solve this problem, van Hasselt et al. [29] introduced double deep Q-learning (DDQL). They adapted the DQL network by using one neural network for choosing the best action and one for estimating the value the next action. Thereby, they achieved better or similar scores in the Atari games compared to the DQL. RL’s application scope has rapidly expanded, making it a key research focus in various fields. Vazquez et al. [30] investigated the application of traditional RL algorithms such as Q-learning or Monte Carlo for DR and concluded that these RL algorithms are capable in DR to help integrate renewable energies into the power grid. Deep reinforcement learning (DRL) such as DQL and DDQL for smart building energy management are investigated in a review paper [31], showing that nearly all model-free DRL-based building energy optimization methods are still not implemented in practice. Furthermore, DRL can be used in buildings to simultaneously reduce energy cost, peak load, and occupant dissatisfaction degree. A notable example of RL in practice is Google DeepMind’s achievement of a 40% reduction in data center cooling costs using RL techniques [32]. In the following part of the literature review, the classification from OpenAI [33] is used where model-free RL is categorized into policy-based aka policy optimization, value-based aka Q-Learning, or methods based on both. There, DQL and DDQL are listed as deep Q-networks (DQN). In DR literature, value-based RL is applied in residential [34,35,36,37,38] or commercial buildings [39,40,41], smart grids [42,43], or in microgrids [44]. The applied RL algorithms are based on QL [34,42,43] or DQN [35,36,37,38,39,40,41,44] and used to improve the energy efficiency or to reduce the energy costs. Value- and policy-based optimization is applied in commercial building such as offices [45,46,47], in a laboratory setup [48], or in an industrial warehouse [49]. The applied RL algorithms are based on soft actor-critic (SAC) [45,46,47], deep deterministic policy gradient (DDPG) [48], or random augmented search [49]. SAC was successfully implemented in a real office building in showing a decrease of temperature violation by 68% while maintaining a similar energy consumption as the reference case [47]. Policy-based RL is applied in a commercial building [50] or a university [51], where proximal policy optimization (PPO) is applied in both. Policy-based RL algorithms are well-suited for continuous action spaces, but value-based RL like DQN is preferred for discrete action spaces for its lower computation costs and greater sample efficiency. Therefore, DQN has potential for application in energy management, because heating, ventilation, and air conditioning (HVAC) systems often use digital controllers with a discrete set point.

1.3. Contributions

Summarizing, RL shows great potential in various application fields. DDQL has not yet been applied to optimize the set point temperature of a refrigerated warehouse in a food processing plant for DR using real-time pricing (RTP) and compared with a state of the art MPC controller based on MILP. We investigate an onsite warehouse in a food processing plant. This warehouse is influenced not only by ambient temperature and the HVAC system, but also by the food production process, which significantly impacts the system’s dynamics. We use the thermal capacity of the ware and the building as TES and apply DR by optimizing the set point temperature of the warehouse’s proportional integral (PI) controller. Our main contributions are:
  • We optimize the set point temperature of a PI controller rather than directly controlling the cooling power. This approach enhances stability and simplifies practical implementation.
  • We apply DDQL - a state-of-the-art RL algorithm - for load shifting to reduce energy costs in an RTP scenario.
  • We formulate the problem as MILP to compare RL with a state-of-the-art MPC controller.
  • We investigate the energy cost savings and the computation time of RL and MILP.
The remainder of this study is organized as follows: The system model for RL and MILP, the simulation study concept, the RL algorithm, and the MILP are described in SectionSection 2. In SectionSection 3, we present the optimization results including an analysis of the RL training process and the influence of the price signal on the DR potential. Finally, conclusions are drawn in SectionSection 4.

2. Methods

In this study, we investigate a food processing plant, where cheese is produced and stored. The focus of the paper is the industrial warehouse in the plant, which is used for cooling and storing cheese. The products of multiple months is stored in a high-bay racking. Cheese enters the warehouse at temperatures between 80°C and 130°C and is cooled to approximately 0°C for storage until sold. An industrial refrigeration system handles this cooling process, and the thermal mass of the warehouse, including the stored cheese products, functions as a TES. By allowing a defined temperature range between 0°C and 5°C, this TES provides flexibility for optimization. The goal of the optimizer is to shift the electrical load of the refrigeration plant to periods with lower electricity prices, reducing overall energy costs. Instead of directly controlling the cooling power, the optimizer adjusts the set point temperature of an existing PI controller. We compare two optimization approaches: dynamic optimization using RL and classical MILP. The optimization is incentivized using hourly RTP. For comparison, we use a realistic, currently employed scenario where the set point temperature remains constant at 2.5°C as a reference case. The study uses data from an industrial plant, including production data (the mass flow of product in the warehouse due to production m ˙ i n ), ambient temperature ( T ), the energy efficiency ratio of the refrigeration system ( β ), and approximations of heat losses to the ambient ( Q ˙ l o s s ). Historical data from May 2020 to April 2023 is available. The hourly spot market prices from Energy Exchange Austria (EXAA) [52] are used as RTP for the optimization process. The data is split into two sets: two years of data are used to train the RL agent, while one year is reserved for testing, either via MILP or by the RL agent. The simulation study is conducted in Python.

2.1. Industrial Warehouse Model

FigureFigure 1 shows an overview of the system. The industrial warehouse is modeled with the thermal capacity C H of the complete warehouse and a transient energy balance. The thermal capacity is the sum of the internal thermal capacities of the stored cheese products and the thermal mass of the building, where the steel of the high-bay racking is a relevant factor. This is modeled by
C H ( t ) = c c h e e s e m c h e e s e ( t ) + c b u i l d i n g m b u i l d i n g
where E is the energy, c c h e e s e and c b u i l d i n g are the specific thermal capacities of cheese and the building, and m c h e e s e and m b u i l d i n g are the masses of the cheese and the building, respectively. The temperature of the industrial warehouse T H can be calculated via a transient energy balance:
d E d t = C H ( t ) T ˙ H ( t ) = Q ˙ g a i n ( t ) + m ˙ i n ( t ) h ( T i n ) m ˙ o u t ( t ) h ( T o u t ) Q ˙ c o o l ( t )
where Q ˙ g a i n ( t ) is the heat gain from ambient, m ˙ i n ( t ) is the mass flow rate of hot cheese in the warehouse, m ˙ o u t ( t ) is the mass flow rate of cold cheese from the warehouse, h ( T ) is the enthalpy of the cheese, and Q ˙ c o o l ( t ) is the cooling power of the chiller. The chiller is modeled via an average energy efficiency rate β :
Q ˙ c o o l ( t ) = P e l ( t ) β
Assuming that m ˙ i n ( t ) and m ˙ o u t ( t ) are equal, C H is constant. As a further simplification, T i n and T o u t of the cheese are assumed to be constant. Thereby, the heat flow rates can be simplified to:
Q ˙ h e a t ( t ) = Q ˙ g a i n ( t ) + m ˙ i n ( t ) h i n m ˙ o u t ( t ) h o u t
Applying all simplifications mentioned, this leads to:
T ˙ H ( t ) = 1 C H Q ˙ h e a t ( t ) P e l ( t ) β
Applying discretization to the inputs ( Q ˙ h e a t ( t ) , P e l ( t ) ) for a specified time interval i of duration Δ t , the transient thermal energy balance can be simplified to:
T H , i + 1 = T H , i + 1 C H Q ˙ h e a t , i P e l , i β Δ t
Equation(6) can be used to calculate the system behavior.
The cooling power Q ˙ c o o l ( t ) is controlled via a basic time-discrete PI controller with saturation:
Algorithm 1:PI controller with saturation
1:
u i = k P ( T H , i T set , i ) + ( k I Δ t k P ) ( T H , i 1 T set , i 1 ) + u j 1  
2:
if P el , i > P max then 
3:
     P el , i = P max  
4:
else if P el , i < P min then 
5:
     P el , i = P min  
6:
else 
7:
     P el , i = u i  
8:
end if
where T s e t , i is the set point temperature, k P is the proportional factor, k I is the integral factor, P m a x is the maximum power, and P m i n is the minimum power, u i is the output signal of the PI controller before saturation, and P el , i is the output signal after saturation that is used to control the electrical power of the chiller. In the reference case, T s e t , i is constant. In the scenarios RL and MILP, the T s e t , i can be between a lower boundary T l b and an upper boundary T u b . The resulting flexibility in temperature associated to the thermal capacity C H is used for load shifting. The model parameters are shown in TableTable 1.
The specific thermal capacity c c h e e s e is taken from literature [25], the energy efficiency rate β is a yearly average based on historical data, the temperature band is defined from 0°C to 5°C, and the remaining parameters are estimated based on internal data of the company. The building model as environment for the RL algorithm was implemented in Python using Gymnasium [53].

2.2. Reinforcement Learning

FigureFigure 2 shows the general principle of RL.
In RL, the problem is framed as a learner (agent) interacting with an environment. The agent interacts with the environment by selection of actions and then receives a reward and a state from the environment. The agent only is aware of the state and a set of possible actions and, based on these, selects an action according to its policy. One iteration of this complete process is called step and all steps together are called episode. The cumulative rewards over the complete episode are called return. Maximizing the expected return is the goal of the agent. In Q-Learning the agent estimates the action-value function denoted as Q S , A . This function gives the expected return when starting in a state S, taking an action A and following a policy π thereafter. Estimating the action-value function Q S , A is done via bootstrapping in Q-learning and is trained via:
Q S , A Q S , A + α R + γ max a Q S , a Q S , A
where α is the learning rate and γ is a discount factor. In DQL, the action-value function is approximated via a neural network Q ( S , A ; θ ) with parameter set θ . Known challenges in using function approximation via neural networks in Q-learning are moving targets and maximization bias. These issues are circumvented in DDQL, where two neural networks are used to stabilize the training process. The policy network Q π is used to select the best action given the current state while the target network Q t is used for estimating the value of the following action. The training process is based on mini-batch learning using memory replay [27]. This technique helps to break correlations of state-action sequences while increasing stability [27]. In contrast to Mnih et al. [27] or Hasselt et al. [29], who use periodical copies of the policy network as target network or update the networks symmetrically by switching the roles, respectively, the presented algorithm uses soft updates such as Lillicrap et al. [54]. The applied DDQL algorithm based on [27,29] is shown in Algorithm2.
Algorithm 2:Double Deep Q-Learning - Training
1:
Initialize Policy Network Q π with random weights θ π   
2:
Initialize Target Network Q t with random weights θ t   
3:
Initialize Replay Buffer M as empty buffer  
4:
loop for each episode:  
5:
    Reset environment, observe S  
6:
    loop for each step of episode:  
7:
        With probability ϵ select a random action A  
8:
        Otherwise select A = arg max a Q π ( S , a ; θ π )   
9:
        Decrease ϵ   
10:
        Take action A, observe R, S   
11:
        Store transition (S, A, R, S ) in M  
12:
        Sample random mini batch (s, a, r, s ) of transitions from M  
13:
        Set y = R , i f S i s a t e r m i n a l s t a t e R + γ Q t ( S , a r g max a Q π ( S , a ; θ π ) ; θ t ) , e l s e   
14:
        Perform a gradient descent step on the policy network L δ ( y , Q π ( S , A ) )   
15:
        Soft update the weights of the target network Q t   
16:
         S S   
17:
    end loop  
18:
end loop
First, the neural networks are initialized with random starting weights ( θ π and θ t ) and an empty replay buffer M is created. At the start of every episode, the environment needs to be reset. In lines 7 9 exploration (selecting a random action) is performed with a probability of ϵ . Otherwise, exploitation, which means applying the best action according to the current policy (defined by the policy network), is applied. To improve the policy π , exploration is essential. As a consequence, the learning process starts with a high exploration rate and later refines the policy using less exploration. The exploration rate is scheduled according to:
ϵ = ϵ e n d + ( ϵ s t a r t ϵ e n d ) e N s t e p s d ϵ
where ϵ is the current threshold for exploration, ϵ s t a r t is the start exploration rate, ϵ e n d is the end exploration rate, d ϵ is the decay rate, and N s t e p s is the number of steps done already. In line 10 the action is performed in the environment and the reward R and the next state S are observed. The transition ( S , A , R , S ) is stored in the replay buffer and random samples from the replay buffer are used to train the policy network Q π . As a measure of error in the training of the policy network’s weights, the Huber loss L δ is used:
L δ ( y , f ( x ) ) = 1 2 ( y Q π ( S , A ) ) 2 for | y Q π ( S , A ) | δ δ | y Q π ( S , A ) | 1 2 δ , otherwise
where δ is a parameter of the Huber loss function. The Huber loss acts as the mean squared error for small errors ( y Q π ( S , A ) ) and as the mean absolute error for larger errors, and, therefore is more robust to outliers. This loss is used to perform a stochastic gradient descent step using the Adam optimizer to update the weights of the policy network Q π . The weights of the target network Q t are updated via soft updates:
θ t = τ θ π + ( 1 τ ) θ t
where τ is the target network’s update rate.
This standard RL algorithm can be applied to any problem that can be abstracted in an agent and an environment, where the set of actions is discrete and the state can be discrete and continuous. The state should fulfill the Markov property, so it should include all information about all aspects of the past agent-environment interaction that make a difference in the future. For physical systems in a linear state space representation, the Markov property is fulfilled.
In the following paragraphs, this algorithm is applied to the DR problem in an industrial warehouse. Here, the controller of the refrigeration systems is the agent and the industrial warehouse is the environment. As an action, the agent can control the state of charge ( S O C ) of the TES, which is reciprocal to the set temperature of the industrial warehouse. The possible actions are discrete ( 0 100 ) , where 0 represents the lowest set state of charge and 100 is the highest set state of charge of the TES controlled by the PI controller. The operation is always optimized for a day, so one episode consists of 24 time steps ( Δ t s t e p = 1 h). The state of the environment is given by a set containing the state of charge, the number of remaining time steps of the day, the electricity price of the next 24 steps, and the predicted thermal load Q ˙ h e a t , i of the next 24 time steps. The electricity price and the thermal load are assumed to be known for one day, so for 24 steps. Electricity spot-market prices are usually published one day in advance, so this is a realistic scenario. After the first step, only the 23 next steps are known, and so forth. Unknown values are set to 0. To stabilize the learning process, the variables defining the state are scaled. The temperature is used to calculate the state of charge of the energy storage in %:
S O C i = 100 T u b T i T u b T l b
The prices p, which are used to calculate the energy costs and the reward, are scaled between one and 10 on a daily basis using:
p i * = 9 p i m i n ( p ) m a x ( p ) m i n ( p ) + 1
Scaled values are indicated with *. Prices are scaled due to three reasons: 1) the values of the state should all be in a similar range; 2) electricity prices fluctuate strongly during the year and the neural network should learn on the relative prices during a day; 3) therefore, the RL agent can handle price signals significantly higher or lower than already seen during training. The scaled prices are always greater than or equal to one, resulting in a positive energy price. If the price could be 0€/(kWh), the RL agent could use the energy during these time slots for increasing the energy consumption for free.
The thermal load is scaled to a load in percentage of the energy storage via:
Q ˙ h e a t , i * = 100 Q ˙ h e a t , i Δ t C m ( T u b T l b )
Combining Equations(11-13) with ( 24 i ) to indicate the remaining time steps of a day/episode, this results in a state S i :
S i = S O C i , ( 24 i ) , p i * , , p i + 23 * , Q ˙ h e a t , i * , , Q ˙ h e a t , i + 23 *
Note that if the electricity price and the thermal load for the forecast are not known, a value zero is used. The action space is discrete ( 0 100 ) representing the resulting state of charge ( S O C s e t ) for the PI controller. The set point temperature can be calculated from the set value of the state of charge by:
T s e t = T l b + ( 100 S O C s e t ) 100 ( T u b T l b )
The environment is implemented using Equation(6) in combination with the PI controller. Note that while there are 24 steps for a day, the PI controller logic is executed multiple times (60 times per time step) during a single simulation step of the environment. Then, the average electrical power P e l , i during that step is used to calculate the reward R:
R i = P e l , i p i *
After 24 steps, the terminated flag is set to true and the episode is finished. The algorithm2 described is used to train the agent. During training, a random initial state of charge is used in combination with a randomly selected price signal and a randomly selected load signal from the training set. 2000 training episodes are used. During operation, the greedy action (exploitation) is always applied.
The parameters of the RL algorithm can be seen in TableTable 2.
The RL algorithm was implemented in Python using PyTorch [55].

2.3. Mixed Integer Linear Programming

To validate the performance of RL, MILP is used as a benchmark. MILP is ideally suited for this purpose as benchmark because it the global optimum of an underlying linear problem can be determined. The objective is to minimize the energy costs and the decision variables are the set point temperatures of the warehouse. The model of the warehouse from SubsectionSection 2.1 and the PI controller from Algorithm1 are formulated as constraints in the MILP. While the model of the warehouse is linear, the PI controller has a non-linear effect given by the saturation that limits the controller’s output signal. Modeling this effect introduces binary variables to the formulation and, thereby, drastically increases the computation time. Also, the PI controller reacts every minute to changes in the warehouse temperature leading to an optimization problem for one day with 7201 continuous and 2880 binary variables. The MILP problem was implemented in Python using Gurobi 11.0 [56]. The complete MILP formulation and its detailed description can be found in AppendixAppendix A.

3. Results and Discussion

The trained RL agent is compared with a MILP optimization for the test set, which spans from May 2022 to April 2023. Both, RL and MILP, are used for MPC, always optimizing the operation fo a full day at midnight. The currently employed PI controller is simulated and serves as a reference scenario. The electricity price and the predicted heating load are assumed to be known for the current day. The results of this case study are shown in Table 3.
As expected, MILP has the highest cost reductions, as it leads to the global optimum. While MILP assumes to have perfect knowledge of the complete system dynamics, RL is model-free. In light of this, RL proves to be a promising solution, as it reaches cost reductions of 17.57% as compared to 18.65% reached by MILP. These results are especially good considering that RL is model-free, while the MILP assumes to have perfect knowledge of the complete model. The detailed optimization for three exemplary days can be seen in FigureFigure 3.
Both - RL and MILP - use low energy prices for the precooling of the warehouse and operate close to the upper boundary T u b of the set temperature. The use of the thermal capacity as energy storage can be seen in the second subplot. The capacity of the industrial warehouse including the stored cheese is so high, that the complete temperature band from T l b to T u b is not fully used in these exemplary days. The results of the complete year can be seen in FigureFigure 4.
Comparing the weekly savings from FigureFigure 4a with the electricity price from FigureFigure 4c shows that higher prices lead to a higher potential for cost savings. In 2023, where the energy prices where significantly lower than in 2022, the weekly absolute energy savings also were lower. Nevertheless, FigureFigure 4b shows that the relative energy savings are high during the complete test period. The only relevant outlier is the Christmas week, where there is no production, a low cooling demand due to cold temperature, and low energy prices. Therefore, the energy costs are low and an average reduction with low energy costs results in high relative (FigureFigure 4b) and low absolute (FigureFigure 4a) energy savings. All in all, FigureFigure 4 shows that both algorithms work for high and low energy prices as well as high and low fluctuations. FigureFigure 5 shows the energy price for the test and the training period. There it can be seen that in the training period the prices are significantly different, in particular lower with less fluctuations, than the prices from the test period. As seen the presented RL algorithm also works for prices differing from those in the data set for training, which can be explained via the scaling in the method. This shows the ability of RL to adapt to new states, encountered by the algorithm before.

3.1. RL Training Process

In this subchapter, the RL training process is investigated. The training process for the parameters from TableTable 2 is shown in FigureFigure 6.
The RL agent is quickly improving with decent results after 250 training episodes. High fluctuation of the results of single episodes is caused by the random initialization of the state of charge. In some episodes the energy storage starts full and ends empty, so no additional energy is needed resulting in a reward of 0€. Still, for training different start values are beneficial. To further investigate the training process, different values for the training episodes and the exploration decay rate d ϵ are compared and shown in FigureFigure 7. The number of training episodes and d ϵ are tested with values from 100 to 10000.
Figure 7 shows that the training process converges and, after 250 episodes, good results can be achieved. However, using longer training periods on average improve the results even further. While 250 episodes average 16.94%, 10000 training episodes average a cost reduction of 17.87% and the deviation of the results is much smaller.

3.2. Computation Time

MILP is capable of finding cost-optimal solutions; however, it can be computationally intensive due to the large number of variables involved. The original problem formulation contains 7201 continuous variables and 2880 binary variables. After applying Gurobi’s presolve, the number of continuous variables is reduced to 5070, while the number of binary variables remains unchanged. Despite this reduction, the average computation time per optimization step is 19s, which is reasonable for complex control tasks. In contrast, the RL agent operates significantly faster, with an average computation time of 2ms per optimization step (excluding the training phase). Training the RL model requires 306s (5min) for 2000 episodes. As a result, the total runtime, including both training and operation, is 307s, which is substantially less than the total runtime for MILP, which is 6963s (116min). While MILP provides competitive results with a manageable average optimization time of 19s, the RL approach offers significantly faster performance, being approximately 23 times faster than MILP. Note that all runtime tests were conducted on a 2022 MacBook Pro M2.

3.3. Applicability in Practice

The proposed RL approach is designed to be versatile, making it suitable for a wide range of DR applications, such as managing energy storage systems like batteries, TES systems, and boilers. However, in practical scenarios, collecting the data for multiple months, at least hundreds episodes each representing a full day, can be inefficient and impractical. To overcome this, transfer learning can be employed. Through this method, the agent is pre-trained using simulated environments and later deployed to real-world applications. This not only speeds up the learning process but also enhances safety and system stability by allowing early training phases to exclude extreme actions and critical states.
Another key aspect of the RL framework is its integration with existing control systems. Instead of directly controlling the cooling power, the RL algorithm adjusts the set-point temperature of an existing PI controller. This indirect control improves overall system stability and makes the solution easier to implement. Directly controlling chiller power is often impractical, making this approach more feasible for real-world applications.
The RL algorithm’s computation is efficient, requiring only 2ms per optimization. This makes it ideal for edge computing environments, such as Internet of Things (IoT) devices, where fast decision-making is essential. Its compatibility with microcontrollers that support Python, combined with the use of open-source libraries like PyTorch, further reduces implementation costs by eliminating the need for expensive proprietary software licenses. In contrast, traditional approaches like MILP would demand high-performance computing resources and expensive commercial solvers, such as Gurobi, making RL a more accessible and cost-effective alternative for DR applications.

4. Conclusions

We proposed a machine learning-based method for load shifting in a food processing plant. The food processing plant is modeled based on a transient energy balance and the thermal capacity of the food stored and the building are used as flexibility for load shifting of the industrial refrigeration system. The load is shifted via optimizing a PI controller’s set point temperature and, thereby, safe states of the system are ensured. To optimize load shifting, a RL algorithm is compared with a state-of-the-art MPC controller based on MILP. As a RL algorithm, DDQL is applied. The simulation study results show cost savings in an RTP-driven scenario of 17.57% via model-free RL, compared to 18.65% via MILP, which assumes perfect knowledge of the model. This demonstrates, that RL is capable of providing good solutions without prior knowledge of the model. Also, RL proves to be computationally highly efficient. Even when accounting for both training and operation, the RL agent is still 23 times faster than the MILP method used for comparison. Training the agent only needs to be done once, its operation being fast, with an average computation time of only 2 ms, compared to 19 seconds for solving the MILP. Additionally, the RL agent is based on open-source code, not requiring any expensive, commercial licenses.
The computational efficiency and independence of expensive licensing make RL a promising tool to optimize energy management problems, especially for edge computing on IoT devices. Due to the adaptive and self-learning capabilities of RL, this method can also be transferred to various energy management applications.

Author Contributions

Conceptualization, P.W., S.H., E.E., M.K. and P.K.; methodology, P.W., S.H., E.E and P.K; software, P.W.; validation, P.W., P.K. and S.H.; formal analysis, P.W., S.H., E.E. and P.K.; investigation, P.W.; resources, P.K.; data curation, P.W.; writing—original draft preparation, P.W.; writing—review and editing, P.W., S.H., E.E., M.K. and P.K.; visualization, P.W. and E.E.; supervision, M.K. and P.K.; project administration, P.K.; funding acquisition, P.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Austrian Federal Ministry for Digital and Economic Affairs, the National Foundation for Research, Technology, and Development as well as the Christian Doppler Research Association.

Institutional Review Board Statement

Not applicable

Informed Consent Statement

Not applicable

Data Availability Statement

The datasets presented in this article are not readily available because they include confidential data from our project partner Rupp Austria GmbH. Requests to access the datasets should be directed to peter.kepplinger@fhv.at.

Acknowledgments

The authors are grateful to the project partner Rupp Austria GmbH for providing the data and all the fruitful discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DR Demand response
DDQL Double deep Q-learning
DDPG Deep deterministic policy gradient
DQL Deep Q-learning
DQN Deep Q-networks
DRL Deep reinforcement learning
DSM Demand side management
EXAA Energy exchange Austria
HVAC Heating, ventilation and air conditioning
IoT Internet of things
LP Linear programming
MILP Mixed integer linear programming
MINLP Mixed integer nonlinear programming
ML Machine learning
MPC Model predictive control
PI Proportional integral
PPO Proximal policy optimization
PV Photovoltaic
RL Reinforcement learning
RTP Real-time pricing
SAC Soft actor-critic
TES Thermal energy storage
TOU Time-of-use

Appendix A. MILP Formulation

The following decision variables are defined for the model:
  • T set , p is the set point temperature of the warehouse during the time period p.
  • T H , t is the warehouse temperature at the time point t.
  • u p is the output signal of the P-controller before saturation during the time period p.
  • u tmp , p is a helper variable to calculate the saturation during the time period p.
  • b 1 , p b 2 , p are binary variables to calculate the saturation during the time period p.
  • P el , p is the electrical power consumption of the industrial refrigeration system p.
Note that T set , p is the actual decision variable that is used and the remaining variables are helper variables to improve the readability of the formulation. Also, T set , p is a continuous variable in the MILP formulation to improve the computation time. Therefore, the MILP problem has slightly more flexibility available for load shifting than the RL problem, where T set , p is an integer. The following time series are needed as inputs:
  • π p is the price signal during the time period p.
  • Q ˙ heat , p is the heat flow rate of the load during the time period p.
The following parameters are needed additionally:
  • Δ t is the length of a time period.
  • T H , s t a r t is the initial warehouse temperature at the time point 0.
  • β is the energy efficiency ratio of the industrial refrigeration system.
  • k P proportional factor of the controller.
  • k I integral factor of the controller.
  • P min is the minimum electrical power.
  • P max is the maximum electrical power.
  • T lb is the minimum set point temperature.
  • T ub is the maximum set point temperature.
  • N is the number of time periods.
  • M 1 and M 2 are big M constraints.
The following sets are defined:
  • P is a set for the indices of every time period.
  • τ is a set for the indices of every time point.
  • I is a set to index of every hour of a day.
  • J is a set to index every minute in a hour.
The optimization problem minimizing the total costs can be written as:
obj : min T set p = 0 N 1 P el , p π p Δ t subject to :
P = 0 , , N 1
τ = 0 , , N
T H , 0 = T H , start
t τ 0 : T H , t = T H , t 1 + 1 C H ( Q ˙ heat , t 1 P el , t 1 β ) Δ t
t τ : T H , t T ub
t τ : T H , t T lb
u 0 = k 0 ( T H , 0 T set , 0 )
p P 0 : u p = k P ( T H , p T set , p ) + ( k I Δ t k P ) ( T H , p 1 T set , p 1 ) + u p 1
p P : u tmp , p u p
p P : u tmp , p P max
p P : u tmp , p u p M 1 b 1 , p
p P : u tmp , p P max M 1 ( 1 b 1 , p )
p P : P el , p u tmp , p
p P : P el , p P min
p P : P el , p u tmp , p + M 2 b 2 , p
p P : P el , p P min + M 2 ( 1 b 2 , p )
p P : T set , p T lb
p P : T set , p T ub
I = 0 , , 23
J = 0 , , 59
i I , j J : T set , 60 i = T set , 60 i + j
p P , t τ : T set , p , T H , t , u p , u tmp , p , P el , p R
p P : b 1 , p , b 2 , p { 0 , 1 }
where Equation (A1) is the objective function minimizing the total energy costs. The variables are split into state variables (at a certain time point) and input variables (during a certain time step). Equation (A2) defines a set of indices for every time period and Equation (A3) defines a set of indices for every time point. The start temperature of the warehouse is defined in Equation (A4) and the remaining temperatures are calculated via an energy balance in Equation (A5). Equations (A6-A7) introduce additional cutting planes, which are not considered in the RL scenario, designed to enhance computational efficiency and accelerate the solution process. The remaining Equations (A8-A24) describe the PI controller with saturation. Equation (A8) is the initial condition of the PI controller assuming that that integral error of the historic values is zero, and, therefore, only the proportional factor k P is needed for the first controller output u 0 . Equation (A9) is used to calculate the controller output signal u p of the remaining time steps depending on the proportional factor k P and the integral factor k I . Equations (A10-A17) represent the saturation element. Saturation is used to limit the output signal of the controller between P min and P max . For implementation, the big M method is used. The set point temperature T set , p is limited in Equations (A18,A19). The PI controller calculates a new output signal u p at every time point. To reduce high-frequency output signals, the T set , p can only be changed every hour. So equations (A20-A22) are used to ensure that during every hour, all set points are the same. This could also be done by defining a separate set of indices with a different time period, which would decrease the number of variables. But for readability reasons and because simple variable reductions are anyways done by the solver, this was not applied. The final optimization problem consists of 5 N + 1 continuous and 2 N binary variables. For one simulated day with N = 1440 these are 7201 continuous and 2880 binary variables.

References

  1. Gerres, T.; Chaves Ávila, J. P.; Llamas, P. L.; San Román, T. G. A Review of Cross-Sector Decarbonisation Potentials in the European Energy Intensive Industry. Journal of Cleaner Production 2019, 210, 585–601. [Google Scholar] [CrossRef]
  2. Clairand, J.-M.; Briceno-Leon, M.; Escriva-Escriva, G.; Pantaleo, A. M. Review of Energy Efficiency Technologies in the Food Industry: Trends, Barriers, and Opportunities. IEEE Access 2020, 8, 48015–48029. [Google Scholar] [CrossRef]
  3. Panda, S.; Mohanty, S.; Rout, P. K.; Sahu, B. K.; Parida, S. ; Samanta, I S. .; Bajaj, M.; Piecha, M.; Blazek, V.; Prokop, L. A comprehensive review on demand side management and market design for renewable energy support and integration. Energy Reports. 2023, 10, 2228–2250. [Google Scholar] [CrossRef]
  4. Siddiquee, S. M. S.; Howard, B.; Bruton, K.; Brem, A.; O’Sullivan, D. T. J. Progress in Demand Response and It’s Industrial Applications. Frontiers in Energy Research 2021, 9. [Google Scholar] [CrossRef]
  5. Morais, D.; Gaspar, P. D.; Silva, P. D.; Andrade, L. P.; Nunes, J. Energy Consumption and Efficiency Measures in the Portuguese Food Processing Industry. Journal of Food Processing and Preservation 2022, 46(8), e14862. [Google Scholar] [CrossRef]
  6. Koohi-Fayegh, S.; Rosen, M. A. A Review of Energy Storage Types, Applications and Recent Developments. Journal of Energy Storage 2020, 27, 101047. [Google Scholar] [CrossRef]
  7. Brok, N.; Green, T.; Heerup, C.; Oren, S. S.; Madsen, H. Optimal Operation of an Ice-Tank for a Supermarket Refrigeration System. Control Engineering Practice 2022, 119, 104973. [Google Scholar] [CrossRef]
  8. Hovgaard, T. G.; Larsen, L. F. S.; Edlund, K.; Jørgensen, J. B. Model Predictive Control Technologies for Efficient and Flexible Power Consumption in Refrigeration Systems. Energy 2012, 44(1), 105–116. [Google Scholar] [CrossRef]
  9. Hovgaard, T. G.; Larsen, L. F. S.; Skovrup, M. J.; Jørgensen, J. B. Optimal Energy Consumption in Refrigeration Systems - Modelling and Non-Convex Optimisation. The Canadian Journal of Chemical Engineering 2012, 90(6), 1426–1433. [Google Scholar] [CrossRef]
  10. Weerts, H. H. M.; Shafiei, S. E.; Stoustrup, J.; Izadi-Zamanabadi, R. Model-Based Predictive Control Scheme for Cost Optimization and Balancing Services for Supermarket Refrigeration Systems. IFAC Proceedings Volumes 2014, 47(3), 975–980. [Google Scholar] [CrossRef]
  11. Shafiei, S. E.; Knudsen, T.; Wisniewski, R.; Andersen, P. Data-driven Predictive Direct Load Control of Refrigeration Systems. IET Control Theory & Applications 2015, 9 (7), 1022–1033. [CrossRef]
  12. Shoreh, M. H.; Siano, P.; Shafie-khah, M.; Loia, V.; Catalão, J. P. S. A Survey of Industrial Applications of Demand Response. Electric Power Systems Research 2016, 141, 31–49. [Google Scholar] [CrossRef]
  13. Goli, S; Mckane, A; Olsen, D. Demand Response Opportunities in Industrial Refrigerated Warehouses in California. ACEEE Summer Study on Energy Efficiency in Industry (2011) 1–14. Lawrence Berkeley National Laboratory. Retrieved from https://escholarship.
  14. Akerma, M.; Hoang, H. M.; Leducq, D.; Flinois, C.; Clain, P.; Delahaye, A. Demand Response in Refrigerated Warehouse. In 2018 IEEE International Smart Cities Conference (ISC2); 2018; pp 1–5. [CrossRef]
  15. Akerma, M.; Hoang, H.-M.; Leducq, D.; Delahaye, A. Experimental Characterization of Demand Response in a Refrigerated Cold Room. International Journal of Refrigeration 2020, 113, 256–265. [Google Scholar] [CrossRef]
  16. Ma, K.; Hu, G.; Spanos, C. J. A Cooperative Demand Response Scheme Using Punishment Mechanism and Application to Industrial Refrigerated Warehouses. IEEE Transactions on Industrial Informatics 2015, 11(6), 1520–1531. [Google Scholar] [CrossRef]
  17. Khorsandnejad, E.; Malzahn, R.; Oldenburg, A.-K.; Mittreiter, A.; Doetsch, C. Analysis of Flexibility Potential of a Cold Warehouse with Different Refrigeration Compressors. Energies 2024, 17(1), 85. [Google Scholar] [CrossRef]
  18. Akerma, M; Hoang, H. M.; Leducq, D; Flinois, C; Clain, P.; Delahaye, A.; Demand response in refrigerated warehouse, 2018 IEEE International Smart Cities Conference (ISC2), Kansas City, MO, USA, 2018, pp. 1-5. [CrossRef]
  19. Chen, C.; Sun, H.; Shen, X.; Guo, Y.; Guo, Q.; Xia, T. Two-Stage Robust Planning-Operation Co-Optimization of Energy Hub Considering Precise Energy Storage Economic Model. Applied Energy 2019, 252 (C), 1–1. [Google Scholar] [CrossRef]
  20. Giordano, L.; Furlan, G.; Puglisi, G.; Cancellara, F. A. Optimal Design of a Renewable Energy-Driven Polygeneration System: An Application in the Dairy Industry. Journal of Cleaner Production 2023, 405, 136933. [Google Scholar] [CrossRef]
  21. Pazmiño-Arias, A.; Briceño-León, M.; Clairand, J.-M.; Serrano-Guerrero, X.; Escrivá-Escrivá, G. Optimal Scheduling of a Dairy Industry Based on Energy Hub Considering Renewable Energy and Ice Storage. Journal of Cleaner Production 2023, 429, 139580. [Google Scholar] [CrossRef]
  22. Cirocco, L.; Pudney, P.; Riahi, S.; Liddle, R.; Semsarilar, H.; Hudson, J.; Bruno, F. Thermal Energy Storage for Industrial Thermal Loads and Electricity Demand Side Management. Energy Conversion and Management 2022, 270, 116190. [Google Scholar] [CrossRef]
  23. Saffari, M.; de Gracia, A.; Fernández, C.; Belusko, M.; Boer, D.; Cabeza, L. F. Optimized Demand Side Management (DSM) of Peak Electricity Demand by Coupling Low Temperature Thermal Energy Storage (TES) and Solar PV. Applied Energy 2018, 211, 604–616. [Google Scholar] [CrossRef]
  24. Wohlgenannt, P.; Huber, G.; Rheinberger, K.; Preißinger, M.; Kepplinger, P. Modelling of a Food Processing Plant for Industrial Demand Side Management. In HEAT POWERED CYCLES 2021 Conference Proceedings, Bilbao, Spain, 10-, pp. 13 April. [CrossRef]
  25. Wohlgenannt, P.; Huber, G.; Rheinberger, K.; Kolhe, M.; Kepplinger, P. Comparison of Demand Response Strategies Using Active and Passive Thermal Energy Storage in a Food Processing Plant. Energy Reports 2024, 12, 226–236. [Google Scholar] [CrossRef]
  26. Zhang, Q.; Grossmann, I. E. Enterprise-Wide Optimization for Industrial Demand Side Management: Fundamentals, Advances, and Perspectives. Chemical Engineering Research and Design 2016, 116, 114–131. [Google Scholar] [CrossRef]
  27. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; Hassabis, D. Human-Level Control through Deep Reinforcement Learning. Nature 2015, 518(7540), 529–533. [Google Scholar] [CrossRef] [PubMed]
  28. Watkins, C. J. C. H.; Dayan, P. Q-Learning. Mach Learn 1992, 8(3), 279–292. [Google Scholar] [CrossRef]
  29. van Hasselt, H. ; Guez, A; Silver, D. In , Deep Reinforcement Learning with Double Q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), Phoenix, Arizona, USA, 2094–2100., 12–17 February 2016. [Google Scholar] [CrossRef]
  30. Vázquez-Canteli, J. R.; Nagy, Z. Reinforcement Learning for Demand Response: A Review of Algorithms and Modeling Techniques. Applied Energy 2019, 235, 1072–1089. [Google Scholar] [CrossRef]
  31. Yu, L.; Qin, S.; Zhang, M.; Shen, C.; Jiang, T.; Guan, X. A Review of Deep Reinforcement Learning for Smart Building Energy Management. IEEE Internet of Things Journal 2021, 8(15), 12046–12063. [Google Scholar] [CrossRef]
  32. Lazic, N.; Boutilier, C.; Lu, T.; Wong, E.; Roy, B.; Ryu, M.; Imwalle, G. Data Center Cooling Using Model-Predictive Control. In Advances in Neural Information Processing Systems; Curran Associates, Inc., 2018; Vol. 31. Available online: https://proceedings.neurips.cc/paper/2018/file/ 059fdcd96baeb75112f09fa1dcc740cc- Paper.pdf.
  33. Part 2: Kinds of RL Algorithms — Spinning Up documentation. Available online: https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html (accessed on day month year).
  34. Afroosheh, S.; Esapour, K.; Khorram-Nia, R.; Karimi, M. Reinforcement Learning Layout-Based Optimal Energy Management in Smart Home: AI-Based Approach. IET Generation, Transmission & Distribution 2024. [CrossRef]
  35. Lissa, P.; Deane, C.; Schukat, M.; Seri, F.; Keane, M.; Barrett, E. Deep Reinforcement Learning for Home Energy Management System Control. Energy and AI 2021, 3, 100043. [Google Scholar] [CrossRef]
  36. Liu, Y.; Zhang, D.; Gooi, H. B. Optimization Strategy Based on Deep Reinforcement Learning for Home Energy Management. CSEE Journal of Power and Energy Systems 2020, 6(3), 572–582. [Google Scholar] [CrossRef]
  37. Peirelinck, T.; Hermans, C.; Spiessens, F.; Deconinck, G. Double Q-Learning for Demand Response of an Electric Water Heater. In 2019 IEEE PES Innovative Smart Grid Technologies Europe (ISGT-Europe); 2019; pp 1–5. [CrossRef]
  38. Jiang, Z.; Risbeck, M. J.; Ramamurti, V.; Murugesan, S.; Amores, J.; Zhang, C.; Lee, Y. M.; Drees, K. H. Building HVAC Control with Reinforcement Learning for Reduction of Energy Cost and Demand Charge. Energy and Buildings 2021, 239, 110833. [Google Scholar] [CrossRef]
  39. Brandi, S.; Piscitelli, M. S.; Martellacci, M.; Capozzoli, A. Deep Reinforcement Learning to Optimise Indoor Temperature Control and Heating Energy Consumption in Buildings. Energy and Buildings 2020, 224, 110225. [Google Scholar] [CrossRef]
  40. Brandi, S.; Fiorentini, M.; Capozzoli, A. Comparison of Online and Offline Deep Reinforcement Learning with Model Predictive Control for Thermal Energy Management. Automation in Construction 2022, 135, 104128. [Google Scholar] [CrossRef]
  41. Coraci, D.; Brandi, S.; Capozzoli, A. Effective Pre-Training of a Deep Reinforcement Learning Agent by Means of Long Short-Term Memory Models for Thermal Energy Management in Buildings. Energy Conversion and Management 2023, 291, 117303. [Google Scholar] [CrossRef]
  42. Han, G.; Lee, S.; Lee, J.; Lee, K.; Bae, J. Deep-Learning- and Reinforcement-Learning-Based Profitable Strategy of a Grid-Level Energy Storage System for the Smart Grid. Journal of Energy Storage 2021, 41, 102868. [Google Scholar] [CrossRef]
  43. Lu, R.; Hong, S. H. Incentive-Based Demand Response for Smart Grid with Reinforcement Learning and Deep Neural Network. Applied Energy 2019, 236, 937–949. [Google Scholar] [CrossRef]
  44. Muriithi, G.; Chowdhury, S. Deep Q-Network Application for Optimal Energy Management in a Grid-Tied Solar PV-Battery Microgrid. The Journal of Engineering 2022, 2022(4), 422–441. [Google Scholar] [CrossRef]
  45. Brandi, S.; Coraci, D.; Borello, D.; Capozzoli, A. Energy Management of a Residential Heating System Through Deep Reinforcement Learning. In Sustainability in Energy and Buildings 2021; Littlewood, J. R., Howlett, R. J., Jain, L. C., Smart Innovation, Systems and Technologies, Eds.; Springer Nature Singapore: Singapore, 2022. [Google Scholar] [CrossRef]
  46. Brandi, S.; Gallo, A.; Capozzoli, A. A Predictive and Adaptive Control Strategy to Optimize the Management of Integrated Energy Systems in Buildings. Energy Reports 2022, 8, 1550–1567. [Google Scholar] [CrossRef]
  47. Silvestri, A.; Coraci, D.; Brandi, S.; Capozzoli, A.; Borkowski, E.; Köhler, J.; Wu, D.; Zeilinger, M. N.; Schlueter, A. Real Building Implementation of a Deep Reinforcement Learning Controller to Enhance Energy Efficiency and Indoor Temperature Control. Applied Energy 2024, 368. [Google Scholar] [CrossRef]
  48. Gao, G.; Li, J.; Wen, Y. DeepComfort: Energy-Efficient Thermal Comfort Control in Buildings Via Reinforcement Learning. IEEE Internet of Things Journal 2020, 7(9), 8472–8484. [Google Scholar] [CrossRef]
  49. Opalic, S. M.; Palumbo, F.; Goodwin, M.; Jiao, L.; Nielsen, H. K.; Kolhe, M. L. COST-WINNERS: COST Reduction WIth Neural NEtworks-Based Augmented Random Search for Simultaneous Thermal and Electrical Energy Storage Control. Journal of Energy Storage 2023, 72. [Google Scholar] [CrossRef]
  50. Azuatalam, D.; Lee, W.-L.; de Nijs, F.; Liebman, A. Reinforcement Learning for Whole-Building HVAC Control and Demand Response. Energy and AI 2020, 2, 100020. [Google Scholar] [CrossRef]
  51. Li, Z.; Sun, Z.; Meng, Q.; Wang, Y.; Li, Y. Reinforcement Learning of Room Temperature Set-Point of Thermal Storage Air-Conditioning System with Demand Response. Energy and Buildings 2022, 259, 111903. [Google Scholar] [CrossRef]
  52. DAY-AHEAD PREISE. Available online: https://markttransparenz.apg.at/de/markt/Markttransparenz/ Uebertragung/EXAA-Spotmarkt (accessed on day month year).
  53. Gymnasium Version 0.29.1. Available online: https://pypi.org/project/gymnasium/ (accessed on day month year).
  54. Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv , 2019. 5 July. [CrossRef]
  55. Pytorch Version 2.1.1. Available online: https://pytorch.org (accessed on day month year).
  56. Gurobi Version 11.0. Available online: https://www.gurobi.com (accessed on day month year).
Figure 1. Scheme of the processed food plant showing all energy flows considered.
Figure 1. Scheme of the processed food plant showing all energy flows considered.
Preprints 140581 g001
Figure 2. Schematic of agent-environment interface in RL.
Figure 2. Schematic of agent-environment interface in RL.
Preprints 140581 g002
Figure 3. Results for load shifting over three consecutive example days to comparing the RL and MILP with the reference scenario. (a) illustrates the refrigeration system’s electrical power consumption. (b) shows cooling hall temperature variations, where 0°C corresponds to a fully charged TES and 5°C represents an empty TES state. (c) displays the electricity price profile over the same period, illustrating the price-driven adjustments in system operation.
Figure 3. Results for load shifting over three consecutive example days to comparing the RL and MILP with the reference scenario. (a) illustrates the refrigeration system’s electrical power consumption. (b) shows cooling hall temperature variations, where 0°C corresponds to a fully charged TES and 5°C represents an empty TES state. (c) displays the electricity price profile over the same period, illustrating the price-driven adjustments in system operation.
Preprints 140581 g003
Figure 4. (a) shows the weekly energy savings via RL and MILP, (b) shows the relative weekly energy savings, and (c) the EXAA spot market price for the test period (May 2022 to April 2023).
Figure 4. (a) shows the weekly energy savings via RL and MILP, (b) shows the relative weekly energy savings, and (c) the EXAA spot market price for the test period (May 2022 to April 2023).
Preprints 140581 g004
Figure 5. EXAA spot market price from May 2020 to April 2023, where the testing period is shaded in grey.
Figure 5. EXAA spot market price from May 2020 to April 2023, where the testing period is shaded in grey.
Preprints 140581 g005
Figure 6. RL training process showing the return of single episodes and the moving average with a window length of 100. The return is proportional to the negative energy costs, while the absolute value is irrelevant and only chosen to be an appropriate scale for the training process. Maximizing the negative energy costs is equal to minimizing the energy costs.
Figure 6. RL training process showing the return of single episodes and the moving average with a window length of 100. The return is proportional to the negative energy costs, while the absolute value is irrelevant and only chosen to be an appropriate scale for the training process. Maximizing the negative energy costs is equal to minimizing the energy costs.
Preprints 140581 g006
Figure 7. RL training process showing the cost reduction and runtime of different training period lengths.
Figure 7. RL training process showing the cost reduction and runtime of different training period lengths.
Preprints 140581 g007
Table 1. Model Parameters.
Table 1. Model Parameters.
Parameter Value
C H 4360MJ/K
m b u i l d i n g 500t
m c h e e s e 1260t
c b u i l d i n g 480J/(kg K)
c c h e e s e 3270J/(kg K)
β 4.938
P m a x 202511W
P m i n 0W
T u b 5°C
T l b 0°C
Δ t 60s
k P 500000W/K
k I 2W/(K s)
Table 2. RL Parameters.
Table 2. RL Parameters.
Parameter Value
Training episodes 2000
Batch size 1250
Memory buffer size 10000
Update rate τ 0.005
Adam learning rate 1e-4
Initial exploration rate ϵ s t a r t 0.9
End exploration rate ϵ e n d 0.05
Exploration decay rate d ϵ 2000
Discount factor γ 0.999
Neural net layers 3
Layer 1 (50, 256), ReLu activation
Layer 2 (256, 256), ReLu activation
Layer 3 (256, 101), ReLu activation
Huber loss parameter δ 1
Table 3. Optimization results for one year comparing RL and MILP.
Table 3. Optimization results for one year comparing RL and MILP.
Optimization Costs (€) Savings (€) Relative savings (%) Relative costs (€/MWh)
RL 116831 24911 17.57 208.00
MILP 115301 26441 18.65 205.30
Reference 141742 - - 252.10
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated