Preprint
Article

This version is not peer-reviewed.

Opponent-Aware Hierarchical Reinforcement Learning for High-Dimensional Day-Ahead Bidding of Virtual Power Plants

Submitted:

03 June 2026

Posted:

04 June 2026

You are already at the latest version

Abstract
Virtual power plants (VPPs) are emerging as flexible market participants by aggregating distributed renewable generation, energy storage systems, controllable loads, and conventional units. In day-ahead electricity markets, a VPP is required to submit a 24-hour bidding trajectory before the operating day, which leads to a high-dimensional continuous decision-making problem with strong temporal coupling and incomplete opponent information. Conventional single-layer reinforcement learning methods may suffer from unstable training, inefficient exploration, and large reward fluctuations when directly learning such high-dimensional bidding strategies. To address these challenges, this paper proposes an opponent-aware hierarchical reinforcement learning framework for high-dimensional day-ahead VPP bidding. The proposed method decomposes the original 24-dimensional bidding action into two coordinated levels. The upper-level TD3 agent generates the daily base bidding trajectory, while the lower-level PPO agent performs period-wise bid refinement around the upper-level reference. To enhance the utilization of market-clearing feedback, a state-aware mechanism is designed to convert cleared-power degradation into structured learning signals. In addition, an opponent probability prediction model is introduced to approximate the bounded-rational bidding behavior of competing participants, thereby providing a more stable competitive environment under incomplete information. Case studies on an IEEE 30-bus system with multiple VPPs and conventional generators demonstrate that the proposed TD3+PPO framework outperforms single-layer PPO and TD3 benchmarks. The proposed method achieves an average reward of 4.98, reduces the reward variance to 5×10−6, and reaches stable convergence within approximately 19000 training steps. The results verify that hierarchical action decomposition, state-aware feedback, and opponent probability modeling can effectively improve the profitability, convergence stability, and strategic adaptability of VPP bidding in high-dimensional day-ahead electricity markets.
Keywords: 
;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated