2. Materials & Methods
In our refined approach to mastering the Ms. Pac-Man game, we synergize deep reinforcement learning with sophisticated optimization strategies. Central to our system is the DQN Agent, leveraging a Q-Network—a convolutional neural network adept at discerning the most advantageous actions in the game scenario. This is supported by the Replay Buffer, an integral feature that archives and re-examines previous gameplay, thereby ensuring a robust and progressive learning journey.
Our technique is further refined by intertwining the Snake Optimization Algorithm (SOA) with Energy Valley Optimization (EVO), both of which are inspired by genetic-based algorithms. These advanced methods are integrated for the optimization of critical hyperparameters, enhancing the overall efficiency of our system. By merging ESO, we optimize not only the hyperparameters but also the training loop, continuously updating the DQN Agent parameters until the best ones are identified, as indicated by the rewards feedback from the environment. The interplay of these elements culminates in a comprehensive and effective solution to the challenges posed by the Ms. Pac-Man gaming environment. Each component—from the Q-Network and Replay Buffer to the innovative ESO optimization—works in concert to refine our strategy, which will be further elaborated in the subsequent sections.
Figure 1.
General Flowchart.
Figure 1.
General Flowchart.
2.1. Setup and Environment Preparation
Building a strong foundation is critical to our Ms. Pac-Man project's success. The preliminary actions made to establish a reliable and effective setup are described in this section. The installation of the required libraries and pack-ages marks the start of the procedure. These assets not only make it possible for our procedures to run smoothly but also offer vital utility functions that are necessary for the experiment's later phases. Using the most recent versions of these tools ensures compatibility and maximizes efficiency.
Next, we used OpenAI's Gym toolbox to put up the Ms. Pac-Man gaming setting. In the field of reinforcement learning, Gym is well-known for its standardised interface, which makes it easier to interact with the game, process observations, and put actions into action. This standardization is key to ensuring that our experiments are replicable and align with broader research efforts. Following this, we define key parameters of our experiment, including the action space (possible movements for Ms. Pac- Man), observation space (how the game’s state is represented), and the reward structure. These parameters are critical as they inform the learning process of our DQN agent, guiding its decision-making and strategy development.
In sum, the setup and environment preparation stage lays the groundwork for our experiment. This meticulous preparation ensures that the later stages, including the training of the Q-Network and the application of the Snake Optimization Algorithm for hyperparameter tuning, are built on a well- defined and robust platform, setting the stage for successful outcomes.
2.2. Design and Functionality of the Q-Network
At the core of our approach to mastering Ms. Pac-Man is the Q-Network, a CNN specifically designed for this task. The Because Q-Network is responsible for computing Q-values for various game states, it is an essential component of our reinforcement learning approach. In order to direct the agent toward activities that maximize benefits over time, it is essential to accurately estimate these Q-values, which stand for the anticipated rewards in the future for taking particular actions in given states.
Our Q-Network is tailored to the specific requirements of Atari-style games like Ms. Pac-Man. It uses convolutional layers to effectively interpret the pixel-based visual inputs of the game. These layers excel in identifying and understanding spatial relationships and patterns, such as the maze layout, ghost positions, and pellet locations in Ms. Pac-Man. The network processes this visual information and feeds it through additional layers, culminating in a dense layer that produces Q-values for every possible action available to the agent.
Training the Q-Network involves using a loss function, typ- ically Mean Squared Error (MSE), to measure the difference between the predicted and target Q-values. These target values are calculated using the Bellman equation, a cornerstone concept in reinforcement learning that relates current Q-values to future rewards and the maximal Q-values of subsequent states. The learning process of the Q-Network is dynamic and ongoing. As the agent interacts with the game environment and gathers new experiences, the network is continuously up- dated, refining its Q-value predictions and enhancing decision- making capabilities.
Delving deeper into the Q-Network’s structure, we have devised an architecture that aligns with the demands of Ms. Pac-Man. The network consists of two primary sections: a convolutional segment and a fully connected (dense) segment. The convolutional part includes three layers. The first layer processes raw pixel inputs using 32 filters of size 8x8 and a stride of 4, followed by a Rectified Linear Unit (ReLU) activation function for non-linearity. The second layer uses 64 filters of size 4x4 with a stride of 2, and the third layer employs 128 filters of size 3x3 with a stride of 1, each accompanied by a ReLU activation. These layers adeptly capture complex spatial patterns and hierarchies in the game’s visuals.
Following the convolutional stage, the processed data is flattened and directed to the dense segment. This begins with a linear layer connecting the convolutional output to 512 neurons, with a subsequent ReLU activation. A dropout layer with a 0.5 probability is included to prevent overfitting and add regularization. The information then passes through 256 neurons and finally to a linear layer that outputs the Q-values for each action available to the agent in the game.
This architecture, with its blend of convolutional and dense layers, provides a comprehensive system to process visual inputs, discern complex patterns within the game, and convert these insights into actionable strategies for optimal game performance.
2.3. Implementing Experience Replay for Stable Learning
Experience replay stands as a critical component in our reinforcement learning strategy, especially in addressing challenges like temporal correlations and the evolving nature of data in such environments. This approach deviates from the traditional method of learning directly from consecutive experiences, which could lead to correlated data and unstable learning paths. The core concept of experience replay involves storing individual experiences, or transitions, and later revisiting them randomly. These transitions are composed of tuples containing the current state, the action taken, the resultant reward, the following state, and an indicator of whether the game concluded after the action. These tuples are then stored in a Replay Buffer, a repository that serves as a memory bank, continuously filled as the agent interacts with the game.
When it’s time for the agent to learn or update its Q- Network, it doesn’t solely depend on recent experiences. Instead, it randomly selects a batch of experiences from the Replay Buffer. This approach helps in breaking the chain of closely correlated experiences and incorporates older but valuable memories into the learning process. The result is a more stable and effective learning trajectory.
The Replay Buffer also allows for the repeated utilization of data, enabling the agent to learn multiple times from a single experience. This feature is particularly beneficial in complex environments like Ms. Pac-Man, where some crucial but infrequent experiences hold significant learning potential. In essence, experience replay, facilitated through the Replay Buffer, introduces a level of randomness and diversity to the learning process. This approach effectively counters issues such as temporal correlation, enriching the agent’s learning with a wide array of past interactions. This mechanism, in conjunction with the Q-Network, forms a solid foundation for our deep reinforcement learning methodology in mastering Ms. Pac-Man.
2.4. Role and Functions of the DQN Agent
The DQN Agent is pivotal in orchestrating the complex interactions between the agent and the Ms. Pac-Man game environment. It carries the dual responsibility of deciding the agent’s actions and training the Q-Network based on the outcomes of these actions. Action selection within the DQN Agent operates on a principle balancing exploration and exploitation. Initially, when the agent’s knowledge of the environment is limited, and its Q-values are unrefined, the agent emphasizes exploration. This is typically managed through an epsilon-greedy strategy. The agent randomly selects actions with a probability defined by epsilon to explore the environment and relies on the highest Q- value actions for the remaining decisions, exploiting its current knowledge. As the agent’s familiarity with the environment improves, and the reliability of its Q-values increases, epsilon is progressively reduced, tilting the balance towards exploitation.
Following action execution, the DQN Agent plays a crucial role in training the Q-Network. It leverages the Replay Buffer, sampling a random batch of experiences to compute target Q-values based on the rewards received and the projected Q- values of subsequent states. The goal is to minimize the gap between these target Q-values and those predicted by the Q- Network, usually using gradient descent or its variants as the optimization method. This ongoing cycle of action selection, experience accumulation, and network updating constitutes the learning loop that gradually enhances the agent’s game strategy and understanding.
A key element in this process is the periodic update of target Q-values. Rather than continuously adjusting them, which could lead to instability, these updates are spaced out to ensure a more consistent and stable learning path.
In summary, the DQN Agent is central to our approach, effectively bridging exploration and learning. It adeptly navigates the Ms. Pac-Man environment, adapting its strategies and continually refining the Q-Network. This dynamic interplay of actions, feedback, and learning equips the agent with the necessary capabilities to master the game environment.
2.5. Execution and Maintenance of the Training Loop
The training loop marks the dynamic phase where our reinforcement learning strategy is actualized. In this iterative process, the DQN Agent interacts with the Ms. Pac-Man game environment, making decisions, assimilating feedback, storing experiences, and iteratively improving its decision- making capabilities by updating the Q-Network. This section explores the intricate aspects and workflow of the training loop.
This loop encompasses numerous episodes, each representing a full game of Ms. Pac-Man from start to end. At the beginning of each episode, the game environment is reset, presenting a new challenge for the agent. During the episode, the agent observes the current state, decides on an action influenced by its epsilon-greedy policy, and implements that action in the game. Subsequently, the game environment transitions to a new state and awards a reward based on the action’s effectiveness. These elements – the state transition, chosen action, received reward, and the game’s end status – are recorded as an experience in the Replay Buffer.
Once a sufficient number of experiences are collected, the agent utilizes a randomly selected batch from the Replay Buffer for Q-Network training. This training involves adjusting the network’s weights to reduce the discrepancy between the predicted Q-values and the target Q-values. This feedback- driven approach ensures that the agent continuously hones its strategy for progressively improved performance.
A crucial aspect of the training loop is the conservation of the Q-Network’s weights. Given that the agent may go through countless iterations, it’s vital to periodically save these weights. These saved checkpoints serve several pur- poses. Firstly, they provide a fallback in case of disruptions, preventing complete loss of progress. Secondly, they allow for periodic assessment of the agent’s performance, offering insights into its learning progression. Finally, saved weights can be used for transferring learning to similar tasks or for further refinement in future tasks.
In essence, the training loop is where the agent’s interaction with the environment and consequent learning converts into practical gameplay strategy. Through ongoing engagement, feedback analysis, and adaptive learning, the agent transitions from a novice to an adept Ms. Pac-Man player, all while ensuring that its accumulated knowledge is regularly saved for future use and evaluation.
2.6. Performance Evaluation of the Agent
Following the intensive training loop, assessing the agent’s performance is crucial. This evaluation not only verifies the agent’s proficiency in the Ms. Pac-Man environment but also identifies potential areas for improvement. The evaluation process is structured to precisely measure the agent’s effectiveness.
The first step in this evaluation is to switch the agent to a purely exploitative mode, which involves disabling the random action selection feature (i.e., setting epsilon to zero in the epsilon-greedy policy). This mode forces the agent to solely depend on its acquired knowledge, choosing actions based entirely on the Q-values provided by the Q-Network. This approach offers an accurate representation of the agent’s learning and its application skills.
In this phase, the agent engages in a set number of episodes, mirroring the training process but with two key distinctions. First, there is no learning or adjustment of the Q-Network’s weights based on the agent’s actions. Second, detailed records are kept of every action, state transition, and reward received. The primary indicator of the agent’s performance is the total cumulative reward accumulated in each episode.
Nevertheless, relying on a single metric might not fully encapsulate the agent’s abilities. Therefore, additional metrics are considered. These could include the number of levels completed, the average number of ghosts eaten per episode, or the frequency of bonus fruit captures. These additional metrics offer a more nuanced view of the agent’s strategic gameplay. Given the inherent randomness in games like Ms. Pac- Man, where ghost behaviors and fruit appearances vary, it’s important to average the agent’s performance across multiple episodes. This approach smooths out random fluctuations and provides a more consistent measure of the agent’s true capabilities.
During the evaluation stage, visual aids frequently support the quantitative analysis. Heatmaps showing the agent's most-traveled paths through the maze or graphs showing the trajectory of accumulated rewards over episodes can provide important visual insights on the agent's strategic routines and behaviors.
2.7. Visualizing the Agent’s Gameplay
Seeing our reinforcement learning agent's gameplay in Ms. Pac-Man provides a clearer understanding of our model's efficacy. Comparable to seeing a human player navigate a maze, this visual method offers an interesting and thorough narrative of the agent's tactics, difficulties, and opportunities for development.
The agent tells a tale through its interactions with the Ms. Pac-Man surroundings, going beyond simple labyrinth navigation to include learned habits, threat assessments, and pivotal decision-making moments. We are able to thoroughly analyze and comprehend the agent's path thanks to this visual representation.
In order to do this, we created a simulated screen that records the agent playing in real time. This covers every step, near-death experience, and power pellet usage accomplished. This visual record fulfills numerous important functions: First of all, it serves as an easy-to-use, direct method of validation. It allows researchers, developers, enthusiasts, and other stake- holders to see the results of the training process in a dynamic and tangible way. It answers critical questions about the agent’s strategy: How effectively does it navigate through tight spots? Does it use power pellets strategically to pursue ghosts, or does it focus on clearing the maze? How does it respond to the sudden appearance of bonus fruits?
Additionally, this visual representation aids in debugging and fine-tuning the agent’s behavior. Anomalies or patterns of suboptimal decisions, which might be less apparent in numerical data, become immediately obvious in a visual format. This can lead to quicker and more intuitive adjustments and improvements.
Finally, these visual recordings have significant outreach potential. They can be shared with the broader community, included in presentations, or utilized in educational settings to demonstrate the principles of deep reinforcement learning in a relatable and engaging way.
In conclusion, visualizing the agent’s gameplay not only brings the abstract elements of deep reinforcement learning to life but also serves as a critical tool for validation, refinement, and education. It provides a vivid depiction of the agent’s abilities, offers valuable insights for improvements, and underscores the effectiveness of deep reinforcement learning in navigating complex environments like Ms. Pac-Man.
2.8. Proposed Optimization Algorithm for Hyperparameter Tuning
In our innovative framework, we synergize the robust methodologies of the Snake Optimization Algorithm (SOA) and the Energy Valley Optimization (EVO) into a unified approach, aptly named the Energy Serpent Optimizer (ESO). By leveraging these two algorithms, we aim to finely tune key hyperparameters, especially the learning rate (lr) and the discount factor (gamma), which are crucial for the DQN Agent’s learning efficacy and strategic foresight.
Snake Optimization (SO) is a novel intelligent optimization algorithm conceptualized by Hashim et al., which draws inspiration from the natural behaviors of snakes, particularly their feeding, fighting, and mating patterns. The uniqueness of this algorithm lies in its emulation of the complex survival strategies of snakes. Distinct from other metaheuristic algorithms, SO categorizes the population into males and females. It commences with randomly generated populations and incorporates the influence of temperature, a critical factor for cold-blooded animals like snakes, in their feeding and mating behavior. The algorithm operates in two phases: the exploration phase, where snakes search randomly for food in an environment with inadequate food supply, and the exploitation phase, where enough food is available, guiding the search behavior. The exploration and exploitation phases are mathematically modeled, with specific equations governing the positions of male and female individuals in each phase. The algorithm also features unique modes like fight and mating modes, triggered based on environmental conditions, particularly temperature, further adding to its complexity and efficacy in optimization.
Energy Valley Optimization, on the other hand, is inspired by the principles of particle physics, particularly the behavior of subatomic particles. It revolves around the concept of stability and decay of particles. In the universe, most particles are unstable, tending to emit energy and transform into more stable forms. This optimization method focuses on the concept of an 'energy valley,' which is a metaphorical representation of the state where particles are in their most stable form, bound by optimal levels of neutrons (N) and protons (Z). The principle underlying this optimization technique is that particles aim to increase their stability by adjusting their N/Z ratio, moving towards this energy valley or stability band. This concept is particularly crucial for understanding the stability of heavier particles, which require a higher N/Z ratio for stability. The optimization process in Energy Valley Optimization mimics this natural tendency of particles, leveraging the idea of stability and transformation to guide the search towards optimal solutions in a given problem space. It's an innovative approach that applies the fundamental aspects of particle physics to the realm of algorithmic optimization.
The implementation begins with the SOA initializing a diverse population of 'snakes', each representing a unique set of hyperparameters. These snakes navigate a metaphorical landscape that mirrors the DQN Agent's performance under varied hyperparameter configurations. Concurrently, the EVO introduces a separate population that undergoes a similar evaluation, where each individual's performance in managing the game character is assessed.
The next phase involves the fusion of SOA and EVO methodologies, where the top-performing snakes from the SOA are combined with the leading individuals from the EVO population. This integration allows for the crossover of robust hyperparameters, creating offspring that are potentially superior to their predecessors. Mutation is applied to both populations, introducing variability and ensuring a broad search across the hyperparameter space.
As the evolutionary process progresses across generations, both SOA and EVO work in tandem to refine the hyperparameters towards an optimal set that promises enhanced performance in the Ms. Pac-Man environment. This iterative process continues until a convergence criterion is met or a predetermined number of generations have passed.
The optimized hyperparameters, a product of this dual-algorithm approach, are then used to configure the DQN Agent. The agent undergoes rigorous training within the game environment, with its performance meticulously tracked and optimized over numerous episodes. To validate the effectiveness of the ESO tuned hyperparameters, an extensive evaluation is conducted. This not only involves a quantitative assessment of the rewards but also includes a qualitative analysis through visualized gameplay, offering insights into the agent’s decision-making and strategic gameplay the detailed steps is shown in
Figure 2.
Figure 2.
Energy Serpent Optimizer pseudocode.
Figure 2.
Energy Serpent Optimizer pseudocode.
The Energy Serpent Optimizer (ESO) (as shown in
Figure 3 pseudocode) is an evolutionary algorithm tailored for optimizing the learning rate and discount factor in reinforcement learning models. It commences with establishing a virtual environment where a population of serpents, each representing a unique set of hyperparameters, is initialized. As the algorithm progresses through generations, each serpent is evaluated on how effectively its hyperparameters perform in the given environment, with performance typically gauged by the agent's reward accumulation.
Serpents are ranked by their fitness, and the ones with superior hyperparameter configurations are identified as the most fit. These leading serpents then undergo a breeding process where genetic operations such as crossover and mutation are applied. Crossover involves blending the hyperparameters of two parent serpents to generate offspring, while mutation introduces random changes to these offspring, ensuring diversity and aiding in the exploration of the hyperparameter space.
This process of evaluation, selection, breeding, and mutation continues over a series of generations, steadily honing the hyperparameters. The algorithm seeks to replace less fit serpents with more promising offspring, iteratively pushing the entire population towards optimal hyperparameter combinations.
The ESO concludes its run after a pre-defined number of generations, at which point it presents the optimal learning rate and discount factor, reflecting the most effective hyperparameter set discovered. This culmination represents a refined approach to training reinforcement learning agents, ensuring they are primed for maximum efficiency in complex decision-making environments.
In summary, the combined ESO approach for hyperparameter tuning is a dynamic and exploratory journey that strategically adjusts the parameters critical to the learning process. This synergy ensures that the DQN Agent is equipped with the best possible strategy, enabling it to navigate the Ms. Pac-Man maze with enhanced proficiency and intelligence.