2. Meet The Players
The first player is the QLearningPlayer, a Reinforcement Learning model that makes decisions based on real-time Q-value updates. This model is fitting because Q-Learning allows the player to balance exploration (moving randomly) and exploitation (comparing Q-values) and eventually converge to an optimal policy (C or D) based on the reward given per round. Q-values are updated and calculated using the Bellman equation:
where
is the learning rate,
is the discount factor,
r is the reward,
is the next state, and
b is the next action. In the implementation of the game, the player class records the current state with each reward and updates its Q-values based on each move.
is set to 0.1,
to 0.9, and exploration rate is set to 0.6 with a decay rate of 0.99 per round. These values are arbitrarily set but enable the model to learn and decrease exploration as the model is trained. Each Q-value for each move C or D is tracked in a HashMap with a double array and the player will collaborate or defect depending on which Q-value is higher.
In the first model, the reward system is set to account only for the player’s personal reward (which would be values of 1, 3, or 5). This could lead to the player commonly converging to defection to maximize its own reward of 5. To explore the differences between personal reward and cumulative reward, a model called the JointPQLearningPlayer was created. This player considers the state of all players’ moves (e.g., CC if both players collaborated) as opposed to the traditional account of one move (C if opponent collaborated). Additionally, this model’s reward updated in the Bellman equation is the cumulative reward which would be values of 5, 6, or 2 for states CD or DC, CC, and DD respectively.
The next strategy is the BayesianInferencePlayer (BIP). This model is programmed to calculate the probabilities of each of the opponents’ moves to structure the BIP’s next move.
The model uses the equations above, where and represent the BIP’s probability of C and D respectively, with and representing the number of times the opponent has C and D and t representing the total amount of moves. For example, if the model predicts that there is an 80% chance that the opponent will C, there will also be an 80% chance that the BIP will also C. The values of and are initialized to 1 and t to 2 for Laplace Smoothing to avoid the case of dividing by zero.
But what if we took the base idea from the BIP and took it a step further? The next strategy is the PatternLearningPlayer (PLP) which makes predictions based on the history of the opponent’s moves. This player uses a simple n-gram model for sequence prediction and is visually depicted in
Figure 1.
With a set history of 3, the PLP moves randomly for 3 rounds to collect data on the opponent’s strategy. Each pattern is stored in a HashMap and then the model can predict, for example, if the opponent will C or D given a history of CCD based on the amount of the times the opponent’s next move has been C or D given that state. The player will play whichever move is predicted of the opponent.
The next strategy is a game-theoretic strategy called TitForTat. This player’s strategy revolves around doing the same move that the opponent had previously done. This strategy is especially interesting because it can “train” other models by giving them a taste of their own medicine. This gives them a punishment if they previously defected and a reward if they previously cooperated. Additionally, a reverse Tit-for-tat (revTitForTat) player was created who plays the opposite move that the opponent previously played.
The next player is the Upper Confidence Bound (UCB) Player which creates a bound for each move C or D and moves based on the larger UCB bound. This is a reinforcement learning algorithm that balances exploration and exploitation with the following equation:
where
is the mean reward for action
a,
t is the total number of times all actions have been played, and
is the number of times action
a has been played.
represents exploitation and the confidence interval term
represents exploration and ensures that fewer-played actions get the chance to the played. By increasing the confidence interval for less played, the UCB ensures each action is explored before converging to a sub-optimal C or D.
Another game-theoretic player was added to test the limits and the cooperation levels of the different models. The Grudge Player begins with collaboration but once the opponent defects, the Grudge Player defects twice before going back to collaboration.
The next player is the PatternPlayer. Rather than C or D using a mathematical expression, the player moves based on a predetermined pattern. The numerical pattern is 01121220 and if there is a zero, the player C and if there is a one, the player D. When a two is encountered, those roles swap. The pattern for one complete iteration is shown below.
Finally, a couple of other basic player models were created to diversify the player pool. A player that always collaborates, a player that always defects, a player that alternates between the two, and one that moves completely randomly were added.