Learning Optimal Strategies in a Duel Game

Angelos Gkekas; Athina Apostolidou; Artemis Vernadou; Athanasios Kehagias

doi:10.20944/preprints202411.0681.v1

Submitted:

09 November 2024

Posted:

11 November 2024

You are already at the latest version

Abstract

We study a duel game in which each player has partial knowledge of the game parameters. We present a method by which, in the course of repeated plays, each player estimates the missing parameters and consequently learns his optimal strategy.

Keywords:

Duel game

;

tree game

;

backward induction

;

learning

Subject:

Computer Science and Mathematics - Applied Mathematics

Introduction

We study a duel game in which certain game parameters (related to the players’ kill probabilities) are unknown and present a method by which each player can estimate the opponent’s parameters and consequently learn his optimal strategy.

The duel game with which we are concerned has been presented in [1] and a similar one in [2]. The game is a variation of duels and, more generally, games of timing studied in the literature [3,4,5].

The paper is organized as follows. In Section 1 we present the rules of the game. In Section 2 we solve the game under the assumption of complete information. In Section 3 we present an algorithm for solving the game when the players have incomplete information. In Section 4 we evaluate the algorithm by numerical experiments. Finally, in Section 5 we summarize our results and present our conclusions.

1. Game Description

The duel with which we are concerned is played between players

P_{1}

and

P_{2}

, under the following rules.

It is played in discrete rounds (time steps) $t \in \{1, 2, . . .\}$ .
In the first turn, the players are at distance D.
$P_{1}$ (resp. $P_{2}$ ) plays on odd (resp. even) rounds.
On his turn, each player has two choices: (i) he can shoot his opponent or (ii) he can move one step forward, reducing their distance by one.
If $P_{n}$ shoots, he has a kill probability $p_{n} (d)$ of hitting (and killing) his opponent, where d is their current distance. If he misses, the opponent can walk right next to him and shoot him for a certain kill.
Each player’s payoff is 1 if he kills the opponent and $- 1$ if he shoots and misses (in which case he is certain to be killed).

For

n \in {1, 2}

, we will denote by

x_{n} (t)

the position of

P_{n}

at round t. The starting positions are

x_{1} (0) = 0

and

x_{2} (0) = D

, with

D = 2 N, N \in N

. The distance between the players at time t is

d = | x_{1} (t) - x_{2} (t) | .

For

n \in {1, 2}

, the kill probability is a decreasing function

p_{n} : {1, 2, \dots D} \to [0, 1]

with

p_{n} (1) = 1

. It is convenient to describe the kill probabilities as vectors:

p_{n} = (p_{n, 1}, . . ., p_{n, D}) = (p_{n} (1), . . ., p_{n} (D)) .

This duel can be modeled as an extensive form game or tree game. The game tree is a directed graph

G = (V, E)

with vertex set

V = \{1, 2, . . ., 2 D\},

where

the vertex $v = d \in \{1, 2, . . ., D\}$ corresponds to a game state in which the players are at distance d and
the vertex $v = d + D \in \{D + 1, D + 2, . . ., 2 D\}$ is a terminal vertex, in which the “active” player has fired at his opponent.

The edges correspond to state transitions; it is easy to see that the edge set is

E = \{(1, D + 1), (2, 1), (2, D + 2), (3, 2), (3, D + 3), . . ., (D, D - 1), (D, 2 D)\} .

An example of the game tree, for

D = 6

, appears in Figure 1. The circular (resp. square) vertices are the ones in which

P_{1}

(resp.

P_{2}

) is active and the rhombic vertices are the terminal ones.

To complete the description of the game, we will define the expected payoff for the terminal vertices. Note that the terminal vertex

d + D

is the child of the nonterminal vertex d in which:

The distance of the players is d and, assuming $P_{n}$ to be the active player, his probability of hitting his opponent is $p_{n, d}$ .
The active player is $P_{1}$ (resp. $P_{2}$ ) iff d is even (resp. odd).

Keeping the above in mind, we see that the payoff (to

P_{1}

) of vertex

d + D

is

\forall d \in \{1, . . ., D\} : Q (d + D) = \{\begin{matrix} p_{1, d} \cdot 1 + (1 - p_{1, d}) \cdot (- 1) = 2 p_{1, d} - 1, & when d is even; \\ p_{2, d} \cdot (- 1) + (1 - p_{2, d}) \cdot 1 = 1 - 2 p_{2, d} & when d is odd . \end{matrix}

The payoff to

P_{2}

at vertex

d + D

is

- Q (d + D)

. This completes the description of the duel.

2. Solution with Complete Information

It is easy to solve the above duel when each player knows D and both

p_{1}

and

p_{2}

. We construct the game tree as described in Section 1 and we solve it by backward induction. Since the method is standard, we simply give an example of its application. Suppose that

\forall n, d : p_{n, d} = \{\begin{matrix} 1 & when d = 1, \\ min (1, \frac{c_{n}}{d^{k_{n}}}) & when d > 1 . \end{matrix}

We take

c_{1} = 1

,

k_{1} = 1

,

c_{2} = 1

,

k_{2} = \frac{1}{2}

. The kill probabilities are

The game tree with terminal payoffs is illustrated in Figure 2.

By the standard backward induction procedure we select the optimal action at each vertex and also compute the values of the nonterminal vertices. These are indicated in Figure 3 (optimal actions correspond to thick edges). We see that the game value is

- 0.1547

, attained by

P_{2}

shooting when the players are at distance 3 (which happens in round 4).

We next present a proposition which characterizes each player’s optimal strategy in terms of a shooting threshold. 1 In the following we use the standard notation by which “

- n

” denotes the “other player”. I.e.,

p_{- 1} = p_{2}

and

p_{- 2} = p_{1}

.2

Theorem 1.

We define for

n \in \{1, 2\}

the shooting criterion vectors

K_{n} = (K_{n, 1}, . . ., K_{n, D})

where

K_{n, 1} = 1 and for d \geq 2 : K_{n, d} = p_{n, d} + p_{- n, d - 1} .

Then the optimal strategy for

P_{n}

is to shoot as soon as it is his turn and the distance of the players is less than or equal to the shooting threshold

d_{n}

, where:

d_{1} = max {d : K_{1, d} \geq 1}, d_{2} = max {d : K_{2, d} \geq 1} .

Proof.

Suppose the players are at distance d and the active player is

P_{n}

.

If $P_{- n}$ will not shoot in the next round, when their distance will be $d - 1$ , then $P_{n}$ must also not shoot in the current round, because he will have a higher kill probability in his next turn, when they will be at distance $d - 2$ .
If $P_{- n}$ will shoot in the next round, when their distance will be $d - 1$ , then $P_{n}$ should shoot iff $p_{n, d}$ (his kill probability now) is higher then $1 - p_{- n, d - 1}$ ( $P_{- n}$ ’s miss probability in the next round). In other words, $P_{n}$ must shoot iff

$p_{n, d} \geq 1 - p_{- n, d - 1}$

or, equivalently, iff

$K_{n, d} = p_{n, d} + p_{- n, d - 1} \geq 1 .$

(1)

Hence we can reason as follows.

At vertex 1, $P_{2}$ is active and his only choice is to shoot.
At vertex 2, $P_{1}$ is active and he knows $P_{2}$ will certainly shoot in the next round. Hence $P_{1}$ will shoot iff he has an advantage, i.e., iff

$Q_{1} (2) \geq Q_{1} (1) \Leftrightarrow 2 p_{1, 2} - 1 \geq 1 - 2 p_{2, 1} \Leftrightarrow p_{1, 2} + p_{2, 1} \geq 1 .$

This is equivalent to $K_{1, 1} \geq 1$ and will always be true.
Hence at vertex 3, $P_{2}$ is active and he knows that $P_{1}$ will certainly shoot in the next round (at vertex d). So $P_{2}$ will shoot iff

$Q_{1} (3) \leq Q_{1} (2) \Leftrightarrow 1 - 2 p_{2, 3} \leq 2 p_{1, 2} - 1 \Leftrightarrow p_{2, 3} + p_{1, 2} \geq 1 .$

which is equivalent to $K_{2, 3} \geq 1$ . Also, if $K_{2, 3} < 1$ , then $P_{1}$ will know, when the game is at vertex 4, that $P_{2}$ will not shoot when at 3. So, $P_{1}$ will not shoot when at 4. But then, when at 5, $P_{2}$ knows that $P_{1}$ will not shoot when at 4. Continuing in this manner we see that $K_{2, 3} < 1$ implies that firing will take place exactly at the vertex 2.
On the other hand, if $K_{2, 3} \geq 1$ , then $P_{1}$ knows when at 4 that $P_{2}$ will shoot at the next round. So, when at 4, $P_{1}$ should shoot iff $K_{1, 4} \geq 1$ . If, on the other hand, $K_{1, 4} < 1$ , then $P_{1}$ will not shoot when at 4 and $P_{2}$ will shoot when at 3.
We continue in this manner for increasing values of d. Since both $K_{1, d}$ and $K_{2, d}$ are decreasing with d, there will exist a maximum $d_{n}$ value (it could equal D) in which some $K_{n, d}$ will be greater than one and $P_{n}$ will be active; then $P_{n}$ must shoot as soon as the game reaches or passes vertex $d_{n}$ and he “has the action”.

This completes the proof. □

Returning to our previous example, we compute the vectors

K_{n}

for

n \in {1, 2}

and list them in the following table.

Table 2. Shooting Criterion

d	1	2	3	4	5	6
Round	6	5	4	3	2	1
$K_{1, d}$		1.500	1.040	0.827	0.700	0.614
$K_{2, d}$		1.707	1.077	0.833	0.697	0.608

For

P_{1}

the shooting criterion is last satisfied when the distance is

d = 3

; this happens at round 4 in which

P_{1}

is inactive, so he should shoot at round 5. However, for

P_{2}

the shooting criterion is also last satisfied at distance

d = 3

and round 4, in which

P_{2}

is active; so he should shoot at round 4. This is the same result we obtained with backward induction.

3. Solution with Incomplete Information

We will now present an approach to the solution of the duel when each player has incomplete knowledge of his opponent’s kill probability function. More specifically, we assume that the players’s kill probabilities have, for all d, the form

f (d; θ)

where

θ

is a parameter vector. Assuming

P_{1}

has

θ = θ_{1}

and

P_{2}

has

θ = θ_{2}

, we have

\forall d : p_{1, d} = f (d; θ_{1}), p_{2, d} = f (d; θ_{2}) .

Furthermore, we assume that both players know the general form

f (d; θ)

and each

P_{n}

knows his own parameter vector

θ_{n}

but not the

θ_{- n}

of his opponent.

Obviously in this case the players cannot perform the computations of either backward induction or the shooting criterion. Consequently we will propose an “exploration-and-exploitation” approach. In other words, under the assumption that multiple duels will be played, each player initially adopts a random strategy and, collecting information from played games, he gradually builds and refines an estimate of his optimal strategy.

We give now a more detailed description of our approach, by Algorithm 1, presented below in pseudocode.

Algorithm 1:Learning the Optimal Duel Strategy

1:: Input: Duel parameters D, $θ_{1}$ , $θ_{2}$ ; Learning parameters $λ$ , $σ_{0}$ ; Number of plays R
2:: $p_{1}$ =CompKillProb $(θ_{1})$
3:: $p_{2}$ =CompKillProb $(θ_{2})$
4:: Randomly initialize parameter estimates $θ_{1}^{0}$ , $θ_{2}^{0}$
5:: for $r \in {1, 2, . . ., R}$ do
6:: $p_{1}^{r}$ =CompKillProb $(θ_{1}^{r - 1})$
7:: $p_{2}^{r}$ =CompKillProb $(θ_{2}^{r - 1})$
8:: $d_{1}^{r}$ =CompShootDist $(p_{1}, p_{2}^{r})$
9:: $d_{2}^{r}$ =CompShootDist $(p_{1}^{r}, p_{2})$
10:: $σ_{r} = σ_{r - 1} / λ$
11:: X=PlayDuel $(p_{1}, p_{2}, d_{1}^{r}, d_{2}^{r}, σ_{r}, X)$
12:: $({\hat{p}}_{1}^{r}, {\hat{p}}_{2}^{r})$ =EstKillProb $(X)$
13:: $θ_{1}^{r}$ =EstPars $({\hat{p}}_{1}^{r}, 1)$
14:: $θ_{2}^{r}$ =EstPars $({\hat{p}}_{2}^{r}, 2)$
15:: end for
16:: return $d_{1}^{R}$ , $d_{2}^{R}$ , $θ_{1}^{R}$ , $θ_{2}^{R}$

The following remarks explain the operation of the algorithm.

In line 1 the algorithm takes as input: (i) the duel parameters D, $θ_{1}$ , $θ_{2}$ , (ii) two learning parameters $λ, σ_{0}$ and (iii) the number R of duels used in the learning process.
Then, in lines 2-3, the true kill probability $p_{n}$ (for $n \in {1, 2}$ ) is computed by the function CompKillProb $(θ_{n})$ , which simply computes

$\forall n, d : p_{n} (d) = f_{n} (d; θ_{n}) .$

We emphasize that these are the true kill probabilities.
In line 4 initial, randomly selected parameter vector estimates $θ_{1}^{0}, θ_{2}^{0}$ are generated.
Then the algorithm enters the loop of lines 5-15 (executed for R iterations) which constitutes the main learning process.
(a)

In lines 6-7 we compute new estimates of the kill probabilities $p_{n}^{r}$ , by function CompKillProb, based on the estimates of parameters $θ_{n}^{r - 1}$ :

$\forall n, d : p_{n}^{r} (d) = f_{n} (d; θ_{n}^{r - 1}) .$

We emphasize that these are estimates of the kill probabilities, based on the parameter estimates $θ_{n}^{r - 1}$ .

(b)

In lines 8-9 we compute new estimates of the shooting thresholds $d_{n}^{r}$ , by function CompShootDist. For $P_{n}$ , this is achieved by computing the shooting criterion $K_{n}$ using the (known to $P_{n}$ ) $p_{n}$ and the (estimated by $P_{n}$ ) $p_{- n}^{r}$ .

(c)

In line 10, $σ_{r}$ (which will be used as a standard deviation parameter) is divided by the factor $λ > 1$ .

(d)
In line 11 the result of the duel is computed by the function PlayDuel. This is achieved as follows:
- For $n \in \{1, 2\}$ , $P_{n}$ selects a random shooting distance ${\hat{d}}_{n}$ from the discrete normal distribution [6] with mean $d_{n}^{r}$ and standard deviation $σ_{r}$ .
- With both ${\hat{d}}_{1}, {\hat{d}}_{2}$ selected, it is clear which player will shoot first; the outcome of the shot (hit or miss) is a Bernoulli random variable with success probability $p_{n, {\hat{d}}_{n}}$ , where $P_{n}$ is the shooting player. Note that $p_{n}$ is the true kill probability.
The result is stored in a table X, which contains the data (shooting distance, shooting player, hit or miss) of every duel played up to the r-th iteration.
(e)

In line 12, the entire game records X are used by EstKillProb $(X)$ to obtain empirical estimates of the kill probabilities ${\hat{p}}_{1}^{r}, {\hat{p}}_{2}^{r}$ . These estimates are as follows:

$\forall n, \forall d \in D_{n} : {\hat{p}}_{n, d}^{r} = \frac{\sum_{r \in R_{n, d}} Z_{r}}{|R_{n, d}|}$

where

$\begin{matrix} D_{n} & = \{d : d such that P_{n} may shoot\} \\ R_{n, d} & = \{r : in the r - th game P_{n} actually shot from distance d\}, \\ Z_{r} & = \{\begin{matrix} 1 & iff the shot in the r - th game hit the target, \\ 0 & iff the shot in the r - th game missed the target \end{matrix} \end{matrix}$

(f)

In line 13-14 the function EstPars uses a least squares algorithm to find (only for the $P_{n}$ who currently has the action) $θ_{n}^{r}$ values which minimize the squared error

$J (θ_{n}) = \sum_{d \in D_{n}} {(f_{n} (d; θ_{n}) - {\hat{p}}_{n, d}^{r})}^{2} .$

(g)

In line 21 the algorithm returns the final estimates of optimal shooting distances $d_{1}^{R}, d_{2}^{R}$ and parameters $θ_{1}^{R}, θ_{2}^{R}$ .

Note that multiplication by

1 / λ \in (0, 1)

results in

{lim}_{r \to \infty} σ_{r} = 0

. Hence, while in the initial iterations of the learning process the players essentially use random shooting thresholds (exploration) with standard deviation

σ_{r}

, it is hoped that, as r increases and

σ_{r}

of the used shooting thresholds goes to zero, the estimates of the kill probabilities and shooting thresholds will converge to their optimal values (exploitation). This is actually corroborated by the experiments we present in the next section.

4. Experiments

In this section we present numerical experiments to evaluate our approach.

4.1. Experiments Setup

In the following subsection we present several experiment groups, all of which share the same structure. Each experiment group corresponds to a particular form of the kill probability functions. In each case, the kill probability parameters, along with the initial player distance D, are the game parameters. For each choice of game parameters we proceed as follows.

First we select the learning parameters

λ, σ

and the number of learning steps R. These, together with the game parameters, are the experiment parameters. Then we select a number J of estimation experiments to run for each choice of experiment parameters. For each of the J experiments we compute the following quantities.

The relative error of the final kill probability parameter estimates. For a given parameter $θ_{n, i}$ , this error is defined to be

$Δ θ_{n, i} = |\frac{θ_{n, i} - θ_{n, i}^{R}}{θ_{n, i}}| .$
The relative error of the shooting threshold estimates. Letting $d_{n}^{R}$ be the estimate of the shooting threshold based on the true kill probability vector $p_{n}$ and kill probability vector estimate $p_{- n}^{R}$ , this error is defined to be

$Δ d_{n} = |\frac{d_{n} - d_{n}^{R}}{d_{n}}| .$
The relative error of the optimal payoff estimates. Letting $Q_{n}^{R}$ be the estimate of the optimal payoff (computed from the estimated shooting thresholds $d_{1}^{R}, d_{2}^{R}$ ) , this error is defined to be

$Δ Q_{n} = \{\begin{matrix} |\frac{Q_{n} - Q_{n}^{R}}{Q_{n}}| & iff Q_{n} \neq 0 \\ 0 & iff Q_{n} = 0 and Q_{n}^{R} = Q_{n} \\ 1 & iff Q_{n} = 0 and Q_{n}^{R} \neq Q_{n} \end{matrix}$

Note that $Δ Q_{2} = Δ Q_{1}$ , because the game is zero-sum.

4.2. Experiment Group A

In this group the kill probability function has the form:

n \in \{1, 2\} : p_{n, d} = min (\frac{c_{n}}{d^{k_{n}}}, 1) .

Let us look at the final results of a representative run of the learning algorithm. With

c_{1} = 1

,

k_{1} = 0.5

,

c_{2} = 1

,

k_{2} = 1

and

D = 10

, we run the learning algorithm with

R = 1500

,

σ_{0} = 6 D

and for three different values

λ \in {1.001, 1.01, 1.05}

. In Figure 4 we plot the logarithm (with base 10) of the relative payoff error

Δ Q_{1} + ϵ

(we have added

ϵ = 10^{- 3}

to deal with the logarithm of zero error). The three curves plotted correspond to the

λ

values 1.001, 1.01, 1.05. We see that, for all

λ

values, the algorithm achieves zero relative error; in other words it learns the optimal strategy for both players. Furthermore convergence is achieved by the 1500-th iteration of the algorithm (1500-th duel played), as seen by the achieved logarithm value

- 3

(recall that we have added

ϵ = 10^{- 3}

to the error, hence the true error is zero). Convergence is fastest for the largest

λ

value, i.e.,

λ = 1.05

, and slowest for the smallest value

λ = 1.001

.

Figure 4. Plot of logarithmic relative error

{log}_{10} Δ Q_{1}

of

P_{1}

’s payoff for a representative run of the learning process.

Figure 4. Plot of logarithmic relative error

{log}_{10} Δ Q_{1}

of

P_{1}

’s payoff for a representative run of the learning process.

In Figure 5 we plot the logarithmic relative errors

{log}_{10} Δ d_{n}

. These also, as expected, have converged to zero by the 1500-th iteration.

Figure 5. Plot of logarithmic relative errors

{log}_{10} Δ d_{1}

,

{log}_{10} Δ d_{2}

for a representative run of the learning process.

Figure 5. Plot of logarithmic relative errors

{log}_{10} Δ d_{1}

,

{log}_{10} Δ d_{2}

for a representative run of the learning process.

The fact that the estimates of the optimal shooting thresholds and strategies achieve zero error, does not imply that the same is true of the kill probability parameter estimates. In Figure 6 we plot the relative errors

Δ c_{1}

,

Δ k_{1}

.

Figure 6. Plot of relative parameter errors

Δ c_{1}

,

Δ k_{1}

for a representative run of the learning process.

Figure 6. Plot of relative parameter errors

Δ c_{1}

,

Δ k_{1}

for a representative run of the learning process.

It can be seen that these errors do not converge to zero; in fact, for

λ \in {1.01, 1.05}

the errors converge to fixed nonzero values, which indicates that the algorithm obtains wrong estimates. However, the error is sufficiently small to still result in zero-error estimates of the shooting thresholds. The picture is similar for the errors

Δ c_{2}

,

Δ k_{2}

, hence their plots are omitted.

In the above we have given results for a particular run of the learning algorithm. This was a successful run, in the sense that it obtained zero-error estimates of the optimal strategies (and shooting thresholds). However, since our algorithm is stochastic, it is not guaranteed that every run will result in zero-error estimates. To better evaluate the algorithm, we have run it for

J = 10

times and averaged the obtained results. In particular, in Figure 7 we plot the average of ten curves of the type plotted in Figure 4. Note that now we plot the curve for

R = 5000

plays of the duel.

Figure 7. Plot of

Q_{1}

(

P_{1}

’s payoff) for a representative run of the learning process.

Figure 7. Plot of

Q_{1}

(

P_{1}

’s payoff) for a representative run of the learning process.

Several observations can be made regarding Figure 7.

For the smallest $λ$ value, namely $λ = 1.001$ , the respective curve reaches $- 3$ at $r = 4521$ . This corresponds to zero average error, which means that, in some algorithm runs, it took more than 4500 iterations (duel plays) to reach zero error.
For the $λ = 1.01$ , all runs of the algorithm reached zero-error after $r = 656$ runs.
Finally, for $λ = 1.05$ , the average error never reached zero; in fact 3 out of 10 runs converged to nonzero-error estimates, i.e., to non-optimal strategies.

The above observations corroborate a fact well known in the study of reinforcement learning [7]. Namely, a small learning rate (in our case small

λ

) results in higher probability of converging to the true parameter values, but also in slower convergence. This can be explained as follows: a small

λ

results in higher

σ_{r}

values for a large proportion of duels played by the algorithm; i.e., in more extensive exploration, which however results in slower exploitation (convergence). 3

We conclude this group of experiments by running the learning algorithm for various combinations of game parameters; for each combination we record the average error attained at the end of the algorithm (i.e., at

r = R = 5000

). The results are summarized at the following tables.

Table 3. Values of final average relative error

Δ Q_{1}

for

c_{2} = 1

,

k_{2} = 1

and various values of

c_{1}

,

k_{1}

,

λ

. D is fixed at

D = 10

.

Table 3. Values of final average relative error

Δ Q_{1}

for

c_{2} = 1

,

k_{2} = 1

and various values of

c_{1}

,

k_{1}

,

λ

. D is fixed at

D = 10

.

$λ$		1.001			1.01			1.05
$k_{1}$	0.50	1.00	1.50	0.50	1.00	1.50	0.50	1.00	1.50
$c_{1}$
1.00	0.000	0.000	0.000	0.000	0.100	0.482	0.224	0.500	1.207
1.50	0.000	0.000	0.000	0.059	0.049	1.748	0.215	0.480	3.777
2.00	0.000	0.000	0.048	0.074	0.199	0.097	0.144	0.980	0.072

Table 4. Round at which

Δ Q_{1}

converged to zero for all

J = 10

sessions for

c_{2} = 1

,

k_{2} = 1

and various values of

c_{1}

,

k_{1}

,

λ

. D is fixed at

D = 10

. If

Δ Q_{1}

did not converge for all sessions we note for how many sessions it converged.

Table 4. Round at which

Δ Q_{1}

converged to zero for all

J = 10

sessions for

c_{2} = 1

,

k_{2} = 1

and various values of

c_{1}

,

k_{1}

,

λ

. D is fixed at

D = 10

. If

Δ Q_{1}

did not converge for all sessions we note for how many sessions it converged.

$λ$		1.001			1.01			1.05
$k_{1}$	0.50	1.00	1.50	0.50	1.00	1.50	0.50	1.00	1.50
$c_{1}$
1.00	4521	1456	2983	656	9/10	8/10	7/10	5/10	5/10
1.50	3238	2939	1754	7/10	9/10	9/10	0/10	8/10	6/10
2.00	4153	423	8/10	2/10	9/10	6/10	3/10	4/10	6/10

From Table 6 and Table 7 we see that for

λ = 1.001

almost all learning sessions conclude in zero

Δ Q_{1}

, while increasing the value of

λ

results in more sessions concluding with non-zero error estimates. Furthermore, we observe that when the average

Δ Q_{1}

converges to zero for multiple values of

λ

the convergence is faster for bigger

λ

. These results highlight the trade off between exploration and exploitation discussed above.

Finally, in Table 8 we see how many learning sessions were run and how many converged to zero error estimate

Δ Q_{1}

for different values of D and

λ

.

Table 5. Fraction of learning sessions that converged to

Δ Q_{1} - 0

for different values of D and

λ

.

Table 5. Fraction of learning sessions that converged to

Δ Q_{1} - 0

for different values of D and

λ

.

$λ$	1.001	1.01	1.05
D
8	163/170	144/170	89/170
10	165/170	139/170	81/170
12	161/170	123/170	68/170
14	164/170	134/170	73/170
total	653/680	540/680	311/680

Table 6. Values of final average relative error

Δ Q_{1}

for

d_{21} = 1

,

d_{22} = D

and various values of

d_{11}

,

d_{12}

,

λ

. D is fixed at

D = 10

.

Table 6. Values of final average relative error

Δ Q_{1}

for

d_{21} = 1

,

d_{22} = D

and various values of

d_{11}

,

d_{12}

,

λ

. D is fixed at

D = 10

.

	$λ$	1.001	1.01	1.05
$d_{11}$	$d_{12}$
1	$D / 2$	0.000	0.233	1.000
1	$2 D / 3$	0.000	0.000	0.799
1	D	$\sim 6 \cdot 10^{- 16}$	$\sim 2 \cdot 10^{- 16}$	0.800
$D / 3$	$2 D / 3$	0.000	0.000	2.799
$D / 3$	D	0.000	0.000	0.777
$D / 2$	D	0.000	0.000	0.480

Table 7. Round at which

Δ Q_{1}

converged to zero for all

J = 10

sessions for

d_{21} = 1

,

d_{22} = D

and various values of

d_{11}

,

d_{12}

,

λ

. D is fixed at

D = 10

. If

Δ Q_{1}

did not converge for all sessions we note for how many sessions it converged.

Table 7. Round at which

Δ Q_{1}

converged to zero for all

J = 10

sessions for

d_{21} = 1

,

d_{22} = D

and various values of

d_{11}

,

d_{12}

,

λ

. D is fixed at

D = 10

. If

Δ Q_{1}

did not converge for all sessions we note for how many sessions it converged.

	$λ$	1.001	1.01	1.05
$d_{11}$	$d_{12}$
1	$D / 2$	287	7/10	0/10
1	$2 D / 3$	812	333	9/10
1	D	7/10	9/10	9/10
$D / 3$	$2 D / 3$	326	139	5/10
$D / 3$	D	225	397	5/10
$D / 2$	D	711	464	6/10

Table 8. Fraction of learning sessions that converged to

Δ Q_{1} - 0

for different values of D and

λ

.

Table 8. Fraction of learning sessions that converged to

Δ Q_{1} - 0

for different values of D and

λ

.

$λ$	1.001	1.01	1.05
D
8	76/110	80/110	49/110
10	97/110	95/110	56/110
12	94/110	78/110	39/110
14	100/110	84/110	40/110
total	367/440	337/440	184/440

4.3. Experiment Group B

In this group the kill probability function is piecewise linear

p_{n, d} = \{\begin{matrix} 1 & when d \in [1, d_{n 1}] \\ - \frac{1}{d_{n 2} - d_{n 1}} d + \frac{d_{n 2}}{d_{n 2} - d_{n 1}} & when d \in [d_{n 1}, d_{n 2}] \\ 0 & when d \in [d_{n 2}, D] \end{matrix}

Let us look again at the final results of a representative run of the learning algorithm. With

d_{11} = D

,

d_{12} = D / 3

,

d_{21} = 1

,

d_{22} = D

and

D = 8

, we run the learning algorithm with

R = 500

,

σ_{0} = 6 D

and for the values

λ \in {1.001, 1.01, 1.05}

. In Figure 8 we plot the logarithm (with base 10) of the relative payoff error

Δ Q_{1} + ϵ

. We see similar results as in Group A, for all

λ

values, the algorithm achieves zero relative error. Convergence is achieved by the 300-th iteration of the algorithm and it is fastest for

λ = 1.05

, and slowest for

λ = 1.001

.

Figure 8. Plot of logarithmic relative error

{log}_{10} Δ Q_{1}

of

P_{1}

’s payoff for a representative run of the learning process.

Figure 8. Plot of logarithmic relative error

{log}_{10} Δ Q_{1}

of

P_{1}

’s payoff for a representative run of the learning process.

In Figure 9 we plot the logarithmic relative errors

{log}_{10} Δ d_{n}

. These also, as expected, have converged to zero by the 300-th iteration.

Figure 9. Plot of logarithmic relative errors

{log}_{10} Δ d_{1}

,

{log}_{10} Δ d_{2}

for a representative run of the learning process.

Figure 9. Plot of logarithmic relative errors

{log}_{10} Δ d_{1}

,

{log}_{10} Δ d_{2}

for a representative run of the learning process.

As in Group A, the fact that the estimates of the optimal shooting thresholds and strategies achieve zero error does not imply that the same is true for the kill probability parameter estimates. In Figure 10, we plot the relative errors

Δ d_{1}

and

Δ d_{2}

. The relative error

Δ d_{2}

converges to zero quickly, but the relative error

Δ d_{1}

does not converge to zero for

λ = 1.05

. For

λ = 1.05

the error converges to a fixed nonzero value, which indicates that the algorithm obtains wrong estimates. However, the error is sufficiently small to still result in zero-error estimates of the shooting thresholds.

Figure 10. Plot of relative parameter errors

Δ d_{1}

,

Δ d_{2}

for a representative run of the learning process.

Figure 10. Plot of relative parameter errors

Δ d_{1}

,

Δ d_{2}

for a representative run of the learning process.

As in group A, to better evaluate the algorithm, we have run it for

J = 10

times and averaged the obtained results. In particular, in Figure 11 we plot the average of ten curves of the type plotted in Figure 4. Note that now we plot the curve for

R = 500

plays of the duel.

Figure 11. Plot of

Q_{1}

(

P_{1}

’s payoff) for a representative run of the learning process.

Figure 11. Plot of

Q_{1}

(

P_{1}

’s payoff) for a representative run of the learning process.

We again run the learning algorithm for various combinations of game parameters and record the average error attained at the end of the algorithm (again at

r = R = 5000

) for each combination. The results are summarized at the following tables.

From Table 6 and Table 7, we observe that for most parameter combinations, all learning sessions conclude with a zero

Δ Q_{1}

for the smaller

λ

values. However, for

λ = 1.05

, the algorithm fails to converge and exhibits a high relative error. Notably, increasing

λ

from 1.001 to 1.01 generally accelerates convergence, although this is not guaranteed in every case. In one instance, a higher

λ

(specifically

1.01

) leads to a slower average convergence of

Δ Q_{1}

to zero, indicating the importance of initial random shooting choices. We also observe that results for the highest

λ

are suboptimal, with the algorithm failing to converge in varying numbers of sessions, such as in 1 out of 10 or even 5 out of 10 cases. Notably, when convergence does occur, it happens relatively quickly, with sessions typically completing in under 1000 iterations. These findings also highlight the trade-off between exploration and exploitation, as discussed earlier.

Finally, in Table 8 we see how many learning sessions were run and how many converged to zero error estimate

Δ Q_{1}

for different values of D and

λ

.

5. Discussion

We proposed an algorithm for estimating the unknown game parameters and the optimal strategies through the course of repeated plays. We tested the algorithm for two models of the kill probability function and found that it converged for the majority of tests. Furthermore, we observed the established relationship between higher learning rates and reduced convergence quality, underscoring the trade-off between learning speed and stability in convergence. Future research could investigate additional models of the accuracy probability function, including scenarios where the two players employ distinct models, and work toward establishing theoretical bounds on the algorithm’s probabilistic convergence.

Author Contributions

All authors contributed equally to all parts of this work, namely conceptualization, methodology, software, validation, formal analysis, writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Polak, Ben. Backward Induction: Reputation and Duels. Open Yale Courses. https://oyc.yale.edu/economics/ econ-159/lecture-16.
Prisner, Erich. Game theory through examples. Vol. 46. American Mathematical Soc. (2014).
Fox, Martin, and George S. Kimeldorf. “Noisy duels.” SIAM Journal on Applied Mathematics, vol.17, pp. 353-361 (1969).
Garnaev, Andrey. “Games of Timing.” In Search Games and Other Applications of Game Theory. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 81-120 (2000).
Radzik, T. “Games of timing related to distribution of resources.” Journal of optimization theory and applications, vol. 58, pp. 443-471 (1988).
Roy, Dilip. “The discrete normal distribution.” Communications in Statistics-theory and Methods, vol. 32, pp. 1871-1883 (2003).
Ishii, Shin, Wako Yoshida, and Junichiro Yoshimoto. “Control of exploitation–exploration meta-parameter in reinforcement learning.” Neural networks, vol. 15, pp. 665-687, (2002).
Geman, Stuart, and Donald Geman. “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.” IEEE Trans. on pattern analysis and machine intelligence, vol. 6, pp. 721-741, (1984).

1	This proposition is stated informally in [1].
2	We use the same notation for several other quantities as will be seen in the sequel.
3	An analogous phenomenon occurs in connection to the “temperature parameter” T in simulated annealing [8] and many other learning algorithms.

Figure 1. Game tree example

Figure 2. Game tree example with values of terminal vertices.

Figure 3. Game tree example with values of all vertices

Table 1. Kill probabilities

d	1	2	3	4	5	6
$p_{1, d}$	1.000	0.500	0.333	0.250	0.200	0.166
$p_{2, d}$	1.000	0.707	0.577	0.500	0.447	0.408

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Learning Optimal Strategies in a Duel Game

Abstract

Keywords:

Subject:

Introduction

1. Game Description

2. Solution with Complete Information

3. Solution with Incomplete Information

4. Experiments

4.1. Experiments Setup

4.2. Experiment Group A

4.3. Experiment Group B

5. Discussion

Author Contributions

Funding

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe