Preprint
Article

This version is not peer-reviewed.

Learning Optimal Strategies in a Duel Game

A peer-reviewed article of this preprint also exists.

Submitted:

09 November 2024

Posted:

11 November 2024

You are already at the latest version

Abstract
We study a duel game in which each player has partial knowledge of the game parameters. We present a method by which, in the course of repeated plays, each player estimates the missing parameters and consequently learns his optimal strategy.
Keywords: 
;  ;  ;  

Introduction

We study a duel game in which certain game parameters (related to the players’ kill probabilities) are unknown and present a method by which each player can estimate the opponent’s parameters and consequently learn his optimal strategy.
The duel game with which we are concerned has been presented in [1] and a similar one in [2]. The game is a variation of duels and, more generally, games of timing studied in the literature [3,4,5].
The paper is organized as follows. In Section 1 we present the rules of the game. In Section 2 we solve the game under the assumption of complete information. In Section 3 we present an algorithm for solving the game when the players have incomplete information. In Section 4 we evaluate the algorithm by numerical experiments. Finally, in Section 5 we summarize our results and present our conclusions.

1. Game Description

The duel with which we are concerned is played between players P 1 and P 2 , under the following rules.
  • It is played in discrete rounds (time steps) t 1 , 2 , . . . .
  • In the first turn, the players are at distance D.
  • P 1 (resp. P 2 ) plays on odd (resp. even) rounds.
  • On his turn, each player has two choices: (i) he can shoot his opponent or (ii) he can move one step forward, reducing their distance by one.
  • If P n shoots, he has a kill probability p n d of hitting (and killing) his opponent, where d is their current distance. If he misses, the opponent can walk right next to him and shoot him for a certain kill.
  • Each player’s payoff is 1 if he kills the opponent and 1 if he shoots and misses (in which case he is certain to be killed).
For n { 1 , 2 } , we will denote by x n ( t ) the position of P n at round t. The starting positions are x 1 ( 0 ) = 0 and x 2 ( 0 ) = D , with D = 2 N , N N . The distance between the players at time t is
d = | x 1 t x 2 t | .
For n { 1 , 2 } , the kill probability is a decreasing function p n : { 1 , 2 , D } [ 0 , 1 ] with p n 1 = 1 . It is convenient to describe the kill probabilities as vectors:
p n = p n , 1 , . . . , p n , D = p n 1 , . . . , p n D .
This duel can be modeled as an extensive form game or tree game. The game tree is a directed graph G = ( V , E ) with vertex set
V = 1 , 2 , . . . , 2 D ,
where
  • the vertex v = d 1 , 2 , . . . , D corresponds to a game state in which the players are at distance d and
  • the vertex v = d + D D + 1 , D + 2 , . . . , 2 D is a terminal vertex, in which the “active” player has fired at his opponent.
The edges correspond to state transitions; it is easy to see that the edge set is
E = 1 , D + 1 , 2 , 1 , 2 , D + 2 , 3 , 2 , 3 , D + 3 , . . . , D , D 1 , D , 2 D .
An example of the game tree, for D = 6 , appears in Figure 1. The circular (resp. square) vertices are the ones in which P 1 (resp. P 2 ) is active and the rhombic vertices are the terminal ones.
To complete the description of the game, we will define the expected payoff for the terminal vertices. Note that the terminal vertex d + D is the child of the nonterminal vertex d in which:
  • The distance of the players is d and, assuming P n to be the active player, his probability of hitting his opponent is p n , d .
  • The active player is P 1 (resp. P 2 ) iff d is even (resp. odd).
Keeping the above in mind, we see that the payoff (to P 1 ) of vertex d + D is
d 1 , . . . , D : Q d + D = p 1 , d · 1 + 1 p 1 , d · 1 = 2 p 1 , d 1 , when d is even ; p 2 , d · 1 + 1 p 2 , d · 1 = 1 2 p 2 , d when d is odd .
The payoff to P 2 at vertex d + D is Q d + D . This completes the description of the duel.

2. Solution with Complete Information

It is easy to solve the above duel when each player knows D and both  p 1 and p 2 . We construct the game tree as described in Section 1 and we solve it by backward induction. Since the method is standard, we simply give an example of its application. Suppose that
n , d : p n , d = 1 when d = 1 , min 1 , c n d k n when d > 1 .
We take c 1 = 1 , k 1 = 1 , c 2 = 1 , k 2 = 1 2 . The kill probabilities are
The game tree with terminal payoffs is illustrated in Figure 2.
By the standard backward induction procedure we select the optimal action at each vertex and also compute the values of the nonterminal vertices. These are indicated in Figure 3 (optimal actions correspond to thick edges). We see that the game value is 0.1547 , attained by P 2 shooting when the players are at distance 3 (which happens in round 4).
We next present a proposition which characterizes each player’s optimal strategy in terms of a shooting threshold. 1 In the following we use the standard notation by which “ n ” denotes the “other player”. I.e., p 1 = p 2 and p 2 = p 1 .2
Theorem 1. 
We define for n 1 , 2 the shooting criterion vectors K n = K n , 1 , . . . , K n , D where
K n , 1 = 1 and for d 2 : K n , d = p n , d + p n , d 1 .
Then the optimal strategy for P n is to shoot as soon as it is his turn and the distance of the players is less than or equal to the shooting threshold  d n , where:
d 1 = max { d : K 1 , d 1 } , d 2 = max { d : K 2 , d 1 } .
Proof. 
Suppose the players are at distance d and the active player is P n .
  • If P n will not shoot in the next round, when their distance will be d 1 , then P n must also not shoot in the current round, because he will have a higher kill probability in his next turn, when they will be at distance d 2 .
  • If P n will shoot in the next round, when their distance will be d 1 , then P n should shoot iff p n , d (his kill probability now) is higher then 1 p n , d 1 ( P n ’s miss probability in the next round). In other words, P n must shoot iff
    p n , d 1 p n , d 1
    or, equivalently, iff
    K n , d = p n , d + p n , d 1 1 .
Hence we can reason as follows.
  • At vertex 1, P 2 is active and his only choice is to shoot.
  • At vertex 2, P 1 is active and he knows P 2 will certainly shoot in the next round. Hence P 1 will shoot iff he has an advantage, i.e., iff
    Q 1 2 Q 1 1 2 p 1 , 2 1 1 2 p 2 , 1 p 1 , 2 + p 2 , 1 1 .
    This is equivalent to K 1 , 1 1 and will always be true.
  • Hence at vertex 3, P 2 is active and he knows that P 1 will certainly shoot in the next round (at vertex d). So P 2 will shoot iff
    Q 1 3 Q 1 2 1 2 p 2 , 3 2 p 1 , 2 1 p 2 , 3 + p 1 , 2 1 .
    which is equivalent to K 2 , 3 1 . Also, if K 2 , 3 < 1 , then P 1 will know, when the game is at vertex 4, that P 2 will not shoot when at 3. So,   P 1 will not shoot when at 4. But then, when at 5, P 2 knows that P 1 will not shoot when at 4. Continuing in this manner we see that K 2 , 3 < 1 implies that firing will take place exactly at the vertex 2.
  • On the other hand, if K 2 , 3 1 , then P 1 knows when at 4 that P 2 will shoot at the next round. So, when at 4, P 1 should shoot iff K 1 , 4 1 . If, on the other hand, K 1 , 4 < 1 , then P 1 will not shoot when at 4 and P 2 will shoot when at 3.
  • We continue in this manner for increasing values of d. Since both K 1 , d and K 2 , d are decreasing with d, there will exist a maximum d n value (it could equal D) in which some K n , d will be greater than one and P n will be active; then P n must shoot as soon as the game reaches or passes vertex d n and he “has the action”.
This completes the proof.    □
Returning to our previous example, we compute the vectors K n for n { 1 , 2 } and list them in the following table.
Table 2. Shooting Criterion
Table 2. Shooting Criterion
d 1 2 3 4 5 6
Round 6 5 4 3 2 1
K 1 , d 1.500 1.040 0.827 0.700 0.614
K 2 , d 1.707 1.077 0.833 0.697 0.608
For P 1 the shooting criterion is last satisfied when the distance is d = 3 ; this happens at round 4 in which P 1 is inactive, so he should shoot at round 5. However, for P 2 the shooting criterion is also last satisfied at distance d = 3 and round 4, in which P 2 is active; so he should shoot at round 4. This is the same result we obtained with backward induction.

3. Solution with Incomplete Information

We will now present an approach to the solution of the duel when each player has incomplete knowledge of his opponent’s kill probability function. More specifically, we assume that the players’s kill probabilities have, for all d, the form f d ; θ where θ is a parameter vector. Assuming P 1 has θ = θ 1 and P 2 has θ = θ 2 , we have
d : p 1 , d = f d ; θ 1 , p 2 , d = f d ; θ 2 .
Furthermore, we assume that both players know the general form f d ; θ and each P n knows his own parameter vector θ n but not the θ n of his opponent.
Obviously in this case the players cannot perform the computations of either backward induction or the shooting criterion. Consequently we will propose an “exploration-and-exploitation” approach. In other words, under the assumption that multiple duels will be played, each player initially adopts a random strategy and, collecting information from played games, he gradually builds and refines an estimate of his optimal strategy.
We give now a more detailed description of our approach, by Algorithm 1, presented below in pseudocode.
Algorithm 1:Learning the Optimal Duel Strategy
1:
Input: Duel parameters D, θ 1 , θ 2 ; Learning parameters λ , σ 0 ; Number of plays R
2:
p 1 =CompKillProb ( θ 1 )
3:
p 2 =CompKillProb ( θ 2 )
4:
Randomly initialize parameter estimates θ 1 0 , θ 2 0
5:
for r { 1 , 2 , . . . , R } do
6:
     p 1 r =CompKillProb ( θ 1 r 1 )
7:
     p 2 r =CompKillProb ( θ 2 r 1 )
8:
     d 1 r =CompShootDist ( p 1 , p 2 r )
9:
     d 2 r =CompShootDist ( p 1 r , p 2 )
10:
     σ r = σ r 1 / λ
11:
    X=PlayDuel ( p 1 , p 2 , d 1 r , d 2 r , σ r , X )
12:
     ( p ^ 1 r , p ^ 2 r ) =EstKillProb ( X )
13:
     θ 1 r =EstPars ( p ^ 1 r , 1 )
14:
     θ 2 r =EstPars ( p ^ 2 r , 2 )
15:
end for
16:
return d 1 R , d 2 R , θ 1 R , θ 2 R
The following remarks explain the operation of the algorithm.
  • In line 1 the algorithm takes as input: (i) the duel parameters D, θ 1 , θ 2 , (ii) two learning parameters λ , σ 0 and (iii) the number R of duels used in the learning process.
  • Then, in lines 2-3, the true kill probability p n (for n { 1 , 2 } ) is computed by the function CompKillProb ( θ n ) , which simply computes
    n , d : p n d = f n d ; θ n .
    We emphasize that these are the true kill probabilities.
  • In line 4 initial, randomly selected parameter vector estimates θ 1 0 , θ 2 0 are generated.
  • Then the algorithm enters the loop of lines 5-15 (executed for R iterations) which constitutes the main learning process.
    (a)
    In lines 6-7 we compute new estimates of the kill probabilities p n r , by function CompKillProb, based on the estimates of parameters θ n r 1 :
    n , d : p n r d = f n d ; θ n r 1 .
    We emphasize that these are estimates of the kill probabilities, based on the parameter estimates θ n r 1 .
    (b)
    In lines 8-9 we compute new estimates of the shooting thresholds d n r , by function CompShootDist. For P n , this is achieved by computing the shooting criterion K n using the (known to P n ) p n and the (estimated by P n ) p n r .
    (c)
    In line 10, σ r (which will be used as a standard deviation parameter) is divided by the factor λ > 1 .
    (d)
    In line 11 the result of the duel is computed by the function PlayDuel. This is achieved as follows:
    • For n 1 , 2 , P n selects a random shooting distance d ^ n from the discrete normal distribution [6] with mean d n r and standard deviation σ r .
    • With both d ^ 1 , d ^ 2 selected, it is clear which player will shoot first; the outcome of the shot (hit or miss) is a Bernoulli random variable with success probability p n , d ^ n , where P n is the shooting player. Note that p n is the true kill probability.
    The result is stored in a table X, which contains the data (shooting distance, shooting player, hit or miss) of every duel played up to the r-th iteration.
    (e)
    In line 12, the entire game records X are used by EstKillProb ( X ) to obtain empirical estimates of the kill probabilities p ^ 1 r , p ^ 2 r . These estimates are as follows:
    n , d D n : p ^ n , d r = r R n , d Z r R n , d
    where
    D n = d : d such that P n may shoot R n , d = r : in the r - th game P n actually shot from distance d , Z r = 1 iff the shot in the r - th game hit the target , 0 iff the shot in the r - th game missed the target
    (f)
    In line 13-14 the function EstPars uses a least squares algorithm to find (only for the P n who currently has the action) θ n r values which minimize the squared error
    J θ n = d D n f n d ; θ n p ^ n , d r 2 .
    (g)
    In line 21 the algorithm returns the final estimates of optimal shooting distances d 1 R , d 2 R and parameters θ 1 R , θ 2 R .
Note that multiplication by 1 / λ 0 , 1 results in lim r σ r = 0 . Hence, while in the initial iterations of the learning process the players essentially use random shooting thresholds (exploration) with standard deviation σ r , it is hoped that, as r increases and σ r of the used shooting thresholds goes to zero, the estimates of the kill probabilities and shooting thresholds will converge to their optimal values (exploitation). This is actually corroborated by the experiments we present in the next section.

4. Experiments

In this section we present numerical experiments to evaluate our approach.

4.1. Experiments Setup

In the following subsection we present several experiment groups, all of which share the same structure. Each experiment group corresponds to a particular form of the kill probability functions. In each case, the kill probability parameters, along with the initial player distance D, are the game parameters. For each choice of game parameters we proceed as follows.
First we select the learning parameters λ , σ and the number of learning steps R. These, together with the game parameters, are the experiment parameters. Then we select a number J of estimation experiments to run for each choice of experiment parameters. For each of the J experiments we compute the following quantities.
  • The relative error of the final kill probability parameter estimates. For a given parameter θ n , i , this error is defined to be
    Δ θ n , i = θ n , i θ n , i R θ n , i .
  • The relative error of the shooting threshold estimates. Letting d n R be the estimate of the shooting threshold based on the true kill probability vector p n and kill probability vector estimate p n R , this error is defined to be
    Δ d n = d n d n R d n .
  • The relative error of the optimal payoff estimates. Letting Q n R be the estimate of the optimal payoff (computed from the estimated shooting thresholds d 1 R , d 2 R ) , this error is defined to be
    Δ Q n = Q n Q n R Q n iff Q n 0 0 iff Q n = 0 and Q n R = Q n 1 iff Q n = 0 and Q n R Q n
    Note that Δ Q 2 = Δ Q 1 , because the game is zero-sum.

4.2. Experiment Group A

In this group the kill probability function has the form:
n 1 , 2 : p n , d = min c n d k n , 1 .
Let us look at the final results of a representative run of the learning algorithm. With c 1 = 1 , k 1 = 0.5 , c 2 = 1 , k 2 = 1 and D = 10 , we run the learning algorithm with R = 1500 , σ 0 = 6 D and for three different values λ { 1.001 , 1.01 , 1.05 } . In Figure 4 we plot the logarithm (with base 10) of the relative payoff error Δ Q 1 + ϵ (we have added ϵ = 10 3 to deal with the logarithm of zero error). The three curves plotted correspond to the λ values 1.001, 1.01, 1.05. We see that, for all λ values, the algorithm achieves zero relative error; in other words it learns the optimal strategy for both players. Furthermore convergence is achieved by the 1500-th iteration of the algorithm (1500-th duel played), as seen by the achieved logarithm value 3 (recall that we have added ϵ = 10 3 to the error, hence the true error is zero). Convergence is fastest for the largest λ value, i.e., λ = 1.05 , and slowest for the smallest value λ = 1.001 .
Figure 4. Plot of logarithmic relative error log 10 Δ Q 1 of P 1 ’s payoff for a representative run of the learning process.
Figure 4. Plot of logarithmic relative error log 10 Δ Q 1 of P 1 ’s payoff for a representative run of the learning process.
Preprints 139052 g004
In Figure 5 we plot the logarithmic relative errors log 10 Δ d n . These also, as expected, have converged to zero by the 1500-th iteration.
Figure 5. Plot of logarithmic relative errors log 10 Δ d 1 , log 10 Δ d 2 for a representative run of the learning process.
Figure 5. Plot of logarithmic relative errors log 10 Δ d 1 , log 10 Δ d 2 for a representative run of the learning process.
Preprints 139052 g005
The fact that the estimates of the optimal shooting thresholds and strategies achieve zero error, does not imply that the same is true of the kill probability parameter estimates. In Figure 6 we plot the relative errors Δ c 1 , Δ k 1 .
Figure 6. Plot of relative parameter errors Δ c 1 , Δ k 1 for a representative run of the learning process.
Figure 6. Plot of relative parameter errors Δ c 1 , Δ k 1 for a representative run of the learning process.
Preprints 139052 g006
It can be seen that these errors do not converge to zero; in fact, for λ { 1.01 , 1.05 } the errors converge to fixed nonzero values, which indicates that the algorithm obtains wrong estimates. However, the error is sufficiently small to still result in zero-error estimates of the shooting thresholds. The picture is similar for the errors Δ c 2 , Δ k 2 , hence their plots are omitted.
In the above we have given results for a particular run of the learning algorithm. This was a successful run, in the sense that it obtained zero-error estimates of the optimal strategies (and shooting thresholds). However, since our algorithm is stochastic, it is not guaranteed that every run will result in zero-error estimates. To better evaluate the algorithm, we have run it for J = 10 times and averaged the obtained results. In particular, in Figure 7 we plot the average of ten curves of the type plotted in Figure 4. Note that now we plot the curve for R = 5000 plays of the duel.
Figure 7. Plot of Q 1 ( P 1 ’s payoff) for a representative run of the learning process.
Figure 7. Plot of Q 1 ( P 1 ’s payoff) for a representative run of the learning process.
Preprints 139052 g007
Several observations can be made regarding Figure 7.
  • For the smallest λ value, namely λ = 1.001 , the respective curve reaches 3 at r = 4521 . This corresponds to zero average error, which means that, in some algorithm runs, it took more than 4500 iterations (duel plays) to reach zero error.
  • For the λ = 1.01 , all runs of the algorithm reached zero-error after r = 656 runs.
  • Finally, for λ = 1.05 , the average error never reached zero; in fact 3 out of 10 runs converged to nonzero-error estimates, i.e., to non-optimal strategies.
The above observations corroborate a fact well known in the study of reinforcement learning [7]. Namely, a small learning rate (in our case small λ ) results in higher probability of converging to the true parameter values, but also in slower convergence. This can be explained as follows: a small λ results in higher σ r values for a large proportion of duels played by the algorithm; i.e., in more extensive exploration, which however results in slower exploitation (convergence). 3
We conclude this group of experiments by running the learning algorithm for various combinations of game parameters; for each combination we record the average error attained at the end of the algorithm (i.e., at r = R = 5000 ). The results are summarized at the following tables.
Table 3. Values of final average relative error Δ Q 1 for c 2 = 1 , k 2 = 1 and various values of c 1 , k 1 , λ . D is fixed at D = 10 .
Table 3. Values of final average relative error Δ Q 1 for c 2 = 1 , k 2 = 1 and various values of c 1 , k 1 , λ . D is fixed at D = 10 .
     λ 1.001 1.01 1.05
     k 1 0.50 1.00 1.50 0.50 1.00 1.50 0.50 1.00 1.50
c 1
1.00 0.000 0.000 0.000 0.000 0.100 0.482 0.224 0.500 1.207
1.50 0.000 0.000 0.000 0.059 0.049 1.748 0.215 0.480 3.777
2.00 0.000 0.000 0.048 0.074 0.199 0.097 0.144 0.980 0.072
Table 4. Round at which Δ Q 1 converged to zero for all J = 10 sessions for c 2 = 1 , k 2 = 1 and various values of c 1 , k 1 , λ . D is fixed at D = 10 . If Δ Q 1 did not converge for all sessions we note for how many sessions it converged.
Table 4. Round at which Δ Q 1 converged to zero for all J = 10 sessions for c 2 = 1 , k 2 = 1 and various values of c 1 , k 1 , λ . D is fixed at D = 10 . If Δ Q 1 did not converge for all sessions we note for how many sessions it converged.
     λ 1.001 1.01 1.05
     k 1 0.50 1.00 1.50 0.50 1.00 1.50 0.50 1.00 1.50
c 1
1.00 4521 1456 2983 656 9/10 8/10 7/10 5/10 5/10
1.50 3238 2939 1754 7/10 9/10 9/10 0/10 8/10 6/10
2.00 4153 423 8/10 2/10 9/10 6/10 3/10 4/10 6/10
From Table 6 and Table 7 we see that for λ = 1.001 almost all learning sessions conclude in zero Δ Q 1 , while increasing the value of λ results in more sessions concluding with non-zero error estimates. Furthermore, we observe that when the average Δ Q 1 converges to zero for multiple values of λ the convergence is faster for bigger λ . These results highlight the trade off between exploration and exploitation discussed above.
Finally, in Table 8 we see how many learning sessions were run and how many converged to zero error estimate Δ Q 1 for different values of D and λ .
Table 5. Fraction of learning sessions that converged to Δ Q 1 0 for different values of D and λ .
Table 5. Fraction of learning sessions that converged to Δ Q 1 0 for different values of D and λ .
     λ 1.001 1.01 1.05
D
8 163/170 144/170 89/170
10 165/170 139/170 81/170
12 161/170 123/170 68/170
14 164/170 134/170 73/170
total 653/680 540/680 311/680
Table 6. Values of final average relative error Δ Q 1 for d 21 = 1 , d 22 = D and various values of d 11 , d 12 , λ . D is fixed at D = 10 .
Table 6. Values of final average relative error Δ Q 1 for d 21 = 1 , d 22 = D and various values of d 11 , d 12 , λ . D is fixed at D = 10 .
     λ 1.001 1.01 1.05
d 11 d 12
1 D / 2 0.000 0.233 1.000
1 2 D / 3 0.000 0.000 0.799
1 D 6 · 10 16 2 · 10 16 0.800
D / 3 2 D / 3 0.000 0.000 2.799
D / 3 D 0.000 0.000 0.777
D / 2 D 0.000 0.000 0.480
Table 7. Round at which Δ Q 1 converged to zero for all J = 10 sessions for d 21 = 1 , d 22 = D and various values of d 11 , d 12 , λ . D is fixed at D = 10 . If Δ Q 1 did not converge for all sessions we note for how many sessions it converged.
Table 7. Round at which Δ Q 1 converged to zero for all J = 10 sessions for d 21 = 1 , d 22 = D and various values of d 11 , d 12 , λ . D is fixed at D = 10 . If Δ Q 1 did not converge for all sessions we note for how many sessions it converged.
     λ 1.001 1.01 1.05
d 11 d 12
1 D / 2 287 7/10 0/10
1 2 D / 3 812 333 9/10
1 D 7/10 9/10 9/10
D / 3 2 D / 3 326 139 5/10
D / 3 D 225 397 5/10
D / 2 D 711 464 6/10
Table 8. Fraction of learning sessions that converged to Δ Q 1 0 for different values of D and λ .
Table 8. Fraction of learning sessions that converged to Δ Q 1 0 for different values of D and λ .
     λ 1.001 1.01 1.05
D
8 76/110 80/110 49/110
10 97/110 95/110 56/110
12 94/110 78/110 39/110
14 100/110 84/110 40/110
total 367/440 337/440 184/440

4.3. Experiment Group B

In this group the kill probability function is piecewise linear
p n , d = 1 when d 1 , d n 1 1 d n 2 d n 1 d + d n 2 d n 2 d n 1 when d d n 1 , d n 2 0 when d d n 2 , D
Let us look again at the final results of a representative run of the learning algorithm. With d 11 = D , d 12 = D / 3 , d 21 = 1 , d 22 = D and D = 8 , we run the learning algorithm with R = 500 , σ 0 = 6 D and for the values λ { 1.001 , 1.01 , 1.05 } . In Figure 8 we plot the logarithm (with base 10) of the relative payoff error Δ Q 1 + ϵ . We see similar results as in Group A, for all λ values, the algorithm achieves zero relative error. Convergence is achieved by the 300-th iteration of the algorithm and it is fastest for λ = 1.05 , and slowest for λ = 1.001 .
Figure 8. Plot of logarithmic relative error log 10 Δ Q 1 of P 1 ’s payoff for a representative run of the learning process.
Figure 8. Plot of logarithmic relative error log 10 Δ Q 1 of P 1 ’s payoff for a representative run of the learning process.
Preprints 139052 g008
In Figure 9 we plot the logarithmic relative errors log 10 Δ d n . These also, as expected, have converged to zero by the 300-th iteration.
Figure 9. Plot of logarithmic relative errors log 10 Δ d 1 , log 10 Δ d 2 for a representative run of the learning process.
Figure 9. Plot of logarithmic relative errors log 10 Δ d 1 , log 10 Δ d 2 for a representative run of the learning process.
Preprints 139052 g009
As in Group A, the fact that the estimates of the optimal shooting thresholds and strategies achieve zero error does not imply that the same is true for the kill probability parameter estimates. In Figure 10, we plot the relative errors Δ d 1 and Δ d 2 . The relative error Δ d 2 converges to zero quickly, but the relative error Δ d 1 does not converge to zero for λ = 1.05 . For λ = 1.05 the error converges to a fixed nonzero value, which indicates that the algorithm obtains wrong estimates. However, the error is sufficiently small to still result in zero-error estimates of the shooting thresholds.
Figure 10. Plot of relative parameter errors Δ d 1 , Δ d 2 for a representative run of the learning process.
Figure 10. Plot of relative parameter errors Δ d 1 , Δ d 2 for a representative run of the learning process.
Preprints 139052 g010
As in group A, to better evaluate the algorithm, we have run it for J = 10 times and averaged the obtained results. In particular, in Figure 11 we plot the average of ten curves of the type plotted in Figure 4. Note that now we plot the curve for R = 500 plays of the duel.
Figure 11. Plot of Q 1 ( P 1 ’s payoff) for a representative run of the learning process.
Figure 11. Plot of Q 1 ( P 1 ’s payoff) for a representative run of the learning process.
Preprints 139052 g011
We again run the learning algorithm for various combinations of game parameters and record the average error attained at the end of the algorithm (again at r = R = 5000 ) for each combination. The results are summarized at the following tables.
From Table 6 and Table 7, we observe that for most parameter combinations, all learning sessions conclude with a zero Δ Q 1 for the smaller λ values. However, for λ = 1.05 , the algorithm fails to converge and exhibits a high relative error. Notably, increasing λ from 1.001 to 1.01 generally accelerates convergence, although this is not guaranteed in every case. In one instance, a higher λ (specifically 1.01 ) leads to a slower average convergence of Δ Q 1 to zero, indicating the importance of initial random shooting choices. We also observe that results for the highest λ are suboptimal, with the algorithm failing to converge in varying numbers of sessions, such as in 1 out of 10 or even 5 out of 10 cases. Notably, when convergence does occur, it happens relatively quickly, with sessions typically completing in under 1000 iterations. These findings also highlight the trade-off between exploration and exploitation, as discussed earlier.
Finally, in Table 8 we see how many learning sessions were run and how many converged to zero error estimate Δ Q 1 for different values of D and λ .

5. Discussion

We proposed an algorithm for estimating the unknown game parameters and the optimal strategies through the course of repeated plays. We tested the algorithm for two models of the kill probability function and found that it converged for the majority of tests. Furthermore, we observed the established relationship between higher learning rates and reduced convergence quality, underscoring the trade-off between learning speed and stability in convergence. Future research could investigate additional models of the accuracy probability function, including scenarios where the two players employ distinct models, and work toward establishing theoretical bounds on the algorithm’s probabilistic convergence.

Author Contributions

All authors contributed equally to all parts of this work, namely conceptualization, methodology, software, validation, formal analysis, writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Polak, Ben. Backward Induction: Reputation and Duels. Open Yale Courses. https://oyc.yale.edu/economics/ econ-159/lecture-16.
  2. Prisner, Erich. Game theory through examples. Vol. 46. American Mathematical Soc. (2014).
  3. Fox, Martin, and George S. Kimeldorf. “Noisy duels.” SIAM Journal on Applied Mathematics, vol.17, pp. 353-361 (1969).
  4. Garnaev, Andrey. “Games of Timing.” In Search Games and Other Applications of Game Theory. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 81-120 (2000).
  5. Radzik, T. “Games of timing related to distribution of resources.” Journal of optimization theory and applications, vol. 58, pp. 443-471 (1988).
  6. Roy, Dilip. “The discrete normal distribution.” Communications in Statistics-theory and Methods, vol. 32, pp. 1871-1883 (2003).
  7. Ishii, Shin, Wako Yoshida, and Junichiro Yoshimoto. “Control of exploitation–exploration meta-parameter in reinforcement learning.” Neural networks, vol. 15, pp. 665-687, (2002).
  8. Geman, Stuart, and Donald Geman. “Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.” IEEE Trans. on pattern analysis and machine intelligence, vol. 6, pp. 721-741, (1984).
1
This proposition is stated informally in [1].
2
We use the same notation for several other quantities as will be seen in the sequel.
3
An analogous phenomenon occurs in connection to the “temperature parameter” T in simulated annealing [8] and many other learning algorithms.
Figure 1. Game tree example
Figure 1. Game tree example
Preprints 139052 g001
Figure 2. Game tree example with values of terminal vertices.
Figure 2. Game tree example with values of terminal vertices.
Preprints 139052 g002
Figure 3. Game tree example with values of all vertices
Figure 3. Game tree example with values of all vertices
Preprints 139052 g003
Table 1. Kill probabilities
Table 1. Kill probabilities
d 1 2 3 4 5 6
p 1 , d 1.000 0.500 0.333 0.250 0.200 0.166
p 2 , d 1.000 0.707 0.577 0.500 0.447 0.408
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated