UCB with an optimal inequality

Upper confidence bound multi-armed bandit algorithms (UCB) typically rely on concentration inequalities (such as Hoeffding’s inequality) for the creation of the upper confidence bound. Intuitively, the tighter the bound is, the more likely the respective arm is or isn’t judged appropriately for selection. Hence we derive and utilise an optimal inequality. Usually the sample mean (and sometimes the sample variance) of previous rewards are the information which are used in the bounds which drive the algorithm, but intuitively the more information that taken from the previous rewards, the tighter the bound could be. Hence our inequality explicitly considers the values of each and every past reward into the upper bound expression which drives the method. We show how this UCB method fits into the broader scope of other information theoretic UCB algorithms, but unlike them is free from assumptions about the distribution of the data, We conclude by reporting some already established regret information, and give some numerical simulations to demonstrate the method’s effectiveness.


Introduction
The multi-armed bandit problem (MAB) is a classic example of a scenario that encodes the general dynamic trade-off between exploration and exploitation in artificial intelligence systems.
The scenario consists of a repeated single decision, who's various choices lead the learner to stochastic rewards. Over time more information is gathered about the rewards of the decisions that are made, until eventually the learner is in a position of choosing the machine that yields the highest average reward to achieve the greatest long-term expected payoff. The visceral picture of the situation is of a gambler choosing which poker-machine is best to play over some time frame.
There are several methods of learning in this situation such as the classic -greedy algorithm (and other type methods), Softmax technique (aka. Boltzmann exploration), Pursuit Algorithms, Thomson sampling and Reinforcement Comparison (aka Gradient) methods; but we focus specifically on the UCB family of bandit algorithms.
Upper confidence bound (UCB) bandit algorithms are a popular class of methods where the bandit arm (ie. decision) with the highest upper confidence bound (for some confidence level) on its mean is selected each turn, Lai and Robbins (1985). Typical methods of generating confidence bounds rely on various concentration inequalities; and these function as methods of generating conservative confidence intervals that inform the selection process.
One of the interesting developments among UCB methods is the improvements which are possible by incorporating tighter concentration inequalities. Classic concentration inequalities such as Hoeffding's inequality, give confidence interval widths on the mean, based simply on the number of reward samples taken and the size of the support of the data. Whereas some newer concentration

Making an optimal concentration inequality
Historical UCB algorithms have relied on the usage of concentration inequalities such as Hoeffding's inequality. And these concentration inequalities can be interpreted as analytic unconditioned probability statements about the relationship between sample statistics and population statistics or alternatively about sample statistics as conditioned by population statistics, but not conversely (as is sometimes misunderstood) as about population statistics conditioned by sample statistics. (see appendix C) We can see in Appendix B (or derivations elsewhere), that the derivation of Hoeffding's inequality begins with the assumption of the mean, and then proceeds to develop a probabilistic bounds the sample mean; and this is a legitimate process of develop bounds on the sample mean conditioned on the assumption of the population mean.
To develop an optimal concentration inequality to replace Hoeffding's inequality in UCB algorithms it is therefore legitimate that we ask the same question that Hoeffding's inequality answers: for a specific possible mean of the data distribution, what is the maximum probability of receiving the relevant sample statistics? And the way in which we answer this question is by solving for the maximising probability density function directly.

An optimal concentration inequality for finite values
For a specific possible mean of the data distribution, what is the maximum probability of receiving the relevant sample statistics?
We begin with the assumption that the distribution of the data only features a finite number of values. And since we are interested in using more than just the sample mean (and sample variance) we ask the question about receiving all sample information.
Our question therefore becomes: what probabilities can the values in our distribution have (consistent with a specified distribution mean) to maximise the probability of receiving all the samples that we did (with their values and multiplicities)? This question is fundamentally a task of maximising an Empirical Likelihood subject to a constraint, and hence an optimisation problem.
For this problem, the qualities of its solution can be intuited, particularly that non-zero probability must be allocated to all the values of the samples (or otherwise the likelihood would straightforwardly be zero), and then perhaps one other extreme point may have some probability to balance the distribution with the specified mean most effectively. A graph illustrating this intuition is shown in Figure 1.
In Theorem 4 of the appendix A we solve the optimisation problem analytically to show exactly what probabilities the values have, and when and where the other extreme point occurs. This process of maximising an empirical likelihood is a well known statistical technique -called the Empirical Likelihood method, and our derivation is reminiscent of this classic technique; see Owen (1988).

The optimal concentration inequality for finite support
By maximising the empirical likelihood of receiving the samples, we saw that there was an upper limit on the number of values in the distribution that will be allocated non-zero probability, specifically the number of sample values plus one. And this fact remains true irrespective of how many values we limit our distribution to support, and irrespective of how those values are arranged or a 0 a 1 a 2 a 3 a 4 a 5 µ a j −2 2 Figure 1: For distributions bounded between -2 and 2, with a specified mean µ, the distribution that maximises the likelihood of drawing sample values a 0 , . . . , a 5 (each with single multiplicity) including the unsampled value 2. We also see the hyperbolic weighting of probabilities centered at 2.

0.0154
a 0 a 1 a 2 a 3 a 4 a 5 −2 2 µ Figure 2: For distributions bounded between -2 and 2 and with the same sample values a 0 , . . . , a 5 as in figure 1, the maximum probability that the mean µ = x consistant with thoes sample values per Theorem 1. We see that the maximum height is at the sample meanμ and is very small, and we can see that there is a slightly heavier right side tail as the data samples are left heavy.
densely packed. Therefore we propose that the same behaviour occurs in the limit of continuous distributions.
Hence the continuous distribution that maximises the likelihood of drawing specific samples (for a specific distribution mean) is given by a finite summation of delta functions.
If we assume that the distribution only contains values in a specific range (ie. has finite support) then the maximising likelihood is given by the following theorem (see Appendix A for full derivation): Theorem 1 Over all distributions (bounded between d + and d − ) that have mean µ = x, the likelihood of drawing samples with values s 1 , s 2 , s 3 , . . . , s n with multiplicities n 1 , n 2 , n 3 , . . . , n n is least upper bounded by a positive p max (s 1..n , n 1..n |µ = x). if the sample mean equals the mean (ie.μ = i n i s i /( i n i ) = µ) then: otherwise: and is given by As this function p max is parameterised by the mean value µ = x it is possible to solve and plot it against x. An example plot against x is shown in figure 2 where we can see that the probability profile of the mean indicates that it is most likely about the sample mean, which is what we would expect.
One inconvenience of using this inequality in UCB methods is that the maximum height of the probability profile for µ does not reach a maximum of 1. In the traditional case of Hoeffding's inequality it does reach a maximum of 1 since it is always perfectly possible that the data distribution has a variance of zero, in which case the mean and sample mean must coincide. Whereas our inequality considers the spread of sample points more directly, and it is true, that the probability that there is any spread in sample points and (or given that) the mean has any particular value (or is in any range of values) must be less than one.
However for practical purposes (and this is quite optional) it is convenient to do some scaling to make the probability profile have a height of 1, and we incorporate this scaling in subsequent Theorem 4. Additionally, we are primarily interested in the region which the mean exceeds the sample mean (µ ≥μ) by some factor, and one of the easy things to notice is that the maximum point of the profile corresponds to the case µ =μ, and that for µ >μ that the function is decreasing. And so we convert our function to be an inequality over a range of mean values greater than a parameter x, with µ ≥ x ≥μ: 1 p max (s 1..n , n 1..n , µ ≥ x) = y≥x p max (s 1..n , n 1..n |µ = y)p(µ = y)dy ≤ max y≥x p max (s 1..n , n 1..n |µ = y) = p max (s 1..n , n 1..n |µ = x) Thus we have converted our inequality which is conditioned by a specific value of the mean, into an unconditioned inequality that is over the range of mean values that we are interested in. Thus we can restate our Theorem for unit-scaled probability over range of mean values -which coincidentally doesn't need a lower bound anymore: Theorem 2 Over all distributions bounded above by d + , that have mean µ ≥ x, the unit-scaled likelihood of drawing samples with values s 1 , s 2 , s 3 , . . . , s n with multiplicities n 1 , n 2 , n 3 , . . . , n n , given that x >μ, is least upper bounded by: 1. In the same line of reasoning as in Appendix C Where G is constrained such that G > max i (s i − µ), and is given by: Working with this measure p scaled max in the context of UCB suggests an like Algorithm 1: Algorithm 1 UCB operating with p scaled max -UCBOI Require: A decreasing function g : N → R < 1 1: Pull each of arms {1, . . . , K} once 2: thus setting number of pulls for for each arm a, N a (t) = 1 3: for t = K to T − 1 do 4: for each arm a compute its average historic rewardμ a , 5: and compute the x a >μ a such that: pull an arm A t+1 ∈ arg max a∈{1,...,K} x a 7: end for Let us give an example application:

The case of Bernoulli data
If we are dealing with Bernoulli data then things become a bit simpler and Theorem 4 reduces to: Theorem 3 for Bernoulli distributions that have mean µ ≥ x, the scaled likelihood of drawing n 0 samples of value 0 and n 1 samples of 1, if x >μ (with N = n 0 + n 1 ), is least upper bounded by: Proof for s 0 , s 1 = 0, 1 solving i x−μ which is put into equation 3.
As Algorithm 1 runs with a process of finding x such that p scaled max (. . . , µ ≥ x) = g(t), this is much the same thing as finding And this expression might look slightly familiar . . . as it is the Kullback-Leibler divergence between the Bernoulli distributions suggested byμ and that is consistent with µ = x. And if we ran Algorithm 1 with Bernoulli data and g(t) = (t(log(t)) 3 ) −1 for t ≥ 3 with g(1) = g(2) = g(3) we would be running a process equivalent to an instance of Cappé et al. (2013)'s KL-UCB algorithm which has been demonstrated to be quite performant. If we denote the divergence between Bernoulli distributions with means µ and µ as: and if µ is the maximum mean of the arms (ie. the optimal one) then the regret of the algorithm instance can be characterised by Cappé et.al's proof that the expected number of pulls of any suboptimal arm a is bounded by:

A Kullback-Leibler reformulation
The occurrence of the Kullback-Leibler divergence is not coincidence, and if we rearrange equation 3 we get: Which is the Kullback-Leibler divergence between our derived distribution that maximises likelihood of the samples given a specific mean x, and the distribution that is most consistent with those samples, and maximises likelihood irrespective of a specific mean (and occurs whenμ = µ).
This then raises the question of exactly where our algorithm and KL-UCB coincide -and the answer is that we believe our algorithm is an unexplored special instance of their more abstract and general framework. In their more abstract framework they consider a mapping (or a projection operator) Π D which maps the empirical distribution of the historic rewards of a given armv a (t) = 1 Na(t) t s=1 δ Ys I As=a (which is a sum of Dirac deltas corresponding to rewards) to an element of class D, which is the specific class of distributions that they are considering and comparing via Kullback-Leibler divergences (and potentially needs appropriate selection). In the application cases that they consider, this mapping mostly occurs via the sample mean -with loss of potential information. Whereas we have created a rather unique case where the this mapping can effectively be the identity -and is therefore lossless as it utilises all available information.
Not only is our algorithm a lossless instance of the KL-UCB framework, but is is also free from any assumptions about the distribution of the data (excepting that is bounded above) whereas all other promoted cases of KL-UCB that are developed into a computable algorithms, assume that the data distribution must be one of a parameterised family of distributions (eg. Bernoulli, Exponential, Poisson, etc) within which the Kullback-Leibler divergences can be computed and expressed. This is a restriction which we seem to have bypassed.
However since our UCB algorithm is an instance of KL-UCB it also inherits some favourable regret bounds, particularly that regret bounds can be derived from the fact that the number of pull of sub-optimal arm a is bounded by: where K inf (ν, µ) = inf{KL(ν, ν ) : ν ∈ D and E(ν ) > µ} and reads as the infimum of Kullback-Leibler divergences between the arm distribution ν and distributions in the model D that have expectations larger than µ. Which in our case is given by equation 5. This expression has been identified as a formula for lower and upper bounds on optimal asymptotic regret of UCB algorithms, see Maillard et al. (2011). Additionally there are some other regret expressions that our algorithm tacitly inherits, such as given by Garivier et al. (2018), as the analysis and performance of information theoretic UCB algorithms is an area of research. However in order witness how effective our algorithm is in practice, it is simple enough to run some experiments.

Concluding experiments
Let us conclude with some numerical experiments. 2 In Figure 3, we plot the average cumulative regret over bandit simulations with 5 arms. We considered: • a UCB1 algorithm (developed in Auer et al. (2002)) which chooses the arm with the highest sample mean of n historic rewards plus 2 log(t)/n • an untuned UCB-V algorithm (given in Audibert et al. (2009)) which chooses the arm with highest sample mean plus 2σ 2 log(t)/n + log(t)/n (whereσ 2 is the biased sample variance of past rewards) • our own algorithm 1 with g(t) = (t(log(t)) 3 ) −1 and • also a method of random arm pulling.
In Figure 3 we considered different distributions of rewards for the arms. particularly, we considered: 1. Beta distributed rewards parameterised by α and β parameters were uniformly randomly generated between 0.1 and 3.1 2. We also considered uniform distributed rewards where the two poles of the reward were uniformly generated to be within the interval [0, 1] 3. We also considered Triangularly distributed rewards, where mode of the triangle was randomly generated to be within the interval [0, 1] and where its bounds were 0, 1 4. We also considered Trapezoidally distributed data, where the data spans the range [0, 1] and the density is flat between two poles uniformly generated within the range [0, 1] From these graphs we can see our method performs well across a range of possible data.

Appendix A. Derivation of the Optimum bound on the mean
In this section we will show with one (rather large) proof that among distributions constrained to finite values, that the maximising probability of a specific mean µ together with drawing a set of samples has a concise statement. The proof encodes the intuition that probability will be assigned to each of the samples in proportion to their multiplicity, and weighted such as to make it consistent with the specific µ, in addition to possibly allocating probability to at-most one other extreme unsampled value -the image of this dynamic is easily seen in the image of Figure 1, but the proof is required and is here. Since the dynamic that probability only to value points which are sampled (plus potentially one other point) holds true irrespective of how many other points exist in the distribution or how densely they are packed, we also assert that this same dynamic occurs in the limit of continuous distributions.
We conduct the optimisation explicitly here, to yield a calculation methodology for the maximum likelihood as a function of the specified mean and sample values with their multiplicities: Theorem 4 Supposing that the distribution of our data has a mean µ = 0 and takes unique finite values a 1 , a 2 , a 3 , . . . , a n . if we sample those values with multiplicities n 1 , n 2 , n 3 , . . . , n n (some of which are non-zero, N = n i=1 n i ), then the probabilities of each of those values α 1 , α 2 , α 3 , . . . , α n which maximise the likelihood of our having sampled per those multiplicities is given by: and hence the likelihood of our sampling is: Where there is a point G which is constrained to be in the region: Where G is either equal to a data value a j where n j = 0 and where α (If there is no qualifying G value, the maximum likelihood is zero.) All other values α k with n k = 0 (and for k = j if it applies) are zero. In this way there are no more non-zero values of α i=1,...,n then there are non-zero multiplicities of n i=1,...,n plus (potentially) one.
Proof The probability of drawing sample values a 1...n with multiplicities n 1...n is given by: Which is the optimisation objective function we seek to maximise (is the probability of drawing the samples in any specific sequence multiplied by the combinatorial number of ways that sequence of draws could occur) Subject to the constraints that the population has a mean of zero, that the probabilities add to one, and the probabilities are between zero and one: Now the constraint that each of the probabilities are less than one is straightforwardly redundant (as the probabilities are non-negative and add to one). To make the constraint that the probabilities are non-negative redundant, we set β 2 i = α i and use β i in the algebra instead. And since the n and n i are constant, to make the optimisation more tractable we consider maximising the substitute function: Subject to re-written constraints that: If λ 1 and λ 2 are the respective Lagrange multipliers associated with these two constraints respectively then the KKT conditions associated with this optimisation problem are: By assuming that for all samples points with non-zero multiplicity (ie n i > 0) that the corresponding β i = 0 (since if such a β i = 0 for multiplicity points then the objective would trivially be zero), we get: Therefore, eliminating the possibility of imaginary β i (and arbitrarily assuming β i are positive) gives: for all n i > 0 β i = n i λ 1 a i + λ 2 and λ 1 a i + λ 2 > 0 for all n i = 0 β i = 0 or λ 1 a i + λ 2 = 0 And since each of the a i are distinct then at most one of these β i for which n i = 0 can be nonzero. And this demonstrates the third clause of our proof (that there are at most as many non-zero α's as there are non-zero multiplicities plus one). Now there is a split in the path, particularly if λ 1 is zero or not zero.
Case 1: (If λ 1 = 0) -then we have an easy time, then λ 2 > 0 (as we assume there are some points with non-zero multiplicity) and hence β i = 0 for any points with zero multiplicity, and for those with positive multiplicity have β i = n i /λ 2 ; substituting into our constraints (6) gives: λ 2 = N and henceμ = n i=1 a i n i /N = 0 Which identifies the convenient case where the average of our samples and population mean coincide, and hence the constraint that the mean is zero bears no pressure on the solution and would be the result if that constraint did not exist (in this way this solution is maximal), and where: End Case 1 However, if λ 1 = 0 then we can sensibly substitute a ratio G between our Lagrange multipliers as λ 2 = −λ 1 G and hence: and λ 1 (a i − G) > 0 for all n i = 0 β i = 0 or a i = G from this it is easy to see that if λ 1 > 0 then G < min i:n i >0 a i and if λ i < 0 then G > max i:n i >0 a i . Now there is a second split in the path, if there exists a non-zero β j where n j = 0, or not: Case 2: (If λ 1 = 0 and there exists an a j where n j = 0 and β j > 0) -then G = a j , then we can re-write our constraints (6) as: We subtract one from the other and to get λ 1 a j = −N . From this fact it must be true that a j = 0 and also that λ 1 and a j must have opposite sign, hence we can also propagate the earlier constraint, that if a j < 0 then a j < min i:n i >0 a i and if a j > 0 then a j > max i:n i >0 a i . Now we can substitute our relation between λ 1 and a j back in to our constraints (8) to get The requirement that β j is not complex leads to real constraints on what could count as plausible candidate for a j . But supposing all these requirements are met for a candidate a j (per our hypothesis in case 2), then our solution is then given by:

End Case 2
And by exclusion between cases 1 and 2, there is only the remaining case: Case 3: (If λ 1 = 0 and there does not exists an a j where n j = 0 and β j > 0) -In this context the only non-zero β i are those which have some multiplicity n i > 0, and we can re-write our second constraint from (6) as: and therefore From this it is clear that if G < min i:n i >0 a i then λ 1 > 0 and also if G > max i:n i >0 a i then λ i < 0, and so our earlier constraint is redundant. Thus we can re-write our first constraint from (6) as: In this context, a suitable G may or may not exist, and solving for it is a bit more difficult, but insofar as it does exist we can progress. Consequently: End Case 3 From these cases we notice that the results of the three cases overlap at their limits. Particularly ifμ = n i=1 a i n i /N = 0 = µ (ie. if the sample mean is equal to the mean) then the conditions of case1 are satisfied, then the condition of case3 (equation 11) cannot be satisfied by any real G, but is satisfied by G in the limit that G goes to infinity (positive or negative), at which point λ 1 in case3 equals λ 1 in case1, and consequently the solutions of case1 and case3 (equations 7 and 12) are the same. Equally if G → ∞ then β j of case2 is zero (per equation 9 with a j = G), and solutions of case2 (equation 10) match those of cases 3 and 1. Thus if the mean of the samples is zero, then all three cases overlap (or atleast at their limits), and the solution is comfortably given by equation (7), or equivalently by equations 10 or 12 with a j , G → ∞.
We also see that cases 2 and 3 overlap in a particularly interesting way, that case3 is case2 in the context where a hypothetical point a * = G is selected such as to allocate it with probability of zero. So for instance if hypothetical β * in equation (9) is zero, then the associated a * = G satisfies the requirement for case3's equation 11 and then the expressions of maximum probability for cases 2 and 3 coincide (equations 10 and 12).
All of this simply identifies that cases 1 and 3 are actually just interesting limit instances of case2 -and that case2's equation 10 constitutes the more general expression of the solution; all of these considerations come to yield in the theorem statement.
This completes the proof.
In this Theorem statement, we have a G that is left uncharacterised, we will now show that a G that maximises the likelihood is unqiue.

A.1. Characterising the G
In Theorem 4 that we have a total expression of the solution conditioned on a selection of an appropriate G value, hence our analysis turns around the function that determines the existence of an appropriate G, specifically the function: And we are interested in characterising the points in the region x > max(0, max(a i )) and x < min(0, min(a i )). For simplicity we will only characterise the positive region x > max(0, max(a i )), in the knowledge that it also implicitly characterises the negative region by symmetry (ie. the logic for negative region simply follows by reversing the signs of the a i and x in every step).
Theorem 5 For function (13) for arbitrary finite numbers a i and corresponding positive integers n i (some of which we assume are non-zero pairs), in the region x > max(0, max(a i )). If there exists a positive a i , then there exists exactly one finite zero-crossing point of f iff i a i n i < 0, and only points x greater than this zero-crossing point will have positive f (x), if i a i n i = 0 then f (x) will cross in the limit of x → +∞ and in all other cases f (x) will be negative. Otherwise if there does not exist a positive a i all points x in the region will be positive.
Proof Supposing that there exists a positive a i then there exists a maximally positive a k and thence we are constrained x > a k . In this case we consider the function g(x) = xf (x) and identify that g will have the same zero crossing points as f and will be the same sign as f in our region of interest (x > a k ). We hence aim to show that if i a i n i < 0 then g has one zero crossing point and is positive for all points greater than this, and if i a i n i ≥ 0 then there is no zero crossing point and there are no positive points in the region. We begin by proving that g is monotonically increasing: and that in our region of interest g begins at −∞ and goes to i a i n i .
ie. lim The result of this Theorem 5 (applied both positively and negatively) inform us that in the context of our big Theorem 4 that there are 5 cases.
1. that the sample meanμ ∝ a i n i = 0, therefore the only G that satisfies is ∞ 2. that the sample meanμ ∝ a i n i < 0 and there does exist a positive a k , hence there is a minimum satisfying G which is positive and all points more positive than it will also satisfy.
3. that the sample meanμ ∝ a i n i < 0 and there does not exist a positive a k , hence all positive G values satisfy.
4. that the sample meanμ ∝ a i n i > 0 and there does exist a negative a k , hence there is a maximum satisfying G which is negative and all points more negative than it will also satisfy.
5. that the sample meanμ ∝ a i n i > 0 and there does not exist a negative a k , hence all negative G values satisfy.
The last piece of the puzzle is to prove that where there exists a selection of satisfying values of G, that it always results in a greater likelihood (ie. is more optimum) to choose the one most extreme (that is positive or negative wrt zero). The case whereμ ∝ a i n i = 0 is trivial that G → ∞ is the most extreme possible point, however it remains to be proven for the other cases. We will explicitly treat the case where a i n i < 0 and hence there is a continuous range of positive G where G > max(0, max i,n i >0 a i ) that satisfies requirements for Theorem 4; we do this in the knowledge that it also implicitly characterises the negative region by symmetry.
Theorem 6 in the context of Theorem 4 where there is a selection of appropriate positive G values, that the maximum likelihood always occurs with the selection of one that is most positive.
Proof In the statement of Theorem 4 the maximum likelihood is given by: It suffices to show that the derivative of this with respect to G is non-negative. We show this by a substitute function: Taking Theorem 4 and combining with the 5 cases that result from Theorem 5, and the consideration of Theorem 6, taking the limit of continuous functions, offsetting for non-zero mean µ gives the result of Theorem 1.

Appendix B. Summary Derivation of Hoeffding's inequality
We begin with a Lemma that categorises the root of all Chernoff bounds, which is then directly used in the derivation of Hoeffding's inequality: Lemma 7 (Chernoff Bound) Ifμ is sample mean of n independent and identically distributed samples of random variable X then for any s > 0 and t: ] n exp(−snt) using Markov's inequality and i.i.d of the samples respectively.
Theorem 8 (Hoeffding's inequality for mean zero) Let X be a real-valued random variable that is bounded a ≤ X ≤ b, with a mean µ of zero. Then for D = b − a and any t > 0, the meanμ of n independent samples of X is probability bounded by: Proof To prove Hoeffding's inequality we develop an upper bound for E[exp(sX)], if we assume variable X has a probability density function f (x), then we can linearize exp(sx) as: −a e sa dx Using the fact that the mean µ = b a f (x)xdx = 0 thus: E[exp(sX)] ≤ 1 sb−sa (sb exp(sa) − sa exp(sb)) Given the fact that for any κ > 0, γ < 0 that κ exp(γ)−γ exp(κ) ≤ (κ−γ) exp 1 8 (κ − γ) 2 (15) thus: E[exp(sX)] ≤ exp 1 8 s 2 (b − a) 2 Applying our Chernoff bound lemma 7 we get: P(μ ≥ t) ≤ exp 1 8 s 2 (b − a) 2 n − snt And minimising with respect to s yields the required result.
At a first glance the most limiting feature of the derivation given is the requirement that the mean is zero, however this is ultimately immaterial and intentionally used to simplify the derivation. Because any data distribution can be shifted such that its expectation value becomes zero, leaving D unchanged. hence: Theorem 9 (Hoeffding's inequality) Let X be a real-valued random variable that is bounded a ≤ X ≤ b. Then for D = b − a and any t > 0, the meanμ of n independent samples of X is probability bounded by: (16) However we can do without simplifications to develop more powerfull inequality: Theorem 10 Let X be a real-valued random variable that is bounded a ≤ X ≤ b, with a mean µ of zero. Then for t > 0, the meanμ of n independent samples of X is probability bounded by: Proof Similar to the proof in Theorem 8 we follow the same steps except do not apply Equation 15, leading to: P(μ ≥ t) ≤ (b exp(sa) − a exp(sb)) n exp(−snt)(b − a) −n And minimising with respect to s occurs at s = 1 b−a log b(a−t) a(b−t) which is the required result.
This concentration inequality is more powerful but more difficult to manipulate, it is also more commonly stated for variable X with non-zero mean and bounded 0 < X < 1. See Chvátal (1979); Hoeffding (1963): Theorem 11 (Also called Hoeffding's inequality) Let X be a real-valued random variable that is bounded 0 ≤ X ≤ 1, with mean µ. Then for t > 0, the meanμ of n independent samples of X is probability bounded by: Proof Follows from substitution a = −µ and b = 1 − µ.
distributed (values of 0 and 1 with equal probability) on a hidden coin-toss, supposing we take three samples and receive sample mean of a -1/3, we hence should be absolutely certain that the distribution is Rademacher and that the mean is zero, even though a naive application of Hoeffding's inequality would suggest that P(µ −μ ≥ 1/3) ≤ exp −1 6 ≈ 0.85 Concentration inequalities gain validity in UCB primarily by being unconditioned probability statements between population statistics and sample statistics, that can be used to give expected regret bounds prior to the occurrence of sample rewards. The intention of this paper is to take the concept of what Hoeffding's inequality is, in the context of UCB, and then make it a tad stronger by using more than the just the sample mean.