Stratified Finite Empirical Bernstein Sampling

We derive a concentration inequality for the uncertainty in the mean computed by stratified random sampling, and provide an online sampling method based on this inequality. Our concentration inequality is versatile and considers a range of factors including: the data ranges, weights, sizes of the strata, the number of samples taken, the estimated sample variances, and whether strata are sampled with or without replacement. Sequentially choosing samples to minimize this inequality leads to a online method for choosing samples from a stratified population. We evaluate and compare the effectiveness of our method against others for synthetic data sets, and also in approximating the Shapley value of cooperative games. Results show that our method is competitive with the performance of Neyman sampling with perfect variance information, even without having prior information on strata variances. We also provide a multidimensional extension of our inequality and discuss future applications.


Introduction
Stratified sampling is a statistical method for estimating the mean of a population by partitioning it into mutually exclusive subgroups, or strata, and applying a sampling estimator to each stratum, before weighting and combining these estimates to form an estimate of the population mean. For example, to poll the population of a country's support for a particular government policy, we can selectively poll the different demographic regions within the country. For instance, if we know that regions A,B and C contain 10%, 40% and 50% of the population, and sampling shows support levels for a policy of 2%, 70% and 30%, respectively, then we can estimate that 43.2% of the total population supports the policy.
Stratified sampling can lead to improved reliability in estimation under certain conditions, such as: when the population is easily divided into strata, in which there is less variance in each stratum than across them all; when the size of the strata are known, and; when sampling selectively from each strata is possible [1,2]. If it is possible to sample selectively between the strata, then there is a further question of how to conduct that sampling most efficiently.
In this paper we propose a process of sampling in order to maximally reduce the uncertainty in the population mean estimate, and to do this we develop a expression associated with that uncertainty. The expression takes the form of a concentration inequality, specifically, a stratified empirical Berstein bound (SEBB), developed under the assumption that the data values have bounded support. This inequality considers factors such as: the sizes of all the strata and the proportion of each that are sampled; the sample variances of the samples from each of the strata; the differences in the range of data of each strata; any additional importance weightings on the strata, and; whether any (or all) of the strata are sampled with or without replacement.
Using this inequality, we then propose an online method for sequentially sampling in order to maximally reduce this bound at each iteration, called the stratified empirical Berstein method (SEBM). By numerically evaluating the SEBM, we demonstrate its value in sampling and estimation applications. Moreover, we show that it can assist in computational tasks -particularly those that involve the calculation of expectation values, as sampling is a straightforward way of approximating them. Specifically, in this work, we consider the calculation of the Shapley value, a solution concept from cooperative game theory, as a computational task to which we can apply our sampling method and demonstration of its utility. Taken together, these results show that SEBM is competitive with the widely-used Neyman sampling method, that assumes perfect variance information, even when SEBM does not have prior information on strata variances.
The remainder of the paper is as follows. The next section reviews background material and sets the context of the paper. Section 3 provides several lemmas that are components of our derivation. Building on this, in Section 4 we provide the full derivation of our concentration inequality, SEBB, and online sampling method, SEBM, which are the main technical contributions of the paper. For evaluation, Section 5 examines the effectiveness of SEBM in the context of synthetic data sets, while Section 6 evaluates its performance on the task of approximating the Shapley value via sampling estimation. Section 7 discusses the results of these two sets of numerical evaluations, and analyze the reasons for the effectiveness of our sampling method. Finally, Section 8 gives an extension of SEBB and SEBM to multidimensional data. Section 9 concludes.
Stratified sampling is often conducted as a two-stage process, particularly when it is unclear how to stratify the population, or how large the resulting strata would be. Under this process, the first stage consists of sampling the population uniformly at random, and collecting the values of readily observable auxiliary variables, in order to estimate the sizes of potential strata by those variables. In the second stage, the strata are chosen and sampled with respect to the information gathered in the first stage, and the total population estimate is computed [9].
One well-known method of estimating efficient strata sizes is via inverse probability weighting methods such as the Horvitz-Thompson estimator [10]. However, these estimators are sometimes seen to perform quite badly in practice [11,12]. Moreover, these estimator do not directly address the prior problem of how to optimally break the population into strata based on the values of the auxiliary variables [3,13,14].
In other situations, the strata and their sizes are naturally given, or the first stage of the two-stage process above may be assumed to have been conducted ideally. In these situation, there exists a further problem of how to allocate the finite number of second-stage samples between the strata. For instance, one could choose to sample equally between strata, proportional to the sizes of the strata, or proportional to the variance of the strata. The last option is often considered in theory and practice, and is called Neyman allocation (sometimes called 'optimum' allocation) [1,2]. This approach assumes knowledge of the variances of the strata; however, in practice it is often the case that strata variance can only be estimated, either as part of the first stage or as the sampling proceeds [15,16].
Finally, even once the samples are taken from the various strata, there is still the question of how to compute appropriate confidence bounds on the final estimate. In the voting verification context, there exist some specialized bounds [5,17], but in the more general case there is some degree of discussion of what bounds should be used [4]. The confidence bounds that are derived critically depend on what assumptions are made about the underlying populations.
In particular, Hoeffding's inequality [18] is used as a bound under the assumption that the underlying population has bounded data support, or are drawn from a finite interval [4,17]. Hoeffding's inequality can be used to produce a very conservative confidence interval that is sensitive only to the width of the data value bounds and the number of samples taken, and it is also most directly suitable for sampling with replacement. In contrast, other concentration inequalities, such as Chebychev's inequality, are sensitive to the sample variance but not the width of the data.
Recently, Maurer and Pontil [19] developed a bound, which they named as an empirical Bernstein bound (EBB), as a concentration of measure for the sample mean of a single (unstratified) population (some similar bounds being published about that time, [20,21]). EBBs are is sensitive to the data width and sample variance. They have replaced Hoeffding's inequality in a number of computational applications [22][23][24][25]. The derivation of Maurer and Pontil's EBB has been extended to entropic [26] and Chernoff concentration inequalities, which are combined using union bounds.
Beyond this, sampling without replacement offers the opportunity to further tighten the bounds over the sampling-with-replacement case. For example, the refinement that is possible was first demonstrated by Serfling [27] with a martingale argument. More recently, Bardenet and Maillard [28] improved on Serfling's result with a reverse martingale argument, and created an EBB suitable to the case of sampling without replacement.
Our key observation is that the components of these analyses can be combined together to create a closed form analytical concentration inequality tailored for stratified random sampling, which is a primary subject of this paper.

Preliminaries
We now state nine lemmas across the next three sections, which we use later to derive our stratified empirical Bernstein bound (SEBB) and method (SEBM). Specifically, in Section 3.1, we provide three lemmas, which show how probability bounds are related to the moment generating function, how probability bounds can be combined, and a useful algebraic relationship regarding the sample variance. Then, in Section 3.2, we provide three bounds on the moment generating functions of random variables. Last, in Section 3.3, we derive three lemmas that relate the moment generating function to the sample means of random variables with and without replacement.

Fundamental results
The first lemma is an often-used and rather weak result used to fuse simple statements of probability: Lemma 3.1 (Probability Union). For any random variables a, b, c: This relationship is a well known and useful tool for settings where the probability relationship between a and c is unknown but the relationship between a and some b, and also between that b and c is known. The next lemma is a straightforward result of algebra that relates the sample squares about the mean to the sample variance.
Lemma 3.2 (Variance Relation). Let X be a random variable with mean µ. For n samples of X, {x k } k=1,...,n , the sample mean,μ = 1 n k x k , biased sample variance, σ 2 = 1 n k (x k −μ) 2 , and average of sample squares about the mean, This result is used later to create bounds for the sample variance from bounds on the sample squares about the mean. In order to create such probability bounds, we make repeated use of the next, which extends directly from Markov's inequality and encompasses a range of inequalities called Chernoff bounds: . For a random variable X, and for any s > 0 and t: Many well-known inequalities follow from upper bounds for E [exp(sX)], also known as the moment generating function.

Bounds on the Moment Generating Function
The next three lemmas give three of these upper bounds for moment generating functions, from which we create our probability inequalities of interest. The first is well known and sometimes called Hoeffding's Lemma [18] and is stated here without proof: Lemma 3.4 (Hoeffding's Lemma). For a random variable X that is of finite support on the interval a ≤ X ≤ b, with width D = b − a, and for any s > 0: The second lemma is very much like Hoeffding's Lemma, except it involves information about the variance of the random variable. The proof of this result is included because it is useful in explaining our own approach.
Lemma 3.5. For a random variable X that is bounded on an interval a ≤ X ≤ b with D = b − a and variance σ 2 , and any s > 0: Proof. We assume without loss of generality that X is centered to have a mean of zero. Then we construct an upper bound for E [exp(sX)] in terms of D by a parabola over exp(sX) for the permitted values of X.
There exists an α, β, γ (see AppendixA) such that αs 2 X 2 + βsX + γ ≥ exp(sX), and for all a ≤ X ≤ b: Where it follows that: The expression in (1) is monotonically increasing with b, and D > b, therefore: Given that for any κ, x ≥ 0, that: Thus letting κ = σ 2 D 2 and x = s(D + σ 2 /D) if follows that: We note that this process of fitting a parabola over the exponential function bears a strong conceptual relationship with a famous bound developed by Hoeffding [18] and Bennett [29].
The third bound that we present, on the moment generating function, is similar again, however this time we consider the random variable X 2 instead of X. Lemma 3.6. Let X be a random variable of finite support on an interval a ≤ X ≤ b, Then for any q > 0: Proof. We assume without loss of generality (and for ease of presentation) that X is centered to have a mean of zero. We construct an upper bound for E exp(−qX 2 ) in terms of D by a parabola over exp(−qX 2 ) for the permitted values of X.
The three inequalities above, Lemmas 3.4,3.5 and 3.6, are used in the derivation of our stratified sampling concentration inequality in Section 4.

The Moment Generating Function of Sample Means
In order to use the previous bonds on the moment generating function we need an inequality relating the moment generating function of a random variable, with the moment generating function of the average of samples of that random variable. To do this we state three further inequalities, where the first one (Lemma 3.7) is most appropriate for sampling with replacement, and the second and third (Lemma 3.8 and Lemma 3.9) can optionally be used in the context of sampling without replacement. Lemma 3.7 (Replacement Bound). Let X be a random variable that is bounded a ≤ X ≤ b with a mean of zero, with D = b − a and variance σ 2 . Let χ m = 1 m m i=1 X i be the average of m independently drawn (with replacement) samples of this random variable. If there exists an α, β ≥ 0 such that for any s > 0 that E[exp(sX)] ≤ exp((αD 2 + βσ 2 )s 2 ) then: where Ω n m = Ψ n m = 1 m Proof. By the independence of samples, we have: Thus: These inequalities are sufficient for all the further derivations that we conduct. However, for the case of sampling without replacement, there is an alternative result that can be directly substituted, given in Lemma 3.9, below, which can be tighter in certain cases. We give its form and derivation, which is included for completeness but is not part of the main results presented in the paper. Before this, particular note must be made that the inequality above, Lemma 3.7 can be used in the context of either sampling with or without replacement. In contrast, Lemma 3.9 can only be used when sampling without replacement. This distinction was shown to be true by Hoeffding [18], and is stated here without proof: Lemma 3.8 (Hoeffding's reduction). let X = (x 1 , . . . , x n ) be a finite population of n real points, let X 1 , . . . , X n denote a random sample without replacement from X and Y 1 , . . . , Y n denote a random sample with replacement from X. If f : R → R is continuous and convex, then: we now state an inequality regarding the moment generating function of the average of samples taken specifically without replacement.
When the sampling takes place without replacement the inequality of Lemma 3.7 can potentially be improved to take advantage of the finite size of the population. This inequality extends an important martingale inequality from [28]: and has a mean of zero and variance σ 2 = 1 n n i=1 x i , denote X 1 , X 2 , . . . , X n the random variables corresponding to the data sequentially drawn randomly without replacement, and χ m the average of the first m of them. If for any random variable Z with a mean of zero such that a ≤ Z ≤ b and D = b − a, with variance σ 2 Z that there exists an α, β ≥ 0 such that for any s > 0 that E[exp(sZ)] ≤ exp((αD 2 + βσ 2 Z )s 2 ) then: Proof. Observe that: Then because: we also have that: by repeated application of the Law of total expectation. Since: then χ k+1 − X k+1 is a random variable with a mean of zero bounded within width D, and it also has a variance given by: by application of Lemma 3.2. Therefore: This martingale result relates the moment generating function bound of the average of finite variables relative to their mean, to the moment generating function bounds of the differences of the incremental averages relative to their mean. We note that this result could be made much stronger by working around the use of Equation (4), but this comes at a cost of increased mathematical complexity.
Since Lemmas 3.9 and 3.7 share a common form, and because of Hoeffding's reduction (Lemma 3.8), all the derivations that follow that invoke Lemma 3.7 have direct analogues using Lemma 3.9 for the context of sampling without replacement. Note, however, that the bound without replacement (Lemma 3.9) may or may not be tighter than the bound with replacement (Lemma 3.7). However, the process of substituting one for the other can be done judiciously on a case-by-case basis to create the tightest possible bound. All the numerical results in this paper (relevant to sampling without replacement) have been produced with this judicious choice conducted.

The Stratified Finite Empirical Bernstein Bound and Sampling Method
This section contains our main results: we derive a novel probability bound for the error of the stratified random sampling estimate, and use it to define a sequential stratified sampling algorithm. Before this, we begin by precisely defining the context of our derivations, to which our bound applies.
Definition 4.1 (Problem context). Let a population consist of n number of strata of finite data points, where n i is the number of data points in the ith stratum. All values in a stratum are bound within a finite support of width D i . Denote the mean and variance of the ith stratum µ i and σ 2 i , respectively. In this context, denote values randomly and sequentially drawn (with or without) replacement by X i,1 , X i,2 , . . . , X i,ni . Then, for the first m i of these sample: . We are interested in the average of the means of the strata as weighted by constant positive factors {τ i } i∈{1,...,n} . Throughout our derivation, we temporarily use arbitrary positive variables {θ i } i∈{1,...,n} .
Given this context, the following two sections contain the derivation of the stratified empirical Berstein bound (SEBB) and the sequential sampling method (SEBM), respectively.

Bound derivation
The bound is now developed in four theorems, which build on each other in sequence: (1) Theorem 4.2 develops a concentration inequality for the error in the stratified population mean estimate n i=1 τ i χ i,mi in the context of variance information. We begin with an expression for a probability bound on the absolute error of the weighted stratified sample means about the weighted strata means, which we call a variance-assisted SEBB (stratified empirical Bernstein bound).
Theorem 4.2 (Variance-assisted SEBB). Assuming the context given in Definition 4.1, and let Ω ni mi and Ψ ni mi be given as in Lemma 3.7, then: Proof. Applying Lemma 3.3: by independence of the sampling between the strata. This form is sufficient for Lemma 3.7 with Lemma 3.5 to apply, resulting in a double-sided tail bound: Minimizing with respect to s and rearranging gives result.
In most cases, the weights τ i can be considered as the probability weights τ i = n i /( n j=1 n j ), and in this context this probability bound can be used as-is for a measure of uncertainty in stratified random sampling if the true variances (or alternatively, upper bounds on the true variances) of the strata are known. However, in other contexts, the weighted sum of variances must be estimated from the data collected, and to include this factor we develop and incorporate a probability bound for the estimate of the sum of variances (as weighted by arbitrary θ i ), as follows.
Proof. To create a probability bound for the sum of variances (weighted by arbitrary positive θ i ), we consider the average square of samples about the strata means. Applying Lemma 3.3 gives: by independence of the sampling between the strata. This is sufficient for Lemma 3.7 with Lemma 3.6 to apply, giving: Minimizing with respect to s, rearranging, and applying Lemma'3.2 gives result.
This inequality gives the probability bound between the weighted variances of the strata, the weighted (biased) sample variances and the weighted square error of the sample means. Although the weighted square error of the sample means may go to zero quickly as additional samples are taken, we nonetheless develop another probability bound to incorporate specific consideration of it.
Proof. We consider the weighted square error of the sample means: such that r i = r, by independence of the sampling and probability complementarities. This is sufficient for us to apply Lemma 3.3 together with Lemmas 3.7 and 3.4, giving: Next, choosing r i to minimize this expression gives: Thus: Using log(1 − (1 − exp(x)) n ) ≤ x + log(n) for negative x, and rearranging, gives result.
This theorem bounds the weighted square error of the sample means. In the next, and final, step we combine the inequalities of Equations (5), (6) and (7) together, to complete our derivation of the SEBB. where: Proof. By widening the bound of Equation (6) we get: Completing the square gives for n i=1 θ i σ 2 i gives: Combining with Equation (7) with a union bound (Lemma 3.1) gives: which is a bound for the weighted sum variances in terms of the sample variances. Letting θ i = 1 2 τ 2 i Ψ ni mi and combining with (5) with a union bound (Lemma 3.1), and then assigning r = t = y = p/3 and rewriting in terms of unbiased sample variance, gives the result.
This completes the derivation. In Equation (8) of Theorem 4.5, we have a concentration inequality for the sum of weighted strata sample mean errors relative to the sample variances. In this context, the weights τ i are flexible but would naturally be probability weights proportional to strata size, τ i = n i /( n j=1 n j ), in which case the inequality provides a concentration of measure in stratified random sampling. Based on this bound, we proceed to propose an online process of sequentially choosing samples from the strata in order maximally minimize it.

Sequential Sampling Using the Stratified Empirical Berstein Method
We introduce a method of sampling, the stratified empirical Bernstein method (SEBM) which sequentially minimizes the bound in Theorem 4.5 (SEBB). Pseudocode for the calculation of the bound and the process of sampling to minimize it, is given in Algorithm 1.
Specifically, Algorithm 1 is a repetitive process involving a scan through the possible strata and then the selection of one stratum to sample from to minimize the SEBB under mild assumptions. The process of scanning involves calculating the confidence bound width (SEBB) that would result if an additional sample were to be taken from that stratum without changing its sample variance (line numbers 5-17 in Algorithm 1). The stratum that yields the smallest confidence bound width in the context of an additional sample is then selected (line [18][19][20][21] and sampled (line 24), the sample variance of that stratum is updated (line 26); this process repeats until the maximum sample budget is reached (per the outer loop, line 1). In this way the process attempts to iteratively minimize the SEBB in expectation with each additional sample taken; and hence lead to potentially greater accuracy in stratified sampling as a result.
We note that computing the SEBB requires the sample variances of all the strata having been calculated. Accordingly, Algorithm 1 must be initialized with at least two samples from each stratum so that sample variance can be calculated. This is a standard requirement of the many reinforcement learning algorithms that use variance in their sampling policies.
Algorithm 1 describes a process specific to sampling without replacement and involves the calculation of the SEBB with the tightest possible uses of Lemmas 3.9 and 3.7. In particular, for any stratum i that is sampled without replacement, any specific bound with an associated Ω ni mi and Ψ ni mi may be substituted forΩ ni mi and Ψ ni mi to potentially tighten the bound, and this corresponds to choice of Lemma 3.9 or Lemma 3.7 in the bound's derivation. Since the SEBB is a composition of such bounds with such choices throughout, there is a structure of valid pairs of substitutions Ω, Ψ forΩ,Ψ in the optimal calculation of the SEBB, which is shown in the steps 8-15 of Algorithm 1. The equivalent algorithm for sampling with replacement simply is the same algorithm altered by replacing all use ofΩ,Ψ with Ω, Ψ.
In the next Section 5 we evaluate the performance of SEBM with other methods in the context of synthetic data.  boundwidth ← log(6/p) min j (d j + ( c j + a j + b j + b j ) 2 )

Numerical Evaluation
In this section assess the value of SEBM as an online method of sampling from stratified data. First we outline the benchmark algorithms used to evaluate our method's performance. Then in Section 5.2 we describe two synthetic data sets and report the distribution of errors under our method and the benchmarks. Following this, in Section 6, we evaluate our method in an example application -that of calculating the Shapley value of a cooperative game. Discussion and analysis of all the numerical results is left to Section 7.

Benchmarks algorithms
In the numerical evaluations, we compare the following sampling methods: • SEBM (Stratified empirical Bernstein method, without replacement): our SEBM method (per Algorithm 1) of iteratively choosing samples from strata to minimize the SEBB, given in Equation (8). An initial sample of two data points from each strata is used to initialize the sample variances of each, with additional samples made to maximally minimize the inequality at each step. All samples are drawn without replacement. • SEBM-W (Stratified empirical Bernstein method with replacement): as above, with the exception that all samples are drawn with replacement, and consequently the inequality does not utilize the martingale inequality given in Lemma 3.9. • Sim (Simple random sampling, without replacement): simple random sampling from the population irrespective of strata without replacement. Note that the last three methods (Ney,Ney-W and SEBM*) assume and utilize prior perfect knowledge of the variance of each of the strata, and that for methods SEBM, SEBM-W and SEBM* (which use Equations (8) and (5)) we selected for minimising a 50% confidence interval (i.e. constant p = 0.5 and t = 0.5). Also note that these methods provide comparisons of different algorithm factors, such as the dynamics of sampling: with and without replacement; with stratification and without; between our method and Neyman sampling, and; with and without perfect knowledge of stratum variances. For these methods, we consider the effectiveness against beta distributed data and for a case of uniform-and-Bernoulli data.

Synthetic Data
The first way we demonstrate the efficacy of our method is to generate sets of synthetic data, and then numerically examine the distribution of errors generated by different methods of choosing finite sequences of samples. In this section, we described the two types of synthetic data sets used in this evaluation, namely: (i) beta distributed stratum data, which are intended reflect real-world data, and (ii) a particular form of uniform and Bernoulli distributed stratum data, in which our sampling method (SEBM) performs poorly.

Beta-Distributed Data
The first pool of synthetic data have between 5 and 21 strata, with the number of strata drawn with uniform probability, and each strata sub-population has sizes ranging from 10 to 201, also drawn uniformly. The data values in each strata are drawn from beta distributions, with classic probability density function: with α and β parameters drawn uniformly between 0 and 4 for each stratum, and Γ is gamma function. Figure 1 compares the distribution of absolute error achieved by each of the sampling methods over 5000 rounds of these data sets. Each panel presents the results that the methods achieve for a given budget of samples, expressed as a multiple of the number of strata (noting that data sets where the sampling budget exceeded the volume of data were excluded). From the plots in Figure 1, we can see that our sampling technique (SEBM and SEBM-W) performs comparably to Neyman sampling (Ney and Ney-W) despite not having access to knowledge of stratum variances. Also, there is a notable similarity between SEBM* and SEBM. As expected, sampling without replacement always performs better than sampling with replacement for the same method, and this difference is magnified as the number of samples grows large in comparison to the population size. Finally, simple random sampling almost always performs worst, because it fails to take advantage of any variance information. These results and their interpretation are discussed and detailed in Section 7 along with results from the other test cases discussed below.

A Uniform and Bernoulli Distribution
The aim of this section is to examine cases of distributions in which our sampling method (SEBM) performs poorly, particularly compared to Neyman sampling (Ney). For this purpose, a data set with two strata is generated, with each stratum containing 1000 points. The data in the first stratum is uniform continuous data in range [0, 1], while the data in the second is Bernoulli distributed, with all zeros except for a specified small number, a, of successes with value 1. For this problem, we conduct stratified random sampling with a budget of 300 samples, comparing the SEBM*, SEBM and Ney methods. The average error returned by the methods across 20,000 realizations of this problem, plotted against the number of successes a, are shown as a graph in Figure 2.
This figure demonstrates that SEBM and SEBM* perform poorly when the strata contain only very small numbers of successes. This under-performance is not simply a result of the SEBM method oversampling in a process of learning the stratum variances, as the under-performance is present in SEBM* as well. The reasons for this under-performance are discussed in conjunction with other results in more detail in

Shapley Value Approximation
The Shapley value is a cornerstone measure in cooperative game theory. It is an axiomatic approach to allocating a divisible reward or cost between participants where there is a clearly defined notion of how much surplus or profit a group or "coalition" of participants could achieve by themselves [31]. It has many applications, including analyzing the power of voting blocks in weighted voting games [32], in cost and surplus division problems [33,34], and as a measure of network centrality [35]. Formally, a cooperative game, N, v ∈ G N , comprises a set of n players, N = {1, 2, . . . , n}, and a characteristic function, v : S ⊂ N → R, which is a function specifying the reward which can be achieved if a subset of the players S ⊂ N cooperate, where v(∅) = 0. In this context the Shapley value ϕ is a unique mapping from cooperative games to the player rewards G N → R n which satisfies axioms: Specifically, the Shapley value is expressed as: That is, under the Shapley value each player is afforded their average marginal contribution across every possible sequence of player join orderings. Or, if v i,k is the average marginal contribution which player i can make across coalitions of size k: then the Shapley value can be expressed as an average: Though the Shapley value is conceptually simple, its use is hampered by the fact that its total expression involves exponentially many evaluations of the characteristic function (there are n × 2 n−1 possible marginal contributions between n players). However, since the Shapley value is expressible as an average over averages by Equation (11), it is possible to approximate these averages via sampling techniques, and these averages are naturally stratified by coalition size. In previously published literature, other techniques have been used to allocate samples in this context, particularly simple sampling [36], Neyman allocation [16,37], and allocation to minimize Hoeffding's inequality [38].
We assess the benefits of using our bound by comparing its performance to the approaches above in the context of some example cooperative games, with results analyzed in Section 7. The example games are described below: For each game, we compute the exact Shapley value, and then the average absolute errors in the approximated Shapley value for a given budget m of marginalcontribution samples across multiple computational runs. The results are shown in Table 1, where the average absolute error in the Shapley value is computed by sampling with Maleki's method [38] is denoted e M a , e sim is Castro's stratified simple sampling method [36], e Ca is Castro's Neyman sampling method [37], and e SEBM is the error associated with our method, SEBM. The results in Table 1 show that our method performs well across the benchmarks. A discussion of all of the results is considered in the next section.

Discussion
In this section we give considerations to the numerical results of the paper. In general, from the results across Figures 1 and 2 and Table 1, the main observation is that our sampling technique, SEBM or SEBM-W, performs competitively to Neyman sampling (Ney or Ney-W). This is despite SEBM not having access to knowledge of strata variances, which the Neyman methods do. If instead we compare SEBM* to Ney, which both have access to strata variances, then both methods use the same information and the results are even stronger for our method. The reasons for this performance are interesting, and we now discuss them in more detail.
From Figure 1 we observe that sampling without replacement always performs better than sampling with replacement for the same method. This improvement is magnified as the number of samples grows large relative to the size of the population. At the same time, simple random sampling almost always performs worst, because it fails to take advantage of any variance information. These results are as expected.
Next, note that the results of Figure 1 show that there is a mid-range of sample sizes where choosing a different method can even have a greater impact on sampling efficiency and rate of average error reduction than the difference between sampling with or without replacement. This is an important insight, as sampling real-world data (e.g. surveys, polling, destructive testing, etc) can be an expensive and slow process. Accordingly an appropriate method of choosing numbers of samples can lead to a material difference in cost for the same accuracy. There is also a slight decrease in the performance of SEBM* in comparison with Ney in the case of high number of samples and sampling without replacement, as illustrated in Figure 1. This indicates that the use of sub-optimal equation 4 in the derivation of Lemma 3.9 might have some negative effect, by distorting the shape of the functions that the sampling processes then minimizes.
Furthermore, if the data features very rare events, then SEBM and SEBM* seem to perform in a manner less than ideal. These condition are illustrated in Figure 2, where the more rare the Bernoulli variable successes, the worse our methods perform in comparison with Neyman sampling (Ney). This shortcoming can be partly explained by noting that SEBM unnecessarily wastes samples on the Bernoulli stratum of rare events in the process of learning that the variance is almost zero, whereas Ney can avoid this because it has prior knowledge of the variances to begin with. As such, this factor explains the difference between the performance of SEBM and SEBM* in the context of Figure 1 and also in Figure 2. What is surprising is how small the difference in performance between SEBM and SEBM* is. This indicates that as additional samples are taken, the original uncertainty about the strata variances have less and less effect upon the total numbers of samples that are eventually drawn from each of the strata.
However, the performance difference between SEBM* and Ney in Figure 2 is not explained by this argument, as they use the same information. Instead, the reason for this difference in performance is found by considering the simplifying approximation of Equation (2) in the derivation of Lemma 3.5. Specifically, (2) introduces a particular distortion into the shape of Equation (8) which our sampling seeks to minimize. Figure 3 illustrates how the approximation (2) loosens the bound with respect to the variance. Observe that when the variances are very small that Equation (3) overly loosens the bounds, causing oversampling of strata with very small variances. It appears that this factor is at play in the under-performance shown in Figure 2 and also the slight under-performance of our method in the Voting Game in Table 1b. We note that there may be other corner-cases where our method also under-performs.
In comparison to existing approaches to approximating the Shapley value, our sampling method shows improved performance on almost all accounts, as shown in Table 1. This was particularly the case in the context of large sample budgets, as our method (SEBM, with error e SEBM ) sampled without replacement, while the other methods sampled with replacement. However it would be remiss not to mention the computational overhead of iteratively minimizing (one sample at a time) our inequality in the context of our simple example games. This overhead can be a significant drawback, however on more complicated games such as where the characteristic function is slower to calculate, any overhead associated with the sampling choice is expected to be much less relevant. We also note that our method's performance could potentially be further improved by selecting more refined D i values for our example games.
One primary limitation of our method is that it rests on assumption of known data widths D i (and in the case of sampling-without-replacement, also on strata sizes N i ), which may not be exactly known in practice. One way to overcome this may be to use our method with a reliable overestimate these parameters (by expert opinion or otherwise). This approximation or estimation may itself open consideration of other probability bounds and/or sampling methods, however we have not investigated this line of inquiry.
In practice, it might also be advisable to run our method with an underestimate of the data widths, as the sampling process is fundamentally sensitive the the shape of the inequality and not necessarily its magnitude or accuracy as a bound.

Multidimensional Extension
Our method of choosing samples can be extended to multidimensional data by a simple modification. Specifically, instead of considering data that is single-valued, we consider data points that are vectors.
Formally, for n strata of finite data points which are all vectors of size M , let n i be the number of data points in the ith stratum. Let the data in the ith stratum have a mean vector values µ i (with µ i,j for the jth component of the vector), which are value bounded within a finite width D i,j , and have vector value variances σ 2 i,j . Given this, let X i,1 , X i,2 , . . . , X i,ni (where X i,k,j is the jth component, of the kth vector from stratum i) be vector random variables corresponding to those data values randomly and sequentially drawn (with or without) replacement. Denote the average of the first m i of these random variables from the ith stratum by χ i,mi = 1 mi mi k=1 X i,k (with χ i,mi,j being the jth component of that vector average). Letσ 2 i,j = i mi−1 mi k=1 (X i,k,j − χ i,mi,j ) 2 be the unbiased sample variance of the first m i of these random variables in the jth component. As before, we assume weights τ i for each stratum. In this context we have the following theorem: Theorem 8.1 (Vector SEBM bound). In the context above, then with Ω ni mi , Ψ ni mi per Lemma 3.7: where: Proof. Squaring (8) and applying it specifically to the jth component of all the vectors gives: Taking a series of union bounds (Lemma 3.1) over j gives result.
The left hand side of the inequality in (12) is the square Euclidean distance between our weighted stratified sample vector estimate n i=1 τ i χ i,mi and the true mean stratified vector n i=1 τ i µ i . In this context, an example sampling process might consist of sampling to maximally minimize the right hand side of the inequality (similar to our SEBM process, described in Section 4.2). This formulation can be applied to more involved computational tasks that involve sampling data with multiple features or auxiliary variables.

Conclusions and Future Work
The derivation of our inequality extends from consideration of Chernoff bounds and probability unions in a similar vein to other EBB derivations [19,28]. However, the bounds on the moment generating functions that we developed in Section 3 use loosening approximations, and hence stronger and/or more representative bounds could be developed at the cost of greater mathematical complexity. Alternatively, an approach utilizing entropic [39] or Efron-Stein inequalities [40] could result in different and potentially tighter results.
We now consider prospective applications of our method. First, the approach derived in this paper was motivated by the problem of approximating the Shapley value of cooperative games defined over complicated optimization problems (i.e. with characteristic functions given by the solution to non-trivial optimization problems). Two examples of this are the problems of pricing (i) logistics, involving solutions to the travelling salesman problem [33], and (ii) electricity networks, which requires solving optimzation problems that incorporate the power flow equations [16,34,41,42]. Focusing on electricity networks in particular: these are complicated technical system used to transport electrical power from generators to loads, subject to the non-linear physical and operational constraints of the system's components. With the emergence of new technologies, electricity is now generated, monitored and used on neighborhood distribution networks by devices that are increasingly responsive and interconnected to the network. Because of this, there are various emerging schemes of how a future distribution-network energy market platform might be designed. Within this context, the Shapley value has been proposed as a fair mechanism for the allocation of resources and costs on such networks. The Shapley value has been considered in different ways as a mechanism for pricing demand response [16], demand or load [34], supply or generation [41], and potentially all simultaneously [42]. Although computing the Shapley value exactly is impractical in these contexts sample-based approximations are a promising avenue for implementing Shapely value pricing schemes in real-world electricity systems.
Second, a potential use of our stratified sampling method is in improving the performance of stochastic gradient decent (SGD) methods for training neural networks [43]. Neural networks are trained by iteratively refining their parameters -the weights and biases of the network -against a cost function of the network's performance against training data. One common method of training neural networks is gradient decent (GD). In each iteration of GD, the derivative of how much a change in any parameter would influence the average performance of the network across the training data is calculated as a gradient vector. Once this vector is calculated, each network parameter takes a small step in the direction of this gradient, to incrementally increase the performance of the network, and through many of these steps the network becomes trained.
However in many cases, the entire corpus of training data is not used in each iteration but only a fraction of the corpus is sampled (as a 'batch' or 'minibatch'), and the average gradient vector of improved performance across the samples of the batch is calculated as an approximation of the true gradient vector. This iterative process has been called SGD, where one of the hyperparameters is the size of the batches, see [44,45]. In the context of supervised learning, each element of the training data is labelled with the desired output of the neural network for it, and these labels can serve to naturally stratify the training data; or the data can be stratified by other means too [46][47][48]. In this setting, Equation 12 may be used to choose between samples of labelled training data, in order to sample batches that more-efficiently estimate of the performance gradient, and hence improve the efficiency of neural network training procedure. This idea of 'smart sampling' for neural network training is not particularly new, and our method is potentially compatible with other performance-enhancing techniques in the literature on neural networks [49,50].
Sourcecode for all the experiments in this paper are available at: https://github.com/Markopolo141/Stratified Empirical Bernstein Sampling