On Least Squares Approximations for Shapley Values and Applications to Interpretable Machine Learning

Tim Pollmann; Jochen Staudacher

doi:10.20944/preprints202601.2308.v1

Submitted:

29 January 2026

Posted:

29 January 2026

You are already at the latest version

Abstract

The Shapley value stands as the predominant point-valued solution concept in cooperative game theory and has recently become a foundational method in interpretable machine learning. In that domain, a prevailing strategy to circumvent the computational intractability of exact Shapley values is to approximate them by reframing their computation as a weighted least squares optimization problem. We investigate an algorithmic framework by Benati et al. (2019), discuss its feasibility for feature attribution and explore a set of methodological and theoretical refinements, including an approach for sample reuse across strata and a relation to Unbiased KernelSHAP. We conclude with an empirical evaluation of the presented algorithms, assessing their performance on several cooperative games including practical problems from interpretable machine learning.

Keywords:

cooperative game theory

;

importance sampling

;

stratified sampling

;

Shapley value

;

interpretable machine learning

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The original KernelSHAP paper by Lundberg and Lee [1] from 2017 was a landmark success for interpretable machine learning. Its primary achievement was making the Shapley value [2] — a theoretically optimal but computationally intractable [3,4,5] concept from cooperative game theory [6] — practical for explaining machine learning models. Lundberg and Lee [1] discovered that the exact Shapley values for a prediction could be recovered as the solution to a specially weighted least squares regression problem. By sampling a manageable number of coalitional inputs and solving this approximate regression, KernelSHAP provided a single, consistent metric for feature importance. The accompanying open-source shap library SHAP [1] fueled its widespread use, making it a predominant standard for model interpretation, particularly in the model-agnostic setting.

The follow-up work on Unbiased KernelSHAP by Covert and Lee [7] represents a success in methodological rigor. It addresses the problem that the original KernelSHAP algorithm has not been proven to be unbiased and does not provide uncertainty estimates. While not as publicly visible as the initial breakthrough, this refinement solidified SHAP’s foundation as a robust, statistically sound tool.

According to Lundberg and Lee [1], KernelSHAP was developed without knowledge of prior articles on cooperative game theory like Charnes et al. [8] and Ruiz et al. [9], who decades earlier had already derived the precise weighting scheme that makes a least-squares regression yield the Shapley value when complete data is available. This outlines a fascinating case of parallel innovation across disciplines. Cooperative game theory, with the Shapley value [2] as its most important solution concept, had long been a mature and applied field providing rigorous solutions to real-world problems of fair division and coalitional analysis, from allocating airport landing fees among airlines based on runway use [10] to measuring the voting power of political blocs in a legislature [11,12,13]. Yet, the field of machine learning, increasingly reliant on inscrutable black-box models, had not seen its own predictions as precisely such a system — where input features act as cooperating agents whose joint effort produces a model’s output. Lundberg and Lee’s pivotal contribution [1] was to bridge this conceptual gap. They recognized that the abstract "game" defined by a model’s prediction function was a natural domain for Shapley’s theory. By doing so, they inadvertently reinvented a specific computational tool — the weighted least-squares approximation — from the game theory literature, but with a transformative new purpose: not to analyze economic coalitions, but to explain the inner logic of artificial intelligence.

Whereas KernelSHAP [1] and its unbiased variant [7] are very widely used, the methods from Benati et al. [14] — which are equally based on the least squares formulation of the Shapley value — have received comparatively little attention. This article represents, to our knowledge, their first application in explainable artificial intelligence. We adapt the ideas from Benati et al. [14] as general TU game approximation algorithms, terming their weighted sampling strategy LSS (Least Squares Sampling). Regarding their proposal for stratification, we differentiate S-LSS, an algorithm without sample reuse across strata, from SRS-LSS, an algorithm with sample reuse across strata suggested in Benati et al. [14].

The objective of this paper is neither to introduce a novel Shapley value estimator nor to perform a broad comparison of approximation algorithms. Instead, the contribution of this work is a detailed structural, theoretical, and algorithmic analysis of the methods LSS, S-LSS and SRS-LSS from Benati et al. [14]. We compare the variances of the LSS and S-LSS estimators in detail. We point out how sample reuse across strata may introduce non-zero covariance terms between strata for SRS-LSS. We prove that LSS and UKS approximate the same underlying problem, thereby only differing in their respective sampling strategies. Finally, we test how the unbiased Shapley estimators LSS, S-LSS and SRS-LSS perform in comparison to KernelSHAP and UKS for both classical cooperative games and real-world applications from interpretable machine learning.

The remainder of this article is structured as follows. In Section 2, we summarize basic ideas from cooperative game theory, including linear solution concepts and the Shapley value, and introduce the BShap (Baseline Shapley) model for interpretable machine learning. Section 3 briefly reviews Monte Carlo methods, along with stratified sampling and importance sampling for variance reduction. A summary of results on approximating linear solution concepts by means of importance sampling on the coalition space from our previous work [15] is presented in Section 4. The central developments of this paper are detailed in Section 5. We formally introduce the LSS, S-LSS and SRS-LSS estimators, compare variances of LSS and S-LSS, investigate the emergence of non-zero covariance terms between strata for SRS-LSS and clarify the similarities and differences between LSS and UKS. The empirical performance of the algorithms introduced in Section 5 is analyzed in Section 6 using two types of cooperative games and three real-world explainability scenarios, thereby numerically substantiating our previous analytical claims. The paper concludes in Section 7 with a summary and recommendations.

2. Preliminaries on Cooperative Game Theory, the Shapley Value and the BShap Model for Interpretable Machine Learning

Following the expositions in Chakravarty et al. [6], Peters [16], and Benati et al. [14], this section offers a concise introduction to transferable utility (TU) cooperative games and point-valued solution concepts such as the Shapley value. Additionally, we briefly describe two representative TU games — the airport game and the weighted voting game — and the model from interpretable machine learning which we utilize in our analysis and our numerical experiments in Section 6.

2.1. Transferable Utility Games and Their Characteristic Functions

We investigate transferable utility games (TU games), cooperative games where a coalition’s earnings are expressed as a scalar [6,16]. This implies that the utility can be distributed without restriction among the players.

Following [6,16], a TU game is formally defined as a pair

(N, v)

. Here,

N = {1, \dots, n}

is the player set and

v : 2^{N} \to R

is the characteristic function, which assigns a real value to each coalition

S \subseteq N

representing its total utility from cooperation. We adopt the standard normalization

v (\emptyset) = 0

.

It is frequently convenient to represent a coalition S by an indicator vector

z \in {0, 1}^{n}

. We define

z (S) = {[1_{i \in S}]}_{i = 1}^{n}

and its inverse

S (z) = {i \in N ∣ z_{i} = 1}

. This mapping allows functions and operators defined on

2^{N}

or

{0, 1}^{n}

to be used interchangeably; for instance,

v (S)

and

v (z)

denote the same value, and

| z | = | S (z) |

.

2.2. Linear Solution Concepts for TU Games

Let

G (N)

denote the set of all TU games on player set N. A point-valued solution concept is a function

α : G (N) \to R^{n}

that returns a vector

α (N, v) \in R^{n}

for each game

(N, v) \in G (N)

. Each entry

α_{i}

quantifies the value assigned to player i. These concepts are employed to evaluate individual influence or to distribute a coalition’s payoff

v (S)

among its members

S \subseteq N

.

A key property of a point-valued solution concept

α

is linearity [14]. This requires that

α

can be written as

α (N, v) = \sum_{S \subseteq N} w (S) ⊙ ν (S, v),

(1)

with ⊙ denoting the Hadamard product of vectors,

w (S) = {[w_{1} (S), \dots, w_{n} (S)]}^{⊤}

standing for weights depending only on S and the player, and

ν (S, v) = {[ν_{1} (S, v), \dots, ν_{n} (S, v)]}^{⊤}

denoting values depending upon S, v, and the player.

2.3. The Shapley Value

The Shapley value [2] represents the most prominent solution concept in cooperative game theory. In recent years, it has been extensively adopted in machine learning and explainable artificial intelligence [1,17,18,19]. Formally, for any player

i \in N

, the Shapley value is the expected marginal contribution

Δ_{i} (S, v) = v (S \cup i) - v (S)

, taken over all sets S of players that precede i in a uniformly random permutation

π

of N, i.e.,

\begin{matrix} ϕ_{i} (N, v) & = E_{π \sim U (Π (N))} [Δ_{i} ({Pre}_{i} (π), v)] \\ = \frac{1}{n!} \sum_{π \in Π (N)} Δ_{i} ({Pre}_{i} (π), v) \end{matrix}

(2)

\begin{matrix} = \sum_{\begin{matrix} S \subseteq N \\ i \notin S \end{matrix}} \frac{| S |! (n - | S | - 1)!}{n!} Δ_{i} (S, v), \end{matrix}

(3)

where

U

stands for a uniform distribution,

Π (N)

denotes the set of all permutations of N, and

{Pre}_{i} (π)

gives the set of players preceding i in a permutation

π \in Π (N)

.

Let

ϕ = {[ϕ_{1}, \dots, ϕ_{n}]}^{⊤}

be the vector of Shapley values, and let

\hat{ϕ}

represent any approximation, with

{\hat{ϕ}}_{i}

its i-th entry. Where no ambiguity arises, we use

ϕ_{i}

to mean

ϕ_{i} (N, v)

.

It follows directly from () that the Shapley value belongs to the class of linear solution concepts introduced in SubSection 2.2. The coefficients

w (S)

are specified by

ν_{i} (S, v) = Δ_{i} (S, v)

and

w_{i} (S) = \{\begin{matrix} \frac{| S |! (n - | S | - 1)!}{n!} & if i \notin S \\ 0 & if i \in S . \end{matrix}

(4)

Note that the Shapley value exhibits symmetry, i.e., two players that contribute equally to each coalition have the same Shapley value meaning there holds

if \forall S \subseteq N ∖ {i, j}, v (S \cup {i}) = v (S \cup {j}), then ϕ_{i} (N, v) = ϕ_{j} (N, v),

for any game

(N, v)

and any players

i, j \in N

with

i \neq j

. It also exhibits efficiency, i.e., for all games

(N, v)

there holds

\sum_{i \in N} ϕ_{i} (N, v) = v (N)

.

2.4. Airport Games

Airport games, proposed by Littlechild and Thompson [10], address the problem of distributing runway construction costs among players whose aircrafts require different runway lengths. Costs are encoded in a vector

c = {[c_{1}, \dots, c_{n}]}^{⊤}

, where

c_{i}

corresponds to player i. The characteristic function is

v (S) = max_{i \in S} c_{i} .

(5)

Classified as line-graph maintenance problems, these games have closed-form Shapley values [20], making them ideal benchmark instances for the numerical studies in Section 6.

2.5. Weighted Voting Games

Weighted voting games [6,16,21] represent voting bodies in which players possess different voting weights (e.g., parliamentary seats), collected in

c = {[c_{1}, \dots, c_{n}]}^{⊤}

. A coalition S succeeds when

\sum_{i \in S} c_{i} \geq C

for a predetermined quota C. This yields a characteristic function defined by

v (S) = \{\begin{matrix} 0 & if \sum_{i \in S} c_{i} < C \\ 1 & if \sum_{i \in S} c_{i} \geq C . \end{matrix}

(6)

In our theoretical and numerical investigations in Section 5 and Section 6, we frequently use a family of weighted voting games with

n = 7

players and

c = {[1, 2, 3, 1, 1, 1, 1]}^{⊤}, C \in {1, \dots, 10} .

(7)

There exist fast methods for computing point-valued solutions of weighted voting games for large player numbers, for instance dynamic programming algorithms [22,23]. Hence these games serve as excellent test games for our experiments in Section 6.

2.6. Baseline Shapley (BShap) for Interpretable Machine Learning

Our numerical experiments in Section 6 apply the Shapley value framework to attribute a model’s prediction to its input features. Here, the prediction task is interpreted as a cooperative game with features as players. Given a model

o

and an input

x

, the characteristic function

v (S)

specifies the value of a feature subset S.

We adopt Baseline Shapley (BShap), a method originally suggested in [1] and formalized by Sundararajan and Najmi [24]. The value

v (S)

is defined deterministically relative to a single baseline vector

z^{'}

, where features not in S are set to their baseline counterparts:

v_{BShap} (S) = o ([z_{S}, z_{\bar{S}}^{'}]) - o (z^{'})

(8)

with

\bar{S}

denoting the set of features not in S. The subtrahend

o (z^{'})

in (8) ensures the normalization

v_{BShap} (\emptyset) = 0

as introduced in SubSection 2.1. Note that this normalization would not be needed for the algorithms based on marginal contributions investigated in our article [15], but becomes crucial for the algorithms based on least square formulations of the Shapley value discussed in Section 5.

Our specific baseline: In the applications in Section 6, the baseline

z^{'}

is taken as the expected feature vector over the training distribution:

z^{'} = \bar{z} = E_{z \sim D} [z] .

with

D

representing the training data distribution. This anchors Shapley values to the prediction

o (\bar{z})

of an average sample, providing a natural reference point. The resulting attributions thus explain the deviation of

o (x)

from this baseline.

3. Preliminaries on Monte Carlo Methods for Estimation

In this section, we first introduce the use of Monte Carlo methods for estimating expectations together with two techniques for variance reduction. We draw mainly from the short overviews provided by Rubinstein and Kroese [25] and Botev and Ridder [26] as well as from the textbook by Rubinstein and Kroese [27].

Let

X

be a d-dimensional discrete (or continuous) random variable with a sample space

X \subseteq R^{d}

and a probability mass function

q_{X}

(or probability density function, respectively), which may be known or unknown. Denote a specific realization of

X

by

x \in X

. For a function

H : X \to R

, we want to estimate the expectation

μ = E [H (X)] .

(9)

This expectation is frequently intractable in closed form, either because

q_{X}

is unknown (and only i.i.d. samples are given) or because the defining sum or integral is computationally infeasible. Monte Carlo methods circumvent this analytical hurdle through simulation.

We begin with Crude Monte Carlo as the foundational method, then present importance sampling and stratified sampling for variance reduction which we examine in this paper. A finite variance of

H (X)

is assumed throughout this manuscript.

3.1. Crude Monte Carlo Method

Crude Monte Carlo proceeds by drawing an i.i.d. sample

X_{1}, \dots, X_{τ} \sim_{iid} q_{X}

and averaging the

H (X_{k})

values. This yields the standard estimator [26]:

\hat{μ} = \frac{1}{τ} \sum_{k = 1}^{τ} H (X_{k}) .

(10)

According to [26], the estimator is unbiased, i.e.

E [\hat{μ}] = μ

, and its variance is

Var [\hat{μ}] = \frac{1}{τ} Var [H (X)] .

(11)

3.2. Stratified Sampling for Variance Reduction

Stratified sampling is a well-established Monte Carlo variance reduction technique [26]. It separates the sample space

X

into ℓ disjoint strata

{X_{1}, \dots, X_{ℓ}}

such that

⋃_{l = 1}^{ℓ} X_{l} = X

with

X_{l} \cap X_{l^{'}} = \emptyset

for

l \neq l^{'}

. Suppose L is a discrete random variable taking values in

{1, \dots, ℓ}

with known probabilities

p_{L} (l) = P (L = l) = P (X \in X_{l})

. We can rewrite

μ

as

μ = E_{p_{L}} [E_{q_{X}} [H (X) ∣ X \in X_{L}]] = \sum_{l = 1}^{ℓ} p_{L} (l) E [H (X) ∣ X \in X_{l}] = \sum_{l = 1}^{ℓ} p_{L} (l) μ_{l},

with

μ_{l} = E [H (X) ∣ X \in X_{l}]

standing for the expectation over the conditional probability distribution of

X

given that

X \in X_{l}

. Our stratified estimator of

μ

becomes

{\hat{μ}}^{s t} = \sum_{l = 1}^{ℓ} p_{L} (l) {\hat{μ}}_{l},

where

{\hat{μ}}_{l}

denotes the estimated value of

H (X)

in stratum

X_{l}

, i.e.,

{\hat{μ}}_{l} = \frac{1}{τ_{l}} \sum_{k = 1}^{τ_{l}} H (X_{l, k}),

(12)

with the sample

X_{l, 1}, \dots, X_{l, τ_{l}}

of size

τ_{l}

being drawn i.i.d. from the conditional probability distribution of

X

given that

X \in X_{l}

. To ensure fair comparisons to other Monte Carlo techniques, we always assume

\sum_{l = 1}^{ℓ} τ_{l} \approx τ

(up to rounding errors), where

τ

stands for the overall sample budget. The estimator is unbiased [26], i.e.,

E [{\hat{μ}}^{s t}] = μ .

Assuming that the sample sizes per stratum are proportionally assigned, i.e.,

τ_{l} = p_{L} (l) τ

, the variance of the crude Monte Carlo estimator (11) has been established as an upper bound [26], i.e.

Var [{\hat{μ}}^{s t}] \leq Var [\hat{μ}],

meaning that the stratified estimator

{\hat{μ}}^{s t}

never performs worse than its crude counterpart

\hat{μ}

.

3.3. Importance Sampling for Variance Reduction

Importance sampling introduces another probability mass (or density) function

p_{X}

, such that

p_{X} (x) = 0 \Rightarrow H (x) q_{X} (x) = 0

. This allows (9) to be rewritten as

μ = E_{p_{X}} [\frac{q_{X} (X)}{p_{X} (X)} H (X)] .

(13)

A Monte Carlo approximation based on (13) is

\hat{μ} = \frac{1}{τ} \sum_{k = 1}^{τ} \frac{q_{X} (X_{k})}{p_{X} (X_{k})} H (X_{k}),

(14)

where

X_{1}, \dots, X_{τ} \sim_{iid} p_{X}

. As shown in [25], this estimator is unbiased, i.e.

E_{p_{X}} [\hat{μ}] = μ

, and its variance is analogous to the crude Monte Carlo case

{Var}_{p_{X}} [\hat{μ}] = \frac{1}{τ} {Var}_{p_{X}} [\frac{q_{X} (X)}{p_{X} (X)} H (X)] .

(15)

With a properly selected

p_{X}

, the variance of the importance sampling estimator is less than or equal to that of the crude Monte Carlo estimator. This follows from comparing (11) and (15), yielding

\frac{1}{τ} {Var}_{q_{X}} [H (X)] \leq min_{p_{X}} \frac{1}{τ} {Var}_{p_{X}} [\frac{q_{X} (X)}{p_{X} (X)} H (X)],

which holds because

p_{X} = q_{X}

can always be chosen as the trivial case. We apply importance sampling in the context of TU games in Section 4.

4. Approximating Linear Solution Concepts via Importance Sampling

The exact calculation of a linear solution concept, i.e., (1), for a TU game normally requires summing up terms whose number grows exponentially in the number of players n. Therefore, approximation algorithms are needed to estimate these values in real-world situations, especially in the context of applications in interpretable machine learning [1,17,18,19] where one can typically not exploit any special structure of the underlying game for computing Shapley values exactly. In this section, we briefly summarize some ideas from [14] and the general framework for importance sampling on the coalition space for linear solution concepts which we recently proposed in [15] and will apply in Section 5.

Benati et al. [14] consider a uniform sampling strategy on the coalition space for approximating linear solution concepts, i.e. they simply adopt the uniform distribution

q = U (2^{N})

. Let us regard it as the crude Monte Carlo method on the coalition space. Although sampling subsets from the uniform distribution is both straightforward and unbiased, it is obviously not an optimal — or even recommendable — sampling strategy for all linear solution concepts or problem settings. For example, when approximating the Shapley value by sampling coalitions using (), small or large coalitions obtain larger weights resulting in a strong influence on the estimator. However, when sampling uniformly from the coalition space, they may extract only a few samples resulting in a less accurate estimator.

To enable more efficient sampling, we apply the importance sampling technique from SubSection 3.3. By appropriately reweighting the estimator, importance sampling allows us to draw samples from a non-uniform, user-defined distribution while maintaining an unbiased estimate of the underlying linear solution concept.

Theorem 1.

[15] For all

i \in N

, let

p_{i} : 2^{N} \to [0, 1] with p_{i} (S) = 0 \Rightarrow w_{i} (S) ν_{i} (S) = 0

be a probability distribution and

S_{i} \sim_{iid} p_{i}

with

S_{i} = {[S_{i, 1}, \dots, S_{i, τ}]}^{⊤}

be a sample of size τ generated by sampling with replacement according to

p_{i}

. Then,

\hat{α} (N, v) = {[{\hat{α}}_{i} (N, v)]}_{i \in N} = {[\frac{1}{τ} \sum_{S \in S_{i}} \frac{w_{i} (S)}{p_{i} (S)} ν_{i} (S, v)]}_{i \in N}

(16)

is an importance sampling estimator of the linear solution concept

α (N, v)

.

The following proposition connects our importance sampling estimator from Theorem 1 established in Pollmann and Staudacher [15] to a finding from Benati et al. [14], p. 95.

Proposition 1.

The importance sampling estimator from Theorem 1 has the following properties:

(a): It is unbiased, i.e.,

$E [{\hat{α}}_{i} (N, v)] = α_{i} (N, v) (\forall i \in N) .$

(17)
(b): Its variance is given by

$Var [{\hat{α}}_{i} (N, v)] = \frac{1}{τ} (\sum_{S \subseteq N} \frac{w_{i} {(S)}^{2}}{p_{i} (S)} ν_{i} {(S, v)}^{2} - α_{i} {(N, v)}^{2}) (\forall i \in N) .$

(18)
(c): It is consistent in probability, i.e.,

${\hat{α}}_{i} (N, v) \overset{τ \to \infty}{\to} α_{i} (N, v) (\forall i \in N) .$

(19)

We note that Theorem 1 and Proposition 1 subsume both the crude Monte Carlo method (setting

p_{i} (S) = q (S) = 2^{- n}

) and stratified sampling. The latter is covered because any stratum estimator for a linear solution concept can be written in the form of (16), allowing direct application of these results.

5. Least Squares Approaches for Approximating Shapley Values

This section covers the idea of approaching the approximation of Shapley values as a least squares problem. First, we give an introduction to the family of least square values in Section 5.1, followed by an overview about Shapley value approximation algorithms in this setting in Section 5.2. In Section 5.3, Section 5.4, and Section 5.5, we compare these algorithms and put them into the overall context.

5.1. The Family of least Square Values

The family of least square values was proposed by Ruiz et al. [9] and comprises a generalization of the formulation of the Shapley value as a least squares problem, which was already proposed earlier by Charnes et al. [8].

Essentially, this family consists of all solution concepts that can be obtained as minimizers of a weighted least squares problem, where the weights determine the relative importance of different coalition sizes. These weights are given by the function

m : {0, \dots, n} \to R_{\geq 0}

that assigns a weight to each coalition size

s \in {0, \dots, n}

with

m (s) > 0

for at least one s. Clearly, per definition, m is symmetric in a sense of assigning the same weight to coalitions of the same cardinality in order to retrieve symmetric solutions like the Shapley value. We denote by

y \in R^{n}

a payoff vector for a given game with

y_{i}

being the payoff for player i. In the following, we use the shorthand notation

y (S) = \sum_{i \in S} y_{i}

. Recall that a payoff vector is called efficient if

y (N) = v (N)

.

With these definitions established, any solution concept belonging to the family of least square values can be expressed as the optimizer of the following weighted least squares problem:

\begin{matrix} min_{y \in R^{n}} & \sum_{S \subseteq N} m (| S |) {(v (S) - y (S))}^{2} \\ s . t . & y (N) = v (N), \end{matrix}

(20)

with m being the weight function that is specific to the particular solution concept.

The closed-form solution — again, for a fixed m — to problem (20) is

y_{i} = \frac{v (N)}{n} + \frac{1}{γ n} (n a_{i} - \sum_{j \in N} a_{j}) (\forall i \in N),

(21)

with

a_{j} = \sum_{\begin{matrix} S \subseteq N \\ j \in S \end{matrix}} m (| S |) v (S) and γ = \sum_{s = 1}^{n - 1} m (s) (\binom{n - 2}{s - 1}) .

(22)

These expressions illustrate how the resulting allocation depends on the weight function m.

Least square values as linear solution concepts

Let us briefly point out why all members of the family of least square values are linear solution concepts in the sense of the definition from SubSection 2.2. Benati et al. [14] showed that (21) can be rewritten as

y = 1 \frac{v (N)}{n} + \sum_{\begin{matrix} S \subseteq N \\ 0 < | S | < n \end{matrix}} w (S) v (S),

(23)

where

1 \in R^{n}

is the all-ones vector and the individual elements of

w (S)

are given by

w_{i} (S) = \{\begin{matrix} \frac{(n - | S |) m (| S |)}{n γ} & if i \in S \\ - \frac{| S | m (| S |)}{n γ} & if i \notin S . \end{matrix}

(24)

From (23), it follows directly that this representation satisfies the definition of linear solution concepts provided in SubSection 2.2.

The Shapley value in the context of least square values

The family of least square values was introduced by Ruiz et al. [9] after Charnes et al. [8] had demonstrated that the Shapley value can be expressed as a least squares optimization problem. In detail, the vector of Shapley values

ϕ

is the solution to problem (20) when m is defined as

m (s) = \frac{1}{n - 1} {(\binom{n - 2}{s - 1})}^{- 1},

(25)

and therefore, via (22), one obtains

γ = \sum_{s = 1}^{n - 1} m (s) (\binom{n - 2}{s - 1}) = \sum_{s = 1}^{n - 1} \frac{1}{n - 1} {(\binom{n - 2}{s - 1})}^{- 1} (\binom{n - 2}{s - 1}) = 1 .

(26)

Thus, by inserting (26) into (21) and reformulating the result, one obtains the closed-form solution for the Shapley value as

ϕ_{i} = \frac{v (N)}{n} + \frac{n - 1}{n} a_{i} - \frac{1}{n} \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} a_{j} (\forall i \in N),

(27)

with

a_{j}

given by (22) for all

j \in N

. We illustrate the computation of the Shapley value as a least squares optimization problem in Figure 1.

5.2. Shapley Value Approximation Algorithms

In this subsection, we provide an introduction to Shapley value approximation algorithms that are based on the formulation of the Shapley value as a least squares problem. The algorithms discussed below are not meant to be an exhaustive list, but rather represent those algorithms we will analyse and test in the rest of the article. Specifically, the first three algorithms have been proposed by Benati et al. [14] in the context of stochastic approximations of cooperative games and are based on the family of least square values defined in SubSection 5.1, while the latter two have their origins in the machine learning world.

Additionally, we would like to draw attention to the absence of the projection method proposed by Benati et al. [14], which ensures efficiency and symmetry of the obtained estimators. Although this approach potentially improves the accuracy of the estimators (or at least never makes their accuracy worse), it is not agnostic to the underlying characteristic function, as it requires prior knowledge about symmetric players, which is an information that we assume to be unavailable — in particular, in the context of interpretable machine learning. Nevertheless, one could still achieve estimator efficiency through a single evaluation of

v (N)

, similar to Castro et al. [28], who proposed filling the “efficiency gap” post hoc. However, since such additional considerations lie outside the scope of this work, we exclude them equally from all algorithms.

Least Squares Sampling (LSS)

This paragraph deals with an algorithm proposed by Benati et al. [14] that approximates (23) by using importance sampling as defined in Theorem 1. Specifically, we refer to the estimator in Equation (22) on p. 97 in the paper [14] combined with their weighted sampling strategy. In the following, we name this algorithm LSS.

The idea of LSS is to sample coalitions with replacement according to

p (S) = \{\begin{matrix} \frac{1}{(n - 1) (\binom{n}{| S |})} & if 0 < | S | < n \\ 0 & else, \end{matrix}

(28)

such that, by combining (24), (25), (26), and (28), one obtains

\frac{w_{i} (S)}{p (S)} = \{\begin{matrix} \frac{n - 1}{| S |} & if i \in S \\ - \frac{n - 1}{n - | S |} & if i \notin S . \end{matrix}

(29)

Then, based on Theorem 1 as well as (23) and (29), an approximation of the Shapley value of player i is given by

\begin{matrix} {\hat{ϕ}}_{i} & = \frac{v (N)}{n} + \frac{1}{τ} \sum_{S \in S} \frac{w_{i} (S)}{p (S)} v (S) \end{matrix}

(30)

\begin{matrix} = \frac{v (N)}{n} + \frac{1}{τ} (\sum_{\begin{matrix} S \in S \\ i \in S \end{matrix}} \frac{n - 1}{| S |} v (S) - \sum_{\begin{matrix} S \in S \\ i \notin S \end{matrix}} \frac{n - 1}{n - | S |} v (S)), \end{matrix}

(31)

where

S

is obtained by sampling

τ

times with replacement according to p.

Equation (31) serves as the basis for Algorithm 1. Note that one evaluation of v can be used to update all players, reducing the number of evaluations of v and increasing convergence speed.

Algorithm 1 LSS

1:: $\hat{ϕ} \leftarrow 1 \frac{v (N)}{n}$
2:: for $τ$ times do
3:: Draw $S \sim p$
4:: $v_{S} \leftarrow v (S)$
5:: $w_{in} \leftarrow \frac{n - 1}{| S |}$
6:: $w_{out} \leftarrow \frac{n - 1}{n - | S |}$
7:: for all $i \in N$ do
8:: if $i \in S$ then
9:: ${\hat{ϕ}}_{i} \leftarrow {\hat{ϕ}}_{i} + \frac{1}{τ} w_{in} v_{S}$
10:: else
11:: ${\hat{ϕ}}_{i} \leftarrow {\hat{ϕ}}_{i} - \frac{1}{τ} w_{out} v_{S}$
12:: end if
13:: end for
14:: end for

Since (31) is an importance sampling estimator of a linear solution concept in the sense of Theorem 1, which can easily be obtained via (23) and (30), LSS inherits the properties stated in Proposition 1. Thus, the estimator is unbiased, and we can use (18) to calculate its variance:

Proposition 2.

The variance of the LSS estimator is given by

Var [{\hat{ϕ}}_{i}] = \frac{1}{τ} (\sum_{\begin{matrix} S \subseteq N \\ 0 < | S | < n \\ i \in S \end{matrix}} \frac{n - | S |}{n | S | (\binom{n - 2}{| S | - 1})} v {(S)}^{2} + \sum_{\begin{matrix} S \subseteq N \\ 0 < | S | < n \\ i \notin S \end{matrix}} \frac{| S |}{n (n - | S |) (\binom{n - 2}{| S | - 1})} v {(S)}^{2} - α_{i}^{2})

(32)

with

α_{i} = ϕ_{i} - \frac{v (N)}{n}

for all

i \in N

.

Proof.

Using (18) and setting

ν_{i} = v

results in

Var [{\hat{ϕ}}_{i}] = \frac{1}{τ} (\sum_{\begin{matrix} S \subseteq N \\ 0 < | S | < n \end{matrix}} \frac{w_{i} {(S)}^{2}}{p (S)} v {(S)}^{2} - α_{i}^{2}),

(33)

where

α_{i} = ϕ_{i} - \frac{v (N)}{n}

. Here, the term

\frac{v (N)}{n}

is excluded from the variance calculation since it is added once at the beginning of the LSS algorithm, but it is not part of the sampling procedure and thus, there is no randomness to it, and it does not influence the variance of

{\hat{ϕ}}_{i}

. As a result, we construct a temporary solution concept of the game,

α_{i}

, that does not contain the constant term

\frac{v (N)}{n}

.

With that given, one obtains (32) by calculating

\frac{w_{i} {(S)}^{2}}{p (S)} = \{\begin{matrix} \frac{n - | S |}{n | S |} {(\binom{n - 2}{| S | - 1})}^{- 1} & if i \in S \\ \frac{| S |}{n (n - | S |)} {(\binom{n - 2}{| S | - 1})}^{- 1} & if i \notin S, \end{matrix}

and inserting the result into (33). □

Stratified Least Squares Sampling (S-LSS)

This paragraph discusses another algorithm proposed by Benati et al. [14], which approximates the Shapley value through what the authors introduce as their stratification strategy. Specifically, we refer to equations (23) and (24) on p. 97 in the paper [14] without any reuse of samples across strata. In the following, we refer to this method as S-LSS.

The stratification according to [14] comes into play when approximating the individual addends

a_{j}

in (21), where a reformulation of (22) results in

a_{j} = \sum_{s = 1}^{n - 1} \underset{a_{j, s}}{\underset{︸}{\sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ j \in S \end{matrix}} m (s) v (S)}} .

(34)

Based on that formulation, approximations of

a_{j, s}

, denoted by

{\hat{a}}_{j, s}

, are obtained as

{\hat{a}}_{j, s} = \frac{1}{τ_{j, s}} \sum_{S \in S_{j, s}} \frac{m (s)}{p_{j, s} (S)} v (S) = \frac{1}{τ_{j, s} (n - s)} \sum_{S \in S_{j, s}} v (S),

(35)

with

S_{j, s}

being an i.i.d. sample of size

τ_{j, s}

from

{S \subseteq N ∣ (s = | S |) \land (j \in S)}

according to the uniform probability distribution

p_{j, s} (S) = {(\binom{n - 1}{s - 1})}^{- 1}

.

Then, based on (27), one obtains an estimator of

ϕ_{i}

as

{\hat{ϕ}}_{i} = \frac{v (N)}{n} + \frac{n - 1}{n} \sum_{s = 1}^{n - 1} {\hat{a}}_{i, s} - \frac{1}{n} \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} \sum_{s = 1}^{n - 1} {\hat{a}}_{j, s},

forming the basis for Algorithm 2.

Algorithm 2 S-LSS

1:: for $i \in N$ do
2:: for $s = 1$ to $n - 1$ do
3:: ${\hat{a}}_{i, s} \leftarrow 0$
4:: for $τ_{i, s}$ times do
5:: Draw $S \sim p_{i, s}$
6:: ${\hat{a}}_{i, s} \leftarrow {\hat{a}}_{i, s} + \frac{1}{τ_{i, s} (n - s)} v (S)$
7:: end for
8:: end for
9:: end for
10:: ${\hat{ϕ}}_{i} = \frac{v (N)}{n} + \frac{n - 1}{n} \sum_{s = 1}^{n - 1} {\hat{a}}_{i, s} - \frac{1}{n} \sum_{j \in N : j \neq i} \sum_{s = 1}^{n - 1} {\hat{a}}_{j, s} (\forall i \in N)$

Clearly, by using the formulation given in (35), the estimators

{\hat{a}}_{j, s}

can be seen as estimators of individual solution concepts

a_{j, s}

in the sense of Theorem 1 by setting

w_{j} (S) = \{\begin{matrix} m (s) & if (s = | S |) \land (j \in S) \\ 0 & else, \end{matrix}

as well as

p_{j} (S) = \{\begin{matrix} p_{j, s} (S) & if (s = | S |) \land (j \in S) \\ 0 & else, \end{matrix}

and

ν_{j} = v

.

Therefore, the individual stratum estimators are unbiased, and (18) can be used to calculate their variances:

Proposition 3.

In the context of the S-LSS algorithm, the variance of

{\hat{a}}_{j, s}

is given by

Var [{\hat{a}}_{j, s}] = \frac{1}{τ_{j, s}} (\frac{1}{(n - s) (n - 1) (\binom{n - 2}{s - 1})} \sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ j \in S \end{matrix}} v {(S)}^{2} - a_{j, s}^{2})

(36)

for all

j \in N

and

s \in {1, \dots, n - 1}

.

Proof.

By using the fact that

\frac{(\binom{n - 1}{s - 1})}{(\binom{n - 2}{s - 1})} = \frac{(n - 1)!}{(s - 1)! (n - s)!} \frac{(s - 1)! (n - s - 1)!}{(n - 2)!} = \frac{n - 1}{n - s},

(37)

we can use (18) to obtain

\begin{matrix} \sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ j \in S \end{matrix}} \frac{m {(s)}^{2}}{p_{j, s} (S)} v {(S)}^{2} & = \frac{(\binom{n - 1}{s - 1})}{{(n - 1)}^{2} {(\binom{n - 2}{s - 1})}^{2}} \sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ j \in S \end{matrix}} v {(S)}^{2} \\ = \frac{n - 1}{(n - s) {(n - 1)}^{2} (\binom{n - 2}{s - 1})} \sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ j \in S \end{matrix}} v {(S)}^{2} \\ = \frac{1}{(n - s) (n - 1) (\binom{n - 2}{s - 1})} \sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ j \in S \end{matrix}} v {(S)}^{2} . \end{matrix}

Inserting this result back into (18) results in (36). □

Based on the individual stratum estimator variances given by Proposition 3, Benati et al. [14] stated the overall variance of the estimated Shapley value of player i as

\begin{matrix} Var [{\hat{ϕ}}_{i}] & = Var [\frac{v (N)}{n} + \frac{n - 1}{n} \sum_{s = 1}^{n - 1} {\hat{a}}_{i, s} - \frac{1}{n} \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} \sum_{s = 1}^{n - 1} {\hat{a}}_{j, s}] \\ = {(\frac{n - 1}{n})}^{2} \sum_{s = 1}^{n - 1} Var [{\hat{a}}_{i, s}] + \frac{1}{n^{2}} \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} \sum_{s = 1}^{n - 1} Var [{\hat{a}}_{j, s}] . \end{matrix}

(38)

Perhaps surprisingly and contrary to what might be expected from the definition of stratified sampling in SubSection 3.2, S-LSS does not guarantee a smaller variance compared to LSS, as we discuss in SubSection 5.3 and show empirically in Section 6.

Sample Reuse Stratified Least Squares Sampling (SRS-LSS)

Benati et al. [14] proposed a third algorithm, which is based on the S-LSS algorithm and reuses samples across strata. Specifically, we refer to the final three sentences in subsection 4.2 on p. 97 in the paper [14]. In the following, we refer to this method as SRS-LSS.

The central idea is to reuse one evaluation

v (S)

in (35) to update all stratum estimators

{\hat{a}}_{j, s}

where

s = | S |

and

j \in S

, hence the name. To achieve that, samples

S_{s}

of size

τ_{s}

are drawn from

{S \subseteq N ∣ s = | S |}

with probabilities

p_{s} (S) = {(\binom{n}{s})}^{- 1}

. Based on the elements of

S \in S_{s}

, all stratum estimators

{\hat{a}}_{j, s}

with

j \in S

are updated with a single evaluation

v (S)

. As a side effect, the number of evaluations of v that are used to update player j’s stratum, i.e.,

τ_{j, s}

, are dependent on the generated samples, which is why, in Algorithm 3, we need to keep track of their values as well. This essentially means that all

τ_{j, s}

become estimators themselves, which is why we denote them by

{\hat{τ}}_{j, s}

. Additionally, we use

{\tilde{a}}_{j, s}

to denote unnormalized estimators of

a_{j, s}

before dividing by

{\hat{τ}}_{j, s} (n - s)

.

Algorithm 3 SRS-LSS

1:: for $s = 1$ to $n - 1$ do
2:: ${\hat{τ}}_{i, s} \leftarrow 0 (\forall i \in N)$
3:: ${\tilde{a}}_{i, s} \leftarrow 0 (\forall i \in N)$
4:: for $τ_{s}$ times do
5:: Draw $S \sim p_{s}$
6:: $v_{S} \leftarrow v (S)$
7:: for all $i \in S$ do
8:: ${\tilde{a}}_{i, s} \leftarrow {\tilde{a}}_{i, s} + v_{S}$
9:: ${\hat{τ}}_{i, s} \leftarrow {\hat{τ}}_{i, s} + 1$
10:: end for
11:: end for
12:: ${\hat{a}}_{i, s} \leftarrow \frac{1}{{\hat{τ}}_{i, s} (n - s)} {\tilde{a}}_{i, s} (\forall i \in N)$
13:: end for
14:: ${\hat{ϕ}}_{i} = \frac{v (N)}{n} + \frac{n - 1}{n} \sum_{s = 1}^{n - 1} {\hat{a}}_{i, s} - \frac{1}{n} \sum_{j \in N : j \neq i} \sum_{s = 1}^{n - 1} {\hat{a}}_{j, s} (\forall i \in N)$

One needs to guarantee that

{\hat{τ}}_{j, s} > 0

for all

j \in N

and

s \in {1, \dots, n - 1}

in order for SRS-LSS to function properly and avoid divisions by zero. One straightforward approach is to repeatedly generate samples until this condition is satisfied, or alternatively, to employ a conditional probability distribution that accounts for which elements have not yet been sampled. In the remainder of this work, we rely on the results from the following proposition, and therefore, we do not incorporate additional safeguards into Algorithm 3.

Proposition 4.

Let

τ_{s} \to \infty

as

τ \to \infty

for all

s \in {1, \dots, n - 1}

. Then, with probability approaching one asymptotically, each stratum estimator receives at least one sample, i.e.,

lim_{τ \to \infty} P (\exists j \in N, s \in {1, \dots, n - 1} : {\hat{τ}}_{j, s} = 0) = 0 .

Proof.

The probability of a player

j \in N

belonging to a random coalition of size

s \in {1, \dots, n - 1}

is given by

\frac{s}{n} > 0

. Thus, we have

P ({\hat{τ}}_{j, s} = 0) = {(1 - \frac{s}{n})}^{τ_{s}} \overset{τ_{s} \to \infty}{\to} 0 .

(39)

Since

P (\exists j \in N, s \in {1, \dots, n - 1} : {\hat{τ}}_{j, s} = 0) = P (⋃_{j \in N} ⋃_{s = 1}^{n - 1} {{\hat{τ}}_{j, s} = 0}),

we can use the union bound [29] and (39) to derive

P (⋃_{j \in N} ⋃_{s = 1}^{n - 1} {{\hat{τ}}_{j, s} = 0}) \leq \sum_{j \in N} \sum_{s = 1}^{n - 1} P ({\hat{τ}}_{j, s} = 0) \overset{τ \to \infty}{\to} 0,

which holds as long as

τ_{s} \to \infty

with

τ \to \infty

. □

Since Benati et al. [14] did not include a variance analysis of the SRS-LSS estimator in their work, we refer to SubSection 5.4 for a detailed discussion of the variance of the obtained estimator.

KernelSHAP (KS)

Let us now introduce the KernelSHAP algorithm, one of the predominant Shapley approximation algorithms in the machine learning world. It was developed by Lundberg and Lee [1], without knowledge of prior works like [8] and [9]. The optimization objective is

\begin{matrix} ϕ = \underset{y \in R^{n}}{arg min} & \sum_{\begin{matrix} z \in {0, 1}^{n} \\ 0 < | z | < n \end{matrix}} p (z) {(y (z) - v (z))}^{2} \\ s . t . & y (N) = v (N) \end{matrix}

(40)

to calculate the Shapley values for the cooperative game

(N, v)

[7]. Here,

p \propto ω

denotes a probability mass function satisfying

p (0) = 0

and

p (1) = 0

such that

p (z) = \{\begin{matrix} Q^{- 1} ω (z) & if 0 < | z | < n \\ 0 & else, \end{matrix}

(41)

where

ω (z) = \frac{n - 1}{(\binom{n}{| z |}) | z | (n - | z |)}

(42)

is the Shapley kernel and

Q = (n - 1) \sum_{s = 1}^{n - 1} \frac{1}{s (n - s)}

(43)

is a normalization constant [7]. As noted in SubSection 2.1, all functions defined on

{0, 1}^{n}

are implicitly also defined on N.

Since solving (40) requires optimizing over the elements in

{0, 1}^{n}

, a set that grows exponentially in n, KernelSHAP approximates Shapley values by generating a sample

S \sim_{iid} p

of size

τ

and solving the approximate optimization problem defined over the sampled subsets only, i.e.,

\begin{matrix} \hat{ϕ} = \underset{y \in R^{n}}{arg min} & \frac{1}{τ} \sum_{S \in S} {(y (S) - v (S))}^{2} \\ s . t . & y (N) = v (N) . \end{matrix}

(44)

Following [7], the Lagrangian of the approximate problem given by (44) with multiplier

λ \in R

is

\begin{matrix} \hat{L} (y, λ) & = y^{⊤} (\underset{\hat{A}}{\underset{︸}{\frac{1}{τ} \sum_{S \in S} z (S) z {(S)}^{⊤}}}) y - 2 y^{⊤} (\underset{\hat{b}}{\underset{︸}{\frac{1}{τ} \sum_{S \in S} z (S) v (S)}}) \\ + \frac{1}{τ} \sum_{S \in S} v {(S)}^{2} + 2 λ (y (N) - v (N)) . \end{matrix}

(45)

Based on the Lagrangian, Covert and Lee [7] derived the closed-form solution

\hat{ϕ} = {\hat{A}}^{- 1} (\hat{b} - 1 \frac{1^{⊤} {\hat{A}}^{- 1} \hat{b} - v (N)}{1^{⊤} {\hat{A}}^{- 1} 1}) .

The resulting estimator is consistent in probability, but assessing its unbiasedness or variance is challenging due to terms like

{\hat{A}}^{- 1} \hat{b}

. While the unbiasedness or variance of

\hat{A}

or

\hat{b}

can be analyzed individually [7], their interaction complicates the analysis of the final Shapley value estimator.

The KernelSHAP algorithm will not be further analyzed theoretically in this work. Due to its setup, it is unclear how the obtained estimator fits the framework defined by Theorem 1 or Monte Carlo estimators introduced in Section 3 in general. To the best of our knowledge, no analytic solution for its variance or a proof of its unbiasedness has been derived up today, which is why the unbiased KernelSHAP algorithm was introduced, guaranteeing unbiasedness and facilitating the variance analysis of the obtained Shapley value estimator.

Unbiased KernelSHAP (UKS)

The UKS algorithm [7] is a continued development of the KernelSHAP algorithm from the previous paragraph, where we pointed out that it is unclear if the KernelSHAP estimator is unbiased due to the interactions between

\hat{A}

and

\hat{b}

.

Considering the original optimization problem defined in (40), its Lagrangian with multiplier

λ \in R

is given by

L (y, λ) = y^{⊤} \underset{A}{\underset{︸}{E [Z Z^{⊤}]}} y - 2 y^{⊤} \underset{b}{\underset{︸}{E [Z v (Z)]}} + E [v {(Z)}^{2}] + 2 λ (1^{⊤} y - v (N)),

whereby Z is a random variable taking values from

{0, 1}^{n}

and being distributed according to p defined by (41). With that formulation given, the closed-form solution of the original problem can be derived as

ϕ = A^{- 1} (b - 1 \frac{1^{⊤} A^{- 1} b - v (N)}{1^{⊤} A^{- 1} 1}) .

Clearly,

b

cannot be computed exactly for large n, since it depends on v, and thus, needs

2^{n} - 2

evaluations of v in order to be exact. Nevertheless, since A does not depend on v and the probability distribution p is known, A can be pre-computed and is exact, which is the main improvement introduced by UKS in contrast to the original KernelSHAP algorithm.

Thus, UKS estimates all players’ Shapley values via

\hat{ϕ} = A^{- 1} (\hat{b} - 1 \frac{1^{⊤} A^{- 1} \hat{b} - v (N)}{1^{⊤} A^{- 1} 1}),

(46)

where

\hat{b}

is defined in (45) as for the KernelSHAP algorithm, i.e.,

\hat{b} = \frac{1}{τ} \sum_{S} z (S) v (S)

, and

A \in R^{n \times n}

is given by

A = [\begin{matrix} \frac{1}{2} & ρ & \dots & ρ \\ ρ & \frac{1}{2} & \dots & ρ \\ ⋮ & ⋮ & ⋱ & ⋮ \\ ρ & ρ & \dots & \frac{1}{2} \end{matrix}]

(47)

with

ρ = \frac{1}{n (n - 1)} \frac{\sum_{s = 2}^{n - 1} \frac{s - 1}{n - s}}{\sum_{s = 1}^{n - 1} \frac{1}{s (n - s)}} .

(48)

Following the results provided in [7], the resulting Shapley value estimator is unbiased and consistent in probability. Furthermore, its covariance is given by

Σ_{\hat{ϕ}} = E Σ_{\hat{b}} E^{⊤}

(49)

with

Σ_{\hat{b}} = Cov [Z v (Z)]

and

E = A^{- 1} - \frac{A^{- 1} 1 1^{⊤} A^{- 1}}{1^{⊤} A^{- 1} 1} .

5.3. Comparison of LSS and S-LSS

In this subsection, we first clarify the relationship between the two algorithms LSS and S-LSS which we introduced at the beginning of SubSection 5.2, followed by a detailed comparison of variances.

Algorithm Comparison

For comparing the algorithms LSS and S-LSS, we start with the following proposition:

Proposition 5.

In the sense of the definition of stratified sampling provided in Section 3.2, S-LSS is not a valid stratification scheme of LSS.

Proof.

As stated in Section 3.2, stratification divides the sample space

X

into strata

{X_{1}, \dots, X_{ℓ}}

, such that

X_{l} \cap X_{l^{'}} = \emptyset

for

l, l^{'} \in {1, \dots, ℓ}

and

l \neq l^{'}

.

By observing the strata definitions given in (34), one obtains that their sample spaces have the intersection

{S \subseteq N ∣ (s = | S |) \land (j \in S) \land (j^{'} \in S)} (\forall j \neq j^{'} \in N) .

Clearly, for any coalition size

s \in {2, \dots, n - 1}

, this set is nonempty, since it contains at least one element of the form

{j, j^{'}, \dots}

, whereby the dots indicate the presence of

s - 2

distinct players other than j and

j^{'}

. Therefore, S-LSS does not form a valid partition of the sample space as required by our definition in SubSection 3.2. □

We note in passing that the structure of the overlap between strata is such that it does not introduce a bias in the Shapley estimator defined by S-LSS.

In order to generate a deeper understanding regarding the relationship between LSS and S-LSS, we explain how Benati et al. [14] derived the LSS algorithm from the initial formulation given by (21). Via (21) and (22), one directly obtains

\begin{matrix} y_{i} & = \frac{v (N)}{n} + \frac{1}{γ n} (n \underset{a_{i}}{\underset{︸}{\sum_{\begin{matrix} S \subseteq N \\ i \in S \end{matrix}} m (| S |) v (S)}} - \sum_{j \in N} \underset{a_{j}}{\underset{︸}{\sum_{\begin{matrix} S \subseteq N \\ j \in S \end{matrix}} m (| S |) v (S)}}) \\ = \frac{v (N)}{n} + \sum_{S \subseteq N} \frac{n 1_{i \in S}}{γ n} m (| S |) v (S) - \underset{\sum_{S \subseteq N} \frac{| S |}{γ n} m (| S |) v (S)}{\underset{︸}{\sum_{j \in N} \sum_{\begin{matrix} S \subseteq N \\ j \in S \end{matrix}} \frac{1}{γ n} m (| S |) v (S)}} \end{matrix}

(50)

\begin{matrix} = \frac{v (N)}{n} + \sum_{S \subseteq N} \underset{w_{i} (S)}{\underset{︸}{\frac{n 1_{i \in S} - | S |}{γ n} m (| S |)}} v (S), \end{matrix}

(51)

which matches the formulation proposed in (23) with weights given by (24). Therefore, one evaluation of v for a given S in (51) corresponds to updating all

a_{j}

where

j \in S

in (50). This implicit update of all

a_{j}

where

j \in S

is encapsulated by the weight function

w_{i}

.

As a result, stratifying by s and j, as done in the case of S-LSS, cannot be derived from LSS in the same way as defined in Section 3.2, since LSS is an algorithm that simultaneously updates multiple strata of S-LSS with a single sampled element, i.e., with one evaluation of v.

We conclude that the statement from SubSection 3.2, which points out that stratified sampling always yields a variance less than or equal to that of the crude Monte Carlo method, does not apply here, since S-LSS is not a valid stratification scheme of LSS. Therefore, in order to understand how their variances compare, we calculate both algorithms’ variances on two selected examples.

Variance Comparison in the Absence of Variance Within Strata

First, let us compare the variances of the LSS and S-LSS estimator in the context of an example game characterized by large variance between strata and no variance within strata. In particular, we consider a cardinality game with

n \geq 2

players and the characteristic function

v (S) = | S | .

(52)

Lemma 1.

On the game defined by (52) with

n \geq 2

, the variance of the LSS estimator is greater zero for all players, i.e.,

Var [{\hat{ϕ}}_{i}^{LSS}] > 0 (\forall i \in N) .

Proof.

First, we obtain that the Shapley values of the game are given by

\frac{v (N)}{n}

for all

i \in N

, since all i are symmetric. Without loss of generality, we fix one player

i \in N

in order to simplify notations. Then, by using Proposition 2, we derive

α_{i}^{2} = {(ϕ_{i} - \frac{v (N)}{n})}^{2} = 0 .

Inserting

α_{i}^{2}

into (32), multiplying by

τ

, and using the identities defined by (37) as well as

\frac{(\binom{n - 1}{s})}{(\binom{n - 2}{s - 1})} = \frac{(n - 1)!}{s! (n - s - 1)!} \frac{(s - 1)! (n - s - 1)!}{(n - 2)!} = \frac{n - 1}{s}

results in

\begin{matrix} τ Var [{\hat{ϕ}}_{i}^{LSS}] & = \sum_{\begin{matrix} S \subseteq N \\ 0 < | S | < n \\ i \in S \end{matrix}} \frac{n - | S |}{n | S | (\binom{n - 2}{| S | - 1})} v {(S)}^{2} + \sum_{\begin{matrix} S \subseteq N \\ 0 < | S | < n \\ i \notin S \end{matrix}} \frac{| S |}{n (n - | S |) (\binom{n - 2}{| S | - 1})} v {(S)}^{2} \\ = \sum_{s = 1}^{n - 1} (\sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ i \in S \end{matrix}} \frac{n - s}{s n (\binom{n - 2}{s - 1})} + \sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ i \notin S \end{matrix}} \frac{s}{n (n - s) (\binom{n - 2}{s - 1})}) s^{2} \\ = \sum_{s = 1}^{n - 1} ((\binom{n - 1}{s - 1}) \frac{n - s}{s n (\binom{n - 2}{s - 1})} + (\binom{n - 1}{s}) \frac{s}{n (n - s) (\binom{n - 2}{s - 1})}) s^{2} \\ = \sum_{s = 1}^{n - 1} (\underset{> 0}{\underset{︸}{\frac{n - 1}{n - s}}} \underset{> 0}{\underset{︸}{\frac{n - s}{s n}}} + \underset{> 0}{\underset{︸}{\frac{n - 1}{s}}} \underset{> 0}{\underset{︸}{\frac{s}{n (n - s)}}}) \underset{> 0}{\underset{︸}{s^{2}}} > 0 . \end{matrix}

Dividing by

τ

, we obtain

Var [{\hat{ϕ}}_{i}^{LSS}] > 0 (\forall i \in N) .

□

Lemma 2.

On the game defined by (52) with

n \geq 2

, the variance of the S-LSS estimator is zero for all players, i.e.,

Var [{\hat{ϕ}}_{i}^{S - LSS}] = 0 (\forall i \in N),

as long as

τ_{j, s} > 0

for all

j \in N

and

s \in {1, \dots, n - 1}

.

Proof.

Without loss of generality, we fix

j \in N

and

s \in {1, \dots, n - 1}

in order to simplify notations.

To calculate the variances of the individual stratum estimators via (36), the exact values of the stratum estimators

a_{j, s}

are required. By using the identity (37), these are given by

\begin{matrix} a_{j, s} & = \sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ j \in S \end{matrix}} m (s) v (S) = \frac{1}{(n - 1) (\binom{n - 2}{s - 1})} \sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ j \in S \end{matrix}} v (S) = \frac{1}{(n - 1) (\binom{n - 2}{s - 1})} (\binom{n - 1}{s - 1}) s \\ = \frac{n - 1}{(n - 1) (n - s)} s = \frac{s}{n - s} . \end{matrix}

(53)

Furthermore, by using (37), the sum of squared evaluations of v in (36) is given by

\begin{matrix} \frac{1}{(n - s) (n - 1) (\binom{n - 2}{s - 1})} \sum_{\begin{matrix} S \subseteq N \\ s = | S | \\ j \in S \end{matrix}} v {(S)}^{2} & = \frac{1}{(n - s) (n - 1) (\binom{n - 2}{s - 1})} (\binom{n - 1}{s - 1}) s^{2} \\ = \frac{n - 1}{(n - 1) {(n - s)}^{2}} s^{2} = \frac{s^{2}}{{(n - s)}^{2}} . \end{matrix}

(54)

Inserting (53) and (54) into (36) results in

Var [{\hat{a}}_{j, s}] = \frac{1}{τ_{j, s}} (\frac{s^{2}}{{(n - s)}^{2}} - a_{j, s}^{2}) = \frac{1}{τ_{j, s}} (\frac{s^{2}}{{(n - s)}^{2}} - \frac{s^{2}}{{(n - s)}^{2}}) = 0,

as long as

τ_{j, s} > 0

. Then, one can easily verify via (38) that

Var [{\hat{ϕ}}_{i}^{S - LSS}] = 0 (\forall i \in N) .

□

Corollary 1.

There exist games for which the variances of all players’ S-LSS estimators are smaller than those the corresponding LSS estimators, i.e.,

Var [{\hat{ϕ}}_{i}^{S - LSS}] < Var [{\hat{ϕ}}_{i}^{LSS}] (\forall i \in N),

as long as the sample allocation scheme of S-LSS satisfies

τ_{j, s} > 0

for all

j \in N

and

s \in {1, \dots, n - 1}

.

Proof.

The proof is straightforward and follows directly from Lemmas 1 and 2. □

Variance comparison in the presence of variance within strata

Let us now consider a weighted voting game as defined in (6) with

c = {[1, 2, 3]}^{⊤} and C = 5

(55)

as a second example. Here, we use concrete numbers in order to simplify calculations. In detail, we are interested in the variances of the Shapley value estimators of player 2 with Shapley value

ϕ_{2} = \frac{1}{2}

.

To simplify our later reasoning, we first calculate the population and estimator variances for all strata of the S-LSS estimator in Table 1.

Lemma 3.

On the game defined by (55), the variance of player 2’s LSS estimator is given by

Var [{\hat{ϕ}}_{2}^{LSS}] = \frac{5}{36 τ} .

Proof.

The proof is straightforward by inserting (55) into (32). □

Lemma 4.

On the game defined by (55), the variance of player 2’s S-LSS estimator with equal sample allocation is given by

Var [{\hat{ϕ}}_{2}^{S - LSS}] = \frac{5}{6 τ} .

Proof.

The proof is straightforward and is done by calculating (36), first, to obtain values for all

Var [{\hat{a}}_{j, s}]

. Note that these

Var [{\hat{a}}_{j, s}]

are given in the “estimator variances” column from Table 1. Then, one obtains the final estimator’s variance by inserting all

Var [{\hat{a}}_{j, s}]

into (38). □

Corollary 2.

There exist games for which the variance of a player’s S-LSS estimator with equal sample allocation is larger than that of the LSS estimator, i.e.,

Var [{\hat{ϕ}}_{i}^{S - LSS}] > Var [{\hat{ϕ}}_{i}^{LSS}]

for some

i \in N

.

Proof.

The proof is straightforward and directly follows from Lemmas 3 and 4. □

Although an equal sample allocation scheme might be a reasonable choice, especially when calculating all players’ Shapley values, one might still ask whether alternative allocation procedures could further reduce the variance of the S-LSS estimator. While our goal is to treat characteristic functions — and, in the context of explainable machine learning, the machine learning models behind them — as black boxes without relying on internal knowledge, let us examine a sample allocation strategy that makes use of a priori information about the variances within strata. Interestingly, we will see that even this variance-informed allocation performs worse than the LSS method on the weighted voting game defined by (55).

Lemma 5.

On the game defined by (55), the variance of player 2’s S-LSS estimator with an optimum sample allocation, i.e., a sample allocation that is tailored to player 2 in order to minimize the variance of their estimator, is given by

Var [{\hat{ϕ}}_{2}^{S - LSS}] = \frac{1}{4 τ} .

Proof.

By inserting the estimator variances from Table 1 into (38), we obtain

\begin{matrix} Var [{\hat{ϕ}}_{2}^{S - LSS}] & = \frac{4}{9} \frac{1}{4 τ_{2, 2}} + \frac{1}{9} \frac{1}{4 τ_{3, 2}} = \frac{1}{9} (\frac{1}{τ_{2, 2}} + \frac{1}{4 τ_{3, 2}}) \\ = \frac{1}{9} (\frac{1}{x τ} + \frac{1}{4 (1 - x) τ}) = \frac{1}{9 τ} (\frac{1}{x} + \frac{1}{4 (1 - x)}), \end{matrix}

(56)

whereby

x \in (0, 1)

is the fraction of

τ

that is used for estimating the stratum with

j = 2, s = 2

and

(1 - x)

is the fraction of

τ

that is used for estimating the stratum with

j = 3, s = 2

.

Therefore, we get the proxy minimization problem

min_{x \in (0, 1)} \frac{1}{x} + \frac{1}{4 (1 - x)} .

Let us now set

f (x) = \frac{1}{x} + \frac{1}{4 (1 - x)}

with its derivative given by

f^{'} (x) = - \frac{1}{x^{2}} + \frac{1}{4 {(1 - x)}^{2}} .

Setting

f^{'} (x) = 0

, we get

x = \frac{4 \pm 2}{3},

with

x = \frac{2}{3}

being the only solution in

(0, 1)

. Furthermore, since

f^{''} (x) = \frac{2}{x^{3}} + \frac{1}{2 {(1 - x)}^{3}} > 0 on (0, 1),

we obtain that

x = \frac{2}{3}

indeed is a minimum with value

f (\frac{2}{3}) = \frac{9}{4} .

Thus, via (56), the minimum variance of the S-LSS estimator is given by

Var [{\hat{ϕ}}_{2}^{S - LSS}] = \frac{1}{9 τ} \frac{9}{4} = \frac{1}{4 τ} .

□

Corollary 3.

There exist games for which the variance of a player’s S-LSS estimator with optimum sample allocation, i.e., a sample allocation that is tailored to that specific player in order to minimize the variance of their estimator, is larger than that of the LSS estimator, i.e.,

Var [{\hat{ϕ}}_{i}^{S - LSS}] > Var [{\hat{ϕ}}_{i}^{LSS}]

for some

i \in N

.

Proof.

The proof is straightforward and directly follows from Lemmas 3 and 5. □

Concluding Variance Comparison of LSS and S-LSS

Recapitulating the results from the previous two paragraphs, we obtain:

Theorem 2.

Depending on the specific cooperative game, the variance of a player’s Shapley value estimator may be smaller under either LSS or S-LSS, even when the S-LSS algorithm uses an optimal, player-specific sample allocation.

Proof.

The proof is straightforward and directly follows from Corollaries 1 and 3. □

To summarize the results in this subsection so far, Proposition 5 demonstrates that S-LSS is not the stratified variant of LSS in a sense of the definition provided in Section 3.2, and Theorem 2 states that neither of both algorithms is preferable over the other one on all cooperative games. Since the proof of Theorem 2 builds on the calculation of both algorithms’ theoretical variances on two distinct games, it does not provide an intuitive explanation why S-LSS may perform worse than LSS in some settings. Therefore, we provide an explanation in the following.

Since Benati et al. [14] did not provide a sample allocation scheme for the S-LSS algorithm, we first derive a proportional sample allocation similar to SubSection 3.2, which is motivated by the fact that all coalition sizes are equally likely in the context of LSS, i.e.,

P (s = | S |) = \frac{1}{n - 1} (\forall s \in {1, \dots, n - 1}),

with

S \sim p^{LSS}

, whereby

p^{LSS}

is the probability mass function from (28).

Furthermore, once a random coalition

S \sim p^{LSS}

of some size

s \in {1, \dots, n - 1}

is sampled, the probability of any player being in it is the same for all players, i.e.,

P (j \in S ∣ s = | S |) = \frac{s}{n} (\forall j \in N) .

Then, the overall probability that a random coalition

S \sim p^{LSS}

belongs to a specific stratum of the S-LSS algorithm — which is the case when

S

meets a specific size s and, additionally, some specific player j is in it — is given by

P (s = | S |) P (j \in S ∣ s = | S |) = \frac{s}{n (n - 1)} (\forall j \in N, \forall s \in {1, \dots, n - 1}) .

(57)

As we established at the beginning of this subsection, the statement on variance reduction for a stratified estimator with stratum sample sizes proportional to the probabilities of sampled elements belonging to a stratum from SubSection 3.2 does not apply in this context since S-LSS is not a valid stratified variant of LSS. However, the proportional sample allocation will come out as a useful starting point for our subsequent explanations. Moreover, as established in Theorem 2, the exact choice of the applied sample allocation scheme does not alter the fact that S-LSS can perform worse than LSS, since the increased variance of S-LSS was also observed under an optimal allocation scheme. Additionally, as we will show later, sample allocation strategies other than the proportional sample allocation do not change the outcome and conclusions of our following reasoning.

Hence we set all

τ_{j, s}^{S - LSS} \propto P (s = | S |) P (j \in S ∣ s = | S |)

. Then, we obtain

\begin{matrix} τ_{j, s}^{S - LSS} & = τ \frac{P (s = | S |) P (j \in S ∣ s = | S |)}{\sum_{s^{'} = 1}^{n - 1} \sum_{j^{'} \in N} P (s^{'} = | S |) P (j^{'} \in S ∣ s^{'} = | S |)} \\ = \frac{s τ}{n (n - 1) \sum_{s^{'} = 1}^{n - 1} \sum_{j^{'} \in N} \frac{s^{'}}{n (n - 1)}} = \frac{s τ}{\sum_{s^{'} = 1}^{n - 1} \sum_{j^{'} \in N} s^{'}} = \frac{s τ}{n \sum_{s^{'} = 1}^{n - 1} s^{'}} \\ = \frac{2 s τ}{n^{2} (n - 1)} (\forall j \in N, \forall s \in {1, \dots, n - 1}) . \end{matrix}

(58)

On the other hand, let us now find out how often a sample is expected to belong to a stratum

a_{j, s}

of S-LSS when actually running LSS, which is denoted by

E [τ_{j, s}^{LSS}]

. Via (57), we obtain

\begin{matrix} E [τ_{j, s}^{LSS}] & = τ P (s = | S |) P (j \in S ∣ s = | S |) \\ = \frac{s τ}{n (n - 1)} (\forall j \in N, \forall s \in {1, \dots, n - 1}) . \end{matrix}

(59)

Finally, by comparing (58) and (59), we observe

τ_{j, s}^{S - LSS} = \frac{2 s τ}{n^{2} (n - 1)} \overset{n > 2}{<} \frac{s τ}{n (n - 1)} = E [τ_{j, s}^{LSS}]

(60)

for all

j \in N

and

s \in {1, \dots, n - 1}

. Clearly, (60) shows that the expected sample size for stratum

a_{j, s}

is larger by a factor of

\frac{n}{2}

when executing LSS in comparison to the corresponding sample size of S-LSS with a proportional sample allocation. This is due to the fact that LSS actually updates multiple strata with one evaluation of v, see the beginning of this subsection. We conclude that S-LSS might eliminate the variance between the estimators of different strata (similar to the definition provided in SubSection 3.2), but taking into account the results from Proposition 3, one obtains the dependency on

τ_{j, s}^{S - LSS}

when quantifying the stratum estimators’ variances, which results in larger variances of the individual stratum estimators due to smaller

τ_{j, s}^{S - LSS}

.

This underscores our observations from the previous paragraphs. Whenever the variances within strata are small, e.g., in the context of the game defined by (52), the small sample sizes for each stratum are enough for S-LSS to provide accurate stratum estimators, and the variance between the strata is eliminated by algorithm design. Contrary, if the variances within strata are larger, e.g., in the context of the game defined by (55), the reduced amount of samples available for individual strata of the S-LSS algorithm might increase the stratum estimators’ variances more than the stratified algorithm design reduces the overall variance.

This result does not only apply to our chosen sample allocation scheme, i.e., the proportional sample allocation. Clearly, from (60), one obtains that there are too little samples available for the individual stratum estimators in comparison to LSS, and any other sample allocation scheme would also have the drawback that there exists at least one stratum for which there are fewer samples available in comparison to LSS. Additionally, we highlight that even an optimized sample allocation cannot account for the reduced amount of samples available, see Lemma 5 and Corollary 3.

In summary, we recommend using LSS whenever one expects large variances within the strata of S-LSS and a small variance between the stratum estimators. On the other hand, whenever one expects a large variance between the stratum estimators and small variances within strata, we recommend choosing S-LSS. In the latter case, if one has information about the exact variances of individual strata or a good approximation of them, one could distribute the sample sizes across strata similar to the procedure from Lemma 5. Additionally, we highlight that several improvements can be made in order to reduce the mutual shortcomings of both algorithms. One approach is the reuse of samples across strata in order to eliminate the sample size inequality from (60), which was proposed by Benati et al. [14] and introduced as SRS-LSS in SubSection 5.2. However, as we will point out in SubSection 5.4, this approach adds covariance terms to the overall variances of the Shapley value estimators. Another improvement could be some kind of two-stage algorithm, where the first stage is used to decide whether the LSS or S-LSS algorithm should be executed in the second stage.

Finally, we visualize the behavior of LSS and S-LSS when applied to different cooperative games. Figure 2 illustrates the previously obtained results by plotting the variances of LSS and S-LSS against weighted voting games with different quotas. As expected, the variances show a clear dependence to the underlying game, determining whether LSS or S-LSS performs better.

5.4. Analysis of the Variance of SRS-LSS

Since Benati et al. [14] did not address the variance of the SRS-LSS algorithm, we derive:

Proposition 6.

For all

i \in N

, the variance of the SRS-LSS estimator is given by

\begin{matrix} Var [{\hat{ϕ}}_{i}] & = {(\frac{n - 1}{n})}^{2} \sum_{s = 1}^{n - 1} Var [{\hat{a}}_{i, s}] \\ + \frac{1}{n^{2}} (\sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} \sum_{s = 1}^{n - 1} Var [{\hat{a}}_{j, s}] + 2 \sum_{\begin{matrix} j, k \in N \\ j < k \\ j, k \neq i \end{matrix}} \sum_{s = 1}^{n - 1} Cov [{\hat{a}}_{j, s}, {\hat{a}}_{k, s}]) \\ - 2 \frac{n - 1}{n^{2}} \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} \sum_{s = 1}^{n - 1} Cov [{\hat{a}}_{i, s}, {\hat{a}}_{j, s}], \end{matrix}

(61)

with

Var [{\hat{a}}_{j, s}]

given in (36),

Cov [{\hat{a}}_{j, s}, {\hat{a}}_{k, s}] = \frac{1}{{(n - s)}^{2}} Cov [\frac{1}{{\hat{τ}}_{j, s}} \sum_{S \in S_{s}} 1_{j \in S} v (S), \frac{1}{{\hat{τ}}_{k, s}} \sum_{S \in S_{s}} 1_{k \in S} v (S)],

(62)

{\hat{τ}}_{j, s} = \sum_{S \in S_{s}} 1_{j \in S} and {\hat{τ}}_{k, s} = \sum_{S \in S_{s}} 1_{k \in S},

and

S_{s} \sim_{iid} U ({S \subseteq N ∣ s = | S |})

.

Proof.

Recall the closed-form formulation of the Shapley value as a member of the family of least square values in (27). By using the stratification scheme introduced in (34), one obtains the estimator

{\hat{ϕ}}_{i} = \frac{v (N)}{n} + \frac{n - 1}{n} {\hat{a}}_{i} - \frac{1}{n} \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} {\hat{a}}_{j}

with

{\hat{a}}_{j} = \sum_{s = 1}^{n - 1} {\hat{a}}_{j, s} .

(63)

With these definitions in place, the variance of the Shapley value estimator is given by

\begin{matrix} Var [{\hat{ϕ}}_{i}] & = Var [\frac{v (N)}{n} + \frac{n - 1}{n} {\hat{a}}_{i} - \frac{1}{n} \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} {\hat{a}}_{j}] \\ = {(\frac{n - 1}{n})}^{2} Var [{\hat{a}}_{i}] + \frac{1}{n^{2}} Var [\sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} {\hat{a}}_{j}] - 2 \frac{n - 1}{n^{2}} Cov [{\hat{a}}_{i}, \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} {\hat{a}}_{j}], \end{matrix}

(64)

where

Var [\sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} {\hat{a}}_{j}] = \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} Var [{\hat{a}}_{j}] + 2 \sum_{\begin{matrix} j, k \in N \\ j < k \\ j, k \neq i \end{matrix}} Cov [{\hat{a}}_{j}, {\hat{a}}_{k}],

(65)

and

Cov [{\hat{a}}_{i}, \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} {\hat{a}}_{j}] = \sum_{\begin{matrix} j \in N \\ j \neq i \end{matrix}} Cov [{\hat{a}}_{i}, {\hat{a}}_{j}] .

(66)

Then, via (63), we rewrite

Cov [{\hat{a}}_{j}, {\hat{a}}_{k}] = Cov [\sum_{s = 1}^{n - 1} {\hat{a}}_{j, s}, \sum_{s^{'} = 1}^{n - 1} {\hat{a}}_{k, s^{'}}] = \sum_{s = 1}^{n - 1} \sum_{s^{'} = 1}^{n - 1} Cov [{\hat{a}}_{j, s}, {\hat{a}}_{k, s^{'}}],

for all

j \neq k \in N

, but since samples are independent for

s \neq s^{'}

, the equation can be simplified to

Cov [{\hat{a}}_{j}, {\hat{a}}_{k}] = \sum_{s = 1}^{n - 1} Cov [{\hat{a}}_{j, s}, {\hat{a}}_{k, s}] .

Next, from line 12 in Algorithm 3, we derive

\begin{matrix} Cov [{\hat{a}}_{j, s}, {\hat{a}}_{k, s}] & = Cov [\frac{1}{{\hat{τ}}_{j, s} (n - s)} \sum_{S \in S_{s}} 1_{j \in S} v (S), \frac{1}{{\hat{τ}}_{k, s} (n - s)} \sum_{S \in S_{s}} 1_{k \in S} v (S)] \\ = \frac{1}{{(n - s)}^{2}} Cov [\frac{1}{{\hat{τ}}_{j, s}} \sum_{S \in S_{s}} 1_{j \in S} v (S), \frac{1}{{\hat{τ}}_{k, s}} \sum_{S \in S_{s}} 1_{k \in S} v (S)] \end{matrix}

with

{\hat{τ}}_{j, s} = \sum_{S \in S_{s}} 1_{j \in S}

,

{\hat{τ}}_{k, s} = \sum_{S \in S_{s}} 1_{k \in S}

, and

S_{s} \sim_{iid} U ({S \subseteq N ∣ s = | S |})

.

Finally, setting

Var [{\hat{a}}_{j}] = \sum_{s = 1}^{n - 1} Var [{\hat{a}}_{j, s}],

since

Cov [{\hat{a}}_{j, s}, {\hat{a}}_{j, s^{'}}] = 0

for

s \neq s^{'}

and inserting everything back into (64) results in (61). □

Corollary 4.

For S-LSS, (61) is equal to (38).

Proof.

In the case of S-LSS, sampling within each stratum is independent of sampling in all other strata. Therefore, the covariance terms defined by (62) are zero. Consequently, setting these covariance terms to zero in (61) results in (38). □

Theorem 3.

For SRS-LSS, the covariance terms in (61) do not vanish in general.

Proof.

To simplify notations, we first obtain from (62) that the constant, non-zero term

{(n - s)}^{- 2} > 0

can be ignored when demonstrating that

Cov [{\hat{a}}_{j, s}, {\hat{a}}_{k, s}] \neq 0

might occur. Therefore, we define

Cov [{\bar{a}}_{j, s}, {\bar{a}}_{k, s}] = Cov [\frac{1}{{\hat{τ}}_{j, s}} \sum_{S \in S_{s}} 1_{j \in S} v (S), \frac{1}{{\hat{τ}}_{k, s}} \sum_{S \in S_{s}} 1_{k \in S} v (S)] .

Now, we demonstrate by example that

Cov [{\bar{a}}_{j, s}, {\bar{a}}_{k, s}]

may be non-zero. For this purpose, we define a weighted voting game with

c = [1, 0, 1]

and quota

C = 2

, such that the only winning coalition of size 2 is

{1, 3}

. Let us now fix

s = 2

, such that the corresponding sample is obtained as

S_{2} \sim_{iid} U ({{1, 2}, {1, 3}, {2, 3}})

. Without loss of generality, we set the sample size to

τ_{2} = 3

.

In Proposition 4, we established that the probability of at least one value

{\hat{τ}}_{j, s}

being zero converges to zero as

τ_{s} \to \infty

. For

τ_{2} = 3

, this probability is given by

P (S_{1} = S_{2} = S_{3}) = 3 {(\frac{1}{3})}^{3} = \frac{1}{9},

(67)

where

S_{2} = {[S_{1}, S_{2}, S_{3}]}^{⊤}

. We denote this event by B and its complement by

\bar{B}

, such that

P (B) = P (S_{1} = S_{2} = S_{3})

.

Although in this example the probability of failing to sample at least one element from each stratum is relatively large, we proceed under the standard assumption that this probability is negligible, which is justified for large

τ_{2}

. Therefore, we calculate the conditional expectations in Table 2 based on the assumption that B does not occur.

Based on Table 2, the conditional covariance between

{\bar{a}}_{1, 2}

and

{\bar{a}}_{3, 2}

can be calculated as

Cov [{\bar{a}}_{1, 2}, {\bar{a}}_{3, 2} ∣ \bar{B}] = E [{\bar{a}}_{1, 2} {\bar{a}}_{3, 2} ∣ \bar{B}] - E [{\bar{a}}_{1, 2} ∣ \bar{B}] E [{\bar{a}}_{3, 2} ∣ \bar{B}] = \frac{5}{16} - \frac{1}{2} \frac{1}{2} = \frac{1}{16} > 0 .

Although conditioning on

\bar{B}

is necessary for computing the covariance, it may slightly affect its numerical value. The true (unconditional) covariance could differ marginally. Accounting for the undefined cases would require specifying a particular strategy to handle such edge cases, which was not proposed for the SRS-LSS algorithm and lies beyond the scope of this investigation. Nevertheless, the key point remains that the covariance does not vanish. As can be observed from the

{\bar{a}}_{1, 2}

and

{\bar{a}}_{3, 2}

columns in Table 2, these quantities have a clear positive correlation. Since players 1 and 3 together form the only winning coalition of size 2, large realizations of

{\bar{a}}_{1, 2}

necessarily coincide with large realizations of

{\bar{a}}_{3, 2}

. This dependence confirms that

Cov [{\bar{a}}_{1, 2}, {\bar{a}}_{3, 2}] > 0

, and as a result, we have

Cov [{\hat{a}}_{1, 2}, {\hat{a}}_{3, 2}] \neq 0

. □

From Theorem 3, it follows that reusing evaluations of the characteristic function introduces non-zero covariance terms. These terms are hard to characterize, which complicates the exact computation of the SRS-LSS estimator’s theoretical variance and its comparison to LSS and S-LSS. A detailed analysis of (62) together with the resulting influence on (61) might be a question for further research. In the meantime, we rely on empirical observations of the SRS-LSS estimator’s variance in the numerical experiments in Section 6, which indicate a significantly reduced variance of SRS-LSS in comparison to LSS and S-LSS.

5.5. Comparison of LSS and UKS

In the following, we compare LSS (Algorithm 1) with UKS (see the end of SubSection 5.2). First, we show how UKS fits into the general importance sampling framework defined by Theorem 1 and discuss the relationship between both algorithms within this context. Subsequently, we compare their variances.

Before going into detail, we note that the formulation of UKS derived in the following was already proposed by Fumagalli et al. [30]. The authors identify UKS as a special case of their SHAPley Interaction Quantification framework, which enables a reformulation of the algorithm similar to ours. Nevertheless, we present our derivations for several reasons. First, our derivation of UKS is, from our perspective, more intuitive. Second, while Fumagalli et al. [30] suggest that their framework enables a particular formulation of UKS that arises specifically from their approach, this form of UKS can also be directly obtained by expanding the matrices and vectors in (46) without using the indirection via their SHAPley Interaction Quantification framework. Third, UKS has not yet been formally integrated into the importance sampling framework defined by Theorem 1, which allows us to directly apply our results from Proposition 1. Finally, and most importantly, to the best of our knowledge, we are the first to demonstrate that UKS approximates the same underlying problem as LSS, establishing a novel link between the two algorithms.

We refer to Appendix A, which shows that our following results are consistent with those of Fumagalli et al. [30].

Algorithm Comparison

First, recall the definition of LSS given in (30) with

w_{i}

being defined in (24), m in (25),

γ = 1

in (26), and p in (28). Since we have only derived

w_{i} (S) / p (S)

so far and not

w_{i} (S)

directly, inserting m into

w_{i}

results in

\begin{matrix} \frac{n - | S |}{n (n - 1) (\binom{n - 2}{| S | - 1})} & = \frac{(n - | S |) (| S | - 1)! (n - | S | - 1)!}{n (n - 1) (n - 2)!} = \frac{(| S | - 1)! (n - | S |)!}{n!} \\ = \frac{1}{| S |} \frac{| S |! (n - | S |)!}{n!} = \frac{1}{| S | (\binom{n}{| S |})} \end{matrix}

for the case

i \in S

, and

\begin{matrix} - \frac{| S |}{n (n - 1) (\binom{n - 2}{| S | - 1})} & = - \frac{| S | (| S | - 1)! (n - | S | - 1)!}{n (n - 1) (n - 2)!} = - \frac{| S |! (n - | S | - 1)!}{n!} \\ = - \frac{1}{n - | S |} \frac{| S |! (n - | S |)!}{n!} = - \frac{1}{(n - | S |) (\binom{n}{| S |})} \end{matrix}

for the case

i \notin S

. Thus,

w_{i}

can be rewritten as

w_{i}^{LSS} (S) = \{\begin{matrix} \frac{1}{| S | (\binom{n}{| S |})} & if i \in S \\ - \frac{1}{(n - | S |) (\binom{n}{| S |})} & if i \notin S . \end{matrix}

(68)

Now, let us turn our attention to the UKS algorithm:

Proposition 7.

UKS can be brought into the form

{\hat{ϕ}}_{i} = t + \frac{1}{τ} \sum_{S \in S^{UKS}} \frac{w_{i}^{UKS} (S)}{p^{UKS} (S)} v (S)

with

S^{UKS} \sim_{iid} p^{UKS}

being a sample according to (41),

t = {[A^{- 1} 1 \frac{v (N)}{1^{⊤} A^{- 1} 1}]}_{i}

being a constant, and

\frac{w_{i}^{UKS} (S)}{p^{UKS} (S)} = {[A^{- 1} z (S)]}_{i} - {[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1} .

(69)

Proof.

Starting from (46) and substituting

\hat{b}

as defined in (45) results in

\begin{matrix} {\hat{ϕ}}_{i} & = {[A^{- 1} (\hat{b} - 1 \frac{1^{⊤} A^{- 1} \hat{b} - v (N)}{1^{⊤} A^{- 1} 1})]}_{i} \\ = {[A^{- 1} \hat{b} - A^{- 1} 1 \frac{1^{⊤} A^{- 1} \hat{b}}{1^{⊤} A^{- 1} 1} + A^{- 1} 1 \frac{v (N)}{1^{⊤} A^{- 1} 1}]}_{i} \\ = \underset{t}{\underset{︸}{{[A^{- 1} 1 \frac{v (N)}{1^{⊤} A^{- 1} 1}]}_{i}}} + {[A^{- 1} \hat{b} - A^{- 1} 1 \frac{1^{⊤} A^{- 1}}{1^{⊤} A^{- 1} 1} \hat{b}]}_{i} \\ = t + {[(A^{- 1} - A^{- 1} 1 \frac{1^{⊤} A^{- 1}}{1^{⊤} A^{- 1} 1}) \hat{b}]}_{i} \\ = t + {[(A^{- 1} - A^{- 1} 1 \frac{1^{⊤} A^{- 1}}{1^{⊤} A^{- 1} 1}) \frac{1}{τ} \sum_{S \in S^{UKS}} z (S) v (S)]}_{i} \\ = t + {[\frac{1}{τ} \sum_{S \in S^{UKS}} (A^{- 1} - A^{- 1} 1 \frac{1^{⊤} A^{- 1}}{1^{⊤} A^{- 1} 1}) z (S) v (S)]}_{i} \\ = t + {[\frac{1}{τ} \sum_{S \in S^{UKS}} (A^{- 1} z (S) - A^{- 1} 1 \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1}) v (S)]}_{i} \\ = t + \frac{1}{τ} \sum_{S \in S^{UKS}} \underset{w_{i}^{UKS} (S) / p^{UKS} (S)}{\underset{︸}{({[A^{- 1} z (S)]}_{i} - {[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1})}} v (S) . \end{matrix}

(70)

□

We now aim to show that both algorithms share the same weight function w and the same initial term, i.e.,

w_{i}^{LSS} (S) = w_{i}^{UKS} (S)

and

t = \frac{v (N)}{n}

. To prepare for this, we first carry out some reformulations in order to simplify the upcoming calculations. Readers only interested in the final result may skip ahead to Theorem 4.

Lemma 6.

The expression for Q given in (43) can be rewritten as

Q = \frac{2 (n - 1) H_{n - 1}}{n},

(71)

where

H_{n - 1}

denotes the

(n - 1)

-th harmonic number.

Proof.

The proof is straightforward by inserting the identity

\sum_{s = 1}^{n - 1} \frac{1}{s (n - s)} = \frac{2 H_{n - 1}}{n}

with

H_{n - 1}

being the

(n - 1)

-th harmonic number [30] into (43), which results in (71). □

Lemma 7.

The inverse of A as defined in (47) is given by

A^{- 1} = [\begin{matrix} 2 H_{n - 1} + \bar{ρ} & \bar{ρ} & \dots & \bar{ρ} \\ \bar{ρ} & 2 H_{n - 1} + \bar{ρ} & \dots & \bar{ρ} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \bar{ρ} & \bar{ρ} & \dots & 2 H_{n - 1} + \bar{ρ} \end{matrix}]

with

\bar{ρ} = - \frac{2 H_{n - 1} (H_{n - 1} - 1)}{1 + n (H_{n - 1} - 1)} .

Proof.

The article by Fumagalli et al. [30] shows that the off-diagonal entries of the inverse of A as defined in (47) are given by

\bar{ρ} = - 2 H_{n - 1} \frac{\frac{H_{n - 1} - 1}{2 H_{n - 1}}}{\frac{1}{2} + (n - 1) \frac{H_{n - 1} - 1}{2 H_{n - 1}}} .

Its denominator can be reformulated as

\begin{matrix} \frac{1}{2} + (n - 1) \frac{H_{n - 1} - 1}{2 H_{n - 1}} & = \frac{H_{n - 1} + (n - 1) (H_{n - 1} - 1)}{2 H_{n - 1}} \\ = \frac{1 + (H_{n - 1} - 1) + (n - 1) (H_{n - 1} - 1)}{2 H_{n - 1}} \\ = \frac{1 + n (H_{n - 1} - 1)}{2 H_{n - 1}}, \end{matrix}

(72)

and thus,

\bar{ρ} = - 2 H_{n - 1} \frac{\frac{H_{n - 1} - 1}{2 H_{n - 1}}}{\frac{1 + n (H_{n - 1} - 1)}{2 H_{n - 1}}} = - \frac{2 H_{n - 1} (H_{n - 1} - 1)}{1 + n (H_{n - 1} - 1)} .

For the diagonal entries, Fumagalli et al. [30] provides

2 H_{n - 1} \frac{\frac{1}{2} + (n - 2) \frac{H_{n - 1} - 1}{2 H_{n - 1}}}{\frac{1}{2} + (n - 1) \frac{H_{n - 1} - 1}{2 H_{n - 1}}},

and similarly to (72), its numerator can be derived as

\frac{1}{2} + (n - 2) \frac{H_{n - 1} - 1}{2 H_{n - 1}} = \frac{1 + (n - 1) (H_{n - 1} - 1)}{2 H_{n - 1}} .

(73)

Putting (72) and (73) together, we retrieve

\begin{matrix} 2 H_{n - 1} \frac{\frac{1}{2} + (n - 2) \frac{H_{n - 1} - 1}{2 H_{n - 1}}}{\frac{1}{2} + (n - 1) \frac{H_{n - 1} - 1}{2 H_{n - 1}}} & = 2 H_{n - 1} \frac{\frac{1 + (n - 1) (H_{n - 1} - 1)}{2 H_{n - 1}}}{\frac{1 + n (H_{n - 1} - 1)}{2 H_{n - 1}}} \\ = 2 H_{n - 1} \frac{1 + (n - 1) (H_{n - 1} - 1)}{1 + n (H_{n - 1} - 1)} \\ = 2 H_{n - 1} - 2 H_{n - 1} (1 - \frac{1 + (n - 1) (H_{n - 1} - 1)}{1 + n (H_{n - 1} - 1)}) \\ = 2 H_{n - 1} - 2 H_{n - 1} \frac{2 H_{n - 1} (1 + n (H_{n - 1} - 1) - 1 - (n - 1) (H_{n - 1} - 1))}{1 + n (H_{n - 1} - 1)} \\ = 2 H_{n - 1} - \frac{2 H_{n - 1} (H_{n - 1} - 1)}{1 + n (H_{n - 1} - 1)} = 2 H_{n - 1} + \bar{ρ} . \end{matrix}

□

Proposition 8.

The constant terms of both algorithms are equal, i.e.,

t = \frac{v (N)}{n} .

Proof.

From (47) one directly obtains that

1

is an eigenvector of A with eigenvalue

λ = \frac{1}{2} + (n - 1) ρ

. Thus, we have

A^{- 1} 1 = λ^{- 1} 1

. With that given, rewriting t from Proposition 7 results in

\begin{matrix} t & = {[A^{- 1} 1 \frac{v (N)}{1^{⊤} A^{- 1} 1}]}_{i} \\ = {[A^{- 1} 1 \frac{1}{1^{⊤} (A^{- 1} 1)}]}_{i} v (N) = {[λ^{- 1} 1 \frac{1}{1^{⊤} (λ^{- 1} 1)}]}_{i} v (N) \\ = {[λ^{- 1} 1 \frac{1}{n λ^{- 1}}]}_{i} v (N) = \frac{λ^{- 1}}{n λ^{- 1}} v (N) = \frac{v (N)}{n} . \end{matrix}

□

Lemma 8.

For any

S \subseteq N

and

i \in N

, there holds

{[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1} = \frac{2 | S | H_{n - 1}}{n} + | S | \bar{ρ} .

Proof.

Using Lemma 7, we derive

1^{⊤} A^{- 1} = [2 H_{n - 1} + n \bar{ρ}, \dots, 2 H_{n - 1} + n \bar{ρ}] \in R^{1 \times n},

and thus,

\frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1} = \frac{| S | (2 H_{n - 1} + n \bar{ρ})}{n (2 H_{n - 1} + n \bar{ρ})} = \frac{| S |}{n} .

Furthermore, also based on Lemma 7,

{[A^{- 1} 1]}_{i} = 2 H_{n - 1} + n \bar{ρ},

and therefore,

{[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1} = \frac{2 | S | H_{n - 1}}{n} + \frac{| S | n \bar{ρ}}{n} = \frac{2 | S | H_{n - 1}}{n} + | S | \bar{ρ} .

□

Proposition 9.

For all

S \subseteq N

with

0 < | S | < n

and

i \in S

,

w_{i}^{UKS} (S) = \frac{1}{(\binom{n}{| S |}) | S |} .

Proof.

From Proposition 7, we obtain

\frac{w_{i}^{UKS} (S)}{p^{UKS} (S)} = {[A^{- 1} z (S)]}_{i} - {[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1} .

(74)

Using Lemma 7, we derive

{[A^{- 1} z (S)]}_{i} = 2 H_{n - 1} + | S | \bar{ρ} .

(75)

Rewriting the Shapley kernel from (42) to follow our general notation for cooperative games results in

ω (S) = \frac{n - 1}{(\binom{n}{| S |}) | S | (n - | S |)} .

(76)

Combining the results from Lemmas 6 and 8 as well as (41), (74), (75), and (76) results in

\begin{matrix} w_{i}^{UKS} (S) & = p^{UKS} (S) \frac{w_{i}^{UKS} (S)}{p^{UKS} (S)} \\ = Q^{- 1} ω (S) ({[A^{- 1} z (S)]}_{i} - {[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1}) \\ = \frac{n}{2 (n - 1) H_{n - 1}} \frac{n - 1}{(\binom{n}{| S |}) | S | (n - | S |)} (2 H_{n - 1} + | S | \bar{ρ} - \frac{2 | S | H_{n - 1}}{n} - | S | \bar{ρ}) \\ = \frac{n}{2 H_{n - 1}} \frac{1}{(\binom{n}{| S |}) | S | (n - | S |)} (2 H_{n - 1} - \frac{2 | S | H_{n - 1}}{n}) \\ = \frac{n}{2 H_{n - 1}} \frac{1}{(\binom{n}{| S |}) | S | (n - | S |)} \frac{2 (n - | S |) H_{n - 1}}{n} = \frac{1}{(\binom{n}{| S |}) | S |} . \end{matrix}

□

Proposition 10.

For all

S \subseteq N

with

0 < | S | < n

and

i \notin S

,

w_{i}^{UKS} (S) = - \frac{1}{(\binom{n}{| S |}) (n - | S |)} .

Proof.

From Proposition 7, we have

\frac{w_{i}^{UKS} (S)}{p^{UKS} (S)} = {[A^{- 1} z (S)]}_{i} - {[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1} .

(77)

Using Lemma 7, we derive

{[A^{- 1} z (S)]}_{i} = | S | \bar{ρ} .

(78)

Combining the results from Lemmas 6 and 8 as well as (41), (76), (77), and (78) results in

\begin{matrix} w_{i}^{UKS} (S) & = p^{UKS} (S) \frac{w_{i}^{UKS} (S)}{p^{UKS} (S)} \\ = Q^{- 1} ω (S) ({[A^{- 1} z (S)]}_{i} - {[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1}) \\ = \frac{n}{2 (n - 1) H_{n - 1}} \frac{n - 1}{(\binom{n}{| S |}) | S | (n - | S |)} (| S | \bar{ρ} - \frac{2 | S | H_{n - 1}}{n} - | S | \bar{ρ}) \\ = - \frac{n}{2 H_{n - 1}} \frac{1}{(\binom{n}{| S |}) | S | (n - | S |)} \frac{2 | S | H_{n - 1}}{n} = - \frac{1}{(\binom{n}{| S |}) (n - | S |)} . \end{matrix}

□

Theorem 4.

The estimators obtained via LSS and UKS are both importance sampling estimators for the same linear solution concept in a sense of Theorem 1, differing only in their respective sampling strategies.

Proof.

Per definition given in (30), LSS has the form

{\hat{ϕ}}_{i} = \frac{v (N)}{n} + \frac{1}{τ} \sum_{S \in S} \frac{w_{i} (S)}{p (S)} v (S),

with weights given by (68).

Proposition 7 shows that UKS can also be written in this form, sharing the same constant term

t = \frac{v (N)}{n}

, see Proposition 8. Moreover, by combining propositions 9 and 10, one obtains

w_{i}^{UKS} (S) = \{\begin{matrix} \frac{1}{| S | (\binom{n}{| S |})} & if i \in S \\ - \frac{1}{(n - | S |) (\binom{n}{| S |})} & if i \notin S \end{matrix}

(79)

for

0 < | S | < n

, which is identical to the weight function of LSS defined in (68).

Therefore, we conclude that both algorithms are importance sampling estimators of the same problem

ϕ_{i} = \frac{v (N)}{n} + \sum_{\begin{matrix} S \subseteq N \\ 0 < | S | < n \end{matrix}} w_{i} (S) v (S)

with

w_{i} (S) = w_{i}^{LSS} (S) = w_{i}^{UKS} (S)

.

Finally, we highlight that both algorithms differ only in their respective sampling distributions, since (28) and the distribution that is proportional to (76) are clearly different sampling distributions. □

We illustrate the result from Theorem 4 in Figure 3 which visualizes the different sampling strategies of LSS and UKS for

n = 9

.

Variance Comparison

In the previous section, we showed that UKS fits the importance sampling framework introduced by Theorem 1, see Proposition 7 and Theorem 4. Therefore, although Covert and Lee [7] already provided (49) as the variance for the UKS estimator, we can now use a different formulation:

Proposition 11.

The variance of the UKS estimator is given by

Var [{\hat{ϕ}}_{i}] = \frac{1}{τ} (\sum_{\begin{matrix} S \subseteq N \\ 0 < | S | < n \end{matrix}} \frac{w_{i} {(S)}^{2}}{p (S)} v {(S)}^{2} - {(ϕ_{i} - \frac{v (N)}{n})}^{2}) (\forall i \in N)

with

\frac{w_{i} {(S)}^{2}}{p (S)} = \{\begin{matrix} \frac{2 (n - | S |) H_{n - 1}}{n | S | (\binom{n}{| S |})} & if i \in S \\ \frac{2 | S | H_{n - 1}}{n (n - | S |) (\binom{n}{| S |})} & if i \notin S . \end{matrix}

(80)

Proof.

First, let us analyze the case

i \in S

. Using Lemma 8 as well as (69), (75), and (79), we derive

\begin{matrix} \frac{w {(S)}^{2}}{p (S)} & = w (S) \frac{w (S)}{p (S)} \\ = \frac{1}{| S | (\binom{n}{| S |})} ({[A^{- 1} z (S)]}_{i} - {[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1}) \\ = \frac{1}{| S | (\binom{n}{| S |})} (2 H_{n - 1} + | S | \bar{ρ} - \frac{2 | S | H_{n - 1}}{n} - | S | \bar{ρ}) \\ = \frac{1}{| S | (\binom{n}{| S |})} (2 H_{n - 1} - \frac{2 | S | H_{n - 1}}{n}) = \frac{2 (n - | S |) H_{n - 1}}{n | S | (\binom{n}{| S |})} . \end{matrix}

Similarly, for the case

i \notin S

, by using Lemma 8 as well as (69), (78), and (79), we obtain

\begin{matrix} \frac{w {(S)}^{2}}{p (S)} & = w (S) \frac{w (S)}{p (S)} \\ = - \frac{1}{(n - | S |) (\binom{n}{| S |})} ({[A^{- 1} z (S)]}_{i} - {[A^{- 1} 1]}_{i} \frac{1^{⊤} A^{- 1} z (S)}{1^{⊤} A^{- 1} 1}) \\ = - \frac{1}{(n - | S |) (\binom{n}{| S |})} (| S | \bar{ρ} - \frac{2 | S | H_{n - 1}}{n} - | S | \bar{ρ}) = \frac{2 | S | H_{n - 1}}{n (n - | S |) (\binom{n}{| S |})} . \end{matrix}

Combining both cases results in (80), and since UKS fits the framework introduced by Theorem 1, we can use (18) to compute its variance. Note that we construct a temporary solution concept

ϕ_{i} - \frac{v (N)}{n}

that excludes

\frac{v (N)}{n}

, similar to the variance of LSS given by Proposition 2. □

In Figure 4, we use Proposition 11 to compute the variances of UKS and Proposition 2 for those of LSS. As expected, neither LSS nor UKS outperforms the other, since they implement the same algorithm while differing only in their respective sampling strategies. The effectiveness of any given sampling strategy depends on the characteristics of the underlying problem, such as the parameters of the weighted voting game. Obviously, there exist problems for which sampling strategies other than the one employed in LSS or UKS achieve superior performance. More broadly, this illustrates that no single sampling strategy is universally optimal, and selecting the most effective one requires tailoring it to the specific problem.

6. Empirical Results

We validate our results by applying the algorithms introduced in SubSection 5.2 to approximate Shapley values for airport games, weighted voting games, and three real-world interpretable machine learning problems.

The algorithms introduced in SubSection 5.2 were implemented in Python. Both the implementations and test problems can be freely accessed via the first author’s GitHub repository

https://github.com/tim-pollmann/shapley-least-squares (accessed on 27 January 2026).

We consistently assume that all players’ Shapley values must be estimated. Although not universal, this is common in explainable machine learning, where one seeks to explain a prediction using all features.

Before proceeding, we note that each algorithm uses different parameters to determine the number of sampled coalitions and thus, evaluations of the characteristic function v. For fair comparisons, we introduce a unified sample budget T, ensuring each algorithm performs T evaluations of v, up to negligible rounding errors.

The overall sample budget T, introduced above, sets the total number of v evaluations per algorithm. We now express the parameters of each algorithm from SubSection 5.2 in terms of T.

For LSS (Algorithm 1) and UKS (see the end of SubSection 5.2), this is a one-to-one mapping, i.e.,

τ = T

. Note that we ignore the single evaluation of v needed for calculating the initial term

\frac{v (N)}{n}

, since it is negligible in comparison to the rounding errors of other algorithms, especially for large T.

S-LSS (Algorithm 2) divides the sample space based on

n - 1

distinct values of s, for each

i \in N

. We use the heuristically motivated proportional sample allocation from (58) to obtain

τ_{i, s} = ⌈\frac{2 s T}{n^{2} (n - 1)}⌉ (\forall i \in N, \forall s \in {1, \dots, n - 1}) .

Finally, SRS-LSS (Algorithm 3) divides the sample space into

n - 1

distinct sets. Thus, for simplicity, noting that alternative sample allocation schemes across strata may provide better results, we set

τ_{s} = ⌈\frac{T}{n - 1}⌉ (\forall s \in {1, \dots, n - 1}) .

We conclude that the deviation from the true total sample budget T is 1 for LSS and UKS, while the upper bounds of their deviations are given by

n^{2} - n - 1

for S-LSS and

n - 2

for SRS-LSS. We consider these deviations to be negligible in our subsequent analysis, in particular for large T.

Note that for SRS-LSS (Algorithm 3), it is not guaranteed that the algorithm runs successfully in the sense that every stratum receives at least one sample, compare Proposition 4. In our mean squared error comparisons, we mandate for any

τ

that at least half of all runs must be successful for the results to be displayed in the final figure.

With these definitions established, we proceeded with our experiments.

Figure 5 confirms the theoretical variances of LSS and S-LSS presented in Propositions 2 and 3, respectively. Moreover, it underscores our results from Theorem 2, where we demonstrated that one or the other algorithm might achieve a smaller variance, depending on the underlying cooperative game. Furthermore, we validate that LSS and UKS perform as expected with respect to the theoretical variances presented in Propositions 2 and 11 as Figure 5 clearly illustrates that the derived theoretical variances are accurate and that, depending on the specific problem, either LSS or UKS may slightly outperform the other, as one sampling strategy might be better suited to the given task than the other method, see Theorem 4.

As for mean squared error comparisons, we first look at an airport game with 100 players specified in Castro et al. [31] and compare the approximation methods from SubSection 5.2 in Figure 6. Note that the graph for SRS-LSS begins only at a sample size of 50,000, as Proposition 4 fails too frequently with smaller sample budgets. With

n = 100

players, guaranteeing at least one sample per stratum becomes more challenging.

Figure 7 compares mean squared errors for our Monte Carlo estimators from SubSection 5.2 for a weighted voting game with 50 players and normally distributed weights, which was previously employed in the software EPIC [22,23]. It can be found on the GitHub page of the second author via https://github.com/jhstaudacher/EPIC/blob/master/test_cases/normal_sqrd/normal_sqrd.n50.q249646.csv.

Figure 8 compares mean squared errors for our Monte Carlo estimators from SubSection 5.2 for a weighted voting game with 150 players and uniformly distributed weights, which was previously employed in the software EPIC [22,23]. It can be found on the GitHub page of the second author via https://github.com/jhstaudacher/EPIC/blob/master/test_cases/uniform/uniform.n150.q35951.csv. Note that the graph for SRS-LSS begins only at a sample size of 150,000, as Proposition 4 fails too frequently with smaller sample budgets. With

n = 150

players, guaranteeing at least one sample per stratum becomes even more challenging than in the airport game from Figure 6.

Figure 9 uses the standard diabetes dataset (442 patients, 10 baseline features) where the target is disease progression after one year. We train a Gradient Boosting Regressor and evaluate the approximation methods from SubSection 5.2 by their mean squared error against exact reference Shapley values.

In Figure 10, the California housing dataset (20,640 entries, 8 features) predicts median house value in hundreds of thousands of dollars. Using an MLP Regressor, we compare approximation methods by their mean squared error on Shapley value estimation against exact references.

Figure 11 uses the classic wine dataset (178 instances, 13 features, 3 wine classes). We train a Random Forest Classifier and evaluate the approximation methods from SubSection 5.2 via mean squared error in estimating Shapley values for predicting the probability of class 0 only.

Let us summarize the comparisons of mean squared errors of all algorithms for approximating the Shapley value from SubSection 5.2 in Figure 6, Figure 7, Figure 8, Figure 9, Figure 10 and Figure 11 succinctly. As expected, LSS and UKS perform more or less equally for all six test problems. SRS-LSS outperforms the other three provably unbiased algorithms LSS, S-LSS and UKS for all test problems. This observation is consistent with the results shown in Figure 1 in Benati et al. [14], where only LSS and SRS-LSS were compared. Therefore, we conclude that the covariance terms established in Theorem 3 do not significantly affect the overall variances of individual players’ Shapley value estimators negatively. S-LSS performs worst for five test problems with the notable exception of the airport game with 100 players in Figure 6 where S-LSS is even faster than KernelSHAP (KS). Not surprisingly, KS is consistently faster than UKS and LSS. For the three large cooperative games in Figure 6, Figure 7 and Figure 8 SRS-LSS converges faster than KS, whereas that comparison reverses for the machine learning tasks in Figure 9, Figure 10 and Figure 11.

7. Summary, Conclusions and Outlook

The celebrated KernelSHAP approach by Lundberg and Lee [1] as well as its later refinement Unbiased KernelSHAP (UKS) proposed by Covert and Lee [7] compute the Shapley value as a least squares optimization problem. While these two algorithms are extremely well established in the machine learning community, the methods from the paper by Benati et al. [14] — which are also based on the least square formula for the Shapley value — have received fairly little attention. To our knowledge, we are the first to apply them in the context of explainable artificial intelligence. We formulate the ideas from Benati et al. [14] as approximation algorithms for general TU games. We refer to their weighted sampling strategy as LSS (Least Square Sampling). As for their approach towards stratification, we distinguish S-LSS with no reuse of samples across strata and SRS-LSS with sample reuse across strata as proposed in Benati et al. [14].

As the result of our thorough and detailed analysis of LSS, S-LSS and SRS-LSS, we presented three key findings.

First, in Proposition 5, we showed that S-LSS as proposed in [14] is not a valid stratified variant of LSS in the sense of the definition provided SubSection 3.2,i.e., for S-LSS the strata overlap. Therefore, we demonstrated that S-LSS might reduce the variance of the obtained estimator, but it could also lead to an increase in comparison to LSS, see Theorem 2.

Second, in Theorem 3, we showed that the SRS-LSS approach proposed by Benati et al. [14] introduces covariance terms between stratum estimators, making its theoretical variance difficult to analyze. Although empirical results suggest that the variance is significantly reduced in comparison to LSS and S-LSS, a theoretical analysis remains an open research question.

Third, in Theorem 4, we established that LSS and UKS are importance sampling estimators in the sense of Theorem 1, addressing the same underlying problem but differing in their respective sampling strategies. Therefore, neither LSS nor UKS is superior over the other one. Which algorithm’s variance is smaller depends on the underlying problem and whether the respective sampling strategy is suitable for this problem or not.

As noted in the introduction, this paper’s aim is neither to propose a new Shapley value estimator nor to compare numerous approximation algorithms. Rather, the focus is on providing structural, theoretical, and algorithmic insight into the methods from Benati et al. [14]. While antithetic sampling — as discussed, for instance, in [32] — could be integrated into the SRS-LSS algorithm, this would have vastly exceeded the scope of this study. In the future, it could be worthwhile to investigate whether there are algorithmic approaches comparable to sophisticated stratification strategies [28,33,34] which could be successfully incorporated to enhance the performance of SRS-LSS. Definitely, our theoretical and numerical results suggest that the SRS-LSS estimator — which is unbiased — justifies more attention. Another, perhaps even more pressing question, is why SRS-LSS outperforms KernelSHAP for weighted voting games and airport games while KernelSHAP exhibits superior performance for our test problems from interpretable machine learning. We have yet to identify the properties of the characteristic function responsible for this phenomenon.

Author Contributions

Conceptualization, T.P. and J.S.; Methodology, T.P. and J.S.; Software, T.P.; Validation, J.S.; Formal Analysis, T.P. and J.S.; Writing—original draft, T.P. and J.S.; Writing—review & editing,, T.P. and J.S.. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

$U$	uniform distribution

⊙	Hadamard product of vectors
$\hat{\cdot}$	estimator or estimated value

$1$	all-ones vector with dimension being clear from context
$0$	all-zeros vector with dimension being clear from context
$1_{condition}$	indicator function, 1 if condition is satisfied, 0 otherwise

n	number of players
N	player set, $N = {1, \dots, n}$
v	characteristic function, $v : 2^{N} \to R$
S	coalition, $S \subseteq N$
$z (S)$	indicator vector corresponding to coalition S, $z (S) = {[1_{i \in S}]}_{i = 1}^{n}$
$S (z)$	subset corresponding to indicator vector $z$ , $S (z) = {i \in N ∣ z_{i} = 1}$
$w$	weight vector, $w (S) = {[w_{1} (S), \dots, w_{n} (S)]}^{⊤}$
$ν$	value vector, $ν (S, v) = {[ν_{1} (S, v), \dots, ν_{n} (S, v)]}^{⊤}$
$α$	linear solution concept, $α (N, v) = \sum_{S \subseteq N} w (S) ⊙ ν (S, v)$
$ϕ$	Shapley value

p	probability mass function used for sample generation, $p : 2^{N} \to [0, 1]$
$τ$	sample size
$S$	sampled coalition, $S \subseteq N$
$S$	sample consisting of $τ$ coalitions, $S \sim_{iid} p$ , $S = [S_{1}, \dots, S_{τ}]$

Appendix A. Comparison of Our Results Against the SHAPley Interaction Quantification (SHAP-IQ) Formulation of UKS

Fumagalli et al. [30] state that choosing the probability mass function

p^{'} (S) = \{\begin{matrix} \frac{1}{2 H_{n - 1}} \frac{1}{n - 1} {(\binom{n - 2}{| S | - 1})}^{- 1} & if 0 < | S | < n \\ 0 & else \end{matrix}

with

H_{n - 1}

being the

(n - 1)

-th harmonic number ensures that UKS is equal SHAP-IQ for the Shapley value.

In detail, they state that the estimator obtained via UKS can be rewritten as

{\hat{ϕ}}_{i} = \frac{v (N)}{n} + \frac{2 H_{n - 1}}{τ} \sum_{S \in S} (1_{i \in S} - \frac{| S |}{n}) v (S),

(A1)

where

S \sim_{iid} p^{'}

.

Let us now rewrite (A1) similar to (70), such that

{\hat{ϕ}}_{i} = \frac{v (N)}{n} + \frac{1}{τ} \sum_{S \in S} \underset{w_{i}^{'} (S) / p^{'} (S)}{\underset{︸}{2 H_{n - 1} (1_{i \in S} - \frac{| S |}{n})}} v (S) .

By comparing this formulation to (70) and Proposition 8, one obtains that it shares the same initial term

\frac{v (N)}{n}

with the formulation of UKS derived by us.

Next, we compare the functions

w_{i}

and

w_{i}^{'}

. For

0 < | S | < n

, we find

\begin{matrix} w_{i}^{'} (S) & = p^{'} (S) \frac{w_{i}^{'} (S)}{p^{'} (S)} \\ = \frac{1}{2 H_{n - 1}} \frac{1}{n - 1} {(\binom{n - 2}{| S | - 1})}^{- 1} 2 H_{n - 1} (1_{i \in S} - \frac{| S |}{n}) \\ = \frac{1}{n - 1} \frac{(| S | - 1)! (n - | S | - 1)!}{(n - 2)!} (1_{i \in S} - \frac{| S |}{n}) . \end{matrix}

If

i \in S

, we obtain

\begin{matrix} w_{i}^{'} (S) & = \frac{1}{n - 1} \frac{(| S | - 1)! (n - | S | - 1)!}{(n - 2)!} (1 - \frac{| S |}{n}) \\ = \frac{(| S | - 1)! (n - | S | - 1)!}{(n - 2)!} \frac{n - | S |}{n (n - 1)} \\ = \frac{(| S | - 1)! (n - | S |)!}{n!} = \frac{| S |! (n - | S |)!}{| S | n!} = \frac{1}{| S | (\binom{n}{| S |})}, \end{matrix}

which matches the case

i \in S

in (79).

On the other hand, if

i \notin S

, we have

\begin{matrix} w_{i}^{'} (S) & = - \frac{1}{n - 1} \frac{(| S | - 1)! (n - | S | - 1)!}{(n - 2)!} \frac{| S |}{n} \\ = - \frac{| S |! (n - | S | - 1)!}{n!} = - \frac{| S |! (n - | S |)!}{n! (n - | S |)} = - \frac{1}{(n - | S |) (\binom{n}{| S |})}, \end{matrix}

which matches the case

i \notin S

in (79).

Finally, we compare the sampling distributions p and

p^{'}

. While Fumagalli et al. [30] define their sampling distribution as

p^{'} (S) \propto ω^{'} (S) = \frac{1}{(n - 1) (\binom{n - 2}{| S | - 1})},

(A2)

we define this distribution as

p (S) \propto ω (S)

with

ω

being specified in (76). Reformulating

ω

results in

\begin{matrix} ω (S) & = \frac{n - 1}{(\binom{n}{| S |}) | S | (n - | S |)} = \frac{| S |! (n - | S |)! (n - 1)}{n! | S | (n - | S |)} \\ = \frac{| S | (| S | - 1)! (n - | S |) (n - | S | - 1)! (n - 1)}{n (n - 1) (n - 2)! | S | (n - | S |)} \\ = \frac{(| S | - 1)! (n - | S | - 1)!}{n (n - 2)!} = \frac{1}{n (\binom{n - 2}{| S | - 1})} . \end{matrix}

(A3)

By comparing (A2) and (A3), one directly obtains that these expressions only differ by a constant factor, i.e.,

ω (S) = \frac{n - 1}{n} ω^{'} (S),

which means that the corresponding sampling distributions p and

p^{'}

are equivalent.

References

Lundberg, S.M.; Lee, S. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar] [CrossRef]
Shapley, L. S. A value for n-person games. Contributions to the Theory of Games 1953, 28, 307–317. [Google Scholar]
Deng, X.; Papadimitriou, C. H. On the complexity of cooperative solution concepts. Math. Oper. Res. 1994, 19, 257–266. [Google Scholar] [CrossRef]
Faigle, U.; Kern, W. The Shapley value for cooperative games under precedence constraints. Int. J. Game Theory 1992, 21, 249–266. [Google Scholar] [CrossRef]
Fernández, J.R.; Algaba, E.; Bilbao, J.M.; Jiménez, A.; Jiménez, N.; López, J.J. Generating functions for computing the Myerson value. Ann. Oper. Res. 2002, 9, 143–158. [Google Scholar] [CrossRef]
Chakravarty, S.R.; Mitra, M.; Sarkar, P. A Course on Cooperative Game Theory; Cambridge University Press: Cambridge, UK, 2015. [Google Scholar]
Covert, I.; Lee, S. Improving KernelSHAP: Practical Shapley Value Estimation Using Linear Regression. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics; Banerjee, A., Fukumizu, K., Eds.; PLMR, 2021; Volume 130, pp. 3457–3465. Available online: http://proceedings.mlr.press/v130/covert21a/covert21a.pdf (accessed on 27 January 2026).
Charnes, A.; Golany, B.; Keane, M.; Rousseau, J. Extremal Principle Solutions of Games in Characteristic Function Form: Core, Chebychev and Shapley Value Generalizations. In Econometrics of Planning and Efficiency; Springer Netherlands: Dordrecht, 1988; Volume 11, pp. 123–133. [Google Scholar] [CrossRef]
Ruiz, L.M.; Valenciano, F..; Zarzuelo, J.M. The Family of Least Square Values for Transferable Utility Games. Games Econ. Behav. 1998, 24, 109–130. [Google Scholar] [CrossRef]
Littlechild, S.C.; Thompson, G.F. Aircraft landing fees: a game theory approach. Bell J. Econ. 1977, 8, 186–204. [Google Scholar] [CrossRef]
Algaba, E.; Bilbao, J.M.; Fernández-García, J.R. The distribution of power in the European Constitution. Eur. J. Oper. Res. 2007, 176, 1752–1766. [Google Scholar] [CrossRef]
Kóczy, L.A. Beyond Lisbon. Demographic trends and voting power in the European Union Council of Ministers. Math. Soc. Sci. 2012, 63, 152–158. [Google Scholar] [CrossRef]
Kóczy, L.A. Brexit and Power in the Council of the European Union. Games 2021, 12, 51. [Google Scholar] [CrossRef]
Benati, S.; López-Blázquez, F.; Puerto, J. A stochastic approach to approximate values in cooperative games. Eur. J. Oper. Res. 2019, 279, 93–106. [Google Scholar] [CrossRef]
Pollmann, T.; Staudacher, J. On Importance Sampling and Multilinear Extensions for Approximating Shapley Values with Applications to Explainable Artificial Intelligence. Preprints 2026. [Google Scholar] [CrossRef]
Peters, H. Game theory: A Multi-leveled approach, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar] [CrossRef]
Rozemberczki, B.; Watson, L.; Bayer, P.; Yang, H.; Kiss, O.; Nilsson, S.; Sarkar, R. The Shapley value in machine learning. In The 31st International Joint Conference on Artificial Intelligence and the 25th European Conference on Artificial Intelligence, IJCAI-ECAI; de Raedt, L., Ed.; International Joint Conferences on Artificial Intelligence Organization: Vienna, Austria, 2022; pp. 5572–5579. [Google Scholar] [CrossRef]
Chen, H.; Covert, I.C.; Lundberg, S.M.; Lee, S. Algorithms to estimate Shapley value feature attributions. Nat. Mach. Intell. 2023, 5, 590–601. [Google Scholar] [CrossRef]
Molnar, C. : Interpreting Machine Learning Models With SHAP. LeanPub. 2023. Available online: https://leanpub.com/shap.
Borm, P.; Hamers, H.; Hendrickx, R. Operations research games: A survey. Top 2001, 9, 139–199. [Google Scholar] [CrossRef]
Fatima, S.S.; Wooldridge, M.; Jennings, N.R. A linear approximation method for the Shapley value. Artif. Intell. 2008, 172, 1673–1699. [Google Scholar] [CrossRef]
Staudacher, J.; Kóczy, L.Á.; Stach, I.; Filipp, J.; Kramer, M.; Noffke, T.; Olsson, L.; Pichler, J.; Singer, T. Computing power indices for weighted voting games via dynamic programming. Oper. Res. 2021, 31, 123–145. [Google Scholar] [CrossRef]
Staudacher, J.; Wagner, F.; Filipp, J. Dynamic Programming for Computing Power Indices for Weighted Voting Games with Precoalitions. Games 2021, 13, 6. [Google Scholar] [CrossRef]
Sundararajan, M.; Najmi, A. The Many Shapley Values for Model Explanation. In Proceedings of the 37th International Conference on Machine Learning; Daumé, III, H. Singh, A., Eds.; PMLR: London, UK, 2020; Volume 119, pp. 9269–9278. Available online: http://proceedings.mlr.press/v119/sundararajan20b/sundararajan20b.pdf (accessed on 27 January 2026).
Rubinstein, R.; Kroese, D. Monte Carlo methods. Wiley Interdiscip. Rev. Comput. Stat. 2012, 4, 48–58. [Google Scholar] [CrossRef]
Botev, Z.; Ridder, A. Variance reduction. In Wiley statsRef: Statistics Reference Online; Wiley: Hoboken, NJ, USA, 2017; pp. 1–6. [Google Scholar] [CrossRef]
Rubinstein, R.Y.; Kroese, D.P. Simulation and the Monte Carlo method; John Wiley & Sons: Hoboken, New Jersey, USA, 2007. [Google Scholar] [CrossRef]
Castro, J.; Gómez, D.; Molina, E.; Tejada, J. Improving polynomial estimation of the Shapley value by stratified random sampling with optimum allocation. Comput. Oper. Res. 2017, 82, 180–188. [Google Scholar] [CrossRef]
Hunter, D. An Upper Bound for the Probability of a Union. J. Appl. Probab. 1976, 13, 597–603. [Google Scholar] [CrossRef]
Fumagalli, F.; Muschalik, M.; Kolpaczki, P.; Hüllermeier, E.; Hammer, B. HAP-IQ: Unified Approximation of any-order Shapley Interactions. In Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: USA, 2023; Volume 36, pp. 11515–11551. [Google Scholar]
Castro, J.; Gómez, D.; Tejada, J. Polynomial calculation of the Shapley value based on sampling. Comput. Oper. Res. 2009, 36, 1726–1730. [Google Scholar] [CrossRef]
Staudacher, J.; Pollmann, T. Assessing Antithetic Sampling for Approximating Shapley, Banzhaf, and Owen Values. AppliedMath 2023, 3, 957–988. [Google Scholar] [CrossRef]
Neyman, J. On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection. J. R. Stat. 1934, 97, 558–625. [Google Scholar] [CrossRef]
Burgess, M.A.; Chapman, A.C. Approximating the Shapley Value Using Stratified Empirical Bernstein Sampling. IJCAI 2021, 73–81. [Google Scholar]

Figure 1. Let

(N, v)

be a cooperative game with

N = {1, 2}

and

v (\emptyset) = 0

,

v ({1}) = 0

,

v ({2}) = 2

, and

v ({1, 2}) = 10

. The player 1 and player 2 axes should be interpreted in a discrete way such that 0 denotes the exclusion and 1 specifies the inclusion of a player. The blue plane is the solution to problem (20) with m being defined as (25) and

γ = 1

being obtained from (26). As a result, the coefficients defining this plane are the Shapley values of the game

(N, v)

, i.e.,

ϕ_{1} = 4

, and

ϕ_{2} = 6

.

Figure 1. Let

(N, v)

be a cooperative game with

N = {1, 2}

and

v (\emptyset) = 0

,

v ({1}) = 0

,

v ({2}) = 2

, and

v ({1, 2}) = 10

. The player 1 and player 2 axes should be interpreted in a discrete way such that 0 denotes the exclusion and 1 specifies the inclusion of a player. The blue plane is the solution to problem (20) with m being defined as (25) and

γ = 1

being obtained from (26). As a result, the coefficients defining this plane are the Shapley values of the game

(N, v)

, i.e.,

ϕ_{1} = 4

, and

ϕ_{2} = 6

.

Figure 2. Theoretical variance comparison of LSS and S-LSS evaluated on the weighted voting game defined by (7). The sample budget is

τ = 10000

, and the sample sizes of S-LSS are distributed according to (58) with ceiling operations applied whenever the result is not an integer. The crosses denote the mean variance across all players’ Shapley values, while the dots specify the variance of

{\hat{ϕ}}_{1}

.

Figure 2. Theoretical variance comparison of LSS and S-LSS evaluated on the weighted voting game defined by (7). The sample budget is

τ = 10000

, and the sample sizes of S-LSS are distributed according to (58) with ceiling operations applied whenever the result is not an integer. The crosses denote the mean variance across all players’ Shapley values, while the dots specify the variance of

{\hat{ϕ}}_{1}

.

Figure 3. The different sampling distributions of LSS and UKS for

n = 9

.

Figure 3. The different sampling distributions of LSS and UKS for

n = 9

.

Figure 4. Theoretical variance comparison of LSS and UKS evaluated the weighted voting games defined by (7). The number of evaluations of v is

τ = 10000

. The crosses represent the mean variance across all players’ Shapley values, while the dots denote the variance for player

i = 1

.

Figure 4. Theoretical variance comparison of LSS and UKS evaluated the weighted voting games defined by (7). The number of evaluations of v is

τ = 10000

. The crosses represent the mean variance across all players’ Shapley values, while the dots denote the variance for player

i = 1

.

Figure 5. Empirical variance validation of

{\hat{ϕ}}_{1}

obtained via LSS, S-LSS and UKS evaluated on the weighted voting games defined by (7). The overall sample budget is

T = 10000

. The crosses represent the theoretical variances, while the dots denote the empirical variances. The empirical variances were obtained over 5000 runs.

Figure 5. Empirical variance validation of

{\hat{ϕ}}_{1}

obtained via LSS, S-LSS and UKS evaluated on the weighted voting games defined by (7). The overall sample budget is

T = 10000

. The crosses represent the theoretical variances, while the dots denote the empirical variances. The empirical variances were obtained over 5000 runs.

Figure 6. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on an airport game with 100 players. The mean squared errors were averaged over 250 runs.

Figure 6. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on an airport game with 100 players. The mean squared errors were averaged over 250 runs.

Figure 7. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on a weighted voting game with 50 players. The mean squared errors were averaged over 250 runs.

Figure 7. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on a weighted voting game with 50 players. The mean squared errors were averaged over 250 runs.

Figure 8. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on a weighted voting game with 150 players. The mean squared errors were averaged over 250 runs.

Figure 8. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on a weighted voting game with 150 players. The mean squared errors were averaged over 250 runs.

Figure 9. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on the diabetes example. The mean squared errors were averaged over 250 runs.

Figure 9. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on the diabetes example. The mean squared errors were averaged over 250 runs.

Figure 10. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on the California housing example. The mean squared errors were averaged over 250 runs.

Figure 10. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on the California housing example. The mean squared errors were averaged over 250 runs.

Figure 11. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on the wine example. The mean squared errors were averaged over 250 runs.

Figure 11. Empirical mean squared error comparison of

\hat{ϕ}

obtained via LSS, S-LSS, SRS-LSS, KS and UKS evaluated on the wine example. The mean squared errors were averaged over 250 runs.

Table 1. Population and estimator variances for all strata of the S-LSS estimator.

s	j	elements	population var.	estimator var.
1	1	$v ({1}) = 0$	0	0
1	2	$v ({2}) = 0$	0	0
1	3	$v ({3}) = 0$	0	0
2	1	$v ({1, 2}) = 0, v ({1, 3}) = 0$	0	0
2	2	$v ({1, 2}) = 0, v ({2, 3}) = 1$	$1 / 4$	$1 / (4 τ_{2, 2})$
2	3	$v ({1, 3}) = 0, v ({2, 3}) = 1$	$1 / 4$	$1 / (4 τ_{3, 2})$

Table 2. All valid realizations of the sample

S_{2}

(order ignored) when

τ_{2} = 3

, together with their respective probabilities as well the corresponding resulting values of

{\bar{a}}_{1, 2}

,

{\bar{a}}_{3, 2}

, and

{\bar{a}}_{1, 2} {\bar{a}}_{3, 2}

. The last row shows the expected values over all valid realizations, i.e., conditioned on

\bar{B}

.

Table 2. All valid realizations of the sample

S_{2}

(order ignored) when

τ_{2} = 3

, together with their respective probabilities as well the corresponding resulting values of

{\bar{a}}_{1, 2}

,

{\bar{a}}_{3, 2}

, and

{\bar{a}}_{1, 2} {\bar{a}}_{3, 2}

. The last row shows the expected values over all valid realizations, i.e., conditioned on

\bar{B}

.

#permutations	$P (S_{2})$	$P (S_{2} ∣ \bar{B})$	elements of $S_{2}$	${\bar{a}}_{1, 2}$	${\bar{a}}_{3, 2}$	${\bar{a}}_{1, 2} {\bar{a}}_{3, 2}$
$6 / 27$	$2 / 9$	$2 / 8$	${1, 2}, {1, 3}, {2, 3}$	$1 / 2$	$1 / 2$	$1 / 4$
$3 / 27$	$1 / 9$	$1 / 8$	${1, 2}, {1, 2}, {1, 3}$	$1 / 3$	1	$1 / 3$
$3 / 27$	$1 / 9$	$1 / 8$	${1, 2}, {1, 2}, {2, 3}$	0	0	0
$3 / 27$	$1 / 9$	$1 / 8$	${1, 2}, {1, 3}, {1, 3}$	$2 / 3$	1	$2 / 3$
$3 / 27$	$1 / 9$	$1 / 8$	${1, 2}, {2, 3}, {2, 3}$	0	0	0
$3 / 27$	$1 / 9$	$1 / 8$	${1, 3}, {1, 3}, {2, 3}$	1	$2 / 3$	$2 / 3$
$3 / 27$	$1 / 9$	$1 / 8$	${1, 3}, {2, 3}, {2, 3}$	1	$1 / 3$	$1 / 3$
$E [\cdot ∣ \bar{B}]$				$1 / 2$	$1 / 2$	$5 / 16$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

On Least Squares Approximations for Shapley Values and Applications to Interpretable Machine Learning

Abstract

Keywords:

Subject:

1. Introduction

2. Preliminaries on Cooperative Game Theory, the Shapley Value and the BShap Model for Interpretable Machine Learning

2.1. Transferable Utility Games and Their Characteristic Functions

2.2. Linear Solution Concepts for TU Games

2.3. The Shapley Value

2.4. Airport Games

2.5. Weighted Voting Games

2.6. Baseline Shapley (BShap) for Interpretable Machine Learning

3. Preliminaries on Monte Carlo Methods for Estimation

3.1. Crude Monte Carlo Method

3.2. Stratified Sampling for Variance Reduction

3.3. Importance Sampling for Variance Reduction

4. Approximating Linear Solution Concepts via Importance Sampling

5. Least Squares Approaches for Approximating Shapley Values

5.1. The Family of least Square Values

Least square values as linear solution concepts

The Shapley value in the context of least square values

5.2. Shapley Value Approximation Algorithms

Least Squares Sampling (LSS)

Stratified Least Squares Sampling (S-LSS)

Sample Reuse Stratified Least Squares Sampling (SRS-LSS)

KernelSHAP (KS)

Unbiased KernelSHAP (UKS)

5.3. Comparison of LSS and S-LSS

Algorithm Comparison

Variance Comparison in the Absence of Variance Within Strata

Variance comparison in the presence of variance within strata

Concluding Variance Comparison of LSS and S-LSS

5.4. Analysis of the Variance of SRS-LSS

5.5. Comparison of LSS and UKS

Algorithm Comparison

Variance Comparison

6. Empirical Results

7. Summary, Conclusions and Outlook

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Comparison of Our Results Against the SHAPley Interaction Quantification (SHAP-IQ) Formulation of UKS

References

MDPI Initiatives

Important Links

Subscribe