This section aims to provide the required background information and introduce the used notation.
Section 2.1 discusses the Blackwell order and its special case at binary targets, the zonogon order, which will be used for operational interpretations and the representation of
f-information for its decomposition.
Section 2.2 discusses the PID framework of Williams and Beer [
1] and the relation between a decomposition based on the redundancy lattice and one based on the synergy lattice. We also demonstrate the unintuitive behavior of the original decomposition measure which will be resolved by our proposal in
Section 3.
Section 2.3 provides the considered definitions of
f-information, Rényi-information, and Bhattacharyya information for the later demonstration of transforming decomposition results between measures.
2.1. Blackwell and Zonogon Order
Definition 1 (Channel). A channel from to represents a garbling of the input variable T that results in variable S. Within this work, we represent an information channel μ as (row) stochastic matrix, where each element is non-negative, and all rows sum to one.
For the context of this work, we consider a variable
S to be the observation of the output from an information channel
from the target variable
T, such that the corresponding channel can be obtained from their conditional probability distribution, as shown in Equation
1 where
and
.
Notation 2 (Binary input channels). Throughout this work, we reserve the symbol κ for binary input channels, meaning κ signals a stochastic matrix of dimension . We use the notation to indicate a column of this matrix.
Definition 2 (More informative [
15,
19]).
An information channel is more informative than another channel if - for any decision problem involving a set of actions and a reward function that depends on the chosen action and state of the variable T - an agent with access to can always achieve an expected reward at least as high as another agent with access to .
Definition 3 (Blackwell order [
15,
19]).
The Blackwell order is a preorder of channels. A channel is Blackwell superior to channel , if we can pass its output through a second channel λ to obtain an equivalent channel to , as shown in Equation 2.
Blackwell
19 showed that a channel is more informative if and only if it is Blackwell superior. Bertschinger and Rauh [
15] showed that the Blackwell order does not form a lattice for channels
if
since the ordering does not provide unique meet and join elements. However, binary target variables
are a special case where the Blackwell order is equivalent to the zonogon order (discussed next) and does form a lattice [
15].
Definition 4 (Zonogon [
15]).
The zonogon of a binary input channel is defined using the Minkowski sum from the collection of vector segments as shown in Equation 3. The zonogon can similarly be defined as image of the unit cube under the linear map of κ.
The zonogon
is a centrally symmetric convex polygon, and the set of vectors
span its perimeter.
Figure 2 shows the example of a binary input channel and its corresponding zonogon.
Definition 5 (Zonogon sum).
The addition of two zonogons corresponds to their Minkowski sum as shown in Equation 4.
Definition 6 (Zonogon order [
15]).
A zonogon is zonogon superior to another if and only if .
Bertschinger and Rauh [
15] showed that for binary input channels, the zonogon order is equivalent to the Blackwell order and forms a lattice (Equation
5). In the remaining work, we will only discuss binary input channels, such that the orderings of Definition 2, 3 and 6 are equivalent and can be thought of as zonogons with subset relation.
For obtaining an interpretation of what a channel zonogon
represents, we can consider a binary decision problem by aiming to predict the state
of a
binary target variable
T using the output of channel
. Any decision strategy
for obtaining a binary prediction
can be fully characterized by its resulting pair of True-Positive Rate (TPR) and False-Positive Rate (FPR), as shown in Equation
6
Therefore, a channel zonogon
provides the set of all achievable (TPR,FPR)-pairs for a given channel
[
18,
20]. This can also be seen from Equation
3, where the unit cube
represents all possible first columns of the decision strategy
. The first column of
fully determines the second since each row has to sum to one. As a result,
provides the (TPR,FPR)-pair for the decision strategy
and the definition of Equation
3 all achievable (TPR,FPR)-pairs for predicting the state of a binary target variable. Since this will be helpful for operational interpretations, we label the axis of zonogon plots accordingly, as shown in
Figure 2, and refer to regions within this plot as reachable decision regions:
Definition 7 (Reachable decision region).
A reachable decision region for a binary decision problem is a set of achievable (TPR,FPR) performance pairs and can be visualized in a TPR/FPR-plot such as Figure 2.
Notation 3 (Channel lattice). We use the notation for the meet element of binary input channels under the Blackwell order and for their join element. We use the notation for the top element of binary input channels under the Blackwell order and for the bottom element.
For binary input channels, the meet element of the Blackwell order corresponds to the zonogon intersection
and the join element of the Blackwell order corresponds to the convex hull of their union
. Equation
7 describes this for an arbitrary number of channels.
2.2. Partial Information Decomposition
The commonly used framework for PIDs was introduced by Williams and Beer [
1]. A PID is computed with respect to a particular random variable that we would like to know information
about, called the target, and tries to identify
from which variables that we have access to, called visible variables, we obtain this information. Therefore, this section considers sets of variables that represent their joint distribution.
Notation 4. Throughout this work, we use the notation T for the target variable and for the set of visible variables. We use the notation for the power set of , and for its power set without the empty set.
The used filter for obtaining the set of atoms (Equation
8) removes sets that would be equivalent to other elements. This is required for obtaining a lattice from the following two ordering relations:
Definition 9 (Redundancy-/Gain-lattice [
1]).
The redundancy lattice is obtained by applying the ordering relation of Equation 9 to all atoms .
The redundancy lattice for three visible variables is visualized in
Figure 3a. On this lattice, we can think of an atom as representing the information that can be obtained from all of its sources about the target
T (their redundancy or informational intersection). For example, the atom
represents on the redundancy lattice the information that is contained in both
and
about
T. Since both sources in
provide the information of
, their redundancy contains at least this information, and the atom
is considered its predecessor. Therefore, the ordering indicates an informational subset relation for the redundancy of atoms, and the information that is represented by an atom increases as we move up. The up-set of an atom
on the redundancy lattice indicates the information that is lost when losing all of its sources. Considering the example from above, if we lose access to
and
, then we lose access to all atoms in the up-set of
.
Definition 10 (Synergy-/Loss-lattice [
21]).
The synergy lattice is obtained by applying the ordering relation of Equation 10 to all atoms .
The synergy lattice for three visible variables is visualized in
Figure 3b. On this lattice, we can think of an atom as representing the information that is contained in neither of its sources (information outside their union). For example, the atom
represents on the synergy lattice the information that is obtained from neither
nor
about
T. The ordering again indicates their expected subset relation: the information that is obtained from neither
nor
is fully contained in the information that cannot be obtained from
and thus
is a predecessor of
.
With an intuition for both ordering relations in mind, we can see how the filter in the construction of atoms (Equation
8) removes sets that would be equivalent to another atom: the set
is removed from the power set of sources since it would be equivalent to the atom
under the ordering of the redundancy lattice and equivalent to the atom
under the ordering of the synergy lattice.
Notation 5 (Redundancy/Synergy lattices). We use the notation for the join and meet operators on the redundancy lattice, and for the join and meet operators on the synergy lattice. We use the notation for the top and for the bottom atom on the redundancy lattice, and and for the top and bottom atom on the synergy lattice. For an atom α, we use the notation for its down-set, for its strict down-set, and for its cover set. These definitions will only appear in the Möbius inverse of a function that is directly associated with either the synergy or redundancy lattice such that there is no ambiguity about which ordering relation has to be considered.
The redundant, unique, or synergetic information (partial contributions) can be calculated based on either lattice. They are obtained by quantifying each atom of the redundancy or synergy lattice with a cumulative measure that increases as we move up in the lattice. The partial contributions are then obtained in a second step from a Möbius inverse.
Definition 11 ([Cumulative] redundancy measure [
1]).
A redundancy measure is a function that assigns a real value to each atom of the redundancy lattice. It is interpreted as a cumulative information measure that quantifies the redundancy between all sources of an atom about the target T.
Definition 12 ([Cumulative] loss measure [
21]).
A loss measure is a function that assigns a real value to each atom of the synergy lattice. It is interpreted as a cumulative measure that quantifies the information about T that is provided by neither of the sources of an atom .
To ensure that a redundancy measure actually captures the desired concept of redundancy, Williams and Beer [
1] defined three axioms that a measure
should satisfy. For the synergy lattice, we consider the equivalent axioms discussed by Chicharro and Panzeri [
21]:
Axiom 1 (Commutativity [
1,
21]).
Invariance in the order of sources (σ permuting the order of indices):
Axiom 2 (Monotonicity [
1,
21]).
Additional sources can only decrease redundant information. Additional sources can only decrease the information that is in neither source.
Axiom 3 (Self-redundancy [
1,
21]).
For a single source, redundancy equals mutual information. For a single source, the information loss equals the difference between the total available mutual information and the mutual information of the considered source with the target.
The first axiom states that an atom’s redundancy and information loss should not depend on the order of its sources. The second axiom states that adding sources to an atom can only decrease the redundancy of all sources (redundancy lattice) and decrease the information from neither source (synergy lattice). The third axiom binds the measures to be consistent with mutual information and ensures that the bottom element of both lattices is quantified to zero.
Once a lattice with corresponding cumulative measure (
/
) is defined, we can use the Möbius inverse to compute the partial contribution of each atom. This partial information can be visualized as partial area in a Venn diagram (see
Figure 1a) and corresponds to the desired redundant, unique, and synergetic contributions. However, the same atom represents different partial contributions on each lattice: As visualized for the case of two visible variables in
Figure 1, the unique information of variable
is represented by
on the redundancy lattice and by
on the synergy lattice.
Definition 13 (Partial information [
1,
21]).
Partial information and corresponds to the Möbius inverse of its corresponding cumulative measure on the respective lattice.
Remark 2. Using the Möbius inverse for defining partial information enforces an inclusion-exclusion relation in that all partial information contributions have to sum to the corresponding cumulative measure. Kolchinsky [14] argues that an inclusion-exclusion relation should not be expected to hold for PIDs and proposes an alternative decomposition framework. In this case, the sum of partial contributions (unique/redundant/synergetic information) is no longer expected to sum to the total amount .
Property 1 (Local positivity, non-negativity [
1]).
A partial information decomposition satisfies non-negativity or local positivity if its partial information contributions are always non-negative, as shown in Equation 12.
The non-negativity property is important if we assume an inclusion-exclusion relation since it states that the unique, redundant, or synergetic information cannot be negative. If an atom
provides a negative partial contribution in the framework of Williams and Beer [
1], then this may indicate that we over-counted some information in its down-set.
Remark 3. Several additional axioms and properties have been suggested since the original proposal of Williams and Beer [1], such as target monotonicity and target chain rule [4]. However, this work will only consider the axioms and properties of Williams and Beer [1]. To the best of our knowledge, no other measure since the original proposal (discussed below) has been able to satisfy these properties for an arbitrary number of visible variables while ensuring an inclusion-exclusion relation for their partial contributions.
It is possible to convert between both representations due to a lattice duality:
Definition 14 (Lattice duality and dual decompositions [
21]).
Let be a redundancy lattice with associated measure and let be a synergy lattice with measure , then the two decompositions are said to be dual if and only if the down-set on one lattice corresponds to the up-set in the other as shown in Equation 13.
Williams and Beer [
1] proposed
, as shown in Equation 14, to be used as measure of redundancy and demonstrated that it satisfies the three required axioms and local positivity. They define redundancy (Equation 14b) as the expected value of the minimum
specific information (Equation
14a).
Remark 4. Throughout this work, we use the term ’target pointwise information’ or simply ’pointwise information’ to refer to ’specific information’. This shall avoid confusion when naming their corresponding binary input channels in Section 3.
To the best of our knowledge, this measure is the only existing non-negative decomposition that satisfies all three axioms listed above for an arbitrary number of visible variables while providing an inclusion-exclusion relation of partial information.
However, the measure
could be criticized for not providing a notion of distinct information due to its use of a pointwise minimum (for each
) over the sources. This leads to the question of distinguishing ”the
same information and the
same amount of information“ [
3,
4,
5,
6]. We can use the definition through a pointwise minimum (Equation 14) to construct examples of unexpected behavior: consider for example a uniform binary target variable
T and two visible variables as output of the channels visualized in
Figure 4. The channels are constructed to be equivalent for both target states and provide access to distinct decision regions while ensuring a constant pointwise information
.
Even though our ability to predict the target variable significantly depends on which of the two indicated channel outputs we observe (blue or green in
Figure 4), the measure
concludes full redundancy between them
. We think this behavior is undesired and, as discussed in the literature, caused by an underlying lack of distinguishing the
same information. To resolve this issue, we will present a representation of
f-information in
Section 3.1, which allows the use of all (TPR,FPR)-pairs for each state of the target variable to represent a distinct notion of uncertainty.
2.3. Information Measures
This section discusses two generalizations of mutual information at discrete random variables based on
f-divergences and Rényi divergences [
22,
23]. While mutual information has interpretational significance in channel coding and data compression, other
f-divergences have their significance in parameter estimations, high-dimensional statistics, and hypothesis testing [
7, p. 88], while Rényi-divergences can be found among others in privacy analysis [
8]. Finally, we introduce Bhattacharyya information for demonstrating that it is possible to chain decomposition transformations in
Section 3.3. All definitions in this section only consider the case of discrete random variables (which is what we need for the context of this work).
Definition 15 (
f-divergence [
22]).
Let be a function that satisfies the following three properties.
By convention we understand that and . For any such function f and two discrete probability distributions P and Q over the event space , the f-divergence for discrete random variables is defined as shown in Equation 15.
Notation 6. Throughout this work, we reserve the name f for functions that satisfy the required properties for an f-divergence of Definition 15.
An
f-divergence quantifies a notion of dissimilarity between two probability distributions
P and
Q. Key properties of
f-divergences are their non-negativity, their invariance under bijective transformations, and them satisfying a data-processing inequality [
7, p. 89]. A list of commonly used
f-divergences is shown in
Table 1. Notably, the continuation for
of both the Hellinger- and
-divergence result in the KL-divergence [
24].
The generator function of an
f-divergence is not unique since
for a real constant
[
7]. As a result, the considered
-divergence is a linear scaling of the Hellinger divergence (
) as shown in Equation
16.
Definition 16 (
f-information [
7]).
An f-information is defined based on an f-divergence from the joint distribution of two discrete random variables and the product of their marginals as shown in Equation 17.
Definition 17 (f-entropy). A notion of f-entropy for a discrete random variable is obtained from the self-information of a variable .
Notation 7. Using the KL-divergence results in the definition of mutual information and Shannon entropy. Therefore, we use the notation for mutual information (KL-information) and (KL-entropy ) for the Shannon entropy.
The remaining part of this section will define Rényi- and Bhattacharyya-information to highlight that they can be represented as an invertible transformation of Hellinger-information. This will be used in
Section 3.3 to transform the decomposition of Hellinger-information to a decomposition of Rényi- and Bhattacharyya-information.
Remark 5. We could similarly choose to represent Renyi divergence as a transformation of the α-divergence. A liner scaling of the considered f-divergence will however not affect our later results (see Section 3.3).
Definition 18 (Rényi divergence [
23]).
Let P and Q be two discrete probability distributions over the event space , then Rényi divergence is defined as shown in Equation 18 for , and extended to by continuation.
Notably, the continuation of Rényi divergence for
also equals the KL-divergence [
7, p. 116]. Renyi divergence can be expressed as an invertible transformation of the Hellinger divergence (
, see Equation
18) [
24].
Definition 19 (Rényi-information [
7]).
Rényi-information is defined equivalent to f-information as shown in Equation 19 and corresponds to an invertible transformation of Hellinger-information ().
Finally, we consider the Bhattacharyya distance (Definition 20), which is equivalent to a linear scaling from a special case of Rényi divergence (Equation
20) [
24]. It is applied, among others, in signal processing [
25] and coding theory [
26]. The corresponding information measure (Equation
21) is like its distance the scaling of a special case of Rényi-information.
Definition 20 (Bhattacharyya distance [
27]).
Let P and Q be two discrete probability distributions over the event space , then the Bhattacharyya distance is defined as shown in Equation 20.
Definition 21 (Bhattacharyya-information).
Bhattacharyya-information is defined equivalent to f-information as shown in Equation 21.