Introduction
The Shannon entropy
H has been fully characterized by different sets of conditions [
1,
2,
3,
4]. The conditions themselves have been investigated deeply by [
5,
6], for instance. All characterizations use variants of the so-called chain rule, which can be summarized as follows. Let
and
be discrete probability distributions, and
The chain rule states that
The construction in Equation (
1) can be iterated, i.e.,
p is extended at several positions
with different distributions
. In case of two positions, this is
where
. The most elementary version of the chain rule is due to [
2], where
q in Equation (
1) is restricted to a Bernoulli distribution. [
7] rely on the full version of Equation (
1), where
q is arbitrary. Since the full version allows an immediate interpretation in various areas of application, they consider Equation (
1) as preferable from a didactical point of view. Often, the completely iterated version,
is taken [
4], since this version allows the practical interpretation that first an alphabet
is chosen with probability
and subsequently a letter within
is chosen according to
.
[
8] give the first algebraic approach, namely in terms of information loss. The latter is defined as
, where the probability distribution
q is a function of
p, hence
quantifies the change in entropy of the transformation. The information loss
is uniquely characterized by a number of properties, including
being a convex linear map, i.e.,
, which can be considered as a restatement of Faddeev’s chain rule.
Common to all versions of the chain rule is that they describe situations where canonically at least two different distributions,
p and
q, are involved. [Definition 3.1.1] [
6] introduce an alternative notation of Faddeev’s condition, which formally considers only one distribution but nevertheless relies on computing the Shannon entropy of two distributions, namely
p and a Bernoulli distribution, to state the chain rule.
The Rényi entropy generalizes the Shannon entropy and can be fully characterized via quasi-arithmetic means [Section 5] [
5]. Another approach to characterize the Rényi entropy relies on a transform. The order-
information energy
S with Onicescu’s information energy as a special case [
9] is, up to a multiplicative constant, the sum of the
-powers of the probabilities. [
10] gives a full characterization by means of a chain rule that is similar to Faddeev’s variant,
where
q is a Bernoulli distribution and the constant depends on
only. [
10] also generalizes this equation by introducing weighted sums. [Theorem 4.5.1] [
4] characterizes the information energy by a formula that is analogous to the fully iterated version of the chain rule (
3).
[
11] deals with a rather general approach to the entropy, which is based on the additivity assumption of the entropy for independent systems. Under certain conditions, called scale invariance there, the entropy is unique. This result, however, is not applicable here, since it is closely related to scalar real numbers and not to probability distributions. Yet, the characterizations given here are in this spirit. To be specific, let
in Equation (
1). Then, we get
and hence a proportionality relation for the Shannon entropy by Equation (
2),
Theorem 1 in
Section 1 states that the chain rule in Khinchin ’s ([
1]) characterization [
1,
2] can be split into two canonical components, the additivity property and the proportionality between
and
. The characterization of the Rényi entropy is in analogy to the characterization of the Shannon entropy, except that the functional Equation (
4), i.e.,
reflects the proportionality relation. A related characterization also exists for the min-entropy, cf. Theorem 3. All technical parts are postponed to
Section 2.
Section 3 finalizes with some comments on our approach.
1. Results
We denote by
an ordered set und let
As we will address only the very first elements of an ordered set, the intuitive, but sloppy notation on the right hand side of Equation (
7) is sufficient.
Let denote by P the set of all discrete probability distributions and by the set of distributions with at most n atomic events that have a probability greater than zero. We denote such a probability distribution by . The uniform distribution is denoted by .
Reordering the elements in Equation (
5) will turn out to be very useful in the proofs. We define for
We use the convention .
Theorem 1. Let be a function. Then the following two assertions are equivalent:
- (1)
-
H is the Shannon entropy,
up to a positive multiplicative constant.
- (2)
-
H has the following properties:
- (a)
H is a continuous function in the topology of convergence in distribution;
- (b)
for any permutation π and , ;
- (c)
for all , ;
- (d)
for all , ;
- (e)
if p is a non-degenerate geometric distribution;
- (f)
for all ;
- (g)
some function exists such that for all we have
Conds. 2a-2g can be regarded as interpretable, hence intrinsic properties, cf. [Section 1.2] [
5]. For this reason we prefer Cond. 2d over a monotonicity assumption on
H. Note that Conds. 2a-2d are precisely Khinchin ’s ([
1]) conditions as given in [
2] except for replacing the chain rule by Conds. 2e-2g.
Remark 1. Note that f is not unique since it is undetermined at 1. A convenient choice is to take .
Remark 2.
Our approach has a simple, but nice implication. Let , . Then,
where denotes the Kullback-Leibler divergence.
In contrast to our characterization of the Shannon entropy, our analogous characterization of the Rényi entropy needs stronger assumptions on the unknown function f.
Definition 1 (Def 2.3.2 in [
12]).
A function , , is called real analytic if for each a neighborhood of y exists, such that for all there is an absolutely and uniformly convergent series representation of the form .
In the following, we call a function smooth if for all and if it can be extended to a real analytic function on a domain including .
Theorem 2. Let be a function. Then, the following two assertions are equivalent:
- (1)
-
for some , up to a positive multiplicative constant.
- (2)
H satisfies Conds. 2a-2f of Theorem 1. A value and a smooth function exist such that, for , we have
Since Theorem 2 gives a characterization only up to a multiplicative constant, the missing constant is typically determined by the requirement that the Shannon entropy is canonically obtained as
. The limit of the Rényi entropy as
is called the min-entropy and will be considered next. Two major differences to the two previous theorems exist: First, the function
f appears on the left hand side of the proportionality relation. Second, two characterizing equations are necessary. Unfortunately, the nice interpretability of Equation (
9) reduces in Equation (
10) and even further in Equations (
11) and ().
Theorem 3. Let be a function. Then, the following two assertions are equivalent:
- (1)
- (2)
H satisfies Conds. 2a-2c, 2e and 2f of Theorem 1. Furthermore, for , . A smooth, monotone function exists, such that for and all , we have
2. Proofs
2.1. Auxiliary Results
For the reader’s convenience, we first repeat several known functional equations and properties.
Lemma 1. Let H be the Shannon entropy or the Rényi entropy. Then, is maximal on .
See, for instance, Lemma 2.2.4 and Remark 4.4.4 in [
4] for a proof.
Lemma 2. Let be a function fullfilling Conds. 2d and 2f of Theorem 1. Then, for all and some .
Proof. The inclusion
and Cond. 2d immediately imply monotonicity, i.e.
By Cond. 2f and Implication 1 of Theorem V in [
13] follows the claim.
□
Lemma 3 (Identity theorem, Corollary 1.2.7 in [
14]).
If f and g are real analytic functions on an open interval and if there is a sequence of distinct points in U with such that for all , then
The proof for the Shannon entropy needs the uniqueness of the solution of the functional Equation (
13) below. We state two lemmas for the same purpose. The first needs the strong additional assumption that the solution is real analytic. The lemma itself remains unused, but the ideas behind the proof are used several times. The second lemma states the version used in Theorem 1; see Theorem 6.4.8 in [
6] for a condition similar to Equation (
16).
Lemma 4. Let f be a real-valued function on a domain that includes . The following two assertions are equivalent:
- (1)
for some , f is real analytic and
- (2)
f is the identity.
Proof. By Equation 0.266 in [
15], the identity satisfies Equation (
13). Assume now that Equation (
13) holds. Then,
since
Since as , Lemma 3 yields that f is uniquely determined by the knowledge of a single value with . □
Lemma 5. Let f be a real-valued function on . The following two assertions are equivalent:
- (1)
Equation (13) holds, f is continuous and for all we have
- (2)
f is the identity.
Proof. Equation (
16) yields
. Equations (
15) and (
16) imply that
for all
and
. Let
for
,
,
and
. It suffices to show that
D is dense in
. Assume that it is not i.e., an intervall
exists with
. Since
, we may assume without loss of generality, that
. Then,
Consider the map
f on the open intervalls of
,
Let
be the
n-times iterative application of
f, i.e.,
and
. Per assumption
f maps
to either
or
. Per construction, if
, then
. If eventually
, then it always holds
. Let
be the length of an interval
, i.e.,
. If
, then
Hence, finally contains or its length is strictly monotoneously increasing. If the latter were true, the limit would exist as . In the limit we would have , a contradiction to . □
2.2. Properties of the Operator
We will use the operator
, defined in Equation (
8), iteratively. For instance,
Note that the terms in general do not commute, e.g.,
and
. The proofs of the theorems in
Section 1 rely on the properties of various probability distributions. We denote by
the Bernoulli distribution, i.e.,
and by
the geometric distribution, i.e.,
The geometric distribution plays a dominant role in all of the subsequent proofs. Its important role has been known in information theory: it maximizes the Shannon entropy among discrete probability distributions given the mean [Theorem 5.8] [
16] and minimizes the min-entropy under fixed variance, among all discrete log-concave random variables in
[Theorem 1] [
17]. Let denote by
equality up to inserting zeros and reordering.
Lemma 6. Let and . Then,
In particular, for , and , we have
Furthermore, for and ,
Proof.
We have
since this is true for
and
by induction. Hence,
Now,
for
, implying Equation (
19). Further, we have
so that Equation () holds by Equation (
19). The other equalities are immediate. □
2.3. Proof of Theorem 1
Assume that
H is the Shannon entropy. Then,
for
. All the other assertions in the second condition hold obviously, when choosing
f to be the identity. Now, assume that the second condition holds. By Cond. 2a, Cond. 2b holds for all
. Cond. 2f and Equation () yield
. It is easy to check that
The function
F is finite on
due to Cond. 2e, Cond. 2f, and Equations () and (
30), i.e.,
Replacing
a by
in the equation above yields
Now,
by Equations (
21) and (
30). Equations (), (
9) and (
30) yield
Since
by Equation (
32) and Cond. 2e, we immediately get that
Equation (
31) now writes
. Plugging in
into Equation (
9) shows that
f is continuous on
. Equation (
34) and Lemma 5 yield that
and
for
. Equation (
9) implies that
so that
f is the identity on
. Lemma 2 implies that
for some
. Henceforth, we may assume that
. Cond. 2f and Equations (
19) and (
30) deliver
This implies
so that, by Equation (
32),
Due to the recursion Formula (
35) and Equation (
32), it suffices to show that the above formula extends to
,
, to finish the proof. By Equations () and (
30), we get
so that
by iteration and by Equation (
32), respectively. On the other hand, by Equations () and (
30),
so that
and by Equation (
32)
Cond. 2a finalizes the proof.
2.4. Proof of Theorem 2
Assume that
H is the Rényi entropy. Let
i.e.,
. Then,
so that
. Reversely, assume that the second condition holds. Since
and
S is strictly monotone in
H we can choose the sign of
such that
S is increasing in
H and
. Furthermore,
S is continuous. If
, then
. Hence,
for
as otherwise, by Equation (
21), we would get a contradiction to Cond. 2e. It follows that
is continuous on
. Equation (
10) implies
for all
, so that
. Cond. 2f and Equation () yield
and thus
. Cond. 2f in Theorem 1 reads
By Cond. 2a , Cond. 2b holds for all
. It is easy to check by induction that Equation (
10) implies
Henceforth, we assume that
. Cond. 2e and Equation () yield
as
. Since
is assumed to be finite, the function
G given by
is finite by Equation (
38). Hence, by Equations (
19), (
36) and (
37),
Equation (
39) implies that
so that
and Equation (
39) reads
In analogy to Equation (
33) we get from Equation () that
so that, by Equation (
41),
i.e., as
,
Since
for all
it follows that
for all
. Now, Equation (
40) reads
Since
the function
f cannot take the value 1 on
. Due to the continuity of
f we may summarize that
Lemma 2 implies that
for some
. Then, Equation (
43) gives
Furthermore, taking the idea from Equation (),
and, cf. Equation (),
so that, for
, we have
The continuity of
S and Equations (
43) and (
42) imply that
H is uniquely determined if
f and
c are unique. So, it remains to determine them. In line with the idea from Equation (), Equations (
43) and () yield
By Equations (
28) and (
10) we have
Equation (
46) yields
i.e.,
Since
and
for
we get for
that
or,
. Since
converges to 1 and
f is assumed to be smooth, Lemma 3 and the proof of Lemma 4 yield the uniqueness of
f, and
for
and some
. Letting
, Equations (
44) and (
45) yield
Since the left hand side is independent of k, the equation must hold.
2.5. Proof of Theorem 3
Assume
H is the min-entropy and let
f be the identity. Then, Equations (
11) and () obviously hold. Reversely, let the second condition of the theorem hold. Cond. (2f) and Equation () imply
and thus
. Let
and
. Equation () yields
by iteration. By Equation () follows
and
f has to be monotonously increasing. Assume
for some
. Then,
for all
and, by Equation (
11),
Letting
we get
, which contradicts Cond. 2e. Now, Equation (
47) yields
so that
Henceforth, we assume that
. Since
, we have, by Equation ()
and
Equation (
11) yields
where the limit
exists since
f is monotonously increasing and thus implies
. Hence, by Equation (
19),
in particular,
Since
, it follows from Equation (
51) that
Thus, the recursion formula (
50) determines
H uniquely as soon as
f is unique. Equations (
52) and (
48) yield
so that in analogy to Equation (
14)
Now, Equations (
50), () and (
52) yield
On the other hand, by Equation () and iterating Equation (
50) follows
In particular,
and
is necessarily 1. We have
Equations (
49) and (
54) and the monotonicity of
f entail
In particular
. Equation (
53) delivers
. Hence, for
, we have
and
as
. Now, Lemma 3 yields the uniqueness of
f.
3. Discussion & Conclusions
The characterization given here is not only novel with respect to splitting the chain rule into its natural components, it is also novel in the sense that only a rather weak relationship is required for the chain rule, i.e. the proportionality factor itself is not given, but arises naturally. Furthermore, the recursion formula has the advantage that it allows an interpretation as an intrinsic property, since only a single distribution is involved and since recursion formulae in general may allow additional practical interpretations such as self-similarity or scale invariance.
In comparison to Faddeev’s approach, our proofs emphasize the role of the geometric distribution
, which arises as the infinite iteration of the Bernoulli distribution
,
, i.e.
. The uniqueness proof for the Shannon entropy uses almost only the resulting proportionality relation of the geometric distribution, simplifying Faddeev’s approach. Further, while Equation (
2) is a rule, proportionality can be considered a law. This yields a deeper understanding of the Shannon entropy and may justify a broader application outside telecommunication.
The proofs rely strongly on distributions that generalize the uniform distribution and the geometric distribution, respectively. Possibly, these distributions play or will play a role also in other areas.
Although the characterization and the proofs of the Shannon and the Rényi entropy follow the same scheme, they show remarkable differences. For instance, the generalization
of the geometric distribution, see Equation (
17), does not play a role in case of the Shannon entropy. Also, the functional characterization in Equation (
9) of the Shannon entropy is not the limit of the characterization in Equation (
10) of the Rényi entropy. However, Equation (
9) might be understood as the derivative of Equation (
10) in the following sense, cf. [
18]. Assume that
f in Equation (
10) is differentiable. Then,
For
we define
. Then, we get
Another interpretation of our results is that
S is not a natural transformation of the Rényi entropy, in contrast to
This definition bridges to the generalized information function [Section 6.2 [
5] and to the definite functions, cf. [
19], Chapter 3, Corollary 3.3 and proof of Theorem 2.2, and Chapter 6, Example 5.16. Indeed, Equation (
10) can be rewritten in full analogy to Equation (
9), namely as
where
. Note that also the interpretation
is possible.
Several technical assumptions had to be included in the characterization of the Rényi entropy, such as the condition that f shall be real analytic. Although we consider this fact as minor, we hope that these assumptions can be dropped in the future.
Author Contributions
Conceptualization and methodology, M.S.; investigation, validation and resources, C.D.; writing—original draft, M.S.; writing—review and editing, C.D.; all authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Acknowledgments
The authors thank Christopher Dörr for many hints and comments. They also thank Thomas Reichelt for valuable hints.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Khinchin, A. The concept of entropy in the theory of probability. Uspekhi Matem. Nauk 1953, 8, 3–20. [Google Scholar]
- Faddeev, D. On the concept of entropy of a finite probabilistic scheme. Uspekhi Matem. Nauk 1956, 11, 227–231. [Google Scholar]
- Rényi, A. On measures of entropy and information. Fourth Berkeley Symposium on Mathematical Statistics and Probability: Contributions to the Theory of Statistics. University of California Press, 1961, pp. 547–562.
- Leinster, T. Entropy and Diversity — The Axiomatic Approach.; Cambridge, 2021.
- Aczél, J.; Daróczy, Z. On Measures of Information and Their Characterizations; Academic Press, 1975.
- Ebanks, B.; Sahoo, P.K.; Sander, W. Characterization of information measures; World Scientific, 1998.
- Carcassi, G.; Aidala, C.; Barbour, J. Variability as a better characterization of Shannon entropy. Europ. J. Phys. 2021, 42, 045102. [Google Scholar] [CrossRef]
- Baez, J.; Fritz, T.; Leinster, T. A characterization of entropy in terms of information loss. Entropy 2011, 13, 1945–1957. [Google Scholar] [CrossRef]
- Onicescu, O. Energie informationnelle. Sci. Paris 1966, 263, 841–842. [Google Scholar]
- Pardo, L. Order-α weighted information energy. Information Sci. 1986, 40, 155–164. [Google Scholar] [CrossRef]
- Schlather, M. An algebraic generalization of the entropy and its application to statistics 2024. arxiv: 2404.05854.
- Krantz, S. Function Theory of Several Complex Variables, 2nd ed.; AMS: Pacific Grove, Calif, 1992. [Google Scholar]
- Erdös, P. On the distribution function of additive functions. Ann. Math. 1946, 47, 1–20. [Google Scholar] [CrossRef]
- Krantz, S.; Parks, H. A Primer of Real Analytic Functions; Birkhäuser Boston, 2002.
- Gradshteyn, I.; Ryzhik, I. Table of Integrals, Series, and Products, 6th ed.; Academic Press: London, 2000. [Google Scholar]
- Conrad, K. Probability Distributions and Maximum Entropy. Technical report, University of Connecticut.URL https://kconrad.math.uconn.edu/blurbs/analysis/entropypost.pdf. Accessed 14 Nov. 2024.
- Jakimiuk, J.; Murawski, D.; Nayar, P.; Słobodianiuk, S. Log-concavity and discrete degrees of freedom. Discr. Math. 2024, 347, 114020. [Google Scholar] [CrossRef]
- Życzkowski, K. Rényi extrapolation of Shannon entropy. Open Sys. & Inform. Dyn. 2003, 10, 297–310. [Google Scholar]
- Berg, C.; Christensen, J.P.R.; Ressel, P. Harmonic Analysis on Semigroups. Theory of Positive Definite and Related Functions; Springer: New York, 1984. [Google Scholar]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).