Preprint
Article

This version is not peer-reviewed.

An Intrinsic Characterization of Shannon’s and Rényi’s Entropy

A peer-reviewed article of this preprint also exists.

Submitted:

15 November 2024

Posted:

19 November 2024

You are already at the latest version

Abstract
All characterizations of Shannon’s entropy include the so-called chain rule, a formula on a hierarchically structured probability distribution, which is based on at least two elementary distributions. We show that the chain rule can be split into two natural components, the well-known additivity of the entropy in case of cross-products and a variant of the chain rule that involves only a single elementary distribution. The latter is given as a proportionality relation and hence allows a vague interpretation as self-similarity, hence intrinsic property of the Shannon entropy. A similar characterization is given also for the Rényi entropy and the min-entropy.
Keywords: 
;  ;  ;  ;  

Introduction

The Shannon entropy H has been fully characterized by different sets of conditions [1,2,3,4]. The conditions themselves have been investigated deeply by [5,6], for instance. All characterizations use variants of the so-called chain rule, which can be summarized as follows. Let p = { p 1 , , p n } and q = { q 1 , , q m } be discrete probability distributions, and
p k , q = { p 1 , , p k 1 , p k q 1 , , p k q m , p k + 1 , , p n } .
The chain rule states that
H ( p k , q ) = H ( p ) + p k H ( q ) .
The construction in Equation (1) can be iterated, i.e., p is extended at several positions k i with different distributions q ( i ) . In case of two positions, this is
p k 1 , q ( 1 ) ; k 2 , q ( 2 ) : = ( p k 2 , q ( 2 ) ) k 1 , q ( 1 ) ,
where k 1 < k 2 . The most elementary version of the chain rule is due to [2], where q in Equation (1) is restricted to a Bernoulli distribution. [7] rely on the full version of Equation (1), where q is arbitrary. Since the full version allows an immediate interpretation in various areas of application, they consider Equation (1) as preferable from a didactical point of view. Often, the completely iterated version,
H ( p 1 , q ( 1 ) ; ; n , q ( n ) ) = H ( p ) + k = 1 n p k H ( q ( k ) ) ,
is taken [4], since this version allows the practical interpretation that first an alphabet α k is chosen with probability p k and subsequently a letter within α k is chosen according to q ( k ) .
[8] give the first algebraic approach, namely in terms of information loss. The latter is defined as F p , q = H ( p ) H ( q ) , where the probability distribution q is a function of p, hence F p , q quantifies the change in entropy of the transformation. The information loss F p , q is uniquely characterized by a number of properties, including F p , q being a convex linear map, i.e., H ( λ p ( 1 λ ) q ) = λ H ( p ) + ( 1 λ ) H ( q ) , which can be considered as a restatement of Faddeev’s chain rule.
Common to all versions of the chain rule is that they describe situations where canonically at least two different distributions, p and q, are involved. [Definition 3.1.1] [6] introduce an alternative notation of Faddeev’s condition, which formally considers only one distribution but nevertheless relies on computing the Shannon entropy of two distributions, namely p and a Bernoulli distribution, to state the chain rule.
The Rényi entropy generalizes the Shannon entropy and can be fully characterized via quasi-arithmetic means [Section 5] [5]. Another approach to characterize the Rényi entropy relies on a transform. The order- α information energy S with Onicescu’s information energy as a special case [9] is, up to a multiplicative constant, the sum of the α -powers of the probabilities. [10] gives a full characterization by means of a chain rule that is similar to Faddeev’s variant,
S ( p k , q ) = S ( p ) p 1 α ( const S ( q ) ) ,
where q is a Bernoulli distribution and the constant depends on α only. [10] also generalizes this equation by introducing weighted sums. [Theorem 4.5.1] [4] characterizes the information energy by a formula that is analogous to the fully iterated version of the chain rule (3).
[11] deals with a rather general approach to the entropy, which is based on the additivity assumption of the entropy for independent systems. Under certain conditions, called scale invariance there, the entropy is unique. This result, however, is not applicable here, since it is closely related to scalar real numbers and not to probability distributions. Yet, the characterizations given here are in this spirit. To be specific, let q = p in Equation (1). Then, we get
p k , p = { p 1 , , p k 1 , p k p 1 , , p k p n , p k + 1 , , p n } ,
and hence a proportionality relation for the Shannon entropy by Equation (2),
H ( p k , p ) = ( 1 + p k ) H ( p ) .
Theorem 1 in Section 1 states that the chain rule in Khinchin ’s ([1]) characterization [1,2] can be split into two canonical components, the additivity property and the proportionality between H ( p k , p ) and H ( p ) . The characterization of the Rényi entropy is in analogy to the characterization of the Shannon entropy, except that the functional Equation (4), i.e.,
S ( p k , p ) = ( 1 + p k α ) S ( p ) p k α ,
reflects the proportionality relation. A related characterization also exists for the min-entropy, cf. Theorem 3. All technical parts are postponed to Section 2. Section 3 finalizes with some comments on our approach.

1. Results

We denote by { . } an ordered set und let
a { p 1 , p 2 , } = { a p 1 , a p 2 , } , a 0 , { p 1 , p 2 , } { q 1 , q 2 , } = { p 1 , p 2 , , q 1 , q 2 , } .
As we will address only the very first elements of an ordered set, the intuitive, but sloppy notation on the right hand side of Equation (7) is sufficient.
Let denote by P the set of all discrete probability distributions and by P n P the set of distributions with at most n atomic events that have a probability greater than zero. We denote such a probability distribution by { p 1 , , p n } P n . The uniform distribution is denoted by U n = { 1 / n , , 1 / n } .
Reordering the elements in Equation (5) will turn out to be very useful in the proofs. We define for p P
p k ^ = { p i : i k } , k p = p k p p k ^ .
For instance,
3 { p 1 , p 2 , p 3 } = { p 3 p 1 , p 3 p 2 , p 3 p 3 , p 1 , p 2 } .
We use the convention 0 log 0 = 0 .
Theorem 1.
Let H : P [ 0 , ] be a function. Then the following two assertions are equivalent:
(1)
H is the Shannon entropy,
H ( p ) = p i p p i log p i ,
up to a positive multiplicative constant.
(2)
H has the following properties:
(a)
H is a continuous function in the topology of convergence in distribution;
(b)
H ( { p 1 , , p n } ) = H ( { p π ( 1 ) , , p π ( n ) } ) for any permutation π and { p 1 , , p n } P n , n N ;
(c)
H ( { p 1 , , p n } ) = H ( { 0 , p 1 , , p n } ) for all { p 1 , , p n } P n , n N ;
(d)
H ( p ) H ( U n ) < for all p P n , n N ;
(e)
H ( p ) ( 0 , ) if p is a non-degenerate geometric distribution;
(f)
H ( p × q ) = H ( p ) + H ( q ) for all p , q P ;
(g)
some function f : [ 0 , 1 ] [ 0 , ) exists such that for all p P we have
H ( 1 p ) = ( 1 + f ( p 1 ) ) H ( p ) .
Conds. 2a-2g can be regarded as interpretable, hence intrinsic properties, cf. [Section 1.2] [5]. For this reason we prefer Cond. 2d over a monotonicity assumption on H. Note that Conds. 2a-2d are precisely Khinchin ’s ([1]) conditions as given in [2] except for replacing the chain rule by Conds. 2e-2g.
Remark 1.
Note that f is not unique since it is undetermined at 1. A convenient choice is to take f ( 1 ) = lim a 1 f ( a ) .
Remark 2.
Our approach has a simple, but nice implication. Let p , q P n , n N . Then,
KL ( 1 p , 1 q ) = ( 1 + p 1 ) KL ( p , q ) ,
where KL denotes the Kullback-Leibler divergence.
In contrast to our characterization of the Shannon entropy, our analogous characterization of the Rényi entropy needs stronger assumptions on the unknown function f.
Definition 1
(Def 2.3.2 in [12]). A function f : U R , U R , is called real analytic if for each y U a neighborhood U y of y exists, such that for all x U y there is an absolutely and uniformly convergent series representation of the form f ( x ) = k a k ( x y ) k .
In the following, we call a function f : [ 0 , 1 ] [ 0 , ) smooth if k = 0 ( 1 + f ( a 2 k ) ) < for all a ( 0 , 1 ) and if it can be extended to a real analytic function on a domain including ( 0 , 1 ] .
Theorem 2.
Let H : P [ 0 , ] be a function. Then, the following two assertions are equivalent:
(1)
H is the Rényi entropy,
H ( p ) = 1 1 α log p i p p i α ,
for some α ( 0 , ) { 1 } , up to a positive multiplicative constant.
(2)
H satisfies Conds. 2a-2f of Theorem 1. A value γ R { 0 } and a smooth function f : [ 0 , 1 ] [ 0 , ) exist such that, for S = e γ H , we have
S ( 1 p ) = ( 1 + f ( p 1 ) ) S ( p ) f ( p 1 ) p P .
Since Theorem 2 gives a characterization only up to a multiplicative constant, the missing constant is typically determined by the requirement that the Shannon entropy is canonically obtained as α 1 . The limit of the Rényi entropy as α is called the min-entropy and will be considered next. Two major differences to the two previous theorems exist: First, the function f appears on the left hand side of the proportionality relation. Second, two characterizing equations are necessary. Unfortunately, the nice interpretability of Equation (9) reduces in Equation (10) and even further in Equations (11) and ().
Theorem 3.
Let H : P [ 0 , ] be a function. Then, the following two assertions are equivalent:
(1)
H is the min-entropy,
H ( p ) = log max p i P p i .
(2)
H satisfies Conds. 2a-2c, 2e and 2f of Theorem 1. Furthermore, H ( p ) < for p P n , n N . A smooth, monotone function f : [ 0 , 1 ] [ 0 , ) exists, such that for S = e H and all p P { { 1 } } , we have
f ( p 1 ) S ( 1 p ) = S ( p ) ,
( 1 + f ( p 1 ) ) S ( 1 p ) 1 ^ 1 p 1 2 = S p 1 ^ 1 p 1 .

2. Proofs

2.1. Auxiliary Results

For the reader’s convenience, we first repeat several known functional equations and properties.
Lemma 1.
Let H be the Shannon entropy or the Rényi entropy. Then, H ( U n ) is maximal on P n .
See, for instance, Lemma 2.2.4 and Remark 4.4.4 in [4] for a proof.
Lemma 2.
Let H : P [ 0 , ] be a function fullfilling Conds. 2d and 2f of Theorem 1. Then, H ( U n ) = c log n for all n N and some c > 0 .
Proof. 
The inclusion P n P n + 1 and Cond. 2d immediately imply monotonicity, i.e.
H ( U n ) H ( U n + 1 ) n N .
By Cond. 2f and Implication 1 of Theorem V in [13] follows the claim.
Lemma 3
(Identity theorem, Corollary 1.2.7 in [14]). If f and g are real analytic functions on an open interval U R and if there is a sequence of distinct points x 1 , x 2 , in U with x = lim n x n U such that f ( x n ) = g ( x n ) for all n N , then
f ( x ) = g ( x ) for all x U .
The proof for the Shannon entropy needs the uniqueness of the solution of the functional Equation (13) below. We state two lemmas for the same purpose. The first needs the strong additional assumption that the solution is real analytic. The lemma itself remains unused, but the ideas behind the proof are used several times. The second lemma states the version used in Theorem 1; see Theorem 6.4.8 in [6] for a condition similar to Equation (16).
Lemma 4.
Let f be a real-valued function on a domain that includes [ 0 , 1 ) . The following two assertions are equivalent:
(1)
f ( a 0 ) = a 0 for some a 0 ( 0 , 1 ) , f is real analytic and
( 1 f ( a ) ) k = 0 ( 1 + f ( a 2 k ) ) = 1 a ( 0 , 1 ) ;
(2)
f is the identity.
Proof. 
By Equation 0.266 in [15], the identity satisfies Equation (13). Assume now that Equation (13) holds. Then,
f ( a 2 ) = f 2 ( a ) ,
since
1 = ( 1 f ( a ) ) ( 1 + f ( a ) ) k = 1 ( 1 + f ( a 2 k ) ) = ( 1 f ( a ) ) ( 1 + f ( a ) ) / ( 1 f ( a 2 ) ) .
It follows that
f ( a 2 k ) = ( f ( a ) ) 2 k a ( 0 , 1 ) , k Z .
Since a 2 k 0 as k , Lemma 3 yields that f is uniquely determined by the knowledge of a single value f ( a 0 ) with a 0 ( 0 , 1 ) . □
Lemma 5.
Let f be a real-valued function on ( 0 , 1 ) . The following two assertions are equivalent:
(1)
Equation (13) holds, f is continuous and for all a ( 0 , 1 ) we have
f ( a ) + f ( 1 a ) = 1 ;
(2)
f is the identity.
Proof. 
Equation (16) yields f ( 1 / 2 ) = 1 / 2 . Equations (15) and (16) imply that f ( ( 1 a 2 k ) 2 r ) = ( 1 f 2 k ( a ) ) 2 r for all a ( 0 , 1 ) and k , r Z . Let T ( A ) = { ( 1 a 2 k ) 2 r : a A , r , k Z } for A ( 0 , 1 ) , D 1 = T ( { 1 / 2 } ) , D n + 1 : = D n T ( D n ) and D = n D n . It suffices to show that D is dense in ( 0 , 1 ) . Assume that it is not i.e., an intervall ( a , b ) ( 0 , 1 ) exists with ( a , b ) D = . Since 1 / 2 D , we may assume without loss of generality, that ( a , b ) ( 1 / 2 , 1 ) . Then,
( a 2 , b 2 ) D = ( 1 b 2 , 1 a 2 ) D = .
Consider the map f on the open intervalls of ( 0 , 1 ) ,
f ( ( a , b ) ) ( a 2 , b 2 ) , i f a 1 / 2 , b 2 > 1 / 2 ( 1 b 2 , 1 a 2 ) , i f a 1 / 2 , b 2 1 / 2 , ( a , b ) a < 1 / 2
Let f ( n ) be the n-times iterative application of f, i.e.,
f ( n ) = f f
and ( c , d ) = f ( n 1 ) ( ( a , b ) ) . Per assumption f maps ( a , b ) ( 1 / 2 , 1 ) to either ( a 2 , b 2 ) or ( 1 b 2 , 1 a 2 ) . Per construction, if f ( n ) ( ( a , b ) ) = ( 1 c 2 , 1 d 2 ) , then f ( n + 1 ) ( ( a , b ) ) ( 1 c 2 , 1 d 2 ) . If eventually f ( n ) ( ( a , b ) ) = f ( n + 1 ) ( ( a , b ) ) = ( c 2 , d 2 ) , then it always holds 1 / 2 ( c 2 , d 2 ) . Let | ( a , b ) | be the length of an interval ( a , b ) , i.e., b a . If a 1 / 2 , then
| f ( a , b ) | = b 2 a 2 = ( b + a ) ( b a ) > b a .
Hence, f ( n + 1 ) ( ( a , b ) ) finally contains 1 / 2 or its length is strictly monotoneously increasing. If the latter were true, the limit would exist as | f | 1 / 2 . In the limit we would have b 2 a 2 = b a , a contradiction to b > a 1 / 2 . □

2.2. Properties of the Operator k

We will use the operator k , defined in Equation (8), iteratively. For instance,
2 k p = ( p 2 p k ) ( p k p p k ^ ) ( p k p p k ^ ) 2 ^ , 1 k n , n 2 , 2 3 { p 1 , p 2 , p 3 } = 2 { p 3 p 1 , p 3 p 2 , p 3 p 3 , p 1 , p 2 } = { p 3 p 2 p 3 p 1 , p 3 p 2 p 3 p 2 , p 3 p 2 p 3 p 3 , p 3 p 2 p 1 , p 3 p 2 p 2 , p 3 p 1 , p 3 p 3 , p 1 , p 2 } = { p 1 p 2 p 3 2 , p 2 2 p 3 2 , p 2 p 3 3 , p 1 p 2 p 3 , p 2 2 p 3 , p 1 p 3 , p 3 2 , p 1 , p 2 } .
Note that the terms in general do not commute, e.g., ( p k p p k ^ ) k ^ ( p k ^ p k p ) k ^ and i k p k i p . The proofs of the theorems in Section 1 rely on the properties of various probability distributions. We denote by b a the Bernoulli distribution, i.e.,
b a = { a , 1 a } , a ( 0 , 1 ) ,
and by g a the geometric distribution, i.e.,
g a = { 1 a , ( 1 a ) a , ( 1 a ) a 2 , } , a ( 0 , 1 ) .
For n N , let
g n , a = 1 a 1 a n { 1 , a , a 2 , , a n 1 } , a [ 0 , 1 ) ,
G n , a = { a n , 1 a , ( 1 a ) a , , ( 1 a ) a n 1 } , a [ 0 , 1 ] ,
U k , n = { 1 k / n , 1 / n , , 1 / n k times } , 0 k < n .
The geometric distribution plays a dominant role in all of the subsequent proofs. Its important role has been known in information theory: it maximizes the Shannon entropy among discrete probability distributions given the mean [Theorem 5.8] [16] and minimizes the min-entropy under fixed variance, among all discrete log-concave random variables in Z [Theorem 1] [17]. Let denote by = ˙ equality up to inserting zeros and reordering.
Lemma 6.
Let p P and 1 p : = lim n 1 n p . Then,
1 p = ˙ g p 1 × p 1 ^ 1 p 1 , p 1 < 1 ,
{ 1 } = ˙ { 1 } × { 1 } .
In particular, for a ( 0 , 1 ) , n N and 0 k < n , we have
1 b a = ˙ g a ,
1 g a = ˙ g 1 a × g a ,
1 G n , a = ˙ g a n × g n , a = ˙ g a = ˙ G , a ,
2 G n , a = ˙ g 1 a × G n 1 , a ,
1 U k , n = ˙ g 1 k / n × U k ,
2 U k , n = ˙ g 1 / n × U k 1 , n 1 , 0 k < n ,
3 2 b a = ˙ g a × b a .
Furthermore, for a ( 0 , 1 ) and n N ,
1 G n , a = ˙ G 2 n , a ,
( G n , a ) 2 ^ = a G n 1 , a , ( 1 a n ) g n , a = ( G n , a ) 1 ^ , U n 1 , n = U n .
Proof.  We have
1 n p = { p 1 2 n } k { 2 n 1 , , 0 } p 1 k p 1 ^ ,
since this is true for n = 1 and
1 n p = 1 ( 1 n 1 p ) = 1 { p 1 2 n 1 } k { 2 n 1 1 , , 0 } p 1 k p 1 ^ = p 1 2 n 1 { p 1 2 n 1 } k { 2 n 1 1 , , 0 } p 1 k p 1 ^ k { 2 n 1 1 , , 0 } p 1 k p 1 ^ = { p 1 2 n } k { 2 n 1 , , 2 n 1 } p 1 k p 1 ^ k { 2 n 1 1 , , 0 } p 1 k p 1 ^
by induction. Hence,
1 p = ˙ k { , 1 , 0 } p 1 k p 1 ^ .
Now, p 1 k p 1 ^ = ( ( 1 p 1 ) p 1 k ) · p 1 ^ 1 p 1 for p 1 < 1 , implying Equation (19). Further, we have
2 b a = ( 1 a ) b a { a } ,
so that Equation () holds by Equation (19). The other equalities are immediate. □

2.3. Proof of Theorem 1

Assume that H is the Shannon entropy. Then,
H ( g a ) = k = 0 ( 1 a ) a k log ( ( 1 a ) a k ) = ( a log a + ( 1 a ) log ( 1 a ) ) / ( 1 a ) ( 0 , )
for a ( 0 , 1 ) . All the other assertions in the second condition hold obviously, when choosing f to be the identity. Now, assume that the second condition holds. By Cond. 2a, Cond. 2b holds for all p P . Cond. 2f and Equation () yield H ( { 1 } ) = 0 . It is easy to check that
H 1 n p = H ( p ) k = 0 n 1 1 + f ( p 1 2 k ) .
Thus,
H ( 1 p ) = H ( lim n 1 n p ) = lim n H ( 1 n p ) = H ( p ) F ( p 1 )
with
F ( a ) = k = 0 1 + f ( a 2 k ) , a ( 0 , 1 ) .
The function F is finite on ( 0 , 1 ) due to Cond. 2e, Cond. 2f, and Equations () and (30), i.e.,
H ( g a ) F ( 1 a ) = H ( g 1 a ) + H ( g a ) .
It follows
F ( 1 a ) 1 = H ( g 1 a ) H ( g a ) , a ( 0 , 1 ) .
Replacing a by 1 a in the equation above yields
( F ( 1 a ) 1 ) ( F ( a ) 1 ) = 1 , a ( 0 , 1 ) .
Now,
H ( b a ) F ( a ) = H ( g a ) ,
by Equations (21) and (30). Equations (), (9) and (30) yield
H ( b a ) ( 1 + f ( 1 a ) ) F ( a ) = H ( g a ) + H ( b a ) .
Since H ( b a ) > 0 by Equation (32) and Cond. 2e, we immediately get that
f ( 1 a ) F ( a ) = 1 , a ( 0 , 1 ) .
Equation (31) now writes f ( a ) + f ( 1 a ) = 1 . Plugging in p = b a into Equation (9) shows that f is continuous on ( 0 , 1 ) . Equation (34) and Lemma 5 yield that f ( a ) = a and F ( a ) = 1 / ( 1 a ) for a ( 0 , 1 ) . Equation (9) implies that f ( 0 ) = 0 so that f is the identity on [ 0 , 1 ) . Lemma 2 implies that H ( U n ) = c log n for some c > 0 . Henceforth, we may assume that c = 1 . Cond. 2f and Equations (19) and (30) deliver
H ( p ) F ( p 1 ) = H ( g p 1 ) + H p 1 ^ 1 p 1 , p 1 < 1 .
This implies
log n · F ( 1 / n ) = H ( U n ) F ( 1 / n ) = H ( g 1 / n ) + H ( U n 1 ) = H ( g 1 / n ) + log ( n 1 ) , n 2 ,
so that, by Equation (32),
H ( b 1 / n ) = log n log ( n 1 ) F ( 1 / n ) = 1 n log 1 n 1 1 n log 1 1 n , n 2 .
Due to the recursion Formula (35) and Equation (32), it suffices to show that the above formula extends to H ( b a ) = a log a ( 1 a ) log ( 1 a ) , a ( 0 , 1 ) , to finish the proof. By Equations () and (30), we get
H ( U k , n ) F ( 1 / n ) = H ( g 1 / n ) + H ( U k 1 , n 1 ) ,
so that
H ( U k , n ) = i = 0 k 1 H ( g 1 / ( n i ) ) j = n i n F ( 1 / j ) = 1 n i = 0 k 1 ( n i ) H ( b 1 / ( n i ) )
by iteration and by Equation (32), respectively. On the other hand, by Equations () and (30),
H ( U k , n ) F ( 1 k / n ) = H ( g 1 k / n ) + log ( k ) ,
so that
H ( g 1 k / n ) + log ( k ) = 1 k i = 0 k 1 ( n i ) 1 n i log 1 n i 1 1 n i log 1 1 n i = 1 k i = 0 k 1 ( n i ) log ( n i ) n i 1 log n i 1 = 1 k n log n ( n k ) log ( n k )
and by Equation (32)
H ( b k / n ) = k n log k n 1 k n log 1 k n , n 2 , k < n .
Cond. 2a finalizes the proof.

2.4. Proof of Theorem 2

Assume that H is the Rényi entropy. Let
S ( p ) = e ( 1 α ) H ( p ) ,
i.e., S ( p ) = p α α . Then,
S ( 1 p ) = p 1 α i p i α + i p i α p 1 α
so that f ( r ) = r α . Reversely, assume that the second condition holds. Since H 0 and S is strictly monotone in H we can choose the sign of γ such that S is increasing in H and S 1 . Furthermore, S is continuous. If S ( p ) = 1 , then 1 = S ( 1 p ) = S ( 2 p ) = = S ( 1 p ) . Hence, S ( b a ) > 1 for a ( 0 , 1 ) as otherwise, by Equation (21), we would get a contradiction to Cond. 2e. It follows that f ( a ) = ( S ( 1 b a ) S ( b a ) ) / ( S ( b a ) 1 ) is continuous on ( 0 , 1 ) . Equation (10) implies f ( 0 ) ( S ( { 0 } p ) 1 ) = 0 for all p P , so that f ( 0 ) = 0 . Cond. 2f and Equation () yield H ( { 1 } ) = 0 and thus S ( { 1 } ) = 1 . Cond. 2f in Theorem 1 reads
S ( p × q ) = S ( p ) S ( q ) .
By Cond. 2a , Cond. 2b holds for all p P . It is easy to check by induction that Equation (10) implies
S 1 n p = S ( p ) k = 0 n 1 1 + f ( p 1 2 k ) = 0 n 1 f ( p 1 2 ) k = + 1 n 1 1 + f ( p 1 2 k ) .
Henceforth, we assume that a ( 0 , 1 ) . Cond. 2e and Equation () yield
> S ( g a ) S ( g 1 a ) S ( g 1 a ) k = 0 n 1 1 + f ( a 2 k ) = 0 n 1 f ( a 2 ) k = + 1 n 1 1 + f ( a 2 k )
as n . Since F ( a ) = k = 0 1 + f ( a 2 k ) is assumed to be finite, the function G given by
G ( a ) = = 0 f ( a 2 ) F ( a 2 + 1 )
is finite by Equation (38). Hence, by Equations (19), (36) and (37),
S ( p ) F ( p 1 ) G ( p 1 ) = S ( g p 1 ) S p 1 ^ 1 p 1 , p 1 < 1 .
Equation (39) implies that
S ( g a ) F ( 1 a ) G ( 1 a ) = S ( g 1 a ) S ( g a ) ,
so that G ( a ) = ( F ( a ) S ( g a ) ) S ( g 1 a ) and Equation (39) reads
S ( p ) F ( p 1 ) ( F ( p 1 ) S ( g p 1 ) ) S ( g 1 p 1 ) = S ( g p 1 ) S p 1 ^ 1 p 1 , p 1 < 1 .
In particular,
S ( b a ) F ( a ) ( F ( a ) S ( g a ) ) S ( g 1 a ) = S ( g a ) .
In analogy to Equation (33) we get from Equation () that
S ( b a ) ( 1 + f ( 1 a ) ) f ( 1 a ) F ( a ) ( F ( a ) S ( g a ) ) S ( g 1 a ) = S ( g a ) S ( b a ) ,
so that, by Equation (41),
S ( b a ) ( 1 + f ( 1 a ) ) f ( 1 a ) F ( a ) S ( b a ) F ( a ) + S ( g a ) = S ( g a ) S ( b a ) ,
i.e., as S ( b a ) > 1 ,
f ( 1 a ) F ( a ) = S ( g a ) .
Since S ( g a ) > 1 for all a ( 0 , 1 ) it follows that f ( a ) , F ( a ) 0 for all a ( 0 , 1 ) . Now, Equation (40) reads
S ( p ) ( 1 f ( 1 p 1 ) ) S ( g 1 p 1 ) = f ( 1 p 1 ) S p 1 ^ 1 p 1 .
Since
1 < S ( b a ) = ( 1 f ( a ) ) S ( g a ) + f ( a ) ,
the function f cannot take the value 1 on ( 0 , 1 ) . Due to the continuity of f we may summarize that
f ( 0 ) = 0 , f ( a ) ( 0 , 1 ) for a ( 0 , 1 ) .
Lemma 2 implies that S ( U n ) = n c for some c > 0 . Then, Equation (43) gives
( 1 f ( 1 1 / n ) ) S ( g 1 1 / n ) = n c f ( 1 1 / n ) ( n 1 ) c .
Furthermore, taking the idea from Equation (),
S ( U k , n ) = ( 1 f ( k / n ) ) S ( g k / n ) + f ( k / n ) k c
and, cf. Equation (),
S ( U k , n ) = ( 1 f ( 1 1 / n ) ) S ( g 1 1 / n ) + f ( 1 1 / n ) S ( U k 1 , n 1 ) = n c + f ( 1 1 / n ) ( n 1 ) c + S ( U k 1 , n 1 ) = n c + f ( 1 1 / n ) f ( 1 1 / ( n 1 ) ) ( n 2 ) c + S ( U k 2 , n 2 ) = n c + ( 1 ( n k ) c ) j = 0 k 1 f ( 1 1 / ( n j ) ) ,
so that, for 0 k < n , we have
( 1 f ( k / n ) ) S ( g k / n ) = n c + ( 1 ( n k ) c ) j = 0 k 1 f ( 1 1 / ( n j ) ) f ( k / n ) k c .
The continuity of S and Equations (43) and (42) imply that H is uniquely determined if f and c are unique. So, it remains to determine them. In line with the idea from Equation (), Equations (43) and () yield
S ( G n , a ) = ( 1 f ( a ) ) S ( g a ) + f ( a ) S ( G n 1 , a ) = ( 1 f ( a ) ) S ( g a ) k = 0 n 1 f k ( a ) + f n ( a ) = f n ( a ) + ( 1 f n ( a ) ) S ( g a ) .
By Equations (28) and (10) we have
S ( G 2 n , a ) = ( 1 + f ( a n ) ) S ( G n , a ) f ( a n ) .
Equation (46) yields
f 2 n ( a ) + ( 1 f 2 n ( a ) ) S ( g a ) = ( 1 + f ( a n ) ) ( f n ( a ) + ( 1 f n ( a ) ) S ( g a ) ) f ( a n ) ,
i.e.,
( f ( a n ) f n ( a ) ) ( 1 f n ( a ) ) ( S ( g a ) 1 ) = 0 .
Since S ( g a ) > 1 and f < 1 for a ( 0 , 1 ) we get for n = 2 that
f 2 ( a ) = f ( a 2 ) a ( 0 , 1 ) ,
or, f ( a 2 k ) = f 2 k ( a ) , k Z . Since ( a 2 k ) k N converges to 1 and f is assumed to be smooth, Lemma 3 and the proof of Lemma 4 yield the uniqueness of f, and f ( a ) = a α for a [ 0 , 1 ] and some α > 0 . Letting n = 2 k , Equations (44) and (45) yield
S ( b 1 / 2 ) = ( 2 k ) c + ( 1 k c ) j = 0 k 1 f 1 1 / ( n j ) f ( 1 / 2 ) k c + f ( 1 / 2 ) = ( 2 k ) c + ( 1 k c ) 2 α 2 α k c + 2 α .
Hence,
2 α S ( b 1 / 2 ) = ( 2 α + c 2 ) k c + 2 .
Since the left hand side is independent of k, the equation c = 1 α must hold.

2.5. Proof of Theorem 3

Assume H is the min-entropy and let f be the identity. Then, Equations (11) and () obviously hold. Reversely, let the second condition of the theorem hold. Cond. (2f) and Equation () imply H ( { 1 } ) = 0 and thus S ( { 1 } ) = 1 . Let
^ i n p = 1 1 p i 2 n ( i n p ) i ^ ,
^ i 0 p = p i ^ / ( 1 p i ) and S = e H . Equation () yields
S ( ^ 1 0 p ) = S ( ^ 1 n p ) k = 0 n 1 ( 1 + f ( p 1 2 k ) ) = S ( g 2 n , p 1 ) S ( ^ 1 0 p ) k = 0 n 1 ( 1 + f ( p 1 2 k ) ) , p 1 < 1 ,
by iteration. By Equation () follows f ( 0 ) = 0 and f has to be monotonously increasing. Assume f ( a ) = 0 for some a ( 0 , 1 ] . Then, f ( p 1 ) = 0 for all 0 p 1 a and, by Equation (11),
S ( p ) = S ( ^ 1 p ) = S ( ^ 1 2 p ) = = S ( ^ 1 p ) = S ( g p 1 ) S ( ^ 1 0 p ) , p 1 a , p 1 < 1 .
Letting p = g 1 a we get S ( g a ) = 1 , which contradicts Cond. 2e. Now, Equation (47) yields
S ( g 2 n , a ) k = 0 n 1 ( 1 + f ( a 2 k ) ) = 1 ,
so that
S ( g a ) = 1 / k = 0 ( 1 + f ( a 2 k ) ) ( 0 , 1 ) , a ( 0 , 1 ) .
Henceforth, we assume that a ( 0 , 1 ) . Since ^ 1 b a = b 1 / ( 1 + a ) , we have, by Equation ()
( 1 + f ( a ) ) S ( b 1 / ( 1 + a ) ) = 1 ,
and
S ( b a ) = ( 1 + f ( ( 1 a ) / a ) ) 1 , a 1 / 2 ( 1 + f ( a / ( 1 a ) ) ) 1 , a < 1 / 2 .
Equation (11) yields
Φ ( p 1 ) S ( 1 p ) = S ( p ) ,
where the limit Φ ( a ) : = k = 0 f ( a 2 k ) exists since f is monotonously increasing and thus implies Φ = f . Hence, by Equation (19),
S ( p ) = S ( 1 p ) f ( p 1 ) = S ( g p 1 ) S ( ^ 1 0 p ) f ( p 1 ) ,
in particular,
S ( g 1 a ) = S ( g a ) S ( g 1 a ) f ( a ) .
Since S ( g a ) < 1 , it follows from Equation (51) that
S ( g a ) = f ( 1 a ) .
Thus, the recursion formula (50) determines H uniquely as soon as f is unique. Equations (52) and (48) yield
f ( 1 a ) k = 0 ( 1 + f ( a 2 k ) ) = 1 ,
so that in analogy to Equation (14)
( 1 + f ( a ) ) f ( 1 a ) = ( 1 + f ( a ) ) f ( 1 a ) f ( 1 a 2 ) k = 1 ( 1 + f ( a 2 k ) ) = f ( 1 a 2 ) .
Now, Equations (50), () and (52) yield
S ( G n , a ) = S ( 1 G n , a ) f ( a n ) = S ( g a ) f ( a n ) = f ( 1 a ) f ( a n ) .
On the other hand, by Equation () and iterating Equation (50) follows
S ( G n , a ) = f ( 1 a ) S ( g 1 a ) S ( G n 1 , a ) = f ( 1 a ) S n ( g 1 a ) = f ( 1 a ) f n ( a ) .
Hence,
f ( a n ) = f n ( a ) , if 1 a < a n .
In particular,
f ( a 2 ) = f 2 ( a ) for a > ( 5 1 ) / 2 ,
and lim a 1 f ( a ) is necessarily 1. We have
S ( b a ) = S ( 2 b a ) f ( 1 a ) = ( S ( g a ) S ( b a ) f ( a ) ) f ( 1 a ) = f ( a ) f ( 1 a ) .
Equations (49) and (54) and the monotonicity of f entail
f ( a ) ( 1 + f ( 1 + 1 / a ) ) = 1 , a 1 / 2 .
In particular f ( 1 / 2 ) = 1 / 2 . Equation (53) delivers f ( 3 / 4 ) = 3 / 4 > ( 5 1 ) / 2 . Hence, for a = 3 / 4 , we have f ( a 2 k ) = a 2 k and a 2 k 1 as k . Now, Lemma 3 yields the uniqueness of f.

3. Discussion & Conclusions

The characterization given here is not only novel with respect to splitting the chain rule into its natural components, it is also novel in the sense that only a rather weak relationship is required for the chain rule, i.e. the proportionality factor itself is not given, but arises naturally. Furthermore, the recursion formula has the advantage that it allows an interpretation as an intrinsic property, since only a single distribution is involved and since recursion formulae in general may allow additional practical interpretations such as self-similarity or scale invariance.
In comparison to Faddeev’s approach, our proofs emphasize the role of the geometric distribution g a , which arises as the infinite iteration of the Bernoulli distribution b a , a ( 0 , 1 ) , i.e. 1 b a = g a . The uniqueness proof for the Shannon entropy uses almost only the resulting proportionality relation of the geometric distribution, simplifying Faddeev’s approach. Further, while Equation (2) is a rule, proportionality can be considered a law. This yields a deeper understanding of the Shannon entropy and may justify a broader application outside telecommunication.
The proofs rely strongly on distributions that generalize the uniform distribution and the geometric distribution, respectively. Possibly, these distributions play or will play a role also in other areas.
Although the characterization and the proofs of the Shannon and the Rényi entropy follow the same scheme, they show remarkable differences. For instance, the generalization G n , a of the geometric distribution, see Equation (17), does not play a role in case of the Shannon entropy. Also, the functional characterization in Equation (9) of the Shannon entropy is not the limit of the characterization in Equation (10) of the Rényi entropy. However, Equation (9) might be understood as the derivative of Equation (10) in the following sense, cf. [18]. Assume that f in Equation (10) is differentiable. Then,
d d α S ( 1 p ) = d d α ( 1 + f ( p 1 ) ) S ( p ) f ( p 1 ) .
For α = 1 we define S ( p ) = lim α 1 p α α . Then, we get
d d α S ( 1 p ) = ( 1 + f ( p 1 ) ) d d α S ( p ) .
Another interpretation of our results is that S is not a natural transformation of the Rényi entropy, in contrast to
T ( p ) = S ( p ) S ( { 1 } ) .
This definition bridges to the generalized information function [Section 6.2 [5] and to the definite functions, cf. [19], Chapter 3, Corollary 3.3 and proof of Theorem 2.2, and Chapter 6, Example 5.16. Indeed, Equation (10) can be rewritten in full analogy to Equation (9), namely as
T ( 1 p ) = ( 1 + f ( p 1 ) ) T ( p ) ,
where T ( 1 p ) = S ( 1 p ) S ( { 1 } ) . Note that also the interpretation T ( 1 p ) = S ( 1 p ) S ( 1 { 1 } ) is possible.
Several technical assumptions had to be included in the characterization of the Rényi entropy, such as the condition that f shall be real analytic. Although we consider this fact as minor, we hope that these assumptions can be dropped in the future.

Author Contributions

Conceptualization and methodology, M.S.; investigation, validation and resources, C.D.; writing—original draft, M.S.; writing—review and editing, C.D.; all authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors thank Christopher Dörr for many hints and comments. They also thank Thomas Reichelt for valuable hints.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Khinchin, A. The concept of entropy in the theory of probability. Uspekhi Matem. Nauk 1953, 8, 3–20. [Google Scholar]
  2. Faddeev, D. On the concept of entropy of a finite probabilistic scheme. Uspekhi Matem. Nauk 1956, 11, 227–231. [Google Scholar]
  3. Rényi, A. On measures of entropy and information. Fourth Berkeley Symposium on Mathematical Statistics and Probability: Contributions to the Theory of Statistics. University of California Press, 1961, pp. 547–562.
  4. Leinster, T. Entropy and Diversity — The Axiomatic Approach.; Cambridge, 2021.
  5. Aczél, J.; Daróczy, Z. On Measures of Information and Their Characterizations; Academic Press, 1975.
  6. Ebanks, B.; Sahoo, P.K.; Sander, W. Characterization of information measures; World Scientific, 1998.
  7. Carcassi, G.; Aidala, C.; Barbour, J. Variability as a better characterization of Shannon entropy. Europ. J. Phys. 2021, 42, 045102. [Google Scholar] [CrossRef]
  8. Baez, J.; Fritz, T.; Leinster, T. A characterization of entropy in terms of information loss. Entropy 2011, 13, 1945–1957. [Google Scholar] [CrossRef]
  9. Onicescu, O. Energie informationnelle. Sci. Paris 1966, 263, 841–842. [Google Scholar]
  10. Pardo, L. Order-α weighted information energy. Information Sci. 1986, 40, 155–164. [Google Scholar] [CrossRef]
  11. Schlather, M. An algebraic generalization of the entropy and its application to statistics 2024. arxiv: 2404.05854.
  12. Krantz, S. Function Theory of Several Complex Variables, 2nd ed.; AMS: Pacific Grove, Calif, 1992. [Google Scholar]
  13. Erdös, P. On the distribution function of additive functions. Ann. Math. 1946, 47, 1–20. [Google Scholar] [CrossRef]
  14. Krantz, S.; Parks, H. A Primer of Real Analytic Functions; Birkhäuser Boston, 2002.
  15. Gradshteyn, I.; Ryzhik, I. Table of Integrals, Series, and Products, 6th ed.; Academic Press: London, 2000. [Google Scholar]
  16. Conrad, K. Probability Distributions and Maximum Entropy. Technical report, University of Connecticut.URL https://kconrad.math.uconn.edu/blurbs/analysis/entropypost.pdf. Accessed 14 Nov. 2024.
  17. Jakimiuk, J.; Murawski, D.; Nayar, P.; Słobodianiuk, S. Log-concavity and discrete degrees of freedom. Discr. Math. 2024, 347, 114020. [Google Scholar] [CrossRef]
  18. Życzkowski, K. Rényi extrapolation of Shannon entropy. Open Sys. & Inform. Dyn. 2003, 10, 297–310. [Google Scholar]
  19. Berg, C.; Christensen, J.P.R.; Ressel, P. Harmonic Analysis on Semigroups. Theory of Positive Definite and Related Functions; Springer: New York, 1984. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated