On the Estimation of Discrete Distributions

J.-P. Laedermann

doi:10.20944/preprints202601.1791.v1

Submitted:

22 January 2026

Posted:

23 January 2026

You are already at the latest version

Abstract

The aim of this work is to base the estimation of probabilities associated with events on Bayes’ theorem, starting from Jeffreys’ non-informative prior.

Keywords:

probability amplitude

;

Bayesian statistics

;

Jeffreys non-informative prior

Subject:

Computer Science and Mathematics - Probability and Statistics

1. Introduction

Let us consider a finite set of n situations that may occur according to a distribution

{(p_{k})}_{k = 0}^{n - 1}

. The probability of each situation is generally determined by repeating the experiment independently and using Laplace’s operational definition to obtain the estimators

{\hat{p}}_{k} = \frac{N u m b e r o f o c c u r r e n c e s o f s i t u a t i o n k}{N u m b e r o f r e p e t i t i o n s o f t h e e x p e r i m e n t}

In addition, the central limit theorem provides an estimate of the difference between the true value and the estimate [1].

This empirical method generally works quite well, but poses a few problems. On the one hand, this estimator and its uncertainty depend on the number of repetitions, giving good results only when this number is sufficiently large, which is a rather vague concept. On the other hand, it is difficult to generalize this definition when the observation of the situation is unclear. However, a situation is often determined using a measuring instrument that produces a noisy result.

The aim of this work is to base the estimation of probabilities associated with events on Bayes’ theorem, starting from Jeffreys’ non-informative priors.

2. Ingredients

2.1. Elements of Bayesian Theory of Statistical Decision-Making

This theory first distinguishes between two spaces: the space of states

θ

and the space of observations x[3]. Each observation is assumed to be made in the presence of an unknown state. The measurement model is the stochastic relation linking the state to the result of the measurement. For a given state, we give a transition probability1

p (x ∣ θ)

, which gives the distribution of observations x given

θ

. Given an a priori distribution

p (θ)

over states and an observation x, Bayes’ theorem gives the a posteriori distribution over states

p (θ ∣ x) = \frac{p (x ∣ θ) p (θ)}{\int d θ p (x ∣ θ) p (θ)}

When the state space is a differentiable manifold and the measurement model is twice differentiable with respect to

θ

, we define a prior distribution called Jeffreys’ prior as follows [2]. First, we form the Fisher information matrix

I_{i j} (θ) = - \int d x p (x ∣ θ) \frac{\partial^{2} ln p (x ∣ θ)}{\partial θ_{i} \partial θ_{j}}

Jeffreys’ prior is then defined as

p_{J} (θ) d θ = \sqrt{det I (θ)} d θ

A fundamental property of this distribution is its invariance. Indeed, it can be shown that

p_{J} (θ) d θ

does not depend on the choice of coordinate system.

2.2. Hyperspheres

The estimators constructed in this work make use of unit hyperspheres. Let us recall some of their properties [4].

In Euclidean space

R^{n}

, the unit hypersphere of dimension

n - 1

is the differentiable manifold

S^{n - 1} = {ψ \in R^{n} ∣ \sum_{k} ψ_{k}^{2} = 1}

The use of the letter

ψ

will be justified below. The usual spherical coordinates are given by

ψ_{0} = sin θ_{n - 2} sin θ_{n - 3} \dots sin θ_{1} sin θ_{0} ψ_{1} = sin θ_{n - 2} sin θ_{n - 3} \dots sin θ_{1} cos θ_{0} ψ_{2} = sin θ_{n - 2} sin θ_{n - 3} \dots cos θ_{1} \dots ψ_{n - 1} = cos θ_{n - 2}

(1)

and the normalized invariant measure under

S O (n)

is given by

d μ (ψ) = \frac{1}{A_{n - 1}} {sin}^{n - 2} θ_{n - 2} {sin}^{n - 3} θ_{n - 3} \dots sin θ_{1} d θ_{0} d θ_{1} \dots d θ_{n - 2}

(2)

A_{n - 1} = \frac{2 π^{n / 2}}{Γ (n / 2)}

(3)

Γ (1 / 2) = \sqrt{π}

(4)

Let

m = (m_{0} \dots m_{n - 1})

be a sequence of positive integers. The even moments relative to this measure are given by

H (m) = \int ψ_{0}^{2 m_{0}} \dots ψ_{n - 1}^{2 m_{n - 1}} d μ (ψ)

(5)

We can show that the function H above is a generalization of the beta function

H (m) = \frac{Γ (n / 2)}{π^{n / 2}} \frac{\prod_{k} Γ (m_{k} + 1 / 2)}{Γ (\sum_{k} (m_{k} + 1 / 2))}

(6)

The normalization factor is chosen so that

H (0) = 1

3. Probability Estimators for Known Situations

The problem is estimating a distribution

p_{k}

. This distribution can be viewed as a point in the

n - 1

dimensional simplex

Δ^{n - 1} = {p ∣ 0 ⩽ p_{k} ⩽ 1, \sum_{k} p_{k} = 1}

This simplex will be the state space for a measurement model whose observations are situations. In a canonical way, the probability of obtaining situation k given that the distribution is

p

is obviously

p (k ∣ p) = p_{k}

This canonical measurement model is twice differentiable with respect to

p

. We can therefore calculate the associated Jeffreys prior

p_{J} (p)

. It was stated above that this distribution is independent of the parameterization of the simplex. Let us choose the new coordinates

ψ_{k} = \sqrt{p_{k}}

(7)

Since

\sum_{k} ψ_{k}^{2} = \sum_{k} p_{k} = 1

, we obtain a measure model on a state space that is the positive part of the hypersphere

S^{n - 1}

p (k ∣ ψ) = ψ_{k}^{2}

Now, it turns out that Jeffreys’ prior for this model is the invariant measure

μ

mentioned above:

d p_{J} (ψ) = d μ (ψ)

(8)

This result allows us to apply Bayes’ theorem to a sequence of independent observations. Let

i = (i_{0} \dots i_{t - 1})

be such a sequence of length t, we obtain the posterior on the sphere

S^{n - 1}

given

i

d p (ψ ∣ i) = \frac{ψ_{i_{0}}^{2} \dots ψ_{i_{t - 1}}^{2} d μ (ψ)}{\int ψ_{i_{0}}^{2} \dots ψ_{i_{t - 1}}^{2} d μ (ψ)}

(9)

Introducing the function

m

giving the multiplicity of an index k in the sequence

i

:

m {(i)}_{k} = Card {s ∣ i_{s} = k}

we obtain (implying the sequence

i

to simplify the notation)

d p (ψ ∣ i) = \frac{\prod_{k} ψ_{k}^{2 m_{k}} d μ (ψ)}{H (m (i))}

(10)

The probability of occurrence of a situation k can be estimated by the posterior mean,

{\hat{p}}_{k} = E (p_{k}) = \frac{\int ψ_{k}^{2} ψ_{i_{0}}^{2} \dots ψ_{i_{t - 1}}^{2} d μ (ψ)}{H (m (i))}

(11)

The associated uncertainties will be given by the covariance matrix

\hat{p_{k} p_{l}} = E (p_{k} p_{l}) = \frac{\int ψ_{k}^{2} ψ_{l}^{2} ψ_{i_{0}}^{2} \dots ψ_{i_{t - 1}}^{2} d μ (ψ)}{H (m (i))}

(12)

Let us define

e_{k}

as a sequence of integers that is zero except for the

k^{t h}

, whose value is 1. It is easy to see that

{\hat{p}}_{k} = \frac{H (m (i) + e_{k})}{H (m (i))}

(13)

\hat{p_{k} p_{k}} = \frac{H (m (i) + e_{k} + e_{l})}{H (m (i))}

(14)

The fact that

\sum_{h} m_{h} = t

and the properties of the Gamma function [5] immediately give

{\hat{p}}_{k} = \frac{m_{k} + 1 / 2}{t + n / 2}

(15)

\hat{p_{k}^{2}} = \frac{(m_{k} + 1 / 2) (m_{k} + 3 / 2)}{(t + n / 2) (t + n / 2 + 1)}

(16)

\hat{p_{k} p_{l}} = \frac{(m_{k} + 1 / 2) (m_{l} + 1 / 2)}{(t + n / 2) (t + n / 2 + 1)}

(17)

4. Self-Consistent Random Sequences

Let

i

be a sequence of length

t - 1

. Equation (15) therefore gives the Bayesian estimates for the

p_{k}

if we have observed the sequence

i

, which we will also denote by

{\hat{p}}_{k} = p_{J} (k ∣ i_{t - 2}, i_{t - 3} \dots i_{0})

At t, let us randomly draw a new situation

i_{t - 1}

according to these

p_{k}

. Let us form a new sequence by concatenation:

(i, i_{t - 1})

. Nothing prevents us from repeating the process for this new sequence.

We say that a sequence

i

is constructed in a self-consistent manner if

$i_{0}$ is drawn uniformly from the n situations
at each step, $i_{s - 1}$ is drawn according to the probability $p_{J} (k ∣ i_{s - 2} \dots i_{0})$

We note that the probability of obtaining a sequence

i = (i_{0} \dots i_{t - 1})

in this way is

p_{a c} (i) = p_{J} (i_{t - 1} ∣ i_{t - 2} \dots i_{0}) p_{J} (i_{t - 2} ∣ i_{t - 3} \dots i_{0}) \dots p_{J} (i_{1} ∣ i_{0}) p_{J} (i_{0})

(18)

Now, it turns out that this probability is given by the function

H \circ m

p_{a c} (i) = H (m (i))

(19)

Since no additional assumptions are made, we will call the probability

p_{a c}

a priori (non-informative) self-consistent on sequences

i

of given length t. Sampling according to this prior will be useful later for estimation in uncertain situations.

Note

This way of generating a random sequence of numbers provides a possible answer to the question: Given a sequence, what is its probability of occurrence? It can be easily seen that the sequences with maximum self-consistent probability are also those with minimum entropy.

Incidentally, for sequences of given length t, we have

\sum_{i} H (m (i)) = 1

From relations (18) and (19), we also have

H (m (i) + e_{k}) = p_{J} (k ∣ i) H (m (i)) H

(20)

(m (i) + e_{k} + e_{l}) = p_{J} (k ∣ i) p_{J} (l ∣ (i, k)) H (m (i))

(21)

5. Generalization to Fuzzy Situations

In reality, situations are often measured by noisy instruments. Instead of knowing the situation at a stage s, the observer only has a measurement result x from a measurement model

p (x ∣ k)

. For a given x, this vector, whose indices are the situations, is often called the likelihood function. For a sequence of steps

s = 0 \dots t - 1

, the observations2

(x_{0} \dots x_{t - 1})

give a likelihood matrix whose time is the row index and the situation is the column index

ℓ^{s} (k) = p (x_{s} ∣ k)

(22)

Let’s return to the situation on the hypersphere. Suppose that at time s, the prior probability on

S^{n - 1}

is

d η_{s} (ψ)

. The appearance of a result

x_{s}

for the measurement model

p (x ∣ k)

, combined with the canonical model

p (k ∣ ψ) = ψ_{k}^{2}

, gives, by Bayes’ theorem, an a posteriori that will be taken as a priori in the next step:

d η_{s + 1} (ψ) = \frac{\sum_{k} ℓ^{s} (k) ψ_{k}^{2} d η_{s} (ψ)}{\sum_{k} ℓ^{s} (k) \int ψ_{k}^{2} d η_{s} (ψ)}

(23)

We can therefore write

d η_{t} (ψ) = \frac{\sum_{i} ℓ^{0} (i_{0}) \dots ℓ^{t - 1} (i_{t - 1}) ψ_{i_{0}}^{2} \dots ψ_{i_{t - 1}}^{2} d μ (ψ)}{\sum_{i} ℓ^{0} (i_{0}) \dots ℓ^{t - 1} (i_{t - 1}) \int ψ_{i_{0}}^{2} \dots ψ_{i_{t - 1}}^{2} d μ (ψ)}

(24)

Let

ℓ (i) = \prod_{s} ℓ^{s} (i_{s})

and use the definition of the function H to obtain the estimators at

t - 1

E (p_{k}) = \frac{\sum_{i} ℓ (i) H (m (i) + e_{k})}{\sum_{i} ℓ (i) H (m (i))}

(25)

E (p_{k} p_{l}) = \frac{\sum_{i} ℓ (i) H (m (i) + e_{k} + e_{l})}{\sum_{i} ℓ (i) H (m (i))}

(26)

which can also be written as

E (p_{k}) = \frac{\sum_{i} p_{J} (k ∣ i) ℓ (i) H (m (i))}{\sum_{i} ℓ (i) H (m (i))}

(27)

E (p_{k} p_{l}) = \frac{\sum_{i} p_{J} (k ∣ i) p_{J} (l ∣ (i, k)) ℓ (i) H (m (i))}{\sum_{i} ℓ (i) H (m (i))}

(28)

The preceding equations suggest a new use for Bayes’ theorem. Indeed, given the self-consistent prior

p_{a c}

on sequences of length t and the product measure model giving the likelihood function ℓ, we can form the posterior on the sequences

p_{a c} (i ∣ ℓ) = \frac{ℓ (i) p_{a c} (i)}{\sum_{j} ℓ (j) p_{a c} (j)}

(29)

and we obtain

E (p_{k}) = \sum_{i} p_{J} (k ∣ i) p_{a c} (i ∣ ℓ)

(30)

E (p_{k} p_{l}) = \sum_{i} p_{J} (k ∣ i) p_{J} (l ∣ (i, k)) p_{a c} (i ∣ ℓ)

(31)

Equations (30) and (31) are integrals that can be calculated using Monte Carlo. It is possible to sample the posterior

p_{a c} (i ∣ ℓ)

as follows:

draw $i_{0}$ according to the weights $ℓ^{0} (k)$
draw $i_{1}$ according to the weights $ℓ^{1} (k) p_{J} (k ∣ i_{0})$
…
draw $i_{t - 1}$ according to the weights $ℓ^{t - 1} (k) p_{J} (k ∣ i_{t - 2} \dots i_{0})$

A sample of r sequences

i_{s}

of this type gives the MC estimators

\hat{E (p_{k})} = \frac{1}{r} \sum_{s} p_{J} (k ∣ i_{s})

(32)

\hat{E (p_{k} p_{l})} = \frac{1}{r} \sum_{s} p_{J} (k ∣ i_{s}) p_{J} (l ∣ (i_{s}, k))

(33)

as well as MC uncertainties

u_{E (p_{k})} = \frac{1}{\sqrt{r}} \sqrt{\frac{1}{r} \sum_{s} p_{J} {(k ∣ i_{s})}^{2} - {\hat{E (p_{k})}}^{2}}

(34)

u_{E (p_{k} p_{l})} = \frac{1}{\sqrt{r}} \sqrt{\frac{1}{r} \sum_{s} p_{J} {(k ∣ i_{s})}^{2} p_{J} {(l ∣ (i_{s}, l))}^{2} - {\hat{E (p_{k} p_{l})}}^{2}}

(35)

Note

Simulating this sampling is very easy, for example in C++, and gives excellent results for likelihood matrices with 3 situations and 10,000 observation times with an r of around one million. The advantage of sampling on the posterior is that it directly locates the sequences that contribute significantly to the overall sum. Recall that the sums (30) and (31) contain

n^{t}

terms.

6. Conclusions

Replacing

p_{k}

with

ψ_{k} = \sqrt{p_{k}}

makes Jeffreys’ prior uniform. In general, probabilistic quantities can be parameterized by the square root, which is their natural expression. This phenomenon is found in quantum mechanics [6], where probability is expressed as the square of the modulus of a wave function, hence the choice of the Greek letter

ψ

for the points of the hypersphere.

The generalization of beta functions given by the function H bridges the gap between continuous calculus on wave functions and sums over sequences, which are computable by MC.

An application of this work is under development for stationary Markov processes.

7. Summary

7.1. Seen by $ψ \in S^{n - 1}$

d η^{0} (ψ) = d μ (ψ) i n v a r i a n t

(36)

d η^{t} (ψ) = d η^{t - 1} (ψ ∣ ℓ^{t - 1}) = \frac{\sum_{k} ℓ^{t - 1} (k) ψ_{k}^{2} d η^{t - 1} (ψ)}{\sum_{k} ℓ^{t - 1} (k) \int ψ_{k}^{2} d η^{t - 1} (ψ)}

(37)

E^{t} (p_{k} ∣ ℓ) = \int ψ_{k}^{2} d η^{t} (ψ)

(38)

E^{t} (p_{k} p_{l} ∣ ℓ) = \int ψ_{k}^{2} ψ_{l}^{2} d η^{t} (ψ)

(39)

7.2. Seen by $i \in {0 \dots n - 1}^{N}$

i^{t} = (i_{0} \dots i_{t - 1})

(40)

p_{a c}^{t} (i) = p_{J} (i_{t - 1} ∣ i^{t - 1}) p_{a c}^{t - 1}

(41)

(i) ℓ^{t} (i) = \prod_{s = 0}^{t - 1} ℓ^{s} (i_{s})

(42)

p_{a c}^{t} (i ∣ ℓ) = \frac{ℓ^{t} (i) p_{a c}^{t} (i)}{\sum_{i^{t}} ℓ^{t} (i) p_{a c}^{t} (i)}

(43)

E^{t} (p_{k} ∣ ℓ) = \sum_{i^{t}} p_{J} (k ∣ i^{t}) p_{a c}^{t} (i ∣ ℓ)

(44)

E^{t} (p_{k} p_{l} ∣ ℓ) = \sum_{i^{t}} p_{J} (l ∣ i^{t} k) p_{J} (k ∣ i^{t}) p_{a c}^{t} (i ∣ ℓ)

(45)

References

Chung, K. L. A course in probability theory; chap 7Academic Press, 1968. [Google Scholar]
Bernardo, J.M.; Smith, A.F.M. Bayesian theory p 314, 358Wiley. 1994, 314. [Google Scholar]
Laedermann, J.-P. Theorie bayesienne de la decision statistique et mesure de la radioactivityUNIL. These de doctorat, 2003. [Google Scholar]
Hyperspheric coordinateshttps. Available online: http://en.wikipedia.org/wiki/N-sphere.
Gradshtyn, L.S.; Ryzbik, L.M. Table of integrals, series, and products, 6.4Alan Jeffrey. 2000. [Google Scholar]
Piron, C. Quantum MechanicsPresses polytechniques et universitaires romandes; 1998. [Google Scholar]

1	The use of the Roman font $p$ will denote a generic probability, the underlying spaces being identifiable by the variables used.
2	The measurement model may vary at each step.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

On the Estimation of Discrete Distributions

Abstract

Keywords:

Subject:

1. Introduction

2. Ingredients

2.1. Elements of Bayesian Theory of Statistical Decision-Making

2.2. Hyperspheres

3. Probability Estimators for Known Situations

4. Self-Consistent Random Sequences

Note

5. Generalization to Fuzzy Situations

6. Conclusions

7. Summary

7.1. Seen by $ψ \in S^{n - 1}$

7.2. Seen by $i \in {0 \dots n - 1}^{N}$

References

MDPI Initiatives

Important Links

Subscribe

On the Estimation of Discrete Distributions

Abstract

Keywords:

Subject:

1. Introduction

2. Ingredients

2.1. Elements of Bayesian Theory of Statistical Decision-Making

2.2. Hyperspheres

3. Probability Estimators for Known Situations

4. Self-Consistent Random Sequences

Note

5. Generalization to Fuzzy Situations

6. Conclusions

7. Summary

7.1. Seen by ψ ∈ S n − 1

7.2. Seen by i ∈ { 0 … n − 1 } N

References

MDPI Initiatives

Important Links

Subscribe

7.1. Seen by $ψ \in S^{n - 1}$

7.2. Seen by $i \in {0 \dots n - 1}^{N}$