Preprint
Article

This version is not peer-reviewed.

On the Estimation of Discrete Distributions

Submitted:

22 January 2026

Posted:

23 January 2026

You are already at the latest version

Abstract
The aim of this work is to base the estimation of probabilities associated with events on Bayes’ theorem, starting from Jeffreys’ non-informative prior.
Keywords: 
;  ;  

1. Introduction

Let us consider a finite set of n situations that may occur according to a distribution ( p k ) k = 0 n 1 . The probability of each situation is generally determined by repeating the experiment independently and using Laplace’s operational definition to obtain the estimators
p ^ k = N u m b e r   o f   o c c u r r e n c e s   o f   s i t u a t i o n   k N u m b e r   o f   r e p e t i t i o n s   o f   t h e   e x p e r i m e n t
In addition, the central limit theorem provides an estimate of the difference between the true value and the estimate [1].
This empirical method generally works quite well, but poses a few problems. On the one hand, this estimator and its uncertainty depend on the number of repetitions, giving good results only when this number is sufficiently large, which is a rather vague concept. On the other hand, it is difficult to generalize this definition when the observation of the situation is unclear. However, a situation is often determined using a measuring instrument that produces a noisy result.
The aim of this work is to base the estimation of probabilities associated with events on Bayes’ theorem, starting from Jeffreys’ non-informative priors.

2. Ingredients

2.1. Elements of Bayesian Theory of Statistical Decision-Making

This theory first distinguishes between two spaces: the space of states θ and the space of observations x[3]. Each observation is assumed to be made in the presence of an unknown state. The measurement model is the stochastic relation linking the state to the result of the measurement. For a given state, we give a transition probability1 p ( x θ ) , which gives the distribution of observations x given θ . Given an a priori distribution p ( θ ) over states and an observation x, Bayes’ theorem gives the a posteriori distribution over states
p ( θ x ) = p ( x θ ) p ( θ ) d θ p ( x θ ) p ( θ )
When the state space is a differentiable manifold and the measurement model is twice differentiable with respect to θ , we define a prior distribution called Jeffreys’ prior as follows [2]. First, we form the Fisher information matrix
I i j ( θ ) = d x p ( x θ ) 2 ln p ( x θ ) θ i θ j
Jeffreys’ prior is then defined as
p J ( θ ) d θ = det I ( θ ) d θ
A fundamental property of this distribution is its invariance. Indeed, it can be shown that p J ( θ ) d θ does not depend on the choice of coordinate system.

2.2. Hyperspheres

The estimators constructed in this work make use of unit hyperspheres. Let us recall some of their properties [4].
In Euclidean space R n , the unit hypersphere of dimension n 1 is the differentiable manifold
S n 1 = { ψ R n k ψ k 2 = 1 }
The use of the letter ψ will be justified below. The usual spherical coordinates are given by
ψ 0 = sin θ n 2 sin θ n 3 sin θ 1 sin θ 0 ψ 1 = sin θ n 2 sin θ n 3 sin θ 1 cos θ 0 ψ 2 = sin θ n 2 sin θ n 3 cos θ 1 ψ n 1 = cos θ n 2
and the normalized invariant measure under S O ( n ) is given by
d μ ( ψ ) = 1 A n 1 sin n 2 θ n 2 sin n 3 θ n 3 sin θ 1 d θ 0 d θ 1 d θ n 2
A n 1 = 2 π n / 2 Γ ( n / 2 )
Γ ( 1 / 2 ) = π
Let m = ( m 0 m n 1 ) be a sequence of positive integers. The even moments relative to this measure are given by
H ( m ) = ψ 0 2 m 0 ψ n 1 2 m n 1 d μ ( ψ )
We can show that the function H above is a generalization of the beta function
H ( m ) = Γ ( n / 2 ) π n / 2 k Γ ( m k + 1 / 2 ) Γ ( k ( m k + 1 / 2 ) )
The normalization factor is chosen so that H ( 0 ) = 1

3. Probability Estimators for Known Situations

The problem is estimating a distribution p k . This distribution can be viewed as a point in the n 1 dimensional simplex
Δ n 1 = { p 0 p k 1 , k p k = 1 }
This simplex will be the state space for a measurement model whose observations are situations. In a canonical way, the probability of obtaining situation k given that the distribution is p is obviously
p ( k p ) = p k
This canonical measurement model is twice differentiable with respect to p . We can therefore calculate the associated Jeffreys prior p J ( p ) . It was stated above that this distribution is independent of the parameterization of the simplex. Let us choose the new coordinates
ψ k = p k
Since k ψ k 2 = k p k = 1 , we obtain a measure model on a state space that is the positive part of the hypersphere S n 1
p ( k ψ ) = ψ k 2
Now, it turns out that Jeffreys’ prior for this model is the invariant measure μ mentioned above:
d p J ( ψ ) = d μ ( ψ )
This result allows us to apply Bayes’ theorem to a sequence of independent observations. Let
i = ( i 0 i t 1 )
be such a sequence of length t, we obtain the posterior on the sphere S n 1 given i
d p ( ψ i ) = ψ i 0 2 ψ i t 1 2 d μ ( ψ ) ψ i 0 2 ψ i t 1 2 d μ ( ψ )
Introducing the function m giving the multiplicity of an index k in the sequence i :
m ( i ) k = Card { s i s = k }
we obtain (implying the sequence i to simplify the notation)
d p ( ψ i ) = k ψ k 2 m k d μ ( ψ ) H ( m ( i ) )
The probability of occurrence of a situation k can be estimated by the posterior mean,
p ^ k = E ( p k ) = ψ k 2 ψ i 0 2 ψ i t 1 2 d μ ( ψ ) H ( m ( i ) )
The associated uncertainties will be given by the covariance matrix
p k p l ^ = E ( p k p l ) = ψ k 2 ψ l 2 ψ i 0 2 ψ i t 1 2 d μ ( ψ ) H ( m ( i ) )
Let us define e k as a sequence of integers that is zero except for the k t h , whose value is 1. It is easy to see that
p ^ k = H ( m ( i ) + e k ) H ( m ( i ) )
p k p k ^ = H ( m ( i ) + e k + e l ) H ( m ( i ) )
The fact that h m h = t and the properties of the Gamma function [5] immediately give
p ^ k = m k + 1 / 2 t + n / 2
p k 2 ^ = ( m k + 1 / 2 ) ( m k + 3 / 2 ) ( t + n / 2 ) ( t + n / 2 + 1 )
p k p l ^ = ( m k + 1 / 2 ) ( m l + 1 / 2 ) ( t + n / 2 ) ( t + n / 2 + 1 )

4. Self-Consistent Random Sequences

Let i be a sequence of length t 1 . Equation (15) therefore gives the Bayesian estimates for the p k if we have observed the sequence i , which we will also denote by
p ^ k = p J ( k i t 2 , i t 3 i 0 )
At t, let us randomly draw a new situation i t 1 according to these p k . Let us form a new sequence by concatenation: ( i , i t 1 ) . Nothing prevents us from repeating the process for this new sequence.
We say that a sequence i is constructed in a self-consistent manner if
  • i 0 is drawn uniformly from the n situations
  • at each step, i s 1 is drawn according to the probability p J ( k i s 2 i 0 )
We note that the probability of obtaining a sequence i = ( i 0 i t 1 ) in this way is
p a c ( i ) = p J ( i t 1 i t 2 i 0 ) p J ( i t 2 i t 3 i 0 ) p J ( i 1 i 0 ) p J ( i 0 )
Now, it turns out that this probability is given by the function H m
p a c ( i ) = H ( m ( i ) )
Since no additional assumptions are made, we will call the probability p a c a priori (non-informative) self-consistent on sequences i of given length t. Sampling according to this prior will be useful later for estimation in uncertain situations.

Note

This way of generating a random sequence of numbers provides a possible answer to the question: Given a sequence, what is its probability of occurrence? It can be easily seen that the sequences with maximum self-consistent probability are also those with minimum entropy.
Incidentally, for sequences of given length t, we have
i H ( m ( i ) ) = 1
From relations (18) and (19), we also have
H ( m ( i ) + e k ) = p J ( k i ) H ( m ( i ) ) H
( m ( i ) + e k + e l ) = p J ( k i ) p J ( l ( i , k ) ) H ( m ( i ) )

5. Generalization to Fuzzy Situations

In reality, situations are often measured by noisy instruments. Instead of knowing the situation at a stage s, the observer only has a measurement result x from a measurement model p ( x k ) . For a given x, this vector, whose indices are the situations, is often called the likelihood function. For a sequence of steps s = 0 t 1 , the observations2 ( x 0 x t 1 ) give a likelihood matrix whose time is the row index and the situation is the column index
s ( k ) = p ( x s k )
Let’s return to the situation on the hypersphere. Suppose that at time s, the prior probability on S n 1 is d η s ( ψ ) . The appearance of a result x s for the measurement model p ( x k ) , combined with the canonical model p ( k ψ ) = ψ k 2 , gives, by Bayes’ theorem, an a posteriori that will be taken as a priori in the next step:
d η s + 1 ( ψ ) = k s ( k ) ψ k 2 d η s ( ψ ) k s ( k ) ψ k 2 d η s ( ψ )
We can therefore write
d η t ( ψ ) = i 0 ( i 0 ) t 1 ( i t 1 ) ψ i 0 2 ψ i t 1 2 d μ ( ψ ) i 0 ( i 0 ) t 1 ( i t 1 ) ψ i 0 2 ψ i t 1 2 d μ ( ψ )
Let
( i ) = s s ( i s )
and use the definition of the function H to obtain the estimators at t 1
E ( p k ) = i ( i ) H ( m ( i ) + e k ) i ( i ) H ( m ( i ) )
E ( p k p l ) = i ( i ) H ( m ( i ) + e k + e l ) i ( i ) H ( m ( i ) )
which can also be written as
E ( p k ) = i p J ( k i ) ( i ) H ( m ( i ) ) i ( i ) H ( m ( i ) )
E ( p k p l ) = i p J ( k i ) p J ( l ( i , k ) ) ( i ) H ( m ( i ) ) i ( i ) H ( m ( i ) )
The preceding equations suggest a new use for Bayes’ theorem. Indeed, given the self-consistent prior p a c on sequences of length t and the product measure model giving the likelihood function , we can form the posterior on the sequences
p a c ( i ) = ( i ) p a c ( i ) j ( j ) p a c ( j )
and we obtain
E ( p k ) = i p J ( k i ) p a c ( i )
E ( p k p l ) = i p J ( k i ) p J ( l ( i , k ) ) p a c ( i )
Equations (30) and (31) are integrals that can be calculated using Monte Carlo. It is possible to sample the posterior p a c ( i ) as follows:
  • draw i 0 according to the weights 0 ( k )
  • draw i 1 according to the weights 1 ( k ) p J ( k i 0 )
  • draw i t 1 according to the weights t 1 ( k ) p J ( k i t 2 i 0 )
A sample of r sequences i s of this type gives the MC estimators
E ( p k ) ^ = 1 r s p J ( k i s )
E ( p k p l ) ^ = 1 r s p J ( k i s ) p J ( l ( i s , k ) )
as well as MC uncertainties
u E ( p k ) = 1 r 1 r s p J ( k i s ) 2 E ( p k ) ^ 2
u E ( p k p l ) = 1 r 1 r s p J ( k i s ) 2 p J ( l ( i s , l ) ) 2 E ( p k p l ) ^ 2
Note
Simulating this sampling is very easy, for example in C++, and gives excellent results for likelihood matrices with 3 situations and 10,000 observation times with an r of around one million. The advantage of sampling on the posterior is that it directly locates the sequences that contribute significantly to the overall sum. Recall that the sums (30) and (31) contain n t terms.

6. Conclusions

Replacing p k with ψ k = p k makes Jeffreys’ prior uniform. In general, probabilistic quantities can be parameterized by the square root, which is their natural expression. This phenomenon is found in quantum mechanics [6], where probability is expressed as the square of the modulus of a wave function, hence the choice of the Greek letter ψ for the points of the hypersphere.
The generalization of beta functions given by the function H bridges the gap between continuous calculus on wave functions and sums over sequences, which are computable by MC.
An application of this work is under development for stationary Markov processes.

7. Summary

7.1. Seen by ψ S n 1

d η 0 ( ψ ) = d μ ( ψ ) i n v a r i a n t
d η t ( ψ ) = d η t 1 ( ψ t 1 ) = k t 1 ( k ) ψ k 2 d η t 1 ( ψ ) k t 1 ( k ) ψ k 2 d η t 1 ( ψ )
E t ( p k ) = ψ k 2 d η t ( ψ )
E t ( p k p l ) = ψ k 2 ψ l 2 d η t ( ψ )

7.2. Seen by i { 0 n 1 } N

i t = ( i 0 i t 1 )
p a c t ( i ) = p J ( i t 1 i t 1 ) p a c t 1
( i ) t ( i ) = s = 0 t 1 s ( i s )
p a c t ( i ) = t ( i ) p a c t ( i ) i t t ( i ) p a c t ( i )
E t ( p k ) = i t p J ( k i t ) p a c t ( i )
E t ( p k p l ) = i t p J ( l i t k ) p J ( k i t ) p a c t ( i )

References

  1. Chung, K. L. A course in probability theory; chap 7Academic Press, 1968. [Google Scholar]
  2. Bernardo, J.M.; Smith, A.F.M. Bayesian theory p 314, 358Wiley. 1994, 314. [Google Scholar]
  3. Laedermann, J.-P. Theorie bayesienne de la decision statistique et mesure de la radioactivityUNIL. These de doctorat, 2003. [Google Scholar]
  4. Hyperspheric coordinateshttps. Available online: http://en.wikipedia.org/wiki/N-sphere.
  5. Gradshtyn, L.S.; Ryzbik, L.M. Table of integrals, series, and products, 6.4Alan Jeffrey. 2000. [Google Scholar]
  6. Piron, C. Quantum MechanicsPresses polytechniques et universitaires romandes; 1998. [Google Scholar]
1
The use of the Roman font p will denote a generic probability, the underlying spaces being identifiable by the variables used.
2
The measurement model may vary at each step.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated