Preprint
Article

This version is not peer-reviewed.

Information-Weight Interpretation of Probability: A Novel Information-Theoretic Perspective

Submitted:

19 August 2025

Posted:

19 August 2025

You are already at the latest version

Abstract
This paper proposes a novel information-theoretic perspective, where probability is interpreted as information weight and probability density is interpreted as information density. We define “information” as the state of a discrete system or the value of a continuous system. We model discrete systems using multisets. From this, we define information weight as the relative multiplicity of each state with respect to the size of the multiset. By relating a discrete system (multiset) to a discrete random variable, we establish a one‑to‑one correspondence between information weight and conventional probability. We then extend this framework to continuous systems. This information-theoretic perspective, distinct from personal belief, propensity, or frequency-based interpretations, emphasizes the contribution of probability to the informational structure of a system.
Keywords: 
;  ;  ;  ;  

1. Introduction

Probability is virtually ubiquitous, playing a role in almost all scientific disciplines (Hájek, 2002). Mathematically, probability is defined as a function satisfying Kolmogorov’s axioms, providing a framework for quantifying uncertainty. However, as D’Agostini (2017) argues, probability is not just a number between 0 and 1 that satisfies the axioms, “because such a ‘definition’ says nothing about what we are talking about.” Probability requires a practical interpretation to connect mathematical formalism to real-world contexts. The question of what probability represents in real-world contexts has sparked decades of philosophical and scientific debate and remains an open issue (Galavotti, 2017). For comprehensive surveys of these interpretive debates, readers may refer to Bunge (1981), Seidenfeld (2015), D’Agostini (2017), Hennig (2024), and Spiegelhalter (2024).
Existing interpretations of probability generally fall into two categories: subjective and objective. In the subjective interpretation (or Bayesian interpretation), probability reflects personal beliefs about a parameter (treated as a constant) or hypothesis, often referred to as the “degree of belief” (e.g., D’Agostini, 2017; Willink, 2025). In contrast, objective interpretations view probability as an inherent physical property of a system. Objective interpretations include frequency interpretation and propensity interpretation. The frequency interpretation defines probability as the limiting relative frequency in repeated trials. The propensity interpretation describes probability as a system’s tendency to produce an outcome, and therefore probability can be called the “degree of propensity” (Huang, 2023). While both subjective and objective interpretations provide valuable insights, they often fail to fully capture the intuitive ways people perceive probability in terms of information and informativeness.
In this paper, we propose an information-theoretic perspective, which defines probability as “information weight” to align with the intuitive and familiar concept “weight”. The following sections are structured as follows. Section 2 describes the proposed information-theoretic perspective on probability and probability density. Section 3 describes the relations of the proposed perspective with Shannon information theory (Shannon, 1948) and informity theory (Huang, 2025). Section 4 introduces information-theoretic terminology for probability concepts. Section 5 provides discussion and conclusion.

2. The Proposed Information-Theoretic Perspective on Probability and Probability Density

2.1. Discrete Systems

In the context of this paper, a discrete system is a system with a countable number of states. The state variable of a discrete system changes only at specific, distinct points (states). For example, tossing a coin is a discrete system with two specific, distinct states: "heads" and "tails."
Let X denote the state variable. Suppose X has n multiple states: { x 1 , x n } , which is its state space. Each state xi occurs multiple times mi. Mathematically, this discrete system can be modeled by a multiset (X, m): { x 1 m 1 ,   x 2 m 2 ,   x n m n } , where X = { x 1 , , x n } is the underlying set of distinct elements (states), n is the number of elements, m i is the multiplicity, i.e., the number of times each element x i appears in the multiset, and j = 1 n m j is the size (i.e., the total multiplicities) of the multiset.
Definition 1.
Each state x i is a piece of information about the state variable X (or the discrete system). For instance, the state “heads” or “tails” is a piece of information about the coin-tossing system.
Definition 2.
The multiplicity m i is defined as the information content of x i , and the size of the multiset j = 1 n m j is defined as the total information content of the discrete system.
Definition 3.
The information weight of x i   , denoted by W ( x i ) , is defined as the ratio of m i to j = 1 n m j
W ( x i ) = m i j = 1 n m j .
That is, information weight is the relative information content of x i . Note that the sum of the information weights of all the states of the discrete system is
i = 1 n m i j = 1 n m j = 1 .
That is, the total relative information content of the discrete system is 1.
Furthermore, the information weight function, denoted by W X , represents the distribution of information weight across all states of the discrete system.
Definition 4.
Assume that the discrete system has a mean state and each state x i is numerical. (e.g., measurement errors are real numbers). Let x ¯ denote the mean state, which is defined as
x ¯ = i = 1 n m i x i j = 1 n m j = i = 1 n W ( x i ) x i .
Therefore, the mean state x ¯ is a weighted average of all states x i of the discrete system, explaining why we term W ( x i ) the information weight.
Now, we relate the discrete system modeled by the multiset to a discrete random variable X with a probability mass function (PMF) P X . Each state x i in the discrete system corresponds to a possible outcome of X and P ( x i ) represents its probability. From this, it is easy to see that the probability P ( x i ) is equal to the information weight W ( x i )
P ( x i ) = W ( x i ) = m i j = 1 n m j ,
and
i = 1 n P ( x i ) = i = 1 n m i j = 1 n m j = 1 .
Equation (4) establishes the equivalence between information weight and probability, while Eq. (5) confirms the normalization of probabilities. Therefore, the PMF P X and the information weight function W X are equivalent, representing the informational structure of a discrete system or discrete distribution.
The information weight W ( x i ) quantifies how strongly a state contributes to the system’s informational structure. Higher information weight means the state (or outcome) is more prevalent or predictable. For example, consider a fair six-sided die rolled 60 times, with outcomes {1, 2, 3, 4, 5, 6} and multiplicities {10, 10, 10, 10, 10, 10}, the information weight of each outcome is W ( x i ) = 10 / 60 0.167 . Therefore, each outcome has equal informational weight, reflecting a uniform distribution where no single outcome dominates the system’s informational structure.

2.2. Continuous Systems

In the context of this paper, a continuous system is a system where its state variable varies continuously, taking any value within a range. For example, sediment particle sizes, ranging from pebbles to microscopic clay, form a continuous system (i.e., a continuous distribution of particle sizes).
Let Y denote the state variable, taking values y [ a , b ] , where [ a , b ] is a continuous state space. To extend the discrete framework, we divide [a, b] into n ( y ) number of small bins of size y = b a n ( y ) , each bin i ( i = 1 ,   2 ,   . .   . n ( y ) ) contains the element δ i [ y i y 2 , y i + y 2 ] centered at y i . Then, this discretized continuous system can be modeled by a multiset (Y, m): { δ 1 m 1 ( y ) ,   δ 2 m 2 ( y ) ,   δ n ( y ) m n ( y ) ( y ) } , where Y = { δ 1 , , δ n ( y ) } is the underlying set of distinct elements, m i ( y ) is the multiplicity, i.e., the number of data points that fall into the bin i, and j = 1 n ( y ) m j ( y ) is the size of the multiset. This multiset corresponds to a histogram commonly used to visualize a set of continuous data, with δ i on the x axis and m i ( y ) on the y axis, such as data for sediment particle sizes. Smaller bin size y increases the bin number n ( y ) , but reduces the multiplicity m i ( y ) of each δ i , thereby refining the histogram’s approximation to the continuous distribution.
Definition 5.
Each value y i [ a , b ] is a piece of information about the continuous state variable Y. For example, the sediment particle size y i =0.003mm is a piece of information about Y.
Definition 6.
The multiplicity m i ( y ) divided by the bin size y is the information content per unit state space, and j = 1 n ( y ) m j ( y ) is the information content of the discretized continuous system. Then, the information weight per unit state space is given by
W ( δ i ) y = m i ( y ) / y j = 1 n ( y ) m j ( y ) .
Definition 6.
The information density at y i , denoted by D ( y i   ) , is defined as
D ( y i   ) = lim y 0 W ( δ i ) y = lim y 0 m i ( y ) / y j = 1 n ( y ) m j ( y ) ,
which reflects the informational contribution per unit state space to the informational structure of the continuous system. Furthermore, the information density function, denoted by D ( Y ) , represents the distribution of information density across the values y [ a , b ] .
Now, we relate the continuous system to a continuous random variable Y with probability density function (PDF) p Y with the support [ a , b ] . Then, it is easy to see that the probability density p ( y i   ) equals the information density D ( y i   )
p ( y i   ) = D ( y i   ) = lim y 0 m i ( y ) / y j = 1 n ( y ) m j ( y ) ,
with the normalization
a b p ( y   ) d y = lim y 0 i = 1 n ( y ) p ( y i   ) y = 1 .
Thus, the PDF p Y and the information density function D ( Y ) are equivalent, representing the informational structure of a continuous system.
Moreover, the information weight of an interval [ y 1 ,   y 2 ] is given by
W ( y 1 ,   y 2 ) = y 1 y 2 y D y   d y ,
which corresponding to the coverage probability of the interval.

3. Relations with Shannon Information Theory and Informity Theory

This section explores how the information-weight interpretation relates to Shannon information theory (Shannon, 1948) and informity theory (Huang, 2025), highlighting correspondences and distinctions.

3.1. Relations with Shannon Information Theory

In discrete systems, each state x i has information weight D ( x i ) = P ( x i ) , representing its relative information content. This should not be confused with Shannon information content, defined as (Shannon, 1948)
I S ( x i ) = l o g 2 P ( x i ) ,
which is the logarithm of the probability’s reciprocal. We term that reciprocal
R ( x i ) = 1 P ( x i ) = 1 W ( x i )
the information rarity. R ( x i ) quantifies how uncommon the state x i is relative to the total occurrence of all states in the discrete system. Notably, information rarity and information weight are inversely related: as information rarity increases, information weight decreases, and vice versa.
In the multiset framework, information rarity can be expressed as
R x i = j = 1 n m j m i .
Then, Shannon information content can be written in terms of information rarity
I S ( x i ) = l o g 2 R ( x i ) .
Accordingly, the discrete entropy can be written as
H ( X ) = i = 1 n P ( x i ) l o g 2 R ( x i ) ,
which means that the discrete entropy is the expectation of the logarithm of information rarity.
Intuitively, information rarity is related to uncertainty. Higher rarity indicates greater uncertainty, explaining Shannon entropy as a measure of the uncertainty of a system.
Moreover, in continuous systems, the differential entropy (which is also called continuous entropy) can be written as
H Y = a b p y   l o g 2 p y   d y = a b p y   l o g 2 R y   d y ,
where R y   = 1 / p ( y   ) , which can be called information rarity density. Thus, the differential entropy is the expectation of the logarithm of information rarity density.

3.2. Relations with Informity Theory

Informity theory, proposed by Huang (2025), is an information-theoretic framework where probability is viewed as relative information content, and informity quantifies the mean relative information content of a system. The proposed information-weight interpretation operationalizes relative information content using the multiset framework, equating probability, information weight, and relative information content as P ( x i ) = W ( x i ) = m i j = 1 n m j . Accordingly, the discrete informity β X can be expressed in terms of information weight
β X = i = 1 n P x i 2 = i = 1 n P x i W x i = E [ W X ] .
That is, the discrete informity is the expectation of the information weight.
Moreover, the continuous informity β Y can be expressed in terms of information density
β Y = a b p y 2 d y = a b p y D y d y = E [ D Y ] .
That is, the continuous informity is the expectation of the information density.

4. Information-Theoretic Terminology for Probability Concepts

This section introduces terminology for probability concepts within the proposed information-theoretic framework, building on the information-weight and information-density interpretations established in Sections 2.1 and 2.2. By comparing the information-theoretic perspective with Bayesian and propensity perspectives, we define “degree of information,” “information spectrum,” and “information interval” to reflect the philosophical emphasis on informational structure of a system, contrasting these with subjective belief and causal propensity.
In various perspectives on probability, the mathematical concepts of probability, probability distribution, and probability interval are assigned distinct terminologies that reflect their philosophical underpinnings. In the Bayesian perspective, probability is termed the “degree of belief” (e.g., D’Agostini, 2017; Willink, 2025), the probability distribution can be named the “belief spectrum,” and the probability interval is often known as the “credible interval.” In the propensity perspective, these are referred to as the “degree of propensity,” “propensity spectrum,” and “propensity interval,” respectively (Huang, 2023). Similarly, in the proposed information-theoretic perspective, probability can be called the "degree of information," the probability distribution can be termed the "information spectrum," and the probability interval can be called the "information interval." These terminologies are not mere labels; they align with distinct philosophical views. The "degree of belief" emphasizes subjective confidence in a parameter or hypothesis, the "degree of propensity" highlights the causal tendency of an outcome, and the "degree of information" underscores the informational weight of a state (or outcome). Similarly, the "belief spectrum” describes the distribution of personal beliefs, the "propensity spectrum” describes the distribution of propensity, and the "information spectrum" describes the distribution of information weight or density.
Furthermore, from the proposed information-theoretic perspective, a probability interval with a specified coverage probability can be termed the “information interval” with a specified “degree of information”, representing a range of values supported by the information weight equal to the coverage probability. For example, consider the sample mean (statistic) Y ¯ of n observations drawn from a normal distribution Y ~ N ( μ ,   σ 2 ) , where μ is the mean and σ is the standard deviation. Its sampling distribution, N ( μ ,   σ 2 n ) ,   serves as the information spectrum, describing the informational structure of the continuous system. The information interval covering sample means supported by an information weight of 0.95 is given by
μ 1.96 σ n , μ + 1.96 σ n ,
which is equivalent to the probability interval with a coverage probability of 0.95.
When μ and σ are unknown, the sample mean y ¯ estimates μ and s / c 4 , n estimates σ, where s is the sample standard deviation and c 4 , n is the bias-correction factor for s. This factor is given by (e.g., Wadsworth, 1989)
  c 4 , n = 2 n 1 Γ n 2 Γ n 1 2   ,  
where Г(.) represents the Gamma function.
Using these estimates, we can obtain an empirical information spectrum for the sample mean Y ¯ , which follows a normal distribution N ( y ¯ , s 2 c 4 , n 2 n ) (Huang, 2018). Then, the empirical information interval covering sample means supported by a nominal information weight of 0.95 (i.e., nominal coverage probability of 0.95) can be constructed using this empirical information spectrum. It is given by
y ¯ 1.96 s c 4 , n n , y ¯ + 1.96 s c 4 , n n .
The half-width of this empirical information interval is 1.96 s c 4 , n n , which corresponds to the unbiased expanded uncertainty in measurement uncertainty analysis (e.g., Huang, 2018).

5. Discussion and Conclusion

The proposed information-weight interpretation of probability is intuitive because it maps probability onto the familiar concept of weight. This aligns with D'Agostini’s (2017) view that we can understand probability statements by mapping them in “some ‘categories’ of our mind, as we do with space and time.” For example, stating “the probability of rain tomorrow is 0.8” can be rephrased as “raining tomorrow has an information weight of 0.8”. Thus, probability, as information weight, can be viewed as a “property” of a piece of information (e.g., a state, outcome, or hypothesis). This is analogous to “strength” is a “physical property” of a material, such as “AISI 302 stainless steel has a tensile strength of 585 MPa.”
The proposed information-weight interpretation of probability provides a novel philosophical perspective. Rather than linking probability to personal beliefs (Bayesian), system tendencies (propensity), or long-run frequencies (frequentist), this framework emphasizes probability as the informational weight that supports a state, outcome, or hypothesis and reflects its contribution to the informational structure of a system. In addition, this interpretation holds regardless of how probabilities are derived, whether from professional judgment, physical models, or empirical frequencies, making it independent of probability assignment methods. Moreover, by defining probability distributions as information spectra and probability intervals as information intervals, this framework unifies the treatment of probability and probability density under a single information-theoretic perspective.

References

  1. Bunge, M. (1981). Four concepts of probability. Applied Mathematical Modelling, 5(5), 306-312.
  2. D'Agostini, G. (2017). Probability, propensity and probability of propensities (and of probabilities). AIP Conference Proceedings, 1853(1), 030001.
  3. Galavotti, M. C. (2017). The interpretation ofprobability: still an open issue? Philosophies, 2(3), 20. [CrossRef]
  4. Hájek, A. (2002). Interpretations of Probability. The Stanford Encyclopedia of Philosophy (Winter 2023 Edition), Edward N. Zalta & Uri Nodelman (eds.) https://plato.stanford.edu/entries/probability-interpret/.
  5. Hennig, C. (2024). Probability Models in Statistical Data Analysis: Uses, Interpretations, Frequentism-as-Model. In Bharath Sriraman (ed.), Handbook of the History and Philosophy of Mathematical Practice. Cham: Springer. 1411-1458.
  6. Huang, H. (2018). A unified theory of measurement errors and uncertainties. Measurement Science and Technology, 29, 125003. [CrossRef]
  7. Huang, H. (2023). A propensity-based framework for measurement uncertainty analysis. Measurement, 112693. [CrossRef]
  8. Huang, H. (2025). The theory of informity: a novel probability framework. Bulletin of Taras Shevchenko National University of Kyiv, Physics and Mathematics, 80(1), 53-59. [CrossRef]
  9. Shannon, C. E. (1948). A mathematical theory of communications. The Bell System Technical Journal, 27, 379-423, 623-656.
  10. Spiegelhalter D. (2024). Why probability probably doesn't exist (but it is useful to act like it does). Nature, 636(8043):560-563. [CrossRef] [PubMed]
  11. Seidenfeld, T. (2015). Probability Theory: Interpretations. International Encyclopedia of the Social & Behavioral Sciences (Second Edition), 37-42. [CrossRef]
  12. Wadsworth, Jr. H. M. (1989). Summarization and interpretation of data. In Handbook of Statistical Methods for Engineers and Scientists, 2.1-2.21, Ed. Harrison M Wadsworth Jr. McGRAW-HILL In.
  13. Willink, R. (2025). On the role of probability in science, analytical measurement and QUAM. Accred Qual Assur, 30, 245–252. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated