The Information Dynamics of Generative Diffusion

Dejan Stančević; Luca Ambrogioni

doi:10.20944/preprints202601.0131.v1

Submitted:

01 January 2026

Posted:

05 January 2026

You are already at the latest version

Abstract

Generative diffusion models have emerged as a powerful class of models in machine learning, yet a unified theoretical understanding of their operation is still developing. This paper provides an integrated perspective on generative diffusion by connecting the information-theoretic, dynamical, and thermodynamic aspects. We demonstrate that the rate of conditional entropy production during generation (i.e. the generative bandwidth) is directly governed by the expected divergence of the score function's vector field. This divergence, in turn, is linked to the branching of trajectories and generative bifurcations, which we characterize as symmetry-breaking phase transitions in the energy landscape. Beyond ensemble averages, we demonstrate that symmetry-breaking decisions are revealed by peaks in the variance of pathwise conditional entropy, capturing heterogeneity in how individual trajectories resolve uncertainty. Together, these results establish generative diffusion as a process of controlled, noise-induced symmetry breaking, in which the score function acts as a dynamic nonlinear filter that regulates both the rate and variability of information flow from noise to data.

Keywords:

generative diffusion models

;

stochastic thermodynamics

;

information theory

;

entropy production

;

symmetry breaking

;

phase transition

Subject:

Physical Sciences - Thermodynamics

1. Introduction

Generative diffusion models have rapidly become one of the most successful frameworks for high-dimensional generative modeling. They were introduced in Sohl-Dickstein et al. [1] in analogy with stochastic thermodynamics. Several works elucidated the theoretical foundations of the method [2,3,4] and their practical implementation procedures [3,5]. Despite these efforts, a unified conceptual understanding of their behavior is still emerging. Several perspectives on information theory, stochastic thermodynamics, and the statistical physics of symmetry breaking have each shed light on different aspects of diffusion models, but their interrelations remain fragmented. The purpose of this perspective paper is to integrate these viewpoints into a single coherent theoretical picture.

Our central thesis is that generation in diffusion models proceeds through a sequence of noise-driven symmetry-breaking transitions. These transitions determine when and how the model commits to a specific generative outcome, structure the flow of information, regulate entropy production, and shape the geometry of trajectories in state space. We refer to this synthesis as the information thermodynamics of generative diffusion.

Information Theory and Entropy-Based Perspectives

A growing line of work has examined diffusion models from the standpoint of information theory, focusing especially on how information about the clean sample

x_{0}

is progressively revealed as noise is removed. Recent works have proposed information-theoretic decompositions of diffusion dynamics [6,7] and have explored the role of conditional entropy in designing improved training and sampling schedules [8,9]. Furthermore, Franzese et al. [10] show how information-theoretic tools reveal the mechanisms by which latent abstractions guide generation. These approaches treat diffusion as a sequential information transfer process and highlight that the effectiveness of generation depends on how rapidly uncertainty about

x_{0}

can be reduced. Central to these results is the observation that the conditional entropy rate is directly linked to geometric quantities such as the divergence of the score and the curvature of the log-density. This suggests that information flow is deeply connected to the underlying dynamical and geometric structure of the generative process.

Phase Transitions, Associative Memories, and Symmetry Breaking

Parallel developments in statistical physics have revealed that diffusion models exhibit noise-driven symmetry-breaking events, where the score field undergoes bifurcations and the generative trajectories split into distinct modes [11,12]. High-dimensional analyses have linked these transitions to mean-field phase transitions [13] and to dynamical behavior captured by stochastic localization [14,15,16,17]. These bifurcations correlate with sharp changes in the Hessian of the log-density, revealing a connection between symmetry breaking and information geometry. Similar mechanisms have been studied in hierarchical generative settings [18,19] and in analyses of memorization, mode formation, and semantic emergence [20,21,22,23]. Generative diffusion models have also been directly connected to modern Hopfield networks and other associative memory networks [24,25,26,27], where generalization has been associated with the emergence of spurious states [28]. Across these domains, the key unifying insight is that the Hessian (or score Jacobian) mediates both stability of generative trajectories and the structure of the data manifold.

Thermodynamics and the Role of Inferential Entropy

The connection between diffusion models and stochastic thermodynamics was first made explicit in Sohl-Dickstein et al. [1], motivating a thermodynamic view of generation. Furthermore, this connection was strengthened with a mathematical framework based on stochastic differential equations (SDEs) formulated in Song et al. [2], Rombach et al. [29] and is central to the modern understanding of diffusion models. However, the notion of entropy that is commonly used in stochastic thermodynamics [30,31] measures the irreversibility of the forward process. While such quantities yield elegant speed-accuracy tradeoffs [32], they characterize the evolution of the distribution of trajectories rather than the uncertainty relevant for generating a single sample. Instead, we argue that what matters during generation is the uncertainty about the clean sample

x_{0}

.

For this reason, it is more natural to study the conditional entropy

H [x_{0} | x_{t}]

and, at the trajectory level, the pathwise conditional entropy

h_{t} (x_{t})

, whose fluctuations capture the temporary multimodality experienced by individual generative paths. As in stochastic thermodynamics, such pathwise entropies can increase along single trajectories, even when the average conditional entropy decreases. We show that these fluctuations reveal symmetry-breaking events; the model becomes momentarily undecided among competing hypotheses for

x_{0}

, leading to spikes in conditional entropy variance and amplified sensitivity to noise.

Our Perspective

We unify these threads by showing that information flow, symmetry breaking, and dynamical instability are different manifestations of the same underlying mechanism governing diffusion models. In particular, we show that:

Entropy as a detector of symmetry breaking: Symmetry-breaking transitions are accompanied by pronounced changes in ensemble-level information measures. In particular, the conditional entropy rate $\dot{H} [x_{0} ∣ x_{t}]$ exhibits sharp peaks around bifurcation points, reflecting an increased sensitivity of the generative process to noise. These peaks provide direct information-theoretic signatures of symmetry-breaking transitions.
Noise-driven decisions: These information-theoretic signatures arise when the score field becomes weak along a low-curvature direction, temporarily losing its ability to suppress stochastic fluctuations. In such regimes, noise plays an active role in selecting which generative branch the trajectory follows, effectively making the generative decision.
Path divergence via Lyapunov instability: The same loss of curvature is reflected in the Jacobian of the score, whose spectrum develops positive eigenvalues along the unstable directions. As a result, nearby generative trajectories diverge exponentially, leading to macroscopic separation between paths that correspond to different generative outcomes.
Non-monotonic pathwise entropy: At the level of individual trajectories, this divergence manifests as heterogeneous resolution of uncertainty about $x_{0}$ . Consequently, the pathwise conditional entropy need not evolve monotonically along single paths and may transiently increase, reflecting temporary ambiguity during the symmetry-breaking decision process. This results in the variance of pathwise conditional entropies peaking.

Together, these results establish a conceptual framework in which entropy production, posterior geometry, and dynamical stability are unified through the lens of noise-driven symmetry breaking. This perspective clarifies the mechanisms by which diffusion models transform noise into structured data and highlights the central role of symmetry in shaping generative dynamics.

2. Information Theory

We start by presenting an introduction to the information theory of sequential generative modeling, which will open the door to the analysis of generative diffusion.

Consider a game of Twenty Questions where an interrogator player may ask twenty binary questions concerning a set to an "oracle" player in order to gradually reveal the identity of a predetermined element

y^{*}

in a finite set

Ω = {y_{1}, \dots, y_{N_{0}}}

with

N_{0}

elements. We denote the size of the possible set

Ω_{j - 1} (a_{1 : j - 1})

after

j - 1

questions as

N_{j - 1} (a_{1 : j - 1})

. The answer

a_{j}

to the j-th question

q_{j}

then divides the set into two possible subsets with sizes

N_{j}^{1} (a_{1 : j - 1}) = N_{j} (a_{j} = 1, a_{1 : j - 1})

and

N_{j}^{0} (a_{1 : j - 1}) = N_{j - 1} (a_{1 : j - 1}) - N_{j} (a_{j} = 1, a_{1 : j - 1})

. Assuming a fixed set of questions, the expected uncertainty experienced by the player after the j-th question can be quantified by the conditional entropy:

H (y^{*} ∣ a_{1 : j}) = - E_{y^{*}, a_{1 : j}} [{log}_{2} p (y ∣ a_{1 : j})] = E_{a_{1 : j}} [{log}_{2} N_{j} (a_{1 : j})]

(1)

where

y^{*}

is sampled uniformly from

Ω

. Under these conditions, the expected entropy reduction associated to a given question is given by

Δ H_{j} = E_{a_{1 : j}} [{log}_{2} N_{j - 1} - \frac{N_{j}^{0}}{N_{j - 1}} {log}_{2} N_{j}^{0} + \frac{N_{j}^{1}}{N_{j - 1}} {log}_{2} N_{j}^{1}],

(2)

where we left the dependence on the set of answers implicit to unclutter the notation. It is easy to see that the maximum bit rate is 1, which is achieved when

N_{j}^{0} = N_{j - 1} / 2

. Assuming that 20 questions are enough to fully identify the value of

y^{*}

, we can encode each

y

in the string of binary values

a_{1 : 20}

, which makes clear that the question answering process consists of gradually filling in this string. Using the language of generative diffusion, we can re-frame this process in terms of a ’forward’ process, where the string

a_{1 : 20}

corresponding to an element of

Ω

is sampled in advance and then transmitted to the j-th ’time point’ through the following non-injective forward process

R_{j} (a_{1 : 20}) = a_{1 : j},

(3)

which deterministically suppresses information by masking the values of the string. The solution of a Twenty Question game can then be seen as inverting this ’forward process’. Note that the forward process leads to a sequence of monotonically non-decreasing marginal conditional entropies

H (y^{*} ∣ a_{1 : j}) < H (y^{*} ∣ a_{1 : j - 1})

, which is a fundamental feature of a forward process in diffusion models that captures the fact that information is lost by the forward transformation.

Now consider the case where a lazy oracle forgot to select a word in advance and decides instead to answer the questions at random under the probability determined by the sizes

N_{j}^{0}

and

N_{j}^{1}

, which we assume to be fixed given the questions. Strikingly, this reformulation does not make any observable difference from the point of view of the interrogator as each (randomly sampled) answer equally reduces the space of possible words and it results in the same entropy reduction, until a final guess can be offered. Therefore, the game of Twenty Questions with a random oracle can be interpreted as a sequential generative process where the state at ’time’ j is given by a binary string

a_{1 : j}

with Markov transition probabilities

p (a_{j + 1} = 0 ∣ a_{1 : j}) = \frac{N_{j + 1}^{0} (a_{1 : j})}{N_{j} (a_{1 : j})}

(4)

The conditional entropy rate

Δ H_{j}

determines how much information is transferred from ’time’ j to the final generation.

As we shall see, the reverse diffusion process can be seen as analogous to this ’generative game’ with the score function playing the part of the interrogator and the noise

ϵ_{t}

playing the role of the oracle. Like in the interrogator in the generative Twenty Questions game, the score function can reduce the information transfer by tilting the probabilities of the stochastic increments out of uniformity, which reduces the impact of the noise. This phenomenon is related to the divergence of the vector field induced by the score function, which causes amplification of small perturbations during the generative dynamics. We will also see that the phenomenon is connected to the branching of paths of fixed-points of the score and consequently to the phenomenon of generative phase transitions and spontaneous symmetry breaking [11].

2.1. Score-Matching Generative Diffusion Models

The sequential generation example outlined above is analogous to the masked diffusion models [33,34]. On the other hand, score-matching generative diffusion models are continuous-time sequential generative models where the forward process is given by a diffusion process such as

d x_{t} = ν (t) d W_{t},

(5)

which is initialized with the data source

p (x_{0}) = ρ (y)

. Generation in score-matching diffusion consists of integrating the ’reverse equation’ [2]:

d x_{t} = - ν^{2} (t) \nabla log p_{t} (x_{t}) d t + ν (t) d W_{t},

(6)

where, for notational simplicity, we restrict our attention to the forward process given in Eq. 5. Note that Eq. 6 must be integrated backward with initial condition determined by the stationary distribution of the forward process.

The fundamental mathematical object that determines the reverse dynamics is the score function, which in this case can be expressed as

\nabla log p_{t} (x_{t}) = E_{y ∣ x_{t}} [\frac{y - x_{t}}{σ^{2} (t)}],

where

σ^{2} (t) = \int_{0}^{t} ν^{2} (τ) d τ

is the total variance of the noise at time t and the expectation is taken with respect of the conditional distribution

p (y ∣ x_{t}) \propto p (x_{t} ∣ y) ρ (y)

. This expression can be further simplified by noticing that

x_{t} = y + σ (t) z_{t}

:

\nabla log p_{t} (x_{t}) = - E_{z_{t} ∣ x_{t}} [\frac{z_{t}}{σ (t)}]

(7)

where

z

is a standard normal vector. In other words, the score is the negative of the average (rescaled) noise and it therefore provides the optimal (infinitesimal) denoising direction.

In dynamical term, the score function determines the vector field (i.e. the drift) that guides the generative paths towards the distribution of the data.

In practice, a normalized score network

s (x_{t}; θ)

should be trained to minimize the rescaled score-matching loss:

L_{sm} (θ, t) = E_{x_{t}} [{∥σ (t) \nabla log p_{t} (x_{t}) - s (x_{t}; θ)∥}^{2}]

(8)

This loss function cannot be computed directly because the true score is not available. However, Eq. 8 can be re-written using Eq. 7 and expanding the square:

\begin{matrix} L_{sm} (θ, t) & = E_{x_{t}} [{∥E_{z_{t} ∣ x_{t}} [z_{t}] + s (x_{t}; θ)∥}^{2}] \\ = E_{z_{t}, y} [{∥z_{t} + s (y + σ (t) z_{t}; θ)∥}^{2}] - E_{z_{t}, y} [{∥z_{t} + σ (t) \nabla log p_{t} (y + σ (t) z_{t})∥}^{2}] . \end{matrix}

(9)

Note that the second term is constant in

θ

, which means that the gradient solely depends on the denoising loss:

\begin{matrix} L_{d} (θ, t) & = E_{z_{t}, y} [{∥z_{t} + s (y + σ (t) z_{t}; θ)∥}^{2}] . \end{matrix}

(10)

The constant term

C_{t} = E_{z_{t}, y} [{∥z_{t} + σ (t) \nabla log p_{t} (y + σ (t) z_{t})∥}^{2}]

(11)

is of high importance for our current purposes. It quantifies the loss of the denoiser obtained from the score function. This is therefore the unavoidable part of the denoising error that is still present given a perfectly trained network. With a few manipulations, it is possible to show that this term is in fact equal to the variance of the posterior denoising distribution:

C_{t} = E_{y, x_{t}} [var (y ∣ x_{t})],

(12)

which allows us to interpret this term as a measure of uncertainty at time t on the final outcome of the generative trajectory.

3. Generative Information Transfer in Score Matching Diffusion

To characterize the generative information transfer we need to compute the conditional entropy rate

\dot{H} [y ∣ x_{t}]

, which is the analogous of the discrete entropy reduction we gave in Eq. 2. The conditional entropy is defined as

H [y ∣ x_{t}] = - E_{y, x_{t}} [log p (y ∣ x_{t})]

(13)

To find the entropy rate, we can take the temporal derivative of Eq. 13 and use the Fokker-Planck equation, which in our case is just the heat equation:

\partial_{t} p_{t} (x_{t}) = \frac{1}{2} ν^{2} (t) \nabla^{2} p_{t} (x_{t}) .

(14)

Using integration by parts, this results in

\begin{matrix} \dot{H} [y | x_{t}] & = \frac{ν^{2} (t)}{2} (E_{p (x_{t}, x_{0})} [{∥\nabla log p (x_{t} | x_{0})∥}^{2}] - E_{p_{t} (x_{t})} [{∥\nabla log p (x_{t})∥}^{2}]) \\ = \frac{ν^{2} (t)}{2} (\frac{D}{σ^{2} (t)} - E_{p_{t} (x_{t})} [{∥\nabla log p (x_{t})∥}^{2}]), \end{matrix}

(15)

where D is the dimensionality of the ambient space. From this formula, we can see that the maximal bandwidth is reached when the Euclidean norm of the score function is minimized.

3.1. Score Norm and Posterior Concentration

To gain some insight on the significance of the square norm and the expression for the conditional entropy we will consider the following case. We assume a discrete data distribution

p_{0} (y) = \frac{1}{N} \sum_{i = 1}^{N} δ (y - y_{i})

with empirical mean equal to zero.

At time t, the marginal density is given by a Gaussian smoothing of the data,

p_{t} (x) = \frac{1}{N} \sum_{i = 1}^{N} φ_{σ (t)} (x - y_{i}),

(16)

where

φ_{σ (t)}

denotes an isotropic Gaussian with variance

σ^{2} (t)

. The posterior distribution over datapoints is

p (y_{i} ∣ x_{t}) = \frac{φ_{σ (t)} (x_{t} - y_{i})}{\sum_{k = 1}^{N} φ_{σ (t)} (x_{t} - y_{k})} .

(17)

The score function can then be written as the posterior average

\nabla log p_{t} (x_{t}) = E_{y ∣ x_{t}} [\frac{y - x_{t}}{σ^{2} (t)}] = \frac{1}{σ^{2} (t)} (μ (x_{t}) - x_{t}), μ (x_{t}) : = E [y ∣ x_{t}] .

(18)

We now assume that the data vectors satisfy

y_{i}^{⊤} y_{j} \approx 0 (i \neq j), {∥ y_{i} ∥}^{2} \approx R^{2},

(19)

i.e. datapoints are approximately orthogonal and lie at a common distance R from the mean. Under this assumption, the squared norm of the posterior mean simplifies to

∥ μ (x_{t}) ∥^{2} = ∥ \sum_{i = 1}^{N} p (y_{i} ∣ x_{t}) y_{i} ∥^{2} \approx R^{2} \sum_{i = 1}^{N} p {(y_{i} ∣ x_{t})}^{2} .

(20)

Taking expectations with respect to

p_{t} (x_{t})

, we obtain

E_{x_{t}} [∥ \nabla log p_{t} (x_{t}) ∥^{2}] = \frac{1}{σ^{4} (t)} (E_{x_{t}} [∥ μ (x_{t}) ∥^{2}] - 2 E_{x_{t}} [x_{t}^{⊤} μ (x_{t})] + E_{x_{t}} [∥ x_{t} ∥^{2}]) .

(21)

The first term captures the data-dependent structure of the score and, using Eq. (20), can be written as

E_{x_{t}} [∥ μ (x_{t}) ∥^{2}] \approx R^{2} E_{x_{t}} [\sum_{i = 1}^{N} p {(y_{i} ∣ x_{t})}^{2}] .

(22)

The quantity

\sum_{i} p {(y_{i} ∣ x_{t})}^{2}

measures the concentration of the posterior over datapoints. It satisfies

1 / N \leq \sum_{i} p {(y_{i} ∣ x_{t})}^{2} \leq 1

, interpolating between a fully diffuse posterior and complete concentration on a single datapoint.

The remaining two terms in Eq. (21) can be estimated explicitly under the forward model

x_{t} = y + σ (t) z

, where

z \sim N (0, I)

is independent of y. We have

E_{x_{t}} [x_{t}^{⊤} μ (x_{t})] = E_{x_{t}} [x_{t}^{⊤} E [y ∣ x_{t}]] = E_{x_{t}, y} [x_{t}^{⊤} y] = E {∥ y ∥}^{2} \approx R^{2},

(23)

E_{x_{t}} [∥ x_{t} ∥^{2}] = {E ∥ y ∥}^{2} + σ^{2} (t) E {∥ z ∥}^{2} \approx R^{2} + D σ^{2} (t),

(24)

where D denotes the ambient dimensionality.

Substituting Eqs. (22), (23), and (24) into Eq. (21) yields

E_{x_{t}} [∥ \nabla log p_{t} (x_{t}) ∥^{2}] \approx \frac{R^{2}}{σ^{4} (t)} (E_{x_{t}} [\sum_{i = 1}^{N} p {(y_{i} ∣ x_{t})}^{2}] - 1) + \frac{D}{σ^{2} (t)} .

(25)

The second term coincides with the expected squared norm of the score of the forward Gaussian kernel and therefore represents a data-independent baseline contribution. The first term encodes the deviation from pure diffusion induced by the structure of the dataset and depends solely on the posterior distribution over datapoints.

Using the bound

1 / N \leq \sum_{i = 1}^{N} p {(y_{i} ∣ x_{t})}^{2} \leq 1

, we obtain the inequality

- \frac{(N - 1) R^{2}}{N σ^{4} (t)} \leq \frac{R^{2}}{σ^{4} (t)} (E_{x_{t}} [\sum_{i = 1}^{N} p {(y_{i} ∣ x_{t})}^{2}] - 1) \leq 0 .

(26)

As a consequence, the expected squared norm of the score is always bounded above by the forward kernel contribution, ensuring that the marginal entropy remains a monotonically increasing function of time.

Further insight can be gained by rewriting the purity term as

E_{x_{t}} [\sum_{i = 1}^{N} p {(y_{i} ∣ x_{t})}^{2}] = \frac{1}{N} \sum_{i = 1}^{N} \int p (y_{i} ∣ x_{t}) p (x_{t} ∣ y_{i}) d x_{t} .

(27)

This expression makes explicit that the deviation from the diffusion baseline is controlled by the overlap of the forward kernels. If, at time t, the datapoints have effectively merged into m indistinguishable groups (with identical posteriors), the purity evaluates to

m / N

, yielding

E_{x_{t}} [\sum_{i = 1}^{N} p {(y_{i} ∣ x_{t})}^{2}] - 1 = \frac{m - N}{N} .

(28)

Therefore, increasing mixing among datapoints (smaller m) makes the data-dependent term more negative, reducing the expected score norm and increasing the conditional entropy rate.

This result allows us to interpret the magnitude of the score vector as a quantitative estimate of uncertainty in the denoising process: when multiple datapoints are compatible with the noisy state

x_{t}

, posterior averaging suppresses the score, leading to enhanced entropy production (equation 15). As we shall see in the rest of the paper, we can associate peaks in the entropy rates with symmetry-breaking bifurcations that correspond to noise-induced ’choices’ between possible data points.

3.2. Conditional Entropy Production as Optimal Error

The conditional entropy rate quantifies the instantaneous generative information transfer at any given moment in time. It can be shown (see [8]) that this quantity is closely connected to the optimal denoising squared error, which is the variance of the denoising distribution:

\dot{H} [y | x_{t}] = \frac{1}{2} \frac{ν^{2} (t)}{σ^{4} (t)} E_{y, x_{t}} [var (y ∣ x_{t})] .

(29)

Intuitively, this means that the information rate is directly related to the denoising uncertainty at a given time.

Using this relation, we can now re-express the denoising score matching formula in Eq. 9 in terms of the conditional entropy rate:

E_{x_{t}} [{∥E_{z_{t} ∣ x_{t}} [z] - s (x_{t}; θ)∥}^{2}] + \frac{2 σ^{4} (t)}{ν^{2} (t)} \dot{H} [y | x_{t}] = E_{z_{t}, y} [{∥z_{t} - s (y + σ (t) z_{t}; θ)∥}^{2}],

(30)

which implies that the entropy rate can be estimated from the training loss if we assume that the network is well-trained.

3.3. Generative Bandwidth

It is insightful to investigate under what circumstances the score-matching diffusion model can achieve the maximum possible generative bandwidth. From equation 15, it is clear that this happens when

E [∥\nabla log p_{t} (x_{t})∥] = 0

, which in turn is obtained if the score vanishes almost everywhere.

To realize this situation, we can consider a data distribution

p_{h} (y)

to be a centered multivariate normal with variance

h^{2}

. In this case, the score function is just:

\nabla log p_{t} (x_{t}) = - \frac{x_{t}}{σ^{2} (t) + h^{2}},

(31)

which vanishes everywhere for

h \to \infty

, giving a maximum entropy rate:

\dot{H} [x_{0} ∣ x_{t}] = \frac{1}{2} \frac{D ν^{2} (t)}{σ^{2} (t)} .

(32)

This corresponds to a setting where the particles are free to diffuse since every possible generation is equally likely. From this, we can conclude that the score function has the negative role of suppressing fluctuations along ’unwanted directions’ to preserve the statistics of the data and that peaks in the information transfer comes from periods where noise fluctuations are not suppressed. Note that the maximum bandwidth scales with the dimensionality D.

Now consider the case where the distribution of the data is a centered Gaussian in a

D_{data}

-dimensional subspace with

D_{data} \leq D

. In this case, the expected norm of the score decomposes as follows

E [{∥\nabla log p_{t} (x_{t})∥}^{2}] = \frac{D_{data}}{σ^{2} (t) + h^{2}} + \frac{D - D_{data}}{σ^{2} (t)} \to \frac{D - D_{data}}{σ^{2} (t)}

(33)

which leads to the entropy rate

\dot{H} [x_{0} ∣ x_{t}] = \frac{1}{2} \frac{D_{data} ν^{2} (t)}{σ^{2} (t)} .

(34)

In this case, the score function suppresses entropy reduction in the subspace orthogonal to the data and therefore acts as a linear analog filter. Note that the entropy rate is zero when

D_{data}

is equal to zero since all the distribution is in this case collapsed into a single point and no ’decision’ needs to be made.

4. Statistical Physics, Order Parameters and Phase Transitions

In this section, we will connect the information theoretical concepts we outlined above with concepts from statistical physics such as order parameters, phase transitions and spontaneous symmetry breaking. We will start by studying the paths of fixed-points of the score function and use them to track ’generative decisions’ (i.e. bifurcations) along the denoising trajectories. As we will see, the stability of these fixed-points paths is regulated by the Jacobian of the score and it is deeply connected with the conditional entropy production.

4.1. Branching Paths of Fixed-Points and Spontaneous Symmetry Breaking

The fixed-points of the score function are defined by the equation:

\nabla log p_{t} (x_{t}^{*}) = 0 .

(35)

We denote the set of fixed-points at time t as

Ψ_{t}

. The solutions of this fixed-point equation can be organized in a set

Ω

of piecewise continuous paths

γ : R^{+} \to R^{d} \in Ω

. To remove ambiguities, we assume that, if

γ (τ)

is discontinuous at

τ_{0}

, then the one-sided limit exists and

γ (τ_{0})

is equal to

{lim}_{t \to τ_{0}^{+}} γ (t) = {arg min}_{x \in Ψ_{τ_{0}}} {lim}_{t \to τ_{0}^{+}} [∥x - γ (t)∥]

. We know that

{lim}_{t \to \infty} γ (t) = 0

for all paths since the zero vector is the only fixed-point of the score of the asymptotic Gaussian distribution. Any two paths

γ_{1} (t)

and

γ_{2} (t)

can be proven to overlap for a finite range of time, meaning that

γ_{1} (t) = γ_{2} (t)

if

t \geq τ_{1, 2} \in R^{+}

(this follows from the results in [35,36,37] on the number of modes of mixture of normal distributions). We refer to

τ_{1, 2}

as the branching time of the two paths. The branching time of two paths of fixed points can roughly be interpreted as a decision time in the generative process, where the sample will be ’pushed’ by the noise in either one or the other path during the reverse dynamics. It is therefore insightful to study the behavior of the paths at the branching times. In general, this can happen if there is a discontinuous jump in a path

γ (t)

. Perhaps more interestingly, two paths can also branch continuously at a finite time. This can be studied by analyzing the Jacobian matrix of the score function:

J_{t} (x_{t}^{*}) = \nabla^{T} \nabla log p_{t} (x_{t}^{*}) .

(36)

We call a path point

γ (t)

stable at time t if

J_{t} (γ (t))

is negative-definite. We say that the path is stable if this is true for all

t \in R^{+}

except for a countable set of time points

t_{j}

where the Jacobian is negative semi-definite. Now consider two stable paths

γ_{1} (t)

and

γ_{2} (t)

that branch continuously at time

τ_{1, 2}

. Given the asymptotic separation vector

v_{1, 2} = lim_{t \to τ_{1, 2}^{-}} \frac{(γ_{2} (t) - γ_{1} (t))}{∥γ_{2} (t) - γ_{1} (t)∥},

it can be shown that

v^{T} J_{t} (γ (t)) v < 0

in a finite interval

(τ_{1, 2}, τ_{1, 2} + ϵ)

and that

lim_{t \to τ_{1, 2}^{+}} v^{T} J_{t} (γ_{1} (t)) v = 0,

which implies that the second directional derivative of

D_{v}^{2} log p_{t} (x_{t})

along

v

vanishes at the branching point.

Consider now a generative diffusion with an initial distribution given as

p_{0} (y) = \frac{1}{K} \sum_{j = 1}^{K} δ (y^{(j)} - y),

(37)

with K distinct data-points

y^{(j)} \in R^{d}

. In this case, there are exactly K distinct stable fixed-point paths

γ_{j} (t)

, with

γ_{j} (0) = y^{(j)}

. Again, any two paths branch at a finite time

τ_{j, k}

. For a given t, we can partition the set of data-points in equivalence classes, where two data-points

y^{(j)}

and

y^{(k)}

share the same class if their associated path coincide at t. Importantly. Each equivalence class corresponds to an individual fixed-point, which allows us to associate each fixed-point

x^{*} \in Ψ_{t}

to a sub-set of data-points that are, using colorful language, fused together. More precisely, we can express the fixed-points as weighted averages of data-points obtained by solving the self-consistency equation:

x^{*} = \sum_{j = 1}^{K} w_{j} (x^{*}) y^{(j)}

(38)

where

w_{j} (x) = \frac{e^{(- {∥y^{(j)}∥}^{2} / 2 + x^{T} y^{(j)}) / σ^{2} (t)}}{\sum_{k = 1}^{K} e^{(- {∥y^{(k)}∥}^{2} / 2 + x^{T} y^{(k)}) / σ^{2} (t)}} .

(39)

Note that this average has non-zero weight on all data points, which is why we cannot find the location of the fixed-point solely based on its equivalence class. However, usually the weights corresponding to data-points in the equivalence class will be substantially larger than the other weights and will therefore dominate the average. In summary, we can interpret the set of fixed-points as a decision tree where each branching point roughly coincides with a split between two sets of data points.

An example of spontaneous symmetry breaking happens when the generative path needs to ’decide’ between two isolated data-points. Consider again the mixture of delta case and two neighboring data-points

y_{1} = v

and

y_{2} = - v

. If the distance between the center of mass of these two points and the nearest external data-point is much larger than

σ (t)

, there will be a fixed point approximately located along the line segment connecting the two points. In these conditions, we can consider the fixed-point equation restricted to the projections on

v

:

x_{v}^{*} = tanh (\frac{x_{v}^{*} + ϕ (x_{v}^{*}, t)}{σ^{2} (t)})

(40)

where

ϕ (x_{v}^{*}, t)

encapsulates the interference due to all other data-points, which we, in this example, we assume to be small relative to the norm of the separation vector:

ϕ (x_{v}^{*}, t) = \frac{σ^{2} (t)}{2} (log (e^{x_{v}^{*}} + \sum_{j \neq 1, 2}^{K} y_{v}^{(k)} e^{(- {∥y^{(k)}∥}^{2} / 2 + x_{v} y_{v}^{(k)}) / σ^{2} (t)}) - x_{v}^{*}) .

If we approximate the interference function with constant

ϕ

using a zero-th order Taylor expansion, Eq. 40 becomes the self-consistency equation of a Curie-Weiss model of magnetism, with temperature

T = σ^{2} (t)

and external magnetic field

ϕ

. The solutions of this equation can be visualized as intersection points between a straight line and a hyperbolic tangent (see [12] and [11] for a detail analysis). When

ϕ

is finite, the system transitions discontinuously from one to two fixed-points, which corresponds to a first-order phase transition in the magnetic system. However, the size of the discontinuity vanishes when

ϕ = 0

, when there is an exact symmetry between the two data-points (see Figure 1). This gives rise to a so called critical phase transition, where a single fixed-point at

x^{*} = 0

continuously splits into two paths

x_{1} (t)

and

x_{2} (t)

with

x_{1, 2} (t - t_{c}) \sim \pm {(t - t_{c})}^{1 / 2}

for

t \to 0

. The loss of stability of the fixed-point at the origin corresponds to the vanishing of the quadratic well around the point:

\frac{\partial^{2}}{\partial x_{v}^{2}} log p_{t_{c}} (x_{t_{c}}^{*}) = 0,

(41)

where, in this case,

x_{t_{c}}^{*} = 0

for

t < t_{c}

. The analysis we just carried out involves the breaking of the permutation symmetry between two isolated data-points. On the other hand, if the symmetry is broken along all directions like in the case where the data manifold is a sphere centered at

x_{t}^{*}

, Eq. 41 implies that

Tr [\nabla^{T} \nabla log p_{t} (x_{t_{c}}^{*})] = \nabla \cdot \nabla log p_{t} (x_{t_{c}}^{*}) = 0

(42)

Therefore, the change in stability condition can be reformulated as the local vanishing (or suppression in a less symmetric case) of the divergence of the vector field that drives the generative dynamics. The transition from the super-critical (

t > t_{c}

) and the sub-critical (

t < t_{c})

phases then corresponds to a sign change in the divergence of the vector field (i,e, the score) in the spherically-symmetric case, or a sign change of the divergence restricted to a sub-space in the general case, with the sub-critical regime being characterized by positive eigenvalues of the Jacobi matrix that lead to divergent local trajectories (see Figure 2).

5. Dynamics of the Generative Trajectories

Around a point

x^{*}

, the local behavior of the generative trajectories under the deterministic ODE flow dynamics can be characterized by its Lyapunov exponent, which quantifies the separation rate of infinitesimally close trajectories. In particular, the minimal local exponent for a perturbation along a unit vector

w

, which in the reverse dynamics can be defined as

l_{w} (t, x) = lim_{τ \to \infty} lim_{w \to 0} \frac{1}{τ} log \frac{∥x_{t - τ} (x_{t} + w) - x_{t - τ} (x_{t})∥}{∥w∥},

(43)

where

x_{t - τ} (x_{t} + w)

denotes a deterministic generative trajectory with the perturbed initial condition. Note that in reality

τ

cannot tend to infinity in the generative dynamics since time is only defined up to 0. However, we will still consider this limit since we are only interested in the local asymptotic behavior of the linearized dynamics around a bifurcation point. Under the reverse dynamics, when the Lyapunov exponent along

w

is negative, infinitesimal perturbations are amplified exponentially (at least locally) by the generative dynamics.

We can use the notion of minimal local Lyapunov exponent to formalize the phenomenon of local divergence of trajectories after a spontaneous symmetry-breaking event at

t_{c}

. To study the local sub-critical behavior, we consider the linearization of the dynamics around the unstable fixed point for

t < t_{c}

and

t_{c} - t = ϵ

:

l_{w} (t_{c} + ϵ, x) = lim_{τ \to \infty} lim_{w \to 0} \frac{1}{τ} log \frac{∥e^{- τ J_{t_{c} - ϵ} (x_{t_{c} - ϵ}^{*})} w∥}{∥w∥} = λ_{\min} (x_{t_{c} - ϵ}^{*}, t_{c} + ϵ),

(44)

where

λ_{\min} (x_{t_{c} - ϵ}^{*}, t_{c} + ϵ)

is the smallest of the eigenvalue of the Jacobi matrix whose eigenvectors overlap with

w

. In the immediate sub-critical phase of a symmetry breaking phase transition, we know that there is a non-empty sub-space spanned by the eigenvector of the Jacobian corresponding to negative eigenvalues. Therefore, perturbations along this unstable eigen-space will be exponentially amplified by the generative dynamics. In the stochastic case, this can be seen as a critical ’macroscopic amplification’ of the infinitesimal noise input, where the noise breaks the symmetry of the generative model. In the deterministic dynamics, the symmetry is instead broken by the amplification of small differences between the generative trajectories.

In general, we will refer to the spectrum of Jacobian eigenvalues

λ_{j} (x_{t}, t)

as the local Lyapunov spectrum. As we shall see, this spectrum can be directly related to the conditional entropy production.

5.1. The Global Perspective on Generative Bifurcations

In the previous sections, we characterized the generative dynamics of diffusion models by studying the associated paths of fixed-points in term of their stability and bifurcations, which led us to establish formal connections with the statistical physics of phase transitions and symmetry breaking. However, in high dimension, small volumes around a fixed-point have vanishingly low probability of being visited. In fact, due to the dispersive effect of the noise, the generative trajectories are concentrated on fixed-variance shells around the fixed points. More formally, these set of "typical" points form tubular neighborhoods of the set of fixed-points (see Figure 1). It is therefore unclear how a bifurcation in a path of fixed-points affects the behavior of the generative trajectories, since the analysis we presented in the previous sections was purely local.

To gain insight into the global behavior of the typical generative trajectories, we can study the expected divergence of the vector field at time t

div (t) = E_{x_{t}} [\nabla \cdot \nabla log p_{t} (x_{t})] = E_{x_{t}} [Tr [\nabla^{T} \nabla log p_{t} (x_{t})]] .

(45)

If

div (t)

is negative, the separation between the generative trajectories will, on average, be contracted by the generative dynamics. The simplest example of this contractive behavior can be studied by considering a data distribution with a single point:

p_{0} (x_{t}) = δ (y - c)

. In this case, all trajectories converge to

c

for

t \to 0

, and we have

{div}_{1} (t) = - \frac{D}{σ^{2} (t)} .

(46)

where D is the dimensionality of the space. In the reverse dynamics, the negative sign implies that the forward process produces a stable dynamics where the particles ’fall’ towards the data points.

In the general case, this quantity can be identified with the "trivial component" of the expected divergence since it does not depend on the data but only on the forward process. In the general case, it can be expressed as

{div}_{1} (t) = E_{x_{t}} [Tr [\nabla^{T} \nabla log p_{t} (x_{t} ∣ y)]] .

(47)

We can therefore study the purely data-dependent part of the expected divergence by subtracting this "trivial component":

Δ div (t) = div (t) - {div}_{1} (t) .

(48)

Intuitively,

Δ div (x_{t})

encodes the separation of the typical trajectories in the reverse process due to bifurcations in the generative process, which mirrors the local analysis we carried out in the previous sections at the level of the fixed-points.

Using integration by parts, it is straightforward to connect the expected divergence with the conditional entropy rate

\dot{H} [y ∣ x_{t}] = \frac{ν^{2} (t)}{2} Δ div (t)

(49)

Therefore, the expected data-dependent divergence of the generative trajectories directly determines the conditional entropy rate. From this identity, we can immediately deduce that

Δ div (x_{t})

is non-negative valued and consequently that

div (t) \geq {div}_{1} (t)

.

We can also show that the marginal entropy is produced by the expected divergence

\dot{H} [x_{t}] = - \frac{ν^{2} (t)}{2} div (t),

(50)

which implies that

div (t) \leq 0

since the marginal entropy is a monotonically increasing function of t under our forward process. This reflects the fact that the forward process always lead to a dispersion of the trajectories, regardless to the nature of the initial distribution. From this, we can conclude that the maximum bandwidth is achieved when

div (t) = E_{x_{t}} [Tr [\nabla^{T} \nabla log p_{t} (x_{t})]] \to 0 .

(51)

This gives us a clear connection between the local vanishing of the Jacobian in spontaneous symmetry breaking (Eq. 42) with the expected vanishing that corresponds to saturation of the generative bandwidth.

5.2. Information Geometry

The derivation in the previous sub-section suggests a deep connection between the information production and the geometry of the data manifold. We can further analyze this connection by using concepts from information geometry [38]. The key connection is that conditional entropy rate is in fact just the expected value of the trace of the Fisher information matrix, which can be defined as follows:

I_{t} (x_{t}) = - E_{y ∣ x_{t}} [\nabla \nabla^{T} log p (y ∣ x_{t})] .

(52)

This quantity quantifies the sensitivity of the posterior distribution

p (y ∣ x_{t})

to changes in

x_{t}

and can be interpreted as a natural metric tensor on the variable

x_{t}

. Using Bayes theorem and our simplified forward process, the expression can be rewritten as

I_{t} (x_{t}) = σ^{- 2} (t) (I + σ^{2} (t) J (x_{t})),

(53)

Geometric information such as the manifold dimensionality is encoded in the spectrum of this matrix [22,39,40,41]. The Fisher information metric provides information on the (local) manifold structure of the data

y

as seen through the lenses of the noisy state

x_{t}

. This is easy to see in the case where the data is Gaussian with covariance matrix

Σ_{0}

, which gives the formula

I_{t} = σ^{- 2} (t) I - {(Σ_{0} + σ^{2} (t) I)}^{- 1} .

(54)

When

y

is supported on a

D_{data}

manifold, the (degenerate) eigenvalue

λ_{| |}

corresponding to the orthogonal complement is equal to zero. On the other hand, in the flat limit, the tangent eigenvalues become equal to

Σ_{0}^{- 1}

. This implies that the dimensionality of the manifold is given by the dimensionality of the eigenspace corresponding to the eigenvalue

λ_{‖} = σ^{- 2} (t)

. In the general case, the eigen-decomposition of

I (x_{t})

characterizes the local tangent structure of the manifold [40,41].

We can now use these expressions to cast light on the geometry of entropy production. The conditional entropy rate is directly related to the trace of the Fisher information matrix:

\dot{H} [y ∣ x_{t}] = \frac{1}{2} ν^{2} (t) E_{x_{t}} [Tr [I (x_{t})]],

(55)

which reduces to Eq. 34 in the linear manifold case we just considered. From this perspective, it is clear that the reduction in bandwidth is the result of the suppression of the eigenvalues of

I (x_{t})

. This can also be seen in the general case by re-expressing the entropy rate in terms of the expected eigenvalues of the Jacobi matrix:

\dot{H} [y ∣ x_{t}] = \frac{ν^{2} (t)}{2 σ^{2} (t)} (D + σ^{2} (t) \sum_{j} E [λ_{j} (x_{t})]) .

(56)

This equation shows that the entropy production is directly regulated by the spectrum of expected local Lyapunov exponents, as studied in our local analysis.

We can better understand this formula by rewriting it as follows:

\dot{H} [y ∣ x_{t}] = \frac{ν^{2} (t)}{2} \sum_{j} (1 / σ^{2} (t) + E [λ_{j} (x_{t})]) .

(57)

From this, we can see that conditional entropy production in an eigenspace is fully suppressed when

E [λ_{j} (x_{t})] = - 1 / σ^{2} (t)

, which is the eigenvalue of the Jacobian of the conditional score under the isotropic forward process.

6. A Stochastic Thermodynamic Perspective

A central question in generative diffusion is how uncertainty about the clean sample

x_{0}

is resolved as the model evolves from the noisy state

x_{t}

toward the data manifold. As argued throughout this paper, the appropriate notion of inferential uncertainty is the previously discussed conditional entropy

H [x_{0} ∣ x_{t}]

and, more fundamentally, its pathwise realization. The study of pathwise entropy is naturally motivated by ideas from stochastic thermodynamics. However, we believe that the commonly used entropy in stochastic thermodynamics [30,31] is not the correct quantity for understanding generative dynamics. It measures the irreversibility of the forward diffusion, not the uncertainty relevant to generating a single outcome.

For a given point on the trajectory

x_{t}

, we define its path-dependent conditional entropy as

h_{t} (x_{t}) = - \int p (x_{0} ∣ x_{t}) log p (x_{0} ∣ x_{t}) d x_{0} .

(58)

This quantity measures the uncertainty experienced along a single generative path. Its expectation is the usual conditional entropy,

E [h_{t} (x_{t})] = H [x_{0} ∣ x_{t}],

but its fluctuations encode a structure that is invisible to marginal entropies. In particular, as illustrated in Figure 3, the pathwise conditional entropy

h_{t} (x_{t})

can locally increase along individual generative trajectories even as the mean conditional entropy decreases, a behavior reminiscent of entropy fluctuations in stochastic thermodynamics. Such effects do not arise in autoregressive models, where each generation step reduces uncertainty about the final sequence by revealing one token, since

H [x_{i + 1 : n} ∣ x_{1 : i}] \leq H [x_{i + 1 : n} ∣ x_{1 : i}] + H [x_{i} ∣ x_{1 : i - 1}] = H [x_{i : n} ∣ x_{1 : i - 1}]

. Whether these entropy fluctuations in diffusion-based generation have any practical advantage, however, remains an open question.

To expose this dynamical heterogeneity, we consider the variance of the pathwise conditional entropy,

V_{h} (t) : = Var [h_{t} (x_{t})] .

(59)

Figure 3. An example of conditional entropies and corresponding paths for a dataset of four data points.

6.1. Variance of Pathwise Conditional Entropy as a Signature of Symmetry Breaking

We find that the variance captures both symmetry-breaking transitions (see Figure 3). To gain a better understanding, we explore the general behavior of the variance and demonstrate a connection with the speciation time [20].

Two limits are immediate. At very early times, the noise scale is negligible compared to the curvature of the data manifold. The posterior

p (x_{0} ∣ x_{t})

is effectively confined to the local tangent plane, behaving as an isotropic Gaussian whose shape is determined solely by the intrinsic dimension

D_{d a t a}

(we assume that the dimension is uniform across the manifold). Because the entropy of this Gaussian depends on k and t but is insensitive to the specific location on the manifold,

h_{t} (x_{t}) \approx const \Rightarrow V_{h} (t) \approx 0 .

(60)

At very late times, the diffusion has effectively mixed the data distribution:

x_{t}

carries little discriminative information about the origin

x_{0}

and the posterior becomes approximately independent of

x_{t}

, again implying

h_{t} (x_{t}) \approx const \Rightarrow V_{h} (t) \approx 0 .

(61)

Thus, nontrivial variance can only arise in an intermediate regime where different trajectories resolve uncertainty in different ways.

Furthermore, near a bifurcation/decision time

t_{c}

, the ensemble contains a substantial fraction of trajectories that are already decisively committed to a branch and a substantial fraction that remain ambiguous. In this regime,

h_{t} (x_{t})

becomes broadly distributed (some paths yield low entropy, others high entropy), and

V_{h} (t)

is therefore maximized.

6.1.1. Connection with the Speciation Time

As already hinted, the variance of the pathwise conditional entropy can be used to locate the speciation time for Gaussian-mixture data in the sense of Biroli et al. [20]. We provide a short argument for why

Var [h_{t} (x_{t})]

peaks at the speciation crossover by using a two-region picture of Biroli et al. [20]: points where the class is effectively determined versus maximally mixed. A fully rigorous derivation is possible, but we focus on the essential mechanism and keep the argument streamlined to make the discussion clear and self-contained.

Recall that Biroli et al. [20] define the speciation time

t_{S}

as the crossover at which (viewed in the forward noising process) the injected noise blurs the principal “class” direction so that class identity becomes hard to infer; equivalently, by time-reversal, it is the time in the backward process at which trajectories start to commit to one of the classes.

We start by noticing that a single Gaussian has

V (t) = 0

. If

p_{0} (x_{0}) = N (μ, Σ_{0})

and the forward kernel

q_{t} (x_{t} ∣ x_{0})

is Gaussian, then

p (x_{0} ∣ x_{t})

is Gaussian with covariance

Σ_{0 ∣ t}

independent of

x_{t}

, so

h_{t} (x_{t}) = \frac{1}{2} log det (2 π e Σ_{0 ∣ t})

is constant and the variance vanish.

Let

p_{0} (x_{0}) = \sum_{z = 1}^{K} π_{z} N (μ_{z}, Σ_{0})

with latent index

z \in {1, \dots, K}

, and let

q_{t} (x_{t} ∣ x_{0})

be the (Gaussian) forward noising kernel. Define the pathwise conditional entropy

h_{t} (x_{t}) \equiv - \int p (x_{0} ∣ x_{t}) log p (x_{0} ∣ x_{t}) d x_{0}, V (t) \equiv {Var}_{x_{t} \sim p_{t}} [h_{t} (x_{t})] .

Now, assume that the mixture is well-separated. For each fixed t, the posterior admits the standard separated-mixture approximation

p (x_{0} ∣ x_{t}) = \sum_{z = 1}^{K} w_{z} (x_{t}) p (x_{0} ∣ x_{t}, z), w_{z} (x_{t}) = p (z ∣ x_{t}),

where

p (x_{0} ∣ x_{t}, z)

is Gaussian, hence

H (p (x_{0} ∣ x_{t}, z)) = h_{G} (t)

for all z. Using the entropy decomposition for (nearly) disjoint mixtures then yields

h_{t} (x_{t}) \approx h_{G} (t) + H (z ∣ X_{t} = x_{t}),

(62)

so the fluctuations of

h_{t} (x_{t})

are governed by those of the discrete uncertainty

H (z ∣ x_{t})

.

Let

w_{z} (x_{t}) = p (z ∣ x_{t})

be the posterior class weights and fix a small

ε

. For each t define the two subsets

A_{t} ≐ \{x_{t} : max_{z} w_{z} (x_{t}) \geq 1 - ε\}, B_{t} ≐ \{x_{t} : {∥ w (\cdot ∣ x_{t}) - π ∥}_{1} \leq ε\},

and write

α (t) ≐ p_{t} (A_{t})

and

β (t) ≐ p_{t} (B_{t})

. The dynamical-regimes setting precisely corresponds to the statement that, for the mixture-of-Gaussians class considered in Biroli et al. [20], the regions

A_{t}

and

B_{t}

carry the most mass for times outside a narrow window around the speciation time, i.e.

p_{t} (A_{t}) \approx 1

or

p_{t} (B_{t}) \approx 1

except near

t ≃ t_{S}

.

On

A_{t}

, the posterior is nearly one-hot, while on

B_{t}

, the posterior is close to

π

. Hence, neglecting the small boundary region

C_{t} ≐ {(A_{t} \cup B_{t})}^{c}

, the random variable

H (z ∣ x_{t})

is approximately:

H (z ∣ x_{t}) \approx \{\begin{matrix} 0 & x_{t} \in A_{t}, \\ H (π) & x_{t} \in B_{t} . \end{matrix}

As a consequence, the variance admits the sharp lower/upper control

V (t) \approx {Var}_{x_{t}} (H (z ∣ x_{t})) \approx α (t) β (t) {(H (π) - 0)}^{2},

(63)

where we used that, when

C_{t}

is negligible,

β (t) \approx 1 - α (t)

and the variance of a two-point mixture is the product of the two masses times the squared gap. Equation (63) makes the mechanism transparent: the variance is small when essentially all points are class-diagnostic (

α (t) \approx 1

) or essentially all points are well-mixed (

α (t) \approx 0

), and it is largest when the population splits nontrivially.

To connect this directly to the dynamical-regimes diagnostics, note that the cloning “same-class” probability can be written as

P (t) ≐ E_{x_{t} \sim p_{t}} [\sum_{z = 1}^{K} w_{z} {(x_{t})}^{2}] .

Pre-speciation,

w (\cdot ∣ x_{t})

is almost one-hot for typical

x_{t}

, so

P (t) \approx 1

and thus

α (t) \approx 1

, implying

V (t) \approx 0

. Post-speciation (well-mixed),

w (\cdot ∣ x_{t}) \approx π

for typical

x_{t}

, so

P (t) \approx \sum_{z} π_{z}^{2}

and thus

α (t) \approx 0

, again implying

V (t) \approx 0

. Since

P (t)

varies continuously with t (it is an expectation of a bounded, smooth functional of the time-marginal

p_{t}

under the Gaussian kernel), it must interpolate continuously between these limiting values; correspondingly, the mass

α (t)

must continuously move from

\approx 1

to

\approx 0

. Therefore the product

α (t) (1 - α (t))

necessarily becomes

Ω (1)

in the crossover window, and by (63)

V (t)

must develop a peak there. Under the dynamical-regimes picture where the transition in

P (t)

sharpens with dimension/separation, this peak concentrates around the speciation time

t_{S}

.

For more general distributions (e.g. strongly non-Gaussian components),

Var [h_{t} (x_{t})]

can also exhibit an additional early-time peak associated with rapid local “Gaussianization” of

p_{t}

under the forward kernel. This peak is absent for the ideal Gaussian case and is suppressed when each component is close to Gaussian.

7. Discussion & Conclusions

This paper has presented a unified framework that connects the dynamics, information theory, and statistical physics of generative diffusion. We have shown that the generative process is governed by the conditional entropy rate, which is directly tied to the expected divergence of the score function’s vector field and, equivalently, to the expected squared norm of the score. This quantity captures how uncertainty about the clean sample is resolved during denoising and reveals when the score is suppressed, allowing noise to drive the dynamics. In this view, the branching of generative trajectories arises from noise-induced symmetry-breaking transitions that occur when multiple datapoints remain compatible with the noisy state, and the model is forced to commit to a specific outcome.

By analyzing the fixed points of the score function and their stability, we showed that these generative decisions are formalized as bifurcations of the score field, which can be mapped onto classical symmetry-breaking phase transitions such as those described by mean-field models like the Curie–Weiss magnet. Peaks in the conditional entropy rate coincide with these bifurcation points, marking moments of maximal posterior mixing and heightened sensitivity to noise, where small fluctuations determine the generative branch taken by the system.

Our results also clarify the relationship between generative diffusion and stochastic thermodynamics. While stochastic thermodynamic entropy characterizes the irreversibility of the process, the conditional entropy studied here captures the inferential uncertainty relevant to generating a single sample. At the trajectory level, the pathwise conditional entropy and its variance reveal heterogeneity in how different generative paths resolve uncertainty, with variance peaks emerging precisely during symmetry-breaking events. From this perspective, entropy fluctuations are not incidental but constitute an information-theoretic signature of generative decisions.

In conclusion, generative diffusion can be understood as a dynamical system that progressively breaks symmetries in the energy landscape while regulating the flow of information through posterior mixing. The score function acts as a dynamic filter that suppresses noise along resolved directions while leaving unresolved directions weakly constrained, thereby controlling the generative bandwidth. This perspective provides a coherent explanation of how diffusion models transform noise into structured data and connects the learning dynamics of modern generative models to fundamental principles of information theory and statistical physics.

Beyond conceptual unification, this framework suggests practical implications for model design and analysis. Because entropy production and posterior mixing are directly linked to the score norm, they offer principled signals for identifying critical periods of high information transfer, motivating adaptive training and sampling strategies that target generative decision points [8]. More broadly, the information-thermodynamic perspective developed here provides a natural language for studying memorization, mode formation, and generalization, and may guide the development of future generative models that explicitly leverage controlled symmetry breaking to represent hierarchical and semantic structure.

Author Contributions

Conceptualization, D.S. and L.A.; methodology, D.S. and L.A.; software, D.S. and L.A.; validation, D.S. and L.A.; formal analysis, D.S. and L.A.; investigation, D.S. and L.A.; data curation, D.S. and L.A.; writing—original draft preparation, D.S. and L.A.; writing—review and editing, D.S.; visualization, D.S. and L.A.; project administration, L.A.; funding acquisition, L.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. arXiv preprint arXiv:1503.03585 2015.
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv preprint arXiv:2011.13456 2021.
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Advances in Neural Information Processing Systems 2020, 33, 6840–6851.
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. arXiv preprint arXiv:2010.02502 2022.
Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. arXiv preprint arXiv:2105.05233 2021.
Kong, X.; Brekelmans, R.; Ver Steeg, G. Information-Theoretic Diffusion. arXiv preprint arXiv:2302.03792 2023.
Kong, X.; Liu, O.; Li, H.; Yogatama, D.; Ver Steeg, G. Interpretable Diffusion via Information Decomposition. arXiv preprint arXiv:2310.07972 2023.
Stancevic, D.; Handke, F.; Ambrogioni, L. Entropic Time Schedulers for Generative Diffusion Models. arXiv preprint arXiv:2504.13612 2025.
Dieleman, S.; Sartran, L.; Roshannai, A.; Savinov, N.; Ganin, Y.; Richemond, P.H.; Doucet, A.; Strudel, R.; Dyer, C.; Durkan, C.; et al. Continuous Diffusion for Categorical Data. arXiv preprint arXiv:2211.15089 2022.
Franzese, G.; Martini, M.; Corallo, G.; Papotti, P.; Michiardi, P. Latent Abstractions in Generative Diffusion Models. Entropy 2025, 27, 371. [CrossRef]
Raya, G.; Ambrogioni, L. Spontaneous Symmetry Breaking in Generative Diffusion Models. arXiv preprint arXiv:2305.19693 2023. [CrossRef]
Ambrogioni, L. The Statistical Thermodynamics of Generative Diffusion Models: Phase Transitions, Symmetry Breaking and Critical Instability. arXiv preprint arXiv:2310.17467 2024. [CrossRef]
Biroli, G.; Mézard, M. Generative Diffusion in Very Large Dimensions. Journal of Statistical Mechanics: Theory and Experiment 2023, 2023, 093402. [CrossRef]
Alaoui, A.E.; Montanari, A.; Sellke, M. Sampling from Mean-Field Gibbs Measures via Diffusion Processes. arXiv preprint arXiv:2310.08912 2023. [CrossRef]
Huang, B.; Montanari, A.; Pham, H.T. Sampling from Spherical Spin Glasses in Total Variation via Algorithmic Stochastic Localization. arXiv preprint arXiv:2404.15651 2024.
Montanari, A. Sampling, Diffusions, and Stochastic Localization. arXiv preprint arXiv:2305.10690 2023.
Benton, J.; De Bortoli, V.; Doucet, A.; Deligiannidis, G. Nearly d-Linear Convergence Bounds for Diffusion Models via Stochastic Localization. In Proceedings of the International Conference on Learning Representations, 2024.
Sclocchi, A.; Favero, A.; Wyart, M. A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data. Proceedings of the National Academy of Sciences 2025, 122, e2408799121. [CrossRef]
Sclocchi, A.; Favero, A.; Levi, N.I.; Wyart, M. Probing the Latent Hierarchical Structure of Data via Diffusion Models. Journal of Statistical Mechanics: Theory and Experiment 2025, 2025, 084005. [CrossRef]
Biroli, G.; Bonnaire, T.; de Bortoli, V.; Mézard, M. Dynamical Regimes of Diffusion Models. Nature Communications 2024, 15. [CrossRef]
Bonnaire, T.; Urfin, R.; Biroli, G.; Mézard, M. Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training. arXiv preprint arXiv:2505.17638 2025.
Achilli, B.; Ambrogioni, L.; Lucibello, C.; Mézard, M.; Ventura, E. Memorization and Generalization in Generative Diffusion under the Manifold Hypothesis. Journal of Statistical Mechanics: Theory and Experiment 2025, 2025, 073401. [CrossRef]
Achilli, B.; Ventura, E.; Silvestri, G.; Pham, B.; Raya, G.; Krotov, D.; Lucibello, C.; Ambrogioni, L. Losing Dimensions: Geometric Memorization in Generative Diffusion. arXiv preprint arXiv:2410.08727 2024.
Ambrogioni, L. In Search of Dispersed Memories: Generative Diffusion Models Are Associative Memory Networks. Entropy 2024, 26, 381. [CrossRef]
Hoover, B.; Strobelt, H.; Krotov, D.; Hoffman, J.; Kira, Z.; Chau, D.H. Memory in Plain Sight: A Survey of the Uncanny Resemblances Between Diffusion Models and Associative Memories, 2023. Associative Memory & Hopfield Networks in 2023.
Hess, J.; Morris, Q. Associative Memory and Generative Diffusion in the Zero-Noise Limit. arXiv preprint arXiv:2506.05178 2025.
Jeon, D.; Kim, D.; No, A. Understanding Memorization in Generative Models via Sharpness in Probability Landscapes. arXiv preprint arXiv:2412.04140 2024.
Pham, B.; Raya, G.; Negri, M.; Zaki, M.J.; Ambrogioni, L.; Krotov, D. Memorization to Generalization: Emergence of Diffusion Models from Associative Memory. arXiv preprint arXiv:2505.21777 2025.
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752 2022.
Premkumar, A. Neural Entropy. arXiv preprint arXiv:2409.03817 2024.
Seifert, U. Entropy production along a stochastic trajectory and an integral fluctuation theorem. Physical review letters 2005, 95, 040602. [CrossRef]
Ikeda, K.; Uda, T.; Okanohara, D.; Ito, S. Speed-accuracy relations for diffusion models: Wisdom from nonequilibrium thermodynamics and optimal transport. Physical Review X 2025, 15, 031031.
Lou, A.; Meng, C.; Ermon, S. Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution. arXiv preprint arXiv:2305.14627 2023.
Sahoo, S.; Arriola, M.; Schiff, Y.; Gokaslan, A.; Marroquin, E.; Chiu, J.; Rush, A.; Kuleshov, V. Simple and Effective Masked Diffusion Language Models. Advances in Neural Information Processing Systems 2024, 37, 130136–130184.
Carreira-Perpinan, M.A. Mode-finding for mixtures of Gaussian distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002, 22, 1318–1323. [CrossRef]
Carreira-Perpinán, M.A.; Williams, C.K. On the number of modes of a Gaussian mixture. In Proceedings of the International Conference on Scale-Space Theories in Computer Vision, 2003.
Améndola, C.; Engström, A.; Haase, C. Maximum number of modes of Gaussian mixtures. Information and Inference: A Journal of the IMA 2020, 9, 587–600. [CrossRef]
Amari, S.i. Information Geometry and Its Applications; Vol. 194, Springer, 2016.
Kadkhodaie, Z.; Guth, F.; Simoncelli, E.P.; Mallat, S. Generalization in diffusion models arises from geometry-adaptive harmonic representations. arXiv preprint arXiv:2310.02557 2023.
Stanczuk, J.P.; Batzolis, G.; Deveney, T.; Schönlieb, C.B. Diffusion Models Encode the Intrinsic Dimension of Data Manifolds. In Proceedings of the International Conference on Machine Learning, 2024.
Ventura, E.; Achilli, B.; Silvestri, G.; Lucibello, C.; Ambrogioni, L. Manifolds, Random Matrices and Spectral Gaps: The Geometric Phases of Generative Diffusion. arXiv preprint arXiv:2410.05898 2024.

Figure 1. Conditional entropy profile (left) and paths of fixed-points (right) for a mixture of two delta distributions. The color in the background visualizes the (log)density of the process.

Figure 2. Stability and instability of trajectories in different parts of a symmetry-breaking potential. Generative branching is associated with divergent trajectories.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

The Information Dynamics of Generative Diffusion

Abstract

Keywords:

Subject:

1. Introduction

Information Theory and Entropy-Based Perspectives

Phase Transitions, Associative Memories, and Symmetry Breaking

Thermodynamics and the Role of Inferential Entropy

Our Perspective

2. Information Theory

2.1. Score-Matching Generative Diffusion Models

3. Generative Information Transfer in Score Matching Diffusion

3.1. Score Norm and Posterior Concentration

3.2. Conditional Entropy Production as Optimal Error

3.3. Generative Bandwidth

4. Statistical Physics, Order Parameters and Phase Transitions

4.1. Branching Paths of Fixed-Points and Spontaneous Symmetry Breaking

5. Dynamics of the Generative Trajectories

5.1. The Global Perspective on Generative Bifurcations

5.2. Information Geometry

6. A Stochastic Thermodynamic Perspective

6.1. Variance of Pathwise Conditional Entropy as a Signature of Symmetry Breaking

6.1.1. Connection with the Speciation Time

7. Discussion & Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe