Preprint
Article

This version is not peer-reviewed.

DNA Methylation Entropy as a Marker of Epigenetic Rejuvenation in Mouse Embryogenesis

Submitted:

28 June 2026

Posted:

30 June 2026

You are already at the latest version

Abstract
Background/Objectives: DNA methylation dynamics are closely linked to ageing. While epigenetic clocks predict biological age from average methylation levels, they may miss structural rearrangements. The "epigenetic rejuvenation" hypothesis posits that embryonic biological age decreases toward a minimum ("ground zero") at gastrulation. This study tested whether DNA methylation entropy decreases from stage E4.5 to E6.5. Methods: Publicly available single-cell nucleosome, methylation and transcription sequencing (scNMT-seq) data (GSE121690; 364 mouse cells) were analysed. Five entropy measures were calculated: Shannon, Renyi, Tsallis, Lempel-Ziv (LZ) complexity, and local gradient entropy. Persistent entropy (PE) and topological analysis (Rips complex, persistence diagrams H₀ and H₁) were computed. A support vector machine (SVM) classifier was trained to distinguish developmental stages. Results: Mean Shannon entropy decreased from 0.841 (E4.5) to 0.805 (E6.5; p < 0.01). All five entropy measures and PE decreased significantly. LZ complexity showed the largest reduction (−28.4% binary, −11.4% ternary; p < 0.001). Ternary LZ correlated with Shannon entropy (r = 0.71). PE decreased from 15.91 to 14.89 (p = 0.01). Topological analysis revealed reorganisation: E6.5 showed stable H₁-cycles (9 intervals), absent at E4.5 (2 intervals), reflecting lineage-specific diversification. Regional entropy and disorder decreased (RE: −25.5%, RD: −27.4%, p < 10⁻¹³). The SVM achieved 93.4% accuracy (AUC = 0.981). Conclusions: The decrease in DNA methylation entropy from E4.5 to E6.5 supports epigenetic rejuvenation as an approach to "ground zero." Topological analysis revealed a qualitative reorganisation: E6.5 showed stable H₁-cycles (9 intervals) that were largely absent at E4.5 (2 intervals). Comprehensive characterisation with balanced subsampling revealed a significant decrease in max persistence (0.446 → 0.218, p = 0.03) and total persistence (2.170 → 1.070, p < 0.001), indicating a transition from a single dominant cycle to multiple smaller, evenly distributed cycles.
Keywords: 
;  ;  ;  ;  ;  ;  

1. Introduction

The fundamental question of the biology of ageing is why new organisms do not inherit the age of their parent germ cells. Somatic cells age by accumulating damage, whereas the germline is “reset” in each generation [4,5]. Kerepesi and co-authors experimentally confirmed this by using epigenetic clocks to show that the biological age of the embryo decreases during early development, reaching a minimum at the gastrulation stage [3,6]. This point was termed “ground zero” — the beginning of organismal life and ageing [1,2].
Epigenetic clocks based on the average methylation level have become the gold standard for assessing biological age. They predict the development of age-associated diseases, including Alzheimer’s disease and cardiovascular pathologies [2]. However, the mean value is not a sufficient statistic for describing the methylation distribution, since it does not account for correlations between CpG sites and cannot distinguish functionally different epigenetic states characterised by the same methylation level. For example, at the same average methylation (0.5), the patterns “11110000” and “10100101” are indistinguishable by the mean value. Shannon entropy, based on state frequencies, also does not distinguish these patterns (both yield an entropy of 1.0 bit). However, LZ complexity is sensitive to the order of elements: for “11110000” it is considerably lower than for “10100101”, reflecting the presence of ordered blocks. This illustrates the necessity of combining statistical and algorithmic measures for a complete description of the epigenetic landscape. Recent studies have shown that methylation entropy may serve as an additional biomarker of ageing: Chan et al. (2025) demonstrated that on the basis of entropy at 3000 CpG loci in buccal cells, chronological age can be predicted with an error of ~5 years, with entropy changes proving to be locus-specific and bidirectional — entropy of a number of regions decreased with age, whereas that of others increased [8]. Experiments on editing individual CpG sites (Liesenfelder et al., 2025) showed that modification of a single locus can affect the methylation of hundreds of linked regions, indicating the non-local character of the epigenetic network [7]. Singh (2025) proposed a theoretical justification for the link between methylation entropy and chromatin phase separation via the Flory–Huggins parameter [9]. Such non-local influence of one CpG site on hundreds of others points to the network organization of the epigenetic landscape and confirms that DNA methylation is not merely a set of independent marks but a complex information system with long-range correlations.
One way to conceptualize this complexity is to view the epigenetic code as a language. The methylation level is the presence/absence of a “letter”, and entropy is a measure of the randomness of the text. High entropy is meaningless noise; low entropy is a structured message (“blank slate”). The aim of this work is to test whether methylation entropy decreases during natural rejuvenation (from E4.5 to E6.5), and to evaluate the possibility of classifying developmental stages on the basis of entropy and structural characteristics of DNA methylation using machine learning methods.

2. Materials and Methods

2.1. Data

Publicly available scNMT-seq (single-cell methylation and accessibility) data from the NCBI GEO database (GSE121690) [10] were used. The data contain 364 mouse cells at developmental stages E4.5 (267 cells) and E6.5 (97 cells). Each file contains information on CpG-site methylation levels in the format: chr, pos, met_reads, nonmet_reads, rate (methylation level from 0 to 1). scNMT-seq is a method enabling simultaneous profiling of DNA methylation, chromatin accessibility, and the transcriptome in a single cell [11]. Transcriptomic data from the same scNMT-seq dataset were used to compute Pearson correlations between persistent entropy and the expression of DNA repair genes. Correlation p-values were adjusted for multiple testing using the Benjamini–Hochberg method (q-values reported).

2.2. Entropy Measures

For each cell, the vector of CpG-site methylation levels (rate values from 0 to 1) was discretised using a histogram with a fixed number of bins ( B ) over the range 0 , 1 . For the resulting probability distribution P = { p 1 , p 2 , . . . , p B } , where p i is the fraction of CpG sites falling into the i -th bin, the following measures were calculated. The choice of B = 20 bins is based on a trade-off between distribution resolution and statistical stability of estimates: with an average of approximately 400 000 CpG sites per cell, each bin contains roughly 20 000 sites, ensuring reliable probability estimation and being consistent with previous work on methylation entropy [8].
Shannon entropy — a classical diversity measure characterising the uncertainty of a distribution:
H S = i = 1 B p i l o g 2 p i
Renyi entropy (of order α ) — a generalisation of Shannon entropy that, at α = 2 , gives greater weight to frequently occurring patterns, suppressing the contribution of rare events (collision entropy):
H 2 = l o g 2 i = 1 B p i 2 .
Tsallis entropy (with parameter q = 2 ) — a non-additive measure used for analysing complex systems with long-range interactions:
S 2 = 1 i = 1 B p i 2 q 1 .
LZ complexity (Lempel-Ziv complexity) [12] — a measure of algorithmic complexity that estimates the number of unique substrings in a sequence. Unlike statistical measures (Shannon, Renyi, and Tsallis entropy), which operate on probability distributions, LZ complexity characterises structural diversity and the degree of pattern repetitiveness in a specific realisation of a sequence.
LZ complexity calculation algorithm
For the calculation, the original vector of methylation levels r = r 1 , r 2 , , r N , where r j 0 , 1 , was binarised with a threshold of 0.5:
x j = 1 , r j > 0.5 0 , r j 0.5
Note that CpG-site methylation levels represent a continuous stochastic process with discrete measurements (typically 5 levels: 0, 0.25, 0.5, 0.75, 1.0). However, for the present study, aimed at identifying qualitative patterns (entropy reduction, stage-specific correlations with repair genes), the use of a deterministic approximation (binarisation with a threshold of 0.5) is a justified and standard approach in the analysis of scNMT-seq data. Binarisation serves as a tool for transitioning to a discrete representation for the calculation of algorithmic complexity (LZ) and topological measures. As a result, a binary string x = ( x 1 , x 2 , , x N ) , where x j { 0 , 1 } , is obtained. LZ complexity is defined in the process of lexicographic parsing of the string according to the following rule:
Let x = ( x 1 , x 2 , , x N ) be the string under analysis. The LZ complexity c x equals the number of steps required to construct the entire string by sequentially adding new substrings that have not previously occurred in the already constructed part.
LZ parsing algorithm (classical Lempel-Ziv algorithm):
  • Initialisation:  c = 1 , current string Q = x 1 (first character). In this phase we begin with element s 1 .
  • For each subsequent symbol  x i (starting from i = 2 ):
    A trial string Q = Q + x i is formed (concatenation of the current string and the new symbol).
    If Q  does not occur as a substring in the already constructed part of the string (i.e., in the concatenation of all previously extracted unique blocks), then:
    • c = c + 1 (a new unique block is registered);
    • Q = x i (a new current string is started from this symbol).
    Otherwise (if Q has already occurred):
    • Q = Q (the current string is extended without incrementing the counter).
  • If after processing all symbols Q is not empty, then c = c + 1 .
By the “already constructed part of the string” we mean the concatenation of all unique substrings (blocks) that were extracted at previous steps of the algorithm. It is against this accumulated string that the trial extension Q is compared. If Q has already occurred as a contiguous substring in the accumulated string, then it is not new, and we simply extend w . If it has not occurred — a new block is fixed and the next one is begun. This is the standard implementation of LZ parsing.
Ternary LZ complexity. To account for hemimethylated states (rate ≈ 0.5), which may reflect epigenetic plasticity, we additionally computed LZ complexity using a ternary alphabet. Methylation rates were discretised as follows:
x j = 0 , r j < 0.4 1 , 0.4 r j 0.6 2 , r j > 0.6
The value 1 represents the hemimethylated / uncertain state. Normalisation was performed as:
L Z n o r m = L Z N / l o g k ( N )
where k = 3 for ternary alphabet. This normalisation ensures comparability across cells with different numbers of CpG sites. The use of ternary encoding preserves information about partial methylation, which is lost in binary binarisation, and provides a more biologically realistic estimate of algorithmic complexity.
Interpretation. LZ complexity reflects the presence of repeating structures in a sequence. High values correspond to complex, poorly compressible sequences with a large number of unique patterns. Low values correspond to sequences with a high degree of repetitiveness (well compressible). In the context of DNA methylation, this means that LZ complexity estimates how chaotically methylated and unmethylated sites alternate along the genome. Ordered patterns (e.g., long blocks of 000…000111…111) are characterised by low LZ complexity, whereas chaotic intermixing (010101…) — by high LZ complexity.
Distinction from other entropy measures. If Shannon, Renyi, and Tsallis entropy assess the probability distribution of states (ignoring the order of occurrence), then LZ complexity accounts for sequence and is sensitive to the spatial organisation of patterns.
Spatial entropy (entropy of local gradients) — a measure that accounts for the order of methylation levels along the genome. In other words: the entropy of variations of the smoothed histogram of the methylation distribution. It was calculated from the variation of the smoothed histogram of the methylation distribution:
S s p a t i a l = j = 1 B 1 Δ j l o g 2 ( Δ j + ε )
where Δ j = p ¯ j + 1 p ¯ j , and p i is the histogram value smoothed by a moving average with window 3, ε = 10 10 is a small correction avoiding the logarithm of zero. This measure is sensitive to local changes in methylation density, characterising the degree of “intermixing” of methylated and unmethylated regions.
Persistent entropy (PE) — a topological complexity measure based on analysis of the persistence diagram H 0 constructed from the binarised methylation sequence (threshold 0.5). For the calculation, one-dimensional cubical filtration implemented in the GUDHI library [13] is used. Let there be m finite intervals (connected components) with lifetimes (persistences) l 1 , l 2 , , l m . The total sum of persistences is L = i = 1 m l i , and the normalised weights are p i = l i / L . Then persistent entropy is computed by the Shannon formula:
P E = i = 1 m p i l o g 2 p i .
PE characterises the diversity of scales of structural elements in the methylation distribution: low values correspond to the dominance of extended homogeneous regions, high values — the presence of blocks of varying sizes. Unlike statistical entropies, PE accounts for spatial order and multiscale block structure, and unlike LZ complexity — it distinguishes the contribution of short and long blocks. For example: for the sequence 000111000111, two blocks of methylated sites have lengths (persistences) 3 and 3 (the blocks are completely isolated). If the blocks were of different lengths, say 3 and 6, PE would reflect this difference. For equal blocks (3,3), PE = 1.0 bit (maximum for two blocks). With the dominance of one long block (e.g., 1 and 8), PE would be lower ( 0.42 ). Thus PE distinguishes situations where short and long blocks make different contributions to structural complexity. For H₁ persistence diagrams, which capture cyclic structures, PE was computed analogously using the lifespans of H₁ intervals.

2.3. Statistical Analysis

Comparison of entropy between stages was performed using a parametric t-test and a non-parametric Mann–Whitney U-test. The joint use of both tests allows the significance of differences to be confirmed regardless of whether the assumption of normality of distributions holds. Correlation analysis was performed using Pearson’s correlation coefficient. To control for multiple comparisons in the analysis of six entropy measures, false discovery rate (FDR) correction by the Benjamini–Hochberg method was applied. P -values were ordered in ascending order: p 1 p 2 p m , where m = 6 is the number of tested hypotheses. Adjusted q -values were computed by the formula
q i = m i n j i m i n ( p j m j , 1 ) ,
with monotonic non-increasing sequence q i enforced. Differences were considered statistically significant at q < 0.05 .
Reproducibility. Calculations were performed in Python 3.10 using the libraries scikit-learn (1.2), GUDHI (3.8), imbalanced-learn (0.10). GridSearch parameters for SVM: C { 0.1 , 1 , 10 , 100 } , g a m m a { scale , 0.1 , 1 } . Additional details are provided in the repository (Appendix A).

3. Results

To assess changes in the epigenetic landscape, five different measures based on information theory and algorithmic complexity were computed. The dynamics of the five main indicators are presented in Figure 1. Despite differences in absolute values and scales, all measures demonstrate a consistent decreasing trend from the early stage E4.5 to the gastrulation stage E6.5.

3.1. Shannon Entropy Dynamics

The mean Shannon entropy was 0.841 at stage E4.5 and 0.805 at stage E6.5 (Table 1). The decrease in entropy (~4.3%) is statistically significant (t-test: p = 0.0028; Mann–Whitney U-test: p = 0.0003).
The statistical significance of the Shannon entropy decrease is confirmed by both the parametric t-test ( p = 0.0028 ) and the non-parametric Mann–Whitney U-test ( p = 0.0003 ). Similar results were obtained for all other entropic measures (Table 2). All five entropic measures and persistent entropy demonstrate a statistically significant decrease (Table 2). After FDR correction using the Benjamini–Hochberg procedure, all differences remained significant at q < 0.01 (lowest q = 0.0035 for LZ complexity, highest q = 0.0097 for persistent entropy), confirming the robustness of the observed entropy decreas). Persistent entropy decreased from 15.91 ± 3.20 (E4.5) to 14.89 ± 3.67 (E6.5); the difference is statistically significant (t-test: p = 0.010 ; Mann–Whitney U-test: p = 2 × 10 6 ).
Figure 2 presents the dynamics of Shannon entropy as a U-shaped curve, clearly demonstrating the decrease in entropy from stage E4.5 to stage E6.5.

3.2. Comparison of Entropic Measures

Correlation analysis showed that the Shannon, Renyi, Tsallis, and spatial entropy measures produce practically identical results (correlation coefficients 0.995–1.000). LZ complexity demonstrated low correlation with the other measures (~0.13), indicating that it captures a different aspect of epigenetic complexity (Table 3).
Robustness check: binary vs. ternary LZ complexity. To test whether the observed decrease in LZ complexity depends on the choice of binarisation threshold, we recomputed LZ complexity using a ternary alphabet (0, 0.5, 1) with normalisation. The results were virtually identical:
  • Binary LZ (normalised): E4.5 mean = 1.165 ± 0.187, E6.5 mean = 1.032 ± 0.253 (−11.4%, p < 0.001)
  • Ternary LZ (normalised): E4.5 mean = 0.735 ± 0.118, E6.5 mean = 0.651 ± 0.160 (−11.4%, p < 0.001)
The identical relative decrease (−11.4%) and highly significant p-values for both encodings demonstrate that the ordering of the epigenetic landscape is a robust phenomenon, independent of the discretisation scheme. Notably, the correlation of ternary LZ with Shannon entropy increased dramatically from r = 0.13 (binary) to r = 0.71 (ternary), indicating that the ternary approach captures information more consistent with statistical entropy while retaining its structural sensitivity. This methodological improvement supports the biological relevance of hemimethylated states in epigenetic regulation.
Note that the raw LZ values (Table 4) showed a larger relative decrease (−28.4%) than the normalised values (−11.4%) because normalisation accounts for differences in sequence length between cells. Both raw and normalised metrics demonstrate consistent and highly significant stage-dependent changes.

3.3. Comparison of Central Tendency and Variability Dynamics

To assess changes in the epigenetic landscape, not only medians (central tendency) but also interquartile ranges (IQR), characterizing the variability of each measure between cells, were analyzed (Table 4).
All measures demonstrate a statistically significant decrease in median from stage E4.5 to E6.5 ( p < 0.01 for Shannon entropy), confirming the main hypothesis of epigenetic landscape ordering as the system approaches “ground zero.”
However, the behaviour of interquartile ranges divides the measures into two groups. Shannon entropy, Renyi entropy, Tsallis entropy, and Spatial gradient entropy demonstrate increased variability at stage E6.5 ( Δ IQR from +11.6% to +19.5%). An increase in IQR accompanied by a decrease in median may reflect either increased inter-cellular variability or a change in the shape of the distribution (e.g., emergence of asymmetry or subpopulations).
In contrast, LZ complexity maintains a stable interquartile range ( Δ IQR = –3.5%). At the same time, its median decreases most dramatically (–28.4%). Thus, despite considerable ordering of the algorithmic structure of methylation (strong decrease in median), the variability of this indicator between cells remains practically unchanged.
This discrepancy confirms that LZ complexity captures a different aspect of epigenetic information compared with statistical entropic measures. Whereas statistical measures are sensitive to the diversity of the methylation distribution, LZ complexity evaluates the presence of repeating structures and the algorithmic compressibility of the sequence. The stability of IQR for LZ complexity against the backdrop of increased variability of statistical measures suggests that the ordering process from E4.5 to E6.5 is accompanied by changes in the shape of the entropy distribution between cells, requiring further investigation to understand the biological mechanisms underlying this phenomenon.

3.4. Machine Learning for Stage Classification

To evaluate the possibility of quantitatively distinguishing early (E4.5) and late (E6.5) stages of mouse embryogenesis based on entropic and structural characteristics of DNA methylation, an additional study was conducted using machine learning methods. The following tools, methods, and data were employed:
  • Classifiers: Logistic Regression, SVM with radial basis function (RBF) kernel, Random Forest.
  • Data preprocessing:
Feature scaling (StandardScaler)
Class balancing using SMOTE (synthetic generation of the minority class E6.5). The use of SMOTE on a sample of is associated with a risk of overfitting due to generation of synthetic samples in underrepresented regions of the feature space. The resulting metrics should be regarded as an upper bound on classification quality, requiring confirmation on independent data.
Stratified train–test split: 75% training, 25% testing.
  • Classifier parameter optimisation: Grid Search with 5-fold cross-validation based on the ROC-AUC criterion.
  • Features. A total of 15 features were included in the analysis, grouped into three categories as shown in Table 5.
Table 5. Categories of features.
Table 5. Categories of features.
Category Features Number of Features in Category
Entropy measures Shannon entropy, Renyi entropy ( α = 2 ), Tsallis entropy ( q = 2 ), LZ complexity, Spatial gradient entropy 5
Basic statistics Mean methylation level (mean_methylation), standard deviation (std_methylation), number of CpG sites (n_CpG_sites), coefficient of variation (cv_methylation) 4
Structural characteristics Expected length of methylated blocks (expected_run_length_1), expected length of unmethylated blocks (expected_run_length_0), ratio of block lengths (run_length_ratio), spatial organization (spatial_organization), LZ/Shannon ratio (lz_shannon_ratio), spatial gradient entropy/Shannon ratio (spatial_gradient_shannon_ratio) 6
The properties of basic statistics are presented in Table 6.
Structural characteristics (run-length encoding) are based on the binarised sequence x j { 0,1 } :
x j = 1 , r j > 0.5 0 , r j 0.5
The properties of these characteristics are given in Table 7.
Spatial characteristics are obtained directly from the values of entropic parameters; the properties of these characteristics are presented in Table 8.
The classification results are presented in Table 9.
Feature importance estimated using Random Forest is given in Table 10.
It should be noted that due to class imbalance (73.4% E4.5 cells), a naive classifier always predicting E4.5 would achieve an accuracy of 73.4%. The SVM result of 93.4% substantially exceeds this baseline, and the high recall for E6.5 (91.7%) confirms that the model does not ignore the minority class.
It is important to note that feature importance estimation in Random Forest under strong multicollinearity (correlation Shannon/Renyi/Tsallis/Spatial gradient > 0.995) may be biased: correlated features compete for predictive contribution, which deflates their individual importance. Therefore, the dominance of structural features in Table 10 reflects not only their biological significance but also the absence of redundant correlation within this group.
Based on the conducted analysis, the following conclusions can be drawn:
  • SVM with RBF kernel demonstrated the best results among all tested models (Accuracy = 93.4%, AUC = 0.981), indicating near-perfect separation of stages E4.5 and E6.5 based on the proposed feature set.
  • Structural characteristics of methylation, in particular the expected length of methylated blocks (expected_run_length_1), proved to be the most informative features, surpassing entropic measures in importance.
  • The coefficient of variation (cv_methylation) and mean methylation level (mean_methylation) are among the top five most important features, indicating the significance of not only diversity but also the central tendency of the methylation mark distribution.
  • The high sensitivity of the model to E6.5 cells (Recall = 91.7%) confirms that the proposed feature set reliably identifies cells at the gastrulation stage — the point of epigenetic rejuvenation “ground zero.”
  • The obtained results demonstrate the fundamental possibility of using the structural-entropic approach for developing diagnostic algorithms for assessing cell developmental stage and, potentially, for evaluating the efficacy of rejuvenating interventions.

3.5. Topological Analysis of the Multidimensional Space of Entropic Features

To assess the global structure of cell distribution in the five-dimensional space of entropic measures (Shannon, Renyi, Tsallis, LZ complexity, Spatial gradient entropy), we applied persistent homology methods [13]. From the matrix of Euclidean distances between all 364 cells, a Rips complex was constructed, and persistence diagrams were computed in dimensions H₀ (connected components) and H₁ (cyclic structures: one-dimensional loops that cannot be contracted to a point).
In the combined E4.5+E6.5 sample, 375 H₀ intervals and 82 H₁ intervals were obtained; PE for H₁ was 5.16. In the separate analysis of stages (Table 11), E4.5 was characterised by 276 H₀ and only 2 H₁ intervals (PE H₁ = 0.29), while E6.5 showed 96 H₀ and 9 H₁ intervals (PE H₁ = 2.25). The emergence of H₁-cycles at E6.5 indicates the appearance of non-trivial topological structure in the distribution of entropic features, consistent with the formation of different cell lineages (epiblast, endoderm, mesoderm), each occupying its own region in the space of entropic parameters.
Footnote to Table 11. Unbalanced sample sizes: n=267 (E4.5) vs n=97 (E6.5). See Table 12 for balanced subsampling
Null model validation. To assess whether the observed H₁-cycles are artefacts of random fluctuations, we performed two permutation-based null model analyses. First, column-wise permutation (shuffling feature values independently within each column) revealed that the real persistent entropy (PE) for H₁ was consistently lower than the null distribution for both stages (p = 1.0). This counter-intuitive result indicates that the epigenetic landscape is more structured than a random point cloud — the observed cyclic structures are not noise but reflect genuine, regular patterns of cell organization.
Second, stratified label permutation (shuffling stage assignments while preserving group sizes: 267 E4.5 and 97 E6.5) showed that neither the difference in PE H1 (p = 0.78) nor the difference in the number of H₁ intervals (p = 0.91) was statistically significant. This indicates that the quantitative differences between stages are primarily driven by the imbalance in sample size, rather than by a biological difference in topological complexity.
Comprehensive topological characterisation. To fully characterise the H₁-cycles, we computed seven topological metrics, comparing E6.5 with balanced subsamples of E4.5 matched for sample size (n = 97). The results are summarised in Table 12.
The number of H₁ cycles did not differ between stages (E4.5 balanced: 16.87 ± 2.59; E6.5: 16; p = 0.85), confirming that the appearance of cyclic structures is not an artefact of sample size. Similarly, normalised persistent entropy, which measures the evenness of cycle size distribution, remained unchanged (E4.5: 0.855 ± 0.035; E6.5: 0.882; p = 0.45). However, maximum persistence decreased from 0.444 ± 0.102 to 0.218 (p = 0.03), and average persistence decreased from 0.130 ± 0.021 to 0.067 (p < 0.001). Total persistence — the sum of all cycle lifetimes — decreased from 2.170 ± 0.353 to 1.070 (p < 0.001).
Interpretation. The topological signature of gastrulation is not an increase in the number of cycles or a change in their size distribution, but a reduction in cycle lifetimes. At E4.5, cycles are longer-lived, with one dominant structure (max persistence 0.446). At E6.5, cycles are shorter-lived (avg persistence 0.067), and the dominant structure disappears (max persistence 0.218). We interpret this as an indication that the epigenetic landscape at E4.5 is characterised bystable, long-lasting topological features, while at E6.5 these features become more transient. The number of cycles remains constant (~17 → ~16), but the cycles themselves are less persistent. This pattern — stable count but reduced lifetime — is consistent with a transition from a rigid, homogeneous landscape to a more dynamic, diversified one, where cells move through distinct regions of the feature space corresponding to different lineages.
Notably, persistent entropy at stage E6.5 showed significant positive correlations with the expression of DNA repair genes, including Xrcc3 (r = 0.358, q = 0.002), Rad17 (r = 0.279, q = 0.020), H2ax (r = 0.259, q = 0.024), and Gadd45a (r = 0.156, q = 0.222). This suggests a link between topological complexity of the methylation landscape and DNA damage response activity during gastrulation.

3.6. Control for a Possible Confounder Due to Cell-Type Heterogeneity

Comparison of stages E4.5 and E6.5 is accompanied not only by changes in developmental timing but also by the appearance of differentiated cell lineages (mesoderm, endoderm, ectoderm). To rule out the possibility that the observed decrease in entropy is explained exclusively by changes in cell composition, we performed two control analyses using cell-type annotations (lineage10x) from the original study by Argelaguet et al. (2019) [10].
Analysis within a single cell type (epiblast). Epiblast is the only population present in substantial numbers at both stages (E4.5: 188 cells; E6.5: 55 cells). Mean Shannon entropy in epiblast decreased from 0.8289 (E4.5) to 0.7679 (E6.5). The difference is highly significant (t-test: p = 2 × 10 5 , Mann–Whitney U-test: p < 10 4 ). Thus, the effect persists even under strict control of cell type.
Regression analysis with a “cell type” covariate. For all cells with known identity ( n = 348 ), a simple linear regression model was constructed: Shannon entropy ~ stage + cell_type. Or, more formally: to quantitatively assess the contribution of developmental stage to changes in DNA methylation entropy, taking into account the possible confounding effect of cell type, a linear regression model of the following form was constructed:
H i = β 0 + β 1 I ( stage i = E 6.5 ) + k = 1 K 1 γ k I ( cell _ type i = c k ) + ε i ,
where:
  • H i — the value of Shannon entropy (bits) for the i -th cell;
  • β 0 — the intercept, corresponding to the mean entropy level in the reference group (epiblast cells at stage E4.5);
  • β 1 — the coefficient for the stage dummy variable, reflecting the change in entropy at stage E6.5 relative to E4.5 after controlling for cell type;
  • Stage i — the indicator variable, equal to 1 for cells at stage E6.5 and 0 for E4.5;
  • β k — coefficients for the dummy variables of cell types; CellType k , i — the k -th cell type (Primitive_Streak, Primitive_endoderm, Visceral_endoderm); epiblast was chosen as the reference type;
  • K — the total number of cell types ( K = 4 in the analysis);
  • ε i — random error, assumed to be independent and normally distributed with zero mean and constant variance ε i N ( 0 , σ 2 ) .
The model was estimated by the ordinary least squares method. The statistical significance of the coefficient β 1 was assessed using the t -test. The presence of a significant negative β 1 indicates an independent effect of stage that cannot be explained by differences in cell composition. The coefficient for stage (E6.5 relative to E4.5) was –0.052 ( p < 0.001 ), confirming the independent influence of developmental stage after accounting for cellular heterogeneity. Coefficients for individual cell types (Primitive_Streak, Primitive_endoderm, Visceral_endoderm) are also given in Table 13. Both tests unambiguously show that the decrease in methylation entropy is not an artefact of differentiation but reflects real epigenetic ordering as the embryo approaches “ground zero”.

3.7. Regional Entropy and Disorder Dynamics

To further characterize the epigenetic landscape at different genomic scales, we computed regional entropy (RE) and regional disorder (RD) following the approach of Bertucci-Richter et al. [14]. For each cell, the genome was divided into windows of 200 bp, 1 kb, and 5 kb; RE (Shannon entropy of methylation states within each window) and RD (proportion of disordered neighbour pairs within each window) were averaged across all windows to obtain cell-level values.
Strikingly, both RE and RD decreased significantly from E4.5 to E6.5 across all window sizes (Table 14):
These results show that, at the single-cell level, regional disorder — like global entropy — decreases during gastrulation. This contrasts with the findings of Bertucci-Richter et al. [14], who reported an increase in RE/RD during mouse gastrulation using bulk RRBS data. The resolution of this apparent contradiction is discussed in Section 4.5.

4. Discussion

4.1. Confirmation of the Hypothesis

The obtained results confirm the hypothesis that DNA methylation entropy decreases as the embryo approaches “ground zero” — the point of minimum biological age in embryogenesis. This is consistent with the concept of epigenetic rejuvenation as a process of ordering the epigenetic landscape: at the early E4.5 stage, high diversity of methylation patterns (“noise”) is observed, whereas by the gastrulation stage E6.5, cells attain a more ordered state (“blank slate”). Our results agree with a recent biophysical model, according to which decreasing entropy corresponds to an increase in the Flory–Huggins parameter characterising chromatin phase separation [9]. An additional confirmation of the significance of the proposed approach was the classification task of stages E4.5 and E6.5 based on 15 features, including entropic and structural characteristics of methylation (Section 3.4). The best model — SVM with a radial kernel — achieved Accuracy of 93.4% and AUC of 0.981, indicating near-perfect separation of the stages. The most important features were structural characteristics — lengths of methylation blocks (expected_run_length_1), coefficient of variation (cv_methylation), and mean methylation level (mean_methylation).

4.2. Interpretation of LZ Complexity

The low correlation of LZ complexity with other entropic measures ( r 0.13 , see Table 3) indicates that this measure captures a different aspect of epigenetic information, related to algorithmic compressibility and the presence of repetitive structures, unlike statistical measures of diversity. Figure 3 shows the dynamics of LZ complexity. Similar to Shannon entropy, this measure decreases from stage E4.5 to stage E6.5, confirming the general trend of ordering of the epigenetic landscape. Unlike the other measures, the interquartile range of LZ complexity at stage E6.5 remains stable (change of –3.5%), indicating the preservation of variability in the algorithmic structure of methylation at the “ground zero” point. A crucial methodological question is whether the observed decrease in LZ complexity is an artefact of binary discretisation. To address this, we repeated the analysis using a ternary alphabet that preserves hemimethylated states (0.5). The decrease remained highly significant (p < 0.001) and identical in magnitude (−11.4% for both encodings). This robustness confirms that the epigenetic ordering from E4.5 to E6.5 is not an artefact of threshold choice. Moreover, the ternary LZ showed substantially higher correlation with Shannon entropy (r = 0.71 vs. r = 0.13 for binary), indicating that accounting for partial methylation states aligns algorithmic complexity more closely with statistical entropy. This finding supports the biological relevance of hemimethylation as a distinct epigenetic state rather than a technical artefact, consistent with the role of 5-hydroxymethylcytosine in epigenetic reprogramming [ 6 ] . It is important to emphasise that all investigated complexity measures — Shannon entropy, Renyi entropy ( α = 2 ), Tsallis entropy ( q = 2 ), LZ complexity, and Spatial gradient entropy — demonstrate a statistically significant decrease from stage E4.5 to stage E6.5 (t-test, p < 0.01 for each measure). At the same time, LZ complexity stands out from the rest: its median decreases most markedly (–28.4% versus –2.2%…–11.0% for the others), and the p -value is minimal (0.0006).
This suggests that LZ complexity possesses enhanced sensitivity to the processes of ordering of the epigenetic landscape at the rejuvenation point. The coordinated change of all five measures, despite differences in the behaviour of variability, convincingly confirms the main hypothesis that approaching “ground zero” is accompanied by ordering of the epigenetic landscape.
The ternary LZ approach bridges the gap between algorithmic and statistical measures. While binary LZ correlated poorly with Shannon entropy (r = 0.13), ternary LZ showed substantially higher correlation (r = 0.71). This indicates that the distinction between LZ complexity and Shannon entropy is partially an artefact of binary discretisation: when hemimethylated states are preserved, algorithmic complexity captures information largely consistent with statistical entropy. However, the remaining difference (r = 0.71, not 1.0) reflects the sensitivity of LZ complexity to the order of methylation states along the genome — a dimension that Shannon entropy, based solely on state frequencies, cannot capture. Thus, ternary LZ offers a structural complement to Shannon entropy rather than a fully independent measure.

4.3. Topological Complexity of the Space of Entropic Features

Application of persistent homology to the five-dimensional point cloud constructed from the entropic measures of all cells revealed a qualitative difference between stages. At E4.5, the distribution of cells contains virtually no cycles (H₁ are almost absent, PE H₁ = 0.29, only 2 intervals), confirming the view of the early embryo as a collection of epigenetically homogeneous cells. At E6.5, conversely, stable H₁-cycles appear (PE H₁ = 2.25, 9 intervals), reflecting the formation of several distinguishable epigenetic clusters.
Null model validation confirmed that the H₁-cycles are genuine topological features rather than random fluctuations: column-wise permutation showed that the real data are significantly more structured than random (p = 1.0). However, the quantitative differences in PE H1 (p = 0.78) and the number of intervals (p = 0.91) were not significant when accounting for sample size imbalance through stratified label permutation.
Comprehensive characterization using balanced subsampling (Table 12) revealed a more nuanced picture: the number of cycles did not change (~17 → ~16), but the distribution of cycle sizes shifted dramatically. Normalised PE showed no significant change (0.855 ± 0.035 → 0.882, p = 0.45), while maximum persistence fell from 0.444 ± 0.102 to 0.218 (p = 0.03) and total persistence decreased from 2.170 ± 0.353 to 1.070 (p < 0.001). This indicates a transition from a single dominant cycle at E4.5 to multiple smaller, evenly distributed cycles at E6.5. The stability of normalised PE (p = 0.45) indicates that the evenness of cycle size distribution is preserved; what changes is not the diversity of cycle sizes, but their lifetime — consistent with a transition from stable, long-lived structures to transient, lineage-specific configurations.
This topological transition is the mathematical fingerprint of structured diversification. At E4.5, cells occupy a relatively homogeneous region of the entropic feature space, dominated by one large-scale structure. At E6.5, this structure dissolves, and cells spread into multiple distinct regions, each corresponding to a different lineage. The decrease in global entropy and the topological reorganisation together describe a process in which the genome becomes more ordered at the whole-cell level while simultaneously diversifying into lineage-specific configurations.

4.4. Global Entropy Versus Regional Disorder: Reconciling Scales

A recent study [14] reported that epigenetic clocks based on regional entropy (RE) and regional disorder (RD) show an increase in predicted age during mouse gastrulation, while clocks based on mean methylation (CpG, RM) show a decrease corresponding to the “ground zero” [14]. At first glance, this appears to contradict our finding that global Shannon entropy decreases from E4.5 to E6.5.
However, this apparent contradiction arises from a fundamental difference in the scale of analysis. Our global entropy is computed genome-wide per cell, capturing the distribution of methylation states across all CpG sites. The RE and RD metrics of Bertucci-Richter operate on regional scales (typically 200-bp windows) and capture local disorder — the variability of methylation within each region.
These two phenomena are not contradictory but complementary:
  • Global entropy decrease reflects consolidation of the genome-wide methylation landscape. As the embryo approaches “ground zero”, the distribution of methylation states across the entire genome becomes more concentrated (lower entropy), consistent with the “blank slate” hypothesis.
  • Regional disorder increase reflects the emergence of cell-type-specific epigenetic patterns. During gastrulation, different genomic regions acquire distinct methylation patterns as cells commit to specific lineages (epiblast, mesoderm, endoderm).
This interpretation is directly supported by our topological analysis (Section 3.5). The emergence of stable H₁-cycles at stage E6.5 — absent at E4.5 — indicates that the epigenetic landscape is not simply homogenising but is undergoing structured diversification. Cells at E6.5 do not collapse into a single identical state; rather, they occupy distinct regions in the entropic feature space, corresponding to different cell lineages. Global entropy decreases while topological complexity increases, reflecting the formation of an ordered but diverse epigenetic landscape.
Thus, global entropy and regional disorder capture distinct, complementary aspects of epigenetic reorganisation during development. Our findings of global entropy reduction and the appearance of H₁-cycles together paint a picture of epigenetic rejuvenation as a process of ordered diversification: the genome becomes more structured at the global level while regional heterogeneity emerges in a controlled, lineage-specific manner.

4.5. Regional Disorder Dynamics: Reconciling with Bertucci-Richter (2024)

Bertucci-Richter et al. [14] reported that regional entropy (RE) and regional disorder (RD) increase during mouse gastrulation, while clocks based on mean methylation (CpG, RM) show a decrease corresponding to “ground zero”. Our findings, based on scNMT-seq data, show the opposite: cell-averaged RE and RD decrease significantly from E4.5 to E6.5 (RE: −25.5%, RD: −27.4%, p < 10⁻¹³).
This apparent contradiction can be resolved by considering the level of aggregation:
  • Bertucci-Richter used bulk RRBS data from whole embryos. The observed increase in RE/RD reflects the emergence of inter-cellular heterogeneity — as different cell lineages (epiblast, mesoderm, endoderm) appear, the average methylation signal across the bulk sample becomes more diverse.
  • Our analysis uses single-cell scNMT-seq data, allowing us to compute cell-averaged RE/RD, where regional values (200-bp windows) are averaged across all windows within each cell. The decrease we observe reflects consolidation of methylation patterns within each cell — each cell becomes epigenetically more ordered as it commits to a specific lineage.
  • This is consistent with our topological analysis (Section 3.5): the emergence of H₁-cycles at E6.5 indicates that the epigenetic landscape is not homogenising but rather structuring into distinct, ordered compartments. Global entropy decreases while regional diversity increases, reflecting a transition from a homogeneous “noisy” state to a structured “ordered” state with multiple cell types.
Thus, the Bertucci-Richter findings and ours are complementary: they capture the between-cell dimension of epigenetic reorganisation, while we capture the within-cell dimension. Both are essential for understanding the full picture of epigenetic rejuvenation.
Our cell-level RE/RD measurements complement this picture: the decrease in RE/RD at the single-cell level (Section 3.7) indicates that within each cell, regional methylation patterns become more ordered. This is precisely what would be expected if each cell is consolidating its epigenetic landscape as it commits to a specific lineage. Thus, the Bertucci-Richter findings (increase in inter-cellular heterogeneity) and ours (decrease in intra-cellular disorder) are two sides of the same coin: ordered diversification.

4.6. Limitations of the Study

The study has several limitations: (1) the analysed dataset lacks methylation data for stages E5.5, E6.75, and E7.5, which precludes testing of the right branch of the U-shaped curve; (2) LZ complexity requires separate in-depth analysis; (3) despite the high classification accuracy (93.4%), clinical application requires simplification of the feature set and additional validation on independent samples; (4) It should be noted that we performed control for a possible confounder related to changes in cell composition across stages and confirmed that the decrease in entropy is not attributable to differentiation (Section 3.6).

5. Conclusions

DNA methylation entropy decreases statistically significantly from the early stage E4.5 to the gastrulation stage E6.5 (t-test, p < 0.01 for Shannon entropy, Renyi entropy, Tsallis entropy, LZ complexity, and Spatial gradient entropy). The most pronounced and statistically robust changes were observed for LZ-complexity: −28.4% (binary) and −11.4% (ternary), both with p < 0.001 . This consistent sensitivity highlights its potential for tracking the epigenetic reorganisation that accompanies the approach to “ground zero”. The consistency of results across binary and ternary encodings confirms the robustness of this finding. Persistent entropy also decreases significantly (from 15.91 ± 3.20 at E4.5 to 14.89 ± 3.67 at E6.5, t-test: p = 0.010 ), confirming the overall dynamics of ordering of the epigenetic landscape. Based on 15 features, including entropic and structural characteristics of methylation, an SVM model with a radial kernel was trained, achieving classification of stages E4.5 and E6.5 with Accuracy of 93.4% and AUC of 0.981, confirming the diagnostic potential of the approach.
The obtained results correspond to the approach to “ground zero” — the point of minimum biological age in embryogenesis — and confirm the hypothesis of a link between decreasing entropy and epigenetic rejuvenation. Topological analysis revealed a qualitative reorganisation of the entropic feature space: at E6.5, H₁-cycles emerge (9 intervals) that are largely absent at E4.5 (2 intervals). Comprehensive characterisation showed a significant decrease in max persistence (0.446 → 0.218, p = 0.03) and highly significant decreases in average persistence (0.130 → 0.067, p < 0.001) and total persistence (2.170 → 1.070, p < 0.001) when comparing E6.5 with balanced subsamples of E4.5. Normalised PE did not change significantly (0.855 → 0.882, p = 0.45).”. While null model validation confirmed that the data are significantly more structured than random, the quantitative differences between stages were sensitive to sample size imbalance. We interpret these findings as suggestive of structured diversification — a process in which the epigenetic landscape becomes more ordered at the global level while diversifying into lineage-specific configurations. Consistent with this, regional entropy and disorder at the single-cell level decrease from E4.5 to E6.5 (RE: −25.5%, RD: −27.4%, p < 10 13 ), while global entropy also decreases. Together, these findings paint a picture of epigenetic rejuvenation as a process of ordered consolidation at the whole-genome scale, accompanied by lineage-specific diversification.
The proposed structuro-entropic framework may serve as a basis for developing algorithms for assessing cell developmental stage and, potentially, for evaluating the efficacy of rejuvenating interventions, consistent with modern biophysical models of epigenetic rejuvenation [9].

Author Contributions

Conceptualization, A.T., A.B.; methodology, A.T.; software, A.T.; validation, A.T. and A.B.; formal analysis, A.T.; investigation, A.T.; resources, A.B.; data curation, A.A.; writing—original draft preparation, A.T.; writing—review and editing, A.B.; supervision, A.A.; project administration, A.A., A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was conducted without financial support from governmental, commercial, or non-profit organizations.

Data Availability Statement

The Python source code for the project is available in the repository https://github.com/andytimoffilim/Epigenetics_Entropy. Data were retrieved from the NCBI Gene Expression Omnibus (GEO accession: GSE121690).

Acknowledgments

The authors thank the administration of LLC “Center for AI for SCO+ Countries”, Saint Petersburg, for providing computational resources and organizational support.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol. 2013, 14(10), R115. [Google Scholar] [CrossRef] [PubMed]
  2. Kerepesi, C.; Gladyshev, V.N. Epigenetic clocks as biomarkers of aging. Nat. Rev. Genet. 2023, 24(5), 309–324. [Google Scholar]
  3. Kerepesi, C.; Zhang, B.; Lee, S.G.; Trapp, A.; Gladyshev, V.N. Epigenetic clock analysis in mice as a function of age and germline rejuvenation. Sci. Adv. 2021, 7(15), eabg6088. [Google Scholar]
  4. Guo, F.; et al. Active and passive demethylation of male and female pronuclear DNA in the mammalian zygote. Cell Stem Cell 2014, 15(4), 447–459. [Google Scholar] [CrossRef] [PubMed]
  5. Reik, W.; Dean, W.; Walter, J. Epigenetic reprogramming in mammalian development. Science 2001, 293(5532), 1089–1093. [Google Scholar] [CrossRef] [PubMed]
  6. Tahiliani, M.; et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 2009, 324(5929), 930–935. [Google Scholar] [CrossRef] [PubMed]
  7. Liesenfelder, S.; Elsafi Mabrouk, M.H.; Iliescu, J.; et al. Epigenetic editing at individual age-associated CpGs affects the genome-wide epigenetic aging landscape. Nat. Aging 2025, 5, 997–1009. [Google Scholar] [CrossRef]
  8. Chan, J.; Rubbi, L.; Pellegrini, M. DNA methylation entropy is a biomarker for aging. Aging-US 2025, 17(6). [Google Scholar] [CrossRef] [PubMed]
  9. Singh, P.P. A Biophysics of Epigenetic Rejuvenation. Cells 2025, 14(16), 1249. [Google Scholar] [CrossRef] [PubMed]
  10. Argelaguet, R.; Clark, S.J.; Mohammed, H.; Stapel, L.C.; et al. Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature 2019, 576(7787), 487–491. [Google Scholar] [CrossRef] [PubMed]
  11. Clark, S.J.; Argelaguet, R.; Kapourani, C.A.; et al. scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 2018, 9, 781. [Google Scholar] [CrossRef] [PubMed]
  12. Lempel, A.; Ziv, J. On the complexity of finite sequences. IEEE Trans. Inf. Theory 1976, 22(1), 75–81. [Google Scholar] [CrossRef]
  13. Maria, C.; et al. The GUDHI library: persistent homology and beyond. 27th International Conference on Supercomputing (ICSC), 2014; pp. 167–176. [Google Scholar]
  14. Bertucci-Richter, E. M.; Shealy, E. P.; Parrott, B. B. Epigenetic drift underlies epigenetic clock signals, but displays distinct responses to lifespan interventions, development, and cellular dedifferentiation. Aging 2024, 16(2), 1002–1020. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Summary dynamics of five entropic measures during mouse embryogenesis. All indicators — Shannon entropy, Renyi entropy ( α = 2 ), Tsallis entropy ( q = 2 ), LZ complexity, and Spatial gradient entropy — demonstrate a consistent decrease from stage E4.5 to stage E6.5 (red markers indicate medians). Data correspond to 364 cells (E4.5, n = 267 ; E6.5, n = 97 ).
Figure 1. Summary dynamics of five entropic measures during mouse embryogenesis. All indicators — Shannon entropy, Renyi entropy ( α = 2 ), Tsallis entropy ( q = 2 ), LZ complexity, and Spatial gradient entropy — demonstrate a consistent decrease from stage E4.5 to stage E6.5 (red markers indicate medians). Data correspond to 364 cells (E4.5, n = 267 ; E6.5, n = 97 ).
Preprints 220633 g001
Figure 2. Dynamics of Shannon entropy during mouse embryogenesis. The boxplot shows the distribution of entropy values at stages E4.5 ( n = 267 ) and E6.5 ( n = 97 ). Red markers correspond to medians. The decrease in entropy (~4.3%) is statistically significant (t-test, p = 0.0028 ; Mann–Whitney U-test, p = 0.0003 ).
Figure 2. Dynamics of Shannon entropy during mouse embryogenesis. The boxplot shows the distribution of entropy values at stages E4.5 ( n = 267 ) and E6.5 ( n = 97 ). Red markers correspond to medians. The decrease in entropy (~4.3%) is statistically significant (t-test, p = 0.0028 ; Mann–Whitney U-test, p = 0.0003 ).
Preprints 220633 g002
Figure 3. LZ Complexity Dynamics During Early Gastrulation (E4.5 to E6.5).
Figure 3. LZ Complexity Dynamics During Early Gastrulation (E4.5 to E6.5).
Preprints 220633 g003
Table 1. Shannon entropy statistics by stage.
Table 1. Shannon entropy statistics by stage.
Stage Mean Median Standard Deviation n
E4.5 0.8407 0.8434 0.0999 267
E6.5 0.8047 0.7890 0.1038 97
Table 2. Statistical significance of differences between stages E4.5 and E6.5.
Table 2. Statistical significance of differences between stages E4.5 and E6.5.
Metric p-value (t-test) p-value (Mann-Whitney) q-value
Shannon entropy 0.0028 0.0003 0.004332
Renyi entropy (α=2) 0.0038 0.0003 0.004576
Tsallis entropy (q=2) 0.0029 0.0003 0.004332
LZ complexity 0.0006 0.0009 0.003482
Spatial gradient entropy 0.0028 0.0003 0.004332
Persistent entropy (PE) 0.0097 0.000002 0.009728
Table 3. Correlations between entropic measures.
Table 3. Correlations between entropic measures.
Shannon Renyi (α=2) Tsallis (q=2) LZ (binary) LZ (ternary, norm.) Spatial gradient entropy
Shannon 1.000 0.995 0.999 0.131 0.712 1.000
Renyi (α=2) 0.995 1.000 0.997 0.130 0.711 0.995
Tsallis (q=2) 0.999 0.997 1.000 0.131 0.712 0.999
LZ (binary) 0.131 0.130 0.131 1.000 0.131
LZ (ternary, norm.) 0.712 0.711 0.712 1.000 0.712
Spatial gradient entropy 1.000 0.995 0.999 0.131 0.712 1.000
Table 4. Dynamics of medians and interquartile ranges of entropic measures.
Table 4. Dynamics of medians and interquartile ranges of entropic measures.
Metric Median (E4.5) Median (E6.5) Δ Median IQR (E4.5) IQR (E6.5) Δ IQR
Shannon entropy 0.8434 0.7890 –6.5% 0.1278 0.1528 +19.5%
Renyi entropy ( α = 2 ) 0.7260 0.6461 –11.0% 0.1966 0.2194 +11.6%
Tsallis entropy ( q = 2 ) 0.3954 0.3610 –8.7% 0.0821 0.0958 +16.7%
LZ complexity (binary, raw) 49942 35759 –28.4% 42531.5 41055.0 –3.5%
LZ complexity (ternary, norm) 0.7350 0.6510 −11.4% 0.1597 0.1605 +0.5%
Spatial gradient entropy 0.8094 0.7913 –2.2% 0.0426 0.0509 +19.5%
Table 6. Basic statistics.
Table 6. Basic statistics.
Feature Description Formula
mean_methylation Mean methylation level across all CpG sites in a cell r ¯ = 1 N j = 1 N r j , where r j — methylation level of the j -th site, N — number of CpG sites
std_methylation Standard deviation of methylation levels σ = 1 N j = 1 N r j r ¯ 2
n_CpG_sites Total number of CpG sites in a cell N
cv_methylation Coefficient of variation (normalized spread) C V = σ / r ¯
Table 7. Properties of structural characteristics.
Table 7. Properties of structural characteristics.
Feature Description Formula
expected_run_length_1 Expected length of continuous blocks of ones (methylated regions) For a Bernoulli process with probability
p = r ¯ : E [ run 1 ] = 1 1 p
where run 1   continuous block of consecutive methylated sites (value 1 after binarisation).
expected_run_length_0 Expected length of continuous blocks of zeros (unmethylated regions) E [ run 0 ] = 1 p
where run 0   — continuous block of consecutive unmethylated sites (value 0).
run_length_ratio Ratio of expected lengths of methylated and unmethylated blocks R = E [ run 1 ] E [ run 0 ] = p 1 p
Table 8. Properties of spatial characteristics.
Table 8. Properties of spatial characteristics.
Feature Description Formula
spatial_organization Measure of spatial ordering of methylation S o r g = 1 H H m a x ,
where H — Shannon entropy,
H m a x — maximum entropy at 20 bins
lz_shannon_ratio Ratio of algorithmic complexity to statistical entropy R L Z / S = L Z H ,
where L Z — LZ complexity,
H — Shannon entropy
spatial_gradient_entropy_shannon_ratio Ratio of spatial entropy to Shannon entropy R C / S = S s p a t i a l H ,
where S s p a t i a l — local gradient entropy, H — Shannon entropy
Table 9. Classification results for stages E4.5 vs. E6.5 (15 features, SMOTE, Grid Search).
Table 9. Classification results for stages E4.5 vs. E6.5 (15 features, SMOTE, Grid Search).
Model Accuracy Precision Recall (E6.5) F1-score (E6.5) AUC-ROC
Logistic Regression 0.879 0.710 0.917 0.800 0.947
SVM (RBF) 0.934 0.846 0.917 0.880 0.981
Random Forest 0.791 0.581 0.750 0.655 0.873
Table 10. Feature importance by contribution to the Random Forest model.
Table 10. Feature importance by contribution to the Random Forest model.
Rank Feature Importance
1 expected_run_length_1 (length of methylated blocks) 0.122
2 cv_methylation (coefficient of variation) 0.110
3 expected_run_length_0 (length of unmethylated blocks) 0.106
4 run_length_ratio (ratio of block lengths) 0.106
5 mean_methylation (mean methylation level) 0.104
6 lz_complexity (LZ complexity) 0.079
7 lz_shannon_ratio 0.069
8 n_CpG_sites 0.061
Table 11. Persistent entropy (PE) for H 0 and H 1 by stage (Rips complex).
Table 11. Persistent entropy (PE) for H 0 and H 1 by stage (Rips complex).
Sample H 0 Intervals H 1 Intervals PE H 0 PE H 1
E4.5 (n=267) 276 2 4.12 0.29
E6.5 (n=97) 96 9 3.89 2.25
E4.5+E6.5 375 82 5.73 5.16
Table 12. Topological reorganisation of the epigenetic landscape from E4.5 to E6.5 (balanced subsampling).
Table 12. Topological reorganisation of the epigenetic landscape from E4.5 to E6.5 (balanced subsampling).
Aspect E4.5 (balanced) E6.5 (real) Change p-value Interpretation
Number of cycles 16.87 ± 2.59 16 −5.2% 0.85 Unchanged
Normalised PE 0.855 ± 0.035 0.882 +3.2% 0.45 Unchanged
Max persistence 0.444 ± 0.102 0.218 −51.0% 0.03 Decreases
Average persistence 0.130 ± 0.021 0.067 −48.6% < 0.001 Decreases
Total persistence 2.170 ± 0.353 1.070 −50.7% < 0.001 Decreases
Table 13. Results of the control analysis for cell-type confounder.
Table 13. Results of the control analysis for cell-type confounder.
Analysis E4.5 E6.5 Statistic p -value
Shannon entropy in epiblast (mean) 0.8289 0.7679 t = 4.28 2 × 10 5
Stage coefficient in regression β 1 (SE) –0.052 (0.012) <0.001
Table 14. Regional entropy (RE) and regional disorder (RD) dynamics.
Table 14. Regional entropy (RE) and regional disorder (RD) dynamics.
Metric Window E4.5 (mean) E6.5 (mean) Δ p-value
RE 200 bp 0.480 ± 0.130 0.357 ± 0.133 −25.5% < 10⁻¹³
RD 200 bp 0.300 ± 0.079 0.218 ± 0.083 −27.4% < 10⁻¹⁵
RE 1 kb 0.590 ± 0.161 0.441 ± 0.169 −25.2% < 10⁻¹²
RD 1 kb 0.316 ± 0.080 0.226 ± 0.090 −28.5% < 10⁻¹⁷
RE 5 kb 0.690 ± 0.169 0.559 ± 0.185 −19.1% < 10⁻⁹
RD 5 kb 0.320 ± 0.073 0.235 ± 0.088 −26.8% < 10⁻¹⁸
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2026 MDPI (Basel, Switzerland) unless otherwise stated

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings