Submitted:
28 June 2026
Posted:
30 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Data
2.2. Entropy Measures
- Initialisation: , current string (first character). In this phase we begin with element .
-
For each subsequent symbol (starting from ):
- –
- A trial string is formed (concatenation of the current string and the new symbol).
- –
-
If does not occur as a substring in the already constructed part of the string (i.e., in the concatenation of all previously extracted unique blocks), then:
- (a new unique block is registered);
- (a new current string is started from this symbol).
- –
-
Otherwise (if has already occurred):
- (the current string is extended without incrementing the counter).
- If after processing all symbols is not empty, then .
2.3. Statistical Analysis
3. Results
3.1. Shannon Entropy Dynamics
3.2. Comparison of Entropic Measures
- Binary LZ (normalised): E4.5 mean = 1.165 ± 0.187, E6.5 mean = 1.032 ± 0.253 (−11.4%, p < 0.001)
- Ternary LZ (normalised): E4.5 mean = 0.735 ± 0.118, E6.5 mean = 0.651 ± 0.160 (−11.4%, p < 0.001)
3.3. Comparison of Central Tendency and Variability Dynamics
3.4. Machine Learning for Stage Classification
- Classifiers: Logistic Regression, SVM with radial basis function (RBF) kernel, Random Forest.
- Data preprocessing:
- –
- Feature scaling (StandardScaler)
- –
- Class balancing using SMOTE (synthetic generation of the minority class E6.5). The use of SMOTE on a sample of is associated with a risk of overfitting due to generation of synthetic samples in underrepresented regions of the feature space. The resulting metrics should be regarded as an upper bound on classification quality, requiring confirmation on independent data.
- –
- Stratified train–test split: 75% training, 25% testing.
- Classifier parameter optimisation: Grid Search with 5-fold cross-validation based on the ROC-AUC criterion.
- Features. A total of 15 features were included in the analysis, grouped into three categories as shown in Table 5.
| Category | Features | Number of Features in Category |
|---|---|---|
| Entropy measures | Shannon entropy, Renyi entropy (), Tsallis entropy (), LZ complexity, Spatial gradient entropy | 5 |
| Basic statistics | Mean methylation level (mean_methylation), standard deviation (std_methylation), number of CpG sites (n_CpG_sites), coefficient of variation (cv_methylation) | 4 |
| Structural characteristics | Expected length of methylated blocks (expected_run_length_1), expected length of unmethylated blocks (expected_run_length_0), ratio of block lengths (run_length_ratio), spatial organization (spatial_organization), LZ/Shannon ratio (lz_shannon_ratio), spatial gradient entropy/Shannon ratio (spatial_gradient_shannon_ratio) | 6 |
- SVM with RBF kernel demonstrated the best results among all tested models (Accuracy = 93.4%, AUC = 0.981), indicating near-perfect separation of stages E4.5 and E6.5 based on the proposed feature set.
- Structural characteristics of methylation, in particular the expected length of methylated blocks (expected_run_length_1), proved to be the most informative features, surpassing entropic measures in importance.
- The coefficient of variation (cv_methylation) and mean methylation level (mean_methylation) are among the top five most important features, indicating the significance of not only diversity but also the central tendency of the methylation mark distribution.
- The high sensitivity of the model to E6.5 cells (Recall = 91.7%) confirms that the proposed feature set reliably identifies cells at the gastrulation stage — the point of epigenetic rejuvenation “ground zero.”
- The obtained results demonstrate the fundamental possibility of using the structural-entropic approach for developing diagnostic algorithms for assessing cell developmental stage and, potentially, for evaluating the efficacy of rejuvenating interventions.
3.5. Topological Analysis of the Multidimensional Space of Entropic Features
3.6. Control for a Possible Confounder Due to Cell-Type Heterogeneity
- — the value of Shannon entropy (bits) for the -th cell;
- — the intercept, corresponding to the mean entropy level in the reference group (epiblast cells at stage E4.5);
- — the coefficient for the stage dummy variable, reflecting the change in entropy at stage E6.5 relative to E4.5 after controlling for cell type;
- — the indicator variable, equal to 1 for cells at stage E6.5 and 0 for E4.5;
- — coefficients for the dummy variables of cell types; — the -th cell type (Primitive_Streak, Primitive_endoderm, Visceral_endoderm); epiblast was chosen as the reference type;
- — the total number of cell types ( in the analysis);
- — random error, assumed to be independent and normally distributed with zero mean and constant variance .
3.7. Regional Entropy and Disorder Dynamics
4. Discussion
4.1. Confirmation of the Hypothesis
4.2. Interpretation of LZ Complexity
4.3. Topological Complexity of the Space of Entropic Features
4.4. Global Entropy Versus Regional Disorder: Reconciling Scales
- Global entropy decrease reflects consolidation of the genome-wide methylation landscape. As the embryo approaches “ground zero”, the distribution of methylation states across the entire genome becomes more concentrated (lower entropy), consistent with the “blank slate” hypothesis.
- Regional disorder increase reflects the emergence of cell-type-specific epigenetic patterns. During gastrulation, different genomic regions acquire distinct methylation patterns as cells commit to specific lineages (epiblast, mesoderm, endoderm).
4.5. Regional Disorder Dynamics: Reconciling with Bertucci-Richter (2024)
- Bertucci-Richter used bulk RRBS data from whole embryos. The observed increase in RE/RD reflects the emergence of inter-cellular heterogeneity — as different cell lineages (epiblast, mesoderm, endoderm) appear, the average methylation signal across the bulk sample becomes more diverse.
- Our analysis uses single-cell scNMT-seq data, allowing us to compute cell-averaged RE/RD, where regional values (200-bp windows) are averaged across all windows within each cell. The decrease we observe reflects consolidation of methylation patterns within each cell — each cell becomes epigenetically more ordered as it commits to a specific lineage.
- This is consistent with our topological analysis (Section 3.5): the emergence of H₁-cycles at E6.5 indicates that the epigenetic landscape is not homogenising but rather structuring into distinct, ordered compartments. Global entropy decreases while regional diversity increases, reflecting a transition from a homogeneous “noisy” state to a structured “ordered” state with multiple cell types.
4.6. Limitations of the Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol. 2013, 14(10), R115. [Google Scholar] [CrossRef] [PubMed]
- Kerepesi, C.; Gladyshev, V.N. Epigenetic clocks as biomarkers of aging. Nat. Rev. Genet. 2023, 24(5), 309–324. [Google Scholar]
- Kerepesi, C.; Zhang, B.; Lee, S.G.; Trapp, A.; Gladyshev, V.N. Epigenetic clock analysis in mice as a function of age and germline rejuvenation. Sci. Adv. 2021, 7(15), eabg6088. [Google Scholar]
- Guo, F.; et al. Active and passive demethylation of male and female pronuclear DNA in the mammalian zygote. Cell Stem Cell 2014, 15(4), 447–459. [Google Scholar] [CrossRef] [PubMed]
- Reik, W.; Dean, W.; Walter, J. Epigenetic reprogramming in mammalian development. Science 2001, 293(5532), 1089–1093. [Google Scholar] [CrossRef] [PubMed]
- Tahiliani, M.; et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science 2009, 324(5929), 930–935. [Google Scholar] [CrossRef] [PubMed]
- Liesenfelder, S.; Elsafi Mabrouk, M.H.; Iliescu, J.; et al. Epigenetic editing at individual age-associated CpGs affects the genome-wide epigenetic aging landscape. Nat. Aging 2025, 5, 997–1009. [Google Scholar] [CrossRef]
- Chan, J.; Rubbi, L.; Pellegrini, M. DNA methylation entropy is a biomarker for aging. Aging-US 2025, 17(6). [Google Scholar] [CrossRef] [PubMed]
- Singh, P.P. A Biophysics of Epigenetic Rejuvenation. Cells 2025, 14(16), 1249. [Google Scholar] [CrossRef] [PubMed]
- Argelaguet, R.; Clark, S.J.; Mohammed, H.; Stapel, L.C.; et al. Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature 2019, 576(7787), 487–491. [Google Scholar] [CrossRef] [PubMed]
- Clark, S.J.; Argelaguet, R.; Kapourani, C.A.; et al. scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 2018, 9, 781. [Google Scholar] [CrossRef] [PubMed]
- Lempel, A.; Ziv, J. On the complexity of finite sequences. IEEE Trans. Inf. Theory 1976, 22(1), 75–81. [Google Scholar] [CrossRef]
- Maria, C.; et al. The GUDHI library: persistent homology and beyond. 27th International Conference on Supercomputing (ICSC), 2014; pp. 167–176. [Google Scholar]
- Bertucci-Richter, E. M.; Shealy, E. P.; Parrott, B. B. Epigenetic drift underlies epigenetic clock signals, but displays distinct responses to lifespan interventions, development, and cellular dedifferentiation. Aging 2024, 16(2), 1002–1020. [Google Scholar] [CrossRef] [PubMed]



| Stage | Mean | Median | Standard Deviation | |
|---|---|---|---|---|
| E4.5 | 0.8407 | 0.8434 | 0.0999 | 267 |
| E6.5 | 0.8047 | 0.7890 | 0.1038 | 97 |
| Metric | p-value (t-test) | p-value (Mann-Whitney) | q-value |
|---|---|---|---|
| Shannon entropy | 0.0028 | 0.0003 | 0.004332 |
| Renyi entropy (α=2) | 0.0038 | 0.0003 | 0.004576 |
| Tsallis entropy (q=2) | 0.0029 | 0.0003 | 0.004332 |
| LZ complexity | 0.0006 | 0.0009 | 0.003482 |
| Spatial gradient entropy | 0.0028 | 0.0003 | 0.004332 |
| Persistent entropy (PE) | 0.0097 | 0.000002 | 0.009728 |
| Shannon | Renyi (α=2) | Tsallis (q=2) | LZ (binary) | LZ (ternary, norm.) | Spatial gradient entropy | |
|---|---|---|---|---|---|---|
| Shannon | 1.000 | 0.995 | 0.999 | 0.131 | 0.712 | 1.000 |
| Renyi (α=2) | 0.995 | 1.000 | 0.997 | 0.130 | 0.711 | 0.995 |
| Tsallis (q=2) | 0.999 | 0.997 | 1.000 | 0.131 | 0.712 | 0.999 |
| LZ (binary) | 0.131 | 0.130 | 0.131 | 1.000 | — | 0.131 |
| LZ (ternary, norm.) | 0.712 | 0.711 | 0.712 | — | 1.000 | 0.712 |
| Spatial gradient entropy | 1.000 | 0.995 | 0.999 | 0.131 | 0.712 | 1.000 |
| Metric | Median (E4.5) | Median (E6.5) | Median | IQR (E4.5) | IQR (E6.5) | IQR |
|---|---|---|---|---|---|---|
| Shannon entropy | 0.8434 | 0.7890 | –6.5% | 0.1278 | 0.1528 | +19.5% |
| Renyi entropy () | 0.7260 | 0.6461 | –11.0% | 0.1966 | 0.2194 | +11.6% |
| Tsallis entropy () | 0.3954 | 0.3610 | –8.7% | 0.0821 | 0.0958 | +16.7% |
| LZ complexity (binary, raw) | 49942 | 35759 | –28.4% | 42531.5 | 41055.0 | –3.5% |
| LZ complexity (ternary, norm) | 0.7350 | 0.6510 | −11.4% | 0.1597 | 0.1605 | +0.5% |
| Spatial gradient entropy | 0.8094 | 0.7913 | –2.2% | 0.0426 | 0.0509 | +19.5% |
| Feature | Description | Formula |
|---|---|---|
| mean_methylation | Mean methylation level across all CpG sites in a cell | , where — methylation level of the -th site, — number of CpG sites |
| std_methylation | Standard deviation of methylation levels | |
| n_CpG_sites | Total number of CpG sites in a cell | |
| cv_methylation | Coefficient of variation (normalized spread) |
| Feature | Description | Formula |
|---|---|---|
| expected_run_length_1 | Expected length of continuous blocks of ones (methylated regions) | For a Bernoulli process with probability : where continuous block of consecutive methylated sites (value 1 after binarisation). |
| expected_run_length_0 | Expected length of continuous blocks of zeros (unmethylated regions) |
where — continuous block of consecutive unmethylated sites (value 0). |
| run_length_ratio | Ratio of expected lengths of methylated and unmethylated blocks |
| Feature | Description | Formula |
|---|---|---|
| spatial_organization | Measure of spatial ordering of methylation |
, where — Shannon entropy, — maximum entropy at 20 bins |
| lz_shannon_ratio | Ratio of algorithmic complexity to statistical entropy |
, where — LZ complexity, — Shannon entropy |
| spatial_gradient_entropy_shannon_ratio | Ratio of spatial entropy to Shannon entropy |
, where — local gradient entropy, — Shannon entropy |
| Model | Accuracy | Precision | Recall (E6.5) | F1-score (E6.5) | AUC-ROC |
|---|---|---|---|---|---|
| Logistic Regression | 0.879 | 0.710 | 0.917 | 0.800 | 0.947 |
| SVM (RBF) | 0.934 | 0.846 | 0.917 | 0.880 | 0.981 |
| Random Forest | 0.791 | 0.581 | 0.750 | 0.655 | 0.873 |
| Rank | Feature | Importance |
|---|---|---|
| 1 | expected_run_length_1 (length of methylated blocks) | 0.122 |
| 2 | cv_methylation (coefficient of variation) | 0.110 |
| 3 | expected_run_length_0 (length of unmethylated blocks) | 0.106 |
| 4 | run_length_ratio (ratio of block lengths) | 0.106 |
| 5 | mean_methylation (mean methylation level) | 0.104 |
| 6 | lz_complexity (LZ complexity) | 0.079 |
| 7 | lz_shannon_ratio | 0.069 |
| 8 | n_CpG_sites | 0.061 |
| Sample | Intervals | Intervals | PE | PE |
|---|---|---|---|---|
| E4.5 (n=267) | 276 | 2 | 4.12 | 0.29 |
| E6.5 (n=97) | 96 | 9 | 3.89 | 2.25 |
| E4.5+E6.5 | 375 | 82 | 5.73 | 5.16 |
| Aspect | E4.5 (balanced) | E6.5 (real) | Change | p-value | Interpretation |
|---|---|---|---|---|---|
| Number of cycles | 16.87 ± 2.59 | 16 | −5.2% | 0.85 | Unchanged |
| Normalised PE | 0.855 ± 0.035 | 0.882 | +3.2% | 0.45 | Unchanged |
| Max persistence | 0.444 ± 0.102 | 0.218 | −51.0% | 0.03 | Decreases |
| Average persistence | 0.130 ± 0.021 | 0.067 | −48.6% | < 0.001 | Decreases |
| Total persistence | 2.170 ± 0.353 | 1.070 | −50.7% | < 0.001 | Decreases |
| Analysis | E4.5 | E6.5 | Statistic | -value |
|---|---|---|---|---|
| Shannon entropy in epiblast (mean) | 0.8289 | 0.7679 | ||
| Stage coefficient in regression (SE) | — | — | –0.052 (0.012) | <0.001 |
| Metric | Window | E4.5 (mean) | E6.5 (mean) | Δ | p-value |
|---|---|---|---|---|---|
| RE | 200 bp | 0.480 ± 0.130 | 0.357 ± 0.133 | −25.5% | < 10⁻¹³ |
| RD | 200 bp | 0.300 ± 0.079 | 0.218 ± 0.083 | −27.4% | < 10⁻¹⁵ |
| RE | 1 kb | 0.590 ± 0.161 | 0.441 ± 0.169 | −25.2% | < 10⁻¹² |
| RD | 1 kb | 0.316 ± 0.080 | 0.226 ± 0.090 | −28.5% | < 10⁻¹⁷ |
| RE | 5 kb | 0.690 ± 0.169 | 0.559 ± 0.185 | −19.1% | < 10⁻⁹ |
| RD | 5 kb | 0.320 ± 0.073 | 0.235 ± 0.088 | −26.8% | < 10⁻¹⁸ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).