Submitted:
31 March 2026
Posted:
01 April 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Epigenomic Regulation and Biomarkers in Human Disease
1.2. Rapid Expansion of Epigenomic Research
1.3. AI and Literature-Level Analysis of Biomedical Research
1.4. Toward AI-Driven Health Monitoring Using Epigenomic Biomarkers
2. Epigenomic Biomarkers and Chromatin Variability in Human Health
2.1. Biological and Clinical Significance of DNA Methylation
2.2. DNA Methylation Biomarkers in Aging and Health Monitoring
2.3. Translational Perspectives and the Need for Literature-Level Synthesis
2.4. Modeling Chromatin Variability from DNA Sequences

3. AI and Text Mining in Biomedical Literature
3.1. AI Applications in Epigenomic Biomarker Research
3.2. Limitations of Current Biomarker Studies
3.3. Evolution of Biomedical Literature Mining
3.4. Full-Text Mining for Knowledge Discovery
3.5. Literature-Level Insights for Epigenomic Biomarker Discovery
4. Literature Mining Framework for Epigenomic Biomarker Discovery
4.1. Data Sources and Corpus Construction
4.2. Query Pattern Design for Epigenomic Biomarker Retrieval
- Baseline queries were constructed to achieve high recall by broadly capturing literature related to epigenetic biomarkers across diverse disease contexts.
- Extended queries incorporated additional constraints, such as disease categories, mechanistic descriptors (e.g., hypermethylation or hypomethylation), clinical cohort indicators, and machine-learning–related terminology.
4.3. Text Mining and Analytical Pipeline
5. Characteristics of the Epigenomic Literature Corpus
5.1. Corpus Scope and Research Coverage
5.2. Textual Composition and Data Preprocessing
5.3. Corpus Scale and Analytical Implications
6. Major Trends Identified Through Literature-Level Analysis
6.1. Experimental Techniques and Research Design Patterns
| Total No. of words | 62,360,028 |
| No. of articles retrieved | 6,152 |
| No. of words per article | 10,136.5 |
| Level | Description | Representative Examples |
| Journal Level (Frequency) | Journals in which epigenomic experimental method keywords most frequently appear | The Journal of Biological Chemistry (111); Physical Review B (102); The Cochrane Database of Systematic Reviews (88) |
| Journal–Keyword Association | Frequently co-occurring keywords within selected journals | Science (predictions, Computationall); Gene (5C); PNAS (predictions) |
| Document Level | Documents containing multiple epigenomic experimental method keywords | Epi_file-10000085.txt (5C, Hi-C, DNase-seq, ATAC-seq, ChIP-seq, RNA-seq); Epi_file-10000364.txt (DNase-seq, ATAC-seq, ChIP-seq, RNA-seq) |

6.2. Topic Structure of Epigenomic Research
| Cluster | Number of groups | Cluster words |
| Cluster 0 | 427 | Plant, genome, assembly, species, sequence, chromosome |
| Cluster 1 | 1513 | patients, disease, variants, minimal, Children |
| Cluster 2 | 620 | Methylation, DNA, DNA, CpG, methylated, epigenetic, site, ages |
| Cluster 3 | 1119 | mice, RNA, mRNA, p, supplementary, h, mM, N |
| Cluster 4 | 824 | Cancer, tumor, patients, immune, Breast, survival, Breast, p |
| Cluster 5 | 901 | Chromatin, enhancer, peak, genome, transcription, DNA, binding, accessibility |
| Cluster 6 | 733 | RNA, Cancer, DNA, patients, disease |

6.3. Contextual Stability of Epigenomic Terminology
| Analysis Type | Focus | Description |
| Genetic Loci Concordance | Chromosomal loci patterns | Concordance contexts showing disease associations and oncogene references near loci expressions (e.g., 18q21.33, 12p13.33, 8q24.21) |
| Assay Keyword Concordance | Epigenomic experimental methods | Contextual usage of assay-related terms (e.g., 5C, DNase-seq, WGBS, RNA-seq) within methodological descriptions |
| Loci Before/After Patterns | Lexical proximity around loci | Words surrounding loci expressions indicate genomic positioning and disease relevance |
| Chromatin Before/After Patterns | Chromatin-related contexts | Surrounding terms highlight genomic annotation, structural regions, and regulatory elements |
| Assay Before/After Patterns | Experimental method contexts | Assay keywords frequently co-occur with regulatory, profiling, and binding-related terminology |
| Methylation Before/After Patterns | Epigenetic modification context | Methylation-related terms appear alongside regulatory, silencing, and modification-related expressions |
6.4. Temporal Evolution of Epigenomic and DNA Methylation Research

7. Integrated Interpretation of Literature-Level Text Mining Results
7.1. Integrated Framework of Epigenomic Literature Patterns
| Text Mining Component | Core Literature-Level Insight | Link to Chromatin Variability Perspective | Extraction Logic |
|
Epigenetic mechanism |
Mechanistic basis of epigenetic regulation |
Biological foundation of chromatin variability |
Predefined epigenetic mechanism keywords extracted via case-insensitive regular expression–based pattern matching applied to full-text corpus |
|
Chromatin variability expressions |
Linguistic operationalization of variability concept |
Conceptual validation of chromatin variability |
Predefined variability-related expressions identified through case-insensitive keyword-based pattern matching in full text |
| Gene–epigenetic relationships | Gene-specific regulatory modulation patterns |
Functional instantiation of variability |
Sentence-level segmentation followed by identification of co-occurrence between gene names and DNA methylation-related expressions |
|
Cell-type specificity |
Context-dependent regulatory variation | Contextual interpretation of chromatin variability | Predefined cell-type terminology matched within full-text corpus to identify cell-specific epigenetic contexts |
|
Disease association |
Clinical relevance of epigenetic regulation | Translational dimension of variability | Co-occurrence detection of predefined disease terms and epigenetic expressions within full text |
|
Temporal dynamics |
Time-dependent regulatory changes |
Dynamic nature of chromatin variability | Identification of predefined temporal keywords to detect longitudinal epigenetic contexts |
|
Environmental effects |
Non-genetic influences on epigenetic regulation |
Environmental modulation of variability | Detection of predefined environmental terms within full-text corpus |
| Semantic network structure | Conceptual connectivity across literature | Systems-level interpretation | Construction of co-occurrence networks by linking concept nodes appearing within shared textual contexts |
| Literature variability score | Quantitative prominence of variability discourse | Independent validation indicator | Frequency aggregation of predefined variability-related keywords across corpus |
7.2. Key Biomarker Insights from Literature Mining
7.3. Chromatin Variability as a Dynamic Health Indicator
7.4. Literature-Derived Biomarker Prioritization
8. Translating Epigenomic Biomarkers into AI-Driven Health Monitoring
8.1. From Literature-Level Insights to Health Monitoring Architecture
8.2. Multi-Layer Architecture for AI-Driven Health Monitoring
- Molecular baseline layer
- Digital physiological layer
- Multimodal AI integration layer
- Predictive monitoring layer
8.3. Multi-Timescale Health Monitoring Through Epigenomic Biomarkers
9. Integration for AI-Driven Epigenomic Health Monitoring
9.1. System-Level Architecture of AI-Driven Epigenomic Health Monitoring
- Molecular baseline layer: This layer establishes individualized biological reference profiles using epigenomic biomarkers derived from DNA methylation data. These profiles may encode biological age acceleration, chromatin variability, and disease-associated epigenetic signatures.
- Digital physiological layer: This layer captures continuous physiological data streams from wearable devices and IoT-based sensors, including cardiovascular dynamics, sleep quality, activity patterns, and metabolic indicators.
- Multimodal integration layer: In this layer, machine learning models integrate heterogeneous data sources to infer latent health states(Miotto et al., 2018; Rajkomar et al., 2018). Approaches such as deep neural networks, multimodal representation learning, and probabilistic modeling enable the fusion of molecular and physiological data.
- Predictive monitoring layer: This final layer translates integrated data into actionable outputs, including early disease risk prediction, anomaly detection, adaptive health recommendations, and clinical decision support.
9.2. Data Integration and Multimodal Analytical Strategies
- Multimodal representation learning: Learning shared latent representations that capture interactions between molecular and physiological signals.
- Bayesian hierarchical modeling: Incorporating epigenomic features as prior distributions that constrain the interpretation of dynamic physiological data.
- Deep learning architectures: Integrating heterogeneous inputs through attention mechanisms, recurrent neural networks, or transformer-based models to capture both temporal dynamics and cross-modal relationships(Miotto et al., 2018; Rajkomar et al., 2018).
9.3. Future Directions for Personalized and Predictive Health Monitoring
10. Future Directions: Toward Integrative Modeling of SNPs, Chromatin Variability, and AI-Driven Health Monitoring
10.1. Unresolved Relationships Between Genetic Variation and Chromatin Variability
10.2. Integrative Multi-Omics and Multi-Scale Data Modeling
10.3. Literature-Level Knowledge Integration for Biomarker Prioritization
10.4. Toward AI-Driven Multi-Timescale Health Monitoring Frameworks
10.5. Toward Hypothesis-Driven and Translational Epigenomic Research
11. Conclusion
Author Contributions
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Glossary
| Term | Description |
| ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) |
A high-throughput sequencing method used to assess chromatin accessibility by detecting open regions of the genome, enabling the identification of regulatory elements such as enhancers and promoters. |
| Biomarker | A measurable biological indicator that reflects physiological states, disease processes, or responses to therapeutic interventions, widely used in diagnosis, prognosis, and health monitoring. |
| ChIP-seq (Chromatin Immunoprecipitation sequencing) |
A sequencing-based technique used to analyze protein–DNA interactions, particularly transcription factor binding and histone modifications, across the genome. |
| Chromatin | A complex of DNA and proteins, primarily histones, that organizes the genome within the nucleus and regulates gene expression through structural and chemical modifications. |
| Chromatin Variability | A measure of how chromatin states change across different cell types, tissues, or conditions, reflecting the dynamic regulatory potential of genomic regions. |
| Concordance Analysis | A text mining method that examines the contextual usage of specific terms or phrases across documents to identify recurring patterns and semantic structures. |
| Corpus (Literature Corpus) |
A large, structured collection of text documents used for computational analysis, typically consisting of scientific articles in literature mining studies. |
| DNA Methylation | An epigenetic modification involving the addition of a methyl group to cytosine residues, typically at CpG sites, influencing gene expression without altering the DNA sequence. |
| Epigenetics | The study of heritable and reversible changes in gene expression that occur without alterations in the underlying DNA sequence, often mediated by chemical modifications of DNA and histones. |
| Epigenomic Biomarker | A biomarker derived from epigenetic features, such as DNA methylation or chromatin state, used to indicate disease status, biological age, or environmental exposure. |
| Epigenome | The complete set of epigenetic modifications across the genome, including DNA methylation, histone modifications, and chromatin accessibility patterns. |
| Full-Text Mining | A computational approach that analyzes the entire content of documents, including abstracts and main text, to extract deeper contextual and structural information. |
| IoT (Internet of Things) |
A network of interconnected devices equipped with sensors and communication capabilities, used in health monitoring systems to collect real-time physiological data. |
| Latent Dirichlet Allocation (LDA) |
A probabilistic topic modeling algorithm used to identify latent thematic structures within large text corpora by grouping words into topics. |
| Machine Learning | A subset of AI that enables systems to learn from data and improve performance on specific tasks without explicit programming, widely applied in biomedical data analysis. |
| Multi-omics Integration | The combined analysis of multiple types of biological data (e.g., genomics, epigenomics, transcriptomics) to provide a comprehensive understanding of biological systems. |
| Next-Generation Sequencing (NGS) |
High-throughput sequencing technologies that enable rapid and large-scale analysis of DNA and RNA, forming the basis of modern epigenomic studies. |
| Topic Modeling | A text mining technique used to discover hidden thematic structures in large document collections by identifying clusters of co-occurring words. |
| Wearable Devices | Electronic devices worn on the body that continuously collect physiological and behavioral data, such as heart rate, activity levels, and sleep patterns. |
| WGBS (Whole-Genome Bisulfite Sequencing) |
A sequencing technique that provides genome-wide, single-base resolution maps of DNA methylation patterns. |
References
- Jones, P.A.; Baylin, S.B. The fundamental role of epigenetic events in cancer. Nat. Rev. Genet. 2002, 3, 415–428. [Google Scholar] [CrossRef] [PubMed]
- Bird, A. Perceptions of epigenetics. Nature 2007, 447, 396–398. [Google Scholar] [CrossRef]
- Portela, A.; Esteller, M. Epigenetic modifications and human disease. Nat. Biotechnol. 2010, 28, 1057–1068. [Google Scholar] [CrossRef]
- Lim, I.; Tan, J.; Alam, A.; Idrees, M.; Brenan, P.A.; Coletta, R.D.; Kujan, O. Epigenetics in the diagnosis and prognosis of head and neck cancer: A systematic review. J. Oral Pathol. Med. 2024, 53, 90–106. [Google Scholar] [CrossRef]
- Burkitt, K. Role of DNA methylation profiles as potential biomarkers and novel therapeutic targets in head and neck cancer. Cancers 2023, 15, 4685. [Google Scholar] [CrossRef]
- Villicaña, S.; Castillo-Fernandez, J.; Hannon, E.; Christiansen, C.; Tsai, P.-C.; Maddock, J.; Kuh, D.; Suderman, M.; Power, C.; Relton, C.; et al. Genetic impacts on DNA methylation help elucidate regulatory genomic processes. Genome Biol. 2023, 24, 11. [Google Scholar] [CrossRef]
- Feehley, T.; O’Donnell, C.W.; Mendlein, J.; Karande, M.; McCauley, T. Drugging the epigenome in the age of precision medicine. Clin. Epigenetics 2023, 15, 6. [Google Scholar] [CrossRef] [PubMed]
- Nadiger, N.; Veed, J.K.; Chinya Nataraj, P.; Mukhopadhyay, A. DNA methylation and type 2 diabetes: A systematic review. Clin. Epigenetics 2024, 16, 67. [Google Scholar] [CrossRef]
- Clark, S.J.; Lee, H.J.; Smallwood, S.A.; Kelsey, G.; Reik, W. Single-cell epigenomics: Powerful new methods for understanding gene regulation and cell identity. Genome Biol. 2016, 17, 72. [Google Scholar] [CrossRef] [PubMed]
- Wang, K.; Liu, H.; Hu, Q.; Wang, L.; Liu, J.; Zheng, Z.; Liu, G.H. Epigenetic regulation of aging: Implications for interventions of aging and diseases. Signal Transduct. Target. Ther. 2022, 7, 374. [Google Scholar] [CrossRef]
- Bao-Caamano, A.; Costa-Fraga, N.; Cayrefourcq, L.; Jácome, M.A.; Rodriguez-Casanova, A.; Muinelo-Romay, L.; Díaz-Lagares, A. Epigenomic analysis reveals a unique DNA methylation program of metastasis-competent circulating tumor cells in colorectal cancer. Sci. Rep. 2023, 13, 15401. [Google Scholar] [CrossRef]
- Novoa, J.; Chagoyen, M.; Benito, C.; Moreno, F.J.; Pazos, F. Pmidigest: Interactive review of large collections of PubMed entries to distill relevant information. Genes 2023, 14, 942. [Google Scholar] [CrossRef]
- Smalheiser, N.R.; Fragnito, D.P.; Tirk, E.E. Anne O’Tate: Value-added PubMed search engine for analysis and text mining. PLoS ONE 2021, 16, e0248335. [Google Scholar] [CrossRef]
- Zhao, S.; Su, C.; Lu, Z.; Wang, F. Recent advances in biomedical literature mining. Brief. Bioinform. 2021, 22, bbaa057. [Google Scholar] [CrossRef]
- Ye, Z.; Tafti, A.P.; He, K.Y.; Wang, K.; He, M.M. Sparktext: Biomedical text mining on big data framework. PLoS ONE 2016, 11, e0162721. [Google Scholar] [CrossRef] [PubMed]
- Comeau, D.C.; Wei, C.H.; Dogan, R.I.; Lu, Z. PMC text mining subset in BioC: 2.3 million full text articles and growing. arXiv 2018, arXiv:1804.05957. [Google Scholar] [CrossRef] [PubMed]
- Westergaard, D.; Stærfeldt, H.H.; Tønsberg, C.; Jensen, L.J.; Brunak, S. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput. Biol. 2018, 14, e1005962. [Google Scholar] [CrossRef]
- Cohen, K.B.; Johnson, H.L.; Verspoor, K.; Roeder, C.; Hunter, L.E. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinform. 2010, 11, 492. [Google Scholar] [CrossRef]
- Tudor, C.O.; Ross, K.E.; Li, G.; Vijay-Shanker, K.; Wu, C.H.; Arighi, C.N. Construction of phosphorylation interaction networks by text mining of full-length articles using the eFIP system. Database 2015, 2015, bav020. [Google Scholar] [CrossRef] [PubMed]
- Islamaj Doğan, R.; Kim, S.; Chatr-Aryamontri, A.; Chang, C.S.; Oughtred, R.; Rust, J.; Tyers, M. The BioC-BioGRID corpus: Full text articles annotated for curation of protein–protein and genetic interactions. Database 2017, 2017, baw147. [Google Scholar] [CrossRef]
- Aria, M.; Cuccurullo, C. bibliometrix: An R-tool for comprehensive science mapping analysis. J. Informetr. 2017, 11, 959–975. [Google Scholar] [CrossRef]
- Donthu, N.; Kumar, S.; Mukherjee, D.; Pandey, N.; Lim, W.M. How to conduct a bibliometric analysis: An overview and guidelines. J. Bus. Res. 2021, 133, 285–296. [Google Scholar] [CrossRef]
- Chen, C. Science mapping: A systematic review of the literature. J. Data Inf. Sci. 2017, 2, 1–40. [Google Scholar] [CrossRef]
- Van Eck, N.J.; Waltman, L. Visualizing bibliometric networks. In Measuring Scholarly Impact: Methods and Practice; Springer: Cham, Switzerland, 2014; pp. 285–320. [Google Scholar]
- Börner, K.; Chen, C.; Boyack, K.W. Visualizing knowledge domains. Annu. Rev. Inf. Sci. Technol. 2003, 37, 179–255. [Google Scholar] [CrossRef]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT. 2019; pp. 4171–4186. [Google Scholar]
- Wei, C.H.; Kao, H.Y.; Lu, Z. PubTator: A web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013, 41, W518–W522. [Google Scholar] [CrossRef]
- Johnson, D.S.; Mortazavi, A.; Myers, R.M.; Wold, B. Genome-wide mapping of in vivo protein–DNA interactions. Science 2007, 316, 1497–1502. [Google Scholar] [CrossRef]
- Comeau, D.C.; Wei, C.H.; Islamaj Doğan, R.; Lu, Z. PMC text mining subset in BioC: About three million full-text articles and growing. Bioinformatics 2019, 35, 3533–3535. [Google Scholar] [CrossRef] [PubMed]
- Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol. 2013, 14, R115. [Google Scholar] [CrossRef] [PubMed]
- Hannum, G.; Guinney, J.; Zhao, L.; Zhang, L.; Hughes, G.; Sadda, S.; Zhang, K. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell 2013, 49, 359–367. [Google Scholar] [CrossRef]
- Levine, M.E.; Lu, A.T.; Quach, A.; Chen, B.H.; Assimes, T.L.; Bandinelli, S.; Horvath, S. An epigenetic biomarker of aging for lifespan and healthspan. Aging 2018, 10, 573–591. [Google Scholar] [CrossRef]
- Lu, A.T.; Quach, A.; Wilson, J.G.; Reiner, A.P.; Aviv, A.; Raj, K.; Horvath, S. DNA methylation GrimAge strongly predicts lifespan and healthspan. Aging 2019, 11, 303–327. [Google Scholar] [CrossRef]
- Buenrostro, J.D.; Giresi, P.G.; Zaba, L.C.; Chang, H.Y.; Greenleaf, W.J. Transposition of native chromatin for fast and sensitive epigenomic profiling. Nat. Methods 2013, 10, 1213–1218. [Google Scholar] [CrossRef]
- Meissner, A.; Gnirke, A.; Bell, G.W.; Ramsahoye, B.; Lander, E.S.; Jaenisch, R. Reduced representation bisulfite sequencing for DNA methylation analysis. Nucleic Acids Res. 2005, 33, 5868–5877. [Google Scholar] [CrossRef] [PubMed]
- Lister, R.; Pelizzola, M.; Dowen, R.H.; Hawkins, R.D.; Hon, G.; Tonti-Filippini, J.; Ecker, J.R. Human DNA methylomes at base resolution. Nature 2009, 462, 315–322. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
- Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Poon, H. Domain-specific language model pretraining. ACM Trans. Comput. Healthc. 2021, 3, 1–23. [Google Scholar] [CrossRef]
- Xu, Z. DNA methylation-based health predictors. Epigenomics 2025, 17, 1083–1090. [Google Scholar] [CrossRef] [PubMed]
- Kiselev, I.S.; Baulina, N.M.; Favorova, O.O. Epigenetic clock. Biochemistry (Moscow) 2025, 90, S356–S372. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Cheng, X.; Ji, S. DNA methylation and prediction of biological age. Front. Mol. Biosci. 2025, 12, 1734464. [Google Scholar] [CrossRef]
- Horvath, S.; Raj, K. DNA methylation-based biomarkers and aging. Nat. Rev. Genet. 2018, 19, 371–384. [Google Scholar] [CrossRef]
- Martínez-Iglesias, O.; Naidoo, V.; Corzo, L.; Pego, R.; Seoane, S.; Rodríguez, S.; Cacabelos, R. DNA methylation as a biomarker for disease outcome. Genes 2023, 14, 365. [Google Scholar] [CrossRef]
- Janovska, J.; Nixdorff, U.N.; Voicehovska, J.V. DNA methylation as a biomarker for cardiometabolic risk. Eur. J. Prev. Cardiol. 2023, 30, zwad125. [Google Scholar] [CrossRef]
- Kim, H.; Wang, X.; Jin, P. Developing DNA methylation-based diagnostic biomarkers. J. Genet. Genomics 2018, 45, 87–97. [Google Scholar] [CrossRef]
- Sahoo, K.; Lingasamy, P.; Khatun, M.; Sudhakaran, S.L.; Salumets, A.; Sundararajan, V.; Modhukur, V. Artificial intelligence in cancer epigenomics. Epigenetics Chromatin 2025, 18, 35. [Google Scholar] [CrossRef]
- Levy, J.J.; Diallo, A.B.; Saldias Montivero, M.K.; Gabbita, S.; Salas, L.A.; Christensen, B.C. Aging prediction with AI-based epigenetic clocks. Epigenomics 2025, 17, 49–57. [Google Scholar] [CrossRef]
- Kalyakulina, A.; Yusipov, I.; Trukhanov, A.; Franceschi, C.; Moskalev, A.; Ivanchenko, M. EpInflammAge. Int. J. Mol. Sci. 2025, 26, 6284. [Google Scholar] [CrossRef]
- Lee, K.E.; Park, H.S. Preliminary testing for the Markov property of the fifteen chromatin states of the Broad Histone Track. Bio-Med. Mater. Eng. 2015, 26, S1917–S1927. [Google Scholar] [CrossRef]
- Lent, H.; Lee, K.E.; Park, H.S. Building the frequency profile of the core promoter element patterns in the three ChromHMM promoter states at 200 bp intervals: A statistical perspective. Genom. Inform. 2015, 13, 152. [Google Scholar] [CrossRef] [PubMed]
- Feinberg, A.P. Phenotypic plasticity and the epigenetics of human disease. Nature 2007, 447, 433–440. [Google Scholar] [CrossRef] [PubMed]
- Baylin, S.B.; Jones, P.A. A decade of exploring the cancer epigenome. Nat. Rev. Cancer 2011, 11, 726–734. [Google Scholar] [CrossRef]
- Esteller, M. Epigenetics in cancer. N. Engl. J. Med. 2008, 358, 1148–1159. [Google Scholar] [CrossRef]
- Ioannidis, J.P.A. Why most clinical research is not useful. PLoS Med. 2016, 13, e1002049. [Google Scholar] [CrossRef]
- Ernst, J.; Kellis, M. ChromHMM: automating chromatin-state discovery. Nat. Methods 2012, 9, 215–216. [Google Scholar] [CrossRef]
- Kundaje, A.; Meuleman, W.; Ernst, J.; et al. Integrative analysis of 111 reference human epigenomes. Nature 2015, 518, 317–330. [Google Scholar] [CrossRef]
- Hearst, M.A. Untangling text data mining. ACL 1999. [Google Scholar]
- Feldman, R.; Sanger, J. The Text Mining Handbook; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
- Khabsa, M.; Giles, C.L. The number of scholarly documents on the public web. PLoS ONE 2014, 9, e93949. [Google Scholar] [CrossRef] [PubMed]
- West, J.D.; Jacquet, J.; King, M.M.; Correll, S.J.; Bergstrom, C.T. The role of text in scientific article analysis. J. Informetrics 2013, 7, 487–499. [Google Scholar]
- Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 101 (Suppl. 1), 5228–5235. [Google Scholar] [CrossRef]
- Blei, D.M. Probabilistic topic models. Commun. ACM 2012, 55, 77–84. [Google Scholar] [CrossRef]
- Topol, E.J. High-performance medicine. Appl. Sci. 2019, 25, 44–56. [Google Scholar]
- Esteva, A.; Robicquet, A.; Ramsundar, B.; Kuleshov, V.; DePristo, M.; Chou, K.; Cui, C.; Corrado, G.; Thrun, S.; Dean, J. Guide to deep learning in healthcare. Appl. Sci. 2019, 25, 24–29. [Google Scholar] [CrossRef] [PubMed]
- Miotto, R.; Li, L.; Kidd, B.A.; Dudley, J.T. Deep learning for healthcare. Appl. Sci. 2018, 19, 1236–1246. [Google Scholar]
- Rajkomar, A.; Dean, J.; Kohane, I. Scalable deep learning with EHR. Appl. Sci. 2018, 1, 18. [Google Scholar]
- Ioannidis, J.P.A. Why most published research findings are false. PLoS Med. 2005, 2, e124. [Google Scholar] [CrossRef]
- Beam, A.L.; Kohane, I.S. Big data and machine learning in healthcare. JAMA 2018, 319, 1317–1318. [Google Scholar] [CrossRef]



| Objective | Query Pattern | Application Context | Notes |
| General biomarker screening | (epigenetic OR “DNA methylation”) AND biomarker | Broad initial corpus retrieval | High recall |
| Diagnostic biomarker | (DNA methylation OR histone) AND “diagnostic biomarker” | Early detection marker identification | Strong clinical relevance |
| Prognostic biomarker | (epigenetic OR methylation) AND prognostic AND survival | Survival analysis–related studies | Frequently linked to TCGA datasets |
| Signature discovery | (epigenetic AND signature) AND cancer | Multi-gene biomarker identification | Often includes machine learning studies |
| Specific histone modification | (H3K27ac OR H3K4me3) AND biomarker | Histone modification–based markers | Enhancer-associated biomarkers |
| CpG marker | (“CpG island methylation”) AND biomarker | Promoter silencing markers | Tumor suppressor gene studies |
| Chromatin accessibility biomarker | (ATAC-seq OR “chromatin accessibility”) AND biomarker | Regulatory biomarker discovery | Increasing in recent studies |
| Multi-omics biomarker | (methylation AND expression) AND biomarker | Integrative inverse-correlation markers | Multi-omics integration studies |
| Liquid biopsy biomarker | (“circulating DNA methylation”) AND biomarker | Non-invasive diagnostic biomarkers | High translational value |
| Drug response biomarker | (epigenetic AND “drug response”) AND biomarker | Therapy response markers | Precision medicine applications |
| Aging biomarker | (“epigenetic clock” OR “methylation age”) AND biomarker | Aging and biological age studies | Includes Horvath clock–based models |
| Neuro-epigenetic biomarker | (“DNA methylation” AND brain) AND biomarker | Neurodegenerative disease markers | Alzheimer’s disease research |
| Immune-related epigenetic biomarker | (epigenetic AND immune) AND biomarker | Immunotherapy-related biomarkers | Checkpoint response studies |
| Enhancer biomarker | (“super-enhancer” OR enhancer) AND epigenetic biomarker | Transcriptional regulation markers | Cancer subtype characterization |
| Transcription factor regulation biomarker | (“ChIP-seq” AND “transcription factor”) AND biomarker | Regulatory network biomarkers | Often linked to ENCODE data |
| Strategy | Query Pattern | Intended Effect |
| Disease restriction | (“epigenetic biomarker”) AND (cancer OR tumor) | Reduces domain noise by restricting disease scope |
| Mechanism emphasis | (hypermethylation OR hypomethylation) AND biomarker | Focuses on mechanistically characterized biomarkers |
| Clinical relevance | (“epigenetic biomarker”) AND cohort | Prioritizes patient-based or population-level studies |
| Machine learning–based discovery | (epigenetic AND “machine learning”) AND biomarker | Identifies computationally derived biomarker signatures |
| Validation-focused studies | (“epigenetic biomarker”) AND validation | Retrieves studies emphasizing reproducibility and independent validation |
| year_bin | n_docs | var_doc_rate | var_score_norm_mean |
| 1990 | 15 | 0 | 0.000413 |
| 1995 | 19 | 0.105263 | 0.000402 |
| 2000 | 47 | 0 | 0.00057 |
| 2005 | 67 | 0.044776 | 0.000686 |
| 2010 | 392 | 0.07398 | 0.000714 |
| 2015 | 1525 | 0.086557 | 0.000623 |
| 2020 | 3773 | 0.048237 | 0.000618 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).