Submitted:
17 January 2026
Posted:
19 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Methodology
2.1. Literature Search Strategy
- Primary: Bioinformatics, Genomics, Deep Learning, State Space Models, Precision Medicine.
- Secondary: Genomic reproducibility, Double-Cut-and-Join (DCJ), Byte-Pair Encoding (BPE), Long-range genomic dependencies, T2T assembly, Caduceus, Mamba.
2.2. Selection Criteria
- Temporal Relevance: Preference was given to studies published between 2012 and 2025 to capture the rise of high-throughput sequencing and the deep learning revolution.
- Thematic Alignment: Works were included if they addressed specific methodological challenges such as sequence tokenization, algorithmic complexity in comparative genomics, or the scalability of neural architectures to ultra-long DNA sequences.
- Impact and Reliability: Selection favored peer-reviewed journals (e.g., Nature, Science, Bioinformatics) and high-impact benchmarking papers from established consortia such as GIAB and MAQC.
2.3. Data Extraction and Thematic Synthesis
- Foundational Robustness: Investigating the "reproducibility crisis" in bioinformatics and the mathematical frameworks for evolutionary distance.
- Representation Learning: Evaluating how DNA is tokenized and processed by Large Language Models (LLMs).
- Architectural Evolution: Contrasting traditional architectures (CNN, RNN, Transformers) with emergent linear-time models (SSMs).
- Clinical Integration: Synthesizing how computational advances translate into precision medicine and diagnostic tools.
2.4. Quality Assessment
3. Conceptual Foundations and Methodological Advances in Bioinformatics and Genomics
3.1. The Role of Reproducibility in Genomic Research
3.2. Comparative Genomics: Quantifying Evolutionary Distances and Reconstruction
3.3. Tokenization and Representation Learning in Genomic Language Models
3.4. Deep Learning Architectures in Bioinformatics
3.5. Architecture Taxonomy and Applications
4. Advanced Computational Models for Long-Range Genomic Analysis
4.1. Modeling Long-Range Dependencies
4.2. Benchmarking and Evaluation
4.3. Integrating Bioinformatics Pathways into Health Applications
5. Conclusions
Data Availability
Ethical Statement
Funding
Conflict of Interest
References
- Icer Baykal, P. et al. Genomic reproducibility in the bioinformatics era. arXiv 2023, arXiv:2308.09558.
- Aganezov, S.; Alekseyev, M.A. On pairwise distances and median score of three genomes under DCJ. BMC Bioinformatics 2012, 13, S1.
- Popova, M. et al. When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes. arXiv 2025, arXiv:2505.08918.
- Min, S. et al. Deep learning in bioinformatics. Brief. Bioinform. 2017, 18, 851–869.
- Popov, M. et al. Leveraging State Space Models in Long Range Genomics. arXiv 2025, arXiv:2504.06304.
- Stephens, Z.D. et al. Big Data: Astronomical or Genomical? PLoS Biol. 2015, 13, e1002195. [CrossRef]
- Nurk, S. et al. The complete sequence of a human genome. Science 2022, 376, 44–53.
- Shendure, J. et al. DNA sequencing at 40: past, present and future. Nature 2017, 550, 345–353.
- Lander, E.S. Initial impact of the sequencing of the human genome. Nature 2011, 470, 187–197. [CrossRef]
- Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 2016, 533, 452–454.
- Luscombe, N.M. et al. What is bioinformatics? A proposed definition and overview of the field. Methods Inf. Med. 2001, 40, 346–358.
- Birney, E. The impact of genomics on 21st century medicine. Cold Spring Harb. Mol. Case Stud. 2019, 5, a004317.
- Collins, F.S.; Varmus, H. A new initiative on precision medicine. N. Engl. J. Med. 2015, 372, 793–795.
- Topol, E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 2019, 25, 44–56.
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752.
- LeCun, Y. et al. Deep learning. Nature 2015, 521, 436–444.
- Sandve, G.K. et al. Ten simple rules for reproducible computational research. PLoS Comput. Biol. 2013, 9, e1003285. [CrossRef]
- Mangul, S. et al. Systematic visualization of the reproducibility of published genomic surveys. Nat. Methods 2019, 16, 11–12.
- Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 2018, 36, 983–987. [CrossRef]
- Li, H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 2014, 30, 2843–2851.
- Zook, J.M. et al. Integrating human sequence data sets provides a benchmark of West African ancestry. Sci. Data 2019, 6, 287.
- Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 2017, 35, 316–319.
- Altschul, S. et al. Basic local alignment search tool. J. Mol. Biol. 1990, 215, 403–410.
- Yancopoulos, S. et al. Efficient distance calculation and finite chromosome phylogeny under untestricted genome rearrangements. Bioinformatics 2005, 21, 3340–3346.
- Bourque, G. Comparative genomics and genome evolution. Curr. Opin. Genet. Dev. 2009, 19, 507–512.
- Fertin, G. et al. Combinatorics of Genome Rearrangements. 1st ed.; MIT Press: Cambridge, MA, USA, 2009; pp. 1-330.
- Ji, Y. et al. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 2021, 37, 2112–2120. [CrossRef]
- Dalla-Torre, H. et al. The Nucleotide Transformer: Building and evaluating robust foundation models for human genomics. bioRxiv 2023, 2023.01.11.523679.
- Nguyen, E. et al. HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2023; Vol. 36.
- Thomas, C. et al. Evo: DNA foundation modeling from molecular to genome scale. bioRxiv 2024, 2024.02.27.582234.
- Alipanahi, B. et al. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 2015, 33, 831–838.
- Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021, 596, 583–589. [CrossRef]
- Eraslan, G. et al. Deep learning: new computational modelling techniques for genomics. Nat. Rev. Genet. 2019, 20, 389–403. [CrossRef]
- Paszke, A. et al. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems; Curran Associates: Red Hook, NY, USA, 2019; Vol. 32.
- Wang, Y. et al. The applications of deep learning in multi-omics data integration. Brief. Bioinform. 2021, 22, bbab154.
- Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009, 326, 289–293. [CrossRef]
- Schiff, P. et al. Caduceus: Bi-directional Equivariant Long-range DNA Sequence Modeling. arXiv 2024, arXiv:2403.03230.
- Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Methods 2015, 12, 931–934. [CrossRef]
- Hamburg, M.A.; Collins, F.S. The Path to Personalized Medicine. N. Engl. J. Med. 2010, 363, 301–304.
- Hasin, Y. et al. Multi-omics strategies and data integration. Genome Biol. 2017, 18, 83.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).