Submitted:
08 June 2025
Posted:
09 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. The Sandbox: Creating Our Learning Environment
3. From Letters to Language: The Core Concepts of AI Models
3.1. The Language of a Model: Tokens, Vectors, and Embeddings
- A Token is a single, discrete unit of our input [4]. For our purposes, each DNA base (‘A’, ‘C’, etc.) is a token. In human language, a token might be a word or a punctuation mark. We first assign a simple numerical ID to each token (e.g., ‘A’ becomes 0, ‘C’ becomes 1).
- A Vector is simply a list of numbers. While a single number (like ‘0’ for ‘A’) is a start, it is not very descriptive. A vector allows us to represent our token with a much richer “definition”. For example, we might represent ‘A’ not just as ‘0’, but as a list of 32 numbers, like `[0.12, -0.45, 0.89, ...]`. This list is a vector.
- An Embedding is the process of creating these rich vector definitions. It is the crucial step where the model learns the best vector representation for each token based on how it is used in the data. A good embedding will learn that ‘A’ and ‘G’ (both purines) are more similar to each other than ‘A’ and ‘T’ are, and this similarity will be reflected in their vectors. In essence, an embedding is a learned, multi-dimensional dictionary that translates simple tokens into meaningful mathematical objects.
3.2. What is “Deep Learning”? Building with Layers
3.3. A Note on How Transformers Differ
4. The Blueprint: Architecting the Model’s “Brain”
4.1. The Input System: Understanding What and Where
4.2. The Core Engine: The Transformer Block
4.3. Final Assembly
5. The Engine of Learning: How a Model Improves
5.1. The Loss Function: Knowing How Wrong You Are
5.2. The Optimizer: Taking a Step in the Right Direction
5.3. The Metric: Reporting Your Progress to the World
6. The Orchestra Conductor: The Training Pipeline in Action
6.1. The Recipe for Training
- The Optimizer (`adam`): This is the engine that drives the learning. It’s an efficient algorithm for adjusting the model’s internal parameters to improve performance.
- The Loss Function (`sparse_categorical_crossentropy`): This is the function the model uses to measure its own mistakes. After each prediction, it calculates a “loss” score that quantifies how far its prediction was from the true answer. The goal of training is to make this score as low as possible.
- The Metrics (`accuracy`): This is the human-understandable score we use to judge the model’s performance—in this case, the percentage of bases it predicts correctly.
6.2. The Learning Loop
Interpreting the Numbers: From Random Guessing to Competence
6.3. The Payoff: From Trained Weights to Final Predictions
6.3.1. The Mechanism of Prediction: A Look Inside the Trained Model
- During Training: The `model.fit()` process is the act of meticulously turning each of these dials, guided by the optimizer, until the machine consistently transforms inputs into the correct outputs. The final positions of all these dials are the learned weights.
- During Inference: When we use the model for prediction, all these dials are locked in place. The model is no longer learning or adjusting. When we provide a new input sequence (which has been converted into a numerical vector), it enters this fixed machine. At each layer, the input signal is transformed by a series of mathematical operations (primarily multiplications and additions) with the layer’s fixed weights. This transformed signal flows through the entire network, from one layer to the next, until it reaches the final output layer. This layer then produces the ultimate result: a list of probabilities for each possible character in our vocabulary. The character with the highest probability is the model’s final prediction.
6.3.2. The Final Artifact and Its Use
7. Placing Our Tool on the Map: A Comparative Analysis
7.1. In Contrast to Classical Bioinformatics
7.2. In Contrast to Other Machine Learning Models
8. Conclusion: A Foundational Toolkit for the Modern Biologist
- The Materials: How the raw language of biology—the sequences of ‘A’, ‘C’, ‘G’, and ‘T’—is translated into the native language of a neural network through the essential concepts of tokens, vectors, and learned embeddings.
- The Blueprint: How a sophisticated architecture like the Transformer is not a monolith, but an elegant assembly of modular layers, with the self-attention mechanism at its core, enabling the model to understand the crucial element of context within a sequence.
- The Physics: How a model truly learns, demystifying the dynamic process where the guiding signal of a loss function and the methodical work of an optimizer iteratively adjust the model’s internal weights, transforming it from a random guesser into a competent predictor.
Funding
Acknowledgments
Conflicts of Interest
Appendix A.1: ‘simulate_data_v2.py’
Appendix A.2: `transformer_model.py`
Appendix A.3: `train.py`
Appendix B.1 - The Learning Process: Training Log
Appendix B.2 - The Final Result: Generating and Viewing Predictions
References
- Friedman, R. The Viral Chase: Outsmarting Evolution with Data Trees and AI Predictions. Preprints 2025. [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention Is All You Need. Advances in Neural Information Processing Systems 2017, 30. [Google Scholar] [CrossRef]
- IUPAC codes. Available online: https://genome.ucsc.edu/goldenPath/help/iupac.html (accessed on 7 June 2025).
- Friedman, R. Tokenization in the Theory of Knowledge. Encyclopedia 2023, 3, 380–386. [Google Scholar] [CrossRef]
- Chollet, F. Deep Learning with Python, 2nd ed.; Manning Publications Co.: Shelter Island, NY, USA, 2021. [Google Scholar]
- Gauthier, J.; Vincent, A.T.; Charette, S.J.; Derome, N. A brief history of bioinformatics. Briefings in Bioinformatics 2019, 20, 1981–1996. [Google Scholar] [CrossRef] [PubMed]
- Ben-Hur, A.; Ong, C.S.; Sonnenburg, S.; Schölkopf, B.; Rätsch, G. Support Vector Machines and Kernels for Computational Biology. PLoS Computational Biology 2008, 4, e1000173. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
