Submitted:
05 June 2025
Posted:
05 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. The Fast-Changing World of Viruses and the Tools We Use to Study Them
- Scientists first isolate the mRNA from the virus. Often, they grab onto a unique "tail" (called a poly-A tail) that most mRNA molecules have.
- The original RNA strand is then removed, and another enzyme builds a second DNA strand, resulting in a stable, double-stranded DNA (ds-cDNA) copy of the original RNA message.
- This ds-cDNA can then be inserted into tiny biological carriers (like plasmids or bacteriophages, which are viruses that infect bacteria) that can make many copies of it inside host cells like Escherichia coli. All these copied DNA pieces together form the cDNA library [7].
- An "ori" (origin of replication), which is like the "start" signal for making copies.
- An antibiotic-resistance gene. This is a neat trick: if you grow bacteria in the presence of an antibiotic, only the bacteria that have successfully taken up your plasmid (and its resistance gene) will survive, making them easy to find.
3. Bringing Viruses to Life: Building and Rebuilding Genomes
4. Super-Precise Editing and Watching Evolution Happen Live
Transition: From Understanding Viruses to Tracking Them on a Global Scale
5. Keeping Tabs on Viruses Worldwide: The UShER Project and Mutation Trees
5.1. Smart Data Storage: The MAT Format and Protocol Buffers
5.2. How UShER Works: Pre-Processing, Placement, and the Handy matUtils Toolkit
- Pre-processing: Think of this as getting the main family tree ready. UShER takes existing viral family trees and their genetic data and cleverly organizes them into that super-compact MAT format we just talked about. It even figures out the likely genetic makeup of the ancestors in the tree. This initial prep work makes everything that comes next much faster and more efficient.
- Placement: This is where the action happens with new virus samples. As new viral genomes are sequenced (say, from new patient samples), UShER rapidly figures out the best spot to add them to the existing, optimized tree. It does this by calculating the fewest genetic changes needed to connect the new sample to the tree. This means the global viral family tree is constantly and quickly updated. Being able to do this over and over again is crucial for watching how a virus is spreading and evolving, almost in real-time – a vital tool for public health officials.
- Get quick summaries of the tree (like how many branches or samples it contains).
- Pull out smaller sections of the tree for a closer look at specific outbreaks or variants.
- Convert the MAT data into more familiar formats (like Newick for trees or VCF for mutation lists) so it can be used with other scientific software.
- It even has advanced features, like helping to label new viral groups (clades) as they emerge or checking how confident UShER is about where a new sample fits in the tree.
5.3. Seeing the Big Picture: Visualizing Massive Trees with Taxonium
Transition: From Powerful Tools to Practical Steps
6. Getting Your Hands Dirty: A Quick Guide to Working with MAT Files
- Set Up Shop: First, a researcher would set up their digital workspace. This usually means connecting to their cloud storage (like Google MyDrive) to keep files organized. Then, they'd install the necessary bioinformatics software. A handy tool called Conda [34] often helps manage these software installations, including the UShER toolkit (which, as we learned, contains the useful matUtils commands).
- Get the Data: Once the virtual lab is ready, they can download the latest public MAT file. Remember, this file is in that super-efficient Protocol Buffer (.pb) format.
-
Put matUtils to Work: With the MAT file downloaded, the matUtils commands become the researcher's best friend. These commands can "unpack" the compressed binary file and pull out all sorts of information. For example:
- matUtils summary can quickly give basic facts about the tree (like how many viruses are in it).
- matUtils extract is really flexible. It can be used to grab just a specific part of the big tree for a closer look, or to convert the MAT data into other common formats that different software programs can understand (like Newick files for tree structures or VCF files for lists of mutations).
Transition: Peering into the Future with a New Kind of AI
7. Crystal Ball Computing: Using AI (Transformers) to Predict Viral Evolution
- Predicting which new viral variant might become dominant.
- Forecasting which mutations are likely to pop up on specific branches of the viral family tree.
- Identifying changes that could help a virus escape our immune system or become resistant to treatments (this is called antigenic drift or immune escape).
- Even helping scientists design hypothetical new virus sequences with desired features – perhaps for making safer vaccines (e.g., a version with low ability to cause disease but still triggers immunity).
7.1. How Transformers Get Smart About Viral Genes
- Learning to "Translate" Old Viruses into New Ones (Seq2Seq Models [37]): Imagine you have the genetic sequence of an ancestor virus and the sequence of one of its direct descendants. This approach treats the problem like translating one language into another. The ancestor sequence is the "input sentence," and the descendant (or "future") sequence is the "translated output." The Transformer's "encoder" part reads the entire ancestor sequence, building a deep understanding of every genetic letter and its context. Its special "self-attention" ability lets it potentially consider the whole genome at once when looking at any single spot. Then, its "decoder" part uses this understanding to build the descendant sequence, one genetic letter at a time. As it predicts each new letter, it can look back at the ancestor sequence and the part of the descendant sequence it has already built. This helps the AI learn the "rules" of how viruses change – which spots are likely to mutate, and how different mutations might influence each other (remember epistasis, where one mutation's effect depends on others?).
- Creating Brand New, Believable Virus Sequences (Generative Models): Some Transformers are designed to be creative. They can learn the underlying "grammar" or patterns of real viral sequences. Once trained, these generative models can dream up completely new viral sequences that, while novel, still look and behave like plausible, real-world viruses. Scientists can use these models to explore the vast universe of potential future variants or even to help design artificial sequences for things like new vaccine candidates.
- Predicting a Virus's Success (Fitness or Escape Potential): Transformers can also be trained like judges. You feed them a viral sequence (or just a key part, like the spike protein of SARS-CoV-2), and they output a score. This score could predict how "fit" the virus is (how well it can survive and spread), its potential to outgrow other variants, or its likelihood of dodging our immune defenses. This directly tackles the big question: which new viral versions are most likely to succeed in a population? To learn this, the AI is often trained on real-world data, like results from those Deep Mutational Scanning experiments we talked about, or how common different variants are in actual outbreaks.
- Uncovering Hidden Teamwork Between Mutations (Modeling Epistasis): Remember how Transformers can potentially pay "attention" to the whole sequence at once? This makes them exceptionally good at spotting and understanding complex, long-distance relationships between different mutations across a virus's entire genome. This is crucial for modeling epistasis, where the impact of one mutation is tied to whether other specific mutations are also present or absent – like a complex team play in sports.
7.2. Teaching AI About Viral Family Trees and Handling Super-Long Genetic Codes
- Learning from Ancestors and Descendants (Implicit Integration): This is the most common trick. Scientists use the family tree to pick out pairs of viruses where one is a direct ancestor of the other. Each of these "parent-child" sequence pairs becomes a lesson for the AI. The AI sees thousands, or even millions, of these real-life evolutionary steps and gradually learns the "rules" of how viruses change from one generation to the next directly from the tree's structure.
- Giving the AI Extra Clues from the Tree (Feature Engineering): Think of this like giving the AI some extra notes about each virus's place in the family. Scientists can calculate various numbers from the tree for each virus: how long its branch is (which can represent time or the number of mutations since its parent), how big its particular family group (clade) is, what its reconstructed ancestor might have looked like, or how closely related it is to important reference viruses. These numerical clues can then be fed into the Transformer along with the genetic sequence itself, giving the AI more context.
- Teaming Up with Other AI: Transformers + Graph Neural Networks (Hybrid Models): This is a more advanced strategy. It involves combining Transformers with another type of AI called Graph Neural Networks (GNNs) [38]. GNNs are superstars at learning from network-like structures – and a family tree is a perfect example! The GNN can first create a smart summary (an "embedding") for each virus based on its position in the tree. This tree-based summary is then combined with the virus's genetic sequence information and fed into the Transformer. It's a cutting-edge approach that really tries to get the best of both worlds.
7.3. Dealing with DNA Overload: Making Transformers Work for Long Viral Genomes
- Chopping it Up (Chunking / Sliding Window): One straightforward idea is to divide the long genome into smaller, more manageable, often overlapping pieces (say, 512 or 1024 letters at a time). Standard Transformers can then work on these smaller chunks. It's simpler to set up and can use less memory for each chunk. The downside? The AI might miss the "big picture" connections that span across the entire genome, and the boundaries between chunks can sometimes cause minor issues.
- Smarter Transformers for Long Reads (Specialized Long-Sequence Transformers): Researchers have also designed new types of Transformers specifically for handling long sequences [40,41]. You might hear names like Longformer, Reformer, Performer, or BigBird [42]. These models use clever tricks to make the attention mechanism more efficient (reducing the computational load to something like O(L log L) or even O(L)). This means they can look at much longer sequences and still pay attention across the entire genome, which is really important for catching those tricky long-distance genetic interactions (epistasis). If the computer power is available, these specialized models are generally preferred over simple chunking, even though they can still be demanding on resources and might need more careful fine-tuning [40].
- Zooming in on Key Genes (Focus on Specific Genes/Proteins): Often, instead of tackling the whole genome, researchers will focus the AI on just one or a few key viral genes or the proteins they code for. For example, with SARS-CoV-2, a lot of attention is on the Spike protein (which is about 3,800 letters long) [43]. These shorter segments are much easier for standard Transformers to handle and often contain the mutations that are most important for how the virus spreads and makes us sick.
7.4. Teaching AI the ABCs (or ACGTs) of Viral Genetic Code
- 'A' might become 0
- 'C' might become 1
- 'G' might become 2
- 'T' might become 3
- An 'N' (for an unknown base or a gap in the sequence) or a special padding symbol might become 4.
- Padding: Not all viral sequences (or chunks of sequences) will be the exact same length. To feed them to the AI efficiently, shorter sequences are "padded" out with a special padding token (like our 'N' or code 4) until they all reach a standard maximum length.
- Attention Masks: Because we've added these padding tokens, we need to tell the AI which parts of the sequence are real genetic data and which parts are just padding. An "attention mask" does this – it's like giving the AI a note saying, "Pay attention to these tokens, but ignore these other ones."
Funding
Conflicts of Interest
Acknowledgments
Appendix A. Processing of SARS-CoV-2 Mutation Data
- # 1. Mount Google MyDrive in Google Colab
- from google.colab import drive
- import os
- # Mount Google MyDrive for use in Google Colab
- # In Colab: activate all allowable permissions for access to MyDrive
- # to bypass any authentication error
- drive.mount('/content/drive')
- os.chdir('/content/drive/MyDrive')
- # 2. Installation of Conda in Google Colab
- !pip install -q condacolab
- import condacolab; condacolab.install()
- # Initialize the shell and restart the kernel
- # Colab expectedly restarts with a log/report on `crash reported`
- !conda init bash
- # Verify installation
- !conda --version
- # 3. Installation of the UShER toolkit via Conda
- # See documentation for other setup options:
- # https://usher-wiki.readthedocs.io/en/latest/Installation.html
- # Create a new environment for UShER
- !conda create -n usher-env # python=3.10, if installed, to support BTE library
- # Activate the new environment
- !conda activate usher-env
- # Set up channels
- !conda config --add channels defaults
- !conda config --add channels bioconda
- !conda config --add channels conda-forge
- # Install package
- !conda install -q usher
- # 4. Download the latest UShER Mutation-Annotated Tree (MAT) data (.pb file; compressed)
- !wget http://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/public-latest.all.masked.pb.gz
- # Uncompress the MAT data file (-f parameter will force a file overwrite)
- !gunzip -f public-latest.all.masked.pb.gz
- # Export summary data associated with MAT file (e.g., --clades, --node-stats,
- # --mutations, --samples, --get-all)
- !matUtils summary --input-mat public-latest.all.masked.pb --clades clades.tsv
- # !matUtils summary --input-mat public-latest.all.masked.pb --samples samples.txt
- # 5. Obtain mutation data for each node of the subtree
- # If any issues arise, verify that public-latest.all.masked.pb is in the current working directory
- # Replace "YOUR_CLADE_OF_INTEREST" with the actual clade name, e.g., "20H (Beta)"
- # May replace "mutations_for_clade.txt" with another output filename
- # Tested with SARS-CoV-2 clade `20H (Beta)` (10179 samples):
- # If scaling up to larger clades, note the full SARS-CoV-2 dataset is ~800x as large
- !matUtils extract \
- --input-mat public-latest.all.masked.pb \
- --clade "YOUR_CLADE_OF_INTEREST" \
- --all-paths mutations_for_clade.txt
- # Explanation of the command:
- # `--input-mat public-latest.all.masked.pb`: Specifies the input MAT file.
- # `--clade "YOUR_CLADE_OF_INTEREST"`: Focuses the extraction on the members of the named
- # clade. This name must exactly match a clade name present in the MAT file's metadata.
- # May specify multiple clade names as a comma-delimited list. Add double quotes to
- # names with spaces.
- # `--all-paths mutations_for_clade.txt`: This crucial option tells `matUtils` to output the mutations
- # along each path from the clade's common ancestor to every sample and internal node within
- # that clade. The output is saved to ` mutations_for_clade.txt`. The list is created by a depth-first
- # traversal order.
- # Output Format:
- # The output file (`mutations_for_clade.txt`) will typically list each node (internal nodes often
- # labeled like `node_X:` or sample (e.g., `Country/SampleID/Date|Accession|Date:`) followed by
- # the mutations inferred to have occurred on the branch immediately leading to it. For example:
- # node_1: G15910T
- # Sample/ID/Date|Accession|Date: C1191T,C11674T
- # node_2: T13090C
- # This detailed mutation information is invaluable for understanding the specific evolutionary
- # changes within a lineage and can serve as input for further analyses, including preparing data for # training predictive models like Transformers.
- """
- # (Optional) Convert VCF formatted file to Fasta formatted sequence data
- # The vcf2fasta binary fails to run in Colab or WSL of Windows 11. Untested in other environments.
- # Installation of VCF library
- !conda install -q vcflib
- !echo "Current working directory: $PWD"
- # Download reference sequence for reconstruction of Fasta sequences from VCF file
- !wget https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/chromosomes/NC_045512v2.fa.gz
- !gunzip -f NC_045512v2.fa.gz
- # The VCF file contains lists of variants in the nucleotide sequence of the genotypes and depends
- # on a Fasta formatted reference sequence to reconstruct full sequences (specified by --reference).
- !vcfindex my_clade.vcf > my_clade_idx.vcf
- !vcf2fasta --reference NC_045512v2.fa my_clade_idx.vcf
- """
Appendix B. Nucleotide Tokenization Code for Transformers (Proof of Concept)
- # import pytorch library
- import torch
- # Define a simple vocabulary mapping
- nuc_to_id = {'A': 0, 'C': 1, 'G': 2, 'T': 3, 'N': 4} # 'N' for unknown/gap, map to padding or UNK
- id_to_nuc = {0: 'A', 1: 'C', 2: 'G', 3: 'T', 4: 'N'}
- # Define special tokens and their IDs
- PAD_TOKEN_ID = 4 # Using 'N' as padding/unknown for simplicity
- MAX_SEQ_LEN = 512 # Your chosen chunk size for Transformer input
- def encode_sequence(seq_str):
- # Converts a nucleotide sequence string to a list of integer IDs, converting all nucleotides
- # to uppercase and mapping unknown characters to PAD_TOKEN_ID
- return [nuc_to_id.get(nuc.upper(), PAD_TOKEN_ID) for nuc in seq_str]
- def prepare_chunk_for_transformer(chunk_str):
- # Encodes, pads, and creates an attention mask for a single sequence chunk,
- # Preparing it for input into a Transformer model
- encoded_ids = encode_sequence(chunk_str)
- # Pad the sequence to MAX_SEQ_LEN
- padding_length = MAX_SEQ_LEN - len(encoded_ids)
- input_ids = encoded_ids + [PAD_TOKEN_ID] * padding_length
- # Create attention mask (1 for real tokens, 0 for padding)
- attention_mask = [1] * len(encoded_ids) + [0] * padding_length
- # Convert to PyTorch tensors
- input_ids_tensor = torch.tensor(input_ids, dtype=torch.long)
- attention_mask_tensor = torch.tensor(attention_mask, dtype=torch.long)
- return input_ids_tensor, attention_mask_tensor
- # Example usage (for one ancestor-descendant pair)
- # The full ancestral_chunk string is abbreviated below
- ancestor_chunk = "GTACGTACGTACGTACGTACGTAC...gtacgtacgtacgtacgtacgtacgtacgtacgtac"
References
- Lowen, A. C. (2017) Constraints, Drivers, and Implications of Influenza A Virus Reassortment. Annual Review of Virology, 4, 105-21. [CrossRef]
- Domingo, E., Martin, V., Perales, C., Grande-Pérez, A., García-Arriaza, J., & Arias, A. (2006) Viruses as quasispecies: biological implications. Current Topics in Microbiology and Immunology, 299, 51-82. [CrossRef]
- Hay, A.J., Gregory, V., Douglas, A.R., & Lin, Y.P. (2001) The evolution of human influenza viruses. Philosophical Transactions of the Royal Society B: Biological Sciences. 356, 1861-70. [CrossRef]
- Rougeon, F., Kourilsky, P., & Mach, B. (1975) Insertion of a rabbit β-globin gene sequence into an E. coli plasmid. Nucleic Acids Research, 2, 2365-78. [CrossRef]
- Temin, H. M., & Mizutani, S. (1970) RNA-dependent DNA polymerase in virions of Rous sarcoma virus. Nature, 226, 1211-13. [CrossRef]
- Baltimore, D. (1970) RNA-dependent DNA polymerase in virions of RNA tumour viruses. Nature, 226, 1209-11. [CrossRef]
- Sambrook, J., & Russell, D.W. (2001) Molecular Cloning: A Laboratory Manual (3rd ed.). Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, USA.
- Kosuri, S., & Church, G. (2014) Large-scale de novo DNA synthesis: technologies and applications. Nature Methods, 11, 499–507. [CrossRef]
- Soares, M. B., Bonaldo, M. F., Jelene, P., Su, L., Lawton, L., & Efstratiadis, A. (1994) Construction and characterization of a normalized cDNA library. Proceedings of the National Academy of Sciences USA, 91, 9228-32. [CrossRef]
- Cohen, S. N., Chang, A. C. Y., Boyer, H. W., & Helling, R. B. (1973) Construction of Biologically Functional Bacterial Plasmids In Vitro. Proceedings of the National Academy of Sciences USA, 70, 3240-44. [CrossRef]
- Casali, N., & Preston, A. (Eds.) (2008) E. coli Plasmid Vectors: Methods and Applications (Vol. 235). Humana Press, Totowa, NJ, USA.
- Geisbert, T. W., & Feldmann, H. (2011) Recombinant Vesicular Stomatitis Virus–Based Vaccines Against Ebola and Marburg Virus Infections. The Journal of Infectious Diseases, 204(Supplement 3), S1075-S1081. [CrossRef]
- Cello, J., Paul, A. V., & Wimmer, E. (2002) Chemical Synthesis of Poliovirus cDNA: Generation of Infectious Virus in the Absence of Natural Template. Science, 297, 1016-18. [CrossRef]
- Taubenberger, J. K., Reid, A. H., Lourens, R. M., Wang, R., Jin, G., & Fanning, D. G. (2005) Characterization of the 1918 influenza virus polymerase genes. Nature, 437, 889-93. [CrossRef]
- Tumpey, T. M., Basler, C. F., Aguilar, P. V., Zeng, H., Solórzano, A., Swayne, D. E., ... & García-Sastre, A. (2005) Characterization of the Reconstructed 1918 Spanish Influenza Pandemic Virus. Science, 310, 77-80. [CrossRef]
- Doudna, J. A., & Charpentier, E. (2014) The new frontier of genome engineering with CRISPR-Cas9. Science, 346, 1258096. [CrossRef]
- Mojica, F. J. M., Diez-Villasenor, C., Garcia-Martinez, J., & Soria, E. (2005) Intervening Sequences of Regularly Spaced Prokaryotic Repeats Derive from Foreign Genetic Elements. Journal of Molecular Evolution, 60, 174-82. [CrossRef]
- Barrangou, R., Fremaux, C., Deveau, H., Richards, M., Boyaval, P., Moineau, S., ... & Horvath, P. (2007) CRISPR Provides Acquired Resistance Against Viruses in Prokaryotes. Science, 315, 1709-12. [CrossRef]
- Jinek, M., Chylinski, K., Fonfara, I., Hauer, M., Doudna, J. A., & Charpentier, E. (2012) A Programmable Dual-RNA–Guided DNA Endonuclease in Adaptive Bacterial Immunity. Science, 337, 816-21. [CrossRef]
- Cong, L., Ran, F. A., Cox, D., Lin, S., Barretto, R., Habib, N., ... & Zhang, F. (2013) Multiplex Genome Engineering Using CRISPR/Cas Systems. Science, 339, 819-23. [CrossRef]
- Adli, M. (2018) The CRISPR tool kit for genome editing and beyond. Nature Communications, 9, 1911. [CrossRef]
- Jiang, F., & Doudna, J. A. (2017) CRISPR-Cas9 Structures and Mechanisms. Annual Review of Biophysics, 46, 505-29. [CrossRef]
- Elena, S. F., & Sanjuán, R. (2007) Virus Evolution: Insights from an Experimental Approach. Annual Review of Ecology, Evolution, and Systematics, 38, 27-52. [CrossRef]
- Fowler, D. M., & Fields, S. (2014) Deep mutational scanning: a new style of protein science. Nature Methods, 11, 801-7. [CrossRef]
- Meini, M.R., Tomatis, P. E., Weinreich. D. M., & Vila, A. J. (2015) Quantitative Description of a Protein Fitness Landscape Based on Molecular Features. Molecular Biology and Evolution, 32, 1774-87. [CrossRef]
- Burton T. D., & Eyre, N. S. (2021) Applications of Deep Mutational Scanning in Virology. Viruses, 13, 1020. [CrossRef]
- Starr, T. N., Greaney, A. J., Hilton, S. K., Ellis, D., Crawford, K. H., Dingens, A. S., ... & Bloom, J. D. (2020) Deep Mutational Scanning of SARS-CoV-2 Receptor Binding Domain Reveals Constraints on Folding and ACE2 Binding. Cell, 182, 1295-1310.e20. [CrossRef]
- Turakhia, Y., Thornlow, B., Hinrichs, A. S., De Maio, N., Gozashti, L., Lanfear, R., ... & Corbett-Detig, R. (2021) Ultrafast Sample placement on Existing tRees (UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic. Nature Genetics, 53, 809-16. [CrossRef]
- Ultrafast Sample Placement on Existing Trees. Available online: https://github.com/yatisht/usher (accessed on 4 June 2025).
- Protocol Buffers Documentation. Available online: https://protobuf.dev (accessed on 4 June 2025).
- Sanderson, T. (2022) Taxonium, a web-based tool for exploring large phylogenetic trees. Elife, 11. [CrossRef]
- Taxonium documentation. Available online: https://docs.taxonium.org (accessed June 4, 2025).
- Bisong, E. (2019) Google Colaboratory. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform. Apress, Berkeley, CA, USA. [CrossRef]
- Conda: A system-level, binary package and environment manager running on all major operating systems and platforms. Available online: https://github.com/conda/conda (accessed June 4, 2025).
- A Python Suite for Evolutionary and Comparative Genomics. Available online: https://github.com/bob-friedman/EvolCat-Python (accessed on 4 June 2025).
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017) Attention is All you Need. Advances in Neural Information Processing, 30. https://arxiv.org/abs/1706.03762v7.
- Yin, X., & Wan, X. (2022, May) How do seq2seq models perform on end-to-end data-to-text generation? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 7701-10). [CrossRef]
- Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., & Yu, P. S. (2020) A Comprehensive Survey on Graph Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 32, 4-24. https://arxiv.org/abs/1901.00596.
- Naqvi, A. A. T., Fatima, K., Mohammad, T., Fatima, U., Singh, I. K., Singh, A., ... & Hassan, M. I. (2020) Insights into SARS-CoV-2 genome, structure, evolution, pathogenesis and therapies: Structural genomics approach. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease, 1866, 165878. [CrossRef]
- Huang, Y., Xu, J., Lai, J., Jiang, Z., Chen, T., Li, Z., ... & Zhao, P. (2023) Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey. arXiv, arXiv:2311.12351. https://arxiv.org/abs/2311.12351v2.
- Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., ... & Ahmed, A. (2020) Big Bird: Transformers for Longer Sequences. Advances in Neural Information Processing Systems, 33, 17283-97. https://arxiv.org/abs/2007.14062v2.
- Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2019) HuggingFace's Transformers: State-of-the-art Natural Language Processing. arXiv, arXiv:1910.03771. https://arxiv.org/abs/1910.03771.
- Zhang, J., Xiao, T., Cai, Y., & Chen, B. (2021) Structure of SARS-CoV-2 spike protein. Current Opinion in Virology, 50, 173-82. [CrossRef]
- Friedman, R. (2023) Tokenization in the Theory of Knowledge. Encyclopedia, 3, 380-86. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
