Submitted:
22 May 2026
Posted:
26 May 2026
You are already at the latest version
Abstract
Keywords:
Introduction
Search Strategy and Scope
Phage Sequence Data and Databases
Phage Identification and Prophage Detection
Genome Assembly, Quality Assessment and Comparative Genomics
Gene Annotation and Functional Prediction
Taxonomy and Classification
Lifestyle Prediction
Defense and Counter-Defense System Prediction
Host Prediction and Phage-Bacteria Interaction
Proposed Workflows for Phage Characterisation
Benchmarking, Current Limitations, and Future Directions
Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- Camargo, A. P. et al. IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata. Nucleic Acids Res. 51, D733–D743 (2022). [CrossRef]
- Cook, R. et al. INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes. PHAGE Ther. Appl. Res. 2, 214–223 (2021). [CrossRef]
- Roux, S., Enault, F., Hurwitz, B. L. & Sullivan, M. B. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015). [CrossRef]
- Bin Jang, H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nat. Biotechnol. 37, 632–639 (2019). [CrossRef] [PubMed]
- Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, (2017). [CrossRef]
- Shang, J. PhaGCN: Phage taxonomic classification with graph convolutional networks. Bioinformatics https://github.com/KennthShang/PhaGCN (2021).
- Camargo, A. P. et al. Identification of mobile genetic elements with geNomad. Nat. Biotechnol. 42, 1303–1312 (2023). [CrossRef]
- Bouras, G. et al. Protein structure-informed bacteriophage genome annotation with Phold. Nucleic Acids Res. 54, (2026). [CrossRef]
- Wendling, C. C., Vasse, M. & Wielgoss, S. Phage quest: a beginner’s guide to explore viral diversity in the prokaryotic world. Brief. Bioinform. 26, bbaf449 (2025). [CrossRef] [PubMed]
- Brister, J. R., Ako-adjei, D., Bao, Y. & Blinkova, O. NCBI Viral Genomes Resource. Nucleic Acids Res. 43, D571–D577 (2014). [CrossRef]
- Russell, D. A. & Hatfull, G. F. PhagesDB: the actinobacteriophage database. Bioinformatics 33, 784–786 (2016). [CrossRef] [PubMed]
- ICTV. ICTV Virus Metadata Resource (VMR) (2022). [CrossRef]
- Bolduc, B. et al. Machine learning enables scalable and systematic hierarchical virus taxonomy. Nat. Biotechnol. 1–10 (2025). [CrossRef]
- Ho, S. F. S., Wheeler, N. E., Millard, A. D. & van Schaik, W. Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data. Microbiome 11, (2023). [CrossRef]
- Amgarten, D., Braga, L. P. P., da Silva, A. M. & Setubal, J. C. MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins. Front. Genet. 9, (2018). [CrossRef]
- Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020). [CrossRef]
- Auslander, N., Gussow, A. B., Benler, S., Wolf, Y. I. & Koonin, E. V. Seeker: alignment-free identification of bacteriophage genomes by deep learning. Nucleic Acids Res. 48, e121–e121 (2020). [CrossRef]
- Fang, Z. et al. PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning. GigaScience 8, (2019). [CrossRef] [PubMed]
- Antipov, D., Raiko, M., Lapidus, A. & Pevzner, P. A. Metaviral SPAdes: assembly of viruses from metagenomic data. Bioinformatics 36, 4126–4129 (2020). [CrossRef] [PubMed]
- Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, (2020). [CrossRef]
- Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, (2021). [CrossRef] [PubMed]
- Jurtz, V. I., Villarroel, J., Lund, O., Larsen, M. V. & Nielsen, M. MetaPhinder—Identifying Bacteriophage Sequences in Metagenomic Data Sets. PLOS ONE 11, e0163111 (2016). [CrossRef]
- Shang, J., Tang, X., Guo, R. & Sun, Y. Accurate identification of bacteriophages from metagenomic data using Transformer. Brief. Bioinform. 23, (2022). [CrossRef] [PubMed]
- Bai, Z. et al. Identification of bacteriophage genome sequences with representation learning. Bioinformatics 38, 4264–4270 (2022). [CrossRef]
- Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021). [CrossRef]
- Akhter, S., Aziz, R. K. & Edwards, R. A. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 40, e126–e126 (2012). [CrossRef] [PubMed]
- Arndt, D. et al. PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res. 44, W16–W21 (2016). [CrossRef]
- Wishart, D. S. et al. PHASTEST: faster than PHASTER, better than PHAST. Nucleic Acids Res. 51, W443–W450 (2023). [CrossRef] [PubMed]
- Gauthier, C. H. et al. DEPhT: a novel approach for efficient prophage discovery and precise extraction. Nucleic Acids Res. 50, e75–e75 (2022). [CrossRef]
- Sirén, K. et al. Rapid discovery of novel prophages using biological feature engineering and machine learning. NAR Genomics Bioinforma. 3, lqaa109 (2021). [CrossRef]
- Wu, S., Fang, Z., Tan, J., Li, M. & Wang, C. Benchmarking computational tools for virus identification in metagenomes across biomes. Microbiome 12, 215 (2024).
- Nayfach, S. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat. Biotechnol. 39, 578–585 (2020). [CrossRef] [PubMed]
- Mallawaarachchi, V. et al. Phables: from fragmented assemblies to high-quality bacteriophage genomes. Bioinformatics 39, (2023). [CrossRef]
- Bankevich, A. et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J. Comput. Biol. 19, 455–477 (2012). [CrossRef]
- Antipov, D., Rayko, M., Kolmogorov, M. & Pevzner, P. A. viralFlye: assembling viruses and identifying their hosts from long-read metagenomics data. Genome Biol. 23, (2022). [CrossRef] [PubMed]
- Chen, L. & Banfield, J. F. COBRA improves the completeness and contiguity of viral genomes assembled from metagenomes. Nat. Microbiol. 9, 737–750 (2024). [CrossRef]
- Kieft, K., Adams, A., Salamzade, R., Kalan, L. & Anantharaman, K. vRhyme enables binning of viral genomes from metagenomes. Nucleic Acids Res. 50, e83 (2022). [CrossRef]
- Zolfo, M. et al. Detecting contamination in viromes using ViromeQC. Nat. Biotechnol. 37, 1408–1412 (2019). [CrossRef]
- Gilchrist, C. L. M. & Chooi, Y.-H. clinker & clustermap.js: automatic generation of gene cluster comparison figures. Bioinformatics 37, 2473–2475 (2021). [CrossRef] [PubMed]
- Hatfull, G. F. & Hendrix, R. W. Bacteriophages and their genomes. Curr. Opin. Virol. 1, 298–303 (2011). [CrossRef]
- Hatfull, G. F. Dark Matter of the Biosphere: the Amazing World of Bacteriophage Diversity. J. Virol. 89, 8107–8110 (2015). [CrossRef]
- McNair, K., Zhou, C., Dinsdale, E. A., Souza, B. & Edwards, R. A. PHANOTATE: a novel approach to gene identification in phage genomes. Bioinformatics 35, 4537–4542 (2019). [CrossRef]
- Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, (2010). [CrossRef] [PubMed]
- Besemer, J., Lomsadze, A. & Borodovsky, M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607–2618 (2001). [CrossRef] [PubMed]
- Kelley, D. R., Liu, B., Delcher, A. L., Pop, M. & Salzberg, S. L. Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res. 40, e9 (2012). [CrossRef]
- Terzian, P. et al. PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR Genomics Bioinforma. 3, (2021). [CrossRef] [PubMed]
- Bouras, G. et al. Pharokka: a fast scalable bacteriophage annotation tool. Bioinformatics 39, (2022). [CrossRef]
- Ecale Zhou, C. L. et al. MultiPhATE2: code for functional annotation and comparison of phage genomes. G3 GenesGenomesGenetics 11, (2021). [CrossRef]
- Shaffer, M. et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 48, 8883–8900 (2020). [CrossRef]
- Zimmermann, L. et al. A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core. J. Mol. Biol. 430, 2237–2243 (2018). [CrossRef]
- Grazziotin, A. L., Koonin, E. V. & Kristensen, D. M. Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic Acids Res. 45, D491–D498 (2016). [CrossRef]
- Cantu, V. A. et al. PhANNs, a fast and accurate tool and web server to classify phage structural proteins. PLOS Comput. Biol. 16, e1007845 (2020). [CrossRef]
- Frontiers | PhageScanner: a reconfigurable machine learning framework for bacteriophage genomic and metagenomic feature annotation. https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2024.1446097/full.
- Flamholz, Z. N., Biller, S. J. & Kelly, L. Large language models improve annotation of prokaryotic viral proteins. Nat. Microbiol. 9, 537–549 (2024). [CrossRef] [PubMed]
- Boulay, A., Leprince, A., Enault, F., Rousseau, E. & Galiez, C. Empathi: embedding-based phage protein annotation tool by hierarchical assignment. Nat. Commun. 16, 9114 (2025). [CrossRef] [PubMed]
- Guan, J. et al. GOPhage: protein function annotation for bacteriophages by integrating the genomic context. Brief. Bioinform. 26, bbaf014 (2025). [CrossRef]
- Heinzinger, M. et al. Bilingual language model for protein sequence and structure. NAR Genomics Bioinforma. 6, (2024). [CrossRef]
- van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2023). [CrossRef]
- Mishra, P. M., Verma, N. C., Rao, C., Uversky, V. N. & Nandi, C. K. Intrinsically disordered proteins of viruses: Involvement in the mechanism of cell regulation and pathogenesis. Prog. Mol. Biol. Transl. Sci. 174, 1–78 (2020).
- Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630, 493–500 (2024). [CrossRef]
- Turner, D. et al. Abolishment of morphology-based taxa and change to binomial species names: 2022 taxonomy update of the ICTV bacterial viruses subcommittee. Arch. Virol. 168, (2023). [CrossRef] [PubMed]
- Meier-Kolthoff, J. P. & Göker, M. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics 33, 3396–3404 (2017). [CrossRef]
- Nishimura, Y. et al. ViPTree: the viral proteomic tree server. Bioinformatics 33, 2379–2380 (2017). [CrossRef]
- Moraru, C., Varsani, A. & Kropinski, A. M. VIRIDIC—A Novel Tool to Calculate the Intergenomic Similarities of Prokaryote-Infecting Viruses. Viruses 12, 1268 (2020). [CrossRef] [PubMed]
- Millard, A. et al. taxMyPhage: Automated Taxonomy of dsDNA Phage Genomes at the Genus and Species Level. PHAGE Ther. Appl. Res. 6, 5–11 (2025). [CrossRef] [PubMed]
- Mayne, R., Aiewsakun, P., Turner, D., Adriaenssens, E. M. & Simmonds, P. GRAViTy-V2: a grounded viral taxonomy application. NAR Genomics Bioinforma. 6, lqae183 (2024). [CrossRef]
- Shang, J., Jiang, J. & Sun, Y. Bacteriophage classification for assembled contigs using graph convolutional network. Bioinformatics 37, i25–i33 (2021). [CrossRef]
- Shang, J., Peng, C., Liao, H., Tang, X. & Sun, Y. PhaBOX: a web server for identifying and characterizing phage contigs in metagenomic data. Bioinforma. Adv. 3, (2023). [CrossRef]
- Smug, B. J., Szczepaniak, K., Rocha, E. P. C., Dunin-Horkawicz, S. & Mostowy, R. J. Ongoing shuffling of protein fragments diversifies core viral functions linked to interactions with bacterial hosts. Nat. Commun. 14, (2023). [CrossRef] [PubMed]
- Hockenberry, A. J. & Wilke, C. O. BACPHLIP: predicting bacteriophage lifestyle from conserved protein domains. PeerJ 9, e11396 (2021). [CrossRef]
- McNair, K., Bailey, B. A. & Edwards, R. A. PHACTS, a computational approach to classifying the lifestyle of phages. Bioinformatics 28, 614–618 (2012). [CrossRef] [PubMed]
- Wu, S. et al. DeePhage: distinguishing virulent and temperate phage-derived sequences in metavirome data with a deep learning approach. GigaScience 10, (2021). [CrossRef]
- Shang, J., Tang, X. & Sun, Y. PhaTYP: predicting the lifestyle for bacteriophages using BERT. Brief. Bioinform. 24, (2022). [CrossRef] [PubMed]
- Zhang, Y., Mao, M., Zhang, R., Liao, Y.-T. & Wu, V. C. H. DeepPL: A deep-learning-based tool for the prediction of bacteriophage lifecycle. PLOS Comput. Biol. 20, e1012525 (2024). [CrossRef] [PubMed]
- Juhász, J. et al. ProkBERT PhaStyle: accurate phage lifestyle prediction with pretrained genomic language models. Bioinforma. Adv. 5, (2024). [CrossRef]
- Zhou, Z. et al. DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome. in (2024). doi:10.48550/arXiv.2306.15006.
- Tesson, F. et al. Systematic and quantitative view of the antiviral arsenal of prokaryotes. Nat. Commun. 13, (2022). [CrossRef] [PubMed]
- Biswas, A., Staals, R. H. J., Morales, S. E., Fineran, P. C. & Brown, C. M. CRISPRDetect: A flexible algorithm to define CRISPR arrays. BMC Genomics 17, (2016). [CrossRef]
- Couvin, D. et al. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins. Nucleic Acids Res. 46, W246–W251 (2018). [CrossRef]
- Russel, J., Pinilla-Redondo, R., Mayo-Muñoz, D., Shah, S. A. & Sørensen, S. J. CRISPRCasTyper: Automated Identification, Annotation, and Classification of CRISPR-Cas Loci. CRISPR J. 3, 462–469 (2020). [CrossRef]
- Payne, L. J. et al. Identification and classification of antiviral defence systems in bacteria and archaea with PADLOC reveals new system types. Nucleic Acids Res. 49, 10868–10878 (2021). [CrossRef]
- Payne, L. J. et al. PADLOC: a web server for the identification of antiviral defence systems in microbial genomes. Nucleic Acids Res. 50, W541–W550 (2022). [CrossRef]
- Millman, A. et al. An expanded arsenal of immune systems that protect bacteria from phages. Cell Host Microbe 30, 1556-1569.e5 (2022). [CrossRef] [PubMed]
- Yi, H. et al. AcrFinder: genome mining anti-CRISPR operons in prokaryotes and their viruses. Nucleic Acids Res. 48, W358–W365 (2020). [CrossRef] [PubMed]
- Wen, Y., Zhang, F. & Jiang, Y. AcaFinder: genome mining anti-CRISPR-associated genes. mSystems 8, e00981-22 (2023).
- Eitzinger, S. et al. Machine learning predicts new anti-CRISPR proteins. Nucleic Acids Res. 48, 4698–4708 (2020). [CrossRef]
- Li, Y. et al. AcrNET: predicting anti-CRISPR with deep learning. Bioinformatics 39, (2023). [CrossRef]
- DeWeirdt, P. C., Mahoney, E. M. & Laub, M. T. DefensePredictor: A Machine Learning Model to Discover Novel Prokaryotic Immune Systems. (2025). [CrossRef]
- Villarroel, J. et al. HostPhinder: A Phage Host Prediction Tool. Viruses 8, 116 (2016). [CrossRef] [PubMed]
- Galiez, C., Siebert, M., Enault, F., Vincent, J. & Söding, J. WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics 33, 3113–3114 (2017). [CrossRef]
- Zhang, R. et al. SpacePHARER: sensitive identification of phages from CRISPR spacers in prokaryotic hosts. Bioinformatics 37, 3364–3366 (2021). [CrossRef]
- Coutinho, F. H. et al. RaFAH: Host prediction for viruses of Bacteria and Archaea based on protein content. Patterns 2, 100274 (2021). [CrossRef]
- Shang, J. & Sun, Y. CHERRY: a Computational metHod for accuratE pRediction of virus–pRokarYotic interactions using a graph encoder–decoder model. Brief. Bioinform. 23, (2022). [CrossRef]
- Shang, J. & Sun, Y. Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning. BMC Biol. 19, 250 (2021). [CrossRef]
- Amgarten, D., Iha, B. K. V., Piroupo, C. M., da Silva, A. M. & Setubal, J. C. vHULK, a New Tool for Bacteriophage Host Prediction Based on Annotated Genomic Features and Neural Networks. PHAGE 3, 204–212 (2022). [CrossRef]
- Zielezinski, A., Deorowicz, S. & Gudyś, A. PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences. Bioinformatics 38, 1447–1449 (2021). [CrossRef] [PubMed]
- Zhou, F. et al. PHISDetector: A Tool to Detect Diverse In Silico Phage–Host Interaction Signals for Virome Studies. Genomics Proteomics Bioinformatics 20, 508–523 (2022). [CrossRef]
- Roux, S. et al. iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria. PLOS Biol. 21, e3002083 (2023). [CrossRef] [PubMed]
- Liu, F., Zhao, Z. & Liu, Y. PHPGAT: predicting phage hosts based on multimodal heterogeneous knowledge graph with graph attention network. Brief. Bioinform. 26, (2024). [CrossRef]
- Gonzales, M. E. M., Ureta, J. C. & Shrestha, A. M. S. PHIStruct: improving phage–host interaction prediction at low sequence similarity settings using structure-aware protein embeddings. Bioinformatics 41, btaf016 (2025). [CrossRef]
- Chen, Q. et al. MoEPH: an adaptive fusion-based LLM for predicting phage-host interactions in health informatics. Front. Microbiol. 16, (2025). [CrossRef] [PubMed]
- Klein-Sousa, V., Roa-Eguiara, A., Kielkopf, C. S., Sofos, N. & Taylor, N. M. I. RBPseg: Toward a complete phage tail fiber structure atlas. Sci. Adv. 11, (2025). [CrossRef]
- Boeckaerts, D. et al. Prediction of Klebsiella phage-host specificity at the strain level. Nat. Commun. 15, (2024). [CrossRef]
- Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017). [CrossRef]
- Köster, J. & Rahmann, S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 28, 2520–2522 (2012). [CrossRef] [PubMed]
- Dip, S. A. et al. Large language model agents for biological intelligence across genomics, proteomics, spatial biology, and biomedicine. Brief. Bioinform. 27, bbag110 (2026). [CrossRef] [PubMed]
- Shang, J. et al. From genomic signals to prediction tools: a critical feature analysis and rigorous benchmark for phage–host prediction. Brief. Bioinform. 26, (2025). [CrossRef]
- Grigson, S. R., Bouras, G., Dutilh, B. E., Olson, R. D. & Edwards, R. A. Computational function prediction of bacteria and phage proteins. Microbiol. Mol. Biol. Rev. 89, (2025). [CrossRef]
- Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023). [CrossRef] [PubMed]
- Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science 387, 850–858 (2025). [CrossRef]
- Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)—Round XIV. Proteins Struct. Funct. Bioinforma. 89, 1607–1617 (2021). [CrossRef]
- Meyer, F. et al. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022). [CrossRef]
- Gaborieau, B. et al. Prediction of strain level phage–host interactions across the Escherichia genus using only genomic information. Nat. Microbiol. 9, 2847–2861 (2024). [CrossRef]
- King, S. H. et al. Generative design of novel bacteriophages with genome language models. (2025). [CrossRef]



| Databases | |||||||
| No. | Tool | Year | Description | Key feature | Citations (Apr 2026) |
Availability | URL |
| T1 | NCBI Viral Genomes Resource | 2015 | Curated complete viral genomes; RefSeq records | INSDC-linked reference standard | 777 | Web | https://www.ncbi.nlm.nih.gov/genome/viruses/ |
| T2 | PhagesDB | 2017 | Actinobacteriophage database (SEA-PHAGES) | >30,000 entries; >5,000 sequenced genomes as of 2026 | 540 | Web | https://phagesdb.org/ |
| T3 | INPHARED | 2021 | Automated curation of complete phage genomes | GitHub-distributed; automated updates; revealed 75% sampling bias | 371 | GitHub | https://github.com/RyanCook94/inphared |
| T4 | ICTV VMR | 2022 | Exemplar virus taxonomy reference | Official nomenclature alignment | 1,504 | Web | https://ictv.global/vmr |
| T5 | IMG/VR v4 | 2023 | Uncultivated virus genome repository | >15 million genomes; 6-fold increase over v3 | 446 | Web | https://img.jgi.doe.gov/vr/ |
| Identification and detection | |||||||
| No. | Tool | Year | Approach | Key metric / feature | Citations (Apr 2026) |
Availability | URL |
| T6 | VirSorter | 2015 | Hallmark gene search + enrichment metrics | >95% recall on contigs ≥10 kb | 1,182 | GitHub | https://github.com/simroux/VirSorter |
| T7 | MetaPhinder | 2016 | BLAST integration across reference genomes | Sequence-level classification | 84 | GitHub + Web | https://github.com/vanessajurtz/MetaPhinder |
| T8 | VirFinder | 2017 | Logistic regression on k-mer frequencies | 78× higher TPR than VirSorter at 1 kb | 701 | GitHub | https://github.com/jessieren/VirFinder |
| T9 | MARVEL | 2018 | Random Forest on genomic features | Extended RF-based detection | 191 | GitHub | https://github.com/LaboratorioBioinformatica/MARVEL |
| T10 | PPR-Meta | 2019 | Deep learning | Three-way classification (phage/plasmid/chromosome) | 200 | GitHub | https://github.com/zhenchengfang/PPR-Meta |
| T11 | DeepVirFinder | 2020 | Alignment-free CNN classifier | AUROC 0.93–0.98 for 300–3,000 bp | 691 | GitHub | https://github.com/jessieren/DeepVirFinder |
| T12 | Seeker | 2020 | LSTM on raw DNA | Alignment-free; no feature engineering | 137 | GitHub | https://github.com/gussow/seeker |
| T13 | VIBRANT | 2020 | Neural network + protein annotation | Automated recovery, annotation, curation | 1,128 | GitHub | https://github.com/AnantharamanLab/VIBRANT |
| T14 | Kraken2 | 2019 | k-mer exact matching | Precision 0.96 in mock community (Ho et al. benchmark) | 7,218 | GitHub | https://github.com/DerrickWood/kraken2 |
| T15 | VirSorter2 | 2021 | Multi-classifier framework | F1 > 0.8 DNA and RNA virus detection | 1,279 | GitHub | https://github.com/jiarong/VirSorter2 |
| T16 | PhaMer | 2022 | Transformer on protein tokens | 27% F1 improvement on real metagenomic data | 65 | GitHub + Web | https://github.com/KennthShang/PhaMer |
| T17 | INHERIT | 2022 | DNABERT-style transformer | Representation learning for phage genomes | 29 | GitHub | https://github.com/Celestial-Bai/INHERIT |
| T18 | geNomad | 2023 | IGLOO encoder + CRF for proviruses | MCC 95.3%; virus ID + taxonomy + annotation in one framework | 884 | GitHub | https://github.com/apcamargo/genomad |
| Prophage detection | |||||||
| No. | Tool | Year | Approach | Key metric / feature | Citations (Apr 2026) |
Availability | URL |
| T19 | PhiSpy | 2012 | Similarity + composition (7 genomic features) | 94% prediction success; 0.66% FPR across 50 genomes | 593 | GitHub + Web | https://github.com/linsalrob/PhiSpy |
| T20 | PHASTER | 2016 | Curated database search (web server) | Widely used prophage web server | 3,855 | Web | https://phaster.ca/ |
| T21 | DEPhT | 2022 | Multimodal approach (3 run modes) | Improved prophage boundary determination | 38 | GitHub | https://github.com/chg60/DEPhT |
| T22 | PHASTEST | 2023 | Updated PHASTER pipeline | 31% faster; Higher sensitivity than phaster | 554 | Web | https://phastest.ca/ |
| T23 | PhageBoost | 2021 | Machine-learning-driven (evaluates viral genomic architecture) | Detects highly divergent prophages; tast, for high-throughput WGS | 74 | GitHub | https://github.com/ku-cbd/PhageBoost |
| Genome assembly and comparative genomics | |||||||
| No. | Tool | Year | Approach | Key metric / feature | Citations (Apr 2026) |
Availability | URL |
| T24 | SPAdes | 2012 | de Bruijn graph assembly | Foundation for metaSPAdes/metaviralSPAdes | 27,729 | GitHub + Web | https://github.com/ablab/spades |
| T25 | ViromeQC | 2019 | Sample-level contamination assessment | Quantifies non-viral contamination in VLP viromes | 115 | GitHub | https://github.com/SegataLab/viromeqc |
| T26 | metaviralSPAdes | 2020 | Virus-specific SPAdes extension | Viral subgraph ID + completeness assessment; also used for contig verification | 320 | GitHub | https://github.com/ablab/spades |
| T27 | VIRIDIC | 2020 | Nucleotide intergenomic similarity | ICTV-recommended algorithm for genus/species demarcation (also used in Taxonomy) | 819 | GitHub + Web | https://github.com/CristinaMoraru/VIRIDIC |
| T28 | CheckV | 2021 | Genome quality assessment | 5 quality tiers; 76,262 reference genomes; adopted by IMG/VR | 1,790 | Bitbucket | https://bitbucket.org/berkeleylab/checkv |
| T29 | Clinker | 2021 | Gene cluster comparison visualisation | Automated comparison figures | 1,393 | GitHub | https://github.com/gamcil/clinker |
| T30 | pyGenomeViz | 2022 | Genome comparison visualisation (Python) | Annotated comparative genomics figures | N/A | GitHub | https://github.com/moshi4/pyGenomeViz |
| T31 | viralFlye | 2022 | Long-read viral assembly | Long-read metagenomics support | 32 | GitHub | https://github.com/Dmitry-Antipov/viralFlye |
| T32 | vRhyme | 2022 | Viral contig binning | Coverage + nucleotide composition signals | 113 | GitHub | https://github.com/AnantharamanLab/vRhyme |
| T33 | Phables | 2023 | Flow decomposition on assembly graphs | 49% more high-quality genomes than existing tools | 42 | GitHub | https://github.com/Vini2/phables |
| T34 | COBRA | 2024 | Paired-end extension of incomplete assemblies | Improves completeness and contiguity | 46 | GitHub | https://github.com/linxingchen/cobra |
| Gene annotation and functional prediction | |||||||
| No. | Tool | Year | Approach | Key metric / feature | Citations (Apr 2026) |
Availability | URL |
| T35 | Prodigal | 2010 | Prokaryotic gene recognition | Fast; widely used general gene caller | 13,001 | GitHub | https://github.com/hyattpd/Prodigal |
| T36 | pVOGs | 2017 | Prokaryotic virus orthologous groups | Viral marker gene detection; integrated into many pipelines | 414 | Web | https://ftp.ncbi.nlm.nih.gov/pub/kristensen/pVOGs/ |
| T37 | HHpred | 2018 | Profile-profile HMM comparison | Remote homology detection | 2,719 | Web | https://toolkit.tuebingen.mpg.de/tools/hhpred |
| T38 | PHANOTATE | 2019 | Dynamic programming for phage ORFs | Finds genes missed by Prodigal/GeneMarkS/Glimmer; handles overlapping frames | 319 | GitHub | https://github.com/deprekate/PHANOTATE |
| T39 | DRAM-v | 2020 | Metabolic pathway annotation | Identifies auxiliary metabolic genes (AMGs) | 1,097 | GitHub | https://github.com/WrightonLabCSU/DRAM |
| T40 | PhANNs | 2020 | ANN structural protein classifier | F1 = 0.875 across 10 structural classes | 113 | GitHub + Web | https://github.com/Adrian-Cantu/PhANNs |
| T41 | PHROGs | 2021 | HMM protein clustering (38,880 clusters) | 50.6% functional annotation across 17,473 reference viruses | 425 | Web | https://phrogs.lmge.uca.fr/ |
| T42 | MultiPhATE2 | 2021 | Parallel multi-gene-finder pipeline | Runs multiple gene finders simultaneously | 24 | GitHub | https://github.com/carolzhou/multiPhATE2 |
| T43 | Pharokka | 2023 | Integrated pipeline (PHANOTATE + PHROGs) | Standard pipeline; <5 min for 50 kb genome | 519 | GitHub | https://github.com/gbouras13/pharokka |
| T44 | VPF-PLM | 2024 | Protein language model annotation | +29% annotated ocean virome protein families | 95 | GitHub | https://github.com/kellylab/viral-protein-function-plm |
| T45 | ProstT5 | 2024 | Protein sequence → 3Di structural alphabet | Bilingual sequence-structure translation | 364 | GitHub | https://github.com/mheinzinger/ProstT5 |
| T46 | Foldseek | 2024 | Ultrafast structural search | 4-5 orders of magnitude faster than Dali/TM-align at comparable sensitivity | 2,365 | GitHub + Web | https://github.com/steineggerlab/foldseek |
| T47 | Empathi | 2025 | Hierarchical protein embeddings | 2× on environment viromes (EnVhogDB); 3× on cultured phages | 6 | HuggingFace | https://huggingface.co/AlexandreBoulay/EmPATHi |
| T48 | GOPhage | 2025 | Genomic context + transformer embeddings | +6.78% accuracy on divergent proteins | 9 | GitHub | https://github.com/jiaojiaoguan/GOPhage |
| T49 | Phold | 2026 | Structure-informed (ProstT5 → Foldseek) | >50% gene annotation vs ~35% homology-only | 30 | GitHub | https://github.com/gbouras13/phold |
| Taxonomy and classification | |||||||
| No. | Tool | Year | Approach | Key metric / feature | Citations (Apr 2026) |
Availability | URL |
| T50 | VICTOR | 2017 | Genome-BLAST Distance Phylogeny | Automated species/genus demarcation | 699 | Web | https://victor.dsmz.de/ |
| T51 | ViPTree | 2017 | Genome-wide tBLASTx proteomic trees | Viral proteomic tree server | 1,004 | Web | https://www.genome.jp/viptree/ |
| T52 | vConTACT2 | 2019 | Gene-sharing networks | 96% genus-level ICTV agreement (pre-2022 framework) | 990 | Bitbucket | https://bitbucket.org/MAVERICLab/vcontact2 |
| T53 | PhaGCN | 2021 | Graph convolutional network + CNN | Semi-supervised classification | 143 | GitHub | https://github.com/KennthShang/PhaGCN |
| T54 | PhaGCN2 | 2023 | Extended PhaGCN (DNA + RNA viruses) | 89.30% recall; 83.91% precision; applied to GPD and GOV2.0 datasets | 112 | GitHub + Web | https://github.com/KennthShang/PhaGCN2.0 |
| T55 | PhaBOX | 2023 | Integrated platform (PhaMer + PhaGCN + CHERRY + PhaTYP) | Unified ID + taxonomy from metagenomes (also used in Host prediction) | 117 | GitHub + Web | https://github.com/KennthShang/PhaBOX |
| T56 | GRAViTy-V2 | 2024 | Composite generalised Jaccard distances | Genome relationship analysis | 10 | GitHub | https://github.com/Mayne941/gravity2 |
| T57 | taxMyPhage | 2025 | Automated genus/species classification | Aligned with current ICTV revisions; dsDNA phages | 76 | GitHub | https://github.com/amillard/tax_myPHAGE |
| T58 | vConTACT3 | 2025 | ML-based hierarchical classification | >95% ICTV agreement (97.6% genus, 98.7% subfamily, 100% family/order) | 6 | Bitbucket | https://bitbucket.org/MAVERICLab/vcontact3 |
| Lifestyle prediction | |||||||
| No. | Tool | Year | Approach | Key metric / feature | Citations (Apr 2026) |
Availability | URL |
| T59 | PHACTS | 2012 | Random Forest on protein similarity | 99% precision (confident predictions); 88% overall sensitivity | 305 | GitHub | https://github.com/deprekate/PHACTS |
| T60 | BACPHLIP | 2021 | HMM domains + Random Forest | 98.3% accuracy on 423 independent test phages | 294 | GitHub | https://github.com/adamhockenberry/bacphlip |
| T61 | DeePhage | 2021 | CNN on one-hot encoded DNA | 89% accuracy; classifies contigs ≥100 bp | 113 | GitHub | https://github.com/shufangwu/DeePhage |
| T62 | PhaTYP | 2023 | BERT pre-trained + fine-tuned | Outperforms prior methods on short contigs | 206 | GitHub + Web | https://github.com/KennthShang/PhaTYP |
| T63 | DeepPL | 2024 | NLP on nucleotide sequences | 94.65% accuracy | 10 | GitHub | https://github.com/Wu-Microbiology/DeepPL |
| T64 | ProkBERT PhaStyle | 2025 | Genomic language models (21–26M params) | BA 0.88–0.93; MCC 0.75–0.86 on 500 bp fragments | 1 | GitHub | https://github.com/nbrg-ppcu/PhaStyle |
| Anti-phage defense system detection | |||||||
| No. | Tool | Year | Approach | Key metric / feature | Citations (Apr 2026) |
Availability | URL |
| T65 | CRISPRDetect | 2016 | Array detection + boundary refinement | CRISPR array identification | 393 | GitHub + Web | https://github.com/davidchyou/CRISPRDetect_2.4 |
| T66 | CRISPRCasFinder | 2018 | Integrated array + Cas protein ID | Combined array and Cas detection | 1,561 | GitHub + Web | https://github.com/dcouvin/CRISPRCasFinder |
| T67 | AcrFinder | 2020 | Homology + guilt-by-association + self-targeting spacers | Anti-CRISPR operon mining | 80 | GitHub + Web | https://github.com/HaidYi/acrfinder |
| T68 | AcRanker | 2020 | XGBoost ranking | Identified AcrIIA20 and AcrIIA21 | 119 | GitHub | https://github.com/amina01/AcRanker |
| T69 | CRISPRCasTyper | 2020 | Automated subtype classification | 98.6% accuracy | 303 | GitHub + Web | https://github.com/Russel88/CRISPRCasTyper |
| T70 | PADLOC | 2021/22 | HMM + system completeness validation | Web server; customisable classifications | 253 | GitHub + Web | https://github.com/padlocbio/padloc |
| T71 | DefenseFinder | 2022 | HMM profiles + MacSyFinder rule engine | 60 families; 151 subtypes across 21,000 genomes | 775 | GitHub + Web | https://github.com/mdmparis/defense-finder |
| T72 | AcrNET | 2023 | Deep learning anti-CRISPR prediction | Beyond homology-based methods | 24 | GitHub | https://github.com/banma12956/AcrNET |
| T73 | AcaFinder | 2023 | Aca gene detection | Independent signal for novel anti-CRISPR loci | 22 | GitHub | https://github.com/boweny920/AcaFinder |
| T74 | DefensePredictor | 2025 | Protein language model embeddings | 45 novel defense systems validated across 69 *E. coli* strains | N/A | GitHub | https://github.com/Alextianyf/DefensePredictor |
| Host prediction | |||||||
| No. | Tool | Year | Approach | Key metric / feature | Citations (Apr 2026) |
Availability | URL |
| T75 | HostPhinder | 2016 | k-mer similarity to reference DB | First dedicated host prediction tool | 193 | GitHub + Web | https://github.com/julvi/HostPhinder |
| T76 | WIsH | 2017 | Markov models on host genomes | Up to 63% genus accuracy; 100× faster than alignment | 317 | GitHub | https://github.com/soedinglab/WIsH |
| T77 | RaFAH | 2021 | Random Forest on 43,644 protein clusters | Consistent across RefSeq/SAG/metagenomic benchmarks | 145 | GitHub | https://github.com/felipehcoutinho/RaFAH |
| T78 | HostG | 2021 | Graph convolutional network (semi-supervised) | GCN-based host prediction | 59 | GitHub | https://github.com/KennthShang/HostG |
| T79 | SpacePHARER | 2021 | Protein-level CRISPR spacer matching | 1.4–4× sensitivity over BLASTN at metagenomic scale | 109 | GitHub | https://github.com/soedinglab/spacepharer |
| T80 | CHERRY | 2022 | Knowledge graph + graph convolutional encoder | Improved species-level accuracy | 105 | GitHub + Web | https://github.com/KennthShang/CHERRY |
| T81 | PHIST | 2022 | k-mer-based alignment-free | +3–20 pp species accuracy; laptop-scale runtime | 59 | GitHub | https://github.com/refresh-bio/PHIST |
| T82 | PHISDetector | 2022 | Unified CRISPR/prophage/similarity platform | Single platform for multiple interaction signals | 52 | GitHub + Web | https://github.com/HIT-ImmunologyLab/PHISDetector |
| T83 | vHULK | 2022 | Neural network on viral protein family scores | Annotated genomic features input | 36 | GitHub | https://github.com/LaboratorioBioinformatica/vHULK |
| T84 | iPHoP | 2023 | Integrated ensemble (homology + CRISPR + k-mer + ML) | 1.5–13× more predictions at equivalent FDR; >300 GB database | 413 | Bitbucket | https://bitbucket.org/srouxjgi/iphop |
| T85 | PhageHostLearn | 2024 | ESM-2 embeddings of RBP + receptor sequences | ROC AUC 81.8%; strain-level for *Klebsiella* | 60 | GitHub | https://github.com/dimiboeckaerts/PhageHostLearn |
| T86 | PHPGAT | 2025 | Graph attention on heterogeneous knowledge graphs | Multimodal phage-host interaction prediction | 14 | GitHub | https://github.com/ZhaoZMer/PHPGAT |
| T87 | PHIStruct | 2025 | SaProt protein structure embeddings | +7–9% over sequence-only for divergent phages | 20 | GitHub | https://github.com/bioinfodlsu/PHIStruct |
| T88 | MoEPH | 2025 | Gated Mixture-of-Experts (transformer + statistical) | Combines PLM embeddings with statistical descriptors | 0 | N/A | N/A |
| T89 | RBPseg | 2025 | ESMFold + structural domain ID | First large-scale phage tail fibre structure atlas | 3 | GitHub | https://github.com/VKleinSousa/RBPseg |
| Integrated pipelines and structural resources | |||||||
| No. | Tool | Year | Description | Key feature | Citations (Apr 2026) |
Availability | URL |
| T90 | BFVD | 2025 | Big Fantastic Virus Database | 351,242 predicted viral protein structures; >62% novel | 78 | GitHub + Web | https://bfvd.foldseek.com/ |
| T91 | Sphae | 2025 | Snakemake workflow wrapping 12 tools | End-to-end processing in <10 min | 14 | GitHub | https://github.com/linsalrob/sphae |
![]() |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
