Submitted:
02 July 2024
Posted:
02 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Material and Methods
2.1. Preprocessing of Input Data
Input from RefSeq
Polyproteins
2.2. Creation of the First Layer Clusters - VOGs
Functional annotation
Naming
2.3. Creation of the Second Layer Clusters - VFAMs
Clustering using MCL
2.4. Creation of the Third Layer Clusters - VFOLDs
2.5. Quality Assessment of the Clustering Results
3. Results
Database
Content
3.1. Quality Assessment
Comparison with similar databases
3.2. Availability
Webpage
Available files
4. Discussion and Application Examples
Limitations
Support for bioinformatic workflows
Usage for metagenome analysis
5. Conclusions
References
- Villarreal, L. Evolution of Viruses. In Encyclopedia of Virology; Elsevier; pp. 174–184. [CrossRef]
- Hendrix, R.W.; Smith, M.C.M.; Burns, R.N.; Ford, M.E.; Hatfull, G.F. Evolutionary Relationships among Diverse Bacteriophages and Prophages: All the World’s a Phage. 96, 2192–2197. [CrossRef]
- Mushegian, A.R. Are There 10 31 Virus Particles on Earth, or More, or Fewer? 202. [CrossRef]
- Koonin, E.V.; Krupovic, M.; Dolja, V.V. The Global Virome: How Much Diversity and How Many Independent Origins? 25, 40–44. [CrossRef]
- Krishnamurthy, S.R.; Wang, D. Origins and Challenges of Viral Dark Matter. 239, 136–142. [CrossRef]
- Kuchibhatla, D.B.; Sherman, W.A.; Chung, B.Y.W.; Cook, S.; Schneider, G.; Eisenhaber, B.; Karlin, D.G. Powerful Sequence Similarity Search Methods and In-Depth Manual Analyses Can Identify Remote Homologs in Many Apparently “Orphan” Viral Proteins. 88, 10–20. [CrossRef]
- Stern, A.; Andino, R. Viral Evolution. In Viral Pathogenesis; Elsevier; pp. 233–240. [CrossRef]
- Koonin, E.V.; Dolja, V.V.; Krupovic, M. The Logic of Virus Evolution. 30, 917–929. [CrossRef]
- Koonin, E.V. Orthologs, Paralogs, and Evolutionary Genomics. 39, 309–338. [CrossRef]
- Pearson, W.R. An Introduction to Sequence Similarity (“Homology”) Searching. 42, 3.1.1–3.1.8. [CrossRef]
- Yoon, B.J. Hidden Markov Models and Their Applications in Biological Sequence Analysis. 10, 402–415. [CrossRef]
- Grazziotin, A.L.; Koonin, E.V.; Kristensen, D.M. Prokaryotic Virus Orthologous Groups (pVOGs): A Resource for Comparative Genomics and Protein Family Annotation. 45, D491–D498. [CrossRef]
- Huerta-Cepas, J.; Szklarczyk, D.; Forslund, K.; Cook, H.; Heller, D.; Walter, M.C.; Rattei, T.; Mende, D.R.; Sunagawa, S.; Kuhn, M.; et al. eggNOG 4.5: A Hierarchical Orthology Framework with Improved Functional Annotations for Eukaryotic, Prokaryotic and Viral Sequences. 44, D286–D293. [CrossRef]
- Terzian, P.; Olo Ndela, E.; Galiez, C.; Lossouarn, J.; Pérez Bucio, R.E.; Mom, R.; Toussaint, A.; Petit, M.A.; Enault, F. PHROG: Families of Prokaryotic Virus Proteins Clustered Using Remote Homology. 3, lqab067. [CrossRef]
- Haft, D.H.; Badretdin, A.; Coulouris, G.; DiCuccio, M.; Durkin, A.S.; Jovenitti, E.; Li, W.; Mersha, M.; O’Neill, K.R.; Virothaisakun, J.; et al. RefSeq and the Prokaryotic Genome Annotation Pipeline in the Age of Metagenomes. 52, D762–D769. [CrossRef]
- Li, W.; O’Neill, K.R.; Haft, D.H.; DiCuccio, M.; Chetvernin, V.; Badretdin, A.; Coulouris, G.; Chitsaz, F.; Derbyshire, M.K.; Durkin, A.S.; et al. RefSeq: Expanding the Prokaryotic Genome Annotation Pipeline Reach with Protein Family Model Curation. 49, D1020–D1028. [CrossRef]
- Benson, D.A.; Cavanaugh, M.; Clark, K.; Karsch-Mizrachi, I.; Ostell, J.; Pruitt, K.D.; Sayers, E.W. GenBank. 46, D41–D47. [CrossRef]
- Boutet, E.; Lieberherr, D.; Tognolli, M.; Schneider, M.; Bansal, P.; Bridge, A.J.; Poux, S.; Bougueleret, L.; Xenarios, I. UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. 1374, 23–54, [26519399]. [CrossRef]
- Chandonia, J.M.; Guan, L.; Lin, S.; Yu, C.; Fox, N.K.; Brenner, S.E. SCOPe: Improvements to the Structural Classification of Proteins – Extended Database to Facilitate Variant Interpretation and Machine Learning. 50, D553–D559. [CrossRef]
- Yost, S.A.; Marcotrigiano, J. Viral Precursor Polyproteins: Keys of Regulation from Replication to Maturation. 3, 137–142, [23602469]. [CrossRef]
- Kristensen, D.M.; Kannan, L.; Coleman, M.K.; Wolf, Y.I.; Sorokin, A.; Koonin, E.V.; Mushegian, A. A Low-Polynomial Algorithm for Assembling Clusters of Orthologous Groups from Intergenomic Symmetric Best Matches.
- Altschul, S.F.; Madden, T.L.; Schäffer, A.A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D.J. Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. 25, 3389–3402. [CrossRef]
- Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T.J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Söding, J.; et al. Fast, Scalable Generation of High-Quality Protein Multiple Sequence Alignments Using Clustal Omega. 7, 539, [21988835]. [CrossRef]
- Eddy, S.R. Accelerated Profile HMM Searches. 7, e1002195. [CrossRef]
- Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic Local Alignment Search Tool. 215, 403–410. [CrossRef]
- Steinegger, M.; Meier, M.; Mirdita, M.; Vöhringer, H.; Haunsberger, S.J.; Söding, J. HH-suite3 for Fast Remote Homology Detection and Deep Protein Annotation. 20, 473. [CrossRef]
- Van Dongen, S. Graph Clustering Via a Discrete Uncoupling Process. 30, 121–141. [CrossRef]
- Burley, S.K.; Bhikadiya, C.; Bi, C.; Bittrich, S.; Chao, H.; Chen, L.; Craig, P.A.; Crichlow, G.V.; Dalenberg, K.; Duarte, J.M.; et al. RCSB Protein Data Bank (RCSB.Org): Delivery of Experimentally-Determined PDB Structures alongside One Million Computed Structure Models of Proteins from Artificial Intelligence/Machine Learning. 51.
- Jumper, J.; Evans, R.; Pritzel, A.; Green, T.; Figurnov, M.; Ronneberger, O.; Tunyasuvunakool, K.; Bates, R.; Žídek, A.; Potapenko, A.; et al. Highly Accurate Protein Structure Prediction with AlphaFold. 596, 583–589. [CrossRef]
- Van Kempen, M.; Kim, S.S.; Tumescheit, C.; Mirdita, M.; Lee, J.; Gilchrist, C.L.M.; Söding, J.; Steinegger, M. Fast and Accurate Protein Structure Search with Foldseek. [CrossRef]
- Fu, L.; Niu, B.; Zhu, Z.; Wu, S.; Li, W. CD-HIT: Accelerated for Clustering the next-Generation Sequencing Data. 28, 3150–3152. [CrossRef]
- Galperin, M.Y.; Wolf, Y.I.; Makarova, K.S.; Vera Alvarez, R.; Landsman, D.; Koonin, E.V. COG Database Update: Focus on Microbial Diversity, Model Organisms, and Widespread Pathogens. 49, D274–D281. [CrossRef]
- Hernández-Plaza, A.; Szklarczyk, D.; Botas, J.; Cantalapiedra, C.P.; Giner-Lamia, J.; Mende, D.R.; Kirsch, R.; Rattei, T.; Letunic, I.; Jensen, L.J.; et al. eggNOG 6.0: Enabling Comparative Genomics across 12 535 Organisms. 51, D389–D394. [CrossRef]
- Koonin, E.V.; Senkevich, T.G.; Dolja, V.V. The Ancient Virus World and Evolution of Cells. 1, 29. [CrossRef]
- Guo, J.; Bolduc, B.; Zayed, A.A.; Varsani, A.; Dominguez-Huerta, G.; Delmont, T.O.; Pratama, A.A.; Gazitúa, M.C.; Vik, D.; Sullivan, M.B.; et al. VirSorter2: A Multi-Classifier, Expert-Guided Approach to Detect Diverse DNA and RNA Viruses. 9, 37. [CrossRef]
- Nayfach, S.; Camargo, A.P.; Schulz, F.; Eloe-Fadrosh, E.; Roux, S.; Kyrpides, N.C. CheckV Assesses the Quality and Completeness of Metagenome-Assembled Viral Genomes. 39, 578–585. [CrossRef]
- Zhong, C.; Edlund, A.; Yang, Y.; McLean, J.S.; Yooseph, S. Metagenome and Metatranscriptome Analyses Using Protein Family Profiles. 12, e1004991. [CrossRef]
- Laffy, P.W.; Wood-Charlson, E.M.; Turaev, D.; Jutz, S.; Pascelli, C.; Botté, E.S.; Bell, S.C.; Peirce, T.E.; Weynberg, K.D.; Van Oppen, M.J.H.; et al. Reef Invertebrate Viromics: Diversity, Host Specificity and Functional Capacity. 20, 2125–2141. [CrossRef]
- Villarroel, J.; Larsen, M.; Kilstrup, M.; Nielsen, M. Metagenomic Analysis of Therapeutic PYO Phage Cocktails from 1997 to 2014. 9, 328. [CrossRef]
- Turner, D.; Shkoporov, A.N.; Lood, C.; Millard, A.D.; Dutilh, B.E.; Alfenas-Zerbini, P.; Van Zyl, L.J.; Aziz, R.K.; Oksanen, H.M.; Poranen, M.M.; et al. Abolishment of Morphology-Based Taxa and Change to Binomial Species Names: 2022 Taxonomy Update of the ICTV Bacterial Viruses Subcommittee. 168, 74. [CrossRef]





| Layer | Strict | Medium | Low |
|---|---|---|---|
| vOG | 38562 | 45613 | 48627 |
| vFAM | 28500 | 32546 | 33951 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).