Submitted:
19 April 2024
Posted:
23 April 2024
You are already at the latest version
Abstract
Keywords:
Introduction
Operation System Requirements
Conventions
Background Knowledge
Protocol 1: Installing PhyKIT and Syntax for Usage
Installing PhyKIT
- Install using PIP (Preferred Installer Program)
- # install
- $ pip install phykit
- Install using Conda
- # install
- $ conda install bioconda::phykit
- Install from source
- # download
- git clone https://github.com/JLSteenwyk/PhyKIT.git
- cd PhyKIT/
- # install
- make install
- Install in a virtual environment using PIP
- # create a virtual environment
- $ python -m venv venv_phykit
- # activate the virtual environment
- $ source venv_phykit/bin/activate
- # install
- $ pip install PhyKIT
- # deactivate virtual environment
- deactivate
- Install in a virtual environment using Conda
- # create a virtual environment
- $ conda create -n venv_phykit
- # activate the virtual environment
- $ conda activate venv_phykit
- # install
- $ conda install -n venv_phykit bioconda::phykit
- # deactivate environment when you are done using PhyKIT
- $ conda deactivate
- Update installation using PIP
- # The “-U” is short for “—upgrade”
- $ pip install phykit -U
- Update installation using Conda
- # This is the same command as to install
- $ conda install -n venv_phykit bioconda::phykit
- Activate environment before using PhyKIT
- If PhyKIT has been installed in an environment, it must be activated when using PhyKIT.
- # If installed using PIP activate the environment
- $ source venv_phykit/bin/activate
- # If installed using Conda, activate the environment
- $ conda activate venv_phykit
- The PhyKIT help menu
- $ phykit -h
- alignment_length: calculates alignment length;
- gc_content: calculate GC content of a nucleotide FASTA entries or entries thereof;
- pairwise_identity: calculates average pairwise identify among sequences in an alignment file. This is a proxy for evolutionary rate;
- relative_composition_variability: calculates relative composition variability in an alignment; and
- a complete list of alignment-based functions, including detailed explanations of each, are available here: https://jlsteenwyk.com/PhyKIT/usage/index.html#alignment-based-functions
- bipartition_support_stats: calculates summary statistics for bipartition support;
- degree_of_violation_of_a_molecular_clock: reports the degree of violation of the molecular clock;
- evolutionary_rate: reports a tree-based estimation of evolutionary rate for a gene;
- prune_tree: prune taxa from a phylogeny; and
- a complete list of tree-based functions, including detailed explanations of each, are available here: https://jlsteenwyk.com/PhyKIT/usage/index.html#tree-based-functions
- saturation: calculates saturation by examining the slope of patristic distances and uncorrected distances;
- treeness_over_rcv: calculates treeness divided by relative composition variability (rcv), treeness, and rcv; and
- a complete list of alignment- and tree-based functions, including detailed explanations of each, are available here: https://jlsteenwyk.com/PhyKIT/usage/index.html#alignment-and-tree-based-functions
- # Call the help message of a specific function
- $ phykit alignment_length -h
PhyKIT Syntax
- # Description of PhyKIT syntax
- $ phykit <command> <arguments> [optional arguments]
- # Note, optional arguments will always have square brackets around them
- # Calculate alignment length
- $ phykit alignment_length input.fa
- # Calculate alignment length using the aln_len alias
- $ phykit aln_len input.fa
- # Calculate alignment length using the al alias
- $ phykit al input.fa
- # Calculate alignment length using the full function name
- $ pk_alignment_length input.fa
- # Calculate alignment length using the aln_len alias
- $ pk_aln_len input.fa
- # Calculate alignment length using the al alias
- $ pk_al input.fa
Summary Statistics and the Verbose Option
Requesting New Functions
Protocol 2: Constructing a Phylogenomic Supermatrix
Data Acquisition
Orthology Inference
Multiple Sequence Alignment and Trimming
- # Alignment with MAFFT
- $ mafft --auto input.fa > output.fa
- # Alignment with MUSCLE5
- $ muscle5 -align input.fa -output output.fa
- # Alignment with Clustal-Omega
- $ clustalo -i input.fa -o output.fa
- # Trimming with ClipKIT
- $ clipkit input.fa -o output.fa
- # Trimming with trimAl
- $ trimal -in input.fa -out output.fa -gappyout
- # Thread nucleotide sequences onto a protein alignment
- $ pk_thread_dna -p protein_alignment.faa -n nucleotide_sequences.fna
- # Thread nucleotide sequences onto a trimmed protein alignment
- $ clipkit output.fa -o output.fa --log
- # Thread nucleotide sequences onto a trimmed protein alignment
- $ pk_thread_dna -p protein_alignment.faa -n nucleotide_sequences.fna -l clipkit.log
- # Recode alignments
- $ pk_aln_recoding input.fa -c <recoding scheme>
- # <recoding scheme> can either be one of the eight available
- # coding schemes or a file that has the custom coding scheme
- # Specify custom recoding scheme
- $ cat custom_recoding_scheme.txt
- A
- G
- T
- C
Constructing a Concatenated Supermatrix
- # Create concatenation matrix
- $ pk_create_concat -a alignment_list.txt -p output_prefix
- # First five lines of the alignment_list.txt file
- $ head -n 5 alignment_list.txt
- Alignment0.fa
- Alignment1.fa
- Alignment2.fa
- Alignment3.fa
- Alignment4.fa
- output_prefix.fa: the concatenated supermatrix
- output_prefix.partition: a description of partition boundaries in RAxML-style format
- output_prefix.occupancy: a description of taxon occupancy per partition, including a detailed list of which taxa are present or missing
- # First five lines of the idmap file for renaming FASTA entries
- $ head -n 5 idmap.txt
- speciesA|gene043 speciesA
- speciesB|gene367 speciesB
- speciesC|gene589 speciesC
- speciesD|gene251 speciesD
- speciesE|gene417 speciesE
- # Renaming FASTA entries
- $ pk_rename_fasta input.fa -i idmap.txt [-o output.fa]
Constructing a Dataset for Two-Step Coalescence
- # ModelTest-NG
- $ modeltest-ng -i input.fa -d aa
- # IQ-TREE
- $ iqtree -s input.fa -m MF
- # ModelTest-NG
- $ raxml-ng --msa prot21.fa --model LG+G4 --prefix output_prefix --bs-trees 100
- # IQ-TREE
- $ iqtree -s example.phy -m LG+G4 -bb 1000
- # Note: LG+G4 should be replaced by the best-fitting substitution model.
- # Collapse branches with bipartition support values less than 75
- $ pk_collapse input.tre -s 75 [-o output.tre]
- # First five lines of the idmap file for renaming tree tips
- $ head -n 5 idmap.txt
- speciesA|gene043 speciesA
- speciesB|gene367 speciesB
- speciesC|gene589 speciesC
- speciesD|gene251 speciesD
- speciesE|gene417 speciesE
- # Renaming tips in a phylogenetic tree
- $ pk_rename_tree input.tre -i idmap.txt [-o output.tre]
Protocol 3: Detecting Anomalies in Orthology Relationships
Hidden Paralogy and Clan Check
- # Monophyly check
- $ pk_monophyly_check input.tre list_of_taxa.txt
- # The format of a clade file
- $ cat clades.txt
- T6 T7 T8
- T10 T11 T12
- # Clan check
- $ pk_hidden_paralogy_check input.tre -c clades.txt
Spurious Homolog Detection
- # Identify putatively spurious homologs/orthologs
- $ pk_spurious_seq input.tre [-f 20]
Protocol 4: Quantifying Biases in Phylogenomic Data Matrices and Related Measures
Measuring Bias at the Level of Taxa
Phylogenomic Subsampling Using the Information Content of Alignments
Phylogenomic Subsampling Using the Information Content in Phylogenetic Trees
Combining the Information Content in Alignments and Trees for Phylogenomic Subsampling
Subsampling for Time Tree Analysis
Measuring Bias at the Level of Sites
Protocol 5: Identifying Polytomies
Protocol 6: Gene-Gene Coevolution as a Genetic Screen
Commentary
Related Tools
-
ClipKIT, an alignment trimming software (Steenwyk et al., 2020);
- ○
- Documentation: https://jlsteenwyk.com/ClipKIT/
- ○
- Source code: https://github.com/JLSteenwyk/ClipKIT
-
BioKIT, a broadly applicable toolkit for broad genomic analysis (Steenwyk et al., 2022a);
- ○
- Documentation: https://jlsteenwyk.com/BioKIT/
- ○
- Source code: https://github.com/JLSteenwyk/BioKIT
-
OrthoSNAP, an algorithm to identify single-copy orthologs nested within larger multi-copy gene families (Steenwyk et al., 2022b);
- ○
- Documentation: https://jlsteenwyk.com/orthosnap/
- ○
- Source code: https://github.com/JLSteenwyk/orthosnap
-
orthofisher, software for sequence similarity search using Hidden Markov Models (Steenwyk and Rokas, 2021);
- ○
- Documentation: https://jlsteenwyk.com/orthofisher/
- ○
- Source code: https://github.com/JLSteenwyk/orthofisher
-
treehouse, a graphical user interface tool for pruning large phylogenies (Steenwyk and Rokas, 2019);
- ○
- Documentation and source code: https://github.com/JLSteenwyk/treehouse
-
ggpubfigs, a ggplot2 extension (https://ggplot2.tidyverse.org/) for making colorblind-friendly and publication-quality scientific figures
- ○
- Documentation and source code: https://github.com/JLSteenwyk/ggpubfigs
Troubleshooting
Future Directions
Acknowledgements
Glossary
| Bootstrap replicates | In the context of phylogenetics, each replicate is a resampling (with replacement) of sites from the full alignment to generate an alignment of equal size; these replicates are then used to reinfer a phylogeny and evaluate support for the phylogeny inferred using the full alignment. |
| Concatenation | The phylogenomic method of combining sequences from multiple loci into a single sequence for each species and using the resulting supermatrix for species tree inference. |
| Hidden paralogy | Asymmetric loss of paralogs in some lineages, leading to mistaken identification of paralogs as orthologs. |
| Long branch attraction | A phylogenetic artifact where rapidly evolving taxa/lineages are erroneously inferred to be closely related. |
| Multispecies coalescence | The phylogenomic method of using single-locus phylogenies, which may differ from each other, to infer a species tree. |
| Orthologs or orthologous genes | Genes in different species that originated from a common ancestor by speciation. |
| Orthology inference | Identifying genes among organisms that evolved from a common ancestral gene. |
| Paralogs or paralogous genes | Genes that are related by duplication. |
| Phylogenomic subsampling | The process of selecting a subset of a complete phylogenomic data matrix to reconstruct phylogenetic trees, often aiming to reduce noise and improve signal or evaluating the stability of the inferred phylogeny. |
| Radiation events | Rapid speciation events that result in a succession of short internal branches in a phylogeny. |
| Single-copy orthologs | Genes present as a single copy in the genome across a set of taxa and originate from speciation events. |
| Spurious ortholog inference | Incorrect identification of genes as orthologous, often due to errors in sequence analysis or interpretation. |
References
- Aberer, A. J.; Krompass, D.; Stamatakis, A. Pruning Rogue Taxa Improves Phylogenetic Accuracy: An Efficient Algorithm and Webservice. Systematic Biology, 2013; 62, 162–166. [Google Scholar]
- Behnel, S., Bradshaw, R., Citro, C., Dalcin, L., Seljebotn, D. S., and Smith, K. 2011. Cython: The Best of Both Worlds. Computing in Science & Engineering 13:31–39.
- Bergsten, J. A review of long-branch attraction. Cladistics, 2005; 21, 163–193. [Google Scholar]
- Bjornson, S., Upham, N., Verbruggen, H., and Steenwyk, J. 2023. Phylogenomic Inference, Divergence-Time Calibration, and Methods for Characterizing Reticulate Evolution. Biology and Life Sciences Available at: https://www.preprints.org/manuscript/202309.0905/v1 [Accessed September 25, 2023].
- Brunette, G. J., Jamalruddin, M. A., Baldock, R. A., Clark, N. L., and Bernstein, K. A. 2019. Evolution-based screening enables genome-wide prioritization and discovery of DNA repair genes. Proceedings of the National Academy of Sciences 116:19593–19599.
- Capella-Gutiérrez, S., Silla-Martínez, J. M., and Gabaldón, T. 2009. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972–1973.
- Caurcel, C., Laetsch, D. R., Challis, R., Kumar, S., Gharbi, K., and Blaxter, M. 2021. MolluscDB: a genome and transcriptome database for molluscs. Philosophical Transactions of the Royal Society B: Biological Sciences 376:20200157.
- Chen, M.-Y., Liang, D., and Zhang, P. 2017. Phylogenomic Resolution of the Phylogeny of Laurasiatherian Mammals: Exploring Phylogenetic Signals within Coding and Noncoding Sequences. Genome Biology and Evolution 9:1998–2012.
- Chen, W., Lee, M.-K., Jefcoate, C., Kim, S.-C., Chen, F., and Yu, J.-H. 2014. Fungal Cytochrome P450 Monooxygenases: Their Distribution, Structure, Functions, Family Expansion, and Evolutionary Origin. Genome Biology and Evolution 6:1620–1634.
- Clark, N. L., Alani, E., and Aquadro, C. F. 2012. Evolutionary rate covariation reveals shared functionality and coexpression of genes. Genome Research 22:714–720.
- Cock, P. J. A., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al. 2009. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25:1422–1423.
- Coombe, L., Warren, R. L., Wong, J., Nikolic, V., and Birol, I. 2023. ntLink: A Toolkit for De Novo Genome Assembly Scaffolding and Mapping Using Long Reads. Current Protocols 3:e733.
- Darriba, D., Posada, D., Kozlov, A. M., Stamatakis, A., Morel, B., and Flouri, T. 2020. ModelTest-NG: A New and Scalable Tool for the Selection of DNA and Protein Evolutionary Models. Molecular Biology and Evolution 37:291–294.
- Darriba, D., Taboada, G. L., Doallo, R., and Posada, D. 2012. jModelTest 2: more models, new heuristics and parallel computing. Nature Methods 9:772–772.
- Darriba, D., Taboada, G. L., Doallo, R., and Posada, D. 2011. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics 27:1164–1165.
- Delsuc, F., Brinkmann, H., and Philippe, H. 2005. Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics 6:361–375.
- Edgar, R. C. 2022. Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nature Communications 13:6968.
- Edwards, S. V. 2016. Phylogenomic subsampling: a brief review. Zoologica Scripta 45:63–74.
- Eisen, J. A. 1998. Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis. Genome Research 8:163–167.
- Embley, M., Der Giezen, M. V., Horner, D. S., Dyal, P. L., and Foster, P. 2003. Mitochondria and hydrogenosomes are two forms of the same fundamental organelle. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 358:191–203.
- Eme, L., Tamarit, D., Caceres, E. F., Stairs, C. W., De Anda, V., Schön, M. E., Seitz, K. W., Dombrowski, N., Lewis, W. H., Homa, F., et al. 2023. Inference and reconstruction of the heimdallarchaeial ancestry of eukaryotes. Nature 618:992–999.
- Emms, D. M., and Kelly, S. 2019. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology 20:238.
- Fernández, R., Gabaldón, T., and Dessimoz, C. 2019. Orthology: definitions, inference, and impact on species phylogeny inference. Available at: https://arxiv.org/abs/1903.04530 [Accessed May 25, 2023].
- Fernández, R., Tonzo, V., Simón Guerrero, C., Lozano-Fernandez, J., Martínez-Redondo, G. I., Balart-García, P., Aristide, L., Eleftheriadi, K., and Vargas-Chávez, C. 2022. MATEdb, a data repository of high-quality metazoan transcriptome assemblies to accelerate phylogenomic studies. Peer Community Journal 2:e58.
- Foster, P. G., Schrempf, D., Szöllősi, G. J., Williams, T. A., Cox, C. J., and Embley, T. M. 2022. Recoding Amino Acids to a Reduced Alphabet may Increase or Decrease Phylogenetic Accuracy. Systematic Biology:syac042.
- Gatesy, J., Meredith, R. W., Janecka, J. E., Simmons, M. P., Murphy, W. J., and Springer, M. S. 2017. Resolution of a concatenation/coalescence kerfuffle: partitioned coalescence support and a robust family-level tree for Mammalia. Cladistics 33:295–332.
- Giacomelli, M., Rossi, M. E., Lozano-Fernandez, J., Feuda, R., and Pisani, D. 2022. Resolving tricky nodes in the tree of life through amino acid recoding. iScience 25:105594.
- Green, R. E., Braun, E. L., Armstrong, J., Earl, D., Nguyen, N., Hickey, G., Vandewege, M. W., St. John, J. A., Capella-Gutiérrez, S., Castoe, T. A., et al. 2014. Three crocodilian genomes reveal ancestral patterns of evolution among archosaurs. Science 346:1254449.
- Han, Y., and Molloy, E. K. 2023. Improving quartet graph construction for scalable and accurate species tree estimation from gene trees. Genome Research:genome;gr.277629.122v2.
- Harris, C. R., Millman, K. J., Van Der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., et al. 2020. Array programming with NumPy. Nature 585:357–362.
- Hernandez, A. M., and Ryan, J. F. 2021. Six-State Amino Acid Recoding is not an Effective Strategy to Offset Compositional Heterogeneity and Saturation in Phylogenetic Analyses. Systematic Biology 70:1200–1212.
- Jarvis, E. D., Mirarab, S., Aberer, A. J., Li, B., Houde, P., Li, C., Ho, S. Y. W., Faircloth, B. C., Nabholz, B., Howard, J. T., et al. 2014. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346:1320–1331.
- Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., Von Haeseler, A., and Jermiin, L. S. 2017. ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods 14:587–589.
- Kapli, P., Yang, Z., and Telford, M. J. 2020. Phylogenetic tree building in the genomic age. Nature Reviews Genetics 21:428–444.
- Katoh, K., and Standley, D. M. 2013. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution 30:772–780.
- Kosiol, C., Goldman, N., and H. Buttimore, N. 2004. A new criterion and method for amino acid classification. Journal of Theoretical Biology 228:97–106.
- Kozlov, A. M., Darriba, D., Flouri, T., Morel, B., and Stamatakis, A. 2019. RAxML-NG: a fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics 35:4453–4455.
- Li, Y., Steenwyk, J. L., Chang, Y., Wang, Y., James, T. Y., Stajich, J. E., Spatafora, J. W., Groenewald, M., Dunn, C. W., Hittinger, C. T., et al. 2021. A genome-scale phylogeny of the kingdom Fungi. Current Biology 31:1653-1665.e5.
- Li, Z., De La Torre, A. R., Sterck, L., Cánovas, F. M., Avila, C., Merino, I., Cabezas, J. A., Cervera, M. T., Ingvarsson, P. K., and Van De Peer, Y. 2017. Single-Copy Genes as Molecular Markers for Phylogenomic Studies in Seed Plants. Genome Biology and Evolution 9:1130–1147.
- Liu, H., Steenwyk, J. L., Zhou, X., Schultz, D. T., Kocot, K. M., Shen, X.-X., Rokas, A., and Li, Y. 2023. A genome-scale Opisthokonta tree of life: toward phylogenomic resolution of ancient divergences. Evolutionary Biology. [CrossRef]
- Liu, L., Zhang, J., Rheindt, F. E., Lei, F., Qu, Y., Wang, Y., Zhang, Y., Sullivan, C., Nie, W., Wang, J., et al. 2017. Genomic evidence reveals a radiation of placental mammals uninterrupted by the KPg boundary. Proceedings of the National Academy of Sciences 114. [CrossRef]
- Manni, M., Berkeley, M. R., Seppey, M., and Zdobnov, E. M. 2021. BUSCO: Assessing Genomic Data Quality and Beyond. Current Protocols 1:e323.
- Martijn, J., Schön, M. E., Lind, A. E., Vosseberg, J., Williams, T. A., Spang, A., and Ettema, T. J. G. 2020. Hikarchaeia demonstrate an intermediate stage in the methanogen-to-halophile transition. Nature Communications 11:5490.
- Martín-Durán, J. M., Ryan, J. F., Vellutini, B. C., Pang, K., and Hejnol, A. 2017. Increased taxon sampling reveals thousands of hidden orthologs in flatworms. Genome Research 27:1263–1272.
- Martínez-Redondo, G. I., Vargas-Chávez, C., Eleftheriadi, K., Benítez-Álvarez, L., Vázquez-Valls, M., and Fernández, R. 2024. MATEdb2, a collection of high-quality metazoan proteomes across the Animal Tree of Life to speed up phylogenomic studies. [CrossRef]
- Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., von Haeseler, A., and Lanfear, R. 2020. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution 37:1530–1534.
- Mirarab, S., Reaz, R., Bayzid, Md. S., Zimmermann, T., Swenson, M. S., and Warnow, T. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30:i541–i548.
- Mongiardino Koch, N. 2021. Phylogenomic Subsampling and the Search for Phylogenetically Reliable Loci. Molecular Biology and Evolution 38:4025–4038.
- Mulhair, P. O., McCarthy, C. G. P., Siu-Ting, K., Creevey, C. J., and O’Connell, M. J. 2022. Filtering artifactual signal increases support for Xenacoelomorpha and Ambulacraria sister relationship in the animal tree of life. Current Biology:S0960982222016840.
- Ocaña-Pallarès, E., Williams, T. A., López-Escardó, D., Arroyo, A. S., Pathmanathan, J. S., Bapteste, E., Tikhonenkov, D. V., Keeling, P. J., Szöllősi, G. J., and Ruiz-Trillo, I. 2022. Divergent genomic trajectories predate the origin of animals and fungi. Nature 609:747–753.
- One Thousand Plant Transcriptomes Initiative 2019. One thousand plant transcriptomes and the phylogenomics of green plants. Nature 574:679–685.
- Parey, E., Louis, A., Montfort, J., Bouchez, O., Roques, C., Iampietro, C., Lluch, J., Castinel, A., Donnadieu, C., Desvignes, T., et al. 2023. Genome structures resolve the early diversification of teleost fishes. Science (New York, N.Y.) 379:572–575.
- Philippe, H., Brinkmann, H., Lavrov, D. V., Littlewood, D. T. J., Manuel, M., Wörheide, G., and Baurain, D. 2011. Resolving Difficult Phylogenetic Questions: Why More Sequences Are Not Enough. PLoS Biology 9:e1000602.
- Philippe, H., Derelle, R., Lopez, P., Pick, K., Borchiellini, C., Boury-Esnault, N., Vacelet, J., Renard, E., Houliston, E., Quéinnec, E., et al. 2009. Phylogenomics Revives Traditional Views on Deep Animal Relationships. Current Biology 19:706–712.
- Philippe, H., Vienne, D. M. D., Ranwez, V., Roure, B., Baurain, D., and Delsuc, F. 2017. Pitfalls in supermatrix phylogenomics. European Journal of Taxonomy. Available at: http://www.europeanjournaloftaxonomy.eu/index.php/ejt/article/view/407 [Accessed December 5, 2023].
- Phillips, M. J., Delsuc, F., and Penny, D. 2004. Genome-Scale Phylogeny and the Detection of Systematic Biases. Molecular Biology and Evolution 21:1455–1458.
- Phillips, M. J., and Penny, D. 2003. The root of the mammalian tree inferred from whole mitochondrial genomes. Molecular Phylogenetics and Evolution 28:171–185.
- Raghavan, V., Kraft, L., Mesny, F., and Rigerte, L. 2022. A simple guide to de novo transcriptome assembly and annotation. Briefings in Bioinformatics 23:bbab563.
- Rodríguez-Ezpeleta, N., Brinkmann, H., Burger, G., Roger, A. J., Gray, M. W., Philippe, H., and Lang, B. F. 2007. Toward Resolving the Eukaryotic Tree: The Phylogenetic Positions of Jakobids and Cercozoans. Current Biology 17:1420–1425.
- Salichos, L., and Rokas, A. 2013. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature 497:327–331.
- Sayyari, E., and Mirarab, S. 2018. Testing for Polytomies in Phylogenetic Species Trees Using Quartet Frequencies. Genes 9:132.
- Schultz, D. T., Haddock, S. H. D., Bredeson, J. V., Green, R. E., Simakov, O., and Rokhsar, D. S. 2023. Ancient gene linkages support ctenophores as sister to other animals. Nature. Available at: https://www.nature.com/articles/s41586-023-05936-6 [Accessed May 21, 2023].
- Shen, X.-X., Opulente, D. A., Kominek, J., Zhou, X., Steenwyk, J. L., Buh, K. V., Haase, M. A. B., Wisecaver, J. H., Wang, M., Doering, D. T., et al. 2018. Tempo and Mode of Genome Evolution in the Budding Yeast Subphylum. Cell 175:1533-1545.e20.
- Shen, X.-X., Salichos, L., and Rokas, A. 2016. A Genome-Scale Investigation of How Sequence, Function, and Tree-Based Gene Properties Influence Phylogenetic Inference. Genome Biology and Evolution 8:2565–2580.
- Sierra-Patev, S., Min, B., Naranjo-Ortiz, M., Looney, B., Konkel, Z., Slot, J. C., Sakamoto, Y., Steenwyk, J. L., Rokas, A., Carro, J., et al. 2023. A global phylogenomic analysis of the shiitake genus Lentinula. Proceedings of the National Academy of Sciences 120:e2214076120.
- Sievers, F., and Higgins, D. G. 2018. Clustal Omega for making accurate alignments of many protein sequences: Clustal Omega for Many Protein Sequences. Protein Science 27:135–145.
- Siu-Ting, K., Torres-Sánchez, M., San Mauro, D., Wilcockson, D., Wilkinson, M., Pisani, D., O’Connell, M. J., and Creevey, C. J. 2019. Inadvertent Paralog Inclusion Drives Artifactual Topologies and Timetree Estimates in Phylogenomics. Molecular Biology and Evolution 36:1344–1356.
- Smith, S. A., Brown, J. W., and Walker, J. F. 2018. So many genes, so little time: A practical approach to divergence-time estimation in the genomic era. PLOS ONE 13:e0197433.
- Steenwyk, J., and King, N. 2023. From Genes to Genomes: Opportunities and Challenges for Synteny-based Phylogenies. Preprints. [CrossRef]
- Steenwyk, J. L., Balamurugan, C., Raja, H. A., Gonçalves, C., Li, N., Martin, F., Berman, J., Oberlies, N. H., Gibbons, J. G., Goldman, G. H., et al. 2024. Phylogenomics reveals extensive misidentification of fungal strains from the genus Aspergillus. Microbiology Spectrum:e03980-23.
- Steenwyk, J. L., Buida, T. J., Gonçalves, C., Goltz, D. C., Morales, G., Mead, M. E., LaBella, A. L., Chavez, C. M., Schmitz, J. E., Hadjifrangiskou, M., et al. 2022a. BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data. Genetics 221:iyac079.
- Steenwyk, J. L., Buida, T. J., Labella, A. L., Li, Y., Shen, X.-X., and Rokas, A. 2021. PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data. Bioinformatics 37:2325–2331.
- Steenwyk, J. L., Buida, T. J., Li, Y., Shen, X.-X., and Rokas, A. 2020. ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference. PLOS Biology 18:e3001007.
- Steenwyk, J. L., Goltz, D. C., Buida, T. J., Li, Y., Shen, X.-X., and Rokas, A. 2022b. OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees. PLOS Biology 20:e3001827.
- Steenwyk, J. L., Li, Y., Zhou, X., Shen, X.-X., and Rokas, A. 2023a. Incongruence in the phylogenomics era. Nature Reviews Genetics. [CrossRef]
- Steenwyk, J. L., Phillips, M. A., Yang, F., Date, S. S., Graham, T. R., Berman, J., Hittinger, C. T., and Rokas, A. 2022c. An orthologous gene coevolution network provides insight into eukaryotic cellular and genomic structure and function. Science Advances 8:eabn0105.
- Steenwyk, J. L., and Rokas, A. 2021. orthofisher: a broadly applicable tool for automated gene identification and retrieval. G3 GenesGenomesGenetics 11:jkab250.
- Steenwyk, J. L., and Rokas, A. 2019. Treehouse: a user-friendly application to obtain subtrees from large phylogenies. BMC Research Notes 12:541.
- Steenwyk, J. L., Rokas, A., and Goldman, G. H. 2023b. Know the enemy and know yourself: Addressing cryptic fungal pathogens of humans and beyond. PLOS Pathogens 19:e1011704.
- Struck, T. H. 2014. TreSpEx–-Detection of Misleading Signal in Phylogenetic Reconstructions Based on Tree Information. Evolutionary Bioinformatics 10:EBO.S14239.
- Susko, E., and Roger, A. J. 2021. Long Branch Attraction Biases in Phylogenetics. Systematic Biology 70:838–843.
- Susko, E., and Roger, A. J. 2007. On Reduced Amino Acid Alphabets for Phylogenetic Inference. Molecular Biology and Evolution 24:2139–2150.
- Tan, G., Muffato, M., Ledergerber, C., Herrero, J., Goldman, N., Gil, M., and Dessimoz, C. 2015. Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference. Systematic Biology 64:778–791.
- Telford, M. J., Lowe, C. J., Cameron, C. B., Ortega-Martinez, O., Aronowicz, J., Oliveri, P., and Copley, R. R. 2014. Phylogenomic analysis of echinoderm class relationships supports Asterozoa. Proceedings of the Royal Society B: Biological Sciences 281:20140479.
- Thornton, J. W., and DeSalle, R. 2000. Gene Family Evolution and Homology: Genomics Meets Phylogenetics. Annual Review of Genomics and Human Genetics 1:41–73.
- Turnbull, R., Steenwyk, J. L., Mutch, S. J., Scholten, P., Salazar, V. W., Birch, J. L., and Verbruggen, H. 2023. OrthoFlow: phylogenomic analysis and diagnostics with one command. In Review Available at: https://www.researchsquare.com/article/rs-3699210/v1 [Accessed December 7, 2023].
- Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., et al. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Methods 17:261–272.
- Waterhouse, R. M., Seppey, M., Simão, F. A., Manni, M., Ioannidis, P., Klioutchnikov, G., Kriventseva, E. V., and Zdobnov, E. M. 2018. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Molecular Biology and Evolution 35:543–548.
- Zhang, C., Scornavacca, C., Molloy, E. K., and Mirarab, S. 2020. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Molecular Biology and Evolution 37:3292–3307.
- Zhao, D., Liu, J., and Yu, T. 2023. Protocol for transcriptome assembly by the TransBorrow algorithm. Biology Methods and Protocols 8:bpad028.





















| Function Name | Function Alias(es) | Description | |
| Alignment-based functions | alignment_length | aln_len; al | Calculate alignment length. |
| alignment_length_no_gaps | aln_len_no_gaps; alng | Calculate alignment length excluding sites with gaps. | |
| alignment_recoding | aln_recoding; recode | Recode alignments using reduced character states. | |
| column_score | cs | Calculates column score. Column score is an accuracy metric for a multiple alignment relative to a reference alignment. |
|
| compositional_bias_per_site | comp_bias_per_site; cbps | Calculates compositional bias per site in an alignment. Site-wise chi-squared tests are conducted in an alignment to detect compositional biases. |
|
| create_concatenation_matrix | create_concat; cc | Create a concatenated alignment file. This function is used to help in the construction of multi-locus data matrices. | |
| evolutionary_rate_per_site | evo_rate_per_site; erps | Estimate evolutionary rate per site. Values may range from 0 (slow evolving; no diversity at the given site) to 1 (fast evolving; all characters appear only once). |
|
| faidx | get_entry; ge | Extracts sequence entry from FASTA file. | |
| gc_content | gc | Calculate GC content of a FASTA file. | |
| pairwise_identity | pairwise_id; pi | Calculate the average pairwise identity among sequences. Pairwise identities can be used as proxies for the evolutionary rate of sequences. |
|
| parsimony_informative_sites | pis | Calculate the number and percentage of parismony informative sites in an alignment. | |
| relative_composition_variability | rel_comp_var; rcv | Calculate RCV (relative composition variability) for an alignment. Lower RCV values are thought to be desirable because they represent a lower composition bias in an alignment. |
|
| relative_composition_variability_taxon | rel_comp_var_taxon; rcvt | Calculate RCVT (relative composition variability, taxon) for an alignment. RCVT is the relative composition variability metric for individual taxa. Lower RCVT values are more desirable because they indicate a lower composition bias for a given taxon in an alignment. |
|
| rename_fasta_entries | rename_fasta | Rename entries in a FASTA file. Note, the input FASTA file does not need to be a multiple sequence alignment. | |
| sum_of_pairs_score | sops; sop | Calculates sum-of-pairs score. Sum-of-pairs is an accuracy metric for a multiple alignment relative to a reference alignment. |
|
| thread_dna | pal2nal; p2n | Thread DNA sequence onto a protein alignment to create a codon-based alignment. | |
| variable_sites | vs | Calculate the number and percentage of variable sites in an alignment. | |
| Tree-based functions | bipartition_support_stats | bss | Calculate summary statistics for bipartition support. |
| branch_length_multiplier | blm | Multiply branch lengths in a phylogeny by a given factor. This can help modify reference trees when conducting simulations or other analyses. |
|
| collapse_branches | collapse; cb | Collapse branches on a phylogeny according to bipartition support. Bipartitions will be collapsed if they are less than the user specified value. |
|
| covarying_evolutionary_rates | cover | Quantify the degree of coevolution between two single-gene trees. | |
| degree_of_violation_of_a_molecular_clock | dvmc | Calculate degree of violation of a molecular clock (or DVMC) in a phylogeny. Lower DVMC values are thought to be desirable because they are indicative of a lower degree of violation in the molecular clock assumption. |
|
| evolutionary_rate | evo_rate | Calculate a tree-based estimation of the evolutionary rate of a gene. | |
| hidden_paralogy_check | clan_check | Scan tree for evidence of hidden paralogy. Specifically, this method will examine if a set of well-known monophyletic taxa are, in fact, exclusively monophyletic. |
|
| internal_branch_stats | ibs | Calculate summary statistics for internal branch lengths in a phylogeny. Internal branch lengths can be useful for phylogeny diagnostics. |
|
| internode_labeler | il | Appends numerical identifiers to bipartitions in place of support values. This is helpful for pointing to specific internodes in supplementary files or otherwise. | |
| last_common_ancestor_subtree | lca_subtree | Obtains subtree from a phylogeny by getting the last common ancestor from a list of taxa. | |
| long_branch_score | lb_score; lbs | Calculate long branch scores in a phylogeny. Lower long branch scores are thought to be desirable because they are indicative of taxa or trees that likely do not have issues with long branch attraction. |
|
| monophyly_check | is_monophyletic | This analysis can be used to determine if a set of taxa are exclusively monophyletic. By exclusively monophyletic, if other taxa are in the same clade, the lineage will not be considered exclusively monophyletic. | |
| nearest_neighbor_interchange | nni | Generate all nearest neighbor interchange moves for a binary rooted tree. | |
| patristic_distances | pd | Calculate summary statistics among patristic distances in a phylogeny. Patristic distances are all tip-to-tip distances in a phylogeny. |
|
| polytomy_test | polyt_test; polyt; ptt | Conduct a polytomy test for three clades in a phylogeny. Polytomy tests can be used to identify putative radiations as well as identify well supported alternative topologies. |
|
| print_tree | print; pt | Print ASCII tree of input phylogeny. | |
| prune_tree | prune | Prune tips from a phylogeny. | |
| rename_tree | rename_tips | Renames tips in a phylogeny. | |
| robinson_foulds_distance | rf_distance; rf_dist; rf | Calculate Robinson-Foulds distance between two trees. Low Robinson-Foulds distances reflect greater similarity between two phylogenies. This function prints out two values, the plain Robinson-Foulds value and the normalized Robinson-Foulds value, which are separated by a tab. |
|
| root_tree | root; rt | Roots phylogeny using user-specified taxa. | |
| spurious_sequence | spurious_seq; ss | Identifies potentially spurious sequences and reports tips in the phylogeny that could possibly be removed from the associated multiple sequence alignment. PhyKIT does so by identifying and reporting long terminal branches defined as branches that are equal to or 20 times the median length of all branches. | |
| terminal_branch_stats | tbs | Calculate summary statistics for terminal branch lengths in a phylogeny. | |
| tip_labels | tree_labels; labels; tl | Prints the tip labels (or names) a phylogeny. | |
| tip_to_tip_distance | t2t_dist; t2t | Calculate distance between two tips (or leaves) in a phylogeny. | |
| tip_to_tip_node_distance | t2t_node_dist; t2t_nd | Calculate distance between two tips (or leaves) in a phylogeny. Distance is measured by the number of nodes between one tip and another. |
|
| total_tree_length | tree_len | Calculate total tree length, which is a sum of all branches. | |
| treeness | tness | Calculate treeness statistic for a phylogeny. Higher treeness values are thought to be desirable because they represent a higher signal-to-noise ratio. |
|
| Alignment- and tree-based functions | saturation | sat | Calculate saturation for a given tree and alignment. Saturation is defined as sequences in multiple sequence alignments that have undergone numerous substitutions such that the distances between taxa are underestimated. |
| treeness_over_rcv | toverr; tor | Calculate treeness/RCV for a given alignment and tree. Higher treeness/RCV values are thought to be desirable because they harbor a high signal-to-noise ratio are least susceptible to composition bias. |
| Recoding Scheme Name | Nucleotides or Amino Acids | Description | Reference |
| RY-nucleotide | Nucleotides | Two characters; one character for purines and another for pyrimidines | (Phillips et al., 2004) |
| SandR-6 | Amino Acids | Six characters; based on the JTT substitution matrix | (Susko and Roger, 2007) |
| KGB-6 | Amino Acids | Six characters; based on the WAG substitution matrix | (Kosiol et al., 2004) |
| Dayhoff-6 | Amino Acids | Six characters; based on the Dayhoff (or PAM250) matrix | (Embley et al., 2003) |
| Dayhoff-9 | Amino Acids | Nine characters; based on the Dayhoff (or PAM250) matrix | (Hernandez and Ryan, 2021) |
| Dayhoff-12 | Amino Acids | Twelve characters; based on the Dayhoff (or PAM250) matrix | (Hernandez and Ryan, 2021) |
| Dayhoff-15 | Amino Acids | Fifteen characters; based on the Dayhoff (or PAM250) matrix | (Hernandez and Ryan, 2021) |
| Dayhoff-18 | Amino Acids | Eighteen characters; based on the Dayhoff (or PAM250) matrix | (Hernandez and Ryan, 2021) |
| <file path> | Either | Custom recoding scheme specified using a two-column file. The first column is the recoded character and the second is the character in the alignment. | NA |
| Feature being subsampled | Metric for subsampling | PhyKIT function | Higher/Lower values are better |
| Taxa | Relative composition variability, taxon (RCVT) | relative_composition_variability_taxon; rel_comp_var_taxon; rcvt | Lower |
| Taxa or Gene | Long branch score | long_branch_score; lb_score; lbs | Lower |
| Gene | Alignment length | alignment_length; aln_len; al | Higher |
| Alignment length, no gaps | alignment_length_no_gaps; aln_len_no_gaps; alng | Higher | |
| Pairwise identity | pairwise_identity; pairwise_id, pi | Context dependent | |
| Relative composition variability | relative_composition_variability; rel_comp_var; rcv | Lower | |
| Variable sites | variable_sites; vs | Higher | |
| Average (or median) bipartition support value | bipartition_support_stats; bss | Higher | |
| Evolutionary rate | evolutionary_rate, evo_rate | Context dependent | |
| Total tree length | total_tree_length; tree_len | Context dependent | |
| Treeness | treeness; tness | Higher | |
| Saturation | saturation; sat | Higher | |
| Treeness / Relative composition variability | treeness_over_rcv; toverr; tor | Higher | |
| Degree of violation of the molecular clock | degree_of_violation_of_a_molecular_clock; dvmc | Lower | |
| Sites | Compositional bias | compositional_bias_per_site; comp_bias_per_site; cbps | Lower |
| Evolutionary rate | evolutionary_rate_per_site; evo_rate_per_site; erps | Lower |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).