A roadmap to infer synteny-based phylogenies.
Taxon sampling/selection. Taxon sampling influences numerous downstream steps, such as orthology inference. Generally, the more taxon sampling, the better [
4,
85]. Fortunately, there are a growing number of chromosome-level or highly-contiguous genome assemblies that are publicly available for downloading and analysis. However, representatives from undersampled lineages may require genome sequencing. Thus, taxon sampling should be guided by the phylogenetic question at hand. For example, determining evolutionary relationships among vertebrates does not require taxon sampling among fungi; in fact, poor taxon sampling of distantly related taxa may introduce long branches and contribute to long-branch attraction artifacts [
91,
92].
Long-read sequencing and chromosomal conformation analyses. Much like traditional phylogenomics using collections of multiple sequence alignments, synteny-based phylogenomics starts with data acquisition. However, unlike multiple sequence alignment-based phylogenomics, high-quality genomes (ideally assembled accurately from telomere-to-telomere on all chromosomes) are necessary. The state of the art for genome assembly requires long-read sequencing (e.g. using Oxford Nanopore or PacBio) [
93,
94], which in turn requires acquisition of high molecular weight DNA from each organism to be sequenced. For more complex genomes, chromosomal interactions detected from Hi-C analyses will help provide additional lines of evidence for subsequent steps, namely genome assembly [
95].
Genome assembly. With long-read sequences and chromosomal conformation data in hand, the next step for synteny-based phylogenomics is to generate an accurate and precise genome for each species to be analyzed. Poor genome assembly quality can be a source of error when detecting synteny [
96] and, in turn, introduce errors in synteny-based phylogenomics. While there is no broadly accepted definition of a “high quality” assembly, researchers should consider three important metrics—completeness, contiguity, and accuracy. Completeness can be assessed by comparing inferred gene content with expectations from transcriptome sequences and the presence/absence of nearly-universal single-copy orthologs [
97]. Incomplete genomes may be difficult to incorporate into synteny-based phylogenomics and may necessitate further efforts to improve the original genome assembly. When highly contiguous genomes are difficult to achieve, macrosyntenic blocks that are broken up across several scaffolds should be removed from the data matrix. Alternatively, microsyntenies may be more appropriate to use because they are more likely to be preserved, even in a discontiguous genome assembly. Examining assembly accuracy is difficult without physical mapping data from, for example, fluorescence in situ hybridization or optical maps [
98]. However, these data can be useful not only to validate but also to improve genome assembly quality, even helping achieve near-complete genomes [
98]. Of note, other measures of assembly quality, such as degree of contamination, should be taken into account, particularly when loss of synteny is inferred.
Genome annotation. To detect syntenic blocks across the resulting set of genomes, the relative positions of orthologous genes are often used [
67,
71]. Thus, phylogeneticists must predict gene boundaries accurately to prevent, for example, erroneously combining two genes into a single gene model or missing genes entirely (
Figure 4). Many phylogenomic studies rely on the outputs of genomes annotated using different methods, but recent studies have shown that the outputs of different gene annotation methods can substantially vary [
99]. A troubling result of comparing genomes annotated using different annotation methods is the artifactual inflation of the number of unique or lineage-specific genes [
99]. Therefore, a single high-quality annotation method trained on the individual organism or methods that combine the results from multiple gene annotation algorithms, like EVidenceModeler [
100], may prove helpful. Moreover, incorporating transcriptomic reads will help refine and provide evidence for gene boundary predictions [
101]
Orthology inference. The resulting gene predictions are subsequently used to infer orthologous relationships among genes (
Figure 4). Orthology relationships are inferred using all-vs-all sequence similarity information [
102]. Researchers face several challenges during orthology inference, stemming from both analytical and biological sources of error [
4,
103]. Analytical errors may stem from genes that are absent from annotation predictions but that are genuinely encoded in the organism's genome. Other sources of error may stem from complex evolutionary histories, such as gene duplication and loss, convergence, or saturation [
4,
104].
Alternatively, whole-genome alignment methods, like Progressive Cactus and SibekliaZ [
105,
106], may overcome potential errors stemming from gene annotation errors. One major innovation offered by Progressive Cactus is that it allows reference-free multiple genome alignment (ameliorating reference-based bias) and detecting multicopy orthology relationships, rather than only single-copy orthology [
106]. Furthermore, Progressive Cactus can also handle large datasets, such as 600 or more animal genomes.
Establishing best practices in synteny detection. Typically, the distributions of gene orthologs along chromosomes in different species are used to detect potential syntenic blocks. Therefore, differences in the quality of ortholog prediction and in the density of syntenic orthologs detected should profoundly shape the accuracy of syntenic block detection. Both factors – accuracy of ortholog detection and density of syntenic orthologs – will likely drop off when comparing genomes separated by long evolutionary time scales.
Care must be taken, therefore, in the selection of software and analysis parameters [
96]. Two key parameters are the minimum number and density of genes necessary to define orthologous syntenic blocks. Higher thresholds are expected to result in more conservative estimates of syntenic blocks (i.e., fewer false positives), but at the cost of potentially having a smaller number of syntenic blocks to analyze. Several software packages facilitate synteny detection, including MCScanX, SynChro, and syntenet [
58,
59,
60]. Notably, each employs different methodology; for example, Synchro identifies pairwise syntenies using reciprocal best BLAST hits of protein sequence similarity, whereas MCScanX detects synteny blocks across two or more genomes [
58,
59]. MCScanX also provides additional utilities to further classify syntenic blocks based on putative evolutionary origins, such as those originating from whole genome duplication events or tandem duplication. Although these algorithms vary in efficacy, genome contiguity appears to be a major driver of error, underscoring the importance of obtaining highly contiguous genome assemblies [
96].
To determine how much of the genome is captured during detection, synteny coverage can be calculated, which is the sum length of blocks divided by genome size [
96]. Synteny coverage may differ between genomes due to biological phenomena such as genome size, content variation, or analytical factors that can relax the definition of a syntenic block; thus, it will be important to report syntenic coverage for individual genomes as well as summary statistics across them. Ideally, syntenic coverage will be high and cover nearly the entire genome for closely related organisms. However, syntenic coverage may be reduced based on the threshold applied for detecting synteny, the rate of evolution among chromosomes, the rate of evolution of local gene order, and the evolutionary distance between species analyzed.
Accounting for sources of phylogenomic error. Diverse factors can lead to erroneous species tree inference. Although these are well studied in analyses of multiple sequence alignments [
4,
103,
107], they are underexplored for synteny-based phylogenomics. Here we discuss potential sources of error for synteny analysis and methods for taking them into account.
Saturation. In nucleotide and amino acid sequence evolution, stepwise evolution of a sequence away from, and then back to, an ancestral state is described as “saturation” and this phenomenon can mask evolutionary history. Saturation may also occur during synteny evolution, whereby multiple sequential rearrangements may interfere with tracing the step-wise evolution of syntenic blocks. To overcome saturation, one solution may be to purge data matrices of rapidly evolving syntenic blocks, wherein the evolutionary history may be harder to trace.
Incomplete lineage sorting. The random sorting of ancestral polymorphisms can lead to genealogies that differ from the species tree, especially during rapid radiation events [
108,
109]. Incomplete lineage sorting among structural variants may also be a source of synteny-based phylogenomic error. Incomplete lineage sorting among gene trees is particularly prevalent during radiation events and in large populations [
108,
110]. Given that genome rearrangement can occur rapidly in a population [
111,
112], it raises the possibility that some structural variants may coalesce before a speciation event; that is, be subject to incomplete lineage sorting. Determining the prevalence (if any) of incomplete lineage sorting among structural variants will elucidate if incomplete lineage sorting is a source of incongruence.
Reticulate evolution. Reticulate evolution refers to non-vertical inheritance of loci, resulting in loci with an evolutionary history that deviates from a strictly bifurcating tree model, such as horizontal gene transfer and introgression/hybridization [
113,
114,
115]. This issue will have varying influences across different lineages; for example, horizontal gene transfer occurs more frequently among bacteria and archaea than many eukaryotic lineages [
116,
117]. Similarly, hybridization is common among plant lineages [
118,
119,
120], and has also been observed in other lineages like animals and fungi [
114,
121,
122,
123].
The non-vertical acquisition of loci may interfere with the detection of otherwise conserved syntenic regions [
124]. In the case of horizontal gene transfer, synteny analysis would suggest an erroneous phylogenetic placement of a lineage; for example, synteny analysis of the horizontally acquired bacterial siderophore gene cluster in yeast [
125] would suggest a close affinity between yeast and bacteria, a hypothesis that is incontrovertibly refuted. Loci with signatures of horizontal gene transfer can be pruned from a data matrix. However, in some cases, horizontally acquired loci that undergo vertical inheritance may be helpful markers for synteny-based phylogenomics [
126].
Modeling syntenic changes. In standard molecular phylogenetics, substitution models approximate the evolutionary process of transitions between character states. These models vary in complexity and ability to capture biological reality [
127,
128,
129]. Analogous substitution models for syntenic data have yet, to our knowledge, to be developed. However, structural variants can segregate among human populations [
40] and recent developments of reference-free pangenomes may help facilitate their detection and illuminate their evolutionary dynamics [
130], paving the way for creating models that capture exchange rates between syntenic states. The empirical determination of best practices for model selection will be important for future studies. Assuming overfitting is not an issue, highly parameterized models may be appropriate for synteny-based tree inference.
Other potential sources of error. Several other sources of error may come into play. For example, although few examples of convergent evolution in genome structure are known [
131,
132,
133], they nonetheless demonstrate how independent rearrangements that result in the same structure could contribute to error in synteny-based phylogenomics. Specifically, the currently accepted evolutionary relationships among the major rodent clades of Hystricomorpha (e.g., capybaras and naked-mole rats), Sciuromorpha (e.g., squirrels and marmots), and Myomorpha (e.g., rats and mice) indicate that Hystricomorpha diverged first and that Sciuromorpha and Myomorpha are sister lineages [
132]. However, independent splitting events in the ortholog of human 3p21.31 in the Hystricomorpha (e.g., Capybaras) and Sciuromorpha (e.g., squirrels) lineages would incorrectly suggest a sister relationship between each lineage [
132]. Other sources of error may include an underpowered number of syntenic blocks and intraspecies heterogeneity in karyotype and chromosome structure due to, for example, Robertsonian chromosome fusions and copy number variants [
72,
111].
For phylogenomic analyses based on collections of multiple sequence alignments, researchers have demonstrated that not all loci have equal phylogenetic information. For example, genes displaying a clock-like pattern of evolution have often been favored for divergence time analysis [
134,
135,
136]. Measures have been developed to quantify the information encoded in multiple sequence alignments and phylogenetic trees inferred from them. Fortunately, some methods may be easily adapted to synteny data. For example, treeness—a measure of signal-to-noise based on the proportion of tree distance observed among internal branches [
137]—may help identify syntenic blocks with robust phylogenetic signal. Similarly, rogue taxa, organisms with unstable phylogenetic placement among syntenic blocks, can be pruned from a data matrix [
86]. Developing methods to measure the phylogenetic informativeness of different syntenic blocks will help increase signal-to-noise ratios among datasets and aid in refining their usage and interpretation within phylogenomic analyses.