Large-scale phylogenomic analysis provides new insights into the phylogeny of the order Gadiformes and evolution of freshwater gadiform species burbot (Lota lota)

Our understanding of phylogenetic relationships among Gadiformes fish is obtained through the analysis of a small number of genes, but uncertainty remains around critical nodes. A series of phylogenetic controversial exists at the suborder, family, subfamily, and species levels. A total of 1105 orthologous exon sequences and translated amino acid sequences from 36 genomes and 12 transcriptomes covering 33 species were applied to investigate the phylogenetic relationships within Gadiformes and address these problems. Phylogenetic trees reconstructed with the amino acid data set using different tree-building methods (RAxML and MrBayes) showed consistent topology. The monophyly of Gadifromes was confirmed in our study. However, the three suborders Muraenolepidoidei, Macrouroidei, and Gadoidei were not well recovered by our phylogenomic study, rejecting the validity of suborder Muraenolepidoidei. Four major lineages were revealed in this study. The family Bregmacerotidae forming clade I was the basal lineage of Gadiformes. The family Merluciidae formed clade II. Clade III contained families Melanonidae, Muraenolepididae, Macrouridae (with subfamilies Trachyrincinae, Macrourinae, and Bathygadinae), and Moridae. Clade IV contained at least three families of suborder Gadoidei, i.e., Gadidae, Phycidae, and Ranicipitidae. The subspecies of Lota lota from Amur River were confirmed, indicating that exon markers were a valid high-resolution method for delimiting subspecies or distinct lineages within species level. The PSMC analysis of different populations of L. lota suggests a continuous decline since 2 Myr.


Introduction
The order Gadiformes includes some of the most important commercial fish in the world, such as cod, hake, and haddock, accounting for approximately a quarter of the world's total marine fish catch [1]. Currently, Gadiformes contains 555 described species in 75 genera and 9 families [1]. Given the commercial and ecological importance of Gadiformes, a longstanding interest in their phylogenetic relationships exists. Beginning with the established by Müller [2], different hypotheses on the relationships of Gadiformes relying on morphological and molecular data have been put forward over the decades. The number of suborders, families, and subfamilies in this order has changed remarkably. The classification of suborders and families in Gadiformes is controversial. Currently, different authors recognized 11-14 families and 2-4 suborders [1,3]. Recently, a consensus of these classifications is the recognition of three suborders, i.e., Muraenolepidoidei, Gadoidei, and Macrouroidei [4]. However, gadiforms especially the Gadoidei are probably one of the best studied groups of all marine fish worldwide because of their importance to fisheries, but their phylogenetic relationship remains vague [4]. Three authors have raised quite different phylogenetic hypotheses for Gadiformes on the basis of the usage of different morphological characters, such as characters of muscles [18]. Recent studies reveal that genome-scale information can unambiguously resolve incongruences in molecular phylogenies. However, the massive data sets generated by high-throughput sequencing especially whole genome data sets require complex analytical methods to confirm the most informative orthologous loci and appropriate tree reasoning methods. Many types of coding and noncoding genomic loci, such as ultraconserved elements (UCEs), introns, or conserved nonexonic elements (CNEEs), are available for reconstructing phylogeny [19], but exons are usually selected to reconstruct fish phylogeny at large scales because of easy alignment and conservation among species [20]. Malmstrøm et al., [21] have performed the phylogenomic analyses of teleost fishes (including 27 Gadiformes fish species) by using a data set of 567 exon orthologs from 111 genes. However, genetic markers are normally selected using similarity without applying stringent analyses to determine the orthology beyond similarity-based criteria. The gene genealogy interrogation can be used to mitigate the effects of gene tree estimation error and filtered paralogs [18]. A new set of exons (including 1105 exons) successfully used for fish phylogenetics is obtained through whole-genome sequencing and gene genealogy interrogation to determine orthology, thereby avoiding paralogy. Hughes et al., [22] have applied this genome-scale data set, conducted the large-scale phylogenomic analysis of 303 fish species, and settled some longstanding uncertainties (such as the branching order at the base of teleosts and among early euteleosts and the sister lineage to the acanthomorph and percomorph radiations). The above studies have confirmed the validity of exon markers for phylogenetic analysis among species, but the validity of this method for delimiting subspecies or distinct lineages within species is unknown.
In this study, we have conducted a comprehensive molecular phylogenetic study of the order Gadiformes, including available genomic and transcriptomic data, by using genome-scale orthologs conserved exon markers. L. lota samples from the Irtysh River, China; Amuer River, Chana; and Great Lakes, Canada are included in this study. First, we use phylogenomic analysis to resolve the longstanding controversy regarding the relationships of Gadiformes fish. Second, the status of new subspecies in L. lota from Amuer River is checked using the genome-wide locus.

Results
A total of 36 genomes and 12 transcriptomes, including two newly sequenced genomes from L. lota that represented Irtysh River (China) and Great Slave Lake (Canada) populations (Table 1), were used in this study. For each new sample, a single paired-end library was sequenced to 49.5-56.5× coverage on the Illumina NovaSeq 6000 platform. Genome sequences were assembled using the SOAPdenovo2, resulting in N50 config sizes between 1.8 and 3.2 kb (Table 2). On average, we recovered 48% of the conserved genes of the Actinopterygii data set (3640 genes) included in the BUSCO analysis. Collectively, these results indicated a middle degree of gene space completeness for gene detection in our partial draft genome assemblies. The Trinity software was applied to the de novo assembly of transcriptome data, and 106 084-353 261 unigenes were generated with N50 varying from 319 to 2285. The transcriptome assembly information is shown in Table 1.
We used a set of 1105 single-copy conserved exons for phylogenomic analyses to establish gadiform relationships firmly. Exon alignments for these eight model species were used to obtain HMMs and search for each locus in 36 genomes and 12 transcriptomes, covering 33 species. Significant hits were added to each original alignment to create an exon sequence database for 48 individuals. The concatenated alignment of 1105 exons produced a data matrix with 563 139 bp (187 713 amino acids) for 33 species.
We compared the levels of among-and within-species variations for exon and amino acid data sets (Tables S1 and S2). Tables 3 and 4 showed the genetic distances for selected species (i.e., Gadus morhua, G. chalcogrammus, and L. lota), which were calculated as absolute pairwise DNA and amino acid differences, and related genetic distances. For the pairwise comparisons within species, the ranges of nucleotide and amino acid differences within Atlantic cod (G. morhua) were 2226-2779 (Table 3) and 716-920 (Table 4), respectively, with averages of 2530.25 and 804.5, respectively. The range of nucleotide differences within L. lota showed high values, which varied from 438 to 2841. Nucleotide differences within the Amur River population of L. lota varied from 438 to 1820 (average = 923). The smallest intergeneric difference was observed between Gadus chalcogrammus and G. morhua (average nucleotide differences, 2530.26; amino acid differences, 804.5), and this finding was similar to the within-species difference between Amur River and EB lineages of L. lota (nucleotide substitutions, 1319-2841; amino acid differences, 413-1060).
The phylogenetic analyses reconstructed from amino acid sequences data set by using two tree-building methods, including RAxML and MrBayes, all recovered the same topology (Figures 2 and 3). Our results suggested that Gadiformes could be split into four major lineages that generally corresponded to the families Bregmacerotidae and Merlucciidae and suborders Macrouroidei and Gadoidei. A sister-group relationship between Bregmacerotidae and all other gadiform families was reported. The hypothesis that proposed Muraenolepididae as the independent suborder Muraenolepidoidei and first diverging family was not supported by our analyses. The family Muraenolepididae was clustered into clade III and had a sister-group relationship with Trachyrincinae (Trachyrincus scabrous and Trachyrincus murrayi). Phylogenetic analyses placed Merlucciidae as the second branching lineage of Gadiformes (clade II). Clade III contained the families Muraenolepididae, Melanonidae, Moridae, and Macrouridae and subfamily Trachyrincinae, including 10 species. This clade covered the previous suborders Muraenolepidoidei and Macrouroidei and two families of suborder Gadoidei (i.e., Trachyrincinae and Moridae). Clade IV included the other families within suborder Gadoidei. The families Gadidae and Phycidae were recovered as paraphyletic groups. Three species (i.e., L. lota, M. molva, and Brosme brosme) in the subfamily Lotinae were analyzed. However, subfamily Lotinae was not recovered as a monophyletic group. M. molva and B. brosme were more closely related to Gadinae than to Lota.
Assuming a generation length of four years and mutation rate of 0.6 × 10−8, the demographic histories of L. lota encompassed the period from 4 million years ago (Mya) to 10 thousand years ago (Kya) (Figure 4). The effective size (Ne) of all populations peaked at 2 Mya and subsequently declined. The Great Slave Lake population showed two phases of decline at 2 Mya to 0.4 Mya and 30 Kya to 12 Kya, a stable phase from 0.4 Mya to 30 Kya, and pronounced expansion since 12 Kya. High similarities of demographic histories of L. lota from Germany, Irtysh River, and Amur River were observed. The first decline of these three populations was from 2 Mya to 0.4 Mya. The second decline of three populations varied remarkably (Germany and Irtysh River: 0.2 Mya to 13 Kya, Amur River: 60 Kya to 10 Kya). The Ne values of Germany and Irtysh River showed slight expansion since 13 Kya.      Table 3. Pairwise numbers of nucleotide differences (below) and genetic distance (above) based on the exon sequences of three species.  32723  33854  26965  29268  23740  2081  1694  1319  1648  1018  548  697  499  1352  458  438  0.0020   LL-L9  31357  32546  25919  27999  22589  2291  1913  1587  1891  1336  773  1040  706  1397  811  760  782   Table 4. Pairwise number of amino acid differences (below) and genetic distances (above) based on the amino acid sequences of three species.  10192  10567  8135  8743  6970  817  659  531  669  467  262  375  247  542  281  276  255 3. Discussion

Phylogenetic relationships of order Gadiformes
In this study, we have tried to construct the phylogeny of Gadiformes at the genome level to solve major phylogenetic questions among gadiform fish. The monophyly of Gadiformes is confirmed in our study. However, most formerly described suborders in Gadiformes are not supported by our phylogenomic study.
Results from our genome data set and Malmstrøm et al., [21] suggest that Gadiformes can be split into four major lineages that do not correspond to the three previously described suborders, namely, Muraenolepidoidei, Macrouroidei, and Gadoidei. The suborder Muraenolepidoidei is first established by Svetovidov [23] on the basis of the peculiar structure of the pectoral girdle. The suborder Muraenolepidoidei is thought to represent an ancient lineage of Gadiformes [24]. By contrast, other authors have assigned muraenolepids as a family to the suborder Gadoidei [3]. A recent study has found that its placement as basal lineage of Gadiformes is weakly supported [4]. In our result, the placement of family Muraenolepidoidei as sister group to Trachyrincinae is strongly supported. Further studies should clarify if Muraenolepididae represents the most basal branch in the Gadiformes phylogeny.
Clade I only contains the family Bregmacerotidae, which is the basal lineage of Gadiformes in our study. The phylogenetic position of Bregmacerotidae has been included in the gadoids or regarded as the suborder Bregmacerotoidei [25]. Roa-Varón and Ortí (2009) have reported a discrepancy between mtDNA and nuclear DNA for the placement of Bregmacerotidae. Mitochondrial DNA data show the close relationship of Bregmacerotidae with Macrourinae. The combined data set of mtDNA and nuclear genes in Roa-Varón and Ortí (2009) showed the family Bregmacerotidae in a monophyletic clade with Ranicipitidae in the suborder Gadoidei. Our analyses suggest the position of Bregmacerotidae as the sister group of all other gadiforms. This family may represent the most basal branch in the Gadiformes phylogeny. The unexpected depth of this divergence suggests that Bregmacerotidae can be ranked at the suborder level. However, the long branches separating Bregmaceros from the other taxa in all analyses suggest a rapid evolutionary rate of changes that can saturate the phylogenetic signal of genes. This long branch may mislead the interpretation of phylogenetic information. Similar long branches of Bregmacerotidae are also reported in mtDNA genes [4]. Further study should clarify if Bregmacerotidae represents the most basal branch in the Gadiformes phylogeny.
The family Merluciidae forms clade II in our study. Several taxonomic revisions about the placement of Merluciidae have been published [26][27]. The controversy focuses on whether Merlucciidae is a separate family within Gadoidei. Our genomic data presents new evidence for considering Merlucciidae as a new independent suborder within gadiforms, representing the basal branch in the Gadiformes phylogeny. This result is anticipated at least in part by Betancur-R et al., [8], who have proposed that the family Merluciidae is the most basal suborder.
Clade III contains some families from suborders Macrouroidei and Gadoidei. These families include Melanonidae, Muraenolepididae, Macrouridae (with subfamilies Trachyrincinae, Macrourinae, and Bathygadinae), and Moridae. Three of four subfamilies in Macrouridae are included in the present study, and three monophyletic subclades are recovered in clade III. However, the monophyly of Macrouridae is not well supported, thereby rejecting the monophyly of this family reported in Betancur-R et al., [8] on the basis of the mtDNA genome. Only subfamilies Macrourinae and Bathygadinae are sister groups within Macrouridae. The monophyly relationship between Macrourinae and Bathygadinae is also supported by morphological study and molecular markers [4,24,28]. Another nominal subfamily Trachyrincinae belonging to Macrouroidei seems to be distantly related and clusters with Muraenolepididae. The subfamily Trachyrincinae and families Melanonidae and Moridae are placed in the suborder Gadoidei [4]. In agreement with our study, Cohen [9] adds Moridae and Trachyrincidae in the suborder Macrouroidei. The placement of the family Melanonidae is largely incongruent among previous studies. The phylogenetic position of this family is controversial. The family Melanonidae is included in the suborder Gadoidei or forms an independent suborder as the sister group of all Gadiformes [3,6]. Melanonidae as an independent family within Macrouroidei is strongly supported in our study.
Clade IV contains at least three families of the suborder Gadoidei, i.e., Gadidae, Phycidae, and Ranicipitidae. The suborder Gadoidei is not well recovered by the genomic data set and has been defined by two principal clades in Roa-Varón and Ortí [4]. The first clade, which contains Moridae and two subfamilies of Macrouridae (i.e., Trachyrincinae and Macrouroidinae), is moved to clade III in our study. The second clade, which includes Merlucciidae, Melanonidae, Euclichthyidae (lack of sample in our study), Ranicipitidae, Bregmacerotidae, and Gadidae, forms clade IV of our results. The results obtained with our genomic data set suggest that the suborder Gadoidei contains at least three families, namely, Gadidae, Phycidae, and Ranicipitidae. The family Gadidae contains some commercially important species and has attracted many studies. Previous morphological studies prove the validity of the family Gadidae. Our data support a monophyletic clade for taxa in the family Gadidae with high bootstrap values. However, two subfamilies Lotinae and Gaidropsarinae are not recovered as monophyletic groups. The three species of the subfamily Lotinae (i.e., L. lota, M. molva, and B. brosme) are included in the present study. M. molva and B. brosme in the present study are more closely related to Gadinae than to L. lota. A previous study has shown the nonmonophyletic relationship between L. lota and M. molva and suggested to clarify the status of Lotinae by including B. brosme to cover all species in the subfamily. Notably, the different positions of L. lota are revealed and compared with those observed by Roa-Varon and Ortí [4]. Roa-Varon and Ortí [4] have revealed that L. lota is more closely related to Gadinae than to M. molva. However, our result indicates the early divergence for Lota genus and a relatively distant relationship to Gainae, thereby revising previous finding. Therefore, M. molva and B. brosme may be moved to the subfamily Gadinae. Additionally, Gadiculus argenteus, which represents the subfamily Gaidropsarinae, is the sister group to Trisopterus and Micromesistius from the subfamily Gadinae. This finding does not recover the subfamily Gaidropsarinae and suggests its inclusion to Gadinae. The status of the subfamily Gaidropsarinae needs revision. More taxa from the subfamily Gaidropsarinae are needed for further study.

Subspecies status of L. lota
Previous morphological and molecular marker studies on the number of subspecies present in L. lota have been controversial [12]. One possible new species or subspecies is confirmed in our genomic study. Hubbs and Schultz [10] have raised the idea of three subspecies on the basis of morphological variation, namely, L. lota lota, L. lota leptura, and L. lota maculosa. The validity of L. lota lota and L. lota maculosa is confirmed using mtDNA markers [12]. The L. lota lota (EB lineage) taxon lives in Eurasia and Canada, and the L. lota maculosa (NA lineage) taxon lives in southern Canada and United States. New subspecies of L. lota from Amur River, China first described by Fang et al., [13] is confirmed in our genomic data. The average divergence between Amur River and other burbot lineages is within the range of interspecific divergence between G. chalcogrammus and G. morhua, supporting the status of Amur River as subspecies or even new species. The evolutionary mtDNA lineage of burbot already exists in the Early Pliocene with PSMC results, rendering sufficient time for speciation. Considering the worldwide distribution, more subspecies or species may exist in L. lota.
The results of our study provide novel insights into the demographic history of burbot populations. On the basis of the PSMC results, the demographic history of L. lota is estimated back up to 4 Mya. This time frame is consistent with that of previous studies. mtDNA markers and fossil records provide evidence that the genus inhabits freshwater as long as 5 Mya [12,14]. The PSMC can be used to infer the timing of divergence among distinct populations under a wide range of demographic scenarios. Our PSMC-based estimates of divergence times (about 2 Myr) among the lineages of L. lota are similar to previous estimates produced using different molecular markers (about 1.1-1.9 Mya) [12]. A striking difference in demographic histories is observed between the Great Slave Lake population and other populations. The Great Slave Lake population shows gradual and sudden expansion after the last glaciation. This finding corresponds well with the strong signal of ice age in the North America. Other populations show long-term demographic contractions until the end of last glacial. The Irtysh River and Germany populations show the most similar demographic histories, reflecting exposure to similar climatic history.

Genomic and transcriptomic data set collection
Genomic and transcriptomic data for 40 species representing 8 gadiform families and 2 outgroup species (i.e., Cyttopsis rosea and Zeus faber) from the order Zeiformes were retrieved from the NCBI database (Table 1). A new genome assembly of L. lota from Amur River, China was included in this study [29].

New genome sequencing and assembly
Two genome assemblies of L. lota were newly sequenced in this study. Samples were collected from Irtysh River, China and Great Slave Lake, Canada, representing the subspecies L. lota lota and L. lota maculosa, respectively. The subspecies were identified using the mtDNA cytb gene extracted from the draft genome assembly data. Genomic DNA was obtained from the muscle in this study. A single paired-end library with an insert size of ∼300 bp was created for each sample by using the Illumina protocol. Two samples were sequenced (2 × 150 bp) to about 50× coverage on the Illumina NovaSeq 6000 platform, and sequences were assembled using the SOAPdenovo2 (https://github.com/aquaskyline/SOAPdenovo2). The draft genome assembly quality in terms of gene space completeness was assessed using the BUSCO 5.0 [30] (Table 2). Raw sequencing data for L. lota genome were deposited at the Sequence Read Archive (SRR**).

Transcriptome de novo assembly
Twelve raw transcriptome data, including eight L. lota individuals from the Amur River and one L. lota individual from the Irtysh River, China, were collected from the Sequence Read Archive database of NCBI (SRR13435971-SRR13435978) and reference [31], respectively. The raw RNA-seq data were cleaned by removing reads with sequencing adapters, unknown nucleotides (N ratio > 10%), and low-quality reads (quality scores < 20) by using the Trimmomatic 0.36. The software Trinity (Version 2.4.0) [32] software was used for the transcriptome de novo assembly of each individual, and the parameters were as follows: --genome_guided_max_intron 10000.

Exon captures and phylogenetic inference
A total of 36 Gadiformes genomes and 12 transcriptomes, representing the major families described in Roa-Varón [4], were included for phylogenomic analysis. The single-copy orthology locus was searched in accordance with that published by Hughes et al. [22]. A data set of 1105 single-copy conserved exon markers > 200 bp was obtained from Hughes et al. [22] by comparing eight well-annotated fish genomes. These exon markers passed the paralogy filtering through the gene genealogy interrogation.
The nHMMER program within the HMMER (v3.3.2) [33] was used to search against each study species data set through the hidden Markov model (HMM). High-quality single-copy conserved nuclear coding sequences of each research species were aligned and spliced into single-nucleotide sequences by using the MACSE software (v2.05) [34] on the basis of their amino acid translation. Gene and translated amino acid alignments were concatenated using SCaFoS [35]. Amino acid sequences for phylogenetic analysis can reduce the confounding effect of base-composition heterogeneity among taxa, a biasing factor shown to be extensive among fishes. The maximum likelihood (ML) and Bayesian inference were used for phylogenetic analyses. The protest 3 [36] was applied to estimate the best-fit model for the amino acid data set. A total of 100 bootstrap replicates were obtained for the optimal PROTGAMMAIJTT substitution model of all concatenated nucleotide and protein sequences in the RAxML v 8.2.12, [37] for the ML tree. The MrBayes analysis was conducted in the MrBayes 3.2.6 [38] with 1 × 105 metropolis-coupled Markov Chain Monte Carlo (MCMCMC) generations. Sampling was done every 100 generations, and the first 25% of the generations were discarded as burn-in. 4.5. History analysis of L. lota The demographic history for L. lota was analyzed using the Pairwise Sequentially Markovian Coalescent (PSMC) model as implemented in the PSMC package v0.6.5 [39]. This software package could infer the population size history from one diploid sequence by using the PSMC model. The whole-genome diploid consensus sequence was generated as follows. The "BWA-MEM" algorithm from the bwa v0.7.17 [40] was used to map the fastq data to the assembled genome of L. lota. The SAMtools v0.1.19 [41] was used to generate the diploid consensus with default settings (https://github.com/lh3/psmc). The default settings of the PSMC were adopted. Mutation rates were estimated along the branches of the phylogeny reported in Malmstrom et al., [37], and generation times were set to four years [9].
Supplementary Materials: Table S1 Pairwise number of nucleotide differences (below) and genetic distance (above) based on the exon sequences for all species. Table S2 Pairwise number of amino acid differences (below) and JTT distances (above) based on the amino acid sequences for all species.  Institutional Review Board Statement: Ethical review and approval were waived for this study, due to no live fishes used.
Data Availability Statement: Raw sequencing data for L. lota genome were deposited at the Sequence Read Archive (SRR**).