Speciation in stickleback facilitated by admixture – where is the evidence?

Where genetic variation promoting speciation originates is a crucial question in evolutionary genomics. In a recent article, Marques et al. (2019) seek to address this question in lake and stream threespine stickleback fish from the Lake Constance (hereafter LC) basin in Central Europe. Based on population genetic methods, they conclude that incipient speciation between lake and stream stickleback was facilitated by the mixing of genetic variation from old lineages evolved in isolation (i.e., admixture following secondary contact). In this comment, I discuss conceptual and methodological problems and unrecognized conflicts with existing evidence that cast doubt on Marques et al.’s conclusion.


Phylogeny of European freshwater threespine stickleback
The phylogenetic analysis involves stickleback samples from 39 localities in and around central Europe, most of which are represented by two individuals (69 individuals in total; Supplementary Table 1). Because of strong adaptive genetic divergence between marine and freshwater stickleback populations, all 38 European localities concern freshwater habitat; only a single locality (CLU; Cluxewe Estuary, Vancouver Island, Canada) included as outgroup represents saltwater habitat (anadromous marine stickleback). All populations are at least potentially natural. The only exceptions are the populations CHE and SAS, which originate from human introduction to the Lake Geneva basin around 1900 (Fatio 1882;Bertin 1925).
The sequence data underlying this analysis are Sbf1 or Pst1 enzyme restriction siteassociated DNA (RAD) sequences generated specifically for this study, or retrieved from published investigations (Roesti et al. 2012b(Roesti et al. , 2014(Roesti et al. , 2015Ferchaud & Hansen 2016;Marques et al. 2016;Fang et al. 2018). In the latter case, individuals with high read coverage were given priority. The full data set is described in detail in Supplementary Table 1.
To obtain single-nucleotide polymorphisms (SNPs) for phylogenetic analysis, all raw fastq files were initially filtered for reads starting with the exact Sbf1 restriction residual (TGCAGG; Sbf1 restriction sites are covered by the Pst1 enzyme too), and the reads were trimmed to 70 base pairs (bp). The fastq data thus obtained were aligned to the Glazer et al.
2015 threespine stickleback reference genome assembly with Novoalign v3.00 (http://www.novocraft.com/products/novoalign), using the alignment parameters from Roesti et al. 2012a (key settings: -t180 -g40 -x15). The alignments were then converted to BAM format and accessed with the R package Rsamtools (Morgan et al. 2017). At each RAD locus in each individual, haploid genotyping was performed by retrieving the leading haplotype, defined as the single sequence exhibiting the highest read count among all unique sequences present at the RAD locus. RAD loci at which this leading haplotype did not occur in at least two copies, or exhibiting an excessive read depth beyond 4.5 times the expected read depth across all genome-wide RAD loci (estimated by the total number of reads divided by the total number of RAD loci), were excluded from analysis. This haploid genotyping approach was chosen because it avoids potential bias in the identification of heterozygous positions and is therefore highly reliable. RAD loci successfully genotyped in every single individual were then used for SNP detection (hence, the final SNP data set contained no missing data). I accepted SNPs along a RAD locus only if they were at least 8 bp away from the previous polymorphic position, thus avoiding pseudo-SNPs arising from indels. For each individual, the nucleotides present at all SNPs were concatenated to a single string, and these strings combined across all 69 individuals in a single fasta file. Overall, the fasta file contained genotype information from 7,121 SNPs from 4,429 genome-wide RAD loci shared among all individuals. As a robustness check, the above SNP detection and genotyping protocol was repeated using more stringent criteria: for an individual, the leading haplotype at a RAD locus was accepted only if present in at least five (as opposed to two) copies, and the minimum spacing threshold for SNPs located on the same RAD locus was increased from 8 to 12 bp. This latter approach, yielding 797 SNPs from 563 RAD loci, produced a very similar phylogenetic tree topology leading to the same conclusions (details not presented). Both fasta files are provided as Supplementary Data 1 and 2.
Analyses based on the above fasta files were carried out using the R packages ape (Paradis & Schliep 2018) and phangorn (Schliep 2011). For the phylogeny, I first determined the most appropriate substitution model ('GTR+G+I'), estimated the maximum likelihood tree, and visualized this tree as phylogram. In addition, genetic similarity among the individuals was quantified by ordination of a genetic distance matrix using Principal Coordinates Analysis (PCoA). Individuals were plotted along the first two PCoA axes.

Genetic diversity
Genetic diversity within populations was quantified based on the haploid individual-level SNP genotype data generated as described for the phylogenetic analysis (minimum leading haplotype coverage = 2; minimum SNP spacing = 8 bp). Diversity was expressed by the proportion of SNPs at which two (haploid) individuals from a given population exhibited a distinct nucleotide. This metric, representing a close analog of within-population heterozygosity, required genotype data from two individuals from each population, thus excluding all study populations for which sequence data from only a single individual was available (Supplementary Table 1). Because genetic diversity is typically greater in marine stickleback populations -considered large and well-connected -than in derived freshwater populations (Mäkinen et al. 2006;Hohenlohe et al. 2010;Catchen et al. 2013;Haenel et al. 2019), I included samples from two marine localities (NOR, Norsminde Fjord, North Sea, Ferchaud & Hansen 2016;PRI, Primorsk, Baltic Sea, Fang et al. 2018) as a robustness check.
Overall, this analysis used genotype data from 7,247 SNPs located on 4,474 RAD loci, and considered 32 localities. For visualization, genetic diversity was standardized by using the sample with the greatest diversity (CLU; Pacific marine stickleback) as reference. The analysis of genetic diversity was repeated using data generated with more stringent thresholds for genotyping (minimum leading haplotype coverage = 5; minimum SNP spacing = 12 bp), which produced very similar results supporting the same conclusions.  Ocean, hence this sample may represent anadromous marine fish as well). The expected high diversity observed in marine samples, particularly in the most ancestral one from the Pacific (Fang et al. 2018), confirms the soundness of the metric used for quantification. Note that the populations from the Lake Constance basin (dark blue) do not display elevated genetic diversity relative to other European freshwater stickleback, contrary to expectations based on recent admixture between ancient lineages within this basin.