The comparative analysis of the DNA repeat composition among Cannabis sativa L., Humulus lupulus L. and Humulus japonicus Siebold & Zucc. with heteromorphic sex chromosomes

: Heteromorphic sex chromosomes are rarely found in plants. They were observed only in 47 species from phylogenetically distant families, suggesting that the evolution of sex chromosomes was independent in these species. It was shown that DNA repeat sequences are one of the major factors driving sex chromosomes evolution, and an accumulation or elimination of the repetitive DNA elements are closely linked with the formation of differences in the sex chromosomes. The goal of this study was to characterize the transposon composition in male and female plants of Cannabis sativa L., Humulus lupulus L. and Humulus japonicus Siebold & Zucc. For the first time, the male and female genomes of H. japonicus as well as male genomes of H. lupulus and C. sativa have been sequenced (there were no open data about them). The analysis of genome-wide sequencing data with using Repeatexplorer2 and author’s scripts was carried out. It was shown that accumulation of Ty3-gypsy may be associated with speciation in Cannabaceae family which is the opposite of the theory of speciation throw whole-genome duplication. Moreover, the sex-specific DNA repeat clusters in C. sativa and H. japonicus were found. The analysis also revealed that the concentration of Tekay, Retand and Ikeros repeats in the Y chromosome of C. sativa is lower than in the X chromosome and the Angela concentration is higher in the Y chromosome.


Introduction
Most animals are dioecious species, containing heteromorphic sex chromosomes. Meantime, plant sex dioecy is an extremely rare event. Nearly 6% of land plants have sex and about 90% of plant species are hermaphrodites [1]. Moreover, sex specificity was evolved many times independently, but mechanisms of sex evolution and determination were studied only for few species. The divergence between males and females has been achieved throw the formation of the Sex-Determining-Region (SDR) [2,3]. SDR is located on the sex chromosomes and usually is surrounded by many repetitive sequences. SDR may be a large part of the sex chromosome or be a small part of it. In case it is a small region, chromosomes are homomorphic and contain a small region of absence of recombination. Large SDRs, surrounded by repetitive sequences are the cause of the chromosomal heteromorphism and big region of recombination arrest. Visually homomorphic sex chromosomes do not have strong morphological differences and it was suggested, that these chromosomes are at the early stages of sexual development [4]. Sex chromosomes evolving is achieved by several steps: occurrence of SDR, recombination arrest, repeats accumulation, genes degeneration, deletion of the inactive part of the chromosome. The repetitive sequences are key players of the sex chromosomal evolution [5][6][7].
Plant genomes are characterized by large amounts of repetitive DNA, referred to as repeatome. This part is known as one of the major sources of changes in genome size [8][9][10][11]. Repeats can be spread through all chromosomes and be more dispersed (the usual situation for transposable elements), but, also, they can be accumulated in one place (like most of the satellites repeats) [12][13][14][15][16]. Repeats are often located in functionally meaningful parts of the genome [17][18][19][20][21]. There are several known main types of repetitive DNA: mobile elements (ME), ribosomal DNA, satellites and organelle repeats. The repeats can differ in copy number and in the length of the motifs. The biggest part of the genome usually consists of Transposable Elements (TEs) [22]. TEs are presented in genomes by two classes. Class I includes retrotransposons (REs), it is remarkable for using reverse transcriptase to be copied from one part of the genome to another part [23,24]. This class is divided into two subclasses: 1. LTR (Long Terminal Repeats), 2. autonomous non-LTR and non-autonomous non-LTR. Class II, known as transposons, moving by the cut-and-paste algorithm [23]. But subclasses of transposons differ in the mechanism of cutting and insertion: exactly cut-and-paste (Terminal Inverted Repeats or TIRs), rolling circle (Helitron), self-synthesizing [8,24,25].
The LTR elements are known as major group in plants. Their fragments range in size from several hundred bp up to 25 thousand bp. LTRs usually consist of 4-6 bp TSD, GAG with its ORFs, structural protein for virus like particles and for POL. POL consists of AP or PROT (asparatic proteinase), RH (RNase H) and INT (DDE integrase). This type of TEs is divided into two superfamilies: Gypsy and Copia, which have different order of RT and INT domains, GAG-PROT-INT-RT-RH for Copia and GAG-PROT-RT-RH-INT for Gypsy [26][27][28][29].
Cannabaceae family contains three species of undoubted agricultural significance and all these three are dioecious, they have a different genome size and sex chromosomal organization. Cannabis sativa L. (2n=20), also known as hemp, has a haploid genome size equal to 818 Mb for female, 843 Mb for male [35]. The Y chromosome is bigger than the X chromosome [35][36][37]. Humulus lupulus L. (2n=20), also known as hop, has a female haploid size equal to 2960 Mb, 2730 for a male [25]. The X chromosome is bigger than the Y chromosome [25]. Humulus japonicus Siebold & Zucc (Japanese hop, ♀2n=16; ♂2n=17) has a haploid genome size for female 1568 Mb, for male it is 1722 Mb [35]. The sex organization system here contains 3 different chromosomes: X which is bigger than Y1, Y1 is bigger than Y2 [38]. Differences of species-specific subtelomeric repeat distribution in the sex chromosomes were revealed in a range of fluorescent in situ hybridization (FISH) experiments [36,[38][39][40]. Thus, Cannabaceae family is a good model for investigations of plant sex chromosome evolution.
In this study, the genomes of C. sativa, H. lupulus and H. japonicus were analyzed, and the DNA repeat compositions were compared between species and between male and female samples. This research may become a starting point for further analysis of the sexevolution model based on repeats accumulation in this family and other families of plants.

Sequencing and processing of data, search of the DNA repeats and their analysis
Sequencing was carried out with Illumina standard protocol with 1 run for each plant (150 bp fragment length). The coverages were the following: C. sativa, female -1.5; C. sativa, male -1. 46 For the preprocessing of raw sequence data, the FASTA-files were interlaced and sampled. Amount of the reads used for analysis was 1000000000 per sample. For the generation of the sampled and interlaced input file, the python3 custom script written by I. Kirov was used [41].
The preprocessed data were used to construct the graph-based clusters by RepeatEx-plorer2 [42,43] with custom database and strict parameters: 90% of similarity among not less than 54 % of length of the reads. The cluster information was collected based on these virtual graphs. The abundance of the clusters was calculated based on the proportion of reads.

Results
3.1. The comparision of the DNA repeat combinations among C. sativa, H. lupulus and H. japonicus.
At the beginning, the proportion of DNA repeats in the genomes of the studied species was calculated as the ratio of reads containing repeats to the total number of reads. The three studied species differ in this indicator. In C. sativa, 80% of reads include repetitive DNA and 60% of reads are in top clusters. In H. japonicus and H. japonicus, proportion of read with DNA repeats was equal to 95% (69% of reads are in top clusters) and 88% (62% of reads are in the top clusters), respectively.
Next, the proportions of the different DNA repeat groups were analyzed for each of the studied species (Figure 1). The analysis revealed that LTRs have the largest proportion in all cases (71,29%-89,31%). In H. japonicus, proportions of other DNA repeat groups were significantly low (in the range of 1,24%-3,21%). In C. sativa, three groups have the same proportions (mobile_elements_Class_I, mobile_elements_Class_II, and unknown repeats), but two groups showed near 10% result (satellites and plastid repeats). Since the majority of the reads were mapped into the LTR group, the next step was to perform a comparative analysis of all LTR families in each species separately. The Figure 2 demonstrate that the species differ greatly in the type of retrotransposons. The majority (almost half) of the reads for C. sativa belong to different families of Ty1-copia type. In H. lupulus, the biggest parts of the LTR compartment are Ty3-gypsy/OTA/Tat/Retand, Ty3_gypsy/Tekay. H. japonicus, also has the majority of reads, belonging to Ty3-gypsy families. These are Ty3-gypsy/OTA/Tat/Retand, Ty3-gypsy/Tekay and Ty3_gypsy/Athila. However, proportions of these repeats differed between the two species of Humulus, and the differences were significant. In the next step of analysis, the proportions of DNA repeats in the genomes of the studied species were separately calculated for male and female samples. The results are presented in Figures S2. The significant differences in repeat combinations and proportions between male and female samples were not observed in all cases.
The detailed analysis of LTRs was also conducted for male and female samples (Figure S3). The comparison of proportions did not reveal significant differences in Ty1-copia, Ty3-gypsy proportions between samples of different sexes.
To find the significant differences between species as well as male and female plants the analysis of LTR families was carried out. The number of LTR families differs between species. Altogether, twelve families were observed in the Cannabaceae family, among them: (1) 6 families were observed in all three species (Angela, Galadriel, Tekay, Athilla, Retand, SIRE); (2) CRM and TAR were presented in both species of Humulus genus; (3) Ikeros was a common family for C. sativa and H. lupulus; (4) Bianca and Tork were presented in C. sativa only; (5) Ale family was specifically presented in H. lupulus only. The significant sex-specific differences were observed for C. sativa. Tekay, Retand and Ikeros repeats showed larger fractions in females and Angela in males ( Figure 3A). In the case with H. japonicus, the results were obtained from two sequenced plants only (high-quality comparison need more NGS data to analyze the sex-specificity of this species' LTR families). However, it can be preliminarily concluded that Retand has larger proportion in female as well as Athilla, Tekay and SIRE in male plant ( Figure 3C). Unlike these two species, there were not observed the sex-specific differences among LTR families in H. lupulus ( Figure 3B).

Discussion
The first significant result of this study was the receipt of updated data on the genome structures for three Cannabaceae species -C. sativa, H. lupulus and H. japonicus. It was previously reported, the C. sativa genome contains 64%-65% of repetitive sequences, among them the biggest part was LTR-repeats (about 50%), rDNA takes place in ~3% of repeatome and a big part (43%) of repeatome consists of low-complexity sequences [44]. Our results differ from the previous ones: we found that the genome contains 80% of repetitive sequences. The LTR repetitive elements make up 75% of repeatome, and the rDNA amount is 3%-4%. According to Natsume et al (2015), the H. lupulus genome contains ~60% of repetitive sequences, among them there were identified about 93% of LTRs [45]. Padgitt-Cobb et al. (2021) estimated the size of the repeatome as "above 70%". Among them, there were 76.5% of LTRs [46]. Our results strongly suggest that 88% of the H. lupulus genome consists of repeats that differ from the previous reports. But in the same time, we found that the LTR amount make up 81% of the repeatome. This value is approximately the average of the values obtained by Natsume's and Padgitt-Cobb's groups. Since the H. japonicus genome is firstly sequenced in this study, the data of the genome component ratios were obtained for the first time. The analysis revealed that the H. japonicus repeatome takes up more space in the genome (95%) than the C. sativa and H. lupulus repeatomes in their genomes. The LTR amount was also the highest among these three species (89,31%).
It was proven that the speciation of Humulus and Cannabis occurred around 27 million years ago [36]. Wherein, the C. sativa genome is three times smaller than H. lupulus one, and H. japonicus genome is two times smaller than the H. lupulus one [35,47,48]. It was previously suggested that the species speciation occurred with whole-genome duplication [46]. However, the detailed analysis of LTRs suggests that it was achieved throw the accumulation of Ty3-gypsy repeats. C. sativa has ~38% repeats of Ty3-gypsy type, H. lupulus has ~68 % and H. japonicus has ~90-92%. This type of accumulation of one particular type of repeats contradicts the hypothesis of whole-genome duplication. Therefore, the proliferation of Ty3-gypsy type might cause the speciation as it was proven for Сapsicum spp [49].
In the analysis of the LTR families among three studied species, five groups were observed. The first group included the families which are presented in all three species (Angela, Galadriel, Tekay, Athila, Retand, SIRE). A similar set of the common LTR families is usually found in closely related species. For example, Borreda et al. (2019) reported about the set of the Retrofit, Oryco, SIRE, Tork, CRM, Reina, Del, Galadriel, Athila, and Tat LTR families, which are common for Citrus spp [50]. Such members of this set as SIRE, Galadriel, and Atilla were also determined as common for Cannabaceae spp. that may be linked with their higher activity in comparison with the others. The second group of the Cannabaceae LTR families included CRM and TAR, which were presented in both species of Humulus genus. Apparently, their activity started after the speciation of Humulus and Cannabis. The similar conclusions were made by Wawrzynski et al. (2008), when they compared the activity of transposons after the Glicine max (L.) Merr. and G. tomentella Hayata speciation [51]. The third identified group of Cannabaceae LTRs consists of one family (Ikeros), which is found in C. sativa and H. lupulus. An absence of the Ikeros activity in H. japonicus genome may be linked with the independent start of the Ikeros activity in C. sativa and H. lupulus genomes after the H. lupulus and H. japonicus divergence (3.74-6.78 MYA) [36,52]. This result once again confirms the correctness of the idea of the independent fate of TE during genome differentiation, which was, for example, shown in the analysis of the evolution of the genomes of Nicotiana or Capsicum spp. [53][54][55]. Finally, this theory can explain finding of the fourth and fifth groups of the Cannabaceae LTRs, which were C. sativa-(Bianca and Tork) and H. lupulus-specific (Ale), respectively.
In the analysis of the LTR combinations in male and female samples, the significant sex-specific differences were found in C. sativa. It has been revealed that Tekay, Retand and Ikeros families have larger fractions in females and Angela in males. The similar results were presented in Spinacia oleracea L. (with homomorphic sex chromosomes) and S. tetrandra Stev. (with heteromorphic sex chromosomes) [56]. Li et al. (2021) showed that the proportion of Ale and Retand was higher in the female genome than in the male genome, while the proportion of Angela, Bianca, CRM, Tekay, LINE, EnSpm_CACTA, and rDNA was higher in male genome than in the female genome of S. oleracea. At the same time, Li's group indicated that the proportion of Angela, Athila, Ogre, CRM, Tekay, and EnSpm_CACTA was higher in the female genome than in the male genome, and the proportion of rDNA was higher in the male genome than in the female genome of S. tetrandra. Thus, C. sativa has a higher proportion of Retand in females, like females of S. oleracea, and Tekay, like females of S. tetrandra and males of S. oleracea. And, the male genome of C. sativa demonstrates a higher proportion of Angela, like males of S. oleracea and females of S. tetrandra. The Ikeros repeats were not described in comparison of the male and female genomes of Spinacia spp. and their different proportion in males and females may be a feature of C. sativa genome.
In general, all the differences noted above in the proportions of repeats between the male and female genomes of C. sativa, S. oleracea, and S. tetrandra can be associated with different concentrations of these repeats in the Y chromosome. Indeed, if the concentration of a repeat is lower in the Y chromosome than in the X chromosome, then the proportion of this repeat in the female genome will be higher than in the male, since the female genome contains two X chromosomes. Conversely, if the concentration of a repeat is higher in the Y chromosome than in the X chromosome, then the proportion of this repeat in the male genome with XY sex chromosomes will be higher.
As a sum, this study is a new step in the investigation of the genome divergence and sex chromosome formation among dioecious species of Cannabaceae family. The obtained results can be a basis for further analysis of the sex-evolution model based on repeats accumulation in this family and other families of plants.
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1, Figure S1: The DNA isolation protocol; Figure S2: The ratios of the DNA repeat types in the male and female genomes of C. sativa, H. lupulus and H. japonicus; Figure S3: The ratios of the LTR superfamilies in the male and female genomes of C. sativa, H. japonicus and H. lupulus.
Author Contributions: O.V.R. and G.I.K conceived and designed the investigation; O.V.R. collected the plant material and isolated DNA; Y.V.B. processed sequencing data, searched for repeats, analyzed the results, wrote the text of the paper; O.S.A. analyzed the results, formulated the discussion and wrote the text of the paper.
Funding: This work was financially supported by a grant from the Russian Foundation for Basic research, agreement № 20-316-70018\19, dated 19.11.2019.