Whole chloroplast genome characterization and comparison of two sympatric species in genus Hippophae ( Elaeagnaceae )

1 Key Laboratory of Tree Breeding and Cultivation, State Forestry Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, China; wangluoyun14@163.com (L.W.), 18810576992@163.com (J.W.), hecy@caf.ac.cn (C.H.), 2 Collaborative Innovation Center of Sustainable Forestry in Southern China, Nanjing Forestry University, Nanjing, China *Correspondence: zhangjg@caf.ac.cn (J.Z.), zengyf@caf.ac.cn (Y.Z.); Tel.: 86-10-62889601 (J.Z.), 86-10-62888786 (Y.Z.) Abstract: Hippophae is a tree species with ecological, economic and social benefits. In this study, we assembled and annotated chloroplast genomes of sympatric Hippophae gyantsensis and H. rhamnoides subsp. yunnanensis. Their full-length are 155260 and 156415 bp, respectively. Each of them has 131 genes, comprising 85 protein-coding genes, 8 ribosomal RNA genes and 38 transfer RNA genes. After comparing the chloroplast genomes, we found 1302 base difference loci, and 63.29% are located in the intergenic region or intron sequences and 36.71% are located in the coding sequences. The SSC region has the highest mutation rate, followed by the LSC region; the IR regions have the lowest mutation rate. Among the protein-coding genes, three had a ratio of nonsynonymous to synonymous substitutions (Ka/Ks) >1 (but P values were non-significant) and 66 had Ka/Ks <1 (46 were significant). In general, the chloroplast protein-coding genes may be subject to purification selection. Among H. gyantsensis and H. rhamnoides subsp. yunnanensis chloroplast protein-coding genes, there are 20 and 16 optimal codons, respectively. Most of the optimal codons were ending with A or U, which indicates significant AT preference. It is an important reference for studies on the general characteristics and evolution of the Hippophae chloroplast genome. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 26 October 2018 doi:10.20944/preprints201810.0638.v1


Introduction:
Chloroplasts are vital sites for green plants and algae to conduct energy conversion and photosynthesis.A chloroplast is a semiautonomous cellular organelle that has relatively independent genetic material, which is called chloroplast DNA.As chloroplast DNA is maternally inherited in most angiosperm [1], and it also has a relatively stable genetic structure, it has attracted broad attention from biologists.
Studies on chloroplast genomes have become increasingly important in recent years.
Most chloroplast genomes of plants have a cyclic structure that includes one large single copy (LSC) region, two inverted repeat (IR) regions and one small single copy (SSC) region.The full length of a chloroplast genome is 120-2500 kb, with 110-130 genes [2,3].With the continuous development of molecular biology, especially the development of large-scale high-throughput genetic sequencing techniques, research on chloroplast genomes is gradually deepening [4].
Hippophae plants belong to the family Elaeagnaceae.They are usually dioecious deciduous shrubs or dungarunga and are widespread in many countries in Asia and Europe.The root system of Hippophae plants can form root nodules, which possess nitrogen fixation functions and can improve soil fertility.Hippophae plants also have strong environmental adaptability: they can resist low-temperature sandy environments and high-temperature, saline-alkaline, dry and humid environments.Thus, they can be utilized for ecological restoration and soil protection [5,6].
According to studies, in the Loess Plateau region, Hippophae forests can reduce 80% of the direct surface runoff, 75% of the water erosion of the surface soil and 85% of the wind erosion [7].Both Hippophae fruits and leaves contain multiple ingredients that have beneficial effects for humans, including regarding cardiovascular protection, immunity augmentation, and cancer and other tumor inhibition, thus functioning as health food and medicine [8].Hence, many kinds of wide-ranging research is carried out on Hippophae plants.Currently, there are 7 recognized species and 11 recognized subspecies of Hippophae [9].As the country with the most abundant Hippophae resources, China has 7 species and 5 subspecies of Hippophae.There are already a large amount of researches analyzing Hippophae interspecific differentiation and genetic diversity using traditional methods such as those involving amplified fragment length polymorphism (AFLP), simple sequence repeats (SSRs), internal transcribed spacers (ITSs) and chloroplast trnL-F and trnS-G sequences [10,11].However, there have been very few studies conducted at the Hippophae chloroplast genome level [12], and there are no published studies comparing the chloroplast genomes of sympatric species of Hippophae.H. gyantsensis and H. rhamnoides subsp.yunnanensis are found in the Qinghai Tibetan Plateau, which is an area that is sensitive to climate change [13].Studies show that alpine environments may change the chloroplast microstructure of plants [14], so conducting research on the chloroplast genome of plants in this area is of great importance.
This study did complete genome random interrupt sequencing for H. gyantsensis and H. rhamnoides subsp.yunnanensis by High-throughput sequencing, de novo assembled , annotated and systematically compared the complete chloroplast genomes of this two Hippophae.Then we obtained some chloroplast genomes characteristics of genus hippophae.It was expected to provide an important reference for the future studies on the chloroplast genome of genus hippophae plants and even angiosperms in plateau region.

Basic characteristics of the chloroplast genomes
The full length of the H. gyantsensis chloroplast genome is 155260 bp, which is a little shorter than the length of the H. rhamnoides subsp.yunnanensis chloroplast genome (156415 bp).The length of the H. gyantsensis LSC region is 83026 bp, the length of the SSC region is 18894 bp, and the length of the IR region is 26670 bp.The lengths of the H. rhamnoides subsp.yunnanensis LSC and SSC regions are 84078 bp and 19047 bp, respectively, which are both larger than those of H. gyantsensis.
However, the length of the H. rhamnoides subsp.yunnanensis IR region (26648 bp) is smaller than that of H. gyantsensis.GC concentration of the total and each of regions in the two Hippophae species were similar,and the IR region has the highest GC concentration.The difference between the two Hippophae chloroplast genome lengths (not including introns) is 5 bp (Table 1).
Both of the chloroplast genomes have 131 functional genes, comprising 85 protein-coding genes, 38 transfer RNA (tRNA) genes and 8 ribosomal RNA (rRNA) genes (Table 1).Their specific classifications are shown in Table 2.Among them, 4 rRNA genes, 8 tRNA genes and 7 protein-coding genes are located in the IR regions (with two copies); 13 genes are located in the SSC region; and the rest are located in the LSC region (Figure 1).Notably, rps12 is a trans-splicing gene, and its 5' end is located in the LSC region and there is one copy of the 3' end in each IR region.The two ycf1 genes have different lengths; the shorter segment is only located in one IR region while the longer segment is located in the other IR region and the SSC region.
There are 22 genes with introns, clpP, rps12 and ycf3 genes have two introns and the rest have one intron.The lengths of introns in the two Hippophae chloroplast genomes are similar.The ndhA gene has the intron with the largest difference in length between the two species: the intron length of H. gyantsensis is 20 bp larger than that of H. rhamnoides subsp.yunnanensis.Among all of the introns, the trnk-UUU gene has the longest intron; its intron length is 2485 and 2497 bp in H. gyantsensis and H. rhamnoides subsp.yunnanensis, respectively (Table 3).
The one-base SSR loci for both Hippophae species are A/T repeats.All the two-base SSR loci for the two species are AT repeats except for one CT repeat loci for H. gyantsensis.The SSR loci of the two chloroplast genomes are mainly located in the non-coding region.H. rhamnoides subsp.yunnanensis has 13 SSR loci and H. gyantsensis has 9 SSR loci located in the coding sequences, 6 of these occur at the same locations.each of the two species is 92455/92450 bp, and base difference loci amount to 5.17% .
According to the results from the sliding-window analysis, regarding the complete chloroplast genomes, the largest difference between the two Hippophae species occurs in the SSC region, followed by the LSC region, and the IR regions have the smallest difference (Figure 2a).The coding region exhibited similar results; the longer segment ycf1 gene located in the SSC region has the largest number of base difference loci per unit, followed by the ndhF gene in the SSC region; the matK, rps16, rpoC2, ndhK and ndhC genes in the LSC region have a relatively high number of base difference loci per unit (Pi value ≥0.15) (Figure 2b).The differences in the IR region are all small ( Pi values <0.05).Comparing Figure 2a with 2b, the locations with large base differences are usually in the intergenic region.
X axes is the location of base，Y axes is the nucleotide polymorphism(Pi)；a：base difference of the complete chloroplast genomes，b：base difference of gene CDS region (not including introns) protein-coding genes are completely the same between the two species, comprising 9 self-replicating-related genes and 10 photosynthesis-related genes.There are 66 protein-coding genes with Ka/Ks <1, and 46 of them have Ka/Ks <<1 (P value ≤0.05).
After excluding genes with only 2 or 1 Ks or with only 1 Ka, there are 25 genes left, comprising 10 protein-coding genes that are related to photosynthesis, 11 protein-coding genes that are related to self-substitution and 4 genes with other functions.Ka/Ks >1 for the matK, rpsl and rpoA genes, though the statistic is non-significant(P value > 0.05).

Comparison of the optimal codons of the chloroplast genomes
Based on the 53 genes of each of the two Hippophae species, and taking codons with RSCU >1 and ΔRSCU >0.08 to be the optimal codons, there are 20 optimal codons in the H. gyantsensis chloroplast genome and 16 in the H. rhamnoides subsp.
yunnanensis chloroplast genome.Of these, 15 optimal codons are the same for the two species (Table 5).Regarding the amino acids coded by the codons, besides tryptophan (which has only one codon and no codon usage preference), only aspartic acid and cysteine in H. gyantsensis, and phenylalanine, aspartic acid, cysteine and glycine from H. rhamnoides subsp.yunnanensis, have no optimal codons.For all of the optimal codons for the two species, only UUG (coding for leucine) ends with G; the rest all end with A or U, which indicates a significant AT preference.

Complete sequence cluster analysis of chloroplast genomes
We used the Bayes cluster, ML and MP methods to conduct a cluster analysis of the chloroplast genomes, with Z. jujube as the outgroup.Elaeagnaceae formed a single branch.Furthermore, there was a 100% agreement showing that Elaeagnus macrophylla Thunb.and Elaeagnus mollis formed a branch.H. rhamnoides subsp.
yunnanensis first formed a small branch with H. rhamnoides, and then with H.
gyantsensis (Figure 3).This reflects that although the two Hippophae species are located in the same geographical region, their chloroplast genomes indicate many interspecies differences.

Basic characteristics of the complete Hippophae chloroplast genomes
The full lengths of the chloroplast genomes of the two Hippophae species in this study and the published H. rhamnoides chloroplast genomes [12] are between 155260 gyantsensis [18], so the chloroplast genomes of these two species represent the characteristics of Hippophae chloroplast genomes to a large degree.
Regarding the two Hippophae chloroplast genomes, the intergenic regions have the largest length difference, followed by the intron regions, while the length difference of the coding regions is only 5 bp.The total length difference is mainly caused by the LSC region, then the SSC region; the IR regions have the smallest length difference.SSR loci are mainly located in the intergenic regions, with AT repetition, probably as there is only a GC concentration of 37% in the Hippophae chloroplast genomes.The intron analysis indicated that the trnk-UUU gene has the longest intron, which is similar to results for many other species [19].
We found that the largest base difference between the H. rhamnoides subsp.
yunnanensis and H. gyantsensis chloroplast genomes is in the SSC region, followed by the LSC region; the IR regions have the smallest base difference.The GC concentration in the SSC region is 30%, while it is 35% in the LSC region and 42% in the IR regions.This shows that base differences in all genome regions is negatively correlated with GC concentration.A study by Zheng et al. [20] shows that for angiosperm chloroplast genomes, the IR region has a high GC concentration because the rRNA gene in this region has a high GC concentration, while the low concentration in the SSC region is related to the NADH genes in this region.
The IR regions of most angiosperm species are highly conserved 27 and have a relatively high GC concentration.This may be related to the presence of two copies.
When mutations occur, the IR regions of angiosperm chloroplast genomes can adjust via transposition in order to decrease the mutation ratio [21].Additionally, we found that compared with the noncoding region in the IR region, the coding region has a lower mutation rate.There are probably two reasons for this.Firstly, genes in the coding region of the IR region experience a strong purification selection effect and a fast evolutionary rate; most of the mutations cannot be reserved and are directly eliminated, so there is a lower base difference than in the noncoding region.Secondly, mutations in the coding region of the IR region are only influenced by random drift, and the base mutation rate is lower than that of the noncoding region, so there are fewer base difference loci than in the noncoding region.Regarding protein-coding genes, we found that the purification effect is not significant for protein-coding genes in the IR region , which supports the second reason, which is that genes in the coding region of the IR region have a slower evolutionary rate.

Characteristics of Hippophae protein-coding genes
In the protein-coding genes of the two Hippophae chloroplast genomes, the gene that has the largest base difference is the longer segment ycfl.This gene is widespread in plant chloroplast genomes, but exhibits allele drop-out in herbaceous plants and cranberry plants [22].Dong et al. found that among 420 species of xylophyta [23], 357 can be identified by the long segment ycfl, which is better than using matK and rbcL.
The effectiveness of differentiation using the long segment ycfl is followed by differentiation using the ndhF gene (which encodes NADH dehydrogenase F) due to its high mutation ratio; many studies have used this gene to analyze interspecies genetic diversity in plants [24,25].Additionally, we found that in Hippophae, the ccsA gene also has a higher mutation rates, which conflicts with a former conclusion that the ccsA gene is widely conserved in photosynthetic plants [2].The three genes above are completely or largely located in the SSC region.Additionally, the matK, rps16, rpoC2, ndhC and ndh genes in the LSC region also have relatively high mutation rates.
All the abovementioned genes with high mutation rates are chloroplast self-replication-related genes or other functional genes, while the photosynthesis-related genes all have lower mutation rates because they need to be relatively strongly conserved to maintain photosynthesis as the main function of chloroplasts [2].
The Ka/Ks results shows that, in general, chloroplast genome protein-coding genes may be selected by purification effects.Among the 25 genes showing significant purification effect, 17 are located in the LSC region and 8 in the SSC region.As the number of protein-coding genes in the LSC region (60) is far greater than that in the SSC region (12), the purification effect of the SSC region protein-coding genes is more significant than in the LSC region.This illustrates the main reason why the SSC region has a larger base difference: compared with other chloroplast genome regions, the SSC region has the highest base mutation rate.Its significant purification effect can ensure that the normal functions of related genes are not disrupted.

Optimal codons
Codon usage preference analysis is an important method to explore genetic and evolutionary pathways of species [26].The use of optimal codons is an important representation of codon preference.The research has shown that directional mutation and natural selection are two main factors that influence codon usage preference [27].
There are 15 common optimal codons for H. rhamnoides subsp.yunnanensis and H.
gyantsensis chloroplast genomes, which reflects the high commonality of codons in Hippophae plants.Except for one codon that ends with G, the rest all end with A or U, which reflects the significant AT preference.This is in accordance with many other analytical results on optimal codons in angiosperm [28,29].The utilization of certain optimal codons is matched with the nucleic acid content in the chloroplast genome intergenic region [29].Species with high AT content in intergenic region also have optimal codons with high AT content.Through our calculations, the AT content in the intergenic regions of both Hippophae chloroplast genomes is 67.4%.Usually, the bases in the intergenic region will not be affected by selective action, so high AT content in the intergenic region probably occurs because the bases in this region are more likely to mutate into A/T.This also indicates that directional mutation may be

Assembling complete chloroplast genomes
After random interrupt sequencing, we used NOVOPlasty 2.7.0 to assemble the complete chloroplast genomes [30].We downloaded the H. gyantsensis trnLF sequence (KU304417) and the H. rhamnoides subsp.yunnanensis trnLF sequence (KU304405) from the National Center for Biotechnology Information (NCBI) website, set them as the seed sequence, then set K-mer as 39, type as chloro and genome range as 120000-200000, inputted the forward reads and reverse reads sequence file location and set the other parameters as default.
To ensure assembly accuracy, after comparing the complete assembled chloroplast genome sequences of H. rhamnoides subsp.yunnanensis and H.
gyantsensis, we designed a primer for suspicious areas that had large differences, and used the total DNA of the two species as the templates to conduct PCR amplification.
The PCR amplification products were sequenced by Beijing Tsingke Biological Technology Inc., and a comparative analysis was then conducted using these sequences and the assembled chloroplast genome sequences.[31].When screening for base differences between the chloroplast genomes, we used the sliding-window method for the calculations, and set the window length at 600 bp and the step size at 200 bp for the analysis.

SSR loci analysis
SSR loci of the two chloroplast genomes were searched by MISA software.The minimum number of repeats of mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide units were 10, 5, 4, 3, 3 and 3, respectively [32]; the smallest distance between two SSR loci was set as 100 bp.

Nonsynonymous substitution (Ka)/synonymous substitution (Ks) analysis
We aligned the coding regions for 85 protein-coding genes of H. rhamnoides subsp.yunnanensis and H. gyantsensis, using the NG method in KaKs_Calculator 2.0 software [33] with default parameter values.Thus, we obtained Ka/Ks results for the two Hippophae species.

Optimal codon analysis
To ensure accuracy for the optimal codon analysis, we first removed the repetitive sequences in the protein-coding genes, then selected the sequences that ATG as the initiator codon and TAA, TAG and TGA as the terminator codons, and gene sequences >300 bp.There were 53 gene sequences for each of the two Hippophae species [34].We used the effective number of codons (ENC) of each gene as the criterion to order the selected gene sequences, then selected the 5 genes from both ends of the resulting sequence to construct high-and low-expression gene pools.
We utilized CodonW to obtain the relative synonymous codon utilization (RSCU) value for the 53 genes and the ΔRSCU value for the high-and low-expression gene pools.Finally, we selected the optimal codons with RSCU>1 and ΔRSCU >0.08 for the Hippophae chloroplast genome [35].

Cluster analysis of chloroplast genomes
We searched for and downloaded the chloroplast genomes for Ziziphus jujube (NC_030299), Elaeagnus macrophylla (NC_028066), Elaeagnus mollis (NC_036932) and Hippophae rhamnoides (NC_035548.1), and analyzed them along with the two chloroplast genomes that we assembled in this study; Z. jujube was used as the outgroup.We compared the genomes using BioEdit software.We utilized the maximum likelihood (ML) method, maximum parsimony (MP) method with 1000 bootstrap replications and Bayes cluster method (mcmc ngen = 1000, sumt burnin = 2500) in MEGA7.0 software [36] to construct a phylogenetic tree for cluster analysis.

Conclusion
Through comparison and analysis of the chloroplast genome sequences of H. rhamnoides subsp.yunnanensis and H. gyantsensis, we found that the length of the Hippophae chloroplast genome is in accordance with most model plants and has the typical four-section structure of angiosperm chloroplast genomes.The evolution rates of the Hippophae chloroplast genome regions are different, and the rate is negatively related to GC content.The SSC region has the highest evolutionary rate, so it has the highest mutation rate and lowest GC content.The IR region has the lowest evolutionary rate, so it has the lowest mutation rate and highest GC content.All index values for the LSC region are in the middle of those for the former two regions.The self-replication-related genes or other functional genes of chloroplasts have relatively higher mutation rates, while the photosynthesis-related genes are more conserved.
Protein-coding genes in the SSC region are significantly affected by purification selection to ensure normal operation of related protein-coding genes under the high mutation rates in the SSC region.These conclusions provide an important reference

Fig
Fig 1 Physical map of the chloroplast genome for H. gyantsensis and H.rhamnoides subsp yunnanensis

Fig 3
Fig 3 Cluster analysis of the chloroplast genomes

1 Physical map of the chloroplast genome for H. gyantsensis and H.rhamnoides subsp yunnanensis Table1 Comparison of the Complete chloroplast genomes for Hippophae gyantsensis and H. rhamnoides subsp yunnanensis
C-type cytochrome synthesis ccsA Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted:

Table 3
Characterization analysis of introns in H. gyantsensis and H.rhamnoides subsp yunnanensis chloroplast genomes 2.2 Base differences between chloroplast genomesAfter comparing the complete chloroplast genomes of the two Hippophae species, we found 288 locations with base sequences insertions/deletions, 20 of which are >50 bp and mainly located in the LSC region.There are also 1302 base difference loci, 478 of which are in the coding region (36.71%) while 824 are in the intergenic region or are intron base difference loci (63.29%).The length of the coding region of Preprints (www.preprints.org)| NOT PEER-REVIEWED | Posted:

26 October 2018 doi:10.20944/preprints201810.0638.v1
19g 2 Sliding-window analysis of the chloroplast genomes of H. gyantsensis and H.rhamnoides subsp yunnanensis2.3Ka/Ksresults of protein-coding genes in chloroplast genomesSelection pressure on a gene can be identified by calculating the ratio of nonsynonymous substitutions (Ka) and synonymous substitutions (Ks) of protein-coding gene codons.According to Ka/Ks analysis of the protein-coding genes for the two Hippophae species, there are 472 substitute loci in total; 264 of them are Ks and 207 of them are Ka.Ka/Ks is 0.230 (<<1).The gene sequences of19

Table 4 .
Kaks analysis of chloroplast genomes CDS region in H. gyantsensis and H. rhamnoides subsp yunnanensis

Table 5
Comparison of optimal codons of H. gyantsensis and H.rhamnoides subsp yunnanensis chloroplast genomes *: the optimal codon for one of the species.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 26 October 2018 doi:10.20944/preprints201810.0638.v1 and
[15][16][17]hich are close to those of the model plant Arabidopsis thaliana, plants of the family Rosaceae (such as apple and pear) and plants of the family Salicaceae (such as Populus trichocarpa and Cathay poplar).Their genomes have the typical four-section structure of angiosperm chloroplast genomes[15][16][17].The cluster analysis of the complete chloroplast genomes shows that H. rhamnoides subsp.yunnanensisformed a branch with H. rhamnoides with 100% support; it has a weaker genetic relationship with the same region of H. gyantsensis.Also, there is a complex pedigree sorting process between H. rhamnoides subsp.yunnanensis and H.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 26 October 2018 doi:10.20944/preprints201810.0638.v1 the
main factor for codon usage preference.Of course, natural selection may also have big effects, which needs to be further studied.We collected leaves from a wild specimen of H. gyantsensis and H. rhamnoides subsp.yunnanensis, respectively, in Gongbo'gyamda County, Linzhi City of Tibet Autonomous Region (altitude: 3600 m) in August 2017.We treated them with silica gel desiccant and then brought them back to the lab.For each species, an appropriate amount of dry leaves was selected and the cetyltrimethylammonium bromide (CTAB) method was used to extract the total DNA.Agarose gel electrophoresis was used to assess the validity of the DNA extraction process.Then we sent the total DNA to Novogene Company for complete genome random interrupt sequencing, which involved HiSeqPE150 and a DNA-350 bp pool type.Finally, a 35-G sequence data set was obtained for each of the Hippophae species.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 26 October 2018 doi:10.20944/preprints201810.0638.v1
We used DOGMA (http://dogma.ccbb.utexas.edu/) to annotate the assembled chloroplast genomes, rectified the annotated results using Sequin15.10software, submitted both annotations to the NCBI database and obtained a serial number for Comparative analysis of the two Hippophae chloroplast genomes DnaSP 5.0 was utilized to count the sequence polymorphisms of the H.