The complete chloroplast genome sequence of four plant species , their SSR identification and phylogenetic analysis

: The chloroplast genome is conservative and stable, which can be employed to resolve genotypes. Currently, published nuclear sequences and molecular markers failed to differentiate the species from taxa robustly, including Machilus leptophylla , Hanceola exserta , Rubus bambusarum , and Rubus henryi . In this study, the four chloroplast genomes were characterized, and then their simple sequence repeats (SSRs) and phylogenetic positions were analyzed. The results demonstrated the four chloroplast genomes consisted of 152.624 kb, 153.296kb, 156.309 kb, and 158.953 kb in length, involving 124, 130, 129, and 131 genes, respectively. Moreover, the chloroplast genomes contained typical four regions. Six classes of SSR were identified from the four chloroplast genomes, in which mononucleotide was the class with the most members. The types of the repeats were various within individual classes of SSR. Phylogenetic trees indicated that M. leptophylla was clustered with M. yunnanensis , and H. exserta was confirmed under family Ocimeae . Additionally, R. bambusarum and R. henryi were clustered together, whereas they did not belong to the same species due to the differing SSR featu res. This research would provide evidence for resolving the species and contributed new genetic information for further study. and three Giga base clean data was obtained. Adapter sequences, potential contamination, and low-quality bases of the data were removed by Adapter Removal. The CLC-quality trim tool was employed fine reads.


Introduction
Machilus leptophylla is an evergreen broad-leaved tree in the family Lauraceae mainly distributed in most districts of China. Zhejiang, Jiangxi, Hunan, Fujian, and other regions of China [1]. Because of its fast growth, beautiful appearance, and high-quality wood, M. leptophylla has attracted more and more attention from commercial markets and related scholars. The genus Machilus includes nearly 100 species distributed in tropical and subtropical East and South Asia [2]. The reported nuclear sequences and genomic markers failed to resolve species in the genus [3]. To date, nine species in genus Hanceola were distributed in south China and identified out, based on the morphological features [4]. However, unlike most species of Hanceola are perennial herbs, H. suffruticosa, as a species newly discovered, is woody and robust stems. Hence, it is challenging to identify the species of Hanceola via morphology solely. There was no report on nuclear sequences and chloroplast genomic markers in this genus at the species level.

Sequencing profiles and Quality control
The morphology of M. leptophylla, H. exserta, R. bambusarum, and R. henryi is shown in Figure 1. The correct identification of the species would ensure the samples for fine sequencing. Leaves of M. leptophylla were grown with independent petioles, and several leaves share a common node. H. exserta grows soft, complex serrated leaves. R. bambusarum and R. henryi usually were considered as the same species owning to the highly similar plant morphology. There were certain variations on leaf sharps, even though both these two species exhibited okra leaves. The three-lobed foliage was much longer and shorter in length and width within R. bambusarum, respectively, comparing with those within R. henryi. 13,128,417, 11,399,665, 11,851,040 and 10,197,459 raw reads were generated from M. leptophylla, H. exserta, R. bambusarum and R. henryi, respectively. Consequently, 12,876,579,11,348,725,11,696,892, and 10,062,246 clean reads yielded via data filter, correspondingly. Within M. leptophylla, 3.94 and 3.86 Giga raw and clean bases were obtained separately, as the effective rate was 98.08%. Likewise, the 3.40, 3.51, and 3.02 G clean bases were produced from H. exserta, R. bambusarum, and R. henryi, respectively, and the effective rates were 99.55%, 98.70%, and 98.67% correspondingly. Overall, the sequencing error rates involved in four species were 0.03%, which kept a quite low level. In detail, the values Q20 and Q30 in terms of four species bellowed 97.78% and 93.65%, separately. It's evidence that the sequencing quality was fine. The GC content involved in four species was presented from 40.02% to 40.81%, and it illustrated the sequencing composition is proper in the experiment.

Figur+
+9-e 1 Profiles of the four species. Machilus leptophylla, Hanceola exserta, Rubus bambusarum, and Rubus henryi were showed as panel (A), (B), (C), and (D), respectively. The small patches placed onto each board at the right top were the leaf morphology, correspondingly. Note： The quality scores are logarithmically linked to error probabilities, and Q20 and Q30 denote accuracy of a base call was 99% and 99.9% separately.

Assembly and annotation of four species
The circle chloroplast genome of M. leptophylla, H. exserta, R. bambusarum, and R. henryi were successfully assembled and annotated ( Figure 2 Figure 3 (B)).
124, 130, 129, and 131 genes were characterized from the chloroplast genome of M. leptophylla, H. exserta, R. bambusarum, and R. henryi, respectively. Within M. leptophylla, the three classes of genes involving coding sequence, tRNA, and rRNA were 80, 36, and 8, respectively. Regarding the chloroplast genome of H. exserta, R. bambusarum, and R. henryi, there were 85, 84, and 84 genes involving in the coding sequence, separately. Correspondingly, the numbers of genes in terms of tRNA were 37, 37, and 39 in the above three species, whereas those regarding rRNA were 8 for each species. In four species, the value of the genes contained introns were 22, 23, 22, and 22, respectively. What's more, they all processed two genes that had more than two introns ( Figure 4).  the genome structures including LSC, SSC, IRa, and IRb, respectively. LSC, SSC, IRa, IRb, Ml, He, Rb, and Rh were abbreviated from large single copy, small single copy, inverted repeats a, inverted repeats b, Machilus leptophylla, Hanceola exserta, Rubus bambusarum, and Rubus henryi, orderly.

Identification of simple sequence repeats
Total 82 SSRs were identified from the chloroplast genome of M. leptophylla (Table 2). Less SSRs were discovered from H. exserta, R. bambusarum, and R. henryi, with 56, 58, and 62, respectively. In all four chloroplast genomes, the number of SSR containing sequences was one, and the number of sequences containing more than 1 SSR was one, too. Additionally, the number of SSRs present in compound formation from M. leptophylla chloroplast genome was 8, whereas those in H. exserta, R. bambusarum, and R. henryi were 6, 6, and 7, respectively. Six different repeat classes were identified from the species. In the chloroplast genome of M. leptophylla, the counts of mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides were 55, 12, 3, 9, 2, and 1, orderly. There were 35, 7, 2, 11, 0, and 1 SSRs belonging to mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide within the chloroplast genome from H. exserta. Regarding those within R. bambusarum and R. henryi, 42 and 46 mononucleotide SSRs were identified separately. Correspondingly, 8 and 9 dinucleotide SSRs were discovered. Moreover, the number of trinucleotide SSRs were 2 and 1 within those two species. Additionally, both six tetranucleotides were found from the chloroplast genomes of R. bambusarum and R.henryi. However, not any pentanucleotide and hexanucleotide SSRs were identified from those two chloroplast genomes. Overall, the numbers of mononucleotide SSRs in the four species were much more than those of other SSR classes.
The types of repeats were various in the same SSR class ( Figure 5). In mononucleotide SSRs, the number of type A/T was much more than those of type C/G. The SSRs involving 10~ 17 consecutive A/T were found in the chloroplast genome of M. leptophylla. The mononucleotide SSRs including 10~14 repeated A/T were discovered in that of H. exserta. In the chloroplast genome of R. bambusarum and R. henryi, SSRs related to A/T are mainly presented as 10~12 nucleotide repeats. Regarding mononucleotide SSRs of type C/G, only one consecutive 10, 11, 11-nucleotides SSR could be identified from M. leptophylla, H. exserta, and R henryi. For dinucleotide SSRs, both five repeated AG/CT and AT/AT were found in all four chloroplast genomes. SSRs of both 6 and 7-repeated dinucleotide can be discovered from that of M. leptophylla. Three trinucleotide SSR types can be identified from all four species: AAG/CTT, AAT/ATT, and ATC/ATG. What's more, 9 types involved in tetranucleotide SSR were discovered among species, i.e. AAAC/GTTT, AAAG/CTTT, AAAT/ATTT, AACT/AGTT, AATG/ATTC, AATT/AATT, ACAG/CTGT, ACAT/ATGT, and AGAT/ATCT. Additionally, the chloroplast genome of M. leptophylla comprised two pentanucleotide and one hexanucleotide SSR type. They were AAATC/ATTTG, AAATT/AATTT, and AAATAG/TTTCTC. The only hexanucleotide type in that of H. exserta was AAGATC/ATCTTG. Note： SSR was abbreviated from simple sequence repeats. The quality scores are logarithmically linked to error probabilities, and Q20 and Q30 denote accuracy of a base call was 99% and 99.9% separately.

Phylogenetic analysis basing on chloroplast genome
The topological structure of the phylogenetic tree of 22 species, including Machilus leptophylla, is illustrated in Figure 6-

Chloroplast genome processing featured constructs
Most chloroplast genome comprises specific four regions, i.e., large and small single copies and two inverted repeats [12]. Complete four regions in chloroplast commonly mean related full biological functions in related species. Our results showed that these four regions could be found in M. leptophylla, H. exserta, R. bambusarum, and R. henryi. What's more, the specific regions showed regular length individually, which indicated there was no loss of significant long fragments. Usually, the chloroplast genome contained 120~130 genes, and that length fall ranged from 107-218 kb [12]. Within these four chloroplast genomes, 124, 130, 129, and 131 genes were identified, and the whole genomes were 152, 624 bp, 153,296 bp, 156,309 bp, and 158, 953 bp in length, correspondingly. Our results were consistent with the report. It's documented one copy of the IR was missed in some species, such as family Papilionoideae, as formed IR lacing clade [17,18]. It's evident there were some exceptions, even though the sub-structures and gene counts were relatively conserved and stable in the chloroplasts. Our results indicate most of the identified genes are without introns. The number of genes that contained introns was 22-23 over the four species, and all four chloroplasts had two genes that processed more than two introns. Obviously, the number of introns involved in chloroplast genes kept at an average level. However, the loss of introns chloroplast genes had been reported in many plants, such as Cicer arietinum, Manihot esculenta, Bambusa sp., and Hordeum vulgare [19][20][21][22]. In chloroplast genomes, intron loss tended to happen in diverse plants such as Poaceae, Onagraceae, Oleaceae, and Pinus [23]. Our data probably inferred the genus Machilus, Hanceola, and Rubus may be more stable involving introns.

SSR from chloroplast genome provide essential genetic information
Simple sequence repeats are tandem repeats, which comprise 1-6 nucleotides in the genomes of organisms [24]. Among species, even genotypes, the number of repeats units may change as the tandem arrays of varies on SSR motifs. A substantial number of SSRs distributed all over the genome, including organellar DNA [25]. In model plants rice and Arabidopsis thaliana, it was reported that SSRs presented to be organized and altered in regions of the genes [26]. Generally, SSRs showed properties of high mutation rate in the locus of generations, locus specificity, intraspecific polymorphism, reproducibility, and multiallelic across taxa [27]. In our studies, mononucleotide SSRs were the richest SSRs in the chloroplast genomes involved in species M. leptophylla, H. exserta, R. bambusarum, and R. henryi. In the chloroplast genome of single-petal (SP) and double-petal (DP) Jasminum sambac L. (Oleaceae), the mononucleotides SSRs accounting for 62.71% (74/118) and 62.39% (73/117), respectively [28]. However, in the chloroplast genome of Rhus chinensis, mononucleotide SSRs account for 28.74%, which was less than dinucleotide SSRs with 60% [29]. It is thus clear evidence that the diverse classes of SSRs in the plant chloroplast genome probably depend on the categories of plants. In nuclear genomes, the mononucleotide SSRs take higher portions in all six classes (mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide). In the nuclear genome of Zanthoxylum bungeanum, mononucleotide repeats were the most abundant class, with the value of 19706, which is the four times of dinucleotide repeats (5154) [30]. Similarly, in the nuclear genome, the mononucleotide presents the highest proportion in Chinese jujube (Ziziphus jujuba) [31]. Moreover, the SSRsclass mononucleotide was the most abundant expressed sequence tag, such as tobaccos (Nicotiana tabacum L.). In mononucleotide SSRs of the four chloroplasts, they were determined to be rich in A/T and rare in tandem G or C repeats, and this was consistent with reported [32][33][34].

Comparision on chloroplast genome offered a robust tool to study phylogenetic relationship and evolution among plant species
Complete chloroplast genome sequences provide insights into the understanding of plants' biology and diversity [10]. Within phylogenetic clades, chloroplast genomes contributed significantly in phylogenetic studies of several plant families and resolving evolutionary relationships [10]. Furthermore, as within and between plant species involving both sequence and structural variation, considerable variation was revealed by chloroplast genome sequences. The information from chloroplast genomes was precious to understand the environments, promoting the breeding of closely related species [35,36]. The phylogenetic tree (Figure 6-1, Figure 6-2, and Figure 6-3) were constructed by three groups of complete chloroplast genome sequences. Overall, the topological structure of the species in this study demonstrated highly consistent with taxa relationship from the database of Taxonomy under National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov/taxonomy/?term= ). Exceptionally, in Figure 6-2, the species Nectandra angustifolia was clustered with Machilus pauhoi and thunbergii, which form a new clade. It inferred that some taxa problems probably existed in the current genus Machilus orNectanra. Within Figure 6-3, species R. bambusarum and R henryi were clustered together with robust bootstrap support. Hence, the information from the phylogenetic tree did not successfully answer the questions of whether the R. bambusarum and R. henryi were the same species or not. However, the specific information from the chloroplast genomes provides the evidence to differentiate these two species, such as differences in the length of the complete circle genome and the distribution and classes of SSRs. The chloroplast genome sequence offered a robust approach to resolve the close species.

Sampling and DNA extraction
The fresh leaves of M. leptophylla were sampled from Zijingang Campus, Zhejiang University (120°51′32′′ E, 30°18′08′′ N). Those of H. exserta, R. bambusarum, and R. henryi were collected from Hangzhou Botany Garden (120°07′36′′ E, 30°15′15′′ N). Consequently, the specimens were deposited in Institute of Crop Sciences, Zhejiang University at Specimen code: LM001, LM002, LM003, and LM004, orderly. The DNA extraction was performed as follows: 1) Weight 80-150 mg fresh samples and mixed them with 800µl of CTAB buffer. 2) Grind the mixture to homogenate, and then vortex them for 3 minutes. 3) Place the tube containing the mixtures in a water bath for 35 minutes at 65 °C. 4) Centrifuge the homogenate for 10 minutes at 13 000 rpm. After that, transfer the supernatant into a new centrifuge tube. 5) aliquot 4µl of RNase A working solution and add them into each tube for incubating at 37°C for 15 minutes. 6) Add phenol/chloroform/isoamyl alcohol (25:24:1) into the tubes, make the final volume were folded. 7) Vortex for mixing and then centrifuge the tubes at 13 000 rpm for 2 minutes. 8) transfer the upper layer of liquid into a new centrifuge tube. 9) Add half-volume pre-cold isopropanol and incubate at the frozen fridge at -20°C for 20 minutes. 10) Centrifuge the tubes at 13000 rpm for 8 minutes, and then discard the supernatant at the condition of ensuring peace of the pellet. 11) Wash it with pre-cold 70% ethanol and dry the pellet at the laminar flow cabinet. 12) Add 50 µl TE buffer to dissolve the DNA. The total DNA quality was detected by NanoDrop Microvolume Spectrophotometers and Fluorometer (ThermoFisher Scientific, USA). The values of OD260/OD280 fall into the range from 1.7 to 1.9 would be kept for further study.

DNA sequence and raw data processing
According to the manufacturer's instructions, the TruSeq Library Construction Kit (Illumina, San Diego, CA, USA) was employed to construct the sequencing libraries. The total DNA samples were fragmented by g-TUBE, followed by centrifuging at 4000 rpm for 3 min and processed orderly via end-repair, adapter, ligation, and exonuclease. The sequencing was conducted by the Illumina HiSeq 2000 platform referring to the standard protocols at Tianjin Sequencing Center, Tianjin Novogene Technology Co., Ltd., China. A genomic shotgun library with an injection size of 150 bp was constructed, and more than three Giga base clean data was obtained. Adapter sequences, potential contamination, and low-quality bases of the raw data were removed by Adapter Removal. The CLC-quality trim tool was employed to filtered fine reads.

Chloroplast genome assembling and annotating
For identifying the chloroplast sequences of M. leptophylla, the Illumina reads were mapped to the reference chloroplast sequence of M balansae (KT348517) in the NCBI Organelle Genome Resources database (http://www.ncbi.nlm.nih.gov/genom e/organ /) by Bwa (version 0.7.17) [37]. Similarly, the reference chloroplast for H. exserta was used by Ocimum basilicum (KT348517), and those of R. bambusarum and R. henryi shared the same reference in terms of Rubus crataegifolius (NC_039704). The reads were assembled and finally polished by SPAdes [38] and Pilon [39] separately. The order of contigs was evaluated based on the collinearity analysis by the tool Mummer [40]. Consequently, the initiation and termination sites of the two inverted repeat sequences were identified by aligning the targeting and reference chloroplast genome with the tool Blast [41]. All four chloroplast genomes were annotated by Dual Organellar GenoMe Annotator (DOGMA) under manual corrections [42]. BLASTX， BLASTN， and tRNAscan-SE1.21 were employed to identify putative gene types involving protein-coding，rRNA and tRNA [43,44]. The circular chloroplast genomes were drawn and illustrated by Organellar Genome DRAW [45].

Phylogenetic analysis
For phylogenetic analysis, 22 chloroplast genomes of representative species, including M. leptophylla, were selected, in which that of Chimonanthus praecox (MT859152) served as the out-group. Similarly, to determine the phylogenetic positions of H. exserta, a total of 20 chloroplast genomes were employed to analyze, and Scutellaria kingiana (MN128389.1) was selected as the out-group. For R. bambusarum and henryi, a total of 24 chloroplast genomes was employed. In this group, Euonymus schensianus (NC036019) was used as an out-group. The chloroplast genomes were aligned using MAFFT (V7.407) [47], and after that, the phylogeny trees were constructed via the maximum likelihood (ML) method by IQtree (Version 1.7) [48]. The internal branching support was estimated through 1000 bootstrap replicates.
2) Six classes of SSR were identified from the four chloroplast genomes, in which mononucleotide was the class with the highest numbers. However, SSR classes regarding trinucleotide, pentanucleotide, and hexanucleotide processed a few numbers. The types of repeats were various within individual classes of SSR.
3) Phylogenetic trees indicated that M. leptophylla was clustered with M. yunnanensis under genus Machilus, H. exserta was confirmed under family Ocimeae. Additionally, R. bambusarum and R. henryi were clustered together, whereas they did not belong to one species due to the differing SSR features.

Data Availability Statement:
The genome sequence data that support the findings are openly available in GenBank of NCBI at https://www.ncbi.nlm.nih.gov/. The accessions involved in species are M. leptophylla, H. exserta, R. bambusarum, and R. henryi, which are MW238421, MW238418, MW238419, and MW238420, respectively. The associated related BioProject number is PRJNA722038.