Chloroplast genome of Phyllanthus emblica and Leptopus cordifolius : Comparative analysis and phylogenetic within family Phyllanthaceae

: Family Phyllanthaceae is one of the largest segregates of the eudicot order Malpighiales and its species are herb, shrub, and tree, which are mostly distributed in tropical regions. Certain taxonomic discrepancies exist at genus and family level. Here, we report chloroplast genomes of three Phyllanthaceae species— Phyllanthus emblica , Flueggea virosa , and Leptopus cordifolius— and compare them with six others previously reported Phyllanthaceae chloroplast genomes. The species of Phyllanthaceae displayed quadripartite structure, comprising inverted repeat regions (IRa and IRb) that separate large single copy (LSC) and small single copy (SSC) regions. The length of complete chloroplast genome ranged from 154,707 bp to 161,093 bp; LSC from 83,627 bp to 89,932 bp; IRs from 23,921 bp to 27,128 bp; and SSC from 17,424 bp to 19,441 bp. Chloroplast genomes contained 111 to 112 unique genes, including 77 to 78 protein-coding, 30 transfer RNA (tRNA), and 4 ribosomal RNA (rRNA) that showed similarities in arrangement. The number of protein-coding genes varied due to deletion/pseudogenization of rps 16 genes in Baccaurea ramiflora and Leptopus cordifolius . High variability was seen in number of oligonucleotide repeats while analysis of guanine-cytosine (GC) content, codon usage, amino acid frequency, simple sequence repeats analysis, synonymous and non-synonymous substitutions, and transition and transversion substitutions showed similarities in all Phyllanthaceae species. We detected a higher number of transition substitutions in the coding sequences than non-coding sequences. Moreover, the high number of transition substitutions was determined among the distantly related species in comparison to closely related species. Phylogenetic analysis shows the polyphyletic nature of the genus Phyllanthus which requires further verification. We also determined suitable polymorphic coding genes, including rpl 22, ycf 1, mat K, ndh F, and rps 15 which may be helpful for the reconstruction of the high-resolution phylogenetic tree of the family Phyllanthaceae using a large number of species in the future. Overall, the current study provides insight into chloroplast genome evolution in Phyllanthaceae. and 375X for Leptopus cordifolius . The Flueggea virosa which was assembled from the SRA data of NCBI show the average coverage depth of 69X. These three de novo assembled genomes, along with the previously reported genome, provided an opportunity to perform in depth comparative chloroplast genomics in family Phyllanthaceae.

The genus Leptopus Decne. is comprised of 9 species of herbs and shrubs that grow in open vegetation and in the forest industry [14]. The species of genus Leptopus are mostly distributed in Asia, including central, northern and southern China, India, Pakistan and Iran. However, some species also exist in America and Europe [14]. Leptopus cordifolius Decne. grows in wild and distributed locations in Pakistan, through north India and Nepal to West Himalaya [14].
The chloroplast is an important organelle that plays a vital role in photosynthesis. The chloroplast genome has a mostly quadripartite structure and consist of large single copy (LSC) region, small single copy (SSC) region, and inverted repeat regions (IRa and IRb) [15][16][17]. The size of the chloroplast genome of photosynthetic plants ranges from 107 KB to 218 KB and contain 120 to 130 genes [18]. Th chloroplast genome inherits maternally [19] in most angiosperms and paternally in some gymnosperms [20]. Many mutational events are reported in chloroplast genomes, including substitutions, insertion-deletions (InDels), repeats, and inversion [21][22][23]. The deletion of certain genes from the chloroplast genome or their transfer to nuclear genomes are also reported in several plant lineages, including the species of order Malpighiales to which family Phyllanthaceae belong [16,24,25]. The chloroplast genome is slowly evolving and lacks the meiotic recombination that is seen in the nuclear genome where homologous chromosomes exchange segments [26]. These properties make the chloroplast genome a suitable molecule for studies ranging from population genetics [27,28] to deep divergence of genus and family level [29][30][31][32], including plant evolution and phylogeny, providing deep insight into the evolution of the plant [33,34].
In the current study, the chloroplast genome of three species was reported and compared with six other species of the family Phyllanthaceae. The aims of the current study were to: (a) gain insight into chloroplast genome characteristics and structure; (b) elucidate the molecular evolution of the chloroplast genome; (c) gain insight into synonymous and non-synonymous substitutions and transition and transversion substitutions; (d) determine suitable polymorphic loci for phylogenetic inference; and (e) determine the phylogenetic position of the newly reported species within the family Phyllanthaceae.

Plant Collection and DNA extraction
The healthy leaves of Phyllanthus emblica and Leptopus cordifolius without any apparent disease symptoms were collected from the Mansehra district, Khyber Pakhtunkhwa, Pakistan.
These leaves were washed with distilled water and ethanol before drying with Silica. Whole genomic DNA was extracted from the Silica dried leaves using the DNeasy Plant Mini Kit from Qiagen. After confirmation of the quality and quantity on a 1% agarose gel and using a Thermo Scientific Nanodrop spectrophotometer, DNA was sent for sequencing with Novogene, Hong Kong.

Sequencing, de novo assembly, and annotation of chloroplast genome
The DNA of Phyllanthus emblica and Leptopus cordifolius was sequenced with HiSeq 2500 using paired end run with 150 bp short read size and 350 bp insert size. We also retrieved the 1.2 Gb raw reads of Flueggea virosa (Roxb. Ex Willd.) Royle (SRR7121487) from the Sequence Read Archive (SRA) of the National Center for Biotechnology which were sequenced by BGI-seq with 100 bp short reads. The whole genome shotgun was used for the de novo assembly of chloroplast genomes of the mentioned three species using NOVOPlasty.
Use of this method, provides a complete circular chloroplast genome including a Large Single Copy (LSC) region, Inverted repeat (IR) regions, and a Small Single Copy (SSC) region. The accuracy of the assembled genome and the coverage depth was confirmed by mapping short reads to its de novo assembled chloroplast genome by using Burrows-Wheeler Aligner (BWA) [35] and visualized in Tablet [36].
De novo assembled chloroplast genome of Phyllanthus emblica was annotated using GeSeq [37] whereas the tRNA genes were further confirmed by tRNAscan-SE v.2 [38] with default parameters. The annotations of protein-coding genes were further confirmed against homologous genes by blast search of National Center of Biotechnology information (NCBI).
The five column delimited table was formed with tools of GB2sequin [39] for submission to GenBank of NCBI. The assembled chloroplast genome was submitted to GenBank of NCBI.

Analysis of chloroplast genome features and inverted repeat contraction and expansion
The de novo assembled three chloroplast genomes of Phyllanthus emblica, Leptopus cordifolius, and Flueggea virosa were compared with previously reported species (mentioned in Table 1). The length of complete chloroplast genome, LSC, SSC, and IRs were compared among the Phyllanthaceae, including gene content, intron content, and GC content of each part in Geneious R8.1 [40]. We also compared the gene arrangement among species of Phyllanthaceae using Mauve alignment [41] integrated in Geneious R8.1. The contraction and expansion of inverted repeat regions were visualized using IRscope [42].

Analysis of codon usage, amino acid frequency and prediction of RNA editing sites
The relative synonymous codon usage (RSCU) and amino acid frequency of each species was determined using Geneious R8.1 [40]. The RSCU value of each species were shown with the help of a heatmap using TBtool [43].

Synonymous, non-synonymous, transition, and transversion substitutions analysis
The synonymous (Ks) and non-synonymous (Ka) substitutions were analyzed in TBtool [43] for 77 protein-coding genes using Baccaurea ramiflora Lour. as reference for all other species as this species lies basal to current species following Abdullah et al. [15]. Each gene selection was predicted following Henriquez et al. [44] by considering ration of Ka/Ks <1 purifying selection, Ka/Ks = 1 neutral selection, and Ka/Ks >1 positive selection. The transition and transversion substitutions of complete genome and protein-coding genes were determined by comparing closely related species and far related species. The whole genome was aligned using MAFFT, whereas the protein-coding sequences of each species were concatenated, except for ycf1 and aligned using MAFFT alignment [45]. For closely related species, the genome of

Repeats analyses
MIcroSAtellite (MISA) [46] was used for the determination of simple sequence repeats (SSRs) with the maximum threshold of 10 for mononucleotide SSRs, 5 for dinucleotide SSRs, and 4 for tri-, and 3 for tetra-, penta-, and hexanucleotide SSRs. REPuter [47] was used for the determination of four types of oligonucleotide repeats including forward (F), complementary (C), reverse (R), and palindromic (P). The parameters, such as repeat size ≥ 30 with at least 90% similarities, were adjusted.

Polymorphism of protein-coding genes
To determine the extent of polymorphism of protein-coding genes for further phylogenetic studies, we extracted all the protein-coding genes of each species. The sequence of each gene was multiple aligned using Geneious R8.1 and analyzed in DnaSP v.6 [48]

Reconstruction of phylogenetic tree
For inferring of phylogeny, the sequences of 76 protein-coding genes (ycf1 not included) of each species were extracted and concatenated in Geneious R8.1 [49]. These concatenated sequences of all the 10 species including Linum usitatissimum L. as outgroup from the family Pinaceae were multiple aligned using Multiple Alignment Fast Fourier Transform (MAFFT) [50] extension in Geneious R8.1. The maximum likelihood tree was reconstructed with RAxML [51] using CIPRES gateway [52]. The Interactive Tree of Life [53] was used to improve the presentation of the tree.

Chloroplast genome features and organization
HiSeq 2500 produced about 10.52 GB data with 31.4 million reads for Phyllanthus emblica and 7.85 GB data with 22.6 million reads for Leptopus cordifolius. The chloroplast genomes assembled from the data exhibited high average coverage depth of 855x for Phyllanthus emblica and 375X for Leptopus cordifolius. The Flueggea virosa which was assembled from the SRA data of NCBI show the average coverage depth of 69X. These three de novo assembled genomes, along with the previously reported genome, provided an opportunity to perform in depth comparative chloroplast genomics in family Phyllanthaceae.
The gene content and organization of the chloroplast genomes of family Phyllanthaceae are provided in Table 1, Table 2 The chloroplast genomes had 111 to 112 unique genes which showed high similarities in the arrangement within the genomes and found no inversion events related to gene rearrangement as predicted by Mauve alignment (Figure 2). Among these 112 genes, 77 to 78 were proteincoding genes, 30 transfer RNA (tRNA) genes, and 4 ribosomal RNA (rRNA) genes (Table 1).
Moreover, 16 to 17 genes were also duplicated in IR regions, including 5 to 6 protein-coding genes, 7 tRNA genes, and 4 rRNA genes. Here, we exclude two pseudogenes of ycf1 Ψ and rps19 Ψ , which left at the junctions of LSC/IRa and SSC/IRa (  Table 2. We found 17 to 18 intron-containing genes, including 11 to 12 protein-coding genes and 6 tRNA genes. The intron was found in the atpF gene of Baccaurea ramiflora only while found absent in all other species. Except clpP and ycf3 that contained two introns, all other genes contained one intron (Table 1). Among intron-containing genes, 5 genes (3 protein-coding genes and 2 tRNA genes) were also duplicated in IRs regions.  Figure 3). So, the IR contraction and expansion was not responsible for complete duplication of a functional copy of a gene, but it leads to the generation of pseudo-copies/copy of rps19, rpl2 and ycf1.

Codon usage and amino-acid frequency
Codon usage analysis was interpreted in terms of RSCU values. The analysis showed that most amino acids were encoded from those codons for which the 3′ ended with A/T (having RSCU > 1) instead of with C/G (having RSCU < 1) (Figure 4). The amino acid frequency analysis revealed that leucine was the most encoded amino acid while the cysteine was the rarest ( Figure   5). Codon-usage analysis and amino-acid frequency show high similarities in all species of the family Phyllanthaceae.

Analysis of simple sequence repeats and oligonucleotide repeats
We analyzed simple sequence repeats (SSRs) and oligonucleotide repeats. The analysis of MISA revealed 630 SSRs made of A/T motifs in all species (Table S1). We found all six types of SSRs based on the motif types including mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide. However, SRRs were predominantly mononucleotide, followed by dinucleotide (Figure 6a). Most of the SSRs exist in the LSC region, followed by the SSC, and then by the IR region (Figure 6b). Most of the SSRs (84) were predicted in Phyllanthus emblica (China) and the lowest SRRs (51) were predicted in  (Figure 6d). The LSC contained a high number of repeats in comparison to SSC and IR, whereas some repeats were also shared between regions of chloroplast genome including LSC/IR, LSC/SSC, and SSC/IR (Figure 6e).

Synonymous and non-synonymous substitutions
The synonymous (Ks) and non-synonymous (Ka) substitutions and their ratio (Ka/Ks) revealed high similarities among the species of Phyllanthaceae (Table S2)  whereas the recorded lower Ts/Tv ratio in complete genome than coding regions revealed that this may be due to inclusion of intergenic spacer region in which higher transversion substitutions take place compared to transition substitutions.

Intra species variations in chloroplast genome of Phyllanthus emblica
We determined the intra species variations in the chloroplast genome of Phyllanthus emblica that belong to two different countries-Pakistan and China. The analysis predicted 257 SNPs in the whole chloroplast genome, of which 79 belong to coding regions. We also predicted about 108 insertions-deletions (InDels) events with an average length of 5.324 in which 3 Indels belong to coding regions. We have not predicted any InDels or SNPs in genes of transfer RNAs and ribosomal RNAs.

Polymorphism of protein-coding genes
The polymorphism of all protein-coding genes was accessed which helped us to identify the suitable polymorphic loci for future phylogenetic inference of family Phyllanthaceae (Table   S3). We selected fifteen genes of ≥ 200 bp. We also accessed the missing data produced by the high polymorphic regions due to Indels to provide information about the suitability of these loci. These loci include rpl22, ycf1, matK, ndhF, rps15 etc. as shown in table 3.

Phylogenetic analysis
The phylogenetic analysis based on 75 coding sequences resolved the relationship of the  In this study, we employed whole genome shotgun sequencing to assemble complete chloroplast genomes with a high coverage depth. The conventional method of chloroplast genome sequencing was isolation or enrichment of the chloroplast genome by amplification with long-range PCR after DNA extraction [54,55]. The advancement of Next generation sequencing technologies makes it feasible to assemble chloroplast genome from whole genome shotgun sequences due to the generation of high parallel sequencing data [56]. Several other studies also reported de novo assembled chloroplast genomes from whole genome shotgun reads [57,58].
The angiosperm chloroplast genome is conserved in gene content and gene order. Both introns in the protein-coding genes and tRNA genes are conserved. Yet, the loss of some genes or introns is also reported in some species [16,18,33,59]. The chloroplast genome features of the species of family Phyllanthaceae were similar to each other with few differences. The intron of atpF gene was deleted in all other species except Baccaurea ramiflora, which was present at a basal point in the current species. The rps16 gene was found to be non-functional in two species-Baccaurea ramiflora (deletion of exon 1) and Leptopus cordifolius (on few fragments exits). The deletion of rps16 was also reported in other species of the order Malphighiales, including families Araceae [58,61,62], Malvaceae [15], and Malpighiaceae [16]. The infA gene has a vital function as a translation initiation factor. The lack of the complete gene or presence of a pseudogene revealed the existence of a functional copy or transferring of infA gene to the nuclear genome [33,63].
Inverted repeat contraction and expansion leads to the generation of both a pseudo copy and a functional copy of a gene, changing duplicated genes to a single copy gene due to transfer of IRs to single copy regions (LSC or SSC), or conversion of a single copy gene to duplicated gene due to transfer from single copy regions (LSC or SSC) to IRs, as reported previously [15,32,57,62]. In the current study, the species of family Phyllanthaceae were also found to be conserved in terms of contraction and expansion of IRs. However, the generation of a pseudo copy of ycf1, rpl2, and rps19 remained similar to other angiosperms [64,65]. The duplication of a complete functional gene such as of ycf1, rps15 and rps19 that is seen in other angiosperms was not detected here [15,57,62].

Simple sequence repeats and oligonucleotide repeat analyses
Simple sequence repeats (SSRs) and oligonucleotide repeats were also analyzed in the chloroplast genome. We observed the highest number of SSRs in the LSC region as compared to SSC and IR regions. Most of the mononucleotide SSRs were A/T whereas for dinucleotide SSRs AT/TA SSRs were more abundant. Most of the SSRs were mononucleotide, followed by dinucleotide SSRs, and then by trinucleotide SSRs. These findings are consistent with previous studies [66][67][68][69][70][71]. However, a higher number of trinucleotides, compared to dinucleotides, has also been reported, [15,55,60,72]. The SSR marker identified in the current study can serve as a resource for population genetics studies of the species of family Phyllanthaceae. We found forward, reverse, complementary, and palindromic repeats in chloroplast genomes using REPuter. These oligonucleotide repeats originate InDels and substitutions [73][74][75] and can be used as proxy for identification of mutational hotspot regions [61,74,76] Higher transition substitutions occur within in protein-coding genes as compared to non-coding regions. Moreover, the transition rate was also high in far related species compared to closely related species. The transfer RNA and ribosomal RNA genes are conserved within species and GC rich, as observed here and reported previously [55,74,80,81]. Therefore, the non-coding part has less GC content than coding sequences as shown in Table 2. The higher Ts/Tv ratio was also observed in the protein-coding sequences of family Araceae [32,58,82], whereas the low Ts/Tv ratio (up to 1) was reported based on the comparison of chloroplast genomes among closely related species, such as Asteraceae [76], Solanaceae [69,70], and Malvaceae [15,54].
The high GC content of coding sequences (up to 37%) compared to non-coding regions (up to 32.5%) might be the reason for higher rate of transition and less transversion substitutions in coding sequences compared to non-coding regions as reported for the chloroplast genome of maize and rice [83]. Another study showed that the ratio of Ts/Tv effected from the function of regional and flanking base composition and suggested the codon usage bias of chloroplast genomes as a possible reason [84] given that codons ending with A/T at the 3′ end showed a higher abundance and high encoding efficacy as reported in the current study and in the previous reports of angiosperms [15,44,62,74,76,85]. The current study, together with previous reports [32,44,58], also shows that most coding sequence substitutions are synonymous instead of non-synonymous to avoid integration of a new amino acid to protein sequences. Recent studies support that most of the synonymous substitutions are linked to transition substitutions and non-synonymous substitutions to transversion substitutions [62,81] as most are synonymous in the protein-coding sequences. This might also be a reason for the higher ratio of Ts/Tv in protein-coding sequences. Further study is required, however, to determine why the Ts/Tv ratio is higher in far diverse species in comparison to closely related species.

Phylogenetics analysis and suitable polymorphic loci for further phylogenetic inference
The phylogenetic analysis showed the polyphyletic nature of the genus Phyllanthus similar to recently obtained results using nuclear (ITS, PHYC) and chloroplast (matK, accD-psaI, trnS-trnG) markers along with morphological data [86]. Hence, the author suggested the division of the large genus Phyllanthus to nine small genera instead of combining the embedded genera to Phyllanthus. Our study was based on 75 protein-coding sequences, which supports the polyphyletic nature of the genus Phyllanthus with a limited number of species. Further sequencing will thus be required to gain broad insight into the phylogenetics of Phyllanthus and the tribe Phyllantheae. The other species show the similar phylogenetic position as reported previously [4,87]. Certain discrepancies still exist in the phylogeny of the family Phyllanthaceae, which warrant further elaboration [3,4,87]. Here, we identified 15 suitable high polymorphic regions from protein-coding sequences based on the comparative analysis of chloroplast genomes. These will serve as suitable markers for quality phylogenetic inference of the family Phyllanthaceae as lineage-specific molecular markers are more authentic, robust, and cost-effective [18,34,76,88,89].
In conclusion, our study provides insight into the molecular evolution of the chloroplast genome and sheds light on the deletion/pseudogenization of rps16 in family Phyllanthaceae for the first time. Our study shows the higher transition rate in coding sequences as compared to non-coding sequences and the polyphyletic nature of the genus Phyllanthus. We also determined suitable polymorphic loci, which may be helpful for the broad phylogenetic inference of the family Phyllanthaceae in the future.
Data availability: The newly assembled chloroplast genomes are submitted to National Center for Biotechnology under accession numbers BK059210 (Flueggea virosa), MZ424188 (Leptopus cordifolius), and MN122078 (Phyllanthus emblica). All the analyzed data are available in the main manuscript or as supplementary materials.          Total 131 * Gene with one intron, ** Gene with two introns, a Gene with two copies, £ Gene pseudo in some species, $ single copy exist of rps19 exist in Glochiodion chodoense, ££ gene with single functional copy in Sauropus spatulifolius, a This also include pseudogene or truncated copy of the functional gene; The bracket indicates the total number of genes; *Sequence and assembled in current study, ** assembled from raw reads of NCBI in current study;