Mining and Characterizing the SSR Markers for Black Rice Using the Illumina Sequencing Platform

Study in black rice has gain prominence in recent times due to its high nutritive value, curative effect, and anti-oxidant properties. However, its poor agronomic traits, including low yield necessitates the incorporation of the colour-grain trait into elite varieties through plant breeding techniques. SSR markers play an important role in plant identification and breeding. Here, the generation of reference-based transcriptome, annotation of transcriptome datasets, and a large set of simple sequence repeat (SSR) markers derived from Black rice have been described. In all 28664 SSRs were predicted in 34978 (48.59%) expressed transcripts. However, 7068 (20.20%) transcripts were found to have more than one SSR. The identified SSRs were dominated by tri-nucleotide and tetra-nucleotide repeats representing about 54.11% and 33.31% respectively, of total SSRs. Validation of selected markers associated with anthocyanin trait performed across different black rice accessions established the reliability of the process used for mining SSR markers. The SSR markers identified in this study could be used to select varieties with desired traits, and to investigate the genetic mechanism underlying anthocyanin accumulation in the pericarps of black rice. Furthermore, the findings from this study may prove beneficial in future genetic diversity studies, primer development, and selective breeding programs.


Introduction
Black rice constitutes a group of speciality rice varieties which are glutinous, packed with a high level of nutrients, anthocyanin, and is cultivated in many countries across Asia. The black colouration of the pericarp is attributed to a class of flavonoid pigment known as anthocyanin [1]. Anthocyanins are the primary pigments in grains, flowers, fruits, and vegetables showing blue, purple and red pigmentations. It is estimated that there are more than 400 naturally occurring anthocyanins in nature [2,3] due to different glycosylation and acylation in which cyanidin-3-O-glucoside is the most abundant, followed by peonidin-3-O-glucosid. Black rice is also considered by many as a panacea of culinary diseases because of its high nutritive value, curative effect coupled with beneficial properties of flavonoids that not only act as an antioxidant [4,5] and anti-inflammatory [6,7] but has also been linked to anti-carcinogenic properties [8]. These black rice may also be helpful for the prevention or cure of diseases caused by vitamin A and B deficiencies [9,10]. There are more than 200 types of black rice varieties in the world [11] with highest production reported in China followed by Sri Lanka, Indonesia, India, and the Philippines.
In spite of the demand for black rice, its availability is limited due to factors like low yield, lodging susceptibility and long duration of this crop that makes farmers less interested in the cultivation of black rice. Thus, its improvement through the incorporation of coloured-pericarp trait into elite varieties, use of genotypic markers would be more beneficial than conventional approaches [12]. The SSR markers are an ideal choice for the large scope of the markers availability and have been used for genetic variability assessment, molecular characterization, genotypic identification, and population structure estimations in rice [13,14]. Besides diversity studies [15], the SSR markers have also been used extensively for the construction of linkage maps [16][17][18] and QTL analysis [19] in rice. Genic SSRs are located in the coding region of the genome relatively easy and inexpensive, and highly transferable to related taxa [20,21]. Although SSRs spanning the linkage groups in different rice varieties, as well as transcriptome studies have been reported in black rice, there are no reports till date in our knowledge, on the development of genic SSR markers from expressed gene sequences which could be used for trait-specific selection of population during a breeding programme.
In this present study, we have generated a large number of gene-based SSR markers from the transcriptome sequences of black rice variety Chakhao poreitin using the Illumina HiSeq sequencing platform. The distribution of the SSR motifs in the sequences was functionally characterized and annotated according to gene ontology terms, the cluster of orthologous genes (KOGs) and KEGG pathways. Subsequently, a large set of genic SSR markers were developed for Black rice from transcriptome sequences. This molecular markers might prove beneficial in genetic diversity studies, primer development, and selective breeding programs.

Plant Materials
Black rice variety (Chakhao poreitin), and white rice variety (IR-64) were sampled from Regional Agricultural Research Station (RARS), Shillongoni, Assam Agricultural University, Jorhat India. The plants were maintained in the polyhouse under irrigated conditions. The immature seeds were collected 6 days after heading (6 DAH). The immature seeds with intact pericarp were dehusked and stored in RNA-later (Catalog no R0901-500ML) until further processing. For the validation of the candidate SSR markers, four additional black rice landraces were collected from different regions of Assam, India.

Illumina Sequencing, Quality Control, and Assembly
The reference-based transcriptome was performed on a total of four rice samples with two biological replicates each from both black rice as well as white rice [36]. The raw data generated were checked for the quality using FastQC and preprocessed, which include removing the adapter sequences and the low-quality bases (<q30). Pre-processing of the data was done with Cutadapt. HISAT is a fast and sensitive spliced alignment program, was used to align the high-quality data to the reference genome with the default parameters. Reads were classified into aligned reads (which align to the reference genome) and unaligned reads. Cufflinks were used to estimate and calculate transcript abundance. It resulted in a normalized read count in the form of FPKM values. FPKM is a unit of measuring gene/transcript expression. Cuffdiff was used to calculate the differentially expressed transcripts and categorize them into up, down and neutrally regulated based on the log2fold change values. Enrichment analysis for differentially expressed genes was performed using PlantRegMap tool.

SSR Prediction search, annotation, and primer design
MISA (MIcroSAtellite) (http://pgrc.ipk-gatersleben.de/misa) and SAMtools were employed for SSR mining and identification of SSRs and the sequence that contained SSR. These sequences had SSRs were aligned with BLASTX against protein databases, including NCBI nr database. Kyoto Encyclopedia of Genes and Genomes (KEGG) and Eukaryotic Ortholog Groups (KOG) with an Ethreshold value of 1.0E-5. Blast2GO software was used to obtain GO annotation. Primers for the SSRs were designed using Batchprimer3 tool (https://probes.pw.usda.gov/batchprimer3/). BatchPrimer3 is a comprehensive web-based primer design program in which we had primer lengths of 16-22 bases, GC content of 40%-60%, annealing temperature of 40 °C-60 °C, and PCR product size of 100 to 600 bp.

Validation of SSR Markers
SSR markers, except mononucleotide repeats, were selected randomly for the validation in the laboratory. The primers were designed using the primer blast tool, and DNA was extracted by the CTAB method. The quality of extracted DNA was assessed through agarose gel electrophoresis on 0.7% of agarose gel. The isolated DNA was quantified by measuring the absorbance at 260 nm in a UV-visible spectrophotometer (Thermo scientific). Polymerase chain reaction (PCR) was carried out in a reaction volume of 10 µL with 5 µL 2ˆTaq PCR Master Mix, 0.2 µM of each primer, 1 µL template DNA and 3.6 µL ddH2O. All amplifications were carried out in SimpliAmp™ Thermal Cycler (Applied Biosystems, Carlsbad, CA, USA) as follow: denaturation at 94 ˝C for 5 min, followed by 30 cycles of 94 ˝C for50s, at specific annealing temperature(Tm) for 30s, 72˝C for 40s and 72˝C for 5min as final extension. PCR amplified products were electrophoresed on 3.5% Agarose. A 1000 bp DNA ladder was used as a molecular marker to determine the approximate size of the fragments.

Sequence Annotation of black rice containing SSR
A total of 16,996 SSR containing sequences were obtained, of which 10,034 were found to be having significant BLAST hits. These SSR sequences were annotated by KOG classification, KEGG (The Kyoto Encyclopedia of Genes and Genomes) classification and GO annotation that further categorized them by molecular function, cellular component, and biological process. Using KOG functional classification, 5,384 genes were aligned to the KOG database, which revealed that other than general function prediction, which showed the highest number of repeats, 10.04 % of the repeats are involved in signal transduction mechanisms, followed by posttranslational modification, protein turnover, chaperones (8.43 %), intracellular trafficking, secretion and vesicular transport (6.09 %). A total of 8,473 (84.0 %) SSR sequences were assigned to gene ontology (GO) terms on the basis of the annotation against the NR (non-redundant) database. Within the cellular component category that represented the major class, 4,800 SSR sequences were assigned to cell component (24.3%) while 3,620 were assigned to organelle component (18.9%). Only a few repeats were assigned to the other categories with the least distribution in virion (0.013%) and symplast (0.012%) region. Within the biological process category, the highest sub-category with 20,281 SSRs were assigned to metabolic process (21.0%), followed by a cellular process with 18,363 SSR (19.0%). Within molecular function category, 15,837 and 15,475 SSR were predominantly associated with the binding activity (42.0%) and catalytic activity (41.0%) respectively, followed by 499 SSR for transporter activity (4.73%), while only a few sequences were distributed in the other molecular functional categories (Figure 1). Furthermore, a general overview of the expression pattern was visualized in a heat map (Figure 2), which provided an overall understanding of the change in gene expression. The KEGG was used to analyze the pathway annotations of SSR sequences and to characterize the functional genes in black rice based on active biological pathways. In all, 901 SSR sequences were annotated to 130 pathways based on the KEGG pathway database. As per KEGG classification, it was observed that the highest number of repeats are involved with biosynthesis of antibiotics (11.09 %) followed by purine metabolism (3.77%), while the other repeats are more or less equally distributed in the various KEGG pathways (Figure 3). A total of 2,168 (7.83 %) sequences were also given enzyme codes according to KEGG, and it was observed that the majority of the sequences coded for hydrolases (39.25%) followed by transferases (37.03%) whereas the least distribution was shown in lygases with a distribution of 2.35 % (Figure 4).

Frequency and distribution of SSRs in the black rice genome
A total of 28,664 SSRs were found out of the 34,978 transcripts expressed in immature black rice pericarp. In all, 16,996 sequences contained SSR of which 10,034 sequences were found to have significant blast results, 7,068 sequences had more than one SSR and 4044 were compound SSRs. ( Table 1). The proportion of SSRs was not evenly distributed; the trinucleotide repeat motifs were the most abundant (15,512 or 62.89 %), followed by tetra-(5,662 or 22.95 %), mono-(4446 or 18.02 %), di-(2879 or 11.67 %), penta-(91or 0.36 %) and hexanucleotide (755 or 0.30 %) repeat motifs. Among the nucleotide repeats, A/T with 2,026 occurrences was the most abundant. The other three major motifs were CGC/GCG with 1302 occurrences, CGG/GCC with 1299 occurrences and CCG/GGC with 1231 occurrences (Figure 5a and 5b).

Development of polymorphic genic SSR markers and Validation of trait-linked functional SSR markers
In this study, a large number of high-quality unigene sequences have been developed from black rice via next-generation sequencing from immature black rice seeds and of genic SSRs have been identified. A whole-genome survey of the introgressed black rice line using DNA markers suggested three regions on chromosomes 1, 3 and 4 were associated with black pigmentation namely Kala1, Kala3, and Kala4 [22]. In all, 61 SSR primer pairs from the unigene sequences corresponding to the genic region on linkage groups 1, 3 and 4 that bears the loci Kala1, Kala3, and Kala4 were identified. Of the 61 tested markers, 18 (29.5 %) amplified non-specific products, 34 (55.7%) failed to amplify the DNA product, and only 6 (9.8 %) showed polymorphism. Although, all 6 markers showed polymorphism, of which 3 could exhibit distinct polymorphism for the Kala1, Kala3 and Kala4 locus across all the black rice accession. Our results demonstrate that in-silico SSR marker development by transcriptome sequencing using NGS can be used to detect (or allow detecting) genic SSRs for rice and could supplement the high genomic resources available depending on the trait of interest ( Figure  5).
In order to validate genic SSR, we chose DNA markers associated with black pigmentation to correspond to the three loci, namely Kala1, Kala3 and Kala4 [22]. Furthermore, Kala1 has been found to encode the DFR enzyme and mapped to chromosome 1 while Kala3 and Kala4 correspond to bHLH domain-containing and MYB domain-containing TF, respectively. As these three loci reportedly acted as a major determinant of pigment formation in black rice pericarp [22], they were chosen as the candidate for validation as well as determining the genic SSR markers that would show polymorphism in a genotype independent manner. The genic SSRs were tested for validation in 5 black rice landraces, namely, Chakhao poreitin, Chakhao amubi, kola-Dhan1, Kola-Dhan-2, and Burma Black. We were able to successfully validate the polymorphism for the 6 SSR markers, i.e. RM 11343, RM11441 and RM11541 present in the gene Kala1 locus, RM15139 present in Kala3 locus and RM17305 and RM5714 present in Kala4 locus ( Figure 6).

Discussion
SSR regions are more abundant in non-coding regions of the genomes, however expressed regions may also harbour such repeat motifs [23][24][25]. Genic-SSR markers are considered to have strong potential for genetic analysis and linkage map construction in crop species due to their specificity and a high degree of conservation [26,27]. Polymorphism in the genic regions, though less, has a greater likelihood of identifying functional variability among genotypes. Black or purple rice is known for its high anthocyanin content. Genetic improvement of this particular trait through markerassisted breeding needs identification of SSR markers that would exhibit greater redundancy for polymorphism across different varieties of black rice. Our study showed that transcriptome sequencing is a useful tool for unigene discovery and marker development in Black rice. We have developed a total of 10034 genic SSR markers from the 16996 SSR-containing unigene sequences in this study.
For the perfect repeat motifs types, tri-nucleotide repeats are observed to have the highest frequency in many crops, including cotton, barley, wheat, maize, sorghum, rice and peanut [28]. However, we observed that the proportion of SSRs was not evenly distributed; the trinucleotide repeat motifs were the most abundant (15,512 or 62.89 %), followed by tetra-(5,662 or 22.95 %), mono-(4446 or 18.02 %), di-(2879 or 11.67 %), penta-(91or 0.36 %) and hexanucleotide (755 or 0.30 %) repeat motifs. It has been emphasized that the frequency of SSRs is correlated with many factors, such as SSR detection criteria, dataset size, database-mining tools, different species and different materials [26,29,30]. The genic SSR markers with di-, tri-, tetra-, penta-, and hexanucleotide repeats were 8.7%, 60.9%, 26.1%, and 4.3%, respectively. In other crops like wheat also, not more than 6.25% of primers exhibit polymorphisms between the parents of any individual mapping population, although 81.25% of detected EST-SSRs have been reported to exhibit polymorphisms in 18 alien species [31]. In another report in peanut, 10.3% EST-SSRs exhibited polymorphisms between 22 cultivated peanut accessions and 88% were polymorphic between 16 wild peanut species [32]. The low specificity in amplification rate and low polymorphism rate in our study could be due to several factors like large intronic region between primers, unrecognized intron splice sites that can disrupt priming sites, and or those that could be simply for sequencing error. Among the nucleotide repeats, A/T with 2,026 occurrences was the most abundant. The other three major motifs were CGC/GCG with 1302 occurrences, CGG/GCC with 1299 occurrences and CCG/GGC with 1231 occurrences. These data suggest that the AT motifs are abundant marker resources in the transcript sequences of Black rice. It was reported that among Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 7 March 2020 doi:10.20944/preprints202003.0119.v1 hundreds of types of repeat motifs, the (AG/CT)n di-nucleotide motifs showed the highest frequency in other species, The lowest frequent motif was CG/CG which was also rare in studies using the plants such as coffee, wheat and rosaceae [28,29,33], but differs from some cereal species, in which (TA/TA)n motif was the most abundant of the repeat motifs in black rice.
To determine the level of polymorphism among our set of new genic-SSR markers, we validated primer pairs. However, the universality of these markers for use as foreground markers in other black rice in the breeding program for the trait of interest, which in this case purple colouration in pericarp was found to be lacking. Similar reports indicated in sesame that the low polymorphism of SSR markers is likely due to its narrow genetic basis [34,35]. Our results indicate that large numbers of polymorphic SSR markers from transcriptome data sets are available to use in marker-assisted gene selection and QTL analysis gene mapping, comparative genome analysis.

Conclusions
We performed a de novo transcriptome sequencing analysis of black rice seed (Chakhao) tissues before maturation using an Illumina HiSeq technology. This paper reports on the transcriptome characterizations of black rice and provides a large number of SSR markers to elucidate the functional annotation of black rice. To our knowledge, this is the first report to develop SSR markers for black rice using a transcriptome sequencing method, and these SSR markers will significantly be useful to study the genetic diversity, QTL mapping, and marker-assisted breeding for black rice. These developed SSR markers from transcriptome sequencing may be an efficient tool to accelerate molecular breeding in black rice.