Transcriptome analysis of the ark shell Scapharca 1 subcrenata : De novo assembly , identification of genes 2 and pathways involved in growth 3

19 To understand the molecular mechanism associated with growth variability in bivalves, the 20 Solexa/Illumina technology was employed to analyze the transcriptomic profiles of extreme 21 growth rate differences (fastVS. slow-growing individuals) in one full-sib family of the ark 22 shell Scapharca subcrenata. De novo assembly of S. subcrenata transcriptome yielded 23 276,082,016 raw reads, which were assembled into 98,502 unique transcripts by Trinity 24 strategy. A total of 6,357 differentially expressed genes (DEGs) were obtained between fast25 and slow-growing individuals, with 580 up-regulated expression and 5777 down-regulated 26 expression. Functional annotation revealed that the largest proportion of DEGs were 27 classified to the large or small subunit ribosomal protein, all of which showed significantly 28 lower expression levels in fast-growing group than those in slow-growing group. GO 29 enrichment analysis identified the maximum of DEGs to biological process, followed by 30 molecular function and cellular component. Most of the top enriched KEGG pathways were 31 related to energy metabolism, protein synthesis and degradation. These findings reveal the 32 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 1 March 2018 doi:10.20944/preprints201803.0002.v1 © 2018 by the author(s). Distributed under a Creative Commons CC BY license. link between gene expression and contrasting phenotypes in ark shells, which support that 33 fast-growing individuals may be resulted from decreased energy requirements for metabolism 34 maintenance, accompanying with greater efficiency of protein synthesis and degradation in 35 bivalves. 36


Introduction
Growth is considered as a trait of great economic importance for cultured animals, which has a major influence on the profitability of food animal production.In aquatic markets, larger animals usually attain a higher price per unit of weight compared to smaller ones [1].
Thus, as for bivalve molluscs, growth is a highly desired trait in shellfish culture industry, and successful selection for fast growth is a key objective in aquaculture breeding programs.As reported, the growth trait of bivalves is indeterminate and has great variability, which is consistent with the different growth rate among different individuals under identical environmental conditions [2][3][4][5][6][7].Understanding the basis of underlying growth variability in bivalves is essential for the selective breeding of commercial species and improvement of bivalve production.
For bivalves, the inter-individual growth variability are highly complex processes that are regulated by both endogenous biological and exogenous environmental factors, called as "Nature and Nurture" [8].To date, extensive studies have been carried out to investigate the effects of environmental factors (i.e., temperature, salinity and food) on growth rates of bivalves [9][10][11].However, the relatively less attention has been paid to endogenous biological factors regulating their growth variability when they are exposed to identical controlled conditions.Previous studies about bivalve molluscs indicate that physiological differences are responsible for growth variability between fast and slow growers, and confirmed the existence of an endogenous physiological component for growth differentiation [12][13][14][15][16][17].
Recent years, molecular genetics have been used to identify a number of candidate genes and markers associated with growth traits in bivalves [18][19][20].Moreover, the in-depth analyses of the transcriptome of Pacific oyster Crassostrea gigas have identified the ribosomal proteins, which demonstrates that the efficiency of protein metabolism plays an important role in growth heterosis [21].Despite this, scant progress has been made on the molecular basis underlying endogenous components that regulate variability in growth rates of bivalves.
The ark shell Scapharca subcrenata is a commercially important bivalve species in Asian countries, such as China, Japan and Korea.Recently, natural resource of the ark shells has been declining dramatically, mainly due to the high demands in markets and their over-exploitation in the shallow coasts [22][23].As one of the important maricultural bivalve species, due to the lack of genetic and genomic information, the molecular mechanisms involved in their growth regulation are poorly understood.Recent advances in next-generation sequencing technologies allow for rapid generation of extensive genomic resources at affordable cost to perform omics level studies in bivalve molluscs [24][25][26].To better understand the molecular basis of underlying growth variability in the ark shell, our study investigates the transcriptome profiles and screens genome-wide differentially expressed genes (DEGs) associated with growth variability in one full-sib family of the ark shell reared under the uniform conditions.The aim of this study is to investigate the gene expression pattern underlying these extreme differences in growth rate among siblings, assist our understanding of the sophisticated processes of growth regulation in this species as well as other bivalves, and provide useful information for the selective breeding of bivalve species.

Sequence Analysis and De Novo Assembly
There was a total of 276,082,016 raw reads generated by Illumina sequencing for the three sequencing samples (Table 1).The raw reads used in this study have been deposited in the NCBI SRA database (accession number: SRP067102).To ensure the data quality, raw reads were filtered by removing reads containing adapters, ambiguous nucleotides and low-quality reads.After filtering, there were a total of 264,369,458 clean reads remained and 26,415,046,707 clean bases produced, with the ratio of valid bases were more than 95.5%.
For the three samples, Q30 percentages were detected to have 89.01%,89.18%, and 88.95%, respectively.The average percentage of GC content for the clean reads is 37.67%, ranging from 37% to 38%.These results suggest that the present sequencing is necessary for accurate determination of the expression level of genes in all the samples.Overall, the clean reads had been assembled into 98,502 unique transcripts.The mean length of transcripts is 994 nucleotides, ranging from 301 to 36,344 nucleotides, with N50 of 1,474.The length-frequency for unigenes was unevenly distributed, with short reads more likely to be assembled into unigenes (data not shown).There were 70.38% of short reads (<=1000 nt) were assembled into unigenes, while less than one-sixth (11.30%) of long reads (> 2000 nt) were assembled into unigenes.

Functional Annotation of Unigenes
The highest percentage of unigenes annotated in the NR database accounted for 33.19% of all unigenes, followed by 30.58% annotated in the SWISSPROT database and 25.63% in KOG database.In contrast, there were only 13.04% of unigenes annotated in the KEGG database.For NR annotations, there were 67.45% (22,055 unigenes) of all annotated unigenes observed to have strong homology with E-value < 10 -30 , while there were only 15.60% unigenes annotated in NR with E-value between 10 -15 and 10 -5 .The maximum number of the annotations (37.32%; 12,204) is homologous to the proteins from the Pacific oyster Crassostrea gigas, followed by 7.27% from Lingula anatina and 6.45% from Capitella teleta.
A total of 28663 annotated unigenes were assigned to the 25 ortholog group in KOG database (Fig. 2).More than 30% of the annotated unigenes were grouped into two subcategories: (R) General function prediction only (17.10%), and (T) Signal transduction mechanisms (16.37%).In contrast, less than 10% of the unigenes were distributed unevenly in the remaining subcategories.For GO annotation, a total of 25,246 unigenes are assigned to three main functional categories, including biological process (BP), cellular component (CC), and molecular function (MF).For the category of biological process, there were more than 50% unigenes involved in the subcategories of metabolic process, cellular process, and single-organism process.For the category of cellular component, more than 25% genes were involved in the subcategories of cell and cell part.For the category of molecular function, most genes were assigned to the subcategories of binding (61.39%) and catalytic activity (40.38%).The biological pathways determined by KEGG analysis yielded 12,847 unigenes assigned to 377 different pathways, which were classified into five specific pathway groups, including metabolism, genetic information processing, environmental information processing, cellular processes, and organismal systems.

Analysis of differentially expressed genes in fast-and slow-growing groups
The count of the expressed tags was separately calculated for each library and used to estimate the gene expression level and the differences between fast-and slow-growing groups.
We compared the expression level between fast-and slow-growing groups, and detected a total of 6,357 differentially expressed genes (DEGs), with the filter criteria of fold changes at least 2-fold up-or down-regulated with the p-value < 0.05 (Figure 3).The MA-plot-based method was used to visualize the gene expression differences between the two groups [27].
Of the 6,357 DEGs, 580 showed significantly up-regulated expression and 5,777 showed down-regulated expression in the slow VS.fast groups.The gene expression was compared between fast-and slow-growing groups, exhibiting a maximum upregulation of 8.6-fold and a maximum down-regulation of 9.3-fold, in term of log2 (fold change).
To narrow down the considerable number of differentially expressed genes, a more strict algorithm with FDR < 0.05 was used to screen the DEGs between the two groups.The comparison of gene expression identified 914 DEGs, with log2 (fold change) ranged from -9.3 to 8.6 (Supplement Table 1).Among these DEGs, 68 showed significantly up-regulated expression, while 846 were observed to have down-regulated expression in the fast-growing group.The largest number of the DEGs (82; 9.0%) were related to large or small subunit ribosomal protein, which showed down-regulated pattern in the fast-growing group compared to the slow one.In contrast, the other DEGs were associated with a variety of genes, such as trypsin, pancreatic elastase, carboxypeptidase, acetylcholinesterase, notch, collagen, and cofilin.
The eukaryotic ribosome, composed of four rRNAs and 79-80 ribosomal proteins, is responsible for protein synthesis and its translational regulation, which controls cell growth, division and development [28][29][30].Despite this, the contribution of ribosomal proteins to ribosome assembly and function is often not well understood.In the present study, a large number of DEG identified as ribosomal proteins suggest their important roles for protein metabolism in the inter-individual growth variability.The findings are consistent with previous results that showed many of the annotated sequences with matches to ribosomal proteins, which support the efficiency of protein metabolism responsible for playing a role in growth heterosis [21].

GO and KEGG pathway enrichment analysis of DEGs
The enrichment analysis of DEGs was used to explore the functional roles of DEGs in growth regulation of S. subcrenata.The enrichment analysis for these DEGs revealed that they were assigned to 1,054 GO terms with the FDR lower than 0.05.Of these enriched GO terms, the top 10 were selected according to the values of log10 (p-value) for each main GO category (Fig. 4).For the category of biological process, the enriched DEGs were mainly associated with macrophage activation, cytoneme assembly, protein synthesis and catabolism, muscle growth, and cell-cell signaling.For the category of molecular function, the enriched DEGs were largely related to the activities of enzymes, such as lipase, transaminase, carboxypeptidase, endopeptidase and metallocarboxypeptidase.For the category of cellular component, the enriched DEGs were referred to extracellular exosome, nuclear speck, lipid particle, cytoplasm, proteasome complex, mitochondrial respiratory chain, and endoplasmic reticulum.
The DEGs were also subjected to KEGG pathway enrichment.As a result, DEGs were annotated to 64 signaling pathways with FDR < 0.05.The most enriched pathways (FDR < 0.01; DEGs number > 10) including 1613 DEGs in 29 pathways (Table 2).The enriched pathways were mainly classified into three groups, including energy metabolism (oxidative phosphorylation, citrate cycle, carbon metabolism, transport and catabolism, and fatty acid metabolism), protein synthesis and degradation (amino acid biosynthesis, amino acid degradation, proteasome, protein processing, protein export, protein digestion and absorption), and others (nervous system, endocrine system, and immune system).
The citrate cycle (TCA) and oxidative phosphorylation are central biochemical pathways in cellular energy metabolism [31].Among the DEGs detected in this study, the largest proportion of 30.25% (488 DEGs) was significantly enriched in metabolism-related pathways.
Oxidative phosphorylation and TCA are two closely linked pathways involved in oxidation of nutrients for producing usable chemical energy in the form of ATP, which take place in mitochondria of eukaryotic cells [32].For almost all aerobic organisms, oxidative phosphorylation is the process in which energy is formed [33].In the present study, oxidative phosphorylation is identifed as the top enriched pathway, which includes 88 transcripts with the lowest FDR value of 1.80E-11.Functional annotation in the pathway of oxidative phosphorylation reveals that DEGs are mainly associated with mitochondrial genes, such as NADH dehydrogenase, NADH-ubiquinone oxidoreductase, cytochrome c oxidase, mitochondrial F1F0-ATP synthase, and cytochrome b.Citrate cycle, also known as TCA cycle, is an important pathway used by all aerobic organisms to generate energy for the oxidation of carbohydrates and fatty acids, which supplies NADH for use in the oxidative phosphorylation and other metabolic processes [34].In the pathway of citrate cycle, 33 DEGs are significantly enriched, which encode for succinate dehydrogenase, ATP-citrate lyase, isocitrate dehydrogenase, phosphoenolpyruvate carboxykinase, pyruvate carboxylase, citrate synthase, etc.The enriched metabolism-related pathways in this study are consistent with previous studies, which indicate that metabolic efficiencies are responsible for growth rate differences in clam R. philippinarum [17], and oyster C. gigas [35].For instance, high growth rates of clams are achieved through a combination of faster feeding and higher digestive performance, which result in the increased growth efficiency, mainly based on reduced metabolic costs of growth [17].
Protein is a dietary component essential for nutritional homeostasis in animals, and ingested protein undergoes a complex series of degradative processes.Protein synthesis is one of the dominant ATP-consuming processes in both of vertebrates and invertebrates [8,36,37].In the present study, a total of 317 DEGs (19.65%) are identified to be involved in protein synthesis and degradation.The findings indicate that there might be a dramatic change in efficiency of amino acid and protein metabolism between fast and slow growing individuals.Moreover, the pathway of proteasome had 44 DEGs, which mainly encoded for 20S and 26S proteasome.The 26S proteasome is responsible for the degradation of most ubiquitylated proteins by an ATP-dependent mechanism, relying on proper interactions between multiple subunits of the enzyme and between multiple modules [38].Additionally, the pathway of protein folding, sorting and degradation plays an essential role in invertebrate physiology as the site of nutrient digestion and absorption [37,39].For instance, the physiological bases of growth heterosis for oyster larvae are enhanced feeding rate and metabolic efficiency, which are potentially realized through different protein depositional efficiencies [37].Moreover, much of the differences in metabolic rates of fast-and slow-growing mussels could be ascribed to differences in protein turnover [39].Compared to slow-growing individuals, fast growers usually have higher energy gain rates coupled with lower metabolic costs of growth [15,17,21].Consistently, our findings of DEGs enrichment in protein metabolism also shed lights on that efficiency of protein synthesis and degradation may contribute to the growth rate differences in fast and slow growers in the ark shell.It is therefore suggested that relatively higher rates of protein digestion, absorption, processing in fast growers of bivalves result in higher energy gain rates and lower metabolic costs, compared to those in slow growers.
In conclusion, the patterns of gene expression may be related to physiological differences underlying growth variability, in accordance with previous reports on other bivalve species.
According to the evidences of gene expression in the contrasting individuals from a full-sib family, it is therefore suggested that fast growing individuals may be resulted from decreased energy requirements for metabolism maintenance, accompanying with greater efficiency of protein synthesis and degradation in ark shells.However, further study is needed to examine the physiological basis of extreme growth differences in the ark shell.

Sample prepare and collection
One full-sib family of the ark shell was produced in the hatchery in Jinzhou City through a stimulated spawning of one male and one female parents collected from a wild population.
The larvae and spats were cultivated in a single family tank.After two-month rearing, two groups were segregated by successive selection of spats chosen from the extremes of the sizes from the same family.The average shell length of fast-and slow-growing individuals were 4.120±0.58mm and1.50±0.25 mm, respectively.Both of fast and slow growers were maintained in the rearing container under the identical feeding and environmental conditions.
The spats were fed with a mixture of the algae of Isochrysis galbana and Nitzschia closterium supplied at a ration of 100,000-150,000 cells mL −1 .The spats from fast-and slow-growing groups were dissected and tissues were stored in RNAlater (Ambion) at -70°C for RNA extraction.

cDNA Library preparation and Solexa sequencing
Total RNA was extracted using Trizol according to the manufacturer's instructions.The RNA purity was evaluated by using Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA).The degradation and contamination of RNA was monitored on 1% agarose gels.
RNA samples from the fast-and slow-growing groups were pooled separately to make two sequencing cDNA libraries (fast VS. slow groups).In addition, RNA samples from five tissues, including adductor muscle, gill, mantle, hepatopancreas and foot, were pooled in equal amounts to generate a mixed sample for library construction.The three libraries for transcriptome analysis were prepared using Illumina's kits following manufacturer's recommendations and then sequenced on the Illumina HiSeq 2000 platform.

Unigene assembly and functional annotation
Raw reads obtained from Illumina sequencing were cleaned by removing adaptors and low quality reads before assembly.Paired-end reads were used for the gap filling of scaffolds to get the transcripts that could not be extended on either end.The longest transcript in a clustering unit was selected as unigene.Q20, Q30 and GC-content of the data were calculated to estimate the quality of data for the downstream analyses.The unigenes were further submitted to Blastn and Blastx searches with annotation against the databases, including the National Center for Biotechnology Information (NCBI) Nr database, the Swiss-Prot protein database, the Eukaryotic Orthologous Groups (KOG) protein database, and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database with a E-value cut-off of 10 -5 .Functional annotation by Gene Ontology (GO) terms was analyzed by Blast2GO (http://www.blast2go.com)software at default settings [40].

Differentially expressed genes (DEGs) and pathway enrichment analysis
The RPKM (reads per kilobase of exon model per million mapped reads) method was used to calculate the expression levels of the unigenes [41].To identify the genes with significantly different expression levels between fast-and slow-growing individuals, differential gene expression analysis was performed to uncover the gene expression profiles of the samples from fast-and slow-growing groups using the DESeq [42].The false discovery rate (FDR) was used to determine the threshold of the P values in multiple tests [43].The significant differences of gene expression between the two groups were determined with a cut-off threshold of P < 0.05 and log2|FoldChange| > 1. GO and KEGG enrichment analyses were performed using the hypergeometric distribution test to identify significantly enriched functional classification or metabolic pathways.The significantly enriched unigenes were selected based on the adjusted P values using an optimized FDR.The resulted DEGs were classified for the categories using the annotation of GO and KEGG pathways.

Conclusions
In conclusion, we report the first comprehensive transcript dataset of the transcriptome for the ark shell S. subcrenata.The identified and annotated transcripts will provide valuable genomic resources for the understanding of its biological characteristics.Comparison of the transcriptomes of fast-and slow-growing individuals derived from one full-sib family revealed many differentially expressed genes potentially related to growth variability in this species.GO and KEGG enrichment analysis shed lights on that metabolism, protein synthesis and degradation mainly contribute to their growth differences.These findings are in accordance with previous reports on growth regulation of other bivalve species, suggesting that the pattern of gene expression may be related to their physiological differences, with greater efficiencies of protein synthesis and degradation in fast growers.The present investigation highlights the molecular basis of growth variability in the ark shell, provides a few candidate genes for growth regulation in bivalves.Further analyses are thus needed to investigate the physiological traits of growth variability.The red dots represent the transcripts with significant expression between the two groups, while the black dots indicate no significant expression.

Fig. 1 .
Fig. 1.Histogram presentation of Gene Ontology (GO) classification in Scapharca subcrenata transcriptome.The y-axis on the right represents the number of genes in each category.The y-axis on the left indicates the percentage of genes in the main category.

Fig. 3 .
Fig. 3. Analysis of differentially expressed genes between the fast-and slow-growing groups.

Fig. 4 .
Fig. 4. Histogram presentation of enriched GO terms.The different expressed genes are summarized in three main categories: biological process (A), molecular function (B) and cellular component (C).

Table 1 .
Summary statistics for sequencing and data quality of RNA-Seq

Table 2 .
List of most enriched pathways for DEGs between the fast-and slow-growing groups