Computational approaches to functionally annotate long noncoding RNA (lncRNA)

Long noncoding RNA (lncRNA) are implicated in various genetic diseases and cancer, attributed to their critical role in gene regulation. RNA sequencing is used to capture their transcripts from certain cell types or conditions. For some studies, lncRNA interactions with other biomolecules have also been captured, which can give clues to their mechanisms of action. Complementary in silico methods have been proposed to predict non-coding nature of transcripts and to analyze available RNA interaction data. Here we provide a critical review of such methods and identify associated challenges. Broadly, these can be categorized as reference-based and reference-free or ab initio , with the former category of methods requiring a comprehensive annotated reference. The ab initio methods can make use of machine learning classiﬁers that are trained on features extracted from sequences, making them suitable to predict novel transcripts, especially in non-model species. Machine learning approaches such as Logistic Regression, Support Vector Machines, Random Forest, and Deep Learning are commonly used. Initial approaches relied on basic sequential features to train the model, whereas the use of secondary structural features appears to be a promising approach for functional annotation. However, adding secondary features will result in model complexities, thus demanding an algorithm that can handle it and furthermore, considerably increasing the utilization of computation resources. Computational strategies combining identiﬁcation and functional annotation which can be eas-ily customized are currently lacking. These can be of immense value to accelerate research in this class of RNAs


lncRNA
Noncoding RNA (ncRNA) will not usually directly code for protein and were considered as junk regions in the DNA with little to no functional significance. Recently, they are starting to garner attention because of the realization that they play a vital role in genome regulation. Many of them are long noncoding RNA (lncRNA), which form 80% of the noncoding transcriptome [1,2] and generally act by interacting with other biomolecules during transcriptional, post-transcriptional, translational or post-translational phases in various mechanisms to affect gene activity [3,4,5]. These were initially thought to be merely transcriptional noise or high throughput sequencing artefacts [6,7]. A subset of lncRNA, known as macroRNA, were also reported to act as precursors to other small or long ncRNA [8,9]. There are also a few examples of lncRNA containing open reading frame (ORF) and having dual action of gene regulation and synthesising protein [10]. Many lncRNA show cross-species conservation [11,12,13], and their unstable nature and high turnover rate in nucleus indicate their role in rapid response to external stimuli [14]. A large number of human lncRNA bind to the polycomb group protein repressive complex (PRC) and chromatin modifying complexes strongly suggesting their role in epigenetic regulation [4].
The use of lncRNA as one of the main regulatory players is seen as a cost-effective model. This is because compared to proteins, RNAs are energetically less expensive to produce in the cell and can act locally without having to be transported between nucleus and cytoplasm. More importantly, lncRNA are highly cell and tissue specific; and a specific epigenetic regulation in response to a stimulus can be achieved without changing the transcriptional machinery [15]. Interestingly, an increase in the number of protein coding genes from invertebrates to mammals has been smaller in comparison to drastic increase in the complexity. A large number of annotated lncRNA in mammals, with their diverse roles, may help us understand the complexity of mammalian systems.
The total number of lncRNAs in the human genome is still not accurately known, as novel lncRNAs are identified constantly and the reference databases are still getting updated. This number also varies wildly across these databases because of the differences in the definition of what constitutes lncRNA and what doesn't. Some of the conservatives ones like GENCODE v34 [16] lists 17,960 lncRNA whereas, as high as 270,000 transcripts are reported by LncBook [17]. Other database estimates lie somewhere in between, for example, FANTOM CAT project reported 27919 lncRNA [18], of which 19,175 were shown to be functionally implicated [18]. Although the information on lncRNA is most comprehensive for humans, they have been identified in other animals and plants. The GENCODE database includes annotations for humans and mouse, other databases like ALDB (Domestic-Animal Long Noncoding RNA Database) contain annotations for domestic animals like cow, pig and chicken [19]. Databases exclusively for plant genomes are also in progress (e.g. PlncDB [20]). From a large number of lncRNA loci, only a few hundred have been functionally characterised. More lncRNAs are likely to be discovered and a majority of the novel lncRNAs predicted by various methods are yet to be functionally annotated.
LncRNAs are involved in aging, neurological diseases and various types of cancer, as evidenced by multi-omics studies [21]. This makes them ideal candidates to study molecular pathways of disease and development. Both the biogenesis and computational prediction of short ncRNA (such as miRNA) are now well established. However, biogenic pathways and systematic prediction of sequence and function of lncRNA are yet to be fully understood. This is attributed to their varying sequence lengths, genomic origin and functions; and the new lncRNAs are still emerging. Some of the known challenges in designing lncRNA annotation pipelines are a) incomplete cataloging of lncRNA features, b) the diversity of experimental methods used to generate transcript information, c) low expression d) cell specificity, and e) dynamics of lncRNA function in response to epigenetic signaling. Latest advances in Big Data processing, Machine Learning and Distributed Computing is assisting in overcoming these challenges but the progress in the field is still limited, and more efficient strategies to predict and functionally characterise lncRNA are required.

Classification
LncRNAs can be antisense, intergenic, interleaved, or overlapping with protein-coding genes [22]. This classification is based on their biogenesis loci as shown in Figure 1. If they originated from a region of the genome which lies in between two coding genes they are called as Long Intergenic Noncoding RNA or LincRNA in short. Complementary RNA strands to mRNAs are called antisense RNA. Antisense transcripts can occur with varying degrees of overlap, from none (divergent), to partial (terminal) and complete (nested). Likewise, there are other categories based on the location of the transcription such as intronic RNA, which as the name suggests, lie in the intronic region of another protein coding gene and bidirectional RNA, which are transcribed within 1 kb of a protein coding gene in antisense direction. Other classifications also exist based on their function, conservation, role in regulation or chromatin modeling, and have been previously reviewed [23,24].

Characteristics
The lncRNAs are usually arbitrarily defined based on length over 200 bp and having no or low coding potential. The transcripts, however may contain 5' cap and poly-A tail, and composed of multiple exons. They are short lived within the cell and the level of expression varies across different cell types and sub types. LncRNAs are more numerous but have lower abundance than Figure 1: Classifications of LncRNA based on their site of origin mRNA within a cell. They are more tissue specific along with high developmental state specificity and cell subtype specificity [25].
LncRNAs vary considerably in length from 200 bp to around 2000 kbp and therefore, are expected to fold into a variety of functional secondary (2D) structures through intramolecular base pairing. Therefore, it is very challenging to identify them conclusively and several efforts are currently underway. Generally conservation of these sequences are found to be moderate compared to their protein coding counterparts, however, the 2D structures are well conserved and so are the expression levels [26]. This suggests that the sequence in itself is less important than the secondary structure. The secondary structures of the lncRNA are thought to be modular, providing interaction sites for other DNA, RNA and proteins. With these interactions they epigenetically regulate the cellular biology, thereby forming a layer of genomic programming on top of the coding genes.

Role
LncRNAs play many roles and are majorly involved in regulation of proximal or overlapping protein coding genes, also known as cis-regulation. However in some cases they can also regulate genes in trans, which are further away on the same or on different chromosomes. They are found in different compartments of a cell like nucleus, nucleolus, cytoplasm, and even in the mitochondria, which mainly correlates to its mode of action. LncRNA do not function alone or in a single manner, they interact with other genes and proteins through complex pathways. More comprehensive functional reviews have been published previously [27,28]. Some of the well known functions of lncRNA are listed in Table 1. [34] Decoys They can repress gene activity by binding to transcription factor complexes targeting DNA or RNA thereby directly affecting the transcription process. In addition, they affect the posttranscription processes by acting as a sponge or decoy for microRNAs which are involved in processes like mRNA cleavage, direct translational repression and/or mRNA destabilization [35] LncRNA AK015322 promotes proliferation of spermatogonial stem cell C18-4 by acting as a decoy for microRNA-19b-3p. [36,35] Coregulators lncRNA can co-regulate its nearby protein coding gene. This association is mostly positive resulting in enhanced expression and sometime negative, repressing the adjacent protein coding gene.
. Igf2as/PEG8 is very highly correlated with Igf2, which is in line with the observation where it is observed to be overexpressed in Wilms' Tumors together with Igf2. . Similarly there are many such coregulatory lncRNA-mRNA pairs like GM42937-ATP1a1, Wt1os-Wt1, GM3235-Fto. [37] Role Description Examples References

Pol II inhibitors
In some cases lncRNAs are believed to suppress the ability of RNA polymerase II to perform transcription by binding to its core and forming preinitiation complexes at the promoter regions.
B2 RNA binds to Pol II resulting in lower levels of transcription. [38] mRNA processing . lncRNAs are involved in various stages of mRNA processing like alternative splicing. A class of lncRNA called NAT (natural antisense transcripts) is one of the major contributors in alternative splicing which it performs by masking portions of the splicing regions because of their complementary nature. . They can also lead mRNA to degradation pathways. . LncRNAs can interact with RNA methyltransferases or demethylases and thus regulate mRNA expression post-transcriptionally.
. The FGFR2 gene in humans is alternatively spliced with respect to Exon IIIb-IIIc, is regulated by asFGFR2 lncRNA which is also a NAT. . The lncRNA 1/2-sbsRNA binds the Alu element of the 3'-UTR region on the target gene in the Staufen 1 (STAU1)-mediated mRNA decay (SMD) pathway. . lncRNA mediated reversible m6A methylation modification of mRNA have been reported in animals. [39] Role Description Examples References

Stability
Post-transcriptional stability of the transcripts once in cytoplasm is maintained by their corresponding NATs by recruiting stabilizing factors, thus preventing the effect of destabilizing factors on those transcripts.
. Noncoding RNA activated by DNA damage (NORAD) controls the ability of RBMX to form complexes which subdue the instability in the genome. . lncRNA OCC-1 reduces the stability of the RNA binding protein HuR which stabilises many mRNA resulting in the inhibition of corresponding protein coding transcripts. [40,41] Translation The NATs are found to affect the translation process thereby regulating gene expression by promoting or hindering translation of mRNA transcripts to protein. They do so by providing RNA binding motifs which interact with translation initiation complexes.

lncRNA GAS5 Interacts with the Eukaryotic Translation Initiation
Factor 4E to moderate c-Myc synthesis via translation. [42]

Extracellular vesicle packaging
They can modulate gene expression in the environment when secreted as extracellular vesicles Extracellular vesicle packaged HIF-1 alpha-stabilizing lncRNA from tumour associated macrophages regulates aerobic glycolysis of breast cancer cells. [43,44] Role Description Examples References Precursor of small RNA lncRNA can be precursors of miRNAs through intracellular shearing, and RNAs can be processed to produce specific miR-NAs regulating the expression of target genes.
. lncRNA H19 produces a precursor of miRNA675 that suppresses translation of insulin growth factor receptor (Igf1r). . lncRNAs MIR100HG hosts genes of miR-100 and miR-125b known to mediate cancer cell resistance . lncRNA MuLnc1 in plants is cleaved by mul-miR3954 producing secondary siRNA [45,46] Encoding peptides Some lncRNA have been observed with potential to code for micropeptides which may be functional Two known examples are from skeletal muscles: . LINC00948 in humans (and AK009351 in mice) encodes a conserved micropeptide (46aa) called myoregulin (MLN) . LncRNA LINC00961 encodes a conserved polypeptide (90aa), named 'small regulatory polypeptide of amino acid response' (SPAR) [47,48] The role of lncRNAs are not limited to the ones mentioned in Table 1. Novel lncRNAs and their roles in cell biology are still emerging due to limited knowledge about this class of RNAs and their functional mechanisms.

Mechanisms
The secondary and tertiary structures of lncRNA also play a significant role in the mechanism of their actions by creating binding sites for other bio-molecules like DNA/RNA/Proteins to interact with. LncRNAs function by employing multiple mechanisms and actions, unlike their protein coding counterparts. Some of these mechanisms are listed below and illustrated in Figure 2. LncRNA can act as scaffolds bringing RBPs and TF bound at different loci in DNA secondary structure together at the promoter region to start transcription. Archetype 2: LncRNA can guide proteins like chromatin remodelling factor complexes to the loci playing a crucial role in epigenetic regulation. Archetype 3: LncRNA can interact with miRNA which are involved in the post translational process by acting as a sponge or decoy. Archetype 4: LncRNAs can form a three stranded nucleic acid structure called R Loop which is a common mechanism regulating gene expression by various mechanisms. Archetype 5: NATs are a type of LncRNA which maintain stability of its corresponding coding transcript post transcription. Archetype 6: LncRNA forms a triple helix of triplex structure with DNA, this structure interacts with proteins which can stabilize or unwind it.

RNA-Protein Interaction
Some lncRNAs interact with proteins called as RNA Binding Proteins (RBP) forming complexes which have significant roles in cellular pathways. Xist is a lncRNA whose functions are well researched, as it is primarily involved in X-chromosome inactivation and assists in differentiation of early pluripotent cells. This transcript interacts with 81 RBPs in a modular and developmentally controlled manner to coordinate chromatin spreading and silencing [49]. LncRNA is also known to interact with transcription factors via Transcription Factor Binding Site (TFBS) and in some cases share the same TFBS with their protein coding counterparts, indicating a direct role in regulating the transcription of coding genes [50]. Computational tools like GraphProt [51] and more recently, Heterogeneous Network Model based method [52] are available to model the binding preference of the RBPs.

RNA-RNA Interaction
LncRNAs interact with other RNAs like mRNAs and/or small RNA such as micro RNAs (miRNAs). The miRNA function by directly interacting with mRNAs regulating gene expression [53]. When these small RNAs are combined with lncRNAs given the tissue specificity of these lncRNAs result in tissue differentiation and cancer development. One such interaction is the formation of MLMI or mRNA-lncRNA-miRNA regulatory network in the case of hepatocellular carcinoma, where miRNAs and lncRNAs are found to be differentially expressed [54]. Another category of interactions includes lncRNA regulating mRNAs by alternate splicing, editing and stabilizing via direct base pairing [55]. RIblast [56] and IntaRNA [57] are in silico tools which can be used to predict RNA-RNA interaction.

RNA-DNA Interaction
LncRNAs can interact with chromatin through RNA:DNA interactions. Several mechanisms have been proposed for this interaction [58]. Two of the well known modes are listed below: By forming RNA loops or R loops: An R loop is a three stranded nucleic acid structure involving an RNA:DNA hybrid and a non-template DNA Figure 2. R-loops are a common occurrence in the genome and they are involved in many regulatory pathways. For example, 1) TERRA lncRNA induces DNA damage response elements by forming an R loop in the telomere region, 2) R-loop formed by the transcription of VIM-AS1 lncRNA regulates the expression of the closest protein-coding gene, transcribed in antisense orientation with respect to the lncRNA, 3) FLC locus in Arabidopsis thaliana is repressed by lncRNA called COLDAIR by the same mechanism [59]. R-loops are also believed to cause instability in the genome. To counter this there are factors preventing the formation of these R loops, failure of which might lead to replication stress, genome instability, chromatin alterations or gene silencing, which generally lead to cancer and other genetic diseases [60]. There are computational tools like QmRLFS-finder [61] to predict and analyse R loop forming sequences, and dedicated databases like R-loopDB [62] for R-loop Forming Sequences (RLFS) and R-loops.
By forming triple helix: RNA-DNA triplex consists of DNA double helix in combination with RNA as the third strand which forms a stable triplex (Figure 2). Orientation of the third strand is important for function [63] and based on its orientation the triplex may be parallel or antiparallel. Some examples of triplex formation are: PRC2 recruitment by the lncRNA Fendrr; Methyltransferase recruitment on rRNA promoter; LncRBA khps1 recruits histone acetyltransferase (p300/CNP) at sense SPHK1 gene and activates transcription. HOTAIR MEG3 and PARTICLE are other examples of lncRNA acting on prompters locally or distantly in epigenetic response. Some proteins like Argonaute may stabilize the triplex structure while some helicases are believed to unwind or remodel it. These interactions point at functional interaction of these proteins with the triple helix [64]. Computational methods Triplexator [65], and LongTarget [65,66] can be used to study triple helix structures.

Data resources
There are many data resources available online containing information about lncRNA sequences, RNA-interactome and functional annotations. Some contain experimental output data, others contain manually curated data obtained from literature and there are even few which provide data obtained by in silico prediction. Table 2 lists some of the well known lncRNA repositories. A RIblast based prediction tool for lncRNA-lncRNA and lncRNA-mRNA interactions, also contains tissue-specific expression and subcellular localisation data for the lncRNAs.

LncRNA transcripts and interactome identification using sequencing
The experimental approaches to identify and annotate these lncRNAs are expensive, time consuming and often are restricted to specific experimental set up. cDNA library preparation followed by sequencing has been one of the traditional approaches for detecting lncRNAs. Earlier, microarray (tiling arrays) based identification was used to detect alternative splicing and discover polymorphisms [74]. High throughput sequencing (HTS) technology using short reads (50-300 bp) provided a revolutionary means for systematic discovery of transcriptional units and has been effective in picking up disease associated with lncRNAs. The wet lab assays to capture RNA-interactome have previously been reviewed in depth [75,76]. Some of the most widely used sequencing techniques are discussed below.

Sequencing of RNA transcripts
SAGE or Serial Analysis of Gene Expression is one of the early methods using HTS and made it possible to detect novel transcripts [77]. cDNA is synthesized from the 3' end of mRNAs after cleaving using restriction enzymes. SAGE tags are concatenated, cloned and sequenced using Sanger sequencing [75]. Another similar approach is CAGE or Cap Analysis of Gene Expression where 5' end of mRNA sequences are extracted and cDNA is synthesized from them. The resulting cDNA is then sequenced using HTS techniques to obtain the sequences in the promoter regions of the mRNA which can be used to determine the gene from which they originated [78]. Ultra deep whole RNA sequencing (RNA-seq) is one of the most popular methods to profile both the coding and noncoding transcriptome. The sequencing reads are then aligned against the reference genome or assembled together in silico to obtain the transcripts [79]. However, the expression levels of non coding transcripts is much lower in comparison to the coding transcripts and more targeted approaches have been implemented to capture lncRNA expression profiles [80]. Another challenge is to recover full length transcripts from the short read high throughput sequencing, which are currently the more widely used sequencing approaches. Upcoming long read sequencing techniques such as from Oxford Nanopore [81] and Pacific Biosciences (PacBio) [82] can help recover longer isoforms. However, these techniques are currently limited to polyA containing transcripts only. In many cases the sample quantity is insufficient to make use of any of the earlier methods for sequencing. Single-cell techniques Smart-seq [83], DP-seq [84,83] and Quart-seq [85] can be used in such cases. They can also be used in cases where cell to cell variation in gene expression needs to be qualified to study epigenetic modifications between individual cells [75]. LncRNA can serve as a precursor to other smaller RNA which are processed upon degradation. The technique PARE-Seq [86], GMUCT [87], and degradome-seq [88] have all been used to map transcripts which are in the process of being degraded. Another technique TIF-seq (transcript isoform sequencing) has been developed to detect both the 5' and 3' ends of transcripts [89]. Global run-on sequencing (GROseq) methods were developed [90] to detect nascent transcription. In order to assess correlation of half lives of lncRNA with their function RNA decay rates can be measured using 5'-bromo-uridine immunoprecipitation chase-deep sequencing analysis (bric-seq) [91].

Sequencing of RNA-Protein Interactions
Several approaches exist to study the interactions between protein complexes and RNAs they are bound to. Some of the commonly used ones are CLIP-Seq or crosslinking immunoprecipitation sequencing [92], photoactivatable ribonucleoside enhanced crosslinking and immunoprecipitation sequencing (PAR-CLIP-Seq) [93], crosslinking immunoprecipitation sequencing (CLIP-Seq) [94] and RNA immunoprecipitation sequencing (RIP-Seq) [95].

Sequencing of RNA-Chromatin Interactions
LncRNA interacts with chromatin directly or indirectly via proteins. Chromatin isolation by RNA purification (ChIRP) can be used to identify association between a ncRNA and chromatin [96]. In this method cross-linked Protein-DNA-ncRNA complexes are obtained, which can then be separated into different components for further investigations. ncRNA can be quantified using qPCR and proteins are detected with western blots. Nucleic acid is subjected to sequencing to observe genomic regions associated with lncRNA. RNA antisense purification (RAP) [97] is another technology similar to ChIRP designed with longer ncRNA probes to reduce technical noise. Capture hybridization analysis of RNA targets (CHART) [98] protocol is conceptually similar to ChIRP and RAP with improvement in probe designs that are focused around ncRNA target region as opposed to whole length based probes used previously. Some of the newer techniques developed to capture RNA:DNA interactions are MARGI (Mapping RNA Genome Interactions) [98,99], iMARGI (in situ mapping of RNA-genome interactome) technique [100], GRID-seq (in situ global RNA interactions with DNA by deep sequencing), and RADICL-seq or RNA And DNA Interacting Complexes Ligated and Sequenced [101].

Sequencing of RNA-RNA Interactions
RAP-RNA [102] is a modified RAP protocol to capture RNA:RNA interactions. Direct RNA:RNA hybridization can also be captured by using cross-linking, ligation and sequencing of hybrids (CLASH) [103]. CLASH used UV cross-linking as opposed to chemical cross linking used in RAP-RNA to improve spurious protein-protein cross linking of the capture.

LncRNA identification using computational approaches
In silico techniques to identify and annotate lncRNA can be reference-based, which requires a reference genome from which a model is trained by extracting the features; LncRScan-SVM, COME and lncScore fall in this category. Their predictions are restricted to known reference transcripts. Reference-free or ab initio methods learn the annotation from the features in the input dataset and hence don't require a reference set. There are many tools for identification of non coding RNAs belonging to this category: CPC, CPC2, CNCI, lncRNA-MFDL, lncScore, LncADeep, DeepLNC, LncRNAnet, COME, CPAT, lncRScan-SVM, longdist, PLEK, FEElnc, lncRNA-ID and LncFinder. Details of all these tools can be found in Table 3.

Yes
Computational methods to study lncRNA reviewed here make use of Logistic Regression (LR), Support Vector Machines (SM), Random Forest (RF), and Artificial Neural Network (ANN) along with their Deep Learning (DL) variations. Usage of LR for this purpose makes the resulting model simple and generalized as observed in CPAT [105], lncScore [111]. However, it is very important to be careful while selecting the features as using some of the features which may not have a clear association with the prediction or features with a high correlation between them might result in a poorly performing model [119]. If the noise is minimal, SVM can be a good choice and it will provide good performance and accuracy. lncRScan-SVM [108], CPC [104], CPC2 [113], CNC I [106], PLEK [107] and longdist [120] make use of this algorithm. These methods may not perform well for this particular problem where the relationship between predictors and the target variable is complex (not linearly separable) and the number of datapoints is huge. RF which is used in COME [114], lncRNA-ID [109] and FEElnc [115] is well suited for higher-dimensional data with large numbers of data points. It is also easy to parallelize the prediction process and since each process operates on a subset of the data it will run faster. But, the model takes up a lot of memory for large complex training datasets like the one which has to be used in this case. ANN including many of its variations is emerging as the preferred choice recently because of their inherent ability to learn extremely nonlinear and complex relationships by themselves, to train on large numbers of observations, and to scale up the computing process. The models obtained from RF and ANN are difficult to interpret and explain the rationale behind the prediction. For example, a deep learning model is nothing but the weights and biases of the input, hidden, and output nodes. Therefore, even a model that is very good at predicting coding potential may not help us much to understand the underlying biological processes with just those seemingly random numbers. Although there are some methods like Garson's algorithm, Lek's profile, partial dependence plot and local interpretable model-agnostic explanations (LIME) to address this issue to some extent, they can only be used to determine the relationship between the predictors and target variables [121] but this will not be sufficient to explain the complete biological process comprehensively given its nature of complexity. Tools like lncRNA-MFDL [110], LncADeep [116], DeepLNC [112], LncFinder [118] employ ANNs.
These models can be trained on a number of lncRNA features which can be primarily categorised as sequential, structural or conservational. All models use sequential features, additionally, tools like lncRNA-MFDL and LncFinder also use structural features, and COME and lncRScan-SVM utilize conservation-based features to annotate lncRNA.
Sequential features can be extracted by direct parsing of the nucleotide sequence of the transcript which can be obtained in the sequence files ( in FASTA format). These consist of features like transcript length, ORF related scores, Exon related scores, GC content, CDS score, Fickett Score, and k-mer pattern information. They are the simplest ones to extract but may have high variability between samples, which needs to be accounted for.
The lncRNA mechanisms of action involve structural interactions in the majority of its functions, thus its secondary structure is crucial in determining its categorization [26,122]. Moreover, conservation of secondary structure is higher compared to the conservation of the sequence itself, thus making them more reliable. However, incorporating secondary features demand more time and resources. Previous attempts at extracting these features by using algorithms like Stochastic Context-Free Grammar (SCFG) [53] or minimum free energy (MFE) [123] based algorithm have been available.
Every tool performs well only over the subset of use cases, such as, for a specific species or specific type of transcripts, which is fundamentally dictated by its choice of features, data sources, and classifiers. But there is a general lack of approaches that can be used as a tool flexible enough to be extended by training on newer data and/or by adding new features.

Conclusion and Future Directions
We understand that lncRNAs make up a large chunk of the human transcriptome, and present diverse molecular structures [ Figure 1], roles [ Table 1], and mechanisms of action [ Figure 2]. As described in section 3, a large selection of experimental protocols exist to capture numerous aspects of lncRNA biogenesis and functions. We also have a number of emerging reference datasets based on automated sequence annotation, experimental verification, disease association, and interaction studies [ Table 2]. These add up to a great resource to develop bioinformatic protocols to analyse this vast amount of accessible noncoding transcript data. The knowledge derived on the function of these RNAs and their potential applications in studying various disease and development pathways is of significance.
In addition to these developments, we also identify challenges and gaps in our comprehension of this important new class of RNAs. Inconsistency in the annotation between different databases, incomplete or overlapping annotation of new transcripts and inherent technical noise of the sequencing platforms to generate this data add to intricacy. Owing to varying sequence length, dynamic response to a stimulus, cell specificity, transcript instability, and multitudinal interaction mechanisms, a broad range of biological assays are available to grasp all these different aspects of a lncRNA activities. This further complicates designing computational protocols to answer biological questions from such heterogeneous data and metadata. An ensemble of computational modeling approaches may be required to address unconventional questions from the transcriptomic data. Currently, there are a number of methods available to elucidate if a transcript is coding or noncoding by using computational coding potential criteria involving the measurement of several sequential features as discussed above [ Table 3]. Noncoding RNA sequences generally operate by folding into secondary structures, which are important for their function. Some of the methods also exploit these structural features in the annotation process by measuring minimum free energy and looking at the conservation of these features. All these dimensions have a major contribution in determining the functional potential of these transcripts accurately. Another important feature is to understand the mechanism of these RNAs through the prediction of their contact sites on the target biomolecules. None of the existing methods take advantage of interaction motif-based features to train their models. There is a need to include and comprehensively evaluate these features while building the models.
A scarce use of structural features in various pipelines is possibly due to our lack of understanding of how these RNA fold into secondary and tertiary structures and in turn dictate specific interactions with proteins or nucleic acid. A majority of the secondary structure methods used minimum free energy algorithms to find the optimum folding stage of an RNA. Longer RNA sequences can hypothetically fold into a large number of possible structural conformations and this complexity can deviate from its minimum free energy status. Newer approaches making use of convolutional neural networks and dynamic programming to solve the complex structures are becoming available [124]. These improvements in structure predictions can be leveraged to build future lncRNA annotation pipelines.
Sometimes the same transcript might act as coding RNA in some circumstances and noncoding in some others [125,126]. Therefore, the co-factors affecting the transcription needs to be studied further to identify the conditions leading to gene expression. A comprehensive integrative and meta-analysis approach is useful in understanding the complex lncRNA mechanisms of action at multiple regulatory levels. It is very important to recognize that all these computational predictions have to be verified experimentally, however they to a large extent, they can reduce the search space to conduct experiments.
Finally, since most of the computational methods are predicting lncRNAs with reasonable accuracy, many novel lncRNAs are being predicted. However, there is a gap between the detection of these noncoding transcripts and understanding their mechanisms. Hence, more efforts are to be made to annotate their functionality which is very useful in fully understanding the underlying cell biology.