Challenges in defining the functional, non-coding, expressed genome of pathogenic mycobacteria

A definitive transcriptome atlas for the non-coding expressed elements of pathogenic mycobacteria does not exist. Incomplete lists of non-coding transcripts can be obtained for some of the reference genomes (e.g. Mycobacterium tuberculosis H37Rv) but to what extent these transcripts have homologues in closely related species or even strains is not clear. This has implications for the analysis of transcriptomic data; non-coding parts of the transcriptome are often ignored in the absence of formal, reliable annotation. Here, we review the state of our knowledge of non-coding RNAs in pathogenic mycobacteria, emphasising the disparities in the information included in commonly used databases. We then proceed to review ways of combining computational solutions for predicting the noncoding transcriptome with experiments that can help refine and confirm these predictions. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 September 2021 doi:10.20944/preprints202109.0253.v1 © 2021 by the author(s). Distributed under a Creative Commons CC BY license. Introduction A definitive atlas of expressed non-coding elements in pathogenic mycobacteria does not exist. The lists available from databases and publications overlap only partially and are only available for the reference genomes of key representatives of the Mycobacterium tuberculosis complex (MTBC), such as Mycobacterium tuberculosis (Mtb) H37Rv. This gap in our knowledge impacts the successful analysis of the copious amounts of genomic and transcriptomic data that have become available in the last decade. For example, in the absence of a formal annotation of the non-coding transcriptome, the easiest and most common approach to call differential expression events is to largely, or entirely, ignore information that does not relate to regions currently annotated as coding (CDS); this issue is more acute in studies focusing on non-reference Mtb strains or their close relatives, where non-coding annotation is scarce or non-existent. In this commentary, inspired by our own struggles to compile a definitive atlas of ‘non-coding’ RNA (using the term here to represent regulatory RNAs such as short RNAs, antisense RNAs and the untranslated parts of mRNA transcripts) in pathogenic mycobacteria, we present a summary of the current information from publicly available sources, highlighting the existing gaps in the knowledge and the computational approaches used to attempt to uncover this less well understood part of the mycobacterial genome. Why pathogenic mycobacteria and why non-coding RNA? Prior to the COVID-19 pandemic, mycobacterial disease was the leading cause of death by a single pathogen; causing over 1.4 million deaths, and infecting over 10 million people in 2019, worldwide (https://www.who.int/news-room/fact-sheets/detail/tuberculosis). The different members of the MTBC include both human-adapted (Mtb) and animal-adapted (Mycobacterium bovis, Mycobacterium caprae, among others) species which show distinct host preference (Brites et al., 2018). Each of these species is uniquely adapted to cause disease within their preferred hosts, and to navigate a complicated lifecycle, which requires rapid response to changing environmental conditions. Pathogenic bacteria have different programs for invasion, proliferation and survival in particular host environments. The pathogen can rapidly and transiently adapt to environmental changes brought about by host Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 September 2021 doi:10.20944/preprints202109.0253.v1


Introduction
A definitive atlas of expressed non-coding elements in pathogenic mycobacteria does not exist. The lists available from databases and publications overlap only partially and are only available for the reference genomes of key representatives of the Mycobacterium tuberculosis complex (MTBC), such as Mycobacterium tuberculosis (Mtb) H37Rv. This gap in our knowledge impacts the successful analysis of the copious amounts of genomic and transcriptomic data that have become available in the last decade. For example, in the absence of a formal annotation of the non-coding transcriptome, the easiest and most common approach to call differential expression events is to largely, or entirely, ignore information that does not relate to regions currently annotated as coding (CDS); this issue is more acute in studies focusing on non-reference Mtb strains or their close relatives, where non-coding annotation is scarce or non-existent. In this commentary, inspired by our own struggles to compile a definitive atlas of 'non-coding' RNA (using the term here to represent regulatory RNAs such as short RNAs, antisense RNAs and the untranslated parts of mRNA transcripts) in pathogenic mycobacteria, we present a summary of the current information from publicly available sources, highlighting the existing gaps in the knowledge and the computational approaches used to attempt to uncover this less well understood part of the mycobacterial genome.
Why pathogenic mycobacteria and why non-coding RNA?
Prior to the COVID-19 pandemic, mycobacterial disease was the leading cause of death by a single pathogen; causing over 1.4 million deaths, and infecting over 10 million people in 2019, worldwide (https://www.who.int/news-room/fact-sheets/detail/tuberculosis). The different members of the MTBC include both human-adapted (Mtb) and animal-adapted (Mycobacterium bovis, Mycobacterium caprae, among others) species which show distinct host preference (Brites et al., 2018). Each of these species is uniquely adapted to cause disease within their preferred hosts, and to navigate a complicated lifecycle, which requires rapid response to changing environmental conditions. Pathogenic bacteria have different programs for invasion, proliferation and survival in particular host environments. The pathogen can rapidly and transiently adapt to environmental changes brought about by host defences by regulating the effect and stability of the transcripts through the parsimonious action of post-transcriptional regulation (Chakravarty & Massé, 2019).
Though the different members of the MTBC have different tropisms, involving specific virulence profiles and metabolic changes made in response to the host environment, nearly 99% of the genomic sequence is conserved among the MTBC members (Malone and Gordon, 2017). The minor variations such as deletions and single nucleotide polymorphisms (SNPs) that vary among species members of the MTBC, and between species-specific strains, seem to have an outsized role determining these preferences (Dinan et al., 2014); (Malone et al., 2018); (Chiner-Oms et al., 2019); (Cheng et al., 2019). Therefore, it is logical to hypothesize that non-coding RNA (ncRNA) is involved at both the transcriptional and post-transcriptional level of gene regulation, magnifying small genetic changes and creating the documented level of phenotypic diversity (Schwenk & Arnvig, 2018).
Advances made in recent years exploring the non-coding genome, especially in the model organisms, have shown how flexible and adaptive riboregulation can be. As a full description of ncRNAs is outside the scope of this commentary, we present instead a graphical summary of the main types, mechanisms of action and targets in Table 1. There are several recent and comprehensive reviews that describe different aspects of the constantly evolving roster of non-coding elements in bacterial genomes, but most of them focus on what has been discovered in the model organisms (Table 2). Mycobacteria are different, in genome, physiology and lifestyle; and it appears that non-coding regulation in MTBC does not use the same accessory proteins or have the same sequence signatures as the model systems. Indeed, efforts to find an Hfq or ProQ analog acting as an RNA chaperone in mycobacteria have so far been unsuccessful (Gerrick, 2018). These differences impact not only on our ability to transfer knowledge from model organisms to the MTBC species, but also on how applicable current experimental and computational methods are to discovering new regulators in mycobacteria.  Only a handful of sRNAs have been functionally characterised in the mycobacteria literature (Table 3). In most cases, top-down approaches, such as differential expression studies and ChIP-seq (chromatin immunoprecipitation with sequencing), have been employed to discover Mtb sRNAs, such as the RNAP-associated, Ms1 (Arnvig et al., 2011;Šiková et al., 2019) and the PhoP-regulated, Mcr7 (Solans et al., 2014). It is curious, that even among the six wellcharacterised examples in Table 3, there is one (MrsI) not listed in the current official annotation of the reference H37Rv genome, available from the corresponding NCBI annotation (GFF) file (GCF_000195955.2_ASM19595v2_genomic.gff), most likely because it was a recent discovery. This annotation file currently includes 20 features labelled as non-coding RNAs, 15 of which are listed in (Arnvig et al., 2011) and9 in (DiChiara et al., 2010)-4 are listed in both. It also includes 10 "sequence features" which are annotated as fragments of putative small regulatory RNAs (8 matching information from DiChiara (DiChiara et al., 2010) and 2 matching information from Pelly (Pelly et al., 2012), and two "misc RNA" including a tmRNA and the ribonuclease P RNA. Although twenty or even thirty non-coding elements is almost certainly an underestimate of the total number of ncRNAs in Mtb, we note here that the corresponding E. coli reference genome annotation (GCF_000005845.2_ASM584v2_genomic.gff) contains currently 72 elements labelled ncRNAs, suggesting that either functional non-coding elements are not very common in bacteria, or that, even for a well-studied organism, our understanding of non-coding regulation is incomplete.

SigF regulon / unknown
Whereas functional characterisation is ultimately needed to create a reliable list of noncoding RNAs, homology to known families of RNAs from other organisms remains the most popular approach for predicting non-coding RNAs in the absence of experimental evidence.
The RNA families described in the RFAM database (Kalvari et al., 2021) derive from the application of covariance models (and where structure information is not available, Hidden Markov Models) representing meticulously curated multiple sequence and secondary structure alignments of homologous RNAs. RFAM thus represents some of the most reliable predictions for non-coding elements in genomes and its predictions for Mtb H37Rv are summarised in Table 4. As conservation of structure is at the heart of RFAM families, noncoding RNAs with few or no known relatives in other species, and those that do not fold into strongly conserved structures, are unlikely to be found in RFAM. Hence, this database too is likely to miss elements that are specific to a small number of pathogenic mycobacteria or that are too short to fold into a stable structure. In general, homology-based approaches to discovering novel non-coding elements will be limited in pathogenic mycobacteria as there are few closely-related genomes outside the phyla. One notable exception, 6C sRNA, is well-   (Lamichhane et al., 2013), could go some way towards eliminating this confusion.
Completing the non-coding transcript atlas: computational predictions from genomic and transcriptomic data The most extensive lists of putative non-coding RNAs in mycobacteria are the result of computational predictions based on genomic or transcriptomic data (or sometimes both).
Genomics-based methods rely on the conservation of non-coding elements across several species and, like RFAM, are likely to miss elements specific to a small subset of the genus or unique to a species. Such comparative genomics methods are typically enhanced by the search for characteristic sequence features and other signals of regulatory RNAs such as promoters, terminator structures and transcription-factor binding sites. For example, SIPHT begins with conserved intergenic sequences (defined as the sequence between two annotated genes/ORFs on one strand) and looks for characteristic features of sRNAs in these regions, such as conserved promoters and rho-independent terminator motifs (Livny et al., 2008). Other genomics-based programs rely entirely on sequence features and genomic context (ignoring conservation). sRNAScanner determines intergenic sequences using genome annotation files and differentiates coding from non-coding sequences using position scoring matrices for sequence signals such as RBS and start codons (Sridhar et al., 2010). A recently published tool, the Pred-GsRNA feature of the PresRAT server, extracts intergenic sequences, also based on genome annotation, and excludes candidates that have an 8 nt sequence found to be depleted in known sRNAs. It scores each predicted sequence with weighted Minimum Free Energy scores for predicted paired and loop regions and scores for the predicted U-rich consensus sequences typical of intrinsic terminators (Kumar et al., 2020 Mycobacterial sRNAs also often lack the recognisable intrinsic terminator motifs at their 3' ends typical of chaperone protein-binding sRNAs (Moores et al., 2017), (DiChiara et al., 2010), (Arnvig et al., 2014). Furthermore, narrowing the search strictly to intergenic regions (with no annotated genes on either strand) effectively ignores sRNAs and asRNAs generated from coding or antisense regions.
Transcriptomics-based methods are usually versions of sliding window approaches looking for abrupt increases and drops in the expression signal and using such changes to delineate the limits of putative non-coding elements. High-throughput RNA sequencing (RNA-seq) has uncovered a multitude of short transcripts from intergenic sequences, 5' and 3' UTRs and antisense to coding regions. Sorting the signal from the noise is the main challenge when using these data in non-coding RNA discovery. For example, sensitive methods are able to pick up expressed elements in regions of low read coverage; this signal may represent true low-abundance transcripts but it can also be the result of either technical or transcriptional noise (due to stochastic gene expression). The more sensitive computational methods will therefore inevitably over-predict putative non-coding elements. Ironically, high-depth sequencing has magnified this problem (Tarazona et al., 2011), (Mao et al., 2015). Nonfragmented, size-selected libraries, where small transcripts remain intact, are superior for discerning between signal and noise for small RNA transcripts (Leonard et al., 2019), (Wang et al., 2016). For all the reasons discussed above, detecting the existence of sRNAs expressed in low levels against very strongly expressed coding genes remains a computational challenge.
Here, we also suggest caution when using publicly available transcriptomic data, some of which dates back to the early use of RNA-seq technologies. In particular, using strand-specific cDNA libraries sequencing, where the information about which strand the transcript originates from is preserved, is invaluable to the discovery of new ncRNAs. Preservation of the strand information avoids mis-mapping asRNAs or other overlapping sRNAs that might otherwise be mapped to a coding gene on the opposite strand.
Many labs have developed their own computational pipelines and scripts to map RNA-seq data, normalise signals and identify ncRNA transcripts across the genome. (Ami et al., 2020), (Dejesus et al., 2017), (Wang et al., 2016), (Gómez-Lozano et al., 2014), (Miotto et al., 2012) whereas others have carried out this process semi-manually (Arnvig et al., 2011). Progress in the field, and an easy comparison between approaches, has been hindered by the fact that few of the labs publishing computational predictions have made their code readily available.
In response to this challenge, several groups have created publicly-available prediction Indeed, most programs need adjustment to their default parameters in order to respond to sequencing depth and signal abundance (Figure 1) but tuning these parameters can be a matter of art rather than science.

Figure 1. Challenges of predicting non-coding expressed elements from transcriptomic signal alone. Coverage views from two real and one hypothetical Mtb transcriptomic dataset: Illumina high-throughput sequencing datasets from Bioproject accession numbers PRJNA278760, sample SRR1917694 (A) and PRJNA390669, sample SRR5689230 (B); diagrammatic illustration of long-read sequencing coverage of the same region (C). The two RNA-seq samples differ in their sequencing depths: average, non-zero, sequencing depth for the region displayed is 55.6 for (A) and 312.8 for (B). The blue rectangle below the x-axis indicates the genomic region covered by DrrS (MTS1338), an annotated M.tb sRNA of 109 nts. This stable (and by far most abundant) form is cleaved from longer transcripts found in Northern Blots of 160-400+ nts (Moores et al., 2017). Both RNA-seq datasets (A & B) display a gradual drop in coverage at the 3' end. In such cases, automatic computational prediction of the correct
transcript length is challenging for any algorithm but here the prediction is further complicated by the fact that multiple overlapping transcripts of different length most likely co-exist in the data. Even in the deeply sequenced sample, where the presence of overlapping transcripts could be conjectured, most algorithms would call a single transcript, without additional knowledge of transcription start and termination sites for the refinement of computational predictions. In the absence of such additional data, long-read sequencing might be helpful: as illustrated in diagram (C), long reads whose starts and ends can be unambiguously defined should be helpful in identifying the presence of multiple overlapping transcripts expressed from a single locus. Image created using the Integrated Genome Viewer (Robinson et al., 2011) and BioRender.com.
The more sophisticated among transcriptomics-based approaches use a combination of sources, such as TSS data or conservation across species, to reduce false positives. DETR'PROK is a Galaxy-based workflow, coordinating over 40 publicly-available Galaxy sequence comparison tools into a pipeline which streamlines the number of user-defined parameters.
However, there are still 14 different user inputs, most of which concern filtering to account for read depth and transcriptional noise (Toffano-Nioche et al., 2013). The recently published ANNOgesic suite of tools utilises multiple third-party software packages, as well as its own scripts to analyse RNA-seq data and filter predictions. Although, the suite includes an sRNAfinder module, using this module in isolation on user-generated alignment files requires specific file formats for the alignment (.wig) and several reference annotation files. Multiple levels of filtering are possible to identify bona fide ncRNAs, but such filtering requires downloading of tools and databases such as RNAfold (Denman, 1993), BSRD (Li et al., 2013) and the NCBI nr protein database (NCBI Resource Coordinators, 2014). In the context of validating mycobacteria ncRNA predictions, such databases may possibly be less relevant, given the lack of homology or shared sequence features between mycobacterial and other bacterial ncRNAs. Additionally, fine-tuning cut-off parameters to distinguish signal from noise is ultimately still up to the user. Somewhat surprisingly, the added complexity of such methods does not always translate into more accurate results: in limited comparisons between methods that use additional information and our own simpler, signal-only-based method, we found that our naïve approach performs comparatively well, most likely because more sophisticated methods often require more tuning of their parameters to take advantage of their added complexity (Ozuna et al., 2019). As the responsibility of parameter tuning is left up to the user, it is obvious that methods with fewer parameters, such as Rockhopper, baerhunter or APERO, may be less error-prone and, ultimately, more appealing, especially to non-computational users looking for quick and easy to implement solutions. Rockhopper is an independent, Java-based tool designed for bacterial RNAseq data (McClure et al., 2013).
To eliminate guesswork by the user to adjust for noise vs. signal, the program normalises for read counts using the upper quartile of non-zero gene expression values and generates a transcriptional map of the predicted non-coding elements. Baerhunter (Ozuna et al., 2019) and APERO (Leonard et al., 2019) are lighter tools to install, both written in R and requiring only the most commonly used BAM format alignment files and relevant reference annotations. Like Rockhopper, the output of baerhunter is a transcriptional map (in .gff format), and can consolidate annotations from multiple samples. APERO exploits improvements in sequencing technology by requiring paired-end reads (where each fragment is sequenced from both ends, creating two barcoded reads for each fragment) and optimising parameters for non-fragmented libraries. The output consists of a set of flat files of the predicted transcript 5' and 3' ends for each sample that can then be filtered for read counts and assembled into a genomic context.
Steps can be taken to lend support to computational predictions of sRNAs and 5' UTRs in mycobacteria. In a recent study to identify differentially expressed, verifiable sRNAs in M.
tuberculosis, software predictions based on RNA-seq produced over 200 candidate sRNAs (Dejesus et al., 2017), 82 of which were differentially expressed by 6-fold in at least one experimental condition . In an alternative approach, we filtered the predicted sRNAs by comparing the 5' boundaries of these predictions with a compendium of published predicted TSSs (Cortes et al., 2013) (Shell et al., 2015) and narrowed the original list to 67-all of which were also differentially expressed (Appendix 1). Mapping RNase cleavage sites in M.tuberculosis, as they were for M.smegmatis, could also lend support to the existence of other sRNA candidates cleaved from longer transcripts or otherwise processed (Martini et al., 2019). No matter how comprehensive and complex the prediction algorithm, true validation of candidate ncRNAs requires experimental confirmation such as using RACE (Rapid Amplification of cDNA Ends) (Frohman et al., 1988) to identify transcript boundaries; and Northern blot to confirm the existence and size(s) of actual RNA transcripts, and to confirm expression of orthologous transcripts in related genomes. As computational tools become more specialised for the mycobacterial genome, laboratory resources can be more confidently directed to their predictions.
A further complication in defining the non-coding transcriptome is that putative non-coding elements predicted by computational algorithms may actually be (or contain) as yet unannotated ORFs; there is no way of asserting from the RNA-seq signal alone whether a transcript is coding or non-coding. Early ribosome profiling studies pointed to the presence of hundreds of small peptides encoded in the 5' UTR of mycobacterial transcripts (Shell et al., 2015), and more recent efforts have shown pervasive translation in Mtb, uncovering over 1000 novel ORFs (Smith et al., 2019). The majority of these were short ORFs with noncanonical features that would thus be missed by regular gene prediction algorithms. Although translation of these transcripts does not necessarily render them functional, they may constitute a pool of peptides that are available to use under the right conditions. The observation that leaderless transcripts are translated more efficiently under stress conditions (Sawyer et al., 2021) also points to the fact that mycobacterial non-canonical ORFs may play increasingly important roles in conditions of nutrient starvation or other stresses.
Can we improve the identification of non-coding elements in mycobacteria?
There is limited scope for improving the computational methods used to predict non-coding RNA from the currently available mycobacterial genomic and transcriptomic data. In our experience, both lack of specificity and sensitivity of current methods can be accounted for by the signal (or absence of it) in the raw data. One problem is that in a compact mycobacterial genome, overlapping signal from UTRs and ORFs may confuse algorithms and stop them from correctly predicting the limits of transcripts. In such cases, some level of manual curation is often needed, guided by visualisation on a genome viewer such as Artemis (Carver et al., 2012) or IGV (Robinson et al., 2011). Another source of problems is the use of short reads in the currently most popular sequencing protocol. Typical Illumina RNA-seq fragments are 75-150 base pairs long and are mapped in overlapping segments, preferentially using pairedends, to infer a longer transcriptional unit. Many genes in mycobacteria, including sRNA and asRNA, are transcribed as polycistronic transcripts, where multiple sequential genes are transcribed into a single mRNA transcript. The individual overlapping transcripts of varying lengths are often difficult to detect with standard RNA-seq ( Figure 1). The development of specialised RNA-seq methods, such as dRNA-seq (Sharma et al., 2010) to enrich for the 5' end of primary transcripts and map TSSs, and Term-seq (D. Dar et al., 2016) to find 3' termini, offers information that can be used to address the issue of overlapping signal from distinct transcripts. Moreover, ribosomal profiling (Ingolia et al., 2009) will continue to be instrumental in resolving ambiguities in annotation of ORFs versus non-coding elements in untranslated regions (Shell et al., 2015), (Smith et al., 2019). Although such information can already be integrated in a subset of computational pipelines (Yu et al., 2018), the corresponding data is only available for a limited number of reference mycobacterial strains.
Perhaps one of the most promising new technologies for studying whole transcriptomes are based on long-read sequencing. Pac-Bio SMRT or Oxford Nanopore Technologies sequencing can achieve reads several thousand nucleotides long, resolving issues associated with errors in the assembly of short reads. The selection of primary RNA transcripts that have not been fragmented in cDNA library preparation make it possible to reconstruct an entire transcriptome with a high level of confidence. In addition, the ability to sequence full polycistronic transcripts allows the surveying of dynamic changes to the structure of bacterial operons in response to a change in the conditions of growth (Yan et al., 2018). Nanopore sequencing goes one step further in making it possible for native RNA molecules to be sequenced, allowing post-transcriptional modifications of individual nucleotides to be detected, that would otherwise be lost during reverse transcription to cDNA (Grünberger et al., 2020). The technologies are still evolving, but with bioinformatics improvements to resolve technical issues of noise from the nanopore and saturation (Soneson et al., 2019), long-read sequencing will become a valuable tool for studying transcriptomes. The longer reads will certainly improve our mapping of 5' and 3' UTRs, as well as our understanding of the dynamic nature of bacterial transcriptional units, but issues with discriminating coding from non-coding elements, sRNAs from UTRs and identifying original versus processed or degraded transcripts will remain a problem.
One, perhaps less obvious, way in which new sequencing technologies may prove instrumental in improving the prediction of non-coding RNAs is their role in improving the assembly of genomic sequences against which RNA-seq reads are mapped. Currently, Mtb transcript mapping relies on the cultured genome, H37Rv, which shows considerable differences compared to clinical and field strains, or isolates, that have adapted to different environmental pressures (Shockey et al., 2019) (O'Neill et al., 2015. As SNPs in promoter regions and small insertions/deletions may play a major role in regulating the expression of non-coding elements (Dinan et al., 2014), it is clear that using the correct genomic sequence is important when analysing transcriptomes of non-reference strains. Sequencing and assembly of potentially thousands more strains are being facilitated by technologies offering portable sequencing platforms and we can expect the number of available mycobacterial genomes to increase manifold in public databases in the next few years, as a result of the increase in popularity of such methods.
Having the correct genomic sequence available is important but correct annotation is arguably just as important, given how many algorithms rely on the annotation of coding elements to make predictions of non-coding ones. Homologous predicted sRNAs are sometimes annotated as protein-coding or non-coding in different genomes, and could, in fact, be dual-function sRNAs (Vanderpool et al., 2011). This is especially obvious when trying to compare non-coding elements or small ORFs in different lineages of Mtb (Arnvig & Young, 2012). To improve annotation efforts, the idea of assembling MTBC pangenomes that differentiate core genes (including non-coding ones) present in all lineages, from accessory genes present in a subset and unique genes present in only one strain or lineage, is an appealing one (Vernikos et al., 2015). Although members of the MTBC are assumed to share very high sequence identity, this assumption is rooted primarily on comparisons of reference sequences and less so on circulating strains. For example, the sequencing of of M. bovis strains that cannot be classified in current clonal complexes, suggested that diversity within this species may be higher than previously thought (Zimpel et al., 2020). Pangenomic projects to date have primarily focussed on identifying differences in antibiotic liability/resistance among clinical strains of Mtb (H. A. Dar et al., 2020) (Rufai et al., 2020), but whole genome sequencing projects of clinical and field strains to assemble lineage-specific pangenomes for both human and animal-adapted MTBC members would allow comparisons and provide a more accurate picture of the extent of riboregulation and its effect on host-specificity and other phenotypic differences (Zimpel et al., 2017).

Conclusions
A definitive answer to the question, "How many non-coding RNA elements exist in mycobacterial genomes?", is not yet possible. Although several computational methods have been developed to support this area of research, our knowledge is currently limited by the availability and quality of raw data. We believe the key to constructing an atlas of the mycobacterial non-coding universe is recognising both the diversity of the individual genomes, and the dynamic nature of the corresponding transcriptomes. Integration of existing and new sequencing technologies and close collaboration between experimental and computational groups should allow us to progress faster towards this goal.

Funding information
J.S. is funded by a Bloomsbury Colleges PhD Studentship (LIDo programme).

Conflicts of interest
The authors declare that there are no conflicts of interest.