Preprint
Article

This version is not peer-reviewed.

Classifying Consensus Sequences Using Point-Set Representations

Submitted:

08 April 2026

Posted:

14 April 2026

You are already at the latest version

Abstract
Consensus sequences at sites such as exon-intron boundaries or branch points are succinctly displayed with sequence logos. Implicit in this representation is a presumption of independence of nucleic acids at distinct sites; consequently sequence logos fail to elicit features such as correlations between neighboring sites or sub-sequences which may be crucial for hypothesis testing, especially in searching for principles underlying the nature of consensus sequencing. We introduce a graphical approach to display such secondary features. Probability distribution functions (PDFs) on these point-sets are used to highlight correlations at exon-intron boundaries and at branch points. Differences in PDFs at normal exon-exon boundaries and cancer fusion junctions as well as differential correlations at cancer junctions are evaluated and shown to have similar characteristics. The formation of cancer junctions appears to pass through a more restrictive selection than the creation of normal exon-exon junctions.
Keywords: 
;  

1. Motivation & Introduction

Our genome consists of two three-billion long, pairwise-linked strands of nucleic acids guanine (G), cytosine (C), adenine (A) and thymine (T) anchored by two sugar-phosphate backbones. In stable states, all complementary nucleic-acid pairs between the two strands are Watson-Crick pairs G C and A T [1], although there can be transient non-Watson-Crick pairings. When pre-messenger RNAs (pre-mRNA) are transcribed from the genome, thymine is replaced by uracil (U). Unlike in the genome, pairing between nucleic acids in RNA strings can be non-unique [2]. For example, multiple sub-sequences are found at exon-intron boundaries or branch points, see below. The most likely collection of sub-sequences form consensus sequences and are displayed using sequence logos [3,4,5,6], such as that shown in Figure 1, or modified versions including energy-normalized sequence logos [7] or log-odds sequence logos [8]. The presence of non-unique binding sequences may originate from smaller energy differentials between pairs of nucleic acids in RNA than those in DNA, as had been found in ab initio quantum mechanical calculations [9,10,11,12,13,14,15,16]. As seen from the figure, the site immediately to the right of the boundary ( + 1 site) is always a G, while the next ( + 2 ) is a U in over 99% of the cases. Sequence logos pre-suppose independence between nucleic acids at distinct binding sites, and consequently fail to elucidate facets such as correlations between neighboring sites or between short neighboring sequences. This paper introduces an approach that aids in visualizing such characteristics.
Shortly after transcription, pre-mRNAs are spliced through a complex series of actions carried out in a 20 n m RNA-protein complex, spliceosome [17,18,19,20,21]. Segments of RNA that are spliced out are referred to as introns and those retained, exons [17,22,23,24,25,26,27]. Typically, consecutive exons are fused to form messenger RNA (mRNA), although sporadically some exons are omitted yielding alternatively spliced mRNA [28,29,30,31]. In addition, occasionally transcripts from different genes ligate, via trans-splicing [32,33,34,35,36,37]. Some alternatively-spliced and trans-spliced mRNAs are translated to proteins as well [28], thus enriching the proteome [38].
Parts of the machinery used in splicing are small nuclear RNAs (snRNA) U 1 , U 2 , U 4 , U 5 , and U 6 . U 1 initiates splicing by docking to an exon-intron boundary while U 2 adheres to a downstream site, the branch point, which lies close to the 3 -end of the intron. The two appendages aid in selecting the intronic region to be spliced out. Machine learning algorithms can be trained to identify exon-intron boundaries and branch points with extremely high precision [39,40]. The current benchmark is SpiceAI [41,42,43], which requires training sequences as long as several thousand nucleic acids in order to perform the task efficiently. However, the recognition regions of U 1 and U 2 , which play a crucial role in docking to the pre-mRNA are only 10 nucleotides long [44], suggesting perhaps a mechanism in splicing that does not rely on pattern-recognition of long sequences. Such a mechanism may reveal physical principles underlying the preferential selections summarized in sequence logos.
Figure 1. A sequence logo representing the probability of finding nucleic acids at sites neighboring an exon-intron boundary. The two left-most sites within the intron are nearly always G U . The likelihood at other sides are proportional to the heights of the symbols.
Figure 1. A sequence logo representing the probability of finding nucleic acids at sites neighboring an exon-intron boundary. The two left-most sites within the intron are nearly always G U . The likelihood at other sides are proportional to the heights of the symbols.
Preprints 207338 g001
Unfortunately, sequence logos can only furnish frequencies of appearance of nucleic acids at sites of interest. Other facets, such as potential correlations between adjoining sites or between small neighboring sub-sequences cannot be elicited from them. A well-known example is the compensation effect, which asserts that when the location 5 sites to the right of the exon-intron boundary (site + 5 ) is not occupied by a G, the nucleotide immediately to the left of the boundary (site 1 ) is very likely a G [45,46,47]. Equivalently, if the nucleotide at the 1 site is not G, the + 5 nucleotide is likely to be G. The ability to visualize and quantify such secondary facets in nucleic acid sequences can aid in unraveling principles underlying preferential selection of consensus sequences.
We introduce a new visual representation of nucleic acid neighborhoods, referred to as point-sets [48] that not only captures the probabilities at nucleic acid sites but also permit easy visualization of correlations between neighboring sites or sub-sequences. As applications, we highlight features that emerge from point-set representations at exon-intron boundaries and at branch points. Furthermore, we use point-sets to characterize differences between normal exon-exon boundaries and cancer fusion junctions.

2. Point-Set Representation

Consider a ( 2 M + 1 ) -long segment of a genomic sequence { S n } n = M M , where S k ’s are nucleic acids A, C, G or U, and M is sufficiently large. We search for properties in the neighborhood of S 0 . The representation introduced below is predicated on the assumption that sites in closer proximity to S 0 play a more significant role in the fate of a site, e.g.,if it becomes an exon-intron boundary or a branch point. As an example, suppose we wish to characterize the nature of the neighborhood of the G U dyad (hereafter expressed as | G U | ) at the exon-intron boundary. Begin by assigning integers { 0 , 1 , 2 , 3 } to { A , G , U , C } [48]. We can search N-long sub-sequences ( N + 1 M ) to the right (anterior) and to the left (posterior) of | G U | . The sub-sequences can be uniquely represented through a pair of numbers defined through quaternary expansions
x = i = 1 N 1 4 i S 2 + i y = i = 1 N 1 4 i S i .
The coordinates x and y bear a one-to-one relationship to the (length N) anterior and posterior sub-sequences of | G U | , with more proximal sites contributing a higher weight. The set of points ( x , y ) for a collection of exon-intron boundaries is the associated point-set. Below, we provide several examples of point-sets and highlight features that can be inferred from probability density functions ρ ( x , y ) , such as those in Figure 2.
Eqns. (1) imply that 0 x , y < 1 . Figure 2 shows the probability density function of the point-set at exon-intron junctions in [ 0 , 1 ] × [ 0 , 1 ] . With N = 3 , the domain is partitioned into 64 × 64 squares. To aid with visualization, the x and y ranges are partitioned to quartiles by thick white lines. The four quartiles in the x direction corresponds to sub-sequences with the S 2 = 0 , 1 , 2 and 3 ; i.e.,sub-sequences with + 1 A , + 1 G , + 1 U and + 1 C respectively. Each quartile, partitioned into sub-quartiles by thinner lines, correspond to + 2 A , + 2 G , + 2 U and + 2 C . They, in turn, are quartered according to the nucleic acid at + 3 site. Corresponding partitioning of the y axis yield primary quartiles 1 A , 1 G , 1 U and 1 C , and sub-partitions defined by nucleic acids at sites 2 and 3 . Note that the length of sub-sequences, N, can be increased for higher resolution at the cost of squeezing individual domains.

2.1. Exon-Intron Junctions

Assignment of an exon-intron boundary is initiated by the docking of a U 1 snRNA, whose recognition region [47,49] is
U 1 : G U C | C A | Ψ Ψ C A U A ,
Ψ being a pseudo-uridine, a modified version of U. Up to 3 nucleic acids to the left of | G U | and 6 to its right maybe used for docking [50]. Prior to hypothesizing on rules governing this attachment, it is necessary to classify as many features of consensus sequences as possible. We thus present the probability density function ρ ( x , y ) of nucleic acid sequences in the immediate neighborhood of known exon-intron junctions in chromosomes 1 22 . The input data contain 608 , 000 junctions given at the UCSC Genome Browser [“http://genome.ucsc.edu (genome hg38)"] [51]. Coordinates x and y for each junction are evaluated using Eqn. (1) and the probability density function (PDF), shown in Figure 2, computed. It should be noted that the PDF of the corresponding neighborhoods of arbitrary G U dyads is significantly different.
Figure 2. Probability density function at exon-intron junctions at (a) lower and (b) higher color gradation. The four quarters in the x-axis correspond to + 1 A , + 1 G , + 1 U , and + 1 C . The image on the left shows that the most frequent neighborhoods are N N G | G U | A A G , N N G | G U | G A G and C A G | G U | A N N where N can be any nucleotide. The higher-resolution image on the right shows the next most frequent nucleic acid neighborhoods which include A A G | G U | A N N , C A G | G U | G N N , N N N | G U | A A G and N N N | G U G A G .
Figure 2. Probability density function at exon-intron junctions at (a) lower and (b) higher color gradation. The four quarters in the x-axis correspond to + 1 A , + 1 G , + 1 U , and + 1 C . The image on the left shows that the most frequent neighborhoods are N N G | G U | A A G , N N G | G U | G A G and C A G | G U | A N N where N can be any nucleotide. The higher-resolution image on the right shows the next most frequent nucleic acid neighborhoods which include A A G | G U | A N N , C A G | G U | G N N , N N N | G U | A A G and N N N | G U G A G .
Preprints 207338 g002
Figure 2(a) shows the probability density in its entire range [ 0 , 100 ] . Two vertical high-density alleys N N G | G U | A A G and N N G | G U | G A G and a horizontal alley C A G | G U | A N N can be noted. Here N represents any nucleic acid. Figure 2(b) refines to the color range [ 0 , 10 ] , highlighting the next most dense sub-sequences, including A A G | G U | A N N , C A G | G U | G N N , N N N | G U | A A G and N N N | G U | G A G .
Correlations between neighboring sites of consensus sequences are easily extracted from the point-set representation. For example, when the 1 site is not a G (i.e.,first, third and fourth quartile in the vertical direction), the most frequent anterior sequences are A A G and G A G . This not only reinforces the compensation effect (i.e., + 5 G ) but also shows, in addition, that the + 4 site is primarily A. Similarly, when the + 5 site is not G (in particular, outside the two dominant vertical alleys) the 1 site is almost exclusively G. Note also that posterior sequences A A G and C A G are more prevalent than others. Furthermore, recent investigations [16,47,52] have proposed that there are two distinct classes of junction neighborhoods, one of type N N | G U | R A G and the other A G | G U | N N N , where R is a purine (G or A). The two vertical lies in 2(b) correspond to the former class and the horizontal dense region to the latter. We reiterate that such subtleties are impossible to deduce from sequence logos.

2.2. Branch Points

The U 2 snRNA with the recognition region
U 2 : A Ψ G A Ψ G U G ,
is the preliminary selector of the branch point which precedes the polypyrimidine tract and sits 30 nucleotides from the end of the intron [53]. Its docking initiates a complex series of actions by several protein complexes to shed the intron. The branch point itself is not unique, being A in 79%, C in 9%, G in 4% and U in 8% of the 56 , 000 unique experimental samples given in Ref. [54]. Sequence logos for the four cases differ significantly, as seen from Figure 3.
Figure 4 shows the probability density functions of point-sets for the four cases and highlights the most frequent consensus sequences. The anterior (resp. posterior) 3-long sequences are those to the right (resp. left) of the branch point. With a branch point A, the consensus sequences have forms C U N | A | U N N and C U N | A | C N N , while sequences N C U | C | C C N and N C U | C | C U N dominate when the branch point is C. The sequences N C U | G | A N N (resp. N N C | U | G A N ) are most frequent when the branch point is G (resp. U). Correlations are also manifest. In Figure 4(a) the anterior sequences are primarily of types A C U or A U U when the 2 site is not U (i.e.,outside the horizontal lines). When the branch point is U, i.e.,Figure 4(d), the 1 site is dominantly C when the first two posterior sites are G A or A A or when the first three sites are G C C .

2.3. Normal Exon-Exon Junctions vs. Cancer Fusions

The splicing process terminates when successive exons (or in the case of alternative splicings an exon and one that is further away) are fused. Errors in alternative or trans-splicing can initiate hereditary diseases like Parkinson’s, neurodegenerative diseases, cystic fibrosis, muscular dystrophy and some types of cancer [36,37,53,55,56,57,58,59,60]. Fusing of exons occurs through the docking of anterior and posterior ends to the U 5 snRNA [52] whose recognition region is [47]
U 5 : Ψ C m A Ψ U U U C C G m C ,
where G m and C m are methylated forms of guanine and cytosine [61].
Figure 5 shows the sequence logos for junctions with successive exons and cancer-inducing fusion junctions. Although the order of frequencies at sites 3 , + 2 and + 3 differ between the groups, the frequencies themselves are not significantly different. There is little that can be inferred on difference between the two groups from sequence logos.
The corresponding point-sets are shown in Figure 6(a) and Figure 6(b). Once again, they are fairly similar, containing dominant densities at junctions N A G | N N N , with the highest densities at C A G | G N N , C A G | A N N , A A G | G N N , and G A G | G N N . Point-sets for the exon-exon junctions are derived using the data from UCSC Genome Browser [http://genome.ucsc.edu (genome hg38)] [51] while those for cancer junctions are extracted from the University of Texas, Houston Fusion Gene Annotation Database [“https://compbio .uth.edu/FusionGDB2/"] [62,63,64]. A similar point-set was found for fusion junctions from the University of Houston Human Genome Sequencing Center.
Figure 5. Sequence logos at (a) normal exon-exon junctions and (b) cancer fusion junctions. Although the order of some frequencies ( 3 , + 2 and + 3 sites) have changed, the frequencies themselves are not significantly different.
Figure 5. Sequence logos at (a) normal exon-exon junctions and (b) cancer fusion junctions. Although the order of some frequencies ( 3 , + 2 and + 3 sites) have changed, the frequencies themselves are not significantly different.
Preprints 207338 g005
Figure 6. Point-sets at (a) normal exon-exon junctions and (b) cancer fusion junctions. In each case, the dominant forms of the junction are of type N A G | N N N , with the highest frequencies of C A G | A N N , C A G | G N N , A A G | G N N , etc.
Figure 6. Point-sets at (a) normal exon-exon junctions and (b) cancer fusion junctions. In each case, the dominant forms of the junction are of type N A G | N N N , with the highest frequencies of C A G | A N N , C A G | G N N , A A G | G N N , etc.
Preprints 207338 g006
Differences between the two distribution functions, shown in Figure 7, highlights changes prevalent in cancer fusions. The sub-groups are fairly scattered, however with large discords in types N A G | N N N and especially N A G | G N N .

2.4. Point-Sets to Visualize Correlations

Using differences in normal exon-exon junctions and cancer junctions, we illustrate how point-sets can be used make inferences on higher order correlations. As before, junctions are classified using 3-long anterior and posterior sub-sequences. If the junctions are joined without a preferential selection, the PDF of the synthesized sequence will be the product of the densities ρ X ( x ) and ρ Y ( y ) of the anterior and posterior sub-sequences; any difference would signify a favored or disfavored fusing of constituents. Thus, for each anterior and posterior subsequence of length N = 3 , we evaluate
C ( x , y ) = ρ ( x , y ) ρ X ( x ) · ρ Y ( y ) .
C ( x , y ) for normal exon-exon junctions (Figure 7(a)) differs significantly from zero only when the posterior sequence is of the type A A G or C A G and at isolated sequences such as C U G | G U G . In contrast, differences in cancer fusion junctions exhibit a large scatter.
Observe the similarity of Figure 8(b) to Figure 7; sequences at cancer junctions are those whose frequency differs most significantly from non-preferential fusing of anterior and posterior segments. It is now left to establish causes for this observation.

3. Discussion

Although splicing of pre-mRNA is overall a highly complex process involving a co-transcriptional synthesis of the dynamic ribonucleoprotein complex spliceosome [17,65,66,67,68,69,70], some aspects, especially the docking of snRNAs U 1 , U 2 and perhaps of U 5 , are potentially simpler to comprehend. The expectation is predicated on the small number of nucleic acids ( 10 ) within recognition regions of the snRNAs. However, even such small domains attach to multiple consensus sequences displayed succinctly via sequence logos [4,5,6].
One may wish to advance and test hypotheses on physical principles that underlie the choice of consensus sequences. Addressing such queries require a comprehensive characterization of sequences, beyond frequencies displayed in sequence logos. A well-known example is the compensation effect, which asserts a strong correlation between the presence of guanine at + 5 and 1 sites at a exon-intron boundary [45,46,47]. One goal of our investigations was to introduce visual means to display such secondary features.
Point-sets contain information on subsequences on either side of a nucleic acid site of interest, such as an exon-intron boundary or a branch point. Their form is predicated on the assumption that generally sites nearer to the location of interest will play a more significant role in its fate. As the length N of the anterior and posterior sequences is increased, a fractal structure of a point-set emerges [48]. For example, fractal characteristics of point-sets, such as fractal dimension [71], generalized dimensions [72] or the singularity spectra [73] can be used to quantify differences between sub-sequences belonging to exons and introns [74]. Furthermore, it is possible to train machine learning algorithms [75] to recognize sequence segments within exons and introns [74].
Point-sets at branch points clearly illustrate that the nucleic acid at the branch point determines the associated consensus sequence, see Figure 4. Similar differences may be present in distinct classes of splicing, such as normal splicing and self-splicing [76,77,78], but it is difficult to find a sufficiently large data set to implement a statistical analysis on the latter group.
Through examples, we showed how point-sets can be used to infer features of consensus sequences that were not accessible from sequence logos. Probability density functions and correlation functions, such as those introduced in Section 2 can be extremely useful in this context. In particular, it was possible to infer that cancer fusion junctions have a higher likelihood of selective fusing of anterior and posterior segments. The next steps in our program involve the search for physical principles that underly the origins of statistical features elicited from point-sets.
The authors wish to thank Predrag Cvitanović, Michael Goldberg, Quentin Vicens and Andrew Török for insightful discussions.

References

  1. M. Goldberg, J. Fischer, L. Hood, L. Hartwell, C. Aquardro, L. Silver, and A. E. Reynolds. Genetics:From Genes to Genomes, 7th Edition. McGraw-Hill Publishing, 2021.
  2. X. Roca, A. R. Krainer, and I. C. Eperon. Pick one, but be quick: 59 splice sites and the problems of too many choices. Genes and Development, 27:129–144, 2013. [CrossRef]
  3. T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display consensus sequences. Nucleic Acids Research, 18(20):6097–6100, 1990. [CrossRef]
  4. I. B. Rogozin and L. Milanesi. Analysis of donor splice sites in different eukaryotic organisms. Journal of Molecular Evolution, 45:50–59, 1997. [CrossRef]
  5. C. R. Sibley, L. Blazquez, and J. Ule. Lessons from non-canonical splicing. Nature Review Genetics, 17(7):407–421, 2016. [CrossRef]
  6. S. Hümmer, S. Borao, s. Guerra-Moreno, L. Cozzuto, E. Hidalgo, and J. Ayte. Cross talk between the upstream exon-intron junction and Prp2 facilitates splicing of non-consensus introns. Cell Reports, 37:109893, 2021. [CrossRef]
  7. C. T. Workman, Y. Yin, D. L. Corcoran, T. Ideker, G. D. Stormo, and P. V. Benos. enoLOGOS: a versatile web tool for energy normalized sequence logos. Nucleic Acids Research, 33:W389–W392, 2005. [CrossRef]
  8. Y.-K. Yu, J. A. Capra, A. Stojmirović, D. Landsman, and S. F. Altschul. Log-odds sequence logos. Bioinformatics, 31:324–331, 2014.
  9. J. Sponer, J. Leszczynski, and P. Hobza. Nature of nucleic acid-base stacking: Nonempirical ab initio and empirical potential characterization of 10 stacked base dimers. comparison of stacked and h-bonded base pairs. Journal of Chemical Physics, 100:5590–5596, 1996. [CrossRef]
  10. P. Jurecka, J. Sponer, J. Cerny, and P. Hobza. Benchmark database of accurate (MP2 and CCSD(T) complete basis set limit) interaction energies of small model complexes, DNA base pairs, and amino acid pairs. Physical Chemistry Chemical Physics, 8:1985–1993, 2006. [CrossRef]
  11. R. Olivia, L. Cavallo, and A. Tramontano. Accurate energies of hydrogen bonded nucleic acid base pairs and triplets in tRNA tertiary interactions. Nucleic Acids Research, 34:865–879, 2006. [CrossRef]
  12. C. A. Johnson, R. J. Bloomingdale, V. E. Ponnusamy, C. A. Tillinghast, B. M. Znosko, and M. Lewis. A computational model for predicting experimental RNA and DNA nearest-neighbor free energy rankings. Journal of Chemical Physics, 115:9244–9251, 2011. [CrossRef]
  13. E. A. Jolley, M. Lewis, and B. M. Znosko. A computational model for predicting experimental RNA nearest-neighbor free energy rankings: Inosine–Uridine pairs. Chemical Physics Letters, 639:157–160, 2015. [CrossRef]
  14. S. C. Leon, M. Prentiss, and M. Fyta. Binding energies of nucleobase complexes: Relevance to homology recognition of DNA. Physical Review E, 93:06210, 2016. [CrossRef]
  15. M.C. Hopfinger, C. C. Kirkpatrick, and B. M. Znosko. Predictions and analyses of RNA nearest neighbor parameters for modified nucleotides. Nucleic Acids Research, 48:8901–8913, 2020. [CrossRef]
  16. M.T. Parker, B.K. Soanes, J. Kusakina, A. Larrieu, and K. et al. Knop. m6A modification of U6 snRNA modulates usage of two major classes of pre-mRNA 5 splice site. eLife, 11:e78808, 2022.
  17. M. C. Wahl, C. L. Will, and R. Lührmann. The spliceosome: design principles of a dynamic RNP machine. Cell, 136(4):701–718, 2009. [CrossRef]
  18. C. L Will and R. Lührmann. Spliceosome structure and function. Cold Spring Harbor Perspectives in Biology, 3(7):a003707, 2011.
  19. Klemens J. Hertel. Spliceosomal Pre-mRNA Splicing Methods and Protocols. Methods in Molecular Biology, 1126. Humana Press, Totowa, NJ, 1st ed. 2014. edition, 2014.
  20. A. G. Matera and Z. Wang. A day in the life of the spliceosome. Nature Reviews Molecular Cell Biology, 15(2):108–121, 2014.
  21. E. C. Merkhofer, P. Hu, and T. L. Johnson. Introduction to co-transcriptional RNA splicing. Spliceosomal Pre-mRNA Splicing: Methods and Protocols, pages 83–96, 2014.
  22. W. Gilbert. Why genes in pieces? Nature, 271(5645):501–501, 1978. [CrossRef]
  23. N. K. Kadri, X. M. Mapel, and H. Pausch. The intronic branch point sequence is under strong evolutionary constraint in the bovine and human genome. Communications Biology, 4(1):1206, 2021. [CrossRef]
  24. E. L Lasda and T. Blumenthal. Trans-splicing. Wiley Interdisciplinary Reviews: RNA, 2(3):417–434, 2011.
  25. M. Hiller, Z. Zhang, R. Backofen, and S. Stamm. Pre-mRNA secondary structures influence exon recognition. PLoS Genetics, 3(11):e204, 2007. [CrossRef]
  26. M. Long and M. Deutsch. Intron exon structures of eukaryotic model organisms. Nucleic Acids Research, 27(15):3219–3228, 1999. [CrossRef]
  27. L. Zhu, Y. Zhang, W. Zhang, S. Yang, J.-Q. Chen, and D. Tian. Patterns of exon-intron architecture variation of genes in eukaryotic genomes. BMC Genomics, 10:1–12, 2009. [CrossRef]
  28. Y. Wang, J. Liu, B. O. Huang, Y.-M. Xu, J. Li, L.-F. Huang, J. Lin, J. Zhang, Q.-H. Min, and W.-M. et al. Yang. Mechanism of alternative splicing and its regulation. Biomedical Reports, 3(2):152–158, 2015.
  29. N. Stepankiw, M. Raghavan, E. A. Fogarty, A. Grimson, and J. A. Pleiss. Widespread alternative and aberrant splicing revealed by lariat sequencing. Nucleic Acids Research, 43(17):8488–8501, 2015. [CrossRef]
  30. J. Ule and B. J. Blencowe. Alternative splicing regulatory networks: Functions, mechanisms, and evolution. Molecular Cell, 76:329–345, 2019. [CrossRef]
  31. L. E. Marasco and A. R. Kornblihtt. The physiology of alternative splicing. Nature Reviews Molecular Cell Biology, 24:242–254, 2023.
  32. C. E. Walsh. New paradigm for gene transfer: RNA trans-splicing and small interfering RNA as therapeutic strategies. Semin. Hematol., 41:297–302, 2004. [CrossRef]
  33. Y. Yang and C. E. Walsh. Spliceosome-mediated RNA trans-splicing. Molecular Therapy, 12(6):1006–1012, 2005. [CrossRef]
  34. T. A. Cooper, L. Wan, and G. Dreyfuss. RNA and disease. Cell, 136(4):777–793, 2009.
  35. C. J. Mcmanus, M. O. Duff, J. Eipper-Mains, and B. R. Graveley. Global analysis of trans-splicing in Drosophila. PNAS, 107(29):12975–12979, 2010. [CrossRef]
  36. M. M. Scotti and M. S. Swanson. RNA mis-splicing in disease. Nature Reviews Genetics, 17(1):19–32, 2016. [CrossRef]
  37. Wei Jiang and Liang Chen. Alternative splicing: Human disease and quantitative analysis from high-throughput sequencing. Computational and Structural Biotechnology Journal, 19:183–195, 2021. [CrossRef]
  38. R. Aebersold, J. N. Agar, I. J. Amster, M. S. Baker, C. R. Bertozzi, E. S. Boja, C. E. Costello, B. F. Cravatt, C. Fenselau, and B. A . et al. Garcia. How many human proteoforms are there? Nature Chemical Biology, 14(3):206–214, 2018. [CrossRef]
  39. M. G. Reese, F. H. Eeckman, D. Kulp, and D. Haussler. Improved splice site detection in genie. Journal of Computational Biology, 4(3):311–323, 1997. [CrossRef]
  40. J. Zuallaert, F. Godin, M. Kim, A. Soete, Y. Saeys, and W. De Nerve. SpliceRover: interpretable convolutional neural networks for improved splice site prediction. Bioinformatics, 34:4180–4188, 2018. [CrossRef]
  41. K. Jaganathan, S. K. Panagiotopoulou, J. F. McRae, and S. F. et al. Darbandi. Predicting splicing from primary sequence with deep learning. Cell, 176:535–548, 2019. [CrossRef]
  42. C. Janiesch, P. Zschech, and K. Heinrich. Machine learning and deep learning. Electronic Markets, 31:685–695, 2021.
  43. W. Jang, J. Park, H. Chae, and M. Kim. Comparison of in silico tools for splice-altering variant prediction using established spliceogenic variants: An end-users point of view. International Journal of Genomics, 2022:5265686, 2022. [CrossRef]
  44. C. van der Feltz and A. A. Hoskins. Structural and functional modularity of the U2 snRNP in pre-mRNA splicing. Critical Reviews in Biochemistry and Molecular Biology, 54:443–365, 2019.
  45. C. Burge and S. Karlin. Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268:78–94, 1997.
  46. I. Carmel, S. Tai, I. Vig, and G. Ast. Comparative analysis detects dependencies among the 5 splice-site positions. RNA, 10:828–840, 2004. [CrossRef]
  47. O.V. Artemyeva-Isman and A.C.G. Porter. Predicting splicing from primary sequence with deep learning. Fronters in Genetics, 12:676971, 2021.
  48. E. Speakman and G. H. Gunaratne. On a kneading theory for gene-splicing. CHAOS, 34:043125, 2024. [CrossRef]
  49. Y. Iida and F. Sasaki. Recognition patterns for exon-intron junctions in higher organisms as revealed by a computer search. Journal of Biochemistry, 94:1731–1738, 1983. [CrossRef]
  50. M. Kramárek, P. Soucek, K. Réblova, L. K. Grodecká, and T. Freiberger. Splicing analysis of STAT3 tandem donor suggests non-canonical binding registers for U1 and U6 snRNAs. Nucleic Acids Research, 52:5959–5974, 2024. [CrossRef]
  51. G. Perez, G. P. Barber, A. Benet-Pages, J. Casper, H. Clawson, M. Diekhans, C. Fischer, A. S. Gonzalez, J. N.and Hinrichs, C. M. Lee, L. R. Nassar, B. J. , Raney, M. L. Speir, M. J. van Baren, C. J. Vaske, D. Haussler, W. J. Kent, and M. Haeussler. The UCSC Genome Browser database: 2025 update. Nucleic Acids Research, 53:D1243–D1249, 2025.
  52. M.T. Parker, S.M. Fica, and G.G. Simpson. RNA splicing: a split consensus reveals two major 5’ splice site classes. Open Biology, 15:240293, 2025. [CrossRef]
  53. A. Anna and G. Monika. Splicing mutations in human genetic disorders: examples, detection, and confirmation. Journal of Applied Genetics, 59:253–268, 2018. [CrossRef]
  54. T. R. Mercer, M. B. Clark, S. B. Andersen, M. E. Brunck, W. Haerty, J. Crawford, R. J. Taft, L. K. Nielsen, M. E. Dinger, and J. S. Mattick. Genome-wide discovery of human splicing branchpoints. Genome Research, 25:290–303, 2015. [CrossRef]
  55. N. A. Faustino and T. A. Cooper. Pre-mRNA splicing and human disease. Genes. Dev., 17:419–437, 2003.
  56. B. R. Graveley. The haplo-spliceo-transcriptome: common variations in alternative splicing in the human population. Trends in Genetics, 24:5–7, 2007. [CrossRef]
  57. R.-H. Fu, S.-P. Liu, H.-J. Huang, S.-J.and Chen, P.-R. Chen, Y.-H. Lin, Y.-C. Ho, W.-L. Chang, C.-H. Tsai, and W.-C. et al. Shyu. Aberrant alternative splicing events in parkinson’s disease. Cell Transplantation, 22(4):653–661, 2013. [CrossRef]
  58. Katarzyna Chwalenia, Loryn Facemire, and Hui Li. Chimeric rnas in cancer and normal physiology. Wiley Interdisciplinary Reviews: RNA, 8(6):e1427, 2017. [CrossRef]
  59. M. Montes, B. L. Sanford, D. F Comiskey, and D. S. Chandler. Rna splicing and disease: animal models to therapies. Trends in Genetics, 35:68–87, 2018. [CrossRef]
  60. Y. Zhang, J. Qian, C. Gu, and Y. Yang. Alternative splicing and cancer: a systematic review. Signal Transduction and Targeted Therapy, 6(1):78, 2021. [CrossRef]
  61. H. Sun, K. Liu, and C. Yi. Regulation and functions of non-m6A mRNA modifications. Nature Reviews Molecular Cell Biology, 24:714–731, 2023. [CrossRef]
  62. P. Kim, S. Yoon, N. Kim, S. Lee, M. Ko, H. Lee, H. Kang, J. Kim, and S. Lee. ChimerDB 2.0 - a knowledge-base for fusion genes updated. Nucleic Acids Research, 38:D81–D85, 2010. [CrossRef]
  63. P. Kim and X. Zhou. FusionGDB: fusion gene annotation DataBase. Nucleic Acids Research, 47:D994–D1004, 2019. [CrossRef]
  64. P. Kim, H. Tan, J. Liu, H. Lee, H. Jung, H. Kumar, and H. Zhou. FusionGDB 2.0: fusion gene annotation updates aided by deep learning. Nucleic Acids Research, 50:D1221–D1230, 2022. [CrossRef]
  65. R. Wan, R. Bai, X. Zhan, and Y. Shi. How is precursor messenger RNA spliced by the spliceosome? Annual Review of Biochemistry, 89:333–358, 2020. [CrossRef]
  66. N. H. Gehring and J.-Y. Roignant. Anything but ordinary - emerging splicing mechanisms in eukaryotic gene regulation. Trends in Genetics, 37(4):355–372, 2021. [CrossRef]
  67. I. Beusch, B. Rao, M. K. Studer, T. Luhovska, V. Sukyte, S. Lei, J. Oses-Prieto, E. SeGraves, A. Burlingame, S. Jonas, and H. G. Madhani. Targeted high-throughput mutagenesis of the human spliceosome reveals its in vivo operating principles. Molecular Cell, 83:2578–2594, 2023. [CrossRef]
  68. M. E. Rogalska, C Vivor, and J. Valcárcel. Regulation of pre-mRNA splicing: roles in physiology and disease, and therapeutic prospects. Nature Reviews Genetics, 24:251–269, 2023. [CrossRef]
  69. H. Shenasa and D. L. Bentley. Pre-mRNA splicing and its co-transcriptional connections. Trends in Genetics, 39(9):672–685, 2023. [CrossRef]
  70. X. Zhan, Y. Lu, and Y. Shi. Molecular basis for the activation of human spliceosome. Nature Communications, 15:6348–6357, 2024. [CrossRef]
  71. B. Mandlebrot. The Fractal Geometry of Nature. W. H. Freeman and Company, New York, 1977. [CrossRef]
  72. H. G. E. Hentschel and I. Procaccia. The infinite number of generalized dimensions of fractals and strange attractors. Physica D, 8:435–444, 1983. [CrossRef]
  73. T. C. Halsey, M. H. Jensen, L. P. Kadanoff, I. Procaccia, and B. I. Shraiman. Fractal measures and their singularities: The characterization of strange sets. Physical Review A, 33:1141–1151, 1986. [CrossRef]
  74. E. Speakman. Point Set Identification of Genetic Sequences. Doctoral Thesis, University of Houston, 2025.
  75. E. Alpaydin. Introduction to Machine Learning. The MIT Press, 2020.
  76. T. R. Cech. The chemistry of self-splicing RNA and RNA enzymes. Science, 236:1532–1539, 1987. [CrossRef]
  77. T. R. Cech. Self-splicing and enzymatic activity of an intervening sequence RNA from Tetrahymena. Bioscience Reports, 10:239–261, 1990. [CrossRef]
  78. A. M. Pyle. Group II intron self-splicing. Annual Reviews of Biophysics, 45:183–205, 2016. [CrossRef]
Figure 3. Sequence logos at branch points of nucleic acids A, C, G and U.
Figure 3. Sequence logos at branch points of nucleic acids A, C, G and U.
Preprints 207338 g003
Figure 4. Consensus sequences depend on the nucleic acid at the branch point. (a) When it is A, the most frequent consensus sequences are C U N | A | U N N and C U N | A | C N N . (b) With C as the branch point the primary consensus sequences take forms N C U | C | C C N and N C U | C | C U N . (c) With G as the branch point, they take the form N C U | G | A N N , while (d) if it is U, the prevalent form is N N C | U | G A N .
Figure 4. Consensus sequences depend on the nucleic acid at the branch point. (a) When it is A, the most frequent consensus sequences are C U N | A | U N N and C U N | A | C N N . (b) With C as the branch point the primary consensus sequences take forms N C U | C | C C N and N C U | C | C U N . (c) With G as the branch point, they take the form N C U | G | A N N , while (d) if it is U, the prevalent form is N N C | U | G A N .
Preprints 207338 g004
Figure 7. The difference of probability densities between successive exon-exon junctions and cancer fusion junctions. On average posterior sequences A A G (resp. C A G ) occur at a higher (resp. lower) frequency in cancer fusions.
Figure 7. The difference of probability densities between successive exon-exon junctions and cancer fusion junctions. On average posterior sequences A A G (resp. C A G ) occur at a higher (resp. lower) frequency in cancer fusions.
Preprints 207338 g007
Figure 8. Differential correlations C ( x , y ) for (a) normal exon-exon junctions and (b) cancer fusion junctions.
Figure 8. Differential correlations C ( x , y ) for (a) normal exon-exon junctions and (b) cancer fusion junctions.
Preprints 207338 g008
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated