ARTICLE | doi:10.20944/preprints202301.0039.v1
Subject: Biology, Other Keywords: somatic point mutations; non-coding RNA; biomarker discovery; driver genes; non-coding RNAs prioritization; health data analytics
Online: 4 January 2023 (02:22:00 CET)
Previous studies demonstrate the critical importance of non-coding RNAs interfacing with chromatin-modifying machinery resulting in promoter-enhancer-based gene regulation and raise the possibility that many other enhancer-like RNAs may operate via similar mechanisms. Critically, more than 80% of the disease-linked variations identified in genome-wide studies are located in the non-coding regions of genomes, especially non-coding RNA, suggesting non-coding RNAs are relevant to disease. Thus, a critical path forward for understanding non-coding RNAs' role, especially long non-coding RNAs, is to understand the genomic regions' transcriptional regulation, especially non-coding regions. Here, we developed a user-friendly R package called SomaGene for studying and identifying enhancer-like non-coding RNAs with enriched somatic mutations in the cancer genome. SomaGene accepts different genomic variants (whole genome/exome somatic point mutations, structural variations, copy number variations) to identify those RNAs that significantly mutated in diseases (e.g., cancer). It then uses multiple publicly available genomics and epigenetics datasets including ENCODE epigenomics annotations, FANTOM5 tissue-specific expression profiles, disease-associated genome-wide association SNPs, and tissue-specific eQTL pairs to identify those RNAs with potentially enhancer function. SomaGene, as a powerful R package, can provide the opportunity to cancer scientists to study the roles of non-coding RNAs in different cancer genomes.
ARTICLE | doi:10.20944/preprints202111.0266.v1
Subject: Engineering, Biomedical & Chemical Engineering Keywords: Pan-Cancer; somatic point mutations; cancer subtyping; biomarker discovery; driver genes; per-sonalized medicine; health data analytics
Online: 15 November 2021 (13:51:33 CET)
The advent of high throughput sequencing has enabled researchers to systematically evaluate the genetic variations in cancer, resulting in identifying many cancer-associated genes. Although cancers in the same tissue are widely categorized in the same group, they demonstrate many differences concerning their mutational profiles. Hence there is no “silver bullet” for the treatment of a cancer type. This reveals the importance of developing a pipeline to identify cancer-associated genes accurately and re-classify patients with similar mutational profiles. Classification of cancer patients with similar mutational profiles may help discover subtypes of cancer patients who might benefit from specific treatment types. In this study, we propose a new machine learning pipeline to identify protein-coding genes mutated in a significant portion of samples to identify cancer subtypes. We applied our pipeline to 12270 samples collected from the International Cancer Genome Consortium (ICGC), covering 19 cancer types. Here we identified 17 different cancer subtypes. Comprehensive phenotypic and genotypic analysis indicates distinguishable properties, including unique cancer-related signaling pathways, in which, for most of them, targeted treatment options are currently available. This new subtyping approach offers a novel opportunity for cancer drug development based on the mutational profile of patients. We also comprehensive study the causes of mutations among samples in each subtype by mining the mutational signatures, which provides important insight into their active molecular mechanisms. Some of the pathways we identified in most subtypes, including the cell cycle and the Axon guidance pathways, are frequently observed in cancer disease. Interestingly, we also identified several mutated genes and different rates of mutation in multiple cancer subtypes. In addition, our study on “gene-motif” suggests the importance of considering both the context of the mutations and mutational processes in identifying cancer-associated genes. The source codes for our proposed clustering pipeline and analysis are publicly available at: https://github.com/bcb-sut/Pan-Cancer.