Submitted:
21 June 2026
Posted:
23 June 2026
You are already at the latest version
Abstract

Keywords:
1. Introduction
- By surveying data foundations for model training by covering major omics data types (transcriptomics & proteomics), imaging inputs (radiology, digital pathology), single cell and clinical label structures.
- It also maps hierarchical nature of classification from binary models through molecular subtyping, stage & grade, prognosis to tissue of origin models.
- Major focuses on ML & DL architecture via ensemble, convolution for images, graph neural networks and generative models.
- Furthermore, it examines open challenges including label noise, cross cohort generalization & absence of benchmarking frameworks.
2. Data Foundations for Tumor Modeling
2.1. Molecular Omics Data
2.1.1. Transcriptomic
2.1.2. Proteomic Data Sources
2.1.3. Others
2.2. Imaging & Radiomics Data Sources
2.3. Spatial & Single Cell
2.4. Clinical Labels and Biological Meaning
3. The Hierarchical Nature of Cancer Classification
3.1. Tumor vs Normal: The “Easy” Problem
3.2. Tumor Classification and Subtyping
3.3. Stage and Grade Prediction: The Hard Problem
3.4. Early Stage & Prognosis Prediction
3.5. Tissue-of-Origin Classification
4. Machine Learning Models Across the Hierarchy
4.1. Classical Machine Learning Approaches
4.2. Deep Learning Approaches
4.3. Graph Neural Networks
5. The Mechanics of Preprocessing: The Hidden Determinant of Performance
- implementation of normalization and scaling in transcriptomics
- challenges of preprocessing in proteomics
- missing data in CPTAC studies & how to select the best possible preprocessing methods in all modalities.
5.2. Scaling vs Transformation: A Comparative View
5.3. Missing Data in CPTAC
6. Class Imbalance and the Myth of “Enough Normals”
6.1. The TCGA Normal Sample Problem
6.2. Synthetic Data Generation
7. Multimodal Fusion of Pathology and Omics
8. Clinical Impact with ML Based Models
9. Open Challenges and Research Gaps
10. Conclusion
Author Contributions
Funding
Data Availability
Acknowledgments
Conflicts of interest
References
- Abascal, F; Acosta, R; Addleman, NJ; et al. Perspectives on ENCODE. Nature 2020, 583, 693–8. [Google Scholar] [CrossRef] [PubMed]
- Abbasi, AF; Sajjad, M; Asim, MN; et al. Multi-omics driven computational framework for cancer molecular subtype classification. Sci Rep 2025, 15, 44141. [Google Scholar] [CrossRef] [PubMed]
- Ahamad, MM; Aktar, S; Uddin, MJ; et al. Early-Stage detection of ovarian cancer based on clinical data using machine learning approaches. J Pers Med 2022, 12, 1211. [Google Scholar] [CrossRef] [PubMed]
- Ahmed, KT; Sun, J; Cheng, S; et al. Multi-omics data integration by generative adversarial network. Bioinformatics 2021, 38, 179–86. [Google Scholar] [CrossRef] [PubMed]
- Alharbi, F; Vakanski, A. Machine Learning Methods for Cancer Classification Using Gene Expression Data: A review. Bioengineering 2023, 10, 173. [Google Scholar] [CrossRef] [PubMed]
- Alharbi, WS; Rashid, M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022, 16, 26. [Google Scholar] [CrossRef] [PubMed]
- Aran, D; Camarda, R; Odegaard, J; et al. Comprehensive analysis of normal adjacent to tumor transcriptomes. Nat Commun 2017, 8, 1077. [Google Scholar] [CrossRef] [PubMed]
- Barbadikar, KM; Magar, ND; Hake, AA; et al. Transcriptomic Data Analysis. In Advanced Statistical Tools and Techniques for Biometrical Data Analysis; Rathod, S, Sailaja, B, Bandumula, N, et al., Eds.; ICAR Indian Institute of Rice Research: Hyderabad, 2023; pp. 207–50. https://icar-iirr.org/books/chapters/AdvancedStatistics_Ch12.pdf.
- Boys, EL; Liu, J; Robinson, PJ; et al. Clinical applications of mass spectrometry-based proteomics in cancer: Where are we? Proteomics 2023, 23, e2200238. [Google Scholar] [CrossRef] [PubMed]
- Bruno, PS; Arshad, A; Gogu, MR; et al. Post-Translational Modifications of Proteins Orchestrate All Hallmarks of Cancer. Life (Basel) 2025, 15, 126. [Google Scholar] [CrossRef] [PubMed]
- Carrillo-Perez, F; Ortuno, FM; Börjesson, A; et al. Performance comparison between multi-center histopathology datasets of a weakly-supervised deep learning model for pancreatic ductal adenocarcinoma detection. Cancer Imaging 2023, 23, 66. [Google Scholar] [CrossRef] [PubMed]
- Chen, P; Chang, D; Yen, H; et al. Radiomic Features at CT Can Distinguish Pancreatic Cancer from Noncancerous Pancreas. Radiol Imaging Cancer 2021, 3, e210010. [Google Scholar] [CrossRef] [PubMed]
- Chhikara, BS; Parang, K. Global Cancer Statistics 2022: The Trends Projection Analysis. https://digitalcommons.chapman.edu/pharmacy_articles/938/.
- Colacino, A; Soricelli, A; Ceccarelli, M; et al. Subtypes detection of papillary thyroid cancer from methylation assay via Deep Neural Network. Comput Struct Biotechnol J 2025, 27, 1809–17. [Google Scholar] [CrossRef] [PubMed]
- Cosmic. COSMIC Mutational Signatures. COSMIC. 2020. https://cancer.sanger.ac.uk/signatures/.
- De Zuani, M; Xue, H; Park, JS; et al. Single-cell and spatial transcriptomics analysis of non-small cell lung cancer. Nat Commun 2024, 15, 4388. [Google Scholar] [CrossRef] [PubMed]
- Dent, A; Diamandis, P. Integrating computational pathology and proteomics to address tumor heterogeneity. J Pathol 2022, 257, 445–53. [Google Scholar] [CrossRef] [PubMed]
- Eissa, NS; Khairuddin, U; Yusof, R. A hybrid metaheuristic-deep learning technique for the pan-classification of cancer based on DNA methylation. BMC Bioinformatics 2022, 23, 273. [Google Scholar] [CrossRef] [PubMed]
- Feng, Z; Zhao, Q; Ding, Y; et al. Identification an innovative classification and nomogram for predicting the prognosis of thyroid carcinoma patients and providing therapeutic schedules. J Cancer Res Clin Oncol 2023, 149, 14817–31. [Google Scholar] [CrossRef] [PubMed]
- GDC Docs. Bioinformatics Pipeline: Methylation analysis Pipeline. GDC Docs. n.d. https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Methylation_Pipeline/.
- Gutta, C; Morhard, C; Rehm, M. Applying a GAN-based classifier to improve transcriptome-based prognostication in breast cancer. PLoS Comput Biol 2023, 19, e1011035. [Google Scholar] [CrossRef] [PubMed]
- Han, E; Kwon, H; Jung, I. A review on multi-omics integration for aiding study design of large scale TCGA cancer datasets. BMC Genomics 2025, 26, 769. [Google Scholar] [CrossRef] [PubMed]
- Han, GH; Kim, H; Yun, H; et al. Developing a comprehensive molecular subgrouping model for cervical cancer using machine learning. Am J Cancer Res 2024, 14, 3186–97. [Google Scholar] [CrossRef] [PubMed]
- Heydari, AA; Davalos, OA; Zhao, L; et al. ACTIVA: realistic single-cell RNA-seq generation with automatic cell-type identification using introspective variational autoencoders. Bioinformatics 2022, 38, 2194–201. [Google Scholar] [CrossRef] [PubMed]
- Higgins, L; Gerdes, H; Cutillas, PR. Principles of phosphoproteomics and applications in cancer research. Biochem J 2023, 480, 403–20. [Google Scholar] [CrossRef] [PubMed]
- Hossain, SMM; Lj, Khatun; Ray, S; et al. Pan-cancer classification by regularized multi-task learning. Sci Rep 2021, 11, 24252. [Google Scholar] [CrossRef] [PubMed]
- Hsu, C; Askar, S; Alshkarchy, SS; et al. AI-driven multi-omics integration in precision oncology: bridging the data deluge to clinical decisions. Clin Exp Med 2025, 26, 29. [Google Scholar] [CrossRef] [PubMed]
- Hu, G; Zheng, Z; He, Y; et al. Integrated analysis of proteome and transcriptome profiling reveals Pan-Cancer-Associated pathways and molecular biomarkers. Mol Cell Proteomics 2025, 24, 100919. [Google Scholar] [CrossRef] [PubMed]
- Hu, R; Zhou, XJ; Li, W. Computational Analysis of High-Dimensional DNA Methylation Data for Cancer Prognosis. J Comput Biol 2022, 29, 769–81. [Google Scholar] [CrossRef] [PubMed]
- Hu, W; Zhang, Y; Mei, J; et al. Spatial transcriptomics in human biomedical research and clinical application. Curr Med 2023, 2, 1. [Google Scholar] [CrossRef]
- Javaid, MA; Shahzad, MS; Shehzad, HMF; et al. DSSCC net enhanced skin cancer classification using SMOTE Tomek and optimized convolutional neural network. Scientific Reports 2025, 15, 41554. [Google Scholar] [CrossRef] [PubMed]
- Jha, A; Quesnel-Vallieres, M; Wang, D; et al. Identifying common transcriptome signatures of cancer by interpreting deep learning models. Genome Biol 2022, 23, 117. [Google Scholar] [CrossRef] [PubMed]
- Ji, Y.; Dutta, P.; Davuluri, R. Deep multi-omics integration by learning correlation-maximizing representation identifies prognostically stratified cancer subtypes. Bioinformatics advances 2023, 3(1), vbad075. [Google Scholar] [CrossRef] [PubMed]
- Jiang, W; Jaehnig, EJ; Liao, Y; et al. Illuminating the Dark Cancer Phosphoproteome Through a Machine-Learned Co-Regulation Map of 26,280 Phosphosites. bioRxiv 2024. [Google Scholar] [CrossRef]
- Jing, B; Chen, G; Yang, M; et al. Development of prediction model to estimate future risk of ovarian lesions: A multi-center retrospective study. Prev Med Rep 2023, 35, 102296. [Google Scholar] [CrossRef] [PubMed]
- Jones, S; Beyers, M; Shukla, M; et al. TULIP: an RNA-seq-based primary tumor type prediction tool using convolutional neural networks. Cancer Inform 2022, 21, 11769351221139491. [Google Scholar] [CrossRef] [PubMed]
- Khatun, R; Akter, M; Islam, MM; et al. Cancer Classification Utilizing Voting Classifier with Ensemble Feature Selection Method and Transcriptomic Data. Genes 2023, 14, 1802. [Google Scholar] [CrossRef] [PubMed]
- Langerud, J; Eilertsen, IA; Moosavi, SH; et al. Multiregional transcriptomics identifies congruent consensus subtypes with prognostic value beyond tumor heterogeneity of colorectal cancer. Nat Commun 2024, 15, 4342. [Google Scholar] [CrossRef] [PubMed]
- Lappalainen, I; Almeida-King, J; Kumanduri, V; et al. The European Genome-phenome Archive of human data consented for biomedical research. Nat Genet 2015, 47, 692–95. [Google Scholar] [CrossRef] [PubMed]
- Lasai, B; Ewout, S W; et al. The fundamental problem of risk prediction for individuals: health AI, uncertainty, and personalized medicine. arXiv. 2025. https://arxiv.org/abs/2506.17141. [CrossRef]
- Laycock, E; Weis, E; Sylvestre-Bouchard, A; et al. Analyzing clinical variables are indicative of uveal melanoma to determine how they affect decisions made by an artificial intelligence classifier. Can J Ophthalmol 2025, 60, 261–66. [Google Scholar] [CrossRef] [PubMed]
- Lee, D; Park, Y; Kim, S. Towards multi-omics characterization of tumor heterogeneity: a comprehensive review of statistical and machine learning approaches. Brief Bioinform 2021, 22, bbaa188. [Google Scholar] [CrossRef] [PubMed]
- Li, H; Han, Z; Sun, Y; et al. CGMega: explainable graph neural network framework with attention mechanisms for cancer gene module dissection. Nat Commun 2024, 15, 5997. [Google Scholar] [CrossRef] [PubMed]
- Li, Y; Dou, Y; Da Veiga Leprevost, F; et al. Proteogenomic data and resources for pan-cancer analysis. Cancer Cell 2023, 41, 1397–406. [Google Scholar] [CrossRef] [PubMed]
- Liang, Z; Li, J; He, S; et al. Thymoma habitat segmentation and risk prediction model using CT imaging and K-means clustering. Med Phys 2025, 52, e17892. [Google Scholar] [CrossRef] [PubMed]
- Lindgren, CM; Adams, DW; Kimball, B; et al. Simplified and Unified Access to Cancer Proteogenomic Data. J Proteome Res 2021, 20, 1902–10. [Google Scholar] [CrossRef] [PubMed]
- Liu, B; Liu, Y; Pan, X; et al. DNA methylation markers for Pan-Cancer prediction by deep learning. Genes 2019, 10, 778. [Google Scholar] [CrossRef] [PubMed]
- Ma, W; Wu, H; Chen, Y; et al. New techniques to identify the tissue of origin for cancer of unknown primary in the era of precision medicine: progress and challenges. Brief Bioinform 2024, 25. [Google Scholar] [CrossRef] [PubMed]
- Mertins, P; Mani, DR; Ruggles, KV; et al. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 2016, 534, 55–62. [Google Scholar] [CrossRef] [PubMed]
- Mohammed, M; Mwambi, H; Mboya, LB; et al. A stacking ensemble deep learning approach to cancer type classification based on TCGA data. Sci Rep 2021, 11, 15626. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, L; Van Hoeck, A; Cuppen, E. Machine learning-based tissue of origin classification for cancer of unknown primary diagnostics using genome-wide mutation features. Nat Commun 2022, 13, 4013. [Google Scholar] [CrossRef] [PubMed]
- Ning, B; Chi, J; Meng, Q; et al. Accurate prediction of colorectal cancer diagnosis using machine learning based on immunohistochemistry pathological images. Sci Rep 2024, 14, 29882. [Google Scholar] [CrossRef] [PubMed]
- Noorbakhsh, J; Farahmand, S; Pour, AFN; et al. Deep learning-based cross-classifications reveal conserved spatial behaviors within tumor histological images. Nat Commun 2020, 11, 6367. [Google Scholar] [CrossRef] [PubMed]
- Panagopoulou, M; Karaglani, M; Manolopoulos, VG; et al. Deciphering the Methylation Landscape in Breast Cancer: Diagnostic and Prognostic Biosignatures through Automated Machine Learning. Cancers 2021, 13, 1677. [Google Scholar] [CrossRef] [PubMed]
- Park, SS; Noh, J; Kim, J; et al. Machine learning-based classification of adrenal tumors using clinical, hormonal, and body composition data. Eur J Endocrinol 2025, 193, 204–15. [Google Scholar] [CrossRef] [PubMed]
- Putra, AH; Salam, A. A comparative performance of SMOTE, ADASYN and random oversampling in machine learning models on prostate cancer dataset. Journal of Applied Informatics and Computing 2025, 9, 603–10. [Google Scholar] [CrossRef]
- Ramadhan, MA; Saragih, TH; Kartini, D; et al. A Comparative Analysis of SMOTE and ADASYN for Cervical Cancer Detection using XGBoost with MICE Imputation. Journal of Electronics Electromedical Engineering and Medical Informatics 2026, 8, 368–94. [Google Scholar] [CrossRef]
- Saadh, MJ; Ahmed, HH; Kareem, RA; et al. Advanced machine learning framework for enhancing breast cancer diagnostics through transcriptomic profiling. Discov Oncol 2025, 16, 334. [Google Scholar] [CrossRef] [PubMed]
- Sahu, PK; Khuntia, M; Choudhury, S; et al. Analysis of Class Imbalanced Brain Tumor Using Machine Learning Techniques. https://www.ijisae.org/index.php/IJISAE/article/view/5003.
- Salih, AM; Raisi-Estabragh, Z; Galazzo, IB; et al. A perspective on explainable artificial intelligence methods: SHAP and LIME. Adv Intell Syst 2024, 7, 202400304. [Google Scholar] [CrossRef]
- Sarkar, H; Lee, E; Lopez-Darwin, SL; et al. Deciphering normal and cancer stem cell niches by spatial transcriptomics: opportunities and challenges. Genes Dev 2024, 39, 64–85. [Google Scholar] [CrossRef] [PubMed]
- Saygili, ES; Elhassan, YS; Prete, A; et al. Machine Learning Based Survival Prediction Tool for adrenocortical carcinoma. J Clin Endocrinol Metab 2025, 110, e3185–92. [Google Scholar] [CrossRef] [PubMed]
- Silvestri, M; Vu, TN; Nichetti, F; et al. Comprehensive transcriptomic analysis to identify biological and clinical differences in cholangiocarcinoma. Cancer Med 2023, 12, 10156–68. [Google Scholar] [CrossRef] [PubMed]
- Stancl, P; Karlic, R. Machine learning for pan-cancer classification based on RNA sequencing data. Front Mol Biosci 2023, 10, 1285795. [Google Scholar] [CrossRef] [PubMed]
- Tang, Z; Kang, B; Li, C; et al. GEPIA2: an enhanced web server for large-scale expression profiling and interactive analysis. Nucleic Acids Research 2019, 47, W556–60. [Google Scholar] [CrossRef] [PubMed]
- Tian, F; Liu, D; Wei, N; et al. Prediction of tumor origin in cancers of unknown primary origin with cytology-based deep learning. Nat Med 2024b, 30, 1309–19. [Google Scholar] [CrossRef] [PubMed]
- Tian, S; Luo, M; Liao, X; et al. Integrated immunogenomic analysis of single-cell and bulk profiling reveals novel tumor antigens and subtype-specific therapeutic agents in lung adenocarcinoma. Comput Struct Biotechnol J 2024a, 23, 1897–911. [Google Scholar] [CrossRef] [PubMed]
- Tomczak, K; Czerwinska, P; Wiznerowicz, M. Review The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Wspolczesna Onkol 2015, 19, 68–77. [Google Scholar] [CrossRef] [PubMed]
- Tran, KA; Kondrashova, O; Bradley, A; et al. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Medicine 2021, 13, 152. [Google Scholar] [CrossRef] [PubMed]
- Vaida, M; Huang, Z. Multimodal graph neural networks in healthcare: a review of fusion strategies across biomedical domains. Front Artif Intell 2026, 8, 1716706. [Google Scholar] [CrossRef] [PubMed]
- Wang, C; He, Y; Zheng, J; et al. Dissecting order amidst chaos of programmed cell deaths: construction of a diagnostic model for KIRC using transcriptomic information in blood-derived exosomes and single-cell multi-omics data in tumor microenvironment. Front Immunol 2023, 14, 1130513. [Google Scholar] [CrossRef] [PubMed]
- Wang, C; Lye, X; Kaalia, R; et al. Deep learning and multi-omics approach to predict drug responses in cancer. BMC Bioinformatics 2022, 22, 632. [Google Scholar] [CrossRef] [PubMed]
- Wang, H; Zhang, Y; Zhang, D; et al. Novel cancer subtyping method guided by tumor-normal sample in latent space of transcriptomic variational autoencoder. Sci Rep 2025, 15, 26444. [Google Scholar] [CrossRef] [PubMed]
- Wang, J; Zhang, J; Dai, X; et al. Computational models for pan-cancer classification based on multi-omics data. Front Genet 2025, 16, 1667325. [Google Scholar] [CrossRef] [PubMed]
- Wang, J-H; Liu, C-Y; Min, Y-R; et al. Cancer Diagnosis by Gene-Environment Interactions via Combination of SMOTE-Tomek and Overlapped Group Screening Approaches with Application to Imbalanced TCGA Clinical and Genomic Data. Mathematics 2024, 12, 2209. [Google Scholar] [CrossRef]
- Wang, JM; Hong, R; Demicco, EG; et al. Deep learning integrates histopathology and proteogenomics at a pan-cancer level. Cell Rep Med 2023, 4, 101173. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z; Gerstein, M; Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2008, 10, 57–63. [Google Scholar] [CrossRef] [PubMed]
- Wei, Q; Zhou, H; Hou, X; et al. Current status of and barriers to the treatment of advanced-stage liver cancer in China: a questionnaire-based study from the perspective of doctors. BMC Gastroenterol 2022, 22, 351. [Google Scholar] [CrossRef] [PubMed]
- Wei, Y; Deng, Y; Sun, C; et al. Deep learning with noisy labels in medical prediction problems: a scoping review. J Am Med Inform Assoc 2024, 31, 1596–607. [Google Scholar] [CrossRef] [PubMed]
- Weinstein, JN; Collisson, EA; Mills, GB; et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 2013, 45, 1113–20. [Google Scholar] [CrossRef] [PubMed]
- Wishart, DS; Tzur, D; Knox, C; et al. HMDB: the Human Metabolome Database. Nucleic Acids Research 2007, 35, D521–6. [Google Scholar] [CrossRef] [PubMed]
- Wu, J; Chen, Z; Xiao, S; et al. DeepMoIC: multi-omics data integration via deep graph convolutional networks for cancer subtype classification. BMC Genomics 2024, 25, 1209. [Google Scholar] [CrossRef] [PubMed]
- Wu, Y; Chen, M; Qin, Y. Anticancer drug response prediction integrating multi-omics pathway-based difference features and multiple deep learning techniques. PLoS Computational Biology 2025, 21, e1012905. [Google Scholar] [CrossRef]
- Xiao, J; Yu, X; Meng, F; et al. Integrating spatial and single-cell transcriptomics reveals tumor heterogeneity and intercellular networks in colorectal cancer. Cell Death Dis 2024, 15, 326. [Google Scholar] [CrossRef] [PubMed]
- Yang, Y; Mirzaei, G. Performance analysis of data resampling on class imbalance and classification techniques on multi-omics data for cancer classification. PLoS ONE 2024, 19, e0293607. [Google Scholar] [CrossRef] [PubMed]
- Yuan, T; Edelmann, D; Fan, Z; et al. Machine learning in the identification of prognostic DNA methylation biomarkers among patients with cancer: A systematic review of epigenome-wide studies. Artif Intell Med 2023, 143, 102589. [Google Scholar] [CrossRef] [PubMed]
- Yurekten, O; Payne, T; Tejera, N; et al. MetaboLights: open data repository for metabolomics. Nucleic Acids Res 2023, 52, D640–46. [Google Scholar] [CrossRef] [PubMed]
- Zeng, WZD; Glicksberg, BS; Li, Y; et al. Selecting precise reference normal tissue samples for cancer research using a deep learning approach. BMC Med Genomics 2019, 12, 21. [Google Scholar] [CrossRef] [PubMed]
- Zeng, Y; Wei, Z; Yu, W; et al. Spatial transcriptomics prediction from histology jointly through Transformer and graph neural networks. Brief Bioinform 2022, 23, bbac297. [Google Scholar] [CrossRef] [PubMed]
- Zhang, B; Wang, J; Wang, X; et al. Proteogenomic characterization of human colon and rectal cancer. Nature 2014, 513, 382–87. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J; Che, Y; Liu, R; et al. Deep learning-driven multi-omics analysis: enhancing cancer diagnostics and therapeutics. Brief Bioinform 2025, 26. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J; Chen, L. Clustering-based undersampling with random over sampling examples and support vector machine for imbalanced classification of breast cancer diagnosis. Computer Assisted Surgery 2019, 24, 62–72. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X; Zhang, P; Ren, Q; et al. Integrative multi-omic and machine learning approach for prognostic stratification and therapeutic targeting in lung squamous cell carcinoma. BioFactors 2024, 51, e2128. [Google Scholar] [CrossRef] [PubMed]
- Zheng, Y; Gindra, RH; Green, EJ; et al. A Graph-Transformer for Whole Slide Image Classification. IEEE Trans Med Imaging 2022, 41, 3003–15. [Google Scholar] [CrossRef] [PubMed]


| Proteomic data source | Sample type | Feature dim. | Strengths | Limitations | Impact on ML Tasks | References |
| CPTAC global proteomics | Tumor & matched normal tissue | Approx. 8000–12000 proteins | High depth; multi-omics integration; standardized pipelines | Bulk averaging, structured missingness | Good for tumor vs normal, weak for stage | Zhang et al. 2014, Mertins et al. 2016 |
| CPTAC phosphoproteomics | Tumor tissue | 20,000–40,000 phosphosites | Pathway and kinase activity insight | Extreme sparsity, batch sensitivity | Unstable models | Mertins et al. 2016, Jiang et al. 2024 |
| Serum/plasma proteomics | Blood | Hundreds–thousands | Non-invasive, early detection potential | Low tumor signal, high noise | Poor accuracy | Boys et al. 2023 |
| Targeted/IHC proteomics | Tissue sections | Dozens of proteins | High specificity, clinical familiarity | Limited scope, not discovery-oriented | Good for binary tasks | Ning et al. 2024 |
| Repository | Primary Source | Methodology | Key Metric | Reference |
| Genomics | ||||
| COSMIC | Pan-Cancer Analysis of Whole Genomes (PCAWG) | SigProfiler Software | 86 SBS mutational signatures (as of v3.4, Oct 2023) | (Cosmic, 2020). |
| PCAWG | Joint ICGC/TCGA initiative | Whole-Genome Sequencing (WGS) | 2,658 tumor-normal matched pairs across 38 cancer types | (Cosmic, 2020). |
| GDC (Genomics) | TCGA, ICGC, and other cancer programs | Harmonization workflow | High throughput sequencing data (SNPs, CNVs, InDels, etc.) | (Bioinformatics Pipeline: Methylation Analysis Pipeline - GDC Docs, n.d.). |
| Epigenomics | ||||
| GDC (Methylation) | 33 TCGA cancer types | SeSAMe correction | Beta value matrices (known CpG sites) | (Bioinformatics Pipeline: Methylation Analysis Pipeline - GDC Docs, n.d.). |
| ENCODE | Public research project | Chromatin state & TF binding sites | High-dimensional chromatin and binding site annotations | (Abascal et al. 2020) |
| Metabolomics | ||||
| MetaboLights | Global EMBL-EBI database | LC-MS/MS or NMR | 8,544 studies and 270,403+ samples (human plasma, serum, urine) | (Yurekten et al., 2023) |
| HMDB / METLIN | Reference Database | Spectral annotation | Reference spectral data for small molecules | (Wishart et al. 2007) |
| TCGA | Detail | Tumor | Normal | Ratio | GTEx | Extra normals |
| ACC | Adrenocortical carcinoma | 77 | 0 | 1:0 | Adrenal Gland | 128 |
| BLCA | Bladder Urothelial Carcinoma | 404 | 19 | 404:19 | Bladder | 9 |
| BRCA | Breast invasive carcinoma | 1085 | 112 | 155:16 | Breast | 179 |
| CESC | Cervical squamous cell carcinoma and endocervical adenocarcinoma | 306 | 3 | 102:1 | Cervix Uteri | 10 |
| CHOL | Cholangio carcinoma | 36 | 9 | 4:1 | - | - |
| COAD | Colon adenocarcinoma | 275 | 41 | 275:41 | Colon | 308 |
| DLBC | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 47 | 0 | 1:0 | Blood | 337 |
| ESCA | Esophageal carcinoma | 182 | 13 | 14:1 | Esophagus | 273 |
| GBM | Glioblastoma multiforme | 163 | 0 | 1:0 | Brain | 207 |
| HNSC | Head and neck squamous cell carcinoma | 519 | 44 | 519:44 | - | - |
| KICH | Kidney Chromophobe | 66 | 25 | 66:25 | Kidney | 28 |
| KIRC | Kidney renal clear cell carcinoma | 523 | 72 | 523:72 | Kidney | 28 |
| KIRP | Kidney renal papillary cell carcinoma | 286 | 32 | 143:16 | Kidney | 28 |
| LAML | Acute Myeloid Leukemia | 173 | 0 | 1:0 | Bone Marrow | 70 |
| LGG | Brain Lower Grade Glioma | 518 | 0 | 1:0 | Brain | 207 |
| LIHC | Liver hepatocellular carcinoma | 369 | 50 | 369:50 | Liver | 110 |
| LUAD | Lung adenocarcinoma | 483 | 59 | 483:59 | Lung | 288 |
| LUSC | Lung squamous cell carcinoma | 486 | 50 | 243:25 | Lung | 288 |
| MESO | Mesothelioma | 87 | 0 | 1:0 | - | - |
| OV | Ovarian serous cystadenocarcinoma | 426 | 0 | 1:0 | Ovary | 88 |
| PAAD | Pancreatic adenocarcinoma | 179 | 4 | 179:4 | Pancreas | 167 |
| PCPG | Pheochromocytoma and Paraganglioma | 182 | 3 | 182:3 | - | - |
| PRAD | Prostate adenocarcinoma | 492 | 52 | 123:13 | Prostate | 100 |
| READ | Rectum adenocarcinoma | 92 | 10 | 46:5 | Colon | 308 |
| SARC | Sarcoma | 262 | 2 | 131:1 | - | - |
| SKCM | Skin Cutaneous Melanoma | 461 | 1 | 461:1 | Skin | 557 |
| STAD | Stomach adenocarcinoma | 408 | 36 | 34:3 | Stomach | 175 |
| TGCT | Testicular Germ Cell Tumors | 137 | 0 | 1:0 | Testis | 165 |
| THCA | Thyroid carcinoma | 512 | 59 | 512:59 | Thyroid | 278 |
| THYM | Thymoma | 118 | 2 | 59:1 | Blood | 337 |
| UCEC | Uterine Corpus Endometrial Carcinoma | 174 | 13 | 174:13 | Uterus | 78 |
| UCS | Uterine Carcinosarcoma | 57 | 0 | 1:0 | Uterus | 78 |
| UVM | Uveal Melanoma | 79 | 0 | 1:0 | - | - |
| Category | Technique | Description | References |
| Data Level Methods | Random Undersampling | Deletes instances from the majority class to match the size of the minority class, which is effective for very large datasets but risks losing information. | Zhang and Chen 2019 |
| Random Oversampling | Duplicates random instances from the minority class, increasing the risk of overfitting. | Putra and Salam 2025 | |
| SMOTE (Synthetic Minority Over-sampling Technique) | Generates new synthetic minority samples by interpolating between existing ones, rather than simply duplicating them. | Mohammed et al. 2021, Sahu et al. 2024 | |
| ADASYN (Adaptive Synthetic) | Focuses on generating synthetic data for minority samples that are harder to learn. | Sahu et al. 2024 | |
| Hybrid Resampling (e.g., SMOTE-Tomek) | Combines techniques, such as SMOTE-Tomek, which generates synthetic samples and then removes overlapping samples between classes using Tomek Links to clean up the decision boundary. | Javaid et al. 2025, Wang et al. 2024 | |
| Algorithm Level / Ensemble Methods | Class Weights / Threshold Adjustment | Assigns a higher penalty, weight to misclassifying the minority class during training or adjust decision threshold to improve the classification of the minority class. | Ramadhan et al. 2026 |
| Balanced Random Forest | Resamples each bootstrap sample to be balanced before training the trees. | Putra and Salam 2025 | |
| Other Approaches | Generative Adversarial Network | Build bulk-RNA profiles for breast cancer patients, resulting in a considerable increase in classification accuracy. May capture complicated cross-modal interactions between mRNA and miRNA, increasing prediction power for survival. | Guttà et al. 2023, Ahmed et al. 2021 |
| ACTIVA(Automated Cell-Type Identification and Vector Augmentation) | Framework designed to realistic synthetic single-cell RNA sequencing (scRNA-seq) data using rare kidney-cancer cell data | Heydari et al. 2022 |
| Fusion strategy | Integration stage | Strengths | Limitations | Implications for ML Tasks | Reference |
| Early fusion | Raw / engineered features | Simple, flexible | Dominated by scaling, weak alignment | Poor performance when modalities are not directly comparable | Zhang et al., 2014, Mertins et al. 2016 |
| Intermediate fusion | Learned embeddings | Modality-aware integration | Sample-level mismatch | Moderate improvement; limited spatial reasoning | Mertins et al. 2016, Zheng et al. 2022 |
| Late fusion | Predictions | Robustness, simplicity | No cross-modal interaction | Limited benefit for early detection tasks | Mertins et al. 2016, Zheng et al. 2022 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).