Submitted:
22 June 2025
Posted:
23 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Cancer Transcriptomics Datasets
2.2. Generation of Predictive Models
- Pi is the absolute distance between the estimated median of biomarker i of the group with the phenotype and the value of the biomarker in the unknown sample;
- Ni is the absolute distance between the estimated median of biomarker i of the control group (negative for phenotype) and the value of the biomarker in the unknown sample;
- Wi is the enrichment score of biomarker i on the group with the phenotype, and n is the total number of biomarkers in the model.
2.3. Classical Model Performance Metrics
2.4. Model Scoring Distribution Analysis (MSDA)
- fTNB1 is the fraction of true negatives at the first model score bin (B1), measuring the model’s reliability in correctly identifying negative cases at low scores.
- fTPB1 is the fraction of true positives in the last score bin (B10), measuring the reliability of high scores in correctly identifying positive cases.
- Ki is the number of observations (samples) of bin i that belongs within a peripheral region of n bins and contains no false positives and no false negatives; this represents model scoring regions with high local performance.
- FPj and FNj are the false positives and false negatives, respectively in bin j from m bins where m = 9 (bins excluding the bin that contains the model’s classification cut-off threshold). This represents a way to introduce a scoring penalty for failing a prediction across the scoring space outside the cutoff-point region.
- Totalj is the total number of observations in bin j.
2.4. Data Manipulations and Analysis
3. Results
3.1. Breast Cancer Models (BCM) Evaluation
3.2. Lung Cancer Model (LCM) Evaluation
3.3. Renal Cancer Models
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| MDPI | Multidisciplinary Digital Publishing Institute |
| MSDA | Model Scoring Distribution Analysis |
| ROC | Receiver Operating Characteristic |
| AUC | Area Under the Receiver Operating Characteristic Curve |
| MSD | Model Scoring Distribution |
| ML | Machine Learning |
| AI | Artificial Intelligence |
| BRCA | Breast Cancer Cell Type |
| LUSC | Lung Squamous Cell Carcinoma subtype |
| LUAD | Lind Adenocarcinoma Cell subtype |
| KICH | Chromophobe Renal Cell carcinoma subtype |
| KIRP | Kidney Renal Papillary cell carcinoma subtype |
| KIRC | Kidney Renal Clear cell carcinoma subtype |
| FPKM | Fragments Per Kilobase of transcript per Million mapped reads |
References
- Strzelecki, M.; Badura, P. Machine Learning for Biomedical Application. Appl. Sci. 2022, Vol. 12, Page 2022 2022, 12, 2022. [Google Scholar] [CrossRef]
- Kourou, K.; Exarchos, K.P.; Papaloukas, C.; Sakaloglou, P.; Exarchos, T.; Fotiadis, D.I. Applied Machine Learning in Cancer Research: A Systematic Review for Patient Diagnosis, Classification and Prognosis. Comput. Struct. Biotechnol. J. 2021, 19, 5546–5555. [Google Scholar] [CrossRef] [PubMed]
- Pais, R.J. Predictive Modelling in Clinical Bioinformatics: Key Concepts for Startups. BioTech 2022, 11, 35. [Google Scholar] [CrossRef] [PubMed]
- Mann, M.; Kumar, C.; Zeng, W.F.; Strauss, M.T. Artificial Intelligence for Proteomics and Biomarker Discovery. Cell Syst. 2021, 12, 759–770. [Google Scholar] [CrossRef] [PubMed]
- Battineni, G.; Sagaro, G.G.; Chinatalapudi, N.; Amenta, F. Applications of Machine Learning Predictive Models in the Chronic Disease Diagnosis. J. Pers. Med. 2020, 10, 21. [Google Scholar] [CrossRef] [PubMed]
- Telikani, A.; Gandomi, A.H.; Tahmassebi, A.; Banzhaf, W. Evolutionary Machine Learning: A Survey. ACM Comput. Surv 2021, 54, 1–35. [Google Scholar] [CrossRef]
- Kim, H.; Kwon, H.J.; Kim, E.S.; Kwon, S.; Suh, K.J.; Kim, S.H.; Kim, Y.J.; Lee, J.S.; Chung, J.-H. Comparison of the Predictive Power of a Combination versus Individual Biomarker Testing in Non–Small Cell Lung Cancer Patients Treated with Immune Checkpoint Inhibitors. Cancer Res. Treat. 2022, 54, 424–433. [Google Scholar] [CrossRef] [PubMed]
- Boeri, C.; Chiappa, C.; Galli, F.; De Berardinis, V.; Bardelli, L.; Carcano, G.; Rovera, F. Machine Learning Techniques in Breast Cancer Prognosis Prediction: A Primary Evaluation. Cancer Med. 2020, 9, 3234–3243. [Google Scholar] [CrossRef] [PubMed]
- Mandrekar, J.N. Receiver Operating Characteristic Curve in Diagnostic Test Assessment. J. Thorac. Oncol. 2010, 5, 1315–1316. [Google Scholar] [CrossRef] [PubMed]
- Dankers, F.J.W.M.; Traverso, A.; Wee, L.; van Kuijk, S.M.J. Prediction Modeling Methodology. In Fundamentals of Clinical Data Science; Springer International Publishing: Cham, 2019; pp. 101–120. [Google Scholar]
- Pais, R.J.; Lopes, F.; Parreira, I.; Silva, M.; Silva, M.; Moutinho, M.G. Predicting Cancer Prognostics from Tumour Transcriptomics Using an Auto Machine Learning Approach. In Proceedings of the Med. Sci Forum; MDPI: Basel Switzerland, August 8, 2023; p. 6. [Google Scholar]
- Yang, D.; Ma, X.; Song, P. A Prognostic Model of Non Small Cell Lung Cancer Based on TCGA and ImmPort Databases. Sci. Rep. 2022, 12, 437. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Wang, Y.; Molino, P.; Li, L.; Ebert, D.S. Manifold: A Model-Agnostic Framework for Interpretation and Diagnosis of Machine Learning Models. IEEE Trans. Vis. Comput. Graph. 2018, 25, 364–373. [Google Scholar] [CrossRef] [PubMed]
- Wang, Q.; Alexander, W.; Pegg, J.; Qu, H.; Chen, M. HypoML: Visual Analysis for Hypothesis-Based Evaluation of Machine Learning Models. IEEE Trans. Vis. Comput. Graph. 2020, 27, 1417–1426. [Google Scholar] [CrossRef] [PubMed]
- Vickers, A.J.; Elkin, E.B. Decision Curve Analysis: A Novel Method for Evaluating Prediction Models. Med. Decis. Making 2006, 26, 565. [Google Scholar] [CrossRef] [PubMed]
- Filho, U.L.; Pais, T.A.; Pais, R.J. Facilitating “Omics” for Phenotype Classification Using a User-Friendly AI-Driven Platform: Application in Cancer Prognostics. BioMedInformatics 2023, Vol. 3, Pages 1071-1082 2023, 3, 1071–1082. [Google Scholar] [CrossRef]
- Edwards, N.J.; Oberti, M.; Thangudu, R.R.; Cai, S.; McGarvey, P.B.; Jacob, S.; Madhavan, S.; Ketchum, K.A. The CPTAC Data Portal: A Resource for Cancer Proteomics Research. J. Proteome Res. 2015, 14, 2707–2713. [Google Scholar] [CrossRef] [PubMed]
- Uhlen, M.; Zhang, C.; Lee, S.; Sjöstedt, E.; Fagerberg, L.; Bidkhori, G.; Benfeitas, R.; Arif, M.; Liu, Z.; Edfors, F.; et al. A Pathology Atlas of the Human Cancer Transcriptome. Science (80-. ). 2017, 357. [Google Scholar] [CrossRef] [PubMed]
- Pais, R.J. Simulation of Multiple Microenvironments Shows a Putative Role of RPTPs on the Control of Epithelial-to-Mesenchymal Transition. biorxiv 2018. [Google Scholar] [CrossRef]
- Swan, A.L.; Mobasheri, A.; Allaway, D.; Liddell, S.; Bacardit, J. Application of Machine Learning to Proteomics Data: Classification and Biomarker Identification in Postgenomics Biology. Omi. A J. Integr. Biol. 2013, 17, 595–610. [Google Scholar] [CrossRef] [PubMed]
- Le, T.T.; Fu, W.; Moore, J.H. Scaling Tree-Based Automated Machine Learning to Biomedical Big Data with a Feature Set Selector. Bioinformatics 2020, 36, 250–256. [Google Scholar] [CrossRef] [PubMed]
- Olson, R.S.; Urbanowicz, R.J.; Andrews, P.C.; Lavender, N.A.; Kidd, L.C.; Moore, J.H. Automating Biomedical Data Science Through Tree-Based Pipeline Optimization. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer, Cham, 2016; Vol. 9597, pp. 123–137 ISBN 9783319312033.
- Uhlen, M.; Fagerberg, L.; Hallstrom, B.M.; Lindskog, C.; Oksvold, P.; Mardinoglu, A.; Sivertsson, A.; Kampf, C.; Sjostedt, E.; Asplund, A.; et al. Tissue-Based Map of the Human Proteome. Science (80-. ). 2015, 347, 1260419–1260419. [Google Scholar] [CrossRef] [PubMed]






Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).