Submitted:
02 December 2025
Posted:
04 December 2025
You are already at the latest version
Abstract

Keywords:
1. Introduction
- Comprehensive Task Coverage: Unlike previous reviews, this SLR simultaneously investigates three critical prediction tasks—diagnosis, risk assessment, and survival analysis—providing a unified perspective on the ML landscape.
- Exclusive Focus on Structured Data: A strict focus is maintained on models built primarily on structured, tabular data, offering a dedicated resource for this data modality and distinguishing this work from reviews centered on medical imaging.
- Rigorous and Transparent Methodology: The review adheres to PRISMA 2020 guidelines Page et al. (2021), employing a systematic search across five major databases with tailored search strings, resulting in a robust synthesis of 42 high-quality, peer-reviewed studies.
- Task-Oriented Synthesis: A novel, task-categorized analysis of the literature is presented, detailing the specific datasets, feature sets, and ML models that are most effective for diagnosis, risk, and survival prediction, respectively.
- Critical Analysis and Future Directions: Beyond summarization, a critical appraisal of current methodologies is provided, highlighting performance trends, identifying common pitfalls such as the lack of external validation, and outlining clear pathways for future research.
2. Related Works
3. Methodology
3.1. Research Questions
3.2. Eligibility Criteria
3.3. Search Strategy
- The search strings were used for the retrieval of the relevant papers. A total of 772 papers were found using the initial search string.
- Inclusion/ Exclusion Criteria were applied for each of the papers and the total number of studies decreased to a total of 138 papers.
- After going through the abstract, methodology, and conclusion, several papers were removed due to being irrelevant, and ended up with 38 Papers.
- Snowballing was done on the selected papers, and the final number was 61.
- After eliminating the duplicates the total number of papers came up to 56.
- Finally, all the 56 papers were read thoroughly. Quality Assessments were applied. Resulting in the final 42 papers to be used for data extraction and synthesis process.
| Database Name | Search String | #Initial Papers | #Studies after IC/EC | After Screening Title/Abstract | Papers from Snowballing + QA | Duplicates Removed | Final Selection (Full-Text) |
|---|---|---|---|---|---|---|---|
| PubMed | ("Lung Cancer"[tiab] OR "Non-Small Cell Lung Cancer"[tiab] OR "lung neoplasm"[tiab] OR "pulmonary cancer"[tiab]) AND ("Detection"[tiab] OR "Prediction"[tiab] OR "Diagnosis"[tiab] OR "prognosis"[tiab] OR "early detection"[tiab] OR "risk assessment"[tiab]) AND ("ML"[tiab] OR "Machine Learning"[tiab] OR "AI"[tiab] OR "Artificial Intelligence"[tiab] OR "data mining"[tiab] OR "ensemble learning"[tiab] OR "Ensemble Stacking"[tiab] OR "Stacking"[tiab] OR "Bagging"[tiab] OR "Boosting"[tiab]) AND ("tabular data"[tiab] OR "structured data"[tiab] OR "clinical data"[tiab] OR "patient records"[tiab] OR "health records"[tiab] OR "EHR"[tiab] OR "electronic health records"[tiab]) NOT ("Image"[ti] OR "Imaging"[ti] OR "Medical Imaging"[ti] OR "histopathology"[ti] OR "segmentation"[ti] OR "MRI"[ti] OR "X-ray"[ti] OR "ultrasound"[ti] OR "CT"[ti] OR "tomography"[ti] OR "blood"[ti] OR "vitamin"[ti] OR "mechanism"[ti] OR "transmission"[ti] OR "prevention"[ti] OR "forecasting"[ti] OR clinical studies[ptype] OR Editorial[ptyp] OR Comment[ptyp] OR Case Reports[ptyp] OR "survey"[tiab] OR "case study"[ti] OR "review"[tiab] OR "preprints"[tiab]) AND English[lang] | 80 | 19 | 19 | 9 | 5 | 42 |
| IEEE | ((Lung Cancer OR Non-Small Cell Lung Cancer OR lung neoplasm OR pulmonary cancer) AND (Detection OR Prediction OR Diagnosis OR Prognosis OR Early Detection OR Risk Assessment) AND (Machine Learning OR ML OR Deep Learning OR AI OR Artificial Intelligence OR Data Mining OR Ensemble Learning OR Supervised Learning OR Decision Trees OR Random Forest OR SVM OR Logistic Regression OR Neural Networks OR Gradient Boosting OR XGBoost OR Ensemble Stacking OR Stacking OR Bagging OR Boosting) AND (Tabular Data OR Structured Data OR Clinical Data OR Electronic Health Records OR EHR OR Health Records OR Patient Records OR Patient Demographics) NOT (Image OR Imaging OR Medical Imaging OR Radiology OR Histopathology OR Segmentation OR MRI OR X-ray OR Ultrasound OR CT OR Tomography OR PET Scan OR Radiomics OR Blood OR Vitamin OR Mechanism OR Transmission OR Prevention OR Forecasting OR Trends OR Epidemiology OR Case Study OR Survey OR Review)) | 103 | 4 | 4 | – | – | – |
| Scopus | (TITLE-ABS-KEY("Lung Cancer" OR "Non-Small Cell Lung Cancer" OR "lung neoplasm" OR "pulmonary cancer")) AND (TITLE-ABS-KEY("Detection" OR "Prediction" OR "Diagnosis" OR "Prognosis" OR "Early Detection" OR "Risk Assessment")) AND (TITLE-ABS-KEY("Machine Learning" OR "ML" OR "AI" OR "Artificial Intelligence" OR "Data Mining" OR "Ensemble Learning" OR "Supervised Learning" OR "Ensemble Stacking" OR "Stacking" OR "Bagging" OR "Boosting")) AND (TITLE-ABS-KEY("Tabular Data" OR "Structured Data" OR "Clinical Data" OR "EHR" OR "Health Records" OR "Patient Records")) AND (NOT TITLE-ABS-KEY("Image" OR "Imaging" OR "Medical Imaging" OR "Radiology" OR "Histopathology" OR "Segmentation" OR "MRI" OR "X-ray" OR "Ultrasound" OR "CT" OR "Tomography" OR "PET Scan" OR "Radiomics" OR "Blood" OR "Vitamin" OR "Mechanism" OR "Transmission" OR "Prevention" OR "Forecasting" OR "Trends" OR "Epidemiology" OR "Case Study" OR "Survey" OR "Review" OR "preprints")) AND (LIMIT-TO(LANGUAGE, "English")) | 104 | 84 | 5 | – | – | – |
| ACM | ("Lung Cancer" OR "Non-Small Cell Lung Cancer" OR "lung neoplasm" OR "pulmonary cancer") AND ("Detection" OR "Prediction" OR "Diagnosis" OR "prognosis" OR "early detection" OR "risk assessment") AND ("ML" OR "Machine Learning" OR "AI" OR "Artificial Intelligence" OR "data mining" OR "ensemble learning" OR "Ensemble Stacking" OR "Stacking" OR "Bagging" OR "Boosting") AND ("tabular data" OR "structured data" OR "clinical data" OR "patient records" OR "health records" OR "EHR" OR "electronic health records") AND NOT ("Image" OR "Imaging" OR "Medical Imaging" OR "histopathology" OR "segmentation" OR "MRI" OR "X-ray" OR "ultrasound" OR "CT" OR "tomography" OR "blood" OR "vitamin" OR "mechanism" OR "transmission" OR "prevention" OR "forecasting" OR "clinical studies" OR "editorial" OR "comment" OR "case reports" OR "survey" OR "case study" OR "review" OR "preprints") | 6 | 0 | 0 | – | – | – |
| Science Direct | "Lung Cancer" AND ("Prediction" OR "Detection") AND ("Ensemble Learning" OR "Machine Learning") AND ("Tabular Data" OR "Structured Data" OR "Clinical Data") | 479 | 15 | 10 | – | – | – |
| Total | – | 772 | 138 | 38 | 9 | 5 | 42 |

3.4. Selection Process
-
Research Objective Clarity
- Does the paper have any relevance to lung cancer prediction?
-
Dataset Structure and Relevance
- Does the paper contain a clear description of the dataset used and is it structured?
-
Descriptive Methodology & Use of Machine Learning
- Does the paper clearly describe the methodology used?
- Does it include machine learning models?
-
Performance Metrics Mentioned
- Does the paper denote the performance metrics of their prescribed methodology?
-
Preprocessing Methods Described
- Does the paper contain details on feature selection and preprocessing methods used?
-
Dataset Availability and Details
- Does the paper describe the availability of the dataset (e.g., public, private)?
-
Comparisons Made
- Does the paper make comparisons with other methods/models?
-
Explainability and Interpretability Methods
- Does the paper use any Explainable AI to describe the model’s output?
-
Challenges & Limitations
- Does the paper discuss the limitations of their methodologies?
-
Recent Publication
- Did the paper publish within the last 5–7 years?
3.5. Data Collection
3.6. Data Synthesis
4. Results
RQ1: What is the status of machine learning models for the prediction of lung cancer from tabular datasets over the years?
RQ2: What are the key features (attributes) used in lung cancer prediction models?
RQ3: Which tabular datasets are frequently used in lung cancer prediction?
3.1. Which datasets are used for lung cancer diagnosis prediction?
3.2. Which datasets are used for lung cancer risk prediction?
3.3. Which datasets are used for lung cancer survival prediction?
RQ4: What are the preprocessing techniques used to feed them into the models?
RQ5: What are the most commonly used machine learning algorithms for lung cancer prediction on tabular datasets?
5.1. Which models are commonly used for diagnosis prediction?
5.2. Which models are commonly used for risk prediction?
5.3. Which models are commonly used for survival prediction?
RQ6: What are the evaluation metrics that are set to validate the model’s performance?
6.1. Which validation methods are used to evaluate the models?
6.2. Which performance matrices have been used most?
RQ7: What feature selection and dimensionality reduction techniques are used to impact the performance of the models?
| Technique Category | Methods | PS# |
|---|---|---|
| Filter Methods | ANOVA F-value, Mutual Information, Spearman’s rank correlation, Gain Ratio | PS12 (ANOVA), PS31 (Mutual Information), PS13 (Spearman), PS22 (Gain Ratio) |
| Wrapper Methods | LASSO regression, Recursive Feature Elimination (RFE) | PS24 (LASSO), PS12 (RFE), PS39 (RFE), PS41 (LASSO) |
| Embedded Methods | XGBoost/RF feature importance, SVM kernel coefficients | PS8 (XGB/RF), PS20 (SVM), PS28 (XGBoost), PS22 (RF), PS33 (ExtraTree), PS39 (CFS, ReliefF, CSO) |
| Clinical/Expert-Driven | Clinical relevance, expert-guided grouping | PS21 (clinical experience), PS19 (clinical relevance), PS32 (clinical-guided mapping), PS16 |
| Statistical/Univariate | Univariate analysis (e.g., logistic regression, Cochran-Mantel-Haenszel) | PS28 (univariate), PS24 (multivariate logistic regression), PS23 (multivariate analysis) |
| Meta-Heuristic/Optimization | Squirrel Search Algorithm (SSA), Genetic Algorithm (GA), Cuckoo Search | PS7 (SSA), PS39 (GA, Cuckoo Search) |
| Other | Backward stepwise selection, AIC-based selection | PS14 (backward stepwise), PS17 (AIC) |
| Not Specified | – | PS10, PS25, PS15, PS27, PS29, PS9, PS35, PS36, PS40 |
RQ8: What ensemble techniques are performed to boost the performance over the traditional ML models?
RQ9: What are techniques that are used to interpret the results of the models?
| Category | Methods | PS# |
|---|---|---|
| Feature Importance | SHAP, LIME, Permutation Importance, RF/XGBoost Feature Importance | PS8, PS12, PS23, PS31, PS33, PS36, PS41, PS3, PS24, PS28, PS42, PS19 |
| Visualization Tools | Nomograms, Calibration Plots, t-SNE, Kaplan-Meier Curves, PDP, Beeswarm | PS24, PS13, PS15, PS11, PS16, PS23, PS28, PS33, PS41, PS27 |
| Model-Agnostic Methods | SHAP, LIME, Partial Dependence Plots (PDP) | PS12, PS23, PS36, PS41 |
| Statistical Analysis | Pearson’s/Heatmap Correlation, Odds Ratios, Cox Regression, Decision Curves | PS26, PS28, PS13, PS16, PS27 |
| Sensitivity Analysis | Occlusion Sensitivity, Input Perturbation | PS11, PS19 |
| Inherently Interpretable Models | Decision Trees, Bayesian Networks, Logistic Regression | PS3, PS18, PS35, PS28 |
| Attention Mechanisms | Transformer Attention Scores | PS32 |
| Cluster Analysis | k-means on Pathway Representations | PS32 |
| Example-Based Explanations | Case-Based Reasoning (e.g., recurrence prediction) | PS10 |
| Localization Techniques | Visual Bounding Boxes (e.g., infected regions) | PS34 |
5. Performance Analysis
6. Discussion
7. Limitations and Future Works
8. Conclusion
Author Contributions: Towhidul Islam
Conflicts of Interest
Appendix A. Example Appendix Section
| PS# | Title | Reference |
|---|---|---|
| PS1 | Sex and Smoking Status Effects on the Early Detection of Early Lung Cancer in High-Risk Smokers Using an Electronic Nose | McWilliams et al. (2015) |
| PS2 | A deep learning approach for overall survival prediction in lung cancer with missing values | Caruso et al. (2024) |
| PS3 | A new tool to predict lung cancer based on risk factors | Ahmad and Mayya (2020) |
| PS4 | Benign-malignant classification of pulmonary nodules by low-dose spiral computerized tomography and clinical data with machine learning in opportunistic screening | Zheng et al. (2023) |
| PS5 | Machine learning application in personalised lung cancer recurrence and survivability prediction | Yang et al. (2022) |
| PS6 | Synergy between imputed genetic pathway and clinical information for predicting recurrence in early stage non-small cell lung cancer | Timilsina et al. (2023) |
| PS7 | A Heuristic Machine Learning-Based Optimization Technique to Predict Lung Cancer Patient Survival | Kukreja et al. (2023) |
| PS8 | A Machine Learning-Based Investigation of Gender-Specific Prognosis of Lung Cancers | Yeh et al. (2021) |
| PS9 | A Risk Model for Lung Cancer Incidence | Hoggart et al. (2012) |
| PS10 | An Artificial Intelligence-Based Tool for Data Analysis and Prognosis in Cancer Patients: Results from the Clarify Study | Torrente et al. (2022) |
| PS11 | Artificial Intelligence-Based Prediction of Lung Cancer Risk Using Nonimaging Electronic Medical Records: Deep Learning Approach | Yeh et al. (2021) |
| PS12 | Body composition radiomic features as a predictor of survival in patients with non-small cellular lung carcinoma: A multicenter retrospective study | Rozynek et al. (2024) |
| PS13 | Comparison of nomogram and machine-learning methods for predicting the survival of non-small cell lung cancer patients | Lei et al. (2022) |
| PS14 | Developing and Validating a Lung Cancer Risk Prediction Model: A Nationwide Population-Based Study | Rubin et al. (2023) |
| PS15 | Development and Validation of a Deep Learning Model for Non-Small Cell Lung Cancer Survival | She et al. (2020) |
| PS16 | Development of a "meta-model" to address missing data, predict patient-specific cancer survival and provide a foundation for clinical decision support | Baron et al. (2021) |
| PS17 | Development of a risk prediction model for lung cancer: The Japan Public Health Center-based Prospective Study | Charvat et al. (2018) |
| PS18 | Early Detection and Prevention of Cancer using Data Mining Techniques | Ramachandran et al. (2014) |
| PS19 | Exploring the efficacy of artificial neural networks in predicting lung cancer recurrence: a retrospective study based on patient records | Lorenc et al. (2023) |
| PS20 | Identification of non-small cell lung cancer with chronic obstructive pulmonary disease using clinical symptoms and routine examination: a retrospective study | Zhuan et al. (2023) |
| PS21 | Interpretable deep learning survival predictive tool for small cell lung cancer | Zhuan et al. (2023) |
| PS22 | Lung Cancer Risk Prediction with Machine Learning Models | Dritsas and Trigka (2022) |
| PS23 | Machine learning approaches for prediction of early death among lung cancer patients with bone metastases using routine clinical characteristics: An analysis of 19,887 patients | Cui et al. (2022) |
| PS24 | Machine learning predictive models and risk factors for lymph node metastasis in cell lung cancer | Wu et al. (2024) |
| PS25 | Multi-Class Neural Networks to Predict Lung Cancer | Rajan et al. (2019) |
| PS26 | Performance of machine learning algorithms for lung cancer prediction: a comparative approach | Maurya et al. (2024) |
| PS27 | Prediction of lung cancer patient survival via supervised machine learning classification techniques | Lynch et al. (2017) |
| PS28 | Prediction of the 1-Year Risk of Incident Lung Cancer: Prospective Study Using Electronic Health Records from the State of Maine | Wang et al. (2019) |
| PS29 | Prognostic models in patients with non-small cell lung cancer using artificial neural networks in comparison with logistic regression | Hanai et al. (2003) |
| PS30 | Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data | Chen and Chen (2022) |
| PS31 | Single Modality vs. Multimodality: What Works Best for Lung Cancer Screening? | Sousa et al. (2023) |
| PS32 | Transformer-based deep learning model for the diagnosis of suspected lung cancer in primary care based on electronic health record data | Wang et al. (2024) |
| PS33 | An eXplainable machine learning framework for predicting the impact of pesticide exposure in lung cancer prognosis | V.r. and S.s. (2025) |
| PS34 | Attention-guided CenterNet deep learning approach for lung cancer detection | Dawood et al. (2025) |
| PS35 | Benchmarking prognosis methods for survivability – A case study for patients with contingent primary cancers | Makond et al. (2021) |
| PS36 | DeepXplainer: An interpretable deep learning based approach for lung cancer detection using explainable artificial intelligence | Wani et al. (2024) |
| PS37 | Detection of lung cancer metastasis from blood using L-MISC nanosensor: Targeting circulating metastatic cues for improved diagnosis | Premachandran et al. (2024) |
| PS38 | Early multi-cancer detection through deep learning: An anomaly detection approach using Variational Autoencoder | Sado et al. (2024)) |
| PS39 | Effective multiple cancer disease diagnosis frameworks for improved healthcare using machine learning | Hsu et al. (2021) |
| PS40 | Lung cancer disease detection using service-oriented architectures and multivariate boosting classifier | Chandrasekar et al. (2022) |
| PS41 | Lung cancer survival period prediction and understanding: Deep learning approaches | Doppalapudi et al. (2021) |
| PS42 | Towards automatic forecasting of lung nodule diameter with tabular data and CT imaging | Ferreira et al. (2024) |
Appendix A.1
References
- Ahmad S. Ahmad and Ali M. Mayya. A new tool to predict lung cancer based on risk factors. Heliyon, 6(2):e03402, February 2020. ISSN 2405-8440. [CrossRef]
- Fatimah Abdulazim Altuhaifa, Khin Than Win, and Guoxin Su. Predicting lung cancer survival based on clinical data using machine learning: A review. Comput Biol Med, 165:107338, October 2023. ISSN 1879-0534. [CrossRef]
- Peter B Bach, Michael W Kattan, Mark D Thornquist, Mark G Kris, R Cameron Tate, Matt J Barnett, L-J Hsieh, and Colin B Begg. Variations in lung cancer risk among smokers. Journal of the National Cancer Institute, 95(6):470–478, 2003.
- Jason M. Baron, Ketan Paranjape, Tara Love, Vishakha Sharma, Denise Heaney, and Matthew Prime. Development of a "meta-model" to address missing data, predict patient-specific cancer survival and provide a foundation for clinical decision support. J Am Med Inform Assoc, 28(3):605–615, March 2021. ISSN 1527-974X. [CrossRef]
- Camillo Maria Caruso, Valerio Guarrasi, Sara Ramella, and Paolo Soda. A deep learning approach for overall survival prediction in lung cancer with missing values. Computer Methods and Programs in Biomedicine, 254:108308, September 2024. ISSN 0169-2607. URL https://www.sciencedirect.com/science/article/pii/S016926072400302X. [CrossRef]
- Urmila Chandran, Jenna Reps, Robert Yang, Anil Vachani, Fabien Maldonado, and Iftekhar Kalsekar. Machine Learning and Real-World Data to Predict Lung Cancer Risk in Routine Care. Cancer Epidemiol Biomarkers Prev, 32(3):337–343, March 2023. ISSN 1538-7755. [CrossRef]
- Thaventhiran Chandrasekar, Sekar Kidambi Raju, Manikandan Ramachandran, Rizwan Patan, and Amir H. Gandomi. Lung cancer disease detection using service-oriented architectures and multivariate boosting classifier. Applied Soft Computing, 122:108820, June 2022. ISSN 1568-4946. URL https://www.sciencedirect.com/science/article/pii/S1568494622002253. [CrossRef]
- Hadrien Charvat, Shizuka Sasazuki, Taichi Shimazu, Sanjeev Budhathoki, Manami Inoue, Motoki Iwasaki, Norie Sawada, Taiki Yamaji, Shoichiro Tsugane, and JPHC Study Group. Development of a risk prediction model for lung cancer: The Japan Public Health Center-based Prospective Study. Cancer Sci, 109(3):854–862, March 2018. ISSN 1349-7006. [CrossRef]
- Anjun Chen and Drake O. Chen. Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data. Sci Rep, 12(1):17917, October 2022. ISSN 2045-2322. [CrossRef]
- Yunpeng Cui, Xuedong Shi, Shengjie Wang, Yong Qin, Bailin Wang, Xiaotong Che, and Mingxing Lei. Machine learning approaches for prediction of early death among lung cancer patients with bone metastases using routine clinical characteristics: An analysis of 19,887 patients. Front Public Health, 10:1019168, 2022. ISSN 2296-2565. [CrossRef]
- Hussain Dawood, Marriam Nawaz, Muhammad U. Ilyas, Tahira Nazir, and Ali Javed. Attention-guided CenterNet deep learning approach for lung cancer detection. Computers in Biology and Medicine, 186:109613, March 2025. ISSN 0010-4825. URL https://www.sciencedirect.com/science/article/pii/S0010482524016986. [CrossRef]
- Shreyesh Doppalapudi, Robin G. Qiu, and Youakim Badr. Lung cancer survival period prediction and understanding: Deep learning approaches. International Journal of Medical Informatics, 148:104371, April 2021. ISSN 1386-5056. URL https://www.sciencedirect.com/science/article/pii/S1386505620319079. [CrossRef]
- Elias Dritsas and Maria Trigka. Lung Cancer Risk Prediction with Machine Learning Models. Big Data and Cognitive Computing, 6(4):139, December 2022. ISSN 2504-2289. URL https://www.mdpi.com/2504-2289/6/4/139. Number: 4 Publisher: Multidisciplinary Digital Publishing Institute. [CrossRef]
- Carlos A. Ferreira, Kiran Vaidhya Venkadesh, Colin Jacobs, Miguel Coimbra, and Aurélio Campilho. Towards automatic forecasting of lung nodule diameter with tabular data and CT imaging. Biomedical Signal Processing and Control, 96:106625, October 2024. ISSN 1746-8094. URL https://www.sciencedirect.com/science/article/pii/S1746809424006839. [CrossRef]
- Taizo Hanai, Yasushi Yatabe, Yusuke Nakayama, Takashi Takahashi, Hiroyuki Honda, Tetsuya Mitsudomi, and Takeshi Kobayashi. Prognostic models in patients with non-small-cell lung cancer using artificial neural networks in comparison with logistic regression. Cancer Sci, 94(5):473–477, May 2003. ISSN 1347-9032. [CrossRef]
- Clive Hoggart, Paul Brennan, Anne Tjonneland, Ulla Vogel, Kim Overvad, Jane Nautrup Østergaard, Rudolf Kaaks, Federico Canzian, Heiner Boeing, Annika Steffen, Antonia Trichopoulou, Christina Bamia, Dimitrios Trichopoulos, Mattias Johansson, Domenico Palli, Vittorio Krogh, Rosario Tumino, Carlotta Sacerdote, Salvatore Panico, Hendriek Boshuizen, H. Bas Bueno-de Mesquita, Petra H.M. Peeters, Eiliv Lund, Inger Torhild Gram, Tonje Braaten, Laudina Rodríguez, Antonio Agudo, Emilio Sanchez-Cantalejo, Larraitz Arriola, Maria-Dolores Chirlaque, Aurelio Barricarte, Torgny Rasmuson, Kay-Tee Khaw, Nicholas Wareham, Naomi E. Allen, Elio Riboli, and Paolo Vineis. A Risk Model for Lung Cancer Incidence. Cancer Prev Res (Phila), 5(6):834–846, June 2012. ISSN 1940-6207. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4295118/. [CrossRef]
- Ching-Hsien Hsu, Xing Chen, Weiwei Lin, Chuntao Jiang, Youhong Zhang, Zhifeng Hao, and Yeh-Ching Chung. Effective multiple cancer disease diagnosis frameworks for improved healthcare using machine learning. Measurement, 175:109145, April 2021. ISSN 0263-2241. URL https://www.sciencedirect.com/science/article/pii/S0263224121001706. [CrossRef]
- Yah Ru Juang, Lina Ang, and Wei Jie Seow. Predictive performance of risk prediction models for lung cancer incidence in Western and Asian countries: a systematic review and meta-analysis. Sci Rep, 15(1):4259, March 2025. ISSN 2045-2322. [CrossRef]
- Hormuzd A Katki, Stephanie A Kovalchik, Christine D Berg, Li C Cheung, and Anil K Chaturvedi. Development and validation of risk models to select ever-smokers for ct lung cancer screening. JAMA, 315(21):2300–2311, 2016.
- Sonia Kukreja, Munish Sabharwal, Mohd Asif Shah, and D. S. Gill. A Heuristic Machine Learning-Based Optimization Technique to Predict Lung Cancer Patient Survival. Comput Intell Neurosci, 2023:4506488, 2023. ISSN 1687-5273. [CrossRef]
- Haike Lei, Xiaosheng Li, Wuren Ma, Na Hong, Chun Liu, Wei Zhou, Hong Zhou, Mengchun Gong, Ying Wang, Guixue Wang, and Yongzhong Wu. Comparison of nomogram and machine-learning methods for predicting the survival of non-small cell lung cancer patients. Cancer Innov, 1(2):135–145, August 2022. ISSN 2770-9183. [CrossRef]
- Andżelika Lorenc, Anna Romaszko-Wojtowicz, Łukasz Jaśkiewicz, Anna Doboszyńska, and Adam Buciński. Exploring the efficacy of artificial neural networks in predicting lung cancer recurrence: a retrospective study based on patient records. Transl Lung Cancer Res, 12(10):2083–2097, October 2023. ISSN 2218-6751. [CrossRef]
- Chip M. Lynch, Behnaz Abdollahi, Joshua D. Fuqua, Alexandra R. de Carlo, James A. Bartholomai, Rayeanne N. Balgemann, Victor H. van Berkel, and Hermann B. Frieboes. Prediction of lung cancer patient survival via supervised machine learning classification techniques. Int J Med Inform, 108:1–8, December 2017. ISSN 1872-8243. [CrossRef]
- Bunjira Makond, Kung-Jeng Wang, and Kung-Min Wang. Benchmarking prognosis methods for survivability – A case study for patients with contingent primary cancers. Computers in Biology and Medicine, 138:104888, November 2021. ISSN 0010-4825. URL https://www.sciencedirect.com/science/article/pii/S001048252100682X. [CrossRef]
- Muntasir Mamun, Afia Farjana, Miraz Al Mamun, and Md Salim Ahammed. Lung cancer prediction model using ensemble learning techniques and a systematic review analysis. In 2022 IEEE World AI IoT Congress (AIIoT), pages 187–193, June 2022. URL https://ieeexplore.ieee.org/abstract/document/9817326. [CrossRef]
- Satya Prakash Maurya, Pushpendra Singh Sisodia, Rahul Mishra, and Devesh Pratap Singh. Performance of machine learning algorithms for lung cancer prediction: a comparative approach. Sci Rep, 14(1):18562, August 2024. ISSN 2045-2322. [CrossRef]
- Annette McWilliams, Parmida Beigi, Akhila Srinidhi, Stephen Lam, and Calum E. MacAulay. Sex and Smoking Status Effects on the Early Detection of Early Lung Cancer in High-Risk Smokers Using an Electronic Nose. IEEE Trans. Biomed. Eng., 62(8):2044–2054, August 2015. ISSN 0018-9294, 1558-2531. URL http://ieeexplore.ieee.org/document/7058387/. [CrossRef]
- Matthew J Page, Joanne E McKenzie, Patrick M Bossuyt, Isabelle Boutron, Tammy C Hoffmann, Cynthia D Mulrow, Larissa Shamseer, Jennifer M Tetzlaff, Elie A Akl, Sue E Brennan, Roger Chou, Julie Glanville, Jeremy M Grimshaw, Asbjørn Hróbjartsson, Manoj M Lalu, Tianjing Li, Elizabeth W Loder, Evan Mayo-Wilson, Steve McDonald, Luke A McGuinness, Lesley A Stewart, James Thomas, Andrea C Tricco, Vivian A Welch, Penny Whiting, and David Moher. The prisma 2020 statement: an updated guideline for reporting systematic reviews. BMJ, 372:n71, 2021. [CrossRef]
- Srilakshmi Premachandran, Ashok Kumar Dhinakaran, Sunit Das, Krishnan Venkatakrishnan, Bo Tan, and Mansi Sharma. Detection of lung cancer metastasis from blood using L-MISC nanosensor: Targeting circulating metastatic cues for improved diagnosis. Biosensors and Bioelectronics, 243:115782, January 2024. ISSN 0956-5663. URL https://www.sciencedirect.com/science/article/pii/S0956566323007248. [CrossRef]
- Juliet Rani Rajan, A. Chilambu Chelvan, and J. Shiny Duela. Multi-Class Neural Networks to Predict Lung Cancer. J Med Syst, 43(7):211, May 2019. ISSN 1573-689X. [CrossRef]
- P. Ramachandran, N. Girija, and T. Bhuvaneswari. Early Detection and Prevention of Cancer using Data Mining Techniques. International Journal of Computer Applications, 97(13):48–53, July 2014. URL https://ijcaonline.org/archives/volume97/number13/17069-7492/.
- Miłosz Rozynek, Zbisław Tabor, Stanisław Kłęk, and Wadim Wojciechowski. Body composition radiomic features as a predictor of survival in patients with non-small cellular lung carcinoma: A multicenter retrospective study. Nutrition, 120:112336, April 2024. ISSN 1873-1244. [CrossRef]
- Katrine H. Rubin, Peter F. Haastrup, Anne Nicolaisen, Sören Möller, Sonja Wehberg, Sanne Rasmussen, Kirubakaran Balasubramaniam, Jens Søndergaard, and Dorte E. Jarbøl. Developing and Validating a Lung Cancer Risk Prediction Model: A Nationwide Population-Based Study. Cancers, 15(2):487, January 2023. ISSN 2072-6694. URL https://www.mdpi.com/2072-6694/15/2/487. Number: 2 Publisher: Multidisciplinary Digital Publishing Institute. [CrossRef]
- Innocent Tatchum Sado, Louis Fippo Fitime, Geraud Fokou Pelap, Claude Tinku, Gaelle Mireille Meudje, and Thomas Bouetou Bouetou. Early multi-cancer detection through deep learning: An anomaly detection approach using Variational Autoencoder. Journal of Biomedical Informatics, 160:104751, December 2024. ISSN 1532-0464. URL https://www.sciencedirect.com/science/article/pii/S1532046424001692. [CrossRef]
- Yunlang She, Zhuochen Jin, Junqi Wu, Jiajun Deng, Lei Zhang, Hang Su, Gening Jiang, Haipeng Liu, Dong Xie, Nan Cao, Yijiu Ren, and Chang Chen. Development and Validation of a Deep Learning Model for Non-Small Cell Lung Cancer Survival. JAMA Netw Open, 3(6):e205842, June 2020. ISSN 2574-3805. [CrossRef]
- Rebecca L. Siegel, Kimberly D. Miller, Nikita Sandeep Wagle, and Ahmedin Jemal. Cancer statistics, 2023. CA A Cancer J Clinicians, 73(1):17–48, January 2023. ISSN 0007-9235, 1542-4863. URL https://acsjournals.onlinelibrary.wiley.com/doi/10.3322/caac.21763. [CrossRef]
- Joana Vale Sousa, Pedro Matos, Francisco Silva, Pedro Freitas, Hélder P. Oliveira, and Tania Pereira. Single Modality vs. Multimodality: What Works Best for Lung Cancer Screening? Sensors (Basel), 23(12):5597, June 2023. ISSN 1424-8220. [CrossRef]
- Martin C Tammemägi, Hormuzd A Katki, William G Hocking, Timothy R Church, Neil Caporaso, Paul A Kvale, Anil K Chaturvedi, Gerard A Silvestri, Thomas L Riley, John Commins, et al. Selection criteria for lung-cancer screening. New England Journal of Medicine, 368(8):728–736, 2013.
- Mohan Timilsina, Dirk Fey, Samuele Buosi, Adrianna Janik, Luca Costabello, Enric Carcereny, Delvys Rodrıguez Abreu, Manuel Cobo, Rafael López Castro, Reyes Bernabé, Pasquale Minervini, Maria Torrente, Mariano Provencio, and Vít Nováček. Synergy between imputed genetic pathway and clinical information for predicting recurrence in early stage non-small cell lung cancer. Journal of Biomedical Informatics, 144:104424, August 2023. ISSN 1532-0464. URL https://www.sciencedirect.com/science/article/pii/S1532046423001454. [CrossRef]
- María Torrente, Pedro A. Sousa, Roberto Hernández, Mariola Blanco, Virginia Calvo, Ana Collazo, Gracinda R. Guerreiro, Beatriz Núñez, Joao Pimentao, Juan Cristóbal Sánchez, Manuel Campos, Luca Costabello, Vit Novacek, Ernestina Menasalvas, María Esther Vidal, and Mariano Provencio. An Artificial Intelligence-Based Tool for Data Analysis and Prognosis in Cancer Patients: Results from the Clarify Study. Cancers (Basel), 14(16):4041, August 2022. ISSN 2072-6694. [CrossRef]
- Nitha V.r. and Vinod Chandra S.s. An eXplainable machine learning framework for predicting the impact of pesticide exposure in lung cancer prognosis. Journal of Computational Science, 84:102476, January 2025. ISSN 1877-7503. URL https://www.sciencedirect.com/science/article/pii/S1877750324002692. [CrossRef]
- Lan Wang, Yonghua Yin, Ben Glampson, Robert Peach, Mauricio Barahona, Brendan C. Delaney, and Erik K. Mayer. Transformer-based deep learning model for the diagnosis of suspected lung cancer in primary care based on electronic health record data. EBioMedicine, 110:105442, December 2024. ISSN 2352-3964. [CrossRef]
- Xiaofang Wang, Yan Zhang, Shiying Hao, Le Zheng, Jiayu Liao, Chengyin Ye, Minjie Xia, Oliver Wang, Modi Liu, Ching Ho Weng, Son Q. Duong, Bo Jin, Shaun T. Alfreds, Frank Stearns, Laura Kanov, Karl G. Sylvester, Eric Widen, Doff B. McElhinney, and Xuefeng B. Ling. Prediction of the 1-Year Risk of Incident Lung Cancer: Prospective Study Using Electronic Health Records from the State of Maine. J Med Internet Res, 21(5):e13260, May 2019. ISSN 1438-8871. [CrossRef]
- Niyaz Ahmad Wani, Ravinder Kumar, and Jatin Bedi. DeepXplainer: An interpretable deep learning based approach for lung cancer detection using explainable artificial intelligence. Computer Methods and Programs in Biomedicine, 243:107879, January 2024. ISSN 0169-2607. URL https://www.sciencedirect.com/science/article/pii/S016926072300545X. [CrossRef]
- Bo Wu, Yihui Zhu, Zhuozheng Hu, Jiajun Wu, Weijun Zhou, Maoyan Si, Xiying Cao, Zhicheng Wu, and Wenxiong Zhang. Machine learning predictive models and risk factors for lymph node metastasis in non-small cell lung cancer. BMC Pulm Med, 24(1):526, October 2024. ISSN 1471-2466. [CrossRef]
- Yang Yang, Li Xu, Liangdong Sun, Peng Zhang, and Suzanne S. Farid. Machine learning application in personalised lung cancer recurrence and survivability prediction. Computational and Structural Biotechnology Journal, 20:1811–1820, January 2022. ISSN 2001-0370. URL https://www.sciencedirect.com/science/article/pii/S2001037022001106. [CrossRef]
- Marvin Chia-Han Yeh, Yu-Hsiang Wang, Hsuan-Chia Yang, Kuan-Jen Bai, Hsiao-Han Wang, and Yu-Chuan Jack Li. Artificial Intelligence-Based Prediction of Lung Cancer Risk Using Nonimaging Electronic Medical Records: Deep Learning Approach. J Med Internet Res, 23(8):e26256, August 2021. ISSN 1438-8871. [CrossRef]
- Yansong Zheng, Jing Dong, Xue Yang, Ping Shuai, Yongli Li, Hailin Li, Shengyong Dong, Yan Gong, Miao Liu, and Qiang Zeng. Benign-malignant classification of pulmonary nodules by low-dose spiral computerized tomography and clinical data with machine learning in opportunistic screening. Cancer Medicine, 12(11):12050–12064, 2023. ISSN 2045-7634. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/cam4.5886. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/cam4.5886. [CrossRef]
- Bing Zhuan, Hong-Hong Ma, Bo-Chao Zhang, Ping Li, Xi Wang, Qun Yuan, Zhao Yang, and Jun Xie. Identification of non-small cell lung cancer with chronic obstructive pulmonary disease using clinical symptoms and routine examination: a retrospective study. Front Oncol, 13:1158948, July 2023. ISSN 2234-943X. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10419203/. [CrossRef]




| Research Question | Rationale |
|---|---|
| 1. What is the status of machine learning models for the prediction of lung cancer from tabular datasets over the years? | To analyze the progress and trends in machine learning models for lung cancer prediction using tabular datasets. |
| 2. What are the key features (attributes) used in lung cancer prediction models? | To identify the most critical features that influence lung cancer prediction models, since selecting the right attributes can significantly enhance the performance. |
| 3. Which tabular datasets are frequently used in lung cancer prediction? 3.1. Which datasets are used for lung cancer diagnosis prediction? 3.2. Which datasets are used for lung cancer risk prediction? 3.3. Which datasets are used for lung cancer survival prediction? | To find out all available datasets used previously and categorize them based on the quality of the datasets. |
| 4. What are the preprocessing techniques used to feed them into the models? | We aim to compare past ML preprocessing techniques for lung cancer prediction and analyze their interrelationships. |
| 5. What are the most commonly used machine learning algorithms for lung cancer prediction on tabular datasets? 5.1. Which models are commonly used for diagnosis prediction? 5.2. Which models are commonly used for risk prediction? 5.3. Which models are commonly used for survival prediction? | In this research question, we want to list all the methods used for predicting lung cancer earlier. |
| 6. What are the evaluation metrics that are set to validate the model’s performance? 6.1. Which validation methods are used to evaluate the models? 6.2. Which performance metrics have been used most? | This RQ explores how the ML methods were evaluated in previous studies and which evaluation metrics were considered to determine the best-fit model based on evaluation metrics. |
| 7. What feature selection and dimensionality reduction techniques are used to impact the performance of the models? | To identify the most effective feature selection and dimensionality reduction techniques that can enhance the performance of ML models by improving accuracy, reducing computational complexity, and minimizing the risk of overfitting. |
| 8. What ensemble techniques are performed to boost the performance over the traditional ML models? | To explore how ensemble techniques can enhance the predictive performance of traditional ML models by combining multiple models to improve accuracy, robustness, and generalization. |
| 9. What are the techniques that are used to interpret the results of the models? | To list the techniques for interpreting results of the models and try to understand the pattern and improvement by the model’s performance. |
| Inclusion Criteria | Exclusion Criteria |
|---|---|
| IC1: Relevance to lung cancer prediction (diagnosis, risk, or survival) | Ex1: Not written in English |
| IC2: Use of tabular/structured datasets (including imaging-derived features in structured format) | Ex2: Book chapters, conference papers, review papers, PhD theses/dissertations |
| IC3: Machine learning model(s) used for prediction | Ex3: Full-text not available |
| IC4: Peer-reviewed journal articles | EC4: Studies relying primarily on image analysis without structured clinical data |
| Identifier | SP Checklist | Score |
|---|---|---|
| SP1 | Research Objective Clarity | (+1) Yes / (0) No |
| SP2 | Dataset Structure and Relevance | (+1) Yes / (0) No |
| SP3 | Descriptive Methodology & use of Machine Learning | (+1) Yes / (0) No |
| SP4 | Performance Metrics Mentioned | (+1) Yes / (0) No |
| SP5 | Preprocessing Methods described | (+1) Yes / (0) No |
| SP6 | Dataset Availability and Details | (+1) Yes / (0) No |
| SP7 | Comparisons made | (+1) Yes / (0) No |
| SP8 | Explainability and interpretability methods | (+1) Yes / (0) No |
| SP9 | Challenges & Limitations | (+1) Yes / (0) No |
| SP10 | Recent Publication | (+1) Yes / (0) No |
| Extracted Data | Description | RQ |
|---|---|---|
| Metadata | Title, Authors, Publication Year, Journal/Conference | RQ1 |
| Dataset | Dataset Name, Dataset Source, Sample Size, Feature Types, Key Features Used | RQ2, RQ3 |
| Methods | ML Algorithms Used, Ensemble Methods Used, Feature Selection Methods, Dimensionality Reduction Methods, Preprocessing Techniques, Interpretability Methods | RQ4, RQ5, RQ7, RQ8, RQ9 |
| Evaluation | Performance Metrics, Best Performing Model | RQ6 |
| Category | Key Attributes | PS# | Count |
|---|---|---|---|
| Demographics | Age, Gender, Race, Marital Status, Education, Living Area, Socioeconomic Status (SES), Medicaid Coverage, Occupation. | PS8, PS21, PS24, PS26, PS10, PS13, PS28, PS23, PS15, PS27, PS22, PS11, PS18, PS29, PS14, PS17, PS9, PS33, PS36, PS40, PS35, PS1 | 22 |
| Clinical Parameters | T/N/M Stage, Tumor Size, Grade, Histology, Primary Site, Metastases (Brain/Liver), AJCC TNM Staging, Karnofsky Performance Scale (KPS), Follow-up Time, Survival Status, Chronic Diseases (COPD, Diabetes, CVDs). | PS8, PS21, PS24, PS10, PS13, PS20, PS19, PS28, PS16, PS23, PS15, PS27, PS22, PS11, PS18, PS29, PS14, PS41, PS35, PS1 | 20 |
| Lifestyle & Environment | Smoking (Pack-Years, Cessation), Alcohol Use, Air Pollution, Occupational Hazards (Asbestos, Silica), Diet, Obesity, Insecticide Use. | PS26, PS3, PS17, PS9, PS33, PS36, PS40, PS1 | 8 |
| Symptoms & Comorbidities | Coughing Blood, Shortness of Breath, Chest Pain, Wheezing, Fatigue, Weight Loss, Chronic Inflammation, Allergy, Snoring, Anemia, Frequent Cold. | PS26, PS3, PS25, PS20, PS28, PS18, PS36, PS40 | 8 |
| Treatment | Surgery (Type/Sequence), Chemotherapy, Radiotherapy, Immunotherapy, Molecular Therapy (EGFR/ALK), Medication Codes (ATC/WHO). | PS8, PS21, PS13, PS23, PS15, PS27, PS19, PS16, PS41, PS35 | 10 |
| Molecular/ Genetic | Biomarkers (EGFR, ALK, KRAS, BRAF), Gene Expression, SNPs, Nuclear Morphology (Size/Shape/Texture), Protein Levels (p27, p53). | PS10, PS7, PS16, PS9, PS38, PS29, PS36 | 7 |
| Imaging/ Diagnostics | CT Nodule Features (Size/Shape/Density), Emphysema Score, FEV1/FVC Ratio, TumorSize, Goddard Score, PET-derived Features. | PS31, PS20, PS42, PS34, PS37, PS12, PS41 | 7 |
| Social/ Economic Factors | Low Income, Low Education, Medicaid Coverage, Occupational Exposures. | PS28, PS14, PS33, PS40 | 4 |
| Survival/ Outcomes | Overall Survival (OS), 1-Year Survival, Survival Time, Mortality Risk. | PS10, PS13, PS27, PS12, PS41, PS35, PS2, PS5, PS6 | 9 |
| Model/ Methodology-Specific Categories | Machine Learning Models (Transformers, XGBoost, ANNs), Genetic Pathways, Anomaly Detection, Survival Prediction Algorithms. | PS2, PS5, PS6, PS4, PS33, PS42, PS34, PS37, PS38, PS39 | 10 |
| Category | Key Datasets/ Examples |
|---|---|
| Population-Based/Registry | SEER (PS8, PS21, PS24, PS27, PS15, PS41), Danish registers (PS14), EPIC (PS9), JPHC (PS17), large oncologic DB (PS23) |
| Clinical/EHR Data | Flatiron Health EHR (PS16), Maine HIE (PS28), WSIC (PS32), Taiwan NHIRD (PS11, PS35), CLARO (PS2), CLARIFY (PS10) |
| Imaging and Radiology | NLST & LIDC (PS31, PS42), ACRIN NSCLC-DICOM-PET (PS12), LUNA-16/Kaggle (PS34, PS40) |
| Genomic/Multi-Omics | TCGA (PS5, PS6, PS38), GTex (PS38), SLCG (PS6) |
| Synthetic/Sensor-Based/Specialized | Synthetea (PS30), eNose (PS1), Pesticide & Lung Cancer (PS33), Blood nanosensor (PS37), Multi-cancer (PS39), Multi-modal (PS40) |
| Category | Key Datasets/ Examples |
|---|---|
| Population-Based/Registry | Danish national registers (PS14), EPIC (European Prospective Investigation into Cancer and Nutrition) (PS9), Japan Public Health Center Study Cohort (PS17) |
| Clinical/EHR Data | Taiwan National Health Insurance Research Database (NHIRD) (PS11), Lung Cancer Prediction Dataset (PS22), “A new tool to predict lung cancer based on risk factors” (PS3) |
| Synthetic/Sensor-Based | Synthea synthetic patient data (PS30) |
| Category | Key Datasets/ Examples |
| Population-Based/Registry | SEER (PS8, PS21, PS15, PS41), Large oncologic database (PS23) |
| Clinical/EHR Data | CLARIFY (PS10), Retrospective cohort from CUCH (PS13), Flatiron Health EHR (PS16), CLARO (PS2), NHIRD (PS35), Wisconsin Prognostic Lung Cancer subdirectory (PS7), Prognostic models (PS29) |
| Imaging and Radiology | ACRIN NSCLC-DICOM-PET (PS12) |
| Genomic/Multi-Omics | TCGA (PS5), TCGA & SLCG (PS6) |
| Synthetic/Sensor-Based/Specialized | Pesticide & Lung Cancer (PS33) |
| Data Preprocessing | PS# |
|---|---|
| Encoding and Transformation | PS8, PS21, PS20, PS41, PS1, PS36 |
| Normalization/Standardization | PS8, PS21, PS38, PS42, PS12, PS31, PS42 |
| Missing Data Handling | PS21, PS20, PS33, PS41 |
| Data Cleaning and Filtering | PS8, PS14, PS15 |
| Data Splitting and Validation | PS26, PS7, PS27, PS16 |
| Domain-Specific Preprocessing | PS12, PS31, PS34, PS37, PS10, PS13 |
| Machine Learning Model | PS# |
|---|---|
| Naïve Bayes (NBayes) | PS8, PS24, PS26, PS20, PS7, PS12, PS22, PS33, PS41, PS35 |
| Decision Tree (DTree) | PS8, PS26, PS13, PS3, PS7, PS12, PS23, PS27, PS22, PS18, PS5, PS33, PS39 |
| Random Forest (RF) | PS8, PS24, PS26, PS30, PS13, PS3, PS12, PS23, PS22, PS31, PS2, PS6, PS33, PS37, PS36, PS41 |
| k-Nearest Neighbor (KNN) | PS8, PS26, PS30, PS20, PS28, PS7, PS12, PS22 |
| Logistic Regression (LR) | PS8, PS26, PS10, PS13, PS20, PS32, PS12, PS23, PS22, PS29, PS14, PS6, PS33, PS35 |
| Support Vector Machine (SVM) | PS24, PS26, PS30, PS20, PS28, PS12, PS27, PS22, PS18, PS5, PS6, PS33, PS39, PS35 |
| Artificial Neural Network (ANN) | PS24, PS26, PS25, PS20, PS19, PS12, PS23, PS22, PS29, PS5, PS6, PS33, PS42, PS39, PS41, PS35 |
| Generalized Linear Model (GLM) | PS24 |
| Gradient Boosting | PS8, PS24, PS26, PS30, PS13, PS20, PS28, PS12, PS23, PS27, PS22, PS6, PS4, PS33, PS36, PS40, PS41 |
| Voting Classifier | PS26 |
| Survival Model | PS21, PS16, PS15, PS17, PS9, PS2, PS34 |
| Deep Learning | PS10, PS31 |
| Graph Machine Learning | PS10 |
| Kaplan-Meier estimator | PS10 |
| Fuzzy Inference System | PS18 |
| Transformer Based model | PS32 |
| Linear Regression | PS27, PS41 |
| K-Means Clustering | PS18 |
| LDA | PS1 |
| VAE | PS38 |
| CNN | PS11, PS36, PS41 |
| Model | PS# |
|---|---|
| Logistic Regression | PS20, PS26, PS32, PS35 |
| Support Vector Machine (SVM) | PS20, PS24, PS26, PS39 |
| Random Forest (RF) | PS24, PS26, PS31, PS36, PS37 |
| XGBoost | PS4, PS20, PS24, PS26, PS36 |
| Decision Trees | PS24, PS39, PS35 |
| K-Nearest Neighbors (KNN) | PS20, PS26 |
| Naive Bayes | PS20, PS24, PS26, PS36, PS35 |
| Neural Networks (ANN/MLP/CNN) | PS24, PS20, PS26, PS31, PS32, PS34, PS36, PS38, PS39, PS35, PS40 |
| Linear Discriminant Analysis | PS1, PS39 |
| Extra Tree | PS26 |
| AdaBoost | PS26 |
| CATBoost | PS36 |
| Bayesian Network | PS35 |
| Variational Autoencoder (VAE) | PS38 |
| Models | PS# |
|---|---|
| Logistic Regression | PS6, PS10, PS14, PS22, PS23, PS28, PS33, PS13 |
| Random Forest (RF) | PS3, PS6, PS22, PS23, PS28, PS33, PS13 |
| XGBoost | PS22, PS23, PS28, PS33, PS13 |
| Support Vector Machine (SVM) | PS6, PS22, PS28, PS33 |
| Decision Trees | PS3, PS22, PS23, PS33 |
| Gradient Boosting Machine | PS6, PS23, PS22 |
| Multi-Layer Perceptron (MLP) | PS6, PS33 |
| LightGBM | PS13 |
| Parametric Survival Models | PS9, PS17 |
| Bayesian Network | PS22 |
| LASSO | PS28 |
| Fuzzy Inference System | PS3 |
| Stochastic Gradient Descent | PS22 |
| J48/AdaBoost/Rotation Forest | PS22 |
| Models | PS# |
|---|---|
| Random Survival Forest | PS2, PS16 |
| Cox Proportional Hazards | PS10, PS15, PS16 |
| DeepSurv | PS15, PS21 |
| Logistic Regression | PS8, PS12, PS29, PS13, PS10 |
| Support Vector Machine (SVM) | PS5, PS8, PS12, PS27, PS41 |
| Random Forest (RF) | PS8, PS12, PS13 |
| XGBoost | PS8, PS13 |
| Decision Trees | PS5, PS7, PS8, PS12, PS27, PS13 |
| Neural Networks (Deep Learning) | PS5, PS25, PS29, PS41, PS10 |
| Gradient Boosting Machine | PS12, PS27, PS41 |
| Linear Regression | PS27, PS41 |
| K-Nearest Neighbors (KNN) | PS7, PS8, PS12 |
| Naive Bayes | PS7, PS12 |
| REPTree | PS7 |
| Validation Method | PS# |
|---|---|
| Holdout (Train-Test Split) | PS7, PS12, PS16, PS19, PS23, PS26, PS36 |
| Cross-Validation | PS5, PS26 (10-fold), PS27 (10-fold) |
| Temporal Validation | PS32 |
| External Validation | PS4 |
| Case-Control Sampling | PS32 |
| Stratified Sampling | PS12 |
| Not Specified | PS8, PS10, PS3, PS9, PS14, PS15, PS17, PS18, PS20, PS21, PS22, PS24, PS25, PS28, PS29, PS30, PS31, PS33, PS34, PS35, PS37, PS38, PS39, PS40, PS41, PS42, PS1, PS2, PS6, PS11, PS13 |
| Performance Metric | PS# |
|---|---|
| Accuracy (Acc) | PS8, PS21, PS24, PS25, PS26, PS30, PS13, PS3, PS19, PS20, PS31, PS22, PS18, PS29, PS36, PS38, PS33 |
| AUC/ROC/AUROC | PS21, PS24, PS26, PS30, PS13, PS20, PS31, PS12, PS23, PS14, PS9, PS5, PS4, PS22, PS11, PS38, PS1 |
| Sensitivity (Recall/TPR) | PS8, PS21, PS24, PS26, PS30, PS13, PS3, PS19, PS20, PS31, PS12, PS23, PS11, PS18, PS36, PS37, PS33 |
| Specificity (TNR) | PS8, PS21, PS24, PS26, PS13, PS3, PS19, PS20, PS23, PS14, PS11, PS18, PS37, PS38, PS33 |
| Precision (PPV) | PS26, PS30, PS13, PS20, PS22, PS12, PS23, PS14, PS32, PS11, PS38, PS39, PS33 |
| F1-Score | PS8, PS26, PS13, PS20, PS22, PS23, PS32, PS38, PS33, PS39 |
| C-index (Concordance Index) | PS21, PS13, PS15, PS17 |
| Calibration Metrics | PS13 (calibration curves, slope), PS23 (Brier score, calibration slope), PS16 (Poisson slopes) |
| Hazard Ratios (HR) | PS10, PS28, PS15 |
| MCC (Matthews Correlation) | PS20, PS19, PS23 |
| NPV | PS20, PS23, PS14, PS11, PS32 |
| MAE/RMSE | PS7, PS27, PS18, PS42 |
| Survival-Specific Metrics | PS10 (survival probabilities), PS15 (Kaplan-Meier, log-rank), PS16 (time-dependent AUROC) |
| Specialized Metrics | PS34 (mAP), PS29 (Judgment Ratio), PS42 (VDT), PS37 (class-specific sensitivity/specificity) |
| Technique Category | Methods | PS# |
|---|---|---|
| Linear Methods | Principal Component Analysis (PCA) | PS33, PS37, PS39, PS1 |
| Non-Linear Methods | t-SNE (t-distributed Stochastic Neighbor Embedding) | PS11 (t-SNE), PS37 (t-SNE) |
| Regularization | LASSO (implicit via feature selection) | PS24, PS31 |
| Encoding/Scaling | One-Hot Encoding, Z-Score Normalization, Min-Max Scaling | PS2 (one-hot, Z-score), PS12 (robust scaling), PS36 (MinMaxScaler) |
| Attention/Transformer | Position embeddings, masking for missing data, attention mechanisms | PS32 (transformer attention), PS2 (masking), PS34 (CBAM attention) |
| Feature Grouping | Code/feature grouping (hierarchical or clinical relevance) | PS32 (Read codes → 450 groups), PS28 (33k → 346 → 118 features) |
| Autoencoders | Variational Autoencoder (VAE) | PS38 |
| Domain-Specific | Image normalization, Hounsfield Unit scaling, ablation studies | PS31 (resizing/normalization), PS42 (Hounsfield Units), PS42 (randomization-based ablation) |
| Not Specified | – | PS8, PS21, PS26, PS30, PS10, PS3, PS19, PS20, PS23, PS15, PS18, PS29, PS5, PS6, PS4, PS35, PS40 |
| Ensemble Technique | PS# |
|---|---|
| Random Forest | PS8, PS24, PS13, PS3, PS12, PS23, PS22, PS33, PS37 |
| XGBoost | PS24, PS26, PS30, PS13, PS28, PS23, PS4, PS33, PS40 |
| AdaBoost | PS26 (Ensemble_1), PS22 |
| LightGBM | PS13 |
| Voting Classifier | PS26 (Ensemble_2) |
| Weighted Average | PS31 (Late Fusion), PS16 (Meta-model), PS27 (Custom Ensemble) |
| Rotation Forest | PS22 |
| Hybrid Models | PS36 (ConvXGB), PS39 (DT-SVM) |
| Stacking | PS41 |
| Extreme Gradient Boosting | PS24, PS40 |
| Custom Ensemble | PS27 (weighted sum), PS41 (stacking) |
| PS# | Best Performance (Accuracy/Metric) | Best Model Name |
|---|---|---|
| PS8 | 90.75% | XGB |
| PS21 | 0.7181 (C-Index) | DeepSurv |
| PS24 | 0.81 (AUC) | GLM |
| PS26 | 92.86% | K-Nearest Neighbors |
| PS30 | 97.5% | XGBoost |
| PS10 | N/A | – |
| PS13 | 85% (60th Month) | Nomogram (overall performance) |
| PS3 | 93.33% | Lung Cancer Prediction Tool (LCPT) |
| PS25 | 100% | Multi-class Neural Networks |
| PS20 | 94.6% | Support Vector Machine (SVM) |
| PS31 | 0.8021 (AUC) | Full Intermediate Fusion (FIF) |
| PS19 | 89.9% (test) | MLP 9:17-7-1:1 architecture |
| PS32 | 0.924 (AUROC) | MedAlbert + LRC |
| PS28 | 0.881 (AUC) | XGBoost |
| PS7 | 98.78% | Naive Bayes + SSA |
| PS16 | 0.7–0.8 (AUROC) | Meta-model |
| PS12 | 98% | MLP with ANOVA F-value feature selection |
| PS23 | 0.82 (AUC) | Gradient Boosting Machine |
| PS15 | 0.74 (C statistics) | DeepSurv |
| PS27 | 15.30 (lowest RMSE) | Custom Ensemble |
| PS22 | 97.1% | Rotation Forest (RotF) |
| PS11 | 0.902 (AUC) | CNN model |
| PS18 | 99.866 | Decision Trees + k-means clustering |
| PS29 | 87% | ANN (vs. logistic regression) |
| PS14 | 81.0% (AUC) | Model B |
| PS17 | 0.793 (c-index) | Risk prediction model |
| PS9 | 0.843 (AUC) | Weibull hazard model (smoking info) |
| PS2 | 80.72 (c-index) | SHAP (for interpretability) |
| PS5 | 0.837 (AUC) | Decision Trees |
| PS6 | 0.75 (PR-AUC), 0.80 (ROC-AUC) | Model with SHAP for feature ranking |
| PS4 | 0.75 | XGBoost (via feature importance) |
| PS33 | 99% | XGBoost with SMOTE+ENN+PCA |
| PS42 | 0.99 mm (MAD: for 1-year) | Tabular data model with class weights |
| PS34 | 99.63% (LUNA-16) | Improved CenterNet (ResNet-34 + CBAM) |
| PS37 | 100% (Sensitivity) | Random Forest |
| PS36 | 97.43% | ConvXGB model |
| PS40 | 93.4% | MRRXGBDC technique |
| PS39 | 98.21% | MLP-NN with GA-CFS feature selection |
| – | 95% (0.950) | Variational Autoencoder (VAE) |
| PS41 | 71.18% (classification accuracy) | ANN for classification |
| PS35 | 95.79% | NB |
| PS1 | 0.846 (AUC – Ex-Smoker Female) | CART & DFA |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).