1. Introduction
Lung cancer remains a leading cause of global cancer mortality, underscoring the critical need for advanced methods to predict its onset, identify its presence, and forecast patient outcomes
Siegel et al. (
2023). The accurate prediction of lung cancer, encompassing diagnosis, risk assessment, and survival analysis, is a cornerstone for improving survival rates through timely screening and personalized intervention strategies. Clinical practice has long relied on traditional statistical models, such as the PLCOM2012
Tammemägi et al. (
2013), Bach
Bach et al. (
2003), and LCRAT (Lung Cancer Risk Assessment Tool)
Katki et al. (
2016), which utilize risk factors like age and smoking history to stratify populations and guide screening protocols. While these models provide a solid foundation, the emergence of machine learning (ML) offers a paradigm shift, enabling the capture of complex, non-linear patterns within high-dimensional data that traditional methods may overlook.
The proliferation of Electronic Health Records (EHRs) and large-scale clinical registries has created vast repositories of structured, tabular data, including demographics, clinical parameters, and treatment histories. The application of ML to these structured datasets presents a significant opportunity to develop more accurate and personalized predictive models for the entire lung cancer care continuum
Dritsas and Trigka (
2022). However, the existing body of literature is fragmented. Previous review papers, such as those by Mamun et al. (2022)
Mamun et al. (
2022), Altuhaifa et al. (2023)
Altuhaifa et al. (
2023), and Juang et al. (2025)
Juang et al. (
2025), have provided valuable but narrow insights, often focusing on a single prediction task or including studies that are not exclusively based on ML or structured data. This fragmentation leaves a clear gap for a comprehensive synthesis that systematically analyzes ML applications across all major prediction tasks using structured data.
To address this gap, this systematic literature review (SLR) was conducted to provide a holistic overview of machine learning techniques for lung cancer prediction using structured clinical data. The specific contributions of this review are delineated as follows:
Comprehensive Task Coverage: Unlike previous reviews, this SLR simultaneously investigates three critical prediction tasks—diagnosis, risk assessment, and survival analysis—providing a unified perspective on the ML landscape.
Exclusive Focus on Structured Data: A strict focus is maintained on models built primarily on structured, tabular data, offering a dedicated resource for this data modality and distinguishing this work from reviews centered on medical imaging.
Rigorous and Transparent Methodology: The review adheres to PRISMA 2020 guidelines
Page et al. (
2021), employing a systematic search across five major databases with tailored search strings, resulting in a robust synthesis of 42 high-quality, peer-reviewed studies.
Task-Oriented Synthesis: A novel, task-categorized analysis of the literature is presented, detailing the specific datasets, feature sets, and ML models that are most effective for diagnosis, risk, and survival prediction, respectively.
Critical Analysis and Future Directions: Beyond summarization, a critical appraisal of current methodologies is provided, highlighting performance trends, identifying common pitfalls such as the lack of external validation, and outlining clear pathways for future research.
Scope of this review: This review examines peer-reviewed studies that develop or evaluate machine learning models for lung cancer prediction using structured (tabular) data. The included studies are focused on three distinct tasks: (1) Diagnosis prediction – identifying the presence of lung cancer; (2) Risk prediction – estimating an individual’s future likelihood of developing lung cancer; and (3) Survival prediction – forecasting patient outcomes such as survival time or mortality. Studies that incorporate imaging-derived or genomic features are included only if these features are integrated into a structured, tabular format alongside clinical data.
The remainder of this paper is organized as follows.
Section 2 reviews related works and prior reviews.
Section 3 details the SLR methodology.
Section 4 presents the results structured by the research questions.
Section 5 provides a synthesized performance analysis.
Section 6 discusses the overarching findings, limitations, and implications.
Section 7 outlines the limitations of this review and proposes future work. Finally,
Section 8 concludes the paper.
2. Related Works
The application of machine learning to structured clinical data for lung cancer prediction has gained substantial momentum in recent years. Primary studies have demonstrated promising results across various prediction tasks. For instance, Chandran et al. (2023) developed ML models using electronic health records to identify high-risk patients, achieving an AUC of 0.76 with LASSO logistic regression
Chandran et al. (
2023). Similarly, Wu et al. (2024) compared six ML models for predicting lymph node metastasis in NSCLC, finding that Generalized Linear Models performed best (AUC=0.810)
Wu et al. (
2024). Maurya et al. (2024) conducted a comparative analysis of twelve algorithms, reporting that K-Nearest Neighbors achieved 92.86% accuracy for lung cancer prediction
Maurya et al. (
2024). These studies collectively highlight the potential of ML approaches while revealing substantial heterogeneity in methodologies, datasets, and evaluation practices.
Several systematic reviews have attempted to synthesize this growing body of literature, though each exhibits specific limitations in scope or methodology. Mamun et al. (2022) focused on early-stage lung cancer prediction but included only ten studies, providing insufficient coverage of the field. Their review also lacked transparency regarding dataset availability and search methodology
Mamun et al. (
2022). Altuhaifa et al. (2023) exclusively addressed survival prediction using clinical data, identifying Random Forest as the predominant model across 30 studies. However, their narrow focus on survival outcomes and use of only three databases limited the review’s comprehensiveness
Altuhaifa et al. (
2023). Juang et al. (2025) conducted a broad examination of risk prediction models across Western and Asian populations but included numerous non-ML approaches and traditional statistical models, making it difficult to assess the specific contributions of machine learning techniques
Juang et al. (
2025).
Additional reviews have addressed adjacent topics while leaving the core focus unexplored. For example, Zhuan et al. (2023) systematically reviewed machine learning for radiation pneumonitis prediction in lung cancer patients, demonstrating excellent diagnostic performance (AUC=0.93) but focusing exclusively on this treatment-related complication rather than core prediction tasks
Zhuan et al. (
2023). Other reviews have emphasized imaging-based approaches or combined multiple data modalities without specifically addressing the unique challenges and opportunities presented by structured data analysis.
This systematic literature review addresses several critical gaps in the existing review landscape. Unlike previous works, it provides comprehensive coverage across all three major prediction tasks—diagnosis, risk assessment, and survival analysis—while maintaining an exclusive focus on structured data approaches. The review employs a rigorous methodology following PRISMA guidelines, with systematic searches across five major databases and transparent reporting of inclusion criteria. By synthesizing 42 high-quality studies, it offers robust insights into optimal datasets, feature sets, model architectures, and validation practices specific to structured data applications. Furthermore, it provides critical analysis of methodological limitations and future directions, aspects often underemphasized in previous reviews. This comprehensive approach enables a holistic understanding of the current state and future potential of ML for lung cancer prediction using structured clinical data.
4. Results
RQ1: What is the status of machine learning models for the prediction of lung cancer from tabular datasets over the years?
From
Figure 2, the distribution of papers over the last 5-7 years published which are relevant to lung cancer prediction is shown. It can be seen that in 2018 - 2019 (2 Papers per year). However, an upward trend can be seen in the following year, with the paper counts rising to almost double (4 Papers) in 2020, 6 in 2021, 8 in 2022, and 14 in 2024. This signifies an interest and advancement in using Machine Learning models with tabular datasets for lung cancer prediction.
The projected increase to 16 or more papers further underscores the growing emphasis on this field, suggesting continued innovation and prioritization of lung cancer prediction in academic and clinical research.
RQ2: What are the key features (attributes) used in lung cancer prediction models?
From
Table 6, we can see that the key attributes used in lung cancer prediction models span diverse categories, reflecting a multifaceted approach to lung cancer diagnosis. It is observed that Demographic features and Clinical parameters are the most frequently used features, appearing in 22 and 20 studies. Lifestyle factors and symptoms are moderately presented, with each having 8 studies. Meanwhile, treatment modalities and survival/outcomes are included in 10 and 9 studies. Molecular/genetic biomarkers and imaging/diagnostics are less common but emerging. Social/Economic Factors are least represented, highlighting a potential gap in socioeconomic data integration.
RQ3: Which tabular datasets are frequently used in lung cancer prediction?
3.1. Which datasets are used for lung cancer diagnosis prediction?
RQ3 reveals a diverse array of tabular datasets used in Lung Cancer Diagnosis prediction which can be visualized from
Table 7. Population-based/Registry datasets such as SEER, Danish Registers EPIC, JPHC, and large oncologic DB provide extensive datasets for lung cancer diagnosis prediction, while clinical and EHR datasets like Flatiron Health EHR, Maine HIE, WSIC, Taiwan NHIRD, CLARO, and CLARIFY provide detailed clinical patient records. In addition, Imaging and Radiology datasets such as NLST & LIDC, ACRIN NSCLC-DICOM-PET and datasets from LUNA-16 and Kaggle contribute to lung cancer diagnosis prediction significantly. The inclusion of genomic and multi-omics datasets like TCGA, GTEx, and SLCG provide molecular and datasets from specialized sources or sensors such as Synthea, eNose, Pesticide & Lung cancer, Blood nanosensor, Multi-cancer, and Multi-modal provide an innovative approach to lung cancer diagnosis prediction.
3.2. Which datasets are used for lung cancer risk prediction?
The following
Table 8 categorizes the datasets used for Lung Cancer Risk Prediction. Population-based/Registry datasets, such as the Danish National Registers, EPIC (European Prospective Investigation into Cancer and Nutrition), and Japan Public Health Center Study Cohort (JPHC), are widely used due to their large-scale, long-term follow-up nature. These datasets often include extensive demographic, lifestyle, and medical history data, making them valuable for studying lung cancer risk factors at a population level.
Similarly, Clinical/EHR (Electronic Health Record) data sources, such as the Taiwan National Health Insurance Research Database (NHIRD) and the Lung Cancer Prediction Dataset, provide real-world medical records, including diagnostic codes, prescriptions, and treatment histories. Such datasets enable machine learning models to predict lung cancer risk based on patient-specific clinical features.
In contrast, Synthetic/Sensor-Based datasets, such as Synthea Synthetic Patient Data, are less commonly used in lung cancer risk prediction. This suggests a research gap where more work is needed to explore how sensor-based data (e.g., wearable devices, environmental exposure tracking) and synthetic patient simulations can contribute to early detection and risk modeling.
3.3. Which datasets are used for lung cancer survival prediction?
Table 9 describes the datasets used for Lung Cancer Survival Prediction, categorized by data type and source. The inclusion of diverse dataset types reflects the review’s scope of including studies where data is ultimately used in a structured, tabular format for machine learning analysis.
Population-Based/Registry datasets, such as SEER (Surveillance, Epidemiology, and End Results) and Large Oncologic Databases, are widely used due to their extensive collection of long-term survival outcomes, patient demographics, and treatment histories. These datasets represent the core of structured data approaches, enabling large-scale epidemiological studies on lung cancer survival rates through their comprehensive tabular format.
Clinical/EHR datasets, including CLARIFY, NHIRD, etrospective cohort from CUCH, Flatiron Health EHR, CLARO, and the Wisconsin Prognostic Lung Cancer Subdirectory, provide real-world patient records with treatment details, laboratory results, and disease progression. These datasets are crucial for developing personalized survival prediction models and align perfectly with the structured data focus of this review.
Imaging and Radiology datasets, such as ACRIN, NSCLC-DICOM-PET, warrant specific justification for inclusion. While these originate from medical imaging scans (e.g., CT, PET scans), they are included because the studies that use them typically extract quantitative, structured features (radiomics) such as texture, shape, and intensity metrics. These extracted features are then organized in tabular format alongside clinical variables, making them compatible with traditional machine learning algorithms designed for structured data.
Genomic and Multi-Omics datasets, such as TCGA (The Cancer Genome Atlas) and SLCG, provide molecular-level insights including gene expression, mutations, and epigenetic alterations. These datasets are included because they represent structured biological data that can be integrated with clinical variables in tabular format. This facilitates precision oncology approaches by linking genetic variations to lung cancer survival through machine learning models that operate on structured input.
Synthetic/Sensor-Based/Specialized datasets, such as the Pesticide & Lung Cancer dataset, are less common but provide valuable insights into environmental and occupational exposure-related risk factors. These datasets are included when the sensor data or synthetic features are represented in a structured format suitable for machine learning analysis, highlighting the need for further research on non-genetic influences on survival.
RQ4: What are the preprocessing techniques used to feed them into the models?
Table 10 categorizes the key preprocessing techniques applied in the selected studies, illustrating a comprehensive approach to preparing data for modeling. The table identifies several essential methods: encoding and transformation, normalization/standardization, missing data handling, data cleaning and filtering, data splitting and validation, and domain-specific preprocessing.
In many studies, encoding and transformation techniques are employed to convert categorical data into numerical representations, which is essential for enabling machine learning algorithms to process diverse datasets effectively. Normalization and standardization techniques are similarly crucial; they scale the data so that each feature contributes equally during model training, improving stability and convergence. These steps are particularly important when dealing with variables that differ significantly in scale.
Handling missing data is another critical preprocessing step. The studies referenced in the table apply various strategies—such as imputation or exclusion—to mitigate the effects of incomplete data, thereby reducing bias and enhancing model reliability. Equally important are data cleaning and filtering processes that remove noise, errors, and outliers, ensuring that the input data is as accurate and consistent as possible.
The table also highlights the use of data splitting and validation, where datasets are partitioned into training, validation, and testing sets. This practice is fundamental for assessing model performance and ensuring that models generalize well to new, unseen data. Lastly, domain-specific preprocessing steps have been introduced in several studies to tailor the data preparation process to the unique characteristics of the specific dataset or application domain. These customized approaches utilize expert knowledge to extract nuanced patterns that might be overlooked by more generic preprocessing techniques.
In summary, Table 10 illustrates a multifaceted preprocessing pipeline that enhances model performance by addressing different aspects of data quality and consistency. The careful combination of these techniques—from standard transformations and scaling to specialized, domain-driven adjustments—ensures that the data is robust, well-structured, and ready for effective model training. This comprehensive approach not only improves predictive accuracy but also sets the stage for future advancements through the standardization and automation of preprocessing pipelines.
RQ5: What are the most commonly used machine learning algorithms for lung cancer prediction on tabular datasets?
From
Figure 3, we can observe that Gradient Boosting, Artificial Neural Network, Logistic Regression, Naive Bayes are the most used ML models, where as Transformer Based Model, Voting Classifier, Generalized Linear Model are the least used ML models for lung cancer prediction.
Table 11 provides a detailed overview of the machine learning models employed across selected studies for lung cancer prediction using tabular data. The table shows a wide range of algorithms, from classical statistical methods to advanced deep learning architectures, highlighting the diversity of approaches in the field. Many studies have favored probabilistic and ensemble-based approaches, with Naïve Bayes, Decision Trees, Random Forests, and Logistic Regression being among the most frequently applied methods. Naïve Bayes and Decision Trees are commonly used for their simplicity and interpretability, making them suitable for preliminary analysis and feature importance assessment. Random Forests and Gradient Boosting methods, on the other hand, are extensively adopted due to their robustness in handling high-dimensional data and their ability to mitigate overfitting through ensemble techniques. Support Vector Machines, k-Nearest Neighbors, and Artificial Neural Networks are also popular choices, often chosen for their flexibility in modeling complex, non-linear relationships. Logistic Regression remains a staple in the predictive modeling domain due to its efficiency and the straightforward interpretability of its coefficients. Additionally, specialized models such as the Kaplan-Meier estimator, Survival Models, and Fuzzy Inference Systems indicate that some studies have focused on time-to-event data or uncertainty handling, further emphasizing the domain-specific challenges of lung cancer prediction. Emerging techniques, including deep learning approaches like CNNs and Transformer-Based models, along with Graph Machine Learning and Variational Autoencoders (VAEs), signal a growing interest in leveraging advanced algorithms to capture intricate patterns within the data. The inclusion of models such as Linear Regression, K-Means Clustering, and LDA also points to exploratory analyses and dimensionality reduction efforts that precede more complex modeling. Overall,
Table 11 underscores the comprehensive utilization of a broad spectrum of machine learning techniques in lung cancer prediction. The selection of these models reflects a balance between traditional methods, which offer interpretability and simplicity, and modern, sophisticated algorithms, which promise enhanced predictive performance and the ability to model complex relationships in the data. This diversity not only caters to the varying nature of tabular data but also paves the way for future research aimed at integrating multiple approaches to improve model robustness and accuracy.
5.1. Which models are commonly used for diagnosis prediction?
Table 12 provides an overview of the machine learning models that have been commonly employed for diagnosis prediction across various studies. The table reflects a diverse selection of models ranging from traditional statistical approaches to more advanced ensemble and deep learning methods.
Logistic Regression is frequently used, as evidenced by its presence in several studies (PS20, PS26, PS32, PS35). Its popularity can be attributed to its efficiency and the clear interpretability of its coefficients, which are particularly valuable in a clinical context. Support Vector Machines (SVM) also appear prominently (PS20, PS24, PS26, PS39), offering robust classification through margin-based separation that helps in distinguishing between diagnostic categories effectively.
Ensemble methods such as Random Forest (PS24, PS26, PS31, PS36, PS37) and XGBoost (PS4, PS20, PS24, PS26, PS36) are widely applied. These models are favored for their ability to handle complex, non-linear relationships and reduce overfitting through the aggregation of multiple decision trees. Decision Trees (PS24, PS39, PS35) provide intuitive insights into the decision-making process, making them a useful tool for understanding diagnostic criteria.
Other models like k-Nearest Neighbors (PS20, PS26) and Naïve Bayes (PS20, PS24, PS26, PS36, PS35) are also in use, offering simplicity and speed for classification tasks based on proximity or probabilistic assumptions, respectively. Neural Networks, including various architectures such as ANN, MLP, and CNN (PS24, PS20, PS26, PS31, PS32, PS34, PS36, PS38, PS39, PS35, PS40), have gained traction for their ability to capture complex, high-dimensional patterns in diagnostic data.
Additionally, models like Linear Discriminant Analysis (PS1, PS39) serve a dual role in classification and dimensionality reduction. Specialized ensemble techniques, such as Extra Tree (PS26) and AdaBoost (PS26), have been explored to further improve predictive performance. Emerging methods such as CATBoost (PS36), Bayesian Networks (PS35), and Variational Autoencoders (PS38) illustrate the field’s ongoing efforts to integrate more advanced and nuanced algorithms into diagnostic prediction frameworks.
Overall, the distribution of models presented in
Table 12 underscores a broad methodological spectrum. This diversity reflects the balance between the need for interpretable, reliable models and the pursuit of higher predictive accuracy through advanced machine learning techniques. The varying approaches also suggest that the choice of model may be influenced by specific dataset characteristics and the unique requirements of each diagnostic task.
5.2. Which models are commonly used for risk prediction?
Table 13 outlines the range of models used for risk prediction in lung cancer, revealing a preference for both traditional and advanced machine learning techniques. Logistic Regression is a staple in these studies, appearing in multiple works (PS6, PS10, PS14, PS22, PS23, PS28, PS33, PS13) due to its straightforward interpretability and efficiency in estimating risk probabilities. Ensemble methods also play a significant role, with Random Forest models frequently adopted (PS3, PS6, PS22, PS23, PS28, PS33, PS13) for their ability to manage non-linear relationships and reduce overfitting through aggregation of decision trees.
XGBoost, another ensemble method known for its gradient boosting framework, is similarly prevalent (PS22, PS23, PS28, PS33, PS13), offering high accuracy and robustness in handling complex data structures. Support Vector Machines are also utilized (PS6, PS22, PS28, PS33) to delineate risk boundaries in high-dimensional spaces, while Decision Trees provide an intuitive means for risk stratification (PS3, PS22, PS23, PS33).
Advanced boosting techniques such as Gradient Boosting Machines (PS6, PS23, PS22) further refine risk predictions by sequentially correcting errors made by previous models. Neural network-based approaches, represented by Multi-Layer Perceptrons (PS6, PS33), add another layer of sophistication by capturing intricate, non-linear patterns in the data. In addition, models like LightGBM (PS13) have been explored for their efficiency and scalability.
The table also indicates the use of specialized methods tailored to specific aspects of risk prediction, such as Parametric Survival Models (PS9, PS17), which incorporate time-to-event data, and Bayesian Networks (PS22) that model probabilistic relationships. Other notable approaches include LASSO (PS28) for feature selection, Fuzzy Inference Systems (PS3) to handle uncertainty, and optimization methods like Stochastic Gradient Descent (PS22). Finally, combinations of ensemble techniques like J48, AdaBoost, and Rotation Forest (PS22) underscore the experimental nature of research in this area.
Overall, the models employed for risk prediction reflect a balance between classic statistical methods and modern machine learning algorithms. This diverse methodological landscape not only accommodates the varied characteristics of clinical and tabular datasets but also supports the development of robust risk prediction frameworks that can be fine-tuned to specific clinical applications.
5.3. Which models are commonly used for survival prediction?
Table 14 provides an overview of the machine learning models that have been commonly employed for survival prediction across various studies. The table illustrates a diverse selection of models, ranging from traditional statistical methods to advanced deep learning techniques, each addressing the unique challenges posed by survival data.
Random Survival Forest appears in studies PS2 and PS16, showcasing its strength in handling censored data and capturing non-linear relationships through ensemble learning. The Cox Proportional Hazards model, featured in PS10, PS15, and PS16, remains a staple in survival analysis due to its interpretability and established framework for estimating hazard ratios.
Deep learning techniques have also made inroads in survival prediction. DeepSurv (PS15, PS21) and Neural Networks (Deep Learning) (PS5, PS25, PS29, PS41, PS10) illustrate the field’s move toward models that can uncover complex, high-dimensional patterns in survival outcomes. More traditional classification methods are still in use, with Logistic Regression (PS8, PS12, PS29, PS13, PS10) and Support Vector Machines (SVM) (PS5, PS8, PS12, PS27, PS41) being applied for their robustness and efficiency.
Additionally, ensemble methods such as Random Forest (PS8, PS12, PS13) and XGBoost (PS8, PS13) are favored for their ability to mitigate overfitting and manage intricate interactions within the data. Decision Trees (PS5, PS7, PS8, PS12, PS27, PS13) offer a clear, interpretable structure for decision-making, while Gradient Boosting Machine (PS12, PS27, PS41) and Linear Regression (PS27, PS41) provide further avenues for predictive modeling with varying degrees of complexity. Complementary models like K-Nearest Neighbors (PS7, PS8, PS12), Naive Bayes (PS7, PS12), and REPTree (PS7) round out the table, each contributing unique advantages in terms of speed, simplicity, or alternative tree-based strategies.
Overall, the distribution of models in
Table 14 reflects a balance between leveraging traditional, interpretable approaches and embracing modern, computationally intensive methods, ensuring that survival prediction frameworks can be tailored to the specific characteristics and requirements of different datasets.
RQ6: What are the evaluation metrics that are set to validate the model’s performance?
6.1. Which validation methods are used to evaluate the models?
Table 15 provides an overview of the various validation methods employed by the selected studies to assess model performance. The table reveals that the most commonly used method is the holdout approach (train-test split), which is applied in several studies (e.g., PS7, PS12, PS16, PS19, PS23, PS26, PS36). This method is popular for its simplicity and ease of implementation, offering a straightforward way to gauge model performance on unseen data. Cross-validation, including specific variants such as 10-fold cross-validation used in studies PS5, PS26, and PS27, is also frequently adopted. This technique provides a more robust performance estimate by repeatedly partitioning the data, thus mitigating the risk of overfitting and ensuring a more reliable assessment.
In addition, a few studies have employed temporal validation (PS32) to account for time-dependent patterns in the data, which is particularly relevant in scenarios where the temporal aspect is critical for predicting outcomes. External validation (PS4) further reinforces the generalizability of the models by testing them on completely independent datasets. Some studies have used case-control sampling (PS32) and stratified sampling (PS12) to ensure that the validation process maintains the inherent structure and distribution of the data.
Notably, a significant number of studies did not specify the validation method used (as indicated by the extensive list of PS# where the method is not specified). This lack of detail may point to challenges in reporting standards or variability in experimental setups across different research works.
Overall, the validation strategies documented in
Table 15 illustrate a range of approaches—from simple holdout methods to more rigorous cross-validation techniques—underscoring the importance of carefully selecting and clearly reporting validation methods to ensure that model performance is accurately and reliably assessed.
6.2. Which performance matrices have been used most?
Table 16 categorizes the various performance metrics used across the selected studies to evaluate lung cancer prediction models. Among these, accuracy remains the most widely adopted metric (PS8, PS21, PS24, PS25, PS26, PS30, PS13, PS3, PS19, PS20, PS31, PS22, PS18, PS29, PS36, PS38, PS33). This metric provides a straightforward measure of overall model correctness but can be misleading in imbalanced datasets where one class dominates.
To address this limitation, AUC/ROC (Area Under the Receiver Operating Characteristic Curve) is frequently reported (PS21, PS24, PS26, PS30, PS13, PS20, PS31, PS12, PS23, PS14, PS9, PS5, PS4, PS22, PS11, PS38, PS1). This metric is particularly useful for assessing model discrimination capabilities, as it evaluates the trade-off between sensitivity and specificity across different classification thresholds.
Sensitivity (Recall/True Positive Rate) and Specificity (True Negative Rate) are also extensively used (PS8, PS21, PS24, PS26, PS30, PS13, PS3, PS19, PS20, PS31, PS12, PS23, PS11, PS18, PS36, PS37, PS33 for Sensitivity; PS8, PS21, PS24, PS26, PS13, PS3, PS19, PS20, PS23, PS14, PS11, PS18, PS37, PS38, PS33 for Specificity). These metrics provide more granular insights into the model’s ability to correctly identify positive and negative cases, which is particularly important in medical applications where false negatives can have severe consequences.
Precision (Positive Predictive Value) and F1-Score are commonly used to balance false positives and false negatives (PS26, PS30, PS13, PS20, PS22, PS12, PS23, PS14, PS32, PS11, PS38, PS39, PS33 for Precision; PS8, PS26, PS13, PS20, PS22, PS23, PS32, PS38, PS33, PS39 for F1-Score). The F1-score is particularly useful in imbalanced datasets as it considers both precision and recall.
For survival analysis models, the C-index (Concordance Index) is often employed (PS21, PS13, PS15, PS17), as it measures the model’s ability to rank survival times correctly. Additionally, calibration metrics, such as calibration curves and Brier scores, have been used to assess how well predicted probabilities align with actual outcomes (PS13, PS23, PS16). Hazard Ratios (HR) (PS10, PS28, PS15) are another survival-specific metric that helps in understanding the relative risk between groups.
Other specialized metrics include Matthews Correlation Coefficient (MCC) (PS20, PS19, PS23), Negative Predictive Value (NPV) (PS20, PS23, PS14, PS11, PS32), Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) (PS7, PS27, PS18, PS42) for regression models, and survival-specific metrics like Kaplan-Meier curves and time-dependent AUROC (PS10, PS15, PS16).
Beyond these standard metrics, some studies have reported domain-specific measures, such as mean Average Precision (mAP) (PS34), Judgment Ratio (PS29), Volume Doubling Time (VDT) (PS42), and class-specific sensitivity/specificity (PS37), reflecting the unique challenges in evaluating lung cancer prediction models.
Overall, the selection of evaluation metrics varies depending on the study’s goals, with a combination of traditional classification metrics, survival-specific measures, and specialized domain-specific metrics providing a comprehensive assessment of model performance.
RQ7: What feature selection and dimensionality reduction techniques are used to impact the performance of the models?
Feature selection and dimensionality reduction play a crucial role in improving the performance of predictive models by eliminating irrelevant or redundant features, thereby enhancing computational efficiency and generalizability. The selected studies employ a diverse range of techniques to refine feature sets, categorized primarily into filter methods, wrapper methods, embedded methods, clinical/expert-driven selection, statistical/univariate analysis, meta-heuristic approaches, and other techniques.
Filter Methods, which independently assess feature importance based on statistical measures, are commonly used. Techniques such as ANOVA F-value (PS12), Mutual Information (PS31), Spearman’s rank correlation (PS13), and Gain Ratio (PS22) help identify the most informative features without considering the model’s impact. These methods are computationally efficient and scalable but may not capture complex feature interactions.
Wrapper Methods rely on iterative model training to evaluate feature subsets, often leading to more optimal selections at the cost of increased computational expense. LASSO regression (PS24, PS41) and Recursive Feature Elimination (RFE) (PS12, PS39) are widely used, with RFE iteratively removing the least important features based on model performance. These methods are particularly effective for high-dimensional datasets but require careful tuning to avoid overfitting.
Embedded Methods integrate feature selection directly into the learning process, making them computationally efficient while maintaining strong predictive power. Feature importance scores from XGBoost and Random Forest (PS8, PS22, PS28, PS33) and SVM kernel coefficients (PS20) are frequently utilized. Additionally, more specialized embedded techniques like Correlation-Based Feature Selection (CFS), ReliefF, and Cuckoo Search Optimization (CSO) (PS39) have been explored.
Clinical/Expert-Driven Feature Selection ensures that selected variables are relevant to domain knowledge. Studies employ clinical relevance assessment (PS19), expert-guided grouping (PS32), and clinical experience-based selection (PS21, PS16). These methods enhance interpretability and reliability but may be limited by subjective bias.
Statistical and Univariate Analysis approaches, including logistic regression-based feature selection (PS24), Cochran-Mantel-Haenszel test (PS28), and multivariate analysis (PS23), help in identifying significant predictors through hypothesis testing. These methods are commonly used in conjunction with other selection techniques for robustness.
Meta-Heuristic and Optimization-Based Methods leverage nature-inspired algorithms to explore feature subsets efficiently. The Squirrel Search Algorithm (SSA) (PS7), Genetic Algorithm (GA), and Cuckoo Search (PS39) are examples of heuristic techniques applied to optimize feature selection. While these approaches can provide highly optimized solutions, they are computationally intensive and require fine-tuning.
Other notable methods include Backward Stepwise Selection (PS14), which iteratively removes non-significant predictors, and AIC-Based Selection (PS17), which minimizes model complexity while maintaining explanatory power.
Despite the variety of feature selection techniques applied, several studies (PS10, PS25, PS15, PS27, PS29, PS9, PS35, PS36, PS40) do not explicitly specify their approach, indicating potential gaps in feature engineering documentation. The diversity of methods highlights the trade-offs between computational cost, model interpretability, and predictive performance, underscoring the need for tailored selection strategies based on dataset characteristics and modeling objectives.
Table 17.
Feature Selection Techniques Used in Studies.
Table 17.
Feature Selection Techniques Used in Studies.
| Technique Category |
Methods |
PS# |
| Filter Methods |
ANOVA F-value, Mutual Information, Spearman’s rank correlation, Gain Ratio |
PS12 (ANOVA), PS31 (Mutual Information), PS13 (Spearman), PS22 (Gain Ratio) |
| Wrapper Methods |
LASSO regression, Recursive Feature Elimination (RFE) |
PS24 (LASSO), PS12 (RFE), PS39 (RFE), PS41 (LASSO) |
| Embedded Methods |
XGBoost/RF feature importance, SVM kernel coefficients |
PS8 (XGB/RF), PS20 (SVM), PS28 (XGBoost), PS22 (RF), PS33 (ExtraTree), PS39 (CFS, ReliefF, CSO) |
| Clinical/Expert-Driven |
Clinical relevance, expert-guided grouping |
PS21 (clinical experience), PS19 (clinical relevance), PS32 (clinical-guided mapping), PS16 |
| Statistical/Univariate |
Univariate analysis (e.g., logistic regression, Cochran-Mantel-Haenszel) |
PS28 (univariate), PS24 (multivariate logistic regression), PS23 (multivariate analysis) |
| Meta-Heuristic/Optimization |
Squirrel Search Algorithm (SSA), Genetic Algorithm (GA), Cuckoo Search |
PS7 (SSA), PS39 (GA, Cuckoo Search) |
| Other |
Backward stepwise selection, AIC-based selection |
PS14 (backward stepwise), PS17 (AIC) |
| Not Specified |
– |
PS10, PS25, PS15, PS27, PS29, PS9, PS35, PS36, PS40 |
Dimensionality reduction is essential in machine learning as it converts high-dimensional data into a more manageable format while retaining important information.
Table 18 summarizes the dimensionality reduction methods used in the selected studies to evaluate model performance. The reviewed studies employ various techniques categorized into linear methods, non-linear methods, regularization, encoding/scaling, attention-based methods, feature grouping, autoencoders, and domain-specific approaches. Linear Methods reduce the complexity of high-dimensional data by mapping it onto a lower-dimensional space while preserving key variance. Principal Component Analysis (PCA) (PS33, PS37, PS39, PS1) is a commonly used approach that transforms data into a set of orthogonal components ordered by the amount of variance they capture. By minimizing redundancy, PCA improves computational efficiency and retains the most relevant features, making it a valuable tool for dimensionality reduction across various domains. Non-linear methods are effective for handling complex feature interactions that linear techniques may overlook, making them valuable for high-dimensional datasets. t-Distributed Stochastic Neighbor Embedding (t-SNE) (PS11, PS37) is a popular technique that reduces dimensionality while preserving local data structures. It is primarily used for visualization but can also support feature reduction by uncovering patterns and clusters. Regularization-based methods incorporate feature selection into the learning process by applying penalties to the model’s complexity. LASSO (Least Absolute Shrinkage and Selection Operator) (PS24, PS31) is a popular technique that enforces sparsity by penalizing the coefficients of less important features, thereby reducing dimensionality while preserving predictive accuracy. Encoding and Scaling methods transform categorical and numerical data into standardized forms, improving model performance and convergence. Techniques like one-hot encoding and z-score (PS2) are commonly used to preprocess the features. Additionally, robust scaling (PS12) and min-max scaling (PS36) have also been explored. Attention and Transformer-Based Methods use attention mechanisms to prioritize relevant features. Position embeddings (PS32) preserve spatial relationships, while masking (PS2) handles missing data. CBAM (Convolutional Block Attention Module) (PS34) improves feature selection by focusing on key features. These methods are commonly applied in transformer models for sequential data. Feature Grouping (PS32, PS28) reduces dimensionality by aggregating related features based on clinical or hierarchical relevance. For example, Read codes are grouped into 450 categories, reducing raw features from 33k to 118. This method simplifies high-dimensional data but relies on domain knowledge for effective grouping. Variational Autoencoders (VAE) (PS38) learn compressed representations by encoding input data into a lower-dimensional latent space and reconstructing it. This reduces dimensionality while preserving key variations, making them effective for tasks like image processing and unsupervised learning. Domain-specific dimensionality Reduction techniques refine feature sets for specific domains, such as medical imaging. Methods like image normalization (PS31), Hounsfield unit scaling (PS42), and ablation studies (PS42) help preprocess and enhance data by resizing, normalizing, and removing features, improving model performance and interpretability. Although, there have been used variety of dimensionality reduction techniques, several studies (PS8, PS21, PS26, PS30, PS10, PS3, PS19, PS20, PS23, PS15, PS18, PS29, PS5, PS6, PS4, PS35, PS40) do not explicitly mention dimensionality reduction methods, highlighting potential gaps in feature engineering documentation.
The analysis reveals notable patterns in method selection across prediction tasks. For diagnosis prediction, clinical/expert-driven feature selection was predominant, reflecting the need for clinically interpretable models. In contrast, risk prediction models favored statistical and filter methods suitable for epidemiological datasets, while survival prediction utilized embedded methods compatible with time-to-event data. The substantial number of studies (approximately 30%) that did not specify their feature engineering approach represents a significant reproducibility concern in the literature.
RQ8: What ensemble techniques are performed to boost the performance over the traditional ML models?
Ensemble learning techniques are widely employed to improve the predictive accuracy and robustness of traditional machine learning models. By combining multiple classifiers, ensemble methods help mitigate individual model weaknesses, reduce variance, and enhance generalization.
The relative frequency of different ensemble techniques used across the studies is summarized in
Figure 4. The graph clearly shows that
XGBoost is the most predominantly employed ensemble method, underscoring its reputation for high performance and efficiency in the domain.
Random Forest also demonstrates significant adoption, reflecting its robustness and effectiveness as a go-to bagging algorithm. Other boosting variants like
AdaBoost and
LightGBM see moderate but important usage. In contrast, more complex techniques such as
Stacking,
Voting, and
Hybrid Models are less frequently applied, likely due to their increased computational complexity and implementation overhead, though they offer unique performance advantages as discussed in the subsequent sections. This distribution highlights a strong preference for powerful, yet relatively straightforward,
boosting and
bagging algorithms in the current literature on lung cancer prediction.
The studies listed in
Table 19 employ various ensemble methods, including bagging, boosting, stacking, and hybrid models, to enhance lung cancer prediction. Bagging-based approaches such as Random Forest (PS8, PS24, PS13, PS3, PS12, PS23, PS22, PS33, PS37) are among the most frequently applied ensemble techniques. Random Forest, which aggregates predictions from multiple decision trees, is particularly effective in handling high-dimensional data and reducing overfitting. Similarly, Rotation Forest (PS22), a variant of bagging that applies feature transformation to subsets of data before training individual classifiers, is also explored. Boosting methods play a significant role in improving weak learners through iterative refinement. XGBoost (PS24, PS26, PS30, PS13, PS28, PS23, PS4, PS33, PS40) is one of the most commonly used gradient boosting algorithms, known for its efficiency and high predictive power. Extreme Gradient Boosting (PS24, PS40), an advanced variant of XGBoost, further enhances performance by optimizing computational speed and handling missing values more effectively. Other boosting techniques such as AdaBoost (PS26, PS22) and LightGBM (PS13) have also been utilized, offering alternative approaches to improve weak learners through weighted resampling. Stacking and hybrid ensemble models provide an additional layer of performance enhancement. Stacking (PS41) combines multiple base learners by training a meta-model on their predictions, thereby leveraging the strengths of diverse algorithms. Hybrid models such as ConvXGB (PS36), which integrates Convolutional Neural Networks (CNN) with XGBoost, and DT-SVM (PS39), which combines Decision Trees and Support Vector Machines, demonstrate the potential of deep learning and classical ML fusion. Voting and weighted combination strategies are also explored. A Voting Classifier (PS26, Ensemble 2) aggregates predictions from different models using either hard or soft voting mechanisms. Additionally, weighted averaging methods such as Late Fusion (PS31), Meta-model fusion (PS16), and Custom Ensemble techniques (PS27, PS41) offer alternative ways to optimize prediction by assigning different importance levels to individual classifiers.
Ensemble technique selection demonstrated clear task-dependent patterns. For diagnosis prediction, hybrid models like ConvXGB (PS36) that combine deep learning with traditional ensembles showed particular promise in han- dling complex feature interactions. Risk prediction models heavily favored XGBoost and Random Forest for their interpretability and robustness with traditional risk factors. Survival prediction uniquely employed meta-model approaches (PS16) that could handle censored data and temporal patterns.
The diverse range of ensemble approaches underscores their critical role in boosting predictive performance over traditional machine learning models. While bagging methods like Random Forest ensure robustness, boosting techniques such as XGBoost and AdaBoost provide better generalization. Meanwhile, hybrid and stacking models further refine accuracy by intelligently integrating multiple learning paradigms. These findings highlight the importance of selecting ensemble strategies tailored to specific dataset characteristics and application needs.
RQ9: What are techniques that are used to interpret the results of the models?
Interpreting machine learning models is crucial for validating their decision-making processes, improving transparency, and ensuring reliability in high-stakes applications such as healthcare. Various techniques are employed to explain model predictions, ranging from feature importance methods to statistical analysis and visualization tools.
The distribution of interpretation techniques, as shown in
Figure 5, reveals a clear preference for
post-hoc explanation methods over inherently interpretable models.
Feature Importance (29%) and
Visualization Tools (24%) together account for over half of all techniques used, underscoring the field’s reliance on methods that explain complex, high-performing models after they have been trained. The significant use of
Model-Agnostic (10%) and
Statistical Analysis (12%) methods further reinforces this trend, highlighting the need for flexible tools that can be applied across different model architectures and provide statistical validation. In contrast, inherently
Interpretable Models themselves are employed in only 2% of cases, suggesting a trade-off where practitioners often prioritize predictive performance and then use post-hoc methods to instill trust. The remaining techniques, including
Sensitivity Analysis,
Attention Mechanisms,
Cluster Analysis, and
Example-Based Localization, while each representing a smaller fraction, are critical for specific data types (e.g., images, temporal data) and provide nuanced, complementary insights into model behavior.
Feature importance analysis is widely used to determine which variables most influence the model’s predictions. Methods such as SHAP (Shapley Additive Explanations), LIME (Local Interpretable Model-agnostic Explanations), and Permutation Importance are frequently applied (PS8, PS12, PS23, PS31, PS33, PS36, PS41, PS3, PS24, PS28, PS42, PS19). Additionally, built-in feature importance rankings from tree-based models like Random Forest (RF) and XGBoost provide further interpretability.
Visualization tools help in understanding model behavior and decision boundaries. Techniques such as Nomograms, Calibration Plots, t-SNE (t-Distributed Stochastic Neighbor Embedding), Kaplan-Meier Curves, Partial Dependence Plots (PDP), and Beeswarm Plots are employed across multiple studies (PS24, PS13, PS15, PS11, PS16, PS23, PS28, PS33, PS41, PS27). These tools enable researchers to analyze variable relationships, class separability, and prediction confidence.
Model-agnostic interpretation methods, including SHAP, LIME, and PDP, are applied in studies (PS12, PS23, PS36, PS41) to provide insights independent of specific model architectures. These techniques explain individual predictions, helping users understand how small changes in input features affect outputs.
Statistical analysis techniques such as Pearson’s Correlation, Heatmap Correlation, Odds Ratios, Cox Regression, and Decision Curve Analysis are employed (PS26, PS28, PS13, PS16, PS27) to validate model relationships and assess the clinical relevance of predicted risk factors.
Sensitivity analysis techniques, including Occlusion Sensitivity and Input Perturbation, are utilized (PS11, PS19) to examine how minor changes in input values influence predictions, revealing feature stability and robustness.
Inherently interpretable models like Decision Trees, Bayesian Networks, and Logistic Regression (PS3, PS18, PS35, PS28) are also used, offering transparency by design. These models provide direct interpretability without requiring additional post-hoc explanations.
Attention mechanisms, particularly Transformer Attention Scores (PS32), help visualize where deep learning models focus during predictions, enhancing interpretability for neural networks.
Cluster analysis techniques, such as k-means applied to pathway representations (PS32), allow researchers to identify natural groupings within data, shedding light on potential subpopulations.
Example-based explanations, such as Case-Based Reasoning (PS10), aid interpretability by linking new predictions to similar past cases, making model decisions more relatable to domain experts.
Localization techniques, such as Visual Bounding Boxes (PS34), are applied to highlight specific image regions contributing to a model’s decision, making deep learning models in medical imaging more interpretable.
Overall, these diverse interpretation methods enhance trust in machine learning models by providing clear explanations of their predictions. While feature importance and statistical methods offer numerical insights, visualization and attention mechanisms provide a more intuitive understanding, making model decisions transparent and actionable for domain experts.
Table 20.
Interpretation Techniques Used in Studies.
Table 20.
Interpretation Techniques Used in Studies.
| Category |
Methods |
PS# |
| Feature Importance |
SHAP, LIME, Permutation Importance, RF/XGBoost Feature Importance |
PS8, PS12, PS23, PS31, PS33, PS36, PS41, PS3, PS24, PS28, PS42, PS19 |
| Visualization Tools |
Nomograms, Calibration Plots, t-SNE, Kaplan-Meier Curves, PDP, Beeswarm |
PS24, PS13, PS15, PS11, PS16, PS23, PS28, PS33, PS41, PS27 |
| Model-Agnostic Methods |
SHAP, LIME, Partial Dependence Plots (PDP) |
PS12, PS23, PS36, PS41 |
| Statistical Analysis |
Pearson’s/Heatmap Correlation, Odds Ratios, Cox Regression, Decision Curves |
PS26, PS28, PS13, PS16, PS27 |
| Sensitivity Analysis |
Occlusion Sensitivity, Input Perturbation |
PS11, PS19 |
| Inherently Interpretable Models |
Decision Trees, Bayesian Networks, Logistic Regression |
PS3, PS18, PS35, PS28 |
| Attention Mechanisms |
Transformer Attention Scores |
PS32 |
| Cluster Analysis |
k-means on Pathway Representations |
PS32 |
| Example-Based Explanations |
Case-Based Reasoning (e.g., recurrence prediction) |
PS10 |
| Localization Techniques |
Visual Bounding Boxes (e.g., infected regions) |
PS34 |
6. Discussion
This systematic review provides a structured synthesis of machine learning applications for lung cancer prediction using structured data, explicitly covering three distinct tasks—diagnosis, risk assessment, and survival analysis. Our findings reveal a field where the optimal ML approach is highly task-dependent, yet dominated by a consistent set of powerful techniques. Contrary to the broader AI trend, it is evident that while Deep Learning has gained popularity, traditional Machine Learning models, particularly ensemble methods, are preferred for prediction tasks on structured clinical data. Among these, Gradient Boosting (especially XGBoost) and Random Forest emerged as the most prevalent and high-performing models across the board, valued for their robustness with tabular data. Their application, however, is nuanced: for diagnosis, they are often part of complex pipelines with heavy feature engineering; for risk assessment, they provide a powerful upgrade to traditional statistical models; and for survival analysis, specialized ensembles are gaining traction against established benchmarks like Cox regression.
A key finding is the critical importance of task-specific data sources and predictive features. The SEER registry was a cornerstone for diagnosis and survival studies, while population registries (e.g., Danish, EPIC) were pivotal for risk prediction. The most influential predictive features were consistently demographics—with attributes such as Age, Gender, Race, and Smoking History being almost universal—and core clinical parameters like TNM stage and comorbidities. The widespread adoption of explainable AI (XAI) techniques, particularly SHAP and LIME, signifies a maturation in the field, enabling practitioners to understand the reasoning behind model predictions and fostering necessary trust for potential clinical integration.
However, this review also uncovers pressing methodological issues that threaten the translation of these models into clinical practice. The overwhelming reliance on holdout validation and the rarity of external validation cast doubt on the real-world robustness and generalizability of the reported high performances. Furthermore, while XAI tools are widely adopted, their use is often superficial, indicating a need for more standardized reporting and a stronger link between model interpretations and established clinical knowledge. Finally, the reliance on private datasets and frequently incomplete reporting of preprocessing and feature selection steps severely hampers reproducibility and independent verification, representing a significant barrier to scientific progress. By synthesizing evidence across the entire prediction continuum, this review not only maps the current landscape but also provides a foundational roadmap, highlighting that future work must prioritize robust validation, deepened interpretability, and enhanced reproducibility to develop truly clinically actionable ML tools.