Submitted:
23 March 2026
Posted:
25 March 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction

- We introduce MM-Care, a novel and comprehensive multi-modal deep learning framework specifically designed for personalized treatment decision support in early-stage NSCLC, integrating imaging, clinical, and genomic data.
- We propose an innovative adaptive multi-modal fusion network based on Transformer-driven cross-attention mechanisms, enabling deep interaction and dynamic weighting of heterogeneous medical features, which significantly enhances prognostic prediction accuracy.
- We integrate an explainable decision report generation module within MM-Care, providing clinicians with visual (Grad-CAM) and quantitative (SHAP values) insights into model predictions, thereby fostering trust and facilitating effective patient communication.
2. Related Work
2.1. Multi-Modal Deep Learning for Cancer Prognosis and Decision Support
2.2. AI in Lung Cancer Treatment Optimization and Outcome Prediction
3. Method
3.1. Overall Framework Architecture

- 1.
- Multi-Modal Feature Extraction Module: Responsible for processing raw, heterogeneous patient data from each modality (imaging, clinical, genomic) into high-dimensional, semantically rich feature vectors.
- 2.
- Adaptive Multi-Modal Fusion Network: A core innovative component designed to facilitate deep interaction and adaptive integration of features across different modalities, generating a unified and comprehensive patient representation.
- 3.
- Dual-Task Prognosis Prediction Head: Utilizes the fused patient representation to simultaneously predict critical prognostic outcomes (overall survival and local control) for both SBRT and surgical treatment arms.
- 4.
- Explainable Decision Report Generation Module: An integrated module that provides transparent, interpretable insights into the model’s predictions through visual heatmaps and quantitative feature importance analyses.
3.2. Multi-Modal Feature Extraction Module
3.2.1. Imaging Feature Extractor
3.2.2. Clinical Feature Encoder
3.2.3. Genomic Feature Embedder
3.3. Adaptive Multi-Modal Fusion Network
3.4. Dual-Task Prognosis Prediction Head
- 1.
- Overall Survival (OS) Prediction: The probability of overall survival at specific time points (e.g., 1, 3, and 5 years). This is typically modeled using a fully connected network followed by a survival regression layer or a Cox proportional hazards layer, outputting probabilities or risk scores. The output is a vector , where k is the number of time points.
- 2.
- Local Control (LC) Prediction: The probability of achieving local control (absence of tumor recurrence within the treated area) at specific time points (e.g., 1 and 3 years). This is typically formulated as a binary classification task for each time point, using a fully connected network followed by a sigmoid activation function to output probabilities. The output is a vector , where j is the number of time points.
3.5. Explainable Decision Report Generation Module
- 1.
-
Imaging-based Explanations (Grad-CAM): Utilizing techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM), we generate saliency maps overlaid on the original CT images. Grad-CAM computes the gradient of the target prediction score (e.g., predicted low 5-year OS for SBRT) with respect to the feature maps of a specific convolutional layer. These gradients are then averaged to obtain neuron importance weights () for each feature map k and class c. The final localization map is a weighted sum of feature maps, passed through a ReLU to highlight positive contributions:These heatmaps visually highlight the specific tumor regions, spatial patterns, and imaging features that contributed most significantly to the model’s survival and local control predictions, providing clinicians with interpretable visual evidence.
- 2.
- Feature Importance Analysis (SHAP Values): SHapley Additive exPlanations (SHAP) values are employed to quantify the contribution of each individual clinical indicator and genomic feature to the final prognostic prediction for a given patient. SHAP is rooted in cooperative game theory, attributing feature contributions fairly by considering all possible permutations of features. For a given prediction function f and patient features x, the SHAP value for feature i is calculated as:where F is the set of all features, S is a subset of features, and is the model’s prediction using only features in set S. This analysis yields an individualized ranking of risk factors, indicating both the magnitude and direction (positive or negative impact on the predicted outcome) of each feature’s influence. This allows clinicians to understand which specific patient characteristics drive a particular outcome prediction for a given treatment.
- 3.
- Integrated Report: The module consolidates the quantitative predictions (e.g., 1, 3, 5-year OS probabilities, 1, 3-year LC probabilities), imaging explanations (Grad-CAM heatmaps), and feature importance analyses (SHAP value plots) into an intuitive and interactive report. This report serves as a valuable tool for clinicians, enabling them to comprehend the model’s decision logic, identify key prognostic factors, and effectively communicate complex prognostic information and personalized treatment recommendations to patients in a transparent manner. The interactive nature allows clinicians to drill down into specific explanations or compare outcomes across treatment options.
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets
4.1.2. Evaluation Metrics
4.1.3. Implementation Details
4.2. Comparison with Baseline Methods
- 1.
- Cox Proportional Hazards (Cox-PH) Regression: A traditional statistical model that uses only clinical features to predict survival risk.
- 2.
- Random Forest (RF): A powerful ensemble machine learning method, trained on clinical features combined with handcrafted radiomic features extracted from CT images.
- 3.
- XGBoost (XGB): Another gradient-boosting machine learning algorithm, which leverages clinical features, handcrafted radiomic features, and shallow deep learning features derived from a pre-trained 2D CNN (without fine-tuning).
- 4.
- Single-Modal 3D CNN (CT-only): A deep learning model that exclusively processes CT imaging data (via a 3D CNN) to predict prognostic outcomes, serving as a baseline for single-modality deep learning.
- 5.
- Simple Multi-Modal Fusion (CT+Clinical, Concatenation): This method extracts deep features from CT images (using a 3D CNN) and clinical features (using an MLP), then simply concatenates these feature vectors before feeding them into a final prediction MLP. This represents a straightforward approach to multi-modal data integration.
4.3. Ablation Study
- 1.
- MM-Care w/o Adaptive Fusion: In this variant, the Adaptive Multi-Modal Fusion Network is replaced by a simple concatenation of the extracted imaging, clinical, and genomic features, followed by a single MLP layer for combination. This assesses the benefit of the Transformer-based cross-attention mechanism.
- 2.
- MM-Care w/o Genomic Features: This model operates only with imaging and clinical features, completely excluding genomic data from the feature extraction and fusion process. This highlights the contribution of genomic information when available.
- 3.
- MM-Care w/o Dual-Task Learning: Instead of a shared dual-task prediction head, separate, independent models are trained for Overall Survival and Local Control prediction, each utilizing the fused multi-modal representation. This evaluates the benefits of joint learning across prognostic outcomes.
- 4.
- MM-Care w/o Spatial Attention (Imaging): The spatial attention mechanism within the Imaging Feature Extractor’s ResNet-3D architecture is removed, allowing us to quantify its impact on focusing on salient tumor characteristics.
4.4. Human Evaluation
4.5. Time-Dependent Prognostic Performance for Individual Treatments
4.6. Effectiveness of Adaptive Multi-Modal Fusion
- 1.
- Early Concatenation Fusion (ECF): Modality-specific features (, , ) are directly concatenated and fed into a shared MLP for joint processing. This is effectively the "MM-Care w/o Adaptive Fusion" from the ablation study.
- 2.
- Late Prediction Fusion (LPF): Separate prognostic models are trained independently for each modality. Their individual prediction scores (e.g., survival probabilities) are then combined via a simple averaging scheme to yield a final ensemble prediction.
- 3.
- Weighted Feature Sum Fusion (WFSF): Features from each modality are linearly combined using pre-defined or globally optimized (but static) weights before feeding into the prediction head. This contrasts with MM-Care’s adaptive, context-dependent weighting.
4.7. Personalized Treatment Recommendation Efficacy
4.8. Quantitative Interpretability Analysis
5. Conclusion
References
- Li, H.; Shen, Y.; Wu, Y.; Cai, S.; Zhu, Y.; Chen, S.; Chen, X.; Chen, Q. Stereotactic body radiotherapy versus surgery for early-stage non–small-cell lung cancer. Journal of Surgical Research 2019, 243, 346–353. [Google Scholar] [CrossRef] [PubMed]
- Xu, C.; Zhao, D.; Wang, B.; Xing, H. Enhancing Retrieval-Augmented LMs with a Two-Stage Consistency Learning Compressor. In Proceedings of the International Conference on Intelligent Computing, 2024; Springer; pp. 511–522. [Google Scholar]
- Cai, Z.; Xiao, W.; Sun, H.; Luo, C.; Zhang, Y.; Wan, K.; Li, Y.; Zhou, Y.; Chang, L.W.; Gu, J.; et al. R-kv: Redundancy-aware kv cache compression for reasoning models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Google Scholar]
- Ma, S.; Liu, S.; Tan, J.; Hu, Y.; Wang, S.; Indurthi, S.R.; Zhao, S.; Wu, L.; Han, J.; Song, K. TCIA: A Task-Centric Instruction Augmentation Method for Instruction Finetuning. CoRR 2025. [Google Scholar] [CrossRef]
- Rutherford, M.W.; Nolan, T.S.; Pei, L.; Wagner, U.; Pan, Q.; Farmer, P.; Smith, K.E.; Kopchick, B.; Opsahl-Ong, L.; Granger, S.; et al. Medical Image De-Identification Resources: Synthetic DICOM Data and Tools for Validation. CoRR 2025. [Google Scholar] [CrossRef]
- Zhang, X.; Li, P.; Li, H. AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics, 2021; pp. 421–435. [Google Scholar] [CrossRef]
- Wang, X.; Gui, M.; Jiang, Y.; Jia, Z.; Bach, N.; Wang, T.; Huang, Z.; Tu, K. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. In Proceedings of the Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022; Association for Computational Linguistics; pp. 3176–3189. [Google Scholar] [CrossRef]
- Seo, A.; Kang, G.C.; Park, J.; Zhang, B.T. Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Association for Computational Linguistics, 2021; Volume 1, pp. 6167–6177. [Google Scholar] [CrossRef]
- Wang, H.; Shen, Y. Skeletal Muscle Mass Index and Risk of Lower Extremity Ulcers: Analysis of NHANES Data with External Hospital Validation. The International Journal of Lower Extremity Wounds 2025, 15347346251409496. [Google Scholar] [CrossRef] [PubMed]
- Shen, Y.; Li, X.; Wu, J.; Ma, Y.; Borchmann, S.; Cheng, Z.; Wang, Y.; Zhao, Y.; Song, J.; Luo, B.; et al. Bispecific nanosystems enable multieffector immune cell retargeting for hematologic malignancy therapy. Advanced Science 2025, 12, e09103. [Google Scholar] [CrossRef] [PubMed]
- Ainslie, J.; Lee-Thorp, J.; de Jong, M.; Zemlyanskiy, Y.; Lebron, F.; Sanghai, S. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023; Association for Computational Linguistics; pp. 4895–4901. [Google Scholar] [CrossRef]
- Pang, S.; Xue, Y.; Yan, Z.; Huang, W.; Feng, J. Dynamic and Multi-Channel Graph Convolutional Networks for Aspect-Based Sentiment Analysis. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics, 2021; pp. 2627–2636. [Google Scholar] [CrossRef]
- Qin, H.; Song, Y. Reinforced Cross-modal Alignment for Radiology Report Generation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022; Association for Computational Linguistics, 2022; pp. 448–458. [Google Scholar] [CrossRef]
- Ju, X.; Zhang, D.; Xiao, R.; Li, J.; Li, S.; Zhang, M.; Zhou, G. Joint Multi-modal Aspect-Sentiment Analysis with Auxiliary Cross-modal Relation Detection. In Proceedings of the Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021; Association for Computational Linguistics; pp. 4395–4405. [Google Scholar] [CrossRef]
- Wu, Y.; Lin, Z.; Zhao, Y.; Qin, B.; Zhu, L.N. A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics, 2021; pp. 4730–4738. [Google Scholar] [CrossRef]
- Xu, C.; Li, J.; Wang, R. Mutual Teaching: Semi-supervised Medical Image Classification with Cross Structural Consistency Learning. In Proceedings of the 2025 IEEE International Conference on Multimedia and Expo (ICME), 2025; IEEE; pp. 1–6. [Google Scholar]
- Malik, V.; Sanjay, R.; Nigam, S.K.; Ghosh, K.; Guha, S.K.; Bhattacharya, A.; Modi, A. ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation. In Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing; Association for Computational Linguistics, 2021; Volume 1, pp. 4046–4062. [Google Scholar] [CrossRef]
- Fu, J.; Huang, X.; Liu, P. SpanNER: Named Entity Re-/Recognition as Span Prediction. Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021, Volume 1, 7183–7195. [Google Scholar] [CrossRef]
- He, J.; Kryscinski, W.; McCann, B.; Rajani, N.; Xiong, C. CTRLsum: Towards Generic Controllable Text Summarization. In Proceedings of the Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022; Association for Computational Linguistics; pp. 5879–5915. [Google Scholar] [CrossRef]
- Ahuja, K.; Diddee, H.; Hada, R.; Ochieng, M.; Ramesh, K.; Jain, P.; Nambi, A.; Ganu, T.; Segal, S.; Ahmed, M.; et al. MEGA: Multilingual Evaluation of Generative AI. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023; Association for Computational Linguistics; pp. 4232–4267. [Google Scholar] [CrossRef]
- Li, L.; Zhang, Y.; Chen, L. Personalized Transformer for Explainable Recommendation. Proceedings of the Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021, Volume 1, 4947–4957. [Google Scholar] [CrossRef]
- Wang, P.; Zhu, Z.; Freire, N.; Azar, Z.; Wu, X.; Liang, D. Online Simultaneous Identification of Multi-Parameters for Interior PMSMs Under Sensorless Control. CES Transactions on Electrical Machines and Systems 2025, 9, 422–433. [Google Scholar] [CrossRef]
- Wang, P.; Zhu, Z.; Liang, D.; Freire, N.M.; Azar, Z. Dual signal injection-based online parameter estimation of surface-mounted PMSMs under sensorless control. IEEE Transactions on Industry Applications, 2025. [Google Scholar]
- Wang, P.; Zhu, Z.; Liang, D. Virtual signal injection-based online full-parameter estimation of surface-mounted PMSMs without influence of position error and inverter nonlinearity. IEEE Journal of Emerging and Selected Topics in Power Electronics, 2025. [Google Scholar]


| Method | C-index (3-year OS) | AUC (3-year LC) | Accuracy (3-year LC) |
|---|---|---|---|
| Cox Regression (Clinical only) | 0.69 | 0.73 | 77.2% |
| Random Forest (Clinical + Radiomics) | 0.72 | 0.76 | 79.8% |
| XGBoost (Clinical + Radiomics + Shallow DL) | 0.74 | 0.78 | 81.5% |
| Single-Modal 3D CNN (CT-only) | 0.73 | 0.77 | 80.9% |
| Simple Multi-Modal Fusion (CT + Clinical) | 0.76 | 0.81 | 84.1% |
| Ours (MM-Care) | 0.78 | 0.83 | 86.3% |
| Model Variant | C-index (3-year OS) | AUC (3-year LC) | Accuracy (3-year LC) |
|---|---|---|---|
| MM-Care (Full Model) | 0.78 | 0.83 | 86.3% |
| MM-Care w/o Adaptive Fusion | 0.76 | 0.81 | 84.4% |
| MM-Care w/o Genomic Features | 0.77 | 0.82 | 85.5% |
| MM-Care w/o Dual-Task Learning | 0.77 | 0.82 | 85.8% |
| MM-Care w/o Spatial Attention (Imaging) | 0.77 | 0.82 | 85.1% |
| Metric | 1-year LC | 3-year LC |
|---|---|---|
| AUC | 0.88 | 0.83 |
| Accuracy | 89.1% | 86.3% |
| Sensitivity | 87.5% | 85.0% |
| Specificity | 90.2% | 87.1% |
| Metric | 1-year OS | 3-year OS | 5-year OS |
|---|---|---|---|
| C-index | 0.83 | 0.79 | 0.77 |
| Time-Dependent AUC | 0.88 | 0.83 | 0.80 |
| Metric | 1-year LC | 3-year LC |
|---|---|---|
| AUC | 0.90 | 0.85 |
| Accuracy | 90.5% | 87.9% |
| Sensitivity | 88.9% | 86.2% |
| Specificity | 91.8% | 89.1% |
| Fusion Strategy | C-index (3-year OS) | AUC (3-year LC) |
|---|---|---|
| ECF (Ablation Baseline) | 0.76 | 0.81 |
| LPF | 0.75 | 0.79 |
| WFSF | 0.76 | 0.80 |
| Ours (MM-Care Adaptive Fusion) | 0.78 | 0.83 |
| Recommendation Strategy | Rec. Acc. (%) |
|---|---|
| Random Choice | 50.0% |
| Clinical Guidelines (Fixed Rules) | 71.5% |
| Expert Consensus (Pre-MM-Care) | 78.2% |
| Ours (MM-Care) | 84.7% |
| Modality | Feature | Average Absolute SHAP Value |
|---|---|---|
| Clinical | Tumor Size (cm) | 0.152 |
| Clinical | Age (years) | 0.138 |
| Clinical | FEV1 (% predicted) | 0.115 |
| Clinical | Charlson Comorbidity Index (CCI) | 0.098 |
| Clinical | Smoking History (Pack-years) | 0.075 |
| Genomic | EGFR Mutation Status | 0.165 |
| Genomic | KRAS Mutation Status | 0.141 |
| Genomic | TP53 Mutation Status | 0.122 |
| Genomic | STK11 Mutation Status | 0.089 |
| Genomic | PD-L1 Expression | 0.068 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).