Submitted:
19 June 2026
Posted:
22 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background
2.1. Vision-Language Models
2.2. Segment Anything and Medical Variants
2.3. Text-Guided Segmentation Paradigm
3. Method Taxonomy
3.1. Text-Prompt Guided Segmentation
3.2. LLM-Embedded Architectures
3.3. Hybrid Frameworks

| Method | Year | Architecture | Modality | Datasets | Dice | Adaptation |
|---|---|---|---|---|---|---|
| SAM-Based Models | ||||||
| SAM [9] | 2023 | ViT-H + prompt enc. | Natural images | SA-1B (1B masks) | - | Pretrain |
| MedSAM [38] | 2024 | SAM + medical FT | Multi-modal | 1.5M med. image pairs | 0.85 | Full FT |
| SAM-Med2D [68] | 2023 | SAM + adapter | 10 modalities | 4.6M imgs, 19.7M masks | 0.83 | Adapter |
| SAM-Med3D [69] | 2023 | Native 3D SAM | Volumetric | 21K imgs, 131K masks | 0.78 | Full FT |
| SAMed [84] | 2023 | SAM + LoRA | Multi-organ CT | Synapse BTCV | 0.82 | LoRA (0.1%) |
| Med-SA [75] | 2025 | SAM + Adpt. + LoRA | 5 modalities | 17 tasks | 0.84 | Adpt. + LoRA |
| AdaptiveSAM [82] | 2024 | SAM + bias tuning | Surg., US, X-ray | Multiple | 0.81 | Bias tuning |
| SegVol [78] | 2024 | Volumetric SAM | CT | 200 organs, 96K vol. | 0.83 | Full FT |
| 3DSAM-adapter [83] | 2024 | SAM 2D→3D adapt. | CT (tumour) | LiTS, KiTS, pancreas CT | 0.86 | Adapter |
| SAM-OCTA [120] | 2025 | SAM + OCTA prompt tuning | Retinal OCTA | ROSE, OCTA-500 | 0.87 | Full FT |
| SAM2LoRA [88] | 2025 | SAM 2 + LoRA | Retinal fundus | 11 datasets | 0.93 | LoRA (<5%) |
| MedSAM2 [76] | 2025 | SAM 2 + medical FT | Image + video | Multi-modal | 0.86 | Full FT |
| MedSAM3 [87] | 2025 | SAM 3 + LoRA | Multi-modal | Concept-aware | 0.84 | LoRA |
| EmbedMedSAM [122] | 2025 | SAM embed. + edge optim. | Multi-modal | Resource-limited settings | 0.82 | Adapter |
| Text-Prompt Guided Models | ||||||
| LViT [20] | 2024 | U-Net + text fusion | Chest X-ray | QaTa-COV19 | 0.83 | Full FT |
| Cross-modal CR [95] | 2024 | Cross-modal recon. CLIP | CT, MRI | Multiple organ datasets | 0.84 | Full FT |
| CLIP-Driven UM [19] | 2023 | CLIP queries + Swin | Abdominal CT | BTCV, LiTS, KiTS | 0.86 | Full FT |
| Universal VLM [99] | 2024 | Extensible CLIP + dec. | Abdominal CT/MRI | BTCV, 15 organs | 0.87 | PEFT |
| ZePT [93] | 2024 | CLIP query disentangle | Pan-tumour CT | Multi-source | 0.77 | Self-prompt |
| BiomedParse [21] | 2025 | SEEM + GPT-4 harm. | 9 modalities | BiomedParseData (6M) | 0.94 | Full pretrain |
| BiomedParse-V [22] | 2025 | FVE + ISD module | CT, MRI, micro. | CVPR 2025 challenge | 0.86 | Full pretrain |
| MedSegX [100] | 2025 | Generalist FM + open vocab | Multi-modal | 100+ datasets | 0.85 | Full pretrain |
| SAT [92] | 2025 | CLIP + transf. dec. | Radiology | 70+ datasets | 0.84 | Full FT |
| LLM-Embedded Architectures | ||||||
| LISA [23] | 2024 | MLLM + 〈SEG〉 tok. | Nat. + reasoning | ReasonSeg, refCOCO | - | Full FT (LLM) |
| LISA++ [24] | 2024 | LISA + inst. reasoning | Nat. + medical | Extended ReasonSeg | - | Full FT |
| ChatRadio-Valuer [123] | 2025 | LLM + rad. impression dec. | Chest X-ray | Multi-inst. CXR | - | Full FT |
| MedPLIB [103] | 2025 | MLLM + SAM-Med2D | Multi-mod. med. | MeCoVQA | 0.81 | LoRA + Adpt. |
| M3D [25] | 2025 | 3D MLLM + decoder | 3D CT | M3D-Seg | 0.79 | Full FT |
| Show & Segment [124] | 2025 | In-context MLLM + dec. | Multi-modal med. | 12 diverse datasets | 0.83 | Zero-shot |
| Hybrid and Other Frameworks | ||||||
| MedCLIP-SAM [113] | 2024 | MedCLIP + SAM | Multi-modal | Multiple | 0.80 | Hybrid |
| SaLIP [114] | 2024 | SAM + CLIP cascade | Multi-modal | Multiple | 0.74 | Zero-shot |
| VILA-M3 [96] | 2025 | VLM + medical expert know. | Multi-modal | BTCV, LiTS, BraTS | 0.86 | PEFT |
| SegFM3D [79] | 2025 | 3D foundation model | Multi-modal 3D | Multi-source | 0.83 | Pretrain |
| Specialized Foundation Models | ||||||
| MoME (lesion) [125] | 2025 | Mixture of mod. experts | Brain MRI lesions | Multi-source MRI | 0.80 | Full FT |
| UniverSeg [126] | 2023 | Few-shot universal | 16 modalities | MegaMedical | 0.72 | Few-shot |
| GenSeg [127] | 2025 | Diffusion gen. + seg. | Multi-modal | Ultra low-data regimes | 0.81 | Hybrid gen. |
| SegMamba-V2 [128] | 2026 | Mamba SSM 3D long-range | Volumetric CT/MRI | Multi-organ 3D | 0.88 | Full FT |
| TotalSeg. [34] | 2023 | nnU-Net based | CT (104 structs.) | 1204 CTs | 0.94 | Full train |
| TotalSeg. MRI [35] | 2025 | Seq.-independent | MRI (multi-organ) | 616 MRI + 527 CT | 0.84 | Full train |
| BrainSegFounder [129] | 2024 | Self-sup. 3D ViT | Brain MRI | Multi-source neuroimaging | 0.91 | PEFT |
| SAMUS [80] | 2024 | SAM + US adapt. | Ultrasound | Multi-source US | 0.80 | Adapter |
| SegAnyBone [81] | 2024 | SAM + bone FT | MRI bones | Multi-seq. MRI | 0.82 | Full FT |
| Self-imp. FM [130] | 2025 | Generative FM + self-imp. | CT, MRI, X-ray | Multi-organ, multi-modal | 0.85 | Full pretrain |
| LCTfound [131] | 2026 | Lung CT ViT FM | Chest CT | LIDC-IDRI, NLM, LUNA16 | 0.89 | Full pretrain |
| Merlin [132] | 2026 | CT VLM + report gen. | Chest CT | Radiology reports + seg. | - | Full pretrain |
| Decipher-MR [133] | 2026 | 3D MRI VLM encoder | Multi-seq. MRI | Diverse MRI tasks | - | Full pretrain |
| CT-CLIP [54] | 2026 | Volumetric CLIP | Chest CT | CT-RATE (50K) | - | Pretrain |
| Deep Learning Baselines (CNN/Transformer, no text prompt) | ||||||
| Confidence-SS [134] | 2025 | CNN-Trans. semi-sup. | Skin lesion | ISIC 2016, PH2 | 0.91 | Semi-supervised |
| H-Self-Support [135] | 2026 | Hierarchical self-support | Brain MRI (tumour) | BraTS 2021 | 0.92 | Self-supervised |
| Dense Enc.-Dec. [102] | 2021 | CNN enc.-dec. skip conn. | Skin lesion | ISIC 2018, PH2 | 0.87 | Full FT |
| RD2A [2] | 2021 | Residual dense + ASPP | Brain MRI (tumour) | BraTS 2019 | 0.89 | Full FT |
| ScaleFusionNet [101] | 2025 | Trans. multi-scale FPN | Skin lesion | ISIC 2017/2018, PH2 | 0.90 | Full FT |
| UNet-Mamba [5] | 2025 | UNet + Mamba-like attn. | Multi-modal | ACDC, Synapse, polyp sets | 0.91 | Full FT |
4. Adaptation Strategies
4.1. Full Fine-Tuning
4.2. Parameter-Efficient Fine-Tuning
4.3. Prompt Engineering

| Strategy | Trainable | GPU Mem. | Accuracy | Storage | Example |
|---|---|---|---|---|---|
| Full fine-tuning | 100% | Very high | Highest | Full model copy | MedSAM |
| LoRA | 0.1–1% | Low | Near best | Small adapter | SAMed |
| Adapter modules | 1–5% | Low | Strong | Small modules | Med-SA |
| Bias tuning | <0.5% | Very low | Good | Bias deltas only | AdaptiveSAM |
| Visual prompt tuning | <0.1% | Minimal | Variable | Prompt tokens | VPT variants |
| Prompt engineering | 0% | None | Variable | Text only | BiomedParse |
| Conv-LoRA | 0.5–2% | Low | Strong | Small modules | Conv-LoRA SAM |
| NAS-LoRA | 1–3% | Low | Strong | Searched arch | NAS-LoRA |
| DoRA | 0.2–1% | Low | Strong | Magnitude+dir. | DoRA variants |
5. Modality-Specific Models
5.1. Computed Tomography
5.2. Magnetic Resonance Imaging
5.3. Pathology
5.4. Chest Radiography
5.5. Ultrasound
6. Clinical Applications
6.1. Organ Segmentation
6.2. Tumor Segmentation
6.3. Radiotherapy Planning
7. Evaluation and Datasets
7.1. Evaluation Metrics
| Metric | Range | Property | Best Used For |
|---|---|---|---|
| Dice Score | [0, 1] | Overlap, small-structure sensitive | Volume overlap reporting |
| IoU (Jaccard) | [0, 1] | Conservative overlap | Detection metric comparison |
| Hausdorff Distance | [0, ∞) mm | Worst-case boundary error | Outlier sensitivity studies |
| 95% Hausdorff (HD95) | [0, ∞) mm | Percentile boundary error | Radiotherapy organs at risk |
| Average Surface Dist. | [0, ∞) mm | Mean boundary error | Boundary quality reporting |
| Normalized Surface Dist. | [0, 1] | Boundary within tolerance | Clinically acceptable boundary |
| Lesion-wise Dice | [0, 1] | Per-lesion overlap | Multi-focal disease |
| Sensitivity / Recall | [0, 1] | Detection rate | Screening applications |
| Specificity | [0, 1] | False-positive rate | Specificity-critical tasks |
| Recognition Accuracy | [0, 1] | Object presence detection | Text-prompted segmentation |
7.2. Benchmark Datasets
8. Challenges and Future Directions
8.1. Prompt Dependence and Trustworthiness
8.2. Efficiency and 3D Scaling
8.3. Data Efficiency
8.4. Clinical Integration and Outlook

9. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Isensee, F.; Jaeger, P.; Kohl, S.; Petersen, J.; Maier-Hein, K. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [PubMed]
- Ahmad, P.; Jin, H.; Qamar, S.; Zheng, R.; Saeed, A. RD2A: densely connected residual networks using ASPP for brain tumor segmentation. Multimed. Tools Appl. 2021, 80, 27069–27094. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the ICLR, 2021. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; et al. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the ICCV, 2021; pp. 10012–10022. [Google Scholar]
- Qamar, S.; Fazil, M.; Ahmad, P.; Khan, S.; Zamani, A.T. UNet with self-adaptive Mamba-like attention and causal-resonance learning for medical image segmentation. Sci. Rep. 2026, 16, 135. [Google Scholar] [CrossRef] [PubMed]
- Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.; Xu, D. Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images. In Proceedings of the MICCAI BrainLes Workshop, 2021; pp. 272–284. [Google Scholar]
- Shamshad, F.; Khan, S.; Zamir, S.; et al. Transformers in medical imaging: A survey. Med. Image Anal. 2023, 88, 102802. [Google Scholar] [CrossRef] [PubMed]
- Moor, M.; Banerjee, O.; Abad, Z.; et al. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef] [PubMed]
- Kirillov, A.; Mintun, E.; Ravi, N.; et al. Segment anything. In Proceedings of the ICCV, 2023; pp. 4015–4026. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; et al. LLaMA: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Radford, A.; Kim, J.; Hallacy, C.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the ICML, 2021; pp. 8748–8763. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; et al. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the ICML, 2021; pp. 4904–4916. [Google Scholar]
- Wang, Z.; Wu, Z.; Agarwal, D.; Sun, J. MedCLIP: Contrastive learning from unpaired medical images and text. In Proceedings of the EMNLP, 2022; pp. 3876–3887. [Google Scholar]
- Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. NEJM AI 2025, 2, AIoa2400640. [Google Scholar] [CrossRef]
- Huang, S.C.; Shen, L.; Lungren, M.; Yeung, S. GLoRIA: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the ICCV, 2021; pp. 3942–3951. [Google Scholar]
- Boecking, B.; Usuyama, N.; Bannur, S.; et al. Making the most of text semantics to improve biomedical vision-language processing. In Proceedings of the ECCV, 2022. [Google Scholar]
- Bannur, S.; Hyland, S.; Liu, Q.; et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the CVPR, 2023; pp. 15016–15027. [Google Scholar]
- Liu, J.; Zhang, Y.; Chen, J.N.; et al. CLIP-driven universal model for organ segmentation and tumor detection. In Proceedings of the ICCV, 2023; pp. 21152–21164. [Google Scholar]
- Li, Z.; Li, Y.; Li, Q.; Wang, P.; Guo, D.; Lu, L.; Jin, D.; Zhang, Y.; Hong, Q. Lvit: language meets vision transformer in medical image segmentation. IEEE Trans. Med. Imaging 2023, 43, 96–107. [Google Scholar] [CrossRef]
- Zhao, T.; Gu, Y.; Yang, J.; et al. A foundation model for joint segmentation, detection, and recognition of biomedical objects across nine modalities. Nat. Methods 2025, 22, 166–176. [Google Scholar] [PubMed]
- Zhao, T.; Gu, Y.; Yang, J.; et al. BiomedParse-V: Scaling foundation model for universal text-guided volumetric biomedical image segmentation. In Proceedings of the CVPR Workshop MedSegFM, 2025. [Google Scholar]
- Lai, X.; Tian, Z.; Chen, Y.; Li, Y.; Yuan, Y.; Liu, S.; Jia, J. LISA: Reasoning segmentation via large language model. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2024; pp. 9579–9589. [Google Scholar]
- Yang, S.; Qu, T.; Lai, X.; et al. LISA++: An improved baseline for reasoning segmentation with large language model. arXiv 2023, arXiv:2312.17240. [Google Scholar]
- Bai, F.; Du, Y.; Huang, T.; Meng, M.; Zhao, B. M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv 2024, arXiv:2404.00578. [Google Scholar]
- Khan, W.; Leem, S.; See, K.; Wong, J.; Zhang, S.; Fang, R. A comprehensive survey of foundation models in medicine. IEEE Reviews in Biomedical Engineering, 2025. [Google Scholar]
- Wu, J.; Wang, Y.; Bai, H. Vision-language foundation model for 3D medical imaging. npj Artif. Intell. 2025, 1, 17. [Google Scholar] [CrossRef]
- Sun, K.; Xue, S.; Sun, F.; Sun, H.; Luo, Y.; Wang, L.; Wang, S.; Guo, N.; Liu, L.; Zhao, T.; et al. Medical multimodal foundation models in clinical diagnosis and treatment: Applications, challenges, and future directions. Artif. Intell. Med. 2025, 103265. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Z.; Liu, Y.; Wu, H.; et al. CLIP in medical imaging: A comprehensive survey. arXiv 2023, arXiv:2312.07353. [Google Scholar]
- Lurz, D.; Neubig, L.; Kopp, M.; Kist, A. Foundation Models in Medical Image Segmentation. In Proceedings of the Bildverarbeitung für die Medizin 2026 (BVM 2026) Informatik aktuell; Springer Vieweg: Wiesbaden, 2026. [Google Scholar] [CrossRef]
- Nikolov, S.; Blackwell, S.; Zverovitch, A.; et al. Clinically applicable segmentation of head and neck anatomy for radiotherapy: deep learning algorithm development and validation study. J. Med. Internet Res. 2021, 23, e26151. [Google Scholar] [CrossRef] [PubMed]
- Vaassen, F.; Hazelaar, C.; Vaniqui, A.; et al. Evaluation of measures for assessing time-saving of automatic organ-at-risk segmentation in radiotherapy. Phys. Imaging Radiat. Oncol. 2020, 13, 1–6. [Google Scholar] [PubMed]
- Cardenas, C.; Yang, J.; Anderson, B.; Court, L.; Brock, K. Advances in auto-segmentation. Semin. Radiat. Oncol. 2019, 29, 185–197. [Google Scholar] [CrossRef] [PubMed]
- Wasserthal, J.; Breit, H.C.; Meyer, M.; et al. TotalSegmentator: Robust segmentation of 104 anatomic structures in CT images. Radiol. AI 2023, 5, e230024. [Google Scholar] [CrossRef] [PubMed]
- Akinci D’Antonoli, T.; Berger, L.K.; Indrakanti, A.K.; Vishwanathan, N.; Weiss, J.; Jung, M.; Berkarda, Z.; Rau, A.; Reisert, M.; Küstner, T.; et al. TotalSegmentator MRI: Robust sequence-independent segmentation of multiple anatomic structures in MRI. Radiology 2025, 314, e241613. [Google Scholar] [CrossRef] [PubMed]
- Lee, S.; Youn, J.; Kim, H.; Kim, M.; Yoon, S.H. CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images. Eur. Radiol. 2025, 35, 4374–4386. [Google Scholar] [CrossRef] [PubMed]
- Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. Proc. NeurIPS 2023, Vol. 36, 28541–28564. [Google Scholar] [CrossRef]
- Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef] [PubMed]
- Tu, T.; Azizi, S.; Driess, D.; Schaekermann, M.; Amin, M.; Chang, P.C.; Carroll, A.; Lau, C.; Tanno, R.; Ktena, I.; et al. Towards generalist biomedical AI. Nejm Ai 2024, 1, AIoa2300138. [Google Scholar] [CrossRef]
- Hu, E.; Shen, Y.; Wallis, P.; et al. LoRA: Low-rank adaptation of large language models. In Proceedings of the ICLR, 2022. [Google Scholar]
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; et al. Parameter-efficient transfer learning for NLP. In Proceedings of the ICML, 2019; pp. 2790–2799. [Google Scholar]
- Maier-Hein, L.; Reinke, A.; Godau, P.; Tizabi, M.D.; Buettner, F.; Christodoulou, E.; Glocker, B.; Isensee, F.; Kleesiek, J.; Kozubek, M.; et al. Metrics reloaded: recommendations for image analysis validation. Nat. Methods 2024, 21, 195–212. [Google Scholar] [CrossRef] [PubMed]
- Reinke, A.; Tizabi, M.D.; Baumgartner, M.; Eisenmann, M.; Heckmann-Nötzel, D.; Kavur, A.E.; Rädsch, T.; Sudre, C.H.; Acion, L.; Antonelli, M.; et al. Understanding metric-related pitfalls in image analysis validation. Nat. Methods 2024, 21, 182–194. [Google Scholar] [CrossRef] [PubMed]
- Taha, A.; Hanbury, A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging 2015, 15, 29. [Google Scholar] [CrossRef] [PubMed]
- Yeghiazaryan, V.; Voiculescu, I. Family of boundary overlap metrics for the evaluation of medical image segmentation. J. Med. Imaging 2018, 5, 015006. [Google Scholar] [CrossRef]
- Lee, J.; Yoon, W.; Kim, S.; et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [PubMed]
- Alsentzer, E.; Murphy, J.; Boag, W.; et al. Publicly available clinical BERT embeddings. In Proceedings of the ClinicalNLP, 2019; pp. 72–78. [Google Scholar]
- Lin, W.; Zhao, Z.; Zhang, X.; et al. PMC-CLIP: Contrastive language-image pre-training using biomedical documents. In Proceedings of the MICCAI, 2023; pp. 525–536. [Google Scholar]
- Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. arXiv 2023, arXiv:2308.02463. [Google Scholar]
- Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Towards generalist foundation model for radiology. Nat. Commun. 2025, 16, 7866. [Google Scholar] [CrossRef] [PubMed]
- Saab, K.; Tu, T.; Weng, W.H.; Tanno, R.; Stutz, D.; Wulczyn, E.; Zhang, F.; Strother, T.; Park, C.; Vedadi, E.; et al. Capabilities of gemini models in medicine. arXiv 2024, arXiv:2404.18416. [Google Scholar]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 754–762. [Google Scholar] [CrossRef]
- Tanno, R.; Barrett, D.G.; Sellergren, A.; Ghaisas, S.; Dathathri, S.; See, A.; Welbl, J.; Lau, C.; Tu, T.; Azizi, S.; et al. Collaboration between clinicians and vision–language models in radiology report generation. Nat. Med. 2025, 31, 599–608. [Google Scholar] [PubMed]
- Hamamci, I.E.; Er, S.; Wang, C.; Almas, F.; Simsek, A.G.; Esirgun, S.N.; Dogan, I.; Durugol, O.F.; Hou, B.; Shit, S.; et al. Generalist foundation models from a multimodal dataset for 3D computed tomography. In Nature Biomedical Engineering; 2026. [Google Scholar]
- Huang, Z.; Bianchi, F.; Yuksekgonul, M.; Montine, T.; Zou, J. A visual-language foundation model for pathology image analysis using medical Twitter. Nat. Med. 2023, 29, 2307–2316. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, H.G.; Lundström, O.; Olsson, J.; et al. Pathology Language-Image Pretraining (PLIP): A foundation model for pathology image analysis. In Nature Medicine; 2023. [Google Scholar]
- Ikezogwo, W.; Seyfioglu, M.; Ghezloo, F.; et al. Quilt-1M: One million image-text pairs for histopathology. In Proceedings of the NeurIPS Datasets, 2023. [Google Scholar]
- Chen, R.J.; Ding, T.; Lu, M.Y.; Williamson, D.F.; Jaume, G.; Song, A.H.; Chen, B.; Zhang, A.; Shao, D.; Shaban, M.; et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 2024, 30, 850–862. [Google Scholar] [CrossRef] [PubMed]
- Vorontsov, E.; Bozkurt, A.; Casson, A.; Shaikovski, G.; Zelechowski, M.; Severson, K.; Zimmermann, E.; Hall, J.; Tenenholtz, N.; Fusi, N.; et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med. 2024, 30, 2924–2935. [Google Scholar] [CrossRef] [PubMed]
- Sun, Y.; Zhu, C.; Zheng, S.; Zhang, K.; Sun, L.; Shui, Z.; Zhang, Y.; Li, H.; Yang, L. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. Proc. Proc. AAAI Conf. Artif. Intell. 2024, Vol. 38, 5034–5042. [Google Scholar] [CrossRef]
- Wang, X.; Du, Y.; Yang, S.; et al. RetCCL: Clustering-guided contrastive learning for whole-slide image retrieval. Med. Image Anal. 2023, 83, 102645. [Google Scholar] [CrossRef] [PubMed]
- Filiot, A.; Ghermi, R.; Olivier, A.; et al. Scaling self-supervised learning for histopathology with masked image modeling. medRxiv 2023. [Google Scholar] [CrossRef]
- Caron, M.; Touvron, H.; Misra, I.; et al. Emerging properties in self-supervised vision transformers. In Proceedings of the ICCV, 2021; pp. 9650–9660. [Google Scholar]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning robust visual features without supervision. In Trans Mach Learn Res; 2024. [Google Scholar]
- Li, Y.; Wang, H.; Duan, Y.; Li, X. CLIP surgery for better explainability with enhancement in open-vocabulary tasks. arXiv 2023, arXiv:2304.05653. [Google Scholar]
- Yang, X.; Chen, A.; PourNejatian, N.; et al. A large language model for electronic health records. npj Digit. Med. 2022, 5, 194. [Google Scholar] [CrossRef] [PubMed]
- Kim, M.; Kim, Y.; Kang, H.J.; Seo, H.; Choi, H.; Han, J.; Kee, G.; Park, S.; Ko, S.; Jung, H.; et al. Fine-tuning LLMs with medical data: can safety be ensured? NEJM AI 2025, 2, AIcs2400390. [Google Scholar] [CrossRef]
- Cheng, J.; Ye, J.; Deng, Z.; et al. SAM-Med2D. arXiv 2023, arXiv:2308.16184. [Google Scholar]
- Wang, H.; Guo, S.; Ye, J.; et al. SAM-Med3D: Towards general-purpose segmentation models for volumetric medical images. arXiv 2023, arXiv:2310.15161. [Google Scholar]
- Mazurowski, M.; Dong, H.; Gu, H.; Yang, J.; Konz, N.; Zhang, Y. Segment anything model for medical image analysis: an experimental study. Med. Image Anal. 2023, 89, 102918. [Google Scholar] [CrossRef] [PubMed]
- Deng, R.; Cui, C.; Liu, Q.; et al. Segment anything model (SAM) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv 2023, arXiv:2304.04155. [Google Scholar]
- Huang, Y.; Yang, X.; Liu, L.; Zhou, H.; Chang, A.; Zhou, X.; Chen, R.; Yu, J.; Chen, J.; Chen, C.; et al. Segment anything model for medical images? Med. Image Anal. 2024, 92, 103061. [Google Scholar] [CrossRef] [PubMed]
- He, S.; Bao, R.; Li, J.; Stout, J.; Bjornerud, A.; Grant, P.; Ou, Y. Computer-vision benchmark Segment-Anything Model (SAM) in medical images: Accuracy in 12 datasets. arXiv 2023, arXiv:2304.09324. [Google Scholar]
- Cheng, D.; Qin, Z.; Jiang, Z.; et al. SAM on medical images: A comprehensive study on three prompt modes. arXiv 2023, arXiv:2305.00035. [Google Scholar]
- Wu, J.; Wang, Z.; Hong, M.; Ji, W.; Fu, H.; Xu, Y.; Xu, M.; Jin, Y. Medical SAM Adapter: Adapting segment anything model for medical image segmentation. Med. Image Anal. 2025, 102, 103547. [Google Scholar] [CrossRef] [PubMed]
- Ma, J.; Kim, S.; Li, F.; Baharoon, M.; Asakereh, R.; Lyu, H.; Wang, B. Segment anything in medical images and videos: Benchmark and deployment. arXiv 2024, arXiv:2408.03322. [Google Scholar]
- Sun, J.; Chen, K.; He, Z.; Ren, S.; He, X.; Liu, X.; Peng, C. Medical image analysis using improved SAM-Med2D: Segmentation and classification perspectives. BMC Med. Imaging 2024, 24, 245. [Google Scholar] [CrossRef] [PubMed]
- Du, Y.; Bai, F.; Huang, T.; Zhao, B. SegVol: Universal and interactive volumetric medical image segmentation. Proc. NeurIPS 2024, Vol. 37, 110746–110783. [Google Scholar] [CrossRef]
- He, Y.; Guo, P.; Tang, Y.; Myronenko, A.; Nath, V.; Xu, Z.; Yang, D.; Zhao, C.; Simon, B.; Belue, M.; et al. VISTA3D: A unified segmentation foundation model for 3D medical imaging. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 20863–20873. [Google Scholar]
- Lin, X.; Xiang, Y.; Zhang, L.; Yang, X.; Yan, Z.; Yu, L. SAMUS: Adapting segment anything model for clinically-friendly and generalizable ultrasound image segmentation. arXiv 2023, arXiv:2309.06824. [Google Scholar]
- Gu, H.; Colglazier, R.; Dong, H.; Zhang, J.; Chen, Y.; Yildiz, Z.; Chen, Y.; Li, L.; Yang, J.; Willhite, J.; et al. SegmentAnyBone: A universal model that segments any bone at any location on MRI. 2025. [Google Scholar] [CrossRef] [PubMed]
- Paranjape, J.N.; Nair, N.G.; Sikder, S.; Vedula, S.S.; Patel, V.M. Adaptivesam: Towards efficient tuning of sam for surgical scene segmentation. In Proceedings of the Annual Conference on Medical Image Understanding and Analysis, 2024; Springer; pp. 187–201. [Google Scholar]
- Gong, S.; Zhong, Y.; Ma, W.; Li, J.; Wang, Z.; Zhang, J.; Heng, P.A.; Dou, Q. 3DSAM-adapter: Holistic adaptation of SAM from 2D to 3D for promptable tumor segmentation. Med. Image Anal. 2024, 98, 103324. [Google Scholar] [CrossRef] [PubMed]
- Zhang, K.; Liu, D. Customized segment anything model for medical image segmentation. arXiv 2023, arXiv:2304.13785. [Google Scholar]
- Gu, H.; Dong, H.; Yang, J.; Mazurowski, M.A. How to build the best medical image segmentation algorithm using foundation models: a comprehensive empirical study with segment anything model. 2024. [Google Scholar] [CrossRef] [PubMed]
- Carion, N.; Gustafson, L.; Hu, Y.T.; Debnath, S.; Hu, R.; Suris, D.; Ryali, C.; Alwala, K.V.; Khedr, H.; Huang, A.; et al. SAM 3: Segment Anything with Concepts. arXiv 2025, arXiv:2511.16719. [Google Scholar]
- Liu, A.; Xue, R.; Cao, X.R.; Shen, Y.; Lu, Y.; Li, X.; Chen, Q.; Chen, J. MedSAM3: Delving into Segment Anything with Medical Concepts. arXiv 2025, arXiv:2511.19046. [Google Scholar]
- Mandal, S.; Karthikeyan, D.; Paldhe, M. SAM2LoRA: Composite Loss-Guided, Parameter-Efficient Finetuning of SAM2 for Retinal Fundus Segmentation. arXiv 2025, arXiv:2510.10288. [Google Scholar]
- Liang, F.; Wu, B.; Dai, X.; et al. Open-vocabulary semantic segmentation with mask-adapted CLIP. In Proceedings of the CVPR, 2023; pp. 7061–7070. [Google Scholar]
- Wang, Z.; Lu, Y.; Li, Q.; et al. CRIS: CLIP-Driven referring image segmentation. In Proceedings of the CVPR, 2022; pp. 11686–11695. [Google Scholar]
- Zou, X.; Yang, J.; Zhang, H.; et al. Segment everything everywhere all at once. In Proceedings of the NeurIPS, 2023. [Google Scholar]
- Zhao, Z.; Zhang, Y.; Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Large-vocabulary segmentation for medical images with text prompts. npj Digit. Med. 2025, 8, 493. [Google Scholar] [CrossRef] [PubMed]
- Jiang, Y.; Huang, Z.; Zhang, R.; Zhang, X.; Zhang, S. Zept: Zero-shot pan-tumor segmentation via query-disentangling and self-prompting. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 11386–11397. [Google Scholar]
- Tanida, T.; Müller, P.; Kaissis, G.; Rückert, D. Interactive and explainable region-guided radiology report generation. In Proceedings of the CVPR, 2023; pp. 7433–7442. [Google Scholar]
- Huang, X.; Li, H.; Cao, M.; Chen, L.; You, C.; An, D. Cross-modal conditioned reconstruction for language-guided medical image segmentation. IEEE Trans. Med. Imaging 2024, 44, 1821–1835. [Google Scholar]
- Nath, V.; Li, W.; Yang, D.; Myronenko, A.; Zheng, M.; Lu, Y.; Liu, Z.; Yin, H.; Law, Y.M.; Tang, Y.; et al. Vila-m3: Enhancing vision-language models with medical expert knowledge. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 14788–14798. [Google Scholar]
- Lee, G.E.; Kim, S.; Cho, J.; Choi, S.; Choi, S.I. Text-guided cross-position attention for segmentation: Case of medical image. In Proceedings of the MICCAI, 2023; pp. 537–546. [Google Scholar]
- Adhikari, R.; Dhakal, M.; Thapaliya, S.; Poudel, K.; Bhandari, P.; Khanal, B. Synthetic boost: Leveraging synthetic data for enhanced vision-language segmentation in echocardiography. In Proceedings of the ASMUS Workshop MICCAI, 2023; pp. 89–99. [Google Scholar]
- Liu, J.; Zhang, Y.; Wang, K.; Yavuz, M.; Chen, X.; Yuan, Y.; Li, H.; Yang, Y.; Yuille, A.; Tang, Y.; et al. Universal and extensible language-vision models for organ segmentation and tumor detection from abdominal computed tomography. Med. Image Anal. 2024, 97, 103226. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; et al. A generalist foundation model and database for open-world medical image segmentation. Nat. Biomed. Eng. 2025. [Google Scholar] [CrossRef] [PubMed]
- Qamar, S.; Qadri, S.F.; Alroobaea, R.; Alshmrani, G.M.; Fazil, M.; Jiang, R. ScaleFusionNet: transformer-guided multi-scale feature fusion for skin lesion segmentation. Sci. Rep. 2025, 15, 34393. [Google Scholar] [CrossRef]
- Qamar, S.; Ahmad, P.; Shen, L. Dense encoder-decoder–based architecture for skin lesion segmentation. Cogn. Comput. 2021, 13, 583–594. [Google Scholar] [CrossRef]
- Huang, X.; Shen, L.; Liu, J.; Shang, F.; Li, H.; Huang, H.; Yang, Y. Towards a multimodal large language model with pixel-level insight for biomedicine. Proc. Proc. AAAI Conf. Artif. Intell. 2025, Vol. 39, 3779–3787. [Google Scholar] [CrossRef]
- Shi, Y.; Zhu, X.; Wang, K.; Hu, Y.; Guo, C.; Li, M.; Wu, J. Med-2e3: A 2d-enhanced 3d medical multimodal large language model. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2025; pp. 2754–2759. [Google Scholar]
- Chen, Z.; Varma, M.; Delbrouck, J.B.; Paschali, M.; Blankemeier, L.; Van Veen, D.; Valanarasu, J.M.J.; Youssef, A.; Cohen, J.P.; Reis, E.P.; et al. Chexagent: Towards a foundation model for chest x-ray interpretation. In Proceedings of the AAAI 2024 Spring Symposium on Clinical Foundation Models, 2024. [Google Scholar]
- Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. Proc. Int. Conf. Learn. Represent. 2025, Vol. 2025, 28085–28128. [Google Scholar]
- Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciosi, S.; Chute, C.; Kim, D.; Lungren, M.P.; Ng, A.Y.; Rajpurkar, P. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proc. Proc. AAAI Conf. Artif. Intell. 2019, Vol. 33, 590–597. [Google Scholar] [CrossRef]
- Wu, J.; Xu, M. One-prompt to segment all medical images. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; pp. 11302–11312. [Google Scholar]
- Ji, Z.; Lee, N.; Frieske, R.; et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Bai, Z.; Wang, P.; Xiao, T.; He, T.; Han, Z.; Zhang, Z.; Shou, M.Z. Hallucination of multimodal large language models: A survey. arXiv 2024, arXiv:2404.18930. [Google Scholar]
- He, S.; Nie, Y.; Chen, Z.; Cai, Z.; Wang, H.; Yang, S.; Chen, H. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning. arXiv 2024, arXiv:2404.151271, 6. [Google Scholar]
- Zhang, H.; Chen, Y.; Wang, Y.; et al. Mitigating hallucinations in radiology vision-language models through fact-grounded preference optimization. Nat. Commun. 2025, 16, 6489. [Google Scholar]
- Koleilat, T.; Asgariandehkordi, H.; Rivaz, H.; Xiao, Y. MedCLIP-SAM: Bridging text and image towards universal medical image segmentation. In Proceedings of the MICCAI, 2024; pp. 643–653. [Google Scholar]
- Aleem, S.; Wang, F.; Maniparambil, M.; Arazo, E.; Dietlmeier, J.; Curran, K.; Connor, N.E.; Little, S. Test-time adaptation with salip: A cascade of sam and clip for zero-shot medical image segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 5184–5193. [Google Scholar]
- Sun, L.; Hwang, J.; Kuo, C.C.; Cha, S. SegRefiner: Towards model-agnostic segmentation refinement with discrete diffusion process. In Proceedings of the NeurIPS, 2023. [Google Scholar]
- Lu, S.; Chen, Y.; Chen, Y.; Li, P.; Sun, J.; Zheng, C.; Zou, Y.; Liang, B.; Li, M.; Jin, Q.; et al. General lightweight framework for vision foundation model supporting multi-task and multi-center medical image analysis. Nat. Commun. 2025. [Google Scholar] [CrossRef] [PubMed]
- Murali, A.; Zarin, F.; Meyer, A.; Mascagni, P.; Mutter, D.; Padoy, N. CycleSAM: Few-Shot Surgical Scene Segmentation with Cycle-and Scene-Consistent Feature Matching. arXiv 2024, arXiv:2407.06795. [Google Scholar]
- Wang, H.; Vasu, P.K.A.; Faghri, F.; Vemulapalli, R.; Farajtabar, M.; Mehta, S.; Rastegari, M.; Tuzel, O.; Pouransari, H. SAM-CLIP: Merging vision foundation models towards semantic and spatial understanding. In Proceedings of the CVPR Workshop, 2024. [Google Scholar]
- Heidari, M.; Kazerouni, A.; Soltany, M.; Azad, R.; Aghdam, E.K.; Cohen-Adad, J.; Merhof, D. HiFormer: Hierarchical multi-scale representations using transformers for medical image segmentation. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023; pp. 1546–1556. [Google Scholar]
- Chen, X.; Wang, C.; Ning, H.; Li, S.; Shen, M. Sam-octa: Prompting segment-anything for octa image segmentation. Biomed. Signal Process. Control 2025, 106, 107698. [Google Scholar] [CrossRef]
- Lee, H.H.; Gu, Y.; Zhao, T.; Xu, Y.; Yang, J.; Usuyama, N.; Wong, C.; Wei, M.; Landman, B.A.; Huo, Y.; et al. Foundation models for biomedical image segmentation: A survey. arXiv 2024, arXiv:2401.07654. [Google Scholar]
- Zhang, Y.; Ye, F.; Yu, X.; Lian, X.; Jiang, T.; Yang, L.; Yang, L. Embedded framework for clinical medical image segment anything in resource limited healthcare regions. npj Digit. Med. 2025. [Google Scholar] [CrossRef] [PubMed]
- Zhong, T.; Zhao, W.; Zhang, Y.; Pan, Y.; Dong, P.; Jiang, Z.; Jiang, H.; Zhou, Y.; Kui, X.; Shang, Y.; et al. Chatradio-valuer: A chat large language model for generalizable radiology impression generation on multi-institution and multi-system data. IEEE Transactions on Biomedical Engineering, 2025. [Google Scholar]
- Gao, Y.; Liu, D.; Li, Z.; Li, Y.; Chen, D.; Zhou, M.; Metaxas, D.N. Show and segment: Universal medical image segmentation via in-context learning. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 20830–20840. [Google Scholar]
- Zhang, X.; Ou, N.; Basaran, B.D.; Visentin, M.; Qiao, M.; Gu, R.; Matthews, P.M.; Liu, Y.; Ye, C.; Bai, W. A foundation model for lesion segmentation on brain mri with mixture of modality experts. IEEE Transactions on Medical Imaging, 2025. [Google Scholar]
- Butoi, V.; Ortiz, J.; Ma, T.; Sabuncu, M.; Guttag, J.; Dalca, A. UniverSeg: Universal medical image segmentation. In Proceedings of the ICCV, 2023; pp. 21438–21451. [Google Scholar]
- Zhang, L.; Jindal, B.; Alaa, A.; Weinreb, R.; Wilson, D.; Segal, E.; Zou, J.; Xie, P. Generative AI enables medical image segmentation in ultra low-data regimes. Nat. Commun. 2025. [Google Scholar] [CrossRef] [PubMed]
- Xing, Z.; Ye, T.; Yang, Y.; Cai, D.; Gai, B.; Wu, X.J.; Gao, F.; Zhu, L. Segmamba-v2: Long-range sequential modeling mamba for general 3d medical image segmentation. IEEE Transactions on Medical Imaging, 2025. [Google Scholar]
- Cox, J.; Liu, P.; Stolte, S.E.; Yang, Y.; Liu, K.; See, K.B.; Ju, H.; Fang, R. BrainSegFounder: Towards 3D foundation models for neuroimage segmentation. Med. Image Anal. 2024, 97, 103301. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Wang, K.; Yu, Y.; Lu, Y.; Xiao, W.; Sun, Z.; Liu, F.; Zou, Z.; Gao, Y.; Yang, L.; et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat. Med. 2025, 31, 609–617. [Google Scholar] [PubMed]
- Gao, Z.; Zhang, G.; Liang, H.; Liu, J.; Ma, L.; Wang, T.; Guo, Y.; Chen, Y.; Yan, Z.; Chen, X.; et al. A lung CT vision foundation model facilitating disease diagnosis and medical imaging. Nat. Commun. 2025. [Google Scholar] [CrossRef] [PubMed]
- Blankemeier, L.; Kumar, A.; Cohen, J.P.; Liu, J.; Liu, L.; Van Veen, D.; Gardezi, S.J.S.; Yu, H.; Paschali, M.; Chen, Z.; et al. Merlin: a computed tomography vision-language foundation model and dataset. Nature 2026. [Google Scholar] [CrossRef] [PubMed]
- Yang, Z.; DSouza, N.; Megyeri, I.; et al. Decipher-MR: a vision-language foundation model for 3D MRI representations. In npj Digital Medicine; 2026. [Google Scholar] [CrossRef] [PubMed]
- Qamar, S.; Alkhatarishi, M.; Alam, F.; Fazil, M. Confidence-weighted semi-supervised learning for skin lesion segmentation using hybrid CNN-Transformer networks. IEEE Access, 2026. [Google Scholar]
- Qamar, S.; Fazil, M.; Ashraf, Z. Bridging annotation gaps: Hierarchical self-support learning for brain tumor segmentation. Diagnostics 2026. [Google Scholar] [CrossRef] [PubMed]
- Fischer, M.; Bartler, A.; Yang, B. Prompt tuning for parameter-efficient medical image segmentation. Med. Image Anal. 2024, 91, 103024. [Google Scholar] [CrossRef] [PubMed]
- Sudre, C.; Li, W.; Vercauteren, T.; Ourselin, S.; Cardoso, M. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the DLMIA Workshop, 2017; pp. 240–248. [Google Scholar]
- Ma, J.; Chen, J.; Ng, M.; et al. Loss odyssey in medical image segmentation. Med. Image Anal. 2021, 71, 102035. [Google Scholar] [CrossRef] [PubMed]
- Berman, M.; Triki, A.; Blaschko, M. The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure. In Proceedings of the CVPR, 2018; pp. 4413–4421. [Google Scholar]
- Khattak, M.; Wasim, S.; Naseer, M.; et al. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the ICCV, 2023; pp. 15144–15154. [Google Scholar]
- Liu, H.; Tam, D.; Muqeeth, M.; et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Proc. NeurIPS 2022, Vol. 35, 1950–1965. [Google Scholar] [CrossRef]
- Zhang, Q.; Chen, M.; Bukharin, A.; et al. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. In Proceedings of the ICLR, 2023. [Google Scholar]
- Jia, M.; Tang, L.; Chen, B.C.; et al. Visual prompt tuning. In Proceedings of the ECCV, 2022; pp. 709–727. [Google Scholar]
- Zhang, R.; Zhang, W.; Fang, R.; et al. Tip-Adapter: Training-free adaption of CLIP for few-shot classification. In Proceedings of the ECCV, 2022; pp. 493–510. [Google Scholar]
- Landman, B.; Xu, Z.; Igelsias, J.; et al. Multi-atlas labeling beyond the cranial vault—Workshop and challenge. In Proceedings of the MICCAI 2015 Workshop, 2015. [Google Scholar]
- Bilic, P.; Christ, P.; Li, H.; et al. The Liver Tumor Segmentation benchmark (LiTS). Med. Image Anal. 2023, 84, 102680. [Google Scholar] [CrossRef] [PubMed]
- Zhang, K.; Yu, J.; Yan, Z.; Liu, Y.; Adhikarla, E.; Fu, S.; Chen, J.; Devarakonda, C.; He, Y.; Kang, J.; et al. BiomedGPT: A generalist vision-language foundation model for diverse biomedical tasks. Nat. Med. 2024, 30, 3613–3623. [Google Scholar] [CrossRef] [PubMed]
- Khattak, M.U.; Kunhimon, S.; Naseer, M.; Khan, S.; Khan, F.S. Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv 2024, arXiv:2412.10372. [Google Scholar]
- Chen, W.; Liu, T.; Mei, H.; Luo, H.; Sun, X. SegMamba: Long-range sequential modeling Mamba for 3D medical image segmentation. In Proceedings of the MICCAI, 2024; pp. 578–588. [Google Scholar]
- Lian, Y.; Xie, Y.; Jiang, Y.; Wang, L.; Yu, H. A data-efficient 3D medical vision-language model using only a 2D encoder. Sci. Rep. 2026. [Google Scholar] [CrossRef] [PubMed]
- Bakas, S.; Akbari, H.; Sotiras, A.; et al. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 2017, 4, 170117. [Google Scholar] [CrossRef] [PubMed]
- Baid, U.; Ghodasara, S.; Mohan, S.; et al. The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv 2021, arXiv:2107.02314. [Google Scholar]
- Bernard, O.; Lalande, A.; Zotti, C.; et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef] [PubMed]
- Christensen, M.; Vukadinovic, M.; Yuan, N.; Ouyang, D. Vision-language foundation model for echocardiogram interpretation. Nat. Med. 2024, 30, 1481–1488. [Google Scholar] [CrossRef] [PubMed]
- Rahman, A.; Valanarasu, J.; Hacihaliloglu, I.; Patel, V. Ambiguous medical image segmentation using diffusion models. In Proceedings of the CVPR, 2023; pp. 11536–11546. [Google Scholar]
- Xing, Z.; Wan, L.; Fu, H.; Yang, G.; Zhu, L. Diff-UNet: A diffusion embedded network for volumetric segmentation. arXiv 2023, arXiv:2303.10326. [Google Scholar]
- Cheng, Z.; Wei, Q.; Zhu, H.; Wang, Y.; Qu, L.; Shao, W.; Zhou, Y. Unleashing the potential of sam for medical adaptation via hierarchical decoding. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2024; pp. 3511–3522. [Google Scholar]
- Dalmaz, O.; Yurt, M.; Çukur, T. ResViT: Residual vision transformers for multimodal medical image synthesis. IEEE Trans. Med. Imaging 2022, 41, 2598–2614. [Google Scholar] [CrossRef] [PubMed]
- Özbey, M.; Dalmaz, O.; Dar, S.; et al. Unsupervised medical image translation with adversarial diffusion models. IEEE Trans. Med. Imaging 2023, 42, 3524–3539. [Google Scholar] [CrossRef] [PubMed]
- Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
- Liu, J.; Yang, H.; Zhou, H.Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining. In Proceedings of the MICCAI, 2024; pp. 615–625. [Google Scholar]
- Zhang, X.; Tan, R.T. Mamba as a bridge: Where vision foundation models meet vision language models for domain-generalized semantic segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025; pp. 14527–14537. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual state space model. In Proceedings of the NeurIPS, 2024. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
- Khalid, N.; Qayyum, A.; Bilal, M.; Al-Fuqaha, A.; Qadir, J. Privacy-preserving artificial intelligence in healthcare: Techniques and applications. Comput. Biol. Med. 2023, 158, 106848. [Google Scholar] [CrossRef] [PubMed]
- Sheller, M.; Edwards, B.; Reina, G.; et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci. Rep. 2020, 10, 12598. [Google Scholar] [CrossRef] [PubMed]
- Pati, S.; Baid, U.; Edwards, B.; et al. Federated learning enables big data for rare cancer boundary detection. Nat. Commun. 2022, 13, 7346. [Google Scholar] [CrossRef] [PubMed]
- Park, S.; Kim, G.; Kim, J.; Kim, B.; Ye, J. Federated split task-agnostic vision transformer for COVID-19 CXR diagnosis. Proc. NeurIPS 2021, Vol. 34, 24617–24630. [Google Scholar]
- Johnson, A.; Pollard, T.; Berkowitz, S.; et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
- Zhao, W.; Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Ratescore: A metric for radiology report generation. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024; pp. 15004–15019. [Google Scholar]
- Pellegrini, C.; Özsoy, E.; Busam, B.; Wiestler, B.; Navab, N.; Keicher, M. RaDialog: Large Vision-Language Models for X-Ray Reporting and Dialog-Driven Assistance. In Proceedings of the Medical Imaging with Deep Learning, 2025. [Google Scholar]
- Alam, H.M.T.; Srivastav, D.; Kadir, M.A.; Sonntag, D. Towards interpretable radiology report generation via concept bottlenecks using a multi-agentic RAG. In Proceedings of the European Conference on Information Retrieval, 2025; Springer; pp. 201–209. [Google Scholar]
- Li, C.Y.; Chang, K.J.; Yang, C.F.; Wu, H.Y.; Chen, W.; Bansal, H.; Chen, L.; Yang, Y.P.; Chen, Y.C.; Chen, S.P.; et al. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation. Nat. Commun. 2025, 16, 2258. [Google Scholar] [CrossRef] [PubMed]
- Soin, A.; Bhatu, N.; Mehta, R.; et al. CheXstray: Real-time multi-modal data concordance for drift detection in medical imaging AI. In Proceedings of the CHIL, 2022; pp. 152–167. [Google Scholar]
- Seyyed-Kalantari, L.; Zhang, H.; McDermott, M.; Chen, I.; Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 2021, 27, 2176–2182. [Google Scholar] [CrossRef] [PubMed]
- Glocker, B.; Jones, C.; Roschewitz, M.; Winzeck, S. Risk of bias in chest radiography deep learning foundation models. Radiol. AI 2023, 5, e230060. [Google Scholar] [CrossRef] [PubMed]
- Yi, P.H.; Bachina, P.; Bharti, B.; Garin, S.P.; Kanhere, A.; Kulkarni, P.; Li, D.; Parekh, V.S.; Santomartino, S.M.; Moy, L.; et al. Pitfalls and best practices in evaluation of AI algorithmic biases in radiology. Radiology 2025, 315, e241674. [Google Scholar] [CrossRef] [PubMed]
- Drukker, K.; Chen, W.; Gichoya, J.; et al. Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases. J. Med. Imaging 2023, 10, 061104. [Google Scholar] [CrossRef]
- Acosta, J.N.; Falcone, G.J.; Rajpurkar, P.; Topol, E.J. Multimodal biomedical AI. Nat. Med. 2022, 28, 1773–1784. [Google Scholar] [CrossRef] [PubMed]
- Tu, T.; Schaekermann, M.; Palepu, A.; Saab, K.; Freyberg, J.; Tanno, R.; Wang, A.; Li, B.; Amin, M.; Cheng, Y.; et al. Towards conversational diagnostic artificial intelligence. Nature 2025, 642, 442–450. [Google Scholar] [CrossRef] [PubMed]
- Jha, D.; Durak, G.; Das, A.; Sanjotra, J.; Susladkar, O.; Sarkar, S.; Rauniyar, A.; Kumar Tomar, N.; Peng, L.; Li, S.; et al. Ethical framework for responsible foundational models in medical imaging. Front. Med. 2025, 12, 1544501. [Google Scholar] [CrossRef]
- Jiang, Y.; Du, Y.; Xiong, K.; Huang, K.; Li, T.; Li, Z.; Zhang, M.; Gan, X.; Li, Q.; Liang, J.; et al. Foundation model-guided multi-view semi-supervised CT segmentation of liver tumors in resource-constrained settings. npj Digit. Med. 2026, 9, 31. [Google Scholar] [CrossRef] [PubMed]
- Chen, H.; Cai, Y.; Wang, C.; Chen, L.; Zhang, B.; Han, H.; Guo, Y.; Ding, H.; Zhang, Q. Multi-organ foundation model for universal ultrasound image segmentation with task prompt and anatomical prior. IEEE Trans. Med. Imaging 2024, 44, 1005–1018. [Google Scholar] [CrossRef]
- Heller, N.; Sathianathen, N.; Kalapara, A.; et al. The KiTS19 challenge data: 300 kidney tumor cases with clinical context, CT semantic segmentations, and surgical outcomes. arXiv 2019, arXiv:1904.00445. [Google Scholar]
- Heller, N.; Isensee, F.; Trofimova, D.; et al. The KiTS21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase CT. arXiv 2023, arXiv:2307.01984. [Google Scholar]
- Heller, N.; Isensee, F.; Maier-Hein, K.; et al. The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 challenge. Med. Image Anal. 2021, 67, 101821. [Google Scholar] [PubMed]
- Wang, M.; Lin, T.; Lin, A.; Yu, K.; Peng, Y.; Wang, L.; Chen, C.; Zou, K.; Liang, H.; Chen, M.; et al. Common and rare fundus diseases identification using vision-language foundation model with knowledge of over 400 diseases. Nat. Commun. 2025, 16, 1325. [Google Scholar] [CrossRef]
- Wang, M.; Lin, T.; Lin, A.; Yu, K.; Peng, Y.; Wang, L.; Chen, C.; Zou, K.; Cheung, C.Y.; Pang, C.P.; et al. Enhancing diagnostic accuracy in rare and common fundus diseases with a knowledge-rich vision-language model. Nat. Commun. 2025. [Google Scholar] [CrossRef] [PubMed]
- Luo, L.; Kim, S.E.; Zhang, X.; Kernbach, J.M.; Kenia, R.; Acosta, J.N.; Nathanson, L.A.; Haimovich, A.D.; Rodman, A.; Goh, E.; et al. A clinical environment simulator for dynamic AI evaluation. Nat. Med. 2026, 1–8. [Google Scholar] [CrossRef]
- Cui, H.; Wang, C.; Maan, H.; Pang, K.; Luo, F.; Duan, N.; Wang, B. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 2024, 21, 1470–1480. [Google Scholar] [CrossRef] [PubMed]
- Theodoris, C.; Xiao, L.; Chopra, A.; et al. Transfer learning enables predictions in network biology. Nature 2023, 618, 616–624. [Google Scholar] [CrossRef] [PubMed]
- Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. Testing and evaluation of health care applications of large language models: A systematic review. JAMA 2025, 333, 319–328. [Google Scholar] [CrossRef] [PubMed]
- Sallam, M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
- Eriksen, A.; Möller, S.; Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 2024, 1. [Google Scholar] [CrossRef]
- Yan, K.; Wang, X.; Lu, L.; Summers, R.M. DeepLesion: Automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imaging 2018, 5, 036501. [Google Scholar] [CrossRef] [PubMed]
- Hung, A.L.Y.; Zheng, H.; Zhao, K.; Du, X.; Pang, K.; Miao, Q.; Raman, S.S.; Terzopoulos, D.; Sung, K. Csam: A 2.5 d cross-slice attention module for anisotropic volumetric medical image segmentation. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024; pp. 5923–5932. [Google Scholar]
- Teo, Z.L.; Thirunavukarasu, A.J.; Elangovan, K.; Cheng, H.; Moova, P.; Soetikno, B.; Nielsen, C.; Pollreisz, A.; Ting, D.S.J.; Morris, R.J.; et al. Generative artificial intelligence in medicine. In Nature Medicine; 2025. [Google Scholar] [CrossRef] [PubMed]



| Survey | Year | Focus | Coverage | Contribution / Limitation |
|---|---|---|---|---|
| Azad et al. | 2023 | Foundation models in medical imaging | General foundation models | Broad scope, predates most VLM segmentation work |
| Shamshad et al. | 2023 | Transformers in medical imaging | Transformer architectures | Pre-foundation-model era, no VLM focus |
| Zhao et al. | 2023 | CLIP in medical imaging | Classification and retrieval | Limited segmentation coverage |
| Khan et al. | 2025 | Foundation models in medicine | Comprehensive survey | Broad coverage, less depth on segmentation taxonomy |
| Awais et al. | 2025 | Foundation models in vision | Vision foundation models | Not medical-specific |
| Wu et al. | 2025 | VLFM for 3D medical imaging | Report generation focus | Limited segmentation scope |
| Lee et al. | 2025 | Foundation models for MIS | Zero-shot evaluation | Narrative review, no taxonomy of VLM methods |
| Lee et al. | 2025 | VLFM for medical imaging | Current practices | General review, no segmentation taxonomy |
| Wang et al. | 2025 | VLM systematic review | Meta-analysis | Statistical focus, limited methodology depth |
| Li et al. | 2025 | VLM in medical image analysis | All medical analysis tasks | Broad, less segmentation depth |
| Liu et al. | 2025 | SAM for medical segmentation | SAM-only focus | No VLM coverage |
| This work | 2026 | VLM FM for medical segmentation | Architectures, adaptation, modalities, applications | Comprehensive taxonomy with focus on 2025-2026 work |
| Dataset | Year | Modality | Size | Annotation Type | Primary Use |
| BTCV (Synapse) | 2015 | CT | 30 scans, 13 organs | Voxel mask | Multi-organ benchmark |
| AMOS22 | 2022 | CT, MRI | 500 CT + 100 MRI, 15 organs | Voxel mask | Multi-organ versatility |
| LiTS | 2023 | CT | 131 scans (liver+tumor) | Voxel mask | Liver tumor segmentation |
| KiTS19/21 | 2019/21 | CT | 300/489 scans | Voxel mask | Kidney tumor |
| FLARE 2022 | 2022 | CT | 2200 unlabeled + 50 labeled | Voxel mask | Low-resource segmentation |
| BraTS 2021 | 2021 | MRI | 2000+ multi-seq cases | Voxel mask | Brain tumor segmentation |
| ACDC | 2018 | MRI | 100 patients | Voxel mask | Cardiac chamber seg. |
| TotalSeg | 2023 | CT | 1228 scans, 104 structs | Voxel mask | Universal anatomy |
| TotalSeg-MRI | 2025 | MRI | 616 MRI + 527 CT, 80 structs | Voxel mask | MRI universal |
| MIMIC-CXR | 2019 | CXR | 377K images + reports | Image-text | VLP, classification |
| CheXpert | 2019 | CXR | 224K images + labels | Image labels | Pathology classification |
| PadChest | 2020 | CXR | 160K images + reports | Image-text | VLP |
| VinDr-CXR | 2022 | CXR | 18K with bounding boxes | Bounding box | Detection |
| NIH ChestXray14 | 2017 | CXR | 112K images | Image labels | Classification |
| MSD | 2022 | Multi | 10 segmentation tasks | Voxel mask | Generalization |
| ISIC | 2019 | Dermatology | 25K images | Mask + class | Skin lesion |
| HAM10000 | 2018 | Dermatology | 10K images | Class + mask | Skin lesion |
| BUSI | 2020 | Ultrasound | 780 images | Mask | Breast lesion |
| Quilt-1M | 2023 | Pathology | 1M image-text pairs | Text caption | Pathology VLP |
| GLaS | 2017 | Pathology | 165 H&E images | Mask | Gland segmentation |
| Camelyon16/17 | 2016/17 | Pathology | 400+/1000+ WSI | Mask + class | Metastasis detection |
| BiomedParseData | 2025 | 9 modalities | 6M image-mask-text | Mask + text | Text-prompted seg. |
| CT-RATE | 2024 | Chest CT | 50K volumes + reports | Volume + text | 3D VLP |
| LIDC-IDRI | 2011 | CT | 1018 lung nodule scans | Mask + class | Lung nodule |
| REFUGE2 | 2022 | Retinal fundus | 1200 images | Mask + class | Glaucoma assessment |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).