Preprint
Review

This version is not peer-reviewed.

Language-Guided Segmentation of Medical Images: A Review of Foundation Models

Submitted:

19 June 2026

Posted:

22 June 2026

You are already at the latest version

Abstract
Vision-language foundation models have transformed medical image segmentation over the past three years. These models pair large image encoders with text prompts, so a single model can segment many anatomical structures, lesion types, and imaging modalities through natural language. This survey reviews vision-language foundation models designed for medical image segmentation. We describe the technical background from contrastive vision-language pretraining to the Segment Anything Model and its medical variants. We propose a three-part taxonomy that covers text-prompt guided models, large language model embedded architectures, and hybrid frameworks. We examine adaptation strategies such as full fine-tuning, Low-Rank Adaptation, adapters, and prompt engineering. We organize the literature by modality and cover computed tomography, magnetic resonance imaging, pathology, chest radiography, and ultrasound. We discuss clinical uses such as organ segmentation, tumor delineation, and radiotherapy planning. We summarize evaluation metrics and benchmark datasets. We identify four open challenges: prompt dependence, mask hallucination, slow volumetric inference, and limited annotated data. We close with a research roadmap for trustworthy deployment, multimodal pretraining, and clinical integration.
Keywords: 
;  ;  ;  ;  

1. Introduction

Medical image segmentation provides the pixel-level basis for many clinical workflows, supporting diagnosis, treatment planning, surgical guidance, and quantitative disease monitoring. Convolutional encoder-decoder networks such as nnU-Net long served as the standard segmentation tool across imaging modalities [1,2], and vision transformers later added longer-range context [3,4]. Self-adaptive Mamba-like attention mechanisms extend this design space by combining state-space dynamics with transformer-style cross-attention [5]. These architectures improved accuracy on many benchmarks but still required task-specific training data [6,7]. The shift to foundation models, trained once on very large datasets and then adapted to many downstream tasks [8], has changed how segmentation systems are built. In computer vision, the Segment Anything Model showed that a single model can handle many object types with simple visual prompts [9], while large language models brought general-purpose text understanding [10,11]. Vision-language models bridge these two lines: Contrastive Language-Image Pretraining (CLIP) established the recipe of dual encoders trained with a contrastive loss on hundreds of millions of image-text pairs [12,13], which was later ported to medical imaging through MedCLIP, BiomedCLIP, GLoRIA, and BioViL [14,15,16,17,18]. Text prompts opened a new way to control segmentation: instead of bounding boxes or click points, a clinician can write a phrase such as “liver tumor in arterial phase CT” and obtain a mask. Early work in this text-prompted paradigm includes LViT and the CLIP-Driven Universal Model [19,20]; more recent systems such as BiomedParse and BiomedParse-V scale it to dozens of modalities and hundreds of object types [21,22], and others couple large language models with segmentation decoders to reason over complex queries [23,24,25]. The pace of work has accelerated, with hundreds of papers published in 2024 and 2025 alone [26,27,28]. This rapid growth motivates a focused survey. Earlier reviews cover foundation models in medicine broadly [26,28], report generation [27], transformers [7], or CLIP in medical imaging [29], and a 2026 review surveys medical-image-segmentation foundation models [30]; Table 1 compares these with our work. The narrower question of vision-language foundation models for medical segmentation, and its text-driven workflows rather than SAM’s visual prompts, has not yet received in-depth treatment. This survey fills that gap.
This survey makes three contributions. First, we provide the first comprehensive taxonomy of vision-language foundation models specific to medical image segmentation. Second, we map the architectural and training choices of more than fifty recent methods to their reported benchmark performance. Third, we identify open problems and solutions around trustworthiness, efficiency, and clinical integration, with close attention to 2025–2026 work, which is about thirty percent of the references here. The motivation is substantial: manual segmentation of one CT scan can take thirty minutes to several hours, and radiotherapy contouring one to four hours per patient [31,32,33]. Foundation models can cut this cost by an order of magnitude while improving cross-institution consistency and enabling central updates [8,34,35]. Natural-language interfaces also lower the barrier to quantitative analysis [23,24,36], and open models such as MedSAM, BiomedCLIP, BiomedParse, and LLaVA-Med are central to this effort [15,21,37,38,39].
We organize the survey as follows in Figure 1. Section 2 gives technical background on vision-language models, the Segment Anything Model and its medical variants, and the text-guided paradigm. Section 3 presents a three-part taxonomy of text-prompt guided, LLM-embedded, and hybrid models. Section 4 covers adaptation strategies (full fine-tuning, parameter-efficient methods, and prompt engineering) [40,41]. Section 5 and 6 review models by imaging modality and by clinical application. Section 7 summarizes evaluation metrics and datasets [42,43,44,45]. Section 8 discusses challenges and future directions, and Section 9 concludes.

2. Background

This section reviews the building blocks of vision-language foundation models for medical image segmentation. We start with general vision-language models, then describe the Segment Anything Model and its derivatives, and finally explain the text-guided segmentation paradigm that links these two lines of work. Figure 2 places the key models on a timeline from 2021 to 2026.

2.1. Vision-Language Models

Vision-language models learn aligned image-text representations through contrastive learning on image-caption pairs [12,13]. CLIP introduced this recipe at scale: separate image and text encoders map inputs into a shared space where a contrastive loss aligns matched pairs, enabling zero-shot classification [12]. Later designs explored tighter modality fusion and caption bootstrapping to clean web data, and instruction tuning produced assistants that answer free-form questions about images, the basis for later medical multimodal models.
Medical imaging needs specialized models because visual and textual distributions differ from web data [14,29]: images focus on anatomy and pathology, and text comes from reports with technical vocabulary [46,47]. MedCLIP decoupled image-text pairs to overcome the small size of paired medical data [14], GLoRIA introduced global-local contrastive learning to align text tokens with image regions [16], and BioViL/BioViL-T refined the approach for chest radiology, with BioViL-T modeling temporal change across follow-up scans [17,18]. BiomedCLIP extended the recipe to fifteen million biomedical image-text pairs from PubMed Central, covering radiology, pathology, and microscopy [15], and PMC-CLIP collected a related large-scale dataset [48]. LLaVA-Med adapted instruction tuning to biomedical figures and captions [37], and large language models were specialized for clinical question answering. RadFM and a later generalist radiology model targeted three-dimensional volumes [49,50]. Med-Gemini reached expert-level performance on several clinical benchmarks [39,51,52], clinician-VLM collaboration improved radiology report generation [53], and CT-CLIP brought contrastive pretraining to volumetric chest CT paired with structured reports [54]. Pathology has its own line of vision-language foundation models. PLIP trained on histopathology image-caption pairs from medical Twitter [55,56], and Quilt-1M assembled a larger image-text dataset from educational videos [57]. UNI and Virchow scaled image-only foundation models for whole-slide images [58,59], and H-optimus-0 is a recent open-source example. PathChat and PathAsst add conversational interfaces [60], while RetCCL and masked-image-modeling backbones support retrieval and self-supervised pretraining [61,62]. These models supply rich visual representations that can be coupled with segmentation heads.
Across these models, paired text supervision yields image features that transfer well to dense tasks such as segmentation, and many recent segmentation methods build on pretrained CLIP-style or self-supervised encoders rather than training from scratch [63,64]. A practical limitation for segmentation is that standard CLIP produces a single global feature with limited spatial structure [12]; CLIP Surgery and related designs expose spatial features for dense prediction [65]. Adapting natural-image models to medicine is essential because vocabulary and semantics differ substantially from web data [14,29]. Domain-specific text encoders such as Bio_ClinicalBERT, BioBERT, and PubMedBERT represent medical text far better than generic CLIP encoders [14,46,47], and GatorTron specializes LLMs for electronic health records [66]. Multilingual support for non-English documentation remains an open need [51,67].

2.2. Segment Anything and Medical Variants

The Segment Anything Model (SAM) was a turning point for general-purpose segmentation [9]. Trained on over one billion masks across eleven million images, it accepts visual prompts (points, boxes, rough masks) and returns a mask via a heavy Vision Transformer image encoder [3], a lightweight prompt encoder, and a small mask decoder, enabling near-real-time inference on cached image features. SAM’s SA-1B data was generated by a model-in-the-loop process in which expert annotators refined model-suggested masks, an approach later reused to augment partially annotated medical data [9,38,68,69]. SAM transfers well to natural images, and 2023 zero-shot evaluations on medical data [70,71,72,73,74] found a consistent pattern: it works well on high-contrast objects with clear boundaries but struggles on small lesions, low-contrast structures, and 3D volumes, with chest radiography and dermatology outperforming abdominal CT and histopathology. The encoder also lacks medical-specific features and processes each slice independently, which motivated medical SAM variants [38,68,69,75]. MedSAM is the most influential of these efforts [38,76]. It collected more than 1.5 million image-mask pairs across ten modalities and fine-tuned the SAM mask decoder while keeping the image encoder mostly frozen, using bounding-box prompts. It achieves strong Dice across radiology, pathology, and microscopy, and MedSAM2 extends to images and short videos [76]. SAM-Med2D scaled the dataset to roughly 4.6 million images and 19.7 million masks and added encoder adapter layers to bridge the natural-medical gap [68,77]. SAM-Med3D rebuilt the architecture for volumetric inputs with 3D positional encodings and attention, trained on more than twenty thousand 3D images [69]. SegVol supports more than two hundred categories with spatial and text prompts [78], and SegFM3D continues the 3D foundation-model line [79]. Several variants target specific needs: SAMUS adapts SAM to noisy ultrasound [80], SegmentAnyBone to multi-sequence bone MRI [81], AdaptiveSAM adds bias-tuning and text prompts for surgical scenes [82], and 3DSAM-adapter extends 2D SAM to promptable volumetric tumor segmentation [83], building on early task-specific adaptation [84]. Large experimental and empirical studies characterized how best to adapt SAM for medical use [70,85]. SAM 2 (2024) supports images and videos through a memory mechanism with an improved encoder and temporal reasoning; MedSAM2 extends it to medical images and short videos [76]. SAM 3 adds concept-aware segmentation by incorporating semantic knowledge into the prompt framework [86], MedSAM3 brings concept-aware prompting and parameter-efficient fine-tuning to the third generation [87], and SAM2LoRA reaches state-of-the-art retinal fundus segmentation while updating fewer than five percent of the parameters [88].
These models share a common limitation: they are built around visual prompts, so a clinician must still place points or boxes for each object of interest. This is workable for individual cases but does not scale to large studies or to natural-language-driven applications, the gap that the text-guided segmentation paradigm addresses.

2.3. Text-Guided Segmentation Paradigm

Text-guided segmentation produces a mask from an image and a natural-language description of the target, ranging from a class label such as “liver” to a clinical phrase such as “hyperdense lesion in the right lobe” or a free-form question. The model must translate the text into a spatial decision, a task at the intersection of vision-language alignment and dense prediction.
Three lines of work converged to make text-guided segmentation possible. Open-vocabulary segmentation in natural images uses CLIP text embeddings as classifiers for arbitrary concepts, with CLIPSeg and CRIS adapting the idea to dense prediction through small decoders and mask transformers [12,89,90]. Referring-expression segmentation, in models such as LAVT, ReSTR, and CRIS trained on RefCOCO-style data, segments objects described by short phrases by fusing visual and textual features early in a transformer. Reasoning segmentation then added world knowledge: LISA emits a special segmentation token whose hidden state decodes into a mask [23], LISA++ extended it with instance-level reasoning and multi-turn dialog [24], follow-ups generalized the paradigm to video and 3D data [25], and SEEM unified multiple prompt types in a single mask decoder [91]. Text-guided segmentation in medicine faces additional challenges: medical text is technical, with synonyms, abbreviations, and Latin equivalents, and 3D context does not transfer easily from 2D pretraining [20,21,22]. Several systems nonetheless demonstrate strong results: LViT injects text into a U-Net-like backbone [20], the CLIP-Driven Universal Model guides segmentation of twenty-five organs and six tumor types [19], BiomedParse handles segmentation, detection, and recognition for eighty-two object types across nine modalities from text [21], BiomedParse-V extends to volumetric data [22], SAT targets general text-prompted radiology [92], and ZePT performs zero-shot pan-tumour segmentation via CLIP-based query disentangling [93].
These systems show that text prompts can replace many manual prompts in medical segmentation. They also show that natural-language interfaces can support broader use cases such as report grounding, education, and clinical decision support [21,24,94]. The remainder of this survey looks more closely at how these models are built, how they are adapted to clinical data, and where they still fall short.

3. Method Taxonomy

We propose a taxonomy with three categories: text-prompt guided models, which inject text embeddings into a segmentation network; large language model embedded architectures, in which a multimodal language model emits a token that is decoded into a mask; and hybrid frameworks, which loosely couple vision-language pretraining with a separate segmentation backbone. Figure 3 illustrates the taxonomy and Figure 4 compares the three architectural patterns. Many recent models combine elements of more than one category [21,95,96]; we assign each to its primary category by how text influences the final mask. Table 2 comprehensively compares fifty-three methods across six categories (SAM-based, text-prompt guided, LLM-embedded, hybrid frameworks, specialised foundation models, and deep-learning baselines); the three taxonomy categories are a subset, with the others included for context.

3.1. Text-Prompt Guided Segmentation

Text-prompt guided segmentation models use text embeddings as conditioning signals inside a segmentation network: a pretrained text encoder such as CLIP, BioClinicalBERT, or a domain-specific transformer [12,15,46,47] produces an embedding that is fused with image features at one or more network stages to yield a mask [95], with training on triples of image, mask, and text description.
LViT was one of the first medical models in this category [20]: a U-Net backbone with bottleneck cross-attention that fuses clinical-note text with image features, evaluated on chest X-ray COVID-19 lesions with clear gains over text-free baselines. TGANet introduced text-guided attention for polyp segmentation, and Lee et al. used text-guided cross-position attention to refine spatial attention [97]. Adhikari et al. combined synthetic images with vision-language segmentation in echocardiography [98]. CRIS, originally for natural images, was an early CLIP-driven referring segmentation model that influenced many medical adaptations [90]. The CLIP-Driven Universal Model scales text-guided segmentation: a frozen CLIP encoder produces per-class embeddings that serve as queries in the mask decoder, with a Swin Transformer image backbone and masked back-propagation to handle partially labeled data, yielding one model that covers twenty-five organs and six tumor types [4,19]. Liu et al. extended it with dynamic class addition [99]. BiomedParse generalizes the text-prompt guided paradigm to nine modalities and eighty-two object types [21]. Built on SEEM, its transformer decoder consumes image features and text embeddings [91], trained on six million GPT-4-harmonized triples from forty-five datasets [11], reaching median Dice above ninety percent and outperforming bounding-box methods on irregular shapes; an independent discriminator filters slices lacking the prompted object. BiomedParse-V extends this to volumetric data via fractal 2.5D encoding [22], MedSegX targets open-world unseen object types [100], and SAT trains text-prompted radiology segmentation on more than seventy CT/MRI/PET datasets, matching task-specific networks while supporting flexible prompts [92].
Several models add more sophisticated text fusion. Cross-Modal Conditioned Reconstruction trains the text-image fusion module by reconstructing masked image regions from text [95], and TGCA-PVT uses cross-position attention with a pyramid vision transformer backbone [97]. Transformer-guided multi-scale fusion (ScaleFusionNet) and dense encoder-decoder designs provide strong baselines for skin lesion and dermatological segmentation [101,102]. A practical advantage is deployment in text-driven workflows: a radiologist dictates a target and the model produces the mask, though mask quality depends on prompt quality. The next category adds reasoning through large language models.

3.2. LLM-Embedded Architectures

Large language model embedded architectures couple a multimodal large language model with a segmentation decoder: the language model handles the conversation and emits a special token whose hidden state encodes the segmentation target, which the decoder converts into a mask. This lets the system handle implicit queries that require reasoning, world knowledge, or multi-step instructions [23,24]. LISA established the embedding-as-mask paradigm [23]: a special <SEG> token is added to a multimodal language model, and its hidden embedding is passed to a SAM-style mask decoder alongside image features, trained end-to-end so the model answers reasoning queries (e.g., “which object can hold liquid”) with both text and a mask. LISA++ added instance-level reasoning and multi-turn dialog [24]. For biomedicine, several models adapt this paradigm. MedPLIB applies the LISA-style architecture with SAM-Med2D as the mask backbone and is trained on the MeCoVQA region-text dataset [103]. M3D and Med-2E3 bring three-dimensional reasoning segmentation to multimodal language models, the latter injecting 2D priors [25,104]. RadFM and CheXagent provide radiology backbones that can be coupled with segmentation heads [49,105], and SAM 2 adds a memory mechanism for image and video segmentation [106].
Two design choices matter: whether the language model is frozen (preserving general knowledge) or fine-tuned (gaining medical precision but risking drift), with LoRA-based adaptation a common compromise [25,40,103]; and the fusion point, from a single token’s hidden state (LISA) to multiple tokens or the full output sequence [95,107]. Computational cost is a practical concern: running a multimodal language model per query can exceed several seconds, acceptable offline but limiting interactive use, which motivates distillation and lightweight backbones [67,103,108].
A further concern is hallucination: the language model can emit plausible but wrong text alongside the segmentation token, yielding an inaccurate mask or one that references an absent object [109,110]. Uncertainty calibration, fact-grounded preference optimization, and presence verification reduce these failures [22,111,112], which we revisit in Section 8.

3.3. Hybrid Frameworks

Hybrid frameworks loosely couple a pretrained vision-language model, which supplies semantic information from text, with a separate segmentation model that produces masks; the coupling can occur at the input, intermediate, or output stage [113,114].
Input-stage coupling converts text into prompts for a segmentation model: SaLIP cascades SAM and CLIP, generating candidate masks with SAM and scoring them against the text prompt with CLIP, requiring no extra training and supporting optional test-time adaptation [114]. Intermediate coupling injects vision-language features into the segmentation model. MedCLIP-SAM aligns medical images and text with MedCLIP, then uses the attention maps to generate point prompts for SAM and refines the resulting masks with text-aligned features [113]. Bio2Vol adapts BiomedParse to volumetric data through dual-rate sampling and cross-slice attention while preserving its pretrained 2D capabilities. Output-stage coupling validates or refines masks: CLIP text-image similarity filters false positives [21,22], language models generate explanations for review [94], and model-agnostic refiners fit into hybrid pipelines [115]. Unified lightweight frameworks can serve multiple clinical sites in multi-task, multi-center settings [116]. Overall, the hybrid approach reuses the strengths of existing pretrained models without retraining, allows modular updates as better components appear, and spans many modalities and tasks, at the cost of integration complexity and possible error propagation between components [106,117].
Recent work explores tighter integration. SAM-CLIP merges SAM and CLIP parameters into one model handling segmentation and zero-shot classification [118], while others distill a teacher VLM into a student segmenter [89]. CycleSAM uses cycle-consistent feature matching for few-shot surgical scenes [117], HiFormer supplies hierarchical multi-scale transformer features [119], SAM-OCTA targets retinal OCTA [120], and VILA-M3 injects medical expert knowledge [96]. These approaches blur the line between hybrid and integrated architectures, as surveyed for biomedical segmentation by Lee et al. [121]. Many of the best systems on recent challenges use hybrid designs. In the CVPR 2025 Conference, Foundation Models for Text-guided 3D Biomedical Image Segmentation Challenge was won by an enhanced version of BiomedParse that combines text-prompted segmentation with a separate volumetric encoder [22]. Similar designs appear in the top entries to other recent challenges [79]. SegFM3D extends 3D foundation models for universal medical image segmentation [79]. Wu and Xu demonstrated universal one-prompt medical image segmentation across organs and tumors [108]. Segment Any Medical Object proposes a generalist segmentation model that supports diverse anatomical structures and lesion types under a single framework.
The category boundaries are not sharp: BiomedParse is primarily text-prompt guided but adds a meta-object classifier that refuses absent prompts, an LLM-like trait without a full language model [21], LISA is the prototypical LLM-embedded design yet decodes through a SAM-style mask head [23,24], and hybrids such as SaLIP and MedCLIP-SAM combine SAM and CLIP without retraining [113,114]. A useful organizing axis is therefore where text enters the decision as shown in Figure 4, inside a unified architecture, through a language-model token, or via a separate component, which in turn shapes model size, cost, and data needs: BiomedParse is roughly 100M parameters and trains on image-mask-text triples, whereas LISA adds a 7–13B language model and instruction-tuning data, and hybrids reuse pretrained components without joint training [21,23,92]. Table 2 lists these methods with their reported performance.
Figure 4. Architectural patterns for the three categories of vision-language segmentation models. Text-prompt guided models inject text embeddings into a unified architecture. LLM-embedded models route text through a multimodal language model that produces a segmentation token. Hybrid models couple separate vision-language and segmentation components.
Figure 4. Architectural patterns for the three categories of vision-language segmentation models. Text-prompt guided models inject text embeddings into a unified architecture. LLM-embedded models route text through a multimodal language model that produces a segmentation token. Hybrid models couple separate vision-language and segmentation components.
Preprints 219358 g004
Table 2. Comparison of vision-language and segmentation foundation models for medical images. The table covers fifty-three methods from 2021 to 2026, grouped into six categories. Best Dice values are illustrative and represent reported peak performance on representative benchmarks. A dash (-) indicates that a single Dice score was not reported. Bold denotes best reported value within a section.
Table 2. Comparison of vision-language and segmentation foundation models for medical images. The table covers fifty-three methods from 2021 to 2026, grouped into six categories. Best Dice values are illustrative and represent reported peak performance on representative benchmarks. A dash (-) indicates that a single Dice score was not reported. Bold denotes best reported value within a section.
Method Year Architecture Modality Datasets Dice Adaptation
SAM-Based Models
SAM [9] 2023 ViT-H + prompt enc. Natural images SA-1B (1B masks) - Pretrain
MedSAM [38] 2024 SAM + medical FT Multi-modal 1.5M med. image pairs 0.85 Full FT
SAM-Med2D [68] 2023 SAM + adapter 10 modalities 4.6M imgs, 19.7M masks 0.83 Adapter
SAM-Med3D [69] 2023 Native 3D SAM Volumetric 21K imgs, 131K masks 0.78 Full FT
SAMed [84] 2023 SAM + LoRA Multi-organ CT Synapse BTCV 0.82 LoRA (0.1%)
Med-SA [75] 2025 SAM + Adpt. + LoRA 5 modalities 17 tasks 0.84 Adpt. + LoRA
AdaptiveSAM [82] 2024 SAM + bias tuning Surg., US, X-ray Multiple 0.81 Bias tuning
SegVol [78] 2024 Volumetric SAM CT 200 organs, 96K vol. 0.83 Full FT
3DSAM-adapter [83] 2024 SAM 2D→3D adapt. CT (tumour) LiTS, KiTS, pancreas CT 0.86 Adapter
SAM-OCTA [120] 2025 SAM + OCTA prompt tuning Retinal OCTA ROSE, OCTA-500 0.87 Full FT
SAM2LoRA [88] 2025 SAM 2 + LoRA Retinal fundus 11 datasets 0.93 LoRA (<5%)
MedSAM2 [76] 2025 SAM 2 + medical FT Image + video Multi-modal 0.86 Full FT
MedSAM3 [87] 2025 SAM 3 + LoRA Multi-modal Concept-aware 0.84 LoRA
EmbedMedSAM [122] 2025 SAM embed. + edge optim. Multi-modal Resource-limited settings 0.82 Adapter
Text-Prompt Guided Models
LViT [20] 2024 U-Net + text fusion Chest X-ray QaTa-COV19 0.83 Full FT
Cross-modal CR [95] 2024 Cross-modal recon. CLIP CT, MRI Multiple organ datasets 0.84 Full FT
CLIP-Driven UM [19] 2023 CLIP queries + Swin Abdominal CT BTCV, LiTS, KiTS 0.86 Full FT
Universal VLM [99] 2024 Extensible CLIP + dec. Abdominal CT/MRI BTCV, 15 organs 0.87 PEFT
ZePT [93] 2024 CLIP query disentangle Pan-tumour CT Multi-source 0.77 Self-prompt
BiomedParse [21] 2025 SEEM + GPT-4 harm. 9 modalities BiomedParseData (6M) 0.94 Full pretrain
BiomedParse-V [22] 2025 FVE + ISD module CT, MRI, micro. CVPR 2025 challenge 0.86 Full pretrain
MedSegX [100] 2025 Generalist FM + open vocab Multi-modal 100+ datasets 0.85 Full pretrain
SAT [92] 2025 CLIP + transf. dec. Radiology 70+ datasets 0.84 Full FT
LLM-Embedded Architectures
LISA [23] 2024 MLLM + 〈SEG〉 tok. Nat. + reasoning ReasonSeg, refCOCO - Full FT (LLM)
LISA++ [24] 2024 LISA + inst. reasoning Nat. + medical Extended ReasonSeg - Full FT
ChatRadio-Valuer [123] 2025 LLM + rad. impression dec. Chest X-ray Multi-inst. CXR - Full FT
MedPLIB [103] 2025 MLLM + SAM-Med2D Multi-mod. med. MeCoVQA 0.81 LoRA + Adpt.
M3D [25] 2025 3D MLLM + decoder 3D CT M3D-Seg 0.79 Full FT
Show & Segment [124] 2025 In-context MLLM + dec. Multi-modal med. 12 diverse datasets 0.83 Zero-shot
Hybrid and Other Frameworks
MedCLIP-SAM [113] 2024 MedCLIP + SAM Multi-modal Multiple 0.80 Hybrid
SaLIP [114] 2024 SAM + CLIP cascade Multi-modal Multiple 0.74 Zero-shot
VILA-M3 [96] 2025 VLM + medical expert know. Multi-modal BTCV, LiTS, BraTS 0.86 PEFT
SegFM3D [79] 2025 3D foundation model Multi-modal 3D Multi-source 0.83 Pretrain
Specialized Foundation Models
MoME (lesion) [125] 2025 Mixture of mod. experts Brain MRI lesions Multi-source MRI 0.80 Full FT
UniverSeg [126] 2023 Few-shot universal 16 modalities MegaMedical 0.72 Few-shot
GenSeg [127] 2025 Diffusion gen. + seg. Multi-modal Ultra low-data regimes 0.81 Hybrid gen.
SegMamba-V2 [128] 2026 Mamba SSM 3D long-range Volumetric CT/MRI Multi-organ 3D 0.88 Full FT
TotalSeg. [34] 2023 nnU-Net based CT (104 structs.) 1204 CTs 0.94 Full train
TotalSeg. MRI [35] 2025 Seq.-independent MRI (multi-organ) 616 MRI + 527 CT 0.84 Full train
BrainSegFounder [129] 2024 Self-sup. 3D ViT Brain MRI Multi-source neuroimaging 0.91 PEFT
SAMUS [80] 2024 SAM + US adapt. Ultrasound Multi-source US 0.80 Adapter
SegAnyBone [81] 2024 SAM + bone FT MRI bones Multi-seq. MRI 0.82 Full FT
Self-imp. FM [130] 2025 Generative FM + self-imp. CT, MRI, X-ray Multi-organ, multi-modal 0.85 Full pretrain
LCTfound [131] 2026 Lung CT ViT FM Chest CT LIDC-IDRI, NLM, LUNA16 0.89 Full pretrain
Merlin [132] 2026 CT VLM + report gen. Chest CT Radiology reports + seg. - Full pretrain
Decipher-MR [133] 2026 3D MRI VLM encoder Multi-seq. MRI Diverse MRI tasks - Full pretrain
CT-CLIP [54] 2026 Volumetric CLIP Chest CT CT-RATE (50K) - Pretrain
Deep Learning Baselines (CNN/Transformer, no text prompt)
Confidence-SS [134] 2025 CNN-Trans. semi-sup. Skin lesion ISIC 2016, PH2 0.91 Semi-supervised
H-Self-Support [135] 2026 Hierarchical self-support Brain MRI (tumour) BraTS 2021 0.92 Self-supervised
Dense Enc.-Dec. [102] 2021 CNN enc.-dec. skip conn. Skin lesion ISIC 2018, PH2 0.87 Full FT
RD2A [2] 2021 Residual dense + ASPP Brain MRI (tumour) BraTS 2019 0.89 Full FT
ScaleFusionNet [101] 2025 Trans. multi-scale FPN Skin lesion ISIC 2017/2018, PH2 0.90 Full FT
UNet-Mamba [5] 2025 UNet + Mamba-like attn. Multi-modal ACDC, Synapse, polyp sets 0.91 Full FT

4. Adaptation Strategies

Vision-language foundation models are pretrained on general or biomedical data and usually need further training to reach a specific clinical task, where the adaptation strategy governs accuracy, computational cost, and deployment flexibility. We discuss three main strategies, full fine-tuning, parameter-efficient fine-tuning, and prompt engineering, whose tradeoffs are summarized in Table 3 [40,41,136].

4.1. Full Fine-Tuning

Full fine-tuning updates all parameters and typically achieves the best accuracy when training data is sufficient [38,68]. MedSAM fine-tunes the SAM mask decoder while partially freezing the encoder [38], and SAM-Med2D fully fine-tunes on more than four million masks [68,77]. Common objectives include Dice, focal, generalized Dice, and Lovász-Softmax losses [137,138,139]. Full fine-tuning is costly: SAM’s ViT backbone exceeds six hundred million parameters [3], requiring large GPU memory and time (MedSAM used twenty A100 GPUs for weeks; BiomedCLIP used sixteen) [15,38], beyond the reach of most groups. It also risks catastrophic forgetting of general knowledge [140] and produces a full model copy per task. Parameter-efficient fine-tuning addresses these issues by updating only a small fraction of parameters.

4.2. Parameter-Efficient Fine-Tuning

Parameter-efficient fine-tuning (PEFT) updates only a small subset of parameters while freezing the rest, preserving original capabilities, reducing training memory, enabling many task-specific adapters, and often approaching full fine-tuning accuracy as in Figure 5 [40,41,141].
Low-Rank Adaptation (LoRA) is the most widely used PEFT method [40]: it adds two trainable low-rank matrices whose product approximates the weight update, typically under one percent of the parameters. SAMed applied LoRA to SAM’s image encoder and matched the state of the art on Synapse while updating only 0.1% of parameters [84], and SAM2LoRA applied it to both encoder and decoder of SAM 2 for retinal fundus segmentation, reaching Dice up to 0.93 with under five percent of parameters trained [88]. Adapter modules insert small trainable layers between frozen ones [41]: Medical SAM Adapter (Med-SA) combines adapters and LoRA with a Space-Depth Transpose for 2D/3D images and surpassed SAM across seventeen tasks and five modalities [75], while AdaptiveSAM tunes only bias terms and adds text-prompted segmentation [82]. Several variants extend LoRA. Conv-LoRA injects local convolutional inductive biases into the ViT encoder while preserving SAM’s segmentation knowledge, AdaLoRA allocates the rank budget adaptively across layers [142], and DoRA and NAS-LoRA refine the magnitude/direction decomposition and configuration search. Visual prompt tuning adds learnable input tokens and suits classification more than dense prediction [143], while CLIP-Adapter and Tip-Adapter attach small networks on frozen CLIP features [144]. Prompt tuning for medical segmentation was studied by Fischer et al. [136], and few-shot PEFT can be cheaper and stronger than in-context learning [141].
A 2024 empirical study of PEFT for SAM across seventeen datasets and five modalities found that the right PEFT strategy slightly outperforms prior methods, that LoRA suits small datasets while adapters suit larger diverse ones, and that fine-tuning the mask decoder matters more than the encoder for medical tasks [85].

4.3. Prompt Engineering

Prompt engineering exploits in-context learning without updating parameters: the user designs prompts to elicit the desired behavior. It is the cheapest and most flexible adaptation, limited mainly by prompt quality and alignment with pretraining data.
For text-prompted models, prompt design matters: BiomedParse uses GPT-4 to harmonize descriptions across forty-five datasets, and specific prompts such as “glandular structure in colon pathology” reach median Dice 0.942 while vague prompts perform worse [21]. SAT-style models use structured anatomical, modality, and pathology prompts [92], and self-regulating prompts reduce forgetting during adaptation [140]. Visual prompt design also matters for SAM: point placement strongly affects quality, bounding boxes constrain better than points, and iterative prompting is most accurate but most interactive [70,73,74]. Combining text and visual prompts is increasingly common: SegVol shows their combination improves accuracy on complex anatomy [78], and AdaptiveSAM integrates text with bounding-box adaptation [82]. Chain-of-thought prompting has been adapted to vision-language models: LLaVASeg uses multi-step prompts that first reason about the target, then identify attributes, then produce the mask, which helps with complex queries but adds latency [24]. Prompt reliability is an active topic: different phrasings, synonyms, and abbreviations of the same target yield different masks [21,97], and negation is hard because contrastive models tend to ignore it [65]; prompt augmentation, ensembling, and semantic-aware design help [65,92]. The best strategy depends on resources, data, and deployment scenario. Full fine-tuning gives the best accuracy with abundant data and compute; PEFT supports many tasks from one base model and suits multi-task, cross-institutional, and continual deployment, since adapters can be switched or added without disturbing the frozen base and federated learning preserves privacy [8,56,72]; and prompt engineering suits rapid prototyping. As summarized in Table 3, LoRA with rank 4–16 reaches accuracy within a few percent of full fine-tuning while training 1–5% of the parameters, and hybrid recipes such as SAMed mix LoRA with selective full fine-tuning [78,84]. PEFT additionally aids interpretability and reproducibility by making explicit which parameters are adapted; reported results should include random seeds, hardware, and training duration as shown in Figure 5.
Figure 5. Comparison of parameter-efficient fine-tuning methods for medical foundation models. Full fine-tuning updates all parameters but has the highest cost. LoRA, adapters, and prompt tuning offer increasingly lightweight alternatives with different accuracy-cost tradeoffs.
Figure 5. Comparison of parameter-efficient fine-tuning methods for medical foundation models. Full fine-tuning updates all parameters but has the highest cost. LoRA, adapters, and prompt tuning offer increasingly lightweight alternatives with different accuracy-cost tradeoffs.
Preprints 219358 g005
Table 3. Comparison of adaptation strategies for vision-language foundation models. The choice depends on dataset size, deployment scenario, and accuracy requirements.
Table 3. Comparison of adaptation strategies for vision-language foundation models. The choice depends on dataset size, deployment scenario, and accuracy requirements.
Strategy Trainable GPU Mem. Accuracy Storage Example
Full fine-tuning 100% Very high Highest Full model copy MedSAM
LoRA 0.1–1% Low Near best Small adapter SAMed
Adapter modules 1–5% Low Strong Small modules Med-SA
Bias tuning <0.5% Very low Good Bias deltas only AdaptiveSAM
Visual prompt tuning <0.1% Minimal Variable Prompt tokens VPT variants
Prompt engineering 0% None Variable Text only BiomedParse
Conv-LoRA 0.5–2% Low Strong Small modules Conv-LoRA SAM
NAS-LoRA 1–3% Low Strong Searched arch NAS-LoRA
DoRA 0.2–1% Low Strong Magnitude+dir. DoRA variants

5. Modality-Specific Models

Imaging modalities pose different challenges: CT is volumetric with strong organ contrast, MRI offers rich soft-tissue contrast across sequences, pathology yields gigapixel whole-slide images, chest radiography is fast but two-dimensional, and ultrasound is real-time but operator-dependent. We review vision-language foundation models per modality, emphasizing recent 2024–2025 work [34,35,36,105].

5.1. Computed Tomography

Computed tomography is the most common volumetric modality, varying in contrast phase, slice thickness, and protocol; vision-language models must handle volumetric input, anatomical variability across patients, and the relationships among adjacent organs.
The CLIP-Driven Universal Model uses CLIP text embeddings of organ names as mask-decoder queries to segment twenty-five organs and detect six tumor types, training on fourteen partially labeled CT datasets through masked back-propagation and reaching strong results on BTCV and LiTS [19,145,146]. Liu et al. extended it to handle dynamic addition of new classes without retraining [99]. TotalSegmentator covers 104 anatomical structures in CT using an nnU-Net backbone and a curated set of more than one thousand scans, with a 2025 update adding sequence-independent MRI segmentation [1,34,35]; though prompt-free, its taxonomy has informed many text-prompted models. SegVol adds text and spatial prompts for more than two hundred CT classes from ninety thousand unlabeled and six thousand labeled volumes [78], and UniverSeg generalizes across fifty-plus datasets and sixteen modalities via few-shot cross-attention with support sets rather than text [126]. CT-CLIP and Merlin brought vision-language pretraining to volumetric chest CT, training on more than fifty thousand CT volumes with paired structured reports from CT-RATE and supporting zero-shot disease detection and report generation [54,132]. M3D and Med-2E3 extended multimodal language models to general 3D CT analysis [25,104], and BiomedGPT provides a generalist backbone covering CT [147]. Recent work targets specific CT tasks: one-prompt segmentation across modalities [108], a self-improving generative model trained on heterogeneous CT and MRI [130], and UniMed-CLIP for unified image-text pretraining across modalities including CT [148].
Three-dimensional context remains a critical challenge: 2D-pretrained models lose information slice by slice. BiomedParse-V uses fractal volumetric encoding [22], while SAM-Med3D, SegVol, and SegFM3D natively process 3D inputs [69,78,79], and LCTfound adds a 2026 lung-CT foundation model [131]. SegMamba and self-adaptive Mamba-like designs offer state-space alternatives that scale to long volumetric sequences [5,128,149], and a data-efficient 3D VLM using only a 2D encoder reaches competitive 3D understanding at lower cost [150]. Still, true 3D vision-language pretraining at the scale of 2D has not yet been achieved [26,27].
CT also raises modality-specific issues. Contrast phase matters: models trained mainly on portal-venous data may underperform on other phases, motivating phase-aware conditioning [143]. Slice thickness and reconstruction kernel affect small-structure accuracy, and most models are trained on adult data, which makes PEFT valuable for pediatric use [78,130].

5.2. Magnetic Resonance Imaging

MRI offers excellent soft-tissue contrast through multiple sequences (T1, T2, FLAIR, DWI) that provide complementary information, so vision-language models must handle this multi-sequence nature. Brain MRI is one of the most studied applications. Swin UNETR and BrainSegFounder provide 3D backbones, the latter pretrained on large-scale neuroimaging to reduce labeling needs [6,129], and the BraTS series supplies multi-sequence benchmark data with more than two thousand labeled cases [151,152]. Residual-dense ASPP networks and hierarchical self-support learning improve multi-grade tumor segmentation under label scarcity [2,135]. Cardiac MRI is another active area: ACDC anchors chamber segmentation [153], and Christensen et al. introduced an echocardiogram vision-language model for cardiac function assessment [154]. SegmentAnyBone uses text prompts to segment bones across T1, T2, PD, and STIR sequences [81], and Med-2E3 extends the 2D-enhanced 3D MLLM to MRI [104]. Prostate MRI benefits from universal models such as UniverSeg and SAM-based PEFT with minimal labeled data [85,126]. Diffusion approaches have also been explored for ambiguous MRI segmentation, including Diff-UNet for volumetric data [155,156].
MRI poses specific challenges: intensities are not standardized across scanners, datasets are smaller than CT or chest X-ray, and multi-sequence input increases memory cost [7]. Most models still process sequences independently or as channels rather than modeling their relationships, which sequence-aware designs and sequence-conditioned prompts address, and DWI remains underrepresented in pretraining [29,58]. Still, text-prompted MRI segmentation is now practical: TotalSegmentator MRI gives sequence-independent segmentation [35], Decipher-MR provides 3D MRI representations for diverse tasks [133], and hierarchical SAM decoding improves fine-grained prediction [157]. Image-translation methods bridge missing sequences [158,159], and Mamba-based state-space architectures (U-Mamba, Swin-UMamba, VMamba) offer efficient alternatives for volumetric data [160,161,162,163,164].

5.3. Pathology

Pathology presents gigapixel whole-slide images, so models operate on patches and aggregate across the slide while handling this scale and the rich semantic vocabulary of pathology. PLIP was an early pathology vision-language model, trained on more than two hundred thousand image-text pairs from medical Twitter to support zero-shot patch classification and retrieval [55,56], and Virchow scaled this to clinical-grade pathology and rare-cancer detection [59]. UNI provides strong image-only features from hundreds of thousands of whole-slide images that can be coupled with text decoders [58], Quilt-1M supplies one million image-text pairs for pathology pretraining [57], and PathAsst offers a generative conversational assistant covering localization and segmentation [60]. Deng et al. evaluated SAM zero-shot on digital pathology [71]. For text-guided pathology segmentation, BiomedParse reaches a median Dice above ninety percent on glandular structures from prompts such as “glandular structure in colon pathology” and handles irregular cellular shapes that challenge bounding-box methods [21]. Hierarchical multi-scale transformers extend to high-resolution pathology [119], while RetCCL and masked-image-modeling backbones support retrieval and stronger features [61,62]. H-optimus-0 adds a recent open-source pathology backbone with strong downstream features [?], and Bio2Vol enables parameter-efficient adaptation of 2D pathology models to volumetric data.
Key challenges for pathology are the large, technical semantic vocabulary and stain and colour variation across institutions; privacy-preserving and federated learning offer routes to cross-institutional training without exposing patient data [165,166,167,168].

5.4. Chest Radiography

Chest radiography has the most mature vision-language ecosystem because large datasets pair images with reports: MIMIC-CXR alone contains more than 370,000 images with free-text reports [169], complemented by CheXpert, PadChest, and VinDr-CXR. Tanida et al. introduced interactive, region-guided report generation [94]. GLoRIA, BioViL, and BioViL-T refined contrastive pretraining for chest radiography, achieving zero-shot pathology detection comparable to radiologists [16,17,18], and RaTEScore provides a radiologist-aligned metric for the report side [170]. For chest X-ray segmentation, LViT uses report excerpts as text prompts for COVID-19 lesions [20], and CXR-LLAVA adds interpretation and grounded reporting [36]. CheXagent provides grounded report generation and a multi-task chest X-ray foundation model [105], while RaDialog, interpretable concept-bottleneck reporting, and ChatRadio-Valuer extend dialog and impression generation [123,171,172,173]. CheXstray supports drift detection in deployed imaging AI [174], and federated split vision transformers address cross-institutional COVID-19 CXR learning [168].
Bias is a particular concern: some chest X-ray models underperform on under-represented populations [175,176], foundation models are vulnerable to imaging artifacts [177], and mitigation strategies plus fact-grounded preference optimization aim to address these failures [112,178].

5.5. Ultrasound

Ultrasound is portable, real-time, and radiation-free but noisier and more operator-dependent than CT or MRI, with quality varying by probe position and gain, which makes it difficult for vision-language models.
SAMUS adapts SAM to clinical ultrasound with parameter-efficient tuning and ultrasound-specific data, outperforming general SAM on multiple tasks [80], while EchoCLIP brought contrastive vision-language pretraining to echocardiography, training on more than one million cardiac ultrasound videos with expert interpretations for strong zero-shot cardiac-function assessment [154]. For obstetric and abdominal ultrasound, recent models target specific structures using benchmark datasets such as BUSI (breast) and thyroid nodule sets. Adhikari et al. used diffusion-synthesized data to enhance echocardiography segmentation [98], and SAM-OCTA extends prompted segmentation to retinal angiography [120]. Ultrasound remains hard because large image-text datasets are scarce, image quality is operator-dependent, and real-time inference is often required. Even so, adapted SAM and vision-language models now give clinically useful ultrasound segmentation, with active work on portable, point-of-care integration [51,179,180].

6. Clinical Applications

Clinical applications span the diagnostic and therapeutic workflow. We focus on three high-impact areas, organ segmentation, tumor segmentation, and radiotherapy planning, with others including surgical planning, image-guided procedures, and conversational diagnostic AI [8,179,180].

6.1. Organ Segmentation

Whereas traditional pipelines trained a separate model per organ and modality, text-prompted universal models now segment dozens to hundreds of organs from a single network: the CLIP-Driven Universal Model (25 organs) [19], SegVol (200+ categories with text and spatial prompts) [78], and the example-based UniverSeg [126], evaluated on benchmarks such as the Medical Segmentation Decathlon and AMOS. TotalSegmentator and its 2025 MRI extension are widely adopted in clinical and research workflows, reducing manual contouring [34,35], and extensible universal models handle dynamic class addition for evolving taxonomies [99]. For specific organs, flexible interfaces include SegmentAnyBone for multi-sequence bone MRI [81], self-improving generative models across modalities [130], MRI lesion models [125], and in-context universal segmentation [124], guided by emerging ethical frameworks [181].
A practical advantage is rapid deployment: a new task can often be addressed by writing a text prompt rather than training a model, though accuracy may trail a dedicated model when the class was unseen in pretraining [70].

6.2. Tumor Segmentation

Tumors vary widely in shape, size, location, and boundary clarity, which text-driven specification helps address. ZePT performs zero-shot pan-tumour segmentation via CLIP-based query disentangling and self-prompting [93], the CLIP-Driven Universal Model covers six tumor types [19], and BiomedParse handles multiple tumor types across nine modalities with strong results on irregular shapes [21]. Brain tumor segmentation is a long-standing application tracked by the BraTS series [151,152], with Swin UNETR and BrainSegFounder as 3D backbones [6,129]. Vision-language prompts can specify tumor type and sequence, and diffusion-based methods including Diff-UNet have been applied to ambiguous tumor boundaries [155,156]. For liver tumor segmentation on LiTS [146], foundation-model-guided semi-supervised CT segmentation works in resource-constrained settings [182], and confidence-weighted CNN-Transformer semi-supervision helps when labels are scarce [134]. The 3DSAM-adapter targets promptable volumetric tumors [83], while ultrasound foundation models and expert-knowledge VLMs extend coverage [96,183]. Kidney tumors use the KiTS benchmarks [184,185,186], where the state of the art combines foundation backbones with task-specific fine-tuning, and skin lesion work builds on ISIC and HAM10000 [101,102,134]. Ophthalmic VLMs covering hundreds of fundus diseases (and RetiZero) extend the paradigm to retinal pathology [120,187,188]. Tumor heterogeneity across patients, time points, and protocols remains a core difficulty; cross-modal contrastive losses and clinical metadata help [95], and clinical-environment simulators support rigorous evaluation [189]. Integration with single-cell or molecular foundation models such as scGPT and Geneformer could unlock new tumor-analysis modalities [190,191].

6.3. Radiotherapy Planning

Radiotherapy planning requires precise, consistent contouring of target volumes and organs at risk, which is time-consuming manually. Early deep-learning systems delineated head-and-neck organs at risk with dedicated per-anatomy models [31,32,33]; vision-language foundation models now offer a unified, text-driven alternative. Broad-coverage models such as TotalSegmentator and the CLIP-Driven Universal Model serve as starting points for organ-at-risk delineation and adapt to site-specific contouring with limited data [19,34], and in-context universal segmentation fits radiotherapy use cases [99,124].
Clinical adoption is gradual: regulatory validation across populations and the interpretability of foundation models remain concerns [172,181], even as some vendors incorporate foundation-model components into planning systems. Systematic reviews of healthcare LLM testing and reports of GPT-4 on complex cases inform deployment expectations [192,193,194].

7. Evaluation and Datasets

Evaluation must capture both region overlap and boundary accuracy. Standard metrics are the Dice score, Intersection over Union, and the 95th-percentile Hausdorff Distance, each with known limitations [42,43,44,45]; Table 4 lists their properties and typical use.

7.1. Evaluation Metrics

The Dice score (twice the intersection over the summed volumes) is the most reported overlap metric but is sensitive to class imbalance [44,137]. IoU (the Jaccard index) is a more conservative overlap measure [42]. Boundary metrics complement overlap: the 95th-percentile Hausdorff Distance (HD95) is widely used for organs at risk, while Average and Normalized Surface Distance report mean boundary error and the fraction of boundary within tolerance [44,45].
For multi-focal disease, lesion-wise Dice evaluates each lesion separately before aggregating, which is more meaningful than global Dice, and lesion-level sensitivity and specificity together with detection hits and false alarms per image suit screening [184,185,186].
Text-prompted segmentation needs additional measures: the median Dice across prompts captures consistency of response to paraphrases of the same target, object-recognition accuracy and the negative-prediction rate test whether the model correctly detects or refuses an absent object (BiomedParse uses a Kolmogorov–Smirnov test for invalid prompts), and RaTEScore evaluates the text output of LLM-embedded systems that produce both masks and reports [21,170].
Multi-class scores can be aggregated by mean, frequency-weighted mean, or worst-case, each with different clinical implications. Maier-Hein et al. and Reinke et al. stress selecting metrics by clinical question and reporting all relevant numbers with uncertainty rather than a single headline value [42,43], and loss design (focal, generalized Dice, Lovász-Softmax) shapes how metrics behave during training [137,138,139].
Table 4. Common evaluation metrics for medical image segmentation. The choice of metric depends on the clinical question and the structure that is segmented.
Table 4. Common evaluation metrics for medical image segmentation. The choice of metric depends on the clinical question and the structure that is segmented.
Metric Range Property Best Used For
Dice Score [0, 1] Overlap, small-structure sensitive Volume overlap reporting
IoU (Jaccard) [0, 1] Conservative overlap Detection metric comparison
Hausdorff Distance [0, ) mm Worst-case boundary error Outlier sensitivity studies
95% Hausdorff (HD95) [0, ) mm Percentile boundary error Radiotherapy organs at risk
Average Surface Dist. [0, ) mm Mean boundary error Boundary quality reporting
Normalized Surface Dist. [0, 1] Boundary within tolerance Clinically acceptable boundary
Lesion-wise Dice [0, 1] Per-lesion overlap Multi-focal disease
Sensitivity / Recall [0, 1] Detection rate Screening applications
Specificity [0, 1] False-positive rate Specificity-critical tasks
Recognition Accuracy [0, 1] Object presence detection Text-prompted segmentation

7.2. Benchmark Datasets

Table 5 summarizes widely used benchmark datasets by modality; below we highlight those most relevant to vision-language segmentation.
Abdominal CT relies on BTCV, AMOS, FLARE, LiTS, KiTS, and the Medical Segmentation Decathlon [145,146,184,185,186]. Brain and cardiac MRI use the BraTS series and ACDC [151,152,153], and echocardiography adds EchoNet/EchoCLIP [154]. Chest X-ray benefits from MIMIC-CXR, CheXpert, PadChest, NIH ChestX-ray14, and VinDr-CXR [169], while pathology draws on Quilt-1M, PMC-CLIP, GLaS, and CAMELYON [48,57,61,62]. Ultrasound and dermatology use BUSI, ISIC, and HAM10000.
For text-prompted segmentation specifically, BiomedParseData provides six million image-mask-description triples harmonized from forty-five datasets with GPT-4 [21], the CVPR 2025 challenge adds 3D benchmark data [22], SAT curates text-annotated radiology from over seventy datasets [92], and CT-RATE pairs chest CT volumes with reports [54,195]. General-domain sets such as COCO and Open Images also support pretraining, and clinical-environment simulators aid dynamic evaluation [189].
Data leakage between web-scale pretraining and public benchmarks is a growing concern, so studies should disclose pretraining data and evaluate on held-out clinical data [175,176]. Privacy-preserving and federated learning enable cross-institutional training without raw-data exchange [165,166,167], and harmonizing heterogeneous datasets requires consistent ontologies such as those used by BiomedParseData and SAT [21,92].

8. Challenges and Future Directions

Despite rapid progress, vision-language foundation models for medical segmentation face open challenges that also define the research agenda. We group them into four themes, prompt dependence and trustworthiness, efficiency and 3D scaling, data efficiency, and clinical integration, pairing each challenge with the directions most likely to address it. Figure 6 summarizes the challenges and roadmap. Cross-cutting concerns include privacy, fairness, regulatory approval, and interpretability [165,175,176,178,181].

8.1. Prompt Dependence and Trustworthiness

Vision-language models depend critically on text prompts: the same target (“liver tumor,” “hepatic mass,” “neoplasm involving segment VIII”) can be phrased many ways that yield different masks, which makes evaluation prompt-dependent, complicates cross-institution use, and degrades on non-English text [21,51,67,97]. Mitigations include prompt augmentation during training, ensemble prompting at inference, ontology-based semantic-aware prompt design, and self-regulating prompts that reduce forgetting during adaptation [21,92,140]. Two weaknesses persist: contrastive models tend to ignore negation (“liver without tumor” resembles “liver with tumor”) [65], and compositional prompts that combine several constraints are handled only partially, even by reasoning models such as LISA [23,24]. Closing the gap to rich clinical language (location, morphology, density, enhancement) and to multilingual use remains open, with BiomedParseData and CT-RATE moving in this direction [54,179].
Hallucination is the most safety-critical issue: a model may segment absent objects, miss present ones, or produce wrong boundaries, with consequences ranging from unnecessary biopsy to incorrect radiotherapy dose [33,109,110,111,177]. Mitigations include presence verification (BiomedParse’s independent segmentation discriminator), fact-grounded preference optimization with physician feedback, diagnosis-guided bootstrapping, and confidence calibration via temperature scaling, ensembles, or Bayesian methods [22,42,112]. Reliable inference-time detection, through confidence, visual-textual consistency, or retrieval-augmented verification, remains unsolved yet essential for clinical use, alongside validation across populations and evolving regulatory frameworks [181,192,193,194].

8.2. Efficiency and 3D Scaling

Many clinical workflows require segmentation in seconds, yet vision-language models often have hundreds of millions to billions of parameters [3,15]; SAM’s largest encoder alone exceeds six hundred million parameters, and LLM-embedded models add a language model per query [9,23,103]. Distillation to smaller students (MobileSAM, FastSAM, EfficientSAM), quantization, one-prompt segmentation, model-agnostic refinement, and lightweight adapters reduce cost [63,64,108,115,143,144], while Mamba-style state-space models give linear-time alternatives for long volumetric sequences [128,149,160,161,163,164]. Three-dimensional context is the deeper bottleneck: 2D-pretrained models lose information slice by slice, and 2.5D cross-slice attention or native 3D models trade speed for accuracy [22,69,196]. Edge deployment at the point of care is enabled by mobile variants, EmbedMedSAM, efficient bedside inference, and two-stage screening-then-refinement pipelines [79,108,122,125,154]. Looking forward, native multimodal pretraining that includes volumetric data, time series, and structured metadata, exemplified by CT-CLIP, BiomedParse-V, and UniMed-CLIP, is needed but still lags 2D pretraining in scale [54,148,162], and genomic and molecular foundation models suggest broader integration [190,191].

8.3. Data Efficiency

Annotated data is scarce because expert contouring is costly, variable, and constrained by privacy [165]. Vision-language and self-supervised pretraining on weakly labeled image-text pairs (PLIP, Quilt-1M, CT-RATE) and unlabeled images (DINO, DINOv2, masked image modeling) reduces this dependence [54,55,57,58,63,64,129]. Few-shot PEFT, hierarchical self-support learning, synthetic data from diffusion models (GenSeg) and modality translation, federated learning across institutions, and uncertainty-guided active learning all extend labeled data further, though synthetic data may inherit generator biases [42,127,135,141,144,158,159,166,167]. Even so, the gap between training-data scale and clinical need remains large, motivating data-efficient learning, cross-institution transfer, and drift monitoring [121,124,174].

8.4. Clinical Integration and Outlook

Clinical adoption requires operating within PACS, RIS, and EHR systems, respecting privacy, and integrating with decision support and reporting, with drift detection and clinical-environment simulators supporting safe operation [66,94,171,172,189]. Patient-specific priors and federated continual updates (with versioning and audit trails), multilingual models for global health, and hybrid pairings of general models with task-specific components and open releases (MedSAM, BiomedCLIP, BiomedParse, LLaVA-Med) will broaden access [10,15,21,37,38,67,113,118]. Emerging 2025–2026 directions include text-prompted volumetric models competitive with task-specific 3D networks, generative augmentation, reasoning over clinical implications, specialty-specific models, and integration of non-imaging data [22,123,187,197]. As the boundaries between segmentation, detection, classification, and generation blur, clinical impact will hinge on safety, fairness, equitable access, and evolving FDA/EMA oversight; generalist and conversational diagnostic AI and joint imaging-molecular models point toward foundation models becoming standard tools across biomedical research and practice [8,180].
Figure 6. Challenges and research roadmap for vision-language foundation models in medical image segmentation. The figure shows the four open challenges (prompt sensitivity, mask hallucination, inference speed, and limited annotated data), each paired with recent mitigations and longer-term roadmap directions, together with cross-cutting concerns spanning bias and fairness, interpretability, and clinical workflow integration.
Figure 6. Challenges and research roadmap for vision-language foundation models in medical image segmentation. The figure shows the four open challenges (prompt sensitivity, mask hallucination, inference speed, and limited annotated data), each paired with recent mitigations and longer-term roadmap directions, together with cross-cutting concerns spanning bias and fairness, interpretability, and clinical workflow integration.
Preprints 219358 g006

9. Conclusion

This survey reviewed vision-language foundation models for medical image segmentation: the technical background, a three-part taxonomy (text-prompt guided, large language model embedded, and hybrid frameworks), adaptation strategies, and the literature organized by imaging modality and clinical application, together with evaluation metrics and benchmark datasets. Progress has been rapid, the two years from 2024 to 2026 produced more advances than the previous decade, with models such as BiomedParse, MedSAM2, and SAT now segmenting many modalities and structures from text prompts while LoRA and adapters make deployment practical. We identified four open challenges, prompt dependence, hallucination, inference speed, and data scarcity, and set out a research roadmap toward trustworthy, well-integrated deployment spanning multimodal pretraining and clinical integration. The clinical payoff is substantial: lower segmentation time and cost, consistent results across institutions, and new applications such as grounded radiology reporting and conversational decision support.

Author Contributions

S.Q. conceived and designed the study, conducted the literature review, developed the taxonomy and analytical framework, prepared all figures and tables, and wrote, edited, and approved the final manuscript. The author has read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article, as no new datasets were generated or analyzed during the current study. All works discussed are available in the cited references.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Isensee, F.; Jaeger, P.; Kohl, S.; Petersen, J.; Maier-Hein, K. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [PubMed]
  2. Ahmad, P.; Jin, H.; Qamar, S.; Zheng, R.; Saeed, A. RD2A: densely connected residual networks using ASPP for brain tumor segmentation. Multimed. Tools Appl. 2021, 80, 27069–27094. [Google Scholar] [CrossRef]
  3. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the ICLR, 2021. [Google Scholar]
  4. Liu, Z.; Lin, Y.; Cao, Y.; et al. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the ICCV, 2021; pp. 10012–10022. [Google Scholar]
  5. Qamar, S.; Fazil, M.; Ahmad, P.; Khan, S.; Zamani, A.T. UNet with self-adaptive Mamba-like attention and causal-resonance learning for medical image segmentation. Sci. Rep. 2026, 16, 135. [Google Scholar] [CrossRef] [PubMed]
  6. Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.; Xu, D. Swin UNETR: Swin transformers for semantic segmentation of brain tumors in MRI images. In Proceedings of the MICCAI BrainLes Workshop, 2021; pp. 272–284. [Google Scholar]
  7. Shamshad, F.; Khan, S.; Zamir, S.; et al. Transformers in medical imaging: A survey. Med. Image Anal. 2023, 88, 102802. [Google Scholar] [CrossRef] [PubMed]
  8. Moor, M.; Banerjee, O.; Abad, Z.; et al. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef] [PubMed]
  9. Kirillov, A.; Mintun, E.; Ravi, N.; et al. Segment anything. In Proceedings of the ICCV, 2023; pp. 4015–4026. [Google Scholar]
  10. Touvron, H.; Lavril, T.; Izacard, G.; et al. LLaMA: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
  11. Achiam, J.; Adler, S.; Agarwal, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
  12. Radford, A.; Kim, J.; Hallacy, C.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the ICML, 2021; pp. 8748–8763. [Google Scholar]
  13. Jia, C.; Yang, Y.; Xia, Y.; et al. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the ICML, 2021; pp. 4904–4916. [Google Scholar]
  14. Wang, Z.; Wu, Z.; Agarwal, D.; Sun, J. MedCLIP: Contrastive learning from unpaired medical images and text. In Proceedings of the EMNLP, 2022; pp. 3876–3887. [Google Scholar]
  15. Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. NEJM AI 2025, 2, AIoa2400640. [Google Scholar] [CrossRef]
  16. Huang, S.C.; Shen, L.; Lungren, M.; Yeung, S. GLoRIA: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the ICCV, 2021; pp. 3942–3951. [Google Scholar]
  17. Boecking, B.; Usuyama, N.; Bannur, S.; et al. Making the most of text semantics to improve biomedical vision-language processing. In Proceedings of the ECCV, 2022. [Google Scholar]
  18. Bannur, S.; Hyland, S.; Liu, Q.; et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the CVPR, 2023; pp. 15016–15027. [Google Scholar]
  19. Liu, J.; Zhang, Y.; Chen, J.N.; et al. CLIP-driven universal model for organ segmentation and tumor detection. In Proceedings of the ICCV, 2023; pp. 21152–21164. [Google Scholar]
  20. Li, Z.; Li, Y.; Li, Q.; Wang, P.; Guo, D.; Lu, L.; Jin, D.; Zhang, Y.; Hong, Q. Lvit: language meets vision transformer in medical image segmentation. IEEE Trans. Med. Imaging 2023, 43, 96–107. [Google Scholar] [CrossRef]
  21. Zhao, T.; Gu, Y.; Yang, J.; et al. A foundation model for joint segmentation, detection, and recognition of biomedical objects across nine modalities. Nat. Methods 2025, 22, 166–176. [Google Scholar] [PubMed]
  22. Zhao, T.; Gu, Y.; Yang, J.; et al. BiomedParse-V: Scaling foundation model for universal text-guided volumetric biomedical image segmentation. In Proceedings of the CVPR Workshop MedSegFM, 2025. [Google Scholar]
  23. Lai, X.; Tian, Z.; Chen, Y.; Li, Y.; Yuan, Y.; Liu, S.; Jia, J. LISA: Reasoning segmentation via large language model. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2024; pp. 9579–9589. [Google Scholar]
  24. Yang, S.; Qu, T.; Lai, X.; et al. LISA++: An improved baseline for reasoning segmentation with large language model. arXiv 2023, arXiv:2312.17240. [Google Scholar]
  25. Bai, F.; Du, Y.; Huang, T.; Meng, M.; Zhao, B. M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv 2024, arXiv:2404.00578. [Google Scholar]
  26. Khan, W.; Leem, S.; See, K.; Wong, J.; Zhang, S.; Fang, R. A comprehensive survey of foundation models in medicine. IEEE Reviews in Biomedical Engineering, 2025. [Google Scholar]
  27. Wu, J.; Wang, Y.; Bai, H. Vision-language foundation model for 3D medical imaging. npj Artif. Intell. 2025, 1, 17. [Google Scholar] [CrossRef]
  28. Sun, K.; Xue, S.; Sun, F.; Sun, H.; Luo, Y.; Wang, L.; Wang, S.; Guo, N.; Liu, L.; Zhao, T.; et al. Medical multimodal foundation models in clinical diagnosis and treatment: Applications, challenges, and future directions. Artif. Intell. Med. 2025, 103265. [Google Scholar] [CrossRef] [PubMed]
  29. Zhao, Z.; Liu, Y.; Wu, H.; et al. CLIP in medical imaging: A comprehensive survey. arXiv 2023, arXiv:2312.07353. [Google Scholar]
  30. Lurz, D.; Neubig, L.; Kopp, M.; Kist, A. Foundation Models in Medical Image Segmentation. In Proceedings of the Bildverarbeitung für die Medizin 2026 (BVM 2026) Informatik aktuell; Springer Vieweg: Wiesbaden, 2026. [Google Scholar] [CrossRef]
  31. Nikolov, S.; Blackwell, S.; Zverovitch, A.; et al. Clinically applicable segmentation of head and neck anatomy for radiotherapy: deep learning algorithm development and validation study. J. Med. Internet Res. 2021, 23, e26151. [Google Scholar] [CrossRef] [PubMed]
  32. Vaassen, F.; Hazelaar, C.; Vaniqui, A.; et al. Evaluation of measures for assessing time-saving of automatic organ-at-risk segmentation in radiotherapy. Phys. Imaging Radiat. Oncol. 2020, 13, 1–6. [Google Scholar] [PubMed]
  33. Cardenas, C.; Yang, J.; Anderson, B.; Court, L.; Brock, K. Advances in auto-segmentation. Semin. Radiat. Oncol. 2019, 29, 185–197. [Google Scholar] [CrossRef] [PubMed]
  34. Wasserthal, J.; Breit, H.C.; Meyer, M.; et al. TotalSegmentator: Robust segmentation of 104 anatomic structures in CT images. Radiol. AI 2023, 5, e230024. [Google Scholar] [CrossRef] [PubMed]
  35. Akinci D’Antonoli, T.; Berger, L.K.; Indrakanti, A.K.; Vishwanathan, N.; Weiss, J.; Jung, M.; Berkarda, Z.; Rau, A.; Reisert, M.; Küstner, T.; et al. TotalSegmentator MRI: Robust sequence-independent segmentation of multiple anatomic structures in MRI. Radiology 2025, 314, e241613. [Google Scholar] [CrossRef] [PubMed]
  36. Lee, S.; Youn, J.; Kim, H.; Kim, M.; Yoon, S.H. CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images. Eur. Radiol. 2025, 35, 4374–4386. [Google Scholar] [CrossRef] [PubMed]
  37. Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. Proc. NeurIPS 2023, Vol. 36, 28541–28564. [Google Scholar] [CrossRef]
  38. Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef] [PubMed]
  39. Tu, T.; Azizi, S.; Driess, D.; Schaekermann, M.; Amin, M.; Chang, P.C.; Carroll, A.; Lau, C.; Tanno, R.; Ktena, I.; et al. Towards generalist biomedical AI. Nejm Ai 2024, 1, AIoa2300138. [Google Scholar] [CrossRef]
  40. Hu, E.; Shen, Y.; Wallis, P.; et al. LoRA: Low-rank adaptation of large language models. In Proceedings of the ICLR, 2022. [Google Scholar]
  41. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; et al. Parameter-efficient transfer learning for NLP. In Proceedings of the ICML, 2019; pp. 2790–2799. [Google Scholar]
  42. Maier-Hein, L.; Reinke, A.; Godau, P.; Tizabi, M.D.; Buettner, F.; Christodoulou, E.; Glocker, B.; Isensee, F.; Kleesiek, J.; Kozubek, M.; et al. Metrics reloaded: recommendations for image analysis validation. Nat. Methods 2024, 21, 195–212. [Google Scholar] [CrossRef] [PubMed]
  43. Reinke, A.; Tizabi, M.D.; Baumgartner, M.; Eisenmann, M.; Heckmann-Nötzel, D.; Kavur, A.E.; Rädsch, T.; Sudre, C.H.; Acion, L.; Antonelli, M.; et al. Understanding metric-related pitfalls in image analysis validation. Nat. Methods 2024, 21, 182–194. [Google Scholar] [CrossRef] [PubMed]
  44. Taha, A.; Hanbury, A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med. Imaging 2015, 15, 29. [Google Scholar] [CrossRef] [PubMed]
  45. Yeghiazaryan, V.; Voiculescu, I. Family of boundary overlap metrics for the evaluation of medical image segmentation. J. Med. Imaging 2018, 5, 015006. [Google Scholar] [CrossRef]
  46. Lee, J.; Yoon, W.; Kim, S.; et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [PubMed]
  47. Alsentzer, E.; Murphy, J.; Boag, W.; et al. Publicly available clinical BERT embeddings. In Proceedings of the ClinicalNLP, 2019; pp. 72–78. [Google Scholar]
  48. Lin, W.; Zhao, Z.; Zhang, X.; et al. PMC-CLIP: Contrastive language-image pre-training using biomedical documents. In Proceedings of the MICCAI, 2023; pp. 525–536. [Google Scholar]
  49. Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. arXiv 2023, arXiv:2308.02463. [Google Scholar]
  50. Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Towards generalist foundation model for radiology. Nat. Commun. 2025, 16, 7866. [Google Scholar] [CrossRef] [PubMed]
  51. Saab, K.; Tu, T.; Weng, W.H.; Tanno, R.; Stutz, D.; Wulczyn, E.; Zhang, F.; Strother, T.; Park, C.; Vedadi, E.; et al. Capabilities of gemini models in medicine. arXiv 2024, arXiv:2404.18416. [Google Scholar]
  52. Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H.; et al. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 754–762. [Google Scholar] [CrossRef]
  53. Tanno, R.; Barrett, D.G.; Sellergren, A.; Ghaisas, S.; Dathathri, S.; See, A.; Welbl, J.; Lau, C.; Tu, T.; Azizi, S.; et al. Collaboration between clinicians and vision–language models in radiology report generation. Nat. Med. 2025, 31, 599–608. [Google Scholar] [PubMed]
  54. Hamamci, I.E.; Er, S.; Wang, C.; Almas, F.; Simsek, A.G.; Esirgun, S.N.; Dogan, I.; Durugol, O.F.; Hou, B.; Shit, S.; et al. Generalist foundation models from a multimodal dataset for 3D computed tomography. In Nature Biomedical Engineering; 2026. [Google Scholar]
  55. Huang, Z.; Bianchi, F.; Yuksekgonul, M.; Montine, T.; Zou, J. A visual-language foundation model for pathology image analysis using medical Twitter. Nat. Med. 2023, 29, 2307–2316. [Google Scholar] [CrossRef] [PubMed]
  56. Nguyen, H.G.; Lundström, O.; Olsson, J.; et al. Pathology Language-Image Pretraining (PLIP): A foundation model for pathology image analysis. In Nature Medicine; 2023. [Google Scholar]
  57. Ikezogwo, W.; Seyfioglu, M.; Ghezloo, F.; et al. Quilt-1M: One million image-text pairs for histopathology. In Proceedings of the NeurIPS Datasets, 2023. [Google Scholar]
  58. Chen, R.J.; Ding, T.; Lu, M.Y.; Williamson, D.F.; Jaume, G.; Song, A.H.; Chen, B.; Zhang, A.; Shao, D.; Shaban, M.; et al. Towards a general-purpose foundation model for computational pathology. Nat. Med. 2024, 30, 850–862. [Google Scholar] [CrossRef] [PubMed]
  59. Vorontsov, E.; Bozkurt, A.; Casson, A.; Shaikovski, G.; Zelechowski, M.; Severson, K.; Zimmermann, E.; Hall, J.; Tenenholtz, N.; Fusi, N.; et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med. 2024, 30, 2924–2935. [Google Scholar] [CrossRef] [PubMed]
  60. Sun, Y.; Zhu, C.; Zheng, S.; Zhang, K.; Sun, L.; Shui, Z.; Zhang, Y.; Li, H.; Yang, L. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. Proc. Proc. AAAI Conf. Artif. Intell. 2024, Vol. 38, 5034–5042. [Google Scholar] [CrossRef]
  61. Wang, X.; Du, Y.; Yang, S.; et al. RetCCL: Clustering-guided contrastive learning for whole-slide image retrieval. Med. Image Anal. 2023, 83, 102645. [Google Scholar] [CrossRef] [PubMed]
  62. Filiot, A.; Ghermi, R.; Olivier, A.; et al. Scaling self-supervised learning for histopathology with masked image modeling. medRxiv 2023. [Google Scholar] [CrossRef]
  63. Caron, M.; Touvron, H.; Misra, I.; et al. Emerging properties in self-supervised vision transformers. In Proceedings of the ICCV, 2021; pp. 9650–9660. [Google Scholar]
  64. Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning robust visual features without supervision. In Trans Mach Learn Res; 2024. [Google Scholar]
  65. Li, Y.; Wang, H.; Duan, Y.; Li, X. CLIP surgery for better explainability with enhancement in open-vocabulary tasks. arXiv 2023, arXiv:2304.05653. [Google Scholar]
  66. Yang, X.; Chen, A.; PourNejatian, N.; et al. A large language model for electronic health records. npj Digit. Med. 2022, 5, 194. [Google Scholar] [CrossRef] [PubMed]
  67. Kim, M.; Kim, Y.; Kang, H.J.; Seo, H.; Choi, H.; Han, J.; Kee, G.; Park, S.; Ko, S.; Jung, H.; et al. Fine-tuning LLMs with medical data: can safety be ensured? NEJM AI 2025, 2, AIcs2400390. [Google Scholar] [CrossRef]
  68. Cheng, J.; Ye, J.; Deng, Z.; et al. SAM-Med2D. arXiv 2023, arXiv:2308.16184. [Google Scholar]
  69. Wang, H.; Guo, S.; Ye, J.; et al. SAM-Med3D: Towards general-purpose segmentation models for volumetric medical images. arXiv 2023, arXiv:2310.15161. [Google Scholar]
  70. Mazurowski, M.; Dong, H.; Gu, H.; Yang, J.; Konz, N.; Zhang, Y. Segment anything model for medical image analysis: an experimental study. Med. Image Anal. 2023, 89, 102918. [Google Scholar] [CrossRef] [PubMed]
  71. Deng, R.; Cui, C.; Liu, Q.; et al. Segment anything model (SAM) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv 2023, arXiv:2304.04155. [Google Scholar]
  72. Huang, Y.; Yang, X.; Liu, L.; Zhou, H.; Chang, A.; Zhou, X.; Chen, R.; Yu, J.; Chen, J.; Chen, C.; et al. Segment anything model for medical images? Med. Image Anal. 2024, 92, 103061. [Google Scholar] [CrossRef] [PubMed]
  73. He, S.; Bao, R.; Li, J.; Stout, J.; Bjornerud, A.; Grant, P.; Ou, Y. Computer-vision benchmark Segment-Anything Model (SAM) in medical images: Accuracy in 12 datasets. arXiv 2023, arXiv:2304.09324. [Google Scholar]
  74. Cheng, D.; Qin, Z.; Jiang, Z.; et al. SAM on medical images: A comprehensive study on three prompt modes. arXiv 2023, arXiv:2305.00035. [Google Scholar]
  75. Wu, J.; Wang, Z.; Hong, M.; Ji, W.; Fu, H.; Xu, Y.; Xu, M.; Jin, Y. Medical SAM Adapter: Adapting segment anything model for medical image segmentation. Med. Image Anal. 2025, 102, 103547. [Google Scholar] [CrossRef] [PubMed]
  76. Ma, J.; Kim, S.; Li, F.; Baharoon, M.; Asakereh, R.; Lyu, H.; Wang, B. Segment anything in medical images and videos: Benchmark and deployment. arXiv 2024, arXiv:2408.03322. [Google Scholar]
  77. Sun, J.; Chen, K.; He, Z.; Ren, S.; He, X.; Liu, X.; Peng, C. Medical image analysis using improved SAM-Med2D: Segmentation and classification perspectives. BMC Med. Imaging 2024, 24, 245. [Google Scholar] [CrossRef] [PubMed]
  78. Du, Y.; Bai, F.; Huang, T.; Zhao, B. SegVol: Universal and interactive volumetric medical image segmentation. Proc. NeurIPS 2024, Vol. 37, 110746–110783. [Google Scholar] [CrossRef]
  79. He, Y.; Guo, P.; Tang, Y.; Myronenko, A.; Nath, V.; Xu, Z.; Yang, D.; Zhao, C.; Simon, B.; Belue, M.; et al. VISTA3D: A unified segmentation foundation model for 3D medical imaging. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 20863–20873. [Google Scholar]
  80. Lin, X.; Xiang, Y.; Zhang, L.; Yang, X.; Yan, Z.; Yu, L. SAMUS: Adapting segment anything model for clinically-friendly and generalizable ultrasound image segmentation. arXiv 2023, arXiv:2309.06824. [Google Scholar]
  81. Gu, H.; Colglazier, R.; Dong, H.; Zhang, J.; Chen, Y.; Yildiz, Z.; Chen, Y.; Li, L.; Yang, J.; Willhite, J.; et al. SegmentAnyBone: A universal model that segments any bone at any location on MRI. 2025. [Google Scholar] [CrossRef] [PubMed]
  82. Paranjape, J.N.; Nair, N.G.; Sikder, S.; Vedula, S.S.; Patel, V.M. Adaptivesam: Towards efficient tuning of sam for surgical scene segmentation. In Proceedings of the Annual Conference on Medical Image Understanding and Analysis, 2024; Springer; pp. 187–201. [Google Scholar]
  83. Gong, S.; Zhong, Y.; Ma, W.; Li, J.; Wang, Z.; Zhang, J.; Heng, P.A.; Dou, Q. 3DSAM-adapter: Holistic adaptation of SAM from 2D to 3D for promptable tumor segmentation. Med. Image Anal. 2024, 98, 103324. [Google Scholar] [CrossRef] [PubMed]
  84. Zhang, K.; Liu, D. Customized segment anything model for medical image segmentation. arXiv 2023, arXiv:2304.13785. [Google Scholar]
  85. Gu, H.; Dong, H.; Yang, J.; Mazurowski, M.A. How to build the best medical image segmentation algorithm using foundation models: a comprehensive empirical study with segment anything model. 2024. [Google Scholar] [CrossRef] [PubMed]
  86. Carion, N.; Gustafson, L.; Hu, Y.T.; Debnath, S.; Hu, R.; Suris, D.; Ryali, C.; Alwala, K.V.; Khedr, H.; Huang, A.; et al. SAM 3: Segment Anything with Concepts. arXiv 2025, arXiv:2511.16719. [Google Scholar]
  87. Liu, A.; Xue, R.; Cao, X.R.; Shen, Y.; Lu, Y.; Li, X.; Chen, Q.; Chen, J. MedSAM3: Delving into Segment Anything with Medical Concepts. arXiv 2025, arXiv:2511.19046. [Google Scholar]
  88. Mandal, S.; Karthikeyan, D.; Paldhe, M. SAM2LoRA: Composite Loss-Guided, Parameter-Efficient Finetuning of SAM2 for Retinal Fundus Segmentation. arXiv 2025, arXiv:2510.10288. [Google Scholar]
  89. Liang, F.; Wu, B.; Dai, X.; et al. Open-vocabulary semantic segmentation with mask-adapted CLIP. In Proceedings of the CVPR, 2023; pp. 7061–7070. [Google Scholar]
  90. Wang, Z.; Lu, Y.; Li, Q.; et al. CRIS: CLIP-Driven referring image segmentation. In Proceedings of the CVPR, 2022; pp. 11686–11695. [Google Scholar]
  91. Zou, X.; Yang, J.; Zhang, H.; et al. Segment everything everywhere all at once. In Proceedings of the NeurIPS, 2023. [Google Scholar]
  92. Zhao, Z.; Zhang, Y.; Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Large-vocabulary segmentation for medical images with text prompts. npj Digit. Med. 2025, 8, 493. [Google Scholar] [CrossRef] [PubMed]
  93. Jiang, Y.; Huang, Z.; Zhang, R.; Zhang, X.; Zhang, S. Zept: Zero-shot pan-tumor segmentation via query-disentangling and self-prompting. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 11386–11397. [Google Scholar]
  94. Tanida, T.; Müller, P.; Kaissis, G.; Rückert, D. Interactive and explainable region-guided radiology report generation. In Proceedings of the CVPR, 2023; pp. 7433–7442. [Google Scholar]
  95. Huang, X.; Li, H.; Cao, M.; Chen, L.; You, C.; An, D. Cross-modal conditioned reconstruction for language-guided medical image segmentation. IEEE Trans. Med. Imaging 2024, 44, 1821–1835. [Google Scholar]
  96. Nath, V.; Li, W.; Yang, D.; Myronenko, A.; Zheng, M.; Lu, Y.; Liu, Z.; Yin, H.; Law, Y.M.; Tang, Y.; et al. Vila-m3: Enhancing vision-language models with medical expert knowledge. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 14788–14798. [Google Scholar]
  97. Lee, G.E.; Kim, S.; Cho, J.; Choi, S.; Choi, S.I. Text-guided cross-position attention for segmentation: Case of medical image. In Proceedings of the MICCAI, 2023; pp. 537–546. [Google Scholar]
  98. Adhikari, R.; Dhakal, M.; Thapaliya, S.; Poudel, K.; Bhandari, P.; Khanal, B. Synthetic boost: Leveraging synthetic data for enhanced vision-language segmentation in echocardiography. In Proceedings of the ASMUS Workshop MICCAI, 2023; pp. 89–99. [Google Scholar]
  99. Liu, J.; Zhang, Y.; Wang, K.; Yavuz, M.; Chen, X.; Yuan, Y.; Li, H.; Yang, Y.; Yuille, A.; Tang, Y.; et al. Universal and extensible language-vision models for organ segmentation and tumor detection from abdominal computed tomography. Med. Image Anal. 2024, 97, 103226. [Google Scholar] [CrossRef] [PubMed]
  100. Wang, Y.; et al. A generalist foundation model and database for open-world medical image segmentation. Nat. Biomed. Eng. 2025. [Google Scholar] [CrossRef] [PubMed]
  101. Qamar, S.; Qadri, S.F.; Alroobaea, R.; Alshmrani, G.M.; Fazil, M.; Jiang, R. ScaleFusionNet: transformer-guided multi-scale feature fusion for skin lesion segmentation. Sci. Rep. 2025, 15, 34393. [Google Scholar] [CrossRef]
  102. Qamar, S.; Ahmad, P.; Shen, L. Dense encoder-decoder–based architecture for skin lesion segmentation. Cogn. Comput. 2021, 13, 583–594. [Google Scholar] [CrossRef]
  103. Huang, X.; Shen, L.; Liu, J.; Shang, F.; Li, H.; Huang, H.; Yang, Y. Towards a multimodal large language model with pixel-level insight for biomedicine. Proc. Proc. AAAI Conf. Artif. Intell. 2025, Vol. 39, 3779–3787. [Google Scholar] [CrossRef]
  104. Shi, Y.; Zhu, X.; Wang, K.; Hu, Y.; Guo, C.; Li, M.; Wu, J. Med-2e3: A 2d-enhanced 3d medical multimodal large language model. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2025; pp. 2754–2759. [Google Scholar]
  105. Chen, Z.; Varma, M.; Delbrouck, J.B.; Paschali, M.; Blankemeier, L.; Van Veen, D.; Valanarasu, J.M.J.; Youssef, A.; Cohen, J.P.; Reis, E.P.; et al. Chexagent: Towards a foundation model for chest x-ray interpretation. In Proceedings of the AAAI 2024 Spring Symposium on Clinical Foundation Models, 2024. [Google Scholar]
  106. Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. Proc. Int. Conf. Learn. Represent. 2025, Vol. 2025, 28085–28128. [Google Scholar]
  107. Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciosi, S.; Chute, C.; Kim, D.; Lungren, M.P.; Ng, A.Y.; Rajpurkar, P. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proc. Proc. AAAI Conf. Artif. Intell. 2019, Vol. 33, 590–597. [Google Scholar] [CrossRef]
  108. Wu, J.; Xu, M. One-prompt to segment all medical images. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; pp. 11302–11312. [Google Scholar]
  109. Ji, Z.; Lee, N.; Frieske, R.; et al. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
  110. Bai, Z.; Wang, P.; Xiao, T.; He, T.; Han, Z.; Zhang, Z.; Shou, M.Z. Hallucination of multimodal large language models: A survey. arXiv 2024, arXiv:2404.18930. [Google Scholar]
  111. He, S.; Nie, Y.; Chen, Z.; Cai, Z.; Wang, H.; Yang, S.; Chen, H. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning. arXiv 2024, arXiv:2404.151271, 6. [Google Scholar]
  112. Zhang, H.; Chen, Y.; Wang, Y.; et al. Mitigating hallucinations in radiology vision-language models through fact-grounded preference optimization. Nat. Commun. 2025, 16, 6489. [Google Scholar]
  113. Koleilat, T.; Asgariandehkordi, H.; Rivaz, H.; Xiao, Y. MedCLIP-SAM: Bridging text and image towards universal medical image segmentation. In Proceedings of the MICCAI, 2024; pp. 643–653. [Google Scholar]
  114. Aleem, S.; Wang, F.; Maniparambil, M.; Arazo, E.; Dietlmeier, J.; Curran, K.; Connor, N.E.; Little, S. Test-time adaptation with salip: A cascade of sam and clip for zero-shot medical image segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 5184–5193. [Google Scholar]
  115. Sun, L.; Hwang, J.; Kuo, C.C.; Cha, S. SegRefiner: Towards model-agnostic segmentation refinement with discrete diffusion process. In Proceedings of the NeurIPS, 2023. [Google Scholar]
  116. Lu, S.; Chen, Y.; Chen, Y.; Li, P.; Sun, J.; Zheng, C.; Zou, Y.; Liang, B.; Li, M.; Jin, Q.; et al. General lightweight framework for vision foundation model supporting multi-task and multi-center medical image analysis. Nat. Commun. 2025. [Google Scholar] [CrossRef] [PubMed]
  117. Murali, A.; Zarin, F.; Meyer, A.; Mascagni, P.; Mutter, D.; Padoy, N. CycleSAM: Few-Shot Surgical Scene Segmentation with Cycle-and Scene-Consistent Feature Matching. arXiv 2024, arXiv:2407.06795. [Google Scholar]
  118. Wang, H.; Vasu, P.K.A.; Faghri, F.; Vemulapalli, R.; Farajtabar, M.; Mehta, S.; Rastegari, M.; Tuzel, O.; Pouransari, H. SAM-CLIP: Merging vision foundation models towards semantic and spatial understanding. In Proceedings of the CVPR Workshop, 2024. [Google Scholar]
  119. Heidari, M.; Kazerouni, A.; Soltany, M.; Azad, R.; Aghdam, E.K.; Cohen-Adad, J.; Merhof, D. HiFormer: Hierarchical multi-scale representations using transformers for medical image segmentation. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023; pp. 1546–1556. [Google Scholar]
  120. Chen, X.; Wang, C.; Ning, H.; Li, S.; Shen, M. Sam-octa: Prompting segment-anything for octa image segmentation. Biomed. Signal Process. Control 2025, 106, 107698. [Google Scholar] [CrossRef]
  121. Lee, H.H.; Gu, Y.; Zhao, T.; Xu, Y.; Yang, J.; Usuyama, N.; Wong, C.; Wei, M.; Landman, B.A.; Huo, Y.; et al. Foundation models for biomedical image segmentation: A survey. arXiv 2024, arXiv:2401.07654. [Google Scholar]
  122. Zhang, Y.; Ye, F.; Yu, X.; Lian, X.; Jiang, T.; Yang, L.; Yang, L. Embedded framework for clinical medical image segment anything in resource limited healthcare regions. npj Digit. Med. 2025. [Google Scholar] [CrossRef] [PubMed]
  123. Zhong, T.; Zhao, W.; Zhang, Y.; Pan, Y.; Dong, P.; Jiang, Z.; Jiang, H.; Zhou, Y.; Kui, X.; Shang, Y.; et al. Chatradio-valuer: A chat large language model for generalizable radiology impression generation on multi-institution and multi-system data. IEEE Transactions on Biomedical Engineering, 2025. [Google Scholar]
  124. Gao, Y.; Liu, D.; Li, Z.; Li, Y.; Chen, D.; Zhou, M.; Metaxas, D.N. Show and segment: Universal medical image segmentation via in-context learning. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 20830–20840. [Google Scholar]
  125. Zhang, X.; Ou, N.; Basaran, B.D.; Visentin, M.; Qiao, M.; Gu, R.; Matthews, P.M.; Liu, Y.; Ye, C.; Bai, W. A foundation model for lesion segmentation on brain mri with mixture of modality experts. IEEE Transactions on Medical Imaging, 2025. [Google Scholar]
  126. Butoi, V.; Ortiz, J.; Ma, T.; Sabuncu, M.; Guttag, J.; Dalca, A. UniverSeg: Universal medical image segmentation. In Proceedings of the ICCV, 2023; pp. 21438–21451. [Google Scholar]
  127. Zhang, L.; Jindal, B.; Alaa, A.; Weinreb, R.; Wilson, D.; Segal, E.; Zou, J.; Xie, P. Generative AI enables medical image segmentation in ultra low-data regimes. Nat. Commun. 2025. [Google Scholar] [CrossRef] [PubMed]
  128. Xing, Z.; Ye, T.; Yang, Y.; Cai, D.; Gai, B.; Wu, X.J.; Gao, F.; Zhu, L. Segmamba-v2: Long-range sequential modeling mamba for general 3d medical image segmentation. IEEE Transactions on Medical Imaging, 2025. [Google Scholar]
  129. Cox, J.; Liu, P.; Stolte, S.E.; Yang, Y.; Liu, K.; See, K.B.; Ju, H.; Fang, R. BrainSegFounder: Towards 3D foundation models for neuroimage segmentation. Med. Image Anal. 2024, 97, 103301. [Google Scholar] [CrossRef] [PubMed]
  130. Wang, J.; Wang, K.; Yu, Y.; Lu, Y.; Xiao, W.; Sun, Z.; Liu, F.; Zou, Z.; Gao, Y.; Yang, L.; et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat. Med. 2025, 31, 609–617. [Google Scholar] [PubMed]
  131. Gao, Z.; Zhang, G.; Liang, H.; Liu, J.; Ma, L.; Wang, T.; Guo, Y.; Chen, Y.; Yan, Z.; Chen, X.; et al. A lung CT vision foundation model facilitating disease diagnosis and medical imaging. Nat. Commun. 2025. [Google Scholar] [CrossRef] [PubMed]
  132. Blankemeier, L.; Kumar, A.; Cohen, J.P.; Liu, J.; Liu, L.; Van Veen, D.; Gardezi, S.J.S.; Yu, H.; Paschali, M.; Chen, Z.; et al. Merlin: a computed tomography vision-language foundation model and dataset. Nature 2026. [Google Scholar] [CrossRef] [PubMed]
  133. Yang, Z.; DSouza, N.; Megyeri, I.; et al. Decipher-MR: a vision-language foundation model for 3D MRI representations. In npj Digital Medicine; 2026. [Google Scholar] [CrossRef] [PubMed]
  134. Qamar, S.; Alkhatarishi, M.; Alam, F.; Fazil, M. Confidence-weighted semi-supervised learning for skin lesion segmentation using hybrid CNN-Transformer networks. IEEE Access, 2026. [Google Scholar]
  135. Qamar, S.; Fazil, M.; Ashraf, Z. Bridging annotation gaps: Hierarchical self-support learning for brain tumor segmentation. Diagnostics 2026. [Google Scholar] [CrossRef] [PubMed]
  136. Fischer, M.; Bartler, A.; Yang, B. Prompt tuning for parameter-efficient medical image segmentation. Med. Image Anal. 2024, 91, 103024. [Google Scholar] [CrossRef] [PubMed]
  137. Sudre, C.; Li, W.; Vercauteren, T.; Ourselin, S.; Cardoso, M. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the DLMIA Workshop, 2017; pp. 240–248. [Google Scholar]
  138. Ma, J.; Chen, J.; Ng, M.; et al. Loss odyssey in medical image segmentation. Med. Image Anal. 2021, 71, 102035. [Google Scholar] [CrossRef] [PubMed]
  139. Berman, M.; Triki, A.; Blaschko, M. The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure. In Proceedings of the CVPR, 2018; pp. 4413–4421. [Google Scholar]
  140. Khattak, M.; Wasim, S.; Naseer, M.; et al. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the ICCV, 2023; pp. 15144–15154. [Google Scholar]
  141. Liu, H.; Tam, D.; Muqeeth, M.; et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Proc. NeurIPS 2022, Vol. 35, 1950–1965. [Google Scholar] [CrossRef]
  142. Zhang, Q.; Chen, M.; Bukharin, A.; et al. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. In Proceedings of the ICLR, 2023. [Google Scholar]
  143. Jia, M.; Tang, L.; Chen, B.C.; et al. Visual prompt tuning. In Proceedings of the ECCV, 2022; pp. 709–727. [Google Scholar]
  144. Zhang, R.; Zhang, W.; Fang, R.; et al. Tip-Adapter: Training-free adaption of CLIP for few-shot classification. In Proceedings of the ECCV, 2022; pp. 493–510. [Google Scholar]
  145. Landman, B.; Xu, Z.; Igelsias, J.; et al. Multi-atlas labeling beyond the cranial vault—Workshop and challenge. In Proceedings of the MICCAI 2015 Workshop, 2015. [Google Scholar]
  146. Bilic, P.; Christ, P.; Li, H.; et al. The Liver Tumor Segmentation benchmark (LiTS). Med. Image Anal. 2023, 84, 102680. [Google Scholar] [CrossRef] [PubMed]
  147. Zhang, K.; Yu, J.; Yan, Z.; Liu, Y.; Adhikarla, E.; Fu, S.; Chen, J.; Devarakonda, C.; He, Y.; Kang, J.; et al. BiomedGPT: A generalist vision-language foundation model for diverse biomedical tasks. Nat. Med. 2024, 30, 3613–3623. [Google Scholar] [CrossRef] [PubMed]
  148. Khattak, M.U.; Kunhimon, S.; Naseer, M.; Khan, S.; Khan, F.S. Unimed-clip: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv 2024, arXiv:2412.10372. [Google Scholar]
  149. Chen, W.; Liu, T.; Mei, H.; Luo, H.; Sun, X. SegMamba: Long-range sequential modeling Mamba for 3D medical image segmentation. In Proceedings of the MICCAI, 2024; pp. 578–588. [Google Scholar]
  150. Lian, Y.; Xie, Y.; Jiang, Y.; Wang, L.; Yu, H. A data-efficient 3D medical vision-language model using only a 2D encoder. Sci. Rep. 2026. [Google Scholar] [CrossRef] [PubMed]
  151. Bakas, S.; Akbari, H.; Sotiras, A.; et al. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 2017, 4, 170117. [Google Scholar] [CrossRef] [PubMed]
  152. Baid, U.; Ghodasara, S.; Mohan, S.; et al. The RSNA-ASNR-MICCAI BraTS 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv 2021, arXiv:2107.02314. [Google Scholar]
  153. Bernard, O.; Lalande, A.; Zotti, C.; et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE Trans. Med. Imaging 2018, 37, 2514–2525. [Google Scholar] [CrossRef] [PubMed]
  154. Christensen, M.; Vukadinovic, M.; Yuan, N.; Ouyang, D. Vision-language foundation model for echocardiogram interpretation. Nat. Med. 2024, 30, 1481–1488. [Google Scholar] [CrossRef] [PubMed]
  155. Rahman, A.; Valanarasu, J.; Hacihaliloglu, I.; Patel, V. Ambiguous medical image segmentation using diffusion models. In Proceedings of the CVPR, 2023; pp. 11536–11546. [Google Scholar]
  156. Xing, Z.; Wan, L.; Fu, H.; Yang, G.; Zhu, L. Diff-UNet: A diffusion embedded network for volumetric segmentation. arXiv 2023, arXiv:2303.10326. [Google Scholar]
  157. Cheng, Z.; Wei, Q.; Zhu, H.; Wang, Y.; Qu, L.; Shao, W.; Zhou, Y. Unleashing the potential of sam for medical adaptation via hierarchical decoding. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2024; pp. 3511–3522. [Google Scholar]
  158. Dalmaz, O.; Yurt, M.; Çukur, T. ResViT: Residual vision transformers for multimodal medical image synthesis. IEEE Trans. Med. Imaging 2022, 41, 2598–2614. [Google Scholar] [CrossRef] [PubMed]
  159. Özbey, M.; Dalmaz, O.; Dar, S.; et al. Unsupervised medical image translation with adversarial diffusion models. IEEE Trans. Med. Imaging 2023, 42, 3524–3539. [Google Scholar] [CrossRef] [PubMed]
  160. Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
  161. Liu, J.; Yang, H.; Zhou, H.Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-UMamba: Mamba-based UNet with ImageNet-based pretraining. In Proceedings of the MICCAI, 2024; pp. 615–625. [Google Scholar]
  162. Zhang, X.; Tan, R.T. Mamba as a bridge: Where vision foundation models meet vision language models for domain-generalized semantic segmentation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025; pp. 14527–14537. [Google Scholar]
  163. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual state space model. In Proceedings of the NeurIPS, 2024. [Google Scholar]
  164. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
  165. Khalid, N.; Qayyum, A.; Bilal, M.; Al-Fuqaha, A.; Qadir, J. Privacy-preserving artificial intelligence in healthcare: Techniques and applications. Comput. Biol. Med. 2023, 158, 106848. [Google Scholar] [CrossRef] [PubMed]
  166. Sheller, M.; Edwards, B.; Reina, G.; et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci. Rep. 2020, 10, 12598. [Google Scholar] [CrossRef] [PubMed]
  167. Pati, S.; Baid, U.; Edwards, B.; et al. Federated learning enables big data for rare cancer boundary detection. Nat. Commun. 2022, 13, 7346. [Google Scholar] [CrossRef] [PubMed]
  168. Park, S.; Kim, G.; Kim, J.; Kim, B.; Ye, J. Federated split task-agnostic vision transformer for COVID-19 CXR diagnosis. Proc. NeurIPS 2021, Vol. 34, 24617–24630. [Google Scholar]
  169. Johnson, A.; Pollard, T.; Berkowitz, S.; et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
  170. Zhao, W.; Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Ratescore: A metric for radiology report generation. In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024; pp. 15004–15019. [Google Scholar]
  171. Pellegrini, C.; Özsoy, E.; Busam, B.; Wiestler, B.; Navab, N.; Keicher, M. RaDialog: Large Vision-Language Models for X-Ray Reporting and Dialog-Driven Assistance. In Proceedings of the Medical Imaging with Deep Learning, 2025. [Google Scholar]
  172. Alam, H.M.T.; Srivastav, D.; Kadir, M.A.; Sonntag, D. Towards interpretable radiology report generation via concept bottlenecks using a multi-agentic RAG. In Proceedings of the European Conference on Information Retrieval, 2025; Springer; pp. 201–209. [Google Scholar]
  173. Li, C.Y.; Chang, K.J.; Yang, C.F.; Wu, H.Y.; Chen, W.; Bansal, H.; Chen, L.; Yang, Y.P.; Chen, Y.C.; Chen, S.P.; et al. Towards a holistic framework for multimodal LLM in 3D brain CT radiology report generation. Nat. Commun. 2025, 16, 2258. [Google Scholar] [CrossRef] [PubMed]
  174. Soin, A.; Bhatu, N.; Mehta, R.; et al. CheXstray: Real-time multi-modal data concordance for drift detection in medical imaging AI. In Proceedings of the CHIL, 2022; pp. 152–167. [Google Scholar]
  175. Seyyed-Kalantari, L.; Zhang, H.; McDermott, M.; Chen, I.; Ghassemi, M. Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nat. Med. 2021, 27, 2176–2182. [Google Scholar] [CrossRef] [PubMed]
  176. Glocker, B.; Jones, C.; Roschewitz, M.; Winzeck, S. Risk of bias in chest radiography deep learning foundation models. Radiol. AI 2023, 5, e230060. [Google Scholar] [CrossRef] [PubMed]
  177. Yi, P.H.; Bachina, P.; Bharti, B.; Garin, S.P.; Kanhere, A.; Kulkarni, P.; Li, D.; Parekh, V.S.; Santomartino, S.M.; Moy, L.; et al. Pitfalls and best practices in evaluation of AI algorithmic biases in radiology. Radiology 2025, 315, e241674. [Google Scholar] [CrossRef] [PubMed]
  178. Drukker, K.; Chen, W.; Gichoya, J.; et al. Toward fairness in artificial intelligence for medical image analysis: identification and mitigation of potential biases. J. Med. Imaging 2023, 10, 061104. [Google Scholar] [CrossRef]
  179. Acosta, J.N.; Falcone, G.J.; Rajpurkar, P.; Topol, E.J. Multimodal biomedical AI. Nat. Med. 2022, 28, 1773–1784. [Google Scholar] [CrossRef] [PubMed]
  180. Tu, T.; Schaekermann, M.; Palepu, A.; Saab, K.; Freyberg, J.; Tanno, R.; Wang, A.; Li, B.; Amin, M.; Cheng, Y.; et al. Towards conversational diagnostic artificial intelligence. Nature 2025, 642, 442–450. [Google Scholar] [CrossRef] [PubMed]
  181. Jha, D.; Durak, G.; Das, A.; Sanjotra, J.; Susladkar, O.; Sarkar, S.; Rauniyar, A.; Kumar Tomar, N.; Peng, L.; Li, S.; et al. Ethical framework for responsible foundational models in medical imaging. Front. Med. 2025, 12, 1544501. [Google Scholar] [CrossRef]
  182. Jiang, Y.; Du, Y.; Xiong, K.; Huang, K.; Li, T.; Li, Z.; Zhang, M.; Gan, X.; Li, Q.; Liang, J.; et al. Foundation model-guided multi-view semi-supervised CT segmentation of liver tumors in resource-constrained settings. npj Digit. Med. 2026, 9, 31. [Google Scholar] [CrossRef] [PubMed]
  183. Chen, H.; Cai, Y.; Wang, C.; Chen, L.; Zhang, B.; Han, H.; Guo, Y.; Ding, H.; Zhang, Q. Multi-organ foundation model for universal ultrasound image segmentation with task prompt and anatomical prior. IEEE Trans. Med. Imaging 2024, 44, 1005–1018. [Google Scholar] [CrossRef]
  184. Heller, N.; Sathianathen, N.; Kalapara, A.; et al. The KiTS19 challenge data: 300 kidney tumor cases with clinical context, CT semantic segmentations, and surgical outcomes. arXiv 2019, arXiv:1904.00445. [Google Scholar]
  185. Heller, N.; Isensee, F.; Trofimova, D.; et al. The KiTS21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase CT. arXiv 2023, arXiv:2307.01984. [Google Scholar]
  186. Heller, N.; Isensee, F.; Maier-Hein, K.; et al. The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 challenge. Med. Image Anal. 2021, 67, 101821. [Google Scholar] [PubMed]
  187. Wang, M.; Lin, T.; Lin, A.; Yu, K.; Peng, Y.; Wang, L.; Chen, C.; Zou, K.; Liang, H.; Chen, M.; et al. Common and rare fundus diseases identification using vision-language foundation model with knowledge of over 400 diseases. Nat. Commun. 2025, 16, 1325. [Google Scholar] [CrossRef]
  188. Wang, M.; Lin, T.; Lin, A.; Yu, K.; Peng, Y.; Wang, L.; Chen, C.; Zou, K.; Cheung, C.Y.; Pang, C.P.; et al. Enhancing diagnostic accuracy in rare and common fundus diseases with a knowledge-rich vision-language model. Nat. Commun. 2025. [Google Scholar] [CrossRef] [PubMed]
  189. Luo, L.; Kim, S.E.; Zhang, X.; Kernbach, J.M.; Kenia, R.; Acosta, J.N.; Nathanson, L.A.; Haimovich, A.D.; Rodman, A.; Goh, E.; et al. A clinical environment simulator for dynamic AI evaluation. Nat. Med. 2026, 1–8. [Google Scholar] [CrossRef]
  190. Cui, H.; Wang, C.; Maan, H.; Pang, K.; Luo, F.; Duan, N.; Wang, B. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat. Methods 2024, 21, 1470–1480. [Google Scholar] [CrossRef] [PubMed]
  191. Theodoris, C.; Xiao, L.; Chopra, A.; et al. Transfer learning enables predictions in network biology. Nature 2023, 618, 616–624. [Google Scholar] [CrossRef] [PubMed]
  192. Bedi, S.; Liu, Y.; Orr-Ewing, L.; Dash, D.; Koyejo, S.; Callahan, A.; Fries, J.A.; Wornow, M.; Swaminathan, A.; Lehmann, L.S.; et al. Testing and evaluation of health care applications of large language models: A systematic review. JAMA 2025, 333, 319–328. [Google Scholar] [CrossRef] [PubMed]
  193. Sallam, M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare 2023, 11, 887. [Google Scholar] [CrossRef] [PubMed]
  194. Eriksen, A.; Möller, S.; Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI 2024, 1. [Google Scholar] [CrossRef]
  195. Yan, K.; Wang, X.; Lu, L.; Summers, R.M. DeepLesion: Automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imaging 2018, 5, 036501. [Google Scholar] [CrossRef] [PubMed]
  196. Hung, A.L.Y.; Zheng, H.; Zhao, K.; Du, X.; Pang, K.; Miao, Q.; Raman, S.S.; Terzopoulos, D.; Sung, K. Csam: A 2.5 d cross-slice attention module for anisotropic volumetric medical image segmentation. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024; pp. 5923–5932. [Google Scholar]
  197. Teo, Z.L.; Thirunavukarasu, A.J.; Elangovan, K.; Cheng, H.; Moova, P.; Soetikno, B.; Nielsen, C.; Pollreisz, A.; Ting, D.S.J.; Morris, R.J.; et al. Generative artificial intelligence in medicine. In Nature Medicine; 2025. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Organization of this survey, showing the major sections and the relationships among background, methods, adaptation, modalities, applications, evaluation, and the combined challenges and future directions.
Figure 1. Organization of this survey, showing the major sections and the relationships among background, methods, adaptation, modalities, applications, evaluation, and the combined challenges and future directions.
Preprints 219358 g001
Figure 2. Timeline of key vision-language and segmentation foundation models from 2021 to 2026. General vision-language models (purple) laid the architectural foundations. Medical vision-language models (teal) adapted these to clinical data. Segmentation foundation models (coral) combined prompt-based interaction with medical domain knowledge. Radiology and report generation models (blue) extended these to clinical workflows. Years reflect first public release or peer-reviewed publication.
Figure 2. Timeline of key vision-language and segmentation foundation models from 2021 to 2026. General vision-language models (purple) laid the architectural foundations. Medical vision-language models (teal) adapted these to clinical data. Segmentation foundation models (coral) combined prompt-based interaction with medical domain knowledge. Radiology and report generation models (blue) extended these to clinical workflows. Years reflect first public release or peer-reviewed publication.
Preprints 219358 g002
Figure 3. Taxonomy of vision-language foundation models for medical image segmentation. We split the literature into three categories: text-prompt guided segmentation models, LLM-embedded architectures, and hybrid frameworks.
Figure 3. Taxonomy of vision-language foundation models for medical image segmentation. We split the literature into three categories: text-prompt guided segmentation models, LLM-embedded architectures, and hybrid frameworks.
Preprints 219358 g003
Table 1. Comparison of recent surveys related to vision-language foundation models in medical imaging. Our work focuses specifically on segmentation and on the rapid developments in 2025 and 2026.
Table 1. Comparison of recent surveys related to vision-language foundation models in medical imaging. Our work focuses specifically on segmentation and on the rapid developments in 2025 and 2026.
Survey Year Focus Coverage Contribution / Limitation
Azad et al. 2023 Foundation models in medical imaging General foundation models Broad scope, predates most VLM segmentation work
Shamshad et al. 2023 Transformers in medical imaging Transformer architectures Pre-foundation-model era, no VLM focus
Zhao et al. 2023 CLIP in medical imaging Classification and retrieval Limited segmentation coverage
Khan et al. 2025 Foundation models in medicine Comprehensive survey Broad coverage, less depth on segmentation taxonomy
Awais et al. 2025 Foundation models in vision Vision foundation models Not medical-specific
Wu et al. 2025 VLFM for 3D medical imaging Report generation focus Limited segmentation scope
Lee et al. 2025 Foundation models for MIS Zero-shot evaluation Narrative review, no taxonomy of VLM methods
Lee et al. 2025 VLFM for medical imaging Current practices General review, no segmentation taxonomy
Wang et al. 2025 VLM systematic review Meta-analysis Statistical focus, limited methodology depth
Li et al. 2025 VLM in medical image analysis All medical analysis tasks Broad, less segmentation depth
Liu et al. 2025 SAM for medical segmentation SAM-only focus No VLM coverage
This work 2026 VLM FM for medical segmentation Architectures, adaptation, modalities, applications Comprehensive taxonomy with focus on 2025-2026 work
Table 5. Summary of widely used benchmark datasets for medical image segmentation, organized by modality. Recent datasets such as BiomedParseData and CT-RATE specifically support vision-language pretraining and text-prompted segmentation.
Table 5. Summary of widely used benchmark datasets for medical image segmentation, organized by modality. Recent datasets such as BiomedParseData and CT-RATE specifically support vision-language pretraining and text-prompted segmentation.
Dataset Year Modality Size Annotation Type Primary Use
BTCV (Synapse) 2015 CT 30 scans, 13 organs Voxel mask Multi-organ benchmark
AMOS22 2022 CT, MRI 500 CT + 100 MRI, 15 organs Voxel mask Multi-organ versatility
LiTS 2023 CT 131 scans (liver+tumor) Voxel mask Liver tumor segmentation
KiTS19/21 2019/21 CT 300/489 scans Voxel mask Kidney tumor
FLARE 2022 2022 CT 2200 unlabeled + 50 labeled Voxel mask Low-resource segmentation
BraTS 2021 2021 MRI 2000+ multi-seq cases Voxel mask Brain tumor segmentation
ACDC 2018 MRI 100 patients Voxel mask Cardiac chamber seg.
TotalSeg 2023 CT 1228 scans, 104 structs Voxel mask Universal anatomy
TotalSeg-MRI 2025 MRI 616 MRI + 527 CT, 80 structs Voxel mask MRI universal
MIMIC-CXR 2019 CXR 377K images + reports Image-text VLP, classification
CheXpert 2019 CXR 224K images + labels Image labels Pathology classification
PadChest 2020 CXR 160K images + reports Image-text VLP
VinDr-CXR 2022 CXR 18K with bounding boxes Bounding box Detection
NIH ChestXray14 2017 CXR 112K images Image labels Classification
MSD 2022 Multi 10 segmentation tasks Voxel mask Generalization
ISIC 2019 Dermatology 25K images Mask + class Skin lesion
HAM10000 2018 Dermatology 10K images Class + mask Skin lesion
BUSI 2020 Ultrasound 780 images Mask Breast lesion
Quilt-1M 2023 Pathology 1M image-text pairs Text caption Pathology VLP
GLaS 2017 Pathology 165 H&E images Mask Gland segmentation
Camelyon16/17 2016/17 Pathology 400+/1000+ WSI Mask + class Metastasis detection
BiomedParseData 2025 9 modalities 6M image-mask-text Mask + text Text-prompted seg.
CT-RATE 2024 Chest CT 50K volumes + reports Volume + text 3D VLP
LIDC-IDRI 2011 CT 1018 lung nodule scans Mask + class Lung nodule
REFUGE2 2022 Retinal fundus 1200 images Mask + class Glaucoma assessment
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated