Submitted:
12 June 2026
Posted:
16 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Universal MMFMs (Uni-MMFMs): Trained on massive, multi-institutional, multimodal datasets covering varied medical knowledge, these models emphasize reliable generalization and wide utility across diseases and tasks, acting as base platforms for downstream work.
- Modality-specific MMFMs (MS-MMFMs): Focused on a single medical data modality (e.g., radiological imaging, histopathology slides, or ultrasound imaging), these models are tuned for distinct characteristics to deliver better results in specific tasks.
- Organ-specific MMFMs (OS-MMFMs): Targeting specific organ systems (e.g., brain, heart, or eye) and their associated conditions, these models utilize organ-focused multimodal data to enhance organ-targeted diagnostic and prognostic tasks.
2. Background and Taxonomy
- Uni-MMFMs: Keywords combined (`medical’ or `healthcare’ or `biomedical’) and (`foundation model’ or `vision-language model’ or `multimodal large language model’ or `multimodal foundation model’).
- MS-/OS-MMFMs: Keywords combined (`[modality-/organ-specific name]’) and (`foundation model’ or `vision-language model’ or `multimodal large language model’ or `multimodal foundation model’), where [modality-specific/organ-specific name] is replaced with the specific modality/organ (e.g., `CT’).
- Uni-MMFMs trained with at least 500,000 images or image-text pairs are included. These models require integration of heterogeneous data distributions, with public resources (e.g., TCGA [19] (> 33,000 pathology image-report pairs), MIMIC-CXR [20], CheXpert [21], and NIH ChestX-ray [22] collectively providing > 870,000 image-text pairs) suggesting 500,000 samples as a benchmark threshold to ensure disease diversity and cross-modal consistency.
- MS-/OS-MMFMs train at least 100,000 high-quality medical images or image-text pairs are included. This threshold is based on clinical validation: public datasets such as MIMIC-CXR [20] (378,000 image-text pairs) and CheXpert [21] (224,000 image-text pairs) demonstrate that 100,000 image-text pairs is the minimum viable scale for models to achieve radiologist-level diagnostic performance. Below this threshold, the model’s generalization ability concerning rare lesions, equipment differences, and complex clinical presentations significantly declines.
- Vision-encoder based vision foundation models (VE-VFMs): they are built on a vision-encoder architecture (e.g., ViT [23] or CNN [24]) that maps raw pixels into dense feature embeddings. These models are trained via self-supervised contrastive learning [25] to maximize similarity between different augmented views of the same image while pushing different images apart.
- Vision encoder-decoder based vision foundation models (VED-VFMs): they are built on a vision encoder-decoder architecture (e.g., MAE [4] or VAE [26]), where the encoder compresses raw pixels into latent representations and the decoder reconstructs the original input. They are trained via self-supervised reconstruction objectives to minimize the discrepancy between original and reconstructed images.
- Multimodal-encoder vision-language foundation models (MME-VLFMs): they are built on a dual-stream encoder architecture (e.g., CLIP [5]) that projects raw pixels and natural language tokens into a shared latent space via separate encoders. They are trained via contrastive learning on massive image-text pairs to maximize the similarity of matched pairs while minimizing that of mismatched ones.
- Multimodal-encoder and language-decoder vision-language foundation models (MME-LD-VLFMs): they are built on an architecture integrating a pre-trained visual encoder with an LLM decoder (e.g., LLaVA [6]), projecting visual features as tokens into the LLM’s embedding space. They are trained on instruction-following data using a causal language modeling loss to generate text responses grounded in visual contexts.
- Multimodal encoder-decoder vision-language foundation models (MMED-VLFMs): they are built on a composite architecture integrating a vision encoder, a reasoning LLM, and a visual decoder. This framework processes visual-textual inputs to interpret semantic data and synthesize new images or masks, optimized via a multi-task objective combining language modeling and visual reconstruction/segmentation losses.
- Mixed-modal vision-language foundation models (MM-VLFMs) [27]: they are built on a unified architecture that processes and generates interleaved sequences of discrete text and visual tokens within a single transformer. Trained with an autoregressive next-token prediction objective, they handle diverse cross-modal tasks including visual question answering, captioning, and content creation.
3. Universal Multimodal Foundation Models
3.1. Overview of Uni-MMFMs
3.2. Challenges and Methods
4. Modality-Specific Multimodal Foundation Models
4.1. Pathology/WSI Multimodal Foundation Models
4.1.1. Overview of Pathology/WSI MMFMs
4.1.2. Challenges and Methods
4.2. X-ray Multimodal Foundation Models
4.2.1. Overview of X-ray MMFMs
4.2.2. Challenges and Methods
4.3. CT Multimodal Foundation Models
4.3.1. Overview of CT MMFMs
4.3.2. Challenges and Methods
4.4. Ultrasound Multimodal Foundation Models
4.4.1. Overview of Ultrasound MMFMs
4.4.2. Challenges and Methods
5. Organ-Specific Multimodal Foundation Models
5.1. Eye Multimodal Foundation Models
5.1.1. Overview of eye MMFMs
5.1.2. Challenges and Methods
5.2. Other Multimodal Foundation Models
5.2.1. Challenges and Methods
6. Future Directions
6.1. Data Level
6.2. Architecture Level
6.3. User Demand Aspect
6.4. Technical Considerations from a Developer’s Perspective
7. Conclusions
References
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the opportunities and risks of foundation models. arXiv 2021, arXiv:2108.07258. [Google Scholar]
- Moor, M.; Banerjee, O.; Abad, Z.S.H.; Krumholz, H.M.; Leskovec, J.; Topol, E.J.; Rajpurkar, P. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef] [PubMed]
- Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022; pp. 15979–15988. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PmLR, 2021; pp. 8748–8763. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar] [CrossRef]
- Zhang, S.; Metaxas, D. On the challenges and perspectives of foundation models for medical image analysis. Med. Image Anal. 2024, 91, 102996. [Google Scholar] [CrossRef] [PubMed]
- Shrestha, P.; Amgain, S.; Khanal, B.; Linte, C.A.; Bhattarai, B. Medical vision language pretraining: A survey. arXiv 2023, arXiv:2312.06224. [Google Scholar]
- Khan, W.; Leem, S.; See, K.B.; Wong, J.K.; Zhang, S.; Fang, R. A Comprehensive Survey of Foundation Models in Medicine. IEEE Rev. Biomed. Eng. 2025, 1–22. [Google Scholar] [CrossRef]
- He, Y.; Huang, F.; Jiang, X.; Nie, Y.; Wang, M.; Wang, J.; Chen, H. Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions. IEEE Rev. Biomed. Eng. 2025, 18, 172–191. [Google Scholar] [CrossRef] [PubMed]
- Liu, C.; Jin, Y.; Guan, Z.; Li, T.; Qin, Y.; Qian, B.; Jiang, Z.; Wu, Y.; Wang, X.; Zheng, Y.F.; et al. Visual–language foundation models in medicine. Vis. Comput. 2024, 1–20. [Google Scholar]
- Sun, K.; Xue, S.; Sun, F.; Sun, H.; Luo, Y.; Wang, L.; Wang, S.; Guo, N.; Liu, L.; Zhao, T.; et al. Medical Multimodal Foundation Models in Clinical Diagnosis and Treatment: Applications, Challenges, and Future Directions. arXiv 2024, arXiv:2412.02621. [Google Scholar]
- Azad, B.; Azad, R.; Eskandari, S.; Bozorgpour, A.; Kazerouni, A.; Rekik, I.; Merhof, D. Foundational models in medical imaging: A comprehensive survey and future vision. arXiv 2023, arXiv:2310.18689. [Google Scholar]
- Huang, S.C.; Jensen, M.; Yeung-Levy, S.; Lungren, M.P.; Poon, H.; Chaudhari, A.S. Multimodal Foundation Models for Medical Imaging-A Systematic Review and Implementation Guidelines. medRxiv 2024, 2024–10. [Google Scholar]
- AlSaad, R.; Abd-Alrazaq, A.; Boughorbel, S.; Ahmed, A.; Renault, M.A.; Damseh, R.; Sheikh, J. Multimodal large language models in health care: applications, challenges, and future outlook. J. Med. Internet Res. 2024, 26, e59505. [Google Scholar] [CrossRef] [PubMed]
- Awais, M.; Naseer, M.; Khan, S.; Anwer, R.M.; Cholakkal, H.; Shah, M.; Yang, M.H.; Khan, F.S. Foundation Models Defining a New Era in Vision: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2245–2264. [Google Scholar] [CrossRef] [PubMed]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar] [CrossRef]
- Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.M.; Chen, W.; et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 2023, 5, 220–235. [Google Scholar] [CrossRef]
- Weinstein, J.N.; Collisson, E.A.; Mills, G.B.; Shaw, K.R.; Ozenberger, B.A.; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, J.M. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 2013, 45, 1113–1120. [Google Scholar] [CrossRef] [PubMed]
- Johnson, A.E.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
- Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. Proc. Proc. AAAI Conf. Artif. Intell. 2019, Vol. 33, 590–597. [Google Scholar] [CrossRef]
- Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017; pp. 3462–3471. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations.
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016; pp. 770–778. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020; pp. 9726–9735. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Bansal, H.; Israel, D.; Zhao, S.; Li, S.; Nguyen, T.; Grover, A. MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants. arXiv 2024, arXiv:2412.12661. [Google Scholar]
- Liu, J.; Wang, Z.; Ye, Q.; Chong, D.; Zhou, P.; Hua, Y. Qilin-med-vl: Towards chinese large vision-language model for general healthcare. arXiv 2023, arXiv:2310.17956. [Google Scholar]
- Nguyen, D.; Nguyen, H.; Diep, N.; Pham, T.N.; Cao, T.; Nguyen, B.; Swoboda, P.; Ho, N.; Albarqouni, S.; Xie, P.; et al. Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching. Adv. Neural Inf. Process. Syst. 2023, 36, 27922–27950. [Google Scholar] [CrossRef]
- Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Towards generalist foundation model for radiology. arXiv 2023, arXiv:2308.02463. [Google Scholar]
- Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf. Process. Syst. 2023, 36, 28541–28564. [Google Scholar] [CrossRef]
- Moor, M.; Huang, Q.; Wu, S.; Yasunaga, M.; Dalmia, Y.; Leskovec, J.; Zakka, C.; Reis, E.P.; Rajpurkar, P. Med-flamingo: a multimodal medical few-shot learner. In Proceedings of the Machine Learning for Health (ML4H). PMLR, 2023; pp. 353–367. [Google Scholar]
- Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; et al. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv 2023, arXiv:2303.00915. [Google Scholar]
- Lin, W.; Zhao, Z.; Zhang, X.; Wu, C.; Zhang, Y.; Wang, Y.; Xie, W. Pmc-clip: Contrastive language-image pre-training using biomedical documents. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, 2023; Springer; pp. 525–536. [Google Scholar] [CrossRef]
- Liu, Z.; Tieu, A.; Patel, N.; Soultanidis, G.; Deyer, L.; Wang, Y.; Huver, S.; Zhou, A.; Mei, Y.; Fayad, Z.A.; et al. VIS-MAE: An Efficient Self-supervised Learning Approach on Medical Image Segmentation and Classification. In Proceedings of the International Workshop on Machine Learning in Medical Imaging; Springer, 2024; pp. 95–107. [Google Scholar]
- Zhang, K.; Zhou, R.; Adhikarla, E.; Yan, Z.; Liu, Y.; Yu, J.; Liu, Z.; Chen, X.; Davison, B.D.; Ren, H.; et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat. Med. 2024, 1–13. [Google Scholar] [CrossRef]
- Tu, T.; Azizi, S.; Driess, D.; Schaekermann, M.; Amin, M.; Chang, P.C.; Carroll, A.; Lau, C.; Tanno, R.; Ktena, I.; et al. Towards generalist biomedical AI. Nejm Ai 2024, 1, AIoa2300138. [Google Scholar] [CrossRef]
- Liu, X.; Yang, G.; Luo, Y.; Mao, J.; Zhang, X.; Gao, M.; Zhang, S.; Shen, J.; Wang, G. Expert-level vision-language foundation model for real-world radiology and comprehensive evaluation. arXiv 2024, arXiv:2409.16183. [Google Scholar]
- Zhao, T.; Gu, Y.; Yang, J.; Usuyama, N.; Lee, H.H.; Kiblawi, S.; Naumann, T.; Gao, J.; Crabtree, A.; Abel, J.; et al. A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities. Nat. Methods 2025, 22, 166–176. [Google Scholar] [CrossRef] [PubMed]
- Xu, L.; Sun, H.; Ni, Z.; Li, H.; Zhang, S. MedViLaM: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation. arXiv 2024, arXiv:2409.19684. [Google Scholar]
- Li, T.; Su, Y.; Li, W.; Fu, B.; Chen, Z.; Huang, Z.; Wang, G.; Ma, C.; Chen, Y.; Hu, M.; et al. GMAI-VL & GMAI-VL-5.5 M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI. arXiv 2024, arXiv:2411.14522. [Google Scholar]
- Chen, Z.; Pekis, A.; Brown, K. Advancing High Resolution Vision-Language Models in Biomedicine. arXiv 2024, arXiv:2406.09454. [Google Scholar]
- Chu, Y.; Zhang, Y.; Han, Z.; Yang, C.; Zhou, L.; Luo, G.; Gao, X. Improving Representation of High-frequency Components for Medical Foundation Models. arXiv 2024, arXiv:2407.14651. [Google Scholar]
- He, S.; Nie, Y.; Chen, Z.; Cai, Z.; Wang, H.; Yang, S.; Chen, H. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning. arXiv E-Prints 2024, arXiv–2404. [Google Scholar]
- Yang, L.; Xu, S.; Sellergren, A.; Kohlberger, T.; Zhou, Y.; Ktena, I.; Kiraly, A.; Ahmed, F.; Hormozdiari, F.; Jaroensri, T.; et al. Advancing multimodal medical capabilities of Gemini. arXiv 2024, arXiv:2405.03162. [Google Scholar]
- Cui, H.; Mao, L.; Liang, X.; Zhang, J.; Ren, H.; Li, Q.; Li, X.; Yang, C. Biomedical visual instruction tuning with clinician preference alignment. Adv. Neural Inf. Process. Syst. 2024, 37, 96449–96467. [Google Scholar] [CrossRef]
- Luo, L.; Chen, X.; Tang, B.; Chen, X.; Han, R.; Hu, C.; Li, Y.; Chen, T. Building Universal Foundation Models for Medical Image Analysis with Spatially Adaptive Networks. arXiv 2023, arXiv:2312.07630. [Google Scholar]
- Ye, Y.; Xie, Y.; Zhang, J.; Chen, Z.; Wu, Q.; Xia, Y. Continual Self-Supervised Learning: Towards Universal Multi-Modal Medical Data Representation Learning. Proc. 2024 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) 2024, 11114–11124. [Google Scholar] [CrossRef]
- Nath, V.; Li, W.; Yang, D.; Myronenko, A.; Zheng, M.; Lu, Y.; Liu, Z.; Yin, H.; Law, Y.M.; Tang, Y.; et al. Vila-m3: Enhancing vision-language models with medical expert knowledge. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 14788–14798. [Google Scholar]
- Chen, J.; Gui, C.; Ouyang, R.; Gao, A.; Chen, S.; Chen, G.H.; Wang, X.; Cai, Z.; Ji, K.; Wan, X.; et al. Towards injecting medical visual knowledge into multimodal llms at scale. In Proceedings of the Proceedings of the 2024 conference on empirical methods in natural language processing, 2024; pp. 7346–7370. [Google Scholar]
- Bawazir, A.; Wu, K.; Li, W. Uni-Mlip: Unified self-supervision for medical vision language pre-training. arXiv 2024, arXiv:2411.15207. [Google Scholar]
- Liu, F.; Zhou, H.; Wang, K.; Yu, Y.; Gao, Y.; Sun, Z.; Liu, S.; Sun, S.; Zou, Z.; Li, Z.; et al. MetaGP: A generative foundation model integrating electronic health records and multimodal imaging for addressing unmet clinical needs. Cell Rep. Med. 2025, 6. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.; Zhang, W.; Li, S.; Yuan, Y.; Yu, B.; Li, H.; He, W.; Jiang, H.; Li, M.; Song, X.; et al. Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation. arXiv 2025, arXiv:2502.09838. [Google Scholar]
- Xu, W.; Chan, H.P.; Li, L.; Aljunied, M.; Yuan, R.; Wang, J.; Xiao, C.; Chen, G.; Liu, C.; Li, Z.; et al. Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv 2025, arXiv:2506.07044. [Google Scholar]
- Wu, L.; Nie, Y.; He, S.; Zhuang, J.; Luo, L.; Mahboobani, N.; Vardhanabhuti, V.; Chan, R.C.K.; Peng, Y.; Rajpurkar, P.; et al. UniBiomed: A Universal Foundation Model for Grounded Biomedical Image Interpretation. arXiv 2025, arXiv:2504.21336. [Google Scholar]
- Chen, K.; Li, Y.; Zhu, X.; Zhang, W.; Hu, B. A vision-language model with multi-granular knowledge fusion in medical imaging. World Wide Web 2025, 28, 5. [Google Scholar]
- Nie, Y.; He, S.; Bie, Y.; Wang, Y.; Chen, Z.; Yang, S.; Chen, H. ConceptCLIP: Towards Trustworthy Medical AI via Concept-Enhanced Contrastive Langauge-Image Pre-training. arXiv 2025, arXiv:2501.15579. [Google Scholar]
- Liu, J.; Zhou, H.Y.; Huang, W.; Yang, H.; Song, D.; Tan, T.; Liang, Y.; Wang, S. BioVFM-21M: Benchmarking and Scaling Self-supervised Vision Foundation Models for Biomedical Image Analysis. In Proceedings of the International Workshop on Foundation Models for General Medical AI, 2025; Springer; pp. 23–33. [Google Scholar]
- Dai, W.; Chen, P.; Ekbote, C.; Liang, P.P. QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training. arXiv 2025, arXiv:2506.00711. [Google Scholar]
- Wang, R.; Yao, Q.; Jiang, Z.; Lai, H.; He, Z.; Tao, X.; Zhou, S.K. ECAMP: entity-centered context-aware medical vision language pre-training. Med. Image Anal. 2025, 103690. [Google Scholar] [CrossRef] [PubMed]
- Lozano, A.; Sun, M.W.; Burgess, J.; Chen, L.; Nirschl, J.J.; Gu, J.; Lopez, I.; Aklilu, J.; Rau, A.; Katzer, A.W.; et al. Biomedica: An open biomedical image-caption archive, dataset, and vision-language models derived from scientific literature. Proc. Proc. Comput. Vis. Pattern Recognit. Conf. 2025, 19724–19735. [Google Scholar] [CrossRef]
- Lu, Z.; Li, H.; Parikh, N.A.; Dillman, J.R.; He, L. RadCLIP: Enhancing Radiologic Image Analysis Through Contrastive Language–Image Pretraining. IEEE Transactions on Neural Networks and Learning Systems, 2025. [Google Scholar]
- Huang, X.; Shen, L.; Liu, J.; Shang, F.; Li, H.; Huang, H.; Yang, Y. Towards a multimodal large language model with pixel-level insight for biomedicine. Proc. Proc. AAAI Conf. Artif. Intell. 2025, Vol. 39, 3779–3787. [Google Scholar] [CrossRef]
- Yu, H.; Yi, S.; Niu, K.; Zhuo, M.; Li, B. UMIT: Unifying Medical Imaging Tasks via Vision-Language Models. arXiv 2025, arXiv:2503.15892. [Google Scholar]
- Liu, F.; Zhu, T.; Wu, X.; Yang, B.; You, C.; Wang, C.; Lu, L.; Liu, Z.; Zheng, Y.; Sun, X.; et al. A medical multimodal large language model for future pandemics. npj Digit. Med. 2023, 6, 226. [Google Scholar] [CrossRef] [PubMed]
- Huang, X.; Huang, H.; Shen, L.; Yang, Y.; Shang, F.; Liu, J.; Liu, J. A refer-and-ground multimodal large language model for biomedicine. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, 2024; Springer; pp. 399–409. [Google Scholar] [CrossRef]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I.; et al. Improving language understanding by generative pre-training. 2018. [Google Scholar] [CrossRef]
- Wang, J.; Wang, K.; Yu, Y.; Lu, Y.; Xiao, W.; Sun, Z.; Liu, F.; Zou, Z.; Gao, Y.; Yang, L.; et al. Self-improving generative foundation model for synthetic medical image generation and clinical applications. Nat. Med. 2024, 1–9. [Google Scholar] [CrossRef]
- Lu, S.; Liu, Z.; Liu, T.; Zhou, W. Scaling-up medical vision-and-language representation learning with federated learning. Eng. Appl. Artif. Intell. 2023, 126, 107037. [Google Scholar] [CrossRef]
- Chen, Z.; Diao, S.; Wang, B.; Li, G.; Wan, X. Towards unifying medical vision-and-language pre-training via soft prompts. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023; pp. 23403–23413. [Google Scholar]
- Wu, L.; Zhuang, J.; Chen, H. Voco: A simple-yet-effective volume contrastive learning framework for 3d medical image analysis. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024; pp. 22873–22882. [Google Scholar]
- Wang, L.; Wang, H.; Yang, H.; Mao, J.; Yang, Z.; Shen, J.; Li, X. Interpretable bilingual multimodal large language model for diverse biomedical tasks. arXiv 2024, arXiv:2410.18387. [Google Scholar]
- Zhang, K.; Yang, Y.; Yu, J.; Jiang, H.; Fan, J.; Huang, Q.; Han, W. Multi-Task Paired Masking With Alignment Modeling for Medical Vision-Language Pre-Training. IEEE Trans. Multimed. 2024, 26, 4706–4721. [Google Scholar] [CrossRef]
- Zhu, X.; Hu, Y.; Mo, F.; Li, M.; Wu, J. Uni-med: a unified medical generalist foundation model for multi-task learning via connector-MoE. arXiv 2024, arXiv:2409.17508. [Google Scholar]
- Aeffner, F.; Zarella, M.D.; Buchbinder, N.; Bui, M.M.; Goodman, M.R.; Hartman, D.J.; Lujan, G.M.; Molani, M.A.; Parwani, A.V.; Lillard, K.; et al. Introduction to digital image analysis in whole-slide imaging: a white paper from the digital pathology association. J. Pathol. Inform. 2019, 10, 9. [Google Scholar] [CrossRef] [PubMed]
- Lu, M.Y.; Chen, B.; Williamson, D.F.; Chen, R.J.; Ikamura, K.; Gerber, G.; Liang, I.; Le, L.P.; Ding, T.; Parwani, A.V.; et al. A foundational multimodal vision language AI assistant for human pathology. arXiv 2023, arXiv:2312.07814. [Google Scholar]
- Chen, R.J.; Ding, T.; Lu, M.Y.; Williamson, D.F.; Jaume, G.; Chen, B.; Zhang, A.; Shao, D.; Song, A.H.; Shaban, M.; et al. A general-purpose self-supervised model for computational pathology. arXiv 2023, arXiv:2308.15474. [Google Scholar]
- Huang, Z.; Bianchi, F.; Yuksekgonul, M.; Montine, T.; Zou, J. Leveraging medical twitter to build a visual–language foundation model for pathology ai. bioRxiv 2023, 2023–03. [Google Scholar]
- Ikezogwo, W.; Seyfioglu, S.; Ghezloo, F.; Geva, D.; Sheikh Mohammed, F.; Anand, P.K.; Krishna, R.; Shapiro, L. Quilt-1m: One million image-text pairs for histopathology. Adv. Neural Inf. Process. Syst. 2023, 36, 37995–38017. [Google Scholar] [CrossRef]
- Vorontsov, E.; Bozkurt, A.; Casson, A.; Shaikovski, G.; Zelechowski, M.; Severson, K.; Zimmermann, E.; Hall, J.; Tenenholtz, N.; Fusi, N.; et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nat. Med. 2024, 30, 2924–2935. [Google Scholar] [CrossRef] [PubMed]
- Lu, M.Y.; Chen, B.; Williamson, D.F.; Chen, R.J.; Liang, I.; Ding, T.; Jaume, G.; Odintsov, I.; Le, L.P.; Gerber, G.; et al. A visual-language foundation model for computational pathology. Nat. Med. 2024, 30, 863–874. [Google Scholar] [CrossRef] [PubMed]
- Xu, H.; Usuyama, N.; Bagga, J.; Zhang, S.; Rao, R.; Naumann, T.; Wong, C.; Gero, Z.; González, J.; Gu, Y.; et al. A whole-slide foundation model for digital pathology from real-world data. Nature 2024, 630, 181–188. [Google Scholar] [CrossRef] [PubMed]
- Ding, T.; Wagner, S.J.; Song, A.H.; Chen, R.J.; Lu, M.Y.; Zhang, A.; Vaidya, A.J.; Jaume, G.; Shaban, M.; Kim, A.; et al. A multimodal whole-slide foundation model for pathology. Nat. Med. 2025, 1–13. [Google Scholar] [CrossRef]
- Ahmed, F.; Sellergren, A.; Yang, L.; Xu, S.; Babenko, B.; Ward, A.; Olson, N.; Mohtashamian, A.; Matias, Y.; Corrado, G.S.; et al. Pathalign: A vision-language model for whole slide images in histopathology. arXiv 2024, arXiv:2406.19578. [Google Scholar]
- Sun, Y.; Zhu, C.; Zheng, S.; Zhang, K.; Sun, L.; Shui, Z.; Zhang, Y.; Li, H.; Yang, L. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. Proc. Proc. AAAI Conf. Artif. Intell. 2024, Vol. 38, 5034–5042. [Google Scholar] [CrossRef]
- Shaikovski, G.; Casson, A.; Severson, K.; Zimmermann, E.; Wang, Y.K.; Kunz, J.D.; Retamero, J.A.; Oakley, G.; Klimstra, D.; Kanan, C.; et al. Prism: A multi-modal generative foundation model for slide-level histopathology. arXiv 2024, arXiv:2405.10254. [Google Scholar]
- Dippel, J.; Feulner, B.; Winterhoff, T.; Milbich, T.; Tietz, S.; Schallenberg, S.; Dernbach, G.; Kunft, A.; Heinke, S.; Eich, M.L.; et al. RudolfV: a foundation model by pathologists for pathologists. arXiv 2024, arXiv:2401.04079. [Google Scholar]
- Ma, J.; Guo, Z.; Zhou, F.; Wang, Y.; Xu, Y.; Li, J.; Yan, F.; Cai, Y.; Zhu, Z.; Jin, C.; et al. Towards a generalizable pathology foundation model via unified knowledge distillation. arXiv 2024, arXiv:2407.18449. [Google Scholar]
- Wang, X.; Zhao, J.; Marostica, E.; Yuan, W.; Jin, J.; Zhang, J.; Li, R.; Tang, H.; Wang, K.; Li, Y.; et al. A pathology foundation model for cancer diagnosis and prognosis prediction. Nature 2024, 634, 970–978. [Google Scholar] [CrossRef] [PubMed]
- Xu, Y.; Wang, Y.; Zhou, F.; Ma, J.; Jin, C.; Yang, S.; Li, J.; Zhang, Z.; Zhao, C.; Zhou, H.; et al. A multimodal knowledge-enhanced whole-slide pathology foundation model. arXiv 2024, arXiv:2407.15362. [Google Scholar]
- Yang, Z.; Wei, T.; Liang, Y.; Yuan, X.; Gao, R.; Xia, Y.; Zhou, J.; Zhang, Y.; Yu, Z. A foundation model for generalizable cancer diagnosis and survival prediction from histopathological images. Nat. Commun. 2025, 16, 2366. [Google Scholar] [CrossRef] [PubMed]
- Chen, K.; Liu, M.; Yan, F.; Ma, L.; Shi, X.; Wang, L.; Wang, X.; Zhu, L.; Wang, Z.; Zhou, M.; et al. Cost-effective instruction learning for pathology vision and language analysis. Nat. Comput. Sci. 2025, 1–10. [Google Scholar] [CrossRef]
- Sun, Y.; Si, Y.; Zhu, C.; Gong, X.; Zhang, K.; Chen, P.; Zhang, Y.; Shui, Z.; Lin, T.; Yang, L. Cpath-omni: A unified multimodal foundation model for patch and whole slide image analysis in computational pathology. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 10360–10371. [Google Scholar]
- Dai, D.; Zhang, Y.; Yang, Q.; Xu, L.; Shen, X.; Xia, S.; Wang, G. Pathologyvlm: a large vision-language model for pathology image understanding. Artif. Intell. Rev. 2025, 58, 1–19. [Google Scholar] [CrossRef]
- Chen, Y.; Wang, G.; Ji, Y.; Li, Y.; Ye, J.; Li, T.; Hu, M.; Yu, R.; Qiao, Y.; He, J. Slidechat: A large vision-language assistant for whole-slide pathology image understanding. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 5134–5143. [Google Scholar]
- Xiang, J.; Wang, X.; Zhang, X.; Xi, Y.; Eweje, F.; Chen, Y.; Li, Y.; Bergstrom, C.; Gopaulchan, M.; Kim, T.; et al. A vision–language foundation model for precision oncology. Nature 2025, 638, 769–778. [Google Scholar] [CrossRef] [PubMed]
- Ding, J.; Ma, S.; Dong, L.; Zhang, X.; Huang, S.; Wang, W.; Zheng, N.; Wei, F. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv 2023, arXiv:2307.02486. [Google Scholar]
- Xu, S.; Yang, L.; Kelly, C.; Sieniek, M.; Kohlberger, T.; Ma, M.; Weng, W.H.; Kiraly, A.; Kazemzadeh, S.; Melamed, Z.; et al. Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv 2023, arXiv:2308.01317. [Google Scholar]
- Zhang, X.; Wu, C.; Zhang, Y.; Xie, W.; Wang, Y. Knowledge-enhanced visual-language pre-training on chest radiology images. Nat. Commun. 2023, 14, 4542. [Google Scholar] [CrossRef] [PubMed]
- Lee, S.; Kim, W.J.; Chang, J.; Ye, J.C. LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation. In Proceedings of the The Twelfth International Conference on Learning Representations.
- Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023; pp. 21315–21326. [Google Scholar] [CrossRef]
- Pellegrini, C.; Özsoy, E.; Busam, B.; Wiestler, B.; Navab, N.; Keicher, M. Radialog: Large vision-language models for x-ray reporting and dialog-driven assistance. In Proceedings of the Medical Imaging with Deep Learning, 2025. [Google Scholar]
- Bannur, S.; Hyland, S.; Liu, Q.; Pérez-García, F.; Ilse, M.; Castro, D.C.; Boecking, B.; Sharma, H.; Bouzid, K.; Thieme, A.; et al. Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023; pp. 15016–15027. [Google Scholar] [CrossRef]
- Liu, B.; Lu, D.; Wei, D.; Wu, X.; Wang, Y.; Zhang, Y.; Zheng, Y. Improving Medical Vision-Language Contrastive Pretraining With Semantics-Aware Triage. IEEE Trans. Med. Imaging 2023, 42, 3579–3589. [Google Scholar] [CrossRef] [PubMed]
- Liu, C.; Cheng, S.; Chen, C.; Qiao, M.; Zhang, W.; Shah, A.; Bai, W.; Arcucci, R. M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer, 2023; pp. 637–647. [Google Scholar] [CrossRef]
- Chen, X.; He, Y.; Xue, C.; Ge, R.; Li, S.; Yang, G. Knowledge boosting: Rethinking medical contrastive vision-language pre-training. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer, 2023; pp. 405–415. [Google Scholar] [CrossRef]
- Yao, J.; Wang, X.; Song, Y.; Zhao, H.; Ma, J.; Chen, Y.; Liu, W.; Wang, B. Eva-x: A foundation model for general chest x-ray analysis with self-supervised learning. npj Digit. Med. 2025, 8, 678. [Google Scholar] [CrossRef] [PubMed]
- Xu, L.; Ni, Z.; Sun, H.; Li, H.; Zhang, S. A foundation model for generalizable disease diagnosis in chest X-ray images. arXiv 2024, arXiv:2410.08861. [Google Scholar]
- Bluethgen, C.; Chambon, P.; Delbrouck, J.B.; Van Der Sluijs, R.; Połacin, M.; Zambrano Chaves, J.M.; Abraham, T.M.; Purohit, S.; Langlotz, C.P.; Chaudhari, A.S. A vision–language foundation model for the generation of realistic chest x-ray images. Nat. Biomed. Eng. 2025, 9, 494–506. [Google Scholar] [CrossRef] [PubMed]
- Müller, P.; Kaissis, G.; Rueckert, D. ChEX: Interactive localization and region description in chest X-rays. In Proceedings of the European Conference on Computer Vision, 2024; Springer; pp. 92–111. [Google Scholar]
- Chen, Z.; Varma, M.; Xu, J.; Paschali, M.; Van Veen, D.; Johnston, A.; Youssef, A.; Blankemeier, L.; Bluethgen, C.; Altmayer, S.; et al. A Vision-Language foundation model to enhance efficiency of chest x-ray interpretation. arXiv E-Prints 2024, arXiv–2401. [Google Scholar]
- Huang, W.; Li, C.; Zhou, H.Y.; Yang, H.; Liu, J.; Liang, Y.; Zheng, H.; Zhang, S.; Wang, S. Enhancing representation in radiography-reports foundation model: A granular alignment algorithm using masked contrastive learning. Nat. Commun. 2024, 15, 7620. [Google Scholar] [CrossRef] [PubMed]
- Kumar, Y.; Marttinen, P. Improving medical multi-modal contrastive learning with expert annotations. In Proceedings of the European Conference on Computer Vision, 2024; Springer; pp. 468–486. [Google Scholar]
- Zambrano Chaves, J.M.; Huang, S.C.; Xu, Y.; Xu, H.; Usuyama, N.; Zhang, S.; Wang, F.; Xie, Y.; Khademi, M.; Yang, Z.; et al. A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings. Nat. Commun. 2025, 16, 3108. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Luo, J.; Wang, J.; Zhong, Y.; Zhang, X.; Wang, Y.; Bhatia, P.; Xiao, C.; Ma, F. Unity in diversity: Collaborative pre-training across multimodal medical sources. Proc. Proc. Conf. Assoc. Comput. Linguist. Meet. 2024, Vol. 2024, 3644. [Google Scholar] [CrossRef] [PubMed]
- Li, Q.; Yan, X.; Xu, J.; Yuan, R.; Zhang, Y.; Feng, R.; Shen, Q.; Zhang, X.; Wang, S. Anatomical structure-guided medical vision-language pre-training. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer, 2024; pp. 80–90. [Google Scholar] [CrossRef]
- Luo, H.; Zhou, Z.; Royer, C.; Sekuboyina, A.; Menze, B. Devide: Faceted medical knowledge for improved medical vision-language pre-training. arXiv 2024, arXiv:2404.03618. [Google Scholar]
- Liu, B.; Lu, Z.; Wang, Y. Towards medical vision-language contrastive pre-training via study-oriented semantic exploration. In Proceedings of the Proceedings of the 32nd ACM International Conference on Multimedia, 2024; pp. 4861–4870. [Google Scholar]
- Liang, X.; Hu, J.; Wang, D.; Ma, Z.; Zhao, L.; Li, R.; Wan, B.; Wang, Q. CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual Rationale. arXiv 2025, arXiv:2507.06959. [Google Scholar]
- Lee, S.; Youn, J.; Kim, H.; Kim, M.; Yoon, S.H. CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images. Eur. Radiol. 2025, 1–13. [Google Scholar] [CrossRef]
- Deperrois, N.; Matsuo, H.; Ruipérez-Campillo, S.; Vandenhirtz, M.; Laguna, S.; Ryser, A.; Fujimoto, K.; Nishio, M.; Sutter, T.M.; Vogt, J.E.; et al. RadVLM: A multitask conversational vision-language model for radiology. arXiv 2025, arXiv:2502.03333. [Google Scholar]
- Li, M.; Meng, M.; Fulham, M.; Feng, D.D.; Bi, L.; Kim, J. Enhancing Medical Vision-Language Contrastive Learning via Inter-Matching Relation Modeling. IEEE Trans. Med. Imaging 2025, 44, 2463–2476. [Google Scholar] [CrossRef] [PubMed]
- Liu, C.; Cheng, S.; Shi, M.; Shah, A.; Bai, W.; Arcucci, R. IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training. IEEE Trans. Med. Imaging 2025, 44, 519–529. [Google Scholar] [CrossRef] [PubMed]
- Liang, X.; Li, X.; Li, F.; Jiang, J.; Dong, Q.; Wang, W.; Wang, K.; Dong, S.; Luo, G.; Li, S. MedFILIP: Medical Fine-Grained Language-Image Pre-Training. IEEE J. Biomed. Health Inform. 2025, 29, 3587–3597. [Google Scholar] [CrossRef] [PubMed]
- Chu, Y.W.; Zhang, K.; Malon, C.; Min, M.R. Reducing hallucinations of medical multimodal large language models with visual retrieval-augmented generation. arXiv 2025, arXiv:2502.15040. [Google Scholar]
- Zhang, Z.; Yu, Y.; Chen, Y.; Yang, X.; Yeo, S.Y. Medunifier: Unifying vision-and-language pre-training on medical data with vision generation task using discrete visual representations. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025; pp. 29744–29755. [Google Scholar]
- Wang, X.; Peng, Y.; Lu, L.; Lu, Z.; Bagheri, M.; Summers, R.M. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017; pp. 2097–2106. [Google Scholar] [CrossRef]
- Niu, C.; Lyu, Q.; Carothers, C.D.; Kaviani, P.; Tan, J.; Yan, P.; Kalra, M.K.; Whitlow, C.T.; Wang, G. Specialty-oriented generalist medical ai for chest ct screening. arXiv 2023, arXiv:2304.02649. [Google Scholar]
- Jin, Y.; Zhang, Y. Orthodoc: Multimodal large language model for assisting diagnosis in computed tomography. arXiv 2024, arXiv:2409.09052. [Google Scholar]
- Zhu, W.; Huang, H.; Tang, H.; Musthyala, R.; Yu, B.; Chen, L.; Vega, E.; O’Donnell, T.; Dehkharghani, S.; Frontera, J.A.; et al. 3D foundation AI model for generalizable disease detection in head computed tomography. arXiv 2025, arXiv:2502.02779. [Google Scholar]
- Gao, Z.; Zhang, G.; Liang, H.; Liu, J.; Ma, L.; Wang, T.; Guo, Y.; Chen, Y.; Yan, Z.; Chen, X.; et al. A Lung CT Foundation Model Facilitating Disease Diagnosis and Medical Imaging. medRxiv 2025, 2025–01. [Google Scholar]
- Pai, S.; Hadzic, I.; Bontempi, D.; Bressem, K.; Kann, B.H.; Fedorov, A.; Mak, R.H.; Aerts, H.J. Vision foundation models for computed tomography. arXiv 2025, arXiv:2501.09001. [Google Scholar]
- Xin, Y.; Ates, G.C.; Gong, K.; Shao, W. Med3dvlm: An efficient vision-language model for 3d medical image analysis. arXiv 2025, arXiv:2503.20047. [Google Scholar]
- Guo, X.; Chai, W.; Li, S.Y.; Wang, G. LLaVA-ultra: Large Chinese language and vision assistant for ultrasound. In Proceedings of the Proceedings of the 32nd ACM international conference on multimedia, 2024; pp. 8845–8854. [Google Scholar] [CrossRef]
- Jiang, Y.; Feng, C.M.; Ren, J.; Wei, J.; Zhang, Z.; Hu, Y.; Liu, Y.; Sun, R.; Tang, X.; Du, J.; et al. Privacy-preserving federated foundation model for generalist ultrasound artificial intelligence. arXiv 2024, arXiv:2411.16380. [Google Scholar]
- Meyer, A.; Murali, A.; Mutter, D.; Padoy, N. Ultrasam: a foundation model for ultrasound using large open-access segmentation datasets. arXiv 2024, arXiv:2411.16222. [Google Scholar]
- Jiao, J.; Zhou, J.; Li, X.; Xia, M.; Huang, Y.; Huang, L.; Wang, N.; Zhang, X.; Zhou, S.; Wang, Y.; et al. Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Med. Image Anal. 2024, 96, 103202. [Google Scholar] [CrossRef] [PubMed]
- Maani, F.; Saeed, N.; Saleem, T.; Farooq, Z.; Alasmawi, H.; Diehl, W.; Mohammad, A.; Waring, G.; Valappi, S.; Bricker, L.; et al. FetalCLIP: A visual-language foundation model for fetal ultrasound image analysis. arXiv 2025, arXiv:2502.14807. [Google Scholar]
- Ambsdorf, J.; Munk, A.; Llambias, S.; Christensen, A.N.; Mikolaj, K.; Balestriero, R.; Tolsgaard, M.G.; Feragen, A.; Nielsen, M. General methods make great domain-specific foundation models: A case-study on fetal ultrasound. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer, 2025; pp. 271–281. [Google Scholar] [CrossRef]
- Yao, J.; Wang, Y.; Lei, Z.; Wang, K.; Feng, N.; Dong, F.; Zhou, J.; Li, X.; Hao, X.; Shen, J.; et al. Multimodal GPT model for assisting thyroid nodule diagnosis and management. npj Digit. Med. 2025, 8, 245. [Google Scholar] [CrossRef] [PubMed]
- Kang, Q.; Lao, Q.; Gao, J.; Bao, W.; He, Z.; Du, C.; Lu, Q.; Li, K. URFM: a general Ultrasound Representation Foundation Model for advancing ultrasound image diagnosis. iScience 2025, 28. [Google Scholar] [CrossRef] [PubMed]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023; pp. 4015–4026. [Google Scholar]
- Zhou, Y.; Chia, M.A.; Wagner, S.K.; Ayhan, M.S.; Williamson, D.J.; Struyven, R.R.; Liu, T.; Xu, M.; Lozano, M.G.; Woodward-Court, P.; et al. A foundation model for generalizable disease detection from retinal images. Nature 2023, 622, 156–163. [Google Scholar] [CrossRef] [PubMed]
- Qiu, J.; Wu, J.; Wei, H.; Shi, P.; Zhang, M.; Sun, Y.; Li, L.; Liu, H.; Liu, H.; Hou, S.; et al. Visionfm: a multi-modal multi-task vision foundation model for generalist ophthalmic artificial intelligence. arXiv 2023, arXiv:2310.04992. [Google Scholar]
- Yu, K.; Zhou, Y.; Bai, Y.; Soh, Z.D.; Xu, X.; Goh, R.S.M.; Cheng, C.Y.; Liu, Y. Urfound: Towards universal retinal foundation models via knowledge-guided masked modeling. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer, 2024; pp. 753–762. [Google Scholar] [CrossRef]
- Shi, D.; Zhang, W.; Yang, J.; Huang, S.; Chen, X.; Xu, P.; Jin, K.; Lin, S.; Wei, J.; Yusufu, M.; et al. A multimodal visual–language foundation model for computational ophthalmology. npj Digit. Med. 2025, 8, 381. [Google Scholar] [CrossRef] [PubMed]
- Shi, D.; Zhang, W.; Chen, X.; Liu, Y.; Yang, J.; Huang, S.; Tham, Y.C.; Zheng, Y.; He, M. Eyefound: a multimodal generalist foundation model for ophthalmic imaging. arXiv 2024, arXiv:2405.11338. [Google Scholar]
- Pissas, T.; Márquez-Neila, P.; Wolf, S.; Zinkernagel, M.; Sznitman, R. Masked image modelling for retinal oct understanding. In Proceedings of the International Workshop on Ophthalmic Medical Image Analysis, 2024; Springer; pp. 115–125. [Google Scholar]
- Holland, R.; Leingang, O.; Bogunović, H.; Riedl, S.; Fritsche, L.; Prevost, T.; Scholl, H.P.; Schmidt-Erfurth, U.; Sivaprasad, S.; Lotery, A.J.; et al. Metadata-enhanced contrastive learning from retinal optical coherence tomography images. Med. Image Anal. 2024, 97, 103296. [Google Scholar] [CrossRef] [PubMed]
- Silva-Rodríguez, J.; Chakor, H.; Dolz, J.; Ayed, I.B.; et al. On the importance of expert knowledge to improve foundation models for retinal fundus images. In Proceedings of the Medical Imaging with Deep Learning, 2024. [Google Scholar]
- Du, J.; Guo, J.; Zhang, W.; Yang, S.; Liu, H.; Li, H.; Wang, N. Ret-clip: A retinal image foundation model pre-trained with clinical diagnostic reports. In Proceedings of the International conference on medical image computing and computer-assisted intervention, 2024; Springer; pp. 709–719. [Google Scholar] [CrossRef]
- Yang, S.; Du, J.; Guo, J.; Zhang, W.; Liu, H.; Li, H.; Wang, N. ViLReF: an expert knowledge enabled vision-language retinal foundation model. arXiv 2024, arXiv:2408.10894. [Google Scholar]
- Wei, H.; Liu, B.; Zhang, M.; Shi, P.; Yuan, W. Visionclip: An med-aigc based ethical language-image foundation model for generalizable retina image analysis. arXiv 2024, arXiv:2403.10823. [Google Scholar]
- Li, Z.; Song, D.; Yang, Z.; Wang, D.; Li, F.; Zhang, X.; Kinahan, P.E.; Qiao, Y. VisionUnite: a Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 1–14. [Google Scholar] [CrossRef]
- Cai, Z.; Lin, L.; He, H.; Cheng, P.; Tang, X. Uni4Eye++: A General Masked Image Modeling Multi-Modal Pre-Training Framework for Ophthalmic Image Classification and Segmentation. IEEE Trans. Med. Imaging 2024, 43, 4419–4429. [Google Scholar] [CrossRef] [PubMed]
- Morano, J.; Fazekas, B.; Sükei, E.; Fecso, R.; Emre, T.; Gumpinger, M.; Faustmann, G.; Oghbaie, M.; Schmidt-Erfurth, U.; Bogunović, H. MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis. arXiv 2025, arXiv:2506.08900. [Google Scholar]
- Sun, Y.; Tan, W.; Gu, Z.; He, R.; Chen, S.; Pang, M.; Yan, B. A data-efficient strategy for building high-performing medical foundation models. Nat. Biomed. Eng. 2025, 1–13. [Google Scholar] [CrossRef]
- Li, Z.; Song, D.; Yang, Z.; Wang, D.; Li, F.; Zhang, X.; Kinahan, P.E.; Qiao, Y. VisionUnite: a Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 1–14. [Google Scholar] [CrossRef]
- Wang, J.; Zhao, S.; Luo, Z.; Zhou, Y.; Jiang, H.; Li, S.; Li, T.; Pan, G. CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding. In Proceedings of the The Thirteenth International Conference on Learning Representations.
- Jiang, W.; Zhao, L.; Lu, B.l. Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI. In Proceedings of the The Twelfth International Conference on Learning Representations.
- Jiang, W.; Wang, Y.; Lu, B.l.; Li, D. NeuroLM: A Universal Multi-task Foundation Model for Bridging the Gap between Language and EEG Signals. In Proceedings of the The Thirteenth International Conference on Learning Representations.
- Chen, Y.; Ren, K.; Song, K.; Wang, Y.; Wang, Y.; Li, D.; Qiu, L. EEGFormer: Towards Transferable and Interpretable Large-Scale EEG Foundation Model. In Proceedings of the AAAI 2024 Spring Symposium on Clinical Foundation Models.
- Christensen, M.; Vukadinovic, M.; Yuan, N.; Ouyang, D. Vision–language foundation model for echocardiogram interpretation. Nat. Med. 2024, 30, 1481–1488. [Google Scholar] [CrossRef] [PubMed]
- Kim, S.; Jin, P.; Song, S.; Chen, C.; Li, Y.; Ren, H.; Li, X.; Liu, T.; Li, Q. Echofm: Foundation model for generalizable echocardiogram analysis. IEEE transactions on medical imaging, 2025. [Google Scholar]
- Vukadinovic, M.; Tang, X.; Yuan, N.; Cheng, P.; Li, D.; Cheng, S.; He, B.; Ouyang, D. EchoPrime: A multi-video view-informed vision-language model for comprehensive echocardiography interpretation. arXiv 2024, arXiv:2410.09704. [Google Scholar]
- Yan, S.; Yu, Z.; Primiero, C.; Vico-Alonso, C.; Wang, Z.; Yang, L.; Tschandl, P.; Hu, M.; Tan, G.; Tang, V.; et al. A general-purpose multimodal foundation model for dermatology. arXiv 2024, arXiv:2410.150382. [Google Scholar]
- Kim, C.; Gadgil, S.U.; DeGrave, A.J.; Omiye, J.A.; Cai, Z.R.; Daneshjou, R.; Lee, S.I. Transparent medical image AI via an image–text foundation model grounded in medical literature. Nat. Med. 2024, 30, 1154–1165. [Google Scholar] [CrossRef] [PubMed]








| Method | Organ | Time | Param | Data format | Data quantity | Architecture |
|---|---|---|---|---|---|---|
| CBraMod [159] | Brain | 2025.04 | 4.0 M | EEG | 1M samples | MAE |
| LaBraM [160] | Brain | 2024.05 | 369 M | EEG | 2,500 hours | MAE |
| NeuroLM [161] | Brain | 2025.04 | 1.7 B | EEG-text | 25,000 hours | VAE |
| EEGFormer [162] | Brain | 2024.02 | - | EEG | 1.7T | Encoder-decoder |
| EchoCLIP [163] | Heart | 2024.04 | - | US video-text | 1M | MME-VLFM |
| EchoFM [164] | Heart | 2025.01 | - | US video | 286K | VED-VFM |
| EchoPrime [165] | Heart | 2024.01 | - | US video-text | 12M | MME-VLFM |
| PanDerm [166] | Skin | 2024.01 | - | Skin image | 2M | VED-VFM |
| MONET [167] | Skin | 2024.01 | - | Skin image-text | 105K | MME-VLFM |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).