Submitted:
09 July 2025
Posted:
11 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background and Motivation
- Scanner and hardware differences: Variations in imaging hardware, such as differences in MRI or CT scanner manufacturers, magnetic field strengths, and imaging protocols, can lead to divergent data distributions even for the same anatomical regions [12].
- Institutional and demographic diversity: Patient populations differ across institutions in terms of age, ethnicity, disease prevalence, and comorbidities, influencing both image appearance and underlying pathology [13].
- Annotation practices and label noise: Disparities in labeling conventions, radiologist interpretations, and inter-observer variability introduce label noise and inconsistencies that exacerbate the challenge of robust learning.
- Temporal shifts: Changes in clinical practice, imaging protocols, or population health over time can introduce subtle or abrupt shifts in data distributions, complicating longitudinal generalization.
3. Taxonomy of Domain Adaptation and Generalization in Medical Imaging
3.1. Supervised and Semi-Supervised Domain Adaptation
3.2. Unsupervised Domain Adaptation
3.3. Domain Generalization
3.4. Self-Supervised and Multi-Modal Learning Approaches
3.5. Summary of Taxonomy
4. Foundation Models for Medical Imaging
4.1. Architectural Paradigms
4.2. Training Objectives and Strategies
- Contrastive Learning: Inspired by methods like SimCLR and CLIP, medical contrastive models learn to bring together representations of images and their corresponding textual descriptions (e.g., radiology reports) while pushing apart unpaired examples. This framework has proven effective in capturing semantically rich and transferable features.
- Masked Image Modeling (MIM): Analogous to masked language modeling in NLP (e.g., BERT), MIM involves corrupting input images by masking out patches and training the model to reconstruct the missing information [57]. This task encourages the model to learn contextual features and global anatomical priors [58].
- Image-Text Alignment: In multi-modal setups, models are trained to align paired image-text embeddings using cosine similarity or contrastive loss [59]. This approach, exemplified by models like MedCLIP and GLoRIA, enables zero-shot transfer to classification and retrieval tasks.
4.3. Representative Foundation Models in Medical Imaging
- MedCLIP: A contrastive learning-based model that aligns medical images and radiology reports using transformer-based encoders. It enables zero-shot classification and report generation without requiring fine-tuning on downstream tasks [63].
- BioViL: The Biomedical Vision-Language Pretraining model leverages paired image-text datasets and contrastive losses to build general-purpose medical embeddings. It has demonstrated strong performance on retrieval, classification, and segmentation tasks.
- GLoRIA: The Global and Local Representation Alignment model learns fine-grained correspondence between regions in medical images and textual descriptions [64]. It enhances interpretability and supports few-shot learning across tasks.
- CheXzero: Built upon the CLIP framework, CheXzero trains a vision-language model using a large dataset of chest X-rays and free-text radiology reports [67]. It enables zero-shot classification and has shown performance competitive with radiologists on specific tasks.
- UNITER and Med-UniT: Inspired by image-text fusion models in general vision, these approaches integrate vision-language modeling into a single transformer architecture, supporting unified learning across multiple imaging modalities and clinical narratives.
4.4. Foundation Models as Implicit Domain Generalizers
4.5. Challenges and Limitations
- Data bias and representativeness: Even large-scale pretraining datasets may suffer from institutional or demographic biases that hinder equitable generalization [70].
- Compute and memory costs: Training and deploying large models requires substantial hardware resources, which may not be accessible to all research or clinical institutions [71].
- Lack of transparency: The interpretability and trustworthiness of foundation models remain underexplored, particularly in high-stakes clinical decision-making.
- Evaluation complexity: Robust, standardized benchmarks that assess generalization across diverse target domains are still emerging, making performance comparisons difficult [72].
4.6. Conclusion
5. Comparative Analysis of Foundation Models and Traditional Adaptation Methods
5.1. Quantitative Evaluation Across Domain Shifts
5.2. Data Requirements and Label Efficiency
5.3. Generalization Behavior and Robustness
5.4. Interpretability and Clinical Alignment
5.5. Computational Cost and Scalability
5.6. Summary of Comparative Insights
6. Open Challenges and Future Directions
6.1. Standardized Evaluation Protocols for Generalization
6.2. Efficient and Ethical Model Scaling
6.3. Bias, Fairness, and Equity in Foundation Models
6.4. Interpretable and Clinically Aligned Foundation Models
6.5. Cross-Modal and Multilingual Generalization
6.6. Lifelong Learning and Continual Domain Adaptation
6.7. Regulatory and Clinical Integration Challenges
6.8. Conclusion
7. Conclusion
References
- Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Wu, Z.; Chang, B.; Sun, X.; Xu, J.; Sui, Z. A survey on in-context learning. arXiv preprint arXiv:2301.00234 2022.
- Yi, H.; Qin, Z.; Lao, Q.; Xu, W.; Jiang, Z.; Wang, D.; Zhang, S.; Li, K. Towards general purpose medical ai: Continual learning medical foundation model. arXiv preprint arXiv:2303.06580 2023.
- Naseem, U.; Khushi, M.; Kim, J. Vision-Language Transformer for Interpretable Pathology Visual Question Answering. IEEE Journal of Biomedical and Health Informatics 2023, 27, 1681–1690. [CrossRef]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners, 2021, http://arxiv.org/abs/2111.06377.
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning, 2020. [CrossRef]
- Dippel, J.; Feulner, B.; Winterhoff, T.; Schallenberg, S.; Dernbach, G.; Kunft, A.; Tietz, S.; Jurmeister, P.; Horst, D.; Ruff, L.; et al. RudolfV: A Foundation Model by Pathologists for Pathologists. arXiv preprint arXiv:2401.04079 2024.
- Pham, V.T.; Nguyen, T.P. Identification and localization covid-19 abnormalities on chest radiographs. In Proceedings of the The International Conference on Artificial Intelligence and Computer Vision. Springer, 2023, pp. 251–261.
- Lu, M.Y.; Chen, R.J.; Wang, J.; Dillon, D.; Mahmood, F. Semi-supervised histology classification using deep multiple instance learning and contrastive predictive coding. arXiv preprint arXiv:1910.10825 2019.
- Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments, 2021. [CrossRef]
- Lu, M.Y.; Chen, B.; Zhang, A.; Williamson, D.F.K.; Chen, R.J.; Ding, T.; Le, L.P.; Chuang, Y.S.; Mahmood, F. Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images. 2023, pp. 19764–19775.
- Weinstein, J.N.; Collisson, E.A.; Mills, G.B.; Shaw, K.R.; Ozenberger, B.A.; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, J.M. The cancer genome atlas pan-cancer analysis project. Nature genetics 2013, 45, 1113–1120.
- Boecking, B.; Usuyama, N.; Bannur, S.; Castro, D.C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al. Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. In Proceedings of the Computer Vision – ECCV 2022; Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G.M.; Hassner, T., Eds., Cham, 2022; pp. 1–21. [CrossRef]
- Ochi, M.; Komura, D.; Onoyama, T.; Shinbo, K.; Endo, H.; Odaka, H.; Kakiuchi, M.; Katoh, H.; Ushiku, T.; Ishikawa, S. Registered multi-device/staining histology image dataset for domain-agnostic machine learning models. Scientific Data 2024, 11, 330.
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models, 2021, [arXiv:cs.CL/2106.09685].
- Dippel, J.; Feulner, B.; Winterhoff, T.; Schallenberg, S.; Dernbach, G.; Kunft, A.; Tietz, S.; Milbich, T.; Heinke, S.; Eich, M.L.; et al. RudolfV: A Foundation Model by Pathologists for Pathologists, 2024. [CrossRef]
- Zhou, H.; Gu, B.; Zou, X.; Li, Y.; Chen, S.S.; Zhou, P.; Liu, J.; Hua, Y.; Mao, C.; Wu, X.; et al. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112 2023.
- Wang, Z.; Wu, Z.; Agarwal, D.; Sun, J. Medclip: Contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163 2022.
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations, 2020, .
- Huang, Z.; Bianchi, F.; Yuksekgonul, M.; Montine, T.J.; Zou, J. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine 2023, 29, 2307–2316.
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions, 2021, .
- Lin, W.; Zhao, Z.; Zhang, X.; Wu, C.; Zhang, Y.; Wang, Y.; Xie, W. PMC-CLIP: Contrastive Language-Image Pre-training Using Biomedical Documents. In Proceedings of the Medical Image Computing and Computer Assisted Intervention – MICCAI 2023; Greenspan, H.; Madabhushi, A.; Mousavi, P.; Salcudean, S.; Duncan, J.; Syeda-Mahmood, T.; Taylor, R., Eds., Cham, 2023; pp. 525–536. [CrossRef]
- Schaumberg, A.J.; Juarez-Nicanor, W.C.; Choudhury, S.J.; Pastrián, L.G.; Pritt, B.S.; Prieto Pozuelo, M.; Sotillo Sánchez, R.; Ho, K.; Zahra, N.; Sener, B.D.; et al. Interpretable multimodal deep learning for real-time pan-tissue pan-disease pathology search on social media. Modern Pathology 2020, 33, 2169–2185. [CrossRef]
- Song, X.; Xu, X.; Yan, P. General Purpose Image Encoder DINOv2 for Medical Image Registration. arXiv preprint arXiv:2402.15687 2024.
- Men, Y.; Fhima, J.; Celi, L.A.; Ribeiro, L.Z.; Nakayama, L.F.; Behar, J.A. DRStageNet: Deep Learning for Diabetic Retinopathy Staging from Fundus Images, 2023, [arXiv:eess.IV/2312.14891].
- Gu, Y.; Yang, J.; Usuyama, N.; Li, C.; Zhang, S.; Lungren, M.P.; Gao, J.; Poon, H. BiomedJourney: Counterfactual Biomedical Image Generation by Instruction-Learning from Multimodal Patient Journeys, 2023. [CrossRef]
- Schirris, Y.; Gavves, E.; Nederlof, I.; Horlings, H.M.; Teuwen, J. DeepSMILE: Contrastive self-supervised pre-training benefits MSI and HRD classification directly from H&E whole-slide images in colorectal and breast cancer. Medical Image Analysis 2022, 79, 102464.
- Gao, Y.; Xia, W.; Hu, D.; Gao, X. DeSAM: Decoupling Segment Anything Model for Generalizable Medical Image Segmentation, 2023. [CrossRef]
- Xie, Y.; Zhang, J.; Xia, Y.; Shen, C. Learning From Partially Labeled Data for Multi-Organ and Tumor Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2023, 45, 14905–14919. [CrossRef]
- ai, k.; Aben, N.; de Jong, E.D.; Gatopoulos, I.; Känzig, N.; Karasikov, M.; Lagré, A.; Moser, R.; van Doorn, J.; Tang, F. Towards Large-Scale Training of Pathology Foundation Models, 2024. [CrossRef]
- Sambara, S.; Zhang, S.; Banerjee, O.; Acosta, J.; Fahrner, J.; Rajpurkar, P. RadFlag: A Black-Box Hallucination Detection Method for Medical Vision Language Models. arXiv preprint arXiv:2411.00299 2024.
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Advances in neural information processing systems 2023, 36, 34892–34916.
- Roth, B.; Koch, V.; Wagner, S.J.; Schnabel, J.A.; Marr, C.; Peng, T. Low-resource finetuning of foundation models beats state-of-the-art in histopathology. arXiv preprint arXiv:2401.04720 2024.
- Khan, W.; Leem, S.; See, K.B.; Wong, J.K.; Zhang, S.; Fang, R. A Comprehensive Survey of Foundation Models in Medicine. arXiv preprint arXiv:2406.10729 2024.
- Glocker, B.; Jones, C.; Roschewitz, M.; Winzeck, S. Risk of bias in chest radiography deep learning foundation models. Radiology: Artificial Intelligence 2023, 5, e230060.
- Chen, X.; Fan, H.; Girshick, R.; He, K. Improved Baselines with Momentum Contrastive Learning, 2020. [CrossRef]
- Zhu, W.; Chen, Y.; Nie, S.; Yang, H. SAMMS: Multi-modality Deep Learning with the Foundation Model for the Prediction of Cancer Patient Survival. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2023, pp. 3662–3668.
- Gutiérrez, J.D.; Rodriguez-Echeverria, R.; Delgado, E.; Rodrigo, M.S.; Sánchez-Figueroa, F. No More Training: SAM’s Zero-Shot Transfer Capabilities for Cost-Efficient Medical Image Segmentation. IEEE Access 2024, 12, 24205–24216. [CrossRef]
- Ma, D.; Taher, M.R.H.; Pang, J.; Islam, N.U.; Haghighi, F.; Gotway, M.B.; Liang, J. Benchmarking and Boosting Transformers for Medical Image Classification. Domain adaptation and representation transfer: 4th MICCAI Workshop, DART 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings. Domain Adaptation and Representation Transfer (Workshop) (4th: 2022: Sin... 2022, 13542, 12–22. [CrossRef]
- Yun, J.; Hu, Y.; Kim, J.; Jang, J.; Lee, S. Enhancing Whole Slide Pathology Foundation Models through Stain Normalization. arXiv preprint arXiv:2408.00380 2024.
- Shi, D.; Zhang, W.; Chen, X.; Liu, Y.; Yang, J.; Huang, S.; Tham, Y.C.; Zheng, Y.; He, M. EyeFound: A Multimodal Generalist Foundation Model for Ophthalmic Imaging, 2024, [arXiv:cs.CV/2405.11338].
- Jun, E.; Jeong, S.; Heo, D.W.; Suk, H.I. Medical Transformer: Universal Encoder for 3-D Brain MRI Analysis. IEEE transactions on neural networks and learning systems 2023, PP. [CrossRef]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.V.; Sung, Y.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, 2021, .
- Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering, 2023. [CrossRef]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s, 2022, .
- Rios-Martinez, C.; Bhattacharya, N.; Amini, A.P.; Crawford, L.; Yang, K.K. Deep self-supervised learning for biosynthetic gene cluster detection and product classification. PLOS Computational Biology 2023, 19, e1011162. [CrossRef]
- Cheng, J.; Ye, J.; Deng, Z.; Chen, J.; Li, T.; Wang, H.; Su, Y.; Huang, Z.; Chen, J.; Jiang, L.; et al. SAM-Med2D, 2023.
- Ma, J.; Guo, Z.; Zhou, F.; Wang, Y.; Xu, Y.; Cai, Y.; Zhu, Z.; Jin, C.; Jiang, Y.L.X.; Han, A.; et al. Towards A Generalizable Pathology Foundation Model via Unified Knowledge Distillation. arXiv preprint arXiv:2407.18449 2024.
- Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT Pre-Training of Image Transformers, 2022. [CrossRef]
- Shi, X.; Chai, S.; Li, Y.; Cheng, J.; Bai, J.; Zhao, G.; Chen, Y.W. Cross-modality Attention Adapter: A Glioma Segmentation Fine-tuning Method for SAM Using Multimodal Brain MR Images. arXiv preprint arXiv:2307.01124 2023.
- Ouyang, C.; Biffi, C.; Chen, C.; Kart, T.; Qiu, H.; Rueckert, D. Self-supervised learning for few-shot medical image segmentation. IEEE Transactions on Medical Imaging 2022, 41, 1837–1848.
- Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 2018.
- Wu, X.; Jiang, Y.; Xing, H.; Song, W.; Wu, P.; Cui, X.W.; Xu, G. ULS4US: universal lesion segmentation framework for 2D ultrasound images. Physics in Medicine and Biology 2023, 68. [CrossRef]
- Pan, H.; Guo, Y.; Deng, Q.; Yang, H.; Chen, Y.; Chen, J. Improving Fine-tuning of Self-supervised Models with Contrastive Initialization, 2022. [CrossRef]
- Brooks, T.; Holynski, A.; Efros, A.A. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18392–18402.
- Campanella, G.; Kwan, R.; Fluder, E.; Zeng, J.; Stock, A.; Veremis, B.; Polydorides, A.D.; Hedvat, C.; Schoenfeld, A.; Vanderbilt, C.; et al. Computational Pathology at Health System Scale–Self-Supervised Foundation Models from Three Billion Images. arXiv preprint arXiv:2310.07033 2023.
- Hamamci, I.E.; Er, S.; Almas, F.; Simsek, A.G.; Esirgun, S.N.; Dogan, I.; Dasdelen, M.F.; Wittmann, B.; Simsar, E.; Simsar, M.; et al. A foundation model utilizing chest CT volumes and radiology reports for supervised-level zero-shot detection of abnormalities. arXiv preprint arXiv:2403.17834 2024.
- Zimmermann, E.; Vorontsov, E.; Viret, J.; Casson, A.; Zelechowski, M.; Shaikovski, G.; Tenenholtz, N.; Hall, J.; Fuchs, T.; Fusi, N.; et al. Virchow 2: Scaling Self-Supervised Mixed Magnification Models in Pathology. arXiv preprint arXiv:2408.00738 2024.
- Wilson, P.F.; Gilany, M.; Jamzad, A.; Fooladgar, F.; To, M.N.N.; Wodlinger, B.; Abolmaesumi, P.; Mousavi, P. Self-supervised learning with limited labeled data for prostate cancer detection in high frequency ultrasound. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 2023.
- de Jong, E.D.; Marcus, E.; Teuwen, J. Current Pathology Foundation Models are unrobust to Medical Center Differences. arXiv preprint arXiv:2501.18055 2025.
- Ma, D.; Pang, J.; Gotway, M.B.; Liang, J. Foundation Ark: Accruing and Reusing Knowledge for Superior and Robust Performance. In Proceedings of the Medical Image Computing and Computer Assisted Intervention – MICCAI 2023; Greenspan, H.; Madabhushi, A.; Mousavi, P.; Salcudean, S.; Duncan, J.; Syeda-Mahmood, T.; Taylor, R., Eds., Cham, 2023; pp. 651–662. [CrossRef]
- Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21372–21383.
- Zhang, Y.; Gao, J.; Tan, Z.; Zhou, L.; Ding, K.; Zhou, M.; Zhang, S.; Wang, D. Data-centric foundation models in computational healthcare: A survey. arXiv preprint arXiv:2401.02458 2024.
- Felfeliyan, B.; Hareendranathan, A.; Kuntze, G.; Cornell, D.; Forkert, N.D.; Jaremko, J.L.; Ronsky, J.L. Self-Supervised-RCNN for Medical Image Segmentation with Limited Data Annotation, 2022. [CrossRef]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards Expert-Level Medical Question Answering with Large Language Models, 2023.
- Aben, N.; de Jong, E.D.; Gatopoulos, I.; Känzig, N.; Karasikov, M.; Lagré, A.; Moser, R.; van Doorn, J.; Tang, F.; et al. Towards Large-Scale Training of Pathology Foundation Models. arXiv preprint arXiv:2404.15217 2024.
- Wood, D.A.; Townend, M.; Guilhem, E.; Kafiabadi, S.; Hammam, A.; Wei, Y.; Al Busaidi, A.; Mazumder, A.; Sasieni, P.; Barker, G.J.; et al. Optimising brain age estimation through transfer learning: A suite of pre-trained foundation models for improved performance and generalisability in a clinical setting. Human Brain Mapping 2024, 45, e26625.
- Vaidya, A.; Zhang, A.; Jaume, G.; Song, A.H.; Ding, T.; Wagner, S.J.; Lu, M.Y.; Doucet, P.; Robertson, H.; Almagro-Perez, C.; et al. Molecular-driven Foundation Model for Oncologic Pathology. arXiv preprint arXiv:2501.16652 2025.
- Lin, W.; Zhao, Z.; Zhang, X.; Wu, C.; Zhang, Y.; Wang, Y.; Xie, W. PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents, 2023. [CrossRef]
- El-Nouby, A.; Touvron, H.; Caron, M.; Bojanowski, P.; Douze, M.; Joulin, A.; Laptev, I.; Neverova, N.; Synnaeve, G.; Verbeek, J.; et al. XCiT: Cross-Covariance Image Transformers, 2021, [arXiv:cs.CV/2106.09681].
- Niu, C.; Wang, G. Unsupervised contrastive learning based transformer for lung nodule detection. Physics in Medicine & Biology 2022, 67, 204001.
- Yan, K.; Cai, J.; Jin, D.; Miao, S.; Guo, D.; Harrison, A.P.; Tang, Y.; Xiao, J.; Lu, J.; Lu, L. SAM: Self-Supervised Learning of Pixel-Wise Anatomical Embeddings in Radiological Images. IEEE transactions on medical imaging 2022, 41, 2658–2669. [CrossRef]
- Mazurowski, M.A.; Dong, H.; Gu, H.; Yang, J.; Konz, N.; Zhang, Y. Segment Anything Model for Medical Image Analysis: an Experimental Study. Medical Image Analysis 2023, 89, 102918. [CrossRef]
- Chen, Z.; Xu, Q.; Liu, X.; Yuan, Y. UN-SAM: Universal Prompt-Free Segmentation for Generalized Nuclei Images. arXiv preprint arXiv:2402.16663 2024.
- Hu, X.; Xu, X.; Shi, Y. How to Efficiently Adapt Large Segmentation Model(SAM) to Medical Images, 2023.
- Lu, M.Y.; Chen, B.; Williamson, D.F.K.; Chen, R.J.; Ikamura, K.; Gerber, G.; Liang, I.; Le, L.P.; Ding, T.; Parwani, A.V.; et al. A Foundational Multimodal Vision Language AI Assistant for Human Pathology, 2023. [CrossRef]
- Nechaev, D.; Pchelnikov, A.; Ivanova, E. Hibou: A Family of Foundational Vision Transformers for Pathology. arXiv preprint arXiv:2406.05074 2024.
- Shweikh, Y.; Sekimitsu, S.; Boland, M.V.; Zebardast, N. The Growing Need for Ophthalmic Data Standardization. Ophthalmology Science 2023, 3.
- Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C.P. Contrastive learning of medical visual representations from paired images and text. In Proceedings of the Machine Learning for Healthcare Conference. PMLR, 2022, pp. 2–25.
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision, 2024. [CrossRef]
- Yuan, H.; Hong, C. Foundation Model Makes Clustering A Better Initialization For Cold-Start Active Learning, 2024. [CrossRef]
- Zhang, Y.; Liu, H.; Hu, Q. TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation, 2021. [CrossRef]
- Jun, E.; Jeong, S.; Heo, D.W.; Suk, H.I. Medical Transformer: Universal Brain Encoder for 3D MRI Analysis, 2021, [arXiv:cs.CV/2104.13633].
- Dai, D.; Zhang, Y.; Xu, L.; Yang, Q.; Shen, X.; Xia, S.; Wang, G. Pa-llava: A large language-vision assistant for human pathology image understanding. arXiv preprint arXiv:2408.09530 2024.
- Hua, S.; Yan, F.; Shen, T.; Zhang, X. PathoDuet: Foundation Models for Pathological Slide Analysis of H&E and IHC Stains, 2023. [CrossRef]
- Zhou, G.; Mosadegh, B. Distilling Knowledge From an Ensemble of Vision Transformers for Improved Classification of Breast Ultrasound. Academic Radiology 2024, 31, 104–120. [CrossRef]
- Koohbanani, N.A.; Unnikrishnan, B.; Khurram, S.A.; Krishnaswamy, P.; Rajpoot, N. Self-path: Self-supervision for classification of pathology images with limited annotations. IEEE Transactions on Medical Imaging 2021, 40, 2845–2856.
- Ye, Y.; Zhang, J.; Chen, Z.; Xia, Y. DeSD: Self-Supervised Learning with Deep Self-Distillation for 3D Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention – MICCAI 2022; Wang, L.; Dou, Q.; Fletcher, P.T.; Speidel, S.; Li, S., Eds., Cham, 2022; pp. 545–555. [CrossRef]
- Dominic, J.; Bhaskhar, N.; Desai, A.D.; Schmidt, A.; Rubin, E.; Gunel, B.; Gold, G.E.; Hargreaves, B.A.; Lenchik, L.; Boutin, R.; et al. Improving Data-Efficiency and Robustness of Medical Imaging Segmentation Using Inpainting-Based Self-Supervised Learning. Bioengineering 2023, 10, 207. [CrossRef]
- Ellis, D.; Srigley, J. Does standardised structured reporting contribute to quality in diagnostic pathology? The importance of evidence-based datasets. Virchows Archiv 2016, 468, 51–59.
- Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C.P. Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proceedings of the Proceedings of the 7th Machine Learning for Healthcare Conference. PMLR, 2022, pp. 2–25.
- Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International conference on machine learning. PMLR, 2019, pp. 2790–2799.
- Li, G.; Togo, R.; Ogawa, T.; Haseyama, M. Self-supervised learning for gastritis detection with gastric x-ray images. International Journal of Computer Assisted Radiology and Surgery 2023, 18, 1841–1848.
- Thawkar, O.; Shaker, A.; Mullappilly, S.S.; Cholakkal, H.; Anwer, R.M.; Khan, S.; Laaksonen, J.; Khan, F.S. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971 2023.
- Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient Transformers: A Survey. ACM Comput. Surv. 2022, 55. [CrossRef]
- Deng, G.; Zou, K.; Ren, K.; Wang, M.; Yuan, X.; Ying, S.; Fu, H. SAM-U: Multi-box Prompts Triggered Uncertainty Estimation for Reliable SAM in Medical Image. In Proceedings of the Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops; Woo, J.; Hering, A.; Silva, W.; Li, X.; Fu, H.; Liu, X.; Xing, F.; Purushotham, S.; Mathai, T.S.; Mukherjee, P.; et al., Eds., Cham, 2023; pp. 368–377. [CrossRef]
- Xie, Y.; Zhang, J.; Shen, C.; Xia, Y. CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation, 2021. [CrossRef]
- Felfeliyan, B.; Forkert, N.D.; Hareendranathan, A.; Cornel, D.; Zhou, Y.; Kuntze, G.; Jaremko, J.L.; Ronsky, J.L. Self-supervised-RCNN for medical image segmentation with limited data annotation. Computerized Medical Imaging and Graphics 2023, 109, 102297.
- Lu, M.Y.; Chen, B.; Zhang, A.; Williamson, D.F.; Chen, R.J.; Ding, T.; Le, L.P.; Chuang, Y.S.; Mahmood, F. Visual language pretrained multiple instance zero-shot transfer for histopathology images. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19764–19775.
- Zhu, W.; Chen, Y.; Nie, S.; Yang, H. SAMMS: Multi-modality Deep Learning with the Foundation Model for the Prediction of Cancer Patient Survival. IEEE Computer Society, 2023, pp. 3662–3668. [CrossRef]
- Shaikovski, G.; Casson, A.; Severson, K.; Zimmermann, E.; Wang, Y.K.; Kunz, J.D.; Retamero, J.A.; Oakley, G.; Klimstra, D.; Kanan, C.; et al. PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology, 2024, [arXiv:eess.IV/2405.10254].
- Lin, W.; Zhao, Z.; Zhang, X.; Wu, C.; Zhang, Y.; Wang, Y.; Xie, W. Pmc-clip: Contrastive language-image pre-training using biomedical documents. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 525–536.
- Walsh, J.; Othmani, A.; Jain, M.; Dev, S. Using U-Net network for efficient brain tumor segmentation in MRI images. Healthcare Analytics 2022, 2, 100098. [CrossRef]
- Gutiérrez, J.D.; Rodriguez-Echeverria, R.; Delgado, E.; Rodrigo, M.Á.S.; Sánchez-Figueroa, F. No More Training: SAM’s Zero-Shot Transfer Capabilities for Cost-Efficient Medical Image Segmentation. IEEE Access 2024.
- Yan, K.; Cai, J.; Jin, D.; Miao, S.; Guo, D.; Harrison, A.P.; Tang, Y.; Xiao, J.; Lu, J.; Lu, L. SAM: Self-supervised learning of pixel-wise anatomical embeddings in radiological images. IEEE Transactions on Medical Imaging 2022, 41, 2658–2669.
- Khan, M.O.; Afzal, M.M.; Mirza, S.; Fang, Y. How Fair are Medical Imaging Foundation Models? In Proceedings of the Proceedings of the 3rd Machine Learning for Health Symposium. PMLR, 2023, pp. 217–231.
- Balestriero, R.; Ibrahim, M.; Sobal, V.; Morcos, A.; Shekhar, S.; Goldstein, T.; Bordes, F.; Bardes, A.; Mialon, G.; Tian, Y.; et al. A Cookbook of Self-Supervised Learning, 2023. [CrossRef]
- Hua, Y.; Yan, Z.; Kuang, Z.; Zhang, H.; Deng, X.; Yu, L. Symmetry-Aware Deep Learning for Cerebral Ventricle Segmentation With Intra-Ventricular Hemorrhage. IEEE Journal of Biomedical and Health Informatics 2022, 26, 5165–5176.
- Lu, J.; Yan, F.; Zhang, X.; Gao, Y.; Zhang, S. Pathotune: Adapting visual foundation model to pathological specialists. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, pp. 395–406.
| Model | Domain | Task | Performance (AUC/ACC/DSC) | Supervision |
|---|---|---|---|---|
| DANN (Ganin et al.) | ChestX-ray14 → CheXpert | Classification | 0.803 AUC | Full source, unlabeled target |
| CycleGAN (Zhu et al.) | MRI (GE) → MRI (Siemens) | Segmentation | 0.726 DSC | Full source, unlabeled target |
| IRM (Arjovsky et al.) | Multi-site CT | Classification | 0.781 ACC | Full source domains only |
| MedCLIP (Zhang et al.) | ChestX-ray14, MIMIC-CXR | Zero-shot Classification | 0.868 AUC | Paired image-text pretraining |
| BioViL (Boecking et al.) | RSNA, MIMIC, OpenI | Retrieval + Classification | 0.842 AUC | Contrastive multi-modal pretraining |
| CheXzero (Tiu et al.) | ChestX-ray14 → CheXpert | Zero-shot Classification | 0.881 AUC | Free-text reports |
| Attribute | Traditional Methods | Foundation Models |
|---|---|---|
| Label Requirements | High (manual labeling) | Low (self/weak supervision) |
| Data Modality | Single (e.g., image only) | Multi-modal (image + text) |
| Domain Generalization | Limited (target-specific) | Strong (OOD robustness) |
| Interpretability | Low to moderate | High (text-guided attention) |
| Deployment Overhead | Low (smaller models) | High (initial training), low (adaptation) |
| Adaptability | Low (full retraining needed) | High (prompt/adapters) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
