Submitted:
23 October 2025
Posted:
24 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Methodological Approach
2.1. Data Preparation
2.2. Model Preparation
2.2.1. Large Vision-Language Models (LVLMs)
- LLaVA v1.6 Mistral-7B [42] pairs the CLIP ViT-L/14 encoder with Mistral-7B-Instruct as its language decoder. The Mistral backbone introduces Sliding Window Attention (SWA), which reduces computational complexity from to and supports a context length of tokens. It also applies Grouped-Query Attention (32 query heads mapped to 8 key-value heads), lowering inference memory requirements by about 75% while preserving generation quality. A two-layer MLP connector (20-50M parameters) projects 1024-dimensional vision features into the 4096-dimensional language space, enabling cross-modal integration.
- LLaVA v1.6 Vicuna-7B [43] substitutes Vicuna-7B as the language backbone while maintaining identical vision encoding and projection mechanisms. Vicuna-7B features 32 transformer layers with RMSNorm normalization and SwiGLU activation functions, supporting context lengths up to 4096 tokens. Both Mistral and Vicuna variants use a similar training protocol. The models first align visual features with language embeddings using filtered image-caption pairs, and subsequently undergo visual instruction tuning.
- LLaVA v1.5 with LLaMA-7B features LLaMA-7B [44] decoder containing 32 transformer layers utilizing RMSNorm normalization, SwiGLU activation functions, and Rotary Position Embeddings (RoPE). The architecture supports a 2048-token context window and incorporates standard multi-head self-attention with 32 heads, each operating on 128-dimensional representations. Cross-modal alignment is achieved through similar linear projection strategy, demonstrating that, with appropriate instruction tuning, complex fusion mechanisms are not required for effective performance.
- IDEFICS-9B [19] Instruct combines a CLIP ViT-H/14 vision encoder with LLaMA-7B [44]. Its distinguishing feature is the Perceiver Resampler module containing approximately 250M trainable parameters, which compresses variable-length visual inputs into exactly 64 visual tokens regardless of input size. The Perceiver operates through learned latent queries that extract fixed-size representations via cross-attention, reducing computational complexity to where latent tokens and M visual features. This design ensures constant processing cost independent of image dimensions. Additionally, the model integrates Gated Cross-Attention Dense layers (approximately 500M parameters) inserted every fourth transformer block, using gating to stabilize multimodal fusion during training.
2.2.2. Small Vision-Language Models (SVLMs)
- Qwen 2-VL [45] employs Native Dynamic Resolution (NaViT) processing, eliminating fixed-resolution constraints. The custom Qwen Vision Transformer (ViT) processes images at native resolutions, producing 4 to 16,384 visual tokens per image with 2D-RoPE positional encoding. A lightweight token merger aggregates spatially adjacent patches through MLP compression before language model fusion.
- Qwen 2.5-3B-Instruct [46] extends this architecture with window attention for computational efficiency and SwiGLU activations in vision MLPs, aligning encoder structure with modern language models. The PatchMerger component provides additional token compression through dedicated MLP sublayers. Notably, it implements Multimodal Rotary Position Embedding (M-RoPE), decomposing positions across temporal, height, and width dimensions for unified spatial-temporal reasoning. Both models support 32,768 token context windows, with Qwen 2.5-3B-Instruct offering enhanced positional encoding strategies optimized for variable-resolution multimodal processing.
- MoonDream2 [47] combines SigLIP-base vision encoding with Phi-1.5 language model [48]. The architecture replaces CLIP’s softmax loss with pairwise sigmoid loss, eliminating global batch dependencies and improving zero-shot performance. The lightweight projection layer (per Table 2) requires minimal computational overhead while maintaining effective cross-modal alignment.
- SmolVLM-500M-Instruct [49] is the most compact model evaluated, employing aggressive visual token compression by reducing the number of visual tokens through pixel shuffle strategies. The Idefics3Connector projects high-dimensional vision features (12,288-dimensional) into a 960-dimensional language space, achieving approximately a 9× reduction in token representation while maintaining captioning quality.
2.2.3. Baseline Architectures
- CNN-Transformer Fusion uses a pre-trained CheXNet [34] (DenseNet121) as the CNN feature extractor. A transformer encoder further contextualizes the extracted embeddings, which are then refined by a tiny transformer decoder [21]. The decoder generates captions through multi-head self-attention and cross-attention with the visual features. The architecture capitalizes on the spatial sensitivity of CNNs and the global sequence modeling power of transformers, linked by a parameter-efficient feature fusion layer.
2.3. Training Environment Preparation
2.3.1. Adaptation Strategy Selection
- (1)
- Targeted LoRA focuses on core transformation layers, targeting the query (), key (), and value () projections within the attention mechanisms, the multimodal connector layers (), and the gate (), up (), and down () projections of the MLP, while leaving the output projections and embedding layers unchanged. This configuration modifies only a tiny selected subset of overall model parameters.
- (2)
- Extended LoRA expands the adaptation to include output projections () and the fully connected layers of MLP, increasing trainable parameters to a greater count.
- (3)
- Hybrid Strategy combines LoRA with full fine-tuning of language model head and token embeddings, reaching significantly greater trainable parameters and training time.
2.3.2. Adaptive Full Fine-Tuning for Baseline Models
2.3.3. Training Configuration of Vision Language Models
2.3.4. Modality-Aware Prompting for SVLMs
2.4. Post-Training Evaluation
2.4.1. Relevance Metrics
- (1)
- Image-Caption Similarity: Computed by mapping the image and candidate caption into a shared embedding space using MedImageInsight [66], a medical VLM trained across X-ray, CT, MRI, OCT, and ultrasound modalities, and calculating their cosine similarity. This metric assesses whether the caption content aligns with the visual signal without requiring reference text.
- (2)
- BERTScore: Applied in its recall-oriented configuration with inverse document frequency weighting, following established best practices for caption evaluation [67]. Contextual embeddings were generated using a DeBERTa-XLarge-MNLI encoder [68], selected for its demonstrated correlation with human quality judgments.
- (3)
- ROUGE-1: F-measure calculated to assess lexical overlap through unigram matching between the candidate and reference captions [69].
- (4)
- BLEURT: Applied using the BLEURT-20 checkpoint to obtain learned quality scores that approximate human preferences for caption quality [70].
2.4.2. Factuality Metrics
- (1)
- UMLS Concept F1: Quantified preservation of key medical entities through comparison of concept sets extracted from candidate and reference captions. Medical concepts were identified using MedCAT [71] and filtered according to clinically relevant semantic types specified in the MEDCON framework [72].
- (2)
- AlignScore: Generated consistency scores by evaluating factual claims in candidate captions against reference standards using a RoBERTa-based alignment model [73]. The metric decomposes captions into claims and aligns them with supporting evidence, producing averaged alignment scores.
3. Experimental Results
3.1. Optimal Adaptation Strategy Selection
3.2. Cross-Model Performance Comparison
3.3. Modality-Aware Prompting for Performance Enhancement in SVLMs
4. Discussion
4.1. The Parameter Efficiency Paradox
4.2. Performance Patterns Across Caption Complexity
4.3. MoonDream2: Bridging the Efficiency Gap
5. Conclusions and Future work
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Afshari Mirak, S.; Tirumani, S.H.; Ramaiya, N.; Mohamed, I. The growing nationwide radiologist shortage: current opportunities and ongoing challenges for international medical graduate radiologists. Radiology 2025, 314, e232625. [Google Scholar] [CrossRef] [PubMed]
- Rawson, J.V.; Smetherman, D.; Rubin, E. Short-term strategies for augmenting the national radiologist workforce. American Journal of Roentgenology 2024, 222, e2430920. [Google Scholar] [CrossRef]
- Smith-Bindman, R.; Miglioretti, D.L.; Larson, E.B. Rising use of diagnostic medical imaging in a large integrated health system. Health affairs 2008, 27, 1491–1502. [Google Scholar] [CrossRef]
- Achour, N.; Zapata, T.; Saleh, Y.; Pierscionek, B.; Azzopardi-Muscat, N.; Novillo-Ortiz, D.; Morgan, C.; Chaouali, M. The role of AI in mitigating the impact of radiologist shortages: a systematised review. Health and Technology 2025, pp. 1–13.
- Dreyer, R.; Van der Merwe, C.; Nicolaou, M.; Richards, G. Assessing and comparing chest radiograph interpretation in the Department of Internal Medicine at the University of the Witwatersrand medical school, according to seniority. African journal of thoracic and critical care medicine 2023, 29, 12–17. [Google Scholar] [CrossRef] [PubMed]
- Ejiga Peter, O.O.; Adeniran, O.T.; John-Otumu, A.M.; Khalifa, F.; Rahman, M.M. Text-Guided Synthesis in Medical Multimedia Retrieval: A Framework for Enhanced Colonoscopy Image Classification and Segmentation. Algorithms 2025, 18, 155. [Google Scholar] [CrossRef]
- Beddiar, D.R.; Oussalah, M.; Seppänen, T. Automatic captioning for medical imaging (MIC): a rapid review of literature. Artificial intelligence review 2023, 56, 4019–4076. [Google Scholar] [CrossRef]
- Adeniran, O.T.; Ojeme, B.; Ajibola, T.E.; Peter, O.O.E.; Ajala, A.O.; Rahman, M.M.; Khalifa, F. Explainable MRI-Based Ensemble Learnable Architecture for Alzheimer’s Disease Detection. Algorithms 2025, 18, 163. [Google Scholar] [CrossRef]
- Reale-Nosei, G.; Amador-Domínguez, E.; Serrano, E. From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation. Medical Image Analysis 2024, 97, 103264. [Google Scholar] [CrossRef]
- Li, T.; Wang, J.; Jin, L. Enhancing Visual Information Extraction with Large Language Models Through Layout-Aware Instruction Tuning. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer; 2024; pp. 276–289. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 2022, 35, 23716–23736. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PmLR; 2021; pp. 8748–8763. [Google Scholar]
- Bannur, S.; Hyland, S.; Liu, Q.; Perez-Garcia, F.; Ilse, M.; Castro, D.C.; Boecking, B.; Sharma, H.; Bouzid, K.; Thieme, A.; et al. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15016–15027.
- Moor, M.; Huang, Q.; Wu, S.; Yasunaga, M.; Dalmia, Y.; Leskovec, J.; Zakka, C.; Reis, E.P.; Rajpurkar, P. Med-flamingo: a multimodal medical few-shot learner. In Proceedings of the Machine Learning for Health (ML4H). PMLR; 2023; pp. 353–367. [Google Scholar]
- Wu, C.; Zhang, X.; Zhang, Y.; Hui, H.; Wang, Y.; Xie, W. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications 2025, 16, 7866. [Google Scholar]
- Ryu, J.S.; Kang, H.; Chu, Y.; Yang, S. Vision-language foundation models for medical imaging: a review of current practices and innovations. Biomedical Engineering Letters 2025, pp. 1–22.
- Zhao, W.; Li, F.; Diao, Y.; Fan, P.; Chen, Z. Cap2Seg: leveraging caption generation for enhanced segmentation of COVID-19 medical images. Frontiers in Physics 2024, 12, 1439122. [Google Scholar] [CrossRef]
- Rau, A.; Endo, M.; Aklilu, J.; Heo, J.; Saab, K.; Paderno, A.; Jopling, J.; Holsinger, F.C.; Yeung-Levy, S. Systematic evaluation of large vision-language models for surgical artificial intelligence. arXiv preprint arXiv:2504.02799, arXiv:2504.02799 2025.
- Laurençon, H.; Saulnier, L.; Tronchon, L.; Bekman, S.; Singh, A.; Lozhkov, A.; Wang, T.; Karamcheti, S.; Rush, A.; Kiela, D.; et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems 2023, 36, 71683–71702. [Google Scholar]
- Kurz, C.F.; Merzhevich, T.; Eskofier, B.M.; Kather, J.N.; Gmeiner, B. Benchmarking vision-language models for diagnostics in emergency and critical care settings. npj Digital Medicine 2025, 8, 423. [Google Scholar] [CrossRef]
- Hoque, M.; Hasan, M.R.; Emon, M.I.S.; Oluwafemi, E.P.O.; Rahman, M.M.; Khalifa, F. Comparative Analysis of Fine-Tuned Multimodal Models in Radiology Image Captioning. In Proceedings of the 2025 IEEE 4th International Conference on Computing and Machine Intelligence (ICMI); 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Busch, F.; Hoffmann, L.; Rueger, C.; van Dijk, E.H.; Kader, R.; Ortiz-Prado, E.; Makowski, M.R.; Saba, L.; Hadamitzky, M.; Kather, J.N.; et al. Current applications and challenges in large language models for patient care: a systematic review. Communications Medicine 2025, 5, 26. [Google Scholar] [CrossRef]
- Shi, Y.; Shu, P.; Liu, Z.; Wu, Z.; Li, Q.; Liu, T.; Liu, N.; Li, X. MGH Radiology Llama: A Llama 3 70B Model for Radiology, 2024, [arXiv:cs.CL/2408.11848].
- Danish, S.; Sadeghi-Niaraki, A.; Khan, S.U.; Dang, L.M.; Tightiz, L.; Moon, H. A comprehensive survey of Vision-Language Models: Pretrained models, fine-tuning, prompt engineering, adapters, and benchmark datasets. Information Fusion 2025, p. 103623.
- Alsinglawi, B.; McCarthy, C.; Webb, S.; Fluke, C.; Saidy, N.T. A Lightweight Large Vision-language Model for Multimodal Medical Images, 2025, [arXiv:cs.CV/2504.05575]. arXiv:cs.CV/2504.05575].
- Mei, X.; Shun, J.; Chao, K. Efficient Fine-Tuning with Low-Rank Adaptation for Large-Scale AI Models. Available at SSRN 5173161 2024.
- Li, Y.; Ghahremani, M.; Wachinger, C. MedBridge: Bridging Foundation Vision-Language Models to Medical Image Diagnosis, 2025, [arXiv:cs.CV/2505.21698].
- Chen, S.; Gu, J.; Han, Z.; Ma, Y.; Torr, P.; Tresp, V. Benchmarking robustness of adaptation methods on pre-trained vision-language models. Advances in Neural Information Processing Systems 2023, 36, 51758–51777. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
- Zanella, M.; Ben Ayed, I. Low-rank few-shot adaptation of vision-language models. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1593–1603.
- Hartsock, I.; Rasool, G. Vision-language models for medical report generation and visual question answering: A review. Frontiers in artificial intelligence 2024, 7, 1430984. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Lai, Z.; Bao, W.; Tan, Z.; Dao, A.; Sui, K.; Shen, J.; Liu, D.; Liu, H.; Kong, Y. Visual Large Language Models for Generalized and Specialized Applications, 2025, [arXiv:cs.CV/2501.02765].
- Zhao, Y.; Braytee, A.; Prasad, M. DualPrompt-MedCap: A Dual-Prompt Enhanced Approach for Medical Image Captioning, 2025, [arXiv:cs.CV/2504.09598].
- Rajpurkar, P.; Irvin, J.; Zhu, K.; Yang, B.; Mehta, H.; Duan, T.; Ding, D.; Bagul, A.; Langlotz, C.; Shpanskaya, K.; et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225 2017.
- Rückert, J.; Bloch, L.; Brüngel, R.; Idrissi-Yaghir, A.; Schäfer, H.; Schmidt, C.S.; Koitka, S.; Pelka, O.; Abacha, A.B.; G. Seco de Herrera, A.; et al. Rocov2: Radiology objects in context version 2, an updated multimodal image dataset. Scientific Data 2024, 11, 688.
- Bodenreider, O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research 2004, 32, D267–D270. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- Hasan, M.R. Transformer and Convolutional Neural Network Based Hybrid Approaches in Medical Image Classification, Caption Generation, and Retrieval Processes. Master’s thesis, Morgan State University, 2024.
- Nam, Y.; Kim, D.Y.; Kyung, S.; Seo, J.; Song, J.M.; Kwon, J.; Kim, J.; Jo, W.; Park, H.; Sung, J.; et al. Multimodal Large Language Models in Medical Imaging: Current State and Future Directions. Korean Journal of Radiology 2025, 26, 900. [Google Scholar] [CrossRef] [PubMed]
- Van, M.H.; Verma, P.; Wu, X. On large visual language models for medical imaging analysis: An empirical study. In Proceedings of the 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE). IEEE, 2024, pp. 172–176.
- OpenAI. CLIP ViT-L/14 Model. https://huggingface.co/openai/clip-vit-large-patch14, 2021. Accessed: 2024-05-28.
- Liu, H. LLaVA v1.6 Mistral 7B. https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b, 2023. Published: December 2023. Accessed: 2024-05-28.
- LMSYS. Vicuna 7B, Version 1.3. https://huggingface.co/lmsys/vicuna-7b-v1.3, 2023. Accessed: 2024-05-28.
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 2023.
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv preprint arXiv:2309.16609 2023.
- Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 2025.
- Korrapati, V. Moondream2: A small vision-language model. https://huggingface.co/vikhyatk/moondream2, 2024. Accessed: 2024-05-28.
- Li, Y.; Bubeck, S.; Eldan, R.; Del Giorno, A.; Gunasekar, S.; Lee, Y.T. Textbooks Are All You Need II: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 2023.
- Marafioti, A.; Zohar, O.; Farré, M.; Noyan, M.; Bakouch, E.; Cuenca, P.; Zakka, C.; Allal, L.B.; Lozhkov, A.; Tazi, N.; et al. Smolvlm: Redefining small and efficient multimodal models. arXiv preprint arXiv:2504.05299 2025.
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 2020.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; et al. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9. [Google Scholar]
- Shinde, G.; Ravi, A.; Dey, E.; Sakib, S.; Rampure, M.; Roy, N. A Survey on Efficient Vision-Language Models. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2025, 15, e70036. [Google Scholar] [CrossRef]
- Jin, F.; Zhang, J.; Zong, C. Parameter-efficient tuning for large language model without calculating its gradients. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 321–330.
- Li, M.; Jiang, Y.; Zhang, Y.; Zhu, H. Medical image analysis using deep learning algorithms. Frontiers in public health 2023, 11, 1273253. [Google Scholar] [CrossRef]
- Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; Raffel, C.A. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 2022, 35, 1950–1965. [Google Scholar]
- Prottasha, N.J.; Mahmud, A.; Sobuj, M.S.I.; Bhat, P.; Kowsher, M.; Yousefi, N.; Garibay, O.O. Parameter-efficient fine-tuning of large language models using semantic knowledge tuning. Scientific Reports 2024, 14, 30667. [Google Scholar] [CrossRef]
- Al-Kababji, A.; Bensaali, F.; Dakua, S.P. Scheduling techniques for liver segmentation: Reducelronplateau vs onecyclelr. In Proceedings of the International conference on intelligent systems and pattern recognition. Springer; 2022; pp. 204–212. [Google Scholar]
- Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen, E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.; Venkatesh, G.; et al. Mixed Precision Training, 2018, [arXiv:cs.AI/1710.03740].
- Kalamkar, D.; Mudigere, D.; Mellempudi, N.; Das, D.; Banerjee, K.; Avancha, S.; Vooturi, D.T.; Jammalamadaka, N.; Huang, J.; Yuen, H.; et al. A Study of BFLOAT16 for Deep Learning Training, 2019, [arXiv:cs.LG/1905.12322].
- Dao, T. FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, 2023, [arXiv:cs.LG/2307.08691].
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization, 2019, [arXiv:cs.LG/1711.05101].
- Cai, L.; Gao, J.; Zhao, D. A review of the application of deep learning in medical image classification and segmentation. Annals of translational medicine 2020, 8, 713. [Google Scholar] [CrossRef]
- Mao, A.; Mohri, M.; Zhong, Y. Cross-Entropy Loss Functions: Theoretical Analysis and Applications, 2023, [arXiv:cs.LG/2304.07288].
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. CoRR 2014, abs/1412.6980.
- Damm, H.; Pakull, T.M.; Becker, H.; Bracke, B.; Eryilmaz, B.; Bloch, L.; Brüngel, R.; Schmidt, C.S.; Rückert, J.; Pelka, O.; et al. Overview of ImageCLEFmedical 2025–medical concept detection and interpretable caption generation. CLEF, 2025.
- Codella, N.C.; Jin, Y.; Jain, S.; Gu, Y.; Lee, H.H.; Abacha, A.B.; Santamaria-Pang, A.; Guyman, W.; Sangani, N.; Zhang, S.; et al. Medimageinsight: An open-source embedding model for general domain medical imaging. arXiv preprint arXiv:2410.06542 2024.
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 2019.
- He, P.; Liu, X.; Gao, J.; Chen, W. DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION. In Proceedings of the International Conference on Learning Representations, 2021.
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text summarization branches out, 2004, pp. 74–81.
- Sellam, T.; Das, D.; Parikh, A.P. BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696 2020.
- Kraljevic, Z.; Searle, T.; Shek, A.; Roguski, L.; Noor, K.; Bean, D.; Mascio, A.; Zhu, L.; Folarin, A.A.; Roberts, A.; et al. Multi-domain clinical natural language processing with MedCAT: the medical concept annotation toolkit. Artificial intelligence in medicine 2021, 117, 102083. [Google Scholar] [CrossRef]
- Yim, W.w.; Fu, Y.; Ben Abacha, A.; Snider, N.; Lin, T.; Yetisgen, M. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Scientific data 2023, 10, 586. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 2019.
- Puth, M.T.; Neuhäuser, M.; Ruxton, G.D. Effective use of Spearman’s and Kendall’s correlation coefficients for association between two measured traits. Animal Behaviour 2015, 102, 77–84. [Google Scholar] [CrossRef]
- Eden, S.K.; Li, C.; Shepherd, B.E. Nonparametric estimation of Spearman’s rank correlation with bivariate survival data. Biometrics 2022, 78, 421–434. [Google Scholar] [CrossRef] [PubMed]
- Xie, Q.; Chen, Q.; Chen, A.; Peng, C.; Hu, Y.; Lin, F.; Peng, X.; Huang, J.; Zhang, J.; Keloth, V.; et al. Me-llama: Foundation large language models for medical applications. Research square 2024, pp. rs–3.











| Primary Modalities | Count | Percentage |
|---|---|---|
| CT | 40,913 | 35.1% |
| X-ray | 31,827 | 27.3% |
| MRI | 18,570 | 15.9% |
| Ultrasound | 17,147 | 14.7% |
| Other* | 8,178 | 7.0% |
| Model | Image Encoder | Text Decoder | Connector / Fusion | # Params (approx.) |
|---|---|---|---|---|
| IDEFICS 9B Instruct | CLIP ViT-H/14 ( 1.3B) | LLaMA-7B ( 7.0B) | Perceiver + gated cross-attn ( 0.75B) | 9.0B |
| LLaVA v1.6 Mistral 7B | CLIP ViT-L/14 (∼0.43B) | Mistral-7B (∼7.0B) | Projection + cross-attn (∼20–50M) | 7.6B |
| LLaVA v1.6 Vicuna 7B | CLIP ViT-L/14 (∼0.43B) | Vicuna-7B (∼6.7B) | Projection + cross-attn (∼20–50M) | 7.1B |
| LLaVA v1.5 with LLaMA 7B | CLIP ViT-L/14 (∼0.43B) | LLaMA-7B (∼6.7B) | Projection + cross-attn (∼20–50M) | 7.1B |
| Qwen 2.5-3B-Instruct | Custom ViT (∼0.3B) | Qwen LM (∼2.8B) | Projection + cross-attn (∼50M) | 3.1B |
| Qwen 2-VL | Custom ViT (∼0.3B) | Qwen LM (∼1.9B) | Projection + cross-attn (∼20M) | 2.2B |
| MoonDream2 | SigLIP-base (∼0.15B) | Phi-1.5 (∼1.7B) | Projection + cross-attn (∼10M) | 1.86B |
| SmolVLM-500M-Instruct | SigLIP-base (∼0.15B) | Tiny LM (∼0.35B) | Projection + cross-attn (∼10M) | 0.5B |
| VisionGPT2 | ViT (∼0.05B) | GPT2 small (∼0.12–0.15B) | Cross-attention (∼5–10M) | 0.21B |
| CNN–Transformer Fusion | Tiny CNN (∼10M) | Tiny Transformer (∼30–35M) | Direct feature fusion (∼3M) | 0.048B |
| Model | Params Trained | % of Total | Adapter Scaling | Target Modules |
|---|---|---|---|---|
| LVLMs | ||||
| LLaVA-Mistral-7B | 40.1M | 0.53 | 2.0 | q,k,v + MLP |
| LLaVA-Vicuna-7B | 34.4M | 0.48 | 1.0 | q,k,v + MLP + mm_proj |
| LLaVA-1.5 | 84.6M | 1.18 | 1.0 | q,k,v,o + MLP + mm_proj |
| IDEFICS-9B | 22.0M | 0.24 | 2.0 | q,k,v only |
| SVLMs | ||||
| Qwen-2.5-3B | 57.0M | 1.87 | 0.5 | All attention + MLP |
| Qwen-2-VL | 54.0M | 2.46 | 1.0 | All attention + MLP |
| MoonDream2 | 74.4M | 3.85 | 0.25 | All linear + proj |
| SmolVLM-500M | 41.7M | 8.34 | 2.0 | All linear |
| Baselines | ||||
| VisionGPT2 | 210M | 100 | – | Progressive full FT |
| CNN-Transformer | 48M | 100 | – | Progressive full FT |
| Model | Parameters Trained (%) | Similarity | BERTScore | ROUGE | BLEURT | UMLS Concept F1 | AlignScore | Relevance | Factuality | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| Large VLMs (LoRA Adapter) | ||||||||||
| LLaVA Mistral-7B | 0.53 | 0.870 | 0.628 | 0.251 | 0.315 | 0.154 | 0.081 | 0.516 | 0.118 | 0.317 |
| LLaVA Vicuna-7B | 0.48 | 0.830 | 0.625 | 0.245 | 0.314 | 0.142 | 0.076 | 0.504 | 0.109 | 0.306 |
| IDEFICS-9B | 0.24 | 0.781 | 0.621 | 0.229 | 0.296 | 0.128 | 0.070 | 0.482 | 0.099 | 0.290 |
| LLaVA-1.5 | 1.18 | 0.720 | 0.617 | 0.218 | 0.295 | 0.108 | 0.059 | 0.462 | 0.083 | 0.273 |
| Small VLMs (LoRA Adapter) | ||||||||||
| MoonDream2 | 3.85 | 0.757 | 0.586 | 0.216 | 0.303 | 0.120 | 0.066 | 0.466 | 0.093 | 0.279 |
| Qwen 2-VL | 2.46 | 0.570 | 0.518 | 0.160 | 0.238 | 0.074 | 0.109 | 0.372 | 0.091 | 0.232 |
| SmolVLM | 8.34 | 0.414 | 0.536 | 0.136 | 0.252 | 0.016 | 0.060 | 0.362 | 0.038 | 0.200 |
| Qwen-2.5 | 1.87 | 0.449 | 0.453 | 0.124 | 0.256 | 0.048 | 0.064 | 0.320 | 0.056 | 0.188 |
| Baselines (Full Finetune) | ||||||||||
| VisionGPT2 | All | 0.389 | 0.546 | 0.118 | 0.247 | 0.022 | 0.035 | 0.325 | 0.028 | 0.177 |
| CNN-Transformer | All | 0.399 | 0.414 | 0.044 | 0.277 | 0.018 | 0.030 | 0.284 | 0.024 | 0.154 |
| Model | Configuration | Similarity | BERTScore | ROUGE | BLEURT | UMLS F1 | AlignScore | Relevance Avg | Factuality Avg | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| SmolVLM | Base | 0.414 | 0.536 | 0.136 | 0.252 | 0.016 | 0.060 | 0.362 | 0.038 | 0.200 |
| SmolVLM | +Modality | 0.418 | 0.532 | 0.144 | 0.266 | 0.031 | 0.096 | 0.365 | 0.048 | 0.207 |
| Change | +1.0% | -0.7% | +5.9% | +5.6% | +93.8% | +60.0% | +0.8% | +26.3% | +3.5% | |
| Qwen 2-VL | Base | 0.570 | 0.518 | 0.160 | 0.238 | 0.074 | 0.109 | 0.372 | 0.091 | 0.232 |
| Qwen 2-VL | +Modality | 0.364 | 0.456 | 0.121 | 0.311 | 0.017 | 0.086 | 0.313 | 0.052 | 0.182 |
| Change | -36.1% | -12.0% | -24.4% | +30.7% | -77.0% | -21.1% | -15.9% | -42.9% | -21.6% | |
| Qwen-2.5 | Base | 0.449 | 0.453 | 0.124 | 0.256 | 0.048 | 0.064 | 0.320 | 0.056 | 0.188 |
| Qwen-2.5 | +Modality | 0.502 | 0.461 | 0.122 | 0.268 | 0.032 | 0.065 | 0.351 | 0.049 | 0.200 |
| Change | +11.8% | +1.8% | -1.6% | +4.7% | -33.3% | +1.6% | +9.7% | -12.5% | +6.4% |
| Ground Truth | chestX-ray showing a right pleural effusion (green arrow) |
|---|---|
| LLaVA-Mistral-7B | ChestX-ray showing a right-sided pleural effusion (green arrow) |
| LLaVA-Vicuna-7B | ChestX-ray showing a large right-sided pleural effusion (green arrow) |
| LLaVA-1.5 | ChestX-ray showing a large right-sided pleural effusion |
| IDEFICS-9B | ChestX-ray showing a large right-sided pleural effusion (green arrow) |
| MoonDream2 | ChestX-ray showing a right-sided pleural effusion (arrow) |
| Qwen 2-VL | ChestX-ray of the patient shows an enlarged right-sided infiltrate in the right lower lobe (arrow) |
| SmolVLM | An arrow pointing to a mass in the patient’s chest on a chest x-ray |
| Qwen 2.5 | CT scan of the left kidney showing a renal cystic lesion (arrow) |
| VisionGPT2 | A chest X-ray showing a large right-sided pneumothorax |
| CNN-Transformer | Chestx ray in sagittal |
| Ground Truth | CTpulmonary embolus study showing pericardial effusion measuring 19.27 mm in this plane |
|---|---|
| LLaVA-Mistral-7B | CTpulmonary embolus study showing pericardial effusion measuring 19.27 mm in this plane |
| LLaVA-Vicuna-7B | CTpulmonary angiogram showing pericardial effusion measuring 19.27 mm in this plane |
| LLaVA-1.5 | CTpulmonary embolism showing a large right lower lobe pulmonary embolism |
| IDEFICS-9B | CTpulmonary embolism study showing [truncated/incomplete] |
| MoonDream2 | CT scan of the chest showing a large mass in the right upper lobe |
| Qwen 2-VL | Computed tomography scan showed a large right-sided mass in the right side |
| SmolVLM | A cross-section of a CT scan of the chest and abdomen, with a yellow square pointing to the 19.27 mm measurement |
| Qwen 2.5 | Axial T2 magnetic resonance imaging of patient |
| VisionGPT2 | PET-CT scan showing a large right hepatic cyst in the right hepatic lobe |
| CNN-Transformer | CT scan showing enlargement in liver lesion |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).