Preprint
Article

This version is not peer-reviewed.

Limitations in Chest X‐Ray Interpretation by Vision‐Capable Large Language Models, Gemini 1.0, Gemini 1.5 Pro, GPT‐4 Turbo, and GPT‐4o

Submitted:

23 December 2025

Posted:

24 December 2025

You are already at the latest version

Abstract
Background/Objectives: Interpretation of chest X-rays (CXRs) requires accurate identification of lesion presence, diagnosis, location, size, and number to be considered complete. However, the effectiveness of large language models with vision capabilities (vLLMs) in performing these tasks remains uncertain. This study aimed to evaluate the image interpretation performance of vLLMs in the absence of clinical information. Methods: A total of 247 CXRs covering 13 diagnoses, such as pulmonary edema, cardiomegaly, lobar pneumonia, and other medical conditions, were evaluated using Gemini 1.0, Gemini 1.5 Pro, GPT-4 Turbo, and GPT-4o. The text outputs generated by the vLLMs were assessed for diagnostic accuracy and identification of key imaging features. Each interpretation was classified as fully correct, partially correct, or incorrect according to the criteria for complete interpretation. Results: When both fully and partially correct responses were considered as successful detections, vLLMs effectively identified large, bilateral, multiple lesions and big devices, such as acute pulmonary edema (53.8%), lobar pneumonia (55%), multiple malignancies (55%), massive pleural effusions (47.5%) and pacemakers (98.3%), showing significant differences in the chi-square test. Feature descriptions varied among models, especially in posteroanterior and anteroposterior views and side markers, though central lines were partially recognized. Gemini 1.5 Pro (49.0%) performed best, followed by Gemini 1.0 (43.8%), GPT-4o (32.0%), and GPT-4Turbo (20.0%). Conclusions: Although vLLMs were able to identify certain diagnoses and key imaging features, their limitations in detecting small lesions, recognizing laterality, reasoning through differential diagnoses, and using domain-specific expressions indicate that CXR interpretation without textual cues still requires further improvement.
Keywords: 
;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated