Diagnostic Performance and Confidence Calibration of Large Language Models for Bone Tumor Radiographs

Sanjana Arun; Eujung Park; Katja Klosterman; Carissa Zhu; Ronak Arun; Palmer Wrigley Stratton; Hamsa Gangaswamiah

doi:10.20944/preprints202604.1279.v1

Submitted:

17 April 2026

Posted:

17 April 2026

You are already at the latest version

Abstract

Background/Objectives: Large language models (LLMs) are increasingly applied to medical image interpretation; however, their diagnostic accuracy and reliability in musculoskeletal radiology remain uncertain. This study evaluates the diagnostic performance and confidence calibration of LLMs in detecting and classifying bone tumors on radiographs. Methods: This retrospective observational study analyzed a dataset of 257 radiographs with confirmed diagnoses obtained from Radiopaedia, including normal studies and a spectrum of benign and malignant bone tumors. Cases were selected to ensure representation across multiple tumor types. Three LLMs (ChatGPT 5.3, X-ray Interpreter GPT-4.1, and X-ray Interpreter Gemini) evaluated each image using a standardized prompt assessing abnormality detection, tumor detection, classification, and confidence. Outcomes included diagnostic accuracy, false positive abnormality rates, false negative rates, tumor hallucination rates, and confidence calibration. Results: Abnormality detection was high across models, with Gemini demonstrating the highest sensitivity (up to 100%). Tumor detection was strongest in lesions with characteristic features, including osteosarcoma and osteochondroma. False negative rates varied substantially, with GPT-4.1 demonstrating the highest rate (29.9%), followed by ChatGPT (24.8%) and Gemini (6.6%). Primary diagnostic accuracy was highest for osteosarcoma in GPT-4.1 (80%), while ChatGPT 5.3 performed best in benign lesions, including osteochondroma (84.6%) and non-ossifying fibroma (76.9%). Tumor subtype classification remained limited across all models and was poorest for Ewing sarcoma (0% in ChatGPT and GPT-4.1; 10.3% in Gemini). False positive abnormality rates were highest in GPT-4.1 (40.7%), followed by Gemini (25.9%) and ChatGPT (13.5%). Tumor hallucination occurred only in Gemini (12.3%). All models demonstrated confidence miscalibration, with higher confidence observed in incorrect predictions and in tumor-negative cases. Conclusions: LLMs demonstrate strong performance in detecting radiographic abnormalities but remain limited in tumor subtype classification, particularly for diagnostically challenging lesions such as Ewing sarcoma. Elevated false positive and false negative rates, along with systematic overconfidence—especially in GPT-4.1—highlight important limitations for clinical use. These findings support the role of LLMs as adjunctive tools rather than independent diagnostic systems.

Keywords:

artificial intelligence

;

radiographs

;

large language models

;

bone tumors

;

diagnostic accuracy

;

medical imaging

Subject:

Medicine and Pharmacology - Orthopedics and Sports Medicine

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Diagnostic Performance and Confidence Calibration of Large Language Models for Bone Tumor Radiographs

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe