Submitted:
10 October 2024
Posted:
12 October 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
2. Methods
2.1. Models
2.2. Diagnosis and Automated Evaluation
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Sameera V, Bindra A, Rath GP. Human errors and their prevention in healthcare. J Anaesthesiol Clin Pharmacol. 2021; 37(3):328-335. [CrossRef]
- Newman-Toker DE, Nassery N, Schaffer AC, et al. Burden of serious harms from diagnostic error in the USA BMJ Quality & Safety. 2024; 33:109-120. [CrossRef]
- National Academies of Sciences, Engineering, and Medicine. Improving Diagnosis in Health Care. Washington, DC: The National Academies Press; 2015. [CrossRef]
- Newman-Toker DE, Wang Z, Zhu Y, Nassery N, Saber Tehrani AS, Schaffer AC, et al. Rate of diagnostic errors and serious misdiagnosis-related harms for major vascular events, infections, and cancers: toward a national incidence estimate using the “Big Three”. Diagnosis. 2020; 8(1): 67–84. [CrossRef]
- Thammasitboon S, Cutrer WB. Diagnostic decision-making and strategies to improve diagnosis. Curr Probl Pediatr Adolesc Health Care. 2013; 43(9): 232–241. [CrossRef]
- Wise J. Burnout among trainees is at all time high, GMC survey shows BMJ 2022; 378 :o1796. [CrossRef]
- Topol JE. Toward the eradication of medical diagnostic errors. Science. 2024; 383:1. [CrossRef]
- Madrid-García A, Rosales-Rosado Z, Freites-Nuñez D, et al. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish access exam to specialized medical training. Sci Rep. 2023; 13:22129. [CrossRef]
- Rosoł M, Gąsior JS, Łaba J et al. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep. 2023; 13:20512. [CrossRef]
- Brin D, Sorin V, Vaid A. et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep.2023;13:16492. [CrossRef]
- Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023; 2(2):e0000198. [CrossRef]
- Khan MP and O’Sullivan ED. A comparison of the diagnostic ability of large language models in challenging clinical cases. Front. Artif. Intell. 2023; 7:1379297. [CrossRef]
- Chiu W, Ko W, Cho W, Hui S, Chan W, Kuo M. Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases J Med Internet Res. 2024; 26:e53724. [CrossRef]
- Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope. 2024; 134(9):3997-4002. [CrossRef]
- Sonoda Y, Kurokawa R, Nakamura Y et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases. Jpn J Radiol. 2024. [CrossRef]
- Eriksen A, Möller S, Ryg J. Use of GPT-4 to Diagnose Complex Clinical Cases. NEJM AI. 2023; 1(1):4.
- Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA. 2023; 330(1):78–80. [CrossRef]
- Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T. Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases. JMIR Form Res 2024; 8:e59267.
- Shieh A., Tran B., He G. et al. Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep. 2024; 14:9330 (2024). [CrossRef]
- Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023; 10(1):1.
- Sarvari P, Al-fagih Z, Ghuwel A, Al-fagih O. A systematic evaluation of the performance of GPT-4 and PaLM2 to diagnose comorbidities in MIMIC-IV patients. Health Care Sci. 2024; 3:3–18. [CrossRef]
- ABIM laboratory reference ranges, https://www.abim.org/Media/bfijryql/laboratory-reference-ranges.pdf, last accessed 2024/03/09.
- Sarvari P, Al-fagih Z, Abou-Chedid A. Rhazes: An AI-Assistant to help with the paperwork and analytical tasks in clinical medicine JMIR Preprints. 2024; 20/09/2024:66691. [CrossRef]
- Shah-Mohammadi F, Finkelstein J. Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department. Diagnostics. 2024; 14:1779. [CrossRef]
- Wang C, Ong J, Wang C, Ong H, Cheng R, Ong D. Potential for GPT Technology to Optimize Future Clinical Decision-Making Using Retrieval-Augmented Generation. Ann Biomed Eng. 2023; 1:1. [CrossRef]


| Type | Name | Occurrence (GPT4o) | Occurrence (Claude) |
|---|---|---|---|
| Hit | Kidney failure, unspecified | 217 | 216 |
| Hit | Diabetes mellitus (no mention of complication, type II or unspecified) | 132 | 137 |
| Hit | Acidosis | 129 | 126 |
| Hit | Congestive heart failure, unspecified | 125 | 128 |
| Miss | Dehydration | 8 | 5 |
| Miss | Diabetes | 8 | 3 |
| Miss | Hypertension | 4 | 2 |
| Miss | Hypotension | 2 | 5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).