Submitted:
06 September 2024
Posted:
09 September 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
2. Methods
2.1. Models
2.2. Diagnosis and Automated Evaluation
3. Results
4. Discussion
5. Conclusions
Conflicts of Interest
References
- Sameera V, Bindra A, Rath GP. Human errors and their prevention in healthcare. J Anaesthesiol Clin Pharmacol. 2021 Jul-Sep;37(3):328-335. [CrossRef] [PubMed]
- Newman-Toker DE, Nassery N, Schaffer AC, et al. Burden of serious harms from diagnostic error in the USA BMJ Quality & Safety 2024;33:109-120.
- National Academies of Sciences, Engineering, and Medicine. 2015. Improving Diagnosis in Health Care. Washington, DC: The National Academies Press. [CrossRef]
- Newman-Toker DE, Wang Z, Zhu Y, Nassery N, Saber Tehrani AS, Schaffer AC, et al. Rate of diagnostic errors and serious misdiagnosis-related harms for major vascular events, infections, and cancers: toward a national incidence estimate using the “Big Three”. Diagnosis. 2020; 8(1): 67–84.
- Thammasitboon S, Cutrer WB. Diagnostic decision-making and strategies to improve diagnosis. Curr Probl Pediatr Adolesc Health Care. 2013; 43(9): 232–241.
- Wise J. Burnout among trainees is at all time high, GMC survey shows BMJ 2022; 378 :o1796.
- Topol, J.E. Toward the eradication of medical diagnostic errors. Science. 2024;383:1.
- Madrid-García A, Rosales-Rosado Z, Freites-Nuñez D, et al. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish access exam to specialized medical training. Sci Rep. 2023;13;22129.
- Rosoł, M., Gąsior, J.S., Łaba, J. et al. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep 2023;13;20512.
- Brin, D., Sorin, V., Vaid, A. et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 2023;13;16492.
- Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, et al. (2023) Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health 2(2): e0000198.
- Khan MP and O’Sullivan ED (2024) A comparison of the diagnostic ability of large language models in challenging clinical cases. Front. Artif. Intell. 7:1379297. [CrossRef]
- Chiu W, Ko W, Cho W, Hui S, Chan W, Kuo M Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases J Med Internet Res 2024;26:e53724.
- Warrier A, Singh R, Haleem A, Zaki H, Eloy JA. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope. 2024 Sep;134(9):3997-4002. [CrossRef] [PubMed]
- Sonoda, Y., Kurokawa, R., Nakamura, Y. et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases. Jpn J Radiol (2024). [CrossRef]
- Eriksen, Alexander & Möller, Sören & Ryg, Jesper. (2023). Use of GPT-4 to Diagnose Complex Clinical Cases. NEJM AI. 1. [CrossRef]
- Kanjee Z, Crowe B, Rodman A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA. 2023;330(1):78–80. [CrossRef]
- Hirosawa T, Harada Y, Mizuta K, Sakamoto T, Tokumasu K, Shimizu T Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases JMIR Form Res 2024;8:e59267.
- Shieh, A., Tran, B., He, G. et al. Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep 14, 9330 (2024).
- Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al.: MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10(1), 1 (2023).
- Sarvari P, Al-fagih Z, Ghuwel A, Al-fagih O.: A systematic evaluation of the performance of GPT-4 and PaLM2 to diagnose comorbidities in MIMIC-IV patients. Health Care Sci. 3, 3–18 (2024).
- ABIM laboratory reference ranges. Available online: https://www.abim.org/Media/bfijryql/laboratory-reference-ranges.pdf (accessed on 9 March 2024).
- Sarvari, P.; Al-fagih, Z.; Abou-Chedid, A. Rhazes: A generative-AI powered web app to handle all the paperwork and analytical tasks in medicine. Submitted to Diagnostics, manuscript id: diagnostics-3210068.
- Shah-Mohammadi, F.; Finkelstein, J. Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department. Diagnostics 2024, 14, 1779. [Google Scholar] [CrossRef] [PubMed]
- Wang C, Ong J, Wang C, Ong H, Cheng R, Ong D. Potential for GPT Technology to Optimize Future Clinical Decision-Making Using Retrieval-Augmented Generation. Ann Biomed Eng. 2023; 1: 1.


| Type | Name | Occurrence (GPT4o) | Occurrence (Claude) |
|---|---|---|---|
| Hit | Kidney failure, unspecified | 217 | 216 |
| Hit | Diabetes mellitus (no mention of complication, type II or unspecified) | 132 | 137 |
| Hit | Acidosis | 129 | 126 |
| Hit | Congestive heart failure, unspecified | 125 | 128 |
| Miss | Dehydration | 8 | 5 |
| Miss | Diabetes | 8 | 3 |
| Miss | Hypertension | 4 | 2 |
| Miss | Hypotension | 2 | 5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).