Submitted:
09 October 2024
Posted:
10 October 2024
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
2. Methods
2.1. Models
2.2. Diagnosis and Automated Evaluation
3. Results
4. Discussion
5. Conclusions
Funding
Author contributions
Data Availability Statement
Conflicts of Interest
References
- Sameera, V.; Bindra, A.; Rath, G.P. Human errors and their prevention in healthcare. J Anaesthesiol Clin Pharmacol. 2021, 37, 328–335. [Google Scholar] [CrossRef] [PubMed]
- Newman-Toker, D.E.; Nassery, N.; Schaffer, A.C.; et al. Burden of serious harms from diagnostic error in the USA. BMJ Quality & Safety 2024, 33, 109–120. [Google Scholar]
- National Academies of Sciences, Engineering, and Medicine. Improving Diagnosis in Health Care; The National Academies Press: Washington, DC, 2015. [Google Scholar] [CrossRef]
- Newman-Toker, D.E.; Wang, Z.; Zhu, Y.; Nassery, N.; Saber Tehrani, A.S.; Schaffer, A.C.; et al. Rate of diagnostic errors and serious misdiagnosis-related harms for major vascular events, infections, and cancers: toward a national incidence estimate using the “Big Three”. Diagnosis 2020, 8, 67–84. [Google Scholar] [CrossRef]
- Thammasitboon, S.; Cutrer, W.B. Diagnostic decision-making and strategies to improve diagnosis. Curr Probl Pediatr Adolesc Health Care. 2013, 43, 232–241. [Google Scholar] [CrossRef]
- Wise, J. Burnout among trainees is at all time high, GMC survey shows. BMJ 2022, 378, o1796. [Google Scholar] [CrossRef]
- Topol, J.E. Toward the eradication of medical diagnostic errors. Science. 2024, 383, 1. [Google Scholar] [CrossRef]
- Madrid-García, A.; Rosales-Rosado, Z.; Freites-Nuñez, D.; et al. Harnessing ChatGPT and GPT-4 for evaluating the rheumatology questions of the Spanish access exam to specialized medical training. Sci Rep. 2023, 13, 22129. [Google Scholar] [CrossRef]
- Rosoł, M.; Gąsior, J.S.; Łaba, J.; et al. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep 2023, 13, 20512. [Google Scholar] [CrossRef]
- Brin, D.; Sorin, V.; Vaid, A.; et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 2023, 13, 16492. [Google Scholar] [CrossRef]
- Kung, T.H.; Cheatham, M.; Medenilla, A.; Sillos, C.; De Leon, L.; et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health 2023, 2, e0000198. [Google Scholar] [CrossRef]
- Khan, M.P.; O’Sullivan, E.D. A comparison of the diagnostic ability of large language models in challenging clinical cases. Front. Artif. Intell. 2024, 7, 1379297. [Google Scholar] [CrossRef] [PubMed]
- Chiu, W.; Ko, W.; Cho, W.; Hui, S.; Chan, W.; Kuo, M. Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases. J Med Internet Res 2024, 26, e53724. [Google Scholar] [CrossRef]
- Warrier, A.; Singh, R.; Haleem, A.; Zaki, H.; Eloy, J.A. The Comparative Diagnostic Capability of Large Language Models in Otolaryngology. Laryngoscope. 2024, 134, 3997–4002. [Google Scholar] [CrossRef]
- Sonoda, Y.; Kurokawa, R.; Nakamura, Y.; et al. Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in “Diagnosis Please” cases. Jpn J Radiol 2024. [Google Scholar] [CrossRef]
- Eriksen, A.; Möller, S.; Ryg, J. Use of GPT-4 to Diagnose Complex Clinical Cases. NEJM AI. 2023, 1. [Google Scholar] [CrossRef]
- Kanjee, Z.; Crowe, B.; Rodman, A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA. 2023, 330, 78–80. [Google Scholar] [CrossRef]
- Hirosawa, T.; Harada, Y.; Mizuta, K.; Sakamoto, T.; Tokumasu, K.; Shimizu, T. Evaluating ChatGPT-4’s Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases. JMIR Form Res 2024, 8, e59267. [Google Scholar] [CrossRef]
- Shieh, A.; Tran, B.; He, G.; et al. Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep 2024, 14, 9330. [Google Scholar] [CrossRef]
- Johnson, A.E.W.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023, 10, 1. [Google Scholar] [CrossRef]
- Sarvari, P.; Al-fagih, Z.; Ghuwel, A.; Al-fagih, O. A systematic evaluation of the performance of GPT-4 and PaLM2 to diagnose comorbidities in MIMIC-IV patients. Health Care Sci. 2024, 3, 3–18. [Google Scholar] [CrossRef]
- ABIM laboratory reference ranges. Available online: https://www.abim.org/Media/bfijryql/laboratory-reference-ranges.pdf (accessed on 9 March 2024).
- Sarvari, P.; Al-fagih, Z.; Abou-Chedid, A. Rhazes: An AI-Assistant to help with the paperwork and analytical tasks in clinical medicine JMIR Preprints. 20/09/2024:66691. [CrossRef]
- Shah-Mohammadi, F.; Finkelstein, J. Accuracy Evaluation of GPT-Assisted Differential Diagnosis in Emergency Department. Diagnostics 2024, 14, 1779. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Ong, J.; Wang, C.; Ong, H.; Cheng, R.; Ong, D. Potential for GPT Technology to Optimize Future Clinical Decision-Making Using Retrieval-Augmented Generation. Ann Biomed Eng. 2023, 1, 1. [Google Scholar] [CrossRef]


| Type | Name | Occurrence (GPT4o) | Occurrence (Claude) |
|---|---|---|---|
| Hit | Kidney failure, unspecified | 217 | 216 |
| Hit | Diabetes mellitus (no mention of complication, type II or unspecified) | 132 | 137 |
| Hit | Acidosis | 129 | 126 |
| Hit | Congestive heart failure, unspecified | 125 | 128 |
| Miss | Dehydration | 8 | 5 |
| Miss | Diabetes | 8 | 3 |
| Miss | Hypertension | 4 | 2 |
| Miss | Hypotension | 2 | 5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).