Submitted:
15 October 2025
Posted:
16 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Methods
2.1. Study Design and Reporting Guidelines
2.2. Search Strategy
2.3. Inclusion and Exclusion Criteria
2.4. Study Selection, Data Extraction and Critical Appraisal
2.5. Synthesis
3. Results
| Study | ChatGPT Version | Evaluation Method | Metrics Used | Limitations Noted |
| Dimitriadis et al. (2024) | GPT-4 | Expert assessment by cardiologists on accuracy and comprehensiveness | Accuracy (%), comprehensiveness | Less comprehensive; limited to observational design |
| King et al. (2024) (appropriateness) | GPT-3.5 and GPT-4 | Graded by board-certified cardiologists using 4-point scale (comprehensive to incorrect) | Appropriateness (%), reproducibility (%), comprehensive knowledge (%) | Occasional hallucinations* (1.9%); small sample of questions; no patient outcomes |
| Anaya et al. (2024) | GPT-3 | Blind assessment with PEMAT; readability calculators (Flesch-Kincaid, etc.) | PEMAT readability/actionability (%), grade level, word difficulty (%) | Longer responses at higher educational levels; lower actionability; observational only |
| Workman et al. (2024) | GPT-4 | Zero-shot learning with prompt engineering; compared to ML/rule-based baselines | Precision (%), recall (%), F1 score (%) | Reliance on synthetic snippets; no real EHR validation; prompt sensitivity |
| Kozaily et al. (2024) | GPT-3.5 | Expert evaluation by HF specialists; consistency across runs | Appropriateness (%), consistency (%) | Inadequacy in advanced topics; heterogeneity in comparators (Bard); small question set |
| King et al. (2024) (readability) | GPT-4 | Expert grading for accuracy/comprehensiveness; readability scores | Flesch-Kincaid grade level, accuracy (%), comprehensiveness increase (%) | Institutional materials bias; no long-term impact assessment |
| Bhupathi et al. (2024) | GPT-3.5 | Likert scales for accuracy/completeness; PEMAT for understandability | Accuracy score (mean/6), completeness (mean/3), PEMAT understandability (%) | Slightly lower accuracy for rarer conditions; no RCTs; potential data abundance bias |
4. Discussion
4.1. Key Findings and Thematic Synthesis
4.2. Implications for Heart Failure Care
4.3. Drawbacks and Limitations of ChatGPT in HF Care
4.4. Strengths and Limitations of the Evidence
4.5. Ethical Considerations
4.6. Future Directions
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Heidenreich PA, Albert NM, Allen LA, et al. Forecasting the impact of heart failure in the United States. Circ Heart Fail. 2013;6(3):606-619. [CrossRef]
- Savarese G, Becher PM, Lund LH, et al. Global burden of heart failure: a comprehensive and updated review of epidemiology. Cardiovasc Res. 2023;119(6):1452-1469. [CrossRef]
- Heidenreich PA, Albert NM, Allen LA, et al. Forecasting the Impact of Heart Failure in the United States: A Policy Statement From the American Heart Association. Circ Heart Fail. 2013;6:606–619.
- Dunlay SM, Shah ND, Shi Q, et al. Lifetime costs of medical care after heart failure diagnosis. Circ Cardiovasc Qual Outcomes. 2011;4(1):68-75. [CrossRef]
- Khan MS, Sreenivasan J, Lateef N, et al. Trends in 30- and 90-day readmission rates for heart failure. Circ Heart Fail. 2021;14(5):e008335. [CrossRef]
- Heidenreich PA, Bozkurt B, Aguilar D, et al. 2022 AHA/ACC/HFSA guideline for the management of heart failure. Circulation. 2022;145(18):e895-e1032. [CrossRef]
- McCullough PA, Mehta HS, Barker CM, et al. Mortality and guideline-directed medical therapy in real-world heart failure patients with reduced ejection fraction. Clin Cardiol. 2021;44(9):1192-1198. [CrossRef]
- Ross JS, Chen J, Lin Z, et al. Recent national trends in readmission rates after heart failure hospitalization. Circ Heart Fail. 2010;3(1):97-103. [CrossRef]
- Gautam N, Ghanta SN, Clausen A, et al. Contemporary applications of machine learning for device therapy in heart failure. JACC Heart Fail. 2022;10(9):603-622. [CrossRef]
- Khan MS, Arshad MS, Greene SJ, et al. Artificial intelligence and heart failure: a state-of-the-art review. Eur J Heart Fail. 2023;25(9):1507-1525. [CrossRef]
- Fabbri M, Yost KJ, Finney Rutten LJ, et al. Health literacy and outcomes in patients with heart failure: a prospective community study. Mayo Clin Proc. 2018;93(1):9-15. [CrossRef]
- McDonagh TA, Metra M, Adamo M, et al. 2021 ESC Guidelines for the diagnosis and treatment of acute and chronic heart failure. Eur Heart J. 2021;42(36):3599-3726. [CrossRef]
- Funk PF, Hoch CC, Knoedler S, et al. ChatGPT’s response consistency: a study on repeated queries of medical examination questions. Eur J Investig Health Psychol Educ. 2024;14(3):657-668. [CrossRef]
- Moher D, Liberati A, Tetzlaff J, et al. Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009;6(7):e1000097. [CrossRef]
- Sterne JA, Hernán MA, Reeves BC, et al. ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016;355:i4919. [CrossRef]
- Ghanta SN, Dahal R, Vellanki S, et al. Applications of ChatGPT in heart failure prevention, diagnosis, management, and research: a narrative review. Diagnostics (Basel). 2024;14(21):2372. [CrossRef]
- Sharma A, Gupta R, Patel N, et al. Exploring the role of ChatGPT in cardiology: a systematic review of the current literature. Cureus. 2024;16(5):e59512. [CrossRef]
- Sarraju A, Bruemmer D, van Iterson E, et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA. 2023;329(10):842-844. [CrossRef]
- Harskamp RE, de Clercq L. Performance of ChatGPT as an AI-assisted decision support tool in medicine: a proof-of-concept study for interpreting symptoms and management of common cardiac conditions (AMSTELHEART-2). Acta Cardiol. 2024;79(3):358-366. [CrossRef]
- Hirosawa T, Kawamura R, Harada Y, et al. ChatGPT-generated differential diagnosis lists for complex case-derived clinical vignettes: diagnostic accuracy evaluation. JMIR Med Inform. 2023;11:e48808. [CrossRef]
- Roosan D, Padua P, Khan R, et al. Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management. J Am Pharm Assoc (2003). 2024;64(2):422-428. [CrossRef]
- Alanzi TM. Impact of ChatGPT on teleconsultants in healthcare: perceptions of healthcare experts in Saudi Arabia. J Multidiscip Healthc. 2023;16:2309-2321. [CrossRef]
- Rawashdeh B, Kim J, AlRyalat SA, et al. ChatGPT and artificial intelligence in transplantation research: is it always correct? Cureus. 2023;15(7):e42150. [CrossRef]
- Elyoseph Z, Hadar-Shoval D, Asraf K, Lvovsky M. ChatGPT outperforms humans in emotional awareness evaluations. Front Psychol. 2023;14:1199058. [CrossRef]
- Guevara M, Chen S, Thomas S, et al. Large language models to identify social determinants of health in electronic health records. NPJ Digit Med. 2024;7(1):6. [CrossRef]
- Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst. 2023;47(1):33. [CrossRef]
- Nakaya Y, Higaki A, Yamaguchi O. ChatGPT’s ability to classify virtual reality studies in cardiology. Eur Heart J Digit Health. 2023;4(2):141-142. [CrossRef]
- Fijačko N, Prosen G, Abella BS, et al. Can novel multimodal chatbots such as Bing Chat Enterprise, ChatGPT-4 Pro, and Google Bard correctly interpret electrocardiogram images? Resuscitation. 2023;193:110009. [CrossRef]
- Dimitriadis F, Alkagiet S, Tsigkriki L, et al. ChatGPT and patients with heart failure. Angiology. 2024. [CrossRef]
- King RC, Samkari JS, Haquang J, et al. Appropriateness of ChatGPT in answering heart failure related questions. Heart Lung Circ. 2024;33(9):1314-1318. [CrossRef]
- Anaya F, Prasad R, Bashour M, et al. Evaluating ChatGPT platform in delivering heart failure educational material: a comparison with the leading national cardiology institutes. Curr Probl Cardiol. 2024;49(10):102797. [CrossRef]
- Warkins TE, Leese J, Ouyang D, et al. ChatGPT-4 zero-shot learning extraction of heart failure signs and symptoms from electronic health records. Prog Cardiovasc Dis. 2024;87:47-49. [CrossRef]
- Kozaily E, Geagea M, Akdogan ER, et al. Accuracy and consistency of online large language model-based artificial intelligence chat platforms in answering patients’ questions about heart failure. Int J Cardiol. 2024;408:132115. [CrossRef]
- King RC, Srinivasan N, Peng Y, et al. Improving the readability of institutional heart failure-related patient education materials using GPT-4: observational study. JMIR Cardio. 2024;8:e48817. [CrossRef]
- Bhupathi M, Mediboina A, Mehweish I, et al. Assessing information provided by ChatGPT: heart failure versus patent ductus arteriosus. Cureus. 2024;16(6):e60365. [CrossRef]
- Temperley HC, O’Sullivan NJ, Mac Curtain BM, et al. Current applications and future potential of ChatGPT in radiology: a systematic review. J Med Imaging Radiat Oncol. 2024;68(3):250-260. [CrossRef]
- Warkentin J, Ezekowitz JA, White J, et al. Care gaps in heart failure management: impact of subspecialty care on outcomes. J Card Fail. 2024;. [CrossRef]
- Saenger JA, Hunger J, Boss A, et al. Delayed diagnosis of a transient ischemic attack caused by ChatGPT. Wien Klin Wochenschr. 2024;136(7-8):236-238. [CrossRef]
- McMurray JJV, Packer M, Desai AS, et al. Angiotensin–Neprilysin inhibition versus enalapril in heart failure. N Engl J Med. 2014;371(11):993-1004. [CrossRef]
| Study | Sample Size | Intervention | Comparator | Key Outcomes |
| Dimitriadis et al. (2024) | 47 questions | ChatGPT-4 | None (observational) | 100% accuracy in key areas such as lifestyle advice, medication mechanisms, and symptom management |
| King et al. (2024) (appropriateness) | 107 questions | GPT-3.5 and GPT-4 | GPT-3.5 vs. GPT-4 | 98.1% appropriateness for GPT-3.5, with occasional hallucinations* (1.9%) |
| Anaya et al. (2024) | 12 questions | ChatGPT-3 | Leading US institutes (ACC, AHA, HFSA) | 78% actionability score, with competitive 75% PEMAT readability but lowest actionability among compared to materials |
| Workman et al. (2024) | 1999 snippets + 102 synthetic | ChatGPT-4 ZSL | ML and rule-based | 95% F1 score for ZSL, outperforming baselines |
| Kozaily et al. (2024) | 30 questions | ChatGPT-3.5 and Bard | ChatGPT vs. Bard | 90% appropriateness for ChatGPT |
| King et al. (2024) (readability) | 143 PEMs | GPT-4 | Original materials | Improved readability to 6th-7th grade level |
| Bhupathi et al. (2024) | 21 questions (10 HF, 11 PDA) | ChatGPT-3.5 | AHA/ACC guidelines | Mean accuracy 5.4/6, 83.75% PEMAT-P understandability |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).