Submitted:
28 April 2025
Posted:
29 April 2025
You are already at the latest version
Abstract
Keywords:
Introduction
Methods
Benchmark Dataset and Test Items / User Prompts
Domain Background and Retrieval Augmented Generation (RAG)
System Prompts
Models
Performance Evaluation
Statistical Analysis
Results
LLM Performance varies significantly with validation requirements
System Prompt Specificity and Test Case Structure affect Model Performance
LLM performance correlates with the age of the user asking for advice
Discussion
Supplementary Materials
Author Contributions
Funding
Data and code availability
Statement on the use of AI
Conflicts of Interest
References
- Alowais SA, Alghamdi SS, Alsuhebany N, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ 2023;23(1):689. [CrossRef]
- Secinaro S, Calandra D, Secinaro A, Muthurangu V, Biancone P. The role of artificial intelligence in healthcare: a structured literature review. BMC Med Inform Decis Mak 2021;21(1):125. [CrossRef]
- Meng X, Yan X, Zhang K, et al. The application of large language models in medicine: A scoping review. iScience 2024;27(5):109713. [CrossRef]
- Silcox C, Zimlichmann E, Huber K, et al. The potential for artificial intelligence to transform healthcare: perspectives from international health leaders. NPJ Digit Med 2024;7(1):88. [CrossRef]
- Kroemer G, Maier AB, Cuervo AM, et al. From geroscience to precision geromedicine: Understanding and managing aging. Cell 2025;188(8):2043-2062. [CrossRef]
- Parchmann N, Hansen D, Orzechowski M, Steger F. An ethical assessment of professional opinions on concerns, chances, and limitations of the implementation of an artificial intelligence-based technology into the geriatric patient treatment and continuity of care. Geroscience 2024;46(6):6269-6282. [CrossRef]
- Vahia IV. Navigating New Realities in Aging Care as Artificial Intelligence Enters Clinical Practice. Am J Geriatr Psychiatry 2024;32(3):267-269. [CrossRef]
- Stefanacci RG. Artificial intelligence in geriatric medicine: Potential and pitfalls. J Am Geriatr Soc 2023;71(11):3651-3652. [CrossRef]
- Wiil UK. Important steps for artificial intelligence-based risk assessment of older adults. Lancet Digit Health 2023;5(10):e635-e636. [CrossRef]
- Ma B, Yang J, Wong FKY, et al. Artificial intelligence in elderly healthcare: A scoping review. Ageing Res Rev 2023;83:101808. [CrossRef]
- Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. Applied Sciences. Appl Sci 2021; 11(14):6421. [CrossRef]
- Pal A, Umapathi LK, Sankarasubbu M. MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering. In: Gerardo F, George HC, Tom P, Joyce CH, Tristan N, eds. Proceedings of the Conference on Health, Inference, and Learning. Proceedings of Machine Learning Research: PMLR, 2022:248-260. Available from: https://proceedings.mlr.press/v174/pal22a.html.
- Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature 2023;620(7972):172-180. [CrossRef]
- Jin Q, Dhingra B, Liu Z, Cohen W, Lu X. PubMedQA: A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019:2567–2577. Available from: https://pubmedqa.github.io.
- Šuster S, Daelemans W. CliCR: a Dataset of Clinical Case Reports for Machine Reading Comprehension. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018:1551-1563. [CrossRef]
- Wang LL, deYoung J, Wallace B. Overview of MSLR2022: A Shared Task on Multi-document Summarization for Literature Reviews. Proceedings of the Third Workshop on Scholarly Document Processing, 2022:175-180. Available from: https://aclanthology.org/2022.sdp-1.20.
- Li J, Sun Y, Johnson RJ, et al. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford) 2016;2016. [CrossRef]
- Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A. CHEMDNER: The drugs and chemical names extraction challenge. J Cheminform 2015;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S1. [CrossRef]
- Kury F, Butler A, Yuan C, et al. Chia, a large annotated corpus of clinical trial eligibility criteria. Sci Data 2020;7(1):281. [CrossRef]
- Schmidgall S, Harris C, Essien I, et al. Evaluation and mitigation of cognitive biases in medical language models. NPJ Digit Med 2024;7(1):295. [CrossRef]
- Wu C, Qiu P, Liu J, et al. Towards evaluating and building versatile large language models for medicine. NPJ Digit Med 2025;8(1):58. [CrossRef]
- Kanithi PK, Christophe C, Pimentel MAF, et al. MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications. September 11, 2024 (https://arxiv.org/abs/2409.07314). Preprint.
- Zheng L, Chiang W-L, Sheng Y, et al. Judging LLM-as-a-judge with MT-bench and Chatbot Arena. 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks, 2020:46595-46623. Available from: https://dl.acm.org/doi/10.5555/3666122.3668142.
- Fuellen G, Kulaga A, Lobentanzer S, et al. Validation requirements for AI-based intervention-evaluation in aging and longevity research and practice. Ageing Res Rev 2025;104:102617. [CrossRef]
- Zakka C, Shad R, Chaurasia A, et al. Almanac - Retrieval-Augmented Language Models for Clinical Medicine. NEJM AI 2024;1(2). [CrossRef]
- Lobentanzer S, Feng S, Bruderer N, et al. A platform for the biomedical application of large language models. Nat Biotechnol 2025;43(2):166-169. [CrossRef]
- Grattafiori A, Dubey A, Jauhri A, et al. The Llama 3 Herd of Models. July 31, 2024 (https://arxiv.org/abs/2407.21783). Preprint.
- Lobentanzer S, Aloy P, Baumbach J, et al. Democratizing knowledge representation with BioCypher. Nat Biotechnol 2023;41(8):1056-1059. [CrossRef]
- Vallat R. Pingouin: statistics in Python. The Journal of Open Source Software 2018;3(31):1026. [CrossRef]
- Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res 2011;12(85):2825-2830. Aailable from: https://dl.acm.org/doi/10.5555/1953048.2078195.
- Virtanen P, Gommers R, Oliphant TE, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 2020;17(3):261-272. [CrossRef]
- Busch F, Hoffmann L, Rueger C, et al. Current applications and challenges in large language models for patient care: a systematic review. Commun Med (Lond) 2025;5(1):26. [CrossRef]
- Beauchamp TL, Childress JF. Principles of Biomedical Ethics: Oxford University Press, 2012.
- Pang C. Is a partially informed choice less autonomous?: a probabilistic account for autonomous choice and information. Humanit Soc Sci Commun 2023;10:131. [CrossRef]
- Hager P, Jungmann F, Holland R, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024;30(9):2613-2622. [CrossRef]
- Mirzadeh I, Alizadeh K, Shahrokhi H, Tuzel O, Bengio S, Farajtabar M. GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models. October 7, 2024 (https://arxiv.org/abs/2410.05229). Preprint.
- Chu CH, Nyrup R, Leslie K, et al. Digital Ageism: Challenges and Opportunities in Artificial Intelligence for Older Adults. Gerontologist 2022;62(7):947-955. [CrossRef]
- Ng KKY, Matsuba I, Zhang PC. RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations. N Engl J Med AI 2024;2(1). [CrossRef]
- Kresevic S, Giuffre M, Ajcevic M, Accardo A, Croce LS, Shung DL. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit Med 2024;7(1):102. [CrossRef]
- Harutyunyan H, Khachatrian H, Kale DC, Ver Steeg G, Galstyan A. Multitask learning and benchmarking with clinical time series data. Sci Data 2019;6(1):96. [CrossRef]
- Xie F, Zhou J, Lee JW, et al. Benchmarking emergency department prediction models with machine learning and public electronic health records. Sci Data 2022;9(1):658. [CrossRef]
- Nguyen T-T, Schlegel V, Kashyap A, et al. Mimic-IV-ICD: A new benchmark for eXtreme MultiLabel Classification. April 27, 2023 (https://arxiv.org/abs/2304.13998). Preprint.



| Evaluated Models | Comprh. | Correct | Useful | Explnb. | Safe | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| w/o RAG | RAG | RAG | w/o RAG | RAG | RAG | w/o RAG | RAG | RAG | w/o RAG | RAG | RAG | w/o RAG | RAG | RAG | |
| Llama 3.2 3B | 0.28 ± 0.08 | 0.25 ± 0.07 | -0.03 | 0.52 ± 0.08 | 0.63 ± 0.02 | +0.11 | 0.44 ± 0.08 | 0.46 ± 0.06 | +0.02 | 0.54 ± 0.11 | 0.54 ± 0.08 | ±0 | 0.89 ± 0.05 | 0.86 ± 0.03 | -0.03 |
| Qwen 2.5 14B | 0.42 ± 0.19 | 0.56 ± 0.04 | +0.14 | 0.68 ± 0.01 | 0.70 ± 0.02 | +0.02 | 0.59 ± 0.15 | 0.71 ± 0.02 | +0.12 | 0.65 ± 0.18 | 0.78 ± 0.03 | +0.13 | 0.85 ± 0.17 | 0.93 ± 0.02 | +0.08 |
| DSR Llama 70B | 0.49 ± 0.05 | 0.53 ± 0.02 | +0.04 | 0.69 ± 0.02 | 0.68 ± 0.01 | -0.01 | 0.70 ± 0.05 | 0.69 ± 0.03 | -0.01 | 0.80 ± 0.04 | 0.79 ± 0.02 | -0.01 | 0.96 ± 0.01 | 0.92 + 0.02 | -0.04 |
| GPT-4o | 0.85 ± 0.06 | 0.76 ± 0.06 | -0.09 | 0.73 ± 0.02 | 0.73 ± 0.02 | ±0 | 0.89 ± 0.03 | 0.82 ± 0.03 | -0.07 | 0.94 ± 0.04 | 0.87 ± 0.04 | -0.07 | 0.99 ± 0.01 | 0.93 ± 0.02 | -0.06 |
| GPT-4o mini | 0.45 ± 0.17 | 0.45 ± 0.16 | ±0 | 0.66 ± 0.02 | 0.67 ± 0.02 | +0.01 | 0.69 ± 0.08 | 0.65 ± 0.08 | -0.04 | 0.77 ± 0.10 | 0.70 ± 0.09 | -0.07 | 0.97 ± 0.01 | 0.93 ± 0.03 | -0.04 |
| o3 mini | 0.67 ± 0.06 | 0.69 ± 0.05 | +0.02 | 0.65 ± 0.03 | 0.68 ± 0.02 | +0.03 | 0.79 ± 0.02 | 0.77 ± 0.03 | -0.02 | 0.84 ± 0.03 | 0.82 ± 0.04 | -0.02 | 0.96 ± 0.02 | 0.95 ± 0.02 | -0.01 |
| Evaluated Models | Young | Mid-Age/Pregeriatric | Geriatric | ||||||
|---|---|---|---|---|---|---|---|---|---|
| w/o RAG | RAG | RAG | w/o RAG | RAG | RAG | w/o RAG | RAG | ||
| Llama 3.2 3B | 0.33 ± 0.07 | 0.38 ± 0.04 | +0.05 | 0.36 ± 0.08 | 0.42 ± 0.05 | +0.06 | 0.47 ± 0.06 | 0.46 ± 0.04 | -0.01 |
| Qwen 2.5 14B | 0.47 ± 0.06 | 0.57 ± 0.01 | +0.10 | 0.47 ± 0.03 | 0.48 ± 0.01 | +0.01 | 0.63 ± 0.04 | 0.66 ± 0.02 | +0.03 |
| DSR Llama 70B | 0.47 ± 0.03 | 0.52 ± 0.01 | +0.05 | 0.51 ± 0.01 | 0.48 ± 0.01 | -0.03 | 0.70 ± 0.01 | 0.69 ± 0.02 | -0.01 |
| GPT-4o | 0.67 ± 0.02 | 0.65 ± 0.03 | -0.02 | 0.63 ± 0.02 | 0.55 ± 0.01 | -0.08 | 0.78 ± 0.02 | 0.76 ± 0.01 | -0.02 |
| GPT-4o mini | 0.48 ± 0.07 | 0.46 ± 0.07 | -0.02 | 0.49 ± 0.03 | 0.45 ± 0.04 | -0.04 | 0.68 ± 0.04 | 0.68 ± 0.04 | ±0 |
| o3 mini | 0.57 ± 0.04 | 0.58 ± 0.04 | +0.01 | 0.54 ± 0.02 | 0.51 ± 0.01 | -0.03 | 0.72 ± 0.03 | 0.75 ± 0.02 | +0.03 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).