Submitted:
27 June 2026
Posted:
29 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. What LLMs Are (and What They Are Not) in Clinical Care
2.1. What LLMs Can Do Well
Summarization and Rewriting at Different Literacy Levels
Extracting the “Clinical Story” from Notes
Generating Caregiver Instructions and Checklists (with Guardrails)
Triage and Documentation Assistance (Co-Pilot Role)
2.2. Where LLMs Fail in Dangerous Ways
Hallucinations and Fabricated Clinical Facts
Confident Incorrectness and “Authority Tone”
Inconsistent Performance Across Populations and Settings
Sensitivity to Prompt Injection and Unsafe Instructions in Downstream Workflows
3. Clinical Use Cases Across the Alzheimer’s Care Continuum
3.1. Caregiver-Facing Support Systems
3.2. Patient-Facing Conversational Agents (High Caution Zone)
3.3. Clinician Workflow Augmentation (Highest Near-Term Feasibility)
3.4. Risk Prediction and Clinical Text Intelligence
3.5. Research and Education Use Cases
4. Safety Challenges Unique to Alzheimer’s Care
4.1. Hallucinations, Omissions, and Factuality Failures in Care-Critical Content
4.2. Overconfidence, “Authority Tone,” and Over-Reliance in Cognitively Vulnerable Users
4.3. Prompt Injection and Adversarial Manipulation in Caregiver and Clinical Tools
4.4. Bias, Inequity, and Differential Performance Across Populations
4.5. Privacy, Confidentiality, and Sensitive Caregiver Narratives
4.6. Consent, Capacity, Accountability, and Emotional Safety
5. Implementation Pathways for LLMs in Alzheimer’s Care (from Pilots to Safe Deployment)
5.1. Defining the Clinical Role: “Assistive Infrastructure,” Not an Autonomous Clinician
5.2. Architecture Choices that Reduce Hallucinations: Grounding Through Retrieval-Augmented Generation
5.3. A Tiered Deployment Pathway that Fits Alzheimer’s Care Realities
5.4. Governance and Continuous Monitoring: Implementation as a Lifecycle Program
5.5. Security and Prompt Injection: A Non-Negotiable Requirement for Dementia-Facing Tools
5.6. Equity, Privacy, and Consent in an AD-Specific Workflow
5.7. A Realistic Endpoint: A Safe Alzheimer’s LLM Is a System, Not a Model
6. Future Directions
6.1. From Promising Prototypes to Pragmatic Clinical Evaluation
6.2. Choosing Endpoints that Reflect Alzheimer’s Realities
6.3. External Validation, Calibration, and Generalizability as Non-Negotiables
6.4. Adversarial Testing and Security-by-Design for Dementia-Facing Systems
6.5. Next-Generation Clinical Intelligence: Longitudinal Notes and Multi-Agent Forecasting
6.6. Multimodal Models and Richer AD Phenotyping
6.7. Aligning Innovation with Evolving Ethics and Governance Expectations
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Non-Medical Glossary (LLM/AI and Implementation Terms)
References
- Chung, P. Verifying Facts in Patient Care Documents Generated by Large Language Models Using Electronic Health Records AIdbp2500418. NEJM AI 2026. 3, 1, p. [Google Scholar]
- Aguirre, A.; et al. Assessing the Quality of ChatGPT Responses to Dementia Caregivers’ Questions: Qualitative Analysis. JMIR Aging 2024, 7, e53019. [Google Scholar] [CrossRef] [PubMed]
- Saeidnia, H.R.; et al. Evaluation of ChatGPT’s responses to information needs and information seeking of dementia patients. Sci. Rep. 2024, 14(1), 10273. [Google Scholar] [PubMed]
- Hasan, W.U.; et al. Empowering Alzheimer’s caregivers with conversational AI: a novel approach for enhanced communication and personalized support. npj Biomed. Innov. 2024, 1(1), 3. [Google Scholar] [CrossRef] [PubMed]
- Parmanto, B.; et al. A Reliable and Accessible Caregiving Language Model (CaLM) to Support Tools for Caregivers: Development and Evaluation Study. JMIR Form. Res. 2024, 8, e54633. [Google Scholar] [CrossRef] [PubMed]
- Li, R.; et al. CARE-AD: a multi-agent large language model framework for Alzheimer’s disease prediction using longitudinal clinical notes. npj Digit. Med. 2025, 8(1), 541. [Google Scholar] [CrossRef] [PubMed]
- Lee, R.W.; et al. Vulnerability of Large Language Models to Prompt Injection When Providing Medical Advice. JAMA Netw. Open 2025, 8(12), e2549963–e2549963. [Google Scholar] [CrossRef] [PubMed]
- Zaretsky, J.; et al. Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format. JAMA Netw. Open 2024, 7(3), e240357–e240357. [Google Scholar] [CrossRef] [PubMed]
- Oliveira, J.D.; et al. Development and evaluation of a clinical note summarization system using large language models. Commun. Med. 2025. 5, 1, 376. [Google Scholar]
- Croxford, E.; et al. Evaluating clinical AI summaries with large language models as judges. npj Digit Med. 2025. 8, 1, 640. [Google Scholar]
- Song, J.W.; et al. Large Language Model Assistant for Emergency Department Discharge Documentation. JAMA Netw. Open 2025, 8(10), e2538427–e2538427. [Google Scholar] [CrossRef] [PubMed]
- McCoy, L.G. Assessment of Large Language Models in Clinical Reasoning: A Novel Benchmarking Study p. AIdbp2500120. NEJM AI 2025. 2, 10. [Google Scholar]
- Pfohl, S.R.; et al. A toolbox for surfacing health equity harms and biases in large language models. Nat. Med. 2024, 30(12), 3590–3600. [Google Scholar] [CrossRef] [PubMed]
- Yang, Y.; et al. Unmasking and quantifying racial bias of large language models in medical report generation. Commun. Med. 2024, 4(1), 176. [Google Scholar] [CrossRef] [PubMed]
- Choukou, M.-A.; et al. Digital Health Technology to Support Health Care Professionals and Family Caregivers Caring for Patients With Cognitive Impairment: Scoping Review. JMIR Ment. Health 2023, 10, e40330. [Google Scholar] [CrossRef] [PubMed]
- Ruggiano, N.; et al. Chatbots to Support People With Dementia and Their Caregivers: Systematic Review of Functions and Quality. J. Med. Internet Res. 2021, 23(6), e25006. [Google Scholar] [CrossRef] [PubMed]
- Morales-de-Jesús, V.; et al. Conversational System as Assistant Tool in Reminiscence Therapy for People with Early-Stage of Alzheimer’s. Healthcare 2021, 9(8). [Google Scholar] [CrossRef] [PubMed]
- Chelli, M.; et al. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J. Med. Internet Res. 2024, 26, e53164. [Google Scholar] [CrossRef] [PubMed]
- Klimova, B.; Kacetl, J. Ethical Considerations of AI Use by the Elderly. Int. J. Human–Computer Interact. 2025, 1–12. [Google Scholar]
- Asgari, E.; et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 2025, 8(1), 274. [Google Scholar] [CrossRef] [PubMed]
- Goh, E.; et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open 2024, 7(10), e2440969–e2440969. [Google Scholar] [PubMed]
- Tierney, A.; et al. Health Equity in the Era of Large Language Models. Am. J. Manag. Care 2025, 31, 112–117. [Google Scholar] [CrossRef] [PubMed]
- Chen, H.; et al. Large language models and global health equity: a roadmap for equitable adoption in LMICs. Lancet Reg. Health – West. Pac. 2025, 63. [Google Scholar]
- Omar, M.; et al. Evaluating and addressing demographic disparities in medical large language models: a systematic review. Int. J. Equity Health 2025, 24(1), 57. [Google Scholar] [CrossRef] [PubMed]
- Ji, Y.; et al. Mitigating the risk of health inequity exacerbated by large language models. npj Digit. Med. 2025, 8(1), 246. [Google Scholar] [CrossRef] [PubMed]
- Solaiman, B. Legal and Ethical Considerations of Artificial Intelligence for Residents in Post-Acute and Long-Term Care. J. Am. Med. Dir. Assoc. 2024, 25(9), 105105. [Google Scholar] [PubMed]
- Deusdad, B. Ethical implications in using robots among older adults living with dementia. Front Psychiatry 2024, 15, 1436273. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; McCoy, A.B.; Wright, A. Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inf. Assoc. 2025, 32(4), 605–615. [Google Scholar]
- Yang, R.; et al. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Syst. 2025. 2, 1, 2. [Google Scholar]
- Raghu Subramanian, C.; Yang, D.A.; Khanna, R. Enhancing Health Care Communication With Large Language Models—The Role, Challenges, and Future Directions. JAMA Netw. Open 2024, 7(3), e240347–e240347. [Google Scholar] [CrossRef] [PubMed]
- Cresswell, K.; et al. Evaluating Artificial Intelligence in Clinical Settings-Let Us Not Reinvent the Wheel. J. Med. Internet Res. 2024, 26, e46407. [Google Scholar] [PubMed]
- Cohen, J.P.; et al. Problems in the deployment of machine-learned models in health care. Cmaj 2021, 193(35), E1391–e1394. [Google Scholar] [CrossRef] [PubMed]
- Griot, M.; Vanderdonckt, J.; Yuksel, D. Implementation of large language models in electronic health records. PLoS Digit Health 2025. 4, 12, e0001141. [Google Scholar]
- Vasey, B.; et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 2022, 28(5), 924–933. [Google Scholar] [CrossRef] [PubMed]
- Liu, X.; et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 2020, 26(9), 1364–1374. [Google Scholar] [CrossRef] [PubMed]
- Rivera, S.C.; et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension. BMJ 2020, 370, m3210. [Google Scholar] [PubMed]
- Collins, G.S.; et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024, 385, e078378. [Google Scholar] [CrossRef] [PubMed]
- Huang, R.; et al. Evaluation and Bias Analysis of Large Language Models in Generating Synthetic Electronic Health Records: Comparative Study. J. Med. Internet Res. 2025. 27, e65317. [Google Scholar]
- Ong, J.C.L.; et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit. Health 2024, 6(6), e428–e432. [Google Scholar] [CrossRef] [PubMed]
- Fareed, M.; et al. A systematic review of ethical considerations of large language models in healthcare and medicine. Front Digit Health 2025. 7, 1653631. [Google Scholar]
- Dino, F.R.; et al. Ethics in digital phenotyping: considerations regarding Alzheimer’s disease, speech and artificial intelligence. J. Med. Ethics 2025. [Google Scholar] [CrossRef] [PubMed]
- Diaz, A.; et al. Informed consent in dementia research: how Public Involvement can contribute to addressing “old” and “new” challenges. In Frontiers in Dementia; 2025; pp. 4–2025. [Google Scholar]
- Soria, S.T. Patient autonomy in the context of digital health. Bioethics 2025, 39(5), 404–413. [Google Scholar] [CrossRef] [PubMed]
- Zhi-Xiang, L.; Lim, W.S.; Chan, E.-Y. Development and Validation of a Multidimensional Short Version Zarit Burden Interview (ZBI-9) for Caregivers of Persons With Cognitive Impairment. Alzheimer Dis. Assoc. Disord. 2023, 37(1). [Google Scholar] [CrossRef] [PubMed]
- Seng, B.K.; et al. Validity and reliability of the Zarit Burden Interview in assessing caregiving burden. Ann. Acad. Med. Singap. 2010, 39(10), 758–63. [Google Scholar] [CrossRef] [PubMed]
- Cummings, J.L. The Neuropsychiatric Inventory: assessing psychopathology in dementia patients. Neurology 1997, 48((5) Suppl 6, S10–6. [Google Scholar] [PubMed]
- Solaiman, B.; et al. A “True Lifecycle Approach” towards governing healthcare AI with the GCC as a global governance model. npj Digit Med. 2025, 8(1), 337. [Google Scholar] [CrossRef] [PubMed]
- Jenkins, D.A.; et al. Continual updating and monitoring of clinical prediction models: time for dynamic prediction systems? Diagn. Progn. Res. 2021, 5(1), 1. [Google Scholar] [CrossRef] [PubMed]
- Davis, S.E.; et al. Detection of calibration drift in clinical prediction models to inform model updating. J. BioMed Inf. 2020, 112, 103611. [Google Scholar] [CrossRef]
- Sahiner, B.; et al. Data drift in medical machine learning: implications and potential remedies. Br. J. Radiol. 2023, 96(1150), 20220878. [Google Scholar] [CrossRef] [PubMed]
- Subasri, V.; et al. Detecting and Remediating Harmful Data Shifts for the Responsible Deployment of Clinical AI Models. JAMA Netw. Open 2025, 8(6), e2513685–e2513685. [Google Scholar] [CrossRef] [PubMed]
- Lea, A.S.; Jones, D.S. Mind the Gap - Machine Learning, Dataset Shift, and History in the Age of Clinical Algorithms. N Engl. J. Med. 2024, 390(4), 293–295. [Google Scholar] [PubMed]
- Mortensen, G.A.; Zhu, R. Early Alzheimer’s Detection Through Voice Analysis: Harnessing Locally Deployable LLMs via ADetectoLocum, a privacy-preserving diagnostic system. AMIA Jt. Summits Transl. Sci. Proc. 2025, 365–374. [Google Scholar] [PubMed]
- Lee, B.; et al. Multimodal Alzheimer’s disease recognition from image, text and audio. Sci. Rep. 2025, 15(1), 29038. [Google Scholar] [CrossRef] [PubMed]
- Zhang, M.; et al. Multimodal LLM for enhanced Alzheimer’s Disease diagnosis: Interpretable feature extraction from Mini-Mental State Examination data. Exp. Gerontol. 2025. 208, 112812. [Google Scholar]




| Evaluation Domain | Core Questions | Primary Metrics / Outcomes | Alzheimer’s-Specific Considerations | Recommended Study Designs & Reporting Standards |
|---|---|---|---|---|
| Technical Validity | Does the system produce factually accurate, complete, and reproducible outputs? | Accuracy; hallucination rate; omission rate; factual verification score; calibration (for predictive models) | Omissions (e.g., missed delirium or fall precautions) may be as harmful as fabricated content; overconfident tone increases risk in cognitively vulnerable users | Blinded expert adjudication; structured hallucination/omission audits; external validation; TRIPOD+AI for prediction models (37) |
| Clinical Safety | Does use of the system avoid introducing new safety risks or delaying appropriate escalation? | Safety-critical error rate; escalation accuracy; unsafe advice frequency; clinician override rate | Caregivers may operationalize outputs as care plans; escalation failures (e.g., failure to recommend urgent evaluation) carry amplified harm | Controlled before–after studies; pragmatic trials; DECIDE-AI early-stage evaluation (34); CONSORT-AI for interventional trials (35) |
| Usability & Cognitive Alignment | Are outputs understandable, usable, and aligned with caregiver and patient cognitive capacity? | Readability scores; caregiver comprehension; System Usability Scale (SUS); time saved; satisfaction | Literacy variability, caregiver stress, fluctuating decisional capacity; simplified language must not sacrifice safety-critical detail | Mixed-method usability studies; caregiver-supervised feasibility trials; SPIRIT-AI for protocol transparency (36) |
| Equity & Fairness | Does performance remain stable across demographic, linguistic, and literacy subgroups? | Subgroup error rates; disparity in unsafe outputs; calibration across strata; qualitative cultural alignment assessment | Dementia diagnosis and caregiving burden already show disparities; uneven performance may exacerbate inequities | Stratified validation; fairness audits; predefined subgroup analyses (15, 16) |
| Privacy, Consent & Governance | Are sensitive narratives protected, and are role boundaries clearly defined? | Data minimization adherence; access control integrity; audit logs; consent documentation clarity | AD care includes sensitive home narratives and caregiver stress disclosures; patient capacity may fluctuate | Privacy impact assessments; governance review; role-based workflow testing (12, 17, 18) |
| Implementation & Monitoring | Does system performance remain stable over time and across care settings? | Post-deployment hallucination drift; update-related failure modes; user feedback trends; sustained adoption | Workflow shifts (ED discharge, hospitalization, community care) may induce domain drift; updates can introduce new risks | Continuous monitoring with periodic audits; time-split and multi-site validation; lifecycle governance frameworks (31, 32) |
| Patient-Facing Safety (if applicable) | Does direct interaction avoid confusion, dependency, or harmful advice? | Confusion/distress episodes; dependency indicators; unsafe directive frequency; escalation compliance | Cognitive vulnerability increases risk of emotional dependence or misinterpretation of authoritative tone | Short supervised feasibility trials; conservative scope limitation; predefined escalation triggers (18, 43) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).