Large Language Models in Alzheimer’s Care: Clinical Use Cases, Safety Challenges, and Implementation Pathways

Tursun Alkam; Ebrahim Tarshizi; Andrew H Van Benschoten

doi:10.20944/preprints202606.2060.v1

Submitted:

27 June 2026

Posted:

29 June 2026

You are already at the latest version

Abstract

Alzheimer’s disease (AD) care is increasingly communication-intensive, requiring sustained caregiver education, symptom monitoring, behavioral management, and coordination across fragmented clinical settings. Large language models (LLMs) can generate, summarize, and adapt natural language at scale, creating opportunities to support dementia care workflows, but they also introduce safety risks that are amplified in cognitively vulnerable populations. We provide a narrative synthesis of emerging applications of LLMs across the AD care continuum, highlight Alzheimer’s-specific safety challenges, and propose pragmatic implementation pathways for responsible clinical translation. LLM use cases cluster into caregiver-facing support, patient-facing conversational agents (high caution), clinician workflow augmentation (highest near-term feasibility), clinical text intelligence for risk prediction (early-stage), and research/education support. Key safety threats include hallucinations, omission of critical information, over-reliance, bias, privacy leakage, and prompt-injection vulnerabilities. Implementation is most defensible using grounded architectures (e.g., retrieval-augmented generation), tiered deployment, human-in-the-loop verification, continuous monitoring, and security testing.

Keywords:

Alzheimer’s disease

;

dementia care

;

large language models

;

generative AI

;

caregiver support

;

retrieval-augmented generation

;

clinical decision support

Subject:

Public Health and Healthcare - Primary Health Care

1. Introduction

Alzheimer’s disease (AD) care is increasingly shaped by long-term, communication-heavy interactions among patients, caregivers, and clinicians. Across the disease continuum, the most persistent challenges are not limited to diagnosis and pharmacologic decisions, but also include day-to-day symptom interpretation, caregiver coaching, planning for safety and functional decline, and coordinating care across fragmented settings. These needs generate large volumes of unstructured clinical text and caregiver narratives, while time constraints and workforce limitations often restrict the amount of individualized education and support that care teams can provide.

Large language models (LLMs) have emerged as a new class of general-purpose generative systems capable of producing fluent text, synthesizing long documents, and adapting explanations to different audiences. This functionality has immediate appeal in Alzheimer’s care because many unmet needs are “language problems” at scale: converting complex plans into caregiver-friendly steps, accelerating chart review, summarizing longitudinal trajectories, and supporting consistent education between visits. However, LLMs are also well known to generate incorrect or fabricated statements (“hallucinations”), and these errors can be difficult to detect because outputs often appear confident and clinically plausible. For this reason, factual reliability remains a primary barrier to safe clinical use, particularly when LLMs generate or transform patient-care documentation [1].

Early dementia-focused evaluations illustrate both the promise and limits of general-purpose conversational models. In a study assessing ChatGPT’s performance for dementia-related information needs, caregivers found the system more helpful for general, non-specialized questions than for clinically nuanced inquiries, reinforcing that caregiver-facing tools cannot be treated as substitutes for clinical guidance [2,3]. This gap has fueled increasing interest in Alzheimer’s-specific architectures that emphasize grounded answers rather than unconstrained generation. For example, retrieval-augmented caregiver systems such as ADQueryAid are designed to provide more reliable support by enriching LLM responses with curated AD knowledge and structured retrieval [4]. Similarly, caregiving-specialized models such as the Caregiving Language Model (CaLM) represent a pragmatic approach to domain adaptation using caregiving-specific knowledge resources to improve reliability and accessibility for caregiver tools [5].

In parallel, investigators are exploring higher-complexity applications in clinical text intelligence and prediction, including multi-agent approaches that analyze longitudinal notes to forecast Alzheimer’s risk trajectories. CARE-AD is one example of this emerging direction, proposing a multi-agent LLM framework for AD prediction using longitudinal clinical documentation [6]. While scientifically promising, such systems must be interpreted cautiously until they demonstrate external validation, calibration, and consistent performance across diverse healthcare settings and populations.

Beyond hallucinations, Alzheimer’s deployment also raises distinct and underappreciated safety threats related to system integration. Prompt injection, maliciously crafted text that manipulates model behavior, has been shown to meaningfully alter medical advice outputs in commercial LLMs, and is especially concerning for caregiver-facing tools that ingest external or user-provided content [7]. These risks reinforce the need to view LLM adoption in dementia care as a governed socio-technical intervention, not merely a software upgrade.

Accordingly, this narrative review synthesizes the emerging role of LLMs in Alzheimer’s care through three integrated lenses: (1) clinical use cases across the AD continuum, with emphasis on caregiver-facing support, patient-facing conversational agents, clinician workflow augmentation, and text-based prediction; (2) safety challenges unique to Alzheimer’s care, including hallucinations, over-reliance risks in cognitively vulnerable users, privacy constraints, bias, and prompt injection vulnerabilities; and (3) implementation pathways that prioritize grounded generation, conservative role design, and staged deployment with continuous monitoring. This approach aligns with the broader movement toward governance and lifecycle risk management for generative AI in health care, including guidance emphasizing ethics, accountability, and oversight.

Rather than arguing that LLMs are ready to replace clinical expertise, we propose that the most defensible near-term value in Alzheimer’s care lies in assistive infrastructure: tools that reduce documentation burden, improve clarity of caregiver instructions, and support navigation and continuity, while preserving human verification for care-critical decisions. Figure 1 provides a conceptual overview of LLM applications across the Alzheimer’s care continuum, summarizing key clinical use cases, safety risks, and implementation pathways that structure the remainder of this review.

2. What LLMs Are (and What They Are Not) in Clinical Care

LLMs are generative systems trained to predict and produce language, enabling them to write coherent text, summarize long documents, answer questions, and adapt explanations to different audiences. In clinical environments, their practical value is strongest for tasks that are fundamentally linguistic, documentation, translation of medical text into patient-friendly language, and synthesis of narrative information across fragmented records. At the same time, LLMs are not inherently reliable fact engines: they do not “know” clinical truth in the way clinicians do, and they can produce fluent but incorrect statements (including invented details) unless outputs are grounded and verified. This distinction is foundational for dementia contexts, where patients and caregivers may be more vulnerable to persuasive or authoritative language and where errors can propagate into real-world decisions.

In other words, LLMs can be powerful clinical communication amplifiers, but they should not be treated as autonomous clinical decision-makers. Their safest clinical identity is a co-pilot that drafts and organizes information under human oversight, rather than an agent that independently diagnoses, triages, or prescribes.

2.1. What LLMs Can Do Well

Summarization and Rewriting at Different Literacy Levels

One of the most mature and clinically relevant capabilities of LLMs is transforming complex medical text into language that is easier to understand. This matters in Alzheimer’s care because caregivers often become the operational “care team” after encounters and need clear instructions that are actionable at home. In a cross-sectional study of inpatient discharge summaries, an LLM was able to translate discharge documentation into patient-friendly language and formats that were significantly more readable and understandable than EHR versions, while the authors emphasized that real-world implementation requires improvements in accuracy, completeness, and safety, and should begin with physician review [8]. These findings support a practical use case for dementia care: using LLMs to draft caregiver-facing summaries (e.g., medication routines, safety recommendations, follow-up plans) that can then be verified and finalized by clinicians.

Extracting the “Clinical Story” from Notes

Clinical records are frequently fragmented, repetitive, and difficult to synthesize, especially for older adults with multimorbidity and frequent transitions. LLMs can help by converting “note bloat” into problem-oriented narratives, what happened, what changed, what matters now, and what the plan is, potentially reducing cognitive burden for clinicians during chart review. A 2025 evaluation describes the development and assessment of a clinical note summarization approach and highlights the motivation for using LLMs to support physician review workflows, while reinforcing that these tools complement rather than replace clinical judgment [9]. In parallel, recent work has proposed methods to evaluate real-world multi-document EHR summaries, reflecting a broader shift toward structured assessment of summary quality and factual reliability rather than anecdotal impressions [10]. For Alzheimer’s care, this “clinical story extraction” is particularly valuable when the care plan must be communicated across settings (clinic → ED → inpatient → post-acute → home) and when caregiver narratives are essential to contextualizing symptoms.

Generating Caregiver Instructions and Checklists (with Guardrails)

LLMs can draft caregiver instructions, routines, and checklists that reduce ambiguity and improve adherence to care plans, especially when outputs are constrained to supportive, non-prescriptive domains (home safety steps, routine planning, symptom tracking logs, appointment question lists). Early caregiver-facing evaluations show real promise but also reveal the boundary of safe reliability. In a Scientific Reports evaluation, caregivers rated ChatGPT as more responsive and helpful for general, non-specialized dementia information needs than for clinically specialized questions, suggesting that LLMs may assist with education and planning but are not dependable as standalone clinical advisors [3]. This pattern is highly relevant to Alzheimer’s care, where caregiver tools should prioritize grounded education, consistent tone, and escalation pathways rather than “open-ended medical advice.”

Triage and Documentation Assistance (Co-Pilot Role)

The highest near-term feasibility for LLMs in medicine is clinician-facing workflow augmentation, particularly drafting clinical text that clinicians can verify. In a comparative effectiveness study of emergency department discharge documentation, use of an on-site LLM assistant was associated with significantly reduced writing time and with LLM-assisted notes rated as more complete and clinically useful than manual notes [11]. This type of co-pilot workflow is directly transferable to AD-related ED visits and hospital discharges, where discharge communication must often serve caregivers and where incomplete or unclear instructions can increase preventable return visits and safety events. The crucial point is that in safe implementations, LLMs accelerate the first draft, they do not replace clinician responsibility for the final clinical content.

2.2. Where LLMs Fail in Dangerous Ways

Hallucinations and Fabricated Clinical Facts

The most widely recognized safety failure mode is hallucination: LLMs may generate information that is plausible in tone but unsupported or false in content. In clinical workflows, hallucinations can appear as invented diagnoses, incorrect medications, fabricated timelines, or erroneous clinical recommendations. A study has emphasized that inaccuracies in patient-care documents generated by LLMs remain a central barrier to safe clinical use, reinforcing that fact-checking and verification must be built into any deployment strategy [1]. In Alzheimer’s care, hallucinations are especially hazardous because caregivers may act on written instructions as if they were clinician-approved plans, and because patients may have reduced ability to question inaccuracies.

Confident Incorrectness and “Authority Tone”

LLMs can produce incorrect statements with confidence, giving outputs an “authority tone” that encourages over-trust. This is not merely a user-interface issue; it is a behavioral safety risk that can bias clinician judgment (automation bias) and mislead caregivers. A study has highlighted clinical reasoning concerns and the risk of over-reliance as LLMs become integrated into documentation, messaging, and clinical workflows [12]. In dementia contexts, this becomes more acute because caregivers may be overwhelmed and looking for definitive answers, while patients may be more vulnerable to persuasive conversational framing. The safest approach is therefore to constrain LLM roles to drafting and explanation while ensuring escalation rules and human oversight for care-critical decisions.

Inconsistent Performance Across Populations and Settings

LLM behavior is not uniform across patient populations, health literacy levels, language styles, and institutional contexts. This raises the risk that a system might provide safer, more complete, or more empathetic guidance for some groups than others, thereby reinforcing existing disparities in dementia care access and outcomes. A “toolbox” paper emphasizes that equity-related model failures are common and difficult to detect without intentional evaluation, and it proposes structured approaches to surface health equity harms and biases in LLMs used for health information needs [13]. Empirical work has also shown measurable bias patterns in health-oriented LLM outputs and decisions, underscoring that subgroup performance auditing cannot be optional if these tools are used at scale [14]. In Alzheimer’s care, this is particularly important because caregiver support needs are tightly coupled to socioeconomic resources, language access, and health literacy.

Sensitivity to Prompt Injection and Unsafe Instructions in Downstream Workflows

As LLMs become embedded in tools that retrieve external content or process user-provided documents, they become vulnerable to prompt injection, maliciously crafted text that can override system instructions and induce unsafe outputs. A quality improvement study demonstrated that commercial LLMs can be highly susceptible to prompt-injection attacks that alter medical advice and validated client-side “man-in-the-middle” injection as a realistic attack vector [7]. This threat is especially relevant in Alzheimer’s caregiver support systems, which may ingest community resources, pasted text, portal messages, or third-party documents. Without security-by-design safeguards (source allow-lists, input sanitization, constrained tool permissions, and adversarial testing), prompt injection becomes a direct clinical safety risk rather than a purely technical concern.

To operationalize these strengths and failure modes into a structured, lifecycle-oriented assessment strategy, we propose a staged evaluation framework that integrates technical validity, clinical safety, and socio-ethical governance domains (Table 1).

3. Clinical Use Cases Across the Alzheimer’s Care Continuum

Alzheimer’s care is uniquely “language intensive.” Patients and caregivers must repeatedly interpret symptoms, communicate evolving needs, coordinate fragmented services, and translate clinical recommendations into day-to-day routines. Digital health technologies have been explored for cognitive impairment care, but the literature remains relatively recent, heterogeneous, and uneven in quality, with persistent gaps in scalability and sustained engagement [15]. In this context, LLMs are attractive because they can deliver interactive dialogue, personalized explanations, and structured summaries, potentially reducing burden at multiple points across the AD trajectory. However, these systems must be framed as assistive infrastructure, not independent clinical authorities, given well-documented risks of misinformation and overconfidence in generated outputs, risks amplified in cognitively vulnerable populations [3,16]. As summarized in , use cases cluster into caregiver-facing support, patient-facing agents (high caution), clinician workflow augmentation, clinical text intelligence, and research/education applications, which we discuss below.

3.1. Caregiver-Facing Support Systems

Caregiver-facing applications represent one of the most clinically plausible near-term pathways for LLMs in Alzheimer’s care because they target high-frequency needs: learning how to respond to symptoms, planning daily routines, ensuring safety, and communicating effectively with clinicians and family. Digital caregiver support tools have shown potential to reduce burden and improve information access, but the landscape remains fragmented, suggesting room for more adaptive and scalable approaches [15]. Within this space, LLMs are particularly suited to education, question answering, and communication support, provided that responses are grounded in trusted resources and designed with clear safety boundaries.

A key proof-of-concept comes from an evaluation of ChatGPT for dementia-related information needs, where caregivers rated the system as generally responsive and helpful, especially for general or non-specialized questions, but noted weaker performance for more specialized, clinically nuanced inquiries [3]. This distinction is critical for Alzheimer’s care: many caregiver needs involve practical “how-to” questions and emotional support, but clinical decision-making (medication changes, red flags, differential diagnosis) demands higher reliability than current general-purpose systems consistently provide.

Beyond open-ended question and answer (Q&A), caregivers frequently need help with navigation of services, daily care planning, and stepwise routines (e.g., creating safety checklists, behavior logs, appointment preparation). Retrieval-augmented generation (RAG) is especially relevant here because it constrains LLM outputs using curated, domain-specific knowledge sources rather than unconstrained generation. ADQueryAid exemplifies this approach: it is a conversational system designed to support AD and related dementias (ADRD) caregivers, explicitly combining an LLM with ADRD knowledge resources via RAG to provide more informative, personalized support [3]. These architectures are particularly aligned with caregiver guidance tasks because they prioritize source-grounded responses and reduce the likelihood of unsupported claims.

A third direction is specialization, rather than relying exclusively on large general-purpose chatbots, investigators have explored caregiving-specific models tailored to the caregiving domain. The Caregiving Language Model (CaLM) illustrates how a targeted caregiving knowledge base, retrieval module, and model adaptation can support caregiver-oriented tools while potentially reducing computational burden compared with relying on the largest foundation models alone [5]. Together, these developments suggest that caregiver-facing LLM systems may be most successful when they (1) focus on high-frequency informational needs, (2) incorporate grounding via RAG, and (3) adopt caregiver-centered conversational policies that emphasize empathy, clarity, and escalation when risk is high.

Finally, LLMs can support communication templates that help caregivers organize and convey critical information. Practical examples include symptom timelines, medication reconciliation checklists, lists of questions for clinicians, and structured summaries for family coordination. Because caregiver–clinician communication is often fragmented across transitions, templated outputs can improve completeness and reduce misunderstandings, so long as caregivers remain the final editors and clinicians retain accountability for medical decisions [15].

3.2. Patient-Facing Conversational Agents (High Caution Zone)

Patient-facing conversational agents may appear appealing for companionship, cognitive engagement, and reminders; however, direct deployment for individuals with AD remains a high caution domain. A systematic review of commercially available dementia-related chatbots highlighted variability in quality, features, and content, underscoring that many systems are not designed with the rigorous safeguards required for cognitively vulnerable users [16]. In AD, the risks extend beyond standard medical AI concerns because impaired memory, fluctuating insight, and heightened suggestibility can increase the likelihood of confusion, distress, or inappropriate trust in confident-sounding outputs.

In the mild or early stages, conversational tools may have a role in structured companionship and social support, particularly when framed as engagement aids rather than sources of authoritative advice. Similarly, cognitive stimulation prompts represent a more defensible category when they are deliberately bounded (e.g., reminiscence-style conversation starters, guided storytelling, mood-supportive prompts). For example, a conversational system has been developed to support reminiscence therapy for people with Alzheimer’s disease by enabling personalized dialogue grounded in the patient’s life history and preferences, illustrating the feasibility of patient-tailored engagement experiences [17].

Patient-facing adherence reminders (medication timing, hydration prompts, appointments) may also be useful, but safest implementation typically involves caregiver mediation or supervision, especially as cognitive impairment progresses. The broader dementia chatbot literature suggests feasibility for supportive functions, yet it also highlights the need for careful adaptation to abilities and safety constraints [16]. Critically, patient-facing systems should remain assistive rather than directive: they must avoid medication changes, discourage self-management of urgent symptoms, and incorporate clear escalation instructions and caregiver involvement as default safeguards. Evidence from caregiver-facing evaluations further supports this caution, showing that even helpful general responses may be less reliable for clinically complex situations [3].

3.3. Clinician Workflow Augmentation (Highest Near-Term Feasibility)

Among all AD-related applications, clinician-facing workflow augmentation is arguably the most feasible near-term use case because outputs can be verified by professionals and integrated into existing accountability structures. Alzheimer’s care frequently requires synthesis of fragmented history across settings, reconciliation of complex medication regimens, and generation of caregiver-friendly explanations, tasks that are both time-consuming and documentation heavy. LLMs can support this burden through accelerated chart review, structured summaries, and drafting of patient-facing materials.

For chart review acceleration, LLMs can generate problem-oriented narratives that highlight key diagnoses, recent events, and care plans from lengthy documentation, supporting continuity across clinics, hospitals, and emergency encounters. A particularly practical use case is drafting and improving the readability of discharge documentation. In a JAMA Network Open comparative effectiveness study, use of an on-site LLM assistant was associated with reduced writing time for emergency department discharge notes compared with manual note-taking, without compromising documentation quality [11]. Separately, a cross-sectional study showed that LLMs can translate discharge summaries into more patient-friendly language that is significantly more readable and understandable than typical EHR discharge documentation, while still requiring physician review for safety and completeness [8].

In AD care specifically, these functions matter because caregivers often become the operational “care team,” and comprehension of discharge instructions and follow-up plans can directly influence safety outcomes. Patient-friendly summaries of diagnosis, prognosis, and next steps may improve caregiver understanding, especially when paired with grounded references to approved educational materials. However, clinician-facing LLM outputs must always be treated as drafts requiring verification, since even subtle inaccuracies or invented details can have downstream consequences. This verification principle becomes even more important when LLM tools offer visit note structuring or differential prompts, where plausible but unsupported suggestions could bias clinical reasoning or documentation [8,11].

3.4. Risk Prediction and Clinical Text Intelligence

Beyond communication and documentation, LLMs are increasingly being explored for clinical text intelligence, extracting predictive signals from unstructured narrative notes that may precede formal diagnostic coding. This direction is attractive in Alzheimer’s disease because prodromal cognitive and functional decline often emerges gradually and is documented inconsistently across years of clinical encounters. Longitudinal notes may contain subtle cues (caregiver concerns, functional changes, clinician impressions, behavioral symptoms) that structured variables alone fail to capture.

CARE-AD (Collaborative Analysis and Risk Evaluation for Alzheimer’s Disease) exemplifies this emerging approach: it is a multi-agent LLM-based framework designed to forecast Alzheimer’s disease onset by analyzing longitudinal electronic health record notes [6]. Multi-agent architectures are conceptually appealing because they can distribute subtasks (e.g., symptom extraction, risk reasoning, evidence aggregation) across specialized agents, potentially improving robustness compared with single-pass generation [6].

Nevertheless, prediction-focused LLM systems remain early-stage for real-world deployment. External validation across health systems, careful calibration, and subgroup performance auditing are essential because documentation style and access patterns vary widely. Even strong retrospective performance can fail when models encounter new populations, different note structures, or evolving clinical language. As a result, LLM-based prediction should currently be framed as an investigational tool that requires rigorous evaluation before it can influence patient-facing decisions [6].

3.5. Research and Education Use Cases

LLMs also support Alzheimer’s care indirectly through research and education workflows. Clinicians and researchers may use LLMs to rapidly summarize literature, draft educational handouts, and generate structured outlines for training or caregiver materials. In Alzheimer’s contexts, this could reduce time to create practical resources (e.g., safety planning handouts, “what to expect” guides, caregiver conversation scripts). However, LLM-based synthesis carries risks of overgeneralization, omission of limitations, and fabricated references, making verification mandatory.

A particularly salient concern is reference accuracy in scientific writing. In a large analysis of LLM-generated references, hallucination and citation inaccuracies were substantial, reinforcing that LLM outputs should not be treated as reliable citation engines and must be checked against primary sources [18]. For patient-facing education handouts, the safest pathway is to use LLMs as drafting assistants that rewrite or simplify vetted content, preferably in a retrieval-grounded workflow and reviewed by clinicians or trained staff before distribution [4,8].

4. Safety Challenges Unique to Alzheimer’s Care

LLM safety challenges are not uniform across clinical domains. In Alzheimer’s disease (AD), risk is amplified by cognitive vulnerability, frequent caregiver mediation, long disease trajectories, and high-stakes transitions (e.g., ED presentations, hospitalization, discharge planning). As LLMs are increasingly integrated into clinical and caregiver workflows, the central question is not whether they can generate fluent responses, but whether they can do so reliably, equitably, and safely, especially when users may have impaired memory, reduced insight, or heightened susceptibility to persuasive language [1,19]. As summarized in Supplementary Table S2, Alzheimer’s-specific risks cluster around factuality failures, over-reliance, equity, privacy/security, and adversarial vulnerabilities, which we discuss in detail below.

4.1. Hallucinations, Omissions, and Factuality Failures in Care-Critical Content

A defining limitation of LLMs is their tendency to produce hallucinations, statements that sound plausible but are not supported by the source input (or are simply false). In clinical contexts, hallucinations may appear as fabricated diagnoses, invented medication changes, incorrect timelines, or inappropriate care instructions. This is particularly dangerous in Alzheimer’s care because caregivers often rely on written information to implement safety plans and manage medications at home. It has been emphasized that factual inaccuracies in LLM-generated patient care documents remain a major barrier to safe clinical adoption, motivating the need for structured verification workflows rather than trust in fluent outputs [1].

Importantly, factuality risk is not limited to outright fabrication. A parallel problem is omission, missing critical details during summarization (e.g., delirium red flags, fall risk precautions, anticoagulation status, or recent medication adjustments). A recent clinical safety framework highlights hallucinations and omissions as distinct error modes that require deliberate evaluation strategies because both can cause harm even when text appears coherent and clinically “reasonable” [20].

AD care carries uniquely high risk because caregiver instructions and discharge summaries often function as the practical “care plan” at home. If an LLM inserts an incorrect medication dose, leaves out aspiration precautions, or fails to include clear delirium escalation guidance, serious harm may result, even when the language appears polished and clinically appropriate [1,20].

These safety domains interact within a broader socio-technical system that extends beyond the model itself, encompassing users, workflows, governance structures, and external information sources (Figure 2).

4.2. Overconfidence, “Authority Tone,” and Over-Reliance in Cognitively Vulnerable Users

LLMs frequently express uncertainty poorly, producing incorrect recommendations with confident, authoritative language. In general medicine, this creates risk of automation bias, humans trusting AI output over their own judgment or over contradictory evidence. In AD care, this risk is magnified because patients may have reduced capacity to appraise credibility, while caregivers may be overwhelmed and more likely to accept confident, step-by-step recommendations as safe “medical advice” [19].

Evidence from physician-facing contexts also supports caution: randomized evaluation has shown that LLM access can influence diagnostic reasoning and does not uniformly improve performance in a way that eliminates error risk, reinforcing that safe use requires structured workflows, verification, and clear role definition (assistive vs decision-making) [21].

In patient-facing use, emotional engagement can further intensify reliance. Ethical analyses in older-adult assistive technology emphasize risks such as dependency, manipulation, and diminished autonomy, concerns that become particularly salient when conversational systems are used as “companions” for individuals with dementia [19].

4.3. Prompt Injection and Adversarial Manipulation in Caregiver and Clinical Tools

As health systems deploy LLMs in tools that interface with external content (web sources, PDFs, guidelines, portals, EHR extracts), they become vulnerable to prompt-injection, malicious or hidden instructions that hijack model behavior and can induce unsafe recommendations. A quality improvement study demonstrated that commercial LLMs can be susceptible to prompt-injection attacks capable of altering medical advice, and it validated client-side “man-in-the-middle” injection as a realistic attack pathway [7].

This matters in Alzheimer’s care because caregiver-facing systems may retrieve resource information from community sites, support forums, or institutional portals, creating opportunities for hostile or misleading content to manipulate outputs. Prompt injection is therefore not a theoretical cybersecurity concern; it is a safety risk requiring adversarial testing, input sanitization, restricted tool permissions, and conservative autonomy (especially for patient/caregiver-facing deployment) [7].

4.4. Bias, Inequity, and Differential Performance Across Populations

LLM behavior can vary across demographic groups, language styles, and cultural contexts, creating the risk of inequitable care guidance. Alzheimer’s and dementia care already contains well-known disparities in diagnosis timing, care access, and caregiver support; LLMs could unintentionally reinforce these patterns if their outputs differ systematically by patient identity cues, dialect, or socioeconomic framing [22,23].

Recent research and reviews have systematically examined demographic disparities and biases in LLMs, including how bias is measured and what mitigation strategies are feasible [24]. Additionally, a toolbox-style approach has been proposed to surface and mitigate health equity harms in LLMs, underscoring that “fairness” must be treated as an explicit design and evaluation requirement rather than an afterthought [25].

In AD care, equity concerns become practical and immediate: caregiver education, symptom interpretation, behavioral guidance, and navigation support must be consistently safe and culturally appropriate, particularly for families with limited health literacy or non-native English proficiency [23,25].

4.5. Privacy, Confidentiality, and Sensitive Caregiver Narratives

Alzheimer’s care involves highly sensitive information: cognitive decline trajectories, functional impairment, home safety issues, caregiver strain, behavioral symptoms, and family conflict. LLMs used in messaging, coaching, or planning tools may inadvertently capture or expose details that patients would not have shared in clinical settings. Ethical and legal analyses in long-term and post-acute care emphasize that AI systems must be evaluated within the realities of caregiving environments, where privacy risks and downstream harms differ from typical outpatient documentation workflows [26].

For AD-focused LLM deployment, privacy safeguards are not optional add-ons; they are fundamental requirements. These include data minimization, strict access controls, clear retention policies, audit trails, and careful separation of caregiver notes from patient records when appropriate [19,26].

4.6. Consent, Capacity, Accountability, and Emotional Safety

A central challenge in Alzheimer’s care is that decision-making capacity can fluctuate or decline, making consent and accountability complex. Ethical analyses of AI in elderly care emphasize recurring concerns around autonomy, safety, privacy, and responsible use, issues that intensify in dementia when patients may be unable to evaluate tool limitations or recognize misinformation [19].

Additionally, conversational systems may produce emotionally inappropriate responses, reinforce misunderstandings, or contribute to distress, particularly when used as companions or quasi-therapeutic supports. Broader ethical work on older adults and dementia-focused assistive technologies highlights risks such as dependency, replacement of human care, and insufficient stakeholder involvement in assessing harms and benefits [27].

The most defensible framing, therefore, is that LLMs should support communication and planning within a governed care relationship (caregiver + clinician oversight), rather than function as an independent authority in patient-facing contexts [1,19].

5. Implementation Pathways for LLMs in Alzheimer’s Care (from Pilots to Safe Deployment)

AD is a high-stakes environment for language technologies because care is mediated through conversations, written instructions, and longitudinal narratives shaped by cognitive vulnerability, caregiver stress, and frequent transitions. Implementation success therefore depends less on fluent generation and more on the surrounding socio-technical system’s ability to prevent unsafe outputs, enable verification, protect privacy, and sustain reliability in real-world settings. Recent global guidance emphasizes that generative models in health care require explicit governance, transparency, and continuous risk management, particularly when outputs may influence clinical decisions or caregiver behavior. Supplementary Table S3 outlines a tiered roadmap that prioritizes feasibility and safety by gradually expanding user-facing autonomy. To clarify how these risk domains manifest in real-world deployment, Figure 3 maps common failure modes specific to Alzheimer’s care and links them to mitigation strategies embedded within the proposed governance model.

5.1. Defining the Clinical Role: “Assistive Infrastructure,” Not an Autonomous Clinician

A defensible strategy defines LLMs as assistive infrastructure rather than medical authorities, clarifying accountability in AD care. LLMs can draft, summarize, translate, and organize information, but should not independently determine diagnoses, recommend medication changes, or issue definitive triage instructions without human oversight. Because caregiver-mediated decision-making is common and patients may be less able to judge credibility, the most durable approach is “draft-and-verify”: the model generates a structured draft (summary, checklist, template), which a clinician or trained staff member validates before it becomes actionable. Even clinician-facing tools can shape reasoning and induce automation bias; integration should therefore prioritize verification workflows rather than assuming accuracy gains by default [21].

5.2. Architecture Choices that Reduce Hallucinations: Grounding Through Retrieval-Augmented Generation

A central lesson in medical LLM research is that larger models are not necessarily safer; reliability improves most when outputs are grounded in vetted sources. Retrieval-augmented generation (RAG) reduces hallucinations by retrieving evidence from curated knowledge bases and generating responses anchored to those materials. A systematic review and meta-analysis proposed development guidelines for biomedical RAG and emphasized source curation, retrieval quality, and evaluation procedures [28]. In AD care, grounding can be implemented through an approved knowledge layer (validated caregiver resources, delirium/fall protocols, medication counseling templates, institution-specific discharge instructions). However, RAG is a reliability enhancer rather than a guarantee: retrieval errors, incomplete sources, or misleading documents can still produce unsafe outputs. A recent perspective underscores that while RAG can improve reliability and personalization, it introduces new failure modes requiring evaluation and governance [29].

5.3. A Tiered Deployment Pathway that Fits Alzheimer’s Care Realities

Because risk differs across user groups, AD deployment should prioritize high-value, low-autonomy use cases first and reserve direct-to-patient systems for late-stage evaluation. The most practical entry point is clinician workflow augmentation, where outputs can be reviewed before affecting care. A comparative effectiveness study found that an on-site LLM assistant reduced emergency department discharge note writing time and produced notes rated more complete, correct, and clinically useful than manual notes [11], highly relevant for AD, where discharge communication often becomes the caregiver’s operational care plan. A second pathway is patient- and caregiver-friendly summaries: a cross-sectional study found LLMs could transform discharge summaries into more readable and understandable formats, while emphasizing that implementation requires stronger accuracy/safety and should begin with physician review [8]. Companion evidence indicates that readability gains can still coexist with omissions and hallucinations, reinforcing the need for validation and guardrails [30]. After clinician co-pilots are stable, caregiver-facing RAG tools with conservative scope limits may support education, routine planning, question preparation, and resource navigation while repeatedly reinforcing escalation pathways. Patient-facing conversational agents should remain the most restrictive tier due to higher risk of confusion, distress, and over-reliance; if deployed, they should be limited to low-risk engagement, avoid directive clinical advice, and ideally be mediated by caregiver oversight. This tiered logic can be visualized as a risk–autonomy matrix in which clinician-facing co-pilot applications occupy the lowest-risk quadrant, while direct-to-patient autonomous agents represent the highest-risk zone (Figure 4).

5.4. Governance and Continuous Monitoring: Implementation as a Lifecycle Program

Real-world deployment requires an operating model that anticipates drift, evolving user behavior, and failure modes not captured in pre-deployment testing. Clinical AI studies show models can fail when workflows, populations, or documentation patterns shift, making ongoing evaluation in the true performance environment essential [31,32]. An EHR-integrated implementation report found large-scale LLM integration feasible and capable of sustained use under privacy safeguards, but noted that user feedback volume can decline over time, making passive reporting insufficient as the primary safety signal [33], an important concern in AD care where near-misses may go unreported. Implementation science also stresses structured evaluation and reporting: DECIDE-AI supports early-stage live evaluation focused on usability and unintended consequences [34], while CONSORT-AI and SPIRIT-AI improve reporting rigor for trials and protocols involving AI components [35,36]. For prediction-oriented systems (e.g., note-based risk forecasting), TRIPOD+AI prioritizes transparency around development, validation, and performance claims [37].

5.5. Security and Prompt Injection: A Non-Negotiable Requirement for Dementia-Facing Tools

Security risks become clinical safety risks when LLM tools ingest external content or interact with downstream systems. Prompt injection can manipulate model behavior through malicious inputs and may bypass intended safeguards. A quality improvement study found commercial LLMs highly susceptible to prompt injection capable of inducing unsafe clinical advice and validated client-side “man-in-the-middle” injection as a realistic vector [7]. Because AD caregiver tools often interface with external resources or user-provided documents, prompt injection testing should be routine (and repeated after updates), retrieval sources should be allow-listed and sanitized, and tool permissions should be tightly restricted for caregiver- and patient-facing systems [7].

5.6. Equity, Privacy, and Consent in an AD-Specific Workflow

Even if an LLM is “accurate” on average, it can harm patients and families if it performs unevenly across populations, produces culturally misaligned guidance, or shifts error burden onto those with lower health literacy or limited digital access. Equity failures can be subtle and may amplify disparities, motivating structured approaches to measure bias in long-form health answers rather than relying on overall performance [13,24]. In AD care, caregiver differences in language proficiency, literacy, time constraints, and support access shape whether an LLM is helpful or confusing; subgroup evaluation should therefore be a deployment requirement, not an add-on [13,38]. Privacy and consent are equally critical because AD workflows contain sensitive caregiver narratives and longitudinal detail. Analyses of LLMs in health care highlight risks of disclosure and memorization when identifiable information is used without strict data minimization, access controls, and logging [39,40]. Consent challenges are magnified in dementia because capacity may be impaired or fluctuate; governance must define who is authorized (patient vs caregiver/proxy), what can be entered, and how outputs are presented to avoid substituting for clinician judgment [41,42]. Patient-facing tools should be deployed under conservative assumptions about vulnerability, with explicit escalation instructions and safeguards that prevent an authoritative medical voice [41,43].

5.7. A Realistic Endpoint: A Safe Alzheimer’s LLM Is a System, Not a Model

The safest pathway is incremental and systems-based: start with clinician co-pilots that reduce documentation burden and improve caregiver communication, then expand to caregiver-facing RAG tools with bounded scope, and only later consider limited patient-facing engagement. Evidence supporting feasibility (efficiency, readability) also reinforces that safety, completeness, verification, monitoring, and security testing are non-negotiable for deployment [7].

6. Future Directions

LLM research in AD care should be judged by real-world benefit, safety, and implementation feasibility rather than conversational fluency or isolated benchmarks. Because AD trajectories are long and care is fragmented, misinformation can be amplified by cognitive vulnerability, caregiver stress, and frequent care transitions. summarizes priority research gaps and links study designs to clinically meaningful outcomes and required safety metrics.

6.1. From Promising Prototypes to Pragmatic Clinical Evaluation

Most dementia-oriented LLM tools remain early-stage; publishable clinical evidence requires staged evaluation in real workflows. DECIDE-AI supports early-stage live evaluation emphasizing workflow factors, user interaction, and unintended consequences [34,35]. As tools mature, CONSORT-AI and SPIRIT-AI strengthen trial and protocol reporting by requiring transparent descriptions of system use and human interaction [35,36]. For prediction-focused tools, TRIPOD+AI provides updated reporting guidance emphasizing transparency and reproducibility [37]. Progress will be clearest when questions shift from “can it answer?” to “does it reduce caregiver burden, errors, or preventable transitions without increasing safety events?”

6.2. Choosing Endpoints that Reflect Alzheimer’s Realities

Translation is limited when studies rely on technical metrics but under-measure real outcomes. Future work should prioritize endpoints reflecting AD realities: caregiver strain, symptom burden, and preventable utilization. The Zarit Burden Interview (ZBI) is a standard caregiver-burden measure suited to caregiver-facing interventions [44,45]. BPSD often drive distress and escalation; the Neuropsychiatric Inventory (NPI) remains a key instrument for symptom profiles and caregiver distress [46]. Outcome frameworks should also measure harms (e.g., hallucinations in safety-critical guidance, escalation failures, equity gaps).

6.3. External Validation, Calibration, and Generalizability as Non-Negotiables

LLM systems are vulnerable to real-world degradation because documentation differs across systems and populations, especially in AD notes that blend caregiver narratives with evolving functional decline. Studies should prioritize external validation and calibration rather than internal performance alone; TRIPOD+AI provides a reporting foundation for these designs [37]. Multi-site/time-split validation, subgroup evaluation, and stress testing will be essential before any move toward higher-responsibility decision support.

6.4. Adversarial Testing and Security-by-Design for Dementia-Facing Systems

Because AD tools ingest external resources and messages, security is inseparable from safety. Prompt injection can manipulate outputs and induce unsafe recommendations; a quality improvement study supports incorporating adversarial testing into evaluation and governance [7]. Prompt injection resistance and retrieval safety should be treated as standard pre-flight requirements. Governance should also follow a lifecycle approach because risks change post-deployment: workflow shifts, transitions of care, and documentation changes can degrade reliability through drift [47,48,49,50]. Post-deployment surveillance, disciplined update management, and regression testing are essential because updates can introduce new failure modes [51,52].

6.5. Next-Generation Clinical Intelligence: Longitudinal Notes and Multi-Agent Forecasting

A promising frontier is extracting signals from longitudinal notes to forecast risk trajectories. CARE-AD proposes a multi-agent framework for AD prediction from longitudinal EHR notes [6], but such approaches remain proof-of-concept until external validation, calibration, and fairness auditing demonstrate stability. Local or hospital-contained systems may reduce exposure of sensitive narratives; ADetectoLocum illustrates this direction [53], though deployability does not substitute for clinical validation.

6.6. Multimodal Models and Richer AD Phenotyping

AD is a multimodal syndrome affecting cognition, behavior, function, and speech/motor patterns. Multimodal approaches combining text, audio, and imaging are emerging and technically feasible [54,55], but require external validation and demonstrations of clinical utility beyond classification accuracy.

6.7. Aligning Innovation with Evolving Ethics and Governance Expectations

Near-term impact will likely come from conservative, measurable interventions: clinician co-pilots, caregiver-facing grounded systems, and tightly constrained patient-facing engagement tools. Long-term frontiers (multi-agent forecasting, multimodal phenotyping) will depend on rigorous external validation and safety governance rather than algorithmic sophistication.

Supplementary Materials

The following supporting information can be downloaded at the website of this paper posted on Preprints.org.

Author Contributions

Conceptualization: T.A.; writing and original draft preparation: T.A.; writing, review and editing: E.T., and A.H.V.B.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The article processing charge will be waived by the journal.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable. This is a narrative review and no new datasets were generated or analyzed.

Acknowledgments

The authors thank the editorial office for the invitation to submit this manuscript and for waiving the article processing charge.

Conflicts of Interest

The authors declare no conflict of interest.

Non-Medical Glossary (LLM/AI and Implementation Terms)

Automation bias / over-reliance: Tendency to trust an AI output too much, even when it is wrong.

Bias (algorithmic): Systematic differences in AI performance or output quality across groups or contexts.

Calibration: How well the model’s confidence aligns with reality (e.g., “80% sure” should be correct ~80% of the time).

Context window: The amount of text an LLM can process at once.

Data drift / model drift: Changes over time in input data or model behavior that can reduce performance after deployment.

Fine-tuning: Additional training of a model on task- or domain-specific data to adapt its behavior.

Foundation model: Large pre-trained model that can be adapted to many tasks (LLMs are a common type).

General-purpose LLM: Broad model trained for many language tasks, not specialized to a single domain.

Grounding: Constraining outputs to trusted sources or evidence so responses are verifiable.

Hallucination: Fluent but incorrect or fabricated content generated by the model.

Human-in-the-loop: Workflow where a person reviews/edits/approves AI output before it is used.

Knowledge base: Curated collection of documents used to support retrieval and grounded responses.

Large language model (LLM): A neural network trained to generate and interpret text; used for summarization, drafting, and Q&A.

Multi-agent system: Architecture where multiple specialized “agents” (modules/models) collaborate (e.g., retrieval agent + safety checker).

Omission error: Leaving out important information even when other parts are correct.

Prompt: The instruction or input text provided to the model.

Prompt injection: Malicious or hidden instructions designed to override the system’s rules or constraints.

RAG (retrieval-augmented generation): System that retrieves relevant documents and uses them to ground the model’s response.

Red teaming: Adversarial testing to discover unsafe behaviors, vulnerabilities, or failure modes.

Scope limitation: Explicit constraints on what the model is allowed to do (e.g., “no medication advice”).

Safety-by-design: Building safeguards into the system (grounding, constraints, verification, monitoring) rather than relying on user vigilance.

Subgroup evaluation: Assessing performance across different groups or contexts to detect inequities.

Telemetry / audit logs: Recorded system interactions and outputs used for monitoring, quality review, and incident investigation.

Tool use (agentic workflows): When an LLM can call external tools (search, calculators, EHR functions), increasing capability and risk.

Verification: Human or automated checks that confirm factual correctness and completeness before outputs are acted upon.

Versioning: Tracking model, prompt, and knowledge-base changes over time to support reproducibility and safety review.

References

Chung, P. Verifying Facts in Patient Care Documents Generated by Large Language Models Using Electronic Health Records AIdbp2500418. NEJM AI 2026. 3, 1, p. [Google Scholar]
Aguirre, A.; et al. Assessing the Quality of ChatGPT Responses to Dementia Caregivers’ Questions: Qualitative Analysis. JMIR Aging 2024, 7, e53019. [Google Scholar] [CrossRef] [PubMed]
Saeidnia, H.R.; et al. Evaluation of ChatGPT’s responses to information needs and information seeking of dementia patients. Sci. Rep. 2024, 14(1), 10273. [Google Scholar] [PubMed]
Hasan, W.U.; et al. Empowering Alzheimer’s caregivers with conversational AI: a novel approach for enhanced communication and personalized support. npj Biomed. Innov. 2024, 1(1), 3. [Google Scholar] [CrossRef] [PubMed]
Parmanto, B.; et al. A Reliable and Accessible Caregiving Language Model (CaLM) to Support Tools for Caregivers: Development and Evaluation Study. JMIR Form. Res. 2024, 8, e54633. [Google Scholar] [CrossRef] [PubMed]
Li, R.; et al. CARE-AD: a multi-agent large language model framework for Alzheimer’s disease prediction using longitudinal clinical notes. npj Digit. Med. 2025, 8(1), 541. [Google Scholar] [CrossRef] [PubMed]
Lee, R.W.; et al. Vulnerability of Large Language Models to Prompt Injection When Providing Medical Advice. JAMA Netw. Open 2025, 8(12), e2549963–e2549963. [Google Scholar] [CrossRef] [PubMed]
Zaretsky, J.; et al. Generative Artificial Intelligence to Transform Inpatient Discharge Summaries to Patient-Friendly Language and Format. JAMA Netw. Open 2024, 7(3), e240357–e240357. [Google Scholar] [CrossRef] [PubMed]
Oliveira, J.D.; et al. Development and evaluation of a clinical note summarization system using large language models. Commun. Med. 2025. 5, 1, 376. [Google Scholar]
Croxford, E.; et al. Evaluating clinical AI summaries with large language models as judges. npj Digit Med. 2025. 8, 1, 640. [Google Scholar]
Song, J.W.; et al. Large Language Model Assistant for Emergency Department Discharge Documentation. JAMA Netw. Open 2025, 8(10), e2538427–e2538427. [Google Scholar] [CrossRef] [PubMed]
McCoy, L.G. Assessment of Large Language Models in Clinical Reasoning: A Novel Benchmarking Study p. AIdbp2500120. NEJM AI 2025. 2, 10. [Google Scholar]
Pfohl, S.R.; et al. A toolbox for surfacing health equity harms and biases in large language models. Nat. Med. 2024, 30(12), 3590–3600. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; et al. Unmasking and quantifying racial bias of large language models in medical report generation. Commun. Med. 2024, 4(1), 176. [Google Scholar] [CrossRef] [PubMed]
Choukou, M.-A.; et al. Digital Health Technology to Support Health Care Professionals and Family Caregivers Caring for Patients With Cognitive Impairment: Scoping Review. JMIR Ment. Health 2023, 10, e40330. [Google Scholar] [CrossRef] [PubMed]
Ruggiano, N.; et al. Chatbots to Support People With Dementia and Their Caregivers: Systematic Review of Functions and Quality. J. Med. Internet Res. 2021, 23(6), e25006. [Google Scholar] [CrossRef] [PubMed]
Morales-de-Jesús, V.; et al. Conversational System as Assistant Tool in Reminiscence Therapy for People with Early-Stage of Alzheimer’s. Healthcare 2021, 9(8). [Google Scholar] [CrossRef] [PubMed]
Chelli, M.; et al. Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis. J. Med. Internet Res. 2024, 26, e53164. [Google Scholar] [CrossRef] [PubMed]
Klimova, B.; Kacetl, J. Ethical Considerations of AI Use by the Elderly. Int. J. Human–Computer Interact. 2025, 1–12. [Google Scholar]
Asgari, E.; et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit. Med. 2025, 8(1), 274. [Google Scholar] [CrossRef] [PubMed]
Goh, E.; et al. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw. Open 2024, 7(10), e2440969–e2440969. [Google Scholar] [PubMed]
Tierney, A.; et al. Health Equity in the Era of Large Language Models. Am. J. Manag. Care 2025, 31, 112–117. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; et al. Large language models and global health equity: a roadmap for equitable adoption in LMICs. Lancet Reg. Health – West. Pac. 2025, 63. [Google Scholar]
Omar, M.; et al. Evaluating and addressing demographic disparities in medical large language models: a systematic review. Int. J. Equity Health 2025, 24(1), 57. [Google Scholar] [CrossRef] [PubMed]
Ji, Y.; et al. Mitigating the risk of health inequity exacerbated by large language models. npj Digit. Med. 2025, 8(1), 246. [Google Scholar] [CrossRef] [PubMed]
Solaiman, B. Legal and Ethical Considerations of Artificial Intelligence for Residents in Post-Acute and Long-Term Care. J. Am. Med. Dir. Assoc. 2024, 25(9), 105105. [Google Scholar] [PubMed]
Deusdad, B. Ethical implications in using robots among older adults living with dementia. Front Psychiatry 2024, 15, 1436273. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; McCoy, A.B.; Wright, A. Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inf. Assoc. 2025, 32(4), 605–615. [Google Scholar]
Yang, R.; et al. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Syst. 2025. 2, 1, 2. [Google Scholar]
Raghu Subramanian, C.; Yang, D.A.; Khanna, R. Enhancing Health Care Communication With Large Language Models—The Role, Challenges, and Future Directions. JAMA Netw. Open 2024, 7(3), e240347–e240347. [Google Scholar] [CrossRef] [PubMed]
Cresswell, K.; et al. Evaluating Artificial Intelligence in Clinical Settings-Let Us Not Reinvent the Wheel. J. Med. Internet Res. 2024, 26, e46407. [Google Scholar] [PubMed]
Cohen, J.P.; et al. Problems in the deployment of machine-learned models in health care. Cmaj 2021, 193(35), E1391–e1394. [Google Scholar] [CrossRef] [PubMed]
Griot, M.; Vanderdonckt, J.; Yuksel, D. Implementation of large language models in electronic health records. PLoS Digit Health 2025. 4, 12, e0001141. [Google Scholar]
Vasey, B.; et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 2022, 28(5), 924–933. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat. Med. 2020, 26(9), 1364–1374. [Google Scholar] [CrossRef] [PubMed]
Rivera, S.C.; et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension. BMJ 2020, 370, m3210. [Google Scholar] [PubMed]
Collins, G.S.; et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 2024, 385, e078378. [Google Scholar] [CrossRef] [PubMed]
Huang, R.; et al. Evaluation and Bias Analysis of Large Language Models in Generating Synthetic Electronic Health Records: Comparative Study. J. Med. Internet Res. 2025. 27, e65317. [Google Scholar]
Ong, J.C.L.; et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit. Health 2024, 6(6), e428–e432. [Google Scholar] [CrossRef] [PubMed]
Fareed, M.; et al. A systematic review of ethical considerations of large language models in healthcare and medicine. Front Digit Health 2025. 7, 1653631. [Google Scholar]
Dino, F.R.; et al. Ethics in digital phenotyping: considerations regarding Alzheimer’s disease, speech and artificial intelligence. J. Med. Ethics 2025. [Google Scholar] [CrossRef] [PubMed]
Diaz, A.; et al. Informed consent in dementia research: how Public Involvement can contribute to addressing “old” and “new” challenges. In Frontiers in Dementia; 2025; pp. 4–2025. [Google Scholar]
Soria, S.T. Patient autonomy in the context of digital health. Bioethics 2025, 39(5), 404–413. [Google Scholar] [CrossRef] [PubMed]
Zhi-Xiang, L.; Lim, W.S.; Chan, E.-Y. Development and Validation of a Multidimensional Short Version Zarit Burden Interview (ZBI-9) for Caregivers of Persons With Cognitive Impairment. Alzheimer Dis. Assoc. Disord. 2023, 37(1). [Google Scholar] [CrossRef] [PubMed]
Seng, B.K.; et al. Validity and reliability of the Zarit Burden Interview in assessing caregiving burden. Ann. Acad. Med. Singap. 2010, 39(10), 758–63. [Google Scholar] [CrossRef] [PubMed]
Cummings, J.L. The Neuropsychiatric Inventory: assessing psychopathology in dementia patients. Neurology 1997, 48((5) Suppl 6, S10–6. [Google Scholar] [PubMed]
Solaiman, B.; et al. A “True Lifecycle Approach” towards governing healthcare AI with the GCC as a global governance model. npj Digit Med. 2025, 8(1), 337. [Google Scholar] [CrossRef] [PubMed]
Jenkins, D.A.; et al. Continual updating and monitoring of clinical prediction models: time for dynamic prediction systems? Diagn. Progn. Res. 2021, 5(1), 1. [Google Scholar] [CrossRef] [PubMed]
Davis, S.E.; et al. Detection of calibration drift in clinical prediction models to inform model updating. J. BioMed Inf. 2020, 112, 103611. [Google Scholar] [CrossRef]
Sahiner, B.; et al. Data drift in medical machine learning: implications and potential remedies. Br. J. Radiol. 2023, 96(1150), 20220878. [Google Scholar] [CrossRef] [PubMed]
Subasri, V.; et al. Detecting and Remediating Harmful Data Shifts for the Responsible Deployment of Clinical AI Models. JAMA Netw. Open 2025, 8(6), e2513685–e2513685. [Google Scholar] [CrossRef] [PubMed]
Lea, A.S.; Jones, D.S. Mind the Gap - Machine Learning, Dataset Shift, and History in the Age of Clinical Algorithms. N Engl. J. Med. 2024, 390(4), 293–295. [Google Scholar] [PubMed]
Mortensen, G.A.; Zhu, R. Early Alzheimer’s Detection Through Voice Analysis: Harnessing Locally Deployable LLMs via ADetectoLocum, a privacy-preserving diagnostic system. AMIA Jt. Summits Transl. Sci. Proc. 2025, 365–374. [Google Scholar] [PubMed]
Lee, B.; et al. Multimodal Alzheimer’s disease recognition from image, text and audio. Sci. Rep. 2025, 15(1), 29038. [Google Scholar] [CrossRef] [PubMed]
Zhang, M.; et al. Multimodal LLM for enhanced Alzheimer’s Disease diagnosis: Interpretable feature extraction from Mini-Mental State Examination data. Exp. Gerontol. 2025. 208, 112812. [Google Scholar]

Figure 1. Tiered Deployment Model for Large Language Models (LLMs) in Alzheimer’s Care. This figure introduces a structured, risk-stratified deployment framework specifically tailored to cognitively vulnerable populations. The model delineates three progressive levels of implementation: Level 1 (Clinician Co-Pilot), in which LLMs function strictly as documentation and summarization support tools under continuous professional oversight; Level 2 (Caregiver Retrieval-Augmented Generation [RAG] Support), where responses are constrained to source-grounded informational assistance for caregivers; and Level 3 (Restricted Patient-Facing Engagement), representing tightly bounded conversational interactions with predefined guardrails and escalation triggers. A visual gradient illustrates increasing autonomy and corresponding risk exposure across tiers, while required oversight intensity decreases proportionally. The conceptual novelty lies in formalizing autonomy boundaries for dementia care, explicitly linking deployment context to vulnerability, supervision, and permissible system capabilities.

Figure 2. Socio-Technical Safeguard Architecture for LLM Implementation in Alzheimer’s Care. This diagram proposes a layered safety architecture integrating model-level safeguards (prompt constraints, retrieval grounding, hallucination detection, audit logging) with human and institutional oversight (clinician review loops, caregiver mediation, governance policies, and escalation pathways). The framework emphasizes that safety is an emergent property of coordinated human-AI interaction rather than a function of model performance alone. Unlike generic AI governance models, this architecture is adapted to dementia contexts, where impaired insight, caregiver mediation, and fluctuating decisional capacity require structured supervision. The bidirectional feedback loops depicted in the model illustrate continuous monitoring and iterative recalibration, underscoring a systems-based approach to deployment.

Figure 3. Failure Modes Specific to LLM Use in Alzheimer’s Care. This schematic identifies and contextualizes failure pathways particularly salient in dementia care settings. Hallucination may lead to caregiver misinterpretation and unnecessary emergency department utilization. Prompt injection or adversarial manipulation may generate unsafe medication advice or harmful instructions. Automation bias may contribute to clinician oversight failure, especially when AI-generated documentation or summaries are insufficiently scrutinized. By mapping upstream model vulnerabilities to downstream clinical consequences, this figure advances a context-specific taxonomy of failure modes in cognitively vulnerable populations and highlights the need for layered technical and human safeguards to mitigate care-critical risks.

Figure 4. Risk-Autonomy Matrix for LLM Applications in Dementia Care. This quadrant-based matrix conceptualizes LLM applications along two intersecting dimensions: degree of system autonomy and potential clinical harm. Applications such as documentation support occupy the low-autonomy/low-risk quadrant, whereas medication-related guidance or patient-facing conversational systems trend toward higher autonomy and risk. The model operationalizes proportional oversight principles by mapping safeguard intensity to quadrant location. Its novelty lies in translating abstract AI risk discourse into a deployment decision tool specific to dementia care, where patient vulnerability amplifies the consequences of automation-related errors.

Table 1. Multidimensional Evaluation Framework for Large Language Models in Alzheimer’s Care.

Evaluation Domain	Core Questions	Primary Metrics / Outcomes	Alzheimer’s-Specific Considerations	Recommended Study Designs & Reporting Standards
Technical Validity	Does the system produce factually accurate, complete, and reproducible outputs?	Accuracy; hallucination rate; omission rate; factual verification score; calibration (for predictive models)	Omissions (e.g., missed delirium or fall precautions) may be as harmful as fabricated content; overconfident tone increases risk in cognitively vulnerable users	Blinded expert adjudication; structured hallucination/omission audits; external validation; TRIPOD+AI for prediction models (37)
Clinical Safety	Does use of the system avoid introducing new safety risks or delaying appropriate escalation?	Safety-critical error rate; escalation accuracy; unsafe advice frequency; clinician override rate	Caregivers may operationalize outputs as care plans; escalation failures (e.g., failure to recommend urgent evaluation) carry amplified harm	Controlled before–after studies; pragmatic trials; DECIDE-AI early-stage evaluation (34); CONSORT-AI for interventional trials (35)
Usability & Cognitive Alignment	Are outputs understandable, usable, and aligned with caregiver and patient cognitive capacity?	Readability scores; caregiver comprehension; System Usability Scale (SUS); time saved; satisfaction	Literacy variability, caregiver stress, fluctuating decisional capacity; simplified language must not sacrifice safety-critical detail	Mixed-method usability studies; caregiver-supervised feasibility trials; SPIRIT-AI for protocol transparency (36)
Equity & Fairness	Does performance remain stable across demographic, linguistic, and literacy subgroups?	Subgroup error rates; disparity in unsafe outputs; calibration across strata; qualitative cultural alignment assessment	Dementia diagnosis and caregiving burden already show disparities; uneven performance may exacerbate inequities	Stratified validation; fairness audits; predefined subgroup analyses (15, 16)
Privacy, Consent & Governance	Are sensitive narratives protected, and are role boundaries clearly defined?	Data minimization adherence; access control integrity; audit logs; consent documentation clarity	AD care includes sensitive home narratives and caregiver stress disclosures; patient capacity may fluctuate	Privacy impact assessments; governance review; role-based workflow testing (12, 17, 18)
Implementation & Monitoring	Does system performance remain stable over time and across care settings?	Post-deployment hallucination drift; update-related failure modes; user feedback trends; sustained adoption	Workflow shifts (ED discharge, hospitalization, community care) may induce domain drift; updates can introduce new risks	Continuous monitoring with periodic audits; time-split and multi-site validation; lifecycle governance frameworks (31, 32)
Patient-Facing Safety (if applicable)	Does direct interaction avoid confusion, dependency, or harmful advice?	Confusion/distress episodes; dependency indicators; unsafe directive frequency; escalation compliance	Cognitive vulnerability increases risk of emotional dependence or misinterpretation of authoritative tone	Short supervised feasibility trials; conservative scope limitation; predefined escalation triggers (18, 43)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.