Submitted:
07 October 2025
Posted:
08 October 2025
You are already at the latest version
Abstract
Keywords:
Introduction
- Medical expertise: To be able to critically evaluate the validity of LLM-generated output.
- Ethics expertise: To identify potential risks and to address instances of ethical dilemmas and violations of medical-ethical norms.
- Practical knowledge: Familiarity with LLMs, by informed and critical use.
- Theoretical knowledge: Knowledge of how LLMs operate, which allows for a more nuanced evaluation of LLM-generated content.
Objectives
Methods
Theoretical and Practical Framework
Model Responses
- Anticipating contextual focus in medical reasoning.
- Explaining “generic” or “textbook” responses.
- Understanding strengths and weaknesses in differential diagnosis.
- Explaining ambiguous or contradictory responses.
- Identifying hallucinations in unfamiliar scenarios.
Ethical Considerations
Results
Key Topic 1: Anticipating Contextual Focus in Medical Reasoning
Key Topic 2: Explaining “Generic” or “Textbook” Responses
Key Topic 3: Understanding Strengths and Weaknesses in Differential Diagnosis
Key Topic 4: Explaining Ambiguous or Contradictory Responses
Key Topic 5: Identifying Hallucinations in Unfamiliar Scenarios
Discussion
Conclusions
Conflict of Interest
References
- Klang E, Tessler I, Freeman R, Sorin V, Nadkarni GN. If Machines Exceed Us: Health Care at an Inflection Point. NEJM AI. 2024;1(10):AIP2400559. [CrossRef]
- Ranji, SR. Large Language Models—Misdiagnosing Diagnostic Excellence? JAMA Network Open. 2024;7(10):e2440901-e2440901. [CrossRef]
- McCoy LG, Ci Ng FY, Sauer CM, et al. Understanding and training for the impact of large language models and artificial intelligence in healthcare practice: a narrative review. BMC Medical Education. 2024/10/07 2024;24(1):1096. [CrossRef]
- Zhang L, Tashiro S, Mukaino M, Yamada S. Use of artificial intelligence large language models as a clinical tool in rehabilitation medicine: a comparative test case. Journal of Rehabilitation Medicine. 09/11 2023;55:jrm13373. [CrossRef]
- Wu Y, Mao K, Zhang Y, Chen J. CALLM: Enhancing Clinical Interview Analysis Through Data Augmentation With Large Language Models. IEEE Journal of Biomedical and Health Informatics. 2024;28(12):7531-7542. [CrossRef]
- Hamm, RM. Physicians neglect base rates, and it matters. Behavioral and Brain Sciences. 1996;19(1):25-26. [CrossRef]
- Fuellen G, Kulaga A, Lobentanzer S, et al. Validation requirements for AI-based intervention-evaluation in aging and longevity research and practice. Ageing Res Rev. Feb 2025;104:102617. [CrossRef]
| GPT-o1 explanation for the key topic | Consensus.app explanation for the example | Quotes from the paper (Zhang et al.) | Synopsis |
|---|---|---|---|
| (1) Anticipating contextual focus in medical reasoning | |||
| “Transformers use attention mechanisms to "focus" on specific parts of the input sequence. In a medical dialogue, the model might concentrate heavily on a particular symptom, lab value, or part of a patient’s history. Understanding this can help clinicians and medical researchers appreciate why a model’s conclusion might hinge on early-mentioned or highly salient details rather than a broad synthesis of all information.” | “Understanding the architecture of transformer models, particularly the attention mechanism, can help anticipate how these models focus on specific parts of the input text when generating medical reasoning. For instance, in the context of rehabilitation medicine, ChatGPT-4 was able to generate comprehensive medical and rehabilitation prescriptions by focusing on relevant parts of the input case description, demonstrating its ability to prioritize contextually important information.” | “[the LLM] also demonstrated the ability to propose specified approaches for each rehabilitation therapy.” |
LLMs can indeed prioritize specific characteristics of a case or patient, and complex scenarios can be addressed through the parallel operations of multiple attention heads. While this capability offers advantages in focusing on critical aspects, it may also have drawbacks, particularly when LLMs fail to correctly integrate and synthesize the resulting information. |
| (2) Explaining “generic” or “textbook” responses | |||
| “MLP layers in transformers combine information gathered by attention heads into higher-level abstractions. When operating on medical queries, these layers may rely on well-learned, “standardized” patterns from training data (e.g., common guidelines or textbook phrasing) instead of tailoring responses to unusual clinical nuances. Understanding the MLP’s integrating role explains why a model might revert to a generic standard-of-care response even when presented with a complex or unique patient scenario.” | “In the study on rehabilitation medicine, ChatGPT-4 produced broader and more general prescriptions that were consistent with textbook answers, indicating its reliance on learned generic medical knowledge.” |
“Compared with standard answers, the large language model generated broader and more general prescriptions in terms of medical problems and management plans, rehabilitation problems and management plans, as well as rehabilitation goals.” |
Referring to knowledge that is well-represented in the training data (short head knowledge) can result in “generic” or “textbook” responses, raising concerns, however, about their adequacy when addressing atypical cases and patients. |
| (5) Identifying hallucinations in unfamiliar scenarios | |||
| “Transformers are trained on patterns within a certain data distribution. When confronted with rare conditions, novel treatments, or unusual clinical contexts, the model’s learned patterns may not apply. Attention could be misdirected, and the MLP layers might produce “hallucinated” content because they have no solid internal representation for the out-of-distribution input.” | “[...] while ChatGPT-4 made an error in the ICF category, it accurately generated ICF codes, highlighting the model's potential to hallucinate in less familiar contexts.” | “A thorough review of the standard clinical ICF code assigned by 2 PMR clinicians was then conducted, comparing it with the table produced by the GPT-4 model (Table II). The 3-digit codes generated by the LLM were accurate (...) However, an error was found when reviewing the case record in the body structures category (s730). The patient had had a stroke, and the original impairment should have been classified as affecting the right precentral gyrus (s110.1), as outlined in the case section. Instead, the table displayed the damage as being in “the upper extremity, left hand.” “ |
LLM responses may exhibit hallucinations when referring to “long tail” knowledge that is not well-represented in the training data. This is hypothesized to be the case for the “body structures category”. Then again, the LLM-generated explanations in this table are not necessarily correct either. A simpler hypothesis regarding the LLM failure is that it did not know or did not consider that the “body structures category” is supposed to refer to the primary site of damage (the brain), not to the secondary site (the hand). Any lack of knowledge regarding the reporting of ICF categories may thus be attributed to insufficient training data regarding this meta-level information. |
| GPT-o1 explanation for the key topic | Consensus.app explanation for the example | Quotes from the paper (Wu et al.) | Synopsis |
|---|---|---|---|
| (3) Understanding strengths and weaknesses in differential diagnosis | |||
| “Attention layers help identify connections between symptoms and conditions, while MLP layers synthesize these into coherent outputs. Knowing this pipeline is useful when the model suggests a differential diagnosis. If the model posits an unusual condition, it might be because it latched onto a distinctive symptom that strongly correlated with that condition in its training data—even if that condition is clinically improbable.” | “The strengths of transformer models in differential diagnosis can be attributed to their ability to synthesize information from diverse sources, while weaknesses may arise from their lack of real-world clinical experience. The CALLM framework, for example, enhances clinical interview analysis by generating synthetic data that can improve diagnostic accuracy, showcasing the model's adaptability in learning from augmented datasets.” | “In automated mental health diagnosis, the scarcity and imbalance of clinical data pose considerable challenges for researchers, limiting the effectiveness of machine learning algorithms. To cope with this issue, this paper aims to introduce a novel clinical transcript data augmentation framework by leveraging large language models (CALLM). The framework follows a “patient-doctor role-playing” intuition to generate realistic synthetic data.” | A hypothesis about how LLMs handle differential diagnoses is that multi-head attention may be responsible for the matching of patient data to the sets of symptoms known for disease conditions, but this matching may ignore disease prevalence. Synthetic data may mitigate this weakness because researchers can generate examples following a data distribution they have under control, and provide these examples to the LLM. |
| (4) Explaining ambiguous or contradictory responses | |||
| “When the patient’s presentation is ambiguous or the prompts contain conflicting information, attention mechanisms may distribute focus across multiple, equally plausible interpretations. The MLP layers may fail to resolve these into a single, authoritative answer. Understanding this helps users interpret uncertain or oscillating responses as a reflection of the model’s internal struggle with ambiguity rather than mere randomness.” | “Ambiguities or contradictions in model outputs can often be traced back to the model's training data or the inherent complexity of medical language. The CALLM framework's use of a "Response-Reason" prompt engineering paradigm aims to generate diagnostically valuable transcripts, which can help mitigate such issues by providing clearer reasoning paths in the model's responses.” | “Our “Response-Reason” prompting approach guides LLMs in generating highly authentic clinical interview transcripts for mental disorder diagnosis. This augmentation is tailored to enhance the training dataset, facilitating both FSL[Few-Shot-Learning] and, in certain cases, ZSL[Zero-Shot-Learning].” ”This technique [...] encouraged it to elucidate the rationale behind the responses, mirroring the profile and characteristics of a simulated patient.” |
Contradictory responses can be attributed to ambiguous input, ambiguity within the training data, or ambiguity in the representation of knowledge by the trained model. Specialized prompting techniques may request that the reasoning path of the LLM is made more transparent, enhancing its reasoning capabilities along the way. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).