Note-Level Phenotyping of Multiple Sclerosis Notes by a Large Language Model Achieves Near Human-Level Agreement

Daniel B. Hier; Pavankumar Y. Srinivasula; Michael D. Carrithers

doi:10.20944/preprints202604.1173.v1

Submitted:

15 April 2026

Posted:

16 April 2026

You are already at the latest version

Abstract

Background/Objectives: Clinical phenotyping from narrative electronic health records (EHRs) often relies on multi-stage pipelines with span-level extraction, ontology mapping, and aggregation, which are complex to develop and maintain. Large language models (LLMs) may enable direct note-level abstraction of clinically meaningful features without intermediate extraction steps. We evaluated whether an LLM can approximate human inter-rater agreement for note-level multiple sclerosis (MS) phenotyping. Methods: We analyzed 100 de-identified MS neurology progress notes from a single academic medical center, each annotated for the presence or absence of 17 predefined neurological phenotype features (e.g., weakness, sensory, gait, pain, cognition, bladder). Two human annotators independently labeled all notes using a multi-label note-level framework in Prodigy, and discrepancies between the human annotators were adjudicated to create a gold standard. The same notes were annotated in a zero-shot setting by a large language model (GPT-5) and by the dictionary-based Doc2Hpo system. We computed percent agreement, Cohen’s κ, and precision, recall, and F1 scores for each annotator relative to the gold standard Results: Human–human agreement was substantial across most phenotype domains, with Cohen’s κ typically between 0.61 and 0.84, and lower agreement for infrequent features such as spasticity and hyperreflexia. Agreement between the LLM and human annotators was comparable to human inter-rater agreement across many features. Relative to the gold standard, the LLM showed recall that was equal to or higher than human annotators for most phenotypes, with overall F1 scores similar to human performance, whereas Doc2Hpo demonstrated lower precision and recall. Conclusions: In this note-level MS phenotyping task, a large language model achieved performance approaching expert human inter-rater agreement, with particularly strong recall across multiple phenotype domains. These findings suggest that direct note-level abstraction of clinically meaningful phenotypes from narrative neurology notes is feasible and may offer a scalable complement to traditional span-oriented extract–map–aggregate pipelines for population-level phenotyping and downstream machine learning applications.

Keywords:

multiple sclerosis

;

phenotype

;

clinical text

;

large language models

;

natural language processing

;

inter-rater agreement

;

phenotyping

Subject:

Medicine and Pharmacology - Neuroscience and Neurology

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Note-Level Phenotyping of Multiple Sclerosis Notes by a Large Language Model Achieves Near Human-Level Agreement

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe