Background/Objectives: Clinical phenotyping from narrative electronic health records (EHRs) often relies on multi-stage pipelines with span-level extraction, ontology mapping, and aggregation, which are complex to develop and maintain. Large language models (LLMs) may enable direct note-level abstraction of clinically meaningful features without intermediate extraction steps. We evaluated whether an LLM can approximate human inter-rater agreement for note-level multiple sclerosis (MS) phenotyping. Methods: We analyzed 100 de-identified MS neurology progress notes from a single academic medical center, each annotated for the presence or absence of 17 predefined neurological phenotype features (e.g., weakness, sensory, gait, pain, cognition, bladder). Two human annotators independently labeled all notes using a multi-label note-level framework in Prodigy, and discrepancies between the human annotators were adjudicated to create a gold standard. The same notes were annotated in a zero-shot setting by a large language model (GPT-5) and by the dictionary-based Doc2Hpo system. We computed percent agreement, Cohen’s κ, and precision, recall, and F1 scores for each annotator relative to the gold standard Results: Human–human agreement was substantial across most phenotype domains, with Cohen’s κ typically between 0.61 and 0.84, and lower agreement for infrequent features such as spasticity and hyperreflexia. Agreement between the LLM and human annotators was comparable to human inter-rater agreement across many features. Relative to the gold standard, the LLM showed recall that was equal to or higher than human annotators for most phenotypes, with overall F1 scores similar to human performance, whereas Doc2Hpo demonstrated lower precision and recall. Conclusions: In this note-level MS phenotyping task, a large language model achieved performance approaching expert human inter-rater agreement, with particularly strong recall across multiple phenotype domains. These findings suggest that direct note-level abstraction of clinically meaningful phenotypes from narrative neurology notes is feasible and may offer a scalable complement to traditional span-oriented extract–map–aggregate pipelines for population-level phenotyping and downstream machine learning applications.