Evaluating the Reliability and Agreement of Rubric-Guided LLM Scoring Versus Human Grading Across Three University Courses

Howard Kim; Sung-Tae Lee; Jongwon Lee

doi:10.20944/preprints202605.1284.v1

Submitted:

19 May 2026

Posted:

19 May 2026

You are already at the latest version

Abstract

This study examines whether a rubric-guided large language model (LLM) can approximate local human grading practice for text-based responses in three university courses. A total of 930 student responses from Prompt Engineering, Photoshop Design, and AI Video Production were scored by two human instructors and by ChatGPT using the same five-criterion analytic rubric (Accuracy, Logical Flow, Specificity, Quality, and Originality; 0.0–3.0 each; Total 0–15). Human consensus (HC) was defined as the mean of the two human scores and was treated as a pragmatic reference rather than ground truth. Pairwise agreement among H1, H2, AI, and HC was evaluated using ICC(3,1), Pearson correlations, mean absolute error (MAE), and Bland–Altman bias and limits of agreement (LoA); a course-specific held-out calibration analysis was additionally conducted. On the Total score, human–human agreement was strong (ICC = 0.819 [0.797, 0.839]). AI–H1 and AI–H2 Total-score agreement were ICC = 0.700 [0.666, 0.732] and 0.767 [0.739, 0.792], respectively, while AI–HC agreement was ICC = 0.763 [0.735, 0.789], with MAE = 1.603 and LoA = [−4.246, 4.045]. At the trait level, AI–HC ICCs exceeded H1–H2 ICCs for all five rubric dimensions, although Quality remained weakly defined in the human baseline. On a 70/30 held-out test split, a course-specific linear calibration modestly improved Total-score ICC from 0.774 to 0.782 and reduced MAE from 1.624 to 1.215, narrowing the LoA from [−4.290, 4.188] to [−3.157, 3.329]. However, threshold-adjacent agreement remained imperfect after calibration. The findings concern written responses only and support a conservative conclusion: rubric-guided LLM scoring can assist human grading under fixed local rubrics, but the current evidence supports calibrated human–AI co-grading rather than unsupervised replacement.

Keywords:

large language models

;

rubric-based grading

;

educational assessment

;

human–AI agreement

;

intraclass correlation coefficient

;

Bland–Altman analysis

;

text-response scoring

;

co-grading

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Evaluating the Reliability and Agreement of Rubric-Guided LLM Scoring Versus Human Grading Across Three University Courses

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe