Preprint
Article

This version is not peer-reviewed.

Evaluating the Reliability and Agreement of Rubric-Guided LLM Scoring Versus Human Grading Across Three University Courses

Submitted:

19 May 2026

Posted:

19 May 2026

You are already at the latest version

Abstract
This study examines whether a rubric-guided large language model (LLM) can approximate local human grading practice for text-based responses in three university courses. A total of 930 student responses from Prompt Engineering, Photoshop Design, and AI Video Production were scored by two human instructors and by ChatGPT using the same five-criterion analytic rubric (Accuracy, Logical Flow, Specificity, Quality, and Originality; 0.0–3.0 each; Total 0–15). Human consensus (HC) was defined as the mean of the two human scores and was treated as a pragmatic reference rather than ground truth. Pairwise agreement among H1, H2, AI, and HC was evaluated using ICC(3,1), Pearson correlations, mean absolute error (MAE), and Bland–Altman bias and limits of agreement (LoA); a course-specific held-out calibration analysis was additionally conducted. On the Total score, human–human agreement was strong (ICC = 0.819 [0.797, 0.839]). AI–H1 and AI–H2 Total-score agreement were ICC = 0.700 [0.666, 0.732] and 0.767 [0.739, 0.792], respectively, while AI–HC agreement was ICC = 0.763 [0.735, 0.789], with MAE = 1.603 and LoA = [−4.246, 4.045]. At the trait level, AI–HC ICCs exceeded H1–H2 ICCs for all five rubric dimensions, although Quality remained weakly defined in the human baseline. On a 70/30 held-out test split, a course-specific linear calibration modestly improved Total-score ICC from 0.774 to 0.782 and reduced MAE from 1.624 to 1.215, narrowing the LoA from [−4.290, 4.188] to [−3.157, 3.329]. However, threshold-adjacent agreement remained imperfect after calibration. The findings concern written responses only and support a conservative conclusion: rubric-guided LLM scoring can assist human grading under fixed local rubrics, but the current evidence supports calibrated human–AI co-grading rather than unsupervised replacement.
Keywords: 
;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated