This study examines whether a rubric-guided large language model (LLM) can approximate local human grading practice for text-based responses in three university courses. A total of 930 student responses from Prompt Engineering, Photoshop Design, and AI Video Production were scored by two human instructors and by ChatGPT using the same five-criterion analytic rubric (Accuracy, Logical Flow, Specificity, Quality, and Originality; 0.0–3.0 each; Total 0–15). Human consensus (HC) was defined as the mean of the two human scores and was treated as a pragmatic reference rather than ground truth. Pairwise agreement among H1, H2, AI, and HC was evaluated using ICC(3,1), Pearson correlations, mean absolute error (MAE), and Bland–Altman bias and limits of agreement (LoA); a course-specific held-out calibration analysis was additionally conducted. On the Total score, human–human agreement was strong (ICC = 0.819 [0.797, 0.839]). AI–H1 and AI–H2 Total-score agreement were ICC = 0.700 [0.666, 0.732] and 0.767 [0.739, 0.792], respectively, while AI–HC agreement was ICC = 0.763 [0.735, 0.789], with MAE = 1.603 and LoA = [−4.246, 4.045]. At the trait level, AI–HC ICCs exceeded H1–H2 ICCs for all five rubric dimensions, although Quality remained weakly defined in the human baseline. On a 70/30 held-out test split, a course-specific linear calibration modestly improved Total-score ICC from 0.774 to 0.782 and reduced MAE from 1.624 to 1.215, narrowing the LoA from [−4.290, 4.188] to [−3.157, 3.329]. However, threshold-adjacent agreement remained imperfect after calibration. The findings concern written responses only and support a conservative conclusion: rubric-guided LLM scoring can assist human grading under fixed local rubrics, but the current evidence supports calibrated human–AI co-grading rather than unsupervised replacement.