1. Introduction
Studies find that large language models (LLMs) can grade coding tasks [
1], short answers to open questions [
2,
3], and student essays [
4,
5] with amicable performance. Grading open forms of assessment manually can be time-consuming and boring, but they are often better measures of student ability than multiple-choice [
2,
6]. Using LLMs for grading would allow teachers to rely on open forms of assessment while saving considerable time [
1,
2]. Teachers could use this time to improve educational materials or tutoring students, making LLM grading a win-win for students and teachers alike. However, despite promising early findings, it remains unclear whether LLMs can grade accurately and fairly across assignments, courses, and programs, or if they are only accurate for specific cases. Consequently, multiple authors advise against fully automating the grading process at this point [
2,
7,
8]. This sentiment is mirrored by the European Union’s artificial intelligence (AI) act, which classifies AI grading as "high risk" and mandates human oversight [
9]. To address this, we introduce
SURE (
Selective
Uncertainty-based
Re-
Evaluation), a human-in-the-loop pipeline that combines automated LLM grading with uncertainty-based flagging and human review.
Specifically, we propose repeatedly prompting LLMs to score the same student answer to obtain a distribution of candidate scores, from which to derive a predicted score (e.g., mean, median, mode, etc.) and certainty estimate (e.g, standard deviation, entropy, etc.). Then, any low-certainty scores (e.g., falling below a threshold) can be flagged and later manually graded by a teacher. Repeatedly sampling from LLMs and aggregating their outputs serves two purposes. First, aggregating results across prompts may improve grading accuracy: instead of relying on a single score, multiple samples might allow the model’s judgments to converge toward a more reliable estimate. In line with this assumption, previous studies report that repeated prompting can sometimes improve LLM-grading [
10,
11]. Second, by examining the variability across samples, we aim to quantify uncertainty and flag questionable grades. We assume that when a grading task falls well within the LLM’s training distribution, it will consistently assign the correct score – or at least do so on average across repeated samples. In contrast, when a task is underrepresented, ambiguous, or absent from the training data, we expect greater variation in the scores, as the model may hallucinate or explore multiple plausible solutions rather than settling on a single, well-defined answer. This idea parallels the "self-consistency" approach introduced by Wang et al. ([
12]), who showed that aggregating answers from multiple reasoning paths not only improves overall accuracy but that the level of agreement among samples can serve as a measure for uncertainty. In line with their findings, recent papers successfully utilized self-consistency based uncertainty metrics to improve LLM performance in the context of question answering [
13,
14,
15].
Combining automated assessment with human review of uncertain cases [
16] has been suggested for other high-risk applications, such as medical diagnosis [
17] and financial fraud detection [
18]. Similarly, Kortemeyer and Nöhl ([
8]) evaluated a related procedure for grading: they obtained ten independent LLM-generated scores per student response, averaged these scores, and compared the mean to predictions from item response theory (IRT) to estimate grading uncertainty and identify responses requiring human review. They found that uncertainty based thresholding improved the LLM grading accuracy for physics exams [
8].
The effectiveness of the SURE pipeline we propose depends critically on the diversity of LLM outputs that arises from repeated prompting. If repeated scores are always identical, uncertainty estimates become meaningless, and we cannot reliably distinguish between easy to grade cases and such that require human review. We explored several strategies for increasing output diversity by influencing the stochasticity and variability of LLM responses:
First, we varied
temperature and
top-p parameters to control token-level randomness: lower values make outputs more deterministic, while higher values encourage more varied responses [
19,
20,
21]. For reasoning models that do not expose these parameters, we instead varied
text verbosity, which affects the length of responses [
22].
Second, we introduced several
prompt perturbations designed to elicit different reasoning paths without changing the prompt content. Specifically, we explored shuffling the order of rubric criteria, instructing LLMs to adopt different grader personas (e.g., strict vs lenient), and prompting them in different languages. Critically, our goal here is not to investigate how LLM-grading fares under different prompting conditions or in multilingual educational contexts; instead our aim is only to increase output variability. Prior work shows that LLMs are sensitive not only to a prompt’s semantic content but also to its phrasing and presentation [
23,
24,
25], and that leveraging such diversification can improve uncertainty estimation [
26]. These effects may be especially pronounced in cases where the model has not converged on a stable reasoning path, and may help reveal those instances in which its grading is unreliable.
Third, we investigated LLM
ensembles – aggregating the outputs of multiple models rather than relying on a single one – to increase output diversity and reduce model-specific biases. This approach builds on the idea of ensemble learning, where combining several imperfect predictors often yields more robust performance, as seen in methods such as bagging and random forests [
27]. Similar ideas are now being explored for LLMs [
28,
29,
30] and they might be particularly useful for estimating (un)certainty: because different LLMs are trained on distinct data and optimization objectives, their outputs might vary in informative ways when evaluating the same student response [
14,
15]. Aggregating these diverse perspectives might stabilize majority-voted scores, especially when certain models are better suited to specific response types. For example, in a methods course, a model fine-tuned for mathematical or formal reasoning (e.g., Minerva [
31]) may be better suited for evaluating answers to quantitative questions, whereas a general instruction-tuned model may perform better when assessing conceptual responses about experimental design. While either model alone may be imperfect outside its area of specialization, aggregating their independent judgments could yield more stable majority-voted scores. Moreover, high agreement between multiple heterogeneous LLMs (i.e., high certainty) may serve as a strong indicator that the assigned score is reliable and does not require human review.
We investigated SURE using data from an introductory programming course for psychology students [
1]. We previously scored student answers to coding excercises and open questions in that course with gpt-4o [
32] and a qualitative inspection of model outputs revealed varied error patterns of both the LLM and human graders in the course. The errors by gpt-4o included deviating from the rubrics, incorrectly penalizing messy or uncommon but correct solutions, failing at counting lines of code or interpreting plots, and interpreting rubrics too literally. Human graders, on the other hand, sometimes made careless mistakes like overlooking rubric criteria or syntax errors [
1]. Because of these earlier findings, we revised some of the rubrics to make them more explicit and easier to follow for both humans and LLMs. The revised rubrics, and all other resources such as prompts, and code are available on GitHub
https://github.com/lukekorthals/sure. Additionally, instead of relying on ground-truth scores derived from a single human rater, four of the authors independently graded student answers based on the revised rubrics to obtain a more robust reference for evaluating LLM-based grading with and without SURE.
1.1. Related Work
Automated grading has been researched for more than half a century, initially focusing on closed-form assessment formats such as multiple-choice and fill-in-the-blank items, where correctness can be determined through exact matching, predefined answer sets, or rule-based heuristics [
33,
34,
35].
In parallel, researchers also investigated automated scoring of open-ended responses, exploring how aspects of writing quality and content understanding could be captured computationally. Early work relied on surface-level textual features [
36], followed by approaches incorporating deeper semantic information such as semantic similarity [
37], and more recently by deep neural networks that learn task-relevant representations directly from data [
38,
39,
40]. Notably, already more than two decades ago, several automated grading systems had been developed and deployed in educational settings [
41,
42,
43].
Programming assignments represent a special case within automated grading research as unit testing and static analysis enable automatic scoring for many tasks [
44,
45]. Critically, such methods cannot be used to grade all aspects of coding education such as evaluating documentation or answers to conceptual questions [
44]. This consideration applies to the course under investigation here as it includes open questions about coding, data science, and psychology, and rubrics that often award partial credit for incomplete or incorrect code.
The ability of modern large language models (LLMs) to handle a wide range of tasks suggests a potential unification of automated grading across domains: Unlike earlier approaches, LLMs can be prompted with natural language instructions and rubrics to assess essays [
4], short answers [
2,
3,
46], and programming assignments [
1,
47]. To improve alignment with human judgments, prior studies explored a range of techniques including prompt engineering with rubric conditioning, few-shot prompting, and chain-of-thought reasoning [
48,
49,
50], as well as retrieval-augmented generation [
46,
51] and task-specific fine-tuning [
52]. Despite these advances, results consistently show sensitivity to prompt design, systematic biases, and non-trivial disagreement with human graders, leading to a broad consensus that LLM-based grading should be deployed with caution rather than as a fully automated replacement for human assessment [
1,
2,
7,
8,
47].
Importantly, human grading itself is imperfect, with multiple raters frequently disagreeing [
53,
54] and human graders sometimes making errors that LLMs avoid [
1]. Nevertheless, human scores remain the benchmark against which automated systems are evaluated because they reflect established educational practice and accountability structures [
9].
3. Results
3.1. Interrater Reliability of Human Graders
We computed ICC(2,1) and fitted linear mixed-effects models to assess interrater reliability at both the level of individual question scores and aggregated assignment grades.
At the question level, reliability was excellent, , 95% CI [.92, .93], , . Variance decomposition from a cross-classified mixed-effects model showed that grader identity accounted for a negligible proportion of total variance (). Assignment and student also accounted for small proportions of variance ( and respectively), with most variability attributable to differences between questions and residual error ( and respectively).
At the grade level, reliability remained excellent, , 95% CI [.83, .94]. Grader and student identity only accounted for a small proportion of variance ( and respectively), with assignment and residual error accounting for most variability ( and respectively).
To examine potential student-specific grading bias, we extended the grade-level mixed-effects model with a grader × student random effect. This model resulted in a singular fit, with the corresponding variance component estimated at zero, indicating no evidence that graders systematically evaluated specific students differently.
Overall, these results indicate that human grading was fair and highly reliable at both the question and grade levels, supporting the use of aggregated ground-truth scores derived from multiple graders as a baseline for evaluating LLM-based grading in this study.
3.2. Exploratory Findings on the Training Set
3.2.1. Descriptive Findings
Figure 1 shows how certainty thresholds were tuned by maximizing
scores for each of the 56 conditions. Most optimal thresholds (∆) lie between 0.6 and 0.85. Optimal thresholds and the average F1 trajectory for the ensemble condition (black triangles and line) are markedly shifted to the left and more peaked than those for the other models. For the test set we fixed the threshold at 0.7, which was the median optimal threshold across all conditions (dashed vertical line).
Figure 2 shows the certainty of correct and incorrect scores and their relationship to tuned certainty thresholds for all conditions of a given LLM. It suggests that certainty is diagnostic of correctness: most correct scores cluster at 100% certainty (all iterations agree), while incorrect scores are more widely distributed at lower levels of certainty. The ensemble stands out with a distinctly bimodal distribution and concentrated thresholds, suggesting that mixing multiple LLMs can help separate cases suited for automated grading from those that aren’t.
Figure 3 displays the observed grading accuracy of different models, grading procedures, and prompting configurations, aggregated across students and questions for the first assignment. Different LLMs are displayed on the x-axis and visual inspection suggests that the reasoning models (gpt-5-nano, and gpt-oss-20b) and the ensemble clearly outperformed gpt-4.1-nano even when using only a single prompt (circles). Both majority-voting (squares) and SURE (triangles) appear to improve the accuracy of all LLMs, with gpt-oss-20b and the ensemble even reaching human level accuracy (grey band) and gpt-4.1-nano achieving the greatest relative gains, potentially because more cases were flagged for this LLM. Effects of prompt perturbations (colors) are difficult to assess visually, but it seems like multilingual prompting may have hurt the accuracy of the three LLMs, particularly gpt-4.1-nano and especially for single-prompt grading. We also see that gpt-5-nano with a single-prompt is on par with gpt-4o (dotted line) used in Korthals et al. ([
1]), while majority-voting improves the accuracy beyond it. Based on visual inspection alone, the
Figure 1,
Figure 2 and
Figure 3 suggest that self-consistency [
12] based certainty estimation can work for weaker as well as stronger LLMs and that the performance of the proposed SURE procedure may be comparable to fully manual grading.
3.2.2. Grading Procedures and Diversification Strategies
We used Bayesian logistic regression to predict the log-odds of scoring student answers correctly based on the grading procedure, LLM, sampling parameters, and prompt perturbation techniques and all meaningful two-way interactions (predictors that were varied together). The four MCMC chains each with 1000 warmup and sampling iterations mixed well (all
).
Table 5 shows the coefficients and 95% HDI of all coefficients. In the following we only interpret those whose 95% HDI excludes zero.
At the
intercept (
majority-voting, gpt-4.1-nano, temp=0, topp=1, verb=0, shuf=0, pers=0, lang=0;
, 95% HDI
) the probability to score a student answer correctly is estimated to be about 83%. Relying oly on a
single-prompt (
, 95% HDI
) reduces that probability, while human-in-the-loop
SURE (
, 95% HDI
) increases it. Utilizing
gpt-5-nano (
, 95% HDI
),
gpt-oss-20b (
, 95% HDI
), or the LLM
ensemble (
, 95% HDI
), instead of gpt-4.1-nano also increased the probability of scoring student answers correctly. These results at the level of individual student answers are consistent with the earlier visual inspection of the grading accuracy at the assignment level (
Figure 3): majority-voting and SURE are better than single-prompt grading, and the reasoning models (gpt-5-nano, gpt-oss-20b) and the ensemble outperformed gpt-4.1-nano.
We obtained negative coefficients for interactions between SURE : gpt-5-nano (, 95% HDI ), SURE : gpt-oss-20b (, 95% HDI ), and SURE : ensemble (, 95% HDI ). This reflects that the relative performance gain for gpt-4.1-nano from SURE is greater than that for the other LLMs, whose baseline accuracy (single-prompt / majority-voting) is already greater and closer to the ceiling.
We found a positive coefficient for the main effect of
topp(llm=gpt-4.1-nano; temp=1) (
, 95% HDI
) and a negative interaction for
topp(llm=gpt-4.1-nano; temp=1) : single-prompt (
, 95% HDI
). This indicates that prompting gpt-4.1-nano with temperature and top_p set to 1 is beneficial but only for majority-voting and SURE.
Figure 3 clearly shows how majority-voting (squares) for gpt-4.1-nano with lower temperature and top_p is only slightly beneficial, while a large jump in accuracy can be seen for gpt-4.1-nano with higher temperature and top_p. Together with the regression results, this indicates that token-level variability may help stabilize majority voted scores, potentially because more plausible scores are explored, while deterministic sampling results in getting stuck in a local minimum similar to relying on a single-prompt.
None of the prompt perturbation techniques improved the probability to score student answers correctly. On the contrary, we obtained negative coefficients for lang (, 95% HDI ), and the interactions between lang : single-prompt (, 95% HDI ), lang : topp(llm=gpt-4.1-nano; temp=1) (, 95% HDI ), and lang : shuf (, 95% HDI ). These indicate that multilingual prompting was detrimental, particularly when relying only on a single-prompt, simultaneously shuffling rubrics, and using gpt-4.1-nano with increased token-level sampling variability.
We also found positive coefficients for the interactions between
lang : gpt-5-nano (
, 95% HDI
), and
lang : gpt-oss-20b (
, 95% HDI
), suggesting that multilingual prompting was less detrimental for more recent the reasoning models. This is in line with
Figure 3, which clearly shows how multilingual prompting was very detrimental for single-prompt grading with gpt-4.1-nano but less so for the other LLMs and grading procedures.
Finally, for random intercepts we found moderate variability for
students (
, 95% HDI
), and considerable variability for
questions (
, 95% HDI
). This indicates that some students are easier to score than others, which raises concerns for potentially biased grading, and suggests that LLMs are worse at scoring certain questions, which is in line with our earlier findings [
1] and exactly what we want to address with SURE grading.
With respect to the research questions, this regression model and
Figure 2 and
Figure 3 suggest that majority-grading improves fully automated grading (RQ1), SURE improves accuracy over automated grading (RQ2), and only higher temperature and top_p and ensembling were effective diversification strategies (improving grading accuracy) in our context (RQ3). Based on these results, we decided to focus only on four LLM configurations for all other analyses:
gpt-4.1-nano with temperature and top_p set to 1.0.
gpt-5-nano with default "medium" text_verbosity.
gpt-oss-20b with default "medium" text_verbosity.
ensemble based on the three selected LLM configurations.
3.2.3. Comparing Single-Prompt, Majority-Voting, SURE and Manual Grading
Accuracy and bias at the level of student answers. We fit a Bayesian logistic regression with random intercepts for students and questions to estimate the log-odds that each of the four human graders and human-in-the-loop SURE grading with four LLM graders (gpt-4.1-nano at temp=topp=1, gpt-5-nano with verb=1, gpt-oss-20b, and the ensemble; without prompt perturbations) would score student answers correctly. At first we obtained values around 1.05, so we increased the sampling to 2000 tuning and 2000 sampling iterations. After this change convergence between the four MCMC chains improved (). The model was fit without an intercept, which means that once the coefficients are transformed from log-odds to probabilities, each one directly represents that grader’s estimated probability of assigning a correct score. Below we report estimated coefficients as log-odds but interpret the results at the probability level.
Figure 4 displays the posterior means and 95% HDIs (left panel) and the pairwise probability that a grader was more accurate than another (right panel). With about 92% estimated probability to score student answers correctly,
grader 2 (
, 95% HDI
) was the most accurate human grader. They were followed by
grader 3 (
, 95% HDI
;
),
grader 1 (
, 95% HDI
;
) and
grader 4 (
, 95% HDI
;
). Under human-in-the-loop SURE grading with tuned certainty thresholds, both
gpt-oss-20b (
, 95% HDI
;
) and the LLM
ensemble (
, 95% HDI
;
) reached accuracies comparable to the mid-range of human graders. In contrast
gpt-5-nano (
, 95% HDI
;
) and particularly
gpt-4.1-nano (
, 95% HDI
;
) performed worse, with all human graders likely outperforming them. Random intercepts revealed moderate variability for
students (
, 95% HDI
), and considerable variability for
questions (
, 95% HDI
), indicating that even the human graders and LLMs under SURE were challenged by certain questions and student answers.
We fit a similar Bayesian linear regression to predict grading bias with four MCM chains making 1000 tuning and sampling draws (all ). This regression revealed a more pronounced difference between human graders and human-in-the-loop SURE grading: Human grader 4 (, 95% HDI ) and grader 2 (, 95% HDI ) were unbiased (HDI include zero), while grader 3 (, 95% HDI ) was underscoring and grader 1 (, 95% HDI ) was overscoring. In contrast, despite SURE all LLM graders underscored students (negative bias) and were likely more biased than most human graders: gpt-oss-20b (, 95% HDI ); ensemble (, 95% HDI ); gpt-4.1-nano (, 95% HDI ); gpt-5-nano (, 95% HDI ). Random intercepts for students showed low variability (, 95% HDI ), while those for questions indicated moderate variability (, 95% HDI ).
Figure 5.
Posterior estimates of grading bias for human graders and LLMs under SURE with tuned certainty thresholds in the training set. The left panel shows posterior means and 95% HDIs for grading bias (deviation from closest ground-truth). The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is less biased (closer to zero) than the grader in the column.
Figure 5.
Posterior estimates of grading bias for human graders and LLMs under SURE with tuned certainty thresholds in the training set. The left panel shows posterior means and 95% HDIs for grading bias (deviation from closest ground-truth). The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is less biased (closer to zero) than the grader in the column.
Alignment at the level of assignment grades. The previous analysis was conducted at the level of individual student answers, for which differences between manual grading and SURE were relatively small. However, even small differences may accumulate at the level of assignment grades. To evaluate this, we computed target grade ranges (human and ground-truth grades) and the grade each student would have received under a given LLM and grading procedure.
Figure 6 reveals considerable disagreement between human graders (wide grey areas) in some cases but also perfect agreement in one case (black horizontal line instead of grey area). It also shows a pronounced negative grading bias for single-prompt grading (circles are often far below the grey target ranges), which is consistent with earlier findings using gpt-4o [
1]. The plot also shows that majority-voting and SURE pull student grades closer to or even inside the target ranges for most students. Interestingly, there are also individual cases where majority-voting or SURE decreased the alignment with human grade ranges. This reflects that sometimes single prompts can "accidentally" be right and that human (re)graders sometimes make mistakes and are less correct than LLMs, which is something we also observed in Korthals et al. ([
1]). The starkest difference emerged for gpt-4.1-nano:
Table 6 shows that for single-prompt grading only
of grades fell inside the target range and the maximum grade deviation reached 1.9 grade points. In contrast, with SURE,
of grades fell inside the target ranges and the maximum deviation was only 0.4 grade points. We see a similar pattern for all other LLMs, with SURE resulting in maximum and median grade point deviations of
and
grade points for all LLMs respectively. This finding suggests that SURE can result in grades matching human accuracy, particularly for stronger LLMs and the LLM ensemble.
Time savings from SURE. While the previous analysis indicates that SURE can achieve good alignment with human grades, this could be driven by a large flagging rate which would mean that SURE is largely manual grading anyway and offers no substantial time-savings. We assessed this by comparing the time spent for manual grading with the time graders would spend under SURE.
Table 7 shows that manually grading the first assignment took between 3 and 6.5 hours. In contrast, the SURE procedure would reduce manual grading time by up to 90%, making it highly efficient.
This table also shows that time savings would be less for gpt-4.1-nano than for the other LLMs and the ensemble, reflecting the lower baseline accuracy and greater flagging rate for this LLM. Together with the big performance increases for gpt-4.1-nano, this provides further evidence that self-consistency based uncertainty estimation is effective for both weaker non-reasoning models (gpt-4.1-nano) and reasoning models (gpt-5-nano, gpt-oss-20b).
3.2.4. Summary of Training Set Results
To summarize, in the training set we found evidence that self-consistency [
12] based (un)certainty can distinguish between cases that can be graded automatically and those that should be reviewed by a human grader. Notably, aggregating the outputs from several LLMs in an ensemble visibly improved the separability of incorrect and correct scores.
We also found that SURE based on optimal thresholds can result in greatly improved alignment with human graders both at the level of individual student answers and assignment grades, while saving teachers more than 80% of the manual grading time spent on this assignment.
However, these results were based on optimally set flagging thresholds tuned on an efficiency (
score) objective. Therefore, they might be overly optimistic estimates of the effectiveness and efficiency of the proposed SURE pipeline. In practice, thresholds would have to be set in advance, which might be problematic if the obtained certainties are highly assignment specific. Therefore, we simulated SURE with a fixed certainty threshold (0.7) in the test set (assignments 2-5). These assignments include more complex programming tasks (e.g., manipulating data, creating plots) which we previously found to be graded less accurately with gpt-4o [
1] and therefore likely will include exactly such cases that we want to identify and flag for human reviews.
3.3. Test Set Validation
3.3.1. Descriptive Findings
Figure 7 shows how correct and incorrect scores were distributed across certainty levels in the test set (assignments 2–5). As in the training set, most correct scores cluster at 100% certainty, indicating full agreement across repeated prompts. However, there are also more correct scores at lower certainty values, which increases the overlap between the distributions of correct and incorrect scores. This overlap implies more unwanted flags (false positives; blue bars to the left of the threshold) and more unflagged incorrect scores (false negatives; red bars to the right of the threshold). Once again, the ensemble stands out with a distinctly bimodal certainty distribution that is neatly separated by the fixed threshold at 0.7. In contrast, the same fixed threshold appears to be too low for the individual LLMs, resulting in many incorrect scores that remain unflagged.
Figure 8 shows assignment-level grading accuracy in the test set for all LLMs and grading procedures. As in the training set, SURE (triangles) and, to a lesser extent, majority-voting (squares) improve accuracy over single-prompt grading (circles) for all models. For all four assignments, SURE with the ensemble reaches human-level grading accuracy (grey band). In contrast, gpt-4.1.-nano clearly reached lower grading accuracy for all assignments even with SURE, and gpt-oss-20b, and gpt-5-nano approached human performance only for assignments 4 and 5. This likely reflects the inadequate thresholds, which resulted in too many unflagged incorrect cases.
3.3.2. Comparing Single-Prompt, Majority-Voting, SURE and Manual Grading
Accuracy and bias at the level of student answers. Similar to the training set, we ran a Bayesian logistic regression to assess the accuracy of the four human graders and the four human-in-the-loop SURE grading procedures (gpt-4.1-nano, gpt-5-nano, gpt-oss-20b, ensemble) at the level of individual student answers in the test set (assignments 2–5). The model included random intercepts for students and questions and was fit without a global intercept, so that each coefficient directly represents the log-odds of assigning a correct score for a specific grader. We estimated the model with four MCMC chains, 1000 tuning 1000 and sampling iterations per chain, and all indicated good convergence.
With an estimated probability of correctly scoring student answers of about 95%, human grader 3 (, 95% HDI ) was the most accurate grader. However, this estimate is based only on assignment 5 (the only test set assignment graded by grader 3), which makes it less generalizable than the estimates for the other human graders and the LLMs under SURE. Grader 3 is followed by grader 1 (, 95% HDI ; ), the ensemble (, 95% HDI ; ), grader 4 (, 95% HDI ; ), grader 2 (, 95% HDI ; ), gpt-oss-20b (, 95% HDI ; ), gpt-5-nano (, 95% HDI ; ), and gpt-4.1-nano (, 95% HDI ; ). Random intercepts for students showed little variability (, 95% HDI ), while those for questions indicated moderate variability (, 95% HDI ).
Figure 9 visualizes these results and highlights the relative ranking of graders in terms of pairwise dominance. The heatmap shows that
grader 3 and
grader 1 clearly outperform the two other human graders and all LLMs under SURE with very high posterior probability, and that
grader 3 is likely more accurate than
grader 1 as well. The
ensemble occupies a distinct middle position: it is almost certainly more accurate than
gpt-4.1-nano,
gpt-5-nano,
gpt-oss-20b, and
grader 2, and is roughly comparable to
grader 4 (pairwise probability close to 0.5). Together, the posterior means and pairwise dominance structure indicate that, in the test set, SURE with the ensemble achieves accuracy similar to mid-range human graders while SURE was less effective for the individual LLMs and did not achieve accuracies comparable to manual grading.
For grading bias, we find very similar results and even more evidence that SURE with the LLM ensemble rivals human performance: With zero being included in the 95% HDI, Human grader 4 (, 95% HDI ) and the ensemble (, 95% HDI ) may be considered unbiased. Like on the training set, human grader 1 was the only grader with a positive bias (overscoring; , 95% HDI ). All other graders were negatively biased (underscoring): human grader 3 (, 95% HDI ); gpt-oss-20b (, 95% HDI ); gpt-5-nano (, 95% HDI ); human grader 2 (, 95% HDI ); gpt-4.1-nano (, 95% HDI ). Random intercepts for students showed moderate variability (, 95% HDI ), while those for questions indicated considerable variability (, 95% HDI ).
In the accuracy analysis, human graders 1 and 3 were clearly the strongest performers, with the ensemble under SURE grading occupying a solid mid-range position. In terms of bias, however, the picture shifts:
Figure 10 shows that the ensemble is much closer to zero than most human graders, with an HDI that includes zero and a posterior mean comparable to the nearly unbiased grader 4. In contrast, graders 1 and 3 – despite being the most accurate – exhibit clear positive and negative bias respectively. The pairwise-dominance heatmap shows that the ensemble with SURE was very likely less biased than human graders 1, 2, and 3.
Alignment at the level of assignment grades. Like for assignment 1, we also assessed the accuracy of grades by computing human and ground-truth target ranges, and the grades students would receive under different LLM grading procedures. In contrast to the previous analysis at the level of student answers where we used data across assignments, we did this separately for each of the four test set assignments. The previous analyses suggest that SURE achieved human like performance for the ensemble and for brevity, we only show its results in
Figure 11, while
Table 8 shows performance metrics (percentage of grades inside target ranges, median and maximum grade point deviation from target range boundaries) for all LLMs.
The figure shows that the ensemble even under fully automated grading with majority-voting frequently produced grades that fall inside the target ranges; however, for some students this procedure resulted in severe underscoring with maximum and median grade point deviations up to and respectively. While the median deviation of less than half a grade point might be acceptable in practice, underscoring a student by more than three full grades is not. This highlights the importance of evaluating LLM grading not only at the level of averages but also at the level of individual students.
In contrast, SURE grading with the ensemble resulted in between 84% and 91% of grades falling inside target ranges, with maximum and median grade point deviation for all assignments. Notably, for assignment 4, we observed 15/46 students for which all three human graders and the ensemble with SURE agreed perfectly.
For the individual LLMs,
Table 8 shows a similar pattern: SURE reduced the severity of grading errors, yielding maximum and median grade point deviations of
and
for gpt-4.1-nano,
and
for gpt-5-nano, and
and
for gpt-oss-20b. This indicates that even with suboptimal certainty thresholds, SURE can meaningfully improve alignment with human grading. However, the remaining deviations and relatively low proportions of grades falling inside the human target ranges suggest that these models still produce errors that are too large to justify replacing manual grading in practice. These results highlight the benefit of using LLM ensembles for uncertainty based flagging, but also put into question whether it is possible to set proper flagging threshold in advance.
Time savings from SURE. Table 9 shows that the time savings achieved by SURE in the test set varied substantially across LLMs and assignments. Notably, the ensemble yielded comparatively modest time savings – typically between 26% and 58% – while the individual LLMs often saved considerably more time, in some cases exceeding 80%. This pattern is consistent with earlier results showing that the fixed certainty threshold (
) was well calibrated for the ensemble but too lenient for the individual LLMs: the weaker models produced many incorrect but high-certainty predictions that went unflagged, reducing the amount of manual regrading and thereby inflating time savings at the cost of lower accuracy. Conversely, the ensemble flagged a larger proportion of cases for review, which reduced automation but produced human-level accuracy.
Importantly, this outcome is not necessarily a limitation of the proposed SURE approach. For assignments that contain questions the LLMs struggle to grade reliably – such as those in assignments 2-5 – we explicitly want them to be flagged which necessarily results in more manual effort. At the same time, the table also shows that when the proportion of flagged responses becomes very large, the resulting time savings may be too small to justify deploying such a pipeline in practice. Thus, while the ensemble achieved the highest grading accuracy, it did so by relying more heavily on human review in the test set, illustrating the trade-off between efficiency and reliability inherent to certainty-based flagging.
3.4. Additional Exploratory Analyses
To assess whether SURE grading works for larger state-of-the art models we used gpt-5 to score all student answers in the test set. Additionally, we created a more diversified ensemble based on gpt-5, codestral-25.01, and llama-3.3-70b-instruct.
Figure 12 shows that certainty is not as clearly separated between correct and incorrect scores for individual models but ensembling them yields a bimodal distribution similar to the previous results. This suggests that ensembling continues to be beneficial for separating correct and incorrect scores.
Figure 13 shows assignment-level grading accuracies for the three additional models and their ensemble in the test set. As before, human-in-the-loop grading with SURE (triangles) improves accuracy over fully automated single-prompt grading (circles) and majority-voting (squares) for all models. However, gpt-5 only slightly benefited from flagging, suggesting that the scores from repeated prompting are very consistent regardless of their agreement with the human-derived ground truth. Notably, the majority-voting accuracy of the ensemble is worse than that of gpt-5 alone, likely because the other two much less accurate models dragged it down. However, with SURE the ensemble outperformed gpt-5 for all assignments, driven by a much larger flagging rate and consequently more human grading. These results suggest that pairing a stronger and wearker models in more diverse ensembles might be beneficial for uncovering cases in which the stronger model is confidently incorrect; allthough at the cost of more human grading effort.
Like the descriptive findings, the results from the Bayesian analyses corroborate that SURE with the ensemble achieves accuracy (see
Figure 14) and bias (see
Figure 15) on par with mid-range human graders.
In contrast to the test set analysis with the original LLMs (gpt-4.1-nano, gpt-5-nano, gpt-oss-20b) – where we evaluated assignment-level grade alignment – we zoomed out further and assesssed the course-level grade alignment. Specifically, we computed course grades based on weighed assignment grades for the new LLMs and their ensemble under all three grading procedures and compared them to course grades calculated for human graders and the target range spanned by the minimal and maximal ground-truth or human course grades for each student.
Table 10 and
Figure 16 show that SURE and ensembling improved alignment markedly with about 96% of course grades falling into the target ranges and low maximum (
) and median (
) deviations on the Dutch 10-point grading scale. Similarly, gpt-5 with SURE achieved 86% of course grades inside the target ranges with
maximum and
median grade point deviation. Critically, SURE with gpt-5 saved about 93% of manual grading time while the ensemble with SURE saved only about 40% compared to fully manual grading. This is driven by the amicable accuracy of gpt-5 under fully automated grading with a single-prompt or majority-voting and its comparatively low flagging rate due to being very consistent across regrades. This indicates that stronger models may be able to achieve acceptable alignment even without human-in-the-loop review, which could completely eliminate the need for human grading, giving teachers more time to spend tutoring students. However, the ensemble with SURE still outperformed gpt-5 in this regard, and emerged as the most accurate grading procedure evaluated here. This finding again highlights the trade-off between grading accuracy and time savings when using certainty-based flagging. Critically, decisionmakers at universities might be much more inclined to utilize human-in-the-loop approaches before fully automated grading becomes widely accepted.
Table 10.
Alignment of LLM grades with human grade target ranges under different procedures for the weighted course grade.
Table 10.
Alignment of LLM grades with human grade target ranges under different procedures for the weighted course grade.
| LLM |
Grading Procedure |
% in Target Range |
Maximum Grade Deviation |
Median Grade Deviation |
| Overall Course Grade |
| gpt-5 |
SP |
73.913 |
0.5 |
0.1 |
| |
MV |
78.261 |
0.6 |
0.1 |
| |
SURE |
86.957 |
0.5 |
0.1 |
| codestral-25.01 |
SP |
0 |
3.2 |
1.9 |
| |
MV |
0 |
2.9 |
1.85 |
| |
SURE |
13.043 |
1.0 |
0.6 |
| llama-3.3-70b-instruct |
SP |
10.87 |
1.2 |
0.4 |
| |
MV |
17.391 |
1 |
0.4 |
| |
SURE |
39.13 |
0.9 |
0.2 |
| new ensemble |
MV |
23.913 |
0.9 |
0.2 |
| |
SURE |
95.652 |
0.3 |
0.1 |
Table 11.
Manual grading time and manual regrading time after SURE (minutes) with time savings for assignments 2–5 with new models. 1
Table 11.
Manual grading time and manual regrading time after SURE (minutes) with time savings for assignments 2–5 with new models. 1
| |
|
Regrading (min) and time savings (%) |
| Grader |
Manual (min) |
gpt-5 |
codestral-25.01 |
llama-3.3-70b-instruct |
New Ensemble |
| Assignments 2-5 Cumulated |
| Grader 1 |
748 |
52 (93%) |
372 (50%) |
111 (85%) |
432 (42%) |
| Grader 2 |
934 |
67 (93%) |
471 (50%) |
137 (85%) |
556 (40%) |
| Grader 3 2
|
298 |
12 (96%) |
51 (55%) |
51 (83%) |
181 (39%) |
| Grader 4 |
734 |
55 (93%) |
354 (52%) |
113 (85%) |
436 (41%) |
3.5. Summary of Results
Across both the training and test sets, our results support the core idea of the proposed SURE pipeline. Repeated prompting produced a certainty measure that was strongly diagnostic of correctness, and combining models in an ensemble yielded a distinctly bimodal certainty distribution that separated clearly between high- and low-confidence predictions. On the training set, tuning certainty thresholds for each condition showed that self-consistency–based flagging can substantially improve grading accuracy for non-reasoning (gpt-4.1-nano) and reasoning models (gpt-5-nano, gpt-oss-20b). With optimally chosen thresholds, SURE brought assignment-level accuracy and grade alignment close to or within the range of human graders, while reducing manual grading time by more than 80%.
The test set analysis, which used a single fixed threshold of and omitted prompt perturbations, provides a more conservative but still encouraging picture. For the ensemble, the fixed threshold aligned well with its bimodal certainty distribution: SURE achieved human-level accuracy at the level of individual answers, near-unbiased grading, and high alignment with human assignment grades (84–91% of grades inside target ranges, with very small maximum and median deviations). However, these gains came with only moderate time savings (typically 26–58%), reflecting that many responses were still routed to human graders. For the individual LLMs, the same fixed threshold was too lenient, leading to more unflagged incorrect answers, lower accuracy, stronger negative bias, and larger grade deviations, even though SURE still improved performance relative to single-prompt and majority-voting baselines.
Together, these findings highlight a central trade-off of certainty-based flagging. When thresholds are well calibrated – most clearly for the ensemble – the pipeline can match mid-range human graders in accuracy and bias while still reducing manual effort. At the same time, the test set results show that thresholds are assignment- and model-sensitive: overly aggressive automation can save time but harms reliability, whereas conservative thresholds preserve human-level accuracy at the cost of smaller time savings.
The findings from the additional anlyses based on a more diversified ensemble (gpt-5, codestral-25.01, llama-3.3-70b-instruct) corroborated the earlier results, with ensembling yielding a bimodal certainty distribution for which the fixed threshold was well calibrated. At the level of overall course grades, SURE resulted in 87% and 96% of grades falling into the target ranges for gpt-5 and the ensemble respectively, with very small maximum (0.3-0.5) and median grade point deviations (0.1) on the 10-point Dutch grading scale, and time savings of about 93% for gpt-5 and about 40% for the ensemble. Notably, even under fully automated grading, gpt-5 achieved very good alignment, which suggests human oversight might become less critical with larger reasoning models. In contrast, llama-3.3-70b-instruct performed considerably worse than gpt-oss-20b, indicating that the number of parameters alone does not drive performance. Instead, we suspect reaonsing capabilities are much more relevant for the ability of these LLMs to follow rubrics and score student answers accurately. Additionally, the particularly poor performance of codestral-25.01, a model trained for fast code generation, may indicate that reasoning and instruction following (i.e., sticking to the rubrics) are more important for grading than domain specialization.
4. Discussion
Large language models can grade open-ended assignments, yet they typically fail to reach human performance and are prone to biases, which limits their suitability for fully automated assessment. Here, we introduced SURE, a lightweight human-in-the-loop framework leveraging self-consistency and ensembling to flag cases for selective human regrading. SURE substantially improved alignment with ground truth scores from four human graders, while reducing overall grading effort. We found that uncertainty estimates based on prompt agreement are informative but unreliable for individual LLMs, whereas LLM ensembles yield more separable uncertainty distributions, which supported fixed-threshold flagging. As such, combining self-consistency with selective human oversight may offer path toward more reliable and scalable AI-assisted grading.
Across conditions, grading student answers with a single prompt resulted in underscoring, which has been reported in prior studies on LLM grading [
3,
47,
75,
76]. Majority voting based on repeatedly scoring the same student answer reduced but did not eliminate this bias, which is consistent with self-consistency stabilizing LLM performance [
12,
13]. However, incorrect scores frequently receive (near-)unanimous agreement across iterations, reflecting that LLMs can be confidently wrong [
10,
26]. This meant that while informative, self-consistency based uncertainty was limited in its ability to flag errenous scores for regrading when used with individual models.
This problem was effectively mitigated by aggregating the outputs of multiple LLMs in ensembles, which resulted in markedly bimodal uncertainty distributions, more clearly separating correct from incorrect predictions. This result is consistent with recent work showing that cross-model disagreement more effectively exposes confidently incorrect LLM responses than within-model self-consistency alone [
14,
15]. While prior studies caution that ensembles of highly similar models may underestimate uncertainty [
15], we found that an ensemble composed exclusively of OpenAI models (gpt-4.1-nano, gpt-5-nano, gpt-oss-20b) was already effective under SURE. This suggests that meaningful diversity can arise even among models from the same provider, likely due to differences in size, architecture, and reasoning capabilities rather than training data alone. However, the second, more heterogeneous ensemble (gpt-5, codestral-25.01, llama-3.3-70b-instruct) achieved even better alignment with human grading. While this gain may partly reflect the strong individual performance of gpt-5, this finding does point towards utilizing diverse ensembles. Notably, reasoning models such as gpt-oss-20b consistently outperformed non-reasoning models even when they were larger (llama-3.3-70b-instruct) or domain-specialized (codestral-25.01), indicating that reasoning capability is more important for LLM-based grading than the number of parameters or domain specialization. The strong performance of gpt-oss-20b is especially encouraging in this regard. As an open-source reasoning model, it combines competitive grading accuracy with lower deployment costs and improved data privacy, making it a practical candidate for educational settings. Notably, gpt-5 achieved very good performance even when scoring student answers based on a single prompt. However, large reasoning models also incur higher latency and computational cost, especially when prompted synchronously at scale. These trade-offs highlight that ensemble design for SURE grading must balance accuracy gains against cost, latency, and privacy constraints. Based on our findings, we recommend constructing SURE ensembles by prioritizing diversity in reasoning behavior rather than model size alone. A practical strategy might be to start with a small set of diverse, preferably open-source reasoning models, evaluate their performance in context, and only add larger or closed-source models if necessary.
Beyond ensembling, only high temperature and top-
p sampling with gpt-4.1-nano emerged as an effective diversification strategy for improving grade alignment under SURE. Randomizing rubric order and instructing models to adopt grading personas (e.g., lenient or strict) had little effect and multilingual prompting even reduced grading accuracy for gpt-4.1-nano under single-prompt conditions. However, this adverse effect was attentuated under majority voting and largely absent for the gpt-5-nano and gpt-oss-20b, suggesting that earlier-generation models may be less consistent across languages than more recent reasoning models. Importantly, these findings are based on a single study and do not imply that multilingual prompting is generally ineffective as a diversification strategy. In fact, prior work has shown multilingual prompting to be more effective than temperature sampling or persona prompting for inducing useful diversity in question answering tasks [
24]. Moreover, our results do not permit conclusions about LLM-based grading in multilingual educational settings as investigated by Grévisse ([
2]). Future work should examine multilingual prompting and other diversification strategies—such as dynamically sampling few-shot exemplars in broader grading contexts.
SURE aligns closely with regulatory expectations for AI-supported assessment: Under the EU AI Act, systems used for evaluating student performance are classified as high-risk applications, requiring meaningful human oversight and safeguards against systematic error [
9]. SURE operationalizes this principle by using uncertainty estimates to determine when human intervention is warranted, allowing instructors to retain control over ambiguous or error-prone cases while benefiting from automation when model confidence is high. Several other human-in-the-loop grading frameworks have been proposed for similar reasons. These include iterative prompt refinement through cycles of human and LLM grading (CoTAL; [
77]), escalation based on discrepancies between LLM scores and student self-evaluations (AVALON; [
78]), and approaches that rely on psychometric modeling to derive uncertainty estimates [
8]. Compared to these methods, SURE is comparatively lightweight: it requires no iterative alignment, no student input, and no explicit psychometric modeling, relying instead on self-consistency and cross-model agreement to identify cases that merit human review. This design allows large portions of an assignment to be graded automatically when confidence is high, while preserving instructor oversight where it is most needed.
Nevertheless, several limitations of our study warrant clarification and point towards future research opportunities. A first limitation concerns the use of human grading as the reference standard. Although interrater reliability in our study was high, consistent with prior work on rubric-based assessment [
53,
54], human grading is not infallible. In earlier work using data from the same course, we observed cases in which human graders made mistakes that LLMs avoided, such as overlooking rubric criteria or syntax errors [
1]. It is therefore possible that some student answers in the present study were consistently misgraded by all human raters despite high agreement, highlighting that aggregated human scores represent a pragmatic benchmark rather than an absolute ground truth. Future work could address this limitation by incorporating more objective reference measures where available. In programming education, aspects of student solutions can often be evaluated using unit tests or static analysis [
44,
45]. Comparing LLM and human grading against such criteria would allow a more precise assessment of shared and complementary failure modes, and could further strengthen SURE in hybrid assessment settings that combine open-ended evaluation with automatically verifiable components.
A second limitation concerns the way human regrading was simulated. Regrading was performed by randomly sampling among available graders, which likely understates persistent rater-specific tendencies (e.g., strict vs. lenient grading) and may therefore make SURE appear less biased by partially averaging out grader effects. In practice, however, many courses rely on a single instructor or teaching assistant per submission, making such tendencies unavoidable. When multiple graders are involved, a common mitigation is to assign graders by question rather than by student, which improves consistency within items while preserving independent judgments across graders. In the context of SURE, such best practices should be maintained during regrading to avoid scoring some submissions more strictly or leniently than others.
Another limitation concerns how uncertainty was operationalized for flagging. We relied on prompt agreement; specifically, the relative frequency of the modal score as a simple and interpretable certainty measure. This performed well under the coarse 0–1 grading scheme used here, but may ignore other uncertainty cuesin settings with more continuous scoring. Although we evaluated alternative distributional metrics and did not observe meaningful improvements in the present study, future work should examine alternative uncertainty metrics, some of which can be found in recent studies operationalizing self-consistency-based uncertainty quantification in various contexts [
14,
15,
26]. Additionally, future work should explore whether uncertainty based flagging is unbiased with respect to students. While we did not observe any obvious patterns, the literature on this topic is mixed with Rodrigues et al. ([
79]) reporting no bias in short answer grading but An et al. ([
80]) reporting gender and racial biases in resume screening. In the context of SURE, it is possible that certain answer styles, languages, or other factors systematically increase or decrease the likelihood of getting flagged, which could inadvertently introduce inequities in grading.
Relatedly, setting an appropriate flagging threshold poses another practical challenge. While thresholds can be tuned using historical data and objective functions such as the
score [
8], such tuning requires additional data and effort and may therefore be impractical in many instructional settings. Instead, instructors may prefer simple heuristics such as regrading all cases with a single deviating score across repeated evaluations (i.e., using an aggressive threshold at certainty
), regrading the lowest
p percent of certainty values, or regrading a fixed proportion of submissions (e.g., half of an assignment). Such heuristics avoid reliance on calibration data and allow instructors to directly control the amount of human effort invested, but require careful consideration of how much uncertainty can be tolerated for a given assessment context.
Alternatively, we think that a particularly promising extension lies in combining SURE and student self-grading, similar in spirit to the AVALON framework [
78]. Instead of flagging uncertain cases for instructors, students could receive transparent grading reports that include rubric-based predicted scores, results from repeated LLM ensemble evaluations, and visualizations of score consistency (e.g., histograms of scores across runs). Low-certainty cases would thus not be flagged for teachers but students, who could be asked to self-grade their work using the same criteria and indicate whether they agree with the model’s assessment. Submissions for which student self-grades and LLM-based grades diverge could then be deferred to instructors for review. Such a workflow could provide students with timely and detailed feedback, which is a powerful driver of learning and often negelected in higher education settings [
81,
82]. Prior research has shown that students appreciate LLM-generated feedback and perceive it as helpful when it is timely and transparent [
1,
83,
84,
85], suggesting an opportunity for integrating assessment and feedback to support learning rather than merely assigning grades. Requiring students to engage with grading criteria and evaluate their own work may further enhance learning, as self-assessment has been shown to improve performance, metacognition, and self-regulated learning [
86]. From a practical perspective, such a framework could be integrated in a learning management system like Canvas [
87], building on existing LLM-based grading and feedback integrations [
1]. One potential drawback is that increased transparency may limit the reuse of identical assignment questions across years, but this may be an acceptable trade-off given the potential gains in feedback timeliness, transparency, and the opportunity for instructors to reinvest saved time in tutoring and individualized support.
A final limitation concerns the generalizability of our findings. Our evaluation was conducted based of data from a single relatively small introductory programming course using a coarse grading scheme and a specific set of assignments, which limits the extent to which conclusions can be directly transferred to other domains, educational levels, or assessment formats. At the same time, this setting provides a demanding test case: assignments combine open-ended questions and coding exercises, partial credit, and heterogeneous student solutions, and have been shown to expose both LLM and human grading errors in prior work [
1]. Future research should therefore validate SURE across a broader range of contexts, including courses outside programming, finer-grained or analytic rubrics, different languages, and different task formats. We are currently collecting data from additional courses spanning multiple disciplines, languages, and assessment forms to this end, and encourage other researchers to replicate and extend our findings in diverse educational settings. Beyond validating performance, such studies should investigate alternative diversification and ensembling strategies, varied uncertainty metrics, escalation heuristics or data driven flagging mechanisms, and assess which parts of SURE as presented here are most sensitive to contextual factors. Establishing how SURE can be adapted to diverse instructional settings will be essential for assessing its potential as a general, domain-agnostic framework for reliable and scalable AI-assisted grading that genuinely supports educators while maintaining the rigor and integrity of human judgment.
Figure 1.
Threshold tuning based on scores. Each ◇ indicates the threshold that maximized for one of the 56 conditions. The dashed vertical line indicates the median threshold across all conditions which we used as a fixed threshold for flagging in the test set.
Figure 1.
Threshold tuning based on scores. Each ◇ indicates the threshold that maximized for one of the 56 conditions. The dashed vertical line indicates the median threshold across all conditions which we used as a fixed threshold for flagging in the test set.
Figure 2.
Certainty distributions for Correct and Incorrect scores across models. Proportions are normalized within each category, such that the bars for Correct and Incorrect each sum to one. Histograms use bins of 5% certainty. Dashed lines show tuned thresholds for the 56 conditions.
Figure 2.
Certainty distributions for Correct and Incorrect scores across models. Proportions are normalized within each category, such that the bars for Correct and Incorrect each sum to one. Histograms use bins of 5% certainty. Dashed lines show tuned thresholds for the 56 conditions.
Figure 3.
Assignment-level grading accuracy across models and conditions in the training set. SURE (∆) consistently achieves the highest accuracies, with gpt-oss-20b and the ensemble reaching human performance (grey band).
Figure 3.
Assignment-level grading accuracy across models and conditions in the training set. SURE (∆) consistently achieves the highest accuracies, with gpt-oss-20b and the ensemble reaching human performance (grey band).
Figure 4.
Posterior estimates of grading accuracy for human graders and LLMs under SURE with tuned certainty thresholds in the training set. The left panel shows posterior means and 95% HDIs for the probability (transformed log-odds) of scoring a student answer correctly. The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is more accurate than the grader in the column.
Figure 4.
Posterior estimates of grading accuracy for human graders and LLMs under SURE with tuned certainty thresholds in the training set. The left panel shows posterior means and 95% HDIs for the probability (transformed log-odds) of scoring a student answer correctly. The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is more accurate than the grader in the column.
Figure 6.
Alignment of assignment grades awarded by human graders (black vertical bars), the target ranges defined by the minimal and maximal human and ground truth grades (grey areas), fully automated LLM grades based on single-prompt (∘), majority-voting (□), and human-in-the-loop LLM grading with SURE (∆) in the training set. Alignment is markedly improved by SURE.
Figure 6.
Alignment of assignment grades awarded by human graders (black vertical bars), the target ranges defined by the minimal and maximal human and ground truth grades (grey areas), fully automated LLM grades based on single-prompt (∘), majority-voting (□), and human-in-the-loop LLM grading with SURE (∆) in the training set. Alignment is markedly improved by SURE.
Figure 7.
Certainty distributions for Correct and Incorrect scores in the test set. Proportions are normalized within each category, such that the bars for Correct and Incorrect each sum to one. Histograms use bins of 5% certainty. The vertical dashed line marks the fixed certainty threshold () used for flagging in the test set.
Figure 7.
Certainty distributions for Correct and Incorrect scores in the test set. Proportions are normalized within each category, such that the bars for Correct and Incorrect each sum to one. Histograms use bins of 5% certainty. The vertical dashed line marks the fixed certainty threshold () used for flagging in the test set.
Figure 8.
Assignment-level grading accuracy in the test set. SURE (∆) consistently achieves the highest accuracies, with the ensemble reaching human performance (grey band).
Figure 8.
Assignment-level grading accuracy in the test set. SURE (∆) consistently achieves the highest accuracies, with the ensemble reaching human performance (grey band).
Figure 9.
Posterior estimates of grading accuracy for human graders and LLMs under SURE with tuned certainty thresholds in the test set. The left panel shows posterior means and 95% HDIs for the probability (transformed log-odds) of scoring a student answer correctly. The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is more accurate than the grader in the column.
Figure 9.
Posterior estimates of grading accuracy for human graders and LLMs under SURE with tuned certainty thresholds in the test set. The left panel shows posterior means and 95% HDIs for the probability (transformed log-odds) of scoring a student answer correctly. The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is more accurate than the grader in the column.
Figure 10.
Posterior estimates of grading bias for human graders and LLMs under SURE with tuned certainty thresholds in the test set. The left panel shows posterior means and 95% HDIs for grading bias (deviation from closest ground-truth). The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is less biased (closer to zero) than the grader in the column.
Figure 10.
Posterior estimates of grading bias for human graders and LLMs under SURE with tuned certainty thresholds in the test set. The left panel shows posterior means and 95% HDIs for grading bias (deviation from closest ground-truth). The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is less biased (closer to zero) than the grader in the column.
Figure 11.
Alignment of assignment grades awarded by human graders (black vertical bars), the target ranges defined by the minimal and maximal human and ground truth grades (grey areas), fully automated LLM grades based on single-prompt (∘), majority-voting (□), and human-in-the-loop LLM grading with SURE (∆) in the test set. Alignment is markedly improved by SURE.
Figure 11.
Alignment of assignment grades awarded by human graders (black vertical bars), the target ranges defined by the minimal and maximal human and ground truth grades (grey areas), fully automated LLM grades based on single-prompt (∘), majority-voting (□), and human-in-the-loop LLM grading with SURE (∆) in the test set. Alignment is markedly improved by SURE.
Figure 12.
Certainty distributions for Correct and Incorrect scores in the test set using three additional models and their ensemble. Proportions are normalized within each category, such that the bars for Correct and Incorrect each sum to one. Histograms use bins of 5% certainty. The vertical dashed line marks the fixed certainty threshold () used for flagging in the test set.
Figure 12.
Certainty distributions for Correct and Incorrect scores in the test set using three additional models and their ensemble. Proportions are normalized within each category, such that the bars for Correct and Incorrect each sum to one. Histograms use bins of 5% certainty. The vertical dashed line marks the fixed certainty threshold () used for flagging in the test set.
Figure 13.
Assignment-level grading accuracy of three additional models and their ensemble in the test set. SURE (∆) consistently achieves the highest accuracies, with the ensemble reaching human performance (grey band and black lines).
Figure 13.
Assignment-level grading accuracy of three additional models and their ensemble in the test set. SURE (∆) consistently achieves the highest accuracies, with the ensemble reaching human performance (grey band and black lines).
Figure 14.
Posterior estimates of grading accuracy for human graders, and three additional LLMs and their ensemble under SURE with certainty thresholds fixed at 0.7 in the test set. The left panel shows posterior means and 95% HDIs for the probability (transformed log-odds) of scoring a student answer correctly. The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is more accurate than the grader in the column.
Figure 14.
Posterior estimates of grading accuracy for human graders, and three additional LLMs and their ensemble under SURE with certainty thresholds fixed at 0.7 in the test set. The left panel shows posterior means and 95% HDIs for the probability (transformed log-odds) of scoring a student answer correctly. The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is more accurate than the grader in the column.
Figure 15.
Posterior estimates of grading bias for human graders, and three additional LLMs and their ensemble under SURE with certainty thresholds fixed at 0.7 in the test set. The left panel shows posterior means and 95% HDIs for grading bias (deviation from closest ground-truth). The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is less biased (closer to zero) than the grader in the column.
Figure 15.
Posterior estimates of grading bias for human graders, and three additional LLMs and their ensemble under SURE with certainty thresholds fixed at 0.7 in the test set. The left panel shows posterior means and 95% HDIs for grading bias (deviation from closest ground-truth). The right panel displays pairwise dominance probabilities, indicating for each row–column pair the posterior probability that the grader in the row is less biased (closer to zero) than the grader in the column.
Figure 16.
Alignment of overall course grades awarded by human graders (black vertical bars), the target range defined by the minimal and maximal human and ground truth grades (grey areas), fully automated LLM grades based on a single-prompt (∘) or majority-voting (□), and human-in-the-loop LLM grading with SURE (∆) using the three new models and their ensemble in the test set. Alignment is markedly improved by SURE.
Figure 16.
Alignment of overall course grades awarded by human graders (black vertical bars), the target range defined by the minimal and maximal human and ground truth grades (grey areas), fully automated LLM grades based on a single-prompt (∘) or majority-voting (□), and human-in-the-loop LLM grading with SURE (∆) using the three new models and their ensemble in the test set. Alignment is markedly improved by SURE.
Table 1.
Model configurations and diversification settings.
Table 1.
Model configurations and diversification settings.
| LLM |
temperature |
top_p |
text verbosity |
shuffled rubrics |
varied personas |
varied languages |
n conditions |
| Prompting conditions |
| gpt-4.1-nano |
0 / 1 |
0.1 / 1 |
- |
no / yes |
no / yes |
no / yes |
24 |
| gpt-5-nano |
- |
- |
low / medium |
no / yes |
no / yes |
no / yes |
16 |
| gpt-oss-20b |
- |
- |
medium |
no / yes |
no / yes |
no / yes |
8 |
| Post-hoc conditions |
| ensemble |
1 (gpt-4.1-nano) |
1 (gpt-4.1-nano) |
medium (gpt-5-nano & gpt-oss-20b) |
no / yes |
no / yes |
no / yes |
8 |
Table 2.
Illustrative grading procedure dataset: each row represents the outcome of a specific grading procedure applied to a student’s answer. Condition 1 illustrates all three grading procedures for prompting conditions. Condition 2000 illustrates ensemble conditions, for which single-prompt grading is missing as it does not make practical sense.
Table 2.
Illustrative grading procedure dataset: each row represents the outcome of a specific grading procedure applied to a student’s answer. Condition 1 illustrates all three grading procedures for prompting conditions. Condition 2000 illustrates ensemble conditions, for which single-prompt grading is missing as it does not make practical sense.
| student |
question |
condition |
procedure |
correct |
error |
| 1 |
#R23 |
1 |
SP |
0 |
-0.5 |
| 1 |
#R23 |
1 |
MV |
0 |
0.25 |
| 1 |
#R23 |
1 |
SURE |
1 |
0 |
| 2000 |
#R23 |
1 |
MV |
0 |
0.25 |
| 2000 |
#R23 |
1 |
SURE |
1 |
0 |
Table 3.
Illustrative rows from the condition-level dataset. The first three rows show variations of temperature and rubric shuffling for gpt-4.1-nano; the last row shows an ensemble condition without prompt perturbations.
Table 3.
Illustrative rows from the condition-level dataset. The first three rows show variations of temperature and rubric shuffling for gpt-4.1-nano; the last row shows an ensemble condition without prompt perturbations.
| condition |
llm |
temp |
topp |
verb |
shuf |
pers |
lang |
| 1 |
gpt-4.1-nano |
0 |
1 |
0 |
0 |
0 |
0 |
| 2 |
gpt-4.1-nano |
1 |
1 |
0 |
0 |
0 |
0 |
| 3 |
gpt-4.1-nano |
1 |
1 |
0 |
1 |
0 |
0 |
| 2000 |
ensemble |
1 |
1 |
1 |
0 |
0 |
0 |
Table 4.
Illustrative grader dataset: each row shows the score a student answer received from a human grader (grader-1, grader-2, grader-3, grader-4) or the human-in-the-loop SURE protocol with a given LLM (gpt-4.1-nano, gpt-5-nano, gpt-oss-20b, ensemble). Values for illustration purposes and do not show real data.
Table 4.
Illustrative grader dataset: each row shows the score a student answer received from a human grader (grader-1, grader-2, grader-3, grader-4) or the human-in-the-loop SURE protocol with a given LLM (gpt-4.1-nano, gpt-5-nano, gpt-oss-20b, ensemble). Values for illustration purposes and do not show real data.
| student |
question |
grader |
correct |
error |
| 1 |
#R23 |
grader-1 |
0 |
0.25 |
| 1 |
#R23 |
grader-2 |
0 |
-0.5 |
| 1 |
#R23 |
grader-3 |
1 |
0 |
| 1 |
#R23 |
grader-4 |
1 |
0 |
| 1 |
#R23 |
gpt-4.1-nano |
0 |
-0.75 |
| 1 |
#R23 |
gpt-5-nano |
0 |
0.25 |
| 1 |
#R23 |
gpt-oss-20b |
1 |
0 |
| 1 |
#R23 |
ensemble |
1 |
0 |
Table 5.
Regression Coefficients.
Table 5.
Regression Coefficients.
| Coefficient |
Mean |
2.5% HDI |
97.5% HDI |
| HDI excludes zero |
| Intercept |
1.601 |
1.199 |
1.980 |
| procedure[single-prompt] |
-0.230 |
-0.300 |
-0.161 |
| procedure[SURE] |
1.311 |
1.219 |
1.400 |
| llm[gpt-5-nano] |
1.437 |
1.319 |
1.557 |
| llm[gpt-oss-20b] |
1.958 |
1.821 |
2.100 |
| llm[ensemble] |
1.989 |
1.830 |
2.147 |
| topp(llm=gpt-4.1-nano; temp=1) |
0.176 |
0.075 |
0.275 |
| languages |
-0.080 |
-0.167 |
-0.002 |
| procedure[SURE] : llm[ensemble] |
-0.742 |
-0.880 |
-0.601 |
| procedure[SURE] : llm[gpt-5-nano] |
-0.632 |
-0.751 |
-0.513 |
| procedure[SURE] : llm[gpt-oss-20b] |
-0.686 |
-0.831 |
-0.542 |
| topp(llm=gpt-4.1-nano; temp=1) : procedure[single-prompt] |
-0.158 |
-0.233 |
-0.082 |
| languages : procedure[single-prompt] |
-0.527 |
-0.580 |
-0.475 |
| languages : llm[gpt-5-nano] |
0.295 |
0.192 |
0.397 |
| languages : llm[gpt-oss-20b] |
0.283 |
0.160 |
0.397 |
| languages : shuffle_rubrics |
-0.082 |
-0.140 |
-0.024 |
| languages : topp(llm=gpt-4.1-nano; temp=1) |
-0.169 |
-0.260 |
-0.073 |
| 1|student_sigma |
0.355 |
0.282 |
0.435 |
| 1|question_sigma |
1.253 |
1.004 |
1.531 |
| HDI includes zero |
| temp(llm=gpt-4.1-nano) |
-0.002 |
-0.100 |
0.099 |
| verb(llm=gpt-5-nano) |
0.067 |
-0.076 |
0.195 |
| shuffle_rubrics |
0.067 |
-0.013 |
0.152 |
| personalities |
-0.003 |
-0.086 |
0.077 |
| procedure[single-prompt] : llm[ensemble-3.5] |
-0.018 |
-1.987 |
1.846 |
| procedure[single-prompt] : llm[gpt-5-nano] |
-0.098 |
-0.189 |
0.004 |
| procedure[single-prompt] : llm[gpt-oss-20b] |
0.111 |
-0.002 |
0.230 |
| temp(llm=gpt-4.1-nano) : procedure[single-prompt] : |
-0.003 |
-0.072 |
0.075 |
| temp(llm=gpt-4.1-nano) : procedure[SURE] |
0.010 |
-0.078 |
0.113 |
| temp(llm=gpt-4.1-nano) : shuffle_rubrics |
-0.034 |
-0.125 |
0.060 |
| temp(llm=gpt-4.1-nano) : personalities |
0.011 |
-0.085 |
0.101 |
| temp(llm=gpt-4.1-nano) : languages |
0.021 |
-0.076 |
0.111 |
| topp(llm=gpt-4.1-nano; temp=1) : procedure[SURE] |
0.035 |
-0.065 |
0.130 |
| topp(llm=gpt-4.1-nano; temp=1) : shuffle_rubrics |
-0.030 |
-0.125 |
0.063 |
| topp(llm=gpt-4.1-nano; temp=1) : personalities |
0.000 |
-0.099 |
0.088 |
| verb(llm=gpt-5-nano) : procedure[single-prompt] : |
-0.048 |
-0.161 |
0.070 |
| verb(llm=gpt-5-nano) : procedure[SURE] |
-0.035 |
-0.175 |
0.112 |
| verb(llm=gpt-5-nano) : shuffle_rubrics |
-0.081 |
-0.196 |
0.037 |
| verb(llm=gpt-5-nano) : personalities |
-0.062 |
-0.173 |
0.062 |
| verb(llm=gpt-5-nano) : languages |
0.047 |
-0.065 |
0.164 |
| shuffle_rubrics : procedure[SURE] |
0.032 |
-0.032 |
0.099 |
| shuffle_rubrics : procedure[single-prompt] : |
-0.014 |
-0.065 |
0.036 |
| shuffle_rubrics : llm[ensemble-3.5] |
-0.127 |
-0.269 |
0.022 |
| shuffle_rubrics : llm[gpt-5-nano] |
0.004 |
-0.099 |
0.121 |
| shuffle_rubrics : llm[gpt-oss-20b] |
-0.030 |
-0.153 |
0.092 |
| shuffle_rubrics : personalities |
-0.007 |
-0.062 |
0.051 |
| personalities : procedure[SURE] |
-0.009 |
-0.077 |
0.053 |
| personalities : procedure[single-prompt] : |
0.002 |
-0.053 |
0.053 |
| personalities : llm[ensemble-3.5] |
0.123 |
-0.023 |
0.260 |
| personalities : llm[gpt-5-nano] |
0.015 |
-0.093 |
0.121 |
| personalities : llm[gpt-oss-20b] |
0.064 |
-0.059 |
0.181 |
| personalities : languages |
-0.023 |
-0.080 |
0.037 |
| languages : procedure[SURE] |
-0.036 |
-0.106 |
0.022 |
| languages : llm[ensemble-3.5] |
0.101 |
-0.046 |
0.240 |
| 1|condition_sigma |
0.027 |
0.000 |
0.052 |
Table 6.
Alignment of LLM grades with human grade target ranges under different procedures for assignment 1.
Table 6.
Alignment of LLM grades with human grade target ranges under different procedures for assignment 1.
| LLM |
Grading Procedure |
% in Target Range |
Maximum Grade Deviation |
Median Grade Deviation |
| Assignment 1 |
| gpt-4.1-nano |
SP |
6.522 |
1.9 |
0.85 |
| |
MV |
8.696 |
1.2 |
0.5 |
| |
SURE |
60.870 |
0.4 |
0.1 |
| gpt-5-nano |
SP |
19.565 |
1.1 |
0.40 |
| |
MV |
47.826 |
0.7 |
0.1 |
| |
SURE |
60.870 |
0.5 |
0.1 |
| gpt-oss-20b |
SP |
54.348 |
0.7 |
0.1 |
| |
MV |
52.174 |
0.6 |
0.1 |
| |
SURE |
73.913 |
0.3 |
0 |
| ensemble |
MV |
45.652 |
0.7 |
0.1 |
| |
SURE |
73.913 |
0.4 |
0 |
Table 7.
Manual grading time and manual regrading time after SURE (minutes) with time savings for assignment 1. 1
Table 7.
Manual grading time and manual regrading time after SURE (minutes) with time savings for assignment 1. 1
| |
|
Regrading (min) and time savings (%) |
| Grader |
Manual (min) |
gpt-4.1-nano |
gpt-5-nano |
gpt-oss-20b |
Ensemble |
| Assignment 1 |
| Grader 1 |
186 |
85 (54%) |
22 (88%) |
19 (90%) |
22 (88%) |
| Grader 2 |
195 |
95 (51%) |
25 (87%) |
24 (88%) |
26 (87%) |
| Grader 3 |
399 |
203 (49%) |
56 (87%) |
57 (86%) |
68 (83%) |
| Grader 4 |
238 |
115 (52%) |
30 (87%) |
33 (86%) |
38 (84%) |
Table 8.
Alignment of LLM grades with human grade target ranges under different procedures for assignments 2–5.
Table 8.
Alignment of LLM grades with human grade target ranges under different procedures for assignments 2–5.
| LLM |
Grading Procedure |
% in Target Range |
Maximum Grade Deviation |
Median Grade Deviation |
| Assignment 2 |
| gpt-4.1-nano |
SP |
4.444 |
2.8 |
1.1 |
| |
MV |
8.889 |
2.8 |
1.0 |
| |
SURE |
42.222 |
1.2 |
0.2 |
| gpt-5-nano |
SP |
33.333 |
1.8 |
0.3 |
| |
MV |
46.667 |
1.7 |
0.3 |
| |
SURE |
73.333 |
1.2 |
0.1 |
| gpt-oss-20b |
SP |
55.556 |
1.1 |
0.2 |
| |
MV |
57.778 |
1.2 |
0.2 |
| |
SURE |
75.556 |
1.0 |
0.1 |
| ensemble |
MV |
55.556 |
1.4 |
0.2 |
| |
SURE |
91.111 |
0.4 |
0.1 |
| Assignment 3 |
| gpt-4.1-nano |
SP |
4.348 |
2.8 |
0.85 |
| |
MV |
15.217 |
2.4 |
0.7 |
| |
SURE |
71.739 |
0.6 |
0.1 |
| gpt-5-nano |
SP |
8.696 |
2.7 |
0.8 |
| |
MV |
13.043 |
1.8 |
0.4 |
| |
SURE |
50 |
1.2 |
0.2 |
| gpt-oss-20b |
SP |
26.087 |
1.5 |
0.3 |
| |
MV |
32.609 |
1.7 |
0.3 |
| |
SURE |
52.174 |
1.4 |
0.1 |
| ensemble |
MV |
26.087 |
1.8 |
0.3 |
| |
SURE |
89.13 |
0.3 |
0.1 |
| Assignment 4 |
| gpt-4.1-nano |
SP |
4.348 |
3.1 |
1.55 |
| |
MV |
2.174 |
2.5 |
1.3 |
| |
SURE |
19.565 |
1.6 |
0.6 |
| gpt-5-nano |
SP |
19.565 |
4.4 |
0.8 |
| |
MV |
34.783 |
3.8 |
0.3 |
| |
SURE |
73.913 |
1.9 |
0.0 |
| gpt-oss-20b |
SP |
54.348 |
3.5 |
0.3 |
| |
MV |
47.826 |
3.2 |
0.3 |
| |
SURE |
60.87 |
1.3 |
0.0 |
| ensemble |
MV |
54.348 |
3.2 |
0.3 |
| |
SURE |
91.304 |
0.4 |
0.0 |
| Assignment 5 |
| gpt-4.1-nano |
SP |
21.739 |
2.1 |
0.65 |
| |
MV |
10.87 |
2.3 |
0.65 |
| |
SURE |
54.348 |
0.9 |
0.2 |
| gpt-5-nano |
SP |
32.609 |
2.6 |
0.45 |
| |
MV |
47.826 |
1.2 |
0.2 |
| |
SURE |
69.565 |
0.9 |
0.1 |
| gpt-oss-20b |
SP |
58.696 |
1.5 |
0.2 |
| |
MV |
76.087 |
0.8 |
0.1 |
| |
SURE |
80.435 |
0.4 |
0.0 |
| ensemble |
MV |
47.826 |
0.9 |
0.2 |
| |
SURE |
84.783 |
0.4 |
0.0 |
Table 9.
Manual grading time and manual regrading time after SURE (minutes) with time savings for assignments 2–5. 1
Table 9.
Manual grading time and manual regrading time after SURE (minutes) with time savings for assignments 2–5. 1
| |
|
Regrading (min) and time savings (%) |
| Grader |
Manual (min) |
gpt-4.1-nano |
gpt-5-nano |
gpt-oss-20b |
Ensemble |
| Assignment 2 |
| Grader 1 |
137 |
59 (57%) |
34 (75%) |
22 (84%) |
62 (55%) |
| Grader 2 |
224 |
102 (54%) |
50 (78%) |
30 (87%) |
95 (58%) |
| Grader 4 |
194 |
93 (52%) |
39 (80%) |
27 (86%) |
82 (58%) |
| Assignment 3 |
| Grader 1 |
323 |
168 (48%) |
70 (78%) |
51 (84%) |
163 (50%) |
| Grader 2 |
380 |
214 (44%) |
91 (76%) |
67 (82%) |
201 (47%) |
| Grader 4 |
253 |
147 (42%) |
67 (74%) |
49 (81%) |
142 (44%) |
| Assignment 4 |
| Grader 1 |
125 |
57 (54%) |
66 (47%) |
27 (78%) |
89 (29%) |
| Grader 2 |
145 |
65 (55%) |
87 (40%) |
30 (79%) |
107 (26%) |
| Grader 4 |
94 |
39 (59%) |
50 (47%) |
21 (78%) |
69 (27%) |
| Assignment 5 |
| Grader 1 |
162 |
91 (44%) |
44 (73%) |
31 (81%) |
89 (45%) |
| Grader 2 |
185 |
96 (48%) |
58 (69%) |
37 (80%) |
99 (46%) |
| Grader 3 |
294 |
160 (46%) |
89 (70%) |
51 (83%) |
161 (45%) |
| Grader 4 |
192 |
99 (48%) |
55 (71%) |
36 (81%) |
105 (45%) |