Submitted:
22 May 2026
Posted:
25 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Digital Writing Feedback Systems: From AWE to Generative Feedback
2.2. Feedback Constructs, Developmental Appropriateness, and Safety
2.3. Educator-Aligned Evaluation and Human-AI Feedback Workflows
3. Methodology
3.1. Study Design Overview
3.2. Participants and Roles
3.2.1. Study design Contributors
3.2.2. Evaluator Participants
3.3. Dataset Construction
3.3.1. Dataset Summary
3.3.2. Data Organization, Preprocessing, and Quality Assurance
3.4. LLM Feedback Generation
3.4.1. Model Selection and Reproducibility
3.4.2. Prompt Design for Feedback Generation
3.5. Educator Evaluation Protocol
3.5.1. Phases 1-2: Early Stakeholder Engagement and Study Preparation (Months 1-9 )
- The team clarified current feedback bottlenecks, such as turnaround time, limits on personalization at scale, and low revision uptake.
- We identified plausible roles for AI in the feedback workflow and delineated where human judgment must remain central (e.g., developmental priorities, grading, and high-stakes tasks).
- We determined year level and writing genres for evaluating LLM-generated feedback in context.
- The team agreed on constraints for any classroom-facing use, including anonymization requirements and the principle that students would not receive AI feedback without teacher mediation.
Phase 1: Initial Requirements Gathering (Months 1-3)
Phase 2: Study design and Ethics preparation (Months 4-9)
- Which year level should the writing samples target to maximize impact and generalizability?
- How should we structure the comparison between GPT-4 Turbo and tutor feedback to minimize bias?
- How can we ensure that the evaluation rubric is interpretable across different levels of teaching experience?
3.5.2. Phases 3-6: Rubric Development, Human Evaluation, and Data Cleaning for Analysis (Months 10-14)
Phase 3: Rubric Development and Calibration (Months 10-11)
Phase 4: Pilot Testing and Tool Iteration(Month 12)
Phase 5: Full Evaluation and Iterative Review (Months 13-14)
Phase 6: Data Integration and Analysis Preparation
3.6. Data Analysis
3.6.1. Quantitative Rubric Analysis
3.6.2. Qualitative Thematic Analysis
4. Results
4.1. Quantitative Findings
4.2. Qualitative Findings: Educator Perspectives
4.2.1. Clarity
C1: Clear and accessible language
C2: Accurate but overly advanced or dense wording
C3: Confusing or misleading explanations from both sources
4.2.2. Feasibility
F1: Mangageable and scaffolded actions
F2/F3: Overload and lack of “How to” support
F4: Feasibility undermined by language or correctness, especially for GPT
4.2.3. Helpfulness
H1: Actionable and growth-oriented feedback
H2: Generic or low-value comments
H3/H4: Missing key issues and misalignment with student needs
4.2.4. Specificity
S1: Text-grounded specificity
S2/S4: Generic or missed opportunities, especially for tutors
S3: Specific but incorrect examples, mostly from GPT
4.2.5. Relevance
R1: Directly targeting core issues was rare
R2/R3: Over-emphasis on minor or partial issues
R4: Inaccurate or hallucinated relevance for GPT
4.2.6. Overall Effectiveness and Summary Judgments
O1: Surface-level improvement without conceptual growth
O2: Reliability and verification burden
O3: Comparative judgments between GPT and tutors
O4: AI as assistant, not replacement
5. Discussion and Educational Impact
5.1. Design considerations for LLM-Generated Feedback
5.1.1. Consideration CR1: Balancing Accessible Language and Pedagogical Depth(C1-C2)
5.1.2. Consideration CR2: Moving beyond Surface Corrections to Conceptual Learning Goals (H1–H3, R1–R3, O1)
5.1.3. Consideration CR3: Trading Off Scalability and Specificity with Reliability and Verification Work (S1–S3, R4, O2)
5.1.4. Consideration CR4: Supporting Student Agency While Keeping Teachers in Control (F1–F4, H4, O3–O4)
5.2. Educational and Stakeholder Implications with Design Recommendations
5.2.1. Design Recommendations
5.3. Hybrid Human-AI Feedback Pipeline
5.4. Limitations
6. Conclusions
| 1 | Australian Curriculum: English,https://www.australiancurriculum.edu.au/. |
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Educator-Aligned Qualitative Codebook
| Code | Dimension | Label | Description with example paraphrase |
|---|---|---|---|
| CL1 | Clarity | Clear language | Straightforward, age-appropriate language. E.g., “Language is simple.” |
| CL2 | Clarity | Too advanced | Vocabulary/phrases too sophisticated for Year 5 (e.g., “coherence”). |
| CL3 | Clarity | Confusing | Internally inconsistent or contradictory explanations. |
| CL4 | Clarity | Undefined terms | Uses technical terms (e.g., “speech tags”) without definition. |
| CL5 | Clarity | Overlong | Content is clear but too dense for a weaker writer to process. |
| CL6 | Clarity | Generic | Language is clear but lacks draft-specific nuance or detail. |
| CL7 | Clarity | Redundant | The same idea is repeated several times without extra value. |
| CL8 | Clarity | Misaligned example | Examples do not match the point, hurting trust and clarity. |
| FE1 | Feasibility | Manageable | Actions are clearly signposted or paired with highlighted student text. |
| FE2 | Feasibility | Overwhelming | Too many points or expectations; the student is likely to feel overwhelmed. |
| FE3 | Feasibility | Vague | Requests lack concrete steps; student knows what to change but not how. |
| FE4 | Feasibility | Missing examples | No text-based demonstrations of what improved sentences might look like. |
| FE5 | Feasibility | Advanced language | Vocabulary (e.g., “convey”) is closer to teacher discourse than student language. |
| FE6 | Feasibility | Hallucinated issues | Feedback is based on wrong issues (e.g., claims there are no paragraphs). |
| FE7 | Feasibility | Surface-only | Changes are feasible but confined to minor edits (spelling, punctuation). |
| HE1 | Helpfulness | Growth-oriented | Concrete suggestions to improve the draft and long-term writing skills. |
| HE2 | Helpfulness | Low-value | Broad advice like “add more detail” without indicating where or how. |
| HE3 | Helpfulness | Missing key issue | A major learning need (e.g., tense control) is not addressed at all. |
| HE4 | Helpfulness | Technical focus | Limited to spelling/grammar; does not help with ideas or structure. |
| HE5 | Helpfulness | Misinterpreted needs | Encourages something educators see as problematic (e.g., praising repetition). |
| HE6 | Helpfulness | Local edits | Improves the specific piece but unlikely to develop broader competence. |
| HE7 | Helpfulness | Low scaffolding | Feedback is not broken down enough for a weak writer to act independently. |
| RE1 | Relevance | Targets real issues | Accurately identifies core problems (e.g., plot coherence, argument depth). |
| RE2 | Relevance | Peripheral issues | Focuses on low-stakes issues (e.g., spacing) while ignoring main problems. |
| RE3 | Relevance | Hallucination | Claims an error that is not present, leading to confusion. |
| RE4 | Relevance | Technical only | Relevant to technical correctness but misses content-level concerns. |
| RE5 | Relevance | Partial relevance | Broadly relevant, but the most important learning need is unaddressed. |
| RE6 | Relevance | Misleading praise | The “well done” section praises features that are actually absent. |
| SP1 | Specificity | Text-grounded | Cites specific sentences or phrases from the student’s text. |
| SP2 | Specificity | Generic strengths | Positive comments are vague with no supporting examples. |
| SP3 | Specificity | Specific/Generic mix | Improvement points are specific but praise remains generic. |
| SP4 | Specificity | No examples | Little or no quotation from text, even when suggesting key changes. |
| SP5 | Specificity | Specific but wrong | Cites an example that is misread, fabricated, or mislabeled. |
| SP6 | Specificity | Missed content | Could have drawn on plot details to make guidance specific but does not. |
| OE1 | Overall | Technical/Conceptual | Primarily improves surface correctness; fails to engage with conceptual goals. |
| OE2 | Overall | Missed priority | Fails to address what educators see as the key developmental priority. |
| OE3 | Overall | Verification required | Feedback needs substantial checking/rewriting before being shared. |
| OE4 | Overall | Over-praising | Too flattering, which may reduce student motivation to revise. |
| OE5 | Overall | Style mismatch | Uses American spelling or conventions not aligned with local curriculum. |
| OE6 | Overall | Worse than tutor | Evaluators explicitly judge tutor feedback as more effective overall. |
| OE7 | Overall | Better for weak | Judge GPT as more appropriate for very basic writers (simpler language). |
| OE8 | Overall | AI assistant | Useful starting point for teacher editing, but not a standalone tool. |
| OE9 | Overall | Hallucinations | Considered unusable because of frequent incorrect claims. |
Appendix B. Comparative Feedback Example
| Student draft (first draft) | Tutor feedback | GPT-4 Turbo feedback |
|---|---|---|
| I think that children should be able to earn pocket money, because they can use the money on snacks, video games and that can help them be happy because they can finally buy the things that they can’t afford. Children should be able to earn up to $2-5 by sweeping and $3-8 mopping. Why I think it’s good for kids to buy whatever they want is because they have their own freedom to use their money that they hardly earned and own. Why I don’t think that kids are able to buy whatever they want is because if you buy a subscription, then it would take money every month and there might be a chance that they can’t pay their subscription back. Then their parents will be angry. I strongly prefer letting children buy whatever they want because they own the money, and they care for what they bought because their parents should have told their child to care and that is my reason of why I think children can earn money and buy whatever they want like me. |
Persuasive – “Pocket Money” (first draft)What you did well
|
What you did well
|
References
- Wei, P.; Wang, X.; Dong, H. The impact of automated writing evaluation on second language writing skills of Chinese EFL learners: A randomized controlled trial. Front. Psychol. 2023, 14, 1249991. [Google Scholar] [CrossRef] [PubMed]
- Benali, A. The impact of using automated writing feedback in ESL/EFL classroom contexts. Engl. Lang. Teach. 2021, 14, 189–195. [Google Scholar] [CrossRef]
- Liu, M.; Li, Y.; Xu, W.; Liu, L. Automated essay feedback generation and its impact on revision. IEEE Trans. Learn. Technol. 2016, 10, 502–513. [Google Scholar] [CrossRef]
- Su, Y.; Lin, Y.; Lai, C. Collaborating with ChatGPT in argumentative writing classrooms. Assess. Writ. 2023, 57, 100752. [Google Scholar] [CrossRef]
- Alharbi, W. E-feedback as a scaffolding teaching strategy in the online language classroom. J. Educ. Technol. Syst. 2017, 46, 239–251. [Google Scholar] [CrossRef]
- Potter, A.; Wilson, J. Statewide implementation of automated writing evaluation: Analyzing usage and associations with state test performance in grades 4–11. Educ. Technol. Res. Dev. 2021, 69, 1557–1578. [Google Scholar] [CrossRef]
- Link, S.; Mehrzad, M.; Rahimi, M. Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Comput. Assist. Lang. Learn. 2022, 35, 605–634. [Google Scholar] [CrossRef]
- Amoozadeh, M.; Daniels, D.; Nam, D.; Kumar, A.; Chen, S.; Hilton, M.; Srinivasa Ragavan, S.; Alipour, M.A. Trust in generative AI among students: An exploratory study. Proceedings of the Proceedings of the 55th ACM Technical Symposium on Computer Science Education 2024, V.1, 67–73. [Google Scholar]
- Han, A.; Zhou, X.; Cai, Z.; Han, S.; Ko, R.; Corrigan, S.; Peppler, K.A. Teachers, parents, and students’ perspectives on integrating generative AI into elementary literacy education. Proceedings of the Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems 2024, 1–17. [Google Scholar]
- Park, H.; Ahn, D. The promise and peril of ChatGPT in higher education: Opportunities, challenges, and design implications. In Proceedings of the Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024; pp. 1–21. [Google Scholar]
- Wang, S.; Xu, T.; Li, H.; Zhang, C.; Liang, J.; Tang, J.; Yu, P.S.; Wen, Q. Large language models for education: A survey and outlook. arXiv 2024, arXiv:2403.18105. [Google Scholar] [CrossRef]
- Dai, W.; Tsai, Y.S.; Lin, J.; Aldino, A.; Jin, H.; Li, T.; Gašević, D.; Chen, G. Assessing the proficiency of large language models in automatic feedback generation: An evaluation study. Comput. Educ. Artif. Intell. 2024, 7, 100299. [Google Scholar] [CrossRef]
- Aldino, A.A.; Tsai, Y.S.; Gupte, S.; Henderson, M.; Nath, D.; Gašević, D.; Chen, G. Analytics of learner-centered feedback: A large-scale case study in higher education. Comput. Educ. 2025, 237, 105360. [Google Scholar] [CrossRef]
- Dai, W.; Cheng, Y.; Aldino, A.A.; Tsai, Y.S.; Gašević, D.; Chen, G. Evaluating the capability of large language models in characterising relational feedback: A comparative analysis of prompting strategies. Comput. Educ. Artif. Intell. 2025, 8, 100427. [Google Scholar] [CrossRef]
- AlGhamdi, E.; Li, Y.; Gašević, D.; Chen, G. Leveraging prompt-based LLMs for automated scoring and feedback generation in higher education. Comput. Educ. 2025, 105511. [Google Scholar] [CrossRef]
- Qian, K.; Cheng, Y.; Guan, R.; Dai, W.; Jin, F.; Yang, K.; Nawaz, S.; Swiecki, Z.; Chen, G.; Yan, L.; et al. Dean of llm tutors: exploring comprehensive and automated evaluation of llm-generated educational feedback via llm feedback evaluators. arXiv 2025, arXiv:2508.05952. [Google Scholar] [CrossRef]
- Attali, Y.; Burstein, J. Automated essay scoring with e-rater® V.2. J. Technol. Learn. Assess. 2006, 4. [Google Scholar] [CrossRef]
- Ramesh, D.; Sanampudi, S.K. An automated essay scoring systems: A systematic literature review. Artif. Intell. Rev. 2022, 55, 2495–2527. [Google Scholar] [CrossRef]
- Taghipour, K.; Ng, H.T. A neural approach to automated essay scoring. In Proceedings of the Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016; pp. 1882–1891. [Google Scholar]
- Schmaltz, A.; Kim, Y.; Rush, A.M.; Shieber, S.M. Sentence-level grammatical error identification as sequence-to-sequence correction. arXiv 2016, arXiv:1604.04677. [Google Scholar]
- Dai, W.; Lin, J.; Jin, H.; Li, T.; Tsai, Y.S.; Gašević, D.; Chen, G. Can large language models provide feedback to students? A case study on ChatGPT. In Proceedings of the 2023 IEEE International Conference on Advanced Learning Technologies (ICALT); IEEE, 2023; pp. 323–325. [Google Scholar]
- Naismith, B.; Mulcaire, P.; Burstein, J. Automated evaluation of written discourse coherence using GPT-4. In Proceedings of the Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023; pp. 394–403. [Google Scholar]
- Zhang, D.; Hoang, T.; Zhu, Y.; Wang, R.; Crouch, P. Generating Feedback for School Students Essay with Large Language Models. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, 2025; Springer; pp. 319–330. [Google Scholar]
- Costa, K.; Mfolo, L.N.; Ntsobi, M.P. Challenges, benefits and recommendations for using generative artificial intelligence in academic writing: A case of ChatGPT. Medicon Eng. Themes 2024, 7, 3–38. [Google Scholar]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2021; pp. 610–623. [Google Scholar]
- Fabbri, A.R.; Kryściński, W.; McCann, B.; Xiong, C.; Socher, R.; Radev, D. Summeval: Re-evaluating summarization evaluation. Trans. Assoc. Comput. Linguist. 2021, 9, 391–409. [Google Scholar] [CrossRef]
- Steiss, J.; Tate, T.; Graham, S.; Cruz, J.; Hebert, M.; Wang, J.; Moon, Y.; Tseng, W.; Warschauer, M.; Olson, C.B. Comparing the quality of human and ChatGPT feedback of students’ writing. Learn. Instr. 2024, 91, 101894. [Google Scholar] [CrossRef]
- Atasoy, A.; Moslemi Nezhad Arani, S. ChatGPT: A reliable assistant for the evaluation of students’ written texts? Educ. Inf. Technol. 2025, 1–31. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002; pp. 311–318. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text summarization branches out, 2004; pp. 74–81. [Google Scholar]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
- Liaqat, A.; Munteanu, C.; Demmans Epp, C. Collaborating with mature English language learners to combine peer and automated feedback: A user-centered approach to designing writing support. Int. J. Artif. Intell. Educ. 2021, 31, 638–679. [Google Scholar] [CrossRef]
- Tan, K.; Pang, T.; Fan, C.; Yu, S. Towards applying powerful large AI models in classroom teaching: Opportunities, challenges and prospects. arXiv 2023, arXiv:2305.03433. [Google Scholar] [CrossRef]
- Jia, Q.; Young, M.; Xiao, Y.; Cui, J.; Liu, C.; Rashid, P.; Gehringer, E. Insta-Reviewer: A data-driven approach for generating instant feedback on students’ project reports. In International Educational Data Mining Society; 2022. [Google Scholar]
- Xu, B.; Bai, Y.; Sun, H.; Lin, Y.; Liu, S.; Liang, X.; Li, Y.; Gao, Y.; Huang, H. EduBench: A comprehensive benchmarking dataset for evaluating large language models in diverse educational scenarios. arXiv 2025, arXiv:2505.16160. [Google Scholar]
- Zawacki-Richter, O.; Marín, V.I.; Bond, M.; Gouverneur, F. Systematic review of research on artificial intelligence applications in higher education: Where are the educators? Int. J. Educ. Technol. High. Educ. 2019, 16, 1–27. [Google Scholar] [CrossRef]
- Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Weisz, J.D.; He, J.; Muller, M.; Hoefer, G.; Miles, R.; Geyer, W. Design principles for generative AI applications. In Proceedings of the Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, 2024; pp. 1–22. [Google Scholar]
- Amershi, S.; Weld, D.; Vorvoreanu, M.; Fourney, A.; Nushi, B.; Collisson, P.; Suh, J.; Iqbal, S.; Bennett, P.N.; Inkpen, K.; et al. Guidelines for human–AI interaction. In Proceedings of the Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019; pp. 1–13. [Google Scholar]
- Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining explanations: An overview of interpretability of machine learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA); IEEE, 2018; pp. 80–89. [Google Scholar]
- Han, J.; Yoo, H.; Myung, J.; Kim, M.; Lim, H.; Kim, Y.; Lee, T.Y.; Hong, H.; Kim, J.; Ahn, S.Y.; et al. LLM-as-a-tutor in EFL writing education: Focusing on evaluation of student–LLM interaction. arXiv 2023, arXiv:2310.05191. [Google Scholar]
- Kim, H.; Baghestani, S.; Yin, S.; Karatay, Y.; Kurt, S.; Beck, J.; Karatay, L. ChatGPT for writing evaluation: Examining the accuracy and reliability of AI-generated scores compared to human raters. Explor. Artif. Intell. Appl. Linguist. 2024, 73–95. [Google Scholar]
- Yazici, A.; Mejia-Domenzain, P.; Frej, J.; Käser, T. GELEX: Generative AI-hybrid system for example-based learning. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 2024; pp. 1–10. [Google Scholar]
- Shute, V.J. Focus on formative feedback. Rev. Educ. Res. 2008, 78, 153–189. [Google Scholar] [CrossRef]
- Nicol, D.J.; Macfarlane-Dick, D. Formative assessment and self-regulated learning: A model and seven principles of good feedback practice. Stud. High. Educ. 2006, 31, 199–218. [Google Scholar] [CrossRef]
- Carless, D.; Winstone, N. Teacher feedback literacy and its interplay with student feedback literacy. Teach. High. Educ. 2023, 28, 150–163. [Google Scholar] [CrossRef]
- Tai, J.; Ajjawi, R.; Boud, D.; Dawson, P.; Panadero, E. Developing evaluative judgement: Enabling students to make decisions about the quality of work. High. Educ. 2018, 76, 467–481. [Google Scholar] [CrossRef]
- Hattie, J.; Timperley, H. The power of feedback. Rev. Educ. Res. 2007, 77, 81–112. [Google Scholar] [CrossRef]
- Harris, K.R.; Graham, S.; Mason, L.H. Improving the writing, knowledge, and motivation of struggling young writers: Effects of self-regulated strategy development with and without peer support. Am. Educ. Res. J. 2006, 43, 295–340. [Google Scholar] [CrossRef]
- Graham, S.; Perin, D. Writing next: Effective strategies to improve writing of adolescents in middle and high schools; Carnegie Corporation of New York, 2007. [Google Scholar]
- OpenAI. GPT-4 technical report. 2023. Available online: https://openai.com/research/gpt-4.
- Lee, S.; Cai, Y.; Meng, D.; Wang, Z.; Wu, Y. Unleashing large language models’ proficiency in zero-shot essay scoring. arXiv 2024, arXiv:2404.04941. [Google Scholar] [CrossRef]
- Mannila, L. Co-designing AI literacy for K–12 education. In Proceedings of the Proceedings of the 19th WiPSCE Conference on Primary and Secondary Computing Education Research, 2024; pp. 1–3. [Google Scholar]
- Wang, Z.; Makarova, V.; Li, Z.; Kodner, J.; Rambow, O. LLMs can perform multi-dimensional analytic writing assessments: A case study of L2 graduate-level academic English writing. arXiv 2025, arXiv:2502.11368. [Google Scholar]
- Ranalli, J. Automated written corrective feedback: How well can students make use of it? Comput. Assist. Lang. Learn. 2018, 31, 653–674. [Google Scholar] [CrossRef]
- Holstein, K.; McLaren, B.M.; Aleven, V. Co-designing a real-time classroom orchestration tool to support teacher–AI complementarity. Grantee Submission, 2019. [Google Scholar]
- Lin, P.; Van Brummelen, J. Engaging teachers to co-design integrated AI curriculum for K–12 classrooms. In Proceedings of the Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, 2021; pp. 1–12. [Google Scholar]


| Phase | Guiding questions | Key outcomes |
|---|---|---|
| Pre-evaluation framing and study preparation (Months 1-9) | ||
| 1 | What are the main feedback bottlenecks in the current program? Where might AI realistically support faster, clearer, and more consistent feedback for primary learners? | Identified workload and consistency constraints; established shared expectations for age-appropriate feedback; and prioritized rubric dimensions, prompt structure, and initial boundary conditions for safe AI use. |
| 2 | How can we design a fair and feasible comparison under real tutor workload constraints? What safeguards are needed to protect student data and avoid over-claiming about AI effectiveness? | Confirmed Year 5 narrative and persuasive writing as the target context; secured ethics approval and consent; and defined a blind, within-script evaluation protocol with clear inclusion criteria. |
| Rubric development, educator evaluation, and analysis preparation (Months 10-14) | ||
| 3 | What makes feedback clear, specific, and developmentally appropriate for Year 5 writers? How can the evaluator’s interpretation be made consistent for comparative analysis? | Co-developed a six-dimensional rubric with plain-language definitions and annotated exemplars; strengthened shared interpretation and consistency across evaluators. |
| 4 | Are rubric labels and scale anchors clear in practice? Where do evaluators struggle or disagree? Does the evaluation template introduce friction or ambiguity? | Refined rubric wording and scale anchors; simplified the evaluation template; and fixed the GPT prompt configuration to ensure stable conditions for full evaluation. |
| 5 | How consistently can evaluators apply the rubric across scripts? What patterns distinguish tutor from AI feedback? Where does AI output require educator verification or correction? | Collected 56 paired tutor-GPT ratings per dimension; surfaced comparative strengths (e.g., clarity/feasibility) and weaknesses (e.g., specificity/relevance); and produced a high-quality dataset for quantitative and qualitative analyses. |
| 6 | How do we ensure data integrity and usability for analysis (scores, comments, and metadata) across multiple evaluator files? | Consolidated evaluator spreadsheets into a unified dataset; paired scores with comments; resolved formatting inconsistencies; and clarified ambiguous entries via follow-up with evaluators. |
| Metric | Student Drafts | Tutor Feedback | GPT Feedback |
|---|---|---|---|
| Count | 100 | 100 | 100 |
| Mean | 461.99 | 363.61 | 296.37 |
| Std. Dev. | 328.73 | 64.71 | 34.61 |
| Min | 39 | 227 | 214 |
| 25% | 270.50 | 318.50 | 271.75 |
| Median | 359.50 | 358.50 | 290.50 |
| 75% | 541.75 | 400.25 | 321.25 |
| Max | 1985 | 541 | 389 |
| Dimension | Description |
|---|---|
| Clarity | The feedback is clear, and the language is easy for the student to understand. |
| Specificity | The feedback addresses specific strengths and weaknesses within the writing. |
| Helpfulness | The feedback is practical and actionable, guiding the student toward specific improvements. |
| Feasibility | The feedback is understandable and manageable for the student. |
| Relevance | The feedback aligns with the student’s writing content. |
| Overall Effectiveness | The feedback supports the student’s writing development overall. |
| Analysis type | Methodology | Example output |
|---|---|---|
| Quantitative rubric comparison | Paired, within-script comparison of GPT vs. tutor Likert ratings on six criteria. Computed descriptive statistics, paired t-tests (Holm-Bonferroni), and paired-effect sizes (d). | Helpfulness: tutors rated higher than GPT (mean , , Holm-adjusted ). |
| Educator-aligned thematic analysis | Reflexive thematic analysis of educators’ free-text comments using a structured codebook. Codes were synthesised into cross-cutting design considerations. | Themes included: “clear but shallow feedback”; “surface vs. conceptual goals”; “AI as assistant, not replacement”. |
| ID | Source | Cla. | Spe. | Hel. | Fea. | Rel. | Ove. | Overall comment (example) |
|---|---|---|---|---|---|---|---|---|
| 23 | Tutor | 4 | 3 | 4 | 3 | 4 | 4 | The student’s story is very lengthy. Feedback could have given more guidance on organising ideas, not just sentence-level issues. |
| 23 | GPT | 3 | 3 | 2 | 3 | 2 | 2 | Language is clear but focuses only on technical skills; misses structure and content issues, so would need teacher editing before use. |
| Tutor | GPT-4 Turbo | Difference | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Dimension | M | M | t | d | |||||
| Clarity | 3.41 | 0.78 | 3.34 | 0.98 | 0.07 | 1.13 | 0.48 | 1.00 | 0.06 |
| Specificity | 2.93 | 0.76 | 2.82 | 1.01 | 0.11 | 1.33 | 0.60 | 1.00 | 0.08 |
| Helpfulness | 3.14 | 0.82 | 2.71 | 1.07 | 0.43 | 1.22 | 2.63 | .07 | 0.35 |
| Feasibility | 3.09 | 0.84 | 2.93 | 0.97 | 0.16 | 1.33 | 0.90 | 1.00 | 0.12 |
| Relevance | 3.16 | 0.85 | 2.86 | 1.14 | 0.30 | 1.37 | 1.65 | .52 | 0.22 |
| Overall effectiveness | 2.96 | 0.81 | 2.70 | 0.95 | 0.26 | 1.31 | 1.48 | .58 | 0.20 |
| Code | Dimension | Label | Description (with example paraphrase) |
|---|---|---|---|
| CL1 | Clarity | Clear, simple language | Feedback uses straightforward, age-appropriate language. E.g., “Language is simple; a Year 5 student could follow this.” |
| HE1 | Helpfulness | Actionable guidance | Feedback offers concrete, text-grounded suggestions. E.g, “Gives clear steps for how to reorganize ideas.” |
| SP5 | Specificity | Specific but wrong | Feedback cites a feature, but the example is misread or invented. E.g., “Criticizes capital letters used correctly.” |
| RE3 | Relevance | Hallucinated problem | Feedback identifies a problem not present in the draft. E.g, “Says there are no paragraphs when there clearly are.” |
| Dim. | Theme | GPT | Tutor | Total |
|---|---|---|---|---|
| Clarity | ||||
| C1 | Clear, accessible language | 16 | 5 | 21 |
| C2 | Accurate but too advanced / dense | 1 | 9 | 10 |
| C3 | Confusing or misleading explanation | 7 | 11 | 18 |
| Feasibility | ||||
| F1 | Manageable, scaffolded actions | 5 | 8 | 13 |
| F2 | Overload (too many actions) | 2 | 0 | 2 |
| F3 | Lacks concrete “how to” guidance | 10 | 15 | 25 |
| F4 | Feasibility harmed by language / errors | 8 | 3 | 11 |
| Helpfulness | ||||
| H1 | Actionable, growth-oriented feedback | 7 | 10 | 17 |
| H2 | Generic or low-value comments | 6 | 6 | 12 |
| H3 | Missing or purely technical focus | 10 | 3 | 13 |
| H4 | Misaligned to student needs | 12 | 9 | 21 |
| Specificity | ||||
| S1 | Text-grounded specificity | 8 | 11 | 19 |
| S2 | Generic, no examples | 12 | 24 | 36 |
| S3 | Specific but incorrect | 9 | 1 | 10 |
| S4 | Missed content-specificity | 8 | 12 | 20 |
| Relevance | ||||
| R1 | Directly targets core issues | 2 | 1 | 3 |
| R2 | Only minor / technical issues | 10 | 5 | 15 |
| R3 | Partial, missed opportunity | 4 | 14 | 18 |
| R4 | Inaccurate or hallucinated issues | 2 | 0 | 2 |
| Overall reflections | ||||
| O1 | Surface-focused, not conceptual | - | - | 11 |
| O2 | Reliability and verification burden | - | - | 10 |
| O3 | Comparative judgments | - | - | 27 |
| O4 | AI as assistant, not replacement | - | - | 8 |
| Stakeholder | Interpretation (from educator evidence) | Design implications |
|---|---|---|
| Educators & schools | Prioritize workload relief for routine checks while retaining authority over instructional focus, developmental appropriateness, tone, and curriculum alignment. Model output is acceptable only when verification is fast and responsibility remains with the educator. | Adopt an AI-as-draft, teacher-as-editor workflow with lightweight verification (e.g., check that each claim is supported by the draft; remove/replace misleading praise/criticism). Use rubric-linked prompt templates and maintain a shared “failure log” (unsupported claims, misalignment, inappropriate developmental judgments) to guide ongoing refinement. |
| Students & families | Value readable, encouraging feedback but may struggle to detect generic or incorrect suggestions. Trust depends on clarity about when AI is involved and what has been educator-checked. | Embed basic feedback literacy activities (e.g., identify evidence for a comment; revise generic advice into a concrete next step; compare tutor vs. AI feedback). Communicate when AI is used, what is educator-mediated, and how concerns can be raised; calibrate feedback depth to student readiness. |
| AI & tool designers | Need to scale feedback while maintaining controllability and alignment with local curricula, genres, and reading levels; specificity must be grounded to avoid misleading output. | Constrain generation with curriculum- and rubric-aligned templates; require each suggestion to reference a supporting text span. Provide controls for tone, reading level, and feedback depth; add automated checks for unsupported or off-topic claims before educator review. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).