Submitted:
15 November 2025
Posted:
19 November 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
2.1. Traditional AES Approaches
2.2. Neural and Deep Learning Models
2.3. LLMs in Automated Essay Scoring
2.4. Prompting Strategies in AES
2.5. Research Gap
3. Our Approach for Automated Essay Scoring
3.1. Dataset
- Lead (introductory segment),
- Position (explicit stance),
- Claim (supporting argument),
- Counterclaim (opposing viewpoint),
- Rebuttal (response to counterclaim),
- Evidence (facts/examples supporting claims),
- Concluding Summary (restatement of key arguments)
- persuade_2.0_human_scores_demo_id_github.csv, which contains full essay texts, holistic scores, demographic profiles, writing task types, source text information, and prompt metadata.
3.2. Model Selection
3.3. Prompt Engineering
3.4. Evaluation Metrics
3.4.1. Exact Match (EM)
3.4.2. F1 Score
3.4.3. Mean Absolute Error (MAE)
3.4.4. Root Mean Squared Error (RMSE)
3.4.5. Average Absolute Deviation (AAD)
3.4.6. Pearson and Spearman Correlation
3.4.7. Cohen’s Kappa
4. Experimental Results
4.1. Prompt 1 Results
4.2. Prompt 2 Results
4.3. Prompt 3 Results
4.4. The Primacy of Prompt Design in AES
5. Conclusion and Future Work
References
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology 2025, 16, 1–72. [Google Scholar] [CrossRef]
- Stahl, M.; Biermann, L.; Nehring, A.;Wachsmuth, H. Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation. In Proceedings of the Proceedings of the 19thWorkshop on Innovative Use of NLP for Building Educational Applications (BEA 2024); Kochmar, E.; Bexte, M.; Burstein, J.; Horbach, A.; Laarmann-Quante, R.; Tack, A.; Yaneva, V.; Yuan, Z., Eds., Mexico City, Mexico, 2024; pp. 283–298.
- Zhang, T.; Jiang, Z.; Zhang, H.; Lin, L.; Zhang, S. MathMistake Checker: A Comprehensive Demonstration for Step-by-Step Math Problem Mistake Finding by Prompt-Guided LLMs. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025, Vol. 39, pp. 29730–29732.
- Pack, A.; Barrett, A.; Escalante, J. Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability. Computers and Education: Artificial Intelligence 2024, 6, 100234. [Google Scholar] [CrossRef]
- Bu, J.; Ren, L.; Zheng, S.; Yang, Y.; Wang, J.; Zhang, F.; Wu, W. ASAP: A Chinese review dataset towards aspect category sentiment analysis and rating prediction. arXiv 2021, arXiv:2103.06605. [Google Scholar] [CrossRef]
- Crossley, S.; Tian, Y.; Baffour, P.; Franklin, A.; Benner, M.; Boser, U. A large-scale corpus for assessing written argumentation: PERSUADE 2.0. Assessing Writing 2024, 61, 100865. [Google Scholar] [CrossRef]
- Semire, D. Automated Essay Scoring. The Turkish Online Journal of Distance Education 2006, 7. [Google Scholar]
- Leidner, D.E. Globalization, culture, and information: Towards global knowledge transparency. The Journal of Strategic Information Systems 2010, 19, 69–77. [Google Scholar] [CrossRef]
- Pérez-Rosas, V.; Kleinberg, B.; Lefevre, A.; Mihalcea, R. Automatic Detection of Fake News. In Proceedings of the Proceedings of the 27th International Conference on Computational Linguistics; Bender, E.M.; Derczynski, L.; Isabelle, P., Eds., Santa Fe, New Mexico, USA, 2018; pp. 3391–3401.
- Kahng, M.; Tenney, I.; Pushkarna, M.; Liu, M.X.; Wexler, J.; Reif, E.; Kallarackal, K.; Chang, M.; Terry, M.; Dixon, L. LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models. IEEE Transactions on Visualization and Computer Graphics 2025, 31, 503–513. [Google Scholar] [CrossRef]
- Yang, K.; Raković, M.; Li, Y.; Guan, Q.; Gašević, D.; Chen, G. Unveiling the tapestry of automated essay scoring: A comprehensive investigation of accuracy, fairness, and generalizability. In Proceedings of the Proceedings of the aaai conference on artificial intelligence, 2024, Vol. 38, pp. 22466–22474.
- Su, J.; Yan, Y.; Fu, F.; Zhang, H.; Ye, J.; Liu, X.; Huo, J.; Zhou, H.; Hu, X. Essayjudge: A multi-granular benchmark for assessing automated essay scoring capabilities of multimodal large language models. arXiv 2025, arXiv:2502.11916. [Google Scholar]
- Staudemeyer, R. Understanding LSTM–a tutorial into long short-term memory recurrent neural networks. arXiv 2019, arXiv:1909.09586. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. 30.
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019, pp. 4171–4186.
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020) 2020.
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Yavuz, F. Utilizing Large Language Models for EFL Essay Grading: An Examination of Reliability and Validity in Rubric-Based Assessments. British Journal of Educational Technology 2024. [Google Scholar] [CrossRef]
- Fiacco, J. Towards Extracting and Understanding the Implicit Rubrics in Transformer-Based Automated Essay Scoring. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023) 2023.
- Lundgren, M. Large Language Models in Student Assessment. arXiv 2024, arXiv:2406.16510. [Google Scholar] [CrossRef]
- Lundgren, M. Large Language Models in Student Assessment. arXiv 2024, arXiv:2406.16510. [Google Scholar] [CrossRef]
- Mansour, W.A.; et al. Can Large Language Models Automatically Score Essays? In Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), 2024.
- Wu, X.; Saraf, P.P.; Lee, G.G.; Latif, E.; Liu, N.; Zhai, X. Unveiling Scoring Processes: Dissecting the Differences between LLMs and Human Graders in Automatic Scoring. arXiv 2024. [Google Scholar] [CrossRef]
- Toulmin, S.E. The Uses of Argument; Cambridge University Press: Cambridge, UK, 1958. [Google Scholar]
- Nussbaum, E.M.; Kardash, C.M.; Graham, S. The Effects of Goal Instructions and Text on the Generation of Counterarguments During Writing. Journal of Educational Psychology 2005, 97, 157–169. [Google Scholar] [CrossRef]
- Stapleton, P.; Wu, Y.Y. Assessing the Quality of Arguments in Students’ Persuasive Writing: A Case Study Analysing the Relationship Between Surface Structure and Argumentative Quality. Journal of Second Language Writing 2015, 30, 1–11. [Google Scholar] [CrossRef]
- Touvron, H.; Martin, L.; Izacard, P.; et al.. LLaMA 3.2: Instruction-Tuned Multilingual Generative Transformer. Meta AI, 2024. Available at: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct.
- AI, D. DeepSeek-R1: Open-Source Reasoning Model. DeepSeek AI, 2024. Available at: https://huggingface.co/deepseek-ai/DeepSeek-R1.
- Jiang, A.; et al.. Mixtral of Experts: High-Accuracy Sparse MoE Model for Text Understanding. arXiv preprint 2024, abs/2401.04088. 2401.
- Zhou, Y.; et al. . Qwen2: Optimized for Low-Latency NLP Applications. Alibaba Cloud, 2023. Available at: https://huggingface.co/Qwen/Qwen2-7B.
- Zhou, Y.; et al. . Qwen2.5: Enhanced for Numerical Accuracy and Rubric Alignment. Alibaba Cloud, 2024. Available at: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct.
- Zhang, W.; Litman, D. Automated essay scoring: A survey of the state of the art. International Journal of Artificial Intelligence in Education 2022. [Google Scholar]
- Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2011, 2, 37–63. [Google Scholar]
- Ramesh, D.; Sanampudi, S. An automated essay scoring systems: A systematic literature review. Artificial Intelligence Review 2022, 55, 249–289. [Google Scholar] [CrossRef] [PubMed]
- Sun, J. A survey of automated essay scoring. Neurocomputing 2025. [Google Scholar] [CrossRef]
- Attali, Y.; Burstein, J. Validity and reliability of automated essay scoring systems. Educational Testing Service Research Report 2016. [Google Scholar]
- Ben-Simon, A.; Bennett, R. Correlation between automated and human essay scoring. Applied Measurement in Education 2007, 20, 358–381. [Google Scholar]
- Cohen, J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960, 20, 37–46. [Google Scholar] [CrossRef]
| Essay ID | Prompt Text | Essay Excerpt | Discourse Elements | Score |
|---|---|---|---|---|
| 423A1CA112E2 | Do people use cell phones daily? | Modern humans today are always on their phone... | Lead, Claim | 3 |
| BC75783F96E3 | Mandatory recycling in cities? | Mandatory recycling programs are essential... | Lead, Claim | 4 |
| 74C8BC7417DE | Limit teen social media use? | Social media should be limited ... it causes mental health problems. | Claim, Evidence | 2 |
| A8445CABFECE | Banning homework—beneficial? | Banning homework would harm students’ ability to develop... | Counterclaim, Evidence | 3 |
| 6B4F7A0165B9 | Teach coding in elementary schools? | Introducing coding classes ... will prepare students for future careers. | Claim, Evidence | 4 |
| Provider | Model Name | Size | Description | Reference |
|---|---|---|---|---|
| Meta | LLaMA 3.2 3B | 3B | Instruction-tuned, efficient generative transformer. | [27] |
| DeepSeek | DeepSeek-R1 7B | 7B | Optimized for classification and reasoning tasks. | [28] |
| Mistral AI | Mistral 8×7B | 7B | High-accuracy model for nuanced text understanding. | [29] |
| Alibaba | Qwen2 7B | 7B | Streamlined for low-latency NLP applications. | [30] |
| Alibaba | Qwen2.5 7B | 7B | Enhanced for numerical accuracy and rubric alignment. | [31] |
| Prompt # | Type | Prompt Example (condensed) | Key Features |
|---|---|---|---|
| 1 | Rubric-aligned | “As an expert essay grader, assess the student’s response using the following criteria: Lead: Does the essay introduce the topic effectively? Position: Is the stance clear? ...Provide a score from 1 to 6. Output only the numerical score.” | Explicit definitions of each scoring dimension |
| 2 | Instruction-based | “You are an experienced English teacher tasked with grading a student’s argumentative essay. Assign a total score from 1 to 6 based on the following criteria: Lead, Position, Counterclaim, Rebuttal, Evidence, and Concluding Summary. Please provide only the final numerical score.” | Role assignment + named criteria (no definitions) |
| 3 | Instruction-based (minimal) | “Grade this student’s argumentative essay by evaluating lead, position, Counterclaim, Rebuttal, Evidence, and Concluding Summary. Assign a final grade from 1 to 6, reflecting how well the student constructs and supports their argument. Provide only the numerical score.” | No role; criteria named only; most concise |
| Metric | LLaMA 3.2 3B | DeepSeek-R1 7B | Mistral 8×7B | Qwen2 7B | Qwen2.5 7B |
|---|---|---|---|---|---|
| Exact Match (EM) | 0.60 | 0.70 | 0.40 | 0.35 | 0.55 |
| F1 Score | 0.92 | 0.93 | 0.91 | 0.92 | 0.89 |
| MAE ↓ | 1.000 | 0.593 | 0.658 | 1.000 | 0.538 |
| RMSE ↓ | 1.140 | 0.629 | 0.835 | 1.304 | 0.734 |
| Pearson r ↑ | 0.863 | 0.834 | 0.657 | 0.371 | 0.672 |
| Spearman ↑ | 0.831 | 0.738 | 0.634 | 0.392 | 0.512 |
| Cohen’s ↑ | -0.06 | 0.06 | 0.16 | 0.07 | 0.14 |
| Precision | 0.85 | 0.93 | 0.89 | 0.85 | 1.00 |
| Recall | 1.00 | 0.93 | 0.94 | 1.00 | 0.80 |
| Essay Snippet | Human | LLaMA | DeepSeek | Mistral | Qwen2 | Qwen2.5 |
|---|---|---|---|---|---|---|
| New software has been created... | 5 | 6 | 5 | 3 | 5 | 6 |
| When you need advice?... | 4 | 4 | 3 | 4 | 5 | 4 |
| Cars—no driver... | 1 | 2 | 1 | 1 | 2 | 1 |
| Some may say they need cars... | 3 | 3 | 5 | 4 | 3 | 4 |
| Does the Electoral College work?... | 3 | 3 | 3 | 4 | 4 | 3 |
| Metric | LLaMA 3.2 3B | DeepSeek-R1 7B | Mistral 8×7B | Qwen2 7B | Qwen2.5 7B |
|---|---|---|---|---|---|
| Exact Match (EM) | 0.52 | 0.57 | 0.45 | 0.47 | 0.39 |
| F1 Score | 0.48 | 0.55 | 0.52 | 0.53 | 0.56 |
| MAE ↓ | 0.752 | 0.689 | 0.701 | 0.691 | 0.668 |
| RMSE ↓ | 0.904 | 0.802 | 0.835 | 0.824 | 0.793 |
| Pearson r ↑ | 0.442 | 0.503 | 0.487 | 0.491 | 0.526 |
| Spearman ↑ | 0.431 | 0.536 | 0.464 | 0.502 | 0.542 |
| Cohen’s ↑ | 0.31 | 0.37 | 0.34 | 0.35 | 0.38 |
| Precision | 0.46 | 0.53 | 0.50 | 0.51 | 0.54 |
| Recall | 0.44 | 0.50 | 0.48 | 0.49 | 0.51 |
| Essay Snippet | Human | LLaMA | DeepSeek | Mistral | Qwen2 | Qwen2.5 |
|---|---|---|---|---|---|---|
| New software has been created... | 5 | 3 | 5 | 2 | 3 | 6 |
| When you need advice?... | 4 | 4 | 3 | 4 | 5 | 3 |
| Cars—no driver... | 1 | 2 | 1 | 2 | 1 | 3 |
| Some may say they need cars... | 3 | 3 | 2 | 3 | 2 | 3 |
| Does the Electoral College work?... | 3 | 2 | 3 | 4 | 3 | 2 |
| Metric | LLaMA 3.2 3B | DeepSeek-R1 7B | Mistral 8×7B | Qwen2 7B | Qwen2.5 7B |
|---|---|---|---|---|---|
| Exact Match (EM) | 0.35 | 0.43 | 0.40 | 0.41 | 0.42 |
| F1 Score | 0.54 | 0.63 | 0.60 | 0.61 | 0.62 |
| MAE ↓ | 0.681 | 0.603 | 0.628 | 0.612 | 0.598 |
| RMSE ↓ | 0.852 | 0.726 | 0.755 | 0.743 | 0.714 |
| Pearson r ↑ | 0.508 | 0.578 | 0.536 | 0.371 | 0.582 |
| Spearman ↑ | 0.483 | 0.562 | 0.536 | 0.542 | 0.577 |
| Cohen’s ↑ | 0.43 | 0.49 | 0.47 | 0.46 | 0.48 |
| Precision | 0.54 | 0.59 | 0.57 | 0.56 | 0.58 |
| Recall | 0.52 | 0.56 | 0.54 | 0.54 | 0.56 |
| Essay Snippet | Human | LLaMA | DeepSeek | Mistral | Qwen2 | Qwen2.5 |
|---|---|---|---|---|---|---|
| New software has been created... | 5 | 4 | 5 | 4 | 5 | 4 |
| When you need advice?... | 4 | 2 | 3 | 4 | 5 | 4 |
| Cars—no driver... | 1 | 1 | 2 | 2 | 3 | 1 |
| Some may say they need cars... | 3 | 3 | 4 | 2 | 3 | 2 |
| Does the Electoral College work?... | 3 | 2 | 3 | 3 | 2 | 2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).