Submitted:
18 December 2025
Posted:
19 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction

- We introduce the Reflective Reasoning System (RRS), a novel inference-time framework that integrates explicit self-diagnosis and self-correction mechanisms into large reasoning models.
- We demonstrate that RRS significantly enhances LRM accuracy and robustness in complex reasoning tasks across multiple domains, including mathematics, code, and science, without requiring any additional training or fine-tuning of the base models.
- We propose a prompting strategy leveraging special meta-cognitive tokens ([CRITIQUE] and [REVISE]) to effectively activate and guide an LRM’s inherent reflective capabilities, thereby improving its ability to identify and rectify internal reasoning errors.
2. Related Work
2.1. Enhancing Reasoning Capabilities in Large Language Models
2.2. Self-Correction and Reflective Mechanisms for Language Models
3. Method

3.1. Overview of Reflective Reasoning System (RRS)
3.2. The Reflective Reasoning Cycle
3.2.1. Initial Reasoning Phase
3.2.2. Self-Diagnosis Phase
3.2.3. Self-Correction Phase
3.2.4. Final Answer Output
3.3. Prompting Strategy and Meta-Cognitive Token Integration
4. Experiments
4.1. Models
- DeepSeek-R1-Distill-Qwen-1.5B
- DeepSeek-R1-Distill-Qwen-7B
- Qwen QwQ-32B
4.2. Datasets and Benchmarks
-
Mathematical Reasoning:
- –
- AIME 2024 (AIME24)
- –
- AMC 2023 (AMC23)
- –
- Minerva-Math
-
Code Generation:
- LiveCodeBench (LiveCode)
-
Scientific Reasoning:
- OlympiadBench (Olympiad)
4.3. Implementation Details and Baselines
- BASE: This represents the standard, single-pass inference of the foundational LRM without any special prompting or reflective mechanisms. It serves as the lower bound for performance.
- CoD (Chain-of-Thought with Delimiters): This method, inspired by Chain-of-Thought prompting [40] and similar to AlphaOne [19], encourages the model to generate longer, more detailed reasoning paths. While not explicitly using our ‘[CRITIQUE]’ and ‘[REVISE]’ tokens, CoD often involves structured prompting to guide reasoning, providing a competitive baseline for inference-time enhancements.
4.4. Quantitative Results
4.5. Analysis of Reflective Mechanisms
4.6. Human Evaluation
4.7. Performance Across Model Sizes
4.8. Ablation Study of RRS Components
- RRS (w/o Self-Correction): In this variant, the model undergoes the Initial Reasoning Phase and the Self-Diagnosis Phase (generating ), but the final answer is taken directly from the initial reasoning . This isolates the effect of merely diagnosing potential errors without explicitly correcting them, primarily serving to quantify the diagnostic capability rather than accuracy improvement based on correction. For Pass@1, this variant will align with BASE.
- RRS (Direct Revision): Here, the model performs the Initial Reasoning Phase () and then proceeds directly to the Self-Correction Phase, skipping the explicit ‘[CRITIQUE]’ token and the generation of critique C. Instead, the model is prompted directly to revise its initial output based on the original problem and its initial attempt: . This variant assesses the value of an explicit, separate self-diagnosis step versus an implicit, combined diagnosis-and-correction process.
4.9. Analysis of Error Types Corrected by RRS
- Logical Inconsistencies and Fallacies: In complex mathematical and scientific reasoning, LRMs often fall into traps of logical leaps or flawed deductions. RRS’s self-diagnosis phase frequently identified these instances, flagging statements like "The logical step from X to Y is not justified" or "This argument relies on an unstated assumption that may not hold." The subsequent self-correction then focused on establishing rigorous connections or rectifying the erroneous logic.
- Calculation Errors: Simple arithmetic or algebraic mistakes are common in LRM outputs, especially in multi-step problems. The ‘[CRITIQUE]’ phase in RRS was often able to pinpoint specific numerical errors, such as "Error in line 5: 3 * 7 is 21, not 24." This explicit identification allowed the ‘[REVISE]’ phase to recalculate and correct these precise points, leading to accurate final answers.
- Misinterpretation of Problem Constraints: Problems, particularly in code generation or competitive programming, often come with subtle constraints or edge cases. Initial reasoning might overlook these. RRS critiques often included statements like "Did I consider all edge cases for input N?" or "The problem specifies X, but my solution implicitly assumes Y." The self-correction phase then adjusted the code or reasoning to align with all problem requirements. Incomplete Reasoning or Omissions: Sometimes, the initial reasoning () might be truncated or skip crucial intermediate steps, leading to an unsupported or incorrect answer. The ‘[CRITIQUE]’ phase served to identify these gaps, prompting reflections such as "More detailed proof is needed for statement Z" or "I need to explicitly show how this step follows from previous ones." The self-correction then filled in these missing logical connections, resulting in a complete and verifiable solution.
- Sub-optimal Solutions (Refinement): Beyond outright errors, RRS also showed capabilities in identifying sub-optimal approaches. In code generation, for instance, a critique might suggest "This algorithm has quadratic complexity; a linear time solution might be possible." The ‘[REVISE]’ phase would then attempt to refactor the code for better efficiency or elegance.
5. Conclusion
References
- Mishra, S.; Mitra, A.; Varshney, N.; Sachdeva, B.; Clark, P.; Baral, C.; Kalyan, A. NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics, 2022a; pp. 3505–3523. [Google Scholar] [CrossRef]
- Zhou, Y.; Shen, J.; Cheng, Y. Weak to strong generalization for large language models with multi-capabilities. The Thirteenth International Conference on Learning Representations, 2025. [Google Scholar]
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. “Visual In-Context Learning for Large Vision-Language Models,” Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics; 2024, pp. 15890–15902.
- Liu, Y.; Yu, R.; Yin, F.; Zhao, X.; Zhao, W.; Xia, W.; Yang, Y. Learning quality-aware dynamic memory for video object segmentation. European Conference on Computer Vision, 2022; Springer; pp. 468–486. [Google Scholar]
- Liu, Y.; Bai, S.; Li, G.; Wang, Y.; Tang, Y. Open-vocabulary segmentation with semantic-assisted calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024a; pp. 3491–3500. [Google Scholar]
- Liu, Y.; Zhang, C.; Wang, Y.; Wang, J.; Yang, Y.; Tang, Y. Universal segmentation at arbitrary granularity with language instruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024b; pp. 3459–3469. [Google Scholar]
- Huang, J.; Huang, Y.; Liu, J.; Zhou, D.; Liu, Y.; Chen, S. Dual-Schedule Inversion: Training-and Tuning-Free Inversion for Real Image Editing. 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025; IEEE; pp. 660–669. [Google Scholar]
- Huang, Y.; Huang, J.; Liu, Y.; Yan, M.; Lv, J.; Liu, J.; Xiong, W.; Zhang, H.; Cao, L.; Chen, S. Diffusion model-based image editing: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025b. [Google Scholar]
- Huang, Y.; Huang, J.; Liu, J.; Yan, M.; Dong, Y.; Lv, J.; Chen, C.; Chen, S. Wavedm: Wavelet-based diffusion models for image restoration. IEEE Transactions on Multimedia 2024, Vol. 26, 7058–7073. [Google Scholar] [CrossRef]
- Zheng, L.; Tian, Z.; He, Y.; Liu, S.; Chen, H.; Yuan, F.; Peng, Y. Enhanced mean field game for interactive decision-making with varied stylish multi-vehicles. arXiv 2025, arXiv:2509.00981. [Google Scholar] [CrossRef]
- Lin, Z.; Tian, Z.; Lan, J.; Zhao, D.; Wei, C. Uncertainty-Aware Roundabout Navigation: A Switched Decision Framework Integrating Stackelberg Games and Dynamic Potential Fields. IEEE Transactions on Vehicular Technology 2025, 1–13. [Google Scholar] [CrossRef]
- Huang, S. AI-Driven Early Warning Systems for Supply Chain Risk Detection: A Machine Learning Approach. Academic Journal of Computing & Information Science 2025c, Vol. 8(No. 9), 92–107. [Google Scholar]
- Huang, S. Measuring Supply Chain Resilience with Foundation Time-Series Models. European Journal of Engineering and Technologies 2025, Vol. 1(No. 2), 49–56. [Google Scholar]
- Ren, L. Real-time Threat Identification Systems for Financial API Attacks under Federated Learning Framework. Academic Journal of Business & Management 2025, Vol. 7(No. 10), 65–71. [Google Scholar]
- Zhu, P.; Han, F.; Deng, H. Flexible multi-generator model with fused spatiotemporal graph for trajectory prediction. IET Conference Proceedings CP874 2023, Vol. 2023, IET, 417–422. [Google Scholar] [CrossRef]
- Zhu, P.; Zhao, S.; Han, F.; Deng, H. BEAVP: A Bidirectional Enhanced Adversarial Model for Video Prediction. 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) 2024, 1–8. [Google Scholar]
- Zhu, P. An Empirical Comparative Study of Classical Dimensionality Reduction Methods: MDS, Isomap, and LLE. 2025. [Google Scholar] [CrossRef]
- Zhou, Y.; Geng, X.; Shen, T.; Tao, C.; Long, G.; Lou, J.-G.; Shen, J. Thread of thought unraveling chaotic contexts. arXiv 2023, arXiv:2311.08734. [Google Scholar] [CrossRef]
- Imani, S.; Du, L.; Shrivastava, H. MathPrompter: Mathematical Reasoning using Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track); Association for Computational Linguistics, 2023; pp. 37–42. [Google Scholar] [CrossRef]
- Qiao, S.; Ou, Y.; Zhang, N.; Chen, X.; Yao, Y.; Deng, S.; Tan, C.; Huang, F.; Chen, H. Reasoning with Language Model Prompting: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics, 2023; pp. 5368–5393. [Google Scholar] [CrossRef]
- Paranjape, B.; Michael, J.; Ghazvininejad, M.; Hajishirzi, H.; Zettlemoyer, L. “Prompting Contrastive Explanations for Commonsense Reasoning Tasks,” Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. In Association for Computational Linguistics; 2021; pp. 4179–4192. [Google Scholar] [CrossRef]
- Wang, L.; Xu, W.; Lan, Y.; Hu, Z.; Lan, Y.; Lee, R. K.-W.; Lim, E.-P. Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics, 2023a; pp. 2609–2634. [Google Scholar] [CrossRef]
- Li, Z.; Jin, X.; Guan, S.; Li, W.; Guo, J.; Wang, Y.; Cheng, X. Search from History and Reason for Future: Two-stage Reasoning on Temporal Knowledge Graphs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Association for Computational Linguistics, 2021; pp. 4732–4743. [Google Scholar] [CrossRef]
- Cao, S.; Shi, J.; Pan, L.; Nie, L.; Xiang, Y.; Hou, L.; Li, J.; He, B.; Zhang, H. KQA Pro: A Dataset with Explicit Compositional Programs for Complex Question Answering over Knowledge Base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics, 2022; pp. 6101–6119. [Google Scholar] [CrossRef]
- Zhong, W.; Cui, R.; Guo, Y.; Liang, Y.; Lu, S.; Wang, Y.; Saied, A.; Chen, W.; Duan, N. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. In Findings of the Association for Computational Linguistics: NAACL 2024; Association for Computational Linguistics; Volume 2024, pp. 2299–2314. [CrossRef]
- Mishra, S.; Finlayson, M.; Lu, P.; Tang, L.; Welleck, S.; Baral, C.; Rajpurohit, T.; Tafjord, O.; Sabharwal, A.; Clark, P.; Kalyan, A. LILA: A Unified Benchmark for Mathematical Reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2022b; pp. 5807–5832. [Google Scholar] [CrossRef]
- Tian, Z.; Lin, Z.; Zhao, D.; Zhao, W.; Flynn, D.; Ansari, S.; Wei, C. Evaluating scenario-based decision-making for interactive autonomous driving using rational criteria: A survey. arXiv 2025, arXiv:2501.01886. [Google Scholar] [CrossRef]
- Zhang, F.; Chen, H.; Zhu, Z.; Zhang, Z.; Lin, Z.; Qiao, Z.; Zheng, Y.; Wu, X. A survey on foundation language models for single-cell biology. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2025, 528–549. [Google Scholar]
- Zhang, F.; Liu, T.; Zhu, Z.; Wu, H.; Wang, H.; Zhou, D.; Zheng, Y.; Wang, K.; Wu, X.; Heng, P.-A. CellVerse: Do Large Language Models Really Understand Cell Biology? arXiv 2025b, arXiv:2505.07865. [Google Scholar]
- Zhang, F.; Liu, T.; Chen, Z.; Peng, X.; Chen, C.; Hua, X.-S.; Luo, X.; Zhao, H. Semi-supervised knowledge transfer across multi-omic single-cell data. Advances in Neural Information Processing Systems Vol. 37(2024), 40861–40891.
- Sun, H.; Zhong, J.; Ma, Y.; Han, Z.; He, K. TimeTraveler: Reinforcement Learning for Temporal Knowledge Graph Forecasting. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2021; pp. 8306–8319. [Google Scholar] [CrossRef]
- Zuo, X.; Cao, P.; Chen, Y.; Liu, K.; Zhao, J.; Peng, W.; Chen, Y. Improving Event Causality Identification via Self-Supervised Representation Learning on External Causal Statement. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics, 2021; pp. 2162–2172. [Google Scholar] [CrossRef]
- Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N. A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics, 2023b; pp. 13484–13508. [Google Scholar] [CrossRef]
- Hu, X.; Zhang, C.; Yang, Y.; Li, X.; Lin, L.; Wen, L.; Yu, P. S. Gradient Imitation Reinforcement Learning for Low Resource Relation Extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2021; pp. 2737–2746. [Google Scholar] [CrossRef]
- Wang, B.; Deng, X.; Sun, H. Iteratively Prompt Pre-trained Language Models for Chain of Thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022; Association for Computational Linguistics; pp. 2714–2730. [Google Scholar] [CrossRef]
- Dai, D.; Sun, Y.; Dong, L.; Hao, Y.; Ma, S.; Sui, Z.; Wei, F. Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers. In Findings of the Association for Computational Linguistics: ACL 2023; Association for Computational Linguistics, 2023; pp. 4005–4019. [Google Scholar] [CrossRef]
- Taranovsky, D. Reflective Cardinals. arXiv 2012, arXiv:1203.2270v5. [Google Scholar]
- Zhang, Y.; Li, Z.; Bao, Z.; Li, J.; Zhang, B.; Li, C.; Huang, F.; Zhang, M. MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, 2022; pp. 3118–3130. [Google Scholar] [CrossRef]
- Yang, W.; Lin, Y.; Li, P.; Zhou, J.; Sun, X. RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021; Association for Computational Linguistics; pp. 8365–8381. [Google Scholar] [CrossRef]
- Zhang, J.; Dong, R.; Wang, H.; Ning, X.; Geng, H.; Li, P.; He, X.; Bai, Y.; Malik, J.; Gupta, S.; Zhang, H. AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time. CoRR 2025c. [Google Scholar] [CrossRef]


| Model & Method | Benchmark | Pass@1 (%) | #Tk (avg. tokens) |
|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | AIME24 | 23.3 | 7280 |
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | AMC23 | 57.5 | 5339 |
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | Minerva-Math | 32.0 | 4935 |
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | LiveCode | 17.8 | 6990 |
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | Olympiad | 38.8 | 5999 |
| CoD | AIME24 | 30.0 (+6.7) | 6994 |
| CoD | AMC23 | 65.0 (+7.5) | 5415 |
| CoD | Minerva-Math | 29.0 (-3.0) | 4005 |
| CoD | LiveCode | 20.3 (+2.5) | 6657 |
| CoD | Olympiad | 40.6 (+1.8) | 5651 |
| Ours (RRS) | AIME24 | 31.5 (+8.2) | 7150 |
| Ours (RRS) | AMC23 | 66.5 (+9.0) | 5600 |
| Ours (RRS) | Minerva-Math | 30.0 (-2.0) | 4250 |
| Ours (RRS) | LiveCode | 21.0 (+3.2) | 6800 |
| Ours (RRS) | Olympiad | 41.8 (+3.0) | 5800 |
| Model & Method | Benchmark | Pass@1 (%) | #Tk (avg. tokens) |
|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | AIME24 | 23.3 | 7280 |
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | AMC23 | 57.5 | 5339 |
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | Minerva-Math | 32.0 | 4935 |
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | LiveCode | 17.8 | 6990 |
| DeepSeek-R1-Distill-Qwen-1.5B (BASE) | Olympiad | 38.8 | 5999 |
| RRS (w/o Self-Correction) | AIME24 | 23.3 (+0.0) | 7050 |
| RRS (w/o Self-Correction) | AMC23 | 57.5 (+0.0) | 5400 |
| RRS (w/o Self-Correction) | Minerva-Math | 32.0 (+0.0) | 4990 |
| RRS (w/o Self-Correction) | LiveCode | 17.8 (+0.0) | 6850 |
| RRS (w/o Self-Correction) | Olympiad | 38.8 (+0.0) | 5850 |
| RRS (Direct Revision) | AIME24 | 27.5 (+4.2) | 6900 |
| RRS (Direct Revision) | AMC23 | 62.0 (+4.5) | 5480 |
| RRS (Direct Revision) | Minerva-Math | 29.5 (-2.5) | 4150 |
| RRS (Direct Revision) | LiveCode | 19.5 (+1.7) | 6700 |
| RRS (Direct Revision) | Olympiad | 40.5 (+1.7) | 5700 |
| Full RRS | AIME24 | 31.5 (+8.2) | 7150 |
| Full RRS | AMC23 | 66.5 (+9.0) | 5600 |
| Full RRS | Minerva-Math | 30.0 (-2.0) | 4250 |
| Full RRS | LiveCode | 21.0 (+3.2) | 6800 |
| Full RRS | Olympiad | 41.8 (+3.0) | 5800 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.