Submitted:
08 August 2025
Posted:
12 August 2025
You are already at the latest version
Abstract
Keywords:
I. Introduction
- HotpotQA [6]: A multi-hop question answering dataset that necessitates aggregating information from multiple documents to answer questions. This dataset is particularly suitable for assessing an LLM’s ability to process complex information and anchor its responses to diverse contexts.
- ELI5 (Explain Like I’m 5) [7]: A dataset designed for generating simplified explanations of complex concepts. We utilized its longer answer segments to evaluate the model’s capacity to maintain explanatory accuracy without introducing spurious information.
- We propose a novel multi-stage critique and refinement framework that enables LLMs to automatically enhance the factual consistency and contextual grounding of their generated responses.
- We demonstrate the effectiveness of leveraging LLMs themselves as sophisticated "Fact Verifier-Content Reviser" agents through zero-shot Chain-of-Thought prompting, eliminating the need for additional model training.
- Our method achieves state-of-the-art performance in improving factual consistency and context grounding on challenging question answering and explanation generation benchmarks, without requiring any fine-tuning of the underlying LLMs.
II. Related Work
A. Enhancing Factual Consistency and Grounding in Large Language Models
B. Self-Correction and Advanced Prompting Techniques for LLMs
III. Method
A. Problem Formulation
B. Multi-Stage Critique and Refinement Framework
C. Zero-Shot Prompting Strategies
- 1)
- Verification Steps Listing: The LLM is first explicitly instructed to outline the systematic steps it will undertake to verify the factual consistency and contextual grounding of the initial response against the source documents D. This encourages a methodical approach.
- 2)
- Error Identification and Classification: Subsequently, the LLM is prompted to meticulously pinpoint specific errors or ungrounded statements within . It is required to classify these errors by type (e.g., factual inaccuracy, ungrounded information, logical inconsistency) and, critically, to provide concrete, direct evidence from D to support its claims regarding the discrepancies. The output of this step constitutes the detailed critique .
- 3)
- Response Revision: Finally, based on its self-generated critique and its comprehensive understanding of Q and D, the LLM is instructed to produce the final, revised response . This revision must address all identified issues and ensure factual accuracy and contextual relevance.
D. Data Preparation and Processing
IV. Experiments
A. Experimental Setup
- GPT-3.5-turbo: A widely used and capable model from OpenAI.
- Gemini Pro: Google’s advanced multimodal model, evaluated here for its text generation capabilities.
- Mixtral 8x7B: A high-quality sparse mixture-of-experts model, known for its efficiency and strong performance.
- Llama 3 70B: A powerful open-source model, serving as the foundation for our primary proposed method.
- (Simple Prompting): This refers to a more concise and direct prompt, implicitly requesting the model to perform verification and correction without explicit intermediate reasoning steps, as described in Section 2.4.1.
- (Chain-of-Thought Guided Prompting): This signifies a more complex and elaborate prompt designed to elicit multi-stage Chain-of-Thought (CoT) reasoning, guiding the model through explicit verification steps, detailed error identification, and structured revision, as detailed in Section 2.4.2.
- HotpotQA: A multi-hop question answering dataset that requires models to aggregate information from multiple supporting documents to formulate an answer. This dataset is particularly effective for evaluating a model’s capacity to process complex, distributed information and anchor its responses firmly within the provided context, thereby minimizing ungrounded statements.
- ELI5 (Explain Like I’m 5): A question answering dataset focused on generating simplified explanations of complex concepts. We specifically utilized its longer answer segments to assess the models’ proficiency in maintaining explanatory accuracy and coherence without introducing spurious information or factual inaccuracies.
- Fact Consistency Score (FCS): Measures the factual accuracy of the generated response relative to the provided source documents. It quantifies the degree to which statements in the response are supported by or consistent with the ground truth information.
- Context Grounding Score (CGS): Evaluates how well the generated response is anchored to the provided context and question. It penalizes information that is irrelevant, ungrounded, or deviates from the scope defined by the input.
B. Baselines
- Base LLM (No Correction): Represents the raw output of a large language model (e.g., Llama 3 70B) without any post-generation critique or refinement. This baseline highlights the inherent limitations of LLMs in terms of factual consistency and contextual grounding.
- Prior Work (Simple Self-Correction): Encompasses existing simple self-correction strategies that involve a single-pass or less structured prompting approach for minor revisions. This baseline reflects the current state-of-the-art in straightforward self-correction mechanisms.
- Other LLMs with and strategies: We also present the performance of GPT-3.5-turbo, Gemini Pro, and Mixtral 8x7B, each employing both simple and Chain-of-Thought prompting strategies. These serve as strong comparative models, illustrating the general applicability and benefits of our multi-stage approach across different LLM architectures.
C. Main Results
- Baseline Performance: The "Base LLM (No Correction)" shows the inherent limitations of raw LLM outputs, with relatively low FCS and CGS scores, highlighting the prevalence of factual hallucinations and ungrounded information.
- Existing Solutions: "Prior Work (Simple Self-Correction)" offers a modest improvement over the base LLM, indicating that simple correction mechanisms can provide some benefit, but significant room for improvement remains.
- CoT’s Advantage: A crucial observation is the consistent performance boost achieved by employing the Chain-of-Thought (CoT) guided prompting strategy compared to the simple prompting strategy for all evaluated models. For instance, GPT-3.5-turbo_b consistently outperforms GPT-3.5-turbo_s across all metrics and datasets. This empirically validates our hypothesis that enabling LLMs to perform explicit reasoning and structured verification steps significantly enhances their ability to identify and correct errors.
- Our Method’s Superiority: Our proposed "Ours" method, leveraging the powerful Llama 3 70B model, consistently achieves the best performance across all metrics and datasets. Particularly, "Ours (Llama 3 70B_b)" sets new state-of-the-art results, with FCS scores of 0.81 on HotpotQA and 0.75 on ELI5, and similarly high CGS scores. This superior performance underscores the efficacy of our multi-stage critique and refinement framework when combined with a strong underlying LLM and the strategic application of CoT prompting. It demonstrates that guiding the LLM through a structured verification and revision process, without requiring additional training, is highly effective in mitigating factual hallucinations and improving contextual grounding.
D. Analysis of Prompting Strategies
- Explicit Reasoning Path: The CoT prompts compel the LLM to articulate its reasoning process step-by-step. This explicit decomposition of the task (e.g., identify verification steps, pinpoint errors, classify error types, provide evidence) allows the model to engage in deeper, more structured analysis of the initial response against the provided context.
- Enhanced Error Identification: By requiring the LLM to list specific errors and provide supporting evidence from the source documents, the strategy forces a more thorough and precise identification of factual inconsistencies and ungrounded statements. This reduces the likelihood of overlooking subtle errors.
- Targeted Correction: With a clear and detailed critique () generated in the preceding step, the LLM in its "Content Reviser" role can perform more targeted and accurate corrections. The explicit error types and evidence guide the revision process more effectively than a general instruction.
- Reduced Ambiguity: The structured nature of CoT prompts reduces ambiguity in the task instructions, leading to more reliable and consistent performance, especially for complex cases where factual nuances or multiple pieces of context need to be reconciled.
E. Human Evaluation
- Factual Accuracy: How well the response aligns with the facts presented in the source documents.
- Contextual Grounding: How well the response uses and stays relevant to the provided context.
- Fluency and Coherence: The readability, grammatical correctness, and logical flow of the response.
- Overall Quality: A holistic judgment of the response’s utility and correctness.
- Our method (Ours, Llama 3 70B_b) received significantly higher scores across all evaluation criteria, particularly in Factual Accuracy and Contextual Grounding, confirming its superior ability to mitigate hallucinations and ungrounded statements.
- While Base LLM outputs were often fluent, their low scores in factual accuracy and grounding highlight the critical need for correction mechanisms.
- Prior work showed improvement, but our multi-stage critique and refinement framework was perceived by human annotators as producing responses that are not only factually correct and well-grounded but also maintain high fluency and overall quality, making them highly reliable and useful.
F. Ablation Studies
- Impact of Explicit Critique Report (): "Ours (No Explicit Critique Report)" refers to a setup where the LLM is guided by a CoT prompt to perform verification and revision steps internally, but it does not explicitly output the detailed critique as a separate, structured report before generating . While still performing well due to the internal CoT reasoning, its performance is slightly lower than the full framework (e.g., 0.79 FCS on HotpotQA vs. 0.81). This indicates that the act of explicitly generating the critique forces the LLM to formalize its error identification and evidence gathering, leading to a more robust and accurate subsequent revision. The structured acts as a concrete intermediate representation that the LLM can more reliably reference for refinement.
- Impact of Distinct Roles and Multi-Stage Process: "Ours (Single-Pass CoT, No Distinct Roles)" represents a scenario where a single, complex CoT prompt is used to guide the LLM to perform verification and revision, but without explicitly instructing it to assume the distinct "Fact Verifier" and "Content Reviser" roles, and without a clear conceptual separation into critique and refinement stages as outlined in Section 2.2. The performance drop (e.g., 0.76 FCS on HotpotQA vs. 0.81 for the full framework) highlights the importance of our meticulously designed multi-stage framework. By defining clear roles and sequential steps, the framework effectively decomposes a complex problem into manageable sub-problems, guiding the LLM to focus on specific tasks (verification, then revision) in a structured manner. This structured guidance minimizes the cognitive load on the LLM, enabling it to execute each phase more effectively.
G. Qualitative Analysis and Error Correction Examples
H. Computational Cost Analysis
- Increased Input Context: Both and especially strategies require a larger input context compared to the Base LLM. This is because the prompts for self-correction include the original question, source documents, and the initial response, along with detailed instructions. The strategy, with its comprehensive CoT prompt, naturally has a larger average input token count (750 tokens) than (600 tokens) and the Base LLM (500 tokens).
- Higher Output Tokens for CoT: The most significant increase in token usage comes from the output of the Chain-of-Thought strategy. Since it explicitly generates the detailed critique () before the refined response (), the average output token count is considerably higher (300 tokens) compared to (180 tokens) and the Base LLM (150 tokens). This additional output is the explicit reasoning process that underpins the improved performance.
- Trade-off between Performance and Cost: The "Ours (Llama 3 70B_b)" method, while achieving the highest performance, also incurs the highest computational cost, approximately 1.6 times that of the Base LLM in terms of total tokens. This represents a clear trade-off: the enhanced factual consistency and contextual grounding come at the expense of increased inference time and API costs (for commercial models).
- Practical Implications: For applications where extreme low-latency is critical, a simpler approach or even accepting the limitations of a Base LLM might be considered. However, for use cases demanding high factual accuracy and reliability, such as knowledge base construction, automated content generation in sensitive domains, or critical decision support systems, the increased computational cost of our framework is a justifiable investment given the substantial gains in output quality. Future work could explore optimizations to reduce token usage while retaining the benefits of CoT reasoning, perhaps through more concise critique formats or distillation techniques.
V. Conclusion
References
- Zhou, Y.; Tao, W.; Zhang, W. Triple sequence generative adversarial nets for unsupervised image captioning. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2021; pp. 7598–7602. [Google Scholar]
- Wang, Q.; Hu, H.; Zhou, Y. Memorymamba: Memory-augmented state space model for defect recognition. arXiv 2024, arXiv:2405.03673. [Google Scholar]
- Cao, M.; Dong, Y.; Cheung, J.C.K. Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022; Association for Computational Linguistics, 2022; pp. 3340–3354. [Google Scholar] [CrossRef]
- Zhou, Y.; Song, L.; Shen, J. MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration. arXiv 2025, arXiv:2506.19835. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; 2022. [Google Scholar]
- Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018; Association for Computational Linguistics, 2018; pp. 2369–2380. [Google Scholar] [CrossRef]
- Fan, A.; Jernite, Y.; Perez, E.; Grangier, D.; Weston, J.; Auli, M. ELI5: Long Form Question Answering. In Proceedings of the Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019; Association for Computational Linguistics, 2019; Volume 1: Long Papers, pp. 3558–3567. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. CoRR 2023. [Google Scholar] [CrossRef]
- Talukdar, W.; Biswas, A. Improving Large Language Model (LLM) fidelity through context-aware grounding: A systematic approach to reliability and veracity. CoRR 2024. [Google Scholar] [CrossRef]
- Zhou, Y.; Shen, T.; Geng, X.; Long, G.; Jiang, D. ClarET: Pre-training a Correlation-Aware Context-To-Event Transformer for Event-Centric Generation and Classification. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); 2022; pp. 2559–2575. [Google Scholar]
- Zhou, Y.; Geng, X.; Shen, T.; Long, G.; Jiang, D. Eventbert: A pre-trained model for event correlation reasoning. In Proceedings of the Proceedings of the ACM Web Conference 2022; 2022; pp. 850–859. [Google Scholar]
- Augenstein, I.; Baldwin, T.; Cha, M.; Chakraborty, T.; Ciampaglia, G.L.; Corney, D.P.A.; DiResta, R.; Ferrara, E.; Hale, S.; Halevy, A.Y.; et al. Factuality challenges in the era of large language models and opportunities for fact-checking. Nat. Mac. Intell. 2024, 852–863. [Google Scholar] [CrossRef]
- Huang, Y.; Huang, J. A Survey on Retrieval-Augmented Text Generation for Large Language Models. CoRR 2024. [Google Scholar] [CrossRef]
- Yang, L.; Chen, H.; Li, Z.; Ding, X.; Wu, X. Give us the Facts: Enhancing Large Language Models With Knowledge Graphs for Fact-Aware Language Modeling. IEEE Trans. Knowl. Data Eng. 2024, 3091–3110. [Google Scholar] [CrossRef]
- Zhou, Y.; Geng, X.; Shen, T.; Pei, J.; Zhang, W.; Jiang, D. Modeling event-pair relations in external knowledge graphs for script reasoning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021. [Google Scholar]
- Ghosh, B.; Hasan, S.; Arafat, N.A.; Khan, A. Logical Consistency of Large Language Models in Fact-Checking. In Proceedings of the The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. [Google Scholar]
- Kamalloo, E.; Dziri, N.; Clarke, C.L.A.; Rafiei, D. Evaluating Open-Domain Question Answering in the Era of Large Language Models. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023; Association for Computational Linguistics, 2023; pp. 5591–5606. [Google Scholar] [CrossRef]
- Lee, N.; Ping, W.; Xu, P.; Patwary, M.; Fung, P.; Shoeybi, M.; Catanzaro, B. Factuality Enhanced Language Models for Open-Ended Text Generation. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; 2022. [Google Scholar]
- Zhang, Q.; Wang, D.; Qian, H.; Li, Y.; Zhang, T.; Huang, M.; Xu, K.; Li, H.; Yan, L.; Qiu, H. Understanding the Dark Side of LLMs’ Intrinsic Self-Correction. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025; Association for Computational Linguistics, 2025; pp. 27066–27101. [Google Scholar]
- Yu, Z.; He, L.; Wu, Z.; Dai, X.; Chen, J. Towards Better Chain-of-Thought Prompting Strategies: A Survey. CoRR 2023. [Google Scholar] [CrossRef]
- Yan, Y.; Jiang, J.; Liu, Y.; Cao, Y.; Xu, X.; Zhang, M.; Cai, X.; Shao, J. S3cMath: Spontaneous Step-Level Self-Correction Makes Large Language Models Better Mathematical Reasoners. In Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, Philadelphia, PA, USA, February 25 - March 4, 2025; AAAI Press, 2025; pp. 25588–25596. [Google Scholar] [CrossRef]
- Wang, Y.; Wu, Y.; Wei, Z.; Jegelka, S.; Wang, Y. A Theoretical Understanding of Self-Correction through In-context Alignment. In Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024; 2024. [Google Scholar]
- Zhou, Y.; Shen, J.; Cheng, Y. Weak to strong generalization for large language models with multi-capabilities. In Proceedings of the The Thirteenth International Conference on Learning Representations; 2025. [Google Scholar]
- Sun, J.; Pan, Y.; Yan, X. Improving intermediate reasoning in zero-shot chain-of-thought for large language models with filter supervisor-self correction. Neurocomputing 2025, 129219. [Google Scholar] [CrossRef]
- Li, L.; Chen, G.; Su, Y.; Chen, Z.; Zhang, Y.; Xing, E.P.; Zhang, K. Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models. CoRR 2024. [Google Scholar] [CrossRef]
- Pan, L.; Saxon, M.; Xu, W.; Nathani, D.; Wang, X.; Wang, W.Y. Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. CoRR 2023. [Google Scholar] [CrossRef]
- Huang, J.; Chen, X.; Mishra, S.; Zheng, H.S.; Yu, A.W.; Song, X.; Zhou, D. Large Language Models Cannot Self-Correct Reasoning Yet. In Proceedings of the The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. [Google Scholar]
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. In Proceedings of the Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024; Association for Computational Linguistics, 2024; pp. 15890–15902. [Google Scholar]
| Model/Method | HotpotQA (FCS) | HotpotQA (CGS) | ELI5 (FCS) | ELI5 (CGS) |
|---|---|---|---|---|
| Base LLM (No Correction) | 0.62 | 0.58 | 0.55 | 0.50 |
| Prior Work (Simple Self-Correction) | 0.68 | 0.65 | 0.60 | 0.57 |
| GPT-3.5-turbo_s | 0.69 | 0.67 | 0.61 | 0.59 |
| GPT-3.5-turbo_b | 0.72 | 0.70 | 0.65 | 0.62 |
| Gemini Pro_s | 0.71 | 0.69 | 0.64 | 0.61 |
| Gemini Pro_b | 0.75 | 0.73 | 0.68 | 0.65 |
| Mixtral 8x7B_s | 0.74 | 0.72 | 0.67 | 0.64 |
| Mixtral 8x7B_b | 0.77 | 0.75 | 0.70 | 0.68 |
| Ours (Llama 3 70B_s) | 0.78 | 0.76 | 0.71 | 0.69 |
| Ours (Llama 3 70B_b) | 0.81 | 0.80 | 0.75 | 0.73 |
| Model/Method | Factual Accuracy | Contextual Grounding | Fluency & Coherence | Overall Quality |
|---|---|---|---|---|
| Base LLM (No Correction) | 2.8 | 2.9 | 4.2 | 2.7 |
| Prior Work (Simple Self-Correction) | 3.5 | 3.4 | 4.3 | 3.3 |
| Ours (Llama 3 70B_b) | 4.6 | 4.5 | 4.5 | 4.4 |
| Method | HotpotQA (FCS) | HotpotQA (CGS) | ELI5 (FCS) | ELI5 (CGS) |
|---|---|---|---|---|
| Ours (Llama 3 70B_b) - Full Framework | 0.81 | 0.80 | 0.75 | 0.73 |
| Ours (No Explicit Critique Report) | 0.79 | 0.77 | 0.72 | 0.70 |
| Ours (Single-Pass CoT, No Distinct Roles) | 0.76 | 0.74 | 0.69 | 0.67 |
| Component | Content |
|---|---|
| Question (Q) | Who wrote the novel "The Hitchhiker’s Guide to the Galaxy" and what year was the first book published? |
| Source Documents (D) | Document 1: "The Hitchhiker’s Guide to the Galaxy is a comedy science fiction series created by Douglas Adams. Originally a 1978 radio comedy broadcast on BBC Radio 4, it was later adapted to other formats, including a ’trilogy of five’ books that sold over 15 million copies. The first book, also titled ’The Hitchhiker’s Guide to the Galaxy’, was published in 1979." |
| Document 2: "Douglas Adams (1952-2001) was an English author, satirist, and dramatist. He is best known as the author of The Hitchhiker’s Guide to the Galaxy, which originated in 1978 as a BBC radio series and was adapted into a book series." | |
| Initial Response () from Base LLM | The novel "The Hitchhiker’s Guide to the Galaxy" was written by Douglas Adams, and the first book was published in 1978. |
| Critique () from Ours | Error 1: Factual Hallucination (Year of Publication) |
| Statement: "the first book was published in 1978." | |
| Evidence from D: Document 1 states: "The first book... was published in 1979." | |
| Reasoning: The initial response incorrectly states the publication year of the first book. The source clearly indicates 1979, not 1978. | |
| Refined Response () from Ours | The novel "The Hitchhiker’s Guide to the Galaxy" was written by Douglas Adams, and the first book was published in 1979. |
| Method | Avg. Input Tokens | Avg. Output Tokens | Avg. Total Tokens |
|---|---|---|---|
| Base LLM (No Correction) | 500 | 150 | 650 |
| Ours (Llama 3 70B_s) | 600 | 180 | 780 |
| Ours (Llama 3 70B_b) | 750 | 300 | 1050 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).