Submitted:
12 December 2024
Posted:
12 December 2024
You are already at the latest version
Abstract
News summarization is a critical task in natural language processing (NLP) due to the increasing volume of information available online. Traditional extractive summarization methods often fail to capture the nuanced and contextual nature of news content, leading to a growing interest in using large language models (LLMs) like GPT-4 for more sophisticated, abstractive summarization tasks. However, LLMs face challenges in maintaining factual consistency and accurately reflecting the core content of news articles. This research addresses these challenges by proposing a novel prompt engineering method designed to guide LLMs, specifically GPT-4, in generating high-quality news summaries. Our approach utilizes a multi-stage prompt framework that ensures comprehensive coverage of essential details and incorporates an iterative refinement process to improve summary coherence and relevance. To enhance factual accuracy, we include built-in validation mechanisms using entailment-based metrics and question-answering techniques. Experiments conducted on a newly collected dataset of diverse news articles demonstrate the effectiveness of our approach, showing significant improvements in summary quality, coherence, and factual accuracy
Keywords:
1. Introduction
- We propose a novel prompt engineering method that enhances the performance of LLMs in news summarization by ensuring comprehensive coverage of key details and iterative refinement of summaries.
- We introduce built-in validation mechanisms within the prompts to improve the factual consistency of the generated summaries, addressing a major challenge in LLM-based summarization.
- Our experiments on a newly collected dataset demonstrate the effectiveness of our approach, with evaluations showing significant improvements in summary quality, coherence, and factual accuracy.
2. Related Work
2.1. Large Language Models
2.2. News Summarization
3. Dataset Collection
3.1. Data Collection Process
- Source Selection: We identified and selected a set of trusted news sources known for their factual reporting and diverse coverage.
- Article Sampling: From each source, we randomly sampled news articles published within the last year to ensure recency and relevance.
- Content Filtering: To maintain quality, we filtered out articles that were either too short (less than 200 words) or too long (more than 2000 words) to ensure that the articles were suitable for summarization tasks.
- Manual Annotation: A team of annotators manually verified the selected articles to ensure they were well-written and free from significant factual errors or biases.
3.2. Evaluation Metrics: GPT-4 as a Judge
- Summary Generation: Using our prompt engineering method, we generate summaries for each news article in the dataset with GPT-4.
- Human Comparison: A subset of the generated summaries is compared against human-written summaries to establish a baseline of quality.
-
Evaluation Criteria: GPT-4 evaluates the summaries based on three main criteria:
- (a)
- Coherence: Assessing whether the summary is logically structured and easy to understand.
- (b)
- Relevance: Determining if the summary accurately reflects the key points and important details of the original article.
- (c)
- Factual Accuracy: Verifying that the information in the summary is correct and consistent with the source article.
- Scoring Mechanism: GPT-4 provides a score for each summary based on the aforementioned criteria, generating a comprehensive evaluation report that includes qualitative feedback and quantitative scores.
4. Method
4.1. Motivation
4.2. Prompt Design
4.2.1. Initial Information Extraction:
-
Prompt Template:Extract the main elements of the article:1. Who is involved?2. What happened?3. When did it happen?4. Where did it happen?5. Why did it happen?6. How did it happen?
- Input: The full text of the news article.
- Output: A structured list of key elements extracted from the article.
4.2.2. Summary Drafting:
-
Prompt Template:Using the extracted elements, draft a concise summaryof the article:- Combine the ’who’, ’what’, ’when’, ’where’, ’why’,and ’how’ into a coherent narrative.- Ensure the summary is clear and logically structured.- Limit the summary to 3-5 sentences.
- Input: The structured list of key elements.
- Output: A preliminary draft of the summary.
4.2.3. Refinement and Validation:
-
Prompt Template:Refine the summary to ensure factual accuracy and coherence:- Verify all factual statements against the original article.- Improve the flow and readability of the summary.- Ensure that the summary accurately reflects the mainpoints of the article.
- Input: The preliminary draft of the summary.
- Output: The final, refined summary.
4.3. Input and Output
4.4. Significance and Effectiveness
5. Experiments
5.1. Experimental Setup
5.2. Results
5.3. Analysis
5.4. Validation of Effectiveness
5.5. Further Experimental Analysis
5.5.1. Performance Across Article Types
| Article Type | Method | Coherence | Relevance | Factual Accuracy |
|---|---|---|---|---|
| Breaking News | Base | 3.8 | 3.6 | 3.3 |
| CoT | 4.2 | 4.0 | 3.8 | |
| Our Method | 4.7 | 4.5 | 4.6 | |
| Feature Stories | Base | 3.4 | 3.2 | 3.0 |
| CoT | 4.0 | 3.9 | 3.5 | |
| Our Method | 4.6 | 4.4 | 4.2 | |
| Opinion Pieces | Base | 3.5 | 3.0 | 3.2 |
| CoT | 4.1 | 3.8 | 3.5 | |
| Our Method | 4.4 | 4.3 | 4.1 |
5.5.2. Analysis of Prompt Variations
5.5.3. Human Evaluation and Qualitative Feedback
- –
- Coherence: Reviewers noted that summaries generated by our method exhibited better logical flow and were easier to follow than those generated by the other methods. The multi-stage process for extracting and refining information was credited for improving narrative structure.
- –
- Relevance: Experts praised the relevance of our summaries, particularly in longer, more complex articles like feature stories. They pointed out that the iterative refinement process helped to ensure that essential points were retained and clearly presented.
- –
- Factual Accuracy: Reviewers were particularly impressed by the factual consistency of the summaries. They emphasized that our method’s built-in validation mechanisms significantly reduced factual errors, an essential requirement in news summarization.
5.5.4. Impact of Iterative Refinement
6. Conclusion
References
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; others. Language models are few-shot learners. Advances in neural information processing systems 2020, 33, 1877–1901.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI blog 2019, 1, 9.
- Zhou, Y.; Rao, Z.; Wan, J.; Shen, J. Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models. arXiv preprint arXiv:2410.19732 2024. [CrossRef]
- Liu, Y.; Gao, J.; Li, M. News summarization and evaluation in the era of GPT-3. arXiv preprint arXiv:2209.12356 2022. [CrossRef]
- Zhou, Y.; Geng, X.; Shen, T.; Tao, C.; Long, G.; Lou, J.G.; Shen, J. Thread of thought unraveling chaotic contexts. arXiv preprint arXiv:2311.08734 2023. [CrossRef]
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024, pp. 15890–15902.
- Zhou, Y.; Geng, X.; Shen, T.; Long, G.; Jiang, D. Eventbert: A pre-trained model for event correlation reasoning. Proceedings of the ACM Web Conference 2022, 2022, pp. 850–859.
- Guan, C.; Zhang, W. Enhancing news summarization with ELearnFit through efficient in-context learning and efficient fine-tuning. arXiv preprint arXiv:2405.02710 2024. [CrossRef]
- Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Shen, J.; Long, G.; Xu, C.; Jiang, D. Fine-grained distillation for long document retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 19732–19740.
- Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R.T. On Faithfulness and Factuality in Abstractive Summarization. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020. Association for Computational Linguistics, 2020, pp. 1906–1919. [CrossRef]
- Zhou, Y.; Shen, T.; Geng, X.; Long, G.; Jiang, D. ClarET: Pre-training a Correlation-Aware Context-To-Event Transformer for Event-Centric Generation and Classification. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2559–2575.
- Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Xu, C.; Long, G.; Jiao, B.; Jiang, D. Towards Robust Ranker for Text Retrieval. Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 5387–5401.
- Zhou, Y.; Geng, X.; Shen, T.; Zhang, W.; Jiang, D. Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5822–5834.
- Zhou, Y.; Geng, X.; Shen, T.; Pei, J.; Zhang, W.; Jiang, D. Modeling event-pair relations in external knowledge graphs for script reasoning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021.
- Liu, Y.; Gao, J.; Li, M. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435 2023. [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; others. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155 2022. [CrossRef]
- Yang, C.; Wang, X.; Lu, Y.; Liu, H.; Le, Q.V.; Zhou, D.; Chen, X. Large Language Models as Optimizers. CoRR 2023, abs/2309.03409, [2309.03409]. [CrossRef]
- Yang, Y.; Hu, E. Efficient large language models: A survey. arXiv preprint arXiv:2312.03863 2023. [CrossRef]
- Gupta, A.; Sharma, R. Automated news summarization using transformers. arXiv preprint arXiv:2108.01064 2021. [CrossRef]
- Jain, S.; Pappu, A. NEWTS: A corpus for news topic-focused summarization. arXiv preprint arXiv:2205.15661 2022. [CrossRef]
| Method | Model | Coherence Score | Relevance Score | Factual Accuracy Score |
|---|---|---|---|---|
| Base | ChatGPT | 3.5 | 3.2 | 3.0 |
| CoT | ChatGPT | 4.0 | 3.8 | 3.5 |
| Our | ChatGPT | 4.5 | 4.2 | 4.1 |
| Base | GPT-4 | 4.0 | 3.8 | 3.5 |
| CoT | GPT-4 | 4.3 | 4.1 | 3.9 |
| Our | GPT-4 | 4.8 | 4.6 | 4.5 |
| Prompt Type | Coherence | Relevance | Factual Accuracy |
|---|---|---|---|
| Minimalist Prompting | 4.1 | 3.9 | 3.8 |
| Expanded Prompting | 4.6 | 4.3 | 4.2 |
| Our Method (Multi-Stage) | 4.8 | 4.6 | 4.5 |
| Method | Coherence | Relevance | Factual Accuracy |
|---|---|---|---|
| Base Method | 3.4 | 3.2 | 3.0 |
| CoT Method | 4.2 | 3.9 | 3.7 |
| Our Method | 4.7 | 4.5 | 4.6 |
| Method | Coherence | Relevance | Factual Accuracy |
|---|---|---|---|
| Without Refinement | 4.2 | 4.1 | 4.0 |
| With Refinement (Full Method) | 4.8 | 4.6 | 4.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).