Submitted:
06 October 2024
Posted:
08 October 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Contextual Prompts: Providing background information relevant to the historical period and events, such as "Identify key entities in the following text from 18th-century France, considering the historical context of the French Revolution."
- Task-Specific Prompts: Tailored for specific tasks like NER or sentiment analysis, for example, "Extract all named entities related to geographic locations from this 19th-century British text."
- Iterative Refinement: Implementing a feedback loop where initial outputs are analyzed to refine the prompts further, ensuring continuous improvement in the model’s performance.
- We introduce a novel method for crafting context-aware and task-specific prompts to improve LLM performance on historical texts.
- We propose an iterative refinement process to continuously enhance the quality of prompts and model outputs.
- We demonstrate the effectiveness of instruction tuning using newly collected evaluation data, assessed with state-of-the-art LLMs, specifically GPT-4.
2. Related Work
2.1. Large Language Models
2.2. Instruction Tuning
3. Dataset
3.1. Instruction Tuning Dataset
- Historical Archives: Digitized archives from libraries and museums, including letters, diaries, and official documents from different historical periods.
- Newspaper Archives: Collections of historical newspapers that provide rich contextual information and reflect the vernacular of their time.
- Literary Works: Classical literature that captures the narrative styles and linguistic nuances of different eras.
3.2. Evaluation Dataset
- Unpublished Historical Documents: Manuscripts and letters from private collections and less accessible archives.
- Historical News Reports: Specific events and reports that were not included in the training data, providing a testbed for the model’s ability to generalize.
- Annotated Corpora: Historical texts annotated by historians for NER and sentiment analysis, offering a benchmark for comparison.
3.3. GPT-4 as Judge: Novel Evaluation Metrics
- Contextual Accuracy: GPT-4 evaluates whether the model’s output accurately reflects the historical context and language of the input text.
- Entity Recognition and Relevance: The evaluation focuses on the correct identification of historical entities and their relevance to the context provided.
- Linguistic Fidelity: The assessment considers the fidelity of the language used, ensuring that the output maintains the stylistic and grammatical conventions of the historical period.
- Holistic Understanding: GPT-4 judges the overall coherence and understanding of the text, beyond mere token-level accuracy.
4. Method
4.1. Motivation
4.2. Crafting the Prompts
4.2.1. Contextual Prompts
"Identify the key entities in the following text from 18th-century France, considering the historical context of the French Revolution."
4.2.2. Task-Specific Prompts
"Extract all named entities related to geographic locations from this 19th-century British text."
4.3. Instruction Tuning with Prompts
- Data Collection: Historical texts are collected and preprocessed as described in the previous section.
- Prompting: Each text is paired with a relevant contextual or task-specific prompt.
- Fine-Tuning: The LLM is fine-tuned on the prompted data. This involves adjusting the model’s weights to optimize its performance on the given tasks.
- Iterative Refinement: The outputs are analyzed and used to refine the prompts further, creating a feedback loop that continually enhances the model’s understanding and performance.
4.4. Significance and Benefits
- Improved Accuracy: The detailed prompts guide the model to understand and process historical context and terminologies accurately, leading to better performance in NLP tasks.
- Contextual Understanding: By immersing the model in the historical context, our method enhances its ability to interpret texts from different periods accurately.
- Flexibility and Adaptability: The iterative refinement process ensures that the prompts and the model continually improve, allowing for adaptability to various historical domains and tasks.
5. Experiments
5.1. Experimental Setup
5.2. Experimental Design
- Data Preparation: Historical texts were collected, preprocessed, and annotated for NER and information extraction tasks.
- Baseline Model Fine-Tuning: The baseline models (LLaMA 7B, LLaMA-2 7B, and Qwen 7B) were fine-tuned on the historical datasets without using our specialized prompts. This involved standard fine-tuning procedures using learning rate scheduling, early stopping, and regularization techniques to optimize performance.
- Instruction-Tuned Model Fine-Tuning: Our instruction-tuned model was fine-tuned using the same datasets but with the addition of context-aware and task-specific prompts. This fine-tuning process was iterative, where initial model outputs were analyzed and used to refine the prompts further.
- Evaluation: Both the baseline and instruction-tuned models were evaluated using predefined metrics on a separate test set of historical texts. The evaluation involved both automated metrics (contextual accuracy, entity recognition, linguistic fidelity, and holistic understanding) and human evaluations by experts in historical linguistics and digital humanities.
5.3. Results
5.4. Analysis of Results
5.5. Ablation Studies
- No Contextual Prompts: The model was fine-tuned without the context-aware prompts to evaluate their impact on contextual understanding.
- No Task-Specific Prompts: The model was fine-tuned without the task-specific prompts to assess their importance in improving task performance.
- No Iterative Refinement: The model was fine-tuned using the initial set of prompts without iterative refinement to determine the value of the feedback loop.
5.6. Human Evaluation
5.7. Validation of Effectiveness
5.8. Additional Insights
6. Conclusions
References
- Zhao, J.; Wang, T.; Abid, W.; Angus, G.; Garg, A.; Kinnison, J.; Sherstinsky, A.; Molino, P.; Addair, T.; Rishi, D. LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report. CoRR 2024, abs/2405.00732, [2405.00732]. [CrossRef]
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. arXiv preprint, arXiv:2402.11574 2024.
- Zhou, Y.; Geng, X.; Shen, T.; Zhang, W.; Jiang, D. Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5822–5834.
- Zhou, Y.; Geng, X.; Shen, T.; Pei, J.; Zhang, W.; Jiang, D. Modeling event-pair relations in external knowledge graphs for script reasoning. Findings of the Association for Computational Linguistics: ACL-IJCNLP2021 2021.
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.B.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; Brynjolfsson, E.; Buch, S.; Card, D.; Castellon, R.; Chatterji, N.S.; Chen, A.S.; Creel, K.; Davis, J.Q.; Demszky, D.; Donahue, C.; Doumbouya, M.; Durmus, E.; Ermon, S.; Etchemendy, J.; Ethayarajh, K.; Fei-Fei, L.; Finn, C.; Gale, T.; Gillespie, L.; Goel, K.; Goodman, N.D.; Grossman, S.; Guha, N.; Hashimoto, T.; Henderson, P.; Hewitt, J.; Ho, D.E.; Hong, J.; Hsu, K.; Huang, J.; Icard, T.; Jain, S.; Jurafsky, D.; Kalluri, P.; Karamcheti, S.; Keeling, G.; Khani, F.; Khattab, O.; Koh, P.W.; Krass, M.S.; Krishna, R.; Kuditipudi, R.; et al. . On the Opportunities and Risks of Foundation Models. CoRR 2021, abs/2108.07258, [2108.07258].
- Wang, Z.; Li, M.; Xu, R.; Zhou, L.; Lei, J.; Lin, X.; Wang, S.; Yang, Z.; Zhu, C.; Hoiem, D.; Chang, S.; Bansal, M.; Ji, H. Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. CoRR 2023, abs/2307.06435, [2307.06435]. [CrossRef]
- Schweter, S.; März, L.; Schmid, K.; Çano, E. hmBERT: Historical Multilingual Language Models for Named Entity Recognition. Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022; Faggioli, G.; Ferro, N.; Hanbury, A.; Potthast, M., Eds. CEUR-WS.org, 2022, Vol. 3180, CEUR Workshop Proceedings, pp. 1109–1129.
- Du, Y.; Guo, H.; Zhou, K.; Zhao, W.X.; Wang, J.; Wang, C.; Cai, M.; Song, R.; Wen, J. What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning. CoRR 2023, abs/2311.01487, [2311.01487]. [CrossRef]
- Zhou, Y.; Shen, T.; Geng, X.; Long, G.; Jiang, D. Claret: Pre-training a correlation-aware context-to-event transformer for event-centric generation and classification. arXiv preprint, arXiv:2203.02225 2022.
- Zhou, Y.; Geng, X.; Shen, T.; Long, G.; Jiang, D. Eventbert: A pre-trained model for event correlation reasoning. Proceedings of the ACM Web Conference 2022, 2022, pp. 850–859. [Google Scholar]
- Zhou, Y.; Long, G. Improving Cross-modal Alignment for Text-Guided Image Inpainting. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 3445–3456.
- Zhou, Y.; Long, G. Multimodal Event Transformer for Image-guided Story Ending Generation. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 3434–3444.
- Zhou, Y.; Tao, W.; Zhang, W. Triple sequence generative adversarial nets for unsupervised image captioning. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7598–7602.
- Zhou, Y. Sketch storytelling. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4748–4752.
- Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Xu, C.; Long, G.; Jiao, B.; Jiang, D. Towards Robust Ranker for Text Retrieval. Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 5387–5401. [Google Scholar]
- Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Shen, J.; Long, G.; Xu, C.; Jiang, D. Fine-grained distillation for long document retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 19732–19740.
- Zhang, X.; Yang, H.; Young, E.F.Y. Attentional Transfer is All You Need: Technology-aware Layout Pattern Generation. 58th ACM/IEEE Design Automation Conference, DAC 2021, San Francisco, CA, USA, December 5-9, 2021. IEEE, 2021, pp. 169–174. [CrossRef]
- Zhou, Y.; Long, G. Style-Aware Contrastive Learning for Multi-Style Image Captioning. Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 2257–2267. [Google Scholar]
- Elouargui, Y.; Zyate, M.; Sassioui, A.; Chergui, M.; El-Kamili, M.; Ouzzif, M. A Comprehensive Survey On Efficient Transformers. 10th International Conference on Wireless Networks and Mobile Communications, WINCOM 2023, Istanbul, Turkey, October 26-28, 2023; Ibrahimi, K.; El-Kamili, M.; Kobbane, A.; Shayea, I., Eds. IEEE, 2023, pp. 1–6. [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 140:1–140:67. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models are Zero-Shot Learners. The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Zhou, Y.; Geng, X.; Shen, T.; Tao, C.; Long, G.; Lou, J.G.; Shen, J. Thread of thought unraveling chaotic contexts. arXiv preprint, arXiv:2311.08734 2023.
- Mishra, S.; Khashabi, D.; Baral, C.; Hajishirzi, H. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022; Muresan, S.; Nakov, P.; Villavicencio, A., Eds. Association for Computational Linguistics, 2022, pp. 3470–3487. [CrossRef]
- Longpre, S.; Hou, L.; Vu, T.; Webson, A.; Chung, H.W.; Tay, Y.; Zhou, D.; Le, Q.V.; Zoph, B.; Wei, J.; Roberts, A. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning. International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA; Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; Scarlett, J., Eds. PMLR, 2023, Vol. 202, Proceedings of Machine Learning Research, pp. 22631–22648.
- Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N.A.; Khashabi, D.; Hajishirzi, H. Self-Instruct: Aligning Language Models with Self-Generated Instructions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023; Rogers, A.; Boyd-Graber, J.L.; Okazaki, N., Eds. Association for Computational Linguistics, 2023, pp. 13484–13508. [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; Schulman, J.; Hilton, J.; Kelton, F.; Miller, L.; Simens, M.; Askell, A.; Welinder, P.; Christiano, P.F.; Leike, J.; Lowe, R. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
| Model | Contextual Accuracy | Entity Recognition | Linguistic Fidelity | Holistic Understanding |
|---|---|---|---|---|
| LLaMA 7B | 76.4% | 72.1% | 74.3% | 71.2% |
| LLaMA-2 7B | 78.5% | 75.4% | 76.9% | 73.5% |
| Qwen 7B | 77.2% | 73.8% | 75.6% | 72.8% |
| Our Method | 85.3% | 83.7% | 84.5% | 82.9% |
| Model Variation | Contextual ACC | Ent. Recognition | Linguistic Fidelity | Holistic Understanding |
|---|---|---|---|---|
| No Contextual Prompts | 80.1% | 78.4% | 79.0% | 77.3% |
| No Task-Specific Prompts | 82.5% | 80.6% | 81.2% | 79.8% |
| No Iterative Refinement | 83.2% | 81.4% | 82.0% | 80.5% |
| Full Model (Ours) | 85.3% | 83.7% | 84.5% | 82.9% |
| Model | Contextual Accuracy | Entity Recognition | Linguistic Fidelity | Holistic Understanding |
|---|---|---|---|---|
| LLaMA 7B | 3.8 | 3.6 | 3.7 | 3.5 |
| LLaMA-2 7B | 4.0 | 3.9 | 4.0 | 3.8 |
| Qwen 7B | 3.9 | 3.7 | 3.8 | 3.7 |
| Our Method | 4.5 | 4.4 | 4.5 | 4.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 1996 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).