Submitted:
20 October 2024
Posted:
22 October 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We propose a novel two-stage fine-tuning process for biomedical large language models that reduces reliance on retrieval mechanisms by embedding domain-specific knowledge directly into the model.
- We introduce a self-reflection loss mechanism during task-specific fine-tuning, which encourages the model to self-assess its answers, improving accuracy and reasoning quality.
- Extensive experiments on medical question-answering datasets demonstrate that our method outperforms existing state-of-the-art models in both accuracy and efficiency, making it more suitable for real-time clinical applications.
2. Related Work
2.1. Large Language Models
2.2. Retrieval-Augmented Large Language Models
2.3. Medical Large Language Models
3. Method
3.1. Model Architecture
3.2. Two-Stage Fine-Tuning
3.3. Self-Reflection Mechanism
3.4. Optimization Objective
3.5. Training Strategy
4. Experiments
4.1. Dataset and Experimental Setup
- MedQA: A dataset focusing on medical licensing examination questions.
- MedMCQA: A multiple-choice medical QA dataset covering a wide range of medical subjects.
- MMLU (Medical subset): A dataset that includes medical questions for evaluating general reasoning skills in the biomedical domain.
4.2. Comparative Results
| Model | Params | MedQA | MedMCQA | MMLU (Med) |
|---|---|---|---|---|
| MedRAG-Refine (7B) | 7B | 44.8% | 43.5% | 55.1% |
| MedRAG-Refine (13B) | 13B | 50.1% | 45.6% | 59.3% |
| Self-BioRAG (7B) | 7B | 43.6% | 42.1% | 53.9% |
| RAG (7B) | 7B | 36.2% | 38.3% | 47.7% |
| LLaMA2 (7B) | 7B | 35.2% | 36.3% | 46.3% |
4.3. Ablation Study
| Model | MedQA | MedMCQA | MMLU (Med) |
|---|---|---|---|
| MedRAG-Refine (Full) | 44.8% | 43.5% | 55.1% |
| MedRAG-Refine (No Self-Reflect) | 41.5% | 40.3% | 50.6% |
4.4. Human Evaluation
| Model | Accuracy | Relevance | Reasoning Quality |
|---|---|---|---|
| MedRAG-Refine (7B) | 92% | 88% | 90% |
| Self-BioRAG (7B) | 85% | 82% | 84% |
| RAG (7B) | 78% | 75% | 76% |
| LLaMA2 (7B) | 73% | 70% | 72% |
4.5. Discussion
5. Conclusion
References
- Li, M.; Kilicoglu, H.; Xu, H.; Zhang, R. BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine. CoRR 2024, abs/2405.00465, [2405.00465]. [CrossRef]
- Jeong, M.; Sohn, J.; Sung, M.; Kang, J. Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models. CoRR 2024, abs/2401.15269, [2401.15269]. [CrossRef]
- Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Shen, J.; Long, G.; Xu, C.; Jiang, D. Fine-grained distillation for long document retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 19732–19740.
- Li, M.; Zhan, Z.; Yang, H.; Xiao, Y.; Huang, J.; Zhang, R. Benchmarking Retrieval-Augmented Large Language Models in Biomedical NLP: Application, Robustness, and Self-Awareness. CoRR 2024, abs/2405.08151, [2405.08151]. [CrossRef]
- Xiong, G.; Jin, Q.; Lu, Z.; Zhang, A. Benchmarking Retrieval-Augmented Generation for Medicine. Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024; Ku, L.; Martins, A.; Srikumar, V., Eds. Association for Computational Linguistics, 2024, pp. 6233–6251. [CrossRef]
- Zhou, Y.; Shen, T.; Geng, X.; Tao, C.; Xu, C.; Long, G.; Jiao, B.; Jiang, D. Towards Robust Ranker for Text Retrieval. Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 5387–5401.
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers); Burstein, J.; Doran, C.; Solorio, T., Eds. Association for Computational Linguistics, 2019, pp. 4171–4186. [CrossRef]
- Wang, Z.; Li, M.; Xu, R.; Zhou, L.; Lei, J.; Lin, X.; Wang, S.; Yang, Z.; Zhu, C.; Hoiem, D.; Chang, S.; Bansal, M.; Ji, H. Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
- Zhou, Y.; Shen, T.; Geng, X.; Long, G.; Jiang, D. ClarET: Pre-training a Correlation-Aware Context-To-Event Transformer for Event-Centric Generation and Classification. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 2559–2575.
- Zhou, Y.; Geng, X.; Shen, T.; Long, G.; Jiang, D. Eventbert: A pre-trained model for event correlation reasoning. Proceedings of the ACM Web Conference 2022, 2022, pp. 850–859.
- Sanh, V.; Webson, A.; Raffel, C.; Bach, S.H.; Sutawika, L.; Alyafeai, Z.; Chaffin, A.; Stiegler, A.; Raja, A.; Dey, M.; Bari, M.S.; Xu, C.; Thakker, U.; Sharma, S.S.; Szczechla, E.; Kim, T.; Chhablani, G.; Nayak, N.V.; Datta, D.; Chang, J.; Jiang, M.T.; Wang, H.; Manica, M.; Shen, S.; Yong, Z.X.; Pandey, H.; Bawden, R.; Wang, T.; Neeraj, T.; Rozen, J.; Sharma, A.; Santilli, A.; Févry, T.; Fries, J.A.; Teehan, R.; Scao, T.L.; Biderman, S.; Gao, L.; Wolf, T.; Rush, A.M. Multitask Prompted Training Enables Zero-Shot Task Generalization. The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022.
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024. Association for Computational Linguistics, 2024, pp. 15890–15902.
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds., 2022.
- Zhou, Y.; Geng, X.; Shen, T.; Zhang, W.; Jiang, D. Improving zero-shot cross-lingual transfer for multilingual question answering over knowledge graph. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5822–5834.
- Gao, Y.; Hou, F.; Wang, R. A Novel Two-step Fine-tuning Framework for Transfer Learning in Low-Resource Neural Machine Translation. Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024; Duh, K.; Gómez-Adorno, H.; Bethard, S., Eds. Association for Computational Linguistics, 2024, pp. 3214–3224. [CrossRef]
- Scao, T.L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilic, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; Tow, J.; Rush, A.M.; Biderman, S.; Webson, A.; Ammanamanchi, P.S.; Wang, T.; Sagot, B.; Muennighoff, N.; del Moral, A.V.; Ruwase, O.; Bawden, R.; Bekman, S.; McMillan-Major, A.; Beltagy, I.; Nguyen, H.; Saulnier, L.; Tan, S.; Suarez, P.O.; Sanh, V.; Laurençon, H.; Jernite, Y.; Launay, J.; Mitchell, M.; Raffel, C.; Gokaslan, A.; Simhi, A.; Soroa, A.; Aji, A.F.; Alfassy, A.; Rogers, A.; Nitzav, A.K.; Xu, C.; Mou, C.; Emezue, C.; Klamm, C.; Leong, C.; van Strien, D.; Adelani, D.I.; et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR 2022, abs/2211.05100, [2211.05100]. [CrossRef]
- Zhou, Y.; Geng, X.; Shen, T.; Pei, J.; Zhang, W.; Jiang, D. Modeling event-pair relations in external knowledge graphs for script reasoning. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021.
- Zhou, Y.; Geng, X.; Shen, T.; Tao, C.; Long, G.; Lou, J.G.; Shen, J. Thread of thought unraveling chaotic contexts. arXiv 2023, arXiv:2311.08734. [Google Scholar]
- Fan, C.; Yan, Z.; Wu, Y.; Qian, B. Span prompt dense passage retrieval for Chinese open domain question answering. J. Intell. Fuzzy Syst. 2023, 45, 7285–7295. [Google Scholar] [CrossRef]
- Izacard, G.; Caron, M.; Hosseini, L.; Riedel, S.; Bojanowski, P.; Joulin, A.; Grave, E. Towards Unsupervised Dense Information Retrieval with Contrastive Learning. CoRR 2021, abs/2112.09118, [2112.09118].
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; Lample, G. LLaMA: Open and Efficient Foundation Language Models. CoRR 2023, abs/2302.13971, [2302.13971]. [CrossRef]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; Schuh, P.; Shi, K.; Tsvyashchenko, S.; Maynez, J.; Rao, A.; Barnes, P.; Tay, Y.; Shazeer, N.; Prabhakaran, V.; Reif, E.; Du, N.; Hutchinson, B.; Pope, R.; Bradbury, J.; Austin, J.; Isard, M.; Gur-Ari, G.; Yin, P.; Duke, T.; Levskaya, A.; Ghemawat, S.; Dev, S.; Michalewski, H.; Garcia, X.; Misra, V.; Robinson, K.; Fedus, L.; Zhou, D.; Ippolito, D.; Luan, D.; Lim, H.; Zoph, B.; Spiridonov, A.; Sepassi, R.; Dohan, D.; Agrawal, S.; Omernick, M.; Dai, A.M.; Pillai, T.S.; Pellat, M.; Lewkowycz, A.; Moreira, E.; Child, R.; Polozov, O.; Lee, K.; Zhou, Z.; Wang, X.; Saeta, B.; Diaz, M.; Firat, O.; Catasta, M.; Wei, J.; Meier-Hellstern, K.; Eck, D.; Dean, J.; Petrov, S.; Fiedel, N. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 240:1–240:113. [Google Scholar]
- Zhou, Y.; Long, G. Improving Cross-modal Alignment for Text-Guided Image Inpainting. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 3445–3456.
- Zhou, Y.; Tao, W.; Zhang, W. Triple sequence generative adversarial nets for unsupervised image captioning. ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7598–7602.
- Zhou, Y. Sketch storytelling. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 4748–4752.
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, Canada, March 3-10, 2021; Elish, M.C.; Isaac, W.; Zemel, R.S., Eds. ACM, 2021, pp. 610–623. [CrossRef]
- Bhattarai, M.; Santos, J.E.; Jones, S.; Biswas, A.; Alexandrov, B.S.; O’Malley, D. Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation. CoRR 2024, abs/2407.19619, [2407.19619]. [CrossRef]
- Zhou, Y.; Long, G. Multimodal Event Transformer for Image-guided Story Ending Generation. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023, pp. 3434–3444.
- Zhou, Y.; Long, G. Style-Aware Contrastive Learning for Multi-Style Image Captioning. Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 2257–2267.
- Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; Hu, X. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data 2024, 18, 1–32. [Google Scholar] [CrossRef]
- Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada; Wooldridge, M.J.; Dy, J.G.; Natarajan, S., Eds. AAAI Press, 2024, pp. 17754–17762. [CrossRef]
- Cheng, Q.; Li, X.; Li, S.; Zhu, Q.; Yin, Z.; Shao, Y.; Li, L.; Sun, T.; Yan, H.; Qiu, X. Unified Active Retrieval for Retrieval Augmented Generation. CoRR 2024, abs/2406.12534, [2406.12534]. [CrossRef]
- Liu, J.; Jin, J.; Wang, Z.; Cheng, J.; Dou, Z.; Wen, J. RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit. CoRR 2023, abs/2306.05212, [2306.05212]. [CrossRef]
- Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 17754–17762.
- Zhou, H.; Liu, F.; Gu, B.; Zou, X.; Huang, J.; Wu, J.; Li, Y.; Chen, S.S.; Zhou, P.; Liu, J.; Hua, Y.; Mao, C.; You, C.; Wu, X.; Zheng, Y.; Clifton, L.; Li, Z.; Luo, J.; Clifton, D.A. A Survey of Large Language Models in Medicine: Progress, Application, and Challenge, 2024, [arXiv:cs.CL/2311.05112]. arXiv:2311.05112.
- Liu, L.; Yang, X.; Lei, J.; Liu, X.; Shen, Y.; Zhang, Z.; Wei, P.; Gu, J.; Chu, Z.; Qin, Z.; Ren, K. A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions. CoRR 2024, abs/2406.03712, [2406.03712]. [CrossRef]
- Wang, G.; Yang, G.; Du, Z.; Fan, L.; Li, X. ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation, 2023, [arXiv:cs.CL/2306.09968]. arXiv:2306.09968.
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; Schaekermann, M.; Wang, A.; Amin, M.; Lachgar, S.; Mansfield, P.A.; Prakash, S.; Green, B.; Dominowska, E.; y Arcas, B.A.; Tomasev, N.; Liu, Y.; Wong, R.; Semturs, C.; Mahdavi, S.S.; Barral, J.K.; Webster, D.R.; Corrado, G.S.; Matias, Y.; Azizi, S.; Karthikesalingam, A.; Natarajan, V. Towards Expert-Level Medical Question Answering with Large Language Models. CoRR 2023, abs/2305.09617, [2305.09617]. [CrossRef]
- Wu, J.; Zhu, J.; Qi, Y. Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. CoRR 2024, abs/2408.04187, [2408.04187]. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).