Submitted:
24 October 2023
Posted:
26 October 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. LLMs in Healthcare
| Area | Example |
|---|---|
| Text Summarization | Generate summary of medical records or medical history |
| Annonymization | Annonymize patient data for privacy and HIPPA compliance |
| Clinical Documentation | Generate patient discharge reports |
| Translation | Translate patient records from one language to another. Also, translate relevant medical information to other languages for patients |
| Medical Writing | Facilitate medical research by assisting in medical writing and research |
| Triage | Guide patients to the right wards or departments |
| Consultation | Recommendations for patient self care |
3. Hallucinations in Healthcare AI
- Unreliable Sources: If the data comes from a general source then it is likely to perpetuate commonly held misconceptions [16]
- Probabilistic Generation: Given the probabilistic nature of text generation, recombination of completely reliable texts can still lead to generation of false statements [10]
- Biased Training Data: Sources that may be biased may lead to generating hallucinations [24]
- Insufficient Context: Text generated by LLMs is based on prompts. Lack of context could lead to text generation with little or no correspondence to what the end user is looking for. [10]
- Self-Contradictions: LLMs are not good at sequential reasoning. This may lead to self-contradictions [16].
4. Addressing the Problem of Hallucinations
4.1. Evaluating Hallucinations
4.2. Measuring Hallucinations
4.2.1. Human Evaluation
4.2.2. Automatic Evaluation
4.2.3. Evaluation Metrics
4.3. Mitigating Hallucinations
4.3.1. Human-In-The-Loop (HITL)
4.3.2. Algorithmic Corrections
4.3.3. Fine Tuning
4.3.4. Improving Prompts
4.4. Adversarial Training
4.5. Input Validation
4.6. Memory Augmentation
4.7. Model Choice
4.8. Benchmark Audits
5. Conclusion
References
- Yang, X., Chen, A., PourNejatian, N., Shin, H., Smith, K., Parisien, C., Compas, C. Others Gatortron: A large clinical language model to unlock patient information from unstructured electronic health records. ArXiv Preprint ArXiv:2203.03540, 2022. [Google Scholar] [CrossRef]
- Kraljevic, Z., Bean, D., Shek, A., Bendayan, R., Hemingway, H. & Au, J. Foresight-Generative Pretrained Transformer (GPT) for Modelling of Patient Timelines using EHRs.
- Emsley, R. ChatGPT: these are not hallucinations–they’re fabrications and falsifications. Schizophrenia 2023, 9, 52. [Google Scholar] [CrossRef] [PubMed]
- Alkaissi, H. & McFarlane, S. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 2023, 15. [Google Scholar] [CrossRef]
- Chung, H., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S. Others Scaling instruction-finetuned language models. ArXiv Preprint ArXiv:2210.11416, 2022. [Google Scholar] [CrossRef]
- Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Hou, L., Clark, K., Pfohl, S., Cole-Lewis, H., Neal, D. Others Towards expert-level medical question answering with large language models. ArXiv Preprint ArXiv:2305.09617, 2023. [Google Scholar] [CrossRef]
- Devlin, J., Chang, M., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv Preprint ArXiv:1810.04805, 2018. [Google Scholar] [CrossRef]
- Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V. & Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv Preprint ArXiv:1910.13461. 2019. [CrossRef]
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. Others Language models are few-shot learners. Advances In Neural Information Processing Systems 2020, 33, 1877–1901. [Google Scholar]
- Salvagno, M., Taccone, F. & Gerli, A. Artificial intelligence hallucinations. Critical Care 2023, 27, 1–2. [Google Scholar] [CrossRef]
- Kumar, A., Agarwal, C., Srinivas, S., Feizi, S. & Lakkaraju, H. Certifying LLM Safety against Adversarial Prompting. ArXiv Preprint ArXiv:2309.02705, 2023. [Google Scholar] [CrossRef]
- Lee, P., Bubeck, S. & Petro, J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. New England Journal Of Medicine 2023, 388, 1233–1239. [Google Scholar] [CrossRef] [PubMed]
- Lee, M. A mathematical investigation of hallucination and creativity in gpt models. Mathematics 2023, 11, 2320. [Google Scholar] [CrossRef]
- Jha, S., Jha, S., Lincoln, P., Bastian, N., Velasquez, A. & Neema, S. Dehallucinating large language models using formal methods guided iterative prompting. 2023 IEEE International Conference On Assured Autonomy (ICAA) 2023, 149–152. [Google Scholar] [CrossRef]
- Wu, Y., Zhao, Y., Hu, B., Minervini, P., Stenetorp, P. & Riedel, S. An efficient memory-augmented transformer for knowledge-intensive nlp tasks. ArXiv Preprint ArXiv:2210.16773, 2023. [Google Scholar] [CrossRef]
- Manakul, P., Liusie, A. & Gales, M. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ArXiv Preprint ArXiv:2303.08896, 2023. [Google Scholar] [CrossRef]
- Falke, T., Ribeiro, L., Utama, P., Dagan, I. & Gurevych, I. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. Proceedings Of The 57th Annual Meeting Of The Association For Computational Linguistics, 2019; 2214–2220. [Google Scholar] [CrossRef]
- Lee, N., Ping, W., Xu, P., Patwary, M., Fung, P., Shoeybi, M. & Catanzaro, B. Factuality enhanced language models for open-ended text generation. Advances In Neural Information Processing Systems 2022, 35, 34586–34599. [Google Scholar]
- Lin, S., Hilton, J. & Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. ArXiv Preprint ArXiv:2109.07958, 2021. [Google Scholar] [CrossRef]
- Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P., Iyyer, M., Zettlemoyer, L. & Hajishirzi, H. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. ArXiv Preprint ArXiv:2305.14251, 2023. [Google Scholar] [CrossRef]
- Lim, Z., Pushpanathan, K., Yew, S., Lai, Y., Sun, C., Lam, J., Chen, D., Goh, J., Tan, M., Sheng, B. Others Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. Ebiomedicine, 2023; 95. [Google Scholar] [CrossRef]
- Liu, J., Zhou, P., Hua, Y., Chong, D., Tian, Z., Liu, A., Wang, H., You, C., Guo, Z., Zhu, L. & Others Benchmarking. Large Language Models on CMExam–A Comprehensive Chinese Medical Exam Dataset. ArXiv Preprint ArXiv:2306.03030, 2023. [Google Scholar] [CrossRef]
- Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y. & Radev, D. Evaluating gpt-4 and chatgpt on japanese medical licensing examinations. ArXiv Preprint ArXiv:2303.18027, 2023. [Google Scholar] [CrossRef]
- Mbakwe, A., Lourentzou, I., Celi, L., Mechanic, O. & Dagan, A. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLOS Digital Health 2023, 2, e0000205. [Google Scholar] [CrossRef] [PubMed]
- Kung, T., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J. Others Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digital Health 2023, 2, e0000198. [Google Scholar] [CrossRef] [PubMed]
- Hulman, A., Dollerup, O., Mortensen, J., Fenech, M., Norman, K., Støvring, H. & Hansen, T. ChatGPT-versus human-generated answers to frequently asked questions about diabetes: A Turing test-inspired survey among employees of a Danish diabetes center. Plos One 2023, 18, e0290773. [Google Scholar] [CrossRef] [PubMed]
- Chen, S., Kann, B., Foote, M., Aerts, H., Savova, G., Mak, R. & Bitterman, D. The utility of ChatGPT for cancer treatment information. MedRxiv 2023, 2023-03. [Google Scholar] [CrossRef]
- Zha, Y., Yang, Y., Li, R. & Hu, Z. AlignScore: Evaluating Factual Consistency with a Unified Alignment Function. ArXiv Preprint ArXiv:2305.16739, 2023. [Google Scholar] [CrossRef]
- Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P., Iyyer, M., Zettlemoyer, L. & Hajishirzi, H. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. ArXiv Preprint ArXiv:2305.14251, 2023. [Google Scholar] [CrossRef]
- Afzal, A., Vladika, J., Braun, D. & Matthes, F. Challenges in Domain-Specific Abstractive Summarization and How to Overcome Them. 15th International Conference On Agents And Artificial Intelligence, ICAART 2023; 2023; pp. 682–689. [Google Scholar]
- Mündler, N., He, J., Jenko, S. & Vechev, M. Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation. ArXiv Preprint ArXiv:2305.15852, 2023. [Google Scholar] [CrossRef]
- Griffin, A. Google shares plunge after its new ChatGPT AI competitor gives wrong answer to question. ,, 2023. Available online: https://www.independent.co.uk/tech/google-ai-bard-chatgpt-shares-b2278932.html (accessed on 28 August 2023).
- Turley, J. ChatGPT falsely accused me of sexually harassing my students. Can we really trust AI? USA Today 2023.
- Meskó, B. & Topol, E. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. Npj Digital Medicine 2023, 6, 120. [Google Scholar] [CrossRef]
- Stern, K., Qiu, Q., Weykamp, M., O’Keefe, G. & Brakenridge, S. Defining posttraumatic sepsis for population-level research. JAMA Network Open 2023, 6, e2251445–e2251445. [Google Scholar] [CrossRef]
- Dziri, N., Milton, S., Yu, M., Zaiane, O. & Reddy, S. On the origin of hallucinations in conversational models: Is it the datasets or the models? ArXiv Preprint ArXiv:2204.07931, 2022. [Google Scholar] [CrossRef]
- OpenAI OpenAI: GPT-4. Available online: https://openai.com/research/gpt-4 (accessed on 18 August 2023).
- Shen, X., Chen, Z., Backes, M., Shen, Y. & Zhang, Y. " Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. ArXiv Preprint ArXiv:2308.03825, 2023. [Google Scholar] [CrossRef]
- Wang, Y., Si, S., Li, D., Lukasik, M., Yu, F., Hsieh, C., Dhillon, I. & Kumar, S. Preserving In-Context Learning ability in Large Language Model Fine-tuning. ArXiv Preprint ArXiv:2211.00635, 2022. [Google Scholar] [CrossRef]
- Ahmad, M., Eckert, C. & Teredesai, A. Interpretable machine learning in healthcare. Proceedings Of The 2018 ACM International Conference On Bioinformatics, Computational Biology, And Health Informatics; 2018; pp. 559–560. [Google Scholar]
- Choi, J., Hickman, K., Monahan, A. & Schwarcz, D. Chatgpt goes to law school. Available At SSRN 2023.
- Li, J., Dada, A., Kleesiek, J. & Egger, J. ChatGPT in Healthcare: A Taxonomy and Systematic Review. MedRxiv 2023, 2023-03. [Google Scholar] [CrossRef]
- Sanderson, K. GPT-4 is here: what scientists think. Nature 2023, 615, 773. [Google Scholar] [CrossRef] [PubMed]
- Borden, N. & Linklater, D. Hickam’s dictum. Western Journal Of Emergency Medicine: Integrating Emergency Care With Population Health 2013, 14. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).