Submitted:
29 June 2024
Posted:
02 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Accuracy: This metric reflects the score obtained on the benchmark for each category independently (normal answer and misleading answer). It is calculated by directly evaluating the answer, regardless of the category. A higher score indicates better performance. It is expected that the normal answer category will have a higher score than the misleading answer category, as the model is less likely to hallucinate under normal conditions.
- Suceptibility: This metric indicates the likelihood that the model is influenced by the misleading prompt. It is calculated as the quotient between the accuracy of the normal answers and the accuracy of the misleading answers. A higher score suggests a greater susceptibility to hallucination and a lower capacity for self-correction once hallucination occurs during inference.
- Consistency: This metric measures the percentage of answers that are identical across both categories, regardless of whether the answer is correct. A higher consistency score indicates that the model is certain of its answers and is less likely to be influenced by noise in the random sampling process during inference.
2. Related Work
2.1. LLM Benchmarking
- Uncertainty evaluation: The softmax probability for each token is considered to assess the model’s confidence in its response.
- Self-consistency: The model generates its response multiple times to evaluate the consistency of its answers.
- Multi-debate: After the model produces the correct answer, it is informed that the answer is incorrect, and its ability to maintain the correct response is evaluated.
- Fact-based metric: The model’s output is compared to factual reality to check for overlaps.
- Classifier-based metric: The model’s output and the correct information are evaluated for overlap using a separate NLP model.
- QA-based metric: The model’s answers to a set of questions are compared to the source content for consistency.
2.2. Current Hallucination Benchmarks
- TruthfulQA: This benchmark evaluates a model’s truthfulness when responding to questions that are frequently answered incorrectly due to false beliefs or misconceptions [23].
- HalluQA: This is a Chinese-specific benchmark featuring intentionally tricky questions that follow the same pattern as TruthfulQA. It includes complex knowledge-based questions written by graduate interns and subsequently filtered [24].
- HaluEval: This benchmark assesses a model’s ability to detect hallucinations in both generated and human-annotated samples [25].
2.3. Massive Multitask Language Understanding (MMLU)
3. DB-MMLU Dataset
3.1. Dataset Preparation
3.2. Evaluation Metrics
3.2.1. Accuracy
3.2.2. Susceptibility
3.2.3. Consistency
4. Experiments
4.1. Tested Models
4.2. Setup
5. Results and Analysis
5.1. Accuracy
5.2. Susceptibility
5.3. Consistency
6. Limitations and Future Work
7. Conclusions
Appendix A. DB-MMLU Details
Appendix A.1. Prompt Used To Generate Misleading Answers for the DB-MMLU Dataset


Appendix B. DB-MMLU Benchmark Results
Appendix B.1. Prompt Used for Testing

Appendix B.2. Accuracy of the Tested Models


Appendix B.3. Susceptibility of the Tested Models


Appendix B.4. Consistency of the Tested Models


Appendix B.5. Error Rates of the Tested Models

Appendix C. Result Analysis
Appendix C.1. Correlation between Normal Accuracy and Misleading Accuracy

Appendix C.2. Correlation between Suceptibility on DB-MMLU and TruthfulQA

Appendix C.3. Correlation between Suceptibility on DB-MMLU and Model size and normal accuracy on DB-MMLU

Appendix C.4. Correlation between Suceptibility and Consistency on DB-MMLU

Appendix D. Case Studies










References
- Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y.T.; Li, Y.; Lundberg, S.; Nori, H.; Palangi, H.; Ribeiro, M.T.; Zhang, Y. Sparks of Artificial General Intelligence: Early experiments with GPT-4, 2023. arXiv:2303.12712 [cs].
- Mumtaz, U.; Ahmed, A.; Mumtaz, S. LLMs-Healthcare: Current applications and challenges of large language models in various medical specialties. Artificial Intelligence in Health 2024, 1, 16. [CrossRef]
- Dell’Acqua, F.; McFowland, E.; Mollick, E.R.; Lifshitz-Assaf, H.; Kellogg, K.; Rajendran, S.; Krayer, L.; Candelon, F.; Lakhani, K.R. Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. SSRN Electronic Journal 2023. [CrossRef]
- Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models, 2024. arXiv:2401.11817 [cs].
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Chen, D.; Chan, H.S.; Dai, W.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Computing Surveys 2023, 55, 1–38. arXiv:2202.03629 [cs] . [CrossRef]
- Shahsavar, Y.; Choudhury, A. User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study. JMIR Human Factors 2023, 10, e47564. Company: JMIR Human Factors Distributor: JMIR Human Factors Institution: JMIR Human Factors Label: JMIR Human Factors Publisher: JMIR Publications Inc., Toronto, Canada, . [CrossRef]
- Babb, M.; Koren, G.; Einarson, A. Treating pain during pregnancy. Canadian Family Physician 2010, 56, 25–27.
- Hong, G.; Gema, A.P.; Saxena, R.; Du, X.; Nie, P.; Zhao, Y.; Perez-Beltrachini, L.; Ryabinin, M.; He, X.; Fourrier, C.; Minervini, P. The Hallucinations Leaderboard – An Open Effort to Measure Hallucinations in Large Language Models, 2024. arXiv:2404.05904 [cs]. [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; Liu, T. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions, 2023. arXiv:2311.05232 [cs].
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; Riedel, S.; Kiela, D. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2021. arXiv:2005.11401 [cs].
- Gou, Z.; Shao, Z.; Gong, Y.; Shen, Y.; Yang, Y.; Duan, N.; Chen, W. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, 2024. arXiv:2305.11738 [cs].
- Gunasekar, S.; Zhang, Y.; Aneja, J.; Mendes, C.C.T.; Del Giorno, A.; Gopi, S.; Javaheripi, M.; Kauffmann, P.; de Rosa, G.; Saarikivi, O.; Salim, A.; Shah, S.; Behl, H.S.; Wang, X.; Bubeck, S.; Eldan, R.; Kalai, A.T.; Lee, Y.T.; Li, Y. Textbooks Are All You Need, 2023. arXiv:2306.11644 [cs]. [CrossRef]
- Beeching, E.; Fourrier, C.; Habib, N.; Han, S.; Lambert, N.; Rajani, N.; Sanseviero, O.; Tunstall, L.; Wolf, T. Open LLM Leaderboard, 2023.
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2023. arXiv:2201.11903 [cs].
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding, 2021. arXiv:2009.03300 [cs].
- Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence?, 2019. arXiv:1905.07830 [cs]. [CrossRef]
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, 2020. arXiv:1905.00537 [cs].
- Saad-Falcon, J.; Fu, D.Y.; Arora, S.; Guha, N.; Ré, C. Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT, 2024. arXiv:2402.07440 [cs].
- Yue, X.; Ni, Y.; Zhang, K.; Zheng, T.; Liu, R.; Zhang, G.; Stevens, S.; Jiang, D.; Ren, W.; Sun, Y.; Wei, C.; Yu, B.; Yuan, R.; Sun, R.; Yin, M.; Zheng, B.; Yang, Z.; Liu, Y.; Huang, W.; Sun, H.; Su, Y.; Chen, W. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI, 2023. arXiv:2311.16502 [cs].
- Gandhi, K.; Fränken, J.P.; Gerstenberg, T.; Goodman, N.D. Understanding Social Reasoning in Language Models with Language Models, 2023. arXiv:2306.15448 [cs].
- Alonso, I.; Oronoz, M.; Agerri, R. MedExpQA: Multilingual Benchmarking of Large Language Models for Medical Question Answering, 2024. arXiv:2404.05590 [cs].
- Nie, A.; Zhang, Y.; Amdekar, A.; Piech, C.; Hashimoto, T.; Gerstenberg, T. MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks, 2023. arXiv:2310.19677 [cs].
- Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 3214–3252. [CrossRef]
- Cheng, Q.; Sun, T.; Zhang, W.; Wang, S.; Liu, X.; Zhang, M.; He, J.; Huang, M.; Yin, Z.; Chen, K.; Qiu, X. EVALUATING HALLUCINATIONS IN CHINESE LARGE LANGUAGE MODELS.
- Li, J.; Cheng, X.; Zhao, X.; Nie, J.Y.; Wen, J.R. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Singapore, 2023; pp. 6449–6464. [CrossRef]
- OpenAI.; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; et al. GPT-4 Technical Report, 2024. arXiv:2303.08774 [cs]. [CrossRef]
- Team, G.; Reid, M.; Savinov, N.; Teplyashin, D.; Dmitry.; Lepikhin.; Lillicrap, T.; Alayrac, J.b.; Soricut, R.; Lazaridou, A.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. arXiv:2403.05530 [cs]. [CrossRef]
- Meta Llama 3.
- Qwen1.5-110B: The First 100B+ Model of the Qwen1.5 Series, 2024. Section: blog.
- christopherthompson81. Examining LLM Quantization Impact.
| 1 |
, where . The two-tailed probability is used with a degree of freedom of 8. |



| Model | Size | MMLU1 | TruthfulQA 1 |
|---|---|---|---|
| Gemma-1.1-2b-it | 2.51B | 37.65 | 45.82 |
| Phi-2 | 2.78B | 58.11 | 44.47 |
| Phi-3-mini-4k-instruct | 3.82B | 69.08 | 59.88 |
| Phi-3-medium-4k-instruct | 14B | 77.83 | 57.71 |
| Mistral-7b-instruct-v0.2 | 7.24B | 60.78 | 68.26 |
| Meta-Llama-3-8B-Instruct | 8.08B | 67.07 | 51.56 |
| DeciLM-7B-instruct | 7.04B | 60.24 | 49.75 |
| Aya-23-8B | 8.03B | - | - |
| Solar-10.7B-Instruct | 10.7B | 66.21 | 71.43 |
| StarChat2-15B-v0.1 | 16B | - | - |
| Temperature | 0.3 |
| Top P | 0.3 |
| Top K | 40 |
| Repeat penalty | 1.1 |
| Model | Normal Accuracy | Misleading Accuracy | Susceptibility | Consistency |
|---|---|---|---|---|
| Gemma-1.1-2b-it | 34.88 | 18.92 | 2.53 | 45.18 |
| Phi-2 | 45.22 | 20.60 | 2.39 | 38.05 |
| Phi-3-mini-4k-instruct | 68.01 | 41.81 | 1.63 | 53.60 |
| Phi-3-medium-4k-instruct | 77.42 | 40.71 | 1.90 | 48.29 |
| Mistral-7b-instruct-v0.2 | 52.07 | 31.40 | 1.65 | 44.38 |
| Meta-Llama-3-8B-Instruct | 52.41 | 30.87 | 1.75 | 46.46 |
| DeciLM-7B-instruct | 51.84 | 25.17 | 2.21 | 39.12 |
| Aya-23-8B | 48.61 | 20.37 | 2.65 | 35.36 |
| Solar-10.7B-Instruct | 62.72 | 52.36 | 1.20 | 68.26 |
| StarChat2-15B-v0.1 | 44.29 | 26.93 | 1.69 | 45.44 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
