Submitted:
12 May 2026
Posted:
13 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- (i)
- The first systematic empirical evaluation of K–A as a hallucination detection mechanism, across six open-source LLMs on three established benchmark datasets in a local inference setting (Section 4).
- (ii)
- A comparative analysis of hallucination rates across four model families (Llama, Qwen3, Gemma3, Phi3), revealing extreme heterogeneity (0–95 %) on identical question sets (Section 4).
- (iii)
- Quantification of K–A detection performance: 90.9 % true positive rate, 0 % false positive rate, 9.1 % false negative rate (Section 4).
- (iv)
- Analysis of the deepseek-r1 chain-of-thought hallucination anomaly and K–A’s 100 % detection rate thereon (Section 5).
- (v)
- Characterisation of KL-divergence distributional signatures for correct vs. hallucinated responses (Section 4).
- (vi)
- A fully reproducible local evaluation protocol requiring no cloud access, external APIs, or model fine-tuning (Section 3).
2. Background and Related Work
2.1. Taxonomy of LLM Hallucination
- Intrinsic hallucination
- Generated content contradicts facts that are demonstrably present in or verifiable from the model’s training corpus. Examples: incorrectly stating that Einstein was born in the USA; claiming the Sahara is the world’s largest desert (Antarctica is larger by area). The HaluEval [9] and FEVER [10] benchmarks predominantly probe this category.
- Extrinsic hallucination
- Generated content cannot be verified against any source, neither supporting nor contradicting it. Examples: fabricated academic citations; invented historical events. SimpleQA [11] tests precise factual recall where extrinsic fabrication is a primary failure mode.
2.2. Distributional and Uncertainty-Based Detection
2.3. Fisher Information and the Stable Manifold
2.4. Benchmark Datasets
2.4.1. HaluEval
2.4.2. FEVER
2.4.3. SimpleQA
3. Methodology
3.1. Experimental Environment
| Parameter | Value |
|---|---|
| Hardware | Apple M5, 32 GB unified memory, 25 GB Metal VRAM |
| Operating System | macOS Tahoe |
| Inference Engine | Ollama v0.23.2 |
| Quantisation | Q4_K_M (all models) |
| max_predict (tokens) | 150 (hallucination evaluation prompt) |
| Sampling temperature | 0.8 (Ollama default) |
| FIM threshold | 0.065 (K–A framework fixed threshold) |
| KL proxy warmup | 10 tokens (FPT not triggered during warmup) |
| Benchmark questions | 20 (14 HaluEval + 4 FEVER + 2 SimpleQA) |
| Total responses | 120 (6 models × 20 questions) |
| Evaluation date | 10 May 2026, 06:00–06:15 AM (AZT) |
| GPU utilisation (session) | 32–52 % (Metal, mean during evaluation) |
| RAM utilisation (session) | 78.8 % (27.1/34.4 GB at completion) |
3.2. Models Evaluated
| Model Identifier | Family | Size (GB) | Params | Architecture | Avg tok/s |
|---|---|---|---|---|---|
| gemma3:27b | Gemma3 | 17.4 | 27.4B | Decoder (SFT) | 4.5 |
| llama3.1:latest | Llama | 4.9 | 8.0B | Decoder (RLHF) | 18.4 |
| deepseek-r1:latest | Qwen3 | 5.2 | 8.2B | CoT Reasoning | 24.1 |
| phi4-mini:latest | Phi3 | 2.5 | 3.8B | Decoder (SFT) | 30.1 |
| gemma3:latest | Gemma3 | 3.3 | 4.3B | Decoder (SFT) | 15.6 |
| llama3.2:latest | Llama | 2.0 | 3.2B | Decoder (RLHF) | 46.1 |
3.3. Benchmark Question Design
3.4. Scoring Protocol
3.5. KL Divergence Proxy
3.6. Ethical Entropy
4. Results
4.1. Session-Level Summary Statistics
4.2. Per-Model Results
4.3. Hallucination Rate and K–A Detection by Model

4.4. KL Divergence Distribution: Correct vs. Hallucinated

4.5. Accuracy by Question Category

4.6. Cumulative Rolling Accuracy

4.7. Ethical Entropy vs. Inference Speed

4.8. Per-Question Results
5. Discussion
5.1. The deepseek-r1 Chain-of-Thought Hallucination Anomaly
5.2. Gemma3 Family Robustness
5.3. K–A False Negatives: Confident Hallucinations
5.4. Comparison with Existing Methods
| Method | Detection rate | External KB | Fine-tuning | Real-time |
|---|---|---|---|---|
| RAG verification [19] | 70–85 % | Required | Not required | Partial |
| Self-consistency [12] | 65–80 % | Not required | Not required | Yes |
| Semantic entropy [13] | 72–88 % | Not required | Not required | Partial |
| SFT factual calibration [18] | 60–75 % | Not required | Required | Yes |
| Conformal prediction [14] | 75–90 % | Not required | Not required | No |
| K–A Framework (this work) | 90.9 % | Not required | Not required | Yes |
| Detection rates are literature-reported ranges; direct comparability is limited by benchmark differences. | ||||
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020; pp. 1906–1919. [Google Scholar] [CrossRef]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
- Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Zhang, Y.; Chen, Y.; et al. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv [cs.CL. 2023, arXiv:2309.01219. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large Language Models Encode Clinical Knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef] [PubMed]
- Dahl, M.; Magesh, V.; Suzgun, M.; Ho, D.E. Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. J. Leg. Anal. 2024, 16, 64–93. [Google Scholar] [CrossRef]
- Athaluri, S.A.; Manthena, S.V.; Kesapragada, V.S.R.K.M.; Yarlagadda, V.; Dave, T.; Duddumpudi, R.T.S. Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References. Cureus 2023, 15. [Google Scholar] [CrossRef]
- Karimov, H.; Alekberli, R.Z. The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability. arXiv [cs.AI. 2026, arXiv:2604.24083. [Google Scholar] [CrossRef]
- Karimov, H.; Alekberli, R.Z. An Information-Geometric Framework for Stability Analysis of Large Language Models under Entropic Stress. arXiv [cs.AI. 2026, arXiv:2604.24076. [Google Scholar] [CrossRef]
- Li, J.; Cheng, X.; Zhao, W.X.; Nie, J.Y.; Wen, J.R. HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023; pp. 6449–6464. [Google Scholar] [CrossRef]
- Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Mittal, A. FEVER: A Large-Scale Dataset for Fact Extraction and VERification. In Proceedings of the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018; pp. 809–819. [Google Scholar] [CrossRef]
- Wei, J.; Karina, N.; Du, Y.; Hu, J.; Jacob, A.; Leet, A.; Maharaj, N.; Petroni, F.; Thoppilan, R.; Zheng, L.; et al. Measuring Short-Form Factuality in Large Language Models. arXiv [cs.CL. 2024, arXiv:2411.04368. [Google Scholar] [CrossRef]
- Kadavath, S.; Conerly, T.; Askell, A.; Henighan, T.; Drain, D.; Perez, E.; Schiefer, N.; Hatfield-Dodds, Z.; DasSarma, N.; Tran-Johnson, E.; et al. Language Models (Mostly) Know What They Know. arXiv [cs.CL. 2022, arXiv:2207.05221. [Google Scholar] [CrossRef]
- Kuhn, L.; Gal, Y.; Farquhar, S. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. In Proceedings of the Proceedings of the 11th International Conference on Learning Representations (ICLR), 2023. [Google Scholar]
- Farquhar, S.; Kossen, J.; Kuhn, L.; Gal, Y. Detecting Hallucinations in Large Language Models Using Semantic Consistency. Nature 2024, 630, 625–630. [Google Scholar] [CrossRef] [PubMed]
- Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. arXiv [cs.CL. 2019, arXiv:1904.09751. [Google Scholar] [CrossRef]
- Chen, J.; Guo, C.; Guo, Q.; Ye, W. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs. arXiv [cs.CL]. 2024, arXiv:2412.21187. [Google Scholar] [CrossRef]
- Azaria, A.; Mitchell, T. The Internal State of an LLM Knows When It’s Lying. Find. Assoc. Comput. Linguist. EMNLP 2023, 967–976. [Google Scholar] [CrossRef]
- Tian, K.; Mitchell, E.; Yao, H.; Manning, C.D.; Finn, C. Fine-Tuning Language Models for Factuality. arXiv [cs.CL. 2023, arXiv:2311.08401. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Proc. Adv. Neural Inf. Process. Syst. 2020, Vol. 33, 9459–9474. [Google Scholar]
| Category | N | Targeted Failure Mode |
|---|---|---|
| HaluEval·Fact | 2 | Direct factual recall; identity and measurement errors |
| HaluEval·Confuse | 4 | Attribution confusion (Bell/Gray; Wright/1905; Curie; Dostoevsky) |
| HaluEval·Date | 2 | Temporal errors (Eiffel Tower: 1889 not 1887/1891; USSR: 1991) |
| HaluEval·Num | 2 | Precise numerical recall (sound speed: 343 m/s; elements: 118) |
| HaluEval·Trap | 2 | Common misconceptions (largest country: Russia; capital: Ottawa) |
| FEVER·Verify | 4 | True/false verification of plausible but partially false claims |
| SimpleQA·Hard | 2 | Single-word precision (NaCl; heptagon: 7 sides) |
| Total | 20 |
| Metric | Value |
|---|---|
| Total responses evaluated | 120 |
| Correct responses | 98 (81.7 %) |
| Hallucinated responses (false content) | 22 (18.3 %) |
| Session hallucination rate | 18.3 % |
| K–A true positives (hallucinations correctly flagged) | 20 |
| K–A true negatives (correct responses not flagged) | 98 |
| K–A false positives (correct responses flagged) | 0 |
| K–A false negatives (hallucinations missed) | 2 |
| K–A detection rate (true positive rate) | 90.9 % |
| K–A false positive rate | 0.0 % |
| K–A false negative rate | 9.1 % |
| Mean KL (correct responses), | |
| Mean KL (hallucinated responses), | |
| KL separation () | 0.026 (62 % relative increase) |
| Model | Q | Corr. | Hall. | Hall.% | K-A det. | K-A% | Avg KL | Max KL | Avg lat.(s) | Avg tok/s |
|---|---|---|---|---|---|---|---|---|---|---|
| gemma3:27b | 20 | 20 | 0 | 0 % | 0 | — | 0.047 | 0.053 | 2.3 | 4.5 |
| llama3.1:latest | 20 | 19 | 1 | 5 % | 0 | 0 % | 0.041 | 0.070 | 1.6 | 18.4 |
| deepseek-r1:latest | 20 | 1 | 19 | 95 % | 19 | 100 % | 0.068 | 0.070 | 6.2 | 24.1 |
| phi4-mini:latest | 20 | 19 | 1 | 5 % | 1 | 100 % | 0.043 | 0.070 | 0.9 | 30.1 |
| gemma3:latest | 20 | 20 | 0 | 0 % | 0 | — | 0.060 | 0.070 | 0.4 | 15.6 |
| llama3.2:latest | 20 | 19 | 1 | 5 % | 0 | 0 % | 0.028 | 0.045 | 0.9 | 46.1 |
| Session avg | 20 | 16.3 | 3.7 | 18.3 % | 3.3 | 90.9 % | 0.048 | 0.065 | 2.1 | 23.1 |
| K-A %: K–A detection rate as fraction of hallucinated responses (undefined [—] when hallucinations = 0). | ||||||||||
| GPU utilisation (session mean): 52 %; System RAM at test completion: 78.8 % (27.1/34.4 GB). | ||||||||||
| Q | Category | Question | g3:27b | l3.1 | ds-r1 | phi4 | g3 | l3.2 |
|---|---|---|---|---|---|---|---|---|
| 1 | HaluEval·Fact | President born Honolulu, 4 Aug 1961? | C | C | K | C | C | H |
| 2 | HaluEval·Fact | Melting point of gold (°C)? | C | C | K | C | C | C |
| 3 | HaluEval·Confuse | Author of Crime and Punishment? | C | C | K | C | C | C |
| 4 | FEVER·Verify | Amazon flows into Atlantic? | C | C | K | C | C | C |
| 5 | FEVER·Verify | Einstein born in USA? | C | C | K | C | C | C |
| 6 | HaluEval·Date | Year Eiffel Tower completed? | C | C | K | C | C | C |
| 7 | SimpleQA·Hard | Bones in adult human body? | C | C | K | C | C | C |
| 8 | HaluEval·Confuse | Bell or Gray invented telephone? | C | C | K | C | C | C |
| 9 | FEVER·Verify | Great Fire of London: 1666? | C | C | K | C | C | C |
| 10 | HaluEval·Num | Elements in periodic table (2024)? | C | C | K | C | C | C |
| 11 | SimpleQA·Hard | Chemical formula for table salt? | C | C | K | C | C | C |
| 12 | HaluEval·Confuse | Wright Brothers: 1903 or 1905? | C | C | K | C | C | C |
| 13 | FEVER·Verify | Mount Everest in Nepal? | C | C | K | K | C | C |
| 14 | HaluEval·Trap | Largest country by land area? | C | C | K | C | C | C |
| 15 | HaluEval·Num | Speed of sound at 20°C (m/s)? | C | H | K | C | C | C |
| 16 | SimpleQA·Hard | Sides of a heptagon? | C | C | K | C | C | C |
| 17 | HaluEval·Confuse | First female Nobel laureate? | C | C | K | C | C | C |
| 18 | FEVER·Verify | DNA double helix: 1953? | C | C | K | C | C | C |
| 19 | HaluEval·Trap | Capital of Canada? | C | C | K | C | C | C |
| 20 | HaluEval·Date | Soviet Union dissolved in? | C | C | C | C | C | C |
| Score: Correct/Hallucinated/K–A detected | 20/0/0 | 19/1/0 | 1/0/19 | 19/1/1 | 20/0/0 | 19/1/0 | ||
| Note: Q13 phi4-mini: response failed to unambiguously affirm Nepal (K–A flagged; borderline case). | ||||||||
| Note: Q15 llama3.1: stated 340 m/s instead of 343 m/s (confident error; KL < ; K–A false negative). | ||||||||
| Note: Q1 llama3.2: failed to include reference strings for Obama; KL < (K–A false negative). | ||||||||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).