Submitted:
04 June 2025
Posted:
06 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Diagnostic Tests
Test 1: Answer Selection
Test 2: Answer Refinement
Insights and Applications
3. Experiments
Benchmarked Metrics
Datasets and Models
Baselines
4. Result and Analysis
Uncertainty Metrics Fall Short in Answer Selection
Uncertainty Metrics show potential to guide Answer Refinement.
Descriptive vs. Prescriptive
5. Conclusions
Limitations
Use of AI Assistants
Appendix A. Implementation Details
Appendix B. Additional Results


Appendix C. Prompt Templates






References
- Stanley F Chen, Douglas Beeferman, and Roni Rosenfeld. 1998. Evaluation metrics for language models.
- Yun Da Tsai and Shou De Lin. 2022. Fast online inference for nonlinear contextual bandit based on generative adversarial network. arXiv preprint, arXiv:2202.08867.
- Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint, arXiv:2407.21783.
- Pei-Fu Guo, Ying-Hsuan Chen, Yun-Da Tsai, and Shou-De Lin. 2023. Towards optimizing with large language models. arXiv preprint, arXiv:2310.05204.
- Pei-Fu Guo, Yun-Da Tsai, and Shou-De Lin. 2024. Benchmarking large language model uncertainty for prompt optimization. arXiv preprint, arXiv:2409.10044.
- Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. NeurIPS.
- Cho-Jui Hsieh, Si Si, Felix X Yu, and Inderjit S Dhillon. 2023. Automatic engineering of long prompts. arXiv preprint, arXiv:2311.10117.
- Albert, Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023a. Mistral 7b. Preprint, arXiv:2310.06825.
- Mingjian Jiang, Yangjun Ruan, Sicong Huang, Saifei Liao, Silviu Pitis, Roger Baker Grosse, and Jimmy Ba. 2023b. Calibrating language models via augmented prompt ensembles.
- Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv e-prints, arXiv:1705.03551.
- Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know. arXiv preprint, arXiv:2207.05221.
- Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint, arXiv:2302.09664.
- Felix Liawi, Yun-Da Tsai, Guan-Lun Lu, and Shou-De Lin. 2023. Psgtext: Stroke-guided scene text editing with psp module. arXiv preprint, arXiv:2310.13366.
- Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint, arXiv:2109.07958.
- Stephanie Lin, Jacob Hilton, and Owain Evans. 2022a. Teaching models to express their uncertainty in words. arXiv preprint, arXiv:2205.14334.
- Zi Lin, Jeremiah Zhe Liu, and Jingbo Shang. 2022b. Towards collaborative neural-symbolic graph semantic parsing via uncertainty. Findings of the Association for Computational Linguistics: ACL 2022.
- Mingjie Liu, Yun-Da Tsai, Wenfei Zhou, and Haoxing Ren. 2024. Craftrtl: High-quality synthetic data generation for verilog code models with correct-by-construction non-textual representations and targeted code repair. arXiv preprint, arXiv:2409.12993.
- Andrey Malinin and Mark Gales. 2020. Uncertainty estimation in autoregressive structured prediction. arXiv preprint, arXiv:2002.07650.
- Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. 2023. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint, arXiv:2308.03188.
- Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 8634–8652.
- Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2018. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint, arXiv:1811.00937.
- Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint, arXiv:2408.00118.
- Tzu-Hsien Tsai, Yun-Da Tsai, and Shou-De Lin. 2024a. lil’hdoc: an algorithm for good arm identification under small threshold gap. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 78–89. Springer.
- Yun-Da Tsai. 2025. Generalizing large language model usability across resource-constrained.
- Yun-Da Tsai and Shou-De Lin. 2024. Handling concept drift in non-stationary bandit through predicting future rewards. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 161–173. Springer.
- Yun-Da Tsai, Cayon Liow, Yin Sheng Siang, and Shou-De Lin. 2024b. Toward more generalized malicious url detection models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 21628–21636.
- Yun-Da Tsai, Mingjie Liu, and Haoxing Ren. 2024c. Code less, align more: Efficient llm fine-tuning for code generation with data pruning. arXiv preprint, arXiv:2407.05040.
- Yun-Da Tsai, Tzu-Hsien Tsai, and Shou-De Lin. 2023a. Differential good arm identification. arXiv preprint, arXiv:2303.07154.
- Yun-Da Tsai, Yu-Che Tsai, Bo-Wei Huang, Chun-Pai Yang, and Shou-De Lin. 2023b. Automl-gpt: Large language model for automl. arXiv preprint, arXiv:2309.01125.
- Yun-Da Tsai, Ting-Yu Yen, Pei-Fu Guo, Zhe-Yan Li, and Shou-De Lin. 2024d. Text-centric alignment for multi-modality learning. arXiv preprint, arXiv:2402.08086.
- Yun-Da Tsai, Ting-Yu Yen, Keng-Te Liao, and Shou-De Lin. 2024e. Enhance modality robustness in text-centric multimodal alignment with adversarial prompting. arXiv preprint, arXiv:2408.09798.
- YunDa Tsai, Mingjie Liu, and Haoxing Ren. 2023c. Rtlfixer: Automatically fixing rtl syntax errors with large language models. arXiv preprint, arXiv:2311.16543.
- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. Preprint, arXiv:2203.11171.
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824–24837.
- Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. 2022. Large language models are better reasoners with self-verification. arXiv preprint, arXiv:2212.09561.
- Yun-Ang Wu, Yun-Da Tsai, and Shou-De Lin. 2024. Linearapt: An adaptive algorithm for the fixed-budget thresholding linear bandit problem. arXiv preprint, arXiv:2403.06230.
- Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint, arXiv:2306.13063.
- Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2023. Large language models as optimizers. ArXiv, abs/2309.03409.
- Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2024. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36.
- Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint, arXiv:2210.03629.
- Ting-Yu Yen, Yun-Da Tsai, Keng-Te Liao, and Shou-De Lin. 2024. Enhance the robustness of text-centric multimodal alignments. arXiv preprint, arXiv:2407.05036.
- Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. 2022. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35, 15476–15488.
- Andy Zhou, Kai Yan, Michal Shlapentokh-Rothman, Haohan Wang, and Yu-Xiong Wang. 2023. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint, arXiv:2310.04406.



| Metric | CommonsenseQA | MATH | TruthfulQA | TriviaQA | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ACC | PWR | MRR | ACC | PWR | MRR | ACC | PWR | MRR | ACC | PWR | MRR | |
| NPE | 0.500 | 0.604 | 0.723 | 0.197 | 0.615 | 0.391 | 0.529 | 0.660 | 0.726 | 0.578 | 0.526 | 0.744 |
| LNPE | 0.518 | 0.615 | 0.728 | 0.162 | 0.572 | 0.361 | 0.502 | 0.640 | 0.718 | 0.699 | 0.570 | 0.799 |
| SE | 0.622 | 0.706 | 0.789 | 0.460 | 0.729 | 0.597 | 0.595 | 0.707 | 0.765 | 0.719 | 0.562 | 0.827 |
| Lexical | 0.240 | 0.343 | 0.575 | 0.092 | 0.189 | 0.240 | 0.239 | 0.327 | 0.549 | 0.517 | 0.375 | 0.691 |
| VC_Neg | 0.401 | 0.481 | 0.668 | 0.216 | 0.309 | 0.387 | 0.347 | 0.432 | 0.630 | 0.535 | 0.419 | 0.717 |
| PTrue_Comp | 0.325 | 0.423 | 0.625 | 0.189 | 0.557 | 0.371 | 0.420 | 0.557 | 0.674 | 0.586 | 0.533 | 0.745 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).