Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models

Chenyuan Zhang; Simin Liu; Hanjing Li; Te Gao; Yidi Wang; Qiguang Chen; Xiachong Feng; Li Cai; Mengnan Du; Zhuotao Tian; Libo Qin; Philip S. Yu; Min Zhang

doi:10.20944/preprints202605.0806.v1

Submitted:

11 May 2026

Posted:

13 May 2026

You are already at the latest version

Abstract

As reasoning becomes a defining capability of large language models, reasoning benchmarks have moved to the center of evaluation. However, despite the rapid growth in the number of benchmarks and reported scores, benchmark results are often not directly comparable. This is because benchmarks may differ not only in the reasoning capabilities they target, but also in the conditions under which models are evaluated and the criteria used to assess success. To address this challenge, we present the first survey of reasoning benchmarks for large language models across three dimensions: Object, Setting, and Evaluation. Object defines the reasoning capability under examination. Setting specifies the conditions that shape model behavior. Evaluation determines how success is measured. We further introduce extended scenarios to account for special conditions. Based on this analysis, we identify two major weaknesses in current practice, namely heterogeneous benchmark objects and weakly justified settings, and derive practical guidance for benchmark selection, construction, and reporting, along with future directions for benchmark development. We hope this survey will help advance reasoning evaluation beyond score comparison alone toward benchmarks that are more interpretable, better justified, and easier to implement. A repository for the related papers is available at https://github.com/chenyuanTKCY/Awesome-Benchmarks-for-LLM-Reasoning.

Keywords:

large language models

;

reasoning

;

reasoning benchmarks

;

reasoning evaluation

;

natural language processing

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Rethinking Benchmark Comparability: A Survey of Reasoning Benchmarks for Large Language Models

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe