Submitted:
22 May 2026
Posted:
25 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- To the best of our knowledge, this is the first systematic attempt to survey Rubric RMs, an emerging RM paradigm based on explicit and interpretable evaluation criteria.
- We introduce a unified framework for Rubric RMs from four perspectives: rubric foundations, modeling, reasoning, and their role in policy optimization, As shown in Figure 2.
- Under this framework, we synthesize representative methods, applications, and evaluation practices, trace the technical evolution of the field, and highlight key open challenges and future directions.
2. Overview
3. Foundations
3.1. Rubric Inputs
3.2. Rubric Synthesis.
3.2.1. Source of Synthesis
3.2.2. Adaptability of Synthesis
3.3. Rubric Contents
4. Modeling
4.1. What to Learn?
4.2. How to Learn?
5. Reasoning
5.1. Explicit Aggregation
Linear Aggregation.
Structure-Aware Aggregation.
5.2. Implicit Aggregation
6. Rubric RMs as the Core of the Optimization Loop
6.1. Data Synthesis
6.2. Policy Learning
6.3. Inference-Time Verification
6.4. Domain Adaptation
7. Evaluation and Benchmarking
Rubric Benchmarks.
Reward Benchmarks.
Evaluation Metrics.
8. Challenges and Future Directions
1. How can we evaluate rubric quality and the faithfulness of rubric-guided judgment under limited human annotation?
2. How can common values be identified from human feedback and datasets?
3. What is the appropriate structured representation of rubrics, and how can such structure be captured or learned effectively?
4. How should reward models and policies be jointly optimized?
9. Conclusion
Appendix A. Full Taxonomy

| Dataset | Domain | Modality | Scale |
|---|---|---|---|
| Rubric Dataset | |||
| RubricHub [31] | Multi-domain | Language | 110k |
| WildChecklists [39] | Multi-domain | Language | 130K |
| OpenRubrics [56] | Multi-domain | Language | 35.6k |
| RaR-Medicine [11] | Medical | Language | 20k |
| RaR-Science [11] | Science | Language | 20k |
| LEGIT [52] | Legal | Language | 24K |
| Auto-Rubric [41] | Multi-domain | Language | 70 |
| Source Dataset | |||
| WildChat [95] | Multi-domain | Language | 1m |
| UltraFeedack [96] | Multi-domain | Language | 64k |
| Skywork-Reward [97] | Multi-domain | Language | 80k |
| HelpSteer3-Preference [98] | Multi-domain | Language | 40k |
| Magpie [99] | Multi-domain | Language | 300k |
| MegaScience [100] | Multi-domain | Language | 1.25m |
| LongWriter [101] | Writing | Language | 6k |
| LongWriter-Zero [102] | Writing | Language | 8.6k |
| LongAlign [103] | Writing | Language | 10k |
| LMsys [104] | Chat | Language | 1m |
| medical-o1 [105] | Medical | Language | 19.7k |
| Rlaif-v [106] | Multi-domain | Multi-modal | 83.1k |
| LLaVA-Human-Preference-10K [107] | Multi-domain | Multi-modal | 9.4k |
| llava-critic-113k [108] | Multi-domain | Multi-modal | 113.0k |
| MM-RLHF [109] | Multi-domain | Multi-modal | 16.3k |
| pixmo-cap [110] | Multi-domain | Multi-modal | 717k |
| DenseFusion-1M [66] | Multi-domain | Multi-modal | 1m |
| ViRL-39K [111] | Multi-domain | Multi-modal | 39k |
Appendix B. Training Dataset for Rubric RMs
Rubric Datasets.
Source Datasets.
References
- Wang, S.; Zhang, S.; Zhang, J.; Hu, R.; Li, X.; Zhang, T.; Li, J.; Wu, F.; Wang, G.; Hovy, E. Reinforcement learning enhanced llms: A survey. arXiv preprint arXiv:2412.10400 2024.
- Zhang, K.; Zuo, Y.; He, B.; Sun, Y.; Liu, R.; Jiang, C.; Fan, Y.; Tian, K.; Jia, G.; Li, P.; et al. A survey of reinforcement learning for large reasoning models. arXiv preprint arXiv:2509.08827 2025. [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds. Curran Associates, Inc., 2022, Vol. 35, pp. 27730–27744.
- Silver, D.; Sutton, R.S. Welcome to the era of experience. Google AI 2025, 1, 11.
- Hubinger, E.; Van Merwijk, C.; Mikulik, V.; Skalse, J.; Garrabrant, S. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820 2019.
- Skalse, J.; Howe, N.; Krasheninnikov, D.; Krueger, D. Defining and Characterizing Reward Gaming. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S.; Mohamed, S.; Agarwal, A.; Belgrave, D.; Cho, K.; Oh, A., Eds. Curran Associates, Inc., 2022, Vol. 35, pp. 9460–9471.
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds. Curran Associates, Inc., 2023, Vol. 36, pp. 46595–46623.
- Sun, H.; Shen, Y.; Ton, J.F. Rethinking bradley-terry models in preference-based reward modeling: Foundations, theory, and alternatives. arXiv preprint arXiv:2411.04991 2024.
- Mu, T.; Helyar, A.; Heidecke, J.; Achiam, J.; Vallone, A.; Kivlichan, I.; Lin, M.; Beutel, A.; Schulman, J.; Weng, L. Rule Based Rewards for Language Model Safety. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A.; Mackey, L.; Belgrave, D.; Fan, A.; Paquet, U.; Tomczak, J.; Zhang, C., Eds. Curran Associates, Inc., 2024, Vol. 37, pp. 108877–108901. [CrossRef]
- Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional AI: Harmlessness from AI Feedback. ArXiv 2022, abs/2212.08073.
- Gunjal, A.; Wang, A.; Lau, E.; Nath, V.; He, Y.; Liu, B.; Hendryx, S.M. Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains. In Proceedings of the NeurIPS 2025 Workshop on Efficient Reasoning, 2025.
- Liang, X.; Zhang, H.; Li, J.; Chen, K.; Zhu, Q.; Zhang, M. Generative Reward Modeling via Synthetic Criteria Preference Learning. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 26755–26769. [CrossRef]
- Zhang, J.; Wang, Z.; Gui, L.; Sathyendra, S.M.; Jeong, J.; Veitch, V.; Wang, W.; He, Y.; Liu, B.; Jin, L. Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training. ArXiv 2025, abs/2509.21500.
- Nisbett, R.E.; Wilson, T.D. The halo effect: Evidence for unconscious alteration of judgments. Journal of personality and social psychology 1977, 35, 250. [CrossRef]
- Jonsson, A.; Svingby, G. The use of scoring rubrics: Reliability, validity and educational consequences. Educational research review 2007, 2, 130–144. [CrossRef]
- Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; Bouamor, H.; Pino, J.; Bali, K., Eds., Singapore, 2023; pp. 2511–2522. [CrossRef]
- Ji, J.; Qiu, T.; Chen, B.; Zhang, B.; Lou, H.; Wang, K.; Duan, Y.; He, Z.; Zhou, J.; Zhang, Z.; et al. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852 2023.
- Sierra, C.; Osman, N.; Noriega, P.; Sabater-Mir, J.; Perelló, A. Value alignment: a formal approach. arXiv preprint arXiv:2110.09240 2021. [CrossRef]
- Bradley, R.A.; Terry, M.E. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika 1952, 39, 324–345. [CrossRef]
- Nitko, A.J.; Brookhart, S.M. Educational assessment of students; Prentice-Hall, Inc., 2006.
- Arora, R.K.; Wei, J.; Hicks, R.S.; Bowman, P.; Candela, J.Q.; Tsimpourlas, F.; Sharman, M.; Shah, M.; Vallone, A.; Beutel, A.; et al. HealthBench: Evaluating Large Language Models Towards Improved Human Health. ArXiv 2025, abs/2505.08775.
- Lv, C.; Zhou, J.; Zhao, W.; Xu, J.; Huang, Z.; Tian, M.; Dou, S.; Gui, T.; Tian, L.; Zhou, X.; et al. Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation 2026.
- Bi, B.; Liu, S.; Wang, Y.; Tong, S.; Mei, L.; Ge, Y.; Xu, Y.; Guo, J.; Cheng, X. Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning. ArXiv 2025, abs/2511.12344.
- Wang, P.; Zuo, Q.; Liu, P.; Sang, Z.; Xie, C.; Yang, H. InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training. ArXiv 2025, abs/2510.15859.
- Rezaei, M.; Vacareanu, R.; Wang, Z.; Wang, C.; Liu, B.; He, Y.; Akyürek, A.F. Online Rubrics Elicitation from Pairwise Comparisons. ArXiv 2025, abs/2510.07284.
- Liu, D.; Yang, F.; Wang, X.; Yan, S.; Chai, J.; Li, J.; Ban, Y.; Mao, Z.; Lin, W.; Yin, G. CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling. arXiv preprint arXiv:2603.08035 2026. [CrossRef]
- Shao, R.; Asai, A.; Shen, S.Z.; Ivison, H.; Kishore, V.; Zhuo, J.; Zhao, X.; Park, M.; Finlayson, S.G.; Sontag, D.; et al. DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research. ArXiv 2025, abs/2511.19399.
- Shen, W.F.; Qiu, X.; Whitehouse, C.; Alazraki, L.; Goel, S.; Barbieri, F.; Willi, T.; Mathur, A.; Leontiadis, I. Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks. arXiv preprint arXiv:2602.05125 2026. [CrossRef]
- Li, G.; Mishra, B.D.; Wang, Z.; Yan, J.; Chen, Y.; Li, C.L.; Le, L.T.; Han, R.; Lee, G.; Tong, H.; et al. RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards. arXiv preprint arXiv:2605.10899 2026.
- Huang, T.H.; Salekin, S.; Movellan, J.; Sala, F.; Bilkhu, M. RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning. arXiv preprint arXiv:2603.09160 2026.
- Li, S.; Zhao, J.; Wei, M.; Ren, H.; Zhou, Y.; Yang, J.; Liu, S.; Zhang, K.; Chen, W. RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation. arXiv preprint arXiv:2601.08430 2026.
- Yuan, Y.; Mang, Q.; Chen, J.; Wan, H.; Liu, X.; Xu, J.; Huang, J.T.; Wang, W.; Jiao, W.; He, P. Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards. ArXiv 2025, abs/2510.07774.
- Wang, Z.; Zeng, J.; Delalleau, O.; Evans, E.; Egert, D.; Shin, H.C.; Soares, F.; Dong, Y.; Kuchaiev, O. RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026.
- Lu, J.; Zhang, S.; Xie, Z.; Song, Z.; Zhang, J. Orcust: Stepwise-Feedback Reinforcement Learning for GUI Agent. ArXiv 2025, abs/2509.17917.
- Ma, L.; Xu, Y.; Long, X.; Zheng, Z. An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs. ArXiv 2025, abs/2510.14660.
- Dhole, K.D.; Agichtein, E. RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation. arXiv preprint arXiv:2603.20882 2026. [CrossRef]
- Raghavendra, M.; Gunjal, A.; Liu, B.; He, Y. Agentic Rubrics as Contextual Verifiers for SWE Agents. arXiv preprint arXiv:2601.04171 2026. [CrossRef]
- Sun, Z.; Shen, Y.; Zhang, H.; Zhou, Q.; Chen, Z.; Cox, D.D.; Yang, Y.; Gan, C. SALMON: Self-Alignment with Instructable Reward Models. In Proceedings of the The Twelfth International Conference on Learning Representations, 2024.
- Viswanathan, V.; Sun, Y.; Kong, X.; Cao, M.; Neubig, G.; Wu, T. Checklists Are Better Than Reward Models For Aligning Language Models. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Tian, J.; Liu, F.; Han, J.; Jiang, Y.; Wu, Y.; Liu, Y.; Li, H.; Xu, F.; Li, W. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria. arXiv preprint arXiv:2605.08354 2026.
- Xie, L.; Huang, S.; Zhang, Z.; Zou, A.; Zhai, Y.; Ren, D.; Zhang, K.; Hu, H.; Liu, B.; Chen, H.; et al. Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling. ArXiv 2025, abs/2510.17314.
- Goel, S.; Hazra, R.; Jayalath, D.; Willi, T.; Jain, P.; Shen, W.F.; Leontiadis, I.; Barbieri, F.; Bachrach, Y.; Geiping, J.; et al. Training AI Co-Scientists Using Rubric Rewards. arXiv preprint arXiv:2512.23707 2025. [CrossRef]
- Liu, Z.; Wang, P.; Xu, R.; Ma, S.; Ruan, C.; Li, P.; Liu, Y.; Wu, Y. Inference-Time Scaling for Generalist Reward Modeling. ArXiv 2025, abs/2504.02495.
- Huang, Z.; Zhuang, Y.; Lu, G.; Qin, Z.; Xu, H.; Zhao, T.; Peng, R.; Hu, J.; Shen, Z.; Hu, X.; et al. Reinforcement Learning with Rubric Anchors. ArXiv 2025, abs/2508.12790.
- Zhou, K.; Tan, C. AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge. arXiv preprint arXiv:2603.07019 2026. [CrossRef]
- He, Y.; Li, W.; Zhang, H.; Li, S.; Mandyam, K.; Khosla, S.; Xiong, Y.; Wang, N.; Peng, S.; Li, B.; et al. AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following. ArXiv 2025, abs/2511.10507.
- Jayalath, D.H.; wat Goel, S..; Foster, T.; Jain, P.; Gururangan, S.; Zhang, C.; Goyal, A.; Schelten, A. Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision. ArXiv 2025, abs/2509.14234.
- Sheng, L.; Ma, W.; Hong, R.; Wang, X.; Zhang, A.; Chua, T.S. Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics. arXiv preprint arXiv:2602.10885 2026.
- Wu, M.; Zhang, G.; Min, S.; Levine, S.; Kumar, A. Rlac: Reinforcement learning with adversarial critic for free-form generation tasks. arXiv preprint arXiv:2511.01758 2025. [CrossRef]
- Chen, J.; Sun, W.; Yin, Q.; Tan, Z.; Zhang, J. ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning. arXiv preprint arXiv:2509.04903 2025.
- Zhang, J.; Lv, X.; Feng, L.; Hou, L.; Li, J. Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards. arXiv preprint arXiv:2601.06021 2026. [CrossRef]
- Lee, J.; On, K.W.; Han, S.; Cohan, A.; Hockenmaier, J. Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics. ArXiv 2025, abs/2512.01020.
- Hashemi, H.; Eisner, J.; Rosset, C.; Van Durme, B.; Kedzie, C. LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 13806–13834. [CrossRef]
- Xu, R.; Liu, T.; Dong, Z.; You, T.; Hong, I.; Yang, C.; Zhang, L.; Zhao, T.; Wang, H. Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training. arXiv preprint arXiv:2602.01511 2026.
- Chen, X.; Li, G.; Wang, Z.; Jin, B.; Qian, C.; Wang, Y.; Wang, H.; Zhang, Y.; Zhang, D.; Zhang, T.; et al. RM-R1: Reward Modeling as Reasoning. ArXiv 2025, abs/2505.02387.
- Liu, T.; Xu, R.; Yu, T.; Hong, I.; Yang, C.; Zhao, T.; Wang, H. OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment. ArXiv 2025, abs/2510.07743.
- Jin, Y.; Li, X.; Cao, F.; Gao, L.; Yao, J. Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints. ArXiv 2025, abs/2511.16139.
- Srivastava, P.; Singh, H.; Madhavan, R.; Patil, G.; Addepalli, S.; Suggala, A.; Aravamudhan, R.; Sharma, S.; Laha, A.; Raghuveer, A.; et al. Robust Reward Modeling via Causal Rubrics. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026.
- Wang, T.; Xiong, C. AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning. ArXiv 2025, abs/2506.15651.
- Feng, X.; Li, Y.; Wan, Z.; Gao, Z.; Yuan, J.; Chen, D.; Qiao, C. RubricRL: Simple Generalizable Rewards for Text-to-Image Generation. ArXiv 2025, abs/2511.20651.
- Peng, H.; Qi, Y.; Wang, X.; Xu, B.; Hou, L.; Li, J. VerIF: Verification Engineering for Reinforcement Learning in Instruction Following. In Proceedings of the Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing; Christodoulopoulos, C.; Chakraborty, T.; Rose, C.; Peng, V., Eds., Suzhou, China, 2025; pp. 30324–30339. [CrossRef]
- Bai, K.T.Y.; Bao, Y.; Chen, G.; Chen, J.; Chen, N.; Chen, R.; Chen, Y.; Chen, Y.; Chen, Y.; Chen, Z.; et al. Kimi K2: Open Agentic Intelligence. ArXiv 2025, abs/2507.20534.
- Fan, Z.; Chen, R.; Hu, T.; Peng, R.; Huang, Z.; Xu, H.; Chen, Y.; Wu, J.; Zhao, J.; Liu, Z. Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation. arXiv preprint arXiv:2604.00536 2026. [CrossRef]
- Wang, Z.; Wang, X.; Lee, S.; Xu, X. ARISE: Agentic Rubric-Guided Iterative Survey Engine for Automated Scholarly Paper Generation. ArXiv 2025, abs/2511.17689.
- Xu, T.; Zheng, Y.; Lu, P.; Ye, L.; Wu, Y.; Zhang, Z.; Yu, Y.; Ma, C.; Zhu, J.; Liu, P.; et al. Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks. arXiv preprint arXiv:2604.02795 2026.
- Li, X.; Zhang, F.; Diao, H.; Wang, Y.; Wang, X.; DUAN, L. DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception. In Proceedings of the The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024.
- Ye, Z.; Yue, Y.; Wang, H.; Han, X.; Jiang, J.; Wei, C.; Fan, L.; Liang, J.; Zhang, S.; Li, J.; et al. Self-Rewarding Rubric-Based Reinforcement Learning for Open-Ended Reasoning. ArXiv 2025, abs/2509.25534.
- Zhou, Y.; Li, S.; Liu, S.; Fang, W.; Zhao, J.; Yang, J.; Lv, J.; Zhang, K.; Zhou, Y.; Lu, H.; et al. Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning. ArXiv 2025, abs/2508.16949.
- Wan, Y.; Fang, T.; Li, Z.; Huo, Y.; Wang, W.; Mi, H.; Yu, D.; Lyu, M.R. Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification. arXiv preprint arXiv:2601.15808 2026.
- Zhang, Q.; Zhou, J.; Wang, Y.; Lyu, F.; Ming, Y.; Xu, C.; Sun, Q.; Zheng, K.; Kang, P.; Liu, X.; et al. RubricBench: Aligning Model-Generated Rubrics with Human Standards. arXiv preprint arXiv:2603.01562 2026.
- Pan, T.; Lin, X.; Yang, W.; He, Q.; Chen, S.; Qi, L.; Xu, W.; Feng, H.; Xu, B.; Xiao, Y. RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following. arXiv preprint arXiv:2603.25133 2026.
- Starace, G.; Jaffe, O.; Sherburn, D.; Aung, J.; Chan, J.S.; Maksin, L.; Dias, R.; Mays, E.; Kinsella, B.; Thompson, W.; et al. PaperBench: Evaluating AI’s Ability to Replicate AI Research. In Proceedings of the Forty-second International Conference on Machine Learning, 2025.
- Wang, Z.; Jung, J.; Lu, X.; Diao, S.; Evans, E.; Zeng, J.; Molchanov, P.; Choi, Y.; Kautz, J.; Dong, Y. ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge. arXiv preprint arXiv:2510.18941 2025.
- Li, J.; Sun, S.; Yuan, W.; Fan, R.Z.; hai zhao.; Liu, P. Generative Judge for Evaluating Alignment. In Proceedings of the The Twelfth International Conference on Learning Representations, 2024.
- Lambert, N.; Pyatkin, V.; Morrison, J.; Miranda, L.; Lin, B.Y.; Chandu, K.; Dziri, N.; Kumar, S.; Zick, T.; Choi, Y.; et al. RewardBench: Evaluating Reward Models for Language Modeling. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025; Chiruzzo, L.; Ritter, A.; Wang, L., Eds., Albuquerque, New Mexico, 2025; pp. 1755–1797. [CrossRef]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds. Curran Associates, Inc., 2023, Vol. 36, pp. 46595–46623.
- Liu, Y.; Yao, Z.; Min, R.; Cao, Y.; Hou, L.; Li, J. RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Tan, S.; Zhuang, S.; Montgomery, K.; Tang, W.Y.; Cuadron, A.; Wang, C.; Popa, R.; Stoica, I. JudgeBench: A Benchmark for Evaluating LLM-Based Judges. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Gureja, S.; Miranda, L.J.V.; Islam, S.B.; Maheshwary, R.; Sharma, D.; Winata, G.T.; Lambert, N.; Ruder, S.; Hooker, S.; Fadaee, M. M-RewardBench: Evaluating Reward Models in Multilingual Settings. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 43–58. [CrossRef]
- Jin, Z.; Yuan, H.; Men, T.; Cao, P.; Chen, Y.; Xu, J.; Li, H.; Jiang, X.; Liu, K.; Zhao, J. RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025; Che, W.; Nabende, J.; Shutova, E.; Pilehvar, M.T., Eds., Vienna, Austria, 2025; pp. 17061–17090. [CrossRef]
- Kim, S.; Kang, D.; Kwon, T.; Chae, H.; Won, J.; Lee, D.; Yeo, J. Evaluating robustness of reward models for mathematical reasoning. arXiv preprint arXiv:2410.01729 2024. [CrossRef]
- Malik, S.; Pyatkin, V.; Land, S.; Morrison, J.; Smith, N.A.; Hajishirzi, H.; Lambert, N. Rewardbench 2: Advancing reward model evaluation. arXiv preprint arXiv:2506.01937 2025.
- Zhou, E.; Zheng, G.; Wang, B.; Xi, Z.; Dou, S.; Bao, R.; Shen, W.; Xiong, L.; Fan, J.; Mou, Y.; et al. Rmb: Comprehensively benchmarking reward models in llm alignment. arXiv preprint arXiv:2410.09893 2024.
- Frick, E.; Li, T.; Chen, C.; Chiang, W.L.; Angelopoulos, A.N.; Jiao, J.; Zhu, B.; Gonzalez, J.E.; Stoica, I. How to Evaluate Reward Models for RLHF. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Li, L.; Wei, Y.; Xie, Z.; Yang, X.; Song, Y.; Wang, P.; An, C.; Liu, T.; Li, S.; Lin, B.Y.; et al. VL-RewardBench: a challenging benchmark for vision-language generative reward models. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24657–24668.
- Yasunaga, M.; Zettlemoyer, L.; Ghazvininejad, M. Multimodal rewardbench: Holistic evaluation of reward models for vision language models. arXiv preprint arXiv:2502.14191 2025. [CrossRef]
- Chen, Z.; Wen, Z.; Du, Y.; Zhou, Y.; Cui, C.; Han, S.; Weng, Z.; Wang, C.; Tong, Z.; HUANG, L.; et al. MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025.
- Wang, B.; Liu, Y.; Liu, Y.; Tang, T.; Wang, S.; Gao, C.; Zheng, C.; Zhang, Y.; Yu, L.; Liu, S.; et al. Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models. arXiv preprint arXiv:2602.04649 2026. [CrossRef]
- Mahmoud, A.; Rezaei, M.; Wang, Z.; Gunjal, A.; Liu, B.; He, Y. Reward Hacking in Rubric-Based Reinforcement Learning. arXiv preprint arXiv:2605.12474 2026. [CrossRef]
- Vygotsky, L.S. Mind in society: The development of higher psychological processes; Vol. 86, Harvard university press, 1978.
- Wood, D.; Bruner, J.S.; Ross, G. The role of tutoring in problem solving. Journal of child psychology and psychiatry 1976, 17, 89–100.
- Andrade, H.G. Teaching with rubrics: The good, the bad, and the ugly. College teaching 2005, 53, 27–31.
- Zhang, W.; Zhang, K.; Qi, J.; Lai, B.; Huang, J. Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs. arXiv preprint arXiv:2603.20046 2026. [CrossRef]
- Yu, J.; Xu, Z.; Wang, J.; Yang, Y. Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance. arXiv preprint arXiv:2603.07461 2026.
- Zhao, W.; Ren, X.; Hessel, J.; Cardie, C.; Choi, Y.; Deng, Y. WildChat: 1M ChatGPT Interaction Logs in the Wild. In Proceedings of the The Twelfth International Conference on Learning Representations, 2024.
- Cui, G.; Yuan, L.; Ding, N.; Yao, G.; He, B.; Zhu, W.; Ni, Y.; Xie, G.; Xie, R.; Lin, Y.; et al. ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback. In Proceedings of the Forty-first International Conference on Machine Learning, 2024.
- Liu, C.Y.; Zeng, L.; Liu, J.; Yan, R.; He, J.; Wang, C.; Yan, S.; Liu, Y.; Zhou, Y. Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451 2024. [CrossRef]
- Wang, Z.; Zeng, J.; Delalleau, O.; Shin, H.C.; Soares, F.; Bukharin, A.; Evans, E.; Dong, Y.; Kuchaiev, O. HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025.
- Xu, Z.; Jiang, F.; Niu, L.; Deng, Y.; Poovendran, R.; Choi, Y.; Lin, B.Y. Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Fan, R.Z.; Wang, Z.; Liu, P. MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning. arXiv preprint arXiv:2507.16812 2025.
- Bai, Y.; Zhang, J.; Lv, X.; Zheng, L.; Zhu, S.; Hou, L.; Dong, Y.; Tang, J.; Li, J. LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs. In Proceedings of the The Thirteenth International Conference on Learning Representations, 2025.
- Wu, Y.; Bai, Y.; Hu, Z.; Lee, R.K.W.; Li, J. Longwriter-zero: Mastering ultra-long text generation via reinforcement learning. arXiv preprint arXiv:2506.18841 2025.
- Bai, Y.; Lv, X.; Zhang, J.; He, Y.; Qi, J.; Hou, L.; Tang, J.; Dong, Y.; Li, J. LongAlign: A Recipe for Long Context Alignment of Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y.; Bansal, M.; Chen, Y.N., Eds., Miami, Florida, USA, 2024; pp. 1376–1395. [CrossRef]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Li, T.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Li, Z.; Lin, Z.; Xing, E.; et al. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset. In Proceedings of the The Twelfth International Conference on Learning Representations, 2024.
- Chen, J.; Cai, Z.; Ji, K.; Wang, X.; Liu, W.; Wang, R.; Hou, J.; Wang, B. HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs, 2024, [arXiv:cs.CL/2412.18925].
- Yu, T.; Zhang, H.; Li, Q.; Xu, Q.; Yao, Y.; Chen, D.; Lu, X.; Cui, G.; Dang, Y.; He, T.; et al. RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 19985–19995. [CrossRef]
- Sun, Z.; Shen, S.; Cao, S.; Liu, H.; Li, C.; Shen, Y.; Gan, C.; Gui, L.; Wang, Y.X.; Yang, Y.; et al. Aligning Large Multimodal Models with Factually Augmented RLHF. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.W.; Martins, A.; Srikumar, V., Eds., Bangkok, Thailand, 2024; pp. 13088–13110. [CrossRef]
- Xiong, T.; Wang, X.; Guo, D.; Ye, Q.; Fan, H.; Gu, Q.; Huang, H.; Li, C. LLLaVA-Critic: Learning to Evaluate Multimodal Models. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 13618–13628. [CrossRef]
- Zhang, Y.; Yu, T.; Tian, H.; Fu, C.; Li, P.; Zeng, J.; Xie, W.; Shi, Y.; Zhang, H.; Wu, J.; et al. MM-RLHF: The Next Step Forward in Multimodal LLM Alignment. In Proceedings of the Forty-second International Conference on Machine Learning, 2025.
- Deitke, M.; Clark, C.; Lee, S.; Tripathi, R.; Yang, Y.; Park, J.S.; Salehi, M.; Muennighoff, N.; Lo, K.; Soldaini, L.; et al. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 91–104. [CrossRef]
- Wang, H.; Qu, C.; Huang, Z.; Chu, W.; Lin, F.; Chen, W. VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.



| Benchmark | Evaluation | Domain | Modality | Scale |
|---|---|---|---|---|
| Rubric Benchmarks | ||||
| RubricBench [70] | Rubric & Pairwise | Multi-domain | Language | 1.1k |
| RUBRICEVAL [71] | Rubric | Multi-domain | Language | 3.5k |
| HealthBench [21] | Rubric | Medical | Language | 5k |
| PaperBench [72] | Rubric | Science | Language | 8.3k |
| ProfBench [73] | Rubric | Science & Business | Language | 40 |
| Auto-J [74] | Rubric | Multi-domain | Language | 1.3k |
| Reward Benchmarks | ||||
| RewardBench [75] | Pairwise | Multi-domain | Language | 3.0k |
| MT-Bench [76] | Pairwise | Multi-domain | Language | 80 |
| RM-Bench [77] | Pairwise | Multi-domain | Language | 1.3k |
| JudgeBench [78] | Pairwise | Multi-domain | Language | 350 |
| M-RewardBench [79] | Pairwise | Multi-Lingual | Language | 2.9k |
| RAG-RewardBench [80] | Pairwise | Multi-domain | Language | 1.5k |
| RewardMath [81] | BoN | MATH | Language | 483 |
| RewardBench2 [82] | BoN | Multi-domain | Language | 1.9k |
| RMB [83] | Pairwise & BoN | Multi-domain | Language | 18.0k |
| PPE [84] | Pairwise & BoN | Multi-domain | Language | 16.0k |
| VL-RewardBench [85] | Pairwise | Multi-domain | Multi-modal | 1.3k |
| MultimodalBench [86] | Pairwise | Multi-domain | Multi-modal | 5.2k |
| MJ-Bench [87] | Pairwise | Multi-domain | Multi-modal | 3.0k |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).