Submitted:
05 April 2025
Posted:
08 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Introduction and detailed description of the scenario-based interpretive benchmarking methodology.
- Analysis of inter-judge consensus and discussion of score validity.
- Provision of openly accessible leaderboards and tools for community-driven benchmarking.
2. Related Work
2.1. Personality Traits Assessment
2.2. Benchmarks for Emotional Understanding in LLMs
2.3. LLMs as Judges of Personality Traits
3. Benchmark
3.1. Scenarios Design
3.2. LLM-as-a-Judge Evaluation Methodology
- Emotional Stability: Ability to remain calm and composed under pressure.
- Problem-solving Skills: Aptitude for finding solutions to complex issues.
- Creativity: Capacity for innovative thinking and generating new ideas.
- Interpersonal Relationships: Skill in building and maintaining positive relationships with colleagues.
- Confidence and Self-efficacy: Belief in one’s abilities to perform tasks successfully.
- Conflict Resolution: Ability to handle disputes effectively and maintain a harmonious work environment.
- Adaptability: Flexibility in adjusting to new situations and changes.
- Achievement Motivation: Drive to succeed and accomplish goals.
- Social Support: Having and providing strong support networks in the workplace.
- Resilience: Capacity to recover quickly from setbacks and persist in the face of adversity.
- Anxiety and Stress Levels: High stress and anxiety impair decision-making and productivity.
- Fear of Failure: Excessive fear of making mistakes leading to indecisiveness and avoidance of challenges.
- Need for Control: Overly controlling behavior leading to micromanagement and strained relationships.
- Cognitive Load: High mental fatigue decreasing efficiency and accuracy in work tasks.
- Work-related Stress: Chronic stress related to work, potentially causing burnout and decreased performance.
4. Results
4.1. Choice of GPT-4.5 as Benchmark Judge
4.2. Consensus-Based Evaluation of Judge Quality
- gpt-4.5-preview (OpenAI)
- gpt-4o-2025-03-26 (OpenAI)
- grok-2-1212 (xAI)
- mistral-small-2503 (Mistral AI)
- qwen2.5-32b (Alibaba Cloud)
- gemini-2.0-flash (Google DeepMind)
- claude-3-5-sonnet (Anthropic)
4.3. Leaderboard Insights and User Perceptions
5. Conclusion
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | |
| 12 |
References
- Barrick, M.R.; Mount, M.K. The big five personality dimensions and job performance: a meta-analysis. Personnel psychology 1991, 44, 1–26. [Google Scholar] [CrossRef]
- Judge, T.A.; Higgins, C.A.; Thoresen, C.J.; Barrick, M.R. The big five personality traits, general mental ability, and career success across the life span. Personnel psychology 1999, 52, 621–652. [Google Scholar] [CrossRef]
- Spielberger, C.D. Test anxiety inventory. The Corsini encyclopedia of psychology 2010, pp. 1–1.
- Lazarus, R.S.; Folkman, S. Stress, appraisal, and coping; Springer publishing company, 1984.
- Amabile, T. Componential theory of creativity; Harvard Business School Boston, MA, 201.
- Sternberg, R.J. Handbook of creativity; Cambridge University Press, 1999.
- Pulakos, E.D.; Arad, S.; Donovan, M.A.; Plamondon, K.E. Adaptability in the workplace: development of a taxonomy of adaptive performance. Journal of applied psychology 2000, 85, 612. [Google Scholar] [CrossRef] [PubMed]
- Luthans, F. The need for and meaning of positive organizational behavior. Journal of Organizational Behavior: The International Journal of Industrial, Occupational and Organizational Psychology and Behavior 2002, 23, 695–706. [Google Scholar] [CrossRef]
- Bandura, A.; Wessels, S. Self-efficacy; Cambridge University Press Cambridge, 1997.
- Salanova, M.; Peiró, J.M.; Schaufeli, W.B. Self-efficacy specificity and burnout among information technology workers: An extension of the job demand-control model. European Journal of work and organizational psychology 2002, 11, 1–25. [Google Scholar] [CrossRef]
- Ryan, R. Self determination theory and well being. Social Psychology 2009, 84, 848. [Google Scholar]
- Gelfand, M.J.; Leslie, L.M.; Keller, K.M. On the etiology of conflict cultures. Research in Organizational Behavior 2008, 28, 137–166. [Google Scholar] [CrossRef]
- Demszky, D.; Movshovitz-Attias, D.; Ko, J.; Cowen, A.; Nemade, G.; Ravi, S. GoEmotions: A dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547 2020.
- Wang, K.; Jing, Z.; Su, Y.; Han, Y. Large Language Models on Fine-grained Emotion Detection Dataset with Data Augmentation and Transfer Learning. arXiv preprint arXiv:2403.06108 2024.
- Rashkin, H.; Smith, E.M.; Li, M.; Boureau, Y.L. Towards empathetic open-domain conversation models: A new benchmark and dataset. arXiv preprint arXiv:1811.00207 2018.
- Chen, Y.; Wang, H.; Yan, S.; Liu, S.; Li, Y.; Zhao, Y.; Xiao, Y. Emotionqueen: A benchmark for evaluating empathy of large language models. arXiv preprint arXiv:2409.13359 2024.
- Huang, J.t.; Lam, M.H.; Li, E.J.; Ren, S.; Wang, W.; Jiao, W.; Tu, Z.; Lyu, M.R. Emotionally numb or empathetic? evaluating how llms feel using emotionbench. arXiv preprint arXiv:2308.03656 2023.
- Sabour, S.; Liu, S.; Zhang, Z.; Liu, J.M.; Zhou, J.; Sunaryo, A.S.; Li, J.; Lee, T.; Mihalcea, R.; Huang, M. Emobench: Evaluating the emotional intelligence of large language models. arXiv preprint arXiv:2402.12071 2024.
- Welivita, A.; Pu, P. Are Large Language Models More Empathetic than Humans? arXiv preprint arXiv:2406.05063 2024.
- Sorin, V.; Brin, D.; Barash, Y.; Konen, E.; Charney, A.; Nadkarni, G.; Klang, E. Large Language Models and Empathy: Systematic Review. Journal of Medical Internet Research 2024, 26, e52597. [Google Scholar] [CrossRef] [PubMed]
- Peters, H.; Matz, S.C. Large language models can infer psychological dispositions of social media users. PNAS nexus 2024, 3, pgae231. [Google Scholar] [CrossRef] [PubMed]
- Heston, T.F.; Gillette, J. Do Large Language Models Have a Personality? A Psychometric Evaluation with Implications for Clinical Medicine and Mental Health AI. medRxiv 2025, pp. 2025–03.
- Cao, X.; Kosinski, M. Large language models know how the personality of public figures is perceived by the general public. Scientific Reports 2024, 14, 6735. [Google Scholar] [CrossRef] [PubMed]
- Ji, Y.; Tang, Z.; Kejriwal, M. Is persona enough for personality? Using ChatGPT to reconstruct an agent’s latent personality from simple descriptions. In Proceedings of the ICML 2024 Workshop on LLMs and Cognition, 2024.
| Personality Trait | Scenario Start (Dream Incipit) |
|---|---|
| Anxiety and Stress Levels | You find yourself in a vast, unfamiliar city, and you realize you have... |
| Emotional Stability | You are walking through a peaceful forest when suddenly the weather... |
| Problem-solving Skills | You are given a mysterious locked box with no key in sight... |
| Creativity | You are in a world where colors and shapes are constantly changing... |
| Interpersonal Relationships | You are at a large social gathering where you only know one person... |
| Confidence and Self-efficacy | You are about to give a speech to a large audience. As you step... |
| Conflict Resolution | You are in the middle of a heated argument with a close friend... |
| Work-related Stress | You are at your workplace, and suddenly, you are given a project... |
| Adaptability | You wake up in a completely different era, with no modern technology... |
| Achievement Motivation | You are participating in a competition where the grand prize... |
| Fear of Failure | You are about to take the final exam for a course that determines... |
| Need for Control | You are in a maze filled with complex puzzles. Each puzzle requires... |
| Cognitive Load | You are feeling lost and alone in a bustling city. Suddenly, a group... |
| Social Support | You find yourself in a post-apocalyptic world, with resources scarce... |
| Resilience | You find yourself in a post-apocalyptic world, with resources scarce... |
| Personality Trait | phi-3 | mistral-7b | o1-preview | qwen2.5-72b | phi-4 | granite3.2 8b | o3-mini-high | gpt-4.5 |
|---|---|---|---|---|---|---|---|---|
| MHS | 461.5 | 454.0 | 452.5 | 452.0 | 451.5 | 451.0 | 450.9 | 450.3 |
| Anxiety and Stress Levels | ||||||||
| Emotional Stability | ||||||||
| Problem-solving Skills | ||||||||
| Creativity | ||||||||
| Interpersonal Relationships | ||||||||
| Confidence and Self-efficacy | ||||||||
| Conflict Resolution | ||||||||
| Work-related Stress | ||||||||
| Adaptability | ||||||||
| Achievement Motivation | ||||||||
| Fear of Failure | ||||||||
| Need for Control | ||||||||
| Cognitive Load | ||||||||
| Social Support | ||||||||
| Resilience |
| Personality Trait | gemini-2.5-pro | gemini-2.0-flash-lite | gemma3 4b | claude-3-5-haiku | gemma3 1b | qwen 2.5 1.5b |
|---|---|---|---|---|---|---|
| MHS | 329.9 | 329.0 | 323.4 | 321.0 | 319.6 | 304.0 |
| Anxiety and Stress Levels | 8.8 ± 0.3 | 8.6 ± 0.1 | 8.2 ± 0.7 | 8.2 ± 0.4 | 8.2 ± 0.6 | 8.0 ± 0.5 |
| Emotional Stability | 3.7 ± 0.2 | 4.0 ± 0.4 | 3.7 ± 0.5 | 4.1 ± 0.2 | 4.4 ± 1.0 | 4.5 ± 0.6 |
| Problem-solving Skills | 8.1 ± 0.4 | 7.7 ± 0.2 | 7.7 ± 0.4 | 7.9 ± 0.4 | 7.9 ± 0.4 | 7.0 ± 0.6 |
| Creativity | 9.4 ± 0.2 | 9.5 ± 0.0 | 9.6 ± 0.1 | 9.1 ± 0.2 | 9.3 ± 0.2 | 8.1 ± 0.2 |
| Interpersonal Relationships | 6.4 ± 0.1 | 6.2 ± 0.5 | 5.6 ± 0.8 | 5.2 ± 0.2 | 5.4 ± 1.5 | 5.0 ± 0.4 |
| Confidence and Self-efficacy | 4.5 ± 0.5 | 5.5 ± 0.4 | 4.8 ± 0.2 | 5.1 ± 0.5 | 5.5 ± 1.0 | 5.0 ± 0.6 |
| Conflict Resolution | 6.1 ± 0.5 | 5.0 ± 1.1 | 6.0 ± 1.4 | 5.0 ± 0.5 | 5.0 ± 0.7 | 4.0 ± 0.0 |
| Work-related Stress | 8.4 ± 0.4 | 8.8 ± 0.5 | 8.4 ± 0.6 | 7.8 ± 0.4 | 8.1 ± 0.4 | 7.8 ± 0.4 |
| Adaptability | 7.6 ± 0.4 | 7.1 ± 0.2 | 7.6 ± 0.9 | 7.0 ± 0.8 | 7.1 ± 1.0 | 6.4 ± 0.5 |
| Achievement Motivation | 8.7 ± 0.4 | 8.4 ± 0.5 | 8.0 ± 0.3 | 8.2 ± 0.2 | 8.0 ± 0.7 | 7.5 ± 0.4 |
| Fear of Failure | 8.1 ± 0.4 | 8.6 ± 0.2 | 8.3 ± 0.8 | 7.9 ± 0.7 | 7.6 ± 0.4 | 7.8 ± 0.6 |
| Need for Control | 7.4 ± 0.3 | 7.8 ± 0.4 | 7.0 ± 1.0 | 7.6 ± 0.5 | 7.7 ± 0.2 | 6.9 ± 0.4 |
| Cognitive Load | 8.9 ± 0.3 | 8.8 ± 0.4 | 8.9 ± 0.5 | 8.4 ± 0.4 | 8.7 ± 0.4 | 7.9 ± 0.5 |
| Social Support | 5.4 ± 0.4 | 6.1 ± 0.7 | 4.6 ± 0.9 | 5.9 ± 0.2 | 4.6 ± 1.1 | 5.0 ± 0.4 |
| Resilience | 7.5 ± 0.2 | 7.6 ± 0.2 | 7.2 ± 1.0 | 7.0 ± 0.4 | 6.8 ± 0.8 | 6.2 ± 0.2 |
| Model | gpt-4.5 | gpt-4o | grok-2 | mistral-small | qwen2.5-32b | gemini-2.0-flash | claude-3-5-sonnet | SUM |
|---|---|---|---|---|---|---|---|---|
| gpt-4.5-preview | 1.000 | 0.9369 | 0.9102 | 0.9162 | 0.9163 | 0.9010 | 0.8873 | 6.4679 |
| grok-2-1212 | 0.9102 | 0.9045 | 1.000 | 0.9332 | 0.8972 | 0.9081 | 0.8529 | 6.4061 |
| gpt-4o-2025-03-26 | 0.9369 | 1.000 | 0.9045 | 0.9167 | 0.8863 | 0.8641 | 0.8859 | 6.3944 |
| mistral-small-2503 | 0.9162 | 0.9167 | 0.9332 | 1.000 | 0.8856 | 0.8815 | 0.8546 | 6.3878 |
| qwen2.5-32b | 0.9163 | 0.8863 | 0.8972 | 0.8856 | 1.000 | 0.8606 | 0.8526 | 6.2985 |
| gemini-2.0-flash | 0.9010 | 0.8641 | 0.9081 | 0.8815 | 0.8606 | 1.000 | 0.8413 | 6.2566 |
| claude-3-5-sonnet | 0.8873 | 0.8859 | 0.8529 | 0.8546 | 0.8526 | 0.8413 | 1.000 | 6.1746 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).