Submitted:
22 December 2025
Posted:
24 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Establishing a unified theoretical framework. Addressing the fragmentation of the MWM theory, this paper systematically clarifies the essential differences (the state space includes psychological attributes, and the observation space includes introspective signals) and connections (the physical world model serves as a “real-world anchor,” while the MWM acts as a “social extension”) between the MWM and the PWM. It further achieves the mathematical unification of the two via Predictive Coding [23,24], filling the theoretical gap in the field.
- Providing technical selection references. Following the evolutionary logic of “static text→high-order text→multimodal dynamic interaction”, this paper organizes 26 ToM benchmarks. By comparing 19 typical methods, it analyzes the technical characteristics of the Prompting and Model-Based paradigms. It also identifies two core integration pathways: “neural generation + symbolic verification” and “symbolic guidance + neural fine-tuning.”
- Identifying practicalization directions. Focusing on key technical bottlenecks such as the dynamic update of mental states, multimodal information alignment, and the robustness of high-order reasoning, this paper also discusses ethical risks including excessive reliance on anthropomorphism, potential privacy leakage from multimodal signals, and model biases caused by pre-trained data. Several actionable future directions are proposed, providing guidance for the practical research on the social intelligence of embodied AI agents.
2. Analysis of Difference and Unity between PWM and MWM

2.1. Difference Analysis

2.2. Unity Analysis
2.3. The Necessity of Constructing a MWM
3. Element Representation of MWM
- Structuralism [58] attempts to disassemble subjective experience into its “atomic units“ through introspection (with sensations and affections as the core representational elements), providing an initial framework for the structural analysis of the mind. However, due to the subjectivity of introspection and the non-standardizability of experiential units, its methodology has gradually been replaced by Folk Psychology [57]. Folk Psychology adopts “external logical propositions” as the approximate representation of the mind (with belief, desire, and intention as core elements), circumventing the limitations of introspection and directly supporting practices such as BDI agent design and human motivation analysis. Nevertheless, its presupposition of the mind’s “rational propositional nature” makes it difficult to accommodate the non-rational behaviors described in neuroscience that involve “non-belief entities”, thus exposing the inherent framework flaws of the strong representational paradigm.
- Evolutionary Psychology [59] and Psychoanalysis [55] shift the research focus from “what the mind is” to “why the mind exists”. Evolutionary Psychology represents the mind as “adaptive modules” shaped by natural selection (with evolutionary adaptive modules and survival/reproductive motivations as core elements), offering functionalist explanations for criminal psychology and social behavior analysis. Yet the link between ancient environments and modern minds lacks empirical validation. Psychoanalysis [55], on the other hand, defines the mind as a closed “energy system” (with ego and libido as core elements), focusing on energy dynamics such as repression and catharsis, and serving applications like psychological counseling and advertising-based psychological intervention. However, the non-embodiment of the “unconscious mind” has exposed it to persistent criticism regarding its “inability to verify specific operational mechanisms”.
- Dimensional Emotion Theory [60] breaks through the constraints of discrete units, representing mental states as continuous coordinate points such as valence and arousal, which enables the quantitative characterization of emotional states. This approach is well-suited for scenarios including public opinion analysis and stress monitoring via wearable devices. Its limitation, however, lies in the inability to distinguish between complex emotions that are “close in coordinate values but fundamentally different in nature” (e.g., the overlap of “high arousal/negative valence” between anger and fear), thus revealing the inherent flaw of “dimensional oversimplification” in the strong representational paradigm.
- Cognitive Architectures [61] analogize the mind to an “information processing system”, with knowledge structures, declarative chunks, and production rules as the core representational elements (corresponding to “memory data” and “CPU rules”). This approach has successfully supported cognitive modeling for complex tasks such as human-computer interaction and driving. However, due to the “symbol grounding problem” (the inability to explain the correspondence between symbols and the real world), it struggles to handle ambiguous and creative mental activities, which has become a core bottleneck in symbolic cognitive research.
- Connectionism [62,63] abandons the presupposition of “monolithic concepts” and represents mental meaning as a “decentralized network pattern” (with weights and activation vectors as core elements). The logic of its distributed activation vectors has directly empowered the practice of deep learning (e.g., LLMs) and pattern recognition. Nevertheless, because its decision-making process relies on “subsymbolic network activation”, it has long been plagued by the problem of “extremely poor interpretability”, making it difficult to trace the formation path of a single decision.
- Embodied cognition and the predictive coding/free energy framework have further expanded the boundaries of representational carriers. Among them, Embodied Cognition [30,64] posits that the mind is not confined to the brain, but is “a product of the interaction between the body and the environment” (with sensorimotor schemas as the core element). It has provided support for scenarios such as virtual reality interaction design and rehabilitation training, yet it fails to explain the formation of disembodied cognition such as mathematics and abstract concepts. The Predictive Coding/Free Energy Principle [23,41] defines the mind as a “prediction machine”, where the core representational elements are not “input information”, but rather “prediction errors” and “priors”. Although it offers a new perspective for computational psychiatry (e.g., the interpretation of prediction errors in schizophrenia) and active inference AI, its excessive theoretical generalization—being able to “explain everything”—results in a lack of specific details regarding prediction regulation.
- Symbolic belief takes discrete symbolic texts and structured propositions as the core representational carriers of mental states. It downplays the architectural correlations or uncertainty quantification of elements, and achieves reasoning solely through the combination of symbolic units. This paradigm represents the direct technical implementation of the “strong representational schools” from an epistemological perspective (e.g., the propositional mind in Folk Psychology and the experience decomposition in Structuralism). The propositional descriptions of “belief and desire” in Folk Psychology are transformed into textual symbols such as “Agent A believes X” in benchmarks including BigToM [53] and ExploreToM [65]. The “experience atoms” in Structuralism, on the other hand, correspond to the structured knowledge graph symbols of “entity-attribute-state” in COKE [68].
- Probabilistic belief employs probability distributions as the carriers of mental states, realizing dynamic reasoning by quantifying uncertainty (e.g., “there is an 85% probability that the object is on the table”). Its logic is completely heterogeneous from that of discrete symbols and architectural organization. This paradigm corresponds to the technical practice of the “Predictive Coding/Free Energy Principle” from an epistemological perspective. The “discrepancy between prediction and reality” in predictive coding is translated into “neuro-guided online probabilistic assistance” in NOPA [69], which updates the probability distribution of demand beliefs in real time based on user behavior.
- Distributed activation vector uses distributed activation vectors in the hidden layers of neural networks as the carriers of mental states. It indirectly characterizes the mind through features such as vector clustering and similarity (e.g., activation clusters in the layers of LLMs correspond to the beliefs of different agents), with no explicit discrete symbols. For instance, Zhu et al. [70] manipulated the identified neural representations directly during the reasoning process to establish a causal relationship between distributed activation vectors and social reasoning. They guided activations along specific directions to test whether these internal representations would functionally affect the model’s ToM reasoning performance, a method referred to as linear probing. Through linear probing technology, decodable belief representations were identified in the attention head activations of models such as Mistral-7B-Instruct and DeepSeek-LLM-7B-Chat. In addition, the identified directions of belief representations across different social reasoning tasks exhibited high consistency, with the correlation coefficient of accuracy rates reaching 0.85–0.90. This indicates that these neural representations may possess cross-task generalization capabilities.
- The BDI architecture organizes mental elements around the Belief-Desire-Intention (BDI) triad, and the correlations between elements directly serve the collaborative decision-making of multi-agent systems (e.g., deriving execution intentions from collaborative goals). The representation type encompassing belief, intention, goal, and emotion proposed by Fung et al. [5] in 2025 does not fall entirely under the BDI architecture, but can be categorized as an “extended BDI architecture”. Mozikov et al. [74] pointed out that existing evaluations of “safety and human alignment” for LLMs mostly rely on pure natural language benchmarks, which have significant limitations. Human decision-making is often driven by emotions (e.g., anger and joy can alter choices), yet previous studies have not systematically explored the impact of emotions on the decision-making logic and ethical tendencies of LLMs. Even aligned LLMs may exhibit irrational behaviors due to emotional biases (e.g., deception and a sharp drop in cooperation rates), which pose risks for scenarios requiring autonomous decision-making such as medical care and customer service.
4. Theory of Mind: from Static Representation to Dynamic Reasoning
4.1. Prompting Paradigm: Stimulating the Implicit ToM Capabilities of LLMs
- Utilizing memory streams to store all perceptual experiences, and dynamically extracting behavioral records of others relevant to the current context through a three-dimensional weighted retrieval mechanism based on recency, importance, and relevance.
- Recursively synthesizing low-level observations into high-level reflections (e.g., “Sam may not know about the party”), thereby forming explicit inferences about others’ mental states.
- Generating socially coordinated behaviors (e.g., actively spreading information, extending invitations) based on these inferences.
- Two-shot CoT prompting (as shown in Figure 5(e)), which guides the model to mimic the reasoning pattern by presenting examples that incorporate intermediate reasoning steps.
- Step-by-step instructions, which explicitly require the model to decompose the reasoning process.
- In-context learning, which integrates the above two components to maximize reasoning quality.
- The first stage is the Perspective-Taking Prompt, which guides the model to switch to the perspective of the target agent (e.g., “Now please act as Anna, recall where you placed the apple earlier, and note that you are unaware that John moved the apple”).
- The second stage is the Reasoning Prompt, which requires the model to answer ToM questions based on this perspective (e.g., “As Anna, where will you look for the apple first?”).
- It may be constrained by the inherent commonsense biases of large language models, which hinder the modeling of personalized beliefs that contradict commonsense knowledge.
- It requires manual annotation to evaluate the quality of BDI inference.
- It faces decision optimization issues where dialogues may terminate prematurely when confidence levels fail to reach a threshold value.
- First, it formalizes ToM tasks as a verifiable sequence update process of belief states using Dynamic Epistemic Logic (DEL). Through the product update mechanism of DEL, it accurately models the belief evolution of multi-agent systems.
- Second, it trains a Process Belief Model (PBM) as a verifier, which is supervised and trained using process-level labels automatically generated by a DEL simulator, thereby acquiring the capability to evaluate the reliability of intermediate reasoning steps.
- Weak high-order reasoning capability: Most methods perform poorly in high-order reasoning tasks, with only a few covering second-order and above reasoning. Perspective confusion is prone to occur in high-order tasks.
- Insufficient depth of multimodal fusion: Only a limited number of approaches have attempted video modality integration, failing to achieve deep multimodal fusion, which makes it difficult to support embodied physical interactions.
- Poor generalizability: Performance relies heavily on the pre-trained knowledge of LLMs and manual symbol design, resulting in weak adaptability to unseen scenarios or cross-scenario applications.
- Narrow scenario and topic coverage: Most methods focus on specific interactions or a small number of social topics, making them unable to address the complex and ever-changing ToM demands of the real world.
4.2. Model-Based Inference Paradigm: Constructing Interpretable Mental Models
- Assumption refers to the hypotheses about the composition of mental states. The previous chapter analyzed the hypotheses of different psychological and cognitive science schools regarding the constituent elements of the mental world (Table 2), while most models and datasets usually adopt simplified representations (Table 3). These representations generally include environmental states , agent actions (including utterances ), observations , and so on.
- Feedback essentially optimizes the strategy model by judging the error between the mental states output by the inverse planning model and the ground-truth mental states. The inverse planning model specifies which variables are used to infer the hypothesized mental states. In Bayesian inverse planning methods [39,42,43], the inverse planning model is derived from the strategy model.
- Strategy denotes the forward planning model that derives the probability distribution of a certain action from the hypothesized mental states. Traditional decision-making models usually need to be acquired via reinforcement learning. By contrast, BIP-ALM [82] draws on methods such as [96,97] and leverages LLMs for decision-making.
- The character net extracts cross-episode agent traits to form priors.
- The mental state ne captures transient mental states of the current episode to form posteriors.
- The prediction net integrates the two to enable behavior prediction.
- First, it parses users’ natural language descriptions (e.g., “I want to drink water”) via a language model to obtain preliminary goal clues.
- Second, it observes users’ action sequences (e.g., “reaching toward the cabinet”) through visual sensors.
- Finally, based on Bayesian inverse planning, it calculates the matching probability between different candidate goals (e.g., “grabbing a cup”, “grabbing a bottle”) and the observed language-action information, and selects the goal with the highest probability as the inference result.
- First, the multimodal information fusion module leverages Gemini 1.5 Pro to extract action sequences from videos, and employs GPT-4o to parse textual dialogues, fill gaps in visual perception, and reconstruct the initial environmental state.
- Second, the hypothesis parsing module generates hypothesis combinations regarding agents’ beliefs, social goals (help/hinder/independent), and their beliefs about others’ goals for each question option.
- Third, the inverse multi-agent planning module, built on the Interactive POMDP (I-POMDP) framework, performs Bayesian reasoning to calculate the posterior probability of each hypothesis. This is achieved by evaluating the likelihood of actions and utterances at each time step under the given hypotheses, with the strategy estimation capability of GPT-4o. The probabilistic dependency graph of its mental state variables is illustrated in Figure 6(b).
- First, it leverages large language models to automatically convert natural language belief statements into symbolic expressions of epistemic logic.
- Second, it performs reasoning on observed agent behaviors through Bayesian inverse planning, jointly inferring a consistent distribution of goals, beliefs, and plans that explain the behaviors.
- Third, it uses epistemic logic to evaluate the truth value of belief statements against the inferred belief states. The dependency of its mental state variables is illustrated in Figure 6(d).
- Generating multiple natural language hypotheses about the target agent’s beliefs and intentions at each time step.
- Updating the weights of these hypotheses based on the likelihood of the agent’s actions, prioritizing the retention of more reasonable interpretations.
- Maintaining hypothesis diversity through resampling and rejuvenation to avoid particle degeneracy.
5. Evolution of ToM Evaluation Benchmarks
5.1. Early Benchmarks: Laying Foundations and Exposing Limitations
- It adopted a unified random generation mechanism to construct stories involving true beliefs, false beliefs, and second-order false beliefs, thereby eliminating idiosyncratic biases across different story types.
- It introduced interference elements such as irrelevant agent actions, distracting sentences about locations, and randomized action sequences to reduce data predictability.
- It mandated the generation of a full set of question types for each story, including Reality, Memory, first-order belief (e.g., “Where will Agent A look for the object?”), and second-order belief (e.g., “Where does Agent A think Agent B will look for the object?”). Moreover, it innovatively proposed an aggregate accuracy evaluation metric: a story was deemed successfully reasoned only if all its associated questions were answered correctly. This metric ensured that models truly distinguish between the objective state of the world and the subjective mental states of agents.
- It increases reasoning depth and exposes the shortcomings of models in high-order recursive reasoning.
- It incorporates complex social dynamics, making it more closely aligned with real interpersonal interaction scenarios.
5.2. Paradigm Shift: Multimodality and Dynamic Interaction
6. Conclusions
6.1. Core Challenges and Future Research Directions
- (1) Neural generation + symbolic verification—following the approach of Thought-tracing [84], LLMs (neural component) generate hypotheses about mental states, while Bayesian inference or logical rules (symbolic component) verify and update these hypotheses, balancing generative capacity with reasoning rigor.
- (2) Symbolic guidance + neural fine-tuning—a symbolic mental model defines the core logic of ToM reasoning (e.g., belief update rules), which is then used to guide the fine-tuning of neural models (e.g., LLMs). This enables the model to maintain efficient reasoning while adhering to explicit cognitive rules. For example, researchers can use a symbolic model to define causal rules for false-belief updating (e.g., “failure to observe object movement → no belief update”), and construct a fine-tuning dataset based on these rules to train LLMs to follow them in ToM tasks, thereby avoiding the rule violation issue common in purely neural models.
6.2. Ethical Considerations
Conflicts of Interest
References
- Waisberg, Ethan; Ong, Joshua; Masalkhi, Mouayad; Zaman, Nasif; Sarker, Prithul; Lee, Andrew G; Tavakkoli, Alireza. Meta smart glasses—large language models and the future for assistive glasses for individuals with vision impairments. Eye 2024, 38(6), 1036–1038. [Google Scholar] [CrossRef]
- Duan, Jiafei; Yu, Samson; Tan, Hui Li; Zhu, Hongyuan; Tan, Cheston. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence 2022, 6(2), 230–244. [Google Scholar] [CrossRef]
- Ha, David; Schmidhuber, Jürgen. World models. arXiv 2018, arXiv:1803.10122. [Google Scholar]
- Ding, Jingtao; Zhang, Yunke; Shang, Yu; Zhang, Yuheng; Zong, Zefang; Feng, Jie; Yuan, Yuan; Su, Hongyuan; Li, Nian; Sukiennik, Nicholas; et al. Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys 2025, 58(3), 1–38. [Google Scholar] [CrossRef]
- Fung, Pascale; Bachrach, Yoram; Celikyilmaz, Asli; Chaudhuri, Kamalika; Chen, Delong; Chung, Willy; Dupoux, Emmanuel; Gong, Hongyu; Jégou, Hervé; Lazaric, Alessandro; et al. Embodied ai agents: Modeling the world. arXiv 2025, arXiv:2506.22355. [Google Scholar] [CrossRef]
- Huang, Ming-Hui; Rust, Roland T. Engaged to a robot? the role of ai in service. Journal of Service Research 2021, 24(1), 30–41. [Google Scholar] [CrossRef]
- Cen, Jun; Yu, Chaohui; Yuan, Hangjie; Jiang, Yuming; Huang, Siteng; Guo, Jiayan; Li, Xin; Song, Yibing; Luo, Hao; Wang, Fan; et al. Worldvla: Towards autoregressive action world model. arXiv 2025, arXiv:2506.21539. [Google Scholar] [CrossRef]
- Schwamb, Karl B. Mental models: A survey. URL: citeseer. nj. nec. com/schwamb90mental. html. 1990.
- Craik, Kenneth James Williams. The nature of explanation; CUP Archive, 1967; volume 445. [Google Scholar]
- Forrester, Jay W. Counterintuitive behavior of social systems. Theory and decision 1971, 2(2), 109–140. [Google Scholar] [CrossRef]
- Battaglia, Tobias Gerstenberg; Ullman, Tomer D; Joshua, B. Intuitive physics as probabilistic inference kevin a. smith, jessica b. hamrick, adam n. sanborn, peter w.
- LeCun, Yann. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review 2022, 62(1), 1–62. [Google Scholar]
- Bardes, Adrien; Garrido, Quentin; Ponce, Jean; Chen, Xinlei; Rabbat, Michael; LeCun, Yann; Assran, Mahmoud; Ballas, Nicolas. Revisiting feature prediction for learning visual representations from video. arXiv 2024, arXiv:2404.08471. [Google Scholar] [CrossRef]
- Assran, Mido; Bardes, Adrien; Fan, David; Garrido, Quentin; Howes, Russell; Muckley, Matthew; Rizvi, Ammar; Roberts, Claire; Sinha, Koustuv; Zholus, Artem; et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv 2025, arXiv:2506.09985. [Google Scholar]
- Feng, Jie; Liu, Tianhui; Du, Yuwei; Guo, Siqi; Lin, Yuming; Li, Yong. Citygpt: Empowering urban spatial cognition of large language models. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, 2025; pp. pages 591–602. [Google Scholar]
- Bruce, Jake; Dennis, Michael D; Edwards, Ashley; Parker-Holder, Jack; Shi, Yuge; Hughes, Edward; Lai, Matthew; Mavalankar, Aditi; Steigerwald, Richie; Apps, Chris; et al. Genie: Generative interactive environments. Forty-first International Conference on Machine Learning, 2024. [Google Scholar]
- Jang, Joel; Ye, Seonghyeon; Lin, Zongyu; Xiang, Jiannan; Bjorck, Johan; Fang, Yu; Hu, Fengyuan; Huang, Spencer; Kundalia, Kaushil; Lin, Yen-Chen; et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv 2025, arXiv:2505.12705. [Google Scholar]
- Shi, Haojun; Ye, Suyu; Fang, Xinyu; Jin, Chuanyang; Isik, Leyla; Kuo, Yen-Ling; Shu, Tianmin. Muma-tom: Multi-modal multi-agent theory of mind. Proceedings of the AAAI Conference on Artificial Intelligence 2025, 39, 1510–1519. [Google Scholar] [CrossRef]
- Sarıtaş, Karahan; Tezören, Kıvanç; Durmazkeser, Yavuz. A systematic review on the evaluation of large language models in theory of mind tasks. arXiv 2025, arXiv:2502.08796. [Google Scholar] [CrossRef]
- Marchetti, Antonella; Manzi, Federico; Riva, Giuseppe; Gaggioli, Andrea; Massaro, Davide. Artificial intelligence and the illusion of understanding: A systematic review of theory of mind and large language models. In Cyberpsychology, Behavior, and Social Networking; 2025. [Google Scholar]
- Smallwood, Richard D; Sondik, Edward J. The optimal control of partially observable markov processes over a finite horizon. Operations research 1973, 21(5), 1071–1088. [Google Scholar] [CrossRef]
- Curtis, Aidan; Tang, Hao; Veloso, Thiago; Ellis, Kevin; Tenenbaum, Joshua B; Lozano-Pérez, Tomás; Kaelbling, Leslie Pack. Llm-guided probabilistic program induction for pomdp model estimation. Conference on Robot Learning, 2025; PMLR; pp. 3137–3184. [Google Scholar]
- Millidge, Beren; Seth, Anil; Buckley, Christopher L. Predictive coding: a theoretical and experimental review. arXiv 2021, arXiv:2107.12979. [Google Scholar]
- Spratling, Michael W. A review of predictive coding algorithms. Brain and cognition 2017, 112, 92–97. [Google Scholar] [CrossRef] [PubMed]
- Zhou, Xuhui; Liu, Jiarui; Yerukola, Akhila; Kim, Hyunwoo; Sap, Maarten. Social world models. arXiv 2025, arXiv:2509.00559. [Google Scholar] [PubMed]
- Zhang, Xiaoyuan; Huang, Yizhe; Ma, Chengdong; Chen, Zhixun; Ma, Long; Du, Yali; Zhu, Song-Chun; Yang, Yaodong; Feng, Xue. Social world model-augmented mechanism design policy learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems; 2025a.
- Xiang, Jiannan; Tao, Tianhua; Gu, Yi; Shu, Tianmin; Wang, Zirui; Yang, Zichao; Hu, Zhiting. Language models meet world models: Embodied experiences enhance language models. Advances in neural information processing systems 2023, 36, 75392–75412. [Google Scholar]
- Sutton, Richard S; Barto, Andrew G; et al. Reinforcement learning: An introduction; MIT press: Cambridge, 1998; volume 1. [Google Scholar]
- Premack, David; Woodruff, Guy. Does the chimpanzee have a theory of mind? Behavioral and brain sciences 1978, 1(4), 515–526. [Google Scholar] [CrossRef]
- Lakoff, George; Johnson, Mark. Metaphors we live by; University of Chicago press, 2024. [Google Scholar]
- Richens, Jonathan; Everitt, Tom; Abel, David. General agents need world models. In Forty-second International Conference on Machine Learning; 2025.
- Sakagami, Ryo; Lay, Florian S; Dömel, Andreas; Schuster, Martin J; Albu-Schäffer, Alin; Stulp, Freek. Robotic world models—conceptualization, review, and engineering best practices. Frontiers in Robotics and AI 2023, 10, 1253049. [Google Scholar] [CrossRef]
- Domjan, Michael. Domjan and Burkhard’s" The principles of learning and behavior; Thomson Brooks/Cole Publishing Co, 1993. [Google Scholar]
- Weger, Ulrich; Wagemann, Johannes; Meyer, Andreas. Introspection in psychology. European Psychologist, 2018. [Google Scholar]
- Wilson, Timothy D; Schooler, Jonathan W. Thinking too much: introspection can reduce the quality of preferences and decisions. Journal of personality and social psychology 1991, 60(2), 181. [Google Scholar] [CrossRef]
- Schwitzgebel, Eric. Introspection. In Stanford Encyclopedia of Philosophy; 2019. [Google Scholar]
- Bernstein, Daniel S; Givan, Robert; Immerman, Neil; Zilberstein, Shlomo. The complexity of decentralized control of markov decision processes. Mathematics of operations research 2002, 27(4), 819–840. [Google Scholar] [CrossRef]
- Nair, Ranjit; Tambe, Milind; Yokoo, Makoto; Pynadath, David; Marsella, Stacy. Taming decentralized pomdps: Towards efficient policy computation for multiagent settings. IJCAI 2003, 3, 705–711. [Google Scholar]
- Baker, Chris L; Jara-Ettinger, Julian; Saxe, Rebecca; Tenenbaum, Joshua B. Rational quantitative attribution of beliefs, desires and percepts in human mentalizing. Nature Human Behaviour 2017, 1(4), 0064. [Google Scholar] [CrossRef]
- Cinelli, Lucas Pinheiro; Marins, Matheus Araújo; da Silva, Eduardo Antúnio Barros; Netto, Sérgio Lima. Variational autoencoder. In Variational methods for machine learning with applications to deep networks; Springer, 2021; pp. pages 111–149. [Google Scholar]
- Friston, Karl. The free-energy principle: a unified brain theory? Nature reviews neuroscience 2010, 11(2), 127–138. [Google Scholar] [CrossRef] [PubMed]
- Baker, Chris L; Saxe, Rebecca; Tenenbaum, Joshua B. Action understanding as inverse planning. Cognition 2009, 113(3), 329–349. [Google Scholar] [CrossRef]
- Shum, Michael; Kleiman-Weiner, Max; Littman, Michael L; Tenenbaum, Joshua B. Theory of minds: Understanding behavior in groups through inverse planning. In Proceedings of the AAAI conference on artificial intelligence; 2019; Volume 33, pp. 6163–6170. [Google Scholar] [CrossRef]
- Wellman, Henry M; Carey, Susan; Gleitman, Lila; Newport, Elissa L; Spelke, Elizabeth S. The child’s theory of mind; The MIT Press, 1990. [Google Scholar]
- Gergely, György; Nádasdy, Zoltán; Csibra, Gergely; Bíró, Szilvia. Taking the intentional stance at 12 months of age. Cognition 1995, 56(2), 165–193. [Google Scholar] [CrossRef] [PubMed]
- Baron-Cohen, Simon; Leslie, Alan M; Frith, Uta. Does the autistic child have a “theory of mind”? Cognition 1985, 21(1), 37–46. [Google Scholar] [CrossRef]
- Rabinowitz, Neil; Perbet, Frank; Song, Francis; Zhang, Chiyuan; Ali Eslami, SM; Botvinick, Matthew. Machine theory of mind. International conference on machine learning, 2018; PMLR; pp. pages 4218–4227. [Google Scholar]
- Chang, Yupeng; Wang, Xu; Wang, Jindong; Wu, Yuan; Yang, Linyi; Zhu, Kaijie; Chen, Hao; Yi, Xiaoyuan; Wang, Cunxiang; Wang, Yidong; et al. A survey on evaluation of large language models. ACM transactions on intelligent systems and technology 2024, 15(3), 1–45. [Google Scholar] [CrossRef]
- Zhang, Jingyi; Huang, Jiaxing; Jin, Sheng; Lu, Shijian. Vision-language models for vision tasks: A survey. IEEE transactions on pattern analysis and machine intelligence 2024, 46(8), 5625–5644. [Google Scholar] [CrossRef] [PubMed]
- Taniguchi, Tadahiro; Ueda, Ryo; Nakamura, Tomoaki; Suzuki, Masahiro; Taniguchi, Akira. Generative emergent communication: Large language model is a collective world model. arXiv 2024, arXiv:2501.00226. [Google Scholar] [CrossRef]
- OpenAI, R. Gpt-4 technical report. arxiv 2303.08774. View in Article 2023, 2(5), 1. [Google Scholar]
- Yang, An; Li, Anfeng; Yang, Baosong; Zhang, Beichen; Hui, Binyuan; Zheng, Bo; Yu, Bowen; Gao, Chang; Huang, Chengen; Lv, Chenxu; et al. Qwen3 technical report. arXiv 2025a, arXiv:2505.09388. [Google Scholar]
- Gandhi, Kanishk; Fränken, Jan-Philipp; Gerstenberg, Tobias; Goodman, Noah. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems 2023, 36, 13518–13529. [Google Scholar]
- Georgeff, M; Rao, A. Modeling rational agents within a bdi-architecture. In Proc. 2nd Int. Conf. on Knowledge Representation and Reasoning (KR’91); 1991, Morgan Kaufmann: of; pp. pages 473–484.
- De Masi, Franco. The ego and the id: Concepts and developments. The International Journal of Psychoanalysis 2023, 104(6), 1091–1100. [Google Scholar] [CrossRef] [PubMed]
- Thomas Trappenberg. Fundamentals of computational neuroscience. OUP Oxford, 2009.
- Daniel C Dennett. The intentional stance. MIT press, 1989.
- Freedheim, Donald K; Weiner, Irving B. Handbook of psychology, history of psychology; John Wiley & Sons, 2012; volume 1. [Google Scholar]
- Confer, Jaime C; Easton, Judith A; Fleischman, Diana S; Goetz, Cari D; Lewis, David MG; Perilloux, Carin; Buss, David M. Evolutionary psychology: Controversies, questions, prospects, and limitations. American psychologist 2010, 65(2), 110. [Google Scholar] [CrossRef] [PubMed]
- Russell, James A. A circumplex model of affect. Journal of personality and social psychology 1980, 39(6), 1161. [Google Scholar] [CrossRef]
- Anderson, John R. Act: A simple theory of complex cognition. American psychologist 1996, 51(4), 355. [Google Scholar] [CrossRef]
- Rumelhart, David E; McClelland, James L; PDP Research Group; et al. Parallel distributed processing, Explorations in the microstructure of cognition: Foundations; The MIT press, 1986a; volume 1. [Google Scholar]
- Rumelhart, David E; Hinton, Geoffrey E; McClelland, James L; et al. A general framework for parallel distributed processing. Parallel distributed processing: Explorations in the microstructure of cognition 1986b, 1(45-76), 26. [Google Scholar]
- Barsalou, Lawrence W. Perceptual symbol systems. Behavioral and brain sciences 1999, 22(4), 577–660. [Google Scholar] [CrossRef] [PubMed]
- Sclar, Melanie; Yu, Jane; Fazel-Zarandi, Maryam; Tsvetkov, Yulia; Bisk, Yonatan; Choi, Yejin; Celikyilmaz, Asli. Explore theory of mind: Program-guided adversarial data generation for theory of mind reasoning. arXiv 2024, arXiv:2412.12175. [Google Scholar]
- Chen, Zhawnen; Wang, Tianchun; Wang, Yizhou; Kosinski, Michal; Zhang, Xiang; Fu, Yun; Li, Sheng. Through the theory of mind’s eye: Reading minds with multimodal video large language models. arXiv 2025a, arXiv:2406.13763. [Google Scholar]
- Xu, Lin; Hu, Zhiyuan; Zhou, Daquan; Ren, Hongyu; Dong, Zhen; Keutzer, Kurt; Ng, See Kiong; Feng, Jiashi. Magic: Investigation of large language model powered multi-agent in cognition, adaptability, rationality and collaboration. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; 2024a, pp. pages 7315–7332.
- Wu, Jincenzi; Chen, Zhuang; Deng, Jiawen; Sabour, Sahand; Meng, Helen; Huang, Minlie. Coke: A cognitive knowledge graph for machine theory of mind. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics; 2024; Volume 1, pp. pages 15984–16007. [Google Scholar]
- Puig, Xavier; Shu, Tianmin; Tenenbaum, Joshua B; Torralba, Antonio. Nopa: Neurally-guided online probabilistic assistance for building socially intelligent home assistants. In 2023 IEEE International Conference on Robotics and Automation (ICRA); 2023, IEEE; pp. 7628–7634.
- Zhu, Wentao; Zhang, Zhining; Wang, Yizhou. Language models represent beliefs of self and others. In Forty-first International Conference on Machine Learning.
- Fan, Xianzhe; Zhou, Xuhui; Jin, Chuanyang; Nottingham, Kolby; Zhu, Hao; Sap, Maarten. Somi-tom: Evaluating multi-perspective theory of mind in embodied social interactions. In arXiv; 2025. [Google Scholar]
- Li, Huao; Chong, Yu; Stepputtis, Simon; Campbell, Joseph P; Hughes, Dana; Lewis, Charles; Sycara, Katia. Theory of mind for multi-agent collaboration via large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023; pp. 180–192. [Google Scholar]
- Yang, Bo; Guo, Jiaxian; Iwasawa, Yusuke; Matsuo, Yutaka. Large language models as theory of mind aware generative agents with counterfactual reflection. arXiv 2025b, arXiv:2501.15355. [Google Scholar]
- Mozikov, Mikhail; Severin, Nikita; Bodishtianu, Valeria; Glushanina, Maria; Nasonov, Ivan; Orekhov, Daniil; Vladislav, Pekhotin; Makovetskiy, Ivan; Baklashkin, Mikhail; Lavrentyev, Vasily; et al. Eai: Emotional decision-making of llms in strategic games and ethical dilemmas. Advances in Neural Information Processing Systems 2024, 37, 53969–54002. [Google Scholar]
- Park, Joon Sung; O’Brien, Joseph; Cai, Carrie Jun; Morris, Meredith Ringel; Liang, Percy; Bernstein, Michael S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology; 2023, pp. 1–22.
- Moghaddam, Shima Rahimi; Honey, Christopher J. Boosting theory-of-mind performance in large language models via prompting. arXiv 2023, arXiv:2304.11490. [Google Scholar]
- Sclar, Melanie; Kumar, Sachin; West, Peter; Suhr, Alane; Choi, Yejin; Tsvetkov, Yulia. Minding language models’(lack of) theory of mind: A plug-and-play multi-character belief tracker. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics; 2023; Volume 1, pp. pages 13960–13980. [Google Scholar]
- Wilf, Alex; Lee, Sihyun; Liang, Paul Pu; Morency, Louis-Philippe. Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics; 2024; Volume 1, pp. pages 8292–8308. [Google Scholar]
- Tudor Lică, Mircea; Shirekar, Ojas; Colle, Baptiste; Raman, Chirag. Mindforge: Empowering embodied agents with theory of mind for lifelong cultural learning. In The Thirty-ninth Annual Conference on Neural Information Processing Systems; 2025.
- Chan, Chunkit; Yim, Yauwai; Zeng, Hongchuan; Zou, Zhiying; Cheng, Xinyuan; Sun, Zhifan; Deng, Zheye; Chung, Kawai; Ao, Yuzhuo; Fan, Yixiang; et al. Xtom: Exploring the multilingual theory of mind for large language models. arXiv 2025, arXiv:2506.02461. [Google Scholar] [CrossRef]
- Wu, Yuheng; Xie, Jianwen; Zhang, Denghui; Xu, Zhaozhuo. Del-tom: Inference-time scaling for theory-of-mind reasoning via dynamic epistemic logic. arXiv 2025, arXiv:2505.17348. [Google Scholar]
- Jin, Chuanyang; Wu, Yutong; Cao, Jing; Xiang, Jiannan; Kuo, Yen-Ling; Hu, Zhiting; Ullman, Tomer; Torralba, Antonio; Tenenbaum, Joshua; Shu, Tianmin. Mmtom-qa: Multimodal theory of mind question answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics; 2024; Volume 1, pp. pages 16077–16102. [Google Scholar]
- Ying, Lance; Zhi-Xuan, Tan; Wong, Lionel; Mansinghka, Vikash; Tenenbaum, Joshua. Grounding language about belief in a bayesian theory-of-mind. arXiv 2024, arXiv:2402.10416. [Google Scholar] [CrossRef]
- Kim, Hyunwoo; Sclar, Melanie; Zhi-Xuan, Tan; Ying, Lance; Levine, Sydney; Liu, Yang; Tenenbaum, Joshua B; Choi, Yejin. Hypothesis-driven theory-of-mind reasoning for large language models. arXiv 2025, arXiv:2502.11881. [Google Scholar]
- Zhang, Xuanming; Chen, Yuxuan; Yeh, Min-Hsuan; Li, Yixuan. Metamind: Modeling human social thoughts with metacognitive multi-agent systems. arXiv 2025b, arXiv:2505.18943. [Google Scholar]
- Zhang, Zhining; Jin, Chuanyang; Jia, Mung Yao; Shu, Tianmin. Autotom: Automated bayesian inverse planning and model discovery for open-ended theory of mind. In ICLR 2025 Workshop on Foundation Models in the Wild; 2025c.
- Frith, Chris; Frith, Uta. Theory of mind. Current biology 2005, 15(17), R644–R645. [Google Scholar] [CrossRef]
- Wellman, Henry M. Theory of mind: The state of the art. European Journal of Developmental Psychology 2018, 15(6), 728–755. [Google Scholar] [CrossRef]
- Sahoo, Pranab; Singh, Ayush Kumar; Saha, Sriparna; Jain, Vinija; Mondal, Samrat; Chadha, Aman. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv 2024, arXiv:2402.07927. [Google Scholar] [CrossRef]
- Bubeck, Sébastien; Chandrasekaran, Varun; Eldan, Ronen; Gehrke, Johannes; Horvitz, Eric; Kamar, Ece; Lee, Peter; Lee, Yin Tat; Li, Yuanzhi; Lundberg, Scott; et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv 2023, arXiv:2303.12712. [Google Scholar] [CrossRef]
- Zhu, Wentao; Zhang, Zhining; Wang, Yizhou. Language models represent beliefs of self and others. arXiv 2024, arXiv:2402.18496. [Google Scholar] [CrossRef]
- Goldman, Alvin I. Interpretation psychologized. Mind & Language 1989, 4(3), 161–185. [Google Scholar] [CrossRef]
- Goldman, Alvin I. Simulating minds: The philosophy, psychology, and neuroscience of mindreading; Oxford University Press, 2006. [Google Scholar]
- Wei, Jason; Wang, Xuezhi; Schuurmans, Dale; Bosma, Maarten; Xia, Fei; Chi, Ed; Le, Quoc V; Zhou, Denny; et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 2022, 35, 24824–24837. [Google Scholar]
- Wu, Yufan; He, Yinghui; Jia, Yilin; Mihalcea, Rada; Chen, Yulong; Deng, Naihao. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics EMNLP 2023; 2023; pp. pages 10691–10706. [Google Scholar]
- Huang, Wenlong; Abbeel, Pieter; Pathak, Deepak; Mordatch, Igor. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International conference on machine learning; 2022, PMLR; pp. pages 9118–9147.
- Li, Shuang; Puig, Xavier; Paxton, Chris; Du, Yilun; Wang, Clinton; Fan, Linxi; Chen, Tao; Huang, De-An; Akyürek, Ekin; Anandkumar, Anima; et al. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems 2022, 35, 31199–31212. [Google Scholar]
- Krojer, Benno; Komeili, Mojtaba; Ross, Candace; Garrido, Quentin; Sinha, Koustuv; Ballas, Nicolas; Assran, Mahmoud. A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs. arXiv 2025, arXiv:2506.09987. [Google Scholar]
- Bordes, Florian; Garrido, Quentin; Kao, Justine T; Williams, Adina; Rabbat, Michael; Dupoux, Emmanuel. Intphys 2: Benchmarking intuitive physics understanding in complex synthetic environments. arXiv 2025, arXiv:2506.09849. [Google Scholar] [CrossRef]
- Foss, Aaron; Evans, Chloe; Mitts, Sasha; Sinha, Koustuv; Rizvi, Ammar; Kao, Justine T. Causalvqa: A physically grounded causal reasoning benchmark for video models. arXiv 2025, arXiv:2506.09943. [Google Scholar] [CrossRef]
- Chen, Delong; Chung, Willy; Bang, Yejin; Ji, Ziwei; Fung, Pascale. Worldprediction: A benchmark for high-level world modeling and long-horizon procedural planning. arXiv 2025b, arXiv:2506.04363. [Google Scholar]
- Le, Matthew; Boureau, Y-Lan; Nickel, Maximilian. Revisiting the evaluation of theory of mind through question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019, pp. pages 5872–5877.
- Kim, Hyunwoo; Sclar, Melanie; Zhou, Xuhui; Bras, Ronan; Kim, Gunhee; Choi, Yejin; Sap, Maarten. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing; 2023, pp. pages 14397–14413.
- Street, Winnie; Siy, John Oliver; Keeling, Geoff; Baranes, Adrien; Barnett, Benjamin; McKibben, Michael; Kanyere, Tatenda; Lentz, Alison; Dunbar, Robin IM; et al. Llms achieve adult human performance on higher-order theory of mind tasks. arXiv 2024, arXiv:2405.18870. [Google Scholar] [CrossRef]
- Strachan, James WA; Albergo, Dalila; Borghini, Giulia; Pansardi, Oriana; Scaliti, Eugenio; Gupta, Saurabh; Saxena, Krati; Rufo, Alessandro; Panzeri, Stefano; Manzi, Guido; et al. Testing theory of mind in large language models and humans. Nature Human Behaviour 2024, 8(7), 1285–1295. [Google Scholar] [CrossRef]
- Xu, Hainiu; Zhao, Runcong; Zhu, Lixing; Du, Jinhua; He, Yulan. Opentom: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models. arXiv 2024b, arXiv:2402.06044. [Google Scholar]
- Kosinski, Michal. Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences 2024, 121(45), e2405460121. [Google Scholar] [CrossRef] [PubMed]
- Xiao, Yang; Wang, Jiashuo; Xu, Qiancheng; Song, Changhe; Xu, Chunpu; Cheng, Yi; Li, Wenjie; Liu, Pengfei. Towards dynamic theory of mind: Evaluating llm adaptation to temporal evolution of human states. arXiv 2025, arXiv:2505.17663. [Google Scholar] [CrossRef]
- Lupu, Andrei; Willi, Timon; Foerster, Jakob. The decrypto benchmark for multi-agent reasoning and theory of mind. arXiv 2025, arXiv:2506.20664. [Google Scholar] [CrossRef]
- Xu, Zixiang; Wang, Yanbo; Huang, Yue; Ye, Jiayi; Zhuang, Haomin; Song, Zirui; Gao, Lang; Wang, Chenxi; Chen, Zhaorun; Zhou, Yujun; et al. Socialmaze: A benchmark for evaluating social reasoning in large language models. arXiv 2025, arXiv:2505.23713. [Google Scholar] [CrossRef]
- Thiyagarajan, Prameshwar; Parimi, Vaishnavi; Sai, Shamant; Garg, Soumil; Meirbek, Zhangir; Yarlagadda, Nitin; Zhu, Kevin; Kim, Chris. Unitombench: Integrating perspective-taking to improve theory of mind in llms. arXiv 2025, arXiv:2506.09450. [Google Scholar]
- Puig, Xavier; Shu, Tianmin; Li, Shuang; Wang, Zilin; Liao, Yuan-Hong; Tenenbaum, Joshua B; Fidler, Sanja; Torralba, Antonio. Watch-and-help: A challenge for social perception and human-ai collaboration. In International Conference on Learning Representations.
- Bharti, Shubham; Cheng, Shiyun; Rho, Jihyun; Zhang, Jianrui; Cai, Mu; Lee, Yong Jae; Rau, Martina; Zhu, Xiaojin. Chartom: A visual theory-of-mind benchmark for llms on misleading charts. arXiv 2025, arXiv:2408.14419. [Google Scholar]
- Li, Xinyang; Liu, Siqi; Zou, Bochao; Chen, Jiansheng; Ma, Huimin. From black boxes to transparent minds: Evaluating and enhancing the theory of mind in multimodal large language models. arXiv 2025, arXiv:2506.14224. [Google Scholar] [CrossRef]
- Huang, X Angelo; La Malfa, Emanuele; Marro, Samuele; Asperti, Andrea; Cohn, Anthony G; Wooldridge, Michael J. A notion of complexity for theory of mind via discrete world models. In Findings of the Association for Computational Linguistics EMNLP 2024; 2024; pp. 2964–2983. [Google Scholar]
- Navigli, Roberto; Conia, Simone; Ross, Björn. Biases in large language models: origins, inventory, and discussion. ACM Journal of Data and Information Quality 2023, 15(2), 1–21. [Google Scholar] [CrossRef]
- Wei, Kangda; Abdullah, Hasnat, Md; Huang, Ruihong. Mitigating gender bias via fostering exploratory thinking in llms. arXiv 2025, arXiv:2505.17217. [Google Scholar] [CrossRef]






| World Models | State Space | Observation Space | Action Space | Supported Behaviors | Theoretical Foundations |
|---|---|---|---|---|---|
| Physical | Objective states of the physical world (position, size, material, etc.) | Physical observations of the environment | Physical actions | Physical interactions | Traditional World Models [3,27], Reinforcement Learning [28] |
| Mental (Social) | Physical attributes + psychological attributes (beliefs, intentions, emotions, etc.) | External observations + introspective observations + memory | Physical actions + cognitive actions | Complex social behaviors (empathy, deception, norm enforcement, etc.) | ToM in cognitive science[29,30] |
| School Classification | Theoretical Perspective | Typical Representational Elements | Application Scenarios | Limitations |
|---|---|---|---|---|
| Folk Psychology [57] | Macroscopic approximate description: Assumes the mind consists of external-directed logical propositions | belief, desire, intention | BDI model agent design, motivation analysis | No “belief” entity in neuroscience; unable to explain irrational and unconscious behaviors. |
| Structuralism [58] | Attempts to decompose subjective experience into indivisible atoms via introspection | Sensations, Images, Affections | Basic psychophysics research, user interaction experience design | Unreliable introspection; “atomic” experiences vary across individuals and are difficult to standardize and verify. |
| Evolutionary Psychology [59] | The mind’s core consists of adaptive modules shaped by natural selection to solve ancient survival problems | Evolutionary adaptive modules, motivations, domain mechanisms | Criminal psychology, social psychology | Difficult to empirically validate the causal link between ancient environments and modern minds; weak explanatory power for individual differences. |
| Psychoanalysis [55] | The mind is a closed energy system, emphasizing repression, catharsis, and energy conservation | Id, Ego, Superego, Libido | Psychological counseling, advertising psychology | Difficult to experimentally verify the specific operation of the “unconscious” (i.e., unfalsifiable). |
| Dimensional Emotion Theory [60] | Mental states are points in a continuous coordinate system rather than discrete switches | Valence, Arousal | Public opinion analysis, stress monitoring via wearable devices | Unable to distinguish complex emotions that are close in coordinates but distinct in nature (e.g., anger and fear). |
| MidruleCognitive Architectures [61] | The mind is an information processing system; core elements are memory-stored data and CPU-executed rules | Knowledge Structures, Production Rules, Goal Stack, Working Memory | Human-computer interaction, cognitive modeling of complex tasks (e.g., driving) | Symbol Grounding Problem; difficulty handling ambiguity and creativity. |
| Connectionism [62,63] | Decentralized with no single “concept”; meaning resides in distributed network patterns | Weights, Activation Vectors | Deep learning (LLMs), pattern recognition | Poor interpretability; difficult to trace how a specific decision emerges from weights. |
| Embodied Cognition [30,64] | The mind is not confined to the brain but arises from body-environment interactions; representation is bodily simulation | Sensorimotor Schemas | Virtual reality interaction design, rehabilitation training | Difficult to explain the formation of disembodied abstract concepts such as “mathematics” and “justice”. |
| Predictive Coding / Free Energy Principle [23,41] | The brain is a prediction machine; core elements are not inputs but discrepancies between predictions and reality | Prediction Error, Priors | Computational psychiatry, active inference AI | Overly grand; explains everything but lacks specific predictive details. |
| Taxonomy | Representation Way | Typical Method/Benchmark |
|---|---|---|
| Symbolic Belief | Takes discrete symbolic texts and structured propositions as the core representational carriers of mental states, without architectural organization or probabilistic modeling. | BigToM [53], ExploreToM [65], ToMLoc [66], MAgIC [67], COKE [68] |
| Probabilistic Belief | Represents mental states in the form of probability distributions; the core is “neuro-guided online probabilistic assistance”, which is completely different from the representational logic of symbolic/BDI architectures. | NOPA [69] |
| Distributed Activation Vector | Enables the model to learn distributed activation vectors of “agent preferences (e.g., preferring item A over B)” and “goal intentions” in hidden layers by observing agents’ “action sequences” (e.g., object retrieval in grid worlds). | ToMNet [47], LLM-Belief [70], VToM [66] |
| BDI Architecture | Organizes mental states around the BDI triad, with elements directly serving multi-agent collaborative decision-making. | SoMi-ToM [71], CoToMA [72], EmbodiedAI [5], ToM-Agent [73] |
| Taxonomy | Method | Year | Lev. | Base LLM | Reasoning Paradigm | Key Mechanism |
|---|---|---|---|---|---|---|
| ToM Prompting | Generative Agent [75] | 2023 | 1 | GPT-3.5-turbo | Retrieval-Enhanced Neural Language Model Reasoning | Retrieval-Based Storage and Recursive Reflection Generation of Memory Streams |
| CoT-ToM [76] | 2023 | 1 | GPT-4/GPT-3.5 | Language-Guided Stepwise Reasoning | CoT prompting, In-context learning | |
| CoToMA [72] | 2023 | 3 | GPT-4/GPT-3.5 | Neural language-based reasoning | To structurally infer and understand others’ perspectives and intentions | |
| SymbolicToM [77] | 2023 | 2 | GPT-4/Llama-13B | Neuro-symbolic hybrid | Symbolic belief graph tracking, Witness-based knowledge propagation | |
| SimToM [78] | 2024 | 2 | GPT-4/Llama2-13B | Simulation-based reasoning | Perspective-taking filtering, Two-stage prompting | |
| COKE [68] | 2024 | 1 | Llama-2-7B/Mistral-7B | Neural-Symbolic Fusion | Cognitive Knowledge Graph, Chained Cognitive Reasoning | |
| MindForge [79] | 2025 | 2 | GPT-4/Llama-3.1-8B | Neural-Symbolic Fusion + Casual Reasoning | Natural language inter-agent communication, ToM causal template | |
| ToM-Agent [73] | 2025 | 2 | GPT-4/GPT-3.5 | Neuro-symbolic + Simulation-based reasoning | BDI tracking with confidence disentanglement, Counterfactual reflection | |
| XToM [80] | 2025 | 2 | GPT-4o/DeepSeek R1 | Neural language-based reasoning | Cross-Language Consistency Evaluation | |
| DEL-ToM [81] | 2025 | 4 | GPT-4o/Llama3.1-8B | Neural-Symbolic Fusion | PBM-based trace verification, inference-time scaling | |
| VToM [66] | 2025 | 1 | GPT-4o | Multimodal neural reasoning | Key frames retrieval | |
| Model-based Inference | ToMNet [47] | 2018 | 1 | - | Implicit Reasoning via Neural Networks | character, mental state, and prediction networks |
| PGM-Aware Agent [67] | 2024 | 3 | GPT-o1/Llama-2/Claude-2 | Neuro-Symbolic Fusion + Probabilistic Reasoning | PGM-LLM fusion, Two-hop understanding | |
| BIP-ALM [82] | 2024 | 1 | GPT-4/Video-Llama 2 | Neural-Symbolic Fusion + BIP | Bayesian inverse planning, Language model-accelerated likelihood estimation | |
| BToM-EL [83] | 2024 | 1 | unknown | Neural-Symbolic Fusion + BIP | Bayesian Inverse Planning, Cognitive Logic Evaluation, Natural Language-Logic Conversion | |
| LIMP [18] | 2025 | 2 | GPT-4o/Gemini 1.5 Pro | Neural-Symbolic Fusion + BIP | Bayesian Probabilistic Inference | |
| Thought-tracing [84] | 2025 | 2 | CPT-4o, DeepSeek R1, Qwen2.5 | Neural-Symbolic Fusion + Sequential Monte Carlo | Hypothesis generation and propagation, Action likelihood-based weighting | |
| MetaMind [85] | 2025 | 2 | GPT-4, Claude-3.5, DeepSeek V3/R1 | Neural-Symbolic Fusion + BIP | Human-like Social Reasoning via a Three-stage Metacognitive Cycle | |
| AutoToM [86] | 2025 | Any | GPT-4o, Llama 3.1 70B, Gemini 2.0 | Neural-Symbolic Fusion + BIP | Automated Bayesian Inverse Planning and agent model discovery |
| Evaluation Dimension | Prompting Paradigm | Model-Based Inference Paradigm |
|---|---|---|
| Implementation Complexity | No fine-tuning required, only prompt design needed | Requires construction of mathematical models and reasoning frameworks |
| Interpretability | Black-box reasoning relying on the implicit capabilities of LLMs | Explicit models with traceable reasoning processes |
| Robustness of High-Order Reasoning | Accuracy drops significantly in third-order and higher-order reasoning tasks | Can handle high-order scenarios via recursive reasoning frameworks |
| Generalizability | Relies on pre-trained data with poor performance on novel scenarios | Can adapt to new scenarios through model adjustments |
| Real-Time Performance | Fast inference speed of LLMs | High computational complexity (e.g., Bayesian reasoning) |
| Dataset | Year | Exp. | Int. | Modality | Task Target/Application Scene | Dataset Detail |
|---|---|---|---|---|---|---|
| ToMi [102] | 2019 | ✕ | Text | Synthetic narrative scenarios through question answering | six distinct question types: Reality, Memory, 1st and 2rd order belief of two agents | |
| BigToM [53] | 2023 | ✕ | Text | fictional stories for evaluating “percepts to beliefs”, “percepts to actions”, and “actions to beliefs” | 5,000 evaluations generated from 200 causal templates focusing on 6 conditions | |
| CoToMA [72] | 2023 | Text | multi-agent collaborative scenario in rescue missions | 3 question types (introspection, 1st-/2rd- order ToMs) | ||
| FANToM [103] | 2023 | ✕ | Text | information-asymmetric multiparty conversation scenarios | 256 conversations with 10K questions across 6 types | |
| Hi-ToM [95] | 2023 | ✕ | Text | High-order ToM evaluation in fictional Sally-Anne-like stories | Average 26.47 lines with 5 agents/questions per story | |
| High-order ToM [104] | 2024 | ✕ | ✕ | Text | High-order ToM evaluation in fictional social interaction scenarios | 7 stories, 140 statements, orders 2-6 |
| ExploreToM [65] | 2024 | ✕ | Text | fictional narrative scenarios | 1,620 stories, 1st/2rd order and state-tracking questions | |
| MAgIC [67] | 2024 | ✕ | Text | ToM evaluation in multi-agent competitions | 103 competition cases across 5 scenarios with 7 metrics | |
| TestingToM [105] | 2024 | ✕ | ✕ | Text | Social cognition evaluation | ToM testing with novel items and variants |
| EAI [74] | 2024 | Text | ToM evaluation via existing ethical and game-theoretic scenarios | ETHICS, MoralChoice, and StereoSet subsets | ||
| COKE [68] | 2024 | ✕ | Text | ToM evaluation in daily social situations across five topics | 45,369 cognitive chains, 62,328 nodes, 1,200 situations, 5 topics, 4 tasks | |
| OpenToM [106] | 2024 | ✕ | Text | ToM evaluation in fictional stories with personified characters | 696 narratives, 13,708 questions covering location tracking, multi-hop reasoning, and attitude inference | |
| LLM-ToM [107] | 2024 | ✕ | ✕ | Text | False-belief understanding in fictional narrative scenarios | 40 tasks, 2 types, 8 scenarios and 16 prompts per task |
| DynToM [108] | 2025 | ✕ | Text | ToM evaluation in social interaction scenarios | 1,100 contexts, 5,500 scenarios, 78,100 questions, 4 question types | |
| Decrypto [109] | 2025 | Text | mental states reasoning while encrypting/decrypting messages | 10 human games, up to 8 turns per game | ||
| SocialMaze [110] | 2025 | ✕ | Text | Social deduction games, daily life interactions, and digital community | 70,000 total instances across 6 tasks in 3 scenarios | |
| UniToMBench [111] | 2025 | ✕ | ✕ | Text | fictional story scenarios and social interaction contexts | 1,025 hand-written scenarios with 8 TOMBENCH task categories |
| XToM [80] | 2025 | ✕ | ✕ | Text | Multilingual ToM evaluation across fictional stories | 300 stories/dialogues, 5 languages, 3 sub-tasks |
| Watch-And-Help [112] | 2020 | Graphs, Video, etc. | Social perception and collaboration evaluation in household scenarios | 1011 training tasks, 5 activity categories, 30 predicate type | ||
| NOPA [69] | 2023 | ✕ | Visual 3D | Embodied household multi-agent collaboration scenarios | 10 testing episodes, 40 human trials, household tasks in virtual homes | |
| MMToM-QA [82] | 2024 | ✕ | Video+Text | ToM evaluation in household activity scenarios | 134 videos, 600 questions, 7 question types, ,462 frames per video | |
| MuMAToM [18] | 2025 | ✕ | Video+Text | multi-agent embodied household collaboration scenarios | 4 apartment environments with 900 questions of 3 types | |
| ChARTOM [113] | 2025 | ✕ | ✕ | Image+Text | ToM evaluation in misleading visual interpretation scenarios | 15 manipulation groups, 30 charts in pairs, 5 chart types, 2 question types |
| SoMi-ToM [71] | 2025 | Video, Image, Text | Embodied multi-agent collaborative crafting scenarios | 35 videos, 363 images, 1225 questions across 3 inference tasks | ||
| ToMLoc [66] | 2025 | ✕ | ✕ | Video+Text | ToM evaluation in real-world social interaction scenarios | 1,403 videos, 8,076 questions, 4-choice format |
| GridToM [114] | 2025 | ✕ | Video+Text | ToM evaluation in 2D grid world multi-agent scenarios | 1,296 samples, 27 map layouts, 3 question types |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
