Cognitive Robotics Review: Advances and New Directions

Alejandra Ciria; Bruno Lara

doi:10.20944/preprints202606.1004.v1

Submitted:

12 June 2026

Posted:

15 June 2026

You are already at the latest version

Abstract

Cognitive Robotics seeks to understand and model cognition through embodied artificial agents, guided by the principles of embodied cognition theories. In this framework, cognition emerges from the dynamic coupling between the body and the environment, and is fundamentally rooted in sensorimotor interaction rather than abstract symbolic processing. This review integrates advances in CR, considering its bio-inspired foundations, including predictive mechanisms and intrinsic motivation, which are complementary principles for developmental learning. Emerging directions in CR are examined to clarify their implications for the field, highlighting their contributions and identifying open challenges. In this context, foundation models, such as Vision–Language–Action models, are reframed as structured starting conditions based on perceptual and motor priors to bypass learning low-level control from scratch, focusing instead on learning an internal model together with object-based semantics, and ultimately providing a sensorimotor base onto which linguistic symbols can be grounded. In parallel, the framework of Basal Cognition is introduced as a conceptual extension that reconceptualizes cognition as a multiscale, self-organizing adaptive process, suggesting that the central challenge for CR is to develop systems whose multiscale organization itself constitutes cognition and intelligence. These perspectives point toward novel lines of research and debate in CR. Foundation models can enable the study of increasingly complex, developmentally grounded learning, while Basal Cognition extends embodiment across scales, opening the door to artificial systems in which morphogenesis, adaptive behavior, and learning emerge from self-organizing multiscale dynamics.

Keywords:

cognitive robotics

;

embodied cognition

;

prediction

;

intrinsic motivation

;

foundation modes

;

Basal Cognition

Subject:

Computer Science and Mathematics - Robotics

1. Introduction

Within cognitive science, Cognitive Robotics (CR) emerged at the intersection of embodied cognition and computational modeling, using artificial agents to both study and synthesize cognition [1]. In this framework, cognition is grounded in the body and the environment, shaped by the agent’s morphology and sensorimotor capacities, and situated within specific physical and social contexts [2]. CR adopts embodied systems to investigate how cognition emerges through interaction, with humanoid robots providing a paradigmatic case by enabling natural engagement with human environments while serving as experimental platforms for studying cognition [3]. CR thus departs from disembodied, knowledge-based approaches by treating intelligence as adaptive behavior grounded in sensorimotor learning rather than abstract symbol manipulation (Figure 1).

This review examines CR through its bio-inspired core principles for sensorimotor learning and emerging computational and theoretical approaches. It analyzes how foundation models can serve as structured starting conditions for scaffolding developmental learning, and advances a conceptual shift grounded in Basal Cognition that reframes cognition in terms of multiscale self-regulation and organizational autonomy.

2. Core Principles

Recent advances in CR are increasingly organized around two complementary principles: prediction and intrinsic motivation. These principles are the core of sensorimotor learning during development. Key challenges in CR are also discussed (see Box Section 2.2).

2.1. Prediction

A central claim in CR is that cognition is grounded in sensorimotor learning and development, with early experience shaping increasingly complex behavior. The theory of sensorimotor contingencies (SMC) formalizes this by defining perception as the mastery of action–perception relations [4]. CR implementations demonstrate that agents can acquire structured environmental representations by constructing repertoires of SMC within context-sensitive action spaces [5]. Although the learning of SMC may initially remain at the level of associations, it is central for enabling agents to generate predictions about the sensory consequences of their actions.

In CR, the learning of internal models for prediction and control remains central. Forward models predict sensory consequences while inverse models compute motor commands for desired outcomes [6,7], also known as world models when they capture the environment from the agent’s perspective. These models support adaptive control through prediction error minimization [8] and have been implemented across domains ranging from imitation [9] to navigation [10]. Developmental approaches emphasize that internal models emerge through SMC learning and exploration rather than being pre-specified. Strategies for initializing learning include motor babbling, which relies on random action sampling to establish sensorimotor mappings, goal babbling, which biases exploration and learning toward desired outcomes [11], and canalizing babbling, which guides exploration through reflex-inspired mechanisms analogous to the prereaching reflex in newborns [12].

Artificial agents have been used as experimental models to study the emergence of the Self, demonstrating that it arises developmentally from embodied sensorimotor interactions and layered predictive processes [13]. CR implementations show that agents can distinguish sensory situations produced by self-generated actions from those produced by external events using internal models [14]. A recent implementation of human–robot interaction proposed that the sense of agency, a main complement of the Self, arises from a dynamic balance between top-down predictions and bottom-up sensory input, where prioritizing predictions enhances agency while prioritizing sensory input promotes adaptive behavior [15].

Underlying many of these developments is the broader framework of predictive processing, in which perceptual and active inference minimize prediction error, a key driver of biological behavior [16,17,18]. In CR, this has led to the proposal of unified models that integrate perception, action, learning, and prediction within a single inferential process, with the Free Energy Principle (FEP) providing a formal mathematical account [19]. Active inference has been applied to humanoid robots to enable adaptive body perception and robust sensorimotor behaviors, such as reaching and object tracking, even under noise and model mismatch [20]. More recent work shows that active inference supports lifelong learning by continuously updating and expanding internal models to generate sensorimotor dynamics for gait control under changing body conditions [21]. Complementary approaches demonstrate that incorporating a self-prior enables embodied agents to develop a sense of Self by distinguishing between self-generated and externally caused sensory events [22]. Together, these developments point toward a unified view in which cognition emerges from predictive, data-efficient architectures grounded in internal models and continuous interaction [23].

2.2. Intrinsic Motivation

Intrinsic motivation provides a complementary mechanism for structuring learning in CR, enabling agents to expand their capabilities autonomously through internal signals such as novelty, prediction error, or learning progress [24]. Foundational work showed that agents can autonomously acquire structured skill repertoires by maximizing learning progress, by seeking situations where their predictions improve the most, and actively selecting experiences that are neither too easy nor too difficult, rather than exploring randomly or optimizing fixed goals [25]. Subsequent approaches improved learning efficiency by shifting exploration from action to goal spaces [26], with interest-driven mechanisms further guiding goal selection based on learning progress and supporting skill acquisition in complex domains [27].

Recent extensions integrate predictive coding and self-supervised learning into unified frameworks supporting skill learning, prediction, and fault detection, learning internal models of sensorimotor repertoires using self-supervised learning and local learning rules [28]. Also, intrinsic motivation has been used for autonomous navigation [29], enabling self-supervised exploration and adaptation [29], and using lifelong learning mechanisms that mitigate catastrophic forgetting [21], while hierarchical and hybrid representations enable dynamic planning in complex environments [30].

Integrating intrinsic motivation with deep reinforcement learning, using prediction error as a curiosity mechanism rather than heavily engineered reward functions, enables agents to learn complex locomotion and manipulation tasks from sparse rewards [31]. Without externally specified goals, systems such as Baby Sophia show that early sensorimotor behaviors, such as self-touch and hand regard, can emerge through intrinsically driven exploration, progressing from undirected movement to structured interaction [32]. Intrinsic motivation is also being extended to multi-agent settings, where shared intrinsic rewards allow groups of robots to coordinate exploration, improving efficiency and cooperation, suggesting that intrinsically motivated learning can scale from individual agents to collective systems [33].

Box 1. Key Challenges in Cognitive Robotics

Interoception and internal signals. CR implementations typically rely on exteroceptive information, leaving internal signals largely unexplored. Incorporating interoceptive signals such as energy expenditure [34] and artificial pain [35] may allow agents to regulate internal states and enable adaptive trade-offs between performance and system integrity. The challenge is not simply to incorporate these signals, but to reconceptualize control architectures so that internal regulation becomes a primary organizing principle of behavior.

Monitoring of prediction error dynamics (PED). Prediction error dynamics refer to the temporal evolution of discrepancies between predicted and actual sensory outcomes. As an agent learns, prediction error decreases over time, and this reduction has been associated with positive emotional valence [36,37]. PED can then be exploited for autonomous goal selection if agents monitor the temporal dynamics of prediction error and associate these dynamics with the goal, forming expectations about future performance. Expected PED can guide behavior, leading agents to preferentially select goals whose anticipated error reduction signals learning, while avoiding goals that are either already mastered [38]. Although still underexplored in CR, integrating the monitoring of PED as a unified mechanism for prediction, emotional valence, and intrinsic motivation represents a key challenge, particularly in the context of the two directions advanced in this review. When combining CR with Foundation Models, this mechanism could scaffold the progressive acquisition of goal-relevant sensorimotor knowledge, bypassing the need to learn low-level control from scratch. By monitoring prediction error dynamics, agents could autonomously select goals whose complexity matches their current competence, starting from basic affordances and progressively moving toward more demanding goals as performance improves. Getting inspiration from the principles of Basal Cognition, a CR system could monitor PED within and between organizational scales, signaling local and global performance and intrinsic goals grounded in viability and self-maintenance.

Scalability and architectural integration. Most approaches focus on isolated sensorimotor mechanisms rather than integrating multiple cognitive processes into coherent architectures. To achieve this, it is crucial first to understand how the temporal interplay between internal models, prediction error dynamics, and intrinsic drives structures experience to support memory and generalization. Second, the field lacks a principled account of how predictive and motivational mechanisms co-adapt across temporal and functional scales. On the prediction ground, there are different time scales for models, from milliseconds when doing motor control to hours or days when doing event planning. Functionally, intrinsic motivation drives learning from sensorimotor contingencies to goal formation and selection. Integrating these mechanisms can lead to the progressive emergence of skills, goals, and abstractions, and yet has been largely unexplored. Worthy of special attention [39], integrates predictive processing hierarchically, combining abstract goal-level representations with low-level sensorimotor dynamics, supporting the gradual extension of capabilities, including tool use, through the coordination of prediction and control. New platforms provide solid test beds for addressing this challenge. For example, MIMo v2 incorporates growth as an explicit dimension, modeling the transition from birth to early childhood through dynamic changes in morphology, actuation, perception, and sensorimotor delays, enabling more realistic accounts of how evolving constraints shape learning and the progressive organization of behavior [40].

3. New Directions

Emerging directions in CR are examined to clarify their implications for CR, highlighting their contributions, identifying open challenges, and pointing toward new lines of research and debate (see Figure 2).

3.1. Foundation Models

Deep learning has long supported advances in perception, control, and sensorimotor learning in embodied agents [1]. Foundation models are now being introduced into robotics either as modular components or as end-to-end Vision–Language–Action (VLA) models [41]. Representative examples include Octo, which enables flexible goal execution from language or an image across multiple robotic embodiments [42], and OpenVLA, which learns generalist policies transferable across platforms [43,44]. Complementary approaches bridge high-level reasoning and low-level control for manipulation and multi-step task execution [45,46,47]. These models demonstrate scalability and generalization, but also reveal tensions among CR core principles.

Foundation models are built on large-scale pretraining and pre-structured abstractions, which risks privileging top-down organization over emergent sensorimotor grounding, potentially diluting CR’s core commitments to embodiment, autonomy, and developmental learning [48]. Recent bio-inspired augmentations attempt to address this by incorporating structured memory, grounded reasoning, and intrinsic motivation [49], yet sensorimotor interaction remains treated as input to be processed rather than as the generative basis of learning, leaving grounding as a problem of model augmentation rather than an emergent property of embodied, developmental dynamics.

Instead of rejecting foundation models, we propose reinterpreting their role as structured starting conditions that constrain and scaffold learning, analogous to the embodied priors provided by evolution in biological systems. This idea is supported by evidence in biological agents showing that predictions can be embedded in the structure of sensory systems as context-independent constraints that guide perception and learning [50], and that locomotor behavior builds upon conserved inborn motor primitives progressively refined during development [51,52]. This perspective directly addresses a central limitation of data-driven approaches: the combinatorial space of environments, goals, tasks, and actions is effectively unbounded, making it infeasible to capture general intelligence through pretraining alone [52,53]. Rather than attempting to approximate this space exhaustively, our proposal reframes foundation models as perceptual and motor priors, while adaptive, goal-directed behavior emerges from sensorimotor interactions.

Focusing on perceptual and motor capacities rather than language-based training, foundation models can provide initial competencies, such as object detection, motor primitives, and policy structures, that function as inductive priors, scaffolding developmental learning and sensorimotor exploration, consistent with the “starting small” hypothesis [35]. As illustrated in Figure 2, foundation models can be integrated into CR frameworks as perceptual and motor priors, or structured starting conditions. For instance, a pre-trained model like Octo [42] can be subordinated to an intrinsically motivated architecture that monitors prediction errors to generate latent goal representations autonomously, allowing an embodied agent to bypass learning low-level control from scratch and focus instead on learning an internal model and object-based semantics that provide a sensorimotor base onto which linguistic symbols can be grounded.

3.2. Basal Cognition

Research in CR, particularly in evolutionary and swarm robotics, has consistently shown that adaptive behavior can emerge from embodied, distributed, and self-organizing processes [1]. The framework of Basal Cognition extends this view by positioning cognition as an emergent property of fundamental regulatory and adaptive dynamics across biological scales, from single and multicellular organisms to morphogenetic processes. Basal Cognition provides a unifying perspective that both reinterprets existing CR approaches and guides the development of future embodied artificial systems.

The Technological Approach to Mind Everywhere (TAME) formalizes this perspective by proposing that cognition emerges from strongly embodied, multiscale interactions across diverse substrates, including processes such as morphogenesis understood as collective intelligence arising from cellular dynamics, offering a unified account of adaptive, goal-directed behavior across biological and artificial systems [54]. Neural Cellular Automata (NCAs) provide concrete implementations of these principles, demonstrating how complex, robust patterns emerge from local interactions among minimal computational units, serving as artificial analogs of distributed cognition in multicellular systems, by parameterizing local update rules as a shared differentiable neural network optimized through end-to-end gradient-based learning [55].

NCAs reproduce morphogenesis-like behavior through learned local update rules, have been extended to investigate how bioelectrical dynamics guide regenerative processes [56], and exhibit robust, goal-directed dynamics without centralized control across domains ranging from robotic morphology regeneration [57] to abstract reasoning tasks such as ARC-AGI [58]. In parallel, hybrid architectures integrate self-organizing NCA dynamics with differentiable logic circuits [59] and Local Pattern Producing Networks for scalable, high-resolution pattern synthesis [60].

As NCAs are increasingly integrated with broader AI tools, principles from Basal Cognition are likely to be progressively incorporated into CR frameworks, bringing them closer to core principles such as intrinsic motivation and the prediction of the sensory consequences of action. A principled basis for this convergence is the proposal that cognition across biological and artificial substrates shares a dual invariant of remapping and navigation within embedding spaces enacted through scale-free iterative error minimization [61]. Crucially, in such systems, the mechanisms governing learning and inference are identical. The rules that reshape internal representations during learning also guide navigation within embedding spaces, unifying sensorimotor prediction and adaptive self-regulation within a substrate-independent framework.

A further implication concerns computational architecture: biological agents achieve robustness and lifelong adaptability in part through polycomputing (the capacity of the same substrate to compute different functions for different observers across scales simultaneously), suggesting that future CR systems may need to move beyond modular designs toward multiscale architectures where competence is distributed across levels of organization [62]. NCAs instantiate this principle concretely, implementing control as a fully distributed process in which local units interact through iterative updates to produce coordinated, goal-directed behavior without centralized computation. These ideas resonate with bio-inspired principles such as co-design of body and controller, prediction through interaction, and hierarchical distributed architectures, proposed to address current limitations in AI and embodied systems [63].

From this perspective, embodiment in CR can be understood as a multiscale phenomenon emerging from tightly coupled interactions between morphology, control, and continuous feedback. This view is further supported by recent work framing cognition as the dynamic transformation of embodied information through distributed, physically realized processes that support viability and self-maintenance [64]. Taken together, these ideas suggest that the central challenge for CR is to develop systems whose adaptive, multiscale organization itself constitutes cognition and intelligence (see Table 1).

4. Conclusions

CR holds the assumption that cognition emerges from the interaction between embodiment, prediction, and intrinsically motivated development. In this framework, prediction-based mechanisms and intrinsic motivation drive the acquisition of sensorimotor contingencies, the formation of internal models, and support lifelong learning. Future progress will depend on integrating these perspectives into coherent architectures that support lifelong learning and multiscale organization. Advancing CR requires understanding how predictive and intrinsic motivational mechanisms co-adapt over time and across levels of organization, structuring experience in ways that support memory, generalization, and the progressive emergence of skills, tasks, goals, and abstractions. Moving beyond goal-oriented performance toward systems capable of self-maintenance, lifelong learning and development, and autonomous goal formation remains a central challenge.

Recent developments introduce both opportunities and conceptual tensions. Foundation models offer powerful priors that can accelerate early-stage learning, but risk reintroducing disembodied, top-down representations if not properly integrated. Reinterpreting these models as structured starting conditions, providing perceptual and motor priors, is a promising path for incorporating them within developmental principles while preserving embodiment and grounding in sensorimotor interactions. Additionally, the perspective of Basal Cognition could further extend CR by reframing cognition as a multiscale property of self-organizing systems grounded in viability and adaptive regulation, challenging traditional boundaries of cognition and suggesting that intelligence arises from distributed, physically realized processes rather than centralized computation. These perspectives point toward novel lines of research and debate in CR.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declaration of Generative AI and AI-Assisted Technologies

During the preparation of this work, the authors used ChatGPT (OpenAI) version GPT-5.3 in order to refine writing style and provide suggestions for greater concision. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the published article.

References

Cangelosi, A.; Asada, M. Cognitive robotics; MIT Press, 2022. [Google Scholar]
Pezzulo, G.; Barsalou, L. W.; Cangelosi, A.; Fischer, M. H.; McRae, K.; Spivey, M. J. Computational grounded cognition: a new alliance between grounded cognition and computational modeling. Front. Psychol. 2013, 3, 612. [Google Scholar] [CrossRef] [PubMed]
Vernon, D.; Sandini, G. The importance of being humanoid. Int. J. Humanoid Robot. 2024, 21, 2350022. [Google Scholar]
O’regan, J. K.; Noë, A. A sensorimotor account of vision and visual consciousness. Behav. Brain Sci. 2001, 24, 939–973. [Google Scholar] [CrossRef] [PubMed]
Stefanini, E.; Lentini, G.; Grioli, G.; Catalano, M. G.; Bicchi, A. Exploring saliency for learning sensory-motor contingencies in loco-manipulation tasks. Robotics 2024, 13, 58. [Google Scholar]
Wolpert, D. M.; Ghahramani, Z.; Jordan, M. I. An internal model for sensorimotor integration. Science 1995, 269, 1880–1882. [Google Scholar] [CrossRef] [PubMed]
Kawato, M. Internal models for motor control and trajectory planning. Curr. Opin. Neurobiol. 1999, 9, 718–727. [Google Scholar] [CrossRef] [PubMed]
Wolpert, D. M.; Diedrichsen, J.; Flanagan, J. R. Principles of sensorimotor learning. Nat. Rev. Neurosci. 2011, 12, 739–751. [Google Scholar] [CrossRef] [PubMed]
Demiris, Y.; Khadhouri, B. Hierarchical attentive multiple models for execution and recognition of actions. Robot. Auton. Syst. 2006, 54, 361–369. [Google Scholar] [CrossRef]
Möller, R.; Schenck, W. Bootstrapping cognition from behavior—a computerized thought experiment. Cogn. Sci. 2008, 32, 504–542. [Google Scholar] [CrossRef] [PubMed]
Rolf, M. Goal Babbling for an Efficient Bootstrapping of Inverse Models in High Dimensions. Phd thesis, Bielefeld University, Bielefeld, Germany, 2012. [Google Scholar]
Fedozzi, M. G.; Rea, F.; Sandini, G.; Triesch, J.; Sciutti, A. Canalizing babbling: Development-inspired goal sampling for visuo-motor learning. In 2025 IEEE International Conference on Development and Learning (ICDL); IEEE, 2025; pp. 1–6. [Google Scholar] [CrossRef]
T. J. Prescott, K. Vogeley, A. Wykowska, Understanding the sense of self through robotics, Science robotics 9 95 (2024) eadn2733. ** The paper proposes that artificial agents can serve as experimental models to study the emergence of the sense of self by implementing embodied, predictive mechanisms. It shows that aspects of the self arise from sensorimotor interaction and layered predictive processes, linking bodily experience to higher-level cognition. Importantly, it lays a developmental line for the emergence of the self, in natural and artificial agents.
G. Schillaci, C.-N. Ritter, V. V. Hafner, B. Lara, Body representations for robot ego-noise modelling and prediction. towards the development of a sense of agency in artificial agents, in: Proceedings of the Artificial Life Conference 2016, MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info . . . , 2016, pp. 390–397. [CrossRef]
W. Ohata, J. Tani, Characterizing the sense of agency in human–robot interaction based on the free energy principle, npj Complexity 2 (2025) 12. **Within the free-energy framework, Sense of Agency in human–robot interaction is modeled as emerging from the balance between top-down predictions and bottom-up sensory input. Prioritizing top-down processes leads to a more self-driven behavior and higher perceived agency, while prioritizing sensory input leads to more adaptive behavior and reduced agency. [CrossRef]
Friston, K. J. A theory of cortical responses. Philos. Trans. R. Soc. B Biol. Sci. 2005, 360, 815–836. [Google Scholar] [CrossRef] [PubMed]
Friston, K. J.; Stephan, K. Free-energy and the brain. Synthese 2007, 159, 417–458. [Google Scholar] [CrossRef] [PubMed]
Clark, A. Whatever next? predictive brains, situated agents, and the future of cognitive science. Behav. Brain Sci. 2013, 36, 181–204. [Google Scholar] [PubMed]
Ciria, A.; Schillaci, G.; Pezzulo, G.; Hafner, V. V.; Lara, B. Predictive processing in cognitive robotics: a review. Neural Comput. 2021, 33, 1402–1432. [Google Scholar] [CrossRef] [PubMed]
Oliver, G.; Lanillos, P.; Cheng, G. An empirical study of active inference on a humanoid robot. IEEE Trans. Cogn. Dev. Syst. 2021, 14, 462–471. [Google Scholar] [CrossRef]
Szadkowski, R. J.; Faigl, J. Lifelong active inference of gait control. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 19133–19144. [Google Scholar] [CrossRef] [PubMed]
D. Kim, H. Kanazawa, Y. Kuniyoshi, Active inference with a self-prior in the mirror-mark task, arXiv preprint arXiv:2604.09673 (2026). ** Under the framework of active inference and the free energy principle, this paper proposes the learning of a self-prior as a model of the self. This is then used to discover a mark in the self in the mirror test. The proposal is implemented on a simulated artificial agent building a body-schema that guides action planning. [CrossRef]
Taniguchi, T.; Murata, S.; Suzuki, M.; Ognibene, D.; Lanillos, P.; Ugur, E.; Jamone, L.; Nakamura, T.; Ciria, A.; Lara, B.; et al. World models and predictive coding for cognitive and developmental robotics: frontiers and challenges. Adv. Robot. 2023, 37, 780–806. [Google Scholar] [CrossRef]
Rayyes, R. Intrinsic motivation learning for real robot applications. Front. Robot. AI 2023, 10, 1102438. [Google Scholar] [CrossRef] [PubMed]
Oudeyer, P.-Y.; Kaplan, F.; Hafner, V. V. Intrinsic motivation systems for autonomous mental development. IEEE Trans. Evol. Comput. 2007, 11, 265–286. [Google Scholar] [CrossRef]
Baranes, A.; Oudeyer, P.-Y. Active learning of inverse models with intrinsically motivated goal exploration in robots. Robot. Auton. Syst. 2013, 61, 49–73. [Google Scholar] [CrossRef]
R. Rayyes, H. Donat, J. Steil, Efficient online interest-driven exploration for developmental robots, IEEE Transactions on Cognitive and Developmental Systems 14 (2022) 1367–1377. ** Pursuing open-ended learning, the authors propose an interest measure in which the system selects goals driven by maximizing expected progress, resulting in an adaptive sampling policy that improves sample efficiency and convergence of learned inverse/forward models in high-dimensional sensorimotor spaces. [CrossRef]
Mahajan, P.; Tang, M.; Li, T. E.; Havoutis, I.; Seymour, B. Neural associative skill memories for safer robotics and modelling human sensorimotor repertoires. Neural Comput. 2025, 1–27. [Google Scholar] [CrossRef] [PubMed]
D. de Tinguy, T. Verbelen, E. Gamba, B. Dhoedt, Zero-shot structure learning and planning for autonomous robot navigation using active inference, ArXiv abs/2510.09574 (2025). [CrossRef]
Priorelli, M.; Stoianov, I. P. Dynamic planning in hierarchical active inference. Neural Netw. Off. J. Int. Neural Netw. Soc. 2024, 185, 107075. [Google Scholar]
C. Schwarke, V. Klemm, M. v. d. Boon, M. Bjelonic, M. Hutter, Curiosity-driven learning of joint locomotion and manipulation tasks, in: J. Tan, M. Toussaint, K. Darvish (Eds.), Proceedings of The 7th Conference on Robot Learning, volume 229 of Proceedings of Machine Learning Research, PMLR, 2023, pp. 2594–2610. URL: https://proceedings.mlr.press/v229/schwarke23a.html.
Zarifis, S.; Chalkiadakis, I.; Chardouveli, A.; Moutzouri, V.; Sotirchos, A.; Papadimitriou, K.; Filntisis, P.; Efthymiou, N.; Maragos, P.; Pastra, K. Baby sophia: A developmental approach to self-exploration through self-touch and hand regard. arXiv 2025, arXiv:2511.09727. [Google Scholar] [CrossRef]
Fu, H.; Liu, W.; Zhou, S. Intrinsic-motivation multi-robot social formation navigation with coordinated exploration. Eng. Appl. Artif. Intell. 2025, 159, 111740. [Google Scholar]
A. Augello, S. Gaglio, I. Infantino, U. Maniscalco, G. Pilato, F. Vella, Roboception and adaptation in a cognitive robot, Robotics and Autonomous Systems 164 (2023) 104400. * The paper introduces an artificial interoceptive system that encodes internal physical states of an agent (e.g., battery state, motor load) as perceptual signals integrated into the control loop. Using reinforcement learning, the robot adapts its behavior by optimizing task performance while regulating these internal variables, enabling self-preservation and context-aware action selection. [CrossRef]
M. Asada, A. Cangelosi, Reevaluating development and embodiment in robotics, Device 2 (2024). ** This perspective revisits development and embodiment in cognitive robotics in light of recent AI advances, proposing the “starting small” principle as a developmental alternative to big-data training paradigms and outlining key challenges for integrating foundation models with incremental, embodied, and socially grounded learning.
Kiverstein, J.; Miller, M.; Rietveld, E. The feeling of grip: novelty, error dynamics, and the predictive brain. Synthese 2019, 196, 2847–2869. [Google Scholar]
Joffily, M.; Coricelli, G. Emotional valence and the free-energy principle. PLoS Comput. Biol. 2013, 9, e1003094. [Google Scholar] [CrossRef] [PubMed]
Schillaci, G.; Pico Villalpando, A.; Hafner, V. V.; Hanappe, P.; Colliaux, D.; Wintz, T. Intrinsic motivation and episodic memories for robot exploration of high-dimensional sensory spaces. Adapt. Behav. 2021, 29, 549–566. [Google Scholar]
Hiruma, H.; Ito, H.; Mori, H.; Ogata, T. Deep active visual attention for real-time robot motion generation: Emergence of tool-body assimilation and adaptive tool-use. IEEE Robot. Autom. Lett. 2022, 7, 8550–8557. [Google Scholar]
López, F. M.; Lenz, M.; Fedozzi, M. G.; Aubret, A.; Triesch, J. Mimo grows! simulating body and sensory development in a multimodal infant model. In 2025 IEEE International Conference on Development and Learning (ICDL); IEEE, 2025; pp. 1–6. [Google Scholar] [CrossRef]
Y. Ma, Z. Song, Y. Zhuang, J. Hao, I. King, A survey on vision-language-action models for embodied ai, arXiv preprint arXiv:2405.14093 (2024). * This survey offers a structured taxonomy of Vision-Language-Action (VLA) models in embodied AI, highlighting their architectural components and key limitations in generalization and real-world adaptability, and serving as a central reference on the integration of foundation models into robotics. [CrossRef]
O. M. Team; Ghosh, D.; Walke, H.; Pertsch, K.; Black, K.; Mees, O.; Dasari, S.; Hejna, J.; Kreiman, T.; Xu, C.; et al. Octo: An open-source generalist robot policy. arXiv 2024. [Google Scholar] [CrossRef]
Kim, M. J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.; Lam, G.; Sanketi, P.; et al. Openvla: An open-source vision-language-action model. arXiv 2024. [Google Scholar] [CrossRef]
O’Neill, A.; Rehman, A.; Maddukuri, A.; Gupta, A.; Padalkar, A.; Lee, A.; Pooley, A.; Gupta, A.; Mandlekar, A.; Jain, A.; et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE, 2024; pp. 6892–6903. [Google Scholar] [CrossRef]
K. Zhang, R. Xu, P. Ren, J. Lin, H. Wu, L. Lin, X. Liang, Robridge: A hierarchical architecture bridging cognition and execution for general robotic manipulation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 14590–14601. [CrossRef]
Jin, C.; Tan, W.; Yang, J.; Liu, B.; Song, R.; Wang, L.; Fu, J. Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation. arXiv 2023, arXiv:2305.18898. [Google Scholar] [CrossRef]
Lykov, A.; Konenkov, M.; Gbagbe, K. F.; Litvinov, M.; Davletshin, D.; Fedoseev, A.; Cabrera, M. A.; Peter, R.; Tsetserukou, D. Cognitiveos: Large multimodal model based system to endow any type of robot with generative ai. In in: 2025 IEEE International Conference on Robotics and Automation (ICRA); IEEE; Volume 2025, pp. 16256–16261. [CrossRef]
Vernon, D. The future of research in cognitive robotics: Foundation models or developmental cognitive models? Adv. Robot. Res. 2025, e202500066. [Google Scholar]
L. Chen, S. M. Nguyen, Foundational models for robotics need to be made bio-inspired, 2025 IEEE International Conference on Advanced Robotics and its Social Impacts (ARSO) (2025) 126–133. * This work outlines a biologically inspired framework for advancing foundation models in robotics, emphasizing the integration of structured memory systems, grounded reasoning (e.g., embodied chain-of-thought), multimodal sensorimotor feedback, and self-motivated learning to support scalable, generalizable, and goal-directed robotic behavior.
Teufel, C.; Fletcher, P. C. Forms of prediction in the nervous system. Nat. Rev. Neurosci. 2020, 21, 231–242. [Google Scholar] [CrossRef] [PubMed]
Dominici, N.; Ivanenko, Y.; Cappellini, G.; d’Avella, A.; Mondì, V.; Cicchese, M.; Fabiano, A.; Silei, T.; Paolo, A. D.; Giannini, C.; Poppele, R.; Lacquaniti, F. Locomotor primitives in newborn babies and their development. Science 2011, 334, 997–999. [Google Scholar] [CrossRef] [PubMed]
Kuniyoshi, Y.; Yorozu, Y.; Suzuki, S.; Sangawa, S.; Ohmura, Y.; Terada, K.; Nagakubo, A. Emergence and development of embodied cognition: A constructivist approach using robots. Prog. Brain Res. 2007, 164, 425–445. [Google Scholar] [CrossRef] [PubMed]
Sandini, G.; Sciutti, A.; Morasso, P. Mutual human-robot understanding for a robot-enhanced society: the crucial development of shared embodied cognition. Front. Artif. Intell. 2025, 8, 1608014. [Google Scholar] [PubMed]
Levin, M. Technological approach to mind everywhere: An experimentally-grounded framework for understanding diverse bodies and minds. Front. Syst. Neurosci. 2021, 16. [Google Scholar]
Mordvintsev, A.; Randazzo, E.; Niklasson, E.; Levin, M. Growing neural cellular automata. Distill 2020, 5, e23. [Google Scholar] [CrossRef]
Hansali, S.; Pio-Lopez, L.; Lapalme, J. V.; Levin, M. The role of bioelectrical patterns in regulative morphogenesis: An evolutionary simulation and validation in planarian regeneration. IEEE Trans. Mol. Biol. Multi-Scale Commun. 2025, 11, 305–331. [Google Scholar] [CrossRef]
B. Hartl, M. Levin, L. Pio-Lopez, Neural cellular automata: Applications to biology and beyond classical ai, Physics of Life Reviews 56 (2026) 94–108. ** This review surveys applications of neural cellular automata (NCAs) to biological modeling, including morphogenesis, regeneration, aging, bioelectricity, and molecular design, and argues for their relevance as models of multiscale biological systems composed of agential materials. It highlights NCAs as a form of collective, self-organizing AI and discusses their potential beyond classical approaches, along with current challenges and limitations. [PubMed]
K. Xu, R. Miikkulainen, Neural cellular automata for arc-agi, in: Artificial Life Conference Proceedings 37, volume 2025, MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info . . . , 2025, p. 16. [CrossRef]
P. Miotti, E. Niklasson, E. Randazzo, A. Mordvintsev, Differentiable logic cellular automata: From game of life to pattern generation, in: Artificial Life Conference Proceedings 37, volume 2025, MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info . . . , 2025, p. 54. [CrossRef]
Pajouheshgar, E.; Xu, Y.; Abbasi, A.; Mordvintsev, A.; Jakob, W.; Süsstrunk, S. Neural cellular automata: From cells to pixels. arXiv 2025, arXiv:2506.22899. [Google Scholar] [CrossRef]
Hartl, B.; Pio-Lopez, L.; Fields, C.; Levin, M. Remapping and navigation of an embedding space via error minimization: a fundamental organizational principle of cognition in natural and artificial systems. arXiv 2026, arXiv:2601.14096. [Google Scholar] [CrossRef]
J. Bongard, M. Levin, There’s plenty of room right here: Biological systems as evolved, overloaded, multiscale machines, Biomimetics 8 (2023) 110. ** This work argues for an observer-dependent, continuous view of biological and artificial systems, emphasizing that cognition and function emerge from multi-scale, entangled processes. It introduces the concept of “polycomputing,” where the same substrate simultaneously performs multiple computations, highlighting the tight coupling of form and function and the importance of understanding and controlling behavior across scales in both living and engineered systems. [PubMed]
Zador, A.; Fellous, J.-M.; Sejnowski, T.; Adam, G.; Aimone, J. B.; Akwaboah, A.; Aloimonos, Y.; Alonso, C. A.; Bartolozzi, C.; Bennington, M. J.; et al. Neuroai and beyond: Bridging between advances in neuroscience and artificialintelligence. arXiv 2026, arXiv:2604.18637. [Google Scholar] [CrossRef]
Dodig-Crnkovic, G. De-anthropomorphizing the mind: life as a cognitive spectrum in a unified framework for biological minds. Front. Syst. Neurosci. 2026, 20, 1730097. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Playing chess was considered a benchmark of machine intelligence, but it has proven to be a relatively simple task. The harder problem proved to be developing agents capable of interacting with a dynamic environment. To approach this challenge, CR proposes that intelligence should emerge from continuous interaction with the environment rather than from the mastery of formal rule systems. CR departs from a disembodied, symbol-manipulation paradigm, arguing that higher cognitive functions such as reasoning and language are elaborations of a more fundamental capacity: adaptive behavior grounded in sensorimotor interactions.

Figure 2. New direction in Cognitive Robotics. Foundation Models are introduced as structured starting conditions that constrain and scaffold early learning by providing initial sensorimotor priors. Within Cognitive Robotics, prediction and intrinsic motivation drive developmental learning through continuous sensorimotor interactions.Basal Cognition extends this framework by reconceptualizing cognition as a multiscale, self-organizing process in which competence is distributed across levels of organization.

Table 1. Classical CR is contrasted with two complementary extensions proposed in this review: foundation models as structured starting conditions that constrain learning, and basal cognition as a framework that reconceptualizes cognition as multiscale self-organization grounded in embodiment and viability.

Dimension	Classical CR	CR + Foundation Models	CR + Basal Cognition
Embodiment	Morphology and sensorimotor capabilities shaped by and grounded during interaction with the environment	Perceptual and motor priors that provide the basic body skills	Single to multicellular interactions across scales, substrates, and environments
Learning	Internal models grounded in sensorimotor interactions	Structured starting conditions that constrain the search space	Multiscale self-organization and distributed feedback-driven dynamics
Prediction	Sensorimotor contingencies and internal models	Embedded predictions based on pretrained goal-image associations	Self expected states and distributed, multiscale expected dynamics
Intrinsic motivation	Driven by learning progress, curiosity, and novelty search	Added as an auxiliary mechanism to support goal-driven associations	Emerges from viability and self-maintenance
Cognition	Rooted in sensorimotor interactions with the environment, driving adaptive behavior	Perceptual and motor priors that scaffold sensorimotor grounding	Emergent property of fundamental regulatory and adaptive dynamics across scales

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.