Submitted:
08 April 2025
Posted:
08 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background and Related Work
2.1. Reinforcement Learning
2.2. Neuroscience of Human RL
2.3. Piaget’s Theory of Cognitive Development
- Sensorimotor (0-2 years): Learning through sensory experiences and actions, with high exploration.
- Preoperational (2-7 years): Emergence of symbolic thinking, but with egocentric limitations.
- Concrete Operational (7-11 years): Logical reasoning about concrete events, with reduced egocentrism.
- Formal Operational (11+ years): Abstract and hypothetical reasoning, enabling complex problem-solving.
2.4. Human-Like RL Models
3. Theoretical Foundations of ARDNS-P
3.1. Dual-Memory System
- Short-term memory (Ms): Captures recent states with fast decay (αs).
- Long-term memory (Ml): Consolidates contextual information with slow decay (αl).
3.2. Variance-Modulated Plasticity
- η is the learning rate,
- r is the reward,
- b is a curiosity bonus,
- σ2 is the reward variance,
- β and γ are hyperparameters,
- M is the combined memory.
3.3. Piagetian Developmental Stages
- Sensorimotor (0-400 episodes): High exploration (ϵ≥0.9), high learning rate (η⋅2), fast short-term memory decay (αs=0.7), slower long-term decay (αl=0.8), curiosity bonus = 8.0.
- Preoperational (401-800 episodes): Reduced exploration (ϵ≥0.6), learning rate (η×1.5), αs=0.8, αl=0.9, curiosity bonus = 6.0.
- Concrete Operational (801-1200 episodes): Further reduced exploration (ϵ≥0.33), learning rate (η×1.2), αs=0.85, αl=0.95, curiosity bonus = 3.0.
- Formal Operational (1201+ episodes): Minimal exploration (ϵ≥0.1), base learning rate (η), αs=0.9, αl=0.98, curiosity bonus = 1.0.
3.4. Action Selection
3.5. Algorithm Flow
- Initialize: Start with episode e=0 and initial state s=(0,0).
- State Observation: Observe the current state s.
- Update Short-Term Memory (Ms): Update Ms using the state and stage-specific αs.
- Update Long-Term Memory (Ml): Update Ml using the state and stage-specific αl.
- Combine Memory (M=[Ms,Ml]): Optionally apply attention to combine Ms and Ml.
- Reward Prediction: Update reward statistics using the latest reward.
- Update Weights (W): Adjust weights based on the reward variance and state transition.
- Compute Action Values (V): Calculate action values using the combined memory.
- Scale Values: No scaling applied in this implementation.
- Compute Action Probability (p(a)): Use epsilon-greedy policy to determine action probabilities.
- Choose Action: Select the action based on p(a).
- Execute Action, Get (s′,r): Perform the action, observe the next state s′ and reward r.
- Adjust Parameters (Piaget Stage): Update stage-specific parameters (ϵ, η, αs, αl, curiosity bonus).
- Episode Done?: If the goal is reached or the maximum steps are exceeded, end the episode; otherwise, continue.
4. Methods
4.1. Environment Setup
- State: Agent’s (x,y) position, starting at (0,0).
- Goal: Position (9,9).
- Actions: Up, Down, Left, Right.
- Reward: +10 at the goal, -3 for hitting obstacles, otherwise a progress-based reward: −0.002+0.08⋅progress−0.015⋅distance.
- Obstacles: 5% of grid cells, updated every 100 episodes.
- Episode Limit: 400 steps.
4.2. ARDNS-P Implementation
- Memory: Short-term (Ms, dimension 10), long-term (Ml, dimension 20).
- Hyperparameters: η=0.15, ηr=0.05, β=0.1, γ=0.01, τ=1.5, weight clipping at 5.0, curiosity factor = 18.0.
- Developmental Stages: As described in Section 3.3.
- Attention Mechanism: Optional attention mechanism to weigh Ms and Ml contributions, enabled by default.
4.3. DQN Baseline
4.4. Simulation Protocol
- Episodes: 20000.
- Random Seed: 42 for reproducibility.
- Metrics: Cumulative reward, steps to goal, goals reached, and reward variance. Metrics are averaged over the last 50 episodes for stability.
5. Python Implementation
- Developmental Stages: Defined for 20000 episodes: sensorimotor (0-400), preoperational (401-800), concrete (801-1200), formal (1201+).
- GridWorld Class: Uses np.array_equal for state comparisons to handle NumPy arrays correctly.
6. Results
6.1. Quantitative Metrics
-
Goals Reached:
- ARDNS-P: 18381/20000 (91.9%)
- DQN: 16675/20000 (83.4%)
-
Mean Reward (last 50 episodes, estimated based on success rates):
- ARDNS-P: 9.12±0.6147
- DQN: 8.24±0.6147
-
Steps to Goal (last 50 episodes, successful episodes):
- ARDNS-P: 149.2±104.5
- DQN: 178.5±104.5
6.2. Graphical Analyses

7. Discussion
- Dual-Memory System: The short- and long-term memory components allow ARDNS-P to balance immediate and contextual information, contributing to its efficiency in navigation (Figure 2(b)).
- Developmental Stages: Piaget-inspired stages adapt exploration and learning rates over time, mimicking human cognitive development. The high exploration in the sensorimotor stage (episodes 0-400) facilitates initial learning, while later stages reduce exploration to exploit learned policies, contributing to the higher goal-reaching success rate.
- Variance-Modulated Plasticity: Adjusting learning rates based on reward uncertainty helps ARDNS-P achieve stability in its reward predictions, as seen in the decreasing variance (Figure 2(c)).
8. Conclusions and Future Work
- Reducing reward variability by incorporating probabilistic reward modeling.
- Extending ARDNS-P to more complex environments, such as 3D navigation or multi-agent settings.
- Incorporating additional human-like mechanisms, such as attention or hierarchical reasoning, to enhance adaptability.
- Validating the model against human behavioral data to better align with cognitive processes.
- Investigating the discrepancy in the learning curve to ensure consistency with the reported success rates.
References
- Badre, D., & Wagner, A. D. (2007). Left ventrolateral prefrontal cortex and the cognitive control of memory. Neuropsychologia, 45(13), 2883–2901. [CrossRef]
- Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679–684.
- Botvinick, M., Ritter, S., Wang, J. X., Kurth-Nelson, Z., Blundell, C., & Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 23(5), 408–422. [CrossRef]
- Dayan, P. (1993). Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4), 613–624. [CrossRef]
- Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138. [CrossRef]
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. [CrossRef]
- Piaget, J. (1950). The Psychology of Intelligence. Routledge.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. [CrossRef]
- Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1–27. [CrossRef]
- Schultz, W. (2016). Dopamine reward prediction-error signalling: A two-component response. Nature Reviews Neuroscience, 17(3), 183–195. [CrossRef]
- Singh, S., Barto, A. G., & Chentanez, N. (2009). Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2), 70–82. [CrossRef]
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
- Tulving, E. (2002). Episodic memory: From mind to brain. Annual Review of Psychology, 53(1), 1–25. [CrossRef]
- Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279–292. [CrossRef]
- Yu, A. J., & Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4), 681–692.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).