Submitted:
17 April 2025
Posted:
18 April 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Background and Related Work
2.1. Reinforcement Learning
2.2. Neuroscience of Human RL
2.3. Piaget’s Theory of Cognitive Development
2.4. Quantum Reinforcement Learning
2.5. Human-like RL Models
3. Foundations of Quantum Computing for ARDNS-P-Quantum
3.1. Qubits and Quantum States
3.2. Superposition
3.3. Entanglement
3.4. Quantum Gates: Focus on RY Gates
3.5. Measurement and Quantum Circuits
3.6. Quantum Advantage in ARDNS-P-Quantum
4. Theoretical Foundations of ARDNS-P-Quantum
4.1. Dual-Memory System
- Short-term memory (Ms): Dimension 8, captures recent states with fast decay (αs).
- Long-term memory (Ml): Dimension 16, consolidates contextual information with slow decay (αl).
4.2. Variance-Modulated Plasticity
4.3. Piagetian Developmental Stages
- Sensorimotor (0-100 episodes): High exploration (ϵ=0.9, high learning rate (η=1.4), high curiosity bonus (b=2.0).
- Preoperational (101-200 episodes): Moderate exploration (ϵ=0.6), reduced learning rate (η=1.05), curiosity bonus (b=1.5).
- Concrete Operational (201-300 episodes): Lower exploration (ϵ=0.3), stable learning rate (η=0.84), curiosity bonus (b=1.0).
- Formal Operational (301+ episodes): Minimal exploration (ϵ=0.2), refined learning rate (η=0.7), curiosity bonus (b=1.0).
4.4. Quantum Action Selection
- Quantum Register: A 2-qubit quantum register |q0q1⟩ is initialized in the state |00⟩.
- Parameterization: The combined memory M (dimension 24) and action weights Wa (shape 4×24) parameterize RY rotation angles:
- 3.
- Circuit Construction: For each qubit qi , an RY gate applies the rotation, as defined in Section 3.4:
- 4.
- Measurement: A 2-bit classical register measures the qubits in the computational basis. The circuit is executed with 16 shots using Qiskit’s AerSimulator, producing a probability distribution over the 4 basis states. The probabilities are mapped to actions:
- 5.
- Action Selection: An epsilon-greedy policy selects the action with probability 1−ϵ, choosing the action with the highest p(ak), or a random action with probability ϵ.
5. Methods
5.1. Environment Setup
- State: Agent’s (x,y) position, starting at (0,0).
- Goal: (9,9).
- Actions: Up, down, left, right.
- Reward: +10 at the goal, -3 for obstacles, otherwise −0.001+0.1⋅progress−0.01⋅distance.
- Obstacles: 5% of cells, updated every 100 episodes.
- Episode Limit: 400 steps.
5.2. ARDNS-P-Quantum Implementation
- Memory: Ms (dimension 8), Ml (dimension 16), combined M (dimension 24).
- Hyperparameters: η=0.7, β=0.1, γ=0.01, ϵmin=0.2, ϵdecay=0.995, curiosity factor = 0.75.
-
Quantum Circuit:
- Qubits: 2 qubits (⌈log2(4)⌉ ).
- Gates: RY rotations parameterized by θi=∑jWa,i,jMj .
- Shots: 16, reduced from 32 in ARDNS-P to optimize runtime.
- Simulator: Qiskit AerSimulator for noise-free simulation.
- Attention Mechanism: Enabled, using tanh for Ms and sigmoid for Ml to weigh memory contributions.
- Circuit Optimization: RY rotations are combined per qubit to minimize circuit depth, reducing quantum gate count by approximately 10% compared to unoptimized circuits.
5.3. DQN Baseline
5.4. Simulation Protocol
- Episodes: 20000.
- Random Seed: 42.
- Metrics: Success rate, mean reward, steps to goal, reward variance (last 100 episodes).
- Hardware: Google Colab CPU (13GB RAM), ensuring accessibility for reproducibility.
6. Flowchart of ARDNS-P-Quantum Algorithm
- Initialize: Start with episode e=0, initial state s=(0,0), and initialize model parameters (Ms,Ml, weights Ws,Wl,Wa), quantum circuit (2 qubits), and stage-specific parameters (ϵ,η,αs,αl).
- State Observation: Observe the current state s from the environment.
- Update Short-Term Memory (Ms): Update Ms using the current state s and stage-specific αs:
- 4.
- Update Long-Term Memory (Ml): Update Ml using the current state s and stage-specific αl:
- 5.
- Combine Memory (M=[Ms Ml]): Concatenate Ms and Ml, optionally applying an attention mechanism to weigh contributions using tanh for Ms and sigmoid for Ml.
- 6.
- Reward Prediction: Update reward statistics (mean and variance) using the latest reward, maintaining a window of 100 recent rewards.
- 7.
- Construct Quantum Circuit: Build a 2-qubit quantum circuit with RY rotations parameterized by the combined memory M and action weights Wa:
- 8.
- Measure Quantum Circuit: Execute the circuit with 16 shots using Qiskit’s AerSimulator, measuring the qubits to obtain action probabilities p(ak).
- 9.
- Compute Action Probability (p(a)): Use the measured probabilities with an epsilon-greedy policy to determine the action probabilities, selecting the highest probability action with probability 1−ϵ, or a random action with probability ϵ.
- 10.
- Choose Action: Select the action a a a based on p(a).
- 11.
- Execute Action, Get (s′,r): Perform the action a in the environment, observe the next state s′ and reward r.
- 12.
- Compute Curiosity Bonus: Calculate the curiosity bonus b based on the novelty of the state s and its distance to the goal, scaled by the stage-specific curiosity factor.
- 13.
- Update Weights (W): Adjust weights using the variance-modulated plasticity rule:
- 14.
- Adjust Parameters (Piaget Stage): Update stage-specific parameters (ϵ,η,αs,αl, curiosity bonus) based on the current episode and Piagetian stage.
- 15.
- Episode Done?: Check if the goal is reached (s′=(9,9)) or the maximum steps (400) are exceeded. If yes, end the episode; otherwise, set s←s′ and continue the loop.

7. Python Implementation
-
Quantum Integration:
- The _create_action_circuit method constructs the quantum circuit using Qiskit’s QuantumCircuit class, initializing a 2-qubit register and applying RY rotations parameterized by the combined memory M and weights Wa.
- The _measure_circuit method executes the circuit with 16 shots using AerSimulator, mapping measurement outcomes to action probabilities.
- Code snippet for circuit creation:
| def _create_action_circuit(self, params, input_state): circuit = QuantumCircuit(self.qr_a) input_state = input_state[:len(params [0])] combined_angles = np.zeros(self.n_qubits_a) for i in range(self.n_qubits_a): angle = np.sum(params[i] * input_state) combined_angles[i] += angle for i in range(self.n_qubits_a): circuit.ry(combined_angles[i], self.qr_a[i]) return circuit |
- Developmental Stages: Defined for 20000 episodes: sensorimotor (0-100), preoperational (101-200), concrete (201-300), formal (301+), with stage-specific parameters for ϵ,η, and curiosity bonus.
- Visualization: Learning curves, steps to goal, and reward variance are smoothed with a Savitzky-Golay filter (window=1001, poly_order=2). Boxplots and histograms visualize reward distributions.
8. Results
8.1. Quantitative Metrics
-
Goals Reached:
- ARDNS-P-Quantum: 19831/20000 (99.2%)
- DQN: 16894/20000 (84.5%)
-
Mean Reward (last 100 episodes):
- ARDNS-P-Quantum: 9.1169±0.6138
- DQN: 3.2207±12.9141
-
Steps to Goal (last 100 episodes, successful episodes):
- ARDNS-P-Quantum: 33.3±9.4
- DQN: 72.1±85.2
-
Simulation Time:
- ARDNS-P-Quantum: 1034.5s
- DQN: 824.0s
8.2. Graphical Analyses
- (a) Learning Curve: ARDNS-P-Quantum’s average reward rises to approximately 9 by episode 2500, stabilizing near 8-10, reflecting its 99.2% success rate. DQN fluctuates between -5 and 5, consistent with its 84.5% success rate and high reward variance.
- (b) Steps to Goal: ARDNS-P-Quantum reduces steps from around 200 to approximately 33 by episode 5000, while DQN stabilizes at around 72, highlighting the quantum-enhanced exploration’s impact on navigation efficiency.
- (c) Reward Variance: ARDNS-P-Quantum’s variance decreases from 1.8 to around 0.8, while DQN remains higher (1.2-1.5), indicating that quantum action selection stabilizes reward predictions.
- Boxplot: ARDNS-P-Quantum’s median reward is approximately 9, with a tight interquartile range and few outliers (down to -40), reflecting consistent performance. DQN’s median is around 8, with greater variability and outliers down to -60.
- Histogram: ARDNS-P-Quantum’s rewards peak at 10 (frequency approximately 14000), with a small tail of negative rewards. DQN’s distribution is bimodal, peaking at 10 (approximately 6000) and -20 (approximately 2000), indicating frequent failures.
9. Discussion
- Quantum Action Selection: The 2-qubit circuit with RY rotations leverages superposition to evaluate all actions simultaneously, enhancing exploration efficiency. This results in a high success rate and rapid convergence in steps to goal (33.3, approaching the optimal path length of approximately 18 in a 10x10 grid without obstacles).
- Resource Optimization: Reducing shots from 32 to 16 and combining RY rotations decreases circuit depth by approximately 10%, contributing to a 20.3% reduction in simulation time (1034.5s vs. 1297.8s in Gonçalves de Sousa, 2025).
- Dual-Memory System: Balances immediate and contextual learning, supporting efficient navigation.
- Developmental Stages: Adaptive exploration improves learning stability across episodes.
- Variance-Modulated Plasticity: Reduces reward variance, enhancing prediction stability.
10. Conclusions and Future Work
- Implementing quantum parallelization (e.g., batch processing in choose_action) to further reduce simulation time by 20-30%.
- Exploring variational quantum circuits to dynamically optimize (θi), potentially improving action selection accuracy.
- Extending to 3D or multi-agent environments to test scalability.
- Incorporating quantum attention mechanisms to enhance memory weighting.
- Validating against human behavioral data to align with cognitive processes.
References
- Badre, D., & Wagner, A. D. (2007). Left ventrolateral prefrontal cortex and the cognitive control of memory. Neuropsychologia, 45(13), 2883-2901. [CrossRef]
- Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679-684.
- Botvinick, M., et al. (2019). Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 23(5), 408-422. [CrossRef]
- Dirac, P. A. M. (1958). The Principles of Quantum Mechanics (4th ed.). Oxford University Press.
- Dong, D., et al. (2008). Quantum reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, 38(5), 1207-1220. [CrossRef]
- Einstein, A., Podolsky, B., & Rosen, N. (1935). Can quantum-mechanical description of physical reality be considered complete? Physical Review, 47(10), 777-780. [CrossRef]
- Farhi, E., & Neven, H. (2018). Classification with quantum neural networks on near term processors. arXiv preprint arXiv:1802.06002. [CrossRef]
- Gonçalves de Sousa, U. (2025). A novel framework for human-like reinforcement learning: ARDNS-P with Piagetian stages. Preprints.org. [CrossRef]
- Grover, L. K. (1996). A fast quantum mechanical algorithm for database search. Proceedings of the 28th Annual ACM Symposium on Theory of Computing, 212-219. [CrossRef]
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533. [CrossRef]
- Nielsen, M. A., & Chuang, I. L. (2010). Quantum Computation and Quantum Information. Cambridge University Press.
- Piaget, J. (1950). The Psychology of Intelligence. Routledge.
- Qiskit. (2023). Qiskit: An open-source framework for quantum computing. Available online: https://qiskit.org.
- Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. [CrossRef]
- Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journal of Neurophysiology, 80(1), 1-27. [CrossRef]
- Singh, S., et al. (2009). Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2), 70-82. [CrossRef]
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
- Tulving, E. (2002). Episodic memory: From mind to brain. Annual Review of Psychology, 53(1), 1-25. [CrossRef]
- Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4), 279-292. [CrossRef]
- Yu, A. J., & Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4), 681-692. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).