Submitted:
06 May 2025
Posted:
09 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- (1)
- It generates agent-specific intermediate tasks using a variational evolutionary algorithm, enabling balanced strategy development.
- (2)
- It models co-evolution between agents and their environment [14], aligning task complexity with agents’ learning progress.
- (3)
- It improves training stability by dynamically adapting task difficulty to match agent skill levels.

2. Problem Statement

and selects an action
based on its policy
. The chosen action results in a transition to a new state st+1, determined by the environment’s transition dynamics
, and an associated reward rt is obtained from the reward function
. The sequence of states, actions, following states, and rewards over an episode of T time steps form the trajectory
, where T is either determined by the maximum episode length or specific task termination conditions. This outlines the process of reinforcement learning for a single agent.![]() |
(1) |
. The global state st of the system is composed of the joint states of all individual agents, denoted as
. Correspondingly, the joint action at at each time step is also formed by the combination of actions from all agents, i.e.,
. In the sparse reward environment, reward signals only emerge when the system achieves specific predefined goal states, posing more significant challenges for agent collaboration and strategy optimization.
, where
represents the reward received by agent i at time step t given the state st and joint action at. The overall goal of the multi-agent system (MAS) then becomes the sum of the individual objectives, denoted as
.![]() |
(2) |
![]() |
(3) |
3. Related Work
3.1. Curriculum Learning
3.2. Evolutionary Reinforcement Learning
4. Methodology
4.1. The Variational Individual-Perspective Evolutionary Operator
![]() |
(4) |
![]() |
![]() |
(5) |
are randomly divided into two groups TA and TB. Then, we take N/2task pairs
from and to produce new children in the population.![]() |
(6) |
represents the crossover direction for pair i. The calculations of si and
are shown below:![]() |
(7) |
![]() |
![]() |
(8) |
4.2. Elite Prototype Fitness Evaluation
![]() |
(9) |
![]() |
(10) |
5. Experiment
5.1. Main Result
- Vanilla MAPPO [35] – Direct training on the target task without intermediate tasks.
- POET [24] – Uses task evolution; implemented with the same setup as CCL for fairness.
- GC [36] – An improved version of POET with enhanced task generation.
- GoalGAN [10] – Combines curriculum learning with attention-based enhancements.
- VACL [43] – Applies variational methods to create robust intermediate tasks.
5.2. Ablation Studies
. This improvement stems from the sigmoid function’s properties: as the agent’s success rate approaches 0 or 1, the task’s suitability to the agent’s abilities decreases exponentially. Specifically, when the success rate is exactly 0.5, the fitness value remains consistently at 0.5. This approach effectively integrates nonlinear elements into the success rate distribution, enabling the fitness function to more accurately represent the relationship between task difficulty and the agent’s skill level.6. Conclusion
References
- Abbass, H.; Petraki, E.; Hussein, A.; McCall, F.; Elsawah, S. A model of symbiomemesis: machine education and communication as pillars for human-autonomy symbiosis. Philos. Transactions Royal Soc. A 2021, 379, 20200364. [Google Scholar] [CrossRef] [PubMed]
- Perrusquía, A.; Yu, W.; Li, X. Multi-agent reinforcement learning for redundant robot control in task-space. Int. J. Mach. Learn. Cybern. 2021, 12, 231–241. [Google Scholar] [CrossRef]
- Rashid, T.; et al. Monotonic value function factorisation for deep multi-agent reinforcement learning. J. Mach. Learn. Res. 2020, 21, 1–51. [Google Scholar]
- Shalev-Shwartz, S.; Shammah, S.; Shashua, A. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv:1610.03295 (2016).
- Ng, A. Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, vol. 1999, 99, 278–287. [Google Scholar]
- Hu, Y.; et al. Learning to utilize shaping rewards: A new approach of reward shaping. Adv. Neural Inf. Process. Syst. 2020, 33, 15931–15941. [Google Scholar]
- Ross, S.; Gordon, G.; Bagnell, D. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011)., 627–635 (JMLR Workshop and Conference Proceedings.
- Duan, Y.; et al. One-shot imitation learning. Adv. neural information processing systems 30 ( 2017.
- Florensa, C.; Held, D.; Wulfmeier, M.; Zhang, M.; Abbeel, P. Reverse curriculum generation for reinforcement learning. In Conference on robot learning, 482–495 (PMLR, 2017).
- Florensa, C.; Held, D.; Geng, X.; Abbeel, P. Automatic goal generation for reinforcement learning agents. In International conference on machine learning, 1515–1528 (PMLR, 2018).
- Bloembergen, D.; Tuyls, K.; Hennes, D.; Kaisers, M. Evolutionary dynamics of multi-agent learning: A survey. J. Artif. Intell. Res. 2015, 53, 659–697. [Google Scholar] [CrossRef]
- Bu¸soniu, L.; Babuška, R.; De Schutter, B. Multi-agent reinforcement learning: An overview. Innov. multi-agent systems applications-1 183–221 (2010).
- Hernandez-Leal, P.; Kartal, B.; Taylor, M. E. A survey and critique of multiagent deep reinforcement learning. Auton. Agents Multi-Agent Syst. 2019, 33, 750–797. [Google Scholar] [CrossRef]
- Antonio, L. M.; Coello, C. A. C. Coevolutionary multiobjective evolutionary algorithms: Survey of the state-of-the-art. IEEE Transactions on Evol. Comput. 2017, 22, 851–865. [Google Scholar] [CrossRef]
- Kaelbling, L. P.; Littman, M. L.; Moore, A. W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
- Dewey, D. Reinforcement learning and the reward engineering principle. In 2014 AAAI Spring Symposium Series (2014). 9/11.
- Booth, S.; et al. The perils of trial-and-error reward design: misdesign through overfitting and invalid task specifications. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 2023, 37, 5920–5929. [Google Scholar] [CrossRef]
- Laud, A. D. Theory and application of reward shaping in reinforcement learning (University of Illinois at UrbanaChampaign, 2004).
- Barto, A. G. Intrinsic motivation and reinforcement learning. Intrinsically motivated learning natural artificial systems 17–47 (2013).
- Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning; 41–48 (2009). [Google Scholar]
- Rohde, D. L.; Plaut, D. C. Language acquisition in the absence of explicit negative evidence: How important is starting small? Cognition 1999, 72, 67–109. [Google Scholar] [CrossRef] [PubMed]
- Elman, J. L. Learning and development in neural networks: The importance of starting small. Cognition 1993, 48, 71–99. [Google Scholar] [CrossRef] [PubMed]
- Narvekar, S.; Sinapov, J.; Stone, P. Autonomous task sequencing for customized curriculum design in reinforcement learning. In IJCAI, 2536–2542 (2017).
- Wang, R.; Lehman, J.; Clune, J.; Stanley, K. O. Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions. arXiv:1901.01753 (2019).
- Cobbe, K.; Klimov, O.; Hesse, C.; Kim, T.; Schulman, J. Quantifying generalization in reinforcement learning. In International conference on machine learning, 1282–1289 (PMLR, 2019).
- Ren, Z.; Dong, D.; Li, H.; Chen, C. Self-paced prioritized curriculum learning with coverage penalty in deep reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2216–2226. [Google Scholar] [CrossRef]
- Wu, J.; et al. Portal: Automatic curricula generation for multiagent reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 2024, 38, 15934–15942. [Google Scholar] [CrossRef]
- Wang, K.; Zhang, X.; Guo, Z.; Hu, T.; Ma, H. Csce: Boosting llm reasoning by simultaneous enhancing of casual significance and consistency. arXiv:2409.17174 (2024).
- Samvelyan, M.; et al. Maestro: Open-ended environment design for multi-agent reinforcement learning. arXiv:2303.03376 (2023).
- Parker-Holder, J.; et al. Evolving curricula with regret-based environment design. In International Conference on Machine Learning, 17473–17498 (PMLR, 2022).
- Beyer, H.-G.; Schwefel, H.-P. Evolution strategies–a comprehensive introduction. Nat. computing 2002, 1, 3–52. [Google Scholar] [CrossRef]
- Miconi, T.; Rawal, A.; Clune, J.; Stanley, K. O. Backpropamine: training self-modifying neural networks with differentiable neuromodulated plasticity. arXiv:2002.10585 (2020).
- Pagliuca, P.; Milano, N.; Nolfi, S. Efficacy of modern neuro-evolutionary strategies for continuous control optimization. Front. Robotics AI 2020, 7, 98. [Google Scholar] [CrossRef]
- Long, Q.; et al. Evolutionary population curriculum for scaling multi-agent reinforcement learning. arXiv:2003.10423 (2020).
- Yu, C.; et al. The surprising effectiveness of ppo in cooperative multi-agent games. Adv. Neural Inf. Process. Syst. 2022, 35, 24611–24624. [Google Scholar]
- Song, Y.; Schneider, J. Robust reinforcement learning via genetic curriculum. In 2022 International Conference on Robotics and Automation (ICRA), 5560–5566 (IEEE, 2022).
- Racaniere, S.; et al. Automated curricula through setter-solver interactions. arXiv:1909.12892 (2019).
- Cahill, A. Catastrophic forgetting in reinforcement-learning environments. Ph.D. thesis, Citeseer (2011).
- French, R. M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 1999, 3, 128–135. [Google Scholar] [CrossRef]
- Lowe, R.; et al. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. neural information processing systems 30 ( 2017.
- Baker, B.; et al. Emergent tool use from multi-agent autocurricula. arXiv:1909.07528 (2019).
- Vaswani, A.; et al. Attention is all you need [j]. Adv. neural information processing systems 2017, 30, 261–272. [Google Scholar]
- Chen, J.; et al. Variational automatic curriculum learning for sparse-reward cooperative multi-agent problems. Adv. Neural Inf. Process. Syst. 2021, 34, 9681–9693. [Google Scholar]
- Qi, X.; Zhang, Z.; Zheng, H.; et al.: MedConv: Convolutions Beat Transformers on Long-Tailed Bone Density Prediction. arXiv preprint arXiv:2502.00631 (2025).
- Wang, K.; Zhang, X.; Guo, Z.; et al.: CSCE: Boosting LLM Reasoning by Simultaneous Enhancing of Causal Significance and Consistency. arXiv preprint arXiv:2409.17174 (2024).
- Liu, S.; Wang, K.: Comprehensive Review: Advancing Cognitive Computing through Theory of Mind Integration and Deep Learning in Artificial Intelligence. In: Proc. 8th Int. Conf. on Computer Science and Application Engineering, pp. 31–35 (2024).
- Zhang, X.; Wang, K.; Hu, T.; et al.: Enhancing Autonomous Driving through Dual-Process Learning with Behavior and Reflection Integration. In: ICASSP 2025 – IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 1–5. IEEE, Seoul (2025).
- Zou, B.; Guo, Z.; Qin, W.; et al.: Synergistic Spotting and Recognition of Micro-Expression via Temporal State Transition. In: ICASSP 2025 – IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 1–5. IEEE, Seoul (2025).
- Hu, T.; Zhang, X.; Ma, H.; et al.: Autonomous Driving System Based on Dual Process Theory and Deliberate Practice Theory. Manuscript (2025).
- Zhang, X.; Wang, K.; Hu, T.; et al.: Efficient Knowledge Transfer in Multi-Task Learning through Task-Adaptive Low-Rank Representation. arXiv preprint arXiv:2505.00009 (2025).
- Wang, K.; Ye, C.; Zhang, H.; et al.: Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment. arXiv preprint arXiv:2504.11515 (2025).


![]() |
![]() |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).













