A. Dataset
This study uses the OpenAI Gym Extended Environment Dataset as the primary dataset. The dataset is designed to simulate diverse dynamic interaction environments and provides a reproducible experimental platform for the autonomous exploration and knowledge accumulation of open-world agents. It includes multiple types of environmental tasks such as navigation, manipulation, reasoning, and interaction, covering both continuous control and discrete decision modes. Each environment in the dataset consists of a state space, action space, reward signals, and optional dynamic parameters. These elements enable the training of agents under various levels of complexity and uncertainty. The diversity and scalability of the dataset support multi-level tasks ranging from low-dimensional physical control to high-dimensional visual perception, offering a unified framework for evaluating the adaptability and generalization capability of agents.In terms of data structure design, the dataset employs a unified specification for state representation and task description to ensure comparability and knowledge transfer across various tasks. Each task instance comprises time-step sequences of state-action-reward-transition pairs, accompanied by relevant environmental metadata. The data exhibit temporal continuity while incorporating implicit uncertainty factors inherent in the environment, such as noise perturbations, task switching, and ambiguous state transitions. This design offers realistic and intricate contexts for modeling self-exploration and long-term memory in agents. By adopting this structured approach, agents can learn dynamic causal relationships and task dependencies from the data, laying a solid foundation for subsequent knowledge accumulation and generalization.
In addition, the dataset provides scalable interfaces and customizable configuration mechanisms. Researchers can generate new task environments or hybrid task sequences at different levels of complexity to evaluate algorithm adaptability and robustness under non-stationary conditions. The open design of the dataset allows seamless integration with frameworks such as reinforcement learning, meta-learning, and multi-agent systems. It supports continuous learning and evolutionary research of agents in open-world settings. This design not only ensures reproducibility and scalability but also provides a reliable data foundation for exploring self-driven mechanisms and knowledge accumulation processes in intelligent agents.
B. Experimental Results
This paper first conducts a comparative experiment, and the experimental results are shown in
Table 1.
The experimental results demonstrate that the proposed algorithm consistently outperforms all comparison methods across four key metrics, highlighting its clear advantages in self-exploration and knowledge accumulation in open-world environments. It achieves the lowest MUE value (0.648), indicating effective suppression of strategy fluctuations and exploration errors through hierarchical memory encoding and dynamic knowledge updating, which enhance stability and adaptability in complex state spaces. Superior performance on the MCS metric further shows that the agent maintains higher policy consistency and decision coherence across multiple tasks, benefiting from intrinsic motivation and structured knowledge modeling that reduce cognitive drift during task switching. The reduced BD metric confirms improved control of behavioral deviation by balancing exploration and exploitation via a composite reward objective and dynamic attention linking short-term perception with long-term memory. Finally, the significantly higher TCR demonstrates enhanced learning efficiency, task adaptability, and knowledge reuse, enabling autonomous, goal-oriented decision-making without external supervision. In addition, a sensitivity analysis of the intrinsic motivation weight coefficient λ on exploration efficiency is conducted, with results presented in
Figure 2.
The results indicate that the intrinsic motivation weight λ strongly influences exploration efficiency (MUE): as λ increases from 0.1 to 0.7, MUE steadily decreases, showing that stronger intrinsic motivation promotes more stable exploration, lower uncertainty, and more effective information acquisition in unknown environments. When λ is small (e.g., 0.1 or 0.3), the agent relies heavily on external rewards and tends to converge to local optima, resulting in higher MUE. The best performance is achieved at λ = 0.7 (MUE = 0.648), where the agent balances autonomous exploration and goal-driven learning, yielding stable yet flexible behavior suited to open-world uncertainty. However, further increasing λ to 0.9 slightly degrades performance (MUE = 0.662), indicating that excessive intrinsic motivation can cause over-exploration and weaken task-oriented learning. These findings highlight the necessity of hierarchical intrinsic motivation regulation, demonstrating that efficient knowledge accumulation and stable policy evolution emerge only under a properly balanced motivation intensity; additionally, the impact of the knowledge update rate η on behavioral deviation is evaluated, as shown in
Figure 3.
The experimental results show that the knowledge update rate η has a significant impact on the variation of the behavioral deviation (BD) metric. As η increases from 0.1 to 0.7, BD shows a continuous downward trend, indicating that the agent maintains higher behavioral consistency and policy stability under a moderate update rate. When η is low, such as 0.1 or 0.3, the knowledge updating process is slow, and the agent’s memory structure cannot promptly reflect environmental changes. This leads to a mismatch between old knowledge and new experiences, resulting in higher behavioral deviation. Such a lag effect makes the agent prone to policy drift and decision fluctuation in open-world environments, limiting the continuity and adaptability of autonomous learning.
When η gradually increases to 0.7, the behavioral deviation of the model reaches its lowest point (BD = 0.667), indicating that knowledge accumulation and updating achieve a dynamic balance. This result verifies the effectiveness of the proposed structured knowledge representation and progressive updating mechanism in complex environments. Under an appropriate knowledge update rate, the agent can effectively integrate long-term memory with new observational information, forming a stable path of knowledge evolution. As a result, it maintains consistent decision behavior during task transfer and policy evolution. This balancing mechanism allows the agent to continuously optimize its cognitive structure in the face of environmental perturbations and task switching, demonstrating strong self-correction and adaptive abilities.
When η further increases to 0.9, the BD metric rises slightly (0.689), suggesting that overly rapid knowledge updating weakens model stability. Excessively frequent knowledge reconstruction amplifies short-term experiences and causes long-term knowledge to be forgotten, leading to policy instability and cognitive drift. This indicates that, in open-world scenarios, the agent’s knowledge updating process should not aim for speed extremes but rather maintain a gradual adaptation rhythm to ensure the coherence and reliability of the knowledge system. In summary, a moderate η enables the agent to achieve an optimal balance between exploration and accumulation, providing a stable cognitive foundation for continuous learning and self-evolution. This paper also presents the stability experimental results of MCS under different data distribution offset conditions, as shown in
Figure 4.
The experimental results show that under different data distribution shift conditions, the agent’s mean consistency score (MCS) gradually decreases as the degree of shift increases. This indicates that changes in data distribution have a significant impact on the model’s stability and policy consistency. Here, No Shift, Mild Shift, Moderate Shift, and Severe Shift respectively correspond to 0%, 10%, 25%, and 40% perturbations applied to the original data distribution, providing a quantifiable scale of distribution changes. When the environment data remain stable (No Shift), the MCS reaches its highest value (0.721), suggesting that the agent can maintain a high level of behavioral consistency and policy robustness under a relatively fixed input distribution. However, when a mild perturbation occurs in the data distribution (Mild Shift), the MCS begins to decline. This shows that even slight distribution shifts impose adjustment pressure on the model’s policy, reflecting the direct challenge posed by environmental dynamics to model generalization.
As the distribution shift becomes more severe (Moderate Shift and Severe Shift), the MCS decreases further to 0.682 and 0.654, respectively. Although the model possesses certain self-regulation capabilities, once the difference between input feature distributions and historical knowledge representations exceeds a threshold, mismatches arise between internal knowledge structures and policy expressions, leading to reduced decision coherence. These experimental findings confirm the necessity of the proposed adaptive knowledge accumulation and dynamic policy updating mechanisms. Through continuous optimization of the knowledge structure and temporal memory correction, the model can effectively enhance its stability and consistency under distribution shifts, ensuring robust performance of open-world agents in dynamic environments.