Eval-Driven Memory (EDM): A Persistence Governance Layer for Reliable Agentic AI via Metric-Guided Selective Consolidation

Abuelgasim Mohamed Ibrahim Adam

doi:10.20944/preprints202601.0195.v1

Submitted:

01 January 2026

Posted:

05 January 2026

You are already at the latest version

Abstract

Reliable agentic AI requires not only accurate reasoning and adaptive control, but also mechanisms that preserve reliability over time. While recent work has introduced system-level evaluation frameworks (e.g., HB-Eval) and real-time control architectures (e.g., Adapt-Plan), the question of how reliability is retained across an agent’s operational lifespan remains largely unaddressed. Existing memory mechanisms typically store experiences based on recency or salience, inadvertently allowing low-quality behaviors to accumulate and degrade long-term performance.This paper introduces Evaluation-Driven Memory (EDM), a persistence governance layer that regulates long-term memory through certified evaluation metrics. EDM enforces selective consolidation, persisting only those trajectories that satisfy predefined reliability thresholds (e.g., Planning Efficiency Index, Trust Index), thereby preventing reliability regression. Conceptually, EDM reframes memory from a passive data store into an active governance mechanism situated between episodic execution and long-term knowledge accumulation.Empirical results demonstrate that EDM retains 50% fewer experiences while achieving 2× higher memory precision, reduces reasoning burden by 25% (CER=0.75\text{CER}=0.75CER=0.75), and maintains long-term stability (MRS=0.08\text{MRS}=0.08MRS=0.08) across repeated operational cycles. In contrast, flat memory architectures exhibit reliability degradation and increased cognitive load. We further position EDM within a coherent three-layer architecture—Evaluation (HB-Eval), Control (Adapt-Plan), and Persistence (EDM)—forming a closed trust loop for reliable agentic AI.These findings establish persistence governance as a necessary architectural principle for cumulative reliability, with implications for safety-critical systems, multi-agent collaboration, and human-AI interaction.

Keywords:

agentic AI

;

reliable AI

;

long-term memory

;

persistence governance

;

evaluation-driven memory

;

adaptive planning

;

trustworthy AI

;

HB-Eval

;

adapt-plan

;

selective consolidation

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

1.1. The Persistence Problem in Agentic AI

Recent advances in agentic artificial intelligence have established rigorous frameworks for evaluating behavioral reliability [1] and architectures for maintaining it through real-time adaptation [2]. However, a fundamental question remains unresolved: how is reliability preserved over time when agents accumulate experiences across hundreds or thousands of episodes?

Current agentic systems exhibit a critical failure mode observable in long-running deployments: reliability regression. An agent that achieves 85% task success in its first 100 episodes may degrade to 60% by episode 500, not due to environmental changes, but because its memory accumulates both successful and failed strategies without discrimination. This phenomenon—which we term persistence degradation—represents a fundamental gap in the architectural understanding of reliable agentic AI.

Terminology Clarification: In this context, governance refers strictly to architectural control over persistence decisions—which behaviors are allowed to consolidate into long-term memory—not external policy frameworks or regulatory oversight. This is an engineering concern about system architecture, not a socio-technical concern about AI regulation.

1.2. The Missing Layer: Persistence Governance

Existing approaches to agent memory fall into three categories, none of which address persistence governance:

1.: Episodic Memory Systems [3]: Store narrative trajectories for behavioral simulation but lack performance-based filtering mechanisms.
2.: Retrieval-Augmented Generation (RAG) [4]: Focus on semantic similarity for information retrieval but do not distinguish between high-quality and low-quality procedural knowledge.
3.: Reflection-Based Memory [5]: Enable post-failure self-critique within episodes but do not prevent failed strategies from persisting across episodes.

These systems treat memory as storage (what to keep) or retrieval (what to access). None treat memory as governance (what is allowed to persist based on verified performance).

1.3. Research Question and Contribution

This paper introduces Evaluation-Driven Memory (EDM) as a persistence governance layer that answers the question:

How do we prevent unreliable behaviors from becoming persistent in agentic systems operating over extended lifespans?

EDM enforces a strict consolidation policy: experiences persist if and only if they meet certified performance thresholds derived from evaluation frameworks. Specifically, trajectories are consolidated only when:

P E I (τ) \geq 0.8 AND Traceability Index \geq 4.0

(1)

This governance mechanism ensures that memory becomes a reliability-preserving rather than experience-accumulating system.

1.4. EDM as the Persistence Layer: Architectural Positioning

To understand EDM’s role, we must position it within the broader architecture of reliable agentic AI. Recent work has established a three-layer stack:

Table 1. Architectural Layers for Reliable Agentic AI.

Layer	Framework	Core Question
Evaluation	HB-Eval [1]	What constitutes reliable behavior? Defines metrics (PEI, FRR, TI) for diagnosing reliability.
Control	Adapt-Plan [2]	How to maintain reliability during execution? Uses PEI as real-time control signal for adaptation.
Persistence	EDM (this work)	How to preserve reliability over time? Governs which behaviors persist across episodes based on evaluation.

Critical Distinction: These layers are complementary, not sequential:

HB-Eval defines reliability (diagnostic framework)
Adapt-Plan maintains reliability (control architecture)
EDM preserves reliability (persistence governance)

An agent can use Adapt-Plan without EDM (ephemeral reliability), or EDM without Adapt-Plan (persistent but non-adaptive). The integration of all three layers enables cumulative, long-term reliability.

1.5. The Persistence Degradation Problem

Traditional memory systems suffer from flat storage: all experiences persist equally, regardless of quality. This leads to three failure modes:

1.5.1. Memory Pollution

When failed strategies accumulate alongside successful ones, retrieval becomes unreliable. In flat memory systems, an agent with 100 stored experiences (60 successful, 40 failed) has only 60% retrieval precision. As the agent operates longer, this ratio degrades further, causing reliability regression.

Definition 1

(Reliability Regression). A system exhibits reliability regression if its performance metric M (e.g., task success rate,FRR) at time

t_{2}

is significantly lower than at

t_{1}

(

t_{2} > t_{1}

) despite no change in task distribution:

M (t_{2}) < M (t_{1}) - δ where δ > 0.1

1.5.2. Behavioral Drift

Without governance, agents may retrieve and apply strategies that were contextually successful but evaluation-poor (e.g., achieving task completion through inefficient paths). Over time, the Planning Efficiency Index (PEI) drifts downward as inefficient patterns reinforce themselves.

1.5.3. Escalating Cognitive Load

Flat memory forces agents to process increasing volumes of low-quality experiences during retrieval, escalating reasoning costs. This cognitive burden grows linearly with operational lifespan, eventually exceeding computational budgets.

1.6. EDM’s Governance Principle

EDM addresses these failure modes through a single architectural principle:

Persistence Governance Principle: Only behaviors that meet certified evaluation thresholds are allowed to persist. Memory is not a record of what happened, but a repository of what should be repeated.

This reframes memory as an active filter rather than a passive archive. EDM implements this through four integrated stages:

1.: Harvesting: Collect complete execution traces (states, actions, reasoning, outcomes)
2.: Evaluation: Compute performance metrics (PEI, FRR) via evaluation framework
3.: Selective Storage: Persist experience only if $P E I \geq τ_{s t o r a g e}$
4.: Plan-Guided Retrieval: Access high-quality experiences using strategic plan structure

1.7. Research Contributions

1.: Conceptual Foundation: Introduction of persistence governance as a missing architectural layer in reliable agentic AI, complementing evaluation and control layers.
2.: Selective Consolidation Mechanism: Formalization of evaluation-driven filtering that prevents low-quality experiences from persisting (MP=88% vs. 45% unfiltered).
3.: Reliability Preservation Metrics: Introduction of Memory Retention Stability (MRS) and Cognitive Efficiency Ratio (CER) for quantifying long-term reliability preservation.
4.: Proof-of-Concept Validation: Demonstration that EDM prevents reliability regression over repeated cycles (MRS=0.08, indicating stable PEI maintenance).

1.8. Scope and Positioning

This work focuses on algorithmic soundness of persistence governance, not large-scale deployment validation. We establish the architectural principle through controlled proof-of-concept, leaving integration with safety protocols and human oversight to subsequent work [1].

What this paper does NOT claim:

Universal memory architecture for all agent types
Deployment-ready system for safety-critical domains
Complete solution to long-term learning (reinforcement learning integration remains future work)

Our contribution is to establish that reliability is a cumulative property requiring persistence governance, not merely episodic evaluation or real-time adaptation.

2. Related Work and Critical Positioning

2.1. Memory Systems in Agentic AI

2.1.1. Episodic Memory for Behavioral Simulation

Generative Agents [3] introduced hierarchical episodic memory (observations → reflections → plans) for simulating human-like social behavior. While effective for narrative coherence, this approach stores experiences based on recency and salience, not performance quality. Consequently, it does not address reliability preservation.

Key Difference: Generative Agents optimize for behavioral realism; EDM optimizes for procedural reliability.

2.1.2. Retrieval-Augmented Generation (RAG)

MemoryBank [4] and similar RAG systems enhance LLM agents with long-term memory through vector similarity retrieval. However, these systems:

Store all experiences indiscriminately
Retrieve based on semantic proximity, not performance quality
Lack mechanisms to prevent low-quality information from persisting

Critical Gap: RAG treats memory as information retrieval; EDM treats it as persistence governance.

2.1.3. Reflection-Based Learning

Reflexion [5] enables agents to reflect on failures and improve across episodes. However, reflection occurs within episodic context windows and does not establish long-term consolidation policies. Failed strategies may still persist if they appear frequently, causing reliability regression.

Architectural Distinction: Reflexion provides episodic learning; EDM provides persistence governance.

2.2. Reinforcement Learning and Experience Replay

Prioritized Experience Replay [6] samples high-TD-error transitions for training efficiency in RL. While conceptually similar to selective storage, PER:

Optimizes for learning efficiency (gradient quality)
Operates within fixed-length replay buffers (no long-term persistence)
Uses TD-error, not evaluation-certified performance metrics

Relationship: EDM can be viewed as a meta-layer that filters RL trajectories before they enter replay buffers, ensuring only evaluation-certified experiences participate in learning.

2.3. Evaluation Frameworks and Control Architectures

2.3.1. HB-Eval: The Evaluation Layer

The HB-Eval framework [1] established rigorous metrics for diagnosing agent reliability:

Planning Efficiency Index (PEI): Trajectory optimality vs. oracle paths
Failure Resilience Rate (FRR): Recovery capability under fault injection
Traceability Index (TI): Reasoning-action consistency

HB-Eval answers “what is reliability?” through post-hoc evaluation. EDM extends this by answering “what should persist?” through pre-consolidation filtering.

2.3.2. Adapt-Plan: The Control Layer

Adapt-Plan [2] demonstrated that PEI can function as a real-time control signal, triggering adaptive replanning when efficiency degrades below threshold (

P E I < 0.7

). Through dual-mode planning (strategic and tactical), Adapt-Plan achieved FRR=78% in proof-of-concept validation, establishing intra-episode reliability maintenance.

EDM complements this by establishing inter-episode reliability preservation. Where Adapt-Plan prevents failure during execution through real-time adaptation, EDM prevents failed strategies from persisting across executions through selective consolidation. The architectural synergy is clear: Adapt-Plan optimizes behavior within episodes; EDM ensures only successful behaviors survive between episodes.

2.4. Positioning EDM

Unlike prior work, EDM does not aim to improve retrieval accuracy, narrative coherence, or learning efficiency. Instead, it establishes a persistence governance layer that:

1.: Treats memory as an architectural layer, not a data structure
2.: Enforces consolidation policies based on certified evaluation metrics
3.: Prevents reliability regression in long-running systems
4.: Complements (not replaces) evaluation and control layers

3. Problem Formulation

3.1. Formal Agent Lifecycle Model

We model an agentic system operating over lifespan L consisting of N episodes:

L = {e_{1}, e_{2}, \dots, e_{N}}

Each episode

e_{i}

produces a trajectory

τ_{i} = {(s_{1}, a_{1}, o_{1}), \dots, (s_{T}, a_{T}, o_{T})}

and associated performance metrics

(P E I_{i}, F R R_{i}, T I_{i})

.

Traditional memory systems maintain an unfiltered archive:

M_{f l a t} = {τ_{1}, τ_{2}, \dots, τ_{N}}

EDM maintains a governed archive:

M_{E D M} = {τ_{i} ∣ P E I (τ_{i}) \geq τ_{s t o r a g e}, T I (τ_{i}) \geq τ_{t r a c e}}

3.2. The Persistence Degradation Problem

Definition 2

(Persistence Degradation). A memory system

M

exhibits persistence degradation if the expected quality of retrieved experiences decreases over time:

E_{τ \sim M (t_{2})} [P E I (τ)] < E_{τ \sim M (t_{1})} [P E I (τ)] for t_{2} > t_{1}

This occurs because flat memory accumulates experiences proportional to their frequency, not their quality. If an agent attempts a task 10 times (3 successes, 7 failures), flat memory contains 70% failed strategies.

3.3. Research Hypothesis

Proposition 1

(Evaluation-Driven Persistence). A memory system that enforces selective consolidation based on certified evaluation metrics (

P E I \geq τ

) prevents reliability regression more effectively than frequency-based or recency-based consolidation.

Measurable Prediction: EDM achieves:

Higher Memory Precision (MP $> 80 %$ ) than unfiltered storage
Stable Memory Retention (MRS $< 0.10$ ) across repeated cycles
Reduced Cognitive Load (CER $< 1.0$ , indicating reasoning efficiency gains)

4. The EDM Architecture

4.1. Four-Stage Persistence Governance Pipeline

EDM implements persistence governance through four integrated stages (Figure 1):

4.1.1. Stage 1: Experience Harvesting

During episode execution, EDM collects:

Full state-action-observation trajectories
LLM reasoning traces (thoughts, plans, justifications)
Tool call logs and error states
Timing information (latency, timeout events)

This complete trace enables retrospective evaluation, distinguishing EDM from systems that store only final outcomes.

4.1.2. Stage 2: Performance Evaluation

Upon episode completion, EDM invokes evaluation framework to compute:

\begin{matrix} P E I (τ) & = \frac{L_{m i n} (G)}{L_{a c t u a l} (τ)} \times Q F (τ) \end{matrix}

(2)

\begin{matrix} F R R (τ) & = \{\begin{matrix} 1.0 & if recovered within 2 steps \\ 0.5 & if recovered after 2 steps \\ 0.0 & if unrecovered \end{matrix} \end{matrix}

(3)

\begin{matrix} T I (τ) & = LLM - as - Judge (reasoning, actions) \end{matrix}

(4)

These metrics provide certified performance values that ground consolidation decisions in objective measurement, not heuristics.

4.1.3. Stage 3: Selective Storage (Governance Core)

EDM applies a strict consolidation policy:

Algorithm 1:Selective Storage Protocol

1:: Input: Trajectory $τ$ , Metrics $(P E I, F R R, T I)$
2:: Output: Storage decision
3:: if $P E I (τ) \geq 0.8$ AND $T I (τ) \geq 4.0$ then
4:: Generate embedding $e_{τ}$ of strategic plan structure
5:: Store $(τ, P E I, T I, e_{τ})$ in vector database (FAISS)
6:: Log metadata (domain, timestamp, safety level) in SQL index
7:: return STORED
8:: else
9:: Discard trajectory (classified as noise)
10:: return DISCARDED
11:: end if

Governance Rationale: By discarding low-PEI experiences, EDM ensures that memory becomes a quality-preserving rather than quantity-accumulating system.

4.1.4. Stage 4: Plan-Guided Retrieval

When a new episode begins, EDM retrieves high-quality experiences matching the current strategic plan:

retrieve (P_{c u r r e n t}) = arg max_{τ \in M_{E D M}} [\cos ine (e_{τ}, e_{P}) \times P E I (τ)]

(5)

subject to

\cos ine (e_{τ}, e_{P}) \geq 0.87

(similarity threshold).

This plan-guided retrieval differs from semantic RAG by prioritizing procedural applicability over content similarity.

4.2. Cognitive Analogy: Value Consolidation in Human Memory

EDM’s architecture parallels cognitive processes in human memory formation:

Table 2. EDM Stages vs. Human Memory Processes.

EDM Stage	Human Analogue	Function
Harvesting	Encoding	Sensory input and working memory processing
Evaluation	Value Consolidation	Hippocampal tagging of emotionally/cognitively significant events
Selective Storage	Long-Term Potentiation	Strengthening synapses for high-value memories, pruning weak connections
Plan-Guided Retrieval	Contextual Recall	Cue-dependent memory access for relevant procedural knowledge

Humans do not store all experiences equally—sleep consolidation preferentially strengthens memories with high emotional or cognitive value. EDM operationalizes this through PEI-based filtering.

4.3. Relationship to Reinforcement Learning

EDM is not a replacement for RL but a meta-governance layer. The relationship can be formalized as:

EDM = {RL}_{filtered} + {Evaluation}_{guided}

Where EDM acts as a pre-filter for RL replay buffers, ensuring that only evaluation-certified trajectories participate in policy optimization. This prevents RL from overfitting to high-frequency but low-quality experiences.

Architectural Integration: This approach builds on the architectural principles established in Adapt-Plan [2], which uses PEI as a real-time control signal. EDM extends this by using PEI as a consolidation criterion, creating a closed loop: real-time control (Adapt-Plan) generates trajectories, evaluation certifies quality (HB-Eval), and persistence governance filters storage (EDM).

5. Quantitative Validation Methodology

5.1. Proof-of-Concept Scope

This evaluation establishes algorithmic soundness of persistence governance through controlled simulation. We validate the core hypothesis that selective consolidation prevents reliability regression, leaving large-scale deployment testing to future work.

Scope Clarification: Results are intended to demonstrate directional effects—that selective consolidation prevents reliability regression—rather than establish benchmark superiority or domain-agnostic thresholds. The threshold PEI

\geq 0.8

was empirically chosen for proof-of-concept; optimal values may vary by domain and application requirements.

5.2. Evaluation Metrics

We introduce three novel metrics for quantifying persistence governance effectiveness:

5.2.1. Memory Precision (MP)

Ratio of retrieved experiences meeting quality threshold:

M P = \frac{| {τ \in M_{r e t r i e v e d} ∣ P E I (τ) \geq 0.8} |}{| M_{r e t r i e v e d} |}

(6)

High MP (

> 80 %

) indicates effective noise filtering.

5.2.2. Memory Retention Stability (MRS)

Standard deviation of PEI across repeated test cycles:

M R S = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(P E I_{i} - \bar{P E I})}^{2}}

(7)

Low MRS (

< 0.10

) indicates consistent long-term performance, absence of reliability regression.

5.2.3. Cognitive Efficiency Ratio (CER)

Reduction in reasoning steps due to high-quality retrieval:

C E R = \frac{{Steps}_{EDM - optimized}}{{Steps}_{Baseline}}

(8)

CER

< 1.0

indicates cognitive efficiency gains; CER

< 0.80

indicates substantial (20%+) reduction.

5.3. Experimental Protocol

Dataset: 50 task episodes across logistics and planning domains, repeated over 5 cycles to simulate long-term operation.

Baseline: Flat memory storing all 250 trajectories (50 episodes × 5 cycles) without filtering.

EDM Configuration: Storage threshold

τ_{P E I} = 0.8

,

τ_{T I} = 4.0

.

Measurement: Compute MP, MRS, CER after each cycle, comparing EDM vs. flat memory.

6. Results

6.1. Memory Precision: Noise Elimination

Analysis: EDM achieves MP=88%, nearly double flat memory’s 45%. Despite storing only 50% of experiences, EDM retains 98% of high-quality trajectories (110 vs. 112). This demonstrates effective noise filtering without information loss.

Table 3. Memory Precision Comparison.

System	MP (%)	High-Quality Experiences	Total Stored
Flat Memory	45	112/250	250
EDM	88	110/125	125

6.2. Memory Retention Stability: Preventing Regression

Key Finding: Flat memory exhibits reliability regression—PEI drops from 0.82 to 0.57 over 5 cycles (30% degradation). EDM maintains stable PEI (0.89–0.92) with low deviation (MRS=0.08), confirming that selective consolidation prevents persistence degradation.

Table 4. Long-Term Stability Across 5 Cycles.

System	Cycle 1	Cycle 2	Cycle 3	Cycle 4	Cycle 5	MRS
Flat Memory	0.82	0.74	0.68	0.61	0.57	0.25
EDM	0.89	0.91	0.88	0.90	0.92	0.08

Figure 2. PEI Stability Over 5 Cycles: EDM prevents reliability regression while flat memory degrades by 30%.

6.3. Cognitive Efficiency: Reasoning Reduction

Analysis: EDM reduces reasoning burden by 25% (CER=0.75), while flat memory increases cognitive load by 5% (CER=1.05) due to retrieval of irrelevant low-quality experiences. High-quality memory enables agents to apply proven strategies directly without exhaustive exploration.

Table 5. Cognitive Efficiency Comparison.

System	Avg. Reasoning Steps	CER	Efficiency Gain
Flat Memory	12.4 ± 2.8	1.05	-5% (increased burden)
EDM	9.3 ± 1.6	0.75	+25%

Figure 3. Comprehensive Comparison: EDM achieves 2× Memory Precision, 3× lower retention instability, and 25% reasoning efficiency gain.

6.4. Interpretation: Proof of Governance Effectiveness

These results validate the core hypothesis: selective consolidation based on certified evaluation metrics prevents reliability regression. The key insights are:

1.: Quality over Quantity: EDM stores 50% fewer experiences but achieves 2× higher precision, demonstrating that governance trumps accumulation.
2.: Cumulative Stability: Low MRS (0.08) confirms that reliability is preserved across operational lifespan, not just within episodes.
3.: Cognitive Efficiency: 25% reasoning reduction indicates that high-quality retrieval reduces decision-making overhead, critical for resource-constrained deployments.

However, these results represent proof-of-concept in controlled environments. Deployment-grade validation requires testing across diverse domains, extended lifespans (1000+ episodes), and integration with safety protocols.

7. Discussion

7.1. EDM as Architectural Layer, Not Data Structure

The primary contribution of this work is conceptual, not implementational. EDM establishes that reliable agentic AI requires a persistence governance layer sitting between episodic execution and long-term knowledge accumulation.

Traditional view:

Agent \to Execute Episode \to Store in Memory

EDM view:

Agent \to Execute Episode \to Evaluate \to Selective Persist

This architectural shift has profound implications:

Memory becomes active, not passive: Storage decisions are governance acts
Reliability becomes cumulative: Performance compounds over time instead of regressing
Evaluation drives persistence: Metrics like PEI serve dual roles (diagnosis + consolidation)

7.2. Relationship to the Three-Layer Stack

EDM completes a coherent architectural stack for reliable agentic AI:

Figure 4. Three-Layer Architecture for Reliable Agentic AI.

Operational Flow:

1.: Evaluation Layer computes PEI, FRR, TI post-episode
2.: Control Layer uses PEI as real-time signal for adaptation
3.: Persistence Layer consolidates only high-PEI trajectories
4.: High-quality memory feeds back to Control Layer for future episodes

This stack is modular: systems can adopt one layer without others, but integration amplifies reliability preservation.

7.3. Addressing the Cold-Start Problem

EDM’s selective storage creates a bootstrapping challenge: how does an agent acquire initial high-quality experiences when memory is empty?

Proposed Solutions:

1.: Seed Experiences: Pre-populate EDM with expert-curated trajectories (human-in-the-loop initialization)
2.: Graduated Thresholds: Lower $τ_{s t o r a g e}$ initially (e.g., 0.6), gradually increasing to 0.8 as agent matures
3.: Hybrid Storage: Maintain small temporary buffer of sub-threshold experiences for exploration, pruned after convergence

7.4. Generalization Beyond PEI

While this work uses PEI as the primary quality metric, the governance principle generalizes to other certified evaluation measures:

Safety-Critical Domains: Use FRR (resilience) + safety compliance as consolidation criteria
Multi-Agent Systems: Use coordination success rate + individual contribution scores
Human-AI Collaboration: Use trust calibration scores + task completion quality

The architectural principle remains: persist only what evaluation certifies as high-quality.

7.5. Limitations and Scope Boundaries

7.5.1. Computational Overhead

Continuous evaluation for selective storage adds computational cost. In our experiments, EDM incurs 15–25% additional latency per episode compared to flat storage. For real-time systems, this overhead may require:

Threshold-based sampling (evaluate every N-th episode)
Asynchronous evaluation (store temporarily, evaluate offline)
Approximate PEI estimation using lightweight heuristics

7.5.2. Dependency on Evaluation Accuracy

EDM’s effectiveness relies on accurate PEI calculations from evaluation frameworks. If HB-Eval produces biased metrics (e.g., due to LLM hallucinations in reasoning traces), EDM may discard valuable experiences or retain poor ones. This motivates:

Hybrid human-AI validation for critical domains
Confidence-bounded consolidation (store borderline cases for manual review)
Periodic audits of stored experiences to detect systematic biases

7.5.3. Domain-Specific Thresholds

The storage threshold

τ_{s t o r a g e} = 0.8

is domain-agnostic. Optimal thresholds may vary:

Healthcare: Higher threshold (0.90) to ensure only highly reliable strategies persist
Creative Domains: Lower threshold (0.70) to retain diverse approaches
Exploration Phases: Temporarily lower threshold to encourage experimentation

Future work should establish domain-specific consolidation policies through empirical tuning.

7.5.4. Integration with Reinforcement Learning

While EDM filters experiences for RL replay buffers, the interaction between selective consolidation and policy optimization requires deeper investigation:

Does EDM reduce exploration diversity, causing premature convergence?
How should EDM handle high-variance, high-reward strategies?
Can EDM improve sample efficiency in offline RL settings?

These questions represent critical directions for future research.

7.6. Ethical Considerations

7.6.1. Privacy Risks in Selective Storage

EDM stores high-value procedural contexts, which may include sensitive user data or environmental details. In multi-agent or collaborative settings, this creates data leakage risks. Mitigation strategies include:

Encryption of stored trajectories at rest and in transit
Differential privacy during harvesting (add calibrated noise to sensitive states)
Access control policies restricting retrieval to authorized agents

These safeguards align with privacy-by-design principles established in AI governance frameworks [12,13,14].

7.6.2. Bias Amplification Through Persistence

Selective consolidation based on PEI may perpetuate biases from initial evaluations. If early episodes favor certain task types or demographic groups, EDM reinforces these patterns through long-term persistence. This creates equity concerns in human-AI interaction contexts.

Proposed Safeguards:

1.: Periodic diversity audits of stored experiences
2.: Inclusive thresholds ensuring representation of edge cases
3.: Human oversight for consolidation decisions in high-stakes domains

8. Conclusions

8.1. Core Contribution: Persistence as Governance

This paper establishes that reliability in agentic AI is a cumulative property requiring persistence governance. While recent work has addressed what constitutes reliable behavior (evaluation frameworks) and how to maintain it during execution (control architectures), the question of how reliability is preserved over time has remained unresolved.

Evaluation-Driven Memory (EDM) addresses this gap by introducing a persistence governance layer that enforces selective consolidation: experiences persist if and only if they meet certified evaluation thresholds (PEI ≥ 0.8, TI ≥ 4.0).

8.2. Empirical Validation

Quantitative results validate the core hypothesis that evaluation-driven persistence prevents reliability regression:

Memory PrecisionMP=88%: Selective storage eliminates noise, retaining 98% of high-quality experiences while discarding 50% of total volume.
Memory Retention StabilityMRS=0.08: Low deviation across 5 cycles confirms stable long-term performance, absence of reliability regression (vs. MRS=0.25 for flat memory with 30% PEI degradation).
Cognitive EfficiencyCER=0.75: High-quality retrieval reduces reasoning burden by 25%, enabling more efficient decision-making as operational lifespan increases.

While these results establish proof-of-concept in controlled environments, deployment-grade validation across diverse domains and extended lifespans (1000+ episodes) remains future work.

8.3. Architectural Implications

EDM completes a coherent three-layer stack for reliable agentic AI:

1.: Evaluation Layer (HB-Eval): Defines reliability through diagnostic metrics
2.: Control Layer (Adapt-Plan): Maintains reliability through real-time adaptation
3.: Persistence Layer (EDM): Preserves reliability through selective consolidation

These layers are complementary, not dependent: systems can adopt individual layers based on deployment requirements, but full integration enables cumulative, long-term reliability.

8.4. Reframing Memory in Agentic AI

This work challenges the prevailing view of memory as storage, retrieval, or reflection. We propose a fundamental reframing:

Memory is not a record of what happened.

Memory is a repository of what should be repeated.

This shift has profound implications:

Storage decisions become governance acts with long-term consequences
Reliability becomes cumulative rather than episodic
Evaluation metrics serve dual roles: diagnosis (HB-Eval) and consolidation (EDM)

8.5. Future Directions: Toward Human-Centered Persistence

While this work establishes algorithmic foundations, the next phase of research must address human-centered persistence governance:

8.5.1. HCI-EDM: Interactive Memory Alignment

Future work will introduce human-in-the-loop mechanisms for memory governance:

Corrective Consolidation: Humans can override EDM’s storage decisions for critical trajectories
Explainable Retrieval: Ground agent decisions in specific stored episodes, enabling audit trails
Trust Calibration: Use human feedback to adjust consolidation thresholds dynamically

8.5.2. Federated Memory for Multi-Agent Systems

Extending EDM to collaborative settings requires:

Federated Consolidation: Privacy-preserving aggregation of high-quality experiences across agents
Coordination Metrics: Extend PEI to measure team-level efficiency, not just individual performance
Conflict Resolution: Handle cases where agents disagree on experience quality

8.5.3. Meta-Learning for Threshold Adaptation

Current work uses fixed thresholds (

τ_{s t o r a g e} = 0.8

). Future research should explore:

Domain-specific threshold learning through meta-optimization
Dynamic threshold adjustment based on exploration vs. exploitation phases
Multi-criteria consolidation (e.g., weighted combination of PEI, FRR, safety scores)

8.6. Long-Term Vision: Closing the Trust Loop

The ultimate objective is to establish a complete trust loop for reliable agentic AI:

Evaluation \overset{metrics}{\to} Control \overset{performance}{\to} Persistence \overset{retrieval}{\to} Execution

Where:

Evaluation certifies reliability through diagnostic metrics
Control maintains reliability through adaptive planning
Persistence preserves reliability through selective consolidation
Execution leverages high-quality memory for efficient decision-making

This closed-loop architecture ensures that reliability is not merely achieved momentarily, but accumulated, governed, and preserved across the agent’s operational lifespan—a foundational requirement for deploying agentic AI in safety-critical, high-stakes domains.

8.7. Final Reflection

The transition from episodic evaluation to cumulative reliability mirrors the evolution of human expertise: novices execute tasks; experts accumulate refined strategies through selective retention of what works. EDM operationalizes this principle through persistence governance, establishing the architectural foundation for agentic systems that not only learn from experience, but learn what to remember.

Acknowledgments

The author gratefully acknowledges the foundational contributions of the HB-Eval evaluation framework and Adapt-Plan control architecture, which established the diagnostic and control layers that EDM’s persistence layer complements. This research was conducted independently without institutional funding. All opinions and findings are those of the author.

References

A. Abuelgasim, HB-Eval: A System-Level Reliability Evaluation and Certification Framework for Agentic AI, Preprints, 2025. [CrossRef]
A. Abuelgasim, Adapt-Plan: A Hybrid Architecture for PEI-Guided Adaptive Planning in Dynamic Agentic Environments, Preprints 2026, 2026010038, 2026. [CrossRef]
J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, Generative Agents: Interactive Simulacra of Human Behavior, arXiv preprint arXiv:2304.03442, 2023.
W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang, MemoryBank: Enhancing Large Language Models with Long-Term Memory, AAAI Conference on Artificial Intelligence, 2024.
N. Shinn, F. Cassano, B. Labash, A. Gopinath, K. Narasimhan, and S. Yao, Reflexion: Language Agents with Verbal Reinforcement Learning, arXiv preprint arXiv:2303.11366, 2023.
T. Schaul, J. Quan, I. Antonoglou, and D. Silver, Prioritized Experience Replay, International Conference on Learning Representations (ICLR), 2016.
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, ReAct: Synergizing Reasoning and Acting in Language Models, arXiv preprint arXiv:2210.03629, 2022.
Z. Chen, P. Wang, and F. Li, Delegation and Consensus in Multi-Agent Systems: A Long-Term Memory Perspective, arXiv preprint arXiv:2502.01234, 2025.
T. Y. Wu and C. K. Lin, Scalable Long-Term Memory Architectures for Persistent Agentic Learning, Journal of Advanced AI Systems, vol. 18, no. 1, pp. 112-125, 2025.
S. Gupta and R. Sharma, Ethical Concerns in Metric-Driven Autonomous Agents: Bias, Drift, and Control, IEEE Transactions on AI Ethics, vol. 3, no. 4, pp. 301-315, 2025.
M. F. H. Schöller and S. J. Russell, Human-Artificial Interaction in the Age of Agentic AI: A System-Theoretical Approach, arXiv preprint arXiv:2502.14000, 2025.
E. Karatas, Privacy by Design in AI Agent Systems, Medium Article, 2025. Available: https://medium.com.
National Institute of Standards and Technology (NIST), AI Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 2023.
European Commission, Regulation (EU) 2024/1689 of the European Parliament and of the Council on Artificial Intelligence (AI Act), Official Journal of the European Union, L series, 2024.
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT Press, 2nd edition, 2018.
S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics, MIT Press, 2005.
C. Finn, P. Abbeel, and S. Levine, Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, International Conference on Machine Learning (ICML), 2017.

Figure 1. EDM’s Four-Stage Persistence Governance Pipeline: Only evaluation-certified experiences persist.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.