Submitted:
03 April 2026
Posted:
07 April 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. The Misplaced Attention Problem

1 [Practitioner Report] Chase, H. (2026). Context engineering our way to long-horizon agents. Sequoia Capital Podcast. (Non-peer-reviewed practitioner account.)
2 [Practitioner Report] Gupta, A. (2026). 2025 was agents. 2026 is agent harnesses. Medium. (Non-peer-reviewed practitioner account.)
3 [Practitioner Report] Zylon.ai (2026). What Is OpenClaw? A Practical Guide to the Agent Harness Behind the Hype. zylon.ai. (Non-peer-reviewed practitioner analysis.) Aethir (2026). The Rise of OpenClaw and AI Agents: GPU Demand Is Surging. aethir.com. (Non-peer-reviewed practitioner report.) NVIDIA GTC Blog (Mar 2026). NVIDIA GTC 2026: Live Updates on What’s Next in AI. blogs.nvidia.com. (Non-peer-reviewed industry blog.)

1.1.1. The Practitioner-Research Gap
1.1.2. The Convergence Moment
1.2. Scope and Contributions
- 1.
- A formal definition of agent harness with necessary and sufficient conditions, tested against edge cases that reveal the concept’s boundaries (Section 2).
- 2.
- A historical account tracing the concept’s emergence from three distinct technological lineages—software testing, reinforcement learning, and early LLM frameworks—and arguing that the harness represents a genuinely new architectural layer, not a simple extension of prior art (Section 3).
- 3.
- An empirically-grounded taxonomy of 22 representative systems, organized through a harness completeness matrix that maps each system to six core functional components (Section 4).
- 4.
- A systematic survey of six open technical challenges, analyzed for their root causes, their cross-component interdependencies, and their comparison across existing approaches (Section 5).
- 5.
- Identification of research gaps that the field’s modular treatment of agent subsystems has left unaddressed—gaps that are only visible when the harness is treated as a unified object of study (Section 6–7).
1.3. Related Work and Positioning
1.3.1. The Fragmentation Problem in Existing Surveys
1.3.2. The Infrastructure Gap
1.3.3. What This Survey Does Differently
1.3.4. Explicit Positioning Against Adjacent Surveys
| Survey | Research Object | Analysis Level | Core Contribution | Overlap with This Survey |
| Wang et al. (2023) | Agent capabilities: memory, planning, tool use, action | Model-level decomposition | Foundational taxonomy of agent functional modules | Describes what components do; does not address runtime governance |
| Xi et al. (2023) | Agent application landscape | Application-level survey | Comprehensive map of agent deployment domains | Documents where agents are used; silent on execution infrastructure |
| Mohammadi et al. (2025) | Evaluation methodology and benchmark design | Benchmark-level analysis | Taxonomy of metrics, failure modes, evaluation protocols | Treats evaluation as external measurement; this survey treats it as an infrastructure component (V) |
| Guo et al. (2024) | Multi-agent coordination: roles, communication, collaboration | System-level survey | Taxonomy of multi-agent coordination patterns | Addresses how agents coordinate; this survey addresses what infrastructure makes coordination reliable |
| Gao et al. (2025) | Agent self-improvement: skill evolution, self-training, adaptive infrastructure | Model-evolution-level analysis | Taxonomy of how agents update capabilities over time | Addresses how agents evolve; this survey addresses what infrastructure makes agents run reliably |
| This survey | Agent execution harness: runtime governance infrastructure | Infrastructure-level analysis | Formal definition, historical lineage, completeness taxonomy, cross-cutting challenge analysis | — |
1.3.5. Empirical Evidence Base: Peer-Reviewed vs. Preprint Sources
2. Definitions and Conceptual Framework
2.1. The Definition Problem
3 [Practitioner Report] Gupta, A. (2026). 2025 was agents. 2026 is agent harnesses. Medium. (Non-peer-reviewed.)
4 [Practitioner Report] Schmid, P. (2026). The importance of agent harness in 2026. (Non-peer-reviewed.)
5 [Practitioner Report] Anthropic. (2026). Building effective agents. Anthropic Engineering Blog. (Non-peer-reviewed official documentation.)
2.2. Formal Definition
Definition 2.1 (Agent Harness). An agent harness is a software system that implements six runtime governance functions:
- E — Execution loop: Manages the observe-think-act cycle, including turn sequencing, termination conditions, and error recovery
- T — Tool registry: Maintains a typed, validated catalog of available tool interfaces; routes and monitors tool invocations
- C — Context manager: Governs what information enters the model’s context window across turns, including compaction, retrieval, and prioritization strategies
- S — State store: Persists task-relevant state across turns and, optionally, across sessions; provides recovery from partial failures
- L — Lifecycle hooks: Pre- and post-invocation interception points for authentication, logging, policy enforcement, and instrumentation
- V — Evaluation interface: Instruments the execution to capture action trajectories, intermediate states, and success signals for offline analysis, through standardized hooks that distinguish the V-component from general logging

2.2.1. The V-Component Design Space: From Logging to Evaluation Pipelines
2.3. Boundary Case Analysis
2.3.1. Framework Status and Validation Pathway
2.4. The Orthogonality Assumption and Its Limits

2.5. Situating the Harness in the Agent Stack
| Concept | Primary Role | Scope | Runtime? | Key Question |
| Framework | Construction primitives | Dev-time | No | How is the agent built? |
| Harness | Runtime governance | Runtime | Yes | How does the agent run reliably? |
| Platform | Organizational management | Both | Both | How are agents managed at scale? |
| Agent OS | Formal kernel services | Runtime | Yes | What are the minimal governance abstractions? |
| Eval harness | Assessment infrastructure | Test-time | Partial | How is agent behavior measured? |
| Task Type | E (Loop) | T (Tools) | C (Context) | S (State) | L (Lifecycle) | V (Eval) | Example Systems |
| Single-turn Q&A | Minimal | Optional | Minimal | × | × | Optional | ChatGPT, Claude.ai |
| Multi-step web research | Moderate | Required (web, search) | High | ∼ | ∼ | Optional | WebArena agents |
| Software engineering | High | Required (code exec, file) | High | Required | Required | Required | SWE-agent, OpenHands |
| Long-running personal assistant | High | Required (broad) | High | Required | Required | Optional | OpenClaw, MemGPT |
| Multi-agent collaboration | High | Required | High | Required | Required | Required | MetaGPT, AutoGen |
| Robotic/embodied task | High | Required (actuators) | Moderate | Required | Required | Required | RAI, embodied systems |
3. Historical Evolution
3.1. Three Lineages, One Synthesis
3.2. Thread 1 — Software Test Harnesses: The Governance Template
3.3. Thread 2 — Reinforcement Learning Environments: The Interface Standard
3.4. Thread 3 — The Early LLM Agent Frameworks: The Failure Mode Catalog
3.4.1. The 2023 Benchmark Infrastructure Emergence
3.5. The Harness Turn (2024–2026)
- 1.
- Prompt Engineering (2022–2024): The primary engineering lever was the text of the input prompt itself. Researchers and practitioners optimized by crafting better instructions, few-shot examples, and reasoning templates. The question was: “What text should we give the model to get better outputs?” This era produced chain-of-thought prompting, in-context learning, and instruction tuning as core methodologies.
- 2.
- Context Engineering (2025): As agents became longer-running, the binding constraint shifted from “what is the input?” to “what information should the model see?” This era focused on context management: what to inject on each turn, how to retrieve and compress memories, how to rank tool results by relevance, and how to handle context window saturation. Context engineering asks: “What structured information should we assemble and present to the model to guide its decisions?” This is when practitioners began systematizing memory retrieval, tool result formatting, and dynamic context management.
- 3.
- Harness Engineering (2026): As models became capable enough to handle long-running tasks but deployment reliability remained elusive, the engineering focus expanded to the full infrastructure wrapper. Harness engineering asks: “What governance, constraints, feedback loops, and execution controls must we design to make agent systems reliable?” The answer spans all six components (E,T,C,S,L,V) considered as an integrated whole. This era, represented by OpenAI’s Codex harness, Meta-Harness optimization, and LangChain’s DeepAgents, recognizes that model capability is necessary but insufficient—reliability emerges from the interaction of a capable model with a thoughtfully designed execution environment.
3.6. Why the Harness Concept Required All Three Lineages
4. Taxonomy of Agent Harness Systems
4.1. The Classification Problem
4.1.1. System Selection Methodology
4.1.2. Coding Methodology for the Completeness Matrix
4.2. Harness Completeness Matrix
| System | E | T | C | S | L | V | Security | MA | Category |
| DeepAgents | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | MicroVM | ✓ | Full-Stack |
| Claude Code | ✓ | ✓ | ✓ | ✓ | ✓ | ∼ | Sandbox | × | Full-Stack |
| OpenClaw | ✓ | ✓ | ✓ | ✓ | ✓ | ∼ | Container | ✓ | Full-Stack |
| DeerFlow | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Container | ✓ | Full-Stack |
| OpenHands | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | Container | ✓ | Full-Stack |
| AIOS | ✓ | ✓ | ✓ | ✓ | ✓ | ∼ | Process | ✓ | Full-Stack |
| SWE-agent | ✓ | ✓ | ∼ | ✓ | ∼ | ✓ | Container | × | Specialized |
| Browser-Use | ✓ | ✓ | ∼ | × | × | ∼ | MicroVM | × | Specialized |
| RAI | ✓ | ✓ | ∼ | ∼ | ∼ | ∼ | Physical | ∼ | Specialized |
| PortiaAI | ✓ | ✓ | ∼ | ✓ | ✓ | ∼ | Container | × | Specialized |
| TrustAgent | ✓ | ✓ | ∼ | ∼ | ✓ | ∼ | Policy | × | Specialized |
| LangGraph | ∼ | ∼ | × | × | × | × | None | ∼ | Framework |
| LlamaIndex | ∼ | ✓ | ✓ | ∼ | × | × | None | × | Framework/Module |
| AutoGen | ∼ | ∼ | ∼ | ∼ | ∼ | × | None | ✓ | Framework |
| Google ADK | ∼ | ∼ | ∼ | ∼ | ∼ | × | ∼ | ✓ | Framework |
| CrewAI | ∼ | ∼ | ∼ | ∼ | × | × | None | ✓ | Framework |
| MemGPT | × | × | ✓ | ✓ | × | × | ∼ | × | Module |
| MCP Servers | × | ✓ | × | × | × | × | ∼ | × | Module |
| Voyager | × | ∼ | × | ✓ | × | × | None | × | Module |
| HAL | ✓ | ∼ | ∼ | ✓ | ✓ | ✓ | VM | ✓ | Eval Infra |
| AgencyBench | ✓ | ∼ | ∼ | ✓ | ∼ | ✓ | Container | ✓ | Eval Infra |
| SkillsBench | ✓ | ∼ | ∼ | ∼ | ∼ | ✓ | Container | × | Eval Infra |
| Harbor (TB2) | ∼ | × | ∼ | ∼ | Container | × | Eval Infra |
4.3. Taxonomy
6 [Practitioner Report] Anthropic. (2026). Demystifying evals for AI agents. Anthropic Engineering Blog. (Non-peer-reviewed official documentation.) OpenClaw introduces a skill marketplace architecture. DeerFlow implements a hierarchical multi-agent pipeline with explicit role separation. OpenHands provides the most comprehensive evaluation integration, making it the primary platform for agent research benchmarking in software engineering.
4.3.1. The Evaluation Infrastructure Gap: Twelve Systems and the State of the Art
4.3.2. Multi-Agent Harness Architecture Patterns
4.3.3. Case Study: OpenAI Codex Harness Engineering at Scale
4.4. What the Taxonomy Reveals

5. Core Technical Challenges
5.1. Introduction to Cross-Cutting Challenge Analysis
5.2. Sandboxing and Security
5.2.1. The Unique Threat Profile of Agent Harnesses
5.2.2. The Isolation Mechanism Design Space
| Mechanism | Startup Latency | Memory Overhead | Isolation Model | Typical Deployment |
| Process-only | ~10ms | +0 MB | None | Prototyping only |
| Docker/OCI | ~1s | +10-50 MB | Namespace isolation | Development, trusted agents |
| gVisor | ~2s | +50-100 MB | User-space syscall interception | Security-conscious deployments |
| MicroVM (Firecracker) | ~125ms | +5-15 MB | Hardware virtualization | Production multi-tenant |
| WebAssembly | ~10ms | +1-5 MB | Capability-based sandbox | Compute-constrained tasks |
| Attack Vector | Entry Point | Harness Component Exploited | Persistence | Example | Harness Mitigation |
| Direct prompt injection | User input | C (context assembly) | Session only | Jailbreak via user message | Input sanitization at ingress (L-hook) |
| Indirect injection via retrieval | Retrieved content | C (context injection) | Session only | Poisoned web page | Content provenance tagging (C + L) |
| Tool-mediated injection | Tool output | T (tool result persistence) | Multi-step | Malicious API response hijacking agent goal | Tool output validation before context injection (L-hook) |
| Memory poisoning | Persistent store write | S (long-term storage) | Cross-session | Injecting false beliefs into long-term memory | Write-time content validation (L-hook on S writes) |
| Capability escalation | Tool composition | T (registry) | Single session | File read + bash exec = arbitrary code | Tool composition policy enforcement |
| Sandbox escape | Execution environment | E (execution loop) | System-level | Container kernel exploit | MicroVM isolation; capability-based sandbox |
| Cross-agent injection | Agent-to-agent message | E (multi-agent coordination) | Multi-agent | Compromised subagent hijacks orchestrator | Message schema validation + agent identity verification |
5.2.3. Why Container Isolation Is Structurally Insufficient
5.2.4. Defensive Architecture at the Harness Level
5.2.5. Prompt Injection: A Detailed Taxonomy
5.2.6. Root Cause Analysis: Why Security Remains Unsolved
5.2.7. The Execution Environment as Policy Materialization Layer
5.2.8. Environment State Management as Harness Reliability Function
5.2.9. Code Execution Environments as Harness Components
5.2.10. SandboxEscapeBench and the Limits of Container Isolation: Extended Analysis

5.3. Evaluation and Benchmarking
5.3.1. The Fundamental Evaluation Problem
5.3.2. The Benchmark Landscape
| Benchmark | Domain | # Tasks | Environment Type | Known Failure Modes |
| SWE-bench | Software engineering | 2,294 | Real GitHub repositories | Test flakiness; repository state drift |
| OSWorld | GUI interaction | 369 | Real computer OS | 28% false negative rate (Kang, 2025) |
| WebArena | Web navigation | 812 | Live websites | Website changes invalidate tasks |
| Mind2Web | Web understanding | 2,350 | Cached snapshots | No environment dynamics |
| GAIA | General reasoning | 466 | Synthetic + real | Limited agent-specific coverage |
| AgentBench | Multi-domain | 1,000+ | 8 environments | Environment non-determinism |
| WorkArena | Enterprise workflow | 33 | ServiceNow (live) | Enterprise data dependency |
| InterCode | Interactive coding | 200+ | Docker containers | Execution environment consistency |
| HAL | Multi-domain | 9 suites | Standardized VMs | $40K cost per full evaluation cycle |
| Terminal-Bench | CLI interaction | 100+ | Container sandbox | Narrow task distribution |
| Benchmark | Environment Reset | State Persistence | Multi-step Tracking | Reproducibility Mechanism | Harness Infra Complexity |
| SWE-bench | Git repository clone | × per task | Test suite execution | Docker + fixed commit hash | High |
| OSWorld | VM snapshot restore | × per task | GUI state tracking | VirtualBox snapshots | Very High |
| WebArena | Live site re-init | × per task | Navigation history | Self-hosted site instances | High |
| AgentBench | Per-environment reset | × per task | Action-observation log | Docker per environment | High |
| HAL | Standardized VM | ✓(cross-run) | LLM-aided log inspection | Centralized harness | Very High |
| AgencyBench | Per-task harness init | ✓(within task) | Tool call trace (90 avg) | Native ecosystem harnesses | Very High |
| GAIA | Synthetic + cached | × | Tool use sequence | Fixed test set | Low |
| InterCode | Docker container | × per task | REPL interaction trace | Docker + fixed image | Moderate |
5.3.3. The Unreliability Crisis
5.3.4. Three Distinct Root Causes of Evaluation Unreliability
5.3.5. Benchmark Design Principles for Harness-Aware Evaluation
5.3.6. Meta-Harness: Automated Harness Optimization as an Evaluation Methodology

5.4. Protocol and Interface Standardization
5.4.1. The Fragmentation Problem
5.4.2. Protocol Comparison
| Protocol | Scope | Transport | Auth | Adoption | Maintainer |
| MCP | Agent→Tool | JSON-RPC (stdio/HTTP) | OAuth (2025) | High | Anthropic |
| A2A | Agent→Agent | HTTPS + SSE | OAuth | Medium | |
| ACP | High-level dialogue | Various | Various | Low | IBM |
5.4.3. Empirical Comparison: MCP vs. A2A Performance Characteristics
5.4.4. MCP and A2A: Complementary Roles in the Harness Stack
| Dimension | MCP | A2A | ACP | OpenAI Function Calling | Anthropic tool_use |
| Communication model | Client-server (tool invocation) | Peer-to-peer (agent-to-agent) | Request-response (task delegation) | Unidirectional (model→tool) | Unidirectional (model→tool) |
| Stateful sessions | Roadmap (2026) | Native | Limited | × | × |
| Auth/permission layer | Roadmap (2026) | OAuth2 | Basic | API key only | API key only |
| Streaming support | Roadmap (2026) | ✓ | Limited | ✓ | ✓ |
| Tool discovery | ✓(schema registry) | ∼ (capability declaration) | × | × | × |
| Multi-step coordination | × | ✓ | ✓ | × | × |
| Production maturity | Early (2025–) | Emerging (2026–) | Emerging (2026–) | Mature | Mature |
| Primary harness role | T-component standard | E-component coordination | Cross-harness task delegation | T-component (closed) | T-component (closed) |
5.4.5. Root Causes of Protocol Fragmentation
6. State and Knowledge Management
6.1. Runtime Context Management
6.1.1. The Context Window Bottleneck

6.1.2. Context Management Strategies Comparison
| Strategy | Mechanism | Pros | Cons | Systems |
| Truncation | Keep most recent K tokens | Simple, fast | Loses old information | Baseline |
| Summarization | Compress older context | Preserves gist | Information loss | LangChain |
| Retrieval-augmented | Store all, retrieve relevant | Preserves everything | Retrieval quality depends on query | MemGPT, Zep |
| Knowledge graph | Structured entity/relationship storage | Semantic relationships | Construction cost | Zep TK |
| Memory-as-action | Agent actively manages memory | Task-optimal curation | Learning overhead | MemAct |
| Skill injection | Structured procedural packages | Targeted, efficient | Curation required | SkillsBench harnesses |
6.1.3. Key Systems and Empirical Results
6.1.4. Long-Context Models and the Evolving C-Component Design Space
6.1.5. Retrieval-Augmented Context Management: Architecture and Limitations
6.1.6. Root Cause: Why Context Management Remains Unsolved
6.1.7. DeepAgents: Context Engineering Through Middleware Architecture
6.2. Tool Use as Core Harness Function
6.2.1. The Tool Registry as a Governance Object
6.2.2. The ToolBench Ecosystem and Large-Scale Tool Learning
6.2.3. Skill Libraries as S-T Component Integration
6.2.4. Tool Registry Design Patterns: A Structured Comparison
6.2.5. Key Harness Design Questions
| Question | Options | Tradeoffs |
| Tool registration | Static vs. dynamic | Static: predictable; Dynamic: flexible but security risk |
| Tool versioning | Strict vs. loose | Strict: reproducible; Loose: adaptive but fragile |
| Permission model | Per-tool vs. per-category vs. per-action | Granularity vs. usability |
| Error handling | Fail-fast vs. retry vs. fallback | Latency vs. robustness |
| Action space | Discrete-tool vs. code-as-action | Composability vs. interpretability |
6.2.6. Harness-Compatible Model Training and Fine-Tuning
6.2.7. Root Cause: Tool Governance Without Standards

6.2.8. Tool Management Infrastructure: The Registry as a Governance Component
6.2.9. Tool Failure as a Harness-Level Problem
6.2.10. Tool Security: The Registry as Attack Surface
6.2.11. MCP as Harness Protocol Infrastructure
6.2.12. Agent Skills: Workflow-Level Interoperability Beyond Tool-Level Protocols
6.3. Memory Management Architecture
6.3.1. Memory as Infrastructure, Not Capability
6.3.2. Architecture Comparison
| System | C-Component | S-Component | Key Innovation |
| MemGPT | Virtual paging | Disk storage | OS-inspired paging |
| Generative Agents | Context + retrieval | Memory stream | Reflection mechanism |
| Reflexion | Episodic buffer | Verbal critique store | Language-grounded self-improvement |
| MemoryBank | Retrieved context | Ebbinghaus-curve store | Forgetting curve update |
| Agent Workflow Memory | Workflow retrieval | Workflow store | Procedural memory induction |
| MemAct | Learnable policy | Action consequences | Memory-as-action |
| CORAL | MM + CO tools | External DB | Explicit tool interface |
| Zep TK | Retrieved context | Knowledge graph | Temporal relationships |
| OpenClaw | Session context | MEMORY.md files | Human-readable persistence |
6.3.3. Reflexion and the Episodic Buffer Design Pattern
| Architecture | Working Memory | Long-term Storage | Retrieval Mechanism | Context Cost | Security Isolation | Representative System |
| Flat context | In-window only | × | × (all in context) | O(1) per turn, O(n) total | None | Early ChatGPT plugins |
| Append-only log | In-window | External log | Chronological | O(n) per turn | None | Naive agent implementations |
| Hierarchical paging | Working set | Tiered (hot/warm/cold) | Priority scheduler | O(k) per turn (k = working set) | Tier-boundary enforcement | MemGPT, MemoryOS |
| Graph-structured | In-window summary | Knowledge graph | Semantic + relational | O(log n) per turn | Node-level ACLs possible | A-MEM, HippoRAG |
| Compression + gisting | Compressed summary | Original + gist index | Two-stage (gist → full) | O(c) per turn (c = compression ratio) | None | ReadAgent, Mem0 |
| Episodic + procedural | Working + episode buffer | Skill library + episode store | Quality-gated + relevance | O(k+p) per turn | Write-time validation | Voyager + Reflexion |
6.3.4. Memory–Security Coupling
6.3.5. The Memory Governance Contract: A Proposed Specification
6.3.6. Root Cause: Four Open Problems in Harness Memory
6.3.7. Compute Economics as a Context Management Constraint
6.3.8. The Harness as Memory Scheduler
6.3.9. Context Rot as a Harness Failure Mode
6.3.10. Memory as Security Attack Surface
6.3.11. The Absence of a Standard Memory Interface
| System | Memory Model | Scheduling Policy | Interface Portability | Write-time Security |
| MemGPT (Packer et al., 2023) | Two-tier paging | Agent-invoked function calls | Host-harness dependent | Absent |
| Generative Agents (Park et al., 2023) | Memory stream + reflection | Harness-scheduled reflection cycles | Non-portable | Absent |
| Voyager (Wang et al., 2023) | Executable skill library | Embedding-similarity retrieval | Non-portable | Absent |
| MemoryOS (BAI-LAB, 2025) | Three-tier hot/warm/cold | LRU-style priority eviction | Partial | Absent |
| A-MEM (2025) | Graph-structured Zettelkasten | Link-traversal retrieval | Non-portable | Absent |
| Mem0 (2025) | Async extraction middleware | Deduplication + consolidation | Partial (API layer) | Absent |
| AgentSys (2026) | Hierarchical isolated stores | Harness-enforced authorization | Non-portable | Write isolation |
7. Coordination and Planning
7.1. Planning and Reasoning Infrastructure
7.1.1. The Planning Loop as a Harness Governance Object
7.1.2. Planning State as Harness State
7.1.3. Search Budget and Harness Resource Governance
7.1.4. Planning Interface Design: The ACI as a Harness Responsibility

| System | Loop Control | State Persistence | Search Budget Enforcement | ACI Design |
| ReAct (Yao et al., 2023) | Linear T→A→O | None (stateless) | Step limit only | Minimal tool-call schema |
| Tree of Thoughts (Yao et al., 2023) | BFS/DFS over thought tree | Tree state in harness memory | Breadth/depth limits | Thought generation prompt |
| LATS (Zhou et al., 2023) | MCTS + reflection | Reflection buffer + tree state | Rollout count ceiling | Value fn + reflection prompt |
| Reflexion (Shinn et al., 2023) | Retry with critique | Episodic verbal critique buffer | Max episode count | Self-critique schema |
| Voyager (Wang et al., 2023) | Curriculum-guided episodes | Skill library + episode log | Episode count limit | Skill code spec + task prompt |
| Agent Q (Putta et al., 2024)† | MCTS over web actions | MCTS tree + DPO training pairs | Rollout budget / cost ceiling | Web action ACI (click/type/nav) |
| SWE-agent (Yang et al., 2024) | ReAct with ACI commands | File viewer state | Step ceiling | Custom ACI: search, edit, run |
| ExACT / R-MCTS (Song et al., 2024)† | R-MCTS + contrastive reflection | State cache for node rollback | Rollout + depth limits | Environment action set |
7.2. Multi-Agent Coordination as Harness Infrastructure
7.2.1. The New Governance Requirements of Multi-Agent Harnesses

| Pattern | Identity Model | Message Validation | State Consistency | Key Vulnerability |
| Role-Based | Fixed SOP assignment (MetaGPT) | Typed document handoffs | Atomic document-level | Cascade errors across pipeline stages |
| Market-Based | Dynamic registry (AutoGen) | Bid/offer with capability declarations | Agent-registry availability | Task starvation under biased assignment |
| Simulation | Persistent entity (Generative Agents) | Indirect via environment state | World-model conflict resolution | Shared state corruption from concurrent writes |
| Hierarchical | Permission tree (DeerFlow, DeepAgents) | Task specification + result validation | Permission-state propagation | Authority escalation via delegation bugs |
7.2.2. Protocol-Layer Standardization vs. Learned Topology
7.2.3. The Single-Agent Baseline Problem
7.2.4. Reliability and Security in Multi-Agent Harnesses
8. Emerging Topics and Research Directions
Group A: Immediate Community Priorities
8.0.1. Cross-Component Interaction Patterns
| Coupling Pattern | Components | Mechanism | Consequence of Independent Optimization |
| Retention–Security | C ↔ L | Longer context retention increases both task performance and adversarial content persistence | C-optimal retention window exceeds L-safe window; joint optimization required |
| Evaluation–Governance | V ↔ L | Both intercept execution stream; V for measurement, L for policy enforcement | Separate logging and governance layers; governance events missing from evaluation traces |
| Memory–Tool Composition | S ↔ T | Tools operating on persistent state are simultaneously tool invocations and state mutations | External state treated as opaque side effects; governance benefits of explicit tracking lost |
8.0.2. Observability and Debugging
8.0.3. Human-in-the-Loop Mechanisms
8.0.4. Cost and Compute Economics
8.0.5. Autonomy and Long-Running Deployments
8.0.6. Automated Harness Engineering
8.0.7. Natural-Language Harness Specification
Group B: Long-Term Research Agenda
8.0.8. Formal Verification and Behavioral Guarantees
8.0.9. Cross-Harness Portability and Unified Benchmarking
8.0.10. Protocol Bridging and Federated Interoperability
8.0.11. Long-Horizon Task Decomposition
8.0.12. Security Model Formalization
8.0.13. Tool Composition and Dependency Inference
8.0.14. Energy-Aware Infrastructure Design

| Direction | Group | Key Challenge | H-Components | Effort | “Solved” Milestone |
| Cross-component coupling | A | C–L, V–L, S–T interactions | All 6 | 2–3 yr | Joint optimization frameworks |
| Observability | A | Structured traces, replay | V, L, E | 2–3 yr | Deterministic replay of failed trajectories |
| Human-in-the-loop | A | Approval policy derivation | L, E | 1–2 yr | Task-level risk-based approval specs |
| Cost economics | A | Context accumulation costs | C, S, T | 2–3 yr | Formal cost models predicting CPT |
| Long-running autonomy | A | Skill library governance | S, L, E | 3–5 yr | Monitored self-improvement loops |
| Formal verification | B | Behavioral guarantees | E, L | 3–5 yr | Machine-checkable harness invariants |
| Cross-harness portability | B | Harness–model coupling | V, T, C, S | 2–3 yr | 100+ task suite on 3+ harnesses |
| Protocol bridging | B | MCP/A2A fragmentation | T, L | 1–2 yr | Universal adapter spec + reference impl. |
| Long-horizon decomposition | B | Harness-aware planning | E, C, S | 2–3 yr | Provably optimal decomposition algorithm |
| Security formalization | B | No formal security model | L, E, T | 4–6 yr | Bell-LaPadula analog for harnesses |
| Tool composition | B | Dependency inference | T, E | 2–3 yr | Auto-inferred composition templates |
| Energy-aware design | B | Energy budgets | E, C, S | 3–5 yr | 20%+ energy reduction vs. baseline |
8.0.15. Synthesis

9. Conclusion
9.1. The Infrastructure Thesis Revisited
9.2. Implications for Research Practice
9.3. Limitations and Future Scope
References
- Asai, A.; et al. Self-RAG: Learning to Retrieve, Generate, and Critique Through Self-Reflection. In Proceedings of the International Conference on Learning Representations (ICLR 2024), 2024. Peer-reviewed.
- Deng, X.; et al. Mind2Web: Towards a Generalist Agent for the Web. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2023), 2023. Peer-reviewed.
- Greshake, K.; et al. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In Proceedings of the ACM Conference on Computer and Communications Security (CCS 2023), 2023. Peer-reviewed.
- Guo, T.; et al. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI 2024), 2024. Peer-reviewed.
- Hong, S.; et al. MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. In Proceedings of the International Conference on Learning Representations (ICLR 2024), 2024. Peer-reviewed.
- Jimenez, C.; et al. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? In Proceedings of the International Conference on Learning Representations (ICLR 2024), 2024. Peer-reviewed.
- Jain, N.; et al. R2E: Turning any Github Repository into a Programming Agent Environment. In Proceedings of the International Conference on Machine Learning (ICML 2024), 2024. Peer-reviewed.
- Kapoor, S.; Stroebl, B.; Kirgis, P.; et al. Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation. In Proceedings of the The Fourteenth International Conference on Learning Representations (ICLR 2026), 2026, [arXiv:2510.11977]. Peer-reviewed, ICLR 2026.
- Lee, N.; et al. ReadAgent: Gist-Based Long-Context Navigation. In Proceedings of the International Conference on Machine Learning (ICML 2024), 2024, [arXiv:2402.09727]. Peer-reviewed.
- Li, G.; et al. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2023), 2023. Peer-reviewed.
- Liu, X.; et al. AgentBench: Evaluating LLMs as Agents. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2023), 2023. Peer-reviewed.
- Ma, C.; et al. AgentBoard: An Analytical Evaluation Board of Multi-Turn LLM Agents. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2024), 2024. Peer-reviewed.
- Mei, K.; et al. AIOS: LLM Agent Operating System. In Proceedings of the Conference on Language Modeling (COLM 2025), 2025, [arXiv:2403.16971]. Peer-reviewed, COLM 2025.
- Mialon, G.; et al. GAIA: A Benchmark for General AI Assistants. arXiv preprint 2023, [arXiv:2311.12983]. Peer-reviewed.
- Mohammadi, M.; Li, Y.; Lo, J.; Yip, W. Evaluation and Benchmarking of LLM Agents: A Survey. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2025), 2025, [arXiv:2507.21504]. Peer-reviewed.
- Packer, C.; et al. MemGPT: Towards LLMs as Operating Systems. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2023), 2023. Peer-reviewed.
- Park, J.S.; et al. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI 2023), 2023. Peer-reviewed.
- Patil, S.; et al. Gorilla: Large Language Model Connected with Massive APIs. arXiv preprint 2023, [arXiv:2305.15334]. Peer-reviewed.
- Qin, Y.; et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs. In Proceedings of the International Conference on Learning Representations (ICLR 2024), 2024. Peer-reviewed.
- Qu, C.; et al. Tool Learning with Large Language Models: A Survey. Frontiers of Computer Science 2024, [arXiv:2405.17935]. Peer-reviewed.
- Schick, T.; et al. Toolformer: Language Models Can Teach Themselves to Use Tools. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2023), 2023, [arXiv:2302.04761]. Peer-reviewed.
- Shinn, N.; et al. Reflexion: Language Agents with Verbal Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2023), 2023, [arXiv:2303.11366]. Peer-reviewed.
- Wang, L.; et al. A Survey on Large Language Model Based Autonomous Agents. arXiv preprint 2023, [arXiv:2308.11432]. Peer-reviewed.
- Wang, G.; et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. In Proceedings of the arXiv preprint, 2023, [arXiv:2305.16291]. Peer-reviewed.
- Wang, X.; et al. Executable Code Actions Elicit Better LLM Agents (CodeAct). In Proceedings of the International Conference on Machine Learning (ICML 2024), 2024, [arXiv:2402.01030]. Peer-reviewed.
- Xi, Z.; et al. The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv preprint 2023, [arXiv:2309.07864]. Peer-reviewed.
- Xie, T.; Zhang, D.; Chen, J.; et al. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2024), 2024, [arXiv:2404.07972]. Peer-reviewed.
- Yang, J.; et al. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2024), 2024, [arXiv:2405.15793]. Peer-reviewed.
- Yang, J.; et al. InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2023), 2023, [arXiv:2306.14898]. Peer-reviewed.
- Yao, S.; et al. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the International Conference on Learning Representations (ICLR 2023), 2023, [arXiv:2210.03629]. Peer-reviewed.
- Yao, S.; et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2023), 2023. Peer-reviewed.
- Zhou, S.; et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. In Proceedings of the International Conference on Learning Representations (ICLR 2024), 2024. Peer-reviewed.
- Bai, Y.; et al. Constitutional AI: Harmlessness from AI Feedback, 2022, [arXiv:2212.08073]. Preprint, under review.
- Brockman, G.; et al. OpenAI Gym, 2016, [arXiv:1606.01540]. Preprint, under review.
- Carta, T.; et al. Gymnasium API for Multi-Turn LLM Tasks, 2024, [arXiv:2407.17032]. Preprint, under review.
- Chen, W.; et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors, 2023, [arXiv:2308.10848]. Preprint, under review.
- Drouin, A.; et al. WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?, 2024, [arXiv:2403.07718]. Preprint, under review.
- Fang, X.; et al. LoCoMo: Long-Context Conversation Benchmark for Agent Memory Evaluation, 2024, [arXiv:2402.17753]. Preprint, under review.
- Gao, H.; Geng, J.; Hua, W.; et al. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve, 2025, [arXiv:2507.21046]. Preprint, under review.
- Hao, S.; et al. Reasoning with Language Model is Planning with World Model (RAP), 2023, [arXiv:2305.14992]. Preprint, under review.
- He, B.; et al. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents, 2024, [arXiv:2410.09024]. Preprint, under review.
- Hua, W.; et al. TrustAgent: Towards Safe and Trustworthy LLM-Based Agents, 2024, [arXiv:2402.17091]. Preprint, under review.
- Huang, W.; et al. Inner Monologue: Embodied Reasoning Through Planning with Language Models, 2022, [arXiv:2207.05608]. Preprint, under review.
- Li, F. OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents, 2026, [arXiv:2603.11853]. Preprint, under review. Cited as first systematic published treatment of production runtime security for an open-source harness — review status pending.
- Li, K.; et al. AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts, 2026, [arXiv:2601.11044]. Preprint, under review.
- Li, X.; et al. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, 2026, [arXiv:2602.12670]. Preprint, under review.
- Liu, N.F.; et al. Lost in the Middle: How Language Models Use Long Contexts, 2023, [arXiv:2307.03172]. Preprint, under review.
- Lu, Y.; et al. ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities, 2024, [arXiv:2408.04682]. Preprint, under review.
- Pan, J.; et al. Interoperability of LLM Agent Communication Protocols: A Systematic Comparison of MCP, A2A, ACP, and ANP, 2025, [arXiv:2505.02279]. Preprint, under review.
- Patil, S.; et al. GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications, 2024, [arXiv:2404.06921]. Preprint, under review.
- PentestJudge Team. Judging Agent Behavior Against Operational Requirements, 2025, [arXiv:2508.02921]. Preprint, under review.
- Perez, F.; Ribeiro, I. Ignore Previous Prompt and Hijack the Conversation: Prompt Injection in LLMs, 2022, [arXiv:2211.09527]. Preprint, under review.
- Putta, S.; et al. Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents, 2024, [arXiv:2408.07199]. Preprint, under review.
- Qian, C.; et al. ChatDev: Communicative Agents for Software Development, 2023, [arXiv:2307.07924]. Preprint, under review.
- Ruan, Y.; et al. Identifying the Risks of LM Agents with an LM-Emulated Sandbox (ToolEmu), 2023, [arXiv:2309.15817]. Preprint, under review.
- Shen, Y.; et al. HuggingGPT: Solving AI Tasks with ChatGPT and Its Friends in Hugging Face, 2023, [arXiv:2303.17580]. Preprint, under review.
- Song, C.; et al. ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning, 2024, [arXiv:2410.02052]. Preprint, under review.
- Sumers, T.; et al. Cognitive Architectures for Language Agents (CoALA), 2023, [arXiv:2309.02427]. Preprint, under review.
- Vezhnevets, A.; et al. Generative Agent-Based Modeling with Actions Grounded in Physical, Social, or Digital Space Using Concordia, 2023, [arXiv:2312.03664]. Preprint, under review.
- Wang, G.; et al. Mixture-of-Agents Enhances Large Language Model Capabilities, 2024, [arXiv:2406.04692]. Preprint, under review.
- Wu, Q.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, 2023, [arXiv:2308.08155]. Preprint, under review.
- Wu, Z.; et al. Agent Workflow Memory, 2024, [arXiv:2409.07429]. Preprint, under review.
- Xu, B. AI Agent Systems: Architectures, Applications, and Evaluation, 2025, [arXiv:2601.01743]. Preprint, under review.
- Xu, J.; et al. TheAgentCompany: Benchmarking LLM Agents on Consequential Real-World Tasks, 2024, [arXiv:2412.14161]. Preprint, under review.
- Yang, J.; et al. AppAgent: Multimodal Agents as Smartphone Users, 2023, [arXiv:2312.13771]. Preprint, under review.
- Marchand, T.; O Cathain, A.; Wynne, J.; Giavridis, P.M.; Deverett, S.; Wilkinson, J.; Gwartz, J.; Coppock, H. SandboxEscapeBench: Quantifying Frontier LLM Capabilities for Container Sandbox Escape, 2026, [arXiv:2603.02277]. Preprint, under review.
- Yuan, Z.; et al. R-Judge: Benchmarking Safety Risk Awareness for LLM Agents, 2024, [arXiv:2401.10019]. Preprint, under review.
- Zeng, A.; et al. AgentTuning: Enabling Generalized Agent Abilities for LLMs, 2023, [arXiv:2310.12823]. Preprint, under review.
- Zhan, Q.; et al. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents, 2024, [arXiv:2403.02691]. Preprint, under review.
- Zhang, Z.; et al. A Survey on the Memory Mechanism of Large Language Model Based Agents, 2024, [arXiv:2404.13501]. Preprint, under review.
- Zhang, J.; et al. AFlow: Automated Agentic Workflow Generation, 2024, [arXiv:2410.10762]. Preprint, under review.
- Zhang, S.; et al. Agent-Oriented Planning: Formalizing Solvability, Completeness, and Non-Redundancy in Multi-Agent Task Decomposition, 2024, [arXiv:2410.02189]. Preprint, under review.
- Zhao, W.X.; et al. A Survey of Large Language Models, 2023, [arXiv:2303.18223]. Preprint, under review.
- Zhong, W.; et al. MemoryBank: Enhancing Large Language Models with Long-Term Memory, 2023, [arXiv:2305.10250]. Preprint, under review.
- De Chezelles, F.; et al. BrowserGym: A Gym Environment for Web Task Automation, 2024, [arXiv:2412.05467]. Preprint, under review.
- Mem0: Memory-Augmented AI Agents with Retrieval-Augmented Gisting, 2025, [arXiv:2504.19413]. Preprint, under review.
- MemoryOS: Operating System-Inspired Memory Management for Long-Horizon LLM Agents, 2025, [arXiv:2506.06326]. Preprint, under review.
- MemAct: Autonomous Context Curation for Long-Horizon Agentic Tasks, 2025, [arXiv:2510.12635]. Preprint, under review.
- Byzantine Fault Tolerance in Multi-Agent LLM Systems: Failure Propagation and Suppression, 2025, [arXiv:2511.10400]. Preprint, under review.
- Jia, Y.; Li, K. AutoTool: Graph-Based Tool Usage Index for Harness-Side Tool Selection, 2025, [arXiv:2511.14650]. Preprint, under review.
- Evo-Memory: Benchmarking Memory Update Policies in Streaming Agent Settings, 2025, [arXiv:2511.20857]. Preprint, under review.
- Securing MCP: Protocol-Level Threat Surface and Governance Controls, 2025, [arXiv:2511.20920]. Preprint, under review.
- A-MEM: Graph-Structured Zettelkasten Memory for LLM Agents, 2025, [arXiv:2502.12110]. Preprint, under review.
- Repo2Run: LLM-Based Docker Environment Construction for Reproducible Coding Agent Benchmarks, 2025, [arXiv:2502.13681]. Preprint, under review.
- Shi, Y.; et al. ToolHijacker: Supply-Chain Attacks on Agent Tool Registries, 2025, [arXiv:2504.19793]. Preprint, under review.
- SAGA: Policy Enforcement for Multi-Agent Security, 2025, [arXiv:2504.21034]. Preprint; conference-reviewed at NDSS 2026.
- Mei, K.; et al. Context Engineering: A Survey of 1,400 Papers on Effective Context Management for LLM Agents, 2025, [arXiv:2507.13334]. Preprint, under review.
- Zhou, Y.; et al. SHIELDA: Modular Exception Taxonomy for Harness-Level Tool Failure Handling, 2025, [arXiv:2508.07935]. Preprint, under review.
- Wang, P.; et al. Hell or High Water: Evaluating Agent Recovery from External Tool Failures, 2025, [arXiv:2508.11027]. Preprint, under review.
- Vuddanti, A.; et al. PALADIN: Structured Tool Failure Recovery for LLM Agents. In Proceedings of the The Fourteenth International Conference on Learning Representations (ICLR 2026), 2026, [arXiv:2509.25238]. Peer-reviewed, ICLR 2026.
- AgentBound: MCP-Permission-Manifest-Based Access Control for Agent Harnesses, 2025, [arXiv:2510.21236]. Preprint, under review.
- Multi-Agent Baseline Study: When Do Multi-Agent Systems Outperform Single-Agent Alternatives?, 2026, [arXiv:2601.12307]. Preprint, under review.
- AOrchestra: Fine-Tuned Orchestrator Synthesis for Multi-Agent Coordination, 2026, [arXiv:2602.03786]. Preprint, under review.
- AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management, 2026, [arXiv:2602.07398]. Preprint, under review.
- MAS-FIRE: Failure Cascade Propagation in Multi-Agent Tool Execution, 2026, [arXiv:2602.19843]. Preprint, under review.
- SkillFortify: Supply Chain Verification for Agent Skill Manifests Under Dolev-Yao Attacker Model, 2026, [arXiv:2603.00195]. Preprint, under review.
- Comprehensive Survey of Memory Management for LLM Agents: The Write–Manage–Read Loop, 2026, [arXiv:2603.07670]. Preprint, under review.
- AEGIS: Framework-Agnostic Pre-Execution Intercept Layer for Harness Security, 2026, [arXiv:2603.12621]. Preprint, under review.
- Sigdel, A.; Baral, R. Schema First: Controlled Experiments on JSON Schema Validation at Harness Tool Registration Boundaries, 2026, [arXiv:2603.13404]. Preprint, under review.
- Policy-First Control: Three-Lifecycle-Point Tool Governance for Agent Harnesses, 2026, [arXiv:2603.18059]. Preprint, under review.
- OpenDev: Building AI Coding Agents for the Terminal—Scaffolding, Harness, Context Engineering, 2026, [arXiv:2603.05344]. Preprint, under review.
- Aethir. The Rise of OpenClaw and AI Agents: GPU Demand Is Surging. https://aethir.com, 2026. [Practitioner report] Non-peer-reviewed industry report.
- Anthropic. Demystifying Evals for AI Agents. https://www.anthropic.com/engineering, 2026. [Practitioner report] Official documentation, non-peer-reviewed.
- Anthropic. Building Effective Agents. https://www.anthropic.com/engineering, 2026. [Practitioner report] Official documentation, non-peer-reviewed.
- Böckeler, B. Harness Engineering. https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html, 2026. [Practitioner report] Fowler Exploring Generative AI series, non-peer-reviewed.
- BrowserUse. Sandboxing AI Agents with microVMs. https://browseruse.com/blog, 2026. [Practitioner report] Product blog, non-peer-reviewed.
- can1357. I Improved 15 LLMs at Coding in One Afternoon. Only the Harness Changed. http://blog.can.ac/2026/02/12/the-harness-problem/, 2026. [Practitioner report] Non-peer-reviewed. Key data point: 6.7%→68.3% task completion (harness-only change). SAFE to cite with [PR] label.
- Cardoso, L. Sandboxes for AI. https://www.luiscardoso.dev/blog/sandboxes-for-ai, 2026. [Practitioner report] Non-peer-reviewed practitioner blog.
- Chase, H. Context Engineering Our Way to Long-Horizon Agents. Sequoia Capital Podcast, 2026. [Practitioner report] Non-peer-reviewed practitioner account.
- Cursor Research Team. Towards Self-Driving Codebases. https://cursor.com/blog/self-driving-codebases, 2026. [Practitioner report] Non-peer-reviewed.
- Deloitte. AI Infrastructure Compute Strategy. https://deloitte.com, 2025. [Practitioner report] Non-peer-reviewed industry report.
- Gray, A. Minions: Stripe’s One-Shot, End-to-End Coding Agents. https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents, 2026. [Practitioner report] Non-peer-reviewed. Key data: 1300 PRs/week, 0 human-written code, 500 tools in Toolshed. SAFE to cite with [PR] label.
- Gupta, A. 2025 Was Agents. 2026 Is Agent Harnesses. https://medium.com, 2026. [Practitioner report] Non-peer-reviewed practitioner account.
- Hashimoto, M. My AI Adoption Journey. https://mitchellh.com/writing/my-ai-adoption-journey, 2026. [Practitioner report] Non-peer-reviewed. Cited for AGENTS.md harness-as-corrective-mechanism framing.
- Khmel, M. Everyone Agrees That Inference Demand Is Exploding. https://medium.com, 2026. [Practitioner report] Non-peer-reviewed. Cited for 1000x projection — present as industry projection, NOT established finding.
- Lopopolo, R. Harness Engineering: Leveraging Codex in an Agent-First World. https://openai.com/index/harness-engineering/, 2026. [Practitioner report] OpenAI Engineering Blog, non-peer-reviewed. Key data: 1M LoC, 0 hand-written, 3.5 PRs/eng/day. SAFE to cite with [PR] label.
- METR Research Team. Many SWE-bench-Passing PRs Would Not Be Merged Into Main. https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/, 2026. [Non-peer-reviewed technical note] METR Technical Notes. Key data: 50% non-merge rate, 24.2pp gap vs. human baseline. SAFE to cite with [Non-peer-reviewed technical note] label.
- OpenRouter.; a16z. State of AI Report. https://a16z.com, 2026. [Practitioner report] Non-peer-reviewed industry report. Cited for 13T tokens/week figure — present as industry report, NOT peer-reviewed finding.
- Schmid, P. The Importance of Agent Harness in 2026. https://www.philschmid.de/agent-harness-2026, 2026. [Practitioner report] Non-peer-reviewed practitioner blog.
- Snorkel AI. Terminal-Bench 2.0. https://snorkel.ai, 2025. [Practitioner report] Industry benchmark documentation, non-peer-reviewed.
- The New Stack. MCP Production Roadmap: Remaining Blockers for Enterprise Tool Protocol Adoption. https://thenewstack.io, 2026. [Practitioner report] Non-peer-reviewed practitioner analysis.
- Zylon.ai. What Is OpenClaw? A Practical Guide to the Agent Harness Behind the Hype. https://zylon.ai, 2026. [Practitioner report] Non-peer-reviewed practitioner analysis.
- EleutherAI. lm-evaluation-harness. https://github.com/EleutherAI/lm-evaluation-harness, 2023. Open-source evaluation framework.
- LangChain. Agent Frameworks, Runtimes, and Harnesses. https://langchain.com/blog, 2025. [Practitioner report] Official documentation, non-peer-reviewed.
- OWASP. LLM01:2025 Prompt Injection. https://owasp.org/www-project-top-10-for-large-language-model-applications/, 2025. Industry security standard, non-peer-reviewed.
- Lee, Y.; Nair, R.; Zhang, Q.; Lee, K.; Khattab, O.; Finn, C. Meta-Harness: End-to-End Optimization of Model Harnesses, 2026, [arXiv:2603.28052]. Preprint, under review.
- OpenAI. Harness Engineering: Leveraging Codex in an Agent-First World. https://openai.com/index/harness-engineering/, 2026. [Practitioner report] Blog post, February 2026.
- LangChain. Improving Deep Agents with Harness Engineering. https://blog.langchain.com/improving-deep-agents-with-harness-engineering/, 2026. [Practitioner report] Blog post, March 2026.
- Anthropic. Agent Skills Open Standard Specification. https://agentskills.io, 2025. Open standard, December 2025.
- Pan, L.; et al. Natural-Language Agent Harnesses. arXiv preprint arXiv:2603.25723 2026. Introduces NLAH (Natural-Language Agent Harnesses) and IHR (Intelligent Harness Runtime). arXiv:2603.25723. [CrossRef]
- Anonymous. Harness Engineering for Language Agents: The Harness Layer as Control, Agency, and Runtime. Preprints.org 2026. Position paper; proposes CAR decomposition and HARNESSCARD reporting artifact. Manuscript 202603.1756.
- Chen, L.; Tong, P.; Jin, Z.; Sun, Y.; Ye, J.; Xiong, H. Plan-on-Graph: Self-Correcting Adaptive Planning of Large Language Model on Knowledge Graphs. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2410.23875. Introduces PoG: guidance/memory/reflection loop for KG-augmented planning.
- Wu, W.; et al. Aligning Large Language Models with Searcher Preferences. arXiv preprint arXiv:2603.10473 2026. Introduces SearchLLM for open-ended generative search; hierarchical multi-dimensional reward system separating bottom-line constraints from preference alignment; hybrid rule-based + LLM-judge evaluation stack with human-in-the-loop calibration.











Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.