Preprint
Article

This version is not peer-reviewed.

Agent Harness for Large Language Model Agents: A Survey

Submitted:

08 April 2026

Posted:

09 April 2026

You are already at the latest version

Abstract
The rapid deployment of large language model (LLM) based agents in production environments has surfaced a critical engineering problem: as agent tasks grow longer and more complex, task execution reliability increasingly depends not on the underlying model's capabilities, but on the infrastructure layer that wraps around it—the agent execution harness. This dependence suggests that the harness, not the model, is the binding constraint for real-world agent system performance. We treat the harness as a unified research object deserving systematic study, distinct from its individual component capabilities.This survey makes five contributions: (1) A formal definition of agent harness with labeled-transition-system semantics distinguishing safety and liveness properties of the execution loop. (2) A historical account tracing the harness concept from software test harnesses through reinforcement learning environments to modern LLM agent infrastructure, showing convergence toward a common architectural pattern. (3) An empirically-grounded taxonomy of 22 representative systems validated against a six-component completeness matrix (Execution environment, Tool integration, Context management, Scope negotiation, Loop management, Verification). (4) A systematic analysis of nine cross-cutting technical challenges—from sandboxing and evaluation to protocol standardization and compute economics—including an empirical protocol comparison (MCP vs.~A2A) and analysis of ultra-long-context model implications. (5) Identification of emerging research directions where harness-layer infrastructure remains underdeveloped relative to component capabilities. The evidence base is grounded in three peer-reviewed empirical studies (HAL on harness-level reliability, SWE-bench on evaluation infrastructure, and AgencyBench on cross-component coupling), supplemented by practitioner reports from large-scale deployments (OpenAI, Stripe, METR) that corroborate the binding constraint thesis. We scope our analysis to end-to-end agentic systems (excluding single-step language model use) and restrict our taxonomy to systems with public documentation, acknowledging that proprietary harness designs remain largely unstudied.
Keywords: 
;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated