2.2. Formal Definition
Figure 3 illustrates the six-component architecture of an agent harness . The diagram shows how each component occupies a distinct governance layer: at the center, the Execution Loop (E) orchestrates the observe-think-act cycle, directing control flow among the other components. The Tool Registry (T) sits at the environment boundary, mediating every action the agent takes on the world through typed, schema-validated interfaces. The Context Manager (C) governs the information channel into the model, filtering and prioritizing what enters the context window at each step. The State Store (S) provides cross-turn and cross-session persistence, feeding recovery state back to the execution loop on failure. The Lifecycle Hooks (L) form an interception layer across all component boundaries, enabling authentication, audit, and policy enforcement without coupling to component logic. Finally, the Evaluation Interface (V) instruments the full execution stream—capturing typed action trajectories, intermediate states, and goal-completion signals—in a standardized format that external benchmark frameworks can consume. The arrows in the figure trace how a single execution step flows: from environment observation, through context assembly (C), into model inference, through tool dispatch (T), and back through state commit (S) before the next turn, with L intercepting each boundary and V recording the trajectory. Reading the figure from left to right maps roughly to the temporal sequence of a single harness step; reading it vertically maps to the isolation hierarchy from model-facing (C, S) to world-facing (T) to governance-facing (L, V).
Figure 3.
Overview of the proposed six-component agent harness architecture: . Each component occupies a distinct governance layer, with arrows tracing a single execution step from observation through context assembly, model inference, tool dispatch, and state commit.
Figure 3.
Overview of the proposed six-component agent harness architecture: . Each component occupies a distinct governance layer, with arrows tracing a single execution step from observation through context assembly, model inference, tool dispatch, and state commit.
Figure 4.
Empirical evidence matrix: five independent studies demonstrating that harness-level changes—without model changes—produce substantial performance improvements. xAI’s Grok Code Fast 1 achieved a 10× improvement (6.7% → 68.3%) on SWE-bench from edit-tool format change alone (Boluk 2026 [Practitioner report]); LangChain’s DeepAgents improved from 52.8% to 66.5% (+26%) on TerminalBench with harness-only changes; Meta-Harness’s automated optimization reached 76.4% on TerminalBench-2, surpassing hand-engineered approaches. In every case, the model remained constant while the harness changed.
Figure 4.
Empirical evidence matrix: five independent studies demonstrating that harness-level changes—without model changes—produce substantial performance improvements. xAI’s Grok Code Fast 1 achieved a 10× improvement (6.7% → 68.3%) on SWE-bench from edit-tool format change alone (Boluk 2026 [Practitioner report]); LangChain’s DeepAgents improved from 52.8% to 66.5% (+26%) on TerminalBench with harness-only changes; Meta-Harness’s automated optimization reached 76.4% on TerminalBench-2, surpassing hand-engineered approaches. In every case, the model remained constant while the harness changed.
Definition 2.1 (Agent Harness). An agent harness is a software system that implements six runtime governance functions:
E — Execution loop: Manages the observe-think-act cycle, including turn sequencing, termination conditions, and error recovery
T — Tool registry: Maintains a typed, validated catalog of available tool interfaces; routes and monitors tool invocations
C — Context manager: Governs what information enters the model’s context window across turns, including compaction, retrieval, and prioritization strategies
S — State store: Persists task-relevant state across turns and, optionally, across sessions; provides recovery from partial failures
L — Lifecycle hooks: Pre- and post-invocation interception points for authentication, logging, policy enforcement, and instrumentation
V — Evaluation interface: Instruments the execution to capture action trajectories, intermediate states, and success signals for offline analysis, through standardized hooks that distinguish the V-component from general logging
These six functions are not arbitrary. The distinction between V (evaluation interface) and L (lifecycle hooks) merits explicit clarification, since any system that logs agent behavior might appear to satisfy both. The difference is functional scope and standardization. L provides pre- and post-invocation interception for operational purposes—authentication, access control, audit trails, and policy enforcement—without commitment to any particular data schema or downstream consumer. V, by contrast, provides structured trajectory capture with standardized schemas that benchmark frameworks, evaluation pipelines, and observability platforms can consume directly: action sequences with typed arguments, intermediate state snapshots, tool call success/failure indicators, goal-completion signals, and token consumption per step. A system with only L can tell you that a tool was called and when; a system with V can tell you whether that tool call advanced the agent toward its goal, in a format that enables cross-harness comparison. The operational implication is concrete: HAL’s standardized evaluation harness requires a V-component that produces trajectory records in a canonical format that HAL’s analysis infrastructure can process—a requirement that cannot be met by a harness that provides only operational logging hooks. They correspond to the six principal failure modes observed in production agent deployments: execution runaway (addressed by E), tool misuse (T), context blowout (C), state loss on failure (S), unmonitored side effects (L), and unobservable behavior (V). A system that implements all six is, in a meaningful sense, operationally governed; a system that implements only some is partially governed; a system that implements none—a bare model call—is ungoverned.
Necessary conditions. A system must implement at minimum E and T to qualify as a harness. Without E, there is no multi-step execution to govern; the system is a single-turn inference wrapper. Without T, the agent cannot act on the world; the system is a reasoning engine with no effectors. These two together constitute the minimal viable operational environment.
Sufficient conditions. A system implementing all six components with production-grade reliability—including error handling, authentication, observability integrations, and documented failure modes—qualifies as a full-stack harness.
Edge cases that test the definition. The definition becomes useful precisely at the boundaries. A simple ReAct loop (Yao et al., 2023) is not a harness: it implements E minimally (a while-loop with no error recovery or termination logic) and T partially (ad-hoc tool calls without a registry), and lacks C, S, L, V entirely. ReAct is a framework primitive from which a harness can be built, not a harness itself. LangGraph is likewise not a harness—it provides DAG-based execution graph primitives (a partial, logic-only implementation of E) but has no opinion on context management, state persistence, security, or evaluation. MemGPT is a capability module: it implements C and S with exceptional sophistication but has no execution loop, no tool registry, and no lifecycle hooks as standalone components. AIOS (Mei et al., COLM 2025), by contrast, qualifies as a full harness: it implements all six components with explicit OS-level abstractions, and its empirical 2.1× speedup from proper scheduling of concurrent agent requests demonstrates that E-level governance has quantifiable performance consequences. The Tree-of-Thoughts framework (Yao et al., NeurIPS 2023) is a further instructive case: by requiring the execution loop to maintain parallel reasoning branches, evaluate intermediate states, and backtrack from dead ends, ToT reveals that the E-component’s design space is substantially richer than linear ReAct-style loops assume. A harness supporting ToT-style reasoning must implement branching execution graphs, branch-level state isolation, and evaluation callbacks at intermediate steps—a superset of what single-path harnesses require. This illustrates the general principle that harness E-component requirements are determined partly by the planning architectures the harness is designed to host.
Formal semantics of the E-component. An under-developed dimension of harness theory is the formal semantics of the execution loop itself. The E-component can be characterized as a labeled transition system (LTS) over states Q = {idle, observing, invoking-model, dispatching-tool, awaiting-tool-result, committing-state, terminated}, observable event alphabet (model response tokens, tool invocations, tool results, human approvals, errors), and a transition function : Q ×→ Q. This formalization reveals three correctness properties that informal descriptions cannot express: safety (the system never enters a state from which termination is unreachable, i.e., no execution runaway); liveness (from every reachable non-terminal state, a terminal state is reachable); and determinism (for reproducibility, must be a function rather than a relation, meaning environment non-determinism must be isolated at tool-call boundaries). Process algebra provides a complementary perspective: a harness’s concurrent sub-agent orchestration can be modeled in CCS or CSP, where the parallel composition of sub-agent processes P1 ‖ P2 ‖ … ‖ Pn must satisfy deadlock-freedom under the harness’s synchronization constraints. Xu (2025, JACM) notes that orchestration patterns in multi-agent systems exhibit exactly the concurrency hazards—deadlock, livelock, and priority inversion—that process algebra was designed to detect. The practical implication is twofold: E-component designs should expose their state machines explicitly in configuration so that validators can check well-formedness before deployment, and multi-agent harnesses should demonstrate absence-of-deadlock for their orchestration topologies, analogously to how concurrent operating systems require protocol verification for inter-process communication. No current production harness satisfies either requirement, representing a gap between formal adequacy and engineering practice that the research directions in §7 should begin to close.
Formalization in Use: Classifying Systems via LTS. The LTS characterization of the E-component is not merely decorative; it provides a discriminative tool for the boundary cases analyzed in §2.3. Consider two contrasting systems. ReAct (Yao et al., 2023) is instructive as a non-harness case. A ReAct implementation can be written as a while-loop with an informal “stop if the model outputs a final answer” condition. Rendered in LTS terms: the state set Q collapses to {active, done}; there is no idle state awaiting context commitment, no awaiting-tool-result state capturing asynchronous returns, and no committing-state transition that guarantees persistence before the next observe step. The transition function is therefore partial—it is undefined for error inputs, since ReAct has no error recovery arc—violating the LTS safety property that termination must always be reachable. The initial state q0 = active and the terminal state F = {done}, but no path guarantees reaching F when a tool call returns an exception. This formal gap is precisely what practitioners observe as “execution runaway.” AutoGPT (Richards, 2023), by contrast, qualifies as a harness under the LTS analysis. Its execution loop implements a richer state space: q0 = idle (awaiting task input), with transitions through goal-parsing, sub-task decomposition, internet-tool invocation, and state-persistence steps before cycling back to idle. The terminal condition F = {goal-achieved, max-steps-exceeded} is explicit in the codebase. The function is total over the documented event alphabet —including exception events, which route to an error-recovery state rather than causing silent failure. AutoGPT’s notorious reliability problems arise not because its LTS is incomplete but because its transitions were implemented without the production-grade guarantees (idempotent state writes, atomic commits) that a safety-critical LTS requires. The distinction matters: ReAct fails to be a harness because it lacks the LTS structure; AutoGPT is a harness whose LTS structure is sound but whose implementation of that structure is not. The formalism draws this line precisely where intuition suggests it should be drawn. A third case—LangGraph—extends the analysis to a topology-encoded harness. LangGraph implements execution as a directed acyclic graph (DAG) of computation nodes. In LTS terms, Q is defined by the set of graph nodes; is defined by graph edges and conditional transition predicates attached to them; and the DAG topology guarantees liveness by construction—acyclicity ensures that no execution can loop indefinitely, so a terminal node is always reachable from any non-terminal node. The E-component is therefore present and formally well-behaved. However, the C-component is realized implicitly through graph topology rather than through an active context management policy: information flows between nodes via the graph structure, but no explicit context compaction, eviction, or prioritization mechanism governs what the model receives at each step. The consequence for classification is precise: LangGraph instantiates E and T (nodes invoke tools), satisfies the LTS safety and liveness properties by virtue of DAG structure, but realizes C implicitly rather than explicitly. We classify LangGraph as a topology-encoded harness—a harness in which C is derivable from the graph specification rather than from a separate runtime component. The LTS analysis thus discriminates three structurally distinct system classes: primitive non-harnesses (ReAct, in which is partial and safety fails), monolithic harnesses (AutoGPT, in which is total but implementation guarantees are weak), and topology-encoded harnesses (LangGraph, in which formal properties are established architecturally rather than imperatively). This three-way classification, derived from a uniform LTS framework, is not achievable by informal analysis alone. The full boundary case analysis for LangGraph, including its implicit C-component realization and comparison with LangChain as a framework primitive, appears in §2.3.
Figure 5 contrasts the labeled transition system structure of three representative systems: ReAct, AutoGPT, and LangGraph. The left panel shows ReAct’s collapsed two-state LTS (active → done), with no error-recovery arc and no committing-state intermediate—the incompleteness of over error inputs is visually apparent as a missing transition from the active state. The center panel shows AutoGPT’s richer state space, tracing the full path from idle through goal-parsing, tool-dispatch, state-persistence, and back to idle, with explicit error-recovery arcs that close the LTS under failure events; the gap between this formally sound structure and its weakly-guaranteed implementation is annotated. The right panel shows LangGraph’s DAG-encoded topology, where liveness follows from acyclicity rather than from explicit terminal-state specification—the C-component’s implicit realization through graph edges (rather than a separate runtime policy) is marked by a dashed border. Reading across the three panels illustrates the three system classes derived from the LTS analysis: primitive non-harness, monolithic harness, and topology-encoded harness, each with distinct formal properties and distinct engineering implications.
Figure 5.
LTS structure comparison of three representative systems: ReAct (primitive non-harness), AutoGPT (monolithic harness), and LangGraph (topology-encoded harness). The three panels illustrate how formal LTS analysis distinguishes system classes.
Figure 5.
LTS structure comparison of three representative systems: ReAct (primitive non-harness), AutoGPT (monolithic harness), and LangGraph (topology-encoded harness). The three panels illustrate how formal LTS analysis distinguishes system classes.