Harness Engineering for LLM Agents: A Survey of Harness Component Taxonomy, Evaluation, and Model–Harness Coevolution

Jinzhe Li; Yuan Wu; Yi Chang

doi:10.20944/preprints202606.2203.v1

Submitted:

27 June 2026

Posted:

30 June 2026

You are already at the latest version

Abstract

As large language model-driven autonomous agents are increasingly deployed in real-world long-horizon, open-environment tasks, foundation models expose systematic capability gaps in context retention, reliable tool invocation, persistent state management, and multi-step execution robustness. We understand the agent harness as the external execution support structure built around the model and treat it as a distinct performance lever that complements base model capability, positioning harness engineering as a growing area of research and engineering practice. Grounded in the scaffolding perspective from developmental psychology, we structure the field along three nested levels of analysis. At the structural level, we develop a unified taxonomy of harness components, mapping them to the specific capability gaps they compensate for and to their coupling with the core agent loop. At the fit level, we articulate a two-stage evaluation logic that distinguishes native model capability-gap diagnosis from assessments of compensation effectiveness and net benefit, and unpack the inherent multi-objective tradeoffs shaping harness design. At the dynamic level, we delineate the bidirectional coevolution mechanism between models and harnesses, explaining the shifting functional boundary where routine capability-bearing support migrates inward into model weights while constraint-bearing governance functions remain external. By synthesizing studies that are currently scattered across adjacent areas, we provide an organizing framework for understanding LLM agent harnesses in relation to the capability gaps they address. We further discuss open challenges and future directions in harness design, evaluation, and model--harness coevolution.

Keywords:

harness engineering

;

large language models

;

scaffolding

;

agent systems

;

evaluation

;

evolution

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

When a task exceeds the independent capacity of a single cognitive subject, a capability gap emerges. Developmental psychology characterizes a central mechanism for bridging such gaps as scaffolding: external support structures enable a learner to accomplish tasks that would otherwise remain beyond unaided reach [1]. As competence increases, effective scaffolding is expected to fade and transfer responsibility to the learner, while higher task demands may require new forms of support [2,3]. This cycle of construction, internalization, withdrawal, and reconstruction offers a useful lens for analyzing support structures that help complex systems move from fragile competence toward reliable performance.

Research on scaffolding can be organized around three progressively richer levels of analysis. At the structural level, studies ask what components constitute a scaffold and which capability gaps they compensate for [1]. At the fit level, they ask whether support is aligned with the actual learner need, and whether under-support or over-support produces a mismatch [2]. At the dynamic level, they ask how support is faded, how responsibility is transferred as competence changes, which capabilities become internalized, and which remain distributed across the subject and external artifacts [2,3,4].

The evolution of large language models (LLMs) and agentic systems exhibits a structurally analogous cycle. As foundation models improve in reasoning and generation, LLM-driven autonomous agents are deployed in software engineering, scientific work, information analysis, and other professional settings [5,6]. These deployments extend beyond conversational interaction into long-running tasks with stronger reliability constraints. Under realistic conditions, several systematic capability gaps become salient. Models require mediated interfaces to turn reasoning into external actions, making tool use and action grounding central deployment problems [7,8,9]. Attention and information localization degrade over long trajectories, causing important state to be forgotten or submerged in noise [10]. The lack of persistent cross-session state constrains continual learning from prior experience [11,12]. Multi-step execution amplifies error accumulation and state drift, gradually pushing trajectories away from the goal [13]. Reliability improvements also introduce latency and cost tradeoffs that must be managed at the workflow level [14]. These are not isolated defects. They are systemic gaps in cognition, execution, and governance that become visible when agents move from controlled demonstrations into open-ended environments.

Engineering practice compensates for these gaps by building support structures outside the model. These supports manage context, expose tools, preserve state, organize workflows, provide execution environments, externalize verification and guardrails, and record observable traces. Around the agent loop, they are commonly described in engineering practice as an agent harness [15,16]. Holding the foundation model fixed, changes to the harness can substantially alter end-to-end performance, making the harness a performance lever orthogonal to model capability [17,18]. As task complexity increases, the harness becomes a research and engineering object in its own right: harness engineering concerns how these external support structures are constructed, organized, evaluated, and governed [19].

Existing discussions of harness engineering remain fragmented. Context engineering, tool learning and function calling, memory mechanisms, interoperability protocols, safety governance, and evaluation methods have developed partially independent literatures and engineering communities [9,12,20,21,22,23]. These literatures often describe which components exist. They give less attention to why the components emerge, how they match specific capability gaps, and how they are withdrawn or reorganized as model generations improve. Broad surveys of LLM agents cover many of these support structures, but they are typically organized around the model or the agent as the primary object. The external harness therefore remains a dispersed set of auxiliary modules [5]. Engineering posts on context and skills show how runtime artifacts are increasingly organized outside the model, but they do not yet amount to a unified academic taxonomy of harness engineering [20,24]. We therefore argue that harness engineering must answer three linked questions corresponding to the three scaffolding levels: what a harness consists of, whether it compensates for the right gap at acceptable cost, and which supports become model-side capabilities rather than external governance layers.

Accordingly, we develop a structured synthesis of harnesses along the three levels of scaffolding analysis. At the structural level, Harness Component Taxonomy, we use capability gaps as the organizing axis, propose a unified taxonomy of seven component families, and explain their boundaries and coupling relationships with respect to the agent loop. At the fit level, Harness Evaluation, we distinguish model capability-gap diagnosis from harness compensation assessment. The former assesses native capability limits; the latter measures compensation effectiveness and net benefit under multi-objective constraints such as reliability, cost, autonomy, safety, and long-horizon continuity. At the dynamic level, Harness Evolution, we treat evolution as a bidirectional process. Harnesses promote capability internalization by accumulating trajectories, feedback, and experience (Harness → Model). Improved model capabilities in turn drive harness optimization at the interface, component, workflow, and closed-loop levels (Model → Harness). This dynamic distinction also clarifies why routine capability-bearing support can move toward the model, whereas constraint-bearing support remains tied to external verification, permission control, and governance. Together, these dynamics continually reshape the model–harness boundary. Figure 1 summarizes the literature organization that supports the three analytical levels.

2. Harness Foundations: From Model Capability Gaps to External Support

Building on the scaffolding lens introduced above, we first trace how external support structures have shifted across successive LLM capability boundaries. When models encounter a boundary in a particular capability, research and engineering practice first tend to construct targeted support outside the model to compensate for what the model cannot yet perform independently and reliably. When the corresponding behavior can be captured by stable training targets and repeatable feedback, these support patterns can move closer to model-side capability, and the former external support loses part of its role as a necessary condition. Improved model capability then expands deployment toward more complex and open-ended tasks, where new capability gaps become visible at a higher level. These developments reveal a cycle of gap exposure, external compensation, training internalization, and renewed gap exposure [1,207].

To date, this pattern can be seen in two earlier boundary shifts in the evolution of LLMs, while a third shift is still unfolding. The first was the intent-alignment cycle. Pretraining gave models strong language-generation capacity, but did not ensure that their outputs would match user intent; few-shot prompting therefore served as an external alignment scaffold before reinforcement learning from human feedback and instruction tuning internalized the relevant behavior into model weights. The second was the reasoning-capability cycle. Even after models could follow instructions, they often omitted intermediate reasoning steps on complex tasks; chain-of-thought prompting became an external reasoning scaffold, and later specialized training made step-by-step reasoning a more endogenous model capability. The third and current cycle concerns execution capability. As models are applied to long-running tasks in real environments, gaps in tool use, state maintenance, environmental feedback processing, side-effect control, and failure recovery become concentrated. We refer to the resulting system-level external execution support as the harness, synthesizing recent academic surveys and official platform documentation [40,207,208]. The remainder of this section follows the temporal logic of these boundary shifts: it first reviews alignment and reasoning, then characterizes the still-open execution cycle, and finally formalizes the concepts of harness and harness engineering.

2.1. Intent Alignment Cycle

Pretrained language models that have not undergone explicit alignment training commonly exhibit a gap in intent following. Although they can generate fluent language, they do not reliably produce outputs that match user goals. Their responses may deviate in style or stance, and they may produce harmful, biased, or factually unreliable content. Merely scaling model size does not reliably remove these problems [209,210,211,212].

Before alignment training became mature, few-shot prompting was a primary way to mitigate this gap. By embedding carefully selected demonstrations in the input prompt, developers could temporarily inject an expected behavioral pattern and steer the model toward the desired output form [210,213]. This intervention was external to the model itself. Its effectiveness depended heavily on the quality and coverage of the examples in a single prompt, and it could fail when the task format or wording shifted. It was also difficult to reuse reliably across users and tasks, making it insufficient as a scalable alignment mechanism.

Reinforcement learning from human feedback and later preference-optimization methods changed this arrangement. Through human preference annotation, reward-model construction, and policy optimization, the ability to follow human intent was trained into model behavior more systematically [209,214,215,216,217,218]. InstructGPT was a landmark result: a 1.3-billion-parameter aligned model was preferred in human evaluations over the 175-billion-parameter GPT-3 baseline and over prompting-based baselines of the same model family [209]. This demonstrated that intent alignment had moved from an external compensatory technique toward an internalized model capability.

The first cycle therefore illustrates a clear boundary shift. Manually designed alignment demonstrations lost their central role as the primary mechanism for intent following once preference-based post-training made instruction following a more stable model behavior. They remained useful in particular cases, but no longer served as the mandatory external structure for basic intent alignment.

2.2. Reasoning Capability Cycle

Intent alignment improved the fit between model outputs and user requests, but it did not by itself solve multi-step reasoning. Even when models understood instructions, they often defaulted to direct answers on arithmetic, symbolic manipulation, and multi-hop logical tasks, omitting intermediate steps and showing rapidly declining accuracy as complexity increased [219,220]. This gap motivated external support during the reasoning stage.

Chain-of-thought prompting became the canonical external reasoning scaffold. By adding examples of step-by-step reasoning to the prompt, researchers significantly improved performance on complex tasks. Even a zero-shot trigger such as “Let’s think step by step” could activate latent stepwise reasoning behavior [219,220]. Subsequent methods such as self-consistency and Tree of Thoughts extended this scaffold into richer forms of sampling, voting, and path search [221,222]. Surveys of chain-of-thought and prompting techniques further show that these methods developed into a broad engineering repertoire [223,224]. In essence, however, these techniques remained external interventions at inference time: the reasoning procedure was injected through prompting rather than being a stable capability of the model.

Specialized reasoning training then began to move this capability boundary inward. Systems such as OpenAI’s o1 models and DeepSeek-R1 used large-scale reinforcement learning to train models to produce and refine stepwise reasoning processes without relying on user-supplied demonstrations [225,226]. DeepSeek-R1-Zero is particularly relevant here because it showed that step-by-step reasoning behavior can emerge under reinforcement learning without supervised chain-of-thought data, suggesting that the structure is not simply imported from external exemplars but can arise endogenously under an appropriate training pressure [226].

Correspondingly, carefully engineered chain-of-thought prompts provide much smaller marginal benefits for newer reasoning models, and overly prescriptive reasoning prompts may even interfere with strong models’ internal reasoning processes [227,228]. The second cycle shows a related but more model-dependent boundary movement. As reasoning-oriented training made stepwise problem solving more endogenous to the model, chain-of-thought prompting shifted from a central scaffold for eliciting reasoning to a task- and model-dependent intervention.

2.3. Execution Capability Cycle and the Emergence of the Harness

As intent alignment and reasoning capability were progressively internalized, models approached the limits of many closed-form task settings. The center of gravity in research and deployment shifted from whether a model could reason correctly to whether it could execute reliably in real environments. This shift had an important precursor in ReAct, which interleaved reasoning and action so that a model could call external tools and interact with an environment over multiple turns [7]. It also contributed to the broader rise of LLM-based autonomous agents. Researchers and practitioners increasingly expected agents to complete goals in open-ended, long-horizon settings that require persistent interaction with the environment. In deployment, however, models exposed execution-level weaknesses: state is lost across multi-step interaction, tool calls may lack validation, errors cascade along execution chains, and agents often fail to recover after breakdowns [5,43,122,229,230].

Benchmarks quantify this execution gap. Leading models have achieved only low end-to-end success rates on realistic web tasks, initially solved fewer than four percent of real GitHub issues in SWE-bench, and completed less than one third of tasks autonomously in simulated enterprise settings [231,232,233]. Tool-agent-user interaction benchmarks further show that pass rates across repeated runs can remain far below single-run success rates, revealing that reliability cannot be reduced to one-shot accuracy [234]. Recent work on agent reliability makes the same point more generally: consistency, robustness, predictability, and bounded error severity are distinct operational dimensions that are obscured by a single success metric [78]. AgentBench and studies of self-correction likewise show that current LLMs still struggle to operate as reliable agents in complex interactive environments and cannot be assumed to repair their own failures without external feedback [235,236].

This divergence between single-run success and cross-run reliability marks a key difference between the execution cycle and the previous two cycles. Alignment and reasoning gaps could be narrowed substantially through scale and training. Execution reliability does not appear to converge monotonically with model size alone [78]. Moreover, many core requirements of execution settings, including permission control, side-effect constraints, and external-system connectivity, are valuable precisely because they remain independent of model weights and can be separately inspected, governed, and audited [22,69,197,198].

Engineering practice has therefore again turned first to support outside the model. Unlike the previous two cycles, however, the support structure is no longer limited to text-level prompt optimization. It includes task orchestration, state management, tool integration, safety constraints, observability, and failure recovery as system-level execution infrastructure [40,58,207,208,237]. This external structure, built around a language model and oriented toward reliable execution, is the harness. The systematic engineering practice around its design, construction, evaluation, and evolution is harness engineering.

The support functions inside the third cycle are not homogeneous. They can be divided into two classes with different evolutionary paths. The first class is capability-bearing support, including task orchestration, state tracking, tool-call decisions, and failure recovery. Tool-use training, native function calling, and model-side context management show how such functions move inward when the behavior admits stable training targets and repeatable feedback. The second class is constraint-bearing support, including safety guardrails, permission and identity management, side-effect control, and interfaces to external systems. The persistence of permission control, side-effect constraints, and audit requirements shows why these functions follow a different path: their value comes from separation from model weights, independent deployability, and auditability [22,69,197,198].

Early signs of inward migration are already visible for capability-bearing functions. Native function calling, self-supervised tool-use training, computer-use interfaces, model-driven context management, and API-oriented fine-tuning all suggest that parts of the harness that were previously external can become model capabilities [20,238,239,240,241]. Yet this migration is incomplete, and it does not imply that the harness will disappear. Current frontier systems still have significant reliability deficits [78,234]. At the same time, stronger models are entrusted with longer, more complex, and riskier tasks; METR reports that the length of tasks that frontier agents can complete at a fixed reliability threshold has been doubling on the order of months [242]. Thus, even as some individual supports weaken, the total need for system-level execution support may grow.

Constraint-bearing functions follow this external logic most clearly. Theoretical and empirical work on hallucination argues that some limitations of LLMs cannot be completely eliminated, which explains why retrieval augmentation and output validation remain important even as model scale increases [243,244]. These external structures should not be treated as temporary compensatory mechanisms. They are part of a governance and verification layer that remains outside the model because its purpose is to provide independent evidence and control.

The third cycle therefore remains open and does not point toward the disappearance of external support. Current systems reveal a shift in the harness rather than its removal: low-level capability support can contract, while governance, verification, and audit layers retain their external role. For this reason, the harness is both the current intermediate layer between model reasoning and reliable execution and a long-lived technical object whose form continues to evolve. The evolving model–harness boundary is the central theme of the following sections on Harness Component Taxonomy, evaluation, and evolution.

2.4. Definition and Conceptual Boundaries of the Harness

Based on the preceding analysis, a harness can be defined as a system-level external execution structure built around a language model and independent of model weights. Its core function is to transform the model’s reasoning and generation capability into reliable and controllable task execution in a real environment. The harness sits between the model and the task environment: inwardly, it receives and organizes model outputs; outwardly, it connects to tools, environmental feedback, and execution substrates. Its core components include task orchestration, cross-step state management, tool integration and call validation, safety guardrails and permission constraints, output verification, and failure detection and recovery. It is therefore a first-class architectural object rather than a peripheral implementation detail [208]. Within this architecture, the model performs reasoning and generation, while the harness turns reasoning results into a controlled execution process.

The harness is not merely an accessory to the model. It is an independent dimension of performance optimization. Holding model weights fixed, changes to the harness can substantially improve an agent’s end-to-end performance. For example, LangChain reports that harness-level changes such as prompts and middleware hooks moved the same coding model from roughly the top 30 to the top 5 on Terminal-Bench 2.0, with a score increase from 52.8 to 66.5 [76]. This separability raises the harness from surrounding engineering to an object of independent research value [207,208].

Three boundaries clarify the concept. First, a harness is broader than a prompt. A prompt is textual input to the model; the harness controls how model calls are organized, how context is assembled, how outputs are checked, and how results are passed to the environment. Prompt design is only one function within a harness. Second, a harness is distinct from an agent. A runnable agent can be decomposed as model plus harness: the model supplies core cognitive capability, while the harness supplies the execution framework that turns that capability into practical effect. The agent is the goal-oriented behavior that emerges from their combination, whereas the harness is the underlying support structure for that behavior [15]. Third, a harness differs from a traditional pipeline. A traditional pipeline is a predefined static workflow whose control logic is entirely specified by humans. A harness is built around model-mediated dynamic control. It must handle uncertain tool calls, state drift, error recovery, and partial delegation of control to the model rather than reducing execution to a single deterministic path [207].

Harness engineering can therefore be defined as the systematic practice of designing, building, evaluating, and evolving harnesses so that language models can execute real tasks reliably, controllably, and maintainably. Its focus is not primarily to modify the model itself, but to construct the execution environment around the model, specify intent clearly, and build feedback loops that support reliable task execution [54,71,85]. Its research problems include component architecture, interaction protocols, the dynamic boundary between harness and model capability, and the strategies by which harnesses should evolve as model generations change.

We can now draw a working boundary among model, harness, and environment. The model is responsible for reasoning and content generation. The environment provides the task setting and real feedback. The harness mediates between them, translating model reasoning into controlled and verifiable execution. Within the harness, capability-bearing functions move inward when they become trainable as stable model behavior, while constraint-bearing functions remain outside when their value depends on separability, auditability, and governance [20,40,197,198]. This boundary provides the basis for the remaining analysis. Harness Component Taxonomy decomposes harness component families according to the capability gaps they compensate for. Harness Evaluation begins from those gaps, diagnosing native capability limits before assessing the effectiveness, risks, and net benefit of harness compensation. Harness Evolution then analyzes how the model–harness boundary changes over time under the joint pressure of model improvement and system objectives.

3. Harness Component Taxonomy

An agent is grounded in a recurring loop: the model reads information, chooses an action, the action changes the environment, and the resulting feedback enters the next round of inference. This loop is the minimal form of sustained agent behavior, but it is insufficient for complex, long-horizon, or high-risk work. The model still depends on external support for limited attention, a closed native action space, non-persistent state, unstable multi-step control, external execution, result verification, and process recording. We map these recurrent gaps to seven component families: context management, tools and interfaces, state and memory, workflow and orchestration, execution environments, verification and guardrails, and observability. A harness surrounds the loop with these compensating structures, allowing the model to act continuously and more reliably in a concrete environment. The resulting design problem is anatomical as well as strategic. Anatomically, different components compensate for different gaps in the agent loop. Strategically, task settings determine which gaps dominate, which components must thicken, and which supports can remain lightweight. This section therefore first decomposes the harness into component families, then examines how design objectives reshape component emphasis, and finally compares representative implementations as different configurations of the same underlying loop. Figure 2 gives an overview of the component families and their associated capability gaps.

3.1. Context Management

Agent trajectories accumulate information as they proceed. Tool results, dialogue history, drafts, failed attempts, and intermediate outputs all compete for the same context window. Recent evaluations of dynamic agents and long-horizon real-world tasks show that, as trajectories stretch across many tool calls, long execution times, and million-token contexts, models suffer from more than degraded information localization. Planning, memory, and tool management also fail in compound ways. AgencyBench scenarios require roughly 90 tool calls and one million tokens on average, while WildClawBench tasks require approximately eight minutes of execution and more than 20 tool calls, making these settings closer to agent operating conditions than earlier static long-context benchmarks [17,25]. Once the model drifts in redundant context, later steps become harder to recover [10,25]. A larger context window is therefore not automatically a better one. Context management treats the window as a scarce resource: it asks what the model should see at the current step, where temporarily irrelevant information should go, and how complex tasks can avoid cross-contamination. Around these questions, context management components fall into selection and retrieval, compression and summarization, externalization and offloading, and isolation with progressive disclosure.

Selection and retrieval start from a central constraint: only part of the available material is relevant to the next step. They identify the evidence needed now, rather than inserting all potentially useful material at once. Just-in-time retrieval uses the task goal, current question, and tool state to retrieve necessary information, while leaving other content outside the window as paths, queries, links, or indexes [20,26]. If selection is too broad, redundant material dilutes model attention. If it is too narrow, key premises may be omitted. These components therefore work with state and memory components: memory preserves retrievable material, while context management decides which part to retrieve at each step.

Compression and summarization become necessary after useful information has already passed through the window. Some of that information is too important to discard, but too bulky to keep verbatim. Auto-compaction, structured summaries, budget-aware compression, and task-state rewriting can turn long histories into shorter records of goals, decisions, evidence, and pending work [27,28]. Their central risk is information loss. Budget-aware context management models compression as a sequential decision problem under an explicit context budget, and long-horizon evaluations show that erroneous compression or omitted premises can later become compound planning, memory, and tool-management failures [25,28]. Safer compression is therefore more than shortening. It preserves recoverable pointers, structured state, and verifiable evidence so that information moved out of the window can be retrieved again when needed.

Externalization and offloading keep useful material available without keeping it in the prompt. Structured notes, draft files, offloaded tool outputs, and file-system caches move large information blocks out of the resident context while retaining references to them [20,29,30]. This is useful for information that remains valuable but should not consume attention budget at the current step. The file system plays a double role: it provides an almost unbounded external workspace for context management while also serving execution environments and state-and-memory mechanisms. Externalization is not forgetting. It moves information from linguistic context into a location that tools can revisit.

Isolation and progressive disclosure matter when one task contains multiple branches, assumptions, or capabilities. Isolation gives subagents, subtasks, or separate workspaces their own contexts, preventing failed attempts, irrelevant evidence, or local assumptions from polluting the main task window [31]. Progressive disclosure stores capability descriptions, resources, and operational details in layers and loads them only when required; Agent Skills are a representative example [24,29]. Together, these mechanisms determine when the model should see more information. Isolation controls boundaries between tasks, while progressive disclosure controls how information unfolds inside a single capability. Both depend on orchestration and control flow, because control logic typically decides whether to launch a subtask and when to load a skill.

Context management turns the model’s limited attention budget into a schedulable system resource. Selection and retrieval decide what enters the window; compression and summarization reduce historical burden; externalization and offloading preserve recoverable pointers; and isolation with progressive disclosure controls information boundaries across tasks and capabilities. These components do not add model knowledge or replace long-term memory. Their distinctive role is to keep the current reasoning state clean, relevant, and recoverable when long tasks would otherwise be submerged by redundant history, irrelevant tool outputs, and failed branches. The corresponding risk is loss of continuity: if context is compressed, isolated, or offloaded too aggressively, key premises disappear from the active reasoning path.

3.2. Tools and Interfaces

Acting in the world requires more than text generation. To read external information, modify files, call services, or trigger real actions, a harness must translate linguistic intent into executable calls through tools and interfaces. The design problem has three parts: which actions the model may use, how actions are called reliably, and how action capabilities are discovered, organized, and governed at scale. The relevant components include tool-call contracts, agent-computer interface design, tool selection and organization, and interoperability protocols. Together, they define the action space. Execution environments run these actions, while verification and guardrails constrain their side effects [9,32].

Tool-call contracts give an exposed action a precise boundary. They specify whether the action can be called, with what parameters, and how the returned result enters later reasoning. Function calling has moved from a specialized tool-learning capability toward a native feature of mainstream models. The problem has therefore shifted from whether the model can call tools to whether it can call them stably, accurately, and explainably [9,32,33]. Vague contracts increase errors in parameter types, enum values, optional-field semantics, and return interpretation. Diagnostic work on multi-agent tool invocation also shows that parameter handling and schema-constraint conflicts are common failure sources [34]. Tool contracts clarify action boundaries by compressing natural-language intent into a constrained parameter space that the execution environment can run.

Agent-computer interface design addresses a different failure point: a tool may exist, but still be difficult for the model to use correctly. Tool names, descriptions, parameter explanations, and return formats enter the model’s decision process, so they operate as control signals rather than ordinary documentation [32]. Interface quality depends on functional completeness, but also on whether actions are sufficiently atomic, names reduce ambiguity, and returns contain only information needed for later reasoning. Recent work on rewriting tool descriptions shows that, at the scale of more than 150 candidate tools, rewritten descriptions reduce accuracy degradation and improve query-level success [35]. A 2026 study of terminal-style agent interfaces similarly reduces effective human-agent collaboration to representational compatibility, transparency of action, and low barriers for human participation [36]. The role of ACI is therefore not to add more tools. It is to reduce misunderstanding and misuse of the actions already available while making those actions easier for humans to supervise [35,36].

Tool selection and organization become more difficult as the action space grows. A small tool set can be placed directly in context, but tens or hundreds of tools consume attention budget, weaken selection signals, and increase the likelihood of incorrect choice [37]. Retrieval-based tool selection recalls tools dynamically according to the current task. Methods that jointly model tool capabilities and agent capabilities can identify appropriate actions more precisely than coarse-grained retrieval [38]. Beyond selection, call timing is a separate problem: some calls fill information gaps, while others merely add cost, latency, or noise. A harness must therefore judge whether a call is necessary, useful, and affordable [39]. Skills represent another organizational pattern. They package instructions, resources, and operating modes as capabilities loaded on demand, rather than keeping the full tool set resident in context [24,29].

Interoperability protocols enter when tools come from many systems rather than one local integration. As agents need access to code repositories, databases, enterprise applications, and external services, one-off integrations become difficult to maintain. Protocols such as MCP provide a uniform interface between AI applications and external data sources, tools, and services, and sit alongside ACP, A2A, ANP, and related interoperability efforts [21,40]. These protocols reduce integration cost, but they also foreground tool supply chains, security boundaries, and permission governance. If an external server’s tool descriptions, permission scope, or execution capabilities are untrusted, they can influence tool selection and expand side effects [41]. Interoperability is valuable not only because it connects more systems, but also because it exposes connection relationships, capability boundaries, and governance responsibilities more clearly.

The action space of an agent is therefore engineered, not merely appended. Tool-call contracts narrow action boundaries, ACI improves action comprehensibility, tool selection and organization control large-scale capability exposure, and interoperability protocols reduce cross-system integration cost while introducing governance boundaries. Their distinctive role is to turn linguistic intent into executable action contracts. More expressive tools expand the agent’s reachable environment, but more tools, broader permissions, and vaguer interfaces also increase misuse and privilege risk. Tools and interfaces therefore define the agent’s capability exposure as much as they expand its task coverage.

3.3. State and Memory

Agent work often outlives a single prompt. After a session reset, prior facts, preferences, decision rationales, and lessons from failed attempts disappear from the visible context. State and memory components persist such information outside model parameters and outside the current context window. Their functions are to record current task progress, save prior facts and experience, and provide reusable evidence for similar tasks [11,12]. The relevant components include working memory, episodic and semantic memory, procedural memory, memory management, and engineering primitives such as file systems and version-control systems. Their boundary with context management is functional: context management allocates attention inside the current window, while memory maintains persistent state outside it. The two are connected through write, manage, and read loops [12,20,42].

Working memory holds the state needed for the next few decisions. It records goals, constraints, the current plan, tool results, intermediate drafts, and unfinished todos. This information is closely coupled to the context window because the next step often depends on it directly [12,43]. Too little working memory causes later steps to lose prior reasoning. Too much consumes attention budget and accelerates context decay. A safer pattern keeps only the state required for the current decision in the window, writes the rest to external storage, and retains pointers for recovery [12,42]. Working memory is therefore the boundary layer between state-and-memory components and context management.

Episodic and semantic memory become relevant when past information should survive beyond the current task state. Episodic memory records temporally ordered interactions, tool returns, user feedback, and task trajectories, allowing an agent to revisit prior decisions in later steps or sessions. Semantic memory stores relatively stable facts, domain knowledge, user preferences, and entity relations, so the system does not rediscover the same information on every task [11,12,44]. Together, they determine how evidence from the past is retrieved. Vector stores are suitable for semantic-similarity recall, while knowledge graphs better express entity relations, multi-hop reasoning, and provenance, though at higher write and maintenance cost [45,46]. The key is not storing more information, but retrieving information that is relevant, trustworthy, and not overly redundant.

Procedural memory is needed when the system should remember how to perform a class of tasks. It preserves reusable steps, skills, conventions, repair strategies, and task scripts. This helps similar tasks avoid starting from scratch, rather than storing the raw record of a particular interaction [11,47]. Procedural memory may be distilled from successful trajectories, failed-attempt reflection, or human-curated skill instructions, and later used as callable operational knowledge [45,47]. It is complementary to Skills. Skills are runtime capability packages loaded on demand, while procedural memory is an accumulated store of cross-session practice. Without quality filtering, procedural memory can also fossilize bad practices, so it must be updated jointly with verification outcomes and human feedback.

Memory management is the control layer that keeps stored experience usable. It decides what to write, how to organize it, when to retrieve it, and when to forget it. Writing must judge whether information is worth long-term retention instead of turning all logs into noise. Management must consolidate, archive, or forget repeated, stale, or conflicting information. Reading must recall the entries most relevant to the current goal, not reinsert the entire memory store into the context window [11,12]. The main risks are under-writing, over-writing, and poor retrieval. These respectively create memory blind spots, retrieval pollution, and context inflation [12,45]. Memory-system quality therefore depends not only on the storage substrate, but also on whether write, manage, and read policies fit the task risk and context budget.

File systems and version control are often the simplest memory substrate for engineering tasks. Many coding agents do not rely on specialized memory databases. They use files, directory structures, logs, and intermediate artifacts as carriers of cross-step state [48,49]. This approach is readable, debuggable, and naturally connected to the execution environment. Artifacts left by an agent in the file system become both context for the next round of work and evidence for verification and audit [48,49]. Its limits are equally clear: the file system does not natively provide semantic search, conflict resolution, or cross-task generalization, so retrieval, indexing, or tiered storage are still needed at scale [46,48].

State becomes useful only when it remains retrievable and trustworthy. Working memory preserves continuity within the current task; episodic and semantic memory provide past evidence and stable knowledge; procedural memory stores reusable practice; memory management controls write and read quality; and file systems plus version control provide auditable engineering substrates. The distinctive gap here is persistence: without state and memory, the agent repeatedly loses task progress, prior evidence, and accumulated practice. Richer memory can support longer-term continuity, but unconstrained writing, management, and retrieval can turn memory into noise, bias, and stale information. Memory is therefore a trust problem as much as a storage problem.

3.4. Workflow and Orchestration

Complex tasks rarely finish in one model call. They unfold across observation, action, branching, and recovery, so isolated model calls must be organized into a process that can keep moving. Workflow and orchestration components provide that process. They decide the next action, decompose tasks, assign models or agents, and organize recovery after failure. The corresponding components include reasoning-action loops, planning and task decomposition, deterministic workflows and hooks, model routing, and subagents and handoffs [50,51]. These components depend on context management and verification: context management determines what each step can see, and verification and guardrails determine whether each step can continue.

Reasoning-action loops begin with the need to make progress before the whole answer is known. ReAct interleaves reasoning traces and actions, allowing the model to update later actions after observing environment feedback instead of producing a complete answer in one shot [7]. Such loops record the relationship between thinking, acting, and observing, so the agent can adjust after tool returns, execution failures, or state changes. Their value lies in decomposing long tasks into short steps that can be observed, interrupted, and verified. As loops grow longer, error, context inflation, and cost accumulate. The loop must therefore be designed together with context compression, verification checks, and stopping conditions [7,50].

Planning and task decomposition become necessary when a goal is too large to execute directly. Planning turns an abstract objective into ordered subgoals, and task decomposition further breaks complex tasks into locally executable and locally verifiable subtasks [50,52]. Plan selection weighs tradeoffs among candidate paths, reflection updates plans after failure, and memory-augmented planning incorporates prior experience [50,52]. Long-horizon evaluations in 2026 move this problem from static reasoning puzzles into real tool environments: AgencyBench requires agents to process queries, deliverables, and scoring rules across many real scenarios, while AgentEscapeBench requires agents to infer, execute, and revise novel tool procedures under explicit dependency graphs [25,53]. The conclusion is not that more search is always better. Long-horizon tasks require exploration, dependency tracking, local verification, and termination conditions within the same control flow. Without these constraints, extra steps become token, latency, and error-accumulation costs.

Deterministic workflows and hooks are useful when parts of the process should not depend on free-form model judgment. A deterministic workflow uses a state machine, directed acyclic graph, or explicit step table to specify preconditions, successor actions, and failure branches, making the process easier to inspect before execution and debug afterward [54,55]. Hooks attach deterministic logic at model calls, tool calls, or state transitions to perform parameter checks, permission decisions, format validation, retries, or compensating actions [15,55]. Their role is not to replace model reasoning, but to provide rails. Stable, repetitive, or high-risk parts are handled by rules, while open and judgment-heavy parts remain with the model.

Model routing matters because not every step needs the same model. In one agent process, intent classification, formatting, deep reasoning, code generation, and long-context analysis may demand different model capabilities. Sending everything to the same model wastes cost or under-provisions capability. Difficulty-aware routing decides whether to upgrade to a stronger model according to request complexity, while capability-aware routing chooses an executor according to task type and model strengths [56,57]. The value of routing is to make model choice a runtime decision that trades off quality, cost, and latency. If routing targets specialized agents rather than models, it enters the subagent and handoff layer.

Subagents and handoffs appear when one context or one role cannot carry the whole task. Subagents can have independent goals, contexts, and tool sets for parallel retrieval, local exploration, role specialization, or high-risk branch isolation [31]. Multi-agent orchestration must usually solve task allocation, result aggregation, and state transfer among agents [31,51]. Centralized topologies use one orchestrator to assign and collect work; decentralized topologies let agents negotiate peer to peer; hierarchical topologies constrain responsibilities and output forms through superior-subordinate structures [51,58]. The key handoff tradeoff is context-transfer granularity. Passing raw history is more complete, but costly and contaminating; passing structured summaries is cheaper, but may lose critical evidence [31].

Workflow and orchestration give the agent a way to advance, stop, and recover. Reasoning-action loops provide the execution rhythm, planning and decomposition provide goal structure, deterministic workflows and hooks provide inspectable control boundaries, model routing schedules capability and cost, and subagents and handoffs provide context isolation and parallel work allocation. The distinctive gap is multi-step control: isolated model calls do not by themselves preserve progress, assign responsibility, or decide when recovery is needed. Orchestration must therefore work with execution feedback, pass criteria from verification and guardrails, and trace evidence from observability.

3.5. Execution Environments

Many agent tasks become meaningful only when an action runs or changes external state. The model can generate commands, code, or operational intent, but completion often requires code execution, file changes, web access, or modified system state. Execution environments translate these intents into observable environmental changes and return runtime results, logs, screenshots, and file diffs to later reasoning and verification. They answer where execution happens, how execution state is preserved, how real interfaces are accessed, and how runtime conditions are prepared. The relevant components include sandboxes, file-system abstractions, browser and computer-use interfaces, and preinstalled runtimes and toolchains. These components receive action contracts defined by tools and interface protocols and provide external feedback signals to verification and guardrails.

Sandboxes are needed because generated actions may have real side effects. Agent-generated code and commands are difficult to audit fully before execution, so a sandbox must manage both isolation strength and resource constraints. Isolation limits contact with the host system, credentials, and network resources. Resource constraints limit CPU, memory, I/O, network egress, and session lifetime [59,60]. Containers, user-space kernels, and microVMs represent different tradeoffs between isolation strength and startup cost. Firecracker-style microVMs provide stronger boundaries for high-risk, multi-tenant, or massively parallel execution [60,61]. The purpose of a sandbox is not to make execution risk-free. It is to confine model actions to boundaries that are reclaimable, auditable, and interruptible.

File-system abstractions matter once execution leaves artifacts behind. Agents can use the file system to save intermediate artifacts, logs, and generated files, preserving traceable state across execution [48,49]. This structure serves three needs at once. It gives context management an external carrier outside the window, gives state and memory a persistent workspace, and gives verification and guardrails comparable runtime artifacts. Recent work shows that coding agents can use terminal tools and directory navigation to organize large text corpora as operable file structures, outperforming semantic-retrieval baselines on some long-context processing tasks [49]. The file system is therefore not merely storage. It is an environmental interface shared by execution, memory, and verification.

Browser and computer-use interfaces are needed when the relevant system has no clean API. A browser lets an agent access web pages, read the DOM or accessibility tree, submit forms, and observe page feedback. Computer use further covers desktop GUIs, file managers, terminals, and other applications without stable APIs [62,63]. These environments expand the action space and introduce stronger grounding requirements: the system must map natural-language intent to screen elements, page state, and multi-step interaction sequences. WindowsWorld and Odysseys shift evaluation from single-site or single-application tasks to cross-application, cross-site, and long-horizon real workflows. WindowsWorld reports poor success on multi-application GUI tasks, while Odysseys shows that even the strongest systems reach only 44.5 percent perfect success on long-horizon web tasks [64,65]. The key issue is therefore not opening an interface. It is converting interface state into reliable next actions across multiple steps, interfaces, and state dependencies.

Preinstalled runtimes and toolchains shape what the agent can try immediately. Language runtimes, package managers, build tools, test frameworks, version-control tools, and common command-line utilities jointly determine whether an agent can complete the generate-run-test-repair loop [59,66]. Insufficient toolchain coverage forces agents to install dependencies on the fly, adding network latency, version conflict, and supply-chain risk. Excessive coverage increases image size, maintenance cost, and attack surface. Long-horizon practice often separates environment initialization from sustained execution. Initialization prepares dependencies, directory structure, and base tools; later execution makes incremental progress within that environment and leaves artifacts, logs, and diffs for the next round [66].

Execution gives model output external consequences. Sandboxes provide execution boundaries, file systems provide cross-step workspaces, browser and computer-use interfaces provide access to real interfaces, and preinstalled runtimes and toolchains provide closed-loop capability. The distinctive gap is grounded action: without an execution environment, tool calls remain abstract action contracts rather than observable state changes. Yet execution expands both the agent’s capability boundary and its side-effect boundary, so greater execution authority must be paired with permission gates, verification loops, and observability.

3.6. Verification and Guardrails

Agent trajectories need feedback about both quality and permission. Without such feedback, models can produce incorrect judgments with high confidence, and single-step errors can propagate through multi-step trajectories. Recent long-horizon evaluations show that real agent tasks cannot be judged only by final answers. WildClawBench checks artifacts, side effects in environment state, and semantic correctness; AgencyBench uses executable scoring scripts and rubrics to map deliverables to comparable scores [17,25]. If critical judgment remains entirely inside the model, the system struggles to distinguish completed work, partial work, plausible-looking errors, and cases where it should not have proceeded. Verification and guardrails move these judgments into the surrounding process. They check whether results are reliable, determine whether actions are allowed, and decide whether high-risk steps need human confirmation. The corresponding components include result verification, deterministic checks, action guardrails, layered defense, and human-in-the-loop control. Execution environments provide tests, logs, and tool returns; observability provides traces and evidence; verification and guardrails use them to pass, retry, block, or escalate.

Result verification starts from a recurring uncertainty: the agent’s answer may look complete before the work is correct. In long-horizon tasks, final-answer scores can hide whether the intermediate path truly completed work, omitted side effects, or only partially satisfied the request. Recent evaluations therefore split verification into executable scripts, environment-state audits, artifact checks, and necessary semantic judgment [17,25]. A more stable pattern shifts feedback toward the environment: code interpreters, search engines, rule checkers, test suites, file diffs, and execution logs provide verifiable signals. Result verification does not prove that the full task goal is complete. It separates a portion of formalizable error from model judgment and places error localization closer to the real execution result.

Deterministic checks apply when part of correctness can be expressed as a rule. Linting, type checking, contract assertions, test-suite hooks, structured-output validation, executable scoring scripts, and environment-state audits respectively check formatting, type constraints, interface contracts, runtime results, downstream consumability, and action side effects [17,25]. Their value is not to replace semantic judgment, but to provide low-cost, reproducible negative feedback to multi-step agents. The closer a check is to the real environment, the more specific and expensive it becomes. Engineering practice therefore needs verification step limits, early-exit conditions, and failure fallback strategies, so that an agent does not consume unbounded tokens, calls, and runtime while pursuing local pass rates.

Action guardrails apply before or during an action, where the question is authorization rather than correctness. Guardrails move safety and policy constraints out of prompts and into runtime checks around model calls and tool execution [67,68]. Input guardrails block prompt injection, sensitive data leakage, and out-of-domain requests. Output guardrails check hallucination, harmful content, factual consistency, and structured format. Tool-call gates further check tool allowlists, parameter ranges, permission boundaries, idempotence, and reversibility [67,68,69]. Code-based guardrails compile some policies into executable rules, bringing policy checks closer to deterministic infrastructure [68]. Their shared function is to restrict free model action to boundaries that are authorized, auditable, and recoverable.

Layered defense is needed because no single checkpoint sees the whole attack surface. Agent attack surfaces include inputs, model outputs, tool calls, external protocols, and execution environments; a single filter cannot cover them all [22,67]. Layered guardrails place checkpoints at different locations, so if one layer is bypassed, a later layer can still check the new context, action parameters, or execution result [67,68]. This structure does not remove risk, because rule quality, changing attack patterns, and expanding tool sets leave blind spots [22,68]. Its value is to reduce the probability of single-point failure and turn safety judgment from a one-time prompt constraint into a continuous execution constraint.

Human-in-the-loop control enters where responsibility cannot be delegated fully to automated checks. HITL is not merely a confirmation button. It preserves human judgment at irreversible, destructive, or compliance-sensitive steps [70]. Approval gates can pause a flow and wait for human approval, asynchronous supervision can send high-risk actions to a review queue, and tiered intervention can decide by risk score whether to continue automatically, request review, or escalate [70]. The effectiveness of human supervision depends on the context supplied by observability. If approvers see only isolated requests without the task goal, tool trajectory, and risk explanation, approval degenerates into a formal click-through [70,71]. HITL therefore needs trace records, evidence displays, and organizational rules to carry real responsibility boundaries.

Reliable agent behavior depends on judgments that can be checked outside the model. Result verification supplies correctness signals, deterministic checks provide reproducible local constraints, action guardrails constrain side effects, layered defense reduces single-point failure, and HITL preserves human responsibility for high-risk decisions. The distinctive gap is external judgment: the model may propose or justify an action, but the harness must decide whether the result is correct, authorized, reversible, or worth escalating. The more judgment is externalized, the easier the system becomes to correct, audit, and govern, but latency, cost, and engineering complexity also increase. A harness combines feedforward and feedback to steer the system toward a target state; verification and guardrails supply the negative-feedback function. Whether this feedback works depends on external signals from the execution environment and evidence structures from observability.

3.7. Observability

Failures are difficult to diagnose when only the final output is visible. If an agent’s intermediate decisions are not recorded, hallucinations, loops, tool misuse, and cost anomalies are difficult to trace to concrete steps. Observability turns execution into inspectable evidence: it assesses output quality, locates where failures occur, and measures whether resource consumption is acceptable [23,72,73]. It includes tracing, evaluation harnesses, and cost and latency instrumentation. These records support verification, guardrails, and human review.

Tracing gives the execution process a timeline. It records model calls, tool calls, retrieval results, environment feedback, subagent handoffs, and final output. The value of structured traces is not saving more logs. It is preserving temporal relationships, causal relationships, and evidence provenance, so failure attribution can move from the whole agent to concrete locations such as planning, tools, context transfer, or evidence use [72,73,74]. OpenTelemetry’s GenAI semantic conventions bring model calls, tool execution, and agent orchestration into a common tracing framework, and tools such as Langfuse, LangSmith, and Arize Phoenix provide engineering support around trace recording, retrieval, and visualization [73]. Tracing turns otherwise invisible intermediate processes into comparable, reviewable, and reproducible process evidence.

Evaluation harnesses add judgment to the recorded process. Local evaluation checks whether an individual step is acceptable: whether the chosen tool is reasonable, parameters satisfy interface constraints, output format is consumable downstream, and the step advances its subgoal [15,74,76]. Whole-trajectory evaluation checks whether a trajectory is coherent, economical, and evidence-grounded: whether tool-call order satisfies dependencies, whether steps are disconnected, and whether the final answer is based on traceable evidence [15,74]. These evaluations expose problems hidden by final success rate. An agent may reach a correct answer through a redundant, brittle, or irreproducible path. Another may fail, but leave enough evidence to localize the flaw to tool definitions, context transfer, or control-flow design [74,76].

Cost and latency instrumentation records whether the process is economically sustainable. The total cost of a multi-step agent is usually accumulated across several expensive steps. If the system records only the final output, it is difficult to determine whether cost came from long prompts, redundant tool calls, inefficient model routing, or repeated verification [72,73]. Stepwise records of input tokens, output tokens, cache hits, time to first token, full generation latency, and failed retries help researchers compare the real cost of context compression, model routing, and tool design [72,73]. Cost and latency are therefore not merely post-deployment operations metrics; they are design dimensions of the harness itself [14,39].

An observable trajectory is easier to compare, debug, and review. Tracing supplies process records, evaluation harnesses provide quality judgments, and cost and latency instrumentation supply resource constraints. The distinctive gap is process evidence: without traces and instrumentation, verification and guardrails can only judge local results, and framework comparisons remain stuck at final success rates. With sufficient evidence, harness problems can be localized to components, steps, and interfaces, providing more concrete inputs for later repair [74,76].

4. Harness Design Tradeoffs Under Multiple Objectives

In practice, harness design is constrained by several goals at the same time. Production agent systems must balance reliability, meaning correct output, stable behavior, and controllable error; cost and latency, meaning acceptable per-run expense and response time; autonomy and task coverage, meaning the range of tasks the agent can advance independently; safety and side-effect control, meaning predictable and reversible effects on the external environment; and long-horizon continuity, meaning preservation of state and experience across steps and sessions [14,77]. A survey of more than 1,300 practitioners reports that 32 percent list output quality as a top barrier to production deployment. Cost ranks below quality, but still shapes call frequency, budget allocation, and sustainable operation at scale [77]. These objectives constrain each other. Reliability depends on verification steps, which consume latency and cost. Autonomy expands the action space and makes safety control harder. Tool coverage supports long-horizon tasks, but each call adds cost.

These tensions cannot be removed by adding every component at maximum strength. Yang and Zhu’s 2026 mathematical analysis of sequential agentic workflows shows that, under simultaneous latency and cost caps, maximizing reliability is a constrained convex optimization problem. The three objectives have an irreducible Pareto frontier: relaxing latency constraints allows more reasoning tokens for reliability, while tightening cost budgets lowers the reliability ceiling [14]. Autonomy and safety are similarly in tension. The larger the action radius and the broader the permissions, the harder it is to statically audit generated actions before execution. Risks such as tool misuse, memory corruption, and cross-agent exploitation also increase [22]. Autonomy is not a passive by-product of model capability. It is an architectural parameter that must be set deliberately, and higher autonomy requires thicker safety support [22,60]. Tool coverage and cost efficiency also conflict. Tool-use decision work in 2026 decomposes whether to call a tool into necessity, utility, and affordability, noting that redundant calls can increase cost, latency, and noise [39]. Yet in long-horizon tasks that depend on tool coverage to complete task branches, excessive tool pruning causes task failure rather than a small quality decrease [14,49]. Harness design is therefore a tradeoff problem under multiple objectives. The priority of objectives determines which capability gap is amplified, and external support thickens where the amplified gap lies.

Reliability becomes difficult when small errors accumulate across a trajectory. Under this priority, the harness must strengthen verification and guardrails, observability, and deterministic workflows inside the orchestration layer. Reliability-oriented harnesses emphasize process constraints: they place verification points before and after critical steps, preserve fallback paths after failure, and record trace evidence for error localization [78,79]. Long-horizon benchmarks in 2026 have expanded reliability evaluation from final pass rate to combinations of executable scripts, rubrics, environment-state audits, and semantic checks, showing that correctness judgment must be decomposed into externally verifiable signals [17,25]. Structured tracing records execution as evidence with causal relationships, allowing failure attribution to descend to concrete steps and tool calls rather than remaining at the level of overall success rate [72,74]. Deterministic control flow moves repetitive, stable, or high-risk steps out of model judgment and implements inspectable execution constraints through rules and hooks [54,55].

Cost and latency pressure begins with repeated small expenses. Context grows, tool calls accumulate, and model calls recur across the trajectory. Under this priority, the harness relies more heavily on context management, tool pruning inside tools and interfaces, and model routing inside workflow and orchestration. Context compression is not only a way to reduce input tokens each round. It dynamically decides which information to retain, compress, or externalize under the remaining context budget, while prioritizing structured state and recoverable pointers [27,28]. Tool exposure and call timing must be filtered by necessity, utility, and affordability to reduce the context, latency, and noise cost of redundant tool definitions and ineffective calls [39]. Model routing sends easier subtasks to lighter models, reducing the proportion of expensive model calls while preserving quality [56,75]. Under a fixed reliability target, allocating token budget according to each step’s marginal reliability gain, using a water-filling strategy, can outperform uniform allocation [14]. However, cost optimization has boundaries. Once compression or pruning removes information or actions required for completion, the cost appears as task failure rather than a small score decrease [14,49].

Autonomy is limited when the agent cannot reach the actions or environments required by the task. Under this priority, support shifts toward tools and interfaces, execution environments, and workflow and orchestration. Tool interfaces and ACI design expand the reachable action space; rewritten tool descriptions can reduce accuracy degradation by 29.23 percent when more than 150 candidate tools are available [35], and terminal-interface research emphasizes that interfaces must support both agent operation and human supervision [36]. Execution environments provide code execution, file operations, and browser access, turning model intent into observable environmental change [59,60]. Workflow and orchestration organize single-step progress into task decomposition, subagent delegation, error recovery, and termination judgment [50,51]. General AgentBench and WildClawBench show that cross-skill, cross-tool, real-runtime tasks significantly amplify degradation relative to specialized short-horizon evaluations, and that a single model can shift substantially across harnesses [17,80]. The bottleneck of autonomy expansion is not the number of components alone. It is whether action interfaces are clear, execution feedback is timely, and runtime boundaries are controlled. Autonomy expansion still has hard constraints: a broader action radius makes side effects harder to enumerate in advance, so the same objective also requires permission management and action-boundary constraints to keep exploration recoverable [22,60].

Safety pressure rises as actions become more consequential. Under this priority, support emphasizes permission gates, layered defense, and HITL within verification and guardrails, together with observability for the trace context needed in human approval. As agent autonomy increases, the attack surface expands from single prompts to tool-call chains, cross-session memory, and multi-agent interaction. Traditional chatbot guardrails do not cover self-prompting across turns, cross-tool linking, or persistent memory risks [22]. Permission gates and tool allowlists constrain reachable operations to a least-privilege set [69]. Layered guardrails distribute checkpoints through execution so that if one layer is bypassed, later layers can still block, reducing single-point failure [67,68]. Human approval gates preserve human judgment at irreversible or compliance-sensitive steps, but their effectiveness depends on context from observability. If approvers see only isolated requests without task trajectory and risk explanation, approval becomes formal confirmation [70,71]. Stronger safety support increases verification latency and constrains throughput, so safety configuration must be calibrated to concrete risk levels and compliance requirements rather than uniformly stacked on every step.

Long-horizon work fails when earlier state no longer guides later decisions. Under this priority, the harness must strengthen state and memory, together with recoverable structures inside context management. The bottleneck is not storage capacity, but whether causal relationships are preserved. Existing memory systems mainly fail by losing causal and objective information and by accumulating errors introduced by lossy similarity-based retrieval [81]. Large-scale trajectory analysis of cross-domain long-horizon tasks also shows that planning- and memory-related failures dominate, with consistent degradation patterns across strong model families [13]. Persistent memory systems store cross-session episodic information, semantic knowledge, and operational experience as retrievable external state, so agents need not rediscover the same information in every task [11,12,45]. File systems provide persistent workspaces; cross-step artifacts become both context for the next round and evidence for verification and audit [48,49]. Recoverable session structures divide long tasks into incremental phases with explicit artifacts and state, allowing interrupted sessions to resume from checkpoints [66,82].

Different objectives therefore thicken different parts of the harness. Reliability strengthens verification, guardrails, and observability. Cost strengthens context management, tool pruning, and model routing. Autonomy strengthens tool interfaces, execution environments, and orchestration. Safety strengthens permission gates, layered defense, and human approval. Long-horizon continuity strengthens state, memory, and recoverable workspaces. Objectives can be combined, but their support requirements often constrain one another: thicker safety support adds latency, broader autonomy expands side-effect boundaries, and lower cost can weaken tool coverage. This structural constraint cannot be eliminated by engineering; it is an intrinsic property of harness design [14]. Harness tradeoffs are therefore configuration decisions. Under a given task risk level, resource constraint, and capability boundary, designers decide which gaps external support should carry and which judgments remain with the model. Evaluation evidence can then guide later reconfiguration as objective priorities and model capabilities change.

5. Representative Harness Implementations

Representative systems show how the same observe-act-feedback loop can be configured in substantially different ways. In current agent research and engineering practice, the model often generates a next action from current context. The action changes external state through a tool or execution environment, and the new observation enters the next decision round until the task completes or a stop condition fires [7,54,59]. Variation appears most clearly along task horizon, execution boundary, permission radius, state persistence, and governance burden. Repository-level repair scenarios represented by SWE-agent have been further refined by late-2025 and 2026 evaluations of long-horizon software evolution and terminal work. Once similar agents move from isolated issue repair to cross-file evolution, real command-line operation, and multi-step test loops, the bottleneck shifts. Success no longer depends only on whether the final patch passes; it also depends on interfaces, environments, test feedback, and state preservation [83,84]. Codex places tasks into independent cloud sandboxes, completing execution, feedback, and updates inside isolation boundaries [85]. Source analysis of Claude Code shows that its core loop calls the model, runs tools, and repeats. Much of the code is concentrated in surrounding permission, compaction, and extension mechanisms [82]. Hermes Agent uses a resident main agent that operates over time. It accepts tasks, accumulates experience, and distills reusable skills through continuing perception-action-feedback [86]. These systems share the same abstract control pattern, but their concrete designs reflect the priority requirements of their task settings.

Repository repair gives the agent a concrete success condition. In SWE-agent-style settings, each run has a goal that can be verified by a test suite, so the primary engineering objective is task completion rate and reliable execution closure. This objective presses the gap into the interface layer and the feedback layer. Success depends directly on whether the agent can locate relevant code, apply an appropriate modification, run tests, and judge success. SWE-agent-like support therefore concentrates on tools and interfaces and verification and guardrails. Newer coding-agent evaluations show that single-issue repair does not cover long-horizon software evolution. SWE-EVO requires agents to make multi-step changes across an average of 21 files under high-level requirements, and the strongest models solve only about a quarter of instances, far below their performance on SWE-bench Verified [83]. Terminal-Bench moves evaluation into realistic command-line tasks, emphasizing terminal-native abilities such as file manipulation, system administration, debugging, and research-code reproduction [84]. The key support in single-issue repair is therefore compact interfaces, immediate test feedback, and executable verification, rather than complex long-term memory. Once the task shifts toward cross-version evolution or terminal-native long-horizon work, state management, environment preparation, and workflow support must thicken as well.

Cloud execution changes the design problem. In Codex-style settings, users submit tasks as requests, and multiple tasks proceed concurrently in isolated environments. The primary engineering objectives are isolation safety and cost control. This setting presses the gap into execution boundaries and context: model-generated code must run without affecting other tasks or production systems, while the token cost of each task must remain controlled. Codex therefore concentrates support in execution environments and context management. Each task runs in an independent environment preloaded with the target repository, making isolation a technical property rather than only a policy document [85]. A second issue is repository organization under finite context budget. Concise repository instructions in AGENTS.md reduce exploratory context cost by giving the agent stable local guidance before it spends tokens rediscovering project conventions [16]. Codex is therefore not designed around long-term memory accumulation. It is designed to let each request complete inside a clear, reclaimable execution boundary.

Local terminal work exposes the agent to a longer and less isolated development setting. In Claude Code-style use, a user may work on the same codebase within one or multiple sessions, so the primary engineering objectives are context continuity and controllable permissions. This setting presses the gap at both ends. Long sessions accumulate history, tool returns, and intermediate drafts, requiring active compression and persistence to preserve continuity. Local workspace actions directly affect the developer’s environment, making side effects less recoverable than in sandboxed settings and requiring earlier permission judgment. Claude Code therefore concentrates support in context management and verification and guardrails. Its five-layer compaction pipeline, append-oriented session storage, and subagent summaries address long-session context growth. Seven permission modes, an ML classifier, and hooks attach deterministic checks around tool calls. Approximately 98.4 percent of the code lives in infrastructure around the main loop [82]. Compared with SWE-agent, Claude Code also carries heavier workflow and orchestration support. MCP, plugins, Skills, and subagent delegation provide more flexible tool extension and capability loading because long-session tasks require the agent to introduce domain-specific tools and capabilities on demand [82].

Resident agents raise a different problem: the system keeps operating after any single task ends. In Hermes Agent-style settings, the primary engineering objectives are cross-session persistence and skill reuse. This setting presses the gap into memory: cross-session state must be retrievable after session reset, and recurring task patterns must be recognized and solidified into reusable skills. Otherwise, the agent cannot become more efficient as it runs longer. Hermes therefore concentrates support in state and memory and workflow and orchestration. Persistent memory uses structured files for environment facts, learned conventions, and user preferences, injecting them into the system prompt at session start so cross-session baseline state does not disappear with resets [86]. A self-improving skill loop extracts reusable patterns after tasks and writes them into structured skill files, allowing similar tasks to load them later rather than reason from scratch [86]. Short-lived subagents handle local execution and isolated exploration, while the main agent coordinates, accumulates, and evolves across longer time scales [86]. Compared with the other three systems, Hermes also has a stronger need for observability. Continuous operation requires tracking which skills were created and remain effective, which triggered errors, and which should be retired. Observability supplies the evidence base for skill evolution.

The comparison shows how task setting changes component emphasis. Repository-level repair presses the gap into interfaces and feedback, concentrating support in ACI and verification. Cloud parallelism presses the gap into execution isolation and cost, concentrating support in sandboxes and context management. Long-session development presses the gap into window continuity and permissions, concentrating support in compaction and guardrails. Resident autonomy presses the gap into cross-session memory and experience reuse, concentrating support in persistent state and skill evolution. The longer the task duration and the broader the execution permission, the more external support must shift from compact interfaces and verification toward recoverable, reusable memory and evolutionary structure. Where a gap is not prominent, support can remain lightweight to avoid unnecessary complexity [54,59].

6. Harness Evaluation: Diagnosing Model Capability Gaps and Assessing Harness Compensation

The core question in evaluating a harness is whether it can compensate for the capability gaps of a base model. We organize harness evaluation into two complementary dimensions. The first is model capability-gap diagnosis, which assesses the model’s native capability limits and identifies where the base model is deficient on its own. The second is harness compensation assessment, which examines whether the harness can effectively fill the corresponding gaps and improve overall agent performance.

6.1. Model Capability-Gap Diagnosis: Assessing Native Capability Limits

Before assessing compensation, we first need to characterize the native capability limits of the base model when it operates without harness support. This section reviews benchmark families that expose gaps across the seven core harness components.

6.1.1. Context and Attention Gaps

Recent long-context evaluations reveal that large models remain limited on long-horizon and long-document tasks. LongBench Pro builds a bilingual benchmark from natural long-form English and Chinese documents with 11 primary tasks and 25 secondary tasks [87]. Its design varies document length, language setting, dependency range, and reasoning difficulty. The results reinforce an important distinction: the ability to accept long inputs is not equivalent to stable long-document understanding. Task length, dependency span, and reasoning complexity all materially affect performance.

AgentLongBench embeds long-context evaluation into simulated agent-environment rollouts, stressing information retention, state tracking, and dynamic information synthesis over extended trajectories [88]. The evidence suggests that a model may complete isolated long-text retrieval tasks, yet still struggle to maintain a stable executable state across sustained interaction. A separate orchestration benchmark for 100k-token workflows further shows that frontier models remain prone to next-step decision errors and context-management failures under extreme context pressure [89]. Together, these studies indicate that closed-loop capabilities for state selection, key-content retention, context management, and execution recovery remain underdeveloped.

6.1.2. Tool-Use and Interface Gaps

Tool-use benchmarks consistently expose weaknesses in real user settings, cross-tool coordination, standardized interfaces, and long-horizon multi-tool workflows. WildToolBench evaluates compositional tool-call tasks derived from real user interactions, including implicit intent and dynamic instruction shifts [90]. Its results show that strong performance on standardized single-step tool calls does not generalize to realistic tool-use conditions. As user interaction becomes more complex, tool-selection errors, intent-tracking drift, and multi-tool failures increase.

Protocol-oriented benchmarks make the same limitation visible at larger scale. MCP-Bench evaluates agents against 28 real MCP servers and 250 tool categories, emphasizing cross-tool coordination, precise parameterization, input-output adaptation, and multi-step chain reasoning [91]. The results suggest that standardized protocol interfaces improve basic tool availability, but models still struggle to compose workflows across servers and satisfy compounded parameter constraints. Toolathlon extends this picture across 32 software applications and more than 600 tool functions, showing that realistic multi-application workflows remain difficult as execution depth and tool diversity increase [92]. MCP-Atlas further broadens the evaluation with 36 real MCP servers and 220 tools, where tool discovery, multi-server workflow composition, parameterization, and error recovery remain frequent failure points [93]. Overall, single-step calls are often more tractable, whereas real tool use still exposes substantial deficits in interface adaptation and cross-tool orchestration.

6.1.3. State and Memory Gaps

Memory evaluations center on incremental writing, multi-session retention, conflict resolution, continual learning during service time, and memory-enabled decision making. MemoryAgentBench reformulates static long-context data into incremental multi-turn interaction and tests accurate retrieval, test-time learning, long-range understanding, and conflict resolution [94]. The evidence indicates that models can handle basic pointwise retrieval, but memory stability degrades sharply under more realistic conditions. Performance falls when information arrives continuously, content updates dynamically, or contradictory facts must be merged.

STATE-Bench shifts the emphasis from retrieval accuracy to whether memory mechanisms improve real task execution, including policy validation, user-state calls, tool coordination, and repeated-error avoidance [95]. MemoryArena evaluates interdependent multi-session tasks and links memory write, retention, reading, and live decision making in a single loop [96]. Other recent benchmarks extend the same pattern. MemoryBench studies continual learning from accumulated user feedback, BEAM examines long-term memory over 100K- to 10M-token conversations, and MemGUI-Bench stresses cross-temporal and cross-spatial memory retention in mobile GUI agents [97,98,99]. The common conclusion is that current models still lack robust external state management, especially for write, update, conflict resolution, and memory-driven downstream decision making.

6.1.4. Workflow and Orchestration Gaps

Long-horizon orchestration benchmarks test whether models can maintain a global plan, actively acquire information, respect constraints, track progress, and suppress error accumulation. DeepPlanning evaluates agentic planning under verifiable constraints, including multi-day travel planning and multi-product shopping, where proactive information acquisition and global constrained optimization are both required [100]. Odysseys demonstrates that binary pass/fail scoring is insufficient for long-horizon web tasks because it hides partial completion and intermediate execution failures [65]. UltraHorizon further shows that trajectories spanning hundreds of tool calls and more than 200k tokens still expose premature termination, early strategy lock-in, and sustained degradation [101].

These results suggest that orchestration failures cluster around global plan maintenance, progress tracking, and error recovery under delayed feedback and long execution chains. In such settings, models remain unable to sustain coherent control flow over the full task horizon.

6.1.5. Execution Environment Gaps

Execution-environment benchmarks assess whether text-based decisions can be converted into precise actions in dynamic, verifiable settings. Gaia2 evaluates agents in asynchronous environments that evolve independently of the agent’s actions, paired with explicit action verification [102]. FeatureBench extends code-agent evaluation to complex feature development in a constructed and verifiable environment [103]. DynamicGUIBench studies dynamic GUI settings in which substantial interface changes occur between actions, stressing visual grounding and action selection under partial observability [104]. Terminal-Bench 2.0 shows that realistic command-line automation remains far from reliable deployment standards [84]. Work on deep research and browsing agents pushes the same evaluation logic toward professional, multi-stage workflows with explicit evidence and verification requirements [105,106,107].

The common conclusion is that models remain unreliable in dynamic, high-interference execution environments. They struggle to verify outcomes continuously and to adapt to environmental state changes, leaving a clear gap in grounded execution.

6.1.6. Verification and Guardrail Gaps

Safety and guardrail evaluations focus on long-horizon attacks, cross-turn intent drift, prompt injection, visual injection, sandbox escape, and other irreversible risks. AgentDyn highlights that harmful intent can be introduced through third-party data and environmental content, not only through explicit user prompts [108]. AgentLAB shows that defenses designed for single-turn interaction do not reliably mitigate long-horizon attacks such as intent hijacking, tool-chain attacks, task injection, and memory poisoning [109]. DeepContext further demonstrates that incremental intent drift across turns can bypass static guardrails [110].

VPI-Bench evaluates visual prompt injection against computer-use agents, showing that embedded malicious instructions in the interface can meaningfully steer behavior [111]. GhostEI-Bench and AgentHazard extend the security surface to mobile and on-device environments, where misleading UI content and third-party interaction can degrade or misdirect agent behavior [112,113]. Other benchmarks expand the failure map across execution isolation, mid-trajectory safety, privacy leakage, temporal evasion, and sandbox escape [114,115,116,117,118,119]. Overall, current models still lack autonomous correction and dynamic risk control in long-horizon, action-capable settings.

6.1.7. Observability Gaps

Observability evaluation asks whether an agent can produce complete, traceable, and attributable evidence after a failure. TraceElephant shows that full traces can improve attribution accuracy by up to 76 percent relative to partial observations [120]. MP-Bench argues that failures in multi-agent systems often admit multiple plausible attributions because of complex inter-agent dependencies and ambiguous trajectories [121]. MAST provides a large failure taxonomy over multi-agent traces and demonstrates that raw logs alone are not enough for diagnosis [122]. Together, these results show why performance scores alone do not explain why a task failed.

These findings support a broader conclusion: base models and native agent systems still do not generate sufficiently complete process evidence for debugging, root-cause analysis, and regression testing.

6.2. Harness Compensation Assessment: Effectiveness and Net Benefit

Once the native gaps are identified, the next question is whether external harness components compensate for them. This requires two tests: whether compensation is effective, and whether the resulting gain justifies the added system cost.

6.2.1. Effectiveness of Harness Compensation

The evidence suggests that external components can convert some unstable model capabilities into deployable system behavior, but the improvement is usually local rather than complete.

State and memory support directly targets the continuity failures exposed by long-horizon and multi-session evaluations. AMA-Bench shows that memory-system design itself becomes a bottleneck in long-horizon agentic applications, rather than a solved add-on to base-model capability [81]. LifeBench finds that even top systems achieve only modest accuracy on long-horizon multi-source memory tasks [123]. M3Exam, H2HMem, RealMem, and MemoryArena further show degradation when models must retain, reconstruct, or reuse information across sessions, parties, and modalities [96,124,125,126]. Explicit memory can therefore expand continuity mechanisms, but it does not automatically produce reliable project-level persistence.

Tool and interface support expands the agent’s reachable action space, but the evaluation evidence shows that reachability does not guarantee reliable composition. MCP-Atlas broadens the evaluated action space through real MCP servers, but tool selection, parameter configuration, and task understanding remain the dominant failure sources [93]. FinMCP-Bench shows that single-tool tasks can be handled reasonably well, while multi-tool and multi-turn settings still degrade noticeably [127]. MCP-Persona highlights underexploration of tool functions in personalized applications [128]. GTA-2, Agent-Diff, and the Data Agent Benchmark similarly indicate that richer tool interfaces make more tasks operationally reachable, but do not remove planning errors, state mismatches, or exploration pathologies [129,130,131].

Execution environments and orchestration make longer workflows operational, but they also reveal where state maintenance, protocol compliance, and multi-stage verification remain fragile. General AgentBench shows that performance can drop substantially once systems move from domain-specific settings to general, heterogeneous environments, while General Agent Evaluation complicates this picture by showing that general agents can match strongly customized domain agents on several benchmarks [80,132]. Long-horizon benchmarks then expose where degradation occurs when tasks require sustained interaction. Web, GUI, terminal, software-engineering, data-science, and research-task evaluations all point to recurring bottlenecks in state maintenance, protocol compliance, and multi-stage verification [17,65,84,102,104,133,134,135,136,137,138]. The harness can move the failure point, but not eliminate it.

Verification, guardrail, and observability mechanisms shift part of the reliability problem from model judgment to external evidence and control. Long-horizon safety benchmarks show that defenses and isolation are necessary in agent settings with tools, memory, and execution side effects [108,109,114,115,116,117,118,119]. At the same time, these mechanisms introduce new surfaces, monitoring burdens, and failure modes. TraceElephant and MP-Bench confirm that richer traces improve debugging and failure attribution, but this is fundamentally a diagnostic benefit rather than a guarantee of fewer failures [120,121].

Overall, external harness components provide measurable support. They usually transform model gaps into system-level management problems, rather than eliminating the gaps entirely.

6.2.2. Net Benefit of Harness Compensation

Whether compensation is justified is a stricter question. A harness is warranted only if its gains in reliability, safety, observability, autonomy, or long-horizon continuity outweigh the added complexity, latency, token cost, and operational burden.

At the level of overall architecture, SkillsBench shows that skill augmentation is not a guaranteed gain; its benefit depends strongly on task and environment structure [139]. General AgentBench and General Agent Evaluation likewise indicate that the apparent advantage of a complex harness often shrinks when the evaluation moves into heterogeneous settings [80,132]. In tool-heavy enterprise tasks, Terminal Agents Suffice for Enterprise Automation suggests that direct terminal/API interaction can outperform heavier tool-augmented stacks in both effectiveness and efficiency [140].

The same tradeoff appears in research-oriented workflows. DeepResearch Bench II, MMDeepResearch-Bench, IDRBench, DeepResearch-9K, and BrowseComp-V3 evaluate agents under deeper evidence chains, multimodal grounding, interaction, and verifiable browsing requirements [107,141,142,143,144]. These settings make quality assessment more realistic, but they also increase turns, tokens, and verification cost. Efficient Benchmarking of AI Agents underscores a related point: evaluating advanced agents is itself expensive, so cost must be part of the assessment [18]. This leads to a stricter criterion. A harness is justified only when its additional reliability, safety, observability, autonomy, or continuity benefits consistently outweigh the overhead it introduces.

6.3. Summary

Model capability-gap diagnosis and harness compensation assessment are two tightly linked stages in harness evaluation. Model capability-gap diagnosis decomposes the model’s limitations into specific, measurable deficits and gives harness design a concrete target. Harness compensation assessment extends the unit of analysis from the model to the full agent system, testing not only whether external components fill the gap, but also whether the resulting system remains economical and operationally viable.

Together, these two lines of evidence point to the same direction: agent evaluation must move beyond single-number task success and toward a system-level view that balances intrinsic capability, external compensation, deployment cost, and operational risk.

7. Harness Evolution

Although a harness systematically compensates for gaps in a base model’s native capabilities, the boundary between the two is not fixed. It is continuously reconstructed as the system evolves. On one side, execution traces, feedback, and historical experience generated by the harness become structured training signals. Through these signals, some externally supported capabilities move closer to model-side behavior. On the other side, improvements in native model capability narrow some of the original gaps. This can make parts of the compensation layer redundant and push the harness to remove obsolete functions, reorganize its architecture, and move toward higher-level system governance. These two forces form a closed loop of coevolution. This section analyzes that pattern in three steps: first, harness-driven model improvement through runtime signals; second, model-driven harness optimization; and third, the evolving model–harness boundary. Figure 3 illustrates this bidirectional coevolution loop and the changing boundary it produces.

7.1. Harness-Driven Model Improvement

The runtime process supported by a harness produces signals through which external support can become model-side capability. During deployment, the harness mediates tool use, environmental perception, state interaction, and error correction. Once these processes are recorded, validated, filtered, and abstracted, they become learning signals for post-training and test-time adaptation. Four signal types are especially important. Demonstration signals provide reproducible execution trajectories and answer how an agent should act effectively. Evaluation signals quantify task quality and define successful outcomes. Interaction signals create dynamic environments for trial and adaptation. Experience signals preserve historical behavior and distill reusable execution patterns.

Demonstration signals come from standardized recording, synthetic construction, and selective filtering of agent trajectories. A harness can integrate target parsing, reasoning, tool selection, and environmental feedback across multiple tool calls into structured reasoning-action, tool-use, or multi-turn interaction traces. These traces support supervised fine-tuning and imitation learning for standardized behavior. Existing work offers several trajectory-construction strategies. ASTRA synthesizes structurally grounded trajectories from tool-call topologies for supervised fine-tuning and reinforcement learning [145]. EigenData generates tool-grounded dialogue data with hierarchical multi-agent data engines while dynamically optimizing prompts and workflows [146]. Tool-R1 converts multi-step tool use into Python-code trajectories with variable dependencies and reuses high-quality samples through a dynamic sample queue [147]. DynaWeb combines expert trajectories, imagined rollouts, and real online interaction, thereby broadening the training-sample sources for web agents [148].

The value of a demonstration signal depends mainly on the structure and completeness of the trajectory. A trace that contains only the final result supports only outcome-expression learning. A complete trace that covers intermediate actions, real-time feedback, and error repair can teach the model process control for long-horizon tasks. ASTRA, EigenData, and Tool-R1 all indicate that clear boundaries among actions, states, and feedback help turn external execution behavior into standardized model action patterns [145,146,147].

Evaluation signals consist of verifiers, detection modules, unit tests, reward models, and environmental feedback. They define success criteria, compliant processes, and valid invocation rules, thereby giving model optimization a discriminative target. In practice, EigenData attaches automated checkers to individual samples and converts tool-use tasks into verifier-driven reinforcement learning problems [146]. ASTRA builds code-executable and rule-verifiable simulation environments that constrain training by both completion and efficiency [145]. Agent-RLVR uses unit tests to verify software-engineering trajectories and adds guidance mechanisms for failed retries [149]. AgentGym-RL supports multi-turn reinforcement learning in diverse realistic environments, allowing models to optimize directly from environmental rewards [150].

The design of the evaluation system directly determines the target and convergence direction of model optimization. Different tasks require different validation logic. Code tasks depend on test cases and repository state; tool-use tasks emphasize API accuracy and business effects; web-interaction tasks can be evaluated through page state and trajectory completeness. To stabilize evaluation, Agent World Model, GTM, and LiteResearcher construct simulation environments for code-driven, tool-simulation, and lightweight retrieval settings [151,152,153]. Long-horizon reinforcement-learning work further integrates reward shaping, data composition, algorithm choice, and environmental stability into the evaluation design space [154]. Only an evaluation system with clear boundaries and stable feedback can prevent model capability from drifting away from real task requirements.

Interaction signals are produced by standardized environments that are explorable, fault tolerant, and reproducible. Long-horizon agent tasks involve sequential decisions and partial observation: the model must act under incomplete environmental knowledge, receive feedback, and update its policy. AgentGym-RL decouples the environment, agent, and training module, and uses staged training to balance exploration and exploitation, allowing the model to learn short-interaction behavior before adapting to long-horizon tasks [150]. Agent World Model constructs thousands of SQL-database-backed standardized tool-interaction environments, giving training settings with stable state transitions and high-quality observations [151]. Other work models interactive tasks as partially observable Markov decision processes and trains agents directly in target environments [155], while long-horizon tool-use studies show that environmental stability is a central variable in reinforcement-learning performance [154].

Interaction signals are valuable because they allow the model to discover strategies through autonomous trial and error, compensating for the limited coverage of demonstration data. DynaWeb’s mixed training over expert traces, imagined rollouts, and online interaction shows that web-agent capability cannot be improved by static samples alone [148]. AgentGym-RL’s staged training mitigates long-horizon instability [150]. Agent-RLVR’s teacher-like guidance gives failed trajectories directions for improvement and error attribution, reducing exploration difficulty in complex settings [149]. Interaction signals therefore let the model move beyond reproducing existing trajectories and complete an exploration, failure, and correction loop inside a controlled environment.

Experience signals arise from the retention, reflection, compression, and reuse of historical execution trajectories. Trajectories become transferable capabilities only after they are stored, organized, and processed. Existing work summarizes this mechanism from several angles. Agent memory forms a write-manage-read loop that couples perception and action and turns a stateless generator into an adaptive agent [12]. Memory evolution can be divided into storage, reflection, and experience, showing a progression from trace retention to experience abstraction [156]. Graph-based memory represents entity relations, hierarchical semantics, and temporal dependencies for multi-turn dialogue, complex planning, and neurosymbolic reasoning [46]. Experiential reflective learning distills trajectories and feedback into general heuristics that help models reuse experience and avoid repeated errors [157]. In this signal system, the harness manages retention, filtering, association, deletion, and reuse.

Experience signals differ from demonstration signals in their level of abstraction. Demonstration signals focus on the concrete action sequence in a single task, whereas experience signals distill reusable patterns across task families. SIRI integrates scattered experience into discoverable, verifiable, and internalizable skill units [158]. MUSE-Autoskill optimizes the full life cycle of skills and enables modular management of external experience [159]. A systematization of knowledge on agentic skills divides the skill life cycle into discovery, practice, distillation, storage, composition, evaluation, and update, treating skills as long-term experience carriers and harness components [160]. Agentic context engineering treats context as an evolving playbook that supports retrieval, compression, isolation, and reuse of experience [161]. Complementary work identifies additional routes. Test-time self-improvement can update the model from newly generated task-adjacent data, while just-in-time reinforcement learning can use experience at runtime without gradient updates [162,163]. Experience signals make the harness a buffer layer for capability internalization: experience is first stored externally as memory or skills, then progressively feeds model improvement through retrieval augmentation, distillation, test-time adaptation, and related mechanisms.

Together, demonstration, evaluation, interaction, and experience signals constitute the main channels through which a harness transfers external capability toward the model. They are usually combined in supervised fine-tuning, reinforcement learning, retrieval augmentation, and continual learning. The value of the harness is therefore not limited to improving deployment-time performance. It also reveals a pathway for selecting, retaining, abstracting, and retraining from runtime experience. External support that is fixed, repetitive, and structurally stable is easiest to internalize. Open-ended, state-dependent, safety-critical long-horizon tasks still require the harness to provide environmental support, compliance verification, memory management, and risk governance.

7.2. Model-Driven Harness Optimization

The second half of the model–harness coevolution loop runs from model improvement back to harness reconstruction. As model capabilities change, the harness must adjust the assumptions it makes about model behavior, the interfaces through which the model acts, the components that remain necessary, and the governance mechanisms that constrain adaptation. In this sense, model-driven harness optimization refers to harness redesign under changing model capability boundaries. Drawing together recent work on harness adaptation, configuration search, meta-harness optimization, and retrospective self-improvement, we analyze this process at four levels: the interface layer, component layer, workflow layer, and closed-loop layer [19,164,165,166].

Interface-layer optimization focuses on the interaction paradigm between the model and external systems. It is a central mechanism for changing agent performance when model weights are fixed. Runtime harness adaptation shows that changing the interface alone can substantially affect behavior [167]. Human-oriented API documentation often does not fit automated model use, so tool interfaces must be rewritten around model cognition [35]. Meta-Harness shows that historical traces and scores can be used to optimize multiple harness-level choices, including prompts and other interaction specifications [165]. Context engineering extends this view by treating write, selection, compression, and isolation as design variables and by defining quality criteria such as relevance, sufficiency, isolation, economy, and provenance [20,27,168]. Additional engineering discussions frame the harness layer itself as a reportable and optimizable scientific object [169,170].

Component-layer optimization decomposes the harness into modules with independent functions and clear responsibilities. Agentic harness engineering treats observable harness artifacts as adaptation surfaces, including prompts, tools, middleware, skills, and configuration-like choices [19]. HARBOR formulates configuration optimization as constrained noisy Bayesian optimization over a mixed-variable space, jointly considering cost and benefit [164]. ARTEMIS uses semantically aware genetic operators for joint optimization of multidimensional agent configurations [171]. Surveys and systematizations further clarify that modern agent systems include reusable skills, memory-management mechanisms, and other iterative components that can be governed or optimized at the harness level [160,172,173]. Effective component optimization depends on observability and rollback boundaries, so that local changes can be evaluated, attributed, and reused independently.

Workflow-layer optimization focuses on the coordination logic among components. Workflow optimization surveys abstract agent workflows as agent computation graphs and distinguish static templates, pre-generated architectures, and dynamically revised runtime graphs [174]. AFlow searches for effective workflows through code modification and execution feedback [175]. Meta-tool optimization mines recurring tool-call sequences from historical traces and turns them into meta-tools [176]. JudgeFlow uses block-level judging to assign responsibility scores to workflow modules and localize failures [177]. MASPO further optimizes multi-agent workflows from system-wide prompts [178].

As tasks become more complex, fixed workflows become insufficient and the workflow architecture itself becomes a primary optimization object. Recent work extends this level toward modular coordination and system-level search. AgentFlow coordinates planner, executor, verifier, and generator modules through evolving memory and optimizes the planner inside the multi-turn loop [179]. MASS searches over block-level prompts, workflow topology, and workflow-level prompts in stages, reducing the cost of multi-agent design search [180]. MAPRO formulates multi-agent prompt optimization as maximum a posteriori inference and solves it through language-guided belief propagation [181]. Workflow optimization surveys provide a unified taxonomy by distinguishing methods that fix a reusable scaffold before deployment from those that select, generate, or revise workflows before or during execution [174].

Closed-loop optimization upgrades the harness from a single-run support mechanism into a continuously self-improving system. Meta-Harness allows agents to read historical configurations, scores, and traces and propose harness updates [165]. Agentic harness engineering uses state observations to make evolution detectable, attributable, and reversible [19]. Retrospective harness optimization uses only past trajectories, self-preference, and consistency checks to update the system without human labels [166]. Additional studies repair harness flaws from failed trajectories [182], while continual-harness frameworks emphasize diagnosis, modification, and evaluation during online adaptation [183,184].

The key advance at this level is that the harness becomes a meta-system capable of recording data, diagnosing failures, changing architecture, and constraining update direction through effect evaluation. In this form, closed-loop optimization is not only an efficiency technique; it also requires update admission, rollback boundaries, and validation criteria so that self-revision remains attributable and governable.

The four levels form a progression from interface, to functional module, to execution architecture, to autonomous evolution. Stronger models can help diagnose system failures and suggest architectural changes, but evolving memory, self-evolving agents, tool use, and autonomous behavior also introduce governance risks such as misevolution and safety vulnerabilities [172,200,205,206]. Observability, permission control, version rollback, and effect validation are therefore required mechanisms for adaptive harnesses, not optional additions.

The main result of model-driven harness reconstruction is not the weakening of the harness. Rather, the harness evolves from a deployment-time support layer into a more complex and more strictly governed adaptive system. The model–harness boundary changes, but system-level governance becomes more important.

7.3. The Evolving Model-Harness Boundary

The coevolution of model and harness continuously adjusts their functional boundary. This evolution follows three parallel paths: capability internalization, governance expansion, and system-level reorganization.

7.3.1. Capability Internalization

Capability internalization means that stable and standardized low-level support capabilities move closer to the model, reducing some burden on the harness. Capabilities with clear action spaces, stable feedback, and repeatable sampling are easiest to consolidate through post-training. AgentGym-RL uses multi-environment interaction and staged training to move models from short interactions toward long-horizon decision making [150]. Hindsight credit assignment decomposes sparse terminal rewards into finer process supervision for long-horizon agents [185]. In-distribution optimization combines automatic process-reward labeling with policy learning, enabling faster iteration of multi-turn execution behavior [186].

Stable training environments and reward design are preconditions for internalization. Micro-level execution patterns that once depended on hand-written prompts or fixed rules can increasingly be optimized as model policies when credit assignment, reward shaping, and environment design are well specified [154,187]. Local internalization, however, does not imply that the whole harness becomes weaker. Benchmarks for command-line, extremely long-horizon, and constraint-heavy planning tasks show that cross-context and cross-tool execution remains difficult for current agents [84,100,101]. In such settings, environment simulation and standardized tool protocols provide external structure for evaluating and operating complex toolchains [127,128]. Internalization therefore applies mainly to low-level local routines. In complex tasks, the harness remains the main carrier that links model capability and reliable execution.

7.3.2. Governance Expansion

Governance expansion means that as task complexity, environmental openness, and risk level increase, the harness expands its functional boundary in state management, tool coordination, security, and governance.

In memory and state management, the bottleneck of long-horizon tasks is not only context length, but also structured control over history, experience, and execution state. AMA-Bench argues that suboptimal memory-system design, rather than base model capability alone, can become a major bottleneck in agentic applications [81]. AMemGym separately shows that long-horizon assistant memory requires interactive evaluation rather than only static recall tests [188]. Multimodal-memory benchmarks further indicate that realistic user-agent interactions require memory mechanisms beyond text-only static recall [124]. Long-term memory management can therefore be understood partly as execution-state management: the harness provides the external layer for preserving, validating, and writing back state [12,189,190].

In tool-ecosystem security, protocols such as MCP standardize tool access but do not eliminate risks inside the invocation process. Open tool ecosystems create large cross-stage attack surfaces [191,192]. Tool-document poisoning and runtime manipulation of tool identity can lead to tool hijacking [193,194]. Because these risks arise at the boundary among model outputs, tool metadata, permissions, and runtime state, they motivate security controls outside model optimization alone, including trust boundaries, privilege control, and state governance in the harness [195].

In system governance, stronger models with broader action spaces increase the need for permission control, process auditing, and failure recovery. Governance shifts from single-output control to path-level control. ContextCov, AgentSpec, runtime governance, AARM, Agent Behavioral Contracts, and SafeHarness provide executable constraints, runtime enforcement, action-boundary management, formal contracts, and life-cycle security for agent deployment [196,197,198,199,200,201]. Static rules cannot cover such complex scenarios by themselves; the harness must supply an independent governance layer. In embodied and policy-constrained settings, governance also separates model cognition from execution oversight through policy checking, capability admission, monitoring, rollback, and human override [202].

7.3.3. System-Level Reorganization

System-level reorganization means that scattered and temporary low-level support mechanisms are reorganized into standardized, auditable, and reversible high-level mechanisms. The result is not a matter of functional addition or deletion, but a more disciplined system layer.

Natural-language agent harnesses externalize control logic into editable artifacts, moving design from implicit code conventions to explicit system architecture [169]. Code-as-harness work treats sandboxes, verification mechanisms, and permission boundaries as core foundations, making external support an engineerable system artifact [203]. Agentic harness engineering and retrospective harness optimization further show how components can evolve automatically and in closed loops without human labels [19,166]. These developments suggest a shift from scattered functional patching toward more explicit and auditable runtime-system construction.

In this process, the model can participate in failure diagnosis, pattern induction, and optimization proposal generation. Studies of harness benefit, misevolution, and evolving memory suggest that controllable evolution depends on explicit mechanisms for update admission, effect evaluation, rollback or recovery, and governance authority at the harness level [172,204,205].

Across these three paths, model and harness form a dynamic layered boundary rather than a fixed separation. Recent systems show that the model increasingly adapts to and consolidates localized, repetitive, and stable abilities, such as routine tool invocation and local plan repair. At the same time, harnesses carry long-cycle, cross-context, and high-risk functions, such as state management across contexts, permission control, audit trails, and long-term memory evolution. These functions are persistent, complex, and risk sensitive, making them difficult to fully internalize into model parameters.

The evolving model–harness boundary is therefore not a path toward mutual replacement. It follows a layered logic: low-level execution capability moves toward model internalization, while high-level governance responsibility concentrates in the harness. System-level reorganization makes this shifting boundary auditable, reversible, and governable. Together, these changes reveal how agent systems become easier to train, more reliable to operate, and easier to govern.

8. Open Challenges and Future Directions

Building on the preceding synthesis, we identify six future directions for harness engineering. These directions extend the paper’s core view of harnesses as external support structures: harnesses should become explicit research objects, comparable design artifacts, gap-indexed evaluation targets, training-signal generators, governable adaptive systems, and components of an automated model–harness coevolution loop.

8.1. Harness as a First-Class Scientific Object

Harness engineering should be studied as an independent object of inquiry rather than as a secondary implementation detail of model work. A useful research program would define stable abstractions, terminology, and design principles for component composition, just as model research standardized architectures and training objectives. The goal is to move from ad hoc engineering tricks to reusable knowledge about how support structures map to capability gaps [19,169].

8.2. Making Harness Design Explicit

Much of today’s harness logic still lives in controller code, prompt fragments, tool wrappers, and runtime conventions. This makes systems hard to compare, ablate, and transfer. A more explicit representation of the harness should surface stage structure, role separation, artifact contracts, state fields, failure categories, and verification thresholds so that the same design can be inspected across different systems [19,169].

8.3. Evaluating Harness Compensation

End-to-end success metrics conflate base-model ability and harness support. Future evaluation should therefore index results by capability gap and component family, and should ask which support reduced which residual failure. Counterfactual evaluation, component-level attribution, and structured error taxonomies would make it possible to judge not only whether a system works, but also whether the harness is appropriately sized for the gap it is meant to cover [14,19,74].

8.4. Designing Harnesses for Training

Harnesses should also be designed as sources of training signal, not only as inference-time scaffolds. By shaping trajectories, feedback, and reward channels, a harness determines which behaviors are internalized and which remain external. The challenge is to identify support structures that produce transferable execution capability rather than narrow prompt-specific behavior [145,149,150].

8.5. Self-Optimizing but Governable Harnesses

Harnesses will increasingly optimize prompts, tools, workflows, memory, and skills on their own. That makes governance part of the design problem rather than an afterthought. Future systems need update admission, effect evaluation, rollback, and permission boundaries so that self-modification remains auditable and reversible. In this view, capability-bearing components may evolve inward, but constraint-bearing components must remain subject to external control [165,166,172,204,205].

8.6. Toward Automated Model–Harness Coevolution

The deeper goal is an automated coevolution loop in which the system can decide which gaps should be internalized by the model and which should remain compensated by the harness. That requires the model–harness boundary itself to become observable, comparable, and recoverable. When this boundary moves, the system should be able to reconfigure interfaces, preserve evidence, and roll back unsafe changes without losing continuity. Such a loop would connect explicit design, gap-aware evaluation, training-time signal design, and governance into a single adaptive process [19,198,200,204].

9. Conclusion

We have framed harness engineering as a problem of allocating capability, compensation, and governance in LLM agent systems. Rather than treating tools, memory, guardrails, and other modules as isolated additions, this perspective relates them to the capability gaps they address, the tradeoffs they introduce, and the reasons some forms of support remain external even as models improve. Across the three levels developed in this survey, Harness Component Taxonomy organizes gap-indexed component families, Harness Evaluation distinguishes native capability-gap diagnosis from compensation effectiveness and net benefit assessment, and Harness Evolution explains why model improvement reorganizes external support rather than simply eliminating it. Routine capability-bearing support may move toward the model when it becomes trainable as stable behavior, whereas constraint-bearing support remains external because it provides independent verification, permission control, and governance. As agents operate over longer horizons with broader tool access and higher autonomy, harness design therefore becomes a first-class engineering problem concerned with where capability should reside, how external support should be configured, and how the model–harness boundary should evolve.

Table 1. Capability Gaps, Harness Components, and Evaluation Logic

Capability gap	Primary harness family	Diagnostic evidence	Compensation pattern	Residual risk and cost
Context and attention decay	Context management	Long-context and agent-rollout benchmarks expose loss of key state, degraded dependency tracking, and context-management errors under long trajectories [25,87,88].	Select only currently relevant evidence; compress histories into structured state; offload bulky artifacts to files or stores; isolate branches and load skills progressively.	Lossy summaries can omit causal premises; retrieval can miss needed evidence; over-compression makes later recovery expensive or impossible.
Unreliable tool use and interface adaptation	Tools and interfaces	Realistic tool benchmarks show failures in tool discovery, parameterization, multi-server composition, and dynamic intent tracking [90,91,93].	Constrain actions through schemas and tool-call contracts; improve agent-computer interfaces; retrieve or prune tools by task; standardize external access through protocols such as MCP.	Broader tool exposure raises context cost and selection noise; external tool metadata introduces supply-chain, privilege, and trust-boundary risks.
Non-persistent or poorly managed state	State and memory	Memory benchmarks show degradation under incremental updates, conflicting facts, multi-session tasks, and memory-driven downstream decisions [94,95,96].	Maintain working state for the current task; persist episodic, semantic, and procedural memory; manage write, consolidation, retrieval, and forgetting policies; use files and version control as auditable state substrates.	Unfiltered memory becomes stale, redundant, or poisoned; similarity retrieval can lose causal relations; memory maintenance adds storage, indexing, and governance burden.
Long-horizon workflow failure	Workflow and orchestration	Planning and web-task evaluations reveal plan drift, premature termination, delayed-feedback errors, and weak recovery across long execution chains [65,100,101].	Use reasoning-action loops, task decomposition, deterministic workflows, hooks, model routing, and subagent handoffs to structure progress, checkpoints, and recovery.	More control logic increases system complexity; fixed workflows can become brittle; subagent handoffs can lose evidence or duplicate work.
Ungrounded execution in changing environments	Execution environments	Execution benchmarks test whether decisions translate into correct commands, GUI actions, code changes, and environment-state updates [84,102,103,104].	Run actions in sandboxes, terminals, browsers, and prepared workspaces; preserve artifacts, logs, diffs, and environment feedback for subsequent reasoning and verification.	Execution adds runtime, dependency, and isolation cost; environment drift and flaky tools complicate attribution; broader execution authority expands side-effect boundaries.
Unsafe or noncompliant side effects	Verification and guardrails	Safety benchmarks expose long-horizon attacks, prompt and visual injection, memory poisoning, privacy leakage, and sandbox escape risks [108,109,111,119].	Externalize judgment into tests, validators, permission gates, allowlists, layered defenses, human approval, and policy checks before irreversible actions.	Guardrails add latency and false blocks; local checks can be bypassed by cross-step attacks; human approval requires trace context to avoid rubber-stamp decisions.
Uninspectable process and weak failure attribution	Observability	Trace-oriented studies show that final success scores hide causal failure points and that full traces improve attribution over partial observations [120,121,122].	Record traces, logs, metrics, token and latency profiles, tool-call chains, and evaluation judgments so failures can be localized to components, steps, or interfaces.	Process evidence increases storage, privacy, and compliance obligations; raw logs still require schema, summarization, and diagnostic tooling to become actionable.

Table 2. Representative Harness Implementations Across the Design Space

Archetype	Task boundary	Permission radius	State persistence	Verification mechanism	Failure recovery	Audit demand	Dominant cost items
SWE-agent-style repository repair	Bounded issue, patch, or benchmark instance with a concrete repository-level success condition.	Primarily repository-local shell, editor, search, and test actions inside a prepared coding environment.	Mostly run-scoped; durable state is the patch, logs, and modified workspace rather than a long-term memory store.	Unit tests, build feedback, diffs, executable scripts, and benchmark pass/fail signals close the generate–test–repair loop [83,84].	Iterative edit, rerun, and local repair; harder cases require environment reset or human diagnosis when tests are incomplete.	Moderate: traces must explain patch provenance and failed attempts, but side effects are usually confined to the repository.	Model calls, command execution, test runtime, dependency setup, and repeated context retrieval.
Codex-style cloud task agent	User-submitted coding or analysis request executed as an isolated task, often in parallel with other tasks.	Cloud sandbox preloaded with the target repository; execution authority is bounded by task environment isolation [85].	Task-scoped and reclaimable; persistent learning is not the main design center.	Sandbox execution, logs, diffs, tests, and user review; concise repository instructions reduce exploratory context cost [16].	Restart, rerun, or revise inside an independent sandbox; failed tasks can be discarded without contaminating other workspaces.	Medium to high: isolation evidence, task traces, and reviewable diffs matter for multi-tenant execution.	Sandbox time, parallel task scheduling, token use, repository indexing, and repeated verification.
Claude Code-style local terminal agent	Long local development session over a user workspace, possibly spanning multiple subtasks and tool extensions.	Local file system, shell, MCP servers, plugins, Skills, and other user-authorized tools; side effects touch the developer environment.	Session and workspace continuity through compaction, append-oriented session storage, summaries, and artifacts [82].	Permission modes, hooks, classifier-supported checks, tests, diffs, and user approval constrain tool calls and workspace changes.	Compaction, checkpointing, manual correction, subagent summaries, and permission escalation or denial after risky steps.	High: the user needs trace context for local file changes, tool calls, permission decisions, and reversible recovery.	Long-context management, tool latency, permission-review overhead, local environment setup, and human attention.
Hermes-style resident agent	Continuing agent operation across tasks, where recurring patterns should become reusable skills.	Resident main agent coordinates short-lived workers, memory files, skill stores, and task-specific tools over time [86].	Cross-session memory and procedural skill persistence are central; learned conventions and preferences are injected into future sessions.	Skill distillation checks, memory review, trace evidence, user feedback, and downstream task success determine whether experience should be reused.	Update, retire, or refine skills; isolate exploratory work in subagents; recover by reverting or rewriting corrupted memories.	Very high: continuous operation requires evidence about which memories and skills were created, triggered, effective, or harmful.	Memory curation, skill life-cycle management, trace storage, evaluation of self-improvements, and long-run governance.
Boundary case: embodied or enterprise workflow agent	Open-ended operational process rather than a code-only task; completion may involve business state, devices, or physical-world consequences.	Enterprise APIs, identity-scoped systems, robotic or GUI actuators, databases, and compliance-sensitive workflows.	Long-lived organizational or environmental state with identity, policy, and provenance constraints.	Policy engines, human-in-the-loop gates, environment-state audits, rollback checks, compliance logs, and external monitors [22,197,198].	Rollback, escalation, compensating transactions, manual override, quarantine, or incident review after unsafe actions.	Highest: audit must cover authority, data access, action rationale, approvals, and recovery because harms may outlive the session.	Governance latency, compliance review, monitoring infrastructure, integration maintenance, and opportunity cost of human approval.

References

Wood, D.; Bruner, J.S.; Ross, G. The Role of Tutoring in Problem Solving. Journal of Child Psychology and Psychiatry 1976, 17, 89–100. [CrossRef]
van de Pol, J.; Volman, M.; Beishuizen, J. Scaffolding in Teacher–Student Interaction: A Decade of Research. Educational Psychology Review 2010, 22, 271–296. [CrossRef]
Vygotsky, L.S. Mind in Society: The Development of Higher Psychological Processes; Harvard University Press: Cambridge, MA, USA, 1978.
Hollan, J.; Hutchins, E.; Kirsh, D. Distributed Cognition: Toward a New Foundation for Human-Computer Interaction Research. ACM Transactions on Computer-Human Interaction 2000, 7, 174–196. [CrossRef]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model Based Autonomous Agents. Frontiers of Computer Science 2024, 18, 186345. [CrossRef]
Xu, R.; Peng, J. A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications, 2025, [arXiv:cs.AI/2506.12594]. arXiv preprint arXiv:2506.12594.
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models, 2022, [arXiv:cs.CL/2210.03629]. arXiv preprint arXiv:2210.03629.
Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023, [arXiv:cs.AI/2307.16789]. arXiv preprint arXiv:2307.16789. [CrossRef]
Qu, C.; Dai, S.; Wei, X.; Cai, H.; Wang, S.; Yin, D.; Xu, J.; Wen, J.R. Tool Learning with Large Language Models: A Survey. Frontiers of Computer Science 2024, 19, 198343, [arXiv:cs.CL/2405.17935]. [CrossRef]
Zeng, W.; Huang, Y.; He, J. LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth, 2026, [arXiv:cs.AI/2602.07962]. arXiv preprint arXiv:2602.07962.
Hu, Y.; Liu, S.; Yue, Y.; Zhang, G.; Liu, B.; Zhu, F.; Lin, J.; Guo, H.; Dou, S.; Xi, Z.; et al. Memory in the Age of AI Agents, 2025, [arXiv:cs.CL/2512.13564]. arXiv preprint arXiv:2512.13564.
Du, P. Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers, 2026, [arXiv:cs.AI/2603.07670]. arXiv preprint arXiv:2603.07670. [CrossRef]
Wang, X.J.; Bai, H.; Sun, Y.; Wang, H.; Zhang, S.; Hu, W.; Schroder, M.; Mutlu, B.; Song, D.; Nowak, R.D. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break, 2026, [arXiv:cs.AI/2604.11978]. arXiv preprint arXiv:2604.11978.
Yang, Y.T.; Zhu, Q. Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs, 2026, [arXiv:cs.AI/2605.23929]. arXiv preprint arXiv:2605.23929.
LangChain. The Anatomy of an Agent Harness, 2026. LangChain blog.
Lopopolo, R. Harness engineering: leveraging Codex in an agent-first world, 2026. OpenAI blog.
Ding, S.; Dai, X.; Xing, L.; Ding, S.; Liu, Z.; JingYi, Y.; Yang, P.; Zhang, Z.; Wei, X.; Fang, X.; et al. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation, 2026, [arXiv:cs.CL/2605.10912]. arXiv preprint arXiv:2605.10912. [CrossRef]
Ndzomga, F. Efficient Benchmarking of AI Agents, 2026, [arXiv:cs.AI/2603.23749]. arXiv preprint arXiv:2603.23749.
Lin, J.; Liu, S.; Pan, C.; Lin, L.; Dou, S.; Xi, Z.; Huang, X.; Yan, H.; Han, Z.; Gui, T.; et al. Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses, 2026, [arXiv:cs.CL/2604.25850]. arXiv preprint arXiv:2604.25850. [CrossRef]
Anthropic. Effective context engineering for AI agents, 2025. Official Anthropic engineering post.
Ehtesham, A.; Singh, A.; Gupta, G.K.; Kumar, S. A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP), 2025, [arXiv:cs.AI/2505.02279]. arXiv preprint arXiv:2505.02279.
Su, H.; Luo, J.; Liu, C.; Yang, X.; Zhang, Y.; Dong, Y.; Zhu, J. A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents, 2025, [arXiv:cs.AI/2506.23844]. arXiv preprint arXiv:2506.23844.
Mohammadi, M.; Li, Y.; Lo, J.; Yip, W. Evaluation and Benchmarking of LLM Agents: A Survey. In Proceedings of the Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. ACM, 2025, KDD ’25, pp. 6129–6139. [CrossRef]
Anthropic. Equipping agents for the real world with Agent Skills, 2025. Official Anthropic engineering post.
Li, K.; Shi, J.; Xiao, Y.; Jiang, M.; Sun, J.; Wu, Y.; Fu, D.; Xia, S.; Cai, X.; Xu, T.; et al. AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts, 2026, [arXiv:cs.AI/2601.11044]. arXiv preprint arXiv:2601.11044.
Mei, L.; Yao, J.; Ge, Y.; Wang, Y.; Bi, B.; Cai, Y.; Liu, J.; Li, M.; Li, Z.Z.; Zhang, D.; et al. A Survey of Context Engineering for Large Language Models, 2025, [arXiv:cs.CL/2507.13334]. arXiv preprint arXiv:2507.13334.
LangChain. Context engineering in agents, 2025. LangChain docs page.
Wu, Y.; Zheng, Y.; Xu, T.; Zhang, Z.; Yu, Y.; Zhu, J.; Ma, C.; Lin, B.; Dong, B.; Zhu, H.; et al. ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents, 2026, [arXiv:cs.AI/2604.01664]. arXiv preprint arXiv:2604.01664. [CrossRef]
Anthropic. Introducing advanced tool use on the Claude Developer Platform, 2025. Official Anthropic engineering post.
Manus. Context Engineering for AI Agents: Lessons from Building Manus, 2025. Official Manus blog.
Anthropic. How we built our multi-agent research system, 2025. Official Anthropic engineering post.
Anthropic. Writing effective tools for agents – with agents, 2025. Official Anthropic engineering post.
Chen, Y.C.; Hsu, P.C.; Hsu, C.J.; Shiu, D.s. Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation. In Proceedings of the Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Industry Track). Association for Computational Linguistics, 2025, pp. 99–111. [CrossRef]
Huang, D.; Malwe, G.; Wang, Z. When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems, 2026, [arXiv:cs.AI/2601.16280]. arXiv preprint arXiv:2601.16280.
Guo, R.; Dong, K.; Gao, X.; Das, K. Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use, 2026, [arXiv:cs.AI/2602.20426]. arXiv preprint arXiv:2602.20426. [CrossRef]
Masi, A.D. Terminal Is All You Need: Design Properties for Human-AI Agent Collaboration, 2026, [arXiv:cs.HC/2603.10664]. arXiv preprint arXiv:2603.10664.
Lumer, E.; Gulati, A.; Nizar, F.; Hedroits, D.; Mehta, A.; Hwangbo, H.; Subbiah, V.K.; Basavaraju, P.H.; Burke, J.A. Tool and Agent Selection for Large Language Model Agents in Production: A Survey. In Proceedings of the 2026 IEEE Conference on Artificial Intelligence (CAI). IEEE, 2026, pp. 701–708. [CrossRef]
Lumer, E.; Nizar, F.; Gulati, A.; Basavaraju, P.H.; Subbiah, V.K. Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems, 2025, [arXiv:cs.CL/2511.01854]. arXiv preprint arXiv:2511.01854.
Wu, Q.; Das, S.; Amani, M.; Nag, A.; Lee, S.; Gummadi, K.P.; Ravichander, A.; Zafar, M.B. To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling, 2026, [arXiv:cs.AI/2605.00737]. arXiv preprint arXiv:2605.00737. [CrossRef]
Anthropic. Model Context Protocol, 2024. Official MCP overview.
Lee, Y.; Choi, W.; Nam, D. Supply Chain Threats in the MCP Ecosystem: Attack Vectors and Mitigation Strategies. In Advances in Information and Computer Security; 2026; pp. 329–349. [CrossRef]
Packer, C.; Wooders, S.; Lin, K.; Fang, V.; Patil, S.G.; Stoica, I.; Gonzalez, J.E. MemGPT: Towards LLMs as Operating Systems, 2023, [arXiv:cs.AI/2310.08560]. arXiv preprint arXiv:2310.08560.
Sumers, T.R.; Yao, S.; Narasimhan, K.; Griffiths, T.L. Cognitive Architectures for Language Agents, 2023, [arXiv:cs.AI/2309.02427]. arXiv preprint arXiv:2309.02427.
Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior, 2023, [arXiv:cs.HC/2304.03442]. arXiv preprint arXiv:2304.03442.
Jiang, D.; Li, Y.; Wei, S.; Yang, J.; Kishore, A.; Zhao, A.; Kang, D.; Hu, X.; Chen, F.; Li, Q.; et al. Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations, 2026, [arXiv:cs.CL/2602.19320]. arXiv preprint arXiv:2602.19320. [CrossRef]
Yang, C.; Zhou, C.; Xiao, Y.; Dong, S.; Zhuang, L.; Zhang, Y.; Wang, Z.; Hong, Z.; Yuan, Z.; Xiang, Z.; et al. Graph-based Agent Memory: Taxonomy, Techniques, and Applications, 2026, [arXiv:cs.AI/2602.05665]. arXiv preprint arXiv:2602.05665.
Fang, R.; Liang, Y.; Wang, X.; Wu, J.; Qiao, S.; Xie, P.; Huang, F.; Chen, H.; Zhang, N. Memp: Exploring Agent Procedural Memory, 2025, [arXiv:cs.CL/2508.06433]. arXiv preprint arXiv:2508.06433.
LangChain. How agents can use filesystems for context engineering, 2025. LangChain blog.
Cao, W.; Yin, X.; Dhingra, B.; Zhou, S. Coding Agents are Effective Long-Context Processors, 2026, [arXiv:cs.CL/2603.20432]. arXiv preprint arXiv:2603.20432.
Huang, X.; Liu, W.; Chen, X.; Wang, X.; Wang, H.; Lian, D.; Wang, Y.; Tang, R.; Chen, E. Understanding the planning of LLM agents: A survey, 2024, [arXiv:cs.AI/2402.02716]. arXiv preprint arXiv:2402.02716. [CrossRef]
Adimulam, A.; Gupta, R.; Kumar, S. The Orchestration of Multi-Agent Systems: Architectures, Protocols, and Enterprise Adoption, 2026, [arXiv:cs.MA/2601.13671]. arXiv preprint arXiv:2601.13671.
Wei, H.; Zhang, Z.; He, S.; Xia, T.; Pan, S.; Liu, F. PlanGenLLMs: A Modern Survey of LLM Planning Capabilities. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2025, pp. 19497–19521. [CrossRef]
Guo, Z.; Li, Y.; Qiu, L.; Wang, X.; Xv, J.; Ru, D.; Li, X.; Zheng, X.; Cao, X.; Cai, X. AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents, 2026, [arXiv:cs.AI/2605.07926]. arXiv preprint arXiv:2605.07926.
Anthropic. Building Effective AI Agents, 2024. Official Anthropic research note.
Microsoft. Conductor, 2026. Official Microsoft GitHub repository.
Moslem, Y.; Kelleher, J.D. Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey, 2026, [arXiv:cs.NI/2603.04445]. arXiv preprint arXiv:2603.04445. [CrossRef]
Guo, X.; Wang, S.; Ji, C.; Zhao, X.; Xi, W.; Liu, Y.; Li, Q.; Deng, C.; Feng, J. Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference, 2025, [arXiv:cs.MA/2509.07571]. arXiv preprint arXiv:2509.07571.
Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, 2023, [arXiv:cs.AI/2308.00352]. arXiv preprint arXiv:2308.00352.
Bui, N.D.Q. Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned, 2026, [arXiv:cs.AI/2603.05344]. arXiv preprint arXiv:2603.05344.
Harang, R. Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk, 2026. NVIDIA Technical Blog, published January 30, 2026.
Agache, A.; Brooker, M.; Iordache, A.; Liguori, A.; Neugebauer, R.; Piwonka, P.; Popa, D.M. Firecracker: Lightweight Virtualization for Serverless Applications, 2020. Proc. 17th USENIX Symposium on Networked Systems Design and Implementation.
Ning, L.; Liang, Z.; Jiang, Z.; Qu, H.; Ding, Y.; Fan, W.; yong Wei, X.; Lin, S.; Liu, H.; Yu, P.S.; et al. A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models, 2025, [arXiv:cs.AI/2503.23350]. arXiv preprint arXiv:2503.23350. [CrossRef]
Wang, S.; Liu, W.; Chen, J.; Zhou, Y.; Gan, W.; Zeng, X.; Che, Y.; Yu, S.; Hao, X.; Shao, K.; et al. GUI Agents with Foundation Models: A Comprehensive Survey, 2024, [arXiv:cs.AI/2411.04890]. arXiv preprint arXiv:2411.04890.
Li, J.; Li, Y.; Zhao, C.; Xu, Z.; Hu, B.; Zhang, M. WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments, 2026, [arXiv:cs.AI/2604.27776]. arXiv preprint arXiv:2604.27776.
Jang, L.K.; Koh, J.Y.; Fried, D.; Salakhutdinov, R. Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks, 2026, [arXiv:cs.LG/2604.24964]. arXiv preprint arXiv:2604.24964.
Anthropic. Effective harnesses for long-running agents, 2025. Official Anthropic engineering post, published November 26, 2025.
Shamsujjoha, M.; Lu, Q.; Zhao, D.; Zhu, L. Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents, 2024, [arXiv:cs.SE/2408.02205]. arXiv preprint arXiv:2408.02205. [CrossRef]
Chennabasappa, S.; Nikolaidis, C.; Song, D.; Molnar, D.; Ding, S.; Wan, S.; Whitman, S.; Deason, L.; Doucette, N.; Montilla, A.; et al. LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents, 2025, [arXiv:cs.CR/2505.03574]. arXiv preprint arXiv:2505.03574.
OWASP Gen AI Security Project. LLM06:2025 Excessive Agency, 2025. OWASP Gen AI Security Project.
OpenAI. Human-in-the-loop, 2026. OpenAI Agents SDK documentation.
OpenAI. Safety practices for agent builders, 2026. OpenAI developer documentation.
Liu, G.; Solomon, S. AI Agent Observability - Evolving Standards and Best Practices, 2025. OpenTelemetry blog.
Crowder, S. How to debug and evaluate AI agents with observability, 2026. LangChain blog.
Wang, Y.; Zhang, J.; Cai, T.; Liu, Z.; Sun, Q.; Sun, Z.; Wu, Z.; Zhang, M.; Zhu, Y. From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents, 2026, [arXiv:cs.CR/2606.04990]. arXiv preprint arXiv:2606.04990.
Okamoto, M.; Erol, A.K.; Riedl, M. Explainable Model Routing for Agentic Workflows, 2026, [arXiv:cs.AI/2604.03527]. arXiv preprint arXiv:2604.03527.
Trivedy, V. Improving Deep Agents with Harness Engineering, 2026. LangChain blog.
LangChain. LangChain State of AI Agents Report: 2024 Trends, 2024. LangChain report.
Rabanser, S.; Kapoor, S.; Kirgis, P.; Liu, K.; Utpala, S.; Narayanan, A. Towards a Science of AI Agent Reliability, 2026, [arXiv:cs.AI/2602.16666]. arXiv preprint arXiv:2602.16666. [CrossRef]
Gupta, A. ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions, 2026, [arXiv:cs.AI/2601.06112]. arXiv preprint arXiv:2601.06112.
Li, X.; Ming, R.; Setlur, P.; Paladugu, A.; Tang, A.; Kang, H.; Shao, S.; Jin, R.; Xiong, C. Benchmark Test-Time Scaling of General LLM Agents, 2026, [arXiv:cs.AI/2602.18998]. arXiv preprint arXiv:2602.18998.
Zhao, Y.; Yuan, B.; Huang, J.; Yuan, H.; Yu, Z.; Xu, H.; Hu, L.; Shankarampeta, A.; Huang, Z.; Ni, W.; et al. AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications, 2026, [arXiv:cs.AI/2602.22769]. arXiv preprint arXiv:2602.22769. [CrossRef]
Liu, J.; Zhao, X.; Shang, X.; Shen, Z. Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems, 2026, [arXiv:cs.SE/2604.14228]. arXiv preprint arXiv:2604.14228.
Le, T.; Thai, M.V.T.; Manh, D.N.; Nhat, H.P.; Bui, N.D.Q. SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios, 2025, [arXiv:cs.SE/2512.18470]. arXiv preprint arXiv:2512.18470.
Merrill, M.A.; Shaw, A.G.; Carlini, N.; Li, B.; Raj, H.; Bercovich, I.; Shi, L.; Shin, J.Y.; Walshe, T.; Buchanan, E.K.; et al. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces, 2026, [arXiv:cs.SE/2601.11868]. arXiv preprint arXiv:2601.11868.
OpenAI. Introducing Codex, 2025. OpenAI blog.
Nous Research. Hermes Agent, 2026. Official Nous Research GitHub repository.
Chen, Z.; Wu, X.; Jia, J.; Gao, C.; Fu, Q.; Zhang, D.; Hu, S. LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark, 2026, [arXiv:cs.CL/2601.02872]. arXiv preprint arXiv:2601.02872. [CrossRef]
Fang, S.; Wang, Y.; Liu, X.; Lu, J.; Tan, C.; Chen, X.; Zheng, Y.; Huang, X.; Qiu, X. AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts, 2026, [arXiv:cs.CL/2601.20730]. arXiv preprint arXiv:2601.20730.
Jenova AI. Long-Context Agentic Orchestration Benchmark, 2026. Jenova.ai benchmark page.
Yu, P.; Liu, W.; Yang, Y.; Li, J.; Zhang, Z.; Feng, X.; Zhang, F. Benchmarking LLM Tool-Use in the Wild, 2026, [arXiv:cs.HC/2604.06185]. arXiv preprint arXiv:2604.06185.
Wang, Z.; Chang, Q.; Patel, H.; Biju, S.; Wu, C.E.; Liu, Q.; Ding, A.; Rezazadeh, A.; Shah, A.; Bao, Y.; et al. MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers, 2025, [arXiv:cs.CL/2508.20453]. arXiv preprint arXiv:2508.20453. [CrossRef]
Li, J.; Zhao, W.; Zhao, J.; Zeng, W.; Wu, H.; Wang, X.; Ge, R.; Cao, Y.; Huang, Y.; Liu, W.; et al. The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution, 2025, [arXiv:cs.CL/2510.25726]. arXiv preprint arXiv:2510.25726.
Bandi, C.; Dumitru, R.G.; Hertzberg, B.; Agarwal, D.; Boo, G.; Polakam, T.; Hassaan, S.; Da, J.; Kim, H.; Gupta, V.; et al. MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers, 2026, [arXiv:cs.SE/2602.00933]. arXiv preprint arXiv:2602.00933.
Hu, Y.; Wang, Y.; McAuley, J. Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions, 2025, [arXiv:cs.CL/2507.05257]. arXiv preprint arXiv:2507.05257. [CrossRef]
Liu, L.; Yadav, N. Introducing STATE-Bench: A benchmark for AI agent memory, 2026. Microsoft Open Source blog.
He, Z.; Wang, Y.; Zhi, C.; Hu, Y.; Chen, T.P.; Yin, L.; Chen, Z.; Wu, T.A.; Ouyang, S.; Wang, Z.; et al. MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks, 2026, [arXiv:cs.CL/2602.16313]. arXiv preprint arXiv:2602.16313.
Ai, Q.; Tang, Y.; Wang, C.; Long, J.; Su, W.; Liu, Y. MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems, 2025, [arXiv:cs.LG/2510.17281]. arXiv preprint arXiv:2510.17281. [CrossRef]
Tavakoli, M.; Salemi, A.; Ye, C.; Abdalla, M.; Zamani, H.; Mitchell, J.R. Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs, 2025, [arXiv:cs.CL/2510.27246]. arXiv preprint arXiv:2510.27246.
Liu, G.; Zhao, P.; Liang, Y.; Luo, Q.; Tang, S.; Chai, Y.; Lin, W.; Xiao, H.; Wang, W.; Chen, S.; et al. MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments, 2026, [arXiv:cs.DC/2602.06075]. arXiv preprint arXiv:2602.06075.
Zhang, Y.; Jiang, S.; Li, R.; Tu, J.; Su, Y.; Deng, L.; Guo, X.; Lv, C.; Lin, J. DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints, 2026, [arXiv:cs.AI/2601.18137]. arXiv preprint arXiv:2601.18137. [CrossRef]
Luo, H.; Zhang, H.; Zhang, X.; Wang, H.; Qin, Z.; Lu, W.; Ma, G.; He, H.; Xie, Y.; Zhou, Q.; et al. UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios, 2025, [arXiv:cs.AI/2509.21766]. arXiv preprint arXiv:2509.21766.
Froger, R.; Andrews, P.; Bettini, M.; Budhiraja, A.; Cabral, R.S.; Do, V.; Garreau, E.; Gaya, J.B.; Laurençon, H.; Lecanu, M.; et al. Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments, 2026, [arXiv:cs.AI/2602.11964]. arXiv preprint arXiv:2602.11964.
Zhou, Q.; Zhang, J.; Wang, H.; Hao, R.; Wang, J.; Han, M.; Yang, Y.; Wu, S.; Pan, F.; Fan, L.; et al. FeatureBench: Benchmarking Agentic Coding for Complex Feature Development, 2026, [arXiv:cs.SE/2602.10975]. arXiv preprint arXiv:2602.10975. [CrossRef]
Liu, E.; Pan, L.; Gao, Z.; Yang, Y.; Shi, C.; Liu, Y.; Wu, J.; Li, Q. Benchmarking and Improving GUI Agents in High-Dynamic Environments, 2026, [arXiv:cs.CV/2604.25380]. arXiv preprint arXiv:2604.25380.
Du, M.; Xu, B.; Zhu, C.; Wang, X.; Mao, Z. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents, 2025, [arXiv:cs.CL/2506.11763]. arXiv preprint arXiv:2506.11763.
Chen, Z.; Ma, X.; Zhuang, S.; Nie, P.; Zou, K.; Liu, A.; Green, J.; Patel, K.; Meng, R.; Su, M.; et al. BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent, 2025, [arXiv:cs.CL/2508.06600]. arXiv preprint arXiv:2508.06600.
Zhang, H.; Zhou, J.; Li, B.; Zhou, B.; Shan, Y.; Lu, H.; Cao, Z.; Chen, J.; Han, Y.; Sheng, Z.; et al. BrowseComp-V3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents, 2026, [arXiv:cs.AI/2602.12876]. arXiv preprint arXiv:2602.12876. [CrossRef]
Li, H.; Wen, R.; Shi, S.; Zhang, N.; Vorobeychik, Y.; Xiao, C. AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?, 2026, [arXiv:cs.CR/2602.03117]. arXiv preprint arXiv:2602.03117.
Jiang, T.; Wang, Y.; Liang, J.; Wang, T. AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks, 2026, [arXiv:cs.AI/2602.16901]. arXiv preprint arXiv:2602.16901.
Albrethsen, J.; Datta, Y.; Kumar, K.; Rajasekar, S. DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs, 2026, [arXiv:cs.AI/2602.16935]. arXiv preprint arXiv:2602.16935. [CrossRef]
Cao, T.; Lim, B.; Liu, Y.; Sui, Y.; Li, Y.; Deng, S.; Lu, L.; Oo, N.; Yan, S.; Hooi, B. VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents, 2025, [arXiv:cs.AI/2506.02456]. arXiv preprint arXiv:2506.02456.
Chen, C.; Song, X.; Chai, Y.; Yao, Y.; Zhao, H.; Li, L.; Li, J.; Teng, Y.; Liu, G.; Wang, Y. GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?, 2025, [arXiv:cs.CR/2510.20333]. arXiv preprint arXiv:2510.20333.
Liu, G.; Ye, J.; Liu, J.; Li, Y.; Liu, W.; Gao, P.; Luan, J.; Liu, Y. Mobile GUI Agents under Real-world Threats: Are We There Yet?, 2025, [arXiv:cs.CR/2507.04227]. arXiv preprint arXiv:2507.04227. [CrossRef]
Li, Y.; Luo, H.; Xie, Y.; Fu, Y.; Yang, Z.; Shao, S.; Ren, Q.; Qu, W.; Fu, Y.; Yang, Y.; et al. ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis, 2026, [arXiv:cs.AI/2604.02022]. arXiv preprint arXiv:2604.02022.
Chen, Y.S.; Huang, S.Y.; Yang, C.L.; Chen, Y.N. TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories, 2026, [arXiv:cs.CR/2604.07223]. arXiv preprint arXiv:2604.07223.
Yagoubi, F.E.; Badu-Marfo, G.; Mallah, R.A. AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems, 2026, [arXiv:cs.AI/2602.11510]. arXiv preprint arXiv:2602.11510. [CrossRef]
Ma, J.; Du, X.; Lin, R.; Bian, Y.; Chen, J.; Wang, J.; Yang, X.; Cui, S.; Meng, C.; Deng, X.; et al. Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions, 2026, [arXiv:cs.CR/2605.22321]. arXiv preprint arXiv:2605.22321.
Wang, Z.; Li, Y.; Wu, Y.; Liu, Z.; Chen, K.; Wai, F.K.; Chen, P.Y.; Thing, V.L.L.; Li, B.; Tao, D.; et al. Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents, 2026, [arXiv:cs.CR/2606.13385]. arXiv preprint arXiv:2606.13385.
Marchand, R.; Cathain, A.O.; Wynne, J.; Giavridis, P.M.; Deverett, S.; Wilkinson, J.; Gwartz, J.; Coppock, H. Quantifying Frontier LLM Capabilities for Container Sandbox Escape, 2026, [arXiv:cs.CR/2603.02277]. arXiv preprint arXiv:2603.02277.
Chen, M.; Wang, J.; Mu, F.; Wang, Y.; Liu, Z.; Feng, H.; Wang, Q. Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems, 2026, [arXiv:cs.MA/2604.22708]. arXiv preprint arXiv:2604.22708. [CrossRef]
In, Y.; Tanjim, M.; Subramanian, J.; Kim, S.; Bhattacharya, U.; Kim, W.; Park, S.; Sarkhel, S.; Park, C. Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation, 2026, [arXiv:cs.AI/2603.25001]. arXiv preprint arXiv:2603.25001.
Cemri, M.; Pan, M.Z.; Yang, S.; Agrawal, L.A.; Chopra, B.; Tiwari, R.; Keutzer, K.; Parameswaran, A.; Klein, D.; Ramchandran, K.; et al. Why Do Multi-Agent LLM Systems Fail?, 2025, [arXiv:cs.AI/2503.13657]. arXiv preprint arXiv:2503.13657.
Cheng, Z.; Wang, W.; Zhao, Y.; Ren, Z.; Chen, J.; Xu, R.; Huang, S.; Chen, Y.; Li, G.; Wang, M.; et al. LifeBench: A Benchmark for Long-Horizon Multi-Source Memory, 2026, [arXiv:cs.AI/2603.03781]. arXiv preprint arXiv:2603.03781.
Huang, Z.; Liu, W.; Tian, Z.; Chen, W.; Chen, J.; Wu, Y.; Zhang, F.; Guo, Q.; Zhou, X. M3Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions, 2026, [arXiv:cs.CL/2606.07402]. arXiv preprint arXiv:2606.07402. [CrossRef]
Zhu, S.; Yang, Y.; Wang, Z.; Shen, T.; Guo, D.; Yang, M.H. H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions, 2026, [arXiv:cs.CL/2606.09461]. arXiv preprint arXiv:2606.09461.
Bian, H.; Yao, Z.; Hu, S.; Xu, Z.; Zhang, S.; Guo, Y.; Yang, Z.; Han, X.; Wang, H.; Chen, R. RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction, 2026, [arXiv:cs.CL/2601.06966]. arXiv preprint arXiv:2601.06966.
Zhu, J.; Tian, Y.; Li, B.; Wu, K.; Liang, Z.; Li, J.; Zhang, X.; Guo, L.; Chen, F.; Liu, Y.; et al. FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol, 2026, [arXiv:cs.AI/2603.24943]. arXiv preprint arXiv:2603.24943. [CrossRef]
Wang, W.; Niu, P.; Zou, G.; Yang, X.; Wang, J.; Shi, H.; Du, Y.; Chai, J.; Pang, X.; Tang, S.; et al. MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation, 2026, [arXiv:cs.AI/2606.02470]. arXiv preprint arXiv:2606.02470.
Wang, J.; Liu, X.; Li, Y.; Zhang, S.; Wang, Y.; Shan, Z.; Le, X.; Chen, C.; Guan, X.; Tao, D. GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows, 2026, [arXiv:cs.CL/2604.15715]. arXiv preprint arXiv:2604.15715.
Pysklo, H.M.; Zhuravel, A.; Watson, P.D. Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation, 2026, [arXiv:cs.SE/2602.11224]. arXiv preprint arXiv:2602.11224. [CrossRef]
Ma, R.; Shankar, S.; Chen, R.; Lin, Y.; Zeighami, S.; Ghosh, R.; Gupta, A.; Gupta, A.; Gopal, T.; Parameswaran, A.G. Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents, 2026, [arXiv:cs.DB/2603.20576]. arXiv preprint arXiv:2603.20576.
Bandel, E.; Yehudai, A.; Eden, L.; Sagron, Y.; Perlitz, Y.; Venezian, E.; Razinkov, N.; Ergas, N.; Ifergan, S.S.; Shlomov, S.; et al. General Agent Evaluation, 2026, [arXiv:cs.AI/2602.22953]. arXiv preprint arXiv:2602.22953.
Long, X.; Du, L.; Xu, Y.; Liu, F.; Wang, H.; Ding, N.; Li, Z.; Guo, J.; Tang, Y. LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks, 2026, [arXiv:cs.CL/2604.13072]. arXiv preprint arXiv:2604.13072.
Xu, X.; Yang, R.; Shen, H.; Xu, W.; Gao, B.; Wu, R.; Shi, K.; Xie, W.; Chen, X.; Wu, M.; et al. RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades, 2026, [arXiv:cs.SE/2605.15846]. arXiv preprint arXiv:2605.15846. [CrossRef]
SWE-Marathon. SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?, 2026. Project website.
Xu, K.; Lu, X.; Qiao, S.; Ding, Z.; Xu, H.; Liang, L.; Zhang, N. LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis, 2026, [arXiv:cs.LG/2605.30434]. arXiv preprint arXiv:2605.30434.
Chen, H.; Metelski, D.; Qi, L.; Xia, T.; Lee, J.; Brown, S.; Riley, K.; Wang, F.; Liu, T.Y.A.; MD, H.C.; et al. CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?, 2026, [arXiv:cs.CL/2605.16679]. arXiv preprint arXiv:2605.16679.
Xu, W.; Li, S.; Ye, T.; Cao, Q.; Chen, Y.; Gao, H.; Wang, Y.; Li, Q.; Li, K.; Xu, S.; et al. ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research, 2026, [arXiv:cs.LG/2606.07591]. arXiv preprint arXiv:2606.07591. [CrossRef]
Li, X.; Chen, W.; Liu, Y.; Zheng, S.; Chen, X.; He, Y.; Li, Y.; You, B.; Shen, H.; Sun, J.; et al. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, 2026, [arXiv:cs.AI/2602.12670]. arXiv preprint arXiv:2602.12670.
Bechard, P.; Ayala, O.M.; Chen, E.; Skelton, J.; Davasam, S.; Sunkara, S.; Yadav, V.; Rajeswar, S. Terminal Agents Suffice for Enterprise Automation, 2026, [arXiv:cs.SE/2604.00073]. arXiv preprint arXiv:2604.00073.
Li, R.; Du, M.; Xu, B.; Zhu, C.; Wang, X.; Mao, Z. DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report, 2026, [arXiv:cs.CL/2601.08536]. arXiv preprint arXiv:2601.08536.
Huang, P.; Zhong, Z.; Wan, Z.; Zhou, D.; Alam, S.; Wang, X.; Li, Z.; Dou, Z.; Zhu, L.; Xiong, J.; et al. MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents, 2026, [arXiv:cs.CV/2601.12346]. arXiv preprint arXiv:2601.12346. [CrossRef]
Feng, Y.; Huang, Q.; Xie, X.; Yang, Z.; Yu, J.; Chen, W.; Tung, A.K.H. IDRBench: Interactive Deep Research Benchmark, 2026, [arXiv:cs.CL/2601.06676]. arXiv preprint arXiv:2601.06676.
Wu, T.; Wang, Y.; Ma, X.; He, X.; Wang, S.; Yin, D.; Zhao, X. DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent, 2026, [arXiv:cs.AI/2603.01152]. arXiv preprint arXiv:2603.01152.
Tian, X.; Wang, H.; Chen, S.; Zhou, H.; Yu, K.; Zhang, Y.; Ouyang, J.; Yin, J.; Chen, J.; Guo, B.; et al. ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas, 2026, [arXiv:cs.CL/2601.21558]. arXiv preprint arXiv:2601.21558. [CrossRef]
Gao, J.; Chen, J.; He, C.; Xu, S.; Jin, D.; Wu, Y. From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents, 2026, [arXiv:cs.AI/2601.22607]. arXiv preprint arXiv:2601.22607.
Zhang, Y.; Zeng, Y.; Li, Q.; Hu, Z.; Han, K.; Zuo, W. Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use, 2025, [arXiv:cs.LG/2509.12867]. arXiv preprint arXiv:2509.12867.
Ding, H.; Liu, P.; Wang, J.; Ji, Z.; Cao, M.; Zhang, R.; Ai, L.; Yang, E.; Shi, T.; Yu, L. DynaWeb: Model-Based Reinforcement Learning of Web Agents, 2026, [arXiv:cs.CL/2601.22149]. arXiv preprint arXiv:2601.22149. [CrossRef]
Da, J.; Wang, C.; Deng, X.; Ma, Y.; Barhate, N.; Hendryx, S. Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards, 2025, [arXiv:cs.CL/2506.11425]. arXiv preprint arXiv:2506.11425.
Xi, Z.; Huang, J.; Liao, C.; Huang, B.; Guo, H.; Liu, J.; Zheng, R.; Ye, J.; Zhang, J.; Chen, W.; et al. AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning, 2025, [arXiv:cs.LG/2509.08755]. arXiv preprint arXiv:2509.08755.
Wang, Z.; Xu, C.; Liu, B.; Wang, Y.; Han, S.; Yao, Z.; Yao, H.; He, Y. Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning, 2026, [arXiv:cs.AI/2602.10090]. arXiv preprint arXiv:2602.10090. [CrossRef]
Ren, Z.; Zhang, X.; Qian, Z.; Gao, Y.; Shi, Y.; Zheng, S.; He, J. GTM: Simulating the World of Tools for AI Agents, 2025, [arXiv:cs.AI/2512.04535]. arXiv preprint arXiv:2512.04535.
Li, W.; Qu, B.; Pan, B.; Zhang, J.; Liu, Z.; Zhang, P.; Chen, W.; Zhang, B. LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent, 2026, [arXiv:cs.AI/2604.17931]. arXiv preprint arXiv:2604.17931.
Wu, X.; Sun, Q.; Zhang, R.; Song, C.; Wu, J.; Qi, Y.; Cheng, H. Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe, 2026, [arXiv:cs.LG/2603.21972]. arXiv preprint arXiv:2603.21972. [CrossRef]
Chen, K.; Cusumano-Towner, M.; Huval, B.; Petrenko, A.; Hamburger, J.; Koltun, V.; Krähenbühl, P. Reinforcement Learning for Long-Horizon Interactive LLM Agents, 2025, [arXiv:cs.LG/2502.01600]. arXiv preprint arXiv:2502.01600.
Luo, J.; Tian, Y.; Cao, C.; Luo, Z.; Lin, H.; Li, K.; Kong, C.; Yang, R.; Ma, J. From Storage to Experience, 2026, [arXiv:cs.AI/2605.06716]. arXiv preprint arXiv:2605.06716.
Allard, M.A.; Teinturier, A.; Xing, V.; Viaud, G. Experiential Reflective Learning for Self-Improving LLM Agents, 2026, [arXiv:cs.LG/2603.24639]. arXiv preprint arXiv:2603.24639. [CrossRef]
He, Z.; Li, Y.; Huang, F.; Chen, T.; Chen, S.; Li, X.; Yu, M.H.; Liu, X.; Wei, L.; Pan, L.; et al. SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training, 2026, [arXiv:cs.AI/2606.02355]. arXiv preprint arXiv:2606.02355.
Lin, H.; Li, P.; Song, J.; Jiang, F.; Zhang, T. MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation, 2026, [arXiv:cs.AI/2605.27366]. arXiv preprint arXiv:2605.27366.
Jiang, Y.; Li, D.; Deng, H.; Ma, B.; Wang, X.; Wang, Q.; Yu, G. SoK: Agentic Skills: Beyond Tool Use in LLM Agents, 2026, [arXiv:cs.CR/2602.20867]. arXiv preprint arXiv:2602.20867.
Zhang, Q.; Hu, C.; Upasani, S.; Ma, B.; Hong, F.; Kamanuru, V.; Rainton, J.; Wu, C.; Ji, M.; Li, H.; et al. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, 2025, [arXiv:cs.LG/2510.04618]. arXiv preprint arXiv:2510.04618. [CrossRef]
Acikgoz, E.C.; Qian, C.; Ji, H.; Hakkani-Tür, D.; Tur, G. Self-Improving LLM Agents at Test-Time, 2025, [arXiv:cs.LG/2510.07841]. arXiv preprint arXiv:2510.07841.
Li, Y.; Lin, Z.; Deng, A.; Zhang, X.; He, Y.; Ji, S.; Cao, T.; Hooi, B. Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates, 2026, [arXiv:cs.LG/2601.18510]. arXiv preprint arXiv:2601.18510.
Sengupta, B.; Wang, J. HARBOR: Automated Harness Optimization, 2026, [arXiv:cs.LG/2604.20938]. arXiv preprint arXiv:2604.20938.
Lee, Y.; Nair, R.; Zhang, Q.; Lee, K.; Khattab, O.; Finn, C. Meta-Harness: End-to-End Optimization of Model Harnesses, 2026, [arXiv:cs.AI/2603.28052]. arXiv preprint arXiv:2603.28052. [CrossRef]
Pan, W.; Liu, S.; Lin, C.Y.; Zeng, J.; Tang, X.; Zhou, X.; Lu, Y.; Jia, X. Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference, 2026, [arXiv:cs.AI/2606.05922]. arXiv preprint arXiv:2606.05922.
Xu, T.; Wen, H.; Li, M. Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents, 2026, [arXiv:cs.AI/2605.22166]. arXiv preprint arXiv:2605.22166.
Vishnyakova, V.V. Context Engineering: From Prompts to Corporate Multi-Agent Architecture, 2026, [arXiv:cs.AI/2603.09619]. arXiv preprint arXiv:2603.09619.
Pan, L.; Zou, L.; Guo, S.; Ni, J.; Zheng, H.T. Natural-Language Agent Harnesses, 2026, [arXiv:cs.CL/2603.25723]. arXiv preprint arXiv:2603.25723.
He, C.; Zhou, X.; Wang, D.; Xu, H.; Liu, W.; Miao, C. Harness Engineering for Language Agents: The Harness Layer as Control, Agency, and Runtime, 2026. Preprints.org, . [CrossRef]
Brookes, P.; Voskanyan, V.; Giavrimis, R.; Truscott, M.; Ilieva, M.; Pavlou, C.; Staicu, A.; Adham, M.; Evers-Hood, W.; Gong, J.; et al. Evolving Excellence: Automated Optimization of LLM-based Agents, 2025, [arXiv:cs.SE/2512.09108]. arXiv preprint arXiv:2512.09108.
Lam, C.; Li, J.; Zhang, L.; Zhao, K. Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework, 2026, [arXiv:cs.AI/2603.11768]. arXiv preprint arXiv:2603.11768.
Tang, Z.; He, X.; Zhao, T.; Wei, F.; Liu, X.; Dong, P.; Wang, Q.; Li, Q.; Wang, H.; Chen, R.; et al. LLM Agent Memory: A Survey from a Unified Representation–Management Perspective, 2026. Preprints.org, . [CrossRef]
Yue, L.; Bhandari, K.R.; Ko, C.Y.; Patel, D.; Lin, S.; Zhou, N.; Gao, J.; Chen, P.Y.; Pan, S. From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents, 2026, [arXiv:cs.AI/2603.22386]. arXiv preprint arXiv:2603.22386.
Zhang, J.; Xiang, J.; Yu, Z.; Teng, F.; Chen, X.; Chen, J.; Zhuge, M.; Cheng, X.; Hong, S.; Wang, J.; et al. AFlow: Automating Agentic Workflow Generation, 2024, [arXiv:cs.AI/2410.10762]. arXiv preprint arXiv:2410.10762. [CrossRef]
Abuzakuk, S.; Kermarrec, A.M.; Sharma, R.; Veski, R.M.; de Vos, M. Optimizing Agentic Workflows using Meta-tools, 2026, [arXiv:cs.AI/2601.22037]. arXiv preprint arXiv:2601.22037.
Ma, Z.; Zhao, Z.; Hua, C.; Berto, F.; Park, J. JudgeFlow: Agentic Workflow Optimization via Block Judge, 2026, [arXiv:cs.AI/2601.07477]. arXiv preprint arXiv:2601.07477.
Wang, Z.; Liu, X.; Wang, L.; Shan, Z.; Wang, Y.; Song, Z.; Zhang, M. MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems, 2026, [arXiv:cs.AI/2605.06623]. arXiv preprint arXiv:2605.06623.
Li, Z.; Zhang, H.; Han, S.; Liu, S.; Xie, J.; Zhang, Y.; Choi, Y.; Zou, J.; Lu, P. In-the-Flow Agentic System Optimization for Effective Planning and Tool Use, 2025, [arXiv:cs.AI/2510.05592]. arXiv preprint arXiv:2510.05592.
Zhou, H.; Wan, X.; Sun, R.; Palangi, H.; Iqbal, S.; Vulić, I.; Korhonen, A.; Arık, S. Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies, 2025, [arXiv:cs.LG/2502.02533]. arXiv preprint arXiv:2502.02533. [CrossRef]
Zhang, Z.; Ge, L.; Li, H.; Zhu, W.; Zhang, C.; Ye, Y. MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference, 2025, [arXiv:cs.CL/2510.07475]. arXiv preprint arXiv:2510.07475.
Chen, M.; Wang, J.; Liu, Z.; Wang, Y.; Wang, Q. From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws, 2026, [arXiv:cs.SE/2606.06324]. arXiv preprint arXiv:2606.06324.
Seong, H.; Yin, L.; Zhang, H.; Shi, Z. The Last Harness You’ll Ever Build, 2026, [arXiv:cs.AI/2604.21003]. arXiv preprint arXiv:2604.21003.
Karten, S.; Zhang, J.; Upaa, T.; Feng, R.; Li, W.; Shi, C.; Jin, C.; Vodrahalli, K. Continual Harness: Online Adaptation for Self-Improving Foundation Agents, 2026, [arXiv:cs.LG/2605.09998]. arXiv preprint arXiv:2605.09998. [CrossRef]
Tan, H.Z.; Yang, X.W.; Chen, H.; Shao, J.J.; Wen, Y.; Shen, Y.; Luo, W.; Du, X.; Guo, L.Z.; Li, Y.F. Hindsight Credit Assignment for Long-Horizon LLM Agents, 2026, [arXiv:cs.LG/2603.08754]. arXiv preprint arXiv:2603.08754.
Zhang, Y.; Fang, M.; Chen, Z.; Pechenizkiy, M. Self-evolving LLM agents with in-distribution Optimization, 2026, [arXiv:cs.LG/2606.07367]. arXiv preprint arXiv:2606.07367.
Zhang, C. From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models, 2026, [arXiv:cs.CL/2604.09459]. arXiv preprint arXiv:2604.09459.
Jiayang, C.; Ru, D.; Qiu, L.; Li, Y.; Cao, X.; Song, Y.; Cai, X. AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations, 2026, [arXiv:cs.CL/2603.01966]. arXiv preprint arXiv:2603.01966. [CrossRef]
Chen, Y.; Lai, H.; Feng, Y.; Han, C.; Zhang, Q.; Lu, B.; Li, M.; Wang, X.; Wang, Z.; Xu, S.; et al. Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents, 2026, [arXiv:cs.AI/2606.06090]. arXiv preprint arXiv:2606.06090.
Rafique, M.; Bindschaedler, L. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents, 2026, [arXiv:cs.AI/2604.10352]. arXiv preprint arXiv:2604.10352.
Zong, X.; Shen, Z.; Wang, L.; Lan, Y.; Yang, C. MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers, 2025, [arXiv:cs.CL/2512.15163]. arXiv preprint arXiv:2512.15163. [CrossRef]
Zhang, D.; Li, Z.; Luo, X.; Liu, X.; Li, P.; Xu, W. MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents, 2025, [arXiv:cs.CR/2510.15994]. arXiv preprint arXiv:2510.15994.
Liu, S.; Tang, X.; Yang, X.; Lin, L.; Zhou, B.; Xiao, W.; Liu, W. When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents, 2026, [arXiv:cs.CR/2605.24069]. arXiv preprint arXiv:2605.24069.
Lee, L.F.; Chang, Y.Y.; Yu, C.M.; Yeh, K.H. WebMCP Tool Surface Poisoning: Runtime Manipulation Attacks on LLM Agents, 2026, [arXiv:cs.CR/2606.06387]. arXiv preprint arXiv:2606.06387. [CrossRef]
Ling, Y.; Yu, S.; Chen, Z.; Fang, C. Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation, 2026, [arXiv:cs.CR/2606.10749]. arXiv preprint arXiv:2606.10749.
Sharma, R.K. ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files, 2026, [arXiv:cs.SE/2603.00822]. arXiv preprint arXiv:2603.00822.
Wang, H.; Poskitt, C.M.; Sun, J. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents, 2025, [arXiv:cs.AI/2503.18666]. arXiv preprint arXiv:2503.18666.
Kaptein, M.; Khan, V.J.; Podstavnychy, A. Runtime Governance for AI Agents: Policies on Paths, 2026, [arXiv:cs.AI/2603.16586]. arXiv preprint arXiv:2603.16586. [CrossRef]
Errico, H. Autonomous Action Runtime Management (AARM): A System Specification for Securing AI-Driven Actions at Runtime, 2026, [arXiv:cs.CR/2602.09433]. arXiv preprint arXiv:2602.09433.
Bhardwaj, V.P. Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents, 2026, [arXiv:cs.AI/2602.22302]. arXiv preprint arXiv:2602.22302.
Lin, X.; Liu, Y.; Chen, Y.; Wu, Y.; Ning, Y.; Liu, Y.; Sun, N.; Zhang, S.; Chong, B.; Zhou, C.; et al. SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment, 2026, [arXiv:cs.CR/2604.13630]. arXiv preprint arXiv:2604.13630. [CrossRef]
Qin, X.; Luan, S.; See, J.; Boukhers, Z.; Yang, C.; Li, Z. Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution, 2026, [arXiv:cs.RO/2604.07833]. arXiv preprint arXiv:2604.07833.
Ning, X.; Tieu, K.; Fu, D.; Wei, T.; Li, Z.; Bei, Y.; Zou, J.; Ai, M.; Liu, Z.; Li, T.W.; et al. Code as Agent Harness, 2026, [arXiv:cs.CL/2605.18747]. arXiv preprint arXiv:2605.18747.
Lin, M.; Wu, J.; Wang, Z.; Shi, Z.; Sang, Y.; He, B.; Liu, Z.; Wei, T.; Wu, Z.; Zhang, Z.; et al. Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents, 2026, [arXiv:cs.AI/2605.30621]. arXiv preprint arXiv:2605.30621.
Shao, S.; Ren, Q.; Qian, C.; Wei, B.; Guo, D.; Yang, J.; Song, X.; Zhang, L.; Zhang, W.; Liu, D.; et al. Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents, 2025, [arXiv:cs.AI/2509.26354]. arXiv preprint arXiv:2509.26354. [CrossRef]
Doshi, A.; Hong, Y.; Xu, C.; Kang, E.; Kapravelos, A.; Kästner, C. Towards Verifiably Safe Tool Use for LLM Agents, 2026, [arXiv:cs.SE/2601.08012]. arXiv preprint arXiv:2601.08012.
Zaharia, M.; Khattab, O.; Chen, L.; Davis, J.Q.; Miller, H.; Potts, C.; Zou, J.; Carbin, M.; Frankle, J.; Rao, N.; et al. The Shift from Models to Compound AI Systems, 2024. BAIR blog.
Meng, Q.; Wang, Y.; Chen, L.; Li, Y.; Wu, W.; Jiang, W.; Wang, Q.; Lu, C.; Gao, Y.; Wu, Y.; et al. Agent Harness for Large Language Model Agents: A Survey. Preprints 2026. Preprint, . [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. In Proceedings of the Advances in Neural Information Processing Systems, 2022, Vol. 35, pp. 27730–27744, [arXiv:cs.CL/2203.02155]. [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, 2020, Vol. 33, pp. 1877–1901, [arXiv:cs.CL/2005.14165]. [CrossRef]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, 2021, pp. 610–623. [CrossRef]
Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. Ethical and Social Risks of Harm from Language Models, 2021, [arXiv:cs.CL/2112.04359]. arXiv preprint arXiv:2112.04359, . [CrossRef]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys 2023, 55, 1–35. [CrossRef]
Christiano, P.F.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep Reinforcement Learning from Human Preferences. In Proceedings of the Advances in Neural Information Processing Systems, 2017, Vol. 30, [arXiv:stat.ML/1706.03741]. [CrossRef]
Ziegler, D.M.; Stiennon, N.; Wu, J.; Brown, T.B.; Radford, A.; Amodei, D.; Christiano, P.; Irving, G. Fine-Tuning Language Models from Human Preferences, 2019, [arXiv:cs.CL/1909.08593]. arXiv preprint arXiv:1909.08593, . [CrossRef]
Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.M.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; Christiano, P. Learning to Summarize from Human Feedback. In Proceedings of the Advances in Neural Information Processing Systems, 2020, Vol. 33, pp. 3008–3021, [arXiv:cs.CL/2009.01325]. [CrossRef]
Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022, [arXiv:cs.CL/2204.05862]. arXiv preprint arXiv:2204.05862, . [CrossRef]
Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the Advances in Neural Information Processing Systems, 2023, Vol. 36, pp. 53728–53741, [arXiv:cs.LG/2305.18290]. [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, 2022, Vol. 35, pp. 24824–24837, [arXiv:cs.CL/2201.11903]. [CrossRef]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the Advances in Neural Information Processing Systems, 2022, Vol. 35, pp. 22199–22213, [arXiv:cs.CL/2205.11916]. [CrossRef]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the International Conference on Learning Representations, 2023, [arXiv:cs.CL/2203.11171]. [CrossRef]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, 2023, Vol. 36, [arXiv:cs.CL/2305.10601]. [CrossRef]
Chu, Z.; Chen, J.; Chen, Q.; Yu, W.; He, T.; Wang, H.; Peng, W.; Liu, M.; Qin, B.; Liu, T. Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024, pp. 1173–1203. [CrossRef]
Schulhoff, S.; Ilie, M.; Balepur, N.; Kahadze, K.; Liu, A.; Si, C.; Li, Y.; Gupta, A.; Han, H.; Schulhoff, S.; et al. The Prompt Report: A Systematic Survey of Prompt Engineering Techniques, 2024, [arXiv:cs.CL/2406.06608]. arXiv preprint arXiv:2406.06608, . [CrossRef]
OpenAI. Learning to Reason with LLMs, 2024. OpenAI release, September 12, 2024.
DeepSeek-AI.; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature 2025, 645, 633–638, [arXiv:cs.CL/2501.12948]. [CrossRef]
Meincke, L.; Mollick, E.R.; Mollick, L.; Shapiro, D. Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting, 2025, [arXiv:cs.CL/2506.07142]. SSRN working paper, . [CrossRef]
OpenAI. Reasoning Best Practices, 2025. OpenAI developer documentation, accessed June 13, 2026.
Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The Rise and Potential of Large Language Model Based Agents: A Survey. Science China Information Sciences 2025, 68, 121101, [arXiv:cs.AI/2309.07864]. [CrossRef]
Xie, Y.; Zhu, C.; Zhang, X.; Zhu, T.; Ye, D.; Qi, M.; Chen, H.; Zhou, W. From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration, 2026, [arXiv:cs.MA/2603.04474]. arXiv preprint arXiv:2603.04474.
Zhou, S.; Xu, F.F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. In Proceedings of the International Conference on Learning Representations, 2024, [arXiv:cs.AI/2307.13854].
Jimenez, C.E.; Yang, J.; Wettig, A.; Yao, S.; Pei, K.; Press, O.; Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? In Proceedings of the International Conference on Learning Representations, 2024, [arXiv:cs.CL/2310.06770].
Xu, F.F.; et al. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. In Proceedings of the Advances in Neural Information Processing Systems, 2025, [arXiv:cs.CL/2412.14161].
Yao, S.; Shinn, N.; Razavi, P.; Narasimhan, K. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. In Proceedings of the International Conference on Learning Representations, 2025, [arXiv:cs.AI/2406.12045].
Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; et al. AgentBench: Evaluating LLMs as Agents. In Proceedings of the International Conference on Learning Representations, 2024, [arXiv:cs.AI/2308.03688]. [CrossRef]
Huang, J.; Chen, X.; Mishra, S.; Zheng, H.S.; Yu, A.W.; Song, X.; Zhou, D. Large Language Models Cannot Self-Correct Reasoning Yet. In Proceedings of the International Conference on Learning Representations, 2024, [arXiv:cs.CL/2310.01798].
Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. In Proceedings of the Conference on Language Modeling, 2024, [arXiv:cs.AI/2308.08155]. [CrossRef]
OpenAI. Function Calling and Other API Updates, 2023. OpenAI blog, June 13, 2023.
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. In Proceedings of the Advances in Neural Information Processing Systems, 2023, Vol. 36, [arXiv:cs.CL/2302.04761]. [CrossRef]
Anthropic. Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku, 2024. Anthropic news, October 22, 2024.
Patil, S.G.; Zhang, T.; Wang, X.; Gonzalez, J.E. Gorilla: Large Language Model Connected with Massive APIs. In Proceedings of the Advances in Neural Information Processing Systems, 2024, Vol. 37, [arXiv:cs.CL/2305.15334]. [CrossRef]
Kwa, T.; West, B.; Becker, J.; Deng, A.; Garcia, K.; Hasin, M.; Jawhar, S.; Kinniment, M.; Rush, N.; Arx, S.V.; et al. Measuring AI Ability to Complete Long Software Tasks. In Proceedings of the Advances in Neural Information Processing Systems, 2025, Vol. 38, [arXiv:cs.AI/2503.14499]. [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey, 2023, [arXiv:cs.CL/2312.10997]. arXiv preprint arXiv:2312.10997, . [CrossRef]
Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models, 2024, [arXiv:cs.CL/2401.11817]. arXiv preprint arXiv:2401.11817, . [CrossRef]

Figure 1. Tree-structured organization of the literature in the three analytical sections. The primary branches correspond to Harness Component Taxonomy, Harness Evaluation, and Harness Evolution, and the leaf nodes show the internal categories used to organize cited work.

Figure 2. The agent harness as external support structures around the core agent loop. At the center, the model performs reasoning and generation within the recurring action–feedback loop: it chooses an action that changes the environment, and the resulting feedback enters the next round of inference. Surrounding the loop are the seven harness component families, each annotated with the recurrent capability gap it compensates for: context management (limited attention), workflow and orchestration (unstable multi-step control), state and memory (non-persistent state), tools and interfaces (closed native action space), verification and guardrails (result verification), observability (process recording), and execution environments (external execution). The bottom layer denotes the task environment the harness mediates access to, including repositories, browsers, APIs, GUIs, files, and users or organizations.

Figure 3. The bidirectional coevolution loop between models and harnesses and its evolving boundary. The upper arc (Harness → Model) denotes harness-driven model improvement, in which runtime trajectories, feedback, and accumulated experience supported by the harness become training signals that move externally supported capabilities toward model-side behavior. The lower arc (Model → Harness) denotes model-driven harness optimization, in which improved native capability drives redesign of the harness. Between the two, the evolving model–harness boundary follows three parallel paths: capability internalization, in which stable, standardized, low-level support migrates inward toward the model; governance expansion, in which the harness thickens its state-management, tool-coordination, security, and governance functions as task complexity, environmental openness, and risk increase; and system-level reorganization, in which scattered and temporary support mechanisms are consolidated into standardized, auditable, and reversible runtime structures. The boundary is therefore layered and dynamic rather than fixed: routine capability-bearing support moves toward the model, while constraint-bearing governance remains external.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.