Submitted:
21 April 2026
Posted:
23 April 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
| Public source | Year | Object | Why it matters for the harness layer |
|---|---|---|---|
| Anthropic evals note [4] | 2026 | agent harness / evaluation harness | defines the harness as the system that enables action and makes explicit that agent evaluation measures harness plus model |
| OpenAI harness note [5] | 2026 | harness engineering | names the broader practice and ties it to project guidance, constraints, cleanup loops, and long-horizon coding work |
| Anthropic application-harness note [9] | 2026 | long-running application harness | shows generator–evaluator structure, artifact handoffs, and context resets as first-class harness levers |
| Anthropic Managed Agents note [10] | 2026 | managed agents / meta-harness | argues for stable interfaces around evolving harnesses and decouples long-horizon sessions from any single sandbox implementation |
| OpenAI Agents SDK update [11] | 2026 | model-native harness / SDK | packages sandboxed execution, files, approvals, and resume bookkeeping as reusable harness primitives |
| OpenAI Codex notes [26,27] | 2026 | Codex harness | treats the harness as reusable runtime logic that can power multiple surfaces, not just a one-shot prompt wrapper |
| Anthropic long-running note [12] | 2025 | long-running harnesses | makes state externalization, progress tracking, and recovery first-class engineering levers |
| OpenAI developer guidance [6,7,8] | 2026 | AGENTS.md, long-horizon tasks | shows the harness as durable instruction plus executable verification and repair loops |
| MCP specification [28] | 2025 | protocol layer | moves tool interoperability and permission boundaries into an explicit systems interface |
2. From Software Engineering to Prompt, Context, & Harness Engineering
2.1. The Label Is New, the Design Problem Is Not
2.2. Canonical Public Examples
3. A Working Definition: The Harness Layer
Control.
Agency.
Runtime.
Two concrete mini-cases.
4. A Descriptive Audit of the Evidence Base
5. Why the Harness Layer Changes What Counts as Progress
Many reported agent gains can be partly harness-sensitive.
Evaluation must become harness-sensitive.
Reproducibility now depends on the harness layer.
6. Why NLP Should Treat the Harness Layer as an Explicit Object of Study
Two likely objections.
7. Research Questions from the Harness-Layer Lens
8. Design Patterns & Failure Modes
9. A Research & Reporting Agenda
First, study control as executable specification.
Second, treat agency as an interface question.
Third, treat runtime as a scientific variable.
Fourth, normalize reporting of the harness.
Fifth, build layer-aware baselines.
10. Conclusions
Acknowledgments
Limitations
Ethical Considerations
Appendix A. Evidence Base, Public Formulations, and Exhaustive Inventory
Appendix A.1. Evidence Base and Selection Logic

| View | Breakdown | Count |
|---|---|---|
| Total cited sources | unique cited sources | 75 |
| Scope split | in-scope harness-relevant works | 63 |
| Scope split | adjacent framing pieces | 12 |
| In-scope source type | papers or benchmarks | 38 |
| In-scope source type | engineering notes, protocol documents, developer guides, or technical articles | 25 |
| In-scope time band | 2023 or earlier | 12/63 |
| In-scope time band | 2024 | 10/63 |
| In-scope time band | 2025 | 10/63 |
| In-scope time band | 2026 | 31/63 |
| In-scope time band | 2024–2026 combined | 51/63 |
Appendix A.2. Public Formulations and Official Examples
| Source | Year | Object or term | What it contributes to the concept |
|---|---|---|---|
| Anthropic evals note [4] | 2026 | agent harness / evaluation harness | defines the harness as the system that enables a model to act as an agent and stresses that agent evaluation measures harness plus model together |
| OpenAI harness note [5] | 2026 | harness engineering | names the broader practice and ties it to durable instructions, architectural constraints, cleanup loops, and long-horizon coding work |
| Anthropic application-harness note [9] | 2026 | long-running application harness | shows generator–evaluator structure, artifact handoffs, context resets, and task-list discipline as explicit harness levers |
| Anthropic Managed Agents note [10] | 2026 | managed agents / meta-harness | argues for stable interfaces that outlast any particular harness, session, or sandbox implementation |
| OpenAI Agents SDK update [11] | 2026 | model-native harness / SDK | packages sandbox execution, file inspection, approvals, and resume support as reusable harness primitives |
| OpenAI Codex harness note [26] | 2026 | Codex harness | treats the harness as reusable runtime logic and protocol support rather than a one-shot prompt surface |
| Anthropic long-running note [12] | 2025 | long-running harnesses | makes state externalization, progress tracking, clean-state discipline, and resumability central design levers |
| OpenAI developer guidance [6,7,8] | 2026 | AGENTS.md, long-horizon tasks | provides concrete examples of harness work as durable project guidance plus iterative plan–act–test–repair loops |
| MCP specification [28] | 2025 | protocol layer | shows that tool interoperability and permission boundaries can themselves be harness design objects |
Appendix A.3. Exhaustive in-Scope Works
| Work | Year | Type | Why it is in scope | Primary harness component(s) |
|---|---|---|---|---|
| Yao et al. [17] | 2022 | precursor paper | couples reasoning and acting in trajectories | agency, control |
| Schick et al. [18] | 2023 | paper | makes tool calls an explicit part of model behavior | interfaces, agency |
| Shinn et al. [68] | 2023 | paper | treats reflection as trajectory repair and memory support | runtime, feedback |
| Wang et al. [54] | 2023 | paper | shows skill libraries and resumable state for long tasks | runtime, feedback |
| Li et al. [69] | 2023 | paper | role-structured multi-agent execution | agency |
| Li et al. [19] | 2023 | benchmark | API tool use and tool evaluation | interfaces, observability |
| Liu et al. [20] | 2023 | benchmark | interactive agent evaluation | observability |
| Zhou et al. [21] | 2023 | benchmark | web interaction as environment-level action | interfaces, observability |
| Wu et al. [43] | 2024 | paper | orchestrated multi-agent conversations and control logic | agency, control |
| Hong et al. [58] | 2023 | paper | specialist-agent software workflows | agency |
| Mialon et al. [29] | 2023 | benchmark | general-assistant tasks that stress tool use and grounded action | observability, agency |
| Jimenez et al. [22] | 2023 | benchmark | executable software tasks and hidden tests | feedback, observability |
| Khattab et al. [70] | 2023 | paper | LM pipeline compilation and self-improvement | control, feedback |
| Qian et al. [59] | 2024 | paper | role-specialized software agents | agency |
| Wang et al. [23] | 2024 | paper | code as an executable action substrate | interfaces, agency |
| Xie et al. [25] | 2024 | benchmark | open computer-use environment | interfaces, observability |
| Yang et al. [24] | 2024 | paper | agent-computer interface for software engineering | interfaces, control |
| Yao et al. [60] | 2024 | benchmark | policy-aware tool-agent-user interaction | governance, observability |
| Wang et al. [32] | 2024 | paper | open software-agent platform with sandboxed runtime | interfaces, governance, runtime |
| Xu et al. [71] | 2024 | benchmark | consequential workplace tasks for agents | governance, observability |
| Work | Year | Type | Why it is in scope | Primary harness component(s) |
|---|---|---|---|---|
| Pan et al. [53] | 2024 | paper | trainable loops with verifiers and repair | feedback, observability |
| Wei et al. [30] | 2025 | benchmark | persistent browsing and recovery | interfaces, runtime |
| Schluntz and Zhang [72] | 2024 | note | codifies practical agent patterns and the workflow/agent distinction | control, feedback |
| Anthropic [73] | 2025 | note | explicit reflection insertion in tool loops | feedback |
| Hadfield et al. [13] | 2025 | note | orchestrator–worker research system | agency |
| Rajasekaran et al. [3] | 2025 | note | state curation and compaction as engineering | runtime |
| Aizawa et al. [66] | 2025 | note | interface and tool-surface design for agents | interfaces, governance |
| Dworken et al. [67] | 2025 | note | security and containment for coding agents | governance |
| Wu et al. [74] | 2025 | note | large tool surfaces and long-running tool access | interfaces, runtime |
| Young [12] | 2025 | note | state externalization, resumability, and recovery | runtime |
| Grace et al. [4] | 2026 | note | agent-harness and evaluation-harness distinction | observability, feedback |
| Segato [55] | 2026 | note | infrastructure variance in agentic evaluation | observability |
| Model Context Protocol [28] | 2025 | protocol | tool interoperability and permissions as protocol objects | interfaces, governance |
| Lopopolo [5] | 2026 | note | explicit naming of harness engineering and its practical levers | control, runtime, governance, observability |
| Chen [26] | 2026 | note | reusable runtime logic and protocol support | runtime, interfaces |
| Bolin [27] | 2026 | note | stepwise action loop and runtime structure | agency, runtime |
| OpenAI [6] | 2026 | developer guide | durable in-repository instruction surfaces | control |
| OpenAI [7] | 2026 | developer guide | operational guidance for agentic coding loops | control, runtime |
| Choi [8] | 2026 | developer blog | long-horizon execution discipline and resumability | runtime, control |
| Böckeler [75] | 2026 | technical article | software-architecture articulation of the harness concept | control, governance, observability |
| Work | Year | Type | Why it is in scope | Primary harness component(s) |
|---|---|---|---|---|
| Pan et al. [14] | 2026 | paper | externalizes harness behavior as portable natural-language control plus shared runtime | control, runtime |
| Lee et al. [15] | 2026 | paper | treats harness code itself as an optimization target and searches over harness designs | runtime, feedback |
| Bandel et al. [34] | 2026 | paper | frames general-agent evaluation as a unified protocol and infrastructure problem | interfaces, observability |
| Bandi et al. [35] | 2026 | benchmark | evaluates real MCP-server tool use with a containerized harness and rich diagnostics | interfaces, observability |
| Ursekar et al. [36] | 2026 | paper | introduces a reproducible evaluation harness for agent optimization | feedback, observability |
| Rafique and Bindschaedler [44] | 2026 | paper | makes memory durability and prompt-state residency explicit harness responsibilities | runtime |
| Jha et al. [46] | 2026 | benchmark | brings production-derived coding tasks and verification signals into agent evaluation | feedback, observability |
| Pradel et al. [47] | 2026 | paper | studies agent architectures on tool-and-project setup plus validation-heavy analysis tasks | interfaces, feedback, observability |
| Li et al. [50] | 2026 | benchmark | provides trajectory-level safety evaluation with long-horizon delayed triggers and diagnosis | governance, observability |
| Wang et al. [49] | 2026 | benchmark | diagnoses cross-domain long-horizon failures with trajectory-grounded attribution | runtime, observability |
| Stein et al. [51] | 2026 | paper | audits large trace collections for sparse safety violations and benchmark gaming | governance, observability |
| Rajasekaran [9] | 2026 | note | shows long-running application harness design with generator–evaluator structure and artifact handoffs | control, runtime |
| Martin et al. [10] | 2026 | note | introduces a meta-harness that decouples harness logic from sessions and sandboxes | interfaces, runtime, governance |
| Work | Year | Type | Why it is in scope | Primary harness component(s) |
|---|---|---|---|---|
| OpenAI [11] | 2026 | product note | packages sandbox execution, files, approvals, and resume bookkeeping as reusable harness primitives | interfaces, runtime, control |
| Anthropic [61] | 2026 | note | studies classifier-mediated auto-approval as a harness governance mechanism | governance, control |
| Bui [33] | 2026 | technical article | explicitly distinguishes scaffolding, harness, and context engineering in terminal coding agents | control, runtime, interfaces |
| Merrill et al. [31] | 2026 | benchmark | hard terminal benchmark with published evaluation harness and multi-run agent analysis | interfaces, runtime, observability |
| Kapoor et al. [37] | 2025 | paper | standardizes large-scale agent evaluation harnesses and releases logs for cross-scaffold analysis | observability, runtime, feedback |
| Waters et al. [48] | 2026 | benchmark | adapts an agent harness to expert-written STEM workflows with mixed rubric and exact-match grading | feedback, observability, interfaces |
| Anthropic [56] | 2026 | note | documents benchmark-aware agent behavior in a web-enabled evaluation setting | observability, governance |
| OpenAI [62] | 2026 | note | makes monitoring infrastructure and trace review part of the deployment harness | governance, observability |
| Rabanser et al. [52] | 2026 | paper | decomposes agent performance into consistency, robustness, predictability, and safety metrics | observability, feedback |
| Ndzomga [57] | 2026 | paper | studies scaffold-driven distribution shift in cost-aware agent benchmarking | observability, runtime |
Appendix B. HarnessCard Materials
Appendix B.1. Expanded HarnessCard
| Field | What should be disclosed | Priority |
|---|---|---|
| Base model(s) | model name, version, decoding settings, and any finetuning or adapters | Required |
| Control artifacts | system instructions, AGENTS.md, repo maps, architecture rules, schemas, tests, linters, done-when criteria | Required |
| Runtime policy | memory type, compaction or summarization policy, checkpointing, retry or rollback policy, budget limits | Required |
| Action substrate | tools, APIs, browser or GUI access, code execution, interface schemas, MCP usage | Required |
| Execution topology | single-agent vs multi-agent structure, planner/verifier roles, reviewer loops, routing logic | Required |
| Feedback stack | tests, graders, reflection prompts, hidden checks, human interventions, or repair loops | Required |
| Governance layer | permissions, sandboxing, escalation rules, policy checks, provenance logging, audit support | Required |
| Observability | stored traces, replay support, latency and cost logging, failure categories | Required |
| Evaluation protocol | task set, number of runs, success criteria, variance treatment, held-out checks or budget limits | Required |
| Release artifacts | prompts or programs, tool specs, traces, configs, environment setup, reproducibility notes | Recommended |
| Known limitations and risks | unresolved failure modes, portability caveats, safety concerns, or red-team findings | Recommended |
Appendix B.2. Illustrative HarnessCard: Repository Coding Agent
| Field | Illustrative disclosure | Why it matters |
|---|---|---|
| Base model(s) | frontier coding model configured through repo or user profiles; effort tuned for long tasks | keeps model choice distinct from harness choice |
| Control artifacts | root-level AGENTS.md; repository map; build/test/lint commands; architecture rules; done-when criteria | reveals the durable instructions and constraints the agent actually reads |
| Runtime policy | repository treated as system of record; thread history; progress file; compaction near context limits; bounded retries | makes long-horizon state handling explicit |
| Action substrate | file edits, shell commands, test runs, diff generation, PR review, optional MCP tools | discloses what the model can actually do in the environment |
| Execution topology | plan → edit → run tools → observe → repair → update status → repeat; optional reviewer loop | captures the control structure rather than only the model |
| Feedback stack | failing tests, custom linter messages, self-review, grader checks, occasional human review | surfaces the verification signals that shape behavior |
| Governance layer | sandbox mode, approval policy for privileged actions, least-privilege connectors, audit trail | keeps permissions and safety visible rather than implicit |
| Observability | persisted thread events, replay support, latency and cost logs, categorized failures | makes debugging and comparison scientifically possible |
| Success criteria | merged change passes required checks, stays within budget, and leaves updated status artifacts | completion becomes operationally verifiable, not merely verbal |
| Known risks | stale docs, state drift, verifier overfitting, hidden human intervention, over-trusting automated review | shows why limitation disclosure belongs inside the reporting standard |
Appendix C. Search Strings, Glossary, and Additional Mini-Cases
| Term | Working meaning in this paper |
|---|---|
| Prompt engineering | writing and organizing instructions, examples, and role structure for desired model behavior |
| Context engineering | curating the evolving token state supplied to the model, including retrieval, memory, and tool context [3] |
| Agent harness / scaffold | the extra-model system that enables a model to act as an agent [4] |
| Harness layer | the extra-model layer that, in this paper’s working definition, couples control artifacts, mediated action interfaces, and runtime policies into governed execution |
| Harness engineering | the design and maintenance of the control, agency, and runtime layer around the model [5] |
| Meta-harness | a more stable interface layer around evolving task-specific harnesses, sessions, and sandboxes [10] |
| Evaluation harness | the system that turns tasks, metrics, graders, and infrastructure into an executable evaluation regime [4] |
| Action substrate | the interface through which the agent can act: code, shell, browser, GUI, APIs, or role-structured delegation |
| Task family | Control levers | Runtime levers | Agency levers and likely risks |
|---|---|---|---|
| Repository coding agent | repo map, AGENTS.md, tests, linters, architectural constraints | compaction, checkpoints, retries, cleanup passes, cost budgets | shell/file edit/PR review; risks include stale docs, verifier overfitting, and hidden human repair |
| Browser or research agent | source hierarchy, citation rules, task decomposition, grading rubric | search history, scratchpads, branching traces, escalation on uncertainty | browse, fetch, cite, summarize; risks include source drift, unsupported synthesis, and provenance loss |
| Enterprise support agent | policy text, workflow scripts, escalation rules, approval thresholds | queue state, customer history, retry and timeout policy, audit logs | tool/API access, human handoff, permissions; risks include privacy leakage, over-escalation, and inconsistent policy application |
| Agent optimizer / evaluation harness | target-agent spec, reference evaluation procedure, budget policy | versioned snapshots, multi-run aggregation, trace capture, replay | code edits plus edit–execute–evaluate loops; risks include grader gaming, reward misspecification, and infrastructure variance |
Appendix D. Expanded Timeline and Framework


References
- Naur, P.; Randell, B. Software Engineering: Report of a conference sponsored by the NATO Science Committee, Garmisch, Germany, 7-11 Oct. 1968, Brussels, Scientific Affairs Division, NATO; 1969. [Google Scholar]
- Liu, X.; Wang, J.; Yuan, X.; Sun, J.; Dong, G.; Di, P.; Wang, W.; Wang, D. Prompting frameworks for large language models: A survey. ACM Computing Surveys, 2023. [Google Scholar]
- Rajasekaran, P.; Dixon, E.; Ryan, C.; Hadfield, J.; Ayub, R.; Moran, H.; Rueb, C.; Jennings, C. Effective context engineering for AI agents, 2025. Anthropic engineering note.
- Grace, M.; Hadfield, J.; Olivares, R.; Jonghe, J.D. Demystifying evals for AI agents, 2026. Anthropic engineering note.
- Lopopolo, R. Harness engineering: leveraging Codex in an agent-first world, 2026; OpenAI engineering note.
- OpenAI. Custom instructions with AGENTS.md OpenAI Codex developer guide. 2026. [Google Scholar]
- OpenAI. Best practices, 2026. OpenAI Codex developer guide.
- Choi, D. Run long horizon tasks with Codex, 2026. OpenAI developer blog.
- Rajasekaran, P. Harness design for long-running application development, 2026. Anthropic engineering note.
- Martin, L.; Cemaj, G.; Cohen, M. Scaling Managed Agents: Decoupling the brain from the hands, 2026. In Anthropic engineering note.
- OpenAI. The next evolution of the Agents SDK, 2026. OpenAI product note.
- Young, J. Effective harnesses for long-running agents, 2025. Anthropic engineering note.
- Hadfield, J.; Zhang, B.; Lien, K.; Scholz, F.; Fox, J.; Ford, D. How we built our multi-agent research system, 2025. Anthropic engineering note.
- Pan, L.; Zou, L.; Guo, S.; Ni, J.; Zheng, H.T. Natural-Language Agent Harnesses. arXiv 2026, arXiv:2603.25723. [Google Scholar] [CrossRef]
- Lee, Y.; Nair, R.; Zhang, Q.; Lee, K.; Khattab, O.; Finn, C. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv 2026, arXiv:2603.28052. [Google Scholar]
- Meng, Q.; Wang, Y.; Chen, L.; Wang, Q.; Lu, C.; Wu, W.; Gao, Y.; Wu, Y.; Hu, Y. Agent Harness for Large Language Model Agents: A Survey. Preprints 2026. [Google Scholar] [CrossRef]
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.R.; Cao, Y. ReAct: Synergizing reasoning and acting in language models. In Proceedings of the The eleventh international conference on learning representations, 2022. [Google Scholar]
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems 2023, 36, 68539–68551. [Google Scholar]
- Li, M.; Zhao, Y.; Yu, B.; Song, F.; Li, H.; Yu, H.; Li, Z.; Huang, F.; Li, Y. API-Bank: A comprehensive benchmark for tool-augmented LLMs. In Proceedings of the Proceedings of the 2023 conference on empirical methods in natural language processing, 2023; pp. 3102–3116. [Google Scholar]
- Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; et al. AgentBench: Evaluating LLMs as agents. arXiv 2023, arXiv:2308.03688. [Google Scholar] [CrossRef]
- Zhou, S.; Xu, F.F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; et al. WebArena: A realistic web environment for building autonomous agents. arXiv 2023, arXiv:2307.13854. [Google Scholar]
- Jimenez, C.E.; Yang, J.; Wettig, A.; Yao, S.; Pei, K.; Press, O.; Narasimhan, K. SWE-bench: Can language models resolve real-world GitHub issues? arXiv 2023, arXiv:2310.06770. [Google Scholar]
- Wang, X.; Chen, Y.; Yuan, L.; Zhang, Y.; Li, Y.; Peng, H.; Ji, H. Executable code actions elicit better LLM agents. In Proceedings of the Forty-first International Conference on Machine Learning, 2024. [Google Scholar]
- Yang, J.; Jimenez, C.E.; Wettig, A.; Lieret, K.; Yao, S.; Narasimhan, K.; Press, O. SWE-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems 2024, 37, 50528–50652. [Google Scholar]
- Xie, T.; Zhang, D.; Chen, J.; Li, X.; Zhao, S.; Cao, R.; Hua, T.J.; Cheng, Z.; Shin, D.; Lei, F.; et al. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems 2024, 37, 52040–52094. [Google Scholar]
- Chen, C. Unlocking the Codex harness: how we built the App Server, 2026. OpenAI engineering note.
- Bolin, M. Unrolling the Codex agent loop, 2026. OpenAI engineering note.
- Model Context Protocol. Model Context Protocol specification, 2025. Model Context Protocol specification, version 2025-11-25.
- Mialon, G.; Fourrier, C.; Wolf, T.; LeCun, Y.; Scialom, T. GAIA: a benchmark for general AI assistants. In Proceedings of the The Twelfth International Conference on Learning Representations, 2023. [Google Scholar]
- Wei, J.; Sun, Z.; Papay, S.; McKinney, S.; Han, J.; Fulford, I.; Chung, H.W.; Passos, A.T.; Fedus, W.; Glaese, A. BrowseComp: A simple yet challenging benchmark for browsing agents. arXiv 2025, arXiv:2504.12516. [Google Scholar] [CrossRef]
- Merrill, M.A.; Shaw, A.G.; Carlini, N.; Li, B.; Raj, H.; Bercovich, I.; Shi, L.; Shin, J.Y.; Walshe, T.; Buchanan, E.K.; et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces. arXiv 2026, arXiv:2601.11868. [Google Scholar] [CrossRef]
- Wang, X.; Li, B.; Song, Y.; Xu, F.F.; Tang, X.; Zhuge, M.; Pan, J.; Song, Y.; Li, B.; Singh, J.; et al. OpenHands: An open platform for AI software developers as generalist agents. arXiv 2024, arXiv:2407.16741. [Google Scholar]
- Bui, N.D. Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned. arXiv 2026, arXiv:2603.05344. [Google Scholar] [CrossRef]
- Bandel, E.; Yehudai, A.; Eden, L.; Sagron, Y.; Perlitz, Y.; Venezian, E.; Razinkov, N.; Ergas, N.; Shachor Ifergan, S.; Shlomov, S.; et al. General Agent Evaluation. arXiv 2026, arXiv:2602.22953. [Google Scholar] [CrossRef]
- Bandi, C.; Hertzberg, B.; Boo, G.; Polakam, T.; Da, J.; Hassaan, S.; Sharma, M.; Park, A.; Hernandez, E.; Rambado, D.; et al. MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers. arXiv 2026, arXiv:2602.00933. [Google Scholar]
- Ursekar, V.; Shanker, A.; Chatrath, V.; Xue, Y.E.; Denton, S. VeRO: An Evaluation Harness for Agents to Optimize Agents. arXiv 2026, arXiv:2602.22480. [Google Scholar] [CrossRef]
- Kapoor, S.; Stroebl, B.; Kirgis, P.; Nadgir, N.; Siegel, Z.S.; Wei, B.; Xue, T.; Chen, Z.; Chen, F.; Utpala, S.; et al. Holistic agent leaderboard: The missing infrastructure for AI agent evaluation. arXiv 2025, arXiv:2510.11977. [Google Scholar] [CrossRef]
- Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Frontiers of Computer Science 2024, 18, 186345. [Google Scholar] [CrossRef]
- Piccialli, F.; Chiaro, D.; Sarwar, S.; Cerciello, D.; Qi, P.; Mele, V. AgentAI: A comprehensive survey on autonomous agents in distributed AI for industry 4.0. Expert Systems with Applications 2025, 291, 128404. [Google Scholar] [CrossRef]
- Luo, J.; Zhang, W.; Yuan, Y.; Zhao, Y.; Yang, J.; Gu, Y.; Wu, B.; Chen, B.; Qiao, Z.; Long, Q.; et al. Large language model agent: A survey on methodology, applications and challenges. arXiv 2025, arXiv:2503.21460. [Google Scholar] [CrossRef]
- Yehudai, A.; Eden, L.; Li, A.; Uziel, G.; Zhao, Y.; Bar-Haim, R.; Cohan, A.; Shmueli-Scheuer, M. Survey on evaluation of LLM-based agents. arXiv 2025, arXiv:2503.16416. [Google Scholar] [CrossRef]
- Zhou, C.; Chai, H.; Chen, W.; Guo, Z.; Shan, R.; Song, Y.; Xu, T.; Yang, Y.; Yu, A.; Zhang, W.; et al. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering. arXiv 2026, arXiv:2604.08224. [Google Scholar] [CrossRef]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversations. In Proceedings of the First conference on language modeling, 2024. [Google Scholar]
- Rafique, M.; Bindschaedler, L. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents. In Proceedings of the Proceedings of the 6th European Workshop on Machine Learning and Systems (EuroMLSys ’26), Also available as. 2026. [Google Scholar]
- Kontonis, V.; Zeng, Y.; Garg, S.; Chen, L.; Tang, H.; Wang, Z.; Awadallah, A.; Horvitz, E.; Langford, J.; Papailiopoulos, D. MEMENTO: Teaching LLMs to Manage Their Own Context. arXiv 2026, arXiv:2604.09852. [Google Scholar] [CrossRef]
- Jha, S.; Paltenghi, M.; Maddila, C.; Murali, V.; Ugare, S.; Chandra, S. ProdCodeBench: A Production-Derived Benchmark for Evaluating AI Coding Agents. arXiv 2026, arXiv:2604.01527. [Google Scholar]
- Pradel, M.; Cadar, C.; Bouzenia, I. Evaluating LLM Agents on Automated Software Analysis Tasks. arXiv 2026, arXiv:2604.11270. [Google Scholar] [CrossRef]
- Waters, K.; Nuzzi, L.; Looram, T.; Tomasiello, A.; Kamdoum, A.G.K.; Li, B.; Sileo, D.; Kretov, E.; Fournier-Facio, F.; Soloupis, G.; et al. COMPOSITE-STEM. arXiv 2026, arXiv:2604.09836. [Google Scholar] [CrossRef]
- Wang, X.J.; Bai, H.; Sun, Y.; Wang, H.; Zhang, S.; Hu, W.; Schroder, M.; Mutlu, B.; Song, D.; Nowak, R.D. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break. arXiv 2026, arXiv:2604.11978. [Google Scholar] [CrossRef]
- Li, Y.; Luo, H.; Xie, Y.; Fu, Y.; Yang, Z.; Shao, S.; Ren, Q.; Qu, W.; Fu, Y.; Yang, Y.; et al. ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety. arXiv 2026, arXiv:2604.02022. [Google Scholar] [CrossRef]
- Stein, A.; Brown, D.; Hassani, H.; Naik, M.; Wong, E. Detecting Safety Violations Across Many Agent Traces. arXiv 2026, arXiv:2604.11806. [Google Scholar] [CrossRef]
- Rabanser, S.; Kapoor, S.; Kirgis, P.; Liu, K.; Utpala, S.; Narayanan, A. Towards a science of AI agent reliability. arXiv 2026, arXiv:2602.16666. [Google Scholar] [CrossRef]
- Pan, J.; Wang, X.; Neubig, G.; Jaitly, N.; Ji, H.; Suhr, A.; Zhang, Y. Training software engineering agents and verifiers with SWE-Gym. arXiv 2024, arXiv:2412.21139. [Google Scholar] [CrossRef]
- Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, Y.; Fan, L.; Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv 2023, arXiv:2305.16291. [Google Scholar] [CrossRef]
- Segato, G. Quantifying infrastructure noise in agentic coding evals, 2026. Anthropic engineering note.
- Anthropic. Eval awareness in Claude Opus 4.6’s BrowseComp performance, 2026. Anthropic engineering note.
- Ndzomga, F. Efficient Benchmarking of AI Agents. arXiv 2026, arXiv:2603.23749. [Google Scholar] [CrossRef]
- Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Wang, J.; Zhang, C.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In Proceedings of the The twelfth international conference on learning representations, 2023. [Google Scholar]
- Qian, C.; Liu, W.; Liu, H.; Chen, N.; Dang, Y.; Li, J.; Yang, C.; Chen, W.; Su, Y.; Cong, X.; et al. ChatDev: Communicative agents for software development. Proceedings of the Proceedings of the 62nd annual meeting of the association for computational linguistics 2024, volume 1, 15174–15186. [Google Scholar]
- Yao, S.; Shinn, N.; Razavi, P.; Narasimhan, K. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv 2024, arXiv:2406.12045. [Google Scholar]
- Anthropic. Claude Code auto mode: a safer way to skip permissions, 2026. Anthropic engineering note.
- OpenAI. How we monitor internal coding agents for misalignment, 2026. OpenAI safety note.
- He, C.; Zhou, X.; Wang, D.; Xu, H.; Liu, W.; Miao, C. The AutoResearch Moment: From Experimenter to Research Director. 2026. [Google Scholar] [CrossRef]
- He, C.; Zhou, X.; Wang, D.; Xu, H.; Liu, W.; Miao, C. OpenClaw as Language Infrastructure: A Case-Centered Survey of a Public Agent Ecosystem in the Wild. 2026. [Google Scholar]
- He, C.; Zhou, X.; Wang, D.; Xu, H.; Liu, W.; Miao, C. Human-AI productivity claims should be reported as time-to-acceptance under explicit acceptance tests. 2026. [Google Scholar]
- Aizawa, K.; Zhang, B.; Witten, Z.; Jiang, D.; Al-Sheikh, S.; Bell, M.; Vo, M.; Chu, T.; Welsh, J.; Parra, D.S.; et al. Writing effective tools for agents — with agents, 2025. Anthropic engineering note.
- Dworken, D.; Weller-Davies, O.; Choi, M.; Wu, C.; Vorwerck, M.; Isken, A.; Bradwell, K.; Garcia, K. Beyond permission prompts: making Claude Code more secure and autonomous, 2025. Anthropic engineering note.
- Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems 2023, 36, 8634–8652. [Google Scholar]
- Li, G.; Hammoud, H.; Itani, H.; Khizbullin, D.; Ghanem, B. CAMEL: Communicative agents for “mind” exploration of large language model society. Advances in neural information processing systems 2023, 36, 51991–52008. [Google Scholar]
- Khattab, O.; Singhvi, A.; Maheshwari, P.; Zhang, Z.; Santhanam, K.; Vardhamanan, S.; Haq, S.; Sharma, A.; Joshi, T.T.; Moazam, H.; et al. DSPy: Compiling declarative language model calls into self-improving pipelines. arXiv 2023, arXiv:2310.03714. [Google Scholar] [CrossRef]
- Xu, F.F.; Song, Y.; Li, B.; Tang, Y.; Jain, K.; Bao, M.; Wang, Z.Z.; Zhou, X.; Guo, Z.; Cao, M.; et al. TheAgentCompany: benchmarking LLM agents on consequential real world tasks. arXiv 2024, arXiv:2412.14161. [Google Scholar]
- Schluntz, E.; Zhang, B. Building effective agents, 2024. Anthropic engineering note.
- Anthropic. The “think” tool: Enabling Claude to stop and think in complex tool use situations. Anthropic engineering note 2025. [Google Scholar]
- Wu, B.; Jones, A.; Renault, A.; Tay, H.; Noble, J.; Picard, N.; Jiang, S.; et al. Introducing advanced tool use on the Claude Developer Platform, 2025. Anthropic engineering note.
- Böckeler, B. Harness engineering, 2026. Thoughtworks article.

| Frame | Main engineering question | Typical artifacts | What remains under-described if this frame is treated as sufficient |
|---|---|---|---|
| Software engineering | How should the system stay correct and maintainable? | modules, interfaces, tests, CI, operational procedures | model-facing instructions, evolving context, and agent-specific control policies |
| Prompt engineering | What should the model be told? | system prompts, examples, roles, output schemas | retrieval, memory, runtime policy, permissions, and tool mediation |
| Context engineering | What should the model see right now? | retrieved snippets, message history, tool descriptions, summaries, notes | action interfaces, approval logic, recovery policy, and observability over time |
| Harness engineering | How should a language agent be governed over time? | durable instructions, tool contracts, checkpoints, graders, budgets, approvals, traces | the agent is no longer reduced to the model alone; the harness layer becomes the thing that must be reported |
| Source type | Control | Agency | Interfaces | Runtime | Governance | Feedback | Observability |
|---|---|---|---|---|---|---|---|
| Papers / benchmarks () | 5 | 8 | 13 | 11 | 5 | 12 | 22 |
| Notes / protocols / technical articles () | 10 | 2 | 7 | 12 | 9 | 3 | 6 |
| Pattern | Typical harness levers | Representative works | Common strengths and failure modes |
|---|---|---|---|
| Single-agent tool loop | prompt assembly, tool schemas, light memory, bounded retries | ReAct [17], Toolformer [18], API-Bank [19] | simple and efficient, but brittle when tasks are long, under-specified, or tool-heavy |
| Executable action substrate | code as action language, interpreter feedback, self-debugging | CodeAct [23], OpenHands [32] | flexible and compositional, but can amplify side effects without strong governance |
| Agent–computer interface | constrained command surface, file editing, search, browser or GUI actions | SWE-agent [24], MCP-Atlas [35], AnalysisBench [47], OSWorld [25] | large gains from interface design, but failures shift to navigation, grounding, and tool parameterization |
| Orchestrator–worker topology | decomposition, role specialization, routing, verifier or reviewer roles | AutoGen [43], MetaGPT [58], ChatDev [59], Anthropic research system [13] | better coverage and parallelism, but coordination overhead and cascading errors remain common |
| Long-running harness | state externalization, checkpoints, progress files, resumability, and compaction | Voyager [54], ClawVM [44], MEMENTO [45], HORIZON [49], Anthropic long-running harnesses [9,12] | supports hours-long work, but state drift, context decay, and brittle recovery remain central risks |
| Policy-aware deployment | permissions, sandboxing, escalation, audit logs, protocol mediation | -bench [60], ATBench [50], auto mode [61], monitoring notes [62], MCP [28] | improves safety and accountability, but can reduce autonomy or hide control logic if under-reported |
| Evaluation / trajectory harness | task adapters, versioned snapshots, multi-run aggregation, trace auditing | General Agent Evaluation [34], VeRO [36], HAL [37], Terminal-Bench 2.0 [31], Meerkat [51] | improves comparability and benchmark integrity, but protocol assumptions, suite design, and contamination controls become part of the result |
| Field | Minimum disclosure | Priority |
|---|---|---|
| Base model(s) | model name, version, and decoding or adaptation settings | Required |
| Control artifacts | instructions, repo maps, AGENTS.md, architecture rules, tests, linters, success criteria | Required |
| Runtime policy | memory or compaction strategy, checkpoints, retries, rollback or escalation policy, budgets | Required |
| Action substrate | tools, APIs, browser or GUI access, code execution, interface schemas, MCP usage | Required |
| Execution topology | single-agent vs multi-agent structure, verifier or reviewer roles, routing logic | Required |
| Feedback stack | tests, graders, hidden checks, reflection prompts, human interventions | Required |
| Governance / observability | permissions, sandboxing, provenance logs, replay support, failure categories | Required |
| Evaluation protocol | task set, number of runs, outcome criteria, variance treatment, budget limits | Required |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).