Submitted:
21 March 2026
Posted:
23 March 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. From Software Engineering to Prompt, Context, & Harness Engineering
2.1. The Label Is New, the Design Problem Is Not
2.2. Canonical Public Examples
| Frame | Main engineering question | Typical artifacts | What remains under-described if this frame is treated as sufficient |
|---|---|---|---|
| Software engineering | How should the system stay correct and maintainable? | modules, interfaces, tests, CI, operational procedures | model-facing instructions, evolving context, and agent-specific control policies |
| Prompt engineering | What should the model be told? | system prompts, examples, roles, output schemas | retrieval, memory, runtime policy, permissions, and tool mediation |
| Context engineering | What should the model see right now? | retrieved snippets, message history, tool descriptions, summaries, notes | action interfaces, approval logic, recovery policy, and observability over time |
| Harness engineering | How should a language agent be governed over time? | durable instructions, tool contracts, checkpoints, graders, budgets, approvals, traces | the agent is no longer reduced to the model alone; the harness layer becomes the thing that must be reported |
3. A Working Definition: The Harness Layer
- Control.
- Agency.
- Runtime.
- Two concrete mini-cases.
4. A Lightweight Descriptive Audit of the Evidence Base
5. Why the Harness Layer Changes What Counts as Progress
- Many reported agent gains can be partly harness-sensitive.
- Evaluation must become harness-sensitive.
- Reproducibility now depends on the harness layer.
6. Why NLP Should Treat the Harness Layer as an Explicit Object of Study
- Two likely objections.
7. Research Questions from the Harness-Layer Lens
8. Design Patterns & Failure Modes
9. A Research & Reporting Agenda
- First, study control as executable specification.
- Second, treat agency as an interface question.
- Third, treat runtime as a scientific variable.
- Fourth, normalize reporting of the harness.
- Fifth, build layer-aware baselines.
10. Conclusion
Limitations
Ethical Considerations
Appendix roadmap
Acknowledgments
Appendix A. Evidence Base, Public Formulations, and Exhaustive Inventory
Appendix A.1. Evidence Base and Selection Logic

| View | Breakdown | Count |
|---|---|---|
| Total cited sources | unique cited sources in the current draft | 49 |
| Scope split | in-scope harness-relevant works | 40 |
| Scope split | adjacent framing pieces | 9 |
| In-scope source type | papers or benchmarks | 22 |
| In-scope source type | engineering notes, protocol documents, developer guides, or technical articles | 18 |
| In-scope time band | 2023 or earlier | 12/40 |
| In-scope time band | 2024 | 10/40 |
| In-scope time band | 2025 | 9/40 |
| In-scope time band | 2026 | 9/40 |
| In-scope time band | 2024–2026 combined | 28/40 |
Appendix A.2. Public Formulations and Official Examples
| Source | Year | Object or term | What it contributes to the concept |
|---|---|---|---|
| Anthropic evals note (Grace et al. 2026) | 2026 | agent harness / evaluation harness | defines the harness as the system that enables a model to act as an agent and stresses that agent evaluation measures harness plus model together |
| OpenAI harness note (Lopopolo 2026) | 2026 | harness engineering | names the broader practice and ties it to durable instructions, architectural constraints, cleanup loops, and long-horizon coding work |
| OpenAI Codex harness note (Chen 2026) | 2026 | Codex harness | treats the harness as reusable runtime logic and protocol support rather than a one-shot prompt surface |
| Anthropic long-running note (Young 2025) | 2025 | long-running harnesses | makes state externalization, progress tracking, clean-state discipline, and resumability central design levers |
| OpenAI developer guidance (Choi 2026; OpenAI 2026a; OpenAI 2026b) | 2026 | AGENTS.md, long-horizon tasks | provides concrete examples of harness work as durable project guidance plus iterative plan–act–test–repair loops |
| MCP specification (Model Context Protocol 2025) | 2025 | protocol layer | shows that tool interoperability and permission boundaries can themselves be harness design objects |
Appendix A.3. Exhaustive in-Scope Works
| Work | Year | Type | Why it is in scope | Primary harness component(s) |
|---|---|---|---|---|
| Yao et al. (2022) | 2022 | precursor paper | couples reasoning and acting in trajectories | agency, control |
| Schick et al. (2023) | 2023 | paper | makes tool calls an explicit part of model behavior | interfaces, agency |
| Shinn et al. (2023) | 2023 | paper | treats reflection as trajectory repair and memory support | runtime, feedback |
| Wang et al. (2023) | 2023 | paper | shows skill libraries and resumable state for long tasks | runtime, feedback |
| Li et al. (2023) | 2023 | paper | role-structured multi-agent execution | agency |
| Li et al. (2023) | 2023 | benchmark | API tool use and tool evaluation | interfaces, observability |
| Liu et al. (2023) | 2023 | benchmark | interactive agent evaluation | observability |
| Zhou et al. (2023) | 2023 | benchmark | web interaction as environment-level action | interfaces, observability |
| Wu et al. (2024) | 2024 | paper | orchestrated multi-agent conversations and control logic | agency, control |
| Hong et al. (2023) | 2023 | paper | specialist-agent software workflows | agency |
| Mialon et al. (2023) | 2023 | benchmark | general-assistant tasks that stress tool use and grounded action | observability, agency |
| Jimenez et al. (2023) | 2023 | benchmark | executable software tasks and hidden tests | feedback, observability |
| Khattab et al. (2023) | 2023 | paper | LM pipeline compilation and self-improvement | control, feedback |
| Qian et al. (2024) | 2024 | paper | role-specialized software agents | agency |
| Wang et al. (2024a) | 2024 | paper | code as an executable action substrate | interfaces, agency |
| Xie et al. (2024) | 2024 | benchmark | open computer-use environment | interfaces, observability |
| Yang et al. (2024) | 2024 | paper | agent-computer interface for software engineering | interfaces, control |
| Yao et al. (2024) | 2024 | benchmark | policy-aware tool-agent-user interaction | governance, observability |
| Wang et al. (2024b) | 2024 | paper | open software-agent platform with sandboxed runtime | interfaces, governance, runtime |
| Xu et al. (2024) | 2024 | benchmark | consequential workplace tasks for agents | governance, observability |
| Work | Year | Type | Why it is in scope | Primary harness component(s) |
|---|---|---|---|---|
| Pan et al. (2024) | 2024 | paper | trainable loops with verifiers and repair | feedback, observability |
| Wei et al. (2025) | 2025 | benchmark | persistent browsing and recovery | interfaces, runtime |
| Schluntz and Zhang (2024) | 2024 | note | codifies practical agent patterns and the workflow/agent distinction | control, feedback |
| Anthropic (2025) | 2025 | note | explicit reflection insertion in tool loops | feedback |
| Hadfield et al. (2025) | 2025 | note | orchestrator–worker research system | agency |
| Rajasekaran et al. (2025) | 2025 | note | state curation and compaction as engineering | runtime |
| Aizawa et al. (2025) | 2025 | note | interface and tool-surface design for agents | interfaces, governance |
| Dworken et al. (2025) | 2025 | note | security and containment for coding agents | governance |
| Wu et al. (2025) | 2025 | note | large tool surfaces and long-running tool access | interfaces, runtime |
| Young (2025) | 2025 | note | state externalization, resumability, and recovery | runtime |
| Grace et al. (2026) | 2026 | note | agent-harness and evaluation-harness distinction | observability, feedback |
| Segato (2026) | 2026 | note | infrastructure variance in agentic evaluation | observability |
| Model Context Protocol (2025) | 2025 | protocol | tool interoperability and permissions as protocol objects | interfaces, governance |
| Lopopolo (2026) | 2026 | note | explicit naming of harness engineering and its practical levers | control, runtime, governance, observability |
| Chen (2026) | 2026 | note | reusable runtime logic and protocol support | runtime, interfaces |
| Bolin (2026) | 2026 | note | stepwise action loop and runtime structure | agency, runtime |
| OpenAI (2026b) | 2026 | developer guide | durable in-repository instruction surfaces | control |
| OpenAI (2026a) | 2026 | developer guide | operational guidance for agentic coding loops | control, runtime |
| Choi (2026) | 2026 | developer blog | long-horizon execution discipline and resumability | runtime, control |
| Böckeler (2026) | 2026 | technical article | software-architecture articulation of the harness concept | control, governance, observability |
Appendix B. HarnessCard Materials
Appendix B.1. Expanded HarnessCard
| Field | What should be disclosed | Priority |
|---|---|---|
| Base model(s) | model name, version, decoding settings, and any finetuning or adapters | Required |
| Control artifacts | system instructions, AGENTS.md, repo maps, architecture rules, schemas, tests, linters, done-when criteria | Required |
| Runtime policy | memory type, compaction or summarization policy, checkpointing, retry or rollback policy, budget limits | Required |
| Action substrate | tools, APIs, browser or GUI access, code execution, interface schemas, MCP usage | Required |
| Execution topology | single-agent vs multi-agent structure, planner/verifier roles, reviewer loops, routing logic | Required |
| Feedback stack | tests, graders, reflection prompts, hidden checks, human interventions, or repair loops | Required |
| Governance layer | permissions, sandboxing, escalation rules, policy checks, provenance logging, audit support | Required |
| Observability | stored traces, replay support, latency and cost logging, failure categories | Required |
| Evaluation protocol | task set, number of runs, success criteria, variance treatment, held-out checks or budget limits | Required |
| Release artifacts | prompts or programs, tool specs, traces, configs, environment setup, reproducibility notes | Recommended |
| Known limitations and risks | unresolved failure modes, portability caveats, safety concerns, or red-team findings | Recommended |
Appendix B.2. Illustrative HarnessCard: Repository Coding Agent
| Field | Illustrative disclosure | Why it matters |
|---|---|---|
| Base model(s) | frontier coding model configured through repo or user profiles; effort tuned for long tasks | keeps model choice distinct from harness choice |
| Control artifacts | root-level AGENTS.md; repository map; build/test/lint commands; architecture rules; done-when criteria | reveals the durable instructions and constraints the agent actually reads |
| Runtime policy | repository treated as system of record; thread history; progress file; compaction near context limits; bounded retries | makes long-horizon state handling explicit |
| Action substrate | file edits, shell commands, test runs, diff generation, PR review, optional MCP tools | discloses what the model can actually do in the environment |
| Execution topology | plan → edit → run tools → observe → repair → update status → repeat; optional reviewer loop | captures the control structure rather than only the model |
| Feedback stack | failing tests, custom linter messages, self-review, grader checks, occasional human review | surfaces the verification signals that shape behavior |
| Governance layer | sandbox mode, approval policy for privileged actions, least-privilege connectors, audit trail | keeps permissions and safety visible rather than implicit |
| Observability | persisted thread events, replay support, latency and cost logs, categorized failures | makes debugging and comparison scientifically possible |
| Success criteria | merged change passes required checks, stays within budget, and leaves updated status artifacts | completion becomes operationally verifiable, not merely verbal |
| Known risks | stale docs, state drift, verifier overfitting, hidden human intervention, over-trusting automated review | shows why limitation disclosure belongs inside the reporting standard |
Appendix C. Search Strings, Glossary, and Additional Mini-Cases
| Term | Working meaning in this paper |
|---|---|
| Prompt engineering | writing and organizing instructions, examples, and role structure for desired model behavior |
| Context engineering | curating the evolving token state supplied to the model, including retrieval, memory, and tool context (Rajasekaran et al. 2025) |
| Agent harness / scaffold | the extra-model system that enables a model to act as an agent (Grace et al. 2026) |
| Harness layer | the extra-model layer that, in this paper’s working definition, couples control artifacts, mediated action interfaces, and runtime policies into governed execution |
| Harness engineering | the design and maintenance of the control, agency, and runtime layer around the model (Lopopolo 2026) |
| Evaluation harness | the system that turns tasks, metrics, graders, and infrastructure into an executable evaluation regime (Grace et al. 2026) |
| Action substrate | the interface through which the agent can act: code, shell, browser, GUI, APIs, or role-structured delegation |
| Task family | Control levers | Runtime levers | Agency levers and likely risks |
|---|---|---|---|
| Repository coding agent | repo map, AGENTS.md, tests, linters, architectural constraints | compaction, checkpoints, retries, cleanup passes, cost budgets | shell/file edit/PR review; risks include stale docs, verifier overfitting, and hidden human repair |
| Browser or research agent | source hierarchy, citation rules, task decomposition, grading rubric | search history, scratchpads, branching traces, escalation on uncertainty | browse, fetch, cite, summarize; risks include source drift, unsupported synthesis, and provenance loss |
| Enterprise support agent | policy text, workflow scripts, escalation rules, approval thresholds | queue state, customer history, retry and timeout policy, audit logs | tool/API access, human handoff, permissions; risks include privacy leakage, over-escalation, and inconsistent policy application |
Appendix D. Expanded Timeline and Framework


References
- Aizawa, Ken, Barry Zhang, Zachary Witten, Daniel Jiang, Sami Al-Sheikh, Matt Bell, Maggie Vo, Theodora Chu, John Welsh, David Soria Parra, Adam Jones, Santiago Seira, Molly Vorwerc, Drew Roper, Christian Ryan, and Alexander Bricken. 2025. Writing effective tools for agents — with agents. Anthropic engineering note. [Google Scholar]
- Anthropic. 2025. The “think” tool: Enabling Claude to stop and think in complex tool use situations. Anthropic engineering note. [Google Scholar]
- Böckeler, Birgitta. 2026. Harness engineering. Thoughtworks article. [Google Scholar]
- Bolin, Michael. 2026. Unrolling the Codex agent loop. OpenAI engineering note. [Google Scholar]
- Chen, Celia. 2026. Unlocking the Codex harness: how we built the app server. OpenAI engineering note. [Google Scholar]
- Choi, Derrick. 2026. Run long horizon tasks with codex. OpenAI developer blog. [Google Scholar]
- Dworken, David, Oliver Weller-Davies, Meaghan Choi, Catherine Wu, Molly Vorwerck, Alex Isken, Kier Bradwell, and Kevin Garcia. 2025. Beyond permission prompts: making Claude code more secure and autonomous. Anthropic engineering note. [Google Scholar]
- Grace, Mikaela, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe. 2026. Demystifying evals for AI agents. Anthropic engineering note. [Google Scholar]
- Hadfield, Jeremy, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford. 2025. How we built our multi-agent research system. Anthropic engineering note. [Google Scholar]
- He, Chaoyue, Xin Zhou, Di Wang, Hong Xu, Wei Liu, and Chunyan Miao. 2026a. The autoresearch moment: From experimenter to research director. [Google Scholar]
- He, Chaoyue, Xin Zhou, Di Wang, Hong Xu, Wei Liu, and Chunyan Miao. 2026b. Human-ai productivity claims should be reported as time-to-acceptance under explicit acceptance tests. [Google Scholar] [CrossRef]
- He, Chaoyue, Xin Zhou, Di Wang, Hong Xu, Wei Liu, and Chunyan Miao. 2026c. Openclaw as language infrastructure: A case-centered survey of a public agent ecosystem in the wild. [Google Scholar]
- Hong, Sirui, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, and et al. 2023. Metagpt: Meta programming for a multi-agent collaborative framework. In The twelfth international conference on learning representations. [Google Scholar]
- Jimenez, Carlos E, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues? arXiv arXiv:2310.06770. [Google Scholar]
- Khattab, Omar, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, and et al. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv arXiv:2310.03714. [Google Scholar] [CrossRef]
- Li, Guohao, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society. Advances in neural information processing systems 36: 51991–52008. [Google Scholar]
- Li, Minghao, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023. Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 conference on empirical methods in natural language processing. pp. 3102–3116. [Google Scholar]
- Liu, Xiaoxia, Jingyi Wang, Xiaohan Yuan, Jun Sun, Guoliang Dong, Peng Di, Wenhai Wang, and Dongxia Wang. 2023. Prompting frameworks for large language models: A survey. In ACM Computing Surveys. [Google Scholar]
- Liu, Xiao, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, and et al. 2023. Agentbench: Evaluating llms as agents. arXiv arXiv:2308.03688. [Google Scholar] [CrossRef]
- Lopopolo, Ryan. 2026. Harness engineering: leveraging Codex in an agent-first world. OpenAI engineering note. [Google Scholar]
- Luo, Junyu, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, and et al. 2025. Large language model agent: A survey on methodology, applications and challenges. arXiv arXiv:2503.21460. [Google Scholar] [CrossRef]
- Mialon, Grégoire, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations. [Google Scholar]
- Model Context Protocol. 2025. Model context protocol specification. Model Context Protocol specification, version 2025-11-25. [Google Scholar]
- Naur, Peter, and Brian Randell. Software Engineering: Report of a conference sponsored by the NATO Science Committee, Garmisch, Germany, 7-11 Oct. 1968, Brussels, Scientific Affairs Division, NATO.
- OpenAI. 2026a. Best practices. OpenAI Codex developer guide. [Google Scholar]
- OpenAI. 2026b. Custom instructions with agents.md. OpenAI Codex developer guide. [Google Scholar]
- Pan, Jiayi, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. 2024. Training software engineering agents and verifiers with swe-gym. arXiv arXiv:2412.21139. [Google Scholar] [CrossRef]
- Piccialli, Francesco, Diletta Chiaro, Sundas Sarwar, Donato Cerciello, Pian Qi, and Valeria Mele. 2025. Agentai: A comprehensive survey on autonomous agents in distributed ai for industry 4.0. Expert Systems with Applications 291: 128404. [Google Scholar] [CrossRef]
- Qian, Chen, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, and et al. 2024. Chatdev: Communicative agents for software development. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers). pp. 15174–15186. [Google Scholar]
- Rajasekaran, Prithvi, Ethan Dixon, Carly Ryan, Jeremy Hadfield, Rafi Ayub, Hannah Moran, Cal Rueb, and Connor Jennings. 2025. Effective context engineering for AI agents. Anthropic engineering note. [Google Scholar]
- Schick, Timo, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems 36: 68539–68551. [Google Scholar]
- Schluntz, Erik, and Barry Zhang. 2024. Building effective agents. Anthropic engineering note. [Google Scholar]
- Segato, Gian. 2026. Quantifying infrastructure noise in agentic coding evals. Anthropic engineering note. [Google Scholar]
- Shinn, Noah, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems 36: 8634–8652. [Google Scholar]
- Wang, Guanzhi, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. arXiv arXiv:2305.16291. [Google Scholar] [CrossRef]
- Wang, Lei, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, and et al. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science 18, 6: 186345. [Google Scholar] [CrossRef]
- Wang, Xingyao, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning. [Google Scholar]
- Wang, Xingyao, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, and et al. 2024. Openhands: An open platform for ai software developers as generalist agents. arXiv arXiv:2407.16741. [Google Scholar]
- Wei, Jason, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. 2025. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv arXiv:2504.12516. [Google Scholar] [CrossRef]
- Wu, Bin, Adam Jones, Artur Renault, Henry Tay, Jake Noble, Noah Picard, Sam Jiang, and et al. 2025. Introducing advanced tool use on the Claude developer platform. Anthropic engineering note. [Google Scholar]
- Wu, Qingyun, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and et al. 2024. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First conference on language modeling. [Google Scholar]
- Xie, Tianbao, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and et al. 2024. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems 37: 52040–52094. [Google Scholar]
- Xu, Frank F, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, and et al. 2024. Theagentcompany: benchmarking llm agents on consequential real world tasks. arXiv arXiv:2412.14161. [Google Scholar]
- Yang, John, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems 37: 50528–50652. [Google Scholar]
- Yao, Shunyu, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. τ-bench: A benchmark for tool-agent-user interaction in real-world domains. arXiv arXiv:2406.12045. [Google Scholar]
- Yao, Shunyu, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations. [Google Scholar]
- Yehudai, Asaf, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, and Michal Shmueli-Scheuer. 2025. Survey on evaluation of llm-based agents. arXiv arXiv:2503.16416. [Google Scholar] [CrossRef]
- Young, Justin. 2025. Effective harnesses for long-running agents. Anthropic engineering note. [Google Scholar]
- Zhou, Shuyan, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, and et al. 2023. Webarena: A realistic web environment for building autonomous agents. arXiv arXiv:2307.13854. [Google Scholar]

| Public source | Year | Object | Why it matters for the harness layer |
|---|---|---|---|
| Anthropic evals note (Grace et al. 2026) | 2026 | agent harness / evaluation harness | defines the harness as the system that enables action and makes explicit that agent evaluation measures harness plus model |
| OpenAI harness note (Lopopolo 2026) | 2026 | harness engineering | names the broader practice and ties it to project guidance, constraints, cleanup loops, and long-horizon coding work |
| OpenAI Codex notes (Bolin 2026; Chen 2026) | 2026 | Codex harness | treats the harness as reusable runtime logic that can power multiple surfaces, not just a one-shot prompt wrapper |
| Anthropic long-running note (Young 2025) | 2025 | long-running harnesses | makes state externalization, progress tracking, and recovery first-class engineering levers |
| OpenAI developer guidance (Choi 2026; OpenAI 2026a, b) | 2026 | AGENTS.md, long-horizon tasks | shows the harness as durable instruction plus executable verification and repair loops |
| MCP specification (Model Context Protocol 2025) | 2025 | protocol layer | moves tool interoperability and permission boundaries into an explicit systems interface |
| Source type | Control | Agency | Interfaces | Runtime | Governance | Feedback | Observability |
|---|---|---|---|---|---|---|---|
| Papers / benchmarks () | 4 | 8 | 8 | 4 | 3 | 5 | 9 |
| Notes / protocols / technical articles () | 6 | 2 | 4 | 8 | 5 | 3 | 4 |
| Pattern | Typical harness levers | Representative works | Common strengths and failure modes |
|---|---|---|---|
| Single-agent tool loop | prompt assembly, tool schemas, light memory, bounded retries | ReAct (Yao et al. 2022), Toolformer (Schick et al. 2023), API-Bank (Li et al. 2023) | simple and efficient, but brittle when tasks are long, under-specified, or tool-heavy |
| Executable action substrate | code as action language, interpreter feedback, self-debugging | CodeAct (Wang et al. 2024a), OpenHands (Wang et al. 2024b) | flexible and compositional, but can amplify side effects without strong governance |
| Agent–computer interface | constrained command surface, file editing, search, browser or GUI actions | SWE-agent (Yang et al. 2024), OSWorld (Xie et al. 2024) | large gains from interface design, but failures shift to navigation and grounding |
| Orchestrator–worker topology | decomposition, role specialization, routing, verifier or reviewer roles | AutoGen (Wu et al. 2024), MetaGPT (Hong et al. 2023), ChatDev (Qian et al. 2024), Anthropic research system (Hadfield et al. 2025) | better coverage and parallelism, but coordination overhead and cascading errors remain common |
| Long-running harness | state externalization, checkpoints, progress files, resumability | Voyager (Wang et al. 2023), BrowseComp (Wei et al. 2025), Anthropic long-running harnesses (Young 2025) | supports hours-long work, but state drift and context decay remain central risks |
| Policy-aware deployment | permissions, sandboxing, escalation, audit logs, protocol mediation | -bench (Yao et al. 2024), sandboxing notes (Dworken et al. 2025), MCP (Model Context Protocol 2025) | improves safety and accountability, but can reduce autonomy or hide control logic if under-reported |
| Field | Minimum disclosure | Priority |
|---|---|---|
| Base model(s) | model name, version, and decoding or adaptation settings | Required |
| Control artifacts | instructions, repo maps, AGENTS.md, architecture rules, tests, linters, success criteria | Required |
| Runtime policy | memory or compaction strategy, checkpoints, retries, rollback or escalation policy, budgets | Required |
| Action substrate | tools, APIs, browser or GUI access, code execution, interface schemas, MCP usage | Required |
| Execution topology | single-agent vs multi-agent structure, verifier or reviewer roles, routing logic | Required |
| Feedback stack | tests, graders, hidden checks, reflection prompts, human interventions | Required |
| Governance / observability | permissions, sandboxing, provenance logs, replay support, failure categories | Required |
| Evaluation protocol | task set, number of runs, outcome criteria, variance treatment, budget limits | Required |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).