Submitted:
27 June 2026
Posted:
30 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Harness Foundations: From Model Capability Gaps to External Support
2.1. Intent Alignment Cycle
2.2. Reasoning Capability Cycle
2.3. Execution Capability Cycle and the Emergence of the Harness
2.4. Definition and Conceptual Boundaries of the Harness
3. Harness Component Taxonomy
3.1. Context Management
3.2. Tools and Interfaces
3.3. State and Memory
3.4. Workflow and Orchestration
3.5. Execution Environments
3.6. Verification and Guardrails
3.7. Observability
4. Harness Design Tradeoffs Under Multiple Objectives
5. Representative Harness Implementations
6. Harness Evaluation: Diagnosing Model Capability Gaps and Assessing Harness Compensation
6.1. Model Capability-Gap Diagnosis: Assessing Native Capability Limits
6.1.1. Context and Attention Gaps
6.1.2. Tool-Use and Interface Gaps
6.1.3. State and Memory Gaps
6.1.4. Workflow and Orchestration Gaps
6.1.5. Execution Environment Gaps
6.1.6. Verification and Guardrail Gaps
6.1.7. Observability Gaps
6.2. Harness Compensation Assessment: Effectiveness and Net Benefit
6.2.1. Effectiveness of Harness Compensation
6.2.2. Net Benefit of Harness Compensation
6.3. Summary
7. Harness Evolution
7.1. Harness-Driven Model Improvement
7.2. Model-Driven Harness Optimization
7.3. The Evolving Model-Harness Boundary
7.3.1. Capability Internalization
7.3.2. Governance Expansion
7.3.3. System-Level Reorganization
8. Open Challenges and Future Directions
8.1. Harness as a First-Class Scientific Object
8.2. Making Harness Design Explicit
8.3. Evaluating Harness Compensation
8.4. Designing Harnesses for Training
8.5. Self-Optimizing but Governable Harnesses
8.6. Toward Automated Model–Harness Coevolution
9. Conclusion
| Capability gap | Primary harness family | Diagnostic evidence | Compensation pattern | Residual risk and cost |
|---|---|---|---|---|
| Context and attention decay | Context management | Long-context and agent-rollout benchmarks expose loss of key state, degraded dependency tracking, and context-management errors under long trajectories [25,87,88]. | Select only currently relevant evidence; compress histories into structured state; offload bulky artifacts to files or stores; isolate branches and load skills progressively. | Lossy summaries can omit causal premises; retrieval can miss needed evidence; over-compression makes later recovery expensive or impossible. |
| Unreliable tool use and interface adaptation | Tools and interfaces | Realistic tool benchmarks show failures in tool discovery, parameterization, multi-server composition, and dynamic intent tracking [90,91,93]. | Constrain actions through schemas and tool-call contracts; improve agent-computer interfaces; retrieve or prune tools by task; standardize external access through protocols such as MCP. | Broader tool exposure raises context cost and selection noise; external tool metadata introduces supply-chain, privilege, and trust-boundary risks. |
| Non-persistent or poorly managed state | State and memory | Memory benchmarks show degradation under incremental updates, conflicting facts, multi-session tasks, and memory-driven downstream decisions [94,95,96]. | Maintain working state for the current task; persist episodic, semantic, and procedural memory; manage write, consolidation, retrieval, and forgetting policies; use files and version control as auditable state substrates. | Unfiltered memory becomes stale, redundant, or poisoned; similarity retrieval can lose causal relations; memory maintenance adds storage, indexing, and governance burden. |
| Long-horizon workflow failure | Workflow and orchestration | Planning and web-task evaluations reveal plan drift, premature termination, delayed-feedback errors, and weak recovery across long execution chains [65,100,101]. | Use reasoning-action loops, task decomposition, deterministic workflows, hooks, model routing, and subagent handoffs to structure progress, checkpoints, and recovery. | More control logic increases system complexity; fixed workflows can become brittle; subagent handoffs can lose evidence or duplicate work. |
| Ungrounded execution in changing environments | Execution environments | Execution benchmarks test whether decisions translate into correct commands, GUI actions, code changes, and environment-state updates [84,102,103,104]. | Run actions in sandboxes, terminals, browsers, and prepared workspaces; preserve artifacts, logs, diffs, and environment feedback for subsequent reasoning and verification. | Execution adds runtime, dependency, and isolation cost; environment drift and flaky tools complicate attribution; broader execution authority expands side-effect boundaries. |
| Unsafe or noncompliant side effects | Verification and guardrails | Safety benchmarks expose long-horizon attacks, prompt and visual injection, memory poisoning, privacy leakage, and sandbox escape risks [108,109,111,119]. | Externalize judgment into tests, validators, permission gates, allowlists, layered defenses, human approval, and policy checks before irreversible actions. | Guardrails add latency and false blocks; local checks can be bypassed by cross-step attacks; human approval requires trace context to avoid rubber-stamp decisions. |
| Uninspectable process and weak failure attribution | Observability | Trace-oriented studies show that final success scores hide causal failure points and that full traces improve attribution over partial observations [120,121,122]. | Record traces, logs, metrics, token and latency profiles, tool-call chains, and evaluation judgments so failures can be localized to components, steps, or interfaces. | Process evidence increases storage, privacy, and compliance obligations; raw logs still require schema, summarization, and diagnostic tooling to become actionable. |
| Archetype | Task boundary | Permission radius | State persistence | Verification mechanism | Failure recovery | Audit demand | Dominant cost items |
|---|---|---|---|---|---|---|---|
| SWE-agent-style repository repair | Bounded issue, patch, or benchmark instance with a concrete repository-level success condition. | Primarily repository-local shell, editor, search, and test actions inside a prepared coding environment. | Mostly run-scoped; durable state is the patch, logs, and modified workspace rather than a long-term memory store. | Unit tests, build feedback, diffs, executable scripts, and benchmark pass/fail signals close the generate–test–repair loop [83,84]. | Iterative edit, rerun, and local repair; harder cases require environment reset or human diagnosis when tests are incomplete. | Moderate: traces must explain patch provenance and failed attempts, but side effects are usually confined to the repository. | Model calls, command execution, test runtime, dependency setup, and repeated context retrieval. |
| Codex-style cloud task agent | User-submitted coding or analysis request executed as an isolated task, often in parallel with other tasks. | Cloud sandbox preloaded with the target repository; execution authority is bounded by task environment isolation [85]. | Task-scoped and reclaimable; persistent learning is not the main design center. | Sandbox execution, logs, diffs, tests, and user review; concise repository instructions reduce exploratory context cost [16]. | Restart, rerun, or revise inside an independent sandbox; failed tasks can be discarded without contaminating other workspaces. | Medium to high: isolation evidence, task traces, and reviewable diffs matter for multi-tenant execution. | Sandbox time, parallel task scheduling, token use, repository indexing, and repeated verification. |
| Claude Code-style local terminal agent | Long local development session over a user workspace, possibly spanning multiple subtasks and tool extensions. | Local file system, shell, MCP servers, plugins, Skills, and other user-authorized tools; side effects touch the developer environment. | Session and workspace continuity through compaction, append-oriented session storage, summaries, and artifacts [82]. | Permission modes, hooks, classifier-supported checks, tests, diffs, and user approval constrain tool calls and workspace changes. | Compaction, checkpointing, manual correction, subagent summaries, and permission escalation or denial after risky steps. | High: the user needs trace context for local file changes, tool calls, permission decisions, and reversible recovery. | Long-context management, tool latency, permission-review overhead, local environment setup, and human attention. |
| Hermes-style resident agent | Continuing agent operation across tasks, where recurring patterns should become reusable skills. | Resident main agent coordinates short-lived workers, memory files, skill stores, and task-specific tools over time [86]. | Cross-session memory and procedural skill persistence are central; learned conventions and preferences are injected into future sessions. | Skill distillation checks, memory review, trace evidence, user feedback, and downstream task success determine whether experience should be reused. | Update, retire, or refine skills; isolate exploratory work in subagents; recover by reverting or rewriting corrupted memories. | Very high: continuous operation requires evidence about which memories and skills were created, triggered, effective, or harmful. | Memory curation, skill life-cycle management, trace storage, evaluation of self-improvements, and long-run governance. |
| Boundary case: embodied or enterprise workflow agent | Open-ended operational process rather than a code-only task; completion may involve business state, devices, or physical-world consequences. | Enterprise APIs, identity-scoped systems, robotic or GUI actuators, databases, and compliance-sensitive workflows. | Long-lived organizational or environmental state with identity, policy, and provenance constraints. | Policy engines, human-in-the-loop gates, environment-state audits, rollback checks, compliance logs, and external monitors [22,197,198]. | Rollback, escalation, compensating transactions, manual override, quarantine, or incident review after unsafe actions. | Highest: audit must cover authority, data access, action rationale, approvals, and recovery because harms may outlive the session. | Governance latency, compliance review, monitoring infrastructure, integration maintenance, and opportunity cost of human approval. |
References
- Wood, D.; Bruner, J.S.; Ross, G. The Role of Tutoring in Problem Solving. Journal of Child Psychology and Psychiatry 1976, 17, 89–100. [CrossRef]
- van de Pol, J.; Volman, M.; Beishuizen, J. Scaffolding in Teacher–Student Interaction: A Decade of Research. Educational Psychology Review 2010, 22, 271–296. [CrossRef]
- Vygotsky, L.S. Mind in Society: The Development of Higher Psychological Processes; Harvard University Press: Cambridge, MA, USA, 1978.
- Hollan, J.; Hutchins, E.; Kirsh, D. Distributed Cognition: Toward a New Foundation for Human-Computer Interaction Research. ACM Transactions on Computer-Human Interaction 2000, 7, 174–196. [CrossRef]
- Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model Based Autonomous Agents. Frontiers of Computer Science 2024, 18, 186345. [CrossRef]
- Xu, R.; Peng, J. A Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications, 2025, [arXiv:cs.AI/2506.12594]. arXiv preprint arXiv:2506.12594.
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models, 2022, [arXiv:cs.CL/2210.03629]. arXiv preprint arXiv:2210.03629.
- Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023, [arXiv:cs.AI/2307.16789]. arXiv preprint arXiv:2307.16789. [CrossRef]
- Qu, C.; Dai, S.; Wei, X.; Cai, H.; Wang, S.; Yin, D.; Xu, J.; Wen, J.R. Tool Learning with Large Language Models: A Survey. Frontiers of Computer Science 2024, 19, 198343, [arXiv:cs.CL/2405.17935]. [CrossRef]
- Zeng, W.; Huang, Y.; He, J. LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth, 2026, [arXiv:cs.AI/2602.07962]. arXiv preprint arXiv:2602.07962.
- Hu, Y.; Liu, S.; Yue, Y.; Zhang, G.; Liu, B.; Zhu, F.; Lin, J.; Guo, H.; Dou, S.; Xi, Z.; et al. Memory in the Age of AI Agents, 2025, [arXiv:cs.CL/2512.13564]. arXiv preprint arXiv:2512.13564.
- Du, P. Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers, 2026, [arXiv:cs.AI/2603.07670]. arXiv preprint arXiv:2603.07670. [CrossRef]
- Wang, X.J.; Bai, H.; Sun, Y.; Wang, H.; Zhang, S.; Hu, W.; Schroder, M.; Mutlu, B.; Song, D.; Nowak, R.D. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break, 2026, [arXiv:cs.AI/2604.11978]. arXiv preprint arXiv:2604.11978.
- Yang, Y.T.; Zhu, Q. Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs, 2026, [arXiv:cs.AI/2605.23929]. arXiv preprint arXiv:2605.23929.
- LangChain. The Anatomy of an Agent Harness, 2026. LangChain blog.
- Lopopolo, R. Harness engineering: leveraging Codex in an agent-first world, 2026. OpenAI blog.
- Ding, S.; Dai, X.; Xing, L.; Ding, S.; Liu, Z.; JingYi, Y.; Yang, P.; Zhang, Z.; Wei, X.; Fang, X.; et al. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation, 2026, [arXiv:cs.CL/2605.10912]. arXiv preprint arXiv:2605.10912. [CrossRef]
- Ndzomga, F. Efficient Benchmarking of AI Agents, 2026, [arXiv:cs.AI/2603.23749]. arXiv preprint arXiv:2603.23749.
- Lin, J.; Liu, S.; Pan, C.; Lin, L.; Dou, S.; Xi, Z.; Huang, X.; Yan, H.; Han, Z.; Gui, T.; et al. Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses, 2026, [arXiv:cs.CL/2604.25850]. arXiv preprint arXiv:2604.25850. [CrossRef]
- Anthropic. Effective context engineering for AI agents, 2025. Official Anthropic engineering post.
- Ehtesham, A.; Singh, A.; Gupta, G.K.; Kumar, S. A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP), 2025, [arXiv:cs.AI/2505.02279]. arXiv preprint arXiv:2505.02279.
- Su, H.; Luo, J.; Liu, C.; Yang, X.; Zhang, Y.; Dong, Y.; Zhu, J. A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents, 2025, [arXiv:cs.AI/2506.23844]. arXiv preprint arXiv:2506.23844.
- Mohammadi, M.; Li, Y.; Lo, J.; Yip, W. Evaluation and Benchmarking of LLM Agents: A Survey. In Proceedings of the Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. ACM, 2025, KDD ’25, pp. 6129–6139. [CrossRef]
- Anthropic. Equipping agents for the real world with Agent Skills, 2025. Official Anthropic engineering post.
- Li, K.; Shi, J.; Xiao, Y.; Jiang, M.; Sun, J.; Wu, Y.; Fu, D.; Xia, S.; Cai, X.; Xu, T.; et al. AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts, 2026, [arXiv:cs.AI/2601.11044]. arXiv preprint arXiv:2601.11044.
- Mei, L.; Yao, J.; Ge, Y.; Wang, Y.; Bi, B.; Cai, Y.; Liu, J.; Li, M.; Li, Z.Z.; Zhang, D.; et al. A Survey of Context Engineering for Large Language Models, 2025, [arXiv:cs.CL/2507.13334]. arXiv preprint arXiv:2507.13334.
- LangChain. Context engineering in agents, 2025. LangChain docs page.
- Wu, Y.; Zheng, Y.; Xu, T.; Zhang, Z.; Yu, Y.; Zhu, J.; Ma, C.; Lin, B.; Dong, B.; Zhu, H.; et al. ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents, 2026, [arXiv:cs.AI/2604.01664]. arXiv preprint arXiv:2604.01664. [CrossRef]
- Anthropic. Introducing advanced tool use on the Claude Developer Platform, 2025. Official Anthropic engineering post.
- Manus. Context Engineering for AI Agents: Lessons from Building Manus, 2025. Official Manus blog.
- Anthropic. How we built our multi-agent research system, 2025. Official Anthropic engineering post.
- Anthropic. Writing effective tools for agents – with agents, 2025. Official Anthropic engineering post.
- Chen, Y.C.; Hsu, P.C.; Hsu, C.J.; Shiu, D.s. Enhancing Function-Calling Capabilities in LLMs: Strategies for Prompt Formats, Data Integration, and Multilingual Translation. In Proceedings of the Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Industry Track). Association for Computational Linguistics, 2025, pp. 99–111. [CrossRef]
- Huang, D.; Malwe, G.; Wang, Z. When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems, 2026, [arXiv:cs.AI/2601.16280]. arXiv preprint arXiv:2601.16280.
- Guo, R.; Dong, K.; Gao, X.; Das, K. Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use, 2026, [arXiv:cs.AI/2602.20426]. arXiv preprint arXiv:2602.20426. [CrossRef]
- Masi, A.D. Terminal Is All You Need: Design Properties for Human-AI Agent Collaboration, 2026, [arXiv:cs.HC/2603.10664]. arXiv preprint arXiv:2603.10664.
- Lumer, E.; Gulati, A.; Nizar, F.; Hedroits, D.; Mehta, A.; Hwangbo, H.; Subbiah, V.K.; Basavaraju, P.H.; Burke, J.A. Tool and Agent Selection for Large Language Model Agents in Production: A Survey. In Proceedings of the 2026 IEEE Conference on Artificial Intelligence (CAI). IEEE, 2026, pp. 701–708. [CrossRef]
- Lumer, E.; Nizar, F.; Gulati, A.; Basavaraju, P.H.; Subbiah, V.K. Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems, 2025, [arXiv:cs.CL/2511.01854]. arXiv preprint arXiv:2511.01854.
- Wu, Q.; Das, S.; Amani, M.; Nag, A.; Lee, S.; Gummadi, K.P.; Ravichander, A.; Zafar, M.B. To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling, 2026, [arXiv:cs.AI/2605.00737]. arXiv preprint arXiv:2605.00737. [CrossRef]
- Anthropic. Model Context Protocol, 2024. Official MCP overview.
- Lee, Y.; Choi, W.; Nam, D. Supply Chain Threats in the MCP Ecosystem: Attack Vectors and Mitigation Strategies. In Advances in Information and Computer Security; 2026; pp. 329–349. [CrossRef]
- Packer, C.; Wooders, S.; Lin, K.; Fang, V.; Patil, S.G.; Stoica, I.; Gonzalez, J.E. MemGPT: Towards LLMs as Operating Systems, 2023, [arXiv:cs.AI/2310.08560]. arXiv preprint arXiv:2310.08560.
- Sumers, T.R.; Yao, S.; Narasimhan, K.; Griffiths, T.L. Cognitive Architectures for Language Agents, 2023, [arXiv:cs.AI/2309.02427]. arXiv preprint arXiv:2309.02427.
- Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior, 2023, [arXiv:cs.HC/2304.03442]. arXiv preprint arXiv:2304.03442.
- Jiang, D.; Li, Y.; Wei, S.; Yang, J.; Kishore, A.; Zhao, A.; Kang, D.; Hu, X.; Chen, F.; Li, Q.; et al. Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations, 2026, [arXiv:cs.CL/2602.19320]. arXiv preprint arXiv:2602.19320. [CrossRef]
- Yang, C.; Zhou, C.; Xiao, Y.; Dong, S.; Zhuang, L.; Zhang, Y.; Wang, Z.; Hong, Z.; Yuan, Z.; Xiang, Z.; et al. Graph-based Agent Memory: Taxonomy, Techniques, and Applications, 2026, [arXiv:cs.AI/2602.05665]. arXiv preprint arXiv:2602.05665.
- Fang, R.; Liang, Y.; Wang, X.; Wu, J.; Qiao, S.; Xie, P.; Huang, F.; Chen, H.; Zhang, N. Memp: Exploring Agent Procedural Memory, 2025, [arXiv:cs.CL/2508.06433]. arXiv preprint arXiv:2508.06433.
- LangChain. How agents can use filesystems for context engineering, 2025. LangChain blog.
- Cao, W.; Yin, X.; Dhingra, B.; Zhou, S. Coding Agents are Effective Long-Context Processors, 2026, [arXiv:cs.CL/2603.20432]. arXiv preprint arXiv:2603.20432.
- Huang, X.; Liu, W.; Chen, X.; Wang, X.; Wang, H.; Lian, D.; Wang, Y.; Tang, R.; Chen, E. Understanding the planning of LLM agents: A survey, 2024, [arXiv:cs.AI/2402.02716]. arXiv preprint arXiv:2402.02716. [CrossRef]
- Adimulam, A.; Gupta, R.; Kumar, S. The Orchestration of Multi-Agent Systems: Architectures, Protocols, and Enterprise Adoption, 2026, [arXiv:cs.MA/2601.13671]. arXiv preprint arXiv:2601.13671.
- Wei, H.; Zhang, Z.; He, S.; Xia, T.; Pan, S.; Liu, F. PlanGenLLMs: A Modern Survey of LLM Planning Capabilities. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2025, pp. 19497–19521. [CrossRef]
- Guo, Z.; Li, Y.; Qiu, L.; Wang, X.; Xv, J.; Ru, D.; Li, X.; Zheng, X.; Cao, X.; Cai, X. AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents, 2026, [arXiv:cs.AI/2605.07926]. arXiv preprint arXiv:2605.07926.
- Anthropic. Building Effective AI Agents, 2024. Official Anthropic research note.
- Microsoft. Conductor, 2026. Official Microsoft GitHub repository.
- Moslem, Y.; Kelleher, J.D. Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey, 2026, [arXiv:cs.NI/2603.04445]. arXiv preprint arXiv:2603.04445. [CrossRef]
- Guo, X.; Wang, S.; Ji, C.; Zhao, X.; Xi, W.; Liu, Y.; Li, Q.; Deng, C.; Feng, J. Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference, 2025, [arXiv:cs.MA/2509.07571]. arXiv preprint arXiv:2509.07571.
- Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.; et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework, 2023, [arXiv:cs.AI/2308.00352]. arXiv preprint arXiv:2308.00352.
- Bui, N.D.Q. Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned, 2026, [arXiv:cs.AI/2603.05344]. arXiv preprint arXiv:2603.05344.
- Harang, R. Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk, 2026. NVIDIA Technical Blog, published January 30, 2026.
- Agache, A.; Brooker, M.; Iordache, A.; Liguori, A.; Neugebauer, R.; Piwonka, P.; Popa, D.M. Firecracker: Lightweight Virtualization for Serverless Applications, 2020. Proc. 17th USENIX Symposium on Networked Systems Design and Implementation.
- Ning, L.; Liang, Z.; Jiang, Z.; Qu, H.; Ding, Y.; Fan, W.; yong Wei, X.; Lin, S.; Liu, H.; Yu, P.S.; et al. A Survey of WebAgents: Towards Next-Generation AI Agents for Web Automation with Large Foundation Models, 2025, [arXiv:cs.AI/2503.23350]. arXiv preprint arXiv:2503.23350. [CrossRef]
- Wang, S.; Liu, W.; Chen, J.; Zhou, Y.; Gan, W.; Zeng, X.; Che, Y.; Yu, S.; Hao, X.; Shao, K.; et al. GUI Agents with Foundation Models: A Comprehensive Survey, 2024, [arXiv:cs.AI/2411.04890]. arXiv preprint arXiv:2411.04890.
- Li, J.; Li, Y.; Zhao, C.; Xu, Z.; Hu, B.; Zhang, M. WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments, 2026, [arXiv:cs.AI/2604.27776]. arXiv preprint arXiv:2604.27776.
- Jang, L.K.; Koh, J.Y.; Fried, D.; Salakhutdinov, R. Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks, 2026, [arXiv:cs.LG/2604.24964]. arXiv preprint arXiv:2604.24964.
- Anthropic. Effective harnesses for long-running agents, 2025. Official Anthropic engineering post, published November 26, 2025.
- Shamsujjoha, M.; Lu, Q.; Zhao, D.; Zhu, L. Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents, 2024, [arXiv:cs.SE/2408.02205]. arXiv preprint arXiv:2408.02205. [CrossRef]
- Chennabasappa, S.; Nikolaidis, C.; Song, D.; Molnar, D.; Ding, S.; Wan, S.; Whitman, S.; Deason, L.; Doucette, N.; Montilla, A.; et al. LlamaFirewall: An Open Source Guardrail System for Building Secure AI Agents, 2025, [arXiv:cs.CR/2505.03574]. arXiv preprint arXiv:2505.03574.
- OWASP Gen AI Security Project. LLM06:2025 Excessive Agency, 2025. OWASP Gen AI Security Project.
- OpenAI. Human-in-the-loop, 2026. OpenAI Agents SDK documentation.
- OpenAI. Safety practices for agent builders, 2026. OpenAI developer documentation.
- Liu, G.; Solomon, S. AI Agent Observability - Evolving Standards and Best Practices, 2025. OpenTelemetry blog.
- Crowder, S. How to debug and evaluate AI agents with observability, 2026. LangChain blog.
- Wang, Y.; Zhang, J.; Cai, T.; Liu, Z.; Sun, Q.; Sun, Z.; Wu, Z.; Zhang, M.; Zhu, Y. From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents, 2026, [arXiv:cs.CR/2606.04990]. arXiv preprint arXiv:2606.04990.
- Okamoto, M.; Erol, A.K.; Riedl, M. Explainable Model Routing for Agentic Workflows, 2026, [arXiv:cs.AI/2604.03527]. arXiv preprint arXiv:2604.03527.
- Trivedy, V. Improving Deep Agents with Harness Engineering, 2026. LangChain blog.
- LangChain. LangChain State of AI Agents Report: 2024 Trends, 2024. LangChain report.
- Rabanser, S.; Kapoor, S.; Kirgis, P.; Liu, K.; Utpala, S.; Narayanan, A. Towards a Science of AI Agent Reliability, 2026, [arXiv:cs.AI/2602.16666]. arXiv preprint arXiv:2602.16666. [CrossRef]
- Gupta, A. ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions, 2026, [arXiv:cs.AI/2601.06112]. arXiv preprint arXiv:2601.06112.
- Li, X.; Ming, R.; Setlur, P.; Paladugu, A.; Tang, A.; Kang, H.; Shao, S.; Jin, R.; Xiong, C. Benchmark Test-Time Scaling of General LLM Agents, 2026, [arXiv:cs.AI/2602.18998]. arXiv preprint arXiv:2602.18998.
- Zhao, Y.; Yuan, B.; Huang, J.; Yuan, H.; Yu, Z.; Xu, H.; Hu, L.; Shankarampeta, A.; Huang, Z.; Ni, W.; et al. AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications, 2026, [arXiv:cs.AI/2602.22769]. arXiv preprint arXiv:2602.22769. [CrossRef]
- Liu, J.; Zhao, X.; Shang, X.; Shen, Z. Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems, 2026, [arXiv:cs.SE/2604.14228]. arXiv preprint arXiv:2604.14228.
- Le, T.; Thai, M.V.T.; Manh, D.N.; Nhat, H.P.; Bui, N.D.Q. SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios, 2025, [arXiv:cs.SE/2512.18470]. arXiv preprint arXiv:2512.18470.
- Merrill, M.A.; Shaw, A.G.; Carlini, N.; Li, B.; Raj, H.; Bercovich, I.; Shi, L.; Shin, J.Y.; Walshe, T.; Buchanan, E.K.; et al. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces, 2026, [arXiv:cs.SE/2601.11868]. arXiv preprint arXiv:2601.11868.
- OpenAI. Introducing Codex, 2025. OpenAI blog.
- Nous Research. Hermes Agent, 2026. Official Nous Research GitHub repository.
- Chen, Z.; Wu, X.; Jia, J.; Gao, C.; Fu, Q.; Zhang, D.; Hu, S. LongBench Pro: A More Realistic and Comprehensive Bilingual Long-Context Evaluation Benchmark, 2026, [arXiv:cs.CL/2601.02872]. arXiv preprint arXiv:2601.02872. [CrossRef]
- Fang, S.; Wang, Y.; Liu, X.; Lu, J.; Tan, C.; Chen, X.; Zheng, Y.; Huang, X.; Qiu, X. AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts, 2026, [arXiv:cs.CL/2601.20730]. arXiv preprint arXiv:2601.20730.
- Jenova AI. Long-Context Agentic Orchestration Benchmark, 2026. Jenova.ai benchmark page.
- Yu, P.; Liu, W.; Yang, Y.; Li, J.; Zhang, Z.; Feng, X.; Zhang, F. Benchmarking LLM Tool-Use in the Wild, 2026, [arXiv:cs.HC/2604.06185]. arXiv preprint arXiv:2604.06185.
- Wang, Z.; Chang, Q.; Patel, H.; Biju, S.; Wu, C.E.; Liu, Q.; Ding, A.; Rezazadeh, A.; Shah, A.; Bao, Y.; et al. MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers, 2025, [arXiv:cs.CL/2508.20453]. arXiv preprint arXiv:2508.20453. [CrossRef]
- Li, J.; Zhao, W.; Zhao, J.; Zeng, W.; Wu, H.; Wang, X.; Ge, R.; Cao, Y.; Huang, Y.; Liu, W.; et al. The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution, 2025, [arXiv:cs.CL/2510.25726]. arXiv preprint arXiv:2510.25726.
- Bandi, C.; Dumitru, R.G.; Hertzberg, B.; Agarwal, D.; Boo, G.; Polakam, T.; Hassaan, S.; Da, J.; Kim, H.; Gupta, V.; et al. MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers, 2026, [arXiv:cs.SE/2602.00933]. arXiv preprint arXiv:2602.00933.
- Hu, Y.; Wang, Y.; McAuley, J. Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions, 2025, [arXiv:cs.CL/2507.05257]. arXiv preprint arXiv:2507.05257. [CrossRef]
- Liu, L.; Yadav, N. Introducing STATE-Bench: A benchmark for AI agent memory, 2026. Microsoft Open Source blog.
- He, Z.; Wang, Y.; Zhi, C.; Hu, Y.; Chen, T.P.; Yin, L.; Chen, Z.; Wu, T.A.; Ouyang, S.; Wang, Z.; et al. MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks, 2026, [arXiv:cs.CL/2602.16313]. arXiv preprint arXiv:2602.16313.
- Ai, Q.; Tang, Y.; Wang, C.; Long, J.; Su, W.; Liu, Y. MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems, 2025, [arXiv:cs.LG/2510.17281]. arXiv preprint arXiv:2510.17281. [CrossRef]
- Tavakoli, M.; Salemi, A.; Ye, C.; Abdalla, M.; Zamani, H.; Mitchell, J.R. Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs, 2025, [arXiv:cs.CL/2510.27246]. arXiv preprint arXiv:2510.27246.
- Liu, G.; Zhao, P.; Liang, Y.; Luo, Q.; Tang, S.; Chai, Y.; Lin, W.; Xiao, H.; Wang, W.; Chen, S.; et al. MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments, 2026, [arXiv:cs.DC/2602.06075]. arXiv preprint arXiv:2602.06075.
- Zhang, Y.; Jiang, S.; Li, R.; Tu, J.; Su, Y.; Deng, L.; Guo, X.; Lv, C.; Lin, J. DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints, 2026, [arXiv:cs.AI/2601.18137]. arXiv preprint arXiv:2601.18137. [CrossRef]
- Luo, H.; Zhang, H.; Zhang, X.; Wang, H.; Qin, Z.; Lu, W.; Ma, G.; He, H.; Xie, Y.; Zhou, Q.; et al. UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios, 2025, [arXiv:cs.AI/2509.21766]. arXiv preprint arXiv:2509.21766.
- Froger, R.; Andrews, P.; Bettini, M.; Budhiraja, A.; Cabral, R.S.; Do, V.; Garreau, E.; Gaya, J.B.; Laurençon, H.; Lecanu, M.; et al. Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments, 2026, [arXiv:cs.AI/2602.11964]. arXiv preprint arXiv:2602.11964.
- Zhou, Q.; Zhang, J.; Wang, H.; Hao, R.; Wang, J.; Han, M.; Yang, Y.; Wu, S.; Pan, F.; Fan, L.; et al. FeatureBench: Benchmarking Agentic Coding for Complex Feature Development, 2026, [arXiv:cs.SE/2602.10975]. arXiv preprint arXiv:2602.10975. [CrossRef]
- Liu, E.; Pan, L.; Gao, Z.; Yang, Y.; Shi, C.; Liu, Y.; Wu, J.; Li, Q. Benchmarking and Improving GUI Agents in High-Dynamic Environments, 2026, [arXiv:cs.CV/2604.25380]. arXiv preprint arXiv:2604.25380.
- Du, M.; Xu, B.; Zhu, C.; Wang, X.; Mao, Z. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents, 2025, [arXiv:cs.CL/2506.11763]. arXiv preprint arXiv:2506.11763.
- Chen, Z.; Ma, X.; Zhuang, S.; Nie, P.; Zou, K.; Liu, A.; Green, J.; Patel, K.; Meng, R.; Su, M.; et al. BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent, 2025, [arXiv:cs.CL/2508.06600]. arXiv preprint arXiv:2508.06600.
- Zhang, H.; Zhou, J.; Li, B.; Zhou, B.; Shan, Y.; Lu, H.; Cao, Z.; Chen, J.; Han, Y.; Sheng, Z.; et al. BrowseComp-V3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents, 2026, [arXiv:cs.AI/2602.12876]. arXiv preprint arXiv:2602.12876. [CrossRef]
- Li, H.; Wen, R.; Shi, S.; Zhang, N.; Vorobeychik, Y.; Xiao, C. AgentDyn: Are Your Agent Security Defenses Deployable in Real-World Dynamic Environments?, 2026, [arXiv:cs.CR/2602.03117]. arXiv preprint arXiv:2602.03117.
- Jiang, T.; Wang, Y.; Liang, J.; Wang, T. AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks, 2026, [arXiv:cs.AI/2602.16901]. arXiv preprint arXiv:2602.16901.
- Albrethsen, J.; Datta, Y.; Kumar, K.; Rajasekar, S. DeepContext: Stateful Real-Time Detection of Multi-Turn Adversarial Intent Drift in LLMs, 2026, [arXiv:cs.AI/2602.16935]. arXiv preprint arXiv:2602.16935. [CrossRef]
- Cao, T.; Lim, B.; Liu, Y.; Sui, Y.; Li, Y.; Deng, S.; Lu, L.; Oo, N.; Yan, S.; Hooi, B. VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents, 2025, [arXiv:cs.AI/2506.02456]. arXiv preprint arXiv:2506.02456.
- Chen, C.; Song, X.; Chai, Y.; Yao, Y.; Zhao, H.; Li, L.; Li, J.; Teng, Y.; Liu, G.; Wang, Y. GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?, 2025, [arXiv:cs.CR/2510.20333]. arXiv preprint arXiv:2510.20333.
- Liu, G.; Ye, J.; Liu, J.; Li, Y.; Liu, W.; Gao, P.; Luan, J.; Liu, Y. Mobile GUI Agents under Real-world Threats: Are We There Yet?, 2025, [arXiv:cs.CR/2507.04227]. arXiv preprint arXiv:2507.04227. [CrossRef]
- Li, Y.; Luo, H.; Xie, Y.; Fu, Y.; Yang, Z.; Shao, S.; Ren, Q.; Qu, W.; Fu, Y.; Yang, Y.; et al. ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis, 2026, [arXiv:cs.AI/2604.02022]. arXiv preprint arXiv:2604.02022.
- Chen, Y.S.; Huang, S.Y.; Yang, C.L.; Chen, Y.N. TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories, 2026, [arXiv:cs.CR/2604.07223]. arXiv preprint arXiv:2604.07223.
- Yagoubi, F.E.; Badu-Marfo, G.; Mallah, R.A. AgentLeak: A Full-Stack Benchmark for Privacy Leakage in Multi-Agent LLM Systems, 2026, [arXiv:cs.AI/2602.11510]. arXiv preprint arXiv:2602.11510. [CrossRef]
- Ma, J.; Du, X.; Lin, R.; Bian, Y.; Chen, J.; Wang, J.; Yang, X.; Cui, S.; Meng, C.; Deng, X.; et al. Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions, 2026, [arXiv:cs.CR/2605.22321]. arXiv preprint arXiv:2605.22321.
- Wang, Z.; Li, Y.; Wu, Y.; Liu, Z.; Chen, K.; Wai, F.K.; Chen, P.Y.; Thing, V.L.L.; Li, B.; Tao, D.; et al. Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents, 2026, [arXiv:cs.CR/2606.13385]. arXiv preprint arXiv:2606.13385.
- Marchand, R.; Cathain, A.O.; Wynne, J.; Giavridis, P.M.; Deverett, S.; Wilkinson, J.; Gwartz, J.; Coppock, H. Quantifying Frontier LLM Capabilities for Container Sandbox Escape, 2026, [arXiv:cs.CR/2603.02277]. arXiv preprint arXiv:2603.02277.
- Chen, M.; Wang, J.; Mu, F.; Wang, Y.; Liu, Z.; Feng, H.; Wang, Q. Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems, 2026, [arXiv:cs.MA/2604.22708]. arXiv preprint arXiv:2604.22708. [CrossRef]
- In, Y.; Tanjim, M.; Subramanian, J.; Kim, S.; Bhattacharya, U.; Kim, W.; Park, S.; Sarkhel, S.; Park, C. Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation, 2026, [arXiv:cs.AI/2603.25001]. arXiv preprint arXiv:2603.25001.
- Cemri, M.; Pan, M.Z.; Yang, S.; Agrawal, L.A.; Chopra, B.; Tiwari, R.; Keutzer, K.; Parameswaran, A.; Klein, D.; Ramchandran, K.; et al. Why Do Multi-Agent LLM Systems Fail?, 2025, [arXiv:cs.AI/2503.13657]. arXiv preprint arXiv:2503.13657.
- Cheng, Z.; Wang, W.; Zhao, Y.; Ren, Z.; Chen, J.; Xu, R.; Huang, S.; Chen, Y.; Li, G.; Wang, M.; et al. LifeBench: A Benchmark for Long-Horizon Multi-Source Memory, 2026, [arXiv:cs.AI/2603.03781]. arXiv preprint arXiv:2603.03781.
- Huang, Z.; Liu, W.; Tian, Z.; Chen, W.; Chen, J.; Wu, Y.; Zhang, F.; Guo, Q.; Zhou, X. M3Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions, 2026, [arXiv:cs.CL/2606.07402]. arXiv preprint arXiv:2606.07402. [CrossRef]
- Zhu, S.; Yang, Y.; Wang, Z.; Shen, T.; Guo, D.; Yang, M.H. H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions, 2026, [arXiv:cs.CL/2606.09461]. arXiv preprint arXiv:2606.09461.
- Bian, H.; Yao, Z.; Hu, S.; Xu, Z.; Zhang, S.; Guo, Y.; Yang, Z.; Han, X.; Wang, H.; Chen, R. RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction, 2026, [arXiv:cs.CL/2601.06966]. arXiv preprint arXiv:2601.06966.
- Zhu, J.; Tian, Y.; Li, B.; Wu, K.; Liang, Z.; Li, J.; Zhang, X.; Guo, L.; Chen, F.; Liu, Y.; et al. FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol, 2026, [arXiv:cs.AI/2603.24943]. arXiv preprint arXiv:2603.24943. [CrossRef]
- Wang, W.; Niu, P.; Zou, G.; Yang, X.; Wang, J.; Shi, H.; Du, Y.; Chai, J.; Pang, X.; Tang, S.; et al. MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation, 2026, [arXiv:cs.AI/2606.02470]. arXiv preprint arXiv:2606.02470.
- Wang, J.; Liu, X.; Li, Y.; Zhang, S.; Wang, Y.; Shan, Z.; Le, X.; Chen, C.; Guan, X.; Tao, D. GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows, 2026, [arXiv:cs.CL/2604.15715]. arXiv preprint arXiv:2604.15715.
- Pysklo, H.M.; Zhuravel, A.; Watson, P.D. Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation, 2026, [arXiv:cs.SE/2602.11224]. arXiv preprint arXiv:2602.11224. [CrossRef]
- Ma, R.; Shankar, S.; Chen, R.; Lin, Y.; Zeighami, S.; Ghosh, R.; Gupta, A.; Gupta, A.; Gopal, T.; Parameswaran, A.G. Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents, 2026, [arXiv:cs.DB/2603.20576]. arXiv preprint arXiv:2603.20576.
- Bandel, E.; Yehudai, A.; Eden, L.; Sagron, Y.; Perlitz, Y.; Venezian, E.; Razinkov, N.; Ergas, N.; Ifergan, S.S.; Shlomov, S.; et al. General Agent Evaluation, 2026, [arXiv:cs.AI/2602.22953]. arXiv preprint arXiv:2602.22953.
- Long, X.; Du, L.; Xu, Y.; Liu, F.; Wang, H.; Ding, N.; Li, Z.; Guo, J.; Tang, Y. LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks, 2026, [arXiv:cs.CL/2604.13072]. arXiv preprint arXiv:2604.13072.
- Xu, X.; Yang, R.; Shen, H.; Xu, W.; Gao, B.; Wu, R.; Shi, K.; Xie, W.; Chen, X.; Wu, M.; et al. RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades, 2026, [arXiv:cs.SE/2605.15846]. arXiv preprint arXiv:2605.15846. [CrossRef]
- SWE-Marathon. SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?, 2026. Project website.
- Xu, K.; Lu, X.; Qiao, S.; Ding, Z.; Xu, H.; Liang, L.; Zhang, N. LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis, 2026, [arXiv:cs.LG/2605.30434]. arXiv preprint arXiv:2605.30434.
- Chen, H.; Metelski, D.; Qi, L.; Xia, T.; Lee, J.; Brown, S.; Riley, K.; Wang, F.; Liu, T.Y.A.; MD, H.C.; et al. CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?, 2026, [arXiv:cs.CL/2605.16679]. arXiv preprint arXiv:2605.16679.
- Xu, W.; Li, S.; Ye, T.; Cao, Q.; Chen, Y.; Gao, H.; Wang, Y.; Li, Q.; Li, K.; Xu, S.; et al. ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research, 2026, [arXiv:cs.LG/2606.07591]. arXiv preprint arXiv:2606.07591. [CrossRef]
- Li, X.; Chen, W.; Liu, Y.; Zheng, S.; Chen, X.; He, Y.; Li, Y.; You, B.; Shen, H.; Sun, J.; et al. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks, 2026, [arXiv:cs.AI/2602.12670]. arXiv preprint arXiv:2602.12670.
- Bechard, P.; Ayala, O.M.; Chen, E.; Skelton, J.; Davasam, S.; Sunkara, S.; Yadav, V.; Rajeswar, S. Terminal Agents Suffice for Enterprise Automation, 2026, [arXiv:cs.SE/2604.00073]. arXiv preprint arXiv:2604.00073.
- Li, R.; Du, M.; Xu, B.; Zhu, C.; Wang, X.; Mao, Z. DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report, 2026, [arXiv:cs.CL/2601.08536]. arXiv preprint arXiv:2601.08536.
- Huang, P.; Zhong, Z.; Wan, Z.; Zhou, D.; Alam, S.; Wang, X.; Li, Z.; Dou, Z.; Zhu, L.; Xiong, J.; et al. MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents, 2026, [arXiv:cs.CV/2601.12346]. arXiv preprint arXiv:2601.12346. [CrossRef]
- Feng, Y.; Huang, Q.; Xie, X.; Yang, Z.; Yu, J.; Chen, W.; Tung, A.K.H. IDRBench: Interactive Deep Research Benchmark, 2026, [arXiv:cs.CL/2601.06676]. arXiv preprint arXiv:2601.06676.
- Wu, T.; Wang, Y.; Ma, X.; He, X.; Wang, S.; Yin, D.; Zhao, X. DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent, 2026, [arXiv:cs.AI/2603.01152]. arXiv preprint arXiv:2603.01152.
- Tian, X.; Wang, H.; Chen, S.; Zhou, H.; Yu, K.; Zhang, Y.; Ouyang, J.; Yin, J.; Chen, J.; Guo, B.; et al. ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas, 2026, [arXiv:cs.CL/2601.21558]. arXiv preprint arXiv:2601.21558. [CrossRef]
- Gao, J.; Chen, J.; He, C.; Xu, S.; Jin, D.; Wu, Y. From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents, 2026, [arXiv:cs.AI/2601.22607]. arXiv preprint arXiv:2601.22607.
- Zhang, Y.; Zeng, Y.; Li, Q.; Hu, Z.; Han, K.; Zuo, W. Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use, 2025, [arXiv:cs.LG/2509.12867]. arXiv preprint arXiv:2509.12867.
- Ding, H.; Liu, P.; Wang, J.; Ji, Z.; Cao, M.; Zhang, R.; Ai, L.; Yang, E.; Shi, T.; Yu, L. DynaWeb: Model-Based Reinforcement Learning of Web Agents, 2026, [arXiv:cs.CL/2601.22149]. arXiv preprint arXiv:2601.22149. [CrossRef]
- Da, J.; Wang, C.; Deng, X.; Ma, Y.; Barhate, N.; Hendryx, S. Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards, 2025, [arXiv:cs.CL/2506.11425]. arXiv preprint arXiv:2506.11425.
- Xi, Z.; Huang, J.; Liao, C.; Huang, B.; Guo, H.; Liu, J.; Zheng, R.; Ye, J.; Zhang, J.; Chen, W.; et al. AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning, 2025, [arXiv:cs.LG/2509.08755]. arXiv preprint arXiv:2509.08755.
- Wang, Z.; Xu, C.; Liu, B.; Wang, Y.; Han, S.; Yao, Z.; Yao, H.; He, Y. Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning, 2026, [arXiv:cs.AI/2602.10090]. arXiv preprint arXiv:2602.10090. [CrossRef]
- Ren, Z.; Zhang, X.; Qian, Z.; Gao, Y.; Shi, Y.; Zheng, S.; He, J. GTM: Simulating the World of Tools for AI Agents, 2025, [arXiv:cs.AI/2512.04535]. arXiv preprint arXiv:2512.04535.
- Li, W.; Qu, B.; Pan, B.; Zhang, J.; Liu, Z.; Zhang, P.; Chen, W.; Zhang, B. LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent, 2026, [arXiv:cs.AI/2604.17931]. arXiv preprint arXiv:2604.17931.
- Wu, X.; Sun, Q.; Zhang, R.; Song, C.; Wu, J.; Qi, Y.; Cheng, H. Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe, 2026, [arXiv:cs.LG/2603.21972]. arXiv preprint arXiv:2603.21972. [CrossRef]
- Chen, K.; Cusumano-Towner, M.; Huval, B.; Petrenko, A.; Hamburger, J.; Koltun, V.; Krähenbühl, P. Reinforcement Learning for Long-Horizon Interactive LLM Agents, 2025, [arXiv:cs.LG/2502.01600]. arXiv preprint arXiv:2502.01600.
- Luo, J.; Tian, Y.; Cao, C.; Luo, Z.; Lin, H.; Li, K.; Kong, C.; Yang, R.; Ma, J. From Storage to Experience, 2026, [arXiv:cs.AI/2605.06716]. arXiv preprint arXiv:2605.06716.
- Allard, M.A.; Teinturier, A.; Xing, V.; Viaud, G. Experiential Reflective Learning for Self-Improving LLM Agents, 2026, [arXiv:cs.LG/2603.24639]. arXiv preprint arXiv:2603.24639. [CrossRef]
- He, Z.; Li, Y.; Huang, F.; Chen, T.; Chen, S.; Li, X.; Yu, M.H.; Liu, X.; Wei, L.; Pan, L.; et al. SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training, 2026, [arXiv:cs.AI/2606.02355]. arXiv preprint arXiv:2606.02355.
- Lin, H.; Li, P.; Song, J.; Jiang, F.; Zhang, T. MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation, 2026, [arXiv:cs.AI/2605.27366]. arXiv preprint arXiv:2605.27366.
- Jiang, Y.; Li, D.; Deng, H.; Ma, B.; Wang, X.; Wang, Q.; Yu, G. SoK: Agentic Skills: Beyond Tool Use in LLM Agents, 2026, [arXiv:cs.CR/2602.20867]. arXiv preprint arXiv:2602.20867.
- Zhang, Q.; Hu, C.; Upasani, S.; Ma, B.; Hong, F.; Kamanuru, V.; Rainton, J.; Wu, C.; Ji, M.; Li, H.; et al. Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models, 2025, [arXiv:cs.LG/2510.04618]. arXiv preprint arXiv:2510.04618. [CrossRef]
- Acikgoz, E.C.; Qian, C.; Ji, H.; Hakkani-Tür, D.; Tur, G. Self-Improving LLM Agents at Test-Time, 2025, [arXiv:cs.LG/2510.07841]. arXiv preprint arXiv:2510.07841.
- Li, Y.; Lin, Z.; Deng, A.; Zhang, X.; He, Y.; Ji, S.; Cao, T.; Hooi, B. Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates, 2026, [arXiv:cs.LG/2601.18510]. arXiv preprint arXiv:2601.18510.
- Sengupta, B.; Wang, J. HARBOR: Automated Harness Optimization, 2026, [arXiv:cs.LG/2604.20938]. arXiv preprint arXiv:2604.20938.
- Lee, Y.; Nair, R.; Zhang, Q.; Lee, K.; Khattab, O.; Finn, C. Meta-Harness: End-to-End Optimization of Model Harnesses, 2026, [arXiv:cs.AI/2603.28052]. arXiv preprint arXiv:2603.28052. [CrossRef]
- Pan, W.; Liu, S.; Lin, C.Y.; Zeng, J.; Tang, X.; Zhou, X.; Lu, Y.; Jia, X. Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference, 2026, [arXiv:cs.AI/2606.05922]. arXiv preprint arXiv:2606.05922.
- Xu, T.; Wen, H.; Li, M. Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents, 2026, [arXiv:cs.AI/2605.22166]. arXiv preprint arXiv:2605.22166.
- Vishnyakova, V.V. Context Engineering: From Prompts to Corporate Multi-Agent Architecture, 2026, [arXiv:cs.AI/2603.09619]. arXiv preprint arXiv:2603.09619.
- Pan, L.; Zou, L.; Guo, S.; Ni, J.; Zheng, H.T. Natural-Language Agent Harnesses, 2026, [arXiv:cs.CL/2603.25723]. arXiv preprint arXiv:2603.25723.
- He, C.; Zhou, X.; Wang, D.; Xu, H.; Liu, W.; Miao, C. Harness Engineering for Language Agents: The Harness Layer as Control, Agency, and Runtime, 2026. Preprints.org, . [CrossRef]
- Brookes, P.; Voskanyan, V.; Giavrimis, R.; Truscott, M.; Ilieva, M.; Pavlou, C.; Staicu, A.; Adham, M.; Evers-Hood, W.; Gong, J.; et al. Evolving Excellence: Automated Optimization of LLM-based Agents, 2025, [arXiv:cs.SE/2512.09108]. arXiv preprint arXiv:2512.09108.
- Lam, C.; Li, J.; Zhang, L.; Zhao, K. Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework, 2026, [arXiv:cs.AI/2603.11768]. arXiv preprint arXiv:2603.11768.
- Tang, Z.; He, X.; Zhao, T.; Wei, F.; Liu, X.; Dong, P.; Wang, Q.; Li, Q.; Wang, H.; Chen, R.; et al. LLM Agent Memory: A Survey from a Unified Representation–Management Perspective, 2026. Preprints.org, . [CrossRef]
- Yue, L.; Bhandari, K.R.; Ko, C.Y.; Patel, D.; Lin, S.; Zhou, N.; Gao, J.; Chen, P.Y.; Pan, S. From Static Templates to Dynamic Runtime Graphs: A Survey of Workflow Optimization for LLM Agents, 2026, [arXiv:cs.AI/2603.22386]. arXiv preprint arXiv:2603.22386.
- Zhang, J.; Xiang, J.; Yu, Z.; Teng, F.; Chen, X.; Chen, J.; Zhuge, M.; Cheng, X.; Hong, S.; Wang, J.; et al. AFlow: Automating Agentic Workflow Generation, 2024, [arXiv:cs.AI/2410.10762]. arXiv preprint arXiv:2410.10762. [CrossRef]
- Abuzakuk, S.; Kermarrec, A.M.; Sharma, R.; Veski, R.M.; de Vos, M. Optimizing Agentic Workflows using Meta-tools, 2026, [arXiv:cs.AI/2601.22037]. arXiv preprint arXiv:2601.22037.
- Ma, Z.; Zhao, Z.; Hua, C.; Berto, F.; Park, J. JudgeFlow: Agentic Workflow Optimization via Block Judge, 2026, [arXiv:cs.AI/2601.07477]. arXiv preprint arXiv:2601.07477.
- Wang, Z.; Liu, X.; Wang, L.; Shan, Z.; Wang, Y.; Song, Z.; Zhang, M. MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems, 2026, [arXiv:cs.AI/2605.06623]. arXiv preprint arXiv:2605.06623.
- Li, Z.; Zhang, H.; Han, S.; Liu, S.; Xie, J.; Zhang, Y.; Choi, Y.; Zou, J.; Lu, P. In-the-Flow Agentic System Optimization for Effective Planning and Tool Use, 2025, [arXiv:cs.AI/2510.05592]. arXiv preprint arXiv:2510.05592.
- Zhou, H.; Wan, X.; Sun, R.; Palangi, H.; Iqbal, S.; Vulić, I.; Korhonen, A.; Arık, S. Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies, 2025, [arXiv:cs.LG/2502.02533]. arXiv preprint arXiv:2502.02533. [CrossRef]
- Zhang, Z.; Ge, L.; Li, H.; Zhu, W.; Zhang, C.; Ye, Y. MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference, 2025, [arXiv:cs.CL/2510.07475]. arXiv preprint arXiv:2510.07475.
- Chen, M.; Wang, J.; Liu, Z.; Wang, Y.; Wang, Q. From Failed Trajectories to Reliable LLM Agents: Diagnosing and Repairing Harness Flaws, 2026, [arXiv:cs.SE/2606.06324]. arXiv preprint arXiv:2606.06324.
- Seong, H.; Yin, L.; Zhang, H.; Shi, Z. The Last Harness You’ll Ever Build, 2026, [arXiv:cs.AI/2604.21003]. arXiv preprint arXiv:2604.21003.
- Karten, S.; Zhang, J.; Upaa, T.; Feng, R.; Li, W.; Shi, C.; Jin, C.; Vodrahalli, K. Continual Harness: Online Adaptation for Self-Improving Foundation Agents, 2026, [arXiv:cs.LG/2605.09998]. arXiv preprint arXiv:2605.09998. [CrossRef]
- Tan, H.Z.; Yang, X.W.; Chen, H.; Shao, J.J.; Wen, Y.; Shen, Y.; Luo, W.; Du, X.; Guo, L.Z.; Li, Y.F. Hindsight Credit Assignment for Long-Horizon LLM Agents, 2026, [arXiv:cs.LG/2603.08754]. arXiv preprint arXiv:2603.08754.
- Zhang, Y.; Fang, M.; Chen, Z.; Pechenizkiy, M. Self-evolving LLM agents with in-distribution Optimization, 2026, [arXiv:cs.LG/2606.07367]. arXiv preprint arXiv:2606.07367.
- Zhang, C. From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models, 2026, [arXiv:cs.CL/2604.09459]. arXiv preprint arXiv:2604.09459.
- Jiayang, C.; Ru, D.; Qiu, L.; Li, Y.; Cao, X.; Song, Y.; Cai, X. AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations, 2026, [arXiv:cs.CL/2603.01966]. arXiv preprint arXiv:2603.01966. [CrossRef]
- Chen, Y.; Lai, H.; Feng, Y.; Han, C.; Zhang, Q.; Lu, B.; Li, M.; Wang, X.; Wang, Z.; Xu, S.; et al. Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents, 2026, [arXiv:cs.AI/2606.06090]. arXiv preprint arXiv:2606.06090.
- Rafique, M.; Bindschaedler, L. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents, 2026, [arXiv:cs.AI/2604.10352]. arXiv preprint arXiv:2604.10352.
- Zong, X.; Shen, Z.; Wang, L.; Lan, Y.; Yang, C. MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers, 2025, [arXiv:cs.CL/2512.15163]. arXiv preprint arXiv:2512.15163. [CrossRef]
- Zhang, D.; Li, Z.; Luo, X.; Liu, X.; Li, P.; Xu, W. MCP Security Bench (MSB): Benchmarking Attacks Against Model Context Protocol in LLM Agents, 2025, [arXiv:cs.CR/2510.15994]. arXiv preprint arXiv:2510.15994.
- Liu, S.; Tang, X.; Yang, X.; Lin, L.; Zhou, B.; Xiao, W.; Liu, W. When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents, 2026, [arXiv:cs.CR/2605.24069]. arXiv preprint arXiv:2605.24069.
- Lee, L.F.; Chang, Y.Y.; Yu, C.M.; Yeh, K.H. WebMCP Tool Surface Poisoning: Runtime Manipulation Attacks on LLM Agents, 2026, [arXiv:cs.CR/2606.06387]. arXiv preprint arXiv:2606.06387. [CrossRef]
- Ling, Y.; Yu, S.; Chen, Z.; Fang, C. Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation, 2026, [arXiv:cs.CR/2606.10749]. arXiv preprint arXiv:2606.10749.
- Sharma, R.K. ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files, 2026, [arXiv:cs.SE/2603.00822]. arXiv preprint arXiv:2603.00822.
- Wang, H.; Poskitt, C.M.; Sun, J. AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents, 2025, [arXiv:cs.AI/2503.18666]. arXiv preprint arXiv:2503.18666.
- Kaptein, M.; Khan, V.J.; Podstavnychy, A. Runtime Governance for AI Agents: Policies on Paths, 2026, [arXiv:cs.AI/2603.16586]. arXiv preprint arXiv:2603.16586. [CrossRef]
- Errico, H. Autonomous Action Runtime Management (AARM): A System Specification for Securing AI-Driven Actions at Runtime, 2026, [arXiv:cs.CR/2602.09433]. arXiv preprint arXiv:2602.09433.
- Bhardwaj, V.P. Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents, 2026, [arXiv:cs.AI/2602.22302]. arXiv preprint arXiv:2602.22302.
- Lin, X.; Liu, Y.; Chen, Y.; Wu, Y.; Ning, Y.; Liu, Y.; Sun, N.; Zhang, S.; Chong, B.; Zhou, C.; et al. SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment, 2026, [arXiv:cs.CR/2604.13630]. arXiv preprint arXiv:2604.13630. [CrossRef]
- Qin, X.; Luan, S.; See, J.; Boukhers, Z.; Yang, C.; Li, Z. Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution, 2026, [arXiv:cs.RO/2604.07833]. arXiv preprint arXiv:2604.07833.
- Ning, X.; Tieu, K.; Fu, D.; Wei, T.; Li, Z.; Bei, Y.; Zou, J.; Ai, M.; Liu, Z.; Li, T.W.; et al. Code as Agent Harness, 2026, [arXiv:cs.CL/2605.18747]. arXiv preprint arXiv:2605.18747.
- Lin, M.; Wu, J.; Wang, Z.; Shi, Z.; Sang, Y.; He, B.; Liu, Z.; Wei, T.; Wu, Z.; Zhang, Z.; et al. Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents, 2026, [arXiv:cs.AI/2605.30621]. arXiv preprint arXiv:2605.30621.
- Shao, S.; Ren, Q.; Qian, C.; Wei, B.; Guo, D.; Yang, J.; Song, X.; Zhang, L.; Zhang, W.; Liu, D.; et al. Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents, 2025, [arXiv:cs.AI/2509.26354]. arXiv preprint arXiv:2509.26354. [CrossRef]
- Doshi, A.; Hong, Y.; Xu, C.; Kang, E.; Kapravelos, A.; Kästner, C. Towards Verifiably Safe Tool Use for LLM Agents, 2026, [arXiv:cs.SE/2601.08012]. arXiv preprint arXiv:2601.08012.
- Zaharia, M.; Khattab, O.; Chen, L.; Davis, J.Q.; Miller, H.; Potts, C.; Zou, J.; Carbin, M.; Frankle, J.; Rao, N.; et al. The Shift from Models to Compound AI Systems, 2024. BAIR blog.
- Meng, Q.; Wang, Y.; Chen, L.; Li, Y.; Wu, W.; Jiang, W.; Wang, Q.; Lu, C.; Gao, Y.; Wu, Y.; et al. Agent Harness for Large Language Model Agents: A Survey. Preprints 2026. Preprint, . [CrossRef]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. In Proceedings of the Advances in Neural Information Processing Systems, 2022, Vol. 35, pp. 27730–27744, [arXiv:cs.CL/2203.02155]. [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, 2020, Vol. 33, pp. 1877–1901, [arXiv:cs.CL/2005.14165]. [CrossRef]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, 2021, pp. 610–623. [CrossRef]
- Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. Ethical and Social Risks of Harm from Language Models, 2021, [arXiv:cs.CL/2112.04359]. arXiv preprint arXiv:2112.04359, . [CrossRef]
- Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys 2023, 55, 1–35. [CrossRef]
- Christiano, P.F.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep Reinforcement Learning from Human Preferences. In Proceedings of the Advances in Neural Information Processing Systems, 2017, Vol. 30, [arXiv:stat.ML/1706.03741]. [CrossRef]
- Ziegler, D.M.; Stiennon, N.; Wu, J.; Brown, T.B.; Radford, A.; Amodei, D.; Christiano, P.; Irving, G. Fine-Tuning Language Models from Human Preferences, 2019, [arXiv:cs.CL/1909.08593]. arXiv preprint arXiv:1909.08593, . [CrossRef]
- Stiennon, N.; Ouyang, L.; Wu, J.; Ziegler, D.M.; Lowe, R.; Voss, C.; Radford, A.; Amodei, D.; Christiano, P. Learning to Summarize from Human Feedback. In Proceedings of the Advances in Neural Information Processing Systems, 2020, Vol. 33, pp. 3008–3021, [arXiv:cs.CL/2009.01325]. [CrossRef]
- Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, 2022, [arXiv:cs.CL/2204.05862]. arXiv preprint arXiv:2204.05862, . [CrossRef]
- Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the Advances in Neural Information Processing Systems, 2023, Vol. 36, pp. 53728–53741, [arXiv:cs.LG/2305.18290]. [CrossRef]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, 2022, Vol. 35, pp. 24824–24837, [arXiv:cs.CL/2201.11903]. [CrossRef]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the Advances in Neural Information Processing Systems, 2022, Vol. 35, pp. 22199–22213, [arXiv:cs.CL/2205.11916]. [CrossRef]
- Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.V.; Chi, E.H.; Narang, S.; Chowdhery, A.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In Proceedings of the International Conference on Learning Representations, 2023, [arXiv:cs.CL/2203.11171]. [CrossRef]
- Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, 2023, Vol. 36, [arXiv:cs.CL/2305.10601]. [CrossRef]
- Chu, Z.; Chen, J.; Chen, Q.; Yu, W.; He, T.; Wang, H.; Peng, W.; Liu, M.; Qin, B.; Liu, T. Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024, pp. 1173–1203. [CrossRef]
- Schulhoff, S.; Ilie, M.; Balepur, N.; Kahadze, K.; Liu, A.; Si, C.; Li, Y.; Gupta, A.; Han, H.; Schulhoff, S.; et al. The Prompt Report: A Systematic Survey of Prompt Engineering Techniques, 2024, [arXiv:cs.CL/2406.06608]. arXiv preprint arXiv:2406.06608, . [CrossRef]
- OpenAI. Learning to Reason with LLMs, 2024. OpenAI release, September 12, 2024.
- DeepSeek-AI.; Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature 2025, 645, 633–638, [arXiv:cs.CL/2501.12948]. [CrossRef]
- Meincke, L.; Mollick, E.R.; Mollick, L.; Shapiro, D. Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting, 2025, [arXiv:cs.CL/2506.07142]. SSRN working paper, . [CrossRef]
- OpenAI. Reasoning Best Practices, 2025. OpenAI developer documentation, accessed June 13, 2026.
- Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The Rise and Potential of Large Language Model Based Agents: A Survey. Science China Information Sciences 2025, 68, 121101, [arXiv:cs.AI/2309.07864]. [CrossRef]
- Xie, Y.; Zhu, C.; Zhang, X.; Zhu, T.; Ye, D.; Qi, M.; Chen, H.; Zhou, W. From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration, 2026, [arXiv:cs.MA/2603.04474]. arXiv preprint arXiv:2603.04474.
- Zhou, S.; Xu, F.F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. In Proceedings of the International Conference on Learning Representations, 2024, [arXiv:cs.AI/2307.13854].
- Jimenez, C.E.; Yang, J.; Wettig, A.; Yao, S.; Pei, K.; Press, O.; Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? In Proceedings of the International Conference on Learning Representations, 2024, [arXiv:cs.CL/2310.06770].
- Xu, F.F.; et al. TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. In Proceedings of the Advances in Neural Information Processing Systems, 2025, [arXiv:cs.CL/2412.14161].
- Yao, S.; Shinn, N.; Razavi, P.; Narasimhan, K. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. In Proceedings of the International Conference on Learning Representations, 2025, [arXiv:cs.AI/2406.12045].
- Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; et al. AgentBench: Evaluating LLMs as Agents. In Proceedings of the International Conference on Learning Representations, 2024, [arXiv:cs.AI/2308.03688]. [CrossRef]
- Huang, J.; Chen, X.; Mishra, S.; Zheng, H.S.; Yu, A.W.; Song, X.; Zhou, D. Large Language Models Cannot Self-Correct Reasoning Yet. In Proceedings of the International Conference on Learning Representations, 2024, [arXiv:cs.CL/2310.01798].
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. In Proceedings of the Conference on Language Modeling, 2024, [arXiv:cs.AI/2308.08155]. [CrossRef]
- OpenAI. Function Calling and Other API Updates, 2023. OpenAI blog, June 13, 2023.
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. In Proceedings of the Advances in Neural Information Processing Systems, 2023, Vol. 36, [arXiv:cs.CL/2302.04761]. [CrossRef]
- Anthropic. Introducing Computer Use, a New Claude 3.5 Sonnet, and Claude 3.5 Haiku, 2024. Anthropic news, October 22, 2024.
- Patil, S.G.; Zhang, T.; Wang, X.; Gonzalez, J.E. Gorilla: Large Language Model Connected with Massive APIs. In Proceedings of the Advances in Neural Information Processing Systems, 2024, Vol. 37, [arXiv:cs.CL/2305.15334]. [CrossRef]
- Kwa, T.; West, B.; Becker, J.; Deng, A.; Garcia, K.; Hasin, M.; Jawhar, S.; Kinniment, M.; Rush, N.; Arx, S.V.; et al. Measuring AI Ability to Complete Long Software Tasks. In Proceedings of the Advances in Neural Information Processing Systems, 2025, Vol. 38, [arXiv:cs.AI/2503.14499]. [CrossRef]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey, 2023, [arXiv:cs.CL/2312.10997]. arXiv preprint arXiv:2312.10997, . [CrossRef]
- Xu, Z.; Jain, S.; Kankanhalli, M. Hallucination is Inevitable: An Innate Limitation of Large Language Models, 2024, [arXiv:cs.CL/2401.11817]. arXiv preprint arXiv:2401.11817, . [CrossRef]



Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).