Submitted:
10 December 2025
Posted:
12 December 2025
Read the latest preprint version here
Abstract
Keywords:
1. Introduction

- Comprehensive Taxonomy. A unified taxonomy of modern tool and agent selection approaches spanning manual, UI-driven, retrieval-based, autonomous selection, tools, subagents, and MCPs.
- Production-Oriented Framework. An analysis of how real-world systems integrate agent and tool retrieval, execution, short-term and long-term memory, authentication, authorization, and human-in-the-loop pipelines to achieve scalable and secure agentic behavior.
- Evaluation and Open Challenges. A synthesis of evaluation methods for agent and tool retrieval and execution accuracy, cost, and latency, along with future research directions.
2. Overview of Tool Selection
2.1. Tools, MCPs, APIs, or Functions

2.1.1. Types and Categories of Tools
2.1.2. Auth Considerations and Human-in-the-Loop
2.2. Agents as Tools
2.2.1. Multi-Agent as Tools Architectures
- Supervisor architectures are suited for research and reasoning tasks that use parallel execution.
- Hand-off architectures are suited for autonomous, domain-isolated agents that do not need heavy orchestration.
2.3. Why Tool Selection
2.3.1. Limitations of LLM Tool Calling
Number of Tools for LLM Agents
Tool description sensitivity
Context and caching pressure
3. Frontend or UI Tool Selection
3.1. Methods of Frontend or UI Tool Selection
3.1.1. Buttons
3.1.2. Mentions or Slash Commands
3.1.3. Natural Language
Voice
Text
Image and Video
3.2. Human-in-the-Loop in Front End
3.3. Frontend Auth Considerations
3.4. Multi-Turn Front End
4. Backend Tool Selection
4.1. Base AI Unit (Tool/Agent)
4.1.1. Inputs
4.1.2. Quantity Considerations
4.2. Tool, MCP, or Agent Retrieval
4.2.1. No Tool Retrieval
4.2.2. Pre-Retrieval or Indexing
Storage and Retrieval Types
Chunking of Tool Document Components
Metadata
Updates to the Tool Knowledge Base
4.2.3. Intra-Retrieval or Inference Time
Planner/CoT
Query Transformation
Knowledge Graph or Metadata Traversal
Agentic RAG
4.2.4. Post-Retrieval
Reranking
Self-Correction
Parameter Prediction Considerations
4.3. Tool, MCP, and Agent Execution
Coupling Retrieval to Execution or Execution to Retrieval
Available Tools and Agents
Methods
Function Calling
Agentic Workflows
Reinforcement Learning Tool Selection
Tool Call Result Context Engineering
Associated Prompts
Human-in-the-Loop
Authorization
4.4. Memory for Multi-Turn Tool Selection
4.4.1. Short-Term Memory for Dynamic Tool Management
4.4.2. Long-Term Memory for Personalized Tool Use
4.5. Evaluations
4.5.1. Tool Retrieval Metrics
4.5.2. Execution Metrics
4.5.3. Tool Analytics
5. Conclusions
Data Availability Statement
References
- Model Context Protocol. Tools Documentation (MCP Concepts). 2025. Available online: https://modelcontextprotocol.io/docs/concepts/tools.
- OpenAI. Function Calling. 2024. Available online: https://platform.openai.com/docs/guides/function-calling?api-mode=chat.
- Anthropic. Tool Use with Claude. 2024. Available online: https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/overview.
- Microsoft Learn. Azure AI Agent Service (preview) — Overview. Accessed. 2025. [Google Scholar]
- Amazon Web Services. Agents for Amazon Bedrock — Developer Guide. Accessed. 2025.
- LlamaIndex. Function Calling with Agents (LlamaIndex Examples), 2025. Accessed. 8 November 2025.
- deepset. Haystack Agents — Concepts and Usage. Accessed. 2025.
- Lumer, E.; Subbiah, V.K.; Burke, J.A.; Basavaraju, P.H.; Huber, A. Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases. Also published in ICAART 2025 2024, 2410.14594. [Google Scholar]
- Lumer, E.; Gulati, A.; Subbiah, V.K.; Basavaraju, P.H.; Burke, J.A. ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents. arXiv 2025. [Google Scholar]
- Qin, Y.; Liang, S.; Ye, Y.; et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs. arXiv 2023, 2307.16789. [Google Scholar]
- Du, Y.; Wei, F.; Zhang, H. AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls. arXiv 2024, 2402.04253. [Google Scholar]
- Frank, K.; Gulati, A.; Lumer, E.; Campagna, S.; Subbiah, V.K. Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks. 2025, 2509.23579. [Google Scholar]
- Aizawa, K.; Engineering, A. Writing Effective Tools for Agents — with Agents. 11 September 2025. Available online: https://www.anthropic.com/engineering/writing-tools-for-agents (accessed on 15 October 2025).
- Microsoft. AutoGen (AG2) — Multi-Agent Framework. Accessed. 2025.
- Manus. Manus. 2025. Available online: https://manus.im/.
- CrewAI. CrewAI — Documentation, 2025. Accessed. 8 November 2025.
- AgentScope Contributors. AgentScope: A Flexible Yet Robust Framework for Building AI Agents. Accessed. 2024.
- Lumer, E.; Gulati, A.; Subbiah, V.K.; Basavaraju, P.H.; Burke, J.A. MemTool: Optimizing Short-Term Memory Management for Dynamic Tool Calling in LLM Agent Multi-Turn Conversations. 2025, 2507.21428. [Google Scholar]
- Mo, G.; Zhong, W.; Chen, J.; Chen, X.; Lu, Y.; Lin, H.; He, B.; Han, X.; Sun, L. LiveMCPBench: Can Agents Navigate an Ocean of MCP Tools? arXiv 2025, arXiv:2508.01780. [Google Scholar] [CrossRef]
- Chen, Y.; Yoon, J.; Sachan, D.S.; et al. Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval. arXiv 2024. [Google Scholar]
- Zheng, Y.; Li, P.; Liu, W.; et al. ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval. arXiv 2024. [Google Scholar]
- Microsoft. GraphRAG — A Graph-based Approach to Retrieval-Augmented Generation. Accessed. 2024.
- Sarmah, P.; et al. Hybrid RAG: Integrating Knowledge Graphs with Retrieval-Augmented Generation. arXiv 2024. arXiv:2408.
- ToolBench Contributors. ToolBench: A Comprehensive Benchmark for Tool Learning. Accessed. 2024.
- LangChain. Context Engineering for AI Agents with LangChain and Manus; Martin, Lance, Ji, Yichao “Peak”, Eds.; LangChain Manus; Available online: https://www.youtube.com/watch?v=6_BcCthVvb8 (accessed on 6 November 2025).
- AI, L. LangChain Documentation. 2025. Available online: https://docs.langchain.com/.
- Jones, A.; Kelly, C. Code execution with MCP: Building more efficient agents. 2025. Available online: https://www.anthropic.com/engineering/code-execution-with-mcp.
- Anthropic. Introducing advanced tool use on the Claude Developer Platform. Engineering blog. 2025. Available online: https://www.anthropic.com/engineering/advanced-tool-use.
- Vercel. AI Tools — Vercel AI SDK, 2025. Accessed. 8 November 2025.
- LlamaIndex. AgentToolSpec API Reference (LlamaIndex), 2025. Accessed. 8 November 2025.
- Amazon Web Services. Amazon GenAI — AgentCore (Serverless Agent Framework) Documentation. Accessed. 2025.
- LangChain, Blog. Introducing Ambient Agents. 2024. Available online: https://blog.langchain.com/introducing-ambient-agents/.
- OpenAI. Guardrails for Python (OpenAI), 2025. Accessed. 8 November 2025.
- NVIDIA. NVIDIA NeMo Guardrails — Documentation. Accessed. 2024.
- OWASP Foundation. OWASP Top 10 for Large Language Model Applications. Accessed. 2024.
- NIST. NIST AI Safety Institute Consortium (AISIC). Accessed. 2024.
- Ouyang, S.; Yan, J.; Hsu, I.H.; Chen, Y.; Jiang, K.; Wang, Z.; Han, R.; Le, L.T.; Daruki, S.; Tang, X.; et al. ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory. arXiv 2025. [Google Scholar]
- Google. Function calling with the Gemini API. 2024. Available online: https://ai.google.dev/gemini-api/docs/function-calling.
- llama.cpp: Port of Facebook’s LLaMA model in C/C++, 2025. In GitHub repository; accessed; Gerganov, G., Ed.; (accessed on 8 November 2025).
- Arcade. Tool Calling-Introduction. 2024. Available online: https://docs.arcade.dev/home/use-tools/tools-overview.
- Composio. Composio — Skills that evolve with your Agents, 2025. Website. accessed. (accessed on 9 November 2025).
- Liu, M.M.; Garcia, D.; Parllaku, F.; Upadhyay, V.; Shah, S.F.A.; Roth, D. ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering. arXiv 2025, arXiv:cs. [Google Scholar]
- Microsoft Blog. Agents in AutoGen. 2024. Available online: https://microsoft.github.io/autogen/0.2/blog/2024/05/24/Agent/.
- Krishnan, N. AI Agents: Evolution, Architecture, and Real-World Applications. arXiv 2025, arXiv:2503.12687. [Google Scholar] [CrossRef]
- Mistral, AI. Agents & Conversations, 2025. Mistral Docs. accessed. (accessed on 8 November 2025).
- Anthropic. Introducing Agent Skills. 2025. Available online: https://www.anthropic.com/news/skills.
- AI, L. LangGraph Agent Supervisor Tutorial. 2025. Available online: https://langchain-ai.github.io/langgraph/tutorials/multi_agent/agent_supervisor/.
- Microsoft. Handoff Orchestration in Semantic Kernel. 2025. Available online: https://learn.microsoft.com/en-us/semantic-kernel/frameworks/agent/agent-orchestration/handoff?pivots=programming-language-csharp.
- Li, M.; Zhao, Y.; Yu, B.; et al. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs. arXiv 2023. [Google Scholar]
- Wang, Z.; Cheng, Z.; Zhu, H.; Fried, D.; Neubig, G. What Are Tools Anyway? A Survey from the Language Model Perspective. arXiv 2024, arXiv:2403.15452. [Google Scholar] [CrossRef]
- Qu, C.; Dai, S.; Wei, X.; Cai, H.; Wang, S.; Yin, D.; Xu, J.; Wen, J.r. Tool Learning with Large Language Models: A Survey. Frontiers of Computer Science 2025, 19. [Google Scholar] [CrossRef]
- Patil, S.G.; Zhang, T.; Wang, X.; Gonzalez, J.E. Gorilla: Large Language Model Connected with Massive APIs. arXiv 2023, 2305.15334. [Google Scholar] [CrossRef]
- Anthropic. Writing effective tools for AI agents — using AI agents, 2025. Engineering blog. accessed. (accessed on 10 November 2025).
- Huang, T.; Jung, D.; Chen, M. Planning and Editing What You Retrieve for Enhanced Tool Learning. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Accessed. 2024; (accessed on 10 November 2025). [Google Scholar]
- Ji, P. Context Engineering for AI Agents: Lessons from Building Manus, 2025. Manus Blog. accessed. (accessed on 9 November 2025).
- LangChain. Middleware, 2025. Documentation. accessed. (accessed on 9 November 2025).
- IETF. RFC 9111: HTTP Caching, 2022. Internet Standard. accessed. (accessed on 9 November 2025).
- Docs, MDN Web. Cache-Control header — HTTP, 2025. Documentation. accessed. (accessed on 9 November 2025).
- OpenAI. Prompt caching. 2025. Available online: https://platform.openai.com/docs/guides/prompt-caching.
- Anthropic. Prompt caching with Claude. 2025. Available online: https://claude.com/blog/prompt-caching.
- Perplexity, AI. Getting Started with Perplexity Tools. 2024. Available online: https://www.perplexity.ai/hub/getting-started.
- Anthropic. Claude Code Overview. 2024. Available online: https://docs.anthropic.com/en/docs/claude-code/overview.
- Cursor IDE. Tools. 2024. Available online: https://docs.cursor.com/agent/tools.
- OpenAI. Codex CLI. 2025. Available online: https://developers.openai.com/codex/cli/.
- AI, C. What is Cline? 2025. Available online: https://docs.cline.bot/getting-started/what-is-cline.
- GitHub. GitHub Copilot Features. 2025. Available online: https://docs.github.com/en/copilot/get-started/features.
- Continue Dev. Automating Documentation Updates with Continue CLI, 2025. Documentation. accessed. (accessed on 8 November 2025).
- FlowiseAI. Tutorials: Interacting with API Tools & MCP; Human In The Loop; Agentic RAG, 2025. Documentation. accessed. (accessed on 8 November 2025).
- Open WebUI. Action Function, 2025. Documentation. accessed. (accessed on 8 November 2025).
- LobeHub. Plugins, 2025. Documentation. accessed. (accessed on 8 November 2025).
- Docs, GitHub. Use GitHub Copilot Agents. Accessed. 2024.
- OpenAI. What is the mentions feature for GPTs? 6 August 2025. Available online: https://help.openai.com/en/articles/8908924-what-is-the-mentions-feature-for-gpts.
- Windsurf, AI. Chat Overview. 2024. Available online: https://docs.windsurf.com/chat/overview.
- Anthropic. Customize Claude Code with plugins. 2025. Available online: https://www.anthropic.com/news/claude-code-plugins.
- Windsurf. Windsurf — Slash Commands Reference. Accessed. 2025.
- Mistral, AI. Function Calling. 2024. Available online: https://docs.mistral.ai/capabilities/function_calling/.
- Meta. Tool Calling. 2024. Available online: https://llama.developer.meta.com/docs/features/tool-calling.
- xAI. Function Calling. 2024. Available online: https://docs.x.ai/docs/guides/function-calling.
- Groq. Compound. 2024. Available online: https://console.groq.com/docs/agentic-tooling#use-cases.
- Cloud, Google. Vertex AI Generative AI — Introduction to function calling. 2024. Available online: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/function-calling.
- Amazon Web Services. Use a tool to complete an Amazon Bedrock model response. 2024. Available online: https://docs.aws.amazon.com/bedrock/latest/userguide/tool-use.html.
- Google Gemini, CLI. Gemini CLI Core: Tools API. 2024. Available online: https://github.com/google-gemini/gemini-cli/blob/main/docs/core/tools-api.md.
- IBM. Tool Calling. 2024. Available online: https://www.ibm.com/watsonx/developer/capabilities/tool-calling.
- Reka, AI. Function Calling. 2024. Available online: https://docs.reka.ai/chat/function-calling.
- Inflection, AI. Inflection Inference API (1.0.0). 2024. Available online: https://developers.inflection.ai/api/docs.
- SmolAgents. SmolAgents Tools. 2024. Available online: https://huggingface.co/docs/smolagents/tutorials/tools.
- Mistral, AI. Function Calling, 2025. Mistral Docs. accessed. (accessed on 8 November 2025).
- Ollama. How to Create Tools to Extend AI with Functions, 2024. Blog post. accessed. (accessed on 8 November 2025).
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv arXiv:2006.11477.
- Zhang, Y.; et al. USM: Scaling Automatic Speech Recognition Beyond 100 Languages. arXiv 2023, arXiv:2303.01037. [Google Scholar] [CrossRef]
- OpenAI. Realtime API Overview. 2024. Available online: https://platform.openai.com/docs/guides/realtime.
- Sesame, AI. Crossing the Uncanny Valley of Conversational Voice. 2024. Available online: https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice.
- Cohere. Basic usage of tool use (function calling). 2024. Available online: https://docs.cohere.com/docs/tool-use-overview.
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar] [PubMed]
- Lumer, E.; Cardenas, A.; Melich, M.; Mason, M.; Dieter, S.; Subbiah, V.K.; Basavaraju, P.H.; Hernandez, R. Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems. arXiv 2025, arXiv:cs. [Google Scholar]
- Microsoft. Visual Studio Code: Intelligent Code Assistance. 2024. Available online: https://code.visualstudio.com/docs/editor/intellisense.
- Park, J.S.; O’Brien, J.C.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior. arXiv 2023, arXiv:2304.03442. [Google Scholar] [CrossRef]
- Mem0. Mem0: The Memory Layer for Personalized AI. 2025. Available online: https://mem0.ai/.
- LlamaIndex. Long-Term Memory in LlamaIndex, 2025. Accessed. 8 November 2025.
- LangChain, AI. LangGraph — Persistence and Checkpointing, 2025. Accessed. 8 November 2025.
- Python Software Foundation. Python Documentation. 2025. Available online: https://docs.python.org/3/.
- Microsoft. TypeScript Documentation. 2025. Available online: https://www.typescriptlang.org/docs/.
- Google DeepMind and Google Research. Announcing the Agent2Agent Protocol (A2A). 2025. Available online: https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/.
- IBM Research. The simplest protocol for AI agents to work together. 2025. Available online: https://research.ibm.com/blog/agent-communication-protocol-ai.
- Heule, S.; Jia, E.; Jain, N. Improving agent with semantic search. Cursor research blog post. 2025. Available online: https://cursor.com/blog/semsearch.
- Huang, N. How agents can use filesystems for context engineering. 2025. Available online: https://blog.langchain.com/how-agents-can-use-filesystems-for-context-engineering/ LangChain.
- Gao, Y.; Xiong, Y.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2024. [Google Scholar] [CrossRef]
- Peng, B.; Zhu, Y.; Liu, Y.; et al. Graph Retrieval-Augmented Generation: A Survey. arXiv 2024. [Google Scholar] [CrossRef]
- Wu, M.; Zhu, T.; Han, H.; et al. SEAL-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark. arXiv 2024. [Google Scholar]
- Han, H.; Shomer, H.; Wang, Y.; Lei, Y.; Guo, K.; Hua, Z.; Long, B.; Liu, H.; Tang, J. RAG vs. GraphRAG: A Systematic Evaluation and Key Insights. arXiv 2025, arXiv:2502.11371. [Google Scholar] [CrossRef]
- Lumer, E.; Basavaraju, P.H.; Mason, M.; Burke, J.A.; Subbiah, V.K. Graph RAG-Tool Fusion. 2025, 2502.07223. [Google Scholar] [CrossRef]
- Sarmah, B.; Hall, B.; Rao, R.; Patel, S.; Pasquali, S.; Mehta, D. HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information Extraction. arXiv 2024, arXiv:2408.04948. [Google Scholar] [CrossRef]
- OpenAI. Embedding Models. 2024. Available online: https://platform.openai.com/docs/guides/embeddings/embedding-models.
- Cohere. Cohere Embed Models. 2024. Available online: https://docs.cohere.com/docs/cohere-embed.
- Cloud, Google. Vertex AI Embeddings (preview). 2025. Available online: https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings.
- Shi, J.; Yuan, Z.; Tie, G.; Zhou, P.; Gong, N.Z.; Sun, L. Prompt Injection Attack to Tool Selection in LLM Agents. arXiv 2025, arXiv:2504.19793. [Google Scholar] [CrossRef]
- Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions. In Proceedings of the NeurIPS, 2023. [Google Scholar]
- Lumer, E.; Nizar, F.; Gulati, A.; Basavaraju, P.H.; Subbiah, V.K. Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems. arXiv 2025, arXiv:cs. [Google Scholar]
- Nizar, F.; Lumer, E.; Gulati, A.; Basavaraju, P.H.; Subbiah, V.K. Agent-as-a-Graph: Knowledge Graph-Based Tool and Agent Retrieval for LLM Multi-Agent Systems. arXiv 2025, arXiv:cs. [Google Scholar]
- Singh, A.; Ehtesham, A.; Kumar, S.; Talaei Khoei, T. Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. arXiv arXiv:2501.09136. [CrossRef]
- Maragheh, R.Y.; Vadla, P.; Gupta, P.; Zhao, K.; Inan, A.; Yao, K.; Xu, J.; Kanumala, P.; Cho, J.; Kumar, S. ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation. arXiv arXiv:2506.21931. [CrossRef]
- Jin, B.; Zeng, H.; Yue, Z.; Yoon, J.; Arik, S.Ö.; Wang, D.; Zamani, H.; Han, J. Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning. arXiv 2025, arXiv:2503.09516. [Google Scholar] [CrossRef]
- Lumer, E.; Melich, M.; Zino, O.; Kim, E.; Dieter, S.; Basavaraju, P.H.; Subbiah, V.K.; Burke, J.A.; Hernandez, R. Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models. arXiv 2025, arXiv:2511.18177. [Google Scholar] [CrossRef]
- Wang, X.; Chen, Y.; Yuan, L.; Zhang, Y.; Li, Y.; Peng, H.; Ji, H. Executable Code Actions Elicit Better LLM Agents. arXiv 2024, arXiv:2402.01030. [Google Scholar] [CrossRef]
- Varda, K.; Pai, S. Code Mode: the better way to use MCP. 2025. Available online: https://blog.cloudflare.com/code-mode/.
- Hacker News. Discussion: Code Mode — the better way to use MCP. 2025. Available online: https://news.ycombinator.com/item?id=45830318.
- Mistral, AI. Tools, 2025. Mistral Docs. accessed. (accessed on 8 November 2025).
- Moonshot, AI. Kimi K2 — Thinking. 2025. Available online: https://moonshotai.github.io/Kimi-K2/thinking.html.
- Anthropic. Use XML tags to structure your prompts. 2025. Available online: https://anthropic.mintlify.app/en/docs/build-with-claude/prompt-engineering/use-xml-tags.
- OpenAI. Reasoning models. 2025. Available online: https://platform.openai.com/docs/guides/reasoning.
- Together, AI. DeepSeek-R1 Quickstart. 2025. Available online: https://docs.together.ai/docs/deepseek-r1.
- veRL Maintainers. Search Tool Integration — Multi-Turn RL. 2025. Available online: https://verl.readthedocs.io/en/latest/sglang_multiturn/search_tool_example.html.
- Moonshot, AI. Kimi K2: Open Agentic Intelligence. arXiv arXiv:2507.20534. [CrossRef]
- Google AI for Developers. Context caching — Gemini API. 2025. Available online: https://ai.google.dev/gemini-api/docs/caching?lang=py.
- Cursor Team. Introducing Cursor 2.0 and Composer · Cursor. 2025. Available online: https://cursor.com/blog/2-0.
- LangChain. Deep Agents overview — Docs by LangChain. 2025. Available online: https://docs.langchain.com/oss/python/deepagents/overview.
- OpenDevin Contributors. OpenDevin — Autonomous Software Engineering Agent. Accessed. 2024.
- Anthropic. Introducing Claude Sonnet 4.5. 2025. Available online: https://www.anthropic.com/news/claude-sonnet-4-5.
- Qian, C.; Acikgoz, E.C.; He, Q.; Wang, H.; Chen, X.; Hakkani-Tür, D.; Tur, G.; Ji, H. ToolRL: Reward is All Tool Learning Needs. arXiv arXiv:2504.13958.
- Dong, G.; Chen, Y.; Li, X.; Jin, J.; Qian, H.; Zhu, Y.; Mao, H.; Zhou, G.; Dou, Z.; Wen, J.R. Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning. arXiv arXiv:2505.16410.
- Anonymous. Reinforcement Learning with Verifiable Rewards: GRPO’s Effective Loss. arXiv 2025, arXiv:2503.06639. [Google Scholar]
- Singh, J.; Magazine, R.; Pandya, Y.; Nambi, A. Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning. arXiv 2025, arXiv:cs. [Google Scholar]
- Yao, S.; Shinn, N.; Razavi, P.; Narasimhan, K. τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv 2024. arXiv:2406.12045.
- Barres, V.; Dong, H.; Ray, S.; Si, X.; Narasimhan, K. τ2-Bench: Evaluating Conversational Agents in a Dual-Control Environment. arXiv arXiv:2506.07982.
- Chen, C.; Hao, X.; Liu, W.; Huang, X.; Zeng, X.; Yu, S.; Li, D.; Wang, S.; Gan, W.; Huang, Y.; et al. ACEBench: Who Wins the Match Point in Tool Usage? arXiv 2025, arXiv:2501.12851. [Google Scholar]
- Zhong, L.; Du, Z.; Zhang, X.; Hu, H.; Tang, J. ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario. arXiv arXiv:2501.10132.
- Wang, X.; Wang, Z.; Liu, J.; Chen, Y.; Yuan, L.; Peng, H.; Ji, H. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. arXiv 2024. arXiv:2309.10691. [CrossRef]
- Wang, P.; Wu, Y.; Wang, Z.; Liu, J.; Song, X.; Peng, Z.; Deng, K.; Zhang, C.; Wang, J.; Peng, J.; et al. MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), 2025. [Google Scholar]
- Patil, S.G.; Mao, H.; Yan, F.; Ji, C.C.; Suresh, V.; Stoica, I.; Gonzalez, J.E. The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models. In Proceedings of the Proceedings of the 42nd International Conference on Machine Learning (ICML), 2025. [Google Scholar]
- WebArena Contributors. WebArena: Open, Reproducible Web Environments for Agents. Accessed. 2024.
- VisualWebArena Contributors. VisualWebArena: A Visual Web Navigation Benchmark for Multimodal Agents. Accessed. 2024.
- BrowserGym Contributors. BrowserGym: A Browser Automation and Web Agent Benchmark. Accessed. 2024.
- AgentBoard Contributors. AgentBoard: A Unified Platform for Evaluating LLM Agents. Accessed. 2024.
- Schmid, P. The New Skill in AI is Not Prompting, It’s Context Engineering, 2025. Blog post. published. 30 06 2025. (accessed on 10 November 2025).
- LangChain. Context engineering in agents, 2025. Documentation. accessed. (accessed on 9 November 2025).
- Martin, L. Context Engineering for Agents, 2025. Blog post. accessed. (accessed on 9 November 2025).
- OpenAI. Structured model outputs, 2025. Documentation. accessed. (accessed on 9 November 2025).
- Redis; Redis, 2025.
- LangChain. Node-level caching in LangGraph, 2025. Changelog. accessed. (accessed on 10 November 2025).
- Databricks. Agent Framework — Databricks GenAI, 2025. Accessed. 8 November 2025.
- Wu, S.; Others. Human Memory for AI Memory: Building Agent Memory from Cognitive Psychology. arXiv 2025. [Google Scholar]
- Vellum. LLM Context Window Comparison. 2025. [Google Scholar]
- Shang, S. LongRoPE 2: Nearly Lossless LLM Context Extension, 2025. arXiv.
- Laban, S. LLMs Lost in Multi-Turn Conversation, 2025. arXiv.
- Hong, S. Context Management for Multi-Turn Conversations, 2025. arXiv.
- Chirkova, S. Provence: Efficient and Robust Context Management. arXiv 2025. [Google Scholar]
- Schmid, S. Context Engineering for LLM Agents, 2025. arXiv.
- Park, J.S.; Others. Generative Agents: Interactive Simulacra of Human Behavior. arXiv 2023, arXiv:2304.03442. [Google Scholar] [CrossRef]
- Zhong, W.; Others. MemoryBank: Enhancing Large Language Models with Long-Term Memory. arXiv 2023. [Google Scholar] [CrossRef]
- Alake, R. Agent Memory: Episodic and Semantic Memory for LLMs. 2025. [Google Scholar]
- Mem0. Mem0: The Memory Layer for AI Applications. 2025. [Google Scholar]
- Zep. Zep: Long-Term Memory for AI Assistants. 2025.
- Letta. Letta: Build Stateful LLM Applications. 2025. [Google Scholar]
- Packer, C.; Others. MemGPT: Towards LLMs as Operating Systems. arXiv 2024, arXiv:2310.08560. [Google Scholar] [CrossRef]
- Xu, S. AMem: Agentic Memory for LLM Agents, 2025. arXiv.
- Zhang, W.; Zhang, X.; Zhang, C.; Yang, L.; Shang, J.; Wei, Z.; Zou, H.P.; Huang, Z.; Wang, Z.; Gao, Y.; et al. PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time. 2025. Available online: https://arxiv.org/abs/2506.06254.
- Hao, Y.; Cao, P.; Jin, Z.; Liao, H.; Chen, Y.; Liu, K.; Zhao, J. Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity. 2025. Available online: https://arxiv.org/abs/2503.00771.
- Cheng, Z.; Wang, H.; Liu, Z.; Guo, Y.; Guo, Y.; Wang, Y.; Wang, H. ToolSpectrum: Towards Personalized Tool Utilization for Large Language Models. Available online: https://arxiv.org/abs/2505.13176.
- Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
- DeepEval. Tool Correctness. 2025. Available online: https://deepeval.com/docs/metrics-tool-correctness.
- Ragas. Metrics Overview. 2025. Available online: https://docs.ragas.io/en/latest/concepts/metrics/.
- OpenAI. Evaluation Best Practices. 2025. Available online: https://platform.openai.com/docs/guides/evaluation-best-practices.
- OpenAI. Working with evals. 2025. Available online: https://platform.openai.com/docs/guides/evals.
- Ragas. Response Relevancy. 2025. Available online: https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/.
- Ragas. Ragas Metric Faithfulness. 2025. Available online: https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/faithfulness/.
- DeepEval. Task Completion. 2025. Available online: https://deepeval.com/docs/metrics-task-completion.
- LangChain. Monitor projects with dashboards. 2025. Available online: https://docs.langchain.com/langsmith/dashboards.





Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).