Preprint
Article

This version is not peer-reviewed.

A Survey on AI Search with Large Language Models

Submitted:

23 July 2025

Posted:

24 July 2025

Read the latest preprint version here

Abstract
Searching for accurate information is a complex task that demands significant effort. Although search engines have transformed the way we access information, they often struggle to understand intricate human intentions fully. Recently, Large Language Models (LLMs) have showcased impressive abilities in understanding and generating language. However, LLMs face limitations in acquiring external knowledge and accessing the most current information. AI search has evolved by integrating LLMs into the search process, enabling it to address complex real-world challenges through comprehensive information retrieval and multi-step reasoning, thereby enhancing our ability to browse and search the web effectively. In recent years, substantial progress has been made in refining AI search. This paper offers an in-depth review of these advancements, focusing on Text-based AI Search, Web Browsing Agents, Multimodal AI Search, Benchmarks, Software, and Products. We also examine the limitations of current AI search methods and explore promising future directions. For further details, please visit our website.
Keywords: 
;  ;  

1. Introduction

Searching for information is a fundamental daily necessity for humans. To meet the demand for rapid access to desired information, key web search technologies like PageRank  [1,2,3] have been developed to support information retrieval systems. These technologies power search engines such as Google, Bing, and Baidu, which efficiently retrieve relevant web pages in response to user queries, offering convenient access to information on the internet. Advances in natural language processing (NLP) [4] and information retrieval (IR) [5] have further enhanced machines’ ability to accurately extract content from the vast array of websites available online. However, as user queries become increasingly complex and the demand for precise, contextually relevant, and up-to-date responses grows, traditional search technologies encounter challenges in fully comprehending intricate human intentions. Consequently, users often need to manually open, read, and synthesize information from multiple web pages to answer complex questions.
Recently, Large Language Models (LLMs) [6] have captured significant attention in both academic and industrial domains. LLMs such as ChatGPT [7] and LLaMA [8] have demonstrated remarkable advancements in language understanding, reasoning, and information integration. However, LLMs face limitations in acquiring external knowledge and accessing the most current information. To address these challenges, researchers are integrating the impressive capabilities of LLMs with search engines and websites, aiming to enhance real-time evidence gathering and reflective reasoning. The complementary strengths of LLMs and search engines present an opportunity for synergy, where the reasoning abilities of LLMs are augmented by the vast web information accessible through search engines. This integration is revolutionizing the way we seek and synthesize web-based information, ushering in a new era of search technology known as Artificial Intelligence (AI) Search. In this survey, we provide an overview of recent advancements in the rapidly evolving field of AI Search. As depicted in Figure 1, we categorize the literature into five primary areas: (1) text-based AI search, (2) Web browsing agents, (3) multimodal AI search, (4) benchmarks, and (5) software and products.
The classic Text-based AI Search operates through a Retrieval-Augmented Generation (RAG) framework [105]. In this workflow, RAG retrieves relevant passages from search engines based on the input query and integrates them into the context of a Large Language Model (LLM) for generating responses. This enables the LLM to utilize external knowledge when addressing questions. Another approach within text-based AI Search is the deep search method, which acquires external information by interacting with search engines as part of an end-to-end coherent reasoning process to tackle complex information retrieval challenges. Unlike predefined workflows, this method allows the model to autonomously determine when to employ search-related tools during its reasoning, enhancing flexibility and effectiveness. Web Browsing Agent accomplishes specific tasks on target websites through a sequence of actions, utilizing a thought-action-observation paradigm. For instance, if you want a web agent to calculate the driving time from Shanghai to Beijing using an open street map, it would perform this task by interacting with the website. Web agents are classified into two primary paths: Generalist Deep Browsing Web Agents that perform more complex web browsing tasks, especially across multiple types of web pages; Specialist Parsing Web Agents that employ dedicated training procedures to make the model focus specifically on action sequences or interface elements. Additionally, with the emergence of visual web-oriented benchmarks and the development of Multimodal Large Language Models, many agents now incorporate screenshots as sensory input to provide a more comprehensive understanding of the web environment. Unlike Multimodal AI Search, most current AI search methods are confined to text-only settings, overlooking the multimodal nature of user queries and the intertwined text-image information on websites. This limitation is particularly significant given the complexity and interleaved nature of modern websites. For example, imagine capturing a photo of an antique at a museum without knowing its historical context. A multimodal AI search engine could match the photograph with an interleaved table of images and text retrieved from the Internet, thereby providing you with the history and story behind it. Thus, a multimodal AI search engine is essential for advancing information retrieval and analysis.
Furthermore, this paper offers a review of the Benchmarks relevant to these methods. Evaluating the search capabilities of AI models, particularly large language models (LLMs), is crucial for assessing their ability to effectively retrieve, filter, and reason over web-based information. This evaluation is essential for understanding the true web-browsing competence of LLMs and their potential to tackle real-world tasks that demand dynamic information retrieval. In recent years, significant efforts have been made to explore AI search from various perspectives. This paper concentrates on three key areas: text-based question-answering benchmarks, web agent benchmarks, and multimodal benchmarks. The Software and Products of AI Search, such as Perplexity [92], have the potential to change our daily lives. We introduce a wide array of state-of-the-art open-source and proprietary models, software, and mainstream AI search products, aiming to present a diverse and comprehensive overview of AI Search. Finally, we discuss the limitations of current AI search methods and explore promising future directions. To illustrate the evolution of AI search methods over time, Figure 2 presents a timeline of recent AI search technologies, related methods, and products.

3. Web Browsing Agent

The Web Browsing Agent is an AI-driven autonomous program designed to mimic human interactions within web browsers. It excels in tasks like information retrieval, task execution, and adapting to dynamic environments. This section delves into the definition, classification, and cutting-edge technologies related to Web Browsing Agents and Web Agents more broadly.

3.1. Agent

Agent is a system that perceives environments, makes autonomous decisions, and executes tasks, aiming to simulate human cognition. With LLMs’ emergence, LLM-based agents have become a key research direction. The core of LLM-based autonomous agents lies in two key aspects: architecture design and capability acquisition. In terms of architecture design, researchers aim to fully leverage the powerful language understanding and generation capabilities of LLMs through various network structures and modular combinations(e.g., AgentVerse’s unified framework [152]). Regarding capability acquisition, two primary methods are employed [153]. First, Fine-tuning optimizes performance for specialized tasks by training the model with domain-specific data. Second, prompt engineering elicits the model’s latent capabilities through carefully designed prompts.
Building upon general agent research, Web Agents’ key distinction lies in handling the diversity and dynamism of web pages, which imposes stricter demands on perception modules and safety-aware design. Most current Web Agent frameworks adopt a Markov Decision Process (MDP) formulation [49], where each decision step is governed by a 4-tuple (S, A, T, R): S (State Space): Represents the environment state, typically the current webpage’s HTML content. A (Action Space): Encompasses possible web interactions (e.g., button clicks, scrolling, text input). T (Transition Function): Defines how executing action A in state S alters the webpage state. R (Reward Function): Evaluates the quality of interactions to guide learning. While this MDP-based approach serves as the foundation, variants exist where steps are adapted (e.g., simplified or extended) based on practical requirements.
Based on the adopted training strategies, current Web Agents can be categorized into two types shown in Figure 4: Generalist Deep Browsing Web Agents that enhance the model’s ability to perform more complex web browsing tasks, especially across multiple types of web pages; Specialist Parsing Web Agents that employ dedicated training procedures to make the model focus specifically on action sequences or interface elements [154].

3.2. Generalist Deep Browsing Web Agents

Due to the open-ended nature of Web Agent applications, conventional static dataset-based training methods exhibit significant limitations, particularly when handling complex web navigation tasks. To enhance Web Agent capabilities in such scenarios, it is crucial to dynamically prompt the model during training to facilitate optimal action selection in corresponding situations. Reinforcement learning (RL) has become a key technology that enables web agents to adapt to dynamic environments in real-time through exploration and interactive feedback.
For instance, WebAgent-R1 is the first purely end-to-end RL-trained Web Agent [46]. It employs a multi-turn end-to-end RL framework, where the agent is trained through online interactions guided by rule-based outcome rewards. During training, it extends the standard Group Proximal Policy Optimization (GRPO) method into Multi-Group GRPO [155], utilizing multiple parallel interaction trajectories to enhance training efficacy. Additionally, WebAgent-R1 implements dynamic context compression for the state space (S) in the Markov Decision Process (MDP). When a new state arrives, earlier states are simplified to reduce context length while preserving complete history, thereby minimizing memory consumption. AutoWebGLM adopts a multi-stage training approach, integrating Supervised Fine-Tuning (SFT), RL, and Rejection Sampling Fine-Tuning (RFT) [47]. And it retains erroneous samples during training to facilitate learning from mistakes. Microsoft proposed an “API-first” Web Agent based on the CodeAct architecture [48], which replaces traditional browser interactions with API calls and selectively accesses browser APIs to retrieve feedback. For websites with limited APIs, the agent directly incorporates complete API documentation into its prompts. For websites with extensive APIs, it first generates a dictionary mapping each API to its documentation and then filters relevant APIs based on task descriptions. This dynamic approach enhances adaptability, with API-based prompts proving more concise and effective than direct browsing. AgentOccam simplifies the action space (A) by replacing multiple operations with functionally equivalent single actions and abstracting knowledge-dependent operations [50]. And it yields a streamlined yet effective Web Agent workflow by reducing the state space (S) by merging repetitive elements or structures and selectively replaying historical information.
Fine tuning before RL is also critical. This process establishes basic web interaction skills in the action space A. The result of the process directly affects the effect of reinforcement learning in the later stage. For example, Huawei’s Pangu DeepDiver integrates cold-start SFT with reward allocation and scheduling mechanism [51], transitioning from lenient to strict scoring to stabilize RL training. The action enhances the model’s ability to couple multiple reasoning and action steps.
Multimodality is also one of the methods to improve the effect of RL, and many web agents are developing towards multimodality. For example: WebVoyager leverages both visual (screenshots) and textual (HTML elements) modalities for interaction [52]. It utilizes the GPT-4V-ACT tool to annotate visual inputs (e.g., screenshots with numbered bounding boxes), which are then mapped to auxiliary text descriptions. This approach bypasses the need to parse complex HTML DOM or accessibility trees, simplifying structural representation. A detailed discussion of multimodal Web Agents is provided in Section 4.3.

3.3. Specialist Parsing Web Agents

Due to the complexity of web environments and the diversity of user objectives, Web Agents can acquire information from multiple sources. While, in principle, accessing a broader range of information types is preferable, focusing on a specific category for in-depth filtering and analysis can also yield effective results. This approach imposes lower demands on the model, making it suitable for lightweight Web Agents. Consequently, the goal-specific Web Agents—those specialized in target elements or actions—emerges. Training such specialized Web Agents typically requires well-defined objectives and dedicated datasets [55,56].
For example, WebDancer specializes in QA pair parsing, aiming to extract high-quality trajectories from QA pairs to guide fine-tuning and reinforcement learning [28]. To extend the reasoning depth and hop count of existing QA datasets, the authors developed two datasets: CRAWLQA and E2HQA. CRAWLQA collects data from root URLs of official websites such as arXiv, GitHub, and Wikipedia, while E2HQA constructs its corpus by reformulating initially simple questions into more complex, multi-step queries. A ReAct-based Web Agent employs rejection sampling to extract trajectories from these QA datasets, forming both short and long chain-of-thought (CoT) trajectories. During training, the agent proceeds to formal RL using QA data not utilized in the SFT phase, internalizing CoT generation as an active behavioral component of the model. This process leverages the Dynamic Adaptive Policy Optimization (DAPO) algorithm. Falcon-UI is another example [54], focusing on graphical user interface (GUI) interactions. For its training, raw data was sourced from Common Crawl, followed by standard deduplication and denoising procedures. The researchers then use APIs with varying resolutions and platform types to simulate diverse device environments (e.g., Android, iOS, Windows, and Linux).The data integration platform interacts with the GUI interface and records the generated new interaction data. Unlike traditional full-page textual datasets, Falcon-UI exclusively logs visible elements, mimicking human-like interactions. The resulting hybrid dataset is then used to train Falcon-UI, significantly improving its GUI processing performance.
In some cases, depending on the task requirements, a Web Agent may employ multiple models. For instance, PhishAgent specializes in phishing website detection by identifying target website brands and their domains [57]. To recognize the brand of a target website, PhishAgent utilizes both textual and visual models. In many scenarios, textual information alone suffices for brand identification, in which case only LLM is used. However, if textual cues are insufficient or obscured by adversarial attacks, PhishAgent activates its brand extractor (IBE) based on multimodal large language model (MLLM), which identifies brand names from webpage screenshots. Upon successful brand recognition, PhishAgent proceeds to cross-reference the target domain with authentic domain information, which is obtained through both offline and online interactions, to determine whether the site is phishing.

5. Benchmarks

5.1. Text-Based QA Benchmark

As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. A variety of widely used English benchmarks have been proposed to assess retrieval capabilities, including TriviaQA, HotpotQA, FEVER, KILT, GAIA, etc. These datasets cover multi-hop reasoning, knowledge-intensive QA, and fact checking, typically relying on structured sources like Wikipedia and StackExchange.
Traditional Benchmarks Natural Questions (NQ) [65] is a large-scale QA dataset using real Google search queries and corresponding Wikipedia pages, requiring models to provide both long-form and short-form answers. TriviaQA [66] is a reading comprehension dataset characterized by complex, compositional questions with significant lexical variation from their evidence, often demanding multi-sentence reasoning. PopQA [67] is an entity-centric QA dataset designed to test factual knowledge recall across a long-tail distribution of entity popularity. HotpotQA [68] is a multi-hop QA dataset that requires reasoning across multiple documents and providing sentence-level supporting facts, making it a benchmark for explainable QA. 2WikiMultiHopQA [172] is a more challenging multi-hop QA dataset that integrates Wikipedia with Wikidata, using structured triples to explain complex reasoning paths. MuSiQue [69] is a multi-hop QA dataset emphasizing connected reasoning and includes unanswerable examples to challenge models that rely on shortcuts. FEVER [70] is a benchmark for fact verification, requiring systems to classify claims as SUPPORTED, REFUTED, or NOTENOUGHINFO against Wikipedia and provide sentence-level evidence. KILT [71] unifies 11 knowledge-intensive NLP tasks under a single Wikipedia snapshot, providing a standardized framework for evaluating both task performance and evidence retrieval. GAIA [72] evaluates general-purpose AI assistants with real-world questions that require a combination of reasoning, tool use, and multi-modality, revealing a large gap between AI and human performance. TREC Health Misinformation Track [173] provides datasets with binary “yes/no” health questions based on medical consensus to evaluate a system’s ability to combat health misinformation.
Modern Browsing Benchmarks. While traditional benchmarks mentioned above have effectively measured an AI’s ability to retrieve straightforward information through basic queries (e.g., single-hop fact lookup), their simplicity has led to saturation—modern models now achieve near-perfect scores on these tasks. This progress reveals a critical gap: real-world information needs often require persistent navigation through complex data landscapes. These challenges mirror the evolutionary jump from arithmetic tests to mathematical proofs—where success depends less on recall and more on strategic problem-solving.
BrowseComp [73] is a benchmark dataset introduced to evaluate web-browsing AI agents. It contains 1,266 challenging questions requiring persistent navigation of the internet to find entangled information. Key features include: (1) High difficulty - questions are designed to be unsolvable by humans within 10 minutes; (2) Verifiability - short reference answers enable easy validation; (3) Diverse topics spanning sports, fiction, and academic publications; and (4) Core capability measurement focusing on persistence, factual reasoning, and creative search strategies. BrowseComp-ZH [74] benchmark is a high-difficulty Chinese web browsing evaluation dataset consisting of 289 multi-hop questions across 11 domains (e.g., Art, Film&TV, Medicine). Each question is reverse-engineered from verifiable factual answers and undergoes rigorous two-stage quality control to ensure retrieval difficulty and answer uniqueness. Figure 6 illustrates these two benchmarks and shows some complex and challenging queries. Mind2Web 2 [75] is also a modern benchmark with 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis.

5.2. Web Agent Benchmark

Web Agent Benchmark refer to a standardized set of test tasks and evaluation frameworks designed to assess the performance of web agents. These benchmarks simulate interactive tasks in real-world web environments to quantify an agent’s capabilities in navigation, operation, and reasoning. A Web Agent Benchmark consists of two key components: tasks (data) and metrics. Tasks refer to a series of operational requirements posed to web agents, mimicking typical human web activities, such as clicking buttons, filling out forms, or navigating between pages. More complex tasks may involve multi-step processes. Metrics are the standards used to evaluate a web agent’s performance, which vary depending on the agent’s functionality and objectives [78,79,81,83].
Mind2Web [76] shown in Figure 7 is the first dataset designed for developing and evaluating general-purpose Web Agents. Its tasks include five top-level domains (travel, shopping, services, entertainment, and information). Each task consists of three core components: Task description: This outlines the high-level goal of the task. Action sequence: This is the sequence of actions required to complete the task on the website. Each action includes the target element and the corresponding operation. Webpage snapshot: It captures the webpage environment during task execution. In short, these three components contain all the task-related information. WebArena [77] analyzed real-world web browser histories and abstracted four prominent categories: e-commerce, social forums, collaborative software development, and content management. For task design, the authors curated three task types: Information seeking – requiring multi-page navigation. Website navigation – using interactive elements (e.g., search functions and links) to locate specific information in webpages. Content/configuration manipulation – creating, modifying, or configuring content (settings). Task evaluation involves: (1) comparing outputs for information seeking tasks, and (2) reward-based assessment of intermediate states for navigation and manipulation tasks. WebChoreArena [80] adheres to WebArena’s design principles. However, its benchmark has new tasks. Their key characteristics include: (1)Emphasis on memory-intensive analytical tasks. (2)Reduction of ambiguity in task instructions and evaluation, a notable departure from WebArena. (3)Template-based task construction and expansion. Experiments on GPT-4o indicate that WebChoreArena presents greater challenges than WebArena. WebCanvas [82] introduces a dynamic evaluation framework using “critical nodes”. Critical nodes refer to essential steps that must be completed in any viable path to accomplish a given web task. To enhance realism, the authors derive Mind2Web-Live from tasks in the Mind2Web dataset. Mind2Web-Live includes critical nodes and meticulously annotated steps. Subsequent experiments demonstrate that in partially web environments, evaluating solely the final state or outcome is insufficient.
For task-specific objectives, general-purpose benchmarks often prove inadequate for evaluating Web Agent performance in their domains, necessitating dedicated benchmarks [85,86,87,88]. For instance, DeepShop focuses on e-commerce, generating query tasks across five popular online shopping categories [89]. During assessment, it adopts a fine-grained approach by separately evaluating product attributes, matching accuracy, and ranking performance, ultimately synthesizing a comprehensive evaluation. SafeArena [84], the first benchmark dedicated to assessing malicious use cases of Web Agents, comprises 250 security tasks and 250 harmful tasks. Its evaluation metrics include the standard Task Completion Rate (TCR), and specialized measures rarely adopted by other benchmarks: Normalized Security Score (NSS) and Rejection Rate.

5.3. MM Search Benchmark

Large language models (LLMs) have made significant strides in understanding and reasoning about live textual content when integrated with search engines. Despite these advancements, a crucial question remains: has the understanding of other modalities, such as visual knowledge in live contexts, been similarly addressed? Are there benchmarks for multimodal search methods?
MMSearch [58] introduced a multimodal AI search engine benchmark to thoroughly assess the searching performance of MLLMs, marking the first evaluation dataset to measure MLLMs’ capabilities in multimodal searching. LIVEVQA [90], depicted in Figure 8, is an automatically collected benchmark dataset specifically designed to evaluate current AI systems on their ability to answer questions requiring live visual knowledge. However, existing benchmarks for this critical task face a significant shortage of suitable datasets and scientifically rigorous evaluation metrics. MRAMG-Bench [91] is a novel benchmark created to comprehensively evaluate the MRAMG task. It consists of six meticulously curated English datasets, including 4,346 documents, 14,190 images, and 4,800 QA pairs, sourced from three domains: Web, Academia, and Lifestyle, across seven distinct data sources.
VisualWebArena [78] is primarily designed for visual web agent tasks, incorporating both textual and visual content from real-world environments. It comprises 910 real-world tasks across three distinct web environments. A key feature of VisualWebArena is that all tasks require agents to process and interpret visual information, rather than relying solely on textual or HTML-based cues. For evaluation, the metrics follow WebArena’s framework but extend it by incorporating image verification alongside the original two assessment methods.

6. Softwares and Products

AI search ecosystem has rapidly diversified into general-purpose platforms, domain-specific tools, and integrated assistants, each leveraging large language models (LLMs), retrieval-augmented generation (RAG), and agentic workflows to redefine information retrieval. Below, we will introduce the key products driving this transformation.
Global General-Purpose AI Search Engines. A pioneer in generative AI, ChatGPT Deep Research [7] integrates Bing’s real-time web search to provide concise, conversational responses, sparking a surge of interest among researchers in large language models. Perplexity Deep Research [92] combines GPT-4 and Claude 3 with real-time web crawling, providing source-attributed answers. Its Discover feature tracks trending topics, making it ideal for academic literature reviews and technical writing. You.com [174] prioritizes privacy and personalization, allowing model switching (e.g., GPT-4, Claude) mid-session. Its Smart mode offers free access, while Research mode supports deep investigations with citation exports. Gemini Deep Research [98] embeds multi-modal capabilities into Pixel phones and Wear OS, enabling real-time translation via camera and health data-driven recommendations, reinforcing its “hardware-software” synergy in high-end markets. Optimized for speed and cost-efficiency, Doubao [93] integrates seamlessly with Douyin for video-content searches. Yuanbao [94] redefines “search-as-service” by embedding within WeChat’s ecosystem. Its three-layer architecture—base model (trillion-parameter MoE), industry-specific tuning (e.g., medical diagnostics), and mini-program integration—enables seamless service execution (e.g., generating travel itineraries with bookings). This ecosystem approach has driven rapid adoption. Nano AI [95] is China’s first “super search agent” that autonomously plans tasks (e.g., travel itineraries, market reports) by integrating data from walled gardens. Its DeepSearch technology parses tables, formulas, and video comments, enabling cross-platform verification for reliable decision-making. Kimi [96] can process 200 K-context windows, ideal for academic paper analysis. Users highlight its semantic search for Chinese literature. DeepSeek Search [99] represents a paradigm shift in cost-efficient, open-source AI search. Quark DeepSearch [97] relies on Qwen-QWQ inference model. Unlike traditional search engines that rely on keyword matching, the model understands natural language and performs semantic analysis to more accurately grasp user intent.
Domain-Specific AI Search Tools. MediSearch [100] provides evidence-based medical answers (e.g., drug interactions, treatment protocols), trusted by 74% of healthcare professionals for clinical decision support. Devv.ai [101] is a code-specific search engine offering real-time debugging snippets and GitHub integration. It supports Chinese queries but is limited to programming contexts. Consensus [102] accesses 200 M+ scientific papers, using NLP to extract hypotheses and methodologies. Researchers report 50% time savings in literature reviews.
Integrated AI Search Assistants WallesAI [103] is a browser-sidebar assistant that reads PDFs, videos, and webpages, enabling cross-document Q&A and content export. Bing Chat [104], deeply integrated into Edge’s ecosystem, delivers citation-backed answers through real-time web indexing and source attribution, establishing a unified search-browser experience.

7. Challenges and Future Research

Despite the notable progress, this field still faces many unresolved challenges, and there is considerable room for improvement. We finally highlight several promising directions based on the reviewed progress:
  • Methods More complex problems lead to a prolonged search process and additional actions, resulting in an extended search context. This extended context can limit the effectiveness of AIS methods and the ability of LLMs, causing search performance to degrade as the inference length increases.
  • Evaluations There is a strong need for systematic and standardized evaluation frameworks in AI search. The datasets used for evaluation should be meticulously curated to closely resemble real-world scenarios, featuring complex, dynamic, and citation-supported answers.
  • Applications The potential real-world applications of AI Search are significant. Beyond user scenarios, there are numerous applications across various industries. We hope to see the development of more AIS software and products to enhance the interaction between humans and machines.

8. Conclusions

Seeking and accessing information is a fundamental daily need for humans. In this survey, we provide a thorough overview of the latest research on AI Search based on LLMs. Our goal is to identify and highlight areas that require further research and suggest potential avenues for future studies. We start by introducing the traditional information retrieval systems, large language models (LLMs), and AI Search based on LLMs. Subsequently, we classify existing studies into four categories: Text-based AI Search, Web Browsing Agent, Multimodal AI Search, and Benchmarks. Then, we spotlight a range of current and significant Software and products within the realm of AI search. Finally, we discuss the limitations of the current AI search methods and explore promising future directions.

References

  1. Brin, S.; Page, L. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems 1998, 30, 107–117. [Google Scholar] [CrossRef]
  2. Berkhin, P. A survey on PageRank computing. Internet mathematics 2005, 2, 73–120. [Google Scholar] [CrossRef]
  3. Burges, C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.; Hamilton, N.; Hullender, G. Learning to rank using gradient descent. In Proceedings of the Proceedings of the 22nd international conference on Machine learning, 2005, pp.
  4. Nadkarni, P.M.; Ohno-Machado, L.; Chapman, W.W. Natural language processing: an introduction. Journal of the American Medical Informatics Association 2011, 18, 544–551. [Google Scholar] [CrossRef] [PubMed]
  5. Kobayashi, M.; Takeda, K. Information retrieval on the web. ACM computing surveys (CSUR) 2000, 32, 144–173. [Google Scholar] [CrossRef]
  6. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv preprint arXiv:2303.18223 2023, arXiv:2303.18223 2023, 11. [Google Scholar]
  7. Research, C.D. https://openai.com/index/introducing-deep-research, 2022.
  8. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, arXiv:2302.13971 2023.
  9. Lewis, P.S.H.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; tau Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS, December 6-12, , virtual, . 2020. [Google Scholar]
  10. Xu, F.; Shi, W.; Choi, E. RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. CoRR 2310. [Google Scholar] [CrossRef]
  11. Jiang, H.; Wu, Q.; Lin, C.Y.; Yang, Y.; Qiu, L. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. [CrossRef]
  12. Jin, J.; Li, X.; Dong, G.; Zhang, Y.; Zhu, Y.; Wu, Y.; Li, Z.; Ye, Q.; Dou, Z. Hierarchical Document Refinement for Long-context Retrieval-augmented Generation, 2025; arXiv:cs.CL/2505.10413]. [Google Scholar]
  13. Kim, G.; Kim, S.; Jeon, B.; Park, J.; Kang, J. Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 2023. [CrossRef]
  14. Shi, W.; Min, S.; Yasunaga, M.; Seo, M.; James, R.; Lewis, M.; Zettlemoyer, L.; tau Yih, W. REPLUG: Retrieval-Augmented Black-Box Language Models. CoRR, 2301. [Google Scholar] [CrossRef]
  15. Yu, W.; Iter, D.; Wang, S.; Xu, Y.; Ju, M.; Sanyal, S.; Zhu, C.; Zeng, M.; Jiang, M. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063, arXiv:2209.10063 2022.
  16. Wang, H.; Zhao, T.; Gao, J. BlendFilter: Advancing Retrieval-Augmented Large Language Models via Query Generation Blending and Knowledge Filtering. 2024; arXiv:cs.CL/2402.11129]. [Google Scholar]
  17. Wang, Y.; Li, P.; Sun, M.; Liu, Y. Self-knowledge guided retrieval augmentation for large language models. arXiv, 2023; arXiv:2310.05002 2023. [Google Scholar]
  18. Wang, H.; Xue, B.; Zhou, B.; Zhang, T.; Wang, C.; Chen, G.; Wang, H.; Wong, K.f. Self-DC: When to retrieve and When to generate? Self Divide-and-Conquer for Compositional Unknown Questions. arXiv, 2024; arXiv:2402.13514 2024. [Google Scholar]
  19. Ding, H.; Pang, L.; Wei, Z.; Shen, H.; Cheng, X. Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models. 2024; arXiv:cs.CL/2402.10612]. [Google Scholar]
  20. Tan, J.; Dou, Z.; Zhu, Y.; Guo, P.; Fang, K.; Wen, J.R. Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs. CoRR, 2402. [Google Scholar] [CrossRef]
  21. Yao, S.; Zhao, J.; Yu, D.; Shafran, I.; Narasimhan, K.R.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the NeurIPS 2022 Foundation Models for Decision Making Workshop; 2022. [Google Scholar]
  22. Shao, Z.; Gong, Y.; Shen, Y.; Huang, M.; Duan, N.; Chen, W. Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy. 2023; arXiv:cs.CL/2305.15294]. [Google Scholar]
  23. Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv, 2022; arXiv:2212.10509 2022. [Google Scholar]
  24. Jiang, Z.; Xu, F.F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active Retrieval Augmented Generation. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023.
  25. Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. CoRR, 2310. [Google Scholar] [CrossRef]
  26. Li, X.; Dong, G.; Jin, J.; Zhang, Y.; Zhou, Y.; Zhu, Y.; Zhang, P.; Dou, Z. Search-o1: Agentic Search-Enhanced Large Reasoning Models. CoRR, 2501. [Google Scholar] [CrossRef]
  27. Li, X.; Jin, J.; Dong, G.; Qian, H.; Zhu, Y.; Wu, Y.; Wen, J.; Dou, Z. WebThinker: Empowering Large Reasoning Models with Deep Research Capability. CoRR, 2504. [Google Scholar] [CrossRef]
  28. Wu, J.; Li, B.; Fang, R.; Yin, W.; Zhang, L.; Tao, Z.; Zhang, D.; Xi, Z.; Jiang, Y.; Xie, P.; et al. WebDancer: Towards Autonomous Information Seeking Agency. 2025; arXiv:cs.CL/2505.22648]. [Google Scholar]
  29. Huang, L.; Liu, Y.; Jiang, J.; Zhang, R.; Yan, J.; Li, J.; Zhao, W.X. ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework. 2025; arXiv:cs.CL/2505.18105]. [Google Scholar]
  30. Yang, T.; Yao, Z.; Jin, B.; Cui, L.; Li, Y.; Wang, G.; Liu, X. Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents. 2025; arXiv:cs.AI/2505.12065]. [Google Scholar]
  31. Jin, J.; Li, X.; Dong, G.; Zhang, Y.; Zhu, Y.; Zhao, Y.; Qian, H.; Dou, Z. Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search. 2025; arXiv:cs.AI/2507.02652]. [Google Scholar]
  32. Wu, W.; Guan, X.; Huang, S.; Jiang, Y.; Xie, P.; Huang, F.; Cao, J.; Zhao, H.; Zhou, J. MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability. 2025; arXiv:cs.CL/2505.20285]. [Google Scholar]
  33. Wang, L.; Chen, H.; Yang, N.; Huang, X.; Dou, Z.; Wei, F. Chain-of-Retrieval Augmented Generation. 2025; arXiv:cs.IR/2501.14342]. [Google Scholar]
  34. Lee, Z.; Cao, S.; Liu, J.; Zhang, J.; Liu, W.; Che, X.; Hou, L.; Li, J. ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation. 2025; arXiv:cs.CL/2503.21729]. [Google Scholar]
  35. Shi, Z.; Yan, L.; Yin, D.; Verberne, S.; de Rijke, M.; Ren, Z. Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers. 2025; arXiv:cs.CL/2505.20128]. [Google Scholar]
  36. Hu, M.; Fang, T.; Zhang, J.; Ma, J.; Zhang, Z.; Zhou, J.; Zhang, H.; Mi, H.; Yu, D.; King, I. WebCoT: Enhancing Web Agent Reasoning by Reconstructing Chain-of-Thought in Reflection, Branching, and Rollback. 2025; arXiv:cs.CL/2505.20013]. [Google Scholar]
  37. Guan, X.; Zeng, J.; Meng, F.; Xin, C.; Lu, Y.; Lin, H.; Han, X.; Sun, L.; Zhou, J. DeepRAG: Thinking to Retrieve Step by Step for Large Language Models. 2025; arXiv:cs.AI/2502.01142]. [Google Scholar]
  38. Jin, B.; Zeng, H.; Yue, Z.; Yoon, J.; Arik, S.; Wang, D.; Zamani, H.; Han, J. Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning. 2025; arXiv:cs.CL/2503.09516]. [Google Scholar]
  39. Song, H.; Jiang, J.; Min, Y.; Chen, J.; Chen, Z.; Zhao, W.X.; Fang, L.; Wen, J.R. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning. 2025; arXiv:cs.AI/2503.05592]. [Google Scholar]
  40. Chen, M.; Li, T.; Sun, H.; Zhou, Y.; Zhu, C.; Wang, H.; Pan, J.Z.; Zhang, W.; Chen, H.; Yang, F.; et al. ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning. 2025; arXiv:cs.AI/2503.19470]. [Google Scholar]
  41. Sun, H.; Qiao, Z.; Guo, J.; Fan, X.; Hou, Y.; Jiang, Y.; Xie, P.; Zhang, Y.; Huang, F.; Zhou, J. ZeroSearch: Incentivize the Search Capability of LLMs without Searching. 2025; arXiv:cs.CL/2505.04588]. [Google Scholar]
  42. Zheng, Y.; Fu, D.; Hu, X.; Cai, X.; Ye, L.; Lu, P.; Liu, P. DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments. 2025; arXiv:cs.AI/2504.03160]. [Google Scholar]
  43. Mei, J.; Hu, T.; Fu, D.; Wen, L.; Yang, X.; Wu, R.; Cai, P.; Cai, X.; Gao, X.; Yang, Y.; et al. O2-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering. 2025; arXiv:cs.CL/2505.16582]. [Google Scholar]
  44. Li, K.; Zhang, Z.; Yin, H.; Zhang, L.; Ou, L.; Wu, J.; Yin, W.; Li, B.; Tao, Z.; Wang, X.; et al. WebSailor: Navigating Super-human Reasoning for Web Agent. 2025; arXiv:cs.CL/2507.02592]. [Google Scholar]
  45. Wang, Z.; Zheng, X.; An, K.; Ouyang, C.; Cai, J.; Wang, Y.; Wu, Y. StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization. 2025; arXiv:cs.CL/2505.15107]. [Google Scholar]
  46. Wei, Z.; Yao, W.; Liu, Y.; Zhang, W.; Lu, Q.; Qiu, L.; Yu, C.; Xu, P.; Zhang, C.; Yin, B.; et al. WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning. 2025; arXiv:cs.CL/2505.16421]. [Google Scholar]
  47. Lai, H.; Liu, X.; Iong, I.L.; Yao, S.; Chen, Y.; Shen, P.; Yu, H.; Zhang, H.; Zhang, X.; Dong, Y.; et al. AutoWebGLM: A Large Language Model-based Web Navigating Agent. In Proceedings of the Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2024. [CrossRef]
  48. Song, Y.; Xu, F.; Zhou, S.; Neubig, G. Beyond Browsing: API-Based Web Agents. 2025; arXiv:cs.CL/2410.16464]. [Google Scholar]
  49. Zhang, D.; Rama, B.; Ni, J.; He, S.; Zhao, F.; Chen, K.; Chen, A.; Cao, J. LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications. 2025; arXiv:cs.AI/2503.02950]. [Google Scholar]
  50. Yang, K.; Liu, Y.; Chaudhary, S.; Fakoor, R.; Chaudhari, P.; Karypis, G.; Rangwala, H. AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents. 2025; arXiv:cs.AI/2410.13825]. [Google Scholar]
  51. Shi, W.; Tan, H.; Kuang, C.; Li, X.; Ren, X.; Zhang, C.; Chen, H.; Wang, Y.; Shang, L.; Yu, F.; et al. Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning. 2025; arXiv:cs.CL/2505.24332]. [Google Scholar]
  52. He, H.; Yao, W.; Ma, K.; Yu, W.; Dai, Y.; Zhang, H.; Lan, Z.; Yu, D. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). [CrossRef]
  53. Wu, J.; Li, B.; Fang, R.; Yin, W.; Zhang, L.; Tao, Z.; Zhang, D.; Xi, Z.; Jiang, Y.; Xie, P.; et al. WebDancer: Towards Autonomous Information Seeking Agency. 2025; arXiv:cs.CL/2505.22648]. [Google Scholar]
  54. Shen, H.; Liu, C.; Li, G.; Wang, X.; Zhou, Y.; Ma, C.; Ji, X. Falcon-UI: Understanding GUI Before Following User Instructions. 2024; arXiv:cs.CL/2412.09362]. [Google Scholar]
  55. Cho, J.; Kim, J.; Bae, D.; Choo, J.; Gwon, Y.; Kwon, Y.D. CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only. 2024; arXiv:cs.AI/2406.06947]. [Google Scholar]
  56. Lin, K.Q.; Li, L.; Gao, D.; Yang, Z.; Wu, S.; Bai, Z.; Lei, W.; Wang, L.; Shou, M.Z. ShowUI: One Vision-Language-Action Model for GUI Visual Agent. 2024; arXiv:cs.CV/2411.17465]. [Google Scholar]
  57. Cao, T.; Huang, C.; Li, Y.; Huilin, W.; He, A.; Oo, N.; Hooi, B. PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection. Proceedings of the AAAI Conference on Artificial Intelligence 2025, 39, 27869–27877. [Google Scholar] [CrossRef]
  58. Jiang, D.; Zhang, R.; Guo, Z.; Wu, Y.; Qiu, P.; Lu, P.; Chen, Z.; Song, G.; Gao, P.; Liu, Y.; et al. MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines. In Proceedings of the The Thirteenth International Conference on Learning Representations.
  59. Wu, J.; Deng, Z.; Li, W.; Liu, Y.; You, B.; Li, B.; Ma, Z.; Liu, Z. MMSearch-R1: Incentivizing LMMs to Search. 2025; arXiv:cs.CV/2506.20670]. [Google Scholar]
  60. Pahuja, V.; Lu, Y.; Rosset, C.; Gou, B.; Mitra, A.; Whitehead, S.; Su, Y.; Awadallah, A. Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents 2025.
  61. Verma, G.; Kaur, R.; Srishankar, N.; Zeng, Z.; Balch, T.; Veloso, M. AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations. 2024; arXiv:cs.AI/2411.13451]. [Google Scholar]
  62. He, H.; Yao, W.; Ma, K.; Yu, W.; Zhang, H.; Fang, T.; Lan, Z.; Yu, D. OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization. 2024; arXiv:cs.CL/2410.19609]. [Google Scholar]
  63. Zheng, B.; Gou, B.; Kil, J.; Sun, H.; Su, Y. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv, 2024; arXiv:2401.01614 2024. [Google Scholar]
  64. Gadiraju, S.S.; Liao, D.; Kudupudi, A.; Kasula, S.; Chalasani, C. InfoTech Assistant: A Multimodal Conversational Agent for InfoTechnology Web Portal Queries. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData); 2024; pp. 3264–3272. [Google Scholar] [CrossRef]
  65. Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 2019, 7, 453–466. [Google Scholar] [CrossRef]
  66. Joshi, M.; Choi, E.; Weld, D.S.; Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, arXiv:1705.03551 2017.
  67. Mallen, A.; Asai, A.; Zhong, V.; Das, R.; Khashabi, D.; Hajishirzi, H. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv arXiv:2212.10511, arXiv:2212.10511 2022.
  68. Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, arXiv:1809.09600 2018.
  69. Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics 2022, 10, 539–554. [Google Scholar] [CrossRef]
  70. Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Mittal, A. FEVER: a large-scale dataset for fact extraction and VERification. arXiv preprint arXiv:1803.05355, arXiv:1803.05355 2018.
  71. Petroni, F.; Piktus, A.; Fan, A.; Lewis, P.; Yazdani, M.; De Cao, N.; Thorne, J.; Jernite, Y.; Karpukhin, V.; Maillard, J.; et al. KILT: a benchmark for knowledge intensive language tasks. arXiv arXiv:2009.02252, arXiv:2009.02252 2020.
  72. Mialon, G.; Fourrier, C.; Wolf, T.; LeCun, Y.; Scialom, T. Gaia: a benchmark for general ai assistants. In Proceedings of the The Twelfth International Conference on Learning Representations; 2023. [Google Scholar]
  73. Wei, J.; Sun, Z.; Papay, S.; McKinney, S.; Han, J.; Fulford, I.; Chung, H.W.; Passos, A.T.; Fedus, W.; Glaese, A. Browsecomp: A simple yet challenging benchmark for browsing agents. arXiv arXiv:2504.12516, arXiv:2504.12516 2025.
  74. Zhou, P.; Leon, B.; Ying, X.; Zhang, C.; Shao, Y.; Ye, Q.; Chong, D.; Jin, Z.; Xie, C.; Cao, M.; et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese. arXiv preprint arXiv:2504.19314, arXiv:2504.19314 2025.
  75. Gou, B.; Huang, Z.; Ning, Y.; Gu, Y.; Lin, M.; Qi, W.; Kopanev, A.; Yu, B.; Gutiérrez, B.J.; Shu, Y.; et al. Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge. 2025; arXiv:cs.AI/2506.21506]. [Google Scholar]
  76. Deng, X.; Gu, Y.; Zheng, B.; Chen, S.; Stevens, S.; Wang, B.; Sun, H.; Su, Y. Mind2Web: Towards a Generalist Agent for the Web. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; Levine, S., Eds. Curran Associates, Inc., Vol. 36; 2023; pp. 28091–28114. [Google Scholar]
  77. Zhou, S.; Xu, F.F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; et al. WebArena: A Realistic Web Environment for Building Autonomous Agents. 2024; arXiv:cs.AI/2307.13854]. [Google Scholar]
  78. Koh, J.Y.; Lo, R.; Jang, L.; Duvvur, V.; Lim, M.; Huang, P.Y.; Neubig, G.; Zhou, S.; Salakhutdinov, R.; Fried, D. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). [CrossRef]
  79. Garg, D.; VanWeelden, S.; Caples, D.; Draguns, A.; Ravi, N.; Putta, P.; Garg, N.; Abraham, T.; Lara, M.; Lopez, F.; et al. REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites. 2025; arXiv:cs.AI/2504.11543]. [Google Scholar]
  80. Miyai, A.; Zhao, Z.; Egashira, K.; Sato, A.; Sunada, T.; Onohara, S.; Yamanishi, H.; Toyooka, M.; Nishina, K.; Maeda, R.; et al. WebChoreArena: Evaluating Web Browsing Agents on Realistic Tedious Web Tasks. 2025; arXiv:cs.CL/2506.01952]. [Google Scholar]
  81. Song, Y.; Thai, K.; Pham, C.M.; Chang, Y.; Nadaf, M.; Iyyer, M. BEARCUBS: A benchmark for computer-using web agents. 2025; arXiv:cs.AI/2503.07919]. [Google Scholar]
  82. Pan, Y.; Kong, D.; Zhou, S.; Cui, C.; Leng, Y.; Jiang, B.; Liu, H.; Shang, Y.; Zhou, S.; Wu, T.; et al. WebCanvas: Benchmarking Web Agents in Online Environments. 2024; arXiv:cs.CL/2406.12373]. [Google Scholar]
  83. Xu, K.; Kordi, Y.; Nayak, T.; Asija, A.; Wang, Y.; Sanders, K.; Byerly, A.; Zhang, J.; Van Durme, B.; Khashabi, D. TurkingBench: A Challenge Benchmark for Web Agents. In Proceedings of the Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). [CrossRef]
  84. Tur, A.D.; Meade, N.; Lù, X.H.; Zambrano, A.; Patel, A.; Durmus, E.; Gella, S.; Stańczak, K.; Reddy, S. SafeArena: Evaluating the Safety of Autonomous Web Agents. 2025; arXiv:cs.LG/2503.04957]. [Google Scholar]
  85. Zhu, Y.; Kellermann, A.; Bowman, D.; Li, P.; Gupta, A.; Danda, A.; Fang, R.; Jensen, C.; Ihli, E.; Benn, J.; et al. CVE-Bench: A Benchmark for AI Agents’ Ability to Exploit Real-World Web Application Vulnerabilities. 2025; arXiv:cs.CR/2503.17332]. [Google Scholar]
  86. Evtimov, I.; Zharmagambetov, A.; Grattafiori, A.; Guo, C.; Chaudhuri, K. WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks. 2025; arXiv:cs.CR/2504.18575]. [Google Scholar]
  87. Qiu, H.; Fabbri, A.; Agarwal, D.; Huang, K.H.; Tan, S.; Peng, N.; Wu, C.S. Evaluating Cultural and Social Awareness of LLM Web Agents. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025; Chiruzzo, L.; Ritter, A.; Wang, L., Eds., Albuquerque, New Mexico; 2025; pp. 3978–4005. [Google Scholar] [CrossRef]
  88. Luo, Y.; Li, Z.; Liu, J.; Cui, J.; Zhao, X.; Shen, Z. Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents. 2025; arXiv:cs.AI/2505.24878]. [Google Scholar]
  89. Lyu, Y.; Zhang, X.; Yan, L.; de Rijke, M.; Ren, Z.; Chen, X. DeepShop: A Benchmark for Deep Research Shopping Agents. 2025; arXiv:cs.IR/2506.02839]. [Google Scholar]
  90. Fu, M.; Peng, Y.; Liu, B.; Wan, Y.; Chen, D. LiveVQA: Live Visual Knowledge Seeking. arXiv, 2025; arXiv:2504.05288 2025. [Google Scholar]
  91. Yu, Q.; Xiao, Z.; Li, B.; Wang, Z.; Chen, C.; Zhang, W. MRAMG-Bench: A BeyondText Benchmark for Multimodal Retrieval-Augmented Multimodal Generation. arXiv, 2025; arXiv:2502.04176 2025. [Google Scholar]
  92. Research, P.D. https://www.perplexity.ai, 2022.
  93. Doubao. https://www.doubao.com, 2023.
  94. Yuanbao. https://yuanbao.tencent.com, 2024.
  95. AI, N. https://www.n.cn, 2025.
  96. Kimi. https://www.kimi.com, 2023.
  97. DeepSearch, Q. https://quark.sm.cn, 2025.
  98. Research, G.D. https://gemini.google/overview/deep-research, 2023.
  99. DeepSeek. https://www.deepseek.com, 2025.
  100. MediSearch. https://medisearch.io, 2023.
  101. Devv.ai. https://devv.ai/zh, 2023.
  102. Consensus. https://consensus.app, 2022.
  103. walles.ai. https://walles.ai/, 2023.
  104. Chat, B. http://bing.com, 2023.
  105. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. A: Generation for Large Language Models, 2024; arXiv:cs.CL/2312.10997].
  106. Zhu, Y.; Yuan, H.; Wang, S.; Liu, J.; Liu, W.; Deng, C.; Chen, H.; Liu, Z.; Dou, Z.; Wen, J.R. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107, arXiv:2308.07107 2023.
  107. Ramos, J.; et al. Using tf-idf to determine word relevance in document queries. In Proceedings of the Proceedings of the first instructional conference on machine learning.
  108. Robertson, S.E.; Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 2009, 3, 333–389. [Google Scholar] [CrossRef]
  109. Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the EMNLP; 2020; pp. 6769–6781. [Google Scholar]
  110. Xiong, L.; Xiong, C.; Li, Y.; Tang, K.F.; Liu, J.; Bennett, P.N.; Ahmed, J.; Overwijk, A. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In Proceedings of the ICLR; 2020. [Google Scholar]
  111. Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text Embeddings by Weakly-Supervised Contrastive Pre-training. CoRR, 2212. [Google Scholar] [CrossRef]
  112. Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N.; Lian, D.; Nie, J.Y. C-Pack: Packed Resources For General Chinese Embeddings. 2024; arXiv:cs.CL/2309.07597]. [Google Scholar]
  113. Li, X.; Jin, J.; Zhou, Y.; Zhang, Y.; Zhang, P.; Zhu, Y.; Dou, Z. From matching to generation: A survey on generative information retrieval. ACM Transactions on Information Systems 2025, 43, 1–62. [Google Scholar] [CrossRef]
  114. Tay, Y.; Tran, V.; Dehghani, M.; Ni, J.; Bahri, D.; Mehta, H.; Qin, Z.; Hui, K.; Zhao, Z.; Gupta, J.P.; et al. Transformer Memory as a Differentiable Search Index. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 2022, 2022., November 28 - December 9.
  115. Wang, Y.; Hou, Y.; Wang, H.; Miao, Z.; Wu, S.; Chen, Q.; Xia, Y.; Chi, C.; Zhao, G.; Liu, Z.; et al. A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems 2022, 35, 25600–25614. [Google Scholar]
  116. Li, X.; Dou, Z.; Zhou, Y.; Liu, F. Corpuslm: Towards a unified language model on corpus for knowledge-intensive tasks. In Proceedings of the Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp.
  117. Liu, T.Y.; et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 2009, 3, 225–331. [Google Scholar] [CrossRef]
  118. Khattab, O.; Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020. [CrossRef]
  119. Sun, W.; Yan, L.; Ma, X.; Wang, S.; Ren, P.; Chen, Z.; Yin, D.; Ren, Z. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023. [CrossRef]
  120. Izacard, G.; Grave, E. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, arXiv:2007.01282 2020.
  121. Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; Chang, M. Retrieval augmented language model pre-training. In Proceedings of the International conference on machine learning. PMLR; 2020; pp. 3929–3938. [Google Scholar]
  122. Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; van den Driessche, G.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving Language Models by Retrieving from Trillions of Tokens. In Proceedings of the International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. PMLR, 2022, Vol. 162, Proceedings of Machine Learning Research, pp.
  123. Ma, X.; Gong, Y.; He, P.; Zhao, H.; Duan, N. Query Rewriting in Retrieval-Augmented Large Language Models. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 2023. [CrossRef]
  124. Ram, O.; Levine, Y.; Dalmedigos, I.; Muhlgay, D.; Shashua, A.; Leyton-Brown, K.; Shoham, Y. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, arXiv:2302.00083 2023.
  125. Yu, Z.; Xiong, C.; Yu, S.; Liu, Z. Augmentation-adapted retriever improves generalization of language models as generic plug-in. arXiv preprint arXiv:2305.17331, arXiv:2305.17331 2023.
  126. Zhang, P.; Xiao, S.; Liu, Z.; Dou, Z.; Nie, J.Y. Retrieve Anything To Augment Large Language Models. CoRR, 2310. [Google Scholar] [CrossRef]
  127. Zhang, L.; Yu, Y.; Wang, K.; Zhang, C. ARL2: Aligning Retrievers for Black-box Large Language Models via Self-guided Adaptive Relevance Labeling. 2024; arXiv:cs.CL/2402.13542]. [Google Scholar]
  128. Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the Middle: How Language Models Use Long Contexts, 2023. arXiv:2307.03172.
  129. Cuconasu, F.; Trappolini, G.; Siciliano, F.; Filice, S.; Campagnano, C.; Maarek, Y.; Tonellotto, N.; Silvestri, F. R: Power of Noise, 2024; arXiv:cs.IR/2401.14887].
  130. Yang, H.; Li, Z.; Zhang, Y.; Wang, J.; Cheng, N.; Li, M.; Xiao, J. PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter. In Proceedings of the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023.
  131. Press, O.; Zhang, M.; Min, S.; Schmidt, L.; Smith, N.; Lewis, M. Measuring and Narrowing the Compositionality Gap in Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore; 2023; pp. 5687–5711. [Google Scholar] [CrossRef]
  132. Yoran, O.; Wolfson, T.; Ram, O.; Berant, J. Making Retrieval-Augmented Language Models Robust to Irrelevant Context. 2023; arXiv:cs.CL/2310.01558]. [Google Scholar]
  133. Huang, Y.; Chen, Y.; Zhang, H.; Li, K.; Fang, M.; Yang, L.; Li, X.; Shang, L.; Xu, S.; Hao, J.; et al. Deep Research Agents: A Systematic Examination And Roadmap. A: Research Agents, 2025; arXiv:cs.AI/2506.18096]. [Google Scholar]
  134. Zhang, W.; Li, Y.; Bei, Y.; Luo, J.; Wan, G.; Yang, L.; Xie, C.; Yang, Y.; Huang, W.C.; Miao, C.; et al. From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents. I, 2025; arXiv:cs.IR/2506.18959]. [Google Scholar]
  135. Xu, R.; Peng, J. Comprehensive Survey of Deep Research: Systems, Methodologies, and Applications. 2025; arXiv:cs.AI/2506.12594]. [Google Scholar]
  136. Sun, S.; Song, H.; Wang, Y.; Ren, R.; Jiang, J.; Zhang, J.; Bai, F.; Deng, J.; Zhao, W.X.; Liu, Z.; et al. SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis. 2025; arXiv:cs.CL/2505.16834]. [Google Scholar]
  137. Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. 2024; arXiv:cs.LG/2305.18290]. [Google Scholar]
  138. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. 2017; arXiv:cs.LG/1707.06347]. [Google Scholar]
  139. Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.K.; Wu, Y.; et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024; arXiv:cs.CL/2402.03300]. [Google Scholar]
  140. Hu, J.; Liu, J.K.; Shen, W. REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models. 2025; arXiv:cs.CL/2501.03262]. [Google Scholar]
  141. Li, K.; Zhang, Z.; Yin, H.; Zhang, L.; Ou, L.; Wu, J.; Yin, W.; Li, B.; Tao, Z.; Wang, X.; et al. WebSailor: Navigating Super-human Reasoning for Web Agent. 2025; arXiv:cs.CL/2507.02592]. [Google Scholar]
  142. Shi, W.; Tan, H.; Kuang, C.; Li, X.; Ren, X.; Zhang, C.; Chen, H.; Wang, Y.; Shang, L.; Yu, F.; et al. Pangu DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning. 2025; arXiv:cs.CL/2505.24332]. [Google Scholar]
  143. Shi, Y.; Li, S.; Wu, C.; Liu, Z.; Fang, J.; Cai, H.; Zhang, A.; Wang, X. Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs. 2025; arXiv:cs.CL/2505.11277]. [Google Scholar]
  144. Dong, G.; Chen, Y.; Li, X.; Jin, J.; Qian, H.; Zhu, Y.; Mao, H.; Zhou, G.; Dou, Z.; Wen, J.R. Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning. 2025; arXiv:cs.CL/2505.16410]. [Google Scholar]
  145. Lin, C.; Wen, Y.; Su, D.; Sun, F.; Chen, M.; Bao, C.; Lv, Z. Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation, 2025; arXiv:cs.CL/2506.05154]. [Google Scholar]
  146. Qian, H.; Liu, Z. Scent of Knowledge: Optimizing Search-Enhanced Reasoning with Information Foraging. 2025; arXiv:cs.CL/2505.09316]. [Google Scholar]
  147. Li, Y.; Luo, Q.; Li, X.; Li, B.; Cheng, Q.; Wang, B.; Zheng, Y.; Wang, Y.; Yin, Z.; Qiu, X. R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning. 2025; arXiv:cs.CL/2505.23794]. [Google Scholar]
  148. Zhang, D.; Zhao, Y.; Wu, J.; Li, B.; Yin, W.; Zhang, L.; Jiang, Y.; Li, Y.; Tu, K.; Xie, P.; et al. EvolveSearch: An Iterative Self-Evolving Search Agent. 2025; arXiv:cs.CL/2505.22501]. [Google Scholar]
  149. Sha, Z.; Cui, S.; Wang, W. SEM: Reinforcement Learning for Search-Efficient Large Language Models. 2025; arXiv:cs.CL/2505.07903]. [Google Scholar]
  150. Wu, P.; Zhang, M.; Zhang, X.; Du, X.; Chen, Z.Z. Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty. 2025; arXiv:cs.CL/2505.17281]. [Google Scholar]
  151. Jiang, P.; Xu, X.; Lin, J.; Xiao, J.; Wang, Z.; Sun, J.; Han, J. s3: You Don’t Need That Much Data to Train a Search Agent via RL. 2025; arXiv:cs.AI/2505.14146]. [Google Scholar]
  152. Chen, W.; Su, Y.; Zuo, J.; Yang, C.; Yuan, C.; Chan, C.M.; Yu, H.; Lu, Y.; Hung, Y.H.; Qian, C.; et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. 2023; arXiv:cs.CL/2308.10848]. [Google Scholar]
  153. Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A survey on large language model based autonomous agents. Frontiers of Computer Science 2024, 18. [Google Scholar] [CrossRef]
  154. Wu, J.; Yin, W.; Jiang, Y.; Wang, Z.; Xi, Z.; Fang, R.; Zhang, L.; He, Y.; Zhou, D.; Xie, P.; et al. WebWalker: Benchmarking LLMs in Web Traversal. 2025; arXiv:cs.CL/2501.07572]. [Google Scholar]
  155. Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.K.; Wu, Y.; et al. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024; arXiv:cs.CL/2402.03300]. [Google Scholar]
  156. Jin, Y.; Li, J.; Liu, Y.; Gu, T.; Wu, K.; Jiang, Z.; He, M.; Zhao, B.; Tan, X.; Gan, Z.; et al. Efficient multimodal large language models: A survey. arXiv, 2024; arXiv:2405.10739 2024. [Google Scholar]
  157. Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv, 2023; arXiv:2303.08774 2023. [Google Scholar]
  158. OpenAI. Hello GPT-4o, 2024.
  159. Anthropic. Claude 3.5 Sonnet, 2024.
  160. Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 2024, 36. [Google Scholar]
  161. Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International conference on machine learning. PMLR; 2023; pp. 19730–19742. [Google Scholar]
  162. Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. In Proceedings of the NeurIPS; 2023. [Google Scholar]
  163. Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved baselines with visual instruction tuning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp.
  164. Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2. 5-vl technical report. arXiv, 2025; arXiv:2502.13923 2025. [Google Scholar]
  165. Team, G.; Anil, R.; Borgeaud, S.; Wu, Y.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; et al. Gemini: a family of highly capable multimodal models. arXiv, 2023; arXiv:2312.11805 2023. [Google Scholar]
  166. Chen, Z.; Wu, J.; Wang, W.; Su, W.; Chen, G.; Xing, S.; Muyan, Z.; Zhang, Q.; Zhu, X.; Lu, L.; et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv, 2023; arXiv:2312.14238 2023. [Google Scholar]
  167. Sun, Q.; Cui, Y.; Zhang, X.; Zhang, F.; Yu, Q.; Luo, Z.; Wang, Y.; Rao, Y.; Liu, J.; Huang, T.; et al. Generative multimodal models are in-context learners. arXiv, 2023; arXiv:2312.13286 2023. [Google Scholar]
  168. Li, J.; Lu, W.; Fei, H.; Luo, M.; Dai, M.; Xia, M.; Jin, Y.; Gan, Z.; Qi, D.; Fu, C.; et al. A survey on benchmarks of multimodal large language models. arXiv, 2024; arXiv:2408.08632 2024. [Google Scholar]
  169. Wang, S.; Liu, W.; Chen, J.; Zhou, Y.; Gan, W.; Zeng, X.; Che, Y.; Yu, S.; Hao, X.; Shao, K.; et al. Gui agents with foundation models: A comprehensive survey. arXiv, 2024; arXiv:2411.04890 2024. [Google Scholar]
  170. He, H.; Yao, W.; Ma, K.; Yu, W.; Dai, Y.; Zhang, H.; Lan, Z.; Yu, D. WebVoyager: Building an end-to-end web agent with large multimodal models. arXiv, 2024; arXiv:2401.13919 2024. [Google Scholar]
  171. Pahuja, V.; Lu, Y.; Rosset, C.; Gou, B.; Mitra, A.; Whitehead, S.; Su, Y.; Awadallah, A. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. arXiv, 2025; arXiv:2502.11357 2025. [Google Scholar]
  172. Ho, X.; Nguyen, A.K.D.; Sugawara, S.; Aizawa, A. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv, 2025; arXiv:2011.01060 2020. [Google Scholar]
  173. Fernández-Pichel, M.; Pichel, J.C.; Losada, D.E. Evaluating Search Engines and Large Language Models for Answering Health Questions. arXiv, 2024; arXiv:2407.12468 2024. [Google Scholar]
  174. You.com. https://you.com, 2021.
Figure 1. Taxonomy of research on AI search: investigating text-based AI search, web browsing agents, multimodal AI search, benchmarks, softwares and products.
Figure 1. Taxonomy of research on AI search: investigating text-based AI search, web browsing agents, multimodal AI search, benchmarks, softwares and products.
Preprints 169412 g001
Figure 2. A timeline of recent AI Search methods and related products has been created, primarily based on the release dates of their respective technical papers.
Figure 2. A timeline of recent AI Search methods and related products has been created, primarily based on the release dates of their respective technical papers.
Preprints 169412 g002
Figure 3. Evolution of text-based AI search paradigms, from (a) standard RAG that retrieves once per query, to (b) advanced RAG workflows capable of multi-turn search and decision-making, and finally to (c) fully autonomous, reasoning-model-powered Deep Search.
Figure 3. Evolution of text-based AI search paradigms, from (a) standard RAG that retrieves once per query, to (b) advanced RAG workflows capable of multi-turn search and decision-making, and finally to (c) fully autonomous, reasoning-model-powered Deep Search.
Preprints 169412 g003
Figure 4. Illustration of Web Agents. (a) Generalist Deep Browsing Web Agents comprises three iterative stages: (i) Observation;(ii) Thought;(iii) Action.(b) Specialist Analytical Agent follows three distinct phases:(i) Observation;(ii) Insight;(iii) Action. The loop terminates when required information is obtained, returning results to the user.
Figure 4. Illustration of Web Agents. (a) Generalist Deep Browsing Web Agents comprises three iterative stages: (i) Observation;(ii) Thought;(iii) Action.(b) Specialist Analytical Agent follows three distinct phases:(i) Observation;(ii) Insight;(iii) Action. The loop terminates when required information is obtained, returning results to the user.
Preprints 169412 g004
Figure 5. Illustration of Multimodal AI Search. (a) The MMSearch [58] pipeline consists of three sequential stages carried out by a Multimodal Large Language Model (MLLM): (i) requery, (ii) rerank, and (iii) summarization. (b) A detailed view of the MMSearch-R1 [59], highlighting the rollout process and the execution of the search tool.
Figure 5. Illustration of Multimodal AI Search. (a) The MMSearch [58] pipeline consists of three sequential stages carried out by a Multimodal Large Language Model (MLLM): (i) requery, (ii) rerank, and (iii) summarization. (b) A detailed view of the MMSearch-R1 [59], highlighting the rollout process and the execution of the search tool.
Preprints 169412 g005
Figure 6. Illustration of Modern Browsing Benchmarks with complex and challenging queries. (a) BrowseComp [73]. (b) BrowseComp-ZH [74].
Figure 6. Illustration of Modern Browsing Benchmarks with complex and challenging queries. (a) BrowseComp [73]. (b) BrowseComp-ZH [74].
Preprints 169412 g006
Figure 7. Sample tasks of Mind2Web [76]. The web agent benchmark can test an agent’s generalizability across tasks on the same website (a vs. b), similar tasks on different websites (a vs. c).
Figure 7. Sample tasks of Mind2Web [76]. The web agent benchmark can test an agent’s generalizability across tasks on the same website (a vs. b), similar tasks on different websites (a vs. c).
Preprints 169412 g007
Figure 8. Illustration of four categories of LiveVQA [90]. QA pair for basic image for understanding, and two multimodal multi-hop QA pairs for deeper reasoning.
Figure 8. Illustration of four categories of LiveVQA [90]. QA pair for basic image for understanding, and two multimodal multi-hop QA pairs for deeper reasoning.
Preprints 169412 g008
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated