Submitted:
15 August 2025
Posted:
18 August 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Problem Formulation and Theoretical Foundations
2.1. General Framework
- Retrieval Stage: Compute the retrieval distribution , which models the probability that document is relevant to the query q.
- Generation Stage: Condition the generative model on both the query q and retrieved documents d, and estimate the conditional distribution .
2.2. Retrieval Models
2.3. Generative Modeling
- Early Fusion: Prepend to the input prompt.
- Late Fusion: Combine outputs from multiple generations each conditioned on a single .
- Hierarchical Fusion: Use attention mechanisms to dynamically weight the influence of different documents during decoding [10].
2.4. Training Objectives
2.5. End-to-End Joint Training
2.6. Theoretical Considerations
2.7. Summary
3. Taxonomy of Efficient RAG Architectures
3.1. Taxonomy Dimensions
- Retriever Type: Indicates whether the retriever is sparse (e.g., BM25), dense (e.g., dual-encoder), hybrid, or learned end-to-end.
- Retriever Index: Describes the index type used for efficient lookup (e.g., inverted index, FAISS-based ANN, HNSW).
- Fusion Strategy: Specifies how retrieved documents are integrated into the generation process—early fusion, late fusion, or attention-based fusion.
- Generator Type: Identifies the underlying architecture of the generator (e.g., GPT, BART, T5).
- Training Paradigm: Denotes how the system is trained—modular (separate retriever and generator training), jointly (end-to-end), or via reinforcement learning.
- Context Handling: Refers to how the model deals with multiple retrieved contexts—concatenation, reranking, or hierarchical modeling.
- Efficiency Optimizations: Includes caching, document compression, distillation, or retrieval-time pruning techniques.
3.2. Comparison Table
| Model | Retriever Type | Index | Fusion | Generator | Training | Context Handling | Efficiency Techniques |
|---|---|---|---|---|---|---|---|
| REALM [15] | Dense (Dual Encoder) | FAISS | Early Fusion | BERT + MLM Head | Joint | Top-k Selection | Index Pretraining |
| RAG [16] | Dense + Sparse | IVF-Flat | Late Fusion (Marginalization) | BART | Modular | Reranking + Marginal | Fixed Top-k + Cache |
| FiD [17] | Dense (DPR) | Flat Index | Early Fusion (Encoder-level) | T5 | Modular | Input Concatenation | Passage Dropout |
| REPLUG [18] | Dense | Flat + Cache | Attention-based Fusion | Decoder-only LLM (OPT) | Plug-and-Play | Sparse Top-k Filtering | Cache Distillation |
| ColBERT-QA [19] | Late Interaction (ColBERT) | ANN + Inverted Lists | Late Fusion + Scoring | BART | Modular | Linear Scoring over Candidates | Index Pruning |
| RETRO [20] | Dense + Exact Match | Chunk Index + Hash Map | Contextual Fusion (Cross-Attn) | Transformer Decoder | Modular | Chunked Fusion | Chunk Compression |
| Atlas [21] | Dense + Learned | FAISS HNSW | Early Fusion | T5 | Joint (Retriever + Generator) | Full Joint Optimization | Retriever Fine-tuning |
| GNN-RAG [22] | Dense + Graph-based | HNSW + Graph Traversal | Attention Fusion + Graph Filtering | BART | Joint (Graph Encoder) | Relevance Propagation | Node Pruning |
| Dr. Retriever [23] | Dense + Memory-Augmented | Flat Memory | Early Fusion | GPT-2 | Modular | Dialogue Turn-Aware | Memory Compression |
| Jina-RAG [24] | Hybrid (Sparse + Dense) | Jina Index (Hybrid ANN) | Early Fusion + Rerank | Encoder-Decoder | Modular | Adaptive Retrieval Depth | Query-aware Filtering |
3.3. Analysis and Observations
Retriever Type and Indexing
Fusion Strategies
Training Paradigms
Efficiency Techniques
3.4. Concluding Remarks
4. Efficient Retrieval Methods
4.1. Sparse Retrieval
4.2. Dense Retrieval
- DPR (Dense Passage Retrieval): Uses BERT-based dual encoders trained with contrastive loss.
- ANCE: Applies hard negative mining and late-stage retraining [22].
- ColBERT: Utilizes late interaction via token-level embeddings and maximum-similarity aggregation.
4.3. Hybrid Retrieval
4.4. Learned and Adaptive Retrieval
- Learned Dense Retrieval (LDR): Retrieval scores are learned through a downstream generation loss.
- GNN-based Retrieval: Nodes in a document graph encode semantic and relational information; queries traverse the graph.
- Retriever Distillation: A smaller retriever mimics the behavior of a stronger teacher retriever (e.g., via KL divergence).
4.5. Indexing and ANN Search
- FAISS: Supports Product Quantization (PQ), Inverted File Index (IVF), and HNSW graphs.
- ScaNN and NGT: Optimized for high-speed ANN on GPU and CPU [23].
- Hierarchical Navigable Small World (HNSW): Graph-based method offering logarithmic search complexity.
4.6. Retrieval-Time Optimizations
Retrieval Caching
Dynamic Top-k Filtering
Query Pruning and Rewriting
Compression and Quantization
4.7. Evaluation Metrics
- Accuracy: Measured by recall@k, precision@k, and mean reciprocal rank (MRR). High recall ensures useful documents are passed to the generator.
- Latency and Throughput: Time per query and number of queries per second are essential for deployment.
4.8. Summary
5. Efficient Generation with Retrieved Contexts
5.1. Context Fusion Strategies
Early Fusion
Late Fusion
Retrieval-Aware Fusion (Attention-Based)
5.2. Efficiency-Oriented Modifications
Truncation and Chunking
Sparse Attention and Routing
Mixture-of-Experts (MoE)
Decoder Caching
Retrieval-Conditioned Prefixes
5.3. Parameter-Efficient Fine-Tuning
- LoRA (Low-Rank Adaptation): Updates low-rank matrices in attention layers while freezing the rest of the model.
- Adapters: Adds small bottleneck layers between transformer blocks that can be trained independently.
- Prefix/Prompt Tuning: Learns soft prompts or embeddings that condition the generation process.
5.4. Batching and Parallelism
5.5. Evaluation of Generation Efficiency
- Latency per Token: Measures the time to produce each output token.
- Total Decode Time: Captures end-to-end time for generating the complete response.
- Throughput: Number of generation requests per second in a production setting.
- Context Utilization Ratio: Fraction of retrieved documents that contribute non-trivially to the final output (e.g., via attention or copy) [33].
5.6. Summary
6. Training Paradigms for Efficient RAG Systems
6.1. Two-Stage vs. End-to-End Training
Two-Stage Training
End-to-End Training
6.2. Contrastive and Supervised Retrieval Objectives
In-Batch Negative Contrastive Loss
Triplet Loss
Label Supervision
6.3. Generator Fine-Tuning Objectives
Sequence Likelihood
Denoising and Masked Objectives
Coverage and Saliency Regularization
Multi-Task Learning
6.4. Reinforcement Learning (RL) for RAG
Reward Functions
Retrieval Policy Learning
6.5. Distillation-Based Training
Retriever Distillation
Generator Distillation:
6.6. Curriculum and Progressive Training
- Start with synthetic or noisy retrievals and progress to gold documents.
- Use short, high-confidence queries before scaling to long-form or ambiguous ones.
- Gradually increase the number of retrieved documents during training.
6.7. Memory and Compute Efficient Training
- Gradient Checkpointing: Saves memory by recomputing intermediate activations during backpropagation.
- Mixed Precision Training: Reduces memory and improves throughput using FP16 or BF16 arithmetic.
- Offline Retrieval: Freeze and precompute document embeddings for faster training.
- Retrieval Caching: Reuse document retrievals across epochs when retrieval changes slowly.
6.8. Summary
7. Benchmarks and Evaluation Protocols
7.1. Evaluation Dimensions
- Retrieval Quality: Measures how well the retriever selects relevant documents.
- Generation Quality: Evaluates the fluency, informativeness, and correctness of the output.
- Faithfulness: Assesses whether the generation adheres strictly to the retrieved evidence [43].
- Efficiency: Includes latency, memory usage, and throughput during inference and training.
7.2. Benchmark Datasets
7.3. Retrieval Evaluation Metrics
- Recall@k: Fraction of queries for which the relevant document appears in top-k results [47].
- Precision@k: Fraction of top-k retrieved documents that are relevant.
- Mean Reciprocal Rank (MRR): Average inverse rank of the first relevant document.
- nDCG@k: Normalized Discounted Cumulative Gain for relevance-ranked documents.
- Embedding Similarity: Cosine similarity between query and gold document embeddings [48].
7.4. Generation Evaluation Metrics
- Exact Match (EM): Binary match between predicted and ground-truth answer.
- F1 Score: Token overlap between predicted and true answer [49].
- BLEU / ROUGE / METEOR: N-gram based precision and recall measures.
- BERTScore: Semantic similarity via BERT embeddings [50].
- Knowledge F1: Measures whether generation correctly uses information from retrieved evidence.
- Faithfulness / Hallucination Rate: Measures how often generation includes unsupported claims.
7.5. Joint Retrieval-Generation Evaluation
- Answer Recall vs. Evidence Recall: Measures how much answer correctness depends on evidence quality.
- Precision-Recall Curves: Derived from retrieval score thresholds that lead to successful generation.
7.6. Human Evaluation Protocols
- Factual Accuracy: Does the output contain any hallucinated or incorrect information?
- Relevance: Does the generation appropriately answer the query or summarize the source [53]?
- Fluency and Readability: Is the output grammatically sound and stylistically coherent [54]?
- Usefulness: Does the answer provide value in the context of the user’s query [55]?
7.7. Efficiency Metrics
- Latency: Total time from query input to generation output.
- Retrieval Cost: Time and resources required to retrieve documents [56].
- Memory Footprint: Peak GPU/CPU memory usage during inference.
- Throughput: Number of queries handled per second or per GPU hour.
7.8. Reproducibility and Reporting Practices
- Reporting performance with and without gold passages.
- Varying k (number of retrieved documents) to evaluate sensitivity.
- Isolating retrieval vs. generation contributions.
- Reporting both automatic and human evaluation metrics.
- Providing full code and data for replication [58].
7.9. Summary
8. Challenges and Open Research Problems in Efficient RAG
8.1. Retrieval-Generation Alignment Gap
Open Problem
Promising Directions
- Learn reward signals from generation loss gradients to guide retriever updates [62].
- Use reinforcement learning to jointly optimize retriever policies based on generation outcomes.
- Apply contrastive learning between relevant and spurious contexts with respect to output quality.
8.2. Faithfulness and Hallucination Control
Open Problem
Challenges
Research Directions
8.3. Scalability and Latency
8Open Problem
Key Bottlenecks
- Dense retrievers require large vector indexes and complex similarity search mechanisms.
- Long document inputs increase memory and computation cost for generators [69].
- Joint retriever-generator training increases gradient memory footprint.
Future Work
- Explore sparse or hybrid retrieval techniques with efficient indexes (e.g., ColBERT, SPLADE).
- Develop lightweight generator architectures with hierarchical or segment-aware encoding [70].
- Compress retriever and generator models via quantization, pruning, or knowledge distillation.
8.4. Context Selection and Redundancy
Open Problem
Research Opportunities
- Use uncertainty-aware retrieval strategies that balance relevance and novelty.
- Implement context re-ranking or clustering to improve information diversity [72].
- Allow generators to attend adaptively over documents rather than fixed concatenations.
8.5. Multimodal and Cross-Lingual RAG
Open Problems
- How can RAG incorporate images, tables, or structured data as retrieval contexts?
- How can we train RAG systems for low-resource languages without parallel corpora?
Research Directions
- Develop shared embedding spaces for cross-modal or cross-lingual retrieval [74].
- Use translation-based retrieval or multilingual pretrained retrievers (e.g., mDPR, LaBSE).
- Design modality-aware generators that fuse information from heterogeneous sources.
8.6. Evaluation Challenges
Key Limitations
Research Directions
8.7. Adaptability and Lifelong Learning
Open Problem
Emerging Solutions
8.8. Security, Privacy, and Bias
Risks
Research Challenges
- Design privacy-preserving retrieval mechanisms (e.g., using differential privacy).
- Detect and filter harmful content during retrieval or post-generation.
- Mitigate bias via dataset balancing, fairness objectives, or counterfactual generation.
8.9. Summary
9. Future Directions and Opportunities
9.1. Unified End-to-End Optimization
Opportunities
- Design differentiable retrievers using approximations of discrete retrieval (e.g., Gumbel-softmax, REINFORCE, in-batch negatives).
- Co-train retriever and generator with shared encoder backbones or dual encoders, enabling information flow between retrieval and generation stages [87].
- Investigate multitask and multi-objective frameworks combining retrieval, generation, and auxiliary tasks such as question decomposition or paraphrase generation [88].
9.2. Context-Aware and Query-Adaptive Retrieval
Ideas to Explore
- Develop query-aware context policies that determine how many documents to retrieve, based on uncertainty, ambiguity, or query intent [89].
- Implement retrieval cascades, where cheap shallow retrieval (e.g., BM25) is followed by dense reranking only when needed.
- Explore curriculum-based retrieval, where retrieval behavior evolves with the generator’s learning phase.
9.3. Integration with Structured and Symbolic Knowledge
Challenges and Goals
- Design hybrid retrievers capable of jointly querying both free-text and structured knowledge.
- Encode graph-based retrieval results (e.g., paths in a knowledge graph) into textual contexts that preserve relational semantics.
- Use symbolic reasoning over retrieved facts to improve multi-hop and commonsense inference.
9.4. Multi-Hop and Compositional Reasoning
Promising Approaches
- Sequential retrieval pipelines where intermediate answers inform subsequent retrieval steps.
- Graph-based modeling of evidence chains, enabling backward and forward chaining reasoning [92].
- Integrate RAG with neuro-symbolic methods to trace and verify reasoning steps explicitly.
9.5. Personalized and Continual Retrieval-Augmented Learning
Future Directions
- Design long-term memory components that store user interactions, personal documents, or task-specific knowledge.
- Implement continual learning frameworks that update document encodings and retriever parameters without catastrophic forgetting [94].
- Explore federated or on-device learning techniques for privacy-aware personalization [95].
9.6. Interpretable and Transparent RAG Systems
Research Goals
- Provide traceability from generation output to specific retrieved passages [97].
- Develop explanation interfaces that highlight influential evidence spans and reasoning chains [98].
- Incorporate attribution mechanisms (e.g., attention maps, saliency methods) into training objectives to improve transparency.
9.7. RAG in Multimodal and Interactive Settings
Emerging Frontiers:
- Design retrievers for multimodal corpora, supporting inputs like images, tables, or time-series data.
- Create multimodal encoders that align vision, language, and audio embeddings into shared retrieval spaces.
- Enable dialog-centric RAG that incorporates user feedback and performs live context updates [99].
9.8. Evaluation Frameworks for Next-Gen RAG
Opportunities
- Develop benchmark suites that test end-to-end performance across multiple axes (e.g., factuality, novelty, response diversity).
- Introduce task-agnostic metrics that quantify retrieval-generation alignment or evidence sensitivity.
- Embrace simulation environments (e.g., embodied QA, virtual assistants) for real-time interactive evaluation.
9.9. Synergies with Agent Architectures
Examples and Opportunities
- Integrate RAG with planning modules, memory stores, and tool execution layers.
- Use RAG to build agents that learn retrieval heuristics via reinforcement learning.
- Enable recursive RAG pipelines where generations trigger further retrievals, enabling autonomous workflows.
9.10. Summary
10. Conclusions
References
- Cheng, X.; Gao, S.; Liu, L.; Zhao, D.; Yan, R. Neural machine translation with contrastive translation memories. arXiv preprint arXiv:2212.03140, arXiv:2212.03140 2022.
- Xu, Z.; Gong, Y.; Zhou, Y.; Bao, Q.; Qian, W. 2024; arXiv:cs.DC/2403.07905].
- Cheng, D.; Huang, S.; Bi, J.; Zhan, Y.; Liu, J.; Wang, Y.; Sun, H.; Wei, F.; Deng, D.; Zhang, Q. UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation. arXiv preprint arXiv:2303.08518, arXiv:2303.08518 2023.
- Joshi, M.; Choi, E.; Weld, D.S.; Zettlemoyer, L. 2017; arXiv:cs.CL/1705.03551].
- Jeong, S.; Baek, J.; Cho, S.; Hwang, S.J.; Park, J.C. 2024; arXiv:cs.CL/2403.14403].
- Anderson, N.; Wilson, C.; Richardson, S.D. Lingua: Addressing Scenarios for Live Interpretation and Automatic Dubbing. In Proceedings of the Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track).
- Ebner, S.; Xia, P.; Culkin, R.; Rawlins, K.; Durme, B.V. 2020; arXiv:cs.CL/1911.03766].
- AG2AI Contributors. AG2: A GitHub Repository for Advanced Generative AI Research. https://github.com/ag2ai/ag2, 2025. Accessed: 2025-01-15.
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
- An, Z.; Ding, X.; Fu, Y.C.; Chu, C.C.; Li, Y.; Du, W. 2024; arXiv:cs.IR/2408.00798].
- Blog, Q. Agentic RAG: Combining RAG with Agents for Enhanced Information Retrieval. https://qdrant.tech/articles/agentic-rag/. Accessed: 2025-01-14.
- Dai, Z.; Zhao, V.Y.; Ma, J.; Luan, Y.; Ni, J.; Lu, J.; Bakalov, A.; Guu, K.; Hall, K.B.; Chang, M.W. Promptagator: Few-shot dense retrieval from 8 examples. arXiv preprint arXiv:2209.11755, arXiv:2209.11755 2022.
- Izacard, G.; Grave, E. Distilling knowledge from reader to retriever for question answering. arXiv preprint arXiv:2012.04584, arXiv:2012.04584 2020.
- Li, X.; Zhao, R.; Chia, Y.K.; Ding, B.; Bing, L.; Joty, S.; Poria, S. Chain of Knowledge: A Framework for Grounding Large Language Models with Structured Knowledge Bases. arXiv preprint arXiv:2305.13269, arXiv:2305.13269 2023.
- NVIDIA. Spectrum-X: End-to-End Networking for AI and High-Performance Computing. https://www.nvidia.com/en-us/networking/spectrumx/, 2025. Accessed: 2025-01-28.
- Anantha, R.; Bethi, T.; Vodianik, D.; Chappidi, S. Context Tuning for Retrieval Augmented Generation. arXiv preprint arXiv:2312.05708, arXiv:2312.05708 2023.
- Thakur, N.; Bonifacio, L.; Zhang, X.; Ogundepo, O.; Kamalloo, E.; Alfonso-Hermelo, D.; Li, X.; Liu, Q.; Chen, B.; Rezagholizadeh, M.; et al. A: When You Don’t Know", 2024; arXiv:cs.CL/2312.11361].
- Friel, R.; Belyi, M.; Sanyal, A. 2024; arXiv:cs.CL/2407.11005].
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser. ; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Ma, Y.; Cao, Y.; Hong, Y.; Sun, A. Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples! ArXiv 2023, 8559. [Google Scholar]
- Repository, L.D. Auto Insurance Claims Workflow using LlamaCloud. https://github.com/run-llama/llamacloud-demo/blob/main/examples/document_workflows/auto_insurance_claims/auto_insurance_claims.ipynb, 2025. Accessed: 2025-01-13.
- Blog, N.D. Microsoft GraphRAG and Neo4j, 2024. Accessed: 2025-01-11.
- Zhong, Z.; Lei, T.; Chen, D. Training language models with memory augmentation. arXiv preprint arXiv:2205.12674, arXiv:2205.12674 2022.
- Ho, X.; Nguyen, A.K.D.; Sugawara, S.; Aizawa, A. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060, arXiv:2011.01060 2020.
- Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv 2021, arXiv:2112.09332. [Google Scholar]
- Singh, A.; Ehtesham, A.; Kumar, S.; Gupta, G.K.; Khoei, T.T. Encouraging Responsible Use of Generative AI in Education: A Reward-Based Learning Approach. In Proceedings of the Artificial Intelligence in Education Technologies: New Development and Innovative Practices; Schlippe, T.; Cheng, E.C.K.; Wang, T., Eds., Singapore; 2025; pp. 404–413. [Google Scholar]
- Zniyed, Y.; Nguyen, T.P.; et al. Enhanced network compression through tensor decompositions and pruning. IEEE Transactions on Neural Networks and Learning Systems 2024. [Google Scholar]
- Li, S.; Ji, H.; Han, J. 2021; arXiv:cs.CL/2104.05919].
- Wang, S.; Xu, Y.; Fang, Y.; Liu, Y.; Sun, S.; Xu, R.; Zhu, C.; Zeng, M. Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). [CrossRef]
- Husain, H.; Wu, H.H.; Gazit, T.; Allamanis, M.; Brockschmidt, M. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, arXiv:1909.09436 2019.
- Tutorial, L.A.R. LangGraph Adaptive RAG: Adaptive Retrieval-Augmented Generation Tutorial. https://langchain-ai.github.io/langgraph/tutorials/rag/langgraph_adaptive_rag/. Accessed: 2025-01-14.
- Narayan, S.; Cohen, S.B.; Lapata, M. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, arXiv:1808.08745 2018.
- Cerny, T.; Abdelfattah, A.S.; Bushong, V.; Maruf, A.A.; Taibi, D. A: Architecture Reconstruction and Visualization Techniques, 2022; arXiv:cs.SE/2207.02988].
- Blog, N.D. Indexing Pipeline GraphRAG Image, 2024. Accessed: 2025-01-11.
- Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; Millican, K.; Van Den Driessche, G.B.; Lespiau, J.B.; Damoc, B.; Clark, A.; et al. Improving language models by retrieving from trillions of tokens. In Proceedings of the International conference on machine learning. PMLR; 2022; pp. 2206–2240. [Google Scholar]
- Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, arXiv:2110.14168 2021.
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, arXiv:2302.04761 2023.
- Berchansky, M.; Izsak, P.; Caciularu, A.; Dagan, I.; Wasserblat, M. Optimizing Retrieval-augmented Reader Models via Token Elimination. arXiv preprint arXiv:2310.13682, arXiv:2310.13682 2023.
- Hoshi, Y.; Miyashita, D.; Ng, Y.; Tatsuno, K.; Morioka, Y.; Torii, O.; Deguchi, J. RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models. arXiv 2023, arXiv:2308.10633. [Google Scholar]
- Berchansky, M.; Izsak, P.; Caciularu, A.; Dagan, I.; Wasserblat, M. Optimizing retrieval-augmented reader models via token elimination. arXiv preprint arXiv:2310.13682, arXiv:2310.13682 2023.
- Rajput, S.; Mehta, N.; Singh, A.; Keshavan, R.H.; Vu, T.; Heldt, L.; Hong, L.; Tay, Y.; Tran, V.Q.; Samost, J.; et al. Recommender Systems with Generative Retrieval. arXiv preprint arXiv:2305.05065, arXiv:2305.05065 2023.
- Wang, X.; Chen, G.H.; Song, D.; Zhang, Z.; Chen, Z.; Xiao, Q.; Jiang, F.; Li, J.; Wan, X.; Wang, B.; et al. 2024; arXiv:cs.CL/2308.08833].
- Gupta, G.K.; Singh, A.; Manikandan, S.V.; Ehtesham, A. Digital Diagnostics: The Potential of Large Language Models in Recognizing Symptoms of Common Illnesses. AI 2025, 6. [Google Scholar] [CrossRef]
- Shi, F.; Chen, X.; Misra, K.; Scales, N.; Dohan, D.; Chi, E.H.; Schärli, N.; Zhou, D. Large language models can be easily distracted by irrelevant context. In Proceedings of the International Conference on Machine Learning. PMLR; 2023; pp. 31210–31227. [Google Scholar]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework. 2023; arXiv:cs.AI/2308.08155]. [Google Scholar]
- Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did Aristotle Use a Laptop? 2021; arXiv:cs.CL/2101.02235]. [Google Scholar]
- Fan, A.; Jernite, Y.; Perez, E.; Grangier, D.; Weston, J.; Auli, M. ELI5: Long form question answering. arXiv preprint arXiv:1907.09190, arXiv:1907.09190 2019.
- Sciavolino, C.; Zhong, Z.; Lee, J.; Chen, D. Simple entity-centric questions challenge dense retrievers. arXiv preprint arXiv:2109.08535, arXiv:2109.08535 2021.
- Rau, D.; Déjean, H.; Chirkova, N.; Formal, T.; Wang, S.; Nikoulina, V.; Clinchant, S. 2024; arXiv:cs.CL/2407.01102].
- Xia, M.; Huang, G.; Liu, L.; Shi, S. Graph based translation memory for neural machine translation. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2019, Vol.
- Geva, M.; Khashabi, D.; Segal, E.; Khot, T.; Roth, D.; Berant, J. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics 2021, 9, 346–361. [Google Scholar] [CrossRef]
- Liu, Y.; Yavuz, S.; Meng, R.; Moorthy, M.; Joty, S.; Xiong, C.; Zhou, Y. Exploring the integration strategies of retriever and large language models. arXiv preprint arXiv:2308.12574, arXiv:2308.12574 2023.
- Yang, A.; Nagrani, A.; Seo, P.H.; Miech, A.; Pont-Tuset, J.; Laptev, I.; Sivic, J.; Schmid, C. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp.
- Yan, S.Q.; Gu, J.C.; Zhu, Y.; Ling, Z.H. 2024; arXiv:cs.CL/2401.15884].
- Liu, N.F.; Lin, K.; Hewitt, J.; Paranjape, A.; Bevilacqua, M.; Petroni, F.; Liang, P. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, arXiv:2307.03172 2023.
- Trivedi, H.; Balasubramanian, N.; Khot, T.; Sabharwal, A. 2022; arXiv:cs.CL/2108.00573].
- Packer, C.; Fang, V.; Patil, S.G.; Lin, K.; Wooders, S.; Gonzalez, J.E. MemGPT: Towards LLMs as operating systems. arXiv preprint arXiv:2310.08560, arXiv:2310.08560 2023.
- Singh, A. Exploring Language Models: A Comprehensive Survey and Analysis. In Proceedings of the 2023 International Conference on Research Methodologies in Knowledge Management, Artificial Intelligence and Telecommunication Engineering (RMKMATE); 2023; pp. 1–4. [Google Scholar] [CrossRef]
- Lin, X.V.; Chen, X.; Chen, M.; Shi, W.; Lomeli, M.; James, R.; Rodriguez, P.; Kahn, J.; Szilvasy, G.; Lewis, M.; et al. RA-DIT: Retrieval-Augmented Dual Instruction Tuning. arXiv preprint arXiv:2310.01352, arXiv:2310.01352 2023.
- Bajaj, P.; Campos, D.; Craswell, N.; Deng, L.; Gao, J.; Liu, X.; Majumder, R.; McNamara, A.; Mitra, B.; Nguyen, T.; et al. A: MARCO, 2018; arXiv:cs.CL/1611.09268].
- Jarvis, C.; Allard, J. A Survey of Techniques for Maximizing LLM Performance. https://community.openai.com/t/openai-dev-day-2023-breakout-sessions/505213#a-survey-of-techniques-for-maximizing-llm-performance-2, 2023.
- Thakur, N.; Reimers, N.; Rücklé, A.; Srivastava, A.; Gurevych, I. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663, arXiv:2104.08663 2021.
- Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A.H.; Riedel, S. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, arXiv:1909.01066 2019.
- Peng, B.; Zhu, Y.; Liu, Y.; Bo, X.; Shi, H.; Hong, C.; Zhang, Y.; Tang, S. A: Retrieval-Augmented Generation, 2024; arXiv:cs.AI/2408.08921].
- Mallen, A.; Asai, A.; Zhong, V.; Das, R.; Hajishirzi, H.; Khashabi, D. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, arXiv:2212.10511 2022.
- Baek, J.; Aji, A.F.; Saffari, A. Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering. arXiv preprint arXiv:2306.04136, arXiv:2306.04136 2023.
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, arXiv:2009.03300 2020.
- Xu, P.; Ping, W.; Wu, X.; McAfee, L.; Zhu, C.; Liu, Z.; Subramanian, S.; Bakhturina, E.; Shoeybi, M.; Catanzaro, B. Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025, arXiv:2310.03025 2023.
- Documentation, L. Agentic RAG Using Vertex AI. https://docs.llamaindex.ai/en/stable/examples/agent/agentic_rag_using_vertex_ai/. Accessed: 2025-01-14.
- Zhao, P.; Zhang, H.; Yu, Q.; Wang, Z.; Geng, Y.; Fu, F.; Yang, L.; Zhang, W.; Jiang, J.; Cui, B. A: Generation for AI-Generated Content, 2024; arXiv:cs.CV/2402.19473].
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D.; et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 2022, 35, 24824–24837. [Google Scholar]
- Robertson, S.; Zaragoza, H.; et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 2009, 3, 333–389. [Google Scholar] [CrossRef]
- Luo, Z.; Xu, C.; Zhao, P.; Geng, X.; Tao, C.; Ma, J.; Lin, Q.; Jiang, D. Augmented Large Language Models with Parametric Knowledge Guiding. arXiv preprint arXiv:2305.04757, arXiv:2305.04757 2023.
- Microsoft. Semantic Kernel Overview, 2025. https://learn.microsoft.com/en-us/semantic-kernel/overview/. Accessed: , 2025. 2 February.
- Xu, S.; Pang, L.; Shen, H.; Cheng, X.; Chua, T.S. Search-in-the-chain: Towards accurate, credible and traceable large language models for knowledgeintensive tasks. CoRR, vol. abs/2304.14732.
- Yasunaga, M.; Aghajanyan, A.; Shi, W.; James, R.; Leskovec, J.; Liang, P.; Lewis, M.; Zettlemoyer, L.; Yih, W.t. Retrieval-augmented multimodal language modeling. arXiv preprint arXiv:2211.12561, arXiv:2211.12561 2022.
- Wang, H.; Hu, M.; Deng, Y.; Wang, R.; Mi, F.; Wang, W.; Wang, Y.; Kwan, W.C.; King, I.; Wong, K.F. Large Language Models as Source Planner for Personalized Knowledge-grounded Dialogue. arXiv preprint arXiv:2310.08840, arXiv:2310.08840 2023.
- Jiang, H.; Wu, Q.; Lin, C.Y.; Yang, Y.; Qiu, L. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736, arXiv:2310.05736 2023.
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2024; arXiv:cs.CL/2303.18223].
- Wen, T.H.; Gašić, M.; Mrkšić, N.; Rojas-Barahona, L.M.; Su, P.H.; Ultes, S.; Vandyke, D.; Young, S. Conditional Generation and Snapshot Learning in Neural Dialogue Systems. In Proceedings of the Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 2016. [CrossRef]
- Wang, H.; Huang, W.; Deng, Y.; Wang, R.; Wang, Z.; Wang, Y.; Mi, F.; Pan, J.Z.; Wong, K.F. UniMS-RAG: A Unified Multi-source Retrieval-Augmented Generation for Personalized Dialogue Systems. arXiv preprint arXiv:2401.13256, arXiv:2401.13256 2024.
- Saha, S.; Junaed, J.A.; Saleki, M.; Sen Sharma, A.; Rifat, M.R.; Rahouti, M.; Ahmed, S.I.; Mohammed, N.; Amin, M.R. Vio-Lens: A Novel Dataset of Annotated Social Network Posts Leading to Different Forms of Communal Violence and its Evaluation. In Proceedings of the Proceedings of the First Workshop on Bangla Language Processing (BLP-2023). [CrossRef]
- Saha, S.; Junaed, J.A.; Saleki, M.; Sharma, A.S.; Rifat, M.R.; Rahouti, M.; Ahmed, S.I.; Mohammed, N.; Amin, M.R. Vio-lens: A novel dataset of annotated social network posts leading to different forms of communal violence and its evaluation. In Proceedings of the Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), 2023, pp.
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, arXiv:2301.12597 2023.
- Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. 2016; arXiv:cs.CL/1609.07843].
- Liu, X.; Lai, H.; Yu, H.; Xu, Y.; Zeng, A.; Du, Z.; Zhang, P.; Dong, Y.; Tang, J. WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences. arXiv preprint arXiv:2306.07906, arXiv:2306.07906 2023.
- Blog, L. Agentic RAG with LlamaIndex. https://www.llamaindex.ai/blog/agentic-rag-with-llamaindex-2721b8a49ff6. Accessed: 2025-01-14.
- Shi, T.; Li, L.; Lin, Z.; Yang, T.; Quan, X.; Wang, Q. Dual-Feedback Knowledge Retrieval for Task-Oriented Dialogue Systems. arXiv preprint arXiv:2310.14528, arXiv:2310.14528 2023.
- Gottlob, G. Complexity results for nonmonotonic logics. Journal of Logic and Computation 1992, 2, 397–425. [Google Scholar] [CrossRef]
- Asai, A.; Wu, Z.; Wang, Y.; Sil, A.; Hajishirzi, H. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection. arXiv preprint arXiv:2310.11511, arXiv:2310.11511 2023.
- Steinberger, R.; Pouliquen, B.; Widiger, A.; Ignat, C.; Erjavec, T.; Tufiş, D.; Varga, D. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages. In Proceedings of the Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC`06).
- Tutorial, L.C. LangGraph CRAG: Contextualized Retrieval-Augmented Generation Tutorial. https://langchain-ai.github.io/langgraph/tutorials/rag/langgraph_crag/. Accessed: 2025-01-14.
- Jiang, Z.; Xu, F.F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. 2023; arXiv:cs.CL/2305.06983].
- Thorne, J.; Vlachos, A.; Christodoulopoulos, C.; Mittal, A. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). [CrossRef]
- Melz, E. Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation. arXiv preprint arXiv:2311.04177, arXiv:2311.04177 2023.
- crewAI Inc.. crewAI: A GitHub Repository for AI Projects. https://github.com/crewAIInc/crewAI, 2025. Accessed: 2025-01-15.
- Wang, H.; Hu, M.; Deng, Y.; Wang, R.; Mi, F.; Wang, W.; Wang, Y.; Kwan, W.C.; King, I.; Wong, K.F. 2023; arXiv:cs.CL/2310.08840].
- Guo, Z.; Cheng, S.; Wang, Y.; Li, P.; Liu, Y. Prompt-Guided Retrieval Augmentation for Non-Knowledge-Intensive Tasks. arXiv preprint arXiv:2305.17653, arXiv:2305.17653 2023.
- Kim, G.; Kim, S.; Jeon, B.; Park, J.; Kang, J. Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models. arXiv preprint arXiv:2310.14696, arXiv:2310.14696 2023.
- Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. 2023; arXiv:cs.CL/2303.17651].
- LangChain. LangSmith: The Ultimate Toolkit for Debugging and Monitoring LLM Applications. https://www.langchain.com/langsmith, 2025. Accessed: 2025-01-28.
- Yang, S. Advanced RAG 01: Small-to-Big Retrieval. https://towardsdatascience.com/advanced-rag-01-small-to-big-retrieval-172181b396d4, 2023.
| Task | Dataset | Description | Retriever Input | Generator Target |
|---|---|---|---|---|
| Open-Domain QA | NaturalQuestions | Real-world questions from Google search logs | Question text | Answer string(s) |
| TriviaQA | Trivia-style questions with evidence documents | Question | Answer string | |
| HotpotQA | Multi-hop QA requiring reasoning across docs | Question | Multi-sentence answer | |
| WebQuestions | QA from search queries linked to Freebase | Question | Named entity or phrase | |
| Fact Checking | FEVER | Fact verification against Wikipedia | Claim statement | Label: SUPPORTS/REFUTES/NOT ENOUGH INFO |
| SciFact | Scientific claim verification | Scientific claim | Label + Rationale sentence | |
| Summarization | QMSum | Meeting transcript summarization | Dialogue transcripts | Abstractive summary |
| MultiNews | Multi-document news summarization | Related news articles | Summary paragraph | |
| Knowledge-Intensive Tasks | KILT | Unified benchmark covering QA, fact checking, entity linking, etc [46]. | Task-specific | Varies |
| Long-Form Generation | ELI5 | Reddit-based QA with long-form answers | Open-ended question | Paragraph-length answer |
| NarrativeQA | Story understanding and narrative response | Story + Question | Long answer |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).