Submitted:
17 July 2025
Posted:
17 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Materials and Methods
2.1. Similarity Search
2.2. Large Language Models
- Decoder-only models (such as Generative Pre-trained Transformer (GPT)) and they are optimized for generating text from a prompt;
- Encoder-decoder models (like the Text-to-Text Transfer Transformer (T5) or the Bidirectional and Auto-Regressive Transformer (BART)) and they are good for tasks that involve transforming one type of text into another, such as translation or summarization;
- Retrieval-augmented models combine a language model with a retrieval system to look up relevant external information during generation. This helps improve factual accuracy and reduces the likelihood of generating incorrect or made-up content.
2.3. Retrieval-Augmented Generation (RAG) Pipeline
- The retriever is responsible for finding the most relevant documents from an external source of information, such as a document collection or a vector database. It is called non-parametric because it doesn't store knowledge in the form of model parameters. Instead, it performs a live search each time a question is asked. This means the knowledge can be easily updated without retraining the model;
- The generator is LLM itself. It is a parametric model, which means it contains knowledge within its internal parameters, based on what it learned during training. The generator takes both the user's question and the retrieved documents as input and produces a written response.
2.4. System functionality
2.5. Technologies Used
2.6. Data Storage and Indexing Infrastructure
- IVFFlat partitions the vector space into clusters and performs search within the most relevant clusters. It requires pre-training and can be fast but may sacrifice some recall;
- HNSW, in contrast, builds a navigable multi-layer graph over the embeddings. It does not require prior training and offers higher recall and faster average query times, making it preferable in this implementation.
2.7. RAG Workflow Implementation
- A system instruction that sets the model’s role (e.g., “You are a helpful and accurate scientific paper expert. Your task is to concisely answer the user's question using only the provided information.”) and defines its limitations (e.g., “If the information is insufficient to answer the question, simply say: ‘I do not have enough information to answer this question accurately.’”);
- A formatted list of retrieved chunks, each labeled with the original paper title;
- The user’s original question, clearly separated at the end of the prompt.
- Temperature = 0.3, which controls the randomness of the output. A lower value like 0.3 makes the model more focused and deterministic, helping reduce speculative or made-up answers (hallucinations);
- Max_tokens = 500, is set not to limit display capacity, but to constrain overly verbose outputs. This prevents the model from generating unnecessarily long or speculative answers to simple questions, maintaining focus and relevance.
2.8. Evaluation Methodology
3. Results
3.1. Final User Interface
3.2. Qualitative System Output
3.2.1. Semantic Search
- Assessing of Soil Erosion Risk Through Geoinformation Sciences and Remote Sensing -- A Review.
- TreeFormers -- An Exploration of Vision Transformers for Deforestation Driver Classification;
- Deep Learning tools to support deforestation monitoring in the Ivory Coast using SAR and Optical satellite imagery;
- Mapping Africa Settlements: High Resolution Urban and Rural Map by Deep Learning and Satellite Imagery;
- Contrasting local and global modeling with machine learning and satellite data: A case study estimating tree canopy height in African savannas;
- Dargana: fine-tuning EarthPT for dynamic tree canopy mapping from space.
3.2.2. Answer Generation
3.2.3. Citation Generation
3.2.4. Similarity Graph
- Green lines show a very high similarity (above 0.85);
- Orange lines show a moderately high similarity (between 0.75 and 0.85);
- Grey lines represent weaker but still meaningful similarity (between 0.65 and 0.75).
3.3. Retrieval Performance Metrics
4. Discussion
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| RAG | Retrieval-Augmented Generation |
| Portable Document Format | |
| RGB | Retrieval-Augmented Generation Benchmark |
| TF-IDF | Term Frequency-Inverse Document Frequency |
| LSA | Latent Semantic Analysis |
| LLM | Large Language Model |
| RNN | Recurrent Neural Network |
| LSTM | Long Short-Term Memory |
| GPT | Generative Pre-trained Transformer |
| T5 | Text-to-Text Transfer Transformer |
| BART | Bidirectional and Auto-Regressive Transformer |
| APA | American Psychological Association |
| MLA | Modern Language Association |
| ISO 690 | International Organization for Standardization |
| IEEE | Electrical and Electronics Engineers |
| AMA | American Medical Association |
| ACS | American Chemical Society |
| DOI | Digital Object Identifier |
| CSS | Cascading Style Sheets |
| IVFFlat | Inverted File Flat |
| HNSW | Hierarchical Navigable Small World |
| NLTK | Natural Language Toolkit |
| MB | Megabyte |
| ID | Identification |
| SQL | Structured Query Language |
| MRR | Mean Reciprocal Rank |
| MAP | Mean Average Precision |
| SSL | Self-supervised Learning |
| SOC | Soil Organic Carbon |
| API | Application Programming Interface |
| CPU | Central Processing Unit |
References
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; Riedel, S.; Kiela, D. Retrieval-augmented Generation for Knowledge-intensive NLP Tasks. Advances in Neural Information Processing Systems 2020, 33, 9459–9474. [Google Scholar] [CrossRef]
- Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; Weston, J. Retrieval Augmentation Reduces Hallucination in Conversation. arXiv Prepr. arXiv:2104.07567 2021. [CrossRef]
- Zhang, Y.; Li, Y.; Cui, L.; Cai, D.; Liu, L.; Fu, T.; Huang, X.; Zhao, E.; Chen, Y.; Wang, L.; Luu, A.T.; Bi, W.; Shi, F.; Shi, S. Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv Prepr. arXiv:2309.01219 2023 [CrossRef]
- Izacard, G.; Lewis, P.; Lomeli, M.; Hosseini, L.; Petroni, F.; Schick, T.; Dwivedi-Yu, J.; Joulin, A.; Riedel, S.; Grave, E. Atlas: Few-shot Learning with Retrieval Augmented Language Models. J. Mach. Learn. Res. 2023, 24, 1–43. [Google Scholar] [CrossRef]
- Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H. Retrieval-augmented Generation for Large Language Models: A Survey. arXiv Prepr. arXiv:2312.10997 2023, 2, 1. [Google Scholar] [CrossRef]
- Arslan, M.; Ghanem, H.; Munawar, S.; Cruz, C. A Survey on RAG with LLMs. Procedia Comput. Sci. 2024, 246, 3781–3790. [Google Scholar] [CrossRef]
- Khan, A.A.; Hasan, M.T.; Kemell, K.K.; Rasku, J.; Abrahamsson, P. Developing Retrieval Augmented Generation (RAG) Based LLM Systems from PDFs: An Experience Report. arXiv Prepr. arXiv:2410.15944 2024. [CrossRef]
- Wang, X.; Wang, Z.; Gao, X.; Zhang, F.; Wu, Y.; Xu, Z.; Shi, T.; Wang, Z.; Li, S.; Qian, Q.; Yin, R.; Lv, C.; Zheng, X.; Huang, X. Searching for Best Practices in Retrieval-augmented Generation. arXiv Prepr. arXiv:2407.01219 2024. [CrossRef]
- Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-augmented Generation. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17754–17762. [Google Scholar] [CrossRef]
- Jiang, Z.; Xu, F.F.; Gao, L.; Sun, Z.; Liu, Q.; Dwivedi-Yu, J.; Yang, Y.; Callan, J.; Neubig, G. Active Retrieval Augmented Generation. Proc. Conf. Empir. Methods Nat. Lang. Process. 2023, 7969–7992. [Google Scholar] [CrossRef]
- Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Web Application for Retrieval-augmented Generation: Implementation and Testing. Electronics 2024, 13, 1361. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2023. [CrossRef]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; Ye, W.; Zhang, Y.; Chang, Y.; Yu, P.S.; Yang, Q.; Xie, X. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
- Pedro, R.; Castro, D.; Carreira, P.; Santos, N. From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-integrated Web Application? arXiv Prepr. arXiv:2308.01990 2023. [CrossRef]
- Kasneci, E.; Sessler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; Krusche, S.; Kutyniok, G.; Michaeli, T.; Nerdel, C.; Pfeffer, J.; Poquet, O.; Sailer, M.; Schmidt, A.; Seidel, T.; Stadler, M.; Weller, J.; Kuhn, J.; Kasneci, G. ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
- Chandrasekaran, D.; Mago, V. Evolution of semantic similarity—a survey. ACM Comput. Surv. 2021, 54, 1–37. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- pgvector. Available online: https://github.com/pgvector/pgvector (accessed on 12 July 2025).
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; Lavaud, L.R.; Lachaux, M.-A.; Stock, P.; Le Scao, T.; Lavril, T.; Wang, T.; Lacroix, T.; El Sayed, W. Mistral 7B. arXiv Prepr. arXiv:2310.06825 2023. [CrossRef]
- Artificial Analysis. Available online: https://artificialanalysis.ai/models/mistral-7b-instruct/providers/prompt-options/single/medium#summary (accessed on 12 July 2025).











| Aspect | Vector search | Similarity graph |
|---|---|---|
| Input | A single query vector | A single query vector or a set of document embeddings |
| Output | Ranked list of similar documents | Graph with nodes (documents) and edges (semantic similarity) |
| Scalability | Optimized for large-scale search | Less scalable; visualization becomes harder with many nodes |
| User interaction | Query response interaction | Exploratory navigation (interactive or static graphs) |
| Primary goal | Find what matches a query | Show how documents relate to each other |
| Query | Paper scope | LLM response |
|---|---|---|
| How can self-supervised learning be used to improve predictions of soil properties? | All papers | Self-supervised learning can be used to improve predictions of soil properties by learning the geographical link between multimodal features via self-supervised contrastive learning, as demonstrated in… The approach employs pretrained Vision Transformers (ViT) for image inputs and Transformers for climate data, before fine-tuning the model with ground reference samples… |
| How can self-supervised learning be used to improve predictions of soil properties? | SSL-SoilNet | Self-supervised learning can be used to improve predictions of soil properties by leveraging a large number of unlabeled samples during the pre-training phase. This approach, as demonstrated in… After the pre-training phase, the self-supervised model is fine-tuned using ground reference samples for the specific task… |
| How does SSL-SoilNet improve SOC prediction over traditional methods? | All papers | I do not have enough information to answer this question accurately. |
| How does SSL-SoilNet improve SOC prediction over traditional methods? | SSL-SoilNet | SSL-SoilNet improves SOC prediction over traditional methods by integrating image-based and climate-based information geographically, which contributes to enhanced prediction accuracy. This integration allows the model to better understand patterns in samples… |
| What are the main differences between coral reef preservation techniques in the Caribbean and the Pacific? | All papers | I do not have enough information to answer this question accurately. |
| Output citation (APA) | Type | Notes |
|---|---|---|
| Kakhani, N., Rangzan, M., Jamali, A., Attarchi, S., Alavipanah, S. K., Mommert, M., Tziolas, N., & Scholten, T. (2023). Ssl-soilnet: a hybrid transformer-based framework with self-supervised learning for large-scale soil organic carbon prediction. arXiv. https://arxiv.org/abs/2308.03586 | arXiv | Good |
| Pliego, M. U., Marín, R. M., Shi, N., Shibayama, T., Leth, U., & Sacristán, M. M. (2025). Transport-related surface detection with machine learning: analyzing temporal trends in madrid and vienna. arXiv. https://arxiv.org/abs/2503.15653 | arXiv | Good |
| Song, Y., She, M., & Köser, K. (2023). Advanced underwater image restoration in complex illumination conditions. ISPRS Journal of Photogrammetry and Remote Sensing, Volume 209, March , Pages 197-212. https://doi.org/10.1016/j.isprsjprs.2024.02.004 |
Journal | Journal volume, issue, and page info not in APA format |
| Zaid, M. M. A., Mohammed, A. A., & Sumari, P. (2025). Remote sensing image classification using convolutional neural network (cnn) and transfer learning techniques. J. Comput. Sci, 21(3), 635-645. https://doi.org/10.3844/jcssp.2025.635.645 | Journal | Journal abbreviation should be written in full |
| Sheagren, C. D., Kadota, B. T., Patel, J. H., Chiew, M., & Wright, G. A. (2025). Accelerated cardiac parametric mapping using deep learning-refined subspace models. In International Workshop on Statistical Atlases and Computational Models of the Heart (pp. 369-379). Cham: Springer Nature Switzerland (). |
Book chapter | Publisher parentheses should not be empty |
| Greco, G., Cena, C., Albertin, U., Martini, M., & Chiaberge, M. (2025). Fault injection analysis of real nvp normalising flow model for satellite anomaly detection. arXiv. https://arxiv.org/abs/2504.02015 | arXiv | Good |
| Weber, M. & Beneke, C. (2025). Pyvit-fuse: a foundation model for multi-sensor earth observation data. arXiv. https://arxiv.org/abs/2504.18770 | arXiv | Good |
| Method | Recall@3 | MRR@3 | Precision@3 | MAP@3 | Average per-query retrieval speed (ms) |
|---|---|---|---|---|---|
| TF-IDF baseline | 0.8333 | 0.7817 | 0.8039 | 0.9185 | 67.94 |
| Embedding (no index) | 0.8733 | 0.8361 | 0.8657 | 0.9664 | 508.74 |
| Embedding + HNSW | 0.8633 | 0.8239 | 0.8621 | 0.9609 | 243.42 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).