Submitted:
30 May 2026
Posted:
01 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Literature Review
2.1. Techniques
2.2. Languages
2.3. Nepali Legal Information Retrieval
3. Methodology
3.1. Data Collection
- Documents: A total of 10 federal legal documents, as shown in Table 1, including the Constitution and nine other widely used federal legal acts, were collected from the Law Commission website [24]. Many of the source files were available only as image-based PDF documents. Therefore, Tesseract OCR [25] was used to convert them into machine-readable text. The extracted text contained OCR-induced artifacts, formatting inconsistencies, and typographical errors, which were subsequently cleaned during preprocessing.
- Query Collection: To evaluate the performance of the retrieval models, a total of 50 query–relevance pairs were constructed, with five queries derived from each of the 10 legal documents (Table 1). The queries were collected from social media groups and online legal websites, and were further created, curated, and validated by legal experts to ensure their relevance and clarity. For each query, the corresponding relevant documents were identified to assess the accuracy of different retrieval models.
3.2. Chunking
- Article-level Chunks: Articles, being the most meaningful and self-contained units in legal statutes, were initially used as individual chunks. This approach was effective for retrieval models with large context windows.
- Fixed-size Sub-Article Chunks: For models with smaller context sizes, article-level chunks sometimes exceeded the maximum input length, leading to information truncation. To address this, each article was further divided into sub-chunks of fewer than 512 tokens, creating fixed-size sub-article chunks suitable for models with limited context windows.
3.3. Retrieval Models Selection
3.4. Retrieval Pipeline
4. Evaluation
- Recall: measures the proportion of relevant documents successfully retrieved by the model out of all relevant documents available. A higher recall indicates that the model is effective in capturing a greater number of relevant items.
- Precision: quantifies the proportion of retrieved documents that are actually relevant. This metric ensures that the model does not return excessive irrelevant results, emphasizing accuracy over quantity.
- MRR: evaluates the rank position of the first relevant document in the retrieved list. This metric is particularly useful in scenarios where presenting a relevant result at the top of the ranking is critical, as it reflects the user experience in practical retrieval settings.
5. Results
- a.
-
Multilingual models: Among all evaluated models, BAAI/bge-m3 consistently achieved the highest performance across all metrics with Recall@10 = 0.9233, Precision@1 = 0.7400, and MRR@10 = 0.8300. Jina closely followed in terms of Recall, though its MRR was slightly lower, indicating that while it retrieves relevant documents effectively, the ranking quality is marginally inferior to BGE-m3. Alibaba-GTE and E5-large-instruct achieved moderate performance, trailing behind the top two models (see Appendix A, Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6).Due to the large input context handled by these models, chunking into smaller segments did not produce noticeable differences in the evaluation metrics, suggesting that these models can effectively process long passages without loss of semantic information.
- b.
-
Monolingual Models: Monolingual models, while somewhat effective, generally underperformed compared to their multilingual counterparts. We evaluated three monolingual models, with the best-performing one being BERT from IRIIS Lab, trained on 27.5GB of Nepali data. Despite this, it achieved Recall@10 = 0.35 and MRR@10 = 0.16, while the other two monolingual models performed significantly worse ((see Appendix A, Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6)).The underperformance of these models can be attributed to the limited size and domain specificity of their training corpora. Since they were trained primarily on general-domain text scraped from news websites rather than legal documents, they were unable to generalize effectively to the legal retrieval task, highlighting the importance of large, domain-specific corpora for monolingual models.Additionally, the limitations imposed by large input context sizes can be partially mitigated by using smaller chunks, as demonstrated by BERT’s MRR increasing from 0.16 to 0.18 at K=10.
- c.
- Trend Analysis: Overall, Recall and MRR increased with higher K values across all models. This trend occurs because, as K increases, the models are allowed to retrieve more documents, which naturally increases the likelihood of including relevant documents (higher Recall) and improves the chances of ranking at least one relevant document higher (higher MRR). In contrast, Precision generally decreased with higher K, as retrieving more documents also increases the proportion of irrelevant ones in the top-K results. The complete results, including all models and chunking strategies, are presented in the Appendix A.
6. Conclusions
7. Future Works
Acknowledgments
Appendix A
Appendix A.1. Article Level Chunking
| Model/Recall | @1 | @2 | @3 | @4 | @5 | @6 | @7 | @8 | @9 | @10 |
| e5-large | 0.527 | 0.607 | 0.687 | 0.727 | 0.757 | 0.767 | 0.797 | 0.797 | 0.817 | 0.837 |
| bge | 0.667 | 0.807 | 0.867 | 0.907 | 0.913 | 0.923 | 0.923 | 0.923 | 0.923 | 0.923 |
| gte | 0.547 | 0.657 | 0.697 | 0.767 | 0.797 | 0.817 | 0.837 | 0.837 | 0.857 | 0.857 |
| qwen | 0.297 | 0.507 | 0.547 | 0.637 | 0.677 | 0.697 | 0.727 | 0.747 | 0.757 | 0.757 |
| kalm | 0.31 | 0.487 | 0.567 | 0.587 | 0.597 | 0.637 | 0.637 | 0.657 | 0.677 | 0.697 |
| jina | 0.627 | 0.787 | 0.867 | 0.907 | 0.917 | 0.917 | 0.917 | 0.917 | 0.923 | 0.923 |
| mpnet | 0.42 | 0.51 | 0.55 | 0.62 | 0.63 | 0.63 | 0.67 | 0.71 | 0.74 | 0.74 |
| nepberta | 0.07 | 0.08 | 0.11 | 0.12 | 0.14 | 0.14 | 0.16 | 0.16 | 0.18 | 0.2 |
| bert | 0.08 | 0.11 | 0.13 | 0.17 | 0.21 | 0.27 | 0.31 | 0.31 | 0.35 | 0.35 |
| robert | 0.05 | 0.09 | 0.13 | 0.15 | 0.15 | 0.15 | 0.17 | 0.17 | 0.17 | 0.17 |
| Model/Precision | @1 | @2 | @3 | @4 | @5 | @6 | @7 | @8 | @9 | @10 |
| e5-large | 0.58 | 0.36 | 0.267 | 0.21 | 0.176 | 0.15 | 0.134 | 0.118 | 0.109 | 0.1 |
| bge | 0.74 | 0.47 | 0.34 | 0.265 | 0.216 | 0.183 | 0.157 | 0.138 | 0.122 | 0.11 |
| gte | 0.6 | 0.38 | 0.267 | 0.22 | 0.184 | 0.157 | 0.14 | 0.122 | 0.111 | 0.1 |
| qwen | 0.34 | 0.29 | 0.213 | 0.185 | 0.156 | 0.133 | 0.12 | 0.108 | 0.098 | 0.088 |
| kalm | 0.34 | 0.28 | 0.22 | 0.17 | 0.14 | 0.123 | 0.106 | 0.095 | 0.087 | 0.08 |
| jina | 0.72 | 0.46 | 0.34 | 0.265 | 0.216 | 0.18 | 0.154 | 0.135 | 0.122 | 0.11 |
| mpnet | 0.48 | 0.31 | 0.22 | 0.185 | 0.152 | 0.127 | 0.114 | 0.105 | 0.098 | 0.088 |
| nepberta | 0.1 | 0.06 | 0.053 | 0.045 | 0.04 | 0.033 | 0.031 | 0.028 | 0.027 | 0.026 |
| bert | 0.1 | 0.07 | 0.053 | 0.05 | 0.052 | 0.053 | 0.051 | 0.045 | 0.044 | 0.04 |
| robert | 0.06 | 0.05 | 0.047 | 0.04 | 0.032 | 0.027 | 0.026 | 0.022 | 0.02 | 0.018 |
| Model/MRR | @1 | @2 | @3 | @4 | @5 | @6 | @7 | @8 | @9 | @10 |
| e5-large | 0.58 | 0.61 | 0.637 | 0.647 | 0.651 | 0.651 | 0.656 | 0.656 | 0.661 | 0.663 |
| bge | 0.74 | 0.8 | 0.82 | 0.83 | 0.83 | 0.83 | 0.83 | 0.83 | 0.83 | 0.83 |
| gte | 0.6 | 0.65 | 0.663 | 0.678 | 0.682 | 0.686 | 0.691 | 0.691 | 0.694 | 0.694 |
| qwen | 0.34 | 0.43 | 0.45 | 0.47 | 0.478 | 0.481 | 0.487 | 0.49 | 0.492 | 0.492 |
| kalm | 0.34 | 0.44 | 0.46 | 0.465 | 0.465 | 0.472 | 0.472 | 0.474 | 0.476 | 0.478 |
| jina | 0.72 | 0.78 | 0.807 | 0.817 | 0.817 | 0.817 | 0.817 | 0.817 | 0.817 | 0.817 |
| mpnet | 0.48 | 0.5 | 0.513 | 0.533 | 0.537 | 0.537 | 0.543 | 0.548 | 0.552 | 0.552 |
| nepberta | 0.1 | 0.1 | 0.113 | 0.113 | 0.117 | 0.117 | 0.12 | 0.12 | 0.122 | 0.124 |
| bert | 0.1 | 0.12 | 0.127 | 0.137 | 0.145 | 0.155 | 0.16 | 0.16 | 0.165 | 0.165 |
| robert | 0.06 | 0.08 | 0.093 | 0.098 | 0.098 | 0.098 | 0.101 | 0.101 | 0.101 | 0.101 |
Appendix A.2. Fixed Size Chunking
| Model/Recall | @1 | @2 | @3 | @4 | @5 | @6 | @7 | @8 | @9 | @10 |
| e5-large | 0.504 | 0.665 | 0.695 | 0.715 | 0.715 | 0.745 | 0.775 | 0.781 | 0.821 | 0.841 |
| bge | 0.614 | 0.751 | 0.775 | 0.865 | 0.869 | 0.869 | 0.889 | 0.889 | 0.889 | 0.899 |
| gte | 0.514 | 0.621 | 0.675 | 0.705 | 0.765 | 0.805 | 0.815 | 0.815 | 0.825 | 0.825 |
| qwen | 0.294 | 0.451 | 0.535 | 0.605 | 0.665 | 0.705 | 0.705 | 0.725 | 0.725 | 0.725 |
| kalm | 0.28 | 0.44 | 0.541 | 0.571 | 0.581 | 0.601 | 0.621 | 0.621 | 0.641 | 0.641 |
| jina | 0.584 | 0.738 | 0.815 | 0.875 | 0.881 | 0.901 | 0.901 | 0.901 | 0.901 | 0.905 |
| mpnet | 0.417 | 0.507 | 0.567 | 0.577 | 0.607 | 0.633 | 0.673 | 0.673 | 0.703 | 0.723 |
| nepberta | 0.07 | 0.08 | 0.11 | 0.12 | 0.147 | 0.147 | 0.147 | 0.147 | 0.167 | 0.187 |
| bert | 0.09 | 0.14 | 0.167 | 0.167 | 0.197 | 0.247 | 0.287 | 0.287 | 0.297 | 0.357 |
| roberta | 0.07 | 0.07 | 0.12 | 0.16 | 0.21 | 0.21 | 0.24 | 0.24 | 0.24 | 0.24 |
| Model/Precision | @1 | @2 | @3 | @4 | @5 | @6 | @7 | @8 | @9 | @10 |
| e5-large | 0.56 | 0.4 | 0.28 | 0.22 | 0.176 | 0.153 | 0.137 | 0.122 | 0.116 | 0.106 |
| bge | 0.7 | 0.46 | 0.327 | 0.27 | 0.22 | 0.183 | 0.16 | 0.14 | 0.124 | 0.114 |
| gte | 0.58 | 0.38 | 0.28 | 0.22 | 0.192 | 0.167 | 0.146 | 0.128 | 0.116 | 0.104 |
| qwen | 0.34 | 0.27 | 0.22 | 0.185 | 0.16 | 0.143 | 0.123 | 0.112 | 0.1 | 0.09 |
| kalm | 0.32 | 0.25 | 0.213 | 0.17 | 0.14 | 0.12 | 0.109 | 0.095 | 0.087 | 0.078 |
| jina | 0.68 | 0.46 | 0.34 | 0.27 | 0.22 | 0.19 | 0.163 | 0.142 | 0.127 | 0.116 |
| mpnet | 0.5 | 0.32 | 0.24 | 0.185 | 0.156 | 0.137 | 0.123 | 0.108 | 0.1 | 0.092 |
| nepberta | 0.1 | 0.06 | 0.053 | 0.045 | 0.044 | 0.037 | 0.031 | 0.028 | 0.027 | 0.026 |
| bert | 0.12 | 0.09 | 0.073 | 0.055 | 0.052 | 0.053 | 0.051 | 0.045 | 0.042 | 0.044 |
| roberta | 0.08 | 0.04 | 0.047 | 0.045 | 0.048 | 0.04 | 0.04 | 0.035 | 0.031 | 0.028 |
| Model/Precision | @1 | @2 | @3 | @4 | @5 | @6 | @7 | @8 | @9 | @10 |
| e5-large | 0.56 | 0.63 | 0.643 | 0.648 | 0.648 | 0.652 | 0.657 | 0.657 | 0.664 | 0.666 |
| bge | 0.7 | 0.76 | 0.767 | 0.787 | 0.787 | 0.787 | 0.79 | 0.79 | 0.79 | 0.79 |
| gte | 0.58 | 0.63 | 0.65 | 0.655 | 0.663 | 0.67 | 0.67 | 0.67 | 0.672 | 0.672 |
| qwen | 0.34 | 0.41 | 0.443 | 0.458 | 0.47 | 0.48 | 0.48 | 0.483 | 0.483 | 0.483 |
| kalm | 0.32 | 0.4 | 0.44 | 0.445 | 0.449 | 0.452 | 0.455 | 0.455 | 0.457 | 0.457 |
| jina | 0.68 | 0.74 | 0.767 | 0.782 | 0.782 | 0.782 | 0.782 | 0.782 | 0.782 | 0.782 |
| mpnet | 0.5 | 0.53 | 0.543 | 0.543 | 0.551 | 0.555 | 0.56 | 0.56 | 0.565 | 0.567 |
| nepberta | 0.1 | 0.1 | 0.113 | 0.113 | 0.121 | 0.121 | 0.121 | 0.121 | 0.124 | 0.126 |
| bert | 0.12 | 0.15 | 0.163 | 0.163 | 0.167 | 0.177 | 0.183 | 0.183 | 0.183 | 0.189 |
| roberta | 0.08 | 0.08 | 0.1 | 0.11 | 0.118 | 0.118 | 0.124 | 0.124 | 0.124 | 0.124 |
References
- Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2023; Association for Computational Linguistics; pp. 2014–2037. [Google Scholar] [CrossRef]
- “Information retrieval.” [Online]. Available online: https://en.wikipedia.org/wiki/Information_retrieval.
- Sanderson, M.; Croft, W. B. The History of Information Retrieval Research. Proc. IEEE 2012, 100, 1444–1451. [Google Scholar] [CrossRef]
- Koroteev, M. V. BERT: A Review of Applications in Natural Language Processing and Understanding. arXiv 2021. [Google Scholar] [CrossRef]
- Chi, Z. “XLM-E: Cross-lingual Language Model Pre-training via ELECTRA,”. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 6170–6182. [Google Scholar] [CrossRef]
- Li, Z. A Classification Retrieval Approach for English Legal Texts. In 2019 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS); IEEE: Changsha, China, Jan. 2019; pp. 220–223. [Google Scholar] [CrossRef]
- Zhang, N.; Pu, Y.-F.; Wang, P. An Ontology-based Approach for Chinese Legal Information Retrieval. In Proceedings of The 5th International Conference on Computer Engineering and Networks — PoS(CENet2015); Sissa Medialab: Shanghai, China, Oct. 2015; p. 076. [Google Scholar] [CrossRef]
- Bal, B. K. “Structure of Nepali Grammar,” in Working Papers. In PAN Localization, Madan Puraskar Pustakalaya (or just PAN Localization Working Papers); Madan Puraskar Pustakalaya: Kathmandu, Nepal; pp. 332–396.
- Salton, G.; Fox, E. A.; Wu, H. Extended Boolean information retrieval. Commun. ACM 1983, vol. 26(no. 11), 1022–1036. [Google Scholar] [CrossRef]
- Robertson, S.; Zaragoza, H. “The Probabilistic Relevance Framework: BM25 and Beyond,” Found. Trends Inf. Retr. 2009, vol. 3(no. 4), 333–389. [Google Scholar] [CrossRef]
- Landauer, T. K.; Dumais, S. T. Latent semantic analysis. Scholarpedia 2008, vol. 3, 4356. [Google Scholar] [CrossRef]
- Xing, W.; Ghorbani, A. “Weighted PageRank algorithm,” in Proceedings. Second Annual Conference on Communication Networks and Services Research, 2004.; IEEE: Fredericton, NB, Canada, 2004; pp. 305–314. [Google Scholar] [CrossRef]
- Church, K. W. Word2Vec. Nat. Lang. Eng. 2017, vol. 23(no. 1), 155–162. [Google Scholar] [CrossRef]
- Khattab, O.; Zaharia, M. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM: Virtual Event China, Jul. 2020; pp. 39–48. [Google Scholar] [CrossRef]
- Voorhees, E. M.; Harman, D. K. The text REtrieval conference (TREC): history and plans for TREC-9. SIGIR Forum 1999, vol. 33, 12–15. [Google Scholar] [CrossRef]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” in Conference on Empirical Methods in Natural Language Processing, 2016. [Online]. Available online: https://api.semanticscholar.org/CorpusID:11816014.
- Campos, D. F. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. ArXiv 2016, vol. abs/1611.09268. Available online: https://api.semanticscholar.org/CorpusID:1289517.
- Zhou, M.; et al. CCMusic: An Open and Diverse Database for Chinese Music Information Retrieval Research. Trans. Int. Soc. Music Inf. Retr. 2025, vol. 8, 22–38. [Google Scholar] [CrossRef]
- Lefebvre-Brossard, A.; Gazaille, S.; Desmarais, M. C. Alloprof: a new French question-answer education dataset and its use in an information retrieval case study. ArXiv 2023, vol. abs/2302.07738. Available online: https://api.semanticscholar.org/CorpusID:256868592.
- “Information Retrieval Through Question Answering on Biomedical Research Papers”. Available online: https://www.researchgate.net/profile/Shushanta-Pudasaini/publication/375011546_Question_Answering_on_Biomedical_Research_Papers_using_Transfer_Learning_on_BERT-Base_Models/links/6560bd5db1398a779dad41c0/Question-Answering-on-Biomedical-Research-Papers-using-Transfer-Learning-on-BERT-Base-Models.pdf.
- Ghimire, D. R.; Panday, S. P.; Shakya, A. Information Extraction from a Large Knowledge Graph in the Nepali Language. Natl. Coll. Comput. Stud. Res. J. 2024, vol. 3(no. 1), 33–49. [Google Scholar] [CrossRef]
- ENHANCED-RETRIEVAL-FOR-QA-SYSTEM-TAILORED-FOR-NEPALI-LEGAL-DOCUMENTS-FOCUSING-ON-PSC-EXAMS-USING-GPT-4-AND-RAG-FRAMEWORK. Available online: https://www.scribd.com/document/844123533/Revised2-Unmasked-ENHANCED-RETRIEVAL-FOR-QA-SYSTEM-TAILORED-FOR-NEPALI-LEGAL-DOCUMENTS-FOCUSING-ON-PSC-EXAMS-USING-GPT-4-AND-RAG-FRAMEWORK-pdf.
- NepKanun: A RAG-Based Nepali Legal Assistant. 2025. Available online: https://openreview.net/forum?id=LuXTBI6GSh.
- Nepali Legal Documents. Available online: https://lawcommission.gov.np/.
- Smith, R. W. An Overview of the Tesseract OCR Engine. Ninth Int. Conf. Doc. Anal. Recognit. ICDAR 2007 2007, vol. 2, 629–633. [Google Scholar]
- Timilsina, S.; Gautam, M.; Bhattarai, B. NepBERTa: Nepali Language Model Trained in a Large Corpus. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers); Association for Computational Linguistics: Online only, 2022; pp. 273–284. [Google Scholar] [CrossRef]
- Thapa, P.; Nyachhyon, J.; Sharma, M.; Bal, B. K. Development of Pre-Trained Transformer-based Models for the Nepali Language. In Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025); Sarveswaran, K., Vaidya, A., Krishna Bal, B., Shams, S., Thapa, S., Eds.; International Committee on Computational Linguistics: Abu Dhabi, UAE, Jan 2025; pp. 9–16. Available online: https://aclanthology.org/2025.chipsal-1.2/.
- Douze, M.; et al. THE FAISS LIBRARY. IEEE Trans. Big Data 2025, 1–17. [Google Scholar] [CrossRef]


| Document | No of Queries |
| Constitution | 5 |
| Ecommerce Act | 5 |
| Education Act | 5 |
| Civil Code | 5 |
| Criminal Code | 5 |
| Consumer Protection Act | 5 |
| Foreign Employment Act | 5 |
| Social Security Act | 5 |
| Citizenship Act | 5 |
| Labor Act | 5 |
| Model | Type |
| 1. BAAI/bge-m3 | Multi |
| 2. jina-embeddings-v3 | Multi |
| 3. infloat/multilingual-e5-large-instruct | Multi |
| 4. Alibaba-NLP/gte-multilingual-base | Multi |
| 5. Qwen/Qwen3-Embedding-0.6B | Multi |
| 6. KaLM-embedding-multilingual-mini-instruct-v2 | Multi |
| 7. paraphrase-multilingual-mpnet-base-v2 | Multi |
| 8. NepBERTA | Mono |
| 9. IRIISNEPAL/RoBERTA | Mono |
| 10. IRIISNEPAL/BERT | Mono |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).