Submitted:
18 May 2025
Posted:
20 May 2025
Read the latest preprint version here
Abstract
Keywords:
1. Introduction
- As is known to us, we are the first to propose a multi-agent data generation framework specifically for retrieval tasks. This framework leverages different agents to respectively handle the processes of generation, validation, and optimization, ensuring that the generated hard queries are close to real-world user queries.
- For each generated sample, we introduce a dual-agent comparative verification scheme. One agent performs logical and rule-based validation by executing code, while the other conducts semantic validation through chain-of-thought reasoning, enabling the effective identification of truly hard queries.
- We propose a multi-agent group discussion mechanism, involving agents with expertise across broader domains, to perform final validation and optimization of the hard queries, thereby ensuring the reliability of the generated samples.
- We conduct extensive experiments on different datasets and models. Experimental results show that our proposed model consistently outperforms existing methods.
2. Related Work
3. Background
4. Method
4.1. Stage 1. Contrastive Agents Based Hard Sample Generation
Contrastive Macthness Assessment via Code and CoT Agents
- Step 1. Understand the query and passage. Clarify user intent and identify product features.
- Step 2. Extract Key Terms. Highlight key attributes, synonyms, and functional components.
- Step 3. Apply Reasoning. Relate extracted query terms with passage attributes, accounting for synonyms and logical equivalences.
- Step 4. Generate Conclusion. Decide on match/mismatch based on accumulated evidence.
4.2. Stage 2. Multi-Agent Group Discussion for Label Refinement
5. Experiments
5.1. Experiment Setup
5.2. Performance Overview
6. Conclusions and Future Work
Appendix
Appendix A Key Prompts
Appendix A.1 Prompt for QueryGen Agent


Appendix A.2 Prompt for Code Agent



Appendix A.3 Prompt for CoT Agent


Appendix A.4 Prompt for Discussion Agent

Appendix B Example
| Passage | Type | Query | Label |
|---|---|---|---|
| Distressed Baseball Cap - Mom Life (Black) Vintage style; Washed & distressed; Low profile crown. Unconstructed style gives off a "dad hat" vibe. Suitable for wear during summer, spring, winter, and fall. Prime features include adjustable closure, unstructured crown, and all-day relaxation. |
Original | black baseball cap | 1 |
| Easy | vintage unstructured dad hat for moms | 1 | |
| Hard | unstructured relaxed fit black cap for all seasons | 1 | |
| KISS Magnetic Lashes, Crowd Pleaser, 1 Pair of Synthetic False Eyelashes With 5 Double Strength Magnets, Wind Resistant, Dermatologist Tested Fake Lashes Last Up To 16 Hours, Reusable Up To 15 Times | Original | magnetic false eyelashes | 1 |
| Easy | wind resistant magnetic lashes without liner | 1 | |
| Hard | reusable fake eyelashes for winter tested by dermatologist | 1 | |
| Queen Sheet Set - Hotel Luxury 1800 Bedding Sheets & Pillowcases - Extra Soft Cooling Bed Sheets - Deep Pocket up to 16 inch Mattress - Wrinkle, Fade, Stain Resistant - 4 Piece (Queen, White)" | Original | 6 quart crockpot | 0 |
| Easy | cool beds for boys | 0 | |
| Hard | soft queen blankets for winter warmth | 0 |
References
- Zheng, X.; Lv, F.; Wang, Z.; Liu, Q.; Zeng, X. Delving into E-Commerce Product Retrieval with Vision-Language Pre-training. In Proceedings of the Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 3385–3389.
- Liu, S.; Chen, C.; Ding, K.; Wang, B.; Xu, K.; Lin, Y. Literature retrieval based on citation context. Scientometrics 2014, 101, 1293–1307.
- Li, S.; Lv, F.; Jin, T.; Lin, G.; Yang, K.; Zeng, X.; Wu, X.M.; Ma, Q. Embedding-based product retrieval in taobao search. In Proceedings of the Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 3181–3189.
- Peng, W.; Li, G.; Jiang, Y.; Wang, Z.; Ou, D.; Zeng, X.; Xu, D.; Xu, T.; Chen, E. Large language model based long-tail query rewriting in taobao search. In Proceedings of the Companion Proceedings of the ACM Web Conference 2024, 2024, pp. 20–28.
- Nguyen, D.A.; Mohan, R.K.; Yang, V.; Akash, P.S.; Chang, K.C.C. RL-based Query Rewriting with Distilled LLM for online E-Commerce Systems. arXiv preprint arXiv:2501.18056 2025.
- Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. CoRR 2024, abs/2402.03216, [2402.03216]. [CrossRef]
- Li, S.; Tang, Y.; Chen, S.; Chen, X. Conan-embedding: General Text Embedding with More and Better Negative Samples. CoRR 2024, abs/2408.15710, [2408.15710]. [CrossRef]
- Du, Y.; Li, S.; Torralba, A.; Tenenbaum, J.B.; Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. In Proceedings of the Forty-first International Conference on Machine Learning, 2023.
- Liu, T.; Wang, X.; Huang, W.; Xu, W.; Zeng, Y.; Jiang, L.; Yang, H.; Li, J. Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion. arXiv preprint arXiv:2409.14051 2024.
- Wu, Y.; Jia, F.; Zhang, S.; Li, H.; Zhu, E.; Wang, Y.; Lee, Y.T.; Peng, R.; Wu, Q.; Wang, C. Mathchat: Converse to tackle challenging math problems with llm agents. arXiv preprint arXiv:2306.01337 2023.
- Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020; Webber, B.; Cohn, T.; He, Y.; Liu, Y., Eds. Association for Computational Linguistics, 2020, pp. 6769–6781. [CrossRef]
- Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text Embeddings by Weakly-Supervised Contrastive Pre-training. CoRR 2022, abs/2212.03533, [2212.03533]. [CrossRef]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. CoRR 2018, abs/1807.03748, [1807.03748].
- Kim, M.; Baek, S. Syntriever: How to Train Your Retriever with Synthetic Data from LLMs. CoRR 2025, abs/2502.03824, [2502.03824]. [CrossRef]
- Reddy, C.K.; Màrquez, L.; Valero, F.; Rao, N.; Zaragoza, H.; Bandyopadhyay, S.; Biswas, A.; Xing, A.; Subbian, K. Shopping queries dataset: A large-scale ESCI benchmark for improving product search. arXiv preprint arXiv:2206.06588 2022.
- Clement, C.B.; Bierbaum, M.; O’Keeffe, K.P.; Alemi, A.A. On the use of arxiv as a dataset. arXiv preprint arXiv:1905.00075 2019.
- Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276 2024.
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. Autogen: Enabling next-gen llm applications via multi-agent conversation. arXiv preprint arXiv:2308.08155 2023.
- Robertson, S.; Zaragoza, H.; et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 2009, 3, 333–389.
- Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821 2021.





| Shopping Queries Dataset | arXiv Dataset | ||||||
|---|---|---|---|---|---|---|---|
| Backbone | Method | Recall@10 | Precision@10 | NDCG@10 | Recall@10 | Precision@10 | NDCG@10 |
| - | BM25 | 0.4664 | 0.3145 | 0.5115 | 0.3690 | 0.0370 | 0.2768 |
| Conan | Zero Shot | 0.4617 | 0.3156 | 0.5165 | 0.3527 | 0.0342 | 0.2619 |
| Original Data | 0.4640 | 0.3182 | 0.5187 | 0.3615 | 0.0358 | 0.2743 | |
| In-Batch Negatives | 0.4816 | 0.3364 | 0.5458 | 0.3732 | 0.0371 | 0.2895 | |
| ContrastGen | 0.4961 | 0.3494 | 0.5534 | 0.3859 | 0.0386 | 0.3021 | |
| BGE-M3 | Zero Shot | 0.5329 | 0.3566 | 0.5896 | 0.5886 | 0.0590 | 0.3550 |
| Original Data | 0.5362 | 0.3857 | 0.6073 | 0.6087 | 0.0610 | 0.4346 | |
| In-Batch Negatives | 0.5420 | 0.3909 | 0.6048 | 0.5978 | 0.0599 | 0.3685 | |
| ContrastGen | 0.5565 | 0.4104 | 0.6362 | 0.6179 | 0.0618 | 0.4450 | |
| Ratio | Shopping Queries | arXiv | ||||
|---|---|---|---|---|---|---|
| Recall@10 | Precision@10 | NDCG@10 | Recall@10 | Precision@10 | NDCG@10 | |
| 0 | 0.5302 | 0.3894 | 0.5937 | 0.6104 | 0.0611 | 0.4321 |
| 10:2 | 0.5242 | 0.3882 | 0.5961 | 0.6237 | 0.0625 | 0.4193 |
| 10:3 | 0.5255 | 0.3870 | 0.5920 | 0.6171 | 0.0618 | 0.4153 |
| 10:4 | 0.5232 | 0.3870 | 0.5906 | 0.6171 | 0.0618 | 0.4192 |
| 10:5 | 0.5272 | 0.3882 | 0.5927 | 0.6179 | 0.0619 | 0.4205 |
| 10:9 | 0.5353 | 0.3966 | 0.5957 | 0.6254 | 0.0626 | 0.4271 |
| 1:1 | 0.5213 | 0.3894 | 0.5911 | 0.6212 | 0.0622 | 0.4335 |
| 1:2 | 0.5265 | 0.3869 | 0.5942 | 0.6112 | 0.0612 | 0.4193 |
| Data Type | Shopping Queries | arXiv | ||||
| Recall@10 | Precision@10 | NDCG@10 | Recall@10 | Precision@10 | NDCG@10 | |
| Easy | 0.5110 | 0.3718 | 0.5854 | 0.6087 | 0.0610 | 0.4289 |
| Hard | 0.5361 | 0.3766 | 0.5981 | 0.6162 | 0.0617 | 0.4385 |
| Mix | 0.5327 | 0.3742 | 0.5973 | 0.6095 | 0.0610 | 0.4352 |
| CoT | Code | Discussion | Shopping Queries | arXiv | ||||
| R@10 | P@10 | N@10 | R@10 | P@10 | N@10 | |||
| ✓ | ✗ | ✗ | 0.5268 | 0.3659 | 0.5825 | 0.6154 | 0.0616 | 0.4292 |
| ✗ | ✓ | ✗ | 0.5268 | 0.3662 | 0.5915 | 0.6137 | 0.0615 | 0.4375 |
| ✓ | ✓ | ✓ | 0.5374 | 0.3770 | 0.6060 | 0.6179 | 0.0619 | 0.4450 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).