Submitted:
08 January 2026
Posted:
09 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- A Diverse, Multi-lingual, and Structured Dataset: We provide a benchmark dataset featuring complex, interdependent questions designed to cover realistic analytical scenarios. The dataset’s industry and linguistic diversity makes it a robust test for Deep Document Research systems/agents in the Finance domain.
- A Novel Evaluation Framework for Financial Report: We develop a specialized evaluation framework that employs tailored assessment methodologies for each task type to facilitate robust evaluation.
2. Competition Timeline
3. Task Definition
3.1. Structure of Question Guideline
3.2. Dataset Task Types
- Extraction: These tasks require direct retrieval of explicitly stated information from the report without transformation or interpretation.
- Calculation: These tasks require performing arithmetic operations on extracted facts to derive new quantitative metrics.
- Summary: These tasks require synthesizing, interpreting, and articulating insights from extracted facts (and calculated metrics) into coherent narratives.
4. Evaluation
- Accuracy: This protocol provides deterministic evaluation for questions with unambiguous, factual answers. We employ an advanced LLM to evaluate the correctness by comparing the predicted answer to the ground truth, assigning a score of 1 for correct matches and 0 otherwise. This method is applied to all grading items in the Extraction and Calculation categories.
- Claim-based Score: To accommodate responses with multiple factual elements, we employ a claim-based scoring method. First, an advanced LLM identifies three to five critical reference claims from the ground truth, with the number determined by the length and complexity of the reference answer. The LLM then evaluates whether each claim is substantively addressed in the predicted answer [6]. This method is applied to the majority of the grading items in the Summary category.
- Criterion-based Score: For grading items demanding nuanced reasoning, qualitative judgment, and depth of analysis, we implement a criterion-based evaluation approach[7] that emulates expert human assessment. First, an advanced LLM is prompted to adopt the role of a financial expert to generate a detailed 10-point scoring criterion based on the ground truth. This criterion deconstructs the ideal answer into its core analytical components. Subsequently, the LLM then evaluates the predicted answer against the criterion to output a score for each criterion. This method is applied to some of the Summary grading items.
5. Dataset Description
5.1. Dataset Construction
- Step 1: Public Company Selection. Our dataset construction involved selecting publicly listed companies from eight major financial markets, including the United States (US), United Kingdom (UK), China (CN), Hong Kong (HK), Australia (AU), Singapore (SG), Malaysia (MY), and Indonesia (ID). This geographical diversity enables coverage of four distinct languages: English (EN), Simplified Chinese (zh-CN), Traditional Chinese (zh-HK), and Bahasa Indonesia (BI). The final dataset consists of 104 companies, with their industry representation spans 10 distinct sectors according to the Bloomberg Industry Classification Standard (BICS).
-
Step 2: Document Preparation. Differently from FinDeepResearch, FinDDR datasets consists exclusively of annual reports. We applied the following selection criteria:
- We selected two reports from the Financial Years of 2023 and 2024 to maintain relevancy and information diversity.
- For the US market, we used Form 10-Ks filings instead of the generic annual reports as the former are regulatory compliant and present a more balanced, objective view focused on material facts.
- For markets with multilingual reports (China, Hong Kong, and Indonesia), we selected the predominant local language version: Simplified Chinese for China, Traditional Chinese for Hong Kong, and Indonesian for Indonesia.
- Step 3: Reference Report Generation. We generate a reference report for each company using the two annual reports. During the generation phase, the system systematically processes each company’s documentation through the hierarchical analytical framework, extracting relevant information segments and synthesizing comprehensive responses for each of the six primary report sections. This generation process produces initial draft reports that capture the breadth and depth of information contained within the source annual reports, serving as the foundation for subsequent human expert refinement.
- Step 4: Two-Tier Expert Verification Framework. The final validation phase implements a dual-stage quality assurance protocol. The first round conducts section-based verification, where domain experts evaluate individual report sections for factual accuracy, analytical depth, and adherence to professional financial analysis standards. The second round performs cross-section review, examining the coherence, consistency, and comprehensive integration across all report sections. This verification process culminates in the production of finalized ground truth reports that serve as reference standards for participant evaluation, ensuring that the benchmark maintains the analytical rigor expected in financial research environments.
5.2. Dataset Statistics
6. Competition Details
6.1. Participant Teams
- 13 Teams: SilverSight, Finsselaer, Token Refund, Financial Wizard, afinit, e0nia, SI4Fin, ICT-NDST, DeepSeek Your Report, LedgerLens, FinSight, DataLovers, and RUCFinAI.
- 16 Organizations: Fudan University, Shanghai Innovation Institute, DataGrand Inc, Rensselaer Polytechnic Institute, Microsoft Research Asia, Experian, afinit, Individual, A*STAR, Chinese Academy of Sciences, Shanghai University of International Business and Economics, The University of Technology Sydney, Renmin University of China, Rajiv Gandhi Institute of Petroleum Technology, Galgotias University, and Wells Fargo.
- 7 Countries: Singapore, China, US, India, South Korea, Australia, and Malaysia.
6.2. Competition Result
6.2.1. Main Result
6.2.2. Region Performance Result
6.2.3. Section Performance Result
6.2.4. Task Type Performance Result
7. Winning Teams’ Methods
7.1. SilverSight
7.2. Finsselaer
7.3. Token Refund
8. Conclusion
9. Organization Team
- Project Leader:
- Fengbin Zhu, National University of Singapore
- Chao Wang, 6Estates Pte Ltd
- Tianhui Tan, Asian Institute of Digital Finance
- Dataset Construction and Evaluation:
- Xiang Yao Ng, 6Estates Pte Ltd
- Ziyang Liu, 6Estates Pte Ltd
- Huanchang Zhuo, 6Estates Pte Ltd
- Min Xu, 6Estates Pte Ltd
- Stanley Marcelino, 6Estates Pte Ltd
- Jing Wang, 6Estates Pte Ltd
- Junfeng Li, National University of Singapore
- Chang Liu, Asian Institute of Digital Finance
- Xuan Yao, Asian Institute of Digital Finance
- Hao Zhuang, Asian Institute of Digital Finance
- Ruiqi Zheng, Asian Institute of Digital Finance
- Zixuan Wang, 6Estates Pte Ltd
- Xiaohan Ai, 6Estates Pte Ltd
- Lan Huang, 6Estates Pte Ltd
- Xin Lin, 6Estates Pte Ltd
- Advisor:
- Ke-Wei Huang, Asian Institute of Digital Finance
- Shuo Zhang, Bloomberg
- Fuli Feng, University of Science and Technology of China
- Huanbo Luan, 6Estates Pte Ltd
- Tat-Seng Chua, National University of Singapore
Appendix A. Expected Report Structure
- Company Overview (): This section provides a concise overview of the company, including its basic information, industry background, key strengths, and strategic direction.
- Financial Performance (): This section presents a detailed analysis of the company’s financial health, including key financial statements and performance metrics, to assess profitability, liquidity, and solvency.
- Business Analysis (): This section provides a summary of a company’s business performance and strategies, offering readers a comprehensive understanding of the company’s business operations, competitive strengths, innovation efforts, and strategies.
- Risk Factors (): This section identifies and discusses the principal risks the company faces, including market, financial, operational, and regulatory risks, along with the strategies in place to manage them.
- Corporate Governance (): This section outlines the company’s governance framework, including the board of directors, executive leadership, governance policies, and practices, ensuring transparency and accountability.
- Future Outlook (): This section provides management’s projections and strategic plans for the future, including anticipated market trends, growth opportunities, and the company’s road map for achieving its objectives.



| 1 | The latest name is FinDocResearch on the OpenFinArena Platform. |
References
- Edge, D.; Trinh, H.; Cheng, N.; Bradley, J.; Chao, A.; Mody, A.; Truitt, S.; Metropolitansky, D.; Ness, R.O.; Larson, J. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130 2024.
- Guo, Z.; Xia, L.; Yu, Y.; Ao, T.; Huang, C. Lightrag: Simple and fast retrieval-augmented generation. arXiv preprint arXiv:2410.05779 2024.
- OpenAI. Introducing deep research. https://openai.com/index/introducing-deep-research/. Accessed: 2025-07-18.
- Gemini Google. Deep Research is now available on Gemini 2.5 Pro Experimental. https://blog.google/products/gemini/deep-research-gemini-2-5-pro-experimental/. Accessed: 2025-07-18.
- Zhu, F.; Ng, X.Y.; Liu, Z.; Liu, C.; Zeng, X.; Wang, C.; Tan, T.; Yao, X.; Shao, P.; Xu, M.; et al. FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis. arXiv preprint arXiv:2510.13936 2025.
- Ip, Jeffrey and Vongthongsri, Kritin Apache-2.0. deepeval, version = 3.6.2. Aug 2025 https://github.com/confident-ai/deepeval.
- Zhang, X.; Li, C.; Zong, Y.; Ying, Z.; He, L.; Qiu, X. Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474 2023.
- DeepSeek-AI. DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention, 2025.
- Team, O. Introducing GPT-5. https://openai.com/index/introducing-gpt-5/, 2025. Accessed: 2025-10-07.
- Team, O. Introducing gpt-oss. https://openai.com/index/introducing-gpt-oss/, 2025. Accessed: 2025-11-07.
- Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388 2025.
- Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216 2024.
- Zhang, Y.; Li, M.; Long, D.; Zhang, X.; Lin, H.; Yang, B.; Xie, P.; Yang, A.; Liu, D.; Lin, J.; et al. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv preprint arXiv:2506.05176 2025.
- Team, M.A. Mistral OCR. https://mistral.ai/news/mistral-ocr, 2025. Accessed: 2025-03-06.
- Livathinos, N.; Auer, C.; Lysak, M.; Nassar, A.; Dolfi, M.; Vagenas, P.; Ramis, C.B.; Omenetti, M.; Dinkla, K.; Kim, Y.; et al. Docling: An efficient open-source toolkit for ai-driven document conversion. arXiv preprint arXiv:2501.17887 2025.
- Wang, J.; Fu, J.; Song, L.; Bian, J. From Complex to Atomic: Enhancing Augmented Generation via Knowledge-Aware Dual Rewriting and Reasoning.

| Statistic | Number |
|---|---|
| Basic Information | |
| Number of Languages | 4 |
| Number of Financial Markets | 8 |
| Number of Industries | 10 |
| Number of Selected Companies | 104 |
| Analytical Structure | |
| Number of Major Sections | 6 |
| Number of Subsections | 17 |
| Grading Items | |
| Number of Grading Items per Report | 183 |
| Full Marks for each Report | 240 |
| Total Number of Grading Items | 19,032 |
| Market | Sample | Validation | Test |
|---|---|---|---|
US |
1 | 6 | 7 |
UK |
1 | 6 | 6 |
China |
1 | 6 | 5 |
Hong Kong |
1 | 6 | 6 |
Singapore |
1 | 6 | 6 |
Australia |
1 | 6 | 5 |
Indonesia |
1 | 6 | 6 |
Malaysia |
1 | 6 | 7 |
| Total | 8 | 48 | 48 |
| Industry | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|---|---|---|
| US | UK | CN | HK | SG | AU | ID | MY | |
| Communication | 0 | 2 | 3 | 2 | 0 | 3 | 2 | 2 |
| Consumer Discretionary | 3 | 0 | 3 | 0 | 0 | 0 | 0 | 0 |
| Consumer Staples | 0 | 2 | 4 | 0 | 4 | 2 | 3 | 4 |
| Energy | 2 | 4 | 0 | 4 | 0 | 0 | 3 | 0 |
| Health Care | 3 | 0 | 0 | 0 | 4 | 3 | 0 | 0 |
| Industrials | 0 | 4 | 1 | 4 | 3 | 0 | 0 | 4 |
| Materials | 2 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
| Real Estate | 1 | 0 | 0 | 3 | 2 | 0 | 4 | 1 |
| Technology | 3 | 1 | 0 | 0 | 0 | 1 | 1 | 1 |
| Utilities | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 |
| Rank | Team | Model | Organization | Score |
|---|---|---|---|---|
| 1 | SilverSight | SilverSight Agent | Fudan Univ., Shanghai Innov. Inst., DataGrand | 197.66 |
| 2 | Finsselaer | FinFiler Agent | Rensselaer Polytechnic Institute | 184.50 |
| 3 | Token Refund | PIKE-Report | Microsoft Research Asia | 173.31 |
| 4 | Financial Wizard | Experian FinAgent | Experian | 171.01 |
| 5 | afinit | afinit fin report agent v2 | afinit | 158.40 |
| 6 | e0nia | aiar | Individual | 156.28 |
| 7 | Baseline | DeepSeek-v3.2 | Official | 156.20 |
| 8 | Baseline | GPT-5-MINI | Official | 150.72 |
| 9 | Baseline | GPT-5-NANO | Official | 149.10 |
| 10 | SI4Fin | GeminiFlashRAG | A*STAR | 140.81 |
| 11 | ICT-NDST | ICTDR | Chinese Academy of Sciences | 127.23 |
| 12 | DeepSeek Your Report | FinCMini Agent | Shanghai Univ. Intl. Business | 121.92 |
| 13 | Baseline | GPT-OSS-20B | Official | 113.41 |
| 14 | LedgerLens | AEGIS | Univ. Technology Sydney | 76.68 |
| 15 | FinSight | CAVM Agent | Renmin University of China | 71.94 |
| 16 | DataLovers | FinMAHRAG3 | Rajiv Gandhi Inst. et al. | 58.88 |
| 17 | RUCFinAI | DeepFin Agent | Renmin University of China | 51.29 |
| Rank | Team | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|---|---|---|---|---|---|---|---|---|---|
| US | UK | China | HK | Singapore | Australia | Indonesia | Malaysia | ||
| 1 | SilverSight | 199.21 | 204.94 | 207.11 | 188.31 | 206.28 | 192.29 | 188.95 | 195.03 |
| 2 | Finsselaer | 179.09 | 187.08 | 188.08 | 176.86 | 187.14 | 189.54 | 174.02 | 194.80 |
| 3 | Token Refund | 163.99 | 180.77 | 167.79 | 163.20 | 181.92 | 175.50 | 171.76 | 181.22 |
| 4 | Financial Wizard | 175.11 | 178.45 | 171.71 | 162.69 | 174.57 | 163.79 | 159.25 | 179.34 |
| 5 | afinit | 161.46 | 174.42 | 154.68 | 142.89 | 169.16 | 154.22 | 142.98 | 164.53 |
| 6 | e0nia | 149.12 | 160.05 | 149.96 | 152.86 | 156.58 | 168.54 | 143.95 | 169.24 |
| 7 | Baseline-DeepSeek-v3.2 | 156.74 | 162.88 | 139.55 | 140.00 | 152.59 | 169.61 | 153.28 | 171.71 |
| 8 | Baseline-GPT-5-MINI with File Search | 152.43 | 160.27 | 132.22 | 130.61 | 162.53 | 151.24 | 147.30 | 163.72 |
| 9 | Baseline-GPT-5-NANO | 142.92 | 159.74 | 141.23 | 134.11 | 162.12 | 174.96 | 121.03 | 159.10 |
| 10 | SI4Fin | 148.61 | 140.29 | 154.85 | 132.29 | 131.95 | 148.87 | 122.83 | 148.01 |
| 11 | ICT-NDST | 130.28 | 148.60 | 94.86 | 129.89 | 128.47 | 130.00 | 108.53 | 139.68 |
| 12 | DeepSeek Your Report | 153.07 | 82.47 | 126.35 | 106.98 | 139.93 | 131.30 | 72.70 | 154.26 |
| 13 | Baseline-GPT-OSS-20B | 116.98 | 101.28 | 92.39 | 109.06 | 108.94 | 119.57 | 115.50 | 136.61 |
| 14 | LedgerLens | 78.50 | 75.32 | 83.78 | 80.99 | 80.66 | 78.81 | 44.12 | 90.21 |
| 15 | FinSight | 71.45 | 69.89 | 55.22 | 74.73 | 65.93 | 85.37 | 83.55 | 69.35 |
| 16 | DataLovers | 59.54 | 63.95 | 47.72 | 49.15 | 64.07 | 70.68 | 43.98 | 70.11 |
| 17 | RUCFinAI | 58.15 | 47.22 | 40.09 | 51.66 | 50.26 | 63.45 | 35.35 | 61.47 |
| Rank | Team | ||||||
|---|---|---|---|---|---|---|---|
| 1 | SilverSight | 80.77 | 83.71 | 69.44 | 87.38 | 86.39 | 87.50 |
| 2 | Finsselaer | 60.32 | 81.43 | 67.47 | 77.69 | 75.91 | 82.64 |
| 3 | Token Refund | 58.05 | 81.23 | 61.32 | 64.94 | 63.87 | 69.89 |
| 4 | Financial Wizard | 59.68 | 79.59 | 61.97 | 55.50 | 65.39 | 70.61 |
| 5 | afinit | 44.14 | 77.40 | 42.24 | 60.75 | 62.22 | 70.46 |
| 6 | e0nia | 59.14 | 73.93 | 44.74 | 54.19 | 58.43 | 69.50 |
| 7 | Baseline-DeepSeek-v3.2 | 57.82 | 77.88 | 40.38 | 47.88 | 51.52 | 68.25 |
| 8 | Baseline-GPT-5-MINI with File Search | 43.95 | 81.03 | 38.18 | 38.56 | 43.65 | 60.86 |
| 9 | Baseline-GPT-5-NANO | 55.50 | 74.70 | 38.21 | 45.88 | 48.70 | 64.18 |
| 10 | SI4Fin | 58.86 | 59.97 | 51.18 | 49.38 | 58.09 | 68.00 |
| 11 | ICT-NDST | 43.64 | 62.77 | 33.35 | 39.00 | 44.78 | 58.25 |
| 12 | DeepSeek Your Report | 42.73 | 58.68 | 34.18 | 41.88 | 42.57 | 56.29 |
| 13 | Baseline-GPT-OSS-20B | 49.68 | 56.88 | 37.32 | 41.94 | 40.52 | 60.29 |
| 14 | LedgerLens | 25.09 | 42.27 | 15.15 | 28.13 | 28.09 | 45.71 |
| 15 | FinSight | 33.45 | 21.04 | 27.38 | 43.06 | 42.13 | 50.25 |
| 16 | DataLovers | 29.41 | 17.26 | 18.71 | 27.38 | 32.61 | 49.93 |
| 17 | RUCFinAI | 27.95 | 21.12 | 5.59 | 19.75 | 18.57 | 39.64 |
| Rank | Team | Extraction | Calculation | Summary |
|---|---|---|---|---|
| 1 | SilverSight | 87.52 | 74.25 | 80.84 |
| 2 | Finsselaer | 83.59 | 77.03 | 71.53 |
| 3 | Token Refund | 83.14 | 78.00 | 61.75 |
| 4 | Financial Wizard | 81.58 | 75.97 | 61.61 |
| 5 | afinit | 81.86 | 67.19 | 53.11 |
| 6 | e0nia | 79.10 | 63.19 | 54.68 |
| 7 | Baseline-DeepSeek-v3.2 | 83.06 | 67.53 | 50.12 |
| 8 | Baseline-GPT-5-MINI with File Search | 83.36 | 73.08 | 43.32 |
| 9 | Baseline-GPT-5-NANO | 77.71 | 68.28 | 47.88 |
| 10 | SI4Fin | 66.09 | 50.53 | 55.39 |
| 11 | ICT-NDST | 67.97 | 51.61 | 41.65 |
| 12 | DeepSeek Your Report | 60.49 | 57.92 | 40.89 |
| 13 | Baseline-GPT-OSS-20B | 52.36 | 44.50 | 44.09 |
| 14 | LedgerLens | 38.91 | 32.86 | 26.18 |
| 15 | FinSight | 23.80 | 22.75 | 37.13 |
| 16 | DataLovers | 21.62 | 17.36 | 29.11 |
| 17 | RUCFinAI | 24.08 | 19.22 | 19.92 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).







