Submitted:
08 May 2026
Posted:
09 May 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Controlled isolation of fusion effects. We design a controlled protocol that holds the adapted backbone, specialist-agent interface, hard-stop rules, and data split fixed while varying only the fusion mechanism. This isolates the contribution of the decision layer itself, independent of model scale or prompt engineering, and reveals that gains are concentrated on approval-boundary cases.
- Case-specific threshold adaptation as a fusion principle. We show that the key mechanism is not better scoring but case-specific threshold selection: GRPO learns to lower the approval barrier when specialist signals converge on safety and raise it when specialist disagreement signals ambiguity. This accounts for of the BEC improvement over fixed-threshold baselines.
- Empirical characterization of fusion sensitivity. Through ablations, error analysis, and temporal diagnostics, we establish that fusion strategy choice matters disproportionately on boundary cases. This asymmetry motivates cost-sensitive fusion as a first-class design target for credit pipelines.
2. Related Work
Machine Learning and Multi-Agent Financial Reasoning
LLM Reasoning and Decision Fusion
3. Methodology
| Algorithm 1:CreditAgent Architecture |
|
3.1. Three-Layer Credit Analysis System
Layer 1: Evidence Filtering
Layer 2: Specialist Risk Analysis
- : a scalar risk score for one dimension (e.g., solvency). Larger indicate lower risk.
- : structured evidence record as a fixed-schema list of cited input factors and textual rationale. We treat these as operational artifacts; explanation faithfulness is evaluated by experts.
Layer 3: Decision Fusion
3.2. Dataset
3.3. Risk-Oriented Agent Optimization
3.4. Decision Setting and Business Efficiency Coefficient
4. Experiments
| Agent | Core Input | Focus | Risk Metric |
|---|---|---|---|
| Macro | PESTEL | Systematic | Mkt. sensitivity |
| Basic | KYB/KYC | Identity | Fraud suspicion |
| History | Cr. bureau | Velocity | Default risk |
| Solvency | Cash flow | Assets/debt | Liquidity buf. |
| Willingness | Behavioral | Intent | Sincerity |
| Fraud | Anomaly | Adversarial | Synth. patterns |
Evaluation Metrics
4.1. Controlled Comparison Protocol
4.2. Main Results and Discussion
Statistical Reliability
Contribution of the Fusion Layer
4.3. Ablation Study
Blackboard Effect on Agent Consistency
4.4. Sensitivity Analysis
4.5. Exploratory Fairness Discussion
5. Conclusion
Appendix A. Analytical Properties of the Decision Rule
| Objective | Formal Statement | Reference |
|---|---|---|
| Variance Reduction | Theorem A1 | |
| Absolute Safety | deterministically | Theorem A2 |
| Consensus Gradient | when profitable | Theorem A3 |
Appendix A.1. Preliminaries and Notation
- : True repayment outcome, where denotes default
- : Raw input feature set (potentially dimensions)
- : Refined evidence tensor from Layer 1 (Eq. 1)
- : Set of Layer 2 specialist agents (Macro, Basic, History, Solvency, Willingness, Fraud)
- : Output tuple of agent , where is the Risk Magnitude and is the Evidence Chain
- Parallel Architecture ():
-
Agents operate in isolation. Each agent computes independently:where no information is exchanged between agents.
- Hierarchical Architecture ():
-
Agents operate via a shared state. Each agent computes its output conditioned on the global context:where denotes the state of the Blackboard at the time of invocation.
Appendix A.2. Information Processing and Variance Reduction
Appendix A.3. The Hard-Stop Mechanism
- with is the weighted risk score from Layer 2 (Eq. 2)
- is the GRPO policy-adjusted threshold
- is the critical veto-agent set
- represents deterministic rejection rules (e.g., active litigation) extracted from SFT agents
Appendix A.4. Utility Optimization via GRPO
Appendix B. Machine Learning LLM Enhanced Dynamic Filter Loop

Appendix B.1. Feature-Selection Pipeline
Appendix Data Cleaning: Enhancing the Signal-to-Noise Ratio
Appendix Feature Analysis: Defining the Information Space
Appendix Coarse Feature Filtering: Eliminating Semantic Noise Channels
Appendix Fine-Grained Feature Selection: Iterative Information Bottleneck
- 1.
- Redundancy Pruning: We compute the Pearson correlation matrix and remove features in the bottom 10%. Highly correlated features share large mutual information , implying redundancy about Y. Removing one of a correlated pair compresses the representation with minimal information loss regarding Y.
- 2.
- Relevance Validation: The remaining features train an XGBoost model. The model’s feature importance scores approximate each feature’s contribution to reducing the prediction loss (e.g., cross-entropy, related to the conditional entropy ). We remove the least important 10% of features.
Appendix Final Validation: Incorporating Expert Priors
Appendix B.2. Theoretical Summary

Appendix B.3. Validation of Feature Selection
| Classification | Precision | Recall | -Score |
|---|---|---|---|
| Approved – Overdue | 0.7400 | 0.8350 | 0.7847 |
| Approved – Current | 0.8131 | 0.6817 | 0.7416 |
| Rejected | 0.8839 | 0.9133 | 0.8984 |
| Avg. | 0.8123 | 0.8100 | 0.8082 |
Appendix C. Limitations
Appendix D. Layer 2 Agents Training Data Details

Appendix D.1. Agent Architecture
- is the scalar Risk Magnitude for a specific dimension (e.g., Solvency), derived via Risk-Oriented Textualization (ROT)
- is the Evidence Chain in natural language, constrained by a JSON schema
Appendix D.2. Credit-Instruct Dataset Construction
Appendix D.3. Risk-Oriented Textualization (ROT)
Appendix D.4. Data Distribution
Appendix E. Models Configurations
| Model Name | Developer | Scale/Type | Access Method |
|---|---|---|---|
| Open-Source | |||
| XuanYuan 2 | Duxiaoman-DI | 70B | Local (vLLM) |
| Fin-R1 | Financial AI Lab | 32B (Distilled) | Local (vLLM) |
| Seed OSS Instruct | ByteDance-Seed | 36B | Local (vLLM) |
| Llama 3.1 | Meta | 8B | Local (vLLM) |
| Llama 3.3 | Meta | 70B | Local (vLLM) |
| Qwen 3 Instruct | Alibaba | 4B, 30B | Local (vLLM) |
| Qwen 3 Thinking | Alibaba | 30B | Local (vLLM) |
| DeepSeek-R1-Distill-Qwen | deepseek-ai | 32B | Local (vLLM) |
| DeepSeek-R1-Distill-llama | deepseek-ai | 8B | Local (vLLM) |
| Proprietary | |||
| Gemini 3 Flash Preview | Unknown | Vertex AI API | |
| Claude 4.5 Sonnet | Anthropic | Unknown | Anthropic API |
| GPT 5.1 | OpenAI | Unknown | OpenAI API |
| Parameter | Configuration Value |
|---|---|
| GPU Hardware | 8 × NVIDIA H800 (80GB) |
| Operating System | Ubuntu 22.04 LTS |
| Inference Engine | vLLM (v0.6.3) / CUDA 12.4 |
| Decoding Strategy | Greedy Search () |
| Max Context Length | 16,384 Tokens |
| Precision | Bfloat16 |
Appendix F. Usage of Large Language Models
Appendix G. Details of Tokens Count and Time-Consuming
| Baselines | Completion Tokens | Prompt Tokens | Total Tokens |
|---|---|---|---|
| Single Model | 798.5 | 1878.3 | 2676.8 |
| Basic Agent | 1340.0 | 247.3 | 1587.3 |
| Solvency Agent | 1003.1 | 379.5 | 1382.6 |
| Willingness Agent | 1102.1 | 572.6 | 1674.7 |
| History Agent | 1210.9 | 411.9 | 1622.8 |
| Fraud Agent | 1160.9 | 260.9 | 1421.8 |
| Macro Agent | 873.9 | 514.4 | 1388.3 |
| Final Decision Agent | 805.8 | 3109.2 | 3915.0 |
| Baselines | Overall Tokens | Time Spend |
|---|---|---|
| Single Model | 2676.772 | 5.9074 |
| Credit Agent | 12992.502 | 20.1459 |

Appendix H. Case Study








Appendix I. More Related Work
Appendix I.1. Credit Assessment
| Work | Category | Model/Algorithm | Details |
|---|---|---|---|
| Sanz et al. [64] | LLMs | LLM-derived features | Extracts textual features complementing traditional variables in P2P lending platforms. Improves predictive performance while emphasizing explainability and fairness in credit decision-making. |
| Feng et al. [25] | LLMs | Instruction-tuned LLMs | Matches or exceeds state-of-the-art credit scoring methods while addressing bias concerns. Demonstrates comprehensive knowledge and adaptability for holistic financial risk evaluation. |
| Sanz et al. [65] | LLMs | LLM textual analysis | Addresses information gaps in P2P lending through textual analysis. Provides complementary signals to traditional credit variables, improving discrimination between creditworthy and risky borrowers with emphasis on explainability. |
| Tan et al. [75] | LLMs | GPT-4 + BERT fusion | Combines GPT-4’s generative capabilities for latent risk extraction with BERT’s bidirectional encoding. Multi-level semantic integration identifies implicit credit risks in unstructured financial documents with superior accuracy and interpretability. |
| Raliphada et al. [60] | LLMs | BERT, RoBERTa, LLaMA 3.2 | Analyzes financial news sentiment using pre-trained language models. RoBERTa’s contextual encoding demonstrates superior risk-relevant sentiment capture. Gradient boosting achieves highest accuracy with structured features. |
| Kang et al. [41] | LLMs | GPT-4o | Processes heterogeneous modalities (power time series, financial metrics, textual data) through attention mechanisms and contrastive learning. Integrates physical consumption data with traditional financial information through cross-modal representation optimization. |
| Puli et al. [58] | Machine Learning | Neural Networks, Random Forest | Demonstrates superior performance with credit-related, interest rate, and liquidity variables as most informative early warning indicators. |
| Zhang et al. [98] | Machine Learning | Neural Networks | Achieves better generalization and robustness than traditional neural networks by evaluating supply chain relationships and leading enterprise credit status alongside individual firm characteristics. |
| Machado et al. [48] | Machine Learning | Random Forest + SHAP | Combines internal banking records with external financial data. Outperforms other ML techniques in prediction accuracy and transparency for detecting early warning signals with SHAP-based explanations. |
| Zhang et al. [100] | Machine Learning | Ensemble Decision Trees | Leverages feature selection and ensemble decision trees for superior performance in processing large-scale loan application data for accurate risk prediction. |
| Blackwell et al. [8] | Statistical | Marginal Risk Model | Introduces marginal risk concept varying with outstanding balance. Enables refined credit limit strategies avoiding "all or nothing" approach while optimizing authorization decisions based on account-specific risk contours. |
| Cameron et al. [10] | Statistical | Count-Duration Models | Provides statistical frameworks for discrete, non-negative integer data. Links count processes to duration analysis for credit scoring and insurance pricing applications. |
Appendix I.2. MAS in Specific Domain
| Work | Domain | Agent Architecture | Key Mechanisms |
|---|---|---|---|
| LawLuo [74] | Legal consultation | Four collaborative agents simulating law firm operations | Role-specific fine-tuning, case graph-based RAG for personalized multi-turn consultation |
| AutoDefense [95] | LLM security | Small open-source LLMs with assigned defensive roles | Collaborative filtering of harmful responses, jailbreak attack defense |
| FinCon [91] | Financial decision-making | Manager-analyst hierarchy inspired by investment firms | Self-critiquing mechanisms, conceptual verbal reinforcement |
| EduPlanner [99] | Educational planning | Evaluator, optimizer, and question analyst agents | Skill-Tree structure, 5-D evaluation module, adversarial collaboration |
| Trans. Analy [94] | Social interaction | Parent, Adult, and Child ego state agents | Transactional Analysis principles, context-aware psychological dynamics |
| MedAgent-Pro [83] | Medical diagnosis | Hierarchical reasoning agents with visual tool integration | Disease-level plan generation via RAG, patient-level personalized reasoning, step-wise reliability verification |
| MedOrch [31] | Medical reasoning | Modular tool-augmented multi-agent orchestration | Flexible tool integration, transparent traceable reasoning for diagnosis, imaging, and multimodal QA |
| ClinicalAgent [92] | Clinical trials | GPT-4 multi-agent system with external tool access | LEAST-TO-MOST decomposition, ReAct reasoning, clinical trial tool integration for outcome prediction |
| DynamiCare [68] | Clinical decision-making | Dynamic specialist agents with adaptive team composition | Multi-round interactive loops, iterative information gathering, adaptive strategy adjustment |
| MeNTi [101] | Medical calculators | Meta-tool with nested calling mechanism | Flexible calculator chaining, slot filling and unit conversion, 281 physician-curated medical tools |
| Agent Laboratory [67] | Scientific research | Autonomous research agents with human-in-the-loop | End-to-end research automation, literature review to report writing, iterative human feedback integration |
| WebAgent-R1 [84] | Web navigation | Multi-turn RL agents with thinking-based prompting | Asynchronous trajectory generation, binary task success rewards, test-time scaling via interactions |
| EMBODIEDBENCH [88] | Embodied AI | Multi-environment benchmark agents | Evaluation spanning household tasks to atomic navigation/manipulation, revealing low-level control limitations |
| Guided Search [93] | Non-serializable envs. | 1-step lookahead agents with trajectory selection | Learned action-value functions, trajectory selection for test-time search in non-restorable states |
| TRAIL [18] | Agent debugging | Trace evaluation framework for single/multi-agent systems | Formal error taxonomy, 148 human-annotated traces, benchmark for error identification in agent workflows |
| DiscoveryWorld [39] | Scientific discovery | Virtual environment with 24 parametric tasks | 24 parametric tasks across difficulty levels, complete discovery cycle evaluation (hypothesis to conclusion) |
| Agent-X [3] | Multimodal reasoning | Vision-centric agents for multi-step visual tasks | Step-level assessment of reasoning coherence, tool selection, and task completion over images/videos |
| G-Safeguard [79] | MAS security | Graph neural network-based anomaly detection system | Anomaly detection on utterance graphs, topological intervention for attack remediation across LLM backbones |
Appendix I.3. MAS Architecture
| Architecture | Work | Details |
|---|---|---|
| Hierarchical | DyLAN [46] | Dynamic hierarchical architecture with multi-turn interactions. Employs agent selection during inference and early termination for enhanced collaboration efficiency. Uses unsupervised agent importance scoring for automatic team optimization. |
| Hierarchical | LMA for SE [30] | Applies hierarchical task decomposition in software engineering lifecycle. Divides high-level requirements into sub-tasks assigned to specialized agents, mirroring agile methodologies. Enhances robustness through cross-examination and multi-agent validation, addressing LLM hallucination issues. Scales to complex systems by incorporating additional agents and reallocating tasks dynamically. |
| Hierarchical | PathFinder [27] | Four collaborative agents (Triage, Navigation, Description, Diagnosis) emulate pathologist decision-making. Iteratively navigates whole slide images, generates importance maps, produces natural language descriptions of diagnostically relevant patches, and synthesizes findings for final classification with inherent explainability. |
| Hierarchical | Pre-Act [62] | Enhances agent performance by generating multi-step execution plans with detailed reasoning before action. Incrementally refines the plan by incorporating previous steps and tool outputs until final response. Demonstrates substantial improvements in action accuracy and goal completion through fine-tuning on smaller models for practical deployment. |
| Decentralized | DMAS [14] | Turn-taking dialogue-based task planning where each robot’s LLM agent independently expresses opinions and considers peer feedback. Enables autonomous decision-making while maintaining collective progress toward task completion. |
| Decentralized | MaAS [97] | Optimizes an agentic supernet—a probabilistic distribution of agent architectures—to dynamically sample query-dependent multi-agent systems. Enables tailored resource allocation based on task difficulty and domain, achieving superior performance with significantly reduced inference costs compared to static, handcrafted multi-agent designs. |
| Centralized | ACORM [34] | Single LLM serves as central planner, generating actions for all agents based on global state information. Achieves centralized training with decentralized execution, simplifying MARL planning and enhancing scalability and inference efficiency. |
| Shared Memory | MetaGPT [33] | Maintains shared message pool allowing dynamic information observation and extraction. Enables parameter sharing where agents update weights based on new knowledge and synchronize with other agents, optimizing collaboration efficiency. |
| Shared Memory | Zep [61] | Introduces Graphiti, a temporally-aware knowledge graph engine that dynamically synthesizes unstructured conversational data and structured business data while maintaining historical relationships. Enables complex temporal reasoning and cross-session information synthesis beyond static document retrieval in RAG frameworks. |
Appendix J. Priori Knowledge in Finance About High-Stakes Financial Decision-Making
| CreditAgent Agent | Classical 5C’s | Risk Dimension | Primary Function |
|---|---|---|---|
| Macro Agent | Conditions | Macro-Economic Context | Systematic risk calibration |
| Basic Agent | Character | Identity & Basic Profile | KYB/KYC verification |
| History Agent | Character | Credit History | Behavioral velocity analysis |
| Solvency Agent | Capacity, Capital | Solvency & Asset Quality | Liquidity & capital adequacy |
| Willingness Agent | Character | Repayment Willingness | Intent signal extraction |
| Fraud Agent | Collateral* | Fraud & Consistency | Adversarial detection |
Appendix J.1. Macro-Economic Context and Systematic Risk
Appendix J.2. Identity Verification and Basic Profiling
Appendix J.3. Credit Behavior History and Velocity Analysis
Appendix J.4. Solvency, Liquidity, and Asset Quality
Appendix J.5. Repayment Willingness and Behavioral Signals
Appendix J.6. Fraud Detection and Consistency Verification
Appendix K. Expert Review Summary
| Summarized Professional Profiles |
|---|
| Profile A: Senior Credit Committee Director (15+ yrs). Expert in credit committee governance and cross-functional decision synthesis at a national commercial bank. |
| Profile B: Risk Modeling Lead (10+ yrs). Specializing in scorecard development, feature engineering, and model validation at a leading fintech company. |
| Profile C: Fraud Investigation Manager (8+ yrs). Focused on synthetic identity detection, network analysis, and adversarial pattern recognition. |
| Profile D: Macro-Economic Risk Analyst (7+ yrs). Expert in industry cycle analysis, regulatory impact assessment, and systematic risk calibration. |
| Profile E: Consumer Credit Underwriter (6+ yrs). Skilled in cash flow analysis, DTI assessment, and affordability verification for unsecured lending. |
| Profile F: Collections Strategy Director (10+ yrs). Expert in delinquency prediction, roll rate analysis, and recovery optimization. |
Appendix K.1. Hierarchical Decision Structure
Appendix K.2. Dimensional Decomposition Necessity
Appendix K.3. Hard-Stop Mechanism for Catastrophic Risk
Appendix K.4. Asymmetric Cost Structure
Appendix K.5. Information Consistency as Collateral Proxy
Appendix K.6. Dynamic Threshold Adjustment.
Appendix L. Dataset Details
Appendix L.1. Core Risk Assessment Dimensions (Main Risk Domains)
- Borrower Basic Information Mainly includes identity verification, income level, credit report authorization status, etc. Due to heavy desensitization of strong identity fields (name, ID number, plaintext mobile number), this dimension tends to be relatively weak and primarily relies on indirect or derived signals.
- Repayment Capacity Focuses on objective financial metrics such as total credit limits granted, outstanding loan balances, unsettled debts, and remaining available credit, used to evaluate the borrower’s actual debt-carrying ability and liquidity buffer.
- Repayment Willingness Captures the degree of funding urgency, tendency toward “borrow-to-repay” cycles, and dependence on external financing through recent and medium-to-long-term loan-approval inquiry frequency, number of inquiring institutions, multi-lender borrowing scores, micro-loan activity flags, and general consumption installment application behavior.
- Credit History and Lending Behavior Concerns historical delinquency patterns (cumulative overdue periods, recent performance), credit card utilization rates, loan portfolio structure (proportion of small-amount loans, number of unsecured consumer loan institutions), abnormal account flags, etc., reflecting long-term credit discipline and current borrowing style.
- Fraud Risk and Consistency Check Primarily relies on internal logical consistency of application materials and anomaly detection — especially cross-referencing or duplication in contact information, same phone number linked to multiple identities, unusual relative relationships, etc. — to identify potential organized fraud, agency packaging, or fabricated application risks.
- Other Information and Auxiliary Signals Includes report generation timestamps, business occurrence time, derived time-delta variables, executed interest rates, third-party model scores, suspicious large-value credit-related bank transfers, historical successful disbursement counts, etc., mainly serving for contextual supplementation, timeliness calibration, pricing reference, and model ensemble support.
| Dimension | Typical Signals and Representative Variables |
|---|---|
| Borrower Basic Information | Latest Monthly Income Level, |
| Credit Report Authorization Time, | |
| Whether Authorized to Query Credit Report | |
| ... | |
| Repayment Capacity | Latest Approved Amount by Blaze, |
| Account Credit Limit, | |
| Cash Loan Account Credit Limit, | |
| ... | |
| Repayment Willingness | Number of Institutions Queried for Loan Approval, |
| Number of Loan Approval Queries, | |
| Total Number of Queries, | |
| ... | |
| Credit History and Lending Behavior | Highest Credit Card Utilization Rate, |
| Average Credit Card Utilization Rate, | |
| ... | |
| Fraud Risk and Consistency Check | Number of Different Contacts Corresponding to the Same Phone Number, |
| Total Times Contacts Were Filled In, | |
| ... | |
| Other Information and Auxiliary Signals | Credit Report Generation Date, |
| Event Occurrence Time / Application Time, | |
| ... |
Appendix M. Statistical Characteristics

Appendix N. Sample Prompts








References
- Agosto, Arianna; Cerchiello, Paola; Giudici, Paolo. Bayesian learning models to measure the relative impact of esg factors on credit ratings. Int. J. Data Sci. Anal. 2025, 20(2), 357–368. [Google Scholar] [CrossRef]
- Amirizaniani, Maryam; Lavergne, Adrian; Okada, Elizabeth Snell; Chadha, Aman; Roosta, Tanya; Shah, Chirag. Developing a framework for auditing large language models using human-in-the-loop. In Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 2025; pp. pages 64–74. [Google Scholar]
- Ashraf, Tajamul; Saqib, Amal; Ghani, Hanan; AlMahri, Muhra; Li, Yuhao; Ahsan, Noor; Nawaz, Umair; Lahoud, Jean; Cholakkal, Hisham; Shah, Mubarak; Torr, Philip; Khan, Fahad Shahbaz; Anwer, Rao Muhammad; Khan, Salman. Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks. 2025. Available online: https://arxiv.org/abs/2505.24876.
- Baiden, John E. The 5 c’s of credit in the lending industry. Available at SSRN 1872804. 2011. [Google Scholar]
- Ben Abdelaziz, Fouad; Mrad, Fatma. Multiagent systems for modeling the information game in a financial market. Int. Trans. Oper. Res. 2023, 30(5), 2210–2223. [Google Scholar] [CrossRef]
- Berg-Kirkpatrick, Taylor; Burkett, David; Klein, Dan. An empirical investigation of statistical significance in nlp. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012; pp. pages 995–1005. [Google Scholar]
- Bhatore, Siddharth; Mohan, Lalit; Reddy, Y Raghu. Machine learning techniques for credit risk evaluation: a systematic literature review. J. Bank. Financ. Technol. 2020, 4(1), 111–138. [Google Scholar] [CrossRef]
- Blackwell, Martin; Sykes, Chris. The assignment of credit limits with a behaviour-scoring system. IMA J. Manag. Math. 1992, 4(1), 73–80. [Google Scholar] [CrossRef]
- Boz, Zeynep; Gunnec, Dilek; Birbil, S. Ilker; Öztürk, M. Kaan. Reassessment and monitoring of loan applications with machine learning. Appl. Artif. Intell. 2018, 32(9-10), 939–955. [Google Scholar] [CrossRef]
- Cameron, A Colin; Trivedi, Pravin K. 12 count data models for financial data. Handb. Stat. 1996, 14, 363–391. [Google Scholar]
- Chen, Fei; Ren, Wei; et al. On the control of multi-agent systems: A survey. Found. Trends Syst. Control 2019, 6(4), 339–499. [Google Scholar] [CrossRef]
- Chen, Tianqi; Guestrin, Carlos. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016; pp. pages 785–794. [Google Scholar]
- Chen, Weize; Su, Yusheng; Zuo, Jingwei; Yang, Cheng; Yuan, Chenfei; Chan, Chi-Min; Yu, Heyang; Lu, Yaxi; Hung, Yi-Hsin; Qian, Chen; et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. The Twelfth International Conference on Learning Representations, 2023. [Google Scholar]
- Chen, Yongchao; Arkin, Jacob; Zhang, Yang; Roy, Nicholas; Fan, Chuchu. Scalable multi-robot collaboration with large language models: Centralized or decentralized systems? In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE, 2024; pp. 4311–4317. [Google Scholar]
- Chicco, Davide; Jurman, Giuseppe. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21(1), 6. [Google Scholar] [CrossRef]
- De Toni, Giovanni; Viappiani, Paolo; Teso, Stefano; Lepri, Bruno; Passerini, Andrea. Personalized algorithmic recourse with preference elicitation. Transactions on Machine Learning Research, 2024. [Google Scholar]
- DeepSeek-AI, *!!! REPLACE !!!*. Deepseek-v3 technical report. arXiv preprint, 2024. Focus on GRPO and reasoning optimization.
- Deshpande, Darshan; Gangal, Varun; Mehta, Hersh; Krishnan, Jitin; Kannappan, Anand; Qian, Rebecca. Trail: Trace reasoning and agentic issue localization. arXiv 2025, arXiv:2505.08638. [Google Scholar] [CrossRef]
- Dong, Guanting; Yuan, Hongyi; Lu, Keming; Li, Chengpeng; Xue, Mingfeng; Liu, Dayiheng; Wang, Wei; Yuan, Zheng; Zhou, Chang; Zhou, Jingren. How abilities in large language models are affected by supervised fine-tuning data composition. Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. 2024, Volume 1, 177–198. [Google Scholar]
- Dorri, Ali; Kanhere, Salil S; Jurdak, Raja. Multi-agent systems: A survey. Ieee Access 2018, 6, 28573–28593. [Google Scholar] [CrossRef]
- Eren Erdogan, Lutfi; Lee, Nicholas; Kim, Sehoon; Moon, Suhong; Furuta, Hiroki; Anumanchipalli, Gopala; Keutzer, Kurt; Gholami, Amir. Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv E-Prints 2025, arXiv–2503. [Google Scholar]
- Fang, Alex; Madappally Jose, Albin; Jain, Amit; Schmidt, Ludwig; Toshev, Alexander; Shankar, Vaishaal. Data filtering networks. arXiv 2023, arXiv:2309.17425. [Google Scholar] [CrossRef]
- Fatemi, Sorouralsadat; Hu, Yuheng. Finvision: A multi-agent framework for stock market prediction. In Proceedings of the 5th ACM International Conference on AI in Finance, 2024; pp. 582–590. [Google Scholar]
- Fazlija, Bledar; Harder, Pedro. Using financial news sentiment for stock price direction prediction. Mathematics 2022, 10(13), 2156. [Google Scholar] [CrossRef]
- Feng, Duanyu; Dai, Yongfu; Huang, Jimin; Zhang, Yifang; Xie, Qianqian; Han, Weiguang; Chen, Zhengyu; Lopez-Lira, Alejandro; Wang, Hao. Empowering many, biasing a few: Generalist credit scoring through large language models. arXiv 2023, arXiv:2310.00566. [Google Scholar]
- Gandhar, Akash; Gupta, Kapil; Pandey, Aman Kumar; Raj, Dharm. Fraud detection using machine learning and deep learning. SN Comput. Sci. 2024, 5(5), 453. [Google Scholar] [CrossRef]
- Ghezloo, Fatemeh; Seyfioglu, Mehmet Saygin; Soraki, Rustin; Ikezogwo, Wisdom O.; Li, Beibin; Vivekanandan, Tejoram; Elmore, Joann G.; Krishna, Ranjay; Shapiro, Linda. Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathology. 2025. Available online: https://arxiv.org/abs/2502.08916.
- Guo, Taicheng; Chen, Xiuying; Wang, Yaqi; Chang, Ruidi; Pei, Shichao; Chawla, Nitesh V; Wiest, Olaf; Zhang, Xiangliang. Large language model based multi-agents: A survey of progress and challenges. arXiv 2024, arXiv:2402.01680. [Google Scholar] [CrossRef]
- Han, Zeyu; Gao, Chao; Liu, Jinyang; Zhang, Jeff; Zhang, Sai Qian. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv 2024, arXiv:2403.14608. [Google Scholar]
- He, Junda; Treude, Christoph; Lo, David. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead. ACM Trans. Softw. Eng. Methodol. 2025a, 34(5), 1–30. [Google Scholar] [CrossRef]
- He, Yexiao; Li, Ang; Liu, Boyi; Yao, Zhewei; He, Yuxiong. Medorch: Medical diagnosis with tool-augmented reasoning agents for flexible extensibility. 2025b. Available online: https://arxiv.org/abs/2506.00235.
- Hoang, Daniel; Wiegratz, Kevin. Machine learning methods in finance: Recent applications and prospects. Eur. Financ. Manag. 2023, 29(5), 1657–1701. [Google Scholar] [CrossRef]
- Hong, Sirui; Zhuge, Mingchen; Chen, Jonathan; Zheng, Xiawu; Cheng, Yuheng; Wang, Jinlin; Zhang, Ceyao; Wang, Zili; Ka, Steven; Yau, Shing; Lin, Zijuan; et al. Metagpt: Meta programming for a multi-agent collaborative framework. The twelfth international conference on learning representations, 2023. [Google Scholar]
- Hu, Zican; Zhang, Zongzhang; Li, Huaxiong; Chen, Chunlin; Ding, Hongyu; Wang, Zhi. Attention-guided contrastive role representations for multi-agent reinforcement learning. arXiv 2023, arXiv:2312.04819. [Google Scholar]
- Huang, Jin; Ling, Charles X. Using auc and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 2005, 17(3), 299–310. [Google Scholar] [CrossRef]
- Huang, Kexin; Zhang, Serena; Wang, Hanchen; Qu, Yuanhao; Lu, Yingzhou; Roohani, Yusuf; Li, Ryan; Qiu, Lin; Li, Gavin; Zhang, Junze; et al. Biomni: A general-purpose biomedical ai agent. biorxiv 2025. [Google Scholar]
- Hurlin, Christophe; Pérignon, Christophe; Saurin, Sébastien. The fairness of credit scoring models. Manag. Sci. 2026, 72(1), 406–425. [Google Scholar] [CrossRef]
- Jajoo, Gautam; Chitale, Pranjal A; Agarwal, Saksham. Masca: Llm based-multi agents system for credit assessment. 2025. Available online: https://arxiv.org/abs/2507.22758.
- Jansen, Peter; Côté, Marc-Alexandre; Khot, Tushar; Bransom, Erin; Mishra, Bhavana Dalvi; Majumder, Bodhisattwa Prasad; Tafjord, Oyvind; Clark, Peter. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents. In Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc., 2024; volume 37, pp. pages 10088–10116. [Google Scholar] [CrossRef]
- Jin, Haibo; Zhang, Peiyan; Wang, Peiran; Luo, Man; Wang, Haohan. From hallucinations to jailbreaks: Rethinking the vulnerability of large foundation models. arXiv 2025, arXiv:2505.24232. [Google Scholar] [CrossRef]
- Kang, Li; Wang, Xiaocun; Zhao, Lei; Wang, Xijing; Xiang, Sheng. Multimodal large language model for enterprise credit assessment with power data enhancement. In Proceedings of the 2025 International Conference on Economic Management and Big Data Application, 2025; pp. pages 785–791. [Google Scholar]
- Karimi, Amir-Hossein; Barthe, Gilles; Schölkopf, Bernhard; Valera, Isabel. A survey of algorithmic recourse: contrastive explanations and consequential recommendations. ACM Comput. Surv. 2022, 55(5), 1–29. [Google Scholar] [CrossRef]
- Li, Guohao; Hammoud, Hasan; Itani, Hani; Khizbullin, Dmitrii; Ghanem, Bernard. Camel: Communicative agents for" mind" exploration of large language model society. Adv. Neural Inf. Process. Syst. 2023a, 36, 51991–52008. [Google Scholar]
- Li, Jiangtong; Bian, Yuxuan; Wang, Guoxuan; Lei, Yang; Cheng, Dawei; Ding, Zhijun; Jiang, Changjun. Cfgpt: Chinese financial assistant with large language model. arXiv 2023b, arXiv:2309.10654. [Google Scholar]
- Li, Xiangci; Ouyang, Jessica. Related work and citation text generation: A survey. arXiv 2024, arXiv:2404.11588. [Google Scholar] [CrossRef]
- Liu, Zijun; Zhang, Yanzhe; Li, Peng; Liu, Yang; Yang, Diyi. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv 2023, arXiv:2310.02170. [Google Scholar]
- Ma, Li; et al. Catboost: unbiased boosting with categorical features. Expert Systems with Applications, 2024. [Google Scholar]
- Machado, Marcos R; Chen, Daniel Tianfu; Osterrieder, Joerg R. An analytical approach to credit risk assessment using machine learning models. Decis. Anal. J. 2025, page 100605. [Google Scholar] [CrossRef]
- Madero, Armando. Specialized multi-agent neural architecture for enhanced reasoning and minimal intervention in ai systems. Available at SSRN 5534058. 2025. [Google Scholar]
- Majumdar, Chitro; Scandizzo, Sergio; Mahanta, Ratanlal; Mandal, Avradip; Bhattacharjee, Swarnendu. A large language model for corporate credit scoring. 2025. Available online: https://arxiv.org/abs/2511.02593.
- McCracken, Lance M; DaSilva, Philomena; Skillicorn, Beth; Doherty, Richard. The cognitive fusion questionnaire: A preliminary study of psychometric properties and prediction of functioning in chronic pain. Clin. J. Pain 2014, 30(10), 894–901. [Google Scholar] [CrossRef]
- Moro, Sérgio; Cortez, Paulo; Rita, Paulo. An automated literature analysis on data mining applications to credit risk assessment. Artificial Intelligence in Financial Markets: Cutting Edge Applications for Risk Management, Portfolio Optimization and Economics 2016, pages 161–177. [Google Scholar]
- Mulyanto, Sigit; Yonia, Dwika; Sutejo, Bambang. Finance loan risk assessment using machine learning for credit eligibility prediction and model optimization. IJISTECH 2025, 8(5), 303–311. Available online: https://www.ijistech.org/ijistech/index.php/ijistech/article/view/376. [CrossRef]
- Naili, Maryem; Lahrichi, Younes. The determinants of banks’ credit risk: Review of the literature and future research agenda. Int. J. Financ. Econ. 2022, 27(1), 334–360. [Google Scholar] [CrossRef]
- Park, Joon Sung; O’Brien, Joseph; Cai, Carrie Jun; Morris, Meredith Ringel; Liang, Percy; Bernstein, Michael S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, 2023; pp. pages 1–22. [Google Scholar]
- Pathi, Sai Prashanth. A multi-agent framework for personalized credit recommendations. Int. J. Multidiscip. Res. Growth Eval. 2025. [Google Scholar] [CrossRef]
- Popa, Alexandru; Sîrbu, Tiberiu-Iulian. From fragmentation to cohesion: An llm-based iterative approach to ontology and knowledge graph refinement. In 2025 25th International Conference on Control Systems and Computer Science (CSCS); IEEE, 2025; pp. pages 651–654. [Google Scholar]
- Puli, Sreenivasulu; Thota, Nagaraju; Subrahmanyam, A.C.V. Assessing machine learning techniques for predicting banking crises in india. J. Risk Financ. Manag. 2024, 17(4), 141. [Google Scholar] [CrossRef]
- Purificato, E.; Lorenzo, F.; Fallucchi, F.; De Luca, E. W. The use of responsible artificial intelligence techniques in the context of loan approval processes. Int. J. Human–Computer Interact. 2023, 39(7), 1543–1562. [Google Scholar] [CrossRef]
- Raliphada, Pfarelo; Olusanya, Micheal O; Olukanmi, Seun. Optimizing credit risk model classification using bert, roberta, and llama 3.2. In IEEE EUROCON 2025-21st International Conference on Smart Technologies; IEEE, 2025; pp. pages 1–6. [Google Scholar]
- Rasmussen, Preston; Paliychuk, Pavlo; Beauvais, Travis; Ryan, Jack; Chalef, Daniel. Zep: A temporal knowledge graph architecture for agent memory. 2025. Available online: https://arxiv.org/abs/2501.13956.
- Rawat, Mrinal; Gupta, Ambuje; Goomer, Rushil; Di Bari, Alessandro; Gupta, Neha; Pieraccini, Roberto. Pre-act: Multi-step planning and reasoning improves acting in llm agents. 2025. Available online: https://arxiv.org/abs/2505.09970.
- Sadok, Hicham; Sakka, Fadi; El Hadi El Maknouzi, Mohammed. Artificial intelligence and bank credit analysis: A review. Cogent Econ. Financ. 2022, 10(1), 2023262. [Google Scholar] [CrossRef]
- Sanz-Guerrero, Mario; Arroyo, Javier. Credit risk meets large language models: Building a risk indicator from loan descriptions in p2p lending. arXiv 2024a, arXiv:2401.16458. [Google Scholar] [CrossRef]
- Sanz-Guerrero, Mario; Arroyo, Javier. Credit risk meets large language models: Building a risk indicator from loan descriptions in peer-to-peer lending. Available at SSRN 4979155. 2024b. [Google Scholar]
- Eslam, Hussein Sayed; Alabrah, Amerah; Kamel, Hussein Rahouma; Muhammad, Zohaib; Badry, Rasha M. Machine learning and deep learning for loan prediction in banking: Exploring ensemble methods and data balancing. IEEE Access 2024, 12, 193997–194019. [Google Scholar] [CrossRef]
- Schmidgall, Samuel; Su, Yusheng; Wang, Ze; Sun, Ximeng; Wu, Jialian; Yu, Xiaodong; Liu, Jiang; Moor, Michael; Liu, Zicheng; Barsoum, Emad. Agent laboratory: Using llm agents as research assistants. 2025. Available online: https://arxiv.org/abs/2501.04227.
- Shang, Tianqi; He, Weiqing; Zheng, Charles; Li, Lingyao; Shen, Li; Zhao, Bingxin. Dynamicare: A dynamic multi-agent framework for interactive and open-ended medical decision-making. 2025. Available online: https://arxiv.org/abs/2507.02616.
- Shao, Zhihong; Wang, Peiyi; Zhu, Qihao; Xu, Runxin; Song, Junxiao; Bi, Xiao; Zhang, Haowei; Zhang, Mingchuan; Li, Y. K.; Wu, Y.; Guo, Daya. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. 2024. Available online: https://arxiv.org/abs/2402.03300.
- Shi, Jinxin; Zhao, Jiabao; Wu, Xingjiao; Xu, Ruyi; Jiang, Yuan-Hao; He, Liang. Mitigating reasoning hallucination through multi-agent collaborative filtering. Expert Syst. With Appl. 2025, 263, 125723. [Google Scholar] [CrossRef]
- Simonetto, Thibault; Ghamizi, Salah; Cordy, Maxime. Tabularbench: Benchmarking adversarial robustness for tabular deep learning in real-world use-cases. Adv. Neural Inf. Process. Syst. 2024, 37, 78394–78430. [Google Scholar]
- Song, Yu; Wang, Yuyan; Ye, Xin; Zaretzki, Russell; Liu, Chuanren. Loan default prediction using a credit rating-specific and multi-objective ensemble learning scheme. Inf. Sci. 2023, 629, 599–617. [Google Scholar] [CrossRef]
- Sun, Chuanneng; Huang, Songjun; Pompili, Dario. Llm-based multi-agent decision-making: Challenges and future directions. IEEE Robotics and Automation Letters, 2025. [Google Scholar]
- Sun, Jingyun; Dai, Chengxiao; Luo, Zhongze; Chang, Yangbo; Li, Yang. Lawluo: A multi-agent collaborative framework for multi-round chinese legal consultation. arXiv 2024, arXiv:2407.16252. [Google Scholar]
- Tan, Huirong; Xie, Yanruixue. Financial text analysis and credit risk assessment using a gpt-4 and improved bert fusion model. PLoS ONE 2025, 20(11), e0336217. [Google Scholar] [CrossRef]
- Tavasoli, Ahmadreza; Sharbaf, Maedeh; Madani, Seyed Mohamad. Responsible innovation: A strategic framework for financial llm integration. arXiv 2025, arXiv:2504.02165. [Google Scholar] [CrossRef]
- Tian, Xu; Tian, ZongYi; Khatib, Saleh FA; Wang, Yan. Machine learning in internet financial risk management: A systematic literature review. PLoS ONE 2024, 19(4), e0300195. [Google Scholar] [CrossRef]
- Wan, Xiangpeng; Deng, Haicheng; Zou, Kai; Xu, Shiqi. Enhancing the efficiency and accuracy of underlying asset reviews in structured finance: The application of multi-agent framework. 2024. Available online: https://arxiv.org/abs/2405.04294.
- Wang, Shilong; Zhang, Guibin; Yu, Miao; Wan, Guancheng; Meng, Fanci; Guo, Chongye; Wang, Kun; Wang, Yang. G-safeguard: A topology-guided security lens and treatment on llm-based multi-agent systems. 2025a. Available online: https://arxiv.org/abs/2502.11127.
- Wang, Xiaofeng; Zhang, Zhixin; Zheng, Jinguang; Ai, Yiming; Wang, Rui. Debt collection negotiations with large language models: An evaluation system and optimizing decision making with multi-agent. 2025b. Available online: https://arxiv.org/abs/2502.18228.
- Wang, Yuelin; Zhang, Yihan; Lu, Yan; Yu, Xinran. A comparative assessment of credit risk model based on machine learning ——a case study of bank loan data. Procedia Computer Science 2019 International Conference on Identification, Information and Knowledge in the Internet of Things, 2020a; 174, pp. 141–149. Available online: https://www.sciencedirect.com/science/article/pii/S1877050920315830, ISSN 1877-0509. [CrossRef]
- Wang, Yuelin; Zhang, Yihan; Lu, Yan; Yu, Xinran. A comparative assessment of credit risk model based on machine learning——a case study of bank loan data. Procedia Comput. Sci. 2020b, 174, 141–149. [Google Scholar] [CrossRef]
- Wang, Ziyue; Wu, Junde; Cai, Linghan; Low, Chang Han; Yang, Xihong; Li, Qiaxuan; Jin, Yueming. Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow. 2025c. Available online: https://arxiv.org/abs/2503.18968.
- Wei, Zhepei; Yao, Wenlin; Liu, Yao; Zhang, Weizhi; Lu, Qin; Qiu, Liang; Yu, Changlong; Xu, Puyang; Zhang, Chao; Yin, Bing; Yun, Hyokun; Li, Lihong. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. 2025. Available online: https://arxiv.org/abs/2505.16421.
- Wright, Devin R; An, Jisun; Ahn, Yong-Yeol. Cognitive linguistic identity fusion score (clifs): A scalable cognition-informed approach to quantifying identity fusion from text. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025; pp. pages 11643–11673. [Google Scholar]
- Wu, Qingyun; Bansal, Gagan; Zhang, Jieyu; Wu, Yiran; Li, Beibin; Zhu, Erkang; Jiang, Li; Zhang, Xiaoyun; Zhang, Shaokun; Liu, Jiale; et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. First Conference on Language Modeling, 2024. [Google Scholar]
- Xu, Jun. Available at SSRN 4988118; Genai and llm for financial institutions: A corporate strategic survey. 2024.
- Yang, Rui; Chen, Hanyang; Zhang, Junyu; Zhao, Mark; Qian, Cheng; Wang, Kangrui; Wang, Qineng; Venkat Koripella, Teja; Movahedi, Marziyeh; Li, Manling; et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. arXiv 2025, arXiv:2502.09560. [Google Scholar]
- Yao, Shunyu; Zhao, Jeffrey; Yu, Dian; Du, Nan; Shafran, Izhak; Narasimhan, Karthik R; Cao, Yuan. React: Synergizing reasoning and acting in language models. International Conference on Learning Representations (ICLR), 2023. [Google Scholar]
- Yao, Zijun; Liu, Yantao; Chen, Yanxu; Chen, Jianhui; Fang, Junfeng; Hou, Lei; Li, Juanzi; Chua, Tat-Seng. Are reasoning models more prone to hallucination? arXiv 2025, arXiv:2505.23646. [Google Scholar] [CrossRef]
- Yu, Yangyang; Yao, Zhiyuan; Li, Haohang; Deng, Zhiyang; Jiang, Yuechen; Cao, Yupeng; Chen, Zhi; Suchow, Jordan; Cui, Zhenyu; Liu, Rong; et al. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making. Adv. Neural Inf. Process. Syst. 2024, 37, 0 137010–137045. [Google Scholar]
- Yue, Ling; Xing, Sixue; Chen, Jintai; Fu, Tianfan. Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning. 2024. Available online: https://arxiv.org/abs/2404.14777.
- Zainullina, Karina; Golubev, Alexander; Trofimova, Maria; Polezhaev, Sergei; Badertdinov, Ibragim; Litvintseva, Daria; Karasik, Simon; Fisin, Filipp; Skvortsov, Sergei; Nekrashevich, Maksim; et al. Guided search strategies in non-serializable environments with applications to software engineering agents. arXiv 2025, arXiv:2505.13652. [Google Scholar] [CrossRef]
- Zamojska, Monika; Chudziak; et al. Games agents play: Towards transactional analysis in llm-based multi-agent systems. arXiv 2025, arXiv:2507.21354. [Google Scholar]
- Zeng, Yifan; Wu, Yiran; Zhang, Xiao; Wang, Huazheng; Wu, Qingyun. Autodefense: Multi-agent llm defense against jailbreak attacks. arXiv 2024, arXiv:2403.04783. [Google Scholar]
- Zhang, Guibin; Yue, Yanwei; Li, Zhixun; Yun, Sukwon; Wan, Guancheng; Wang, Kun; Cheng, Dawei; Yu, Jeffrey Xu; Chen, Tianlong. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. arXiv 2024a, arXiv:2410.02506. [Google Scholar]
- Zhang, Guibin; Niu, Luyang; Fang, Junfeng; Wang, Kun; Bai, Lei; Wang, Xiang. Multi-agent architecture search via agentic supernet. 2025a. Available online: https://arxiv.org/abs/2502.04180.
- Zhang, Lang; Hu, Haiqing; Zhang, Dan. A credit risk assessment model based on svm for small and medium enterprises in supply chain finance. Financ. Innov. 2015, 1(1), 14. [Google Scholar] [CrossRef]
- Zhang, Xueqiao; Zhang, Chao; Sun, Jianwen; Xiao, Jun; Yang, Yi; Luo, Yawei. Eduplanner: Llm-based multi-agent systems for customized and intelligent instructional design. IEEE Transactions on Learning Technologies, 2025b. [Google Scholar]
- Zhang, Xuyang; Xu, Lidong; Li, Ningxin; Zou, Jianke. Research on credit risk assessment optimization based on machine learning. Preprints.org 2024b. [Google Scholar] [CrossRef]
- Zhu, Yakun; Wei, Shaohang; Wang, Xu; Xue, Kui; Zhang, Xiaofan; Zhang, Shaoting. Menti: Bridging medical calculator and llm agent with nested tool calling. 2025. Available online: https://arxiv.org/abs/2410.13610.






| Work | Modality | Arch. | Trans. | CR | #Feat. | #Samp. | STI | Decision Logic |
|---|---|---|---|---|---|---|---|---|
| XGBoost [12] | Tabular | SM | × | × | 39 | 300k | – | Discriminative |
| Omega2 [50] | Hybrid (RAG+GBDT) | C | × | × | 24 | 7.8k | – | Logistic |
| MASCA [38] | Textual | H | × | × | 20 | 200 | × | Majority Vote |
| MADeN [80] | Dynamic | H | × | × | 6 | 975 | Interaction-based | |
| Our Work | Multimodal | H | 100+ | 6,000 | Score Fusion (Agent+GBDT) |
| Model Name | BEC ↑ | ACC.↑(%) | Std.↓ | Acc. (G1)(%) | Acc. (G2)(%) | Acc. (G3)(%) |
|---|---|---|---|---|---|---|
| SVM | 0.6127 | 67.40 | 0.1045 | 62.10 | 65.40 | 74.70 |
| LightGBM | 0.6353 | 77.00 | 0.0812 | 73.50 | 75.80 | 81.70 |
| XGBoost | 0.6394 | 78.47 | 0.0768 | 75.20 | 77.10 | 83.11 |
| CatBoost | 0.6402 | 77.53 | 0.0792 | 74.00 | 76.20 | 82.39 |
| Gemini-3-Flash-Preview | 0.5803 | 69.71 | 0.1040 | 57.53 | 68.67 | 82.94 |
| Claude-Sonnet-4.5 | 0.6138 | 70.99 | 0.1392 | 70.31 | 54.30 | 88.37 |
| GPT-5.1 | 0.5671 | 66.36 | 0.0767 | 56.09 | 68.48 | 74.52 |
| Seed-OSS-36B-Instruct | 0.5612 | 70.51 | 0.2496 | 86.84 | 35.25 | 89.45 |
| Llama-3.1-8B-Instruct | 0.3056 | 45.03 | 0.1612 | 34.28 | 67.82 | 33.00 |
| Llama-3.3-70B-Instruct | 0.5192 | 58.27 | 0.0560 | 50.63 | 60.29 | 63.89 |
| Qwen3-4B-Instruct | 0.5634 | 63.91 | 0.1479 | 67.75 | 44.19 | 79.79 |
| Qwen3-30B-Thinking | 0.4757 | 63.12 | 0.3207 | 84.25 | 17.80 | 87.30 |
| Qwen3-30B-Instruct | 0.5798 | 68.72 | 0.2119 | 81.50 | 38.85 | 85.80 |
| DeepSeek-R1-Dist-32B | 0.5433 | 68.85 | 0.2685 | 84.60 | 31.05 | 90.90 |
| DeepSeek-R1-Dist-8B | 0.5602 | 61.41 | 0.0758 | 57.91 | 54.38 | 71.94 |
| XuanYuan2-70B-Chat | 0.1508 | 39.37 | 0.3210 | 16.24 | 84.77 | 17.10 |
| Fin-R1 | 0.5382 | 57.64 | 0.1089 | 63.05 | 42.45 | 67.43 |
| CreditAgent(SFT)(DS) | 0.6013↓0.1634 | 72.23↓11.9 | 0.2170↑0.0958 | 84.45↓6.05 | 41.74↓24.51 | 90.50↓2.70 |
| CreditAgent(SFT)(Seed) | 0.6190↓0.1457 | 73.66↓9.66 | 0.2012↑0.080 | 83.50↓7 | 45.62↓20.63 | 91.85↓1.35 |
| CreditAgent (SFT)+Avg | 0.6862↓0.0785 | 78.40↓4.92 | 0.1313↑0.0101 | 78.95↓11.55 | 62.05↓4.20 | 94.20↑1.00 |
| CreditAgent (SFT)+DST | 0.7055↓0.592 | 76.55↓6.77 | 0.0550↓0.0662 | 72.10↓18.40 | 73.25↑7.00 | 84.30↓8.90 |
| CreditAgent (GRPO) | 0.7647 | 83.32 | 0.1212 | 90.50 | 66.25 | 93.20 |
| Config. | BEC↑ | Acc.↑(%) | Std.↓ | G1 | G2 | G3 |
|---|---|---|---|---|---|---|
| FCA | 0.7647 | 83.32 | 0.1242 | 90.50 | 66.25 | 93.20 |
| w/o Macro | 0.6968 (↓0.0679) |
78.25 (↓5.07) |
0.1636 [-2pt](↑0.0394) |
88.75 (↓1.75) |
55.15 (↓11.10) |
90.85 (↓2.35) |
| w/o {M,H,W} | 0.6288 (↓0.1359) |
73.85 (↓9.47) |
0.2010 (↑0.0768) |
85.25 (↓5.25) |
45.60 (↓20.65) |
90.70 (↓2.50) |
| w/o Layer 2 | 0.5089 (↓0.2558) |
60.34 (↓22.98) |
0.2142 [-2pt](↑0.0900) |
72.31 (↓18.19) |
30.25 (↓36.00) |
78.45 (↓14.75) |
| Configuration | BEC ↑ |
|---|---|
| Parallel agents (no blackboard) | 0.6412 |
| Sequential agents without shared context | 0.6497 |
| Single LLM with structured prompt | 0.6836 |
| GBDT + Layer 3 fusion hybrid | 0.7043 |
| Full CreditAgent | 0.7647 |
| Group | XGBoost | GPT-5.1 | CreditAgent |
|---|---|---|---|
| 18–28 M | 77.3 | 72.6 | 63.8 |
| 18–28 F | 69.4 | 65.9 | 62.1 |
| 28–38 M | 72.1 | 68.4 | 63.2 |
| 28–38 F | 63.8 | 62.7 | 62.9 |
| 38–48 M | 64.3 | 61.2 | 62.4 |
| 38–48 F | 48.6 | 52.3 | 60.7 |
| DPD ↓ | 28.7 | 20.3 | 3.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).