Preprint
Article

This version is not peer-reviewed.

Learning to Fuse: Cost-Sensitive Credit Assessment via Hierarchical Multi-Agent Reinforcement Learning

Submitted:

08 May 2026

Posted:

09 May 2026

You are already at the latest version

Abstract
Credit risk assessment requires both accurate prediction and structured decomposition of how hetero-geneous evidence contributes to each decision. Monolithic Large Language Models can incorporate unstructured evidence and natural-language reasoning into such workflows, but in high-stakes un- derwriting they may be distracted by noisy inputs, miss rare but decisive risk cues, and offer limited control over policy-dependent decision thresholds. We present CREDITAGENT, a hierarchical credit review system with three stages: evidence filtering, specialist risk analysis by agents, and decision fusion. Our central contribution is holding the adapted backbone, specialist-agent outputs, hard-stop rules, and data split fixed, we vary only the final fusion strategy to isolate the effect of hierarchical fusion on underwriting quality. On a held-out set of 6,000 personal credit cases from Chinese financial institution, CREDITAGENT achieves 83.32% accuracy and a Business Efficiency Coefficient of 0.7647 outperform flagship model. We present these findings as an institution-specific case study while identifying which components (hierarchical fusion, GRPO training recipe) are mechanism-portable versus institution-specific (hard-stop rules, cost ratios). To ensure reproducibility, we make code and dataset publicly available at https://github.com/kouzhizhuo/Credit_Agents.
Keywords: 
;  ;  

1. Introduction

Automated credit risk assessment is a high-stakes problem that requires synthesizing structured quantitative signals with institution-specific qualitative evidence [52,76,87]. Traditional machine learning models such as XGBoost [12] remain strong on tabular benchmarks [71], but they provide limited support for explicitly decomposing how heterogeneous signals such as macroeconomic context, borrowing history, inquiry velocity, and fraud cues contribute to a committee credit decision [32,72,77]. In practice, credit review still depends heavily on human analysts and static rule stacks [5,28,33]. As shown in Figure 1, this workflow is labor-intensive and difficult to scale.
Large Language Models (LLMs) make it possible to incorporate unstructured evidence and natural-language reasoning into financial workflows [45,89]. However, deploying a single end-to-end LLM for credit decision-making introduces three practical problems: (i) irrelevant or noisy fields can distract the model from decision-relevant evidence [40]; (ii) rare but high-impact cues, such as fraud indicators or sudden inquiry spikes, can be weakened when embedded in long heterogeneous prompts [26,57]; and (iii) a single free-form response provides limited control over institution-specific approval thresholds and only weakly structured intermediate outputs for downstream review [1]. Existing multi-agent financial systems partly address specialization, but they often do not clearly separate evidence selection, specialist analysis, and final decision control.
[ notitle, rounded corners, colframe=darkgrey, colback=white, boxrule=2pt, boxsep=0pt, left=0.15cm, right=0.17cm, enhanced, shadow=2.5pt-2.5pt0ptopacity=5,mygrey, toprule=2pt, before skip=0.65em, after skip=0.75em ] (Q) How can end-to-end LLM-based credit decision systems integrate heterogeneous financial evidence, enforce domain policies, and generate committee-aligned, auditable decision-making?
To address this, we build CreditAgent, a hierarchical system that separates credit review into three stages [55]. Layer ❶: Evidence Filtering. An LLM-guided feature-selection loop is combined with XGBoost importance estimates to choose a smaller subset of applicant fields for downstream processing [22]. Layer ❷: Specialist Risk Analysis. Six specialist agents each evaluate one predefined risk dimension and write a fixed-schema output (a bounded scalar score plus supporting input factors) to a shared blackboard; later agents can condition on earlier outputs [49]. Layer ❸: Decision Fusion. A fusion module aggregates specialist outputs via learned attention weights, applies institution-specific veto rules for catastrophic patterns, and selects the final approval decision using a threshold trained with GRPO [2,51,85].
To operationalize the agentic reasoning infrastructure, CreditAgent is trained in two stages: (i) supervised fine-tuning (SFT) [19] on Seed-OSS-36B-Instruct 1 to instill the required output schema and domain reasoning patterns, and (ii) reinforcement learning with GRPO [69] to optimize the fusion layer under asymmetric business utility. On a held-out set of 6,000 Chinese personal credit cases from one institution, the system achieves 83.32% accuracy and Business Efficiency Coefficient (BEC 5) of 0.7647, with the largest gains on approval-boundary cases. Our main contributions are:
  • Controlled isolation of fusion effects. We design a controlled protocol that holds the adapted backbone, specialist-agent interface, hard-stop rules, and data split fixed while varying only the fusion mechanism. This isolates the contribution of the decision layer itself, independent of model scale or prompt engineering, and reveals that gains are concentrated on approval-boundary cases.
  • Case-specific threshold adaptation as a fusion principle. We show that the key mechanism is not better scoring but case-specific threshold selection: GRPO learns to lower the approval barrier when specialist signals converge on safety and raise it when specialist disagreement signals ambiguity. This accounts for 62 % of the BEC improvement over fixed-threshold baselines.
  • Empirical characterization of fusion sensitivity. Through ablations, error analysis, and temporal diagnostics, we establish that fusion strategy choice matters disproportionately on boundary cases. This asymmetry motivates cost-sensitive fusion as a first-class design target for credit pipelines.

3. Methodology

We model credit review as a hierarchical decision function F : X { 0 , 1 } , where X contains structured applicant features and unstructured records, and the output indicates reject (1) or approve (0). The framework decomposes this mapping into three stages: evidence filtering, specialist scoring, and decision fusion. Algorithm 1 summarizes the end-to-end pipeline.
Algorithm 1:CreditAgent Architecture
  • Input: Raw evidence x X ; trained GBDT/XGBoost model M xgb ; filter agent G ψ ; specialist agents A = { a Macro , a Basic , a History , a Solvency , a Willingness , a Fraud } ; fusion network MLP ; policy π θ ; initial sparsity threshold  γ 0 .
  • Output: Decision Y { 0 , 1 } (1=Reject, 0=Approve), final score S final , audit chain { o j } j = 1 6 .
  • /* Layer 1: Dynamic Data Governance */
  • x CoarseFilter x | a txlog ( x ) , a creport ( x )
  • x x I [ XGB ( x , y ) > γ ] selectEq. (1)
  • Initialize shared blackboard S shared
  • /* Layer 2: Multi-Agent Domain Reasoning */
  • for agent a j A  do
  •      o j { r j , e j } a j ( P j S shared ) , where P j = T ( x j , g ( x j ) )
  •      S shared [ S shared o j ]
  • end for
  • /* Layer 3: Score Fusion & Decision */
  • α j softmax ( MLP ( E ( e j ) ) ) Eq. (2)
  • S final σ j α j · r j
  • τ * π θ ( S shared ) Eq. (3)
  • if k V : Hard - Stop ( e k ) = 1 thenEq. ()
  •      Y 1
  • else
  •      Y I [ S final τ * ] Eq. ()
  • end if
  • return Y , S final , { o j } j = 1 6 ▹ Decision + Audit Chain

3.1. Three-Layer Credit Analysis System

Layer 1: Evidence Filtering

Raw credit applications may contain over 10 k features, many of which are irrelevant or redundant for the downstream LLM agents’ context windows. Layer 1 selects a compact subset of decision-relevant fields and two agents before specialist reasoning begins. We define a Recurrent Feature Selection loop in which a Dynamic Data Filter Agent G ψ interacts with a GBDT-based predictor. Given a raw feature vector x X , the gating process produces a refined vector:
x = x I XGB ( x , y ) > γ
as formula 1, ⊙ is the Hadamard product, XGB ( x , y ) is the feature-importance vector from an XGBoost model trained on historical repayment labels y { 0 , 1 } , and γ is a sparsity threshold dynamically adjusted by G ψ to satisfy context-window constraints while retaining predictive features.

Layer 2: Specialist Risk Analysis

Credit risk is multi-dimensional: solvency, fraud, repayment history, and macro conditions each require distinct assessment logic. Rather than processing all dimensions in a single prompt, Layer 2 routes the filtered evidence to six specialist agents A = { a 1 , , a 6 } , each fine-tuned on our Credit-Instruct dataset (Section 3.3). Each agent a j produces a fixed-schema output tuple o j = { r j , e j } :
  • r j [ 0 , 10 ] : a scalar risk score for one dimension (e.g., solvency). Larger indicate lower risk.
  • e j E * : structured evidence record as a fixed-schema list of cited input factors and textual rationale. We treat these as operational artifacts; explanation faithfulness is evaluated by experts.
Agents execute sequentially. The global context state is updated as S t + 1 = [ S t o t + 1 ] with S 0 = , so each agent can condition on upstream findings via a shared blackboard. The execution order follows standard underwriting review practice: macro context establishes the economic regime, identity checks gate further analysis, and fraud assessment runs last.

Layer 3: Decision Fusion

Specialist agents may disagree, and the final operating must balance default prevention against approval yield. Layer 3 aggregates all specialist outputs into a single binary decision via attention-weighted scoring, threshold optimization, and deterministic vetoes.
Step 1: Attention-weighted risk score. A global risk score S final is computed as formula 2:
S final = σ j = 1 n α j r j , α j = exp MLP ( E ( e j ) ) k exp MLP ( E ( e k ) )
where n = | A | = 6 , E : E * R d e maps each evidence record to a frozen backbone embedding, MLP ( · ) outputs a scalar relevance logit, and σ is the sigmoid function. The MLP is the only trainable component in this step; both the fusion weights α j and the decision threshold are input-dependent.
Step 2: Threshold optimization via GRPO. We formulate threshold selection as a contextual bandit: the policy π θ predicts a case-specific threshold τ * = π θ ( S shared ) from the shared blackboard state. We optimize π θ with Group Relative Policy Optimization (GRPO) shows in formula 3,
J ( θ ) = E i = 1 G min ρ i ( θ ) A ^ i , clip ( ρ i ( θ ) , 1 ϵ , 1 + ϵ ) A ^ i
where G is the group size (sampled threshold actions), ρ i ( θ ) is the importance ratio between current and reference policy, A ^ i is the advantage relative to the group mean reward, and ϵ is the clipping radius. The reward signal combines BEC pre (Eq. ) with a stability regularizer:
R ( τ , S shared ) = BEC pre ( τ ) λ τ · | τ τ ¯ |
where τ ¯ is the running mean threshold from the reference policy and λ τ = 0.05 penalizes both overly conservative and overly aggressive deviations. This ensures the policy explores around a sensible operating point without collapsing to extreme thresholds. In practice, we sample G = 4 candidate thresholds per case during training (sensitivity to this choice is analyzed in Section 4).
Step 3: Hard-stop vetoes. The final decision Y ^ combines threshold with deterministic safety rules:
Y ^ = 1 , S final π θ ( S shared ) k V : Hard - Stop ( e k ) = 1 0 , otherwise
where V A is the set of veto-enabled agents, and Hard−Stop denotes deterministic rules over structured credit-bureau flags for catastrophic patterns (confirmed fraud, active litigation, identity inconsistency). Vetoes are triggered by explicit rule matches on constrained specialist JSON outputs.

3.2. Dataset

Evaluation set contains 6k real-world Chinese personal credit records designed to test default prediction and risk reasoning using structured credit bureau data in three group and 2k cases each (G1:High-pass risk; G2: Low-approval recovery; G3: Rejection stability) (Appendices Appendix L and Appendix M). For credit data only approved cases have ground-truth repayment data; rejected cases require outcome construction. In our dataset for approved cases is obtained from subsequent repayment performance within the observation window. For rejected cases is obtained from external bureau follow-up and later cross-institution repayment record. The data is fully released as practical comprehensive dataset to evaluate whether a model can identify latent behavioral risk under realistic underwriting evidence.

3.3. Risk-Oriented Agent Optimization

Mapping structured financial features directly into an LLM prompt can introduce semantic distortion [24]. We address this with three enhancements. First, we define a textualization operator T that maps the agent-relevant refined feature slice x j x to a structured semantic prompt. This operator is our risk-oriented textualization (ROT) step: it converts numeric evidence into bounded risk statements that the specialist agents can reason over more consistently. For each agent a j , we introduce a risk-discretization function g : R { Low , Medium , High } , defined by business-driven quantiles [63]. The resulting prompt P j = T ( x j , g ( x j ) ) incorporates causal templates (e.g., "Solvency risk [High], the Debt-to-Income ratio is 85%") to stabilize the model’s internal reasoning.
Second, we enforce an explicit risk reasoning mechanism through a tripartite Credit-CoT instruction template: (i) Risk Identification extracts risk-bearing signals from the textualized input; (ii) Evidence Aggregation weighing these signals against offsetting positive indicators and upstream blackboard context; and (iii) Decision Justification outputting a constrained JSON rationale and a scalar risk judgment. We ensure output compliance by constrained decoding to the target JSON schema.
Third, we construct the Credit-Instruct dataset, comprising triples of information with experts justification ( i n p u t , C o T _ r e a s o n i n g , d e c i s i o n ) , and use PEFT via LoRA to adapt the backbone model [29]. The training objective is
min θ L C E ( y , y ^ ) + β · DPO ( O A , O B )
where y is the ground-truth supervision label for the current agent task, y ^ is the predicted label, β controls the strength of preference alignment, and DPO ( O A , O B ) is a Direct Preference Optimization term derived from expert pairwise rankings over a preferred output O A and a dispreferred output O B (construction details in Appendix D). In practice, the cross-entropy term supervises decision correctness, while the DPO term encourages preference-aligned reasoning traces.

3.4. Decision Setting and Business Efficiency Coefficient

Credit underwriting is an asymmetric decision problem: false approvals can trigger substantial default loss, while false rejections reduce approval yield and customer acquisition. In practice, many applicants are easy to classify because rules and standard models already agree on clearly safe or clearly risky cases. The main operational challenge is the set of approval-boundary cases, where predicted risk is close to the action threshold. Performance on these cases matters disproportionately, so beyond overall accuracy we evaluate decision quality with a task-specific utility metric, the Business Efficiency Coefficient (BEC), and also report per-group accuracy. The hierarchical design is intended to improve exactly these finely balanced decisions by preserving rare but important specialist signals. Standard metrics such as AUC [35] and F1-score [15] do not capture asymmetric business cost. We therefore define the average misclassification cost as formula 4
AvgCost = 1 N i = 1 N C FN I ( y i = 1 , y ^ i = 0 ) + C FP I ( y i = 0 , y ^ i = 1 ) ,
where y i { 0 , 1 } is the ground truth ( 1 = Default ) and y ^ i { 0 , 1 } is the model decision ( 1 = Reject ). Following expert guidance in Appendix L, we set C FN = 3 and C FP = 1 , reflecting that default loss is estimated to be about three times the opportunity cost of rejecting a good applicant. We also test robustness to this ratio in Section 4. Letting MaxCost = max ( C FN , C FP ) , we define
BEC pre = 1 AvgCost MaxCost .
To reflect the stability of specialist reasoning, we add a consistency penalty based on the standard deviation of the six Layer 2 risk scores { r j } j = 1 6 :
BEC = max 0 , BEC pre × ( 1 λ bec · σ agents ) ,
where λ bec = 0.1 in the main experiments. BEC rewards decisions that are both cost-effective and stable across risk perspectives. Because BEC is aligned with the Layer 3 objective, we additionally report Accuracy and per-group accuracies as complementary checks (Proof in Appendix A).

4. Experiments

We evaluate CreditAgent on a proprietary evaluation set designed to stress risk-boundary decisions in real-world lending. The set contains 6,000 temporally later, expert-validated Chinese personal credit cases partitioned into three groups that emphasize different underwriting regimes. The SFT stage uses a separate 40,000-case training corpus with a strict temporal cutoff before the evaluation window, ensuring no data leakage. Proprietary baselines (GPT-5.1, Claude-4.5, Gemini-3-Flash) are evaluated as zero-shot API models under the same input schema and prompting protocol; open-source backbones are run locally; CreditAgent variants receive task-specific SFT/GRPO adaptation (Section 3). Proprietary models serve to characterize untuned LLM behavior; the controlled architectural comparison is within the CreditAgent. Model configurations in Appendix E.
Table 2. Agent specialization and input/output schema in layer 2.
Table 2. Agent specialization and input/output schema in layer 2.
Agent Core Input Focus Risk Metric
Macro PESTEL Systematic Mkt. sensitivity
Basic KYB/KYC Identity Fraud suspicion
History Cr. bureau Velocity Default risk
Solvency Cash flow Assets/debt Liquidity buf.
Willingness Behavioral Intent Sincerity
Fraud Anomaly Adversarial Synth. patterns

Evaluation Metrics

Standard metrics alone do not capture asymmetric financial error costs. We therefore adopt a triple-metric evaluation: BEC, Accuracy, and Stability Factor ( σ ). Concretely, BEC converts false-negative and false-positive costs into a normalized [ 0 , 1 ] utility score using the cost matrix introduced in Section 3, with C FN : C FP = 3 : 1 , so that larger values indicate better business utility under asymmetric lending risk. Accuracy measures raw prediction quality, while σ is the standard deviation of group-wise accuracies and reflects regime imbalance. Because the Layer 3 policy is trained toward this cost-sensitive utility, we report Accuracy and per-group accuracies as independent checks that are not the direct optimization target.

4.1. Controlled Comparison Protocol

Our primary goal is not to maximize leaderboard performance across unrelated model families, but to isolate the effect of decision fusion in a hierarchical underwriting pipeline. Accordingly, all main ablations use a controlled protocol in which the adapted backbone (Seed-OSS-36B-Instruct with the same domain adaptation), the evidence-filtering stage, the six specialist agents and their output schema, the shared-blackboard execution order, the institution-specific hard-stop/veto rules, and the train/validation/test split are held fixed. The only component that changes is the fusion strategy: simple averaging, Dempster–Shafer theory (DST), SFT-only fusion, or SFT+GRPO fusion. This design allows us to attribute performance differences more credibly to policy-aware fusion rather than to upstream implementation variation. For completeness, we also report untuned proprietary LLMs as zero-shot reference points, but because these differ in pretraining, alignment, and prompting interface, they serve only to contextualize task difficulty rather than support causal claims.

4.2. Main Results and Discussion

The upper rows of Table 3 report untuned LLMs under a matched input protocol and are included as reference points only. These models are often highly uneven across risk regimes: for example, DeepSeek-R1-Distill-32B reaches 90.90% on the rejection-heavy Group 3 but only 31.05% on the approval-sensitive Group 2, yielding a BEC of 0.5433. The lower rows provide the controlled comparison, where the adapted backbone is fixed and only the fusion method changes. Under this setting, CreditAgent (GRPO) achieves the best overall utility, with BEC 0.7647, accuracy 83.32%, and Group 2 accuracy 66.25%. Classical tabular baselines on the same feature set and temporal split are notably weaker, reaching BEC 0.6394 for XGBoost and 0.6488 for Random Forest, with Group 2 accuracies of 50.68% and 37.14%, respectively. These results suggest that the hierarchical design is especially effective on approval-boundary cases, where decision quality matters most.

Statistical Reliability

We assess robustness over three random seeds for the GRPO stage, varying initialization and rollout sampling. CreditAgent (GRPO) obtains BEC 0.7647 ± 0.0041 and accuracy 83.32 % ± 0.38 % (mean ± standard deviation), while Group 2 accuracy is 66.25 % ± 0.72 % with a 95% bootstrap confidence interval of [ 64.8 % , 67.7 % ] . A paired bootstrap test [6] shows that GRPO fusion significantly outperforms SFT+Avg on both BEC and accuracy over the full 6,000-case evaluation set ( p < 0.01 ), and on Group 2 accuracy specifically ( p < 0.005 ).
Its BEC gain over DST is also significant at p < 0.01 , although DST remains more stable in terms of σ . Sensitivity analysis further shows that when the cost ratio C FN : C FP varies from 2 : 1 to 5 : 1 , GRPO retains the highest BEC throughout, and its advantage over SFT+Avg widens from Δ BEC = 0.054 to Δ BEC = 0.098 as asymmetry increases. The Figure 3 shows our agents align with human experts’ evaluation.

Contribution of the Fusion Layer

Comparing SFT-only, average fusion, DST fusion, and GRPO fusion isolates the effect of the final decision layer. Committee-based fusion already outperforms plain SFT, indicating that the gain is not simply due to longer responses or larger context. DST is the most stable variant ( σ = 0.0550 ), but it yields lower utility than the learned GRPO policy.
To test whether the same specialist outputs could be fused equally well by a non-LLM model, we train a cost-sensitive XGBoost classifier with C FN : C FP = 3 : 1 on the six specialist risk scores { r j } and embedded evidence vectors { E ( e j ) } . This baseline reaches BEC 0.7043, above SFT-only (0.6190) and average fusion (0.6862) but below GRPO (0.7647), with the largest gap on Group 2 (58.90% vs. 66.25%). This pattern suggests that GRPO’s advantage comes from input-dependent threshold adaptation, not merely from stronger scoring. The training curves in Figure 4 tell the same story. Layer 2 SFT converges quickly, with loss falling from 1.0181 to 0.1559 over 300 steps, suggesting that the specialist agents learn stable role-specific reasoning. Layer 3 GRPO then improves the operating point, with reward increasing from about 3.6 to a peak of 5.044 at step 170 while validation reward closely tracks training reward. Together, these curves support the interpretation that the final gains come from optimizing the decision layer rather than from overfitting.

4.3. Ablation Study

To evaluate how each hierarchical component contributes to the final model, we conduct an ablation study using Seed-OSS-36B-Instruct as a fixed backbone. Table 4 measures how performance changes when specialist agents or the full Layer 2 committee are removed.
Removing Layer 2 causes the largest degradation, dropping BEC from 0.7647 to 0.5089 and accuracy from 83.32 % to 60.34 % , which shows that the gain does not come mainly from a stronger final classifier. Removing the Macro, History, and Willingness agents still leaves a usable system but lowers BEC by 0.1359 and Group 2 accuracy from 66.25 % to 45.60 % , indicating that these agents carry a substantial part of the approval-boundary signal. Removing only the Macro Agent has a smaller average effect but increases regime imbalance from 0.1242 to 0.1636 , suggesting that macro context mainly stabilizes the decision boundary across groups. The ablation supports hierarchical multi-agents system converts them into a utility-aware decision rule.

Blackboard Effect on Agent Consistency

The sequential blackboard allows each agent to condition on upstream findings, and its effect is substantial. Compared with a parallel variant in which all six agents receive only x and cannot access prior outputs, sequential execution increases the mean pairwise Pearson correlation among risk scores { r j } from 0.31 to 0.47 and reduces inter-agent dispersion σ agents from 0.182 to 0.134. More importantly, the downstream fusion layer reaches BEC 0.7647 on sequential outputs but only 0.6412 on parallel outputs, with an even larger gap on Group 2 (66.25% vs. 52.10%). The extended simplification study in Table 5 reinforces this result: removing the blackboard causes the largest single drop in BEC, showing that inter-agent information propagation, rather than agent count alone, is a key source of the gain. The weaker performance of the GBDT + Layer 3 hybrid further suggests that Layer 3 cannot fully compensate for the loss of specialist qualitative reasoning in Layer 2. These results indicate that the blackboard contributes greater agreement and more informative agreement for fusion on boundary cases.

4.4. Sensitivity Analysis

The sensitivity study tests whether the GRPO gain is brittle or reflects a sensible operating region. Using the Seed-OSS-36B-Instruct backbone throughout to isolate optimization effects, as Figure 5 show a consistent pattern: moderate regularization outperforms no regularization, moderate clipping ( 0.2 , 0.28 ) outperforms both narrower and wider trust regions, and moderate sampling diversity ( p = 0.7 ) outperforms both lower and higher diversity. The sample-size sweep shows the clearest utility-compute trade-off: moving from n = 2 to n = 4 yields a large gain, while n = 8 adds only marginal benefit at noticeably higher cost. Figure 6 shows GRPO methods robustness cross cost ratio.

4.5. Exploratory Fairness Discussion

Fairness is important in credit decision-making, but it is not a primary validated claim of this paper [37,59]. We therefore report only a limited post-hoc diagnostic. As shown in Table 6, on a held-out 1,500-case subset, CreditAgent exhibits substantially lower approval-rate disparity across age×gender groups (DPD = 3.1 ) than XGBoost (28.7) and GPT-5.1 (20.3), suggesting more uniform behavior across subgroups. This pattern is qualitatively consistent with greater reliance on explicit financial-behavior evidence rather than shortcut subgroup correlations. However, the subgroup sample sizes are too small for strong statistical conclusions, and we do not claim counterfactual fairness, calibration parity, or recourse guarantees [16,42]. Likewise, the architecture logs intermediate agent outputs that may aid human review.

5. Conclusion

This work presented CreditAgent, a hierarchical multi-agent framework for high-stakes credit decision-making that mirrors institutional review practice by separating evidence organization, specialist risk assessment, and policy-aware decision fusion. Across a temporally ordered real-world benchmark, the full system consistently outperformed strong open and proprietary baselines, with the clearest gains appearing on approval-boundary cases, where decisions are both most difficult and most consequential. Our analyses further show that specialist decomposition is the primary source of performance improvement, while adaptive fusion and deterministic hard-stop rules are both necessary to translate specialist judgments into business-aligned and risk-sensitive final decisions.
Beyond overall utility, the results support a broader claim: for structured financial decision-making, architectural alignment with domain workflow can yield stronger and more stable behavior than monolithic reasoning alone. Rebuttal-driven additions further strengthen this picture by showing competitiveness against classical ML baselines on matched structured inputs, prompt-level transfer across domains, and reduced demographic disparity relative to strong tabular and monolithic LLM baselines. Taken together, these findings suggest that hierarchical committee-style agent systems are a promising direction for reliable decision support in regulated, high-stakes settings.

Appendix A. Analytical Properties of the Decision Rule

This appendix records simple analytical properties of the proposed decision rule. These properties are not intended as general guarantees about credit-risk prediction accuracy, fairness, or deployment safety. Rather, they clarify deterministic consequences of the fusion design, such as the behavior of hard-stop vetoes and the effect of conditioning specialist outputs on shared context.
Table A1. Summary of theoretical guarantees for the CreditAgent architecture.
Table A1. Summary of theoretical guarantees for the CreditAgent architecture.
Objective Formal Statement Reference
Variance Reduction E [ Var ( R j x , S shared ) ] E [ Var ( R j x ) ] Theorem A1
Absolute Safety V k = 1 Y ^ = 0 deterministically Theorem A2
Consensus Gradient BEC σ agents < 0 when profitable Theorem A3

Appendix A.1. Preliminaries and Notation

We adopt notation consistent with the main paper (Section 3). Let F : X Y denote the hierarchical decision function decomposed into three latent spaces: Data Gating ( Z 1 ), Domain Expertise ( Z 2 ), and Decision Fusion ( Z 3 ).
Definition A1 
(Credit Decision Environment). Let ( Ω , F , P ) be a probability space. We define:
  • Y { 0 , 1 } : True repayment outcome, where Y = 1 denotes default
  • x X : Raw input feature set (potentially > 10 5 dimensions)
  • x = x I [ XGB ( x , y ) > γ ] : Refined evidence tensor from Layer 1 (Eq. 1)
  • A = { a 1 , , a 6 } : Set of Layer 2 specialist agents (Macro, Basic, History, Solvency, Willingness, Fraud)
  • o j = { r j , e j } : Output tuple of agent a j , where r j [ 0 , 1 ] is the Risk Magnitude and e j E * is the Evidence Chain
Definition A2 
(Hierarchical Context State). Following the State-Aware Routing Framework (Section 3.2), the global context state is updated recursively:
S t + 1 = [ S t o t + 1 ]
We define the shared context available to agent a j at position j in the routing sequence as:
S shared ( j ) = [ o 1 o j 1 ]
where ∥ denotes blackboard concatenation of previously published agent outputs. In particular, the outputs of the Macro Agent and Basic Agent often appear early in the routing order and therefore become part of the context received by later agents.
Definition A3 
(Architecture Comparison). The relationship between specialized agents is categorized by their information-sharing protocols:
Parallel Architecture ( P ):
Agents operate in isolation. Each agent a j computes independently:
r j P = f j ( x )
where no information is exchanged between agents.
Hierarchical Architecture ( H ):
Agents operate via a shared state. Each agent a j computes its output conditioned on the global context:
r j H = f j ( x S shared ( j ) )
where S shared ( j ) denotes the state of the Blackboard at the time of invocation.

Appendix A.2. Information Processing and Variance Reduction

We represent specialized agents as estimators of the hidden risk state Y. Let x be the refined feature vector from Layer 1.
Theorem A1 
(Contextual Stability). The hierarchical conditioning of specialized agents on the shared context S shared reduces the expected conditional variance compared to a parallel architecture. Formally, let R j be the risk magnitude predicted by specialist agent a j . Then:
E Var ( R j x , S shared ) E Var ( R j x )
with equality if and only if R j S shared x (i.e., the shared context provides no additional information beyond x ).
Proof. 
By the Law of Total Variance, we decompose the variance:
Var ( R j x ) = E Var ( R j x , S shared ) x + Var E [ R j x , S shared ] x
Since variance is non-negative:
Var E [ R j x , S shared ] x 0
This directly implies:
E Var ( R j x , S shared ) x Var ( R j x )
Taking expectations over x yields the result. Equality holds iff E [ R j x , S shared ] is x -measurable, i.e., when S shared provides no additional predictive information. □
Remark A1 
(Interpretation: Diluted Attention Prevention). The variance reduction quantifies the prevention of “diluted attention” described in Section 1 and addressed by the Shared Blackboard Architecture in Section 3.1. By conditioning on S shared , specialized agents avoid the information fragmentation that occurs in single-agent LLM systems processing high-dimensional credit data.

Appendix A.3. The Hard-Stop Mechanism

As defined in Eq. (4), the final decision is governed by a weighted score gated by a “Hard-Stop” rule. We prove this ensures high-stakes safety.
Definition A4 
(Decision Function with Veto). Following Eq. (4), the final loan decision Y ^ : X × S { 0 , 1 } is defined as equation A1:
Y ^ = 1 if S final π θ ( S shared ) and k V , Hard Stop ( e k ) = 0 0 otherwise
where:
  • S final = σ j = 1 n α j · r j with n = | A | = 6 is the weighted risk score from Layer 2 (Eq. 2)
  • π θ ( S shared ) is the GRPO policy-adjusted threshold
  • V A is the critical veto-agent set
  • Hard Stop ( e k ) represents deterministic rejection rules (e.g., active litigation) extracted from SFT agents
We express the decision function equivalently as:
Y ^ = I S final π θ ( S shared ) × k V 1 V k ( x , S shared )
where V k { 0 , 1 } is the binary veto indicator for critical agent k.
Theorem A2 
(Boundary Safety). TheCreditAgentarchitecture guarantees that no positive indicators from specialty agents can override a deterministic fraud rejection. Formally, if any veto condition is triggered:
k V : V k = 1 Y ^ = 0
Proof. 
Suppose k * V such that V k * ( x , S shared ) = 1 (i.e., Hard Stop ( e k * ) is triggered). Then from Eq. equation A2:
k V 1 V k ( x , S shared ) = 1 V k * · k k * ( 1 V k ) = 0
Therefore Y ^ = I [ S final π θ ( S shared ) ] × 0 = 0 , regardless of S final . □
Proposition A1 
(Diluted Attention Resistance). Let A + = { a j : r j > τ j good } be the set of agents giving positive assessments. The architecture satisfies:
| A + | = n 1 V k = 1 for some k V Y ^ = 0
That is, even unanimous positive assessment from n 1 agents cannot override Fraud agent veto.

Appendix A.4. Utility Optimization via GRPO

To optimize the Business Efficiency Coefficient (BEC) defined in Eq. 5, the system must balance False Negatives ( C F N : approving a defaulting customer) and False Positives ( C F P : rejecting a good customer).
Definition A5 
(Business Efficiency Coefficient). Following Eq. 4 and Eq. 5, the BEC is defined as:
BEC = max 0 , BEC pre × ( 1 λ bec · σ agents )
where:
BEC pre = 1 AvgCost MaxCost , MaxCost = max ( C F N , C F P )
and σ agents is the standard deviation of risk scores { r 1 , , r n } across Layer 2 agents.
Theorem A3 
(Consensus-Driven Optimization). The GRPO optimization with BEC objective induces a gradient that monotonically decreases inter-agent variance when the system is profitable:
BEC σ agents < 0 whenever BEC pre > 0
Proof. 
From the BEC definition:
BEC σ agents = BEC pre · σ agents ( 1 λ bec · σ agents ) = λ bec · BEC pre
Since λ bec > 0 (penalty intensity) and BEC pre > 0 , we have BEC σ agents < 0 . □
Remark A2 
(Avoiding Degenerate Consensus). The GRPO optimization encourages agent consensus but is bounded by irreducible uncertainty σ min = E [ Var ( Y x , S shared ) ] . The hierarchical architecture mitigates systematic bias by anchoring specialist agents on independently-validated Macro and Basic context through the Shared Blackboard Architecture (Section 3.1).

Appendix B. Machine Learning LLM Enhanced Dynamic Filter Loop

In credit-underwriting scenarios, analyzing, processing, and predicting from high-dimensional structured raw data is critically important. Such datasets commonly contain hundreds or even thousands of feature variables describing a borrower’s profile, income, credit history, and consumption behavior. The core challenge is extracting meaningful predictive signals from this high-dimensional space, which is often contaminated with various forms of noise. We can reframe this feature selection problem through an information-theoretic lens: the goal is to reconstruct the essential signal of a borrower’s creditworthiness from noisy observations by maximizing the transmitted information relevant to the target variable (e.g., default status) under constraints. The Layer-1 component of our Agent Framework introduces an automated, dynamic feature-selection pipeline designed to act as an intelligent, iterative "denoising" and "information compression" system, which is illustrated in Figure A1. Below, we describe this automated feature-selection agent, now augmented with information-theoretic principles, a theoretical summary, and experimental results of feature selection.
Figure A1. Pipeline of Feature Selection Agent.
Figure A1. Pipeline of Feature Selection Agent.
Preprints 212577 g0a1

Appendix B.1. Feature-Selection Pipeline

Appendix Data Cleaning: Enhancing the Signal-to-Noise Ratio

Large and complex datasets contain not only missing values, outliers, and format inconsistencies but also substantial redundancy and noise. We can model the observed raw data X raw as a combination of a latent true signal X true and an additive noise component N , such that X raw = X true + N . The noise N encompasses random errors, missing entries, and inconsistencies. Manual cleaning is slow and prone to omissions. Our automated data-cleaning procedures aim to estimate and reduce N , thereby increasing the effective signal-to-noise ratio of the dataset. The output is a higher-fidelity, well-structured dataset X clean that serves as a superior basis for subsequent information extraction, satisfying I ( X clean ; Y ) I ( X raw ; Y ) , where I ( · ; · ) denotes mutual information and Y is the target.

Appendix Feature Analysis: Defining the Information Space

We perform a global analysis to understand the business objective—maximizing information about target Y—and operational constraints, such as an upper bound on the number of allowable features (a form of bandwidth constraint). We then create initial feature meta-information. For each raw feature, we produce a human-interpretable description, effectively defining the alphabet and semantics for each information source. This step is foundational for calculating information-theoretic measures like entropy H ( X ) and mutual information I ( X ; Y ) . When data is distributed across multiple tables, we perform joins to consolidate information from multiple correlated sources, potentially increasing the total information available about Y.

Appendix Coarse Feature Filtering: Eliminating Semantic Noise Channels

The coarse screening stage performs an initial review of all candidate features to remove those constituting pure semantic noise. Features such as arbitrary IDs or session hashes are, by design, non-informative for predicting Y. From an information-theory perspective, their mutual information with the target is approximately zero:
I ( F id ; Y ) = H ( Y ) H ( Y | F id ) 0
This indicates that knowing F id does not reduce the uncertainty H ( Y ) about the target. According to the Data Processing Inequality, discarding such features incurs no information loss about Y and is a lossless compression step. We also remove duplicate fields and other non-informative attributes. A final manual audit confirms the reasonableness of this initial screening.

Appendix Fine-Grained Feature Selection: Iterative Information Bottleneck

After coarse screening, we apply an iterative fine-grained selection process that embodies the principles of the Information Bottleneck method, seeking a compact representation X selected that preserves maximal information about Y while minimizing redundancy. The process combines correlation-based pruning with XGBoost-based importance validation.
1.
Redundancy Pruning: We compute the Pearson correlation matrix and remove features in the bottom 10%. Highly correlated features share large mutual information I ( F i ; F j ) , implying redundancy about Y. Removing one of a correlated pair compresses the representation with minimal information loss regarding Y.
2.
Relevance Validation: The remaining features train an XGBoost model. The model’s feature importance scores approximate each feature’s contribution to reducing the prediction loss (e.g., cross-entropy, related to the conditional entropy H ( Y | X ) ). We remove the least important 10% of features.
This loop repeats, each iteration potentially refining the feature set to increase the information rate  I ( X selected ; Y ) per feature. The process terminates when a balance between model performance (a proxy for I ( X selected ; Y ) ) and feature compactness is achieved, yielding a subset of approximately 150 features.

Appendix Final Validation: Incorporating Expert Priors

To obtain a validated set of predictors, domain experts perform a final review. This step integrates rich prior knowledge—an external information source—into the system. From 150 features, experts validated and annotated the 100 most relevant to credit review operations. This ensures the final feature set maximizes not just statistical mutual information I statistical ( F ; Y ) , but also business-relevant information I business ( F ; Y ) , guaranteeing actionability and interpretability. The final output is a robust, information-rich, and compact feature combination for downstream underwriting analysis.

Appendix B.2. Theoretical Summary

The pipeline can be formally viewed as an optimization problem:
maximize X selected X raw I ( X selected ; Y ) subject to | X selected | k
where k is the desired feature count. The steps of cleaning, filtering, and iterative selection collectively work to approximate this optimum by systematically removing noise, eliminating non-informative signals, and compressing redundancy, all while preserving the information most valuable for predicting credit risk.
Figure A2. Feature Selection Results: Per-Class Performance and Macro-F1 Trends.
Figure A2. Feature Selection Results: Per-Class Performance and Macro-F1 Trends.
Preprints 212577 g0a2

Appendix B.3. Validation of Feature Selection

To examine how feature selection affects predictive performance, we evaluated model metrics at multiple stages of the feature selection pipeline, progressively reducing the feature space from the full set of 2,515 features to the final 100 features used in our reported experiments. At each stage, we trained an XGBoost model and assessed its predictive performance. Figure A2 illustrates how the macro F1 score varies with the number of retained features (mean ± standard deviation over 5-fold cross-validation). Although using all 2,515 features yields a higher nominal score (macro F1 = 0.89), performance drops sharply during the initial reduction step and then stabilizes as fewer but more informative features are retained.
We further analyzed performance across the three major outcome classes in the dataset: (1) approved at credit review but subsequently delinquent, (2) approved and not delinquent, and (3) rejected at credit review. The class analysis in Figure A2 shows that the model maintains strong performance on the “rejected” class (F1 = 0.898), suggesting that the reduced set of characteristics preserves key discriminative signals relevant to credit-rejection decisions. Moreover, the corresponding classification results are reported in Table A2. In particular, the model achieves particularly high accuracy for the “rejected” class, and attains an overall F1 score of 80.8%, providing a solid basis for subsequent credit-review analysis.
We note that the apparent decline in predictive scores after aggressive feature pruning does not necessarily indicate degraded model quality. Feature selection can reduce model complexity, mitigate overfitting, and improve interpretability and deployment efficiency (e.g., faster inference and lower data acquisition costs).
Table A2. Classification Report of Feature Prediction.
Table A2. Classification Report of Feature Prediction.
Classification Precision Recall F 1 -Score
Approved – Overdue 0.7400 0.8350 0.7847
Approved – Current 0.8131 0.6817 0.7416
Rejected 0.8839 0.9133 0.8984
Avg. 0.8123 0.8100 0.8082

Appendix C. Limitations

Despite the significant improvements demonstrated by CreditAgent, our framework has several limitations that warrant consideration for future work. The system was evaluated on 6,000 credit cases from 2023 in a specific context, and its performance may not transfer to different geographic regions, regulatory environments, or economic conditions without recalibration of agent configurations and decision thresholds. The hierarchical multi-agent architecture incurs significant overhead, approximately 3.4 × higher latency ( 20.15 s vs. 5.91 s per sample) and nearly 5× greater token consumption ( 12 , 993 vs. 2 , 677 tokens per case) compared to single-model approaches. While acceptable for non-time-critical credit review workflows, this limits real-time deployment scenarios. Furthermore, the specialized fine-tuning requirements (SFT and GRPO) demand substantial computational resources that may be prohibitive for resource-constrained institutions. These trade-offs between performance gains and computational costs represent important considerations for practical deployment.

Appendix D. Layer 2 Agents Training Data Details

Figure A3. Data Distribution Details for Layer2 Specilized Agents.
Figure A3. Data Distribution Details for Layer2 Specilized Agents.
Preprints 212577 g0a3
Layer 2 of the CreditAgent framework implements a Multi-Agent Domain Expertise Review system, where specialized reasoning agents are deployed across six credit review dimensions. To ensure expert-level grounding, each agent a i is fine-tuned using Supervised Fine-Tuning (SFT) on a custom Credit-Instruct dataset.

Appendix D.1. Agent Architecture

The Layer 2 system employs a State-Aware Routing Framework to manage dependencies between specialized agents A = { a 1 , a 2 , , a 6 } . Each SFT-aligned agent a j produces an output tuple:
o j = { r j , e j }
where:
  • r j [ 0 , 1 ] is the scalar Risk Magnitude for a specific dimension (e.g., Solvency), derived via Risk-Oriented Textualization (ROT)
  • e j E * is the Evidence Chain in natural language, constrained by a JSON schema
The global context state is updated recursively as:
S t + 1 = [ S t o t + 1 ]
with S 0 = , enabling downstream agents to condition their responses on the semantic outputs of upstream agents.

Appendix D.2. Credit-Instruct Dataset Construction

The Credit-Instruct dataset aligns the model’s latent space with expert reasoning paths (Chain-of-Thought) and domain-specific knowledge injection. The dataset comprises triples of:
D = { ( input , bad   hbox , decision ) }

Appendix D.3. Risk-Oriented Textualization (ROT)

To address semantic distortion during the mapping of structured financial features to the linguistic reasoning space, we define a textualization operator T that maps the agent-relevant refined feature slice x j x to a structured semantic prompt. For each agent a j , a risk-discretization function is introduced:
g : R { Low , Medium , High }
defined by business-driven quantiles. The resulting prompt incorporates causal templates:
P j = T ( x j , g ( x j ) )
The framework utilizes PEFT via QLoRA to adapt the core backbone model, optimizing the objective:
min θ L C E ( y , y ^ ) + β · DPO ( O A , O B )
where y is the ground-truth supervision label for the current agent task, y ^ is the predicted label, β controls the preference-alignment strength, and DPO ( O A , O B ) represents the Direct Preference Optimization loss based on expert pairwise rankings of a preferred output O A over a dispreferred output O B . The Layer 2 committee consists of six specialized agents, each focusing on a distinct credit assessment dimension in Table 2

Appendix D.4. Data Distribution

Figure A3 illustrates the training data distribution across the six specialized agents, showing the score distributions for Basic, History, Solvency, Willingness, Fraud, and Macro dimensions. The heterogeneous score distributions reflect the varying risk profiles encountered in real-world credit assessment scenarios, ensuring robust agent training across diverse risk regimes.

Appendix E. Models Configurations

During the experimental phase of this study, we evaluated a wide range of large language models. Table A3 summarizes the model names, developers, model scales, and the corresponding methods employed for each model.
Table A3. Model architectures and parameter scales evaluated in the benchmark.
Table A3. Model architectures and parameter scales evaluated in the benchmark.
Model Name Developer Scale/Type Access Method
Open-Source
XuanYuan 2 Duxiaoman-DI 70B Local (vLLM)
Fin-R1 Financial AI Lab 32B (Distilled) Local (vLLM)
Seed OSS Instruct ByteDance-Seed 36B Local (vLLM)
Llama 3.1 Meta 8B Local (vLLM)
Llama 3.3 Meta 70B Local (vLLM)
Qwen 3 Instruct Alibaba 4B, 30B Local (vLLM)
Qwen 3 Thinking Alibaba 30B Local (vLLM)
DeepSeek-R1-Distill-Qwen deepseek-ai 32B Local (vLLM)
DeepSeek-R1-Distill-llama deepseek-ai 8B Local (vLLM)
Proprietary
Gemini 3 Flash Preview Google Unknown Vertex AI API
Claude 4.5 Sonnet Anthropic Unknown Anthropic API
GPT 5.1 OpenAI Unknown OpenAI API
The models evaluated in this study include a diverse set of open-source models, along with three carefully selected, widely adopted closed-source models. Testing for the open-source models was primarily conducted using vLLM-based deployment. The hardware specifications of the testing machines are detailed in Table A4.
Table A4. Inference parameters and environment details.
Table A4. Inference parameters and environment details.
Parameter Configuration Value
GPU Hardware 8 × NVIDIA H800 (80GB)
Operating System Ubuntu 22.04 LTS
Inference Engine vLLM (v0.6.3) / CUDA 12.4
Decoding Strategy Greedy Search ( T = 0 )
Max Context Length 16,384 Tokens
Precision Bfloat16

Appendix F. Usage of Large Language Models

The authors used large language models only for polishing prose of text where the complete draft was fully written by the authors initially and polished later with the help of LLM-based assistants including ChatGPT, Gemini, and Perplexity. The authors’ used code assistants including Cursor and Copilot to implement the authors’ original design and ideas. The scientific contributions, technical methods, ideas and core results are entirely the original work of the authors.

Appendix G. Details of Tokens Count and Time-Consuming

In the experiments, we quantified the average input and output token counts for representative general purpose single models, as well as for each of the six dimension specific agents and the final decision agent within our credit agent framework. These results are summarized in Table A5. As shown, the Basic Agent, which performs foundational information analysis, exhibits the highest output token count. Furthermore, we benchmarked both total token consumption and inference latency across individual models and the end to end credit agent system; the comparative results are reported in Table A6.
Table A5. Average Tokens Count in End to End Models and Each part of Credit Agents.
Table A5. Average Tokens Count in End to End Models and Each part of Credit Agents.
Baselines Completion Tokens Prompt Tokens Total Tokens
Single Model 798.5 1878.3 2676.8
Basic Agent 1340.0 247.3 1587.3
Solvency Agent 1003.1 379.5 1382.6
Willingness Agent 1102.1 572.6 1674.7
History Agent 1210.9 411.9 1622.8
Fraud Agent 1160.9 260.9 1421.8
Macro Agent 873.9 514.4 1388.3
Final Decision Agent 805.8 3109.2 3915.0
Table A6. Comparison of overall tokens and time consuming between single model and credit agents.
Table A6. Comparison of overall tokens and time consuming between single model and credit agents.
Baselines Overall Tokens Time Spend
Single Model 2676.772 5.9074
Credit Agent 12992.502 20.1459
Figure A4. The time analysis of agent output (a) is Token Count vs. Processing Time (b) is Processing Time Distribution.
Figure A4. The time analysis of agent output (a) is Token Count vs. Processing Time (b) is Processing Time Distribution.
Preprints 212577 g0a4
Figure A4 presents the time analysis of agents’ output. As shown in (a), there is a strong positive correlation between total tokens per sample and processing time, with a linear fit yielding R 2 = 0.671 . The histogram in (b) reveals that the processing time follows a right-skewed distribution. Most samples are processed within 15–25 seconds, which proves the efficiency of our framework.

Appendix H. Case Study

In this section, we present four representative case studies to illustrate the application of our multi-agent credit risk assessment framework in diverse borrower scenarios. These cases are selected to cover a spectrum of credit profiles, ranging from high-risk applicants with potential delinquency issues to low-risk premium borrowers. Each case includes the original input data (borrower information, repayment assessments, credit history, etc.), ground truth labels (final rating, delinquency probability, and decision), and detailed scoring from specialized agents across key dimensions such as repayment ability, willingness, credit history, fraud risk, and application reasonableness. The agents’ analyses culminate in a final prediction, overall rating, identified risks, suggested measures, and decision recommendation.
Case Study 1 exemplifies a high-risk scenario where the borrower exhibits missing basic information, multi-lender borrowing, large unsettled balances, frequent inquiries, and contact anomalies. Despite a ground truth approval with "D" rating and "High" delinquency probability, the agents recommend rejection due to elevated risks.
Case Study 2 represents a premium low-risk profile with high credit limits, minimal inquiries, no overdues, and low fraud indicators. The ground truth and agents align on an "A" rating, "Low" delinquency probability, and approval, though noting minor issues like missing basics and high interest rates.
Case Study 3 highlights another high-risk case with extremely low income, severe multi-lender borrowing, frequent inquiries, and high credit utilization. Both ground truth and agents concur on a "D" rating, "High" delinquency probability, and rejection.
Case Study 4 demonstrates a good mid-tier profile with moderate income, strong repayment ability, low inquiries, no overdues, and minimal fraud risk. The ground truth and agents support a "B" rating, "Low" delinquency probability, and approval, with the primary concern being incomplete basic information.
These cases underscore the framework’s ability to integrate multi-dimensional data for nuanced risk evaluation, balancing quantitative metrics with qualitative insights. They also reveal common challenges, such as data incompleteness, and the importance of agent collaboration in decision-making. The detailed illustrations for each case category are provided in Figure A5 and Figure A6 for Case 1, Figure A7 and Figure A8 for Case 2, Figure A9 and Figure A10 for Case 3, and Figure A11 and Figure A12 for Case 4.
Figure A5. Case study 1-1.
Figure A5. Case study 1-1.
Preprints 212577 g0a5
Figure A6. Case study 1-2.
Figure A6. Case study 1-2.
Preprints 212577 g0a6
Figure A7. Case study 2-1.
Figure A7. Case study 2-1.
Preprints 212577 g0a7
Figure A8. Case study 2-2.
Figure A8. Case study 2-2.
Preprints 212577 g0a8
Figure A9. Case study 3-1.
Figure A9. Case study 3-1.
Preprints 212577 g0a9
Figure A10. Case study 3-2.
Figure A10. Case study 3-2.
Preprints 212577 g0a10
Figure A11. Case study 4-1.
Figure A11. Case study 4-1.
Preprints 212577 g0a11
Figure A12. Case study 4-2.
Figure A12. Case study 4-2.
Preprints 212577 g0a12

Appendix I. More Related Work

Appendix I.1. Credit Assessment

Credit assessment [7,50,100] constitutes a fundamental task in financial risk management, wherein institutions evaluate the creditworthiness of individuals or entities to determine their likelihood of fulfilling financial obligations. In the context of loan applications [9,53,66,81], credit assessment becomes particularly challenging due to the heterogeneous nature of applicant profiles and the need to balance risk and return optimization. Early credit assessment systems predominantly relied on statistical models [8,10] and manually engineered expert knowledge, employing techniques such as logistic regression and scorecard models, which suffered from limited capacity to capture complex non-linear relationships. The advent of machine learning transformed this landscape by offering superior capabilities in automatic feature learning, handling high-dimensional data. Consequently, machine learning techniques [48,58] have been widely adopted for feature selection, feature fusion, and predicting applicants’ credit levels with improved discrimination power. The emergence of LLMs [41,60,65] introduces a paradigm shift in credit assessment by leveraging their exceptional generalization capabilities and semantic processing characteristics, which enable the incorporation of a broader spectrum of applicant features while rendering them interpretable. The evolution of credit assessment methodologies is illustrated by the works shown in Table A7.
Table A7. Evolution of credit assessment methodologies across statistical, machine learning, and LLM-based approaches.
Table A7. Evolution of credit assessment methodologies across statistical, machine learning, and LLM-based approaches.
Work Category Model/Algorithm Details
Sanz et al. [64] LLMs LLM-derived features Extracts textual features complementing traditional variables in P2P lending platforms. Improves predictive performance while emphasizing explainability and fairness in credit decision-making.
Feng et al. [25] LLMs Instruction-tuned LLMs Matches or exceeds state-of-the-art credit scoring methods while addressing bias concerns. Demonstrates comprehensive knowledge and adaptability for holistic financial risk evaluation.
Sanz et al. [65] LLMs LLM textual analysis Addresses information gaps in P2P lending through textual analysis. Provides complementary signals to traditional credit variables, improving discrimination between creditworthy and risky borrowers with emphasis on explainability.
Tan et al. [75] LLMs GPT-4 + BERT fusion Combines GPT-4’s generative capabilities for latent risk extraction with BERT’s bidirectional encoding. Multi-level semantic integration identifies implicit credit risks in unstructured financial documents with superior accuracy and interpretability.
Raliphada et al. [60] LLMs BERT, RoBERTa, LLaMA 3.2 Analyzes financial news sentiment using pre-trained language models. RoBERTa’s contextual encoding demonstrates superior risk-relevant sentiment capture. Gradient boosting achieves highest accuracy with structured features.
Kang et al. [41] LLMs GPT-4o Processes heterogeneous modalities (power time series, financial metrics, textual data) through attention mechanisms and contrastive learning. Integrates physical consumption data with traditional financial information through cross-modal representation optimization.
Puli et al. [58] Machine Learning Neural Networks, Random Forest Demonstrates superior performance with credit-related, interest rate, and liquidity variables as most informative early warning indicators.
Zhang et al. [98] Machine Learning Neural Networks Achieves better generalization and robustness than traditional neural networks by evaluating supply chain relationships and leading enterprise credit status alongside individual firm characteristics.
Machado et al. [48] Machine Learning Random Forest + SHAP Combines internal banking records with external financial data. Outperforms other ML techniques in prediction accuracy and transparency for detecting early warning signals with SHAP-based explanations.
Zhang et al. [100] Machine Learning Ensemble Decision Trees Leverages feature selection and ensemble decision trees for superior performance in processing large-scale loan application data for accurate risk prediction.
Blackwell et al. [8] Statistical Marginal Risk Model Introduces marginal risk concept varying with outstanding balance. Enables refined credit limit strategies avoiding "all or nothing" approach while optimizing authorization decisions based on account-specific risk contours.
Cameron et al. [10] Statistical Count-Duration Models Provides statistical frameworks for discrete, non-negative integer data. Links count processes to duration analysis for credit scoring and insurance pricing applications.

Appendix I.2. MAS in Specific Domain

As AI agents demonstrate increasingly sophisticated capabilities, complex real-world tasks [67,68,101] are being delegated to autonomous systems for resolution. However, individual agents face inherent limitations in both functional scope and throughput capacity, which has motivated the adoption of MAS for tackling these challenging problems. Complex real-world tasks often demand diverse expertise [39,92], parallel processing capabilities [3], and coordinated action—requirements [18,79] that exceed the capabilities of even the most advanced individual agents. This has spurred significant research into MAS, where the collective intelligence emerging from specialized agent collaboration enables the solution of tasks that remain intractable for individual agents. Applications of MAS in specific domains are shown in Table A8.
Table A8. Overview of domain-specific MAS.
Table A8. Overview of domain-specific MAS.
Work Domain Agent Architecture Key Mechanisms
LawLuo [74] Legal consultation Four collaborative agents simulating law firm operations Role-specific fine-tuning, case graph-based RAG for personalized multi-turn consultation
AutoDefense [95] LLM security Small open-source LLMs with assigned defensive roles Collaborative filtering of harmful responses, jailbreak attack defense
FinCon [91] Financial decision-making Manager-analyst hierarchy inspired by investment firms Self-critiquing mechanisms, conceptual verbal reinforcement
EduPlanner [99] Educational planning Evaluator, optimizer, and question analyst agents Skill-Tree structure, 5-D evaluation module, adversarial collaboration
Trans. Analy [94] Social interaction Parent, Adult, and Child ego state agents Transactional Analysis principles, context-aware psychological dynamics
MedAgent-Pro [83] Medical diagnosis Hierarchical reasoning agents with visual tool integration Disease-level plan generation via RAG, patient-level personalized reasoning, step-wise reliability verification
MedOrch [31] Medical reasoning Modular tool-augmented multi-agent orchestration Flexible tool integration, transparent traceable reasoning for diagnosis, imaging, and multimodal QA
ClinicalAgent [92] Clinical trials GPT-4 multi-agent system with external tool access LEAST-TO-MOST decomposition, ReAct reasoning, clinical trial tool integration for outcome prediction
DynamiCare [68] Clinical decision-making Dynamic specialist agents with adaptive team composition Multi-round interactive loops, iterative information gathering, adaptive strategy adjustment
MeNTi [101] Medical calculators Meta-tool with nested calling mechanism Flexible calculator chaining, slot filling and unit conversion, 281 physician-curated medical tools
Agent Laboratory [67] Scientific research Autonomous research agents with human-in-the-loop End-to-end research automation, literature review to report writing, iterative human feedback integration
WebAgent-R1 [84] Web navigation Multi-turn RL agents with thinking-based prompting Asynchronous trajectory generation, binary task success rewards, test-time scaling via interactions
EMBODIEDBENCH [88] Embodied AI Multi-environment benchmark agents Evaluation spanning household tasks to atomic navigation/manipulation, revealing low-level control limitations
Guided Search [93] Non-serializable envs. 1-step lookahead agents with trajectory selection Learned action-value functions, trajectory selection for test-time search in non-restorable states
TRAIL [18] Agent debugging Trace evaluation framework for single/multi-agent systems Formal error taxonomy, 148 human-annotated traces, benchmark for error identification in agent workflows
DiscoveryWorld [39] Scientific discovery Virtual environment with 24 parametric tasks 24 parametric tasks across difficulty levels, complete discovery cycle evaluation (hypothesis to conclusion)
Agent-X [3] Multimodal reasoning Vision-centric agents for multi-step visual tasks Step-level assessment of reasoning coherence, tool selection, and task completion over images/videos
G-Safeguard [79] MAS security Graph neural network-based anomaly detection system Anomaly detection on utterance graphs, topological intervention for attack remediation across LLM backbones

Appendix I.3. MAS Architecture

To enable more effective message delivery and task coordination, multi-agent systems have evolved diverse architectural paradigms. Based on inter-agent communication patterns and message exchange mechanisms, these architectures can be categorized into four main types: hierarchical [27,30,46,62], decentralized [14,97], centralized [34], and shared memory [61]. Table A9 presents examples for each MAS category.
Table A9. Taxonomy of MAS architectures based on communication patterns and coordination mechanisms.
Table A9. Taxonomy of MAS architectures based on communication patterns and coordination mechanisms.
Architecture Work Details
Hierarchical DyLAN [46] Dynamic hierarchical architecture with multi-turn interactions. Employs agent selection during inference and early termination for enhanced collaboration efficiency. Uses unsupervised agent importance scoring for automatic team optimization.
Hierarchical LMA for SE [30] Applies hierarchical task decomposition in software engineering lifecycle. Divides high-level requirements into sub-tasks assigned to specialized agents, mirroring agile methodologies. Enhances robustness through cross-examination and multi-agent validation, addressing LLM hallucination issues. Scales to complex systems by incorporating additional agents and reallocating tasks dynamically.
Hierarchical PathFinder [27] Four collaborative agents (Triage, Navigation, Description, Diagnosis) emulate pathologist decision-making. Iteratively navigates whole slide images, generates importance maps, produces natural language descriptions of diagnostically relevant patches, and synthesizes findings for final classification with inherent explainability.
Hierarchical Pre-Act [62] Enhances agent performance by generating multi-step execution plans with detailed reasoning before action. Incrementally refines the plan by incorporating previous steps and tool outputs until final response. Demonstrates substantial improvements in action accuracy and goal completion through fine-tuning on smaller models for practical deployment.
Decentralized DMAS [14] Turn-taking dialogue-based task planning where each robot’s LLM agent independently expresses opinions and considers peer feedback. Enables autonomous decision-making while maintaining collective progress toward task completion.
Decentralized MaAS [97] Optimizes an agentic supernet—a probabilistic distribution of agent architectures—to dynamically sample query-dependent multi-agent systems. Enables tailored resource allocation based on task difficulty and domain, achieving superior performance with significantly reduced inference costs compared to static, handcrafted multi-agent designs.
Centralized ACORM [34] Single LLM serves as central planner, generating actions for all agents based on global state information. Achieves centralized training with decentralized execution, simplifying MARL planning and enhancing scalability and inference efficiency.
Shared Memory MetaGPT [33] Maintains shared message pool allowing dynamic information observation and extraction. Enables parameter sharing where agents update weights based on new knowledge and synchronize with other agents, optimizing collaboration efficiency.
Shared Memory Zep [61] Introduces Graphiti, a temporally-aware knowledge graph engine that dynamically synthesizes unstructured conversational data and structured business data while maintaining historical relationships. Enables complex temporal reasoning and cross-session information synthesis beyond static document retrieval in RAG frameworks.

Appendix J. Priori Knowledge in Finance About High-Stakes Financial Decision-Making

In the domain of financial risk control, decomposing a complex decision system into multiple specialized sub-agents is not merely a technical design choice but is deeply rooted in established financial business logic. The structure of our Layer 2 expert agents directly mirrors the organizational paradigm of institutional credit committees, where domain specialists responsible for a distinct risk dimension collaborate to reach a lending decision.The classical 5C’s of Credit framework [4] including Character, Capacity, Capital, Conditions, and Collateral, constitute the canonical framework for credit analysis, widely adopted in both academic literature and industry practice.
Table A10 summarizes the mapping between classical 5C’s credit assessment frameworks and our agent architecture.
Table A10. Mapping between CreditAgent Layer 2 Architecture and Classical Credit Assessment Frameworks (5C’s of Credit, Baiden 2011).
Table A10. Mapping between CreditAgent Layer 2 Architecture and Classical Credit Assessment Frameworks (5C’s of Credit, Baiden 2011).
CreditAgent Agent Classical 5C’s Risk Dimension Primary Function
Macro Agent Conditions Macro-Economic Context Systematic risk calibration
Basic Agent Character Identity & Basic Profile KYB/KYC verification
History Agent Character Credit History Behavioral velocity analysis
Solvency Agent Capacity, Capital Solvency & Asset Quality Liquidity & capital adequacy
Willingness Agent Character Repayment Willingness Intent signal extraction
Fraud Agent Collateral* Fraud & Consistency Adversarial detection
* In unsecured consumer lending, Collateral is reinterpreted as information consistency verification.
Character in Baiden’s framework encompasses willingness and determination to meet a loan obligation, discovered through investigation into the customers’ payment habits, and the way they manage their business affairs. This multi-faceted definition naturally decomposes into three specialized agents including the Basic Agent (identity authenticity), the History Agent (historical payment patterns), and the Willingness Agent (current repayment intent). Capacity (cash flow ability) and Capital (equity reserves indicating stakeholder commitment) are jointly assessed by the Solvency Agent, as these dimensions are inherently correlated in consumer credit contexts. The following subsections detail the prior financial knowledge integrated into each agent, specifying the primary functions performed and their grounding in established credit assessment principles.

Appendix J.1. Macro-Economic Context and Systematic Risk

Credit risk is dynamic since the same borrower profile may represent acceptable risk during economic expansion, while elevated risk during economic contraction. Unlike traditional credit scoring process that assume distributional stationarity, our Macro Agent processes PESTEL(Political, Economic, Social, Technological, Environmental, Legal) indicators including interest rate trajectories, industry volatility indices, unemployment trends, and regulatory regime shifts to compute a market-sensitivity credit score.
The underlying financial prior knowledge stems from well-established macro-credit linkages. For example, rising interest rates mechanically increase debt-servicing burdens for variable-rate borrowers, while depressing asset valuations that serve as implicit collateral at the same time. In addition, unemployment shocks exhibit asymmetric effects that job loss triggers immediate payment stress, while re-employment restores capacity only with significant lag. These macro-credit transmission mechanisms constitute the domain knowledge that enables the Macro Agent to calibrate systematic risk adjustments across heterogeneous economic environments.

Appendix J.2. Identity Verification and Basic Profiling

Know Your Borrower (KYB) and Know Your Customer (KYC) compliance represents the foundational layer of credit risk assessment, serving both regulatory mandate and as the first line of defense against synthetic identity fraud, which is a rapidly growing threat vector in digital lending. The Basic Agent performs deterministic validation before probabilistic assessment begins, processing structured identity data to evaluate document consistency, contact information stability, employment verification, and digital footprint patterns. This dimension directly operationalizes the "Character" component of the 5C’s framework, as [4] notes that character is "generally discovered through interview and investigation" into the customer’s identity and business affairs.

Appendix J.3. Credit Behavior History and Velocity Analysis

Historical repayment behavior is one of the most predictive feature in consumer credit modeling. However, raw delinquency counts fail to capture the velocity of credit deterioration, a critical leading indicator distinguishing stable borrowers experiencing isolated hardship from those on deteriorating trajectories. The History Agent operationalizes this temporal dimension through roll rate analysis and behavioral velocity scoring.

Appendix J.4. Solvency, Liquidity, and Asset Quality

Repayment capacity assessment requires distinguishing between solvency and liquidity—short-term cash availability. A borrower may be solvent but illiquid, or liquid but insolvent. The Solvency Agent jointly operationalizes the "Capacity" and "Capital" dimensions of the 5C’s framework through dual-axis evaluation. This distinction reflects established risk management practice where affordability assessment determines whether the borrower can sustain payments from regular income streams, while capital adequacy provides a buffer against income disruption [7].

Appendix J.5. Repayment Willingness and Behavioral Signals

Willingness to pay are critical in credit risk. The Willingness Agent extracts intent signals from application behavior, inquiry patterns, and multi-lender dynamics that traditional scorecards systematically underweight. This agent directly addresses the core definition of "Character" in the 5C’s framework [4] defines Character as the customers’ willingness and determination to meet a loan obligation, discovered through investigation into payment habits and the way they respond to adversity.

Appendix J.6. Fraud Detection and Consistency Verification

First-party fraud (applicant misrepresentation) and third-party fraud (identity theft, synthetic identity) represent non-credit-risk loss vectors that bypass traditional scoring entirely. Fraud signals are particularly susceptible to dilution when processed through monolithic systems, necessitating dedicated detection mechanisms [26]. The Fraud Agent implements adversarial detection by treating each application as potentially hostile input requiring multi-layered verification. In unsecured consumer lending where physical collateral is absent, this agent reinterprets the "Collateral" dimension of the 5C’s framework as information consistency verifications.

Appendix K. Expert Review Summary

The design and validation of our CreditAgent framework benefited significantly from consultations with senior professionals at major financial institutions in China. Through structured interviews with industry specialists, we gathered critical insights into credit committee operations, multi-dimensional risk assessment, and automated decisioning systems. These expert perspectives directly shaped the development of our hierarchical multi-agent architecture. Our consultation panel comprised seasoned specialists across credit underwriting, risk modeling, fraud investigation, and macro-economic analysis, each with 6 to over 15 years of hands-on experience as Table A11 shows.
Table A11. Expert Consultation Panel Profiles.
Table A11. Expert Consultation Panel Profiles.
Summarized Professional Profiles
Profile A: Senior Credit Committee Director (15+ yrs). Expert in credit committee governance and cross-functional decision synthesis at a national commercial bank.
Profile B: Risk Modeling Lead (10+ yrs). Specializing in scorecard development, feature engineering, and model validation at a leading fintech company.
Profile C: Fraud Investigation Manager (8+ yrs). Focused on synthetic identity detection, network analysis, and adversarial pattern recognition.
Profile D: Macro-Economic Risk Analyst (7+ yrs). Expert in industry cycle analysis, regulatory impact assessment, and systematic risk calibration.
Profile E: Consumer Credit Underwriter (6+ yrs). Skilled in cash flow analysis, DTI assessment, and affordability verification for unsecured lending.
Profile F: Collections Strategy Director (10+ yrs). Expert in delinquency prediction, roll rate analysis, and recovery optimization.
Key findings from our expert consultations highlighted several critical aspects that directly informed the CreditAgent architecture:

Appendix K.1. Hierarchical Decision Structure

Expert A emphasized that institutional credit committees operate through structured deliberation rather than holistic judgment: A single reviewer cannot hold all dimensions in mind simultaneously. This insight directly motivated our three-layer architecture separating data governance, domain expertise, and decision fusion.

Appendix K.2. Dimensional Decomposition Necessity

Experts D, E, and F independently confirmed that credit risk assessment requires distinct analytical methodologies across dimensions. Expert D noted that Macro factors require forward-looking scenario analysis, while historical behavior requires backward-looking pattern recognition. These are fundamentally different cognitive tasks. This validated our design of six specialized agents rather than a monolithic reasoning approach.

Appendix K.3. Hard-Stop Mechanism for Catastrophic Risk

Expert C advocated for deterministic rejection rules. Fraud losses are unrecoverable. The positive signals from other dimensions should override a verified fraud indicator. This principle directly shaped the veto authority granted to our Fraud Agent.

Appendix K.4. Asymmetric Cost Structure

Expert F provided quantitative context for misclassification costs. In unsecured consumer lending, a false negative, which is approving a defaulter, typically costs 3 times more than a false positive, depending on product and recovery rates. This insight informed the asymmetric penalty structure in our business efficiency coefficient design.

Appendix K.5. Information Consistency as Collateral Proxy

Expert C observed that in unsecured lending, "the degree to which application materials can be cross-validated serves the same risk-mitigation function as physical collateral in secured lending." This perspective supported our reinterpretation of the 5C’s Collateral dimension as consistency verification.

Appendix K.6. Dynamic Threshold Adjustment.

Expert B highlighted limitations of static decision thresholds. Optimal approval thresholds shift with economic cycles, portfolio composition, and competitive dynamics. A system that cannot adapt its decision boundary will either over-reject during expansion or under-reject during contraction. This observation motivated our GRPO-based threshold optimization in Layer 3.
These expert insights were instrumental in developing CreditAgent’s hierarchical structure, ensuring alignment with institutional credit committee practices while enabling the scalability and auditability required for production deployment.

Appendix L. Dataset Details

This dataset is derived from real Chinese personal credit reports. It contains approximately 6,000 samples, each corresponding to a genuine cash loan or consumer finance credit application scenario. Every sample is labeled with a forward-looking delinquency outcome , making it suitable for evaluating models’ default prediction and deep credit risk reasoning capabilities using purely structured credit bureau data.
Unlike transaction-log or consumption-flow benchmarks, this dataset focuses exclusively on classic credit bureau fields plus selected external enrichment variables. It tests whether models can extract hidden risk contradictions and behavioral patterns from multi-lender borrowing, inquiry spikes, repayment history, credit utilization, fraud signals, and related dimensions.

Appendix L.1. Core Risk Assessment Dimensions (Main Risk Domains)

In the risk control practice of consumer finance and online cash loans in China, core features are typically grouped into the following six primary risk assessment dimensions. These dimensions collectively cover almost all key predictive signals — from borrower profile to behavioral trajectory, credit performance, fraud indicators, and contextual auxiliary information — forming a relatively complete and layered risk evaluation framework:
  • Borrower Basic Information Mainly includes identity verification, income level, credit report authorization status, etc. Due to heavy desensitization of strong identity fields (name, ID number, plaintext mobile number), this dimension tends to be relatively weak and primarily relies on indirect or derived signals.
  • Repayment Capacity Focuses on objective financial metrics such as total credit limits granted, outstanding loan balances, unsettled debts, and remaining available credit, used to evaluate the borrower’s actual debt-carrying ability and liquidity buffer.
  • Repayment Willingness Captures the degree of funding urgency, tendency toward “borrow-to-repay” cycles, and dependence on external financing through recent and medium-to-long-term loan-approval inquiry frequency, number of inquiring institutions, multi-lender borrowing scores, micro-loan activity flags, and general consumption installment application behavior.
  • Credit History and Lending Behavior Concerns historical delinquency patterns (cumulative overdue periods, recent performance), credit card utilization rates, loan portfolio structure (proportion of small-amount loans, number of unsecured consumer loan institutions), abnormal account flags, etc., reflecting long-term credit discipline and current borrowing style.
  • Fraud Risk and Consistency Check Primarily relies on internal logical consistency of application materials and anomaly detection — especially cross-referencing or duplication in contact information, same phone number linked to multiple identities, unusual relative relationships, etc. — to identify potential organized fraud, agency packaging, or fabricated application risks.
  • Other Information and Auxiliary Signals Includes report generation timestamps, business occurrence time, derived time-delta variables, executed interest rates, third-party model scores, suspicious large-value credit-related bank transfers, historical successful disbursement counts, etc., mainly serving for contextual supplementation, timeliness calibration, pricing reference, and model ensemble support.
These six dimensions together constitute the most predictive feature set in consumer finance and cash loan approval models. In practical modeling and strategy formulation, different weights are usually assigned to each dimension according to customer type (new vs returning, first-time vs repeat borrowing), product positioning, and risk control window.
Indicator Mapping Table
Table A12. Mapping of Credit Bureau Features Across the Six Core Risk Dimensions (Partial).
Table A12. Mapping of Credit Bureau Features Across the Six Core Risk Dimensions (Partial).
Dimension Typical Signals and Representative Variables
Borrower Basic Information Latest Monthly Income Level,
Credit Report Authorization Time,
Whether Authorized to Query Credit Report
...
Repayment Capacity Latest Approved Amount by Blaze,
Account Credit Limit,
Cash Loan Account Credit Limit,
...
Repayment Willingness Number of Institutions Queried for Loan Approval,
Number of Loan Approval Queries,
Total Number of Queries,
...
Credit History and Lending Behavior Highest Credit Card Utilization Rate,
Average Credit Card Utilization Rate,
...
Fraud Risk and Consistency Check Number of Different Contacts Corresponding to the Same Phone Number,
Total Times Contacts Were Filled In,
...
Other Information and Auxiliary Signals Credit Report Generation Date,
Event Occurrence Time / Application Time,
...

Appendix M. Statistical Characteristics

The dataset comprises a total of 6000 real-world personal consumer loan samples, with transaction occurrence dates spanning from January 1, 2023 to December 31, 2023. All samples are authentic records of cash loan and consumer finance credit applications, covering the full credit evaluation cycle from credit inquiry authorization to final approval amount. The dataset contains rich multi-dimensional risk features, including approved credit limits, multi-platform lending behavior scores, contact person relationship chain characteristics, loan application timing, product pricing, historical transaction behavior, institutional query frequency, and recent borrowing amounts. Table Figure A13 provides a comprehensive statistical summary of key numerical fields.
Figure A13. Feature-Statistics Boxplots — Values Scaled by Powers of 10 (Scaling Annotated in Labels).
Figure A13. Feature-Statistics Boxplots — Values Scaled by Powers of 10 (Scaling Annotated in Labels).
Preprints 212577 g0a13

Appendix N. Sample Prompts

Preprints 212577 i001Preprints 212577 i002Preprints 212577 i003Preprints 212577 i004Preprints 212577 i005Preprints 212577 i006Preprints 212577 i007Preprints 212577 i008

References

  1. Agosto, Arianna; Cerchiello, Paola; Giudici, Paolo. Bayesian learning models to measure the relative impact of esg factors on credit ratings. Int. J. Data Sci. Anal. 2025, 20(2), 357–368. [Google Scholar] [CrossRef]
  2. Amirizaniani, Maryam; Lavergne, Adrian; Okada, Elizabeth Snell; Chadha, Aman; Roosta, Tanya; Shah, Chirag. Developing a framework for auditing large language models using human-in-the-loop. In Proceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 2025; pp. pages 64–74. [Google Scholar]
  3. Ashraf, Tajamul; Saqib, Amal; Ghani, Hanan; AlMahri, Muhra; Li, Yuhao; Ahsan, Noor; Nawaz, Umair; Lahoud, Jean; Cholakkal, Hisham; Shah, Mubarak; Torr, Philip; Khan, Fahad Shahbaz; Anwer, Rao Muhammad; Khan, Salman. Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks. 2025. Available online: https://arxiv.org/abs/2505.24876.
  4. Baiden, John E. The 5 c’s of credit in the lending industry. Available at SSRN 1872804. 2011. [Google Scholar]
  5. Ben Abdelaziz, Fouad; Mrad, Fatma. Multiagent systems for modeling the information game in a financial market. Int. Trans. Oper. Res. 2023, 30(5), 2210–2223. [Google Scholar] [CrossRef]
  6. Berg-Kirkpatrick, Taylor; Burkett, David; Klein, Dan. An empirical investigation of statistical significance in nlp. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2012; pp. pages 995–1005. [Google Scholar]
  7. Bhatore, Siddharth; Mohan, Lalit; Reddy, Y Raghu. Machine learning techniques for credit risk evaluation: a systematic literature review. J. Bank. Financ. Technol. 2020, 4(1), 111–138. [Google Scholar] [CrossRef]
  8. Blackwell, Martin; Sykes, Chris. The assignment of credit limits with a behaviour-scoring system. IMA J. Manag. Math. 1992, 4(1), 73–80. [Google Scholar] [CrossRef]
  9. Boz, Zeynep; Gunnec, Dilek; Birbil, S. Ilker; Öztürk, M. Kaan. Reassessment and monitoring of loan applications with machine learning. Appl. Artif. Intell. 2018, 32(9-10), 939–955. [Google Scholar] [CrossRef]
  10. Cameron, A Colin; Trivedi, Pravin K. 12 count data models for financial data. Handb. Stat. 1996, 14, 363–391. [Google Scholar]
  11. Chen, Fei; Ren, Wei; et al. On the control of multi-agent systems: A survey. Found. Trends Syst. Control 2019, 6(4), 339–499. [Google Scholar] [CrossRef]
  12. Chen, Tianqi; Guestrin, Carlos. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016; pp. pages 785–794. [Google Scholar]
  13. Chen, Weize; Su, Yusheng; Zuo, Jingwei; Yang, Cheng; Yuan, Chenfei; Chan, Chi-Min; Yu, Heyang; Lu, Yaxi; Hung, Yi-Hsin; Qian, Chen; et al. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. The Twelfth International Conference on Learning Representations, 2023. [Google Scholar]
  14. Chen, Yongchao; Arkin, Jacob; Zhang, Yang; Roy, Nicholas; Fan, Chuchu. Scalable multi-robot collaboration with large language models: Centralized or decentralized systems? In 2024 IEEE International Conference on Robotics and Automation (ICRA); IEEE, 2024; pp. 4311–4317. [Google Scholar]
  15. Chicco, Davide; Jurman, Giuseppe. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21(1), 6. [Google Scholar] [CrossRef]
  16. De Toni, Giovanni; Viappiani, Paolo; Teso, Stefano; Lepri, Bruno; Passerini, Andrea. Personalized algorithmic recourse with preference elicitation. Transactions on Machine Learning Research, 2024. [Google Scholar]
  17. DeepSeek-AI, *!!! REPLACE !!!*. Deepseek-v3 technical report. arXiv preprint, 2024. Focus on GRPO and reasoning optimization.
  18. Deshpande, Darshan; Gangal, Varun; Mehta, Hersh; Krishnan, Jitin; Kannappan, Anand; Qian, Rebecca. Trail: Trace reasoning and agentic issue localization. arXiv 2025, arXiv:2505.08638. [Google Scholar] [CrossRef]
  19. Dong, Guanting; Yuan, Hongyi; Lu, Keming; Li, Chengpeng; Xue, Mingfeng; Liu, Dayiheng; Wang, Wei; Yuan, Zheng; Zhou, Chang; Zhou, Jingren. How abilities in large language models are affected by supervised fine-tuning data composition. Proc. 62nd Annu. Meet. Assoc. Comput. Linguist. 2024, Volume 1, 177–198. [Google Scholar]
  20. Dorri, Ali; Kanhere, Salil S; Jurdak, Raja. Multi-agent systems: A survey. Ieee Access 2018, 6, 28573–28593. [Google Scholar] [CrossRef]
  21. Eren Erdogan, Lutfi; Lee, Nicholas; Kim, Sehoon; Moon, Suhong; Furuta, Hiroki; Anumanchipalli, Gopala; Keutzer, Kurt; Gholami, Amir. Plan-and-act: Improving planning of agents for long-horizon tasks. arXiv E-Prints 2025, arXiv–2503. [Google Scholar]
  22. Fang, Alex; Madappally Jose, Albin; Jain, Amit; Schmidt, Ludwig; Toshev, Alexander; Shankar, Vaishaal. Data filtering networks. arXiv 2023, arXiv:2309.17425. [Google Scholar] [CrossRef]
  23. Fatemi, Sorouralsadat; Hu, Yuheng. Finvision: A multi-agent framework for stock market prediction. In Proceedings of the 5th ACM International Conference on AI in Finance, 2024; pp. 582–590. [Google Scholar]
  24. Fazlija, Bledar; Harder, Pedro. Using financial news sentiment for stock price direction prediction. Mathematics 2022, 10(13), 2156. [Google Scholar] [CrossRef]
  25. Feng, Duanyu; Dai, Yongfu; Huang, Jimin; Zhang, Yifang; Xie, Qianqian; Han, Weiguang; Chen, Zhengyu; Lopez-Lira, Alejandro; Wang, Hao. Empowering many, biasing a few: Generalist credit scoring through large language models. arXiv 2023, arXiv:2310.00566. [Google Scholar]
  26. Gandhar, Akash; Gupta, Kapil; Pandey, Aman Kumar; Raj, Dharm. Fraud detection using machine learning and deep learning. SN Comput. Sci. 2024, 5(5), 453. [Google Scholar] [CrossRef]
  27. Ghezloo, Fatemeh; Seyfioglu, Mehmet Saygin; Soraki, Rustin; Ikezogwo, Wisdom O.; Li, Beibin; Vivekanandan, Tejoram; Elmore, Joann G.; Krishna, Ranjay; Shapiro, Linda. Pathfinder: A multi-modal multi-agent system for medical diagnostic decision-making applied to histopathology. 2025. Available online: https://arxiv.org/abs/2502.08916.
  28. Guo, Taicheng; Chen, Xiuying; Wang, Yaqi; Chang, Ruidi; Pei, Shichao; Chawla, Nitesh V; Wiest, Olaf; Zhang, Xiangliang. Large language model based multi-agents: A survey of progress and challenges. arXiv 2024, arXiv:2402.01680. [Google Scholar] [CrossRef]
  29. Han, Zeyu; Gao, Chao; Liu, Jinyang; Zhang, Jeff; Zhang, Sai Qian. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv 2024, arXiv:2403.14608. [Google Scholar]
  30. He, Junda; Treude, Christoph; Lo, David. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead. ACM Trans. Softw. Eng. Methodol. 2025a, 34(5), 1–30. [Google Scholar] [CrossRef]
  31. He, Yexiao; Li, Ang; Liu, Boyi; Yao, Zhewei; He, Yuxiong. Medorch: Medical diagnosis with tool-augmented reasoning agents for flexible extensibility. 2025b. Available online: https://arxiv.org/abs/2506.00235.
  32. Hoang, Daniel; Wiegratz, Kevin. Machine learning methods in finance: Recent applications and prospects. Eur. Financ. Manag. 2023, 29(5), 1657–1701. [Google Scholar] [CrossRef]
  33. Hong, Sirui; Zhuge, Mingchen; Chen, Jonathan; Zheng, Xiawu; Cheng, Yuheng; Wang, Jinlin; Zhang, Ceyao; Wang, Zili; Ka, Steven; Yau, Shing; Lin, Zijuan; et al. Metagpt: Meta programming for a multi-agent collaborative framework. The twelfth international conference on learning representations, 2023. [Google Scholar]
  34. Hu, Zican; Zhang, Zongzhang; Li, Huaxiong; Chen, Chunlin; Ding, Hongyu; Wang, Zhi. Attention-guided contrastive role representations for multi-agent reinforcement learning. arXiv 2023, arXiv:2312.04819. [Google Scholar]
  35. Huang, Jin; Ling, Charles X. Using auc and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 2005, 17(3), 299–310. [Google Scholar] [CrossRef]
  36. Huang, Kexin; Zhang, Serena; Wang, Hanchen; Qu, Yuanhao; Lu, Yingzhou; Roohani, Yusuf; Li, Ryan; Qiu, Lin; Li, Gavin; Zhang, Junze; et al. Biomni: A general-purpose biomedical ai agent. biorxiv 2025. [Google Scholar]
  37. Hurlin, Christophe; Pérignon, Christophe; Saurin, Sébastien. The fairness of credit scoring models. Manag. Sci. 2026, 72(1), 406–425. [Google Scholar] [CrossRef]
  38. Jajoo, Gautam; Chitale, Pranjal A; Agarwal, Saksham. Masca: Llm based-multi agents system for credit assessment. 2025. Available online: https://arxiv.org/abs/2507.22758.
  39. Jansen, Peter; Côté, Marc-Alexandre; Khot, Tushar; Bransom, Erin; Mishra, Bhavana Dalvi; Majumder, Bodhisattwa Prasad; Tafjord, Oyvind; Clark, Peter. Discoveryworld: A virtual environment for developing and evaluating automated scientific discovery agents. In Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc., 2024; volume 37, pp. pages 10088–10116. [Google Scholar] [CrossRef]
  40. Jin, Haibo; Zhang, Peiyan; Wang, Peiran; Luo, Man; Wang, Haohan. From hallucinations to jailbreaks: Rethinking the vulnerability of large foundation models. arXiv 2025, arXiv:2505.24232. [Google Scholar] [CrossRef]
  41. Kang, Li; Wang, Xiaocun; Zhao, Lei; Wang, Xijing; Xiang, Sheng. Multimodal large language model for enterprise credit assessment with power data enhancement. In Proceedings of the 2025 International Conference on Economic Management and Big Data Application, 2025; pp. pages 785–791. [Google Scholar]
  42. Karimi, Amir-Hossein; Barthe, Gilles; Schölkopf, Bernhard; Valera, Isabel. A survey of algorithmic recourse: contrastive explanations and consequential recommendations. ACM Comput. Surv. 2022, 55(5), 1–29. [Google Scholar] [CrossRef]
  43. Li, Guohao; Hammoud, Hasan; Itani, Hani; Khizbullin, Dmitrii; Ghanem, Bernard. Camel: Communicative agents for" mind" exploration of large language model society. Adv. Neural Inf. Process. Syst. 2023a, 36, 51991–52008. [Google Scholar]
  44. Li, Jiangtong; Bian, Yuxuan; Wang, Guoxuan; Lei, Yang; Cheng, Dawei; Ding, Zhijun; Jiang, Changjun. Cfgpt: Chinese financial assistant with large language model. arXiv 2023b, arXiv:2309.10654. [Google Scholar]
  45. Li, Xiangci; Ouyang, Jessica. Related work and citation text generation: A survey. arXiv 2024, arXiv:2404.11588. [Google Scholar] [CrossRef]
  46. Liu, Zijun; Zhang, Yanzhe; Li, Peng; Liu, Yang; Yang, Diyi. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv 2023, arXiv:2310.02170. [Google Scholar]
  47. Ma, Li; et al. Catboost: unbiased boosting with categorical features. Expert Systems with Applications, 2024. [Google Scholar]
  48. Machado, Marcos R; Chen, Daniel Tianfu; Osterrieder, Joerg R. An analytical approach to credit risk assessment using machine learning models. Decis. Anal. J. 2025, page 100605. [Google Scholar] [CrossRef]
  49. Madero, Armando. Specialized multi-agent neural architecture for enhanced reasoning and minimal intervention in ai systems. Available at SSRN 5534058. 2025. [Google Scholar]
  50. Majumdar, Chitro; Scandizzo, Sergio; Mahanta, Ratanlal; Mandal, Avradip; Bhattacharjee, Swarnendu. A large language model for corporate credit scoring. 2025. Available online: https://arxiv.org/abs/2511.02593.
  51. McCracken, Lance M; DaSilva, Philomena; Skillicorn, Beth; Doherty, Richard. The cognitive fusion questionnaire: A preliminary study of psychometric properties and prediction of functioning in chronic pain. Clin. J. Pain 2014, 30(10), 894–901. [Google Scholar] [CrossRef]
  52. Moro, Sérgio; Cortez, Paulo; Rita, Paulo. An automated literature analysis on data mining applications to credit risk assessment. Artificial Intelligence in Financial Markets: Cutting Edge Applications for Risk Management, Portfolio Optimization and Economics 2016, pages 161–177. [Google Scholar]
  53. Mulyanto, Sigit; Yonia, Dwika; Sutejo, Bambang. Finance loan risk assessment using machine learning for credit eligibility prediction and model optimization. IJISTECH 2025, 8(5), 303–311. Available online: https://www.ijistech.org/ijistech/index.php/ijistech/article/view/376. [CrossRef]
  54. Naili, Maryem; Lahrichi, Younes. The determinants of banks’ credit risk: Review of the literature and future research agenda. Int. J. Financ. Econ. 2022, 27(1), 334–360. [Google Scholar] [CrossRef]
  55. Park, Joon Sung; O’Brien, Joseph; Cai, Carrie Jun; Morris, Meredith Ringel; Liang, Percy; Bernstein, Michael S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, 2023; pp. pages 1–22. [Google Scholar]
  56. Pathi, Sai Prashanth. A multi-agent framework for personalized credit recommendations. Int. J. Multidiscip. Res. Growth Eval. 2025. [Google Scholar] [CrossRef]
  57. Popa, Alexandru; Sîrbu, Tiberiu-Iulian. From fragmentation to cohesion: An llm-based iterative approach to ontology and knowledge graph refinement. In 2025 25th International Conference on Control Systems and Computer Science (CSCS); IEEE, 2025; pp. pages 651–654. [Google Scholar]
  58. Puli, Sreenivasulu; Thota, Nagaraju; Subrahmanyam, A.C.V. Assessing machine learning techniques for predicting banking crises in india. J. Risk Financ. Manag. 2024, 17(4), 141. [Google Scholar] [CrossRef]
  59. Purificato, E.; Lorenzo, F.; Fallucchi, F.; De Luca, E. W. The use of responsible artificial intelligence techniques in the context of loan approval processes. Int. J. Human–Computer Interact. 2023, 39(7), 1543–1562. [Google Scholar] [CrossRef]
  60. Raliphada, Pfarelo; Olusanya, Micheal O; Olukanmi, Seun. Optimizing credit risk model classification using bert, roberta, and llama 3.2. In IEEE EUROCON 2025-21st International Conference on Smart Technologies; IEEE, 2025; pp. pages 1–6. [Google Scholar]
  61. Rasmussen, Preston; Paliychuk, Pavlo; Beauvais, Travis; Ryan, Jack; Chalef, Daniel. Zep: A temporal knowledge graph architecture for agent memory. 2025. Available online: https://arxiv.org/abs/2501.13956.
  62. Rawat, Mrinal; Gupta, Ambuje; Goomer, Rushil; Di Bari, Alessandro; Gupta, Neha; Pieraccini, Roberto. Pre-act: Multi-step planning and reasoning improves acting in llm agents. 2025. Available online: https://arxiv.org/abs/2505.09970.
  63. Sadok, Hicham; Sakka, Fadi; El Hadi El Maknouzi, Mohammed. Artificial intelligence and bank credit analysis: A review. Cogent Econ. Financ. 2022, 10(1), 2023262. [Google Scholar] [CrossRef]
  64. Sanz-Guerrero, Mario; Arroyo, Javier. Credit risk meets large language models: Building a risk indicator from loan descriptions in p2p lending. arXiv 2024a, arXiv:2401.16458. [Google Scholar] [CrossRef]
  65. Sanz-Guerrero, Mario; Arroyo, Javier. Credit risk meets large language models: Building a risk indicator from loan descriptions in peer-to-peer lending. Available at SSRN 4979155. 2024b. [Google Scholar]
  66. Eslam, Hussein Sayed; Alabrah, Amerah; Kamel, Hussein Rahouma; Muhammad, Zohaib; Badry, Rasha M. Machine learning and deep learning for loan prediction in banking: Exploring ensemble methods and data balancing. IEEE Access 2024, 12, 193997–194019. [Google Scholar] [CrossRef]
  67. Schmidgall, Samuel; Su, Yusheng; Wang, Ze; Sun, Ximeng; Wu, Jialian; Yu, Xiaodong; Liu, Jiang; Moor, Michael; Liu, Zicheng; Barsoum, Emad. Agent laboratory: Using llm agents as research assistants. 2025. Available online: https://arxiv.org/abs/2501.04227.
  68. Shang, Tianqi; He, Weiqing; Zheng, Charles; Li, Lingyao; Shen, Li; Zhao, Bingxin. Dynamicare: A dynamic multi-agent framework for interactive and open-ended medical decision-making. 2025. Available online: https://arxiv.org/abs/2507.02616.
  69. Shao, Zhihong; Wang, Peiyi; Zhu, Qihao; Xu, Runxin; Song, Junxiao; Bi, Xiao; Zhang, Haowei; Zhang, Mingchuan; Li, Y. K.; Wu, Y.; Guo, Daya. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. 2024. Available online: https://arxiv.org/abs/2402.03300.
  70. Shi, Jinxin; Zhao, Jiabao; Wu, Xingjiao; Xu, Ruyi; Jiang, Yuan-Hao; He, Liang. Mitigating reasoning hallucination through multi-agent collaborative filtering. Expert Syst. With Appl. 2025, 263, 125723. [Google Scholar] [CrossRef]
  71. Simonetto, Thibault; Ghamizi, Salah; Cordy, Maxime. Tabularbench: Benchmarking adversarial robustness for tabular deep learning in real-world use-cases. Adv. Neural Inf. Process. Syst. 2024, 37, 78394–78430. [Google Scholar]
  72. Song, Yu; Wang, Yuyan; Ye, Xin; Zaretzki, Russell; Liu, Chuanren. Loan default prediction using a credit rating-specific and multi-objective ensemble learning scheme. Inf. Sci. 2023, 629, 599–617. [Google Scholar] [CrossRef]
  73. Sun, Chuanneng; Huang, Songjun; Pompili, Dario. Llm-based multi-agent decision-making: Challenges and future directions. IEEE Robotics and Automation Letters, 2025. [Google Scholar]
  74. Sun, Jingyun; Dai, Chengxiao; Luo, Zhongze; Chang, Yangbo; Li, Yang. Lawluo: A multi-agent collaborative framework for multi-round chinese legal consultation. arXiv 2024, arXiv:2407.16252. [Google Scholar]
  75. Tan, Huirong; Xie, Yanruixue. Financial text analysis and credit risk assessment using a gpt-4 and improved bert fusion model. PLoS ONE 2025, 20(11), e0336217. [Google Scholar] [CrossRef]
  76. Tavasoli, Ahmadreza; Sharbaf, Maedeh; Madani, Seyed Mohamad. Responsible innovation: A strategic framework for financial llm integration. arXiv 2025, arXiv:2504.02165. [Google Scholar] [CrossRef]
  77. Tian, Xu; Tian, ZongYi; Khatib, Saleh FA; Wang, Yan. Machine learning in internet financial risk management: A systematic literature review. PLoS ONE 2024, 19(4), e0300195. [Google Scholar] [CrossRef]
  78. Wan, Xiangpeng; Deng, Haicheng; Zou, Kai; Xu, Shiqi. Enhancing the efficiency and accuracy of underlying asset reviews in structured finance: The application of multi-agent framework. 2024. Available online: https://arxiv.org/abs/2405.04294.
  79. Wang, Shilong; Zhang, Guibin; Yu, Miao; Wan, Guancheng; Meng, Fanci; Guo, Chongye; Wang, Kun; Wang, Yang. G-safeguard: A topology-guided security lens and treatment on llm-based multi-agent systems. 2025a. Available online: https://arxiv.org/abs/2502.11127.
  80. Wang, Xiaofeng; Zhang, Zhixin; Zheng, Jinguang; Ai, Yiming; Wang, Rui. Debt collection negotiations with large language models: An evaluation system and optimizing decision making with multi-agent. 2025b. Available online: https://arxiv.org/abs/2502.18228.
  81. Wang, Yuelin; Zhang, Yihan; Lu, Yan; Yu, Xinran. A comparative assessment of credit risk model based on machine learning ——a case study of bank loan data. Procedia Computer Science 2019 International Conference on Identification, Information and Knowledge in the Internet of Things, 2020a; 174, pp. 141–149. Available online: https://www.sciencedirect.com/science/article/pii/S1877050920315830, ISSN 1877-0509. [CrossRef]
  82. Wang, Yuelin; Zhang, Yihan; Lu, Yan; Yu, Xinran. A comparative assessment of credit risk model based on machine learning——a case study of bank loan data. Procedia Comput. Sci. 2020b, 174, 141–149. [Google Scholar] [CrossRef]
  83. Wang, Ziyue; Wu, Junde; Cai, Linghan; Low, Chang Han; Yang, Xihong; Li, Qiaxuan; Jin, Yueming. Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow. 2025c. Available online: https://arxiv.org/abs/2503.18968.
  84. Wei, Zhepei; Yao, Wenlin; Liu, Yao; Zhang, Weizhi; Lu, Qin; Qiu, Liang; Yu, Changlong; Xu, Puyang; Zhang, Chao; Yin, Bing; Yun, Hyokun; Li, Lihong. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. 2025. Available online: https://arxiv.org/abs/2505.16421.
  85. Wright, Devin R; An, Jisun; Ahn, Yong-Yeol. Cognitive linguistic identity fusion score (clifs): A scalable cognition-informed approach to quantifying identity fusion from text. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025; pp. pages 11643–11673. [Google Scholar]
  86. Wu, Qingyun; Bansal, Gagan; Zhang, Jieyu; Wu, Yiran; Li, Beibin; Zhu, Erkang; Jiang, Li; Zhang, Xiaoyun; Zhang, Shaokun; Liu, Jiale; et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. First Conference on Language Modeling, 2024. [Google Scholar]
  87. Xu, Jun. Available at SSRN 4988118; Genai and llm for financial institutions: A corporate strategic survey. 2024.
  88. Yang, Rui; Chen, Hanyang; Zhang, Junyu; Zhao, Mark; Qian, Cheng; Wang, Kangrui; Wang, Qineng; Venkat Koripella, Teja; Movahedi, Marziyeh; Li, Manling; et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents. arXiv 2025, arXiv:2502.09560. [Google Scholar]
  89. Yao, Shunyu; Zhao, Jeffrey; Yu, Dian; Du, Nan; Shafran, Izhak; Narasimhan, Karthik R; Cao, Yuan. React: Synergizing reasoning and acting in language models. International Conference on Learning Representations (ICLR), 2023. [Google Scholar]
  90. Yao, Zijun; Liu, Yantao; Chen, Yanxu; Chen, Jianhui; Fang, Junfeng; Hou, Lei; Li, Juanzi; Chua, Tat-Seng. Are reasoning models more prone to hallucination? arXiv 2025, arXiv:2505.23646. [Google Scholar] [CrossRef]
  91. Yu, Yangyang; Yao, Zhiyuan; Li, Haohang; Deng, Zhiyang; Jiang, Yuechen; Cao, Yupeng; Chen, Zhi; Suchow, Jordan; Cui, Zhenyu; Liu, Rong; et al. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making. Adv. Neural Inf. Process. Syst. 2024, 37, 0 137010–137045. [Google Scholar]
  92. Yue, Ling; Xing, Sixue; Chen, Jintai; Fu, Tianfan. Clinicalagent: Clinical trial multi-agent system with large language model-based reasoning. 2024. Available online: https://arxiv.org/abs/2404.14777.
  93. Zainullina, Karina; Golubev, Alexander; Trofimova, Maria; Polezhaev, Sergei; Badertdinov, Ibragim; Litvintseva, Daria; Karasik, Simon; Fisin, Filipp; Skvortsov, Sergei; Nekrashevich, Maksim; et al. Guided search strategies in non-serializable environments with applications to software engineering agents. arXiv 2025, arXiv:2505.13652. [Google Scholar] [CrossRef]
  94. Zamojska, Monika; Chudziak; et al. Games agents play: Towards transactional analysis in llm-based multi-agent systems. arXiv 2025, arXiv:2507.21354. [Google Scholar]
  95. Zeng, Yifan; Wu, Yiran; Zhang, Xiao; Wang, Huazheng; Wu, Qingyun. Autodefense: Multi-agent llm defense against jailbreak attacks. arXiv 2024, arXiv:2403.04783. [Google Scholar]
  96. Zhang, Guibin; Yue, Yanwei; Li, Zhixun; Yun, Sukwon; Wan, Guancheng; Wang, Kun; Cheng, Dawei; Yu, Jeffrey Xu; Chen, Tianlong. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. arXiv 2024a, arXiv:2410.02506. [Google Scholar]
  97. Zhang, Guibin; Niu, Luyang; Fang, Junfeng; Wang, Kun; Bai, Lei; Wang, Xiang. Multi-agent architecture search via agentic supernet. 2025a. Available online: https://arxiv.org/abs/2502.04180.
  98. Zhang, Lang; Hu, Haiqing; Zhang, Dan. A credit risk assessment model based on svm for small and medium enterprises in supply chain finance. Financ. Innov. 2015, 1(1), 14. [Google Scholar] [CrossRef]
  99. Zhang, Xueqiao; Zhang, Chao; Sun, Jianwen; Xiao, Jun; Yang, Yi; Luo, Yawei. Eduplanner: Llm-based multi-agent systems for customized and intelligent instructional design. IEEE Transactions on Learning Technologies, 2025b. [Google Scholar]
  100. Zhang, Xuyang; Xu, Lidong; Li, Ningxin; Zou, Jianke. Research on credit risk assessment optimization based on machine learning. Preprints.org 2024b. [Google Scholar] [CrossRef]
  101. Zhu, Yakun; Wei, Shaohang; Wang, Xu; Xue, Kui; Zhang, Xiaofan; Zhang, Shaoting. Menti: Bridging medical calculator and llm agent with nested tool calling. 2025. Available online: https://arxiv.org/abs/2410.13610.
Figure 1. Real-world Financial Institution’s Human-in-the-loop Credit Review Workflow.
Figure 1. Real-world Financial Institution’s Human-in-the-loop Credit Review Workflow.
Preprints 212577 g001
Figure 2. Overview of CreditAgent. Layer 1 selects decision-relevant features via an XGBoost-guided filtering loop. Layer 2 routes filtered evidence to six specialist agents that write structured records to a shared blackboard. Layer 3 fuses specialist scores via learned attention weights, applies hard-stop vetoes, and selects a GRPO-optimized decision threshold.
Figure 2. Overview of CreditAgent. Layer 1 selects decision-relevant features via an XGBoost-guided filtering loop. Layer 2 routes filtered evidence to six specialist agents that write structured records to a shared blackboard. Layer 3 fuses specialist scores via learned attention weights, applies hard-stop vetoes, and selects a GRPO-optimized decision threshold.
Preprints 212577 g002
Figure 3. Model Performance Across Six Agents Comparison, Experts Evaluation v.s. Agents’
Figure 3. Model Performance Across Six Agents Comparison, Experts Evaluation v.s. Agents’
Preprints 212577 g003
Figure 4. L2 Agents Training Loss Achieve 84.7 % Reduction & L3 GRPO Reward Score Best in Step 170 (5.0).
Figure 4. L2 Agents Training Loss Achieve 84.7 % Reduction & L3 GRPO Reward Score Best in Step 170 (5.0).
Preprints 212577 g004
Figure 5. Sensitivity Analysis of GRPO Training and Rollout Parameters Comparison. n is the number of sampled responses per prompt.
Figure 5. Sensitivity Analysis of GRPO Training and Rollout Parameters Comparison. n is the number of sampled responses per prompt.
Preprints 212577 g005
Figure 6. Sensitivity of BEC Cost Ratio.
Figure 6. Sensitivity of BEC Cost Ratio.
Preprints 212577 g006
Table 1. Taxonomy of Related Work: Situating our Framework in the Literature. Arch.: Architecture. Trans.: Transaction Log. CR: Credit Report. #Feat.: Number of Features. #Samp.: Number of Samples. STI: Same-Tier Interaction. H: Hierarchical. C: Centralized. SM: Single Model.
Table 1. Taxonomy of Related Work: Situating our Framework in the Literature. Arch.: Architecture. Trans.: Transaction Log. CR: Credit Report. #Feat.: Number of Features. #Samp.: Number of Samples. STI: Same-Tier Interaction. H: Hierarchical. C: Centralized. SM: Single Model.
Work Modality Arch. Trans. CR #Feat. #Samp. STI Decision Logic
XGBoost [12] Tabular SM × × 39 300k Discriminative
Omega2 [50] Hybrid (RAG+GBDT) C × × 24 7.8k Logistic
MASCA [38] Textual H × × 20 200 × Majority Vote
MADeN [80] Dynamic H × × 6 975 Interaction-based
Our Work Multimodal H 100+ 6,000 Score Fusion (Agent+GBDT)
Table 3. Performance comparison of CreditAgent variants against LLM baselines on the 6,000-case evaluation set. DST is Dempster-Shafer theory fusion. Avg denotes average score evaluation of the six specialist agents. Underlined values indicate the best performance.
Table 3. Performance comparison of CreditAgent variants against LLM baselines on the 6,000-case evaluation set. DST is Dempster-Shafer theory fusion. Avg denotes average score evaluation of the six specialist agents. Underlined values indicate the best performance.
Model Name BEC ↑ ACC.↑(%) Std.↓ Acc. (G1)(%) Acc. (G2)(%) Acc. (G3)(%)
SVM 0.6127 67.40 0.1045 62.10 65.40 74.70
LightGBM 0.6353 77.00 0.0812 73.50 75.80 81.70
XGBoost 0.6394 78.47 0.0768 75.20 77.10 83.11
CatBoost 0.6402 77.53 0.0792 74.00 76.20 82.39
Gemini-3-Flash-Preview 0.5803 69.71 0.1040 57.53 68.67 82.94
Claude-Sonnet-4.5 0.6138 70.99 0.1392 70.31 54.30 88.37
GPT-5.1 0.5671 66.36 0.0767 56.09 68.48 74.52
Seed-OSS-36B-Instruct 0.5612 70.51 0.2496 86.84 35.25 89.45
Llama-3.1-8B-Instruct 0.3056 45.03 0.1612 34.28 67.82 33.00
Llama-3.3-70B-Instruct 0.5192 58.27 0.0560 50.63 60.29 63.89
Qwen3-4B-Instruct 0.5634 63.91 0.1479 67.75 44.19 79.79
Qwen3-30B-Thinking 0.4757 63.12 0.3207 84.25 17.80 87.30
Qwen3-30B-Instruct 0.5798 68.72 0.2119 81.50 38.85 85.80
DeepSeek-R1-Dist-32B 0.5433 68.85 0.2685 84.60 31.05 90.90
DeepSeek-R1-Dist-8B 0.5602 61.41 0.0758 57.91 54.38 71.94
XuanYuan2-70B-Chat 0.1508 39.37 0.3210 16.24 84.77 17.10
Fin-R1 0.5382 57.64 0.1089 63.05 42.45 67.43
CreditAgent(SFT)(DS) 0.6013↓0.1634 72.23↓11.9 0.2170↑0.0958 84.45↓6.05 41.74↓24.51 90.50↓2.70
CreditAgent(SFT)(Seed) 0.6190↓0.1457 73.66↓9.66 0.2012↑0.080 83.50↓7 45.62↓20.63 91.85↓1.35
CreditAgent (SFT)+Avg 0.6862↓0.0785 78.40↓4.92 0.1313↑0.0101 78.95↓11.55 62.05↓4.20 94.20↑1.00
CreditAgent (SFT)+DST 0.7055↓0.592 76.55↓6.77 0.0550↓0.0662 72.10↓18.40 73.25↑7.00 84.30↓8.90
CreditAgent (GRPO) 0.7647 83.32 0.1212 90.50 66.25 93.20
Table 4. Ablation results of CreditAgent components. Figures in parentheses indicate performance change compared to FCA (Full CreditAgent). G1, G2 & G3 is % accuracy. Config. refer model configurations. Underlined values indicate the best.
Table 4. Ablation results of CreditAgent components. Figures in parentheses indicate performance change compared to FCA (Full CreditAgent). G1, G2 & G3 is % accuracy. Config. refer model configurations. Underlined values indicate the best.
Config. BEC↑ Acc.↑(%) Std.↓ G1 G2 G3
FCA 0.7647 83.32 0.1242 90.50 66.25 93.20
w/o Macro 0.6968
(↓0.0679)
78.25
(↓5.07)
0.1636
[-2pt](↑0.0394)
88.75
(↓1.75)
55.15
(↓11.10)
90.85
(↓2.35)
w/o {M,H,W} 0.6288
(↓0.1359)
73.85
(↓9.47)
0.2010
(↑0.0768)
85.25
(↓5.25)
45.60
(↓20.65)
90.70
(↓2.50)
w/o Layer 2 0.5089
(↓0.2558)
60.34
(↓22.98)
0.2142
[-2pt](↑0.0900)
72.31
(↓18.19)
30.25
(↓36.00)
78.45
(↓14.75)
Table 5. Architectural simplification study.
Table 5. Architectural simplification study.
Configuration BEC ↑
Parallel agents (no blackboard) 0.6412
Sequential agents without shared context 0.6497
Single LLM with structured prompt 0.6836
GBDT + Layer 3 fusion hybrid 0.7043
Full CreditAgent 0.7647
Table 6. Demographic fairness analysis on the held-out test split. Lower DPD is better.(M for Male and F for Female).
Table 6. Demographic fairness analysis on the held-out test split. Lower DPD is better.(M for Male and F for Female).
Group XGBoost GPT-5.1 CreditAgent
18–28 M 77.3 72.6 63.8
18–28 F 69.4 65.9 62.1
28–38 M 72.1 68.4 63.2
28–38 F 63.8 62.7 62.9
38–48 M 64.3 61.2 62.4
38–48 F 48.6 52.3 60.7
DPD ↓ 28.7 20.3 3.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated