CCLUPE: Benchmark for Credit Context Log Understanding and Prediction Evaluation

Zhizhuo Kou; Yanting Zhang; Lei Zhu; Zhenghao Zhu; Yakun Cui; Zhiqiang Qian; Haoran Li; Han Wu; Huozhi Zhou; Jian Xie; Sirui Han; Yike Guo

doi:10.20944/preprints202604.1432.v1

Submitted:

19 April 2026

Posted:

21 April 2026

You are already at the latest version

Abstract

While Large Language Models (LLMs) have shown great promise in transforming credit risk assess-ment, existing evaluation frameworks are almost exclusively concerned with general financial NLP tasks and neglect the specific reasoning needed by practitioners. To address this, we develop the Credit Context Log Understanding and Prediction Evaluation (CCLUPE) benchmark. Unlike the previous benchmarks, CCLUPE aims to capture and evaluate the intricate reasoning unique to each constituent of the Chinese credit market, where evaluations are heavily based on the integration and synthesis of complex transacted logs and the prediction of hidden financial behaviors. Unlike previous benchmarks, CCLUPE aims to capture and evaluate the intricate reasoning unique to each constituent of the Chinese credit market. Unlike previous benchmarks, CCLUPE aims to capture and evaluate the intricate reasoning unique to each constituent. CCLUPE boasts more than 4,000 premium samples segmented by individual and micro-enterprise customers and distributed among 7 principal log types and 16 sub log types. A comprehensive assessment process involving upwards of 20 professional annotators is enacted to guarantee the quality of the dataset. Moreover, we introduce Log-Score, a novel evaluation metric designed to incorporate log misunderstanding penalties and assess multifaceted competencies. Even the state-of-the-art models underperform when it comes to these high-stakes tasks. CCLUPE serves as a rigorous testbed for the next generation of financial LLMs, ensuring their robustness for deployment in complex real credit scenarios.

Keywords:

LLM

;

credit analysis

Subject:

Computer Science and Mathematics - Other

1. Introduction

Large Language Models (LLMs) have demonstrated remarkable progress in financial applications [1,2], with their capabilities being evaluated through various benchmarks such as FinBen [3], Open FinLLM Leaderboard [4], and CALM [5]. The establishment of effective datasets and benchmarks has been instrumental in guiding model optimization and comparative analysis, significantly accelerating the development of financial domain adaption LLMs.

Credit risk assessment [6,7], characterized by its knowledge-intensive nature and multi-modal data requirements, presents an ideal yet challenging application space for LLMs, particularly in areas such as borrower evaluation, transaction log analysis, and behavioral inference. However, existing benchmarks primarily focus on isolated financial NLP tasks or specific data modalities, failing to capture the multi-stage reasoning and causal inference required in real-world credit evaluation processes. Despite this need, there is currently a notable absence of comprehensive, high-quality benchmarks specifically designed for evaluating LLMs in credit risk assessment scenarios. In China, the absence of a unified credit scoring system unlike the widely established FICO® Score1 in the United States or the Schufa Credit Score2 in Germany presents unique challenges. Financial institutions must rely on nuanced, localized methods to assess creditworthiness, synthesizing transaction histories, credit reports, and contextual factors. This situation highlights the necessity for a specialized evaluation framework tailored to the diverse data sources and unique risk factors in the Chinese personal loan market.

Nevertheless, constructing a benchmark that accurately reflects the intricacies of this domain is a complex undertaking. It is non-trivial to design a comprehensive, high-quality credit risk evaluation benchmark, which presents several challenges: ❶ High-Dimensional Data Integration: Credit assessment requires the fusion of heterogeneous inputs, combining qualitative textual descriptions with quantitative, time-series transactional logs reflect the real applicant’s situation. ❷ Multi-stage Reasoning: Real credit decisions involve complex causal inference and behavioral pattern recognition across multiple evaluation stages, demanding rigorous evaluation scenarios. ❸ Domain-Specific Precision: While general LLMs perform well on broad financial tasks, credit risk analysis requires both expert-level domain knowledge and near-zero tolerance for misunderstanding, where subtle errors can lead to significant financial exposure.

To address these challenges, we introduce CCLUPE (Credit Context Log Understanding and Prediction Evaluation), a comprehensive and high-quality benchmark for evaluating LLMs in credit risk logs assessment with the following key features: ❶ Comprehensive Credit Context Coverage: CCLUPE incorporates 4,062 rigorously curated samples spanning personal and micro-enterprise client profiles, featuring 7 major log types and 16 subtypes. Each sample integrates textual credit reports with time-series transactional log data, reflecting real-world credit analysis workflows. ❷ Focus on Expenditure and Spending Pattern Recognition: Unlike previous benchmarks, CCLUPE specifically targets the ability of LLMs to interpret, reason about, and quantify borrower spending behaviors through causal inference and multi-stage reasoning. ❸ High Data Quality: We employed over 20 professional annotators and implemented rigorous validation mechanisms to ensure data integrity and annotation accuracy. ❹ Novel Evaluation Metrics: We introduce Log-Score, a robust evaluation metric that incorporates misunderstanding penalties and multi-dimensional capability assessment, addressing the financial sector’s low tolerance for inaccuracies. ❺ Challenge and Effectiveness: Extensive experiments demonstrate that even SOTA LLMs exhibit unsatisfactory performance on CCLUPE, highlighting significant challenges and the necessity for specialized research in credit risk assessment.

In summary, CCLUPE establishes a rigorous foundation for domain adaptation in financial AI. By focusing on multi-stage reasoning and behavioral inference, our benchmark provides a path for advancing LLM capabilities toward dependable deployment in complex, real-world credit scenarios.

2. Related Works

LLMs have demonstrated significant potential in finance, driving the development of benchmarks for credit risk assessment. These benchmarks span tasks such as financial text understanding [3,8,9], risk prediction [4,10], and Retrieval-Augmented Generation (RAG) [11,12]. A key application involves predicting loan defaults and borrower profiles by processing both structured (e.g., credit scores) and unstructured data (e.g., narratives). More recently, studies have integrated multi-source data to enhance performance; specifically, Feng et al. [5] and Lei et al. [13] investigate cross-modal reasoning that combines diverse credit data for improved financial decision-making.

2.0.0.1. Difference from Existing Datasets

Despite recent progress, existing benchmarks often focus on isolated subtasks, lack transaction-level behavioral analysis, and fail to integrate heterogeneous data modalities (text, tabular, and time-series). In contrast, as summarized in Table 1, CCLUPE addresses these gaps through several distinctive advantages: (1) a larger scale of over 4,000 samples; (2) comprehensive coverage of transaction log analysis; and (3) a unified multi-modal framework fits real transaction log. These features enable a more realistic and fine-grained evaluation of LLMs in complex credit risk assessment scenarios (Appendix A).

3. CCLUPE Dataset: High-Quality Authentic Credit Log Scenario Dataset

CCLUPE is a benchmark grounded in authentic consumer credit transaction logs, covering both personal and micro-enterprise (SME) clients. Unlike benchmarks limited to general financial knowledge or synthetic data, CCLUPE employs a hierarchical framework with 7 knowledge domains, 16 sub-domains, and 4 cognitive levels. To ensure practical relevance, the dataset and questioning logic were developed in consultation with professional credit underwriters (Detail in Appendix B). Their expertise ensures that each evaluation task mirrors the specific spend patterns, risk signals, and decision-making logic critical to real-world loan approvals. This expert-validated structure enables a systematic evaluation of LLMs’ credit underwriting and reasoning capabilities in high-stakes financial scenarios.

3.1. Statistical Characteristics

Table 2 summarizes CCLUPE, which comprises 4,062 samples across two primary client categories (details in Appendix E). Personal Clients (2,931 samples) focus on individual consumption and repayment patterns. SMEs (1,131 samples) reflect operational cash flows and business financial health. This distribution mirrors real credit portfolios, where individual consumers predominate while SME represent a distinct segment requiring specialized evaluation.

Cognitive Levels

Questions are stratified across four cognitive levels based on an adapted Bloom’s taxonomy (Figure 1). Analysis & Synthesis constitutes the largest category, reflecting the integrative nature of credit risk evaluation. Comprehension & Calculation focuses on quantitative reasoning, such as ratio computation and trend analysis, while Evaluation & Inference targets higher-order predictive judgment. Finally, Memory & Recognition assesses fundamental pattern identification. This distribution covers the cognitive complexity spectrum essential for professional credit assessment, emphasizing synthesis and reasoning over rote recognition.

Knowledge Domain

The benchmark comprises seven core knowledge domains, weighted to reflect authentic credit priorities. Stability and Regularity of Cash Flow is the primary focus, as consistency is a key indicator of repayment capability. Other critical domains include Structure and Concentration, Liquidity and Pressure, and Conduct Characteristics, which evaluate diversification, financial stress, and behavioral patterns. Specialized or temporal factors such as Time Characteristics and Industry-Specific Behaviors complement the framework, ensuring a distribution that prioritizes cash flow stability as the leading predictive weight in underwriting.

3.2. Fine-Grained Credit Log Labels

Effective credit risk assessment requires fine-grained labeling to capture nuanced behavioral patterns distinguishing creditworthy users from high-risk ones. As illustrated in Figure 2, CCLUPE employs a hierarchical labeling framework tailored to two distinct client populations with fundamentally different transaction characteristics.

3.3. Question Design

Personal Client Labels

For personal clients, CCLUPE encompasses eight sub-domains capturing individual financial habits: Concentration (transaction distribution), Microloans (leverage indicators), Spikes (anomalies), Refunds (return frequency), Nighttime and Overseas (spatiotemporal patterns), Seasonality (cyclicality), and Liquidity (repayment resilience). This framework enables granular assessment of individual creditworthiness by identifying patterns ranging from routine consumption to acute financial stress.

Micro-Enterprise Client Labels

For SMEs, also eight sub-domains capture operational health and financial discipline: Repayments (regularity), Essentials (rent and inventory), and Subscription (recurring services) reflect operational stability. Risky (atypical movements) and Penalties (late charges) signal cash flow distress, while Nonessential spending evaluates management quality. Finally, Alignment assesses business-transaction consistency, and Liquidity targets working capital adequacy. This taxonomy ensures that LLMs are tested on indicators critical to real-world commercial underwriting.

Hierarchical Annotation

Each sample is orthogonally mapped to a primary knowledge domain and a corresponding client-specific sub-domain (Table 2, Appendix D). This nested multi-level structure facilitates both broad macro-level performance assessment and targeted, fine-grained diagnostic error analysis, allowing for the isolation of specific reasoning bottlenecks across diverse credit scenarios.

CCLUPE employs a systematic question design framework evaluating credit risk assessment capabilities across cognitive and knowledge dimensions, aligned with professional underwriting standards.

Question Types

The benchmark incorporates three question formats, each designed to probe multiple cognitive dimensions simultaneously. Single-choice questions primarily assess the recognition and interpretation of transaction patterns. Multiple-choice questions evaluate a model’s capacity for comprehensive analysis, requiring the identification of all relevant risk factors within a given scenario. Calculation tasks test quantitative reasoning through the computation of stability metrics and concentration ratios. Crucially, these formats are not restricted to isolated cognitive levels; rather, they serve as integrated probes where even a single-choice question may necessitate multi-step synthesis, ensuring that the four cognitive dimensions are systematically evaluated across all formats.

Cognitive Level Alignment

As depicted in Figure 1, questions are stratified across four cognitive levels to evaluate increasingly complex reasoning. Memory & Recognition requires the identification of explicit patterns directly observable in transaction logs. Comprehension & Calculation demands the interpretation and computation of derived metrics, such as cash flow ratios and liquidity trends. Analysis & Synthesis involves the strategic integration of heterogeneous features to construct coherent risk profiles. Finally, Evaluation & Inference demands high-order predictive judgments, requiring models to perform evidence synthesis and multi-step reasoning to forecast potential defaults.

Client-Type Differentiation

Personal client questions emphasize consumption behavior analysis, probing spending patterns and personal financial management. SME questions center on operational health, examining cash flow management, transaction-business alignment, and commercial payment behaviors.

3.4. Quality Assurance

CCLUPE implements a rigorous quality assurance framework encompassing four dimensions: distribution consistency, logical coherence, coverage completeness, and privacy preservation.

Distribution Consistency

Transaction logs maintain rigorous statistical alignment with authentic data distributions. We employ Kullback-Leibler (KL) divergence as a validation metric, ensuring

D_{K L} (P_{generated} ∥ P_{original}) < 0.1

across transaction amounts, frequencies, and temporal distributions. Beyond statistical parity, this constraint ensures that the synthetic logs faithfully preserve the underlying financial logic and sequential dependencies of genuine credit evaluation processes. Consequently, the data effectively captures the behavioral nuances and transition patterns critical for real finacial decisions, yielding no significant deviation from authentic borrower profiles.

Logical Coherence

To ensure integrity, transaction records undergo rigorous programmatic validation: refund-purchase pairs are verified, repayment schedules are synchronized with loan amounts, and balance trajectories are mathematically reconciled with sequences. For SMEs, transaction types must align with declared business categories. Any samples containing contradictions or null values are automatically filtered to maintain dataset fidelity.

Coverage and Privacy

The dataset provides exhaustive coverage across all 7 domains, 16 sub-domains, and 4 cognitive levels for both client categories. To prioritize ethical research and open access, all entries were scrubbed of merchant names, precise timestamps, and personally identifiable information. This anonymization pipeline ensures full compliance with financial privacy standards while preserving the utility for public release.

3.5. Annotation Pipeline

Dataset Construction

As illustrated in Figure 2, the CCLUPE annotation pipeline comprises three stages. Data collection process integrates two sources: Transaction Logs containing structured records with timestamp, amount, type, counterparty, and balance fields; and Credit Analysis documentation providing domain expertise on assessment criteria and underwriting logic. A dual-track approach ensures comprehensive coverage of question generation. Human credit analysis experts annotators generate questions using templates covering all cognitive levels and knowledge domains, producing transaction excerpts, questions, options, and verified answers. Concurrently, credit risk specialists design questions targeting specific risk indicators and behavioral patterns, ensuring alignment with professional standards.

Annotation Team

We recruited a balanced team of 24 annotators, comprising 14 junior annotators with foundational financial knowledge and 10 domain experts. The expert group included six financial industry professionals specializing in credit risk and four academics in STEM and Finance, all holding at least a Master’s degree. Junior annotators performed initial question reformulation and independent solving, while experts spearheaded data quality control, complex task design, and answer verification. The process involved approximately 800 cumulative man-hours. To ensure data integrity, 30% of the samples underwent independent multi-reviewer assessment; samples with inter-rater agreement below 90% were flagged for expert adjudication and revision. This rigorous pipeline yielded the final

4, 062

validated samples, where inconsistent entries were either corrected via expert consensus or removed to maintain high reliability.

4. CCLUPE Benchmark: Comprehensive Credit Evaluation

To ensure robust evaluation, we utilize a mixture of single-answer and multiple-choice questions (MCQ), alongside computational tasks across four cognitive levels. Unlike existing benchmarks, CCLUPE emphasizes multiple-answer MCQs to mirror the multi-faceted nature of credit risk, where a single transaction log often triggers concurrent risk indicators. This design necessitates that models identify all relevant patterns without over-selection—addressing a critical trade-off in real-world underwriting, where both missed risks (false negatives) and false alarms (false positives) carry significant financial consequences.

4.1. Transaction-Log Misunderstanding Penalty

In credit evaluations, misunderstanding non-existent risk factors is as detrimental as overlooking actual risks. To quantify this asymmetry, we employ a precision-adjusted score. Let

G

denote the set of ground-truth correct options and

M

denote the set of options selected by the model. The score for a single item,

S_{i t e m}

, is defined as:

S_{i t e m} = max (0, \frac{| G \cap M |}{| G |} - λ \cdot \frac{| M ∖ G |}{| M |}),

(1)

where

| \cdot |

signifies cardinality, and

λ > 1

is the penalty coefficient for incorrect selections (misunderstandings). This ensures that aggressive guessing strategies are penalized more heavily than conservative omissions.

4.2. Balanced Multi-Dimensional Aggregation

To mitigate potential performance bias arising from class imbalance across domains and cognitive levels, we adopt a hierarchical macro-averaging strategy. This approach ensures an equitable evaluation that prevents majority-category performance from masking deficiencies in rarer but high-stakes reasoning tasks, thereby promoting a more realistic assessment of model generalizability.

First, we compute the Domain-Balanced Score (

S_{d o m}

) by averaging performance across D distinct knowledge domains. To fit the column width, we formulate this hierarchically:

\begin{matrix} S_{d o m} & = \frac{1}{D} \sum_{k = 1}^{D} {\bar{x}}_{k}^{(d o m)}, \\ where {\bar{x}}_{k}^{(d o m)} & = \frac{1}{| Q_{k} |} \sum_{q \in Q_{k}} S_{i t e m}^{(q)} . \end{matrix}

(2)

Here,

Q_{k}

is the set of questions from domain k.

Similarly, the Cognitive-Balanced Score (

S_{c o g}

) aggregates performance across L distinct levels of reasoning complexity:

\begin{matrix} S_{c o g} & = \frac{1}{L} \sum_{j = 1}^{L} {\bar{x}}_{j}^{(c o g)}, \\ where {\bar{x}}_{j}^{(c o g)} & = \frac{1}{| Q_{j} |} \sum_{q \in Q_{j}} S_{i t e m}^{(q)} . \end{matrix}

(3)

The unified Knowledge-Unbiased Score is then calculated as a convex combination:

S_{u n b i a s e d} = η \cdot S_{d o m} + (1 - η) \cdot S_{c o g},

(4)

with

η

set to

0.5

by default to ensure equal emphasis on domain breadth and cognitive depth.

4.3. Transaction-Log Score Evaluation

In processing transaction logs, identifying the bulk of risk patterns is insufficient if the model also fabricates non-existent indicators. To quantify this, we calculate the Transaction-Log Score (

L

) in two steps. First, we compute the dataset-wide Misunderstanding Rate (

M

), which represents the average ratio of incorrect selections across all

N_{t o t a l}

transaction logs evaluated:

M = \frac{1}{N_{t o t a l}} \sum_{q = 1}^{N_{t o t a l}} (\frac{| M_{q} ∖ G_{q} |}{max (1, | M_{q} |)}) .

(5)

Here,

M_{q} ∖ G_{q}

represents the set of fabricated risk codes selected by the model but not present in the ground truth. We use

max (1, | M_{q} |)

in the denominator to assign a maximum penalty (ratio of 1.0) if the model makes selections but abstains from valid choices entirely.

Finally, the Transaction-Log Score (

L

) is derived by discounting the Knowledge-Unbiased Score (

S_{u n b i a s e d}

) based on this misunderstanding rate:

L = S_{u n b i a s e d} \cdot {(1 - M)}^{δ},

(6)

where

δ \geq 1

is the severity exponent. We set

δ = 2

, imposing a quadratic penalty. This ensures that the Log-Score decays distinctively faster than the misunderstanding rate increases, prioritizing analytical reliability over raw recall in the evaluation.

5. Experiments

5.1. Competing LLMs

To comprehensively evaluate the performance of current LLMs capability in the transaction log analysis and evaluation domain, we conducted experiments across a diverse range of model architectures and parameter scales. Our evaluation encompasses both proprietary and open-source models. The proprietary models include Gemini33, claude sonnet 4.54, GPT55, GPT4o6,DeepSeek V3 [14] and DeepSeek V3.2, Kimi K2 [15], Qwen 3 MAX [16], GLM 4.6 [17], or open-source alternatives, we selected Qwen 3 series [16], Llama series7 and also financial domain adaption model like Fin R1 [18].

Table 3. Performance Comparison across different evaluation dimensions.

Method	Mem.	Com.	Ana.	Eva.	Sin.	Mul.	Cal.	Avg.	LogScore
Proprietary Models
Gemini 3	51.0	27.2	37.4	41.4	22.8	82.2	37.3	36.0	15.6
Claude Sonnet 4.5	40.2	35.9	30.9	40.7	44.4	13.5	35.5	34.9	15.6
Claude Sonnet 4.5 Think	46.9	42.6	39.4	50.9	51.8	13.9	41.9	42.9	22.3
GPT 5	41.1	34.6	36.0	39.2	39.3	27.7	32.3	36.5	15.4
GPT 5 Think	40.8	45.2	41.9	48.8	52.0	17.7	43.5	44.0	21.9
GPT 4o	57.4	27.6	29.9	37.1	22.6	66.7	24.2	32.7	13.1
GPT 4o Think	46.9	34.1	34.1	37.0	29.0	58.1	29.0	35.7	14.6
Open-source Models
DeepSeek V3.2	45.8	30.0	34.4	37.9	24.1	69.6	25.8	34.6	13.1
Kimi K2	50.1	24.0	33.7	37.1	17.0	84.6	22.6	32.6	12.0
Qwen 3 Max	54.8	29.6	41.6	45.3	24.8	91.0	34.2	39.6	16.3
DeepSeek V3	39.9	24.8	29.7	34.6	19.2	65.7	21.0	29.9	10.7
GLM 4.6	34.4	24.5	30.7	33.7	19.7	62.6	21.0	29.6	10.7
Qwen3 235B	36.4	25.6	24.7	34.8	34.9	11.4	30.6	31.1	10.5
Qwen3 30B	33.2	24.4	20.7	31.1	29.5	8.7	22.6	24.6	8.7
Qwen3 4B	11.1	9.3	8.4	13.0	12.5	0.3	9.7	9.7	2.5
Llama 3.3 70B	24.2	18.8	18.6	24.6	24.5	5.8	19.4	20.1	6.1
Llama 3.1 70B	22.4	14.0	11.0	19.4	18.6	0.2	12.9	14.3	4.2
Llama 3.1 8B	5.0	8.4	6.9	9.0	8.7	3.8	6.5	7.6	1.5
Fin R1	12.5	14.4	12.6	13.2	16.6	2.3	12.9	13.3	2.7

5.2. Evaluation Methods

Our experimental evaluation was conducted separately for proprietary and open-source models. Proprietary models and larger open-source models were evaluated through commercial API calls, while smaller open-source models were deployed locally. All local experiments were performed on a single NVIDIA H800-level GPU. We utilized vLLM for efficient local deployment and inference.

5.3. Main Results and Key Findings

Proprietary Models’ Performance.

Proprietary models demonstrate superior performance, with Gemini 3 leading at average score and LogScore. The performance gap between proprietary and open-source models is most pronounced in multi-step financial reasoning tasks. GPT 5 Think and Claude Sonnet 4.5 Think achieve the highest average scores (

44.0

and

42.9

, respectively), outperforming all other tested models. For all proprietary models (Claude 4.5, GPT 5, and GPT 4o), adding a "Think" (reasoning) phase significantly increases the LogScore and the overall average.

Open-Source Models’ Performance

Qwen 3 Max achieves competitive performance comparable to proprietary models, particularly excelling in memory & recognition.The data shows a significant performance jump between Llama 3.1 8B and Llama 3.3 70B, with the larger model nearly tripling the average score of the smaller version. Qwen 3 Max and Kimi K2 show exceptional performance in the "Mul." (multi-task/step) dimension, both scoring above

84.0 %

and beating most proprietary entries in that specific category.

Task-Specific Performance

All evaluated models exhibit a sharp performance decay when transitioning from single-turn to multi-turn reasoning. Proprietary frontiers, such as GPT-5 and Gemini 3, suffer an average degradation of approximately

39 %

in complex conversational contexts despite maintaining high aggregate scores. Calculation remains the primary bottleneck; even specialized reasoning models like GPT-5-Think achieve only

43.5 %

accuracy, underscoring a persistent deficiency in executing rigorous numerical logic within financial decision-making tasks.

Financial Domain Adaptation.

The LogScore reveals a significant disparity in domain expertise. Traditional open-source models, such as the Llama 3 series, consistently score below

10.0

, demonstrating limited utility for complex financial reasoning.

Table 4. Model Capability Decomposition.

Metric	Proprietary	Open-Source	Gap
Avg. Aptitude	$39.2$	$25.4$	$+ 13.8$
Max Calc.	$43.5$	$48.4$	$- 4.9$
Unreliability	$+ 112 %$	$+ 145 %$	$- 33 %$

The Reasoning Premium.

The "Think" variants consistently enhance capabilities in Analysis & Evaluation; for instance, GPT-5-Think gains

+ 9.6

points over its base counterpart. However, increased test-time compute does not fully resolve the multi-turn reliability deficit. Reasoning models, with an average LogScore of

\sim 21.0

, still struggle with contextual persistence, indicating that extended internal reasoning alone is insufficient to maintain long-term conversational coherence in complex financial assessments.

Knowledge Performance Analysis

As Figure shows a significant performance disparity across knowledge domains. Proprietary frontiers led by GPT-5-Think maintain a dominant lead in Stability and Regularity of Cash Flow indicating superior capability in identifying acute risk signals. This varied performance across domains underscores that CCLUPE’s multi-faceted taxonomy effectively differentiates model capabilities, exposing specific knowledge deficits that aggregate scores obscure.

5.4. Stability Analysis

To validate the framework’s reliability, we conducted five independent iterations and analyzed performance variance across key dimensions (Table 5). Both proprietary GPT 5 Think and open-source Qwen 3 Max demonstrate remarkable stability, with standard deviations consistently below

1.0 %

. Specifically, GPT 5 Think variability ranges between

0.47 %

and

0.93 %

, while Qwen-3-Max fluctuates narrowly within

0.63 %

–

0.74 %

. These sub-percentage variances indicate that performance gains are not artifacts of prompt sensitivity or stochastic noise [19]. Instead, our framework captures a stable, reproducible performance signal, substantiating the robustness of our metrics and the high quality of the CCLUPE dataset across disparate architectures and task types.

5.5. Cognitive Complexity and Logical Parsing

Cognitive Dimension Distribution: The generation logic specifically targets the "Integrative Nature" of credit evaluation. As seen in the Python implementation’s `parse questions` function, questions are categorized into four cognitive levels following a modified Bloom’s taxonomy. This ensures the benchmark evaluates not just data retrieval, but the model’s ability to synthesize evidence (e.g., "Analysis & Synthesis") to form a credit opinion.

Anti-Leakage Design: A distinct feature of the CCLUPE benchmark is the prohibition of specific numeric values in the question stems (absolute prohibition clause). By forcing the LLM to use "fuzzy positioning" (e.g., "the last transaction at merchant X"), the benchmark ensures that the evaluator model must truly parse the table to find the values, rather than relying on values mentioned in the question to guess the answer.

Difficulty Stratification: The prompt enforces a 5:5:5 ratio for difficulty. This stratification allows for a granular performance analysis. For example, "High Difficulty" questions require ≥5 calculation steps (e.g., cross-referencing multiple months of balance and aggregate repayments), which is often where standard LLMs fail on the CCLUPE leaderboard.

Domain Perspective Mapping: By mapping questions to "8 Underwriting Perspectives," the benchmark provides an industry-standard review. The `eight category` logic in the script ensures that "Other" categories (like "Buy Now Pay Later" or "Nighttime Spending") are explicitly captured, providing professional insights into latent borrower risk that traditional scorecards might overlook.

6. Conclusions

This paper introduces CCLUPE, a benchmark for credit risk assessment comprising 4,062 high-quality samples. The datasets requires the synthesis of trimodal data textual reports, structured tables, and time-series transaction logs to evaluate multi-stage reasoning and causal inference. Our results show even SOTA LLMs struggle with this domain-specific logic, necessitating our proposed LogScore to penalize hallucinations and normalize cross-domain performance. CCLUPE establishes a rigorous testbed for deploying high-fidelity LLMs in real-world commercial underwriting.

7. Limitations

Despite the rigorous curation and substantial scale of CCLUPE, we acknowledge several limitations. First, while our multiple-choice and computational tasks enable objective evaluation, they may not fully encompass the open-ended complexities of real-world underwriting reports. Second, the inherent intricacy of credit logic posed challenges even for expert annotators, potentially introducing subtle biases despite our double-blind protocols. Third, although CCLUPE spans diverse credit sub-domains, it does not currently integrate audio-visual data or real-time macro-economic feeds, which are often utilized in institutional high-frequency risk monitoring. Finally, while our stability analysis confirms robustness under standard conditions, model performance may degrade when faced with noisy or malformed transaction data. Future work will investigate adversarial perturbations and real-time data fusion to further enhance the benchmark’s ecological validity in financial ecosystems.

Appendix A. More Related Work

While early financial LLM benchmarks made progress on credit risk evaluation, some other related works are also worth to be concerned. BizFinBench offers 6,781 Chinese business-oriented queries covering numerical computation, reasoning, information extraction, and prediction [4]. Moreover, Golden Touchstone builds on this with a bilingual suite addressing eight core financial NLP tasks [20]. FinMME tackles multimodal chart–text reasoning with over 11,000 samples across 18 sub-domains [21], while FinDABench focuses on numerical analysis, anomaly detection, and report generation [12]. More comprehensive efforts like FinTral/FinSet bring together nine task types across 23 datasets to assess financial QA, extraction, and misunderstanding detection [11]. Other works emphasize the value of incorporating narrative and textual data into credit risk models [22,23]. Studies such as Drinkall et al. [24], Golec and AlabdulJalil [25], Bagalkotkar et al. [26] demonstrate that integrating borrower narratives with traditional financial indicators yields significant improvements in prediction accuracy by capturing deeper insights into borrower behavior and contextual factors.

Table A1. Comparison of Recent Financial LLM Benchmarks and Reasoning Frameworks.

Benchmark/Model	Key Focus	Language/Modality	Notable Contribution
BizFinBench	Real-world business queries	Chinese / Text + Tables	Introduces IteraJudge for automated bias-reduced evaluation via iterative refinement.
CCLUPE	Credit transaction logs	Chinese / Text + Timeseries	Focuses on multi-stage reasoning and causal inference for credit risk appraisal.
Dianjin-R1	Complex financial reasoning	Bilingual / Text	Enhances reasoning via dual-reward reinforcement learning and structured supervision.
FinRAGBench-V	Multimodal RAG	Bilingual / Text + Tables	Evaluates RAG specific to multimodal financial contexts.

Appendix A.1. Datasets and Methodological Tools

Beyond task-specific benchmarks, several foundational datasets and tools have been developed to support LLM-based credit risk assessment. These resources provide essential infrastructure for evaluating text understanding, structured data reasoning, and behavioral inference in financial applications. Existing work encompasses three key directions: (1) loan document understanding, which focuses on contract analysis and financial document comprehension [27]; (2) time-series and transaction data modeling, which addresses the temporal dynamics of financial behavior [28]; and (3) financial-domain LLM frameworks with RAG integration, which enhance model reliability through external knowledge retrieval [29,30].

Appendix B. Expert Consultation Process Record and Annotation

The design and validation of our research framework benefited significantly from consultations with senior professionals at one of the largest financial institutions, which serves over 2 billion users globally. Through structured interviews with industry specialists, we gathered critical insights into consumer credit evaluation. These expert perspectives directly shaped the development of CCLUPE. Our consultation panel comprised seasoned specialists across credit underwriting, risk control, and lending operations, each with 8 to over 15 years of hands-on experience as Table A2 shows.

Table A2. Professional Profiles Summary.

Summarized Professional Profiles
Profile A: Senior Credit Underwriting Specialist (10+ yrs). Expert in loan approval processes and comprehensive bank statement analysis.
Profile B: Risk Control Expert (8+ yrs). Specializing in strategic policy design and risk management workflow optimization.
Profile C: Risk System Architect (10+ yrs). Proven track record in developing and scaling automated decisioning systems.
Profile D: Credit Review Analyst (6+ yrs). Focused on detailed credit reporting and complex borrower risk assessment.
Profile E: Micro-enterprise Credit Specialist (7+ yrs). Skilled in small business underwriting and financial statement audit.
Profile F: Risk Operations Manager (5+ yrs). Expert in portfolio monitoring, early warning indicators, and collection strategies.

Key findings from our expert consultations highlighted several critical aspects:

Structured Credit Assessment Workflow Experts emphasized the sequential nature of credit evaluation, typically beginning with identity verification and credit report review before proceeding to income validation and repayment capacity analysis. This insight directly influenced our hierarchical task design in CCLUPE, ensuring alignment with real-world underwriting workflows.

Document and Statement Interpretation Credit professionals heavily rely on bank statements, financial documents, and tabular data for income verification and cash flow analysis. Experts A and E particularly noted that accurate interpretation of transaction records and financial statements is fundamental to credit decisions.

Multi-dimensional Cross-validation Expert D highlighted the critical practice of cross-referencing multiple data sources, including credit reports, bank statements, and employment verification, to detect inconsistencies and potential fraud. This insight reinforced our inclusion of multi-document reasoning tasks and the integration of consistency-checking mechanisms in our evaluation framework.

Risk Policy and Regulatory Compliance Experts B and F emphasized the essential role of understanding risk control policies, regulatory requirements, and institution-specific credit guidelines. This validated our approach to incorporate policy, aware evaluation criteria and domain, specific knowledge assessment across various lending scenarios.

Data Quality and Anomaly Detection Multiple experts noted the prevalence of data quality issues in real-world credit applications, including incomplete documentation, inconsistent information, and potential fabrication. Expert C particularly stressed the importance of identifying anomalies in automated decisioning systems. This observation supported our decision to include robustness testing and misunderstanding detection metrics in our benchmark design.

Borrower Segmentation Complexity Experts A and E highlighted significant differences in assessment approaches between personal loans and small business credit, noting that SME lending requires additional evaluation of business viability and industry-specific risks. This insight informed our comprehensive coverage of diverse borrower profiles and lending scenarios.

These expert insights were instrumental in developing CCLUPE’s comprehensive structure, ensuring its relevance to real-world consumer credit operations while maintaining high standards of practical applicability and evaluation rigor. The consultation process validated our approach to creating a benchmark that effectively assesses AI systems’ capabilities in handling complex credit assessment tasks across the full lending lifecycle.

We recruited professional annotators through the platform and university recurit system and provided compensation aligned with standard university research assistant rates, ensuring the payment was fair and adequate for the local cost of living and the participants’ demographic.

Appendix C. Dataset Construction Prompt Sample

Appendix C.1. Personal Client Profile: Not-Overdue Clients

Table A3 documents the localized and translated prompt configuration used to generate synthetic transaction logs for the CCLUPE benchmark.

Table A3. LLM Prompt for Generating Non-Overdue Personal Credit Logs.

Augmentation Strategy: Pattern-Consistent Flow Generation
System Overview:
You are a financial credit analysis expert. Based on the provided raw bank transaction logs of high-credit-rating clients, you must synthesize highly realistic new client data that maintains the original statistical distribution but simulates unique behavioral features.
Translation of Input Prompt:
"The following content contains the complete raw transaction data for a non-overdue personal client. Please meticulously analyze its integrity, field semantics, data distribution, and transaction characteristics (e.g., consistent repayment records, absence of overdue status flags):
[Raw Data Context Provided Hierarchically]
Please generate a new set of personal transaction data for Person ID: {ID}, strictly adhering to:
1. Structural Integrity: Headers, field counts, and data types must be identical to the original; do not add or remove columns.
2. Volume: Generate between 50 to 300 independent transaction records.
3. Temporal Distribution: Transactions should be evenly distributed throughout the 2023 calendar year (Jan–Dec).
4. Data Variance: Textual fields (merchant names, remarks) must be distinct from the original samples to avoid data leakage.
5. Logic Consistency: No null values. Amounts must fluctuate realistically, transaction types must be diverse, and logical relationships between associated fields (e.g., Balance = Prev_Balance + Amount) must remain self-consistent.
Return the result directly in tab-separated format (Header + Data) without additional conversational text."

Appendix C.2. Synthesis and Evaluation Review

Pattern Analysis: The generation process for non-overdue clients focuses on Repayment Regularity and Cash Flow Stability. Unlike overdue samples, these records prioritize the cyclical nature of income (e.g., monthly payroll) and the punctuality of high-priority expenses like credit card repayments and utilities.

Merchant Diversification: To ensure the benchmark’s robustness, the LLM is instructed to generate a wide variety of merchant names (Textual fields). This forces models to generalize spending habits rather than relying on keyword matching of known "safe" merchants.

Logical Validation: Post-generation scripts (as seen in the Python implementation) enforce a strict volume check (

[50, 300]

rows). This ensures that the generated credit history is long enough to exhibit long-term financial patterns, such as "Nighttime Spending" or "Investment Redemption," which are key sub-domains of the CCLUPE benchmark.

Temporal Coherence: By enforcing a distribution across the full year of 2023, the dataset captures potential seasonal behaviors (e.g., increased utility bills in winter or travel expenses in summer) that are critical for evaluating an LLM’s understanding of time-series transactional data.

Appendix C.3. Personal Client Profile: Overdue Clients

Table A4 details the prompt engineering strategy used to generate synthetic transaction logs for clients with historical default patterns. The focus is on simulating behavioral indicators of credit risk without explicitly labeling transactions as "overdue."

Table A4. LLM Prompt for Generating Overdue-Prone Personal Credit Logs.

Augmentation Strategy: Feature-Induced Risk Simulation
System Overview:
You are a senior risk management consultant. Your task is to analyze raw bank statements of clients who eventually defaulted and synthesize realistic new datasets for ID: {ID}. The generated data must exhibit subtle precursors to financial distress while remaining structurally valid.
Translation of Input Prompt:
"The following content provides the complete raw transaction data for a personal client with a history of overdue payments. Please analyze its structure, field semantics, data distribution, and specific transaction characteristics:
[Full Transactional Context Provided]
Generate a new set of original transactional records for Person ID: {person_id}, strictly adhering to:
1. Structural Adherence: Column headers, field counts, and data types must perfectly match the original; no deviations permitted.
2. Data Volume: Synthesize between 50 to 300 independent transaction records.
3. Temporal Spread: Ensure transactions are evenly distributed across the 2023 calendar year (Jan–Dec).
4. Field Uniqueness: All text-based fields (e.g., merchant names, remarks) must be entirely different from the source samples to ensure data diversity.
5. Arithmetic Logic: Ensure zero null values. Balance trajectories must be mathematically sound (e.g., refund amounts must be negative and reflect in the balance logic).
6. Latent Feature Construction: The content should subtly reflect characteristics associated with potential delinquency (e.g., irregular income, high-frequency small loans, or spikes in nighttime spending), without explicitly using words like ’Overdue’ or ’Default’.
Return ONLY tab-separated content (Header + Data) with no conversational preamble."

Appendix C.4. Risk Pattern Synthesis and Evaluation

Behavioral Indicators: The synthesis of "Overdue" profiles shifts from regular consumption to Spike Recognition and Financial Stretching. The model is tasked with generating logs that include "Microloan" signals and "Spikes" in spending that exceed typical balance buffers, simulating the "liquidity pressure" mentioned in the CCLUPE core domains.

Implicit Feature Learning: A key requirement of the prompt is the exclusion of explicit "Overdue" keywords. This is designed to test the LLM’s capability in Causal Inference and Multi-stage Reasoning. Benchmarking models must infer risk from the sequence and nature of transactions (e.g., a sudden increase in revolving credit repayments) rather than simple text classification.

Contextual Integrity: By utilizing full-year records (Jan–Dec 2023), the generated samples provide sufficient historical depth to evaluate "Seasonality" and "Repayment Behavior Patterns." The Python implementation enforces a strict tab-separated format to ensure that multi-modal models can accurately parse the tabular time-series data without alignment errors.

Data Leakage Prevention: The strict requirement for "completely different merchant names" ensures that the benchmark remains a test of latent financial behavior recognition rather than a retrieval task. This forces the evaluator model to analyze the intent behind the transaction (e.g., a high-interest lender) regardless of the merchant’s pseudonym.

Appendix C.5. Micro-Enterprise Profile: Not-Overdue SME Clients

Table A5 documents the prompt engineering framework for generating synthetic transaction logs for Small and Micro Enterprises (SMEs) with healthy credit profiles. The logic centers on business-related identifiers and operational health signals.

Table A5. LLM Prompt for Generating Healthy Micro-Enterprise Credit Logs.

Augmentation Strategy: Operational Attribute Simulation
System Overview:
You are an expert in industrial finance and SME credit underwriting. Your task is to analyze the bank statements of healthy micro-enterprises (including individual businesses and small factories) and synthesize a new dataset for SME ID: {micro_id}. The logs must reflect robust commercial activity and financial discipline.
Translation of Input Prompt:
"The following content contains the complete raw transaction data for a non-overdue micro-enterprise/individual business owner. Please carefully analyze its structural integrity, field definitions, and typical indicators of healthy business operations:
[Full SME Transactional Context Provided]
Generate a new set of original transaction data for SME Subject ID: {micro_id}, strictly adhering to:
1. Structural Preservation: Column headers, field counts, and data types must be identical to the provided original; no modifications allowed.
2. Business Volume: Generate 50 to 300 independent records that reflect the typical transaction frequency of a healthy SME.
3. Temporal Consistency: Transactions must span the full year of 2023 (Jan–Dec).
4. Merchant Diversity & SME Identity:
– Counterparty Names: Must incorporate SME-specific identifiers such as ’Co., Ltd.’, ’Trading Store’, ’Individual Proprietorship’, ’Factory’, etc.
– Remarks: Must indicate business-specific scenarios (e.g., procurement, inventory, payroll, operational rent).
5. Financial Logic: No empty values; transaction amounts must be consistent with micro-enterprise scales. Balance logic must remain self-consistent throughout the sequence.
Return the response using a tab-separated format (Header + Data) with no additional explanation."

Appendix C.6. Business Context and Domain Analysis

Operational Health Recognition: For SMEs, the prompt focuses on Stability and Regularity of Cash Flow. Unlike personal accounts, SME logs prioritize "Essentials" (operational rent, utilities) and "Repayments" (business loans). This is essential for the CCLUPE benchmark to evaluate if a model can distinguish between personal consumption and commercial procurement.

Identity-Specific Labeling: As specified in the Python implementation’s prompt, the LLM must generate counterparties with suffixes such as "Co., Ltd." or "Factory." This ensures that the generated data captures the "Alignment" domain of CCLUPE—verifying if the transactional behavior (e.g., bulk procurement of raw materials) is actually consistent with the stated business type of the micro-subject.

Scalability and Diversity: The use of `ThreadPoolExecutor` to generate data for 225 subjects (IDs 276–500) ensures that the benchmark provides a statistically significant population of SMEs. This volume allows for rigorous testing of an LLM’s Analysis & Synthesis capabilities by presenting various SME industry types, from services to small-scale manufacturing.

Benchmark Strategic Value: In the Chinese credit market context, where formal credit scores for small business owners are often incomplete, these transaction logs serve as the primary evidence of creditworthiness. By ensuring "completely differentiated textual fields," the prompt forces the evaluator model to recognize "Specialised Cash Flow Conduct"—such as seasonal restocking patterns around the National Day or Spring Festival—rather than relying on fixed merchant names.

Appendix C.7. Micro-Enterprise Profile: Overdue SME Clients

Table A6 documents the prompt engineering strategies used to simulate transactional distress in Small and Micro Enterprises (SMEs). This prompt requires the LLM to blend "operational scenarios" with "latent delinquency signatures" while maintaining business-specific terminology.

Table A6. LLM Prompt for Generating Overdue-Prone Micro-Enterprise Credit Logs.

Augmentation Strategy: Business Cycle & Distress Simulation
System Overview:
You are a specialist in commercial credit risk evaluation. Your objective is to analyze transaction logs of SMEs (small companies, trading stores, or factories) with delinquency histories and synthesize a new dataset for SME ID: {micro_id}. The logs must reflect genuine business operations while embedding precursors to credit failure.
Translation of Input Prompt:
"The following content provides the complete raw transaction data for an overdue micro-enterprise/individual business owner. Please meticulously analyze its structural integrity, field semantics, specific business transaction characteristics, and patterns associated with credit delinquency:
[Full SME Transactional Context and Default History Provided]
Generate a new set of original transaction data for SME Subject ID: {micro_id}, strictly adhering to:
1. Structural Fidelity: Column headers, field counts, and data types must precisely match the original; do not alter the schema.
2. Operational Volume: Generate 50 to 300 independent records that reflect the business activities of an SME under financial stress.
3. Temporal & Cyclical Distribution: Spread transactions across Jan–Dec 2023, ensuring business-specific cycles (e.g., seasonal procurement) are visible.
4. Differentiated SME Identity:
– Counterparty Names: Must include commercial identifiers such as ’Company’, ’Trading Firm’, ’Proprietorship’, or ’Factory’.
– Remarks: Must describe plausible commercial scenarios (e.g., supply chain payments, rent, or logistics fees).
5. Logical Consistency: No empty values allowed. Transaction amounts must be appropriate for small-scale operations. Associated fields (e.g., balance and loan repayment logic) must be calculated correctly.
6. Latent Overdue Features: Infuse the logs with subtle indicators of financial distress—such as high interest-debt service, irregular revenue patterns, or excessive inventory costs relative to sales—without explicitly using words like ’overdue’ or ’default’.
Return the response as tab-separated values (Header + Data) without any conversational text."

Appendix C.8. Synthesis Logic and Macro-Domain Review

Analysis of Financial Distress: For SME subjects, the prompt focuses on Cash Flow Structure and Concentration. The transition to "overdue" is often represented by a shift in counterparty dependencies or a decline in "stability and regularity" of revenue. These latent features allow the benchmark to evaluate if LLMs can perform Multi-stage Reasoning to detect business failure precursors.

Contextual Business Scenarios: In accordance withRule 4 of the Python script, the LLM is forced to define "remarks" that ground the data in reality. For overdue SMEs, this might include "Urgent Inventory Repayment" or "Property Rent Adjustments," which test the LLM’s Comprehension & Calculation capabilities regarding business operating margins and liquidity pressure.

Benchmark Consistency: The generation of data for 225 SME subjects (IDs 26–250) mirrors the scale of the non-overdue population. This balanced distribution is vital for calculating the Log-Score mentioned in the CCLUPE core framework, as it ensures the model is evaluated on its ability to minimize "misunderstanding penalties" (false positives) in a mixed dataset.

Domain-Specific Penalties: Because SMEs have own Global Interpreter Lock (GIL) related complexities in real-world data, the synthetic generation ensures Arithmetic Logic consistency (e.g., negative refund values). This provides a clean but challenging "trimodal" environment (text, tabular, and time-series) to test the robustness of financial LLMs against professional underwriting standards.

Appendix C.9. Question Generation: Personal Underwriting Audit

Table A7 documents the prompt used to transform raw personal transaction logs into a structured 15-question credit audit. The prompt enforces hierarchical categorization and cognitive complexity mapping.

Table A7. LLM Prompt for Generating Personal Credit Underwriting Questions.

Instruction Set: Hierarchical Underwriting Question Synthesis
Task Overview:
Based on the provided bank statement data, generate 15 specialized multiple-choice questions (at least one multi-choice). Questions must be grounded in the data but satisfy the "Zero Data Leakage" rule for the question stem.
Translation of Input Prompt (Core Requirements):
"You are provided with a personal transaction table. Generate 15 questions based on the following classification logic:
1. Primary Category (8 Underwriting Perspectives):
Select from: Consumption Stability, Essential Expenditure Regularity, Repayment Patterns, High-Risk Merchant Transactions, Discretionary Concentration, Transaction Failures & Liquidity Pressure, Fund Retention Habits, and ’Others’ (specifically labeled in parentheses).
2. Secondary Category (24 Classification Logics):
Annotate each question with its specific logic mapping (e.g., ’Consumption Stability-Payroll regularity’).
3. Strict Prohibitions:
– No Specific Numbers in Stems: Do not use amounts or specific dates in the question text. Use relative references like ’the repayment recorded in the table’ or ’a certain date.’
– No Intermediate Disclosure: Do not reveal the number of transactions required for calculation in the stem.
4. Complexity & Cognition Mapping:
– Difficulty Levels (5:5:5 ratio): Low (≥2 computation steps), Medium (≥3 steps), High (≥5 steps).
– Cognitive Dimensions: Memory & Recognition, Comprehension & Calculation, Analysis & Synthesis, Evaluation & Inference.
Standardized Format:
`8 Categories: [Type] \| 24 Logics: [Class-Subclass] \| Difficulty: [L/M/H] \| Cognition: [Dimension]`
[Question Body]
a. [Opt] b. [Opt] c. [Opt] d. [Opt]
`Correct Answer: X`"

Appendix C.10. Cognitive Complexity and Logical Parsing

Cognitive Dimension Distribution: The generation logic specifically targets the "Integrative Nature" of credit evaluation. As seen in the Python implementation’s `parse questions` function, questions are categorized into four cognitive levels following a modified Bloom’s taxonomy. This ensures the benchmark evaluates not just data retrieval, but the model’s ability to synthesize evidence (e.g., "Analysis & Synthesis") to form a credit opinion.

Anti-Leakage Design: A distinct feature of the CCLUPE benchmark is the prohibition of specific numeric values in the question stems (absolute prohibition clause). By forcing the LLM to use "fuzzy positioning" (e.g., "the last transaction at merchant X"), the benchmark ensures that the evaluator model must truly parse the table to find the values, rather than relying on values mentioned in the question to guess the answer.

Difficulty Stratification: The prompt enforces a 5:5:5 ratio for difficulty. This stratification allows for a granular performance analysis. For example, "High Difficulty" questions require ≥5 calculation steps (e.g., cross-referencing multiple months of balance and aggregate repayments), which is often where standard LLMs fail on the CCLUPE leaderboard.

Domain Perspective Mapping: By mapping questions to "8 Underwriting Perspectives," the benchmark provides an industry-standard review. The `eight category` logic in the script ensures that "Other" categories (like "Buy Now Pay Later" or "Nighttime Spending") are explicitly captured, providing professional insights into latent borrower risk that traditional scorecards might overlook.

Appendix C.11. Question Generation: Micro-Enterprise Credit Audit

Table A8 displays the prompt configuration for generating 15 specialized credit audit questions from raw SME operational logs. The prompt focuses on distinctive SME domains such as supply chain cycles and liquidity turnover.

Table A8. LLM Prompt for Generating Micro-Enterprise (SME) Credit Audit Questions.

Instruction Set: Complex Commercial-SME Underwriting Synthesis
Task Overview:
Analyze the provided micro-enterprise operational logs. Generate 15 multiple-choice questions (at least two multi-select). Questions must revolve around commercial scenarios: revenue collection frequency, procurement vs. sales alignment, upstream/downstream payment cycles, and tax/utility regularity.
Translation of Input Prompt (Core Requirements):
"You are provided with an SME operational transaction table. Generate 15 questions based on the following classification structure:
1. Primary Category (7 SME Audit Perspectives):
Must cover all 20 sub-categories across 7 domains:
– Revenue/Expense Stability: Monthly frequency and industry-specific cost-to-income ratios.
– Cash Flow Health: Non-operational funds vs. operational capital and turnover efficiency.
– Business Activity: Daily transaction volume, QR code payment features, and nighttime trading records.
– Counterparty Risk: Relying on single-source suppliers and related-party transaction ratios.
– Payment Management: Account receivables/payables aging and tax/social security punctuality.
– Specialized Behaviors: Cash-out patterns, seasonal fluctuations, and transaction disputes.
– Loan Compliance: Operational loan repayment history and utility regularity.
2. Functional Constraints (Anti-Guessing):
– Zero-Data Stem: No specific amounts, dates, or merchant names in the stems. Use relative locators like ’a certain partner recorded in the table.’
– Step-Based Difficulty (5:5:5 Ratio):
Low: ≥2-step judgment; Medium: ≥3-step analysis; High: ≥5-step calculation/inference.
Cognitive Dimensions: Memory & Recognition, Comprehension & Calculation, Analysis & Synthesis, Evaluation & Inference."

Appendix C.12. Commercial Reasoning and SME Domain Analysis

Operational Scenario Alignment: As illustrated in the Python implementation, the prompt forces the LLM to avoid personal consumption narratives. Instead, it prioritizes Industry-Specific Ratios (e.g., procurement as a percentage of revenue). This is a vital metric in the CCLUPE benchmark for testing whether a financial LLM can distinguish between a healthy business cycle and a fraudulent or declining one.

Multi-Step Calculation Logic: The "High Difficulty" questions in the SME category are significantly more complex than personal ones. According to the script’s instruction for "≥5 steps," a single question may require the evaluator to: 1) identify all QR code receipts, 2) aggregate them by month, 3) calculate the mean, 4) compare that to the supplier payment cycle, and 5) evaluate the liquidity buffer. This directly tests the LLM’s*Multi-modal/Multi-stage Reasoning capabilities.

Anti-Leakage & Qualitative Options: Consistent with the "Absolute Prohibition" clause in the prompt, options are often qualitative (e.g., "Matched with monthly revenue" vs. "Significantly exceeding revenue") rather than just numeric. This ensures that the model is evaluated on its Financial Knowledge and its ability to perform Causal Inference across the transaction timeline.

Diverse Commercial Perspectives: By covering 20 sub-categories (e.g., "Upstream/Downstream Aging Management"), the CCLUPE benchmark ensures that data-synthesis does not overlook the "latency" and "receivables" inherent to SME credit risk. The `parse questions` function in the provided script ensures that these labels are captured for diagnostic error analysis, allowing researchers to pinpoint which specific business logic causes LLM "misunderstandings."

Appendix D. Dataset Details

The CCLUPE benchmark adopts a hierarchical taxonomy to map diverse transactional behaviors into actionable credit-risk signals. As detailed in the expert consultation process, indicators are partitioned into core knowledge domains and specialized behavioral sub-categories.

Main Behavioral Domains The core evaluation focuses on seven primary domains that represent the most significant predictors of creditworthiness in the Chinese market:

Consumption Stability: Evaluates the variance and consistency of spending over time to infer income reliability.
Essential Expenditure Regularity: Monitors fixed costs (utilities, rent, tax) that indicate fundamental financial discipline.
Repayment Behavior Patterns: Specifically tracks loan-related logs to identify punctuality and debt-servicing intent.
High-Risk Merchant Activities: Identifies interactions with gambling, high-interest lending, or speculative investment platforms.
Discretionary Spending Concentration: Analyzes the proportion of non-essential vs. essential spending to assess luxury-led risk.
Liquidity Pressure and Transaction Failures: Records insufficient fund errors and abrupt balance drops.
Funds-Flow Timing and Consumption Lag: Analyzes the delta between income receipt and major expenditures.

Indicator MappingTable A9 provides a comprehensive overview of the indicators categorized by their relevance to Personal and Micro-enterprise profiles.

Cognitive Level Mapping In accordance with Bloom’s Taxonomy for financial reasoning, each question based on the above indicators is mapped to a cognitive depth:

1.: Memory & Recognition: Identifying specific transaction types or merchant names within the log.
2.: Comprehension & Calculation: Summarizing total spending or calculating debt-to-income ratios from raw entries.
3.: Analysis & Synthesis: Detecting latent patterns (e.g., "seasonal restocking") across multiple months of data.
4.: Evaluation & Inference: Making high-level risk judgments or predicting potential default based on observed behavioral anomalies.

Table A9. Detailed taxonomy of CCLUPE behavioral–consumption indicators across main and secondary categories.

Category	Behavioral Signals and Specialized Questions
Main Domains
Consumption Stability	Frequency of daily spending, month-over-month variance, and lifestyle consistency.
Expenditure Regularity	Timeliness of utility payments (electricity, water, heating), and insurance continuity.
Repayment Patterns	Credit card repayment cycles, bill-period alignment, and partial vs. full settlement.
High-Risk Activities	Transactions involving speculative assets, high-leverage platforms, or risky merchants.
Spending Concentration	Counterparty dependency, merchant relationship depth, and category-specific spikes.
Liquidity Pressure	Transaction failures (insufficient funds), frequent small-value borrowing, and balance volatility.
Funds-Flow Timing	Income-to-expenditure lag, salary deposit recency, and cash flow alignment.
Secondary/Other
Digital Footprint	Subscription service continuity, app engagement frequency, and digital wallet usage trends.
Temporal Patterns	Nighttime/early-morning consumption, seasonal spending spikes, and holiday surges.
Locational Logic	Off-location (travel) consumption analysis and cross-border spending patterns.
Social/Employment	Intergenerational transfer support, part-time employment markers, and payroll regularity.
Operational Health	(SME only) Inventory restock cycles, supplier relationship stability, and business-type alignment.
Transaction Quality	Refund rates, transaction reversals, and penalty fee occurrences.

Our dataset organizes behavioral–consumption indicators into 7 main categories and one other categories including the indicators that are considered to be minor-relevant. Each specific subcategories facilitate precise characterization and analysis of consumer financial behavior. The main categories include Consumption Stability, Essential Expenditure Regularity, Repayment Behavior Patterns, High-Risk Merchant Activities, Discretionary Spending Concentration, Liquidity Pressure and Transaction Failures, Funds-Flow Timing and Consumption Lag, Others. While the other catergories include Small Loan Transactions, Subscription Service Continuity, Late Payment Penalty, etc. The specific categories are shown as Figure 7 shows. Each main category is further divided into specialized questions that capture distinct behavioral signals relevant for credit-risk inference.

Table A10. Behavioral–consumption indicators including 7 main categories and one other categories.

Dataset Overview
Main Categories
Consumption Stability
Essential Expenditure Regularity
Repayment Behavior Patterns
High-Risk Merchant Activities
Discretionary Spending Concentration
Liquidity Pressure and Transaction Failures
Funds-Flow Timing and Consumption Lag
Others
Small Loan Transactions
Subscription Service Continuity
Late Payment Penalty
Seasonal Spending Behavior
Installment Payment
Part-time Employment Indicators
Sudden Large-Amount Spending
Refunds and Transaction Reversals
Spending Diversification and Merchant Relationships
Cross-Border Consumption Characteristics
Small Loan Transactions
Subscription Service Continuity
Late Payment Penalty
Seasonal Spending Behavior
Installment Payment
Part-time Employment Indicators
Sudden Large-Amount Spending
Refunds and Transaction Reversals
Spending Diversification and Merchant Relationships
Nighttime and Early-Morning Consumption
Consumption Activity and Frequency
Consumption Cycles and Trigger Patterns
Off-Location Consumption Analysis
Intergenerational Consumption Support
Income-Consumption Cycle Alignment

Appendix E. Statistical Characteristics

The CCLUPE benchmark consists of 4,062 high-quality samples, distinctively designed to bridge the gap between general financial NLP and specialized credit risk reasoning. Unlike synthetic benchmarks, CCLUPE is grounded in authentic transaction logs. Personal clients (N=2,931) reflect individual consumption and repayment hygiene, while SME (N=1,131) focus on operational cash flows, business continuity, and industry-specific financial health. The dataset features a hierarchical annotation framework spanning 7 core knowledge domains and 16 fine-grained sub-domains. As shown in Table A11, the Stability and Regularity of Cash Flow domain represents the largest segment (34.6%), mirroring real-world credit underwriting priorities where consistent revenue is the primary indicator of repayment capacity.

Table A11. Comprehensive breakdown of the CCLUPE Dataset across domains and client types.

Core Domain	Sub-Domain Examples	Personal	SME	Total
Stability & Regularity	Payroll, Revenue Stability	980	428	1,408
Cash Flow Conduct	Specific Industry Patterns	310	210	520
Structure & Concentration	Merchant Diversity, Counterparty Risk	645	212	857
Time Characteristics	Nighttime Trading, Seasonality	82	36	118
Specialised Conduct	Cross-border, Tax Punctuality	130	80	210
Liquidity & Pressure	Fund Reserves, Overdue Precursors	680	136	816
Others	Miscellaneous Risk Factors	104	29	133
Total		2,931	1,131	4,062

To evaluate deep reasoning rather than simple retrieval, we map questions to four cognitive levels. Analysis & Synthesis (43.8%) is the most frequent level, requiring models to integrate multi-source data (textual descriptions and time-series logs) to form a coherent credit opinion.

Appendix F. Credit Review Domain Knowledge Domain Analysis

The fine-grained results in Table A12 reveal a stark divide in model performance across different credit indicators. Proprietary frontiers, particularly the Claude Sonnet 4.5 and GPT-5 series, demonstrate a localized mastery of acute risk signals. For example, GPT-5-Think achieves a state-of-the-art

62.5 %

in Spikes detection and

53.0 %

in Risky transaction identification. These high scores suggest that large-scale reinforcement learning from human feedback (RLHF) has successfully equipped these models with the ability to identify sudden behavioral anomalies in financial logs.

Appendix F.1. The Bottleneck of Concentration and Nonessential Spending

Despite successes in anomaly detection, nearly all models struggle with structural indicators such as Concentration (Conce.) and Nonessential Spending (Noness.). In the concentration domain, aggregate scores for open-source models often fall below

20 %

, with even high-performing reasoning models like GPT-4o-Think managing only

22.4 %

. This performance floor suggests that models find it significantly more difficult to execute the multi-step aggregation required to assess portfolio-level risks or to distinguish between essential business opex and non-essential outflows, highlighting a critical area for future dataset-led alignment.

Appendix F.2. Impact of Extended Reasoning

The inclusion of `Think’ (internal reasoning) variants generally elevates performance in synthesis-heavy tasks like Repayment (Repay.) and Asset Overshooting (Overs.). For instance, Claude-Sonnet-4.5-Think provides a substantial +20.0 point boost in Asset Overshooting over the standard GPT-4o. These reasoning-enhanced models are better at maintaining the long-range logical chains necessary to reconcile a borrower’s historical debt obligations with their current liquidity trajectories. However, the persistent inaccuracy in the Others category across all models indicates that even extended reasoning cannot yet fully resolve the interpretive ambiguity found in atypical or less-structured transaction types.

Table A12. Domain-specific Performance Comparison across different sectors and industries.

Method	Align.	Conce.	Micro.	Spikes	Refun.	Night.	Overs.	Seaso.	Liqud.	Repay.	Essen.	Risky	Noness.	Penal.	Subsc.	Others
Proprietary Models
Gemini 3	29.1	8.8	13.3	31.2	48.3	18.2	26.3	23.1	28.2	28.8	30.1	36.7	15.9	24.5	39.3	29.5
Claude Sonnet 4.5	47.3	41.4	40.0	50.0	32.8	31.8	35.0	33.6	28.7	32.6	33.8	48.9	39.7	30.2	30.5	32.9
Claude Sonnet 4.5 Think	54.1	48.3	36.7	50.0	37.8	47.0	55.0	44.9	38.4	43.0	42.8	54.6	48.3	34.0	35.6	38.7
GPT 5	39.7	32.8	33.3	56.2	35.3	28.8	35.0	30.8	30.0	32.1	35.8	39.2	37.1	26.4	35.6	30.6
GPT 5 Think	45.2	46.6	36.7	62.5	41.2	43.9	50.0	40.2	42.3	43.2	44.8	53.0	49.3	45.3	40.7	39.8
GPT 4o	33.6	13.8	10.0	43.8	20.2	24.2	25.0	13.1	23.6	22.3	23.8	33.8	14.4	17.0	25.4	28.2
GPT 4o Think	32.2	22.4	10.0	25.0	37.0	37.9	40.0	24.3	25.7	25.5	29.2	35.7	24.8	17.0	28.8	31.0
Open-source Models
DeepSeek V3.2	27.4	12.1	23.3	12.5	31.1	25.8	30.0	17.8	27.8	27.8	30.2	32.4	15.7	26.4	25.4	24.3
Kimi K2	26.0	6.9	16.7	18.8	35.3	27.3	25.0	15.9	23.7	19.7	22.6	31.9	11.2	9.4	18.6	25.4
Qwen3 Max	30.8	15.5	16.7	31.8	49.6	21.2	25.0	26.2	30.7	29.1	30.4	35.7	20.0	24.5	33.9	30.8
DeepSeek V3	22.6	15.5	30.0	31.2	20.2	16.7	20.0	16.8	23.8	18.2	22.6	29.7	13.6	17.0	20.3	22.4
GLM 4.6	23.3	15.5	13.3	25.0	26.9	21.2	20.0	15.9	23.7	19.1	23.6	26.5	16.3	13.2	25.4	24.5
Qwen3 235B	11.0	6.7	12.0	16.7	12.4	7.8	10.5	14.1	9.2	15.2	12.1	16.0	7.4	13.2	17.3	9.7
Qwen3 30B	37.0	31.0	23.3	25.0	25.2	24.2	30.0	19.6	20.0	22.1	21.4	34.1	23.2	22.6	27.1	22.4
Qwen3 4B	13.7	12.1	13.3	6.2	9.2	7.6	10.0	10.3	8.4	9.0	10.0	15.1	10.1	13.2	5.1	8.0
Llama 3.3 70B	28.8	22.4	13.3	31.2	18.5	21.2	30.0	13.1	21.3	19.5	15.3	28.9	18.1	11.3	18.6	18.4
Llama 3.1 70B	19.9	10.3	6.7	18.8	10.1	13.6	20.0	12.1	11.7	14.6	12.2	25.9	11.2	15.1	15.3	14.0
Llama 3.1 8B	6.2	6.9	10.0	6.2	5.9	4.5	10.0	6.5	6.2	7.7	7.3	8.9	10.7	11.3	6.8	6.6
Fin R1	16.4	17.2	16.7	18.8	12.6	10.6	15.0	20.6	10.9	14.3	13.6	15.1	15.7	13.2	15.3	10.0

Appendix G. Dataset Comparison

Figure A1. Data Comparision with Related Works

As illustrated in Table 1, we present a comprehensive comparison of CCLUPE with existing benchmarks relevant to financial and credit assessment domains. The comparison encompasses dataset scale, annotation quality, domain specificity, and modality coverage, revealing distinct characteristics and application scenarios for each benchmark.

CCLUPE (Ours) represents a significant advancement in credit assessment benchmarking, comprising 4,062 samples with expert-validated annotations. A defining characteristic of our dataset is its specialized focus on transaction log analysis, which constitutes the core data source in real-world credit decisioning workflows. CCLUPE uniquely supports trimodal inputs—text, tabular, and time-series data—enabling holistic evaluation of consumer financial behavior. The dataset’s design is grounded in extensive consultations with industry practitioners, ensuring alignment with authentic credit underwriting processes and risk control methodologies.

CALM offers the largest scale among existing credit-related benchmarks with 14,000 samples and expert annotations. However, its scope is confined to loan approval decisions with exclusively tabular modality, limiting its capacity to assess models’ reasoning capabilities across heterogeneous data formats. While CALM provides valuable resources for binary classification tasks, it lacks the behavioral consumption indicators and temporal transaction patterns essential for comprehensive credit risk inference.

Loan Approval Benchmark contains 3,065 samples targeting loan approval scenarios with text and tabular modalities. Notably, this benchmark lacks expert annotations, potentially compromising the reliability and domain authenticity of its ground-truth labels. Its coverage remains restricted to conventional approval decisions without incorporating nuanced behavioral signals derived from transaction histories.

FinRAGBench-V provides 1,394 expert-annotated samples combining text and tabular data. While it demonstrates rigorous annotation quality, the dataset is designed for general financial RAG tasks without domain-specific focus on credit assessment, limiting its applicability for evaluating credit-oriented AI systems.

MMMU (Finance) and MME-Finance represent general-purpose financial multimodal benchmarks with 390 and 1,171 samples respectively. Both datasets lack expert annotations and are primarily oriented toward visual chart interpretation and general financial knowledge assessment rather than credit-specific workflows. Their modality coverage (text+image and text+chart) does not encompass the tabular and time-series formats predominant in credit risk analysis.

This comparative analysis underscores CCLUPE’s distinctive positioning within the landscape of financial AI benchmarks. While existing datasets either emphasize general financial knowledge or provide limited credit-specific coverage, CCLUPE bridges this gap by delivering: (1) substantial scale with expert-validated annotations, (2) specialized focus on transaction-level behavioral analysis, (3) comprehensive trimodal coverage aligned with real-world credit assessment pipelines, and (4) hierarchical behavioral indicators derived from industry best practices. These characteristics establish CCLUPE as a rigorous and practically relevant benchmark for advancing AI capabilities in consumer credit evaluation.

Appendix H. Model Configurations

We conducted experiments across a diverse array of 15+ LLMs, encompassing proprietary frontier models and open-source financial adaptations. All models were evaluated between January and February 2025.

Table A13. Model architectures and parameter scales evaluated in the benchmark.

Model Name	Developer	Scale/Type	Access Method
Proprietary
Gemini 3	Google	Unknown	Vertex AI API
Claude 4.5 Sonnet	Anthropic	Unknown	Anthropic API
GPT-5	OpenAI	Unknown	OpenAI API
Open-Source
DeepSeek V3	DeepSeek	671B (MoE)	Local (vLLM)
Llama 3.3	Meta	70B	Local (vLLM)
Qwen 3	Alibaba	4B, 30B, 235B	Local (vLLM)
Fin R1	Financial AI Lab	32B (Distilled)	Local (vLLM)

Local models were deployed on a high-performance cluster. To eliminate stochastic variance and ensure focus on reasoning capability, the temperature was set to

0.0

. We utilized the vLLM engine to handle the long-context requirements of 2023 full-year transaction logs, which can exceed 3,000 tokens per prompt.

Table A14. Inference parameters and environment details.

Parameter	Configuration Value
GPU Hardware	8 × NVIDIA H800 (80GB)
Operating System	Ubuntu 22.04 LTS
Inference Engine	vLLM (v0.6.3) / CUDA 12.4
Decoding Strategy	Greedy Search ( $T = 0$ )
Max Context Length	16,384 Tokens
Precision	Bfloat16

A critical challenge in credit risk assessment is the cost of "misunderstanding" risks. To address this, CCLUPE introduces the Log-Score (L), which discounts a model’s

S u n b i a s e d

performance based on its Misunderstanding Rate (

M

).The Misunderstanding Rate for

N_{t o t a l}

logs is calculated as:

M = \frac{1}{N_{t o t a l}} \sum_{q = 1}^{N_{t o t a l}} (\frac{| M_{q} ∖ G_{q} |}{max (1, | M_{q} |)})

(A1)

Where

M_{q} ∖ G_{q}

represents fabricated risk indicators identified by the model that are absent in the ground truth. The final Log-Score is then derived using a severity exponent

δ

:

L = S_{u n b i a s e d} \cdot {(1 - M)}^{δ}

(A2)

In our experiments, we set

δ = 2

to impose a quadratic penalty, ensuring that analytical reliability is prioritized over raw recall.

Appendix I. Sample Question

Figure A2. Sample Single Choice Question about Comprehension & Calculation of Personal Clients Log Information.

Figure A3. Sample Multiple Choice Question about Analysis & Synthesis of Micro-enterprise Clients Log Information.

Figure A4. Sample Single Choice Question about Analysis & Synthesis of Personal Clients Log Information.

Figure A5. Sample Multiple Choice Question about Comprehension & Calculation of Micro-enterprise Clients Log Information.

Figure A6. Sample Single Choice Question about Comprehension & Calculation of Micro-enterprise Clients Log Information.

Figure A7. Sample Single Choice Question about Comprehension & Calculation of Personal Clients Log Information.

Figure A8. Sample Calculation Question about Comprehension & Calculation of Personal Clients Log Information.

Figure A9. Sample Multiple Choice Question about Memory & Recognition of Micro-enterprise Clients Log Information.

References

Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. Bloomberggpt: A large language model for finance. arXiv 2023, arXiv:2303.17564. [Google Scholar] [CrossRef]
Liu, X.Y.; Wang, G.; Yang, H.; Zha, D. Fingpt: Democratizing internet-scale data for financial large language models. arXiv 2023, arXiv:2307.10485. [Google Scholar]
Xie, Q.; Han, W.; Chen, Z.; Xiang, R.; Zhang, X.; He, Y.; Xiao, M.; Li, D.; Dai, Y.; Feng, D.; et al. Finben: A holistic financial benchmark for large language models. Advances in Neural Information Processing Systems 2024, 37, 95716–95743. [Google Scholar]
Lin, S.C.; Tian, F.; Wang, K.; Zhao, X.; Huang, J.; Xie, Q.; Borella, L.; White, M.; Wang, C.D.; Xiao, K.; et al. Open FinLLM leaderboard: Towards financial ai readiness. arXiv 2025, arXiv:2501.10963. [Google Scholar] [CrossRef]
Feng, D.; Dai, Y.; Huang, J.; Zhang, Y.; Xie, Q.; Han, W.; Chen, Z.; Lopez-Lira, A.; Wang, H. Empowering many, biasing a few: Generalist credit scoring through large language models. arXiv 2023, arXiv:2310.00566. [Google Scholar]
Chen, Z.; Li, S.; Smiley, C.; Ma, Z.; Shah, S.; Wang, W.Y. Convfinqa: Exploring the chain of numerical reasoning in conversational finance question answering. arXiv 2022, arXiv:2210.03849. [Google Scholar]
Wang, N.; Yang, H.; Wang, C.D. Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. arXiv 2023, arXiv:2310.04793. [Google Scholar]
Xia, Y.; Shi, Z.; Du, X.; Zheng, Q. Extracting narrative data via large language models for loan default prediction: when talk isn’t cheap. Applied Economics Letters 2025, 32, 481–486. [Google Scholar] [CrossRef]
Sanz-Guerrero, M.; Arroyo, J. Credit risk meets large language models: Building a risk indicator from loan descriptions in P2P lending. arXiv 2024, arXiv:2401.16458. [Google Scholar] [CrossRef]
Lei, Y.; Wang, Z.; Liu, C.; Wang, T.; Lee, D. FinLangNet: A Novel Deep Learning Framework for Credit Risk Prediction Using Linguistic Analogy in Financial Data. Proceedings of Preprint. ACM, New York, NY, USA, 2024; pp. 1–10. [Google Scholar]
Bhatia, G.; Cavusoglu, H.; Abdul-Mageed, M.; et al. Fintral: A family of gpt-4 level multimodal financial large language models. Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, 2024, 13064–13087. [Google Scholar]
Liu, S.; Zhao, S.; Jia, C.; Zhuang, X.; Long, Z.; Zhou, J.; Zhou, A.; Lan, M.; Wu, Q.; Yang, C. Findabench: Benchmarking financial data analysis ability of large language models. arXiv 2024, arXiv:2401.02982. [Google Scholar]
Lei, Y.; Wang, Z.; Liu, C.; Wang, T. Zigong 1.0: A large language model for financial credit. arXiv 2025, arXiv:2502.16159. [Google Scholar] [CrossRef]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
Team, K.; Bai, Y.; Bao, Y.; Chen, G.; Chen, J.; Chen, N.; Chen, R.; Chen, Y.; Chen, Y.; Chen, Y.; et al. Kimi K2: Open Agentic Intelligence. arXiv 2025, arXiv:2507.20534. [Google Scholar] [CrossRef]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
Zeng, A.; Lv, X.; Zheng, Q.; Hou, Z.; Chen, B.; Xie, C.; Wang, C.; Yin, D.; Zeng, H.; Zhang, J.; et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv 2025, arXiv:2508.06471. [Google Scholar] [CrossRef]
Liu, Z.; Guo, X.; Lou, F.; Zeng, L.; Niu, J.; Wang, Z.; Xu, J.; Cai, W.; Yang, Z.; Zhao, X.; et al. Fin-r1: A large language model for financial reasoning through reinforcement learning. arXiv 2025, arXiv:2503.16252. [Google Scholar] [CrossRef]
Truong, S.; Tu, Y.; Liang, P.; Li, B.; Koyejo, S. Reliable and Efficient Amortized Model-based Evaluation. arXiv 2025, arXiv:2503.13335. [Google Scholar] [CrossRef]
Wu, X.; Liu, J.; Su, H.; Lin, Z.; Qi, Y.; Xu, C.; Su, J.; Zhong, J.; Wang, F.; Wang, S.; et al. Golden Touchstone: A Comprehensive Bilingual Benchmark for Evaluating Financial Large Language Models. arXiv 2024, arXiv:2411.06272. [Google Scholar] [CrossRef]
Luo, J.; Kou, Z.; Yang, L.; Luo, X.; Huang, J.; Xiao, Z.; Peng, J.; Liu, C.; Ji, J.; Liu, X.; et al. FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation. arXiv 2025, arXiv:2505.24714. [Google Scholar]
Azime, I.A.; Kanubala, D.D.; Afonja, T.; Fritz, M.; Valera, I.; Klakow, D.; Slusallek, P. Accept or Deny? Evaluating LLM Fairness and Performance in Loan Approval across Table-to-Text Serialization Approaches. arXiv 2025, arXiv:2508.21512. [Google Scholar]
Jajoo, G.; Chitale, P.A.; Agarwal, S. MASCA: LLM based-Multi Agents System for Credit Assessment. arXiv 2025, arXiv:2507.22758. [Google Scholar] [CrossRef]
Drinkall, F.; Pierrehumbert, J.B.; Zohren, S. Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs. arXiv 2024, arXiv:2407.17624. [Google Scholar]
Golec, M.; AlabdulJalil, M. Interpretable LLMs for Credit Risk: A Systematic Review and Taxonomy. arXiv 2025, arXiv:2506.04290. [Google Scholar] [CrossRef]
Bagalkotkar, A.; Karmakar, A.; Arnson, G.; Linda, O. Fairhome: A fair housing and fair lending dataset. arXiv 2024, arXiv:2409.05990. [Google Scholar] [CrossRef]
Koreeda, Y.; Manning, C.D. ContractNLI: A dataset for document-level natural language inference for contracts. arXiv 2021, arXiv:2110.01799. [Google Scholar]
Zhang, X.; Luo, S.; Zhang, B.; Ma, Z.; Zhang, J.; Li, Y.; Li, G.; Yao, Z.; Xu, K.; Zhou, J.; et al. Tablellm: Enabling tabular data manipulation by llms in real office usage scenarios. arXiv 2024, arXiv:2403.19318. [Google Scholar] [CrossRef]
Lee, J.; Roh, M. Multi-Reranker: Maximizing performance of retrieval-augmented generation in the FinanceRAG challenge. arXiv 2024, arXiv:2411.16732. [Google Scholar]
Zhao, S.; Jin, Z.; Li, S.; Gao, J. FinRAGBench-V: A Benchmark for Multimodal RAG with Visual Citation in the Financial Domain. arXiv 2025, arXiv:cs. [Google Scholar]

1	FICO® Scores are used by the top 90 US lenders for credit risk assessment, created by Fair Isaac Corporation https://www.fico.com/en/products/fico-score
2	SCHUFA (Schutzgemeinschaft für allgemeine Kreditsicherung) is a German credit agency that calculates credit scores https://www.schufa.de/en/newsroom/creditworthiness/obtain-schufa-score-credit-report/
3	https://deepmind.google/models/gemini/
4	https://www.anthropic.com/claude/sonnet
5	https://openai.com/index/gpt-5/
6	https://openai.com/index/hello-gpt-4o/
7	https://www.llama.com/

Figure 1. The Comprehensive Taxonomy, Data Examples and Statistical Characteristics of CCLUPE. The circular taxonomy diagram shows four core cognitive levels, knowledge categories and domains.

Figure 2. The annotation pipeline of CCLUPE. The process consists of three main stages: (1) Raw transaction log collection and dataset construction, (2) Questions design through credit analysis experts and human annotator, and (3) Quality Control checking where expert reviewers validate consistent annotations and resolve inconsistencies.

Table 1. Comparison with existing benchmarks. CCLUPE provides a comprehensive and high-quality dataset for the financial multimodal domain.

Table 2. Statistical characteristics of the CCLUPE dataset, including question types, cognitive levels, and knowledge domains.

Statistic	Number
Dataset Overview
Total Samples	4,062
Cognitive Level Distribution
Memory & Recognition	343
Comprehension & Calculation	1,263
Analysis & Synthesis	1,780
Evaluation & Inference	676
Core Knowledge Domain
Stability and Regularity of Cash Flow	1,408
Characteristics of Cash Flow Conduct	520
Cash Flow Structure and Concentration	857
Time Characteristics of Cash Flow	118
Specialised Cash Flow Conduct	210
Fund Liquidity and Pressure	816
Others	133
Personal Clients	2,931
Micro-enterprise Clients	1,131

Table 5. Model performance consistency across 5 independent runs. Results are reported as mean percentage accuracy with standard deviations.

Method	Single.	Multi.	Cal.	Avg.
GPT 5 Think	52.0±0.75	17.7±0.93	43.5±0.67	44.0±0.47
Qwen 3 Max	24.8±0.63	91.0±0.72	34.2±0.73	39.6±0.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.